OpenMp - Pthreads

Tutorials | Exercises | Abstracts | LC Workshops | Comments | Search | Privacy & Legal Notice
OpenMP
Author: Blaise Barney, Lawrence Livermore National Laboratory
UCRL-MI-133316
Table of Contents
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
Abstract
Introduction
OpenMP Programming Model
OpenMP API Overview
Compiling OpenMP Programs
OpenMP Directives
1. Directive Format
2. C/C++ Directive Format
3. Directive Scoping
4. PARALLEL Construct
5. Exercise 1
6. Work-Sharing Constructs
1. DO / for Directive
2. SECTIONS Directive
3. SINGLE Directive
7. Combined Parallel Work-Sharing Constructs
8. TASK Construct
9. Exercise 2
10. Synchronization Constructs
1. MASTER Directive
2. CRITICAL Directive
3. BARRIER Directive
4. TASKWAIT Directive
5. ATOMIC Directive
6. FLUSH Directive
7. ORDERED Directive
11. THREADPRIVATE Directive
12. Data Scope Attribute Clauses
1. PRIVATE Clause
2. SHARED Clause
3. DEFAULT Clause
4. FIRSTPRIVATE Clause
5. LASTPRIVATE Clause
6. COPYIN Clause
7. COPYPRIVATE Clause
8. REDUCTION Clause
13. Clauses / Directives Summary
14. Directive Binding and Nesting Rules
Run-Time Library Routines
Environment Variables
Thread Stack Size and Thread Binding
Monitoring, Debugging and Performance Analysis Tools for OpenMP
Exercise 3
References and More Information
Appendix A: Run-Time Library Routines
Abstract
OpenMP is an Application Program Interface (API), jointly defined by a group of major computer hardware and software vendors.
OpenMP provides a portable, scalable model for developers of shared memory parallel applications. The API supports C/C++ and
Fortran on a wide variety of architectures. This tutorial covers most of the major features of OpenMP 3.1, including its various constructs
and directives for specifying parallel regions, work sharing, synchronization and data environment. Runtime library functions and
environment variables are also covered. This tutorial includes both C and Fortran example codes and a lab exercise.
converted by W eb2PDFConvert.com
Level/Prerequisites: This tutorial is ideal for those who are new to parallel programming with OpenMP. A basic understanding of
parallel programming in C or Fortran is required. For those who are unfamiliar with Parallel Programming in general, the material
covered in EC3500: Introduction to Parallel Computing would be helpful.
Introduction
What is OpenMP?
OpenMP Is:
An Application Program Interface (API) that may be used to explicitly direct multi-threaded,
shared memory parallelism.
Comprised of three primary API components:
Compiler Directives
Runtime Library Routines
An abbreviation for: Open Multi-Processing
OpenMP Is Not:
Meant for distributed memory parallel systems (by itself)
Necessarily implemented identically by all vendors
Guaranteed to make the most efficient use of shared memory
Required to check for data dependencies, data conflicts, race conditions, deadlocks, or code sequences that cause a program
to be classified as non-conforming
Designed to handle parallel I/O. The programmer is responsible for synchronizing input and output.
Goals of OpenMP:
Standardization:
Provide a standard among a variety of shared memory architectures/platforms
Jointly defined and endorsed by a group of major computer hardware and software vendors
Lean and Mean:

Establish a simple and limited set of directives for programming shared memory machines.
Significant parallelism can be implemented by using just 3 or 4 directives.
This goal is becoming less meaningful with each new release, apparently.
Ease of Use:
Provide capability to incrementally parallelize a serial program, unlike message-passing libraries which typically require an
all or nothing approach
Provide the capability to implement both coarse-grain and fine-grain parallelism
Portability:
The API is specified for C/C++ and Fortran
Public forum for API and membership
Most major platforms have been implemented including Unix/Linux platforms and Windows
History:
In the early 90's, vendors of shared-memory machines supplied similar, directive-based, Fortran programming extensions:
The user would augment a serial Fortran program with directives specifying which loops were to be parallelized
The compiler would be responsible for automatically parallelizing such loops across the SMP processors
Implementations were all functionally similar, but were diverging (as usual)
First attempt at a standard was the draft for ANSI X3H5 in 1994. It was never adopted, largely due to waning interest as
distributed memory machines became popular.
However, not long after this, newer shared memory machine architectures started to become prevalent, and interest resumed.
The OpenMP standard specification started in the spring of 1997, taking over where ANSI X3H5 had left off.
Led by the OpenMP Architecture Review Board (ARB). Original ARB members and contributors are shown below. (Disclaimer:
all partner names derived from the OpenMP web site)

APR Members
Endorsing Application Developers
Compaq / Digital
Hewlett-Packard Company
Intel Corporation
International Business
Machines (IBM)
Kuck & Associates, Inc. (KAI)
Silicon Graphics, Inc.
Sun Microsystems, Inc.
U.S. Department of Energy
ASCI program
ADINA R&D, Inc.

ANSYS, Inc.
Dash Associates
Fluent, Inc.
ILOG CPLEX Division
Livermore Software Technology
Corporation (LSTC)
MECALOG SARL
Oxford Molecular Group PLC
The Numerical Algorithms Group Ltd.
(NAG)
Endorsing Software
Vendors
Absoft Corporation
Edinburgh Portable
Compilers
GENIAS Software
GmBH
Myrias Computer
Technologies, Inc.
The Portland Group, Inc.
(PGI)
For more news and membership information about the OpenMP ARB, visit: openmp.org/wp/about-openmp.
Release History
OpenMP continues to evolve - new constructs and features are added with each release.
Initially, the API specifications were released separately for C and Fortran. Since 2005, they have been released together.
The table below chronicles the OpenMP API release history.
Date
Oct 1997
Oct 1998
Nov 1999
Nov 2000
Mar 2002
May 2005
May 2008
Jul 2011
Jul 2013
Nov 2015
Version
Fortran 1.0
C/C++ 1.0
Fortran 1.1
Fortran 2.0
C/C++ 2.0
OpenMP 2.5
OpenMP 3.0
OpenMP 3.1
OpenMP 4.0
OpenMP 4.5
This tutorial refers to OpenMP version 3.1. Syntax and features of newer releases are not currently covered.
References:
OpenMP website: openmp.org
API specifications, FAQ, presentations, discussions, media releases, calendar, membership application and more...
Wikipedia: en.wikipedia.org/wiki/OpenMP
OpenMP Programming Model

Shared Memory Model:
OpenMP is designed for multi-processor/core, shared memory machines. The underlying architecture can be shared memory
UMA or NUMA.
Uniform Memory Access
Non-Uniform Memory Access
Thread Based Parallelism:

OpenMP programs accomplish parallelism exclusively through the use of threads.
A thread of execution is the smallest unit of processing that can be scheduled by an operating system. The idea of a subroutine
that can be scheduled to run autonomously might help explain what a thread is.
Threads exist within the resources of a single process. Without the process, they cease to exist.
Typically, the number of threads match the number of machine processors/cores. However, the actual use of threads is up to the
application.
Explicit Parallelism:
OpenMP is an explicit (not automatic) programming model, offering the programmer full control over parallelization.
Parallelization can be as simple as taking a serial program and inserting compiler directives....
Or as complex as inserting subroutines to set multiple levels of parallelism, locks and even nested locks.
Fork - Join Model:

OpenMP uses the fork-join model of parallel execution:
All OpenMP programs begin as a single process: the master thread. The master thread executes sequentially until the first
parallel region construct is encountered.
FORK: the master thread then creates a team of parallel threads.

The statements in the program that are enclosed by the parallel region construct are then executed in parallel among the various
team threads.
JOIN: When the team threads complete the statements in the parallel region construct, they synchronize and terminate, leaving
only the master thread.
The number of parallel regions and the threads that comprise them are arbitrary.
Compiler Directive Based:

Most OpenMP parallelism is specified through the use of compiler directives which are imbedded in C/C++ or Fortran source
code.
Nested Parallelism:
The API provides for the placement of parallel regions inside other parallel regions.
Implementations may or may not support this feature.
Dynamic Threads:
The API provides for the runtime environment to dynamically alter the number of threads used to execute parallel regions.
Intended to promote more efficient use of resources, if possible.
Implementations may or may not support this feature.
I/O:
OpenMP specifies nothing about parallel I/O. This is particularly important if multiple threads attempt to write/read from the same
file.
If every thread conducts I/O to a different file, the issues are not as significant.
It is entirely up to the programmer to ensure that I/O is conducted correctly within the context of a multi-threaded program.
Memory Model: FLUSH Often?

OpenMP provides a "relaxed-consistency" and "temporary" view of thread memory (in their words). In other words, threads can
"cache" their data and are not required to maintain exact consistency with real memory all of the time.
When it is critical that all threads view a shared variable identically, the programmer is responsible for insuring that the variable is
FLUSHed by all threads as needed.
More on this later...
OpenMP API Overview

Three Components:
The OpenMP API is comprised of three distinct components:
Compiler Directives (44)
Runtime Library Routines (35)
Environment Variables (13)
The application developer decides how to employ these components. In the simplest case, only a few of them are needed.
Implementations differ in their support of all API components. For example, an implementation may state that it supports nested
parallelism, but the API makes it clear that may be limited to a single thread - the master thread. Not exactly what the developer
might expect?
Compiler Directives:
Compiler directives appear as comments in your source code and are ignored by compilers unless you tell them otherwise usually by specifying the appropriate compiler flag, as discussed in the Compiling section later.
OpenMP compiler directives are used for various purposes:
Spawning a parallel region
Dividing blocks of code among threads
Distributing loop iterations between threads
Serializing sections of code
Synchronization of work among threads
Compiler directives have the following syntax:
sentinel
directive-name
[clause, ...]
For example:
Fortran !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)

C/C++ #pragma omp parallel default(shared) private(beta,pi)
Compiler directives are covered in detail later.
Run-time Library Routines:
The OpenMP API includes an ever-growing number of run-time library routines.

These routines are used for a variety of purposes:
Setting and querying the number of threads
Querying a thread's unique identifier (thread ID), a thread's ancestor's identifier, the thread team size
Setting and querying the dynamic threads feature
Querying if in a parallel region, and at what level
Setting and querying nested parallelism
Setting, initializing and terminating locks and nested locks
Querying wall clock time and resolution
For C/C++, all of the run-time library routines are actual subroutines. For Fortran, some are actually functions, and some are
subroutines. For example:
Fortran INTEGER FUNCTION OMP_GET_NUM_THREADS()

C/C++ #include <omp.h>
int omp_get_num_threads(void)
Note that for C/C++, you usually need to include the <omp.h> header file.
Fortran routines are not case sensitive, but C/C++ routines are.
The run-time library routines are briefly discussed as an overview in the Run-Time Library Routines section, and in more detail in
Appendix A.
Environment Variables:
OpenMP provides several environment variables for controlling the execution of parallel code at run-time.
These environment variables can be used to control such things as:
Setting the number of threads
Specifying how loop interations are divided
Binding threads to processors
Enabling/disabling nested parallelism; setting the maximum levels of nested parallelism
Enabling/disabling dynamic threads
Setting thread stack size
Setting thread wait policy
Setting OpenMP environment variables is done the same way you set any other environment variables, and depends upon which
shell you use. For example:
csh/tcsh setenv OMP_NUM_THREADS 8

sh/bash export OMP_NUM_THREADS=8
OpenMP environment variables are discussed in the Environment Variables section later.
Example OpenMP Code Structure:
Fortran - General Code Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
PROGRAM HELLO
INTEGER VAR1, VAR2, VAR3
Serial code
.
.
.
Beginning of parallel region. Fork a team of threads.

Specify variable scoping
!$OMP PARALLEL PRIVATE(VAR1, VAR2) SHARED(VAR3)
Parallel region executed by all threads

.
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Other OpenMP directives

.
Run-time Library calls

.
All threads join master thread and disband

!$OMP END PARALLEL
Resume serial code

.
.
.
END
C / C++ - General Code Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
#include <omp.h>
main ()
int var1, var2, var3;
Serial code
.
.
.
Beginning of parallel region. Fork a team of threads.

Specify variable scoping
#pragma omp parallel private(var1, var2) shared(var3)
{
Parallel region executed by all threads

.
Other OpenMP directives

.
Run-time Library calls

.

}
Resume serial code

.
.
.
}
Compiling OpenMP Programs

LC OpenMP Implementations:
As of June 2016, the documentation sources for LC's default compilers claim the following OpenMP support:
Compiler
Version Supports
Intel C/C++, Fortran

GNU C/C++, Fortran
PGI C/C++, Fortran
IBM Blue Gene C/C++
14.0.3
4.4.7
8.0.1
12.1
IBM Blue Gene Fortran

14.1
IBM Blue Gene GNU C/C++, Fortran 4.4.7
OpenMP 3.1
OpenMP 3.0
OpenMP 3.0
OpenMP 3.1
OpenMP 3.1
OpenMP 3.0
OpenMP 4.0 Support (according to vendor and openmp.org documentation):

GNU: supported in 4.9 for C/C++ and 4.9.1 for Fortran
Intel: 14.0 has "some" support; 15.0 supports "most features"; version 16 supported
PGI: not currently available
IBM BG/Q: not currently available
OpenMP 4.5 Support:
Not currently supported on any of LC's production cluster compilers.
Supported in a beta version of the Clang compiler on the non-production rzmist and rzhasgpu clusters (June 2016).
To view all LC compiler versions, use the command use -l compilers to view compiler packages by version.
To view LC's default compiler versions see: https://computing.llnl.gov/?set=code&page=compilers
Best place to view OpenMP support by a range of compilers: http://openmp.org/wp/openmp-compilers/.
Compiling:
All of LC's compilers require you to use the appropriate compiler flag to "turn on" OpenMP compilations. The table below shows
what to use for each compiler.
Compiler / Platform
Compiler
Flag
Intel
Linux Opteron/Xeon
icc
icpc
ifort
-openmp
PGI
Linux Opteron/Xeon
pgcc
pgCC
pgf77
pgf90
-mp
GNU
Linux Opteron/Xeon
IBM Blue Gene
gcc
g++
g77
gfortran
-fopenmp
IBM
Blue Gene
bgxlc_r, bgcc_r
bgxlC_r, bgxlc++_r
bgxlc89_r
bgxlc99_r
bgxlf_r
bgxlf90_r
bgxlf95_r
bgxlf2003_r
-qsmp=omp
*Be sure to use a thread-safe compiler - its name ends with _r

Compiler Documentation:
IBM BlueGene: www-01.ibm.com/software/awdtools/fortran/ and www-01.ibm.com/software/awdtools/xlcpp
Intel: www.intel.com/software/products/compilers/
PGI: www.pgroup.com
GNU: gnu.org
All: See the relevant man pages and any files that might relate in /usr/local/docs
OpenMP Directives
Fortran Directives Format
Format: (case insensitive)

sentinel
directive-name
[clause ...]
All Fortran OpenMP directives must begin with a

sentinel. The accepted sentinels depend upon the
type of Fortran source. Possible sentinels are:
A valid OpenMP directive.

Must appear after the
sentinel and before any
clauses.
Optional. Clauses can be in any

order, and repeated as
necessary unless otherwise
restricted.
!$OMP
C$OMP
*$OMP
Example:
!$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)
Fixed Form Source:

!$OMP C$OMP *$OMP are accepted sentinels and must start in column 1
All Fortran fixed form rules for line length, white space, continuation and comment columns apply for the entire directive line
Initial directive lines must have a space/zero in column 6.
Continuation lines must have a non-space/zero in column 6.
Free Form Source:

!$OMP is the only accepted sentinel. Can appear in any column, but must be preceded by white space only.
All Fortran free form rules for line length, white space, continuation and comment columns apply for the entire directive line
Initial directive lines must have a space after the sentinel.
Continuation lines must have an ampersand as the last non-blank character in a line. The following line must begin with a sentinel
and then the continuation directives.
General Rules:
Comments can not appear on the same line as a directive
Only one directive-name may be specified per directive
Fortran compilers which are OpenMP enabled generally include a command line option which instructs the compiler to activate
and interpret all OpenMP directives.
Several Fortran OpenMP directives come in pairs and have the form shown below. The "end" directive is optional but advised for
readability.
!$OMP
directive
[ structured block of code ]

!$OMP end
directive
OpenMP Directives
C / C++ Directives Format
Format:
#pragma omp
directive-name
Required for all A valid OpenMP directive.
[clause, ...]
Optional. Clauses can be in any
newline
Required. Precedes the
OpenMP
C/C++
directives.
Must appear after the pragma order, and repeated as necessary

and before any clauses.
unless otherwise restricted.
structured block which is

enclosed by this directive.
Example:
#pragma omp parallel default(shared) private(beta,pi)
General Rules:
Case sensitive
Directives follow conventions of the C/C++ standards for compiler directives
Only one directive-name may be specified per directive
Each directive applies to at most one succeeding statement, which must be a structured block.
Long directive lines can be "continued" on succeeding lines by escaping the newline character with a backslash ("\") at the end of
a directive line.
OpenMP Directives
Directive Scoping
Do we do this now...or do it later? Oh well, let's get it over with early...
Static (Lexical) Extent:

The code textually enclosed between the beginning and the end of a structured block following a directive.
The static extent of a directives does not span multiple routines or code files
Orphaned Directive:
An OpenMP directive that appears independently from another enclosing directive is said to be an orphaned directive. It exists
outside of another directive's static (lexical) extent.
Will span routines and possibly code files
Dynamic Extent:
The dynamic extent of a directive includes both its static (lexical) extent and the extents of its orphaned directives.
Example:
!$OMP
!$OMP
!$OMP
!$OMP
PROGRAM TEST
...
PARALLEL
...
DO
DO I=...
...
CALL SUB1
...
ENDDO
END DO
...
CALL SUB2
...
END PARALLEL
SUBROUTINE SUB1
...
!$OMP CRITICAL
...
!$OMP END CRITICAL
END
SUBROUTINE SUB2
...
!$OMP SECTIONS
...
!$OMP END SECTIONS
...
END
STATIC EXTENT
ORPHANED DIRECTIVES
The DO directive occurs within an enclosing parallel

region
The CRITICAL and SECTIONS directives occur

outside an enclosing parallel region
DYNAMIC EXTENT
The CRITICAL and SECTIONS directives occur within the dynamic extent of the DO and PARALLEL directives.
Why Is This Important?

OpenMP specifies a number of scoping rules on how directives may associate (bind) and nest within each other
Illegal and/or incorrect programs may result if the OpenMP binding and nesting rules are ignored
See Directive Binding and Nesting Rules for specific details
OpenMP Directives
PARALLEL Region Construct
Purpose:
A parallel region is a block of code that will be executed by multiple threads. This is the fundamental OpenMP parallel construct.
Format:
!$OMP PARALLEL [clause ...]
IF (scalar_logical_expression)
PRIVATE (list)
SHARED (list)
DEFAULT (PRIVATE | FIRSTPRIVATE | SHARED | NONE)
FIRSTPRIVATE (list)
REDUCTION (operator: list)
Fortran
COPYIN (list)
NUM_THREADS (scalar-integer-expression)
block
!$OMP END PARALLEL
#pragma omp parallel [clause ...] newline
if (scalar_expression)
private (list)
shared (list)
default (shared | none)
firstprivate (list)
C/C++
reduction (operator: list)
copyin (list)
num_threads (integer-expression)
structured_block
Notes:
When a thread reaches a PARALLEL directive, it creates a team of threads and becomes the master of the team. The master is
a member of that team and has thread number 0 within that team.
Starting from the beginning of this parallel region, the code is duplicated and all threads will execute that code.
There is an implied barrier at the end of a parallel region. Only the master thread continues execution past this point.
If any thread terminates within a parallel region, all threads in the team will terminate, and the work done up until that point is
undefined.
How Many Threads?

The number of threads in a parallel region is determined by the following factors, in order of precedence:
1. Evaluation of the IF clause
2. Setting of the NUM_THREADS clause
3. Use of the omp_set_num_threads() library function

4. Setting of the OMP_NUM_THREADS environment variable
5. Implementation default - usually the number of CPUs on a node, though it could be dynamic (see next bullet).
Threads are numbered from 0 (master thread) to N-1
Dynamic Threads:
Use the omp_get_dynamic() library function to determine if dynamic threads are enabled.
If supported, the two methods available for enabling dynamic threads are:
1. The omp_set_dynamic() library routine
2. Setting of the OMP_DYNAMIC environment variable to TRUE
Nested Parallel Regions:

Use the omp_get_nested() library function to determine if nested parallel regions are enabled.
The two methods available for enabling nested parallel regions (if supported) are:
1. The omp_set_nested() library routine
2. Setting of the OMP_NESTED environment variable to TRUE
If not supported, a parallel region nested within another parallel region results in the creation of a new team, consisting of one
thread, by default.
Clauses:
IF clause: If present, it must evaluate to .TRUE. (Fortran) or non-zero (C/C++) in order for a team of threads to be created.
Otherwise, the region is executed serially by the master thread.
The remaining clauses are described in detail later, in the Data Scope Attribute Clauses section.
Restrictions:
A parallel region must be a structured block that does not span multiple routines or code files
It is illegal to branch (goto) into or out of a parallel region
Only a single IF clause is permitted
Only a single NUM_THREADS clause is permitted
A program must not depend upon the ordering of the clauses
Example: Parallel Region

Simple "Hello World" program
Every thread executes all code enclosed in the parallel region
OpenMP library routines are used to obtain thread identifiers and total number of threads
Fortran - Parallel Region Example

1
PROGRAM HELLO
2
3
INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS,
4
+
OMP_GET_THREAD_NUM
5
6 !
Fork a team of threads with each thread having a private TID variable
7 !$OMP PARALLEL PRIVATE(TID)
8
9 !
Obtain and print thread id
10
TID = OMP_GET_THREAD_NUM()
11
PRINT *, 'Hello World from thread = ', TID
12
13 !
14
15
16
17
18
19 !
20 !$OMP
21
22
Only master thread does this

IF (TID .EQ. 0) THEN
NTHREADS = OMP_GET_NUM_THREADS()
PRINT *, 'Number of threads = ', NTHREADS
END IF
END PARALLEL
END
C / C++ - Parallel Region Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
#include <omp.h>
main(int argc, char *argv[]) {
int nthreads, tid;
/* Fork a team of threads with each thread having a private tid variable */
#pragma omp parallel private(tid)
{
/* Obtain and print thread id */
tid = omp_get_thread_num();
printf("Hello World from thread = %d\n", tid);
/* Only master thread does this */
if (tid == 0)
{
nthreads = omp_get_num_threads();
printf("Number of threads = %d\n", nthreads);
}
}
/* All threads join master thread and terminate */
OpenMP Exercise 1
Getting Started
Overview:
Login to the workshop cluster using your workshop username and OTP token
Copy the exercise files to your home directory
Familiarize yourself with LC's OpenMP environment
Write a simple "Hello World" OpenMP program
Successfully compile your program
Successfully run your program
Modify the number of threads used to run your program
GO TO THE EXERCISE HERE
Approx. 20 minutes
OpenMP Directives
Work-Sharing Constructs
A work-sharing construct divides the execution of the enclosed code region among the members of the team that encounter it.
Work-sharing constructs do not launch new threads
There is no implied barrier upon entry to a work-sharing construct, however there is an implied barrier at the end of a work sharing
construct.
Types of Work-Sharing Constructs:

NOTE: The Fortran workshare construct is not shown here.
DO / for - shares iterations of a loop
SECTIONS - breaks work into separate, SINGLE - serializes a section of code
across the team. Represents a type of

"data parallelism".
discrete sections. Each section is

executed by a thread. Can be used to
implement a type of "functional
parallelism".
Restrictions:
A work-sharing construct must be enclosed dynamically within a parallel region in order for the directive to execute in parallel.
Work-sharing constructs must be encountered by all members of a team or none at all
Successive work-sharing constructs must be encountered in the same order by all members of a team
OpenMP Directives
DO / for Directive
Purpose:
The DO / for directive specifies that the iterations of the loop immediately following it must be executed in parallel by the team.
This assumes a parallel region has already been initiated, otherwise it executes in serial on a single processor.
Format:
!$OMP DO [clause ...]
SCHEDULE (type [,chunk])
ORDERED
PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
SHARED (list)
Fortran
REDUCTION (operator | intrinsic : list)
COLLAPSE (n)
do_loop
!$OMP END DO
[ NOWAIT ]
#pragma omp for [clause ...] newline

schedule (type [,chunk])
ordered
private (list)
firstprivate (list)
lastprivate (list)
shared (list)
collapse (n)
nowait
C/C++
for_loop
Clauses:
SCHEDULE: Describes how iterations of the loop are divided among the threads in the team. The default schedule is
implementation dependent. For a discussion on how one type of scheduling may be more optimal than others, see
http://openmp.org/forum/viewtopic.php?f=3&t=83.
STATIC
Loop iterations are divided into pieces of size chunk and then statically assigned to threads. If chunk is not specified, the
iterations are evenly (if possible) divided contiguously among the threads.
DYNAMIC
Loop iterations are divided into pieces of size chunk, and dynamically scheduled among the threads; when a thread
finishes one chunk, it is dynamically assigned another. The default chunk size is 1.
GUIDED
Iterations are dynamically assigned to threads in blocks as threads request them until no blocks remain to be assigned.
Similar to DYNAMIC except that the block size decreases each time a parcel of work is given to a thread. The size of the
initial block is proportional to:
number_of_iterations / number_of_threads
Subsequent blocks are proportional to
number_of_iterations_remaining / number_of_threads
The chunk parameter defines the minimum block size. The default chunk size is 1.
RUNTIME
The scheduling decision is deferred until runtime by the environment variable OMP_SCHEDULE. It is illegal to specify a
chunk size for this clause.
AUTO
The scheduling decision is delegated to the compiler and/or runtime system.
NO WAIT / nowait: If specified, then threads do not synchronize at the end of the parallel loop.
ORDERED: Specifies that the iterations of the loop must be executed as they would be in a serial program.
COLLAPSE: Specifies how many loops in a nested loop should be collapsed into one large iteration space and divided
according to the schedule clause. The sequential execution of the iterations in all associated loops determines the order of the
iterations in the collapsed iteration space.
Other clauses are described in detail later, in the Data Scope Attribute Clauses section.
Restrictions:
The DO loop can not be a DO WHILE loop, or a loop without loop control. Also, the loop iteration variable must be an integer and
the loop control parameters must be the same for all threads.
Program correctness must not depend upon which thread executes a particular iteration.
It is illegal to branch (goto) out of a loop associated with a DO/for directive.
The chunk size must be specified as a loop invarient integer expression, as there is no synchronization during its evaluation by
different threads.
ORDERED, COLLAPSE and SCHEDULE clauses may appear once each.
See the OpenMP specification document for additional restrictions.
Example: DO / for Directive

Simple vector-add program
Arrays A, B, C, and variable N will be shared by all threads.
Variable I will be private to each thread; each thread will have its own unique copy.
The iterations of the loop will be distributed dynamically in CHUNK sized pieces.
Threads will not synchronize upon completing their individual pieces of work (NOWAIT).
Fortran - DO Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PROGRAM VEC_ADD_DO
INTEGER N, CHUNKSIZE, CHUNK, I
PARAMETER (N=1000)
PARAMETER (CHUNKSIZE=100)
REAL A(N), B(N), C(N)
!
Some initializations
DO I = 1, N
A(I) = I * 1.0
B(I) = A(I)
ENDDO
CHUNK = CHUNKSIZE
!$OMP PARALLEL SHARED(A,B,C,CHUNK) PRIVATE(I)

!$OMP DO SCHEDULE(DYNAMIC,CHUNK)
DO I = 1, N
C(I) = A(I) + B(I)
ENDDO
!$OMP END DO NOWAIT
!$OMP END PARALLEL
END
C / C++ - for Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
#include <omp.h>
#define N 1000
#define CHUNKSIZE 100
int i, chunk;
float a[N], b[N], c[N];
/* Some initializations */
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
#pragma omp parallel shared(a,b,c,chunk) private(i)
{
#pragma omp for schedule(dynamic,chunk) nowait
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
}
/* end of parallel region */
24
OpenMP Directives
SECTIONS Directive
Purpose:
The SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be
divided among the threads in the team.
Independent SECTION directives are nested within a SECTIONS directive. Each SECTION is executed once by a thread in the
team. Different sections may be executed by different threads. It is possible for a thread to execute more than one section if it is
quick enough and the implementation permits such.
Format:
!$OMP SECTIONS [clause ...]
PRIVATE (list)
FIRSTPRIVATE (list)
LASTPRIVATE (list)
REDUCTION (operator | intrinsic : list)
!$OMP
Fortran
SECTION
block
!$OMP
SECTION
block
!$OMP END SECTIONS
[ NOWAIT ]
#pragma omp sections [clause ...] newline

private (list)
firstprivate (list)
lastprivate (list)
nowait
{
C/C++
#pragma omp section
newline
structured_block
#pragma omp section
newline
structured_block
}
Clauses:
There is an implied barrier at the end of a SECTIONS directive, unless the NOWAIT/nowait clause is used.
Clauses are described in detail later, in the Data Scope Attribute Clauses section.
Questions:
What happens if the number of threads and the number of SECTIONs are different? More threads than SECTIONs? Less
threads than SECTIONs?
Answer
Which thread executes which SECTION?
Answer
Restrictions:
It is illegal to branch (goto) into or out of section blocks.
SECTION directives must occur within the lexical extent of an enclosing SECTIONS directive (no orphan SECTIONs).
Example: SECTIONS Directive

Simple program demonstrating that different blocks of work will be done by different threads.
Fortran - SECTIONS Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PROGRAM VEC_ADD_SECTIONS
INTEGER N, I
PARAMETER (N=1000)
REAL A(N), B(N), C(N), D(N)
!
DO I = 1, N
A(I) = I * 1.5
B(I) = I + 22.35
ENDDO
!$OMP PARALLEL SHARED(A,B,C,D), PRIVATE(I)

!$OMP SECTIONS
!$OMP SECTION
DO I = 1, N
C(I) = A(I) + B(I)
ENDDO
!$OMP SECTION
DO I = 1, N
D(I) = A(I) * B(I)
ENDDO
!$OMP END SECTIONS NOWAIT
!$OMP END PARALLEL
END
C / C++ - sections Directive Example

1
2
3
4
5
6
7
8
9
10
#include <omp.h>
#define N 1000
int i;
float a[N], b[N], c[N], d[N];
for (i=0; i < N; i++) {
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
a[i] = i * 1.5;
b[i] = i + 22.35;
}
#pragma omp parallel shared(a,b,c,d) private(i)
{
#pragma omp sections nowait
{
#pragma omp section
for (i=0; i < N; i++)
c[i] = a[i] + b[i];
#pragma omp section
for (i=0; i < N; i++)
d[i] = a[i] * b[i];
}
}
/* end of sections */
OpenMP Directives
SINGLE Directive
Purpose:
The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team.
May be useful when dealing with sections of code that are not thread safe (such as I/O)
Format:
!$OMP SINGLE [clause ...]
PRIVATE (list)
FIRSTPRIVATE (list)
Fortran
block
!$OMP END SINGLE [ NOWAIT ]
#pragma omp single [clause ...] newline

private (list)
firstprivate (list)
C/C++
nowait
structured_block
Clauses:
Threads in the team that do not execute the SINGLE directive, wait at the end of the enclosed code block, unless a
NOWAIT/nowait clause is specified.
Clauses are described in detail later, in the Data Scope Attribute Clauses section.
Restrictions:
It is illegal to branch into or out of a SINGLE block.
OpenMP Directives
Combined Parallel Work-Sharing Constructs
OpenMP provides three directives that are merely conveniences:
PARALLEL DO / parallel for
PARALLEL SECTIONS
PARALLEL WORKSHARE (fortran only)
For the most part, these directives behave identically to an individual PARALLEL directive being immediately followed by a
separate work-sharing directive.
Most of the rules, clauses and restrictions that apply to both directives are in effect. See the OpenMP API for details.
An example using the PARALLEL DO / parallel for combined directive is shown below.
Fortran - PARALLEL DO Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
PROGRAM VECTOR_ADD
INTEGER N, I, CHUNKSIZE, CHUNK
PARAMETER (N=1000)
REAL A(N), B(N), C(N)
!
DO I = 1, N
A(I) = I * 1.0
B(I) = A(I)
ENDDO
CHUNK = CHUNKSIZE
!$OMP PARALLEL DO
!$OMP& SHARED(A,B,C,CHUNK) PRIVATE(I)
!$OMP& SCHEDULE(STATIC,CHUNK)
DO I = 1, N
C(I) = A(I) + B(I)
ENDDO
!$OMP END PARALLEL DO
END
C / C++ - parallel for Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
#include <omp.h>
#define N
1000
#define CHUNKSIZE
100
int i, chunk;
float a[N], b[N], c[N];
for (i=0; i < N; i++)
a[i] = b[i] = i * 1.0;
chunk = CHUNKSIZE;
15
16
17
18
19
20
#pragma omp parallel for \

shared(a,b,c,chunk) private(i) \
schedule(static,chunk)
for (i=0; i < n; i++)
c[i] = a[i] + b[i];
}
OpenMP Directives
TASK Construct
Purpose:
The TASK construct defines an explicit task, which may be executed by the encountering thread, or deferred for execution by any
other thread in the team.
The data environment of the task is determined by the data sharing attribute clauses.
Task execution is subject to task scheduling - see the OpenMP 3.1 specification document for details.
Also see the OpenMP 3.1 documentation for the associated taskyield and taskwait directives.
Format:
!$OMP TASK [clause ...]
IF (scalar logical expression)
FINAL (scalar logical expression)
UNTIED
MERGEABLE
PRIVATE (list)
Fortran
FIRSTPRIVATE (list)
SHARED (list)
block
!$OMP END TASK
#pragma omp task [clause ...] newline
if (scalar expression)
final (scalar expression)
untied
mergeable
C/C++
private (list)
firstprivate (list)
shared (list)
structured_block
Clauses and Restrictions:
Please consult the OpenMP 3.1 specifications document for details.
OpenMP Exercise 2
Overview:
Login to the LC workshop cluster, if you are not already logged in
Work-Sharing DO/for construct examples: review, compile and run

Work-Sharing SECTIONS construct example: review, compile and run
Approx. 20 minutes
OpenMP Directives
Synchronization Constructs
Consider a simple example where two threads on two different processors are both trying to increment a variable x at the same
time (assume x is initially 0):
THREAD 1:
THREAD 2:
increment(x)
{
x = x + 1;
}
increment(x)
{
x = x + 1;
}
THREAD 1:
THREAD 2:
10
20
30
10
20
30
LOAD A, (x address)
ADD A, 1
STORE A, (x address)
LOAD A, (x address)
ADD A, 1
STORE A, (x address)
One possible execution sequence:

1. Thread 1 loads the value of x into register A.
2. Thread 2 loads the value of x into register A.
3. Thread 1 adds 1 to register A
4. Thread 2 adds 1 to register A
5. Thread 1 stores register A at location x
6. Thread 2 stores register A at location x
The resultant value of x will be 1, not 2 as it should be.
To avoid a situation like this, the incrementing of x must be synchronized between the two threads to ensure that the correct result
is produced.
OpenMP provides a variety of Synchronization Constructs that control how the execution of each thread proceeds relative to other
team threads.
OpenMP Directives
MASTER Directive
Purpose:
The MASTER directive specifies a region that is to be executed only by the master thread of the team. All other threads on the
team skip this section of code
There is no implied barrier associated with this directive
Format:
!$OMP MASTER
block
Fortran
!$OMP END MASTER

#pragma omp master
C/C++
newline
structured_block
Restrictions:
It is illegal to branch into or out of MASTER block.
OpenMP Directives
CRITICAL Directive
Purpose:
The CRITICAL directive specifies a region of code that must be executed by only one thread at a time.
Format:
!$OMP CRITICAL [ name ]
block
Fortran
!$OMP END CRITICAL [ name ]

#pragma omp critical [ name ]
C/C++
newline
structured_block
Notes:
If a thread is currently executing inside a CRITICAL region and another thread reaches that CRITICAL region and attempts to
execute it, it will block until the first thread exits that CRITICAL region.
The optional name enables multiple different CRITICAL regions to exist:
Names act as global identifiers. Different CRITICAL regions with the same name are treated as the same region.
All CRITICAL sections which are unnamed, are treated as the same section.
Restrictions:
It is illegal to branch into or out of a CRITICAL block.
Fortran only: The names of critical constructs are global entities of the program. If a name conflicts with any other entity, the
behavior of the program is unspecified.
Example: CRITICAL Construct

All threads in the team will attempt to execute in parallel, however, because of the CRITICAL construct surrounding the increment
of x, only one thread will be able to read/increment/write x at any time
Fortran - CRITICAL Directive Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
PROGRAM CRITICAL
INTEGER X
X = 0
!$OMP PARALLEL SHARED(X)
!$OMP CRITICAL
X = X + 1
!$OMP END CRITICAL
!$OMP END PARALLEL
END
C / C++ - critical Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
#include <omp.h>
int x;
x = 0;
#pragma omp parallel shared(x)
{
#pragma omp critical
x = x + 1;
}
OpenMP Directives
BARRIER Directive
Purpose:
The BARRIER directive synchronizes all threads in the team.
When a BARRIER directive is reached, a thread will wait at that point until all other threads have reached that barrier. All threads
then resume executing in parallel the code that follows the barrier.
Format:
Fortran
!$OMP BARRIER
C/C++ #pragma omp barrier
newline
Restrictions:
All threads in a team (or none) must execute the BARRIER region.
The sequence of work-sharing regions and barrier regions encountered must be the same for every thread in a team.
OpenMP Directives
TASKWAIT Directive
Purpose:
OpenMP 3.1 feature
The TASKWAIT construct specifies a wait on the completion of child tasks generated since the beginning of the current task.
Format:
Fortran
!$OMP TASKWAIT
C/C++ #pragma omp taskwait
newline
Restrictions:
Because the taskwait construct does not have a C language statement as part of its syntax, there are some restrictions on its
placement within a program. The taskwait directive may be placed only at a point where a base language statement is allowed.
The taskwait directive may not be used in place of the statement following an if, while, do, switch, or label. See the OpenMP 3.1
specifications document for details.
OpenMP Directives
ATOMIC Directive
Purpose:
The ATOMIC directive specifies that a specific memory location must be updated atomically, rather than letting multiple threads
attempt to write to it. In essence, this directive provides a mini-CRITICAL section.
Format:
!$OMP ATOMIC
Fortran
statement_expression
#pragma omp atomic
C/C++
newline
statement_expression
Restrictions:
The directive applies only to a single, immediately following statement
An atomic statement must follow a specific syntax. See the most recent OpenMP specs for this.
OpenMP Directives
FLUSH Directive
Purpose:
The FLUSH directive identifies a synchronization point at which the implementation must provide a consistent view of memory.
Thread-visible variables are written back to memory at this point.

There is a fair amount of discussion on this directive within OpenMP circles that you may wish to consult for more information.
Some of it is hard to understand? Per the API:
If the intersection of the flush-sets of two flushes performed by two different threads is non-empty, then the two flushes must
be completed as if in some sequential order, seen by all threads.
Say what?
To quote from the openmp.org FAQ:
Q17: Is the !$omp flush directive necessary on a cache coherent system?

A17: Yes the flush directive is necessary. Look in the OpenMP specifications for examples of it's uses. The directive is
necessary to instruct the compiler that the variable must be written to/read from the memory system, i.e. that the variable can
not be kept in a local CPU register over the flush "statement" in your code.
Cache coherency makes certain that if one CPU executes a read or write instruction from/to memory, then all other CPUs in
the system will get the same value from that memory address when they access it. All caches will show a coherent value.
However, in the OpenMP standard there must be a way to instruct the compiler to actually insert the read/write machine
instruction and not postpone it. Keeping a variable in a register in a loop is very common when producing efficient machine
language code for a loop.
Also see the most recent OpenMP specs for details.
Format:
Fortran !$OMP FLUSH
(list)
C/C++ #pragma omp flush (list)
newline
Notes:
The optional list contains a list of named variables that will be flushed in order to avoid flushing all variables. For pointers in the
list, note that the pointer itself is flushed, not the object it points to.
Implementations must ensure any prior modifications to thread-visible variables are visible to all threads after this point; ie.
compilers must restore values from registers to memory, hardware might need to flush write buffers, etc
The FLUSH directive is implied for the directives shown in the table below. The directive is not implied if a NOWAIT clause is
present.
Fortran
BARRIER
END PARALLEL
CRITICAL and END CRITICAL
END DO
END SECTIONS
END SINGLE
ORDERED and END ORDERED
C / C++
barrier
parallel - upon entry and exit
critical - upon entry and exit
ordered - upon entry and exit
for - upon exit
sections - upon exit
single - upon exit
OpenMP Directives
ORDERED Directive
Purpose:
The ORDERED directive specifies that iterations of the enclosed loop will be executed in the same order as if they were
executed on a serial processor.
Threads will need to wait before executing their chunk of iterations if previous iterations haven't completed yet.
Used within a DO / for loop with an ORDERED clause
The ORDERED directive provides a way to "fine tune" where ordering is to be applied within a loop. Otherwise, it is not required.
Format:
!$OMP DO ORDERED [clauses...]
(loop region)
!$OMP ORDERED
(block)
Fortran
!$OMP END ORDERED
(end of loop region)

!$OMP END DO
#pragma omp for ordered [clauses...]
(loop region)
C/C++ #pragma omp ordered
structured_block
newline
(endo of loop region)

Restrictions:
An ORDERED directive can only appear in the dynamic extent of the following directives:
DO or PARALLEL DO (Fortran)
for or parallel for (C/C++)
Only one thread is allowed in an ordered section at any time
It is illegal to branch into or out of an ORDERED block.
An iteration of a loop must not execute the same ORDERED directive more than once, and it must not execute more than one
ORDERED directive.
A loop which contains an ORDERED directive, must be a loop with an ORDERED clause.
OpenMP Directives
THREADPRIVATE Directive
Purpose:
The THREADPRIVATE directive is used to make global file scope variables (C/C++) or common blocks (Fortran) local and
persistent to a thread through the execution of multiple parallel regions.
Format:
Fortran
!$OMP THREADPRIVATE (/cb/, ...)
cb is the name of a common block
C/C++ #pragma omp threadprivate (list)

Notes:
The directive must appear after the declaration of listed variables/common blocks. Each thread then gets its own copy of the
variable/common block, so data written by one thread is not visible to other threads. For example:
Fortran - THREADPRIVATE Directive Example
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
PROGRAM THREADPRIV
INTEGER A, B, I, TID, OMP_GET_THREAD_NUM
REAL*4 X
COMMON /C1/ A
!$OMP THREADPRIVATE(/C1/, X)
!
Explicitly turn off dynamic threads

CALL OMP_SET_DYNAMIC(.FALSE.)
PRINT *, '1st Parallel Region:'

!$OMP PARALLEL PRIVATE(B, TID)
A = TID
B = TID
X = 1.1 * TID + 1.0
PRINT *, 'Thread',TID,':
A,B,X=',A,B,X
!$OMP END PARALLEL
PRINT *, '************************************'
PRINT *, 'Master thread doing serial work here'
PRINT *, '************************************'
PRINT *, '2nd Parallel Region: '
!$OMP PARALLEL PRIVATE(TID)
PRINT *, 'Thread',TID,':
A,B,X=',A,B,X
!$OMP END PARALLEL
END
Output:
1st Parallel Region:
Thread 0 :
A,B,X= 0 0 1.000000000
Thread 1 :
A,B,X= 1 1 2.099999905
Thread 3 :
A,B,X= 3 3 4.300000191
Thread 2 :
A,B,X= 2 2 3.200000048
************************************
Master thread doing serial work here
************************************
2nd Parallel Region:
Thread 0 :
A,B,X= 0 0 1.000000000
Thread 2 :
A,B,X= 2 0 3.200000048
Thread 3 :
A,B,X= 3 0 4.300000191
Thread 1 :
A,B,X= 1 0 2.099999905
C/C++ - threadprivate Directive Example

1
2
3
4
5
6
7
8
9
10
11
12
#include <omp.h>
int a, b, i, tid;
float x;
#pragma omp threadprivate(a, x)
/* Explicitly turn off dynamic threads */
omp_set_dynamic(0);
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
printf("1st Parallel Region:\n");

#pragma omp parallel private(b,tid)
{
a = tid;
b = tid;
x = 1.1 * tid +1.0;
printf("Thread %d:
a,b,x= %d %d %f\n",tid,a,b,x);
} /* end of parallel region */
printf("************************************\n");
printf("Master thread doing serial work here\n");
printf("************************************\n");
printf("2nd Parallel Region:\n");
#pragma omp parallel private(tid)
{
printf("Thread %d:
a,b,x= %d %d %f\n",tid,a,b,x);
} /* end of parallel region */
}
Output:
1st Parallel Region:
Thread 0:
a,b,x= 0 0 1.000000
Thread 2:
a,b,x= 2 2 3.200000
Thread 3:
a,b,x= 3 3 4.300000
Thread 1:
a,b,x= 1 1 2.100000
************************************
Master thread doing serial work here
************************************
2nd Parallel Region:
Thread 0:
a,b,x= 0 0 1.000000
Thread 3:
a,b,x= 3 0 4.300000
Thread 1:
a,b,x= 1 0 2.100000
Thread 2:
a,b,x= 2 0 3.200000
On first entry to a parallel region, data in THREADPRIVATE variables and common blocks should be assumed undefined, unless
a COPYIN clause is specified in the PARALLEL directive
THREADPRIVATE variables differ from PRIVATE variables (discussed later) because they are able to persist between different
parallel regions of a code.
Restrictions:
Data in THREADPRIVATE objects is guaranteed to persist only if the dynamic threads mechanism is "turned off" and the number
of threads in different parallel regions remains constant. The default setting of dynamic threads is undefined.
The THREADPRIVATE directive must appear after every declaration of a thread private variable/common block.
Fortran: only named common blocks can be made THREADPRIVATE.
OpenMP Directives
Data Scope Attribute Clauses
Also called Data-sharing Attribute Clauses
An important consideration for OpenMP programming is the understanding and use of data scoping
Because OpenMP is based upon the shared memory programming model, most variables are shared by default
Global variables include:

Fortran: COMMON blocks, SAVE variables, MODULE variables
C: File scope variables, static
Private variables include:
Loop index variables
Stack variables in subroutines called from parallel regions
Fortran: Automatic variables within a statement block
The OpenMP Data Scope Attribute Clauses are used to explicitly define how variables should be scoped. They include:
PRIVATE
FIRSTPRIVATE
LASTPRIVATE
SHARED
DEFAULT
REDUCTION
COPYIN
Data Scope Attribute Clauses are used in conjunction with several directives (PARALLEL, DO/for, and SECTIONS) to control the
scoping of enclosed variables.
These constructs provide the ability to control the data environment during execution of parallel constructs.
They define how and which data variables in the serial section of the program are transferred to the parallel regions of the
program (and back)
They define which variables will be visible to all threads in the parallel regions and which variables will be privately allocated
to all threads.
Data Scope Attribute Clauses are effective only within their lexical/static extent.
Important: Please consult the latest OpenMP specs for important details and discussion on this topic.
A Clauses / Directives Summary Table is provided for convenience.
PRIVATE Clause
Purpose:
The PRIVATE clause declares variables in its list to be private to each thread.
Format:
Fortran PRIVATE (list)
C/C++ private (list)
Notes:
PRIVATE variables behave as follows:
A new object of the same type is declared once for each thread in the team
All references to the original object are replaced with references to the new object
Variables declared PRIVATE should be assumed to be uninitialized for each thread
Comparison between PRIVATE and THREADPRIVATE:
PRIVATE
THREADPRIVATE
Data Item
C/C++: variable
Fortran: variable or common block
C/C++: variable
Fortran: common block
Where
Declared
At start of region or work-sharing group
In declarations of each routine using block or

global file scope
Persistent?
No
Yes
Extent
Lexical only - unless passed as an argument to Dynamic

subroutine
Initialized
Use FIRSTPRIVATE
Use COPYIN
SHARED Clause
Purpose:
The SHARED clause declares variables in its list to be shared among all threads in the team.
Format:
Fortran SHARED (list)
C/C++ shared (list)
Notes:
A shared variable exists in only one memory location and all threads can read or write to that address
It is the programmer's responsibility to ensure that multiple threads properly access SHARED variables (such as via CRITICAL
sections)
DEFAULT Clause
Purpose:
The DEFAULT clause allows the user to specify a default scope for all variables in the lexical extent of any parallel region.
Format:
Fortran
C/C++
Notes:
Specific variables can be exempted from the default using the PRIVATE, SHARED, FIRSTPRIVATE, LASTPRIVATE, and
REDUCTION clauses
The C/C++ OpenMP specification does not include private or firstprivate as a possible default. However, actual implementations
may provide this option.
Using NONE as a default requires that the programmer explicitly scope all variables.
Restrictions:
Only one DEFAULT clause can be specified on a PARALLEL directive
FIRSTPRIVATE Clause
Purpose:
The FIRSTPRIVATE clause combines the behavior of the PRIVATE clause with automatic initialization of the variables in its list.
Format:
Fortran FIRSTPRIVATE (list)
C/C++ firstprivate (list)
Notes:
Listed variables are initialized according to the value of their original objects prior to entry into the parallel or work-sharing
construct.
LASTPRIVATE Clause
Purpose:
The LASTPRIVATE clause combines the behavior of the PRIVATE clause with a copy from the last loop iteration or section to the
original variable object.
Format:
Fortran LASTPRIVATE (list)
C/C++ lastprivate (list)
Notes:
The value copied back into the original variable object is obtained from the last (sequentially) iteration or section of the enclosing
construct.
For example, the team member which executes the final iteration for a DO section, or the team member which does the last
SECTION of a SECTIONS context performs the copy with its own values
COPYIN Clause
Purpose:
The COPYIN clause provides a means for assigning the same value to THREADPRIVATE variables for all threads in the team.
Format:
Fortran COPYIN (list)
C/C++ copyin
(list)
Notes:
List contains the names of variables to copy. In Fortran, the list can contain both the names of common blocks and named
variables.
The master thread variable is used as the copy source. The team threads are initialized with its value upon entry into the parallel
construct.
COPYPRIVATE Clause
Purpose:
The COPYPRIVATE clause can be used to broadcast values acquired by a single thread directly to all instances of the private
variables in the other threads.
Associated with the SINGLE directive
See the most recent OpenMP specs document for additional discussion and examples.
Format:
Fortran COPYPRIVATE (list)
C/C++ copyprivate
(list)
REDUCTION Clause
Purpose:
The REDUCTION clause performs a reduction on the variables that appear in its list.
A private copy for each list variable is created for each thread. At the end of the reduction, the reduction variable is applied to all
private copies of the shared variable, and the final result is written to the global shared variable.
Format:
Fortran REDUCTION (operator|intrinsic: list)
C/C++ reduction (operator: list)
Example: REDUCTION - Vector Dot Product:
Iterations of the parallel loop will be distributed in equal sized blocks to each thread in the team (SCHEDULE STATIC)
At the end of the parallel loop construct, all threads will add their values of "result" to update the master thread's global copy.
Fortran - REDUCTION Clause Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
PROGRAM DOT_PRODUCT
INTEGER N, CHUNKSIZE, CHUNK, I
PARAMETER (N=100)
REAL A(N), B(N), RESULT
!
DO I = 1, N
A(I) = I * 1.0
B(I) = I * 2.0
ENDDO
RESULT= 0.0
CHUNK = CHUNKSIZE
!$OMP
!$OMP&
!$OMP&
!$OMP&
PARALLEL DO
DEFAULT(SHARED) PRIVATE(I)
SCHEDULE(STATIC,CHUNK)
REDUCTION(+:RESULT)
DO I = 1, N
RESULT = RESULT + (A(I) * B(I))
ENDDO
!$OMP
END PARALLEL DO
PRINT *, 'Final Result= ', RESULT
END
C / C++ - reduction Clause Example

1
2
3
4
5
#include <omp.h>
main(int argc, char *argv[])
int
i, n, chunk;
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
float a[100], b[100], result;

n = 100;
chunk = 10;
result = 0.0;
for (i=0; i < n; i++) {
a[i] = i * 1.0;
b[i] = i * 2.0;
}
#pragma omp parallel for
default(shared) private(i)
schedule(static,chunk)
reduction(+:result)
\
\
\
for (i=0; i < n; i++)

result = result + (a[i] * b[i]);
printf("Final result= %f\n",result);
}
Restrictions:
Variables in the list must be named scalar variables. They can not be array or structure type variables. They must also be
declared SHARED in the enclosing context.
Reduction operations may not be associative for real numbers.
The REDUCTION clause is intended to be used on a region or work-sharing construct in which the reduction variable is used only
in statements which have one of following forms:
Fortran
C / C++
x = x operator expr
x = expr operator x (except subtraction)
x = intrinsic(x, expr)
x = intrinsic(expr, x)
x = x op expr
x = expr op x (except subtraction)
x binop = expr
x++
++x
x---x
x is a scalar variable in the list

expr is a scalar expression that does not reference x
intrinsic is one of MAX, MIN, IAND, IOR, IEOR
operator is one of +, *, -, .AND., .OR., .EQV., .NEQV.
x is a scalar variable in the list

expr is a scalar expression that does not reference x
op is not overloaded, and is one of +, *, -, /, &, ^, |, &&, ||
binop is not overloaded, and is one of +, *, -, /, &, ^, |
OpenMP Directives
Clauses / Directives Summary
The table below summarizes which clauses are accepted by which OpenMP directives.
Directive
Clause
PARALLEL DO/for SECTIONS SINGLE PARALLEL PARALLEL

DO/for
SECTIONS
IF
PRIVATE
SHARED
DEFAULT
FIRSTPRIVATE
LASTPRIVATE
REDUCTION
COPYIN
COPYPRIVATE
SCHEDULE
ORDERED
NOWAIT
The following OpenMP directives do not accept clauses:

MASTER
CRITICAL
BARRIER
ATOMIC
FLUSH
ORDERED
THREADPRIVATE
Implementations may (and do) differ from the standard in which clauses are supported by each directive.
OpenMP Directives
Directive Binding and Nesting Rules
This section is provided mainly as a quick reference on rules which govern OpenMP directives and binding. Users should
consult their implementation documentation and the OpenMP standard for other rules and restrictions.
Unless indicated otherwise, rules apply to both Fortran and C/C++ OpenMP implementations.
Note: the Fortran API also defines a number of Data Environment rules. Those have not been reproduced here.
Directive Binding:
The DO/for, SECTIONS, SINGLE, MASTER and BARRIER directives bind to the dynamically enclosing PARALLEL, if one exists.
If no parallel region is currently being executed, the directives have no effect.
The ORDERED directive binds to the dynamically enclosing DO/for.
The ATOMIC directive enforces exclusive access with respect to ATOMIC directives in all threads, not just the current team.
The CRITICAL directive enforces exclusive access with respect to CRITICAL directives in all threads, not just the current team.
A directive can never bind to any directive outside the closest enclosing PARALLEL.
Directive Nesting:
A worksharing region may not be closely nested inside a worksharing, explicit task, critical, ordered, atomic, or master region.
A barrier region may not be closely nested inside a worksharing, explicit task, critical, ordered, atomic, or master region.
A master region may not be closely nested inside a worksharing, atomic, or explicit task region.
An ordered region may not be closely nested inside a critical, atomic, or explicit task region.
An ordered region must be closely nested inside a loop region (or parallel loop region) with an ordered clause.
A critical region may not be nested (closely or otherwise) inside a critical region with the same name. Note that this restriction is
not sufficient to prevent deadlock.
parallel, flush, critical, atomic, taskyield, and explicit task regions may not be closely nested inside an atomic region.
Run-Time Library Routines

Overview:
The OpenMP API includes an ever-growing number of run-time library routines.
These routines are used for a variety of purposes as shown in the table below:
Routine
Purpose
OMP_SET_NUM_THREADS
Sets the number of threads that will be used in the next parallel region
OMP_GET_NUM_THREADS
Returns the number of threads that are currently in the team executing
the parallel region from which it is called
OMP_GET_MAX_THREADS
Returns the maximum value that can be returned by a call to the

OMP_GET_NUM_THREADS function
OMP_GET_THREAD_NUM
Returns the thread number of the thread, within the team, making this
call.
OMP_GET_THREAD_LIMIT
Returns the maximum number of OpenMP threads available to a

program
OMP_GET_NUM_PROCS
Returns the number of processors that are available to the program
OMP_IN_PARALLEL
Used to determine if the section of code which is executing is parallel

or not
OMP_SET_DYNAMIC
Enables or disables dynamic adjustment (by the run time system) of

the number of threads available for execution of parallel regions
OMP_GET_DYNAMIC
Used to determine if dynamic thread adjustment is enabled or not
OMP_SET_NESTED
Used to enable or disable nested parallelism
OMP_GET_NESTED
Used to determine if nested parallelism is enabled or not
OMP_SET_SCHEDULE
Sets the loop scheduling policy when "runtime" is used as the

schedule kind in the OpenMP directive
OMP_GET_SCHEDULE
Returns the loop scheduling policy when "runtime" is used as the

schedule kind in the OpenMP directive
OMP_SET_MAX_ACTIVE_LEVELS
Sets the maximum number of nested parallel regions
OMP_GET_MAX_ACTIVE_LEVELS
Returns the maximum number of nested parallel regions
OMP_GET_LEVEL
Returns the current level of nested parallel regions
OMP_GET_ANCESTOR_THREAD_NUM Returns, for a given nested level of the current thread, the thread
number of ancestor thread
OMP_GET_TEAM_SIZE
Returns, for a given nested level of the current thread, the size of the
thread team
OMP_GET_ACTIVE_LEVEL
Returns the number of nested, active parallel regions enclosing the

task that contains the call
OMP_IN_FINAL
Returns true if the routine is executed in the final task region;

otherwise it returns false
OMP_INIT_LOCK
Initializes a lock associated with the lock variable
OMP_DESTROY_LOCK
Disassociates the given lock variable from any locks
OMP_SET_LOCK
Acquires ownership of a lock
OMP_UNSET_LOCK
Releases a lock
OMP_TEST_LOCK
Attempts to set a lock, but does not block if the lock is unavailable
OMP_INIT_NEST_LOCK
Initializes a nested lock associated with the lock variable
OMP_DESTROY_NEST_LOCK
Disassociates the given nested lock variable from any locks
OMP_SET_NEST_LOCK
Acquires ownership of a nested lock

OMP_UNSET_NEST_LOCK
Releases a nested lock
OMP_TEST_NEST_LOCK
Attempts to set a nested lock, but does not block if the lock is
unavailable
OMP_GET_WTIME
Provides a portable wall clock timing routine
OMP_GET_WTICK
Returns a double-precision floating point value equal to the number of

seconds between successive clock ticks
For C/C++, all of the run-time library routines are actual subroutines. For Fortran, some are actually functions, and some are
subroutines. For example:

C/C++ #include <omp.h>
int omp_get_num_threads(void)
Note that for C/C++, you usually need to include the <omp.h> header file.
Fortran routines are not case sensitive, but C/C++ routines are.
For the Lock routines/functions:
The lock variable must be accessed only through the locking routines
For Fortran, the lock variable should be of type integer and of a kind large enough to hold an address.
For C/C++, the lock variable must have type omp_lock_t or type omp_nest_lock_t, depending on the function being
used.
Implementation notes:
Implementations may or may not support all OpenMP API features. For example, if nested parallelism is supported, it may
be only nominal, in that a nested parallel region may only have one thread.
Consult your implementation's documentation for details - or experiment and find out for yourself if you can't find it in the
documentation.
The run-time library routines are discussed in more detail in Appendix A.
OpenMP provides the following environment variables for controlling the execution of parallel code.
All environment variable names are uppercase. The values assigned to them are not case sensitive.
OMP_SCHEDULE
Applies only to DO, PARALLEL DO (Fortran) and for, parallel for (C/C++) directives which have their schedule clause
set to RUNTIME. The value of this variable determines how iterations of the loop are scheduled on processors. For example:
setenv OMP_SCHEDULE "guided, 4"
setenv OMP_SCHEDULE "dynamic"
OMP_NUM_THREADS
Sets the maximum number of threads to use during execution. For example:
setenv OMP_NUM_THREADS 8
OMP_DYNAMIC
Enables or disables dynamic adjustment of the number of threads available for execution of parallel regions. Valid values are
TRUE or FALSE. For example:
setenv OMP_DYNAMIC TRUE
OMP_PROC_BIND
Enables or disables threads binding to processors. Valid values are TRUE or FALSE. For example:
setenv OMP_PROC_BIND TRUE
OMP_NESTED
Enables or disables nested parallelism. Valid values are TRUE or FALSE. For example:
setenv OMP_NESTED TRUE
OMP_STACKSIZE
Controls the size of the stack for created (non-Master) threads. Examples:
setenv
setenv
setenv
setenv
setenv
setenv
setenv
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
2000500B
"3000 k "
10M
" 10 M "
"20 m "
" 1G"
20000
OMP_WAIT_POLICY
Provides a hint to an OpenMP implementation about the desired behavior of waiting threads. A compliant OpenMP
implementation may or may not abide by the setting of the environment variable. Valid values are ACTIVE and PASSIVE.
ACTIVE specifies that waiting threads should mostly be active, i.e., consume processor cycles, while waiting. PASSIVE specifies
that waiting threads should mostly be passive, i.e., not consume processor cycles, while waiting. The details of the ACTIVE and
PASSIVE behaviors are implementation defined. Examples:
setenv
setenv
setenv
setenv
OMP_WAIT_POLICY
OMP_WAIT_POLICY
OMP_WAIT_POLICY
OMP_WAIT_POLICY
ACTIVE
active
PASSIVE
passive
OMP_MAX_ACTIVE_LEVELS
Controls the maximum number of nested active parallel regions. The value of this environment variable must be a non-negative
integer. The behavior of the program is implementation defined if the requested value of OMP_MAX_ACTIVE_LEVELS is
greater than the maximum number of nested active parallel levels an implementation can support, or if the value is not a nonnegative integer. Example:
setenv OMP_MAX_ACTIVE_LEVELS 2
OMP_THREAD_LIMIT
Sets the number of OpenMP threads to use for the whole OpenMP program. The value of this environment variable must be a
positive integer. The behavior of the program is implementation defined if the requested value of OMP_THREAD_LIMIT is
greater than the number of threads an implementation can support, or if the value is not a positive integer. Example:
setenv OMP_THREAD_LIMIT 8
Thread Stack Size and Thread Binding

Thread Stack Size:
The OpenMP standard does not specify how much stack space a thread should have. Consequently, implementations will differ in
the default thread stack size.
Default thread stack size can be easy to exhaust. It can also be non-portable between compilers. Using past versions of LC
compilers as an example:
Compiler
Approx. Stack Limit Approx. Array Size (doubles)
Linux icc, ifort
4 MB
700 x 700
Linux pgcc, pgf90
8 MB
1000 x 1000
Linux gcc, gfortran
2 MB
500 x 500
Threads that exceed their stack allocation may or may not seg fault. An application may continue to run while data is being
corrupted.
Statically linked codes may be subject to further stack restrictions.

A user's login shell may also restrict stack size
If your OpenMP environment supports the OpenMP 3.0 OMP_STACKSIZE environment variable (covered in previous section),
you can use it to set the thread stack size prior to program execution. For example:
setenv
setenv
setenv
setenv
setenv
setenv
setenv
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
OMP_STACKSIZE
2000500B
"3000 k "
10M
" 10 M "
"20 m "
" 1G"
20000
Otherwise, at LC, you should be able to use the method below for Linux clusters. The example shows setting the thread stack size
to 12 MB, and as a precaution, setting the shell stack size to unlimited.
csh/tcsh
setenv KMP_STACKSIZE 12000000

limit stacksize unlimited
export KMP_STACKSIZE=12000000
ksh/sh/bash ulimit -s unlimited

Thread Binding:
In some cases, a program will perform better if its threads are bound to processors/cores.
"Binding" a thread to a processor means that a thread will be scheduled by the operating system to always run on a the same
processor. Otherwise, threads can be scheduled to execute on any processor and "bounce" back and forth between processors
with each time slice.
Also called "thread affinity" or "processor affinity"
Binding threads to processors can result in better cache utilization, thereby reducing costly memory accesses. This is the primary
motivation for binding threads to processors.
Depending upon your platform, operating system, compiler and OpenMP implementation, binding threads to processors can be
done several different ways.
The OpenMP version 3.1 API provides an environment variable to turn processor binding "on" or "off". For example:
setenv OMP_PROC_BIND
setenv OMP_PROC_BIND
TRUE
FALSE
At a higher level, processes can also be bound to processors.

Detailed information about process and thread binding to processors on LC Linux clusters can be found HERE.
Monitoring, Debugging and Performance Analysis Tools for OpenMP

Monitoring and Debugging Threads:
Debuggers vary in their ability to handle threads. The TotalView debugger is LC's recommended debugger for parallel programs.
It is well suited for both monitoring and debugging threaded programs.
An example screenshot from a TotalView session using an OpenMP code is shown below.
1. Master thread Stack Trace Pane showing original routine
2. Process/thread status bars differentiating threads
3. Master thread Stack Frame Pane showing shared variables
4. Worker thread Stack Trace Pane showing outlined routine.
5. Worker thread Stack Frame Pane
6. Root Window showing all threads
7. Threads Pane showing all threads plus selected thread
See the TotalView Debugger tutorial for details.

The Linux ps command provides several flags for viewing thread information. Some examples are shown below. See the man
page for details.
% ps -Lf
UID
blaise
blaise
blaise
blaise
blaise
PID
22529
22529
22529
22529
22529
% ps -T
PID SPID TTY
PPID
28240
28240
28240
28240
28240
LWP
22529
22530
22531
22532
22533
C NLWP STIME TTY

0
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
TIME
00:00:00
00:01:24
00:01:24
00:01:24
00:01:24
CMD
a.out
a.out
a.out
a.out
a.out
TIME CMD
22529
22529
22529
22529
22529
22529
22530
22531
22532
22533
pts/53
pts/53
pts/53
pts/53
pts/53
00:00:00
00:01:49
00:01:49
00:01:49
00:01:49
a.out
a.out
a.out
a.out
a.out
% ps -Lm
PID
LWP
22529
- 22529
- 22530
- 22531
- 22532
- 22533
TTY
pts/53
-
TIME
00:18:56
00:00:00
00:04:44
00:04:44
00:04:44
00:04:44
CMD
a.out
-
LC's Linux clusters also provide the top command to monitor processes on a node. If used with the -H flag, the threads
contained within a process will be visible. An example of the top -H command is shown below. The parent process is PID
18010 which spawned three threads, shown as PIDs 18012, 18013 and 18014.
Performance Analysis Tools:

There are a variety of performance analysis tools that can be used with OpenMP programs. Searching the web will turn up a
wealth of information.
At LC, the list of supported computing tools can be found at: computing.llnl.gov/code/content/software_tools.php.
These tools vary significantly in their complexity, functionality and learning curve. Covering them in detail is beyond the scope of
this tutorial.
Some tools worth investigating, specifically for OpenMP codes, include:
Open|SpeedShop
TAU
PAPI
Intel VTune Amplifier
ThreadSpotter
OpenMP Exercise 3
Assorted
Overview:
Login to the workshop cluster, if you are not already logged in
Orphaned directive example: review, compile, run
Get OpenMP implementation environment information
Check out the "bug" programs
This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it
at the end of the exercise.
Where would you like to go now?

Exercise
Agenda
Back to the top

Author: Blaise Barney, Livermore Computing.
The OpenMP web site, which includes the C/C++ and Fortran Application Program Interface documents.
www.openmp.org
Appendix A: Run-Time Library Routines
OMP_SET_NUM_THREADS
Purpose:
Sets the number of threads that will be used in the next parallel region. Must be a postive integer.
Format:
Fortran SUBROUTINE OMP_SET_NUM_THREADS(scalar_integer_expression)
#include <omp.h>
C/C++ void omp_set_num_threads(int num_threads)

Notes & Restrictions:
The dynamic threads mechanism modifies the effect of this routine.
Enabled: specifies the maximum number of threads that can be used for any parallel region by the dynamic threads
mechanism.
Disabled: specifies exact number of threads to use until next call to this routine.
This routine can only be called from the serial portions of the code
This call has precedence over the OMP_NUM_THREADS environment variable
OMP_GET_NUM_THREADS
Purpose:
Returns the number of threads that are currently in the team executing the parallel region from which it is called.
Format:
#include <omp.h>
C/C++ int omp_get_num_threads(void)

If this call is made from a serial portion of the program, or a nested parallel region that is serialized, it will return 1.
The default number of threads is implementation dependent.
OMP_GET_MAX_THREADS
Purpose:
Returns the maximum value that can be returned by a call to the OMP_GET_NUM_THREADS function.
Fortran INTEGER FUNCTION OMP_GET_MAX_THREADS()

#include <omp.h>
C/C++ int omp_get_max_threads(void)

Generally reflects the number of threads as set by the OMP_NUM_THREADS environment variable or the
OMP_SET_NUM_THREADS() library routine.
May be called from both serial and parallel regions of code.
OMP_GET_THREAD_NUM
Purpose:
Returns the thread number of the thread, within the team, making this call. This number will be between 0 and
OMP_GET_NUM_THREADS-1. The master thread of the team is thread 0
Format:
Fortran INTEGER FUNCTION OMP_GET_THREAD_NUM()
#include <omp.h>
C/C++ int omp_get_thread_num(void)

If called from a nested parallel region, or a serial region, this function will return 0.
Examples:
Example 1 is the correct way to determine the number of threads in a parallel region.
Example 2 is incorrect - the TID variable must be PRIVATE
Example 3 is incorrect - the OMP_GET_THREAD_NUM call is outside the parallel region
Fortran - determining the number of threads in a parallel region

Example 1: Correct
PROGRAM HELLO
INTEGER TID, OMP_GET_THREAD_NUM
!$OMP PARALLEL PRIVATE(TID)
...
!$OMP END PARALLEL
END
Example 2: Incorrect
PROGRAM HELLO
!$OMP PARALLEL
...
!$OMP END PARALLEL
END
Example 3: Incorrect
PROGRAM HELLO
!$OMP PARALLEL
...
!$OMP END PARALLEL
END
OMP_GET_THREAD_LIMIT
Purpose:
Returns the maximum number of OpenMP threads available to a program.
Format:
Fortran INTEGER FUNCTION OMP_GET_THREAD_LIMIT
#include <omp.h>
C/C++ int omp_get_thread_limit (void)

Notes:
Also see the OMP_THREAD_LIMIT environment variable.
OMP_GET_NUM_PROCS
Purpose:
Returns the number of processors that are available to the program.
Format:
Fortran INTEGER FUNCTION OMP_GET_NUM_PROCS()
#include <omp.h>
C/C++ int omp_get_num_procs(void)
OMP_IN_PARALLEL
Purpose:
May be called to determine if the section of code which is executing is parallel or not.
Format:
Fortran LOGICAL FUNCTION OMP_IN_PARALLEL()
#include <omp.h>
C/C++ int omp_in_parallel(void)

For Fortran, this function returns .TRUE. if it is called from the dynamic extent of a region executing in parallel, and .FALSE.
otherwise. For C/C++, it will return a non-zero integer if parallel, and zero otherwise.
OMP_SET_DYNAMIC
Purpose:
Enables or disables dynamic adjustment (by the run time system) of the number of threads available for execution of parallel
regions.
Format:
Fortran SUBROUTINE OMP_SET_DYNAMIC(scalar_logical_expression)
#include <omp.h>
C/C++ void omp_set_dynamic(int dynamic_threads)

For Fortran, if called with .TRUE. then the number of threads available for subsequent parallel regions can be adjusted
automatically by the run-time environment. If called with .FALSE., dynamic adjustment is disabled.
For C/C++, if dynamic_threads evaluates to non-zero, then the mechanism is enabled, otherwise it is disabled.
The OMP_SET_DYNAMIC subroutine has precedence over the OMP_DYNAMIC environment variable.
The default setting is implementation dependent.
Must be called from a serial section of the program.
OMP_GET_DYNAMIC
Purpose:
Used to determine if dynamic thread adjustment is enabled or not.
Format:
Fortran LOGICAL FUNCTION OMP_GET_DYNAMIC()
#include <omp.h>
C/C++ int omp_get_dynamic(void)

For Fortran, this function returns .TRUE. if dynamic thread adjustment is enabled, and .FALSE. otherwise.
For C/C++, non-zero will be returned if dynamic thread adjustment is enabled, and zero otherwise.
OMP_SET_NESTED
Purpose:
Used to enable or disable nested parallelism.
Format:
Fortran SUBROUTINE OMP_SET_NESTED(scalar_logical_expression)
#include <omp.h>
C/C++ void omp_set_nested(int nested)

For Fortran, calling this function with .FALSE. will disable nested parallelism, and calling with .TRUE. will enable it.
For C/C++, if nested evaluates to non-zero, nested parallelism is enabled; otherwise it is disabled.
The default is for nested parallelism to be disabled.
This call has precedence over the OMP_NESTED environment variable
OMP_GET_NESTED
Purpose:
Used to determine if nested parallelism is enabled or not.
Format:
Fortran LOGICAL FUNCTION OMP_GET_NESTED
#include <omp.h>
C/C++ int omp_get_nested (void)

For Fortran, this function returns .TRUE. if nested parallelism is enabled, and .FALSE. otherwise.
For C/C++, non-zero will be returned if nested parallelism is enabled, and zero otherwise.
OMP_SET_SCHEDULE
Purpose:
This routine sets the schedule type that is applied when the loop directive specifies a runtime schedule.
Format:
SUBROUTINE OMP_SET_SCHEDULE(KIND, MODIFIER)
Fortran INTEGER (KIND=OMP_SCHED_KIND) KIND

INTEGER MODIFIER
#include <omp.h>
C/C++ void omp_set_schedule(omp_sched_t kind, int modifier)
OMP_GET_SCHEDULE
Purpose:
This routine returns the schedule that is applied when the loop directive specifies a runtime schedule.
Format:
SUBROUTINE OMP_GET_SCHEDULE(KIND, MODIFIER)
Fortran INTEGER (KIND=OMP_SCHED_KIND) KIND

INTEGER MODIFIER
#include <omp.h>
C/C++ void omp_get_schedule(omp_sched_t * kind, int * modifier )
OMP_SET_MAX_ACTIVE_LEVELS
Purpose:
This routine limits the number of nested active parallel regions.
Format:
SUBROUTINE OMP_SET_MAX_ACTIVE_LEVELS (MAX_LEVELS)
Fortran INTEGER MAX_LEVELS

#include <omp.h>
C/C++ void omp_set_max_active_levels (int max_levels)

If the number of parallel levels requested exceeds the number of levels of parallelism supported by the implementation, the value
will be set to the number of parallel levels supported by the implementation.
This routine has the described effect only when called from the sequential part of the program. When called from within an explicit
parallel region, the effect of this routine is implementation defined.
OMP_GET_MAX_ACTIVE_LEVELS
Purpose:
This routine returns the maximum number of nested active parallel regions.
Format:
Fortran INTEGER FUNCTION OMP_GET_MAX_ACTIVE_LEVELS()
#include <omp.h>
C/C++ int omp_get_max_active_levels(void)
OMP_GET_LEVEL
Purpose:
This routine returns the number of nested parallel regions enclosing the task that contains the call.
Format:
Fortran INTEGER FUNCTION OMP_GET_LEVEL()
#include <omp.h>
C/C++ int omp_get_level(void)

The omp_get_level routine returns the number of nested parallel regions (whether active or inactive) enclosing the task that
contains the call, not including the implicit parallel region. The routine always returns a non-negative integer, and returns 0 if it is
called from the sequential part of the program.
OMP_GET_ANCESTOR_THREAD_NUM
Purpose:
This routine returns, for a given nested level of the current thread, the thread number of the ancestor or the current thread.
Format:
INTEGER FUNCTION OMP_GET_ANCESTOR_THREAD_NUM(LEVEL)
Fortran INTEGER LEVEL
#include <omp.h>
C/C++ int omp_get_ancestor_thread_num(int level)

If the requested nest level is outside the range of 0 and the nest level of the current thread, as returned by the omp_get_level
routine, the routine returns -1.
OMP_GET_TEAM_SIZE
Purpose:
This routine returns, for a given nested level of the current thread, the size of the thread team to which the ancestor or the current
thread belongs.
Format:
INTEGER FUNCTION OMP_GET_TEAM_SIZE(LEVEL)
Fortran INTEGER LEVEL
#include <omp.h>
C/C++ int omp_get_team_size(int level);

If the requested nested level is outside the range of 0 and the nested level of the current thread, as returned by the omp_get_level
routine, the routine returns -1. Inactive parallel regions are regarded like active parallel regions executed with one thread.
OMP_GET_ACTIVE_LEVEL
Purpose:
The omp_get_active_level routine returns the number of nested, active parallel regions enclosing the task that contains the call.
Format:
Fortran INTEGER FUNCTION OMP_GET_ACTIVE_LEVEL()
#include <omp.h>
C/C++ int omp_get_active_level(void);

The routine always returns a nonnegative integer, and returns 0 if it is called from the sequential part of the program.
OMP_IN_FINAL
Purpose:
This routine returns true if the routine is executed in a final task region; otherwise, it returns false.
Format:
Fortran LOGICAL FUNCTION OMP_IN_FINAL()
#include <omp.h>
#include <omp.h>
C/C++ int omp_in_final(void)
OMP_INIT_LOCK
OMP_INIT_NEST_LOCK
Purpose:
This subroutine initializes a lock associated with the lock variable.
Format:
SUBROUTINE OMP_INIT_LOCK(var)
Fortran SUBROUTINE OMP_INIT_NEST_LOCK(var)

#include <omp.h>
C/C++ void omp_init_lock(omp_lock_t *lock)

void omp_init_nest_lock(omp_nest_lock_t *lock)

The initial state is unlocked
For Fortran, var must be an integer large enough to hold an address, such as INTEGER*8 on 64-bit systems.
OMP_DESTROY_LOCK
OMP_DESTROY_NEST_LOCK
Purpose:
This subroutine disassociates the given lock variable from any locks.
Format:
SUBROUTINE OMP_DESTROY_LOCK(var)
Fortran SUBROUTINE OMP_DESTROY_NEST_LOCK(var)

#include <omp.h>
C/C++ void omp_destroy_lock(omp_lock_t *lock)

void omp_destroy_nest_lock(omp_nest_lock_t *lock)

It is illegal to call this routine with a lock variable that is not initialized.
OMP_SET_LOCK
OMP_SET_NEST_LOCK
Purpose:
This subroutine forces the executing thread to wait until the specified lock is available. A thread is granted ownership of a lock
when it becomes available.
Format:
SUBROUTINE OMP_SET_LOCK(var)
Fortran SUBROUTINE OMP_SET_NEST_LOCK(var)

#include <omp.h>
C/C++ void omp_set_lock(omp_lock_t *lock)

void omp_set_nest__lock(omp_nest_lock_t *lock)

OMP_UNSET_LOCK
OMP_UNSET_NEST_LOCK
Purpose:
This subroutine releases the lock from the executing subroutine.
Format:
SUBROUTINE OMP_UNSET_LOCK(var)
Fortran SUBROUTINE OMP_UNSET_NEST_LOCK(var)

#include <omp.h>
C/C++ void omp_unset_lock(omp_lock_t *lock)

void omp_unset_nest__lock(omp_nest_lock_t *lock)

OMP_TEST_LOCK
OMP_TEST_NEST_LOCK
Purpose:
This subroutine attempts to set a lock, but does not block if the lock is unavailable.
Format:
SUBROUTINE OMP_TEST_LOCK(var)
Fortran SUBROUTINE OMP_TEST_NEST_LOCK(var)

#include <omp.h>
C/C++ int omp_test_lock(omp_lock_t *lock)

int omp_test_nest__lock(omp_nest_lock_t *lock)

For Fortran, .TRUE. is returned if the lock was set successfully, otherwise .FALSE. is returned.
For C/C++, non-zero is returned if the lock was set successfully, otherwise zero is returned.
OMP_GET_WTIME
Purpose:
Returns a double-precision floating point value equal to the number of elapsed seconds since some point in the past. Usually
used in "pairs" with the value of the first call subtracted from the value of the second call to obtain the elapsed time for a block of
code.
Designed to be "per thread" times, and therefore may not be globally consistent across all threads in a team - depends upon
what a thread is doing compared to other threads.
Format:
Fortran DOUBLE PRECISION FUNCTION OMP_GET_WTIME()
#include <omp.h>
C/C++ double omp_get_wtime(void)

OMP_GET_WTICK
Purpose:
Returns a double-precision floating point value equal to the number of seconds between successive clock ticks.
Format:
Fortran DOUBLE PRECISION FUNCTION OMP_GET_WTICK()
#include <omp.h>
C/C++ double omp_get_wtick(void)

https://computing.llnl.gov/tutorials/openMP/
Last Modified: 06/30/2016 15:25:18 blaiseb@llnl.gov
UCRL-MI-133316
Tutorials | Exercises | Abstracts | LC Workshops | Comments | Search | Privacy & Legal Notice
POSIX Threads Programming

Author: Blaise Barney, Lawrence Livermore National Laboratory
UCRL-MI-133316
Table of Contents
1. Abstract
2. Pthreads Overview
1. What is a Thread?
2. What are Pthreads?
3. Why Pthreads?
4. Designing Threaded Programs
3. The Pthreads API
4. Compiling Threaded Programs
5. Thread Management
1. Creating and Terminating Threads
2. Passing Arguments to Threads
3. Joining and Detaching Threads
4. Stack Management
5. Miscellaneous Routines
6. Exercise 1
7. Mutex Variables
1. Mutex Variables Overview
2. Creating and Destroying Mutexes
3. Locking and Unlocking Mutexes
8. Condition Variables
1. Condition Variables Overview
2. Creating and Destroying Condition Variables
3. Waiting and Signaling on Condition Variables
9. Monitoring, Debugging and Performance Analysis Tools for Pthreads
10. LLNL Specific Information and Recommendations
11. Topics Not Covered
12. Exercise 2
13. References and More Information
14. Appendix A: Pthread Library Routines Reference
Abstract
In shared memory multiprocessor architectures, threads can be used to implement parallelism. Historically, hardware vendors have
implemented their own proprietary versions of threads, making portability a concern for software developers. For UNIX systems, a
standardized C language threads programming interface has been specified by the IEEE POSIX 1003.1c standard. Implementations that
adhere to this standard are referred to as POSIX threads, or Pthreads.
The tutorial begins with an introduction to concepts, motivations, and design considerations for using Pthreads. Each of the three major
classes of routines in the Pthreads API are then covered: Thread Management, Mutex Variables, and Condition Variables. Example codes
are used throughout to demonstrate how to use most of the Pthreads routines needed by a new Pthreads programmer. The tutorial concludes
with a discussion of LLNL specifics and how to mix MPI with pthreads. A lab exercise, with numerous example codes (C Language) is also
included.
Level/Prerequisites: This tutorial is ideal for those who are new to parallel programming with pthreads. A basic understanding of parallel
programming in C is required. For those who are unfamiliar with Parallel Programming in general, the material covered in EC3500:
Introduction to Parallel Computing would be helpful.
Pthreads Overview
What is a Thread?
Technically, a thread is defined as an independent stream of instructions that can be scheduled to run as such by the operating system.
But what does this mean?
To the software developer, the concept of a "procedure" that runs independently from its main program may best describe a thread.
To go one step further, imagine a main program (a.out) that contains a number of procedures. Then imagine all of these procedures
being able to be scheduled to run simultaneously and/or independently by the operating system. That would describe a "multi-threaded"
program.
How is this accomplished?
Before understanding a thread, one first needs to understand a UNIX process. A process is created by the operating system, and
requires a fair amount of "overhead". Processes contain information about program resources and program execution state, including:
Process ID, process group ID, user ID, and group ID
Environment
Working directory.
Program instructions
Registers
Stack
Heap
File descriptors
Signal actions
Shared libraries
Inter-process communication tools (such as message queues, pipes, semaphores, or shared memory).
UNIX PROCESS
THREADS WITHIN A UNIX PROCESS
Threads use and exist within these process resources, yet are able to be scheduled by the operating system and run as independent
entities largely because they duplicate only the bare essential resources that enable them to exist as executable code.
This independent flow of control is accomplished because a thread maintains its own:
Stack pointer
Registers
Scheduling properties (such as policy or priority)
Set of pending and blocked signals
Thread specific data.
So, in summary, in the UNIX environment a thread:
Exists within a process and uses the process resources
Has its own independent flow of control as long as its parent process exists and the OS supports it
Duplicates only the essential resources it needs to be independently schedulable
May share the process resources with other threads that act equally independently (and dependently)
Dies if the parent process dies - or something similar
Is "lightweight" because most of the overhead has already been accomplished through the creation of its process.
Because threads within the same process share resources:
Changes made by one thread to shared system resources (such as closing a file) will be seen by all other threads.
Two pointers having the same value point to the same data.
Reading and writing to the same memory locations is possible, and therefore requires explicit synchronization by the programmer.
Pthreads Overview
What are Pthreads?

Historically, hardware vendors have implemented their own proprietary versions of threads. These implementations differed substantially
from each other making it difficult for programmers to develop portable threaded applications.
In order to take full advantage of the capabilities provided by threads, a standardized programming interface was required.
For UNIX systems, this interface has been specified by the IEEE POSIX 1003.1c standard (1995).
Implementations adhering to this standard are referred to as POSIX threads, or Pthreads.
Most hardware vendors now offer Pthreads in addition to their proprietary API's.
The POSIX standard has continued to evolve and undergo revisions, including the Pthreads specification.
Some useful links:
standards.ieee.org/findstds/standard/1003.1-2008.html
www.opengroup.org/austin/papers/posix_faq.html
www.unix.org/version3/ieee_std.html
Pthreads are defined as a set of C language programming types and procedure calls, implemented with a pthread.h header/include
file and a thread library - though this library may be part of another library, such as libc, in some implementations.
Pthreads Overview
Why Pthreads?
Light Weight:
When compared to the cost of creating and managing a process, a thread can be created with much less operating system overhead.
Managing threads requires fewer system resources than managing processes.
For example, the following table compares timing results for the fork() subroutine and the pthread_create() subroutine. Timings
reflect 50,000 process/thread creations, were performed with the time utility, and units are in seconds, no optimization flags.
Note: don't expect the sytem and user times to add up to real time, because these are SMP systems with multiple CPUs/cores working
on the problem at the same time. At best, these are approximations run on local machines, past and present.
fork()
Platform
real
user
pthread_create()
sys
real
user
sys
Intel 2.6 GHz Xeon E5-2670 (16 cores/node)
8.1
0.1
2.9
0.9
0.2
0.3
Intel 2.8 GHz Xeon 5660 (12 cores/node)
4.4
0.4
4.3
0.7
0.2
0.5
AMD 2.3 GHz Opteron (16 cores/node)
12.5
1.0
12.5
1.2
0.2
1.3
AMD 2.4 GHz Opteron (8 cores/node)
17.6
2.2
15.7
1.4
0.3
1.3
IBM 4.0 GHz POWER6 (8 cpus/node)
9.5
0.6
8.8
1.6
0.1
0.4
64.2
30.7
27.6
1.7
0.6
1.1
104.5
48.6
47.2
2.1
1.0
1.5
INTEL 2.4 GHz Xeon (2 cpus/node)
54.9
1.5
20.8
1.6
0.7
0.9
INTEL 1.4 GHz Itanium2 (4 cpus/node)
54.5
1.1
22.2
2.0
1.2
0.6
IBM 1.9 GHz POWER5 p5-575 (8 cpus/node)

IBM 1.5 GHz POWER4 (8 cpus/node)
fork_vs_thread.txt
Efficient Communications/Data Exchange:

The primary motivation for considering the use of Pthreads in a high performance computing environment is to achieve optimum
performance. In particular, if an application is using MPI for on-node communications, there is a potential that performance could be
improved by using Pthreads instead.
MPI libraries usually implement on-node task communication via shared memory, which involves at least one memory copy operation
(process to process).
For Pthreads there is no intermediate memory copy required because threads share the same address space within a single process.
There is no data transfer, per se. It can be as efficient as simply passing a pointer.
In the worst case scenario, Pthread communications become more of a cache-to-CPU or memory-to-CPU bandwidth issue. These
speeds are much higher than MPI shared memory communications.
For example: some local comparisons, past and present, are shown below:
Pthreads Worst Case

MPI Shared Memory Bandwidth
Memory-to-CPU Bandwidth
(GB/sec)
(GB/sec)
Platform
Intel 2.6 GHz Xeon E5-2670
4.5
51.2
Intel 2.8 GHz Xeon 5660
5.6
32
AMD 2.3 GHz Opteron
1.8
5.3
AMD 2.4 GHz Opteron
1.2
5.3
IBM 1.9 GHz POWER5 p5-575
4.1
16
IBM 1.5 GHz POWER4
2.1
Intel 2.4 GHz Xeon
0.3
4.3
Intel 1.4 GHz Itanium 2
1.8
6.4
Other Common Reasons:

Threaded applications offer potential performance gains and practical advantages over non-threaded applications in several other
ways:
Overlapping CPU work with I/O: For example, a program may have sections where it is performing a long I/O operation. While one
thread is waiting for an I/O system call to complete, CPU intensive work can be performed by other threads.
Priority/real-time scheduling: tasks which are more important can be scheduled to supersede or interrupt lower priority tasks.
Asynchronous event handling: tasks which service events of indeterminate frequency and duration can be interleaved. For
example, a web server can both transfer data from previous requests and manage the arrival of new requests.
A perfect example is the typical web browser, where many interleaved tasks can be happening at the same time, and where tasks can
vary in priority.
Another good example is a modern operating system, which makes extensive use of threads. A screenshot of the MS Windows OS and
applications using threads is shown below.
Click for larger image
Pthreads Overview
Designing Threaded Programs
Parallel Programming:
On modern, multi-core machines, pthreads are ideally suited for parallel programming, and whatever applies to parallel programming in
general, applies to parallel pthreads programs.
There are many considerations for designing parallel programs, such as:
What type of parallel programming model to use?
Problem partitioning
Load balancing
Communications
Data dependencies
Synchronization and race conditions
Memory issues
I/O issues
Program complexity
Programmer effort/costs/time
...
Covering these topics is beyond the scope of this tutorial, however interested readers can obtain a quick overview in the Introduction to
Parallel Computing tutorial.
In general though, in order for a program to take advantage of Pthreads, it must be able to be organized into discrete, independent
tasks which can execute concurrently. For example, if routine1 and routine2 can be interchanged, interleaved and/or overlapped in real
time, they are candidates for threading.
Programs having the following characteristics may be well suited for pthreads:
Work that can be executed, or data that can be operated on, by multiple tasks simultaneously:
Block for potentially long I/O waits
Use many CPU cycles in some places but not others
Must respond to asynchronous events
Some work is more important than other work (priority interrupts)
Several common models for threaded programs exist:
Manager/worker: a single thread, the manager assigns work to other threads, the workers. Typically, the manager handles all
input and parcels out work to the other tasks. At least two forms of the manager/worker model are common: static worker pool and
dynamic worker pool.
Pipeline: a task is broken into a series of suboperations, each of which is handled in series, but concurrently, by a different thread.
An automobile assembly line best describes this model.
Peer: similar to the manager/worker model, but after the main thread creates other threads, it participates in the work.
Shared Memory Model:

All threads have access to the same global, shared memory
Threads also have their own private data
Programmers are responsible for synchronizing access (protecting) globally shared data.
Shared Memory Model
Thread-safeness:
Thread-safeness: in a nutshell, refers an application's ability to execute multiple threads simultaneously without "clobbering" shared data
or creating "race" conditions.
For example, suppose that your application creates several threads, each of which makes a call to the same library routine:
This library routine accesses/modifies a global structure or location in memory.
As each thread calls this routine it is possible that they may try to modify this global structure/memory location at the same time.
If the routine does not employ some sort of synchronization constructs to prevent data corruption, then it is not thread-safe.
threadunsafe
The implication to users of external library routines is that if you aren't 100% certain the routine is thread-safe, then you take your
chances with problems that could arise.
Recommendation: Be careful if your application uses libraries or other objects that don't explicitly guarantee thread-safeness. When in
doubt, assume that they are not thread-safe until proven otherwise. This can be done by "serializing" the calls to the uncertain routine,
etc.
Thread Limits:
Although the Pthreads API is an ANSI/IEEE standard, implementations can, and usually do, vary in ways not specified by the standard.
Because of this, a program that runs fine on one platform, may fail or produce wrong results on another platform.
For example, the maximum number of threads permitted, and the default thread stack size are two important limits to consider when
designing your program.
Several thread limits are discussed in more detail later in this tutorial.
The Pthreads API

The original Pthreads API was defined in the ANSI/IEEE POSIX 1003.1 - 1995 standard. The POSIX standard has continued to evolve
and undergo revisions, including the Pthreads specification.
Copies of the standard can be purchased from IEEE or downloaded for free from other sites online.
The subroutines which comprise the Pthreads API can be informally grouped into four major groups:
1. Thread management: Routines that work directly on threads - creating, detaching, joining, etc. They also include functions to
set/query thread attributes (joinable, scheduling etc.)
2. Mutexes: Routines that deal with synchronization, called a "mutex", which is an abbreviation for "mutual exclusion". Mutex
functions provide for creating, destroying, locking and unlocking mutexes. These are supplemented by mutex attribute functions
that set or modify attributes associated with mutexes.
3. Condition variables: Routines that address communications between threads that share a mutex. Based upon programmer
specified conditions. This group includes functions to create, destroy, wait and signal based upon specified variable values.
Functions to set/query condition variable attributes are also included.
4. Synchronization: Routines that manage read/write locks and barriers.
Naming conventions: All identifiers in the threads library begin with pthread_. Some examples are shown below.
Routine Prefix
Functional Group
pthread_
Threads themselves and miscellaneous subroutines
pthread_attr_
Thread attributes objects
pthread_mutex_
Mutexes
pthread_mutexattr_
Mutex attributes objects.
pthread_cond_
Condition variables
pthread_condattr_
Condition attributes objects
pthread_key_
Thread-specific data keys
pthread_rwlock_
Read/write locks
pthread_barrier_
Synchronization barriers
The concept of opaque objects pervades the design of the API. The basic calls work to create or modify opaque objects - the opaque
objects can be modified by calls to attribute functions, which deal with opaque attributes.
The Pthreads API contains around 100 subroutines. This tutorial will focus on a subset of these - specifically, those which are most likely
to be immediately useful to the beginning Pthreads programmer.
For portability, the pthread.h header file should be included in each source file using the Pthreads library.
The current POSIX standard is defined only for the C language. Fortran programmers can use wrappers around C function calls. Some
Fortran compilers may provide a Fortran pthreads API.
A number of excellent books about Pthreads are available. Several of these are listed in the References section of this tutorial.
Compiling Threaded Programs

Several examples of compile commands used for pthreads codes are listed in the table below.
Compiler / Platform
Compiler Command
Description
INTEL
Linux
icc -pthread
icpc -pthread
C++
PGI
Linux
pgcc -lpthread
pgCC -lpthread
C++
GNU
Linux, Blue Gene
gcc -pthread
GNU C
g++ -pthread
GNU C++
IBM
Blue Gene
bgxlc_r
bgcc_r
bgxlC_r, bgxlc++_r
C (ANSI / non-ANSI)
C++
Thread Management
Creating and Terminating Threads
Routines:
pthread_create (thread,attr,start_routine,arg)
pthread_exit (status)
pthread_cancel (thread)
pthread_attr_init (attr)
pthread_attr_destroy (attr)
Creating Threads:
Initially, your main() program comprises a single, default thread. All other threads must be explicitly created by the programmer.
pthread_create creates a new thread and makes it executable. This routine can be called any number of times from anywhere
within your code.
pthread_create arguments:
thread: An opaque, unique identifier for the new thread returned by the subroutine.
attr: An opaque attribute object that may be used to set thread attributes. You can specify a thread attributes object, or NULL for
the default values.
start_routine: the C routine that the thread will execute once it is created.
arg: A single argument that may be passed to start_routine. It must be passed by reference as a pointer cast of type void. NULL
may be used if no argument is to be passed.
The maximum number of threads that may be created by a process is implementation dependent. Programs that attempt to exceed the
limit can fail or produce wrong results.
Querying and setting your implementation's thread limit - Linux example shown. Demonstrates querying the default (soft) limits and then
setting the maximum number of processes (including threads) to the hard limit. Then verifying that the limit has been overridden.
bash / ksh / sh
tcsh / csh
$ ulimit -a
% limit
core file size
(blocks, -c) 16
cputime
unlimited
data seg size
(kbytes, -d) unlimited filesize
unlimited
scheduling priority
(-e) 0
datasize
unlimited
file size
(blocks, -f) unlimited stacksize
unlimited
pending signals
(-i) 255956
coredumpsize 16 kbytes
max locked memory
(kbytes, -l) 64
memoryuse
unlimited
max memory size
(kbytes, -m) unlimited vmemoryuse
unlimited
open files
(-n) 1024
descriptors 1024
pipe size
(512 bytes, -p) 8
memorylocked 64 kbytes
POSIX message queues
(bytes, -q) 819200
maxproc
1024
real-time priority
(-r) 0
stack size
(kbytes, -s) unlimited % limit maxproc unlimited
cpu time
(seconds, -t) unlimited
max user processes
(-u) 1024
% limit
virtual memory
(kbytes, -v) unlimited cputime
unlimited
file locks
(-x) unlimited filesize
unlimited
datasize
unlimited
$ ulimit -Hu
stacksize
unlimited
7168
coredumpsize 16 kbytes
memoryuse
unlimited
$ ulimit -u 7168
vmemoryuse
unlimited
descriptors 1024
$ ulimit -a
memorylocked 64 kbytes
core file size
(blocks, -c) 16
maxproc
7168
data seg size

(kbytes, -d) unlimited
scheduling priority
(-e) 0
file size
(blocks, -f) unlimited
pending signals
(-i) 255956
max locked memory
(kbytes, -l) 64
max memory size
(kbytes, -m) unlimited
open files
(-n) 1024
pipe size
(512 bytes, -p) 8
POSIX message queues
(bytes, -q) 819200
real-time priority
(-r) 0
stack size
(kbytes, -s) unlimited
cpu time
(seconds, -t) unlimited
max user processes
(-u) 7168
virtual memory
(kbytes, -v) unlimited
file locks
(-x) unlimited
Once created, threads are peers, and may create other threads. There is no implied hierarchy or dependency between threads.
Peer Threads
Thread Attributes:
By default, a thread is created with certain attributes. Some of these attributes can be changed by the programmer via the thread
attribute object.
pthread_attr_init and pthread_attr_destroy are used to initialize/destroy the thread attribute object.
Other routines are then used to query/set specific attributes in the thread attribute object. Attributes include:
Detached or joinable state
Scheduling inheritance
Scheduling policy
Scheduling parameters
Scheduling contention scope
Stack size
Stack address
Stack guard (overflow) size
Some of these attributes will be discussed later.
Thread Binding and Scheduling:

Question: After a thread has been created, how do you know a)when it will be scheduled to run by the operating system, and b)which
processor/core it will run on?
Answer
The Pthreads API provides several routines that may be used to specify how threads are scheduled for execution. For example, threads
can be scheduled to run FIFO (first-in first-out), RR (round-robin) or OTHER (operating system determines). It also provides the ability to
set a thread's scheduling priority value.
These topics are not covered here, however a good overview of "how things work" under Linux can be found in the
sched_setscheduler man page.
The Pthreads API does not provide routines for binding threads to specific cpus/cores. However, local implementations may include this
functionality - such as providing the non-standard pthread_setaffinity_np routine. Note that "_np" in the name stands for "nonportable".
Also, the local operating system may provide a way to do this. For example, Linux provides the sched_setaffinity routine.
Terminating Threads & pthread_exit():

There are several ways in which a thread may be terminated:
The thread returns normally from its starting routine. Its work is done.
The thread makes a call to the pthread_exit subroutine - whether its work is done or not.
The thread is canceled by another thread via the pthread_cancel routine.
The entire process is terminated due to making a call to either the exec() or exit()
If main() finishes first, without calling pthread_exit explicitly itself
The pthread_exit() routine allows the programmer to specify an optional termination status parameter. This optional parameter is
typically returned to threads "joining" the terminated thread (covered later).
In subroutines that execute to completion normally, you can often dispense with calling pthread_exit() - unless, of course, you want
to pass the optional status code back.
Cleanup: the pthread_exit() routine does not close files; any files opened inside the thread will remain open after the thread is
terminated.
Discussion on calling pthread_exit() from main():

There is a definite problem if main() finishes before the threads it spawned if you don't call pthread_exit() explicitly. All of the
threads it created will terminate because main() is done and no longer exists to support the threads.
By having main() explicitly call pthread_exit() as the last thing it does, main() will block and be kept alive to support the
threads it created until they are done.
Example: Pthread Creation and Termination

This simple example code creates 5 threads with the pthread_create() routine. Each thread prints a "Hello World!" message, and
then terminates with a call to pthread_exit().
Pthread Creation and Termination Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#include <pthread.h>
#include <stdio.h>
#define NUM_THREADS
void *PrintHello(void *threadid)

{
long tid;
tid = (long)threadid;
printf("Hello World! It's me, thread #%ld!\n", tid);
pthread_exit(NULL);
}
int main (int argc, char *argv[])
{
pthread_t threads[NUM_THREADS];
int rc;
long t;
for(t=0; t<NUM_THREADS; t++){
printf("In main: creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *)t);
if (rc){
printf("ERROR; return code from pthread_create() is %d\n", rc);
exit(-1);
}
}
/* Last thing that main() should do */
pthread_exit(NULL);
}
Thread Management
Passing Arguments to Threads
The pthread_create() routine permits the programmer to pass one argument to the thread start routine. For cases where multiple
arguments must be passed, this limitation is easily overcome by creating a structure which contains all of the arguments, and then
passing a pointer to that structure in the pthread_create() routine.
All arguments must be passed by reference and cast to (void *).
Question: How can you safely pass data to newly created threads, given their non-deterministic start-up and scheduling?
Answer
Example 1 - Thread Argument Passing

This code fragment demonstrates how to pass a simple integer to each thread. The calling thread uses a unique
data structure for each thread, insuring that each thread's argument remains intact throughout the program.
long taskids[NUM_THREADS];
for(t=0; t<NUM_THREADS; t++)
{
taskids[t] = t;
printf("Creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *) taskids[t]);
...
}
Example 2 - Thread Argument Passing

This example shows how to setup/pass multiple arguments via a structure. Each thread receives a unique instance
of the structure.
struct thread_data{
int thread_id;
int sum;
char *message;
};
struct thread_data thread_data_array[NUM_THREADS];
void *PrintHello(void *threadarg)
{
struct thread_data *my_data;
...
my_data = (struct thread_data *) threadarg;
taskid = my_data->thread_id;
sum = my_data->sum;
hello_msg = my_data->message;
...
}
{
...
thread_data_array[t].thread_id = t;
thread_data_array[t].sum = sum;
thread_data_array[t].message = messages[t];
rc = pthread_create(&threads[t], NULL, PrintHello,
(void *) &thread_data_array[t]);
...
}
Example 3 - Thread Argument Passing (Incorrect)

This example performs argument passing incorrectly. It passes the address of variable t, which is shared memory
space and visible to all threads. As the loop iterates, the value of this memory location changes, possibly before
the created threads can access it.
int rc;
long t;
for(t=0; t<NUM_THREADS; t++)
{
printf("Creating thread %ld\n", t);
rc = pthread_create(&threads[t], NULL, PrintHello, (void *) &t);
...
}
Thread Management
Joining and Detaching Threads
Routines:
pthread_join (threadid,status)
pthread_detach (threadid)
pthread_attr_setdetachstate (attr,detachstate)
pthread_attr_getdetachstate (attr,detachstate)
Joining:
"Joining" is one way to accomplish synchronization between threads. For example:
The pthread_join() subroutine blocks the calling thread until the specified threadid thread terminates.
The programmer is able to obtain the target thread's termination return status if it was specified in the target thread's call to
pthread_exit().
A joining thread can match one pthread_join() call. It is a logical error to attempt multiple joins on the same thread.
Two other synchronization methods, mutexes and condition variables, will be discussed later.
Joinable or Not?
When a thread is created, one of its attributes defines whether it is joinable or detached. Only threads that are created as joinable can
be joined. If a thread is created as detached, it can never be joined.
The final draft of the POSIX standard specifies that threads should be created as joinable.
To explicitly create a thread as joinable or detached, the attr argument in the pthread_create() routine is used. The typical 4
step process is:
1. Declare a pthread attribute variable of the pthread_attr_t data type
2. Initialize the attribute variable with pthread_attr_init()
3. Set the attribute detached status with pthread_attr_setdetachstate()

4. When done, free library resources used by the attribute with pthread_attr_destroy()
Detaching:
The pthread_detach() routine can be used to explicitly detach a thread even though it was created as joinable.
There is no converse routine.
Recommendations:
If a thread requires joining, consider explicitly creating it as joinable. This provides portability as not all implementations may create
threads as joinable by default.
If you know in advance that a thread will never need to join with another thread, consider creating it in a detached state. Some system
resources may be able to be freed.
Example: Pthread Joining

This example demonstrates how to "wait" for thread completions by using the Pthread join routine.
Since some implementations of Pthreads may not create threads in a joinable state, the threads in this example are explicitly created in
a joinable state so that they can be joined later.
Pthread Joining Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
#include <stdio.h>
#include <stdlib.h>
#include <math.h>
#define NUM_THREADS 4
void *BusyWork(void *t)
{
int i;
long tid;
double result=0.0;
tid = (long)t;
printf("Thread %ld starting...\n",tid);
for (i=0; i<1000000; i++)
{
result = result + sin(i) * tan(i);
}
printf("Thread %ld done. Result = %e\n",tid, result);
pthread_exit((void*) t);
}
{
pthread_t thread[NUM_THREADS];
pthread_attr_t attr;
int rc;
long t;
void *status;
/* Initialize and set thread detached attribute */
pthread_attr_init(&attr);
pthread_attr_setdetachstate(&attr, PTHREAD_CREATE_JOINABLE);
for(t=0; t<NUM_THREADS; t++) {
printf("Main: creating thread %ld\n", t);
rc = pthread_create(&thread[t], &attr, BusyWork, (void *)t);
if (rc) {
exit(-1);
}
}
/* Free attribute and wait for the other threads */
44
45
46
47
48
49
50
51
52
53
54
55
56
57
pthread_attr_destroy(&attr);
for(t=0; t<NUM_THREADS; t++) {
rc = pthread_join(thread[t], &status);
if (rc) {
printf("ERROR; return code from pthread_join() is %d\n", rc);
exit(-1);
}
printf("Main: completed join with thread %ld having a status
of %ld\n",t,(long)status);
}
printf("Main: program completed. Exiting.\n");
pthread_exit(NULL);
}
Thread Management
Stack Management
Routines:
pthread_attr_getstacksize (attr, stacksize)
pthread_attr_setstacksize (attr, stacksize)
pthread_attr_getstackaddr (attr, stackaddr)
pthread_attr_setstackaddr (attr, stackaddr)
Preventing Stack Problems:

The POSIX standard does not dictate the size of a thread's stack. This is implementation dependent and varies.
Exceeding the default stack limit is often very easy to do, with the usual results: program termination and/or corrupted data.
Safe and portable programs do not depend upon the default stack limit, but instead, explicitly allocate enough stack for each thread by
using the pthread_attr_setstacksize routine.
The pthread_attr_getstackaddr and pthread_attr_setstackaddr routines can be used by applications in an
environment where the stack for a thread must be placed in some particular region of memory.
Some Practical Examples at LC:

Default thread stack size varies greatly. The maximum size that can be obtained also varies greatly, and may depend upon the number
of threads per node.
Both past and present architectures are shown to demonstrate the wide variation in default thread stack size.
Node
Architecture
Intel Xeon E5-2670
Intel Xeon 5660
AMD Opteron
Intel IA64
Intel IA32
IBM Power5
IBM Power4
IBM Power3
#CPUs Memory (GB) Default Size

(bytes)
16
12
32
24
2,097,152
2,097,152
8
4
2
16
8
4
2,097,152
33,554,432
2,097,152
8
8
16
32
16
16
196,608
196,608
98,304
Example: Stack Management

This example demonstrates how to query and set a thread's stack size.
Stack Management Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
#include <stdio.h>
#define NTHREADS 4
#define N 1000
#define MEGEXTRA 1000000
void *dowork(void *threadid)
{
double A[N][N];
int i,j;
long tid;
size_t mystacksize;
tid = (long)threadid;
pthread_attr_getstacksize (&attr, &mystacksize);
printf("Thread %ld: stack size = %li bytes \n", tid, mystacksize);
for (i=0; i<N; i++)
for (j=0; j<N; j++)
A[i][j] = ((i*j)/3.452) + (N-i);
pthread_exit(NULL);
}
int main(int argc, char *argv[])
{
pthread_t threads[NTHREADS];
size_t stacksize;
int rc;
long t;
pthread_attr_getstacksize (&attr, &stacksize);
printf("Default stack size = %li\n", stacksize);
stacksize = sizeof(double)*N*N+MEGEXTRA;
printf("Amount of stack needed per thread = %li\n",stacksize);
pthread_attr_setstacksize (&attr, stacksize);
printf("Creating threads with stack size = %li bytes\n",stacksize);
for(t=0; t<NTHREADS; t++){
rc = pthread_create(&threads[t], &attr, dowork, (void *)t);
if (rc){
exit(-1);
}
}
printf("Created %ld threads.\n", t);
pthread_exit(NULL);
}
Thread Management
Miscellaneous Routines
pthread_self ()
pthread_equal (thread1,thread2)
pthread_self returns the unique, system assigned thread ID of the calling thread.
pthread_equal compares two thread IDs. If the two IDs are different 0 is returned, otherwise a non-zero value is returned.
Note that for both of these routines, the thread identifier objects are opaque and can not be easily inspected. Because thread IDs are
opaque objects, the C language equivalence operator == should not be used to compare two thread IDs against each other, or to
compare a single thread ID against another value.
pthread_once (once_control, init_routine)
pthread_once executes the init_routine exactly once in a process. The first call to this routine by any thread in the process
executes the given init_routine, without parameters. Any subsequent call will have no effect.
The init_routine routine is typically an initialization routine.
The once_control parameter is a synchronization control structure that requires initialization prior to calling pthread_once. For
example:
pthread_once_t once_control = PTHREAD_ONCE_INIT;
Pthread Exercise 1
Getting Started and Thread Management Routines
Overview:
Login to an LC cluster using your workshop username and OTP token
Copy the exercise files to your home directory
Familiarize yourself with LC's Pthreads environment
Write a simple "Hello World" Pthreads program
Successfully compile your program
Successfully run your program - several different ways
Review, compile, run and/or debug some related Pthreads programs (provided)
Mutex Variables
Overview
Mutex is an abbreviation for "mutual exclusion". Mutex variables are one of the primary means of implementing thread synchronization
and for protecting shared data when multiple writes occur.
A mutex variable acts like a "lock" protecting access to a shared data resource. The basic concept of a mutex as used in Pthreads is
that only one thread can lock (or own) a mutex variable at any given time. Thus, even if several threads try to lock a mutex only one thread
will be successful. No other thread can own that mutex until the owning thread unlocks that mutex. Threads must "take turns" accessing
protected data.
Mutexes can be used to prevent "race" conditions. An example of a race condition involving a bank transaction is shown below:
Thread 1
Thread 2
Read balance: $1000
Balance
$1000
Read balance: $1000
$1000
Deposit $200
$1000
Deposit $200
$1000
Update balance $1000+$200
$1200
Update balance $1000+$200
$1200
In the above example, a mutex should be used to lock the "Balance" while a thread is using this shared data resource.
Very often the action performed by a thread owning a mutex is the updating of global variables. This is a safe way to ensure that when
several threads update the same variable, the final value is the same as what it would be if only one thread performed the update. The
variables being updated belong to a "critical section".
A typical sequence in the use of a mutex is as follows:
Create and initialize a mutex variable
Several threads attempt to lock the mutex
Only one succeeds and that thread owns the mutex
The owner thread performs some set of actions
The owner unlocks the mutex
Another thread acquires the mutex and repeats the process
Finally the mutex is destroyed
When several threads compete for a mutex, the losers block at that call - an unblocking call is available with "trylock" instead of the
"lock" call.
When protecting shared data, it is the programmer's responsibility to make sure every thread that needs to use a mutex does so. For
example, if 4 threads are updating the same data, but only one uses a mutex, the data can still be corrupted.
Mutex Variables
Creating and Destroying Mutexes
Routines:
pthread_mutex_init (mutex,attr)
pthread_mutex_destroy (mutex)
pthread_mutexattr_init (attr)
pthread_mutexattr_destroy (attr)
Usage:
Mutex variables must be declared with type pthread_mutex_t, and must be initialized before they can be used. There are two ways
to initialize a mutex variable:
1. Statically, when it is declared. For example:
pthread_mutex_t mymutex = PTHREAD_MUTEX_INITIALIZER;
2. Dynamically, with the pthread_mutex_init() routine. This method permits setting mutex object attributes, attr.
The mutex is initially unlocked.
The attr object is used to establish properties for the mutex object, and must be of type pthread_mutexattr_t if used (may be
specified as NULL to accept defaults). The Pthreads standard defines three optional mutex attributes:
Protocol: Specifies the protocol used to prevent priority inversions for a mutex.
Prioceiling: Specifies the priority ceiling of a mutex.
Process-shared: Specifies the process sharing of a mutex.
Note that not all implementations may provide the three optional mutex attributes.
The pthread_mutexattr_init() and pthread_mutexattr_destroy() routines are used to create and destroy mutex
attribute objects respectively.
pthread_mutex_destroy() should be used to free a mutex object which is no longer needed.
Mutex Variables
Locking and Unlocking Mutexes
Routines:
pthread_mutex_lock (mutex)
pthread_mutex_trylock (mutex)
pthread_mutex_unlock (mutex)
Usage:
The pthread_mutex_lock() routine is used by a thread to acquire a lock on the specified mutex variable. If the mutex is already
locked by another thread, this call will block the calling thread until the mutex is unlocked.
pthread_mutex_trylock() will attempt to lock a mutex. However, if the mutex is already locked, the routine will return immediately
with a "busy" error code. This routine may be useful in preventing deadlock conditions, as in a priority-inversion situation.
pthread_mutex_unlock() will unlock a mutex if called by the owning thread. Calling this routine is required after a thread has
completed its use of protected data if other threads are to acquire the mutex for their work with the protected data. An error will be
returned if:
If the mutex was already unlocked
If the mutex is owned by another thread
There is nothing "magical" about mutexes...in fact they are akin to a "gentlemen's agreement" between participating threads. It is up to
the code writer to insure that the necessary threads all make the the mutex lock and unlock calls correctly. The following scenario
demonstrates a logical error:
Thread 1
Lock
A = 2
Unlock
Thread 2
Lock
A = A+1
Unlock
Thread 3
A = A*B
Question: When more than one thread is waiting for a locked mutex, which thread will be granted the lock first after it is released?
Answer
Example: Using Mutexes

This example program illustrates the use of mutex variables in a threads program that performs a dot product.
The main data is made available to all threads through a globally accessible structure.
Each thread works on a different part of the data.
The main thread waits for all the threads to complete their computations, and then it prints the resulting sum.
Using Mutexes Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#include <stdio.h>
#include <stdlib.h>
/*
The following structure contains the necessary information
to allow the function "dotprod" to access its input data and
place its output into the structure.
*/
typedef struct
{
double
*a;
double
*b;
double
sum;
int
veclen;
} DOTDATA;
/* Define globally accessible variables and a mutex */
#define NUMTHRDS 4
#define VECLEN 100
DOTDATA dotstr;
pthread_t callThd[NUMTHRDS];
pthread_mutex_t mutexsum;
/*
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
The function dotprod is activated when the thread is created.

All input to this routine is obtained from a structure
of type DOTDATA and all output from this function is written into
this structure. The benefit of this approach is apparent for the
multi-threaded program: when a thread is created we pass a single
argument to the activated function - typically this argument
is a thread number. All the other information required by the
function is accessed from the globally accessible structure.
*/
void *dotprod(void *arg)
{
/* Define and use local variables for convenience */
int i, start, end, len ;
long offset;
double mysum, *x, *y;
offset = (long)arg;
len = dotstr.veclen;
start = offset*len;
end
= start + len;
x = dotstr.a;
y = dotstr.b;
/*
Perform the dot product and assign result
to the appropriate variable in the structure.
*/
mysum = 0;
for (i=start; i<end ; i++)
{
mysum += (x[i] * y[i]);
}
/*
Lock a mutex prior to updating the value in the shared
structure, and unlock it upon updating.
*/
pthread_mutex_lock (&mutexsum);
dotstr.sum += mysum;
pthread_mutex_unlock (&mutexsum);
pthread_exit((void*) 0);
}
/*
The main program creates threads which do all the work and then
print out result upon completion. Before creating the threads,
the input data is created. Since all threads update a shared structure,
we need a mutex for mutual exclusion. The main thread needs to wait for
all threads to complete, it waits for each one of the threads. We specify
a thread attribute value that allow the main thread to join with the
threads it creates. Note also that we free up handles when they are
no longer needed.
*/
{
long i;
double *a, *b;
void *status;
/* Assign storage and initialize values */
a = (double*) malloc (NUMTHRDS*VECLEN*sizeof(double));
b = (double*) malloc (NUMTHRDS*VECLEN*sizeof(double));
for (i=0; i<VECLEN*NUMTHRDS; i++)
{
a[i]=1.0;
b[i]=a[i];
}
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
dotstr.veclen = VECLEN;
dotstr.a = a;
dotstr.b = b;
dotstr.sum=0;
pthread_mutex_init(&mutexsum, NULL);
/* Create threads to perform the dotproduct */
for(i=0; i<NUMTHRDS; i++)
{
/*
Each thread works on a different set of data. The offset is specified
by 'i'. The size of the data for each thread is indicated by VECLEN.
*/
pthread_create(&callThd[i], &attr, dotprod, (void *)i);
}
/* Wait on the other threads */
for(i=0; i<NUMTHRDS; i++)
{
pthread_join(callThd[i], &status);
}
/* After joining, print out the results and cleanup */
printf ("Sum = %f \n", dotstr.sum);
free (a);
free (b);
pthread_mutex_destroy(&mutexsum);
pthread_exit(NULL);
}
Serial version
Pthreads version
Condition Variables
Overview
Condition variables provide yet another way for threads to synchronize. While mutexes implement synchronization by controlling thread
access to data, condition variables allow threads to synchronize based upon the actual value of data.
Without condition variables, the programmer would need to have threads continually polling (possibly in a critical section), to check if the
condition is met. This can be very resource consuming since the thread would be continuously busy in this activity. A condition variable
is a way to achieve the same goal without polling.
A condition variable is always used in conjunction with a mutex lock.
A representative sequence for using condition variables is shown below.
Main Thread
Declare and initialize global data/variables which require synchronization (such as "count")
Declare and initialize a condition variable object
Declare and initialize an associated mutex
Create threads A and B to do work
Thread A
Do work up to the point where a certain condition
must occur (such as "count" must reach a specified
value)
Lock associated mutex and check value of a global
variable
Call pthread_cond_wait() to perform a
blocking wait for signal from Thread-B. Note that a
call to pthread_cond_wait() automatically and
Thread B
Do work
Lock associated mutex
Change the value of the global variable that Thread-A
is waiting upon.
Check value of the global Thread-A wait variable. If it
fulfills the desired condition, signal Thread-A.
Unlock mutex.
Continue
atomically unlocks the associated mutex variable so

that it can be used by Thread-B.
When signalled, wake up. Mutex is automatically and
atomically locked.
Explicitly unlock mutex
Continue
Main Thread
Join / Continue
Condition Variables
Creating and Destroying Condition Variables
Routines:
pthread_cond_init (condition,attr)
pthread_cond_destroy (condition)
pthread_condattr_init (attr)
pthread_condattr_destroy (attr)
Usage:
Condition variables must be declared with type pthread_cond_t, and must be initialized before they can be used. There are two
ways to initialize a condition variable:
1. Statically, when it is declared. For example:
pthread_cond_t myconvar = PTHREAD_COND_INITIALIZER;
2. Dynamically, with the pthread_cond_init() routine. The ID of the created condition variable is returned to the calling thread
through the condition parameter. This method permits setting condition variable object attributes, attr.
The optional attr object is used to set condition variable attributes. There is only one attribute defined for condition variables: processshared, which allows the condition variable to be seen by threads in other processes. The attribute object, if used, must be of type
pthread_condattr_t (may be specified as NULL to accept defaults).
Note that not all implementations may provide the process-shared attribute.
The pthread_condattr_init() and pthread_condattr_destroy() routines are used to create and destroy condition
variable attribute objects.
pthread_cond_destroy() should be used to free a condition variable that is no longer needed.
Condition Variables
Waiting and Signaling on Condition Variables
Routines:
pthread_cond_wait (condition,mutex)
pthread_cond_signal (condition)
pthread_cond_broadcast (condition)
Usage:
pthread_cond_wait() blocks the calling thread until the specified condition is signalled. This routine should be called while mutex
is locked, and it will automatically release the mutex while it waits. After signal is received and thread is awakened, mutex will be
automatically locked for use by the thread. The programmer is then responsible for unlocking mutex when the thread is finished with it.
Recommendation: Using a WHILE loop instead of an IF statement (see watch_count routine in example below) to check the waited for
condition can help deal with several potential problems, such as:
If several threads are waiting for the same wake up signal, they will take turns acquiring the mutex, and any one of them can then
modify the condition they all waited for.

If the thread received the signal in error due to a program bug
The Pthreads library is permitted to issue spurious wake ups to a waiting thread without violating the standard.
The pthread_cond_signal() routine is used to signal (or wake up) another thread which is waiting on the condition variable. It
should be called after mutex is locked, and must unlock mutex in order for pthread_cond_wait() routine to complete.
The pthread_cond_broadcast() routine should be used instead of pthread_cond_signal() if more than one thread is in a
blocking wait state.
It is a logical error to call pthread_cond_signal() before calling pthread_cond_wait().
Proper locking and unlocking of the associated mutex variable is essential when using these routines. For example:
Failing to lock the mutex before calling pthread_cond_wait() may cause it NOT to block.
Failing to unlock the mutex after calling pthread_cond_signal() may not allow a matching pthread_cond_wait()
routine to complete (it will remain blocked).
Example: Using Condition Variables

This simple example code demonstrates the use of several Pthread condition variable routines.
The main routine creates three threads.
Two of the threads perform work and update a "count" variable.
The third thread waits until the count variable reaches a specified value.
Using Condition Variables Example

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
#include <stdio.h>
#include <stdlib.h>
#define NUM_THREADS 3
#define TCOUNT 10
#define COUNT_LIMIT 12
int
count = 0;
int
thread_ids[3] = {0,1,2};
pthread_mutex_t count_mutex;
pthread_cond_t count_threshold_cv;
void *inc_count(void *t)
{
int i;
long my_id = (long)t;
for (i=0; i<TCOUNT; i++) {
pthread_mutex_lock(&count_mutex);
count++;
/*
Check the value of count and signal waiting thread when condition is
reached. Note that this occurs while mutex is locked.
*/
if (count == COUNT_LIMIT) {
pthread_cond_signal(&count_threshold_cv);
printf("inc_count(): thread %ld, count = %d Threshold reached.\n",
my_id, count);
}
printf("inc_count(): thread %ld, count = %d, unlocking mutex\n",
my_id, count);
pthread_mutex_unlock(&count_mutex);
/* Do some "work" so threads can alternate on mutex lock */
sleep(1);
}
pthread_exit(NULL);
}
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
void *watch_count(void *t)

{
long my_id = (long)t;
printf("Starting watch_count(): thread %ld\n", my_id);
/*
Lock mutex and wait for signal. Note that the pthread_cond_wait
routine will automatically and atomically unlock mutex while it waits.
Also, note that if COUNT_LIMIT is reached before this routine is run by
the waiting thread, the loop will be skipped to prevent pthread_cond_wait
from never returning.
*/
pthread_mutex_lock(&count_mutex);
while (count<COUNT_LIMIT) {
pthread_cond_wait(&count_threshold_cv, &count_mutex);
printf("watch_count(): thread %ld Condition signal received.\n", my_id);
count += 125;
printf("watch_count(): thread %ld count now = %d.\n", my_id, count);
}
pthread_mutex_unlock(&count_mutex);
pthread_exit(NULL);
}
{
int i, rc;
long t1=1, t2=2, t3=3;
pthread_t threads[3];
/* Initialize mutex and condition variable objects */
pthread_mutex_init(&count_mutex, NULL);
pthread_cond_init (&count_threshold_cv, NULL);
/* For portability, explicitly create threads in a joinable state */
pthread_create(&threads[0], &attr, watch_count, (void *)t1);
pthread_create(&threads[1], &attr, inc_count, (void *)t2);
pthread_create(&threads[2], &attr, inc_count, (void *)t3);
/* Wait for all threads to complete */
for (i=0; i<NUM_THREADS; i++) {
pthread_join(threads[i], NULL);
}
printf ("Main(): Waited on %d threads. Done.\n", NUM_THREADS);
/* Clean up and exit */
pthread_mutex_destroy(&count_mutex);
pthread_cond_destroy(&count_threshold_cv);
pthread_exit(NULL);
}
Monitoring, Debugging and Performance Analysis Tools for Pthreads

Monitoring and Debugging Pthreads:
Debuggers vary in their ability to handle Pthreads. The TotalView debugger is LC's recommended debugger for parallel programs. It is
well suited for both monitoring and debugging threaded programs.
An example screenshot from a TotalView session using a threaded code is shown below.
1. Stack Trace Pane: Displays the call stack of routines that the selected thread is executing.
2. Status Bars: Show status information for the selected thread and its associated process.
3. Stack Frame Pane: Shows a selected thread's stack variables, registers, etc.
4. Source Pane: Shows the source code for the selected thread.
5. Root Window showing all threads

6. Threads Pane: Shows threads associated with the selected process
See the TotalView Debugger tutorial for details.

The Linux ps command provides several flags for viewing thread information. Some examples are shown below. See the man page for
details.
% ps -Lf
UID
blaise
blaise
blaise
blaise
blaise
PID
22529
22529
22529
22529
22529
PPID
28240
28240
28240
28240
28240
LWP
22529
22530
22531
22532
22533
C NLWP STIME TTY

0
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
99
5 11:31 pts/53
% ps -T
PID SPID
22529 22529
22529 22530
22529 22531
22529 22532
22529 22533
TTY
pts/53
pts/53
pts/53
pts/53
pts/53
TIME
00:00:00
00:01:49
00:01:49
00:01:49
00:01:49
CMD
a.out
a.out
a.out
a.out
a.out
% ps -Lm
PID
LWP
22529
- 22529
- 22530
- 22531
TTY
pts/53
-
TIME
00:18:56
00:00:00
00:04:44
00:04:44
CMD
a.out
-
TIME
00:00:00
00:01:24
00:01:24
00:01:24
00:01:24
CMD
a.out
a.out
a.out
a.out
a.out
- 22532 - 22533 -
00:04:44 00:04:44 -
LC's Linux clusters also provide the top command to monitor processes on a node. If used with the -H flag, the threads contained
within a process will be visible. An example of the top -H command is shown below. The parent process is PID 18010 which spawned
three threads, shown as PIDs 18012, 18013 and 18014.
Performance Analysis Tools:

There are a variety of performance analysis tools that can be used with threaded programs. Searching the web will turn up a wealth of
information.
At LC, the list of supported computing tools can be found at: computing.llnl.gov/code/content/software_tools.php.
These tools vary significantly in their complexity, functionality and learning curve. Covering them in detail is beyond the scope of this
tutorial.
Some tools worth investigating, specifically for threaded codes, include:
Open|SpeedShop
TAU
HPCToolkit
PAPI
Intel VTune Amplifier
ThreadSpotter
LLNL Specific Information and Recommendations

This section describes details specific to Livermore Computing's systems.
Implementations:
All LC production systems include a Pthreads implementation that follows draft 10 (final) of the POSIX standard. This is the preferred
implementation.
Implementations differ in the maximum number of threads that a process may create. They also differ in the default amount of thread
stack space.
Compiling:
LC maintains a number of compilers, and usually several different versions of each - see the LC's Supported Compilers web page.
The compiler commands described in the Compiling Threaded Programs section apply to LC systems.
Mixing MPI with Pthreads:

This is the primary motivation for using Pthreads at LC.
Design:
Each MPI process typically creates and then manages N threads, where N makes the best use of the available cores/node.
Finding the best value for N will vary with the platform and your application's characteristics.
In general, there may be problems if multiple threads make MPI calls. The program may fail or behave unexpectedly. If MPI calls
must be made from within a thread, they should be made only by one thread.
Compiling:
Use the appropriate MPI compile command for the platform and language of choice
Be sure to include the required Pthreads flag as shown in the Compiling Threaded Programs section.
An example code that uses both MPI and Pthreads is available below. The serial, threads-only, MPI-only and MPI-with-threads versions
demonstrate one possible progression.
Serial
Pthreads only
MPI only
MPI with pthreads
makefile
Topics Not Covered

Several features of the Pthreads API are not covered in this tutorial. These are listed below. See the Pthread Library Routines Reference
section for more information.
Thread Scheduling
Implementations will differ on how threads are scheduled to run. In most cases, the default mechanism is adequate.
The Pthreads API provides routines to explicitly set thread scheduling policies and priorities which may override the default
mechanisms.
The API does not require implementations to support these features.
Keys: Thread-Specific Data
As threads call and return from different routines, the local data on a thread's stack comes and goes.
To preserve stack data you can usually pass it as an argument from one routine to the next, or else store the data in a global
variable associated with a thread.
Pthreads provides another, possibly more convenient and versatile, way of accomplishing this through keys.
Mutex Protocol Attributes and Mutex Priority Management for the handling of "priority inversion" problems.
Condition Variable Sharing - across processes
Thread Cancellation
Threads and Signals
Synchronization constructs - barriers and locks
Pthread Exercise 2
Mutexes, Condition Variables and Hybrid MPI with Pthreads
Overview:
Login to the LC workshop cluster, if you are not already logged in
Mutexes: review and run the provided example codes
Condition variables: review and run the provided example codes
Hybrid MPI with Pthreads: review and run the provided example codes
This completes the tutorial.

Please complete the online evaluation form - unless you are doing the exercise, in which case please complete it at the
end of the exercise.
Where would you like to go now?

Exercise
Agenda
Back to the top

Author: Blaise Barney, Livermore Computing.
POSIX Standard: www.unix.org/version3/ieee_std.html
"Pthreads Programming". B. Nichols et al. O'Reilly and Associates.
"Threads Primer". B. Lewis and D. Berg. Prentice Hall
"Programming With POSIX Threads". D. Butenhof. Addison Wesley
"Programming With Threads". S. Kleiman et al. Prentice Hall
Appendix A: Pthread Library Routines Reference

For convenience, an alphabetical list of Pthread routines, linked to their corresponding man page, is provided below.
pthread_atfork
pthread_attr_destroy
pthread_attr_getdetachstate
pthread_attr_getguardsize
pthread_attr_getinheritsched
pthread_attr_getschedparam
pthread_attr_getschedpolicy
pthread_attr_getscope
pthread_attr_getstack
pthread_attr_getstackaddr
pthread_attr_getstacksize
pthread_attr_init
pthread_attr_setdetachstate
pthread_attr_setguardsize
pthread_attr_setinheritsched
pthread_attr_setschedparam
pthread_attr_setschedpolicy
pthread_attr_setscope
pthread_attr_setstack
pthread_attr_setstackaddr
pthread_attr_setstacksize
pthread_barrier_destroy
pthread_barrier_init
pthread_barrier_wait
pthread_barrierattr_destroy
pthread_barrierattr_getpshared
pthread_barrierattr_init
pthread_barrierattr_setpshared
pthread_cancel
pthread_cleanup_pop
pthread_cleanup_push
pthread_cond_broadcast
pthread_cond_destroy
pthread_cond_init
pthread_cond_signal
pthread_cond_timedwait
pthread_cond_wait
pthread_condattr_destroy
pthread_condattr_getclock
pthread_condattr_getpshared
pthread_condattr_init
pthread_condattr_setclock
pthread_condattr_setpshared
pthread_create
pthread_detach
pthread_equal
pthread_exit
pthread_getconcurrency
pthread_getcpuclockid
pthread_getschedparam
pthread_getspecific
pthread_join
pthread_key_create
pthread_key_delete
pthread_kill
pthread_mutex_destroy
pthread_mutex_getprioceiling
pthread_mutex_init
pthread_mutex_lock
pthread_mutex_setprioceiling
pthread_mutex_timedlock
pthread_mutex_trylock
pthread_mutex_unlock
pthread_mutexattr_destroy
pthread_mutexattr_getprioceiling
pthread_mutexattr_getprotocol
pthread_mutexattr_getpshared
pthread_mutexattr_gettype
pthread_mutexattr_init
pthread_mutexattr_setprioceiling
pthread_mutexattr_setprotocol
pthread_mutexattr_setpshared
pthread_mutexattr_settype
pthread_once
pthread_rwlock_destroy
pthread_rwlock_init
pthread_rwlock_rdlock
pthread_rwlock_timedrdlock
pthread_rwlock_timedwrlock
pthread_rwlock_tryrdlock
pthread_rwlock_trywrlock
pthread_rwlock_unlock
pthread_rwlock_wrlock
pthread_rwlockattr_destroy
pthread_rwlockattr_getpshared
pthread_rwlockattr_init
pthread_rwlockattr_setpshared
pthread_self
pthread_setcancelstate
pthread_setcanceltype
pthread_setconcurrency
pthread_setschedparam
pthread_setschedprio
pthread_setspecific
pthread_sigmask
pthread_spin_destroy
pthread_spin_init
pthread_spin_lock
pthread_spin_trylock
pthread_spin_unlock
pthread_testcancel
https://computing.llnl.gov/tutorials/pthreads/
Last Modified: 06/16/2016 22:08:34 blaiseb@llnl.gov

UCRL-MI-133316

OpenMp - Pthreads

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

OpenMp - Pthreads

Загружено:

Авторское право:

Доступные форматы

Tutorials | Exercises | Abstracts | LC Workshops | Comments | Search | Privacy & Legal Notice

Lean and Mean:

all partner names derived from the OpenMP web site)

Endorsing Application Developers

ADINA R&D, Inc.

OpenMP Programming Model

Uniform Memory Access

Non-Uniform Memory Access

Thread Based Parallelism:

Fork - Join Model:

FORK: the master thread then creates a team of parallel threads.

Compiler Directive Based:

Memory Model: FLUSH Often?

OpenMP API Overview

Fortran !$OMP PARALLEL DEFAULT(SHARED) PRIVATE(BETA,PI)

Run-time Library Routines:

The OpenMP API includes an ever-growing number of run-time library routines.

Fortran INTEGER FUNCTION OMP_GET_NUM_THREADS()

csh/tcsh setenv OMP_NUM_THREADS 8

Example OpenMP Code Structure:

Fortran - General Code Structure

Beginning of parallel region. Fork a team of threads.

Parallel region executed by all threads

Other OpenMP directives

Run-time Library calls

All threads join master thread and disband

Resume serial code

C / C++ - General Code Structure

int var1, var2, var3;

Beginning of parallel region. Fork a team of threads.

Parallel region executed by all threads

Other OpenMP directives

Run-time Library calls

All threads join master thread and disband

Resume serial code

Compiling OpenMP Programs

Intel C/C++, Fortran

IBM Blue Gene Fortran

OpenMP 4.0 Support (according to vendor and openmp.org documentation):

*Be sure to use a thread-safe compiler - its name ends with _r

Format: (case insensitive)

All Fortran OpenMP directives must begin with a

A valid OpenMP directive.

Optional. Clauses can be in any

Fixed Form Source:

Free Form Source:

[ structured block of code ]

Required for all A valid OpenMP directive.

Must appear after the pragma order, and repeated as necessary

structured block which is

Static (Lexical) Extent:

The DO directive occurs within an enclosing parallel

The CRITICAL and SECTIONS directives occur

Why Is This Important?

How Many Threads?

3. Use of the omp_set_num_threads() library function

Nested Parallel Regions:

Example: Parallel Region

Fortran - Parallel Region Example

Only master thread does this

C / C++ - Parallel Region Example

/* All threads join master thread and terminate */

Types of Work-Sharing Constructs:

DO / for - shares iterations of a loop