Вы находитесь на странице: 1из 70

Introduction to OpenMP

Christian Terboven, Dirk Schmidl


IT Center, RWTH Aachen University
Member of the HPC Group
{terboven,schmidl}@itc.rwth-aachen.de

IT Center der RWTH Aachen University


History

 De-facto standard for Shared-Memory Parallelization.

 1997: OpenMP 1.0 for FORTRAN


 1998: OpenMP 1.0 for C and C++
 1999: OpenMP 1.1 for FORTRAN
(errata) http://www.OpenMP.org
 2000: OpenMP 2.0 for FORTRAN
 2002: OpenMP 2.0 for C and C++
 2005: OpenMP 2.5 now includes
both programming languages.
 05/2008: OpenMP 3.0 release
 07/2011: OpenMP 3.1 release
RWTH Aachen University is
 07/2013: OpenMP 4.0 release a member of the OpenMP
Architecture Review Board
 11/2015: OpenMP 4.5 release (ARB) since 2006.
2 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Multi-Core System
Architecture

3 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Moore‘s Law still holds!

The number of transistors


on a chip is still doubling
every 24 months …

… but the clock speed is no


longer increasing that fast!

Instead, we will see many


more cores per chip!

Source: Herb Sutter


www.gotw.ca/publications/concurrency-ddj.htm

4 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example for a SMP system

 Dual-socket Intel Woodcrest


(dual-core) system
 Two cores per chip, 3.0 GHz Core Core Core Core

 Each chip has 4 MB of L2 on-chip cache on-chip cache

cache on-chip, shared by


both cores

 No off-chip cache bus

 Bus: Frontsidebus

 SMP: Symmetric Multi Processor memory


 Memory access time is
uniform on all cores

 Limited scalabilty
5 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Overview
&
Parallel Region

6 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP‘s machine model

 OpenMP: Shared-Memory Parallel Programming Model.

Memory All processors/cores access


a shared main memory.

Crossbar / Bus Real architectures are


more complex, as we
will see later / as we
have seen.
Cache Cache Cache Cache

Parallelization in OpenMP
Proc Proc Proc Proc
employs multiple threads.

7 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Execution Model

 OpenMP programs start with


Master Thread Serial Part
just one thread: The Master.

Parallel
 Worker threads are spawned Slave Region
at Parallel Regions, together Slave
Threads
Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
 In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region

 Concept: Fork-Join.
 Allows for an incremental parallelization!
8 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Parallel Region and Structured
Blocks

 The parallelism has to be expressed explicitly.


C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... $!omp end parallel
}
 Structured Block  Specification of number of threads:
 Exactly one entry point at the top  Environment variable:

 Exactly one exit point at the bottom OMP_NUM_THREADS=…


 Branching in or out is not allowed  Or: Via num_threads clause:
 Terminating the program is allowed add num_threads(num) to the
(abort / exit) parallel construct
9 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Hello OpenMP World

10 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Hello orphaned OpenMP World

11 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Starting OpenMP Programs on Linux

 From within a shell, global setting of the number of threads:


export OMP_NUM_THREADS=4
./program

 From within a shell, one-time setting of the number of threads:


OMP_NUM_THREADS=4 ./program

12 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing Construct

13 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing

 If only the parallel construct is used, each thread executes the


Structured Block.
 Program Speedup: Worksharing
 OpenMP‘s most common Worksharing construct: for
C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i];
a[i] = b[i] + c[i]; END DO
}

 Distribution of loop iterations over all threads in a Team.

 Scheduling of the distribution can be influenced.

 Loops often account for most of a program‘s runtime!

14 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Worksharing illustrated

Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do

15 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo

Vector Addition

16 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Influencing the For Loop Scheduling

 for-construct: OpenMP allows to influence how the iterations are


scheduled among the threads of the team, via the schedule clause:
 schedule(static [, chunk]): Iteration space divided into blocks of
chunk size, blocks are assigned to threads in a round-robin fashion. If chunk
is not specified: #threads blocks.

 schedule(dynamic [, chunk]): Iteration space divided into blocks


of chunk (not specified: 1) size, blocks are scheduled to threads in the order
in which threads finish previous blocks.

 schedule(guided [, chunk]): Similar to dynamic, but block size


starts with implementation-defined value, then is decreased exponentially
down to chunk.
 Default on most implementations is schedule(static).
17 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Synchronization Overview

 Can all loops be parallelized with for-constructs? No!


 Simple test: If the results differ when the code is executed backwards, the
loop iterations are not independent. BUT: This test alone is not sufficient:
C/C++
int i, int s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
s = s + a[i];
}

 Data Race: If between two synchronization points at least one thread


writes to a memory location from which at least one other thread
reads, the result is not deterministic (race condition).

18 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Synchronization: Critical Region

 A Critical Region is executed by all threads, but by only one thread


simultaneously (Mutual Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}

 Do you think this solution scales well?


C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
19 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping

20 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scoping Rules

 Managing the Data Environment is the challenge of OpenMP.

 Scoping in OpenMP: Dividing variables in shared and private:


 private-list and shared-list on Parallel Region

 private-list and shared-list on Worksharing constructs

 General default is shared for Parallel Region, firstprivate for Tasks.

 Loop control variables on for-constructs are private

 Non-static variables local to Parallel Regions are private

 private: A new uninitialized instance is created for each thread

firstprivate: Initialization with Master‘s value

lastprivate: Value of last loop iteration is written back to Master

 Static variables are shared


21 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Privatization of Global/Static Variables

 Global / static variables can be privatized with the threadprivate


directive
 One instance is created for each thread

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword


__thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

22 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Privatization of Global/Static Variables

 Global / static variables can be privatized with the threadprivate


directive
 One instance is created for each thread

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword


__thread (GNU extension)

C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)

23 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct

24 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct

 OpenMP barrier (implicit or explicit)


 Threads wait until all threads of the current Team have reached the barrier
C/C++
#pragma omp barrier

 All worksharing constructs contain an implicit barrier at the end

25 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Back to our bad
scaling example
C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}

26 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
It‘s your turn: Make It Scale!

#pragma omp parallel


do i = 0, 24
{ s = s + a(i)
end do
#pragma omp for
for (i = 0; i < 99; i++) do i = 25, 49
{ s = s + a(i)
end do
do i = 0, 99
s = s + a(i)
s = s + a[i]; do i = 50, 74
end do
s = s + a(i)
end do
}
do i = 75, 99
s = s + a(i)
} // end parallel end do

27 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Reduction Clause

 In a reduction-operation the operator is applied to all variables in the


list. The variables have to be shared.
 reduction(operator:list)
 The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}

 Possible reduction operators with initialization value:


+ (0), * (1), - (0), & (~0), | (0), && (1), || (0),
^ (0), min (largest number), max (least number)
28 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing

double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i];
}
29 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data in Caches

 When data is used, it is copied into


caches.
 The hardware always copies Core Core
chunks into the cache, so called
cache-lines. on-chip cache on-chip cache

 This is useful, when:


 the data is used frequently (temporal
locality) bus

 consecutive data is used which is on


the same cache-line (spatial locality)
memory

30 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing

 False Sharing occurs when


 different threads use elements of the
same cache-line Core Core

 one of the threads writes to the on-chip cache on-chip cache

cache-line
 As a result the cache line is moved
between the threads, also there is bus
no real dependency

 Note: False Sharing is a memory


performance problem, not a
correctness issue

31 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing
4000
 no performance benefit for
more threads 3000

MFLOPS
 Reason: false sharing of 2000
s_priv
1000
 Solution: padding so that
only one variable per cache 0
line is used 1 2 3 4 5 6 7 8 9 10 11 12
#threads

with false sharing


with falsewithout
sharing false sharing

cache line 1 cache line 2


Standard 1 2 3 4 …
With padding 1 2 3 …
32 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing avoided

double s_priv[nthreads * 8];


#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t * 8] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i * 8];
}
33 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example

PI

34 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

35 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)

double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

#pragma omp parallel for private(fX,i) reduction(+:fSum)


for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}

36 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (2/2)

 Results:
# Threads Runtime [sec.] Speedup
1 1.11 1.00
2
4
8 0.14 7.93

 Scalability is pretty good:


 About 100% of the runtime has been parallelized.

 As there is just one parallel region, there is virtually no overhead introduced


by the parallelization.

 Problem is parallelizable in a trivial fashion ...

37 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Single and Master Construct

38 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Single Construct

C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single

 The single construct specifies that the enclosed structured block is


executed by only on thread of the team.
 It is up to the runtime which thread that is.

 Useful for:
 I/O

 Memory allocation and deallocation, etc. (in general: setup work)

 Implementation of the single-creator parallel-executor pattern as we will see


now…
39 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Master Construct

C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
... structured block ... ... structured block ...
!$omp end master

 The master construct specifies that the enclosed structured block is


executed only by the master thread of a team.

 Note: The master construct is no worksharing construct and does


not contain an implicit barrier at the end.

40 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Section and Ordered Construct

41 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?

 How would you parallelize this code?


void traverse (Tree *tree)
{
if (tree->left) traverse(tree->left);

if (tree->right) traverse(tree->right);

process(tree);
}

 One option: Use OpenMP‘s parallel sections.

42 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Sections Construct

C/C++ Fortran
#pragma omp sections [clause] !$omp sections [clause]
{ !$omp section
#pragma omp section ... structured block ...
... structured block ... !$ omp section
#pragma omp section ... structured block ...
... structured block ... ...
... !$omp end sections
}

 The sections construct contains a set of structured blocks that are


to be distributed among and executed by the team of threads.

43 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?!

 How would you parallelize this code?


void traverse (Tree *tree)
{
#pragma omp parallel sections Nested Parallel Regions
{
#pragma omp section
if (tree->left) traverse(tree->left);
#pragma omp section
if (tree->right) traverse(tree->right);
} // end omp parallel Barrier here!
process(tree);
}
 Downsides of this option:
 Unneccessary overhead and synchronization points

 Not always well supported (how many threads to be used?)


44 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The ordered Construct

 Allows to execute a structured block within a parallel loop in sequential


order
 In addition, an ordered clause has to be added to the for construct which any
ordered construct may occur
#pragma omp parallel for ordered
for (i=0 ; i<10 ; i++){
...
#pragma omp ordered
{
...
}
...
}

 Use Cases:
 Can be used e.g. to enforce ordering on printing of data
 May help to determine whether there is a data race

45 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library

46 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library

 C and C++:
 If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP
is defined. To use the OpenMP runtime library, the header omp.h has to
be included.

 omp_set_num_threads(int): The specified number of threads will be


used for the parallel region encountered next.

 int omp_get_num_threads: Returns the number of threads in the


current team.

 int omp_get_thread_num(): Returns the number of the calling thread


in the team, the Master has always the id 0.
 Additional functions are available, e.g. to provide locking
functionality.
47 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Tasking

48 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Recursive approach to compute
Fibonacci
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}

 On the following slides we will discuss three approaches to


parallelize this recursive code with Tasking.

49 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Task Construct

C/C++ Fortran
#pragma omp task [clause] !$omp task [clause]
... structured block ... ... structured block ...
!$omp end task

 Each encountering thread/task creates a new Task


 Code and data is being packaged up

 Tasks can be nested

Into another Task directive

Into a Worksharing construct


 Data scoping clauses:
 shared(list)

 private(list) firstprivate(list)

 default(shared | none)
50 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Tasks in OpenMP: Data Scoping

 Some rules from Parallel Regions apply:


 Static and Global variables are shared

 Automatic Storage (local) variables are private

 If shared scoping is not derived by default:


 Orphaned Task variables are firstprivate by default!

 Non-Orphaned Task variables inherit the shared attribute!

 Variables are firstprivate unless shared in the enclosing context

51 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
First version parallelized with Tasking
(omp-v1)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x)
{
#pragma omp parallel
x = fib(n - 1);
{
}
#pragma omp single
#pragma omp task shared(y)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
[...] return x+y;
} }
o Only one Task / Thread enters fib() from main(), it is responsable for
creating the two initial work tasks
o Taskwait is required, as otherwise x and y would be lost

52 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration

 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks

fib(2) fib(1) fib(1) fib(0)

Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)

53 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration

 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks
 … fib(2) fib(1) fib(1) fib(0)

fib(1) fib(0)

54 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (1/3)

 Overhead of task creation prevents better scalability!

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads

55 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
if Clause

 If the expression of an if clause on a task


evaluates to false
 The encountering task is suspended

 The new task is executed immediately

 The parent task resumes when the new task finishes

→ Used for optimization, e.g., avoid creation of small tasks

56 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v2)

 Improvement: Don‘t create yet another task once a certain (small


enough) n is reached
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x) \
#pragma omp parallel if(n > 30)
{ {
#pragma omp single x = fib(n - 1);
{ }
fib(input); #pragma omp task shared(y) \
} if(n > 30)
} {
[...] y = fib(n - 2);
} }
#pragma omp taskwait
return x+y;
}
57 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (2/3)

 Speedup is ok, but we still have some overhead when running with 4
or 8 threads

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads
58 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v3)

 Improvement: Skip the OpenMP overhead once a certain n


is reached (no issue w/ production compilers)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ if (n <= 30)
[...] return serfib(n);
#pragma omp parallel int x, y;
{ #pragma omp task shared(x)
#pragma omp single {
{ x = fib(n - 1);
fib(input); }
} #pragma omp task shared(y)
} {
[...] y = fib(n - 2);
} }
#pragma omp taskwait
return x+y;
59 Introduction to OpenMP }
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (3/3)

 Everything ok now 

Speedup of Fibonacci with Tasks


9
8
7
6
Speedup

5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads

60 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (1/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
61 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (2/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
62 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (3/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c:
// Scope of d:
// Scope of e:
} } }
63 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (4/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d:
// Scope of e:
} } }
64 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (5/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e:
} } }
65 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (6/7)

int a = 1;
void foo() Hint: Use default(none) to be
{ forced to think about every
int b = 2, c = 3; variable if you do not see clear.
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }
66 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (7/7)

int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;

// Scope of a: shared, value of a: 1


// Scope of b: firstprivate, value of b: 0 / undefined
// Scope of c: shared, value of c: 3
// Scope of d: firstprivate, value of d: 4
// Scope of e: private, value of e: 5
} } }
67 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier and Taskwait Constructs

 OpenMP barrier (implicit or explicit)


 All tasks created by any thread of the current Team are guaranteed to be
completed at barrier exit
C/C++
#pragma omp barrier

 Task barrier: taskwait


 Encountering Task suspends until child tasks are complete

Only direct childs, not descendants!


C/C++
#pragma omp taskwait

68 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Task Synchronization

 Task Synchronization explained:

#pragma omp parallel num_threads(np)


{
#pragma omp task np Tasks created here, one for each thread

function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task 1 Task created here
function_B();
}
B-Task guaranteed to be completed here
}

69 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Questions?

70 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University

Вам также может понравиться