OpenMP 01 Introduction

Introduction to OpenMP
Christian Terboven, Dirk Schmidl

IT Center, RWTH Aachen University
Member of the HPC Group
{terboven,schmidl}@itc.rwth-aachen.de
IT Center der RWTH Aachen University

History
 De-facto standard for Shared-Memory Parallelization.
 1997: OpenMP 1.0 for FORTRAN

 1998: OpenMP 1.0 for C and C++
(errata) http://www.OpenMP.org
 2002: OpenMP 2.0 for C and C++
 2005: OpenMP 2.5 now includes
both programming languages.
 05/2008: OpenMP 3.0 release
 07/2011: OpenMP 3.1 release
RWTH Aachen University is
 07/2013: OpenMP 4.0 release a member of the OpenMP
Architecture Review Board
 11/2015: OpenMP 4.5 release (ARB) since 2006.
2 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Multi-Core System
Architecture
Moore‘s Law still holds!
The number of transistors

on a chip is still doubling
every 24 months …
… but the clock speed is no

longer increasing that fast!
Instead, we will see many

more cores per chip!
Source: Herb Sutter

www.gotw.ca/publications/concurrency-ddj.htm
Example for a SMP system
 Dual-socket Intel Woodcrest

(dual-core) system
 Two cores per chip, 3.0 GHz Core Core Core Core
 Each chip has 4 MB of L2 on-chip cache on-chip cache
cache on-chip, shared by

both cores
 No off-chip cache bus
 Bus: Frontsidebus
 SMP: Symmetric Multi Processor memory

 Memory access time is
uniform on all cores
 Limited scalabilty
OpenMP Overview
&
Parallel Region
OpenMP‘s machine model
 OpenMP: Shared-Memory Parallel Programming Model.
Memory All processors/cores access

a shared main memory.
Crossbar / Bus Real architectures are

more complex, as we
will see later / as we
have seen.
Cache Cache Cache Cache
Parallelization in OpenMP
Proc Proc Proc Proc
employs multiple threads.
OpenMP Execution Model
 OpenMP programs start with

Master Thread Serial Part
just one thread: The Master.
Parallel
 Worker threads are spawned Slave Region
at Parallel Regions, together Slave
Threads
Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
 In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region
 Concept: Fork-Join.
 Allows for an incremental parallelization!
Parallel Region and Structured
Blocks
 The parallelism has to be expressed explicitly.

C/C++ Fortran
#pragma omp parallel !$omp parallel
{ ...
... structured block
structured block ...
... $!omp end parallel
}
 Structured Block  Specification of number of threads:
 Exactly one entry point at the top  Environment variable:
 Exactly one exit point at the bottom OMP_NUM_THREADS=…

 Branching in or out is not allowed  Or: Via num_threads clause:
 Terminating the program is allowed add num_threads(num) to the
(abort / exit) parallel construct
Demo
Hello OpenMP World
Demo
Hello orphaned OpenMP World
Starting OpenMP Programs on Linux
 From within a shell, global setting of the number of threads:

export OMP_NUM_THREADS=4
./program
 From within a shell, one-time setting of the number of threads:

OMP_NUM_THREADS=4 ./program
For Worksharing Construct
For Worksharing
 If only the parallel construct is used, each thread executes the

Structured Block.
 Program Speedup: Worksharing
 OpenMP‘s most common Worksharing construct: for
C/C++ Fortran
int i; INTEGER :: i
#pragma omp for !$omp do
for (i = 0; i < 100; i++) DO i = 0, 99
{ a[i] = b[i] + c[i];
a[i] = b[i] + c[i]; END DO
}
 Distribution of loop iterations over all threads in a Team.
 Scheduling of the distribution can be influenced.
 Loops often account for most of a program‘s runtime!
Worksharing illustrated
Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do
Demo
Vector Addition
Influencing the For Loop Scheduling
 for-construct: OpenMP allows to influence how the iterations are

scheduled among the threads of the team, via the schedule clause:
 schedule(static [, chunk]): Iteration space divided into blocks of
chunk size, blocks are assigned to threads in a round-robin fashion. If chunk
is not specified: #threads blocks.
 schedule(dynamic [, chunk]): Iteration space divided into blocks

of chunk (not specified: 1) size, blocks are scheduled to threads in the order
in which threads finish previous blocks.
 schedule(guided [, chunk]): Similar to dynamic, but block size

starts with implementation-defined value, then is decreased exponentially
down to chunk.
 Default on most implementations is schedule(static).
Synchronization Overview
 Can all loops be parallelized with for-constructs? No!

 Simple test: If the results differ when the code is executed backwards, the
loop iterations are not independent. BUT: This test alone is not sufficient:
C/C++
int i, int s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
s = s + a[i];
}
 Data Race: If between two synchronization points at least one thread

writes to a memory location from which at least one other thread
reads, the result is not deterministic (race condition).
Synchronization: Critical Region
 A Critical Region is executed by all threads, but by only one thread

simultaneously (Mutual Exclusion).
C/C++
#pragma omp critical (name)
{
... structured block ...
}
 Do you think this solution scales well?

C/C++
int i, s = 0;
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
Data Scoping
Scoping Rules
 Managing the Data Environment is the challenge of OpenMP.
 Scoping in OpenMP: Dividing variables in shared and private:

 private-list and shared-list on Parallel Region
 private-list and shared-list on Worksharing constructs
 General default is shared for Parallel Region, firstprivate for Tasks.
 Loop control variables on for-constructs are private
 Non-static variables local to Parallel Regions are private
 private: A new uninitialized instance is created for each thread
firstprivate: Initialization with Master‘s value
lastprivate: Value of last loop iteration is written back to Master
 Static variables are shared

Privatization of Global/Static Variables
 Global / static variables can be privatized with the threadprivate

directive
 One instance is created for each thread
Before the first parallel region is encountered
Instance exists until the program ends
Does not work (well) with nested Parallel Region
 Based on thread-local storage (TLS)
TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword

__thread (GNU extension)
C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)
Privatization of Global/Static Variables
 Global / static variables can be privatized with the threadprivate

directive
 One instance is created for each thread
Before the first parallel region is encountered
Instance exists until the program ends
Does not work (well) with nested Parallel Region
 Based on thread-local storage (TLS)
TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword

__thread (GNU extension)
C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)
The Barrier Construct
The Barrier Construct
 OpenMP barrier (implicit or explicit)

 Threads wait until all threads of the current Team have reached the barrier
C/C++
#pragma omp barrier
 All worksharing constructs contain an implicit barrier at the end
Back to our bad
scaling example
C/C++
int i, s = 0;
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
It‘s your turn: Make It Scale!
#pragma omp parallel

do i = 0, 24
{ s = s + a(i)
end do
#pragma omp for
for (i = 0; i < 99; i++) do i = 25, 49
{ s = s + a(i)
end do
do i = 0, 99
s = s + a(i)
s = s + a[i]; do i = 50, 74
end do
s = s + a(i)
end do
}
do i = 75, 99
s = s + a(i)
} // end parallel end do
The Reduction Clause
 In a reduction-operation the operator is applied to all variables in the

list. The variables have to be shared.
 reduction(operator:list)
 The result is provided in the associated reduction variable
C/C++
int i, s = 0;
#pragma omp parallel for reduction(+:s)
for(i = 0; i < 99; i++)
{
s = s + a[i];
}
 Possible reduction operators with initialization value:

+ (0), * (1), - (0), & (~0), | (0), && (1), || (0),
^ (0), min (largest number), max (least number)
False Sharing
double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i];
}
Data in Caches
 When data is used, it is copied into

caches.
 The hardware always copies Core Core
chunks into the cache, so called
cache-lines. on-chip cache on-chip cache
 This is useful, when:

 the data is used frequently (temporal
locality) bus
 consecutive data is used which is on

the same cache-line (spatial locality)
memory
False Sharing
 False Sharing occurs when

 different threads use elements of the
same cache-line Core Core
 one of the threads writes to the on-chip cache on-chip cache
cache-line
 As a result the cache line is moved
between the threads, also there is bus
no real dependency
 Note: False Sharing is a memory

performance problem, not a
correctness issue
False Sharing
4000
 no performance benefit for
more threads 3000
MFLOPS
 Reason: false sharing of 2000
s_priv
1000
 Solution: padding so that
only one variable per cache 0
line is used 1 2 3 4 5 6 7 8 9 10 11 12
#threads
with false sharing

with falsewithout
sharing false sharing
cache line 1 cache line 2

Standard 1 2 3 4 …
With padding 1 2 3 …
False Sharing avoided
double s_priv[nthreads * 8];

#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t * 8] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i * 8];
}
Example
PI
Example: Pi (1/2)
double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;

for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}
Example: Pi (1/2)
double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
#pragma omp parallel for private(fX,i) reduction(+:fSum)

for (i = 0; i < n; i++)
{
fX = fH * ((double)i + 0.5);
fSum += f(fX);
}
return fH * fSum;
}
Example: Pi (2/2)
 Results:
# Threads Runtime [sec.] Speedup
1 1.11 1.00
2
4
8 0.14 7.93
 Scalability is pretty good:

 About 100% of the runtime has been parallelized.
 As there is just one parallel region, there is virtually no overhead introduced

by the parallelization.
 Problem is parallelizable in a trivial fashion ...
Single and Master Construct
The Single Construct
C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single
 The single construct specifies that the enclosed structured block is

executed by only on thread of the team.
 It is up to the runtime which thread that is.
 Useful for:
 I/O
 Memory allocation and deallocation, etc. (in general: setup work)
 Implementation of the single-creator parallel-executor pattern as we will see

now…
The Master Construct
C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
!$omp end master
 The master construct specifies that the enclosed structured block is

executed only by the master thread of a team.
 Note: The master construct is no worksharing construct and does

not contain an implicit barrier at the end.
Section and Ordered Construct
How to parallelize a Tree Traversal?
 How would you parallelize this code?

void traverse (Tree *tree)
{
if (tree->left) traverse(tree->left);
if (tree->right) traverse(tree->right);
process(tree);
}
 One option: Use OpenMP‘s parallel sections.
The Sections Construct
C/C++ Fortran
#pragma omp sections [clause] !$omp sections [clause]
{ !$omp section
#pragma omp section ... structured block ...
... structured block ... !$ omp section
#pragma omp section ... structured block ...
... structured block ... ...
... !$omp end sections
}
 The sections construct contains a set of structured blocks that are

to be distributed among and executed by the team of threads.
How to parallelize a Tree Traversal?!
 How would you parallelize this code?

void traverse (Tree *tree)
{
#pragma omp parallel sections Nested Parallel Regions
{
#pragma omp section
if (tree->left) traverse(tree->left);
#pragma omp section
if (tree->right) traverse(tree->right);
} // end omp parallel Barrier here!
process(tree);
}
 Downsides of this option:
 Unneccessary overhead and synchronization points
 Not always well supported (how many threads to be used?)

The ordered Construct
 Allows to execute a structured block within a parallel loop in sequential

order
 In addition, an ordered clause has to be added to the for construct which any
ordered construct may occur
#pragma omp parallel for ordered
for (i=0 ; i<10 ; i++){
...
#pragma omp ordered
{
...
}
...
}
 Use Cases:
 Can be used e.g. to enforce ordering on printing of data
 May help to determine whether there is a data race
Runtime Library
Runtime Library
 C and C++:
 If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP
is defined. To use the OpenMP runtime library, the header omp.h has to
be included.
 omp_set_num_threads(int): The specified number of threads will be

used for the parallel region encountered next.
 int omp_get_num_threads: Returns the number of threads in the

current team.
 int omp_get_thread_num(): Returns the number of the calling thread

in the team, the Master has always the id 0.
 Additional functions are available, e.g. to provide locking
functionality.
Tasking
Recursive approach to compute
Fibonacci
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}
 On the following slides we will discuss three approaches to

parallelize this recursive code with Tasking.
The Task Construct
C/C++ Fortran
#pragma omp task [clause] !$omp task [clause]
!$omp end task
 Each encountering thread/task creates a new Task

 Code and data is being packaged up
 Tasks can be nested
Into another Task directive
Into a Worksharing construct

 Data scoping clauses:
 shared(list)
 private(list) firstprivate(list)
 default(shared | none)
Tasks in OpenMP: Data Scoping
 Some rules from Parallel Regions apply:

 Static and Global variables are shared
 Automatic Storage (local) variables are private
 If shared scoping is not derived by default:

 Orphaned Task variables are firstprivate by default!
 Non-Orphaned Task variables inherit the shared attribute!
 Variables are firstprivate unless shared in the enclosing context
First version parallelized with Tasking
(omp-v1)
{ int x, y;
[...] #pragma omp task shared(x)
{
#pragma omp parallel
x = fib(n - 1);
{
}
#pragma omp single
#pragma omp task shared(y)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
[...] return x+y;
} }
o Only one Task / Thread enters fib() from main(), it is responsable for
creating the two initial work tasks
o Taskwait is required, as otherwise x and y would be lost
Fibonacci Illustration
 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks
fib(2) fib(1) fib(1) fib(0)
Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)
Fibonacci Illustration
 T1 enters fib(4)
 T1 creates tasks for
fib(3) and fib(2) fib(4)
 T1 and T2 execute tasks
from the queue
 T1 and T2 create 4 new fib(3) fib(2)
tasks
 T1 - T4 execute tasks
 … fib(2) fib(1) fib(1) fib(0)
fib(1) fib(0)
Scalability measurements (1/3)
 Overhead of task creation prevents better scalability!
Speedup of Fibonacci with Tasks

9
8
7
6
Speedup
5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads
if Clause
 If the expression of an if clause on a task

evaluates to false
 The encountering task is suspended
 The new task is executed immediately
 The parent task resumes when the new task finishes
→ Used for optimization, e.g., avoid creation of small tasks
Improved parallelization with Tasking
(omp-v2)
 Improvement: Don‘t create yet another task once a certain (small

enough) n is reached
{ int x, y;
[...] #pragma omp task shared(x) \
#pragma omp parallel if(n > 30)
{ {
#pragma omp single x = fib(n - 1);
{ }
fib(input); #pragma omp task shared(y) \
} if(n > 30)
} {
[...] y = fib(n - 2);
} }
#pragma omp taskwait
return x+y;
}
 Speedup is ok, but we still have some overhead when running with 4
or 8 threads

9
8
7
6
Speedup
5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads
Improved parallelization with Tasking
(omp-v3)
 Improvement: Skip the OpenMP overhead once a certain n

is reached (no issue w/ production compilers)
{ if (n <= 30)
[...] return serfib(n);
#pragma omp parallel int x, y;
{ #pragma omp task shared(x)
#pragma omp single {
{ x = fib(n - 1);
fib(input); }
} #pragma omp task shared(y)
} {
[...] y = fib(n - 2);
} }
return x+y;
59 Introduction to OpenMP }
 Everything ok now 

9
8
7
6
Speedup
5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads
Data Scoping Example (1/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of b: firstprivate
// Scope of c:
// Scope of d:
// Scope of e:
} } }
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of c: shared
// Scope of d:
// Scope of e:
} } }
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of d: firstprivate
// Scope of e:
} } }
int a = 1;
void foo() Hint: Use default(none) to be
{ forced to think about every
int b = 2, c = 3; variable if you do not see clear.
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of d: firstprivate
// Scope of e: private
} } }
int a = 1;
void foo()
{
int b = 2, c = 3;
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared, value of a: 1

// Scope of b: firstprivate, value of b: 0 / undefined
// Scope of c: shared, value of c: 3
// Scope of d: firstprivate, value of d: 4
// Scope of e: private, value of e: 5
} } }
The Barrier and Taskwait Constructs
 OpenMP barrier (implicit or explicit)

 All tasks created by any thread of the current Team are guaranteed to be
completed at barrier exit
C/C++
#pragma omp barrier
 Task barrier: taskwait

 Encountering Task suspends until child tasks are complete
Only direct childs, not descendants!

C/C++
Task Synchronization
 Task Synchronization explained:
#pragma omp parallel num_threads(np)

{
#pragma omp task np Tasks created here, one for each thread
function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task 1 Task created here
function_B();
}
B-Task guaranteed to be completed here
}
Questions?

OpenMP 01 Introduction

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

OpenMP 01 Introduction

Загружено:

Авторское право:

Доступные форматы

Introduction to OpenMP

Christian Terboven, Dirk Schmidl

IT Center der RWTH Aachen University

 De-facto standard for Shared-Memory Parallelization.

 1997: OpenMP 1.0 for FORTRAN

The number of transistors

… but the clock speed is no

Instead, we will see many

Source: Herb Sutter

 Dual-socket Intel Woodcrest

 Each chip has 4 MB of L2 on-chip cache on-chip cache

cache on-chip, shared by

 No off-chip cache bus

 SMP: Symmetric Multi Processor memory

 OpenMP: Shared-Memory Parallel Programming Model.

Memory All processors/cores access

Crossbar / Bus Real architectures are

 OpenMP programs start with

 The parallelism has to be expressed explicitly.

 Exactly one exit point at the bottom OMP_NUM_THREADS=…

Hello OpenMP World

Hello orphaned OpenMP World

 From within a shell, global setting of the number of threads:

 From within a shell, one-time setting of the number of threads:

 If only the parallel construct is used, each thread executes the

 Distribution of loop iterations over all threads in a Team.

 Scheduling of the distribution can be influenced.

 Loops often account for most of a program‘s runtime!

 for-construct: OpenMP allows to influence how the iterations are

 schedule(dynamic [, chunk]): Iteration space divided into blocks

 schedule(guided [, chunk]): Similar to dynamic, but block size

 Can all loops be parallelized with for-constructs? No!

 Data Race: If between two synchronization points at least one thread

 A Critical Region is executed by all threads, but by only one thread

 Do you think this solution scales well?

 Managing the Data Environment is the challenge of OpenMP.

 Scoping in OpenMP: Dividing variables in shared and private:

 private-list and shared-list on Worksharing constructs

 General default is shared for Parallel Region, firstprivate for Tasks.

 Loop control variables on for-constructs are private

 Non-static variables local to Parallel Regions are private

 private: A new uninitialized instance is created for each thread

firstprivate: Initialization with Master‘s value

lastprivate: Value of last loop iteration is written back to Master

 Static variables are shared

 Global / static variables can be privatized with the threadprivate

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword

 Global / static variables can be privatized with the threadprivate

Before the first parallel region is encountered

Instance exists until the program ends

Does not work (well) with nested Parallel Region

 Based on thread-local storage (TLS)

TlsAlloc (Win32-Threads), pthread_key_create (Posix-Threads), keyword

 OpenMP barrier (implicit or explicit)

 All worksharing constructs contain an implicit barrier at the end

#pragma omp parallel

 In a reduction-operation the operator is applied to all variables in the

 Possible reduction operators with initialization value:

 When data is used, it is copied into