Академический Документы
Профессиональный Документы
Культура Документы
3 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Moore‘s Law still holds!
4 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example for a SMP system
Bus: Frontsidebus
Limited scalabilty
5 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Overview
&
Parallel Region
6 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP‘s machine model
Parallelization in OpenMP
Proc Proc Proc Proc
employs multiple threads.
7 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
OpenMP Execution Model
Parallel
Worker threads are spawned Slave Region
at Parallel Regions, together Slave
Threads
Worker
Threads
with the Master they form the Threads
Team of threads.
Serial Part
In between Parallel Regions the
Worker threads are put to sleep.
The OpenMP Runtime takes care
of all thread management work. Parallel
Region
Concept: Fork-Join.
Allows for an incremental parallelization!
8 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Parallel Region and Structured
Blocks
10 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo
11 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Starting OpenMP Programs on Linux
12 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing Construct
13 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
For Worksharing
14 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Worksharing illustrated
Pseudo-Code Memory
Here: 4 Threads
Thread 1 do i = 0, 24 A(0)
.
a(i) = b(i) + c(i) .
end do .
A(99)
Thread 2 do i = 25, 49
Serial B(0)
a(i) = b(i) + c(i)
do i = 0, 99 .
end do .
a(i) = b(i) + c(i) .
end do do i = 50, 74 B(99)
a(i) = b(i) + c(i)
Thread 3 end do C(0)
.
do i = 75, 99 .
a(i) = b(i) + c(i) .
C(99)
Thread 4 end do
15 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Demo
Vector Addition
16 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Influencing the For Loop Scheduling
18 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Synchronization: Critical Region
20 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scoping Rules
C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)
22 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Privatization of Global/Static Variables
C/C++ Fortran
static int i; SAVE INTEGER :: i
#pragma omp threadprivate(i) !$omp threadprivate(i)
23 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct
24 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Barrier Construct
25 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Back to our bad
scaling example
C/C++
int i, s = 0;
#pragma omp parallel for
for (i = 0; i < 100; i++)
{
#pragma omp critical
{ s = s + a[i]; }
}
26 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
It‘s your turn: Make It Scale!
27 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Reduction Clause
double s_priv[nthreads];
#pragma omp parallel num_threads(nthreads)
{
int t=omp_get_thread_num();
#pragma omp for
for (i = 0; i < 99; i++)
{
s_priv[t] += a[i];
}
} // end parallel
for (i = 0; i < nthreads; i++)
{
s += s_priv[i];
}
29 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data in Caches
30 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing
cache-line
As a result the cache line is moved
between the threads, also there is bus
no real dependency
31 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
False Sharing
4000
no performance benefit for
more threads 3000
MFLOPS
Reason: false sharing of 2000
s_priv
1000
Solution: padding so that
only one variable per cache 0
line is used 1 2 3 4 5 6 7 8 9 10 11 12
#threads
PI
34 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)
double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
35 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (1/2)
double f(double x)
1
{ 4
return (4.0 / (1.0 + x*x)); 𝜋=
} 1 + 𝑥2
0
double CalcPi (int n)
{
const double fH = 1.0 / (double) n;
double fSum = 0.0;
double fX;
int i;
36 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Example: Pi (2/2)
Results:
# Threads Runtime [sec.] Speedup
1 1.11 1.00
2
4
8 0.14 7.93
37 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Single and Master Construct
38 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Single Construct
C/C++ Fortran
#pragma omp single [clause] !$omp single [clause]
... structured block ... ... structured block ...
!$omp end single
Useful for:
I/O
C/C++ Fortran
#pragma omp master[clause] !$omp master[clause]
... structured block ... ... structured block ...
!$omp end master
40 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Section and Ordered Construct
41 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?
if (tree->right) traverse(tree->right);
process(tree);
}
42 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Sections Construct
C/C++ Fortran
#pragma omp sections [clause] !$omp sections [clause]
{ !$omp section
#pragma omp section ... structured block ...
... structured block ... !$ omp section
#pragma omp section ... structured block ...
... structured block ... ...
... !$omp end sections
}
43 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
How to parallelize a Tree Traversal?!
Use Cases:
Can be used e.g. to enforce ordering on printing of data
May help to determine whether there is a data race
45 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library
46 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Runtime Library
C and C++:
If OpenMP is enabled during compilation, the preprocessor symbol _OPENMP
is defined. To use the OpenMP runtime library, the header omp.h has to
be included.
48 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Recursive approach to compute
Fibonacci
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x = fib(n - 1);
[...] int y = fib(n - 2);
fib(input); return x+y;
[...] }
}
49 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
The Task Construct
C/C++ Fortran
#pragma omp task [clause] !$omp task [clause]
... structured block ... ... structured block ...
!$omp end task
private(list) firstprivate(list)
default(shared | none)
50 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Tasks in OpenMP: Data Scoping
51 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
First version parallelized with Tasking
(omp-v1)
int main(int argc, int fib(int n) {
char* argv[]) if (n < 2) return n;
{ int x, y;
[...] #pragma omp task shared(x)
{
#pragma omp parallel
x = fib(n - 1);
{
}
#pragma omp single
#pragma omp task shared(y)
{ {
fib(input); y = fib(n - 2);
} }
} #pragma omp taskwait
[...] return x+y;
} }
o Only one Task / Thread enters fib() from main(), it is responsable for
creating the two initial work tasks
o Taskwait is required, as otherwise x and y would be lost
52 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration
T1 enters fib(4)
T1 creates tasks for
fib(3) and fib(2) fib(4)
T1 and T2 execute tasks
from the queue
T1 and T2 create 4 new fib(3) fib(2)
tasks
T1 - T4 execute tasks
Task Queue
fib(2)
fib(3) fib(2)
fib(1) fib(1) fib(0)
53 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Fibonacci Illustration
T1 enters fib(4)
T1 creates tasks for
fib(3) and fib(2) fib(4)
T1 and T2 execute tasks
from the queue
T1 and T2 create 4 new fib(3) fib(2)
tasks
T1 - T4 execute tasks
… fib(2) fib(1) fib(1) fib(0)
fib(1) fib(0)
54 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Scalability measurements (1/3)
5
4 optimal
omp-v1
3
2
1
0
1 2 4 8
#Threads
55 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
if Clause
56 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v2)
Speedup is ok, but we still have some overhead when running with 4
or 8 threads
5
optimal
4
omp-v1
3 omp-v2
2
1
0
1 2 4 8
#Threads
58 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Improved parallelization with Tasking
(omp-v3)
Everything ok now
5 optimal
4 omp-v1
omp-v2
3
omp-v3
2
1
0
1 2 4 8
#Threads
60 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (1/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a:
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
61 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (2/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b:
// Scope of c:
// Scope of d:
// Scope of e:
} } }
62 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (3/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b: firstprivate
// Scope of c:
// Scope of d:
// Scope of e:
} } }
63 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (4/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d:
// Scope of e:
} } }
64 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (5/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e:
} } }
65 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (6/7)
int a = 1;
void foo() Hint: Use default(none) to be
{ forced to think about every
int b = 2, c = 3; variable if you do not see clear.
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
// Scope of a: shared
// Scope of b: firstprivate
// Scope of c: shared
// Scope of d: firstprivate
// Scope of e: private
} } }
66 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Data Scoping Example (7/7)
int a = 1;
void foo()
{
int b = 2, c = 3;
#pragma omp parallel shared(b)
#pragma omp parallel private(b)
{
int d = 4;
#pragma omp task
{
int e = 5;
68 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Task Synchronization
function_A();
#pragma omp barrier All Tasks guaranteed to be completed here
#pragma omp single
{
#pragma omp task 1 Task created here
function_B();
}
B-Task guaranteed to be completed here
}
69 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University
Questions?
70 Introduction to OpenMP
C. Terboven | IT Center der RWTH Aachen University