Вы находитесь на странице: 1из 17

Outline Intel Threading Building Blocks

1 2

Intel Threading Building Blocks


3

Dr. M. Schwind
Prof. Praktische Informatik

Winter Term 2013/2014


4

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

1 / 65

Ubersicht
1

Intel Threading Building Blocks

Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Mutex Atomar Operations Container concurrent vector concurrent hash map Prof. Praktische Informatik Intel Threading Building Blocks concurrent queue Task-Programming

WS 2013/2014

2 / 65

Introduction Basics Parallel Constructs Synchronization Container Task-Programming

C++-library for shared memory parallel programming mainly for multicore CPU Implements important parallel programming patterns
Parallel loops Pipelining Task programming

Provides data structures, which allow the parallel access from several threads:
Queue (FIFO) Associative Container Vector

Developed by Intel
6

No restriction to Intel-Processors Implementation uses generic programming (C++-templates)

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

3 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

4 / 65

Ubersicht
Commercial- and open-source-version Homepage: http://www.threadingbuildingblocks.org/ Literature:
Website http://www.threadingbuildingblocks.org/documentation.php
Reference Manual Installation Guide Getting Started Guide
2 1

Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 6 / 65

Book
Intel Threading Building Blocks: Outtting C++ for Multi-core Processor Parallelism Author: James Reinders Verlag: OReilly ISBN: 0596514808 Erscheinungsdatum: 2007

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

5 / 65

Motivation for Generic Programming


Example: simplied implementation of a stack for storing integer values Problem: Type safe implementation of a stack for the storage of variable types requires a implementation per each type Solution:
Usage of the preprocessor (awkward, confusing, dicult to maintain) Usage of templates

Generic Programming with C++


Functions, classes, and methods can be declared with types, which are variable until compile time To dene a class with variables type the declaration of a class is preceded with template<typename T1, typename T2, ...>
T1, T2, ... are identiers for the variable types typename is a keyword preceding the identier

Example
1 2 3 4 5 6 7 8 9 10 11
class IntStack { public : void push ( const int & item ) { mem [ pos ++]= item ; int pop () { return mem [ - - pos ]; } int mem [100]; int pos ; }; IntStack s ; // Usage s . push (5); x = s . pop ();
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 7 / 65

Example
}

1 2 3 4 5 6 7 8 9

template < typename T > // Use type T instead int class Stack { public : void push ( const T & item ) { mem [ pos ++]= item ; T pop () { return mem [ - - pos ]; } T mem [100]; int pos ; }

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

8 / 65

Declaration of Objects using Variable Types

Type Requirements
Templates can be used with self dened types Example: Usage of the stack class for storage of self dened tuple classes Analysis of the implementation of the stack class shows, that it is required that a assignment operator must be dened. Template implementations require certain semantic and syntactic requirements.

Declaration of objects from a template class uses class name followed by the types specied in <>-braces

Example
1 2 3 4 5 6 7
// D e c l a r a t i o n of a integer stack Stack < int > int_stack ; int_stack . push (5); // D e c l a r a t i o n of a stack using double p r e c i s i o n numbers Stack < double > double_stack ; double_stack . push (5.0);

Example
1 2 3 4 5 6 7 8 9 10 11
9 / 65

class IntTupel { public : // A s s i g n m e n t O pe r at o r IntTupel & operator =( const IntTupel & other ) { s1 = other . s1 ; s2 = other . s2 ; return * this ; } int s1 , s2 ; // el e me nt s of the tuple }; ... // Usage Stack < IntTupel > s ; s . push ( IntTupel (5 ,6));
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 10 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

Concepts and Models


A concept is a collection of requirements for a type
Syntactic requirements (e.g. a class denes a method with a specic name) Semantic requirements (a method does a computation in a specic way)

Splittable Concept
pseudo signature
X::X(X& x, Split )

semantics Splitting x into x and a new constructed object

A model is a type which fullls all requirements of a concept Concepts are in threading building blocks described by pseudo signatures:

Splitting-Constructor splits objects into two parts Argument Split is used to distinguish the splitting-constructor from the copy constructor Used for:
Partitioning of a index range into two subranges, which can be computed in parallel Duplication of function objects which are computed in parallel

Example (CopyConstructible)
pseudo signature
T( const T&) ~T() T* operator& () const T* operator&() const

semantics Copy-Constructor Destructor Address from T Address from const T


WS 2013/2014 11 / 65

Models:
blocked_range and blocked_range2d parallel_reduce and parallel_scan

Prof. Praktische Informatik

Intel Threading Building Blocks

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

12 / 65

Range Concept
Represent index sets Typically used in parallel loops pseudo signature
R::R(const R& ) R::~R() bool R::empty() const R::is_divisible() const R::R(R& r, Split) const

Models for Ranges


blocked range
template<typename Value> class blocked_range;

semantics Copy-Constructor Destructor true if index range empty true if index range can be divided Subdivision of r into two index sets

Represents half open interval [i , j ); i and j have type Value Models for Value are build in types such as int, uint or pointer to vector elements
1 template < typename Value > class blocked_range { 2 public : 3 typedef size_t size_type ; 4 typedef Value const_iterato r ; 5 6 blocked_range ( Value begin , Value end , size_type grainsize =1); 7 blocked_range ( blocked_range & r , split ); 8 9 size_type size () const ; 10 bool empty () const ; 11 12 size_type grainsize () const ; 13 bool is_divisible () const ; 14 15 const _iterator begin () const ; 16 const _iterator end () const ; }
13 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 14 / 65

Models:
blocked_range

and blocked_range2d

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

Initialization of TBB
Example
Concept of value
pseudo signature
Value::Value(const Value&) Value::~Value() operator-(const Value& i, const Value& j) operator+(const Value& i, size_t k)

semantics Copy constructor Destructor Number of elements in range [i , j ) k th value after i

1 2 3 4 5 6 7 8

include " tbb / t a s k _ s c h e d u l e r _ i n i t . h " using namespace tbb ; int main () { t a s k _ s c h e d u l e r _i n i t init ; ... return EXIT_SUCCESS ; }

Each program requires a tbb::task_scheduler_init-object After initialization threads get started and wait for work assignment. A additional parameter can specify the number of threads Example: task_scheduler_init init(8) creates 8 threads Threads are alive as long the task_scheduler_init-object is not destroyed
task_scheduler_init-Objekt

gets destructed threads are destroyed


WS 2013/2014 16 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

15 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Ubersicht
1

Template for Parallel Loops


parallel_for template<typename Range, typename Body> void parallel_for(const Range& range, const Body& body);

Introduction Basics Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Container Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 17 / 65

Parallel iteration over a range-object Range object is subdivided into parts For each part the
operator()

gets called from the body object

Additional version of parallel_for which has as a third argument a partitioner Requirements for body:
pseudo signature
Body::Body(const Body); Body::~Body(); void Body::operator()(Range& r) const;

semantics Copy-constructor Destructor application of the operator () to r


WS 2013/2014 18 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Operation of parallel for


subdivide the index range recursively until the call to return the value false. For each part of the index range the body object is replicated and applied to that part.
parallel_for is_divisible()

Reductions Operation
parallel_reduce template<typename Range, typename Body> void parallel_reduce(const Range& range, const Body& body);

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14
class DoubleAll { int * intput ; DoubleAll ( int * _input ) : input ( _input ) {}; void operator ()( const blocked_range < int >& range ) const { for ( int i = range . begin (); i != range . end (); ++ i ) input [ i ]*=2: } } void Par all elDo ubl eAl l ( int * input , size_t n ) { DoubleAll da ( input ); parallel_for ( blocked_range < int >(0 , n ,1000) , da ); }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 19 / 65

Build a single object by applying a reduction operator to a set of objects Computation e.g. the sum, minimum, maximum of vector elements Additional version using a partitioner Reduction operator should be associative Body:
pseudo signature
Body::Body(Body, split); Body::~Body(); void Body::operator()(Range& r); Body::join(Body& rhs);

semantics Splitting Constructor Destructor Reduction of elements using the subrange r Combining the values of subranges; combines rhs with the value of *this
WS 2013/2014 20 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

Operation of parallel reduce Recursive subdivision of the range object into subranges until a call to is_divisible returns false Body object:
Is replicated for each subrange Application of the operator-() of body object to each subrange Stores the value of a reduction over a subrange

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
class Sum { public : float * array ; float value ; Sum ( float * _array ) : value (0) , array ( _array ) {} Sum ( Sum & s , split ) { value =0; array = s . array } void operator () ( const blocked_range < int > & range ) { float temp = value ; for ( int i = range . begin (); i != range . end (); ++ i ) temp += array [ i ]; value = temp ; } void join ( Sum & rhs ) ( value += rhs . value ;} }; float ParallelSum ( float * array , size_t n ) { Sum total ( array ); p ar al le l _r educ e ( blocked_range < int >(0 , n , 1000) , total ); return total . value ; }

Combination of intermediate results with

join

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

21 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

22 / 65

Partitioner
Controls the subdivision of range objects and the assignment of range objects to threads. Used for
parallel_for, parallel_reduce

parallel do
parallel_do template<typename InputIterator, typename Body> void parallel_do(InputIterator first, InputIterator last, Body& body);

and

parallel_scan.

simple_partitioner

Recursive subdivision of range objects until Range::is_divisible return false.


auto_partitioner

Sequential iteration over a elements of some container and applying an operator of the body object. Particularly useful when the elements of the container are not random accessible, e.g. in a list To each element of the container the operator object is applied Iterator object required:
operator()

Subdivides range object not necessarily until Range::is_divisible returns false. Balances work for processors, by ensuring that ranges for threads have nearly equal size.
affinity_partitioner

of the body

Subdivision similar to auto_partitioner On iterating several times over the range object the partitioner assigns subranges to the same threads over all iterations. Increases cache eciency if data ts in cache.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 23 / 65

Iterator is a abstract interface to access elements from a container Iterator objects are dened for STL (Standard Template Library)-Container or TBB-Container Possibility to apply the body to objects which are generated while the computation proceeds.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 24 / 65

Pseudo-Signature Body:
pseudo signature
void B::operator()( B::argument_type &item, parallel_do_feeder<B::argument_type>& feed ) const; B::argument_type B::argument_type(const B::argument_type& ~B::argument

semantics item element to which the operator is applied feed is used to store newly created elements Type of elements Copy constructor of argument_type Destructor of argument_type

Pipeline
Class denition
1 class pipeline { 2 public : 3 pipeline (); 4 virtual ~ pipeline (); 5 void add_filter ( filter & f ); 6 void run ( size_t m a x _ n u m b e r _ o f _ l i v e _ t o k e n s ); 7 void clear (); 8 }

Example
1 2 3 4 5 6 7 8 9 10 11
class ListEl {}; // is Copy - C o n s t r u c t i b l e struct Body { typedef ListEl argument_type ; void operator ()( ListEl c , tbb :: parallel_do_feeder < ListEl >& feed ) const ListEl & new_item = prozess_item ( c ); feed . add ( new_item ); } }; std :: list < ListEl > list ; ... tbb :: parallel_do ( list . begin () , list . end () , Body ());

A pipeline object (class pipeline;) uses several uses several pipeline stages f1 , . . . , fn called lters in TBB. Filters are created outside of the pipeline and put into the pipeline by calling pipeline::addfilter() The method pipeline::run starts the pipeline; max_number_of_live_tokens limits the number of parallel pipeline stages. pipeline::clear() removes all lters from pipeline; after that call the lters can be destroyed.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 26 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

25 / 65

Class Denition of Filter


1 class Filter { 2 enum mode { parallel , serial } 3 filter ( mode ); 4 bool is_serial () const ; 5 virtual void * operator ()( void * item ) = 0; 6 virtual ~ filter (); }

Parallel Sorting
parallel_sort

1 template < typename RandomAccessIterator , typename Compare > 2 parallel_sort < R a n d o m A c c e s s I t e r a t o r begin , 3 R a n d o m A c c e s s I t e r a t o r end , 4 const Compare & comp );

Each lter-class has to overwrite the virtual method void* filter::operator()(void *). The return value from the operator-() is used as the argument the next pipeline stage . The rst lter f1 generates the data; a return value of that no more elements need to be processed sind.
NULL item

of

tells TBB

Used for sorting a container-object Unstable sorting order of elements with the same key is not preserved. Deterministic sorting the same sequence of element generates in each sorting run the same sorted sequence RandomAccessIterator is dened in STL-Library; allows random access to elements

The last stage fn should manage the output; The return value of that stage is ignored. A lter can be marked as a parallel lter several items are computed in parallel in that stage
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 27 / 65

Example
1 2 3 4
const int N = 100000; float b [ N ]; ... parallel_sort (b , b +N , std :: greater < float >());
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 28 / 65

Additional algorithm templates

Ubersicht
1

Introduction Basics Parallel Constructs Synchronization Mutex Atomar Operations Container Task-Programming

parallel_scan

Computing the prex sum in parallel Used for e.g. parallel sorting
parallel_for_each

parallel application of a function to elements of a random-access-iterator


parallel_invoke

parallel invocation of up to 10 functions


5

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

29 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

30 / 65

Scoped Locking Pattern


Motivation: Exception Usage:
1 2 3 4 5 6 7 8 9 10 11 12
fun1 () { ... throw new Exception (); } void fun2 () { lock . lock (); fun1 (); // mutex is not // un l oc k ed lock . unlock (); } fun1

Solution 1: Modication of fun2


1 void fun2 () { 2 lock . lock (); 3 try { 4 fun2 (); 5 } 6 catch ( Exception * e ) { 7 lock . unlock (); 8 // E x c e p t i o n H an dl i n g 9 } 10 catch (...) { 11 lock . unlock (); 12 throw ; 13 } 14 lock . unlock (); 15 }

1 void fun3 () { 2 try { 3 fun2 (); 4 } catch ( Exception * e ) { 5 // e x e c p t i o n ha ndl in g 6 } 7 }

Disadvantage: throws a exception, the lock variable


lock

Problem: In case unlocked

is not

Unlocking locks may be forgotten; Not only when using exceptions. Complexity of program text increases Increased programming expenses for the programmer

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

31 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

32 / 65

Solution 2: Division of lock-variables and the locking functionality into two objects
Mutex: globally visible Scoped Lock: Used for locking the mutex
For each thread and each mutex one Scoped Lock instance exists Locks a mutex at its object-construction Unlocks a mutex at their deconstruction Tip: Using a code block (braces { } in C++) and declaring a scoped lock object at the beginning of the code block locks the associated mutex within the whole code block

Mutex Concept
All the following mutex models have to implement to following functions
Pseudo Signature
M() ~M() typename M::scoped_lock M::scoped_lock() M::scoped_lock(M& mutex) M::~scoped_lock() M::scoped_lock::aquire(M& mutex) bool M::scoped_lock::try_aquire(M& mutex)

Example:
1 ... 2 { 3 // C o n s t r u c t i o n of myLock locks mutex myMutex 4 mutex :: scoped_lock myLock ( myMutex ); 5 // C o m p u t a t i o n s are p r o t e c t e d by myMutex 6 ... 7 // u n l o c k i n g of myMutex 8 // ( D e s t r u c t o r of myLock is called i m p l i c i t l y ) , 9 }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 33 / 65

M::scoped_lock::release() static const bool M::is_rw_mutex static const bool M::is_recursive_mutex static const bool M::is_fair_mutex
Prof. Praktische Informatik

Semantic Construction of a Mutex-Object Destruction of a Mutex-Object Type of the Scoped-Lock-Class Construction of a Scoped-Lock-Object without locking the mutex-variable Construction of a Scoped-Lock-Object and locking the mutex Freeing of the mutex, if locked Lock mutex Try to lock, mutex . Returns false If already locked, otherwise true; Unlock of mutex true, if Reader-Writer-Mutex true, if Recursive-Mutex true, if Faire-Mutex
WS 2013/2014 34 / 65

Intel Threading Building Blocks

Models Implementing the Mutex-Concept


spin_mutex-Class mutex-Class

Wrapper for operating system implementation of locks


recursive_mutex-Class

Lock-Implementation using a busy-waiting loop. Uses a ag variable in memory. Good for short delays, since while waiting
processor time and memory bandwidth is used.

Wrapper for recursive operating system implementation (e.g. for pthread_mutex_t) A recursive lock, can be locked several times from one and the same thread. If a mutex was locked n-times, the thread has to be unlocked n times too.

Unfair Implementation:
Order of locking requests is ignored.
queuing_mutex-Class

Implementation using a busy waiting loop Fair implementation locking requests are served in FIFO order. Implementation scales

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

35 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

36 / 65

ReaderWriterMutex-Concept
Read-/Write Locks:
Several threads that only read the protected data structure are allowed to read in parallel. One thread which tries to modify the data structure needs exclusive write access. Can only be locked by several readers or by one writer. Additional requirements compared to the mutex-concept.
Pseudo Signature
M::scoped_lock(M& mutex, bool write=true) M::scoped_lock::aquire(M& mutex, bool write=true) bool M::scoped_lock::try_aquire(M& mutex, bool write=true) bool RW::scoped_lock::upgrade_to_writer() bool RW::scoped_lock::downgrade_to_reader()
Prof. Praktische Informatik

Summary of Locks

Class
mutex recursive_mutex spin_mutex queuing_mutex spin_rw_mutex queuing_rw_mutex

Semantic Constructs Scoped-Lock-Object for locking mutex Locks mutex Try to lock, mutex . If locked returns false, otherwise true; Reader-Lock Writer-Lock Writer-Lock Reader-Lock
WS 2013/2014

scalable OS dependent OS dependent x x

fair OS dependent OS dependent x x

recursive x -

release of CPU x x -

Intel Threading Building Blocks

37 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

38 / 65

exclusive write access; write=false read access Models: Class spin_rw_mutex and Class queuing_rw_mutex Example
write=true

Example
Data Structure queue (FIFO) Implementation using a linked list Attaching elements at the end Taking elements at start Later: Implementation by TBB
enq-Operation deq-Operation concurrent_queue

Taking elements from an empty queue Exception

Example
1 2 3 4 5 6 7
template < typename T > struct Node { Node () : next ( NULL ) {} Node ( const T & v ) : val ( v ) , next ( NULL ) {} T val ; // Value Node * next ; // Pointer to next element };

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

template < typename T > class LockQueue { // Mutexes for taking and a t t a c h i n g mutex enqLock , deqLock ; // // pointer to b e g i n n i n g and end of linked list Node <T > * head , * tail ; public : // Queue has one s en ti n e l element LockQueue () { head = new Node <T >(); tail = head ; } ~ LockQueue () { delete head ; }

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

39 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

40 / 65

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
void enq ( const T & x ) { mutex :: scoped_lock l ( enqLock ); Node <T > * e = new Node <T >( x ); tail - > next = e ; tail = e ; } T deq () { mutex :: scoped_lock l ( deqLock ); if ( head - > next == NULL ) throw new EmptyException (); T val = head - > next - > val ; Node <T > * h = head ; head = head - > next ; delete h ; return val ; } };

Notes: by using two mutexes it is possible to take elements from and attach elements to the queue in parallel Deadlock free, since no thread accesses two locks at the same time points to sentinel-element, it successor is the rst element of the queue
head

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

41 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

42 / 65

Atomars Operations

Ubersicht
Introduction Basics Parallel Constructs Synchronization Container concurrent vector concurrent hash map concurrent queue Task-Programming
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 44 / 65

1 struct atomic <T > { 2 typedef T value_type ; 1 3 4 value_type fetch_and_add ( value_type addend ); // x = x + addend 5 value_type f e t c h _a n d _ i n c r e me n t (); // x = x +1 2 6 value_type f e t c h _a n d _ d e c r e me n t (); // x =x -1 7 value_type compare_an d_s wap ( value_type new_value , (*) 8 value_type comparand ); 9 value_type fetch_and _store ( value_type new_value ); // swap (x , n e w _ v a l u e 3) 10 value_type operator () const ; 11 value_type operator +=( value_type ); 12 value_type operator -=( value_type ); 4 13 value_type operator ++(); 14 value_type operator - -(); 15 } 5
T

Integer- or pointer-type

Operations are executed atomar compare_and_swap:


Compares comparand with value from *this, if equal sets *this=new_value Returns the old value of *this
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 43 / 65

concurrent vector
concurrent_queue template<typename T> class concurrent_vector;

Selected Methods (Continued)


Memory
size_type size() Number of elements stored bool empty() Returns size() == 0 size_type capacity() Maximum number of elements,

before new

Properties: Random access to elements (addressed by index) Data structure can grow After growing indices and iteratores are still valid No shrinking is possible Selected Methods
Access to elements:
T& operator[](size_type i) Access i-th element without index checking T& at(size_type i) Access i-th element; Exception std::out_of_range

memory is allocated
size_type max_size()

Maximum number of elements

Iteratores and Ranges and iterator end() random access iteratores for vector elements in increasing order of indices reverse_iterator rbegin() and reverse_iterator rend() random access Iteratores for visiting vector elements in reverse order range_type range(int grainsize) Range object for vector
iterator begin()

Enlargement:
size_type grow_by(size_type delta, const T& t=T()) Enlargement by delta elements void grow_to_at_least(size_type n) Enlargement by minimal n-elements size_t push_back(cons T& val) Attaching value val Intel at the end; Returns Praktische Informatik Threading Building Blocksthe

when index invalid T& front() and T& back() Access to rst or last element

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

45 / 65

Prof.

number of elements WS 2013/2014 stored 46 / 65

concurrent hash map


concurrent_hash_map<Key,T,HashCompare> template<typename Key,typename T,typename HashCompare> class concurrent_hash_map;

Element Access
Accessor-object (proxy) allows the concurrent access to key-value pairs Accessor object uses implicit lock for each key-value pair Construction of a accessor object locking of the corresponding key-value pair Destruction of the accessor-objects unlocking the implicit lock The are two dierent accessor-types:
const_accessor read accessor read/write

Hash-table for storage of key-value pairs with parallel access Key - type of key, T type of values HashCompare Class for mapping of keys to integer values. Concept of HashCompare:
Pseudo signature
HashCompare::HashCompare(const HashCompare&) HashCompare::~HashCompare() bool HashCompare::equal(const Key& j, const Key& k) size_t HashCompare::hash(const Key& k) const

Semantic Copy-Constructor Destructor True, if j and k are equal Mappping k Integer

access read lock access read-/write lock

const accessor
1 template < typename Key , typename T , 2 typename HashCompare , typename A > 3 class concurrent_hash_map < Key ,T , hashCompare ,A >:: con st_acce ssor { 4 ... 5 typedef const std :: pair < const Key , T > value_type ; 6 7 bool empty () const ; // Element present ? 8 const value_type & operator *() const ; // Pointer to entry 9 const value_type * operator - >() const ; // R e f e r e n c e to entry 10 void release (); // u n l o c k i n g the i mpl i c i t lock Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 48 / 65 11 };

Conditions:

i,j have type Key; h is a object, which implements the concept HashCompare. If h.equal(i,j) is true, then h.hash(i) = h.hash(j) must hold.
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 47 / 65

Selected Methods
Example: Compute the frequency of words
size_type count(const Key& key) const

returns one if

key

is present, null otherwise


res

Example
1 struct MyHashCompare { 2 static size_t hash ( const string & x ) { 3 size_t h =0; 4 for ( const char * s = x . c_str (); * s ; s ++) 5 h =( h *17)^* s ; 6 return h ; 7 } 8 static bool equal ( const string & x , const string & y ) { 9 return x == y ; 10 } 11 }; 12 13 typedef concurrent_hash_map < string , int , MyHashCompare > StringTable ;

bool find(accessor& res, const Key key)

Search for key; If present returns in a write lock

the entry; locks the entry with

bool insert(accessor& res, const Key key)

Similar to find; Dierence: If entry not present create and insert new key-value pair with pair<Key,T>(key,T()).
bool erase(const Key& key)

Search key; if present delete it Iteration over elements:


1. by using iterator begin() and iterator end() 2. by using a range object returned by range_type range(size_t grainsize)

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

49 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

50 / 65

Example
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
struct Tally { StringTable & table ; Tally ( StringTable & _table ) : table ( _table ) {} void operator ()( const blocked_range < string * > r ) const { for ( string * p = r . begin (); p != r . end (); ++ p ) { StringTable :: accessor a ; table . insert (a , * p ); a - > second +=1; } } }; void C ountAc currences ( String * data , int nitems ) { t a s k _s c h e d u l e r _ in i t init ; StringTable table ; parallel_for ( blocked_range < string * >( data , data + nitems , grainsize ) , Tally ( table ) ); for ( StringTable :: iterator i = table . begin (); i != table . end (); ++ i ) cout < <i - > first < < " " <<i - > second < < endl ; }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 51 / 65

concurrent queue
concurrent_queue template<typename T> class concurrent_queue;

FIFO-queue Inserting and deleting elements concurrently possible Limited capacity Implementation uses locks Busy waiting on some (blocking) operations Important methods:
void push(const T& source); Inserting elements at the end void pop(T& destination); Removing and returning from the

beginning;

blocks if empty
bool pop_if_present(T& destination); Removing and returning; size_type size() const; Number of elements stored; If empty, return

the number of waiting threads as a negative number size_t capacity() const; Return maximum capacity
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 52 / 65

Ubersicht
1

Task-Programming
A task is composed of data and code which uses the data for computation. Tasks can be executed in parallel Tasks can be divided into subtasks father-child relationship creates a tree of tasks Child tasks should be independent computation on dierent cores possible Programmer denes the subdivision Scheduler component within TBB manages computation order Example for Algorithms:
Linear algebra (Matrix-Multiplication,-Decomposition) Sorting (Merge-,Quick-Sort) Search
Intel Threading Building Blocks WS 2013/2014 53 / 65 Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 54 / 65

Introduction Basics Parallel Constructs Synchronization Container Task-Programming

Prof. Praktische Informatik

Split-Join
Decomposition of a task into subtasks split-operation Waiting for the completion of childs join-operation Task-Depth

Blocking
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k +1); 6 task & tk = new ( al locate_child ()) T (...); tk . spawn (); 7 ... 8 task & t2 = new ( al locate_child ()) T (...); t2 . spawn (); 9 task & t1 = new ( al locate_child ()) T (...); 10 t1 . s p aw n _ a n d _ wa i t_ al l ( t1 ); 11 } 12 return NULL ; }

Each task has the implicit information about his task depth. Task depth of childs is one grater than task depth of father Root task has task depth 0

Reference counter
Each task has a reference counter The reference counter counts the number of existing childs If the reference counter reaches zero task is deleted; reference counter of father is decremented

Explanation: T inherits from the class Task and reimplements the method execute controls the subdivision into tasks; Steps:

execute().

Split-/Join-Parallelism; Two possible methods


Continuation-Passing Blocking

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

55 / 65

Allocation of task objects set_ref_count() initializing the reference counter to #childs+1 spawn() marks a task for further execution spawn_and_wait() waits, until the reference counter reaches 1. Important: set_ref_count-call before spawn-call execute returns a task isBuilding computed Prof. Praktische Informatik Intelwhich Threading Blocks immediately WS 2013/2014

56 / 65

Example (Blocking)
1 struct Tree { int val ; Tree * left ,* right ; } 2 class SumTask : public Task { 3 int * sum ; 4 Tree * tree ; 5 6 SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {}; 7 8 task * execute () { 9 SumTask *a ,* b ; 10 int ref =1 , x =0 , y =0; 11 if ( tree - > right != NULL ) { 12 a = new ( alloc ate_child ()) SumTask ( tree - > right ,& x ); 13 ref ++; } 14 if ( tree - > left != NULL ) { 15 b = new ( alloc ate_child ()) SumTask ( tree - > left ,& y ); 16 ref ++; } 17 if ( ref > 1) { 18 set_ref_count ( ref ); 19 if ( tree - > right != NULL ) spawn (* a ); 20 if ( tree - > left != NULL ) spawn (* b ); 21 wait_for_all (); } 22 * sum = tree - > val + x + y ; 23 } 24 return NULL ; 25 } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 57 / 65

Problems with Blocking


Problems: Local variables of task::execute remain on the stack of the executing OS thread, while calling task::spawn_and_wait. Task-Stealing in conjunction with blocking may result in a stack growth; Remember stack size is limited The scheduler tries to limit the stack growth, be choosing ready tasks with a task depth higher then the last blocking task. limited parallelism Instead of calling
task::spawn_and_wait

Solution:

the method

task::execute()

ends.

The computation using the results from child tasks is outsourced into a continuation-task. The continuation task is executed, after all childs have nished.

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

58 / 65

Continuation-Passing
1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k ); 6 r e c y c l e _ a s _ c o n t i n u a t i o n (); 7 task & tk = new ( allocate_child ()) T (...); tk . spawn (); 8 ... 9 task & t1 = new ( allocate_child ()) T (...); t1 . spawn (); 10 return & t1 ; }

Example (Continuation-Passing)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
class SumContTask : public Task { int * sum , x , y ; SumContTask ( int * _sum ) : sum ( _sum ) {} task * execute () { * sum = x + y ; return NULL ;} } class SumTask : public Task { int * sum ; Tree * tree ;

SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {* sum += tree - > val ;}; task * execute () { SumTask *a ,* b ; int ref =0; SumCont * c = new ( a l l o c _ c o n t in u t a t i o n ()) SumContTask ( sum ); if ( tree - > right != NULL ) { a = new ( alloc ate_child ()) SumTask ( tree - > right ,& c - > x ); ref ++; } if ( tree - > left != NULL ) { b = new ( alloca te_child ()) SumTask ( tree - > left ,& c - > y ); ref ++; } if ( ref > 0) { set_ref_count ( ref ); if ( tree - > right != NULL ) c - > spawn (* b ); if ( tree - > left != NULL ) c - > spawn (* a ); } return NULL ; } }
Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 60 / 65

In the example there is no further computation after t1.spawn() There is no need from algorithm point of view for a continuation task. Internals from TBB require continuation task
recycle_as_continuation()

marks father as a continuation task

Additional Possibility: Specifying a continuation task implicitly. (shown in next example)


Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 59 / 65

Important Methods of the Class Task


void wait_for_all(); void spawn(task &child); void spawn(task_list& list); spawn_and_wait_for_all(task &child); spawn_and_wait_for_all(task_list &list); depth_type depth(); void set_depth(depth_type new_depth); void add_to_depth(int delta); int ref_count() const: void set_ref_count(int count); void recycle_as_continuation(); void recycle_as_child_of(task& parent); void recycle_to_reexecute();
Prof. Praktische Informatik

Initialization of the Task Scheduler


Example
1 int ParallelSum ( Tree * tree ) { 2 int sum ; 3 SumTask & a =* new ( task :: allocate_root ()) SumTask ( tree , & sum ); 4 task :: s p a w n _ ro o t _ a n d _ w a i t ( a ); 5 return sum ; 6 }

wait for childs to nish mark child for execution marks a list of childs for execution mark child for execution and wait for the childs Mark childs in list for execution and wait for childs Returns task depth Sets task depth Increments task depth Returns reference counter Sets reference counter Recycling of a task as continuation task Recycling as child with father parent Recycling as child
WS 2013/2014 61 / 65

Root tasks starts task computation Root task has to use Result is stored in
sum &root) new(task::allocate_root())

as argument to new executes root

The static method task::spawn_root_and_wait(task task and waits for completion. The static task::spawn_root_and_wait(task_list executing a list of root tasks
Prof. Praktische Informatik Intel Threading Building Blocks

&root)

can be used for

Intel Threading Building Blocks

WS 2013/2014

62 / 65

Execution Orders

Ready-Pool

Each OS-thread manages a ready-pool Organization of the ready-pool:


Per task depth there is a list with ready to executed tasks. The lists are managed by an array; the task depth is the index

small memory footprint good cache locality no parallelism

high memory footprint poor cache locality high parallelism

New tasks are stored at the beginning of the list corresponding to their tasks depth and are removed at the beginning of their list (LIFO).

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

63 / 65

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

64 / 65

Operation of the task-scheduler

Tasks are executed in the following order: 1. The task returned by


task::execute().

2. The task which is farther of the last executed task. 3. A task from the list with the highest task depth. 4. A task with an anity for that thread. 5. A task from the ready pool of another thread with the lowest depth (task stealing).

Prof. Praktische Informatik

Intel Threading Building Blocks

WS 2013/2014

65 / 65

Вам также может понравиться