So Ict 2011

Multi-Thread Skeletons for Multi-GPUs Computing
Le Duc Tung
High Performance Computing Center, HUST Hanoi, Vietnam
Nguyen Huu Duc

School of Information and Communication Technology, HUST Hanoi, Vietnam
Pham Tuan Anh

School of Information and Communication Technology, HUST Hanoi, Vietnam
tungldducnh@soict.hut.edu.vn anhpt@soict.hut.edu.vn hpcc@mail.hut.edu.vn Ngo Huy Hoang Nguyen Minh Thap

High Performance Computing Center, HUST Hanoi, Vietnam High Performance Computing Center, HUST Hanoi, Vietnam
hoang.ngo.h@gmail.com ABSTRACT
This paper introduces a library which supports programmers to write parallel programs on GPU architecture, especially with a system consisting of multi-GPUs. The library is designed based on skeletons, which helps us to make parallel programs easily and quickly as if writing sequential programs. Skeletons usually are described by functional language which supports high-order function totally. Because of performance and popularity of C++ language, we try to re-annotate C++ language to support high-order functions completely, hence, it is convenience for us to create generalpurpose skeletons.
towernguyenminh@gmail.com
Skeletons [5] are frequently-used functions that are building blocks in many algorithms. A skeletal library could be implemented on various back-end but provides a relatively unied interface. Skeletons are implemented by experts for a specic architecture with a view to squeezing performance out of that architecture. Users could use skeletons in a sequential manner and do not have to understand about their actual parallel implementations. To reprogram that function with a high performance is hard and needless for inexpert programmers because of various parallelism problems such as communication, load balancing, synchronization, etc. The power of skeletons will be utilized if their implementations support high-order function which accepts a function as input and returns another function. The input function is usually dened by users when they write their program, this makes the program exibility. CUDA programs include two parts of code: one for running on CPU and the other for GPU. Although CUDA framework uses C language as fundamental language, users still have to remember many kind of modied keywords for specic functions running on GPU. Therefore, supporting the user-dened functions which seamlessly executes on GPU architecture is an important feature. The remain of our paper is organized into four sections: the rst section introduces some related-works which are the approach of skeletons for programming and some brief information on CUDA architecture. Our main contribution will be showed in the second section in which consists of designation and implementation of the library. In the third part, we do experiments with two basic problems to see the performance and convenience of our library. The nal section is our conclusion. In our paper, we use the notation of Haskell to represent skeletons. The symbol denotes an operator which is both associative and communicative. f x and f x y means applying a function f to one or two arguments respectively. The notation f :: a b denotes the type of function f , where function f accepts an input with a type of a and returns an output with a type of b. g f denotes the composition of two functions f and g, the output of f is the input of g.
General Terms
Parallel Programming
Keywords
Multi-GPUs, parallel programming, CUDA, Skeleton
1.
INTRODUCTION
In the trend toward multi-core and many-core architecture, NVIDIA introduced Compute Unied Device Architecture (CUDA), a multi-processing architecture for GPUs, which includes both hardware and software solutions. GPUs are now powerful programmable machines with computing capacity up to hundred times higher than normal CPUs. The power of GPUs inspires us to design and implement a skeletal library for GPU systems. Some libraries are introduced before in [7], [9] but both of them still lack of important skeletons such as scan and do not support multi-GPUs in their CUDA implementations.
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for prot or commercial advantage and that copies bear this notice and the full citation on the rst page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specic permission and/or a fee. SoICT 11 Hanoi, Vietnam Copyright 2011 ACM ...$10.00.
2.
RELATED WORK
2.1
CUDA Architecture
NVIDIA GeForce 8800 introduced back in 2006 is the rst graphic processing unit (GPU) supporting the Compute Unied Device Architecture (CUDA) and programmable with C. Its processing cores allow executing both graphic processing and general-purpose parallel computing on the same hardware platform. The GPU can eectively execute up to 12.288 threads concurrently with its 128 cores. GeForce 8800 is also the rst GPU implementing scalar thread instead of vectors supporting standard scalar programming languages such as C. It also supports other general-purpose languages through its instructions for calculating with integer or oating point number as dened by the IEEE 754. In 2009, NVIDIA introduced the third generation architecture for parallel computing, code-named Fermi. With Fermi, NVIDIA summed up lessons learned from the two previous architecture as well as the applications executed on these architectures to create the rst GPU truly for high performance computing. These includes: Improve double precision performance Provide true cache hierarchy Faster context switching More shared memory Faster atomic operations CUDA is a unied processing architecture including both hardware and software designed for parallel computing. CUDA enables NVIDIAs modern GPU to execute programs implemented in C, C++, FORTRAN and many other languages. These programming languages have one thing in common: they are designed for sequential executing model; CUDA maintains this model and extends it with a minimal set of abstract objects describing the parallel portion of the implementation. This way, programmers can focus more on designing parallel algorithm instead of wasting eorts on implementing mechanisms for controlling concurrency. A CUDA program consists of two parts: the host part runs on CPU(s) and the other contains one or many kernels and device functions. When a CUDA program is executed, the host part can invoke a parallel kernel on the device. The kernel will then be concurrently executed N times by N dierent CUDA threads. For convenience, these threads can be organized into a thread hierarchy. Threads can be groups into one-dimensional, two-dimensional, or three-dimensional thread blocks which can then be organized into a one-dimensional or two-dimensional grid of thread blocks. Threads in a block can synchronize through the built-in barrier and communicate with each other through the high-speed limited shared memory. CUDA does not support synchronization between threads in dierent blocks; however, these threads can communicate through the global memory and atomic functions. On executing, each thread will be mapped to a streaming processor (SP) (or CUDA core). For instance, a Tesla C1060 has 240 CUDA cores, so there will be 240 threads executed in parallel at any moment. Each block of threads will be run by a streaming multiprocessor (SM); depending on hardware resources, a SM can execute multiple blocks. With the introduction of new Fermi GPUs, dierent kernels can be executed concurrently (at the time of this writing, the
limit is sixteen kernels corresponding to sixteen SMs on a Fermi GPU). With Tesla GPU, only one kernel can use the GPU at one time, so dierent kernels should be executed sequentially. In SIMT model [8], 32 concurrent threads are executed in groups called warps. Any divergence in execution of threads in the same warp would result in lowering performance. Threads in the same warp can access memory freely but to obtain high performance they should access consecutive words in memory because these accesses can be coalesced in an aggregate transaction which results in higher memory throughput.
2.2
The approach of skeleton for parallel programming
Skeletal parallel programming [5], [6] has known as a promising approach for parallel programming. In this approach, programmers build their program based on pre-dened functions (skeletons), this components help parallelism become easier, and hide many factors related to lower -level parallel implementation on dierent architectures. There are many denitions about skeleton, but in general, it is like patterns in software engineering with clear denition for its interface. The source code of a skeletal parallel program looks very simple and easy to understand, it is almost the same with a sequential program. Consider a simple skeletal program called Divide-Conquer skeleton. This skeleton applies for a class of data-parallelism problems. Its denition is as follows:
DC indivisible split join f = F where F P = f P, if indivisible P = join (map F (split P)), otherwise
The skeleton consists of three functions, function indivisible checks the indivisible of input data, function split is to calculate and divide the input data equally and function join to combine results returned from computation. However, these functions is automatically running by the system. The burden to users is just to dene a function f to apply for each element P in the input data. f is usually called userdefined function. Some libraries provide divide-conquer skeletons can be listed as SkeTo [2] for C++ language, Skandium for Java language [1]. Besides, there are some other kind of skeletons such as: iterative combination skeleton, cluster skeleton, task queue skeleton [5].
3. 3.1
CUSKEL LIBRARY Architecture
The library is built on CUDA framework and pthreads library with the architecture shown in gure 1. The library plays a role of abstract level. Therefore, users do not need to pay attention on details of the model of architecture or implementation of CUDA, or how to make multi-threads to control multi-GPUs. Because of the limit of function pointer in CUDA architecture, we use a source-to-source compiler from Clang library [3]. The compiler will convert function pointers in skeletons from users source code into function calls inside the body of the function. Then, function calls will be annotated to gen-
they want to perform operations relating to memory on GPUs. Skeletons: This is the main component of the library including skeletons for parallel computation using GPUs. To support multi-thread feature for functions of memory management and skeletons, it can be use one of two models: fork - join and thread pool. With the model of thread pool, when there is no jobs to execute, threads are in sleep status. When a job comes, system will send a message to wake up threads, and threads will process assigned jobs. The advantage of that model is threads are created once. With the model of fork - join, after nishing assigned jobs, threads end and are destroyed to release memory. Threads are re-created after each execution. Compared to thread pool model, fork - join model has lower performance due to cost of creation of threads. However, implementing thread pool model in the context of skeletons is nontrivial. Because threads are created only one time, we can not use template feature of C++ language to implement skeletons. Dierent skeletons have dierent userdened functions, or, in other word, types of input data for each of skeletons are not the same. Meanwhile, each threads at the time of its creation is sticked to a specic function or data. Therefore, it is not easy to use a common thread for many skeletons. In our paper, our approach is to use fork - join model. When a function of memory management and a skeleton is called, the library will generate N sub-threads correspondent with N GPUs in the system. While sub-threads do its job, the main thread is just waiting for returned results from each sub-thread. This process is shown in the gure 4.
Figure 1: The architecture of cuSkel library.
erate CUDA-compatible code. At this time, the nal source code totally follows limited constraints of the hardware of GPU with compute capability 1.x. Using a such compiler helps to increase the scalability of implementation environment for cuSkel library. For example, if we would like to support more computation power by using multi-CPUs, we just modify source code generated by the compiler without changing the programming interface. After processing, users program will be compiled by nvcc or gcc to run on both GPU and CPU. The gure 2 shows the important role of source-to-source compiler in our library.
Figure 2: Role of source-to-source compiler
APIs in the library are categorized into three groups as in the gure 3: Environment management: These APIs is called when users want to start or nish computing on GPUs. Memory management: Users call these functions when Figure 4: forkjoin model for multi-threads.
3.2
Designation of APIs and Skeletons
Because functions of environment are run on a single thread, we just mention multi-threads APIs. With multi-threads APIs, their implementing structure consists of three parts as in the gure 5. 1. Function cuSkel_api_function: user will call this function directly when they want to use cuSkel. This function are implemented by the main thread which have the role of creation of sub-threads. After that, the main thread will wait for the completion of subthreads. 2. Function thread_code: this function controls a subthread to utilize a specic GPU in the system. Based on initial values from cuSkel_api_function, this function will initialize its own values for its thread, and set
Figure 3: Main components in cuSkel library.
c uS k e l _ a p i _ f u n c t i o n ( arg ): i n i t _ t h r e a d _ a r g u m e n t () for i = 1 to num_of_GPUs : pthread_create ( threads [ i ] , thread_properties , thread_code , thread_args [ i ] ,) w ai t _f o r_ t hr e ad s () thread_code ( arg ): i n i t _ t h r e a d _ p r i v a t e _ d a t a () ke rnel _fun ctio n ( thread_arg ) wa it_f or_k erne l () pthread_exit () ke rnel _fun ctio n ( arg ): do_work ()
Pattern cuInit()
Function Initializes parameters for environment and global variables. cuInit() must be called before every operation on GPU. Releases global variables.
cuFinalize()
Table 1: Functions of environment management.
Figure 5: Pseudo-code to implement multi-threads APIs. the area of data needed for computation. Then, the function call kernel_function with its corresponding data to actually execute on GPU. With functions of memory management, this function will call functions of memory management in CUDA library instead of calling kernel_function. This function also waits until the completion of execution on GPU. 3. Function kernel_function: this function actually execute parallel computation using GPU. In the running time, this function only call functions which are able to execute on GPU, not functions running on CPU or host. Because there is only one argument for thread_code function, we need an suitable data structure to store input data. the data structure should include three main pieces of information as follows: thread_code: this value is unique for each thread and be used to distinguish dierent threads. This value also is used for determining input data for a thread. data: includes both input data and output data. This variable is usually a pointer. size_of_data: bases on thread_code, a pointer to the rst element of data array, and size of data, it is easy to nd out the data oset for each thread. This will be mentioned in 3.2.2
on GPU, A global variable memTLU with the type of std:map is used. For convenience, a variable memCounter is used to index memory, initially, its value is zero. The details of these variables will be discuss in 3.2.2. The work of cuFinalize() is in contrast with the one of cuInit(). cuFinalize() releases variables initialized by cuInit() such as memTLU and data.
3.2.2
Memory Management
To manage memory on GPUs, cuSkel provides four APIs as follows: cuAlloc(): this API is used to allocate memory. cuFree(): this API is used to release allocated memory. cuGetVector(): this API is used to transfer data from GPU to main memory on host (CPU). cuSetVector(): this API is used to transfer data from main memory to GPU. Note that, these functions only use to operate with memory on GPU. With RAM memory, programmer still uses functions malloc, free, or operators like new and delete as normal. cuSkel library allows to execute parallel computation with many GPU concurrently, therefore data for computation should be distributed on many GPUs. A mechanism which hides the distribution of data with users is very important. Management of memory in cuSkel is based on an array called data with the type of std:vector to store pointers to memory block on GPU. ith element in the array stores information of memory of ith GPU in the system. When a thread accesses the memory on GPUs, it uses thread_code to determine an element in the array. Using both data and memTLU, we can map a user pointer to many pointers to manage data on multi-GPUs. Value of memTLU is the address of a memory cell, this ensures the value is unique. A global variable memCounter is used to number for area of memory allocated. To allocate a memory area on multi-GPUs, cuSkel does as follows: 1. Allocates memory for an element with the type dened by user, and a pointer a to this area. Then, user will use pointer a to refer to actual memory on GPUs. 2. memTLU[a] = memCounter++; stores key a in memTLU with the value of memCounter, then increases memCounter by 1. 3. Initialises main thread and creates sub-threads.
3.2.1
Environment Management
cuSkel libary provides two functions to manage implementation environment: cuInit() and cuFinalize(). Being dierent with the other APIs, the implementation of these functions do not use pthreads library. The table 1 lists the details of these functions. Before operating with memory on GPU or skeletons, programmer must call the function cuInit() to initialize variables needed for the execution of the program on GPU. Firstly, cuInit() determines the number of GPU in the system and the number of thread to create through CUDA library. For each thread, the system will initialize an array data with the type of std::vector to store the location of memory allocated. To map a pointer with actual memory
4. Sub-thread calculates its memory area. With an array of N elements, the calculation will be as follows:
size = numOfElement / numOfGPU if ( threadID == ( numOfGPU - 1)) { size = numOfElement - ( size * threadID ) }
mapKernel - the core Besides the function f, the other arguments of cuMap will be assigned to corresponding elds in the structure map_info which is passed to thread mapWorker. Based on the elds of the structure map_info, thread mapWorker determines the size of data to process, resolves address space pointed by pointers in and out, and calls kernel mapKernel corresponding to the function f. Implementing the skeleton map is fairly simple as shown in gure 6. Threads running on GPU compute its identication numbers using the built-in variables in CUDA such as threadIdx, blockIdx and blockDim and then process its data element. After processing an element, the threads jump to the next element with the step of s that is the total number of threads running on the GPU. This process is repeated until all elements of the input array are treated.
template < class T1 , class T2 > __global__ void mapKernel ( int n , T1 * in , T2 * out ) { int tid = threadIdx . x + blockIdx . x * blockDim . x ; while ( tid < n ) { out [ tid ] = f ( a [ tid ]); tid += blockDim . x * gridDim . x ; } }
5. Sub-thread actually allocates its memory area on GPU managed by that sub-thread, and stores the address of the memory area in the array data Compared to allocation operation, operation for releasing memory is simpler. 1. memID = memTLU[a]; Sub-threads get the index of memory area pointed by a. 2. aOnGPU = data[threadID][memID]; Based on the index of memory area, get a pointer in data[threadID]. 3. Sub-threads do release memory on GPU using CUDA library. With functions for transferring data, in addition to get actual address of memory block and calculate size of memory on each GPU, threads have to calculate oset of memory on the main memory (memory on host). The formula is as follows: offset = size * threadID Sub-threads transfer the area of data with size elements at the address a + offset on the main memory to GPUs to manage it.
Figure 6: The core of skeleton map
3.3
Parallel Skeletons
3.3.2 reduce skeleton

The skeleton Reduce is primitive and important in the parallel data. The operation takes a binary operator which is associative and an array of n elements [a0 , a1 , . . . , an1 ] and returns a scalar value a0 a1 . . . an1 Corresponding to the above denition, the prototype of the function cuReduce, the function that implements the skeleton reduce, is
template < class T1 > void cuReduce ( T1 (* f )( T1 , T1 ) , int N , T1 * in , T1 * out )
cuSkel library consists of six skeletons for parallel computation on multi-GPUs. map, reduce and scan are primitive skeletons. zipWith, mapReduce, zipWithReduce are advanced skeletons optimized for some class of specic problems. map skeleton Map is a skeleton derived from functional language. Map allows to apply a function concurently on every element in a list. The formal denition of Map is as follows: (map f ) : [a] [b] where, f is a function with the type of f : a b. In cuSkel library, map skeleton is implemented as function cuMap with the declaration as follows:
template < class T1 , class T2 > void cuMap ( T2 (* f )( T1 ) , int N , T1 * in , T2 * out )
3.3.1
3.3.3 scan skeleton

The skeleton scan is a good example of the sequential calculations, but there is an ecient parallel algorithm in [4]. The inclusive scan takes an associative binary operator and an array of n elements [a0 , a1 , . . . , an1 ] and returns [a0 , (a0 + a1 ), . . . , (a0 a1 . . . an1 )]
Using the template feature of C++ language, the skeleton accepts the arguments of any type. The skeleton map is implemented using the structure mentioned before in the gure 5: cuMap - the API function mapWorker - the thread function
zipWith skeleton zipWith is not an essential skeleton but very ecient with applications that deals with two arrays with the same type. Skeleton zipWith takes a function f and two arrays of n elements [a0 , a1 , . . . , an1 ], [b0 , b1 , . . . , bn1 ] and returns another array [f (a0 , b0 ), f (a1 , b1 ), . . . , f (an1 , bn1 )] Similar to map skeleton, this skeleton do not require communication, hence, this skeleton is a linear function. zipWithReduce and mapreduce skeleton MapReduce and zipWithReduce skeletons are complex skeletons. MapReduce is a composition of {map and reduce, it takes a function f , an associative binary operator and an array of n elements [a0 , a1 , . . . , an1 ] and returns f (a0 ) f (a1 ) . . . f (an1 ) zipWithReduce is a composition of zipWith and reduce, it takes a function f , an associative binary operation , and two arrays of n elements [a0 , a1 , . . . , an1 ], [b0 , b1 , . . . , bn1 ] and returns a scalar value of f (a0 , b0 ) f (a1 , b1 ) . . . f (an1 , bn1 ) Compared to the skeletons map, reduce, zipWith, the two skeletons are more eective composition due to no intermediate arrays required. Instead of making two calls to two skeletons, do a complex skeleton. The programmer is encouraged to use the complex skeletons whenever possible.
3.3.4
Size 2 2 2 2
25 26 27 28
CPU 45.029 90.533 187.190 362.486
1GPU 2.937 5.848 12.046 X
4GPU 0.816 1.554 2.987 5.889
Table 2: The experiment result in milliseconds
3.3.5
Map skeleton operating on a GPU is 15 times faster than the same one operating on a CPU. We have to notice that when the number of GPU is increased by 4 times, the amount of experimenting time is also deducted by 4 times. This is because the map operation includes data parallelism but no communication. Map skeleton is experimented with 228 elements on a GPU is not successful due to the lack of memory space. This is the reason for cuSkel to support the ability of operating on many GPUs. reduce Skeleton. The reduce skeleton is experimented with the operator and the array with the size of 225 to 228 oating point number with simple precision. The table 3 gives us the operating time of reduce skeleton in milliseconds. Size 2 2 2 2
25 26 27 28
CPU 171.226 342.173 684.435 1369.800
1GPU 1.358 2.611 5.162 X
4GPU 1.176 1.727 2.550 4.786
4.
EXPERIMENT
Table 3: The result of the experiment on reduce skeleton in milliseconds
The experiments are performed at High Performance Computing Center with the conguration as follows: One processor AMD Athlon X4 620 2.6GHz QuadCore 8GB RAM Two NVIDIA GeForce GTX 295 with 4 GPUs, each GPU has 240 cores @ 1242MHz and 896MB GDDR3 @ 999MHz. Firstly, we will test each skeleton separately. Then, each skeleton will be used to implement the problems of calculating the scalar product of two vectors and calculating the Pearson coecient of correlation.
It can be observed that the performance of reduce skeleton is absolutely impressive. The parallel operation on 4 GPUs is 250 times faster than the sequential operation on 1 CPU. This is because of the optimized techniques applied during the setup process. The reduce operation doesnt have a strong data parallelism as map so when the number of GPUs is 4 times increased, the amount of processing time is only deducted by 2 times.
4.2
Examples
Scalar product.
The scalar product of two vectors a = [a1 , a2 , . . . , an ] and b = [b1 , b2 , . . . , bn ] dened as:
n
4.1
Primitive skeletons
ab=
ai bi = a1 b1 + a2 b2 + . . . + an bn
i=1
map skeleton. The map skeleton is experimented with function f (x) = x2 and arrays with the size from 225 to 228 elements which are oating point numbers with simple precision. The table 2 gives us the computing time of the map skeleton in milliseconds.
where n is the size of the vector. To perform the operation ai bi , we use the skeleton zipWith with the function of multiplication zipW ith(, a, b) = [a1 b1 , a2 b2 , . . . , an bn ]
Transmitting the result of skeleton zipWith to skeleton reduce with the operator +, we use the scalar product of two vectors a and b. ab = reduce(+, zipW ith(, a, b)) = reduce(+, [a1 b1 , a2 b2 , . . . , an bn ]) = a1 b1 + a2 b2 + . . . + an bn We may also use the complex skeleton zipWithReduce with the function of multiplication and the operator +. a b = zipW ithReduce(, +, a, b) Since the scalar product is an operation that is used frequently, it is put in many libraries. In this experiment, to achieve an objective evaluation, we compare the function of the scalar product using the library cuSkel with the library CUBLAS which is an implementation of BLAS (Basic Linear Algebra Subprogram) developed by NVIDIA CUDA. The function of the scalar product is experimented with two arrays of 224 to 228 oating point numbers. The experimental results are presented in table 4. Again, the programs running on GPU is 50 to 200 times faster than on the CPU. It is remarkable that a program on a GPU using a skeleton version of cuSkel runs 10 % faster than the version performed in the library CUBLAS. This result demonstrates, to some extent, the quality of the library cuSkel. We should also notice that the versions of complex skeleton run up 2 times faster than the one of two primitive skeletons.
Size
24 25 26 27 28
CPU
1GPU cuBLAS
1GPU 1 skeleton 4,365 8,025 17,638 X X
4GPU 1 skeleton 5,250 6,208 8,356 13,429 24,713
2 2 2 2 2
68,648 136,874 270,313 552,128 1089,452
5,290 10,789 22,834 X X
Table 5: The experiment result of the coecient of correlation (in milliseconds)
the ratio of arithmetic operations and the operations accessing the global memory does not change. With the program for multiple GPUs, it needs to initialize and run multiple thread groups, which is quite expensive. In comparision to the program carried out by using the library CUBLAS, the skeleton program is still 20 % faster.
5.
CONCLUSIONS
Pearson coefcient of correlation.

In statistics, the Pearson coecient of correlation is a measure by the linear dependence between two variables X and Y. The mathematical expression of the Pearson coecient of correlation is as follows: r= (n n XY X Y X 2 ( X)2 )(n Y 2 ( Y )2 )
The Pearson coecient of correlation is calculated using skeletons as follows: sum1 sum2 sumpr = reduce(+, [x1 , x2 , . . . , xn ]) = reduce(+, [y1 , y2 , . . . , yn ]) = zipW ithReduce(, +, [x1 , x2 , . . . , xn ], [y1 , y2 , . . . , yn ])
sumsq1 = mapReduce(sqr, +, [x1 , x2 , . . . , xn ]) sumsq2 = mapReduce(sqr, +, [y1 , y2 , . . . , yn ]) n sumpr sum1 sum2 r = (n sumsq1 sum12 )(n sumsq2 sum22 ) where sqr(x) = x2 In this experiment, we calculate the coecient of correlation of two matrixes of 224 to 228 oating point numbers. The experimental results are presented in table 5. The gap between the execution time of the program for the CPU and the program for GPU is reduced. Even the experimented program of 4 GPUs is only 44 times faster than the sequential program. This happens because the number of arithmetic operations in the loop of this program is ve times as much as one with the program calculating the scalar product of CPU. For programs running on the GPU,
With the hope easing programming on GPU, this paper introduces a library of high-performance algorithmic skeleton with a simple programming interface. The library contains two main components: skeletons for the 1-D array data structure and a compiler source to source. Six basic algorithmic skeletons have been implemented. They accept all types of data and can run in parallel way on multiple GPUs. The memory management and allocation in tasks are performed automatically and transparently to the programmer. The library obviously can not exceed an experienced developer, but the results of experiments shows that the library has good quality and high performance compared to other libraries and sequential programs. A problem of cuSkel is that it relies heavily on metaprogramming with templates. As all the code template is processed, evaluated and generated at compile time, the compiled program using cuSkel takes a longer time. Because of the limit of C++ language, if the skeletons want to accept arguments of any type, with the template in metaprogramming, this is inevitable. At present, the skeletons in the library cuSkel can only treat the data structure 1-D array using the multi-GPUs architecture. In the future, we will develop the skeletons for more complicated data structure such as 2-D array or tree. Supporting for new platforms such as a multi-CPU or a GPU-cluster is also a potential research.
6.
[1] [2] [3] [4]
REFERENCES
Skandium library project, 2010. Sketo library project, 2010. clang: a c language family frontend for llvm, 2011. G. E. Blelloch. Scans as primitive parallel operations. IEEE Trans. Comput., 38:15261538, November 1989. [5] M. Cole. Algorithmic skeletons: structured management of parallel computation. MIT Press, Cambridge, MA, USA, 1991.
Size
24 25 26 27 28
CPU
1GPU cuBLAS
1GPU 2 skeletons 2,915 5,721 11,562 X X
1GPU 1 skeleton 1,333 2,645 5,505 X X
4GPU 2 skeletons 1,458 1,882 3,173 6,355 12,937
4GPU 1 skeleton 1,200 1,523 2,621 3,910 6,913
2 2 2 2 2
89,020 177,657 354,700 710,407 1420,532
1,408 2,946 6,299 X X
Table 4: The experimental result of the scalar product (in milliseconds)
[6] J. Darlington, A. J. Field, P. G. Harrison, P. H. J. Kelly, D. W. N. Sharp, and Q. Wu. Parallel programming using skeleton functions. In Proceedings of the 5th International PARLE Conference on Parallel Architectures and Languages Europe, PARLE 93, pages 146160, London, UK, 1993. Springer-Verlag. [7] J. Enmyren and C. W. Kessler. Skepu: a multi-backend skeleton programming library for multi-gpu systems. In Proceedings of the fourth international workshop on High-level parallel programming and applications, HLPP 10, pages 514, New York, NY, USA, 2010. ACM.
[8] N. Satish, M. Harris, and M. Garland. Designing ecient sorting algorithms for manycore gpus. In Proceedings of the 2009 IEEE International Symposium on Parallel&Distributed Processing, pages 110, Washington, DC, USA, 2009. IEEE Computer Society. [9] S. Sato and H. Iwasaki. A skeletal parallel framework with fusion optimizer for gpgpu programming. In Proceedings of the 7th Asian Symposium on Programming Languages and Systems, APLAS 09, pages 7994, Berlin, Heidelberg, 2009. Springer-Verlag.

So Ict 2011

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

So Ict 2011

Загружено:

Авторское право:

Доступные форматы

Multi-Thread Skeletons for Multi-GPUs Computing

Nguyen Huu Duc

Pham Tuan Anh

tungldducnh@soict.hut.edu.vn anhpt@soict.hut.edu.vn hpcc@mail.hut.edu.vn Ngo Huy Hoang Nguyen Minh Thap

The approach of skeleton for parallel programming

CUSKEL LIBRARY Architecture

Figure 1: The architecture of cuSkel library.

Figure 2: Role of source-to-source compiler

Designation of APIs and Skeletons

Figure 3: Main components in cuSkel library.

Table 1: Functions of environment management.

Figure 6: The core of skeleton map

3.3.2 reduce skeleton

3.3.3 scan skeleton

CPU 45.029 90.533 187.190 362.486

1GPU 2.937 5.848 12.046 X

4GPU 0.816 1.554 2.987 5.889

Table 2: The experiment result in milliseconds

CPU 171.226 342.173 684.435 1369.800

1GPU 1.358 2.611 5.162 X

4GPU 1.176 1.727 2.550 4.786

Table 3: The result of the experiment on reduce skeleton in milliseconds

1GPU 1 skeleton 4,365 8,025 17,638 X X

4GPU 1 skeleton 5,250 6,208 8,356 13,429 24,713

68,648 136,874 270,313 552,128 1089,452

5,290 10,789 22,834 X X

Table 5: The experiment result of the coecient of correlation (in milliseconds)

Pearson coefcient of correlation.

1GPU 2 skeletons 2,915 5,721 11,562 X X

1GPU 1 skeleton 1,333 2,645 5,505 X X

4GPU 2 skeletons 1,458 1,882 3,173 6,355 12,937

4GPU 1 skeleton 1,200 1,523 2,621 3,910 6,913

89,020 177,657 354,700 710,407 1420,532

1,408 2,946 6,299 X X

Table 4: The experimental result of the scalar product (in milliseconds)

Вам также может понравиться