Вы находитесь на странице: 1из 5

Mosaic: A Comprehensive API Framework for Heterogeneous Multi-core Systems

Prashanth Thinakaran, RamSrivatsa Kannan, Srinivas Avirdy


Abstract In this paper we propose Mosaic framework, a

comprehensive framework Application Programming Interface for Heterogeneous Multi-core systems. Recent trends have shown the significant importance of adopting Heterogeneity in Multi-cored processing environment due to their potential high performance and energy efficiency. In order to extract the maximum potential of underlying hardware, the API must be efficient enough such that it efficiently automates the process of mapping the computation to underlying computational resources. However, most of the approaches proposed rely on the programmer to specify the mapping manually, such an approach demands a lot of work to be done by a programmer. Hence Mosaic, a novel API framework that adopts efficient heuristics in automating the process of mapping computations to processing elements which also includes map-reduce in order to exploit the explicit parallelism of the application. By adopting Mosaic framework to map computations to processing elements on a CPU+GPU hyper threaded system, we achieve reduced energy consumption per instruction and execution time and it is also established that the proposed API is adaptable to various benchmark sets and system configurations. Qualitative analyses of various architectural parameters like compileroptimization, scheduling, dynamic-mapping of computation across CPU and application specific accelerators are presented.
Keywords- heterogeneous architectures; multi-core; api; mapreduce; gpgpu; cuda;

to parallel programming techniques which is supposedly to be the need of the multi-cored era. However, most of the traditional parallel programming techniques, such as message passing and shared memory threading techniques demand a lot of laborious work from the developers. It is cumbersome for a developer/programmer to manage concurrency, locality and synchronization in a multi-core multithreaded processing environment. Hence it becomes very difficult to write a scalable parallel program that efficiently uses the underlying computational resources. In order for developers to fully harness the potential of heterogeneous multi-core processing elements, the step which maps the computation to heterogeneous processing elements must be automated. But in most of the API for heterogeneous models rely on the programmer to perform this mapping manually [21]. A CHI based API for heterogeneous multi-core system the merge based programming environment [15] which is used to program heterogeneous processing elements together using a library based approach. In this case, mapping the computations to processing elements is not determined at run time dynamically and hence this cumbersome manual approach should be completely overcome by fully automating the process of mapping, which is overcome by the Mosaic based framework for API. This paper presents Mosaic, a Map-reduce model [11] based API that maps the computations dynamically during run time. Map function processes and splits the input data in to chunks and generates an intermediate set of key pairs for individual data chunks. The pairs with the same key values are merged by the Reduce function. Thus in a run time environment after realizing the workload as a DAG, the application specific instructions are identified and segregated and thereby the map reduce automatically parallelizes the computations by mapping the disjoint chunks of input data. The Mosaic implementation uses threads to spawn the Map reduce function in a parallel fashion. The run time mapping facilitates the load balancing and the locality is managed by varying the granularity of the application thus by achieving maximum throughput. Thus the Mosaic API as in Figure.2 handles the concurrency, locality, load balancing automatically which facilitates the mapping for Heterogeneous environment dynamically by realizing the workload as Directed Acyclic Graph.The rest of the paper is organized as follows section 2 provides a detailed frame work of Mosaic API, section 3 provides a frame work of dynamic compilation, while section 4 discusses the Programming model of Mosaic

I.

INTRODUCTION AND RELATED WORK

Recent trends shown that by adopting heterogeneity in multicore processing elements improves the power to performance ratio [14] and a better Energy Per Instruction [5]. Fundamentally while we scale up the number of cores in a multi-core environment, it is essential to design ultra-low EPI cores [5] [21]. Nowadays in most of the personal computer we find multi-core CPU and GPU as in Figure.1. Most commonly in a heterogeneous environment the General Purpose Processor is combined with one or more specialized cores such as GPU, in which the computations that are specific to specialized cores is mapped on to the device software and on the other hand the conventional general purpose compute instructions are mapped on to the GPPs [4] [7] [12] [14]. As this multi-core chips becoming ubiquitous, to efficiently harness their computational power we should resort

API with blocks and flat data, the results by running various workloads are analyzed in section 5.

Mosaic makes use of dynamic compilation to compile Mosaic API calls into machine codes dynamically during run-time. One of the most striking features of Mosaic is of its

II.

MOSAIC FRAMEWORK

Mosaic framework uses TBB [9] and CUDA [2] [13] to create and implement API to suit a wide range of of heterogeneous architectures. A sketch of Mosaic is shown in Fig. 1. Mosaic supports virtual multi-threaded programming model for heterogeneous systems. The implementation of Mosaic is similar to OpenMP [6] [19]. But with a major difference that the standard OpenMP code parallelization happens only on CPU but in Mosaic it happens both in CPU as well as domain specific accelerators like GPUs. Beneath the Mosaic API is the Mosaic system layer which consists of set of libraries that needs to be loaded dynamically during run-time and scheduler. The compiler dynamically converts the API code into assembly code by linking the associated libraries and in order to promote codereusability the compiled assembly code is stored in code cache such that they can be re-usable without recompilation. After compilation of assembly code then the scheduler schedules the native machine code to the underlying CPU and Accelerators such as GPUs. Mosaic defines mArray a new data structure using c++templates which represents a multi-dimensional array of generic type. A mArray<int> represents array of integers. Thus these customized data types facilitate the domain specific computation accuracy. However Mosaic also allows programmers to select the heterogeneous processing elements. For e.g., mArray<int>Msum = Add( Mx, My, PE_Select ) allows the programmer to select the domain specific processing element of his own choise. Mosaic uses Intel Threading Blocks [1] [9] [10] to exploit CPU parallelism through hardware threads, this approach is known as Threading-API approach, in which the programmer implements the function intrinsics explicitly by implementing TBB on the CPU and CUDA [2] [16] [18] [20] on GPU. Mosaic automatically mapps the respective computation to its specific processing elements. Figure 3. Shows the implementation of image filter program using a threaded approach. Thus the CpuFilter() and GpuFilter() are executed parallel across the respective Processing units then combined by the mosaic run-time bind function which is BindFilter(). The keyword Mosaic_PARTIONABLE tells that the associated computation can be portioned both to CPU and Other accelerators such as GPU. Then finally using ApplymArrayop to the BindFilter() which results to Boolean array mSucess() which is converted to a Boolean value. III. DYNAMIC COMPILATION Figure 1. Heterogeneous architecture comprising a general purpose CPU and one or more accelerators

dynamic compilation and its nature of being adaptable to changes in the runtime. The Mosaic compilation comprises of six stages as shown in the figure 5. 1. Categorization and segregation of instructions The Program that implements the function intrinsic of the nave programmer, the coding logic must be spilt-up accordingly in compliance with the underlying processing units. Thus, the domain specific native instructions are judiciously categorized and segregated accordingly. 2. Building Directed Acyclic Graph (DAGs) DAGs are built according to the data dependencies among mArrays data structure in the API calls. These DAGs are essentially the intermediate representation in order to realize the applications dependency graph varied across CPUs and Accelerators such as DSP, FPGA and GPU. Figure 4 shows the steps in building a Directed Acyclic graph across varied application characteristics by analyzing the communication pattern which the application exhibits. 3. Map Reduce Map-reduce pattern in Mosaic in which all computations are decomposed into a set of map operations and a reduce operation, with all map operations independent and potentially concurrent [3] [11]. Thus by this map reduce paradigm the applications inherent parallelism could be extracted by decomposition of data hierarchically, form a simple computation to complex algorithms. Thus implementing

Map reduce paradigm is intended to capture the maximum amount of parallelism exhibited by the application.

Mosaic API, even the glue instructions are generated to bind the CPU and accelerator specific results. IV. PROGRAMMING ENVIRONMENT

After Mosaic programming model is a library based programming approach such as to raise the level of abstraction in order to improve the resource utilization in heterogeneous systems. Thus by encapsulation the accelerator specific code in software libraries hide the code complexity directly from the view of the nave programmers. Therefore library based programming paradigm are highly encouraged which are designed in a way to improve programmer productivity by increasing the level of abstraction. Few examples for this approach are BLAS linear algebra libraries [8] [18] released for Cell processor and GPUs [17]. Runtime Library Environment of Mosaic in which all applications that are given as input to the heterogeneous system, the instructions are segregated accordingly based on the underlying heterogeneous Processing elements thus by dynamically linking the associated libraries with the domain specific computation. Hence accelerator specific libraries are mapped dynamically.

Figure 2. Mosaic software architecture comprising of several layers 4. Load balancing Mosaic Load balancing decides the mapping from computations to processing elements judiciously, in order to increase the resource utilization of underlying PEs. Thus the scheduler schedules the instructions accordingly through Mosaics automatic mapping technique. 5. Performing optimizations on DAGs: A number of optimizations are applied to the DAGs which are built by Mosaics DAG builder. The most common ways of DAG optimization are (i) operation coalescing and (ii) removal of redundant temporary arrays. Operation coalescing identifies the group of operations running on same processing elements is abstracted to a single super operation, which reduces the overhead of scheduler which avoids scheduling operations separately instead it does collectively. Thus by operation coalescing we coalesce element wise operations into one single abstracted super operations. By using the second technique of optimization removes the redundant temporary arrays which are used in Mosaic paradigm. 6. Code generation: At the last step after optimization of DAG, the mapping schema of computations to their respective processing elements is decided by keeping in mind that GPU has a limited memory constraints and hence Mosaic schedules operations to GPU by splitting a DAG in to several smaller DAGs and schedule them in a pipelined manner. Through

Figure 3. A Image Filter Program implemented using Mosaic based paradigm

. Figure 4. Dynamic linking of application specific libraries

Figure 5. Compilation sequence of Mosaic architecture

A. Programming with block data With respect to task-based programming languages, blocking is one of the major techniques used. Grouping data into blocks leads to grouping of operations such as CPU operations and Accelerator specific operations. Thus resorting to this methodology would lead to good granularity. Thus a huge problem set is decomposed into smaller problems for example, algorithms like matrix multiplication, Jacobi transformation, Cholesky factorization can be easily grouped into blocks by abstracting the data set as hyper-matrices. The dependency complexity is high even for hyper-matrices with few blocks. This methodology gives the programmer his freedom to adjust the degree of locality and parallelism.

V.

EVALUATION AND RESULTS

Mosaic Frame work was evaluated on heterogeneous Computer, consisting of a multi-core multi threaded Intel x86 CPU and a high end NVIDIA GPU using standard benchmarks. The characteristics of these benchmarks are their computation demanding applications which includes financial modeling, image processing, and scientific computing. These benchmarks demands varied computation and communication during run-time.

B. Programming with flat data Not of all algorithms go in hand with blocks. For instance LU Decomposition is usually implemented as like the algorithm receives 2D matrix as an input and perform several computations by pivoting rows and columns by swapping them which overwrites its contents. Thus this makes hard to implement as blocks. Because each of data blocks we operate on varies sporadically in a pipelined manner and hence becomes difficult to categorize it under blocks. Similarly even merge sort fall under this flat data category because the array is accessed repeatedly with two different and overlapping block sizes.

Figure 6. Mapping of Mosaic with less Powerful GPU

We use Intel Threading blocks [9] [10] for CPU bound applications and CUDA Software Development Kit (SDK) [2] [18] for programming GPU related applications if possible. One important advantage of Mosaic is its capability to adjust to hardware changes. To demonstrate this advantage, we did the following two experiments. In the first experiment, we replaced our GPU by a less powerful one but kept the same 4-core CPU. The new less powerful GPU has fewer stream processors and comparatively less memory. The performance of Mosaic mapping with this new hardware configuration is shown in Figure 6. The GPUalways speedup with the new less powerful processor is 6x compared to the 7.3x speedup with the old powerful GPU In the second experiment, we went for the other extreme, where we replaced our 4-core CPU by a single core CPU but kept the original GPU. The performance with this new hardware configuration is shown in Figure 7. With a single core CPU, the average speedup of CPU-always is down to 1.5x. Hence the Mosaic decides to shift its computation to GPU rather than CPU. From the above evaluations we can infer that the Mosaic design framework has the capability to adapt its self dynamically and schedule to PEs accordingly such that the over all throughput of the application is achieved.

REFERENCES
[1] CPU+GPU integration. http://www.google.com/search?hl=en&lr=& rls=GGLG%2CGGLG%2005-47%2CGGLG%3Aen&q=intel+amd+ nvidia+ati+cpu+gp%u+integrated+&btnG=Search. CUDA. http://developer.nvidia.com/object/cuda.html C. Ranger, R. Raghuraman, A. Penmetsa, G. Bradski, and C. Kozyrakis. Evaluating mapreduce for multi-core and multiprocessor systems. In Proc. of HPCA, pages 1324, 2007. D. Tarditi, S. Puri, and J. Oglesby. Accelerator: Using data parallelism to program GPUs for general-purpose uses. In Proc. of ASPLOS, pages 325335, 2006. E. Grochowski and M. Annavaram. Energy per Instruction Trends in Intel Microprocessors. Technology@Intel Magazine, March 2006. GLSL OpenGL Shading Language. www.wikipedia.org/wiki/GLSL. GPGPU: General Purpose Computation using Graphics Hardware. www.gpgpu.org. IBM. Basic Linear Algebra Subprograms Library Programmers Guide and API Reference, sc33-8426-01 edition, 2008 Intel. Intel threading building blocks. http://www3.intel.com/cd/software/products/asmona/ eng/294797.htm. Intel. Intel C++ compiler. http://www3.intel.com/cd/software/products/asmona/eng/compilers/284 132.htm. Jeffrey Dean and Sajay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters. Google, Inc. J. L. Hennessy and D. A. Patterson. Computer Architecture: A Quantitative Approach, volume 4. Morgan Kaufmann, 2007. J. Nickolls and I. Buck. NVIDIA CUDA software and GPU computing Platform. In Microprocessor Forum, 2007. KUMAR, R., TULLSEN, D., JOUPPI, N., AND RANGANATHAN, P. Heterogeneous Chip Multiprocessors. IEEE Computer (November 2005), 3238. LINDERMAN, M. D., COLLINS, J. D., WANG, H., AND MENG, T. H. Merge: A Programming Model for Heterogeneous Multi-core Systems. In Proceedings of the 2008 ASPLOS (March 2008). M. McCool, K. Wadleigh, B. Henderson, and H. Y. Lin. Performance Evaluation of GPUs using the RapidMind Development Platform. In Proceedings of the 20th International Conference on Supercomputing, 2006. MUNSHI, A. OpenCL Parallel Computing on the GPU and CPU. In ACM SIGGRAPH 2008 (2008). NVIDIA. CUDA CUBLAS Library, 2.0 edition, March 2008. OBRIEN, K., OBRIEN, K., SURA, Z., CHEN, T., AND ZHANG, T. Supporting OpenMP on Cell. International Journal on Parallel Programming 36 (2008), 289311. REINDERS, J. Intel Threading Building Blocks. OReilly, July 2007. WANG, P., COLLINS, J. D., CHINYA, G., JIANG, H., TIAN, X., GIRKAR, M., YANG, N., LUEH, G.-Y., AND WANG, H. EXOCHI: Architecture and Programming Environment for a Heterogeneous Multicore Multithreaded System. In Proceedings of the ACM SIGPLAN 07 Conference on PLDI (June 2007), pp. 156166.

[2] [3]

[4]

[5] [6] [7] [8] [9] [10]

[11] [12] [13] [14]

[15]

[16]

[17] [18] [19]

[20] [21]

Figure 7. Mapping of Mosaic with less Powerful CPU

Вам также может понравиться