Вы находитесь на странице: 1из 14

Parallel and Distributed Computing Opportunities and Challenges Gruppe WG2

W. Robitza (a0700504), F. Schwarz (a0700830), and D. Selig (a0603678)

University of Vienna

Abstract. At the dawn of a new computing era is parallel computing. Driven by the eorts of hardware manufacturers, the capabilities of parallel computing will become available in almost every consumer device and not only be used in large distributed systems. When we talk about the advantages of multicore systems, we often forget that the former only come with a deep understanding of what parallelism means. It is the responsibility of programmers to move from an age where sequential programming seemed to be everlasting to an age of natively parallel code. We present an overview of the struggles and issues in the parallel computing revolution and provide insight into the eorts of research groups that try to overcome most of those problems. We also discuss the possible future of parallel computing and describe how the principles of computer science can be applied in this area.


Todays omnipresence of multiple cores or even multiple CPUs in personal computing devices marks an important step in the transition to a world of computing that no longer relies on sequential programming. Until only a few years ago, it has been widely believed that single-CPU clock rates could be doubled every 18 months in accordance to Moores Law [1]. Theoretical sequential processing power would have increased by an enormous amount. This law was proven wrong in this eld [2], although the number of cores per chip can still grow according to it. The use of manycore (i.e. with a comparably high number) over multicore systems is undeniably the only future of high performance computing, because it allows for cheaper and more power-ecient operability [3]. Parallel computing will take place on the desktop PC as well as on large grids of distributed machines. The introduction of parallel architecture however came without deep knowledge in the eld of parallel programming. Problems arise from the fact that programmers need to understand the implications of parallelism and how to transform common tasks into parallel programs. This is essential in order to optimize the usage of available computational power, since sequential programs will not benet from the parallel architecture.

In this paper an overview of the current problems in the eld of parallel and distributed computing is given. This is achieved by analyzing recent work in Section 2. Here, a representative body of works in this eld is described. The current status of the parallel landscape is evaluated. In Section 3 dierent approaches of parallel design or optimization are presented and compared to each other, if applicable. Also, the eects of the Great Principles of P. Denning and C. Martell on the parallel computing domain are highlighted1 . Section 4 is an attempt of discussing the proposed methodologies and underlying principles. Finally, Section 5 concludes with a summary of the ndings in this paper. Moreover, an outlook on the possible future of parallel computing is given.


Background Work
Parallel Computing History

The problems of parallel computing are by no means new. In fact, Amdahls Law [4] is considered as the the major underlying principle when estimating the maximum improvement to be expected when only improving components of a system. The speedup that can be achieved by using for example multiple CPUs is limited by the sequential parts of the program. Therefore, parallel computing is not an out-of-the-box solution for overcoming todays computational problems if they are inherently sequential. Programmers need to optimize sequential code as well as parallel code to achieve optimal performance. In fact, in the 1960s and 1970s parallel computing has been deemed the solution for high performance computing. Yet, many problems have been faced then and they still apply today [5]. Until recently, no considerable progress has been made in improving eciency and optimally exploiting the parallel architectures of today. 15 years ago, massively parallel computing has been proven cost-ineective, albeit its promising nature [6]. Now that the world can not afford to boost single-core performance any further, it is necessary to deal with the problems. 2.2 The Dwarfs of Scientic Computing

In the widely known technical report from Berkeley University of 2006 [7], several main recommendations for the foreseeable future of parallel computing have been stated. Also, conventional wisdoms that no longer apply have been reformulated for a new landscape of parallel computing. One of these old wisdoms would state that the mathematical operation of multiplication is slower than storing and loading data however today, this is not the case anymore. In 2004, the so called Seven Dwarfs of scientic computing have been formulated [8]. These are numerical methods that are supposed to appear in most scientic problems over the next decade. Those dwarfs are described on a higher level so that their presence in multiple applications can be more easily identied. To summarize, these are the following:


1. Dense Linear Algebra: Operations on dense matrices or vectors of data (i.e. including a low number of zero values). 2. Sparse Linear Algebra: Operations on sparse matrices (i.e. including a high number of zero values). These formats of data can often be found in linear programming problems. Data is stored in a compressed fashion. 3. Spectral Methods: Data is transformed from a time or space into a frequency domain, e.g. by applying the Fast Fourier Transform (FFT). 4. N-Body Methods: Problems are addressing the fact that points in data depend on all others, thus leading to O(N 2 ) calculations. 5. Structured Grids: Points in a grid will be spatially updated together. 6. Unstructured Grids: Points in a grid will be conceptually updated together. 7. Monte Carlo: Calculations are dependent on results of random trials. It should be noted that those Seven Dwarfs have not been designed with parallel computing in mind. Quite the contrary, some calculations are very hard to transform into naturally parallel code although they contain some parallelism. Some of the optimizations used to overcome this problem will be explained in more detail in Section 3. Still, not only the parallel optimization is of importance in addressing possible speedup of these dwarfs the memory speed and throughput seems to be an obstacle as well, as outlined in Section 3, too. Moreover, if we think about high-level applications that might use parallel processing, we are also bound to think about whether they might actually benet from parallelization. Six more Dwarfs have been added to the seven original ones: 1. Combinational Logic: Functions that depend on logical functions and states that are stored locally (e.g. encryption). 2. Graph Traversal: A typical traversal of many nodes in a graph. 3. Dynamic Programming: Especially favorable in optimization problems, dynamic programming tries to compute a solution by addressing subproblems. 4. Backtrack, Branch and Bound: Like the above, a solution is found by dividing a problem into subproblems but eliminating non-optimal results. 5. Graphical Models: Representation of nodes and edges to convey for example Markov models or Bayesian networks. 6. Finite State Machine: The representation of a machine that consists of welldened states and transition rules depending on a current state and input values. Of these, all seem tractable, only the Finite State Machine may not prot from parallel architecture, since its semantics are clearly sequential. We can now refer to those dwarfs whenever discussion the optimization of computations, as their impact on most computational problems and their ubiquity have been proven. 2.3 The Learning Curve Thinking Parallel

Todays programmers are widely accustomed to delivering sequential code. They rely on year-long improved compilers that already optimize given code for dierent architectures and have been worked on for ten years or more. Concurrency

in software and hardware introduces a layer of abstraction that not every programmer is able to understand natively. It is undeniable that the rise and fall of parallel computing goes hand in hand with the ability of the developers to make use of such a promising technology and being able to think parallel. Several aspects of parallelism can be considered a barrier to easy parallel programming, including locks, deadlocks, load balancing, scheduling and memory consistency [2]. Programming often involves nding a tradeo between Opacity and Visibility, the rst meaning the abstraction of any underlying architecture, the latter a full disclosure of the architecture design [7]. Although almost every computer lab can be equipped with natively parallel processors, most computer science curriculums lack a course dedicated to parallel programming. However, not only low level programming is important: The impacts of parallelism potentially apply to every major topic, as for example algorithms, operating systems, databases and datastructures: New course models have to be developed that specically address new paradigms [9]. 2.4 Approaches to Parallelism

When thinking about making use of parallel processing power, mostly bottom-up approaches are considered. Since every architecture is specic in its implementation of operations, memory access and caching, dierent compilers will have to be written. Generic optimizations need to be included at this low-level stage to allow for all possible applications that may run on this specic platform. Another way of addressing the lack of compelling applications for parallel architectures is to develop them in a top-down fashion. Computational problems nowadays are inherently driven by an underlying business case since the applications are human-centered and potentially groundbreaking. The ideas for applications are elaborated upon rst then the appropriate platform is chosen. This approach is followed in the Berkeley Par Lab, which is explained in more detail in Section 3. The general question is if any framework will allow programmers to write parallel code as easily as they were accustomed to before. In other terms, such a framework would parallelize code automatically and scale it well over even 128 cores or more.


Comparison of Dierent Approaches

Modeling Parallel Performance

Amdahls Law When discussing the performance of parallel or distributed computing, one mostly refers to Amdahls Law [4]. Briey explained, it allows to calculate the possible speedup for a program depending on the amount of parallel portions and the number of cores. The speedup can be generalized to S= 1 (1 P ) +


where P denotes the parallel portion and N is the number of processing units. A plot of dierent combinations can be seen in Figure 1. The x-axis shows the number of cores, while the y-axis shows the possible speedup according to the parallel portion of the program indicated by the lines.

25 50.00% 20 75.00% 90.00% 95.00% 15


0 1 10 100 1000 10000 100000

Fig. 1. Amdahls Law visualizes speedup (y-axis) for dierent combinations of processing units (x-axis) and amount of parallel code.

This model is very generic and clearly shows the problems we have to deal with in parallel and distributed computing. If only 50% of a program is parallel code, even with thousands of cores the speedup will not be more than 2. Even with a parallel portion of 95%, a speedup of 20 is only reached with a large number of cores that for the foreseeable future is not likely to be implemented in everyday-use computers. The lack of possible speedup is simply due to the fact that most algorithms today are simple sequences of operations there is no parallelism in this denition. Amdahls law only tells us how much speedup can be achieved it does not clarify anything about which factors really play an important role in that process. Therefore, it is only of modest practical use. This is also due to the fact that not only the processors throughput is an important factor in computing the memory is, too.

The Rooine Model An alternative to modeling parallel performance is the Rooine Model presented in [10]. It does not focus on the possible speedup of an arbitrary program, instead it oers a very simple model that allows one to see the interdependencies between processor and memory speed. Processor performance can be measured in oating point operations per second (commonly abbreviated as FLOPs), which is currently the most widely used metric. It is determined by standardized benchmarks. Memory performance is measured by its throughput, e.g. how many bits per second can be streamed from point A to point B in a processor architecture the authors focus on the stream between the cache and the main memory. Another term is introduced, called Operational Intensity. It refers to the operations per byte of main memory trac and is a typical feature of any kernel. In this sense, kernels are generic computational problems, e.g. one of the dwarfs mentioned before can be referred to as a kernel.

32 16 Ridge Attainable GFlops/s 8 4

kM Pea d dwi Ban th

Peak Floating Point Performance

ry emo

1/2 1/4 1/2 1 2 4 8 Operational Intensity (Flops/Byte)

Fig. 2. The Rooine Model

The Rooine Model can be explained best by providing an example that can be seen in Figure 2. The x-axis refers to the operational intensity dened above, the y-axis shows the peak oating point performance. The horizontal line to the right is the maximum oating point performance for a given system. Obviously, the overall maximum performance can never be increased above that line for any operational intensity. This is the Rooine. The slanted line to the left is the maximum memory performance the system can deliver under a given operational intensity. It shows that a system can be slowed down by the memory architecture. This conforms to the principle of memory hierarchy. If all computation were to

take place in higher level caches, processing would be faster, but again, access to memory would take longer. Essentially, the Rooine model can be condensed to the formula Attainable GFlops/s = min
Peak Floating Point Performance Peak Memory BandwidthOperational Intensity


The ridge point marks an important factor in determining the potential of a system. Its position on the x-axis shows how operationally intensive a problem has to be to achieve the maximum performance. In essence, it is a hint for programmers that shows how dicult it is to optimize the system. As a concrete example, we could imagine that a kernel A has an operational intensity of 0.8 Flops/Byte, whereas another kernel B could use 8 Flops/Byte. The dashed line marks the operational intensity of this kernel A. As it hits the roof exactly at the ridge point, the maximum possible speed for this operation is bound by the peak oating point performance of the CPU, and not the memory. If the kernel had a lower operational intensity, the memory would impose limits on the maximum speed and vice-versa. 3.2 Optimization of Computations

Parallel and distributed systems oer new opportunities for optimizing various computations that make up the fundamental operations in scientic computing. As already mentioned in Section 2.2, several attempts were made to classify these, one of them is the set of dwarfs. One kind of computation often mentioned in literature is the Sparse Matrix-Vector Multiplication. It is an applied case of the eld of sparse linear algebra (Dwarf number 2). Sparse Matrix-Vector Multiplication Sparse matrices are very important for various scientic computations. Those matrices hold only a relatively low number of non-zero values and therefore require a lot of redundant memory. As the Rooine model [10] shows, memory bandwidth is the bottleneck of most systems. An optimization for computations that need operations with sparse matrices is the Sparse Matrix-Vector Multiplication (SpMV), which aims to eliminate unnecessary information in the matrices to reduce their size. SpMV is one of the most used kernels and an important factor of speed in many applications of various elds of computer science, including scientic computing, economic modeling or information retrieval [11]. Therefore, a lot of eort is put in its optimization. The authors of [11] divide optimization techniques into three areas, which are the following (including an example for each): Low-level Code Optimizations An example for a code optimization would be software prefetching: Prefetching means that data that might be needed in near future is heuristically picked and loaded into the cache. Usually in an x86 architecture, the data has to be transferred from the main memory to the L2 cache and from there to the L1 cache if it is not already stored in the

cache hierarchy. With software prefetching however, some architectures allow data to be placed directly into the L1 cache. This reduces memory latency by avoiding the L2 cache entirely. As we can see, the principle of memory hierarchy and the principle of caching can be applied to this problem, since the L1 cache will always be faster than the L2, yet with a tradeo in storage capacity. Data Structure Optimizations Those include index size selection: As caches have only limited storage, the temporary indices can be compressed to t the caches size, no matter the size of the matrix stored in the main memory. By using such memory ecient indices, performance may increase up to 20%. The optimization of data structures is a highly complex theme in scientic computing and mathematics dierent approaches to optimizing the data structure of sparse matrices include using dierent internal representation formats, such as explained in [12]. Parallelization Optimizations One of those is thread blocking: The main idea follows a divide and conquer approach: First, the matrix is partitioned by row blocks, column blocks or segments of all of them. The parts should be determined in a way so that they all hold a similar number of nonzero values. Next, each part of the matrix is assigned to and processed by a certain thread. Now each thread can be mapped to a certain core to allow parallelization. Especially systems that support a high number of possible threads per core benet from this approach. Several libraries already exist for automated optimization of Sparse Matrix computations in serial code, such as OSKI [13], from which most of the optimizations are derived. 3.3 Towards a Parallel Paradigm The Berkeley Par Lab

Considering parallel computing, much can be learned from the work of a research group at the University of California at Berkeley. They founded the Parallel Computing Laboratory, Par Lab [7], and have been publishing visionary papers since 2006. Their work is mainly based on a running project that is about creating applications that are optimized for parallel systems. The researchers of Par Lab concentrate on parallel systems in both embedded and high performance computing trying to nd similarities to prot from. Those systems have more in common now than they had in the past, including Energy Embedded systems aim to get the maximum lifespan out of a battery while data centers need to have low electricity and cooling costs. Hardware The price of hardware plays a role as cheaper devices are better received by customers. Also, data center operators are keen on keeping their hardware costs low. Security As embedded systems become more complex year by year, their role in everyday life gains importance as well. As a result, they are more and more included in networks and connect to the internet. Data centers on the

other hand are usually connected to networks as well. This leads to increased danger of virus infections and unauthorized access hence. Therefore, security plays a very important role for both. The parallel approach of the last decades may work for a smaller number of processors (multicore systems) but there is a great chance of having diculties in larger systems (manycore systems). Therefore paradigms in parallel thinking have to change due to conditions having altered over the years. Electricity used to be very cheap whereas transistors were rather expensive; nowadays it is the other way round electricity prices are rising while transistors have become cheaper. As well, in earlier days load and store operations were quick while multiplications used to be slow; today the memory access is the bottle neck of a systems performance. As mentioned before, single processor performance used to double every 18 months according to Moores Law; during the last years, this period has approximated to about ve years. As a result, programmers didnt bother to parallelize their code, as they could as well wait for the next generation of sequential chips; nowadays, the waiting period is much longer hence parallelism has gained importance [7]. The Par Lab Approach For the future, it is crucial that programs that run efciently on parallel systems are easy to develop. This has already been explained in Section 2.3. In order to facilitate easy programming for parallel systems, the Par Lab researchers introduce two development layers. The eciency layer can be understood as a framework provided by parallel programming experts. On top of that, the productivity layer oers a toolkit for programmers having diculties in making use of parallel features themselves. The eciency layer has to work as close to the machine as possible. Unfortunately, many multicore systems do not oer interfaces supporting common parallel operations. Therefore, the Par Lab researchers introduce another layer in addition to the two development layers providing an API for parallel job creation, synchronization, memory allocation and bulk memory access, which they call portability layer [2] in order to be able to write applications for dierent architectures. In [7] it is mentioned that so-called autotuners should replace the compilers that have been in use for many decades. At the moment, compilers that parallelize sequential code still work quite well. However, this can not be guaranteed for multicore systems of all sizes, because the eciency of compiler-driven optimizations might be lower than expected. Today, most consumer CPUs do not have more than four cores. As the number of cores on a chip will increase, the architecture of processors may change. There will be a variety of chips with dierent amounts of cores, as well as each of those cores may have variable clock frequency to be power ecient. As a result, there will be a great diversity of chips, making it dicult to maintain the high expectations for compilers in regards to program eciency, scalability and portability [10]. Autotuners work dierently from compilers as they aim to optimize the code by generating various library kernels and comparing their performance on the


given system. The autotuner uses various uncommon ways of optimization this trial and error approach presents results that are even better than those of code that is manually optimized for the same target system. Autotuners can assist with generating parallel code, however a universal solution has not been found yet. Par Lab Application Development As mentioned before, the researchers at the Par Lab are working on various programming experiments for parallel systems. Although the Par Lab is an institution for scientic computing, the project is about developing client applications in a top-down approach. It is believed that data centers can also prot from the Par Lab results as they usually use similar hardware for their client distributed systems. As explained in Section 2.4, the traditional approach was to design software to run on certain hardware and applications to run on top of that software. To achieve higher user acceptance for the end products, this bottom-up approach has been reversed. A new top-down approach for parallel application development typically begins with nding ideas for needed applications, then designing software to t the applications demands. In the end, the hardware has to be chosen according to the softwares requirements. The researchers invited local experts from the respective application elds to help design, use and evaluate the technology. The applications range from audio signal processing including speech recognition, content-based image retrieval and medical applications to a web browser and even more. It is often stated that challenges arising from new inventions require fundamental rethinking of technology and know-how [2]. To build a solid base for the applications to run on, the Par Lab team introduces a deconstructed OS: The operating system divides the physical resources (cores, guaranteed network bandwidth, and the like) into partitions. These partitions are exible and virtualized by time-multiplexing onto the actual hardware. There are two dierent scheduling levels: one for the operating system and one for application runtimes. The main idea is to give the partitions either exclusive access to resources (such as a single core) or guarantee a quality-of-service via a contract (such as a certain network bandwidth). Each partition has full access to its resources within the assigned scheduling slot. This approach is similar to the one the so called Exokernel uses [14]. The operating system developed by the Par Lab is only responsible for low-level resource management and scheduling as well as communication between partitions. As a result, its kernel is much thinner than that of a traditional operating system. As the Par Labs project uses a top-down approach, the hardware has to be designed in accordance to the requirements of the software. Four ways are presented to achieve this goal: Supporting OS partitioning According to the deconstructed operating system introduced above, the hardware should support its partitioning. This renders the components performance predictable.


Optional explicit control of the memory hierarchy Programmers have the possibility of prefetching data to the caches manually (see Section 3.2). In addition, the traditional self-organized memory hierarchy can still be used. Accurate complete counters of performance and energy Current processors lack accurate and comparable performance and energy measurements. In order to track enhancements, such measures are of great importance. The counters should be integrated into the software stack, so that both the eciency and productivity layer programmer can use them. Intuitive performance model As the use of multicore systems bears new problems for all developers, the Par Lab researchers developed a model that holds performance for eliminating bottlenecks in the aforementioned dwarfs. With this approach, it could become much easier for programmers to create natively parallel programs. The top-down solution will assure that the results are user-centered and of great value to a broad community. The division of programming community into productivity and eciency layer experts is a very fundamental break with software development traditions, but seems promising in regards to the problems still faced today.


As mentioned several times before, parallel computing has been deemed the future of computing for many decades now. However, until recently, no remarkable breakthroughs have been achieved. When the industry started to implement multicore chips in consumer hardware, the potential of parallel computing became obvious as the possible target audience grew. 4.1 The Underlying Principles

The Great Principles of computer science apply to many of the problems already described. A principle like Moores Law [1] has been found valid for predicting the near future in terms of computational performance. However, over the years, the perspective had to be changed. At some point, it suddenly became obvious to the industry that the clock rate of CPUs could not be doubled every eighteen months anymore. This is shown in Figure 3. As clearly visible, CPU manufacturers had to recalculate their roadmaps for predicted clock rates within the last ve years. Take as an example the roadmap from the year 2005. It had been predicted that in the year 2011, single core CPUs would be capable of clock rates up to 17-18 GHz. Only two years later, the roadmap was recalculated and the prognosis for 2011 was reduced to half of the originally estimated speed. Even now we see that those speeds of 6-7 GHz can not be reached by single cores. It was soon found that Moores Law could no longer be applied to the clock rate for single-core CPU, but instead the number of cores per CPU. Thus we can see that Moores Law is a true principle, as it has not been suddenly abandoned or proven wrong, but reinterpreted.

30 25 20 15 Roadmap 2005 10 Roadmap 2007 5 Intel Single Core Intel Mutli Core 0 2001 2003 2005 2007 2009 2011 2013

Clock Rate (GHz)

Fig. 3. CPU Roadmap Predictions [2]

As it became obvious, rethinking the principles of computing was necessary. Similar to Moores Law, the principle of memory hierarchy and caching has already been addressed in this paper. The Rooine model shows the implications of a fact that has been known for years now: The gap between the growth curves of memory and CPU power [15]: Memory speed improvements have always been accomplished at a slower rate than those of CPU power. Also, through the use of shared caches, memory boundaries are even more severe in high performance computing [11]. We can see that the principle of caching still applies and is of importance in optimization for parallel computing, as explained in Section 3.2. 4.2 The Future of Parallel Computing

What can be done to overcome the issues that are still omnipresent in the eld of parallel computing? Given the history of attempts and failures, it is reasonable to say that the next years probably will not bring the desired revolution. However considering the optimistic approaches of research groups such as the Par Lab, it is possible that parallel computing will gain attention not only in a scientic community, but also in the minds of producers for consumer applications. The recent rise of Cloud Computing also undeniably goes hand in hand with the improvements on distributed and parallel systems. Without the underlying computational power, this would have not been possible. Also, the concept Software as a Service (SaaS) relies on massively parallel systems as a backbone. Whole operating systems and logical business infrastructure will be moved to the cloud and into the hands of providers like Amazon, Google or Microsoft [2]. While it is clear that mobile devices will never be able to reach the computational power of stationary devices, they are connected to the Internet and


therefore have access to the most powerful systems that exist. Creating compelling software that is designed to be used by the average consumer is in our opinion the main goal for the near future of parallel computing. With the support of the microprocessor industry, parallel products will be available in all price segments, from cheap embedded CPUs to massively parallel and scalable manycore servers. We think that the future of parallel computing depends on the ability of programmers to understand the paradigms of parallelism. A deep knowledge of the Great Principles is also desirable, as their inuence can be seen in all elds of computer science. It is however possible to create frameworks like the Par Lab (i.e. the approach based on the preliminary work in [7] and extensively described in [2]), where two development layers try to separate the tasks of programmers, so that every involved person may focus on a condensed problem, not having to worry about details.


In this paper we have presented an overview of the current state in parallel computing. Early attempts to utilize parallel processing for dealing with computationally intensive problems have failed the widespread implementation of multicore systems however changed the views dramatically. In Section 2 a short history of parallel computing was presented. While we mainly addressed the problems leading to a need for parallel computing, we also provided insight into current publications. Several key issues for the adoption of parallel principles were enumerated, such as the need for a simple programming framework and the ability of programmers to think parallel. Also, the main computational problems of scientic computing were identied and presented. In Section 3, two models that measure parallel performance were shown the Amdahl Law being an older and more generic approach in contrast to the Rooine model, which is dedicated to the tradeo between computational intensity and memory performance. We also gave an example of one of the dwarfs, the Sparse Matrix-Vector Multiplication (SpMV) and explained the principles that are used to optimize such a task for parallel processing. Lastly, the innovative and future-oriented approach of the Berkeley Par Lab has been presented in detail. Section 4 aimed to discuss the current state of parallel computing in regards to the Great Principles, which are often referred to in basic problems of computer science. We assumed a steady growth of number of cores implemented in computers and tried to give a prediction about the future, along with the assumptions of several other authors. To conclude with, it has to be mentioned that the whole eld of not only high performance computing, but also desktop computing will surely benet from the current parallel research activity. In fact, research in this area has been extensive for several years now and continues to spread among institutes of dierent background. Not only scientic computing researchers are interested in


delivering fast results from massive computations, also programmers of desktop software are bound to be drawn into the spiral that is the parallel revolution, with prots for the industry and the consumer.

1. Moore, G.E.: Cramming more components onto integrated circuits. Electronics 38(8) (1965) 2. Asanovic, K., Bodik, R., Demmel, J., Keaveny, T., Keutzer, K., Kubiatowicz, J., Morgan, N., Patterson, D., Sen, K., Wawrzynek, J., Wessel, D., Yelick, K.: A view of the parallel computing landscape. Commun. ACM 52(10) (2009) 5667 3. Chandrakasan, A., Sheng, S., Brodersen, R.: Low-power cmos digital design. SolidState Circuits, IEEE Journal of 27(4) (1992) 473 484 4. Amdahl, G.M.: Validity of the single processor approach to achieving large scale computing capabilities. In: Proc. of AFIPS Spring Joint Computer Conference. Volume 30. (1967) 483 485 5. Denning, P.J., Dennis, J.B.: The resurgence of parallelism. Commun. ACM 53 (2009) 3032 6. Kumm, E., Lea, R.: Parallel computing eciency: climbing the learning curve. In: TENCON 94. IEEE Region 10s Ninth Annual International Conference. Theme: Frontiers of Computer Technology. Proceedings of 1994. (1994) 728 732 vol.2 7. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The landscape of parallel computing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Department, University of California, Berkeley (2006) 8. Colella, P.: Dening software requirements for scientic computing. Presentation (2004) 9. Marowka, A.: Think parallel: Teaching parallel programming today. Distributed Systems Online, IEEE 9(8) (2008) 1 1 10. Williams, S., Waterman, A., Patterson, D.: Rooine: an insightful visual performance model for multicore architectures. Commun. ACM 52(4) (2009) 6576 11. Williams, S., Oliker, L., Vuduc, R., Shalf, J., Yelick, K., Demmel, J.: Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In: SC 07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, New York, NY, USA, ACM (2007) 112 12. Hugues, M., Petiton, S.: Sparse matrix formats evaluation and optimization on a gpu. In: High Performance Computing and Communications (HPCC), 2010 12th IEEE International Conference on. (2010) 122 129 13. Vuduc, R., Demmel, J.W., Yelick, K.A.: OSKI: A library of automatically tuned sparse matrix kernels. Journal of Physics: Conference Series 16 (2005) 521530 14. Engler, D.R., Kaashoek, M.F., OToole, Jr., J.: Exokernel: an operating system architecture for application-level resource management. In: Proceedings of the fteenth ACM symposium on Operating systems principles. SOSP 95, New York, NY, USA, ACM (1995) 251266 15. Carvalho, C.: The gap between processor and memory speeds. In: Proc. of IEEE International Conference on Control and Automation. (2002)