Multi Core

Multi-core processor
A multi-core processor is an integrated circuit (IC) to which two or more processors has been attached for enhanced performance, reduced power consumption, and more efficient simultaneous processing of multiple tasks. As the two processors are actually plugged into the same socket, the connection between them is faster. Companies that have produced or are working on multi-core products include AMD, ARM, Broadcom, Intel, and VIA. As hardware designers turn toward multicore processors to improve computing power, software programmers must find new programming strategies that harness the power of parallel computing. One technique that effectively takes advantage of multicore processors is data parallelism. Data parallelism is a programming technique for splitting a large data set into smaller chunks that can be operated on in parallel. After the data has been processed, it is combined back into a single data set. With this technique, programmers can modify a process that typically would not be capable of utilizing multicore processing power, so that it can efficiently use all processing power available. Fig-1 Simple Dual core Architecture Now consider the implementation shown in Figure 2, which uses data parallelism to fully harness the processing power offered by a quadcore processor. In this case, the large data set is broken into four subsets. Each subset is assigned to an individual core for processing. After processing is complete, these subsets are re-joined into a single full data set. Fig-2 By using the programming technique of data parallelism, a large data set can be processed in parallel on multiple CPU cores. The introduction of multicore processors provides a new challenge for software developers, who must now master the programming techniques necessary to capitalize on multicore processing potential. One of these programming techniques is task parallelism. Task parallelism is simply the concurrent execution of independent tasks in software. Consider a single-core processor that is running a Web browser and a word-processing program at the same time. Although these applications run on separate threads, they still ultimately share the
same processor. Now consider a second scenario in which the same two programs are running on a dual-core processor. On the dual-core machine, these two applications essentially can run independently of one another. Although they may share some resources that prevent them from running completely independently, the dual-core machine can handle the two parallel tasks more efficiently. When programming multicore applications, special considerations must be made to harness the power of today's processors. Pipelining, which is a technique that can be used to gain a performance increase (on a multicore CPU) when running an inherently serial task. Pipelining is the process of dividing a serial task into concrete stages that can be executed in assembly-line fashion. The following conceptual illustration shows how a sample pipelined application might run on several CPU cores: When creating real world multicore applications using pipelining, a programmer must take several important concerns into account. In specific, balancing pipeline stages and minimizing memory transfer between cores are critical to realizing performance gains with pipelining.
Software Dependent
While the concept of multiple processors sounds very appealing, there is a major limitation to this ability. In order for the true benefits of the multiple processors to be seen, the software that is running on the computer must be written to support multithreading. Without the software supporting such a feature, threads will be primarily run through a single processor thus degrading the efficiency. All of the major current operating systems have multithreading capability. But the multithreading must also be written into the application software. Most of the applications that the average user runs currently do not have this. The improvement in performance gained by the use of a multi-core processor depends very much on the software algorithms used and their implementation. In particular, possible gains are limited by the fraction of the software that can be parallelized to run on multiple cores simultaneously Working Windows ( and any other sensible operating system) allocates activity between the two processors in a fine grained way. A single running application will use one core while windows tasks use the other only for a very CPU intensive application. In the normal case where an application does something, waits for I/O or other system services then does a bit more, and waits again, Windows simply allocates the threads requiring CPU services to whichever processor is free at the time so except in fairly unusual circumstances the total cpu load is evenly distributed across both processors.
The downside of multiprocessor systems is that very CPU intensive tasks ( those that hog the CPU for a significant amount of time - things like picture manipulation activities that have to do work on every pixel in a big image or very compute intensive number crunching) run a bit slower than they might on a system with a single CPU that was twice as powerful. But not at half speed because windows still has to do all the other things (redrawing the screen, dealing with the network, I/O etc,etc, etc). The upside is that the system inherently feels more responsive because no one task can grab the whole of the processor power. Other things (responding to the mouse, graphics redraws etc.) can get CPU resources more easily. Optimize Managed Code For Multi-Core Machines The key to performance improvements is therefore to run a program on multiple processors in parallel. Unfortunately, it is still very hard to write algorithms that actually take advantage of those multiple processors. We need to write our programs in a new way. The Task Parallel Library (TPL) is designed to make it much easier to write managed code that can automatically use multiple processors. Using the library, you can conveniently express potential parallelism in existing sequential code, where the exposed parallel tasks will be run concurrently on all available processors. Usually this results in significant speedups. TPL does not require any language extensions and works with the .NET Framework 3.5 and higher. For example, suppose you have the following for loop that squares the elements of an array:
for (int i = 0; i < 100; i++) { a[i] = a[i]*a[i]; }
Since the iterations are independent of each other, that is, subsequent iterations do not read state updates made by prior iterations, you can use TPL to express the potential parallelism with a call to the parallel for method, like this:
Parallel.For(0, 100, delegate(int i) { a[i] = a[i]*a[i]; });
Cluster Computing
A computer cluster consists of a set of loosely connected computers that work together so that in many respects they can be viewed as a single system. The components of a cluster are usually connected to each other through fast local area networks, each node running its own instance on an operating system. The desire to get more computing horsepower and better reliability by orchestrating a number of low cost commercial off-the-shelf computers has given rise to a variety of architectures and configurations. The activities of the computing nodes are coordinated by "clustering middleware", a software layer that sits atop the nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept. The
world's fastest machine in 2011 was the K computer which has a distributed memory, cluster architecture. Attributes of clusters Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. "Load-balancing" clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized. However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from a webserver cluster which may just use a simple round-robin method by assigning each new request to a different node. "Computer clusters" are used for computation-intensive purposes, rather than handling IOoriented operations such as web service or databases. For instance, a compute cluster might support computational simulations of weather or vehicle crashes. Very tightly-coupled compute clusters are designed for work that may approach "supercomputing". "High-availability clusters" (also known as failover clusters, or HA clusters) improve the availability of the cluster approach. They operate by having redundant nodes, which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure. There are commercial implementations of High-Availability clusters for many operating systems. Design and configuration One of the issues in designing a cluster is how tightly-coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares a dedicated network, is densely located, and probably has homogenous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching grid computing. Cluster is homogenous while grids are heterogeneous. The computers that are part of a grid can run different operating systems and have different hardware whereas the cluster computers all have the same hardware and OS. A grid can make use of spare computing power on a desktop computer while the machines in a cluster are dedicated to work as a single unit and nothing else. In a Beowulf system, the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves. In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization. The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.
Computer clusters have historically run on separate physical computers with the same operating system. With the advent of virtualization, the cluster nodes may run on separate physical computers with different operating systems which are painted above with a virtual layer to look similar. The cluster may also be virtualized on various configurations as maintenance takes place.
A typical Beowulf configuration
Data sharing and communication Data sharing As the early computer clusters were appearing during the 1970s, so were supercomputers. One of the elements that distinguished the two classes at that time was that the early supercomputers relied on shared memory. To date clusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it. Message passing and communication Two widely used approaches for communication between cluster nodes are MPI, the Message Passing Interface and PVM, the Parallel Virtual Machine.[11] PVM was developed at the Oak Ridge National Laboratory around 1989 before MPI was available. PVM must be directly installed on every cluster node and provides a set of software libraries that paint the node as a "parallel virtual machine". PVM provides a run-time environment for message-passing, task and resource management, and fault notification. PVM can be used by user programs written in C, C++, or FORTRAN, etc. MPI emerged in the early 1990 out of discussions between 40 organizations. The initial effort was supported by ARPA and National Science Foundation. Rather than starting a new, the design of MPI drew on various features available in commercial systems of the time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use TCP/IP and socket connections. MPI is now a widely-available communications model that enables parallel programs to be written in languages such as C, FORTRAN, Python, etc.
GPU Architecture
GPU computing or GPGPU is the use of a GPU (graphics processing unit) to do general purpose scientific and engineering computing. The model for GPU computing is to use a CPU and GPU together in a heterogeneous coprocessing computing model. The sequential part of the application runs on the CPU and the computationally-intensive part is accelerated by the GPU. From the users perspective, the application just runs faster because it is using the high-performance of the GPU to boost performance. The GPU has evolved over the years to have teraflops of floating point performance. NVIDIA revolutionized the GPGPU and accelerated computing world in 2006-2007 by introducing its new massively parallel architecture called CUDA. The CUDA architecture consists of 100s of processor cores that operate together to crunch through the data set in the application. The success of GPGPUs in the past few years has been the ease of programming of the associated CUDA parallel programming model. In this programming model, the application developers modify their application to take the compute-intensive kernels and map them to the GPU. The rest of the application remains on the CPU. Mapping a function to the GPU involves rewriting the function to expose the parallelism in the function and adding C keywords to move data to and from the GPU. The developer is tasked with launching 10s of 1000s of threads simultaneously. The GPU hardware manages the threads and does thread scheduling. The Tesla 20-series GPU is based on the Fermi architecture, which is the latest CUDA architecture. Fermi is optimized for scientific applications with key features such as 500+ gigaflops of IEEE standard double precision floating point hardware support, L1 and L2 caches, ECC memory error protection, and local user-managed data caches in the form of shared memory dispersed throughout the GPU, coalesced memory accesses and so on. History of GPU Computing Graphics chips started as fixed function graphics pipelines. Over the years, these graphics chips became increasingly programmable, which led NVIDIA to introduce the first GPU or Graphics Processing Unit. In the 1999-2000 timeframe, computer scientists in particular, along with researchers in fields such as medical imaging and electromagnetics started using GPUs for running general purpose computational applications. They found the excellent floating point performance in GPUs led to a huge performance boost for a range of scientific applications. This was the advent of the movement called GPGPU or General Purpose computing on GPUs.
The problem was that GPGPU required using graphics programming languages like OpenGL and Cg to program the GPU. Developers had to make their scientific applications look like graphics applications and map them into problems that drew triangles and polygons. This limited the accessibility of tremendous performance of GPUs for science. NVIDIA realized the potential to bring this performance to the larger scientific community and decided to invest in modifying the GPU to make it fully programmable for scientific applications and added support for high-level languages like C, C++, and FORTRAN. This led to the CUDA architecture for the GPU. CUDA Parallel Architecture and Programming Model The CUDA parallel hardware architecture is accompanied by the CUDA parallel programming model that provides a set of abstractions that enable expressing fine-grained and coarse-grain data and task parallelism. The programmer can choose to express the parallelism in high-level languages such as C, C++, FORTRAN or driver APIs such as OpenCL and DirectX-11 Compute. NVIDIA today provides support for programming the GPU with C, C++, FORTRAN, OpenCL, and DirectCompute. A set of software development tools along with libraries and middleware are available to developers as shown in the figure above and linked from here. GPU to be programmed using C with a minimal set of keywords or extensions.
The CUDA parallel programming model guides programmers to partition the problem into coarse sub-problems that can be solved independently in parallel. Fine grain parallelism in the sub-problems is then expressed such that each sub-problem can be solved cooperatively in parallel.

Multi Core

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Multi Core

Загружено:

Авторское право:

Доступные форматы

Multi-core processor

A typical Beowulf configuration

Вам также может понравиться