Вы находитесь на странице: 1из 5

Paper 2 Heterogeneous Multicore Systems: A Future Trend Analysis

David Patterson predicted memory-wall and power-wall would impede the growth of semiconductor industry which otherwise largely follows the trend of Moores Law. Architectural paradigm shift towards multicore computing, helped usher in unprecedented growth in the past decade. Computing paradigm shift towards mobile handheld devices is now emphasizing the use of heterogeneous multicore architectures. The main distinction for these evolving technologies is low-power usage. There are well understood challenges to the design of future computer architectures, solutions to which are just starting to appear. This paper discusses those solutions and predicts the current and the future trends in heterogeneous multicore systems. limitation of memory access which is posed by the concept of multicore computing that uses shared coherent memory for message passing among the multiple cores. In this paper I have discussed the ways to move forward and some of the present technologies to climb these walls. With the shift in focus of the major computing platform towards mobile platforms of varying sizes, newer kinds of applications are getting developed. The scope and reach of a mobile device puts more emphasis on applications that demand power aware smaller and specialized cores in the chip multi-processors. This is in direct context of the earlier analysis of human brain. Another take away from the analysis is that the data processing of all the things that we humans hear, see, taste, smell, think or do the numbercrunching etc, everything is done by various specialized locations in the brain, which vary in their size, properties and functioning. Taking cue from this, we can argue that the processors of the future devices, which has to support natural user interfaces, need to have non-uniform types of specialized data processors for individual kinds of sensing the system needs to support. Thus, the future of computer architecture lies not only in the multicore processors, but the heterogeneous variety of it. The International Technology Roadmap envisions two kinds of semiconductor devices for future computers devices following More-Moore trend and devices following More-than-Moore trend . The former being the ones following the trend predicted by Moores Law i.e. dependent on advancements in semiconductor feature technology, whereas the latter being the ones used to interface with the analog world. Section 2 describes the problem statement of memory and power walls and section 3 provides some of the ideal and cutting edge solutions to the same. In section 4, I delve deeper towards the trends in computer architecture and present my views on what will be the number of cores in future SoC and present a conclusion to all these in section 5.

Heterogeneous Multicore Processors, Wide I/O, Intelligent RAM, Hybrid Memory Cube, TOMI, Interconnects, Epiphany.



The present day computer intelligence and the number-crunching capabilities on a common laptop or a tablet are beyond the documented capabilities of supercomputers from 2 or 3 decades ago as shown in Figure 1a. This opens up vistas for development of new applications on your personal devices which makes us, the user of these devices feel more connected to our environment. Computers have already transcended the physical walls of our homes and moved into our pockets. They are equally powerful, if not more, than the desktop sitting on a table in the corner of the room just over a decade or two ago. The future of computing lies in developing newer forms of interactions with it. This raw processing power has ushered in a race in design and implementation of natural user interfaces which encompasses the seamless integration of our computing infrastructure with our natural physical world. Our future goal is to emulate our natural world interactions, the richness of the physical world and human intelligence upto a practical limit, using computers. The best known and efficient processor with huge intelligence which is a doing similar work continually is the human brain. Analyzing the working of human brain, we can tell a fact without much contention is that it is largely parallel in its operation. The processor development race till the last decade was predominantly focused on the increase in working speed of the central processor. But this trend of computer architectures was hit by the predictions of Dr. Patterson, namely the memory wall and power wall. The power wall states that the switching of logical states inside the processor gets converted into heat which hampers the performance, thus putting an upper limit to switching speed. The answer to this was the historic shift from scaling-up of operating frequencies, towards increasing the number of processors working in parallel, i.e. multicore processors. The caveat to this is the memory wall, which is the reason why a system with n-computing cores working in parallel is not n-times as fast as a single core system. Memory wall is the



Supercomputers of today are video games of tomorrow. An analysis of problems of the former can give valuable insight into problems of future computers of all size . Figure 1a shows a comparison of a 1997 Sandia Labs Supercomputer and 2006 Sony Playstation computing power which clearly shows the disparity. Acquiring and transferring detailed amounts of information from sensors for the intuitive human-computer interfaces of future and reducing their response-times to match and then surpass human stimuli is the single most critical requirement to be dealt-with, for computer systems of tomorrow. This requires fast instruction execution and faster sensory systems which inadvertently means advancements in both More-Moore and More-than-Moore devices as described earlier. For MoreMoore devices, which are predominantly multicore systems, the main concerns are getting around with the memory-wall and power-wall. Execution time of multicore systems is heavily dominated by the memory access time by multiple CPUs. The obvious ways to

reduce this access time is to consider one of the following ways (i) Increase the speed of memory transfer, (ii) Increase the size memory transfer size, (iii) Bring memory close to CPU.

Figure 1a Comparison of Supercomputer (1997) and Video Game (2006). (Picture credit ) The problem with increasing memory transfer speed is it hits straight into the power wall i.e. increasing the transfer speed consumes more power. Other techniques which doesnt work by increasing frequency like Intels Rambus technology were developed and proved to be effective, but the technology was highly expensive and didnt do well commercially . Memory transfer size has increased over the years from fetching single instruction at a time to multi instructions. But wide memory buses running over long traces would consume more power because wide bus means more memory I/O pins. Thus unless the CPU and memory are close, it wouldnt work for portable system. Cost per Billion Transistors Cost Microprocessor Transistors $387 Memory Transistors $0.67

technology, named the Hybrid Memory Cube, called HMC hereafter. The process technology for fabricating processors and memory are inherently different. DRAM has the ability to pack massive memory capacity within a small silicon die and at an amazingly low price, but this inherently brings limitations on logic density and I/O performance, which is the strong point of manufacturing technology for CPUs . But all possibilities of overcoming memory wall suggest that we need to blur the distinction between processor and memory as much as possible. HMC is precisely a 3D stacking of multilayer DRAM stack over CPU process technology based I/O buffer. This also frees the processor of memory controller related burdens, thereby making memory access much faster than the planar technology currently employed to interface memories with processors. Intel has demonstrated an I/O prototype that achieved 1.4 mWatt/Gbit/sec energy efficiency and which is optimized for hybrid 3D stacking. The HMC has demonstrated a sustained transfer rate of 1 Tbit/sec . To have a perspective, this has 10 times more bandwidth and is 7 times more energy efficient than the present day most advanced DDR3 module. HMC uses a new technology Through-Silicon-Via, called TSV hereafter, to stack the DRAM array onto the logic die. TSV allows a high amount of parallel interconnects between the memory and processor die, virtually impossible with present technology. This TSV technology supports different dies manufactured in different process technology . Performance comparison in Figure 3a shows the critical lead of HMC over other DRAM technologies such as DDR3 and DDR4 (extrapolated).

Table 2a Comparison of cost per billion transistors used for making microprocessors vs memory. (Data source ) Moving the data close to CPU would have huge reduction in access time. One way of doing it is by increasing the cache memory size, which are the fastest type of memory cells. But since caches are placed on the microprocessor die and amount to almost 50p.c. of die real estate in present day chips, they make chips highly expensive, thus putting a commercial limit to their size. A comparison of costs per billion transistors of microprocessor transistor and DRAM transistor in Table 2a shows why caches are so expensive. These are the basic questions that come in the path of predicting the future trend of chip multi-processors, how many IP cores would be present in the CPUs in near future and what will be their types, or even predicting whether the number would be welldefined or will depend on the applications the future devices would support.

Figure 3a Comparison b/w DDR and HMC performance (Picture credit ). This technology also helps the layout of DRAM, by locating I/O channels nearer the memory array, implying a higher bank count per die. However, being a new technology, the cost of development of such systems would be on higher side of the pricing spectrum and Intel is targeting server and data-center type applications at the beginning. Long term reliability of TSV is also an unverified parameter. But as mentioned earlier, supercomputers of today would be video games of tomorrow , this technology is worth keeping an eye on.

The previous section describes the grueling problems that the current researchers in numerous architecture groups around the globe are facing. Here I present a few potential solutions for the overcoming memory wall and power wall.


Samsung Wide I/O Memory


Intel and Micron Hybrid Memory Cube

Intel in collaboration with DRAM technology giant Micron has developed highly efficient, fast and the most advanced DRAM

Samsungs Wide I/O is a DDR technology for mobile-devices which holds promise of highly power efficient data transfers between the processor and memory. Mobile devices are already the most popular computing devices, and data from Samsung

Mobile units shows a trend of roughly 56% YoY growth of Samsung smartphones . The trend is similar across all vendors. Wide parallel interfaces have been written-off from electronics industry due to their power hungry nature. Also, they are expensive due to the wide buses and complex Printed Circuit Boards (PCBs). But contrary to this semiconductor industry trend, Samsung went in the opposite direction of making a wider I/O bus and argued it to be faster and more energy efficient data transfer, and they have the numbers to prove them correct. Wide I/O is a 512-bit wide data bus compared to conventional 32-bit wide data buses . Considering the power and control pins, the Wide I/O bus can have upto 1200 pins. The new Wide I/O mobile 1Gb DDR has reported a data rate of 12.8 Gbit/sec, which increases its bandwidth by almost 4 times that of LPDDR2 of the same size. The power consumption is reported to be reduced by 87% in the same comparison. Thus, Wide I/O as compared to conventional Serial I/O gives higher bandwidth at lower power, but its economics are still complex and under consideration .

between memory banks. TOMI Aurora is a 20mW/CPU at 500MHz, Quad-core 32-bit CPU + 64M DRAM, 4096-bit wide internal memory buses and packed in an area of 5.6 x 6.7 sq. mm. The cost of this quad-core chip is astonishingly low - $0.99 . This is the closest that memory can get from the processor, and thus the most enhanced performance that current technology processors can get. This is very similar in concept to Pattersons Intelligent-RAM project. Table 3b shows the comparison between memory transistor and processor transistor properties. The microprocessor and memory transistor fabrication technologies are different. Microprocessor Transistors Speed Leakage Power Cost 1x 1x 1x Memory Transistors 0.8x 0.00001x 0.01x

Table 3a Property comparison between Microprocessor and Memory Transistor (Source ). But as we can see, barring the speed difference, memory transistors are better in terms of cost and leakage power. Functionally both are same, which means processors can be reasonably fabricated using memory transistors . But there were mainly 2 challenges towards TOMIs microprocessor creation 1. Memory-transistor specific wiring libraries are not present with any standard manufacturers. These libraries are created to reduce design time. TOMI engineers created a small library of 60 digital cells and 14 analog cells to make a test processor at first. 2. Due to identical memory cells, wiring is minimal in memories. So, the number of layers of wiring is upto 3-4 in memories whereas, for processors, there are millions of wires which need to be spread in 10-12 layers. TOMI designers created a newer and simpler architecture using which, microprocessors and the memory arrays are fabricated together, all in those 3 layers with less transistors, 22000 only per core . But as of now, these are limited to working with huge unstructured data in cloud computing. There are various implications of TOMI TM technology as mentioned here 1. 5x-20x Power reduction as compared to ARM or Intel Atom architecture CPUs, thus higher battery life for mobile devices. 2. 5x-10x Cost reduction as compared to similar ARM architecture CPUs, thus cheaper ubiquitous computing systems. 3. Lighter and smaller system designs. 4. Highly enhanced performance at very small scale, previously impossible low-power product designs such as HUD computers, implant computer. SoC design towards realizing natural user interfaces would be made possible with such architectures. 5. Cost effective solution to massively parallel problems like DNA sequencing, MapReduce, medical applications etc. Thus TOMI technology can be regarded as a breakthroughenabler towards future computing hardware, making highly complex data processing possible at cheap prices. This is surely the hardware architecture that will help towards the development of tomorrows natural user interfaces which would require

Figure 3b Comparison between LPDDR2 and Wide I/O (Picture Source ) Conventional method of reducing power budget on mobile memory was to reduce the operating voltage of the DDR e.g. DDR2 at 1.8V, DDR3 at 1.5V. But with reducing semiconductor feature size to sub-10nm, operating voltages would practically cease to reduce because it would approach Near Threshold Voltage operation. Wide I/O solves some of these sub-10nm power criticalities. To put to perspective, comparison between conventional 3D PoP stacking of DRAM, also an advanced technology, and TSV Wide I/O memory shows 4 fold reduction in I/O power consumption as shown in Table 3b and Figure 3c. Conventional 3D PoP for LPDDR2 I/O Power 176mW Table 3c I/O Power Comparison (Source ) TSV SiP for Samsung Wide I/O 44mW


Venray TOMITM Aurora Chip

Venray Technologies (www.venraytechnology.com) is a promising startup which is finally successful in pushing existing technology to carve Memory and CPU on a single die. TOMI is the first multi-core milliwatt microprocessor. This is a brand new approach towards co-design of memory cells and processor cells on the same die. TOMI is manufactured by packing CPU cores

massive data handling and processing capabilities in ever decreasing form factors.


As seen in the previous section, we are at such a technological juncture where we are capable of overcoming both the power wall and memory wall and thus think of designs of SoC with higher numbers of IP cores. Let us foray into the SoC trends of future computing infrastructure.


Number and Types of Unique IP Cores

For mobile systems, design optimizations with respect to area and power, result in the use of heterogeneous multiprocessors due to its best fit in an inherently varying workload that mobile systems pose. Heterogeneous CMPs are best suited for systems where multithreading is used extensively. By definition, a SoC is a heterogeneous system and thus best suited for use in mobile devices. SoC are primarily data-flow problems i.e. data from various sources to be processed at any given time, most of which are real-time data such as audio, video, network, camera etc for a smartphone. For accessing and processing these large amounts of data, the chip needs to access the memory frequently and repeatedly which is the real bottleneck for most systems. Also due to hard real-time constraints on most of the data processing on mobile platforms, memory access times are critical. But with TOMI technology, memory has come inside the SoC which would not only mitigate this problem but would also help in reducing the power consumption. Looking at the trend chart in Figure 4a, the number of differentiating IP cores on a single chip, compiled by Semico Research (www.semico.com), we can see that with advancements in technology nodes, the number of unique IP cores increase rapidly. It shows that in 2012, there are 75-80 unique IP cores per SoC and it shows a YoY growth of 18.7% in the number of different IP cores on a single chip. We have to keep in mind that these are unique IP cores and there can be cores that can be used multiple times on a die. Extrapolating this chart to a time of 5 years from now, the technology node will move to below 20nm. From the trend we can easily say that the number of unique IP cores on an SoC will be close to 140-160. These would include the IP cores for CPU, GPU, DSP for computational purposes, similar sets of cores for wifi and networks, hardware accelerators, ADC and DAC cores, RF cores, memory interfaces, audio, video, 3D graphics, camera, image processing, VOIP, NFC, various input/output protocols to communicate with the world. However, these trends are for the chips that support todays hardware technologies. With the future natural-user-interfaces these numbers and types of input/output IP cores would definitely increase to a huge number. Also, with the support for communications processors for 3G/4G onto the same SoC would also increase the number of IP Cores hugely. Another important subsystem is security subsystems such as SoC firewall, content protection etc. which will be present on future SoCs and add to the IP core count.

Figure 4a Trend for SoC IP Core (Source ) Often data from all these interfaces, which obviously dont operate at similar pace, would need to be operated-on simultaneously. Right now, the maximum number of computational cores on any SoC across the industry is 4+1 CPU and 1 GPU on Nvidias Tegra3 processor. TOMI TM Aurora also has 4 CPUs. With increasing pixel density for displays such as the New-iPad Retina Display and invention of newer display technologies like Mirasol Displays (www.mirasoldisplays.com) from Qualcomm, the number of GPUs can very well go to 6-8 per SoC (Quad-GPU-core A5X for New-iPad). Whereas for CPU performance, as benchmarked by Anandtech (www.anandtech.com), Qualcomm Snapdragon S4 dual-core CPU running at 1.5GHz beats Nvidia Tegra3 quad-core CPU running at 1.3GHz in single threaded as well as multi threaded applications. The website notes that, at present due to the offloading of all video functionalities to GPUs, there are very few applications that need more than 2 cores to run efficiently. It also notes that having more cores is better, but having 2 faster cores is better than having 4 slower cores. However, all these present day SoCs have homogenous CPUs. As noted earlier, mobile applications are best suited for heterogeneous cores. Though the software and OS at present are not optimized, but the only mode of system improvement is to increase the thread count in which case more cores would be required. Also, with the invention of TOMI technologies, tighter integration of memory and processor would indeed help to overcome the memory limitation and thus in future we would definitely see more CPU cores in a die. As per the trends in mobile processors , I can predict upto 10-14 CPU cores on a single SoC in the next 5 years. The reason for this would be considerable increase thread performance and further development of advanced OS and software techniques. However, there are a few rebellious architectures which go beyond these predictions. These are being patented and about to be commercialized in near future. They are discussed in the next section.


A Few Different Approaches

A new approach is taken by Adapteva Inc (www.adapteva.com) which is a startup making smartphone processors, Epiphany-IV using 28nm technology and 64 independent RISC cores, each with 32Kbytes of memory, consumes 25mW per core, accommodated on an area of 8.2 sq. mm. gives the highest efficient floating point processor at 70 GigaFlops/Watt and can scale upto 1000s of cores . It is designed to use as an accelerator of DSP related tasks such as speech recognition, image processing, machine vision, medical diagnostics, SDR etc. As per core CoreMark benchmarking, Adaptevas Epiphany-IV chips prove to be four times better in performance when compared to Intel Xeon and consumes less than 2 watts of peak power. The main innovation from Adapteva is a patented low-power networkon-a-chip architecture that sustains 25-Gbytes per second local memory bandwidth and 6.4-Gbytes per processor network bandwidth, as per the report . Another different approach for SoC development is using reconfigurable architectures. One way to look at conventional heterogeneous computer architectures is that one IP Core does same-thing-all-the-time and if not used, it is powered down. Why not build a reconfigurable architecture on a gate array and let the improvements in technology increase the size and speed of the reconfigurable gate array. This means the IP cores are not hardcoded at all and the flexibility is with the system designer to program them as whichever core they wish one part of the array to be. This will definitely help redefine computer architecture. A patent has been awarded to RMT Inc. for using such a computer architecture. Here an FPGA is the controller of the computer and a CPU and other hardware its peripherals. This device can be reprogrammed as a re-definable circuit board, a CPU, any peripheral device, memory device etc without any hardware support. This brings the new perspective of an on demand reconfigurable parallel processing facility which is not available to any other architecture at present.

In this paper we see that we are presently standing at such a position in computer-architecture timeline, where we do have some promising technologies which present innovative solutions to age-old problems related to high-performance computing. Still, they are so new that not much can be said with concreteness that all or some of these technologies will prevail in the highly competitive market of today. Only time will tell.