Вы находитесь на странице: 1из 11

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO.

1, JANUARY 2011

173

A 48-Core IA-32 Processor in 45 nm CMOS Using On-Die Message-Passing and DVFS for Performance and Power Scaling
Jason Howard, Saurabh Dighe, Sriram R. Vangal, Gregory Ruhl, Member, IEEE, Nitin Borkar, Shailendra Jain, Vasantha Erraguntla, Michael Konow, Michael Riepen, Matthias Gries, Guido Droege, Tor Lund-Larsen, Sebastian Steibl, Shekhar Borkar, Vivek K. De, and Rob Van Der Wijngaart

AbstractThis paper describes a multi-core processor that integrates 48 cores, 4 DDR3 memory channels, and a voltage regulator controller in a 6 4 2D-mesh network-on-chip architecture. Located at each mesh node is a ve-port virtual cut-through packet-switched router shared between two IA-32 cores. Core-to-core communication uses message passing while exploiting 384 KB of on-die shared memory. Fine grain power management takes advantage of 8 voltage and 28 frequency islands to allow independent DVFS of cores and mesh. At the nominal 1.1 V supply, the cores operate at 1 GHz while the 2D-mesh operates at 2 GHz. As performance and voltage scales, the processor dissipates between 25 W and 125 W. The 567 mm2 processor is implemented in 45 nm Hi-K CMOS and has 1.3 billion transistors. Index Terms2D-routing, CMOS digital integrated circuits, DDR3 controllers, dynamic voltage frequency scaling (DVFS), IA-32, message passing, network-on-chip (NoC).

I. INTRODUCTION

FUNDAMENTAL shift in microprocessor design from frequency scaling to increased core counts has facilitated the emergence of many-core architectures. Recent many-core designs have proven to optimize performance while achieving higher energy efciency [1]. However, the complexity of maintaining coherency across traditional memory hierarchies in many-cores designs is causing a dilemma. Simply stated, the computational value gained through additional cores will at some point be exceeded by the protocol overhead needed to maintain cache coherency among the cores. Architectural techniques can be used to delay this crossover point for only so long. Alternatively, another approach is to altogether eliminate cache coherency and rely on software to maintain data consistency between cores. Many-core architectures also face steep design challenges with respect to power consumption. The seemingly endless compaction and density increase of

transistors, as stated by Moores Law [2], has both positively impacted the growth in core counts while negatively impacting thermal gradients via the exponential surge in power density. In an effort to mitigate these effects, many-core architectures will be required to employ a variety of power saving techniques. The prototype processor (Fig. 1) described in this paper is an evolutionary approach toward many-core Network-on-Chip (NoC) architectures that remove dependence on hardware maintained cache coherency while remaining in a constrained power budget. The 48 cores communicate over an on-die network utilizing a message passing architecture that allows data sharing with software maintained memory consistency. The processor also uses voltage and frequency islands to combine the advantages of Dynamic Voltage and Frequency Scaling (DVFS) for improving energy efciency. The remainder of the paper is organized as follows. Section II gives a more in-depth architectural description of the 48-core processor and describes key building blocks. The section also highlights enhancements made to the IA-32 core and describes accompanying non-core logic. Router architectural details and packet formats are also described, followed by an explanation of the DDR3 memory controller. Section III presents a novel message passing based software protocol used to maintain data consistency in shared memory. Section IV describes the DVFS power reduction techniques. Details of an on-die voltage regulator controller are also discussed. Experimental results, chip measurement, and programming methodologies are given in Section V. II. TOP LEVEL ARCHITECTURE The processor is implemented in 45 nm high-K metal-gate mm and contains 1.3 bilCMOS [3] with a total die area of lion transistors. The architecture integrates 48 Pentium class IA-32 cores [4] using a tiled design methodology; the 24 tiles are arrayed in a 6 4 grid with 2 cores per tile. High speed, low latency routers are also embedded within each tile to provide a 2D-mesh interconnect network with sufcient bandwidth, an essential ingredient in complex, many-core NoCs. 4 DDR3 memory channels reside on the periphery of the 2D-mesh network to provide up to 64 GB of system memory. Additionally, an 8-byte bidirectional high-speed I/O interface is used for all off-die communication. Included within a tile are two 256 KB unied L2 caches, one for each core, and supporting network interface (NI) logic re-

Manuscript received April 15, 2010; revised July 16, 2010; accepted August 30, 2010. Date of publication November 09, 2010; date of current version December 27, 2010. This paper was approved by Guest Editor Tanay Karnik. J. Howard, S. Dighe, S. Vangal, G. Ruhl, N. Borkar, S. Borkar, and V. K. De are with Intel Corporation, Hillsboro, OR 97124 USA (e-mail: jason.m.howard@intel.com). S. Jain and V. Erraguntla are with Intel Labs, Bangalore, India. M. Konow, M. Riepen, M. Gries, G. Droege, T. Lund-Larsen, and S. Steibl are with Intel Labs, Braunschweig, Germany. R. Van Der Wijngaart is with Intel Labs, Santa Clara, CA. Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JSSC.2010.2079450

0018-9200/$26.00 2010 IEEE

174

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 1. Block diagram and tile architecture.

Fig. 2. Full-chip and tile micrograph and characteristics.

quired for core-to-router communication. Each tiles NI logic also features a Message Passing Buffer (MPB), or 16 KB of on-die shared memory. The MPB is used to increase performance of a message passing programming model whereby cores communicate through local shared memory. Total die power is kept at a minimum by dynamically scaling both voltage and performance. Fine grained voltage change commands are transmitted over the on-die network to a Voltage Regulator Controller (VRC). The VRC interfaces with two on package voltage regulators. Each voltage regulator has a 4-rail output to supply 8 on-die voltage islands. Further power savings are achieved through active frequency scaling at a tile granularity. Frequency change commands are issued to a tiles un-core logic, whereby frequency adjustments are processed. Tile performance is scalable from 300 MHz at 700 mV to 1.3 GHz at 1.3 V. The on-chip network scales from 60 MHz at 550 mV to 2.6 GHz at 1.3 V. The design target for nominal usage is 1 GHz for tiles and 2 GHz for the 2-D network, when supplied by 1.1 V. Full-Chip and tile micrograph and characteristics are shown in Fig. 2.

A. Core Architecture The core is an enhanced version of the second generation Pentium processor [4]. L1 instruction and data caches have been upsized to 16 KB, over the previous 8 KB design, and support 4-way set associativity and both write-through and writeback modes for increased performance. Additionally, data cache lines have been modied with a new status bit used to mark the content of the cache line as Message Passing Memory Type (MPMT). The MPMT is introduced to differentiate between normal memory data and message passing data. The cache lines MPMT bit is determined by page table information found in the cores TLB and must be setup properly by the operating system. The Pentium instruction set architecture has been extended to include a new instruction, INVDMB, used to support software managed coherency. When executed, an INVDMB instruction invalidates all MPMT cache lines in a single clock cycle. Subsequently, reads or writes to the MPMT cache lines are guaranteed to miss and data will be fetched or written. The instruction exposes the programmer to direct control of cache management while in a message passing environment.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

175

Fig. 3. Router architecture and latency.

The addressable space of the core has been extended from 32-bits to 36-bits to support 64 GB system memory. This is accomplished using a 256 entry Look-up-table or LUT extension. To ensure proper physical routing of system addresses, the LUTs also provide addresses destination and routing information. A bypass status bit in an LUT entry allows for direct access to local tile MPB. To provide applications and designers the most exibility, the LUT can be recongured dynamically. B. L2 Cache and Controller Each cores L1 data and instruction caches are reinforced by a unied 256 KB 4-way write-back L2 cache. The L2 cache uses a 32 byte line size to match the line sizes internal to the cores L1 caches. Salient features of the L2 cache include: a 10 cycle hit latency, in-line double-error-detection and single-error-correction for improved performance, several programmable sleep modes for power reduction and a programmable a time-out and retry mechanism for increased system reliability. Evicted cache lines are determined through a strict least-recently-used (LRU) algorithm. After every reset deassertion in the L2 cache, 4000 cycles are needed to initialization the state array and LRU bits. During that time the grants to the core will be deasserted and no requests will be accepted. Several architectural attributes for the core were considered during the design of the L2 cache and associated cache controller. Simplications to the controllers pipeline were made due to the core limitation of only 1 outstanding read/write request at a given time. Additionally, inclusion is not maintained between the cores L1 caches and the cores L2 cache. This eliminates the necessity for snoop or inquire cycles between L1 and L2 caches and allows data to be evicted from the L2 cache without an inquire cycle to L1 cache. Finally, the there is no allocate-on-write capability in L1 caches of the core. Thus, a L1

cache write miss and L2 cache write hit does not write the cache line back into the L1 cache. High post-silicon visibility into the L2 cache was achieved through a comprehensive scan approach. All L2 cache lines were made scan addressable, including both tag and data arrays and LRU status bits. A full self test feature was also included that allowed the L2 cache controller to write either random or programmed data to all cache lines, followed by a read comparison of the results. C. Router Architecture The 5-port router [5] uses two 144 bit uni-directional links to connect with 4 neighboring routers and one local port while creating the 2-D mesh on-die network. As an alternative to wormhole routing used in earlier work [1], virtual cut-through switching is used for reduced mesh latency over the previous work. The router has 4 pipe stages (Fig. 3) and an operational frequency of 2 GHz at a 1.1 V. The rst stage includes link traversal for incoming packet traversal and input buffer write. The switch arbitration is done in the second stage and third & fourth stages are the VC allocation and switch traversal stages respectively. Two message classes (MCs) and eight virtual channels (VCs) ensure deadlock free routing and maximize bandwidth utilization. Two VCs are reserved: VC6 for request MCs and VC7 for response MCs. Dimension-ordered XY routing eliminates network deadlock and route pre-computation in the previous hop allows fast output port identication on packet arrival. Input port and output port arbitrations are done concurrently using a centralized conictfree wrapped wave-front arbiter [6] formed using a 5 5 array of asymmetric cells (Fig. 4). A cell with a row (column) token that is unable to use the token passes the token to the right (down), wrapping around at the end of the array. These tokens

176

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 4. Wrapped wave-front arbiter.

propagate in a wave-front from the priority diagonal group. If a cell with a request receives both a row and a column token, it grants the request and stops the propagation of tokens. Crossbar switch allocation is done in a single clock cycle, on a packet granularity. No-load router latency is 4 clock cycles, including link traversal. Individual links offer 64 GB/s of interconnect bandwidth, enabling the total network to support 256 GB/s of bisection bandwidth. D. Communication and Router Packet Format A packet is the granularity at which all 2-D mesh agents communicate with each other. However, a packet maybe subdivided into a collection of one or multiple FLITs or Flow control units. The header (Fig. 5) is the rst FLIT of any packet and contains routing related elds and ow control commands. Protocol layer packets are divided into request and response packets through the message class eld. When required, data payload FLITs will follow the header FLIT. Important elds within the header FLIT include: Routeidenties the tile/router a packet will traverse to DestIDidenties the nal agent within a tile a packet is addressed to SourceIDidenties the mesh agent within each node/tile a packet is from Commandidenties the type of MC (request or response) TransactionIDa unique identier assigned at packet generation time Packetization of core requests and responses into FLITs is handled by the associated un-core logic. It is incumbent on the user to correctly packetize all FLITs generated off die. E. DRR3 Memory Controller Memory transactions are serviced by four DDR3 [7] integrated memory controllers (IMC) positioned at the periphery

of the 2D-mesh network. The controllers feature support for DDR3800, 1066 and 1333 speed grades and reach 75% bandwidth efciency with rank and bank interleaving applying closed-page mode. By supporting dual rank and two DIMMs per channel, a system memory of 64 GB is realized using 8 GB DIMMs. An overview of the IMC is shown in Fig. 6. All memory access requests enter the IMC through the Mesh Interface Unit (MIU). The MIU reassembles memory transactions from the 2-D mesh packet protocol and passes the transaction to the Access Controller (ACC) block or the controller state machine. The Analog Front End (AFE) circuits provide the actual I/O buffers and ne-grain DDR3-compliant compensation and training control. The IMCs AFE is a derivative of productized IP [8]. The ACC block is responsible for buffering and issuing up to eight data transfers in-order while interleaving control sequences such as refresh and ZQ calibration commands. Control sequence interleaving results in a 5X increase the achievable bandwidth since activate and precharge delays can be hidden behind data transfers on the DDR3 bus. The ACC also applies closed-page mode by precharging activated memory pages as soon as one burst access is nished using auto-precharge. A complete feature list of the IMC is shown in Table I. III. MESSAGE PASSING Shared memory coherency is maintained through software in an effort to eliminate the communication and hardware overhead required for a memory coherent 2D-mesh. Inspired by software coherency models such as SHMEM [9], MPI and openMP [10], the message passing protocol is based on one-sided put and get primitives that efciently moves data between the L1 cache of one core to the L1 cache of another [11]. As described earlier, the new Message Passing Memory Type (MPMT) is introduced in conjunction with the new instruction, INVDMB,

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

177

Fig. 5. Request (a) and Response (b) Header FLITs.

Fig. 6. Integrated memory controller block diagram.

as an architectural enhancement to optimize data sharing using these software procedures. The MPMT retains all the performance benets of a conventional cache line, but distinguishes itself by addressing non-coherent shared memory. The strict message passing protocol proceeds as follows (Fig. 7(a)): A core initiates a message write by rst invalidating all message passing data cache lines. Next, when the core attempts to write data from a private address to a message passing address a write miss occurs and the data is written to memory. Similarly, when a core reads a message, it begins by invalidating all message passing data cache lines. The read attempt will cause a read miss, and the data from memory will be fetched.

The 16 KB MPB, found in each tile and used as on-die shared memory, further optimizes the design by decreasing the latency of shared memory accesses. Messages that are smaller than 16 KB see a 15x latency improvement when passed through the MPB, rather than sent through main memory (Fig. 7(b)). However, messages larger than 16 KB lose this performance edge since the MPB is completely lled and the remaining portion must be sent to main memory. IV. DVFS To maximize power savings, the processor is implemented using 8 Voltage Islands (VIs) and 28 Frequency Islands (FIs)

178

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

TABLE I INTEGRATED MEMORY CONTROLLER FEATURES

Fig. 7. Message passing protocol (a) and message passing versus DDR3-800 (b).

(Fig. 8). Software based power management protocols take advantage of the voltage/frequency islands through Dynamic Voltage and Frequency Scaling (DVFS). Two voltage islands supply the 2-D mesh and die periphery, with the remaining 6 voltage islands being divided among the core area. The on-die VRC interfaces with two on-package voltage regulators [12] to scale the voltages of the 2-D mesh and core area dynamically from 0V-1.3 V in 6.25 mV steps. Since the VRC acts as any other 2-D mesh agent, the VRC is addressable by all cores. Upon reception of a new voltage command change, the VRC and on-package voltage regulators respond in under a millisecond. VIs for idle cores can be set to 0.7 V, a safe voltage for state retention, or completely collapsed to 0 V, if retention is unnecessary. Voltage level isolation and translation circuitry allow VIs with active cores to continue execution with no impact from collapsed VIs and provide a clean interface across voltage domains.

The processors 28 FIs are divided as such: one FI for each tile (24 total), one FI for the entire 2-D mesh, and the remaining three FIs for the system interface, VRC, and memory controllers, respectively. Similar to the VIs, all core area FIs are dynamically adjustable to an integer division (up to 16) of the globally distributed clock. However, unlike voltage changes, the response time of frequency changes are significantly fasteraround 20 ns when a 1 GHz clock is being used. Thus, frequency changes are much more common than voltage changes in power optimized software. Deterministic rst-in-rst-out (FIFO) based clock crossing units (CCFs) are used for synchronization across clocking domains [5]. Frequency aware read pointers ensure that the same FIFO location is not read from and written to simultaneously. Embedded level shifters (Fig. 9) within the clock crossing unit handle voltage translation.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

179

Fig. 8. Voltage/frequency islands, clock crossing FIFOs and clock gating.

Fig. 9. Voltage level translation circuit.

V. EXPERIMENTAL RESULTS & PROGRAMMING The board and packaged processor used for evaluation and testing is shown in Fig. 10. The die is packaged in a 14-layer (5-4-5) 1567 pin LGA package with 970 signal pins, most of which are allocated to the 4 DDR3 channels. A standard Xeon server socket is used to house the package on the board. A standard PC running customized software is able to interface with the processor and populate the DDR3 memory with a bootable OS for every core. After reset de-assertion, all 48-cores boot independent OS threads. Silicon has been validated to be fully functional. The Lower-Upper Symmetric Gauss-Seidel solver (LU) and Block tri-diagonal solver (BT) benchmarks from the NAS parallel benchmarks [13] were successfully ported to the processor architecture with minimal effort. LU employs a symmetric

successive over-relaxation scheme to solve regular, sparse, block 5 5 lower and upper triangular systems of equations. LU uses a pencil decomposition to assign a column block of a 3-dimensional discretization grid to each core. A 2 dimensional pipeline algorithm is used to propagate a wavefront communication pattern across the cores. BT solves multiple independent systems of block tri-diagonal equations with 5 5 blocks. BT decomposes the problem into larger numbers of blocks that are distributed about the cores with a cyclic distribution. The communication patterns are regular and emphasize nearest neighbor communication as the algorithm sweeps successively over each plane of blocks. It is important to note that these two benchmarks have distinctly different communication patterns. Results for runs on the processor with cores running at 533 MHz and the mesh at 1 GHz for the benchmarks running on a 102 102 102 discretization grid are shown in Fig. 11. As expected, the speedup is effectively linear across the range of problems studied. Measured maximum frequency for both the core and router as a function of voltage is shown in Fig. 12. Silicon measurements were taken while maintaining a constant case temperature of 50 using a 400 W external chiller. As voltage for the router is scaled from 550 mV to 1.34 V, a resulting Fmax increase is observed from 60 MHz to 2.6 GHz. Likewise, as voltage for the IA-core is scaled from 730 mV to 1.32 V, Fmax increases from 300 MHz to 1.3 GHz. The offset between the two proles is explained by the difference in design target points; the core was designed for 1 GHz operation at 1.1 V while the router was designed for 2 GHz operation at the same voltage. This chart shown in Fig. 13 illustrates the increase in total power as supply voltage is scaled. During data collection, the

180

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Fig. 10. Package and test board.

Fig. 13. Measured chip power versus supply.

Fig. 11. NAS parallel benchmark results with increasing core count.

at 1.14 V and consuming 125 W of power. Fig. 14 presents the measured power breakdown at full power and low power operation. When the processor is dissipating 125 W, 69% of this power, or 88 W, is attributed to the cores. When both voltage and frequency are reduced and the processor dissipates 25 W, only 21% of this power, or 5.1 W, is due to the cores. At this low power operation we see the memory controllers becoming the major power consumer, largely because the analog I/O voltages cannot be scaled due to the DDR3 spec. VI. CONCLUSION In this paper, we have presented a 48 IA-32 core processor in a 45 nm Hi-K CMOS process that utilizes a 2D-mesh network and 4 DDR3 channels. The processor uses a new message passing protocol and 384 KB of on-die shared memory for increased performance of core to core communication. It employs dynamic voltage and frequency scaling with 8 voltage islands and 28 frequency islands for power management. Silicon operates over a wide voltage and frequency range, 0.7 V and 125 MHz up to 1.3 V and 1.3 GHz. Measured results show a power consumption of 125 W at 50 when operating under typical conditions, 1.14 V and 1 GHz. With active DVFS measured power is reduced by 80% to 25 W at 50 . These results demonstrate the feasibility of many-core architectures and high-performance, energy-efcient computing in the near future.

Fig. 12. Maximum frequency (Fmax) versus supply.

processors operating frequency was always set to the Fmax of the current supply voltage being measuring. Thus, we see a cubic trend of power increase since power is proportional to frequency times voltage squared. At 700 mV the processor consumes 25 W, while at 1.28 V it consumes 201 W. During nominal operation we see a 1 GHz core and a 2 GHz router operating

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

181

Fig. 14. Measured full power and low power breakdowns.

ACKNOWLEDGMENT The authors thank Yatin Hoskote, D. Finan, D. Jenkins, H. Wilson, G. Schrom, F. Paillet, T. Jacob, S. Yada, S. Marella, P. Salihundam, J. Lindemann, T. Apel, K. Henriss, T. Mattson, J. Rattner, J. Schutz, M. Haycock, G. Taylor, and J. Held for their leadership, encouragement, and support, and the entire mask design team for chip layout.

Jason Howard is a senior technical research lead for the Advanced Microprocessor Research team within Intel Labs, Hillsboro, Oregon. During his time with Intel Labs, Howard has worked on projects ranging from high performance low power digital building blocks to the 80-Tile TeraFLOPs NoC Processor. His research interests include alternative microprocessor architectures, energy efcient design techniques, variation aware and tolerant circuitry, and exascale computing. Jason Howard received the B.S. degree and M.S degree in electrical engineering Brigham Young University, Provo, UT, in 1998 and 2000 respectively. He joined Intel Corporation in 2000. He has authored and co-authored several papers and has several patents issued and pending.

REFERENCES
[1] S. Vangal et al., An 80-Tile 1.28TFLOPS network-on-Chip in 65 nm CMOS, ISSCC Dig. Tech. Papers, pp. 9899, Feb. 2007. [2] G. Moore, Cramming more components onto integrated circuits, Electronics, vol. 38, no. 8, Apr. 1965. [3] K. Mistry et al., A 45 nm logic technology with high -k gate transistors, strained silicon, 9 Cu interconnect layers, 193 nm dry patterning, and 100% Pb-free packaging, IEDM Dig. Tech. Papers, Dec. 2007. [4] J. Schutz, A 3.3 V 0.6 m BiCMOS superscalar microprocessor, ISSCC Dig. Tech. Papers, pp. 202203, Feb. 1994. [5] P. Salihundam et al., A 2 Tb/s 6 4 mesh network with DVFS and 2.3 Tb/s/W router in 45 nm CMOS, in Symp. VLSI Circuits Dig. Tech. Papers, Jun. 2010. [6] Y. Tamir and H.-C. Chi, Symmetric crossbar arbiters for VLSI communication switches, IEEE Trans. Parallel Distrib Syst., vol. 4, no. 1, pp. 1327, Jan. 1993. [7] JEDEC, Solid State Technology Association: DDR3 SDRAM Specication, Apr. 2008, JESD79-3B. [8] R. Kumar and G. Hinton, A family of 45 nm IA processors, ISSCC Dig. Tech. Papers, pp. 5859, Feb. 2009. [9] SHMEM Technical Note for C, Cray Research, Inc., 1994, SG-2516 2.3.. [10] L. Smith and M. Bull, Development of hybrid mode MPI/OpenMP applications, Scientic Programming, vol. 9, no. 23, pp. 8398, 2001. [11] T. Mattson et al., The intel 48-core single-chip cloud computer (SCC) processor: Programmers view, in Int. Conf. High Performance Computing, 2010. [12] G. Schrom, F. Faillet, and J. Hahn, A 60 MHz 50 W ne-grain package integrated VR powering a CPU from 3.3 V, in Applied Power Electronics Conf., 2010. [13] D. H. Bailey et al., The NAS parallel benchmarks, Int. J. Supercomputer Applications, vol. 5, no. 3, pp. 6373, 1991. Saurabh Dighe received his MS degree in Computer Engineering from the University of Minnesota, Minneapolis in 2003. He was with Intel Corporation, Santa Clara, working on front end logic and validation methodologies for the Itanium processor and the Core processor design team Currently he is a member of the Advanced Microprocessor Research team at Intel Labs, Oregon, involved in the denition, implementation and validation of future Tera-scale computing technologies like the Intel Teraops processor and 48-Core IA-32 Message Passing Processor. His research interests are in the area of energy efcient computing and low power high performance circuits.

+ Metal

Sriram R. Vangal (S90M98) received the B.S. degree from Bangalore University, India, and the M.S. degree from University of Nebraska, Lincoln, USA, and the Ph.D. degree from Linkping University, Sweden, all in Electrical Engineering. He is currently a Principal Research Scientist with Advanced Microprocessor Research, Intel Labs. Sriram was the technical lead for the advanced prototype team that designed the industrys rst single-chip 80-core, sub-100 W Polaris TeraFLOPS processor (2006) and co-led the development of the 48-core Rock-Creek prototype (2009). His research interests are in the area of low-power high-performance circuits, power-aware computing and NoC architectures. He has published 20 journal and conference papers and has 16 issued patents with 8 pending.

182

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 46, NO. 1, JANUARY 2011

Gregory Ruhl (M07) received the B.S. degree in computer engineering and the M.S. degree in electrical and computer engineering from the Georgia Institute of Technology, Atlanta, in 1998 and 1999, respectively. He joined Intel Corporation, Hillsboro, OR, in 1999 as a part of the Rotation Engineering Program where he worked on the PCI-X I/O switch, Gigabit Ethernet validation, and clocking circuit and test automation research projects. After completing the REP program, he joined Intels Circuits Research Lab where he worked on design, research and validation on a variety of topics ranging from SRAMs and signaling to terascale computing. In 2009, Greg became a part of the Microprocessor Research Lab within Intel Labs where he has since been designing and working on tera- and exa-scale research silicon and near threshold voltage computing projects.

Michael Konow manages an engineering team working on research and prototyping of future processor architectures within Intel Labs, Braunschweig, Germany. During his time with Intel Labs, Michael was leading the development of the 1st in-socket FPGA prototype of a x86 processor and has worked on several FPGA and silicon prototyping projects. His research interests include future microprocessor architectures and FPGA prototyping technologies. Michael received his diploma degree in Electrical Engineering from University of Braunschweig in 1996. Since then he has worked on the development of integrated circuits for various companies and a wide range of applications. He joined Intel in 2000 and Intel Labs in 2005.

Nitin Borkar received the M.Sc. degree in physics from the University of Bombay, India, in 1982, and the M.S.E.E. degree from Louisiana State University in 1985. He joined Intel Corporation in Portland, OR, in 1986. He worked on the design of the i960 family of embedded microcontrollers. In 1990, he joined the i486DX2 microprocessor design team and led the design and the performance verication program. After successful completion of the i486DX2 development, he worked on high-speed signaling technology for the Teraop machine. He now leads the prototype design team in the Microprocessor & Programming Research Laboratory, developing novel technologies in the high-performance low power circuit areas and applying those towards future computing and systems research.

Michael Riepen is a senior technical research engineer in the Advanced Processor Prototyping Team within Intel Labs, Braunschweig, Germany. During his time in Intel Labs, Michael worked on FPGA prototyping of processor architectures as well as on efcient many-core pre-silicon validation environments. His research interests include exascale computing, Many-core programmability as well as efcient validation methodologies. Michael Riepen received the master of computer science from University of applied sciences Wedel, Germany, in 1999. He joined Intel Corporation in 2000. He has authored and co-authored several papers and has several patents issued and pending.

Shailendra Jain received the B.Tech. degree in electronics engineering from Devi Ahilya Vishwavidyalaya, Indore India in 1999, and M.Tech. degree in Microelectronics and VLSI Design from IIT, Madras, India in 2001. With Intel since 2004, he is currently a technical research lead at the Bangalore Design Lab of Intel Labs, Bangalore, India. His research interests includes near-threshold voltage range digital circuits design, energy-efcient design techniques for TeraFLOPs NoC processors and oating-point arithmetic units, and many core advance rapid prototyping. He has co-authored ten papers in these areas.

Matthias Gries joined Intel Labs at Braunschweig, Germany, in 2007, where he is working on architectures and design methods for memory subsystems. Before, he spent three years at Inneon Technologies in Munich, Germany, rening micro-architectures for network applications at the Corporate Research and Communication Solutions departments. He was a post-doctoral researcher at the University of California, Berkeley, in the Computer-Aided Design group, implementing design methods for application-specic programmable processors from 2002 to 04. He received the Doctor of Technical Sciences degree from the Swiss Federal Institute of Technology (ETH) Zurich in 2001 and the Dipl.-Ing. degree in electrical engineering from the Technical University Hamburg-Harburg, Germany, in 1996. His interests include architectures, methods and tools for developing x86 platforms, resource management and MP-SoCs.

Vasantha Erraguntla received her B.E. in Electrical Engineering from Osmania University, India and an M.S. in Computer Engineering from University of Louisiana. She joined Intel in 1991 to be a part of the Teraop machine design team and worked on its high-speed router technology. Since June 2004, Vasantha has been heading Intel Labs Bangalore Design Lab to facilitate the worlds rst programmable Terascale processor and the 48-iA core Single-Chip Cloud Computer. Vasantha has co-authored over 13 IEEE journal and conference papers and holds 3 patents and 2 pending. She is also a member of IEEE. She served on the organizing committee of the 2008 and 2009 International Symposium on Low Power Electronics and Design (ISLPED) and on the Technical Program Committee of ISLPED 2007, Asia Solid State Circuits Conference (A-SSCC) in 2008 and 2009. She is also a Technical Program Committee member for energy-efcient digital design for ISSCC 2010 and ISSCC 2011. She is also serving on the Organizing Committee for the VLSI Design Conference 2011.

Guido Droege received his Diploma in electrical engineering from Technical University Braunschweig, Germany in 1992 and the Ph.D. degree in 1997. His academic work focused on analog circuit design automation. After graduation, Droege worked an ASIC company, Sican GmbH, and later Inneon Technologies. He designed RF circuits for telecommunication and worked on MEMS technology for automotive. In 2001 he joined Intel Corporation where he started with high-speed interface designs for optical communication. As part of Intel Labs he was responsible for the analog frontend of several Silicon prototype designs. Currently, he works in the area of high-bandwidth memory research.

HOWARD et al.: A 48-CORE IA-32 PROCESSOR IN 45 NM CMOS USING ON-DIE MESSAGE-PASSING AND DVFS FOR PERFORMANCE AND POWER SCALING

183

Tor Lund-Larsen is the Engineering Manager for the Advanced Memory Research team within Intel Labs Germany in Braunschweig, Germany. During his time with Intel Labs, Tor has worked on projects ranging from FPGA based proto-types for multi-radio and adaptive clocking to analog clocking concepts and memory controllers. His research interests include multi-level memory, Computation-in Memory for many-core architecture, resiliency and high bandwidth memory. Tor Lund-Larsen received MBA and MSE degrees from the Program of Engineering and Manufacturing Management program at the University of Washington, Seattle in 1996. He joined Intel Corporation in 1997.

Vivek K. De is an Intel Fellow and director of Circuit Technology Research in Intel Labs. In his current role, De provides strategic direction for future circuit technologies and is responsible for aligning Intels circuit research with technology scaling challenges. De received his bachelors degree in electrical engineering from the Indian Institute of Technology in Madras, India in 1985 and his masters degree in electrical engineering from Duke University in 1986. He received a Ph.D. in electrical engineering from Rensselaer Polytechnic Institute in 1992. De has published more than 185 technical papers and holds 169 patents with 33 patents pending.

Sebastian Steibl is the Director of Intel Labs Braunschweig in Germany and leads a team of researchers and engineers in developing technologies ranging from the next generation Intel CPU architectures, high bandwidth memory and memory architectures to emulation and FPGA many-core prototyping methodology.His research interests include on-die message passing and embedded many-core microprocessor architectures. Sebastian Steibl has a Degree in Electrical Engineering from Technical University of Braunschweig and holds three patents.

Rob Van Der Wijngaart is a senior software engineer in the Developer Products Division of Intels Software and Services Group. At Intel he has worked on parallel programming projects ranging from benchmark development, programming model design and implementation, algorithmic research, and ne-grain power management. He developed the rst application to break the TeraFLOP-on-a-chip barrier for the 80-Tile TeraFLOPs NoC Processor. Van der Wijngaart received an M.S, degree in Applied Mathematics from Delft University of Technology, Netherlands, and a Ph.D. degree in Mechanical Engineering from Stanford University, CA, in 1982 and 1989, respectively. He joined Intel in 2005. Before that he worked at NASA Ames Research Center for 12 years, focusing on high performance computing. He was one of the implementers of the NAS Parallel Benchmarks.

Shekhar Borkar is an Intel Fellow, an IEEE Fellow and director of Exascale Technologies in Intel Labs. He received B.S. and M.S. degrees in physics from the University of Bombay, India, in 1979, and the M.S. degree in electrical engineering from the University of Notre Dame, Notre Dame, IN, in 1981 and joined Intel Corporation where he has worked on the design of the 8051 family of microcontrollers, iWarp multi-computer, and high-speed signaling technology for Intel supercomputers. Shekhars research interests are low power, high performance digital circuits, and high speed signaling. He has published over 100 articles and holds 60 patents.

Вам также может понравиться