Вы находитесь на странице: 1из 12

194

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

A Fully Integrated Multi-CPU, Processor Graphics, and Memory Controller 32-nm Processor
Marcelo Yuffe, Moty Mehalel, Ernest Knoll, Joseph Shor, Senior Member, IEEE, Tsvika Kurts, Eran Altshuler, Eyal Fayneh, Kosta Luria, and Michael Zelikson

AbstractThis paper describes the second-generation Intel Core processor, a 32-nm monolithic die integrating four IA cores, a processor graphics, and a memory controller. Special attention is given to the circuit design challenges associated with this kind of integration. The paper describes the chip oor plan, the power delivery network, energy conservation techniques, the clock generation and distribution, the on-die thermal sensors, and a novel debug port. Index TermsClocking, Intel second-generation core, low Vccmin, modularity, power gates, thermal sensors.

I. INTRODUCTION

HE desktop and mobile computer market place is constantly looking for system performance improvements, lower power dissipation density and better form factors for miniaturization; these three vectors seem to contradict each other. The 32-nm Second Generation Intel Core (SGIC) processor tackles this paradigm by integrating up to four high-performance Intel Architecture (IA) cores, a power/performance optimized processor graphics (PG), and memory and PCIe controllers in the same die. The chip is manufactured using Intels 32-nm process which incorporates the second generation of Intels high-k metal gates for improved leakage current control; the process also provides nine copper interconnect metal layers that were well exploited for top-level interconnect as well as for robust power delivery. The SGIC architecture block diagram is shown in Fig. 1, and the oor plan of the four IA-core version is shown in Fig. 2. The SGIC IA core implements an improved branch prediction algorithm, a micro-operation (Uop) cache, a oating point advanced vector extension (AVX), a second load port in the L1 cache, and bigger register les in the out-of-order part of the machine; all of these architecture improvements boost the IA core performance without increasing the thermal power dissipation envelope or the average power consumption (to preserve battery life in mobile systems). Although these architectural advances are beyond the scope of this paper, the Intel AVX, which enhanced the SSE 128-bit vectors into 256 b is worth mentioning (more information about the SGIC architectural features can be found in [1]). The AVX

Fig. 1. SGIC block diagram.

Fig. 2. SGIC oorplan, power planes, and choppability axes.

Manuscript received April 28, 2011; revised June 29, 2011; accepted July 29, 2011. Date of publication October 13, 2011; date of current version December 23, 2011. This paper was approved by Guest Editor Alice Wang. The authors are with Intel Corporation, Haifa 31015, Israel (e-mail: marcelo. yuffe@intel.com). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/JSSC.2011.2167814

architecture supports three-operand syntax which allows more efcient coding by the compiler. Additional instructions were added to simplify auto-vectorization of high level languages to assembly by the compiler. The SGIC architecture added instructions which support single- and double-precision oating-point data types. The additional state needed for the new growth of the 16 registers to 256 b, is supported by new xSave/xRestor instructions that were designed to support additional future extensions of Intel 64 architecture. The CPUs and PG share the same 8-MB level-3 cache (L3$) memory. The data ow is optimized by a high performance on die interconnect fabric (called ring) that connects between the CPUs, the PG, the L3 cache, and the system agent (SA) unit. The SA houses a 1600 MT/s, dual-channel DDR3 memory controller, a 20-lane PCIe gen2 controller, a two-parallel-pipe display engine, the power management control unit, and the testa-

0018-9200/$26.00 2011 IEEE

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

195

bility logic. An on-die PROM is used for congurability and yield optimization. II. MODULAR FLOOR PLAN From the beginning of the project, the SGIC was conceived as a modular design that will allow the integration of different blocks into a single chip. The SGIC team opted to divide the chip into several modules: IA core, SA, PG, L3 cache, and I/O; the modules were designed independently and assembled together by a dedicated full chip team that took care of the integration of the different modules and the full chip validation aspects. The PG module is of special interest because it was designed using completely different design methodologies and CAD tools; this module even used a completely different standard cell library and a separated power delivery network. The key to the smooth integration of this block into the rest of the chip was the ring bus, which provides a common protocol for all of the modules of the chip, allowing resource sharing between the different modules (for example, the L3$ space can be accessed by any of the modules). The ring protocol and the ring distributed controller take care of the ring trafc to minimize performance impact due to data trafc congestion. The design team also took advantage of the common interconnect protocol and physical layer provided by the ring to bridge between different design methodologies used for the design of the different modules; this was especially important for the integration of PG, which was designed with a completely different design methodology. The modular ring interconnect enables the four-core die to be easily converted into a two-core die by chopping out two cores and two L3 cache modules as described in Fig. 2. Additional optimizations can be done by reducing the number of execution units of the PG or by reducing the L3 cache size. This modular oor plan technique converts the tedious and time-consuming task of creating die variations into a simple database management exercise, considerably reducing the time it takes to bring the different avors of the product to the market. The SGIC was implemented in three different avors: the die size of the i7 2820QM model (four IA cores, 8 MB L3$, 12EU PG) is 216 mm , the die size of the i7 2620 M model (two IA cores, 4 MB L3$, 12EU PG) is 149 mm , and the die size of the i3 2100 model (two IA cores, 3 MB L3$, 6EU PG) is 130 mm . III. POWER DELIVERY NETWORK EMBEDDED POWER GATES
AND

Fig. 3. SGIC core IREM image when (a) the processor is idle and (b) during deep sleep.

Fig. 4. Residual gated voltage in C6 state with (strong) and without (weak) negative bias.

Although the SGIC implements Intel SpeedStep technology for minimizing the power consumed by the CPU, the product requirements have clearly indicated that efcient power gating is a must in order to meet the aggressive average power goals. Definition of the power delivery network (PDN) topology and, in particular, the implementation of power gates, in the SGIC were based on several criteria: 1) co-optimize the quality of power delivery on both die and package levels; 2) enable exible power management, i.e., support ne granularity for gated and ungated regions and support different power states; 3) minimize the energy penalty, associated with power gates switching; and 4) minimize the amount of switching noise injected by the power gate switching.

The conguration chosen based on the above criteria comprises P-type embedded power gates (EPGs), i.e., power switches that reside inside the gated region, forming a grid of power transistors connected among themselves by a gated power grid. The total width of EPG is approximately 2 m, which is of the accumulated width of the SNB IA core transistors. Due to the dense and very regular layout, the area consumed by the power gates is 3.8% of the core area. Since a nongated version of the SGIC core is not available, it is practically impossible to quantify the effect of power gates on timing; however, the product has met all timing/performance goals. An ungated power grid, which spans the whole core, shares the two top metal layer resources with the gated PDN. Fig. 3 shows an infrared emission microscopy (IREM) photo of the SGIC core for two power states: idleC1 [Fig. 3(a)] and deep sleepC6 [Fig. 3(b)]. As can be seen from Fig. 3(b), in the C6 state, most of the supply voltage falls across the power gates, resulting in the visible EPG grid (thin vertical lines spread over the core). Such PDN topology enables allocation of selected fubs or individual circuits to either gated or ungated supply. Bright spots in Fig. 3(b) represent circuitry that is fed by ungated power supply, immersed into a region of gated logic: the control hub that supports snoops during C6 state, and PLL. Small light dots in Fig. 3(b) represent individual ungated

196

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 5. EPG gate biasing circuit.

Fig. 6. Voltage dependence of the of the supply path resistance. Simulation temperature was adjusted to factor self heat.

circuitry like thermals probes and ungated repeater. Such a local coexistence of gated and ungated logic is used more extensively in the System Agent, where all logic blocks associated with PCIe are gated individually. The SGIC PDN topology enables minimization of the EPG switching losses: 1) all on-package decoupling capacitors are connected to the ungated power supply, which does not switch and 2) since power gates are immersed in the gated power grid, it is possible to discharge most of the energy accumulated at EPG switching nodes into gated power supply. In order to increase the efciency of power gating, the gatesource voltage of the pMOS switches was negatively biased (i.e., the gate is driven by a voltage higher than the source voltage), driving the transistors into deeper subthreshold regime. The biasing circuit is situated in the ring area, near the core, and the bias voltage is distributed across the whole core using a dedicated low-resistance grid. Fig. 4 presents the results of a measurement of the residual gated voltage during the C6 state. The principle schematic of the biasing circuit is shown in Fig. 5: VCCA is an on-die high-voltage power supply and VCCB is the voltage driven into the gates of the power

gating transistors. The switching speed of this circuit is tuned to control the power gate switching strength in order to minimize the switching noise injected by this operation. The core PDN performance was analyzed with a commercially available grid simulator for several different real stresses, including power virus and idle state. Simulation results for the idle state were compared to corresponding measured data, as shown in Fig. 6. Die power dissipation in C6 state was measured with enabled and disabled embedded power gates. The difference between the two measurements yields the corresponding power savings. The same was performed in System Agent in order to quantify the effectiveness of PCIe gating. By measuring the part power dissipation when the power gates are enabled and when they are disabled it is possible to measure the average power merit of the power gates. This was done on a representative set of SGIC units at two different temperatures (110 C and 50 C) and at two different power supply voltages (0.88 V and 1.10 V); power gating saves approximately of the total power dissipation, which translates to savings of more than 90% of the IA core power dissipation.

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

197

Fig. 7. RF shared strength control pMOS devices.

IV. VCCMIN MINIMIZATION As shown in Fig. 2 in the SGIC, the cores, the ring, and L3$ are sharing the same power plane. The key challenge in this scheme is bringing all these components to meet comparable Vccmin levels. This approach guarantees that the overall power consumption at full-chip level will be minimal at low power states when the chip is running at the lowest possible voltage needed to support a specic operating frequency. In addition, this approach eliminates any redundant design overkill in one of the components. For example, if L3$ is limited to run at much higher Vccmin than the core register les (RFs), then the RFs could have been designed to higher Vccmin with no impact on the overall Vccmin at full-chip level, but with area benets by using smaller cells, and with power benets by using low-leakage transistors. There are three components that limit Vccmin: logic paths, RFs, and small-signal arrays (SSAs). L3$ is the largest SSA as it includes the largest number of devices, and most of them are at minimal sizing. As a result, the random statistical variations in L3$ are very signicant, so L3$ was the biggest design and process challenge to make it running at the same low-power supply voltages as the core logic. Previous processors solved this problem by connecting the L3$ to a separate higher voltage power plane; however, this approach considerably increases the power dissipated by the L3 cache itself and taking into account that the SGIC implements 3, 4, or 8 MB of L3 cache capacity (depending on the chip conguration); the power dissipated by the cache memory accounts for a big portion of the overall power consumption of the die. Enabling L3$ to run at low Vccmin has saved 1.2-W average power for the SGIC quad-core die. The other arrays like the RFs and smaller SSAs required design focus and attention as well. These arrays are running at higher bandwidth than the L3$ so the timing constrains are tighter. Several circuit and logic design techniques have been developed to minimize the Vccmin of the SSAs and the RFs of the chip to bring them to a lower level than the core logic.

Fig. 8. Vccmin improvement after applying SGIC Vccmin reduction circuits.

Fig. 7 illustrates one of these techniques in the RFs. Random fabrication variations may cause RF write-ability degradation at low voltages; this technique weakens the memory-cell pull-up device effective strength, solving the low-voltage write-ability issue caused by a too strong pMOS device in the memory cells. The effective size of the shared pMOS is set during production testing by enabling any combination of the three parallel transistors. Similar techniques have been developed for L3$ and other SSAs. Fig. 8 shows the Vccmin distribution of the baseline and its improvement in the SGIC. A key component in the success to meet the aggressive Vccmin target is an accurate modeling at presilicon stage. Large arrays, as opposed to logic paths, cannot be xed at postsilicon stages because of their large area and high density. An accurate statistical simulation algorithm has been developed to cover all the failure modes of the arrayswrite-ability, read stability, and data retention (soft errors may become an important Vccmin limiter if not taken into account properly, but this limitation can be easily solved by protecting the few problematic state elements with parity or by using error correction code or by simply increasing the area of those state elements). The input data to the model includes the variations of the main parameters of the transistors like threshold voltage, effective channel length, mobility, and velocity saturation. All of the failure modes are simulated in

198

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 9. Silicon versus simulation Vccmin results at various modes and programming levels.

Fig. 10. SGIC nal Vccmin results of the L3 SSA, RFs, and random logic.

transient ac mode to accurately model the real activation conditions of the cells. This approach is different than the traditional algorithm to model Read stability by using static noise margin analysis [2]. Fig. 9 shows the correlation between silicon results and the simulation model. The -axis includes various operation modes of L3. WRpgm* refers to write operation at different programming levels of the shared pMOS device (see Fig. 7). In production, the optimal setting is used (WRpgm2 in Fig. 9). AC_Ret stands for retention failure during write operation. This failure mechanism is related to the voltage droop that the shared pMOS creates on the memory array power supply. SLEEP_Ret is the retention Vccmin when the L3 sleep-transistor is enabled for power reduction. The nal outcome of this work is shown in Fig. 10. The obtained Vccmin of the three components of the core/ring is equalized and there is no clear limiter. L3 result is obtained when the optimal programming setting is used and the redundancy recovery mechanism is applied. The shown result is obtained when the optimal programming is set and while the chip is running under normal operating conditions with all the power saving mechanisms enabled. The logic part has no special circuit techniques and the result reects the speed limitation of the critical paths at low voltage operation, after post silicon speed up

at minimum operating voltage/minimum operating frequency point was done. V. HIGH-BANDWIDTH LOW-LATENCY CACHE ACCESS THROUGH THE RING The ring provides the common platform used to connect the different modules (CPUs, shared L3$s, PG, SA). To maximize performance, high-bandwidth low-latency cache access is required. Cache access messages are synchronously staged by high-phase transparent latches in the ring-stops and by lowphase latches at clock domain crossing (Fig. 11). Instantaneous clock skew (systematic skew and random jitter) degrades the timing accuracy and limits the ring frequency. Synchronization buffers can allow higher frequency by adding latency that affects overall performance. The clocking scheme (described in Section VI) provides the skew and jitter required by the ring performance. If the instantaneous skew is less than the skew budget, the data propagates through all latches in transparency. Double de-skew latch (Fig. 12) provides robust cross-domain race margin without extra latency. The amount of wires were locally doubled to halve the switching frequency allowing larger skew between the clock domains, thanks to the locality the global routing resource accounting was not impacted and

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

199

Fig. 11. Ring path. Phase one (PH1) ring stop and phase two (PH2) single cross-domain example de-skew latches.

Fig. 12. Cross clock domain path.

therefore the die size was not affected. During even cycle the even latch and even side of the mux-latch are transparent; the same is for the odd. The local trafc (latch mux_latch) is at half frequency thus improving the race margin without affecting the max delay (mux_latch output is at full speed). A Valid signal is sent from a similar structure to the next ring stop few gates delay before the data. It enables the clock rise that opens the latches. Valid rising edge propagates in transparency through the clock path; late falling edge may cause sampling of undetermined data, which is not further used since it is not valid ( ). Sharing latch and arbitration by a mux-latch reduces data propagation delay. The Valid signal participates in ring stop arbitration between the passing message and a new message (Request) through a mux-latch control during the transparency window, as seen in Fig. 12. The propagation of the Valid signal through the local clock network during transparency window allows it to be generated no earlier than the data, thus reducing latency. Static timing tools model the path through the latches and through local clock drivers from data or clock gate inputs to latch output as one transparency path consisting of several stages. A multicycle path is modeled from the latch open through several ring-stops to the receiver off the ring or to a latch capture. The clock-domain information is preserved for pruning and for max-skew-aware margin calculations. Accurate nonworse-casing timing model allows the low-latency ring implementation. The skew budget is only two thirds of a phase, allowing a shorter transparency window for improved race

Fig. 13. SGIC clocking scheme.

immunity. This is twice the real skew to avoid nonreproducible cross-domain paths failures due to PLL jitter. VI. CLOCK GENERATION AND DISTRIBUTION The clocking scheme shown in Fig. 13 employs 13 PLLs to generate the clocks for different domains [3], [4]. The IA cores, the L3 cache, and the ring (which are sharing the same power plane) are running at the same frequency ([5] provides an excellent dissertation of the architecture tradeoffs involved with core frequency selection). In order to minimize the skew and power in the clock distribution, each slice (CPU, L3 cache, and ring stop) uses its own PLL; the RCLK PLL assures low clock skew among the reference clocks of the entire die PLLs despite the different operating voltages of the different clock networks. A

200

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 14. (a) LNPLL VCO block diagram. (b) VCO tuning network.

low-jitter PLL (LNPLL) design enables the ring data ow with minimal latency. The PLL random jitter is mainly determined by the VCO quality. The LNPLL VCO [Fig. 14(a)] is a three-CMOS stage, full-swing ring oscillator. The VCO stage is loaded by two tuning networks [based on varactors, see Fig. 14(b)] that change their loading under the control of a dc input voltage. The bottom load is controlled by the PLL control voltage, thus adjusting the PLL frequency and phase. The upper load is controlled by a PTAT circuit. The temperature inverse-proportional loading of the VCO stages stabilizes the VCO frequency with temperature changes; this assures that the PLL remains locked at all of the operating temperatures despite the relatively low gain of the VCO. Metal capacitors are used in series with the varactor to separate the dc control voltage from the ac signal at VCO stage output. Resistors are used to isolate the output of the VCO stage from the low-impedance dc sourceeither the PLL loop lter or the PTAT circuit. Due to the limited capacitance ratio of the varactors, the required frequency range is covered by ve overlapping frequency bands. The bands are implemented by ve parallel switched varactor blocks. Within band, the frequency ratio is better than 1.35, while overall VCO frequency ratio is better than 2.5. A banding nite-state machine (FSM) is used to select the appropriate VCO band for the required clock frequency. The banding FSM can operate in two modes, automatic-band-select (ABS) mode and open-loop-frequency mapping mode (OLFM). In the ABS mode, the banding is determined by the FSM during the closed-loop locking process. In the OLFM mode, the VCO bands limits are measured at open-loop, and then the required PLL output frequencies are mapped to a corresponding VCO band by a lookup table. The OLFM mode is used for fast-relock frequency change clock generators to support Intel SpeedStep technology. The PTAT temperature compensation led to better than 4% open-loop VCO frequency change for 120 C temperature change. The CMOS ring oscillator VCO provides good random noise performance, but has poor power supply rejection ratio (PSRR). In order to achieve the required performance, the PLL

is supplied by an on-die low-noise linear voltage regulator, as reported in [6], with better than 40-dB PSRR and less than 50- V random voltage noise, at all of the frequency spectrums; this power supply source is completely separated from the CPU main power supply noise to avoid noise injection from the digital parts of the die (and the power gates) into the PLL power supply rail. The measured rms longterm jitter is better than 2 ps for all bands and all frequency ranges; the period jitter is less than 0.2 ps. For example, the integrated phase noise (1.5 MHz to 1 GHz) for a 3.3-GHz clock signal is ps, as presented in Fig. 15. The clock distribution within one slice (CPU and adjacent L3 and ring stop) is shown in Fig. 16. The Slice PLL generates the clock that is distributed through vertical spines, two in the L3 cache and three within the IA core. The spines drive global clock islands, allowing ne granularity clock gating for power savings. The skew within slice is kept low using clock compensators, controlled by dedicated state machines. The slice PLL closes the loop through the L0 spine; thus, L0 spine is deskewed to the PLL reference and can act as the slice timing reference. The phases of the two adjacent spines are compared with the timing reference, and the compensators are controlled to practically eliminate the skew due to within die variation. C0 and L1 spines are deskewed to L0, then C1 to C0, and nally C2 is deskewed to C1. The spine to spine maximum skew is 10 ps, while the overall slice max skew is 16 ps. The scope image in Fig. 17 was probed on a specic die between two adjacent spines with a skew of 1.4 ps. The overall slice clock distribution power is 600 mW at 1 V and 3.3 GHz. VII. THERMAL One of the important functions in a processor is temperature control. When the chip gets too hot, its frequency needs to be lowered, in order to allow it to cool down. This process is called throttling. It is important to have an accurate thermal sensor to provide this temperature information, since the sensing accuracy directly inuences performance in this case. In addition, the thermal sensor provides information for fan regulation in the

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

201

Fig. 15. Measured phase noise.

Fig. 16. Slice clock distribution.

temperature range 50 C100 C. There is also a fail-safe catastrophic function which shuts down the chip in the event that the temperature spikes signicantly above the throttle point. The SGIC has two types of thermal sensors. The rst is a diode-based thermal sensor described in [7] that compares the diode voltage (which has a negative temperature coefcient) to a reference voltage to output the temperature. This sensor functions over a very large temperature range of operation ( 25 C to 150 C), providing information for throttling, catastrophic function and fan regulation. This sensor has been used in many generations of Intel processors. However, the diode-based sensor is rather large (83 000 m ), so there is only one such sensor per core. Our simulations and silicon studies have determined that during different applications, different areas of the core can get hot. In order to measure these

hot-spots, we have introduced a miniaturized CMOS-based thermal sensor [8]. This sensor has a substantially reduced area (5100 m ) compared with the diode sensor (Fig. 18), but has a more limited accurate temperature range when single-point calibration is used (due to within-die variation, every sensor must be calibrated independently). It is shaped as a tall thin block, enabling it to be placed into the repeater channels, which are normally used for cross-chip signaling buffers. These channels are very heavily populated with higher level metals, but contain very few transistors and lower metals. Therefore, the placement of the CMOS sensors in these channels makes them essentially free (since the sensors are fed by an independent low noise power supply source the area needed for this dedicated power network should be taken into account while planning the global chip routing resources). The CMOS sensor allows the SGIC to accurately throttle, based on localized hot-spot sensing, for DFT, real-time measurements and burn-in. The CMOS sensors are also used heavily in burn-in, when the temperature gradients on the chip become very high and the sensors insure that the localized temperature will not exceed the intended burn-in temperature. As described in [7], the CMOS sensor output is proportional to the transistor threshold voltage and the mobility; silicon measurements are conrming this. For the readers convenience the CMOS sensor is explained here again to allow better interpretation of the importance of the silicon results presented in this paper.

202

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 17. Measured spinespine clock skew.

Fig. 18. Area comparison of the diode-based thermal sensor [7] and the CMOS sensor [8].

A simplied circuit schematic of the CMOS Sensor is shown in Fig. 19. The voltage reference circuit on the left is used to generate bias currents and a bias voltage , which is roughly equal to . The amplier A2 forces the drain voltage of M5 to be equal to . M5 is in the linear mode of operation because it shares a gate bias with M4. Thus, the equation describing its current is (1) is the electron mobility and is the threshold where voltage, both of which decrease with temperature for the range

of interest. This current is mirrored by M2 and M3 and integrated over the capacitor . The voltage on is compared and is used to trigger a pulse generator which is with used to discharge and is also the frequency output of the circuit. The frequency obeys the following equation: (2) The CMOS sensor frequency is input to a counter such that the output count is proportional to the frequency. This count was compared with electrical test parameters of devices in the scribe lines of the wafers. This is shown in Fig. 20(a) and (b) for several

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

203

Fig. 19. Schematic diagram of the CMOS sensor.

Fig. 20. Correlation of CMOS sensor count to (a) nMOS Idsat and (b) nMOS

thousands of units from different lots, during wafer sort. It was found that the count was well correlated to nMOS Idsat (which is proportional to mobility), in Fig. 20(a) and to the nMOS , as in Fig. 20(b). The count was uncorrelated to other sort parameters (e.g., pMOS , pMOS mobility, and resistance). The linear correlation of the count (e.g., frequency) of the CMOS sensor to the nMOS mobility and proves the validity of (2). VIII. GDXC: A NOVEL ON-DIE PROBING TECHNIQUE Due to the high integration of the SGIC, external busses are not observable by external equipment. To overcome this issue, the SGIC incorporates a dedicated die probing port called Generic Debug eXternal Connection (GDXC) which outputs internal information in a packet format to be used for debug and validation purposes. GDXC is an essential debug tool for

the SGIC from power-on of the rst silicon until post launch when the part is in mass production. GDXC comprise of debug bus that allows monitoring the trafc between the IA cores, PG, caches and System Agent. GDXC has a dedicated port through at which the SGIC exposes selected parts of its ring internal buses and functions as well as power management event information. GDXC is a nonintrusive methodology that answers much of the debug communitys concerns by providing some observability of the ring at high speed and out-of-order protocols. Its output may be connected to third-party logic analyzer or to custom sampling logic. The GDXC port is composed by 16 output lanes, using a PCIE-based protocol. GDXC also includes a hardware method for triggeringcalled G-ODLAT, which can enable early triggering on a failure scenario. GDXC location allows observation of the ring for protocol correctness. It provides observability of the four functional sub-

204

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 47, NO. 1, JANUARY 2012

Fig. 21. GDXC queue architecture.

Fig. 23. SGIC die photograph.

IX. CONCLUSION
Fig. 22. GDXC top side connector.

rings that comprise the ring. In addition, GDXC provides observability of power management transactions and Serial VID commands, to monitor the interaction between the SGIC and the external voltage regulator. GDXC helps to understand the scenario that leads to the failure by packet format, with its unique time stamp method, makes it possible to time-align the events observed by GDXC from different on-die modules to the same time scale. GDXC comprises of a set of queues that hold packets until they are issued out to the logic analyzer (Fig. 21). Since there are many queues that lead to a narrow pipe of x16 PCIe lanes, GDXC is susceptible to overow. Thus, at the entrance of the queues there is a Qualier which is used to lter out packets that are not essential to the current debug. These qualiers signicantly reduce GDXC susceptibility to overow. The connection to the external logic analyzer is done by a top side port which is located on the top side of the package. This approach saves the need of package pin allocation while improving in situ accessibility for debugging in a system (Fig. 22).

The Second Generation Intel Core was introduced to the market in early 2011. The part is offered in a variety of congurations and packaging for optimal performance, cost, power consumption, and form factor adaptation to the target system requirements. The thermal dissipation power (TDP) of the SGIC ranges from 17 to 45 W for a two-core and a four-core mobile part and all the way to 95 W for a high-end desktop part. The IA cores and PG are powered from independent 0.65 to 1.15 V variable voltage power supply sources, all controlled by the SVID bus. The DDR3 interface uses a 1.5-V power plane while the PCIe interface uses a 1.05-V power plane. The die photograph is shown in Fig. 23. REFERENCES
[1] Intel 64 and IA-32ArchitecturesOptimization Reference Manual, [Online]. Available: http://www.intel.com/Assets/PDF/manual/ 248966.pdf [2] F. Seevinck and J. Lohstroh, Static- noise margin analysis of MOS SRAM cells, IEEE Journal of Solid-State Circuits, vol. SC-22, no. 5, pp. 748754, Oct. 1987. [3] S. Rusu et al., A 45 nm 8-core enterprise Xeon processor, in ISSCC Tech. Dig. Papers, 2009.

YUFFE et al.: A FULLY INTEGRATED MULTI-CPU, PROCESSOR GRAPHICS, AND MEMORY CONTROLLER 32-NM PROCESSOR

205

[4] E. Fayneh and E. Knoll, Clock generation and distribution for intel banias mobile microprocessor, in Proc. VLSI Circuits Symp., 2003, pp. 1720. [5] E. Rotem, A. Mendelson, R. Ginosar, and U. Weiser, Multiple clock and voltage domains for chip multi processors, in Proc. 42nd Annu. IEEE/ACM Int. Symp. Micro Architecture (MICRO 42), New York, 2009, pp. 459468. [6] J. Shor, Low noise linear voltage regulator for use as an on-chip PLL supply in microprocessors, in Proc. IEEE Int. Symp. Circuit Syst., Paris, France, May 30, 2010, pp. 841844. [7] D. Duarte, G. Geannopoulos, U. Mughal, K. L. Wong, and G. Taylor, Temperature sensor design in a high volume manufacturing 65 nm CMOS digital process, in Proc. IEEE Custom Integr. Circuits Conf., Sep. 2007, pp. 221224. [8] K. Luria and J. Shor, Miniaturized CMOS thermal sensor array for temperature gradient measurement in microprocessors, in Proc. IEEE Int. Symp. Circuit Syst., Paris, France, May 30, 2010, pp. 18551858. Marcelo Yuffe received the B.Sc. degree in electrical engineering from the Technion Israel institute of Technology, Haifa, Israel, in 1991. He joined Intel Corporation, Haifa, Israel, in 1990, where he is a Senior Principal Engineer. He deals with special circuit design for CPUs, mainly I/O and clock and power delivery circuits.

Tsvika Kurts received the B.Sc. and M.Sc. degrees from the TechnionIsrael Institute of Technology, Haifa, Israel, in 1984 and 1992, respectively. He is currently a Principal Engineer/Architect with Intels Microprocessor Chipset Division, Haifa, Israel, leading the Debug Architecture of Sandy-Bridge processor. He has been with Intel for 26 years. He was part of the core team that developed the Pentium M, Centrino platform, and lead the Quad Core architecture of Core 2 Duo. Earlier at Intel, he was part of the Intel Pentium Pro processor bus architecture and system validation team and was involved in the Intel Pentium 4 processor bus protocol development.

Michael Zelikson was born in Leningrad, Russia, in 1962. He received the B.Sc. , M.Sc., and D.Sc. degrees from TechnionIsrael Institute of Technology, Haifa, Israel, in 1989, 1991, 1995, respectively . His academic research focused on electrical modulation of optical constants in a-Si:H based waveguides. After completing his D.Sc. work, he joined the IBM Research Division, working in the eld of analog and mixed-signal design in Si:Ge technology, working on interconnect high-bandwidth modeling, linear ampliers, and more. Since 2003, he has been with Intel Corporation, Haifa, having his main eld of expertise in power delivery design and analysis, power management, and voltage regulation

Moty Mehalel received the B.Sc. degree in electrical engineering from the Technion Israel institute of Technology, Haifa, Israel, in 1980. He is a Senior Principal Engineer with Intel Corporation, Haifa, Israel. He was with Tadiran Communication Ltd. from 1984 to 1988, focusing on DSP hardware development. He joined Intel in 1988 as a Design Engineer. Since then, he has been working on Cache design, Cache testing, techniques for lowering the minimum operational voltage, global circuit design methodologies, and low-power design.

Eran Altshuler received the B.Sc. and M.Sc. degrees from the TechnionIsrael Institute of Technology, Haifa, Israel, in 1987 and 1990, respectively, both in electrical engineering. In 1991, he joined Intel Corporation, Haifa, Israel, as a Digital Circuit Design Engineer in the processors group, where he is a Principal Engineer.

Ernest Knoll received the B.S.E.E. degree from the Polytechnic University, Iassi, Romania, in 1980. He joined Intel Corporation, Haifa, Israel, in 1990 and worked for several CPUs generations, with focus on clock generation and distribution. He is currently a Senior Principal Engineer. He holds 18 U.S. patents, all in the analog circuit design area, and has authored or coauthored ve technical papers.

Eyal Fayneh received the B.Sc. degree from the University of Tel-Aviv, Tel Aviv, Israel, in 1991. He then worked on designing RF circuits and frequency-generation circuits for radio application and clock & clock generation circuits for Motorola, and in 1996 joined Intel Corporation, Haifa, Israel, where he is a Principal Engineer. Currently, he designs high-performance clock generators for CPUs core and I/O.

Joseph Shor (SM11) received the B.A. degree in physics from Queens College, Queens, NY, in 1986, and the Ph.D. degree in electrical engineering from Columbia University, New York, NY, in 1993. From 1988 to 1994, he was a Senior Research Scientist with Kulite Semiconductor, where he developed processes and devices for silicon carbide and diamond microsensors. From 1994 to 1999, he was a Senior Analog Designer with Motorola Semiconductor in the DSP Division. Between 19992004, he was with Saifun Semiconductor as a Staff Engineer, where he established the analog activities for Flash and EEPROM NROM memories. Since 2004, he has been with Intel Corporation, Haifa, Israel, where he is presently a Principal Engineer and head of the Analog Team at Intel Yakum. He has authored or coauthored more than 50 papers in refereed journals and conference proceedings in the areas of analog circuit design and device physics. He holds 35 issued patents and several pending patents. His present interests include switching and linear voltage regulators, thermal Sensors, PLLs, and IO circuits, all for microprocessor applications.

Kosta Luria was born in Moscow, USSR, in 1962. He received the B.S. degree in electrical engineering from Tel Aviv University, Tel Aviv, Israel, in 1991. From 1991 to 1997, he was with Motorola Communications Ltd., developing analog circuits for wireless modems for use in the SCADA Irrigation applications. In 1997, he joined the analog team at the startup company Friendly Robotics, which has developed the robotic lawn mower and the robotic vacuum cleaner. He joined Intel and began working on the analog chip design in 2001. His interests include smart temperature sensors, high-quality voltage regulators, bandgap references, AD converters, and special circuits for new applications.

Вам также может понравиться