Foundation of EDA 3D IC Book

Foundations and Trends R in sample Vol.
xx, No xx (xxxx) 1153 c xxxx xxxxxxxxx DOI: xxxxxx
Three-dimensional Integrated Circuits: Design, EDA, and Architecture

Yuan Xie, Yibo Chen, Guangyu Sun, Jin Ouyang1
1
Computer Science and Engineering DepartmentUniversity Park, PA16802, USA, yuanxie@cse.psu.edu
Abstract
The emerging three-dimensional (3D) integration technology is one of the promising solutions to overcome the barriers in interconnect scaling, thereby oering an opportunity to continue performance improvements using CMOS technology. As the fabrication of 3D integrated circuits has become viable, developing CAD tools and architectural techniques are imperative for the successful adoption of 3D integration technology. In this article, we rst give a brief introduction on the 3D integration technology, and then review the EDA challenges and solutions that can enable the adoption of 3D ICs, and nally present design and architectural techniques on the application of 3D ICs, including a survey of various approaches to design future 3D ICs, leveraging the benets of fast latency, higher bandwidth, and heterogeneous integration capability that are oered by 3D technology.
Contents
1 Introduction
2 3D Integration Technology 2.1 The Impact of 3D Technology on 3D IC Design Partitioning
3 5 7 7 9 12 14 15 15 21 34
3 Benets of 3D Integrated Circuits 3.1 3.2 3.3 3.4 Wire Length Reduction Memory Bandwidth Improvement Heterogenous Integration Cost-eective Architecture
4 3D IC EDA Design Tools and Methodologies 4.1 4.2 4.3 Thermal Analysis for 3D ICs Physical Design Tools and Algorithms for 3D ICs 3D Testing
i
ii Contents 4.4 3DCacti: an Early Analysis Tools for 3D Cache design 38 48 48 49 50 60 62 63 70 77 87 87 91 98 107 113 126 139 141 142 143
5 3D FPGA Design 5.1 5.2 5.3 5.4 FPGA Background 3D FPGA Design PCM-Based 3D Non-Volatile FPGA Summary
6 3D Architecture Exploration 6.1 6.2 6.3 Single-core Fine-Granularity Design Multi-core Memory Stacking 3D Network-on-Chip
7 Cost Analysis for 3D ICs 7.1 7.2 7.3 7.4 7.5 7.6 7.7 Introduction Early Design Estimation for 3D ICs 3D Cost Model Cost Evaluation for Fully-Customized ASICs Cost Evaluation for Many-Core Microprocessor Designs 3D IC Testing Cost Analysis Conclusion
8 Challenges for 3D IC Design Acknowledgements References
1
Introduction
With continued technology scaling, interconnect has emerged as the dominant source of circuit delay and power consumption. The reduction of interconnect delays and power consumption are of paramount importance for deep-sub-micron designs. Three-dimensional integrated circuits (3D ICs) [1] are attractive options for overcoming the barriers in interconnect scaling, thereby oering an opportunity to continue performance improvements using CMOS technology. 3D integration technologies oer many benets for future microprocessor designs. Such benets include: (1) The reduction in interconnect wire length, which results in improved performance and reduced power consumption; (2)Improved memory bandwidth, by stacking memory on microprocessor cores with TSV connections between the memory layer and the core layer; (3) The support for realization of heterogeneous integration, which could result in novel architecture designs. (4)Smaller form factor, which results in higher packing density and smaller footprint due to the addition of a third dimension to the conventional two dimensional layout, and potentially results in a lower cost design. This book rst presents the background on 3D integration technology, and shows the major benets oered by 3D integration. As a key
1
Introduction
part, EDA design tools and methodologies for 3D ICs are reviewed. The designs of 3D FPGAs and micro-architectures are then discussed, which leverage the benets of fast latency, higher bandwidth, and heterogeneous integration capability that are oered by 3D technology. The cost of 3D integration are also analyzed in the last section.
2
3D Integration Technology
The 3D integration technologies [2, 3] can be classied into one of the two following categories. (1) Monolithic approach. This approach involves sequential device process. The frontend processing (to build the device layer) is repeated on a single wafer to build multiple active device layers before the backend processing builds interconnects among devices. (2) Stacking approach, which could be further categorized as wafer-to-wafer, die-to-wafer, or die-to-die stacking methods. This approach processes each layer separately, using conventional fabrication techniques. These multiple layers are then assembled to build up 3D IC, using bonding technology. Since the stacking approach does not require the change of conventional fabrication process, it is much more practical compared to the monolithic approach, and become the focus of recent 3D integration research. Several 3D stacking technologies have been explored recently, including wire bonded, microbump, contactless (capacitive or inductive), and through-silicon vias (TSV) vertical interconnects [1]. Among all these integration approaches, TSV-based 3D integration has the potential to oer the greatest vertical interconnect density, and therefore is the most promising one among all the vertical interconnect technolo3
Face-2-Face (F2F)
substrate
Face-2-Back (F2B)
Metal layers Device layer
Device layer Metal layers
substrate

substrate substrate
Microbump
C4 Pad
TSV (Through-Silicon-Via)
Fig. 2.1: Illustration of F2F and F2B 3D bonding.
gies. Figure 2.1 shows a conceptual 2-layer 3D integrated circuit with TSV and microbump. 3D stacking can be carried out using one of the two main techniques [4]: (1) Face-to-Face (F2F) bonding: two wafers(dies) are stacked so that the very top metal layers are connected. Note that the die-todie interconnects in face-to-face wafer bonding does not go through a thick buried Silicon layer and can be fabricated as microbump. The connections to C4 I/O pads are formed as TSVs; (2) Face-to-Back (F2B) bonding: multiple device layers are stacked together with the top metal layer of one die is bond together with the substrate of the other die, and direct vertical interconnects (which are called through-silicon vias (TSV)) tunneling through the substrate. In such F2B bonding, TSVs are used for both between-layer-connections and I/O connections. Figure 7.1 shows a conceptual 2-layer 3D IC with F2F or F2B bonding, with both TSV connections and microbump connections between layers. All TSV-based 3D stacking approaches share the following three common process steps [4]: (i) TSV formation ; (ii) Wafer thinning and (iii) Aligned wafer or die bonding, which could be wafer-to-wafer(W2W) bonding or die-to-wafer(D2W) bonding. Wafer thinning is used to reduce the impact of TSVs. The thinner the wafer, the smaller (and shorter) the TSV is (with the same aspect ratio constraint) [4]. The
2.1. The Impact of 3D Technology on 3D IC Design Partitioning
wafer thickness could be in the range of 10 m to 100 m and the TSV size is in the range of 0.2 m to 10 m [1].
2.1
The Impact of 3D Technology on 3D IC Design Partitioning
One of the key issues for 3D IC design is what level of logic granularity should be considered in 3D partitioning: Designers may perform a negranularity partitioning at the logic gate level, or perform a coarsegranularity partitioning at the core level (such as keeping the processor core to be a 2D design, with cache stacked on top of the core). Which partitioning strategy should be adopted is heavily inuenced by the underlying 3D process technology. In TSV-based 3D stacking bonding, the dimension of the TSVs is not expected to scale at the same rate as feature size because alignment tolerance during bonding poses limitation on the scaling of the vias. The TSV size, length, and the pitch density, as well as the bonding method (face-to-face or face-to-back bonding, SOI-based 3D or bulk CMOS-based 3D), can have a signicant impact on the 3D IC design. For example, relatively large size of TSVs can hinder partitioning a design at very ne granularity across multiple device layers, and make the true 3D component design less possible. On the other hand, The monolithic 3D integration provides more exibility in vertical 3D connection because the vertical 3D via can potentially scale down with feature size due to the use of local wires for connection. Availability of such technologies makes it possible to partition the design at a very ne granularity. Furthermore, face-to-face bonding or SOI-based 3D integration may have a smaller via pitch size and higher via density than face-to-back bonding or bulk -CMOS-based integration. Such inuence of the 3D technology parameters on the microprocessor design must be thoroughly studied before an appropriate partition strategy is adopted. For TSV-based 3D stacking, the partitioning strategy is determined by the TSV pitch size and the via diameter. As shown in Figure 2.1, the TSV via goes through the substrate and incurs area overhead, and therefore the larger the via diameter, the bigger the area overhead. For example, Figure 2.2 shows the number of connections for dierent
Fig. 2.2: Comparison of Interconnects at Dierent Partitioning Granularity (Courtesy: Ruchir Puri, IBM).
Fig. 2.3: Area Overhead (in relative ratio, via diameter in unit of m) at Dierent Partitioning Granularity (Courtesy: Ruchir Puri, IBM). partitioning granularity, and Figure 2.3 shows the area overhead for dierent size of 3D via diameters 1 . It shows that for ne granularity partitioning there are a lot of connections, and with relative large via diameter, the area overhead would be very high for ne granularity partitioning. Consequently, for most of the existing 3D process technology with via diameter usually larger than 1 m, it makes more sense to perform the 3D partitioning at the unit level or core level, rather than at the gate level that can result in a large area overhead.
1 These
two tables are based on IBM 65nm technology for high performance microprocessor
design
3
Benets of 3D Integrated Circuits
The following subsections will discuss various 3D IC design approaches that leverage dierent benets that 3D integration technology can oer, namely, wire length reduction, high memory bandwidth, heterogeneous integration, and cost reduction.
3.1
Wire Length Reduction
Designers have resorted to technology scaling to improve microprocessor performance. Although the size and switching speed of transistors benet as technology feature sizes continue to shrink, global interconnect wire delay does not scale accordingly with technologies. The increasing wire delays have become one major impediment for performance improvement. Three-dimensional integrated circuits (3D ICs) are attractive options for overcoming the barriers in interconnect scaling, thereby offering an opportunity to continue performance improvements using CMOS technology. Compared to a traditional two dimensional chip design, one of the important benets of a 3D chip over a traditional two-dimensional (2D) design is the reduction on global interconnects.
7
Benets of 3D Integrated Circuits
It has been shown that three-dimensional architectures reduce wiring length by a factor of the square root of the number of layers used [5]. The reduction of wire length due to 3D integration can result in two obvious benets: latency improvement and power reduction. Latency Improvement. Latency improvement can be achieved due to the reduction of average interconnect length and the critical path length. Early work on ne-granularity 3D partitioning of processor components shows that the latency of a 3D components could be reduced. For example, since interconnects dominate the delay of cache accesses which determines the critical path of a microprocessor, and the regular structure and long wires in a cache make it one of the best candidates for 3D designs, 3D cache design is one of the early design example for ne-granularity 3D partition [2]. Wordline partitioning and bitline partitioning approaches divide a cache bank into multiple layers and reduce the global interconnects, resulting in a fast cache access time. Depending on the design constraints, the 3DCacti tool [6] automatically explores the design space for a cache design, and nds out the optimal partitioning strategy, and the latency reduction can be as much as 25% for a two-layer 3D cache. 3D arithmetic-component designs also show latency benets. For example, various designs [710] have shown that the 3D arithmetic unit design can achieve around 6%-30% delay reduction due to the wire length reduction. Such ne-granularity 3D partitioning was also demonstrated by Intel [11], showing that by targeting the heavily pipelined wires, the pipeline modications resulted in approximately 15% improved performance, when the Intel Pentium-4 processor was folded onto 2-layer 3D implementation. Note that such ne-granularity design of 3D processor components increases the design complexity, and the latency improvement varies depending on the partitioning strategies and the underlying 3D process technologies. For example, for the same Kogge-Stone adder design, a partitioning based on logic level [7] demonstrates that the delay improvement diminishes as the number of 3D layers increases; while a bit-slicing partitioning [9] strategy would have better scalability as the bit-width or the number of layers increases. Furthermore, the de-
3.2. Memory Bandwidth Improvement
lay improvement for such bit-slicing 3D arithmetic units is about 6% when using a bulk-CMOS-based 180nm 3D process [10], while the improvement could be as much as 20% when using a SOI-based 180nm 3D process technology [9], because the SOI-based process has much smaller and shorter TSVs (and therefore much smaller TSV delay) compared to the bulk-CMOS-based process. Power Reduction. Interconnect power consumption becomes a large portion of the total power consumption as technology scales. The reduction of the wire length translates into the power saving in 3D IC design. For example, 7% to 46% of power reduction for 3D arithmetic units were demonstrated in [9]. In the 3D Intel Pentium-4 implementation [11], because of the reduction in long global interconnects, the number of repeaters and repeating latches in the implementation is reduced by 50%, and the 3D clock network has 50% less metal RC than the 2D design, resulting in a better skew, jitter and lower power. Such 3D stacked redesign of Intel Pentium 4 processor improves performance by 15% and reduces power by 15% with a temperature increase of 14 degrees. After using voltage scaling to lower the leak temperature to be the same as the baseline 2D design, their 3D Pentium 4 processor still showed a performance improvement of 8%.
3.2
Memory Bandwidth Improvement
It has been shown that circuit limitations and limited instruction level parallelism will diminish the benets of modern superscalar microprocessors by increased architectural complexity, which leads to the advent of Chip Multiprocessors (CMP) as a viable alternative to the complex superscalar architecture. The integration of multi-core or many-core microarchitecture on a single die is expected to accentuate the already daunting memory-bandwidth problem. Supplying enough data to a chip with a massive number of on-die cores will become a major challenge for performance scalability. Traditional o-chip memory will not suce due to the I/O pin limitations. Three-dimensional integration has been envisioned as a solution for future micro-architecture design (especially for multi-core and many-core architectures), to mitigate the intercon-
10 Benets of 3D Integrated Circuits nect crisis and the memory wall problem [1214]. It is anticipated that memory stacking on top of logic would be one of the early commercial uses of 3D technology for future chip-multiprocessor design, by providing improved memory bandwidth for such multi-core/manycore microprocessors. In addition, such approaches of memory stacking on top of core layers do not have the design complexity problem as demonstrated by the ne-granularity design approaches, which require re-designing all processor components for wire length reduction (as discussed in 3.2). Intel [11] explored the memory bandwidth benets using a baseline Intel Core2 Duo processor, which contains two cores. By having memory stacking, the on-die cache capacity is increased, and the performance is improved by capturing larger working sets, reducing ochip memory bandwidth requirements. For example, one option is to stack an additional 8MB L2 cache on top of the base-line 2D processor (which contains 4MB L2 cache), and the other option is to replace the SRAM L2 cache with a denser DRAM L2 cache stacking. Their study demonstrated that a 32MB 3D stacked DRAM cache can reduce the cycles per memory access by 13% on average and as much as 55% with negligible temperature increases. PicoServer project [15] follows a similar approach to stack DRAM on top of multi-core processors. Instead of using stacked memory as a larger L2 cache (as shown by Intels work [11]), the fast on-chip 3D stacked DRAM main memory enables wide low-latency buses to the processor cores and eliminates the need for an L2 cache, whose silicon area is allocated to accommodate more cores. Increasing the number of cores by removing the L2 cache can help improve the computation throughput, while each core can run at a much lower frequency, and therefore result in an energy-ecient many core design. For example, it can achieve a 14% performance improvement and 55% power reduction over a baseline multi-core architecture. As the number of the cores on a single die increases, such memory stacking becomes more important to provide enough memory bandwidth for processor cores. Recently, Intel [16] demonstrated an 80tile terascale chip with network-on-chip. Each core has a local 256KB
3.2. Memory Bandwidth Improvement
11
SRAM memory (for data and instruction storage) stacked on top of it. TSVs provide a bandwidth of 12GB/second for each core, with a total about 1TB/second bandwidth for Tera Flop computation. In this chip, the thin memory die is put on top of the CPU die, and the power and I/O signals go through memory to CPU. Since DRAM is stacked on top of the processor cores, the memory organization should also be optimized to fully take advantages of the benets that TSVs oer [12, 17]. For example, the numbers of ranks and memory controllers are increased, in order to leverage the memory bandwidth benets. A multiple-entry row buer cache is implemented to further improve the performance of the 3D main memory. Comprehensive evaluation shows that a 1.75 speedup over commodity DRAM organization is achieved [12]. In addition, the design of MSHR was explored to provided a scalable L2 miss handling before accessing the 3D stacked main memory. A data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning is proposed. Such structure provides an additional 17.8% performance improvement. If stacked DRAM is used as the last-level caches (LLC) in chip multiple processors (CMPs), the DRAM cache sets are organized into multiple queues [17]. A replacement policy is proposed for the queue-based cache to provide performance isolation between cores and reduce the lifetimes of dead cache lines. Approaches are also proposed to dynamically adapt the queue size and the policy of advancing data between queues. The latency improvement due to 3D technology can also be demonstrated by such memory stacking design. For example, Li et al. [18] proposed a 3D chip multiprocessor design using network-in-memory topology. In this design, instead of partitioning each processor core or memory bank into multiple layers (as shown in [2, 6]), each core or cache bank remains to be a 2D design. Communication among cores or cache banks are via the network-on-chip(NoC) topology. The core layer and the L2 cache layer are connected with TSV-based bus. Because the short distance between layers, TSVs provide a fast access from one layer to another layer, and eectively reduce the cache access time because of the faster access to cache banks through TSVs.
12 Benets of 3D Integrated Circuits
3.3
Heterogenous Integration
3D integration also provides new opportunities for future architecture design, with a new dimension of design space exploration. In particular, the heterogenous integration capability enabled by 3D integration gives designers new perspective when designing future CMPs. 3D integration technologies provide feasible and cost-eective approaches for integrating architectures composed of heterogeneous technologies to realize future microprocessors targeted at the More than Moore technology projected by ITRS. 3D integration supports heterogeneous stacking because dierent types of components can be fabricated separately, and layers can be implemented with dierent technologies. It is also possible to stack optical device layers or non-volatile memories (such as magnetic RAM (MRAM) or phase-change memory (PCRAM)) on top of microprocessors to enable cost-eective heterogeneous integration. The addition of new stacking layers composed of new device technology will provide greater exibility in meeting the often conicting design constraints (such as performance, cost, power, and reliability), and enable innovative designs in future microprocessors. Non-volatile Memory Stacking. Stacking layers of non-volatile memory technologies such as Magnetic Random Access Memory (MRAM) [19] and Phase Change Random Access Memory (PRAM) [20] on top of processors can enable a new generation of processor architectures with unique features. There are several characteristics of MRAM and PRAM architectures that make them as promising candidates for on-chip memory. In addition to their nonvolatility, they have zero standby power, low access power and are immune to radiation-induced soft errors. However, integrating these nonvolatile memories along with a logic core involves additional fabrication challenges that need to be overcome (for example, MRAM process requires growing a magnetic stack between metal layers). Consequently, it may incur extra cost and additional fabrication complexity to integrate MRAM with conventional CMOS logic into a single 2D chip. The ability to integrate two dierent wafers developed with dierent technologies using 3D stacking oers an ideal solution to overcome this
3.3. Heterogenous Integration
13
Processor Core Layer DRAM Layer

Fiber: to off-chip
Non-volatile Memory Layer

Optical Die
Package
Fiber: to off-chip
Fig. 3.1: An illustration of 3D heterogeneous architecture with nonvolatile memory stacking and optical die stacking. fabrication challenge and exploit the benets of PRAM and MRAM technologies. For example, Sun et al. [21] demonstrated that the optimized MRAM L2 cache on top of multi-core processor can improve performance by 4.91% and reduce power by 73.5% compared to the conventional SRAM L2 cache with the similar area. Optical Device Layer Stacking. Even though 3D memory stacking can help mitigate the memory bandwidth problem, when it comes to o-chip communication, the pin limitations, the energy cost of electrical signaling, and the non-scalability of chip-length global wires are still signicant bandwidth impediments. Recent developments in silicon nanophotonic technology have the potential to meet the o-chip communication bandwidth requirements at acceptable power levels. With the heterogeneous integration capability that 3D technology oers, one can integrate optical die together with CMOS processor dies. For example, HP Labs proposed a Corona architecture [22], which is a 3D many-core architecture that uses nanophotonic communication for both inter-core communication and o-stack communication to memory or I/O devices. A photonic crossbar fully interconnects its 256 low-power multithreaded cores at 20 terabyte per second bandwidth, with much lower power consumption. Figure 3.1 illustrates such a 3D heterogenous processor architecture, which integrates non-volatile memories and optical die together
14 Benets of 3D Integrated Circuits through 3D integration technology.
3.4
Cost-eective Architecture
Increasing integration density has resulted in large die size for microprocessors. With a constant defect density, a larger die typically has a lower yield. Consequently, partitioning a large 2D microprocessor to be multiple smaller dies and stacking them together may result in a much higher yield for the chip, even though 3D stacking incurs extra manufacture cost due to extra steps for 3D integration and may cause a yield loss during stacking. Depending on the original 2D microprocessor die size, it may be cost-eective to implement the chip using 3D stacking [23], especially for large microprocessors. The heterogenous integration capability that 3D provides can also help reduce the cost. In addition, as technology feature size scales to reach the physics limits, it has been predicted that moving to the next technology node is not only dicult but also prohibitively expensive. 3D stacking can potentially provide a cost-eective integration solution, compared to traditional technology scaling.
4
3D IC EDA Design Tools and Methodologies
3D integration technology will not be commercially viable without the support of Electronic Design Automation (EDA) tools and methodologies that allow architects and circuit designers to develop new architectures or circuits using this technology. To eciently exploit the benets of 3D technologies, design tools and methodologies to support 3D designs are imperative. This section rst surveys existing EDA techniques which are the key enablers for the adoption of 3D technology for microarchitecture design. EDA design tools that are essential for 3D microarchitecture design adoptions can be divided into two dierent categories: physical design tools and early design analysis tools. In the following subsection, we rst give an overview of various CAD tools and techniques and then show a few case studies with detailed descriptions.
4.1
Thermal Analysis for 3D ICs
3D technology oers several benets compared to traditional 2D technology. However, one of the major concerns in the adoption of 3D technology is the increased power densities that can result from placing
15
16 3D IC EDA Design Tools and Methodologies one power hungry block over another in the multi-layered 3D stack. Since the increasing power density and the resulting thermal impact are already major concerns in 2D ICs, the move to 3D ICs could accentuate the thermal problem due to increased power density, resulting in higher on-chip temperatures. High temperature has adverse impacts on circuit performance. The interconnect delay becomes slower while the driving strength of a transistor decreases with increasing temperature. Leakage power has an exponential dependence on the temperature and increasing on-chip temperature can even result in thermal runaways. In addition, at suciently high temperatures, many failure mechanisms, including electromigration (EM), stress migration (SM), time-dependent dielectric (gate oxide) breakdown (TDDB), and thermal cycling (TC), are signicantly accelerated, which leads to an overall decrease in reliability. Consequently, it is very critical to model the thermal behaviors in 3D ICs and investigate possible solutions to mitigate thermal problems in order to fully take advantage of the benets that 3D technologies oer. This section surveys several thermal modeling approaches for 3D ICs. A detailed 3D modeling method is rst introduced, and then compact thermal models are introduced. 4.1.1 3D Thermal Analysis based on Finite Dierence Method or Finite Element Method
Sapatnekar et al proposed a detailed 3D thermal model [24]. The heat equation ( 4.1), which is a parabolic partial dierential equation (PDE) denes on-chip thermal behavior at the macroscale. T (r, t) = kt 2 T (r, t) + g (r, t) (4.1) t where p represents the density of the material (in kg/m3), cp is the heat capacity of the chip material (in J/(kg K)), T is the temperature (in K), r is the spatial coordinate of the point where the temperature is determined, t is time (in sec), kt is the thermal conductivity of the material (in W/(m K)), and g is the power density per unit volume (in W/m3). The solution of Equation (4.1) is the transient thermal response. Since all derivatives with respect to time go to zeroes in the steady state, steady-state analysis is needed to solve the PDE, which cp
4.1. Thermal Analysis for 3D ICs
17
is the well-known Poissons equation. A set of boundary conditions must be added in order to get a welldened solution to Equation (4.1). It typically involves building a package macro model and assuming a constant ambient temperature is interacted with the model. There are two methods to discretize the chip and form a system of linear equations representing the temperature distribution and power density distribution: nite dierence method (FDM) and nite element method (FEM). The dierence between them is that FDM discretizes the dierential operator while the FEM discretizes the temperature eld. Both of them can handle complicated material structures such as non-uniform interconnects distributions in a chip. In FDM, a heat transfer theory is adopted, which builds an equivalent thermal circuit through the thermal-electrical analogy. The steadystate equation represents the network with thermal resistors connected between nodes and with thermal current sources mapping to power sources. The voltage and the temperature at the nodes can be computed by solving the circuit. The ground node is considered as a constant temperature node, typically the ambient temperature. The FEM has dierent perspective to solve Poissons equation. The design space is discretized into elements, with shapes such as tetrahedral or hexahedra. A reasonable discretization divides the chip into 8-node rectangular hexahedral elements. The temperature within an element then is calculated using an interpolation function. Since FDM methods are similar to power grid analysis problem, similar solution techniques can be used, such as the multigrid-based approaches. Li et al [25] proposed multigrid (MG) techniques for fast chip level thermal steady-state and transient simulation. This approach avoids an explicit construction of the matrix problem that is intractable for most full-chip problems. Specic MG treatments are proposed to cope with the strong anisotropy of the full-chip thermal problem that is created by the vast dierence inmaterial thermal properties and chip geometries. Importantly, this work demonstrates that only with careful thermal modeling assumptions and appropriate choices for grid hierarchy, MG operators, and smoothing steps across grid points can a full-chip thermal problem be accurately and eciently analyzed. The
18 3D IC EDA Design Tools and Methodologies large thermal transient simulations are further speeded-up by incorporating reduced-order thermal models that can be eciently extracted under the same MG framework. The FEM provides another avenue to solve Poissons equation. In nite element analysis, the design space is rst discretized or meshed into elements. Dierent element shapes can be used such as tetrahedra and hexahedra. For the on-chip problem, where all heat sources are modeled as being rectangular, a reasonable discretization for the FEM divides the chip into 8-node rectangular hexahedral elements [26]. The temperatures at the nodes of the elements constitute the unknowns that are computed during nite element analysis, and the temperature within an element is calculated using an interpolation function that approximates the solution to the heat equation within the elements. 4.1.2 Compact 3D IC Thermal Modeling
3D IC Thermal analysis with FDM or FEM can be very time consuming and therefore is not suitable to be used for design space exploration where the thermal analysis has to be performed iteratively. Therefore a compact thermal model for 3D IC is desirable. For traditional 2D design, one widely used compact thermal model is called HotSpot [27], which is based on an equivalent circuit of thermal resistances and capacitances that correspond to microarchitecture components and thermal packages. In a well-known duality between heat transfer and electrical phenomena, heat ow can be described as a current while temperature dierence is analogous to a voltage. The temperature dierence is caused by heat ow passing through a thermal resistance. It is also necessary to include thermal capacitances for modeling transient behavior to capture the delay before the temperature reaching steady state due to a change in power consumption. Like electrical RC constants, the thermal RC time constants characterize the rise and fall times led by the thermal resistances and capacitances. In the rationale, the current and heat ow are described by exactly the same dierent equations for a potential dierence. These equivalent circuits are called compact models or dynamic compact models if thermal capacitors are included. For a microarchitecture unit, the dominant mechanism to de-
4.1. Thermal Analysis for 3D ICs
19
termine the temperature is the heat conduction to the thermal package and to neighboring units. In HotSpot, the temperature is tracked at the granularity of individual microarchitectural units and the equivalent RC circuits have at least one node for each unit. The thermal model component values do not depend on initial temperature or the particular congurations. HotSpot is a simple library that generates the equivalent RC circuit automatically and computes temperature are the center of each block with power dissipation over any chosen time step. Based on the original Hotspot model, a 3D thermal estimation tool named HS3D was introduced. HS3D allows 3D thermal evaluation despite of largely unchanged computational model and methods in HotSpot. The inter-layer thermal vias can be approximated by changing the vertical thermal resistance of the materials. HS3d library allows incompletely specied oorplans as input and ensures accurate thermal modeling of large oorplan blocks. Many routines have been recreated for optimizations of loop accesses, cache locality, and memory paging. These improvements oer reduced memory usage and runtime reduction by over 3 orders when simulating a large number of oorplans. To guarantee the correctness and eciency of HS3D library, the comparison between this new library and a commercial FEM software was performed. First a 2D sample device and package is used for the verication. The dierence of the average chip temperatures from HS3D and FEM software is only 0.02 degrees Centigrade. Multi-layer (3D) device modeling verication is provided using 10 micron thick silicon layers and 2 micron thick interlayer material. The test case includes two layers with a sample processor in each layer. The experiment results show that the average temperature misestimation is 3 degrees Centigrade. In addition, the thermal analysis using FEM software costs 7 minutes while only costs one second using HS3D. It indicates that HS3D provides not only high accuracy but also high performance (low run time). The extension in HS3D was integrated into HotSpot in the later versions to support 3D thermal modeling. Cong and Zhang derived a closed-form compact thermal model for thermal via planning in 3D ICs [?]. In their thermal resistive model, a tile structure is imposed on the circuit stack with each tile the size
20 3D IC EDA Design Tools and Methodologies
Rlateral
(a). Tiles Stack Array
(b). Single Tile Stack
(c). Tile Stack Analysis
Fig. 4.1: Resistive Thermal Model for a 3-D IC.
of a via pitch, as shown in Figure 4.1(a). Each tile stack contains an array of tiles, one tile for each device layer, as shown in Figure 4.1(b). A tile either contains one via at the center, or no via at all. A tile stack is modeled as a resistive network, as shown in Figure 4.1(c). A voltage source is utilized for the isothermal base, and current sources are present in each silicon layer to represent heat sources. The tile stacks are connected by lateral resistances. The values of the resistances in the network are determined by a commercial FEM-based thermal simulation tool. If there is no via existing in a tile, unnecessary nodes can be deleted to speed up the computing time. To further improve analysis accuracy without losing eciency, Yang et al. proposed an incremental and adaptive chip-package thermal analysis ow named ISAC [28]. During thermal analysis, both time complexity and memory usage are linearly or superlinearly related to the number of thermal elements. ISAC incorporates an ecient technique for adapting thermal element spatial resolution during thermal analysis. This technique uses incremental renement to generate a tree of heterogeneous rectangular parallelepipeds that supports fast thermal analysis without loss of accuracy. Within ISAC, this technique is incorporated with an ecient multigrid numerical analysis method, yielding a comprehensive steady-state thermal analysis solution.
4.2. Physical Design Tools and Algorithms for 3D ICs
21
4.2
Physical Design Tools and Algorithms for 3D ICs
Physical design tools play an important role in 3D design automation since one essential dierence between 2D and 3D CAD tools is the physical design. The physical design tools for 3D IC can be classied into 3 categories: oorplanning, placement, and routing. In addition, one major concern in 3D ICs is the increased power density due to placing one thermal critical block over another in the 3D stack. Consequently, thermal-aware physical design tools should be developed to address both thermal and physical tool issues. The physical design methodologies for 3D IC depend on the partitioning strategy, which is determined by the 3D technology process. As discussed in Chapter 2, the partitioning strategy in 3D ICs can be classied into two categories: ne-granularity partitioning and coarse-granularity partitioning. For monolithic 3D approach (such as the MLBS approach) or TSV-based 3D ICs with extremely small TSVs, it is possible to perform ne-granularity partitioning with the gate-level 3D placement and routing; On the other hand, with large TSV size (with via diameter larger than 1um), it makes more sense to perform coarse-granularity partitioning at the unit level or core level, to avoid large area overhead. For coarse-granularity partitioning, since each unit or core is still a 2D design, the physical placement and routing for each unit/core can still be done with conventional placement and routing EDA tools, and then for coarse-granularity 3D partitioning, only 3D oorplanning is needed. Consequently, 3D oorplanning is the most critical needed physical design tool for near term 3D IC design. In this section, we will rst describe in detail a thermal-aware 3D oorplanning framework [29], and then give a quick survey on thermal-aware 3D placement and routing. 4.2.1 Thermal-aware 3D Floorplanning
3D IC design is fundamentally related to the topological arrangement of logic blocks. Therefore, physical design tools play an important role in the adoption of 3D technologies. New placement and routing tools are necessary to optimize 3D instantiations and new design tools are required to optimize interlayer connections. A major concern in the
22 3D IC EDA Design Tools and Methodologies adoption of 3D architecture is the increased power densities that can result from placing one computational block over another in the multilayered 3D stack. Since power densities are already a major bottleneck in 2D architectures, the move to 3D architectures could accentuate the thermal problem. Even though 3D chips could oer some respite due to reduced interconnect power consumption (as a result of the shortening of many long wires), it is imperative to develop thermally aware physical design tools. For example, partition design to place highly loaded, active gates in layer close to the heat-sink. Tools for modeling thermal eects on chip-level placement have been developed. Recently, Cong et al. [34] proposed a thermal-driven oorplanning algorithm for 3D ICs. Chu and Wong [35] proposed using a matrix synthesis problem (MSP) to model the thermal placement problem. A standard cell placement tool to even thermal distribution has been introduced by Tsai and Kang [36] with their proposed compact nite dierence method (FDM) based temperature modeling. In [37], thermal eect was formulated as another force in a force-directed approach to direct the placement procedure for a thermally even standard cell placement. Another design metric, reliability, was taken care of in [38] when doing a multi-layer System-on-Package oorplanning, thermal issue was neglected, though. This section presents a thermal-aware oorplanner for a 3D microprocessors [39]. The oorplanner is unique in that it accounts for the eects of the interconnect power consumption in estimating the peak temperatures. We describe how to estimate temperatures for 3D microprocessors and show the eectiveness of the thermal-aware tool in reducing peak temperatures using one microprocessor design and four MCNC benchmarks. 4.2.1.1 B -tree Floorplan Model
The oorplanner used in this work is based on the B -tree representation, which was proposed by Chang et al. [42]. A B -tree is an ordered binary tree which can represent a non-slicing admissible oorplan. Figure 4.2 shows a B -tree representation and its corresponding oorplan. The root of a B -tree is located at the bottom-left corner. A B -tree
23
of an admissible oorplan can be obtained by depth-rst-search procedure in a recursive fashion. Starting from the root, the left subtree is visited rst then followed by the right subtree, recursively. The left child of node ni is the lowest unvisited module in Ri , which denotes the set of modules adjacent to the right boundary of module bi . On the other hand, the module, representing the right child of ni , is adjacent to and above module bi . We assume the coordinates of the root node are always (0,0) residing at bottom-left corner. The geometric relationship between two modules in a B -tree is maintained as follows. The x-coordinate of node nj is xi +wi , if node nj is the left child of node ni . That is, bj is located and is adjacent to the right-hand side of bi . For the right child nk of ni , its x-coordinate is the same as that of ni , with module bk sitting adjacent to and above bi . While the original B -tree structure was developed and used for the 2D oorplanning problem, we extend this model to represent a multi-layer 3D oorplanning (see Figure 4.3) and modify the perturbation function to handle 3D oorplans in this work. There are six perturbation operations used in our algorithm and they are listed below: (1) (2) (3) (4) (5) (6) Node swap, which swaps two modules. Rotation, which rotates a module. Move, which moves a module. Resize, which adjusts the aspect-ratio of a soft module. Interlayer swap, which swaps two modules at dierent layers. Interlayer move, which moves a module to a dierent layer.
The rst three perturbations are the original moves dened in [42]. Since these moves only have inuence on the oorplan in single layer, more interlayer moves, (5) and (6) are needed to explore the 3D oorplan solution space. 4.2.1.2 Simulated Annealing Engine
A simulated annealing engine is used to generate oorplanning solutions. The inputs to our oorplanning algorithm are the area of all functional modules and interconnects among modules. However, the
n1 b7 b6 b5 b3 b4 b1 b2 (a) n4 (b) n3 n6 n7 n2 n5
Fig. 4.2: (a)an example oorplan (b)the corresponding B -tree
Fig. 4.3: Extending B -tree model to represent 3D oorplanning
actual dimension of each module is unknown a priori except for its area before placement. That is, we have to treat them as soft modules. Thus, we provide the choice of adjusting aspect-ratio as one perturbation operation. During simulated process, each module dynamically adjusts its aspect-ratio to t closely with the adjacent modules, that is, with no dead space between two modules. A traditional weighted cost representing optimization costs (area and wire length) is generated after each perturbation. Dierent from 2D oorplanning, our 3D oorplanner uses a twostage approach. The rst stage tries to partition the blocks to ap-
25
propriate layers, and tries to minimize the packed area dierence between layers and total wire length using all perturbation operations (one through six listed in previous subsection). However, because the rst stage tries to balance the packed areas of the dierent layers, the oorplan of some layers may not be compactly packed. The second stage is intended to overcome this problem. Thus, in the second stage, we start with the partitioning solution generated by the rst stage and focus on adjusting the oorplan of each layer simultaneously with the rst four operations. At this point, there are no inter-layer operations to disturb module partition of each layer obtained from stage one. One problem of 3D oorplanning is the nal packed area of each layer must match to avoid penalties of chip area. For example, if the nal width of packed modules of layer L1 is larger than the nal width of packed modules in layer L2 and the height of L1 is smaller that of L2, a signicant portion of chip area is wasted due to the need for the layer dimensions to match for manufacturing. To make sure the dimension of each layer will be compatible, we adopt the concept of dimension deviation dev (F ) in [38]. The goal is to minimize dev (F ), which tells the deviation of the upper-right corner of a oorplan from the average Avex , Avey values. The value, Avex can be calculated by ux(fi )/k , where ux(fi ) is the x-coordinate of upper-right corner of oorplan i and k indicates the number of layers. The value Avey can be obtained # layers in similar manner. Thus, dev (F ) is formulated as |Avex i ux(fi )| + |Avey uy (fi )|. The modied cost function for 3D oorplanner can be written as cost = area + wl + dev (F ) where area, wl are chip area and wire length, respectively. 4.2.1.3 Temperature Approximation (4.2)
Although HotSpot-3D can be used to provide temperature feedbacks, when evaluating a large number of solutions during simulated procedure, it is not wise to involve the time-consuming temperature calculation every time. Other than using the actual temperature values, we have adopted the power density metric as a thermal-conscious
26 3D IC EDA Design Tools and Methodologies mechanism in our oorplanner. The temperature is heavily dependent on power density based on a general temperature-power equation: T = P R = P (t/k A) = (P/A) (t/k ) = d (t/k ), where t is the thickness of the chip, k is the thermal conductivity of the material, R is the thermal resistance, and d is the power density. Thus, we can substitute the temperature and adopt the power density, according to the equation above, to approximate the 3-tie temperature function, CT = (T To )/To , proposed in [34] to reect the thermal eect on a chip. As such, the 3-tie power density function is dened as P = (Pmax Pavg ))/Pavg , where Pmax is the maximum power density while Pavg is the average power density of all modules. The cost function for 2D architecture used in simulated annealing can be written as cost = area + wl + P (4.3)
For 3D architectures, we also adopt the same temperature approximation for each layer as horizontal thermal consideration. However, since there are multiple layers in 3D architecture, the horizontal consideration alone is not enough to capture the coupling eect of heat. The vertical relation among modules also needs to be involved and is dened as: OP (T P m) = (P m + P mi ) overlap area, where OP (T P m) stands for the summation of the power density of module, P m, and all overlapping module mi with module m and their relative power densities multiplying their corresponding overlapped area. The rationale behind this is that for a module with relatively high power density in one layer, we want to minimize its accumulated power density from overlapping modules located in dierent layers. We can dene the set of modules to be inspected, so the total overlap power density is T OP = OP (T P i), for all modules in this set. The cost function for 3D architecture is thus modied as follows: cost = area + wl + dev (F ) + P + T OP (4.4)
At the end of algorithm execution, the actual temperature prole is reported by our HS3D tool.
27
Circuit wire (um) 339672 542926 133202 44441 846817
Alpha xerox hp ami33 ami49
2D area (mm2 ) 29.43 19.69 8.95 1.21 37.43
peakT 114.50 123.75 119.34 128.21 119.42
2D(thermal) wire area peakT 2 (um) (mm ) 381302 29.68 106.64 543855 19.84 110.45 192512 8.98 116.91 51735 1.22 116.97 974286 37.66 108.86
Table 4.1: Floorplanning results of 2D architecture.
Circuit wire (um) 210749 297440 124819 27911 547491
Alpha xerox hp ami33 ami49
3D area (mm2 ) 15.49 9.76 4.45 0.613 18.55
peakT 135.11 137.51 137.90 165.61 137.71
3D(thermal) wire area peakT 2 (um) (mm ) 240820 15.94 125.47 294203 9.87 127.31 110489 4.50 134.39 27410 0.645 155.57 56209 18.71 132.69
Table 4.2: Floorplanning results of 3D architecture.
4.2.1.4
Experimental Results
We implemented the proposed oorplanning algorithm in C++. The thermal model is based on the HS3D. In order to eectively explore the architecture-level interconnect power consumption of a modern microprocessor, we need a detailed model which can act for the current-generation high-performance microprocessor designs. We have used IVM (http://www.crhc.uiuc.edu/ACS/tools/ivm), a Verilog implementation of an Alpha-like architecture (denoted as Alpha in the rest of this section) at register-transfer-level, to evaluate the impacts of both interconnect and module power consumptions at the granularity
28 3D IC EDA Design Tools and Methodologies of functional module level. A diagram of the processor was shown in Figure 4.4. Each functional block in Figure 4.4 represents a module used in our oorplanner. The registers between pipeline stages are also modeled but not shown in the gure.
fetch0 fetch1 fetch2 L1 Icache ITLB
Fetch
4X Decoder
Decode
Rename0
Rename1
Rename2
Rename
Scheduler
Schedule
Register File
RegRead
DTLB
SALU0 AGEN0
SALU1 CALU AGEN1
L1 Dcache
Execute
Arch RAT
Reorder Buffer
Retire
Fig. 4.4: Processor model diagram. The implementation of microprocessor has been mapped to a commercial 160 nm standard cell library by Design Compiler and placed and routed by First Encounter under 1GHz performance requirement. There are totally 34 functional modules and 168 netlists extracted from the processor design. The area and power consumptions from the actual layout served as inputs to our algorithm. Other than Alpha processor, we have also used MCNC benchmarks to verify our approach. A similar approach in [36] is used to assign the average power density for each module in the range of 2.2*104 (W/m2 ) and 2.4*106 (W/m2 ). The total net power is assumed to be 30% of total power of modules due to lack of information for the MCNC benchmarks and the total wire length used to be scaled during oorplanning is the average number from 100 test runs with the consideration of area factor alone. The widely used method of half-perimeter bounding box model is adopted to estimate the wire length. Throughout the experiments, two-layer 3D
29
architecture was assumed due to a limited number of functional modules and excessively high power density beyond two layers; however, our approach is capable of dealing with multiple-layer architecture. Table 4.1 and 4.2 show the experiment results of our approach when considering traditional metrics (area and wire) and thermal effect. When taking thermal eect into account, our thermal-aware oorplanner can reduce the peak temperature by 7% on the average while increasing wirelength by 18% and providing a comparable chip area as compared to the oorplan generated using traditional metrics. When we move to 3D architectures, the peak temperatures increased by 18% (on the average) as compared to the 2D oorplan due to the increased power density. However, the wire length and chip area reduced by 32% and 50%, respectively. The adverse eect of the increased power density in the 3D design can be mitigated by our thermal-aware 3D oorplanner, which lower the peak temperature by average 8 degrees with little area increase as compared to the 3D oorplanner that does not account for thermal behavior. As expected, the chip temperature is higher when we step from 2D to 3D architecture without thermal consideration. Although the wire length is reduced when moving to 3D and thus accordingly reduces interconnect power consumption, the temperature for 3D architecture is still relatively high due to the accumulated power densities and smaller chip footprints. After applying our thermal-aware oorplanner, the peak temperature is lowered to a moderate level through the separation of high power density modules in dierent layers. 4.2.2 Thermal-aware 3D Placement and Routing
Placement is an important step in the physical design ow since the quality of placement results has signicant impact on the performance, power, temperature and routability. Therefore, a thermal-aware 3D placement tool is needed to fully take advantage of 3D IC technology. The 3D thermal-aware placement problem is dened as: given a circuit with a set of cell instances and a set of nets, the device layer number, and the per-layer placement region, a placement of the cells in that region, nd a placement for every cell so that the objective
30 3D IC EDA Design Tools and Methodologies function of weighted total wirelength is minimized, under certain constraints such as overlap-free constraints, performance constraints and temperature constraints. Since the weighted total wirelength is a widely accepted metric for placement qualities, it is considered as the objective function of placement problem. The objective function depends on the placement. It is obtained from a weighted sum of the wirelength of instances and the number of through-silicon vias (TSVs) over all the nets. For thermal consideration, temperatures are not directly formulated as constraints but are appended as a penalty to the wirelength objective function. There are several ways to formulate this penalty: the weighted temperature penalty with transformation to thermal-aware net weights, or the thermal distribution cost penalty, or the distance from the cell location to the heat sink during legalization . Goplen and Sapatnekar [26] proposed a uses an iterative forcedirected approach for 3D thermal-aware placement, in which thermal forces direct cells away from areas of high temperature. Finite element analysis (FEA) is used to calculate temperatures eciently during each iteration. Balakrishnan et al [30] presented a 3D global placement technique which determines the region for a group of cells to be located. This approach involves a two-stage renement procedure: Initially, a multilevel min-cut based method is used to minimize the congestion and power dissipation within conned areas of the chip. This is followed by a simulated annealing-based technique that serves to minimize the amount of congestion created from global wires as well as to improve the temperature distribution of the circuit. Cong et al [31] proposed a novel thermal-aware 3D cell placement approach, named T3Place, based on transforming a 2D placement with good wirelength to a 3D placement, with the objectives of half-perimeter wirelength, throughthe-silicon (TS) via number and temperature. T3Place is composed of two steps, transformation from a 2D placement to a 3D placement and the renement of the resulting 3D placement. Later, Cong and Luo [32] improved the approach with a multilevel non-linear programming based 3D placement technique. This approach relaxes the discrete layer assignments so that they are continuous in the z-direction and the problem can be solved by an analytical global placer. A density penalty function is added to guide the the overlap removal and device layer as-
31
G0
Compact Thermal Model Gi Gi Gk
G0 level 0 (1) TS-Via Planning a) TS-Via Number Distribution b) Signal TS-Via Assignment c) TS-Via Number Adjust (2) Wire Refinement Upward Pass
level 0
level i Downward Pass level k
level i (1) Initial Tree Generation (2) TS-Via Number Estimation (3) TS-Via Number Distribution (4) TS-Via Number Adjust
Fig. 4.5: Thermal-Driven 3-D Multilevel Routing with TS-via Planning.
signment simultaneously. A thermal-driven multilevel routing tool is presented by Cong et al [33]. The routing problem is dened as given the design rule, height and the thermal conductivity of each material layer, a 3D placement or oorplan result with white space reserved for interconnect, a given maximum temperature, route the circuit so that the weighted cost of wirelength and the total number of TSVs is minimized according to the design rules. Figure 4.5 provides the ow of thermal-driven 3D multilevel routing with TS-via (Through-Silicon Via) planning. A weighted area sum model is used to estimate the routing resources, which are reserved for local nets at each level. In addition, the average power density of each tile is computed during the coarsening process. In the initial routing stage, each multipin net generates a 3D Steiner tree, which starts from a minimum spanning tree (MST). Then at each edge of the initial MST a point-to-path maze-searching engine is applied. Steiner edges are generated when the searching algorithm reaches the existing edges of the tree. After maze-searching algorithm, the number of dummy TS-vias is estimated through binary search in order for the insertion. The upper bound of dummy TS-vias is estimated by the area of white space between the blocks. After the estimation, a TSV-via number distribution step is applied to assign the dummy TS-vias. The TS-vias are rened to minimize wirelength and maximum temperature during each renement step. During the planning process, the multilevel routing system communicates with the thermal model, which provides the latest thermal prole. The thermal model returns a temperature map by reading
32 3D IC EDA Design Tools and Methodologies the tile structure and the TS-vias number at each tile and assigns one temperature values to each tile. Cong and Zhang later improved the TS-via planning algorithm to further reduce the chip temperature [?]. The TS-via minimization problem with temperature constraints is formulated as a constrained nonlinear programming problem (NLP) based on the thermal resistive model. An ecient heuristic algorithm, named m-ADVP, is developed to solve a sequence of simplied via planning subproblems in alternating direction in a multilevel framework. The vertical via distribution is formulated as a convex programming problem, and the horizontal via planning is based on two ecient techniques: path counting and heat propagation. Experimental results show that the m-ADVP algorithm is more than 200 faster than the direct solution to the NPL formulation for via planning with very similar solution quality (within 1% of TSvias count). Compared with the TS-via planning algorithm described in the previous paragraph, the improved algorithm can reduce the total TS-via number by over 68% for the same required temperature with similar runtime. 4.2.3 4.2.3.1 Other Physical Design Problems Clock Tree Design
To leverage the multifaceted benets of latency, power, and high density from interdie vias, researchers have also proposed methods to minimize the wire length and power for forming a zero-skew clock tree for a 3D IC. By doing so, the clock signals are directly relayed from other die layers by routing themselves through the interdie vias, reducing traveling distance by several orders of magnitude. Although this approach improves the eciency of constructing the clock distribution network that is, better routability it inadvertently introduces another problem that makes prebond testability much more challenging. The results of such 3D routing algorithms often generate a clock distribution network, in which most of the die layers would contain a plethora of isolated clock islands that are not fully connected. Testing each prebond layer requires multiple synchronous clock input sources that is, using multiple test probes to feed those disjoint
33
clock islands during the prebond test phase. Nonetheless, applying multiple test probes could also undermine the requirement of maintaining a zero-skew clock tree. As to be discussed in 3D Testing, prebond test is indispensable for the commercial success of 3D ICs, due to the yield concern. As a result, with respect to 3D clock tree design, there are two goals to achieve. First, each prebond layer under test should contain a fully testable, zero-skew clock tree. Second, a zero-skew clock tree also must be guaranteed for the nal integrated 3D chip after bonding. To address this problem, Zhao et al [43] employ a redundancy design by providing a congurable, complete 2D clock tree for each layer for prebond test purposes, while simultaneously maintaining a 3D clock routing tree for normal operations postbond. On each layer, TSVbuers are used to ensure the zero-skew of 2D clock trees. The 2D trees are merely used for prebond test and will be gated o afterward. One drawback of this approach is that the routability could be worsened rather than improved as a result of the redundant clock trees. Kim et al [44] propose an improved solution using much less buer resources with a new tree topology generation algorithm, and the transmission gate control lines used in [43] are replaced by a specially designed component called self controlled clock transmission gate (SCCTG). 4.2.3.2 Power Supply Network Design
Among various physical design problems, power/ground(P/G) network synthesis plays an important role in the early stages of the IC design ow. P/G network synthesis sizes and places the power and ground lines for the chip. There are two important optimization goals for P/G network synthesis: (1) Minimize IR drop, which can have negative impact on the performance of the circuit. IR drop has become more important as technology scales, because as the wires shrink with smaller feature sizes, the resistances along the power lines increase. (2) Minimize the routing area of the P/G wires, which causes congestion during later routing stages. However, there is potential conict between these two optimization goals. For example, using wider wires for P/G network could help reduce IR drop but the routing area is increased. Consequently, a design tradeo between the P/G area and IR drop is needed.
34 3D IC EDA Design Tools and Methodologies A few works have explored 3D P/G construction. Decoupling capacitors (decaps) can be used during placement to reduce the power supply noise [45, 46]. In these works, the resistive and inductive eects of a 3D P/G grid were used to calculate the decap budget for each placed module. Considering decaps during placement reduced the total number of decaps needed to adequately reduce the power supply noise [47]. In addition to using decaps to reduce P/G noise, the density of P/G TSVs can be adjusted to reduce the power supply noise [48]. In a traditional design ow, oorplanning and P/G network design are two sequential steps: the oorplanning is performed rst to optimize chip area and wire length, and then the P/G network is synthesized. However, it has been shown by Liu and Chang [49] that, in 2D IC design, if the oorplan does not consider the P/G network, the resulting oorplan may lead to an inecient P/G network with high IR drops and several P/G voltage violations. It is very dicult to repair a P/G network after the post-layout stage, so the P/G network should be considered as early as possible. Consequently, Liu and Chang proposed a oorplan and P/G network co-synthesis method to eciently solve the problem [49]. In oorplan and P/G co-synthesis, the P/G network is created during oorplanning. During co-synthesis, the oorplan not only tries to minimize the wirelength and area of the design, but also ensures that there are no power violations to the blocks in the oorplan. This results in a oorplan with an ecient P/G network. Falkenstern et al [50] proposed a 3D oorplanning and P/G co-synthesis ow, in which P/G network synthesis are added to the oorplanning iterations guided by simulated-annealing.
4.3
3D Testing
One of the challenges for 3D technology adoption is the insucient understanding of 3D testing issues and the lack of DFT solutions. While test techniques and DFT solutions for 3D ICs have remained largely unexplored in the research community, this section describes testing challenges for 3D ICs, including problems that are unique to 3D integration, and summarizes early research results in this area.
4.3. 3D Testing
35
4.3.1
Pre-bond Die Testing
To make 3D IC designs commercially viable, the most critical step is in the nal integration which must ensure that only Known-GoodDies(KGDs) will be bonded and packaged. As the number of 3D layers increases, a random bonding strategy will never be economically practical, because it could decrease the overall yield exponentially. To eectively address this issue, designers must ensure that each individual die layer is designed to be testable before bonding takes place. The primary challenge posed by many of these architectural partitioning proposals is that each layer, prior to bonding, can contain only partial circuits of the entire system. For example, in a planar processor design, the instruction queue, the ALU, and the physical register le are logically connected in series on a 2D oorplan, whereas with 3D partitioning they can be placed in dierent layers to take advantage of the communication latency from interdie vias. Enabling prebond tests for such 3D IC designs will require completely reconsidering conventional DFT strategies. Lewis and Lee [51] described a general test architecture that enables pre-bond testability of an architecturally partitioned 3D processor and provided mechanisms for basic layer functionality. They also proposed test methods for designs partitioned at the circuits level, in which the gates and transistors of individual circuits could be split across multiple die layers [52]. They investigated a bit-partitioned adder unit and a port-split register le, which represents the most dicult circuitpartitioned design to test pre-bond but which is used widely in many circuits. 4.3.2 Testing of TSVs
The TSVs are key components of a 3D IC and are used for providing power, as well as clock and functional signals. In addition, they also provide test access to logic blocks on dierent layers. One critical issue that requires taking DFT into account for 3D IC design is the assembly yield of the TSV postbond. Even a single TSV defect between any two layers can void the entire chip stack, reducing the overall yield. To guarantee the full functionality of 3D TSV interconnects, de-
36 3D IC EDA Design Tools and Methodologies signers must take the precaution of implementing redundancy and selfrepairing mechanisms for each TSV especially those for transmitting signals. TSVs for power delivery and clock distribution could be more defect tolerable as long as the power or clock distribution network remains a connected graph, and skew or current needed to drive loads are unaected in the presence of defects. Self-repairing interconnects can be achieved by inserting redundant TSVs along with one or a cluster of the required TSVs and their selection logic. Since the yield of TSVs has a dramatic inuence on overall chip yield, some pre-bond TSV testing schemes are needed in order to reduce the risk of bonding dies that have irreparable TSV failures. As via-rst TSVs have one end that is not only oating but also buried in the wafer substrate before thinning and bonding, double-end probing for TSV resistance measurement becomes incapable. Alternatively, a single-end probing based TSV test method was proposed [53] to diagnose prebond TSVs with the parasitic capacitance measurements. The TSV is modeled as a conductor with its parasitic capacitance in a given range. If there is any defect in the TSV or in the insulator surrounding TSV, the measured parasitic capacitance will deviate from the nominal value. In thisway, the faulty TSVs can be detected before bond. 4.3.3 Scan Chain Design and Test Access Optimization
Modular testing, which is based on test access mechanisms (TAMs) and IEEE Std 1500 core test wrappers, provides a low-cost solution to the test access problem for a SoC. For todays 2D ICs, several optimization techniques have been reported in the literature for test infrastructure design to minimize test time. Similar techniques are needed for 3D ICs, but we are now confronted with an even more dicult test access problem: the embedded cores in a 3D IC might be on dierent layers, and even the same embedded core could have blocks placed on dierent layers. Only a limited number of TSVs can be reserved for use by the TAM [54]. Recent research on this issue has addressed the dual problems of designing scan chains for cores placed on multiple layers [55] and the optimization of the TAM for a core-based 3D IC [56]. Consider the example shown in Figure 4.6, which shows ve
4.3. 3D Testing
W=w1+w2 Layer1 W=v1+v2+v3 Layer1
37
core
core
Test response
TSV (w1 wires)
Test stimuli
TSV (w1 wires) TAM3[in] Layer0
Test stimuli
TSV (v3 wires)
TSV (v3 wires) TAM3[out] Layer0
TAM1[out] TAM1[in] Core test wrapper TAM2[in]w

2
core
w1
w1
core
TAM1[in] Core test wrapper v1
core
v1
core
TAM1[out] TAM2[out]
core
w2
core
TAM2[out]
TAM2[in] v 2
core
v2
core
(a)
(b)
Fig. 4.6: Two scenarios for test access mechanism (TAM) optimization for a 3D-core-based SoC: Approach 1 (a);Approach 2 (b).
(wrapped) cores in a 3D IC. Four cores are placed in Layer 0, and the fth core is placed on Layer 1. Suppose a total of 2W channels (that is, number of wires from the TAM) are available from the tester to access the cores, out of which W channels must be used to transport test stimuli, and the remaining W channels must be used to collect the test responses. We also have an upper limit on the number of TSVs that can be used for the test access infrastructure. The TAM design needs to be optimized to minimize the test application time. Therefore, the objective here is to allocate wires (bit widths) to the dierent TAMs used to access the cores. In Figure 4.6a, we have two TAMs of widths w1 and w2 (w1 + w2 = W ) and the core on Layer 1 is accessed using the rst TAM. A total of 2w1 TSVs are needed for this test infrastructure design. Figure 4.6b presents an alternative design in which three TAMs are used, and the core of Layer 1 is accessed using the third TAM (v1 + v2 + v3 = W ). Note that both 2w1 and 2v3 are no more than the number of TSVs dedicated for test access. Optimization methods and design tools are needed to determine a test access architecture that leads to the minimum test time under the constraints of wire length and the number of TSVs. Wu et al. presented techniques to address this problem [56]. An alternative technique does not impose any limits on the number of TSVs used for the TAM, but considers
38 3D IC EDA Design Tools and Methodologies prebond test considerations and wire-length limits [57]. Continued advances are needed to incorporate thermal constraints that limit the amount of test parallelism, and the eective use of prebond test access hardware for testing after assembly.
4.4
3DCacti: an Early Analysis Tools for 3D Cache design
To justify 3D cost overhead, it is essential to study the benets of 3D integration at the early design cycle. It usually requires a strong link between architectural analysis tools and 3D physical planning tools. The early analysis tools study the tradeos among the number of layers, power density, and performance. In this section, we describe an example of early design analysis tools for 3D IC design: 3DCacti, which explores the architectural design space of cache memories using 3D structures. Since interconnects dominate the delay of cache accesses, which determines the critical path of a microprocessor, the exploration of benets from advanced technologies is particularly important. The regular structure and long wires in a cache make it one of the best candidates for 3D designs. A tool to predict the delay and energy of a cache at the early design stage is crucial, because the timing prole and the optimized congurations of cache depend on the number of active device level available as well as the way a cache is partitioned into dierent active device layers. This section examines possible partitioning techniques for caches designed using 3D structures and presents a delay and energy model to explore dierent options for partitioning a cache across dierent device layers. 4.4.1 3D Cache Partitioning Strategies
In this section, we discuss dierent approaches to partition a cache into multiple device layers. The nest granularity of partitioning a cache is at the SRAM cell level. At this level of partitioning, any of the six transistors of a SRAM cell can be assigned to any layer. For example, the pull-up PMOS transistors can be in one device layer and the access transistors and the pull-down NMOS transistors can be in another layer. The benets of cell level partitioning include the reduction in footprint of the cache
4.4. 3DCacti: an Early Analysis Tools for 3D Cache design
39
arrays and, consequently, the routing distance of the global signals. However, the feasibility of partitioning at this level is constrained by the 3D via size as compared to the SRAM cell size. Assuming a limitation that the aspect ratio of 3D via size cannot be scaled less than 1 m x 1 m, the 3D via has a comparable size to that of a 2D 6T SRAM cell in 180 nm technology and is much larger than a single cell in 70 nm technology. Consequently, when the 3D via size does not scale with feature size (which is the case for wafer-bonding 3D integration), partitioning at the cell level is not feasible. In contrast, partitioning at the SRAM cell level is feasible in technologies such as MLBS, because no limitation is imposed on via scaling with feature size. However, it should be noted that even if the size of a 3D via can be scaled to as small as a nominal contact in a given technology, the total SRAM cell area reduction (as compared to a 2D design) due to the use of additional layers is limited, because metal routing and contacts occupy a signicant portion of the 2D SRAM cell area [58]. Consequently, partitioning at a higher level of granularity is more practical. For example, individual sub-arrays in the 2D cache can be partitioned across multiple device layers. The partitioning at this granularity reduces the footprint of cache arrays and routing lengths of global signals. However, it also changes the complexity of the peripheral circuits. In this section, we consider two options of partitioning the sub-array into multiple layers: 3D divided wordline (3DWL) strategy and 3D divided bit line strategy (3DBL). 3D Divided Wordline (3DWL) - In this partitioning strategy, the wordlines in a sub-array are divided and mapped onto dierent active device layers (See Fig. 4.9). The corresponding local wordline decoder of the original wordline in 2D sub-array is placed on one layer and is used to feed the wordline drivers on dierent layers through the 3D vias. we duplicate word line drivers for each layer. The duplication overhead is oset by the resized drivers for a smaller capacitive load on the partitioned word line. Further, the delay time of pulling a wordline decreases as the number of pass transistors connected to a wordline driver is smaller. The delay calculation of the 3DWL also accounts for the 3D via area utilization. The area overhead due to 3D vias is small compared to the number of cells on a word line.
40 3D IC EDA Design Tools and Methodologies Another benet from 3DWL is that the length of the address line from periphery of the core to the wordline decoder decreases proportional to the number of device layers. Similarly, the routing distance between the output of pre-decoder to the local decoder is reduced. The select lines for the writes and muxes as well as the wires from the output drivers to the periphery also have shorter length. 3D Divided Bitline (3DBL) - This approach is akin to the 3DWL strategy and applies partitioning to the bitlines of a sub-array (See Fig. 4.7). The bitline length in the sub-array and the number of the pass transistors connected to a single bitline are reduced. In the 3DBL approach, the sense ampliers can either be duplicated across dierent device layers or shared among the partitioned sub-arrays in dierent layers. The former approach is more suitable for reducing access time while the latter is preferred for reducing number of transistors and leakage. In latter approach, the sharing increases complexity of multiplexing of bitlines and reduces performance as compared to the former. Similar to 3DWL, the length of the global lines are reduced in this scheme. 4.4.1.1 3D Cache Delay-Energy Estimator (3DCACTI)
In order to explore the 3D cache design space, a 3D cache delay-energy estimation tool called 3DCacti was developed [59]. The tool was built on top of the Cacti 3.0 2D cache tool (Cache Access and Cycle Time) [60]. 3DCacti searches for the optimized conguration that explores the best delay, power, and area eciency trade-o, according to the cost function for a given number of 3D device layers. In the original Cacti tool [60], several conguration parameters (see Table 4.3 are used to divide a cache into sub-arrays to explore delay, energy, and area eciency trade-os. In 3DCacti implementation, two additional parameters, Nx and Ny, are added to model the intra-subarray 3D partitions. The additional eects of varying each parameter other than the impact on length of global routing signals are listed in Table 4.3. Note that the tag array is optimized independently of the data array and the conguration parameters for tag array: Ntwl, Ntbl, and Ntspd are not listed.
41

Fig. 4.7: Cache with 3D divided bitline partitioning and mapped into 2 active device layers.
Original array
256xBLs 128xWLs
Ndwl =2
128xBLs 128xWLs
Ndbl = 4
128xBLs 32WLs
Nspd =2
128xBLs &16xWLs 256xBLs 16WLs

Nx:Ny = 1:2 Nx:Ny = 2:1

256xBLs & 8xWLs

C = cache size = 32kB B = block size = 16B A = associativity = 2
Fig. 4.8: Cache with 3D divided bitline partitioning and mapped into 2 active device layers.
Fig. 4.8 shows an example of how these conguration parameters used in 3DCacti aect the cache structure. The cell level partitioning approach (using MLBS) is implicitly simulated using a dierent cell width and height within Cacti.

Parameter Ndbl Denition the number of cuts on a cache to divide bitlines Eect on Cache Design 1. the bitline length in each sub-array 2. the number of sense ampliers 3. the size of wordline driver 4. the decoder complexity 5. the multiplexors complexity in data output path 1. the wordline length in each sub-array 2. the number of wordline drivers 3. the decoder complexity 1. the wordline length in each sub-array 2. the size of wordline drivers 3. the multiplexors complexity in data output path 1. the wordline length in each sub-array 2. the size of wordline driver 1. the bitline length in each sub-array 2. the complexity in multiplexors in data output path
Ndwl
Nspd
the number of cuts on a cache to divide wordlines the number of sets connected to a wordline
Nx Ny
the number of 3D partitions by dividing wordlines the number of 3D partitions by dividing bitlines
Table 4.3: Design parameters for 3DCacti and their impact on cache design
Fig. 4.9: An example showing how each conguration parameter affects a cache structure. Each box is a sub-array associated with an independent decoder.
4.4.2
Design Exploration using 3DCacti
By using 3DCacti, we can explore various 3D partitioning options of caches to understand their impact on delay and power at the very early design stage. Note that the data presented in this section is in 70 nm technology, assuming 1 read/write port and 1 bank in each cache, unless
43
otherwise stated. First, we explore the best congurations for various degrees of 3DWL and 3DBL in terms of delay. Fig. 4.10 and Fig. 4.11 shows the access delay and energy consumption per access for 4-way set associative caches of various sizes and dierent 3D partitioning settings. Recollect that Nx (Ny) in the conguration refers to the degree of 3DWL (3DBL) partitioning. First, we observe that delay reduces as the number of layers increase. From Fig. 4.12, we observe that the reduction in global wiring length of the decoder is the main reason for delay reduction benet. We also observe that for the 2-layer case, the partitioning of a single cell using MLBS provides delay reduction benets similar to the best intra-subarray partitioning technique as compared to the 2D design.
1.6 1.4
32,1,32 1,4,4
64KB
32,1,16 1,2,4 32,1,32 1,2,8 32,1,32 1,2,8
128KB
256KB
512KB
1MB
markers: Ndwl,Ndbl,Nspd Ntwl,Ntbl,Ntspd
Delay (nS)
1.2 1 0.8 0.6 0.4 0.2

1x1
32,1,16 1,2,4
32,1,16 1,2,2 32,1,32 1,2,4 32,1,16 1,4,1 16,2,32 1,2,32 32,1,32 1,2,8
32,1,16 1,4,8
MLBS
2x1
1x2
4x1
2x2
1x4
8x1
4x2
2x4
1x8
2D
3D 2 wafers
3D 4 wafers 3D partitioning (Nx*Ny)
3D 8 wafers
Fig. 4.10: Access time for dierent partitioning. Data of caches of associativity=4 are shown. Another general trend observed for all cache sizes is that partitioning more aggressively using 3DWL results in faster access time. For example, in the 4-layer case, the conguration 4*1 has an access time which is 16.3% less than that of the 1*4 conguration for a 1MB cache. We observed that the benets from more aggressive 3DWL stem from the longer length of the global wires in the X direction as compared to the Y direction before 3D partitioning is performed. The preference for shorter bitlines for delay minimization in each of the sub-arrays

64KB 1.4 1.2 Energy (nJ) 1 0.8 0.6 0.4
1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
128KB
256KB
512KB
1MB
2D
3D 2 layers
3D 4 layers
3D 8 layers
3D Partitioning (Nx*Ny)
Fig. 4.11: Energy for dierent partitioning when setting the weightage of delay higher. Data of caches of associativity=4 are shown.
output SA BL WL-charge WL_driver decoder predec_driver
1.6 1.4 1.2
Delay (nS)
1 0.8 0.6 0.4 0.2 0 1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
Fig. 4.12: Access time breakdown of a 1MB cache corresponding to the results shown in Fig. 4.12

decoder 1.6 1.4 1.2 WL BL SA leakage Others (tag array)
45
Energy (nJ)
1 0.8 0.6 0.4 0.2 0

1x1 MLBS 2x1 1x2 4x1 2x2 1x4 8x1 4x2 2x4 1x8
Fig. 4.13: Energy breakdown of a 1MB cache corresponding to the results shown in Fig. 4.13
and the resulting wider sub-arrays in optimal 2D conguration is the reason for the dierence in wire lengths along the two directions. For example, in Fig. 4.14(a), the best sub-array conguration for the 1MB cache in 2D design results in the global wire length in the X direction being longer. Consequently, when wordlines are divided along the third dimension, more signicant reduction in critical global wiring lengths can be achieved. Note that because 3DCacti is exploring partitioning across the dimensions simultaneously, some congurations can result in 2D congurations that have wirelengths greater in the Y directions (See Fig. 4.14(c)) as in the 1MB cache 1*2 conguration for 2 layers. The 3DBL helps in reducing the global wire length delays by reducing the Y direction length. However, it is still not as eective as the corresponding 2*1 conguration as both the bitline delays in the core and the routing delays are larger (See Fig. 4.12 and Fig. 4.13). These trends are dicult to analyze without the help of a tool to partition across multiple dimensions simultaneously. The energy reduction for the corresponding best delay congurations tracks the delay reduction in many cases. For example, the energy of 1MB cache increases when moving from a 8*1 conguration to a 1*8 conguration. In these cases,

Fig. 4.14: Critical paths in 3DWL and 3DBL for a 1MB cache. Dashed lines represent the routing of address bits from pre-decoder to local decoder while the solid arrow lines are the routing paths from the address inputs to predecoders.
the capacitive loading that aects delay also determines the energy trends. However, in some cases, the energy reduces signicantly when changing congurations and does not track performance behavior. For example, for the 512KB cache using 8-layers, the energy reduces when moving from 2*4 to 1*8 conguration. This stems from the dierence in the number of sense ampliers activated in these congurations, due to the dierent number of bitlines in each sub-array in the dierent congurations and the presence of the column decoders after the sense ampliers. Specically, the optimum (Ndwl,Ndbl,Nspd) for the 512KB case is (32,1,16) for the 2*4 case and (32,1,8) for the 1*8 conguration. Consequently, the number of sense ampliers activated per access for
47
1*8 conguration is only half as much as that of the 2*4 conguration, resulting in a smaller energy.
5
3D FPGA Design
5.1
FPGA Background
Field Programmable Gate Arrays (FPGAs) have become a viable alternative to custom Integrated Circuits (ICs) by providing exible computing platforms with short time-to-market and low costs. In an FPGAbased system, a design is mapped onto an array of recongurable logic blocks, which are connected by reprogrammable interconnections composed of wire segments and switch boxes. The conguration and programming bits are stored in SRAM cells and they are usually loaded from o-chip ROM or ash at boot-up. The baseline fabric of SRAM-based FPGAs, as shown in Fig. 5.1, consists of a 2-D array of logic blocks (LBs) that can be interconnected via programmable routing. Each LB contains four slices, each consisting of two four-input lookup tables (LUTs). The programmable routing comprises interconnect segments that can be connected to the LBs via connection boxes (CBs) and to each other via switching boxes (SBs). Programming bits are used to implement logic function in LUTs and to determine logic connectivity in CBs and SBs. Conventionally, these bits are loaded from o-chip at power-up and stored in SRAM cells.
48
5.2. 3D FPGA Design
49
LB
CB
LB
CB
LB
M M M M
CB
SB
CB
CB
SB
CB
CB
LB
M M M CB M
LB
LB
M
CB
SB
CB
CB
SB
CB
M
M M
LB Tile
L
LB
LB
Fig. 5.1: Classical SRAM-based FPGA architecture. (LB: Logic blocks; SB: Switching blocks; CB: Connection blocks; M: Memory elements)
Modern FPGAs also have embedded RAM blocks for high-bandwidth data exchange during the computation.
5.2
3D FPGA Design
To provide the required recongurable functionality, FPGAs provide a large amount of programmable interconnect resources in the form of wire segments, switches, and signal repeaters. Typically, the delay of FPGAs is dominated by these interconnects. Reducing the lengths of interconnects can lead to signicant improvements in the performance of FPGAs. Moreover, these programmable interconnect resources typically consume a large portion of the FPGA silicon die area. Since die area is one of the main factors that determine manufacturing costs, reducing the silicon footprint can lead to signicant improvements in the manufacturing costs of FPGAs. As an emerging technology for integration, three dimensional (3D) ICs can increase the performance, functionality, and logic density of integrated systems. In the past, various approaches for implementing 3-D FPGAs have been considered [6168]. They include FPGAs based on 3-D routing switches with electrical or optical inter-stratum interconnections [6668] or partitioning of memory and routing functions in dierent strata [63]. These earlier works were focused on either routing
50 3D FPGA Design architectures or FPGA technologies. Rahman et al. [64] presented analytical models for predicting interconnect requirements in eld-programmable gate arrays (FPGAs), and examined opportunities for three-dimensional (3-D) implementation of FPGAs. In their exploration, 3D (6-way) switching blocks were used and all the components in the FPGA were evenly distributed between layers in ne granularity. Experimental results on 0.25 m technology showed that in FPGAs with 70K logic cells, the improvement in LUT density can be 25-60% in 3-D implementation with two to four strata. The interconnect delay is signicantly reduced, and by 3-D integration with 2C4 strata and the clock frequency as 2-D FPGA, the reduction in power dissipation is 35%-55%. Lin et al. [65] proposed a 3D FPGA architecture with special stacking techniques, in which all SRAM cells were moved to a separate layer and monolithical stacking was used so that there was no overhead on inter-layer connections. In monolithical stacking, electronic components and their connections (wiring) are lithographically built in layers on a single wafer, hence no TSV is needed. In this case, the area overhead on TSVs is eliminated and the logic density is greatly improved due to stacking. However, applications of this method are currently limited because creating normal transistors requires enough heat to destroy any existing wiring. To facilitate the design and evaluation using 3D FPGAs, new CAD tools for 3D FPGA integration are needed. Alexander et al. [62] proposed 3D placement and routing algorithms. Ababei et al [69] presented a fast placement tool for 3D FPGAs called TPR (Three dimensional Place and Route). The tool eectively investigated the eect of 3D integration on delay of FPGAs in addition to interconnect wirelength.
5.3
PCM-Based 3D Non-Volatile FPGA
In this section we propose the novel 3D non-volatile FPGA architecture (3D-NonFAR), including the renovation of basic FPGA structures, as well as the layer partition and logic density evaluation for the 3D die stacking.
5.3. PCM-Based 3D Non-Volatile FPGA

MLC
MLC
ADC
51
MLC MLC
Sense Amplifier
LB
MLC
MLC
MLC MLC
CB
SB
Fig. 5.2: Renovated FPGA basic structures with PCM MLC cells. 5.3.1 PCM-Based FPGA Basic Structures
Firstly, we choose Xilinx Virtex-4 island-based FPGA [70] as the baseline architecture for our study. With the PCM technology, the SRAM cells in FPGAs can be replaced with PCM cells as follows, and the logic density can be signicantly improved. Conguration bits in LBs, CBs, and SBs. Writing to these bits only happens at conguration time. During FPGA run time, these bits are only read. Thus the write speed for these bits is not an issue, and they can be implemented using PCM MLC cells with a slightly slower write speed than that of SLC cells. The logic density improvement per bit of PCM MLC against SRAM is about 16 [71]. The renovated basic FPGA structures with PCM MLC cells is shown in Fig. 5.2. Embedded RAM blocks. The read/write speed of RAM blocks is critical for the performance of FPGAs, and they can be implemented using PCM SLC cells. From Table 5.1, PCM SLC write latency is 10-40 larger than that of SRAM. However, through a set of write optimizations such as write coalescing [72] and aggressive word-line/bit-line cutting [73], PCM can achieve about the same performance as DRAM ( [72] reported 1.2x slower than DRAM). Simulation results [73] show that the logic density improvement per bit against SRAM is about 10.
52 3D FPGA Design 5.3.2 3D PCM-Based FPGA with Die Stacking
In addition to the high-performance recongurable logic fabric, modern FPGAs also have many built-in system-level blocks, including microprocessor cores, DSP cores, I/O trans-ceivers, etc. These features allow logic designers to build the highest levels of performance and functionality into their FPGA-based systems. In order to eciently integrate these system-level blocks in the new architecture, their characteristics need to be accurately modeled. The areas of components including system-level blocks are estimated using the models presented in previous studies [64, 74]. An area breakdown of Xilinx Virtex-4 Platform FPGAs based on SRAM is depicted in Fig. 5.3, where SRAM occupies about 40% of the total area of planar FPGAs. Replacing the SRAM with high-density PCM cells can signicantly improve the logic density and reduce chip area.
Logic Blocks (LB)
4.6 %
Routing Resources (RR)
BRAM
DSP PowerPC CLK+I/O 5.3 10.6% 11.2% %
8.1%
15.1%
9.9%
20.3%
Mem
14.9%
Logic Mem Switch Boxs Interconnect
Fig. 5.3: Area breakdown of Xilinx Virtex-4 Platform FPGAs. Heterogeneous three-dimensional (3D) die stacking is an ee ctive way to further improve the logic density and reduce the chip footprint. The stacking for FPGA begins with partitioning the resources and assigning them to dierent layers. We use a simulated-annealing based layer assignment algorithm similar to [75], to minimize the following cost function: cost = (A + dev (A)) + nT SV ,
N
dev (A) =
i=1
|A(i) Aavg |
(5.1)
where (Ai ) is the area of the die in layer i, A = max(A(i) is the N footprint of the chip, Aavg = i=1 A(i)/N is the average area per layer, and nT SV is the number of TSVs. To reduce the integration cost and improve the chip performance, the following constraints are also
53
set for the layer assignment algorithm: Basic FPGA infrastructures are preserved so that existing FPGA CAD tools can still be used for the new architecture to maintain low integration cost; The partition and integration of non-congurable systemlevel blocks (processor cores, DSP blocks, etc) are fully addressed; All non-volatile memory elements are aggregated in one single layer to reduce manufacture cost;
SB
CB PCM CB
CB CB
SB
PCM CB
CB CB
SB
SB SB
CB PCM CB
SB SB
SB SB
Block RAM
CB CB
Block RAM
CB PCM CB CB
LB I/O I/O I/O
I/O
SB
SB
DSP
SB
DSP
I/O I/O I/O LB I/O TSVs
Power PC CLK
TSVs LB I/O I/O
CLK
Fig. 5.4: 3D-NonFAR: The proposed 3D non-volatile FPGA architecture with two-layer die stacking and TSVs. With all the specications listed above, the layer assignment algorithm produces 3D-NonFAR, an novel 3D non-volatile FPGA architecture with two-layer die stacking, as demonstrated in Fig. 5.4. In 3D-NonFAR, all CBs and SBs, as well as the memory elements in LBs and the embedded block memories are put in the upper layer, while the LBs are located in the lower layer and TSVs are used to connect the LBs to the corresponding memory elements upward. Since all the congurable routing tracks are in the upper layer, no TSV is needed for routing. The non-congurable system-level blocks and the clock network are also located in the same layer as the LBs, thus no TSV is needed to delivery clock signals to the ip-ops in LBs.
54 3D FPGA Design
LB-1 0.064 LB-2 0.064
RR-1 0.226 RR-2 0.226
BRAM-1 0.075 BRAM-2 0.075
Misc-1 13.6 Misc-2 13.6
Cu-Bump 0.10+ Cu-Bump 0.10+
(a) 3-D SRAM Fine-Grained Die Stacking, footprint = 0.60A +

LB-SRAM 0.081
RR-SRAM 0.203 SB 0.15 PowerPC 0.106 CLK+I/O 0.112
BRAM 0.149 Unoccupied Unoccupied
SRAM CMOS CMOS
RR 0.099
LB 0.046 DSP 0.053
(b) 3-D SRAM Monolithically Stacking, footprint = 0.433A

RR-PRAM 0.013 LB-PRAM 0.005 BRAM 0.009 Unoccupied
RR 0.099
LB 0.046 DSP 0.053
SB 0.15 PowerPC 0.042 CLK+I/O 0.112
TSV 0.083 TSV 0.083
PRAM+ CMOS
CMOS
(c) 3-D PRAM Die Stacking, footprint = 0.403A
Fig. 5.5: Chip footprint comparison between dierent 3D FPGA stacking scenarios. The numbers in each block represent the relative area compared to the area of the baseline 2D FPGA.(A, area of baseline 2D FPGA; RR, routing resource; LB-SRAM, conguration memory cells in LB; RR-SRAM, conguration memory cells in RR; ST, switch transistor).
We compare the logic density improvement of 3D-NonFAR with other two 3D SRAM-based FPGA stacking scenarios, namely 3DFineGrain [64] in which 3D (6-way) switching blocks are used and all the resources evenly distributed between layers in ne granularity, and 3D-Mono [65] in which all SRAM cells are moved to a separate layer and monolithical stacking is used so that there is no overhead on interlayer connections1 . We conduct the comparison on Virtex-4 XC4VFX60 Platform FPGA, which is in the same technology node (90nm) as the latest PCM cells in literature [71]. The pitch of Cu bumps used in 3DFineGrain for inter-layer connection is 10 m. Based on data reported in [76] we choose typical 3 3 m2 TSV pitch for the experiments (TSV size is 1 m2 ). The number of Cu bumps and TSVs are estimated according to the partition style of logic fabrics and the usage of
1 In
monolithical stacking, electronic components and their connections (wiring) are lithographically built in layers on a single wafer, hence no TSV is needed. Applications of this method are currently limited because creating normal transistors requires enough heat to destroy any existing wiring.

3.5 3D FineGrain 3 2.5 2 1.5 1 FX12 FX20 FX40 FX60 FX100 Xilinx Virtex-4 Hybrid FPGA Family FX140 3D Mono 3D NonFAR
55
Fig. 5.6: Logic density improvement of three 3D FPGA stacking styles. By improvement, we mean the ratio of footprint in baseline 2D FPGA to that in 3D FPGA.
non-congurable blocks. For 3D-FineGrain, nbump W nSB /2 (5.2)
where W is the channel width of routing fabrics and nS B is the number of SBs. For 3D-NonFAR, nT SV (4 + 4 + 1) nLU T 4 + 2 nGlobalW ire (5.3)
where nLU T is the number of 4-input LUTs, and nGlobalW ire is the number of global interconnects that go across layers. Fig. 5.5-(a) shows the case of 3D-FineGrain, in which 3D switching blocks incurs massive usage of Cu bumps and the footprint reduction is limited to be around 40%. In Fig. 5.5-(b), 3D-Mono leads to unbalanced layer partitions with a separate SRAM layer, and the footprint reduction is about 56% with three-layer stacking. Fig. 5.5-(c) shows that in 3D-NonFAR, PCM cells occupy much smaller area than SRAM cells, so the upper layer can accommodate more blocks, yielding the most areaecient stacking with a footprint reduction of near 60%. The logic density improvement across the whole Xilinx XC4VFX FPGA Family is depicted in Fig. 5.6, which shows that 3D-NonFAR is more favorable in larger devices.
Logic Density Improvement
56 3D FPGA Design 5.3.3 Performance and Power Evaluation of 3D-NonFAR
In this section, we quantify the improvements in delay and power consumption of 3D-NonFAR. To accurately model the impact of integrating PCM as a universal memory replacement in FPGAs, we rst evaluate the timing and power characteristics of memories with CACTI [77] and PCRAMsim [73], as well as the data from latest literature [71]. A set of evaluation results are listed in Table 5.1. We then proceed to evaluate the delay and power of the whole chip. Table 5.1: characteristics of 32MB RAM blocks in dierent memory technologies at 90nm process SRAM Area (mm2 ) Read Latency (ns) Write Latency (ns) Read Energy (nJ ) Write Energy (nJ ) Leakage Power (nW ) 5.3.3.1 523.5 13.1 13.1 5.03 2.59 5.14 DRAM 32.2 6.92 6.92 1.07 0.55 1.71 PCM (SLC) 59.2 10.7 151.8 2.26 4.78 0.0091 PCM (MLC) 36.0 15 285.7 3.49 7.17 0.0137
Delay Improvement in 3D-NonFAR
The path delay in FPGAs can be divided into logic block delay and interconnect delay. In 3D-NonFAR, the LUTs in logic blocks are implemented with PCM MLC cells, which are about 5% slower in read speed than that of conventional SRAM cells, as shown in Table 5.1. However, this overhead can be easily compensated by the reduction of interconnect delay which will be assessed later in this section. It also has to be mentioned that, although the write operation of PCM MLC cells is much slower, it will not aect the runtime performance of LUTs, because write to LUTs only occurs during the conguration time. The block RAMs in 3D-NonFAR are implemented in PCM SLC cells. As aforementioned, they can achieve the same performance as that of SRAM blocks with aggressive optimizations [73], so we assume
57
that no delay overhead is incurred by block RAMs. The reduction of interconnect delay is quantied as follows. FPGAs have wire segments of dierent lengths that can be used to implement interconnections with dierent requirements. In Xilinx Virtex-4 FPGAs, there are four types of wire segments, which are with length 1 (Single), 3 (Hex-3), 6 (Hex-6) and 24 (Global) (The unit for the wire length is the tile width L). The RC delay of each type of wire segments can be computed from the Elmore delay model in [65]. For 3D FPGA stacking, the unit wire length L is reduced by a logic density scaling factor . While we assume that all of Single, Hex-3, and Hex-6 wires reside in the upper layer, Global wires can go across layers with TSVs for interconnect reduction. For TSVs on Global wires, we use the resistance and capacitance values of 43m and 40f F respectively for delay calculation. We feed the improved unit length to the Elmore model [65] and compute the delay improvement for each type of wire segments. The results are listed in Table 5.2. Table 5.2: Relative Delay Improvement of Dierent Types of Interconnect wire segments Interconnect Type 3D-FineGrain ( = 0.62) 3D-Mono ( = 0.44) 3D-NonFAR ( = 0.41) Single 1.28 1.39 1.62 HEX-3 1.51 1.60 2.05 HEX-6 1.61 1.73 2.16 Global 1.63 1.87 2.58
We then evaluate the improvement of overall system performance for 3D-NonFAR as well as 3D-FineGrain [64] and 3D-Mono [65] architectures. Since both 3D-NonFAR and 3D-Mono maintain the basic infrastructure of conventional 2D FPGA, they can be evaluated using the commonly-used VPR toolset [78]. 3D-FineGrain can be modeled within TPR [69], which is an variant of VPR and supports the modeling of ne-grained 3D FPGA integration. The evaluation is performed in the following procedure: (1) We modify VPR/TPR so that they recognize the cluster structure in Virtex-4 FPGAs and generate the correct rout-
58 3D FPGA Design
0.8 Normalized Total Wirelength 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0 3D FineGrain 3D Mono 3D NonFAR
Fig. 5.7: The total wirelength reduction by 3D FPGA stacking. For each benchmark, the wirelength values are normalized against the total wirelength of the baseline 2D FPGA.
1.2 3D FineGrain 1.0 0.8 0.6 0.4 0.2 0.0 3D Mono 3D NonFAR
Fig. 5.8: The critical path delay reduction by 3D FPGA stacking. For each benchmark, the delay values are normalized against the critical path delay of the baseline 2D FPGA.
ing graph. (2) We then modify the architecture models in VPR/TPR to reect the revised delay values of logic blocks and wire segments, according to the parameters scaled from Table 5.1 and Table 5.2. The size of buers inserted in long interconnects are also re-optimized with the new delay parameters; (3) We place and route the 20 largest MCNC benchmark circuits
Normalized Critical Path Delay
59
with the modied VPR/TPR, and quantify the relative improvement against the baseline 2D FPGA. Fig. 5.7 depicts the total wire length reduction brought by 3D FPGA stacking. The average reductions against the baseline 2D FPGA for 3D-FineGrain, 3D-Mono and 3D-NonFAR are 37.7%, 43.7% and 54.9%, respectively. Fig. 5.8 depicts the critical path delay reduction. The average reductions for 3D-FineGrain, 3D-Mono and 3D-NonFAR are 22.9%, 36.5% and 44.9%, respectively. We can see that the total wire length is nearly uniformly reduced across all the benchmarks, while the critical path delay improvement largely depends on the circuit structures. 5.3.3.2 Reduction of Power Consumption in 3D-NonFAR
There are two major components of power consumption in FPGAs: static power, and dynamic power. Although the dynamic component is still dominating in FPGAs in 90nm process technology, static power occupies a signicant portion in the total power consumption [79]. We thus have to quantify the reduction in both dynamic and static power in 3D-NonFAR. The breakdown of dynamic power in FPGAs is shown as: Pdyn = Plogic,dyn + Pmem,dyn + Pnet,dyn + Pclk,dyn = Plogic,dyn + Pmem,dyn
2 + VDD fnet 2 Cnet + Cclk VDD fclk ,
(5.4)
all nets
where Plogic,dyn is the dynamic energy consumed in the logic circuits (including the non-congurable blocks, e.g. DPSs), Pmem,dyn in the memory elements, Pint,dyn in the interconnect wires, and Pclk,dyn in the clock network. In the 2D baseline FPGA, the ratio between Plogic,dyn , Pmem,dyn , Pint,dyn , and Pclk,dyn is estimated to be 0.3:0.15:0.4:0.15 using Xilinx X-Power tool [80]. During the transition from 2D FPGAs to 3D FPGAs, Plogic,dyn is assumed to be unaected, while the change of Pmem,dyn can be estimated with the parameters in Table 5.1. According to Table 5.1, the write energy for PCM cells is much higher than that
60 3D FPGA Design of SRAM cells. However, the total write energy can be reduced using narrow rows and partial writes as presented in [72]. Pint,dyn and Pclk,dyn are determined by the equivalent capacitances of the signal net and the clock network (Cnet and Cclk , respectively). While other parameters contributing to Pint,dyn and Pclk,dyn remain unaected, Cnet and Cclk are scaled down with the logic density scaling factor , leading to the major reduction in dynamic power. Note that for Global wires involving TSVs, the capacitance of TSVs are included in the re-calculation of Cnet . As for the leakage power, the portion consumed on logic circuits in 3D-NonFAR is the same as that in the baseline 2D FPGA, while the portion of memory cells is signicantly reduced, due to the zero leakage of PCM cells. This can be updated according to Table 5.1. To evaluate the total power reduction of 3D-NonFAR, we assume the ratio between the total dynamic power and static power consumed in baseline 2D FPGAs is 0.8:0.2 [79]. The power numbers for the logic circuits including LBs, DPSs, PowerPC cores, and IO blocks are provided by Xilinx XPower [80], while the values for interconnects and clock network are estimated using the breakdown ratios. The total power for each of the 20 MCNC benchmark circuit is estimated according to the resource usage, and the relative power reduction against the baseline 2D FPGA for 3D stacking is depicted in Fig. 5.9, where the average reductions for 3D-FineGrain, 3D-Mono, and 3D-NonFAR are 18.7%, 24.0% and 35.1%, respectively.
5.4
Summary
3D FPGAs have showed their potentials on increasing logic density, reducing power, and improving performance. In particular, the interconnect benet brought by 3D stacking breaks one of the critical roadblocks on the scaling of FPGAs, and helps close the gap between FPGAs and ASICs. 3D FPGA is shown as a promising alternative for future FPGA innovation, as well as a vital playground for 3D silicon integration.
5.4. Summary
61
1.0 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0.0
Normalized Power Consumption
3D FineGrain
3D Mono
3D NonFAR
Fig. 5.9: The total power reduction by 3D FPGA stacking. For each benchmark, the power values are normalized against the total power of the baseline 2D FPGA.
6
3D Architecture Exploration
As process technologies of 3D integration emerge, the 3D stacking provides great opportunities of improvements in the microarchitecture. In this chapter, we introduce some recent 3D research in the architecture level. These techniques leverage the advantages of 3D and help to improve performance, reduce power consumption, etc. As the 3D integration can reduce the wire length, it is straightforward to partition the structure of a planar processor and stacked them to improve the performance. There are two dierent methods: (1) coarse granularity stacking, also known as memory+logic strategy, in which some on-chip memories are separated from and stacked on the part containing logic components [8185], and (2) ne granularity stacking, in which various function units of the processor are separated and stacked, and some components are partitioned inside and implemented with 3D integration [8690]. 3D stacking enables denser form factor and more cost-ecient integration of heterogeneous process technologies. It is possible to stack DRAM or other emerging memories on-chip. Consequently, the advantages of dierent memory technologies are leveraged. For example, more data can be stored on-chip and the static power consumption can
62
6.1. Single-core Fine-Granularity Design
63
be reduced. At the same time, the high bandwidth provided by 3D integration can be explored for the large capacity on-chip memory to further improve the performance [8285, 91, 92]. Another on-chip component that potentially benets from 3D integration most is the on-chip interconnection network (i.e. network-onchip, NoC). During recent years, to address the burgeoning bandwidth requirement of many-core architectures researchers have proposed to substitute multi-hop network-on-chip with superior scalability for conventional buses with slagging bandwidth improvement. As a communication fabric, network-on-chip takes advantage of 3D integration in two ways: a) wire-length reduction which allows links to run faster and more power eciently and b) the implementation of truly 3D network topologies which improves connectivity and reduces network diameter, and therefore promotes both performance and power. In addition, with the rapid advancement of many-core architectures and their data throughput, network-on-chip becomes a critical performance factor of future processors that are projected to consist of tens to hundreds of cores.
6.1
Single-core Fine-Granularity Design
In this section, we rst talk about how to partition and stack the onchip memory (SRAM arrays), as they have regular structures. Then, we discuss the benets and issues of partitioning the logic components. Puttaswamy et al provided a well study of 3D integrated SRAM components for high-performance microprocessors [86]. In this paper, they explored various design options of 3D integrated SRAM arrays with functional block partitioning. They studied two dierent types of SRAM arrays, which are banked SRAM arrays (e.g. caches) and multiported SRAM arrays (e.g. register les). For the banked SRAM arrays, the paper discussed methods of bank stacking 3D and array splitting 3D. The bank stacking is straightforward as the SRAM arrays have already been partitioned into banks in the 2D case. The decision for a horizontal or vertical split largely depends on which dimension has more wire delay. The array splitting method may help to reduce the lengths of either wordlines or the bit-
64 3D Architecture Exploration
wordline (port 1)
bitline (port 1) layer 1 layer 0 wordline (port 0)
bitline (port 0)
Fig. 6.1: An illustration of 3D stacked two-port SRAM cell.
lines. When the wordline is split, die-to-die vias are required because the row decoder need to drive the wordlines on both of the die, and the column select multiplexers have been split across the two die. For the row stacked SRAM, row decoder need to be partitioned across the two die. If the peripheral circuitry, such as sense amplier, is shared for the two die, extra multiplexers may be required. In their work, latency and energy results were evaluated for caches from 16KB to 4MB. The results show that array-split congurations provide additional latency reductions as compared to the bank-stacked congurations. The 3D organization provides more benets to the larger arrays because these structures have substantially longer global wires to route signals between the array edge and the farthest bank. For large-sized (2-4 MB) caches, the 3D organization exhibits the greatest reduction in global bank-level routing latency, since it is dominant. On the other hand, moderate-sized (64-512 KB) cache delays are not as dominated by the global routing. In these instances, a 3D organization that targets the intra-SRAM delays provides more benet. Similar to the latency reduction, the 3D organization reduces the energy in the most wiredominated portions of the arrays, namely the bank-level routing and the bitlines in dierent congurations [86]. For the multiported SRAM, Puttaswamy et al took the register
65
les (RF) as an example to show dierent 3D stacking strategies. First, a RF can split half of the entries and stack them on the rest half. This is called register partitioning (RP). The bitline and row decoder are halved so that the access latency is reduced. Second, a bit-partitioned (BP) 3D register le stacks higher order and lower order bits of the same register across dierent die. Such a strategy reduce the load on wordline. Third, the large footprint of multiported SRAM cell provides the opportunity to allocate one or two die-to-die vias for each cell. Consequently, it is feasible that each die contains bitlines, wordlines, and access transistors for half of the ports. This strategy is called port splitting (PS), which provides benets in reducing area footprint. Figure 6.1 gives an illustration of the 3D structure of the two-port cell. The area reduction translates into latency and energy savings. The register les with sizes ranging from 16 up to 192 entries were simulated in the work. The results show that, for 2-die stacks, the BP design provides the largest latency reduction benet when compared to the corresponding RP and PS designs. The wordline is heavily loaded by the two access transistors per bitline column. Splitting the wordline across two die reduces a major component of the wire latency. As the number of entries increases, benets of latency reduction from BP designs decrease because the height of the overall structure increases, which makes the row decoder and bitline/sense-amplier delay become increasingly critical. The benets of using other two strategies, however, increase for the same reason. The 3D conguration that minimizes energy consumption is not necessarily the conguration that has the lowest latency. For smaller number of entries, BP organization requires the least energy. As the number of entries increases above 64-entries, RP organization provides the most benet by eectively reducing bitline length [86]. When there are more than two stacked layers, the case is more complicated because of many stacking options. More details can be found in the paper. We have discussed how to split a planar SRAM cache and integrate it with 3D stacking. Black et al compares the performance, power consumption and temperatures of splitting logic components of a processor and stack them in two layers. Such a strategy is called logic+logic
66 3D Architecture Exploration stacking. Using logic+logic stacking, a new 3D oorplan can be developed that requires only 50% of the original footprint. The goal is to reduce inter-block interconnect by stacking and reducing intra-block, or within block interconnect through block splitting [87]. Some components in pipeline, which are far from each other in the worst case data path of a 2D oorplan, can be stacked on each other in dierent layers so that the pipeline stages can be reduced. An example is shown in Figure 6.2, the path between the rst level data cache (D$) and the data input to the functional units (F) is drawn illustratively. The worst case path in planar processor occurs in Figure 6.2 (a), when load data must travel from the far edge of the D$ to the farthest functional unit. Figure 6.2(b) shows that a 3D oorplan can overlap the D$ and F. In the 3D oorplan, the load data only travels to the center of the D$, at which point it is routed to the other die to the center of the functional units. As a result of stacking that same worst case path contains half as much routing distance, since the data is only traversing half of the data cache and half of the functional units [87]. Using logic+logic stacking, 25% of all pipe stages in the microarchitecture are eliminated with the new 3D oorplan simply by reducing metal runs, resulting in a 15% performance improvement [87]. Although the performance (IPC) increase causes more activity that increases the power, the power per instruction is reduced because of the elimination of pipeline stages. The temperature increase, however, is a serious problem because of the doubling of power density in 3D stacked logic. The worst-case shows a 26 degree increase if there where no power savings from the 3D oorplan and the stacking were to result in a 2x power density. A simple iterative process of placing blocks, observing the result is a 1.3x power density increase and 14 degree temperature increase [87]. Balaji et al also explores design methodologies for processor components in 3D technology, in order to maximize the performance and power eciency of these components [90]. They proposed systematic partitioning strategies for custom design of several example components: a) instruction scheduler, b) Kogge-Stone adder, and c) logarithmic shifter. In addition, they also developed a 3D ASIC design ow leveraging both widely used 2D CAD tools (Synopsys Design Com-
67
Tire1
D$
Tire0
(a)
(b)
Fig. 6.2: (a) Planar oorplan of a deeply pipelined microprocessor with the point register read to execute paths, (b)3D oorplan of the planar microprocessor in (a).
piler, Cadence Silicon Ensemble, Synopsys Prime Time) and emerging 3D CAD tools (MIT PR3D and in-house netlist extraction and timing tool). The custom designs are implemented with MIT 3D Magica layout tool customized for 3D designstogether with in-house netlist and timing extraction scripts. For the instruction scheduler, they found that the tag drive latency is a major component of the overall latency and sensitive to wire delay. Thus they proposed to either horizontally or vertically partition the tag line to reduce the tag drive latency. From the experiment result, horizontal partitioning achieves signicant latency reduction (44% when moving from 2D to 2-tier 3D implementations) while vertical partitioning only achieves marginal improvement (4%). For both the KS adder and the logarithmic shifter, the intensive wiring in the design become the limiting factor in performance and power in advanced technology nodes. In the 3D implementation, the KS adder is partitioned horizontally and the logarithmic shifter is partitioned vertically. Signicant latency reductions are observed,
68 3D Architecture Exploration 20.23% 32.7% for the KS adder and 13.4% for the logarithmic shifter when the number of tiers varies. In addition to the custom design, they also experimented the proposed ASIC design ow with a range of arithmetic units. The implementation results show that signicant latency reductions (9.6% 42.3%) are archived for the 3D designs generated by the proposed ow. Last but not the least, it is observed that moving from 1 tier to 2 tier produces the most signicant performance improvements, while this improvement saturates when more tiers are used. Architectural performance impact of the 3D components is evaluated by simulating a 3D processor running SPEC2000 benchmarks. The data path width of the processor is scaled up, since the 3D components have much lower latency than 2D components with same widths. Therefore, simulation results show an average IPC speedup of 11% for the 3D processor. So far, the partitioning of the components in the 3D design are all specied and most partitions are limited within two layers. In order to further explore the 3D design space, Ma et al proposed a microarchitecture-physical codesign framework to handle ne-grain 3D design [?]. First, the components in the design are modeled in multiple silicon layers. The eects of dierent partitioning approaches are analyzed with respect to area, timing, and power. All these approaches are considered as the potential design strategies. Note that the number of layer that a component can be partitioned into is not xed. Especially, the single layer design of a component is also included in the total design space. In addition, the author analyzed the impact of scaling the sizes of dierent architectural structures. Having these design alternatives of the components, an architectural building engine is used to choose the optimized congurations among a wide range of implementations. Some heuristic methods are proposed to speed-up the convergence and reduce redundant searches on infeasible solutions. At the same time, a thermal-aware packing engine with temperature simulator tool is employed to optimize the thermal characters of the entire design so that the hotspot temperatures are below the given thermal thresholds. With these engines, the framework takes the frequency target, architectural netlist, and a pool of alternative block implemen-
69
0-15 Bit
16-31 Bit
32-47 Bit
48-63 Bit
(a)
(b)
Fig. 6.3: The conceptual view of a 3D stacked register le: (a) data bits on lower three layers are all zeros, (b) data in all layers are non-zeros.
tations as the inputs and nds the optimized design solution in terms of performance, temperature, or both. In the experiments, the author used a superscalar processor as an example of design space exploration using the framework. The critical components such as the issue queue and caches could be partitioned into 2-4 layers. The partition methods included block folding and port partitioning, as introduced before. The experimental results show a 36% performance improvement over traditional 2D and 14% over 3D with single-layer unit implementations. We have discussed that the 3D stacking aggravates the thermal issues, and a thermal-aware placement can help alleviate the problem [87]. There are some architectural level techniques, which can be used to control the thermal hotspots. Puttaswamy et al proposed thermal herding techniques for the ne-grain partitioned 3D microarchitecture. For dierent components in the processor, various techniques are introduced reduce 3D power density and locate a majority of the power on the die (layer) closest to the heat sink. Assume there are four 3D stacked layers and the processor is 64bit based. Some 64-bit based components (e.g. register les, arithmetic
70 3D Architecture Exploration units) are equally partitioned and 16-bit is placed in each layer, as shown in Figure 6.3. Such a partition not only reduces the access latency, but also provides the opportunities for thermal herding. For example, in Figure 6.3(a) the most signicant 48 bits (MSBs) of the data are zero. We dont have to load/store these zero bits into register les, or process these bits in the arithmetic unit, such as the adder. If we only process the least signicant 16 bits (LSBs), the power is reduced. In addition, if the LSBs are located in the layer next to the heat sink, the temperature is also decreased. For the data shown in Figure 6.3(b), however, we have to process all bits since they are non-zeros in four layers. For some other components, such as instruction scheduler and data caches, entries are partitioned and placed in dierent layers. The access to these components are controlled so that the entries, which are located in the layer next to the heat sink, are more frequently accessed. Consequently, the energy is herded toward the heat sink and the temperature is reduced. Extra hardware, however, is required to achieve these techniques. For example, an extra bit is induced in the register le to represent whether the MSBs are zero. This bit is propagated to other components for further controls. Compared to a conventional planar processor, the 3D processor achieves a 47.9% frequency increase which results in a 47.0% performance improvement, while simultaneously reducing total power by 20%. Without Thermal Herding techniques, the worst-case 3D temperature increases by 17 degrees. With Thermal Herding techniques, the temperature increase is only 12 degrees [88].
6.2
Multi-core Memory Stacking
In this section, we will focus on the memory+logic strategy in multicore processor. Memory of various technologies can be stacked on top of cores as caches or even on-chip main memories. Dierent from the research in the previous section, which focuses on optimizations in the ne-granularity (e.g. wire length reduction), the approaches of this section consider the memories as a whole structure and explore the highlevel improvements, such as access interfaces, replacement policies, etc.
6.2. Multi-core Memory Stacking

Cache 8MB SRAM Core1 Cache 4MB SRAM Core2 Core1 Cache 4MB SRAM Core2 Cache 32MB DRAM Core1 Core2 Cache 64MB DRAM Core1 Cache 4MB SRAM Core2
71
(a)
(b)
(c)
(d)
Fig. 6.4: Memory stacked options: (a) 4MB baseline; (b) 8MB stacked for a total of 12MB; (c) 32MB of stacked DRAM with no SRAM; (d) 64MB of stacked DRAM.
6.2.1
3D Caches Stacking
Black et al proposed to stack SRAM/DRAM L2 caches on cores based on Intels CoreT M 2 Duo model. It was assumed that the 4M L2 caches occupy 50% area in a planar two-core processor. They proposed three dierent stacking options without increasing the footprint: (1) stacking extra 8M L2 caches on the original processor; (2) removing 4M SRAM caches and stacking a 32M DRAM caches on logic (the 2MB DRAM tag is placed in the same layer of logic); (3) removing 4M SRAM caches and stacking a 64M DRAM caches (the 4MB DRAM tag is placed in the same layer of logic). These three options are shown in Figure 6.4 in together with the 2D baseline. The results showed that stacking DRAM caches could signicantly reduce the average memory access time and reduce the o-chip bandwidth for some benchmarks with working set larger than 4MB. In addition, the o-chip bus power was also greatly reduced due to the reduction of o-chip memory access. Since the DRAM has less power than SRAM of the same area, stacking DRAM cache also reduces the power compared to the case of stacking SRAM caches. The thermal simulations also showed that the peak temperature was increased by less than three degree with the 64MB DRAM caches stacked. It meant that the thermal impact of stacking DRAM was not signicant [87]. While prior approaches have shown the performance benets of using 3D stacked DRAM as a large last-level cache (LLC), Loh proposed new cache management policies for further improvements leveraging the hardware organization of DRAM architectures [82]. The
Caches 1 level FIFO
Used?
st
One cache set y 2nd level FIFO

Used?
y n Evict
LRU
Fig. 6.5: Structure of the multi-queue cache management scheme for (a) a single core and (b) multiple cores.
management scheme organizes the ways a cache set into multiple FIFOs (queues). As shown in Figure 6.5, each entry of the queue represents a cache line, and all entries of queues and LRU cache cover a set of cache line. The rst level queues and second level queues are similar except that, in multi-core processors, there is one rst level of queue for each thread, but the second level queue is shared, as shown in Figure 6.5(b). The data loaded to LLC is rst inserted into these queues before being placed in to cache ways following LRU policies. A ubit is employed to represent whether the cache line is re-used (hit) during its stay in these queues. When cache lines are evicted from queues, they are moved to LRU-based cache ways only if they have been re-used. Otherwise, these cache lines are evicted from queues directly. Under some cache behaviors, these queues can eectively prevent useful data from being evicted by inserted data, which are not reused. The rst-level of queue is core-based so that the cache requests of dierent cores are placed into separated queues. Consequently, the cache replacements of dierent cores are isolated from each other in rst-level queues. Such a scheme can help to improve the utilization eciency of caches because a core with a high access rate can quickly

Per-thread 1 level FIFOs
st
Evict (a)
One cache set Shared y 2nd level FIFO

Used? Used?
y n Evict
LRU
n Evict (b)
73
evict the cache lines of another core from the shared cache without the isolation. This scheme raises another issue of how to allocate the size of the rst-level queue for each core. In order to solve this problem, the paper proposed an adaptive multi-queue (AMQ) method, which leveraged the set-dueling principal [82] to dynamically allocate the size of each rst-level queue. There are several pre-dened congurations of queue sizes, and the AMQ can dynamically decide which congurations should be used according to the real cache access pattern. A stability control mechanisms was used to avoid unstable and rapidly switch through many dierent congurations of rst-level queues. The results showed that the AMQ management policy can improve the performance by 27.6% over the baseline of simply using DRAM as LLC. This method, however, incurs extra overhead of managing multiple queues. For example, it requires a head pointer for each queue. In addition, it may not work well when the cache associativity is not large enough (64 way was used in the paper) because the queue size need to be large enough to record the re-use of a cache line. Since the rst-level queue is core-based, it has a scalability limitation because of the same reason. 6.2.2 3D Main Memory Stacking
In addition to the caches stacking, many studies have been exploring the potential of building main memory on-chip with the help of 3D integration [8385, 9395]. The most straightforward way is to implement a vertical bus across the layers of stacked DRAM layers to the processor cores [9395]. The topology and overall architecture of the 3D stacked memory is kept the same as in the traditional o-chip memory. Previous work show that, simply placing the DRAM provides a 34.7% increase in performance on memory-intensive workloads from SPEC2006 [83]. The benets come from reducing wire delay between the memory controller and main memory, and running the memory controller at a faster speed. Due the advantages of dense TSV, the interface between cores and stacked memories can be implemented as a wider buses, compared to that of 2D o-chip memory. After increasing the width of bus from 64-bit (2D FSB width) to a 64-byte TSV-based 3D buses, the perfor-
Rank0 Rank1 Rank2 Rank3 Rank0 Rank1 Rank2 Rank3 Rank0 Rank1 Rank2 Rank3 Rank0 Rank1 Rank2 Rank3
MC0 MSHR0 1 0 2 3
MC1 MSHR1 5 4 6 7
MC2 MSHR2 9 8 10 11
MC3 MSHR3 13 12 14 15
mance improvement over 2D case is increased to 71.8% [83]. Increasing the bus width does not fully exploit 3D stacking because the memory organization is still inherently two-dimensional. Tezzaron Corporation has announced true 3D DRAMs where the individual bitcell arrays are stacked in a 3D fashion [96]. Dierent from the prior approaches, this true 3D DRAMs isolate the DRAM cells from peripheral circuitry. The DRAM bitcells implemented in a NMOS technology is optimized for density, whereas the peripheral circuitry implemented on a CMOS layer optimized for speed. The combination of reducing bitline capacitance and using high-speed logic provides signicant improvement in memory access time. Consequently, the true 3D structures result in a 116.8% performance improvement over the 2D case [83]. The high bandwidth of 3D stacking also makes it viable option to increase the number of ranks and memory controller interfaces (MC) of 3D stacked memories. Note that it not practical in o-chip due to pin and chip-count limitations [83]. Loh modied the banking of L2 caches
L2 Cache
Bus from/to L2 cache Core0 Core1 Core2 Core3
Fig. 6.6: A structure view of 4 MSHR banks, 4 MCs, 16 Ranks.
75
to route each set of L2 banks to one and only one MC so that the interleaving of L2 cache occurs on a 4KByte boundaries rather than 64 bytes. Such a structure makes L2 cache banks aligned to its own MSHR and MC so that the communication is reduced when we have multiple ranks and MCs. Figure 6.6 shows the structure of the 4-core processor with the aligned L2 cache banks and corresponding MSHRs and MCs. The results show that we can have another 1.75X speedup in addition to that provided by the true 3D structure. Lohs work also pointed out that the signicant increase in memory system performance makes the L2 miss handling architecture (MHA) a new bottleneck. The experiments showed that simply increasing the capacity of MSHR cannot improve the performance consistently. Consequently, a novel data structure called the Vector Bloom Filter with dynamic MSHR capacity tuning was proposed to achieve a scalable MSHR. The VBF-based MSHR is eectively a hash-table with linear probing, which speeds up the searching of MSHR. The experimental results show that the VBF-based dynamic MSHR can provide a robust, scalable, high- performance L2 MHA for 3D-stacked memory architecture. Overall, a 23.0% improvements is observed on memory-intensive workloads of SPEC2006 over the baseline L2 MSHR architecture for the dual-MC (quad-MC) conguration [83]. Kgil et al studied the advantages of 3D stacked main memory, in respect of the energy eciency of processors [93]. In their work, the structure of the DRAM-based main memory was not changed and was stacked directly on the processor. The bus width of of the 3D main memory was assumed to be up to 1024-Bit. With the large bus width, one interesting observation was that the 3D main memory could achieve a similar bandwidth as the L2 cache. Consequently, the paper pointed out that the L2 cache was no longer necessary and the space could be saved to insert more processing cores. With more processing cores, the frequency of the cores could be low down without degrading computing bandwidth. The energy eciency could be increased because the power consumption was reduced as the frequency decreased, especially for applications with high thread level parallelism. The paper provided comprehensive comparisons among dierent congurations of 2D and 3D processors with similar die areas, in respect of processing bandwidth
76 3D Architecture Exploration and power consumption. Woo et al further explored the high bandwidth of 3D stacked main memory by modifying the structure of the L2 cache and the corresponding interface between cache and 3D stacked main memory [84]. The paper rst revisited the impact of cache line size on cache miss rate when there was no constraint to the bandwidth of main memory. The experimental results show an interesting conclusion that most modern applications will benet from a smaller L1 cache line size, but a larger cache line is found to be helpful for a much larger L2 cache. Especially, the maximum line size, an entire page (4KB), is found to be very effective in a large cache. Then, the paper proposed a technique named SMART-3D to employ the benets of large cache line size. The cache line of L2 cache was still kept as 64B, and read/write from L1 cache to L2 cache was operated with the traditional 64B bus structures. The bus between L2 cache and main memory, however, was designed to be 4KB, and 64 cache lines could be lled with data from main memory at the same time. In order to achieve a high parallel data lling, the L2 cache was divided into 64 subarrays, and one cache line from each subarray was written in parallel. Two cache eviction policies were proposed so that either one or 64 cache lines could be evicted on demand. Besides the modication to L2 cache, the 3D stacked DRAM was carefully re-designed since a large number (32K) of TSVs were required in SMART-3D. The paper also noticed the potential exacerbation of the false sharing problem caused by SMART-3D. The cache coherence policy was modied so that either one or 64 cache lines are fetched for dierent cases. The experimental results show that the performance is improved with the help of SMART-3D. For the single core, the average speedup of SMART-3D is 2.14x over 2D case for memory-intensive benchmarks from SPEC2006, which is much higher than that of base 3D stacked DRAM. In addition, using 1MB L2 and 3D stacked DRAM with SMART-3D achieves 2.31x speedup for a two-core processor, whereas a 4M L2 cache without SMART-3D only achieves 1.97x over the 2D case. Furthermore, the analysis shows that SMART-3D can even lower the energy consumption in the L2 cache and 3D DRAM because it reduces the total number of row buer misses.
6.3. 3D Network-on-Chip
77
It is known that the periodically refresh is required for DRAMs to maintain the information stored in them. Since the read access is self-destructive, the refresh in a DRAM row involves to read out the stored data in each cell and immediately rewrite it back to the same cell. The refresh incurs considerable power and bandwidth overhead. 3D integration is an promising technique that benets DRAM design particularly from capacity and performance perspectives. Nevertheless, 3D stacked DRAMs potentially further exacerbates power and bandwidth overhead incurred by the refresh process. Ghosh et al proposed Smart Refresh [85], a low-cost technique implemented in the memory controller to eliminate the unnecessary periodic refresh processes and mitigate the energy consumption overhead in DRAMs. By employing a time-out counter in the memory controller, for each memory row of a DRAM module, the DRAM row that was recently accessed will not be refreshed during the next periodic refresh operation. The simulation results show that the proposed technique can reduce 53% of refresh operations on average with a 2GB DRAM, and achieves 52.6% energy saving for refresh operations and 12.13% overall energy saving on average. An average 9.3% energy saving can be achieved for a 64MB 3D DRAM with 64ms refresh rate, and 6.8% energy saving can be achieved for the same DRAM capacity with 32ms refresh rate.
6.3
3D Network-on-Chip
Another on-chip component that potentially benets from 3D integration most is the on-chip interconnection network (i.e. network-onchip, NoC). During recent years, to address the burgeoning bandwidth requirement of many-core architectures researchers have proposed to substitute multi-hop network-on-chip with superior scalability for conventional buses with slagging bandwidth improvement. As a communication fabric, network-on-chip takes advantage of 3D integration in two ways: a) wire-length reduction which allows links to run faster and more power eciently and b) the implementation of truly 3D network topologies which improves connectivity and reduces network diameter, and therefore promotes both performance and power. In addition, with the rapid advancement of many-core architectures and their data
78 3D Architecture Exploration throughput, network-on-chip becomes a critical performance factor of future processors that are projected to consist of tens to hundreds of cores. Among the pioneers exploring this eld, researchers from the Pennsylvania State University have proposed a series of novel architectures for 3D NoC [9799] each of which incorporates dierent techniques and presents varying performance-power contours. Li et al [97] proposed a 3D NoC primarily used to interconnect cache banks, based on dTDMA buses as the vertical links. While this architecture is superior to 2D NoC and hop-by-hop 3D NoC, it suers from scalability issue due to the insucient bandwidth provided by the vertical bus for increasing number of layers in the stack. To address this issue, Kim et al [98] later proposed a 3D NoC router with dimensionally decomposed crossbar termed DimDe, which shows better performance and energy metrics for larger of Furthermore, layers. While the previous two works assumed ing minimal hardware, asnumber will be shown later. hybridization of the NoC router with the bus requires only that cores, cache banks, and one additional link (instead of two) on the NoC router. This routers are stacked on each other in their is the case because the bus is a single entity for communicatentirety, et al [99] took ing both up and down.Park Due to technological limitations and a disparaging way that nely decomposes router complexity issues (to be discussed in Section 3.1), each component into sub-modules and stacks the sub-modules. We will not all NoC routers can include a vertical bus, but the ones that do form gateways to the other layers. Therefore, those discuss each work in detail in the following paragraphs. routers connected to vertical buses have a slightly modied
architecture, as explained in Section 3.2.
4 Communication Pillars assumed here Cache Bank or CPU Node Pillar Node Layers of 3D Chip
e as the vias themselves. Compared to a wire m, inter-layer vias are signicantly larger and ve the same wiring density as intra-layer inHowever, what vertical interconnects lack in nsity, they compensate for by extremely small distances, ranging from 5 to 50 m. Such disxtremely small, providing rapid transfer times ing layers [29]. ers have so far focused on physical aspects, ocess technologies, and developing automated placement tools [9, 11, 7]. Research at the arevel has also surfaced [4, 35]. Specic to 3D ign, [29, 38] have studied multi-bank uniform ures. In our work, we focus our attention on the D NoC-based non-uniform L2 cache architec-
Non-shared Cache Bank Node Shared Cache Bank Node

2 2 2 1 2
Processing Element (Cache Bank or CPU)

NIC
2
CPU Node
2 2 1 2 2 2 2 2 2 1 2 1 2 2
b bits
In 3D case, a CPUs vicinity forms a cylinder
Network-in-Memory Architecture
n layers
NoC
Communication 2 osed architecture for multiprocessor systems dTDMA Bus ~middle layer Pillar (b-bit dTDMA Bus Distances given 1 2 2 hared L2 caches involves placement of CPUs spanning all 1 with respect to 1 layers) ayers of a 3D chip with the remaining space this CPU 1 2 2 2 cache banks. Most 3D IC designs observed Communication 2 2 2 Pillar 1 ure so far have not exceeded 5 layers, mostly (b-bit dTDMA Bus 2 facturability issues, thermal management, and spanning all layers) ailed analysis of via pitch considerations and Figure 4. Proposed 3D Network-inFigure 7. A high-level on the number of inter-layer gateways follows (a) (b) (c) Memory architecture overview of the modied Figure 8. A CPU has more cache ban 1. As previously mentioned, the most valuable Figure 6. Side view of Figure 5. Transceiver module of a dTDMA bus. router of the pillar nodes. architecture. D chips is the very small distance between the the 3D chip with the 3.1 The dTDMA Bus as a Communicatance on the order of tens of microns is negligidTDMA bus (pillar). tion Pillar d to the distance traveled between two network ers in 2D (1500 m on average for a 64 KB The dTDMA bus architecture [30] eliminates the transsignicantly improve the p thereby increasing the latency of both intra-layer and intermplemented in 70 nm technology). This charactional character commonly with buses, and inTable associated 2. Area overhead of inter-wafer wiring layer communication. 1. Area and power overhead of dTDMA stage, and even single-stag kes traveling in the vertical (inter-layer) direcstead employs a bus arbiter which dynamically grows and for different via which parallelize the variou On the other hand, there is a minimum acceptable numt as compared to the horizontal (intra-layer). shrinks the number of timeslots to match the pitch number sizes. of acposed architecture, we use a ber of pillars. In this work, we place each CPU on its own -layer interconnect option is to extend the NoC tive clients. Single-hop communication and transaction-less onent Power Area Inter-Wafer latency. mensions. This requires the addition of two 2 arbitrations allow for low Bus andWidth predictable latencies. Area Dy- (due to dTDMA Bus wiring) pillar. If multiple CPUs were allowed to share the same pilic NoC Router (5-port) 119.55 mW 0.3748 mm 10 m 5 m 1 m 0.2 m up and down) to each router. However, adding namic allocation always produces the most efcient2timesRouters connected to pil lar, there would be fewer pillars, but such an organization 2 2 2 MA Bus Rx/Tx (2 per client) 97.39 W 0.00036207 mm2 128 bits(+42 control) 62500 m 15625 m 625 m 25 m ks to an NoC router will increase its complexity lot conguration, making the dTDMA bus nearly 100% terface between the dTDMA would give rise to other issues, as described in Section 3.3. 2 MA Bus per bus) 204.98 W 0.00065480 s to 7 Arbiter links). (1 This, in turn, will increase the mm bandwidth efcient. Each pillar node requires a compact be provided to enable seam bability inside the router since there are more 3.2 NoC Router Architecture transceiver module to interface withof thethe bus,inter-layer as shown in vias determines the numThe density links with the 2D network contending for an output link. Moreover, the Figure 5. 6. The arbiter should be placed in the middle router is shown in Figure 7. ber of pillars which can be employed. Table 2 illustrates the A generic NoC router consists of four major compoature, a multi-hop communication fabric, thus The dTDMA bus interface (Figure 5) consists of a transis added to the router, which e chipto to keep wire distances ason uniform as posarea occupied by athrough pillar aconsisting of 170 wires (128-bit nents: the routing unit (RT), the virtual channel allocation unwise place traditional NoC routers the mitter and a receiver connected to the bus tri-state The extra PC has its own d rally, the control wires with unit (VA), the switch allocation unit (SA), and the crossbar because thenumber multi-hopof delay and the delayincreases of driver. The tri-state drivers on each receiver and transmitter bus + 3x14 control wires required in a 4-layer 3D SoC) for guishable from the other li elf would overshadow ultra fast to propagation are controlled bydifferent independently fully-tapped (XBAR). In the mesh topology, each router has ve physr of pillar nodes the attached the pillar, i.e., the via programmed pitch sizes. In Face-To-Back 3D implemenrouter only sees an addition feedback shift registers. Details of its operation can be ical channels (PC): North, South, East, and West, and one layers present in the chip. The arbiter and all tations (Figure 3), the pillars must pass through the active nly desirable, but also feasible, to have singlefound in [30]. The total number of wires required by the for the connection with the local processing element (CPU 3.3 CPU Placemen omponents of the dTDMA bus architecture have device layer [29], implying that nication amongst the layers because of the short control signals from the arbiter to each layer is 3n+log2 (n), the area occupied by the or cache bank). Each physical unit has a number of virtual ween them. in To that effect, we propose the use of forusing n layers. Because of its very small into size, the dTDMA bus area. This is the reason emented Verilog HDL and synthesized pillar translates wasted device channels (VC) associated with it. These are rst-in-rst-out The dTDMA pillars pr me-Division Multiple libraries. Access (dTDMA) busesoccupied interface to the NoC router. al 90 nm TSMC The area byis a minimal (FIFO) buffers which hold its from different pending mestween layers of the chip. whyaddition the number of inter-layer connections must be kept to nication Pillars between the wafers, as shown The presence of a centralized arbiter is another reason sages. In our implementation, we used 3 VCs per PC, each 1 sociated with each process and the transceivers is much smaller compared minimum. However, increases, the area ocThese vertical bus pillars provide single-hop why the numberaof vertical buses, or pillars,as invia the density chip message deep. Each message was chosen to be 4 its long. cess, as shown in Figure 4. C router, thus justifying decision to use ion between any fully two layers, and can our be intershould be kept low. An by arbiter required for each pilcupied theispillars becomes smaller, and, at the state-ofThe width of the router links was chosen to be 128 bits. processor instant access to ditional NoC router for intra-layer traversalthe us- layers. lar with control signals connecting to of all layers, as shown e as the vertical gateway between The the-art via pitch 0.2 m, becomes negligible compared
Distance in hops from CPU
Fig. 6.7: The 3D bus-hybrid NoC proposed by Li et al [97]. (a) The vertical buses interconnecting all nodes in a cylinder. (b) The dTDMA bus and the arbiter. (c) The 3D routner with the capability for vertical communication.
Naively, 3D NoCs can be constructed as straight forward exten-
ower numbers of the dTDMA components and a
Consequently, a 64B cache line can t in a packet (i.e., 4
In
r fe uf ffer tB u pu ut B p ut O
them with rapid access to al
79
sions of 2D NoCs, with the same topology but increased dimensionality. Such naive extensions while potentially improving performance due to reduced network diameter, incur severe problems identied by Li et al [97]: a) naively increasing the dimensionality increases the size of the routers crossbar from 55 to 77, which almost doubles the area and the power of the crossbar [98]; b) data are transferred in the vertical dimension on a hop-by-hop basis which sabotages the lowlatency advantage of TSVs. Due to the above reasons, Li et al abandon the hop-by-hop NoC but choose a router architecture that has only one port dedicated for vertical communication (Figure 6.7c. This design reduces the size of the crossbar from 77 to 66 and therefore saves signicant amounts of area and power. A dTDMA bus is used to connect all the vertical ports of routers in a cylinder (Figure 6.7a and 6.7b), such that moving data from the any layer to any other layer only takes one-hop latency (which fully leverages the latency advantage of TSVs). The authors also account for area overheads of vertical interconnections (TSVs) and thermal issues, and address these problems by multiple cores sharing one vertical bus and osetting cores to keep them away from each other. They develop a systematic core placement algorithm embodying considerations of area overheads and thermal issues, which is shown to signicantly reduce peak temperatures even with TSV density constraint. Finally, the existing CMP-DNUCA L2 cache management is adapted to manipulate cache line migrations both intraand inter-layer, a policy coined as CMP-DNUCA-3D. Li et al [97] study the performance impact of the proposed 3D NoC and NUCA architecture in comparison with the 2D NUCA. To isolate the eects of cache line migration, a 3D NUCA without dynamic migration (CMP-SNUCA-3D) is also studied. From the results, CMP-SNUCA-3D on average saves 10 cycles of L2 hit latency, and CMP-DNUCA-3D further reduces 7 cycles. The improvements of L2 hit latency translates to IPC improvements up to 18.0% and 37.1% respectively, especially for applications with intensive L2 accesses. We see from the results that even with static cache line placement, combining network-on-chip with L2 NUCA cache achieves signicant performance improvement over 2D NUCA. Another interesting result observed by Li et al is that due to the increased connectivity and reduced network
m analysis hardware on separate active he processor die using 3D IC technole modular snap-on introspective layer acts like a hardware system monitor.
example, moving from the bottom layer of a 4-layer chip to the top layer requires 3 network hops. This architecture, while simple to implement, has two major inherent drawbacks: (1) It wastes the benecial attribute of a negligible inter-wafer distance (around 50 m per layer) in 3D chips, NSIONAL NETWORK-ON- as shown in Figure 1. Since traveling in the vertical dimension is multi-hop, it takes as much time as moving within each layer. Of ECTURES course, the average number of hops between a source and a destie exploration of possible architectural does decrease as a result of folding a 2D design into multi80 3Dnation Architecture Exploration ensional NoC network. A typical 2D ple stacked layers, but inter-layer and intra-layer hops are indistinProcessing Elements (PE) arranged in guishable. Furthermore, each itdata must migrations undergo buffering diameter, 3D DNUCA has far less than and the arbi2D NUCA, uch like a Manhattan grid. The PEs are tration at every hop, adding to the overall delay in moving up/down which in part contributes to the reduction in hit latency. nderlying packet-based network fabric. the layers. (2) The addition of two extra ports necessitates a larger ork router through a Network Interface One problem with the dTDMA 3D NoC is the limited 7 7 crossbar, as shown in Figure bus 3(b).based Crossbars scale upward r is, in turn, connected to four adjacent very of inefciently, as illustrated in Table table includes the scalability the vertical bus. While Li et 1. al This are focused on a 2-layer 3D direction. area and power budgets of all crossbar types investigated in this CMP where the scalability problem does not surface, it is found by Kim nsional paradigm into the third dimenpaper, based on synthesized implementations in 90 nm technology. et al [98] that the dTDMA bus based 3D NoC saturates quickly with n challenges. Given that on-chip netDetails of the design and synthesis methodology are given in Seced in terms of area and power resources, increasing and number of layers. Motivated this problem, Kim tion loads 5.2. Clearly, a 77 crossbar incurs signicantby area and power are expected to provide ultra-low laet al propose a over dimensionally decomposed router architecture (DimDe) overhead all other architectures. Therefore, the 3D Symmetric entify a reasonable tradeoff between NoC implementation isscales a somewhat naive extension to the baseline whose vertical bandwidth better. hreads. Our task in this section is pre2D network. A 3D Symmetric NoC xtension of a baseline 2D NoC impleHOP mension, while considering the aforeR R
HOP
D NoC Architecture
HOP
R R R R
R R R R
East West North South Up Down PE South West Up Down East North PE
hitecture is illustrated in Figure 2. The output channels/ports. As previously 2D NoC router, giving rise to a 55 omputation unit, RC, operates on the
Generic NoC Router
Routing Computation (RC) VC Allocator (VA) Switch Allocator (SA) Credits in
R R
7x7 Crossbar
Output Channel 1
Overall View the fabrication (b) Crossbar Conguration misalignment (a) issues during process. Consequently, Figure 3: A 3D Symmetric NoC Network (a) limit vertical via density in 3D it is the large vias which ultimately chips. Crossbar Type Area Power with 50% A 3D NoC-Bus Hybrid NoC
HOP
Output Channel P
Crossbar (P x P)
c NoC Router Architecture R R Up/Down est unit of ow control; one packet is 3.3 The 3D NoC-Bus Hybrid Architecture its) of an incoming packet and, based R R The previous sub-section argues that multi-hop communication ictates the appropriate output Physical in the vertical (inter-layer) dimension is not desirable. Given the lid Virtual Channels (VC) within the Overall View (b) Crossbar Conguration very small (a) inter-strata distance, single-hop communication is, in ting can be deterministic or adaptive. Figure This 4: A realization 3D NoC-Bus Hybrid Architecture (b) fact, feasible. opens the door to a very popular tion unit (VA) arbitrates between all Non-Segmented shared-medium interconnect, the bus. The NoC router can be hys to the same output VCs and chooses Inter-Layer Links Fig. 6.8: The rst two 3D NoCs examined by Kim et al a[98]. bridized with a bus link in the vertical dimension to create 3D (a) A A 3D ation unit (SA) arbitrates between all NoC-Bus Hybrid structure, as shown in Figure 4(a). This approach hop-by-hop 3D NoC and (b) a bus-hybrid 3D NoC, same as the one e crossbar. The winning its can then HOP Large via pad R wasby rst introduced in [32], where it was used in a 3D NUCA L2 Connection fixes misalignment ve on to their respective output links. proposed Li et al [97]. Box issues HOP Cache for CMPs. This hybrid system provides both performance l implementations in this work employ and area benets. Instead of an unwieldy 77 crossbar, it requires a 66 crossbar (Figure 4(b)), since the bus adds a single additional port to the generic 2D 55 crossbar. The additional link forms the ic NoC Architecture Segmented Links extension to the baseline NoC router interface between the NoC domain and the bus (vertical) domain. (1 Hop across ALL Layers) imply adding two additional physical The bus link has its own dedicated Vertical queue, which is controlled by a Interconnect Up and one for Down, along with the central arbiter. Flits from different layers wishing to move up/down VC arbiters and Switch Arbiters), and should 5: arbitrate for access to the shared medium. Figure Side View of the Inter-Layer Via Structure in a 3D his architecture a 3D Symmetric NoC, Figure 5 illustrates the side view of the vertical via structure. NoC-Bus Hybrid Structure ayer movement bear identical characThis schematic depictsbenets the usefulness of the via pads between Despite the marked over the 3D large Symmetric NoC router Figure 6: NoC Router sal, as illustrated in Figure 3(a). For the different layers; they are deliberately oversized to cope of Section 3.2, the bus approach also suffers from a major with drawSouth
Layer X Layer X+1
Via Pad
North Table 1: R Area R and Power Comparisons of South Switches Assessed in this Work PE R BUS
42 Crossbar(for 3D DimDe) HOP 55 Crossbar(Conventional 2D Router) R 66 Crossbar(3D NoC-Bus Hybrid) R R 77 Crossbar(3D Symmetric NoC Router)
R
3039.32 m2 8523.65 m2 2 11579.10 Eastm 2 Westm 17289.22
switching activity at 500 MHz 1.63 mW 4.21 mW 5.06 mW 9.41 mW 6x6 the Crossbar Crossbar
the 3D crossbar. The internal c Box (CB) is shown in Figure 7( dotted, because it is not needed mentation, which is presented i mentation also affects the via la While this layout is more comp area between the offset vertica circuitry, as shown by the dotted Hence, the 2D crossbars of one single three-dimensional cr present, and a traveling it go points and links between the inp re-entering another layer do not instead, they directly connect to layer. For example, a it can m layer 2 to the northern output po
Layer X-1
Up/Down
East
West
North
PE
the resource-constrained nature of NoCs, erational complexity of the central arbit diciously. The goal is not to create an ove which provides the best possible matches Our objective was to obtain reasonably i a single clock cycle. East West To achieve this objective, we divided t 6x6 North tical link into two stages, as shown at the Crossbar South rst stage is performed locally, within e PE Up/Down bitrates over all its in a single layer whi different layer. Once a local winner is ch ties the second stage of arbitration, whi This global stage takes in all winning req (b) Crossbar Conguration decides on how the segmented link will -Bus Hybrid Architecture 81 the inter-layer transfer(s). The arb modate n-Segmented a way as to realize the scenarios which a r-Layer Links A 3D Crossbar distances are, in fact, negligicommunication. justied by the fact that inter-layer 5x5 Figure 11(b) illustrates all possible req ble. To investigate the effectiveness and integrity of this connection Crossbar HOP Large via pad R physical design of the East Connection In and simulated it of a particular vertical bundle, assuming scheme, we laid out the CB fixes misalignment Box issues West In (PTM) [4] at 70 tion using the deterministic XYZ routing. in HSpice using HOP the Predictive Technology Model North In and L3 indicate the different segments of nm technology and 1 V power supply. The South latency results for 2, In the link between layers 1 and 2, L2 is the 3 and 4-layer distances are shown in Table 2. Evidently, even with PE In 3, and so on. As an example, let us assu a four-layer design (i.e. traversing four cascaded pass transistors), Vertical links Segmented Links which wants to go to layer 2, has won the the delay is only 36.12 ps; this is a mere 1.8% ofout the coming of 2 ns clock pe(1 Hop across ALLriod Layers)(500 MHz) of the NoC router. In fact, paper (up to 25, 1; global request signal 1 (see Figure 11( the addition of repeaters Vertical for a 5x5 terconnect a it in layer 2 wants to go to layer 3; glo will increase latency, because with such small wire lengths (around crossbar) Only 4 vertical links serted. Finally a it in layer 3 wants to go 50 m per layer), the overall propagation delay is dominated by nter-Layer Via Structure in a 3D shown here for clarity signal 9 is asserted. The global arbiter the gate delays and not the wiring delay. This effect is corroborated that the global request combination 1, 5, by the increased delay of 105.14 ps when using a single repeater, (a) over the 3D Symmetric NoC router Figure 11(b)) results in full concurrent communi in Table 2. 6: NoC Routers with True 3D Crossbars ach also suffers from a major drawSegmented The 3D DimDe NoC ipating layers. It will, therefore, grant all TOP VIEW Inter-Layer Links rrent communication in the third diAll combinations which favor simultaneo HOP R Area between East Out East In shared medium, it can only be used offset intermunication are programmed into the glob Row connects can HOP Up to time. This severely increases conModule still be used! congurations can be given higher priorit Layer X+1 West Out West In ity under high network load, as will The arbiter can be placed on any layer, s Layer X North Out South Out Therefore, while single-hop vertito be traveled by the inter-layer control s Up/Down Metal Layers In UP/DN aforementioned two arbitration stages su ove performance in terms of overall Column Module Silicon Substrate XYZ routing is used. In this case, a i suffers. UP/ DN (Active Layer) Vertical (i.e. Z) dimension will be ejected to the lo Via Pad Module Up/Down In Router destination layers router. If, however, a d North In South In Down to Pass Layer X-1 Transistor Vertical is used, which allows its coming from di s options, we can envision a true 3D Interconnect (b) their traversal in the destination layer, th h enables seamless integration of the Figure 10:of Overview ofBox the(CB) 3D DimDe NoC Architecture bitration stage is required to handle con (a) Internal Details a Connection (b) Inter-layer Via Layout outer operation. Figure 6 illustrates The second two 3D NoCs examined by Kim et al [98]. (a) A different layers and its residing in from should be noted at this point that theFig. 6.9: Figure 7: Side View of the Inter-Layer Via Structure in a 3D Inter-Layer Number of Delay third arbitration stage, illustrated at the bo true 3D crossbar (b) the proposed dimensionally-decomposed (DimDe) Link Length Repeaters bar - in the context of a 2D physical Crossbar take care of such Inter-Intra Layer (IIL) c 50 m (Layer 1 to 2) 0 7.86 ps ach input is connected to each outputarchitecture. It will be shown in Section 4 that adding a 128-bit vertical link, 100 m (Layer 1 to 3) 0 19.05 ps XYZ algorithms also complicates the re oint. However, extending this denialong with its associated control about 0.01 150 m (Layer 1 to 4)signals, 0 consumes 36.12 only ps different layers. It is no longer enough to would imply a switch of enormous mm2 of silicon real estate. 150 m (Layer 1 to 4) 1 (layer 3) 105.14 ps tination layer; the output port designation he increased numbers of input- and However, despite this encouraging result, Propagation Delay B Table 2: [98], The Inter-Layer Distance In Kims work 4 dierent kinds on of 3D NoC architecturesalso areneeds to be sent. IIL conicts high ith the various layers). Therefore, in there is an opposite side to the coin which volved in coordinating it traversal in a 3 To indicate the fact thatAdding each vertical link in the proposed ar- 6.8a) is a) a 3D symmetric NoC architecture (Figure r structure which can accommodateconsidered: paints a rather bleak picture. a large An example of the use of a non-XYZ routi chitecture is composed of a of wires, we thereby refer to 3D of extension fromin the 2Dcrossbar NoC mentioned above, which in has to an output port through more thanthe naive number vertical links a number 3D Figure 12, which tracks the path of a these links as bundles. The presence of a segmented wire bunsuch a conguration can be viewed7-port torouters increaseand NoCmulti-hop connectivity results in into the eastern output of Layer X+1. In t latencies moving across layers; b) a 3D dle dictates the use of one central arbiter for each vertical bundle, layer and continues traversal in a differen twork, we still call this structure aNoC-bus creased path diversity. This translates into A which is assigned the task of controlling trafc along the vertihybrid architecture (Figureall 6.8b) is simply the 3D NoC Each vertical bundle in DimDe consists city. multiple possible paths between source and cal link. If arbitration is carried out at a local level alone, then the limited by Li pairs. et al While abovethis feihui , having the problem with (128 bits in this work), and a number o mbedded in the crossbar and extendproposed destination increased dibenet of concurrent communication along a single vertical Figure 8: bunA central arbiter which coordinates it mov a initially true 3D NoC (Figure 6.9a) fuses the crosshe use of a 55 crossbar, since noscalability; versity may look like layer a router positive dlec) cannot be realized; each wouldatsimply be unaware of the 3D 3 3 3
South
Via Pad
ately limit vertical via density in 3D
Box router (CB) is in Figure Figure10. 7(a). horizontal pass transistor is is shown shown in AsThe illustrated in the gure, the gatedotted, because it is not needed in our proposed 3D crossbar impleway to different layers is facilitated by the inclusion of the third, mentation, is presented in Section The vertical link segVerticalwhich Module. The 3D DimDe router4. uses vertical links which mentation also affects via layout, illustrated Figure 7(b). are segmented at thethe different device as layers through in the use of comWhile this layout is more than7(a) thatshows shown in Figure 5, the pact Connection Boxescomplex (CB). Figure a side view crossarea section between the offset can of still be utilized bywhich other of such a CB. vertical Each boxvias consists 5 pass transistors can connect the by vertical (inter-layer) to the 7(b). horizontal (intracircuitry, as shown the dotted ellipselinks in Figure layer) links. The dotted transistor is not needed in DimDe, because Hence, the 2D crossbars of all layers are physically fused into the design was architected in such a way as to avoid the case where one single three-dimensional crossbar. Multiple internal paths are intra-layer needs to pass athrough a of CB. The CB present, and a communication traveling it goes through number switching structure allows simultaneous transmission in two directions, e.g. a points and links between the input and output ports. Moreover, its it coming from layer X+1 and connecting to the left link of Layer re-entering another layer do not go through an intermediate buffer; X, and a it coming from layer X-1 connecting to the right link instead, they directly connect to the output port of the destination of layer X (see Figure 7(a)). The inclusion of pass transistors in layer. For example, a it can move from the western input port of 3D Network-on-Chip the data path adds delay and degrades the 6.3. signal strength due to layer 2 to the northern output port of layer 4 in a single hop. the associated voltage drop. However, this design decision is fully
Up/Down
East
West
North
PE
West Out
Connection Box
Pass Transistors
need to be dedicated for inter-layerbars of tribute, it actually to a dramatic in- which forms a 555 3D all the routers leads in a vertical column, Crossbar in Table 1, a 55 crossbar is signi-crossbar. crease in the complexity of the central The benet of this architecture aris low cross-layer Conceptuallatency and hungry than the 66 crossbar of the biter, which coordinates inter-layer commuForm improved cross-layer bandwidth. However, it has high area and power 77 crossbar of the 3D Symmetric nication in the 3D crossbar. The arbiter now asbetween arbitration complexity, which compromises its fean the various links in a 3D crossbaroverheads needs as to well decide a multitude of possible interconnections, dedicated connection boxes at each and requires an excessive number of control signals to enable all s can facilitate linkage between verthese interconnections. Even if the arbiter functionality can be disallowing exible it traversal within tributed to multiple smaller arbiters, then the coordination between
Layer X-1
Layer X
Layer X+1
South Out
North Out
East Out
PE Out
n t io ec Ej
82 3D Architecture Exploration sibility; d) a partially-connected 3D NoC router (Figure 6.9b) limits the connectivity along the vertical dimension by dimensionally decomposing the crossbar into x -, y -, and z -modules. Kim et al propose this novel architecture to address the complexity problem of the true 3D crossbar. In this architecture, its traveling on dierent dimensions are guided to dierent switching modules. In the 3D router, three modules (row, column, and vertical) are dedicated for each dimension. In addition, to avoid re-buering its switching from the vertical to the planar dimensions, the vertical module has direct connections directly fused into the crossbar of the row and column modules. To constrain the complexity, the arbiter for the decomposed crossbar is designed have multiple arbitration stages, where the rst stage is the local arbitration of its requesting to change layers and the second stage is the global arbitration that decides which winners from the rst stage can use which segments of the vertical crossbar. The key stage is the global arbitration which should be designed to maximize parallel transmissions in the vertical dimension. If its are allowed to route on the planar dimensions after traveling the vertical dimension (ZXY or adaptive routing), an additional stage is needed to resolve inter-intra layer (IIL) conicts between change-of-layer and local its. All four architectures mentioned above are studied by simulations. Dierent from Li et als work [97], Kim et al are focused on a 4-layer 3D NoC which stress vertical links, and use both synthetic tracs and real applications to thoroughly exercise dierent architectures. It is interesting to see from the synthetic trac that the NoC-bus hybrid architecture has relatively low latency when the load is low but increases most rapidly when the load is increasing. This signals the poor scalability of this architecture. The proposed DimDe architecture, on the other hand, perform consistently better than other architectures for all loads except for the true 3D NoC. The throughput of DimDe is on average 18% and 27% over 3D symmetric and 3D NoC-bus hybrid designs, and within 3% and 4% of the true 3D NoC for synthetic tracs and benchmarks respectively. However, due to the lower complexity, DimDe strikes the lowest energy-delay product (EDP) among all architectures, and outperforms the true 3D NoC by 26% on average.
chip network based on a 3D multi-layered router (3DM) lines for 8 buffers), it makes it a viable partitioning that is designed to span across the multiple layers of a 3D strategy. The partitioning also results in reduced chip. Logically, the 3DM architecture is identical to the capacitive load on the partitioned word-lines in each layer 2DB case (see Figure 3 (a)) with the same number of as the number of pass transistors connected to it nodes albeit the smaller area of each node and the shorter decreases. In turn, it reduces the driver sizing distance between routers (see Figure 3 (c)). Consequently, requirements for the partitioned word-lines [15]. the design of a 3DM router does not need additional Since buffers contribute about 31% of the router functionality as compared to a 2D router and only requires dynamic power [5], exploiting the data pattern in a flit distribution of the functionality across multiple layers. and utilizing power saving techniques (similar to the We classify the router modules into two categories scheme proposed in [16]) can yield significant power separable and non-separable, based on the ability to savings. systematically split a module into smaller sub modules The lower layers of the router buffer can be across different layers with the inter-layer wiring dynamically shutdown to reduce power consumption constraints, and the need to balance the area across layers. when only the LSB portion has valid data [36]. We define The input buffers, crossbar, and inter-router links are a short-flit as a flit that has redundant data in all the other 6.3. 3D Network-on-Chip 83 classified as separable modules, while arbitration logic layers except the top layer of the router data-path. For and routing logic are non-separable modules since they example, if a flit consists of 4 words and all the three cannot be systematically broken into sub-modules. The lower words are zeros, such a flit is a short flit. Our clockaddress such problems. The following subsections describe the detailed design of each gating is based on a short-flit detection (zero-detector) re described in the following component. circuit, one for each layer. The overhead of utilizing this atives. technique in terms of power and area is negligible 3.2.1 Input Buffer compared to the number of bit-line switching that can be Assuming that the flit width is W bits, we can place Router Architecture (3DM) avoided. Figure 4 (b) shows the case, where all the four them onto L layers, with W/L bits per layer. For example, layers are active and Figure 4 (c) depicts the case, where if W=128bits and L=4 layers, then each layer has 32bits 16] proposed a 3D processor the bottom three layers are switched off. starting with the LSB on the top layer and MSB at the individual functional modules (a) 2DB, 3DB (b) 3DM (full) (c) 3DM (partial) 3.2.2 Crossbar bottom layer. . Although such a multi-layer Figure 4. Input Buffer Distribution Typically, an on-chip router buffer uses a register-file In the proposed 3DM design, a larger crossbar is Decomposition the separable input vias buer. (a) Thedecomposed 2D/3D baseline type it isof easily on a per-bit into a number of smaller multi-bit crossbars onsidered aggressive in current Fig. 6.10: the architecture number and of inter-layer required for this basis. In our approach as shown in Figure 4 (b) and Figure input buer. (b) The 3D decomposed input buer and (c) the decompositioned in such stacking will be feasible partitioning is equal to the number of word-lines. Sincedifferent layers. As shown in Figure 5, the 4buer (c), the with word-lines of the buffer span across L layers, crossbar size and power are determined by the number of posed unused portion powered o. res. We propose a similar onthis number is remain smallwithin in the of on-chip buffers (e.g., 8 while the bit-lines a case layer. Consequently, input/output ports (P) and flit bandwidth (W) and
D multi-layered router (3DM) lines for 8 buffers), it makes it a viable partitioning oss the multiple layers of a 3D strategy. The partitioning also results in reduced architecture is identical to the capacitive load on the partitioned word-lines in each layer a)) with the same number of as the number of pass transistors connected to it a of each node and the shorter decreases. In turn, it reduces the driver sizing e Figure 3 (c)). Consequently, requirements for the partitioned word-lines [15]. ter does not need additional Since buffers contribute about 31% of the router o a 2D router and only requires dynamic power [5], exploiting the data pattern in a flit lity across multiple layers. and utilizing power saving techniques (similar to the modules into two categories scheme proposed in [16]) can yield significant power Figure 5. Crossbar Distribution / Relative Area Comparison ble, based on the ability to Fig. 6.11: savings. Decomposition of the crossbar: the 2D baseline crossbar ule into smaller sub modules (2DB), the The layerscrossbar of the routerand buffer can be 3D lower decomposed (3DM), the 3D decomposed with the inter-layer wiring crossbar dynamically shutdown to reduce power consumption with support for express channels (3DM-E), which sightly in254 balance the area across layers. creases when only the LSB portion has valid data [36]. We define the crossbar size. ar, and inter-router links are a short-flit as a flit that has redundant data in all the other dules, while arbitration logic layers except the top layer of the router data-path. For separable modules since they example, if a flit consists of 4 words and all the three roken into sub-modules. The lower words are zeros, such a flit is a short flit. Our clockPC0 PC ibe the detailed design of each gating is based on a short-flit detection (zero-detector) circuit, one for each layer. The overhead of utilizing this technique in terms of power and area is negligible compared to the number of bit-line switching that can be width is W bits, we can place avoided. Figure 4 (b) shows the case, where all the four Node A /L bits per layer. For example, layers are active and (a) Figure 4 (c) depicts the case, where (b) 3DB bandwidth usage Wire distribution (c) 3DM ban rs, then each layer has 32bits the bottom three layers are switched off. Figure 6. Inter-router Link Distribution Fig. 6.12: Decomposition of the inter-router links. he top layer and MSB at the architecture has 4 separate nodes (A,B,C therefore, such a decomposition is beneficial. In the 2DB 3.2.2 Crossbar case, P=5 and the total size is 5W5W, whereas in the one node per layer, as shown in Figure 6 3DMproposed case with 4 layers, size of the for each is the 3DM case, we have only 2 separate no uter buffer uses a register-file In the 3DM the design, a crossbars larger crossbar layer is (5W/4) (5W/4). If we add up the total area for Figure 6 (c)), since now the floor-plan is easily separable on a per-bit decomposed into a number of smaller multi-bit crossbars size the 3DM crossbar, it is still four times smaller than the of a 3DB node. Consequently, in the own in Figure 4 (b) and Figure positioned different layers. As shown in Figure 5, the nodes share 4W wires; while in the 2DB in design. e buffer span across L layers, crossbar size and power determined by the number We use the matrix are crossbar for illustrating the ideas in of nodes share 4W wires. This indicates th this paper. However, 3Dflit splitting method is(W) generic and bandwidth is doubled from the perspec within a layer. Consequently, input/output ports (P)such and bandwidth
and is not limited to this structure. In our design, each line has a flit-wide bus, with tri-state buffers at the cross
node. For example, node A in Figure 6 (c 2W wires available, which doubles the
84 3D Architecture Exploration Park et al [99] take a disparaging design paradigm other than the previous two works. Inspired by the single-core ne-granularity designs, they propose to nely partition each module of a router into sub-modules and stack the sub-modules to achieve a real 3D router. First, router components are identied as separable and non-separable modules. The separable modules are the input buer, the crossbar and the inter-router links; the non-separable modules are the arbitration and routing logics. To decompose the separable modules, Kim et al propose to bit-slice the input buer (Figure 6.10), the crossbar (Figure 6.11) as well as inter-router links (Figure 6.12), such that the data width of each component is reduced from W to W/n where n is the number of layers. Bit-slicing the input buer reduces the length of the word-line and saves power. In addition, bit-slicing the input buer allows for selectively switching o any unused slices of the buer to save power during run time. The same techniques can also be applied for the crossbar and the inter-router links. As a result of 3D partitioning, the area of the router reduces and the available link bandwidth for each router increases. This excess bandwidth is leveraged to construct express physical links in this work and is shown to signicantly improve performance. In addition, with reduced crossbar size and link length, the latencies of both crossbar and link stages decrease and they can be combined into one stage without violating timing constraints. According to their experiment results, the proposed architecture with express links perform best among the architectures examined (2D NoC, 3D hop-by-hop NoC, the proposed architecture without express links and with express links). The latency improvements are up to 51% and 38%, and the power improvements are up to 42% and 67% for synthetic tracs and real workloads respectively. Aside from electrical NoC, 3D optical NoCs leveraging heterogeneous technology integration have also been proposed [100, 101]. The advancement in silicon photonics have promoted researches on building on-chip optical interconnects with monolithic optical components, e.g. waveguides, light modulators and detectors, etc. However, building feasible optical on-chip networks requires massively fabricating these components with electrical devices. Since the optimal processes are dierent
85
for optical and electrical devices, this presents an issue for technology tuning. This problem is resolved with 3D integration by using dierent technologies with dierent layers. Ye et al [100]have demonstrated an optical NoC using 3D-stacking, based on the Cygnus optical router architecture. Cygnus leverages waveguides and micro-ring resonators in the optical crossbar to switch data between directions. The micro-ring resonators can be viewed as couplers that can divert signals from waveguides. The micro-ring can be switched between the on and o states by electrical signals. In the on state it diverts optical signals while in the o state it lets optical signals pass by. Due to the absent of optical buering, Cygnus proposes using circuit-switching and the optical path set-up and tear-down are done with an electrical hop-by-hop network. As discussed in the previous paragraph, however, it is technologically dicult to integrate the optical network with the electrical network. A solution surfaces with 3D integration, where the two network are fabricated on dierent layers; control signals and inject/eject links are implemented by TSVs. Comparing 3D Cygnus with an electrical network having equal size, experiment results show up to 70% power saving and 50% latency saving. The drawback of the 3D Cygnus, however, is the slightly lower maximum throughput due to the overheads of circuit-switching. Zhang et al [101] yet propose another 3D optical NoC using packet switching. This work is motivated by Corona [102] which adapts token rings for optical interconnects. The problem with token rings, however, is the reduced utilization when the network size becomes large due to the elongated token circulating time. Zhang et al propose to divide a large network into smaller regions and implement optical token rings inside and between regions. While this approach reduces token circulating time, it introduces a large number of waveguide crossings that severely aects optical signal integrity. To address this problem, they leverage the recently demonstrated optical vias that can couple light from one layer to another layer. They propose to stack crossing waveguides at dierent layers such that crosstalk and signal loss are largely reduced. As a result, the proposed 3D NoC architecture outperforms Corona and electrical networks in both performance and power. For example, the zero-load latency of the proposed 3D optical NoC is 28%
86 3D Architecture Exploration lower than Corona, while the throughput is 2.4 times higher. In addition, the proposed 3D optical NoC consumes 6.5% less power than Corona. We have covered that 3D integration benets NoC architectures with shorter links, smaller network diameters, and heterogeneous integration. Some researchers, on the other hand, also nd that the potentially large size and the high parasitics of TSVs may create new problems for 3D NoC. The area overheads imposed by TSV bundles as the vertical links can be signicant even with a moderate 16um pitch [103]. This problem stresses the allocation of valuable on-chip real estates and should be taken into consideration when designing practical 3D chips. In [103], data serialization is proposed to constrain the area overheads imposed by TSVs. Specically, light-weight hardware serializer/de-serializer are designed to implement an ecient serial protocol. It shows that with 4:1 serialization, up to 70% of vertical link area can be saved, with only less than 0.8% and 5% power and performance overheads. Using asynchronous transmission, the serial links can be clocked at as high as 5Ghz, and the performance overheads are reduced to under 3%. Qian et al [104] considers the performance of 3D NoC from a dierent perspectivethe worst-case performance. They utilize a canonical tool called network calculus to obtain the analytical worst-case delay bounds of packets, and also verify these bounds by extensive simulation. Their ndings with 3D NoC, interesting shows that while the average packet latency of 3D NoC is lower than an equal-size 2D NoC, the worst-case latency is notably worse. The primary factor for this performance degradation is the reduced vertical bandwidth due to serialization. While [103] shows that serialization is eective in saving area and introduces small overheads to average performance, Qian et als work presents an alarming observation that the worst-case performance may suer from 3D integration, which is especially detrimental to hard real-time applications.
7
Cost Analysis for 3D ICs
7.1
Introduction
Three-dimensional integrated circuit (3D IC) [1, 105108] is emerging as an attractive option for overcoming the barriers in interconnect scaling, thereby oering an opportunity to continue performance improvements using CMOS technology. There are various vertical 3D interconnect technologies that have been explored [1], including wire bonding, microbump, contactless (capacitive or inductive), monolithic multiplelayer buried structure (MLBS) [109], and through-silicon via (TSV) vertical interconnect. A comparison in terms of vertical density and practical limits can be found in [1]. TSV-based 3D integration has the potential to oer the greatest vertical interconnect density with minimum impact on conventional IC fabrication approaches, and therefore it is the most promising one among these vertical interconnect technologies. Consequently, we focus our study on TSV-based 3D ICs in this research. In a TSV-based 3D chip, multiple dies are rst fabricated with traditional 2D processes separately and later stacked together with direct TSVs. Fig. 7.1 shows a conceptual 2-layer 3D-stacked chip. In
87
88 Cost Analysis for 3D ICs
Fig. 7.1: A conceptual 3D IC: two dies are stacked together with TSVs. general, the benets of TSV-based 3D integration are listed as follows, Shorter global interconnects [107, 108]; Higher packing density and smaller footprint due to the addition of a third dimension; Higher performance due to reduced average interconnect length and higher memory bandwidth [105]; Lower interconnect power consumption due to the total wiring length reduction [107]; and Supports for the heterogenous chips with mixed technologies [110, 111]. The majority of the 3D IC research so far is focused on how to take advantage of the performance, power, smaller form-factor, and heterogeneous integration benets that oered by 3D integration [1, 105108, 110, 111]. However, when it comes to the discussion on the adoption of such emerging technology as a mainstream design approach, it all comes down to the question of the cost for 3D integration. All the advantages of 3D ICs ultimately have to be translated into cost savings when a design strategy has to be decided [112]. For example, designers may ask themselves questions like,
7.1. Introduction
89
Do all the benets of 3D IC design come with a much higher cost? For example, 3D bonding incurs extra process cost, and the Through-Silicon Vias (TSVs) may increase the total die area, which has a negative impact on the cost; However, smaller die sizes in 3D ICs may result in higher yield than that of a larger 2D die, and reduce the cost. How to do 3D integration in a cost-eective way? For example, to re-design a small chip may not gain the cost benets of improved yield resulted from 3D integration. In addition, if a chip is to be implemented in 3D, how many layers of 3D integration would be cost eective? and should one use wafer-to-wafer or die-to-wafer stacking [107]? Are there any design options to compensate the extra 3D bonding cost? For example, in a 3D IC, since some global interconnects are now implemented by TSVs, it may be feasible to use fewer number of metal layers for each 2D die. In addition, heterogeneous integration via 3D could also help cost reduction. Cost analysis for 3D ICs at the early design stage is critical to answer these questions. Considering 3D integration has both positive and negative eects on the manufacturing cost, there is no clear answers yet whether 3D integration is cost-eective or how to achieve 3D integration in a cost-eective way. In order to maximize the positive eect and compensate the negative eect on cost, it becomes critical to analyze the 3D IC cost at the early design stage. This early stage cost analysis can help the chip designers make decision on whether 3D integration should be used, or which 3D partitioning strategy should be adopted. Cost-ecient design is the key for the future wide adoption of the emerging 3D IC design, and 3D IC cost analysis needs close coupling between 3D IC design and 3D IC process 1 . In this chapter, we rst propose a 3D IC cost analysis methodology
1 IC Cost analysis needs a close interaction between designers and foundry. We work closely with
our industrial partners to perform 3D IC cost analysis. However, we cannot disclose absolute numbers for the cost, and therefore in this paper, we either use arbitrary units (a.u.) or normalized value to present the data.
90 Cost Analysis for 3D ICs with a complete set of cost models that include wafer cost, 3D bonding cost, package cost, and cooling cost. Using this cost analysis methodology along with the existing performance, power, area estimation tools, such as McPAT [113], we estimate the 3D IC cost in two cases: one is for fully-customized ASICs, the other is for many-core microprocessor designs. During these case studies, we explore some guidelines showing what a cost-eective 3D IC fabrication option should be in the future. Note that 3D integration is not yet a mature technology with very well-developed and tested cost models, the optimal condition concluded by this work is subject to parameter changes. However, the major contribution of this work is to provide a cost estimation methdology for 3D ICs. To our best knowledge, this is the rst eort to model the 3D IC cost with package and cooling cost included, while the majority of the existing 3D IC research activities mainly focused on the circuit performance and power consumption. There are abundant research eorts put into 3D integration research since it has been widely recognized as a promising technology and an alternative approach to preserve the chip scaling trend predicted by Moores Law. However, most of the previous work on 3D integration just focused on its architectural benets [106,114116] or its associated design automation tools [117120]. Only a few of them considered the cost issue of 3D ICs. Coskun et al. [121] and Liu et al. [122] investigated the cost model for 3D Systems-on-Chips (SoCs). Based on their studies, the key factor that aects the wafer cost is considered to be the die area. However, both of their cost models are based on simplied die area estimation methodologies. Coskun et al. [121] only considered the area of cores and memories, and they assumed the core area and the memory area are equal for the experiment convenience purpose. In addition, they assumed the number of TSVs per chip is a xed value. Liu et al. [122] ignored the area overhead caused by TSVs in their study. A yield and cost model [123] was developed for 3D ICs with particular emphasis on stacking yield and how wafer yield is aected by vertical interconnects. However, there is no further analysis on the cooling cost, which is a non-negligible part of the nal system-level cost. Weerasekera et al. [124] also described a quantitative cost model for
7.2. Early Design Estimation for 3D ICs
91
3D ICs. Their cost model features a detailed analysis of the cost per production step including the package cost. However, the package cost presented in this work is based on plastic package, which is not an appropriate package type for either SoCs or microprocessors. In this chapter, we propose a complete set of 3D cost models, which not only includes the wafer cost and the 3D bonding cost, but also includes another two critical cost components, the package cost and the cooling cost. Both of them should not be ignored because while 3D integration can potentially reduce the package cost by having a smaller chip footprint, the multiple 3D-stacked dies might increase the cooling cost due to the increased power density. By using the proposed cost model set, we conduct two case studies on fully-customized ASIC and many-core microprocessor designs, respectively. The experimental results show that our 3D IC cost model makes it feasible to estimate the nal system-level cost of the target chip design at the early design stage. More importantly, it gives some guidelines to determine what a cost-eective 3D fabrication option should be.
7.2
Early Design Estimation for 3D ICs
It is estimated that design choices made within the rst 20% of the total design cycle time will ultimately result up to 80% of the nal product cost [124]. Hence, to facilitate the design decision of using 3D integration from a cost perspective, it is necessary to perform cost analysis at the early design stage. While it is straightforward to sum up all the incurred cost after production, predicting the nal chip cost at the early design stage is a big challenge because most of the detailed design information is not available at that stage. Taking two dierent IC design styles (ASIC design style and many-core microprocessor design style ) as examples: ASIC Design style: At the early design stage, probably only the system-level (such as behavioral level or RTL level) design specication is available, and a rough gate count can be estimated from a quick high-level synthesis or logic synthesis tool, or simply by past design experience. Microprocessor design style: Using a many-core micropro-
92 Cost Analysis for 3D ICs cessor design as an example, the information available at the early design stage is also very limited. The design specication may just include information as follows: (1) The number of microprocessor cores; (2) The type of microprocessor cores (in-order or out-of-order); and (3) The number of cache levels and the cache capacity of each level. All these specications are on the architectural level. Referring to previous design experience, it is feasible to achieve a rough estimation on the gate count for logic-intensive cores and the cell count for memory-intensive caches, respectively. Consequently, it is very likely that at the early design stage, the cost estimation is simply based on the design style and a rough gate count for the design as the initial starting point. In this section, we describe how to translate the logic gate count (or the memory cell count) into higher level estimations, such as the die area, the metal layer count, and the TSV count, given a specied fabrication process node. The die area estimator and the metal layer estimator are two key components in our cost model: (i) larger die area usually causes lower die yield, and thus leads to higher chip cost; (ii) the fewer metal layers are required, the less the number of fabrication steps (and fabrication masks) are needed, which reduce the chip cost. It is also important to note that the 3D partitioning strategy can directly aect these two numbers: (1) 3D integration can partition the original 2D design into several smaller dies; (2) TSVs can potentially shorten the total interconnect length and thus reduce the number of metal layers. 7.2.1 Die Area Estimation
At the early design stage, the relationship between the die area and the gate count can be roughly described as follows, Adie = Ngate Agate (7.1)
where Ngate is the gate count and Agate is an empirical parameter that shows the proportional relationship between area and gate counts. Based on empirical data from many industrial designs, in this work, we
93
assume that Agate is equal to 31252 , in which is half of the feature size for a specic technology node. Although this area estimation methodology is straightforward and highly simplied, it accords with the real-world measurement quite well2 . 7.2.2 Wire Length Estimation
As the only inputs at the early design stage are the estimation of gate counts, it is necessary to further have an estimation on the design complexity, which can be represented by the total length of wire interconnects. Rents Rule [125] is a well-known and powerful tool that reveals the trend between the number of signal terminals and the number of internal gates. Rents Rule can be expressed as follows,
p T = kNgate
(7.2)
where the parameters k and p are Rents coecient and exponent and T is the number of signal terminals. Although Rents Rule is an empirical result based on the observations of previous designs and it is not proper to use it for non-traditional designs, it does provide a useful framework to compare similar architecture and it is plausible to use it as a part of our 3D IC cost model. Based on Rents Rule, Donath discovered it can be used to estimate the average wire length [126] and Davis et al. found it can be used to estimate the wire length distribution in VLSI chips [127]. As a result, in this work, we use the derivation of Rents Rule to predict the wire length distribution function i(l) [127], which has the following forms, Region I: 1 l Ngate i(l) = Region II: k 2 l3 2 Ngate l2 + 2Ngate l l2p4 3 Ngate Ngate l
3
Ngate l < 2 i(l) =
k 2 6
l2p4
(7.3)
2 Note
that it is necessary to update Agate along with the technology advance because the gate length is not strictly linearly scaled.
94 Cost Analysis for 3D ICs where l is the interconnect length in units of the gate pitch, is the fraction of the on-chip terminals that are sink terminals and is related to the average fanout of a gate (f.o.) as follows, = and is calculated as follows, =
p1 2Ngate 1 Ngate p 1+2p2 Ngate p(2p1)(p1)(2p3)
2p1
f.o. f.o. + 1
(7.4)
1 6p
Ngate 2p1
(7.5)
Ngate p1
7.2.3
Metal Layer Estimation
After estimating the die area and the length of wire, we are able to further further predict the number of metal layers that are required to route all the interconnects within the die area constraint. The number of required metal layers for routing depends on the complexity of the interconnects. A simplied metal layer estimation can be derived from the average wire length [124] as follows, nwire = f.o.R mw Ngate Adie (7.6)
where f.o. refers to the average gate fanout, w to the wire pitch, to the utilization eciency of metal layers, R m to the average wire length, and nwire to the number of required metal layers. Such simplied model is based on the assumptions that each metal layer has the same utilization eciency and the same wire width [124]. However, such assumptions may not be valid in real design [128]. To improve the estimation of the number of metal layers needed for feasible routing, a more sophisticated metal layer estimation method is derived from the wire length distribution rather than the simplied average wire length estimation. The basic ow of this method is explained as follows, Estimate the available routing area of each metal layer with
95
the expression: Ki = Adie i 2Avi (Ngate f.o. I (li )) wi (7.7)
where i is the metal layer, i is the layers utilization efciency, wi is the layers wire pitch, Avi is the layers via blockage area, and function I (l) is the cumulative integral of the wire length distribution function i(l), which is expressed in Equation 7.3. Assume that shorter interconnects are routed on lower metal layers. Starting from Metal 1, we route as many interconnects as possible on the current metal layer until the available routing area is used up. The interconnects routed on each metal layer can be express as: L(li ) L(li1 ) Ki (7.8)
where = 4/(f.o. + 3) is a factor accounting for the sharing of wires between interconnects on the same net [127, 129]. The function L(l) is the rst-order moment of i(l). Repeat the same calculations for each metal layer in a bottom-up manner until all the interconnects are routed properly. By applying the estimation methodology introduced above, we can predict the die area and the number of metal layers at the early design stage where we only have the gate count as the input. Table 7.1 lists the values of all the related parameters [128, 130]. Fig. 7.2 shows an example which estimates the area and the number of metal layers of 65nm designs with dierent scale of gates. Fig. 7.2 also shows an important implication for 3D IC cost reduction: When a large 2D chip is partitioned into multiple smaller dies with 3D stacking, each smaller die requires fewer number of metal layers to satisfy the interconnect routability requirements. Such metal layer reduction opportunity can potentially oset the cost of extra steps caused by 3D integration such as 3D bonding and test.
Fig. 7.2: An example of early design estimation of the die Area and the metal Layer under 65nm process. This estimation is well correlated with the state-of-the-art microprocessor designs. For example, Sun SPARC T2 [131] contains about 500 million transistors (roughly equivalent to 125 million gates), with an area of 342mm2 and 11 metal layers) 7.2.4 TSV Estimation
The existence of TSVs in the 3D IC have its eects on the wafer cost as well. These eects are twofold, (1) In 3D ICs, some global interconnects are now implemented by TSVs, going between stacked dies. This could lead to the reduction of the total wire length, and provides opportunities for metal layer reduction for each smaller die; (2) On the other hand, 3D stacking with TSVs may increase the total die area, since the silicon area where TSVs punch through may not be utilized for building devices or 2D metal layer connections3 . While the rst eect is already explained by Fig. 7.2 showing the existence of TSVs lead to a potential metal layer reduction on the
3 Based
on current TSV fabrication technologies, the diameter of TSVs ranges from 0.2m to 10m [106]. In this work, we use the TSV diameter of 10m, and assume that the keepout area diameter is 2.5X of TSV diameter, which is 25m
97
Table 7.1: The parameters used in the metal layer estimation model Rents exponent for logic, plogic Rents coecient for logic, klogic Rents exponent for memory, pmemory Rents coecient for memory, kmemory Average gate fanout, f.o. Utilization eciency of layer 1, 1 Utilization eciency of layer 2, 2 Utilization eciency of layer 3, 3 Utilization eciency of layer 4, 4 Utilization eciency of other odd layers, 5,7,9,11,... Utilization eciency of other even layers6,8,10,12,... Wire pitch of lower metal layers, wi(i<9) Wire pitch of higher metal layers, wi(i9) Blockage area per via on layer i, Avi 0.63 0.14 0.12 6.00 4 9% 25% 54% 38% 60% 42% (i + 1)F 10F (wi + 3F )2
horizontal direction, modeling the second eort needs extra eorts by adding the estimation of the number of TSVs and their associated silicon area overhead. To predict the number of required TSVs for a certain partition pattern, a derivation of Rents Rule [126] describing the relationship between the interconnect count (X ) and the gate count (Ngate ) can be used. This interconnect-gate relationship is formulated as follows,
p1 X = kNgate 1 Ngate
(7.9)
where is calculated by Equation 7.4. As illustrated in Fig 7.3, if a planar 2D design is partitioned into two separate dies whose gate counts are N1 and N2 , respectively, by using Equation 7.9, the interconnect count on each layer, X1 and X2 , can be calculated, respectively, as follows,
p1 1 X1 = k1 N1 1 N1 p2 1 X2 = k2 N2 1 N2
(7.10) (7.11)
Fig. 7.3: The basic idea of how to estimate the number of TSVs Irrespective to the partition pattern, the total interconnect count of a certain design always keeps constant. Thus, the number of TSVs can be calculated as follows, XT SV = k1,2 (N1 + N2 ) 1 (N1 + N2 )p1,2 1
p2 1 p1 1 k2 N2 1 N2 k1 N1 1 N1
(7.12)
where k1,2 and p1,2 are the equivalent Rents coecient and exponent, and they are derived as follows [132], p1 N1 + p2 N2 p1,2 = (7.13) N1 + N2 k1,2 =
N1 N2 k1 k2 1/(N1 +N2 )
(7.14)
where Ni is the gate count on layer i. Therefore, after adding the area overhead caused by TSVs, we formulate the new TSV-included area estimation, A3D , as follows: A3D = Adie + NTSV/die ATSV (7.15)
where Adie is calculated by die area estimator, NTSV/die is the equivalent number of TSVs on each die, and ATSV is the size of TSVs (including the keepout area). In practice, the nal TSV-included die area, A3D , is used in the wafer cost estimation later, while Adie is used in Equation 7.7 for metal layer estimation because that is the actually die area available for routing.
7.3
3D Cost Model
After estimating the die area of each layer, the number of required metal layers, and the number of TSVs across planar dies in the previous
7.3. 3D Cost Model
99
Die Area
I/O Count
Number of TSVs
Die Temperature Bonding F2F/F2B Yield Bonding
Wafer/Die Yield
Package Yield
Wafer Cost Wafer Test Cost

Stacked Die Test Cost
Package
Cooling
Cost
Cost
3D Bonding Cost
W2W Stacking
3D Chip Cost
KGD Test Cost KGD Test Coverage
D2W Stacking
Fig. 7.4: Overview of the proposed 3D cost model, which includes four components: wafer cost model, 3D bonding cost model, package cost model, and cooling cost model. section, a complete set of cost models are proposed in this section. Our 3D IC cost estimation is composed of four separate parts: wafer cost model, 3D bonding cost model, package cost model, and cooling cost model. An overview of this proposed cost model set is illustrated in Fig. 7.4. 7.3.1 Wafer Cost Model
The wafer cost model estimates the cost of each separate planar die. This cost estimation includes all the cost incurred before the 3D bonding process. Before estimating the die cost, we rst model the wafer cost. The wafer cost is determined by several parameters, such as fabrication process type (i.e. CMOS logic or DRAM memory), process node (from 180nm down to 22nm), wafer diameter (200mm or 300mm), and the number of metal layers (polysilicon, aluminum, or copper depending
100
on the fabrication process type). We model the wafer cost by dividing it into two parts: a xed part that is determined by process type, process node, wafer diameter, and the actual fabrication vendor; a variable part that is mainly contributed to the number of metal layers. We call the xed part as silicon cost and the variable part as metal cost. Therefore, the wafer cost is expressed as follows: Cwafer = Csilicon + Nmetal Cmetal (7.16)
where Cwafer is the wafer cost, Csilicon is the silicon cost (the xed part), Cmetal is the metal cost per layer, and Nmetal is the number of metal layers. All the parameters used in this model are from the ones collected by IC Knowledge LLC [133], which considers several factors including material cost, labor cost, foundry margin, number of reticles, cost per reticle, and other miscellaneous cost. Fig. 7.5 shows the predicted wafer cost of 90nm, 65nm, and 45nm processes, with 9 or 10 layers of metal, for three dierent foundries, respectively. Besides the number of metal layers aecting the cost as demonstrated by Equation 7.16, the die area is another key factor that aects the bare die cost since smaller die area implies more dies per wafer. The number of dies per wafer is formulated by Equation 7.17 [134] as follows, (wafer /2)2 wafer Ndie = (7.17) Adie 2 Adie where Ndie is the number of dies per wafer, wafer is the diameter of the wafer, and Adie is the die area. After having estimated the wafer cost by Equation 7.16 and the number of dies per wafer by Equation 7.17, the bare die cost can be expressed as follows, Cdie = Cwafer Ndie (7.18)
In addition, the die area also relates to the die yield, which later aects the net die cost. By assuming rectangular defect density distribution, the relationship between the die area and die yield can be formulated as follows [135], Ydie = Ywafer 1 e2Adie D0 2Adie D0 (7.19)
7.3. 3D Cost Model
101
Fig. 7.5: A batch of data calculated by the wafer cost model. The wafer cost varies from dierent processes, dierent number of metal layers, dierent foundries, and some other factors. where Ydie and Ywafer are the yields of dies and wafers, respectively, and D0 is the wafer defect density. Therefore, the net die cost is Cdie /Ydie . 7.3.2 3D Bonding Cost Model
The 3D bonding cost model estimated the cost incurred during the process that integrates several planar 2D dies together using the TSVbased technology. The extra fabrication steps required by 3D integrations consist of TSV forming, thinning, and bonding. There are two ways to build 3D TSVs: laser drilling or etching. Laser drilling is only suitable for a small number of TSVs (hundreds to thousands) while etching is suitable for a large number of TSVs.
102
Fig. 7.6: Fabrication steps for 3D ICs: (a) TSVs are formed before BEOL process, thus TSVs only punch through the silicon substrate but not the metal layers; (b) TSVs are formed after BEOL process, thus TSVs punch through not only the silicon substrate but the metal layers as well.
Furthermore, there are two approaches for TSV etching: (1) TSV-rst approach: TSVs can be formed during the 2D die fabrication process, before the Back-End-of-Line (BEOL) processes. Such an approach is called TSV-rst approach, and is shown in Fig. 7.6(a); (2) TSV-last approach: TSVs can also be formed after the completion of 2D fabrications, after the BEOL processes. Such an approach is called TSV-later approach, and is shown in Fig. 7.6(b). Either approach has advantages and disadvantages. TSV-rst approach builds TSVs before metal layers, thus there is no TSVs punching through metal layers and hence no TSV area overhead; TSV-last approach has TSV area overhead, but it isolates the die fabrication from 3D bonding, which does not need to change the traditional fabrication process. In order to separate the wafer cost model and the 3D bonding cost model, we assume TSV-last approach is used in 3D IC fabrication. The TSV area overhead caused by TSV-last approach is modeled by Equation 7.15. The data of the 3D bonding cost including the cost of TSV forming, wafer thinning, and wafer (or die) bonding are obtained from our industry partner, with the assumption that the yield of each 3D process step is 99%. Combined with the wafer cost model, the 3D bonding cost model can be used to estimate the cost of a 3D-stacked chip with multiple
7.3. 3D Cost Model
103
dies. In addition, the entire 3D-stacked chip cost depends on some design options. For instance, it depends on whether die-to-wafer (D2W) or wafer-to-wafer (W2W) bonding is used, and it also depends on whether face-to-face (F2F) or face-to-back (F2B) bonding is used. If D2W bonding is selected, cost of Known-Good-Die (KGD) test should also be included [112]. For D2W bonding, the cost of a bare N -layer 3D-stacked chip before packaging is calculated as follows, CD2W =
N i=1 (Cdiei
+ CKGDtest ) /Ydiei + (N 1)Cbonding

N 1 Ybonding
(7.20)
where CKGDtest is the KGD test cost, which we model as Cwafersort /Ndie , and the wafer sort cost, Cwafersort , is a constant value for a specic process in a specic foundry in our model. Note that the testing cost for 3D IC itself is a complicated problem. Our other study has demonstrated a more detailed test cost analysis model with the study of design-for-test (DFT) circuitry and various testing strategies, showing the variants of testing cost estimations [136]. For example, adding extra DFT circuitry to improve the yield of each die in D2W stacking can help the cost reduction but the increased area may increase the cost. In this paper, we adopt the above testing cost assumptions to simplify the total cost estimation, due to space limitation. For W2W bonding, the cost is calculated as follows, CW2W =
N i=1 Cdiei + (N 1)Cbonding N 1 N i=1 Ydiei Ybonding
(7.21)
In order to support multiple-layer bonding, the default bonding mode is F2B. If F2F mode is used, there is one more component die that does not need the thinning process, and the thinning cost of this die is subtracted from the total cost. 7.3.3 Package Cost Model
The package cost model estimates the cost incurred during the packaging process, which can be determined by three factors: package type, package area, and pin count.
104
Fig. 7.7: The package cost depends on both pin count and die area.
As an example, we select the package type to be ip-chip Land Grid Array (fcLGA) in this paper. fcLGA is the common package type of multiprocessors, while other types of packages, such as fcPGA, PGA, pPGA, etc., are also available in our model. Besides package type, package area and pin count are the other two key factors that determine the package cost. By using the early stage estimation methodology as mentioned in Section 7.2, the package area can be estimated from the die area, A3D in Equation 7.15, and the pin count can be estimated by Rents Rule in Equation 7.2. We analyze the actual package cost data and nd that the number of pins becomes the dominant factor to the package cost when the die area is much smaller than the total area of pin pads. It is easy to understand, since there is a base material and production cost per pin which will not be reduced with die area shrinking. Fig. 7.7 shows the sample data of package cost obtained from IC Knowledge LLC [133]. Based on this set of data, we use curve tting to derive a package cost function with the parameters of the die area
7.3. 3D Cost Model
105
and the pin count, which can be expressed as follows, Cpackage = 1 Np + 2 A 3D (7.22)
where Np is the pin count, A3D is the die area. In addition, 1 , 2 , and are the coecients and exponents, which are 0.00515, 0.141 and 0.35, respectively. By observing Equation 7.22, we can also nd that pin count will dominate the package cost if the die area is suciently small, since the order of pin count is higher than that of area. Fig. 7.7 also shows the curve tting result, which is well aligned to the raw data. 7.3.4 Cooling Cost Model
Although the combination of the aforementioned wafer cost model, 3D bonding cost model, and package cost model can already oer the cost of a 3D-stacked chip after packaging, we add the cooling cost model because it has been widely noticed that 3D-stacked chip has higher working temperature than its 2D counterpart and it might need more powerful cooling solution that causes extra cost. Gunther, et al. [137] noticed that the cooling cost is related to the power dissipation of the system. Furthermore, the system power dissipation is highly related to the chip working temperature [138]. Our cooling cost estimation is based on the peak steady state temperature of the targeted 3D-stacked multiprocessor. There are many types of cooling solutions, ranging from a simple extruded aluminum heatsink to an elaborate vapor phase refrigeration. Depending on the cooling mechanisms, cooling solutions used today can be dened as either convection cooling, phase-change cooling, thermoelectric cooling (TEC) or liquid cooling [139]. Typical convection cooling solutions are heatsinks and fans, which are widely adopted for the microprocessor chips in todays desktop computers. Phase-change cooling solutions, such as heatpipes, might be used in laptop computers. In addition, thermoelectric cooling and liquid cooling are used in some high-end computers. We collect the cost of all the cooling solutions from the commercial market by searching the data from Digikey [140] and Heatsink Factory [141]. We nd that more powerful types of cool-
106
Table 7.2: The cost of various cooling solutions Cooling Solution Heatsink Fan TEC Heatpipe Liquid Cooling Unit Price $ 36 10 40 15 20 30 70 75
Table 7.3: The values of Kc and c in Equation 7.23, which are related to the chip temperature Chip Temperature ( C) < 60 60 90 90 120 120 150 150 180 Kc 0.2 0.4 0.2 1.6 2 c 6 16 2 170 230
ing solutions often lead to higher costs as expected. Table 7.2 is a list of typical prices of these cooling solutions. In our model, we further assume that cooling cost increases linearly with the rise of chip temperature, if the same type of cooling solution is adopted. Based on this assumption and the data listed in Table 7.2, the cooling cost is therefore estimated as follows: Ccooling = Kc T + c (7.23)
where T is the temperature from which the cooling solution can bring down, Kc and c are the cooling cost parameters, which can be determined by Table 7.3. Fig. 7.8 shows the cost of these ve types of cooling solutions. It can be observed that the chips with higher steady state temperatures will require more powerful cooling solutions, which lead to higher costs. It is also illustrated in Fig. 7.8 that the cooling cost is not a global linear function of the temperature, whereas there are several regions with
7.4. Cost Evaluation for Fully-Customized ASICs
107
Fig. 7.8: A plot for the cooling cost model: the cooling cost increases linearly with the chip temperature if the same cooling solution is used; more powerful cooling solutions result in higher costs.
linear increase of the cost. Each of the region is correspondent to one type of cooling solutions.
7.4
Cost Evaluation for Fully-Customized ASICs
The ideal 3D fabrication gives the freedom that transistors can be built on their optimal layers. In this case, we rst use our early design estimation methodology and the proposed cost model to demonstrate how the cost of fully-customized ASICs can be reduced by 3D fabrication. In this section, we use the IBM Common Platform foundry cost model as an example to perform a series of analysis for logic circuitry. 7.4.1 TSV Area Overhead
As mentioned in Section 7.2, building TSVs not only incurs additional process cost but also causes area overhead. To estimate the TSV area overhead, we set the exponent parameter p and the coecient parameter k for logic designs (in Equation 7.2) to 0.63 and 0.14, respectively, according to Bakoglus research [130]. We further assume that all the gates are uniformly partitioned into N layers. We choose the TSV size
108
Fig. 7.9: The total area and the percentage of area occupied by TSVs: For small designs, the TSV area overhead is near to 10%, but for large designs, the TSV area overhead is less than 4%.
to be 8m 8m. Using the early design estimation methodology described in Section 7.2, the predicted TSV area overhead under 65nm process is shown in Fig. 7.9. Consistent with Equation 7.15, the TSV area overhead increases as the number of 3D layers increases. The overhead can reach as high as 10% for a small design (with 5 million gates) using 4 layers. However, when the design is suciently large (with 200 million gates) and 3D integration only stacks two dies together, the TSV area overhead can be as low as 2%. To summarize, for large design, the area overhead due to TSVs is acceptable. 7.4.2 Potential Metal Layer Reduction
As discussed in Section 7.2, the total wire length decreases as the number of 3D layers increases, since the existence of TSVs removes the necessity of global interconnects. After subtracting the global interconnects implemented by TSVs, we further assume the remaining interconnects are evenly partitioned into N dies. Therefore, the routing
109
Table 7.4: The Number of Required Metal Layers Per Die (65nm Process) Gate counts 5 million 10 million 20 million 50 million 100 million 200 million 2D 5 6 7 8 10 12 2-layer 5 5 6 7 8 10 3-layer 5 5 5 7 7 9 4-layer 4 5 5 6 7 8
complexity on each die is much less than that of the original 2D design. As a result, it becomes possible to remove one or more metal layers on each smaller die after 3D partitioning. Using the 3D routability estimation methodology discussed in Section 7.2, we project the metal layer reduction eect for various design scales ranging from 5 million gates to 200 million gates, when they are partitioned into 1, 2, 3, and 4 layers, respectively. The estimation of the metal layer reduction is listed in Table 7.4. Although the result shows that there are few opportunities to reduce the number of required metal layers in a relatively small design (i.e., with 5 million gates), the eect of the metal layer reduction becomes more and more obvious with the growth of the design complexity. For instance, the number of required metal layers can be reduced by 2, 3, and 4 when a large 2D design (i.e., with 200 million gates) is uniformly partitioned into 2, 3, and 4 separate dies, respectively. To summarize, in large-scale 3D IC design, it is possible to use fewer metal layers for each smaller die, compared to the baseline 2D design. 7.4.3 Bonding Methods: D2W or W2W
Die-to-wafer (D2W) and wafer-to-wafer (W2W) are two dierent ways to bond multiple dies together in 3D integration [1]. In Section 7.3, how to model these two bonding methods is discussed. D2W bonding can achieve a higher yield by introducing Known-Good-Die (KGD) test, while W2W bonding waives the necessity of test before bonding at the
110
Fig. 7.10: The cost comparison among 2D, 2-layer D2W, and 2-layer W2W under 65nm process with the consideration of TSV area overheads and metal layer reductions.
expense of yield loss [1]. Since both D2W and W2W have their pros and cons, we use our 3D cost model to nd out which one is more cost-eective. Fig. 7.10 shows the cost comparison among conventional 2D processes, 2-layer D2W bonding, and 2-layer W2W bonding. It can be observed that, although the cost of W2W is lower than that of D2W in the cases of small designs, the cost of W2W is always higher than that of 2D processes in our results. This phenomena can be explained by the relationship between die area and die yield as expressed by Equation 7.19. For the W2W bonding, the yield of each component die does increase due to the area reduction, but when all the dies are stacked together without pre-stacking test, the chip yield is approximate to the product of the yield of each planar die. Thus, the overall yield of W2Wbonded 3D chips becomes similar to that of 2D chips. Since the extra bonding cost is inevitable for W2W bonding, but not for 2D processes, W2W-bonded chips is then more expensive than their 2D counterparts. However, while D2W bonding has to pay extra cost for KGD test,
111
Table 7.5: The enabling point of 3D fabrication (The data from IBM Common Platform is used in this example) Process Node 130nm 90nm 65nm 45nm Gate count 21 million 40 million 76 million 143 million Die area 277.3mm2 253.1mm2 250.9mm2 226.2mm2
it can ensure only good dies are integrated together. Therefore, the yield benet achieved from partitioning design into multiple smaller dies is reserved. Although it might not oset the KGD test cost for small-size dies whose yields are already high, the yield improvement can potentially oset the bonding cost and the KGD test cost when the original design scale is too large to implement using 2D processes with high yields. Based on our evaluation result, the D2W bonding starts to show its cost eciency when the design scale is larger than 100 million gates per chip. To summarize, from yield perspective, Die-to-Wafer (D2W) stacking has cost advantage over Wafer-to-Wafer (W2W) stacking, and the cost of D2W-bonded chips can be lower than that of 2D chips when the design scale is suciently large. 7.4.4 Enabling Point of 3D
It is critical to recognize what is the enabling point beyond which 3Dstacked chips would be cheaper than their 2D counterparts. For this purpose, we select the IBM Common Platform 65nm fabrication technology as our experiment target and compare the cost of the 2D baseline design with its 3D counterparts, which have 2, 3, and 4 layers. We calculate the cost of bare chips before packaging by using our proposed wafer cost model and 3D bonding cost model. Fig. 7.11 shows the cost estimations based on the assumption that the 2D baseline design is partitioned uniformly. It can be observed from Fig. 7.11 that, the cost dramatically in-
112
Fig. 7.11: The cost of 3D ICs with dierent number of layers under 65nm process with ith the consideration of TSV area overheads and metal layer reductions.
creases with the growth of the chip size due to the exponential relationship between die sizes and die yields. Because die yields become more sensitive to large die areas, splitting a large-scale 2D design into multiple dies is more likely to reduce the total cost than splitting a small 2D design. Another observation is that the optimal number of 3D layers in terms of cost might vary depending on how large the design scale is. For example, the most cost-eective way for a large-scale design with 200 million gates is to partition it into 4 layers. Whereas, for a smaller design with 100 million gates, the most cost-eective option is to use 2-layer partitioning. Finally, when the design scale is only 50 million gates per chip, the conventional 2D fabrication is always the most costeective one, since the 3D bonding cost is the dominant cost in this case and the yield improvement brought by 3D partitioning is too small. Running the same experiment using dierent fabrication processes ranging from 130nm down to 45nm, we achieve the estimation of the boundary that indicates where the 2-layer 3D stacking starts to be more cost-eective than the baseline 2D design. The data are listed in Table 7.5. If we convert the number of gates into chip area, the
7.5. Cost Evaluation for Many-Core Microprocessor Designs
113
enabling point of 2-layer 3D process is about 250mm2 . This is mainly because the yield of fabricating a die larger than 250mm2 starts to drop quickly. If the design scale keeps increasing, which means the chip area keeps growing, more 3D layers are required to achieve the highest cost eciency. To summarize, 3D integration is cost-eective for large designs, but not for small designs; the optimal number of 3D layers from a cost saving perspective increases as the gate count increases. 7.4.5 Cost-Driven 3D Design Flow
The 3D IC cost analysis discussed above is conducted before the real design, and all the inputs of the cost model are predicted from early design estimation. However, if the same cost analysis methodology is applied during design time, using the real design data, such as die area, TSV interconnects, and metal interconnects, as the inputs of the cost model, then a cost-driven 3D IC design ow becomes possible. Fig. 7.12 shows a proposed cost-driven 3D IC design ow. The integration of 3D IC cost models into design ows guides the designer to optimize their 3D IC design and eventually to manufacture low-cost product. Such a unique and close integration of cost analysis with 3D EDA design ow has two advantages. Firstly, as we discussed earlier, many design decisions (such as partitioning and placement & routing) can aect cost analysis. Closely coupling cost analysis with 3D EDA ow can result in more accurate cost estimation. Secondly, the cost analysis result can drive 3D EDA tools to carry out a more cost-eective optimization, in addition to considering other design goals (such as performance and power consumption).
7.5
Cost Evaluation for Many-Core Microprocessor Designs
In the previous section, we evaluate the cost benet of 3D fabrication for fully-customized ASIC designs. However, the gate-level negranularity 3D partitioning is not feasible in the near future since it requires all the basic modules (such as adders, multipliers, etc.) to be redesigned in a 3D-partitioned way. Hence, as a near-term goal, the module-level coarse-granularity 3D partitioning is regarded as a more
114

Start
2D-3D Partitioning
Functional Component Partitioning Timing Analysis
I/O Interface Partitioning
Placement & Routing
Die Area Estimation
Mixed
Technology
Metal Layer
Reduction
Wafer
Utilization
Adaption
Die Area
I/O
Count
Die
Temperature
Number
of TSVs
Wafer/ Die Yield
Package Yield
Bonding Yield
F2F/F2B Bonding
Wafer Cost Wafer Test Cost
Package
Cost
Cooling Cost
3D
Bonding
Cost
Stacking Method
KGD Test Cost KGD Test Coverage
W2W Stacking
D2W Stacking
Stacked Die Test Cost
3D Chip Cost Finish
Cost Model
Fig. 7.12: The scheme of a Cost-Driven 3D IC design ow realistic way for 3D IC implementation, such as system-on-chip (SoC) style or many-core microprocessor design style 4 , where each IP block (such each core/memory IP) remains to be a 2D block and the 3D
4 In
this paper, a multi-core chip refers to eight or less homogeneous cores, whereas a many-core chip has more than eight homogeneous cores in one microprocessor package.
115
partitioning is performed at the block level. In such design style, the Rents rule-based TSV estimation is not valid, and the TSV estimation would have to be carried out based on system-level specication (such as the signal connections between cores and memory). In this section, we demonstrate how to use the remaining parts of the proposed 3D cost model to estimate the 3D many-core microprocessor cost. We use 45nm IBM common platform cost model in this case study.
7.5.1
Baseline 2D Conguration
In order to have a scalable architecture for many-core microprocessors, the baseline 2D design used in our study is an NoC-based architecture, in which the processing tile is the basic building block. Each processing tile is composed of one in-order SPARC-like microprocessor core [131], one L1 cache, one L2 cache, and one router. Fig. 7.13 shows the example of such architecture with 64 tiles connected by a 8-by-8 mesh structure. Table 7.6 list the detail of cores, L1, L2 caches, and routers, respectively. The gate count of the core and the router modules are obtained from empirical data of existing designs [131], and the cell count of the cache module by only considering the SRAM cells and ignoring the cost of peripheral circuits. In this case study, we simply assume logic gates and SRAM cells (6-transistor model) occupy the same die area. We consider the resulting estimation error is tolerable since it is an early design stage estimation. Note that we set the memory cell count (from L1 and L2 caches) to be close to the logic gate count (from cores and routers) by tuning cache capacities. This manual tuning is just for the convenience of our later heterogeneous stacking. Generally, there is no constraints on the ratio between the logic and the memory modules. Fig. 7.14 illustrates two ways to implement a 3D dual-core microprocessor, and it can be conceptually extended to the many-core microprocessor case. Generally, the two types of partitioning strategies are: homogeneous and heterogeneous. As shown in Fig. 7.14(a), all the layers after homogeneous partitioning are identical, while heterogenous partitioning leads to core (logic) layers and cache (memory) layers as shown in Fig. 7.14(b).
116
Fig. 7.13: Structure of the baseline 2D multiprocessor, which consists of 64 tiles.
Fig. 7.14: Three ways to implement a 3D dual-core microprocessor.
7.5.2
Cost Evaluation with Homogeneous Partitioning
We rst explore the homogeneous partitioning strategies, in which all the processing tiles are distributed into 1, 2, 4, 8, and 16 layers, respec-
117
Table 7.6: The conguration of the SPARC-like core in a tile of the baseline 2D multiprocessor. Processor Cores Clock frequency Architecture type INT pipeline FP pipeline ALU count FPU count Empirical gate count Routers Type Empirical gate count Caches Cache line size L1 I-cache capacity L1 D-cache capacity L2 cache capacity Empirical cell count
1.2 GHz in-order 6 stages 6 stages 1 1 1.7 million 4 ports 1.0 million 64B 32 KB 32 KB 256 KB 2.6 million
tively, without breaking tiles into ner granularity. Fig. 7.15 illustrates the conceptual view of a 4-layer homogeneous stacking. The role of TSVs in the homogeneous 3D-stacked chip is to provide the inter-layer NoC connection and to deliver the power from I/O pins to inner layers. Instead of using Equation 7.12, the TSV count is predicted on the basis of NoC bus width. We assume 3D mesh is used and the NoC it size is 128 bits. Therefore, the number of TSVs between two layers, NTSV , is estimated as, NTSV = 2 128 Ntiles/layer 50% (7.24)
where Ntiles/layer is the number of tiles per layer, 2 means the bidirectional channel between two tiles, and 50% is added due to our assumption that half TSVs are for power delivery. After calculation, the TSV area overhead is around 3.8%.
118
Fig. 7.15: The conceptual view of a 4-layer homogeneous stacking composed of 4 identical layers.
4 0 0
Fig. 7.16: The cost break-down of a 16-core microprocessor design with dierent homogeneous partitioning.
In the next step, we use the cost model set to estimate the cost of wafer, 3D bonding, package, and cooling, respectively. We analyze microprocessor designs of dierent scales. Fig. 7.16 to Fig. 7.19 show the estimated price break-down of 16-core, 32-core, 64-core, and 128core microprocessors when they are fabricated by 1-layer, 2-layer, 4layer, 8-layer, and 16-layer processes, respectively. In these gures, all the prices are divided into 5 parts (the rst bar, which represents 2D planar process, only has 3 parts as the 2D process only has one die and does not need the bonding process). From bottom up, they are the cost of one die, the cost of remaining dies, the bonding cost, the package cost, and the cooling cost.

4 5 0
119
D 1 0 0
As the gures illustrate, the cost of one die is descending when the number of layers is ascending. This descending trend is mainly due to the yield improvement brought by the smaller die size. However, this trend is attened at the end since the yield improvement is not signicant any more when the die size is suciently small. As a result, the cumulative cost of all the dies does not follow a descending trend. Combined with the 3D bonding cost, which is an ascending function to the layer count, the wafer cost usually hits its minimum at some middle points. For example, if only considering the wafer cost and the 3D bonding cost, the optimal number of layers for 16-core, 32-core, 64core, and 128-core designs are 2, 4, 4, 8, respectively, which is consistent
120
1

6 0 0
to our rules of thumb discussed in Section 7.4. However, the cost before packaging is not the nal cost. Systemwise, the nal cost needs to include the package cost and the cooling cost. As mentioned, the package cost is mainly determined by the pin count. Thus, the package cost almost remains constant in all the cases since the pin count does not change from one partitioning to another partitioning. The cooling cost mainly depends on the chip temperature. As more layers are vertically stacked, the higher the chip temperature is, the cooling cost grows with the increase of 3D layers, and this growth is quite fast because a 3D-stacked chip with more than 8 layers can easily reach the peak temperature of more than 150 C. Actually, the extreme cases in this experiment (such as 16-layer stacking) is not practical indeed as the underlying chip is severely overheated. In this experiment, we rst obtain the power data of each component (the core, router, L1 cache, and L2 cache) from a power estimation tool, McPAT [113]. Combined with the area data achieved from early design stage estimation, we get the power density data, then feed them into HotSpot [139], which is a 3D-aware thermal estimation tool, and nally get the estimated chip temperature and its corresponding cost of cooling solutions. Fig. 7.16 to Fig. 7.19 also show the cost of packaging and cooling. As we can nd, the cooling cost is just a small portion for few layer stacking (such as 2-layer), but it starts to dominate the total cost for aggressive stacking (such as 8-layer and 16-layer). Therefore, the optimal partitioning solution in terms of the minimum total
121
cost diers from the one in terms of the cost only before packaging. Observed from the result, the optimal number of layers for 16-core, 32core, 64-core, and 128-core become 1, 2, 2, and 4, respectively instead. This result also reveals our motivation why the package cost and the cooling cost are included in the decision of optimal 3D stacking layers. 7.5.3 Cost Evaluation with Heterogeneous Partitioning
Similar to our baseline design illustrated in Fig. 7.13, todays highperformance microprocessors have a large portion of the silicon area occupied by on-chip SRAM or DRAM caches. In addition, non-volatile memory can also be integrated as on-chip memory [110]. However, dierent fabrication processes are not compatible. For example, the CMOS logic module might require 1-poly-9-copper-1-aluminum interconnect layers, while DRAM memory module needs 7-poly-3-copper and ash memory module needs 4-poly-1-tungsten-2-aluminum. As a result, integrating heterogeneous modules in a planar 2D chip would dramatically increase the chip cost. Intel shows that heterogeneous integration for large 2D System-on-Chip (SoC) would boost the chip cost by 3 times [142]. However, fabricating heterogeneous modules separately could be a cost-eective way for such systems. For example, if the SRAM cache modules are fabricated separately, as per our estimation, the required number of metals is only 6 while the logic modules (such as cores and routers) usually need more than 10 layers of metals. Hence, We reevaluate the many-core microprocessor cost by using heterogeneous integration, in which the tile is broken into logic parts (i.e. core and router) and memory parts (i.e. L1 and L2 caches). Fig. 7.20 illustrates a conceptual view of 4-layer heterogeneous partitioning where logic modules (cores and routers) and memory modules (L1 and L2 caches) are on separate dies. As shown in Fig. 7.20, there are two types of TSVs in the heterogeneous stacking chip: one is for NoC mesh interconnect, which is the same as that in the homogeneous stacking case; the other is for the interconnect between cores and caches, which is caused by the separation of the logic and memory modules. The number of the extra core-to-cache TSVs is calculated as
122
Fig. 7.20: The conceptual view of a 4-layer heterogeneous stacking composed of 2 logic layers and 2 memory layers. follows, NextraTSV = Data + Addressi Ntiles/layer (7.25)
where Data is the cache line size and Addressi is the address width of cache i. Note that the number of tiles per layer, Ntiles/layer is doubled in the heterogeneous partitioning compared to its homogeneous counterpart. For our baseline conguration (Table 7.6), in which IL1 is 32KB, DL2 is 32KB, L2 is 256KB, and cache line size is 64B, Data+ Addressi is 542. As a side-eect, the extra core-to-cache TSVs cause the TSV area overhead increasing from 3.8% to 15.7%. Compared to its homogeneous counterpart, the advantages and disadvantages of partitioning an NoC-based many-core microprocessor in the heterogeneous way are, Fabricating logic and memory dies separately reduces the wafer cost; Putting logic dies closer to heat sink reduces the cooling cost; The higher TSV area overhead increases the wafer cost and the package cost. In order to know whether heterogenous integration is more costeective, we repeat the cost estimation for 16-core, 32-core, 64-core, and

3 5 0
123
o B
Fig. 7.21: The cost break-down of a 16-core microprocessor design with dierent heterogeneous partitioning.
o B
128-core microprocessor designs by using heterogenous partitioning. The price break-down of heterogeneous stacking is shown in Fig. 7.21 to Fig. 7.24. Compared to the homogeneous integration cost as illustrated in Fig. 7.16 to Fig. 7.19, we can observe that the cost of heterogeneous integration is higher than that of homogenous integration in most cases, which is mainly due to the increased die area caused by higher TSV counts. However, in some cases, the cost of heterogenous integration can be cheaper compared to its homogeneous counterparts. Fig. 7.25 demonstrates the cost comparison between homogeneous and heterogeneous integrations for a 64-core microprocessor design using
124
6 0 0
o B
1 6 0 0
o B
the 45nm 4-layer 3D process. It can be observed that although the fabrication eciency of separated logic and memory dies cannot oset the extra cost caused by the increased die area, the cooling cost is reduced after putting logic layers, which closer to the heat sink (or other cooling solutions). While our experiment shows heterogeneous integration has the cost advantages on cheaper memory layer and lower chip temperature, in re-
125
Fig. 7.25: The cost comparison between homogeneous and heterogeneous integrations for a 64-core microprocessor using 45nm 2-layer 3D process.
Table 7.7: The optimal number of 3D layers for many-core microprocessor designs (45nm). Homogeneous 1 2 2 4 Heterogeneous 1 2 4 8
16-core 32-core 64-core 128-core
ality, some designers might still prefer homogeneous integration, which has identical layout on each die. The reciprocal design symmetry (RDS) was proposed by Alam, et al. [143] to re-use one mask set be re-used for multiple 3D-stacked layers. The RDS could relieve the design eort, and thus reduce the design time, which means a cost reduction on human resources. However, this personnel cost is not included in our cost model set, and it is out of the scope of this paper. We list the optimal partitioning options for 45nm many-core microprocessor designs in Table 7.7, which shows how our proposed cost estimation methodology helps decide the 3D partitioning at the early design stage.
126
7.6
3D IC Testing Cost Analysis
DFT plays a key role in reducing the test eorts and improving test quality. Since it also incurs certain overhead for the design, the economics of incorporating DFT has been explored [144]. In 3D IC design, various testing strategies and dierent integration methods could also aect the testing cost dramatically, and the interaction with other cost factors could results in dierent trade-os. For example, wafer-to-wafer stacking could save the Known-Good-Die (KGD) testing cost and improve the throughput, compared to die-to-wafer stacking; however, the overall yield could be much lower, and therefore the total IC cost could be higher. It is therefore very important to estimate the test economics with a testing cost model, and integrate it into a cost driven 3D IC design ow to strike a balance between cost and other benets (such as performance/power/area) [72]. To understand the impact of testing strategies on 3D IC design and explore the tradeos between testing strategies and integration options, in this paper, we propose a testing cost analysis methodology with a break-down of testing cost for dierent integration strategies. We address several unique issues raised by 3D integration, including the overall yield calculation, the impact of new defect types introduced by 3D integration, and the interactions with fabrication cost. With the testing cost model we perform a set of trade-o analysis and show that the design choices could be dierent after testing cost is taken into account.
7.6.1
Testing Cost Modeling For 3D ICs
In this section, we rst present a basic cost model that captures the economics of testing in 3D IC design. We then discuss the impact of new issues raised by 3D stacking, including the use of dierent stacking strategies, new defect types during stacking, as well as the interaction between testing costs and fabrication costs of 3D ICs.
7.6. 3D IC Testing Cost Analysis
127
7.6.1.1
Basic Cost Model of 3D Testing
Our testing cost model is adapted from the 2D DFT model in [144]. For the sake of brevity, we mainly address the uniqueness due to 3D integration. The testing cost of 3D ICs consists of four main components-preparation cost, execution cost, test-related silicon cost, and imperfect-test-quality cost. Hence, we compute testing cost Ctest (per good chip) as [144]: Ctest = Cprep + Cexec + Csilicon + Cquality (7.26)
Cprep captures xed costs of test generation, tester program creation, and any design eort for incorporating test-related features (all nonrecurring costs, including software systems). Note that in 3D ICs, this may include the preparation cost for pre-bond testing of TSVs. Cexec consists of costs of test-related hardware such as probe cards and cost incurred by tester use. Tester-use cost depends on factors including tester setup time, test execution time (as a function of die area and yield), and capital cost of the tester (as a function of die area and number of I/O pins) and capital equipment depreciation rate. This should also be compensated if pre-bond TSV testing is applied. Csilicon is the cost required to incorporate DFT features. We model this cost as a function of wafer size, die size, yield, and the extra area required by DFT. Cquality is the penalty due to the faulty chips that escape the test. This component exhibits increasing importance for 3D integration with the relatively low yield and high cost of 3D ICs. Test preparation cost for 3D ICs. For the cost of test preparation for 3D ICs, we start from the model presented in [144] for 2D cases and augment it for 3D IC integration. Generally, the preparation cost contains the test-pattern generation cost (Ctest gen ), testerprogram preparation cost (Ctest prog ), and the additional design cost for (CDF T design ). In 3D IC designs, dierent bonding technologies might be used to assemble die or wafer layers together into a single chip. For example, if W2W bonding is used, no pre-bonding test will be performed. In this case, all of the test-related preparations, including the DFT design,
128
test-pattern generation and tester-program preparation, should target at the whole chip, instead of a single die. Therefore, the testing cost only applies to the nal test after wafer bonding and sorting. The preparation cost for a W2W-bonded 3D chip can be calculated as:
overall Cprep =
1 YW 2W
overall overall overall (CDF T design + Ctest gen + Ctest prog ),
(7.27)
where YW 2W denotes the nal yield of the chip after bonding and will be discussed in the next sub-section. If D2W bonding is used, each layer should be tested before bonding, so the preparation cost will be changed to:
overall Cprep =
1 YD2W
N 1
(YN Cprep (N ) +
i=1
Cprep (i)).
(7.28)
where Cprep (i) = 1 (CDF T Yi

design (i)
+ Ctest gen (i) + Ctest prog (i)).
(7.29)
where 1 . . . N denote the N layers in the chip, Yi is the yield for each die before bonding, and CDF T design (i), Ctest gen (i), Ctest prog (i) are calculated with regard to the area of each die using the 2D model in [144]. Note that layer N is the bottom layer in the stack that is closest to IO pads, and it is usually tested only after bonding. Test execution cost for 3D ICs. The test execution cost is the total cost per chip of hardware consumption and the testing devices expense. In 2D IC testing, all individual ICs on a wafer should be tested before they are packaged. The wafer testing is performed by using a hardware called wafer prober or probe card. Therefore, the hardware cost for testing is mainly determined by the cost of probe cards [144]. We use Nprobe to represent the number of wafers a probe card can test and use Cprobe to denote the price of a probe card. Thus the hardware cost is calculated as: Chw = Cprobe V /Nprobe /V. (7.30)
In plain terms, a chip with n layer should be tested n times for D2W stacking while only 1 time for W2W stacking. Thus, the hardware cost
129
for 3D ICs can be summarized as CW 2W,hw = Cprobe Vchip /Nprobe /Vchip , CD2W,hw = Cprobe N Vchip /Nprobe /Vchip . (7.31)
Besides the cost of probe cards, the cost of testing devices should also be considered in the execution cost. The cost of the testing device can be calculated as: Cdevice = Rtest Ttest + Ridle Tidle . (7.32)
where Rtest is the cost rate of the testing devices when they are operating, containing the cost of operator, electricity, device depreciation, etc. Ridle denotes the cost rate of the testing device when they are inactive, which is much smaller than Rtest . Ttest and Tidel are the testing time and the idle time of the testing device. We denote the relationship between the testing rate and idle rate as: Rtest = Ridle , where is the proportion between the testing cost and idle cost and satises: 1. Also, the testing time and idle time satisfy : Ttest + Tidle = Tyear , where Tyear is the total second number of one year. Also, the testing rate and idle rate are in proportion to the testing devices price. A testing devices unit price is : Q = Kcapital Kpin Adie . (7.33)
where Kcapital is the average device price per pin and Kpin is the average density of pins in a die. Therefore, the cost rate of the testing devices for a 3D chip can be summarized as RD2W,test,i = depr Kcapital Kpin Ai + ADF T,i , (7.34)
RW 2W,test,i =
depr Kcapital Kpin Ai depr Kcapital Kpin AN + ADF T
if i < N if i = N
(7.35)
where depr is the annual depreciation rate of the testing device. The testing time can be simply modeled as: Ttest = Tsetup + Kave t A2 die . (7.36)
130
where Kave time is a constant multiplier that relates testing time to the die area. Thus, the testing time for a 3D chip is: TD2W,test,i = Tsetup + Kave t (Ai + ADF T,i )2 . if i < N Tsetup + Kave t Ai 2 . 2 Tsetup + Kave t (AN + ADF T ) . if i = N (7.37)
TW 2W,test,i =
(7.38)
Testing circuit overhead. Testing circuits usually occupy a certain fraction of the total silicon area. In 3D integration with KGD testing (D2W or W2W stacking), since test structures have to be implemented in each layer, the total area overhead of the testing circuits will be larger than that of the 2D case, and the relative area overhead of the testing circuits increases as the number of total layers goes up. For a given die with DFT structures, we denote the area ratio between DFT circuits and the whole die by DF T . For W2W stacking, we have ADF T,total = DF T A2D , while for D2W stacking:
ADF T,i = DF T Ai = DF T A2D N (7.39)
N 1
ADF T,total = DF T A2D +

i=1
ADF T,i (7.40)
= DF T A2D
2N 1 N
Therefore, the DFT circuit overhead ADF T,total for D2W stacking is an increasing function of layer count N . The silicon cost for the DFT circuit overhead is given by: Csilicon,D2W = Qwaf er 2 Rwaf er waf ADF T,total Yoverall (7.41)
die
Imperfect test quality. For simplicity in modeling, we only model escape cost due to imperfect test quality. The test escape rate is a function of fault coverage and the fault occurrence rate (or simply the yield) at a particular stage of test. Williams and Brown [145] give the
131
escape rate, the ratio of the defective ICs that pass testing to all the ICs that pass testing, as Er = 1 Y 1f c (7.42)
where f is fault coverage. The same relation stands for W2W stacking of 3D ICs. Similarly, if D2W stacking is used, the error rate will be changed to:
N
Er,D2W = 1
i=1
Yi1f ci
(7.43)
where Yi and f ci are the yield and fault coverage for die in layer i, respectively. The cost of imperfect test quality is then calculated by: Cquality = Cpenalty Er (7.44)
where Cpenalty is the economic penalty for allowing a defective IC to escape testing. Depending on the importance attached to escaping ICs, the penalty might be as much as 10 times of the manufacturing cost of a good IC. 7.6.1.2 The Impact of 3D Stacking Strategies on Chip Yield
3D stacking strategies have signicant impacts on the overall chip yield, which in turns aects the average cost per chip. The yield model for a single die has been well investigated [146, 147]. Assuming that the defects are randomly distributed on a wafer, the probability of the number of defect dies in a wafer can be described as a binomial random variable and can be approximated by a Poisson random variable. Therefore, the yield of a die can be simply modeled as: YP = eD0 Adie . (7.45)
where D0 is the average density of the defect and Adie is area of the certain chip. However, it has been shown that the defects are usually not randomly distributed across the chip, but are always clustered. In this case, the die yield should be higher than the result of Equation (7.45). Therefore, a compound Poisson distribution is proposed by Murphy [146],
132
which applies a weighting function to modulate the original Poisson distribution. A widely used weighting function is the Gamma function [148]. The Gamma function based yield model is: YG = 1+ D0 Adie
(7.46)
The parameter is dened as = (D /D )2 and depends upon the complexity of the manufacturing process. D0 is the silicon defect density parameter. Obviously, the yield decreases exponentially with the increase of die area. Thus, the smaller dies in 3D IC design may result in higher yield than that of a larger 2D die and therefore reduce the cost. To calculate the overall yield of a 3D chip, the following aspects should be taken into consideration: For 3D ICs, both the die yield and the stacking yield should be consider at the same time. The defects exist in each die will certainly aect the overall chip yield of after stacking. At the mean time, an unsuccessful stacking operation can also cause a chip to fail. In 3D ICs, dies at dierent layer are not necessary to have the same size. As shown in Equation (7.46), dierent die sizes will result in dierent yields. Stacking yield should be dened carefully. There are two sources of the stacking failure: (1) Failure results from imperfect TSVs; (2) Failure results from imperfect stacking operation. However, since the TSV-related failures have already been captured when calculating the stacking yield, they should not be counted as defects of a single die. Dierent stacking methods, such D2W stacking and W2W stacking, require dierent yield models. Yield model for W2W stacking. During W2W stacking, each die cannot be tested until the stacking is nished. The DFT circuitry is located at the bottom layer of the chip, which is closer to the IOs. Based on these considerations, the yield for W2W stacking can be modeled
133
as: Yoverall,W 2W = [1 +
N 1
D0 (AN + ADF T )] YS,W 2W,i (1 + D 0 Ai ) (7.47)
i=1
where YS,W 2W,i denotes the stacking yield between layer i and layer i 1, and ADF T is the area of DFT circuitry on the bottom layer. Yield model for D2W stacking. For D2W stacking design, a higher yield can be achieved by introducing KGD test. The dies in layer 1 to layer N 1 are tested separately before the stacking operation. Thus, during the stacking step, all these dies can be considered as good dies. Thus the chip yield will be:
N 1
Yoverall,D2W
D0 (AN + ADF T,N )] = [1 +
YS,D2W,i
i=1
(7.48)
where YS,D2W,i is the D2W stacking yield factor, which will be discussed in the next subsection. 7.6.1.3 New Defect Types Due to 3D Stacking
Because 3D integration incurs additional processing steps, such as TSV forming and bonding, new defect mechanisms (unique to 3D integration) must be addressed as part of a test strategy. First, in TSV-based 3D ICs, TSVs under manufacturing suer from conductor open defects and dielectric short defects, thus the TSV failure rate is unwillingly high [53, 149]. Moreover, the thinning, alignment, and stacking of the wafers add extra steps to the manufacturing process. During bonding, any foreign particle caught between the wafers can lead to peeling, as well as delamination, which dramatically reduces bonding quality and yield. In addition, to maintain good conductivity and minimize resistance, the interconnect TSVs and micropads between wafers must be precisely aligned. Due to the new defect types brought in by 3D stacking, the stacking
134
yield factor in Equations (7.47)-(7.48) can be modeled as: YS = Ybonding YT SV = Ybonding (1 fT SV )NT SV (7.49)
where Ybonding captures the yield loss of the chip due to faults in the bonding processes, fT SV is the TSV failure rate, NT SV is the total number of TSVs, and YT SV describes the rate of loss due to failed TSVs. In current TSV process technology, fT SV varies from 50 ppm to 5%. A simple calculation shows that, given a design with 200 TSVs and a typical TSV failure rate of 0.1%, the yield loss due to failed TSV will be as much as 20%, which is barely acceptable. For this reason, designers have proposed several techniques, including pre-bond TSV testing [53] and redundant TSV insertion [150] to mitigate the impact of high TSV failure rate. Pre-bond TSV testing. As previously discussed, with pre-bond TSV testing, the faulty TSVs can be detected before bonding and the failure rate fT SV is greatly reduced. Correspondingly, the area overhead of the pre-bond testing circuit will be translated to testing cost, as it increases ADF T in Equations (7.39)-(7.41). The trade-o between fT SV and the increase of ADF T will then be explored in our 3D testing cost model. Redundant TSV insertion. Another eective way to reduce the impact of faulty TSVs is to create TSV redundancy. A straightforward doubling redundancy of TSVs can tolerant any single TSV failure. However, it fails if both of the TSVs fail. The degree of redundancy m is dened as how many TSVs are used for a single signal. The TSV yield and the TSV area with redundancy is then given by: YT SV = 1 (fT SV )m
NT SV
(7.50) (7.51)
AT SV,redundant = m AT SV 7.6.1.4 Interaction with Fabrication Cost
The fabrication cost for 3D ICs has been analyzed in [73]. However, DFT-related cost concerns have not been touched yet. According to
135
Fabrication Cost
Testing Cost
Chip Yield
Chip Area
DFT
TSV
Fig. 7.26: Dependency graph of the key quantities in 3D IC cost model. the discussions in previous sub-sections, incorporating DFT in 3D IC design will have signicant impact on chip fabrication and total cost of chips. Figure 7.26 shows the dependencies between the key quantities in 3D IC cost model considering testing cost. We can see that testing cost is closely related to fabrication cost through the inclusion of TSVs and DFT structures. Both of the fabrication and testing costs have strong correlations with chip area and yield, and the trade-os between all the quantities in the graph are to be explored by our 3D cost model. To facilitate the total cost analysis, we rst update the total chip area to be: Aoverall = A2D + ADF T + AT SV NT SV (7.52)
where AT SV is the chip area occupied by a single TSV. NT SV is the TSV count, which can be estimated from A2D using a methodology similar to the one presented in [73]. With this we calculate Ai s in Equations (7.34)-(7.41). The total cost of 3D ICs is then calculated by: Ctotal = Cf abrication + Ctest (7.53)
We reuse the fabrication cost analysis in [73] for estimation of
136
Cf abrication and calculate Ctest using the model presented in this paper. 7.6.2 Cost Analysis for 3D ICs with Testing Cost
In this section we apply our testing cost model to 3D IC designs and analyze the interactions between IC cost and the key quantities in 3D ICs, including the layer count, the DFT alpha factor, and the TSV faulty rate reduction strategies. 7.6.2.1 Model Parameters
We assume various input values to perform experiments with our testing cost model. Essentially, the model uses three categories of parameters: independent variables, 3D integration factors, and constants. Independent Variables: The key independent variables used in our studies are production volume V and die area A2D . We explore a spectrum of values for V from small, ASIC-like volumes (100,000) to large, microprocessor-like volumes (100 million). For A2D , we explore die areas ranging from 0.5 cm2 to 4 cm2 . 3D alpha factors: The impact of 3D integration on testing is dicult to estimate without specic attributes of the IC design. Since very little data on the modeling of specic attributes has been published, we explore a range of values that are reasonable to the best of our knowledge. We argue that the proposed 3D IC test model is parametric-based and the analysis can be applied on any other values of parameters without loss of credibility. Constants: The constants used in this paper are from both our industry partners5 and data in previous literatures. Table 7.8 lists the constants used in this paper. All the cost analysis is performed with IBM 65nm Common Platform technology. The TSV parameters are scaled down from the data in MIT
5 Arbitrary
unit (A.U.) is used to present the cost value due to the non-disclosure agreement with our industry partners.
137
Table 7.8: Constant values used in our cost model.

Eqn No. (7.46) (7.46) (7.42) (7.49) (7.50) (7.52) Const Name D0 fc Ybonding fT SV AT SV Value 0.004/mm2 2.0 0.99 0.95 10 ppm 10 m2 Remarks The density of point-defects per unit area. The yield model parameter The default fault coverage The yield loss due to wafer thinning, etc. The default TSV defect rate. The TSV cross-sectional area.
1 0.9 Overall Chip Yield 0.8 0.7 0.6 0.5 0.4 D2W Stacking W2W Stacking
4 5 6 Number of Layers
Fig. 7.27: Overall chip yield with dierent number of layers and stacking strategies.
Lincoln Lab 130nm 3D technology. 7.6.2.2 Trade-o Analysis
The impact of layer count and stacking strategies. Given a design with specied equivalent 2D chip area, we rst address the questions in a DFT perspective, that how many layers the design should be partitioned to, and what stacking strategy should be used. As discussed earlier, the layer count has signicant impact on the overall chip yield as well as the DFT circuit overhead. To verify this, we perform simulations on a benchmark design with silicon area of 200 mm2
138

prep W2W prep D2W 20 Testing Cost Breakdown (A.U.) 18 16 14 12 10 8 6 4 2 0 2 3 4 Number of Layers 5 6 exec W2W exec D2W silicon W2W silicon D2W quality W2W quality D2W
Fig. 7.28: Testing cost with dierent number of layers and stacking strategies on IBM 65nm technology.
and production volume of 100, 000. Figure 7.27 shows the overall chip yield results with dierent design partitioning and stacking strategies. In W2W stacking, the overall chip yield decreases rapidly as the number of layers increases. Although using more layers lead to smaller die area for each layer, the corresponding yield improvement cannot compensate the yield loss due to stacking untested imperfect dies. On the other hand, the overall chip yield in D2W stacking slightly improves as number of layers increases from 2 to 5, thanks to the KGD testing before stacking. When the layer count further increases, the yield loss on the stacking operations starts to dominate the yield change. In Figure 7.28, a breakdown of the testing cost for the benchmark design is depicted. We can see that for W2W stacking, the penalty for test escapes (CQuality ) rises as the overall yield goes down. In D2W stacking, the test execution cost occupies the largest portion as testing is carried out on each die. Figure 7.29 summaries the total cost of the benchmark design with dierent layer counts and stacking strategies, including both the testing cost and the silicon fabrication cost. When only 2 layers are used
7.7. Conclusion
139
Test W2W 35 Total Cost Breakdown (A.U.) 30 25 20 15 10 5 0 2 3
Fab W2W
Test D2W
Fab D2W
4 Number of Layers
Fig. 7.29: Chip total cost with dierent number of layers and stacking strategies on IBM 65nm technology.
for the design, W2W stacking has almost the same cost as that of D2W stacking. Both the fabrication and the testing cost increase rapidly with layer count, due to the drop of the overall yield. It suggests that W2W stacking might be more favorable in design with low layer counts, because it can achieve higher production throughput without performing KGD testing, and therefore reduce the time-to-market of the design.
7.7
Conclusion
To overcome the barriers in technology scaling, three-dimensional integrated circuit (3D IC) is emerging as an attractive option for future microprocessor designs. However, fabrication cost is one of the important considerations for the wide adoption of the 3D integration. Systemlevel cost analysis at the early design stage to help the decision making on whether 3D integration should be used for the application is very critical. To facilitate the early design stage cost analysis, we propose a set of cost models that include wafer cost, 3D bonding cost, package cost,
140
and cooling cost. Based on the cost analysis, we identify the design opportunities for cost reduction in 3D ICs, and provide a few design guidelines on cost-eective 3D IC designs. Our research is complementary to the existing research on 3D microprocessors that are focused on other design goals, such as performance and power consumption.
8
Challenges for 3D IC Design
Even though 3D integrated circuits show great benets, there are several challenges for the adoption of 3D technology for future architecture design: (1) Thermal management. The move from 2D to 3D design could accentuate the thermal concerns due to the increased power density. To mitigate the thermal impact, thermal-aware design techniques must be adopted for 3D architecture design [2]; (2) Design Tools and methdologies. 3D integration technology will not be commercially viable without the support of EDA tools and methodologies that allow architects and circuit designers to develop new architectures or circuits using this technology. To eciently exploit the benets of 3D technologies, design tools and methodologies to support 3D designs are imperative [3]; (3) Testing. One of the barriers to 3D technology adoption is insucient understanding of 3D testing issues and the lack of design-for-testability (DFT) techniques for 3D ICs, which have remained largely unexplored in the research community.
141
Acknowledgements
This is an example of acknowledgement. This is an example of acknowledgement. This is an example of acknowledgement.
142
References
[1] W. R. Davis, J. Wilson, S. Mick, et al. Demystifying 3D ICs: the pros and cons of going vertical. IEEE Design and Test of Computers, 22(6):498510, 2005. [2] Y. Xie, G Loh, B Black, and K. Bernstein. Design space exploration for 3D architectures. ACM Journal of Emerging Technologies in Compuing Systems, 2006. [3] Yuan Xie, Jason Cong, and Sachin Sapatnekar. Three-Dimensional Integrated Circuit Design: EDA, Design and Microarchitectures. Springer, 2009. [4] P. Garrou. Handbook of 3D Integration: Technology and Applications using 3D Integrated Circuits, chapter Introduction to 3D Integration. Wiley-CVH, 2008. [5] J.W. Joyner, P. Zarkesh-Ha, and J.D. Meindl. A stochastic global net-length distribution for a three-dimensional system-on-a-chip (3D-SoC). In Proc. 14th Annual IEEE International ASIC/SOC Conference, September 2001. [6] Yuh-fang Tsai, Feng Wang, Yuan Xie, N. Vijaykrishnan, and M. J. Irwin. Design space exploration for three- dimensional cache. IEEE Transactions on Very Large Scale Integration Systems, 2008. [7] Balaji Vaidyanathan, Wei-Lun Hung, Feng Wang, Yuan Xie, Vijaykrishnan Narayanan, and Mary Jane Irwin. Architecting microprocessor components in 3D design space. In Intl. Conf. on VLSI Design, pages 103108, 2007. [8] Kiran Puttaswamy and Gabriel H. Loh. Scalability of 3d-integrated arithmetic units in high-performance microprocessors. In Design Automation Conference, pages 622625, 2007.
143
144
References
[9] Jin Ouyang, Guangyu Sun, Yibo Chen, Lian Duan, Tao Zhang, Yuan Xie, and Mary Irwin. Arithmetic unit design using 180nm TSV-based 3D stacking technology. In IEEE International 3D System Integration Conference, 2009. [10] R Egawa, J Tada, H Kobayashi, and G Goto. Evaluation of ne grain 3D integrated arithmetic units. In IEEE International 3D System Integration Conference, 2009. [11] B. Black et al. Die stacking 3D microarchitecture. In MICRO, pages 469479, 2006. [12] G. H. Loh. 3d-stacked memory architectures for multi-core processors. In International Symposium on Computer Architecture (ISCA), pages 453464, 2008. [13] P. Jacob et al. Mitigating memory wall eects in high clock rate and multicore cmos 3D ICs: Processor memory stacks. Proceedings of IEEE, 96(10), 2008. [14] Gabriel Loh, Yuan Xie, and Bryan Black. Processor design in threedimensional die-stacking technologies. IEEE Micro, 27(3):3148, 2007. [15] Taeho Kgil, Shaun DSouza, Ali Saidi, Nathan Binkert, Ronald Dreslinski, Trevor Mudge, Steven Reinhardt, and Krisztian Flautner. PicoServer: using 3D stacking technology to enable a compact energy ecient chip multiprocessor. In ASPLOS, pages 117128, 2006. [16] S. Vangal et al. An 80-tile Sub-100-W TeraFLOPS processor in 65-nm CMOS. IEEE Journal of Solid-State Circuits, 43(1):2941, 2008. [17] G.H. Loh. Extending the eectiveness of 3D-stacked dram caches with an adaptive multi-queue policy. In International Symposium on Microarchitecture (MICRO), Dec. 2009. [18] F. Li, C. Nicopoulos, T. Richardson, Y. Xie, N. Vijaykrishnan, and M. Kandemir. Design and management of 3D chip multiprocessors using network-inmemory. In International Symposium on Computer Architecture (ISCA06), 2006. [19] Xiangyu Dong, Xiaoxia Wu, Guangyu Sun, Y. Xie, H. Li, and Yiran Chen. Circuit and microarchitecture evaluation of 3D stacking Magnetic RAM (MRAM) as a universal memory replacement. In Design Automation Conference, pages 554559, 2008. [20] Xiaoxia Wu, Jian Li, Lixi Zhang, Evan Speight, and Yuan Xie. Hybrid cache architecture. In International Symposium on Computer Architecture (ISCA), 2009. [21] Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. A novel 3D stacked MRAM cache architecture for CMPs. In International Symposium on High Performance Computer Architecture, 2009. [22] Dana Vantrease, Robert Schreiber, Matteo Monchiero, Moray McLaren, Norman P. Jouppi, Marco Fiorentino, Al Davis, Nathan Binkert, Raymond G. Beausoleil, and Jung Ho Ahn. Corona: System implications of emerging nanophotonic technology. In Proceedings of the 35th International Symposium on Computer Architecture, pages 153164, 2008.
References
145
[23] Xiangyu Dong and Yuan Xie. Cost analysis and system-level design exploration for 3D ICs. In Asia and South Pacic Design Automation Conference, 2009. [24] S.S. Sapatnekar. Addressing Thermal and Power Delivery Bottlenecks in 3D Circuits. In Proceedings of the Asia and South Pacic Design Automation Conference, 2009., pages 423428, 2009. [25] Peng Li, L. T. Pileggi, M. Asheghi, and R. Chandra. Ic thermal simulation and modeling via ecient multigrid-based approaches. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25:17631776, 2006. [26] B. Goplen and S. Sapatnekar. Ecient thermal placement of standard cells in 3D ICs using a force directed approach. In International Conference on Computer Aided Design., pages 8689, 2003. [27] W. Huang, S. Ghosh, and et al. Hotspot: a compact thermal modeling methodology for early-stage VLSI design. TVLSI, 14(5):501513, 2006. [28] Y. Yang, Z. Gu, C. Zhu, R. P. Dick, and L. Shang. Isac: Integrated space-andtime-adaptive chip-package thermal analysis. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., 26(1):8699, 2007. [29] W. L. Hung, G. M. Link, Xie Yuan, N. Vijaykrishnan, and M. J. Irwin. Interconnect and thermal-aware oorplanning for 3d microprocessors. In International Symposium on Quality Electronic Design, page 6 pp., 2006. [30] Karthik Balakrishnan, Vidit Nanda, Siddharth Easwar, and Sung Kyu Lim. Wire congestion and thermal aware 3D global placement. In Proc. of Asia South Pacic Design Automation Conference (ASPDAC), pages 11311134, 2005. [31] Jason Cong, Guojie Luo, Jie Wei, and Yan Zhang. Thermal-aware 3d ic placement via transformation. In Proc. Asia and South Pacic Design Automation Conf. ASP-DAC 07, pages 780785, 2007. [32] J. Cong and Guojie Luo. A multilevel analytical placement for 3d ics. In Proc. Asia and South Pacic Design Automation Conf. ASP-DAC 2009, pages 361 366, 2009. [33] J. Cong and Yan Zhang. Thermal-driven multilevel routing for 3d ics. In Proc. Asia and South Pacic Design Automation Conf the ASP-DAC 2005, volume 1, pages 121126, 2005. [34] J. Cong, J. Wei, and Y. Zhang. A thermal-driven oorplanning algorithm for 3d ics. In Proceedings of the ICCAD, 2004. [35] C. Chu and D. Wong. A matrix synthesis approach to thermal placement. In Proceedings of the ISPD, 1997. [36] C. Tsai and S. Kang. Cell-level placement for improving substrate thermal distributio. In IEEE Trans. on Computer-Aided Design of Integrated Circuits and System, 2000. [37] Brent Goplen and Sachin Sapatnekar. Ecient Thermal Placement of Standard Cells in 3D ICs using a Force Directed Approach. Proceedings of ICCAD, 2003. [38] P. Shiu and S. K. Lim. Multi-layer oorplanning for reliable system-onpackage. In Proceedings of ISCAS, 2004.
146
References
[39] W. Hung, G. Link, Y. Xie, V. Narayanan, and M.J.Irwin. Interconnect and thermal-aware oorplanning for 3d microprocessors. In Proceedings of the International Symposium of Quality Electronic Devices, 2006. [40] K. Skadron, M. R. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan. Temperature-Aware Microarchitecture. Proc. ISCA-30, pages 2 13, June 2003. [41] Greg Link and V. Narayanan. Thermal trends in emergent technologies. In Proceedings of the International Symposium of Quality Electronic Devices, 2006. [42] Y. Chang, Y. Chang, G.-M. Wu, and S.-W. Wu. B*-trees: a new representation for non-slicing oorplans . Proceedings of the Annual ACM/IEEE Design Automation Conference, 2000. [43] Xin Zhao, D. L. Lewis, H.-H. S. Lee, and Sung Kyu Lim. Pre-bond testable low-power clock tree design for 3D stacked ics. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design (ICCAD), pages 184190, 2009. [44] Tak-Yung Kim and Taewhan Kim. Clock tree synthesis with pre-bond testability for 3d stacked ic designs. In Proc. 47th ACM/IEEE Design Automation Conf. (DAC), pages 723728, 2010. [45] J. R. Minz, Eric Wong, M. Pathak, and Sung Kyu Lim. Placement and routing for 3-d system-on-package designs. IEEE Trans. Compon. Packag. Technol., 29(3):644657, 2006. [46] Eric Wong, J. Minz, and Sung Kyu Lim. Multi-objective module placement for 3-d system-on-package. IEEE Trans. VLSI Syst., 14(5):553557, 2006. [47] E. Wong, J. Minz, and Sung Kyu Lim. Eective thermal via and decoupling capacitor insertion for 3d system-on-package. In Proc. 56th Electronic Components and Technology Conf, 2006. [48] Hao Yu, J. Ho, and Lei He. Simultaneous power and thermal integrity driven via stapling in 3d ics. In Proc. IEEE/ACM Int. Conf. Computer-Aided Design ICCAD 06, pages 802808, 2006. [49] Chen-Wei Liu and Yao-Wen Chang. Power/ground network and oorplan cosynthesis for fast design convergence. IEEE Trans. Comput.-Aided Design Integr. Circuits Syst., 26(4):693704, 2007. [50] P. Falkenstern, Yuan Xie, Yao-Wen Chang, and Yu Wang. Three-dimensional integrated circuits (3d ic) oorplan and power/ground network co-synthesis. In Proc. 15th Asia and South Pacic Design Automation Conf. (ASP-DAC), pages 169174, 2010. [51] D. L. Lewis and H.-H. S. Lee. A scanisland based design enabling prebond testability in die-stacked microprocessors. In Proc. IEEE Int. Test Conf. ITC 2007, pages 18, 2007. [52] D. L. Lewis and H.-H. S. Lee. Testing circuit-partitioned 3d ic designs. In Proc. IEEE Computer Society Annual Symp. VLSI ISVLSI 09, pages 139 144, 2009. [53] Po-Yuan Chen, Cheng-Wen Wu, and Ding-Ming Kwai. On-chip TSV testing for 3D IC before bonding using sense amplication. In Proc. of Asian Test Symp (ATS), pages 450455, 2009.
References
147
[54] H.-H. S. Lee and K. Chakrabarty. Test challenges for 3D integrated circuits. IEEE Design & Test of Computers, 26(5):2635, 2009. [55] X. Wu, P. Falkenstern, K. Chakrabarty, and Y. Xie. Scan-chain design and optimization for 3D ICs. ACM Journal of Emerging Technologies for Computer Systems, 2009. [56] X. Wu, Y. Chen, K. Chakrabarty, and Y. Xie. Test-access mechanism optimization for core-based three-dimensional SOCs. Proc. International Conference on Computer Design, 2008. [57] Li Jiang, Lin Huang, and Qiang Xu. Test architecture design and optimization for three-dimensional socs. In Proc. DATE 09. Design, Automation & Test in Europe Conf. & Exhibition, pages 220225, 2009. [58] K. Zhang et al. A SRAM Design on 65nm CMOS technology with Integrated Leakage Reduction Scheme. IEEE Symp. On VLSI Circuits Digest of Technical Papers, 2004. [59] Yuhfang Tsai, Yuan Xie, Vijaykrishnan Narayanan, and Mary Jane Irwin. Three-dimensional cache design exploration using 3dcacti. Proceedings of the IEEE International Conference on Computer Design (ICCD 2005), pages 519 524, 2005. [60] P. Shivakumar et al. Cacti 3.0: An Integrated Cache Timing, Power, and Area Model. Western Research Lab Research Report, 2001. [61] C. Ababei, H. Mogal, and K. Bazargan. Three-dimensional place and route for fpgas. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(6):11321140, 2006. [62] M. J. Alexander, J. P. Cohoon, J. L. Colesh, J. Karro, and G. Robins. Threedimensional eld-programmable gate arrays. In Proc. Eighth Annual IEEE Int. ASIC Conf. and Exhibit, pages 253256, 1995. [63] S. M. S. A. Chiricescu and M. M. Vai. A three-dimensional fpga with an integrated memory for in-application reconguration data. In Proc. IEEE Int. Symp. Circuits and Systems ISCAS 98, volume 2, pages 232235, 1998. [64] A. Rahman, S. Das, and et al. Wiring requirement and three-dimensional integration technology for eld programmable gate arrays. TVLSI, 11(1):44 54, 2003. [65] M. Lin, A. El Gamal, and et al. Performance benets of monolithically stacked 3-D FPGA. TCAD, 26(2):216229, 2007. [66] A. Gayasen, V. Narayanan, M. Kandemir, and A. Rahman. Designing a 3-d fpga: Switch box architecture and thermal issues. IEEE Transactions on Very Large Scale Integration (VLSI) Systems, 16(7):882893, 2008. [67] M. Leeser, W. M. Meleis, M. M. Vai, S. Chiricescu, Weidong Xu, and P. M. Zavracky. Rothko: a three-dimensional fpga. IEEE Design & Test of Computers, 15(1):1623, 1998. [68] W. M. Meleis, M. Leeser, P. Zavracky, and M. M. Vai. Architectural design of a three dimensional fpga. In Proc. Seventeenth Conf Advanced Research in VLSI, pages 256268, 1997. [69] C. Ababei, P. Maidee, and K. Bazargan. Exploring potential benets of 3D FPGA integration. In FPL, 2004.
148
References
[70] Xilinx. Virtex-4 FPGA data sheets. http://www.xilinx.com/support/ documentation/virtex-4_data_sheets.htm. [71] F. Bedeschi, R. Fackenthal, and et al. A bipolar-selected phase change memory featuring multi-level cell storage. JSSC, 44(1):217227, 2009. [72] B. C. Lee, E. Ipek, and et al. Architecting phase change memory as a scalable DRAM alternative. In ISCA, 2009. [73] X. Dong, N. Jouppi, and Y. Xie. PCRAMsim: System-level performance, energy, and area modeling for phase-change RAM. ICCAD, 2009. [74] C. H. Ho, W. Leong, and et al. Virtual embedded blocks: A methodology for evaluating embedded elements in FPGAs. In FCCM, 2006. [75] P.H. Shiu, R. Ravichandran, S. Easwar, and S.K. Lim. Multi-layer oorplanning for reliable system-on-package. 2004. [76] Subhash Gupta, Mark Hilbert, and et al. Techniques for producing 3D ICs with high-density interconnect. http://www.tezzaron.com/about/papers/ ieee%20vmic%202004%20finalsecure.pdf, 2004. [77] S. Thoziyoor, N. Muralimanohar, and N. P. Jouppi. CACTI 5.1. Technical Report HPL-2008-20, HP Laboratories, 2008. [78] V. Betz and J. Rose. VPR: A new packing, placement and routing tool for FPGA research. In Field-Programmable Logic Applications, 1997. [79] A. Telikepalli. Power-performance inection at 90 nm process node FPGAs in focus. In Chip Design Magazine, April 2008. [80] Xilinx. XPower estimator. http://www.xilinx.com/products/design_ tools/logic_design/verification/xpower.htm. [81] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykrishnan Narayanan, and Mahmut Kandemir. Design and management of 3d chip multiprocessors using network-in-memory. In ISCA 06: Proceedings of the 33rd annual international symposium on Computer Architecture, pages 130141, Washington, DC, USA, 2006. IEEE Computer Society. [82] Gabriel H. Loh. Extending the eectiveness of 3d-stacked dram caches with an adaptive multi-queue policy. In MICRO 42: Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, pages 201212, New York, NY, USA, 2009. ACM. [83] Gabriel H. Loh. 3d-stacked memory architectures for multi-core processors. In ISCA 08: Proceedings of the 35th Annual International Symposium on Computer Architecture, pages 453464, Washington, DC, USA, 2008. IEEE Computer Society. [84] Dong Hyuk Woo, Nak Hee Seong, Dean L. Lewis, and Hsien-Hsin S. Lee. An optimized 3d-stacked memory architecture by exploiting excessive, highdensity tsv bandwidth. In HPCA 10: Proceedings of the 2007 IEEE 16th International Symposium on High Performance Computer Architecture, Bangalore, Indian, 2010. IEEE Computer Society. [85] Mrinmoy Ghosh and Hsien-Hsin S. Lee. Smart refresh: An enhanced memory controller design for reducing energy in conventional and 3d die-stacked drams. In MICRO 40: Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, pages 134145, Washington, DC, USA, 2007. IEEE Computer Society.
References
149
[86] Kiran Puttaswamy and Gabriel H. Loh. 3d-integrated sram components for high-performance microprocessors. IEEE Trans. Comput., 58(10):13691381, 2009. [87] Bryan Black, Murali Annavaram, Ned Brekelbaum, John DeVale, Lei Jiang, Gabriel H. Loh, Don McCaule, Pat Morrow, Donald W. Nelson, Daniel Pantuso, Paul Reed, Je Rupley, Sadasivan Shankar, John Shen, and Clair Webb. Die stacking (3d) microarchitecture. In MICRO 39: Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, pages 469479, Washington, DC, USA, 2006. IEEE Computer Society. [88] Kiran Puttaswamy and Gabriel H. Loh. Thermal herding: Microarchitecture techniques for controlling hotspots in high-performance 3d-integrated processors. In HPCA 07: Proceedings of the 2007 IEEE 13th International Symposium on High Performance Computer Architecture, pages 193204, Washington, DC, USA, 2007. IEEE Computer Society. [89] Bryan Black, Donald W. Nelson, Clair Webb, and Nick Samra. 3d processing technology and its impact on ia32 microprocessors. In ICCD 04: Proceedings of the IEEE International Conference on Computer Design, pages 316318, Washington, DC, USA, 2004. IEEE Computer Society. [90] Balaji Vaidyanathan, Wei-Lun Hung, Feng Wang, Yuan Xie, Vijaykrishnan Narayanan, and Mary Jane Irwin. Architecting microprocessor components in 3d design space. In VLSID 07: Proceedings of the 20th International Conference on VLSI Design held jointly with 6th International Conference, pages 103108, Washington, DC, USA, 2007. IEEE Computer Society. [91] Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, and Yuan Xie. Hybrid cache architecture with disparate memory technologies. In ISCA 09: Proceedings of the 36th annual international symposium on Computer architecture, pages 3445, New York, NY, USA, 2009. ACM. [92] Guangyu Sun, Xiangyu Dong, Yuan Xie, Jian Li, and Yiran Chen. A novel architecture of the 3d stacked mram l2 cache for cmps. pages 239 249, feb. 2009. [93] Taeho Kgil, Ali Saidi, Nathan Binkert, Steve Reinhardt, Krisztian Flautner, and Trevor Mudge. Picoserver: Using 3d stacking technology to build energy ecient servers. J. Emerg. Technol. Comput. Syst., 4(4):134, 2008. [94] C.C. Liu, I. Ganusov, M. Burtscher, and Sandip Tiwari. Bridging the processor-memory performance gap with 3d ic technology. Design Test of Computers, IEEE, 22(6):556 564, nov. 2005. [95] Gian Luca Loi, Banit Agrawal, Navin Srivastava, Sheng-Chih Lin, Timothy Sherwood, and Kaustav Banerjee. A thermally-aware performance analysis of vertically integrated (3-d) processor-memory hierarchy. In DAC 06: Proceedings of the 43rd annual Design Automation Conference, pages 991996, New York, NY, USA, 2006. ACM. [96] Tezzaron Semiconductors. Leo fastack 1gb ddr sdram datasheet. In http://www.tezzaron.com/memory/TSC Leo.htm, 2005. [97] Feihui Li, Chrysostomos Nicopoulos, Thomas Richardson, Yuan Xie, Vijaykrishnan Narayanan, and Mahmut Kandemir. Design and management of 3d
150
References chip multiprocessors using network-in-memory. SIGARCH Comput. Archit. News, 34(2):130141, 2006. Jongman Kim, Chrysostomos Nicopoulos, Dongkook Park, Reetuparna Das, Yuan Xie, Vijaykrishnan Narayanan, Mazin S. Yousif, and Chita R. Das. A novel dimensionally-decomposed router for on-chip communication in 3d architectures. SIGARCH Comput. Archit. News, 35(2):138149, 2007. Dongkook Park, Soumya Eachempati, Reetuparna Das, Asit K. Mishra, Yuan Xie, N. Vijaykrishnan, and Chita R. Das. Mira: A multi-layered on-chip interconnect router architecture. In ISCA 08: Proceedings of the 35th Annual International Symposium on Computer Architecture, pages 251261, Washington, DC, USA, 2008. IEEE Computer Society. Yaoyao Ye, Lian Duan, Jiang Xu, Jin Ouyang, Mo Kwai Hung, and Yuan Xie. 3d optical networks-on-chip (noc) for multiprocessor systems-on-chip (mpsoc). pages 1 6, sep. 2009. Xiang Zhang and Ahmed Louri. A multilayer nanophotonic interconnection network for on-chip many-core communications. In DAC 10: Proceedings of the 47th Design Automation Conference, pages 156161, New York, NY, USA, 2010. ACM. Dana Vantrease, Robert Schreiber, Matteo Monchiero, Moray McLaren, Norman P. Jouppi, Marco Fiorentino, Al Davis, Nathan Binkert, Raymond G. Beausoleil, and Jung Ho Ahn. Corona: System implications of emerging nanophotonic technology. In ISCA 08: Proceedings of the 35th Annual International Symposium on Computer Architecture, pages 153164, Washington, DC, USA, 2008. IEEE Computer Society. Sudeep Pasricha. Exploring serial vertical interconnects for 3d ics. In DAC 09: Proceedings of the 46th Annual Design Automation Conference, pages 581586, New York, NY, USA, 2009. ACM. Yue Qian, Zhonghai Lu, and Wenhua Dou. From 2d to 3d nocs: a case study on worst-case communication performance. In ICCAD 09: Proceedings of the 2009 International Conference on Computer-Aided Design, pages 555562, New York, NY, USA, 2009. ACM. Kerry Bernstein. New dimension in performance: harnessing 3D integration technology. EDA Forum, 3(2), 2006. Gabriel Loh, Yuan Xie, and Bryan Black. Processor design in threedimensional die-stacking technologies. IEEE Micro, 27(3):3148, 2007. Y. Xie, G. H. Loh, B. Black, and K. Bernstein. Design space exploration for 3D architectures. ACM Journal on Emerging Technologies in Computing Systems, 2(2):65103, 2006. Cristinel Ababei, Yan Feng, Brent Goplen, et al. Placement and routing in 3D integrated circuits. IEEE Design and Test of Computers, 22(6):520531, 2005. Soon-Moon Jung, Jae-Hoon Jang, Won-Seok Cho, et al. The revolutionary and truly 3-dimensional 25F 2 SRAM technology with the smallest S 3 (Stacked Single-Crystal Si) cell, 0.16m2 , and SSTFT (Atacked Single-Crystal Thin Film Transistor) for ultra high density SRAM. In Proceedings of the Symposium on VLSI Technology, pages 228229, 2004.
[98]
[99]
[100]
[101]
[102]
[103]
[104]
[105] [106] [107]
[108]
[109]
References
151
[110] Xiangyu Dong, Xiaxiao Wu, Guangyu Sun, et al. Circuit and microarchitecture evaluation of 3D stacking Magnetic RAM (MRAM) as a universal memory replacement. In Proceedings of the Design Automation Conference, pages 554559, 2008. [111] Guangyu Sun, Xiangyu Dong, Yuan Xie, et al. A novel architecture of the 3D stacked MRAM L2 cache for CMPs. In Proceedings of the International Symposium on High Performance Computer Architecture, pages 239249, 2009. [112] Larry Smith, Greg Smith, Sharath Hosali, and Sitaram Arkalgud. 3D: it all comes down to cost. In Proceedings of RTI Conference of 3D Architecture for Semiconductors and Packaging, 2007. [113] Sheng Li, Jung Ho Ahn, Richard D. Strong, et al. McPAT: an integrated power, area, and timing modeling framework for multicore and manycore architectures. In Proceedings of the International Symposium on Microarchitecture, pages 469480, 2009. [114] Taeho Kgil, Shaun DSouza, Ali Saidi, et al. PicoServer: using 3D stacking technology to enable a compact energy ecient chip multiprocessor. In Proceedings of the International Conference on Architectural Support for Programming Languages and Operating Systems, pages 117128, 2006. [115] Balaji Vaidyanathan, Wei-Lun Hung, Feng Wang, et al. Architecting microprocessor components in 3D design space. In Proceedings of the International Conference on VLSI Design, pages 103108, 2007. [116] G. L. Loi, B. Agrawal, N. Srivastava, et al. A thermally-aware performance analysis of vertically integrated (3-D) processor-memory hierarchy. In Proceedings of the Design Automation Conference, pages 991996, 2006. [117] Eric Wong and Sung Kyu Lim. 3D oorplanning with thermal vias. In Proceedings of the Design, Automation and Test in Europe, pages 878883, 2006. [118] Tianpei Zhang, Yong Zhan, and Sachin S. Sapatnekar. Temperature-Aware routing in 3D ICs. In Proceedings of the Asia and South Pacic Design Automation Conference, pages 309314, 2006. [119] Hao Hua, Chris Mineo, Kory Schoeniess, et al. Exploring compromises among timing, power and temperature in three-dimensional integrated circuits. In Proceedings of the Design Automation Conference, pages 9971002, 2006. [120] B. Goplen and S.S. Sapatnekar. Placement of thermal vias in 3D ICs using various thermal objectives. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 25(4):692709, 2006. [121] Ayse Coskun, Andrew Kahng, and Tajana Simunic Rosing. Temperature- and cost-aware design of 3D multiprocessor architectures. In Proceedings of the Euromicro Conference on Digital System Design, pages 183190, 2009. [122] Christianto C. Liu, Jeng-Huei Chen, Rajit Manohar, and Sandip Tiwari. Mapping System-on-Chip designs from 2-D to 3-D ICs. In Proceedings of the International Symposium on Circuits and Systems, pages 29392942, 2005. [123] P. Mercier, S.R. Singh, K. Iniewski, et al. Yield and cost modeling for 3D chip stack technologies. In Proceedings of the Custom Integrated Circuits Conference, pages 357360, 2006. [124] Roshan Weerasekera, Li-Rong Zheng, Dinesh Pamunuwa, and Hannu Tenhunen. Extending Systems-on-Chip to the third dimension: performance, cost
152
References and technological tradeos. In Proceedings of the International Conference on Computer-Aided Design, pages 212219, 2007. B. S. Landman and R. L. Russo. On a pin versus block relationship for partitions of logic graphs. IEEE Transancations on Computers, 20(12):1469 1479, 1971. Wilm E. Donath. Placement and average interconnection lengths of computer logic. IEEE Transactions on Circuits and Systems, 26(4):272277, 1979. Jerey A. Davis, Vivek K. De, and James D. Meindl. A stochastic wire-length distribution for gigascale integration (GSI)Part I: derivation and validation. IEEE Transactions on Electron Devices, 45(3):580589, 1998. Andrew B. Kahng, Stefanus Mantik, and Dirk Stroobandt. Toward accurate models of achievable routing. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 20(5):648659, 2001. Philip Chong and Robert K. Brayton. Estimating and optimizing routing utilization in DSM design. In Proceedings of the Workshop on System-Level Interconnect Prediction, pages 97102, 1999. H. B. Bakoglu. Circuits, Interconnections, and Packaging for VLSI. AddisonWesley, 1990. Marc Tremblay and Shailender Chaudhry. A third-generation 65nm 16-core 32-thread llus 32-scout-thread CMT SPARC(R) processor. In Proceedings of the International Solid State Circuit Conference, pages 8283, 2008. P. Zarkesh-Ha, J.A. Davis, W. Loh, and J.D. Meindl. On a pin versus gate relationship for heterogeneous systems: heterogeneous Rents rule. In Proceedings of the IEEE Custom Integrated Circuits Conference, pages 9396, 1998. IC Knowledge LLC. IC Cost Model, 2009 Revision 0906, 2009. Jan Rabaey, Anantha Chandrakasan, and Borivoje Nikolic. Digital Integrated Circuits. Prentice-Hall, 2003. B.T. Murphy. Cost-size optima of monolithic integrated circuits. Proceedings of the IEEE, 52(12):15371545, dec. 1964. Yibo Chen, Dimin Niu, Xiangyu Dong, Yuan Xie, and Krishnendu Chakrabarty. Testing Cost Analsysis in three-dimensional (3D) integration technology. In Technical Report, CSE Department, Penn State, http://www.cse.psu.edu/yuanxie/3d.html, 2010. S. Gunther, F. Binns, D. M. Carmean, and J. C Hall. Managing the impact of increasing microprocessor power consumption. Intel Technology Journal, 5(1):19, 2001. Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, et al. Temperature-aware microarchitecture: modeling and implementation. ACM Transactions on Architecture and Code Optimization, 1(1):94125, 2004. W. Huang, K. Skadron, S. Gurumurthi, et al. Dierentiating the roles of IR measurement and simulation for power and temperature-aware design. In Proceedings of the International Symposium on Performance Analysis of Systems and Software, pages 110, 2009. Digikey, 2009. www.digikey.com. Heatsink Factory, 2009. www.heatsinkfactory.com.
[125]
[126] [127]
[128]
[129]
[130] [131]
[132]
[133] [134] [135] [136]
[137]
[138]
[139]
[140] [141]
References
153
[142] Shekhar Borkar. 3D technology: a system perspective. In Proceedings of the International 3D-System Integration Conference, pages 114, 2008. [143] Syed M. Alam, Robert E. Jones, Scott Pozder, and Ankur Jain. Die/wafer stacking with reciprocal design symmetry (RDS) for mask reuse in threedimensional (3D) integration technology. In Proceedings of the International Symposium on Quality of Electronic Design, pages 569575, 2009. [144] P. K. Nag, A. Gattiker, Sichao Wei, R. D. Blanton, and W. Maly. Modeling the economics of testing: a DFT perspective. IEEE Design & Test of Computers, 19(1):2941, 2002. [145] T. W. Williams and N. C. Brown. Defect level as a function of fault coverage. IEEE Transactions on Computers, (12):987988, 1981. [146] B. T. Murphy. Cost-size optima of monolithic integrated circuits. Proceedings of the IEEE, 52(12):15371545, 1964. [147] T. L. Michalka, R. C. Varshney, and J. D. Meindl. A discussion of yield modeling with defect clustering, circuit repair, and circuit redundancy. IEEE Transactions on Semiconductor Manufacturing, 3(3):116127, 1990. [148] P. Mercier, S. R. Singh, K. Iniewski, B. Moore, and P. OShea. Yield and cost modeling for 3D chip stack technologies. In Proc. IEEE Custom Integrated Circuits Conf. (CICC), pages 357360, 2006. [149] E. J. Marinissen and Y. Zorian. Testing 3D chips containing through-silicon vias. In Proc. Int. Test Conf. (ITC), pages 111, 2009. [150] D.-M. Kwai and C.-W. Wu. 3D integration opportunities, issues, and solutions: a designers perspective. In Proc. SPIE Lithography Asia, 2009.

Foundation of EDA 3D IC Book

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Foundation of EDA 3D IC Book

Загружено:

Авторское право:

Доступные форматы

Foundations and Trends R in sample Vol.

xx, No xx (xxxx) 1153 c xxxx xxxxxxxxx DOI: xxxxxx

Three-dimensional Integrated Circuits: Design, EDA, and Architecture

Computer Science and Engineering DepartmentUniversity Park, PA16802, USA, yuanxie@cse.psu.edu

2 3D Integration Technology 2.1 The Impact of 3D Technology on 3D IC Design Partitioning

8 Challenges for 3D IC Design Acknowledgements References

Device layer Metal layers

Metal layers Device layer

Metal layers Device layer

Fig. 2.1: Illustration of F2F and F2B 3D bonding.

2.1. The Impact of 3D Technology on 3D IC Design Partitioning

The Impact of 3D Technology on 3D IC Design Partitioning

Wire Length Reduction

Benets of 3D Integrated Circuits

3.2. Memory Bandwidth Improvement

Memory Bandwidth Improvement

3.2. Memory Bandwidth Improvement

12 Benets of 3D Integrated Circuits

3.3. Heterogenous Integration

Processor Core Layer DRAM Layer

Non-volatile Memory Layer

14 Benets of 3D Integrated Circuits through 3D integration technology.

Thermal Analysis for 3D ICs

4.1. Thermal Analysis for 3D ICs

4.1. Thermal Analysis for 3D ICs

20 3D IC EDA Design Tools and Methodologies

(a). Tiles Stack Array

(b). Single Tile Stack

(c). Tile Stack Analysis

Fig. 4.1: Resistive Thermal Model for a 3-D IC.

4.2. Physical Design Tools and Algorithms for 3D ICs

Physical Design Tools and Algorithms for 3D ICs

4.2. Physical Design Tools and Algorithms for 3D ICs

24 3D IC EDA Design Tools and Methodologies

Fig. 4.2: (a)an example oorplan (b)the corresponding B -tree

Fig. 4.3: Extending B -tree model to represent 3D oorplanning

4.2. Physical Design Tools and Algorithms for 3D ICs

4.2. Physical Design Tools and Algorithms for 3D ICs

Circuit wire (um) 339672 542926 133202 44441 846817

Alpha xerox hp ami33 ami49

2D area (mm2 ) 29.43 19.69 8.95 1.21 37.43

peakT 114.50 123.75 119.34 128.21 119.42

Table 4.1: Floorplanning results of 2D architecture.

Circuit wire (um) 210749 297440 124819 27911 547491

Alpha xerox hp ami33 ami49

3D area (mm2 ) 15.49 9.76 4.45 0.613 18.55

peakT 135.11 137.51 137.90 165.61 137.71

Table 4.2: Floorplanning results of 3D architecture.

SALU1 CALU AGEN1

4.2. Physical Design Tools and Algorithms for 3D ICs

4.2. Physical Design Tools and Algorithms for 3D ICs

Compact Thermal Model Gi Gi Gk

level i Downward Pass level k

Fig. 4.5: Thermal-Driven 3-D Multilevel Routing with TS-via Planning.

4.2. Physical Design Tools and Algorithms for 3D ICs

Pre-bond Die Testing

TSV (w1 wires)

TSV (w1 wires) TAM3[in] Layer0

TSV (v3 wires)

TSV (v3 wires) TAM3[out] Layer0

TAM1[out] TAM1[in] Core test wrapper TAM2[in]w

TAM1[in] Core test wrapper v1

3DCacti: an Early Analysis Tools for 3D Cache design