IETETechRev294318-2880ef741 080007 PDF

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.
165]||ClickheretodownloadfreeAndroidapplicationforthis journal
Network-on-chips on 3-D ICs: Past, Present, and Future

M. Pawan Kumar, Srinivasan Murali1 and Kamakoti Veezhinathan
Department of Computer Science and Engineering, Indian Institute of Technology Madras Chennai, India, 1iNoCs, Switzerland
Abstract
Interconnects have become the chief bottleneck in todays era of chip design. Along the road of interconnect evolution, Network-on-Chips (NoCs) have emerged as a structured and scalable solution for connecting computational elements on a very large scale integration chip. Also, with the deep-submicron technology allowing integration of billions of transistors, chips have grown very complex and large in size. The global wire-length problem was addressed with the integration of devices in the third dimension (3-D). The combination of 3-D integration and a scalable interconnect, like NoCs, promise to revolutionize design for Chip Multi-processors, System-on-chips, and System-in-package. This paper surveys on all the advancements in 3-D NoCs. Keywords Interconnects, Network-on-chips, System-on-chips, 3-D Integration.
1. Introduction
With the shrinking feature size in todays deep-submicron era, chips are complex. The complexity arises due to several factors that include global wire delay, clock propagation issues, and hence, reduced yield. Also, chips hit a physical limitation in miniaturization; any further decrease in size would negate the gains obtained by making chips smaller. To mitigate this problem, 3-D integration was sought [1]. The 3-D integration, though not widely in use today, is a non-conventional and an emerging paradigm which promises to widen the horizons of chip-design [2-5]. Research studies on 3-D integration admit that this paradigm provides promising solutions for CAD design challenges [6,7]. With freedom in the third dimension, architectures that were prohibitive due to wiring constraints are now possible, and many 3-D implementations can outperform their 2-D counterparts [8]. The 3-D integration comes with a bunch of advantages: noise-isolation by separating analog/RF and digital circuits into separate substrates with metal or dielectric bonding layer [9,10]; reduced wire-lengths [11]; heterogenous mixed-technology integration enabling non-Complementary Metal-OxideSemiconductor (CMOS) integration (such as memories, Micro Electro-Mechanical Systems (MEMS), etc., on a single die); higher order of connectivity; improved power consumption by restriction to on-chip signaling; and high-bandwidth vertical links [12-14]. The 3-D chips also are a promising hardware solution for a parallel computing environment [15]. IMEC has designed some 3-D test prototypes, namely, the 3-D130C which consisted of several test structures to assess design issues of Cu-TSV technology, thermal hot318
spots, electrostatic discharge, etc., and the 3-D65 which was designed and manufactured to extend the learning to advanced CMOS technologies manufactured on 300 mm. On the flip side of the coin, the design of 3-D ICs is quantitatively constrained by several factors. The viapitch (the distance between two adjacent vias) and the via-diameter determine how many vias can be placed together in a single inter-layer TSV bundle. TSVs are bound on both ends by landing pads. The landing pads on the strata compete for real-estate of the chip along with other IPs, thereby playing a role in determining the dimensions of the chip. Also, the non-negligible capacitance of TSVs constrains the number of inter-layer vias. The temperature is another crucial constraint for 3-D stack[16]. Crossing the temperature threshold could burn the components and render the circuit useless. As mentioned above, with the shrinking feature sizes, the gate-delays were brought down and the interconnect delays went up [17]. Thus, the interconnect turned out to be a limiting factor for the next generation high-performance circuits and System-on-Chips (SoCs) [18,19]. Long interconnects started becoming impediments from a latency and power standpoint [20]. Consequently, Network-on-Chips (NoCs) emerged as a very scalable communication backbone, well suited for Globally Asynchronous Locally Synchronous (GALS)-based systems [21,22], offering ample bandwidth, facilitating concurrent transfers, providing a regular structure, and offering high flexibility [21,23]. Due to the regular and structured wiring, the electrical parameters of NoCs are well-defined and controllable. This, in turn, enables design of high-performance, aggressive signaling circuits that can reduce the power dissipation by a factor of ten
IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012
[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs
and increase propagation velocity by three times [24]. NoCs are, now, widely used as an interconnect in Chip Multi-Processors (CMPs) and SoCs. However, the design of NoC also entails several constraints. To mention a few, the queues in switches of an interconnect is constrained for controlled power consumption. For custom topologies, the number of ports per switch is constrained to keep the arbitration simple. Large number of switches contribute to power profile of the chip. Therefore, the number of switches is also a careful choice to make during the synthesis of custom NoCs. Despite the constraints, NoCs are a necessity for 3-D chips for they provide scalability of the interconnects across additional layers, efficiently parallelize communication in each layer, and help control the number of vertical wires for inter-layer communication [25]. The performance improvement arising from the architectural advantages of a scalable interconnect like NoC will be significantly enhanced if 3-D ICs are adopted as the underlying fabrication standard. The amalgam of the two promises to be a robust paradigm with unprecedented capabilities overcoming many individual limitations and a viable solution for building high-performance SoCs in the near future. This survey is about all the advancements, hitherto, in the field of NoCs for 3-D ICs.
CAD Tools: CAD tools which are specifically meant for 3-D integration are needed to take full advantage of 3-D for taping out elegant multi-level designs. Also, manufacturing of vertical vias, such as TSVs, is a complex and an expensive process [27]. 2.2 Challenges of Network-on-chips Second, NoCs also have their own set of challenges. The design and implementation challenges through the lens of the interconnect are: At the architectural level: Buffer-sizing: Sizing the queues of the switches in the interconnect can deeply affect power and area of the overall interconnect. Choice of Topology: An optimal topology can vastly change the performance of the system by directly affecting latency, power dissipation, area footprint, etc. At the application level: Task Mapping: Task mapping done statically or dynamically can affect system performance, for instance, congestion reduction by dynamic task mapping. At the communication level: Choice of the routing protocol: Researchers have used both deterministic and adaptive routing based on the nature of the application. 2.3 Challenges of Network-on-chips for 3-D ICs The combination of the two is going to bring some unique challenges such as the placement of switches of an NoC onto the various layers of the 3-D stack in such a way that the number of TSVs is minimized [28]. To mention another, the routing protocols might have to change from the standard dimensional-order routing (DOR) schemes so that traffic hotspots are not created [29]. Notwithstanding these unique challenges, the advantages are also numerous which has driven research in various directions at various levels. The classification of all these directions will be presented later and all the works in each direction have been elaborated. Before that, we shall summarize a few terms in the next section, which are pertinent to 3-D integration and NoCs. These terms will be used widely across the rest of the paper. In Section 5, we conclude the paper. Appendix A, at the end of the paper, should give the reader an idea of the tools/simulators/frameworks used for either implementation or simulation of 3-D and NoCs platforms.
319
2.
Design Challenges of Network-on-chips for 3-D ICs
In this section, we discuss the challenges of 3-D integration and NoCs individually and later show how these challenges become even more complex when the combination of the two is considered. 2.1 Challenges for 3-D Integration First, we discuss challenges that are solely pertinent to 3-D integration. Yield: With immature manufacturing infrastructure, defects are much more probable than in conventional two-dimensional circuits. Also, too many vertical vias can bring down the yield. Testing: There are no mechanisms for the pre-bond and post-bond testing of Through Silicon Vias (TSVs) [26]. Power Dissipation: Thermal buildup between layers is inevitable and must be addressed. Circuit Architecture Challenges: Design might incur new defects due to TSV formation which could arise because of misalignment or poor bonding. Such defects could affect signal integrity, power integrity, and induce delay[26].
3. Terminology
At the outset, we define few terms that are native to NoCs and 3-D integration. The terms mentioned here will be used throughout the survey. First, we describe some common interconnect technologies pertinent to 3-D integration. Each integration approach is slightly different from the other and is suited to different needs depending on the underlying platform. They are as follows: Wire Bonding - Vertical wires connect the individual dies on a stack, and wire bonds are possible only on the chips periphery, hence limiting the interconnect density. Microbump - This technology involves using solder or gold bumps on the surface of the die to make connections. Contactless - This technology uses capacitive (voltagedriven) or inductive (current-driven) coupling to provide inter-chip communication between layers. Whether inductive or capacitive, Alternating Current coupling, in general, suffers less from mechanical stress and parasitic load [30]. Through-Silicon Via - These are high-density vertical electrical connections connecting one wafer to another by etching holes through the layers and filling the holes with Tungsten to establish connectivity, as shown in the Figure 1. Apart from the shortcoming of expensive manufacture, TSVs are the most favorable vertical interconnect technology [4]. TSV is the most widely used vertical interconnect, being the most promising, among the aforementioned ones [32,33]. Most of the undertaken research uses TSVs for a vertical interconnect. A NoC also has a jargon to itself. The following is a brief list: Network-on-Chip - It refers to interconnection network that connects various components inside a chip. Processing Elements (PE) - It refers to the IP cores that are connected to the NoC network fabric. These are the sources/destinations for various traffic flows in the NoC. Topology - This refers to the modular structure in which the NoC components are connected. Mesh, Torus, and Fat Tree are some NoC topologies. Packets and Flits - The message from the PEs are segmented into packets appended with routing information. Each packet is broken up into smaller manageable units of data known as flits. A flit is a basic unit of bandwidth
320 Layer 2
Core
Switch TSV macro Vertical link
Horizontal link Layer 1
TSV macro
Switch
Core
Layer 0
Figure 1: Vertical vias and pads to connect multiple layers [31].
and storage allocation. The head flit of a packet generally contains the route information. Flow - A flow is a logical communication channel for an application defined by a source, a destination, and the bandwidth required. Route - This is a specific path taken by a flow, from a source PE to a destination. Globally Asynchronous Locally Synchronous - This refers to circuits that consist of locally synchronous modules which communicate through asynchronous wrappers. Dimension Order Routing - The routing strategy where flits are routed first in a vertical and then in a horizontal direction or vice-versa. This class of routing algorithms ensures freedom from deadlocks.
4.
Classification of Research of Network-onchips for 3-D ICs
In this section, we classify all the various research directions of NoCs for 3-D ICs by segregating them into sub-domains. We discuss research work based on these sub-domains. The taxonomy of research in NoCs for 3-D ICs is shown in Figure 2. We shall delve into each of the following 4.1 Architectural Advancements From an architectural standpoint, there have been advancements in various directions. The following are the aspects of architecture which have witnessed advancements: (1) topological advancements; (2) communication media; (3) vertical interconnects; and (4) router micro-architectures. First, we describe few works which have addressed more than one aspect of architectural advancements here, before we describe other works which are specifically directed to a sub-domain.
[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs NoCs on 3-D ICs Architectural advancements Fault tolerance Topological advancements Vertical interconnects Router designs Communication media Design & synthesis Migration from 2-D to 3-D NoC Implementation
Mapping & placement
Floor planning
Figure 2: Classification of design issues domains described in Figure 2 one by one in the following sub-sections.
Chien et al. [34] proposed a scheme to mitigate the hotspot effect in a many core system partitioned into many core groups (CGs). The authors focus on a 1024node mesh. A GALS-based Digital Signal Processor (DSP) design is adopted with each DSP core constituting a tile. Each core, in turn, is composed of an on-chip oscillator with its own clock and a switch with associated buffers. The design has three layers of die stack with the many core NoC layer sandwiched in the middle. The cores are arranged in a 3232 fashion, divided into CGs of 88 each. Each of these CGs contain 16 cores arranged in a 44 fashion. Two kinds of thermal ridges are used to separate CGs depending on their location. A low-density thermal ridge is placed where routing logic dominates most of the silicon area. A high-density thermal ridge lies in the intersectional area having no wires passing through. Thermal ridges are introduced with a constraint of occupying utmost 20% extra area and are crucial for heat dissipation. The thermal profiling is done with the help of temperature distribution. If asymmetric, then the rotation of the cores in the CG or the CGs are done till there is uniformity in temperature distribution. The authors reduce the chip temperature by trading off utmost 20% extra area. Ramanujam and Lin [35] presented a novel LayerMultiplexed (LM) architecture for 3-D NoCs that exploits the optimality of an oblivious routing algorithm called Randomized Partially Minimal routing with the short inter-layer wiring delays enabled in 3-D technology. The authors replace the one-layer-per-hop routing in the vertical dimension with simpler vertical demultiplexing and multiplexing structures. Since the power of crossbars donot scale well, the routers remain 55 in size. Each processor directly does not connect to the routers, instead, connects to a packet injection stage at the same (X, Y) location which in turn connects to the 5-port router at the same location. The packet injection stage is like a 4-port switch with a typical router pipeline, with
route selection modified to implement load-balancing to ensure traffic is well divided across all layers. This packet injection stage comprising a demultiplexer is spread over all the k layers with the control logic located in the middle layer. Once the packets enter the layer, the routers are made to select XY or YX routing with equal probability to reach the (X, Y) of the destination. Now, the packets reach the packet ejection multiplexers. Each horizontal plane router sees at its ejection port, four virtual channels, each of which corresponds to the ejection queue of a processor connected to a router. The flits from different layers queue up in their respective virtual channels which are the inputs to the packet ejection multiplexers. Now, the flits are multiplexed and one channel is allowed access for the flits to reach the processor. The authors compare this scheme with standard 3-D symmetric meshes and an asymmetric 884 topology. This architecture consumes 27% lesser power in comparison with the 3-D meshes. The LM architecture achieves a worst-case hop count reduction of 33% for the symmetric and 20% reduction for the asymmetric topology. Kostas et al. [36] showed that the existing way for designing homogeneous NoCs is not efficient. The authors performed a study with four applications and conclude that all routers are not utilized uniformly, and that only a small percentage of them need to serve high-packet streams. So, rather than supplying the whole architecture with high Vdd, it is possible to identify parts of the chip where a low Vdd will suffice. The authors introduced a 3-D NoC architecture with multiple supply voltages. Furthermore, they proposed a high-level mapping algorithm for supporting application mapping onto such NoCs. They proposed an architecture with two layers where the top layer is powered with high supply voltage and the lower layer with low supply voltage. They show that such an arrangement leads to a better thermal profile. Initially, the application is profiled in order to determine the way the target NoC platform has to be tuned. Next, the applications functionalities are appropriately clustered into groups based on the communication demand. Finally, these derived groups are partitioned and mapped to regions with different voltage supply. This scheme gains 15% in energy savings as compared with 3-D NoCs with a unique supply voltage. 4.1.1 Topological Advancements This sub-section has few works dedicated to topological advancements. There are a few works which demonstrate how certain topologies are a good fit for the 3-D domain. Feero and Pande [8] discussed performance in terms of throughput, latency, energy dissipation, and wiring area overhead compared with traditional 2D implementations. The throughput of 3-D meshes and stacked 3-D
321
meshes is shown to be much higher in comparison with 2D meshes. The results also showed the link traversals dropping by 30% due to decreased hop-count and more interconnectivity. On a slightly related note, it is also intuitively easy to guess that latency also drops because of the reduced hop-count. The 3-D topologies occupy more silicon area because of more ports/switch. For a ciliated mesh, since the number of switches is half of a normal 3-D topology, the area overhead is accordingly smaller. Pavlidis and Friedman [6] studied the possible topologies for 3-D NoCs and analytical models for zero-load latency and the power consumption with delay constraints of these networks that capture the effects of the topology on the performance of 3-D NoC. A. Weldezion et al. [37] evaluated throughput and latency of three architectures, namely, 2-D mesh, 3-D mesh with switch connectivity between layers, and 3-D mesh with bus connectivity between layers. The objective of this study was to derive key design guidelines. The routing strategy is based on non-minimal and load-dependent deflection type packet switching, with adaptive perhop routing. A hot-potato protocol implementation is deployed with switch architecture as described in [38]. All the switches are bufferless. The 2-D mesh has baseline routers with 5 ports, and the 3-D mesh router is the extension of a baseline 2-D router. The 3-D mesh with bus connectivity between layers has a Time Division Multiple Access (TDMA) bus (centrally arbitrated) along each vertical line of routers. The bus protocol adopts a least-served first-priority scheme. This bus operates at 16 times the network frequency and 10packet deep First-InFirst-Out queues (FIFOs) to account for the serialization and the non-serialization delay. Each packet is a flit long. In comparison with 2-D meshes, the 3-D meshes remain stable for a wider range of network sizes and injection rates. The architectures are evaluated over two standard traffic patterns: uniform random and local. Results showed that 3-D mesh with switch connectivity between layers is the best performer in terms of throughput and normalized latency as the numbers of nodes are scaled. The TDMA bus performed the worst among the three by succumbing under local traffic. With a lot of communication in the bus, the traffic and, hence, the contention increases, thus resulting in diminished throughput. The same authors [39] evaluated the performance of processor-memory architectures formed within a 3-D structure, with NOC architecture as the communication backbone for a mobile platform. The authors suggested that the on-chip network latency is hardly dependent on the number of processors in the setup but is predominantly determined by the flash technology implemented in the memory blocks. This work highlights
322
how processor-memory architectures in 3-D enable massive memory capacities in a small footprint with high-performance and scalability suitably catering to next generation mobile applications. The NoC model is a 3-D NoC VHDL model developed based on Nostrum Architecture [40]. The model has 16 layers with two kinds of blocks: Processors and memories. Each layer is a 44 array of nodes. Different layouts were consideredby the authors, namely, Dance Hall, Terminal, Per-layer, Sandwich, and Mixed, as shown in Figure 3. The variation in positions of processors and the memory blocks differentiate the layouts. The modules with heavy communication are placed closer to each other and modules with light loads are placed farther away manifesting the locality principle. These architectures are comparable against several parameters such as cache coherency, power consumption (crucial for embedded/mobile platforms), off-chip input/output access (pins on top/ bottom layer to connect other off-chip chips), thermal stress and management, manufacturing cost, throughput, performance in terms of latency (normalized and actual), latency in hops, and hop-count ratio. Matsutani et al. [41] used tree networks in the 3-D paradigm and showed gains in comparison with the 2-D counterparts. A tree network as shown in Figure 4 has a specified out-degree. Each router has a fan-out and the tree expands from the center to the edges of the chip. They show how tree networks when implemented in a 3-D IC can help reduce the wire lengths, thereby presenting an attractive solution to the wire-length problem of tree networks in 2-D NoCs. In 2-D tree layouts, top-rank links unavoidably become long near the root of the tree, whereas in a 3-D structure, the vertical dimension can be used for these long top-rank router links. The laying out of top-router links in the vertical dimension would make links between rank-1 and rank-2 routers the longest (a rank-1 router is one which is the deepest in the tree attached to the IPs). That is, compared with the original 2-D layout, the longest link length in the 3-D layout is reduced by half. The authors use a 4-split and a 2-split method to divide the original planar Fat Trees and Fat-H Trees into several parts and connect them using vertical links in order to reduce wire length, wire delay, etc. Apart from reducing the wire lengths by as much as 50% and hence reducing wire delay, the results also
Memory Processor
(a)
(b)
(c)
(d)
(e)
Figure 3: 3-D memory architectures [39] (a) dance hall; (b) sandwich; (c) per-layer; (d) terminal; (e) mixed.
many-core CMPs. Any high-radix network, including CNOC, will have long wires that will in turn lead to increased area and power consumption. In order to make it ultra-scalable, they use 3-D to make the wires shorter. The design presented has a 512-node 5-stage 3-D CNOC composed of radix-8 routers. The topology has 8 blocks of 64 nodes each. Independent blocks are interconnected with wires between stages 2 and 3 and between stages 3 and 4. The first stage Switching Modules (SMs) are called Input Modules and the last stage SMs are called Output Modules. The other intermediary SMs are referred to as Center Modules. Each SM is implemented as an inputbuffered crossbar switch. All layers have the same floor plan to enable mask re-use, thereby reducing the design time. SMs on the same tier are placed adjacent to each other and TSVs are used to connect SM stages 2 and3. The SM stages 3 and 4 are used to connect different blocks. The total TSV count in this design came up to 504. The TSVs are strategically placed in an area where there are few horizontal wires and no dense logic. The authors compare this CNOC with its planar counterpart and also a 2-D and a 3-D mesh with the same 512 node configuration. Results showed that the proposed CNOC consumes a significant 57% lesser power than the planar counterpart. CNOC also performs better on throughput in comparison with the mesh topologies. Yet another major benefit of the 3-D CNOC is the easy scalability to accommodate any number of additional cores in the 5-stage network. This scalability is attributed to 3-D integration. Yiou et al. [46] proposed a de-Bruijn topology for 3-D NoCs which is inspired from [47] where NoCs are constructed based on the de-Bruijn graph. The simple, reliable, high-throughput, and low latency characteristics of de-Bruijn graphs make it a topology choice for NoCs whether 2-D or 3-D. The authors construct a de-Bruijn graph structure in both the horizontal and the vertical directions as shown in Figure 5. They take a 16-node horizontal plan graph and divide it into four parts. Each part from every horizontal plane is connected together in the vertical plane in a de-Bruijn fashion. Nodes are represented using a k-bit sequence where k is the diameter of the graph. To route between nodes, an algorithm based on left/right shifting of the bit sequence representing the nodes is suggested. With the help of this shifting, neighbor can be reached. Left and right shifting is done to calculate the path to destination and the shorter among the two paths is chosen for routing. DOR is adopted as a routing strategy. The authors use Message Passing Interface Style NoC Simulator [48] to evaluate the performance of the scheme. They compared their topology with Mesh-Pillar NoC [42] and 3-D Mesh NoC using uniform and hot-spot traffic patterns. The
323
Figure 4: A tree network [41].
assert improvements in flit transmission energy and area overhead. The same authors [42] proposed a class of 3-D topologies called Xbar-connected Network-on-Tiers which consist of multiple network layers tightly connected via crossbar switches. The authors exploit the short delay and the high density of inter-tier links, and proposed crossbar switches that connect different tiers and their cores. The routers are of two kinds in this work, tier routers which connect cores in the same router and have inter-tier links in the third dimension, and pillar routers which connect all inter-tier links and cores in a pillar via a crossbar switch. A pillar is defined to be a set of inter-tier links placed on all cores with the coordinates (x,y,z) where 0 <= z <= n; where, n is the number of tiers. Deadlock-free routing is ensured by imposing a deadlock-free algorithm on a tier as well as between tiers. Between tiers, it is ensured that the flits travel from a higher-numbered tier to a lowernumbered tier. The proposed topologies are evaluated in terms of ideal throughput, average hop count, simulated throughput, component count, network logic area, and energy consumption. Amir-Mohammed et al. [43] consolidated all the topological advancements in their survey and conducted a detailed analysis using different design metrics. Zia et al. [44] proposed an easily-scalable, high-performance, and low-power Clos [45] NoC (CNOC) for
the time idle. The routers used in this design have a TSV arbiter which is meant to decide to which input port theTSV bundle be granted. The routers are groupedinto 22 configurations (and some corner case configurations) and the TSV bundle is shared among the routers in this configuration. The path length slightly increases by two extra horizontal links: from the router to the TSV sharing pad (located in the center of the sharing nodes) in one layer and the sharing pad to the router in a different layer. Although each router in this design is incorporated with an extra TSV arbiter, because of TSV bundle sharing, the TSV planar footprint is brought down by 60% and also the authors achieved throughput improvementsbyas much as 30% without losing on the network latency.
Figure 5: A 3-D NoC topology based on De Bruijn graph [46].
results showed encouraging numbers for latency for a de-Bruijn network, but the meshes were superior with respect to power consumption. 4.1.2 Vertical Interconnects Vertical links across layers of a 3-D chip are high-bandwidth links and much shorter than horizontal links. But, these advantages come with a price. TSV processing cost, for example, is shown to have a dominating cost in the processing of a 3-D wafer [49]. Also, TSV fabrication has low yield relative to standard 2-D processes. This subsection describes a few works which have paid attention to this trade-off. In a 3-D NoC, the number of vertical connections has an impact on both cost and performance. Xu [50] studied the trade-off between lower execution time due to a lot of TSVs and easier design implementation with very less TSVs. They use an 80-node network which models a single-chip CMP. The 3-D model has 5 layers, with one layer having processors and the other layers having shared cache memories. The routers which connect layers have two extra ports and the corresponding virtual channels, buffers, and crossbars to connect to the pillars which go up and down. The setup uses a DOR algorithm to avoid deadlocks. They used a cycle-accurate 3-D NoC simulator to run the experiments. The metric used to evaluate the setup is the Performance Power Product. Liu et al. [51] proposed a TSV squeezing mechanism primarily based on the assumption that the TSVs utili zation is pretty low and adjacent routers rarely transmit packets via their TSVs at the same time in a symmetric mesh 3-D NoC. So, the authors propose a mechanism where neighboring routers share the TSVs in a timedivision multiplexed mode. The authors run experiments on a 442 3-D mesh to support their assumption, and show that the TSVs at a high load of 0.3 are still 80% of
324
Amir-Mohammed et al. [52] proposed self- reconfigurable bidirectional bisynchronous vertical TSVs to be used either for out-going or incoming transfers instead of the pair of unidirectional counterparts in an effort to mitigate area footprint of the TSVs, increase the TSV utilization, and also help improve the routability. The authors exploit the high-speed inter-layer channels by using mixed-clock FIFOs. The inter-layer channels in this study were clocked at twice the frequency of normal intra-layer links. For the vertical output port, the normal FIFO would be replaced by a bi-synchronous FIFO. The routers in this design, additionally, have a control module to decide the direction of transfer (incoming or out-going) at every UP and DOWN port of a router. These bidirectional links are managed by finite state machines. To change the direction of transfer, a token-passing technique is used. When the router is receiving data, it keeps open the data out port to listen and vice-versa. Also, credit-based flow control is used to inform neighboring routers if the router is ready to receive data through the vertical channel. The authors show a 47% gain in the TSV area footprint and also show increment in the bandwidth utilization of the TSVs in the topology. Loi et al. [53] introduced a circuit-level model for TSVbased interconnects based on accurate 3-D parasitic extraction, and describe a 3-D design flow that allows post-layout verification of the 3-D stack. They, additionally, focus on reliability enhancement techniques for 3-D NoCs based on fault-tolerance and post-silicon calibration. The starting point for this work are reported in [33] and [54] where a thorough physical and timing analysis of the vertical links have been conducted on a 3-D NoC. To model the vertical vias, the authors base their integration effort on the xpipes NoC library [55]. TSVs are accurately inserted into the design based on these models which are used to build the Library Exchange Format and liberty descriptions of vertical pads and vias. For reliability enhancement, the authors
focus on countermeasures for random defects, namely, stuck-at defects and stuck-open defects, which happen due to various physical phenomena. For these defects, they use the simplest remedy of via duplication. One of the vias is configured for transfer of flits during setup. The setup is stored in a set of configuration registers. The TSVs are inserted in ring oscillators to measure its parasitics and its variability. The collection of data using this setup can be used for technology tuning and post-silicon calibration. They taped out a 16-bit 2-tier 3-D NoC using xpipes [55] tool chain with extensions for vertical links as mentioned before. Each tier consists of a traffic generator, a JTAG controller, a 33 switch, a slave memory, and the above mentioned test structures for TSV characterization and tuning of the 3-D technology. Jueping et al. [56] proposed an accurate energy consumption model of 3-D TSV for the power estimation of a 3-D NoC. The authors also analyzed the capacitance model of an isolated TSV in detail. The model is written in such a way that it can be integrated with simulators written in C++ or Register Transfer Level code. The power equations for TSV links which are different from planar links are formulated. The dynamic power of the TSV is formulated based on the capacitance of the TSV, the supply voltage, and the activity factor of the link. Also, the total capacitance of the TSV is formulated to be the series combination of the oxide (the di-electric) used and the depletion capacitance. To evaluate the capacitance modeling, the TSV structure is simulated by Silvaco DevEdit3d [57] (DevEdit creates standard structures which can be easily integrated into 2-D and 3-D simulators and other support tools; refer to Appendix A), and the simulation results are compared with TSV high-frequency C-V measurements in [58]. 4.1.3 Communication Media The growing communication bandwidth requirements (both off-chip and on-chip), the strict on-chip temperature, and the power budget requirements, together, make interconnect power consumption a critical problem for multi-core SoC design. Whether electronic NoCs can continue satisfying these constraints in the foreseeable future remains an open research problem [59]. This subsection overviews research undergone in considering various communication media for on-chip data transfer. Photonic NoCs have been explored as an alternative for reducing the impact of communication on the overall power budget of the interconnect. In [60], the authors studied the performance improvement obtained by using a CMOS-compatible photonic on-chip bus for future CMPs. The authors of [61] illustrated the gains obtained by photonic NoCs against electronic NoCs. In [62], the
authors presented a 3-D many-core architecture that used photonic communication for both inter-core communication and off-stack communication to memory or I/O devices. Photonic NoCs need new design environments and tools from a Computer-Aided Design (CAD) standpoint, but they are impacting because they provide ultra-high throughputs, minimal access latencies, and, fundamentally, dissipate less power. NoCs in other emerging paradigms, such as with wireless interconnects, are discussed in [63]. Ye et al. [64] proposed a 3-D electronic-controlled optical NoC for Multi-Processor SOC using Cygnus optical routers [65]. The authors implemented this in a TSV-based two-layer 3-D chip as shown in Figure 6. The upper layer is an optical layer which aids in high-speed optical payload transmission. The lower layer is the electronic layer responsible for control packet transmission. The routers consist of optical switching fabric in the top layer and electronic control units in the bottom layer stacked together by TSVs. The optical switching fabric of every router consists of a switching function which deviate light with the help of micro-resonators. The electronic control unit uses control signals to configure the optical switching fabric by powering the micro-resonators, on or off. The authors use the deterministic XY routing for transfer of control packets. For the payload transmission, circuit switching is adopted where the optical path is already fixed and computed beforehand. A single transfer would involve optical path setup, payload transmission without any buffering and optical path teardown. Acomparison was drawn between an 88 mesh-based 3-D optical NoC and an 88 mesh 2D NoC. The optical NoC gains 70% in power consumption per packet for 2048 byte packets. The optical NoC also showed reductions in end-to-end delay.
Optical layer
Electronic layer
Optical switching fabric Optical interconnect Electronic control unit Functional core Metallic interconnect
Figure 6: The 2-layer optical NoC [64].
325
4.1.4 Router Mircoarchitectures Routers of the NoC fabric are crucial for the performance of the interconnect. Extending a traditional 2-D baseline router to the third dimension with extra ports in the vertical dimension might not serve as a good option, because the arbitration complexity would be too high due to ample path diversity resulting from the additional interconnects in the third dimension. In this subsection, we have overviewed few router micro-architectures which explore the third dimension and tap the potential of short and fast vertical links. Kim et al. [66] proposed a true 3-D crossbar, namely, the DimDe router. The dimensional decomposition is inspired from the RoCo router [67] where the incoming flits are segregated into different crossbars depending on the dimension of travel. The Dimde extends RoCo to the third dimension by having another well-fused vertical module along with the row and column modules of RoCo, as shown in Figure 7. To achieve the arbitration of vertical links within the crossbar, a two-stage arbitration mechanism is proposed. In the first phase, a local arbiter selects which flit gets to use the vertical link and at the second stage, namely, the global level a central arbiter decides which of the local winners finally uses the vertical link. The arbiter is designed to facilitate as many concurrent vertical transfers as possible just like the row-column modules. The vertical links are segmented at the device layers with Connection Boxes which help in vertical-to-horizontal traversal of flits. Early ejection mechanism is adopted and XYZ routing algorithm is deployed. The extra connections between the vertical module with the row and column modules make them a 42 crossbar which is still significantly smaller than 77 configuration of a 3-D baseline router or 66 configuration of a 3-D NoC Hybrid Bus. The propagation delays for vertical vias were modeled using HSpice [68]. The proposed design was compared with four other router architectures, namely, 2-D NoC, 3-D symmetric NoC, 3-D Hybrid NoC, and a full 3-D crossbar. DimDe
was 27% better in performance in comparison with 3-D Symmetric and Hybrid NoCs and 4% better than a full 3-D crossbar. Though technologically aggressive, by both design and performance, the DimDe was the most superior amongst all designs. The hierarchical router proposed by Lafi et al. [69] comprises of two decoupled modules: One for intra-layer communication and another for inter-layer communication. The baseline 3-D router for meshes would have a 77 crossbar. However, in this hierarchical design, this 77 crossbar is replaced with a 55 horizontal module for intra-layer communication and a 44 vertical modulefor inter-layer communication, as shown in the Figure8. The PEs are made to connect to the vertical modules. Deterministic routing is deployed in such a way that flits whose target resource is located on another layer are directed first to the targeted layer before being routed horizontally to the destination. Though the flits are buffered an extra time each both at the sender and the receiver, this design proves that the latency would still be better because of the intrinsic delay of a large 77 baseline routers crossbar. The hierarchical routers proposed in this work is based on the asynchronous NoC[70] proposed in FAUST chip [71]. Areas of hierarchical routers were compared with the standard 77 3-D baseline routers and they turned out to be almost the same. They use Transaction-Level Modeling in SystemC to develop the class of this asynchronous router and also to speed up the simulations. Traffic generators are used as PEs and a network of 256 nodes was evaluated. Uniform and transpose traffic were used for experimentation. The results show a gain of 20% for uniform traffic and 30% gain in transpose traffic in latency. This explains the related gains in throughput for both these synthetic traffic patterns. Since the crossbar delay increases quadratically with the number of inputs, the authors also assert that the gains will turn more significant as the number of PEs increase. Park et al. [72] proposed a 3-D stacked NoC router architecture called MIRA, where a baseline 2-D router (2DB), with
Row module (east-west) Vertical module (up-down) Column module (north-south) (c)
East-west out North-south Ejection out from up-down
Flit in
OUT Early ejection (a)
Column module (north-south)
North-south out
5x5 monolithic crossbar
East West North Flit in South PE
Row module (east-west) 2x2 Crossbars
East-west out
Guided flit queuing
Guided flit queuing
Flit in
Flits going up-down
Early ejection
(b)
Figure 7: (a) Baseline conventional 2D NoC router overview; (b) the 2D Row-column (RoCo) decoupled router; (c) the propposed 3D DimDe router architecture [66].
326
no extra functionality, is divided into layers like the rest of the interconnect fabric with the objective of exploiting the benefits of 3-D technology. They explored three router designs: 3-D baseline router (3DB), 3-D multi-layered router (3DM), and 3-D multi-layered router with express channels (3DM-E). The input buffers, crossbar, and the inter-router links in a router are all separable modules and are split across layers, while the routing logic and the arbitration logic of the router are non-separable and are locked to (typically) the top layer due to the proximity to the heat sink. The IP blocks used are CPUs and cache banks. The CPUs are placed on the top layer closest to the heat sink. The input buffer for a flit is split vertically, with the LSB of the flit at the top and the MSB of the flit in the bottom layer. The proposed technique also deals with short flits, in which only one of the three words are used while the others remain 0s. In the case of such short flits, clock-gating was used to selectively switch-off the bottom layers in the input buffer storing the 0-words of the short flits. A large crossbar is divided into a set of smaller crossbars positioned in different layers. With five ports, the proposed routers crossbar is still four times smaller than a baseline router. Also, the inter-router links being split across layers and the cores being true 3-D cores [73], the bandwidth now is doubled for a node, as shown in the Figure 9. The virtual channel allocation logic is split into two stages: one is positioned in a single layer and the second stage is split across layers. All these designs were implemented in Hardware Description Language (HDL) and each of these modules were synthesized using a 90nm Taiwan Semiconductor Manufacturing Company (TSMC) standard cell library. In the third design, 3DM-E, the extra bandwidth available in the multi-layered design, is used to support an additional physical port per direction. So, the extra physical port at each router is used to support multi-hop express channels to expedite flit transfer. The routers radix increases to 9 physical ports, though the total area is only 0.7 times a 2-D baseline router because the per-layer size still remains small. The authors use a cycle-accurate simulator for the performance analysis of all the designs and HotSpot for the thermal analysis [74]. In the latency analysis, 3DM-E turns out to be the best design because of the reduced hop-count in comparison with the traditional 2-D and 3-D designs. The 3DM-E aids in enabling higher injection rates. The power consumption of 3DM and 3DM-E also turn out to be lesser in comparison with 2DB and 3DB, primarily because of the clock-gating detection circuits which switch off bottom layers for short flits. This evaluation was conducted both on traces as well as synthetic traffic. 4.2 Fault Tolerance As technology scales, fault tolerance is becoming a key concern in on-chip communication. Currently available processes for TSV fabrication have low yields relative to
Up Down North South East West Resource
7x7 Router
Up Down North South East West Resource
(a) The classic monolithic 7x7 router
North South East West
5x5 Router
North South East West
Intra-layer communication
Up Down Resource
4x4 Router
Up Down Resource
Inter-layer communication The proposed heirarchical router [69]. Figure 8: The (a) blueprint of a hierarchical router
PE Core PC0 Node A Node B Node C Node D Node A (a) 3DB bandwidth usage Node B (b) 3DM bandwidth usage Router
PC0 PC1 PC0 PC1
Figure 9: Inter-router link distribution [72].
standard 2-D processes, thereby impacting the feasibility of high-bandwidth vertical connectivity. This makes it a key challenge for 3-D circuits and 3-D NoCs. This section describes efforts pertinent to fault-tolerant architectures for 3-D NoCs. Loi et al. [75] described the design of a defect-tolerant TSV-based multi-bit vertical link which enables significant yield improvement with respect to random open defects at an extremely low cost. The switches used were extended from the xpipes NoC library [76] by adding a couple of vertical ports. To increase wafer yield and fault tolerance, the authors resort to hardware redundancy deployed at design time with some amount of postmanufacturing configuration. In the proposed dynamic routing scheme, all the pads are driven by a small 21 crossbar and each signal can be routed to two different
327
TSVs. The number of extra pads is varied across a userdefined range. During a fault, the signals are shifted to the extra pad in its path. Furthermore, shifting of the displaced connections over to other adjacent pads is done until all connections are across safe electrical structures. The proposed technique combines the design of routers with testing resources such as scan-chains, which aid in testing any failed TSV, isolate it and reconfigure to restore normal functionality. The scheme proves capable of yields up to 98% with a minimum silicon cost of just 17% per TSV link in 130nm. 4.3 Design and Synthesis Design and synthesis of 3-D NoCs introduce several new challenges such as the number of TSVs that are allowed between any two layers which is strongly dependent on the underlying 3-D fabrication technology, the placement of TSV macros for vertical link connections, and the assignment of switches to layers. We describe few works which have addressed many aspects of the design flow of 3-D NoCs before describing works pertaining to (1) mapping and placement and (2) floorplanning in subsequent subsections. Seiculescu et al. [77] proposed a design flow for building application-specific NoCs for both 2-D and 3-D SoCs with multiple voltage frequency islands (VFIs). The algorithm takes as inputs the application description (number of cores, their sizes, position, communication description, and VFI assignment), the optimization objective (either power or latency), and library of the area with an optional input floor plan. A set of design points can be explored by varying various parameters such as number of switches, width of links, etc., that have different trade-offs between power, area, and latency. The algorithm initially finds the minimum operating frequency in each VFI before calculating the number of switches per VFI. Next, cores are assigned to switches in such a way that cores with high bandwidth requirements or tight latency constraints are connected to the same switch. Finally, the routing between cores is done to produce a minimally disturbed floor plan. The synthesis had as high as a 30% gain in power reduction for 3 VFIs in 3-D in comparison with 2-D. Yan and Lin [78] proposed a 3-D NoC synthesis algorithm call Ripup-reroute and router merging that is based on a rip-up and reroute formulation for routing flows and a router-merging procedure for network optimization. They show reductions in hop-count and power consumption over both normal and optimized 3-D mesh implementations. The choice of a ripup and reroute mechanism, as put by the authors, is to have a provision in the heuristics to, iteratively, identify increasingly improving solutions. The output of the architecture
328
is a topology with pre-determined routes such that all data requirements are satisfied. The rip-up takes place flow-wise. The resources are first de-allocated and for this flow, the shortest path is selected. Then, the network links and router ports are inserted to implement the routing and derive the network topology by repeating this for all flows. The rip-up and rerouting flows is to refine and optimize the network topology. Chih-Hao et al. [29] proposed a run-time management (RTM) scheme to ensure thermal safety by minimally affecting performance. The scheme comprises of trafficaware downward routing and thermal-aware vertical throttling. Given the thermal limit, traffic distribution, network topology, router architecture, power, and the thermal model, the design goal is to find a framework and a policy for RTM such that the achievable throughput is optimized with the constraint that the temperature never overshoots the thermal limit and the network has maximum availability. They extend their framework for 2-D NoCs [79] to 3-D. This design uses the pillar routers which were proposed as a part of [42]. The two goals of the proposed traffic-aware downward routing are proactively migrating power from top to bottom and adaptively adjusting the amount of migration to prevent network saturation. In minimal path routing, by changing XYZ routing to ZXY routing, they make power migrate with negligible migration overhead. In downward routing, which is a non-minimal path routing algorithm, the lying layer of the horizontal routing path is shifted down toward the heat sink. In each vertical pillar, the downward level is determined by the network status. Each layer has a counter which is used for traffic load estimation and distribution. The routing behavior is in the order of vertical-horizontal-vertical, as shown in Figure10. Non-minimal path routing increases the zero-load latency and power consumption but because of the short vertical distances between layers, the vertical latency is only a cycle[42] and the vertical driving power is also less. Also, by thermal-aware vertical throttling,
L=0 Source L=1 Downward level = 2 Destination
L=2
Downward level = 1
L=3
Y X Z
Figure 10: Traffic-aware downward routing [29].
flow across a router gets restrictive when a router gets throttled, which is based on the temperature threshold. This design has three levels of throttling: full, half, and no throttling. The level of throttling increases when the temperature threshold remains violated, else the level is lowered. The authors ensure a deadlock-free routing algorithm. They compare their approach with distributed traffic throttling, and show an 82.7% gain in the average throttling time. In addition, throttling time of the proposed thermal-aware throttling is reduced by 69.6%. Furthermore, an improvement in the network throttling ratio is seen. Wang et al . [80] proposed a thermal management method for NoCs via task scheduling on OS-level considering both temperature constraint and memory access delay. The proposed temperature management scheme is compared with two existing approaches, namely, coldest scheduling, where the scheduler tries to find the coldest processor and exchange the thread with the processor that reached the temperature threshold and, random scheduling, where the scheduler selects a cool processor randomly rather than the coldest one to balance the performance constraint, under the same temperature constraint. The authors use HotSpot [74,81] to obtain temperature values of each IP and router by inputting power and floor plan information. The prime idea of the algorithm is to minimize performance degradation. The design includes a temperature controller to take these temperature management actions in the simulation. The algorithm is as follows: when a processor exceeds the temperature threshold, the controller computes R hot + R cool (where R represents the ratio of total scheduling expense time to total execution cycles of a thread) for threads assigned to cool processors. The thread with the minimal R cool+ R hot is selected for scheduling. The temperature distribution was compared and the proposed data-aware algorithm was shown to be least susceptible to hotspots. While random scheduling had the least scheduling frequency of threads and the worst temperature distribution and coldest scheduling unbalanced between cool threads, but with better temperature distribution than random scheduling, the proposed data-aware algorithm caused the most frequent scheduling and had the flattest temperature distribution. 4.3.1 Mapping and Placement Interconnect and cell placement problems are known to be NP-complete. A wide repertoire of heuristic algorithms exist in the literature for mapping and placement of cells. The increasing complexity of placement in 3-D ICs has driven research in this direction. This subsection overviews works that have addressed the problem of mapping and placement of NoCs to 3-D ICs.
Arjomand and Sarbazi-Azad [82] proposed a design methodology to efficiently assign tasks of an application to IPs in a regular 3-D NoC at the system level, and voltage-frequency planning at the circuit level. They augment their discrete event simulator with Orion (refer to Appendix A) to calculate power and silicon area of the network components [83]. An application is represented by an Application Characteristic Graph (ACG). Each vertex in ACG refers to a set of tasks which should run on a specific IP core and the aggregate communication bandwidth is the weight of the edge in the graph. The mapping algorithm is as follows: a tag for every task is computed taking its computational and communication requirements into account. Now, the blocks are arranged in decreasing order of the tag. The task at the front of the queue is picked and assigned to a layer-1 (bottommost) IP. The tasks adjacent to this task have their tagincremented by a constant at this juncture, before reordering the queue. If the next extracted task communicates with previously mapped tasks, then it should be assigned to a tile with least Manhattan distance (vertically close if the task is communication centric and same layer or lower if computation centric) with those clusters. This results in a mapping of computational intensive tasks closer to the heat sink and heavily communicating clusters predominantly using vertical vias. Voltage blocks are specified by a 3-tuple: supply voltage, threshold voltage, and frequency. Every link that connects two voltage blocks is equipped with a frequency wrapper or a voltage-level shifter which converts from one voltage to another as per the specifications of the data flow. The authors use asynchronous queues from [84] and equations for 3-tuple representations from [85]. These synchronizers impose a latency and power overhead, though enhancing bandwidth. To address the power overhead of communications due to these voltage-level shifters, the work extends to merge different voltage blocks using a heuristic algorithm. The heuristics consider power and thermal characteristics for different pairs of voltage blocks before merging, such that the requirements of all IPs are satisfied. This approach does mapping and placement and further optimizes temperature and power consumption. The drawback of this approach is the long running time of this process. Christianto et al. [86] proposed an approach to SoC design addressing the issues of heat dissipation and manufacturing cost. In this paper, the authors present a design methodology to analyze how 2-D SoC designs, with any underlying interconnect, can be effectively mapped to 3-D ICs trading off performance and cost subject to a temperature constraint. They do two case studies, namely, the multi-window display application (MWD) and the graphics processor design. The authors propose three design regimes with one concentrating on manufacturing cost, one on the delay, and one with
329
equal emphasis to both. In all the three design schemes, temperature was brought down by 20 degrees when the application was mapped to 3-D and there were gains in delay reduction as well. In the case of the graphics processor with no temperature constraint, the delay was reduced by a significant 43%, though not practical. Even with temperature constraints, there was a 29% reduction in delay which is significant in comparison with MWD. By mapping to 3-D designs, the short vertical links can be better leveraged to bring down delay. The increase of 40% to 50% in cost was however unavoidable primarily because of the wafer and the bonding yields and the increase in the total silicon area. Seiculescu et al. [31] proposed a tool called SunFloor 3d, which synthesizes a custom NoC topology for 3-D SoCs, find paths for communication flows, and does the switch-to-layer assignment. The input to the setup is the communication characteristics, the TSV constraints across adjacent layers, and (optionally) the floor plan of the SoC. The authors try to achieve a minimally perturbed floor plan from the input and also minimize wire-lengths. The synthesis procedure consists of the following steps: establishing core-to-switch connectivity, obtaining deadlock-free paths for traffic flows, and finally, placement of switches and TSV macros on the layers. In the first step, a partitioning graph is constructed, with vertices representing IP cores and edges defined by a combination of bandwidth and latency constraints of a traffic flow between two cores. The weight of an edge is a function of the latency and the bandwidth constraints between the cores connected by this edge. The number of switches is varied till min-cut partitions are obtained for the current switch count. Cores belonging to the same partition are assigned to the same switch. To compute the paths for the different flows, the TSV constraint between adjacent layers are considered along with the deadlock removal algorithm earlier used in [87] and [88]. In the third step, the layers of the switches are computed as an average of the layers of the cores the switch connects to. But, due to the possibility of device overlap, one switch at a time is considered and placed near the ideal place calculated, thus minimally perturbing the input topology. Additionally, they use their synthesis procedure of NoCs for 2-D SoCs [87] and compare it with the proposed synthesis tool for 3-D SoCs, thus highlighting the gains of 3-D through latency and power performance improvements. The power gains were, mainly, due to the shorter wires in 3-D. In [25], the same authors presented a direct extension of the 2-D NoC synthesis procedure [87] meeting application performance and technology constraints. Here, the authors design the NoC for each layer separately and the connectivity of the switches is later determined. This method connects cores to switches which belong to the
330
same layer, thereby incurring large power and latency penalty. In comparison with Sunfloor 3-D [31], this paper has a more restricted topology synthesis. In the previous work, cores across layers can share switches, whereas in this work, the floor plan of each layer is fixed. This work also does not address placement of TSV macros and other network components such as the mapping of switches to layers. The work restricts itself to the placement of switches. Due to the restriction of cores being assigned to a switch of the same layer, the inter-layer traffic traverses more hops, thus leading to a higher power consumption and latency. 4.3.2 Floorplanning Floorplanning is an important stage of physical design. The intractable nature of the problem of floorplanning has attracted researchers to provide solutions using genetic algorithms and meta-heuristics. This subsection surveys research done in the field of floorplanning. Quaye [89] showed that with an appropriate mapping of tasks, communication volume can be reduced. Agenetic algorithm was proposed that yielded solution with better thermal characteristics than those earlier reported. A brief description of any genetic algorithm is as follows: it generates an initial random pool of possible solutions (chromosomes), which are further evaluated on each iteration (generation) by the assignment of fitness scores using an objective function. The fitness scores, thus obtained, guide the drive toward an optimized solution to the problem. Adetailed view on genetic algorithms can be found in[90]. The authors encode an integer of n bits where n is the number of physical PEs times the degree of virtualization (the number of tasks mapped onto a PE) to be a chromosome. With data-power, traffic, connectivity, dimensions of IPs and chip, number of layers as inputs, the algorithm first generates a floor plan. A certain number of solutions are generated and for each one, mapping and placement is done computing the communication cost and peak temperature. For the thermal profiling, the proposed methodology uses the 3-D extension of HotSpot [81]. Solutions are ranked and then the genetic operators (crossover and mutation) are applied again to generate new chromosomes till a best solution is selected. The authors compare the performance of each application mapped on a 3-D and a 2-D design. The expected increase in temperature was evident in the 3-D design, but at the same time, the 3-D design provided a much more efficient communication framework. For the thermal profiling, they feed as input the average power consumption of every IP, the physical dimensions and location of every IP, and the number of layers in the 3-D design to the Hotspot [81] tool. The tool returns the temperature of the chip based on the values provided. The solutions in the solution space are ranked
on the basis of the cost function (logarithmic). The genetic algorithm iterates till the best solution is obtained. 4.4 Migration from 2-D Network-on-chip to 3-D Network-on-chip The 3-D integration can well harness the power of a scalable interconnect like NoC and vice-versa, especially for interconnect-dominated architectures. The combined advantages of both 3-D and NoC is a strong motivation to migrate existing designs in 2-D to the 3-D platform. Some works have drawn comparison on how designs vary performance-wise in 2-D and 3-D. In our work [28], we presented an algorithm that addresses the crucial issue of migrating NoC topologies onto 3-D IC layers. Our proposed algorithm determines the assignment of switches to the different layers, with the objective of either minimizing the number of TSVs needed or to minimize power consumption meeting TSV constraints. The assignment of cores to the different layers, application communication characteristics, and the NoC topology and paths for flows are taken as inputs to the method. We integrated the method with an existing NoC floorplanner and design flow, thereby automating switch and NoC component placement in the layers as well. We only considered the communication architecture design, that is, mapping and placement of NoC components on the layers. The assignment of the processor/memory and other hardware cores is part of 3-D floorplanning, which is a complementary problem that considers thermal issues as well. Our method can be used in conjunction with any of them, by taking the output core assignment of the existing methods as our input. Moreover, as the switch power consumption is a fraction of the overall chip power consumption, our switch assignment to the layers does not perturb the thermal profile of the chip. Our experiments on many SoC benchmarks showed a reduction of 8 to 10% in the NoC power consumption and a significant 49% reduction in the number of vertical links (and hence, the TSVs) when compared with existing approaches. Qian et al. [91] presented a detailed study on the perflow worst-case communication performance in 2-D and 3-D NoCs through a case study. The authors use a 3-D baseline router with a 77 crossbar for analysis and simulation. The authors use their worst-case delay-bound analysis technique [92] which combines network calculus model [93,94] and a network contention tree model [95], apart from running simulations. They validate their analysis with the simulations. A 2-D mesh with 64 nodes and 4-layer 3-D chip with the same number of nodes are compared against a cornerto-corner communication traffic pattern. Closed form
formulae are derived for this traffic pattern to calculate the worst-case delay-bound latency for any flow in the interconnect fabric. The authors follow a four-step process to calculate the delay-bound. They construct a contention tree to model the contention between flows. The contention tree is scanned to derive the output arrival curves. According to network calculus, the arrival curve ensures that, at any given time instant, the traffic arrival rate is always bounded by the curve. Now, the equivalent service curves are computed. Similarly, the service curve indicates that the receiver must wait at least a given amount of time before servicing the data. Based on these arrival and service curves, the delaybound for the flow in consideration is computed. They also explored the combined effect of vertical bandwidth and traffic burstiness. Results suggest that the average performance of 3-D NoCs is better than that of the 2-D counterparts, but the worst-case performance in 3-D NoCs may be worse than the 2-D counterparts. 4.5 Implementation Few researches have extended their existing 2-D NoC implementations to the 3-D domain, while few have researched the implementation feasibility of 3-D NoCs on various platforms. This last section surveys work related to implementation of 3-D NoCs on various target platforms. Akram et al. [96] extended their 2D-Oasis NoC [97] to the third dimension and present a hardware design for 3-D Oasis NoC. The OASIS-NoC design is tested with the JPEG codec application [98]. The 3-D Oasis NoC has a 224 mesh topology with routers 77 in size and adopt a worm-hole switching policy. The stall-and-go flow control mechanism is deployed. A deterministic X-Y-Z routing algorithm is used in the design. The target device for this design is the 65nm Alteras Stratix III FPGA boards. They compared their 3-D design with their 2-D counterpart and found that the logic area increased by 52%. In terms of clock-speed, 3-D Oasis NoC under-performs the 2-D NoC architecture by 8.5% on an average with a small 1.74% power overhead. Results also show a 22% reduction in the flit-latency when compared with the 2-D Oasis NoC. Li et al. [99] proposed a router architecture and topology design for a NUCA [100] that combines the benefits of a NoC and 3-D technology to reduce L2 cache latencies in CMP-based systems [101]. The topology involves careful placement of CPUs and the rest of the space with L2 banks. The topology is designed with two kinds of routers. The routers at the vertical pillars are different from baseline 2-D routers which do not connect to any other layer. The routers at the vertical pillar connect to a dynamic TDMA bus. A single-hop verticallinkis
331
preferred to 77 baseline 3-D router for performance reasons. The bus arbiter is kept in the middle layer for uniformity in wire-lengths. Each processor has a dedicated pillar unless the pillars are lesser than the number of processors. The placement algorithm tries not to stack CPUs on top of each other. The pillars too are tried to be as far apart as possible to avoid congested areas. After the placement of CPUs, L2 banks are placed as shown in Figure 11. The cache placement and replacement policy used are similar to those of CMP-NUCA [102]. The architecture is simulated using Simics [103] interfaced with a 3-D NoC simulator, which is based on the 2-D counterpart [104]. The cache access latencies are extracted using Cacti 3.2 [105]. This work shows that the vertical interconnections has an impact on the cache latencies and emphasize the importance of 3-D technology for future processors.
Cache bank node CPU node
8 Communication pillars, 8 CPUs, 1 CPU per pillar
CPUs offset in all three dimensions to avoid hotspots
Figure 11: Placement of Caches and CPUs in a 3-D setup [99].
5. Conclusion
The 3-D NoCs are a revolutionary advancement in the direction of scalable chip design which is well-suited for the next generations nanoscale electronic systems. The 3-D NoCs, though not commercially viable, at the present moment, because of many technology roadblocks, is a promising technology for the integration of CMPs and SoCs for the next few decades. The structural design of the paradigm presents the designers an extra spatial dimension which enables the construction of complex circuits catering to todays multipurpose systems. The 3-D NoC is an answer to the incapability of the traditional transistor scaling to meet performance and integration requirements of SoCs.
based on an equivalent circuit of thermal resistances and capacitances that correspond to micro-architecture blocks and essential aspects of the thermal package. The model has been validated using finite element simulation. HotSpot has a simple set of interfaces and hence can be integrated with most power-performance simulators like Wattch [106]. The chief advantage of HotSpot is that it is compatible with the kinds of power/performance models used by the computer-architecture community, requiring no detailed design or synthesis description. HotSpot makes it possible to study thermal evolution over long periods of real, full-length applications. HSpice - HSpice [68] is the industrys standard for accurate and comprehensive circuit simulation and offers foundry-certified MOS device models with state-of-theart simulation and analysis algorithms. DevEdit - DevEdit, a product of Silvaco, Inc., is a tool that can be used to either create a device from scratch or to edit an existing device. DevEdit creates standard Silvaco structures that are easily integrated into Silvaco 2-D or 3-D simulators and other support tools.
Appendix A: Some NoC Simulation Platforms and Tools

In this section, some of the simulators and NoC evaluation platforms built by the NoC research community to study various NoC designs are discussed. Some of the significant features of each of the implementations are also highlighted. Orion - A power and area model of on-chip interconnection networks that helps in quick explorations of different NoC designs [83]. The motivation behind building such a framework is to estimate the power and area of the interconnect early in the design phase, as they are crucial for an optimized design. As a feature, this framework automatically updates itself as and when a new technology library becomes available. The simulator models all the router components and also the links that form the interconnection network. Hotspot - HotSpot [81] is an accurate and fast thermal model suitable for use in architectural studies. It is
332
References
1. K. Banerjee, S. Souri, P . Kapur, and K. Saraswat, 3-D ICs: A Novel Chip Design for Improv- ing Deep-submicrometer Interconnect Performance and Systems-on-Chip Integration, Proceedings of the IEEE, vol. 89, p. 602-33, May 2001. S. Al-Sarawi, D. Abbott, and P . Franzon, A review of 3-D packaging technology, IEEE Transactions on Components, Packaging, and Manufacturing Technology, Part B: Advanced Packaging, vol. 21, p.2-14, Feb. 1998. R. Zhang, K. Roy, C.-K. Koh, and D. Janes, Stochastic interconnect modeling, power trends, and performance characterization of 3-D circuits, IEEE Transactions on Electron Devices, vol. 48, p. 638-52, Apr. 2001. W. Davis, J. Wilson, S. Mick, J. Xu, H. Hua, C. Mineo, et al, Demystifying 3D ICs: the pros and cons of going vertical, IEEE Design Test of Computers, vol. 22, p. 498-510, Nov.-Dec. 2005. Y. Deng, and W. Maly, 2.5D system integration: A design driven system implementation schema, in Proceedings of the ASP-DAC, p. 450-5, Jan. 2004. V. Pavlidis, and E. Friedman, 3-D Topologies for Networks-on-
2.
3.
4.
5.
6.
[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs Chip, in IEEE International SOC Conference, p. 285-8, Sep.2006. F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M. Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, in 33rd International Symposium on Computer Architecture, p. 130-41, 2006. 8. B. Feero, and P . Pande, Networks-on-Chip in a Three-Dimensional Environment: A Perfor- mance Evaluation, IEE Transactions on Computers, vol. 58, p. 32-45, Jan. 2009. 9. S. Kim, C. Liu, L. Xue, and S. Tiwari, Crosstalk reduction in mixed-signal 3-D integrated circuits with interdevice layer ground planes, IEEE Transactions on Electron Devices, vol. 52, p. 145967, July 2005. 10. J. U. Knickerbocker, P . S. Andry, B. Dang, R. R. Horton, M.J.Interrante, C. S. Patel, et al, Three-dimensional silicon integration, IBM Journal of Research and Development, vol. 52, p. 553-69, Nov. 2008. 11. J. Joyner, P . Zarkesh-Ha, and J. Meindl, A stochastic global netlength distribution for a threedimensional system-on-a-chip (3D-SoC), in Proceedings. 14th Annual IEEE International ASIC/ SOC Conference, p. 147-51, 2001. 12. M. Ieong, K. Guarini, V. Chan, K. Bernstein, R. Joshi, J. Kedzierski, etal, Three dimensional CMOS devices and integrated circuits, in Proceedings of the IEEE Custom Inte- grated Circuits Conference, p. 207-13, Sep. 2003. 13. C. Liu, I. Ganusov, M. Burtscher, and S. Tiwari, Bridging the processor-memory performance gap with 3D IC technology, IEEE Design Test of Computers, vol. 22, p. 556-64, Nov.-Dec. 2005. 14. S. Lim, Physical design for 3D system on package, IEEE Design Test of Computers, vol. 22, p. 532-9, Nov.-Dec. 2005. 15. S. D. Bhabani Shankar Prasad Mishra, Parallel Computing Environments: A Review, in The IETE Techincal Review, vol. 28, p. 240-7, June 2009. 16. M. S. Bakir, G. Huang, D. Sekar, and C. King, 3-D Integrated Circuits: Liquid Cooling and Power Delivery, in The IETE Techincal Review, vol. 26, p. 407-16, Nov. 2009. 17. International Technology Roadmap for Semiconductors, 2005. [Online]. Available from: http://www.itrs.com [Last accessed on 2011 July 31]. 18. L. Benini, and G. De Micheli, Networks on chips: Technology and Tools, Morgan Kaufmann, First Edition, July 2006. 19. Networks-on-chips: A New SoC paradigm, Computer, vol. 35, p. 70-8, Jan 2002. 20. R. Ho, K. Mai, and M. Horowitz, The future of wires, Proceedings of the IEEE, vol. 89, p. 490-504, Apr. 2001. 21. W. Dally, and B. Towles, Route packets, not wires: On-chip interconnection networks, in Proceedings of the Design Automation Conference, p. 684-9, 2001. 22. D. Lackey, P . Zuchowski, T. Bednar, D. Stout, S. Gould, and J. Cohn, Managing power and performance for system-on-chip designs using Voltage Islands, in IEEE/ACM International Conference on Computer Aided Design, p. 195-202, Nov 2002. 23. R. Kumar, V. Zyuban, and D. Tullsen, Interconnections in multicore architectures: under- standing mechanisms, overheads and scaling, in Proceedings. 32nd International Symposium on Computer Architecture, p. 408-19, June 2005. 24. W. Dally, and J. W. Poulton, Digital Systems Engineering, in Cambridge University Press, 1998, June 1998. 25. S. Murali, C. Seiculescu, L. Benini, and G. De Micheli, Synthesis of networks on chips for 3D systems on chips, in Proceedings of the ASP-DAC, p. 242-7, Jan 2009. 26. K. Salah, A. El Rouby, H. Ragai, and Y. Ismail, 3D/TSV enabling technologies for SOC/NOC: Modeling and design challenges, in International Conference on Microelectronics (ICM), p. 268-71, Dec. 2010. 27. D. Velenis, M. Stucchi, E. Marinissen, B. Swinnen, and E. Beyne, Impact of 3D design choices on manufacturing cost, in IEEE International Conference on 3D System Integration, p. 1-5, Sep.2009. 7. 28. M. Pawan Kumar, S. Anish Kumar, S. Murali, L. Benini, and V. Kamakoti, A Method for Integrating Network-on-Chip Topologies with 3-D ICs, in Proceedings of the International Symposium on VLSI, p. 60-5, July 2011. C. H. Chao, K. Y. Jheng, H. Y. Wang, J. C. Wu, and A. Y. Wu, Traffic- and Thermal-Aware Run-Time Thermal Management Scheme for 3D NoC Systems, in Networks-on-Chip (NOCS), 2010 Fourth ACM/IEEE International Symposium on, p. 223-30, May 2010. R. Canegallo, L. Ciccarelli, F. Natali, A. Fazzi, R. Guerrieri, and P . Rolandi, 3d contactless communication for ic design, in IEEE International Conference on Integrated Circuit Design and Technology, p. 241-4, June 2008. C. Seiculescu, S. Murali, L. Benini, and G. De Micheli, SunFloor 3D: A tool for Networks On Chip topology synthesis for 3D systems on chips, in Design, Automation Test in Europe Conference Exhibition, 2009. DATE 09., p. 9-14, Apr. 2009. W. Dally, and J. W. Poulton, Handbook of 3-D Integration, in Wiley VCH, 2008. I. Loi, F. Angiolini, and L. Benini, Supporting vertical links for 3D networks-on-chip: toward an automated design and analysis flow, in Proceedings of the 2nd international conference on NanoNetworks, vol. 15, p. 115:5, 2007. J. H. Chien, C. L. Lung, C. C. Hsu, Y. F. Chou, and D. M. Kwai, Floorplanning 1024 cores in a 3D-stacked networkon- chip with thermal-aware redistribution, in 12th IEEE Intersociety Conference on Thermal and Thermomechanical Phenomena in Electronic Systems (ITherm), p. 1-6, June 2010. R.S. Ramanujam, and B. Lin, A Layer-Multiplexed 3D On-Chip Network Architecture, IEEE Embedded Systems Letters, vol. 1, p. 50-5, Aug. 2009. K. Siozios, I. Anagnostopoulos, and D. Soudris, A High-Level Mapping Algorithm Targeting 3D NoC Architectures with Multiple Vdd, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), p. 444-5, July 2010. A. Weldezion, M. Grange, D. Pamunuwa, Z. Lu, A. Jantsch, R. Weerasekera, et al, Scalability of Network-on-Chip Communication Architecture for 3-D Meshes, in Networks- onChip, 2009. NoCS 2009. 3rd ACM/IEEE International Symposium on, pp. 114-23, May 2009. E. Nilsson, Design and Implementation of a hot-potato Switch in a Network-on-Chip, Mem- oire, Department of Microelectronics and Information Technology, Royal Institute of Technol- ogy, Jan.2002. A. Weldezion, Z. Lu, R. Weerasekera, and H. Tenhunen, 3-D Memory Organization and Per- formance Analysis for MultiProcessor Network-on-Chip Architecture, in IEEE International Conference on 3D System Integration, p. 1-7, Sep. 2009. A. Jantsch, and H. Tenhunen, in Networks on Chip, Kluwer Academic Publishers, 2003. H. Matsutani, M. Koibuchi, D. Hsu, and H. Amano, ThreeDimensional Layout of On-Chip Tree-Based Networks, in International Symposium on Parallel Architectures, Algorithms, and Networks, p. 281-8, May 2008. H. Matsutani, M. Koibuchi, and H. Amano, Tightly-Coupled Multi-Layer Topologies for 3-D NoCs, in International Conference on Parallel Processing, p. 75, Sep. 2007. A. M. Rahmani, K. Latif, P . Liljeberg, J. Plosila, and H. Tenhunen, Research and practices on 3D networks-on-chip architectures, in NORCHIP , 2010, p. 1-6, Nov. 2010. A. Zia, S. Kannan, G. Rose, and H. Chao, Highly-scalable 3D CLOS NOC for many-core CMPs, in 8th IEEE International NEWCAS Conference (NEWCAS), p. 229-32, June 2010. C. Clos, A Study of Non-Blocking Switching Networks, p. 406-24, Mar. 1953. Y. Chen, J. Hu, and X. Ling, De Bruijn graph based 3D Network on Chip architecture design, in International Conference on Communications, Circuits and Systems, p. 986-90, July 2009.
29.
30.
31.
32. 33.
34.
35.
36.
37.
38.
39.
40. 41.
42.
43.
44.
45. 46.
333
[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs M. Hosseinabady, M. R. Kakoee, J. Mathew, and D. K. Pradhan, Reliable network-on-chip based on generalized de Bruijn graph, in Proceedings of the 2007 IEEE International High Level Design Validation and Test Workshop, p. 3-10, 2007. 48. Z. Li, X. Ling, and J. Hu, MSNS: A Top-Down MPI-Style Hierarchical Simulation Frame- work for Network-on-Chip, in Proceedings of the WRI International Conference on Communications and Mobile Computing, Vol. 02, p. 609-14, 2009. 49. D. Velenis, M. Stucchi, E. Marinissen, B. Swinnen, and E. Beyne, Impact of 3D Design Choices on Manufacturing Cost, in IEEE International Conference on 3D System Integration, p. 1-5, Sep.2009. 50. T. Xu, P . Liljeberg, and H. Tenhunen, A study of Through Silicon Via impact to 3D Network- on-Chip design, in International Conference On Electronics and Information Engineering (ICEIE), vol. 1, p. 333-7, Aug. 2010. 51. C. Liu, L. Zhang, Y. Han, and X. Li, Vertical interconnects squeezing in symmetric 3D mesh Network-on-Chip, in Proceedings of the 16th ASP-DAC, p. 357-62, Jan. 2011. 52. A. M. Rahmani, P . Liljeberg, J. Plosila, and H. Tenhunen, BBVC3D-NoC: An Efficient 3D NoC Architecture Using Bidirectional Bisynchronous Vertical Channels, in IEEE Computer Society Annual Symposium on VLSI (ISVLSI), p. 452-3, July 2010. 53. I. Loi, P . Marchal, A. Pullini, and L. Benini, 3D NoCs: Unifying inter & intra chip communication, in Proceedings of 2010 IEEE International Symposium on Circuits and Systems (ISCAS), p.3337-40, June 2010. 54. I. Loi, F. Angiolini, and L. Benini, Developing mesochronous synchronizers to enable 3D NoCs, in Proceedings of the conference on Design, automation and test in Europe, p. 1414-9, 2008. 55. F. Angiolini, P . Meloni, S. Carta, L. Benini, and L. Raffo, Contrasting a NoC and a traditional interconnect fabric with layout awareness, in Proceedings of the conference on Design, automation and test in Europe, p. 124-9, 2006. 56. C. Jueping, J. Peng, Y. Lei, H. Yue, and L. Zan, Through-silicon via (TSV) Capacitance Mod- eling for 3D NoC Energy Consumption Estimation, in 10th IEEE International Conference on Solid-State and Integrated Circuit Technology (ICSICT), p. 815-7, Nov.2010. 57. Silvaco, DevEdit, 3-D Device Simulator. [Online]. Available from: http://www.silvaco.com/products/interactive tools/devedit/ devedit br.html [Last accessed on 2011 July 31]. 58. D. H. Kim, and S. K. Lim, Through-silicon-via-aware delay and power prediction model for buffered interconnects in 3d ics, in Proceedings of the 12th ACM/IEEE international workshop on System level interconnect prediction, p. 25-32, 2010. 59. J. Owens, W. Dally, R. Ho, D. Jayasimha, S. Keckler, and L.-S. Peh, Research Challenges for On-Chip Interconnection Networks, Micro, IEEE, vol. 27, p. 96-108, Sep. 2007. 60. N. Kirman, M. Kirman, R. K. Dokania, J. F. Martinez, A. B. Apsel, M. A. Watkins, et al, On-Chip Optical Technology in Future Bus-Based Multicore Designs, IEEE Mi- cro, vol. 27, p. 56-66, Jan.2007. 61. A. Shacham, K. Bergman, and L. P . Carloni, Maximizing GFLOPS-per-Watt: High- bandwidth, Low-power Photonic OnChip Networks, P = ac2, vol. 27, p. 96-108, Sep. 2007. 62. D. Vantrease, R. Schreiber, M. Monchiero, M. McLaren, N. Jouppi, M. Fiorentino, A. Davis, N. Binkert, R. Beausoleil, and J. Ahn, in 35th International Symposium on Computer Architecture, p. 153-64, June 2008. 63. L. Carloni, P . Pande, and Y. Xie, Networks-on-Chip in Emerging Interconnect Paradigms: Advantages and Challenges, in Networkson-Chip, NoCS 2009. 3rd ACM/IEEE International Symposium on, p. 93-102, 2009. 64. Y. Ye, L. Duan, J. Xu, J. Ouyang, M. K. Hung, and Y. Xie, 3D optical networks-on-chip (NoC) for multiprocessor systems-onchip (MPSoC), in IEEE International Conference on 3D System Integration, p. 1-6, Sep. 2009. 65. H. Gu, K. H. Mo, J. Xu, and W. Zhang, A Low-power Low-cost Optical Router for Optical Networks-on-Chip in Multiprocessor 47. Systems-on-Chip, in Proceedings of the IEEE Computer Society Annual Symposium on VLSI, p. 19-24, 2009. 66. J. Kim, C. Nicopoulos, D. Park, R. Das, Y. Xie, V. Narayanan, et al, A Novel Dimensionally Decomposed Router for OnChip Communication in 3D Architectures, in Proceedings. 32nd International Symposium on Computer Architecture, p. 138-49,2007. 67. J. Kim, C. Nicopoulos, D. Park, V. Narayanan, M. Yousif, and C. Das, A Gracefully Degrading and Energy-Efficient Modular Router Architecture for On-Chip Networks, in International Symposium on Computer Architecture, p. 4-15, July 2006. 68. 70nm PTM Technology Model. [Online]. Available from: http:// www.eas.asu.edu/ptm [Last accessed on 2011 July 31]. 69. W. Lafi, D. Lattard, and A. Jerraya, An Efficient Hierarchical Router for Large 3D NoCs, in 21st IEEE International Symposium on Rapid System Prototyping (RSP), pp. 1-5, June 2010. 70. D. Lattard, E. Beigne, F. Clermidy, Y. Durand, R. Lemaire, P . Vivet, and F. Berens, A Recon-figurable Baseband Platform Based on an Asynchronous Network-on-Chip, IEEE Journal of Solid-State Circuits, vol. 43, p. 223-35, Jan. 2008. 71. Y. Thonnart, P . Vivet, and F. Clermidy, A fully-asynchronous low-power framework for GALS NoC integration, in Design, Automation Test in Europe Conference Exhibition, p. 33-8, Mar.2010. 72. D. Park, S. Eachempati, R. Das, A. Mishra, Y. Xie, N. Vijaykrishnan, et al, MIRA: A Multi-layered On-Chip Interconnect Router Architecture, in 35th International Symposium on Computer Architecture, p. 251-61, June 2008. 73. K. Puttuswamy, and G. H. Loh, Thermal Herding: Microarchitecture Techniques for Con- trolling Hotspots in HighPerformance 3D-Integrated Processors, in Proceedings of the 13th HPCA, p. 193-204, 2007. 74. W. Huang, S. Member, S. Ghosh, S. Velusamy, K. Sankaranarayanan, and K. S. et al., Hotspot: A compact thermal modeling method for CMOS VLSI systems, IEEE Transactions on Very Large Scale Integration, vol. 14, p. 501-13, 2006. 75. I. Loi, S. Mitra, T. Lee, S. Fujita, and L. Benini, A low-overhead fault tolerance scheme for TSV-based 3D network on chip links, in IEEE/ACM International Conference on Computer-Aided Design, p. 598-602, Nov. 2008. 76. M. DallOsso, G. Biccari, L. Giovannini, D. Bertozzi, and L.Benini, Xpipes: A Latency Insensitive Parameterized Network-onChip Architecture for Multiprocessor SoCs, in Pro-ceedings. 21st International Conference on Computer Design, p. 536-9, Oct.2003. 77. C. Seiculescu, S. Murali, L. Benini, and G. De Micheli, Comparative Analysis of NoCs for Two-Dimensional Versus Three-Dimensional SoCs Supporting Multiple Voltage and Frequency Islands, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 57, p. 364-8, May 2010. 78. S. Yan, and B. Lin, Design of Application-Specific 3D Networkson-Chip Architectures, in IEEE International Conference on Computer Design, p. 142-9, Oct. 2008. 79. L. Shang, L.-S. Peh, A. Kumar, and N. K. Jha, Thermal Modeling, Characterization and Man- agement of On-Chip Networks, in Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p. 67-78, 2004. 80. H. Wang, Y. Fu, T. Liu, and J. Wang, Thermal Management via Task Scheduling for 3D NoC based Multi-Processor, in International SoC Design Conference (ISOCC), p. 440-4, Nov.2010. 81. K. Skadron, M. Stan, W. Huang, S. Velusamy, K. Sankaranarayanan, and D. Tarjan, Temperature-aware microarchitecture, in Proceedings. 30th Annual International Sympo- sium on Computer Architecture, p. 2-13, June 2003. 82. M. Arjomand, and H. Sarbazi-Azad, Voltage-Frequency Planning for Thermal-Aware, Low- Power Design of Regular 3-D NoCs, in 23rd International Conference on VLSI Design, p. 57-62, Jan. 2010. 83. H. S. Wang, X. Zhu, L. S. Peh, and S. Malik, Orion: A powerperformance simulator for interconnection networks, in
334
[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs Proceedings. 35th Annual IEEE/ACM International Symposium on Microarchitecture, p. 294-305, 2002. 84. T. Chelcea, and S. Nowick, Robust interfaces for mixed-timing systems with application to latency-insensitive protocols, in Design Automation Conference. Proceedings, p. 21-6, 2001. 85. W. Dally, and J. W. Poulton, The VLSI Handbook, in Taylor & Francis Group, CRC Press, 2007. 86. C. Liu, J.-H. Chen, R. Manohar, and S. Tiwari, Mapping systemon-chip designs from 2-D to 3-D ICs, in IEEE International Symposium on Circuits and Systems, vol. 3, p. 2939-42, May2005. 87. S. Murali, P . Meloni, F. Angiolini, D. Atienza, S. Carta, L. Benini, et al, Designing Application-Specific Networks on Chips with Floorplan Information, in IEEE/ACM International Conference on Computer-Aided Design, p. 355-62, Nov. 2006. 88. A. Hansson, A. L. Spaanenburg, and K. Goossens, A Unified Approach to Mapping and Routing in a Combined Guaranteed Service and Best-Effort Network-on-Chip Architecture, 2005. 89. C. Addo-Quaye, Thermal-aware mapping and placement for 3-D NoC designs, in Proceed- ings. IEEE International SOC Conference, p. 25-8, Sep. 2005. 90. D. E. Goldberg, Genetic Algorithms in Search, Optimization and Machine Learning, 1st ed., 1989. 91. Y. Qian, Z. Lu, and W. Dou, From 2D to 3D NoCs: A Case Study on Worst-case Communi- cation Performance, in Proceedings of the 2009 International Conference on Computer-Aided Design, p.555-62, 2009. 92. Y. Qian, Z. Lu, and W. Dou, Analysis of Communication Delay Bounds for Network on Chips, in Proceedings of the ASP-DAC, p. 7-12, 2009. 93. C. S. Chang, Performance Guarantees in Communication Networks. Germany: Springer-Verlag, 2000. 94. R. Cruz, A calculus for network delay. i. network elements in isolation, IEEE Transactionson Information Theory, vol. 37, p.114-31, Jan. 1991. 95. Z. Lu, A. Jantsch, and I. Sander, Feasibility analysis of messages for on-chip networks using wormhole routing, in Proceedings of the ASP-DAC, p. 960-4, 2005. 96. A. Ben Ahmed, A. Ben Abdallah, and K. Kuroda, Architecture and Design of Efficient 3D Network-on-Chip (3D NoC) for Custom Multicore SoC, in International Conference on Broadband, Wireless Computing, Communication and Applications (BWCCA), p. 67-73, November 2010. 97. K. Mori, A. Esch, A. Abdallah, and K. Kuroda, Advanced Design Issues for OASIS Network- on-Chip Architecture, in International Conference on Broadband, Wireless Computing, Com- munication and Applications, p. 74-9, Nov. 2010. 98. J. Rosenthal, JPEG Image Compression using an FPGA, Dec.2006. 99. F. Li, C. Nicopoulos, T. Richardson, Y. Xie, V. Narayanan, and M.Kandemir, Design and Management of 3D Chip Multiprocessors Using Network-in-Memory, in 33rd International Symposium on Computer Architecture, p. 130-41, 2006. 100. C. Kim, D. Burger, and S. W. Keckler, An adaptive, non-uniform cache structure for wire- delay dominated on-chip caches, SIGOPS Oper. Syst. Rev., vol. 36, p. 211-22, Oct. 2002. 101. K. Olukotun, B. A. Nayfeh, L. Hammond, K. Wilson, and K.Chang, The case for a single- chip multiprocessor, SIGPLAN Not., p. 2-11, Sep. 1996. 102. B. M. Beckmann, and D. A. Wood, Managing Wire Delay in Large Chip-Multiprocessor Caches, in Proceedings of the 37th annual IEEE/ACM International Symposium on Microarchitecture, p. 319-30, 2004. 103. P . Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G.Hallberg, J. Hogberg, et al, Simics: A full system simulation platform, IEEE Computer Journal, vol. 35, p. 50-8, Feb. 2002. 104. J. Kim, D. Park, C. Nicopoulos, N. Vijaykrishnan, and C. R. Das, Design and Analysis of an NoC Architecture from Performance, Reliability and Energy Perspective, in Proceedings of the ACM symposium on Architecture for networking and communications systems, p. 173-82, 2005. 105. P . Shivakumar, N. P . Jouppi, and P . Shivakumar, CACTI 3.0: An Integrated Cache Timing, Power, and Area Model, Tech. Rep.,2001. 106. D. Brooks, V. Tiwari, and M. Martonosi, Wattch: A framework for architectural-level power analysis and optimizations, in Proceedings of the 27th annual international symposium on Computer architecture, p. 83-94, 2000.
AUTHORS
M. Pawan Kumar is pursuing his Master of Science (by research) at the Department of Computer Science and Engineering, Indian Institute of Technology Madras. His areas of interests include Computer Architecture and VLSI Design. E-mail: mpkpawan@cse.iitm.ac.in Srinivasan Murali holds a master and a PhD in Electrical Engineering from Stanford University, CA, USA, on the subject of Networks-on-Chips. He is the recipient of a Best PhD Dissertation Award from the EDAA Council in 2008. His background includes CAD tooling and design automation for system design. He is currently
a visiting faculty at the Department of Computer Science and Engineering, Indian Institute of Technology, Madras. E-mail: srinivasan.murali@epfl.ch Kamakoti Veezhinathan is a Professor at Department of Computer Science and Engineering, Indian Institute of Technology, Madras. His areas of interests include Computer Architecture, Secure Systems Engineering and VLSI Design. E-mail: kama@cse.iitm.ac.in
DOI: 10.4103/0256-4602.101313; Paper No. TR 241_11; Copyright 2012 by the IETE
335

IETETechRev294318-2880ef741 080007 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

IETETechRev294318-2880ef741 080007 PDF

Загружено:

Авторское право:

Доступные форматы

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.

Network-on-chips on 3-D ICs: Past, Present, and Future

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Design Challenges of Network-on-chips for 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Switch TSV macro Vertical link

Horizontal link Layer 1

Figure 1: Vertical vias and pads to connect multiple layers [31].

Classification of Research of Network-onchips for 3-D ICs

Mapping & placement

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Figure 4: A tree network [41].

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Figure 6: The 2-layer optical NoC [64].

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

OUT Early ejection (a)

Column module (north-south)

5x5 monolithic crossbar

East West North Flit in South PE

Row module (east-west) 2x2 Crossbars

Guided flit queuing

Guided flit queuing

Flits going up-down

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Up Down North South East West Resource

Up Down North South East West Resource

(a) The classic monolithic 7x7 router

North South East West

North South East West

Figure 9: Inter-router link distribution [72].

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Figure 10: Traffic-aware downward routing [29].

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

[Downloadedfreefromhttp://www.tr.ietejournals.orgonMonday,November11,2013,IP:101.63.182.165]||ClickheretodownloadfreeAndroidapplicationforthis journal Kumar MP, et al.: Network-on-Chips on 3-D ICs

Cache bank node CPU node

8 Communication pillars, 8 CPUs, 1 CPU per pillar

CPUs offset in all three dimensions to avoid hotspots

Figure 11: Placement of Caches and CPUs in a 3-D setup [99].

Appendix A: Some NoC Simulation Platforms and Tools

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

DOI: 10.4103/0256-4602.101313; Paper No. TR 241_11; Copyright 2012 by the IETE

IETE TECHNICAL REVIEW | VOl 29 | ISSUE 4 | JUL-AUG 2012

Вам также может понравиться