Вы находитесь на странице: 1из 276

Embedded Systems

Series Editors
Nikil D. Dutt, Department of Computer Science, Zot Code 3435,
Donald Bren, School of Information and Computer Sciences, University of
California, Irvine, CA 92697-3435, USA
Peter Marwedel, TU Dortmund, Informatik 12, Otto-Hahn-Str. 16, 44227,
Dortmund, Germany
Grant Martin, Tensilica Inc., 3255-6 Scott Blvd., Santa Clara, CA 95054, USA

For further volumes:


http://www.springer.com/series/8563
Ian OConnor Gabriela Nicolescu
Editors

Integrated Optical
Interconnect Architectures
for Embedded Systems
Editors
Ian OConnor Gabriela Nicolescu
Ecole Centrale de Lyon Lyon Institute of Dpt. Gnie Informatique & Gnie Logiciel
Nanotechnology Ecole Polytechnique de Montreal
Ecully, France Montreal, QC, Canada

ISSN 2193-0155 ISSN 2193-0163 (electronic)


ISBN 978-1-4419-6192-1 ISBN 978-1-4419-6193-8 (eBook)
DOI 10.1007/978-1-4419-6193-8
Springer New York Heidelberg Dordrecht London

Library of Congress Control Number: 2012948541

Springer Science+Business Media New York 2013


This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of
the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation,
broadcasting, reproduction on microfilms or in any other physical way, and transmission or information
storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology
now known or hereafter developed. Exempted from this legal reservation are brief excerpts in connection
with reviews or scholarly analysis or material supplied specifically for the purpose of being entered and
executed on a computer system, for exclusive use by the purchaser of the work. Duplication of this
publication or parts thereof is permitted only under the provisions of the Copyright Law of the Publishers
location, in its current version, and permission for use must always be obtained from Springer. Permissions
for use may be obtained through RightsLink at the Copyright Clearance Center. Violations are liable to
prosecution under the respective Copyright Law.
The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication
does not imply, even in the absence of a specific statement, that such names are exempt from the relevant
protective laws and regulations and therefore free for general use.
While the advice and information in this book are believed to be true and accurate at the date of
publication, neither the authors nor the editors nor the publisher can accept any legal responsibility for
any errors or omissions that may be made. The publisher makes no warranty, express or implied, with
respect to the material contained herein.

Printed on acid-free paper

Springer is part of Springer Science+Business Media (www.springer.com)


Preface

Integrated optical interconnect is increasingly perceived as a viable alternative to


conventional electrical interconnect to support high-speed communication between
processors in high-performance distributed multi-processor systems-on-chips
(MPSoCs). The shift to such architectures as mainstream computing devices is the
recognized route to address, in particular, power issues by reducing individual pro-
cessor frequency while retaining the same overall computing capacity. This ratio-
nale answers the need for flexible and scalable computing platforms capable (1) of
achieving future required application performance in terms of resolution (audio,
video, and computing) and CPU power/total MIPS (real-time encodingdecoding,
data encryptiondecryption) and (2) of working with multiple standards and with
constrained power, which are both particularly important for mobile applications.
Aggregated on-chip data transfer rates in MPSoC are critical and are expected to
reach over 100 Tb/s in this decade. As such, interconnects will play a significant
role for MPSoC design in order to support these high data rates.
Besides a huge data rate, optical interconnects also allow for additional flexibility
through the use of wavelength division multiplexing. It is possible to exploit this to design
more intelligent interconnect systems, such as passive, wavelength-reconfigurable optical
networks on chip. Such structures, supplying reconfigurable channels of high-speed com-
munication to each IP block, are thus suitable candidates for the basis of fast and flexible
interconnect structures, removing key processor communication bottlenecks.
This book intends to give a broad overview of current thinking in optical inter-
connect technologies and architectures. Introductory chapters on high-performance
computing and the associated issues in conventional interconnect architectures and
on the fundamental building blocks for integrated optical interconnect provide the
foundations for the bulk of the book which brings together leading experts in the
field of optical interconnect architectures for data communication. Particular empha-
sis is given to the ways in which the photonic components are assembled into archi-
tectures to address the needs of data-intensive on-chip communication and to the
performance evaluation of such architectures for specific applications. In this way,
it is hoped that the reader can glean insight into suitable contexts for the use of opti-
cal interconnect.

v
vi Preface

Basics for High-Performance Computing


and Optical Interconnect

In the first part of this book, we examine both ends of the optical interconnect
domain, as a convergence of application needs (performance metrics of communi-
cation infrastructure in systems on chip) and enabling technology (building blocks
of silicon photonics).
In Chap. 1, the system on chip concept is introduced, with a particular focus on
SoC communication systems and the main features and limitations of various types
of on-chip interconnect. The author examines both performance and physical inte-
gration issues and stresses that on-chip interconnect, rather than logic gates, is the
bottleneck to system performance. Much research and industry effort are focused
today on vertical solutions, at packaging level (system in package or SiP) or at inte-
gration level (3D integrated circuits or 3DICs)these approaches can indeed for a
time relax the SoC interconnect bottleneck and allow the implementation of com-
plex, heterogeneous, and high-performance systems. However, the author concludes
with the observation that increasing complexity and requirements in terms of com-
putation capability of new generation systems will reach the limit of electrical inter-
connect quite soon, driving the need for novel solutions and different approaches for
reliable and effective on-chip and die-to-die communication. Optical interconnect is
one such potential solution.
In Chap. 2, the authors give a review of silicon photonics technology and focus
on explaining the main principles and orders of magnitude of the various compo-
nents that are required for on-chip optical interconnects and in particular for WDM
(wavelength division multiplexing) links. Achieving true CMOS compatibility at
material and process level is a driving factor for silicon photonics, and the high
refractive index contrast also makes it possible to scale down photonic building
blocks to a footprint compatible with thousands of components on a single chip.
The authors highlight the fast pace of progress of this technology and their convic-
tion that on-chip optical links will become a reality before 2020, while also singling
out the two most significant issues that still need to be solved when using silicon
photonics for on-chip links: which approach for the light source and how to handle
thermal issues (both to lower thermal dissipation and to minimize sensitivity to
temperature variation).

On-Chip Optical Communication Topologies

The second part of this book looks at various proposals for communication topolo-
gies based on silicon photonics, in particular for MPSoCs at a scale of tens to hun-
dreds of cores. Indeed, there have been several proposals in recent years for optical
interconnect networks attempting to provide improved performance and energy
efficiency compared to electrical networks. Three chapters review some of these
Preface vii

topologies and make further novel proposals, while introducing critical concepts to
this domain such as multilevel design and analysis and system integration and
interfacing.
Chapter 3 introduces this part of the book with a further review of basic nano-
photonic devices as integrated with a standard CMOS process. The authors then
propose a structured approach to clearly analyze previous proposals at relevant
abstraction levels (here considered to be architectural, microarchitectural, and phys-
ical) and use this approach to identify opportunities for new designs and make the
link between application requirements and technology constraints. The design pro-
cess is illustrated in an on-chip tile-to-tile network, processor-to-DRAM network,
and DRAM memory channel, and the authors conclude with a discussion of lessons
learned throughout such a design process with a set of guidelines for designers.
Chapter 4 proposes a fat tree-based optical NoC (FONoC) at several levels of
detail including the topology, floorplan, and protocols. Central to the proposal is a
low-power and low-cost optical turnaround router (OTAR) with an associated rout-
ing algorithm. In contrast to some other optical NoCs, FONoC does not require a
separate electronic NoC for network control, since it carries both payload data and
network control data on the same optical network. The authors describe the proto-
cols, which are designed to minimize network control data and related power con-
sumption. The overall power consumption and performance (delay and throughput)
is evaluated by an analytical model and compared to a matched electronic 64-node
NoC in 45 nm CMOS under different offered loads and packet sizes.
In Chap. 5, the authors describe an optical ring bus (ORB)-based hybrid opto-
electric on-chip communication architecture. This topology uses an optical ring
waveguide to replace global pipelined electrical interconnects while maintaining
the interface with typical bus protocol standards such as AMBA AXI3. The pro-
posed ORB architecture supports serialization of uplinks/downlinks to optimize
communication power dissipation and is shown to reduce transfer latency power
consumption compared to a pipelined, electrical, bus-based communication archi-
tecture at the 22 nm CMOS technology node.

System Integration and Optical-Enhanced MPSoC Performance

The concepts of system integration, multilevel performance/power analyses, and


network/application scalability, introduced in the previous part, are taken further in
the final part of this book. As indicated in the very first chapter, the most important
bottlenecks to the performance of next-generation MPSoCs will be the power
efficiency and the available communication speed between cores. Hence, as a can-
didate solution for the communication infrastructure of the SoC, the development of
proper hierarchical models and tools for the design and analysis of optical networks
on chip, taking into account its heterogeneous nature, becomes a necessity.
Chapter 6 studies a class of optical interconnect employing a single central pas-
sive-type optical router using wavelength division multiplexing as a routing mechanism.
viii Preface

Using this as a platform, the authors develop a novel 4-layered hardware stack
architecture consisting of the physical layer, the physical-adapter layer, the data link
layer, and the network layer, allowing the modular design of each building block
and boosting the interoperability and design reuse. Crucial to proving the industrial
viability of the approach, the authors have made significant effort to model and
integrate the proposed protocol stack within an industrial simulation environment
(ST OCCS GenKit) using an industrial standard (VSTNoC) protocol. Similarly as
in Chap. 3, the authors use this approach to introduce the micro-architecture of a
new electrical distributed router as a wrapper for the ONoC and evaluate the perfor-
mance of the layered architecture both at the system level (for network latency and
throughput) and at the physical (optical) level. Experimental results prove the scal-
ability of the network and demonstrate that it is able to deliver a comparable band-
width or even better (in large network sizes).
In Chap. 7, the authors exploit application characteristics by examining the
behavior of on-chip network traffic to understand how its locality in space and time
can be advantageously exploited by slowly reconfiguring networks, such as a
reconfigurable photonic NOC. The authors provide implementation details and a
performance and power characterization in which the topology is adapted automati-
cally (at the microsecond scale) to the evolving traffic situation by use of silicon
microrings.
Finally, while the previous chapter focused on exploiting application character-
istics, Chap. 8 explores new physical integration strategies by coupling the optical
interconnect concept to the emerging paradigm of 3DICs. The authors investigate
design trade-offs for a 3D MPSoC using a specific optical interconnect layer and
highlight current and short-term design trends. A system-level design space explo-
ration flow is also proposed, taking routing capabilities of optical interconnect into
account. The resulting application-to-architecture mappings demonstrate the
benefits of the 3D MPSoC architectures and the efficiency of the system-level
exploration flow.
We would like to take this opportunity to thank all the contributors of this book
for having undertaken the writing of each chapter and for their patience during the
review process. We also wish to extend our appreciation to the team at Springer for
their editorial guidance as well as of course for giving us the opportunity to compile
this book together.

Ecully, France Ian OConnor


Montreal, QC, Canada Gabriela Nicolescu
Contents

Part I Basics for High-Performance Computing and Optical Interconnect

1 Interconnect Issues in High-Performance Computing


Architectures .......................................................................................... 3
Alberto Scandurra
2 Technologies and Building Blocks for On-Chip
Optical Interconnects............................................................................. 27
Wim Bogaerts, Liu Liu, and Gunther Roelkens

Part II On-Chip Optical Communication Topologies

3 Designing Chip-Level Nanophotonic Interconnection Networks ...... 81


Christopher Batten, Ajay Joshi, Vladimir Stojanov,
and Krste Asanovi
4 FONoC: A Fat Tree Based Optical Network-on-Chip
for Multiprocessor System-on-Chip ..................................................... 137
Jiang Xu, Huaxi Gu, Wei Zhang, and Weichen Liu
5 On-Chip Optical Ring Bus Communication Architecture
for Heterogeneous MPSoC .................................................................... 153
Sudeep Pasricha and Nikil D. Dutt

Part III System Integration and Optical-Enhanced MPSoC Performance

6 A Protocol Stack Architecture for Optical Network-on-Chip:


Organization and Performance Evaluation ......................................... 179
Atef Allam and Ian OConnor

ix
x Contents

7 Reconfigurable Networks-on-Chip ....................................................... 201


Wim Heirman, Iigo Artundo, and Christof Debaes
8 System Level Exploration for the Integration of Optical
Networks on Chip in 3D MPSoC Architectures .................................. 241
Sbastien Le Beux, Jelena Trajkovic, Ian OConnor,
Gabriela Nicolescu, Guy Bois, and Pierre Paulin

Index ................................................................................................................ 263


Contributors

Atef Allam Ecole Centrale de Lyon - Lyon Institute of Nanotechnology,


University of Lyon, Ecully, France
Inigo Artundo Universidad Politcnica de Valencia, Valencia, Spain
Krste Asanovi University of California at Berkeley, Berkeley, CA, USA
Christopher Batten Cornell University, Ithaca, NY, USA
Wim Bogaerts Ghent University IMEC, Ghent, Belgium
Guy Bois cole Polytechnique de Montreal, Montreal, QC, Canada
Christof Debaes Vrije Universiteit Brussel, Brussel, Belgium
Nikil D. Dutt University of California, Irvine, CA, USA
Huaxi Gu Hong Kong University of Science and Technology, Hong Kong, China
Wim Heirman University of Ghent, Ghent, Belgium
Ajay Joshi Boston University, Boston, MA, USA
Sbastien Le Beux Lyon Institute of Nanotechnology, University of Lyon,
Ecully, France
Liu Liu South China Normal University, Guangzhou, China
Weichen Liu Hong Kong University of Science and Technology, Hong Kong, China
Gabriela Nicolescu cole Polytechnique de Montreal, Montreal, QC, Canada
Ian OConnor Ecole Centrale de Lyon - Lyon Institute of Nanotechnology,
University of Lyon, Ecully, France
Sudeep Pasricha Colorado State University, Fort Collins, CO, USA

xi
xii Contributors

Pierre Paulin STMicroelectronics, Ottawa, ON, Canada


Gunther Roelkens Ghent University IMEC, Ghent, Belgium
Alberto Scandurra OCCS Group, STMicroelectronics, Catania, Italy
Vladimir Stojanov Massachusetts Institute of Technology, Cambridge,
MA, USA
Jelena Trajkovic cole Polytechnique de Montreal, Montreal, QC, Canada
Jiang Xu Hong Kong University of Science and Technology, Hong Kong, China
Wei Zhang Nanyang Technological University, Singapore
Part I
Basics for High-Performance Computing
and Optical Interconnect
Chapter 1
Interconnect Issues in High-Performance
Computing Architectures

Alberto Scandurra

Abstract Systems on chip (SoCs) are complex systems containing billions of transis-
tors integrated in a unique silicon chip, implementing highly complex functionalities by
means of a variety of modules communicating with the system memories and/or between
them through a proper communication system. Integration density is now so high that
many issues arise when a SoC has to be implemented, and the electrical limits of inter-
connect wires are a limiting factor for performance. The main SoC building-block to be
affected by these problems is the on-chip communication system (or on-chip intercon-
nect), whose task is to ensure effective and reliable communication between all the
functional blocks of the SoC. A novel methodology aiming at solving the problems
mentioned above consists of splitting a complex system over more dice, exploiting the
so-called system in package (SiP) approach and opening the way to dedicated high-
performance communication layers such as optical interconnect. This chapter deals with
the SoC technology, describes current solutions for on-chip interconnect, illustrates the
issues faced during the SoC design and integration phases and introduces the SiP con-
cept and its benefits.

Keywords System on chip (SoC) Interconnect Bus Network on chip (NoC)


Integration System in package (SiP)

Outlook

Systems on chip (SoCs) are complex systems containing billions of transistors inte-
grated in a single silicon-chip, implementing highly complex functionalities by means
of a variety of modules communicating with the system memories and/or between

A. Scandurra (*)
OCCS Group, STMicroelectronics,
Stradale Primosole 50, 95121, Catania, Italy
e-mail: Alberto.scandurra@st.com

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 3


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_1,
Springer Science+Business Media New York 2013
4 A. Scandurra

them through a distinct and organized communication system. Ever-increasing inte-


gration density has led to the emergence of many issues in the implementation of
systems on chip, not least the electrical limits of interconnect wires as a limiting factor
for performance. In this context, a new technology is required for on-chip intercon-
nect, in order to overcome current physical and performance issues.
In order to cover all the topics introduced above, this chapter is organized as
follows:
Section Introduction to Systems on Chip describes the SoC as the modern
approach for designing and integrating complex systems.
Section On-Chip Communication Systems deals with the SoC communication
infrastructure, illustrating the concepts of the on-chip bus and network on chip.
Section SoC Performance and Integration Issues describes physical and per-
formance issues usually met during the SoC integration phase.
Section The Interconnect Bottleneck describes how the interconnect, rather
than logic gates, is now the major origin of performance and physical issues.
Section 3D Interconnect deals with Systems in Package and die to die
communication.

Introduction to Systems on Chip

The system on chip (SoC) is now the essential solution for delivering competitive
and cost-efficient performance in todays challenging electronics market. Consumers
using PCs, PDAs, cell-phones, games, toys and many other products demand more
features, instant communications and massive data storage in ever smaller and more
affordable products. The unstoppable drive in silicon fabrication has delivered tech-
nology to meet this demandchips with hundreds of millions of gates using 130 nm
processes are no more than the size of a thumbnail. These SoCs present one of the
biggest challenges that engineers have ever faced; how to manage and integrate
enormously complex designs that combine the richest imaginable mix of micropro-
cessors, memories, buses, architectures, communication standards, protocol proces-
sors, interfaces and other intellectual property components where system level
considerations of synchronization, testability, conformance and verification are cru-
cial. Integrated circuit (IC) design has become a multi-million-gate challenge for
which the demands on design teams are ever greater.
The techniques used in designing multi-million-gate SoCs employ the worlds
most advanced electronic design automation (EDA), with a level of sophistication
that requires highly trained and experienced engineers. Key issues to be managed in
the design process include achieving timing closure that accounts for wire delays in
the metal interconnects inside the chip, and designs for tests so that the chips can be
manufactured economically. Early prediction of the right architecture, design-flow
and best use of EDA solutions is required to achieve first silicon success and neces-
sarily decrease the time-to-market from years to months.
1 Interconnect Issues in High-Performance Computing Architectures 5

Fig. 1.1 Typical


organization of a SoC Initiators
External fast
(processors,
memories
real time
controllers
blocks, DMAs)

On-chip communication system

Peripherals
Slow memories
controllers

The building-blocks of a SoC can be distinguished as initiators or processing


elements (PEs), targets or storage elements (SEs), and communication infrastruc-
ture blocks, composing as a whole the on-chip interconnect (see Fig. 1.1); initiators
represent all blocks able to generate traffic, i.e., write data into a SE and read data
from a SE; targets are blocks able to manage the traffic generated by the initiators.
Among the initiators of the system the following classes can be identified:
Processors
Real time initiators
DMAs (direct memory access)
Processors, such as the ST20, ST40, ST50 and LX from STMicroelectronics,
have strict requirements in terms of latency and bandwidth, and their bandwidth
must further be in some way limited to allow the other initiators to be serviced.
Real time initiators, such as audio/video blocks, are more latency-tolerant than
processors, but have strict requirements in terms of bandwidth.
DMAs do not have any particular requirements in terms of latency or bandwidth,
and can normally work using the remaining bandwidth, i.e. the part of the band-
width not used by the processors and real time initiators.
Among the targets the following classes can be identified:
External fast memories
Internal slow memories
Peripherals
External fast memories comprise high performance memories such as SDRAM
(synchronous dynamic random access memory) and DDR (dual data rate) SDRAM,
used mainly for real time applications (e.g. video), and today operating at around
400 MHz. Their speed is limited by physical constraints imposed by pads.
Slow memories are usually low-performance memories such as SRAM and Flash,
used for the storage of huge amounts of data, whose access is managed by caches, and
operating at around 200 MHz. Their speed is limited by application requirements.
6 A. Scandurra

Peripherals are slow memories such as I2C and Smartcard, used where no high
performance is required, and operating at around 50/100 MHz.
Normally the CPUs run at the highest speed and the memory system represents
the SoC bottleneck in terms of performance.
Hence within a single chip, different circuit islands run at different frequen-
cies; this approach is called GALS (globally asynchronous locally synchronous)
and is widely used today. The different clock frequencies required to operate the
various subsystems are generated by the clock generator (clockgen), while the sub-
systems are linked together by the on-chip interconnect, such as the STBus/STNoC
[1] in the case of STMicroelectronics products. Typically the on-chip interconnect
optimizes the CPU path, i.e. the interconnect structure normally operates at the
same frequency as the CPU. Since the other subsystems often operate at a different
frequency, dedicated frequency converters have to be placed between the intercon-
nect and the other subsystems to enable inter-block communication.

On-Chip Communication Systems

As already shown in Fig. 1.1, a SoC can be seen as a number of intellectual proper-
ties (IPs) properly connected by an on-chip communication architecture (OCCA),
an infrastructure that interconnects the various IPs and provides the communication
mechanisms necessary for distributed computation over a set of heterogeneous pro-
cessing modules. The throughput and latency of the communication infrastructure,
and also the relevant power consumption, often limit the overall SoC performance.
Until now the prominent type of OCCA has been the on-chip bus, such as the
STBus from STMicroelectronics, the AMBA bus from ARM [2], CoreConnect
from IBM [3], which represent the traditional shared-communication medium. This
type of OCCA, while not at all scalable, has been able to fulfill SoC requirements
because the performance bottleneck has always been the memory system. However,
with the growing requirements of more modern SoCs and CMOS technology scal-
ing, the performance bottleneck is moving from memories to interconnect, as
detailed in Sect. 4.
In order to overcome this limit, a new generation architecture, called network on
chip (NoC), has been deeply studied and proposed; it is an attempt to translate the
networking and parallel computing domain experience into the SoC world, relying on
a packet-switched micro-network backbone based on a well-defined protocol stack.
Innovative NoC architectures include STNoC from STMicroelectronics [4], thereal
from Philips Research Lab [5], and Xpipe from University of Bologna [6].

On-Chip Bus

On-chip buses are communication systems composed of intelligent logic, respon-


sible for arbitration among the possible traffic flows injected by the different SoC
1 Interconnect Issues in High-Performance Computing Architectures 7

initiators (PEs able to generate traffic), and a set of physical channels through which
the traffic flows are routed from initiators to targets (PEs able to receive and process
traffic) and vice versa.
The peculiarities of a bus, which are also the main drawbacks, are:
Limited available bandwidth, given by the product of the bus size (width) by the
bus operating frequency. To achieve a higher available bandwidth implies either
widening the bus size, thereby amplifying physical issues such as wire conges-
tion, or increasing the operating frequency, leading to increased power consump-
tion, and which is moreover limited by physical issues such as capacitive load
and capacitive coupling.
Lack of bandwidth scalability, since connecting more IPs to the bus implies
dividing the total available bandwidth among all the IPs, thereby allocating a
lower bandwidth to each of them.
Limited system scalability, since connecting more IPs to the bus results in an
increase of the capacitive load, which leads to a drop in operating frequency.
Limited quality of service, since there is no possibility to process different classes
of traffic (such as low latency CPUs, high bandwidth video/audio processors,
DMAs) in a different way.
High occupation area, due to the large number of wires required to transport all
the protocol information, i.e. data and control signals (STBus interfaces for
example are characterized by hundreds of wires).
High power consumption, which is determined by the switching activity and
potentially affects all the wires of the bus.

Network on Chip

The new requirements of modern applications impose the need for new solutions to
overcome the previously mentioned drawbacks of on-chip buses, both for the clas-
sic shared-bus (such as AMBA AHB) and the more advanced communication sys-
tems supporting crossbar structures (such as the STBus). In conjunction with the
most recent technology features, a novel on-chip communication architecture, called
network on chip (NoC), has been proposed.
It is important to highlight that the NoC concept is not merely an adaptation to
the SoC context of parallel computing or wide area network domains; many issues
are in fact still open in this new field, and the highly complex design space requires
detailed exploration. The key open points are, for instance, the choice of the net-
work topology, the message format, the end-to-end services, the routing strategies,
the flow control and the queuing management. Moreover, the type of quality of
service (QoS) to be provided is another open issue, as is the most suitable software
view to allow the applications to exploit NoC infrastructure peculiarities.
From lessons learned by the telecommunications community, the global on-chip
communication model is decomposed into layers similar to the ISOOSI reference
model (see Fig. 1.2). The protocol stack enables different services and allows QoS,
8 A. Scandurra

Fig. 1.2 ISOOSI protocol stack


Application

Application
Presentation

Session

Transport

Network

Network
Data link

Physical

providing to the programmer an abstraction of the communication framework.


Layers interact through well-defined interfaces and they hide any low-level physical
DSM (Deep SubMicron) issues.
The Physical layer refers to all that concerns the electronic details of wires, the
circuits and techniques to drive information (drivers, repeaters, layout), while the
Data link layer ensures reliable transfer despite the physical unreliability and deals
with medium access (sharing/contention). At the Network level there are issues
related to the topology and the consequent routing scheme, while the Transport
layer manages the end-to-end services and the packet segmentation/re-assembly.
The other levels, up to the Application layer, can be viewed as a sort of merged
adaptation layer that implements (in hardware or through part of an operating sys-
tem) services and exposes the NoC infrastructure according to a proper program-
ming model [e.g. the message passing (MP) paradigm].
Despite the similarity discussed above, it is clear that the micro-network in the
single chip domain differs from the wide-area network. Distinct features of NoCs
include the spatial locality of connected modules, the reduced non-determinism of
the on-chip traffic, the stringent energy and latency constraints, the possibility of
application specific stack services, and the need for low cost solutions.
An open issue in NoC literature is the trade-off between the QoS provided by the
network and the relevant implementation cost. QoS must be supported at all layers,
and basic services are a fixed bandwidth, a maximum latency, the correctness (no
errors) and the completion (no packet loss) of the transmission. Another approach
consists of using a best effort service strategy, which allows for a better average
utilization but cannot support a QoS. Since users demand application predictability,
mixing both approaches could be a good solution.
1 Interconnect Issues in High-Performance Computing Architectures 9

Fig. 1.3 Various NoC topologies

NoC communication is packet-based and the generally accepted forwarding


scheme is a wormhole, because it allows for a deeper pipeline and a reduced buffering
cost. Packets are divided into basic units called flits; the queues in each node have flit
granularity and the physical node-to-node links are managed by a flow control that
works on a flit per flit basis.
Another key point is the network topology, which has to be regular and simple.
The literature points to hybrid solutions, with local clusters based on shared buses,
and global communication using NoC. Some NoC state-of-the-art projects are based
on the simple ring, two-dimensional mesh, fat tree [7], and Octagon [4] topologies,
as shown in Fig. 1.3.
As far as the routing policy is concerned, it is possible to choose between deter-
ministic, adaptive, source, arithmetic or table-driven schemes; deadlock handling is
topology dependent. Input queues are suitable for a low cost implementation, but
they show limited performance with respect to output buffering. In terms of control
flow, many solutions select a simple request/grant scheme, others a more efficient
credit-based one. Links can be noisy channels, so the literature begins to present
work on error detection code or error correction code applied to on-chip intercon-
nections, with distributed or end-to-end error recovery strategies.
Besides routers, a significant amount of area is consumed by the so-called net-
work interface (NI) that is the access to the NoC, translating the connected IP
transactions to packets that are exchanged in the network. The NI hides network
dependent aspects to the PE, covering the transport layer (connection handling, de-
assembling of messages, higher level services).
To summarize, the main benefits of the NoC approach are:
Modularity, thanks to standard basic components, the NI and the Router
Abstraction as an inherent property of the layered approach, fitting also the
demands of QoS
10 A. Scandurra

Flexibility/scalability of the network as a benefit of a packet-based


communication
Regular and well controlled structure to cope with DSM issues
Re-use of the communication infrastructure viewed as a platform

Topology

A first parameter for the topology is its scalability; a topology is said to be scalable
if it is possible to create larger networks of any size, by simply adding new nodes.
Two different approaches can be followed for the specification of the topology of a
NoC: topology-dependent and topology-independent. The former approach specifies
the network architecture and its building blocks assuming a well defined topology.
The latter aims at providing flexibility to the SoC architect in choosing the topology
for the interconnect, depending on the application. This means that it is possible to
build any kind of topology by plugging together the NoC building-blocks in the
proper way. While this second approach is more versatile because of the higher
configurability allowed, it also has the following drawbacks:
A very wide design and verification space, which would require significant effort
to ensure a high quality product to the NoC user.
Exposure of the complexity of the network layer design (including issues such as
deadlock) to the SoC architect, thus requiring novel specific competencies and a
high effort in defining an effective (in terms of performance) and deadlock-free
architecture.
A need for high parametric building blocks, with few cost optimization possibilities.
Moreover, a NoC built on top of a specific topology still needs a high degree of
flexibility (routing, flow control, queues, QoS) in order to properly configure the
interconnect to support different application requirements.

Routing Algorithms

Routing algorithms are responsible for the selection of a path from a source node to
a destination node in a particular topology of a network. A good routing algorithm
balances the load across the various network channels even in the presence of non-
uniform and heavy traffic patterns. A well designed routing algorithm also keeps
path lengths as short as possible, thus reducing the overall latency of a message.
Another important aspect of a routing algorithm is its ability to operate in the
presence of faults in the network. If a particular algorithm is hardwired into the rout-
ers and a link or node fails, the entire network fails. However, if the algorithm can
be reprogrammed or adapted to bypass the failure, the system can continue to oper-
ate with only a slight loss in performance.
Routing algorithms are classified depending on how they select between the possible
paths from a source node to a destination node. Three main categories are specified:
1 Interconnect Issues in High-Performance Computing Architectures 11

Deterministic, where the same path is always chosen between a source and a
destination node, even if multiple paths exist.
Oblivious, where the path is chosen without taking into account the present state
of the network; oblivious routing algorithms include deterministic routing algo-
rithms as a subset.
Adaptive, where the current state of the network is used to select the path.

Deadlock

A deadlock occurs in an interconnection network when a set of packets are unable


to make any progress because they are waiting for one another to release network
resources, such as buffers or channels.
Deadlock is a catastrophic event for the network. After a few resources are kept
busy by deadlocked packets, other packets get blocked on these resources, thus
paralyzing the network operation. In order to prevent such a problem, two solutions
can be put into place:

Deadlock avoidance, a method to guarantee that the network cannot become


deadlocked.
Deadlock recovery, a method consisting of detecting and correcting deadlock.

If deadlock is caused by dependencies external to the network, it is called high-level


deadlock or protocol deadlock (hereafter we term low-level deadlock as that related to
the dependencies of the topology plus the relevant routing algorithm). For instance a
simple request/response protocol could lead to deadlock conditions when dependencies
occur in target devices between the incoming requests and the outgoing responses.
A network must always be free of deadlock, livelock, and starvation. A livelock
refers to packets circulating the network without making any progress towards their
destination. Starvation refers to packets indefinitely waiting at a network buffer (due
to an unfair queuing policy). Both livelock and starvation reflect problems of fair-
ness in network routing or scheduling policies.
As far as deadlock is concerned, in the case of deterministic routing, deadlock is
avoided by eliminating cycles in the resource dependency graph; this is a directed
graph, which depends on the topology and the routing, where the vertices are the
resources and the edges represent the relationships due to the routing function. In
the case of wormhole packet switching, these resources are the virtual channels; so
we talk about a virtual channel dependency graph. A virtual channel (VC) provides
logical links over the same shared physical channels, by establishing a number of
independently allocated flit buffers in the corresponding transmitter/receiver nodes.
When the physical link is not multiplexed among different VCs, the resource depen-
dency graph could be simply called a channel dependency graph.
Protocol (or high-level) deadlock refers to a deadlock condition due to resource
dependencies external to the network. For instance, when a request-response proto-
col, such as STBus from STMicroelectronics or AMBA AXI from ARM, is adopted
as end-to-end in the network, a node connected as target introduces dependencies
12 A. Scandurra

between incoming requests and outgoing responses: the node does not perform as a
sink for incoming packets, due to the finite size of the buffers and the dependencies
between requests and responses.
In shared memory architectures, complex cache-coherent protocols could lead to
a deeper level of dependencies. The effect of these protocol dependencies can be
eliminated by using disjoint networks to handle requests and replies. The following
two approaches are possible:

Two physical networks, i.e., separated physical data buses for requests and
responses.
Two virtual networks, i.e., separated virtual channels for requests and responses.

Quality of Service

The set of services requested by the IPs connected to the network (called network
clients) and the mechanisms used to provide these services are commonly referred
to as QoS.
Generally, it is useful to classify the traffic across the network into a number of
classes, in order to efficiently allocate network resources to packets. Different
classes of packets usually have different requirements in terms of importance, toler-
ance to latency, bandwidth and packet loss.
Two main traffic categories are specified:
Guaranteed service
Best effort
Traffic classes belonging to the former category are guaranteed a certain level of
performance as long as the injected traffic respect a well defined set of constraints.
Traffic classes belonging to the latter category do not get any strong guarantee from
the network; instead, it will simply make its best effort to deliver the packets to their
destinations. Best effort packets may then have arbitrary delay, or even be
dropped.
The key quality of service concern in implementing best effort services is provid-
ing fairness among all the best effort flows. Two alternative solutions exist in terms
of fairness:
Latency-based fairness, aiming at providing equal delays to flows competing for
the same resource.
Throughput-based fairness, aiming at providing equal bandwidth to flows com-
peting for the same resource.
While latency-based fairness can be achieved implementing a fair arbitration scheme
[such as round-robin or least recently used (LRU)], throughput-based fairness can be
achieved in hardware by separating each flow requesting a resource into a separate
queue, and then serving the queues in round-robin fashion. The implementation of such
a separation can be expensive; in fact while physical channels (links) do not have to be
1 Interconnect Issues in High-Performance Computing Architectures 13

replicated because of their dynamic allocation, virtual channels and buffers, requiring
FIFOs, have to be replicated for each different class of traffic. So it is very important to
choose the proper number of classes needing true isolation, keeping in mind that in
many situations it may be possible to combine classes without a significant degradation
of quality of service but gaining a reduction in hardware complexity.

Error Recovery

A high performance, reliable and energy efficient NoC architecture requires a good
utilization of error-avoidance and error-tolerance techniques, at most levels of its
layered organization. Using modern technologies to implement the present day sys-
tems (in order to improve performance and reduce power consumption), means
adopting lower levels of power supply voltage, leading to lower margins of noise
immunity for the signals transmitted over the communication network of the sys-
tem. This leads to a noisy interconnect, which behaves as an unreliable transport
medium, and introduces errors in the transmitted signals. So the communication
process needs to be fault-tolerant to ensure correct information transfer. This can be
achieved through the use of channel coding. Such schemes introduce a controlled
amount of redundancy in the transmitted data, increasing its noise immunity.
Linear block codes are commonly used for channel encoding. Using an (n, k)
linear block code, a data block of k bits length is mapped onto an n bit code word,
which is transmitted over the channel. The receiver examines the received signal
and declares an error if it is not a valid code word.
Once an error has been detected, it can be handled in one of two different ways:
Forward error correction (FEC), where the properties of the code are used to cor-
rect the error.
Retransmission, also called automatic repeat request (ARQ), where the receiver
asks the sender to retransmit the code word affected by the error.
FEC schemes require a more complex decoder, while ARQ schemes require the
existence of a reverse channel from the receiver to the transmitter, in order to ask for
the retransmission.

SoC Performance and Integration Issues

In decananometric CMOS technologies, DSM effects are significant and the physi-
cal design of a SoC is increasingly faced with two types of issue:
Performance issues, related mainly to the bandwidth requirements of the different
IPs, that in order to be fulfilled, would require SoCs to run at very high speeds.
Integration issues, related to the difficulties encountered mainly during the place-
ment of the hard macros and the standard cells, and during the routing of clock
nets and communication system wires.
14 A. Scandurra

Performance Issues

New generation systems will be composed of functional building blocks with a


computation capability requiring a very high bandwidth (i.e. the number of bytes
transferred per time unit) compared to those currently exploited. Bandwidth increase
can be obtained in a variety of ways:
Increasing the physical channel size
Increasing the clock frequency
While this can be done with a few problems at IP level, for example with wider
interfaces and/or faster transmission frequency, various problems affect the com-
munication system to achieve the same target (the so called offered throughput),
mainly in terms of congestion and crosstalk.
In fact, wider physical channels imply the need to route a higher number of wires
between different points of the chip, resulting in routing and congestion issues.
Increasing transmission frequency results in a higher level of energy coupling effects
(crosstalk) between wires, leading to corruption of the transmitted signal. This is
true for both bus-based interconnects and Networks on Chip, where the offered
throughput is the aggregated throughput of all the links between different nodes.
The throughput an on-chip interconnect can offer is also limited by physical
implications. As far as the overall operating frequency of a SoC is concerned, two
main factors influence it, namely the device switching times and the bandwidth
offered by metallic wires. Current technologies can achieve unprecedented transis-
tor transition frequencies due to short transistor lengths. However, the same is not
true for interconnects. Indeed, continually shrinking feature sizes, higher clock fre-
quencies, and growth in complexity are all negative factors as far as switching
charges on metallic interconnect are concerned. This situation is shifting the IC
design bottleneck from computing capability to communication.
Feature sizes on integrated circuits and also, therefore, circuit speed have fol-
lowed Moores law for over four decades and CMOS integration capability is still
increasing. In this respect, according to the international technology roadmap for
semiconductors (ITRS) [8], the RC time constants associated with metallic inter-
connects will not be able to decrease sufficiently for the high-bandwidth applica-
tions destined to appear in the next few years (see Fig. 1.4).
Internal data rates of processors fabricated in deep submicron CMOS technology
have exceeded gigahertz rates. While processing proceeds at GHz internally, off
chip wires have held inter-chip clock rates to hundreds of MHz.

Integration Issues

Figure 1.5 is an illustration of the physical issues; it shows the floorplan of an exam-
ple CMOS chip for a consumer application.
1 Interconnect Issues in High-Performance Computing Architectures 15

Fig. 1.4 Average interconnect delay as a function of process

In this figure the rectangles represent the various IPs of the chip (both initiators
and targets); the space available for the communication system is the very irregular
shape between all the different IPs. In such an area the Network Interfaces, repre-
senting the access points of the IPs to the on-chip network, the nodes, responsible
for arbitration and propagation of information, and all the physical channels con-
necting the different NoC building-blocks have to be placed. Because of the shape,
which is quite irregular and with thin regions, and the area size, it is evident that the
placement of the interconnect standard cells can be difficult, and the routing of the
wires that can be also very long will likely suffer congestion.

Electrical Interconnect Classification

From a technological point of view interconnects can be classified in the following


categories (see Fig. 1.6):
Local interconnect, used for short-distance communication, typically between
logic units, and comprising the majority of on-chip wires; they have the smallest
pitch and a delay of less than one clock cycle.
Global interconnect, providing communication between large functional blocks
(IPs); they are fewer than local interconnects, but are no less important. Improving
the performance of a small number of critical global links can significantly
16 A. Scandurra

Fig. 1.5 Example CMOS chip floorplan

enhance the total system performance. Global interconnects have the largest
pitch and a delay typically longer than one or two clock cycles.
Intermediate interconnect, having dimensions that are between those of local and
global interconnects.
A key difference between local and global interconnect is that the length of the
former scales with the technology node, while for the latter the length is approxi-
mately constant.
From a functional point of view, the two main important and performance-
demanding applications of interconnects in SoC are signaling (i.e. the communica-
tion of different logic units) and clock distribution. In this context they can be
classified as:
Point-to-point links, used for critical data-intensive links, such as CPU-memory
buses in processor architectures.
1 Interconnect Issues in High-Performance Computing Architectures 17

Fig. 1.6 Interconnect


classification

Broadcast links, representing physical channels where the number of receivers


(and therefore repeaters) is high and switching activity is also high.
Network links, targeted at system buses and reconfigurable networks, aiming at
serving complete system architectures, whose typical communication is around
several tens of GB/s.

The Interconnect Bottleneck

The continuous evolution and scaling down of CMOS technologies has been the
basis of most of todays information technologies. It has allowed the improvement
of the performance of electronic circuits, increasing their yield and lowering the
cost per function on chip. Through this, the processing and storage of information
(in particular digitally encoded information) has become a cheap commodity.
Computing powers not imaginable only a few years ago have been brought to the
desktops of every researcher and every engineer. Electronic ICs and their ever
increasing degree of integration have been at the core of our current knowledge-
based society and they have formed the basis of a large part of the growth of
efficiency and competitiveness of large as well as small industries.
Continuing this evolution will however require a major effort. A further scaling
down of feature sizes in microelectronic circuits will be necessary. To reach this
goal, major challenges have to be overcome, and one of these is the interconnect
bottleneck.
The rate of inter-chip communication is now the limiting factor in high perfor-
mance systems. The function of an interconnect or wiring system is to distribute
clock and other signals to and among the various circuits/systems on a chip. The
fundamental development requirement for interconnect is to meet the high-speed
transmission needs of chips despite further scaling of feature sizes. This scaling
down however, has been shown to increase the signal runtime delays in the global
18 A. Scandurra

interconnect layers severely. Indeed, while the reduction in transistor gate lengths
increases the circuit speed, the signal delay time for global wires continues to
increase with technology scaling, primarily due to the increasing resistance of the
wires and their increasing lengths. Current trends to decrease the runtime delays,
the power consumption and the crosstalk, focus on lowering the RC-product of the
wires, by using metals with lower resistivity (e.g. Copper instead of Aluminum) and
by the use of insulators with lower dielectric constant. Examples of the latter include
nanoporous SiOC-like or organic (SilK type) materials, which have dielectric con-
stants below 2.0 or air gap approaches, which reach values close to 1.81.7.
Integration of these materials results in an increased complexity however, and they
have inherent mechanical weaknesses. Moreover, introducing ultra low dielectric
constant materials finds its fundamental physical limit when one considers that the
film permittivity cannot be less than 1 (that of a vacuum).
Therefore, several researchers have come to the conclusion that the global inter-
connect performance needed for future generations of ICs cannot be achieved even
with the most optimistic values of metal resistivity and dielectric constants.
Evolutionary solutions will not suffice to meet the performance roadmap and there-
fore radical new approaches are needed.
Several such possibilities are now envisaged, the most prominent of which are
the use of RF or microwave interconnects, optical interconnects, 3D interconnects
and cooled conductors. The ITRS roadmap suggests that research and evaluation is
greatly needed for all these solutions for the next few years. Subsequently, a narrow-
ing down of remaining solutions and start of an actual development effort is
expected.
As has already been stated, the main limitations due to metallic interconnects are
the crosstalk between lines and the noise on transmitted signals, the delay, the con-
nection capability and the power consumption (due to repeaters). As a result, the
Semiconductor Research Corporation has cited interconnect design and planning as
a primary research thrust.

Electrical Interconnect Metrics

An ideal interconnect should be able to transmit any signal with no delay, no degra-
dation (either inherent or induced by external causes), over any distance without
consuming any power, requiring zero physical footprint and without disturbing the
surrounding environment.
According to this, a number of metrics have been defined in order to characterize
the performance and the quality of real interconnects, such as:
Propagation delay
Bandwidth density
Power-delay product
Bit error rate
1 Interconnect Issues in High-Performance Computing Architectures 19

45
90 nm
65 nm
45 nm
Delay per unit length (ps/mm)
40
32 nm
22 nm

35

30

25

20
1 2 3 4 5 6 7
Normalized interconnect width

Fig. 1.7 Interconnect delay as function of interconnect width

Propagation Delay

The propagation delay is the time required by a signal to cross a wire. Pure intercon-
nect delay depends on the link length and the speed of propagation of the wavefront
(time of flight). Electrical regeneration introduces additional delay through buffers
and transistor switching times. Additionally, delay can be induced by crosstalk.
It can be reduced by increasing the interconnect width at the expense of a smaller
bandwidth density.
Technology scaling has insignificant effect on the delay of an interconnect with
an optimal number of repeaters. The minimum achievable interconnect delay
remains effectively fixed at approximately 20 ps/mm when technology scales from
90 to 22 nm, as shown in Fig. 1.7.

Bandwidth Density

Bandwidth density is a metric that characterizes information throughput through a


unit cross section of an interconnect. Generally, it is defined by the pitch of the
electrical wires.

Power-Delay Product

Signal transmission always requires power. In the simplest case, it is required to


change the charge value on the equivalent capacitor of a metallic wire. In more
20 A. Scandurra

realistic cases, power will also be required in emitter and receiver circuitry, and in
regeneration circuits.
A distinction can also be made between static and dynamic power consumption
by introducing a factor a representing the switching activity of the interconnect link
(0 < a < 1).
The power-delay product (PDP) is routinely used in the technology design pro-
cess to evaluate circuit performance.

Bit Error Rate

The bit error rate (BER) may be defined as the rate of error occurrences and is the
main criterion in evaluating the performance of digital transmission systems.
For an on-chip communication system a BER of 1015 is acceptable; electrical
interconnects typically achieve BER figures better than 1045. That is why the BER
is not commonly considered in integrated circuit design circles. However, future
operation frequencies are likely to change this, since the combination of necessarily
faster rise and fall times, lower supply voltages and higher crosstalk increases the
probability of wrongly interpreting the signal that was sent.
Errors come from signal degradation. Real signals are characterized by their
actual frequency content and by their voltage or current value limits. The frequency
content will define the necessary channel bandwidth, according to Shannon
Hartleys theorem. Analogue signals are highly sensitive to degradation and the
preferred mode of signal transmission over interconnect is digital.
Signal degradation can be classed as time-based, inherent and externally
induced:
Time-based: non-zero rise-time, overshoot, undershoot, and ringing time-based
degradation can be incorporated into the delay term for digital signals. While the
whole of these degradations can be assimilated into a quasi-deterministic behav-
ior that does not exceed the noise margins of the digital circuits, a transformation
in temporal space is possible (to contribute to the regeneration delay term). This
assumption is however destined to disappear with nanometric technologies,
because of a more probabilistic behavior and especially of weaker noise
margins.
Inherent: attenuation (dB/cm), skin effect, and reflections (dB).
Externally induced: crosstalk (dB/cm) and sensitivity to ambient noise.
The allowable tolerance on signal degradation and delay for a given bandwidth
and power budget forces a limit to the transmission distance. The maximum inter-
connect segment length can in fact be calculated, a segment being defined as a por-
tion of interconnect not requiring regeneration at a receiver point spatially distant
from its emission point.
Signal regeneration in turn leads to a further problem, i.e., the energy used to
propagate the signal in the transmission medium can escape into the surrounding
environment and perturb the operation of elements close to the transmission path.
1 Interconnect Issues in High-Performance Computing Architectures 21

3D Interconnect

The typical electronics product/system of the near future is expected to include all
the following types of building-blocks:

Digital processors (CPU)


Digital signal processors (DSP)
ASICs
Memories
Busses and NoC
Peripheral and interface devices
Analog baseband front-end
RF and microwave processing stages
Discrete components (R, L, C)
Micro-electro-mechanical-systems (MEMS)
Displays
User interfaces

Several studies and technology roadmaps have highlighted that these electronics
products of the future will be characterized by a high level of heterogeneity, in terms
of the following mix:
Technology: digital, analog, RF, optoelectronic, MEMS, embedded passives.
Frequency: from hundreds of MHz in digital components domain till hundreds of
GHz in RF, microwave and optical domains.
Signal: digital circuits coexisting with ultra low-noise amplifier RF circuits.
Architecture: heterogeneous architectures, i.e. event driven, data driven and time
driven models of computation, regular versus irregular structures, tradeoffs
required between function, form and fit over multiple domains of computational
elements and multiple hierarchies of design abstraction.
Design: electrical design to be unified with physical and thermal design across
multiple levels of design abstraction.
In order to simplify the design and manufacturing of such complex and hetero-
geneous systems, relying on different technologies, an adequate approach would be
to split them over a number of independent dice. Some, or even many, of the dice
will need to be in communication with each other. This approach is known as sys-
tem in package (SiP) [9], however many terms are in use that are almost synony-
mous: high density packaging (HDP), multi chip module (MCM), multi chip
package (MCP), few chip package (FCP) [10]. In general the term SiP is used when
a whole system, rather than a part, is placed into a single MCM.
The SiP paradigm moves packaging design to the early phases of system design
including chip/package functionality partitioning and integration, which is a para-
digm shift from the conventional design approach. Packaging has always played an
important role in electronic products manufacturing; however in the early days its
role was primarily structural in nature, while today and tomorrow it is playing
22 A. Scandurra

Fig. 1.8 Example of heterogeneous integration

increasingly important roles in carrying out the products function and


performance.
Such a technology offers many significant benefits, including:

Footprintmore functionality fits into a small space. This extends Moores Law
and enables a new generation of tiny but powerful devices.
Heterogeneous integrationcircuit layers can be built with different processes,
or even on different types of wafers. This means that components can be opti-
mized to a much greater degree than if they were built together on a single wafer.
Even more interesting, components with completely incompatible manufactur-
ing could be combined in a single device (see Fig. 1.8). It is worth considering
that non-digital functions (memory, analog) are best built in non-digital pro-
cesses, that can be integrated in a low-noise and low-cost process by integrating
them in a package, rather than in a chip with additional process steps.
Speedthe average wire length becomes much shorter. Because propagation
delay is proportional to the square of the wire length, overall performance
increases.
Powerkeeping a signal on-chip reduces its power consumption by 10 to a 100
times. Shorter wires also reduce power consumption by producing less parasitic
capacitance. Reducing the power budget leads to less heat generation, extended
battery life, and lower cost of operation.
1 Interconnect Issues in High-Performance Computing Architectures 23

Fig. 1.9 Detail of electrical wires between dice

Designthe vertical dimension adds a higher order of connectivity and opens a


world of new design possibilities (see Fig. 1.9).
Circuit securitythe stacked structure hinders attempts to reverse engineer the
circuitry. Sensitive circuits may also be divided among the layers in such a way
as to obscure the function of each layer.
Bandwidththe lack of memory bandwidth is increasingly becoming the pri-
mary constraint for improved system performance, in particular in multimedia
and data-intensive applications. Moreover, the random nature of memory accesses
in many applications results in relatively ineffective caches and the memory
bandwidth becoming strongly dependent on SDRAM accesses. 3D integration
allows large numbers of vertical vias between the layers. This allows the con-
struction of wide bandwidth buses between functional blocks in different layers.
A typical example would be a processor plus memory 3D stack, with the cache
memory stacked on top of the processor. This arrangement allows a bus much
wider than the typical 128 or 256 bits between the cache and processor. Wide
buses in turn alleviate the memory wall problem. Figure 1.10 highlights the com-
munication wires between two dice, in both cross section view and top view.

Summarizing, the system in package technology offers the possibility to improve


significantly the overall system performance when the system is too large to fit on a
single chip, or when the system is a mixed-signal one and putting everything into a
single chip is not possible from the technological point of view.
However, in spite of the significant advantages the SiP approach gives with
respect to the more traditional SoC paradigm, the fact that chip count, clock speed
and number of I/O per chip are growing rapidly in electronic systems is pushing the
24 A. Scandurra

Fig. 1.10 Die to die physical link wires

limits of electrical I/O channels between dice. Using other interconnect technolo-
gies (as previously mentioned) within single chips or even a dedicated interconnect
layer in a chip stack may alleviate these issues.

Conclusion

In this chapter the system on chip concept is introduced, and current SoC commu-
nication systems are described. The main features, as well as the limitations, of the
various types of on-chip interconnect are illustrated. Some details are given about
both performance issues and physical integration issues, highlighting why today
interconnect, rather than logic gates, is seen as the system bottleneck.
The system in package approach is then introduced, seen as a possibility to relax
the issues affecting SoC technology and allow the implementation of complex, het-
erogeneous and high performance systems.
However the increasing complexity and requirements in terms of computation
capability of new generation systems will reach the limit of electrical interconnect
quite soon, requesting novel solutions and different approaches for reliable and
effective on-chip and die-to-die communication.

References

1. STMicroelectronics. UM0484 User manual: STBus communication system concepts and defi-
nitions. http://www.st.com/internet/com/TECHNICAL_RESOURCES/TECHNICAL_
LITERATURE/USER_MANUAL/CD00176920.pdf. Last accessed on October 8, 2012
1 Interconnect Issues in High-Performance Computing Architectures 25

2. ARM Ltd. AMBA open specifications. http://www.arm.com/products/system-ip/amba/amba.


open-specifications.php. Last accessed on October 8, 2012
3. IBM Microelectronics. CoreConnect Bus Architecture. https://www-01.ibm.com/chips/techlib/
techlib.nsf/productfamilies/CoreConnect_Bus_Architecture. Last accessed on October 8, 2012
4. Coppola M, Locatelli R, Maruccia G, Pieralisi L, Scandurra A (2004) Spidergon: a novel on-
chip communication network. In: SOC working conference, Tampere
5. Goossens K, Dielissen J, Radulescu A (2005) AEthereal network on chip: concepts, architec-
tures, and implementations. In: Design & test of computers. IEEE, New York, NY, USA
6. DallOsso M, Biccari G, Giovannini L, Bertozzi D, Benini L (2003) Xpipes: a latency insensitive
parameterized network-on-chip architecture for multiprocessor SoCs. In: 21st international
conference on computer design, San Jose, CA, USA
7. Dally WJ, Towles B (2003) Principles and practices of interconnection networks. Morgan
Kaufmann, San Francisco
8. ITRS web site, http://www.itrs.net. Last accessed on October 8, 2012
9. Madisetti VK. The System-on-Package (SOP) Thrust, NSF ERC on Packaging, Georgia Tech.
http://users.ece.gatech.edu/~vkm/sop.html. Last accessed on October 8, 2012
10. Tummala R. High Density Packaging in 2010 and beyond. IEEE 4th International Symposium
on Electronic Materials and Packaging, Taipei, Taiwan, December 4th6th 2002
Chapter 2
Technologies and Building Blocks for On-Chip
Optical Interconnects

Wim Bogaerts, Liu Liu, and Gunther Roelkens

Abstract In this chapter we discuss the elemental building blocks to implement


optical interconnects on a chip: light sources, photodetectors, switches and multi-
plexers and of course, the optical waveguides. We discuss how these building blocks
can be implemented using silicon technology and evaluate the different integration
strategies of the optical layer with electronics silicon photonics optical intercon-
nectswaceguides modulators photodetectorshybrid integration.

Keywords Silicon photonics Optical interconnects Waceguides Modulators


Photodetectors Hybrid integration

Introduction

In this chapter we will discuss the most common technology aspects to implement an
optical interconnect system, and more specifically an on-chip optical interconnect sys-
tem. From an application point of view, optical interconnects should seamlessly replace
the function of electrical interconnects. This means that an optical interconnect, or an
interconnect fabric, should always have an electrical interface. With this in mind, an
optical interconnect should be a self-contained system, including the electro-optical
and opto-electrical conversions, as well as the control and switching electronics.
We will dissect optical interconnects into their constituent building blocks, and
will explore the different options for their technological implementation. We will
then go into more details on the most promising technology for on-chip intercon-
nects: silicon photonics.

W. Bogaerts (*) G. Roelkens


Department of Information Technology, Photonics Research Group, Ghent University IMEC,
Building 41, Office 1.41, Sint-Pietersnieuwstraat 41, 9000 Gent, Belgium
e-mail: emailwim.bogaerts@ugent.be
L. Liu
School of Information and Optoelectronic Science and Engineering,
South China Normal University, 510006 Guangzhou, China

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 27


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_2,
Springer Science+Business Media New York 2013
28 W. Bogaerts et al.

Anatomy of an Optical Link

The most simple optical interconnect is a point-to-point optical link connecting two
electrical systems. Such a link typically consists of a unit converting an electrical
signal into an optical signal, a medium to carry the optical signal and a unit to con-
vert it back into an electrical signal. In an on-chip link, the medium is typically an
optical waveguide, confining light along an optical transmission line. Such wave-
guides are discussed in detail in section Waveguide Circuits. Converting the opti-
cal signal into an electrical one is done through a photodetector, typically combined
with a trans-impedance amplifier to convert the photocurrent into a voltage.
Photodetectors are discussed in section Photodetectors. For the conversion of the
electrical signal into an optical one, there are basically two main approaches, based
on the choice of the light source. This is shown in Fig. 2.1.
The most straightforward way of converting an electrical signal into light is by
directly modulating a light source. In the case of a high-speed optical interconnect,
this would be a laser. In case of many links on a chip, this would require a dedicated
laser per link. As will be discussed in section Light Sources, integrating many
small laser sources on a chip is certainly technologically feasible. However, these
sources can also generate a significant amount of heat.
An alternative is to use a continuous wave (CW) light source, and subsequently
modulate a signal onto it. This approach has the advantage that only a single common

Fig. 2.1 Optical link implementation using (a) an internal directly modulated light source, and
(b) a CW external light source with a signal modulator
2 Technologies and Building Blocks for On-Chip Optical Interconnects 29

light source is required, and this source can even be placed off-chip and fed through
an optical waveguide or fiber. As will be shown in section Modulation, Switching,
Tuning, the actual signal modulators can be implemented in more simple techno-
logy than the lasers. They should also reduce the on-chip heat generation. Another
advantage of modulating a CW source, compared to a directly modulated laser, is
the possibility to use advanced phase modulation formats, effectively coding more
bits in the same bandwidth. This is very difficult to achieve using a directly modu-
lated source, where typically intensity modulation is used.
However, an external source could pose additional topological constraints, as it
requires feed-in lines for all the modulators. This could be alleviated by integrating
an on-chip CW light source per link, accompanied by a signal modulator, but this
would again carry a penalty in power consumption and chip area.

On-Chip Optical Networks

An on-chip interconnect system typically contains a large number of links. When


moving towards optical interconnects, this means than many links need to be accom-
modated together on the same chip. This multiplexing of links can be done on vari-
ous levels.
The most obvious way of implementing numerous links is space-division multi-
plexing, i.e. provide each link with its designated waveguide. However, this is far
from trivial: in contrast with electrical interconnections, which can be arranged as a
multi-layered mesh (in a first approximation, an electrical interconnect just needs an
electrical contact between layers to transport the signal, so a standard via contact
works), it is not straightforward to (a) fabricate multilayer optical waveguide cir-
cuits, and (b) transfer light from one layer to the next (at least in the situation where
layers are sufficiently far apart to avoid unintentional optical crosstalk). So for the
remainder of this chapter, we will consider single-layered optical networks.
So, in order to accommodate a large number of links on a chip, the optical wave-
guides should not only be shared between links, but each transmitter should also be
able to address the exact receiver it wants to target. This can be done through a
switched network, which can be reconfigured to set up a certain link between two
points. The mechanism to implement such switches are discussed in more detail
in section Modulation, Switching, Tuning.
The alternative is using wavelength division multiplexing (WDM), where differ-
ent links are transported through the same waveguide but are modulated on different
carrier wavelengths, effectively propagating independently. A WDM network on a
chip can be configured in a bus configuration, where end points know which wave-
length to dial into (and ignore the rest), or in a routed network, where the wave-
length is used as a label to route the signal to the correct end point.
The first configuration has the advantage of simplicity, and reconfigurability
(including broadcasting) but could carry a power penalty as all wavelengths sig-
nals are distributed over the entire chip. The routing scheme makes better use of
30 W. Bogaerts et al.

Fig. 2.2 Optical networks on a chip. (a) Circuit switched, (b) wavelength switched, (c) WDM bus

available bandwidth, but is technologically more complex as it requires


(reconfigurable) wavelength routing devices throughout the optical network
(Fig. 2.2). Wavelength routers are discussed in section Waveguide Circuits.

High-Contrast Photonics

Now we have an idea of which building blocks are required for on-chip optical
interconnects, we should look for the best suited technologies and materials to
implement them. Unlike integrated electronics, photonic integrated circuits come in
a large variety of materials: glasses, semiconductors (silicon, germanium and IIIV
compounds), lithium niobate, polymers, etc. and each of these has its strong and
weak points. But when we look towards technologies for optical interconnects, we
can already impose some boundary conditions. The foremost constraint is one of
density: in compliance with Moores law, electronics are steadily shrinking, and if
2 Technologies and Building Blocks for On-Chip Optical Interconnects 31

optical interconnects are to a be useful extension of electronics, they need to occupy


as little floor space as possible on a chip. So the important requirement is to keep the
optical building blocks, especially the waveguides, as compact as possible.
Optical waveguides come in many materials, and the material system essentially
dictates the size of the waveguide core, i.e. the area where the light is confined. In
most waveguides, this confinement is in a material with a refractive index n that is
slightly higher than the surrounding cladding. The stronger the index contrast, the
smaller the core can be made. Optical fibers, made out of two glasses with a very
slight index contrast, have a core diameter of the order of 10 mm. On the other side
of the spectrum, a waveguide made in a high-index semiconductor (nSi = 3. 45) sur-
rounded by air nair = 1. 0 or glass/oxide nSiO2 = 1.45 can confine the same light in a
core less than 500 nm across. In addition, such high-contrast waveguides also allow
sharp bends, with a radius of a few micrometer.

Silicon Photonics

Silicon is the most prominent semiconductor for electronics. But in recent years it
has shown to be a promising material for integrated photonics as well [10, 36]. It has
a high refractive index contrast with its own native oxide, and is transparent at the
commonly used communications wavelengths around 1,550 and 1,310 nm. But the
main attraction for silicon as a material for photonic integration is that it can be
processed with the same tools and similar chemistry as now used for the fabrication
of electronic circuitry [10] and even monolithically with CMOS on the same sub-
strate [7, 36, 85]. This not only leverages the huge investments in wafer-scale pro-
cessing and patterning technologies, but also facilitates the direct integration of
silicon photonics with electronics.
However, silicon may seem like a good material for waveguides, but it is notori-
ously bad for active photonic functions, especially the emission of light. So to
implement a full optical link there will always be a requirement to integrate other
materials for sources and detectors. As will be discussed in section Photodetectors,
detectors can be implemented in germanium, a material that can be deposited or
epitaxially grown on silicon. However, efficient light sources may need the inclu-
sion of efficient light emitters, and IIIV semiconductors are currently considered
to be the best option.

IIIV Semiconductors and Silicon

IIIV materials, either based on gallium arsenide (GaAs) or indium phosphide (InP)
are commonly used for efficient light sources and photodetectors. They can also be
used for photonic integrated circuits, and can provide a similar index contrast with
glass as silicon. Also, different integration schemes to integrate active and passive
functions on the same IIIV chips have been demonstrated, and some are commer-
cially available today. However, the wafer-scale fabrication technologies for IIIV
32 W. Bogaerts et al.

semiconductor are somewhat lagging those for silicon, missing the drive of the
electronics industry, and typical IIIV semiconductors are not available in large-
size wafers (200 or 300 mm).
Therefore, a attractive route is to combine active functions in IIIV semiconduc-
tors with silicon photonics. This can be done by integrating ready-made IIIV com-
ponents onto a silicon photonics chip. This is definitely possible using flip-chip based
technologies, but it is a relatively cumbersome process that limits the number of
components that can be integrated simultaneously. Also, alignment tolerances can be
quite tough to meet, which translates in a significant higher integration cost.
The alternative is to integrate unprocessed IIIV material onto the silicon in the
form of a thin (local) film, and subsequently use wafer scale processing techno-
logies to pattern the IIIV devices. The obvious technique to integrate the IIIV
material would seem to be direct epitaxy, but the crystal lattice mismatch of IIIV
materials with silicon is typically too large to effectively do this, and while there are
some demonstrations of IIIV growth on silicon (directly or through a germanium
interface layer), the large number of dislocations generated degrade the optical
quality of the IIIV material.
The alternative to direct epitaxy is the use of bonding. Small IIIV dies are
locally bonded to a silicon wafer, which can already be patterned with photonic
circuitry. After bonding, the IIIV material can be thinned down to a thin film. The
actual bonding can be done in different ways, either directly making use of molecu-
lar forces or through the use of an intermediate adhesive or metal layer. The merits
of the different technologies are discussed in section Light Sources.
After the integration of the IIIV material on silicon, the actual devices can be
further processed on wafer scale, and patterned using the same lithographic tech-
niques used for silicon processing. However, when this processing is done in silicon
fabs that also process electronics, care should be taken not to contaminate tools with
IIIV compounds. Also, the integration of IIIV material into a fully functional
photonic/electronic chip, including the electrical contacting, is not straightforward.

Integrating Photonics and Electronics

An optical interconnect only makes sense when it is tightly integrated with the elec-
tronic systems it needs to interconnect. While the optical interconnect is primarily
devised to support the electronics, the interconnect subsystems also require dedi-
cated electronics for driving and control. The actual integration strategy, i.e. how to
combine the optical interconnect layer with the electronics, can have a strong impact
on the performance, the floor space and ultimately the cost of the full component.
However, the main essential point from an integration point of view, and that holds
for all the technologies discussed throughout this chapter, is that everything should
ultimately be compatible with wafer-scale processing.
In section Integration in an Electronics Interconnect we discuss a number of
integration options for silicon photonics interconnect layers in a traditional
2 Technologies and Building Blocks for On-Chip Optical Interconnects 33

electronics chip. One of the main criteria is the position of the photonics fabrica-
tion in the entire electronics fabrication flow. Here we can discern between front-
end-of-line processes (the photonics sitting at the same level as the transistors),
back-end-of-line processing (the photonics is positioned between or directly on
top of the metal interconnect layers) or 3-D integrated (the photonics is processed
separately and integrated as a complete stack on the electronics). These options
are also illustrated in Fig. 2.21.

Waveguide Circuits

Photonic integrated circuits can combine many function on a single chip. Key to
this, and especially in the context of interconnects, is to transport light efficiently
between functional elements of the chip. The most straightforward way for this is
through optical waveguides which confine light to propagate along a line-shaped
path. As we will see further, these waveguides can also be used as functional ele-
ments themselves, especially by manipulating multiple delays to obtain interfer-
ence, which in turn can be used to construct filters for particular wavelengths.

Optical Waveguides

Optical waveguides need to confine light along a path on chip, so it can be used to
transport a signal between two points. The most straightforward way to construct a
waveguide is to use a core with a high refractive index surrounded by a lower refrac-
tive index. A well-known example of such a waveguide is an optical fiber, consist-
ing of two types of glass with a slight difference in refractive index. Most optical
waveguides have an invariant cross section along the propagation direction. The
propagation inside the waveguide can then be described in terms of eigenmodes: a
field distribution in and around the core that propagates as a single entity at a fixed
velocity. Such an eigenmode is characterized by a propagation vector b or an effec-
tive refractive index neff. The propagation speed of the mode in the waveguide is
given by c neff, with c the speed of light in vacuum. Depending on their dimensions,
waveguides can support multiple eigenmodes, that propagate independently with
their own neff.
On a chip, there are much more different ways to construct a high index wave-
guide core: glasses, polymers and different types of semiconductor are the most
straightforward. Especially the last category is relevant: as already explained, optical
waveguides can be made more compact when there is a high index contrast between
core and cladding. This makes semiconductors extremely attractive, and silicon in
particular, because of its compatibility with CMOS fabrication processes.
To construct a submicrometer waveguide in silicon, a cladding material with a
low refractive index is needed. Silica (SiO2) is perfectly suited for the purpose,
34 W. Bogaerts et al.

Fig. 2.3 High-contrast silicon waveguide geometries. (a) Photonic wire strip waveguide, (b) rib
waveguide

resulting in an index contrast of 3. 451. 45. The cladding material should surround
the entire waveguide core, however, and this requires a layer stack of silicon and
silica. Such silicon-on-insulator (SOI) wafers are already used for the fabrication of
electronics, and can be commercially purchased from specialized manufacturers,
such as SOITEC. The high-quality substrates are typically fabricated through wafer
bonding, where partially oxidized wafers are fused together by molecular bonding.
By carefully implanting one of the wafers with hydrogen prior to bonding, a defect
layer can be formed at a precise depth, and the substrate of the wafer can be removed,
leaving a thin layer of silicon on top of a buried oxide (BOx). Such an SOI stack for
nanophotonic waveguides has a typical silicon thickness of 200400 nm, and a bur-
ied oxide of at least 1 mm, preferably 2 mm thick, to avoid leakage of light into the sili-
con substrate. This gives a high refractive index contrast in the vertical direction.
To create an in-plane index contrast, the SOI layer is patterned, typically using
a combination of lithography and plasma etching [10, 15, 81]. Depending on the
etch depth, different waveguide geometries can be obtained. The most common are
illustrated in Fig. 2.3. A strip waveguide, often called a photonic wire, has a fully
etched-through cladding and offers the highest possible contrast in all directions.
Alternatively, a rib waveguide has a partially etched cladding and has a weaker
lateral contrast.
The lateral contrast has a direct impact on the confinement. The larger the lateral
confinement, the smaller the mode size can be, the closer the waveguides can be
spaced without inducing crosstalk, and the tighter the bend radius can be.
Photonics wires typically consist of a silicon core of 300500 nm width and 200
400 nm height. Several groups have standardized on 220 nm thick silicon, as sub-
strates with this thickness can be purchased off the shelf. The core dimensions are
dictated by several factors. First, it is of best interest to confine the light as tightly as
possible. This does not mean that the core can be shrunk indefinitely. At certain
dimensions, the size of the optical mode will be minimal, and for smaller cores the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 35

mode will expand again. For a 220 nm thick silicon core, the optical mode is smallest
for a width around 450 nm at wavelengths around 1,550 nm. For this configuration,
not all the light is confined to the silicon, but a significant fraction of the light (about
25%) is in the cladding. With such waveguides, it is possible to make bends with
23m bend radius with no significant losses.
A second thing to consider is the single-mode behavior of the waveguide.
As optical waveguides get larger (for the same index contrast), they can support
more eigenmodes. While these propagate independently they can couple when there
is a change in cross section (e.g. a bend, a crossing, a splitter). This can give rise to
unwanted effects such as multi-mode interference, losses and crosstalk. Therefore,
it is best to have a waveguide which only supports a single guided mode. This can
be done by keeping the cross section sufficiently small. In the same SOI layer stack
of 220 nm thickness, all higher-order modes are suppressed for widths below 480 nm,
again for wavelengths around 1,550 nm.
Finally, there is the issue of polarization: modes in optical waveguides can be
classified according to their polarization: the orientation of the electric field compo-
nents. On a optical chip, this classification is typically done with respect to the plane
of the chip. We find quasi-TE (TransverseElectric field) modes with the E-field
(almost) in the plane of the chip, and quasi-TM (TransverseMagnetic field) modes
which have their E-field in the (mostly) vertical direction. In the case of a vertically
symmetric waveguide cross section (e.g. a rectangular silicon wire completely sur-
rounded by oxide), the waveguide will always support both a TE and a TM mode (so
the waveguide is never truly single-mode), but the TE and TM modes are fully decou-
pled: as long as the vertical symmetry is maintained, there will be no mode mixing
between the TE and the TM ground mode, not even in bends or splitters. Whether the
TE or the TM mode is the actual ground mode of the waveguide depends on the cross
section: the mode with the E-field along the largest core dimension will have the high-
est effective index. In the case of a waveguide cross section which is wider than its
height, the TE is mode the ground mode. For a perfectly square waveguide cross sec-
tion, the TE and TM modes are degenerate. Typically, photonic wires have a larger
width than height, because this is easier on fabrication (printing wider lines, and etch-
ing less deep). They are therefore most commonly used in the TE-polarization.
The essential figure of merit for photonic wires is their propagation loss: the
lower the loss, the longer an optical link can be for a given power budget. Photonic
wires fabricated with high-resolution e-beam lithography have been demonstrated
with losses as low as 1 dB/cm [48], meaning they still retain 80% of the optical
power after 1 cm propagation. For waveguides defined with optical lithography,
such as used for the fabrication of electronics, the propagation losses are slightly
higher, of the order of 1.4 dB/cm [13]. These losses are mainly attributed to scatter-
ing at roughness induced by the fabrication process, and absorption at surface states.
Making waveguides wider reduces the modal overlap with the sidewall, which
reduces the waveguide loss, even down to 0.3 dB/cm [99], but at the cost of a taper-
ing section and loss of single-mode behaviour.
Because of their small feature size, the properties of photonic wires are fairly
wavelength dependent: the effective index as well as the exact mode profile changes
36 W. Bogaerts et al.

as a function of wavelength. This dispersion has as a result that signals in the


waveguide will travel at a group velocity that is considerably smaller than the light
speed. For 1,550 nm and a silicon wire of 450 220 nm, the group velocity is c 4. 3,
or the group indexng = 4. 3.
SOI rib waveguides are typically made by partially etching the silicon layer [13],
although it is also possible through oxidation [120]. Because there is still remaining
silicon on the sides, the lateral refractive index contrast is less than in a strip wave-
guide. This increases the minimum bend radius that can be afforded. However, the
shallow etch has two main advantages: it creates less sidewall roughness than in the
deeply etched waveguides, and the remaining silicon can also be used for electrical
contacting. This is especially advantageous for making modulators, as will be dis-
cussed in section Modulation, Switching, Tuning.
Silicon on insulator is a very good material for high-quality waveguides, because
it uses pure crystalline silicon which has hardly any material loss. However, SOI
wafers are only available from a limited number of sources, and in only a few
predefined layer stacks. This limits the flexibility of optical waveguide geometries,
but also of the substrates on which waveguides can be integrated.
An alternative to bonded silicon layers is the use of deposited silicon. Silicon can
be applied in polycrystalline or amorphous form through chemical vapor deposition
(LP-CVD or PE-CVD). However, this material is of less optical quality as single-
crystal silicon: polycrystalline material has grain boundaries which can scatter or
absorb light. The best propagation losses in photonic wires made out of polysilicon
are around 9 dB/cm [1, 42, 127]. Amorphous silicon has no grain boundaries, but
the amorphous structure gives rise to many unsaturated SiSi bonds, which can
absorb light. Therefore, these bonds need to be passivated, typically by hydrogen.
Hydrogen can be added in-situ, during deposition, or afterwards during an anneal
phase. Experiments have shown that in-situ hydrogenation during a low-tempera-
ture PECVD deposition can generate good-quality material, with a-silicon film
losses of only 0.8 dB/cm [52, 93, 98]. The best photonic wires in such material have
losses of the order of 3.4 dB/cm [93].
The advantages of such deposited silicon films is that they could, in principle, be
deposited on top of, or even inside, an electronics interconnect stack (see section
Integration in an Electronics Interconnect). However, the deposition process or
the material itself could impose restrictions on the further processing due to thermal
budget or contamination. For instance, the amorphous silicon cannot withstand
high temperatures without crystallizing and losing its passivation. This severely
limits the functions that could be implemented in this material, such as modulators
(section Modulation, Switching, Tuning).

Coupling Structures

An essential aspect of many waveguide circuits on a chip is efficient coupling of


light between the chip and the outside world, typically an optical fiber. In on-chip
2 Technologies and Building Blocks for On-Chip Optical Interconnects 37

Fig. 2.4 Coupling structures for optical chips. (a) Spot-size converter for edge coupling, (b) vertical
grating coupler

optical interconnects, this is not a main concern, as light does not have to leave the
chip. However, there are two easily identified exceptions: in the case where an
external light source is used, this light has to be coupled to the chip. As second
aspect is the extension of on-chip interconnects to multi-chip modules. For the sake
of completeness, we will briefly discuss the two main options for coupling light to
a chip: edge coupling in the plane of the chip, and vertical coupling. As the on-chip
waveguides typically have a different cross-section than the off-chip mode, a spot-
size converter will be necessary. The most relevant coupling structures are illus-
trated in Fig. 2.4.
As the on-chip waveguides are oriented in the plane of the chip, it is relatively
easy to transport the light to the edge of the chip. At the edge, the small wire mode
should be converted to a fiber-matched mode. The traditional approach to this is
including an adiabatic taper, consisting of a gradually narrowing silicon waveguide:
for very small widths, the light is no longer confined and the mode expands. This
larger mode is then captured by a larger waveguide (in oxide, oxynitride or poly-
mers) which can couple directly to a fiber at the polished facet of the chip. This
tapering approach has two advantages: it is a fairly simple and tolerant concept to
manufacture once you have a patterning technology capable of sub-100 nm features,
and it works over a broad wavelength range. Coupling efficiencies of 90% have
been demonstrated [96]. However, the edge-coupling approach has significant draw-
backs as well. The number of ports that can be accommodated at the edge is limited,
and the path of the optical waveguide to the edge should not be crossed by any
obstacle, such as metal interconnects. The taper structures are also quite large,
requiring lengths of several hundred nanometer. Finally, the optical ports are only
38 W. Bogaerts et al.

accessible after dicing the wafer and polishing the facets: this makes wafer-scale
testing and selecting known-good-dies for further processing difficult.
The alternative is vertical coupling: using a diffraction grating, light can be cou-
pled from an on-chip waveguide to a fiber positioned above the chip. The grating
can be implemented as etched grooves [10, 102], metal lines [103], or subwave-
length structures [74]. Such structures attain coupling efficiencies of over 30%. By
engineering the grating layer structure, higher coupling efficiencies of 70% have
been demonstrated [112, 114]. The gratings can be made quite compact by design-
ing them such that the fiber-size spot is focused directly into the core of a photonic
wire waveguide [111].
However, because the grating is a diffractive structure, its behavior is wavelength
dependent. The typical operational bandwidth (at 50% or 3 dB) is quite large:
6080 nm. This is possible because of the very high refractive index contrast of the
silicon waveguides.
There is also the matter of fabrication: the best devices require more complex
fabrication techniques, and deviations in the fabrication will quickly lead to a shift
in wavelength or a drop in efficiency. The vertical couplers do have significant oper-
ational advantages: they are more tolerant to fiber alignment errors, can be used
directly on the wafer for testing and die selection, and can be positioned anywhere
on the chip, giving more flexibility for packaging or testing.
The vertical should be treated with a question mark, though. When designing
the diffraction grating for true vertical coupling, one introduces several (unwanted)
parasitic effects. For one, the grating will become strongly reflective: it will also act
as a Bragg reflector, reflecting light from the waveguide back into the waveguide.
This can be partially reduced by engineering the grating [89]. Also, a vertical grat-
ing is symmetric, and symmetry-breaking schemes should be implemented to avoid
the grating coupling to both directions of the waveguide. For fiber coupling, the
solution is to use fibers polished at an angle. But in situations when vertical cou-
pling is a necessity (e.g. integration of a vertical light source) additional measures,
such as a lens or refracting wedge are required [92].

Wavelength Filters and Routers

Optical waveguides typically have an extremely large bandwidth compared to elec-


trical interconnects. Therefore, they are often limited by the electro/optical conver-
sion at the end points. A solution is WDM: multiplexing lower-bandwidth signals
on different carrier wavelengths. This requires components to combine and separate
the different wavelength channels. Photodetectors operate for all wavelengths, so
only the correct wavelength channel should be guided to the detector, while others
are ignored.
In addition, WDM can also be used to provide a more distributed interconnect
infrastructure: the carrier wavelength can be used to route signals over the chip.
Again, filters and wavelength routers are needed to correctly distribute the signals.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 39

Both the multiplexing and the routing require wavelength selective elements.
On a chip, these are best implemented by interference of two or more waves with a
wavelength-dependent path length difference. This can be self-interference in a
resonator, two-path interference in a (cascaded) MachZehnder interferometer, or
multipath interference. In all cases, the physical length of the delay line scales
inversely with the group index, so photonic wires are well placed to implement
these wavelength selective functions. On the other hand, the free spectral range
(FSR) of a filter is the wavelength spacing between two adjacent filter peaks, and it
should have a sufficiently large FSR to cover a broad band of signal wavelengths in
WDM. For this, the delay length should be sufficiently short, and here the photonic
wires sharp bend radius and tight spacing allows FSRs which are difficult to con-
struct with other waveguide technologies.

Resonant Ring Filters

In a ring resonator light is circulated in a ring-shaped waveguide. The structure is


in resonance if the phase is matched after a full round trip. When coupled to
access waveguides, typically using directional couplers, the ring can drop the part
of the spectrum around a resonance wavelength from one access waveguide (bus
waveguide) to the other access waveguide (drop waveguide). This way, a single
wavelength channel can be dropped from, or when used in reverse, added to the
bus waveguide (Fig. 2.5).
Photonic wire-based rings can be very compact, resulting in FRSs of tens of nm.
A single ring resonator has a Lorentzian-shaped transmission spectrum [11, 55], but
by cascading multiple rings one can construct a ring filter with a more uniform pass
band [24, 121]. Also, using multiple rings with a different FSR can be used in a
Vernier configuration, creating a filter with a much larger overall FSR. Also, ring
resonators can be used to build a wavelength router that directs the inputs to an
output of choice based on the input wavelength [59].
While ring resonators are probably the most compact way to implement custom
add/drop filters, they have the disadvantage that they rely on a resonance: this implies
a significant power buildup inside the filter. In silicon wires, this will induce nonlin-
ear behavior which will result a resonance shift or even kill the resonance. These
nonlinear effect put an upper limit on the power budget of the link.

MachZehnder Waveguide Filters

A MachZehnder interferometer (MZI) is a simple interferometer where light is


split up in two paths which are then brought together. The resulting intensity depends
on the phase relation in the two paths, with a maximum when both arms are in phase
(constructive interference), and a minimum when the arms are in opposite phase
(destructive interference). When the path lengths are unequal, the phase delay
between the two arms is wavelength dependent, varying periodically with a period
40 W. Bogaerts et al.

Fig. 2.5 Ring-resonator filters. (a) All-pass filter consisting of a single ring on a bus waveguide.
(b) Add-drop filter, which drops wavelength channels from the bus waveguide to the drop port

(free spectral range, or FSR) inversely proportional to the arms length and the group
index of the waveguide (Fig. 2.6).
For splitting and combining the light, one can make use of directional couplers
or multi-mode interferometers (MMI). In the former, light can couple between two
adjacent waveguides, and the coupling strength can be controlled by the length of
the coupler or the width of the gap. However, in a photonic wire geometry this gap
is difficult to control accurately. MMIs use broad waveguide area which supports
multiple modes to distribute the light to two or more output waveguides. They have
been proven to be more tolerant than directional couplers for 50% coupling ratios,
but arbitrary ratios are more difficult to design accurately.
While single MZIs have a sinusoidal wavelength response, they can be cascaded
to obtain more complex filter behavior. This can be done through a cascade where
the MZIs are stacked in series, or directly using common splitter and combiner sec-
tions [12, 35, 105, 123].
As MZI-based filters are nonresonant, they dont suffer from nonlinear effects,
but they typically require a much larger footprint than a ring-resonator-based filter
for a similar filter response.

Arrayed Waveguide Gratings

The principle of an MZI can be extended to multiple delay lines: input light can be
split up between an array of waveguides with a wavelength-dependent phase delay.
When the outputs of these delay lines are arranged in a grating configuration, the
distributed light will be refocused in a different location depending on the phase
2 Technologies and Building Blocks for On-Chip Optical Interconnects 41

Fig. 2.6 MachZehnder Interferometer-based wavelength filters. (a) Single MachZehnder


Interferometer with sinusoidal transmission. (b) Cascaded higher-order MZI filter with flat-top
transmission spectrum

delay (and thus, on the wavelength). this way, one component can (de)multiplex a
multitude of wavelength channels simultaneously [32]. Again, silicon photonic
wires can make the delay lines of the arrayed waveguide grating (AWG) shorter and
arrange them in a more compact way that other waveguide technologies (Fig. 2.7).
An 1 N AWG with one input waveguide can serve as a multiplexer. However, if
designed properly, an n N AWG can be used to route light from any input to any
output, based on the choice of wavelength at the input. This is done by carefully
matching the FSR of the AWG to N times the wavelength channel spacing [34].
Because of the high index contrast, silicon AWGs typically perform worse than
glass-based components (but with a much smaller footprint), with crosstalk levels
around 2025 dB [14].
AWGs typically have a Gaussian-shaped transmission band. However, by engi-
neering the geometry of the access waveguides, a more uniform pass band can be
obtained, typically with a penalty in insertion loss of approximately 3 dB. More
elaborate synchronized schemes can reduce this loss by cascading an additional
interferometer to the AWG [33, 118]. Such techniques have been demonstrated for
mature silica waveguide technology.

Planar Concave Gratings

An alternative distributed interference approach is to use the slab waveguide


area, instead of an array of waveguides. Echelle gratings or planar concave grat-
ings (PCG) use etched grating facets to obtain a set of matched phase delays in
42 W. Bogaerts et al.

Fig. 2.7 Arrayed waveguide grating in silicon on insulator. (a) Operating principle. (b) Eight-
channel AWG in silicon, presented in [14]. (c) Plot of the transmission of the eight output channels
based on the date of the device from [14]

the different output waveguides [19, 20, 20]. Here, too, it is possible to configure
the input and output waveguides in a router configuration. Performance of PCGs
is similar to that of silicon AWGs, with crosstalk levels around 2030 dB [19].
The choice whether to use an AWG or PCG is very much dependent on the chan-
nel spacing and number of channels [14] (Fig. 2.8).

Tolerances in Wavelength Filters

All filters discussed here rely on a wavelength-dependent phase delay. This implies
a good control of the dispersion (wavelength dependence of the neff and ng of the
waveguide or slab area. In silicon photonics, the dispersion is very dependent on the
actual fabricated geometry. For wire-based filters, nanometer-variations in line
width can result in wavelength shifts in the order of nanometers. Therefore, accurate
control of the fabrication process is extremely important, and on top of that, active
tuning or trimming of the delay lines is often necessary.

Fabrication Accuracy

As already mentioned, silicon photonics is compatible with electronics manufactur-


ing technology, bringing on board the immensely well-controlled processes and
2 Technologies and Building Blocks for On-Chip Optical Interconnects 43

Fig. 2.8 Planar curved grating (or Echelle grating) in silicon on insulator. (a) Operation principle. (b)
Example of a four-channel PCG from [20]. (c) The transmission plotted based on the data from [20]

fabrication environments. It has been shown that it is indeed possible to control the
average delay line width (and therefore the peak wavelength) of a ring resonator or
an MZI to within a nanometer between two devices on the same chip, and within
23 nm for devices on the same wafer or even between wafers [94].
Even with this process control, it is not possible to manufacture wavelength
filters with subnanometer accuracy while still maintaining practical process toler-
ances. In a typical CMOS fabrication process, tolerances of 510% are used.
While photonic wire waveguides are much larger that todays state-of-the-art
transistor features, the tolerances are stricter, and thus well below 1% of the criti-
cal dimensions.

Temperature Control

In addition to fabrication control, the WDM filters should also be tolerant to different
operational conditions, most notably a broad temperature range. Temperatures can
vary wildly within an electronics chip, with hot-spots popping up irregularly. Silicon
photonic wires are very susceptible to temperature variations, and in filters this results
44 W. Bogaerts et al.

in a peak wavelength shift of the order of 50100 pm/ C. This effect can be reduced
in various ways. One can design the waveguide to have no thermal dependence, by
incorporating materials which have the opposite thermal behavior as silicon: some
polymers claddings have been demonstrated to work well for this purpose [104], but
these then introduce many questions on fabrication and reliability.
Alternatively, active thermal compensation could be used by including heaters,
or even coolers, in or near the waveguides. Such thermal tuning can compensate the
remaining process variations, but it introduces additional power consumption for
the heaters, as well as the necessity for control and monitoring circuitry.
The heaters themselves can be incorporated as metallic resistors [29, 44], using
silicides or doped silicon [108, 117] or even use the silicon of the waveguide itself
as a heater element [50].

Light Sources

The light source problem is probably the technologically most controversial tech-
nological challenge in silicon photonic optical interconnects. It is well know that
crystalline silicon cannot emit light due to its indirect bandgap. This makes mono-
lithically integrated lasers very difficult, and opens the door to a large number of
light source alternatives.
In the specific case of on-chip optical interconnect, an number of requirements
are imposed on candidate light sources. First of all, they have to be electrically
pumped and work under continuous wave or be directly modulated, depending on
the interconnect scheme from Fig. 2.1. Also, they should be efficient and have a low
threshold current, in order to reduce the energy per bit in the link.
The most straightforward solution is to use a commercially available InP based
laser diode and integrate it onto the SOI circuits. The laser power will then be dis-
tributed over the whole chip and shared as a common optical power supply by all
the links, as shown in Fig. 2.1b. A more challenging scheme is to implement an
individual on-chip laser for each link, which can then either be used in CW or be
directly modulated (cf. Fig. 2.1a). In the following part of the section, we will dis-
cuss in detail the implementation and challenges of the two schemes.

Off-Chip Lasers and Interfaces to the SOI Circuits

Using an off-chip laser decouples the light source problem from the silicon photon-
ics, obviating the need to build a light source in/on silicon. Also, the laser diode can
be tested and selected prior to assembly. The challenging issue here is the optical
coupling interface between a laser diode and an SOI waveguide.
In its simplest form, the laser diode is just fiber-pigtailed and connected to the sili-
con chip using the fiber couplers discussed in section Waveguide Circuits. But the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 45

Fig. 2.9 Off-chip laser sources. (a) Fiber pigtailed source, (b) laser subassembly mounted on a
non-vertical grating coupler [36]. (c) VCSEL mounted on a vertical grating coupler, (d) VCSEL
mounted on a non-vertical grating coupler with a refracting wedge [92]

laser diode can also be mounted on the chip itself, either as a bare chip or a subas-
sembly. In that case, the coupling scheme should be adapted to the particular laser
diode. An example is the laser package developed by Luxtera, which couples the
horizontal laser light into a vertical grating coupler by means of a reflecting mirror
and a ball lens integrated in a micropackage on top of a silicon photonics chip [36].
When using vertical coupling, vertical-cavity surface emitting lasers (VCSEL)
are very attractive: such devices can be flip-chipped directly on top of a grating
coupler. However, this means the grating coupler should work for the vertical
direction. As discussed in section Waveguide Circuits it is not straightforward
to implement truly vertical grating couplers. One problem is that due to symmetry
the grating coupler will diffract the source light into the waveguide on both sides
of the grating. A solution suggested by Schrauwen et al is to use a refractive
angled interface to deflect the perfectly vertical light to a non-vertical grating
coupler (Fig. 2.9).
The disadvantage of using a off-chip device is the strict optical alignment needed for
integration, and such a process has to be repeated sequentially if multiple lasers are
46 W. Bogaerts et al.

Fig. 2.10 Processing flow of device fabrication based on IIIV/SOI bonding

going to be used. This is especially true in a WDM environment, where a light source
for each wavelength channel is needed. This can be accomplished by a multi-wavelength
laser or comb-laser, or by connecting an individual laser for each channel.

On-Chip Lasers

When bringing the lasers to the chip, a number of new possibilities arise. When the
lasers are integrated with the photonic circuitry, a much higher density can be
achieved. Lasers can be integrated close to the modulators, and much denser link
networks can be built. Also, the lasers can be batch processed.
In order to achieve optical gain on SOI, one needs to integrate new materials with
optical gain or modify silicon itself at the early stage of the chip fabrication. Various
methods have been proposed. One of the most successful so far is the heterogenous
integration of IIIV materials on SOI based on bonding technology, which will be
discussed in the following part. Some advanced on-chip lasers based on other
approaches will also be reviewed.

IIIV/SOI Bonding Technology

When IIIV semiconductors need to be integrated on silicon, direct epitaxy is not


really a simple solution due to the crystal lattice mismatch: this will cause a lot of
dislocation defects which will make it impossible to get the high-quality quantum
well layers needed for good optical gain. Therefore, a more attractive approach is to
use high-quality laser-grade IIIV stacks and bond them onto the silicon. Figure 2.10
shows the processing flow of the device fabrication based on this IIIV/SOI bond-
ing technology. Generally, a IIIV die or wafer with appropriate size is first bonded
2 Technologies and Building Blocks for On-Chip Optical Interconnects 47

up side down on top of an SOI wafer. The SOI wafer can be either patterned or
unpatterned. If necessary, multiple IIIV dies with different epi-structures can be
bonded on the same SOI wafer for realizing different functionalities. The IIIV dies
are still unpatterned at this stage. Thus, only a coarse alignment to the underlying
SOI structures is necessary. Then, the InP substrate is removed by mechanical grind-
ing and chemical etching. To isolate the etching solution from the device layers, an
etch stop layer (usually InGaAs/InP), which will be removed subsequently, is
embedded between these layers and the substrate. The devices in the IIIV layers
are then lithographically aligned and fabricated with standard wafer-scale process-
ing. As compared to the approach of using off-chip lasers mentioned in section
Off-Chip Lasers and Interfaces to the SOI Circuits, the tight alignment tolerance
during the bonding process is much relaxed here.
To realize bonding between the IIIV dies and the SOI wafer, there are two
common techniques: direct (molecular) bonding and adhesive bonding. In the first
approach, a thin layer of SiO2 is first deposited on top of the IIIV dies and the SOI
wafer. For a patterned SOI, the wafer should be planarized and polished through a
chemical mechanical polishing (CMP) process [53, 110]. The initial bonding of the
IIIV dies and SOI wafer is achieved through the van der Waals force. Such an
attraction force is only noticeable when the two surfaces are brought close together
within a few atomic layers. Thus, in order to make the van der Waals attraction take
place in a large portion of the bonded interfaces, the surfaces of the IIIV dies and
SOI wafer must be particle-free, curvature-free, and ultra-smooth. The bonded
stack is subsequently annealed, usually at a relatively low temperature (up to
300 C), in order to avoid cracks induced by the thermal expansion coefficient mis-
match between the IIIV and silicon. A stronger covalent bonding will then hap-
pen if the two bonded surfaces are chemical-activated before contacting. Without
the aid of SiO2, direct bonding of IIIV material and silicon is also possible through
O2 plasma activation of both surfaces and incorporation of vertical outgassing
channels on SOI [66].
Alternatively, in the adhesive bonding approach, a bonding agent, usually a poly-
mer film, will be applied in between the two bonded surfaces. Due to the liquid form
of the polymer before curing, the topography of surfaces can thus be planarized, and
some particles, at least with diameters smaller than the polymer layer thickness, are
acceptable. The whole stack will also undergo a curing step at an appropriate tem-
perature depending on the chosen polymer. The most successful implementation of
this technology on the related devices mentioned in this book is done through DVS-
BCB polymer, due to its good planarization properties, low curing temperature
(250 C), and resistance to common acids and bases [88].

IIIV/SOI Based Micro-lasers

The optical coupling from the bonded active laser cavities to the passive SOI wave-
guides is one of the most challenging issues in designing a micro-laser based on the
IIIV/SOI heterogeneous integration technology. In order to accommodate the
48 W. Bogaerts et al.

Fig. 2.11 IIIV bonded stripe laser geometries. (a) FabryPerot laser with integrated polymer
mode converter between IIIV and silicon waveguides [90], (b) fabricated laser, (c) bonded IIIV
laser with thick gain section and inverted-taper mode conversion to the silicon waveguide [63], (d)
IIIV/SOI hybrid waveguide structure with evanescent gain section [40]

pin junction and facilitate an efficient current injection, a thick IIIV epi-layer
structure is necessary. This will normal result in a low index contrast in the vertical
direction for the IIIV waveguide. However, an single mode SOI wire waveguide
has a high-index contrast in all directions. The mismatch in the mode profiles and
the effective mode indices makes the out-coupling of the laser light difficult. A solution
is to use a mode converter, made of an SOI inverse taper and a polymer waveguide,
for interfacing an single mode SOI waveguide to an IIIV FabryPerot (FP) laser
cavity, as shown in Fig. 2.11a,b [90]. The structure is designed and fabricated in a
self-aligned manner. Despite the fact that this mode converter has a large footprint,
efficient light output with power up to 1 mW in the SOI waveguide was obtained,
and subsequent optimizations of such mode converters (2.11c) have demonstrated
power up to 3 mW on both ends of the laser cavity [63]. The advantage of this
approach is that in the bonded region most of the light is in the IIIV material and
will experience strong gain, while the laser mirrors can be implemented in the
silicon.
An alternative approach, proposed by Fang and coworkers, is to use an ultra-thin
bonding layer [40]. As shown in Fig. 2.11d, the IIIV layers and silicon in this case
can be considered together as one hybrid waveguide. Here, a large portion of the
2 Technologies and Building Blocks for On-Chip Optical Interconnects 49

guided power is still located in silicon, and the overlap with the active IIIV materials
is smaller. This implies that the gain per unit of length of such a structure will also be
smaller. Still, with proper design a sufficient overlap with the gain medium can be
achieved. Based on such a waveguide structure, stand-alone FP lasers were introduced
initially, and integrated distributed feedback (DFB) lasers, distributed Bragg reflector
(DFB) lasers, and ring lasers were also demonstrated subsequently [3840].
Partly because of the limited gain caused by the small modal overlap, the laser
devices mentioned above still have a relatively large footprint (100 mm to 1 mm).
They can give a lasing power several mW with performances similar to that of an
off-chip laser. This kind of device is also ideal for the implementation shown in
Fig. 2.1b, where a CW laser is used as an optical power supply. However, because
they are still quite large, such lasers cannot be directly modulated directly at speeds
required for optical links, as in Fig. 2.1a.
For this, a true micro-laser with a dimension of several microns is the logical
candidate. Such small lasers can be implemented at any position where an electro-
optical interface is needed. The best example of such microlasers are based on
microdisks, coupled to a single mode silicon wire waveguide [110]. This is shown
in Fig. 2.12a, b. Different from the approaches mentioned above, the out-coupling
here is based on the evanescent coupling from the cavity resonant mode to the
guided mode in the silicon waveguide. Due to the mode index mismatch, a very high
coupling is still difficult to achieve, but the coupling should not be too high anyway
as not to destroy the cavity resonance. Single mode output power over 100 mW
under continuous wave with a microdisk cavity of 7.5 mm diameter and a threshold
current of 0.38 mA was obtained, as shown in Fig. 2.12c, d [100]. These lasers are
quite small, and they can be directly modulated. Direct current modulation up to
4 Gb/s was achieved [75]. Also, as the lasers are evanescently coupled to the bus
waveguides, several microdisks can be cascaded on one silicon waveguide: for
instance, a 4-channel multiwavelength laser source for WDM applications has been
demonstrated, as shown in Fig. 2.12e, f [109].
A different form of such a micro-laser uses a micro-ring cavity which is laterally
coupled to a silicon waveguide. Continuous wave lasing was achieved with rings of
diameters as small as 50 mm [67].

Other Advanced On-Chip Lasers

Optical gain in silicon can be achieved through various optical nonlinear effects [16,
43, 91], which led to the first realization of a silicon laser [16, 17]. However, such a
device based on a pure optical effects cannot possible be pumped electrically, which
makes such a laser unsuitable for on-chip optical interconnect. Gain through carrier
population inversion, which can be electrically pumped, is quasi-impossible in sili-
con, since it is an indirect bandgap material with very inefficient radiative recombi-
nation of carriers. Still, locally confining the carrier in, e.g., silicon nanocrystals,
provides an approach to increase the radiative recombination probability, and net
optical gain was demonstrated [84]. However, for silicon nanocrystals the gain
50 W. Bogaerts et al.

Fig. 2.12 (a) Schematic structure, (b) lightcurrentvoltage curve, and (c) spectrum of a IIIV
microdisk laser on an SOI waveguide [100, 113]. The light power was measured in the access fiber,
which is about one third of that in the SOI waveguide. (d) Spectrum and fabricated structure (inset)
of a multiwavelength laser [109]

wavelength is within the visible band, which is not suitable for integration with sili-
con waveguide. A nanocrystal-based gain material for longer wavelengths would
require IVVI semiconductors, with bulk bandgaps beyond 2 mm.
Erbium doping, which is widely used in fiber amplifiers, provides another route
to implement gain in silicon. Net material gain was achieved in the 1.55 mm wave-
length band, but no laser action was reported so far [51, 77].
Finally, an approach which has drawn a considerable amount of interest and some
recent promising results is the epitaxial growth of germanium on silicon for mono-
lithic lasers. Although Ge is also an indirect bandgap material, the offset between the
direct and the indirect bandgap is sufficiently small that bandgap engineering can be
done to stimulate radiative recombination from the direct bandgap valley. By using a
combination of strain and heavy n-type doping the germanium can be turned into a
direct-bandgap material [73]. Based on this approach, a FP laser working under
pulsed operation has been demonstrated through optical pumping [72]. With electri-
cal pumping, such a laser could provide an ideal light source for on-chip optical
interconnects, as germanium is already present in many CMOS fabs.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 51

Modulation, Switching, Tuning

For many practical purposes it is essential than the function of an optical chip can
be electrically controlled. This is especially true in interconnects, where an electri-
cal signal should be imprinted on an optical carrier, transported through an optical
link or network, and then converted back to an electrical signal. This requires sev-
eral functions where electrical actuation of optical components are required:
Signal modulation. The electrical signal should be imposed on an optical carrier,
which requires a very fast mechanism to change the optical properties of a wave-
guide circuit. Modulation speeds are required from 1 GHz over 10 GHz, 40 GHz and
even beyond 100 GHz,
Switching. The optical signal should be routed through the network. In the case of
a switched network topology, the switch should be sufficiently fast to rapidly estab-
lish and reroute connections, but it should consume as little power as possible to
maintain its state once the switching operation is performed. Depending on the
configuration, switching speeds can be ms to ns.
Tuning. As discussed in the section on passive waveguides, the fabrication technol-
ogy is far from perfect, and especially in WDM configuration the operation condi-
tions often require active tuning to keep the WDM filters spectrally aligned. Tuning
is typically a rather slow process (ms to ms) but should require low power.

Electro-Optical Signal Modulation

To modulate an optical carrier on a carrier wavelength one can either modulate the
amplitude or the phase. When propagating through a (waveguide) medium this
involves a modulation of the absorption or the refractive index, respectively. The
most simple form of modulation is direct amplitude modulation, of onoff-keying
(OOK). This scheme is exceptionally easy to decode at the receiver side, as it only
requires a photodetector. Electrical amplitude modulators can be based on elec-
troabsorption effects, i.e. band-edge shifts driven by an external electric field, but
this only works at a given wavelength. Alternatively, phase modulation encodes the
signal in the phase of the light, and this is generally a broadband effect. This makes
a more efficient use of the spectrum, and an electrical modulator now requires only
a change in refractive index, not absorption: this is much easier to achieve. At the
receiver side things become more convoluted though, requiring multiple detectors
or interferometric structures. More advanced modulation schemes involve multiple
amplitude or phase levels, making much more efficient use of the spectrum, but
requiring much more complex detection schemes at the receiving end. The modu-
lation format (phase or absorption) can be decoupled from the actual physical
modulation effect (absorption or index change). This is shown in Fig. 2.13. OOK
can be achieved using direct absorption modulation, but also using phase modula-
tors in conjunction with an interferometer or resonator. A phase modulator can also
52 W. Bogaerts et al.

Fig. 2.13 Electro-optic amplitude and phase modulation. (a) An electrical signal drives an elec-
tro-absorber, where the absorption edge shifts as a function of the electric field. (b) An electro-
optic phase shifter changes the optical length of the cavity, and thus shifts the absorption wavelength.
(c) A phase shifter in the MZI changes the phase difference in the two arms from constructive to
destructive interference. (d) Two amplitude modulators in a MZI will act as a phase modulator

be converted to an amplitude modulator, by combining it with an interferometer or


a resonator. A phase modulator embedded in an arm of an MZI with equal arm
lengths can flip the phase difference at the combiner from 0 to 180 , flipping the
transmission at the output from constructive to destructive interference. Likewise,
a phase modulator embedded in a resonator will modify the optical roundtrip
length, shifting the resonance wavelength. Vice versa, two amplitude modulators
in an interferometer can work together as a binary phase modulator: in an interfer-
ometer with a 180 phase shift section in one arm, two amplitude modulators
driven with complementary signals will open either one or the other channel,
resulting in a 180 phase difference at the output, but with at least 3 dB (50%)
insertion loss compared to the input.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 53

Fig. 2.14 Electro-optic modulation mechanisms on a magnitude and length scale

As refractive index modulation is a broadband affect, and we intend to operate


silicon photonics at multiple wavelengths, we will spend the rest of this discussion
on phase modulation effects, knowing that we can convert this into amplitude-mod-
ulation with the right waveguiding structure. There are various mechanisms that
affect the refractive index: mechanical, thermal, electrical carrier density, and direct
electro-optic effects (e.g. the Pockels effect). Depending on the materials used,
these effects have different strengths and time constants, as shown in Fig. 2.14. For
multi-GHz signals in an unstrained silicon waveguide, only carrier-induced disper-
sion can be leveraged. On the other hand, efficient thermal and mechanical effects
can be used for tuning or switching on millisecond or microsecond timescales.
Especially in WDM links, tuning is essential to compensate fabrication nonunifor-
mity and varying operating environments.
Driving electro-optic modulators generally requires dedicated driver electron-
ics: most E/O modulators are voltage driven. For a small device this can be a local
CMOS driver, but for larger modulators (when using a quite weak effect) distrib-
uted drivers or microwave strip lines are required. This becomes an issue when the
modulator length becomes a sizable fraction of the bit length. In addition, depend-
ing on the intrinsic and parasitic load, the drivers might require high voltages or
preemphasis.
For modulators, the figure of merits are the insertion loss (expressed in dB), the
modulation bandwidth (expressed in GHz) and energy per modulated bit. That last met-
ric is often not straightforward to calculate, as it does not necessarily include the driver
electronics. Therefore, for modulators in a waveguide configuration, the common figure
of merit is the voltagelength product VpLp required to obtain a p phase shift.
54 W. Bogaerts et al.

Thermal and Mechanical Effects for Tuning and Switching

Thermal Tuning

The refractive index of silicon is quite temperature sensitive: dn / dT = 1.8 10 4 /


C. In an interferometric structure, this can easily lead to tens of pm shift in the spec-
tral response per degree temperature change. As discussed in section Waveguide
Circuits, this has a detrimental effect on the operation requirement of silicon pho-
tonics: the temperature should be kept very stable, or special measures should be
taken to obtain an athermal response [104].
However, the strong temperature dependence can also be used for active tuning
purposes: heaters can be fitted to WDM filters to actively shift the spectral response.
Such tuning should of course be controlled by a feedback loop which requires some
photodetector and additional electronic circuitry. The heaters themselves can be
incorporated as metallic resistors [29, 44], using silicides or doped silicon [107] or
even use the silicon of the waveguide itself as a heater element [50]. Examples of
such cross sections are shown in Fig. 2.15.
A significant drawback of thermal tuning is that it can only be used in one direction
(heating not cooling). So to compensate for thermal variations in the chip (e.g. hotspots),
the operating temperature of the photonics layer should be kept at the upper boundary of
the operation specs: this requires a continuous power consumption to drive the heaters.
The power consumption of heaters for tuning depends entirely on the volume of
material that needs to be heated in order to raise the temperature of the silicon wave-
guide core to the desired temperature. Ideally, the waveguide and heaters are close
together and thermally isolated from the environment by an insulating material or
using an undercut etch, locally removing the thermally conductive silicon substrate
[28, 101]. Obviously, heating a small resonator is also more efficient than heating
long delay lines. In addition to the power consumption, small thermal volumes will
also have a smaller time constant, making faster operation possible. This is espe-
cially true for switching applications [101]. This way, thermal switches can be quite
efficient [37], but it depends on the actual use case. Thermo-optic switches have a
continuous power consumption, as they need to keep the temperature stable. So if
the time between switching operations is small, thermo-optic switching can be quite
efficient. If the time between switching actions is long compared to the switching
operation itself, the overall power consumption could be quite high.

Mechanical Tuning

An alternative to thermal tuning/switching for switching is using mechanical effects.


This makes use of a combination of MEMS and optics. Free-space optical switches
based on micro-electro-mechanical systems (MEMS) have existed for many years
[64], but it is also possible to use MEMS, or rather NEMS (nano EMS) in combination
with waveguides: free-standing silicon waveguides can be actuated electrically. The
2 Technologies and Building Blocks for On-Chip Optical Interconnects 55

Fig. 2.15 Integrated heating mechanisms: (a) metal top heater. (b) Silicide (or highly-doped) side
heater. (c) Top metal heater with insulation trenches. (d) Heater inside the waveguide core

easiest configuration here is a directional coupler: two adjacent waveguides can be


electrostatically attracted or repelled, changing the coupling constant [22, 56]. This can
be used to tune wavelength filters, but it has more potential for switching: with electro-
static actuation, there is only power consumption while charging/discharging the
capacitors that control the waveguide position. This means that in between the switch-
ing operations, the power consumption is limited to small leakage currents. Depending
on the spatial configurations, the time constants could also be quite low (Fig. 2.16).
Instead of tuning the coupling strength of a directional coupler, one can also
electromechanically actuate a slot waveguide: a waveguide consisting of a silicon
core with an etched slot in the middle. When the total waveguide cross section is
sufficiently small and the slot sufficiently narrow, such slot waveguides support a
single guided mode. Also, such slot waveguides can have a very high field intensity
in the low-index slot itself [4]. By moving the two parts of the slot waveguide, a
strong change in effective index can be obtained [3], making such waveguides
efficient phase modulators with a low power consumption [106].
56 W. Bogaerts et al.

Fig. 2.16 Mechanical waveguide actuation: (a) apply strain by actuating from the substrate. (b)
Electrostatically moving waveguide butt coupling. (c) Actuating the spacing in a directional cou-
pler. (d) Actuating the slot width of a slot waveguide

Carrier-Based Silicon Modulators

Carrier Manipulation

As discussed, the fastest and most efficient phase modulators are based on direct electro-
optic effects. However, because it has a centro-symmetric lattice, silicon does not have
the required second-order (Pockels) effect. While is possible to induce this effect using
strain [57] to break the lattice symmetry, this requires substantial substrate engineering.
Therefore, the most common solution today for all-silicon modulators is to use
the carrier dispersion effect [86]: the refractive index (both the real and imaginary
part) of silicon depends on the concentration of electrons and holes in the material
[97]. Injection into or extraction of carriers out of a waveguide core will change
its effective index, and therefore its optical length. This results in a phase modula-
tion at the output. To manipulate the carrier density, one can use injection, deple-
tion or accumulation mechanisms, as shown in Fig. 2.17. The strongest effect is
carrier injection into the intrinsic region of a pin diode, located in the center of
the waveguide core to maximize the overlap with the optical mode. Applying a
forward biasing on the diode forces majority carriers from the p and n region into
2 Technologies and Building Blocks for On-Chip Optical Interconnects 57

Fig. 2.17 Silicon modulator geometries. (a) Forward-biased pin diode, (b) reverse-biased pn
diode, (c) vertical pn diode and (d) vertical siliconoxidesilicon capacitor

the core. [49, 122]. As this involves a lot of carriers, the effect is quite strong.
However, it is limited in speed by the recombination time of the carriers in the
core. To obtain modulation speeds well in excess of 1 Gbps, special driving
schemes using pre-emphasis are required.
A faster alternative is based on the depletion of a pn diode in the core. Reverse-
biasing such a diode will expand or shrink the depletion region in the junction.
Because the number of carriers involved is much smaller than with the injection
scheme, the effect is much weaker. However, it is not limited by the carrier recom-
bination time, only by the mobility and the capacitance formed by the depletion
region [45, 69, 71]. The effect can be enhanced by using complex junction geome-
tries, or multiple junctions inside the waveguide core by creating a larger overlap
with the optical mode [78, 79]. However, as the modulation efficiency is directly
linked to the amount of carriers that are moved around, a high modulation efficiency
is typically combined with rather high absorption losses. Reverse-biased pn diode
configurations have been demonstrated with Vp Lp of about 1 V cm.
58 W. Bogaerts et al.

Instead of a junction, it is also possible to use carrier accumulation in a capacitor


[70]. However it is not straightforward to make a good capacitor with a vertical
insulator, so the most promising geometry is the use of a layered capacitor. This
involves somewhat more elaborate processing, but it also allows to make very
efficient capacitors with a thin oxide, which can accumulate a lot of carriers for a
given operating voltage. Such waveguide configurations can get even better Vp Lp
values as low as 0.8 [119].

Silicon Modulator Components

The main effect of the carrier manipulation is a change in refractive index, even
though also a change in absorption is induced. To make an amplitude modulator
out of the resulting phase modulator, the junction or capacitor must be incorpo-
rated in an interferometer or (ring) resonator. In a MachZehnder interferometer,
one can put a modulator in both arms and operate the device in pushpull: this
essentially halves the device length or operating voltage. Injection modulators with
very high modulation efficiency have been demonstrated with a length of only
150 mm [49], small enough to be driven as a lumped electrical load. Carrier deple-
tion modulators on the other side, require lengths of millimeters to get a decent
modulation depth at CMOS operating voltage. As the effects could support modu-
lation of 40 GHz or beyond, special care has to be taken with the electrical drivers
to avoid unwanted RF effects over the length of the modulator. The simplest
approach is to drive the diode from a coplanar microwave waveguide which runs
parallel to the optical waveguide: the electrical wave will copropagate with the
optical mode, and with careful design the propagation velocities can be matched.
The drawback of this approach is that the microwave waveguide needs to be termi-
nated, which dissipates a lot of power.
The alternative is to use a resonator-based structure in combination with a phase
modulator. The most common and practical resonator geometry for this purpose is
a ring resonator [11]: the modulator diode is curved into a compact ring [122, 124]
or disk [119], and on resonance light will circle in this ring for thousands of times.
The rings can be as small as 10 mm in diameter, which means that it can be electri-
cally actuated as a lumped element. This obviates the need for coplanar electrodes
and significantly reduces the power dissipation. Making use of a resonator intro-
duces some drawbacks: the main one is that the modulator resonance should be
spectrally aligned with the operating wavelength. This imposes stringent fabrication
requirements and the requirement of some tuning mechanism to compensate for
operating conditions. The modulator could be tuned by applying a bias to the modu-
lation voltage, but as the modulation effects are typically quite small, in most cases
the tuning range will be too small. So an additional tuning mechanism, such as a
heater, is required.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 59

Carrier-Based Switches

Instead of using heaters or carriers for modulation, they can also be used for switching.
While the mechanism is the same, the operational requirements for switching are
different. Response times are of the order of ms or ms, and power efficiency is impor-
tant, as switches operate as passive devices, and like with WDM components, all
dissipation will add to the link power budget. In this respect, thermal switches seem
the simplest solution, and thermal switches have been demonstrated [27, 47, 117].
Alternatively, one can use carrier injection. This effect is still quite strong, and one of
the main drawbacks can now be turned into an advantage: as a modulator, the carrier
injection device is limited by the carrier lifetime in the intrinsic region of the junc-
tion. However, if the structure can be engineered to increase this lifetime, a switch
can maintain its state for longer without additional power consumption [108]. This is
an effect which also applies to charge accumulation devices, where the switching
action is controlled by charging or draining a capacitor.

Hybrid Silicon Modulators

As already mentioned, silicon is not necessarily the best material for electro-optic
modulation, given its lack of intrinsic first-order electro-optic effects. Therefore, an
efficient way could be to integrate the silicon with other optical materials or struc-
tures which allow efficient modulation. One possibility is the integration of IIIV
semiconductors, in a similar way as the light sources. Alternatively, electro-optic
materials can be directly integrated with the silicon.

Silicon/IIIV Modulators

IIIV semiconductors are well known for their good electro-optic properties, making
them an interesting candidate to realize high performance modulators on a silicon
photonic platform. Similar to silicon, typically carrier depletion type modulators [25]
and stark effect electro-absorption modulators [62] are used. Also IIIV microdisk
modulator structures relying on the change in Q-factor by bleaching of the quantum
well absorption through current injection were demonstrated [76]. The first two
approaches were implemented in a hybrid waveguide approach, in which the optical
mode is partially confined to the silicon and partially to the IIIV waveguide, similar
to the hybrid IIIV/silicon laser platform. Realized electro-absorption modulators
show 5 dB extinction ratio at 10 Gbit/s with a sub-volt drive and 30 nm optical band-
width. MachZehnder type modulators, based on a carrier-depletion approach, show
a modulation efficiency of 1.5 V mm and over 100 nm optical bandwidth. In these
60 W. Bogaerts et al.

cases the modulation bandwidth was RC limited. By applying proper traveling wave
electrode designs and terminations much high speeds can be envisioned. In the
microdisk approach, evanescent coupling between a silicon waveguide layer and a
IIIV microdisk mode is used. The microdisk supports several resonances, the
Q-factor of which, and hence the transmission characteristic of the disk, can be
altered by current injection in the quantum well active region, which bleaches the
absorption.

Slot or Sandwich Modulators

Silicon modulators can be significantly improved by just adding other materials


which do have a strong c(2) effect. Such materials include polymers, perovskites or
silicon nanocrystals. To integrate such materials with silicon waveguides, and have
a strong overlap of the light with the active material, slot waveguides can be used.
As already mentioned, such slot waveguides can have a very high optical field inside
the slot, as long as the refractive index of the material in the slot is substantially
lower than that in the core of the silicon [4]. An external electric field will then
change the refractive index of the electro-optic material and thus the effective index
of the waveguide (Fig. 2.18).
The slot can be etched vertically in the waveguide (working for the TE-polarization)
[4] or sandwiched as a thin layer into a multilayer silicon core (working for the
TM-polarization)[80, 125].
Modulators based on this effect have been demonstrated, using an electro-optic
polymer filling in the slot [2, 6, 31, 65]. In such modulators the electro-optic
modulation effect is intrinsically faster as they are less limited by carrier dynam-
ics, and the main limitations are presented by the need for high-speed RF elec-
trodes and low RC parasitic time constants. Also, horizontal sandwich structures
filled with nanocrystals have been demonstrated for switches [80]. As with the
electromechanically actuated waveguides, such switches could also have a low
power consumption while not switching, limited by leakage current through the
slot or sandwich layer.

Photodetectors

Introduction

At the end of the optical link the optical signals need to be converted to the electrical
domain again. This has to be done at high speed and with as little as possible signal
degradation due to noise (high sensitivity). Integrated photodetectors, which convert
the incident optical power into a photocurrent, connected to an integrated transim-
pedance amplifier, enable the optical-electrical conversion on the photonic intercon-
nection layer. For an intra-chip optical interconnect application, the photodetector
2 Technologies and Building Blocks for On-Chip Optical Interconnects 61

Fig. 2.18 Hybrid silicon modulator concepts. (a) Vertical sandwich waveguide, (b) slot-based sili-
con hybrid modulator

should satisfy several requirements. The speed of the detector, its responsivity and
its dark current are important performance metrics. However, also the device foot-
print and the available thermal budget for the incorporation of the photodetectors in
the electronic/photonic integrated circuit are important. Several material systems
can be considered to realize the photodetectors. While crystalline silicon is transpar-
ent for near infrared wavelengths ( > 1.1 mm), silicon photodetectors can still be used
in an on-chip interconnect context, for example for optical clock distribution through
free space using 850 nm wavelengths or by inducing defects in the silicon (through
ion implantation) which renders the material absorbing at near infrared communica-
tion wavelengths. The use of Silicon-Germanium or IIIV semiconductors are how-
ever a more straightforward route to realize high performance integrated
photodetectors on a silicon waveguide circuit. In the following subsections we will
give a brief overview of the state-of-the-art in the integration of photodetectors on a
silicon waveguide platform.

Photodetector Geometry

Basically, two types of photodetector structures are used, differing in the way an elec-
trical field is applied in the absorbing region. In one approach, a reverse biased pin
structure is used to extract the generated electronhole pairs from the absorbing
62 W. Bogaerts et al.

Fig. 2.19 Integrated photodetector geometries: pin photodetectors versus metalsemiconductormetal

region, while in another approach a metalsemiconductormetal structure is used,


consisting of two back-to-back aligned Schottky contacts. Two approaches can be
considered for the illumination of the photodiodes. Either surface illumination can be
used or a waveguide geometry can be used to provide efficient absorption of light. In
the case of surface illumination, there is a trade-off to be made between device speed
and responsivity of the photodetector. The bandwidth of a photodetector is determined
by the speed with which it responds to variations in the incident optical power. There
are three major factors which influence the speed of response: the RC time constant of
the detector and load, the transit time resulting from the drift of carriers across the
depletion layer and the delay resulting from the diffusion of carriers generated outside
the depletion layer. In a well-designed photodetector, the third contribution can be
neglected. The carrier transit time and RC time constant play a dominant role.
Typically, for high speed photodetectors, the carrier transit time can not be neglected
and a trade-off between responsivity (a high responsivity requires a thick absorbing
layer for surface illuminated photodetectors) and bandwidth (a short carrier transit
time requires a thin absorbing layer) needs to be made (Fig. 2.19). Waveguide based
photodetectors are considered to be more suited for on-chip optical interconnects,
since there is no trade-off to be made between device speed and responsivity
(Fig. 2.20).
2 Technologies and Building Blocks for On-Chip Optical Interconnects 63

Fig. 2.20 Integrated photodetector geometries: coupling from optical waveguide to integrated
photodetector

Silicon Photodetectors

The idea of generating a clock signal on an electronic IC by means of an optical


signal stems from the delay, skew and jitter a normal electronic clock distribution
suffers from, especially when the clock frequency is increased. In optics, it is rela-
tively straight forward to generate extremely short pulses (100 fs to 10 ps) by mode-
locking a laser at a high repetition rate. The repetition rate of such an optical pulse
train is solely defined by the round trip time in the laser, making the generated pulse
stream a very stable clock source. In one approach, this optical clock is distributed
over the chip by free space optics, and silicon photodetectors are used to detect the
850 nm signal. By making the device capacitance sufficiently small, the voltage
swing that is generated by the photodetector (when terminated with a sufficiently
large load) is large enough to allow operation without a receiving amplifier circuit
at all. This is the idea of so called receiverless data injection [30].
While this is an elegant approach for optical clock distribution, a waveguide
based approach is required when a dense intra-chip optical interconnect is required.
As discussed in section Waveguide Circuits, silicon waveguide circuits are very
well suited for this task. The required transparency of the silicon waveguide makes
it impossible to realize efficient photodetectors in this same material although some
mechanisms exist: mid-bandgap absorption, surface-state absorption, internal
photoemission absorption and two-photon absorption (TPA) [23].
64 W. Bogaerts et al.

For instance, radiation damaged silicon as the photodetector material will


produce a photocurrent when illuminated with sub bandgap wavelengths. By locally
implanting Si-ions in the pristine silicon waveguide, divacancies and interstitial
clusters are formed, inducing substantial optical attenuation ( > 100 dB/cm) in an
otherwise transparent material. The electronhole pairs that are created in this way,
can be extracted from the photodetector by applying an electrical field over the
implanted region, e.g. by reverse biasing a silicon pin diode. This approach how-
ever still requires relatively large photodetectors for an on-chip optical interconnect
application (a problem that could be overcome by implementing the photodetector
in a resonator structure, but this makes the responsivity of the photodetector very
wavelength dependent) and requires large bias voltages [46].
Even in pure, undamaged silicon, two-photon absorption (TPA) can be used to
create a detector for in-line monitoring purposes [18, 68]. In a waveguide it can be
implemented as a simple pin diode in the waveguide core, similar to a carrier-
injection modulator. Carriers generated by TPA will be extracted as photocurrent,
and as TPA requires two photons to generate an electron, the response of the detec-
tor will be quadratic. such a detector scheme is especially useful in a resonator
structure, as on resonance there is a high power in the resonator. The detector itself
does not necessarily introduce additional losses in the resonator: TPA is a process
that occurs anyway, and by extracting the generated free carriers as photocurrent, a
TPA detector will even reduce the losses introduced by free carrier absorption.

IIIV Photodetectors

High frequency, > 1 GHz, optical infrared detectors, sensitive between 1,100 and
1,700 nm, are usually fabricated in semiconducting InGaAs. Although this semicon-
ductor has the advantage of bandgap tailoring, it requires additional technology to
integrate these IIIV semiconductor materials on the electronic/photonic integrated
circuit. A straight forward approach (and the most rugged approach) is to use flip-
chip integration. This approach however limits the density of integration. With dis-
crete devices, receiver sensitivity is limited by the capacitance of the bulky detector.
Thanks to the much smaller capacitance of waveguide detectors, the receiver elec-
tronics can be redesigned with much higher performance, implying also a perfor-
mance improvement when detectors can be integrated.
In order to achieve the integration of IIIV semiconductors on the silicon
waveguide platform, a die-to-wafer bonding procedure can be used, as discussed
in section Light Sources to transfer the IIIV epitaxial layer stack onto the sili-
con waveguide circuit. This approach has the advantage that a dense integration
of photodetectors can be achieved and that all alignment is done by means of
lithographic techniques. An alternative approach would be to hetero-epitaxially
grow IIIV compounds on the silicon waveguide circuit. The large mismatch in
lattice constant between silicon and InP-based semiconductors makes it difficult
however to form high quality IIIV semiconductor layers on silicon, although a
2 Technologies and Building Blocks for On-Chip Optical Interconnects 65

lot of progress is being made in this field in recent years. Layer quality has a
direct influence on photodetector dark current, responsivity and maximum oper-
ation speed.
Most of the research so far has been geared towards the demonstration of high
performance IIIV semiconductor photodetectors on silicon, however without
addressing issues such as compatibility of the metallization with CMOS integration.
For example, typically Au-based electrodes are used for these devices. Surface illu-
mination on a silicon waveguide platform can be accomplished by using a diffrac-
tion grating to deflect the light from the silicon waveguide to the IIIV layer stack.
This approach has the advantage that the photodector doesnt have to be closely
integrated with the silicon waveguide layer. It can easily be placed a few micrometer
away from the silicon waveguide layer. Proof-of-principle devices based on this
concept were realized in [87] on a 10 10 mm footprint, however showing a limited
responsivity due to the sub-optimal epitaxial layer structure that was used.
Both pin type [9] and MSM [21] type waveguide photodetectors were realized
on a silicon-on-insulator waveguide platform. Responsivities in the range of
0.51 A/W were realized this way, in a device of about 50 mm2 in size. In the case of
the metalsemiconductormetal photodetector, the device speed is determined by
the spacing between the electrodes and the applied bias. Using a conservative 1 mm
spacing between the electrodes and applying 5 V reverse bias, simulations predict a
bandwidth over 35 GHz. In the pin structure, a bandwidth of 33 GHz was experi-
mentally obtained.

Germanium Photodetectors

Monolithic integration of a photodetector in an SOI waveguide technology


requires an active material compatible with silicon technology. Germanium wave-
guide photodetectors enable the design of high-speed optical receivers with very
high performance. The integration of bulk Germanium results in the shortest
absorption length and is hence the preferred option for high performance photo-
detectors, given the reduced device capacitance in short photodetectors. The
Germanium can be integrated in two ways, either through epitaxial growth or by
wafer bonding. Epitaxial growth is the most followed route for integration since it
leverages the SiGe integration technologies developed in micro-electronics.
Hetero-epitaxial growth also brings along considerable challenges, given the lat-
tice mismatch between Ge and Si and also given the fact that thermal budget limi-
tations constrain the type of process that can be used for epitaxy and, eventually,
the resulting material quality. Therefore, it has been essential to develop a low-
thermal-budget process that could fit into a standard CMOS wafer fabrication
flow without affecting the performance of the transistors and other optical devices,
while maintaining a reasonable material quality. This way, very high speed photo-
detectors were realized (with a bandwidth of over 40 GHz) with close to theoreti-
cal responsivities of 1 A/W [26, 50, 82, 115, 116].
66 W. Bogaerts et al.

For the detection of very-low-power optical signals, as can be expected in on-chip


optical interconnect applications, at very high speed avalanche photodetectors can
be used, exploiting charge amplification close to avalanche breakdown. These ava-
lanche photodetectors allow for an improved sensitivity compared to standard pho-
todetectors. In [5, 58] such devices are reported, which achieves an avalanche gain
of about 10 dB with an operational speed higher than 30 GHz. Moreover, integrating
the photodetector allows to make the device very compact, which reduces the
required voltage over the photodetector to about 1.5 V.

Integration in an Electronics Interconnect

When all building blocks of a photonic interconnect link or network are there, they
need to be integrated with electronics. With electronics we should distinguish
between the actual functional blocks that need to be interconnected (e.g. processor
cores or large blocks of memory) and the additional electronics that supports the
actual optical link (laser drivers, tuning current sources, monitor readouts,
amplifiers, ). In essence, the latter is an essential part of the photonic circuitry,
rather than the electronics themselves. Given the modulation speeds of the optical
links, it is essential that the driver electronics are as close as possible to the actual
photonic components.
In the integration of photonics and electronics there are many trade-offs that
need to be considered. Operation speed is definitely one of them, but also power
consumption, heat dissipation strategies, chip real estate, yield, and finally bring-
ing the electronics and photonics together. The most commonly considered inte-
gration scenarios are illustrated in Fig. 2.21: Integration of the photonics layer
directly with the transistors, integrating the photonics in or on top of the metal
interconnect layers, or fabricating the photonics layer separately and using a 3-D
integration strategy to bring both layers together. In this section we will compare
the merits of those options.
When integrating photonics and electronics, one will always be faced with simi-
lar questions: what will be the impact of one technology on the other, and the result-
ing compromises. And of course there is the problem of compound yield: the overall
yield of the integrated photonicelectronic circuit is the product of the electronics
yield, the photonics yield, and the yield of the integration process. If one of the steps
has a low yield, the compound yield might make the approach unviable, unless one
can incorporate a selection step with intermediate testing.

Front-End-of-Line

As weve discussed in the previous sections, silicon-on-insulator is a very attrac-


tive material for compact low power photonic circuits. Also, SOI is also used by a
2 Technologies and Building Blocks for On-Chip Optical Interconnects 67

Fig. 2.21 Integration strategies for photonics and electronics. A photonic circuit can be integrated
in the front-end-of-line (FEOL), back-end-of-line (BEOL) and using 3-D stacking

number of manufacturers for the fabrication electronics, so it might come as natu-


ral to try and integrate both photonics and electronics in the same SOI substrate.
However, the requirements for photonic versus electronic SOI substrates are very
different: compared to advanced SOI CMOS nodes, the photonic waveguiding
layer is quite thick, and more resembles a bulk silicon substrate. Also, the buried
oxide cladding for photonics is around 2 mm thick, which results in a very high
thermal barrier for electronics. So it does not seem to be straightforward to include
both electronics and photonics on the same substrate.
Still, this approach is being pursued by a few actors in the field. The most nota-
ble is Luxtera, which have adapted a Freescale 130 nm technology node to accom-
modate photonics: for this, serious tradeoffs were required in the photonic circuitry,
and it was not possible to take full advantage of the high-contrast nature of silicon
wire waveguides [36]. Luxteras technology currently supports passive circuitry
with active tuning, electrical modulators and integrated germanium photodetec-
tors, enabling a full transceiver [85]. The laser source is integrated in a later stage
during packaging. Their electronic/photonic chips also include the necessary driving
68 W. Bogaerts et al.

circuitry as well as elementary logic. Even at a 130 nm node, the electronics on the
chip consumes much less real-estate than the photonics. Given the fact that transis-
tors and photonics compete for the same real-estate, this approach only makes
sense in situations where the primary function of the chip is photonic, and not
where photonics supports the electronics. The current products of Luxtera are
therefore focused towards active optical cables, and not on-chip interconnects [36].
Given the fact that chip real-estate is extremely precious, front-end photonic/
electronic integration for on-chip interconnects does only make sense if the photon-
ics does not encroach on the transistor space of the logic it is serving. Still, the argu-
ments of keeping the driver electronics close to the photonics holds. The solution is
to use an SOI process for cointegrating the driver electronics and the photonics, and
then use a 3-D integration technology to connect this photonic/electronic link layer
to the actual logic. Such techniques are discussed further.
While front-end integration does not seem to make sense for on-chip optical
interconnects, for longer interconnects it can still be the most attractive proposition.
However, only a few electronics manufacturers run their transistors in an SOI pro-
cess; the majority makes CMOS on bulk silicon. Because a CMOS manufacturer is
very unlikely to modify their processes to such an extent as to accommodate SOI
processes (which would need redevelopment or at least recalibrating their entire
front-end process), several groups in the world have explored the possibilities of
building a photonic substrate in a bulk CMOS process.
The main obstacle for making a waveguide in bulk silicon is the lack of a buffer
layer which optically insulates the waveguide from the high-index substrate. One
solution is to build a hybrid substrate with local SOI regions where the photonic
waveguides will be. This can be done in two ways: starting from a bulk Si wafer, or
starting from an SOI wafer. When using an SOI wafer, one can etch away the top
silicon and the buried oxide in the regions which will accommodate electronics.
Subsequently, a selective silicon epitaxial regrowth can be done to create a bulk sili-
con substrate at the same level as the waveguide layer. To finish, a chemical/mechan-
ical polishing (CMP) step is required. Alternatively, one could start with a bulk
silicon substrate, where a deep trench is etched in the waveguide regions. Using an
oxide deposition and CMP, a planar substrate with local areas of buried oxide is
created. Subsequently, the core layer of silicon can be deposited. This can be amor-
phous silicon, which can be recrystallized using solid-phase epitaxy, seeding off the
bulk silicon substrate [95]. Finally, it is even possible to create a local waveguide
layer by undercutting the bulk silicon substrate [83]. This results in waveguides
formed in the polysilicon gate layer which do have a higher propagation loss than
high-quality single-crystal waveguides.
Both approaches allow the integration of SOI waveguides in a bulk CMOS pro-
cess. However, this does not solve all issues. CMOS processing typically relies on a
very uniform distribution of features to achieve reliable processing over the chip
and the wafer (especially for dry etching and CMP). The same also holds for photo-
nics, and the density and length-scale of the features do not necessarily match.
Therefore, careful consideration is needed when combining both types of features
onto the same substrate in the same process layer, respecting the proper spacing and
inclusion of dummy structures to guarantee the correct densities.
2 Technologies and Building Blocks for On-Chip Optical Interconnects 69

Also, while one could devise a process flow where many steps are shared between
the transistors and the photonics, there will be a need for additional steps, which
could have an impact on chip/wafer yield. The compound yield of the process flow
can drop dramatically with the number of steps.

In/On the Metal Interconnect Layer

Electronic chips already have multiple metal interconnect layers. An optical inter-
connect layer embedded in or deposited on top of these metal layers would definitely
make sense from a separation of concerns point-of-view. However, this conflict
somewhat with the technologies required for silicon photonics: especially the high-
quality single-crystal silicon layer needed for waveguides and modulators is impos-
sible to incorporate monolithically: there is no epitaxial substrate in the
back-end-of-line interconnect layers, and the temperature budget does not allow
silicon epitaxy: in BEOL processes, the process temperature is limited to ca. 450 C.
As we have discussed, amorphous silicon is a possibility, but with the penalty of
higher optical losses and the difficulty of making good junctions for carrier-based
modulators. Other optical materials will not allow the same integration density as
silicon photonics and might only be suitable for real global interconnects.

3D Integration

To overcome this problem, the photonics layer can also be integrated on top of the
electronics using 3-D integration techniques. this would allow both layers to be
fabricated separately (in their own optimized process flow, or even different fabs).
This means fewer compromises are needed in both the electronics and photonics,
and there is no real competition for real-estate. It is now also possible to make the
photonics layer in one technology, and still remain compatible with various CMOS
technology nodes: the photonics should not scale down as aggressively as the
advanced CMOS.
3-D integration can be accomplished in different ways, depending on the appli-
cation. The photonics can be stacked on the electronics or the other way around,
depending on the die size. On-chip interconnects will likely require a similar die
size for photonics and electronics, but for applications in sensing or spectroscopy,
or even off-chip datacomm, the photonics die could be larger than the electronics
die. In general, the smaller die will be stacked on the larger die.
3-D integration technology in general relies on through-silicon-vias (TSV) to
connect the metal layers of both chips. Here we can distinguish between processes
where this TSV is processed before the stacking or after the stacking. In Via-first
processes, TSV could be fabricated in the photonics wafer, which then requires no
modifications to the electronics process. The photonics die would then be on the top
of the electronics die, and face upwards (needed for applications where access to the
70 W. Bogaerts et al.

Fig. 2.22 3-D integration approaches. (a) Photonics face-down. The photonics wafer is bonded
upside-down on the electronics wafer and metal TSV connections are processed after bonding and
substrate removal. (b) Photonics face-up. TSVs are processed in the photonics wafer and stick out
after substrate thinning. No wafer-scale processing is needed after stacking

waveguides is essential). Such TSVs are typically large, with the diameter and pitch
proportional to the thickness of the substrate. Large TSVs will then introduce para-
sitic resistance and capacitance [8, 54, 60], which can be a dominant speed bump for
high-speed interconnect.
When the vias are processed post-bonding, this problem can be largely over-
come. Silicon photonics uses an SOI wafer, so the buried oxide can be used as a very
selective stopping layer for substrate removal: an SOI photonics waver could be
bonded upside-down on a CMOS wafer and the entire silicon substrate, and even the
buried oxide, could be removed [41, 61]. Afterwards, deep-etched vias connect to
the underlying CMOS metallization layers and even process additional metal inter-
connects on top. the layers here can be so thin that the parasitics of the large TSVs
can be avoided (Fig. 2.22).

Backside Integration

To decouple the photonics and the electronics process but still process everything on a
single high-quality substrate, one can make use the back side of the wafer as well as the
front side: e.g. The silicon photonics layer could be processed on the front side of an
SOI wafer, and bulk electronics on the back side. The high-temperature steps for both
sides could be executed first after which the metal interconnects and the TSVs are
defined. Different TSV technologies could require wafer thinning and bonding to a
handling wafer. Using an unthinned wafer requires relatively large TSVs [41].
2 Technologies and Building Blocks for On-Chip Optical Interconnects 71

Competition for chip area is less than with FEOL integration, but as with some
3D integration approaches the TSVs need to pass through the transistor layer [41,
61]. Even though the use of high-quality substrates is optimized, two-sided process-
ing introduces problems of wafer handling, packaging and testing.

Flip-Chip Integration

3-D integration is similar in topography with flip-chipping [126]. However, flip-


chip integration typically has the photonics and electronics layer face one
another, and no TSVs are necessary. A flip-chipped assembly therefore provides
no direct access to the electronics or the photonic surface, which makes input/
output a real issue. However, flip-chipping has the similar advantage as 3-D
integration: very dense 2D arrays of connections between photonics and elec-
tronics, no compound yield issue (testing both layers separately) and no conflicts
in chip area.

Summary

In this chapter we took a closer look at the different components that are required
for on-chip optical interconnects, and more particular WDM links. The technology
we focused on was that of silicon photonics, as it is the most obvious candidate to
realize on-chip optical links: the materials and processes are the closest to true
CMOS compatibility, and the high refractive index contrast makes it possible to
scale down the photonic building blocks to a footprint which allows thousands of
components on a single chip.
While most of the technology is already there, there are many issues that still
need to be solved when using silicon photonics for on-chip links: the big question
is the one of the light source. While we discussed the various options, there is as
yet no clear-cut winner, and all options have their advantages and disadvantages.
The second main challenge with silicon photonic links is the thermal manage-
ment. Especially in a WDM setting, where spectral filters are required, silicon
photonics is extremely temperature-sensitive, and it is not inconceivable that a
significant portion of the power budget of the optical link will be needed for ther-
mal feedback and control.
The world of silicon photonics is moving extremely rapidly, and new break-
through developments are being reported every year. This chapter is therefore
only intended as a snapshot view, and for that we focused mostly on explaining all
the principles, rather than give a complete report of the latest and greatest results.
Given this fast technological progress, and the strong need for higher bandwidth,
we are convinced that on-chip optical links will become a reality later in this
decade.
72 W. Bogaerts et al.

References

1. Agarwal AM, Liao L, Foresi JS, Black MR, Duan X, Kimerling LC (1996) Low-loss poly-
crystalline silicon waveguides for silicon photonics. J Appl Phys 80(11):61206123
2. Alloatti L, Korn D, Hillerkuss D, Vallaitis T, Li J, Bonk R, Palmer R, Schellinger T, Koos C,
Freude W, Leuthold J, Fournier M, Fedeli J, Barklund A, Dinu R, Wieland J, Bogaerts W,
Dumon P, Baets R (2010) Silicon high-speed electro-optic modulator. In: 2010 7th IEEE
international conference on group IV photonics (GFP), pp 195197 Beijing, China
3. Almeida VR, Panepucci RR (2007) Noems devices based on slot-waveguides. In: Conference
on lasers and electro-optics/quantum electronics and laser science conference and photonic
applications systems technologies, p JThD104 Washington DC, USA
4. Anderson PA, Schmidt BS, Lipson M (2006) High confinement in silicon slot waveguides
with sharp bends. Opt Express 14(20):91979202
5. Assefa S, Xia F, Vlasov YA (2010) Reinventing germanium avalanche photodetector for
nanophotonic on-chip optical interconnects. Nature 464(7285):U80U91
6. Baehr-Jones T, Hochberg M, Wang GX, Lawson R, Liao Y, Sullivan PA, Dalton L, Jen AKY,
Scherer A (2005) Optical modulation and detection in slotted silicon waveguides. Opt Express
13(14):52165226
7. Barwicz T, Watts MR, Popovic MA, Rakich PT, Socci L, Kartner FX, Ippen EP, Smith HI
(2007) Polarization-transparent microphotonic devices in the strong confinement limit. Nat
Photon 1:5760
8. Bermond C, Cadix L, Farcy A, Lacrevaz T, Leduc P, Flechet B (2009) High frequency char-
acterization and modeling of high density TSV in 3d integrated circuits. In: 2009 SPI 09
IEEE workshop on signal propagation on interconnects, pp 14 Strasbourgh, France
9. Binetti PRA, Leijtens XJM, de Vries T, Oei YS, Di Cioccio L, Fedeli J-M, Lagahe C, Van
Campenhout J, Van Thourhout D, van Veldhoven PJ, Notzel R, Smit MK (2009) Inp/InGaAs
photodetector on SOI circuitry. In: Group IV photonics, pp 214216 2009 6th IEEE interna-
tional conference on group IV photonics (GFP), San Francisco, USA
10. Bogaerts W, Baets R, Dumon P, Wiaux V, Beckx S, Taillaert D, Luyssaert B, Van Campenhout
J, Bienstman P, Van Thourhout D (2005) Nanophotonic waveguides in silicon-on-insulator
fabricated with CMOS technology. J Lightwave Technol 23(1):401412
11. Bogaerts W, De Heyn P, Van Vaerenbergh T, De Vos K, Kumar Selvaraja S, Claes T, Dumon
P, Bienstman P, Van Thourhout D, Baets R. (2012), Silicon microring resonators. Laser &
Photon. Rev., 6: 4773. doi: 10.1002/lpor.201100017
12. Bogaerts W, Dumon P, Van Thourhout D, Taillaert D, Jaenen P, Wouters, J, Beckx S, Wiaux
V, Baets R (2006) Compact wavelength-selective functions in silicon-on-insulator photonic
wires. J Sel Top Quantum Electron 12(6):13941401
13. Bogaerts W, Selvaraja SK (2011) Compact single-mode silicon hybrid rib/strip waveguide
with adiabatic bends. IEEE Photon J 3(3):422432
14. Bogaerts W, Selvaraja SK, Dumon P, Brouckaert J, De Vos K, Van Thourhout D, Baets R
(2010) Silicon-on-insulator spectral filters fabricated with CMOS technology. J Sel Top
Quantum Electron 16(1):3344
15. Bogaerts W, Wiaux V, Taillaert D, Beckx S, Luyssaert B, Bienstman, P, Baets R (2002)
Fabrication of photonic crystals in silicon-on-insulator using 248-nm deep UV lithography.
IEEE J Sel Top Quantum Electron 8(4):928934
16. Boyraz O, Jalali B (2004) Demonstration of a silicon Raman laser. Opt Express 12(21):
52695273
17. Boyraz O, Jalali B (2005) Demonstration of directly modulated silicon Raman laser. Opt
Express 13(3):796800
18. Bravo-Abad J, Ippen EP, Soljacic M (2009) Ultrafast photodetection in an all-silicon chip
enabled by two-photon absorption. Appl Phys Lett 94:241103
19. Brouckaert J, Bogaerts W, Dumon P, Van Thourhout D, Baets R (2007) Planar concave grat-
ing demultiplexer fabricated on a nanophotonic silicon-on-insulator platform. J Lightwave
Technol 25(5):12691275
2 Technologies and Building Blocks for On-Chip Optical Interconnects 73

20. Brouckaert J, Bogaerts W, Selvaraja S, Dumon P, Baets R, Van Thourhout D (2008) Planar
concave grating demultiplexer with high reflective Bragg reflector facets. IEEE Photon
Technol Lett 20(4):309311
21. Brouckaert J, Roelkens G, Van Thourhout D, Baets R (2007) Compact InAlAs/InGaAs
metalsemiconductormetal photodetectors integrated on silicon-on-insulator waveguides.
IEEE Photon Techol Lett 19(19):14841486
22. Bulgan E, Kanamori Y, Hane K (2008) Submicron silicon waveguide optical switch driven by
microelectromechanical actuator. Appl Phys Lett 92(10):101110
23. Casalino M, Coppola G, Iodice M, Rendina I, Sirleto L (2010) Near-infrared sub-bandgap
all-silicon photodetectors: state of the art and perspectives. Sensors 10:1057110600
24. ChaiChuay C, Yupapin PP, Saeung P (2009) The serially coupled multiple ring resonator
filters and Vernier effect. Opt Appl XXXIX(1):175194
25. Chen H-W, Kuo Y-H, Bowers JE (2008) High speed hybrid silicon evanescent mach-zehnder
modulator and switch. Opt Express 16:2057120576
26. Chen L, Lipson M (2009) Ultra-low capacitance and high speed germanium photodetectors
on silicon. Opt Express 17(10):79017906
27. Chu T, Yamada H, Ishida S, Arakawa Y (2005) Compact 1 x n thermo-optic switches based
on silicon photonic wire waveguides. Opt Express 13(25):1010910114
28. Cunningham JE, Shubin I, Zheng X, Pinguet T, Mekis A, Luo Y, Thacker H, Li G, Yao J, Raj
K, Krishnamoorthy AV (2010) Highly-efficient thermally-tuned resonant optical filters. Opt
Express 18(18):1905519063
29. Dai D, Yang L, He S (2008) Ultrasmall thermally tunable microring resonator with a submi-
crometer heater on Si nanowires. J Lightwave Technol 26(58):704709
30. C. Debaes, D. Agarwal, A. Bhatnagar, H. Thienpont, and D. A. B. Miller, High-Impedance
High-Frequency Silicon Detector Response for Precise Receiverless Optical Clock Injection, in
SPIE Photonics West 2002 Meeting, San Jose, California, Proc. SPIE Vol. 4654, 7888 (2002)
31. Ding R, Baehr-Jones T, Liu Y, Bojko R, Witzens J, Huang S, Luo J, Benight S, Sullivan P,
Fedeli J-M, Fournier M, Dalton L, Jen A, Hochberg M (2010) Demonstration of a low V pi
L modulator with GHz bandwidth based on electro-optic polymer-clad silicon slot wave-
guides. Opt Express 18(15):1561815623
32. Dragone C (1991) An NxN optical multiplexer using a planar arrangement of two star cou-
plers. IEEE Photon Technol Lett 3(9):812814
33. Dragone C (1998) Efficient techniques for widening the passband of a wavelength router.
J Lightwave Technol 16(10):18951906
34. Dumon P, Bogaerts W, Van Thourhout D, Taillaert D, Baets R, Wouters, J, Beckx S, Jaenen
P (2006) Compact wavelength router based on a silicon-on-insulator arrayed waveguide grat-
ing pigtailed to a fiber array. Opt Express 14(2):664669
35. Dumon P, Bogaerts W, Wiaux V, Wouters J, Beckx S, Van Campenhout J, Taillaert D, Luyssaert
B, Bienstman P, Van Thourhout D, Baets R (2004) Low-loss SOI photonic wires and ring reso-
nators fabricated with deep UV lithography. IEEE Photon Technol Lett 16(5): 13281330
36. P. Duran, Blazar 40 Gbps Optical Active Cable, Luxteras white paper from: www.luxtera.
com, 2008.
37. Espinola RL, Tsai M-C, Yardley JT, Osgood RM Jr (2003) Fast and low-power thermooptic
switch on thin silicon-on-insulator. IEEE Photon Technol Lett 15(10):13661368
38. Fang AW, Koch BR, Gan K-G, Park H, Jones R, Cohen O, Paniccia MJ, Blumenthal DJ, Bowers
JE (2008) A racetrack mode-locked silicon evanescent laser. Opt Express 16(2): 13931398
39. Fang AW, Koch BR, Jones R, Lively E, Liang D, Kuo Y-H, Bowers JE (2008) A distributed
Bragg reflector silicon evanescent laser. IEEE Photon Technol Lett IEEE 20(20): 16671669
40. Fang AW, Park H, Cohen O, Jones R, Paniccia MJ, Bowers JE (2006) Electrically pumped
hybrid algainas-silicon evanescent laser. Opt Express 14(20):92039210
41. Fedeli JM, Augendre E, Hartmann JM, Vivien L, Grosse P, Mazzocchi, V, Bogaerts W, Van
Thourhout D, Schrank F (2010) Photonics and electronics integration in the Helios project.
In: 2010 7th IEEE international conference on group IV photonics (GFP), pp 356358
Beijing, China
74 W. Bogaerts et al.

42. Foresi JS, Black MR, Agarwal AM, Kimerling LC (1996) Losses in polycrystalline silicon
waveguides. Appl Phys Lett 68(15):20522054
43. Foster MA, Turner AC, Sharping JE, Schmidt BS, Lipson M, Gaeta AL (2006) Broad-band
optical parametric gain on a silicon photonic chip. Nature 441(7096):960963
44. Gan F, Barwicz T, Popovic MA, Dahlem MS, Holzwarth CW, Rakich PT, Smith HI, Ippen
EP, Kartner FX (2007) Maximizing the thermo-optic tuning range of silicon photonic struc-
tures. In: 2007 photonics in switching, pp 6768 San Francisco, USA
45. Gardes F, Reed G, Emerson N, Png C (2005) A sub-micron depletion-type photonic modula-
tor in silicon on insulator. Opt Express 13(22):88458854
46. Geis MW, Spector SJ, Grein ME, Yoon JU, Lennon DM, Lyszczarz TM (2009) Silicon wave-
guide infrared photodiodes with over 35 ghz bandwidth and phototransistors with 50 a/w
response. Opt Express 17(7):51935204
47. Geis MW, Spector SJ, Williamson RC, Lyszczarz TM (2004) Submicrosecond submilliwatt
silicon-on-insulator thermooptic switch. IEEE Photon Technol Lett 16(11):25142516
48. Gnan M, Thoms S, Macintyre DS, De La Rue RM, Sorel M (2008) Fabrication of low-loss
photonic wires in silicon-on-insulator using hydrogen silsesquioxane electron-beam resist.
Electron Lett 44(2):115116
49. Green WMJ, Rooks MJ, Sekaric L, Vlasov YuA (2007) Ultra-compact, low RF power, 10
Gb/s silicon MachZehnder modulator. Opt Express 15(25):1710617113
50. Gunn C (2006) Cmos photonics for high-speed interconnects. IEEE Micro 26(2):5866
51. Han H-S, Seo S-Y, Shin JH, Park N (2002) Coefficient determination related to optical gain in
erbium-doped silicon-rich silicon oxide waveguide amplifier. Appl Phys Lett 81(20): 37203722
52. Harke A, Krause M, Mueller J (2005) Low-loss singlemode amorphous silicon waveguides.
Electron Lett 41(25):13771379
53. Hattori HT, Seassal C, Touraille E, Rojo-Romeo P, Letartre X, Hollinger G, Viktorovitch P,
Di Cioccio L, Zussy M, Melhaoui LE, Fedeli JM (2006) Heterogeneous integration of micro-
disk lasers on silicon strip waveguides for optical interconnects. Photon Technol Lett IEEE
18(1):223225
54. Healy MB, Lim SK (2009) A study of stacking limit and scaling in 3d ICS: an interconnect
perspective. In: 2009 ECTC 2009 59th Electronic components and technology conference,
pp 12131220 San Diego, USA
55. Heebner J, Grover R, Ibrahim T (2008) Optical microresonators: theory, fabrication and
applications. In: Springer series in optical sciences, 1st edn. Springer, Berlin
56. Ikeda T, Takahashi K, Kanamori Y, Hane K (2010) Phase-shifter using submicron silicon wave-
guide couplers with ultra-small electro-mechanical actuator. Opt Express 18(7):70317037
57. Jacobsen RS, Andersen KN, Borel PI, Fage-Pedersen J, Frandsen LH, Hansen O, Kristensen
M, Lavrinenko AV, Moulin G, Ou H, Peucheret, C, Zsigri B, Bjarklev A (2006) Strained sili-
con as a new electro-optic material. Nature 441(7090):199202
58. Kang Y, Liu H-D, Morse M, Paniccia MJ, Zadka M, Litski S, Sarid G, Pauchard A, Kuo Y-H,
Chen H-W, Zaoui WS, Bowers JE, Beling A, McIntosh DC, Zheng X, Campbell JC (2009)
Monolithic germanium/silicon avalanche photodiodes with 340 GHz gain-bandwidth
product. Nat Photon 3(1):5963
59. Kazmierczak A, Bogaerts W, Drouard E, Dortu F, Rojo-Romeo P, Gaffiot F, Van Thourhout
D, Giannone D (2009) Highly integrated optical 4 x 4 crossbar in silicon-on-insulator tech-
nology. J Lightwave Technol 27(16):33173323
60. Kim DH, Mukhopadhyay S, Lim SK (2009) Tsv-aware interconnect length and power pre-
diction for 3d stacked ICS. In: 2009 IITC 2009 IEEE international interconnect technology
conference, pp 2628 Sapporo, Japan
61. Koester SJ, Young AM, Yu RR, Purushothaman S, Chen K-N, La Tulipe DC, Rana N, Shi L,
Wordeman MR, Sprogis EJ (2008) Wafer-level 3d integration technology. IBM J Res Dev
52(6):583597
62. Kuo Y-H, Chen Y-H, Bowers J E (2008) High speed hybrid silicon evanescent electroabsorp-
tion modulator. Opt Express 16:99369941
2 Technologies and Building Blocks for On-Chip Optical Interconnects 75

63. Lamponi M, Keyvaninia S, Pommereau F, Brenot R, de Valicourt G, Lelarge F, Roelkens G,


Van Thourhout D, Messaoudene S, Fedeli J-M, Duan G-H (2010) Heterogeneously integrated
InP/SOI laser using double tapered single-mode waveguides through adhesive die to wafer
bonding. In: 2010 7th IEEE international conference on group IV photonics (GFP), pp 2224
Beijing, China
64. Lee S-S, Huang L-S, Kim C-J, Wu MC (1999) Free-space fiber-optic switches based on
mems vertical torsion mirrors. J Lightwave Technol 17(1):713
65. Leuthold J, Freude W, Brosi J-M, Baets R, Dumon P, Biaggio I, Scimeca ML, Diederich F,
Frank B, Koos C (2009) Silicon organic hybrid technology: a platform for practical nonlinear
optics. Proc IEEE 97(7):13041316
66. Liang D, Bowers JE (2008) Highly efficient vertical outgassing channels for low-temperature
InP-to-silicon direct wafer bonding on the silicon-on-insulator substrate. J Vac Sci Technol B
26(4):15601568
67. Liang D, Fiorentino M, Okumura T, Chang H-H, Spencer DT, Kuo, Y-H, Fang AW, Dai D,
Beausoleil RG, Bowers JE (2009) Electrically-pumped compact hybrid silicon microring
lasers for optical interconnects. Opt Express 17(22):2035520364
68. Liang TK, Tsang HK, Day IE, Drake J, Knights AP, Asghari M (2002) Silicon waveguide
two-photon absorption detector at 15 mm wavelength for autocorrelation measurements. Appl
Phys Lett 81:13231325
69. Liao L, Liu A, Basak J, Nguyen H, Paniccia M, Rubin D, Chetrit Y, Cohen R, Izhaky N (2007)
40 gbit/s silicon optical modulator for highspeed applications. Electron Lett 43(22)
70. Liao L, Samara-Rubio D, Morse M, Liu A, Hodge D, Rubin D, Keil U, Franck T (2005) High
speed silicon MachZehnder modulator. Opt Express 13(8):31293135
71. Liu A, Liao L, Rubin D, Nguyen H, Ciftcioglu B, Chetrit Y, Izhaky N, Paniccia M (2007)
High-speed optical modulation based on carrier depletion in a silicon waveguide. Opt Express
15(2):660668
72. Liu J, Sun X, Camacho-Aguilera R, Kimerling LC, Michel J (2010) Ge-on-Si laser operating
at room temperature. Opt Lett 35(5):679681
73. Liu J, Sun X, Pan D, Wang X, Kimerling LC, Koch TL, Michel J (2007) Tensile-strained,
n-type Ge as a gain medium formonolithic laser integration on Si. Opt Express 15(18):
1127211277
74. Liu L, Pu M, Yvind K, Hvam JM (2010) High-efficiency, large-bandwidth silicon-on-insula-
tor grating coupler based on a fully-etched photonic crystal structure. Appl Phys Lett
96(5):051126
75. Liu L, Roelkens G, Van Campenhout J, Brouckaert J, Van Thourhout D, Baets R (2010) Iiiv/
silicon-on-insulator nanophotonic cavities for optical network-on-chip. J Nanosci Nanotechnol
10(3):14611472
76. Liu L, Van Campenhout J, Roelkens G, Soref R A, Van Thourhout D, Rojo-Romeo P, Regreny P,
Seassal C, Fdli J-M, Baets R (2008) Carrier-injection-based electro-optic modulator on silicon-on-
insulator with a heterogeneously integrated iii-v microdisk cavity. Opt Lett 33(21):25182520
77. Loureno MA, Gwilliam RM, Homewood KP (2007) Extraordinary optical gain from silicon
implanted with erbium. Appl Phys Lett 91(14):141122
78. Marris-Morini D, Le Roux X, Vivien L, Cassan E, Pascal D, Halbwax, M, Maine S, Laval S,
Fdli J-M, Damlencourt J-F (2006) Optical modulation by carrier depletion in a silicon pin
diode. Opt Express 14(22):1083810843
79. Marris-Morini D, Vivien L, Fdli J-M, Cassan E, Lyan P, Laval S (2008) Low loss and high
speed silicon optical modulator based on a lateral carrier depletion structure. Opt Express
16(1):334339
80. Martinez A, Blasco J, Sanchis P, Galan JV, Garcia-Ruperez J, Jordana EP, Gautier LY,
Hernandez S, Guider R, Daldosso N, Garrido BJ-M, Fedeli PL, Marti J, Spano R (2010)
Ultrafast all-optical switching in a silicon-nanocrystal-based silicon slot waveguide at tele-
com wavelengths. Nano Lett 10(4):15061511
76 W. Bogaerts et al.

81. McNab SJ, Moll N, Vlasov YA (2003) Ultra-low loss photonic integrated circuit with mem-
brane-type photonic crystal waveguides. Opt Express 11(22):29272939
82. Michel J, Liu J, Kimerling LC (2010) High-performance Ge-on-Si photodetectors. Nat
Photon 4(8):527534
83. Orcutt JS, Khilo A, Holzwarth CW, Popovi MA, Li H, Sun J, Bonifield T, Hollingsworth R,
Krtner FX, Smith HI, Stojanovi V, Ram RJ (2011) Nanophotonic integration in state-of-
the-art CMOS foundries. Opt Express 19(3):23352346
84. Pavesi L, Dal Negro L, Mazzoleni C, Franzo G, Priolo F (2000) Optical gain in silicon nano-
crystals. Nature 408(6811):440444
85. Pinguet T, Analui B, Balmater E, Guckenberger D, Harrison M, Koumans R, Kucharski D,
Liang Y, Masini G, Mekis A, Mirsaidi S, Narasimha A, Peterson M, Rines D, Sadagopan V,
Sahni S, Sleboda TJ, Song, D, Wang Y, Welch B, Witzens J, Yao J, Abdalla S, Gloeckner S, De
Dobbelaere P (2008) Monolithically integrated high-speed CMOS photonic transceivers. In:
2008 5th IEEE international conference on group IV photonics, pp 362364 Sorrento, Italy
86. Reed GT, Mashanovich G, Gardes FY, Thomson DJ (2010) Silicon optical modulators. Nat
Photon 4(8):518526
87. Roelkens G, Brouckaert J, Taillaert D, Dumon P, Bogaerts W, Van Thourhout D, Baets R
(2005) Integration of InP/InGaAsP photodetectors onto silicon-on-insulator waveguide cir-
cuits. Opt Express 13(25):1010210108
88. Roelkens G, Brouckaert J, Van Thourhout D, Baets R, Notzel R, Smit M (2006) Adhesive
bonding of InP/InGaAsP dies to processed silicon-on-insulator wafers using DVS-bis-
benzocyclobutene. J Electrochem Soc 153(12):G1015G1019
89. Roelkens G, Van Thourhout D, Baets R (2007) High efficiency grating couplers between sili-
con-on-insulator waveguides and perfectly vertical optical fibers. Opt Lett 32(11): 14951497
90. Roelkens G, Van Thourhout D, Baets R, Ntzel R, Smit M (2006) Laser emission and photo-
detection in an InP/InGaAsP layer integrated on and coupled to a silicon-on-insulator wave-
guide circuit. Opt Express 14(18):81548159
91. Rong HS, Liu AS, Jones R, Cohen O, Hak D, Nicolaescu R, Fang A, Paniccia M (2005) An
all-silicon Raman laser. Nature 433(7023):292294
92. Schrauwen J, Scheerlinck S, Van Thourhout D, Baets R (2009) Polymer wedge for perfectly
vertical light coupling to silicon. In: Broquin J-M, Greiner CM (eds) Integrated optics: devices,
materials, and technologies, vol XIII. Proceedings of SPIE, vol 7218, SPIE, p 72180B
93. Selvaraja S, Sleeckx E, Schaekers M, Bogaerts W, Van Thourhout D, Dumon P, Baets R
(2009) Low-loss amorphous silicon-on-insulator technology for photonic integrated circuitry.
Opt Commun 282(9):17671770
94. Selvaraja SK, Bogaerts W, Dumon P, Van Thourhout D, Baets R (2010) Subnanometer lin-
ewidth uniformity in silicon nanophotonic waveguide devices using CMOS fabrication tech-
nology. J Sel Top Quantum Electron 16(1):316324
95. Shin DJ, Lee KH, Ji H-C, Na KW, Kim SG, Bok JK, You YS, Kim SS, Joe IS, Suh SD, Pyo
J, Shin YH, Ha KH, Park YD, Chung CH (2010) MachZehnder silicon modulator on bulk
silicon substrate; toward dram optical interface. In: 2010 7th IEEE international conference
on group IV photonics (GFP), pp 210212 Beijing, China
96. Shoji T, Tsuchizawa T, Watanabe T, Yamada K, Morita H (2002) Low loss mode size converter
from 03mm square Si waveguides to singlemode fibres. Electron Lett 38(25):16691700
97. Soref R, Bennett B (1987) Electrooptical effects in silicon. J Quantum Electron 23(1):
123129
98. Sparacin DK, Sun R, Agarwal AM, Beals MA, Michel J, Kimerling LC, Conway TJ,
Pomerene AT, Carothers DN, Grove MJ, Gill DM, Rasras MS, Patel SS, White AE (2006)
Low-loss amorphous silicon channel waveguides for integrated photonics. In: 2006 3rd IEEE
international conference on group IV photonics, pp 255257 Ottawa, Canada
99. Spector S, Geis MW, Lennon D, Williamson RC, Lyszczarz TM (2004) Hybrid multi-mode/
single-mode waveguides for low loss. In: Optical amplifiers and their applications/integrated
photonics research. Optical Society of America, p IThE5 San Francisco
2 Technologies and Building Blocks for On-Chip Optical Interconnects 77

100. Spuesens T, Liu L, De Vries T, Rojo-Romeo P, Regreny P, Van Thourhout D (2009) Improved
design of an InP-based microdisk laser heterogeneously integrated with SOI. In: 6th IEEE
international conference on group IV photonics, p FA3 Sorrento, Italy
101. Sun P, Reano RM (2010) Submilliwatt thermo-optic switches using free-standing silicon-on-
insulator strip waveguides. Opt Express 18(8):84068411
102. Taillaert D, Bogaerts W, Bienstman P, Krauss TF, Van Daele P, Moerman I, Verstuyft S, De
Mesel K, Baets R (2002) An out-of-plane grating coupler for efficient butt-coupling between
compact planar waveguides and single-mode fibers. J Quantum Electron 38(7):949955
103. Taillaert D, Van Laere F, Ayre M, Bogaerts W, Van Thourhout D, Bienstman P, Baets R
(2006) Grating couplers for coupling between optical fibers and nanophotonic waveguides.
Jpn J Appl Phys 45(8A):60716077
104. Teng J, Dumon P, Bogaerts W, Zhang H, Jian X, Han X, Zhao M, Morthier G, Baets R (2009)
Athermal silicon-on-insulator ring resonators by overlaying a polymer cladding on narrowed
waveguides. Opt Express 17(17):1462714633
105. Tsuchizawa T, Yamada K, Fukuda H, Watanabe T, Takahashi J, Takahashi, M, Shoji T,
Tamechika E, Itabashi S, Morita H (2005) Microphotonics devices based on silicon microfab-
rication technology. IEEE J Sel Top Quantum Electron 11(1):232240
106. Van Acoleyen K, Roels J, Claes T, Van Thourhout D, Baets RG (2011) Nems-based optical
phase modulator fabricated on silicon-on-insulator. In: 2011 8th IEEE international confer-
ence on group IV photonics, p FC6 London, UK
107. Van Campenhout J, Green WMJ, Assefa S, Vlasov YA (2010) Integrated nisi waveguide heat-
ers for CMOS-compatible silicon thermooptic devices. Opt Lett 35(7):10131015
108. Van Campenhout J, Green WM, Assefa S, Vlasov YuA (2009) Low-power, 2x2 silicon elec-
tro-optic switch with 110-nm bandwidth for broadband reconfigurable optical networks. Opt
Express 17(26):2402024029
109. Van Campenhout J, Liu L, Romeo PR, Van Thourhout D, Seassal C, Regreny P, Di Cioccio
L, Fedeli J-M, Baets R (2008) A compact SOI-integrated multiwavelength laser source based
on cascaded InP microdisks. IEEE Photon Technol Lett 20(16):13451347
110. Van Campenhout J, Rojo RP, Regreny P, Seassal C, Van Thourhout D, Verstruyft S, Di Ciocco
L, Fedeli J-M, Lagahe C, Baets R (2007) Electrically pumped InP-based microdisk lasers
integrated with a nanophotonic silicon-on-insulator waveguide circuit. Opt Express 15(11):
67446749
111. Van Laere F, Claes T, Schrauwen J, Scheerlinck S, Bogaerts W, Taillaert D, OFaolain L, Van
Thourhout D, Baets R (2007) Compact focusing grating couplers for silicon-on-insulator
integrated circuits. Photon Technol Lett 19(23):19191921
112. Van Laere F, Roelkens G, Ayre M, Schrauwen J, Taillaert D, Van Thourhout D, Krauss TF,
Baets R (2007) Compact and highly efficient grating couplers between optical fiber and nano-
photonic waveguides. J Lightwave Technol 25(1):151156
113. Van Thourhout D, Spuesens T, Selvaraja SK, Liu L, Roelkens G, Kumar R, Morthier G, Rojo-
Romeo P, Mandorlo F, Regreny P, Raz O, Kopp C, Grenouillet L (2010) Nanophotonic
devices for optical interconnect. J Sel Top Quantum Electron 16(5):13631375
114. Vermeulen D, Selvaraja S, Verheyen P, Lepage G, Bogaerts W, Absil P, Van Thourhout D,
Roelkens G (2010) High-efficiency fiber-to-chip grating couplers realized using an advanced
CMOS-compatible silicon-on-insulator platform. Opt Express 18(17):1827818283
115. Vivien L, Osmond J, Fdli J-M, Marris-Morini D, Crozat P, Damlencourt J-F, Cassan E,
Lecunff Y, Laval S (2009) 42 ghz pin germanium photodetector integrated in a silicon-on-
insulator waveguide. Opt Express 17(8):62526257
116. Vivien L, Rouvire M, Fdli J-M, Marris-Morini D, Damlencourt J-F, Mangeney J, Crozat
P, El Melhaoui L, Cassan E, Le Roux X, Pascal D, Laval S (2007) High speed and high
responsivity germanium photodetector integrated in a silicon-on-insulator microwaveguide.
Opt Express 15(15):98439848
117. Vlasov Y, Green WMJ, Xia F (2008) High-throughput silicon nanophotonic wavelength-
insensitive switch for on-chip optical networks. Nat Photon 2(4):242246
78 W. Bogaerts et al.

118. Wang Z, Chen Y-Z, Doerr CR (2009) Analysis of a synchronized flattop AWG using low
coherence interferometric method. IEEE Photon Technol Lett 21(8):498500
119. Watts MR, Trotter DC, Young RW, Lentine AL (2008) Ultralow power silicon microdisk modu-
lators and switches. In: 2008 5th IEEE international conference on group IV photonics, pp 46
Sorrento, Italy
120. Webster MA, Pafchek RM, Sukumaran G, Koch TL (2005) Low-loss quasi-planar ridge
waveguides formed on thin silicon-on-insulator. Appl Phys Lett 87(23), p.231108
121. Xia F, Rooks M, Sekaric L, Vlasov Yu (2007) Ultra-compact high order ring resonator filters
using submicron silicon photonic wires for on-chip optical interconnects. Opt Express
15(19):1193411941
122. Xu Q, Manipatruni S, Schmidt B, Shakya J, Lipson M (2007) 125 gbit/s carrier-injection-
based silicon micro-ring silicon modulators. Opt Express 15(2):430436
123. Yamada K, Shoji T, Tsuchizawa T, Watanabe T, Takahashi J, Itabashi S (2005) Silicon-wire-
based ultrasmall lattice filters with wide free spectral ranges. J Sel Topics Quantum Electron
11:232240
124. Ye T, Cai X (2010) On power consumption of silicon-microring-based optical modulators.
J Lightwave Technol 28(11):16151623
125. Zhang L, Yue Y, Xiao-Li Y, Wang J, Beausoleil RG, Willner AE (2010) Flat and low disper-
sion in highly nonlinear slot waveguides. Opt Express 18(12):1318713193
126. Zheng X, Patil D, Lexau J, Liu F, Li G, Thacker H, Luo Y, Shubin I, Li J, Yao J, Dong P, Feng
D, Asghari M, Pinguet T, Mekis A, Amberg P, Dayringer M, Gainsley J, Moghadam H F,
Alon E, Raj K, Ho R, Cunningham J, Krishnamoorthy A (2011) Ultra-efficient 10gb/s hybrid
integrated silicon photonic transmitter and receiver. Opt Express 19(6):51725186
127. Zhu S, Fang Q, Yu MB, Lo GQ, Kwong DL (2009) Propagation losses in undoped and
n-doped polycrystalline silicon wire waveguides. Opt Express 17(23):2089120899
Part II
On-Chip Optical Communication
Topologies
Chapter 3
Designing Chip-Level Nanophotonic
Interconnection Networks

Christopher Batten, Ajay Joshi, Vladimir Stojanov, and Krste Asanovi

Abstract Technology scaling will soon enable high-performance processors with


hundreds of cores integrated onto a single die, but the success of such systems could
be limited by the corresponding chip-level interconnection networks. There have
been many recent proposals for nanophotonic interconnection networks that attempt
to provide improved performance and energy-efficiency compared to electrical net-
works. This chapter discusses the approach we have used when designing such net-
works, and provides a foundation for designing new networks. We begin by
reviewing the basic nanophotonic devices before briefly discussing our own silicon-
photonic technology that enables monolithic integration in a standard CMOS pro-
cess. We then outline design issues and categorize previous proposals in the literature
at the architectural level, the microarchitectural level, and the physical level. In
designing our own networks, we use an iterative process that moves between these
three levels of design to meet application requirements given our technology con-
straints. We use our ongoing work on leveraging nanophotonics in an on-chip title-
to-tile network, processor-to-DRAM network, and DRAM memory channel to
illustrate this design process.

C. Batten ()
School of Electrical and Computer Engineering,
College of Engineering, Cornell University, 323 Rhodes Hall, Ithaca, NY 14853, USA
e-mail: cbatten@cornell.edu
A. Joshi
Department of Electrical and Computer Engineering, Boston University, Boston, MA 02215, USA
V. Stojanov
Department of Electrical Engineering and Computer Science, Massachusetts Institute of
Technology, Cambridge, MA 02139, USA
K. Asanovi
Department of Electrical Engineering and Computer Science, University of California at
Berkeley, Berkeley, CA 94720, USA

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 81


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_3,
Springer Science+Business Media New York 2013
82 C. Batten et al.

Keywords Nanophotonics Optical interconnect Multicore/manycore processors


Interconnection networks Network architecture

Introduction

Todays graphics, network, embedded, and server processors already contain many
cores on one chip, and this number will continue to increase over the next decade.
Intra-chip and inter-chip communication networks are becoming critical compo-
nents in such systems, affecting not only performance and power consumption, but
also programmer productivity. Any future interconnect technology used to address
these challenges must be judged on three primary metrics: bandwidth density,
energy efficiency, and latency. Enhancements of current electrical technology
might enable improvements in two metrics while sacrificing a third. Nanophotonics
is a promising disruptive technology that can potentially achieve simultaneous
improvements in all three metrics, and could therefore radically transform chip-
level interconnection networks. Of course, there are many practical challenges
involved in using any emerging technology including economic feasibility, effec-
tive system design, manufacturing issues, reliability concerns, and mitigating vari-
ous overheads.
There has recently been a diverse array of proposals for network architectures
that use nanophotonic devices to potentially improve performance and energy
efficiency. These proposals explore different single-stage topologies from buses [9,
14, 29, 53, 74, 76] to crossbars [39, 44, 64, 65, 76] and different multistage topolo-
gies from quasi-butterflies [6, 7, 26, 32, 34, 41, 56, 63] to tori [18, 48, 69]. Note that
we specifically focus on chip-level networks as opposed to cluster-level optical net-
works used in high-performance computing and data-centers. Most proposals use
different routing algorithms, flow control mechanisms, optical wavelength organi-
zations, and physical layouts. While this diversity makes for an exciting new
research field, it also makes it difficult to see relationships between different propos-
als and to identify promising directions for future network design.
In previous work, we briefly described our approach for designing nanophoto-
nic interconnection networks, which is based on thinking of the design at three
levels: wthe architectural level, the microarchitectural level, and the physical
level [8]. In this chapter, we expand on this earlier description, provide greater
detail on design trade-offs at each level, and categorize previous proposals in the
literature. Architectural-level design focuses on choosing the best logical network
topology and routing algorithm. This early phase of design should also include a
detailed design of an electrical baseline network to motivate the use of nanophoto-
nic devices. Microarchitectural-level design considers which buses, channels, and
routers should be implemented with electrical versus nanophotonic technology.
This level of design also explores how to best implement optical switching, tech-
niques for wavelength arbitration, and effective flow control. Physical-level design
determines where to locate transmitters and receivers, how to map wavelengths to
3 Designing Chip-Level Nanophotonic Interconnection Networks 83

waveguides, where to layout waveguides for intra-chip interconnect, and where to


place optical couplers and fibers for inter-chip interconnect. We use an inherently
iterative process to navigate these levels in order to meet application requirements
given our technology constraints.
This chapter begins by briefly reviewing the underlying nanophotonic technol-
ogy, before describing in more detail our three-level design process and surveying
recent proposals in this area. The chapter then presents three case studies to illus-
trate this design process and to demonstrate the potential for nanophotonic intercon-
nection networks, before concluding with several general design themes that can be
applied when designing future nanophotonic interconnection networks.

Nanophotonic Technology

This section briefly reviews the basic devices used to implement nanophotonic
interconnection networks, before discussing the opportunities and challenges
involved with this emerging technology. See [10, 68] for a more detailed review of
recent work on nanophotonic devices. This section also describes in more detail the
specific nanophotonic technology that we assume for the case studies presented
later in this chapter.

Overview of Nanophotonic Devices

Figure 3.1 illustrates the devices in a typical wavelength-division multiplexed


(WDM) nanophotonic link used to communicate between chips. Light from an off-
chip two-wavelength (l1,l2) laser source is carried by an optical fiber and then cou-
pled into an optical power waveguide on chip A. A splitter sends both wavelengths
down parallel branches on opposite sides of the chip. Transmitters along each branch
use silicon ring modulators to modulate a specific wavelength of light. The diameter
of each ring sets its default resonant frequency, and the small electrical driver uses
charge injection to change the resonant frequency and thus modulate the corre-
sponding wavelength. Modulated light continues through the waveguides to the
other side of the chip where passive ring filters can be used to shuffle wavelengths
between the two waveguides. It is possible to shuffle multiple wavelengths at the
same time with either multiple single-wavelength ring filters or a single multiple-
wavelength comb filter. Additional couplers and single-mode fiber are used to con-
nect chip A to chips B and C. On chips B and C, modulated light is guided to
receivers that each use a passive ring filter to drop the corresponding wavelength
from the waveguide into a local photodetector. The photodetector turns absorbed
light into current, which is sensed by the electrical amplifier. Ultimately, the exam-
ple in Fig. 3.1 creates four point-to-point channels that connect the four inputs (I1I4)
to the four outputs (O1O4), such that input I1 sends data to output O1, input I2 sends
84 C. Batten et al.

Fig. 3.1 Nanophotonic devices. Four point-to-point nanophotonic channels implemented with
wavelength-division multiplexing. Such channels can be used for purely intra-chip communication
or seamless intra-chip/inter-chip communication. Number inside ring indicates resonant wave-
length; each input (I1I4) is passively connected to the output with the corresponding subscript
(O1O4); link corresponding to I2 O2 on wavelength l2 is highlighted (from [8], courtesy of
IEEE)

data to output O2, and so on. For higher bandwidth channels we can either increase
the modulation rate of each wavelength, or we can use multiple wavelengths to
implement a single logical channel. The same devices can be used for a purely intra-
chip interconnect by simply integrating transmitters and receivers on the same
chip.
As shown in Fig. 3.1, the silicon ring resonator is used in transmitters, passive
filters, and receivers. Although other photonic structures (e.g., MachZehnder inter-
ferometers) are possible, ring modulators are extremely compact (310 mm radius)
resulting in reduced area and power consumption. Although not shown in Fig. 3.1,
many nanophotonic interconnection networks also use active filtering to implement
optical switching. For example, we might include multiple receivers with active
filters for wavelength l1 on chip B. Each receivers ring filter would be detuned by
default, and we can then actively tune a single receivers ring filter into resonance
using charge injection. This actively steers the light to one of many possible outputs.
Some networks use active ring filters in the middle of the network itself. For exam-
ple, we might replace the passive ring filters on chip A in Fig. 3.1 with active ring
filters to create an optical switch. When detuned, inputs I1, I2, I3, and I4 are con-
nected to outputs O1, O4, O3, and O2, respectively. When the ring filters are actively
tuned into resonance, then the inputs are connected to the outputs with the corre-
sponding subscripts. Of course, one of the challenges with these actively switched
filters is in designing the appropriate electrical circuitry for routing and flow control
that determines when to tune or detune each filter.
Most recent nanophotonic interconnection designs use the devices shown in
Fig. 3.1, but some proposals also use alternative devices such as vertical cavity surface
emitting lasers combined with free-space optical channels [1, 78] or planar wave-
guides [48]. This chapter focuses on the design of networks with the more common
ring-based devices and linear waveguides, and we leave a more thorough treatment of
interconnect network design using alternative devices for future work.
3 Designing Chip-Level Nanophotonic Interconnection Networks 85

Nanophotonic Technology Opportunities and Challenges

Nanophotonic interconnect can potentially provide significant advantages in terms


of bandwidth density and energy efficiency when compared to long electrical intra-
chip and inter-chip interconnect [55]. The primary bandwidth density advantage
comes from packing dozens of wavelengths into the same waveguide or fiber, with
each wavelength projected to operate at 510 Gb/s for purely intra-chip communi-
cation and 1040 Gb/s for purely inter-chip communication. With waveguide pitches
on the order of a couple micron and fiber coupling pitches on the order of tens of
microns, this can translate into a tremendous amount of intra- and inter-chip band-
width. The primary energy-efficiency advantage comes from ring-modulator trans-
ceivers that are projected to require sub-150 fJ/bit of data-dependent electrical
energy regardless of the link length and fanout. This improvement in bandwidth
density and energy efficiency can potentially be achieved with comparable or
improved latency, making nanophotonics a viable disruptive technology for chip-
level communication.
Of course, there are many practical challenges to realizing this emerging technol-
ogy including economic feasibility, effective system design, manufacturing issues,
reliability concerns, and compromising various overheads [22]. We now briefly dis-
cuss three of the most pressing challenges: opto-electrical integration, temperature
and process variability, and optical power overhead.

Opto-Electrical Integration

Tightly integrating optical and electrical devices is critical for achieving the potential
bandwidth density and energy efficiency advantages of nanophotonic devices. There
are three primary approaches for opto-electrical integration in intra-chip and inter-
chip interconnection networks: hybrid integration, monolithic back-end-of-line
(BEOL) integration, and monolithic front-end-of-line (FEOL) integration.
Hybrid Integration. The highest-performing optical devices are fabricated
through dedicated processes customized for building such devices. These optical
chips can then be attached to a micro-electronic chip fabricated with a standard
electrical CMOS process through package-level integration [2], flip-chip bonding
the two wafers/chips face-to-face [73, 81], or 3D integration with through-silicon
vias [35]. Although this approach is feasible using integration technologies avail-
able currently or in the near future, it requires inter-die electrical interconnect (e.g.,
micro-bumps or through-silicon vias) to communicate between the micro-electronic
and active optical devices. It can be challenging to engineer this inter-die intercon-
nect to avoid mitigating the energy efficiency and bandwidth density advantages of
chip-level nanophotonics.
Monolithic BEOL Integration. Nanophotonic devices can be deposited on top of
the metal interconnect stack using amorphous silicon [38], poly-silicon [67], silicon
nitride [5], germanium [52], and polymers [15, 33]. Ultimately, a combination of these
materials can be used to create a complete nanophotonic link [79]. Compared to
86 C. Batten et al.

hybrid integration, BEOL integration brings the optical devices closer to the micro-
electronics which can improve energy efficiency and bandwidth density. BEOL inte-
gration does not require changes to the front end, does not consume active area, and
can provide multiple layers of optical devices (e.g., multi-layer waveguides). Although
some specialized materials can be used in BEOL integration, the nanophotonic devices
must be deposited within a strict thermal processing envelope and of course require
modifications to the final layers of the metal interconnect stack. This means that
BEOL devices often must trade-off bandwidth density for energy efficiency (e.g.,
electro-optic modulator devices [79] operate at relatively high drive voltages to achieve
the desired bandwidth and silicon-nitride waveguides have large bending losses limit-
ing the density of photonic devices). BEOL integration is suitable for use with both
SOI and bulk CMOS processes, and can potentially also be used in other applications
such as for depositing optics on DRAM or FLASH chips.
Monolithic FEOL Integration. Photonic devices without integrated electrical cir-
cuitry have been implemented in monocrystalline silicon-on-insulator (SOI) dies
with a thick layer of buried oxide (BOX) [23, 49], and true monolithic FEOL integra-
tion of electrical and photonic devices have also been realized [25, 28]. Thin-BOX
SOI is possible with localized substrate removal under the optical devices [31]. On
the one hand, FEOL integration can support high-temperature process modifications
and enables the tightest possible coupling to the electrical circuits, but also consumes
valuable active area and requires modifications to the sensitive front-end processing.
These modifications can include incorporating pure germanium or high-percentage
silicon-germanium on the active layer, additional processing steps to reduce wave-
guide sidewall roughness, and improving optical cladding with either a custom thick
buried-oxide or a post-processed air gap under optical devices. In addition, FEOL
integration usually requires an SOI CMOS process, since the silicon waveguides are
implemented in the same silicon film used for the SOI transistors. There has, how-
ever, been work on implementing FEOL polysilicon nanophotonic devices with
localized substrate removal in a bulk process [58, 61].

Process and Temperature Variation

Ring-resonator devices have extremely high Q-factors, which enhance the electro-optical
properties of modulators and active filters and enables dense wavelength division mul-
tiplexing. Unfortunately, this also means small unwanted changes in the resonance can
quickly shift a device out of the required frequency operating range. Common sources
of variation include process variation that can result in unwanted ring geometry varia-
tion within the same die, and thermal variation that can result in spatial and temporal
variation in the refractive index of silicon-photonic devices. Several simulation-based
and experimental studies have reported that a 1 nm variation in the ring width can shift
a rings resonance by approximately 0.5 nm [47, 70], and a single degree change in
temperature can shift a rings resonance by approximately 0.1 nm [22, 47, 51]. Many
nanophotonic network proposals assume tens of wavelengths per waveguide [6, 32, 63,
74, 76], which results in a channel spacing of less than 1 nm (100 GHz). This means
3 Designing Chip-Level Nanophotonic Interconnection Networks 87

ring diameter variation of 2 nm or temperature variation of 10 C can cause a ring reso-


nator to filter the incorrect neighboring wavelength.
Process Variation. A recent study of FEOL devices fabricated in 0.35 mm found
that intra-die variation resulted in a 100 GHz change in ring resonance, and intra-
wafer variation resulted in a 1 THz change in ring resonance across the 300 mm
wafer [84]. A different study of FEOL devices in a much more advanced technology
generation found a mean relative mismatch of 31 GHz within a multi-ring filter bank
but a much more significant mean absolute mismatch of 600 GHz inter-die varia-
tion [58]. These results suggest that design-time frequency matching for rings in
close proximity might be achievable at advanced technology nodes, but that fre-
quency matching rings located far apart on the same die or on different dies might
require some form of resonance frequency tuning.
Thermal Variation. Spatial and temporal temperature gradients are more troubling,
since these can be difficult to predict; greater than 10 C variation is common in mod-
ern high-performance microprocessors. Simulation-based chip-level models suggest
maximum temperature differentials up to 17 C in space [47, 72] and up to 28 C in
time across different benchmarks [47]. An experimental-based study measured vari-
ous blocks in an AMD Athlon microprocessor increasing from an idle ambient tem-
perature of 45 C to a steady-state temperature of 70 C in the L1 data cache and
80 C in the integer instruction scheduler, and measured peak spatial variation at
approximately 35 C between the heavily used blocks and idle blocks in the
chip [54].
There have been a variety of device-level proposals for addressing these chal-
lenges including injecting charge to use the electro-optic effect to compensate for
variation [51] (can cause self-heating and thermal runaway), adding thermal micro-
heaters to actively maintain a constant device temperature or compensate for pro-
cess variation [3, 21, 77] (requires significant static tuning power), using athermal
device structures [27] (adds extra area overhead), and using extra polymer materials
for athermal devices [16, 83] (not necessarily CMOS compatible). There has been
relatively less work studying variation in CMOS-compatible nanophotonic devices
at the system-level. Some preliminary work has been done on integrating thermal
modeling into system-level nanophotonic on-chip network simulators [57], and
studying run-time thermal management techniques for a specific type of nanopho-
tonic on-chip network [47]. Recent work has investigated the link-level implications
of local thermal tuning circuitry and adding extra rings to be able to still receive
wavelengths even after they have shifted due to thermal drift [24].

Optical Power Overhead

A nanophotonic link consumes several types of data-independent power: fixed


power in the electrical portions of the transmitters and receivers (e.g., clock and
static power), tuning power to compensate for process and thermal variation, and
optical laser power. The laser power depends on the amount of optical loss that any
given wavelength experiences as it travels from the laser, through the various devices
88 C. Batten et al.

shown in Fig. 3.1, and eventually to the photodetector. In addition to the photonic
device losses, there is also a limit to the total amount of optical power that can be
transmitted through a waveguide without large non-linear losses. High optical losses
per wavelength necessitate distributing those wavelengths across many waveguides
(increasing the overall area) to stay within this non-linearity limit. Minimizing opti-
cal loss is a key device design objectives, and meaningful system-level design must
take into account the total optical power overhead.

MIT Monolithic FEOL Nanophotonic Technology

In the case studies presented later in this chapter, we will be assuming a monolithic
FEOL integration strategy. Our approach differs from other integration strategies,
since we attempt to integrate nanophotonics into state-of-the-art bulk-CMOS micro-
electronic chips with no changes to the standard CMOS fabrication process. In this
section, we provide a brief overview of the specific technology we are developing
with our colleagues at the Massachusetts Institute of Technology. We use our experi-
ences with a 65 nm test chip [60], our feasibility studies for a prototype 32 nm pro-
cess, predictive electrical device models [80], and interconnect projections [36] to
estimate both electrical and photonic device parameters for a target 22 nm technology
node. Device-level details about the MIT nanophotonic technology assumed in the
rest of this chapter can be found in [30, 5861], although the technology is rapidly
evolving such that more recent device-level work uses more advanced device and
circuit techniques [24, 25, 46]. Details about the specific technology assumptions for
each case study can be found in our previous system-level publications [6, 7, 9, 32].
Waveguide. To avoid process changes, we design our waveguides in the polysili-
con layer on top of the shallow-trench isolation in a standard bulk CMOS process
(see Fig. 3.2a). Unfortunately, the shallow-trench oxide is too thin to form an effec-
tive cladding and shield the core from optical-mode leakage into the silicon sub-
strate. We have developed a novel self-aligned post-processing procedure to etch
away the silicon substrate underneath the waveguide forming an air gap. A reason-
ably deep air gap provides a very effective optical cladding. For our case studies, we
assume eight-waveguide bundles can use the same air gap with a 4-mm waveguide
pitch and an extra 5-mm of spacing on either side of the bundle. We estimate a time-
of-flight latency of approximately 10.5 ps/mm which enables raw interconnect
latencies for crossing a 400-mm2 chip to be on the order of one to three cycles for a
5-GHz core clock frequency.
Transmitter. Our transmitter design is similar to past approaches that use minor-
ity charge-injection to change the resonant frequency of ring modulators [50]. Our
racetrack modulator design is implemented by doping the edges of a polysilicon
modulator structure creating a lateral PiN diode with undoped polysilicon as the
intrinsic region (see Fig. 3.2b). Our device simulations indicate that with polysilicon
carrier lifetimes of 0.11 ns it is possible to achieve sub-100 fJ per bit time (fJ/bt)
modulator driver energy for random data at up to 10 Gb/s with advanced digital
3 Designing Chip-Level Nanophotonic Interconnection Networks 89

Fig. 3.2 MIT monolithic FEOL nanophotonic devices. (a) Polysilicon waveguide over SiO2 film
with an air gap etched into the silicon substrate to provide optical cladding; (b) polysilicon ring
modulator that uses charge injection to modulate a single wavelength: without charge injection the
resonant wavelength is filtered to the drop port while all other wavelengths continue to the
through port; with charge injection, the resonant frequency changes such that no wavelengths are
filtered to the drop port; (c) cascaded polysilicon rings that passively filter the resonant wave-
length to the drop port while all other wavelengths continue to the through port (adapted
from [7], courtesy of IEEE)

equalization circuits. To avoid robustness and power issues from distributing a mul-
tiple-GHz clock to hundreds of transmitters, we propose implementing an optical
clock delivery scheme using a simple single-diode receiver with duty-cycle correc-
tion. We estimate the serialization and driver circuitry will consume less than sin-
gle-cycle at a 5-GHz core clock frequency.
Passive Filter. We use polysilicon passive filters with two cascaded rings for
increased frequency roll-off (see Fig. 3.2c). As mentioned earlier in this section, the
rings resonance is sensitive to temperature and requires active thermal tuning.
Fortunately, the etched air gap under the ring provides isolation from the thermally
conductive substrate, and we add in-plane polysilicon heaters inside most rings to
improve heating efficiency. Thermal simulations suggest that we will require
40100 mW of static power for each double-ring filter assuming a temperature range
of 20 K. These ring filters can also be designed to behave as active filters by using
charge injection as in our transmitters, except at lower data rates.
Receiver. The lack of pure Ge presents a challenge for mainstream bulk CMOS
processes. We use the embedded SiGe (2030 % Ge) in the p-MOSFET transistor
source/drain regions to create a photodetector operating at around 1200 nm. Simulation
results show good capacitance (less than 1 fF/mm) and dark current (less than 10 fA/mm)
at near-zero bias conditions, but the sensitivity of the structure needs to be improved
to meet our system specifications. In advanced process nodes, the responsivity and
speed should improve through better coupling between the waveguide and the photo-
detector in scaled device dimensions, and an increased percentage of Ge for device
strain. Our photonic receiver circuits would use the same optical clocking scheme as
our transmitters, and we estimate that the entire receiver will consume less than 50 fJ/
bt for random data. We estimate the deserialization and driver circuitry will consume
less than single-cycle at a 5-GHz core clock frequency.
90 C. Batten et al.

Based on our device simulations and experiments we project that it may be pos-
sible to multiplex 64 wavelengths per waveguide at a 60-GHz spacing, and that by
interleaving wavelengths traveling in opposite directions (which helps mitigate
interference) we can possibly have up to 128 wavelengths per waveguide. With a
4-mm waveguide pitch and 64128 wavelengths per waveguide, we can achieve a
bandwidth density of 160320 Gb/s/mm for intra-chip nanophotonic interconnect.
With a 50-mm fiber coupler pitch, we can achieve a bandwidth density of
1225 Gb/s/mm for inter-chip nanophotonic interconnect. Total link latencies includ-
ing serialization, modulation, time-of-flight, receiving, and deserialization could
range from three to eight cycles depending on the link length. We also project that
the total electrical and thermal on-chip energy for a complete 10 Gb/s nanophotonic
intra-chip or inter-chip link (including a racetrack modulator and a double-ring filter
at the receiver) can be as low as 100250 fJ/bt for random data. These projections
suggest that optical communication should support significantly higher bandwidth
densities, improved energy efficiency, and competitive latency compared to both
optimally repeated global intra-chip electrical interconnect (e.g., [36]) and projected
inter-chip electrical interconnect.

Designing Nanophotonic Interconnection Networks

In this section, we describe three levels of nanophotonic interconnection network


design: the architectural level, the microarchitectural level, and the physical level.
At each level, we use insight gained from designing several nanophotonic net-
works to discuss the specific implications of using this emerging technology, and
we classify recent nanophotonic network proposals to illustrate various different
approaches. Each level of design enables its own set of qualitative and quantitative
analysis and helps motivate design decisions at both higher and lower levels.
Although these levels can help focus our design effort, network design is inher-
ently an iterative process with a designer moving between levels as necessary to
meet the application requirements.

Architectural-Level Design

The design of nanophotonic interconnection networks usually begins at the archi-


tectural level and involves selecting a logical network topology that can best lever-
age nanophotonic devices. A logical network topology connects a set of input
terminals to a set of output terminals through a collection of buses and routers
interconnected by point-to-point channels. Symmetric topologies have an equal
number of input and output terminals, usually denoted as N. Figure 3.3 illustrates
several topologies for a 64-terminal symmetric network ranging from single-stage
global buses and crossbars to multi-stage butterfly and torus topologies (see [20] for
3 Designing Chip-Level Nanophotonic Interconnection Networks 91

a b c d

Bus Crossbar Butterfly Torus

Fig. 3.3 Logical topologies for various 64 terminal networks. (a) 64-writer/64-reader single global
bus; (b) 6464 global non-blocking crossbar; (c) 8-ary 2-stage butterfly; (d) 8-ary 2-dimensional
torus. Squares: input and/or output terminals; dots: routers; in (c) inter-dot lines: uni-directional
channels; in (d) inter-dot lines: two channels in opposite directions (from [8], courtesy of IEEE)

a more extensive review of logical network topologies, and see [4] for a study
specifically focused on intra-chip networks). At this preliminary phase of design,
we can begin to determine the bus and channel bandwidths that will be required to
meet application requirements assuming ideal routing and flow-control algorithms.
Usually this analysis is in terms of theoretical upper-bounds on the networks band-
width and latency, but we can also begin to explore how more realistic routing
algorithms might impact the networks performance. When designing nanophotonic
interconnection networks, it is also useful to begin by characterizing state-of-the-art
electrical networks. Developing realistic electrical baseline architectures early in
the design process can help motivate the best opportunities for leveraging nanopho-
tonic devices. This subsection discusses a range of topologies used in nanophotonic
interconnection networks.
A global bus is perhaps the simplest of logical topologies, and involves N input
terminals arbitrating for a single shared medium so that they can communicate with
one of N 1 output terminals (see Fig. 3.3a). Buses can make good use of scarce
wiring resources, serialize messages which can be useful for some higher-level pro-
tocols, and enable one input terminal to easily broadcast a message to all output
terminals. Unfortunately, using a single shared medium often limits the performance
of buses due to practical constraints on bus bandwidth and arbitration latency as the
number of network terminals increases. There have been several nanophotonic bus
designs that explore these trade-offs, mostly in the context of implementing efficient
DRAM memory channels [9, 29, 53, 74, 76] (discussed further in case study #3),
although there have also been proposals for specialized nanophotonic broadcast
buses to improve the performance of application barriers [14] and cache-coherence
protocols [76]. Multiple global buses can be used to improve system throughput,
and such topologies have also been designed using nanophotonic devices [62].
A global crossbar topology is made up of N buses with each bus dedicated to a
single terminal (see Fig. 3.3b). Such topologies present a simple performance model
92 C. Batten et al.

to software and can sustain high-performance owing to their strictly non-blocking


connectivity. This comes at the cost, however, of many global buses crossing the
network bisection and long global arbitration delays. Nanophotonic crossbar topol-
ogies have been particularly popular in the literature [39, 40, 44, 64, 65, 76], and we
will see in the following sections that careful design at the microarchitectural and
physical levels is required to help mitigate some of the challenges inherent in any
global crossbar topology.
To avoid global buses and arbitration, we can move to a multi-stage topology
such as a k-ary n-stage butterfly where radix-k routers are arranged in n stages with
N k routers per stage (see Fig. 3.3c). Although multi-stage topologies increase the
hop-count as compared to a global crossbar, each hop involves a localized lower-
radix router that can be implemented more efficiently than a global crossbar. The
reason for the butterfly topologys efficiency (distributed routing, arbitration, and
flow-control), also leads to challenges in reducing zero-load latencies and balancing
channel load. For example, a butterfly topology lacks any form of path diversity
resulting in poor performance on some traffic patterns. Nanophotonic topologies
have been proposed that are similar in spirit to the butterfly topology for multichip-
module networks [41], on-chip networks [56], and processor-to-DRAM networks
[6, 7]. The later is discussed further as a case study in section Case Study #2:
Manycore Processor-to-DRAM Network. In these networks, the lack of path diver-
sity may not be a problem if application requirements specify traffic patterns that
are mostly uniform random. Adding additional stages to a butterfly topology
increases path diversity, and adding n 1 stages results in an interesting class of
network topologies known as Clos topologies [19] and fat-tree topologies [45]. Clos
and fat-tree topologies can offer the same non-blocking guarantees as global cross-
bars with potentially lower resource requirements. Clos and fat-tree topologies have
been proposed that use nanophotonic devices in low-radix [26] and high-radix
[32, 34] configurations. The later is discussed further as a case study in section Case
Study #1: On-Chip Tile-to-Tile Network. Nanophotonic Clos-like topologies that
implement high-radix routers using a subnetwork of low-radix routers have also
been explored [63].
A k-ary n-dimensional torus topology is an alternative multi-stage topology
where each terminal is associated with a router, and these routers are arranged in an
n-dimensional logical grid with k routers in each dimension (see Fig. 3.3d). A mesh
topology is similar to the torus topology except with the logically long wrap-
around channels eliminated in each dimension. Two-dimensional torus and mesh
topologies are particularly attractive in on-chip networks, since they naturally map
to the planar chip substrate. Unfortunately, low-dimensional torus and mesh topolo-
gies have high hop counts resulting in longer latencies and possibly higher energy
consumption. Moving from low-dimensional to high-dimensional torus or mesh
topologies (e.g., a 4-ary 3-dimensional topology) reduces the network diameter, but
requires long channels when mapped to a planar substrate. Also, higher-radix rout-
ers are required, potentially resulting in more area and higher router energy. Instead
of adding network dimensions, we can use concentration to reduce network dia-
meter [43]. Internal concentration multiplexes/demultiplexes multiple input/output
3 Designing Chip-Level Nanophotonic Interconnection Networks 93

terminals across a single router port at the edge of the network, while external con-
centration integrates multiple terminals into a unified higher-radix router. There has
been some work investigating how to best use nanophotonics in both two-dimen-
sional torus [69] and mesh [18, 48] topologies.
While many nanophotonic interconnection networks can be loosely categorized
as belonging to one of the four categories shown in Fig. 3.3, there are also more radi-
cal alternatives. For example, Koohi et al. propose a hierarchical topology for an
on-chip nanophotonic network where a set of global rings connect clusters each
with their own local ring [42].
Table 3.1 is an example of the first-order analysis that can be performed at the
architectural level of design. In this example, we compare six logical topologies for
a 64-terminal on-chip symmetric network. For the first-order latency metrics we
assume a 22-nm technology, 5-GHz clock frequency, and a 400-mm2 chip. The bus
and channel bandwidths are sized so that each terminal can sustain 128 b/cycle
under uniform random traffic assuming ideal routing and flow control. Even from
this first-order analysis we can start to see that some topologies (e.g., crossbar,
butterfly, and Clos) require fewer channels but they are often long, while other
topologies (e.g., torus and mesh) require more channels but they are often short.
We can also see which topologies (e.g., crossbar and Clos) require more global
bisection wiring resources, and which topologies require higher-radix routers (e.g.,
crossbar, butterfly, Clos, and cmesh). First-order zero-load latency calculations can
help illustrate trade-offs between hop count, router complexity, and serialization
latency. Ultimately, this kind of rough analysis for both electrical and nanophoto-
nic networks helps motivate the microarchitectural-level design discussed in the
next section.

Microarchitectural-Level Design

For nanophotonic interconnection networks, microarchitectural-level design involves


choosing which buses, channels, and routers to implement electrically and which to
implement with nanophotonic devices. We must decide where nanophotonic trans-
mitters and receivers will be used in the network, how to use active filters to imple-
ment nanophotonic routers, the best way to arbitrate for wavelengths, and how to
manage electrical buffering at the edges of nanophotonic network components. At
this level of design, we often use nanophotonic schematics to abstractly illustrate
how the various components are integrated (see Fig. 3.4 for symbols that will be used
in nanophotonic schematics and layouts). When working at the microarchitectural
level, we want to focus on the higher-level operation of the nanophotonic devices, so
it is often useful to assume we have as many wavelengths as necessary to meet our
application requirements and to defer some practical issues related to mapping wave-
lengths to waveguides or waveguide layout until the final physical level of design.
Although this means detailed analysis of area overheads or optical power require-
ments is not possible at this level of the design, we can still make many qualitative
94 C. Batten et al.

Table 3.1 Architectural-level analysis for various 64 terminal networks


Buses and channels Routers Latency
Topology NC NBC bC NBC bC NR radix HR TR TC TS T0
Crossbar 6464 64 64 128 8,192 1 6464 1 10 n/a 4 14
Butterfly 8-ary 2-stage 64 32 128 4,096 16 88 2 2 210 4 1018
Clos (8,8,8) 128 64 128 8,192 24 88 3 2 210 4 1432
Torus 8-ary 2-dim 256 32 128 4,096 64 55 29 2 2 4 1038
Mesh 8-ary 2-dim 224 16 256 4,096 64 55 215 2 1 2 746
CMesh 4-ary 2-dim 48 8 512 4,096 16 88 17 2 2 1 325
Networks sized to sustain 128 b/cycle per input terminal under uniform randomtraffic. Latency
calculations assume electrical implementation with an88 grid ofinput/output terminals and the
following parameters: 22-nm technology, 5-GHz clock frequency,and 400-mm2 chip.Nc number of
channels or buses,bC bits/channel or bits/bus,NBC number of bisection channelsor buses, NR number
of routers,HR number of routers alongminimal routes, TR routerlatency, TC channellatency, TS seri-
alizationlatency, T0 zeroload latency (from [8], courtesy of IEEE)

a b c d e f g

Coupler Transmitter Multiple Receiver Tunable Passive Active


Transmitters Receiver Filter Filter

Fig. 3.4 Symbols used in nanophotonic schematics and layouts, For all ring-based devices, the
number next to the ring indicates the resonant wavelength, and a range of numbers next to the ring
indicates that the symbol actually represents multiple devices each tuned to a distinct wavelength
in that range. The symbols shown include: (a) coupler for attaching a fiber to an on-chip wave-
guide; (b) transmitter including driver and ring modulator for l1; (c) multiple transmitters includ-
ing drivers and ring modulators for each of l1l4; (d) receiver including passive ring filter for l1
and photodetector; (e) receiver including active ring filter for l1 and photodetector; (f) passive ring
filter for l1; (g) active ring filter for l1 (from [8], courtesy of IEEE)

and quantitative comparisons between various network microarchitectures. For


example, we can compare different microarchitectures based on the number of opto-
electrical conversions along a given routing path, the total number of transmitters and
receivers, the number of transmitters or receivers that share a single wavelength, the
amount of active filtering, and design complexity. It should be possible to narrow our
search in promising directions that we can pursue with a physical-level design, or to
iterate back to the architectural level to explore other topologies and routing algo-
rithms. This subsection discusses a range of microarchitectural design issues that
arise when implementing the logical topologies described in the previous section.
Nanophotonics can help mitigate some of the challenges with global electrical
buses, since the electrical modulation energy in the transmitter is independent of both
bus length and the number of terminals. However, the optical power strongly depends
on these factors making it necessary to carefully consider the networks physical
design. In addition, an efficient global bus arbitration is required which is always
challenging regardless of the implementation technology. A nanophotonic bus topol-
ogy can be implemented with a single wavelength as the shared communication
3 Designing Chip-Level Nanophotonic Interconnection Networks 95

a b c d

SWBR Bus SWMR Bus MWSR Bus MWMR Bus

Fig. 3.5 Microarchitectural schematics for nanophotonic four terminal buses. The buses connect one
or more input terminals (I1I4) to one or more output terminals (O1O4) via a single shared wavelength:
(a) single-writer broadcast-reader bus; (b) single-writer multiple-reader bus; (c) multiple-writer sin-
gle-reader bus; (d) multiple-writer multiple-reader bus (adapted from [8], courtesy of IEEE)

medium (see Fig. 3.5). Assuming a fixed modulation rate per wavelength, we can
increase the bus bandwidth by using using multiple parallel wavelengths. In the sin-
gle-writer broadcast-reader (SWBR) bus shown in Fig. 3.5a, a single input terminal
modulates the bus wavelength that is then broadcast to all four output terminals. This
form of broadcast bus does not need any arbitration because there is only one input
terminal. The primary disadvantage of a SWBR bus is simply the large amount of
optical power require to broadcast packets to all output terminals. If we wish to send
a packet to only one of many outputs, then we can significantly reduce the optical
power by using active filters in each receiver. Figure 3.5b shows a single-writer mul-
tiple-reader (SWMR) bus where by default the ring filters in each receiver are
detuned such that none of them drop the bus wavelength. When the input terminal
sends a packet to an output terminal, it first ensures that the ring filter at the destina-
tion receiver is actively tuned into the bus wavelength. The control logic for this
active tuning usually requires additional optical or electrical communication from
the input terminal to the output terminals. Figure 3.5c illustrates a different bus net-
work called a multiple-writer single-reader (MWSR) bus where four input terminals
arbitrate to modulate the bus wavelength that is then dropped at a single output ter-
minal. MWSR buses require global arbitration, which can be implemented either
electrically or optically. The most general bus network enables multiple input termi-
nals to arbitrate for the shared bus and also allows a packet to be sent to one or more
output terminals. Figure 3.5d illustrates a multiple-writer multiple-reader (MWMR)
bus with four input terminals and four output terminals, but multiple-writer broad-
cast-reader (MWBR) buses are also possible. Here arbitration will be required at
both the transmitter side and the receiver side. MWBR/MWMR buses will require
O(Nbl) transceivers where N is the number of terminals and bl is the number of
shared wavelengths used to implement the bus.
There are several examples of nanophotonic buses in the literature. Several
researchers have described similar techniques for using a combination of nanopho-
tonic SWBR and MWSR buses to implement the command, write-data, and read-
data buses in a DRAM memory channel [29, 53, 74, 76]. In this context the
arbitration for the MWSR read-data bus is greatly simplified since the memory
controller acts a master and the DRAM banks act as slaves. We investigate various
ways of implementing such nanophotonic DRAM memory channels as part of the
96 C. Batten et al.

a b c

SWMR Crossbar Buffered SWMR Crossbar MWSR Crossbar

Fig. 3.6 Microarchitectural schematics for nanophotonic 44 crossbars. The crossbars connect all
inputs (I1I4) to all outputs (O1O4) and are implemented with either: (a) four single-writer multi-
ple-reader (SWMR) buses; (b) four SWMR buses with additional output buffering; or (c) four
multiple-writer single-reader (MWSR) buses (adapted from [8], courtesy of IEEE)

case study in section Case Study #3: DRAM Memory Channel. Binkert et al. dis-
cuss both single-wavelength SWBR and SWMR bus designs for use in implement-
ing efficient on-chip barrier networks, and the results suggest that a SWMR bus
can significantly reduce the required optical laser power as compared to a SWBR
bus [14]. Vantrease et al. also describe a nanophotonic MWBR bus used to broad-
cast invalidate messages as part of the cache-coherence protocol [76]. Arbitration
for this bus is performed optically with tokens that are transferred between input
terminals using a specialized arbitration network with a simple ring topology. Pan
et al. proposed several techniques to help address scaling nanophotonic MWMR
buses to larger numbers of terminals: multiple independent MWMR buses improve
the total network bisection bandwidth while still enabling high utilization of all
buses, a more optimized optical token scheme improves arbitration throughput,
and concentrated bus ports shared by multiple terminals reduce the total number of
transceivers [62].
Global crossbars have several attractive properties including high throughput and
a short fixed latency. Nanophotonic crossbars use a dedicated nanophotonic bus per
input or output terminal to enable every input terminal to send a packet to a different
output terminal at the same time. Implementing such crossbars with nanophotonics
have many of the same advantages and challenges as nanophotonic buses except at
a larger scale. Figure 3.6 illustrates three types of nanophotonic crossbars. In the
SWMR crossbar shown in Fig. 3.6a, there is one bus per input and every output can
3 Designing Chip-Level Nanophotonic Interconnection Networks 97

read from any of these buses. As an example, if I2 wants to send a packet to O3 it first
arbitrates for access to the output terminal, then (assuming it wins arbitration) the
receiver for wavelength l2 at O3 is actively tuned, and finally the transmitter at I2
modulates wavelength l2 to send the packet. SWBR crossbars are also possible
where the packet is broadcast to all output terminals, and each output terminal is
responsible for converting the packet into the electrical domain and determining if
the packet is actually destined for that terminal. Although SWBR crossbars enable
broadcast communication they use significantly more optical power than a SWMR
crossbar for unicast communication. Note that even SWMR crossbars usually
include a low-bandwidth SWBR crossbar to implement distributed redundant arbi-
tration at the output terminals and/or to determine which receivers at the destination
should be actively tuned. A SWMR crossbar needs one transmitter per input, but
requires O(N2bl) receivers. Figure 3.6b illustrates an alternative called a buffered
SWMR crossbar that avoids the need for any global or distributed arbitration. Every
input terminal can send a packet to any output terminal at any time assuming it has
space in the corresponding queue at the output. Each output locally arbitrates among
these queues to determine which packet can access the output terminal. Buffered
SWBR/SWMR crossbars simplify global arbitration at the expense of an additional
O(N2) buffering. Buffered SWMR crossbars can still include a low-bandwidth
SWBR crossbar to determine which receivers at the destination should be actively
tuned. The MWSR crossbar shown in Fig. 3.6c is an alternative microarchitecture
that uses one bus per output and allows every input to write any of these buses. As
an example, if I2 wants to send a packet to O3 it first arbitrates, and then (assuming
it wins arbitration) it modulates wavelength l3. A MWSR crossbar needs one
receiver per output, but requires O(N2bl) transmitters. For larger networks with
wider channel bitwidths, the quadratic number of transmitters or receivers required
to implement nanophotonic crossbars can significantly impact optical power, ther-
mal tuning power, and area.
There have been several diverse proposals for implementing global crossbars
with nanophotonics. Many of these proposals use global on-chip crossbars to imple-
ment L2-to-L2 cache-coherence protocols for single-socket manycore processors.
Almost all of these proposals include some amount of concentration, so that a small
number of terminals locally arbitrate for access to a shared crossbar port. This con-
centration helps leverage electrical interconnect to reduce the radix of the global
crossbar, and can also enable purely electrical communication when sending a
packet to a physically close output terminal. Krman et al. describe three on-chip
SWBR nanophotonic crossbars for addresses, snoop responses, and data for imple-
menting a snoopy-based cache-coherence protocol [39]. The proposed design uses
distributed redundant arbitration to determine which input port can write to which
output port. A similar design was proposed by Pasrich et al. within the context of a
multiprocessor system-on-chip [64]. Krman et al. have recently described a more
sophisticated SWMR microarchitecture with connection-based arbitration that is
tightly coupled to the underlying physical layout [40]. Miller et al. describe a buff-
ered SWBR nanophotonic crossbar for implementing a directory-based cache-
coherence protocol, and the broadcast capabilities of the SWBR crossbar are used
98 C. Batten et al.

for invalidation messages [44]. The proposed design requires several hundred thou-
sand receivers for a 6464 crossbar with each shared bus using 64 wavelengths
modulated at 10 Gb/s. Vantrease et al. describe a MWSR nanophotonic crossbar for
implementing a directory-based cache-coherence protocol, and a separate MWBR
nanophotonic bus for invalidation messages [76]. The proposed design requires
about a million transmitters for a 6464 crossbar with each shared bus using 256
wavelengths modulated at 10 Gb/s. Arbitration in the MWSR nanophotonic cross-
bar is done with a specialized optical token scheme, where tokens circle around a
ring topology. Although this scheme does enable round-robin fairness, later work
by Vantrease et al. investigated techniques to improve the arbitration throughput for
these token-based schemes under low utilization [75]. Petracca et al. proposed a
completely different microarchitecture for a nanophotonic crossbar that uses optical
switching inside the network and only O(Nbl) transmitters and completely passive
receivers [65]. The proposed design requires a thousand optical switches for a 6464
crossbar with each shared bus using 96 wavelengths modulated at 10 Gb/s. Each
switch requires around O(8bl) actively tuned filters. The precise number of active
filters depends on the exact switch microarchitecture and whether single-wavelength
or multiple-wavelength active filters are used. Although such a microarchitecture
has many fewer transmitters and receivers than the designs shown in Fig. 3.6, a
separate multi-stage electrical network is required for arbitration and to setup the
optical switches.
There are additional design decisions when implementing a multi-stage topol-
ogy, since each network component can use either electrical or nanophotonic
devices. Figure 3.7 illustrates various microarchitectural designs for a 2-ary
2-stage butterfly topology. In Fig. 3.7a, the routers are all implemented electrically
and the channels connecting the first and second stage of routers are implemented
with point-to-point nanophotonic channels. This is a natural approach, since we
can potentially leverage the advantages of nanophotonics for implementing long
global channels and use electrical technology for buffering, arbitration, and
switching. Note that even though these are point-to-point channels, we can still
draw the corresponding nanophotonic implementations of these channels as being
wavelength-division multiplexed in a microarchitectural schematic. Since a sche-
matic is simply meant to capture the high-level interaction between electrical and
nanophotonic devices, designers should simply use the simplest representation at
this stage of the design. Similarly, the input and output terminals may be co-
located in the physical design, but again the schematic is free to use a more abstract
representation. In Fig. 3.7b, just the second stage of routers are implemented with
nanophotonic devices and the channels are still implemented electrically. Since
nanophotonic buffers are currently not feasible in intra-chip and inter-chip net-
works, the buffering is done electrically and the routers 22 crossbar is imple-
mented with a nanophotonic SWMR microarchitecture. As with any nanophotonic
crossbar, additional logic is required to manage arbitration for output ports. Such
a microarchitecture seems less practical since the router crossbars are localized,
and it will be difficult to outweigh the opto-electrical conversion overhead when
working with short buses. In Fig. 3.7c, both the channels and the second stage of
3 Designing Chip-Level Nanophotonic Interconnection Networks 99

a b

Butterfly with Butterfly with


Nanophotonic Channels Nanophotonic Second-Stage Routers
c d

Butterfly with Butterfly with Unified


Nanophotonic Channels and Nanophotonic Channels and
Second-Stage Routers Second-Stage Routers

Fig. 3.7 Microarchitectural schematics for nanophotonic 2-ary 2-stage butterflies. Networks con-
nect all inputs (I1I4) to all outputs (O1O4) with each network component implemented with either
electrical or nanophotonic technology: (a) electrical routers and nanophotonic channels; (b) elec-
trical first-stage routers, electrical channels, and nanophotonic second-stage routers; (c) electrical
first-stage routers, nanophotonic channels, and nanophotonic second-stage routers; (d) similar to
previous subfigure except that the channels and intra-router crossbars are unified into a single stage
of nanophotonic interconnect (adapted from [8], courtesy of IEEE)

routers are implemented with nanophotonic devices. This requires opto-electrical


conversions at two locations, and also needs electrical buffering to be inserted
between the channels and the second-stage routers. Figure 3.7d illustrates a more
promising microarchitecture where the nanophotonic channels and second-stage
routers are unified and requires a single opto-electrical conversion. This does,
however, force the electrical buffering to the edge of the nanophotonic region of
the network. It is also possible to implement all routers and all channels with
nanophotonics to create a fully optical multi-stage network, although the micro-
architecture for each router will need to be more complicated and a second control
network is required to setup the active ring filters in each router.
Most proposals for nanophotonic butterfly-like topologies in the literature focus
on high-radix, low-diameter butterflies and use electrical routers with nanophotonic
point-to-point channels. Koka et al. explore both single-stage and two-stage butterfly-
like topologies as the interconnect for large multichip modules [41]. Morris et al. pro-
posed a two-stage butterfly-like topology for a purely on-chip network [56]. Both of
these proposals are not true butterfly topologies since they incorporate some amount
of flattening as in the flattened butterfly topology [37], or viewed differently some of
the configurations resemble a generalized hypercube topology [12]. In addition,
100 C. Batten et al.

some of the configurations include some amount of shared nanophotonic buses


instead of solely using point-to-point channels. In spite of these details, both micro-
architectures are similar in spirit to that shown Fig. 3.7a. The evaluation in both of
these works suggests that only implementing the point-to-point channels using nano-
photonic devices in a multi-stage topology might offer some advantages in terms of
static power, scalability, and design complexity, when compared to more compli-
cated topologies and microarchitectures. We will investigate a butterfly-like topol-
ogy for processor-to-DRAM networks that only uses nanophotonic channels as a
case study in section Case Study #2: Manycore Processor-to-DRAM Network. All
of these butterfly networks have no path diversity, resulting in poor performance on
adversarial traffic patterns when using simple routing algorithms. Pan et al. proposed
a three-stage high-radix Clos-like topology for a on-chip network to enable much
better load balancing [63]. In this design, the first and third stage of the topology
effectively require radix-16 or radix-24 routers for a 64-terminal or 256-terminal
network respectively. These high-radix routers are implemented with a mesh subnet-
work, and the middle-stage routers connect corresponding mesh routers in each sub-
network. The middle-stage routers and the channels connecting the stages are all
implemented with a unified nanophotonic microarchitecture similar in spirit to that
shown Fig. 3.7d with buffered SWMR crossbars and a separate SWBR crossbar to
determine which receivers should be actively tuned. Gu et al. proposed a completely
different Clos microarchitecture that uses low-radix 22 routers and implements all
routers and channels with nanophotonic devices [26]. We will investigate a Clos
topology for global on-chip communication as a case study in section Case Study
#1: On-Chip Tile-to-Tile Network.
Designing nanophotonic torus topologies requires similar design decisions at the
microarchitectural level as when designing butterfly topologies. Figure 3.8 illus-
trates two different microarchitectures for a 4-ary 1-dimensional torus (i.e., four
node ring). In Fig. 3.8a, the four radix-2 routers are implemented electrically and the
channels between each pair of routers are implemented with nanophotonic devices.
In Fig. 3.8b, both the routers and the channels are implemented with nanophotonic
devices. The active ring filters in each router determine whether the packet exits the
network at that router or turns clockwise and continues on to the next router. Since
this creates a fully optical multi-stage network, a separate control network,
implemented either optically or electrically, will be required to setup the control
signals at each router. As with the butterfly microarchitecture in Fig. 3.7d, buffering
must be pushed to the edge of the nanophotonic region of the network.
Proposals in the literature for chip-level nanophotonic torus and mesh networks
have been mostly limited to two-dimensional topologies. In addition, these propos-
als use fully optical microarchitectures in the spirit of Fig. 3.8b, since using electri-
cal routers with short nanophotonic channels as in Fig. 3.8a yields little benefit.
Shacham et al. proposed a fully optical two-dimensional torus with a combination
of radix-4 blocking routers and specialized radix-2 injection and ejection rout-
ers [69]. A separate electrical control network is used to setup the control signals at
each nanophotonic router. In this hybrid approach, the electrical control network
3 Designing Chip-Level Nanophotonic Interconnection Networks 101

Fig. 3.8 Microarchitectural schematics for nanophotonic 4-ary 1-dim torus. Networks connect all
inputs (I1I4) to all outputs (O1O4) with each network component implemented with either electri-
cal or nanophotonic technology: (a) electrical routers and nanophotonic channels or (b) nanopho-
tonic routers and channels. Note that this topology uses a single unidirectional channel to connect
each of the routers (from [8], courtesy of IEEE)

uses packet-based flow control while the nanophotonic data network uses circuit-
switched flow control. The radix-4 blocking routers require special consideration by
the routing algorithm, but later work by Sherwood-Droz et al. fabricated alternative
non-blocking optical router microarchitectures that can be used in this nanophoto-
nic torus network [71]. Poon et al. survey a variety of designs for optical routers that
can be used in on-chip multi-stage nanophotonic networks [66]. Li et al. propose a
two-dimensional circuit-switched mesh topology with a second broadcast nanopho-
tonic network based on planar waveguides for the control network [48]. Cianchetti
et al. proposed a fully optical two-dimensional mesh topology with packet-based
flow control [18]. This proposal sends control bits on dedicated wavelengths ahead
of the packet payload. These control bits undergo an opto-electrical conversion at
each router hop in order to quickly conduct electrical arbitration and flow control. If
the packet wins arbitration, then the router control logic sets the active ring filters
such that the packet payload proceeds through the router optically. If the packet
loses arbitration, then the router control logic sets the active ring filters to direct the
packet to local receivers so that it can be converted into the electrical domain and
buffered. If the packet loses arbitration and no local buffering is available then the
packet is dropped, and a nack is sent back to the source using dedicated optical
channels. Later work by the same authors explored optimizing the optical router
microarchitecture, arbitration, and flow control [17]. To realize significant advan-
tages over electrical networks, fully optical low-dimensional torus networks need to
carefully consider waveguide crossings, drop losses at each optical router, the total
tuning cost for active ring filters in all routers, and the control network overhead.
102 C. Batten et al.

Physical-Level Design

The final phase of design is at the physical level and involves mapping wavelengths
to waveguides, waveguide layout, and placing nanophotonic devices along each
waveguide. We often use abstract layout diagrams that are similar to microarchitec-
tural schematics but include additional details to illustrate the physical design.
Ultimately, we must develop a detailed layout diagram that specifies the exact place-
ment of each device, and this layout is then used to calculate the area consumed by
nanophotonic devices and the total optical power required for all wavelengths. This
subsection discusses a range of physical design issues that arise when implementing
the nanophotonic microarchitectures described in the previous section.
Figure 3.9 illustrates general approaches for the physical design of nanophoto-
nic buses. These examples implement a four-wavelength SWMR bus, and they
differ in how the wavelengths are mapped to each waveguide. Figure 3.9a illus-
trates the most basic approach where all four wavelengths are multiplexed onto the
same waveguide. Although this produces the most compact layout, it also requires
all nanophotonic devices to operate on the same waveguide which can increase the
total optical loss per wavelength. In this example, each wavelength would experi-
ence one modulator insertion loss, O(Nbl) through losses in the worst case, and a
drop loss at the desired output terminal. As the number of wavelengths for this bus
increases, we will need to consider techniques for distributing those wavelengths
across multiple waveguides both to stay within the waveguides total bandwidth
capacity and within the waveguides total optical power limit. Figure 3.9b illus-
trates wavelength slicing, where a subset of the bus wavelengths are mapped to
distinct waveguides. In addition to reducing the number of wavelengths per wave-
guide, wavelength slicing can potentially reduce the number of through losses and
thus the total optical power. Figure 3.9ce illustrate reader slicing, where a subset
of the bus readers are mapped to distinct waveguides. The example shown in
Fig. 3.9c doubles the number of transmitters, but the input terminal only needs to
drive transmitters on the waveguide associated with the desired output terminal.
Reader slicing does not reduce the number of wavelengths per waveguide, but it
does reduce the number of through losses. Figure 3.9d illustrates a variation of
reader slicing that uses optical power splitting. This split nanophotonic bus requires
a single set of transmitters, but requires more optical power since this power must
be split between the multiple bus branches. Figure 3.9e illustrates another variation
of reader slicing that uses optical power guiding. This guided nanophotonic bus
also only requires a single set of transmitters, but it uses active ring filters to guide
the optical power down the desired bus branch. Guided buses require more control
overhead but can significantly reduce the total optical power when the optical loss
per branch is large. Reader slicing can be particularly effective in SWBR buses,
since it can reduce the number of drop losses per wavelength. It is possible to
implement MWSR buses using a similar technique called writer slicing, which can
help reduce the number of modulator insertion losses per wavelength. More com-
plicated physical design (e.g., redundant transmitters and optical power guiding)
3 Designing Chip-Level Nanophotonic Interconnection Networks 103

a b c

SWMR Bus with SWMR Bus with SWMR Bus with


Single Waveguide Wavelength Slicing Reader Slicing and
Redundant Transmitters
d e

SWMR Bus with SWMR Bus with


Reader Slicing and Reader Slicing and
Optical Power Splitting Optical Power Guiding

Fig. 3.9 Physical design of nanophotonic buses. The four wavelengths for an example four-output
SWMR bus are mapped to waveguides in various ways: (a) all wavelengths mapped to one wave-
guide; (b) wavelength slicing with two wavelengths mapped to one waveguide; (c) reader slicing
with two readers mapped to one waveguide and two redundant sets of transmitters; (d) reader slic-
ing with a single transmitter and optical power passively split between two branches; (e) reader
slicing with a single transmitter and optical power actively guided down one branch (adapted
from [8], courtesy of IEEE)

may have some implications on the electrical control logic and thus the networks
microarchitecture, but it is important to note that these techniques are solely
focused on mitigating physical design issues and do not fundamentally change the
logical network topology. Most nanophotonic buses in the literature use wave-
length slicing [29, 74, 76] and there has been some exploration of the impact of
using a split nanophotonic bus [14, 74]. We investigate the impact of using a guided
nanophotonic bus in the context of a DRAM memory channel as part of the case
study in section Case Study #3: DRAM Memory Channel.
Most nanophotonic crossbars use a set of shared buses, and thus wavelength slic-
ing, reader slicing, and writer slicing are all applicable to the physical design of
these crossbars. Figure 3.10a illustrates another technique called bus slicing, where
a subset of the crossbar buses are mapped to each waveguide. In this example, a 44
SWMR crossbar with two wavelengths per bus is sliced such that two buses are
mapped to each of the two waveguides. Bus-sliced MWSR crossbars are also pos-
sible. Bus slicing reduces the number of wavelengths per waveguide and the number
of through losses in both SWMR and MWSR crossbars. In addition to illustrating
how wavelengths are mapped to waveguides, Fig. 3.10a also illustrates a serpentine
layout. Such layouts minimize waveguide crossings by snaking all waveguides
104 C. Batten et al.

SWMR Crossbar with


Double-Serpentine Layout
c

SWMR Crossbar with


with Bus Slicing

SWMR Crossbar with


Single-Serpentine Layout

Fig. 3.10 Physical design of nanophotonic crossbars. In addition to the same techniques used with
nanophotonic buses, crossbar designs can also use bus slicing: (a) illustrates a 44 SWMR cross-
bar with two wavelengths per bus and two buses per waveguide. Colocating input and output ter-
minals can impact the physical layout. For example, a 44 SWMR crossbar with one wavelength
per bus and a single waveguide can be implemented with either: (b) a double-serpentine layout
where the light travels in one direction or (c) a single-serpentine layout where the light travels in
two directions (from [8], courtesy of IEEE)

around the chip, and they result in looped, U-shaped, and S-shaped waveguides. The
example in Fig. 3.10a assumes that the input and output terminals are located on
opposite sides of the crossbar, but it is also common to have pairs of input and out-
put terminals co-located. Figure 3.10b illustrates a double-serpentine layout for a
44 SWMR crossbar with one wavelength per bus and a single waveguide. In this
layout, waveguides are snaked by each terminal twice with light traveling in one
direction. Transmitters are on the first loop, and receivers are on the second loop.
Figure 3.10c illustrates an alternative single-serpentine layout where waveguides
are snaked by each terminal once, and light travels in both directions. A single-
serpentine layout can reduce waveguide length but requires additional transmitters
to send the light for a single bus in both directions. For example, input I2 uses l2 to
3 Designing Chip-Level Nanophotonic Interconnection Networks 105

a b c

Pt-to-Pt Channels with Pt-to-Pt Channels with Pt-to-Pt Channels with


Single Waveguide and Wavelength Slicing and Channel Slicing and
Serpentine Layout Serpentine Layout Serpentine Layout

d e

Pt-to-Pt Channels with Pt-to-Pt Channels with


Channel Slicing and Channel Slicing with
Ring-Filter Matrix Layout Pt-to-Pt Layout

Fig. 3.11 Physical design of nanophotonic point-to-point channels. An example with four point-
to-point channels each with four wavelengths can be implemented with either: (a) all wavelengths
mapped to one waveguide; (b) wavelength slicing with two wavelengths from each channel mapped
to one waveguide; (c) partial channel slicing with all wavelengths from two channels mapped to
one waveguide and a serpentine layout; (d) partial channel slicing with a ring-filter matrix layout
to passively shuffle wavelengths between waveguides; (e) full channel slicing with each channel
mapped to its own waveguide and a point-to-point layout (adapted from [8], courtesy of IEEE)

send packets clockwise and l3 to send packets counter-clockwise. A variety of


physical designs for nanophotonic crossbars are proposed in the literature that use a
combination of the basic approaches described above. Examples include fully
wavelength-sliced SWBR crossbars with no bus slicing and a serpentine layout [39,
44, 64], partially wavelength-sliced and bus-sliced MWSR/SWMR crossbars with a
double-serpentine layout [63, 76], fully reader-sliced SWMR crossbars with multi-
ple redundant transmitters and a serpentine layout [56], and a variant of a reader-
sliced SWMR crossbar with a serpentine layout which distributes readers across
waveguides and also across different wavelengths on the same waveguide [40].
Nanophotonic crossbars with optical switching distributed throughout the network
have a significantly different microarchitecture and correspondingly a significantly
different physical-level design [65].
Figure 3.11 illustrates general approaches for the physical design of point-to-
point nanophotonic channels that can be used in butterfly and torus topologies. This
particular example includes four point-to-point channels with four wavelengths per
channel, and the input and output terminals are connected in such a way that they
could be used to implement the 2-ary 2-stage butterfly microarchitecture shown in
106 C. Batten et al.

Fig. 3.7a. Figure 3.11a illustrates the most basic design where all sixteen wave-
lengths are mapped to a single waveguide with a serpentine layout. As with nano-
photonic buses, wavelength slicing reduces the number of wavelengths per
waveguide and total through losses by mapping a subset of each channels wave-
lengths to different waveguides. In the example shown in Fig. 3.11b, two wave-
lengths from each channel are mapped to a single waveguide resulting in eight total
wavelengths per waveguide. Figure 3.11ce illustrate channel slicing where all
wavelengths from a subset of the channels are mapped to a single waveguide.
Channel slicing reduces the number of wavelengths per waveguide, the through
losses, and can potentially enable shorter waveguides. The example shown in
Fig. 3.11c, maps two channels to each waveguide but still uses a serpentine layout.
The example in Fig. 3.11d has the same organization on the transmitter side, but
uses a passive ring filter matrix layout to shuffle wavelengths between waveguides.
These passive ring filter matrices can be useful when a set of channels is mapped to
one waveguide, but the physical layout requires a subset of those channels to also be
passively mapped to a second waveguide elsewhere in the system. Ring filter matri-
ces can shorten waveguides at the cost of increased waveguide crossings and one or
more additional drop losses. Figure 3.11e illustrates a fully channel-sliced design
with one channel per waveguide. This enables a point-to-point layout with wave-
guides directly connecting input and output terminals. Although point-to-point lay-
outs enable the shortest waveguide lengths they usually also lead to the greatest
number of waveguide crossings and layout complexity. One of the challenges with
ring-filter matrix and point-to-point layouts is efficiently distributing the unmodu-
lated laser light to all of the transmitters while minimizing the number of laser
couplers and optical power waveguide complexity. Optimally allocating channels to
waveguides can be difficult, so researchers have investigated using machine learn-
ing [39] or an iterative algorithm [11] for specific topologies. There has been some
exploratory work on a fully channel-sliced physical design with a point-to-point
layout for implementing a quasi-butterfly topology [41], and some experimental
work on passive ring filter network components similar in spirit to the ring-filter
matrix [82]. Point-to-point channels are an integral part of the case studies in sec-
tions Case Study #1: On-Chip Tile-to-Tile Network and Case Study #2: Manycore
Processor-to-DRAM Network.
Much of the above discussion about physical-level design is applicable to micro-
architectures that implement multiple-stages of nanophotonic buses, channels, and
routers. However, the physical layout in these designs is often driven more by the
logical topology, leading to inherently channel-sliced designs with point-to-point
layouts. For example, nanophotonic torus and mesh topologies are often imple-
mented with regular grid-like layouts. It is certainly possible to map such topologies
onto serpentine layouts or to use a ring filter matrix to pack multiple logical chan-
nels onto the same waveguide, but such designs would probably be expensive in
terms of area and optical power. Wavelength slicing is often used to increase the
bandwidth per channel. The examples in the literature for fully optical fat-tree net-
works [26], torus networks [69], and mesh networks [18, 48] all use channel slicing
and regular layouts that match the logical topology. Since unmodulated light will
3 Designing Chip-Level Nanophotonic Interconnection Networks 107

need to be distributed across the chip to each injection port, these examples will
most likely require more complicated optical power distribution, laser couplers
located across the chip, or some form of hybrid laser integration.
Figures 3.12 and 3.13 illustrate several abstract layout diagrams for an on-chip
nanophotonic 6464 global crossbar network and an 8-ary 2-stage butterfly net-
work. These layouts assume a 22-nm technology, 5-GHz clock frequency, and 400-
mm2 chip with 64 tiles. Each tile is approximately 2.52.5 mm and includes a
co-located network input and output terminal. The network bus and channel band-
widths are sized according to Table 3.1. The 6464 crossbar topology in Fig. 3.12
uses a SWMR microarchitecture with bus slicing and a single-serpentine layout.
Both layouts map a single bus to each waveguide with half the wavelengths directed
from left to right and the other half directed from right to left. Both layouts are able
to co-locate the laser couplers in two locations along one edge of the chip to sim-
plify packaging. Figure 3.12a uses a longer serpentine layout, while Fig. 3.12b uses
a shorter serpentine layout which reduces waveguide lengths at the cost of increased
electrical energy to communicate between the more distant tiles and the nanophoto-
nic devices. The 8-ary 2-stage butterfly topology in Fig. 3.13 is implemented with
16 electrical routers (eight per stage) and 64 point-to-point nanophotonic channels
connecting every router in the first stage to every router in the second stage.
Figure 3.13a uses channel slicing with no wavelength slicing and a point-to-point
layout to minimize waveguide length. Note that although two channels are mapped
to the same waveguide, those two channels connect routers in the same physical
locations meaning that there is no need for any form of ring-filter matrix. Clever
waveguide layout results in 16 waveguide crossings located in the middle of the
chip. If we were to reduce the wavelengths per channel but maintain the total wave-
lengths per waveguide, then a ring-filter matrix might be necessary to shuffle chan-
nels between waveguides. Figure 3.13b uses a single-serpentine layout. The
serpentine layout increases waveguide lengths but eliminates waveguide crossings
in the middle of the chip. Notice that the serpentine layout requires co-located laser
couplers in two locations along one edge of the chip, while the point-to-point layout
requires laser couplers on both sides of the chip. The point-to-point layout could
position all laser couplers together, but this would increase the length of the optical
power distribution waveguides. Note that in all four layouts eight waveguides share
the same post-processing air gap, and that some waveguide crossings may be neces-
sary at the receivers to avoid positioning electrical circuitry over the air gap.
Figure 3.14 illustrates the kind of quantitative analysis that can be performed at the
physical level of design. Detailed layouts corresponding to the abstract layouts in
Figs. 3.12b and 3.13b are used to calculate the total optical power and area overhead
as a function of optical device quality and the technology assumptions in the earlier
section on nanophotonic technology. Higher optical losses increase the power per
waveguide which eventually necessitates distributing fewer wavelengths over more
waveguides to stay within the waveguides total optical power limit. Thus higher opti-
cal losses can increase both the optical power and the area overhead. It is clear that for
these layouts, the crossbar network requires more optical power and area for the same
quality of devices compared to the butterfly network. This is simply a result of the
108 C. Batten et al.

Fig. 3.12 Abstract physical layouts for 6464 SWMR crossbar. In a SWMR crossbar each tile
modulates a set wavelengths which then must reach every other tile. Two waveguide layouts are
shown: (a) uses a long single-serpentine layout where all waveguides pass directly next to each
tile; (b) uses a shorter single-serpentine layout to reduce waveguide loss at the cost of greater
electrical energy for more distant tiles to reach their respective nanophotonic transmitter and
receiver block. The nanophotonic transmitter and receiver block shown in (c) illustrates how bus
slicing is used to map wavelengths to waveguides. One logical channel (128 b/cycle or 64 l per
channel) is mapped to each waveguide, but as required by a single-serpentine layout, the channel
is split into 64 l directed left to right and 64 l directed right to left. Each ring actually represents
64 rings each tuned to a different wavelength; a = l1l64; b = l64l128; couplers indicate where laser
light enters chip (from [8], courtesy of IEEE)

cost of providing O(N2bl) receivers in the SWMR crossbar network versus the sim-
pler point-to-point nanophotonic channels used in the butterfly network. We can also
perform rough terminal tuning estimates based on the total number of rings in each
layout. Given the technology assumptions in the earlier section on nanophotonic
3 Designing Chip-Level Nanophotonic Interconnection Networks 109

Fig. 3.13 Abstract physical layouts for 8-ary 2-stage butterfly with nanophotonic channels. In a
butterfly with nanophotonic channels each logical channel is implemented with a set of wave-
lengths that interconnect two stages of electrical routers. Two waveguide layouts are shown:
(a) uses a point-to-point layout; (b) uses a serpentine layout that results in longer waveguides but
avoids waveguide crossings. The nanophotonic transmitter and receiver block shown in (c) illus-
trates how channel slicing is used to map wavelengths to waveguides. Two logical channels (128 b/
cycle or 64 l per channel) are mapped to each waveguide, and by mapping channels connecting the
same routers but in opposite directions we avoid the need for a ring-filter matrix. Each ring actually
represents 64 rings each tuned to a different wavelength; a = l1l64; b = l64l128; k is seven for
point-to-point layout and 21 for serpentine layout; couplers indicate where laser light enters chip
(from [8], courtesy of IEEE)

technology the crossbar network requires 500,000 rings and a fixed thermal tuning
power of over 10 W. The butterfly network requires only 14,000 rings and a fixed
thermal tuning power of 0.28 W. Although the crossbar is more expensive to implement,
110 C. Batten et al.

it should also have significantly higher performance since it is a single-stage


non-blocking topology. Since nanophotonics is still an emerging technology, evaluat-
ing a layout as a function of optical device quality is critical for a fair comparison.

Case Study #1: On-Chip Tile-to-Tile Network

In this case study, we present a nanophotonic interconnection network suitable for


global on-chip communication between 64 tiles. The tiles might be homogeneous
with each tile including both some number of cores and a slice of the on-chip mem-
ory, or the tiles might be heterogeneous with a mix of compute and memory tiles.
The global on-chip network might be used to implement shared memory, message
passing, or both. Our basic network design will be similar regardless of these
specifics. We assume that software running on the tiles adhere to a dynamically par-
titioned application model; tiles within a partition communicate extensively, while
tiles across partitions communicate rarely. This case study assumes a 22-nm technol-
ogy, 5-GHz clock frequency, 512-bit packets, and 400-mm2 chip. We examine net-
works sized for low (LTBw), medium (MTBw), and high (HTBw) target bandwidths
which correspond to ideal throughputs of 64, 128, and 256 b/cycle per tile under
uniform random traffic. More details on this case study can be found in [32].

Network Design

Table 3.1 shows configurations for various topologies that meet the MTBw target.
Nanophotonic implementations of the 6464 crossbar and 8-ary 2-stage butterfly
networks were discussed in section Designing Nanophotonic Interconnection
Networks. Our preliminary analysis suggested that the crossbar network could
achieve good performance but with significant optical power and area overhead,
while the butterfly network could achieve lower optical power and area overhead
but might perform poorly on adversarial traffic patterns. This analysis motivated our
interest in high-radix, low-diameter Clos networks. A classic three-stage (m,n,r)
Clos topology is characterized by the number of routers in the middle stage (m), the
radix of the routers in the first and last stages (n), and the number of input and output
switches (r). For this case study we explore a (8,8,8) Clos topology which is similar
to the 8-ary 2-stage butterfly topology shown in Fig. 3.3c except with three stages of
routers. The associated configuration for the MTBw target is shown in Table 3.1.
This topology is non-blocking which can enable significantly higher performance
than a blocking butterfly, but the Clos topology also requires twice as many bisec-
tion channels which requires careful design at the microarchitectural and physical
level. We use an oblivious non-deterministic routing algorithm that efficiently bal-
ances load by always randomly picking a middle-stage router.
3 Designing Chip-Level Nanophotonic Interconnection Networks 111

a 1 b 1
10 10
Through Loss (dB/ring)

Through Loss (dB/ring)


2 2
50
10 15 10 25
10
5 10
3 3
10 2 10
1 7

4 4
10 10
0 1 2 3 4 5 0 1 2 3 4 5
Waveguide Loss (dB/cm) Waveguide Loss (dB/cm)
Crossbar Optical Power (W) Crossbar Area Overhead (%)
c 10
1 d 10
1

15
10
5
Through Loss (dB/ring)

Through Loss (dB/ring) 2.5


10

2 2

5
10 10
15

1 2
5

3 3
10 10

4 4
10 10
0 1 2 3 4 5 0 1 2 3 4 5
Waveguide Loss (dB/cm) Waveguide Loss (dB/cm)
Butterfly Optical Power (W) Butterfly Area Overhead (%)

Fig. 3.14 Comparison of 6464 crossbar and 8-ary 3-stage butterfly networks. Contour plots
show optical laser power in Watts and area overhead as a percentage of the total chip area for the
layouts in Figs. 3.12b and 3.13b. These metrics are plotted as a function of optical device quality
(i.e., ring through loss and waveguide loss) (from [8], courtesy of IEEE)

The 8-ary 2-stage butterfly in Fig. 3.13b has low optical power and area overhead
due to its use of nanophotonics solely for point-to-point channels and not for optical
switching. For the Clos network we considered the two microarchitectures illus-
trated in Fig. 3.15. For simplicity, these microarchitectural schematics are for a
smaller (2,2,2) Clos topology. The microarchitecture in Fig. 3.15a uses two sets of
nanophotonic point-to-point channels to connect three stages of electrical routers.
All buffering, arbitration, and flow-control is done electrically. As an example, if
input I2 wants to communicate with output O3 then it can use either middle router. If
the routing algorithm chooses R2, 2, then the network will use wavelength l2 on the
first waveguide to send the message to R2, 1 and wavelength l4 on the second wave-
guide to send the message to O4. The microarchitecture in Fig. 3.15b implements
both the point-to-point channels and the middle stage of routers with nanophoton-
ics. We chose to purse the first microarchitecture, since preliminary analysis sug-
gested that the energy advantage of using nanophotonic middle-stage routers was
outweighed by the increased optical laser power. We will revisit this assumption
later in this case study. Note how the topology choice impacted our microarchitec-
tural-level design; if we had chosen to explore a low-radix, high-diameter Clos
112 C. Batten et al.

a b

Clos with Nanophotonic Channels Clos with Nanophotonic Channels


and Middle-Stage Routers

Fig. 3.15 Microarchitectural schematic for nanophotonic (2,2,2) Clos. Both networks have four
inputs (I14), four outputs (O14), and six 22 routers (R1-3;1-2) with each network component imple-
mented with either electrical or nanophotonic technology: (a) electrical routers with four nano-
photonic point-to-point channels; (b) electrical first- and third-stage routers with a unified stage
of nanophotonic point-to-point channels and middle-stage routers (from [32], courtesy of
IEEE)

topology then optical switching would probably be required to avoid many opto-
electrical conversions. Here we opt for a high-radix, low-diameter topology to mini-
mize the complexity of the nanophotonic network.
We use a physical layout similar to that shown for the 8-ary 2-stage butterfly in
Fig. 3.13b except that we require twice as many point-to-point channels and thus
twice as many waveguides. For the Clos network, each of the eight groups of routers
includes three instead of two radix-8 routers. The Clos network will have twice the
optical power and area overhead as shown for the butterfly in Fig. 3.14c and 3.14d.
Note that even with twice the number of bisection channels, the Clos network still
uses less than 10 % of the chip area for a wide range of optical device parameters.
This is due to the impressive bandwidth density provided by nanophotonic techno-
logy. The Clos network requires an order of magnitude fewer rings than the crossbar
network resulting in a significant reduction in optical power and area overhead.

Evaluation

Our evaluation uses a detailed cycle-level microachitectural simulator to study the


performance and power of various electrical and nanophotonic networks. For
power calculations, important events (e.g., channel utilization, queue accesses, and
arbitration) were counted during simulation and then multiplied by energy values
derived from first-order gate-level models assuming a 22-nm technology. Our
baseline includes three electrical networks: an 8-ary 2-dimensional mesh (emesh),
a 4-ary 2-dimensional concentrated mesh with two independent physical networks
(ecmeshx2), and an (8,8,8) Clos (eclos). We use aggressive projections for the on-
chip electrical interconnect. We also study a nanophotonic implementation of the
Clos network as described in the previous section (pclos) with both aggressive and
3 Designing Chip-Level Nanophotonic Interconnection Networks 113

conservative nanophotonic technology projections. We use synthetic traffic pat-


terns based on a partitioned application model. Each traffic pattern has some num-
ber of logical partitions, and tiles randomly communicate only with other tiles that
are in the same partition. Although we studied various partition sizes and map-
pings, we focus on the following four representative patterns. A single global parti-
tion is identical to the standard uniform random traffic pattern (UR). The P8C
pattern has eight partitions each with eight tiles optimally co-located together. The
P8D pattern stripes these partitions across the chip. The P2D pattern has 32
partitions each with two tiles, and these two tiles are mapped to diagonally oppo-
site quadrants of the chip.
Figure 3.16 shows the latency as a function of offered bandwidth for a subset of
the configurations. First note that the pclos network has similar zero-load latency
and saturation throughput regardless of the traffic patterns, since packets are always
randomly distributed across the middle-stage routers. Since to first order the nano-
photonic channel latencies are constant, this routing algorithm does not increase the
zero-load latency over a minimal routing algorithm. This is in contrast to eclos,
which has higher zero-load latency owing to the non-uniform channel latencies. Our
simulations show that on average, ecmeshx2 has higher performance than emesh
due to the path diversity provided by the two mesh networks and the reduced net-
work diameter. Figure 3.16 illustrates that pclos performs better than ecmeshx2 on
global patterns (e.g., P2D) and worse on local patterns (e.g., P8C). The hope is that
a higher-capacity pclos configuration (e.g., Fig. 3.16d) will have similar power con-
sumption as a lower-capacity ecmeshx2 configuration (e.g., Fig. 3.16a). This could
enable a nanophotonic Clos network to have similar or better performance than an
electrical network within a similar power constraint.
Figure 3.17 shows the power breakdowns for various topologies and traffic pat-
terns. Figure 3.17a includes the least expensive configurations that can sustain an
aggregate throughput of 2 kb/cycle, while Fig. 3.17b includes the least expensive
configurations that can sustain an aggregate throughput of 8 kb/cycle. Compared to
emesh and ecmeshx2 at 8 kb/cycle, the pclos network with aggressive technology pro-
jections provides comparable performance and low power dissipation for global traffic
patterns, and comparable performance and power dissipation for local traffic patterns.
The benefit is less clear at lower target bandwidths, since the non-trivial fixed power
overhead of nanophotonics cannot be as effectively amortized. Notice the significant
amount of electrical laser power; our analysis assumes a 33 % efficiency laser mean-
ing that every Watt of optical laser power requires three Watts of electrical power to
generate. Although this electrical laser power is off-chip, it can impact system-level
design and the corresponding optical laser power is converted into heat on-chip.

Design Themes

This case study illustrates several important design themes. First, it can be challenging
to show a compelling advantage for purely on-chip nanophotonic interconnection
114 C. Batten et al.

a b
100 100
Avg Latency (cycles)

Avg Latency (cycles)


80 80

60 60
40 40

20 20

0 0
0 2 4 6 8 0 6 12 18 24
Offered BW (kb/cycle) Offered BW (kb/cycle)
ecmeshx2 in LTBw Configuration ecmeshx2 in HTBw Configuration
c d
100 100
Avg Latency (cycles)

Avg Latency (cycles)


UR
80 80
P2D
60 60 P8C
P8D
40 40

20 20

0 0
0 2 4 6 8 0 6 12 18 24
Offered BW (kb/cycle) Offered BW (kb/cycle)
pclos in LTBw Configuration pclos in HTBw Configuration

Fig. 3.16 Latency versus offered bandwidth for on-chip tile-to-tile networks. LTBw systems have
a theoretical throughput of 64 b/cycle per tile, while HTBw systems have a theoretical throughput
of 256 b/cycle both for the uniform random traffic pattern (adapted from [32], courtesy of IEEE)

networks if we include fixed power overheads, use a more aggressive electrical base-
line, and consider local as well as global traffic patterns. Second, point-to-point nano-
photonic channels (or at least a limited amount of optical switching) seems to be a
more practical approach compared to global nanophotonic crossbars. This is espe-
cially true when we are considering networks that might be feasible in the near future.
Third, it is important to use an iterative design process that considers all levels of the
design. For example, Fig. 3.17 shows that the router power begins to consume a
significant portion of the total power at higher bandwidths in the nanophotonic Clos
network, and in fact, follow up work by Kao et al. began exploring the possibility of using
both nanophotonic channels and one stage of low-radix nanophotonic routers [34].

Case Study #2: Manycore Processor-to-DRAM Network

Off-chip main-memory bandwidth is likely to be a key bottleneck in future many-


core systems. In this case study, we present a nanophotonic processor-to-DRAM
network suitable for single-socket systems with 256 on-chip tiles and 16 DRAM
modules. Each on-chip tile could contain one or more processor cores possibly with
3 Designing Chip-Level Nanophotonic Interconnection Networks 115

a 20 b 60

Laser = 33 W
Electrical Power (W)

Electrical Power (W)


15 45

10 30

5 15

0 0
X X
lo a

lo a
ec mes h/p d

ec esh hx2 8d

ec mes h/p d

ec esh hx2 8d
emesh /ur

emesh /ur
es 2/ d

es 2/ d

c
ec es x2/ /ur

ec es x2/ /ur
ec me h/ 8c

ec me h/ 8c
2/ c

2/ c
lo s

lo s
pc s
s

pc s
s
hx p8

hx p8
pc eclo

pc eclo
e es p2

e es p2
m hx p2

p8

m hx p2

p8
m s p

m s p
h

h
/

/
emmes

emmes
e

e
m

m
Dynamic Power at 2 kb/cycle DynamicPowerat8kb/cycle

Fig. 3.17 Dynamic power breakdown for on-chip tile-to-tile networks. Power of eclos and pclos
did not vary significantly across traffic patterns. (a) LTBw systems at 2 kb/cycle offered bandwidth
(except for emesh/p2d and ecmeshx2/p2d which saturated before 2 kb/cycle, HTBw system shown
instead); (b) HTBw systems at 8 kb/cycle offered bandwidth (except for emesh/p2d and ecmeshx2/
p2d which are not able to achieve 8 kb/cycle). pclos-c (pclos-a) corresponds to conservative
(aggressive) nanophotonic technology projections (from [32], courtesy of IEEE)

shared cache, and each DRAM module includes multiple memory controllers and
DRAM chips to provide large bandwidth with high capacity. We assume that the
address space is interleaved across DRAM modules at a fine granularity to maxi-
mize performance, and any structure in the address stream from a single core is
effectively lost when we consider hundreds of tiles arbitrating for tens of DRAM
modules. This case study assumes a 22-nm technology, 2.5-GHz clock frequency,
512-bit packets for transferring cache lines, and 400-mm2 chip. We also assume that
the total power of the processor chip is one of the key design constraints limiting
achievable performance. More details on this case study can be found in [6, 7].

Network Design

We focus on high-radix, low-diameter topologies so that we can make use of simple


point-to-point nanophotonic channels. Our hope is that this approach will provide a
significant performance and energy-efficiency advantage while reducing risk by
relying on simple devices. The lack of path diversity in the butterfly topology is less
of an issue in this application, since we can expect address streams across cores to
116 C. Batten et al.

Fig. 3.18 Logical topology for processor-to-DRAM network. Two (3,9,2,2) LMGS networks are
shown: one for the memory request network and one for the memory response network. Each LMGS
network includes three groups of nine tiles arranged in small 3-ary 2-dimensional mesh cluster and
two global 32 routers that interconnect the clusters and DRAM memory controllers (MC). Lines in
cluster mesh network represent two unidirectional channels in opposite directions; other lines repre-
sent one unidirectional channel heading from left to right (from [8], courtesy of IEEE)

be less structured than in message passing networks. A two-stage symmetric


butterfly topology for 256 tiles would require radix-16 routers which can be expen-
sive to implement electrically. We could implement these routers with nanophoton-
ics, but this increases the complexity and risk associated with adopting nanophotonics.
We could also increase the number of stages to reduce the radix, but this increases
the amount of opto-electrical conversions or requires optical switching. We choose
instead to use the local-meshes to global-switches (LMGS) topology shown in
Fig. 3.18 where each high-radix router is implemented with an electrical mesh sub-
network also called a cluster. A generic (c,n,m,r) LMGS topology is characterized
by the number of clusters (c), the number of tiles per cluster (n), the number of
global switches (m), and the radix of the global switches (cr). For simplicity,
Fig. 3.18 illustrates a smaller (3,9,2,2) LMGS topology supporting a total of 27 tiles.
We assume dimension-ordered routing for the cluster mesh networks, although of
course other routing algorithms are possible. Notice that some of the mesh routers
in each cluster are access points, meaning they directly connect to the global rout-
ers. Each global router is associated with a set of memory controllers that manage
an independent set of DRAM chips, and together this forms a DRAM module. To
avoid protocol deadlock, we use one LMGS network for memory requests from a
tile to a specific DRAM module, and a separate LMGS network for memory
responses from the DRAM module back to the original tile. In this study, we assume
the request and response LMGS networks are separate physical networks, but they
could also be two logical networks implemented with distinct virtual channels. The
LMGS topology is particularly useful for preliminary design space exploration
3 Designing Chip-Level Nanophotonic Interconnection Networks 117

since it decouples the number of tiles, clusters, and memory controllers. In this case
study, we explore LMGS topologies supporting 256 tiles and 16 DRAM modules
with one, four, and 16 clusters. Since the DRAM memory controller design is not
the focus of this case study, we ensure that the memory controller bandwidth is not
a bottleneck by providing four electrical DRAM memory controllers per DRAM
module. Note that high-bandwidth nanophotonic DRAM described as part of the
case study in section Case Study #3: DRAM Memory Channel could potentially
provide an equivalent amount of memory bandwidth with fewer memory controllers
and lower power consumption.
As mentioned above, our design uses a hybrid opto-electrical microarchitecture
that targets the advantages of each medium: nanophotonic interconnect for energy-
efficient global communication, and electrical interconnect for fast switching,
efficient buffering, and local communication. We use first-order analysis to size the
nanophotonic point-to-point channels such that the memory system power con-
sumption on uniform random traffic is less than a 20 W power constraint. Initially,
we balance the bisection bandwidth of the cluster mesh networks and the global
channel bandwidth, but we also consider overprovisioning the channel bandwidths
in the cluster mesh networks to compensate for intra-mesh contention. Configurations
with more clusters will require more nanophotonic channels, and thus each channel
will have lower bandwidth to still remain within this power constraint.
Figure 3.19 shows the abstract layout for our target system with 16 clusters. Since
each cluster requires one dedicated global channel to each DRAM module, there are a
total of 256 cluster-to-memory channels with one nanophotonic access point per chan-
nel. Our first-order analysis determined that 16 l (160 Gb/s) per channel should enable
the configuration to still meet the 20 W power constraint. A ring-filter matrix layout is
used to passively shuffle the 16-l channels on different horizontal waveguides destined
for the same DRAM module onto the same set of four vertical waveguides. We assume
that each DRAM module includes a custom switch chip containing the global router for
both the request and response networks. The switch chip on the memory side arbitrates
between the multiple requests coming in from the different clusters on the processor
chip. This reduces the power density of the processor chip and could enable multi-
socket configurations to easily share the same DRAM modules. A key feature of this
layout is that the nanophotonic devices are not only used for inter-chip communication,
but can also provide cross-chip transport to off-load intra-chip global electrical wiring.
Figure 3.20 shows the laser power as a function of optical device quality for two differ-
ent power constraints and thus two different channel bandwidths. Systems with greater
aggregate bandwidth have quadratically more waveguide crossings, making them more
sensitive to crossing losses. Additionally, certain combinations of waveguide and cross-
ing losses result in large cumulative losses and require multiple waveguides to stay with
in the waveguide power limit. These additional waveguides further increase the total
number of crossings, which in turn continues to increase the power per wavelength,
meaning that for some device parameters it is infeasible to achieve a desired aggregate
bandwidth with a ring-filter matrix layout.
118 C. Batten et al.

Fig. 3.19 Abstract physical layout for nanophotonic processor-to-dram network. Target
(16,16,16,4) LMGS network with 256 tiles, 16 DRAM modules, and 16 clusters each with a 4-ary
2-dimensional electrical mesh. Each tile is labeled with a hexadecimal number indicating its clus-
ter. For simplicity the electrical mesh channels are only shown in the inset, the switch chip includes
a single memory controller, each ring in the main figure actually represents 16 rings modulating or
filtering 16 different wavelengths, and each optical power waveguide actually represents 16 wave-
guides (one per horizontal waveguide). NAP = nanophotonic access point; nanophotonic request
channel from group 3 to DRAM module 0 is highlighted (adapted from [7], courtesy of IEEE)
3 Designing Chip-Level Nanophotonic Interconnection Networks 119

a 0.1 b 0.1
1 Infeasible Infeasible
Crossing Loss (dB/xing)

Crossing Loss (dB/xing)


2 Region Region
3
0.05 4 0.05
8 11
0.5
1
2 5
8 11
2
0 0
0 0.5 1 1.5 2 0 0.5 1 1.5 2
Waveguide Loss (dB/cm) Waveguide Loss (dB/cm)
LMGS Optical Power with 16 LMGS Optical Power with 16
Clusters and 32 b/cycle/channel (W) Clusters and 128 b/cycle/channel (W)

Fig. 3.20 Optical power for nanophotonic processor-to-DRAM networks. Results are for a
(16,16,16,4) LMGS topology with a ring-filter matrix layout and two different power constraints:
(a) low power constraint and thus low aggregate bandwidth and (b) high power constraint and thus
high aggregate bandwidth (from [7], courtesy of IEEE)

Evaluation

Our evaluation uses a detailed cycle-level microarchitectural simulator to study the


performance and power of various electrical and nanophotonic networks. We aug-
ment our simulator to count important events (e.g., channel utilization, queue
accesses, and arbitration) which are then multiplied by energy values derived from
our analytical models. The modeled system includes two-cycle mesh routers, one-
cycle mesh channels, four-cycle global point-to-point channels, and 100-cycle
DRAM array access latency. For this study, we use a synthetic uniform random
traffic pattern at a configurable injection rate.
Figure 3.21 shows the latency as a function of offered bandwidth for 15
configurations. The name of each configuration indicates the technology used to
implement the global channels (E = electrical, P = nanophotonics), the number of
clusters (1/4/16), and the over-provisioning factor (x1/x2/x4). Overprovisioning
improves the performance of the configurations with one and four clusters. E1x4
and E4x2 increase the throughput by 34 over the balanced configurations.
Overprovisioning had minimal impact on the 16 cluster configurations since the
local meshes are already quite small. Overall E4x2 is the best electrical configuration
and it consumes approximately 20 W near saturation. Just implementing the global
channels with nanophotonics in a simple mesh topology results in a 2improvement
in throughput (e.g., P1x4 versus E1x4). However, the full benefit of photonic inter-
connect only becomes apparent when we partition the on-chip mesh network into
clusters and offload more traffic onto the energy-efficient nanophotonic channels.
The P16x1 configuration with aggressive projections can achieve a throughput of
9 kb/cycle (22 Tb/s), which is a 9 improvement over the best electrical
configuration (E42) at comparable latency. The best optical configurations con-
sume 16 W near saturation.
120 C. Batten et al.

a b c
350
Avg Latency (cycles)

300

250

200

150
0 0.4 0.8 1.2 0 2 4 6 0 2 4 6 8 10
Offered BW (kb/cycle) Offered BW (kb/cycle) Offered BW (kb/cycle)

Baseline Conservative Aggressive


Electrical Technology Nanophotonic Technology Nanophotonic Technology

Fig. 3.21 Latency versus offered bandwidth for processor-to-DRAM networks. E electrical,
P nanophotonics, 1/4/16 number of clusters, x1/x2/x4 over-provisioning factor (adapted from [7],
courtesy of IEEE)

Table 3.2 Power Breakdown for Processor-to-DRAM Networks


Component power (W)
Throughput Mesh Mesh Global Thermal Total
Configuration (kb/cycle) routers channels hannels tuning power (W)
E4x2 0.8 2.4 1.2 16.9 n/a 20.5
P16x1 (conservative) 6.0 5.9 3.2 3.1 3.9 16.2
P16x1 (aggressive) 9.0 8.0 4.5 1.5 2.6 16.7
These represent the best electrical and nanophotonic configurations. E4x2 is the electricalbaseline
with four clusters and an overprovisioning factor of two, while P16x1 uses nanophotonicglobal
channels, 16 clusters, and no overprovisioning

Table 3.2 shows the power breakdown for the E4x2 and P16x1 configurations
near saturation. As expected, the majority of the power in the electrical configuration
is spent on the global channels that connect the access points to the DRAM mod-
ules. By implementing these channels with energy-efficient photonic links we
have a larger portion of our energy budget for higher-bandwidth on-chip mesh
networks even after including the overhead for thermal tuning. Note that the laser
power is not included here as it is highly dependent on the physical layout and
photonic device design as shown in Fig. 3.20. The photonic configurations con-
sume close to 15 W leaving 5 W for on-chip optical power dissipation as heat.
Ultimately, photonics enables an 810 improvement in throughput at similar
power consumption.
3 Designing Chip-Level Nanophotonic Interconnection Networks 121

Design Themes

This case study suggests it is much easier to show a compelling advantage for
implementing a inter-chip network with nanophotonic devices, as compared to a
purely intra-chip nanophotonic network. Additionally, our results show that once
we have made the decision to use nanophotonics for chip-to-chip communication, it
makes sense to push nanophotonics as deep into each chip as possible (e.g., by using
more clusters). This approach for using seamless intra-chip/inter-chip nanophotonic
links is a general design theme that can help direct future directions for nanophoto-
nic network research. Also notice that our nanophotonic LMGS network was able
to achieve an order-of-magnitude improvement in throughput at a similar power
constraint without resorting to more sophisticated nanophotonic devices, such as
active optical switching. Again, we believe that using point-to-point nanophotonic
channels offers the most promising approach for short term adoption of this technol-
ogy. The choice of the ring-filter matrix layout was motivated by its regularity, short
waveguides, and the need to aggregate all of the nanophotonic couplers in one place
for simplified packaging. However, as shown in Fig. 3.20, this layout puts significant
constraints on the maximum tolerable losses in waveguides and crossings. We are
currently considering alternate serpentine layouts that can reduce the losses in cross-
ings and waveguides. However, the serpentine layout needs couplers at multiple
locations on the chip, which could increase packaging costs. An alternative would
be to leverage the multiple nanophotonic devices layers available in monolithic
BEOL integration approach. Work by Biberman et al. has shown how multilayer
deposited devices can significantly impact the feasibility of various network archi-
tectures [13], and this illustrates the need for a design process that iterates across the
architecture, microarchitecture, and physical design levels.

Case Study #3: DRAM Memory Channel

Both of the previous case studies assume a high-bandwidth and energy-efficient


interface to off-chip DRAM. In this case study, we present photonically integrated
DRAM (PIDRAM) which involves re-architecting the DRAM channel, chip, and
bank to make best use of the nanophotonic technology for improved performance
and energy efficiency. As in the previous case study, we assume the address space is
interleaved across DRAM channels at a fine granularity, and that this effectively
results in approximately uniform random address streams. This case study assumes
a 32-nm DRAM technology, 512-bit access width, and timing constraints similar to
those in contemporary Micron DDR3 SDRAM. More details on this case study can
be found in [9].
122 C. Batten et al.

b
a

PIDRAM Architecture PIDRAM with Shared Bus

c d

PIDRAM with Split Bus PIDRAM with Guided Bus

Fig. 3.22 PIDRAM designs. Subfigures illustrate a single DRAM memory channel (MC) with
four DRAM banks (B) at various levels of design: (a) logical topology for DRAM memory chan-
nel; (b) shared nanophotonic buses where optical power is broadcast to all banks along a shared
physical medium; (c) split nanophotonic buses where optical power is split between multiple direct
connections to each bank; (d) guided nanophotonic buses where optical power is actively guided
to a single bank. For clarity, command bus is not shown in (c) and (d), but it can be implemented
in a similar fashion as the corresponding write-data bus or as a SWBR bus (adapted from [9],
courtesy of IEEE)

Network Design

Figure 3.22a illustrates the logical topology for a DRAM memory channel. A mem-
ory controller is used to manage a set of DRAM banks that are distributed across
one or more DRAM chips. The memory system includes three logical buses: a com-
mand bus, a write-data bus, and a read-data bus. Figure 3.22b illustrates a straight-
forward nanophotonic microarchitecture for a DRAM memory channel with a
combination of SWBR, SWMR, and MWSR buses.
The microarchitecture in Fig. 3.22b can also map to a similar layout that we call
a shared nanophotonic bus. In this layout, the memory controller first broadcasts a
command to all of the banks and each bank determines if it is the target bank for the
command. For a PIDRAM write command, just the target bank will then tune-in its
3 Designing Chip-Level Nanophotonic Interconnection Networks 123

nanophotonic receiver on the write-data bus. The memory controller places the write
data on this bus; the target bank will receive the data and then perform the corre-
sponding write operation. For a PIDRAM read command, just the target bank will
perform the read operation and then use its modulator on the read-data bus to send
the data back to the memory controller. Unfortunately, the losses multiply together
in this layout making the optical laser power an exponential function of the number
of banks. If all of the banks are on the same PIDRAM chip, then the losses can be
manageable. However, to scale to larger capacities, we will need to daisy-chain
the shared nanophotonic bus through multiple PIDRAM chips. Large coupler losses
and the exponential scaling of laser power combine to make the shared nanophoto-
nic bus feasible only for connecting banks within a PIDRAM chip as opposed to
connecting banks across PIDRAM chips.
Figure 3.22c shows the alternative reader-/writer-sliced split nanophotonic bus
layout, which divides the long shared bus into multiple branches. In the command
and write-data bus, modulated laser power is still sent to all receivers, and in the
read-data bus, laser power is still sent to all modulators. The split nature of the bus,
however, means that the total laser power is roughly a linear function of the number
of banks. If each bank was on its own PIDRAM chip, then we would use a couple
of fibers per chip (one for modulated data and one for laser power) to connect the
memory controller to each of the PIDRAM chips. Each optical path in the write-
data bus would only traverse one optical coupler to leave the processor chip and one
optical coupler to enter the PIDRAM chip regardless of the total number of banks.
This implementation reduces the extra optical laser power as compared to a shared
nanophotonic bus at the cost of additional splitter and combiner losses in the mem-
ory controller. It also reduces the effective bandwidth density of the nanophotonic
bus, by increasing the number of fibers for the same effective bandwidth.
To further reduce the required optical power, we can use a reader-/writer-sliced
guided nanophotonic bus layout, shown in Fig. 3.22d. Each nanophotonic demulti-
plexer uses an array of either active ring or comb filters. For the command and
write-data bus, the nanophotonic demultiplexer is placed after the modulator to
direct the modulated light to the target bank. For the read-data bus, the nanophoto-
nic demultiplexer is placed before the modulators to allow the memory controller to
manage when to guide the light to the target bank for modulation. Since the optical
power is always guided down a single branch, the total laser power is roughly con-
stant and independent of the number of banks. The optical loss overhead due to the
nanophotonic demultiplexers and the reduced bandwidth density due to the branch-
ing make a guided nanophotonic bus most attractive when working with relatively
large per-bank optical losses.
Figure 3.23 illustrates in more detail our proposed PIDRAM memory system.
The figure shows a processor chip with multiple independent PIDRAM memory
channels; each memory channel includes a memory controller and a PIDRAM
DIMM, which in turn includes a set of PIDRAM chips. Each PIDRAM chip con-
tains a set of banks, and each bank is completely contained within a single PIDRAM
chip. We use a hybrid approach to implement each of the three logical buses. The
memory scheduler within the memory controller orchestrates access to each bus to
124 C. Batten et al.

Fig. 3.23 PIDRAM memory system organization. Each PIDRAM memory channel connects to a
PIDRAM DIMM via a fiber ribbon. The memory controller manages the command bus (CB),
write-data bus (WDB), and read-data bus (RDB), which are wavelength division multiplexed onto
the same fiber. Nanophotonic demuxes guide power to only the active PIDRAM chip. B=PIDRAM
B=PIDRAM bank; each ring represents multiple rings for multi-wavelength buses (from [9], cour-
tesy of IEEE)

avoid conflicts. The command bus is implemented with a single wavelength on a


guided nanophotonic bus. The command wavelength is actively guided to the
PIDRAM chip containing the target bank. Once on the PIDRAM chip, a single
receiver converts the command into the electrical domain and then electrically
broadcasts the command to all banks in the chip. Both the write-data and read-data
buses are implemented with a guided nanophotonic bus to actively guide optical
power to a single PIDRAM chip within a PIDRAM DIMM, and then they are
implemented with a shared nanophotonic bus to distribute the data within the
PIDRAM chip.
Figure 3.24 illustrates two abstract layouts for a PIDRAM chip. In the P1 layout
shown in Fig. 3.24a, the standard electrical I/O strip in the middle of the chip is
replaced with a horizontal waveguide and multiple nanophotonic access points. The
on-chip electrical H-tree command bus and vertical electrical data buses remain as in
traditional electrical DRAM. In the P2 layout shown in Fig. 3.24b, more of the on-
chip portion of the data buses are implemented with nanophotonics to improve cross-
chip energy-efficiency. The horizontal waveguides contain all of the wavelengths,
3 Designing Chip-Level Nanophotonic Interconnection Networks 125

a b

P1 Layout P2 Layout

Fig. 3.24 Abstract physical layout for PIDRAM chip. Two layouts are shown for an example
PIDRAM chip with eight banks and eight array blocks per bank. For both layouts, the nanophoto-
nic command bus ends at the command access point (CAP), and an electrical H-tree implementa-
tion efficiently broadcasts control bits from the command access point to all array blocks. For
clarity, the on-chip electrical command bus is not shown. The difference between the two layouts
is how far nanophotonics is extended into the PIDRAM chip: (a) P1 uses nanophotonic chip I/O
for the data buses but fully electrical on-chip data bus implementations, and (b) P2 uses seamless
on-chip/off-chip nanophotonics to distribute the data bus to a group of four banks. CAP = com-
mand access point; DAP = data access point (adapted from [9], courtesy of IEEE)

and the optically passive ring filter banks at the bottom and top of the waterfall
ensure that each of these vertical waveguides only contains a subset of the channels
wavelengths. Each of these vertical waveguides is analogous to the electrical vertical
buses in P1, so a bank can still be striped across the chip horizontally to allow easy
access to the on-chip nanophotonic interconnect. Various layouts are possible that
correspond to more or less nanophotonic access points. For a Pn layout, n indicates
the number of partitions along each vertical electrical data bus. All of the nanopho-
tonic circuits have to be replicated at each data access point for each bus partition.
This increases the fixed link power due to link transceiver circuits and ring heaters.
It can also potentially lead to higher optical losses, due to the increased number of
rings on the optical path. Our nanophotonic layouts all use the same on-chip com-
mand bus implementation as traditional electrical DRAM: a command access point
is positioned in the middle of the chip and an electrical H-tree command bus broad-
casts the control and address information to all array blocks.

Evaluation

To evaluate the energy efficiency and area trade-offs of the proposed DRAM chan-
nels, we use a heavily modified version of the CACTI-D DRAM modeling tool.
Since nanophotonics is an emerging technology, we explore the space of possible
126 C. Batten et al.

results with both aggressive and conservative projections for nanophotonic devices.
To quantify the performance of each DRAM design, we use a detailed cycle-level
microarchitectural simulator. We use synthetic traffic patterns to issue loads and
stores at a rate capped by the number of in-flight messages. We simulate a range of
different designs with each configuration name indicating the layout (Pn), the num-
ber of banks (b8/b64), and the number of I/Os per array core (io4/io32). We use the
events and statistics from the simulator to animate our DRAM and nanophotonic
device models to compute the energy per bit.
Figure 3.25 shows the energy-efficiency breakdown for various layouts imple-
menting three representative PIDRAM configurations. Each design is subjected to a
random traffic pattern at peak utilization and the results are shown for the aggressive
and conservative photonic technology projections. Across all designs it is clear that
replacing the off-chip links with photonics is advantageous, as E1 towers above the
rest of the designs. How far photonics is taken on chip, however, is a much richer
design space. To achieve the optimal energy efficiency requires balancing both the
data-dependent and data-independent components of the overall energy. The
data-independent energy includes: electrical laser power for the write bus, electrical
laser power for the read bus, fixed circuit energy including clock and leakage, and
thermal tuning energy. As shown in Fig. 3.25a, P1 spends the majority of the energy
on intra-chip communication (write and read energy) because the data must traverse
long global wires to get to each bank. Taking photonics all the way to each array
block with P64 minimizes the cross-chip energy, but results in a large number of
photonic access points (since the photonic access points in P1 are replicated 64
times in the case of P64), contributing to the large data-independent component of
the total energy. This is due to the fixed energy cost of photonic transceiver circuits
and the energy spent on ring thermal tuning. By sharing the photonic access points
across eight banks, the optimal design is P8. This design balances the data-dependent
savings of using intra-chip photonics with the data-independent overheads due to
electrical laser power, fixed circuit power, and thermal tuning power.
Once the off-chip and cross-chip energies have been reduced (as in the P8 layout
for the b64-io4 configuration), the activation energy becomes dominant. Figure 3.25b
shows the results for the b64-io32 configuration which increases the number of bits
we read or write from each array core to 32. This further reduces the activate energy
cost, and overall this optimized design is 10 more energy efficient than the base-
line electrical design. Figure 3.25c shows similar trade-offs for the low-bandwidth
b8-io32 configuration.
In addition to these results, we also examined the energy as a function of utiliza-
tion and the area overhead. Figure 3.26 illustrates this trade-off for configurations
with 64 banks and four I/Os per array core. As expected, the energy per bit increases
as utilization goes down due to the data-independent power components. The large
fixed power in electrical DRAM interfaces helps mitigate the fixed power overhead
in a nanophotonic DRAM interface at low utilization; these results suggest the
potential for PIDRAM to be an energy efficient alternative regardless of utilization.
Although not shown, the area overhead for a PIDRAM chip is actually quite mini-
mal since any extra active area for the nanophotonic devices is compensated for the
more area-efficient, higher-bandwidth array blocks.
3 Designing Chip-Level Nanophotonic Interconnection Networks 127

a b c
10
Energy (pJ/bt)

8
6
4
2
0
E1

E1

E1
P1
P2
P4
P8

P1
P2
P4
P8

P1
P2
P4
P8
6
2
4

6
2
4
P1
P3
P6

P1
P3
P6
b64-io4w/ b64-io32w/ b8-io32w/
ConservativeProj ConservativeProj ConservativeProj
d e f
10
Energy (pJ/bt)

8
6
4
2
0
E1

E1

E1
P1
P2
P4
P8

P1
P2
P4
P8

P1
P2
P4
P8
6
2
4

6
2
4
P1
P3
P6

P1
P3
P6

b64-io4w/ b64-io32w/ b8-io32w/


AggressiveProj AggressiveProj AggressiveProj

Fig. 3.25 Energy breakdown for DRAM memory channels. Energy results are for uniform ran-
dom traffic with enough in-flight requests to saturate the DRAM memory channel. (ac) Assume
conservative nanophotonic device projections, while (df) assume more aggressive nanophotonic
projections. Results for (a), (b), (d), and (e) are at a peak bandwidth of 500 Gb/s and (c) and (f)
are at a peak bandwidth of 60 Gb/s with random traffic. Fixed circuits energy includes clock and
leakage. Read energy includes chip I/O read, cross-chip read, and bank read energy. Write energy
includes chip I/O write, cross-chip write, and bank write energy. Activate energy includes chip I/O
command, cross-chip row address energy, and bank activate energy (from [9], courtesy of IEEE)

Design Themes

Point-to-point nanophotonic channels were a general theme in the first two case
studies, but in this case study point-to-point channels were less applicable. DRAM
memory channels usually use bus-based topologies to decouple bandwidth from
capacity, so we use a limited form of active optical switching in reader-sliced
SWMR and MWSR nanophotonic buses to reduce the required optical power. We
see this is a gradual approach to nanophotonic network complexity: a designer can
start with point-to-point nanophotonic channels, move to reader-sliced buses if there
is a need to scale terminals but not the network bandwidth, and finally move to fully
128 C. Batten et al.

Fig. 3.26 Energy versus utilization. Energy results are for uniform random traffic with varying
numbers of in-flight messages. To reduce clutter, we only plot the three most energy efficient
waterfall floorplans (P4, P8, P16) (adapted from [9], courtesy of IEEE)

optical switching only if it is absolutely required to meet the desired application


requirements. As in the previous case study, focusing on inter-chip nanophotonic
networks and using a broad range of nanophotonic device parameters helps make a
more compelling case for adopting this new technology compared to purely on-chip
nanophotonic networks. Once we move to using nanophotonic inter-chip interfaces,
there is a rich design space in how far into the chip we extend these nanophotonic
links to help off-load global on-chip interconnect. In this specific application the
fixed power overhead of nanophotonic interconnect is less of an issue owing to the
significant amount of fixed power in the electrical baseline interfaces.

Conclusions

Based on our experiences designing multiple nanophotonic networks and reviewing


the literature, we have identified several common design guidelines that can aid in
the design of new nanophotonic interconnection networks.
Clearly Specify the Logical Topology. A crisp specification of the logical net-
work topology uses a simple high-level diagram to abstract away the details of the
nanophotonic devices. Low-level microarchitectural schematics and physical lay-
outs usually do a poor job of conveying the logical topology. For example, Figs. 3.12b
and 3.13b have very similar physical layouts but drastically different logical topolo-
gies. In addition, it is easy to confuse passively WDM-routed wavelengths with true
network routing; the former is analogous to routing wires at design time while the
later involves dynamically routing packets at run time. A well-specified logical
topology removes this ambiguity, helps others understand the design, enables more
direct comparison to related proposals, and allows the application of well-know
interconnection network techniques for standard topologies.
Iterate Through the Three-Levels of Design. There are many ways to map a logical
bus or channel to nanophotonic devices and to integrate multiple stages of nanophoto-
nic interconnect. Overly coupling the three design levels artificially limits the design
3 Designing Chip-Level Nanophotonic Interconnection Networks 129

space, and since this is still an emerging technology there is less intuition on which
parts of the design space are the most promising. Only exploring a single topology,
microarchitecture, or layout ignores some of the trade-offs involved in alternative
approaches. For example, restricting a design to only use optical switching eliminates
some high-radix topologies. These high-radix topologies can, however, be imple-
mented with electrical routers and point-to-point nanophotonic channels. As another
example, only considering wavelength slicing or only considering bus/channel slicing
artificially constrains bus and channel bandwidths as opposed to using a combination
of wavelength and bus/channel slicing. Iterating through the three levels of design can
enable a much richer exploration of the design space. For example, as discussed in
section Case Study #2: Manycore Processor-to-DRAM Network, an honest evalua-
tion of our final results suggest that it may be necessary to revisit some of our earlier
design decisions about the importance of waveguide crossings.
Use an Aggressive Electrical Baseline. There are many techniques to improve
the performance and energy-efficiency of electrical chip-level networks, and most
of these techniques are far more practical than adopting an emerging technology.
Designers should assume fairly aggressive electrical projections in order to make a
compelling case for chip-level nanophotonic interconnection networks. For exam-
ple, with an aggressive electrical baseline technology in section Case Study #1:
On-Chip Tile-to-Tile Network, it becomes more difficult to make a strong case for
purely on-chip nanophotonic networks. However, even with aggressive electrical
assumptions it was still possible to show significant potential in using seamless
intra-chip/inter-chip nanophotonic links in sections Case Study #2: Manycore
Processor-to-DRAM Network and Case Study #3: DRAM Memory Channel.
Assume a Broad Range of Nanophotonic Device Parameters. Nanophotonics is
an emerging technology, and any specific instance of device parameters are cur-
rently meaningless for realistic network design. This is especially true when param-
eters are mixed from different device references that assume drastically different
fabrication technologies (e.g., hybrid integration versus monolithic integration). It
is far more useful for network designers to evaluate a specific proposal over a range
of device parameters. In fact, one of the primary goals of nanophotonic interconnec-
tion network research should be to provide feedback to device experts on the most
important directions for improvement. In other words, are there certain device
parameter ranges that are critical for achieving significant system-level benefits?
For example, the optical power contours in section Case Study #2: Manycore
Processor-to-DRAM Network helped not only motivate alternative layouts but also
an interest in very low-loss waveguide crossings.
Carefully Consider Nanophotonic Fixed-Power Overheads. One of the primary
disadvantages of nanophotonic devices are the many forms of fixed power including
fixed transceiver circuit power, static thermal tuning power, and optical laser power.
These overheads can impact the energy efficiency, on-chip power density, and sys-
tem-level power. Generating a specific amount of optical laser power can require
significant off-chip electrical power, and this optical laser power ultimately ends up
as heat dissipation in various nanophotonic devices. Ignoring these overheads or
only evaluating designs at high utilization rates can lead to overly optimistic results.
130 C. Batten et al.

For example, section Case Study #1: On-Chip Tile-to-Tile Network suggested
that static power overhead could completely mitigate any advantage for purely on-
chip nanophotonic networks, unless we assume relatively aggressive nanophotonic
devices. This is in contrast to the study in section Case Study #3: DRAM Memory
Channel, which suggests that even at low utilization, PIDRAM can achieve similar
performance at lower power compared to projected electrical DRAM interfaces.
Motivate Nanophotonic Network Complexity. There will be significant practical
risk in adopting nanophotonic technology. Our goal as designers should be to
achieve the highest benefit with the absolute lowest amount of risk. Complex nano-
photonic interconnection networks can require many types of devices and many
instances of each type. These complicated designs significantly increase risk in
terms of reliability, fabrication cost, and packing issues. If we can achieve the same
benefits with a much simpler network design, then ultimately this increases the
potential for realistic adoption of this emerging technology. Two of our case studies
make use of just nanophotonic point-to-point channels, and our hope is that this
simplicity can reduce risk. Once we decide to use nanophotonic point-to-point
channels, then high-radix, low-diameter topologies seem like a promising direction
for future research.

Acknowledgements This work was supported in part by DARPA awards W911NF-06-1-0449,


W911NF-08-1-0134, W911NF-08-1-0139, and W911NF-09-1-0342. Research also supported in
part by Microsoft (Award #024263) and Intel (Award #024894) funding and by matching funding
from U.C. Discovery (Award #DIG07-10227). The authors acknowledge chip fabrication support
from Texas Instruments.

We would like to thank our co-authors on the various publications that served as the basis for the
three case studies, including Y.-J. Kwon, S. Beamer, I. Shamim, and C. Sun. We would like to
acknowledge the MIT nanophotonic device and circuits team, including J. S. Orcutt, A. Khilo,
M. A. Popovi, C. W. Holzwarth, B. Moss, H. Li, M. Georgas, J. Leu, J. Sun, C. Sorace,
F. X. Krtner, J. L. Hoyt, R. J. Ram, and H. I. Smith.

References

1. Abousamra A, Melhem R, Jones A (2011) Two-hop free-space based optical interconnects for
chip multiprocessors. In: International symposium on networks-on-chip (NOCS), May 2011.
http://dx.doi.org/10.1145/1999946.1999961 Pittsburgh, PA
2. Alduino A, Liao L, Jones R, Morse M, Kim B, Lo W, Basak J, Koch B, Liu H, Rong H, Sysak
M, Krause C, Saba R, Lazar D, Horwitz L, Bar R, Litski S, Liu A, Sullivan K, Dosunmu O, Na
N, Yin T, Haubensack F, Hsieh I, Heck J, Beatty R, Park H, Bovington J, Lee S, Nguyen H, Au
H, Nguyen K, Merani P, Hakami M, Paniccia MJ (2010) Demonstration of a high-speed
4-channel integrated silicon photonics WDM link with silicon lasers. In: Integrated photonics
research, silicon, and nanophotonics (IPRSN), July 2010. http://www.opticsinfobase.org/
abstract.cfm?URI=iprsn-2010-pdiwi5 Monterey, CA
3. Amatya R, Holzwarth CW, Popovi MA, Gan F, Smith HI, Krtner F, Ram RJ (2007) Low-
power thermal tuning of second-order microring resonators. In: Conference on lasers and
electro-Optics (CLEO), May 2007. http://www.opticsinfobase.org/abstract.cfm?URI=CLEO-
2007-CFQ5 Baltimore, MA
3 Designing Chip-Level Nanophotonic Interconnection Networks 131

4. Balfour J, Dally W (2006) Design tradeoffs for tiled CMP on-chip networks. In: International
symposium on supercomputing (ICS), June 2006. http://dx.doi.org/10.1145/1183401.1183430
Queensland, Australia
5. Barwicz T, Byun H, Gan F, Holzwarth CW, Popovi MA, Rakich PT, Watts MR, Ippen EP,
Krtner F, Smith HI, Orcutt JS, Ram RJ, Stojanovic V, Olubuyide OO, Hoyt JL, Spector S,
Geis M, Grein M, Lyszcarz T, Yoon JU (2007) Silicon photonics for compact, energy-efficient
interconnects. J Opt Networks 6(1):6373
6. Batten C, Joshi A, Orcutt JS, Khilo A, Moss B, Holzwarth CW, Popovi MA, Li H, Smith HI, Hoyt
JL, Krtner FX, Ram RJ, Stojanovi V, Asanovi K (2008) Building manycore processor-to-
DRAM networks with monolithic silicon photonics. In: Symposium on high-performance inter-
connects (hot interconnects), August 2008 http://dx.doi.org/10.1109/HOTI.2008.11 Stanford, CA
7. Batten C, Joshi A, Orcutt JS, Khilo A, Moss B, Holzwarth CW, Popovi MA, Li H, Smith HI,
Hoyt JL, Krtner FX, Ram RJ, Stojanovi V, Asanovi K (2009) Building manycore proces-
sor-to-DRAM networks with monolithic CMOS silicon photonics. IEEE Micro 29(4):8-21
8. Batten C, Joshi A, Stojanovi V, Asnaovi K (2012) Designing chip-level nanophotonic inter-
connection networks. IEEE J Emerg Sel Top Circuits Syst. http://dx.doi.org/10.1109/
JETCAS.2012.2193932
9. Beamer S, Sun C, Kwon Y-J, Joshi A, Batten C, Stojanovi V, Asanovi K (2010) Rearchitecting
DRAM memory systems with monolithically integrated silicon photonics. In: International
symposium on computer architecture (ISCA), June 2010. http://dx.doi.
org/10.1145/1815961.1815978 Saint-Malo, France
10. Beausoleil RG (2011) Large-scale integrated photonics for high-performance interconnects.
ACM J Emerg Technol Comput Syst 7(2):6
11. Beux SL, Trajkovic J, OConnor I, Nicolescu G, Bois G, Paulin P (2011) Optical ring network-
on-chip (ORNoC): architecture and design methodology. In: Design, automation, and test in
Europe (DATE), March 2011. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=5763134 \bibitem{beux-photo-ornoc-date2011} Grenoble, France
12. Bhuyan LN, Agrawal DP (2007) Generalized hypercube and hyperbus structures for a com-
puter network. IEEE Trans Comput 33(4):323333
13. Biberman A, Preston K, Hendry G, Sherwood-Droz N, Chan J, Levy JS, Lipson M, Bergman
K (2011) Photonic network-on-chip architectures using multilayer deposited silicon materials
for high-performance chip multiprocessors. ACM J Emerg Technol Comput Syst 7(2):7
14. Binkert N, Davis A, Lipasti M, Schreiber R, Vantrease D (2009) Nanophotonic barriers. In:
Workshop on photonic interconnects and computer architecture, December 2009. Atlanta, GA
15. Block BA, Younkin TR, Davids PS, Reshotko MR, Chang BMPP, Huang S, Luo J, Jen AKY
(2008) Electro-optic polymer cladding ring resonator modulators. Opt Express 16(22):
1832618333
16. Christiaens I, Thourhout DV, Baets R (2004) Low-power thermo-optic tuning of vertically
coupled microring resonators. Electron Lett 40(9):560561
17. Cianchetti MJ, Albonesi DH (2011) A low-latency, high-throughput on-chip optical router
architecture for future chip multiprocessors. ACM J Emerg Technol Comput Syst 7(2):9
18. Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing net-
work. In: International symposium on computer architecture (ISCA), June 2009. http://dx.doi.
org/10.1145/1555754.1555809 Austin, TX
19. Clos C (1953) A study of non-blocking switching networks. Bell Syst Techn J 32:406424
20. Dally WJ, Towles B (2004) Principles and practices of interconnection networks. Morgan
Kaufmann. http://www.amazon.com/dp/0122007514
21. DeRose CT, Watts MR, Trotter DC, Luck DL, Nielson GN, Young RW (2010) Silicon micror-
ing modulator with integrated heater and temperature sensor for thermal control. In: Conference
on lasers and electro-optics (CLEO), May 2010. http://www.opticsinfobase.org/abstract.
cfm?URI=CLEO-2010-CThJ3 San Jose, CA
22. Dokania RK, Apsel A (2009) Analysis of challenges for on-chip optical interconnects. In:
Great Lakes symposium on VLSI, May 2009. http://dx.doi.org/10.1145/1531542.1531607
Paris, France
132 C. Batten et al.

23. Dumon P, Bogaerts W, Baets R, Fedeli J-M, Fulbert L (2009) Towards foundry approach for
silicon photonics: silicon photonics platform ePIXfab. Electron Lett 45(12):581582
24. Georgas M, Leu JC, Moss B, Sun C, Stojanovi V (2011) Addressing link-level design
tradeoffs for integrated photonic interconnects. In: Custom integrated circuits conference
(CICC), September 2011 http://dx.doi.org/10.1109/CICC.2011.6055363 San Jose, CA
25. Georgas M, Orcutt J, Ram RJ, Stojanovi V (2011) A monolithically-integrated optical receiver
in standard 45 nm SOI. In: European solid-state circuits conference (ESSCC), September 2011.
http://dx.doi.org/10.1109/ESSCIRC.2011.6044993 Helsinki, Finland
26. Gu H, Xu J, Zhang W (2009) A low-power fat-tree-based optical network-on-chip for multi-
processor system-on-chip. In: Design, automation, and test in Europe (DATE), May 2009.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=5090624 Nice, France
27. Guha B, Kyotoku BBC, Lipson M (2010) CMOS-compatible athermal silicon microring reso-
nators. Opt Express 18(4):34873493
28. Gunn C (2006) CMOS photonics for high-speed interconnects. IEEE Micro 26(2):5866
29. Hadke A, Benavides T, Yoo SJB, Amirtharajah R, Akella V (2008) OCDIMM: scaling the
DRAM memory wall using WDM-based optical interconnects. In: Symposium on high-
performance interconnects (hot interconnects), August 2008. http://dx.doi.org/10.1109/
HOTI.2008.25 Stanford, CA
30. Holzwarth CW, Orcutt JS, Li H, Popovi MA, Stojanovi V, Hoyt JL, Ram RJ, Smith HI
(2008) Localized substrate removal technique enabling strong-confinement microphotonics in
bulk-Si-CMOS processes. In: Conference on lasers and electro-optics (CLEO), May 2008.
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=4571716 San Jose, CA
31. Hwang E, Bhave SA (2010) Nanophotonic devices on thin buriod oxide silicon-on insulator
substrates. Opt Express 18(4):38503857
32. Joshi A, Batten C, Kwon Y-J, Beamer S, Shamim I, Asanovi K, Stojanovi V (2009) Silicon-
photonic Clos networks for global on-chip communication. In: International symposium on
networks-on-chip (NOCS), May 2009 http://dx.doi.org/10.1109/NOCS.2009.5071460 San
Diego, CA
33. Kalluri S, Ziari M, Chen A, Chuyanov V, Steier WH, Chen D, Jalali B, Fetterman H, Dalton
LR (1996) Monolithic integration of waveguide polymer electro-optic modulators on VLSI
circuitry. Photon Technol Lett 8(5):644646
34. Kao Y-H, Chao JJ (2011) BLOCON: a bufferless photonic Clos network-on-chip architecture.
In: International symposium on networks-on-chip (NOCS), May 2011. http://dx.doi.
org/10.1145/1999946.1999960 Pittsburgh, PA
35. Kash JA (2008) Leveraging optical interconnects in future supercomputers and servers. In:
Symposium on high-performance interconnects (hot interconnects), August 2008. http://dx.
doi.org/10.1109/HOTI.2008.29 Stanford, CA
36. Kim B, Stojanovi V (2008) Characterization of equalized and repeated interconnects for NoC
applications. IEEE Design Test Comput 25(5):430439
37. Kim J, Balfour J, Dally WJ (2007) Flattened butterfly topology for on-chip networks. In:
International symposium on microarchitecture (MICRO), December 2007 http://dx.doi.
org/10.1109/MICRO.2007.15 Chicago, IL
38. Kimerling LC, Ahn D, Apsel AB, Beals M, Carothers D, Chen Y-K, Conway T, Gill DM,
Grove M, Hong C-Y, Lipson M, Liu J, Michel J, Pan D, Patel SS, Pomerene AT, Rasras M,
Sparacin DK, Tu K-Y, White AE, Wong CW (2006) Electronic-photonic integrated circuits on
the CMOS platform. In: Silicon Photonics, March 2006. http://dx.doi.org/10.1117/12.654455
San Jose, CA
39. Krman N, Krman M, Dokania RK, Martnez JF, Apsel AB, Watkins MA, Albonesi DH
(2006) Leveraging optical technology in future bus-based chip multiprocessors. In: International
symposium on microarchitecture (MICRO), December 2006 http://dx.doi.org/10.1109/
MICRO.2006.28 Orlando, FL
40. Krman N, Martnez JF (2010) A power-efficient all-optical on-chip interconnect using
wavelength-based oblivious routing. In: International conference on architectural support for
3 Designing Chip-Level Nanophotonic Interconnection Networks 133

programming languages and operating systems (ASPLOS), March 2010 http://dx.doi.


org/10.1145/1736020.1736024 Pittsburgh, PA
41. Koka P, McCracken MO, Schwetman H, Zheng X, Ho R, Krishnamoorthy AV (2010) Silicon-
photonic network architectures for scalable, power-efficient multi-chip systems. In:
International symposium on computer architecture (ISCA), June 2010 http://dx.doi.
org/10.1145/1815961.1815977 Saint-Malo, France
42. Koohi S, Abdollahi M, Hessabi S (2011) All-optical wavelength-routed NoC based on a novel
hierarchical topology. In: International symposium on networks-on-chip (NOCS), May 2011
http://dx.doi.org/10.1145/1999946.1999962 Pittsburgh, PA
43. Kumar P, Pan Y, Kim J, Memik G, Choudhary A (2009) Exploring concentration and channel
slicing in on-chip network router. In: International symposium on networks-on-chip (NOCS),
May 2009 http://dx.doi.org/10.1109/NOCS.2009.5071477 San Diego, CA
44. Kurian G, Miller J, Psota J, Eastep J, Liu J, Michel J, Kimerling L, Agarwal A (2010) ATAC:
a 1000-core cache-coherent processor with on-chip optical network. In: International confer-
ence on parallel architectures and compilation techniques (PACT), September 2010. http://
dx.doi.org/10.1145/1854273.1854332 Minneapolis, MN
45. Leiserson CE (1985) Fat-trees: universal networks for hardware-efficient supercomputing.
IEEE Trans Comput C-34(10):892901
46. Leu JC, Stojanovi V (2011) Injection-locked clock receiver for monolithic optical link in 45
nm. In: Asian solid-state circuits conference (ASSCC), November 2011. http://dx.doi.
org/10.1109/ASSCC.2011.6123624 Jeju, Korea
47. Li Z, Mohamed M, Chen X, Dudley E, Meng K, Shang L, Mickelson AR, Joseph R,
Vachharajani M, Schwartz B, Sun Y (2010) Reliability modeling and management of nano-
photonic on-chip networks. In: IEEE Transactions on very large-scale integration systems
(TVLSI), PP(99), December 2010
48. Li Z, Mohamed M, Chen X, Zhou H, Michelson A, Shang L, Vachharajani M (2011) Iris: a
hybrid nanophotonic network design for high-performance and low-power onchip communi-
cation. ACM J Emerg Technol Comput Syst 7(2):8
49. Liow T-Y, Ang K-W, Fang Q, Song J-F, Xiong Y-Z, Yu M-B, Lo G-Q, Kwong D-L (2010)
Silicon modulators and germanium photodetectors on SOI: monolithic integration, compati-
bility, and performance optimization. J Sel Top Quantum Electron 16(1):307315
50. Lipson M (2006) Compact electro-optic modulators on a silicon chip. J Sel Top Quantum
Electron 12(6):15201526
51. Manipatruni S, Dokania RK, Schmidt B, Sherwood-Droz N, Poitras CB, Apsel AB, Lipson M
(2008) Wide temperature range operation of micrometer-scale silicon electro-optic modula-
tors. Opt Lett 33(19):21852187
52. Masini G, Colace L, Assanto G (2003) 2.5 Gbit/s polycrystalline germanium-on-silicon pho-
todetector operating from 1.3 to 1.55 ?m. Appl Phys Lett 82(15):5118 5124
53. Mejia PV, Amirtharajah R, Farrens MK, Akella V (2011) Performance evaluation of a multicore
system with optically connected memory modules. In: International symposium on networks
on-chip (NOCS), May 2011 http://dx.doi.org/10.1109/NOCS.2010.31 Grenoble, France
54. Mesa-Martinez FJ, Nayfach-Battilana J, Renau J (2007) Power model validation through ther-
mal measurements. In: International symposium on computer architecture (ISCA), June 2007
http://dx.doi.org/10.1145/1273440.1250700 San Diego, CA
55. Miller DA (2009) Device requirements for optical interconnects to silicon chips. Proc. IEEE
97(7):11661185
56. Morris R, Kodi A (2010) Exploring the design of 64 & 256 core power efficient nanophotonic
interconnect. J Sel Top Quantum Electron 16(5):13861393
57. Nitta C, Farrens M, Akella V (2011) Addressing system-level trimming issues in onchip nano-
photonic networks. In: International symposium on high-performance computer architecture
(HPCA), February 2011. http://dx.doi.org/10.1109/HPCA.2011.5749722 San Antonio, TX
58. Orcutt JS, Khilo A, Holzwarth CW, Popovi MA, Li H, Sun J, Bonifield T, Hollingsworth R,
Kartner FX, Smith HI, Stojanovi V, Ram RJ (2011) Nanophotonic integration in state-of-
the-art CMOS foundaries. Opt Express 19(3):23352346
134 C. Batten et al.

59. Orcutt JS, Khilo A, Popovic MA, Holzwarth CW, Li H, Sun J, Moss B, Dahlem MS, Ippen EP,
Hoyt JL, Stojanovi V, Kartner FX, Smith HI, Ram RJ (2009) Photonic integration in a com-
mercial scaled bulk-CMOS process. In: International conference on photonics in switching,
September 2009. http://dx.doi.org/10.1109/PS.2009.5307769 Pisa, Italy
60. Orcutt JS, Khilo A, Popovic MA, Holzwarth CW, Moss B, Li H, Dahlem MS, Bonifield TD,
Kartner FX, Ippen EP, Hoyt JL, Ram RJ, Stojanovi V (2008) Demonstration of an electronic
photonic integrated circuit in a commercial scaled bulk-CMOS process. In: Conference on
lasers and electro-optics (CLEO), May 2008. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=4571838 San Jose, CA
61. Orcutt JS, Tang SD, Kramer S, Li H, Stojanovi V, Ram RJ (2011) Low-loss polysilicon wave-
guides suitable for integration within a high-volume polysilicon process. In: Conference on
lasers and electro-optics (CLEO), May 2011. http://ieeexplore.ieee.org/xpl/articleDetails.
jsp?arnumber=5950452 Baltimore, MD
62. Pan Y, Kim J, Memik G (2010) FlexiShare: energy-efficient nanophotonic crossbar architec-
ture through channel sharing. In: International symposium on high-performance computer
architecture (HPCA), January 2010. http://dx.doi.org/10.1109/HPCA.2010.5416626
Bangalore, India
63. Pan Y, Kumar P, Kim J, Memik G, Zhang Y, Choudhary A (2009) Firefly: illuminating on-chip
networks with nanophotonics. In: International symposium on computer architecture (ISCA),
June 2009. http://dx.doi.org/10.1145/1555754.1555808 Austin, TX
64. Pasricha S, Dutt N (2008) ORB: an on-chip optical ring bus communication architecture for
multi-processor systems-on-chip. In: Asia and South Pacific design automation conference
(ASP-DAC), January 2008 http://dx.doi.org/10.1109/ASPDAC.2008.4484059 Seoul, Korea
65. Petracca M, Lee BG, Bergman K, Carloni LP (2009) Photonic NoCs: system-level design
exploration. IEEE Micro 29(4):7477
66. Poon AW, Luo X, Xu F, Chen H (2009) Cascaded microresonator-based matrix switch for sili-
con on-chip optical interconnection. Proc IEEE 97(7):12161238
67. Preston K, Manipatruni S, Gondarenko A, Poitras CB, Lipson M (2009) Deposited silicon
high-speed integrated electro-optic modulator. Opt Express 17(7):51185124
68. Reed GT (2008) Silicon photonics: the state of the art. Wiley-Interscience. http://www.ama-
zon.com/dp/0470025794
69. Shacham A, Bergman K, Carloni LP (2008) Photonic networks-on-chip for future generations
of chip multiprocessors. IEEE Trans Comput 57(9):12461260
70. Sherwood-Droz N, Preston K, Levy JS, Lipson M (2010) Device guidelines for WDM inter-
connects using silicon microring resonators. In: Workshop on the interaction between nano-
photonic devices and systems (WINDS), December 2010. Atlanta, GA
71. Sherwood-Droz N, Wang H, Chen L, Lee BG, Biberman A, Bergman K, Lipson M (2008) Optical
4x4 hitless silicon router for optical networks-on-chip. Opt Express 16(20):1591515922
72. Skandron K, Stan MR, Huang W, Velusamy S, Sankarananarayanan K, Tarjan D (2003)
Temperature-aware microarchitecture. In: International symposium on computer architecture
(ISCA), June 2003. http://dx.doi.org/10.1145/871656.859620 San Diego, CA
73. Thourhout DV, Campenhout JV, Rojo-Romeo P, Regreny P, Seassal C, Binetti P, Leijtens
XJM, Notzel R, Smit MK, Cioccio LD, Lagahe C, Fedeli J-M, Baets R (2007) A photonic
interconnect layer on CMOS. In: European conference on optical communication (ECOC),
September 2007. http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=5758445 Berlin,
Germany
74. Udipi AN, Muralimanohar N, Balasubramonian R, Davis A, Jouppi N (2011) Combining
memory and a controller with photonics through 3D-stacking to enable scalable and energy-
efficient systems. In: International symposium on computer architecture (ISCA), June 2011.
http://dx.doi.org/10.1145/2000064.2000115 San Jose, CA
75. Vantrease D, Binkert N, Schreiber R, Lipasti MH (2009) Light speed arbitration and flow
control for nanophotonic interconnects. In: International symposium on microarchitecture
(MICRO), December 2009. http://dx.doi.org/10.1145/1669112.1669152 New York, NY
3 Designing Chip-Level Nanophotonic Interconnection Networks 135

76. Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi NP, Fiorentino M, Davis A,


Binkert N, Beausoleil RG, Ahn JH (2008) Corona: system implications of emerging nanopho-
tonic technology. In: International symposium on computer architecture (ISCA), June 2008.
http://dx.doi.org/10.1109/ISCA.2008.35 Beijing, China
77. Watts MR, Zortman WA, Trotter DC, Nielson GN, Luck DL, Young RW (2009) Adiabatic
resonant microrings with directly integrated thermal microphotonics. In: Conference on lasers
and electro-optics (CLEO), May 2009. http://www.opticsinfobase.org/abstract.
cfm?URI=CLEO-2009-CPDB10 Baltimore, MD
78. Xue J, Garg A, iftio lu B, Hu J, Wang S, Savidis I, Jain M, Berman R, Liu P, Huang M, Wu
H, Friedman E, Wicks G, Moore D (2010) An intra-chip free-space optical interconnect. In:
International symposium on computer architecture (ISCA), June 2010 http://dx.doi.
org/10.1145/1815961.1815975 Saint-Malo, France
79. Young IA, Mohammed E, Liao JTS, Kern AM, Palermo S, Block BA, Reshotko MR, Chang
PLD (2010) Optical I/O technology for tera-scale computing. IEEE J Solid-State Circuits
45(1):235248
80. Zhao W, Cao Y (2006) New generation of predictive technology model for sub-45 nm early
design exploration. IEEE Trans Electron Dev 53(11):28162823
81. Zheng X, Lexau J, Luo Y, Thacker H, Pinguet T, Mekis A, Li G, Shi J, Amberg P, Pinckney N,
Raj K, Ho R, Cunningham JE, Krishamoorthy AV (2010) Ultra-low energy all-CMOS modula-
tor integrated with driver. Opt Express 18(3):30593070
82. Zhou L, Djordjevic SS, Proietti R, Ding D, Yoo SJB, Amirtharajah R, Akella V (2009) Design
and evaluation of an arbitration-free passive optical crossbar for onchip interconnection net-
works. Appl Phys A Mater Sci Process 95(4):11111118
83. Zhou L, Okamoto K, Yoo SJB (2009) Athermalizing and trimming of slotted silicon microring
resonators with UV-sensitive PMMA upper-cladding. Photon Technol Lett 21(17):11751177
84. Zortman WA, Trotter DC, Watts MR (2010) Silicon photonics manufacturing. Opt Express
18(23):2359823607
Chapter 4
FONoC: A Fat Tree Based Optical Network-
on-Chip for Multiprocessor System-on-Chip

Jiang Xu, Huaxi Gu, Wei Zhang, and Weichen Liu

Abstract Multiprocessor systems-on-chip (MPSoCs) make an attractive platform


for high-performance applications. Networks-on-chip (NoCs) can improve the on-
chip communication bandwidth of MPSoCs. However, traditional metallic intercon-
nects consume a significant amount of power to deliver even higher communication
bandwidth required in the near future. Optical NoCs are based on CMOS-compatible
optical waveguides and microresonators, and promise significant bandwidth and
power advantages. This work proposes a fat tree-based optical NoC (FONoC)
including its topology, floorplan, protocols, and a low-power and low-cost optical
router, optical turnaround router (OTAR). Different from other optical NoCs,
FONoC does not require building a separate electronic NoC for network control.
It carries both payload data and network control data on the same optical network,
while using circuit switching for the former and packet switching for the latter. The
FONoC protocols are designed to minimize network control data and the related
power consumption. An optimized turnaround routing algorithm is designed to uti-
lize the low-power feature of OTAR, which can passively route packets without
powering on any microresonator in 40% of all cases. Comparing with other optical
routers, OTAR has the lowest optical power loss and uses the lowest number of
microresonators. An analytical model is developed to characterize the power
consumption of FONoC. We compare the power consumption of FONoC with a
matched electronic NoC in 45 nm, and show that FONoC can save 87% power

J. Xu () H. Gu W. Liu
Mobile Computing System Laboratory, Department of Electronic
and Computer Engineering, The Hong Kong University of Science and Technology,
Clear Water Bay Kowloon, Hong Kong
e-mail: jiang.xu@ust.hk
W. Zhang
Nanyang Technological University, Singapore, Singapore

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 137


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_4,
Springer Science+Business Media New York 2013
138 J. Xu et al.

comparing with the electronic NoC on a 64-core MPSoC. We simulate the FONoC
for the 64-core MPSoC and show the end-to-end delay and network throughput
under different offered loads and packet sizes.

Keywords Optical network on chip Multiprocessor System on chip Fat tree


Router

Introduction

As the number of transistors available on a single chip increases to billions or even


larger numbers, the multiprocessor system-on-chip (MPSoC) is becoming an attrac-
tive choice for high-performance and low-power applications [1]. Traditional on-chip
communication architectures for MPSoC face several issues, such as poor scalability,
limited bandwidth, and high power consumption [2, 3]. Networks-on-chip (NoCs)
relieve MPSoC of these issues by using modern communication and networking theo-
ries. Many NoCs have been studied, and most of them are based on metallic intercon-
nects and electronic routers [49]. As new applications continually push back the
limits of MPSoC, the conventional metallic interconnects and electronic routers grad-
ually become the bottlenecks of NoC performance due to the limited bandwidth, long
delay, large area, high power consumption, and crosstalk noise [10].
Optical NoCs use silicon-based optical interconnects and routers, which are com-
patible with CMOS technologies [11]. Studies show that the optical NoC is a promis-
ing candidate to achieve significantly higher bandwidth, lower power, lower interference,
and lower delay compared with electronic NoCs [12]. Optical interconnects have dem-
onstrated their strengths in multicomputer systems, on-board inter-chip interconnect,
and the switching fabrics of Internet routers. Silicon-based optical waveguides can be
used to build on-chip optical interconnects [13]. The progress in photonic technologies,
especially the development of microresonators, makes optical on-chip routers possible
[14]. Microresonators can be fabricated on silicon-on-insulator (SOI) substrates, which
have been used for CMOS-based high-performance low-leakage SoCs. Microresonators,
as small as 3 mm in diameter, have been demonstrated [15].
Several optical NoCs and optical routers propose to use microresonators.
Shacham et al. proposed an optical NoC. The optical NoC uses an augmented torus
network to transmit payload data, while network control data are transmitted through
a separate electronic network. It is built from 4 4 optical routers, injection switches,
and ejection switches. The injection and ejection switches are used for local injec-
tion and ejection packets. Briere et al. proposed a multistage optical router called
l-router [10]. l-Router uses a passive switching fabric and wavelength-division
multiplexing (WDM) technology. An N N l-router needs N wavelengths and mul-
tiple basic 2 2 switching elements to achieve non-blocking switching. Poon et al.
proposed a non-blocking optical router based on an optimized crossbar for 2D mesh
optical NoC [16]. Each port of the router is aligned to its corresponding direction
to reduce the waveguide crossings around the switching fabric. We proposed an
4 FONoC: A Fat Tree Based Optical Network-on-Chip 139

optical router, which significantly reduces the cost and optical power loss of 2D
mesh/torus optical NoCs [17]. Previous optical NoC and router studies concentrate
2D topologies, such as mesh and torus.
In this work, we propose a new optical NoC, FONoC (fat tree-based optical NoC),
including its topology, protocols, as well as a low-power and low-cost optical router,
OTAR (optical turnaround router). In contrast to previous optical NoCs, FONoC
does not require the building of a separate electronic NoC. It transmits both payload
data and network control data over the same optical network. FONoC is based on a
fat tree topology, which is a hierarchical multistage network, and has been used by
multi-computer systems [18]. It also attracts the attentions of electronic NoC studies
[1921]. While electronic fat tree-based NoCs use packet switching for both payload
data and network control data, FONoC uses circuit switching for payload data and
packet switching for network control data. The protocols of FONoC minimize the
network control data and the related power consumed by optical-electronic conver-
sions. An optimized turnaround routing algorithm is designed to utilize the mini-
mized network control data and a low-power feature of OTAR, which can passively
route packets without powering on any microresonator in 40% of cases. An analyti-
cal model is developed to assess the power consumption of FONoC. Based on the
analytical model and SPICE simulations, we compare FONoC with a matched elec-
tronic NoC in 45 nm. We simulate the FONoC for the 64-core MPSoC and show its
performance under various offered loads and packet sizes.
The rest of the chapter is organized as follows. Section Optical Turnaround
Router for FONoC describes the optical router proposed for FONoC. Section Fat
Tree-Based Optical NoC details FONoC, including the topology, floorplan, and
protocols. Section Comparison and Analysis evaluates and analyzes the power
consumption, optical power loss, and network performance of FONoC. Conclusions
are drawn in section Conclusions.

Optical Turnaround Router for FONoC

OTAR (optical turnaround router) is the key component of FONoC. It implements


the routing function. OTAR switches packets from an input port to an output port
using a switching fabric, which is composed of basic switching elements. OTAR
uses two types of basic switching elements which are based on microresonators. We
will introduce the working principles of the microresonator and switching elements
before detailing the router.

Microresonator and Switching Elements

The two switching elements used by OTAR are crossing and parallel elements,
which implement the basic 1 2 switching function (Fig. 4.1). Both switching
140 J. Xu et al.

Fig. 4.1 Switching elements

elements consist of a microresonator and two waveguides. The parallel element


does not have any waveguide crossing, and hence no crossing insertion loss. The
resonant wavelength of the microresonator can be controlled by voltage. While
powered off, the microresonator has an off-state resonant wavelength loff, which is
determined by the materials used and the internal structure of the microresonator.
When the microresonator is powered on, the resonant wavelength changes to the
on-state resonant wavelength lon. If the wavelength of an optical signal is different
from the resonant wavelength, it will be directed to the through port. Otherwise, the
signal will be routed to the drop port. Hence, by powering the microresonator on or
off, the basic switching elements can be controlled to switch a packet to either the
drop port or the through port. The switch time of the microresonator is small, and a
30 ps switching time has been demonstrated [14].

Traditional Switching Fabrics

The switching fabric of an optical router can be implemented by the traditional fully-
connected crossbar. An n n optical router requires an n n crossbar, which is com-
posed of n2 microresonators and 2n crossing waveguides. Figure 4.2a shows a 4 4
fully-connected crossbar, which has four input ports and four output ports. The fully-
connected crossbar can be optimized based on the routing algorithm used by an opti-
cal router. The turnaround routing algorithm (also called the least common ancestor
routing algorithm) has been favored by many fat tree-based networks [22, 23]. In this
algorithm, a packet is first routed upstream until it reaches the common ancestor node
of the source and destination of the packet; then, the packet is routed downstream to
4 FONoC: A Fat Tree Based Optical Network-on-Chip 141

Fig. 4.2 4 4 crossbar-based switching fabrics

the destination. Turnaround routing is a minimal path routing algorithm and is free of
deadlock and livelock. In addition, it is a low-complexity adaptive algorithm which
does not use any global information. These features make the turnaround routing
algorithm particularly suitable for optical NoCs, which require both low latency and
low cost at the same time. Some microresonators can be removed from the fully-
connected crossbar based on the turnaround routing algorithm (Fig. 4.2b). Compared
with the fully-connected crossbar, the optimized crossbar saves six microresonators,
but still has the same number of waveguide crossings, and hence does not improve the
optical power loss or, by extension, the power consumption.

Optical Turnaround Router

We propose a new router, OTAR, for FONoC (Fig. 4.3). OTAR is a 4 4 optical
router using the turnaround routing algorithm. It consists of an optical switching
fabric, a control unit, and four control interfaces (CI). The switching fabric uses
only six microresonators and four waveguides. The control unit uses electrical sig-
nals to configure the switching fabric according to the routing requirement of each
packet. The control interfaces inject and eject control packets to and from optical
waveguides.
The OTAR router has four bidirectional ports, called UP right, UP left, DOWN
right, and DOWN left. OTAR has a low-power feature, and can passively route
packets which travel on the same side. Packets, travelling between UP left and
DOWN left as well as between UP right and DOWN right, do not require any
microresonator to be powered on. There are a total of ten possible inputoutput
ports combinations. The passive cases account for four out of the ten possible com-
binations, so that if traffic arrives at each port with equal probability, 40% traffic
will be routed passively without activating any microresonator. The four ports are
aligned to their intended directions, and the input and output of each port is also
properly aligned. The microresonators in OTAR are identical, and have the same
142 J. Xu et al.

Fig. 4.3 Optical turnaround router

on-state and off-state resonant wavelengths, lon and loff. OTAR uses the wavelength
lon to transmit the payload packets which carry payload data, and loff to transmit
control packets which carry network control data.
The switching fabric implements a 4 4 switching function for the four bidirec-
tional ports. It is designed to minimize waveguide crossings. The U-turn function is
not implemented because the routing algorithm does not use it. Two unnecessary
turns are also eliminated since payload packets will not make turns when they flow
down the fat tree in turnaround routing. The OTAR router is strictly non-blocking
while using the turnaround routing algorithm. This can be proved by exhaustively
examining all the possible cases. The non-blocking property can help to increase the
network throughput.
The control unit processes the control packets and configures the optical switch-
ing fabric. Control packets are used to setup and maintain optical paths for payload
packets, and are processed in the electronic domain. The control unit is built from
CMOS transistors and uses electrical signals to power each microresonator on and
off according to the routing requirement of each packet. It uses an optimized routing
algorithm, which we will describe in the next section. Each port of the OTAR has a
control interface. The control interface includes two parallel switching elements (to
minimize the optical loss), an optical-electronic (OE) converter to convert optical
control packets into electronic signals, and an electronic-optical (EO) converter to
carry out the reverse conversion. The microresonators in the control interface are
always in the off-state and identical to those in the optical switching fabric. Their
off-state resonant wavelength loff is used to transmit control packets.
4 FONoC: A Fat Tree Based Optical Network-on-Chip 143

Fig. 4.4 FONoC topology for a 64-core MPSoC

Fat Tree-Based Optical NoC

We propose a new optical NoC, FONoC (fat tree-based optical NoC), for MPSoCs
including its topology, floorplan, and protocols. In contrast to other optical NoCs,
FONoC transmits both payload packets and control packets over the same optical
network. This obviates the need for building a separate electronic NoC for control
packets. The hierarchical network topology of FONoC makes it possible to connect
the FONoCs of multiple MPSoCs and other chips, such as off-chip memories, into an
inter-chip optical network and thus form a more powerful multiprocessor system.

Topology and Floorplan

FONoC is based on a fat tree topology to connect OTARs and processor cores
(Fig. 4.4). It is a non-blocking network, and provides path diversity to improve per-
formance. Processors are connected to OTARs by optical-electronic and electronic-
optical interfaces (OEEO), which convert signals between optical and electronic
domains. The notation FONoC(m,k) describes a FONoC connecting k processors
using an m-level fat tree. There are k processors at level 0 and k/2 OTARs at other
levels. Based on the fat tree topology, to connect k processors, the number of net-
work levels required is m = log 2 k + 1 , and all the processors are in the first network
level. While connecting with other MPSoCs and off-chip memories, OTARs at the
topmost level route the packets from FONoC to an inter-chip optical network. In
this case, the number of OTARs required is k log k . If an inter-chip optical net-
2
2
work is not used, OTARs at the topmost level can be omitted. In this case, only
144 J. Xu et al.

Fig. 4.5 FONoC floorplan for the 64-core MPSoC

k
(log 2 k - 1) OTARs are required. In Fig. 4.4, each optical interconnect is bidirec-
2
tional, and includes two optical waveguides.
The corresponding floorplan of FONoC for a 64-core MPSoC is shown in
Fig. 4.5. Starting from level 2, multiple OTARs are grouped into a router cluster for
floorplanning purposes. The router clusters are connected by optical interconnects.
FONoC can be built on the same device layer as the processors, but to reduce chip
area, 3D chip technology can also fabricate FONoC on a separate device layer and
stack it above a device layer for processor cores [24].

FONoC Protocols

FONoC uses both connection-oriented circuit switching and packet switching to


transfer payload packets and control packets, respectively. In the absence of effec-
tive optical buffers, optical NoCs using packet switching convert signals from the
optical domain to the electronic domain for buffering, and then convert them back
to the optical domain for transmission. These domain conversions consume a lot of
power. However, FONoC uses packet switching only for control packets, because
network control data are critical for network performance and are usually processed
and shared by the routers along its path.
4 FONoC: A Fat Tree Based Optical Network-on-Chip 145

Before payload packets can be transmitted in FONoC, an optical path is first


reserved from a source processor to a destination processor. The path consists of a
series of OTARs and interconnects, and is managed by three control packets:
SETUP, ACK, and RELEASE. A SETUP packet is issued by the source and requests
OTARs to reserve a path. OTAR finds and reserves a path based on an optimized
turnaround routing algorithm, which will be described shortly. It has lsetup bits and
only contains the destination address. For a FONoC with k processors, lsetup = log2 k.
When the SETUP reaches the destination, an ACK packet is sent back to the source
and requests OTAR to power on the resonators along the path. Upon receiving the
ACK packet, the source sends the payload packets. In addition to the last payload
packet, the source sends a RELEASE packet to free the reserved path. There is no
buffer required for payload packets. Once the connection is established, the latency
and bandwidth are guaranteed.
We optimize the traditional turnaround routing algorithm for FONoC, and call
it EETAR (energy-efficient turnaround routing). EETAR utilizes the special fea-
tures of OTAR. It is an adaptive and distributed routing algorithm. In EETAR, a
packet first climbs the tree. Each router chooses an available port to move the
packet upward until it arrives at a router which is the common ancestor of the
source and destination. Then, the packet will move downward along a determinis-
tic path. EETAR takes account of the power consumption of microresonators. It
chooses to passively route packets whenever possible. For example, EETAR tries
to route packets coming from the DOWN left port of OTAR to the UP left port, and
avoid powering on any microresonator. This not only reduces power consumption
but also avoids the high insertion loss of microresonators. Moreover, EETAR
makes routing decisions without using source addresses. This reduces the length of
SETUP packets to half, and hence reduces the power consumption at the control
interfaces of OTAR. In the best case, EETAR can save half of the power consumed
by a packet as compared with traditional turnaround routing. The pseudo-code of
EETAR is as follows.
Let us define a node in FONoC(l,k) as either a processor or router. Node (x, y) is
the x-th node at the y-th level (Fig. 4.4). Apart from the nodes at the 0-th level, each
node connects two parent nodes and two child nodes through the UP left and UP
0 1
right ports, which are labeled as pup and pup , and DOWN left and DOWN right
0
ports, which are labeled as pdown and p1down .
/* EETAR Algorithm */
INPUT destination (xd,0), current node (xc,yc), input port pin
IF U xd U + 2 yc - 1, U = 2 yc xc DIV 2 yc -1

/* make turns and move downward */
i
pout = pdown i = ( xd SHIFTRIGHT (y c - 1) bits) MOD2
ELSE /* move upward */
IF port puppin is available, pout = pup
Pin

1- p
ELSE pout = pup in
RETURN output port pout
146 J. Xu et al.

Comparison and Analysis

We analyze the power consumption, optical power loss, and network performance
for FONoC. The power consumption of FONoC is compared to a matched elec-
tronic NoC. The optical power loss of OTAR is compared to three other optical
routers under different conditions. We simulate and compare the network perfor-
mance of the FONoC for the 64-core MPSoC under various offered loads and
packet sizes.

Power Consumption

Power consumption is a critical aspect of NoC design. For high-performance com-


puting, low power consumption can reduce the cost related to packaging and cool-
ing solutions, and system integration. FONoC consumes power in several ways.
OEEO interfaces consume power to generate, modulate, and detect optical signals,
optical routers consume power to route packets, while control units need power to
make decisions for control packets. We develop an analytical model to characterize
the power consumption of FONoC.
o
EPK is defined as the energy consumed to transmit a payload packet. It has two
o
portions as shown in Eq. (4.1), where E payload is the energy consumed by a payload
packet directly, and Ectrl is control overhead.

o o
EPK = E payload + Ectrl (4.1)

o
E payload can be calculated by Eq. (4.2), where m is the number of microresonators
in the on-state while transferring the payload packet, Pmro is the average power con-
sumed by a microresonator when it is in the on-state, Lopayload is the payload packet
size, R is the data rate of EOOE interfaces, d is the distance traveled by the payload
packet, c is the speed of light in a vacuum, n is the reflection index of silicon optical
o
waveguide, Eoeeo is the energy consumed for 1-bit OE and EO conversions.

o
Lopayload d n
E payload = mPmro ( + o
) + Eoeeo Lopayload
R c (4.2)

Ectrl can be calculated by Eq. (4.3). Additional variables are defined as follows:
Loctrl is the total size of the control packets used, h is the number of hops to transfer
the payload packet, Ecue is the average energy required by the control unit to make
decisions for the payload packet.

o
Ectrl = Eoeeo Loctrl h + Ecue (h + 1) (4.3)
4 FONoC: A Fat Tree Based Optical Network-on-Chip 147

Fig. 4.6 Power consumption of FONoC and the electronic NoC

The power consumption of a matched electronic fat tree-based NoC is analyzed


in a similar way. The electronic NoC has the same topology as FONoC and uses
the turnaround routing algorithm. We designed and simulated a 4 4 input-buffered
pipelined electronic router for the electronic NoC based on the 45 nm Nangate
open cell library and Predictive Technology Model (www.si2.org). Each port of the
electronic router is 32-bits wide, and the switching fabric of the electronic router is
a crossbar. We assume the size of each processor core to be 1 mm by 1 mm. The
metal wires in the electronic NoC are modeled as fine-grained lumped RLC net-
works, and the coupling capacitances between adjacent wires (values extracted
from layout) are taken into account. Since mutual inductance has a significant
effect in deep submicron process technologies, it is considered up to the third
neighboring wires. The electronic router and metal wires are simulated in Cadence
Spectre. Simulation results show that on average the crossbar inside the electronic
router consumes 0.06 pJ/bit, the input buffer consumes 0.003 pJ/bit, and the con-
trol unit consumes 1.5 pJ to make decisions for each packet. We assume the data
rates at the interfaces of FONoC and the electronic NoC are both 12.5 Gbps, which
has been demonstrated [25]. The average size of payload data is 512 bits. While
interfacing with 45 nm CMOS circuits, the energy consumed of OE and EO con-
versions is estimated to be 1 pJ/bit, which is linearly scaled down from the experi-
mental measurement of an 80 nm design [26]. OTAR uses the same control unit as
the electronic router. In the on-state, a microresonator needs a DC current and
consumes less than 20 mW [16].
We compare the power consumed by FONoC and the electronic NoC (ENoC)
while varying the number of connected processors and using different packet sizes
(Fig. 4.6). The results show that FONoC consumes significantly less power than the
electronic NoC. For example, for a 64-core MPSoC and 64-byte packets, FONoC
consumes only 0.71 nJ/packet, while the electronic NoC consumes 5.5 nJ/packet,
148 J. Xu et al.

Fig. 4.7 Comparison of optical power loss

which represents an 87% power saving. The results show that the power saving
could increase to 93% while using 128-byte packets in a 1024-core MPSoC.

Optical Power Loss

We analyze and compare the optical power loss of OTAR with three other optical
routers including the fully-connected crossbar, the optimized crossbar, and the 4 4
optical router proposed in, which is referred to as COR for clarity. In our compari-
son, we considered two major sources of optical power losses, the waveguide cross-
ing insertion loss and microresonator insertion loss. The waveguide crossing
insertion loss is 0.12 dB per crossing [16], and the microresonator insertion loss is
0.5 dB [27]. In an optical router, packets transferring between different input and
output ports may encounter different losses. We analyze the maximum loss, mini-
mum loss, and average loss of all possible cases (Fig. 4.7). The results show that
OTAR is the best in all comparisons. OTAR has 4% less minimum loss, 23% less
average loss, and 19% less maximum loss than the optimized crossbar. COR has the
same maximum loss as OTAR, but has higher average and minimum losses.
The number of microresonators used by an optical router is an indicator for the
area cost. While the optimized crossbar uses fewer microresonators than the fully-
connected crossbar, they have the same losses. OTAR uses 6 microresonators; the
fully-connected crossbar uses 16; the optimized crossbar uses 10; and COR uses 8.
OTAR uses the lowest number of microresonators, at 40% fewer than the optimized
crossbar.

Network Performance

We simulate the FONoC for the 64-core MPSoC and study the network perfor-
mance in terms of end-to-end (ETE) delay and network throughput. The ETE delay
4 FONoC: A Fat Tree Based Optical Network-on-Chip 149

is the average time between the generation of a packet and its arrival at its destination.
It is the sum of the connection-oriented path-setup time and the time used to trans-
mit optical packets. We simulated a range of packet sizes used by typical MPSoC
applications. We assumed a moderate bandwidth of 12.5 Gbps for each intercon-
nect. In the simulations, we assume that processors generate packets independently
and the packet generation time intervals follow a negative exponential distribution.
We used the uniform traffic pattern, i.e. each processor sends packets to all other
processors with the same probability. FONoC is simulated in a network simulator,
OPNET (www.opnet.com).
The ETE delay under different offered loads and packet sizes is shown in Fig. 4.8.
It shows that FONoC saturates at different loads with different packet sizes. The
ETE delay is very low before the saturation load, and increases dramatically after it.
For 32-byte packets, ETE delay is 0.06 ms before the saturation load 0.2, and goes
up to 110 ms after it. Packets larger than 32-byte have higher saturation load. This is
due to the lower number of control packets when using larger packets under the
same offered load. In addition, larger packets also have longer transmission times
and cause longer inter-packet arrival gaps compared with smaller packets under the
same offered load. These both help to reduce network contention during path setup,
and lead to higher saturation loads. Figure 4.8 also shows the network throughput
under various offered load and packet sizes. Ideally, throughput should increase
with the offered load. However, when the network becomes saturated, it will not be
able to accept a higher offered load beyond its capacity. The results show that the
throughput remains at a certain level after the saturation point.

Conclusions

This work proposes FONoC including its protocols, topology, floorplan, and a low-
power and low-cost optical router, OTAR. FONoC carries payload data as well as
network control data on the same optical network, while using circuit switching for
the former and packet switching for the latter. We analyze the power consumption,
optical power loss, and network performance of FONoC. An analytical model is
developed to assess the power consumption of FONoC. Based on the analytical
model and SPICE simulations, we compare FONoC with a matched electronic
NoC in 45 nm. The results show that FONoC can save 87% power to achieve the
same performance for a 64-core MPSoC. OTAR can passively route packets with-
out powering on any microresonator in 40% of all cases. Comparing with three
other optical routers, OTAR has the lowest optical power loss and uses the lowest
number of microresonators. We simulate the FONoC for a 64-core MPSoC and
show the end-to-end delay and network throughput under various offered loads and
packet sizes.

Acknowledgments This work is partially supported by HKUST PDF and RGC of the Hong
Kong Special Administrative Region, China.
150 J. Xu et al.

Fig. 4.8 End-to-end delay and network throughput of FONoC

References

1. Benini L, De Micheli G (2002) Networks on chip: a new paradigm for systems on chip design.
In: Design, automation and test in Europe conference and exhibition, Paris, France
2. Sgroi M, Sheets M, Mihal A, Keutzer K, Malik S, Rabaey J, Sangiovanni-Vincentelli A (2001)
Addressing the system-on-a-chip interconnect woes through communication-based design. In:
Design automation conference, Las Vegas, NV, USA
4 FONoC: A Fat Tree Based Optical Network-on-Chip 151

3. Reyes V, Bautista T, Marrero G, Nez A, Kruijtzer W (2005) A multicast inter-task commu-


nication protocol for embedded multiprocessor systems. In: Conference on hardware-software
codesign and system synthesis, New York, USA, pp 267272
4. Dally W, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In:
Design automation conference, Las Vegas, NV, USA
5. Kumar S, Jantsch A, Soininen JP, Forsell M, Millberg M, berg J, Tiensyrj K, Hemani A
(2002) A network on chip architecture and design methodology. In: IEEE Computer Society
annual symposium on VLSI, Pittsburgh, PA, USA
6. Goossens K, Dielissen J, Radulescu A (2005) thereal network on chip: concepts, architec-
tures and implementations. IEEE Design Test Comput 22(5):414421
7. Kumar A, Peh LS, Kundu P, Jha NK (2008) Toward ideal on-chip communication using
express virtual channels. IEEE Micro 28(1):8090
8. Amde M, Felicijan T, Efthymiou A, Edwards D, Lavagno L (2005) Asynchronous on-chip
networks. In: IEE proceedings: computers and digital techniques, pp 273283
9. Xu J, Wolf W, Henkel J, Chakradhar S (2006) A design methodology for application-specific
networks-on-chip. In: ACM transactions on embedded computing systems
10. Briere M, Girodias B et al (2007) System level assessment of an optical NoC in an MPSoC
platform. In: Design, automation & test in Europe conference & exhibition, Nice, France
11. Chen G, Chen H, Haurylau M, Nelson NA, Albonesi DH, Fauchet PM, Friedman EG (2007)
Predictions of CMOS compatible on-chip optical interconnect. Integr VLSI J 40(4):434446
12. Shacham A, Bergman K, Carloni LP (2007) The case for low-power photonic networks on
chip. In: Design automation conference, pp 132135
13. Xia F, Sekaric L, Vlasov Y (2007) Ultracompact optical buffers on a silicon chip. Nat Photon
1:6571
14. Xu Q, Schmidt B, Pradhan S, Lipson M (2005) Micrometre-scale silicon electro-optic modula-
tor. Nature 435(7040):325327
15. Little BE, Foresi JS, Steinmeyer G et al (1998) Ultra-compact Si-SiO2 microring resonator
optical channel dropping filters. IEEE Photon Technol Lett 10(4):549551
16. Poon AW, Xu F, Luo X (2008) Cascaded active silicon microresonator array cross-connect
circuits for WDM networks-on-chip. In: Proceedings of SPIE, vol 6898
17. Gu H, Xu J, Wang Z (2008) ODOR: a microresonator-based high-performance low-cost router
for optical networks-on-chip. In: Proceedings of international conference on hardware-soft-
ware codesign and system synthesis, Atlanta, Georgia, USA
18. Leiserson CE, Abuhamdeh ZS, Douglas DC, Feynman CR, Ganmukhi MN et al (1992) The
network architecture of the connection machine CM-5. In: Proceedings of the fourth annual
ACM symposium on parallel algorithms and architectures, San Diego, CA, USA, pp 272285
19. Hossain H, Akbar M, Islam M (2005) Extended-butterfly fat tree interconnection (EFTI) archi-
tecture for network on chip. In: IEEE Pacific Rim conference on communications, computers
and signal processing, Victoria, BC, Canada, pp 613616
20. Jeang YL, Huang WH, Fang WF (2004) A binary tree architecture for application specific
network on chip (ASNOC) design. In: IEEE Asia-Pacific conference on circuits and systems,
Tainan, Taiwan, pp 877880
21. Adriahantenaina A, Charlery H, Greiner A, Mortiez L, Zeferino CA (2003) SPIN: a scalable,
packet switched, on-chip micro-network. In: Design, automation and test in Europe confer-
ence and exhibition (DATE), Munich, Germany, pp 7073
22. Pande PP, Grecu C, Jones M, Ivanov A, Saleh R (2005) Performance evaluation and design
trade-offs for network-on-chip interconnect architectures. IEEE Trans Comput 54:10251040
23. Strumpen V, Krishnamurthy A (2005) A collision model for randomized routing in fat-tree
networks. J Parallel Distrib Comput 65:10071021
24. Kim J, Nicopoulos C, Park D, Das R, Xie Y, Vijaykrishnan N, Das C (2007) A novel dimen-
sionally-decomposed router for on-chip communication in 3D architectures. In: Proceedings
of the annual international symposium on computer architecture (ISCA), San Diego, CA,
USA, pp 138149
152 J. Xu et al.

25. Xu Q, Manipatruni S, Schmidt B, Shakya J, Lipson M (2007) 12.5 Gbit/s carrier injection-
based silicon microring silicon modulators. Opt Express 15(2):430436
26. Kromer C, Sialm G, Berger C, Morf T, Schmatz ML, Ellinger F et al (2005) A 100-mW
4 10 Gb/s transceiver in 80-nm CMOS for high-density optical interconnects. IEEE J Solid-
State Circuit 40(12):26672679
27. Xiao S, Khan MH, Shen H, Qi M (2007) Multiple-channel silicon micro-resonator based filters
for WDM applications. Opt Express 15:74897498this figure will be printed in b/
wthis figure will be printed in b/wthis figure will be printed in b/w
Chapter 5
On-Chip Optical Ring Bus Communication
Architecture for Heterogeneous MPSoC

Sudeep Pasricha and Nikil D. Dutt

Abstract With increasing application complexity and improvements in process


technology, multi-processor systems-on-chip (MPSoC) with tens to hundreds of
cores on a chip are being realized today. While computational cores have become
faster with each successive technology generation, communication between them
has not scaled well, and has become a bottleneck that limits overall chip perfor-
mance. On-chip optical interconnects are a promising development to overcome
this bottleneck by replacing electrical wires with optical waveguides. In this chapter
we describe an optical ring bus (ORB) based hybrid opto-electric on-chip commu-
nication architecture for the next generation of heterogeneous MPSoCs. ORB uses
an optical ring waveguide to replace global pipelined electrical interconnects while
preserving the interface with todays bus protocol standards such as AMBA AXI3.
The proposed ORB architecture supports serialization of uplinks/downlinks to opti-
mize communication power dissipation. We present experiments to show how ORB
has the potential to reduce transfer latency (up to 4.7), and lower power consump-
tion (up to 12) compared to traditionally used pipelined, all-electrical, bus-based
communication architectures, for the 22 nm technology node.

Keywords Optical interconnects Multi-processor systems on chip On-chip


communication architecture AMBA Low power design

S. Pasricha ()
Electrical and Computer Engineering Department, Colorado State University, Fort Collins,
1373 Campus Delivery, Fort Collins, CO 80523-1373, USA
e-mail: sudeep@colostate.edu
N.D. Dutt
University of California, Irvine, Irvine, CA 92617, USA

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 153


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_5,
Springer Science+Business Media New York 2013
154 S. Pasricha and N.D. Dutt

Introduction

Driven by increasing application complexity and improvements in fabrication


technology into the ultra deep submicron (UDSM) domain, multiprocessor systems
on chip (MPSoCs) with tens to hundreds of processing cores on a chip are becoming
increasingly prevalent today [13]. In order to satisfy increasingly stringent com-
munication bandwidth and latency constraints, an efficient on-chip communication
fabric is a critical component of these MPSoC designs. Unfortunately, deep submi-
cron effects such as capacitive and inductive crosstalk coupling noise [4] are becom-
ing highly dominant in new technologies, leading to an increase in propagation
delay of signals on traditional copper-based electrical interconnects. Lower supply
voltages in successive UDSM technology nodes render signals more vulnerable to
this noise, and also to voltage droop. One way of reducing the influence of delay and
noise constraints is to increase wire spacing or use wire shielding techniques, both
of which cause interconnect resources to be used less efficiently and consequently
result in routing congestion or even non-routability. Additionally, in synchronous
digital design where a signal must propagate from source to destination within a
single clock cycle to ensure predictable operation, global interconnects that span the
chip (and can be several mm in length) have to be clocked at very low frequencies.
Such low clock frequencies on global interconnects, coupled with increasing propa-
gation delay, put serious limits on the achievable bandwidth and overall system
performance. According to the International Roadmap for Semiconductors (ITRS),
global interconnect delay has already become a major source of performance bottle-
necks and is one of the semiconductor industrys topmost challenges [5].
To reduce global interconnect delay, designers today make use of repeater insertion
on interconnects [6] to transform the quadratic dependency of propagation delay on
interconnect length into a linear one. Another technique that is frequently used in addi-
tion to repeater insertion is to pipeline global interconnects by inserting flip-flops,
latches or register slices [7, 8]. Pipelining allows signals to travel shorter distances (i.e.,
the segment length from one stage to the next) in a single clock cycle. This enables the
global interconnect to be clocked at higher clock frequencies and potentially support
larger bandwidths. Figure 5.1 shows an MPSoC design with four computation clusters
that communicate with other clusters using pipelined global interconnects. Each cluster
has bus-based local interconnects that can handle high data bandwidths of the cores in
the cluster locally. The global interconnects in such systems can be shared or be point-
to-point, and are operated at higher frequencies. Even though the number of global
interconnects in a design is typically much less than the number of smaller local inter-
connects, they are the primary source of bottlenecks [4] and hence critically affect
overall performance. Pipelined global electrical interconnects, such as the ones seen in
Fig. 5.1 have two serious drawbacks. Firstly, a large number of pipeline stages are
inevitably required for MPSoCs with high bandwidth requirements, resulting in high
data transfer latencies. Secondly, the large number of latches (and repeaters) required
to support multi-GHz frequencies for high performance MPSoCs leads to very high
leakage and dynamic power dissipation. These drawbacks are due to the fundamental
limitations of using copper as a global interconnect.
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 155

Fig. 5.1 Traditional multi-cycle (pipelined) on-chip global communication in MPSoCs

Recently, it has been shown that it may be beneficial to replace global on-chip
electrical interconnects with optical interconnects [9]. Optical interconnects can
theoretically offer ultra-high communications bandwidths in the terabits per sec-
ond range, in addition to lower access latency and susceptibility to electromag-
netic interference [10]. Optical signaling also has low power consumption, as the
power consumption of optically transmitted signals at the chip level is indepen-
dent of the distance covered by the light signal [11]. While optical interconnects
at the chip-to-chip level are already being actively developed [12], on-chip optical
interconnects have only lately begun to receive attention. This is due to the rela-
tively recent advances in the field of nanoscale silicon (Si) optics that have led to
the development of CMOS compatible silicon-based optical components such as
light sources [13], waveguides [14], modulators [15, 16], and detectors [17, 18].
As a result, while on-chip optical interconnects were virtually inconceivable with
previous generations of photonic technologies a few years ago, these recent
advances have enabled the possibility of creating highly integrated CMOS com-
patible optical interconnection fabrics that can send and receive optical signals
with superior power efficiencies. In order to practically implement an on-chip
optical interconnect based fabric, it is highly likely that future CMOS ICs will
utilize 3D integration [19] as shown conceptually in Fig. 5.2. 3D integration will
allow logic and Si photonics planes to be separately optimized [20, 21]. In the
figure, the bottom plane consists of a CMOS IC with several microprocessor and
memory cores, while the top plane consists of an optical waveguide that transmits
optical signals at the chip level. It is also possible for all memory cores to be
implemented on a dedicated layer, separate from the microprocessor layer. Vertical
through silicon via (TSV) interconnects provide interconnections between cores
in different layers. As optical memories and optical buffered transfers cannot be
156 S. Pasricha and N.D. Dutt

Fig. 5.2 3D IC implementation of a hybrid opto-electric communication architecture with proces-


sors and memory on the bottom layer and the optical waveguide on the top layer

easily implemented in silicon, an electrical portion of the on-chip communication


architecture is still required for interfacing with the optical path for transfers to
and from processor and memory cores. The electrical communication fabric is
also useful for low overhead local communication, where optical transfers may be
prohibitively expensive.
In this chapter, we describe a novel hybrid opto-electric on-chip communica-
tion architecture that uses an optical ring bus as a global interconnect between
computation clusters in MPSoC designs. Figure 5.3 shows how an optical ring bus
can replace the global, pipelined electrical interconnects in the MPSoC depicted
in Fig. 5.1. Our proposed optical ring bus (ORB) communication architecture
makes use of a laser light source, opto-electric converters, an optical waveguide,
and wave division multiplexing (WDM) to transfer data between clusters on a
chip, while preserving the standard bus protocol interface (e.g., AMBA AXI3 [8])
for inter- and intra-cluster on-chip communication. The ORB architecture sup-
ports serialization of uplinks/downlinks to optimize on-chip communication
power dissipation. Our experimental results indicate that compared to a tradi-
tional pipelined, all-electrical global interconnect architecture, the ORB architec-
ture dissipates significantly lower power (up to a 12 reduction) and also reduces
communication latency (up to a 4.7 reduction) for MPSoC designs. In the very
likely scenario that bus-based on-chip opto-electric interconnects become a real-
ity in the future, this work takes the first step in developing an optical interconnect
based on-chip communication architecture that is compatible with todays stan-
dards, and quantifying its benefits over traditionally used all-electrical, pipelined
and bus-based communication architectures.
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 157

Fig. 5.3 Proposed optical ring bus (ORB) on-chip communication architecture for MPSoCs

Related Work

The concept of optical interconnects for on-chip communication was first introduced
by Goodman et al. [22]. Several works in recent years have explored chip-to-chip
photonic interconnects [2329]. With advances in the fabrication and integration of
optical elements on a CMOS chip in recent years, several works have presented a
comparison of the physical and circuit-level properties of non-pipelined on-chip
electrical (copper-based) and optical interconnects [9, 3035]. In particular, Collet
et al. [30] compared simple optical and point-to-point links using a Spice-like simu-
lator. Tosik et al. [31] studied more complex interconnects, comparing optical and
electrical clock distribution networks, using physical simulations, synthesis tech-
niques and predictive transistor models. Both works studied power consumption
and bandwidth, and highlighted the benefits of on-chip optical interconnect technol-
ogy. Intels Technology and Manufacturing Group also performed a preliminary
study evaluating the benefits of optical intra-chip interconnects [32]. They con-
cluded that while optical clock distribution networks are not especially beneficial,
wave division multiplexing (WDM) based on-chip optical interconnects offer inter-
esting advantages for intra-chip communication over copper in UDSM process
technologies. While all of these studies have shown the promise of on-chip optical
interconnects, they have primarily focused on clock networks and non-pipelined
point-to-point links. One of the contributions of this chapter is to contrast on-chip
optical interconnects with all-electrical pipelined global bus-based communication
architectures that are used by designers to support high bandwidth on-chip data
transfers today.
158 S. Pasricha and N.D. Dutt

Network-on-chip (NoC) architectures [36, 37] have received much attention of


late as an alternative to bus-based architectures for future MPSoCs. Similar to pipe-
lined interconnects (shared or point-to-point), NoCs split larger interconnects into
smaller segments (links) separated by routers to enable multi-GHz frequencies and
high bandwidths. However, electrical NoCs suffer from the same drawbacks as
pipelined copper interconnects: high latencies and much higher power dissipation
[38] due to the overhead of buffering and routing in the switches and network inter-
faces. Some recent work has proposed hybrids of optical interconnects and torus/
mesh/crossbar NoC fabrics [3944]. These architectures are based on non-blocking
micro-resonator based photonic switches as fundamental building blocks for rout-
ing photonic messages. However, the high power overhead of electrical routers and
opto-electric/electro-optic conversion at the interface of each component, as well as
fabrication challenges associated with wideband photonic switching elements
makes realizing such architectures a difficult proposition in the near future.
In contrast to these opto-electric NoC architectures, we propose a novel low
cost optical ring bus-based hybrid opto-electric communication architecture (ORB)
that does not require the complexity of network interfaces and packet routers. The
optical ring is used primarily to facilitate global on-chip communication between
distant processor and memory cores on the chip. Our proposed architecture over-
comes many of the limitations of other approaches. For example waveguide cross-
ings in photonic torus and some crossbar architectures can lead to significant losses
that increase dissipated power, which is avoided when utilizing a ring topology.
Another important issue is the latency for setting up the transfers and sending
acknowledgements via the electrical NoC in some of these architectures, which
can dramatically reduce performance and increase energy consumption according
to our studies. ORB utilizes the much faster on-chip optical infrastructure for path
setup and flow control. The proposed architecture is also much simpler than com-
plex crossbar and torus architectures resulting in a reduced optical path complexity
while still providing significant opportunity for improvements over traditional, all-
electrical communication architectures. We also employ more efficient polymer-
based optical waveguides in our architecture instead of SOI based optical
waveguides employed in previous work. Finally, our architecture has the significant
advantage of seamlessly interfacing with existing bus-based protocols and stan-
dards, while providing significant improvements in on-chip power consumption
and performance.

Optical Ring Bus Architecture: Building Blocks

Optical interconnects offer many advantages over traditional electrical (copper-


based) interconnects: (1) they can support enormous intrinsic data bandwidths in
the order of several Gbps using only simple on-off modulation schemes, (2) they are
relatively immune to electrical interference due to crosstalk and parasitic capaci-
tances and inductances, (3) their power dissipation is completely independent of
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 159

Fig. 5.4 ORB optical interconnect components

transmission distance at the chip level, and (4) routing and placement is simplified
since it is possible to physically intersect light beams with minimal crosstalk. Once
a path is acquired, the transmission latency of the optical data is very short, depend-
ing only on the group velocity of light in a silicon waveguide: approximately
6.6 107 m/s, or 300 ps for a 2-cm path crossing a chip [45]. After an optical path is
established, data can be transmitted end to end without the need for repeating or
buffering, which can lead to significant power savings.
Realizing on-chip optical interconnects as part of our proposed ORB communi-
cation architecture requires several CMOS compatible optical devices. Although
there are various candidate devices that exist for these optical elements, we select
specific devices that satisfy on-chip requirements. Figure 5.4 shows a high level
overview of the various components that make up our ORB optical interconnect
architecture. There are four primary optical components: a multi-wavelength laser
(light source), an opto-electric modulator/transmitter, an optical ring waveguide and
an optical receiver. The modulator converts electrical signals into optical light (E/O),
which is propagated through the optical waveguide, and then detected and converted
back into an electrical signal at the receiver (O/E). Integrating such an optical sys-
tem on a chip requires CMOS compatibility which puts constraints on the types of
materials and choices of components to be used. Recent technological advances
indicate that it is possible to effectively fabricate various types of optical compo-
nents on a chip. However, there are still significant challenges in efficiently integrat-
ing a silicon based laser on a chip. Using an off-chip laser can actually be beneficial
because it leads to lower on-chip area and power consumption. Consequently, in our
optical interconnect system we use an off-chip laser from which light is coupled
onto the chip using optical fibers, much like what is done in chip-to-chip optical
interconnects today [12, 46].
160 S. Pasricha and N.D. Dutt

The transmission part in Fig. 5.4 consists of a modulator and a driver circuit. The
electro-optic modulator converts an input electrical signal into a modulated optical
wave signal for transmission through the optical waveguide. The modulators are
responsible for altering the refractive index or absorption coefficient of the optical
path when an electrical signal arrives at the input. Two types of electrical structures
have been proposed for opto-electric modulation: pin diodes [47] and MOS
capacitors [15]. Micro-ring resonator based pin diode type modulators [16, 47]
are compact in size (1030 mm) and have low power consumption, but possess low
modulation speeds (several MHz). Such micro-ring resonators couple light when
the relation: l m = Neff,ring 2R is satisfied, where R is the radius of the micror-
ing resonator, Neff,ring is the effective refractive index, m is an integer value, and l
is the resonant wavelength [48]. As resonance wavelength is a function of R and
Neff,ring, by changing R and Neff,ring, the resonant wavelength of the microring
can be altered, thus enabling it to function as an optical modulator (wavelength on-
off switch). In general the resonance wavelength shift (Dlc) is achieved as a func-
tion of the change in refractive index by tuning DNeff which is given by l DNeff/
Neff,ring Dlc. In contrast to microring resonators, MOS capacitor structures such
as the MachZehnder interferometer based silicon modulators [15, 46] have higher
modulation speed footprint (several GHz) but a large power consumption and
greater silicon footprint (around 10 mm). While these electro-optical modulators
today are by themselves not very attractive for on-chip implementation, there is a lot
of ongoing research which is attempting to combine the advantages of both these
modulator types [16]. Consequently, we make use of a predictive modulator model
which combines the advantages of both structures. We assume a modulator capaci-
tance that scales linearly with modulator length at the rate of 1.7 pF/mm [33].
The modulator is driven by a series of tapered inverters (i.e., driver). The first stage
consists of a minimum sized inverter. The total number of stages N is given as

cm
N = log / log3.6
cg

where Cm is the modulator capacitance and Cg is the capacitance of a mini-


mum sized inverter. These drivers receive their input signal from a transmission
bridge (Tx Bridge) belonging to a cluster. The Tx Bridge component is similar to
a bridge in a traditional hierarchical shared bus architecture, and logically treats
the optical ring bus waveguide as any other shared bus. Any communication
request meant for a core in another cluster is sent to the optical ring bus through
the transmission bridge.
The optical waveguide is responsible for transporting data via light signals
from the source modulator to the destination receiver. The choice of the optical
material and wavelength of utilized light are the two main factors affecting wave-
guide performance. For on-chip optical interconnects, there are two popular can-
didates for waveguide material: high refractive index silicon on insulator (SOI)
and low refractive index polymer waveguides. SOI waveguides have lower pitch
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 161

(i.e., width) and lower area footprint compared to polymer waveguides. This leads
to better bandwidth density (i.e., transmitted bits per unit area). However polymer
waveguides have lower propagation delay than SOI waveguides. The area over-
head for polymer waveguides is mitigated if they are fabricated on a separate,
dedicated layer. Additionally, if wavelength division multiplexing (WDM) is
used, polymer waveguides provide superior performance and bandwidth density
compared to SOI waveguides [49]. Consequently, in our optical ring bus, we make
use of a low refractive index polymer waveguide with an effective index of 1.4.
We chose a ring shaped optical waveguide to avoid sharp turns in the waveguide
which can lead to significant signal loss. The optical ring is implemented on a
dedicated layer and covers a large portion of the chip so that it can effectively
replace global electrical pipelined interconnects.
The receiver part in Fig. 5.4 consists of a photo-detector to convert the light sig-
nal into an electrical signal, and a circuit to amplify the resulting analog electrical
signal to a digital voltage level. In order to support WDM, where transmission
occurs on multiple wavelengths, the receiver includes a wave-selective microring
resonator filter for each wavelength that is received. An important consideration in
the selection of a photo-detector is the trade-off between detection speed and sensi-
tivity (quantum efficiency) of the detector. Interdigitated metalsemiconductor
metal (MSM) Ge and SiGe photo-detectors have been proposed [17, 50] that have
fast response, excellent quantum efficiency and low power consumption. These
attributes makes the MSM detector a suitable candidate as a photo-detector in our
optical interconnect architecture. We assume a detector capacitance of 100 fF based
on a realistic detector realization [34].
A trans-impedance amplifier (TIA) is used to amplify the current from the photo-
detector [33]. The TIA consists of an inverter and a feedback resistor, implemented
as a PMOS transistor. Additional minimum sized inverters are used to amplify the
signal to a digital level. The size of the inverter and feedback transistor in the TIA
is determined by bandwidth and noise constraints. To achieve high-gain and high-
speed detection, a higher analog supply voltage than the digital supply voltage may
be required, which may consume higher power. We assume a TIA supply voltage
that is 20% higher than the nominal supply for our power calculations. The amplified
digital signal is subsequently sent to the receiving bridge (Rx Bridge) component,
which decodes the destination address, and passes the received data to a specific
core in the cluster.

ORB On-Chip Communication Architecture

The previous section gave an overview of the various components that are part
of our on-chip optical interconnect architecture. In this section we elaborate on
the operation of our optical ring bus based hybrid opto-electric communication
architecture.
162 S. Pasricha and N.D. Dutt

In the ORB hybrid opto-electric communication architecture, the various cores


within each cluster are locally interconnected using high speed and low complexity,
low power, and low area footprint electrical bus-based communication architectures
(such as hierarchical buses or crossbar buses). When a core in a cluster must com-
municate with a core in another cluster, the transfer occurs using the optical ring
waveguide. The inter-cluster communication is first sent to the closest ORB inter-
face in the transmitting cluster, which interfaces with the optical ring waveguide.
The interface consists of transmitting and receiving bridges, which are similar to
standard bridges used in hierarchical bus-based architectures except that they have
separate buffers for each associated wavelength (this is described in more detail in
the next paragraph). The transmitting bridge sends the data transfer to a local modu-
lator which converts it into light and transmits it through the optical waveguide from
where it reaches the receiver interface. At the receiver interface, wave selective
microring receivers drop the corresponding wavelength from the waveguide into
a photo-detector device that converts the light signal into an electrical signal. TIAs
and inverters convert the resulting analog signal into a digital voltage signal and
send it to the receiving bridge from where the data is forwarded to the appropriate
core in the cluster.
A cluster can have more than one transmitting and receiving interfaces, depending
on its communication needs. For a global interconnect with an address bus width a,
a data bus width d, and c control bits, there are a + d + c concentric optical ring wave-
guides. These optical waveguides must be spaced 0.53 mm apart to avoid significant
crosstalk. It was shown in [9] that a single wavelength optical link is inferior to a
delay optimized electrical interconnect in terms of bandwidth density. To improve
bandwidth density of the optical interconnect, we make use of wavelength division
multiplexing (WDM) [33, 51]. This involves using multiple wavelength channels to
transmit data on the same waveguide. WDM can significantly improve optical inter-
connect bandwidth density over electrical interconnects. We assume that each of the
waveguides has l available wavelengths for WDM. This creates a l-way bus, and
necessitates a mechanism for determining how the l wavelengths are distributed
among various data streams. The value l has significant implications for perfor-
mance, cost and power as using a larger number of wavelengths improves bandwidth
but requires more processing, area, and power overhead at the transmitters and
receivers. Based on predictions in [9] which indicate that the number of wavelengths
will increase with every technology node, we limit the maximum number of wave-
lengths l in our optical ring waveguides, and consequently the number of allowed
transmitter/receiver interface pairs to 32.
There are two ways to allocate the wavelengths (i.e., multiplex the optical bus):
by address space and by cluster. In the address space based scheme, wavelengths
are allotted to different address spaces, whereas in the cluster based scheme each
cluster has exclusive use of a subset of the l wavelengths. It was shown in [40] that
even though the cluster based allocation scheme allows only l cluster interfaces to
the optical bus, it is more beneficial in terms of power consumption compared to
the address space allocation approach. Consequently, we use a cluster based wave-
length allocation approach in ORB. If simplicity in design is a key concern, each
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 163

Fig. 5.5 SWMR reservation channels and MWMR data channels

of the N clusters in an MPSoC application can be allocated an equal number of


wavelengths: l/N. However, this does not take into account the specific perfor-
mance requirements of the application. It is very possible that certain clusters have
greater communication bandwidth needs than others. Consequently, the fraction of
total wavelengths li allocated to a cluster i is calculated as


BW
i = . N i
BWj
j =1

where BWi is the bandwidth requirements of a cluster i, and the number of allocated
wavelengths li being rounded to the nearest integer. The total number of transmit-
ters for cluster i on the optical ring bus is

Ti _ total = i . (ai + di + ci )

and the total number of receivers for cluster i is

Ri _ total = ( i ). (ai + di + ci )

The photonic waveguides in ORB are logically partitioned into four channels:
reservation, reservation acknowledge, data (a combination of address, data, and
control signals), and data acknowledge, as shown in Fig. 5.5. In order to reserve an
164 S. Pasricha and N.D. Dutt

optical path for a data transfer, ORB utilizes a single write multiple read (SWMR)
configuration on dedicated reservation channel waveguides. The source cluster uses
one of its available wavelengths (lt) to multicast the destination ID via the reserva-
tion channel to other gateway interfaces. This request is detected by all of the other
interfaces, with the destination interface accepting the request, while the other inter-
faces ignore it. As each gateway interface has a dedicated set of wavelengths allo-
cated to it, the destination can determine the source of the request, without the
sender needing to send its ID with the multicast.
If the request can be serviced by the available wavelength and buffer resources at
the destination, a reservation acknowledgement is sent back via the reservation
ACK channel on an available wavelength. The reservation ACK channel also has a
SWMR configuration, but a single waveguide per gateway interface is sufficient to
indicate the success or failure of the request. Once the photonic path has been
reserved in this manner, data transfer proceeds on the data channel, which has a low
cost multiple writer multiple reader (MWMR) configuration. In ORB, the number
of data channel waveguides is equal to the total number of address bus, data bus, and
control lines. The same wavelength (lt) used for the reservation phase is used by the
source to send data on. The destination gateway interface tunes one of its available
microring resonators to receive data from the sender on that wavelength after the
reservation phase. Once data transmission has completed, an acknowledgement is
sent back from the destination to the source interface via a data ACK channel that
also has a SWMR configuration with a single waveguides per interface to indicate
if the data transfer completed success or failed.
The advantage of having a fully optical path setup and acknowledgement based
flow control in ORB is that it avoids using the electrical interconnects for path setup,
as is proposed with some other approaches [39, 43], which our analysis shows can
be a major latency and power bottleneck to the point of mitigating the advantage of
having fast and low power photonic paths.
One final important design consideration is to ensure that light does not circulate
around the optical ring for more than one cycle, because that could lead to undesir-
able interference from older data. This is resolved by using attenuators with each
modulator, to act as a sink for the transmitted wavelength(s), once the signal has
completely traversed the optical ring.

Communication Serialization

Serialization of electrical communication links has been widely used in the past to
reduce wiring congestion, lower power consumption (by reducing link switching and
buffer resources), and improve performance (by reducing crosstalk) [5254]. As
reducing power consumption is a critical design goal in future MPSoCs, we propose
using serialization at the transmitting/receiving interfaces, to reduce the number of
optical components (waveguides, transmitters/receivers) and consequently reduce
area and complexity on the photonic layer as well as lower the power consumption.
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 165

Fig. 5.6 Serialization scheme for interface (a) serializer, (b) de-serializer

In our architecture, we make use of a shift register based serialization scheme,


similar to [5557]. A single serial line is used to communicate both data and control
signals between the source and destination nodes. A frame of data transmitted on the
serial line using this scheme consists of n + 2 bits, which includes a start bit (1), n
bits of data, and a stop bit (0). Figure 5.6a shows the block diagram of the transmit-
ter (or serializer) at the source. When a word is to be transferred, the ring oscillator
is enabled and it generates a local clock signal that can oscillate above 2 GHz to
provide high transmission bandwidth. At the first positive edge of this clock, an n + 2
bit data frame is loaded in the shift register. In the next n + 1 cycles, the shift register
shifts out the data frame bit by bit. The stop bit is eventually transferred on the serial
line after n + 2 cycles, and r0 becomes 1. At this time, if the transmission buffer is
empty, the ring oscillator and shift registers are disabled, and the serial line goes into
its idle state. Otherwise, the next data word is loaded into the shift register and data
transmission continues without interruption.
Figure 5.6b shows the block diagram of the receiver (or de-serializer) at the des-
tination. An RS flip-flop is activated when a low-to-high transition is detected on
the input serial line (the low corresponds to the stop bit of the previous frame,
while the high corresponds to the start bit of the current frame). After being acti-
vated, the flip-flop enables the receiver ring oscillator (which has a circuit similar to
166 S. Pasricha and N.D. Dutt

the transmitter ring oscillator) and the ring counter. The n-bit data word is read bit
by bit from the serial line into a shift register, in the next n clock cycles. Thus, after
n clock cycles, the n bit data will be available on the parallel output lines, while the
least significant bit output of the ring counter (r0) becomes 1 to indicate data
word availability at the output. With the assertion of r0, the RS flip-flop is also
reset, disabling the ring oscillator. At this point the receiver is ready to start receiv-
ing the next data frame. In case of a slight mismatch between the transmitter and
receiver ring oscillator frequencies, correct operation can be ensured by adding a
small delay in the clock path of the receiver shift register.
The preceding discussion assumed n:1 serialization, where n data bits are trans-
mitted on one serial line (i.e., a serialization degree of n). If wider links are used,
this scheme can be easily extended. For instance, consider the scenario where 4n
data bits need to be transmitted on four serial lines. In such a case, the number of
shift registers in the transmitter must be increased from 1 to 4. However the control
circuitry (flip-flop, ring oscillator, ring counter) can be reused among the multiple
shift registers and remains unchanged. At the destination, every serial line must
have a separate receiver to eliminate jitter and mismatch between parallel lines.

Experiments

In this section we present comparison studies between ORB and traditional all-
electrical on-chip communication architectures. The ORB communication architec-
ture uses an optical ring bus as a global interconnect, whereas the traditional
all-electrical communication architecture uses electrical pipelined global intercon-
nects. Both configurations use electrical buses as local interconnects within clusters.

Experimental Setup

We select several MPSoC applications for the comparison between our hybrid opto-
electric ORB communication architecture and the traditional pipelined electrical
communication architectures. These applications are selected from the well known
SPLASH-2 benchmark suite [58] (Barnes, Ocean, FFT, Radix, Cholesky, Raytrace,
Water-NSq). We also select applications from the networking domains (proprietary
benchmarks Netfilter and Datahub [59]). These applications are parallelized and
implemented on multiple cores. Table 5.1 shows the characteristics of the imple-
mentations of these applications, such as the number of cores (e.g., memories,
peripherals, processors), programmable processors and clusters on the MPSoC chip.
The die size is assumed to be 2 2 cm.
The applications described above are modeled in SystemC [60] and simulated at
the transaction level bus cycle accurate abstraction [61] to quickly and accurately
estimate performance and power consumption of the applications. The various cores
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 167

Table 5.1 MPSoC applications and their characteristics


MPSoC application Description # of cores # of proc. # of clusters
Radix Integer radix sort 18 4 3
Barnes Evolution of galaxies 26 6 4
FFT FFT kernel 28 6 4
Ocean Ocean movements 35 10 5
Cholesky Cholesky factorization kernel 43 18 6
Netfilter Packet processing and forwarding 49 22 6
Datahub 68 26 8
Raytrace 3-D ray tracing 84 32 9
Water-NSq Forces/potentials of H2O molecules 112 64 12

are interconnected using the AMBA AXI3 [8] standard bus protocol. A high level
simulated annealing floorplanner based on sequence pair representation PARQUET
[62] is used to create an early layout of the MPSoC on the die, and Manhattan dis-
tance based wire routing estimates are used to determine wire lengths for accurate
delay and power dissipation estimation.
The global optical ring bus length is calculated using simple geometric calcula-
tions and found to be approximately 43 mm. Based on this estimate, as well as opti-
cal component delay values (see section Performance Estimation Models), we
determine the maximum operating frequencies for ORB as 1.4 GHz (65 nm), 2 GHz
(45 nm), 2.6 GHz (32 nm) and 3.1 GHz (22 nm). To ensure a fair comparison, we
clock the traditional all-electrical global pipelined interconnect architecture at the
same frequencies as the optical ring bus architecture in our experiments. The cores
in the clusters are assumed to operate at twice the interconnect frequencies. We set
the width of the address bus as 32 bits and that of the separate read and write data
buses as 64 bits. The bus also uses 68 control bits, based on the AMBA AXI3 pro-
tocol. These translate into a total of 228 (address + read data + write data + control)
data optical waveguides, as discussed in section ORB On-Chip Communication
Architecture. Finally, WDM is used, with a maximum of l = 32 wavelengths allo-
cated based on cluster bandwidth requirements, on a per-application basis.

Performance Estimation Models

For the global electrical interconnect, wire delay and optimal delay repeater insertion
points are calculated using an RLC transmission line wire model described in [63].
Latches are inserted based on wire length (obtained from routing estimates), wire
delay, and clock frequency of the bus, to pipeline the bus and ensure correct opera-
tion [7]. For instance, a corner to corner wire of length 4 cm for a 2 cm 2 cm die size
has a projected delay of 1.6 ns in 65 nm technology, for a minimum width wire size
Wmin [63]. To support a frequency of 2.5 GHz (corresponding to a clock period of
0.4 ns), 4 latches need to be inserted to ensure correct (multi-cycle) operation. It has
been shown that increasing wire width can reduce propagation delay at the cost of
168 S. Pasricha and N.D. Dutt

Table 5.2 Delay (in ps) of optical components for 1 cm optical data path
Tech node 65 nm 45 nm 32 nm 22 nm
Modulator driver 45.8 25.8 16.3 9.5
Modulator 52.1 30.4 20.0 14.3
Polymer waveguide 46.7 46.7 46.7 46.7
Photo detector 0.5 0.3 0.3 0.2
Receiver amplifier 16.9 10.4 6.9 4.0
Total optical delay 162.0 113.6 90.2 74.7

area. For our global interconnects, we therefore consider wider interconnects with a
width 3Wmin which results in a near optimal power delay product at the cost of only a
slight area overhead. The delay of such a wide, repeater-inserted wire is found to be
approximately 26 ps/mm, varying only slightly (1 ps/mm) between the 6522 nm
nodes. Delay due to bridges, arbitration, and the serialization/de-serialization at the
interfaces was considered by annotating SystemC models with results of detailed
analysis for the circuits from the gate-level.
For the optical ring bus (ORB) architecture, we model all the components
described in section Optical Ring Bus Architecture: Building Blocks, annotated
with appropriate delays. Table 5.2 shows delays of the various optical interconnect
components used in ORB, for a 1 cm optical data path, calculated based on estimates
from [5, 33]. It can be seen that the optical interconnect delay remains constant for
the waveguide, while the delay due to other components reduces with each technol-
ogy generation. This is in contrast to the minimum electrical interconnect delay
which is expected to remain almost constant (or even increase slightly) despite opti-
mal wire sizing (i.e., increasing wire width) and repeater insertion to reduce delay.

Power Estimation Models

To estimate the power consumption for the electrical interconnects, we must account
for power consumed in the wires, repeaters, serialization/de-serialization circuits,
and bus logic components (latches, bridges, arbiters and decoders). For bus wire
power estimation, we determine wire lengths using our high level floorplan and rout-
ing estimates as described earlier. We then make use of bus wire power consumption
estimates from [64], and extend them to account for repeaters [65]. Static repeater
power and capacitive load is obtained from data sheets. Capacitive loads for compo-
nents connected to the bus are obtained after logic synthesis. Other capacitances (e.g.
ground, coupling) are obtained from the Berkeley Predictive Technology Model
(PTM) [66], and ITRS estimates [5]. The power consumed in the serialization cir-
cuitry and bus logic components is calculated by first creating power models for the
components, based on our previous work on high-level power estimation of com-
munication architecture components using regression based analysis of gate level
simulation data [65]. These power models are then plugged into the SystemC simula-
tion models. Power numbers are obtained for the components after simulation and
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 169

Table 5.3 Power consumption of optical data path (in mW)


Tech node 65 nm 45 nm 32 nm 22 nm
Transmitter 18.4 8.6 6.0 5.0
Receiver 0.3 0.2 0.3 0.3
Total optical power 18.7 8.8 6.3 5.3

are within 5% accuracy of gate-level estimates [65]. Simulation is also used to obtain
accurate values for switching activity, which is used for bus wire power estimation.
For the optical interconnect, power consumption estimates for a transmitter and
receiver in a single optical data path are derived from [33] and shown in Table 5.3. It
can be seen that the power consumed by the transmitter dominates power consumed
by the receiver. The size as well as the capacitance of the modulator is large, requir-
ing a large driving circuit. To maintain their resonance under on-die temperature
variations, microring resonators need to be thermally tuned. We assume a dedicated
thermal tuner for every microring resonator in the proposed communication fabric,
dissipating approximately 20 mW/K, with a mean temperature deviation of about
20. In addition, we also consider the laser power driving the optical interconnect. As
an optical message propagates through the waveguide, it is attenuated through wave-
guide scattering and ring resonator insertion losses, which translates into optical
insertion loss. This loss sets the required optical laser power and correspondingly the
electrical laser power as it must be sufficient to overcome losses due to electrical
optical conversion efficiencies as well as transmission losses in the waveguide. We
conservatively set an electrical laser power of 3.3 W (with 30% laser efficiency) in
our power calculations based on per component optical losses for the coupler/splitter,
non-linearity, waveguide, ring modulator, receiver filter, and photodetector.

Performance Comparison

Optical waveguides provide faster signal propagation compared to electrical inter-


connects because they do not suffer from RLC impedances. But in order to exploit
the propagation speed advantage of optical interconnects, electrical signals must be
converted into light and then back into an electrical signal. This process requires an
overhead that entails a performance and power overhead that must be taken into
account while comparing optical interconnects with electrical interconnects.
To compare the performance of the ORB and traditional pipelined global inter-
connect based on-chip communication architectures, we simulate the MPSoC appli-
cations implemented at the 65, 45, 32 and 22 nm technology nodes. We also
incorporate the impact of uplink/downlink serialization at the electro-optic inter-
faces on the performance. Figure 5.7ac show the average latency improvements for
the ORB communication architecture for three degrees of serialization1 (no
serialization), 2, and 4.
170 S. Pasricha and N.D. Dutt

a 5 65 nm

Latency Reduction Factor


4.5
45 nm
4
32 nm
3.5
22 nm
3
2.5
2
1.5
1
0.5
0
x

es

ky
fft

ce

sq
r
di

ea

lte

hu
es
rn

ra

-n
ra

oc

tfi

ta
ba

ol

yt

er
ne

da
ch

ra

at
w
b 4.5 65 nm
Latency Reduction Factor

4 45 nm
3.5 32 nm
3 22 nm
2.5
2
1.5
1
0.5
0
x

es

ky
fft

ce

sq
r
di

ea

lte

hu
es
rn

ra

-n
ra

oc

tfi

ta
ba

ol

yt

er
ne

da
ch

ra

at
w

c 4 65 nm
Latency Reduction Factor

3.5 45 nm
3 32 nm
2.5 22 nm
2
1.5
1
0.5
0
x

es

y
fft

ce

sq
r
di

sk
ea

te

hu
rn

ra

-n
ra

l
oc

tfi

ta
ba

ol

yt

er
ne

da
ch

ra

at
w

Fig. 5.7 Latency reduction for ORB over traditional all-electrical bus-based communication
architectures (a) no serialization, (b) serialization degree of 2, (c) serialization degree of 4

It can be seen that the ORB architecture provides a latency reduction over tradi-
tional all-electrical bus-based communication architectures for UDSM technology
nodes. The speedup is small for 65 nm because of the relatively lower global clock
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 171

frequency (1.4 GHz) which does not require as much pipelining. However, from
the 45 nm down to the 22 nm nodes, the speedup increases steadily because of ris-
ing clock frequencies which introduce more pipeline stages in the electrical global
interconnect, increasing its latency, compared to the ORB architecture. The speedup
for radix is lower than other applications due to the smaller length of global inter-
connect wires, which reduces the advantage of having an optical link for global
data transfers. On the other hand, the lower speedup for ocean is due to the smaller
number of global inter-cluster data transfers, despite having long global intercon-
nects. With the increasing degree of serialization, a notable reduction in improve-
ment is observed. This is primarily because of the latency overhead of the
serialization/de-serialization process. The applications are impacted differently
depending upon the amount of inter cluster communication each must support. For
instance, the smaller number of inter-cluster transfers in ocean results in smaller
latency degradation because of serialization than for an application with a higher
proportion of inter-cluster transfers, such as datahub. While increase in latency is
an undesirable side effect of serialization, it nonetheless brings other benefits, as
discussed in the next section. Overall, the ORB architecture speeds up global data
transfers due to the faster optical waveguide. Despite the costs associated with
converting the electrical signal into an optical signal and back, it can be seen that
at 22 nm, ORB can provide up to a 4.7 speedup without serialization, up to a 4.1
speedup with serialization degree of 2, and up to a 3.5 speedup with serialization
degree of 4. With improvements in optical component fabrication over the next few
years, the opto-electrical conversion delay is expected to decrease leading to even
greater performance benefits.

Power Comparison

With increasing core counts on a chip aimed at satisfying ever increasing bandwidth
requirements of emerging applications, the on-chip power dissipation has been ris-
ing steadily. High power dissipation on a chip significantly increases cooling costs.
It also increases chip temperature which in turn increases the probability of timing
errors and overall system failure. On-chip communication architectures have been
shown to dissipate an increasing proportion of overall chip power in multicore chips
(e.g., ~40% in the MIT RAW chip [67] and ~30% in the Intel 80-core Teraflop chip
[68]) due to the large number of network interface (NI), router, link, and buffer
components in these architectures. Thus it is vital for designers to focus on reducing
power consumption in the on-chip interconnection architecture.
Figure 5.8ac show the power savings that can be obtained when using the
ORB architecture instead of an all-electrical pipelined interconnect architecture,
for three degrees of serialization1 (no serialization), 2, and 4. It can be
seen that the ORB architecture consumes more power for the 65 nm node, com-
pared to the all-electrical pipelined interconnect architecture. However, for tech-
nology nodes from 45 nm onwards, there is a significant reduction in ORB power
172 S. Pasricha and N.D. Dutt

a 12 65 nm

Power Reduction Factor


10 45 nm
32 nm
8 22 nm
6

0
x

es

ky

ce

sq
fft

lte
di

ea

hu
es
rn

ra

-n
ra

tfi
oc

ta
ba

yt

er
ol

ne

da

ra
ch

at
w
b 12 65 nm
Power Reduction Factor

45 nm
10
32 nm
8 22 nm

0
es

ky

ce

sq
x

fft

r
lte
ea

hu
di

es
rn

ra

-n
ra

tfi
oc

ta
ba

yt

er
ol

ne

da

ra
ch

at
w

c 14 65 nm
Power Reduction Factor

12 45 nm
32 nm
10
22 nm
8
6
4
2
0
es

ce

sq
x

fft

r
lte
k
ea

hu
di

es
rn

ra

-n
ra

tfi
oc

ta
ba

yt

er
ol

ne

da

ra
ch

at
w

Fig. 5.8 Power consumption reduction for ORB over traditional all-electrical bus-based communi-
cation architectures (a) no serialization, (b) serialization degree of 2, (c) serialization degree of 4

consumption, due to expected improvements in opto-electrical modulator struc-


ture fabrication as well as an increase in electrical power consumption due to
higher operating frequencies and greater leakage. For the case of no serialization
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 173

depicted in Fig. 5.8a, ORB can provide up to 10.3 power reduction at 22 nm,
compared to all-electrical pipelined global interconnect architectures, which is a
strong motivation for adopting it in the near future. When serialization of degree 2
is utilized, there is actually a slight increase in power consumption for the 65 nm
node due to the overhead of the serialization/de-serialization circuitry (Fig. 5.8b).
However, from the 45 nm node and below, the reduction in power dissipation due
to fewer active microring resonators and associated heaters overshadows the
serialization overhead, and leads to a slight reduction in power consumption. At
the 22 nm node, up to a 7% reduction in power consumption is observed, com-
pared to the base case without any serialization. A similar trend is observed when
a serialization degree of 4 is utilized, shown in Fig. 5.8c. At the 22 nm node, up
to a 13% reduction in power consumption is observed compared to the base case
with no serialization. These results indicate the usefulness of serialization as a
mechanism to reduce on-chip power dissipation. In addition serialization also
reduces the complexity of the photonic layer by reducing the number of resona-
tors, waveguides, and photodetectors, which can tremendously boost yield and
lower costs for fabricating hybrid opto-electric communication architectures
such as ORB.

Conclusion and Future Work

In this chapter, we presented a novel on-chip opto-electrical bus-based communi-


cation architecture. Our optical ring bus (ORB) communication architecture
replaces global pipelined electrical (copper) interconnects with an optical ring
waveguide and opto-electric modulators and receivers. While there is a definite
performance and power overhead associated with converting electrical signals
into optical signals and back today, we showed that ORB can be beneficial for
ultra-deep submicron (UDSM) technology nodes below 65 nm. Our experimental
results, based on emerging technology trends and recently published studies, have
shown that the ORB architecture can provide as much as 4.7 average latency
reduction, along with a 12 power reduction, compared to the traditional all-elec-
trical interconnect architecture at the 22 nm technology node. It is clear that ORB
can provide a performance-per-watt that is far superior to electrical alternatives.
Furthermore, ORB is scalable to accommodate an increasing number of computa-
tional clusters and cores on a chip in the future, and provides a clean separation of
concerns as the optical waveguide and components are fabricated in a separate,
dedicated layer. Our ongoing work is looking at characterizing bandwidth density
and analyzing the implications of emerging optical components for the ORB
architecture. Future challenges in this area include the need for active or passive
control methods to reduce optical interconnect susceptibility to temperature varia-
tions, and better opto-electric modulator designs to reduce delay and power
consumption.
174 S. Pasricha and N.D. Dutt

References

1. Pham D et al (2005) The design and implementation of a first-generation CELL processor. In:
Proceedings of the IEEE ISSCC, pp 184185 San Francisco, CA
2. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In: Proceedings
of the IEEE international solid state circuits conference, paper 5.2 San Francisco, CA
3. Tilera Corporation (2007) TILE64 Processor. Product Brief
4. Ho R, Mai W, Horowitz MA (2001) The future of wires. Proc IEEE 89(4):490504
5. International Technology Roadmap for Semiconductors (2006) http://www.itrs.net/ Accessed
on Oct 2011
6. Adler V, Friedman E (1998) Repeater design to reduce delay and power in resistive intercon-
nect. In: IEEE TCAS
7. Nookala V, Sapatnekar SS (2005) Designing optimized pipelined global interconnects: algo-
rithms and methodology impact. IEEE ISCAS 1:608611
8. AMBA AXI Specification. www.arm.com/armtech/AXI Accessed on Oct 2011
9. Haurylau M et al (2006) On-chip optical interconnect roadmap: challenges and critical direc-
tions. IEEE J Sel Top Quantum Electron 12(6):16991705
10. Miller DA (2000) Rationale and challenges for optical interconnects to electronic chips. Proc
IEEE 88:728749
11. Ramaswami R, Sivarajan KN (2002) Optical networks: a practical perspective, 2nd edn.
Morgan Kaufmann, San Francisco
12. Young I (2004) Intel introduces chip-to-chip optical I/O interconnect prototype. Technology@
Intel Magazine
13. Rong H et al (2005) A continuous-wave Raman silicon laser. Nature 433:725728
14. McNab SJ, Moll N, Vlasov YA (2003) Ultra-low loss photonic integrated circuit with mem-
brane-type photonic crystal waveguides. Opt Express 11(22):29272939
15. Liu A et al (2004) A high-speed silicon optical modulator based on a metal-oxide-semiconduc-
tor capacitor. Nature 427:615618
16. Xu Q et al (2007) 12.5 Gbit/s carrier-injection-based silicon microring silicon modulators. Opt
Express 15(2):430436
17. Reshotko MR, Kencke DL, Block B (2004) High-speed CMOS compatible photodetectors for
optical interconnects. Proc SPIE 5564:146155
18. Koester SJ et al (2004) High-efficiency, Ge-on-SOI lateral PIN photodiodes with 29 GHz
bandwidth. In: Proceedings of the Device Research Conference, Notre Dame, pp 175176
19. Haensch W (2007) Is 3D the next big thing in microprocessors? In: Proceedings of interna-
tional solid state circuits conference (ISSCC), San Francisco
20. Pasricha S, Dutt N (2008) Trends in emerging on-chip interconnect technologies. IPSJ Trans
Syst LSI Design Methodology 1:217
21. Pasricha S (2009) Exploring serial vertical interconnects for 3D ICs. In: IEEE/ACM design
automation conference (DAC) San Diego, CA 581586
22. Goodman JW et al (1984) Optical interconnects for VLSI systems. Proc IEEE 72(7):850866
23. Tan M et al (2008) A high-speed optical multi-drop bus for computer interconnections. In:
Proceedings of the 16th IEEE Symposium on high performance interconnects, pp 310
24. Chiarulli D et al (1994) Optoelectronic buses for high performance computing. Proc IEEE
82(11):1701
25. Kodi AK, Louri A (2004) Rapid: reconfigurable and scalable all-photonic in-104 interconnect
for distributed shared memory multiprocessors. J Lightwave Technol 22:21012110
26. Kochar C et al (2007) Nd-rapid: a multidimensional scalable fault-tolerant optoelectronic
interconnection for high performance computing systems. J Opt Networking 6(5):465481
27. Ha J, Pinkston T (1997) Speed demon: cache coherence on an optical multichannel intercon-
nect architecture. J Parallel Distrib Comput 41(1):7891
28. Carrera EV, Bianchini R (1998) OPNET: a cost-effective optical network for multiprocessors.
In: Proceedings of the international conference on supercomputing 98 401408
5 On-Chip Optical Ring Bus Communication Architecture for Heterogeneous MPSoC 175

29. Batten C et al (2008) Building many core processor-to-dram networks with monolithic silicon
photonics. In: Proceedings of the 16th annual symposium on high-performance interconnects,
August 2728, pp 2130 Stanford, CA
30. Collet JH, Caignet F, Sellaye F, Litaize D (2003) Performance constraints for onchip optical
interconnects. IEEE J Sel Top Quantum Electron 9(2):425432
31. Tosik G et al (2004) Power dissipation in optical and metallic clock distribution networks in
new VLSI technologies. IEE Electron Lett 40(3):198200
32. Kobrinsky MJ et al (2004) On-chip optical interconnects. Intel Technol J 8(2):129142
33. Chen G, Chen H, Haurylau M, Nelson N, Albonesi D, Fauchet PM, Friedman EG (2005)
Predictions of CMOS compatible on-chip optical interconnect. In: Proceedings of the SLIP,
pp 1320 San Francisco, CA
34. Ian OConnor (2004) Optical solutions for system-level interconnect. In: Proceedings of the
SLIP Paris, France
35. Pappu AM, Apsel AB (2005) Analysis of intrachip electrical and optical fanout. Appl Opt
44(30):63616372
36. Benini L, Micheli GD (2002) Networks on chip: a new SoC paradigm. IEEE Comput
49(2/3):7071
37. Dally WJ, Towles B (2001) Route packets, not wires: on-chip interconnection networks. In:
Design automation conference, pp 684689 Las Vegas, NV
38. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In:
Proceedings of the ISSCC San Francisco, CA
39. Shacham A, Bergman K, Carloni L (2007) The case for low-power photonic networks on chip.
In: Proceedings of the DAC 2007 San Diego, Ca
40. Kirman N et al (2006) Leveraging optical technology in future bus-based chip multiprocessors.
In: Proceedings of the MICRO Orlando, FL
41. Vantrease D et al (2008) Corona: system implications of emerging nanophotonic technology.
In: Proceedings of the ISCA Beijing, China
42. Poon AW, Xu F, Luo X (2008) Cascaded active silicon microresonator array cross-connect
circuits for WDM networks-on-chip. Proc SPIE Int Soc Opt Eng 6898:689812
43. Kodi A, Morris R, Louri A, Zhang X (2009) On-chip photonic interconnects for scalable
multi-core architectures. In: Proceedings of the 3rd ACM/IEEE international symposium on
network-on-chip (NoCs09), San Diego, 1013 May 2009, p 90
44. Pan Y et al (2009) Firefly: illuminating future network-on-chip with nanophotonics. In:
Proceedings of the ISCA, pp 429440
45. Hsieh I-W et al (2006) Ultrafast-pulse self-phase modulation and third-order dispersion in si
photonic wire-waveguides. Opt Express 14(25):1238012387
46. Gunn C (2006) CMOS photonics for high-speed interconnects. IEEE Micro 26(2):5866
47. Barrios CA et al (2003) Low-power-consumption short-length and high-modulation-depth sili-
con electro-optic modulator. J Lightwave Technol 21(4):10891098
48. Woo S, Ohara M, Torrie E, Singh J, Gupta A (1995) The SPLASH-2 programs: characteriza-
tion and methodological considerations. In: Proceedings of the international symposium on
computer architecture (ISCA), Santa Margherita Ligure, June 1995, pp 2436
49. Eldada L, Shacklette LW (2000) Advances in polymer integrated optics. IEEE JQE 6(1):
5468
50. Gupta A et al (2004) High-speed optoelectronics receivers in SiGe. In: Proceedings of the
VLSI design, pp 957960
51. Lee BG et al (2007) Demonstrated 4 4 Gbps silicon photonic integrated parallel electronic to
WDM interface. OFC
52. Dobkin R et al (2008) Parallel vs. serial on-chip communication. In: Proceedings of the SLIP
Newcastle, United Kingdom
53. Morgenshtein A et al (2004) Comparative analysis of serial vs parallel links in NoC. In:
Proceedings of the SSOC
176 S. Pasricha and N.D. Dutt

54. Ghoneima M et al (2005) Serial-link bus: a low-power on-chip bus architecture. In: Proceedings
of the ICCAD San Jose, CA
55. Kimura S et al (2003) An on-chip high speed serial communication method based on indepen-
dent ring oscillators. In: Proceedings of the ISSCC
56. I-Chyn Wey et al (2005) A 2 Gb/s high-speed scalable shift-register based on-chip serial com-
munication design for SoC applications. In: Proceedings of the ISCAS
57. Saneei M, Afzali-Kusha1 A, Pedram M (2008) Two high performance and low power serial
communication interfaces for on-chip interconnects. In: Proceedings of the CJECE
58. Woo SC et al (1995) The SPLASH-2 programs: characterization and methodological consid-
erations. In: Proceedings of the ISCAS S. Margherita Ligure, Italy
59. Pasricha S, Dutt N (2008) The optical ring bus (ORB) on-chip communication architecture.
CECS technical report, February 2008
60. SystemC initiative. www.systemc.org Accessed on Oct 2011
61. Mller W, Ruf J, Rosenstiel W (2003) SystemC methodologies and applications. Kluwer, Norwell
62. Adya SN, Markov IL (2003) Fixed-outline floorplanning: enabling hierarchical design. In:
IEEE Transactions on TVLSI
63. Ismail YI, Friedman EG (2000) Effects of inductance on the propagation delay and repeater
insertion in VLSI circuits, IEEE TVLSI 8(2):195206
64. Kretzschmar C et al (2004) Why transition coding for power minimization of on-chip buses
does not work. In: DATE
65. Pasricha S, Park Y, Kurdahi F, Dutt N (2006) System-level power-performance trade-offs in
bus matrix communication architecture synthesis. In: CODES+ISSS
66. Berkeley Predictive Technology Model, U.C. Berkeley. http://www-devices.eecs.berkeley.
edu/~ptm/ Accessed on Oct 2011
67. Taylor M et al (2002) The raw microprocessor. IEEE Micro
68. Vangal S et al (2007) An 80-tile 1.28 TFLOPS network-on-chip in 65 nm CMOS. In: Proceeindgs
spelling error with Proceedings of the IEEE ISSCC Proceedings San Francisco, CA
Part III
System Integration and Optical-Enhanced
MPSoC Performance
Chapter 6
A Protocol Stack Architecture for Optical
Network-on-Chip
Organization and Performance Evaluation

Atef Allam and Ian OConnor

Abstract Optical networks-on-chip (ONoCs) represent an emerging technology


for use as a communication platform for systems-on-chip (SoC). It is a novel on-
chip communication system where information is transmitted in the form of light,
as opposed to conventional electrical networks-on-chip (ENoC). As the ONoC
becomes a candidate solution for the communication infrastructure of the SoC, the
development of proper hierarchical models and tools for its design and analysis,
specific to its heterogeneous nature, becomes a necessity. This chapter studies a
class of ONoCs that employ a single central passive-type optical router using wave-
length division multiplexing (WDM) as a routing mechanism. A novel protocol
stack architecture for the ONoC is presented. The proposed protocol stack is a
4-layered hardware stack consisting of the physical layer, the physical-adapter layer,
the data link layer, and the network layer. It allows the modular design of each
ONoC building block, thus boosting the interoperability and design reuse of the
ONoC. Adapting this protocol stack architecture, this chapter introduces the micro-
architecture of a new router called electrical distributed router (EDR) as a wrapper
for the ONoC. Then, the performance of the ONoC layered architecture has been
evaluated both at system-level (network latency and throughput) and at the physical
(optical) level. Experimental results prove the scalability of the ONoC and demon-
strate that the ONoC is able to deliver a comparable bandwidth or even better (in
large network sizes) to the ENoC. The proposed protocol stack has been modeled
and integrated inside an industrial simulation environment (ST OCCS GenKit)
using an industrial standard (VSTNoC) protocol.

A. Allam I. OConnor (*)


Ecole Centrale de Lyon, Lyon Institute of Nanotechnology, University of Lyon,
36 avenue Guy de Collongue, Ecully 69134, France
e-mail: ian.oconnor@ec-lyon.fr

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 179


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_6,
Springer Science+Business Media New York 2013
180 A. Allam and I. OConnor

Keywords Electrical distributed router (EDR) Optical network-on-chip (ONoC)


Optical-router Protocol stack Performance evaluation

Introduction

Optical networks-on-chip (ONoCs) are increasingly considered to be viable


candidate solutions for replacing electrical interconnect (ENoC) as a communica-
tion infrastructure in the multi-processor system-on-chip (MPSoC) domain [1].
This is due to its intrinsic characteristics of contention-free routing, absence of
crosstalk and impedance matching that leads to design simplification, infinite
bandwidth in the optical passive network, low power dissipation, and high
interconnect density [2].
The basic switching element of optical NoCs is the micro-resonator device.
Some ONoCs utilize passive-type micro-resonators [13] where the resonance
wavelength is a property of its material and structure. Other ONoCs employ active-
type micro-resonators [46] where the resonance wavelength is controlled (to a
limited extent) by a voltage or current source. The optical NoC considered in this
chapter is of the former type, where the routing mechanism is realized mainly
within one contention-free passive optical router based on wavelength division
multiplexing (WDM).
As the MPSoC complexity scales, the design of its communication infrastructure
becomes an increasingly difficult task [7]. The optical NoC has its own special
topology [1] that differentiates it from known electrical NoC topologies. In this
topology, SoC IPs (processor, ASIC, ) are connected to the central optical router
through a heterogeneous communication channel (digital, analog, and optical) in
which data is transformed between these domains in addition to the transformation
between parallel and serial format. Modeling abstraction, hierarchical models, and
modular design of the ONoC are essential to enable design space exploration and
validation. In this context, it is necessary to consider the layered stack architecture
similar to the OSI reference model [8], which has been adopted by most NoC pro-
posals [7, 912].
The core objective of this chapter is twofold; the first is to define a clear protocol
stack architecture for the optical NoC and to introduce the concept of the electrical
distributed router (EDR) as a wrapper for the optical NoC, and to define its micro-
architecture. The second objective is to characterize and analyze the performance of
this layered ONoC architecture both at the system-level and at the physical-level.
In the proposed ONoC protocol stack, the micro-architecture of the hardware
components building blocks is defined for each layer. An extra layer, the physical-
adapter layer (L1.5), is introduced between the physical layer (L1) and the data link
layer (L2) similar to that of the industrial UniPro protocol stack [13]. The layered
ONoC protocol architecture is thus composed of four hardware layers (see Fig. 6.1),
namely, the physical layer (L1), the physical-adapter layer (L1.5), the data link layer
(L2), and the network layer (L3).
6 A Protocol Stack Architecture for Optical Network-on-Chip 181

Fig. 6.1 Optical NoC


protocol stack architecture

ONoC Protocol Stack Architecture

The proposed ONoC protocol stack follows the architecture of the classical OSI
reference model. This ONoC layered protocol architecture allows the modular design
of each ONoC building block that boosts the interoperability and design reuse. In
addition, it allows scalability and manages the design complexity of the ONoC.

ONoC Physical Layer (L1)

The physical layer is concerned with the physical characteristics of the communica-
tion medium [14]. In optical NoCs, the physical layer is realized with devices from
heterogeneous domains. Some components depend on the optical technology, while
others depend on the CMOS technology. The optical physical layer defines the
specifications of the photonic and optoelectronic devices in the communication path.
It specifies the free spectral range (FSR) and the number of working wavelengths,
and the photonic power levels of the optical beam. In addition, the physical layer
specifies the width of wires as well as the levels and timing of the signals. There are
three classes of physical links in the ONoC: (1) the heterogeneous optoelectronic
multi-wavelength transmitter (MWL-Tx) link, (2) the heterogeneous optoelectronic
multi-wavelength receiver (MWL-Rx) link, and (3) the purely optical link composed
of the waveguides through the optical router. Concerning the IP connectivity, each
pair of MWL-Tx and MWL-Rx links are dedicated to a single SoC IP.

Multi-Wavelength Transmitter (MWL-Tx)

The multi-wavelength transmitter link converts serial digital signals into a form
suitable to be transmitted in the form of light over the optical router. It consists of
182 A. Allam and I. OConnor

ser Data
LCS

Fig. 6.2 Multi-wavelength transmitter

Fig. 6.3 Multi-wavelength


receiver

the laser drivers, and laser source modules in addition to the on/off demultiplexer
(see Fig. 6.2). Each laser source generates a laser beam with a wavelength corre-
sponding to the packet destination, and with instantaneous photonic power propor-
tional to the level of each input serial data bit. Note that this architecture considers
an array of fixed-wavelength laser sources. While tunable laser sources also exist,
overall size and inter-wavelength switching speed prohibits their practical use.
However, from the point of view of the model, there is no fundamental reason why
the architecture could not include this type of device. The laser source in our ONoC
is an on-chip directly modulated compact IIIV type.

Multi-Wavelength Receiver (MWL-Rx)

The multi-wavelength receiver link converts the router optical signals into electrical
digital format. It consists of the photodiode (PD), transimpedance-amplifier (TIA),
and the comparator modules as shown in Fig. 6.3. Demultiplexing is carried out in
the optical domain, where the incoming photonic beam (composed of several mul-
tiplexed wavelengths) is exposed to the set of photodiodes each sensitive to a single
wavelength. When a certain photodiode is stimulated by the photonic beam, it pro-
duces an electrical current proportional to the photonic power in that beam. The
photodetector considered in our simulation is a broadband Ge-detector integrated
with silicon nanophotonic waveguides [15], which uses a set of filters before it for
wavelength selection.
6 A Protocol Stack Architecture for Optical Network-on-Chip 183

Fig. 6.4 Transmitter physical


adapter
data_in LCS
OBI TxCtrl SER
ser_data

ONoC Physical-Adapter Layer (L1.5)

The physical-adapter layer is a sublayer of the physical (L1) layer of the OSI
reference model. Its main objective is to hide and wrap the heterogeneous (electri-
cal analog and optical) physical layer L1 of the ONoC protocol stack. Two units
define the architecture of the physical-adapter layer: the transmitter physical
adapter (Tx-PhyAdapter) and the receiver physical adapter (Rx-PhyAdapter)
units. Bit encoding is a vital service to be implemented in the physical-adapter
layer, the objective being to reduce the average power consumption. Our optical
bus inverter (OBI) module implements a source encoding technique allowing the
number of 1s within a flit to be reduced, so as to keep the laser switched off as
much as possible during the transmission of information. With this approach, the
signal being serialized contains the smallest number of logic ones, in such a way
that the laser is kept turned on for as short a time as possible. Encoding and decod-
ing are implemented with the OBI unit in the Tx-PhyAdapter and Rx-PhyAdapter,
respectively.

Transmitter Physical Adapter (Tx-PhyAdapter)

The transmitter physical adapter is constructed from the transmitter controller


(txCtrl) and the serializer (SER) building blocks. In addition, it contains the
transmitting part of the optical bus inverter (OBI) unit (see Fig. 6.4). Its main
function is to manage and control the operation of the multi-wavelength trans-
mitter link, MWL-Tx. It drives and activates one laser driver module at a time
and controls its operation with the laser control signals LCS (on/off and sel
signals).

Receiver Physical Adapter (Rx-PhyAdapter)

The receiver physical adapter is constructed from the deserializer (DESER) and the
synchronizer (Sync) modules, in addition to the receiving part of the optical bus
inverter (OBI) unit (see Fig. 6.5). Its main function is the synchronization of the
serial communication and data conversion from serial to parallel.
184 A. Allam and I. OConnor

Fig. 6.5 Receiver physical


ser_data
adapter
DESER OBI
data_out

Sync

FCU
FC Process

FC_sig rxFIF

FCU
FC
Buf
rxFIF

TxIU RxIU

Fig. 6.6 Flow control mechanism

ONoC Data Link Layer (L2)

The objective of the data link layer is to provide the reliability and the synchroniza-
tion functionality to the packet flow. Its main task in the ONoC protocol stack is to
ensure reliable communication of data packets along the two complementary rout-
ers used in the ONoC (see Network Layer section). Unlike macro computer net-
works, NoCs have to deliver messages between the IPs with guaranteed zero loss.
Thus, ONoC has to adapt a rigorous flow control scheme.
The flow control is a key aspect of the data link layer. It is the mechanism that deter-
mines the packet movement along the network path and it is concerned with the alloca-
tion of shared resources (buffers and physical channels) as well as contention resolution
[16]. Since the electrical distributed router in ONoC employs the wormhole packet
switching technique (see Network Layer section), the proposed ONoC protocol stack
uses flit-buffer flow control, which allocates buffers and channel bandwidth in units of
flits. The flow-control is implemented in the electrical domain (see Fig. 6.6) so as to
reduce the complexity of the data conversion modules between different domains. It
can employ any suitable flow-control scheme (such as the credit-based or the backpres-
sure handshake on-off scheme). Our model adopts the handshake on-off scheme.
Figure 6.7 shows the frame structure of the data link layer. It is composed of
three fields: (1) the OBI field that is 1-bit wide used to indicate if the encoded
data is the OBI-inverted or the original bit-stream; (2) the PSS (protocol specific
signals) that is a variable-width field used to carry the protocol specific signal
communication (e.g. flit-id and aux signals in the VSTNoC protocol [17]); and
6 A Protocol Stack Architecture for Optical Network-on-Chip 185

Fig. 6.7 Frame structure of


the data link layer OBI PSS Flit

1-bit Protocol-dependent

TxIU

MWL-Tx

RxIU
MWL-RX
Optical
Router
TxIU

MWL-Tx

RxIU
MWL-RX
TxIU

MWL-Tx

EDR

Fig. 6.8 Electrical distributed-router and optical router connection in 3-2 ONoC

(3) the Flit field that is used to carry the flit bitsits width is protocol-dependent
(e.g. 36-, 72-, or 144-bits in the VSTNoC protocol).

ONoC Network Layer (L3)

The network layer is responsible for transferring the data packets from the source IP
to the intended destination IP. It provides the routing functionality and the buffering
service to the data packets. The network layer in optical NoC is realized with two
complementary routers, the electrical distributed router (EDR) and the optical cen-
tralized router (OCR) (see Fig. 6.8); and it uses a two-level routing mechanism:
1. The optical routing level, which is implemented using the optical centralized
router. At this optical level, the routing mechanism is contention free and it is
based on wavelength division multiplexing (WDM).
2. The electrical routing level, which is implemented inside the electrical distrib-
uted router. Here, the routing mechanism is distributed among the transmitting-
and receiving-path interface units (TxIU and RxIU) of the electrical distributed
router. Inside the TxIU, the routing information extracted from the header flit is
used to feed the serial data to the corresponding laser driver and to activate its
corresponding laser source. On the other hand, at the RxIU, a packet from one
buffer among the group of buffers associated with different sources is released to
186 A. Allam and I. OConnor

a b
Fig. 6.9 N N l-router architecture (a), 4-port optical switch example (b)

the destination IP according to an adopted arbitration mechanism. Thus, the


possibility of flit contention does exist at this level.
The optical NoC employs a hybrid switching technique (circuit switching and
packet switching). The proposed electrical distributed router employs the wormhole
packet switching technique [18]; while the optical router exercises the circuit
switching mechanism [19].

Optical Central Router (OCR)

The optical-router (l-router) is responsible for the actual propagation of optical


information streams from the sources to the destinations. It is a passive optical net-
work composed of several 4-port optical switches (based on add-drop filters) and
designed to route data between SoC components. Figure 6.9a presents an example
of an N N l-router architecture (each grey square represents an add-drop filter, a
physical architecture example of which is shown in Fig. 6.9b) [2]. Optical beams
propagate inside the optical router in one direction from input ports to the output
ports according to the wavelength division multiplexing (WDM) routing scheme.
Because of this, and since the optical router exercises the circuit switching tech-
nique, the flits are routed in a deadlock free way.

Electrical Distributed Router (EDR)

In an MPSoC that employs an optical NoC as its communication infrastructure, IP


traffic and protocols have to be adapted to the data-format and signals of the ONoC
6 A Protocol Stack Architecture for Optical Network-on-Chip 187

Controller FCU
PSS FC_sig

CH
H_DEC serData

Flit_in

DMUX
Buffer TxPA

LCS

Fig. 6.10 Micro-architecture of the transmitting-path interface unit

transmitter. On the other hand, the data accumulated by the ONoC receiver from
various source IPs needs to be delivered to a single target, complying with a stan-
dard communication protocol. Thus, the main objective of the Electrical Distributed
Router is to adapt the SoC traffic to and from the ONoC data format (according to a
standard network interface protocol) with the signaling and timing required by the
optical NoC transmitter and receiver modules. It consists of two building blocks: (1)
the transmitting-path interface unit, TxIU, which is analogous to the input unit of
the conventional NoC router, and (2) the receiving-path interface unit, RxIU, which
is analogous to the output unit of the conventional NoC router.

Transmitting-Path Interface Unit (TxIU)

The transmitting-path interface unit (TxIU) works as an interfacing and adapter


unit. It manages and adapts the network interface (NI) protocol signals, PSS (e.g.
VCI, VSTNOC, protocol) to the signaling and timing required by the multi-
wavelength transmitter module. It mainly consists of the controller unit, the trans-
mitter physical adapter (TxPA), the header-decoder, and the flow-control unit (see
Fig. 6.10). The header decoder, H_DEC, sets the destination channel buffer, CH,
with the destination address; while the flow-control unit, FCU, realizes the adopted
flow-control scheme at the TxIU.

Receiving-Path Interface Unit (RxIU)

The receiving-path interface unit, RxIU, operates as an adapter between the ONoC
receiver physical adapter module and the destination network interface, NI. It includes
188 A. Allam and I. OConnor

FCU Arbiter Controller


FC_sig PSS

PSSE
rxFIFO
Rx_data_1

Buffer
MUX
Flit_out
rxFIFO
Rx_data_m

Fig. 6.11 Micro-architecture of the receiving-path interface unit

FIFO buffers, rxFIFO, to store the receiver data; and an Arbiter module to arbitrate
between buffered packets so as to be delivered to the output NI (see Fig. 6.11). Its
Controller module manages and adapts the flow of released flits with the NI protocol
signals, PSS that have been extracted with the PSSE module. The Flow-Control unit,
FCU, generates the flow-control signals as part of the flow-control mechanism.

Performance Evaluation

This section investigates the performance evaluation of the proposed layered pro-
tocol architecture of the ONoC built with the novel EDR. The ONoC performance
analysis has been carried out both at the system-level (network latency and through-
put) and at the physical level. In physical-level (optical) performance analysis of
the ONoC, we study the communication reliability of the ONoC formulated by the
signal-to-noise ratio (SNR) and the bit error rate (BER). Optical performance of
the ONoC is carried out based on the system parameters, component characteris-
tics and technology. The system-level analysis is carried out through simulation
using flit-level-accurate SystemC model.

System-Level Performance Analysis

Communication channels in the optical NoC architecture defined above can be cat-
egorized in equivalence classes. An equivalence class, as introduced by Draper and
Ghosh [20], is defined as a set of channels with similar stochastic properties with
respect to the arrival and service rate. There are five main equivalence classes to
which a channel in ONoC can be assigned:
6 A Protocol Stack Architecture for Optical Network-on-Chip 189

Fig. 6.12 ONoC datapath diagram

Input channel (ICH), i.e. the input queue interfacing the ONoC to the NI of the
connecting source IP.
Transmitting-path channel (TxCH), which consists of the buffer of the transmit-
ting-path interface unit, TxIU, in addition to the serializer of the transmitter
module.
Serial channel (serCH) constructed from the whole serial datapath starting from
the laser-driver module up to the comparator module and passing through the
optical router.
Receiving-path channel (RxCH), which consists of the FIFO buffers of the
receiving-path interface unit, RxIU, in addition to the deserializer of the receiver
module.
Output channel (OCH), which is the output connection interfacing the ONoC to
the NI of the connecting destination IP.
The ONoC datapath is constructed from the series connection of these channels
as depicted in Fig. 6.12.
All datapaths through ONoC between any pair of source and destination IP nodes
are symmetric. Using this property in addition to the equivalence classes introduced
earlier, the characterization of ONoC performance metrics can be achieved by ana-
lyzing one ONoC datapath as depicted in Fig. 6.12.

Preliminary Definitions

In the Optical NoC, each input channel, as well as each transmitting-path channel,
is dedicated to a single input port of ONoC, while each output channel can accept
traffic from all associated RxCH channels.
Each SoC IP interacts with ONoC through the NI, according to a predefined
communication protocol. The number of clock cycles required to transfer one unit
of data (flit) between the NI and ONoC varies and is defined in the communica-
tion protocol used. In the following we denote the protocol clock cycle as PCC.
The system-on-chip runs with a system clock frequency denoted fsys. Some compo-
nents of ONoC run at this system frequency (such as TxIU and RxIU) while the serial
datapath runs with serialization frequency fser. The ONoC is expected to be clocked
with a frequency higher than the nominal clock frequency, f0, which is the system
clock frequency corresponding to the minimum clock-period T0 (the time required by
a flit to be completely serialized through the serial datapath). As such, we define the
190 A. Allam and I. OConnor

ratio of the system frequency fsys to the nominal frequency f0, as the speed factor and
denote it spf, and is given as:

( )
spf = Tsys / T0 = fsys / fser (FS / PCC) (6.1)

where FS is the flit size in bits.

Saturation Throughput (Throughput Upper Bound)

The optical NoC works in linear operation (no saturation) as long as the serial chan-
nel bandwidth is able to accommodate the flow of input traffic, assuming infinite
FIFO buffers. This assumption is only used to characterize the interaction of ONoC
to the traffic flow in order to obtain an upper bound for network throughput.
An output channel, under ideal conditions, can release one flit every PCC clock
cycles. Thus, the output channel bandwidth, OCHBW, can be given (in pkts/cycles)
as in (6.2). In addition, considering the fact that OCHBW is shared among traffic
from all associated RxCH channels and define pij as the probability of sending pack-
ets from node i to node j, the ideal capacity, Cap, can be given (in pkts/cycles) as in
(6.3), given that Nf is the number of flits per packet.

OCHBW = 1 / (PCC Nf ). (6.2)

( )
Cap = 1 / PCC Nf i p ij . (6.3)

Maximum throughput occurs when some channel in the network becomes satu-
rated. The throughput upper bound is obtained by considering the role of the speed
factor and the traffic injection rate, assuming infinite RxIU FIFO-buffers.
Running the ONoC with a serialization frequency that is not high enough com-
pared to the operating system frequency will result in high spf, which leads to
flooding the ONoC with the injected traffic. As a result, saturation due to limited
serial channel bandwidth can occur for these injected traffic levels.
Let us define NP0 as the number of injected packets during one nominal clock-
period T0 at input channel and the injection ratio iR as the ratio of injected traffic to
the capacity; and recall that T0 = spf Tsys.

(
NP0 = spf iR / PCC Nf i p ij ) (6.4)

The maximum number of flits that can pass through the serial channel, serCH,
during T0 is 1/PCC [see (6.2)]. Saturation due to the serial channel occurs when
NP0 > 1/(PCC Nf). Thus, saturation occurs at the injection ratio, iRsat, given by:
6 A Protocol Stack Architecture for Optical Network-on-Chip 191

iR sat = i p ij / spf. (6.5)

So, the throughput upper bound, TUB, is given by

TUB = Cap iR sat = 1 / (PCC Nf spf ). (6.6)

Thus, the ONoC can work in linear operation, while accommodating maximum
offered traffic (iR = 1), with speed factor

spf i p ij . (6.7)

The implementation technology used for ONoC optoelectronic devices deter-


mines the maximum operating frequency for these devices, i.e. the serialization
frequency. Thus, for a given SoC operating frequency, Eq. (6.5) reveals that the
saturation point of the ONoC can be pushed to the right (allowing it to accommo-
date more traffic before saturation) through careful design of the optoelectronic
devices with more advanced technologies.

System-Level Simulation

To analyze the ONoC behavior and to evaluate its performance metrics (latency and
throughput), a BCA SystemC model for the ONoC has been developed. The model
implements all the micro-architectural details of the ONoC in addition to the struc-
tural details of its components. This model simulates the network at flit-level so as
to produce very accurate performance information.
The performance evaluation of the ONoC is carried out under two traffic test
sets: (1) a synthetic workload simulating real world traffic characteristics and (2) the
SPLASH-2 benchmark [21] traffic. In addition, it is compared to the performance
of the ENoC with mesh topology.
MPSoC with 64128 processors are common today in high-end servers, and
this number is increasing with time. The modern microprocessor executes over 109
instructions per second with an average bandwidth of about 400 MB/s. However,
to avoid increasing memory latency, most processors still need larger peak band-
width [22].
The simulation experiment is carried out for various number of IPs (8, 16, 32, and
64). The synthetic workload traffic is used to evaluate the ONoC performance under
various bandwidth requirements of 8, 16, 24, and 32 Gbps from each IP. In the
conducted simulation tests, the MPSoC is clocked with a frequency of 1 GHz, while
the ONoC is allowed to deliver serial data with a rate of 12.5 Gbps using the current
state of the art photonic component parameters [15, 23] shown in Table 6.1. The flit-
size is set to be 64-bits with a packet length of 4 flits.
192 A. Allam and I. OConnor

Table 6.1 Opto-electronic and photonic device parameters


Device Parameter Value
Laser-driver Bias current 0.5 mA
Modulation current 1.45 mA
Laser-source Efficiency 0.145 W/A
Microdisk FSR 32 nm
Waveguide Losses 5%
Photo-detector Responsivity 1.0 A/W
Dark current 18 nA
TIA Noise density 1 pA/sqrt(Hz) (90 nm
technology)

Fig. 6.13 ONoC performance metrics

The system level analysis shows that the ONoC under study is a stable network,
as is clearly revealed from the simulated throughput in Fig. 6.13. The network is
stable since it continues to deliver the peak throughput (and does not decrease) after
the saturation point.
In passive-type ONoCs such as that under study, there is a single central optical
router and the buffering queues are located at the end of the communication datap-
ath (compared to the ENoC which has buffers and routing switches at each routing
node). Thus, the resource contentions are far less in the case of the ONoC as com-
pared to the ENoC, and hence the ONoC deliverable bandwidth is expected to be
higher as the size of the MPSoC becomes larger. Simulation results in Fig. 6.14 bear
out this hypothesis.
Figure 6.14 shows that the ONoC can handle the necessary bandwidth success-
fully as long as it does not exceed its physical bandwidth (12.5 Gbps in our setup)
and achieve a bandwidth equal to that of the ENoC for relatively low bandwidth
demands. It also proves the scalability of the ONoC, where it illustrates that the
ONoC achievable bandwidth is almost constant regardless of the network size com-
pared to the ENoC.
A similar observation and conclusion can be drawn from the results of simulating
the SPLASH-2 benchmark, as shown in Fig. 6.15. The ONoC delivers a comparable
bandwidth to the ENoC.
6 A Protocol Stack Architecture for Optical Network-on-Chip 193

Fig. 6.14 ONoC and ENoC


performance for various
MPSoC bandwidth demands

Fig. 6.15 ONoC


performance under
SPLASH-2 benchmark

In NoCs, the typical packet size is 1,024 bits, where it is divided into several flits
for efficient resource utilization with a typical size of 64-bits. Because of the limited
width of the physical channel in the ENoC, the flit is subdivided into one or more
physical transfer digits or phits; typically with a size between 1- and 64-bits. Each
phit is transferred across the channel in a single clock cycle. Each input channel of
the ENoC router accepts and deserializes the incoming phits. Once a complete flit is
constructed, it is allocated to the input buffer and can arbitrate for an output channel.
On the other end, the output channel plays the complementary role where it serial-
izes the buffered flit again to phits for physical channel bandwidth allocation.
194 A. Allam and I. OConnor

avg BW (Gbps)

Fig. 6.16 ONoC performance against ENoC with various Phit lengths

Comparing the performance of the ONoC, with serial core communication


(i.e. with 1-bit physical channel width), against that of the ENoC when it is built
of physical channel widths (phit) of different sizes (8-, 16-, 32-bits) is illustrated
in Fig. 6.16, for flit size of 64-bits. The results demonstrate that the ONoC
achieves better performance over the ENoC with small phit size (8- and 16-bits)
regardless of the network size. Even for large phit sizes (32-bits or more), the
ONoC can still deliver better performance than the ENoC for large network sizes
(64-nodes or higher).

Optical Performance Analysis

The previous section examined the ONoC performance from the system-level per-
spective. In our physical-level performance analysis of the ONoC, we study the
communication reliability of the ONoC formulated by the signal-to-noise ratio SNR
(the relative level of the signal to the noise) and the bit error rate BER (the rate of
occurrence of erroneous bits relative to the total number of bits received in a trans-
mission). This has been achieved through analyzing the heterogeneous communica-
tion path of the ONoC based on:
System parameters such as: ONoC size (passive optical-router structure and its
number of routing elements) and data rate.
Technology characteristics (micro-resonator roundtrip and coupling losses, wave-
guide sidewall roughness and reflection losses, and manufacturing variability).
Component characteristics (detector responsivity, source threshold current and
efficiency, TIA input referred noise).
6 A Protocol Stack Architecture for Optical Network-on-Chip 195

Fig. 6.17 Micro-resonator


filter drop and through
response

Preliminary Definitions

The path of data through the heterogeneous domains is as follows: first, the laser-driver
generates two electrical current values corresponding to digital data bits 1 and 0. This
current drives the laser-source module to generate an optical beam with photonic
power proportional to this input current. This optical beam is synthesized at a specific
wavelength according to the physical characteristics of the laser-source. Optical
beams are routed inside the passive optical-router (using the wavelength division
multiplexingWDMrouting mechanism). Then, the photodetector produces an
electrical current proportional to the incident photonic power, which is fed to the TIA
that generates the equivalent voltage. After the received signal (and associated noise)
has been amplified by the TIA, a decision to convert the received signal to a logic 1
or 0 will be carried out by the comparator and will be subject to errors based on the
relative level of the signal to the noise (SNR).
Each micro-resonator switch of the optical-router has a nominal resonant wave-
length lres (see Fig. 6.2); and each lres is i.Dl distant from the systems base wave-
length. Here, i is the optical channel index (i = 0 N1, N being the number of
channels, a function of the ONoC size and structure), and Dl is the channel spacing
(equal to FSR/N, FSR being the free spectral range of the micro-resonator switches).
In practice, due to manufacturing variations and heating, the actual resonant wave-
length will be in a range of dl around the nominal resonant wavelength lres, with
dl being the maximum error or detuning that can occur in the system.

Communication Reliability Investigation

The injected laser signal to the optical-router, on its normal path to the destination,
passes through a number of micro-resonator switches. In each switch, the signal
encounters attenuation in the drop and through channels depending on its wave-
length (see Fig. 6.17). One of the switches will be resonant to the signals own
196 A. Allam and I. OConnor

nominal wavelength, which directs it to follow the drop path while it is being
manipulated by its drop transfer function. In all other switches (with different wave-
lengths), the signal follows the through path and is manipulated by the switch
through transfer function.
Since the micro-resonator filter cannot achieve perfect wavelength selectivity,
crosstalk and interference coming from signals on other wavelengths are added to
the data signal in the drop path. Similarly, a small fraction of the data signal extracted
by the filters through transfer function is added to the signals on other wavelengths
in the transfer path depending on the wavelength, which is considered to be one
source of the optical-router losses. The other sources of optical-router losses are the
micro-resonators drop and through attenuation (which depend on the device param-
eters such as the rings roundtrip loss coefficient, r, and the coupling coefficients
between straight waveguide and the ring, k1, and between the two rings, k2, in the
double-rings micro-resonator filter), in addition to the losses caused by the passive
waveguides (due to the sidewall roughness and the reflection losses).
To obtain the SNR figure of the ONoC, the N digital sources are allowed to trans-
mit 1s and 0s to the N destinations randomly. The laser signal is represented as a
Gaussian shape around the transmitting wavelength so that the whole wavelength
spectrum of the signal is accurately manipulated by each micro-disk throughout the
path in the router. At the receiver, the wavelength selection at the photodetector is
carried out with the same type of filter switch as is used inside the optical-router;
and the input referred noise of the TIA, coupled with the photodetector dark current,
gives the total noise at the input of the TIA circuit. This noise and the received opti-
cal power at the photodetector for logic 1 and 0 are used to calculate the SNR
and the BER using the methodology in [23].

Parametric Exploration

In this section, we explore and analyze the SNR and the BER of the ONoC against
the maximum detuning dl (upwards from the ideal case of dl=0 nm, i.e. no manu-
facturing or thermal variations) for various system specifications and technology
parameters. The reference point for the photonic device parameters is the current
state of the art component parameters [15, 23] shown in Table 6.1. We will contrast
our ONoC BER against the typical BER figures required by Synchronous Optical
NETwork (SONET), Gigabit Ethernet and Fibre Channel specifications, which are
1010 to 1012 or better [23].
Figures 6.18 and 6.19 show the SNR and the BER for various values of the ring-
resonators roundtrip loss coefficient, r, for a 16-node ONoC working with data rate
of 12.5 Gbps. When no mistuning exists, the SNR is between 21 and 26 dB resulting
in a BER in the range of 1024 to 109 bits1 for roundtrip loss coefficient, r, between
0.03 and 0.01, respectively. As the detuning increases, the SNR decreases and the
BER increases. As the detuning value increases beyond 0.4 nm, the BER becomes
unacceptable, resulting in unreliable data communication irrespective of the
roundtrip loss coefficient.
6 A Protocol Stack Architecture for Optical Network-on-Chip 197

Fig. 6.18 SNR for 16-node


ONoC with data rate of
12.5 Gbps

Fig. 6.19 BER for 16-node


ONoC with data rate of
12.5 Gbps

Figure 6.20 shows the BER for a 16-node ONoC operating with various data
rates corresponding to a roundtrip loss coefficient of 0.02 and coupling coefficients
k1 and k2 of 0.38 and 0.08, respectively. With calibration and careful design result-
ing in a maximum detuning value of 0.2 nm, the 16-node ONoC with the previous
parameters working with 4 Gbps data rate can achieve communication with a BER
of 1019 bits1 which is considered to be highly reliable compared to the Optical
NETwork (SONET). For the same ONoC configuration, the BER is found to worsen
as the data rate becomes higher, which imposes more constraints on the calibration
for achieving an acceptable BER. On the other hand, implementing the micro-reso-
nator filter with a photonic technology that can realize a roundtrip loss coefficient of
0.01 can achieve very low BER and tolerate larger detuning values even in the case
of high data rates, as Fig. 6.21 illustrates.
198 A. Allam and I. OConnor

Fig. 6.20 BER for 16-node 0


1 = 0.38 2 =0.08 =0.02
ONoC for various data rates 10

-10
10

BER
-20
10
12.5 Gbps
8 Gbps
4 Gbps
-30
10
0 0.2 0.4 0.6 0.8
(nm)

Fig. 6.21 BER for 16-node 0


1 = 0.38 2 =0.08 =0.01
ONoC for various data rates 10

-10
10

-20
10
BER

-30
10
12.5 Gbps
-40 8 Gbps
10
4 Gbps

-50
10
0 0.2 0.4 0.6 0.8
(nm)

Fig. 6.22 BER for various 0


1 = 0.38 2 =0.08 =0.01
ONoC with data rate of 10
12.5 Gbps

-10
10
BER

-20 64-IPs
10
32-IPs
16-IPs

-30
10
0 0.1 0.2 0.3 0.4 0.5 0.6
(nm)
6 A Protocol Stack Architecture for Optical Network-on-Chip 199

As the ONoC size increases (i.e. the number of micro-resonator switches and the
number of required resonant wavelengths increases), the photonic channel spacing
becomes smaller for the same FSR, and the photonic signal encounters a through
attenuation in a large number of micro-disks. This will increase the interference and
the router losses, which decreases the SNR and increases the BER. Figure 6.22
shows the BER for different ONoC sizes working with 12.5 Gbps data rate, and
corresponding to a roundtrip loss coefficient of 0.01 and coupling coefficients k1
and k2 of 0.38 and 0.08, respectively. Achieving an acceptable BER in large size
ONoC requires a larger FSR, which would impose more stringent constraints on the
design of the micro-resonator filters (both in terms of choice of parameters as well
as in the development of improved filter structures).

Conclusion

In this chapter, we have introduced the concept and the micro-architecture of a new
router called Electrical Distributed Router as a wrapper for the ONoC. We have also
presented a novel layered protocol architecture for the ONoC. The Network Layer
in the proposed protocol stack is flexible enough to accommodate various router
architectures realizing the same function. The performance of the ONoC layered
architecture has been investigated both at system level and at the physical level. In
our optical performance analysis, we explored and analyzed the SNR and the BER
of the ONoC against maximum detuning under various system specifications and
technology parameters. In passive-type ONoCs such as that under analysis, there is
a single central optical router and the buffering queues are located at the end of the
communication path (compared to the electrical NoC). Thus, resource contentions
are low in the case of the ONoC, and hence the performance is expected to be high.
The models and analyses described in this work bear out this conclusion. In particu-
lar, the performance analysis showed that the ONoC is capable of absorbing a high
level of traffic before saturation. Moreover, experimental results proved the scal-
ability of the ONoC and demonstrated that the ONoC is able to deliver a comparable
bandwidth or even better (in large network sizes) to the ENoC.

References

1. Scandurra A, OConnor I (2008) Scalable CMOS-compatible photonic routing topologies for


versatile networks on chip. In: Proceedings of the 1st international workshop on network on
chip architectures, pp 4451 Lake Como, Italy
2. Brire M et al (2005) Heterogeneous modelling of an optical network-on-chip with SystemC.
In: Proceedings of the 16th IEEE international workshop on rapid system prototyping (RSP),
pp 1016 Montreal, Canada
3. Brire M et al (2007) System level assessment of an optical NoC in an MPSoC platform. In:
Proceedings of the IEEE design automation and test in Europe (DATE), pp 10841089 Nice, France
200 A. Allam and I. OConnor

4. Gu H, Xu J, Wang Z (2008) ODOR: a microresonator-based highperformance low-cost router


for optical networks-on-chip. In: Proceedings of the international conference on hardware-
software codesign and system synthesis, pp 203208 Atlanta (GA), USA
5. Gu H, Zhang W, Xu J (2009) A low-power fat tree-based optical network-on-chip for multi-
processor system-on-chip. In: Proceedings of the IEEE design automation and test in Europe
(DATE), pp 38 Conference location: Nice, France
6. Shacham A, Lee BG, Biberman A, Bergman K, Carloni LP (2007) Photonic NoC for DMA
communications in chip multiprocessors. In: Proceedings of the IEEE symposium on high-
performance interconnects, pp 2938 Stanford (CA), USA
7. Benini L, Micheli GD (2002) Networks on chips: a new SoC paradigm. IEEE Comput 35
(1):7078
8. Zimmermann H (1980) OSI reference modelthe ISO model of architecture for open systems
interconnection. IEEE Trans Commun 28(4):425432
9. Carara E, Moraes F, Calazans N (2007) Router architecture for high-performance NoCs.
In: Proceedings of the 20th annual conference on integrated circuits and systems design,
pp 111116 Rio de Janeiro, Brazil
10. Dehyadgari M, Nickray M, Afzali-kusha A, Navabi Z (2006) A new protocol stack model for
network on chip. In: Proceedings of the IEEE Computer Society annual symposium emerging
VLSI technologies and architectures, pp 440441 Karlsruhe, Germany
11. Millberg M, Nilsson E, Thid R, Kumar S, Jantsch A (2004) The Nostrum backbone a com-
munication protocol stack for networks on chip. In: Proceedings of the 17th international con-
ference on VLSI design, pp 693696 Mumbai, India
12. Sgroi M et al (2001) Addressing the system-on-a-chip interconnect woes through com-
munication based design. In: Proceedings of the 38th annual design automation confer-
ence, pp 667672 Las Vegas (NV), USA
13. MIPI Alliance, MIPI Alliance Standard for Unified Protocol (UniPro) (2010) http://www.
mipi.org August 30th 2012
14. Jantsch A, Tenhunen H (2003) Networks on chip. Kluwer Academic, Dordrecht, pp 85106
15. Vivien L, Osmond J, Fedeli JM, Marris-Morini D, Crozat P, Damlencourt JF, Cassan E, Lecunff
Y, Laval S (2009) 42 GHz p.i.n Germanium photodetector integrated in a silicon-on-insulator
waveguide. Opt Express 17:62526257
16. Lu Z (2007) Design and analysis of on-chip communication for network-on-chip platforms. Ph.D.
Dissertation, Department of Electronic, Computer and Software System, School of Information
and Communication Technology Royal Institute of Technology (KTH), Stockholm
17. Coppola M, Pistritto C, Locatelli R, Scandurra A (2006) STNoC: an evolution towards mpsoc
era. In: Proceedings of the design, automation and test in Europe (DATE) Munich, Germany
18. Rijpkema E et al (2003) Trade offs in the design of a router with both guaranteed and best-ef-
fort services for networks on chip. In: Proceedings of the IEEE design automation and test in
Europe (DATE), pp 1035010355 Munich, Germany
19. Wiklund D, Liu D (2003) SoCBUS: switched network on chip for hard real time embedded
systems. In: Proceedings of the IEEE international symposium on parallel and distributed pro-
cessing, pp 18, 2003 Nice, France
20. Draper JT, Ghosh J (1994) A comprehensive analytical model for wormhole routing in multi-
computer systems. J Parallel Distrib Comput 23(2):202214
21. SPLASH-2 benchmark (2010) http://www.capsl.udel.edu/splash/ August 30th 2012
22. Dally W, Towles B (2004) Principles and practices of interconnection networks. Morgan
Kaufmann, San Francisco
23. Spuesens T, Liu L, Vries Td, Romeo PR, Regreny P, Thourhout DV (2009) Improved design
of an InP based microdisk laser heterogeneously integrated with SOI. In: Group IV photonics
2009, San Francisco
Chapter 7
Reconfigurable Networks-on-Chip

Wim Heirman, Iigo Artundo, and Christof Debaes

Abstract There is little doubt that the most important limiting factors of the
performance of next-generation chip multiprocessors (CMPs) will be the power
efficiency and the available communication speed between cores. Photonic net-
works-on-chip (NoCs) have been suggested as a viable route to relieve the off- and
on-chip interconnection bottleneck. Low-loss integrated optical waveguides can
transport very high-speed data signals over longer distances as compared to on-
chip electrical signaling. In addition, novel components such as silicon microrings,
photonic switches and other reconfigurable elements can be integrated to route
signals in a data-transparent way.
In this chapter, we look at the behavior of on-chip network traffic and show how
the locality in space and time which it exhibits can be advantageously exploited by
what we will define as slowly reconfiguring networks. We will review existing
work on photonic reconfigurable NoCs, and provide implementation details and a
performance and power characterization of our own reconfigurable photonic NoC
proposal in which the topology is adapted automatically (on a microsecond scale) to
the evolving traffic situation by use of silicon microrings.

Keywords Network-on-chip Optical interconnects Reconfigurable networks

W. Heirman (*)
Computer Systems Laboratory, Ghent University,
Sint-Pietersnieuwstraat 41, Gent, 9000, Belgium
e-mail: wim.heirman@ugent.be
I. Artundo
iTEAM, Universidad Politcnica de Valencia, Valencia, Spain
e-mail: iiarmar@iteam.upv.es
C. Debaes
Department of Applied Physics and Photonics, Vrije Universiteit Brussel, Brussel, Belgium
e-mail: christof.debaes@vub.ac.be

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 201


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_7,
Springer Science+Business Media New York 2013
202 W. Heirman et al.

Introduction

Power efficiency has become one of the prime design consideration within todays
ICT landscape. As a result, power density limitations at the chip level have placed
constraints on further clock speed improvements and pushed the field into increased
parallelism. This has led to the development of multicore architectures or chip mul-
tiprocessors (CMPs) [20]. In the embedded domain, a similar evolution resulted in
the emergence of multi-processor systems-on-chip (MPSoCs), which combine sev-
eral general and special purpose processors with memory banks and input/output
(I/O) devices on a single chip [44].
As such, both CMPs and MPSoCs have begun to resemble highly parallel com-
puting systems integrated on a single chip. One of the most promising paradigm
shifts that has emerged in this domain are packet-switched networks-on-chips
(NoCs) [15]. Since interconnect resources in these networks are shared between
different data flows, they can operate at significantly higher power efficiencies than
fixed interconnect topologies. However, due to the relentless increase in required
throughput and number of cores, the links of those networks are starting to stretch
beyond the capabilities of electrical wires. In fact, some recent CMP prototypes
with eighty cores show that the power dissipated by the NoC accounts for up to 25%
of the overall power [40].
Meanwhile, recent developments in integrating photonic devices within CMOS
technology have demonstrated photonic interconnects as a viable alternative for
high performance off-chip and global on-chip communication [67]. This has sparked
interest among several research groups to propose architectures with photonic
NoCs [10, 12, 66]. Nevertheless, using optical links as mere drop-in replacements
for the connections of electronic packet-switched networks is not yet a reality.
Conversion at each routing point from the optical to the electrical domain and back
can be power inefficient and increase latency. But novel components, such as silicon
microring resonators [89], which can now be integrated on-chip, are opening new
possibilities to build optical, switched interconnection networks [49, 77].

Opportunities for Reconfiguration

In a first step, we will take a look at exactly how reconfiguration helps to improve
network latency and power requirements. As an initial approximation, energy usage
and packet latency increase mainly with the number of hops a network packet has to
travel. Once weve fixed the networks topology and the mapping of computational
threads to the processors at each network node, the characteristics of the resulting
network traffic have been mostly defined. In reconfigurable networks, one will
exploit certain properties of this network traffic to minimize the number of hops
packets have to travel. The following section will analyze these network traffic
properties in detail and describe how they can be used to trigger network optimiza-
tion through reconfiguration.
7 Reconfigurable Networks-on-Chip 203

To do this we will look at network traffic at different time scales. At each of these
scales, a different mechanism is at play providing structure to the network traffic,
and, if understood by a network designer, providing insight into how traffic and
network interact. This in turn can lead to opportunities for improving network per-
formance, lower power usage or increase reliability.

A Note on On-Chip Versus Off-Chip Network Traffic

While a large body of existing work on network traffic locality is set in multi-chip
multi-processor systems such as servers or supercomputers, only more recent work
considers the same effect in on-chip settings. Indeed, parallel (super-)computing
has been in existence since the 1980s, and has had much time to mature as a research
field. Yet the growing number of cores per chip [85] will make the conclusions
drawn for off-chip networks valid for on-chip networks as well.
In fact, when compared to off-chip networks, the on-chip variants are usually
situated at an architectural level that is closer to the processor. The bandwidth and
latency requirements imposed on them are therefore much more stringent. Figure 7.1
shows the system-level architectural difference: on-chip networks mostly connect
between the L1 and L2 caches (Fig. 7.1, top), while off-chip networks are connected
after the L2 cache or even after main memory (Fig. 7.1, bottom). In multi-chip sys-
tems, a larger fraction of memory references can therefore be serviced without
requiring use of the interconnection network, yielding lower network bandwidth
and latency requirements.1 On-chip networks on the other hand will be used much
more often: each memory access that doesnt hit in the first-level cache, typically
once every few thousand memory references for each processor, results in a remote
memory operationversus once every few million memory accesses for a typical
off-chip network. Additionally, the network latency that can be tolerated from an
on-chip network will be much lower, in the order of a few tens of nanoseconds,
versus multiple hundreds of nanoseconds for a typical off-chip network.

Network Traffic Locality

It is known that memory references exhibit locality in space and time, in a fractal or
self-similar way [24, 60]. This locality is commonly exploited by caches to improve
performance. Due to the self-similar nature of locality, this effect is present at all
time scales, from the very fast nanosecond scales exploited by first-level caches,
down to micro- and millisecond scales which are visible on the interconnection

1
Or, in a message-passing system, processors can work on local data for a longer time before mes-
sages need to be sent with new data.
204 W. Heirman et al.

Fig. 7.1 Architecture of shared-memory multiprocessor, as a chip-multiprocessor with on-chip


network (a) or in the traditional multi-chip implementation (b). L1$ and L2$ denote the first- and
second-level caches, NI is the network interface. Dashed lines denote the chip boundaries. Note
that the on-chip interconnection network sits at an architectural level that is much closer to the
processors, it will therefore have much more stringent requirements on bandwidth and latency

network of a shared-memory (on-chip or multi-chip) multiprocessor. This behavior


can be modeled as traffic bursts: these are periods of high-intensity communication
between specific processor pairs. These bursts were observed to be active for up to
several milliseconds, on a background of more uniform traffic with a much lower
intensity. These bursts can be caused both by context switches between different
applications [4], and by the applications themselves [36].
In [34], a study was made of the locality of communication, and its variance
through time. This was done by computing the Rent exponent of the network
traffic.2 Figure 7.2 shows the variation of the Rent exponent and the (relative)
per-node bandwidth through time, for water.sp, one of the SPLASH-2 bench-
marks, when ran on a 64-node network. One can clearly see different phases

2
See [51] for the original description of Rents law relating the number of devices in a subset of an
electronic circuit to its number of terminals, [14] for a theoretical derivation of the same law, and
[23] for an extension of Rents rule which replaces the number of terminals with network band-
width. In essence, a low Rent exponent (near zero) signifies very localized communication, such
as nearest-neighbor only, while a very high Rent exponent (near one) denotes global, all-to-all
communication.
7 Reconfigurable Networks-on-Chip 205

water.sp, 64 nodes water.sp, 64 nodes


1 1.8

Per node bandwidth


1.6
0.8
Rent exponent

1.4
1.2
0.6 1
0.4 0.8
0.6
0.2 0.4
0.2
0 0
0 20 40 60 80 100 0 20 40 60 80 100
Simulation time (M cycles) Simulation time (M cycles)

Fig. 7.2 Estimated rent exponent (left) and relative per-node bandwidth (right) through time for
the water.sp benchmark run on 64 nodes

during the programs execution: periods with a high amount of long-distance


(high Rent exponent) communication, are alternated with phases of less intense,
more localized communication.

Context Switching

In systems where the number of threads is greater than the number of processors,
multiple threads are time-shared on a single processor. This is usually the case in,
for instance, web and database servers where context switches happen when a thread
needs to wait while an I/O operation is completed. Each time a processor switches
to a different thread, this new thread will proceed to load its data set into the cache.
This causes a large burst of cache misses. Sometimes all of the threads data can be
found in the local memory of the processors node, but often remote memory
accesses are required. In this case, the thread switch causes a communication burst.
One such example is the case where a thread just woke up because its I/O-request
was completed, the thread will now read or write new data on another nodes mem-
ory or I/O-interface.
A study of these context switch induced bursts was done in [4]. One experiment
time-shared multiple SPLASH-2 benchmarks [88] on the same machine, another
used the Apache web server loaded with the SURGE request generator [7] to study
an I/O-intensive workload. A clear correlation was found between context switches
and bursts. This is illustrated in Fig. 7.3, which shows the traffic generated by a
single node through time and the points where context switches occurred. Here, four
instances of the cholesky benchmark, with 16 threads each, were run on a single
16-node machine. Solid lines denote a context switch on this node, at this point a
burst of outgoing memory requests is generated to fill the local cache with the new
threads working set. Dashed lines show context switches on other nodes. In some
206 W. Heirman et al.

160

140

120
Traffic flow (MB/s)

100

80

60

40

20

0
8.5 9 9.5 10 10.5 11 11.5 12 12.5 13 13.5
Time (s)

Fig. 7.3 Traffic observed in and out of a single node through time, while running four 16-thread
cholesky applications on a single 16-node machine. Solid arrows are shown when a context switch
occurs on this node, dashed lines denote context switches on other nodes [4]

of these instances, the neighboring node generates a burst of accesses to memory on


the local node, again resulting in a communication burst. Other bursts are due to
structure in the application as previously described.
Traffic bursts caused by context switches typically involve intense communica-
tion, and can be several milliseconds long. The opportunities for reconfiguration
are therefore similar to those for traffic bursts inherent to the application, as
described before. One added advantage of context switches is that they are more
predictable: the operating systems scheduler often knows in advance when a con-
text switch will occur (at the end of the current threads time quantum), at that
moment a communication burst will most likely start at the node where the context
switch occurs.3 Also, if the new thread is known, the destination of the traffic burst
can be predicted. The burst is mostly caused by the threads working set being
moved into the processors cache. Mostly this working set is the same as, or only
slightly different from, the previous time the thread was running. The destination
of the bursts will therefore be the same as the last time the same thread was sched-
uled. This information can be used by the reconfiguration controller, to reconfigure
the network pro-actively, rather than reacting to measured network traffic.

3
Often, the operating system tries to avoid context switches at the same time on all nodes as this
would initiate communication bursts on all nodes simultaneously, this can easily saturate the whole
network.
7 Reconfigurable Networks-on-Chip 207

Scenarios

A lot of research that is currently being done in the context of MPSoC design
resolves around system scenarios. This concept groups system behaviors that are
similar from a multidimensional cost perspectivesuch as resource requirements,
delay, and energy consumptionin such a way that the system can be configured to
exploit this cost similarity [21]. Often, scenarios can be traced back to a certain
usage pattern of the system. Modern cellular phones, for instance, can be used to
watch video, play games, browse the internet, and even make phone calls. Each of
these usage scenarios imposes its own specific requirements on the device, in terms
of required processing power, use of the various subcomponents (graphics, radios,
on-chip network), etc. At design-time, these scenarios can be individually opti-
mized. Mechanisms for predicting the current scenario at runtime, and for switching
between scenarios, are also being investigated.
The system configuration, which is the result of the system being operated in a
specific scenario, consists of the setting of a number of system knobs which allow
trade-offs to be made between performance and power, among other cost metrics.
One well-known technique used in this case is dynamic voltage and frequency scal-
ing (DVFS), which changes the processors clock speed and core voltage [45]. This
allows a designer to choose between high-performance, high-power operation when
needed to meet a real-time deadline, or low-power operation when possible. [21]
describe one example system in which an H.264 video decoder, which has a fixed
per-frame deadline (at 30 frames per second) but a variable computational complex-
ity per frame (depending on the video frame type, complexity, level of movement,
etc.). By choosing the correct DVFS setting for each frame, the energy requirement
for the decoding of lower-complexity frames could be reduced by up to 75%, while
keeping the hardware fixed.
In this design pattern, network reconfiguration can easily be integrated as another
system knob. Communication requirements can be profiled at design-time [37],
while runtime scheduling and mapping can be done to optimize communication
flows and configure the network accordingly [62]. Changes to network parameters
(link speed and width) or topology (adding extra links) can thus be done in response
to system scenario changes.

Algorithmic Communication Patterns

The application running on the multiprocessor machine, executes a certain algo-


rithm, which is split up among several processors. Each of the processors usually
works on a subset of the data. One example is the simulation of oceanic currents,
where each processor is responsible for part of the simulated ocean. Neighboring
parts of the ocean influence each other because water flows from one part to the
208 W. Heirman et al.

other. In the same way, information (current velocities and direction, water
temperature) flows between the processors responsible for these parts. Clearly, if the
processors themselves are neighbors on the communication network (i.e. connected
directly), this makes for very efficient communication because a large fraction of
network traffic does not need intermediate nodes. There is a similar communication
pattern in several other physical simulations, where data is distributed by dividing
space in 1-D, 2-D or 3-D grids and communication mainly happens between
neighboring grid points. Other physical mechanisms, such as gravity, work over
long distances. Cosmic simulations therefore require communication among all
processors (although the traffic intensity is not uniform).
An important property here is how many communication partners each processor
has. In some cases, the number of communication partners is higher than the net-
work fan-out; or the topology, created by connecting all communication partners,
cannot be mapped to the network topology using single-hop connections only. Then,
some packets will have to be forwarded by intermediate nodes, making communica-
tion less efficient. For instance, when communication is structured as a tree, which
is the case for several sorting algorithms, it is not obvious how threads and data
should be placed on a ring network. In a clientserver architecture, where one thread
is the server which answers questions from all other threads, the fan-out of the
server thread is extremely high. The node that runs this thread will never have an
equally high physical fan-out. In those cases, a large fraction of network traffic will
require forwarding.
Moreover, for some applications each nodes communication partners change
through time. This happens for instance in algorithms where the work on each data set
is not equal, and redistribution of work or data takes place to balance the workload of
all processors. Another situation is in scatter-gather algorithms, in which data is dis-
tributed to or collected from a large number of nodes by a single threadwhich will
thus communicate in turn with different nodes. And sometimes the data set of one
processor just does not fit in its local memory, and has to be distributed over several
nodes. In this case, for part of the data, external memory accesses are required.
Regularity in the application is again visible on the network as communication
bursts. Highly regular applications like the ocean simulation will have bursts,
between the nodes simulating neighboring parts of the ocean, that span the entire
length of the program. For other applications, communication is less regular, but
even there, bursts of significant lengths (several milliseconds) can be detected. They
can also be exploited by the same techniques that exploit bursts caused by other
mechanisms explored here.

Application-Driven Reconfiguration

Another method of exploiting regular communication patterns in the application is


to have the program(mer) specify these patterns, and reconfigure the network
accordingly. Since this can be done at a high abstraction level (source code or
7 Reconfigurable Networks-on-Chip 209

algorithmic level), by someone with a view of the complete program and algorithm
(programmer or compiler), it can be expected that this method allows for very
accurate prediction of the communication pattern, and would therefore result in the
largest gains. It does, however, require a large effort to analyze the application in
this way. Moreover, due to dependencies on the input data, it is not always possible
to predict, at compile time, a fraction of total communication that is large enough
to be of benefit.
A very early example of the approach of application-driven reconfiguration can
be found in 1982, when Snyder introduced the configurable, highly parallel (CHiP)
computer [79]. Its processing elements are connected to a reconfigurable switching
lattice, in which several (virtualized) topologies can be embedded, such as a mesh
for dynamic programming or a binary tree used for sorting.
Another example is the interconnection cached network (ICN) [26]. This
architecture combines many small, fast crossbars with a large, slow-switching
crossbar. By choosing the right configuration of the large crossbar, a large class
of communication patterns can be supported efficiently: meshes, tori, trees, etc.,
can be embedded in the architecture. The large crossbar thus acts (under control
of the application) as a cachehence the name ICNfor the most commonly
used connections. This approach is also used, to some extent, in the earth simula-
tor [28]. Its architecture centers around a 640640 crossbar, on which communi-
cation between 640 processing nodes occurs through application-defined circuits.
Inside each processing node, eight vector processors are connected through a
smaller but higher data-rate crossbar.
[8] built on the ICN concept, and describe a dual network approach. Long-lived
burst transfers use a fast optical circuit switching (OCS) network, which is reconfigured
using MEMS mirrors (with a switching time of a few milliseconds) under control of
the application. The other, irregular trafficwhich is usually of a much lower vol-
umeuses a secondary network, the Electronic Packet Switching (EPS) network,
which is implemented as a classic electrical network with lower bandwidth but higher
fan-outs, to obtain low routing latencies on uniform traffic patterns.

Previous Work on Multiprocessor Reconfigurable


Optical Interconnects

Clearly, descriptions of network traffic locality, and the idea that networks can be
reconfigured to exploit this fact, have been around since the days of the first multi-
processors. Demonstration systems using optical reconfiguration were starting to be
built not much later.
In the free-space interconnects paradigm, [76] demonstrated, with the COSINE-1
system dating back to 1991, a manually reconfigurable free-space optical switch
operating with LEDs. This is to our knowledge the first demonstrator that showed
the possibilities of reconfigurable optical systems. Since then, technology has
advanced drastically and new and improved reconfiguration schemes have appeared
210 W. Heirman et al.

Fig. 7.4 Optical highway architecture [75]

in the research scene. In the following paragraphs, we will try to give an overview
of the state of the art of reconfigurable technology, showing that the optical
reconfiguration is getting possible in a near future in the light of the new studies and
practical implementations.
There have been lots of proposals for reconfigurable optical architectures in the
past, but only a few of them have been accomplished in the form of demonstrators.
During the recent years, some of them have achieved remarkable results by imple-
menting the reconfiguration in very different ways. For example we have the
OCULAR-II system [58, 59], developed by the University of Tokyo and Hamamatsu
Photonics, which is a two-layer pipelined prototype in which the processing ele-
ments, with VCSEL outputs and photodetector input arrays, are connected via mod-
ular, compactly stacked boards. Between each of the layers there is a free-space
optical interconnection system, and by changing the phase pattern displayed on a
phase-modulating parallel-aligned spatial light modulator (SLM), the light paths
between the nodes can be dynamically altered with a speed of 100 ms.
The proposed latest OCULAR-III architecture [13] is a multistage interconnec-
tion network relying on fixed fiber-based block interconnects between stages. These
interconnections are based on modular, reusable and easy to align fiber-based
blocks. Network reconfiguration is based on electronics though, by setting the states
of local crossbars on the processing plane.
Another reconfigurable architecture constructed was the Optical Highway [75],
a free space design which interconnects multiple nodes through a series of relays
used to add/drop thousands of channels at a time (see Fig. 7.4). The architecture
considered here was a network-based distributed memory system (cluster style),
with a 670 nm laser diode as a transmitter and a diffractive optical element to pro-
duce a fan-out simulating a laser array. Polarizing optics defined a fixed network
topology, and a polarizing beam splitter deflected channels of a specific polarization
to the corresponding node, with each channels polarization state determined by
patterned half wave plates. It can be made reconfigurable using also a SLM, allow-
ing to switch the beam-path of a single channel from an electronic control signal,
and route it to only one of three detectors.
An alternative modular systems has been presented in 2002, with a powerful
optical interconnection network [1]. The solution is based on a generic optical
7 Reconfigurable Networks-on-Chip 211

Fig. 7.5 MEMS pop-up


mirrors [1]

Fig. 7.6 Reconfigurable


switch concept architecture
based on pop-up mirrors [1]

communication interface with a simple electronic router implemented in PCB


technology. Together with optical switching using micro-electromechanical sys-
tem (MEMS) pop-up mirrors, it is possible to switch packets over reconfigurable
topologies at speeds of 700 ms per switch (see Figs. 7.5 and 7.6).
Also on the board-to-board level, the SELMOS system [90] was designed to be a
reconfigurable optical interconnect (ROI), whose core was built from a 3-D micro-
optical switching system and a self-organized lightwave network. Here, the
reconfiguration process was done with 2 2 waveguide prism deflector switches in
a 1,024 1,024 Banyan network. Switching speed is estimated in the order of 450 ns,
depending on the type of switches used. Self-organizing network formation worked
by arranging first the optoelectronic devices with waveguides in a designed
configuration, stacking them to create a 3-D structure, and then introducing some
excitation to this structure, creating a self-aligned wiring coupling several wave-
guides (see Fig. 7.7). However, only simulations and partial experiments have been
realized, and a full working demonstrator is still to be constructed.
One of the most representative free-space interconnects is an adaptive optical
system built using an off-the-shelf commercial ferroelectric display panel at the
University of Cambridge [39]. Here, an 850 nm optically modulated channel from
a VCSEL at 1.25 Gb/s is steered using reconfigurable binary phase gratings dis-
played on a ferroelectric LC on silicon SLM (see Fig. 7.8). The reconfiguration
timescales here are in the order of milliseconds, as one single line of the LC is
refreshed in 192 ms/line, total 25 ms. The measured optical losses total 13.6 dB,
sufficient to give a bit error rate (BER) of 10 12 with current optical transmitter and
receiver technology.
212 W. Heirman et al.

Fig. 7.7 SELMOS system: photoresistive materials are put where vertical coupling paths will be formed,
and write beams through the waveguides construct a self-organized micro-optical network [90]

Fig. 7.8 Free-space reconfigurable optical system [39]

Other approach using liquid crystals (LC) is the work by [9], where they imple-
ment reconfiguration by using a 8 4 multilevel phase-only holograms written in a
nematic LC panel. The splitting diffraction efficiency achieved is rather low (15%),
as well as the switching time of 100 ms at an operational wavelength of 620 nm.
7 Reconfigurable Networks-on-Chip 213

In the OSMOSIS demonstrator [38], a low-latency, high-throughput scalable


optical interconnect switch for HPC systems was introduced, that features a
broadcast-and-select architecture based on wavelength- and space-division mul-
tiplexing. It makes use of semiconductor optical amplifiers (SOAs) combining 8
wavelengths on 8 fibers with two receivers per output, supporting 64 nodes with
a line rate of 40 Gb/s per node and operating on fixed-length packets with a
duration of 51.2 ns.
But one of the closest approaches to our own proposed architecture, which will be
described in section A Self-adapting On-Chip Photonic Interconnect Architecture, is
the l-connect system [70], developed at the Lawrence Livermore National Laboratory.
The idea behind the l-connect is to interconnect multiple nodes in a network using a
simple broadcast-and-select architecture in combination with wavelength selective
routing. The nodes in the network represent either a board within a rack or a complete
rack system in itself. Each node consists of one or more CPUs, associated local mem-
ory and a cache controller. Additionally each node has an O/E interface, a multi-wave-
length transmitter that can transmit on one of the two system wavelengths and a receiver
preceded by a fixed wavelength optical filter. This filter selects a single information
channel from the incoming multi-wavelength signal. The nodes are interconnected via
a passive optical star coupler, which is physically implemented as a multi-mode optical
fiber ring that connects all the nodes. Communication between the nodes is accom-
plished optically with 12 parallel independent information channels on different wave-
lengths being simultaneously broadcast to many nodes (in the order of 100) through the
fiber ring network. In this arrangement, wavelength division multiplexing (WDM) cre-
ates multiple concurrent logical bus channels over the common physical medium.
Messages are routed at the source simply by selecting the transmission wavelength. If
the number of the system wavelengths can be equal to the number of nodes, only a
single hop exists between the nodes and this architecture then functions as a fully non-
blocking optical cross-connect where contention only arises when two nodes need to
transmit to the same receiving node.
At the emitter of every node, two VCSEL arrays emitting at 814 and 846 nm are
mounted in close proximity so that every couple of VCSELs shine directly into the
same fiber. At any time, only one of the wavelengths is selected to broadcast the data
by electrically driving the appropriate VCSEL. Each transmitter emits 2 mW of
optical power and is modulated at 1.25 Gb/s. The receiver side of every node has a
WDM filter module based on Distributed Bragg reflectors (DBR) and anti-reflection
coatings. The optical signal of every fiber is split in the filter module to four differ-
ent detectors with a 1.6 dB insertion loss and 23 dB crosstalk. The wavelength of
every channel in the optical signal is spectrally spaced 10 nm apart.
The adoption of an optical broadcast scheme has inherently the disadvantage that
the optical power of the emitter is split in the system to N nodes, so that each receiver
only receives 1/N of the total optical power, not yet including the excess insertion
losses. The use of an array waveguide grating (AWG) router would improve the
optical power budget because all of the optical power in each transmitted signal
would be sent to its intended recipient. However, the large wavelength channel spac-
ing needed in coarse WDM (CWDM) prohibits the use of AWGs.
214 W. Heirman et al.

Fig. 7.9 Architectural overview of RAPID. Every node is connected to two scalable intercon-
nects: an optical intraboard interconnect and a scalable remote superhighway [48]

The mounting of the optical components is a technological challenge because the


optical components in the system require a lateral alignment accuracy of 2 mm and
therefore active alignment is necessary to implement the system.

Reconfigurable Optical Interconnect (ROI) Architectures

Apart from these implementations, there has been very interesting architecture mod-
els proposed in the last years that have not been implemented yet to our knowledge.
Like RAPID, a high bandwidth, low power, scalable and reconfigurable optical inter-
connect [48]. It is an all-photonic passive network, composed of tunable VCSELs,
photodetectors, couplers, multiplexers, and demultiplexers (see Fig. 7.9). It provides
large bandwidth by using WDM and space division multiplexing (SDM) techniques,
and combining them into a multiple WDM technique that needs fast switching times,
in the order of nanoseconds, over 2-D torus, hypercube, and fat-tree topologies.
There has been some work also on reconfigurable buses, being the linear array
with a reconfigurable pipelined bus system (LARPBS) [74] the best example of a
complete architecture, although again a complete implementation has not been real-
ized yet. It is a fiber-based optical parallel bus model that uses three folded wave-
guides, one for message passing and the other two for addressing via the coincident
pulse technique. The reconfigurability in this model is provided by pairs of 2 2
bus-partition optical switches, located between each processor, that can partition the
system into two subsystems with the same characteristics at any of these pairs of
switches by introducing some conditional delay.
The HFAST [46] architecture is a MPI HPC interconnect, not targeting
shared-memory applications, but still interesting enough to comment here from
an architectural point of view. HFAST attempts to minimize the number of
7 Reconfigurable Networks-on-Chip 215

optical transceivers and switch ports required for a large scale system design,
since the transceivers are both expensive and power hungry, and uses circuit
switches for wiring the packet switches together. It tries to minimize pathways
that have bandwidth contention measuring explicit MPI communication pat-
terns rather than shared-memory cache-line transfers. HFAST is based on the
observation that short messages are strictly latency bound and benefit from a
completely different low-power network layer since they rarely hit the band-
width limits for the network. Thus, the problem for the big messages reduced to
a strictly bandwidth contention minimization problem.
Other architectures with shared or switched topologies include the simultane-
ous optical multiprocessor exchange bus (SOME-bus) from Drexel University
[47], the optical centralized shared bus from the University of Texas at Austin
[29], and Columbia Universitys data vortex optical packet switching interconnec-
tion network [30].
Finally, it has also been suggested in the Washington University to dynamically
reconfigure a router switch fabric using optical chip-to-chip communication but
CMOS technology for decision, control, and signal switching functions [50]. The
obtained speedup in packet latency is 1.71 for a 400 ms reconfiguration period
illustrates the clear potential for slow reconfiguration techniques.
Moreover, since an optical channel can offer very high aggregate bandwidth, one
can also use techniques such as a fixed time-division multiplexing, as proposed in
[73] with a technique called reconfiguration with time division multiplexing
(RTDM). With RTDM, only a subset of all possible connections, as required by the
running applications, needs to be optically multiplexed in the network, letting the
network go through a set of personalized configurations.
As a summary, we include in Table 7.1 a brief comparison of the different
reconfigurable optical interconnects presented in this section, according to several
key parameters.

Previous Works on Reconfigurable NoC

A system-on-chip (SoC) platform can contain many different instruction processing


(IP) blocks, including RAMs, CPUs, DSPs, IOs, FPGAs and other coarse and fine
grained programmable IP-blocks. Therefore, an optimal NoC architecture that
adapts to all the blocks and the running applications is desirable from the perfor-
mance and the power consumption points of view.
Reconfiguration in a NoC can be done in very different ways, but up to now,
there are three main techniques that have been proposed in the literature. First, by
modifying the assignment of the processing cores to the network nodes, most usu-
ally in a mesh type topology. Secondly, by adapting the network devices, such as
buffers, links, or routers, according to the specific application running on the system.
And third, by establishing adaptable virtual channels (VCs) over a fixed physical
topology to route traffic streams or packets in an optimal way.
216 W. Heirman et al.

Table 7.1 Summary of reconfigurable optical interconnect demonstrators


System Technology used Reconfiguration time
OCULAR-II SLM 100 ms
[9]
Optical highway Diffractive and polarizing
[75] optics with LC+SLM (ms scale)
Free-space adaptable system LC + SLM 25 ms
[39]
Modular system on PCB MEMS mirror 700 ms
[1] switches
SELMOS Prism deflector 5 ms
[90] switches
OSMOSIS Broadcast and select Packet switching
[38] WDM/SDM (50 ns)
l-Connect Broadcast and select Packet switching
[70] WDM (ns scale)

Fig. 7.10 Mapping and routing of different processes into a tiled-mesh topology [41]

The first approach considers the three-step design flow in systems-on-chip, where
each application is divided into a graph of concurrent tasks and, using a set of avail-
able IPs, they are assigned and scheduled. Here, a mapping algorithm decides to
which tile each selected IP should be mapped such that the metrics of interest are
optimized. For this approach, mesh topologies are mostly used, due to their regular
two dimensional structure that results in IP re-use, easier layout, and predictable
electrical properties. [41] uses a branch-and-bound mapping algorithm to construct
a deadlock-free deterministic routing function such that the total communication
energy is minimized (Fig. 7.10). Others, like the NMAP heuristic algorithm [63],
optimize bandwidth by splitting the traffic between the cores across multiple paths,
and [5] use a genetic algorithm based technique. In [81], they consider not only
minimizing the communication energy according to bandwidth constraints, but also
to latency constraints. Here, the energy consumption of the input and output ports at
each router node varied linearly with the injection and acceptance rates.
7 Reconfigurable Networks-on-Chip 217

Fig. 7.11 Network-on-chip for an MPEG4 decoders, as example of application-specific optimization of


a mesh network. From left to right: a network with regular mesh topology (a), a specialized version with
a large switch (s8) to the SDRAM, which is used by all other cores (b), and an alternate implementation
which is an optimized mesh with superfluous switches and switch I/Os removed (c). From [43]

The second approach considers the individual nodes of a SoC to be heterogeneous


in nature, with widely varying functionality and communication requirements.
Therefore, the communication infrastructure should optimally match the communica-
tion patterns among these components accounting for the individual component needs.
By modifying the NoC nodes and their interactions, communication performance can
be maximized to the running application. An example of this is presented in [11],
where a customization of the NoC is done by first mapping the processing nodes so as
to minimize spatial traffic density, then removing unnecessary mesh links and switch-
ing nodes, and finally allocating bandwidth to the remaining links and switches accord-
ing to their relative load, so that link utilization is balanced. However, this customization
is done at the design time, and can not be modified later on. Another example of trying
to match the network to communication patterns is done with the PipesCompiler [43],
which instantiates a network of building blocks from a library of composable soft mac-
ros (switches, network interfaces and links). The network components are optimized
for that particular network and support reliable, latency-insensitive operation, obtain-
ing large savings in area, power and latency (Fig. 7.11). A way to efficiently utilize the
full bandwidth of a NoC is by the use of flow control algorithms, but they commonly
rely on local information, or suffer from large communication overhead and unpredict-
able delays, unacceptable for NoC applications. [64] proposes a NoC-adapted scheme
that controls the packet injection rate in order to regulate the number of packets in the
network. Another alternative to maximize communication performance by modifying
the network is to increase or decrease channel buffer depth at each node router by ana-
lyzing the traffic characteristics of a target application [42].
218 W. Heirman et al.

Fig. 7.12 The network nodes of the ReNoC system consist of a router that is wrapped by a topol-
ogy switch, allowing for different logical topologies [82]

Fig. 7.13 Mesh network with additional long-range links inserted according to hotspot traffic
measurements [65]

Considering full reconfigurable NoC topologies, [82] introduce ReNoC, a logi-


cal topology built on top of the real physical architecture, with the reconfigurability
is inserted as a layer between routers and links (Fig. 7.12). The logical topology is
configured in a circuit-switched fashion by the running application in an initializa-
tion phase, just before it starts. This allows the use of an optical and energy-efficient
topology switch configuration by combining packet-switching and physical circuit-
switching within the same NoC. [83] makes a full analysis on top of ReNoC of such
reconfigurable NoC by synthesizing application specific topologies, mapping them
onto the physical architecture, and creating deadlock free, application specific rout-
ing algorithms. [19] also extends ReNoC by using packet switching along with
optical circuit switching (OCS). A similar technique of introducing long links in the
topology has been explored in [65] too, allowing connections that span many rout-
ers to bypass them and hence decrease the amount of traffic in the intermediate
routers (Fig. 7.13). [55] reports that delays are decreased by 85% and the energy by
70% by bypassing FIFO buffers and synchronization logic in a similar architecture.
Another example of a physical circuit-switched NoC is [87], where connections can
be set up directly between IP blocks. The connections are configured using a sepa-
rate packet-switched network which is also used for best-effort traffic, although they
can not be shared, creating thus two separate networks.
7 Reconfigurable Networks-on-Chip 219

Fig. 7.14 The Nostrum backbone with the application resource network interface (RNI), that
maps processes to resources in a mesh architecture [61]

Finally, a third approach tries to adapt the communications among the nodes to
the network infrastructure by the creation or reservation of virtual resources over the
physical topology, to maximize and guarantee application performance in terms of
bandwidth and latency delivered. Most of the time, virtual channels (VCs) are cre-
ated as a response to quality-of-service (QoS) demands from applications, corre-
sponding to a loose classification of their communication patterns in four classes:
signaling (for inter-module control signals), real time (representing delay-con-
strained bit streams), read/writes (modeling short data access) and block transfers
(handling large data bursts). For example, the technique of spatial division multi-
plexing (SDM), used in [56], consists of allocating only a subset of the link wires to
a given virtual circuit. Messages are digit-serialized on a portion of the link (i.e.
serialized on a group of wires). The switch configuration is set once and for all at
the connection setup. No inside router configuration memory is therefore needed
and the constraints on the reservation of the circuits are relaxed. [27] introduces a
simple static timing analysis model that captures virtual channeled wormhole net-
works with different link capacities and eliminates the reliance on simulations for
timing estimations. It proposes an allocation algorithm that greedily assigns link
capacities using the analytical model so that packets of each flow arrive within the
required time. The temporally disjoint networks (TDNs) of [61] are used in order to
achieve several privileged VCs in the network, along the ordinary best effort traffic.
The TDNs are a consequence of the deflective routing policy used, and gives rise to
an explicit time-division-multiplexing within the network (Fig. 7.14). The NoC
described in [17] provides tight time-related guarantees by a dynamic link arbitra-
tion process that depends on the current traffic and maximizes link utilization.

Photonic Reconfigurable NoCs

A photonic implementation of the previously commented ReNoC architecture


has been proposed in [19], where the photonic architecture is actually a logical
topology built upon the real physical 2-D mesh, according to different communi-
cation patterns of the running applications. It makes use of packet switching
combined with optical circuit switching (OCS) to avoid the delays introduced by
pure packet queuing. Long photonic links can be set between routers, bypassing
220 W. Heirman et al.

Fig. 7.15 Implementation of the physical architecture of RePNoC and logical topology of an
application pattern [19]

this way intermediate nodes and optimizing application demands, and latency
performance simulations in this case show a 50% decrease compared to a static
photonic NoC (Fig. 7.15).
The basic building blocks to introduce dynamism on a network are switches and
routers. On a photonic NoC though, these elements must have very limited space
and power requirements, and must have a good integration with the processing and
memory elements. That is why silicon photonics poses itself as an ideal candidate
for integration here. [52] gives an overview of state-of-the-art silicon modulators
and switches, with modulating speeds of 4 Gb/s and switching speeds of down to
1 ns in compact (10 mm) configurations of 12, 22, and 44) based on microring
resonators pumped at 1.5 mm [53, 72] (Fig. 7.16). Active devices have been realized
in InP as well though, like the 116 phased-array switch presented in [80], with a
more modest response time of 11 ns (Fig. 7.17). On a slower timescale, [84] intro-
duce the use of electro-optic Bragg grating couplers to demonstrate a reconfigurable
waveguide interconnect with switching times of 75 ms operating at 850 nm.

A Self-adapting On-Chip Photonic Interconnect Architecture

Lacking a cheap and effective way of optically controlling the routing (and doing
possible buffering), most of the approaches described above necessarily work in a
circuit-switched way. And while the actual switching of the optical components can
nowadays be done in mere nanoseconds or less [18], the set-up of an optical circuit
7 Reconfigurable Networks-on-Chip 221

Fig. 7.16 Conceptual art of a


proposed photonic NoC stack
[52], with dedicated
computation, storage, and
optical communication planes

Fig. 7.17 Micrograph of the 116 optical switch. The total device size is 4.1 mm 2.6 mm,
including the input/output bends and lateral tapers [80]

still requires at least one network round-trip time, which accounts for several tens of
nanoseconds. This makes that such proposals only reach their full potential at large
packet sizes, or in settings where software-controlled circuit switching can be used
with relatively long circuit lifetimes. Indeed, in [77], packets of several kilobytes are
needed to reach a point where the overhead of setting up and tearing down the opti-
cal circuits (which is done with control packets sent over an electrical network), can
be amortized by the faster optical transmission.
In SoC architectures, and to a lesser extent in CMPs, large direct memory access
(DMA) transfers can reach packet sizes of multiple KB. However, most packets are
coherency control messages and cache line transfers. These are usually latency
bound and very short. In practice, this would mean that most of the traffic would not
be able to use the optical network, as they do not reach the necessary size to com-
pensate for the latency overhead introduced, and that the promised power savings
could not be realized!4

4
One might consider using a larger cache line size to counter this, but an increase to multiple kilo-
byte would in most cases only result in excessive amounts of false sharing, negating any obtained
performance increase.
222 W. Heirman et al.

We propose to use the combination of the electrical control network and the
optical circuit-switched links as a packet-switched network with slow
reconfiguration. This idea is based on existing work such as the Interconnection
Cached Network [26], or see [8] for a modern application of the same idea. But
rather than relying on application control of the network reconfiguration, which
requires explicit software intervention and does not agree with the implicit com-
munication paradigm of the shared memory programming model, our approach
provides for an automatic reconfiguration based on the current network traffic.
This concept has been described in [2], and was proven to provide significant per-
formance benefits in (off-chip) multiprocessor settings. Here, we will apply the
same approach to on-chip networks, and model the physical implementation on the
architecture introduced by [71, 77].

Physical Architecture

The photonic NoC proposed by [71] introduces a non-blocking torus topology, con-
necting the different cores of the system, based on a hybrid approach: a high-bandwidth
circuit-switched photonic network combined with a low-bandwidth packet-switched
electronic network. This way, large data packets are routed through a time and wave-
length multiplexed network, for a combined bandwidth of 960 Gb/s, while delay-criti-
cal control packets and data messages with sizes below a certain threshold are routed
through the low-latency electrical layer. As the basic switching element, a 44 hitless
silicon router is presented by [78], based on eight silicon microring resonators with a
bandwidth per port of 38.5 GHz on a single wavelength configuration.
An example 16-node architecture is depicted in Fig. 7.18. Each square repre-
sents a 44 router containing eight microring resonators. In this architecture, each
node has a dedicated 33 router to inject and eject packets from the network, rep-
resented by the smaller squares. The network nodes themselves are represented by
discs. By means of the electronic control layer, each node first sends a control
packet to make the reservation of a photonic circuit from source to destination.
Once this is done, transmission is done uninterrupted for all data packets. To end
the transmission phase, a control packet is sent back from the destination to free
the allocated resources.
For our architecture, we combine a standard electrical network-on-chip with a
dedicated reconfigurable photonic layerformed by the architecture proposed by
[71]. The photonic layer will established a set of extra links in a circuit-switched
fashion for certain intervals of time, depending on automated load measurements
over the base topology. The reconfiguration will follow slowly-changing dynamics
of the traffic, while the base electronic network layer will still be there to route con-
trol and data messages.
Other architectures, similar to [71], have been proposed and can be interchanged
as the physical layer on which to apply our slow reconfiguration architecture. For
7 Reconfigurable Networks-on-Chip 223

Fig. 7.18 16-node


non-blocking torus [71].
Squares represent optical
routers based on microring
resonators, and network
nodes are represented by
discs. The electrical control
(or base) network, which is a
2-D torus overlaid on the
optical network, is not shown
here for clarity

instance, [25] avoids the need for an electrical control layer by sending all packets
through an all-optical network using different wavelengths. Still, the separation
between control and data layers, even when they are sent through the same physical
channels, is maintained. Our approach is valid to any network architecture where
this distinction is kept, as the reconfigurable layer can be virtually established irre-
spective of the underlying physical implementation.

Using Traffic Locality to Trigger Reconfiguration

As described in section Opportunities for Reconfiguration, network traffic con-


tains a large amount of intrinsic, yet poorly exploited locality. From this observation
came the idea to use slowly reconfigurable but high (data) speed optical components
to establish temporary extra links, providing direct connections between pairs of
processor cores that are involved in a communication burst. Other communication,
which is not part of a burstor a lower-intensity burst when the hardware would
support less extra links than there are bursts at a given timewill be routed through
a standard packet-switched (optical or electrical) network (the base network, see
Fig. 7.19). The positions of the extra links are re-evaluated over time as old bursts
stop and new ones appear.
We have previously evaluated this concept in the context of shared-memory serv-
ers and supercomputers, and proposed an implementation using low-cost optical
components [2]. Since then, multicore technology has enabled the integration of a
complete shared-memory multiprocessor on a single chip. At the same time, on-
chip reconfigurable optical interconnects became a reality, using the integration
possibilities allowed by the emerging field of silicon photonics [6, 86].
224 W. Heirman et al.

Fig. 7.19 Reconfigurable


network topology. The
network consists of a base
network (a 2-D mesh in this
example), augmented with a
limited number of direct,
reconfigurable links (which
are made up of the
reconfigurable optical layer
from Fig. 7.18)

Proposed Reconfigurable Network Architecture

Our network architecture, originally proposed in [32], starts from a base network
with fixed topology. In addition, we provide a second network that can realize a
limited number of connections between arbitrary node pairsthe extra links or
elinks. A schematic overview is given in Fig. 7.19.
The elinks are placed such that most of the traffic has a short path (a low number of
intermediate nodes) between source and destination. This way, a large percentage of
packets has a correspondingly low (uncongested) latency. In addition, congestion is
lowered because heavy traffic is no longer spread out over a large number of intermedi-
ate links. For the allocation of the elinks, a heuristic is used that tries to minimize the
aggregate hop distance traveled multiplied by the size of each packet sent over the net-
work, under a set of implementation-specific conditions: these can be the maximum
number of elinks n, the number of elinks that can terminate at one node (the fanout, f),
etc. After each interval of length Dt (the reconfiguration interval), a new optimum topol-
ogy is computed using the traffic pattern measured in the previous interval. A more
detailed description of the underlying algorithms can be found in [31].
Although the actual reconfiguration, done by switching the microrings, happens
in mere picoseconds, the execution time of the optimization algorithm, which
includes collecting traffic patterns from all nodes and distributing new configuration
and routing data, cannot be assumed negligible. The time this exchange and calcula-
tion takes will be denoted by the selection time (tSe). The actual switching of optical
reconfigurable components will then take place during a certain switching time (tSw),
7 Reconfigurable Networks-on-Chip 225

after which the new set of elinks will be operational. Traffic cannot be flowing
through the elinks while they are being reconfigured. Therefore, the reconfiguration
process starts by draining all elinks before switching any of the microrings. This
takes at most 20 ns (the time to send our largest packet, which is 80 bytes, over a
40 Gbps link). During the whole reconfiguration phase, network packets can still
use the base network, making our technique much less costly than some other more
intrusive reconfiguration schemes, where all network traffic needs to be stopped and
drained from the complete network during reconfiguration.
The reconfiguration interval, denoted by Dt, must be chosen as short as possible to be
able to follow the dynamics of the evolving traffic and get a close-to-optimal topology.
On the other hand, it must be significantly larger than the switching time of the chosen
implementation technology to amortize the fraction of time that the elinks are off-line.
Gathering traffic information for each of the nodes to compute the optimal net-
work configuration is straightforward if each node can count the number of bytes
sent to each destination. Collecting this data at a centralized arbiter over our high-
performance interconnect only takes one network round-trip time. Finally, compu-
tation needs to be done on this data at the centralized unit. This computation is
largely based on heuristics and pre-computed tables, and can therefore quickly
determine a near-optimal elink configuration and its corresponding routing tables.
We assume that this selection algorithm can be executed on one of the systems
processors, and even for a 64-node network we expect this to take only a few micro-
seconds. Of course, this will only hold for slowly-reconfiguring networks, where
the reconfiguration interval is long enough to amortize this delay.
If we want to reduce the reconfiguration interval even further, we will have to
move to a decentralized scheme, where traffic information is spread locally to
neighboring nodes only, and the selection mechanism is done at each processor with
just local information.

Mapping the Reconfigurable Architecture onto


the Photonic Network

Applying this architecture to the specifics of a NoC, we can consider the network
presented in [71] as being an instantiation of our general reconfigurable network
model, where the number of elinks n equals the number of processing nodes p, and
with a maximum fan-out per node of one (n = p, f = 1). This way, each extra link
would be considered as a dedicated circuit of the non-blocking mesh. The
reconfiguration interval, Dt, was fixed to 1 ms.
With optical components that can switch in the 30 ps range, the switching time (tSw)
will only take a negligible fraction of the reconfiguration interval Dt. However the selec-
tion time (tSe) will remain significant as it requires exchange of data over the network.
We propose a scheduling where we allow the selection to take up to a full
reconfiguration interval. The three phases (shown in Fig. 7.20) of collecting traffic
information (measure), making a new elink selection (select), and adjusting the
226 W. Heirman et al.

Fig. 7.20 Sequence of events in the on-chip reconfigurable network. During every reconfiguration
interval of 1 ms, traffic patterns are measured. In the next interval, the optimal network configuration
is computed for such patterns. One interval later, this configuration is enabled. The reconfiguration
itself takes place at the start of each configure box, but the switching time is very short (just 2% of
the reconfiguration interval in this architecture) and is therefore not shown here

network with this selection (configure) are performed in a pipelined fashion,


where each phase uses the results (traffic pattern or elink selection) of the previ-
ous interval. This adaptation of Shachams NoC architecture using microrings has
been further developed in [3, 16].

Extra Link Selection

For every reconfiguration interval, a decision has to be made on which elinks to


activate, within the constraints imposed by the architecture, and based on the
expected traffic during that interval. In our current implementation, the traffic
is expected to be equal to the traffic that was measured two intervals agothis
avoids the need for a complicated and time-consuming prediction algorithm. As
explained in section Proposed Reconfigurable Network Architecture, we want
to minimize the number of hops on the (electronic) base network for most of the
traffic. We do this by minimizing a cost function that expresses the total number
of network hops traversed by all bytes being transferred. This cost function can be
written as:

C = d (i, j )T (i, j )
i, j
(7.1)

with d(i, j) the distance between nodes i and j, which is a function of the elinks that
are selected to be active, and T(i, j) the number of bytes sent from node i to node j in
the time interval of interest.
Since the time available to perform this optimization is equal to the
reconfiguration time (1 ms here), we use a greedy heuristic that can quickly find a
set of active elinks that satisfies the constraints imposed by the architecture, and
has an associated cost close to the global optimum. More details on this algorithm
can be found in [33].
7 Reconfigurable Networks-on-Chip 227

Network Delivery Order, Deadlock Avoidance and Routing

In some cases, our reconfigurable network will deliver messages out-of-order.


During normal operation this is not possible, as routing happens deterministi-
cally, andeven when an elink is usedonly a single route is used between each
node pair. For network packets that are in flight during a reconfiguration, this
guarantee cannot be made, however. Consider a network packet that is sent by
node A to node B, for which no shortcut elink exists at the time. Just after the
packet leaves A, an elink does come online between A and B (or some part of the
way, but branching at a point where the first packet has already passed through).
A second packet from A to B can now use the elink, and possible arrive at B
before the first packet arrives.
Rather than including a completeand very expensivereordering mechanism
in our network routers, it proved sufficient to include a small hardening patch in the
cache coherency protocol. When considering network packets related to a single
memory address, this address home node and all caches that may make requests to
this home node operate in lockstep for most of the time: the cache makes a request
and the home responds, or vice versa. No two packets between any given node pair
(corresponding to the same address) will ever be in flight on the network. The only
exception to this is when an exclusive grant is on its way from the directory on the
home node to the cache, and a writeback request follows closely behind it (due to
the fact that some other node now wants to write to the same cache line). When
these two messages are reordered, the cache will first receive a writeback request,
while it is still waiting for its exclusive access to be granted. This situation is easily
detectable by the cache controller, however, the solution is simply that the writeback
request must be recorded (this takes one bit in the caches miss status holding regis-
ter) and that the data should be written back immediately after the exclusive grant
arrives and the write operation has been performed. Reordering of messages relat-
ing to different cache lines can be tolerated in any case, since these operations
already happen in parallel (note that we operate under release consistency, if there
is any synchronization to be made between operations on different cache lines this
is to be taken care of by using memory barrier instructionsthese occur at the pro-
cessor level and have no effect here).
To avoid deadlocks on our network, we use two main mechanisms. Dimension
order routing (DOR) can be used on the base network since it guarantees dead-
lock-freedom on all regular mesh and torus networks. This leaves only the possi-
bility of deadlocks between packets using the elinks. Each packet can go through
just one elink on its path. After that, it switches to another virtual channel (VC).5
We assign a higher priority to the VC used after traversing the elink, this guarantees
forward progress.

5
Actually another set of VCs is used since separate request and reply VCs are already employed to
avoid fetch deadlocks at the protocol level.
228 W. Heirman et al.

For routing packets through the elinks we use a static routing table: when
reconfiguring the network, the routing table in each node is updated such that, for
each destination, it tells the node to route packets either through an elink starting at
that node, to the start of an elink on another node, or straight to its destination, the
latter two using normal dimension order routing.

Network Evaluation Methodology

To characterize the performance of our proposal, we employed highly detailed tim-


ing simulations and power estimation. This section details the methodology used,
while the following section describes our simulation results.

Simulation Platform

We have based our simulation platform on the commercially available Simics simu-
lator [57]. It was configured to simulate a multicore processor inspired by the
UltraSPARC T1/T2, which runs multiple threads per core (four in our experiments).
This way, the traffic of 64 threads is concentrated on a 16-node network, stressing
the interconnection network with aggregated traffic. The processor core is modelled
as a single issue scalar core, running at 1 GHz. Stall times for caches and main
memory are set to conservative values for CMP settings (2 cycles access time for
L1 caches, 19 cycles for L2 and 100 cycles for main memory). Cache coherency is
maintained by a directory-based coherency controller at each node, which uses a
full bit-vector directory protocol. The interconnection network models a packet-
switched 44 network with contention and cut-through routing. The time required
for a packet to traverse a router is three cycles. The directory controller and the
interconnection network are custom extensions to Simics. Both the coherency traffic
(read requests, invalidation messages etc.) and data traffic are sent over the base
network. The resulting remote memory access times are around 100 ns, depending
on network size and congestion.
The proposed reconfigurable NoC has been configured with a link throughput
of 10 Gb/s in the base network. To model the elinks, a number of extra point-to-
point links can be added to the base torus topology at the start of each
reconfiguration interval. The speed of these reconfigurable optical elinks were
assumed to be four times faster than that of the base network links (40 Gb/s). For
evaluation, we have compared the proposed solution with three standard, non-
reconfigurable NoCs: a 10 Gb/s electrical NoC, a 40 Gb/s electrical NoC and a
40 Gb/s photonic NoC.
The network traffic is the result of both coherency misses and cold, capacity and
conflict misses. To make sure that private data transfer does not become excessive,
a first-touch memory allocation was used that places data pages of 8 KB on the node
7 Reconfigurable Networks-on-Chip 229

of the processor core that first references them. Also each thread is pinned down to
one processor (using the Solaris processor_bind() system call), so the thread stays
on the same network node as its private data for the duration of the program.

Power Modeling

To estimate the power consumption of our optical circuit-switched routing, we


will need to know the state of each switch in the meshthis means which micror-
ings are powered on for each reconfiguration interval. We can know this by look-
ing at the routing table of each router (Table 1b in [78]) and assign a power value
for each active ring. In this reference the power consumed per ring in the ON
state is assumed to be 6.5 mW, while in the OFF state the required power is con-
sidered negligible. This is for rings that switch in only 30 ps, though. Using a
reconfiguration interval of one microsecond, our architecture does not need such
an exorbitantly fast (and power hungry) device. Instead, it can tolerate several
nanoseconds of switching time, and we will assume that such a device can be
powered with just 0.5 mW.
Also, [78] consider nine possible states of the router, determined by all possible
simultaneous connections between its in- and output ports. Each of these states has
a specific number of microrings powered on. However, when a router is only used
by a single traversing elink, fewer active microrings are required. If we do not use
just the nine predefined states, but only account for the minimal number of rings
needed for establishing the optical elink path, we can obtain a significantly lower
power consumption.
Therefore, we will assume the use of a more power-efficient scheme that only
powers the rings needed on each reconfiguration interval, instead of putting the
switch in a state where several rings will be powered whether they are used or
not. Of course, the electronic control of such a switch would be more compli-
cated, this is why the nine predefined states were originally proposed even if this
is not the most power-efficient scheme. But where the localized control, and the
aim for independence between the different circuits validates such an approach,
our architecture on the other hand performs a global and simultaneous assign-
ment of all elinks and microrings, and should therefore be able to operate in the
optimized case.
For the parameters to estimate the power consumption of the links and the rout-
ing of the packets, we have used the same values as cited by [77] and shown in
Table 7.2. One notable difference is that we include an extra static power of 500 mW
for each optical link, as it is likely that the analog optical transceiver circuits will
consume power even while the links are not sending data. As for the dynamic power
dissipated by the electrical-to-optical (E/O) and optical-to-electrical (O/E)
conversions, a reasonable estimate for a modulator and its corresponding detector at
10 Gb/s is 2 pJ/bit. Future predictions push this value down to 0.2 pJ/bit [22]. In our
simulation we have used a less stringent 0.5 pJ/bit.
230 W. Heirman et al.

Table 7.2 Power consumption figures


Technology node 32 nm
Core dimension 1.67 1.67 mm2
Electrical link power 0.34 pJ/bit/mm
Optical link power 0.5 pJ/bit
Buffering energy 0.12 pJ/bit
Routing energy 0.35 pJ/bit
Crossbar transfer energy 0.36 pJ/bit
Static electrical link power @ 10 Gbps 500 mW
Static electrical link power @ 40 Gbps 2 mW
Static optical link power 500 mW
Microring ON power 500 mW
Microring OFF power 0 mW

Workload Generation

While most network performance studies employ simple synthetic network traffic
patterns (such as hotspot, random uniform, butterfly, etc.), and are able to obtain
reasonable accuracies with them, this is not possible for reconfigurable networks.
Indeed, the very nature of reconfigurable networks makes them exploit long-lived
dynamics, present in the network traffic generated by real applications, but which is
absent in most simple synthetic patterns.
The SPLASH-2 benchmark suite [88] was used as the workload. It consists
of a number of scientific and technical algorithms using a multi-threaded,
shared-memory programming model. Still, the detailed execution-driven simu-
lation of a single SPLASH-2 benchmark program takes a significant amount of
computation time. We therefore developed a new method of generating syn-
thetic traffic which does take the required long-lived traffic dynamics into
account. The traffic model and methodology for constructing the traces is
described in [35]. This way, we could quickly yet accurately simulate the per-
formance and power consumption of our network under realistic traffic
conditions.

Simulation Results

A direct comparison with our reference architecture [78] is difficult, since in the
original case, only large DMA transfers (of which there are usually very few in
realistic CMP systems) would use the optical network, while most of the traffic
both by aggregate size and by latency sensitivitynecessarily sticks to the electri-
cal control network. Yet, just comparing the performance of our solution with a
7 Reconfigurable Networks-on-Chip 231

Fig. 7.21 Average remote memory access latency

base-network only architecture is not very insightful either. Therefore, we have


made a performance and power comparison of our proposed architecture versus a
non-reconfigurable 2-D torus topology.

Network Performance

In this section we first aim to obtain the performance improvement by introduc-


ing reconfiguration in the system, versus a standard topology. For this, we com-
pare four approaches: using either the reconfigurable architecture introduced
above, or a 2-D torus-only network with link speeds of 10 Gb/s (low speed
electrical NoC) or 40 Gbps (high speed) electrical or optical NoC, without
reconfiguration capabilities. In the case of an all-optical network, every node
needs an optical transceiver in all four directions. Also, a conversion from the
optical to the electrical domain is needed at each hop, since the routing is still
performed electronically. In contrast, our proposed reconfigurable NoC will
require only one transceiver per node, which is an advantage in cost and power
consumption. Moreover, the data can now travel over much longer distances
until O/E and E/O conversions are needed, which again reduces power and
latency.
In Fig. 7.21 and Table 7.3, average remote memory access latencies are presented
for all network configurations. We can observe that the reconfigurable approach
performs significantly better than the low-speed non-reconfigurable network (35%),
but still far from a high-speed (either electrical or optical) implementation due to the
huge amount of bandwidth available in these cases.
232 W. Heirman et al.

Table 7.3 Comparison of the link activity and average remote memory access latency for the
different types of networks-on-chip
BWmax (Gbps) BWavg (Gbps) Tmem (#cycles) dhop (#hops) Ptot (mW)
Electrical NoC 10 5.70 308.9 2.13 315
Reconfigurable NoC 202.1 1.66 378
Base electrical NoC 10 5.21
Reconfigurable photonics 40 5.08
NoC
High-speed NoC 40 17.28 87.2 2.13
Electrical NoC 985
Photonic NoC 814

Fig. 7.22 Average number of


hops per byte sent

In Fig. 7.22 and Table 7.3, we show the average number of hops per byte sent.
Comparing with the non-reconfigurable topologyin which the network consists
of just a 2-D toruswe obtain a clear, 22% reduction of the hop distance. Similar
simulations on larger scale CMPs with up 64 cores show a 34.7% reduction in hop
distance. This will increase further as the network scales [3].
There is only a small variability between the different applications measured
because, at any time, there is exactly the same number of elinks present. The only
thing that can differ is that, sometimes, slightly longer routes are created, but
since the elink selection always tries to maximize data hop-distance, the aver-
age hop distance will also be not that different. Note that the number of active
microrings depends on the shape of the traffic pattern (the source-destination pair
distribution)albeit not by a great amountbut it does not depend on the traffic
magnitude.
7 Reconfigurable Networks-on-Chip 233

Fig. 7.23 Total power consumption per interval under different network architectures

Network Usage

A key factor in understanding the power consumption is the usage of the switches
and links in the network. For a normal r p/r torus topology, the diameter (maxi-
mum number of hops between any node pair) is [69]:

r p
D = +
2 2r (7.2)

where p is the number of processors and r is the size of the torus. In regular tori this
makes D = 4 hops when p = 16. For our benchmark applications, the average hop
distance is 2.13 for p = 16.
In our simulations, we use a folded torus topology as shown in Fig. 7.18. The
complete topology contains p/4 hitless switches (4 4 optical routing elements)
and p gateway switches. We found that the mean number of (non-gateway) switches
used per elink during each reconfiguration interval is 3.28. This results in a total of
37.5 active optical routing elements (out of the 64 available ones), of which 13 rout-
ers are traversed by more than one elink. From all routers, on average 73.7 micror-
ings are in the active state.
Table 7.3 furthermore details the average data volume over the different NoC
architectures. For the proposed reconfigurable NoC we can see that the total volume
is almost evenly distributed between the electrical base links and the high-speed
reconfigurable elinks. This clearly indicates that the heuristic to allocate the
234 W. Heirman et al.

reconfigurable links is able to capture a significant part of data packets in bursts.


This figure could nevertheless be further improved when the number of cores and
the traffic demands are scaled up.
The folded torus topology used in our study has twice the wire demand and
bisection bandwidth of a mesh network, trading a longer average flit transmission
distance for fewer routing hops. While wider flits and a folded topology can increase
link bandwidth utilization efficiency, this remains still low in our simulations, as
shown in Table 7.3. [68] investigated various metrics of a folded torus NoC, includ-
ing energy dissipation, for different traffic loads. The comparative analysis was
done with respect to average dynamic energy dissipated per full packet transfer
from source to destination node. It was found that energy dissipation increases lin-
early with the number of virtual channels (VCs) used. Furthermore, a small number
of VCs will keep energy dissipation low without giving up throughput. Energy dis-
sipation reaches an upper limit when throughput is maximized, meaning that energy
dissipation does not increase beyond the link saturation point. In general, architec-
tures with more elaborated topologies, and therefore higher degrees of connectivity,
have a higher energy dissipation on average at this saturation point than do others.
If power dissipation is criticalwhich is usually the case in on-chip multiprocessor
networksa simpler mesh topology may be preferable to a folded torus, as detailed
in the work by [15].

Power Consumption

In this section, we evaluate the power consumed by the NoCs and include the
powering of the microring resonators when establishing the elinks on the
reconfigurable layer.
An estimation of the power consumption consumed by the NoC can be calcu-
lated by combining the parameters given in Table 7.2 and the activity of the links
and optical switches in sections Network Performance and Network Usage. In
comparison to the low-speed NoC with fixed topology, the reconfigurable NoC con-
sumes modestly more power (20%) and improves significantly averaged network
performance. Moreover, in comparison to the high-speed fixed NoCs, the proposed
solution consumes significantly less -corresponding to a reduction from 54% to
62% as compared to the fixed photonic and electrical NoCs (Fig. 7.23).
It is important to note at this stage that we have adopted rather conservative
memory stall times (see section Simulation Platform). Future CMPs, equipped
with improved cache hierarchies, will impose significantly higher throughput
demands on the intercore network and further increase the power consumption of
the NoC. In addition, the proposed solution based on the reconfigurable NoC will
benefit from this scaling as it will decrease the network traffic contention between
the most active communicating pairs.
The estimated power consumption is of course highly dependent on the param-
eters chosen in Table 7.2 which was taken from [77]. Nevertheless, the conclusions
7 Reconfigurable Networks-on-Chip 235

that we draw from the results are generic. The proposed reconfigurable NoCs will
always perform better than the fixed NoC consisting solely of a base network. The
reason is that in our proposal, links with more bandwidth and lower latency are
added only where and when relevant. When compared to high-speed NoCs our pro-
posal consumes less power since it requires fewer high-speed links and transceivers.
The proposed photonic NoC thus allows for a very efficient resource utilization of
the high-speed transceivers.
In our study, we assumed that the silicon microrings do not consume energy in
their off-state. This justifies our choice to adopt the proposal by [77] for the photo-
nic links, where a network of p nodes requires 8p2 microring switches (excluding
the gateway switches). Temperature detuning of the microrings might require extra
power dissipation to stabilize the temperature locally at each ring. In recent
work [54], however, silicon microrings were demonstrated with a temperature
dependence as low as 0.006 nm/ C.

Conclusions

In this chapter, we first described the different forms of network traffic locality, and
acknowledged the possibility of exploiting this locality, through network
reconfiguration, to optimize network performance in terms of several important
characteristics such as bandwidth, latency, power usage and reliability. We also sur-
veyed existing works for optical reconfigurable on-chip networks, both demonstra-
tors and architectural proposals. Finally, we presented our own proposal for a
self-adapting, traffic-driven reconfigurable optical on-chip network. We believe that
optical, reconfigurable on-chip networks offer a viable and attractive road towards
future, high-performance and high core-count CMP and MPSoC systems.

Acknowledgements This work was supported by the European Commissions 6th FP Network of
Excellence on Micro-Optics (NEMO), the BELSPO IAP P6/10 photonics@be network sponsored
by the Belgian Science Policy Office, the GOA, the FWO, the OZR, the Methusalem and Hercules
foundations. The work of C. Debaes is supported by the FWO (Fund for Scientic Research
Flanders) under a research fellowship.

References

1. Agelis S, Jacobsson S, Jonsson M, Alping A, Ligander P (2002) Modular interconnection


system for optical PCB and backplane communication. In: IEEE International parallel & dis-
tributed processing symposium, pp 245250
2. Artundo I, Desmet L, Heirman W, Debaes C, Dambre J, Van Campenhout J, Thienpont H
(2006) Selective optical broadcast component for reconfigurable multiprocessor interconnects.
IEEE J Sel Top Quantum Electron Special Issue Opt Communication 12(4):828837.
DOI 101109/JSTQE2006876158
3. Artundo I, Heirman W, Debaes C, Loperena M, Van Campenhout J, Thienpont H (2009) Low-
power reconfigurable network architecture for on-chip photonic interconnects. In: 17th IEEE
236 W. Heirman et al.

symposium on high performance interconnects, New York, pp 163169. DOI 101109/


HOTI200927
4. Artundo I, Manjarres D, Heirman W, Debaes C, Dambre J, Van Campenhout J, Thienpont H
(2006) Reconfigurable interconnects in DSM systems: a focus on context switch behavior. In:
Frontiers of high performance computing and networkingISPA 2006 workshops, vol 4331.
Springer, Berlin, pp 311321
5. Ascia G, Catania V, Palesi M (2004) Multi-objective mapping for mesh-based NoC architec-
tures. In: Proceedings of ISSS-CODES, Stockholm, Sweden
6. Assefa S, Xia F, Vlasov YA (2010) Reinventing germanium avalanche photodetector for nano-
photonic on-chip optical interconnects. Nature 464:8084. DOI 101038/ nature08813
7. Barford P, Crovella M (1998) Generating representative web workloads for network and server
performance evaluation. In: Proceedings of the 1998 ACM SIGMETRICS joint international
conference on measurement and modeling of computer systems, Madison, pp 151160.
DOI 101145/277851277897
8. Barker KJ, Benner A, Hoare R, Hoisie A, Jones AK, Kerbyson DK, Li D, Melhem R, Rajamony
R, Schenfeld E, Shao S, Stunkel C, Walker P (2005) On the feasibility of optical circuit switching
for high performance computing systems. In: SC 05: proceedings of the 2005 ACM/IEEE confer-
ence on supercomputing, IEEE Computer Society, Washington, p 16. DOI 101109/SC200548
9. Barnes TH, Eiju T, Matsuda K, Ichikawa H, Taghizadeh MR, Turunen J (1992) Reconfigurable
free-space optical interconnections with a phase-only liquid-crystal spatial light modulator.
Appl Opt 31:55275535
10. Beausoleil RG, Ahn J, Binkert N, Davis A, Fattal D, Fiorentino M, Jouppi NP, McLaren M,
Santori CM, Schreiber RS, Spillane SM, Vantrease D, Xu Q (2008) A nanophotonic intercon-
nect for high-performance many-core computation. IEEE LEOS Newslett 22(3):1522
11. Bolotin E, Cidon I, Ginosar R, Kolodny A (2004) QNoC: QoS architecture and design process
for network on chip. In: J Syst Arch 50:105128
12. Brire M, Girodias B, Bouchebaba Y, Nicolescu G, Mieyeville F, Gaffiot F, OConnor I (2007)
System level assessment of an optical NoC in an MPSoC platform. In: Proceedings of the
conference on design, automation and test in Europe, pp 10841089
13. Cassinelli A, Takashi K (2002) Presentation of OCULAR-III architecture (using guide-wave
interconnection modules). In: OSAKA research meeting
14. Christie P, Stroobandt D (2000) The interpretation and application of Rents rule. IEEE Trans
Very Large Scale Integr Syst 8(6):639648. DOI 101109/92902258
15. Dally WJ, Towles B (2002) Route packets, not wires: on-chip interconnection networks. In:
Design automation conference, pp 684689
16. Debaes C, Artundo I, Heirman W, Van Campenhout J, Thienpont H (2010) Cycle-accurate
evaluation of reconfigurable photonic networks-on-chip. In: Righini GC (ed) Proceedings of
SPIE photonics Europe, vol 7719. SPIE, p 771916. DOI 101117/ 12854744
17. Faruque M, Weiss G, Henkel J (2006) Bounded arbitration algorithm for QoS-supported on-
chip communication. In: Proceedings of the 4th international conf hardware/software codesign
and system synthesis, pp 142147
18. Fidaner O, Demir HV, Sabnis VA, Zheng JF, Harris JSJ, Miller DAB (2006) Integrated photo-
nic switches for nanosecond packet-switched optical wavelength conversion. Opt Express
14(1):361 (2006)
19. Gao Y, Jin Y, Chang Z, Hu W (2009) Ultra-low latency reconfigurable photonic network on
chip architecture based on application pattern. In: Proceedings of NFOEC
20. Geer D (2005) Chip makers turn to multicore processors. IEEE Comput 38(5):1113 (2005).
DOI 101109/MC2005160
21. Gheorghita SV, Palkovic M, Hamers J, Vandecappelle A, Mamagkakis S, Basten T, Eeckhout
L, Corporaal H, Catthoor F, Vandeputte F, Bosschere KD (2009) System-scenario-based design
of dynamic embedded systems. ACM Trans Des Autom Electron Syst 14(1):145 (2009).
DOI 101145/14552291455232
22. Green WMJ, Rooks MJ, Sekaric L, Vlasov YA (2007) Ultra-compact, low RF power, 10 Gb/s
silicon MachZehnder modulator. Opt Express 15(25):1710617113
7 Reconfigurable Networks-on-Chip 237

23. Greenfield D, Banerjee A, Lee JG, Moore S (2007) Implications of Rents rule for NoC design
and its fault-tolerance. In: Proceedings of the first international symposium on networks-on-
chips (NOCS07), Princeton, pp 283294
24. Greenfield D, Moore S (2008) Fractal communication in software data dependency graphs. In:
Proceedings of the 20th ACM symposium on parallelism in algorithms and architectures
(SPAA08), Munich, pp 116118. DOI 101145/13785331378555
25. Gu H, Xu J, Zhang W (2009) A low-power fat tree-based optical network-on-chip for multi-
processor system-on-chip. In: Proceedings of the conference on design automation and test in
Europe, Nice, pp 38
26. Gupta V, Schenfeld E (1994) Performance analysis of a synchronous, circuit-switched inter-
connection cached network. In: ICS 94: proceedings of the 8th international conference on
supercomputing, ACM, Manchester, pp 246255. DOI 101145/ 181181181540
27. Guz Z, Walter I, Bolotin E, Cidon I, Ginosar R, Kolodny A (2006) Efficient link capacity and
QOS design for network-on-chip. In: Proceedings of the conference on design, automation and
test in Europe, pp 914
28. Habata S, Umezawa K, Yokokawa M, Kitawaki S (2004) Hardware system of the earth simula-
tor. Parallel Comput 30(12):12871313. DOI 101016/jparco200409004
29. Han X, Chen RT (2004) Improvement of multiprocessing performance by using optical cen-
tralized shared bus. Proc SPIE 5358:8089
30. Hawkins C, Small BA, Wills DS, Bergman K (2007) The data vortex, an all optical path mul-
ticomputer interconnection network. IEEE Trans Parallel Distr Syst 18(3):409420.
DOI 101109/TPDS200748
31. Heirman W (2008) Reconfigurable optical interconnection networks for shared-memory mul-
tiprocessor architectures. PhD Thesis, Ghent University
32. Heirman W, Artundo I, Carvajal D, Desmet L, Dambre J, Debaes C, Thienpont H, Van Campenhout
J (2005) Wavelength tuneable reconfigurable optical interconnection network for shared-mem-
ory machines. In: Proceedings of the 31st European conference on optical communication
(ECOC 2005), vol 3. The Institution of Electrical Engineers, Glasgow, pp 527528
33. Heirman W, Dambre J, Artundo I, Debaes C, Thienpont H, Stroobandt D, Van Campenhout J
(2008) Predicting the performance of reconfigurable optical interconnects in distributed
shared-memory systems. Photon Netw Commun 15(1):2540. DOI 101007/s11107-007-
0084-z
34. Heirman W, Dambre J, Stroobandt D, Van Campenhout, J (2008) Rents rule and parallel pro-
grams: characterizing network traffic behavior. In: Proceedings of the 2008 international work-
shop on system level interconnect prediction (SLIP08), ACM, Newcastle, pp 8794
35. Heirman W, Dambre J, Van Campenhout J (2007) Synthetic traffic generation as a tool for
dynamic interconnect evaluation. In: Proceedings of the 2007 international workshop on sys-
tem level interconnect prediction (SLIP07), ACM, Austin, pp 6572
36. Heirman W, Dambre J, Van Campenhout J, Debaes C, Thienpont H (2005) Traffic temporal
analysis for reconfigurable interconnects in shared-memory systems. In: Proceedings of the
19th IEEE international parallel & distributed processing symposium, IEEE Computer Society,
Denver, p 150
37. Heirman W, Stroobandt D, Miniskar NR, Wuyts R, Catthoor F (2010) PinComm: character-
izing intra-application communication for the many-core era. In: Proceedings of the 16th IEEE
international conference on parallel and distributed systems (ICPADS), Shanghai, pp 500507.
DOI 101109/ICPADS201056
38. Hemenway R, Grzybowski R, Minkenberg C, Luijten R (2004) Optical-packet-switched inter-
connect for supercomputer applications. J Opt Netw Special Issue Supercomput Interconnects
3(12):900913. DOI 101364/JON3000900
39. Henderson CJ, Leyva DG, Wilkinson TD (2006) Free space adaptive optical interconnect at
125 Gb/s, with beam steering using a ferroelectric liquid-crystal SLM. IEEE/OSA J Lightwave
Technol 24(5):19891997. DOI 101109/JLT2006871015
40. Hoskote Y, Vangal S, Singh A, Borkar N, Borkar S (2007) A 5-GHz mesh interconnect for a
teraflops processor. IEEE Micro 27(5):5161. DOI 101109/MM200777
238 W. Heirman et al.

41. Hu J, Marculescu R (2003) Exploiting the routing flexibility for energy/performance aware
mapping of regular NoC architectures. In: Proceedings of the conference on design, automa-
tion and test in Europe, pp 688693. DOI 101109/DATE20031253687
42. Hu J, Marculescu R (2004) Application-specific buffer space allocation for networks-on-chip
router design. In: Proceedings of the IEEE/ACM international conference on computer-aided
design, San Jose, pp 354361. DOI 101109/ ICCAD20041382601
43. Jalabert A, Murali S, Benini L, Micheli GD (2004) xPipesCompiler: a tool for instantiating
application-specific NoCs. In: Proceedings of the conference on design, automation and test in
Europe, vol 2, Paris, pp 884889. DOI 101109/ DATE20041268999
44. Jerraya A, Wolf W (eds) (2005) Multiprocessor systems-on-chips. Elsevier/Morgan Kaufmann,
San Francisco
45. Jha NK (2001) Low power system scheduling and synthesis. In: ICCAD 01: proceedings of
the 2001 IEEE/ACM international conference on computer-aided design, IEEE, Piscataway,
pp 259263 (2001)
46. Kamil S, Pinar A, Gunter D, Lijewski M, Oliker L, Shalf J (2007) Reconfigurable hybrid intercon-
nection for static and dynamic scientific applications. In: Proceedings of the 4th international con-
ference on computing frontiers, ACM, Ischia, pp 183194. DOI 101145/12425311242559
47. Katsinis C (2001) Performance analysis of the simultaneous optical multi-processor exchange
bus. Parallel Comput 27(8):10791115
48. Kodi A, Louri A (2006) RAPID for high-performance computing systems: architecture and
performance evaluation. Appl Opt 45:63266334
49. Koohi S, Hessabi S (2009) Contention-free on-chip routing of optical packets. In: Proceedings
of the 3rd ACM/IEEE international symposium on networks-on-chip, pp 134143
50. Krishnamurthy P, Chamberlain R, Franklin M (2003) Dynamic reconfiguration of an optical
interconnect. In: Proceedings of the 36th annual simulation symposium, pp 8997
51. Landman BS, Russo RL (1971) On a pin versus block relationship for partitions of logic
graphs. IEEE Trans Comput C-20(12):14691479
52. Lee BG, Biberman A, Chan J, Bergman K (2010) High-performance modulators and switches
for silicon photonic networks-on-chip. IEEE J Sel Top Quantum Electron 16(1):622.
DOI 101109/JSTQE20092028437
53. Lee BG, Biberman A, Sherwood-Droz N, Poitras CB, Lipson M, Bergman K (2009) High-
speed 2 2 switch for multiwavelength silicon-photonic networks-on-chip. J Lightwave
Technol 27(14):29002907
54. Lee J, Kim D, Ahn H, Park S, Pyo J, Kim G (2007) Temperature-insensitive silicon nano-wire
ring resonator. In: Optical fiber communication conference and exposition and the national
fiber optic engineers conference, OSA technical digest series (CD), Anaheim, p OWG4
55. Lee SJ, Lee K, Yoo HJ (2005) Analysis and implementation of practical, cost-effective net-
works on chips. IEEE Design Test Comput 22(5):422433
56. Leroy A, Marchal A, Shickova A, Catthoor F, Robert F, Verkest D (2005) Spatial division
multiplexing: a novel approach for guaranteed throughput on NoCs. In: Proceedings of the
third IEEE/ACM/IFIP International conference on hardware/software codesign and system
synthesis, pp 8186
57. Magnusson PS, Christensson M, Eskilson J, Forsgren D, Hallberg G, Hogberg J, Larsson F, Moestedt
A, Werner B (2002) Simics: a full system simulation platform. IEEE Comput 35(2):5058
58. McArdle N, Fancey SJ, Dines JAB, Snowdon JF, Ishikawa M, Walker AC (1998) Design of paral-
lel optical highways for interconnecting electronics. Proc SPIE Opt Comput 3490:143146
59. McArdle, N, Naruse M, Ishikawa M, Toyoda H, Kobayashi Y (1999) Implementation of a pipe-
lined optoelectronic processor: OCULAR-II. In: Optics in computing, OSA technical digest
60. McNutt B (2000) The fractal structure of data reference: applications to the memory hierarchy.
Kluwer Academic, Norwell, MA, USA
61. Millberg M, Nilsson E, Thid R, Jantsch A (2004) Guaranteed bandwidth using looped contain-
ers in temporally disjoint networks within the nostrum network on chip. In: Proceedings of the
conference on design, automation and test in Europe, pp 890895
7 Reconfigurable Networks-on-Chip 239

62. Miniskar NR, Wuyts R, Heirman W, Stroobandt D (2009) Energy efficient resource manage-
ment for scalable 3D graphics game engine. Tech report, IMEC
63. Murali S, De Micheli G (2004) Bandwidth-constrained mapping of cores onto NoC architec-
tures. In: Proceedings of the conference on design, automation and test in Europe, IEEE
Computer Society, Washington, p 20896
64. Ogras U, Marculescu R (2006) Prediction-based flow control for network-on-chip traffic. In:
Proceedings of the 43rd design automation conference, pp 839844
65. Ogras UY, Marculescu R (2006) Its a small world after all: NoC performance optimization via
long-range link insertion. IEEE Trans Very Large Scale Integr Syst Special Sect Hardware/
Software Codesign Syst Synth 14(7):693706. DOI 101109/ TVLSI2006878263. Index terms
Design automation, multiprocessor system-onchip (MP-SoC), network-on-chip (NoC), perfor-
mance analysis
66. Ohashi K, Nishi K, Shimizu T, Nakada M, Fujikata J, Ushida J, Torii S, Nose K, Mizuno M,
Yukawa H, Kinoshita M, Suzuki N, Gomyo A, Ishi T, Okamoto D, Furue K, Ueno T,
Tsuchizawa T, Watanabe T, Yamada K, Itabashi S, Akedo J (2009) On-chip optical intercon-
nect. Proc IEEE 97(7):11861198. DOI 101109/JPROC20092020331
67. Owens JD, Dally WJ, Ho R, Jayasimha D, Keckler SW, Peh LS (2007) Research challenges
for on-chip interconnection networks. IEEE Micro 27(5):96108
68. Pande PP, Grecu C, Jones M, Ivanov A, Saleh R (2005) Performance evaluation and design trade-
offs for network-on-chip interconnect architectures. IEEE Trans Comput 54(8): 10251040
69. Parhami B (1999) Introduction to parallel processing: algorithms and architectures. Kluwer
Academic
70. Patel R, Bond S, Pocha M, Larson M, Garrett H, Drayton R, Petersen H, Krol D, Deri R,
Lowry M (2003) Multiwavelength parallel optical interconnects for massively parallel pro-
cessing. IEEE J Sel Top Quantum Electron 9:657666
71. Petracca M, Lee BG, Bergman K, Carloni LP (2008) Design exploration of optical intercon-
nection networks for chip multiprocessors. In: Proceedings of the 16th IEEE symposium on
high performance interconnects, Stanford, pp 3140. DOI 101109/ HOTI200820
72. Poon AW, Xu F, Luo X (2008) Cascaded active silicon microresonator array cross-connect
circuits for WDM networks-on-chip. In: Proceedings of SPIE photonics west, pp 1924
73. Qiao C, Melhem R, Chiarulli D, Levitan S (1994) Dynamic reconfiguration of optically inter-
connected networks with time-division multiplexing. J Parallel Distr Comput 22(2):268278
74. Roldan R, dAuroil B (2003) A preliminary feasibility study of the LARPBS optical bus paral-
lel model. In: Proceedings of the 17th annual international symposium on high performance
computing systems and applications, pp 181188
75. Russell G (2004) Analysis and modelling of optically interconnected computing systems. PhD
Thesis, Heriot-Watt University
76. Sakano T, Matusumoto T, Noguchi K, Sawabe T (1991) Design and performance of a multi-
processor system employing board-to-board free-space interconnections: COSINE-1. Appl
Opt 30:23342343
77. Shacham A, Bergman K, Carloni L (2008) Photonic networks-on-chip for future generations
of chip multi-processors. IEEE Trans Comput 57(9):12461260. DOI 101109/TC200878
78. Sherwood-Droz N, Wang H, Chen L, Lee BG, Biberman A, Bergman K, Lipson M (2008)
Optical 4 4 hitless silicon router for optical networks-on-chip (NoC). Opt Express
16(20):1591515922. DOI 101364/OE16015915
79. Snyder L (1982) Introduction to the configurable, highly parallel computer. Computer 15(1)
:4756
80. Soganci IM, Tanemura T, Williams KA, Calabretta N, de Vries T, Smalbrugge E, Smit MK,
Dorren HJS, Nakano Y (2010) Monolithically integrated InP 1 16 optical switch with wave-
length-insensitive operation. IEEE Photon Technol Lett 22(3):143145
81. Srinivasan K, Chatha K (2005) A technique for low energy mapping and routing in net-
work-on-chip architectures. In: Proceedings of the international symposium on low power
electronics and design, pp 387392
240 W. Heirman et al.

82. Stensgaard MB, SparsJ (2008) ReNoC: a network-on-chip architecture with reconfigurable
topology. In: 2nd ACM/IEEE international symposium on networks-on-chip, Newcastle, pp
5564. DOI 101109/NOCS20084492725
83. Stuart MB, Stensgaard MB, SparsJ (2009) Synthesis of topology configurations and deadlock
free routing algorithms for renoc-based systems-on-chip. In: Proceedings of the 7th IEEE/
ACM international conference on hardware/software codesign and system synthesis, pp 481
490. DOI 101145/16294351629500
84. Tang S, Tang Y, Colegrove J, Craig DM (2004) Electro-optic Bragg grating couplers for fast
reconfigurable optical waveguide interconnects. In: Proceedings of the conference on lasers
and electro-optics (CLEO), p 2
85. Vangal S, Howard J, Ruhl G, Dighe S, Wilson H, Tschanz J, Finan D, Singh A, Jacob T, Jain
S, Erraguntla V, Roberts C, Hoskote Y, Borkar N, Borkar S (2008) An 80-tile sub-100-W
TeraFLOPS processor in 65-nm CMOS. IEEE J Solid-State Circuits 43(1):2941 (2008).
DOI 101109/JSSC2007910957
86. Vlasov Y, Green WMJ, Xia F (2008) High-throughput silicon nanophotonic wavelength-insen-
sitive switch for on-chip optical networks. Nat Photon 2:242246
87. Wolkotte PT, Smit GJM, Rauwerda GK, Smit LT (2005) An energy-efficient reconfigurable
circuit-switched network-on-chip. In: Proceedings of the 19th IEEE international parallel and
distributed processing symposium (IPDPS), Denver, p 155a
88. Woo SC, Ohara M, Torrie E, Singh JP, Gupta A (1995) The SPLASH-2 programs: charac-
terization and methodological considerations. In: Proceedings of the 22nd international
symposium on computer architecture, Santa Margherita Ligure, pp 2436
89. Xu Q, Fattal D, Beausoleil RG (2008) Silicon microring resonators with 15-mm radius. Opt
Express 16(6):4309
90. Yoshimura T, Ojima M, Arai Y, Asama K (2003) Three-dimensional self-organized
microoptoelectronic systems for board-level reconfigurable optical interconnects-perfor-
mance modeling and simulation. IEEE J Sel Top Quantum Electron 9(2):492511.
DOI 101109/JSTQE2003812503
Chapter 8
System Level Exploration for the Integration
of Optical Networks on Chip in 3D MPSoC
Architectures

Sbastien Le Beux, Jelena Trajkovic, Ian OConnor, Gabriela Nicolescu,


Guy Bois, and Pierre Paulin

Abstract Design trends for next-generation multi-processor systems on chip


(MPSoC) point to the integration of a large number of processing elements onto a
single chip, requiring high-performance interconnect structures for high-throughput
communication. On-chip optical interconnect and 3D die stacking are currently con-
sidered to be the two most promising paradigms in this design context. New architec-
tures based on these paradigms are currently emerging and new system-level
approaches are required for their efficient design. We investigate design tradeoffs for
3D MPSoC integrating optical networks-on-chip (ONoC) and highlight current and
short-term design trends. We also propose a system-level design space exploration
flow that takes routing capabilities of optical interconnect into account. The resulting
application-to-architecture mappings demonstrate the benefits of the presented 3D
MPSoC architectures and the efficiency of our system-level exploration flow.

Keywords Optical network-on-chip (ONoC) Multi-processor systems on chip


(MPSoC) 3D die stacking Design space exploration

S. Le Beux (*)
cole Polytechnique de Montral, Montreal, QC, Canada
Ecole Centrale de Lyon Lyon Institute of Nanotechnology, University of Lyon,
36 avenue Guy de Collongue, Ecully Cedex 69134, France
e-mail: sebastien.le-beux@ec-lyon.fr
J. Trajkovic G. Nicolescu G. Bois
cole Polytechnique de Montral, Montreal, QC, Canada
e-mail: Jelena.Trajkovic@polymtl.ca; Gabriela.Nicolescu@polymtl.ca;
Guy.Bois@polymtl.ca
I. OConnor S. LeBeux ()
Ecole Centrale de Lyon Lyon Institute of Nanotechnology, University of Lyon,
36 avenue Guy de Collongue, Ecully Cedex 69134, France
e-mail: sebastien.le-beux@ec-lyon.fr
P. Paulin
STMicroelectronics (Canada) Inc., Ottawa, ON, Canada
e-mail: Pierre.Paulin@st.com

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 241


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8_8,
Springer Science+Business Media New York 2013
242 S. Le Beux et al.

Introduction

The latest edition of ITRS (International Technology Roadmap for Semiconductors)


[1] emphasizes the More Than Moores Law trend. This trend focuses on system
integration rather than transistor density, allowing for both functional and techno-
logical diversification in integrated systems. Functional diversification allows for
non-digital functionalities, such as RF communication, power control, passive com-
ponents, sensors, actuators, and optical interconnect, to migrate from the board level
into chip-level (SoC) 3D architectures. Technological diversification allows for the
integration of new technologies that enable high performance, low power, high reli-
ability, low cost, and high design productivity. Some of the examples of these new
technologies are: design-for-variability, low power design, and homogeneous and
heterogeneous multi-processor system-on-chip (MPSoC) architectures.
These heterogeneous systems enable the efficient execution of new applications
and open new markets. They will be found in key domains such as transport, mobil-
ity, security, health, energy, communication, education and entertainment. Some
examples of applications of these systems are: car surround sensors, pre-crash
detection, car-to-car communication, navigation, smart phones, and mobile health
monitoring systems.
Moreover, technology scaling down to the ultra deep submicron (UDSM) domain
provides for billions of transistors which enable putting hundreds of cores on a
single chip. These cores, running at a higher clock frequency, create a need for
higher data bandwidth and increased parallelism. Therefore, the role of interconnect
becomes a dominant factor in performance. Designing such systems using tradi-
tional electrical interconnect poses a significant challenge. Deep submicron effects,
such as capacitive and inductive coupling [2], become dominant, leading to increases
in interconnect noise and propagation delay of global interconnect. Decreasing the
supply voltage in the presence of increasing interconnect noise makes the signal
even more vulnerable to noise. Increases in propagation delay require global inter-
connect to be clocked at a very low rate, which puts limits on the achievable band-
width and consequently on the overall system performance. This problem has been
solved in the past by adapting the interconnect architectures, such as inserting
repeaters in interconnect lines [3] or using multi-cycle (pipelined) global intercon-
nect [4]. However, the use of pipelining leads to higher data transfer delays, due to
the large number of pipeline registers required, and also to an increase in power
consumption, both due to the number of additional registers and due to the increased
operating frequency. Therefore, a new interconnect technology that can overcome
the problems of electrical interconnect, and that can be integrated in the system is
highly desirable.
Optical interconnect has been successfully used in off-chip, long range communi-
cations. Use of the on-chip optical interconnect, especially optical networks-on-chip
(ONoC) promises to deliver significantly increased bandwidth, increased immunity to
electromagnetic noise, decreased latency, and decreased power. Apart from these
physical properties, the use of wavelength routing and wavelength division multiplexing
8 System Level Exploration for the Integration... 243

(WDM) [5] contributes to the advantageous properties of optical interconnect. For


traditional routing, a part of the message contains the destination address, while for
wavelength routing the chosen wavelength determines the destination address, there-
fore enabling low contention or even contention-free routing. WDM allows for mul-
tiple signals to be transmitted simultaneously, therefore facilitating higher throughput.
From the technology point of view, the integration of optical interconnect requires
process technology compatibility with traditional silicon technology. The current
technology is mature enough to allow this integration, thanks to CMOS-compatible
optical components, such as light sources [6], waveguides [7], modulators [8, 9], and
detectors [10, 11]. Also using 3D architectures and adding optical interconnect as a
separate layer may simplify place-and-route for complex circuits.
The design of systems that incorporate optical interconnect poses a significant
challenge. Therefore it is important to focus considerable research efforts onto tech-
nological, architectural and system-level development of these systems. Traditional
electronic design automation (EDA) tools and methodologies need to be augmented
with novel system-level design approaches that will incorporate these diverse ele-
ments into efficient systems. Particularly beneficial will be tools that can perform
analyses at higher abstraction levels, early in the design flow. In this context, we
focus on the system-level design of 3D architectures integrating ONoC.
Our contributions are twofold:
A design tradeoff investigation for an MPSoC architecture relying on both 3D die
stacking and optical interconnect integration technologies. The featured hetero-
geneous architecture is composed of electrical layers and an optical layer, which
implements an ONoC.
A definition of a system-level design flow optimizing application mapping onto
this architecture. By taking advantages of optical properties, this flow improves
the system execution performance levels compared to an architecture including
only electrical interconnects.
The chapter is organized as follows. Section Related Work discusses related
work; section 3D MPSoC Architecture Integrating ONoC presents the 3D archi-
tecture, its components and its model. In section Design Tradeoffs, we vary the
parameters of the proposed architecture, in order to evaluate its complexity and
performances tradeoffs. Section Design Space Exploration presents the system-
level exploration flow and section Case Study presents the experimental results.
Section Conclusions concludes and identifies open problems and opportunities
for future work.

Related Work

Many contributions address ONoC design. ONoCs have been considered as full
replacement solutions for electrical NoCs in [12, 13]. Fat tree [12] and 2D Mesh
[13] topologies are implemented using optical interconnects in the context of planar
244 S. Le Beux et al.

architectures. Contrary to this approach, we believe that more realistic interconnect


solutions for next-generation MPSoC architectures should combine both electrical
and optical technologies. In [14, 15], electrical interconnect manages local com-
munication while an optical interconnect is responsible for global communications.
However, such a single-write-multiple-read (SWMR) implementation implies that
each wavelength flowing through the ONoC must be assigned to a given optical
network interface. As a result, no parallel communications with the same wave-
length are possible. This point drastically affects the ONoC scalability, since only
one communication may occur at any given time.
An approach using electrical interconnects for control flow and optical intercon-
nects for data flow was proposed in [16]. The electrical signal precedes the optical
one in order to reserve the optical path. Therefore, optical communications may be
delayed until an optical path becomes free, resulting in contention delay. Hence, this
approach does not result in a contention-free network.
The Corona architecture [17, 18] follows a multi-write-single-read (MWSR)
implementation. The MWSR approach requires arbitration to manage write conflicts.
In the architecture proposed in this chapter, no arbitration is required, which results
in more efficient communication. To overcome this drawback of the Corona archi-
tecture, the Firefly architecture [19] extends the prior work by proposing the imple-
mentation of reservation-assisted SWMR buses. The main objective is to reduce the
power consumption of optical communications by using an initialization packet that
turns on data receiver resources. As a drawback, extra latency is required compared
to the SWMR technique and the network throughput rapidly decreases with the
token round-trip latency [20]. The more recent FlexiShare architecture [20] imple-
ments token stream arbitration that reduces this drawback by allowing the injection
of a new token each cycle. Only architectures proposed in [2123] consider conten-
tion-free ONoC, but the implementation complexity rapidly increases with the sys-
tem size. One possible solution for reducing the complexity of such multi-stage
implementation [21] is the use of a reduction method presented in [5]. Basically,
optical switches are removed and the number of required wavelengths is reduced
when total connectivity between interconnected nodes (e.g. processors, memories,
etc.) is not required. This is particularly suitable in the context of 3D architecture,
where only a subset of nodes needs to communicate with each other. For this reason,
all the experiments presented in this chapter will utilize this network architecture to
realize optical interconnect in 3D MPSoC.
Basically, 3D architectures consist of stacked 2D layers that are interconnected
with through silicon vias (TSV [24]). The main advantages of TSVs are their low
latency and low power, while their main drawbacks are area size and design cost.
TSVs thus need to be used carefully in order to find the best efficiency/area cost
trade-off solutions. Several methodologies allow the number of TSVs to be mini-
mized [25, 26] and optimize their location on a die in order to maximize their
benefits for a given application. While such a methodology results in efficient archi-
tectures, the resulting layers are application-specific and may be difficult to reuse in
other contexts (e.g. to execute other applications) and to scale. We believe that
architecture genericity and scalability can be achieved with regularity: the more an
8 System Level Exploration for the Integration... 245

identical pattern is regularly repeated on a die (e.g. such as in Mesh and Torus net-
works), the more an architecture is generic and scalable. The same principle can
also be applied to 3D architectures: the more an identical layer is regularly repeated,
the more the architecture is generic and scalable. The architecture proposed in this
paper follows this trend: it is composed of identical electrical layers. Our approach
thus has the potential for scaling to complex systems.
Design tradeoffs for various 3D electrical interconnects have been studied in
[2731]. Methodologies are proposed to design application specific 3D SoCs in [25,
32]. System level methodologies for 3D chip allow the maximum clock frequency
to be evaluated [33], as well as power consumption [34] and even chip cost [35].
Thermal-aware application mapping for 3D chips is investigated in [36]. Our work
is complementary to this related work, since we address design tradeoffs for a 3D
architecture integrating optical interconnects.

3D MPSoC Architecture Integrating ONoC

This section presents the 3D architecture model used in this work. The architecture
is defined by the extension of two planar approaches: (1) the electrical NoC and (2)
the optical network-on-chipONoC. Figure 8.1a illustrates the 3D architecture
integrating an ONoC. It is composed of a set of stacked electrical layers and one
optical layer. The electrical layer is composed of a set of computing nodes intercon-
nected through a NoC while the optical layer integrates only an ONoC (computing
nodes are not a part of the optical layer). Two types of communications are distin-
guished in this architecture:
Intra-layer communications are used for data transfers between nodes situated
within the same electrical layer.
Inter-layer communications are used for data transfers between nodes situated
on different electrical layers.
The ONoC is obviously dedicated to the inter-layer type of communications. All
electrical layers are connected to the optical layer using electrical, point-to-point,
vertical TSVs. Inter-layer communications require routing composed of three main
steps: (1) electrical routing from the source node to a TSV, (2) optical routing within
the ONoC and (3) electrical routing from a TSV to the destination node (the details
will be presented in the remainder of the section). Given that the ONoC is used for
inter-layer communications, the optical layer is located in the middle of the 3D
architecture to minimize the length of TSVs and eliminate the need for high aspect
ratio TSVs (a notorious difficulty in the 3DIC domain).
This architecture relies on a communication hierarchy similar to that proposed
in the Firefly 2D architecture [20]. Due to 3D integration, additional communica-
tion resources could also be considered, especially to provide point-to-point con-
nections between nodes located on different (but adjacent) layers (e.g. to provide
direct connection between nodei,j,k and nodei,j,k+1, where k and k + 1 denote adjacent
246 S. Le Beux et al.

Fig. 8.1 3D architecture: (a) overview, (b) focus on electrical resources and (c) focus on optical
resources

layers). However, in addition to increased routing complexity (i.e. with these addi-
tional resources, inter-layer communication can be performed with ONoC or with
point-to-point connection), analysis demonstrated that the performance gain was
negligible [20]. Thus we do not consider such point-to-point connections for the
remainder of the chapter.

Intra-Layer Communication

Intra-layer communications are used for data transfers between the nodes situated
on the same electrical layer. Each electrical layer is composed of a set of homoge-
neous nodes interconnected by a 2D Mesh NoC. We define a node as a computing
8 System Level Exploration for the Integration... 247

Fig. 8.2 Optical network a


interface: (a) optical layer
and (b) electrical layer sides

subsystem including a processor and a local memory. The node accesses the net-
work via an electrical Network Interface (NI) (see Fig. 8.1b). The NoC is composed
of links and switches that are used to route data from a source NI to a destination NI.
The XY routing policy is used in this work.

Inter-Layer Communication

Inter-layer communications are used for data transfers between nodes situated on
different electrical layers. The inter-layer communications are enabled by the opti-
cal network interfaces (ONI). By providing opto-electrical and electro-optical con-
versions, ONIs allow the sending/receiving of data. Their location in the architecture
is illustrated in Fig. 8.1.
The main components of an ONI are shown in Fig. 8.2. The ONI is composed of
an electrical and an optical part. Thus, the transmitter and receiver chains of each
ONI are implemented in both electrical and optical layers. The components of the
transmitter chain in the electrical layer are a serializer (SER) and CMOS driver
circuits. An uploading TSV links the electrical layer to the optical layer. For the
transmitter functionality, the optical layer includes microsource lasers [5]. The
receiver chain includes a photodetector (on the optical layer), a downloading TSV
(connecting an optical to an electrical layer), and a CMOS receiver circuit and a
deserializer (DES) (on an electrical layer). The CMOS receiver circuit consists of a
transimpedance amplifier (TIA) and a comparator. The TIA takes in electrical cur-
rent, generated by the photodetector, and transforms it into an electrical voltage,
while the comparator decides a value of each bit based on the provided electrical
voltage. For an ONoC interconnecting N ONIs, transmitter and receiver chains are
replicated N times.
The inter-layer communication process starts on an electrical layer when an
ONI receives a data and a destination ID. The data is serialized and the appropriate
CMOS driver circuit then modulates the current flowing through the microsource
248 S. Le Beux et al.

Fig. 8.3 Diagonal and straight states: (a) logical view and (b) layout

laser. The used wavelength is determined, in case of wavelength routing, based on


both source ID and destination ID. The intensity of light emission is modulated
according to the data bit values, achieving the electro-optical conversion. The sig-
nal enters into the ONoC, is routed inside it and is finally received by a receiver
part of an ONI.
In the receiver ONI, the photodetector starts the opto-electronic conversion by con-
verting a flow of photons into a photocurrent. A downloading TSV transmits the ana-
log signal to the CMOS receiver circuit (on the destination layer). The latter converts
the analog signal to a digital one, which is then deserialized (DES). Data is finally
injected into the electrical NoC where it is transmitted to the destination node.

Optical Layer

The optical layer is composed of the ONoC and the optical part of each ONI. The
ONoC used in this work is composed of waveguides and contention-free optical
switches. The waveguides transmit optical signals and the optical switches manage
the routing of signals into these waveguides.
From a functional point of view, an optical switch operates in a similar way to a
classical electronic switch. From any input port, switching is obtained to one of the
two output ports depending on the wavelength value of the optical signal injected
(Fig. 8.3). An optical switch is characterized by its resonant wavelength ln. As illus-
trated in Fig. 8.3, there are two possible switch states:
The diagonal state that occurs when a signal characterized by a wavelength l is
different from ln (l ln) is injected. In this case, the optical switch does not reso-
nate, and the signal is propagated along the same waveguide.
The straight state that occurs when a signal characterized by a wavelength l = ln
is injected. In this case, the optical switch resonates and signals are coupled to
the opposite waveguide.
We utilize wavelength division multiplexing (WDM) where multiple signals of
different wavelengths flow through a waveguide. When these multiple signals
8 System Level Exploration for the Integration... 249

a b

Fig. 8.4 ONoC interconnecting 4 ONIs located on (a) 4 layers and (b) 2 layers

encounter an optical switch, each of them is routed through the switch according to
the individual wavelength, as if it were the only signal flowing through the wave-
guide. Thanks to these optical properties, multiple signals can be transmitted
simultaneously, which facilitates the design of high bandwidth and potentially
contention-free ONoC.
The main constraint of an optical interconnect is the number of optical switches
crossed (nOSC) by one optical signal (note that this is different from the total num-
ber of optical switches in the network) [5]. nOSC for an ONoC is defined by the
path crossing the maximal number of optical switches. Recent work [37] reports
values of an output power of an integrated laser to be around 2.5 mW/mm2. To
achieve an acceptable communications bit error rate (below 1018) with an input
referred TIA noise density of 1024 A2/Hz, a total loss of no more than 13 dB in the
passive optical structure may be tolerated. For current technology, 2 cm die sizes,
typical values for loss in passive waveguides (2 dB/cm) and for loss per optical
switch (0.3 dB), the limit for nOSC is reached for 48 optical switches crossed.
Further technology improvements are expected to reduce switch losses to 0.2 dB
that may lead to reliable structures with 64 optical switches crossed. Given these
observations, we consider that a design with nOSC equal to 48 represents a cur-
rently feasible solution and a design with nOSC between 48 and 64 represents a
feasible solution in the near future. The design feasibility step (section Study of
Optical Layer Complexity) evaluates all the optical paths in the ONoC.
Since the ONoC aims only to manage inter-layer communications, full connec-
tivity between ONIs is not necessary (ONIs located on the same layer communicate
through the electrical NoCs). As a consequence, the number of optical switches
crossed by an optical signal can be reduced. Figure 8.4 illustrates an ONoC connect-
ing ONIs A, B, C and D. The initiator parts of ONI are shown on the left hand side
and the target parts are on the right. In Fig. 8.4a, the four ONIs are located on dif-
ferent layers while in Fig. 8.4b A and B are located on one layer, and C and D are
250 S. Le Beux et al.

Fig. 8.5 100%, 50% and 25% interconnect ratio

located on the other layer. In Fig. 8.4a three targets are reachable from each source,
and the wavelengths used for these communications are illustrated in the corre-
sponding truth table. For instance, ONI A communicates with ONI C using l1. In
this case, two optical switches are crossed (nOSC = 2), as illustrated by the dashed
line. In the example illustrated in Fig. 8.4b, there is no need to connect ONIs on the
same layer (e.g. ONI A and ONI B) using the ONoC, and therefore half the com-
munication scheme is deleted, as illustrated by the corresponding truth table and the
resulting ONoC. A total of just two optical switches are necessary for entire net-
work and a single switch is crossed to perform communication from A to C
(nOSC = 1). Hence, by using this method, the number of optical switches is reduced
without impacting ONoC performance, which remains contention free. Only [22,
23] consider contention-free ONoC, but they do not consider any method to reduce
the implementation complexity when total connectivity is not required.
In order to respect the nOSC constraint, we will explore a scenario where only a
subset of the nodes in electrical layer is connected to TSVs through ONIs. For the
nodes that are not connected to TSVs, a routing path to the closest node connected
to TSVs is required. We thus introduce the concept of interconnect ratio (IR). An IR
is defined as the number of ONI divided by the total number of electrical nodes (or
switches), in percent. In this study, we consider 100%, 50% and 25% IR, as are
illustrated in Fig. 8.5.

Architecture and Communication Models

For the purposes of system-level exploration, we use an abstract communication-


oriented model of the architecture. In order to focus on communication, nodes are
abstracted into atomic resources able to perform computation and to send/receive
requests to/from other nodes. Routing resources (both electrical and optical) are
characterized by their maximum bandwidth. In this way we model a scenario
where multiple communications share resources using a fraction of the total
resource bandwidth. Therefore, instead of using the clock cycle as the unit of
time, in our model the unit of time is given by the latency of a communications.
This is used to make fast estimations of communication performance. In addition
8 System Level Exploration for the Integration... 251

to their bandwidth, electrical routing resources (i.e. NI, electrical part of ONI,
electrical links and electrical switches) are also characterized by their latency.
ONIs are abstracted at the transmitter and receiver level.
Waveguides and optical switches are considered latency free and they are
characterized by a set of wavelengths potentially flowing through them. Optical
switches are also characterized by their resonant wavelength. The clock speed of
the architecture is limited by the speed of opto-electrical interfaces which require
serialization of data. The maximum conversion frequency currently supported
[5] is 100 MHz. Therefore, system frequency is also 100 MHz. Note that while
electrical layer components operate at 100 MHz, the optical layer components
operate at 3.2 GHz [i.e. the frequency of optical components equals the system
frequency (100MHZ) multiplied by the data bit width (32 bits)].
The model is configurable in terms of the number of electrical layers, the number
of nodes per layer and the value of IR.

Design Tradeoffs

In this section, we evaluate complexity and performance metrics for various archi-
tectural configurations. This evaluation shows some of the design tradeoffs for 3D
architectures including ONoC.

Study of Optical Layer Complexity

As explained in section Optical Layer, the main factor in the implementation


complexity of the optical layer is the maximum number of optical switches crossed
(nOSC) by an optical signal. Figure 8.6 illustrates the evolution of this number for
an architecture including four electrical layers and an optical layer. This evolution
depends on the number of nodes per layer and the interconnect ratio (IR). For
instance, the point P highlighted in Fig. 8.6 represents 27 optical switches crossed
required for an architecture that integrates four 4 4-stacked electrical layers (like
the one in Fig. 8.1a) with 50% IR. This value corresponds to a currently feasible
design implementation solution. Considering that it will be possible to cross 64
optical switches in the near future, feasible architectures that integrate 4 stacked
electrical layers with IR of 100%, 50% and 25% will allow 64, 144 and 256 nodes
to be interconnected, in 4 4, 6 6 and 8 8 configurations, respectively.

Communication Performance Evaluation

We carried out performance evaluation using an event-based simulator that was


developed internally. This simulator is based on the model presented in section
252 S. Le Beux et al.

100

80
Optical Switches Crossed

60

40
P

20
IR=100%
IR=50%
IR=25%
0
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9
Electrical layer configuration (number of nodes)

Fig. 8.6 Implementation complexity of the optical layer for four electrical layer architectures

Architecture and Communication Models. We simulated different configurations


of the presented architecture model with a synthetic benchmark throughput: each
node sends a considerable amount of data (order of MB) to a randomly selected
node. We present here the average values that are obtained by performing hundreds
of simulations.
We conducted two sets of experiments in order to evaluate system performance.
The first set of experiments analyzes the impact of the injection rate on the through-
put of the 3D architecture. We present here a comparison between 3D architectures
integrating only electrical layers with those including ONoC. The second set of
experiments analyzes the average transfer time depending on the architecture size.

Throughput as a Function of Injection Rate

Experiments were made for six configurations of 3D architectures with a total of 64


nodes: three architectures integrate electrical-only layers (annotated as 3D Mesh)
and three include electrical layers and an optical layer (annotated as ONoC). These
configurations are defined by the size of Mesh NoC integrated in each electrical
layer and the total number of electrical layers, as described in Table 8.1. The ONoC
architectures implement connectivity between all ONIs, which is characterized by
IR = 100%.
As illustrated in Fig. 8.7, 3D architectures integrating ONoC outperform 3D
architectures integrating only electrical layers. This is possible thanks to the conten-
tion-free property of optical switches. In fact, the bottleneck comes from conten-
tions occurring in electrical NoC and in ONI. The bottleneck explains the almost
8 System Level Exploration for the Integration... 253

0.45
8x4x2 3D MESH (2 layers)
4x4x4 3D MESH (4 layers)
0.4 4x2x8 3D MESH (8 layers)
8x4x2 ONoC (2 layers)
0.35 4x4x4 ONoC (4 layers)
Throughput (flits/node/cycle)

4x2x8 ONoC (8 layers)


0.3

0.25

0.2

0.15

0.1

0.05

0
0 0.1 0.2 0.3 0.4 0.5 0.6
Injection rate (flits/node/cycle)

Fig. 8.7 Throughput for 64 node architectures (IR = 100%)

Table 8.1 3D architecture configurations


Configuration notation Mesh size No. of electrical layers
842 84 2
444 44 4
428 42 8

constant values for throughput at injection rates greater than 0.4, across all
configurations. Furthermore, for ONoC-based architectures, the throughput
increases with the number of layers. This is not the case for 3D Mesh configurations:
the optimal 3D Mesh configuration is 4 4 4. According to our experiments, not
presented here, when IR is set to 50% the 3D architecture integrating ONoC still
outperforms the electrical 3D architectures. When IR = 25%, the two architectures
provide similar performances. From these results, we conclude that ONoC-based
architectures scale better with next generation MPSoC, where hundreds of nodes
located on different layers are expected to communicate with each other.

Average Transfer Time as a Function of Architecture Size

For this analysis, in order to observe communication performance trends we ana-


lyze average transfer time as a function of the number of nodes on the electrical
layer with various values of IR for ONoC architectures. We present here the case
where interconnects are saturated, i.e. the case with the maximum injection rate
(100%). Figure 8.8a illustrates the average transfer time for various architectural
254 S. Le Beux et al.

2.5
2.2

Average transfer time (million cycles)


Average transfer time (million cycles)

2 1.8

1.6

1.4
1.5
1.2

1
inter-layers, IR=100%
1 inter-layers, IR=50%
ONoC, IR=100%
0.8 inter-layers, IR=25%
ONoC, IR=50% intra-layers, IR=100%
ONoC, IR=25% 0.6 intra-layers, IR=50%
3D Mesh intra-layers, IR=25%
0.5 0.4
2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9 2x2 3x3 4x4 5x5 6x6 7x7 8x8 9x9
Electrical layer configuration (number of nodes) Electrical layer configuration (number of nodes)

Fig. 8.8 Average transfer times for 2-layer architectures: (a) all, (b) intra-layer and inter-layer
communications

configurations with two electrical layers. We observe that the average transfer time
depends on both the number of nodes and the IR. With an increase in the number
of nodes, the average transfer time increases for both 3D Mesh and all ONoC
architectures. The increase for 3D Mesh is linear. As for the ONoC configurations,
the average transfer time also increases with IR. For all values of IR, it may be
observed that the increase in average transfer time is less rapid than for the 3D
Mesh. These results allow system designers to rapidly evaluate the benefits of opti-
cal interconnect (compared to electrical ones), and thus aid in designing the most
efficient interconnect architecture.
Figure 8.8b illustrates the average transfer times for intra-layer and inter-layer
communications:
The average transfer time for intra-layer communication increases with the num-
ber of nodes. This behavior is due to the increasing number of electrical switches
that need to be crossed. The average transfer time also depends on the IR since
additional contentions in electrical NoC occur for a reduced number of ONI.
The average transfer time for inter-layer communications strongly depends on the
IR. Indeed, when the IR value is reduced, contentions occur for electro-optical
and opto-electronic conversions. One can observe that the inter-layer communica-
tions time slightly increases with the number of nodes. The main reason for this
dependency is the necessity of the electrical routing to and from TSVs.
Figure 8.7b shows that, for a small set of nodes, intra-layer communications
perform faster than inter-layer communications. This trend is reversed for larger
sets (e.g. 5 5 when IR is set to 50%). The number of nodes for which inter-layer
communications perform faster depends on the IR. These results aid system design-
ers to take advantage of both electrical and optical interconnect technologies for
short-range and long range communications, respectively. Experiments with 4-layer
and 8-layer architectures (not presented here) validate our observation for this type
of architectures.
In this section we analyzed the complexity of the optical layer and highlighted
current and possible short-term trends. We also illustrated the potential for using
8 System Level Exploration for the Integration... 255

Fig. 8.9 System level exploration flow

ONoC in order to achieve high throughput communications in large scale architec-


tures. Finally, we analyzed the impact of the number of layers, the number of nodes
and the IR on the communication performance of various architectures. This analy-
sis helps system designers to rapidly define the most appropriate interconnect archi-
tecture for a given architecture size. In order to fully benefit from the selected
architecture, tools are required to aid in the application mapping. Therefore, in the
following section we present one such tool.

Design Space Exploration

In order to provide automatic mapping of an application onto 3D architectures inte-


grating ONoC, we implemented the design space exploration flow illustrated in
Fig. 8.9. The flow is an extension of our prior work presented in [38]. This explora-
tion flow automatically evaluates each mapping in the design space of possible
application mappings, searching for the best score, as detailed below. The inputs of
the flow are the architecture model defined in section Architecture and
Communication Models and an application model.
The application model is defined as a Directed Acyclic Graph G = (T, E) where T
is a set of tasks ti and E is a set of edges ei. A task ti denotes a function of a given
application. The task is annotated with an execution time (the number of clock
cycles cc or kilo clock cycles kcc) necessary to execute it on a processor that is a part
of a node in an electrical layer. An edge ei defines a directed data dependency from
a source task to a target task. Each edge is annotated with the amount of data
(expressed in bytes b or kilo bytes kb) transferred between these tasks.
256 S. Le Beux et al.

a b

Fig. 8.10 Impact of mapping on system throughput

A mapping solution assigns each task of the application model to a node in


charge of its execution. Communications occurring in the architecture depend on
this mapping and directly impact the system throughput, i.e. the system execution
performances. Figure 8.10 illustrates the utilization rate of a given routing resource
according to two possible mappings. In Fig. 8.10a, the resource utilization rate is
low, resulting in long utilization time, i.e. a low throughput. This typically happens
when contentions occur. In Fig. 8.10b, the same routing resource is intensively used,
resulting in a shorter utilization time, and, therefore, higher throughput.
For each optimization, we explore a set of possible mappings through the
NSGA-II evolutionary algorithm [39], using crossover and mutation operators.
Each exploration is set to iterate 50 times with a population size of 200 individuals,
i.e. 10,000 mappings are evaluated. The overall flow works at system level, which
allows fast exploration (of the order of a couple of minutes). For each mapping
solution, we simulate execution for a single application iteration and we measure
the utilization of all resources (switch, ONI, etc.). We define a score of the evalu-
ated mapping as the longest utilization time for used resources. This longest utili-
zation time (and, therefore, the score) corresponds to the worst-case minimum
delay between successive iterations that avoids inter-iteration contention. As the
exploration searches for a mapping minimizing the score, the optimization result is
the mapping maximizing the system throughput. In this context, throughput
improvements are expected because contention-free properties of optical switches
result in an increased utilization rate of routing resources. Note that the throughput
improvements correspond directly to the execution time speedup. Therefore, the
improvements of throughput guarantee the speedup, and vice-versa.

Case Study

This section presents the results obtained by using the presented design space explo-
ration technique to optimize the mapping of the image processing application
Demosaic onto various network architecture configurations. The application is an
industrial reference application, provided by STMicroelectronics. The Demosaic
8 System Level Exploration for the Integration... 257

Fig. 8.11 Annotated task graph representing the Demosaic application

image processing application performs color filter array interpolation, also called
demosaicing, as a part of the digital image processing chain. Demosaicing is neces-
sary since the standard camera sensor detects only one color value per pixel (green,
blue or red). In order to reconstruct the output image, the Demosaic application
performs three interpolations on an input image: (1) the green interpolation, (2) the
blue interpolation, and (3) the red interpolation. Figure 8.11a represents the corre-
sponding application model (using the annotations for task ti and edge ei as explained
in section Design Space Exploration). In order to lay stress on communications, a
task set identified as the Demosaic kernel (in Fig. 8.10a) is replicated 8 times, allow-
ing the application to manage larger image blocks (see Fig. 8.10b).
We use our design space exploration flow to optimize the mapping of 8
Demosaic kernels onto 64-node architectures where nodes are distributed across
2, 4, and 8 stacked electrical layers. For each of the layer configurations, we con-
sider several different values of Interconnect Ratio IR: 25%, 50% and 100%. We
present results for the pareto optimal mapping obtained by our design space
exploration tool. Figure 8.12 compares speedup of different configurations of the
ONoC-based architectures to the (reference) speedup of the architectures integrat-
ing only electrical layers (3D Mesh). The speedup is shown relative to a corre-
sponding 3D architecture, e.g. speedup of the 8 4 2 ONoC is relative to the
8 4 2 3D Mesh, while that of the 4 4 4 ONoC is relative to the 4 4 4 3D
Mesh. For 2-layer configurations, the ONoC-based architecture and the electrical
architecture provide almost equivalent performance for IR values of 50% and
100%. The configuration where IR is 25% slightly underperforms (0.8%) com-
pared to the 3D Mesh architecture. This is due to the relatively large time required
for intra-layer communication (larger than for architecture with 4 4 or 4 2 mesh
size) in addition to the time required for electro-optical and opto-electrical con-
version. A similar, but less pronounced, effect may be seen for 4-layer configuration
with 4 4 mesh size. For the remaining 4-layer configurations, ONoC-based
258 S. Le Beux et al.

1.4
3D Mesh
ONoC, IR: 25%
ONoC, IR: 50%
1.3 ONoC, IR: 100%
Speedup (compared to 3D Mesh)
1.2

1.1

0.9

0.8
8x4x2 (2 layers) 4x4x4 (4 layers) 4x2x8 (8 layers)
3D architecture configuration

Fig. 8.12 Execution performance (speedup) for different 64-node architectures executing
Demosaic kernel (replicated eight times)

architectures provide significant speedup: 9% for IR = 50% and 17% for IR = 100%
compared to the 4 4 4 3DMesh. Note that the 4 4 4 3D Mesh architecture is
an optimal electrical-only architecture, as shown in Fig. 8.7. As for 8-layer
configurations, ONoC with 25%, 50% and 100% IR uniformly provides better
performance than the corresponding 3D Mesh, showing 8%, 18% and 35%
speedup, respectively. The corresponding speedup values are directly proportional
to the increase in throughput.
Optical interconnects enable novel communication possibilities (e.g. WDM
offers a new dimension for data or address coding) and provide high performance
levels (e.g. near zero latency for long range communications). However, to maxi-
mize the benefits from these features it is necessary to carry out careful design space
exploration at different levels:
At the architectural level, reducing the design complexity of the ONoC by taking
into account its context (e.g. the number of electrical computing resources, com-
munication hierarchy and the resulting communication scenarios)
At the application level, optimizing the mapping of complex applications while
matching to ONoC communication performance levels
In this work, we consider exploration at both architecture and application levels
in order to (1) reduce the number of optical switches crossed by optical signals (thus
reducing communication losses and power consumption) and (2) maximize the
application execution throughput by using WDM. The obtained results demonstrate
that our exploration flow effectively exploits the routing capabilities of the ONoC to
maximize the system speedup factor. We believe that such a methodology allows
energy-efficient 3D MPSoCs to be designed, which further efficiently execute
8 System Level Exploration for the Integration... 259

data-intensive applications. The proposed methodology could be extended by con-


sidering further design challenges at the architectural level (e.g. layout) and addi-
tional metrics at the application level (e.g. power consumption).

Conclusion

This work addresses system-level design for 3D MPSoC integrating an Optical


Network-on-Chip (ONoC). We presented a heterogeneous 3D MPSoC architecture
that consists of several electrical layers and an optical layer, which is used to per-
form high bandwidth, contention-free routing. We showed various design tradeoffs
through the analysis of the optical layer complexity and highlight a current and a
possible short-term design solution. We also illustrated the interest for using ONoC
for high throughput communications. We propose a system-level exploration flow
optimizing the application mapping, while taking into account routing capabilities
and contention-free properties of optical interconnect. The experimental results for
the image processing application Demosaic validate that our approach enables
efficient use of optical interconnects.
There are several areas of interest for our future work. For example, we currently
use the latency of communication as a unit of time. In the future we may use finer
granularity time intervals and investigate the trade-off between the accuracy and speed
of estimation. Furthermore, we will evaluate our approach on other industrial applica-
tions. We are particularly interested in data-intensive, communication-oriented appli-
cations, for which we strongly believe that this approach is beneficial. Finally, we will
investigate error modeling and reliability issues in optical interconnect. For this pur-
pose, we will evaluate the impact of power consumption and temperature onto the data
transmission quality in optical interconnects. The exploration flow will then be
extended so that the optical interconnect reliability will be maximized.

References

1. International Technology Roadmap for Semiconductors (ITRS) [Online] Available http://pub-


lic.itrs.net/. Accessed 30 Aug 2012
2. Ho R, Mai W, Horowitz MA (2001) The future of wires. Proc IEEE 89(4):490504
3. Adler V, Friedman E (1998) Repeater design to reduce delay and power in resistive intercon-
nect. IEEE Trans Circuits Syst II Analog Digital Signal Process 45(5):607616
4. Nookala V, Sapatnekar SS (2005) Designing optimized pipelined global interconnects: algo-
rithms and methodology impact. In: Proceedings of IEEE international symposium on circuits
and systems (ISCAS), Kobe Japan, pp 608611
5. OConnor I, Mieyeville F, Gaffiot F, Scandurra A, Nicolescu G (2008) Reduction methods for
adapting optical network on chip topologies to specific routing applications. In: Proceedings of
design of circuits and integrated systems, Grenoble, 1214 November 2008
6. Kobrinsky MJ, Block BA, Zheng J-F, Barnett BC, Mohammed E, Reshotko M, Robertson F, List
S, Young I, Cadien K (2004) On-chip optical interconnects. INTEL Technol J 8(2):129141
260 S. Le Beux et al.

7. Koester SJ, Dehlinger G, Schaub JD, Chu JO, Ouyang QC, Grill A (2005) Germanium-on-
insulator photodetectors. In: IEEE international conference on group VI photonics, Antwerpen,
Belgium, pp 171173
8. Massoud Y et al (2008) Subwavelength nanophotonics for future interconnects and architec-
tures. In: Invited talk, NRI SWAN Center, Rice University, in fact, it is a presentation given in
Univeristy of Austin (see http://www.src.org/library/publication/p024870/)
9. Miller D (2009) Device requirement for optical interconnects to silicon chips. Proc IEEE
Special Issue Silicon Photon 97(7):11661185
10. Minz JR, Thyagara S, Lim SK (2007) Optical routing for 3-D system-on-package. IEEE Trans
Components Packaging Technol 30(4):805812
11. OConnor I, Gaffiot F (2004) On-chip optical interconnect for low-power. In: Macii E (ed)
Ultra-low power electronics and design. Kluwer, Dordrecht
12. Gu H, Zhang W, Xu J (2009) A low-power fat tree-based optical network-on-chip for multiproces-
sor system-on-chip. In: Proceedings of design, automation, and test in Europe (DATE), Nice,
France, pp 38
13. Gu H, Xu J, Wang Z (2008) A novel optical mesh network-on-chip for gigascale systems-on-
chip. In: Proceedings of APCCAS, Macao, pp 17281731
14. Pasricha S, Dutt N (2008) ORB: an on-chip optical ring bus communication architecture for
multi-processor systems-on-chip. In: Proceedings of ASP-DAC, seoul, korea, pp 789794
15. Kirman N, Kirman M, Dokania RK, Martinez JF, Apsel AB, Watkins MA, Albonesi DH
(2006) Leveraging optical technology in future bus-based chip multiprocessors. In: Proceedings
of the 39th annual IEEE/ACM international symposium on microarchitecture, Orlando,
Florida, USA
16. Shacham A, Bergman K, Carloni L (2008) Photonic networks-on-chip for future generations
of chip multiprocessors. IEEE Trans Comput 57(9):12461260
17. Vantrease D, Schreiber R, Monchiero M, McLaren M, Jouppi NP, Fiorentino M, Davis A,
Binkert NL, Beausoleil RG, Ahn JH (2008) Corona: system implications of emerging nano-
photonic technology. In: Proceedings of the international symposium on computer architecture
(ISCA), Beijing, pp 153164
18. Beausoleil RG, Ahn J, Binkert N, Davis A, Fattal D, Fiorentino M, Jouppi NP, McLaren M,
Santori CM, Schreiber RS, Spillane SM, Vantrease D, Xu Q (2008) A nanophotonic intercon-
nect for high-performance many-core computation. In: Proceedings of the 16th IEEE sympo-
sium on high performance interconnects, pp 182189 Cardiff, UK
19. Pan Y, Kumar P, Kim J, Memik G, Zhang Y, Choudhary A (2009) Firefly: illuminating future
network-on-chip with nanophotonics. In: Proceedings of International Symposium on
Computer Architecture (ISCA), Austin, Texas, pp 429440
20. Pan Y, Kim J, Memik G (2010) FlexiShare: channel sharing for an energy-efficient nanopho-
tonic crossbar. In: Proceedings of the IEEE international symposium on high-performance
computer architecture (HPCA), Bangalore, pp 112
21. Briere M, Girodias B, Bouchebaba Y, Nicolescu G, Mieyeville F, Gaffiot F, OConnor I (2007)
System level assessment of an optical NoC in an MPSoC platform. In: Proceedings of design
automation and test in Europe, Nice, 1620 April 2007, pp 10841089
22. Joshi A, Batten C, Kwon Y, Beamer S, Shamim I, Asanovic K, Stojanovic V (2009) Silicon-
photonic clos networks for global on-chip communication. In: Proceedings of the 3rd ACM/
IEEE international symposium on networks-on-chip (NOCS), Catania, Italy, pp 124133
23. Cianchetti MJ, Kerekes JC, Albonesi DH (2009) Phastlane: a rapid transit optical routing net-
work. In: In: Proceedings of International Symposium on Computer Architecture (ISCA),
Austin, Texas, pp 441450
24. Loi I, Angiolini F, Benini L (2007) Supporting vertical links for 3D networks-on-chip: toward
an automated design and analysis flow. In: Proceedings of international conference on nano-
networks, Catania, Italy, pp 15:115:5
25. Seiculescu C, Murali S, Benini L, De Micheli G (2010) Sunfloor 3D: a tool for networks on
chip topology synthesis for 3-D systems on chips. IEEE Trans Comput Aided Des Integr
Circuits Syst 29:19872000
8 System Level Exploration for the Integration... 261

26. Zhou P, Yuh P-H, Sapatnekar SS (2010) Application-specific 3D network-on-chip design


using simulated allocation. In: Proceedings of the Asia and South Pacific design automation
conference (ASPDAC), Taipei, pp 517522
27. Weerasekeraet R, Zheng LR, Pamunuwa D, Tenhunen H (2007) Extending systems-on-chip to
the third dimension: performance, cost and technological tradeoffs. In: Proceedings of interna-
tional conference on computer-aided design, San Jose, pp 212219
28. Pavlidis VF, Friedman EG (2007) 3-D topologies for networks-on-chip. IEEE Trans VLSI Syst
15(10):285288
29. Feero BS, Pande PP (2009) Networks-on-chip in a three-dimensional environment: a perfor-
mance evaluation. IEEE Trans Comput 58(1):3245
30. Feihui L, Nicopoulos C, Richardson T, Xie Y, Narayanan V, Kandemir M (2006) Design and
management of 3D chip multiprocessors using network-in-memory. ACM SIGARCH Comput
Archit News 34(2):130141
31. Bartzas A, Skalis N, Siozios K, Soudris D (2007) Exploration of alternative topologies for
application-specific 3D networks-on-chip. In Workshop on application specific processors
(WASP). Salzburg, Austria, doi:10.1.1.100.6130.
32. Shan Y, Bill Lin L (2008) Design of application-specific 3D networks-on-chip architectures.
In: Proceedings of IEEE international conference on computer design (ICCD), Lake Tahoe,
CA, pp 142149
33. Rahman A, Reif R (2000) System-level performance evaluation of three-dimensional inte-
grated circuits. IEEE Trans VLSI Syst 8(6):671678
34. Facchini M, Carlson T, Vignon A, Palkovic M, Catthoor F, Dehaene W, Benini L, Marchal P
(2009) System-level power/performance evaluation of 3D stacked DRAMs for mobile applica-
tions. In: Proceedings design, automation, and test in Europe (DATE), Nice, France,
pp 923928
35. Dong X, Xie Y (2009) System-level cost analysis and design exploration for three-dimensional
integrated circuits (3D ICs). In: Proceedings of ASP-DAC, Yokohama, pp 234241
36. Addo-Quaye C (2005) Thermal-aware mapping and placement for 3-D NoC designs. In:
Proceedings of IEEE international systems-on-chip conference, Herndon, VA, pp 2528
37. Spuesens T, Liu L, de Vries T, Romeo PR, Regreny P, Van Thourhout D (2009) Improved
design of an InP-based microdisk laser heterogeneously integrated with SOI. In: 6th IEEE
international conference on group IV photonics (GFP), San Francisco, pp 202204
38. Le Beux S, Bois G, Nicolescu G, Bouchebaba Y, Langevin M, Paulin P (2010) Combining
mapping and partitioning exploration for NoC-based embedded systems. J Syst Archit 56(7):
223232
39. Srinivas N, Deb K (1994) Multiobjective optimization using nondominated sorting in genetic
algorithms. Evol Comput 2(3):221248
Index

A evaluation, 126128
Advanced microcontroller bus architecture network design, 122126
(AMBA), 6, 167 Dual data rate (DDR), 5
Architectural-level design Dynamic voltage and frequency scaling
Clos and fat-tree topologies, 92 (DVFS), 207
electrical baseline architectures, 91
first-order analysis, 93, 94
global crossbar topology, 92 E
hierarchical topology, 93 Edge coupler, 3738
logical network topology, 90 Electrical distributed router (EDR), 180, 185
multiple global buses, 92 Electronic design automation (EDA), 4, 243
symmetric topologies, 91 Electronic packet switching (EPS), 209
Arrayed waveguide grating (AWG), 4042 End-to-end (ETE) delay, 148, 149
Automatic repeat request (ARQ), 13 Energy-efficient turnaround routing (EETAR),
145

B
Back-end-of-line (BEOL) integration, 67, 85 F
Bit error rate (BER), 20, 188 FabryPerot (FP) laser, 48
Buried oxide (BOX), 86 Fat tree based optical network-on-chip
Bus, 67 (FONoC)
CMOS technologies, 138
comparison and analysis
C network performance, 148149
Chemical/mechanical polishing (CMP), 68 optical power loss, 148
Clos topology, 92, 112 power consumption, 146148
Context switching, 205206 MPSoC, 138
Corona architecture, 244 multi-computer systems, 139
COSINE-1, 209 OTAR
control interfaces, 141
microresonator and switching elements,
D 139140
Dimension order routing (DOR), 227 non-blocking property, 142
Directed acyclic graph (DAG), 255 payload packets, 142
Direct memory access (DMAs), 5 traditional switching fabrics,
DRAM memory channel 140141
design themes, 129 turnaround routing algorithm, 141

I. OConnor and G. Nicolescu (eds.), Integrated Optical Interconnect Architectures 263


for Embedded Systems, Embedded Systems, DOI 10.1007/978-1-4419-6193-8,
Springer Science+Business Media New York 2013
264 Index

Fat tree based optical network-on-chip MEMS. See Micro-electromechanical system


(FONoC) (cont.) (MEMS)
protocols, 144145 Metalsemiconductormetal (MSM), 161
topology and floorplan, 143144 Microarchitectural-level design
FlexiShare architecture, 244 2-ary 2-stage butterfly topology, 99
Flip-chip integration, 71 buffered SWMR crossbar, 98
Forward error correction (FEC), 13 cache-coherence protocol, 96
Free-space reconfigurable optical system, chip-level nanophotonic torus and mesh
211, 212 networks, 102
Free spectral range (FSR), 39 directory-based cache-coherence protocol, 98
Front-end-of-line (FEOL) integration, electrical modulation energy, 95
6668, 85 global bus arbitration, 95
global crossbars, 97
high-radix routers, 101
G multi-stage electrical network, 99
Germanium photodetectors, 6566 MWBR/MWMR buses, 96
Globally asynchronous locally synchronous MWSR, 96
(GALS), 6 nanophotonic crossbars, 97
nanophotonic schematics, 93
opto-electrical conversions, 100
H point-to-point nanophotonic channels, 100
HFAST, 214215 router crossbars, 100
Hybrid silicon modulators snoopy-based cache-coherence protocol, 98
silicon/IIIV modulators, 5960 SWBR and SWMR bus designs, 96
slot/sandwich modulators, 60, 61 torus topologies, 101
two-dimensional mesh topology, 102
Micro-electromechanical system (MEMS), 211
I Monolithic FEOL integration strategy
Input channel (ICH), 189 passive filter, 89
Interconnection cached network (ICN), 209 receiver, 8990
International technology roadmap for transmitter, 8889
semiconductors (ITRS), waveguide, 88
14, 154, 242 MSM. See Metal-semiconductor-metal (MSM)
Multi-mode interferometers (MMI), 40
Multiple-writer broadcast-reader (MWBR), 96
L Multiple-writer multiple-reader (MWMR), 96
Lasers Multiple-writer single-reader (MWSR), 96
off-chip lasers, 4446 Multi-processor systems on chip (MPSoC)
on-chip lasers 3D MPSoC integrating optical
bonding, 4647 networks-on-chip
micro-lasers, 4750 architecture and communication
silicon laser, 4950 models, 250251
Linear array with a reconfigurable pipelinedbus communication performance
system (LARPBS), 214 evaluation, 252255
Local-meshes to global-switches (LMGS) Corona architecture, 244
topology, 117 data-intensive applications, 259
demosaic application, 257
design space exploration, 255256
M EDA tools and methodologies, 243
MachZehnder interferometer (MZI), 3941 electrical interconnects, 244
Manycore processor-to-DRAM network functional diversification, 242
design themes, 121122 interconnect architectures, 242
evaluation, 118121 inter-layer communication, 247248
network design, 116118 intra-layer communications, 246247
Index 265

3D Mesh architecture, 257 fixed-power overheads, 130131


novel communication possibilities, 258 logical topology, 129
optical layer, 248251 manycore processor-to-DRAM network
pareto optimal mapping, 257 design themes, 121122
technological diversification, 242 evaluation, 118121
thermal-aware application mapping, 245 network design, 116118
TSVs, 244 microarchitectural-level design
WDM, 243 2-ary 2-stage butterfly topology, 99
on-chip opto-electrical bus-based buffered SWMR crossbar, 98
communication architecture cache-coherence protocol, 96
CMOS compatible optical chip-level nanophotonic torus and mesh
interconnection fabrics, 155 networks, 102
designs, 154 directory-based cache-coherence
electrical communication fabric, 156 protocol, 98
global interconnects, 154 electrical modulation energy, 95
3D IC implementation, 156 global bus arbitration, 95
intra-chip communication, 157 global crossbars, 97
network interfaces and packet routers, high-radix routers, 101
158 multi-stage electrical network, 99
optical interconnects, 155 MWBR/MWMR buses, 96
ORB architecture, 158166 MWSR, 96
performance comparison studies, nanophotonic crossbars, 97
169171 nanophotonic schematics, 93
performance estimation models, 167168 opto-electrical conversions, 100
pipelined global electrical point-to-point nanophotonic channels,
interconnects, 154, 155 100
polymer-based optical waveguides, 158 router crossbars, 100
power comparison studies, 171173 snoopy-based cache-coherence
power estimation models, 168169 protocol, 98
spice-like simulator, 157 SWBR and SWMR bus designs, 96
Multi-wavelength receiver (MWL-Rx) link, 182 torus topologies, 101
Multi-wavelength transmitter (MWL-Tx) link, two-dimensional mesh topology, 102
181182 nanophotonic technology
MZI. See MachZehnder interferometer (MZI) devices, 8385
monolithic FEOL integration strategy,
8890
N optical power overhead, 8788
Nanophotonic interconnection network design opto-electrical integration, 8586
aggressive electrical baseline technology, process and temperature variation, 8687
130 network complexity, 131
architectural-level design on-chip tile-to-tile network
Clos and fat-tree topologies, 92 design themes, 115116
electrical baseline architectures, 91 evaluation, 113115
first-order analysis, 93, 94 network design, 112113
global crossbar topology, 92 physical-level design
hierarchical topology, 93 abstract layout diagrams, 102
logical network topology, 90 bus slicing, 104
multiple global buses, 92 channel slicing, 106107
symmetric topologies, 91 64 64 crossbar topology, 108
device parameters, 130 laser couplers, 109
DRAM memory channel nanophotonic buses, 103
design themes, 129 nanophotonic crossbars, 105
evaluation, 126128 passive ring filter matrix layout, 107
network design, 122126 point-to-point nanophotonic channels, 106
266 Index

Nanophotonic interconnection network electro-optical signal modulation, 5153


design (cont.) hybrid modulator
quantitative analysis, 110 silicon/IIIV modulators, 5960
quasi-butterfly topology, 107 slot modulator, 60, 61
reader slicing, 104 light sources
serpentine layout, 104 off-chip laser, 4446
SWMR microarchitecture, 108 on-chip lasers, 4650
waveguide length, 110 optical power supply, 44
wavelength slicing, 103 mechanical tuning, 5456
writer slicing, 104 optical link implementation, 2829
three-levels of design, 129130 photodetectors
Network interface (NI), 9 geometry, 6163
Network on chip (NoC) germanium photodetectors, 6566
benefits of, 910 IIIV photodetectors, 6465
deadlock, 1112 silicon photodetectors, 6364
error recovery, 13 photonics and electronics, 3233
fat tree (see Fat tree based optical IIIV semiconductors and silicon, 3132
network-on-chip (FONoC)) silicon photonics, 31
ISOOSI protocol stack, 78 space-division multiplexing, 2930
network interface, 9 switched network, 2930
quality of service, 1213 thermal tuning, 54, 55
reconfigurable networks (see waveguide
Reconfigurable networks-on-chip) arrayed waveguide grating, 4042
routing algorithm, 1011 edge coupler, 3738
topology, 9, 10 fabrication accuracy, 4243
MachZehnder interferometer, 3941
planar concave gratings, 4143
O resonant ring filter, 39, 40
OBI. See Optical bus inverter (OBI) rib waveguide, 36
OCR. See Optical central router (OCR) strip waveguide, 3436
OCULAR-II, 210 temperature control, 4344
On-chip communication architecture (OCCA) vertical coupler, 38
bus, 67 wavelength division multiplexing, 2930
network on chip On-chip opto-electrical bus-based
benefits of, 910 communication architecture
deadlock, 1112 CMOS compatible optical interconnection
error recovery, 13 fabrics, 155
ISOOSI protocol stack, 78 electrical communication fabric, 156
network interface, 9 experimental setup
quality of service, 1213 performance estimation models, 167168
routing algorithm, 1011 power estimation models, 168169
topology, 9, 10 global interconnects, 154
On-chip optical interconnect intra-chip communication, 157
carrier-based modulation MPSoC designs, 154
carrier manipulation, 5658 network interfaces and packet routers, 158
silicon modulator components, 58 optical interconnects, 155
switching, 59 ORB architecture
electronics interconnect building blocks, 158161
backside integration, 7071 on-chip communication architecture,
flip-chip integration, 71 161166
front-end-of-line, 6668 performance comparison studies, 169171
3-D integration, 6970 pipelined global electrical interconnects,
metal interconnect layers, 69 154, 155
3-D stacking, 67 polymer-based optical waveguides, 158
Index 267

power comparison studies, 171173 photonic waveguides, 163


spice-like simulator, 157 space based scheme, 162
On-chip tile-to-tile network SWMR reservation channels and
design themes, 115116 MWMR data channels, 163
evaluation, 113115 Optical turnaround router (OTAR)
network design, 112113 control interfaces, 141
Optical bus inverter (OBI), 183 microresonator and switching elements,
Optical central router (OCR), 186 139140
Optical-electronic and electronic-optical non-blocking property, 142
(OE-EO) interfaces, 143 payload packets, 142
Optical network interfaces (ONI), 247 traditional switching fabrics, 140141
Optical networks-on-chip (ONoCs) protocol turnaround routing algorithm, 141
architecture Opto-electrical integration
central optical router, 180 hybrid integration, 85
data link layer, 184185 monolithic BEOL integration, 86
EDR monolithic FEOL integration, 86
receiving-path interface unit, 187188 OSMOSIS, 213
transmitting-path interface unit, 187 Output channel (OCH), 189
micro-resonator device, 180
network layer
EDR, 185 P
OCR, 186 PCG. See Planar concave gratings (PCG)
optical performance analysis PDP. See Power-delay product (PDP)
communication reliability investigation, Photodetectors
195196 geometry, 6163
parametric exploration, 196199 germanium photodetectors, 6566
preliminary definitions, 195 IIIV photodetectors, 6465
physical-adapter layer silicon photodetectors, 6364
receiver physical adapter, 183184 Photodiode (PD), 182
transmitter physical adapter, 183 Photonically integrated DRAM
physical layer (PIDRAM), 122
multi-wavelength receiver, 182 Physical-level design
multi-wavelength transmitter, abstract layout diagrams, 102
181182 bus slicing, 104
system-level performance analysis channel slicing, 106107
preliminary definitions, 189190 64 64 crossbar topology, 108
saturation throughput, 190191 laser couplers, 109
system-level simulation, 191194 nanophotonic buses, 103
Optical ring bus (ORB) architecture nanophotonic crossbars, 105
building blocks passive ring filter matrix layout, 107
CMOS compatible optical devices, 159 point-to-point nanophotonic channels,
MSM detector, 161 106
off-chip laser, 159 quantitative analysis, 110
opto-electric modulation, 160 quasi-butterfly topology, 107
predictive modulator model, 160 reader slicing, 104
SOI waveguides, 160 serpentine layout, 104
TIA, 161 SWMR microarchitecture, 108
traditional electrical interconnects, 158 waveguide length, 110
WDM, 161 wavelength slicing, 103
on-chip communication architecture writer slicing, 104
ACK channel, 164 Planar concave gratings (PCG), 4143
cluster based scheme, 162 Power-delay product (PDP), 1920
communication serialization, 164166 Processors, 5
inter-cluster communication, 162 Protocol specific signals (PSS), 184
268 Index

Q tiled-mesh topology, 216


Quality of service (QoS) virtual channels, 219
latency-based fairness, 12 Reconfigurable optical interconnect (ROI). See
throughput-based fairness, 1213 Reconfigurable networks-on-chip
traffic categories, 12 Reconfiguration with time division
multiplexing (RTDM), 215
Resonant ring filters, 39, 40
R Resource network interface (RNI), 219
RAPID, 214 Rib waveguide, 36
Receiver physical adapter (Rx-PhyAdapter),
183184
Receiving-path interface unit (RxIU), 187188 S
Reconfigurable networks-on-chip SDM. See Spatial division multiplexing
evaluation methodology (SDM)
power modeling, 229230 SDRAM. See Synchronous dynamic random
simulation platform, 228229 access memory (SDRAM)
workload generation, 230 SELMOS, 211212
logical topology, 218 Serial channel (serCH), 189
on-chip vs. off-chip network traffic Signal-to-noise ratio (SNR), 188
application-driven reconfiguration, Silicon-on-insulator (SOI), 34, 86
208209 Silicon photodetectors, 6364
communication patterns, 207208 Silicon photonics, 31
context switching, 205206 Simultaneous optical multiprocessor exchange
locality, 203205 bus (SOME-bus), 215
system scenarios, 207 Single-writer broadcast-reader (SWBR), 95
optical interconnects Single-writer multiple-reader (SWMR), 96
l-connect system, 213 Slot modulator, 60, 61
COSINE-1, 209 Spatial division multiplexing (SDM), 219
free-space reconfigurable optical Strip waveguide, 3436
system, 211, 212 Synchronous dynamic random access memory
HFAST, 214215 (SDRAM), 5
LARPBS, 214 System in package (SiP), 2124
MEMS pop-up mirrors, 210211 System on chip (SoC)
OCULAR-II, 210 average interconnect delay, 14, 15
optical highway, 210 bandwidth density, 19
OSMOSIS, 213 bit error rate, 20
RAPID, 214 bus, 67
RTDM, 215 electronic design automation, 4
SELMOS, 211212 floorplan, 16
SOME-bus, 215 initiators, 5
photonic NoC, 219220 integration, 1416
base network and extra links, 3D interconnect, 2124
223225 interconnect classification
dimension order routing, 227 broadcast links, 17
extra link selection, 226 global interconnect, 1516
network delivery order, 227 intermediate interconnect, 16
16-node architecture, 222223 local interconnect, 15
optical switch, 220, 221 network links, 17
traffic locality, 223224 point-to-point links, 16
simulation results network on chip (see Network on chip (NoC))
average remote memory access latency, organization of, 56
231232 power-delay product, 1920
folded torus topology, 233234 propagation delay, 19
hop distance, 232 targets, 56
power consumption, 233235 throughput, 14
Index 269

T W
Temporally disjoint networks (TDNs), 219 Wave division multiplexing
Thermal tuning, 54, 55 (WDM), 157
Through silicon via (TSV), 6971, 155 Waveguide
Trans-impedance amplifier (TIA), 161, 182 coupling structures
Transmitter physical adapter edge coupler, 3738
(Tx-PhyAdapter), 183 vertical coupler, 38
Transmitting-path interface unit (TxIU), 187 rib waveguide, 36
Two-photon absorption (TPA), 6364 strip waveguide, 3436
wavelength filters and routers
arrayed waveguide grating,
U 4042
Ultra deep submicron (UDSM) domain, 154 fabrication accuracy, 4243
MachZehnder interferometer,
3941
V planar concave gratings, 4143
Vertical-cavity surface emitting lasers resonant ring filter, 39, 40
(VCSEL), 45 temperature control, 4344
Vertical coupler, 38 Wavelength division multiplexing (WDM),
Virtual channels (VCs), 219 2930, 83, 138

Вам также может понравиться