Вы находитесь на странице: 1из 274

Timing Optimization Through Clock

Skew Scheduling
Ivan S. Kourtev • Baris Taskin • Eby G. Friedman

Timing Optimization
Through Clock Skew
Scheduling

ABC
Ivan S. Kourtev Baris Taskin
University of Pittsburgh Drexel University
Pittsburgh, PA Philadelphia, PA
USA USA

Eby G. Friedman
University of Rochester
Rochester, NY
USA

ISBN: 978-0-387-71055-6 e-ISBN: 978-0-387-71056-3


DOI: 10.1007/978-0-387-71056-3

Library of Congress Control Number: 2008937987

°c Springer Science+Business Media, LLC 2009


All rights reserved. This work may not be translated or copied in whole or in part without the written
permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY
10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection
with any form of information storage and retrieval, electronic adaptation, computer software, or by similar
or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of
trade names, trademarks, service marks and similar terms, even if they are not identified as such, is not to
be taken as an expression of opinion as to whether or not they are subject to proprietary rights.
While the advice and information in this book are believed to be true and accurate at the date of going
to press, neither the authors nor the editors nor the publisher can accept any legal responsibility for any
errors or omissions that may be made. The publisher makes no warranty, express or implied, with respect
to the material contained herein.

Printed on acid-free paper

springer.com
Preface

History of the Book


The last three decades have witnessed an explosive development in in-
tegrated circuit fabrication technologies. The complexities of current CMOS
circuits are reaching beyond the 65 nanometer feature size and multi-hundred
million transistors per integrated circuit. To fully exploit this technological
potential, circuit designers use sophisticated Computer-Aided Design (CAD)
tools. While supporting the talents of innumerable microelectronics engineers,
these CAD tools have become the enabling factor responsible for the success-
ful design and implementation of thousands of high performance, large scale
integrated circuits.
This book (a research monograph) originated from a body of doctoral dis-
sertation research completed by the first author at the University of Rochester
from 1994 to 1999 while under the supervision of Prof. Eby G. Friedman. This
research focuses on issues in the design of the clock distribution network in
large scale, high performance digital synchronous circuits and particularly, on
algorithms for non-zero clock skew scheduling. During the development of this
research, it became clear that incorporating timing issues into the successful
integrated circuit design process is of fundamental importance, particularly in
that advanced theoretical developments in this area have been slow to reach
the designers’ desktops. The second edition of the book is enhanced by the
body of doctoral dissertation research completed by the second author at the
University of Pittsburgh from 2000 to 2005 under the supervision of Prof.
Ivan S. Kourtev. This dissertation focuses on advanced timing, synchroniza-
tion and design methodologies based on non-zero clock skew scheduling. In-
cluded in this book are methods on the applicability of clock skew scheduling
on circuits with level-sensitive latches, a timing-driven circuit design method-
ology to attain the maximum performance out of clock skew scheduling and
a solution to non-zero clock skew scheduling problem in a parallel comput-
ing environment, specifically derived for integration into the physical design
process of an emerging non-zero clock skew clocking technology.

V
VI Preface

It is the authors’ belief that the successful application of non-zero clock


skew scheduling techniques to the integrated circuit design process can only
follow a detailed understanding of the operation of integrated circuits at many
different levels—from device physics through system architecture to packag-
ing. While a detailed coverage of all of these topics in a single text is im-
practical, an honest effort has been made to provide an in-depth treatment of
all of those areas closely related to the clock skew scheduling techniques pre-
sented in this book. Tutorial chapters on the structure and design of modern
integrated circuits, as well as on the fundamental principles of signal delay
are included in this text since these topics are crucial to understanding clock
skew scheduling in general. The information presented in these tutorial chap-
ters can also quickly familiarize the reader with the problems, definitions, and
terminology used throughout the book.
Automated methodologies for synchronous circuit performance optimiza-
tion through clock skew scheduling is the primary topic presented in this
book. The objectives of these methodologies are to improve the performance
(specifically, the operating frequency or speed) while increasing the reliability
of fully synchronous digital integrated circuits. Traditionally, design wisdom
has dictated the use of global zero clock skew. In the research presented here,
however, non-zero clock skew scheduling is exploited. A set of algorithms to
accomplish this objective are considered in more detail. Specifically, this book
deals in depth with the following issues:
• A methodology for simultaneous non-zero clock skew scheduling and design
of the topology of the clock distribution network. This methodology is
based on the pioneering works of Friedman [1] and Fishburn [2], and builds
on Linear Programming (LP) solution techniques. The non-zero clock skew
scheduling of circuits with level-sensitive latches and for multi-phase clock
signals is formulated as a LP problem. The simultaneous clock scheduling
and clock tree topology synthesis problem is formulated as a mixed-integer
linear programming problem that can be solved efficiently. The proposed
algorithms have been evaluated on a variety of benchmark and industrial
circuits and synchronous performance improvements of well above 60%
have been demonstrated.
• For those cases where reliable circuit operation and production yield are
the highest level priorities, an alternative problem formulation is devel-
oped. This formulation is based on a quadratic (hence the QP—quadratic
programming) measure, or cost function, of the tolerance of a clock sched-
ule to parameter variations. A mathematical framework is presented for
solving the constrained and bounded QP problem. A constrained ver-
sion of the problem is iteratively solved using the Lagrange multipliers
method. As these research issues are topics of great practical importance
for input/output (I/O) interfacing and Intellectual Property (IP) blocks,
explicit clock delay and skew requirements are fully integrated into the
mathematical model described here.
Preface VII

• The theoretical derivation of the limits on the improvements on the clock


period available through clock skew scheduling. The theoretical derivation
is performed by identifying the limits for three local data path topologies.
A methodology to mitigate the limitation of clock skew scheduling for a
reconvergent path system is presented. The methodology involves delay
insertion on some data paths of the reconvergent system and is formulated
as an LP problem for an automated application.
• A practical (and necessary) implementation of clock skew scheduling for an
emerging clock generation and distribution technology in resonant rotary
clocking technology. Preliminary efforts in modeling and implementation
are demonstrated. Details are included on the integration of clock skew
scheduling into a complete physical design flow for the automated design
of rotary clock synchronized synchronous circuits.
As with any project of this magnitude, mistakes are likely. To the best
knowledge of the authors, proper credit has been given to everyone whose
work has been mentioned here, but the authors take full responsibility for any
errors or omissions.

Acknowledgments
The authors would like to thank all of those who have helped writing
and correcting early manuscript versions of this monograph—fellow colleagues
and students, as well as the anonymous reviewers who provided important
comments on improving the overall quality of this book. The authors would
also like to thank Dr. Bob Grafton from the National Science Foundation for
supporting the early research projects that have culminated in the writing
and production of this book. We would also like to warmly acknowledge the
assistance and support of Alex Greene and Katelyn Stanne from Springer—
Alex and Katie’s patience and encouragement have been crucial to the success
of this project.
The research work described in this research monograph was made pos-
sible in part by support from the National Science Foundation under Grant
No. MIP-9423886 and Grant No. MIP-9610108, by a grant from the New
York State Science and Technology Foundation to the Center for Advanced
Technology-Electronic Imaging Systems, and by grants from the Xerox Cor-
poration, IBM Corporation, Intel Corporation and Multigig Inc.

Pittsburgh, PA, Ivan S. Kourtev


Philadelphia, PA, Baris Taskin
Rochester, NY, Eby G. Friedman
July, 2008
Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2 VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Synchronous VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The VLSI Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3 Signal Delay in VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19


3.1 Delay Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Devices and Interconnections . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.2.1 Analytical Delay Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.2 Controlling the Delay . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.3 Waveform Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
3.2.4 Short-Channel Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
3.2.5 The Importance of Interconnections . . . . . . . . . . . . . . . . . 35
3.2.6 Delay Mitigation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

4 Timing Properties of Synchronous Systems . . . . . . . . . . . . . . . . 41


4.1 Storage Elements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.2 Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.3 Parameters of Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
4.3.1 Width of the Clock Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.2 Latch Clock-to-Output Delay . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.3 Latch Data-to-Output Delay . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.4 Latch Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.3.5 Latch Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.4 Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.5 Parameters of Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.1 Width of the Clock Pulse . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
4.5.2 Flip-Flop Clock-to-Output Delay . . . . . . . . . . . . . . . . . . . . 49

IX
X Contents

4.5.3 Flip-Flop Setup Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49


4.5.4 Flip-Flop Hold Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
4.6 The Clock Signal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
4.6.1 Clock Skew . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
4.6.2 Multi-Phase Clock Synchronization . . . . . . . . . . . . . . . . . . 53
4.7 Single-Phase Path with Flip-Flops . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.7.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 55
4.7.2 Preventing the Early Arrival of the Data Signal . . . . . . . 58
4.8 Single-Phase Path with Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
4.8.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 61
4.8.2 Preventing the Early Arrival of the Data Signal . . . . . . . 63
4.9 Multi-Phase Path with Latches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
4.9.1 Preventing the Late Arrival of the Data Signal . . . . . . . . 66
4.9.2 Preventing the Early Arrival of the Data Signal . . . . . . . 68
4.10 A Final Note . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

5 Clock Skew Scheduling and Clock Tree Synthesis . . . . . . . . . . 71


5.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
5.2 Definitions and Graphical Model . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2.1 Permissible Range of Clock Skew . . . . . . . . . . . . . . . . . . . . 74
5.2.2 Graphical Model of a Synchronous System . . . . . . . . . . . . 76
5.3 Clock Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.4 Timing Constraints and Design Automation . . . . . . . . . . . . . . . . 85
5.5 Structure of the Clock Distribution Network . . . . . . . . . . . . . . . . 86
5.6 Solution of the Clock Tree Synthesis Problem . . . . . . . . . . . . . . . 87
5.7 Software Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7.1 Simultaneous Clock Scheduling and Clock
Tree Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.7.2 Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

6 Clock Skew Scheduling of Level-Sensitive Circuits . . . . . . . . . 97


6.1 Clock Scheduling for Level-Sensitive Circuits . . . . . . . . . . . . . . . . 97
6.1.1 Latching Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.2 Synchronization Constraints . . . . . . . . . . . . . . . . . . . . . . . . 98
6.1.3 Propagation Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.1.4 Validity Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
6.1.5 Initialization Constraints . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2 Iterative Approach to Clock Skew Scheduling . . . . . . . . . . . . . . . 103
6.3 Linearization of the Timing Analysis . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.1 Modified Big M (MBM) Method . . . . . . . . . . . . . . . . . . . . 105
6.3.2 Linear Programming (LP) Model . . . . . . . . . . . . . . . . . . . . 106
6.4 An Example and Experimental Results . . . . . . . . . . . . . . . . . . . . . 108
6.4.1 Level-Sensitive Synchronous Circuit State
of Operation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
6.5 Optimality of the LP Formulation . . . . . . . . . . . . . . . . . . . . . . . . . 113
Contents XI

6.6 Multi-Phase Level-Sensitive Circuits . . . . . . . . . . . . . . . . . . . . . . . 117


6.6.1 Multi-Phase Synchronization Overview . . . . . . . . . . . . . . . 117
6.6.2 Multi-Phase Level-Sensitive Circuit Timing . . . . . . . . . . . 118
6.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

7 Clock Skew Scheduling for Improved Reliability . . . . . . . . . . . 121


7.1 Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
7.1.1 Clock Scheduling for Maximum Performance . . . . . . . . . . 123
7.1.2 Maximizing Safety . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
7.1.3 Further Improvement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
7.1.4 Clock Scheduling as a Quadratic Programming
Problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
7.2 Derivation of the QP Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.1 The Circuit Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 129
7.2.2 Linear Dependence of Clock Skews . . . . . . . . . . . . . . . . . . 130
7.2.3 Optimization Problem and Solution . . . . . . . . . . . . . . . . . . 137

8 Delay Insertion and Clock Skew Scheduling . . . . . . . . . . . . . . . . 145


8.1 Limitations on Minimum Clock Period . . . . . . . . . . . . . . . . . . . . . 146
8.1.1 Uncertainty of Data Propagation Times . . . . . . . . . . . . . . 147
8.1.2 Data Path Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
8.1.3 Reconvergent Paths . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
8.2 Delay Insertion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
8.2.1 Motivational Example with a Reconvergent Path . . . . . . 153
8.2.2 Reconvergence in an Edge-Triggered Circuit . . . . . . . . . . 153
8.2.3 Reconvergence in a Level-Sensitive Circuit . . . . . . . . . . . . 159
8.2.4 General Reconvergent Data Path Systems . . . . . . . . . . . . 160
8.3 Linear Problem Formulation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.4 Practical Concerns in Modeling and Application . . . . . . . . . . . . . 163
8.5 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165

9 Practical Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167


9.1 Computational Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.1 Algorithm LMCS-1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
9.1.2 Algorithm LMCS-2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.1.3 Algorithm CSD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.1.4 Summary of the Proposed Algorithms . . . . . . . . . . . . . . . . 175
9.2 Unconstrained Basis Skews . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
9.3 I/O Registers and Target Delays . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181

10 Clock Skew Scheduling in Rotary Clocking Technology . . . . 183


10.1 Resonant Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183
10.1.1 Rotary Traveling Wave Oscillators . . . . . . . . . . . . . . . . . . . 185
10.1.2 Timing Requirements of Rotary Circuits . . . . . . . . . . . . . 189
XII Contents

10.2 Physical Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 191


10.2.1 Timing-Driven Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 193
10.2.2 Partitioning with chaco . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
10.2.3 Register Insertion for Partitioning . . . . . . . . . . . . . . . . . . . 196
10.2.4 Clock Skew Scheduling of Partitions . . . . . . . . . . . . . . . . . 197
10.2.5 Timing-Driven Register Placement . . . . . . . . . . . . . . . . . . 200
10.3 Parallelization of Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . 202
10.3.1 Speedup of Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
10.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

11 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205


11.1 Clock Skew Scheduling of Level-Sensitive Circuits . . . . . . . . . . . 205
11.1.1 Experimental Results on ISCAS’89 Benchmark
Circuits . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
11.1.2 Verification and Interpretation of Results . . . . . . . . . . . . . 208
11.1.3 Parameter Data Distributions . . . . . . . . . . . . . . . . . . . . . . . 209
11.1.4 Skew Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
11.2 Multi-Phase Level-Sensitive Circuits . . . . . . . . . . . . . . . . . . . . . . . 213
11.2.1 Multi-Phase Clocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
11.2.2 Multi-Phase Clocking Effects on Time Borrowing . . . . . . 219
11.2.3 Multi-Phase Clocking and Clock Skew Scheduling . . . . . 220
11.2.4 Simultaneous Time Borrowing and Clock Skew
Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
11.3 Quadratic Programming (QP) for Maximizing Safety . . . . . . . . 223
11.3.1 Description of Computer Implementation . . . . . . . . . . . . . 223
11.3.2 Graphical Illustrations of Results . . . . . . . . . . . . . . . . . . . . 225
11.4 Delay Insertion in Clock Skew Scheduling . . . . . . . . . . . . . . . . . . . 225
11.5 Physical Design of Rotary Clock Synchronized Circuits . . . . . . . 233
11.5.1 Clock Skew Scheduling of Partitions Results . . . . . . . . . . 234
11.5.2 Overall CAD Tool Results . . . . . . . . . . . . . . . . . . . . . . . . . . 237

12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247

Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
List of Figures

1.1 Moore’s law—an exponential increase in circuit density. . . . . . . . . 2


1.2 Moore’s law—an exponential increase in circuit performance. . . . . 3
1.3 Example of applying localized negative clock skew. . . . . . . . . . . . . . 4

2.1 Logic schematic view of a full adder circuit. . . . . . . . . . . . . . . . . . . . 9


2.2 Circuit view of a two-input NAND gate. . . . . . . . . . . . . . . . . . . . . . . 10
2.3 Signal delay with linear ramp input and a linear ramp output. . . . 10
2.4 Signal delay with linear ramp input and an exponential output. . . 11
2.5 A finite-state machine (FSM) model of a synchronous system. . . . 13
2.6 A local data path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.7 A typical integrated circuit design flow. . . . . . . . . . . . . . . . . . . . . . . . 16

3.1 A simple electronic circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20


3.2 Signal waveforms for the circuit shown in Figure 3.1(b). . . . . . . . . . 21
3.3 Signal waveforms for the inverter shown in Figure 3.1(b). . . . . . . . 22
3.4 An N-channel enhancement mode MOS transistor. . . . . . . . . . . . . . 24
3.5 A basic CMOS inverter logic gate. . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
3.6 Operating mode of a CMOS inverter. . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.7 High-to-low output transition for a step input signal. . . . . . . . . . . . 28
3.8 Operating point trajectory of a CMOS inverter for different. . . . . . 28
3.9 Low-to-high output transition for a step input signal. . . . . . . . . . . . 30
3.10 Graphical illustration of the RC signal delay expressions. . . . . . . . 37

4.1 A general view of a register. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42


4.2 Schematic representation of a level-sensitive register or latch. . . . . 43
4.3 Idealized operation of a level-sensitive register or latch. . . . . . . . . . 44
4.4 Parameters of a level-sensitive register. . . . . . . . . . . . . . . . . . . . . . . . 46
4.5 An edge-triggered register or flip-flop. . . . . . . . . . . . . . . . . . . . . . . . . 47
4.6 Idealized operation of an edge-triggered register or flip-flop. . . . . . 48
4.7 Parameters of an edge-triggered register. . . . . . . . . . . . . . . . . . . . . . . 50
4.8 A typical clock signal. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

XIII
XIV List of Figures

4.9 Lead/lag relationships causing clock skew. . . . . . . . . . . . . . . . . . . . . 52


4.10 A sample multi-phase synchronization clock. . . . . . . . . . . . . . . . . . . . 53
4.11 Multi-phase clock skew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
4.12 A single-phase local data path. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.13 Timing diagram—violation of the setup constraint. . . . . . . . . . . . . . 56
4.14 Timing diagram—violation of the hold constraint. . . . . . . . . . . . . . . 59
4.15 A single-phase local data path with latches. . . . . . . . . . . . . . . . . . . . 61
4.16 A multi-phase local data path with latches. . . . . . . . . . . . . . . . . . . . 65

5.1 A simple synchronous digital circuit with four registers and four
logic gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 The permissible range of the clock skew of a local data path. A
timing violation exists if sk ∈ / [lk , uk ]. . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 A directed multi-graph representation of the synchronous
system shown in Figure 5.1. The graph vertices correspond to
the registers, R1 , R2 , R3 and R4 , respectively. . . . . . . . . . . . . . . . . . . 77
5.4 A graph representation of the synchronous system shown in
Figure 5.1 according to Definition 5.3. The graph vertices
v1 , v2 , v3 , and v4 correspond to the registers, R1 , R2 , R3 and R4 ,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Transformation rules for the circuit graph. . . . . . . . . . . . . . . . . . . . . 79
5.6 Application of non-zero clock skew to improve circuit
performance (a lower clock period) or circuit reliability
(increased safety margins within the permissible range). . . . . . . . . . 83
5.7 Tree structure of a clock distribution network. . . . . . . . . . . . . . . . . . 86
5.8 Buffered clock tree for the benchmark circuit s1423. The circuit
s1423 has a total of N = 74 registers and the clock tree consists
of 45 buffers with a branching factor of is f = 3. . . . . . . . . . . . . . . 91
5.9 Buffered clock tree for the benchmark circuit s400. The circuit
s400 has a total of N = 21 registers and the clock tree consists
of 14 buffers with a branching factor of f = 3. . . . . . . . . . . . . . . . . . 92
5.10 Sample input for the clock scheduling program described in
Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.11 Sample output for the clock scheduling program described in
Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.12 The application of clock skew scheduling to a commercial
integrated circuit with 6,890 registers [note that the time scale
is in femtoseconds, 1 fs = 10−15 sec = 106 ns]. . . . . . . . . . . . . . . . . . . 96

6.1 Possible cases for the arrival and departure times of data at the
initial latch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Propagation of the data signal in a simple circuit. . . . . . . . . . . . . . 101
6.3 The iterative algorithm for static timing analysis of
level-sensitive circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 A simple synchronous circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
List of Figures XV

6.5 A single-phase synchronization clock with a 50% duty cycle. . . . . . 109


6.6 Zero and non-zero clock skew timing schedules for the
level-sensitive circuit in Figure 6.4. . . . . . . . . . . . . . . . . . . . . . . . . . . 109
6.7 The optimized timing schedule for s27 operable with TCP = 4.1. . 112
6.8 Run times under 1250 seconds for the LP and MIP formulations. 115
6.9 Propagation of the data signal in a simple multi-phase circuit. . . 119

7.1 Circuit graph of the simple example circuit C1 from Section 7.1.1. 129
7.2 Two spanning trees and the corresponding minimal sets of
linearly independent clock skews and linearly independent cycles
for the circuit example C1 . Edges from the spanning tree are
indicated with thicker lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133

8.1 Limitation on the minimum clock period TCP caused by the


delay uncertainty of a local data path. . . . . . . . . . . . . . . . . . . . . . . . . 147
8.2 Limitation on the minimum clock period TCP caused by data
path cycles. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
8.3 Limitation on the minimum clock period TCP caused by
reconvergent paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
8.4 A simple reconvergent data path system. . . . . . . . . . . . . . . . . . . . . . 153
8.5 Timing of the edge-sensitive reconvergent system in Figure 8.4
after CSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
8.6 The simple reconvergent system in Figure 8.4 after delay
insertion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
8.7 Two reconvergent data path systems satisfying (P1) and (P2),
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
8.8 Timing of the simple level-sensitive reconvergent system in
Figure 8.4 after CSS. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
8.9 A generalized reconvergent data path system. . . . . . . . . . . . . . . . . . . 161
8.10 Timing of the edge-triggered reconvergent system with m=3
and n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
8.11 Timing of the level-sensitive reconvergent system with m=3 and
n=2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

9.1 Computation of the clock schedule basis sb by computing only


the last nb rows of the matrix −Z + I. . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2 The numerical constants (as functions of k = p/r) of the term
r3 in the runtime complexity expressions for the algorithms
LMCS-1, LMCS-2 and CSD, respectively. . . . . . . . . . . . . . . . . . . . . . 176
9.3 The numerical constants (as functions of k = p/r) of the term
r2 in the memory complexity expressions for the algorithms
LMCS-1, LMCS-2 and CSD, respectively. . . . . . . . . . . . . . . . . . . . . . 176
9.4 Modified example circuit C1 to include an additional edge e6 .
C1 is originally introduced in Section 7.1.1 and illustrated in
Figure 7.1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
XVI List of Figures

9.5 I/O registers in a VLSI integrated circuit. Note that the I/O
registers form part of the local data paths between the inside of
the circuit and the outside of the circuit. . . . . . . . . . . . . . . . . . . . . . . 179

10.1 Basic rotary clock architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186


10.2 The RTWO theory. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.3 The cross-section of the transmission line with shunt connected
inverters. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
10.4 Line voltage and line current for the 3.4GHz clock example. . . . . . 189
10.5 The clock phase relationships on an ROA ring. . . . . . . . . . . . . . . . . 190
10.6 The physical design flow of VLSI circuits with RTWO clock
synchronization. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
10.7 Partitioning a circuit for timing analysis. . . . . . . . . . . . . . . . . . . . . . 198
10.8 An ROA ring in a chip layout illustrated in 0.13 um technology. . 201
10.9 Xgrid computing cluster. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203

11.1 Data propagation times for s938 with 32 registers and 496 data
paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.2 Maximum effective path delays in data paths of s938 for zero
clock skew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.3 Maximum effective path delays for s938 for non-zero clock skew. 211
11.4 Distribution of the clock skew values of the non-zero clock skew
case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5 Distribution of the clock delay values of the non-zero clock skew
case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.6 Generation of an n-phase data path with latches. . . . . . . . . . . . . . . 214
11.7 Non-overlapping multi-phase synchronization clock. . . . . . . . . . . . . 215
11.8 Effects of multi-phase clocking on time borrowing. . . . . . . . . . . . . . 219
11.9 Effects of multi-phase clocking on clock skew scheduling. . . . . . . . . 221
11.10 Effects of multi-phase clocking on time borrowing and clock
skew scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.11 Circuit s3271 with r = 116 registers and p = 789 local data
paths. The target clock period is TCP = 40.4 nanoseconds. . . . . . . 227
11.12 Circuit s1512 with r = 57 registers and p = 405 local data
paths. The target clock period is TCP = 39.6 nanoseconds. . . . . . . 228
11.13 Percentage improvements through delay insertion in Table 11.6. . . 232
11.14 Percentage improvements on edge-triggered circuits in Table 11.6. 232
11.15 Percentage improvements on level-sensitive circuits in Table 11.6. 233
11.16 CAD tool flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.17 The run times of hpictiming with Xgrid on large circuits. . . . . . . 239
11.18 Run time breakdown of hpictiming program steps for s38584. . . 240
11.19 Run time breakdown of hpictiming program steps for s38417. . . 240
11.20 Run time breakdown of hpictiming program steps for
industrial1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
1
Introduction

The concept of data or information processing arises in a variety of fields.


Understanding the principles behind this concept is fundamental to computer
design, communications, manufacturing process control, biomedical engineer-
ing, and an increasingly large number of other areas in technology and science.
It is impossible to imagine modern life without computers for generating, ana-
lyzing and retrieving large amounts of information, as well as communicating
information regardless of location.
Technologies for designing and building microelectronics-based computa-
tional equipment have been steadily advancing ever since the first commercial
discrete integrated circuits (ICs) were introduced in the late 1950’s [3].1 As
predicted by Moore’s Law in the 1960’s [4], integrated circuit density has been
doubling approximately every 18 months. This scaling of circuit size has been
accompanied by a similar exponential increase in circuit speed (or more pre-
cisely, clock frequency). These trends of steadily increasing circuit size and
clock frequency are illustrated in Figures 1.1 and 1.2, respectively. As a result
of this amazing revolution in semiconductor technology, it is not unusual for
modern integrated circuits to contain over ten million switching elements (i.e.,
transistors) packed into a chip area as large as 500 mm2 (e.g., [5, 6, 7]). This
truly exceptional technological capability is due to advances in both design
methodologies and physical manufacturing technologies. Research and experi-
ence demonstrate that this trend of exponentially increasing integrated circuit
computational power will continue into the foreseeable future.
Integrated circuit performance is typically characterized [8] by the speed
of operation, the available circuit functionality, and the power consumption,
and there are multiple factors which directly affect these performance charac-
teristics. While each of these factors is significant, on the technological side,

1
Monolithic integrated circuits were first introduced in the early 1960’s.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 1
DOI: 10.1007/978-0-387-71056-3 1,
c Springer Science+Business Media LLC 2009
2 1 Introduction

Transistor Count

1010

Dual-core Itanium
109
Itanium II

108

Pentium IV

107 Pentium II
Pentium

i486
106 i860 V80
V60/V70
i43201 i80286
105 μPD7809
i8087
μPD7720
i8086
104

i4004

1975 1980 1985 1990 1995 2000 2005 year

Fig. 1.1. Moore’s law—an exponential increase in circuit density, or number of


transistors, per integrated circuit.

increased circuit performance has been largely achieved by the following ap-
proaches:
• reduction in feature size (technology scaling), that is, the capability of
manufacturing physically smaller and faster circuit structures,
• increase in chip area, permitting a larger number of circuits and therefore
greater on-chip functionality,
• advances in packaging technology, permitting the increasing volume of
data traffic between an integrated circuit and its environment as well as
the efficient removal of heat created during circuit operation.
The most complex integrated circuits are referred to as VLSI circuits,
where the term VLSI stands for Very Large Scale Integration. This term
describes the complexity of modern integrated circuits consisting of hundreds
of thousands to many millions of active transistor elements. Presently, the
1 Introduction 3

Clock Frequency (MHz)

104

ItaniumII
3 PentiumIV
10
Itanium
DECAlpha

102
Pentium

V70

101 i80286
i8086

1 i4004

year
1975 1980 1985 1990 1995 2000 2005

Fig. 1.2. Moore’s law—an exponential increase in circuit performance, or clock


frequency.

leading integrated circuit manufacturers have a technological capability for


the mass production of VLSI circuits with feature sizes as small as 65nm [5,
6]. These sub-100 nanometer technologies are identified with the term deep
submicrometer (DSM) since the minimum feature size is well below the one
micrometer mark.
As these dramatic advances in fabricating technologies take place, inte-
grated circuit performance is often limited by effects closely related to the
very reasons behind these advances such as small geometry interconnect struc-
tures. Circuit performance has become strongly dependent and limited by
electrical issues that are particularly significant in deep submicrometer inte-
grated circuits. Signal delay and related waveform effects are among those
phenomena that have a great impact on high performance integrated circuit
design methodologies and the resulting system implementation. In the case
4 1 Introduction

of fully synchronous VLSI systems, these effects have the potential to create
catastrophic failures due to the limited time available for signal propagation
among the gates.
The material presented in this monograph is associated with these afore-
mentioned delay effects from the perspective of a synchronous digital VLSI
system. The research results described here can be used to improve the per-
formance and reliability of a synchronous VLSI circuit through the design of
the clock distribution network common to any synchronous digital system.
Specifically, new algorithms for scheduling the arrival time of the clock sig-
nals at the individual registers (or synchronous macro blocks) of a circuit
and synthesizing the overall clock tree are discussed. Operational character-
istics, performance improvements and limitations to suggested improvements
are presented in a cohesive manner.
To provide an intuitive perspective into the topics discussed here, consider
the simple synchronous circuit shown in Figure 1.3 [9]. Two consecutively con-
nected local data paths, consisting of the registers, R1 and R2 , and R2 and R3 ,
respectively, are depicted in this figure. Consider that, by design, clock delays
to R1 and R3 must be identical. That is, the clock signal C1 to the register
R1 is synchronized2 with the clock signal C3 to R3 . The signal delays through
the registers are considered identical in this example, numerically assigned
to 2 ns. Under this identical register delay assumption, the path from R2 to
R3 is the worst case path (since it has a larger logic signal delay). By delaying
the clock signal C3 to the register R3 with respect to the clock signal to the
register R2 , a leading (or negative) clock skew is added to this local data path
from R2 to R3 . As the clock delays to R1 and R3 are designed to be identical, a
certain amount of lagging (or positive) clock skew is applied to the local data
path from R1 to R2 . Thus, the clock signal C2 should be designed to lead the
clock signal C3 by 1.5 ns, thereby forcing both paths R1 to R2 and R2 to R3
to have the same total effective local data path delay (consisting of propaga-

R1 R2 R3
Logic Logic
Data Signal Data Data Data
Delay = 4 ns Delay = 7 ns
Clock Clock Clock
C1 C2 C3

TC1 = 3 ns TC2 = 1.5 ns TC1 = 3 ns

Clock Signal

Fig. 1.3. Example of applying localized negative clock skew to a synchronous circuit.

2
The signals C1 and C3 arrive at the same time with no delay or advance with
respect to each other.
1 Introduction 5

tion delay TP D and local data path skew TSkew ) TP D + TSkew = 7.5 ns. The
delay of the critical path (R2 to R3 ) of the synchronous circuit is temporally
refined to the precision of the clock distribution network, and the entire sys-
tem (for this simple example) could operate at a maximum clock frequency of
133.3 MHz. Note that, if no localized clock skew were applied, the maximum
possible frequency would be 111.1 MHz. The performance characteristics of
the system, both with and without the application of localized clock skew, are
summarized in Table 1.1.

Table 1.1. Performance characteristics of the circuit shown in Figure 1.3 without
and with localized clock skew.

Local Data Path TP D(min) with TCi TCf TSkew TP D(min) with
zero skew non-zero skew
R1 ;R2 4 + 2 + 0 = 6 3 1.5 1.5 4 + 2 + 1.5 = 7.5
R2 ;R3 7 + 2 + 0 = 9 1.5 3 -1.5 7 + 2 − 1.5 = 7.5
fmax 111.1 MHz 133.3 MHz

Note that |TSkew | < TP D (since | − 1.5 ns | < 9 ns) for the local data
path from R2 to R3 . Therefore, it is ensured that the correct data signal is
successfully latched into R3 and no local data path/clock skew constraint rela-
tionship is violated. This design technique of applying localized clock skew is
particularly effective in sequentially-adjacent, temporally irregular local data
paths; however, it is applicable to any type of synchronous sequential sys-
tem. For certain architectures, a significant improvement in performance and
reliability is both possible and likely.
One of the objectives of this research monograph is to provide detailed
insight into the systematic application of the technique exemplified in Fig-
ure 1.3 and Table 1.1 and described above to synchronous sequential digital
circuits of arbitrary structure and size. To this end, the basic properties of
CMOS-based digital integrated circuits as well as the fundamental principles
of synchronous VLSI system operation are reviewed in Chapter 2.
In Chapters 3 and 4, the timing issues related to the implementation of
synchronous VLSI circuits are discussed. A summary of the definitions and
notations used in this monograph is presented. Signal delay in CMOS digital
integrated circuits is presented in Chapter 3 where the sources of both device
and interconnect delays are discussed.
In Chapter 4, the fundamental timing relationships of synchronous digital
systems are summarized as these relationships are key to understanding the
algorithms presented in Chapters 5 and 7. More specifically, Chapter 4 de-
scribes in considerable detail the properties of both the various types of timed
storage elements and of the data paths built with these elements.
6 1 Introduction

In Chapter 5, clock skew scheduling is formally introduced. Specifically,


the relationships between clock skew and the clock distribution network are
analyzed in detail and a methodology for circuit performance optimization is
presented. The presentation in Chapter 5 focuses on the appropriate use of
timing constraints and an optimization objective to formulate a mathematical
clock skew scheduling problem for a given circuit. Circuits with both edge-
triggered (flip-flops) and level-sensitive (latches) registers as storage elements
are analyzed. It is shown how Linear Programming (LP) formulations can be
used in clock skew scheduling with the objective of minimizing the clock period
of a circuit. In practice, there may be a variety of situations where a different
design objective is more appropriate. For example, it may be appropriate to
try and maximize the timing reliability of the circuit under various process
and operating variations or to decrease the total circuit area by downsizing
circuits without compromising the timing reliability of the circuit. Such design
objectives can be addressed successfully via clock skew scheduling and two
important applications are detailed in Chapters 7 and 8.
In Chapter 6, the application clock skew scheduling problem to circuits
with level-sensitive registers is formulated as an LP problem and the perfor-
mance results are presented.
In Chapter 7, a different class of clock skew scheduling algorithms are
described in detail. Based on a Quadratic Programming (QP) formulation,
these algorithms can be used when it is important to maximize the timing
reliability of a circuit in the presence of process and operating parameter
variations.
In Chapter 8, clock skew scheduling is discussed in a different perspective.
It is shown that by taking advantage of clock skew scheduling, the logic delay
of parts of the circuit can be increased without compromising the circuit
reliability and correct operation. Longer permissible logic delays are directly
translated into reduced circuit sizes, thereby leading to savings in both circuit
area and power.
In Chapter 9, an efficient solution to the QP problem formulation in
Chapter 7 is developed and analyzed. Also demonstrated in Chapter 9 is a
process for integrating certain issues of practical importance into the mathe-
matical model presented in Chapter 7.
In Chapter 10, the application of clock skew scheduling to an emerging
type of clock distribution network based on resonant oscillation and adiabatic
switching is described in detail. It is shown how clock skew scheduling al-
gorithms can be modified in order to address the particular physical design
challenges of the resonant rotary clocking technology.
In Chapter 11, the application of the QP based algorithms to benchmark
and industrial circuits is presented. Finally, some conclusions are offered in
Chapter 12.
2
VLSI Systems

High performance VLSI digital systems are composed of millions of electronic


devices that exhibit switching properties. The analysis and design of these
systems can be approached at different levels of abstraction, with the ad-
vantages and limitations corresponding to each such level [10, 11]. Abstract
representations are used to hide the details and highlight the essential fea-
tures of a system in a specific context. For example, the system architects
of a VLSI integrated circuit may choose Boolean or switching algebra as the
formal mathematical framework to describe a complex computational pro-
cedure [12, 13]. Circuit designers, on the other hand, may be interested in
active and passive circuit elements such as transistors and interconnect, as
well as in the underlying physical laws that govern the operation of these ele-
ments [10, 14, 15]. Aspects of the VLSI design issues covered in this monograph
overlap several levels of abstraction and require familiarity with the terminol-
ogy and phenomena at each of these levels. The information described in this
chapter provides a fundamental background for motivating the use of clock
distribution networks in VLSI-based synchronous digital systems.
The essential characteristics of a digital VLSI system are reviewed in this
chapter. First, the basic signal properties related to digital circuits are pre-
sented in Section 2.1. Following this description, the principles of operation
of a synchronous digital system are discussed in Section 2.2. The VLSI cir-
cuit design process is summarized in Section 2.3 followed by some concluding
remarks in Section 2.4.

2.1 Signal Representation


Data processing in the most widely available types of digital integrated circuits
(e.g. CMOS, Bipolar, BiCMOS and GaAs) is based on the transport of electri-
cal energy from one physical location to another physical location. Typically,
the information being processed is encoded as a physical variable that can
be stored and transmitted to other locations while functionally manipulated
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 7
DOI: 10.1007/978-0-387-71056-3 2,
c Springer Science+Business Media LLC 2009
8 2 VLSI Systems

along the way. Such a physical variable—also called a signal—is, for example,
the electrical voltage provided by a power supply (with respect to a ground
potential) and developed in circuit elements in the presence of an electromag-
netic field. The voltage signal or bit of information (in a digital circuit) is tem-
porarily stored in a circuit structure capable of accumulating electric charge.
This accumulating or storage property is called a capacitance—denoted by
the symbol C—and, depending on the materials and the physical properties,
is created by a variety of different forms of conductor-insulator-conductor
structures commonly found in integrated circuits
Furthermore, modern digital circuits utilize Boolean (binary) logic, in
which information is encoded by two values of a signal. These two signal
values are typically called false and true (or low and high or logic zero and
logic one) and correspond to the minimum and maximum1 allowable values
of the signal voltage for a specific integrated circuit implementation.2 Since
the voltage V is proportional to the stored electric charge q (q = CV, where
C is the storage capacitance), the logic low value corresponds to a fully dis-
charged capacitance (q = CV = 0) while the logic high value corresponds
to a capacitance storing the maximum possible charge (fully charged to a
voltage V ).
The largest and most complicated digital integrated circuits today contain
many millions of circuit elements each processing hundreds and thousands of
binary signals [8, 10, 16, 17]. Every circuit element has a number of input
terminals through which data is received from other elements. In addition, a
circuit element has a number of output terminals through which the results
of the processing are made available to other elements. For a circuit to imple-
ment a particular function, the inputs and outputs of all of the elements must
be properly connected among each other. These connections are accomplished
with wires, which are collectively referred to as an interconnect network, while
the set of circuit elements processing the binary signals is often simply called
the logic gates. During normal circuit operation, signals are received at the in-
puts of the logic gates, the gates process the signals to generate new data, and
then transmit the resulting data signals to the corresponding logic elements
through a network of interconnections. This process involves the transport of
a voltage signal from one physical location to another physical location. In
each case, this process takes a small yet finite amount of time to be completed
and is often called the propagation delay of the signal.
Usually, a small number of logic gates are combined to yield modules (or
standard cells) that perform frequently encountered operations—these mod-
ules can then be reused at many different places in a circuit. An example of
such a module is the full adder circuit shown in Figure 2.1. This specific circuit

1
Or the maximum and minimum (that is, vice versa) voltage levels.
2
In practice, a range of values close to the minimum and maximum signal voltages,
respectively, are interpreted as logic zero and one, respectively. By doing so, the
noise immunity of the circuit is significantly improved.
2.1 Signal Representation 9

adds two one-bit numbers x0 and y0 and a carry-in bit c0 to produce a two-bit
result z1 z0 , where z1 = x0 y0 + x0 c0 + y0 c0 and z0 = x0 ⊕ y0 ⊕ c0 . A typical
CMOS transistor configuration for one of the two-input NAND gates is shown
in Figure 2.2 [corresponding to the gates na 1 through na 3 in Figure 2.1].

z1 z0

xo 1 na 4

xo 2 na 1 na 2 na 3

x0 y0 c0
Fig. 2.1. Logic schematic view of a full adder circuit.

The rate of data processing in a digital integrated circuit is directly related


to two factors—how fast can the circuit switch between the two logic values,
and how precisely can a circuit element interpret a specific signal value as the
intended binary logic state. Switching the state of a circuit between two logic
values requires either charging a fully discharged capacitance or discharging a
fully charged capacitance, depending upon the type of state transition—low-
to-high or high-to-low. This charging/discharging process is controlled by the
active switching elements in the logic gates and is strongly affected by the
physical properties of both the gates and the interconnections. Specifically,
the signal waveform shapes change, either enhancing or degrading the signals,
affecting both the ability and the time required for the logic gates to properly
interpret these signals.
The concept of signal propagation delay between two different points A
and B of a circuit is illustrated in Figures 2.3 and 2.4, respectively. The signals
at points A and B—denoted by sA and sB , respectively—are plotted versus
time for two different cases in Figures 2.3 and 2.4, respectively. Without con-
sidering the specific electronic devices and circuits required to create these
waveforms shapes, it is assumed that signal sA makes a high-to-low transi-
tion and triggers a computation that causes signal sB to make an opposite
low-to-high transition. Several important observations can be made from the
waveforms depicted in Figures 2.3 and 2.4:
10 2 VLSI Systems

VDD

x0 x1

x0

x1

Fig. 2.2. Circuit view of a two-input NAND gate.

sA , sB
sA sB
90%

tr B

tP DAB = tP LH AB

50%

t fA

10%

time

Fig. 2.3. Signal propagation delay from point A to point B with a linear ramp
input and a linear ramp output.

• although sA is the same in each case, sB may have different shapes,


• a temporal relationship (or causality relationship) between sA and sB ex-
ists in the sense that sA ‘causes’ sB , thereby preceding the switching event
by an amount of time required for the physical switching process to prop-
agate through the circuit structure,
2.2 Synchronous VLSI Systems 11

• regardless of shape, sB has the same logical meaning, that is, that the
state of the circuit at point B changes from low to high; this low-to-high
transition and the reverse high-to-low state transition of signal sA require
a positive amount of time to complete.
The temporal relationship between sA and sB as shown in Figures 2.3 and 2.4
must be evaluated quantitatively. This information permits the speed of the
signals at different points in the same circuit or in different circuits built
in different semiconductor technologies to be temporally characterized. By
quantifying the physical speed of the logical operations, circuit designers are
provided with the necessary timing information to design correctly functioning
integrated circuits.

2.2 Synchronous VLSI Systems


Typically, a digital VLSI system performs a complex computational algorithm,
such as a Fast Fourier Transform or a RISC3 architecture microprocessor.
Although modern VLSI systems contain large number of components, these
systems normally employ only a limited number of different kinds of logic

sA , sB
sA sB
90%

tr B

tP DAB = tP LH AB

50%

t fA

10%

time

Fig. 2.4. Signal propagation delay from point A to point B with a linear ramp
input and an exponential output.

3
RISC = Reduced Instruction Set Computer.
12 2 VLSI Systems

elements or logic gates. Each logic element accepts certain input signals and
computes an output signal used by other logic elements. At the logic level of
abstraction, a VLSI system is a network of tens of thousands or more logic
gates whose terminals are interconnected by wires in order to implemented
the target algorithm.
As mentioned earlier in Section 2.1, the switching variables acting as in-
puts and outputs of a logic gate in a VLSI system are represented by tangible
physical quantities,4 while a number of these devices are interconnected to
yield the desired function of each logic gate. The specific physical characteris-
tics are collectively summarized with the term technology, that encompasses
such detail as the type and behavior of the devices that can be built, the
number and sequence of the manufacturing steps and the impedance of the
different interconnect materials. Today, several technologies are used in the im-
plementation of high performance VLSI systems—these are best exemplified
by CMOS, Bipolar, BiCMOS, and Gallium Arsenide [10, 16]. CMOS technol-
ogy, in particular, exhibits many desirable performance characteristics, such
as low power consumption, high density, ease of design and moderate to high
speed. Due to these excellent performance characteristics, CMOS technology
has become the dominant VLSI technology used today.
The design of a digital VLSI system requires a great deal of effort when
considering a broad range of architectural and logic issues, such as choosing
the appropriate gates and interconnections among these gates to achieve the
required circuit function. No design is complete, however, without considering
the dynamic (or transient) characteristics of the signal propagation or, alter-
natively, the changing behavior of the signals with time. Every computation
performed by a switching circuit involves multiple signal transitions between
the logic states, each transition requiring a finite amount of time to com-
plete. The voltage at every circuit node must reach a specific value for the
computation to be completed. Therefore, state-of-the-art integrated circuit
design is largely centered around the difficult task of predicting and properly
interpreting signal waveform shapes at various points within a circuit.
In a typical VLSI system, millions of signal transitions occur, such as
those shown in Figures 2.3 and 2.4, which determine the individual gate de-
lays and the overall speed of the system. Some of these signal transitions can
be executed concurrently while others must be executed in a strict sequential
order [17]. The sequential occurrence of the latter operations—or signal tran-
sition events—must be carefully coordinated in time so that logically correct
system operation is guaranteed and the results are reliable (in the sense that
these results can be repeated). This coordination is known as synchronization
and is critical to ensuring that any pair of logical operations in a circuit with
a precedence relationship proceed in the proper order. In modern digital inte-
grated circuits, synchronization is achieved at all stages of the system design
process and system operation by a variety of techniques, known as a timing

4
Such quantities as the electrical voltages and currents in electronic devices.
2.2 Synchronous VLSI Systems 13

discipline or timing scheme [10, 18, 19, 20]. With few exceptions, these circuits
are based on a fully synchronous timing scheme, specifically developed to cope
with the finite speed required by the physical signals to propagate throughout
a system.
A fully synchronous system is most frequently modeled as a finite-state
machine as shown in Figure 2.5. As illustrated in Figure 2.5, there are three

COMPUTATION
Input Output
Data Combinational Logic Data

Clocked Storage (Registers)

Clock Signal

Clock Distribution Network

SYNCHRONIZATION
Fig. 2.5. A finite-state machine (FSM) model of a synchronous system.

recognizable components in this system. The first component—the logic gates,


collectively referred to as the combinational logic—provides the range of op-
erations that a system executes. The second component—the clocked storage
elements or simply the registers—are elements that store the results of the
logical operations. Together, the combinational logic and registers constitute
the computational portion of a synchronous system and are interconnected in
a way that implements the required system function. The third component
of the synchronous system—known as the clock distribution network—is a
highly specialized circuit structure which does not perform a computational
process but rather provides an important control capability. The clock gen-
eration and distribution network controls the overall synchronization of the
circuit by generating a time reference and properly distributes this time ref-
erence to every register.
The normal operation of a system, such as the example shown in Figure 2.5,
consists of the iterative execution of computations in the combinational logic
followed by the storage of the processed results in the registers. The actual
process of storage is temporally controlled by the clock signal and occurs once
the signal transients in the logic gate outputs are completed and the outputs
have settled to a valid state. At the beginning of each computational cycle,
the inputs of the system together with the data stored in the registers initiate
14 2 VLSI Systems

a new switching process. As time proceeds, the signals propagate through the
logic, generating results at the logic output. By the end of the clock period,
these results are stored in the registers and are operated upon during the
following clock cycle.

Signal activity at the


beginning of the clock period

Ri Rf
Combinational
Data Data
Logic

Clock Clock
Signal activity at the
end of the clock period

Fig. 2.6. A local data path.

Therefore, the operation of a digital system can be thought of as the


sequential execution of a large set of simple computations that occur concur-
rently in the combinational logic portion of the system. The concept of a local
data path is a useful abstraction for each of these simple operations and is
shown in Figure 2.6. The magnitude of the delay of the combinational logic is
bound by the requirement of storing data in the registers within a clock pe-
riod. The initial register Ri is the storage element at the beginning of the local
data path and provides some or all of the input signals for the combinational
logic at the beginning of the computational cycle (defined by the beginning
of the clock period). The combinational path ends with the data successfully
latching within the final register Rf where the results are stored at the end of
the computational cycle. Each register acts as a source or sink for the data
depending upon which phase the system is currently operating in.

2.3 The VLSI Design Process


As previously mentioned, VLSI systems are composed of millions of active
electronic devices (transistors) with switching properties. Groups of these de-
vices are interconnected together to yield functional parts from which the
VLSI system is built. Typical functional parts include, for example, logic gates
such as the two-input NAND gate shown in Figure 2.2. In this monograph,
2.3 The VLSI Design Process 15

the design process refers to the activity in which a concept and a set of spec-
ifications are converted into an actual integrated circuit.
A view of the VLSI design process—also known as a design flow—is illus-
trated in Figure 2.7 magnifying the clock distribution network design process.
This flow is typical in the design of high-volume, Application-Specific Inte-
grated Circuits (ASICs). The sequence of steps in this design flow is from
top to bottom and follows the direction of the arrows as shown in Figure 2.7.
As previously mentioned, the design process often starts with loosely defined
behavioral and architectural specifications, as well as with design constraints
such as physical dimensions, cost, power supply voltage, operational temper-
ature and so on. Architectural specifications are refined and coded into a
Hardware Description Language (HDL) which forms the basis for the actual
synthesis process. The HDL descriptions are also useful in performing simu-
lations to verify the desired circuit function.
The synthesis process is performed by software-based synthesis tools which
compile the HDL descriptions into an equivalent logic schematic of a circuit—
each logic gate in this schematic has been predesigned and is available to
the synthesis tool as a library element. After the circuit synthesis process is
completed, the resulting logic and register circuit structures are symbolically
placed to form the integrated circuit. Wire routing among the circuit struc-
tures is performed next to connect the inputs and outputs of the logic gates
as well as to deliver the clock signal to each of the clocked registers within the
circuit. A variety of verification and simulation procedures are also performed
to ensure the correct functionality and timing of the integrated circuit. Among
these procedures is a timing verification step which includes the analysis of
the data and clock signal delays to ensure correct temporal operation.
The body of research presented in this monograph deals with certain as-
pects of the timing of VLSI-based digital circuits, particularly those topics
related to the clock distribution network. The timing optimization algorithms
presented in Chapters 5, 6 and 7 are integrated into the design flow at the
step called Clock Planning, shown shaded in Figure 2.7. As indicated in Fig-
ure 2.7, Clock Planning includes clock scheduling and the design of both the
topology and the circuit structure of the clock tree5 . The timing information
describing the signal delays obtained from the Placement of Logic and Regis-
ters step is used in the clock planning process. Specifically, both the maximum
and minimum data path delays are used in the clock skew scheduling process.
The entire chip verification process is not considered complete until the tim-
ing verification is satisfied after the detailed chip routing has been completed
and all physical impedance characteristics have been back annotated and an-
alyzed with accurate timing analysis tools [8, 9, 21]. Several iterations of the
Clock Planning may be required in order to satisfy the entire chip verification
process.

5
A clock tree is another term for describing the clock distribution network.
16 2 VLSI Systems

In this monograph, an algorithm to perform the simultaneous non-zero


clock skew scheduling and the topological design of the clock tree is presented
in Chapter 5 and enhanced algorithms for clock skew scheduling are presented
in Chapter 6 and 7.

Behavioral and Architectural Specifications, Logic Synthesis, Timing


Specifications

Placement of Logic and Registers

Delay Information

Clock Planning
(Clock Scheduling,
Clock Tree Topology)

Clock Tuning (Pre-Route)

Clock Verification

Detailed Chip Routing, Parasitic Extraction, Circuit Simulation and


Verification
Fig. 2.7. A typical integrated circuit design flow magnifying the clock distribution
network design process.

2.4 Summary
The behavior of a fully synchronous system is well defined and controllable
as long as the time window provided by the clock period is sufficiently long
2.4 Summary 17

to allow every signal in the circuit to propagate through the required logic
gates and interconnect wires and successfully latch into the final register of
each local data path. In designing the system and choosing the proper clock
period, however, two contradictory requirements must be satisfied. First, the
smaller the clock period, the more computational cycles can be performed by
the circuit in a given amount of time. Alternatively, the time window defined
by the clock period must be sufficiently long so that the slowest signals reach
the destination registers before the current clock cycle is concluded and the
following clock cycle is initiated.
This strategy for organizing the computational process has certain clear
advantages that have made a fully synchronous timing scheme the primary
choice for digital VLSI systems:
• The properties and variations are well understood.
• The nondeterministic behavior of the propagation delay of the combina-
tional logic (due to environmental and process fluctuations and the un-
known input signal pattern) is eliminated such that the system as a whole
has a completely deterministic behavior corresponding to the implemented
algorithm. As long as the data signal is successfully captured inside the
register before the arrival of the next clock signal, the timing characteris-
tics of the system are completely known.
• The circuit design process does not need to be concerned with glitches
in the combinational logic outputs. Therefore, the only relevant dynamic
timing characteristic of the logic is the propagation delay.
• The state of the system is completely defined within the storage elements—
this characteristic greatly simplifies certain aspects of the design, debug
and test phases when developing a large synchronous digital system.
However, the synchronous paradigm also has certain limitations that makes
the design of a synchronous VLSI system increasingly challenging:
• This synchronous approach has a serious drawback in that this approach
requires the overall circuit to operate as slow as the slowest register-to-
register path. Thus, the global speed of a fully synchronous system de-
pends upon those data paths with the largest delays—these paths are also
known as the worst case or critical paths. In a typical VLSI system, the
propagation delays in the combinational paths are distributed unevenly so
there may be many paths with delays much smaller than the clock period.
Although these paths could operate at a lower clock period—or higher
clock frequency—it is these critical paths that bound the minimum clock
period, thereby imposing a limit on the overall system speed (or clock fre-
quency). This imbalance in propagation delays is sometimes so dramatic
that the system speed is dictated by only a handful of very slow paths.
• The clock signal has to be distributed to tens of thousands of storage
registers scattered throughout the system. Therefore, a significant portion
of the system area and dissipated power is devoted to the clock distribution
18 2 VLSI Systems

network—a circuit structure that does not perform any computational


function.
• The reliable operation of a synchronous digital system depends upon cer-
tain assumptions concerning the propagation delays which, if not satisfied,
can lead to catastrophic timing violations which would render the system
unusable.
3
Signal Delay in VLSI Systems

In order to understand the timing characteristics of a synchronous digital


system—specifically, the delays within the data paths and clock distribution
network—a more complete understanding of the properties of signal delay in
VLSI systems is necessary. The topic of signal delay in VLSI-based systems
is examined in detail in this chapter. Delay metrics are first analyzed and
certain definitions are introduced in Section 3.1. A more thorough analytical
treatment of the subject of computing delay in CMOS integrated circuits is
presented in Section 3.2.

3.1 Delay Metrics

The delay of a signal propagating from one point within a circuit to another
point is caused by both the active electronic devices (transistors) in the logic
elements and the various passive interconnect structures connecting the logic
gates. While the physical principles behind the operation of transistors and
interconnect are well understood at the current-voltage (I-V ) level, it is often
computationally difficult to directly apply this detailed information to the
densely packed multi-million transistor DSM integrated circuits of today.
A general form of a circuit with N input and M output terminals (labeled
x1 , . . . , xN and y1 , . . . , yM , respectively) is shown in Figure 3.1(a). The box
labeled ‘CIRCUIT’ may represent a simple wire, a transistor, a logic gate con-
sisting of several transistors, or an arbitrarily complex combination of these
elements. The logic schematic outlined in Figure 3.1(b), for example, may
correspond to a portion of the circuit between points X and Y shown in
Figure 3.1(a). With the choice of logic circuit illustrated in Figure 3.1(b), a
logically possible signal activity at the circuit points X, Y, and Z is shown
in Figure 3.2. The dynamic characteristics and temporal relationships of the
signal transitions are described and formalized in Definitions 3.1, 3.2, and 3.3.

I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 19


DOI: 10.1007/978-0-387-71056-3 3,
c Springer Science+Business Media LLC 2009
20 3 Signal Delay in VLSI Systems

x1 y1

Y
CIRCUIT
X

xN yM
(a) Abstract representation of a circuit

Y
X
Z

(b) Logic schematic of part of the circuit in Figure 3.1(a)

Fig. 3.1. A simple electronic circuit.

Definition 3.1. If X and Y are two points in a circuit and sX and sY are
the signals at X and Y, respectively, the signal propagation delay tP DXY from
X to Y is defined 1 as the time interval from the 50% point of the signal
transition of sX to the 50% point of the signal transition of sY .
This formal definition of the propagation delay is related to the concept
that ideally, the switching point of a logic gate is at the 50% level of the output
waveform. Thus, 50% of the maximum output signal level is assumed to be
the boundary point where the state of the gate switches from one binary logic
state to the other binary logic state. Practically, a more physically correct def-
inition of propagation delay is the time from the switching point of the driving
circuit to the switching point of the driven circuit. Currently, however, this
switching point-based reference for signal delay is not widely used in practical
computer-aided design applications because of the computational complexity
of the algorithms and the increased amount of data required to estimate the
delay of a path based on information describing the signal waveform shape.
Therefore, choosing the switching point at 50% has become a generally ac-
ceptable practice for referencing the propagation delay of a switching element.
Also note that the propagation delay tP D as defined in Definition 3.1
is mathematically additive, thereby permitting the delay between any two
points X and Y to be determined by summing the delays through consecu-

1
Although the delay can be defined from any point X to any other point Y, the
points X and Y typically correspond to an input and an output of a logic gate,
respectively. In such a case, the signal delay from X to Y is the propagation delay
of the gate.
3.1 Delay Metrics 21
sX , sY , sZ

90%

tP DXY
= tP DXZ + tP DZY

50%

tP DZY = tP HLZY

tP DXZ = tP LHXZ
10%
sZ sX sY
time

Fig. 3.2. Signal waveforms for the circuit shown in Figure 3.1(b).

tive structures between X and Y . From Figures 3.1(b) and 3.2, for example,
tP DXY = tP DXZ + tP DZY . However, this additivity property must be applied
with caution since neither of the switching points of consecutively connected
gates may occur at the 50% level. In addition, passive interconnect struc-
tures along signal paths do not exhibit switching properties although physical
signals propagate through these structures with finite speed (more precisely,
through signal dispersion). Therefore, if the properties of a signal propagat-
ing through a series connection of logic gates and interconnections are being
evaluated, an analysis of the entire signal path composed of gates and wires—
rather than adding 50%-to-50% delays—is necessary to avoid accumulating
significant error in the path delay.
In high performance CMOS VLSI circuits, logic gates often switch before
the input signal completes a transition.2 This difference in switching speed
may be sufficiently large such that an output signal of a gate will reach the 50%
point before the input signal reaches the 50% point. If this is the case, tP D as
defined by Definition 3.1 may have a negative value. Consider, for example, the
inverter connected between nodes X (inverter input) and Z (inverter output)
shown in Figure 3.1(b). The specific input and output waveforms for this

2
Also, a gate may have asymmetric signal paths, whereby a gate would switch
faster in one direction than in the other direction.
22 3 Signal Delay in VLSI Systems
sX , sZ
sX sZ
90%

TP HLXZ < 0
(tfZ < trX )

50%

TP LHXZ > 0
(trZ > tfZ )

10%

time
tfX trZ tfZ trX

Fig. 3.3. Signal waveforms for the inverter in the circuit shown in Figure 3.1(b).

inverter are shown in detail in Figure 3.3. When the input signal sX makes
a high-to-low transition, the output signal sZ makes a low-to-high transition
(and vice versa). In this specific example, the low-to-high transition of the
signal sZ crosses the 50% signal level after the high-to-low transition of the
signal sX . Therefore, the signal delay tP LH (the signal name index is omitted
for clarity) is positive as shown by the direction of the arrow in Figure 3.3—
coinciding with the positive direction of the x-axis. However, when the input
signal sX makes a low-to-high transition, the output signal sZ makes a faster
high-to-low transition and crosses the 50% signal level before the input signal
sX crosses the 50% signal level. The signal delay tP HL in this case is negative
as shown by the direction of the arrow in Figure 3.3—coinciding with the
negative direction of the x-axis. This phenomenon can occur in circuits with
slow input signal transitions and fast output signal transitions, demonstrating
a weakness in the 50% delay definition commonly used today throughout
industry.
The possible asymmetry of the switching characteristics of a logic gate—
as illustrated by the waveforms shown in Figure 3.3—requires the ability to
discriminate between the values of the propagation delay in the two differ-
ent switching situations (a low-to-high or a high-to-low transition). One sin-
gle value of the propagation delay tP D —as defined in Definition 3.1—does
not provide sufficient information about possible asymmetry in the switch-
3.2 Devices and Interconnections 23

ing characteristics of a logic gate. Therefore, the concept of delay is extended


to include this missing information. Specifically, the direction of the output
waveform (since the output of a gate is typically the evaluation node) is in-
cluded in the definition of delay, thereby permitting the evaluation of the gate
switching speed to account for the effects of the output signal transition:
Definition 3.2. The signal propagation delays tP LHXY and tP HLXY , respec-
tively, denote the signal delay from input X to output Y (as defined in Defini-
tion 3.1) where the output signal (at point Y ) transitions from low to high and
from high to low, respectively (the low-to-high and high-to-low transitions).
It is important to consider both tP LH and tP HL during circuit analysis and
design. However, if only a single value of tP D is specified, tP D usually refers
to the arithmetic average, (tP LH + tP HL )/2.
While Definition 3.2 specifies the time between switching events, it does
not convey any information about the transition time of the events themselves.
This transition time is finite and is characterized by the two parameters de-
scribed in the following definition:
Definition 3.3. For a signal making a transition between two different logic
states, the transition time is defined as the time interval between the 10% point
and the 90% point of the signal. For a low-to-high transition, the rise transition
time tr = t90% −t10% . For a high-to-low transition, the fall transition time
 
tf = t10% −t90% .
The parameters defined in Definition 3.3 are illustrated in Figures 2.3 and 2.4
where the fall time tfA and the rise time trB for the signals sA and sB ,
respectively, are indicated.
As tr and tf are related to the slope of the signal transitions, the transition
times also affect the values of tP LH and tP HL , respectively. In Figure 3.2, for
example, note that if the signal sY had been slower—a longer fall time tfY —
sY would have crossed the 50% level at a later time, effectively increasing the
propagation delay tP LHXY . However—as illustrated in Figures 2.3 and 2.4—it
is possible for the 50%-to-50% delay to remain nearly the same, although the
signal slope may change significantly [note the rise time trB in Figures 2.3
and 2.4].

3.2 Devices and Interconnections


The technology of choice for most modern high performance digital integrated
circuits is based on the MOSFET3 transistor structure. The primary reasons
for the wide application of MOSFETs are, among other things, high packing
density and, in its complementary form, low power dissipation. In this section,

3
MOSFET ≡ Metal-Oxide-Semiconductor Field Effect Transistor
24 3 Signal Delay in VLSI Systems

the properties of both active devices and interconnections are discussed from
the perspective of circuit performance.
An N-channel enhancement mode MOSFET transistor (NMOS) is de-
picted in Figure 3.4. Note that in most digital applications, the substrate

Vd
− +

Vgd Idd

drain
+ gate base
Vg Vb Vds
+ (substrate)
source

Vgs Iss

− −
Vs
Fig. 3.4. An N-channel enhancement mode MOS transistor.

is usually connected to the source, i.e., Vs = Vb and Vsb = 0. Therefore, the


four-terminal transistor depicted in Figure 3.4 can be considered as a three-
terminal device with the voltages Vs , Vg , and Vd controlling the operation of
the transistor. Assuming no substrate current, Idd = Iss —both currents Idd
and Iss are usually referred to as Ids only. In the following discussion, the
additional indices n and p are used to indicate which type of transistor is
being considered, N-channel or P-channel, respectively.
To first order, the drain current Idsn through a long-channel NMOS tran-
sistor4 can be modeled by the classical Shichman-Hodges set of equations [22]:
⎧  

⎪ 1 2
⎪βn (Vgsn − Vtn )Vdsn − Vdsn , Vgsn ≥ Vtn and Vgdn ≥ Vtn


⎪ 2



⎪ (triode or linear region)

⎨ 1 2
Idsn = βn (Vgsn − Vtn ) , Vgsn ≥ Vtn and Vgdn ≤ Vtn

⎪ 2

⎪ (pentode or saturation region)





⎪ 0, Vgsn ≤ Vtn

⎩ (cutoff region).
(3.1)
4
Derivation of the PMOS I-V equations is straightforward by accounting for the
changes in voltage and current directions.
3.2 Devices and Interconnections 25

In (3.1), the parameter βn is a device parameter commonly called the gain


factor or the current gain of the transistor—the dimension of βn is [A/V 2 ].
The current gain βn is
Wn
βn = Kn , (3.2)
Ln
where Kn is the process transconductance parameter and Wn and Ln are the
width and length of the transistor channel, respectively. The process transcon-
ductance Kn is
ox
Kn = μn Cox = μn , (3.3)
tox
where μn is the carrier mobility, Cox is the gate capacitance per unit area,
ox is the relative dielectric constant of the gate oxide material (3.9 for SiO2 )
and tox is the gate oxide thickness. By substituting the index p for the index
n in (3.1), (3.2), and (3.3), analogous expressions for βp and Kp of a P-
channel enhancement mode MOSFET transistor can be developed [16, 8, 10,
15]. Also note that the threshold voltage Vtn of an enhancement-mode N-
channel transistor is positive (Vtn > 0), while the threshold voltage Vtp of an
enhancement-mode P-channel transistor is negative (Vtp < 0).
Equation (3.1) and the counterpart for a P-channel MOS device are fun-
damental to both static and dynamic circuit analysis. Static or DC analysis
refers to evaluating the circuit bias conditions in which the control voltages,
Vg , Vd , and Vs , remain constant. Dynamic analysis is attractive from a signal
delay perspective since it deals with voltage and current waveforms changing
with time. An important goal of dynamic analysis is to determine the timing
relationships among the transistor terminals. Specifically, the voltages at these
terminals are the signal representations of the data being processed. By per-
forming a dynamic analysis, the signal delay from an input waveform to the
corresponding output waveform can be evaluated at high levels of accuracy.
Complementary MOS logic or CMOS logic is the most popular circuit
style for most modern high performance digital integrated circuits. An ana-
lytical analysis of a simple CMOS logic gate is presented in Section 3.2.1 for
one of the simplest CMOS gates—the CMOS inverter shown in Figure 3.5.
Performing such a simple analysis illustrates the process for estimating cir-
cuit performance, as well as provides insight into what factors and how these
factors affect the timing characteristics of a logic gate.

3.2.1 Analytical Delay Analysis

Consider the CMOS inverter circuit consisting of a PMOS device Q1 and an


NMOS device Q2 as shown in Figure 3.5. For this analysis, assume that the
capacitive load of the inverter—consisting of the device capacitances, inter-
connect capacitances and the load capacitance of the following stage—can
be lumped into a single capacitor CL . The output voltage Vo = VCL is the
voltage across the capacitive load and the terminal voltages of the transistors
are listed in Table 3.1. The regions of operation for the devices, Q1 and Q2 ,
26 3 Signal Delay in VLSI Systems

Vdd
Q1

Idsp

Vi (t) Idsn LOAD Vo (t)

Q2

Fig. 3.5. A basic CMOS inverter logic gate.

Table 3.1. Terminal voltages for the P-channel and N-channel transistor in a CMOS
inverter circuit.
Q1 (PMOS) Q2 (NMOS)
Vgs Vgsp = Vi − VDD Vgsn = Vi
Vgd Vgdp = Vi − Vo Vgdn = Vi − Vo
Vds Vdsp = Vo − VDD Vdsn = Vo

are illustrated in Figure 3.6 depending upon the values of Vi and Vo . Re-
ferring to Figure 3.6 may be helpful in understanding the switching process
of a CMOS inverter. Methods for determining the values of the fall time
tf and the propagation delay tP HL are described in this section. Similarly,
closed form expressions are derived for the rise time tr and the propagation
delay tP LH .

Derivation of the Fall Time

The transition process used to derive tf and tP HL is illustrated in Figure 3.7.


Assume that the input signal Vi has been held at logic low (Vi = 0) for a
sufficiently long time such that the capacitor CL is fully charged to the value
of Vdd —the operating point of the inverter is point A depicted in Figures 3.6
and 3.8. At time t0 = 0, the input signal abruptly switches to a logic high. The
capacitor CL cannot discharge instantaneously, thereby forcing the operating
point of the circuit to point B, (Vi , Vo ) = (Vdd , Vdd ). At B, the device Q1 is
cut off while Q2 is conducting, thereby permitting CL to begin discharging
through Q2 . As this discharge process develops, the operating point moves
down the line BD, approaching point D when CL is fully discharged, i.e.,
Vo (D) = 0. Observe that during the interval 0 ≤ t < t2 , the operating point
3.2 Devices and Interconnections 27

Vo
A B
Vdd
Q1 linear Q1 linear Q1 cutoff
Q2 cutoff Q2 sat Q2 sat

C (V − V )
I II III IV dd tn

Q1 sat
Q2 sat

−Vtp VII VI V
F

Q1 sat Q1 sat Q1 cutoff


Q2 cutoff Q2 linear Q2 linear
E D
Vi
0 Vtn (Vdd + Vtp ) Vdd

Fig. 3.6. Operating mode of a CMOS inverter depending upon the input and output
voltages. (Note that the abbreviation ‘sat’ stands for the saturation region.)

is between B and C and the device Q2 operates in the saturation region. At


time t2 , the capacitor is discharged to Vdd −Vtn and Q2 begins to operate in the
linear region. For t ≥ t2 , the device Q2 is in the linear region. If 0.1Vdd < Vtn <
0.5Vdd (as is typical), then t1 < t2 < t3 as shown in Figure 3.7. Therefore,
the fall time is tf = t4 − t1 and the propagation time tP HL = t3 − 0 = t3 .
To determine the values of tf and tP HL , the output waveform Vo (t) must be
evaluated for each of the intervals [t0 , t2 ) and [t2 , ∞).
For t0 ≤ t < t2 , the current discharging the capacitor Idsn , shown in
Figure 3.5, is
1
dVo
Idsn = βn (Vdd − Vtn )2 = −CL . (3.4)
2 dt
Substituting
Vtn βn Vdd (1 − η)
η= and γn = , (3.5)
Vdd CL
and solving (3.4) for Vo with the initial condition Vo (0) = Vdd , yields Vo (t) for
t 0 ≤ t < t2 ,
 
βn γn
Vo (t) = Vdd − (Vdd − Vtn ) t = Vdd 1 −
2
(1 − η)t . (3.6)
2CL 2
28 3 Signal Delay in VLSI Systems

Fig. 3.7. High-to-low output transition for a step input signal.

A B A B

C
I II III IV I II III IV
C C

F F
VII VI V VII VI V
E D E D

Ideal Step Input Non-Ideal (Non-Step) Input

Fig. 3.8. Operating point trajectory of a CMOS inverter for different input wave-
forms (only the rising input signal is shown).

From (3.6) it can be further shown that

2CL 2η
Vo (t2 ) = Vdd − Vtn for t2 = 2 Vtn = . (3.7)
βn (Vdd − Vtn ) γn (1 − η)
3.2 Devices and Interconnections 29

The interval t ≥ t2 is considered next. The device Q2 operates in the linear


region, where Idsn is
 
1 2 dVo
Idsn = βn (Vdd − Vtn )Vo − Vo = −CL . (3.8)
2 dt

A closed form expression for the output voltage Vo (t) for time t ≥ t2 is
obtained by solving (3.8), a Bernoulli equation, with the initial condition
Vo (t2 ) = Vdd − Vtn :

2(1 − η)
for t ≥ t2 , Vo (t) = Vdd . (3.9)
1 + eγn (t−t2 )
The values of t1 from (3.6) and t3 and t4 from (3.9) are ([10, 15, 23])
1 0.2
t1 = ,
γn 1 − η
 
1 2η
t3 = + ln(3 − 4η) , (3.10)
γn 1 − η
 
1 2η
and t4 = + ln(19 − 20η) .
γn 1 − η

The fall time tf is ([10, 15, 23])


  
CL 1 η − 0.1
tf = t4 − t 1 = 2 + ln(19 − 20η) , (3.11)
βn Vdd (1 − η) 1−η

and the propagation delay tP HL is ([10, 15, 23])


  
CL 1 2η
tP HL = t3 − 0 = t3 = + ln(3 − 4η) . (3.12)
βn Vdd (1 − η) 1 − η

Derivation of the Rise Time

The rise time tr and propagation delay tP LH are determined from the switch-
ing process illustrated in Figure 3.9 (similarly to tf and tP HL derived earlier
in this section). Assume that the input signal Vi has been held at logic high
(Vi = Vdd ) for a sufficiently long time such that the capacitor CL is fully
discharged to Vo = 0. The operating point of the inverter is point D shown
in Figures 3.6 and 3.8. At time t0 = 0, the input signal abruptly switches
to a logic low. Since the voltage on CL cannot change instantaneously, the
operating point is forced at point E. At E, the device Q2 is cut off while Q1
is conducting, thereby permitting CL to begin charging through Q1 . As this
charging process develops, the operating point moves up the line EA towards
point A at which point CL is fully charged, i.e., Vo (A) = Vdd . Note that during
the interval 0 ≤ t < t2 , the operating point is between E and F and the device
30 3 Signal Delay in VLSI Systems


Vdd , t<0
Vi (t) =
0, t≥0

tr

tP LH Vo (t)
0.9Vdd

0.5Vdd

−Vtp
0.1Vdd
0 t1 t2 t3 t4 t
Fig. 3.9. Low-to-high output transition for a step input signal.

Q1 operates in the saturation region. At time t2 , the capacitor is charged to


−Vtp (recall that Vtp < 0) and Q1 begins to operate in the linear region. For
t ≥ t2 , the device Q1 is in the linear region. If 0.1Vdd < |Vtp | < 0.5Vdd (as is
typical), then t1 < t2 < t3 as shown in Figure 3.9. Therefore, the rise time is
tr = t4 − t1 and the propagation delay is tP LH = t3 − 0 = t3 . To determine
the values of tr and tP LH , the output waveform Vo (t) must be evaluated for
each of the intervals [t0 , t2 ) and [t2 , ∞).
An analysis similar to that described previously in this section for the high-
to-low output transition can be performed to derive closed form expressions
for t1 , t3 , and t4 as shown in Figure 3.9. Substituting
Vtp βp Vdd (1 − π)
π=− and γp = , (3.13)
Vdd CL
t1 , t3 , and t4 are
1 0.2
t1 = ,
γp 1 − π
 
1 2π
t3 = + ln(3 − 4π) , (3.14)
γp 1 − π
 
1 2π
and t4 = + ln(19 − 20π) .
γp 1 − π
3.2 Devices and Interconnections 31

Therefore, the rise time tr is ([10, 15, 23])


  
CL 1 π − 0.1
tr = t4 − t 1 = 2 + ln(19 − 20π) , (3.15)
βp Vdd (1 − π) 1−π

and the propagation delay tP LH is ([10, 15, 23])


  
CL 1 2π
tP LH = t3 − 0 = t3 = + ln(3 − 4π) . (3.16)
βp Vdd (1 − π) 1 − π

Several observations can be made by analyzing the expressions derived


in this section for tr , tf , tP HL , and tP LH . These observations are provided in
the following subsections. First, the factors which affect the inverter delays
are analyzed in Section 3.2.2. Following this analysis, the related waveform
effects are considered in Section 3.2.3 and the effects of short-channel devices
in submicrometer technologies are described in Section 3.2.4.

3.2.2 Controlling the Delay

Note that in (3.11) and (3.15), the fall and rise times, respectively, are the
product of the term CL /β, and another process dependent term (a function
composed solely of Vdd and Vt ). These relationships imply that for a given
manufacturing process, improvements in the individual gate delays are possi-
ble by reducing the load impedance CL or by increasing the current gain of
the transistors. Increasing the current gain (higher β) is possible either by uti-
lizing a more advanced technology or by controlling certain physical qualities
of the transistor (the specific physical layout). In the latter case, increasing β
of the devices (recall that β ∝ W/L) is typically accomplished by controlling
the value of W —a process known as transistor or gate sizing5 [24, 25, 26].
Transistor sizing, however, has limits—area requirements may limit the max-
imum channel width W, and increasing W will also increase the input load
capacitance of the previous gates.

3.2.3 Waveform Effects

The ideal step input waveform used in the derivation of the delay expressions
presented in Section 3.2.1 is a physical abstraction. Such an ideal waveform
does not practically exist, although it can be used to simplify the analysis
presented in Section 3.2.1. Note that despite ideally fast input waveforms, the
output signal of a CMOS logic gate has a finite slope, thereby contributing
to the gate delay. In a practical VLSI integrated circuit, both the input and
output signals have a non-zero rise and fall time caused by the impedances
along any signal path. Fast input waveforms can be effectively considered as
5
Typically, device channel length is chosen to be the minimum geometry permitted
by the technology and therefore cannot be decreased to further increase β.
32 3 Signal Delay in VLSI Systems

step inputs. The delay expressions derived in (3.11) and (3.15) model the de-
lays for such cases with reasonable accuracy. Slow input waveforms, however,
contribute significantly to the overall delay of the charge/discharge path in a
gate [8, 10, 15, 23], making the delay expressions presented in Section 3.2.1
less accurate.
Furthermore, it is considerably more difficult to derive closed form delay
expressions for non-step input waveforms. Consider, for example, the deriva-
tion of the fall time of the inverter shown in Figure 3.5 assuming a non-ideal
input, such as the linear ramp signal sA depicted in Figure 2.3. Referring
to Figure 3.8, the trajectory of the operating point relating Vi and Vo for a
non-ideal (non-step) input is as shown in the diagram on the right. This tra-
jectory is a curve passing through regions I, II, III, and IV,6 and down the
line C → C → D, rather than the two straight-line segments A → B and
B → C → D (as shown in the diagram on the left). Therefore, calculating an
exact expression for tf in this case requires separately evaluating the delay
for all five portions of the output Vo —one for each region.
An analysis of the CMOS inverter shown in Figure 3.5 with a non-step
input signal, as well as the respective delay expressions, can be found in [23].
Consider, for example, a linear ramp input described by


⎪0 t<0
⎨ t
Vi (t) = Vdd 0 ≤ t < tri , (3.17)

⎪ t
⎩ ri
Vdd t ≥ tri

where tri is the rise time of the input voltage signal Vi (t). For the case de-
picted in the upper diagram shown in Figure 3.8, the total propagation delay
tP HLramp at the 50 % level [23] is given by

1
tP HLramp = (1 + 2η)tri + tP HLstep , (3.18)
6
where tP HLstep is the propagation delay time for a step input given by (3.12).
Note that the ramp input described by (3.17) is also an idealization in-
tended to simplify analysis. In a practical integrated circuit, the input wave-
form to the inverter is not a linear ramp but rather the output waveform
of another gate within the circuit. For such an input signal—also known as
a characteristic input [23]—it is preferable to regard the propagation delay
through the inverter gate shown in Figure 3.5 as a function of the CL /β ra-
tio of the preceding gate or, equivalently, as a function of the step response
delay of the preceding stage [23]. This type of direct analytical solution—by
breaking the output waveform into regions depending upon the trajectory of
the operating point—is further complicated for those gates with more than
one input arriving at an arbitrary time and with arbitrary waveforms. Due to
6
I, II, III, IV, and V for slower input signals.
3.2 Devices and Interconnections 33

the growing complexity of such an analytical solution, it is imperative that


alternative methods for delay calculation be developed for practical use.
Non-ideal input waveforms also have implication on the power dissipation
of individual logic gates, and therefore on the entire circuit. Observe that
in regions II, IV, and VI, shown in Figure 3.6, both devices simultaneously
conduct, thereby creating a temporary direct path for the current to flow
from Vdd to ground. The short-circuit current in this direct current path is
only mildly related to the output voltage of the gate and adds to the total
power dissipation. This added power component is known as short-circuit
power [26, 27, 28]. The short-circuit power can be a substantial fraction of the
total transient power dissipation of a circuit and has become a severe obstacle
to satisfying a maximum power budget. Faster waveforms throughout the
circuit generally mean less time is spent switching within regions II, IV, and
VI, and therefore decreased short-circuit current and power.

3.2.4 Short-Channel Effects


The active device model, (3.1), used in the analyses described in Section 3.2.1,
is accurate for long-channel devices. As technology is scaled down into the deep
submicrometer range, a variety of physical phenomena develop that require
improved device models in order to preserve accuracy. In this section, certain
important effects, known as short-channel effects, are described in terms of
their effect on propagation delay.

Channel-Length Modulation
A MOSFET device modeled by (3.1) has an infinite output resistance in sat-
uration and acts as a voltage-controlled current source. Recall the linear por-
tion of the falling/rising output waveforms from the analyses described in
Section 3.2.1. The device acts as a current source since the drain current Idsn is
completely independent of the voltage Vdsn in the saturation region [see (3.1)].
This independence, however, is an idealization that does not consider the ef-
fect of the voltage Vdsn on the shape of the channel. In practice, as Vdsn
increases beyond Vgsn − Vtn (such that Vgdn < Vtn or Vgdp > Vtp for a PMOS
device), the channel pinch-off point moves towards the source. Therefore, due
to an effect known as channel-length modulation, the effective channel length
is reduced [10, 15, 22, 29].
To analytically account for channel-length modulation, an expression for
the current of a MOS transistor operating in the saturation region is modified
as follows:
1 2
Idsn = βn (Vgsn − Vtn ) (1 + λn Vdsn ) . (3.19)
2
The additional factor (1 + λn Vdsn ) in (3.19) describes the finite device output
resistance ∂Vdsn /∂Idsn = 2(Vgsn − Vtn )−2 /(λn βn ) when the transistor oper-
ates in the saturation region. The output waveform deteriorates due to the
degradation of the transfer characteristic of the inverter.
34 3 Signal Delay in VLSI Systems

Velocity Saturation

In a long-channel transistor, the drift velocity of the carriers in the channel is


proportional to both the carrier mobility and the lateral electric field in the
channel (parallel to the source-drain path). In short-channel devices, however,
the velocity of the carriers eventually saturates to some value vsat for a specific
value of the voltage Vds within the operating range of the circuit. This velocity
saturation phenomenon is due to the power supply voltage not being scaled
down as quickly as the device dimensions, creating high electric field within
the device.
The saturation in the carrier velocity for high electric field strengths—
caused by the high voltage Vds applied over a short channel—causes a reduc-
tion in both the device transconductance [see (3.3)] and the current gain of a
saturated device. This reduction in the current gain β has a direct effect on
the ability of the devices to drive a specific load, resulting in increased delay
times. Recall that the propagation delays described in (3.11), (3.12), (3.15)
and (3.16) are inversely proportional to β.
A more realistic device model for DSM devices—known as the α-power law
model—has been developed by [30] to include the carrier velocity saturation
 
effect in submicrometer device I-V model.7 If ID0 n
and VD0 n
are given by
α α/2
 Vgsn − Vtn  Vgsn − Vtn
ID0 = ID0 , VD0 = VD0 , (3.20)
n
Vdd − Vtn n
Vdd − Vtn

then the drain current Idsn of the MOS transistor is


⎧ 


⎪ ID0n Vgsn ≥ Vtn and Vdsn ≥ VD0


n

⎪ (pentode or saturation region)



⎪ 
⎨ ID0n V 
 dsn Vgsn ≥ Vtn and Vdsn < VD0n
Idsn = VD0 (3.21)


n

⎪ (triode or linear region)



⎪ 0 Vgsn ≤ Vtn



(cutoff region).

In (3.20) and (3.21), α is the velocity saturation index, VD0 is the drain
saturation voltage for Vgsn = Vdd , and ID0 is the drain saturation current
for Vgsn = Vdsn = Vdd . A typical value of the velocity saturation index for a
short-channel device is 1 ≤ α ≤ 2, where (3.21) is the same as (3.1) for α = 2.
Analytical solutions for the output voltage of a CMOS inverter with a
purely capacitive load CL for a step, linear ramp and exponential input wave-
forms can be found in [31]. Closed form expressions for the delay of a CMOS

7
Short-channel MOS devices in general.
3.2 Devices and Interconnections 35

inverter as shown in Figure 3.5 under the α-power law model are given in [30]
and are repeated below:

Vtn 1 1−η CL Vdd
η= and tP HL = tP LH = − tT + . (3.22)
Vdn 2 1+α 2ID0

The propagation delay described by (3.22) can be applied to non-ideal input


waveforms and consists of two terms. The first term reflects the effect on the
gate delay of the input waveform shape and is proportional to the input wave-
form transition time tT . The second term reflects the dependence of the delay
on the gate load, similarly to the CL /β term included in (3.12) and (3.16).

3.2.5 The Importance of Interconnections

The analysis of the CMOS gate delay as described in Section 3.2.1 is based on
the assumption that the load of the inverter shown in Figure 3.5 is a purely
capacitive load (C). This assumption is generally true for logic gates placed
physically close to each other. In a multi-million transistor VLSI system, how-
ever, certain connected logic gates may be relatively far from each other. In
this situation, the impedance of the interconnect wires cannot be considered
as being purely capacitive but rather as being resistive-capacitive (RC). An
important type of global circuit interconnect structure where the gates can
be very far apart is the clock distribution network [9, 32].
On-chip interconnect has become a major concern due to the high resis-
tance of the interconnect which can limit overall circuit performance. These
interconnect impedances have become significant as the minimum line dimen-
sions have been scaled down into the deep submicrometer regime while the
overall chip dimensions have increased. Perhaps the most important conse-
quence of these trends of scaling transistor and interconnect dimensions and
increasing chip sizes is that the primary source of signal propagation delay
has shifted from the active transistors to the passive interconnect lines. There-
fore, the nature of the load impedance has shifted from a lumped capacitance
to a distributed resistance-capacitance, thereby requiring new qualitative and
quantitative interpretations of the signal switching processes.
To illustrate the effects of scaling, consider ideal scaling [8] where devices
are scaled down by a factor of S (S > 1) and chip sizes are scaled up by a
factor of Sc (Sc > 1). The delay of the logic gates decreases by 1/S while
the delay due to the interconnect increases by S 2 SC 2
[8, 33]. Therefore, the
ratio of interconnect delay to gate delay after ideal scaling increases by a
factor of S 3 SC
2
. For example, if S = 4 (corresponding to scaling down from
a 2 μm CMOS technology to a 0.5 μm CMOS technology) and Sc = 1.225
(corresponding to the chip area increasing by 50%), the ratio of interconnect
delay to gate delay increases by a factor of 43 × 1.225 = 78.4 times.
36 3 Signal Delay in VLSI Systems

Delay estimation in RC interconnect

Interconnect delay can be analyzed by considering the CMOS inverter shown


in Figure 3.5 with a capacitive load CL representing the accumulated capac-
itance of the fanout of the inverter. The interconnect connecting the drains
of the devices, Q1 and Q2 , to the upper terminal of the load is replaced by a
distributed RC line with a resistance and capacitance, Rint and Cint , respec-
tively [33].
Closed form expressions for the signal delay of a CMOS inverter with
an RC load have been developed by Wilnai [34]. The delay values for both
distributed and lumped nature of the RC load are summarized in Table 3.2.
These delay values are obtained assuming a step input driving the CMOS
inverter.

Table 3.2. Closed form expressions for the signal delay of the CMOS inverter shown
in Figure 3.5 driving an RC load. An ideal step input signal (Vi (t) transitioning from
high to low) is assumed.

Output Voltage Signal Delay


Range (Distributed RC) (Lumped RC)
0 to 90% 1.0RC 2.3RC
10% to 90% 0.9RC 2.2RC ←− rise time tr
0 to 63% 0.5RC 1.0RC
0 to 50% 0.4RC 0.7RC ←− delay tP LH
0 to 10% 0.1RC 0.1RC

The delay values listed in Table 3.2 are graphically illustrated in Fig-
ure 3.10 [34]. Two waveforms describing the output of a CMOS inverter (shown
in Figure 3.5) for an input signal making a high-to-low transition are shown
in Figure 3.10. These two waveforms are based on the assumption that the
RC load of the CMOS inverter is distributed and lumped, respectively.
Furthermore, assuming an on-resistance Rtr of the driving transistor [33],
the interconnect delay Tintc can be characterized by the following expres-
sion [34],

Tintc = Rint Cint + 2.3 (Rtr Cint + Rtr CL + Rint CL ) (3.23)


≈ (2.3Rtr + Rint ) Cint . (3.24)

The on-resistance of the driving transistor Rtr in (3.23) and (3.24) can be
approximated [33] by
1
Rtr ≈ , (3.25)
βVDD
where the term β in (3.25) is the current gain of the driving transistor oper-
ating in the saturation region [see (3.2)].
3.2 Devices and Interconnections 37

Vout (t)/Vdd

0.9 90%

distributed
lumped
0.63 63%

0.5 50%

0.1 10%
time
0.5RC 1.0RC 1.5RC 2.0RC
Fig. 3.10. Graphical illustration of the RC signal delay expressions listed in Ta-
ble 3.2 (from [34]). The output waveforms for a CMOS inverter are for both a
distributed and lumped RC load.

Approximating a distributed RC line by a combination of lumped resis-


tances (R) and capacitances (C) is a common strategy when using circuit
simulation programs (such as SPICE). A lumped Π and T ladder circuit
model better approximates a distributed RC model than a lumped L lad-
der circuit [35] by up to 30%. As described in [35], a strategy to model a
distributed RC line depends upon two circuit parameters:
CL
1. the ratio CT = of the load capacitance CL of the fanout to the
C
capacitance C of the interconnect line,
Rtr
2. the ratio RT = of the output resistance of the driving MOSFET
R
device Rtr to the resistance R of the interconnect line.
The appropriate ladder circuit (from [35]) to properly model a distributed
RC interconnect line within 3% error as a function of RT and CT is listed
in Table 3.3. By using the proper ladder circuit recommended in [35], the
computational time of the simulation can be greatly reduced while preserving
the accuracy of the overall circuit simulation [21].

3.2.6 Delay Mitigation

As discussed in this chapter, signal delay in VLSI circuits is caused by the


inherent switching properties and impedances of the transistors and intercon-
nections along each signal path. Accurate methods for estimating the signal
38 3 Signal Delay in VLSI Systems

Table 3.3. Circuit network to model distributed RC line with maximum error of 3%
(from [35]). The notations Π, T and L correspond to a Π, T and L impedance model,
respectively. The notations R and C correspond to a single lumped resistance and
capacitance, respectively. The notation N means that the interconnect impedance
can be ignored.

RT
CT 0 0.01 0.1 0.2 0.5 1 2 5 10 20 50 100
0 Π3 Π3 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.01 Π3 Π3 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.1 T2 T2 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.2 T2 T2 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.5 T1 T1 T1 T1 Π1 Π1 Π1 Π1 Π1 C C C
1 T1 T1 T1 T1 Π1 Π1 Π1 Π1 Π1 C C C
2 T1 T1 T1 T1 Π1 Π1 Π1 Π1 L1 L1 C C
5 Π1 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 C C
10 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 L1 C C
20 R R R R R R L1 L1 L1 L1 C C
50 R R R R R R R R R R C N
100 R R R R R R R R R R N N

delay are required in order to guarantee that the circuit will operate correctly.
Furthermore, certain signal delays within a circuit may need to be decreased
so as to meet specific performance goals.
A variety of different techniques have been developed to improve the sig-
nal delay characteristics depending upon the type of load and other circuit
parameters. Among the most important techniques are:
• Gate sizing to increase the output current drive capability of the transistors
along a logic chain [24, 25, 26]. Gate sizing must be applied with caution,
however, because of the resulting increase in area and power dissipation,
and, if incorrectly applied, increase in delay.
• Tapered buffer circuit structures are often used to drive large capacitive
loads (such as at the output pad of a chip) [17, 36, 37, 38, 39, 40, 41]. A
series of CMOS inverters such as the circuit shown in Figure 3.5 can be
cascaded where the output drive of each buffer is increased by a constant
(or variable) tapering factor.
• The use of repeater circuit structures to drive resistive-capacitive (RC)
loads. Unlike tapered buffers, repeaters are typically CMOS inverters of
uniform size (drive capability) that are inserted at uniform intervals along
an interconnect line [8, 42, 43, 44, 45, 46, 47].
• A different timing discipline such as asynchronous timing [17, 48, 49].
Unlike fully synchronous circuits, the order of execution of logic opera-
tions in an asynchronous circuit is not controlled by a global clock signal.
Therefore, the temporal operation of asynchronous circuits is essentially
3.2 Devices and Interconnections 39

independent of the signal delays. The logical order of the operations in


an asynchronous circuit is enforced by requiring the generation of special
handshaking signals which communicate the status of the computation.
Among other useful techniques to improve the signal delay characteristics
are the use of dynamic CMOS logic circuits such as Domino logic [50, 51,
52, 53] and differential circuit logic styles, such as cascade voltage switch
logic (CVSL) [54, 55, 56, 57].
4
Timing Properties of Synchronous Systems

The general structure and principles for operating a fully synchronous dig-
ital VLSI system are described in Chapter 2. The combinational logic and
the storage elements make up the computational circuitry used to implement
a specific synchronous system. The clock distribution network provides the
time reference for the storage elements—or registers—thereby enforcing the
required logical order of operations. This time reference consists of one or
more clock signals that are delivered to each and every register within the in-
tegrated circuit. These clock signals control the order of computational events
by controlling the exact times the register data input signals are sampled.
As shown in Chapter 3, the data signals are inevitably delayed as these sig-
nals propagate through the logic gates and along interconnections within the
local data paths. These propagation delays can be evaluated within a certain
accuracy and used to derive timing relationships among the signals within a
circuit. In this chapter, the properties of commonly used types of registers
and their local timing relationships for different types of local data paths are
described. After discussing registers in general in Section 4.1, the properties of
level-sensitive registers (latches) and the significant timing parameters char-
acterizing these registers are reviewed in Sections 4.2 and 4.3, respectively.
Edge-sensitive registers (flip-flops) and the timing parameters are analyzed
in Sections 4.4 and 4.5, respectively. Properties and definitions related to
the clock distribution network are reviewed in Section 4.6. The mathemat-
ical foundation for analyzing timing violations in flip-flops and latches for
single-phase operation, and latches for multi-phase operation are discussed in
Sections 4.7, 4.8 and 4.9, respectively, followed by some final comments in
Section 4.10.

4.1 Storage Elements


The storage elements (registers) used in VLSI systems vary in their func-
tion and temporal relationships. Independent of these differences, however,
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 41
DOI: 10.1007/978-0-387-71056-3 4,
c Springer Science+Business Media LLC 2009
42 4 Timing Properties of Synchronous Systems

all storage elements share a common feature—the existence of two groups


of signals with largely different purposes. A generalized view of a register is
depicted in Figure 4.1. The I/O signals of a register can be divided into two

Data (Outputs)
Data (Inputs)

REGISTER

Control (clock, set/reset, etc.)

Fig. 4.1. A general view of a register.

groups as shown in Figure 4.1. One group of signals—called the data signals—
consists of input and output signals of the storage element. These input and
output signals are typically connected to the terminals of ordinary logic gates
and may be connected to the data signal terminals of other storage elements.
Another group of signals—identified by the name control signals—are those
signals that control the storage of the data signals in the registers but do not
participate in the logical computation process.
Certain control signals enable the storage of a data signal in a register
independently of the values of any data signals. These control signals are
typically used to initialize the data in a register to a specific well known
value. Other control signals—such as a clock signal—control the process of
storing a data signal within a register. In a synchronous circuit, each register
has at least one clock (or control) signal input.
The two major groups of storage elements (registers) are considered in
the following sections based on the type of relationship that exists among the
data and clock signals of these elements. In latches, it is the specific value or
level of a control signal1 that determines the data storage process. Therefore,
latches are also called level-sensitive registers. In contrast to latches, a data
signal is stored in flip-flops enabled by an edge of a control signal. For that
reason, flip-flops are also called edge-triggered registers. The timing properties
of latches and flip-flops are described in detail in the following two sections.

1
This signal is most frequently the clock signal.
4.2 Latches 43

4.2 Latches

A latch is a register whose behavior depends upon the value or level of the
clock signal [10, 12, 14, 15, 29, 58, 59, 60]. Therefore, a latch is often referred
to as a transparent latch, a level-sensitive register or a polarity hold latch. A
simple type of latch with a clock signal C and an input signal D is depicted in
Figure 4.2—the output of the latch is typically labeled Q. This type of latch
is also known as a D latch and its operation is illustrated in Figure 4.3.

Data Input D Q Data Output

Clock Input C

Fig. 4.2. Schematic representation of a level-sensitive register or latch.

The type of register illustrated in Figures 4.2 and 4.3 is a positive-polarity2


latch since it is transparent during that portion of the clock period during
which C is high. The operation of this positive latch is summarized in Ta-
ble 4.1.

Table 4.1. Operation of the positive-polarity D latch.


Clock Output State
high passes input transparent
low maintains output opaque

As described in Table 4.1 and illustrated in Figure 4.3, the output signal
of the latch follows the data input signal while the clock signal remains high,
i.e., C = 1 ⇒ Q = D. Thus, the latch is said to be in a transparent state
during the interval t0 < t < t1 as shown in Figure 4.3. When the clock signal
C changes from 1 to 0, the current value of D is stored in the register and the
output Q remains fixed to that value regardless of whether the data signal
D changes. The latch does not pass the input data signal to the output but
rather holds onto the final value of the data signal when the clock signal made
the high-to-low transition. By analogy with the term transparent introduced
above, this state of the latch is called opaque and corresponds to the interval
t1 < t < t2 shown in Figure 4.3 where the input data signal is isolated from
the output port. As shown in Figure 4.3, the clock period is TCP = t2 − t0 .

2
Or simply a positive latch.
44 4 Timing Properties of Synchronous Systems

Transparent Opaque
State State
Clock
Leading Edge Trailing Edge
C

Data In
Stored Value
Q

Data Out
t1
Clock Period TCP
t0 t2
Fig. 4.3. Idealized operation of a level-sensitive register or latch.

The edge of the clock signal that causes the latch to switch to its transpar-
ent state is identified as the leading edge of the clock pulse. In the case of the
positive latch shown in Figure 4.2, the leading edge of the clock signal occurs
at time t0 . The opposite edge direction of the clock signal is identified as the
trailing edge—the falling edge at time t1 shown in Figure 4.3. Note that for
a negative latch, the leading edge is a high-to-low transition and the trailing
edge is a low-to-high transition.

4.3 Parameters of Latches


Registers such as the D latch illustrated in Figures 4.2 and 4.3 and the flip-
flops described in Sections 4.4 and 4.5 are built of discrete components, such
as the NMOS transistor shown in Figure 3.4. The exact relationships among
signals on the terminals of a register can be presented and evaluated in an-
alytical form [61, 62, 63]. In this research monograph, however, registers are
considered at a higher level of abstraction in order to hide the unnecessary
details of the specific electrical implementation. The latch delay parameters
described in the following sections are therefore considered from the perspec-
tive of the earlier discussion of delay in Chapter 3. These parameters are
briefly introduced next.
4.3 Parameters of Latches 45

Note: The remaining portion of this chapter and the rest of this monograph
use an extensive notation for various parameters describing the signals and
storage elements.

4.3.1 Width of the Clock Pulse


L
The width of the clock pulse CW is the permissible width of this portion of the
L
clock signal during the time when the latch is transparent. In other words, CW
is the length of the time interval between the leading and the trailing edge of
the clock signal such that the latch will operate properly. The superscript L is
used optionally to represent the type of registers—latch in this case—that are
synchronized by this clock signal. The subscript W is used to represent the
width, which is included to distinguish between a clock signal C and the clock
L
width CW . Increasing the value of CW any further will not affect the values
L L L
of DDQ , δS and δH (defined in Sections 4.3.3, 4.3.4, and 4.3.5, respectively).
The width of the clock pulse, CW L
= t6 − t1 , is illustrated in Figure 4.4. The
clock period is TCP = t8 − t1 .

4.3.2 Latch Clock-to-Output Delay


L
The clock-to-output delay DCQ (typically called the clock-to-Q delay) is the
propagation delay of the latch from the clock signal terminal to the output
terminal. The value of DCQL
= t2 − t1 is depicted in Figure 4.4 and is defined
assuming that the data input signal has settled to a stable value sufficiently
early, i.e., setting the data input signal earlier with respect to the leading
L
clock edge will not affect the value of DCQ .

4.3.3 Latch Data-to-Output Delay


L
The data-to-output delay DDQ (typically called the data-to-Q delay) is the
propagation delay of the latch from the data signal terminal to the output
L
terminal. The value of DDQ is defined assuming that the clock signal has set
the latch to its transparent state sufficiently early, i.e., making the leading
L
edge of the clock signal occur earlier will not change the value of DDQ . The
data-to-output delay DDQ = t4 − t3 is illustrated in Figure 4.4.
L

4.3.4 Latch Setup Time

The latch setup time δSL = t6 − t5 , shown in Figure 4.4, is the minimum time
between a change in the data signal and the trailing edge of the clock signal
such that the new value of D would successfully propagate to the output Q
of the latch and be stored within the latch during the opaque state.
46 4 Timing Properties of Synchronous Systems

Q
C

t8
Clock Period TCP

Data-to-Output DDQ
L
Hold Time δH
L

t7
t6
t5
Setup Time δSL
Width of Clock Pulse CW
L

t4
t3
t2
t1
Clock-to-Output DCQ
L
Clock

Data In

Data Out

Fig. 4.4. Parameters of a level-sensitive register.

4.3.5 Latch Hold Time


L
The latch hold time δH is the minimum time after the trailing clock edge that
the data signal must remain constant such that this value of D is successfully
4.4 Flip-Flops 47

L
stored in the latch during the opaque state. This definition of δH assumes that
the last change of the value of D has occurred no later than δSL before the
trailing edge of the clock signal. The term δH L
= t7 − t6 is shown in Figure 4.4.
Note: The latch parameters introduced in Sections 4.3.1 through 4.3.5 are
used to refer to any latch in general or to a specific instance of a latch when
this instance can be unambiguously identified. To refer to a specific instance i
of a latch explicitly, the parameters are additionally shown with a superscript.
Li
For example, DCQ refers to the clock-to-output delay of latch i. Also, adding
m and M to the subscript of any parameter is used to refer to the minimum
and maximum values of that parameter, respectively.

4.4 Flip-Flops
An edge-triggered register or flip-flop is a type of register which, unlike the
latches described in Sections 4.2 and 4.3, is never transparent with respect to
the input data signal [10, 12, 14, 15, 29, 58, 59, 60]. The output of a flip-flop
normally does not follow the input data signal at any time during the register
operation but rather holds onto a previously stored data value until a new
data signal is stored in the flip-flop. A simple type of flip-flop with a clock
signal C and an input signal D is shown in Figure 4.5—similarly to latches,

Data Input D Q Data Output

Clock Input C

Fig. 4.5. An edge-triggered register or flip-flop.

the output of a flip-flop is usually labeled Q. This specific type of register,


shown in Figure 4.5, is called a D flip-flop and its operation is illustrated in
Figure 4.6.
In typical flip-flops, data is stored either on the rising edge (the low-to-
high transition) or on the falling edge (the high-to-low transition) of the clock
signal. The flip-flops are known as positive-edge-triggered and negative-edge-
triggered flip-flops, respectively. The term latching, storing or positive edge
is used to identify the edge of the clock signal on which storage in the flip-
flop occurs. For the sake of clarity, the latching edge of the clock signal for
flip–flops will also be called the leading edge (compare to the discussion of
latches in Sections 4.2 and 4.3). Also, note that certain flip-flops—known as
double-edge-triggered (DET) flip-flops [64, 65, 66, 67, 68]—can store data at
either edge of the clock signal. The complexity of these flip-flops, however, is
significantly higher and these registers are therefore rarely used.
48 4 Timing Properties of Synchronous Systems

As shown in the timing diagram in Figure 4.6, the output of the flip-flop
remains unchanged most of the time regardless of the transitions in the data
signal. Only values of the data signal in the vicinity of the storing edge of
the clock signal can affect the output of the flip-flop. Therefore, changes in
the output will only be observed when the currently stored data has a logic
value x and the storing edge of the clock signal occurs while the input data
signal has a logic value of x̄.

Clock Period TCP

Clock
Latching Edge
C

Data In
Stored Value Stored Value
Q

Data Out

t0 t1 t2
Fig. 4.6. Idealized operation of an edge-triggered register or flip-flop.

4.5 Parameters of Flip-Flops


The significant timing parameters of edge-triggered registers are similar to
those of latches (recall 4.3) and are presented next. These parameters are
illustrated in Figure 4.7.

4.5.1 Width of the Clock Pulse


F
The width of the clock pulse CW is the permissible width of the time interval
between the latching edge and non-latching edge of the clock signal. The
superscript F is used optionally to represent the type of registers—flip-flops
in this case—that are synchronized by this clock signal. The subscript W is
4.5 Parameters of Flip-Flops 49

used to represent the width, which is included to distinguish between a clock


signal C and the clock width CW . The width of the clock pulse CW F
= t6 − t3
is shown in Figure 4.7 and is defined as the interval between the latching
and non-latching edges of the clock pulse such that the flip-flop will operate
F
correctly. Further increasing CW will not affect the values of the setup time δSF
F
and hold time δH (defined in Sections 4.5.3 and 4.5.4, respectively). The clock
period TCP = t6 − t1 is also shown in Figure 4.7.

4.5.2 Flip-Flop Clock-to-Output Delay


F
As shown in Figure 4.7, the clock-to-output delay DCQ of the flip-flop is
DCQ = t5 − t3 . This propagation delay parameter—typically called the clock-
F

to-Q delay—is the propagation delay from the clock signal terminal to the
F
output terminal. The value of DCQ is defined assuming that the data input
signal has settled to a stable value sufficiently early, i.e., setting the data input
signal any earlier with respect to the latching clock edge will not affect the
F
value of DCQ .

4.5.3 Flip-Flop Setup Time

The flip-flop setup time δSF is shown in Figure 4.7—δSF = t3 − t2 . The pa-
rameter δSF is defined as the minimum time between a change in the data
signal and the latching edge of the clock signal such that the new value of D
propagates to the output Q of the flip-flop and is successfully latched within
the flip-flop.

4.5.4 Flip-Flop Hold Time


F
The flip-flop hold time δH is the minimum time after the arrival of the latching
clock edge during which the data signal must remain constant in order to
successfully store the D signal within the flip-flop. The hold time δH F
= t4 −t3
is illustrated in Figure 4.7. This definition of the hold time assumes that the
last change of D has occurred no later than δSF before the arrival of the latching
edge of the clock signal.
Note: Similar to latches, the parameters of these edge-triggered registers refer
to any flip-flop in general or to a specific instance of a flip-flop when this
instance is uniquely identified. To explicitly refer to a specific instance i of a
flip-flop, the flip-flop parameters are additionally shown with a superscript.
For example, δSF i refers to the setup time parameter flip-flop i. Also, adding
F
m and M to the subscript of DCQ are used to refer to the minimum and
F
maximum values of DCQ .
50 4 Timing Properties of Synchronous Systems

Q
C

t6
Width of Clock Pulse CW
F

Clock-to-Output DCQ
F
Hold Time δH
F

t5
t4
t3
t2
Setup Time δSF
Clock Period TCP

t1
Clock

Data In

Data Out

Fig. 4.7. Parameters of an edge-triggered register.

4.6 The Clock Signal


The clock signal is typically delivered to each storage element within a circuit.
This signal is crucial to the correct operation of a fully synchronous digital
4.6 The Clock Signal 51

system. As described in 2.2, the storage elements serve to establish the relative
sequence of events within a system so that those operations that cannot be
executed concurrently operate on the proper data signals.
A typical clock signal c(t) in a synchronous digital system is shown in
Figure 4.8. The clock period TCP of c(t) is also indicated in Figure 4.8. In

Width of Clock Pulse CW

ΔL ΔT

ΔL
ΔT
Clock Period TCP

Fig. 4.8. A typical clock signal.

order to provide the highest possible clock frequency, the objective is for TCP
to be the smallest number such that

∀t : c(t) = c(t + nTCP ), (4.1)

where n is an integer. The width of the clock pulse CW is shown in Figure 4.8
where the meaning of CW is explained in Sections 4.3.1 (for a latch) and 4.5.1
(for a flip-flop), respectively.
Typically, the period of the clock signal TCP is a constant, that is,
∂TCP /∂t = 0. If the clock signal c(t) has a delay τ from some reference
point, the leading edges of c(t) occur at times

τ + mTCP for m ∈ {. . . , −2, −1, 0, 1, 2, . . . }, (4.2)

and the trailing edges of c(t) occur at times

τ + CW + mTCP for m ∈ {. . . , −2, −1, 0, 1, 2, . . . }. (4.3)

In practice, however, it is possible for the edges of a clock signal to fluctuate


in time, that is, for a clock signal not to occur precisely at the times described
by (4.2) and (4.3) for the leading and trailing edges, respectively. This phe-
nomenon is known as clock jitter and may be due to various causes such as
variations in the manufacturing process, ambient temperature, power supply
noise and oscillator variations.
52 4 Timing Properties of Synchronous Systems

To account for this clock jitter, the following parameters are introduced:
• the maximum deviation ΔL of the leading edge of the clock signal, i.e., the
leading edge is guaranteed to occur anywhere in an interval (τ + kTCP −
ΔL , τ + kTCP + ΔL ),
• the maximum deviation ΔT of the trailing edge of the clock signal, i.e.,
the trailing edge is guaranteed to occur anywhere in the interval (τ +CW +
kTCP − ΔT , τ + CW + kTCP + ΔT ).

4.6.1 Clock Skew

Consider a local data path such as the path shown in Figure 2.6 on page 14.
Without loss of generality, assume that the registers shown in Figure 2.6 are
flip-flops. The clock signal with period TCP is delivered to each of the registers
Ri and Rf . Let the clock signal driving the register Ri be denoted as Ci and the
clock signal driving the register Rf be denoted by Cf . Also, let ticd and tfcd be
the delays of Ci and Cf to the registers Ri and Rf , respectively.3 As described
by (4.2), the latching or leading edges of Ci occur at times

. . . , τ + ticd − TCP , τ + ticd , τ + ticd + TCP , . . . .

Similarly, the latching or leading edges of Cf occur at times

. . . , τ + tfcd − TCP , τ + tfcd , τ + tfcd + TCP , . . .

as described by (4.3).
The clock skew TSkew (i, f ) = ticd −tfcd between Ci and Cf is introduced next
as the difference of the arrival times of Ci and Cf [9] (a more formal definition
is provided in Chapter 5). This concept is illustrated by Figure 4.9. Note that
depending on the values of ticd and tfcd , the clock skew can be zero, negative or

Zero skew Negative skew Positive skew

Clock f Clock f Clock f

Clock i Clock i Clock i

Delay i = Delay f Delay i < Delay f Delay i > Delay f


Fig. 4.9. Lead/lag relationships causing clock skew to be zero, negative or positive.

3
Note that ticd and tfcd are measured with respect to the same reference point.
4.6 The Clock Signal 53

positive, depending upon whether ticd is equal to, less than or greater than tfcd ,
respectively. Furthermore, note that the clock skew as defined above is only
defined for sequentially-adjacent registers, that is, a local data path [such as
the path shown in Figure 2.6].

4.6.2 Multi-Phase Clock Synchronization

Multi-phase (clock) synchronization is observed when different phases of the


clock signal are distributed to the synchronous components of a circuit.
Figure 4.10 presents a representation of a multi-phase clock signal. In Fig-

CW

n φ(n)
Csource
CW
(n−1) φ(n−1)
Csource

CW
2 φ2
Csource
CW
φ1
1
Csource

Clock Period TCP

Fig. 4.10. A sample multi-phase synchronization clock.

ure 4.10, the multi-phase synchronization scheme is generated with overlap-


ping clock signal phases. In practical implementation, non-overlapping clock
phases are used more frequently due to their simplicity of synchronization im-
plementation and analysis. The duty cycles of the clock phases are considered
identical with on-times of (CW ). It is common for duty cycles to be similar as
multiple phases of the clock are typically generated from a single oscillation
source with phase shifters.
In Figure 4.10, the set of clock signals C global = {C 1 , . . . , C n } consti-
tutes the n-phase clocking scheme, where the superscripts denote the partic-
ular clock phase. The subscripts denote the location of the clock signals on
54 4 Timing Properties of Synchronous Systems
pf
tf

pf φp f
Csource

pf
Cf
pp
|φ pi p f + TSkew
i f
(i, f )|

p φ pi
i
Csource

p
Ci i
pi
ti
Clock Period TCP

Fig. 4.11. Multi-phase clock skew.

1
the circuit. For instance, Csource denotes the clock signal at the clock source
“source” of the clock phase C 1 . When this clock signal is delivered to an ar-
bitrary register Rk , it is denoted by Ck1 . The start time φpi of clock signal
phase C pi is defined with respect to a common reference clock cycle. The
phase shift operator φpi pf [69] is used to transform variables between differ-
ent clock phases. The phase shift operator φpi pf is defined as the algebraic
difference φpi pf = φpi − φpf + kTCP , where k is the number of clock cycles
occurring between phases. Note that for a single-phase clocking scheme, the
phase shift operator evaluates to φif = TCP .
A multi-phase synchronization approach can be advantageous in terms of
increasing the reachability of circuit registers, creating less skew within phys-
ically neighboring local clock domains and potentially saving power. Despite
these advantages, the design and analysis of such synchronization schemes are
more complex.
pi pf p
The multi-phase clock skew is defined as TSkew (i, f ) = tpi i − tf f , where
pi pf pi pf
ti and tf are the delays of the clock signals Ci and Cf from the clock
sources to the registers Ri and Rf , respectively. The multi-phase clock skew
is illustrated in Figure 4.11. The common clock period for all clock phases is
denoted by TCP for consistency with the original formulation of the single-
phase synchronized circuits.
4.7 Single-Phase Path with Flip-Flops 55

4.7 Single-Phase Path with Flip-Flops


A local data path composed of two flip-flops and combinational logic between
the flip-flops is shown in Figure 4.12. The initial flip-flop Ri is the origin of

Flip-Flop Ri Flip-Flop Rf

Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C

Clock Ci Clock Cf
Fig. 4.12. A single-phase local data path.

the data signal and the final flip-flop Rf is the destination of the data signal.
The combinational logic block Lif between Ri and Rf accepts the input data
signals supplied by Ri and other registers and logic gates and transmits the
operated upon data signals to Rf . The period of the clock signal is denoted by
TCP and the delays of the clock signals Ci and Cf to the flip-flops Ri and Rf
are denoted by ticd and tfcd , respectively. The input and output data signals to
Ri and Rf are denoted by Di , Qi , Df , and Qf , respectively.
An analysis of the timing properties of the local data path shown in Fig-
ure 4.12 is offered in the following sections. First, the timing relationships to
prevent the late arrival of data signals to Rf are examined in Section 4.7.1. The
timing relationships to prevent the early arrival of signals to the register Rf are
described in Section 4.7.2. The analyses presented in Sections 4.7.1 and 4.7.2
borrow some of the notation from [19] and [20]. Similar analyses of synchro-
nous circuits from the timing perspective can be found in [69, 70, 71, 72, 73].

4.7.1 Preventing the Late Arrival of the Data Signal

The operation of the local data path Ri ;Rf shown in Figure 4.12 requires that
any data signal that is being stored in Rf arrives at the data input Df of Rf no
later than δSF f 4 before the latching edge of the clock signal Cf . It is possible
for the opposite event to occur, that is, for the data signal Df not to arrive at
the register Rf sufficiently early in order to be stored successfully within Rf . If
this situation occurs, the local data path shown in Figure 4.12 fails to perform
as expected and a timing failure or violation is created. This form of timing
violation is typically called a setup (or long path) violation. A setup violation
is depicted in Figure 4.13 and is used in the following discussion.
4
As a reminder for the definitions in Section 4.5, in δSF f representation, subscript S
denotes the setup time, the superscript F denotes a flip-flop parameter and the
superscript f denotes the parameter defined at the final register Rf .
56 4 Timing Properties of Synchronous Systems

ΔL
Ci
k-th clock period

Di

Fi
DCQ
Qi

DPi,fM
Df

δSF f
TCP
Cf

k-th clock period

ΔL

Fig. 4.13. Timing diagram of a local data path with flip-flops illustrating a violation
of the setup (or long path) constraint.

The coincidental cycles (k-th) of the clock signals Ci and Cf are shaded
for identification in Figure 4.13. Also shaded in Figure 4.13 are those portions
of the data signals Di , Qi , and Df that are relevant to the operation of
the local data path shown in Figure 4.12. Specifically, the shaded portion
of Di corresponds to the data to be stored in Ri at the beginning of the k-
th clock cycle. This data signal propagates to the output of the register Ri
and is illustrated by the shaded portion of Qi shown in Figure 4.13. The
combinational logic operates on Qi during the k-th clock cycle. The result
of this operation is illustrated by the shaded portion of the signal Df which
must be stored in Rf during the next (k + 1)-st clock cycle.
Observe that as illustrated in Figure 4.13, the leading edge of Ci that
initiates the k-th clock cycle occurs at time ticd + kTCP with respect to a
global time reference of zero. Similarly, the leading edge of Cf that initiates
4.7 Single-Phase Path with Flip-Flops 57

the (k + 1)-th clock cycle occurs at time tfcd + (k + 1)TCP . Therefore, the latest
arrival time Af of the data signal Df at the flip-flop Rf must satisfy

Af ≤ tfcd + (k + 1)TCP − ΔF L − δS .
Ff
(4.4)
f

The term tcd + (k + 1)TCP − ΔF L on the right hand side of (4.4) corresponds
to the critical situation of the leading edge of Cf arriving earlier by the maxi-
mum possible deviation ΔF L . The −δS term on the right hand side of (4.4) ac-
Ff

counts for the setup time of Rf (recall the definition of δSF from Section 4.5.3).
Note that the value of Af in (4.4) consists of two components:
1. The latest arrival time Di that a valid data signal Qi appears at the output
of Ri , i.e., the sum Di = ticd + kTCP + ΔF Fi
L + DCQM of the latest possible
arrival time of the leading edge of Ci and the maximum clock-to-Q delay
of Ri ,
2. The maximum propagation delay DPi,fM of the data signals through the
combinational logic block Lif and interconnect along the path Ri ;Rf .
Therefore, Af can be described as
 
Af = Di + DPi,fM = ticd + kTCP + ΔF Fi i,f
L + DCQM + DP M . (4.5)
By substituting (4.5) into (4.4), the timing condition guaranteeing correct
signal arrival at the data input D of Rf is
i  i,f f
Ff
tcd + kTCP + ΔF L + DCQM +DP M ≤ tcd + (k + 1)TCP − ΔL −δS . (4.6)
Fi F

The above inequality can be transformed by subtracting the kTCP terms from
both sides of (4.6). Furthermore, certain terms in (4.6) can be grouped to-
gether. Also, by noting that ticd − tfcd = TSkew (i, f ) is the clock skew between
the registers Ri and Rf ,
 
i,f
TSkew (i, f ) + 2ΔF
L ≤ T CP − D Fi
CQM + D PM + δ Ff
S . (4.7)

Note that a violation of (4.7) is illustrated in Figure 4.13.


The timing relationship (4.7) represents three important results describing
the late arrival of the signal Df at the data input of the final register Rf in a
local data path Ri ;Rf :
i,f
1. Given any values of TSkew (i, f ), ΔF Ff Fi
L , DP M , δS and DCQM , the late arrival
of the data signal at Rf can be prevented by controlling the value of the
clock period TCP . A sufficiently large value of TCP can always be chosen
to relax (4.7) by increasing the upper bound described by the right hand
side of (4.7).
2. For correct operation, the clock period TCP does not necessarily have
to be larger than the term DCQM Fi
+ DPi,fM + δSF f . If the clock skew
TSkew (i, f ) is properly controlled, choosing a particular negative value for
the clock skew will relax the  left side of (4.7), thereby
 permitting (4.7) to
be satisfied despite TCP − DCQM
Fi
+ DPi,fM + δSF f < 0.
58 4 Timing Properties of Synchronous Systems
 
3. Both the term 2ΔF L and the term
Fi
DCQM + DPi,fM + δSF f are harmful
in the sense that these terms impose a lower bound on the clock period
TCP (as expected). Although negative skew can be used to relax the in-
equality (4.7), these two terms work against relaxing the values of TCP and
TSkew (i, f ). Note that equivalently, the inequality (4.7) can be interpreted
as imposing an upper bound on the clock skew TSkew (i, f ).
Finally, the relationship (4.7) may be rewritten in a form that clarifies the
upper bound imposed on the clock skew TSkew (i, f ):
 
TSkew (i, f ) ≤ TCP − DCQM
Fi
+ DPi,fM + δSF f − 2ΔF
L. (4.8)

4.7.2 Preventing the Early Arrival of the Data Signal

Late arrival of the signal Df at the data input of Rf (see Figure 4.12) is ana-
lyzed in Section 4.7.1. In this section, an analysis of the timing relationships
of the local data path Ri ;Rf to prevent early data arrival of Df is presented.
To this end, recall from the discussion in Section 4.5.4 that any data signal Df
Ff
being stored in Rf must lag the arrival of the leading edge of Cf by at least δH .
It is possible for the opposite event to occur, i.e., for a new data signal Dfnew to
overwrite the value of Df and be stored within the register Rf . If this situation
occurs, the local data path shown in Figure 4.12 will not perform as desired
because of the timing violation known as a hold time (or short path) violation.
In this section, these hold time violations caused by race conditions are
analyzed. It is shown that a hold violation is more dangerous than a setup
violation since a hold violation cannot be removed by simply adjusting the
clock period TCP [unlike the case of a data signal arriving late where TCP can
be increased to satisfy (4.7)]. A hold violation is depicted in Figure 4.14 and
is used in the following discussion.
The situation depicted in Figure 4.14 is different from the situation de-
picted in Figure 4.13 in the following sense. In Figure 4.13, a data signal stored
in Ri during the k-th clock cycle arrives too late to be stored in Rf during the
(k + 1)-st clock cycle. In Figure 4.14, however, the data stored in Ri during the
k-th clock cycle arrives at Rf too early and overwrites the data that had to be
stored in Rf during the same k-th clock cycle. To clarify this concept, certain
portions of the data signals are shaded for easy identification in Figure 4.14.
The data Di being stored in Ri at the beginning of the k-th clock cycle is
shaded. This data signal propagates to the output of the register Ri and is
illustrated by the shaded portion of Qi shown in Figure 4.14. The output of
the logic (left unshaded in Figure 4.14) is being stored within the register Rf
at the beginning of the (k +1)-st clock cycle. Finally, the shaded portion of Df
corresponds to the data signal that is to be stored in Rf at the beginning of
the k-th clock cycle.
Note that, as illustrated in Figure 4.14, the leading (or latching) edge of Ci
that initiates the k-th clock cycle occurs at time ticd + kTCP . Similarly, the
4.7 Single-Phase Path with Flip-Flops 59

ΔL
Ci
k-th clock period

Di

Fi
DCQ
Qi

DPi,fm
Df

Ff
δH
Cf
k-th clock period

ΔL

Fig. 4.14. Timing diagram of a local data path with flip-flops with a violation of
the hold constraint.

leading (or latching) edge of Cf that initiates the k-th clock cycle occurs at
time tfcd + kTCP . Therefore, the earliest arrival time af of the data signal Df
at the register Rf must satisfy the following condition:
 
af ≥ tfcd + kTCP + ΔF Ff
L + δH . (4.9)
f 
The term tcd + kTCP + ΔF L on the right hand side of (4.9) corresponds to
the critical situation of the leading edge of the k-th clock cycle of Cf arriving
late by the maximum possible deviation ΔF L . Note that the value of af in (4.9)
has two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of Ri , i.e., the sum di = ticd + kTCP − ΔF Fi
L + DCQm of the earliest
arrival time of the leading edge of Ci and the minimum clock-to-Q delay
of Ri ,
2. The minimum propagation delay DPi,fm of the signals through the combi-
national logic block Lif and interconnect wires along the path Ri ;Rf .
60 4 Timing Properties of Synchronous Systems

Therefore, af can be described as


 
af = di + DPi,fm = ticd + kTCP − ΔF Fi i,f
L + DCQm + DP m . (4.10)

By substituting (4.10) into (4.9), the timing condition that guarantees that
Df does not arrive too early at Rf is
i  i,f f 
tcd + kTCP − ΔF
L + DCQm + DP m ≥ tcd + kTCP + ΔL + δH .
Fi F Ff
(4.11)

The inequality (4.11) can be further simplified by regrouping terms and


noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers Ri
and Rf :  
i,f
TSkew (i, f ) − 2ΔFL ≥ − DCQm + DP m + δH .
Fi Ff
(4.12)

Recall that a violation of (4.12) is illustrated in Figure 4.14.


The timing relationship described by (4.12) provides certain important
facts describing the early arrival of the signal Df at the data input of the
final register Rf of a local data path:
1. Unlike (4.7), the inequality (4.12) does not depend on the clock pe-
riod TCP . Therefore, a violation of (4.12) cannot be corrected by simply
increasing the clock period TCP . A synchronous digital system with hold
violations is non-functional, while a system with setup violations will still
operate correctly at a reduced speed.5
2. Both for (4.12) and for zero-skew systems, the hold violation can be
avoided through delay padding [74] into the logic. Inserting delays into
the logic increases the DPi,fm value on the right hand side of the inequality,
making it easy to satisfy the constraint for given values of TSkew (i, f ). A
more sophisticated used of delay insertion in eliminating timing violations
for non-zero clock skew circuits is presented in Chapter 8.
3. The relationship (4.12) can be satisfied with a sufficiently large value of
the clock skew TSkew (i, f ). However, both the term 2ΔF L and the term δH
Ff

are harmful in the sense that these terms impose a lower bound on the
clock skew TSkew (i, f ) between the register Ri and Rf . Although positive
skew may be used to relax (4.12),
 these two terms
 work against relaxing
Fi i,f
the values of TSkew (i, f ) and DCQm + DP m .

Finally, the relationship (4.12) can be rewritten to stress the lower bound
imposed on the clock skew TSkew (i, f ):
 
TSkew (i, f ) ≥ − DPi,fm + DCQ
Fi Ff
+ δH + 2ΔF L. (4.13)

5
Increasing the clock period TCP in order to satisfy (4.7) is equivalent to reducing
the frequency of the clock signal.
4.8 Single-Phase Path with Latches 61

4.8 Single-Phase Path with Latches


A local data path consisting of two level-sensitive registers (or latches)
and combinational logic between these registers (or latches) is shown in
Figure 4.15. Note the initial latch Ri which is the origin of the data signal and
the final latch Rf which is the destination of the data signal. The combinational

Latch Ri Latch Rf

Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C

Clock Ci Clock Cf
Fig. 4.15. A single-phase local data path with latches.

logic block Lif between Ri and Rf accepts the input data signals sourced by Ri
and other registers and logic gates and transmits the data signals that have
been operated on to Rf . The period of the clock signal is denoted by TCP and
the delays of the clock signals Ci and Cf to the latches Ri and Rf are denoted
by ticd and tfcd , respectively. The input and output data signals to Ri and Rf
are denoted by Di , Qi , Df , and Qf , respectively.
An analysis of the timing properties of the local data path shown in Fig-
ure 4.15 is offered in the following sections. The timing relationships to prevent
the late arrival of the data signal at the latch Rf are examined in Section 4.8.1.
The timing relationships to prevent the early arrival of the data signal at the
latch Rf are examined in Section 4.8.2.
The analyses presented in this section are built on the timing relationships
among the signals of a latch that are similar to those used in Section 4.7.
Specifically, it is guaranteed that every data signal arrives at the data input
of a latch no later than δSL time before the trailing clock edge. Also, this data
L
signal must remain stable at least δH time after the trailing edge, i.e., no
L
new data signal should arrive at a latch δH time after the latch has become
opaque.
Observe the differences between a latch and a flip-flop [70, 75]. In flip-
flops, the setup and hold requirements described in the previous paragraph are
relative to the leading—not to the trailing—edge of the clock signal. Similarly,
in flip-flops, the late and early arrival of the data signal to a latch gives rise
to timing violations known as a setup and hold violation, respectively.

4.8.1 Preventing the Late Arrival of the Data Signal


A system of signals similar to the example illustrated in Figure 4.13 is as-
sumed in the following discussion. A data signal Di is stored in the latch Ri
62 4 Timing Properties of Synchronous Systems

during the k-th clock cycle. The data Qi stored in Ri propagates through the
combinational logic Lif and the interconnect along the path Ri ;Rf . In the
(k + 1)-st clock cycle, the result Df of the computation in Lif is stored within
the latch Rf . The signal Df must arrive at least δSL time before the trailing
edge of Cf in the (k + 1)-st clock cycle.
Similar to the discussion presented in Section 4.7.1, the latest arrival time
Af of Df at the D input of Rf must satisfy

Af ≤ tfcd + (k + 1)TCP + CW L
T − δS .
− ΔL Lf
(4.14)

Note the difference between (4.14) and (4.4). In (4.4), the first term on the
right hand side is [tfcd + (k + 1)TCP − ΔFL ], while in (4.14), the first term on the
L L
right hand side has an additional term CW . The addition of CW corresponds to
the concept that unlike flip-flops, a data signal is stored in the latches, shown
L
in Figure 4.15, at the trailing edge of the clock signal
f (the CW term). Similar to

the case of flip-flops in Section 4.7.1, the term tcd + (k + 1)TCP + CW L


− ΔLT
in the right hand side of (4.14) corresponds to the critical situation of the
trailing edge of the clock signal Cf arriving earlier by the maximum possible
deviation ΔL T.
Observe that the value of Af in (4.14) consists of two components:
1. The latest arrival time Di when a valid data signal Qi appears at the
output of the latch Ri ,
2. The maximum signal propagation delay through the combinational logic
block Lif and the interconnect along the path Ri ;Rf .
Therefore, Af can be described as

Af = DPi,fM + Di . (4.15)

However, unlike the situation of flip-flops as discussed in Section 4.7.1, the


term Di on the right hand side of (4.15) is not the sum of the delays through
the register Ri . The reason is that the value of Di depends upon whether the
signal Di arrived before or during the transparent state of Ri in the k-th clock
cycle. Therefore, the value of Di in (4.15) is the greater of the following two
quantities:
 Li
 i 

Di = max Ai + DDQM , tcd + kTCP + ΔL Li


L + DCQM . (4.16)

There are two terms in the right hand side of (4.16):


 Li

1. The term Ai + DDQM corresponds to the situation in which Di arrives
at Ri after the leading edge of the k-thclock period,
2. The term ticd + kTCP + ΔL Li
L + DCQM corresponds to the situation in
which Di arrives at Ri before the arrival of the leading edge of the k-th
clock pulse.
4.8 Single-Phase Path with Latches 63

By substituting (4.16) into (4.15), the latest time of arrival Af is


  i 

Af = DPi,fM + max Ai + DDQMLi


, tcd + kTCP + ΔL Li
L + DCQM , (4.17)

which is in turn substituted into (4.14) to obtain


  i 

DPi,fM + max Ai + DDQM Li


, tcd + kTCP + ΔL Li
L + DCQ

≤ tfcd + (k + 1)TCP + CW L
T − δS .
− ΔL Lf

(4.18)

Equation (4.18) is an expression of the inequality that must be satisfied in


order to prevent the late arrival of a data signal at the data input D of
the latch Rf . By satisfying (4.18), any setup violation in a local data path
with latches as shown in Figure 4.15 is avoided. For a circuit to operate cor-
rectly, (4.18) must be enforced for every local data path Ri ;Rf consisting of
the latches, Ri and Rf .
The max operator in (4.18) creates a mathematically difficult situation
since it is unknown which of the quantities under the max operation is greater.
To overcome this obstacle, this max operation may be split into two conditions:
  f

DPi,fM + Ai + DDQMLi
≤ tcd + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf
(4.19)
 
DPi,fM + ticd + kTCP + ΔL Li
L +DCQM
f

≤ tcd +(k + 1)TCP + CW L


T − δS .
− ΔL Lf
(4.20)

Taking into account that the clock skew TSkew (i, f ) = ticd − tfcd , (4.19)
and (4.20) can be rewritten, respectively, as
  f

DPi,fM + Ai + DDQM
Li
≤ tcd + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf
(4.21)
 L     
i,f
TSkew (i, f ) + ΔL + ΔLT ≤ TCP + CW − DCQM + DP M + δS
L Li Lf
. (4.22)

Similar to Sections 4.7.1 and 4.7.2, (4.22) can be rewritten to emphasize the
upper bound on the clock skew TSkew (i, f ) imposed by (4.22):
  f

DPi,fM + Ai + DDQM
Li
≤ tcd + (k + 1)TCP + CW L
− ΔL T − δS , (4.23)
Lf

   
i,f
TSkew (i, f ) ≤ TCP + CW L
− ΔL L − Δ L
T − D Li
CQM + D PM + δ Lf
S . (4.24)

4.8.2 Preventing the Early Arrival of the Data Signal

A system of signals similar to the example illustrated in Figure 4.14 is assumed


in the discussion presented in this section. Recall the difference between the
late arrival of a data signal at Rf and the early arrival of a data signal at Rf
(see Section 4.7.2). In the former case, the data signal stored in the latch Ri
during the k-th clock cycle arrives too late to be stored in the latch Rf during
64 4 Timing Properties of Synchronous Systems

the (k + 1)-st clock cycle. In the latter case, the data signal stored in the
latch Ri during the k-th clock cycle propagates to the latch Rf too early and
overwrites the data signal that is already stored in the latch Rf during the
same k-th clock cycle.
In order for the proper data signal to be successfully latched within Rf
during the k-th clock cycle, there should not be any changes in the signal Df
until at least the hold time after the arrival of the storing (trailing) edge of
the clock signal Cf . Therefore, the earliest arrival time af of the data signal
Df at the register Rf must satisfy the following condition,
 
af ≥ tfcd + kTCP + CW L
+ ΔL Lf
T + δH . (4.25)
f L

The term tcd + kTCP + CW + ΔLT on the right hand side of (4.25) corre-
sponds to the critical situation of the trailing edge of the k-th clock cycle of
the clock signal Cf arriving late by the maximum possible deviation ΔL T . Note
that the value of af in (4.25) consists of two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of the latch Ri , i.e., the sum di = ticd + kTCP − ΔL Li
L + DCQm of
the earliest arrival time of the leading edge of the clock signal Ci and the
Li
minimum clock-to-Q delay DCQm of Rf ,
2. The minimum propagation delay DPi,fm of the signal through the combi-
national logic Lif and the interconnect along the path Ri ;Rf .
Therefore, af can be described as
 
af = di + DPi,fm = ticd + kTCP − ΔL Li i,f
L + DCQm + DP m . (4.26)
By substituting (4.26) into (4.25), the timing condition guaranteeing that Df
does not arrive too early at the latch Rf is
i  i,f f 
tcd + kTCP − ΔL L + DCQm + DP m ≥ tcd + kTCP + CW + ΔT + δH .
Li L L Lf

(4.27)
The inequality (4.27) can be further simplified by reorganizing the terms
and noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers
Ri and Rf :
   
i,f
TSkew (i, f ) − ΔL L + ΔT ≥ − DCQm + DP m + δH .
L Li Lf
(4.28)

The timing relationship described by (4.28) represents three important


results describing the early arrival of the signal Df at the data input of the
final latch Rf of a local data path:
1. The relationship (4.28) does not depend on the value of the clock pe-
riod TCP . Therefore, if a hold time violation in a synchronous system
has occurred,6 this timing violation cannot be fixed through clock period
manipulation.
6
As described by the inequality (4.28) not being satisfied.
4.9 Multi-Phase Path with Latches 65

2. Similar to flip-flop-based path, the hold violation can be avoided through


delay padding [74] into the logic. Inserting delays into the logic increases
the DPi,fm value on the right hand side of the inequality, making it easy to
satisfy the constraint for given values of TSkew (i, f ).
3. The relationship (4.28) can be satisfied with a sufficiently  large value
 of the
clock skew TSkew (i, f ). Furthermore, both the term ΔL L + Δ L
T and the
Lf
term δH are harmful in the sense that these terms impose a lower bound
on the clock skew TSkew (i, f ) between the latches Ri and Rf . Although
positive skew (TSkew (i, f ) > 0) can be used to relax (4.28), these two
terms make it difficult
 to satisfy the
 inequality (4.28) for specific values
Li
of TSkew (i, f ) and DCQm + DPi,fm .

Finally, the relationship (4.28) can be rewritten to emphasize the lower bound
on the clock skew TSkew (i, f ):
   Li i,f

L + ΔT − DCQm + DP m + δH .
TSkew (i, f ) ≥ ΔL L Lf
(4.29)

4.9 Multi-Phase Path with Latches

Multi-phase clock synchronization is often used for level-sensitive synchro-


nous circuits. A multi-phase local data path consisting of two latches and
combinational logic between these latches is shown in Figure 4.16. Similar to

Latch Ri Latch Rf

Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C

Clock Cipi p
Clock Cf f

Fig. 4.16. A multi-phase local data path with latches.

single-phase counterpart in Figure 4.15, the initial latch Ri is the origin of


the data signal and the final latch Rf is the destination of the data signal.
The combinational logic block Lif between Ri and Rf accepts the input data
signals sourced by Ri and other registers and logic gates and transmits the
data signals that have been operated on to Rf . The period of the multi-phase
clock signals is denoted by TCP and the latches Ri and Rf of a local data path
p
shown in Figure 4.16 are synchronized by the clock signals Cipi and Cf f , re-
spectively. As defined in Section 4.6.2, the superscripts pi and pf describe the
clock phases that synchronize Ri and Rf , respectively. The subscripts i and f
66 4 Timing Properties of Synchronous Systems

denote the clock signals of phase C pi at Ri and phase C pf at Rf , respectively.


p
The delays of the clock signals Cipi and Cf j to the latches Ri and Rf are de-
pi pf
noted by ti and tf , respectively. The input and output data signals to Ri
and Rf are denoted by Di , Qi , Df , and Qf , respectively.
An analysis of the timing properties of the local data path shown in Fig-
ure 4.16 is offered in the following sections. The timing relationships to prevent
the late arrival of the data signal at the latch Rf are examined in Section 4.9.1.
The timing relationships to prevent the early arrival of the data signal at the
latch Rf are examined in Section 4.9.2.
The analyses presented in this section are built on the timing relationships
among the signals of a latch similar to those used in Sections 4.7 and 4.8.
Specifically, it is guaranteed that every data signal arrives at the data input
of a latch no later than δSL time before the trailing clock edge. Also, this data
L
signal must remain stable at least δH time after the trailing edge, i.e., no
L
new data signal should arrive at a latch δH time after the latch has become
opaque.

4.9.1 Preventing the Late Arrival of the Data Signal

Analogous to the single-phase discussion, a system of signals similar to the


example illustrated in Figure 4.10 is assumed in the following discussion. A
data signal Di is stored in the latch Ri during the k-th clock cycle. The data Qi
stored in Ri propagates through the combinational logic Lif and the intercon-
nect along the path Ri ;Rf . During the (k + 1)-st clock cycle, the result Df
of the computation in Lif is stored within the latch Rf . The signal Df must
arrive at least δSL time before the trailing edge of Cf in the (k + 1)-st clock
cycle.
Similar to the discussions presented in Sections 4.7.1 and 4.8.1, the latest
arrival time Af of Df at the D input of Rf must satisfy

p
Af ≤ φpf + tff + (k + 1)TCP + CW L
− ΔLT − δS .
Lf
(4.30)

Note the difference between (4.30) and (4.14). In (4.30), the term on the
right hand side has an additional term φpf to account for the clock phase
information.
Observe that the value of Af in (4.30) consists of two components:
1. The latest arrival time Di when a valid data signal Qi appears at the
output of the latch Ri ,
2. The maximum signal propagation delay through the combinational logic
block Lif and the interconnect along the path Ri ;Rf .
Therefore, Af can be described as

Af = DPi,fM + Di . (4.31)
4.9 Multi-Phase Path with Latches 67

Similar to Section 4.8.1, the value of Di in (4.31) is the greater of the following
two quantities:
   pi 

Di = max Ai + DDQM Li
, φ + tpi i + kTCP + ΔL Li
L + DCQM . (4.32)
There are two terms in the right hand side of (4.32):
 Li

1. The term Ai + DDQM corresponds to the situation in which Di arrives
at Ri afterthe leading edge of the k-th clock cycle,
2. The term φpi + tpi i + kTCP + ΔL Li
L + DCQM corresponds to the situation
in which Di arrives at Ri before the arrival of the leading edge of the k-th
clock pulse.
By substituting (4.32) into (4.31), the latest time of arrival Af is
   pi 

Af = DPi,fM + max Ai + DDQM Li


, φ + tpi i + kTCP + ΔL Li
L + DCQM ,
(4.33)
which is in turn substituted into (4.30) to obtain
   pi 

DPi,fM + max Ai + DDQM Li


, φ + tpi i + kTCP + ΔL Li
L + DCQ

p
≤ φpf + tff + (k + 1)TCP + CW L
− ΔLT − δS .
Lf

(4.34)
Equation (4.34) is an expression of the inequality that must be satisfied in
order to prevent the late arrival of a data signal at the data input D of
the latch Rf . By satisfying (4.34), any setup violation in a local data path
with latches as shown in Figure 4.16 is avoided. For a circuit to operate cor-
rectly, (4.34) must be enforced for every local data path Ri ;Rf consisting of
the latches, Ri and Rf .
Similar to single-phase operation, the max operator in (4.34) may be split
into two conditions:
  p p

DPi,fM + Ai + DDQM
Li
≤ φ f + tff + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf

(4.35)
 pi 
DPi,fM pi L Li
+ φ + ti + kTCP + ΔL +DCQM

p
≤ φpf + tff +(k + 1)TCP + CW
L
T − δS .
− ΔL Lf

(4.36)
pi pf
Taking into account that the multi-phase clock skew is Tskew = tpi i −
(i, f )
pf
tf , (4.35) and (4.36) can be rewritten, respectively, as
 
DPi,fM + Ai + DDQM Li
(4.37)
p
≤ φpf + tff + (k + 1)TCP + CWL
− ΔL T − δS ,
Lf

pi pf  
φpi pf + TSkew (i, f ) + ΔL L + ΔT
L

   Li  (4.38)
≤ TCP + CW L
− DCQM + DPi,fM + δSLf .
68 4 Timing Properties of Synchronous Systems

Similar to Sections 4.8.1 and 4.8.2, (4.38) can be rewritten to emphasize the
pi pf
upper bound on the clock skew TSkew (i, f ):
 
DPi,fM + Ai + DDQM
Li
(4.39)
p
≤ φpf + tff + (k + 1)TCP + CW
L
T − δS ,
− ΔL Lf

p p
i f
TSkew (i, f )
   Li i,f
 (4.40)
≤ −φpi pf + TCP + CW
L
L − ΔT − DCQM + DP M + δS
− ΔL L Lf
.

4.9.2 Preventing the Early Arrival of the Data Signal

In order for the proper data signal to be successfully latched within Rf during
the k-th clock cycle, there should not be any changes in the signal Df until at
least the hold time after the arrival of the storing (trailing) edge of the clock
p
signal Cf f . Therefore, the earliest arrival time af of the data signal Df at the
register Rf must satisfy the following condition,
 
p
af ≥ φpf + tff + kTCP + CW L
+ ΔL Lf
T + δH . (4.41)
 
p
The term φpf + tff + kTCP + CW L
+ ΔLT on the right hand side of (4.41)
corresponds to the critical situation of the trailing edge of the k-th clock cycle
p
of the clock signal Cf f arriving late by the maximum possible deviation ΔL T.
Note that the value of af in (4.41) consists of two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of the latch Ri , i.e., the sum di = φpi + tpi i + kTCP − ΔL Li
L + DCQm
pi
of the earliest arrival time of the leading edge of the clock signal Ci and
Li
the minimum clock-to-Q delay DCQm of Rf ,
2. The minimum propagation delay DPi,fm of the signal through the combi-
national logic Lif and the interconnect along the path Ri ;Rf .
Therefore, af can be described as
 
af = di + DPi,fm = φpi + tpi i + kTCP − ΔL Li i,f
L + DCQm + DP m . (4.42)

By substituting (4.42) into (4.41), the timing condition guaranteeing that Df


does not arrive too early at the latch Rf is
 pi 
φ + tpi i + kTCP − ΔL Li
L + DCQm + DP m
i,f
  (4.43)
p
≥ φpf + tff + kTCP + CW L
+ ΔL Lf
T + δH .

The inequality (4.43) can be further simplified by reorganizing the terms


p pi pf
and noting that tpi i − tff = TSkew (i, f ) is the multi-phase clock skew between
the registers Ri and Rf :
4.10 A Final Note 69
   
pi pf i,f
φpi pf + TSkew L + ΔT ≥ − DCQm + DP m + δH .
(i, f ) − ΔL L Li Lf
(4.44)

The timing relationship described by (4.44) represents three important


results describing the early arrival of the signal Df at the data input of the
final latch Rf of a local data path:
1. The relationship (4.44) does not depend on the value of the clock pe-
riod TCP . Therefore, if a hold time violation in a synchronous system has
occurred,7 this timing violation cannot be fixed by manipulating the clock
period.
2. Similar to flip-flop-based path, the hold violation can be avoided through
delay padding [74] into the logic. Inserting delays into the logic increases
the DPi,fm value on the left hand side of the inequality, making it easy to
satisfy the constraint for given values of TSkew (i, f ).
3. The relationship (4.44) can be satisfied with a sufficiently large value
pi pf
of the clock skew TSkew (i, f ). Furthermore, both the term ΔL L + ΔT
L
Lf
and the term δH are harmful in the sense that these terms impose a
pi pf
lower bound on the clock skew TSkew (i, f ) between the latches Ri and Rf .
Although positive skew can be used to relax (4.44), these two terms make
pi pf
it difficult
 to satisfy the inequality (4.44) for specific values of TSkew (i, f )
Li
and DCQm + DPi,fm .

Finally, the relationship (4.44) can be rewritten to emphasize the lower bound
pi pf
on the clock skew TSkew (i, f ):

pi pf    Li i,f

TSkew (i, f ) ≥ −φpi pf + ΔL L + ΔT − DCQm + DP m + δH .
L Lf
(4.45)

4.10 A Final Note

The properties of registers and local data paths are described in this chapter.
Specifically, the timing relationships to prevent setup and hold timing viola-
tions in a local data path consisting of two positive edge-triggered flip-flops are
analyzed in Sections 4.7.1 and 4.7.2, respectively. The timing relationships to
prevent setup and hold timing violations in a local data path consisting of two
positive-polarity latches have also been analyzed in Sections 4.8.1 and 4.8.2,
respectively. Timing relationships to prevent setup and hold timing violations
in a local data path consisting of two positive-polarity latches, synchronized by
a multi-phase clocking scheme, have been analyzed in Sections 4.9.1 and 4.9.2,
respectively.
In a fully synchronous digital VLSI system, however, it is possible to en-
counter certain local data paths different from those circuits analyzed in this
chapter. For example, a local data path may begin with a positive-polarity,
7
As described by the inequality (4.44) not being satisfied.
70 4 Timing Properties of Synchronous Systems

edge-sensitive register Ri and end with a negative-polarity, edge-sensitive reg-


ister Rf . It is also possible that different types of registers are used, e.g., a
register with more than one data input. In each individual case, the analyses
provided in this chapter illustrate a general methodology for determining the
proper timing relationships specific to that system. Furthermore, note that for
a given system, the timing relationships that must be satisfied for a system to
operate correctly—such as (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40)
and (4.45)—are collectively referred to as the overall timing constraints of the
synchronous digital system [9].
5
Clock Skew Scheduling and Clock Tree
Synthesis

The basic principles of operation of a synchronous digital VLSI system are


described in Chapter 2. As demonstrated in Chapter 3, the propagation of
signals through logic gates and interconnections requires a certain amount of
time to complete. Therefore, a timing discipline is necessary to ensure that log-
ical computations—whether executing concurrently or in sequence—operate
on the proper data signals. As described in Chapter 4, this timing discipline
is implemented by inserting storage elements, or registers, throughout the cir-
cuit. Also analyzed in Chapter 4 are the timing relationships among signals
in local data paths based on the type of clock signal and storage element.
Recall from Chapter 4 the relationships that must be satisfied in order for a
local data path to operate properly [inequalities (4.8), (4.13), (4.23), (4.24),
(4.29), (4.39), (4.40) and (4.45)]. These relationships are written in the form
of bounds on the clock skew TSkew in order to emphasize that bounds are im-
posed on TSkew by various parameters of the data paths and the clock signal.
If any of the inequalities (4.8), (4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and
(4.45) is not satisfied, a timing violation occurs.
A methodology and software system for determining (or scheduling) the
values of the clock skew TSkew based on the timing constraints of a fully
synchronous digital VLSI system and for synthesizing the clock distribution
network so as to implement these target clock skew values is described in this
chapter. The relation of synchronization to the design of the clock distribu-
tion network is presented in Section 5.1. Some useful definitions and notations
are introduced in Section 5.2. The clock skew scheduling problem for more
popular register type of edge-triggered flip-flops is described in Section 5.3.
Various formulations of timing problem with the presented timing constraints
are briefed in Section 5.4. The structure of the clock distribution network is
examined from the perspective of clock skew scheduling in Section 5.5. The
proposed algorithms are described in Section 5.6. Finally, the software pro-
grams developed to implement the algorithm and the demonstration of these
programs on benchmark and industrial circuits are described in Section 5.7.

I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 71


DOI: 10.1007/978-0-387-71056-3 5,
c Springer Science+Business Media LLC 2009
72 5 Clock Skew Scheduling and Clock Tree Synthesis

5.1 Background

As described in Chapter 2, most high performance digital integrated circuits


implement data processing algorithms based on the iterative execution of basic
operations. Typically, these algorithms are highly parallelized and pipelined
by inserting clocked registers at specific locations throughout the circuit. The
synchronization strategy for these clocked registers in the vast majority of
VLSI/ULSI-based digital systems is a fully synchronous approach. It is not
uncommon for the computational process in these systems to be spread over
hundreds of thousands of functional logic elements and tens of thousands of
registers.
For such synchronous digital systems to function properly, the many thou-
sands of switching events require a strict temporal ordering. This strict order-
ing is enforced by a global synchronization signal known as the clock signal.
For a fully synchronous system to operate correctly, the clock signal must be
delivered to every register at a precise relative time. The delivery function is
accomplished by a circuit and interconnect structure commonly known as a
clock distribution network [9, 32].
As described in Chapter 3, multiple factors affect the propagation delay of
the data signals through the combinational logic gates and interconnect. Since
the clock distribution network is composed of logic gates and interconnection
wires, the signals in the clock distribution network are delayed. Moreover,
the dependence of the correct operation of a system on the signal delay in
the clock distribution network is far greater than on the delay of the logic
gates. Recall that by delivering the clock signal to registers at precise times,
the clock distribution network essentially quantizes the operational time of a
synchronous system into clock periods, thereby permitting the simultaneous
execution of operations.
The nature of the on-chip clock signal has become a primary factor limit-
ing circuit performance, causing the clock distribution network to become a
performance bottleneck in high speed VLSI systems. As described in Chap-
ter 3, the primary source of the load for the clock signals has shifted from the
logic gates to the interconnect, thereby changing the physical nature of the
load from a lumped capacitance (C) to a distributed resistive-capacitive (RC)
load [8, 76]. These interconnect impedances degrade the on-chip signal wave-
form shapes and increase the path delay. Furthermore, statistical variations
of the parameters characterizing the circuit elements along the clock and data
signal paths, caused by the imperfect control of the manufacturing process and
the environment, introduce ambiguity into the signal timing that cannot be ne-
glected. All of these changes have a profound impact on both the choice of syn-
chronous design methodology and on the overall circuit performance. Among
the most important consequences are increased power dissipated by the clock
distribution network as well as increasingly challenging timing constraints
that must be satisfied in order to avoid timing violations [9, 77, 78, 79, 80].
Therefore, the majority of the approaches used to design a clock distribution
5.2 Definitions and Graphical Model 73

network focus on simplifying the performance goals by targeting minimal or


zero global clock skew [81, 82, 83], which can be achieved by different rout-
ing strategies [84, 85, 86, 87], buffered clock tree synthesis, symmetric n-ary
trees [77] (most notably H-trees) or a distributed series of buffers connected
as a mesh [9, 32, 80].

5.2 Definitions and Graphical Model


A synchronous digital system is a network of combinational logic and storage
registers whose input and output terminals are interconnected by wires. An
example of a synchronous system is shown in Figure 5.1. The sets of registers
and logic gates of this specific system are outlined in Figure 5.1. The system
consists of four registers, R1 through R4 , and four logic gates, G1 through G4 .
For clarity, the clock distribution network and clock signals to the registers
are not shown in Figure 5.1 and the details of the registers and logic gates are
also omitted.

The set of registers R = R1 , R2 , R3 , R4 }

data data
R1 G1 G2 R3 G3 R4
input output

data
R2 G4
input
The set of logic gates G = {G1 , G2 , G3 , G4 }

Fig. 5.1. A simple synchronous digital circuit with four registers and four logic
gates.

A sequence of connected logic gates (no registers) is called a signal path.


For example, in Figure 5.1, one signal path begins at the register R1 and
propagates through the logic gates G1 and G2 before reaching the register R3 .
Other signal paths can also be identified within the system shown in Fig-
ure 5.1. Every signal path in a synchronous system is delimited by a pair of
registers—one register each for the start and the end of the path. Such a pair
of registers is called a sequentially-adjacent pair and is defined next:
Definition 5.1. Sequentially-adjacent pair of registers. For an arbitrary or-
dered pair of registers Ri , Rf in a synchronous circuit, one of the following
two situations can be observed. Either there exists at least one signal path
that connects some output of Ri to some input of Rf or inputs of Rf cannot be
74 5 Clock Skew Scheduling and Clock Tree Synthesis

reached from outputs of Ri through a signal path.1 In the former case—denoted


by R1 ;R2 —the pair of registers Ri , Rf is called a sequentially-adjacent pair
of registers and switching events at the output of Ri can possibly affect the in-
put of Rf during the same clock period. A sequentially-adjacent pair of registers
is also referred to as a local data path [9].
Generalized examples of local data paths with flip-flops and latches are shown
in Figures 4.12 and 4.15, respectively. The clock signal Ci driving the initial
register Ri of the local data path and the clock signal Cf driving the final
register Rf are shown in Figures 4.12 and 4.15, respectively. Returning to
Figure 5.1, for example, R1 , R3 is a sequentially-adjacent pair of registers
connected by a signal path consisting of the combinational logic gates, G1
and G3 . In Figure 5.1, however, R3 , R1 is not a sequentially-adjacent pair of
registers.

5.2.1 Permissible Range of Clock Skew

The timing constraints of a local data path have been derived in Sections 4.7.1
through 4.8.2 for paths consisting of flip-flops and latches. The concept of clock
skew used in these timing constraints is formally defined next:
Definition 5.2. Clock skew. In a given digital synchronous circuit, the clock
skew TSkew (i, j) between the registers Ri and Rj is defined as the algebraic
difference,
TSkew (i, j) = ticd − tjcd , (5.1)
where Ci and Cj are the clock signals driving the registers Ri and Rj , respec-
tively, and ticd and tjcd are the delays of the clock signals Ci and Cj , respec-
tively.
In Definition 5.2, the clock delays, ticd and tjcd , are with respect to an
arbitrary—but necessarily the same—reference point. A commonly used ref-
erence point is the source of the clock distribution network on the integrated
circuit. Note that the clock skew TSkew (i, j) as defined in Definition 5.2 obeys
the antisymmetric property,

TSkew (i, j) = −TSkew (j, i). (5.2)

Recall that the clock skew TSkew (i, j) as defined in Definition 5.2 is a com-
ponent in the timing constraints of a local data path [see inequalities (4.8),
(4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45)]. Therefore, the clock
skew TSkew (i, j) is defined and is of primary practical use for sequentially-
adjacent pairs of registers Ri ;Rj , that is, for local data paths.2
1
Propagating through a sequence of logic elements only.
2
Note that technically, TSkew (i, j) can be calculated for any ordered pair of regis-
ters Ri , Rj . However, the skew between a non-sequential pair of registers has no
practical value.
5.2 Definitions and Graphical Model 75

For notational convenience, clock skews within a circuit are frequently


denoted throughout this monograph with the small letter s with a single sub-
script. In such cases, the clock skew sk corresponds to a uniquely identified
local data path k within the circuit, where the local data paths have been
numbered 1 through a certain number p. In other words, the skew s1 corre-
sponds to the local data path one, the skew s2 corresponds to the local data
path two and so on.
Previous research [83, 88] has indicated that tight control over the clock
skews rather than the clock delays is necessary for the circuit to operate
reliably. Timing relationships similar to (4.8), (4.13), (4.23), (4.24), (4.29),
(4.39), (4.40) and (4.45) are used in [88] to determine a permissible range of
allowable clock skew for each signal path. The concept of a permissible range
for the clock skew sk of a data path Ri ;Rf is illustrated in Figure 5.2.

Race Clock Period


Conditions PERMISSIBLE RANGE Limitations
Negative Skew lk sk uk Positive Skew
Fig. 5.2. The permissible range of the clock skew of a local data path. A timing
violation exists if sk ∈
/ [lk , uk ].

Each signal data path has a unique permissible range associated with it.3
The permissible range is a continuous interval of valid skews for a specific path.
As suggested by the inequalities, (4.8), (4.13), (4.23), (4.24), (4.29), (4.39),
(4.40) and (4.45) and illustrated in Figure 5.2, every permissible range is de-
limited by a lower and upper bound of the clock skew. These bounds—denoted
by lk and uk , respectively—are determined based on the timing parameters
of the individual local data paths and the constraints to prevent timing vio-
lations discussed in Chapter 4. Note that the bounds lk and uk also depend
on the operational clock period for the specific circuit. When sk ∈ [lk , uk ]—
as shown in Figure 5.2—the timing constraints of this specific k-th local data
path are satisfied. The clock skew sk is not permitted to be in either the inter-
val (−∞, lk ) because a race condition will be created or the interval (uk , +∞)
because the minimum clock period will be limited.
Furthermore, note that the reliability of a circuit is related to the prob-
ability of a timing violation occurring for any local data path Ri ;Rf . This

3
Later in Section 5.2.2 it is shown that it is more appropriate to refer to the permis-
sible range of a sequentially-adjacent pair of registers. There may be more than
one local data path between the same pair of registers but circuit performance is
ultimately determined by the permissible ranges of the clock skew between pairs
of registers.
76 5 Clock Skew Scheduling and Clock Tree Synthesis

observation suggests that the reliability of any local data path Ri ;Rf of a
circuit (and therefore of the entire circuit) is increased in two ways:
1. by choosing the clock skew sk for the k-th local data path as far as possible
from the borders of the interval [lk , uk ], that is, by (ideally) positioning
the clock skew sk in the middle of the permissible range as sk = 12 (lk +uk ),
2. by increasing the width (uk − lk ) of the permissible range of the local data
path Ri ;Rf .
Even if the clock signals can be delivered to the registers within a given circuit
with arbitrary delays, it is generally not possible to have all clock skews in
the middle of the permissible range as suggested above. The reason behind
this characteristic is that inherent structural limitations of the circuit create
linear dependencies among the clock skews within the circuit. These linear
dependencies and the effect of these dependencies on a number of circuit
optimization techniques are examined in detail in Chapter 7.

5.2.2 Graphical Model of a Synchronous System

Many different fully synchronous digital systems exist. It is virtually impos-


sible to describe the variety of all past, current or future such systems de-
pending on the circuit manufacturing technology, design style, performance
requirements and multiple other factors. A system model of these fully syn-
chronous digital systems is required so that the system properties can be fully
understood and analyzed from the perspective of clock skew scheduling and
clock tree synthesis while permitting unnecessary details to be abstracted.4
In this section, a graphical model used to represent fully synchronous dig-
ital systems is introduced. The purpose of this model is twofold. First, the
model provides a common abstract framework for the automated analysis of
circuits by computers. Second, it permits a significant reduction of the size of
the data that needs to be stored in the computer memory when performing
analysis and optimization procedures on a circuit. This graph-based model can
be arrived at in a natural way by observing what constitutes relevant system
information (in terms of the clock skew scheduling problem). For example, it
is sufficient to know that a pair of registers Ri , Rj are sequentially-adjacent
whereas the specific functional information characterizing the individual logic
gates along the signal paths between Ri and Rj is not necessary.
Consider, for instance, the system shown in Figure 5.1. This system is
completely described (for the purpose of clock skew scheduling) by the timing
information describing the four registers, four logic gates, ten wires (nets) and
the connectivity of these wires to the registers and logic gates. Consider next
the abstract representation of this system shown in Figure 5.3. Note that the

4
As a matter of fact, the graph model described here is quite universal and can be
successfully applied for a variety of other different circuit analysis and optimiza-
tion purposes.
5.2 Definitions and Graphical Model 77

R1 G3 , G2
G1 ,
G2
G3

R3 R4
G2
G 1, G4 , G1 , G2

R2 G4 , G1 , G2

Fig. 5.3. A directed multi-graph representation of the synchronous system shown


in Figure 5.1. The graph vertices correspond to the registers, R1 , R2 , R3 and R4 ,
respectively.

registers, R1 through R4 , are represented by the vertices of the graph shown in


Figure 5.3. However, the logic gates and wires have been replaced in Figure 5.3
by arrows or arcs, representing the signal paths among the registers. The four
logic gates and ten nets in the original system have been reduced to only six
local data paths represented by the arcs in Figure 5.3. For clarity, each arc or
edge is labeled with the logic gates5 along the signal path represented by this
specific arc.
The type of data structure shown in Figure 5.3 is known as a multi-
graph [89] since there may be more than one edge between a pair of vertices
in the graph. In order to simplify data storage and the relevant analysis and
optimization procedures, this multi-graph is reduced to a simple graph [89]
model by imposing the following restrictions:6
• either one or zero edges can exist between any two different vertices of the
graph,
• there cannot be self-loops, that is, edges that start and end at the same
vertex of the graph,
• additional labels (or markings) of the edges are introduced in order to
represent the timing constraints of the circuit.
With the above restrictions, a formal definition of the circuit graph model is
as follows:
Definition 5.3. Circuit graph. A fully synchronous digital circuit C is rep-
resented as the connected undirected simple graph GC . The graph GC is the
(C) (C) (C)
ordered six-tuple GC = V (C) , E (C) , A(C) , hl , hu , hd , where

5
In the order in which the traveling signals pass through the gates.
6
Restrictions on the model itself and not on the ability of the model to represent
features of the circuits.
78 5 Clock Skew Scheduling and Clock Tree Synthesis

• V (C) = {v1 , . . . vr } is the set of vertices of the graph GC ,


• E (C) = {e1 , . . . ep } is the set of edges of the graph GC ,
(C)
• A(C) = [aij ]r×r is the symmetric adjacency matrix of GC .
Each vertex from V (C) represents a register of the circuit C. There is exactly
one edge in E (C) for every sequentially-adjacent pair of registers in C. The
(C) (C)
mappings hl : E (C) → R and hu : E (C) → R to the set of real numbers R
assign the lower and upper permissible range bounds, lk , uk ∈ R, respectively,
for the sequentially-adjacent pair of registers indicated by the edge ek ∈ E.
(C)
The edge labeling hd defines a direction of signal propagation for each edge
vx , ez , vy .
Note that in a fully synchronous digital circuit there are no purely combina-
tional signal cycles, that is, it is impossible to reach the input of any logic
gate Gk by starting at the output of Gk and going through a sequence of
combinational logic gates only [9, 90].
Naturally, all registers from the circuit C are preserved when constructing
the circuit graph GC as described in Definition 5.3—these registers are enu-
merated 1 through r and a vertex vi is created in the graph for each register Ri .
Alternatively, an edge between two vertices is added in the graph if there are
one or more local data paths between these two vertices. The self-loops are
discarded because the clock skew of these local data paths is always zero and
cannot be manipulated in any way.
The graph GC for any circuit C can be determined by either direct inspec-
tion of C or by first building the circuit multi-graph and then modifying the
multi-graph to satisfy Definition 5.3. Consider, for example, the circuit multi-
graph shown in Figure 5.3—the corresponding circuit graph is illustrated in
Figure 5.4. Observe the labels of the graph edges in Figure 5.4. Each edge

v1
[l1 ,
u
e1 1 ]

[l3 , u3 ]
v3 v4
e3 →
u 2]
[l 2,

e2
v2

Fig. 5.4. A graph representation of the synchronous system shown in Figure 5.1
according to Definition 5.3. The graph vertices v1 , v2 , v3 , and v4 correspond to the
registers, R1 , R2 , R3 and R4 , respectively.
5.2 Definitions and Graphical Model 79

is labeled with the corresponding permissible range of the clock skew for the
given pair of registers. An arrow is drawn next to each edge to indicate the
order of the registers in this specific sequentially-adjacent pair—recall that
the clock skew as defined in Definition 5.2 is an algebraic difference. As shown
in the rest of this section, either direction of an edge can be selected as long
as the proper choices of lower and upper clock skew bounds are made.
In most practical cases, a unique signal path (a local data path) exists
between a given sequentially-adjacent pair of registers Ri , Rj . In these cases,
the labeling of the corresponding edge is straightforward. The permissible
range bounds lk and uk are computed using (4.8), (4.13), (4.23), (4.24), (4.29),
(4.39), (4.40) and (4.45) and the direction of the arrow is chosen so as to
coincide with the direction of the signal propagation from Ri to Rj . With
these choices, the clock skew is computed as s = ticd − tjcd . In Figure 5.4, for
example, the direction labels of both e1 and e2 can be chosen from v1 to v3
and from v2 to v3 , respectively.
Multiple signal paths between a pair of registers, Rx and Ry , require a
more complicated treatment. As specified before, there can be only one edge
between the vertices, vx and vy , in the circuit graph. Therefore, a methodology
is presented for choosing the correct permissible range bounds and direction
labeling for this single edge. This methodology is illustrated in Figure 5.5 and
is a two-step process. First, multiple signal paths in the same direction from

[lz , uz ] 
→ [lz(i) , uz(i) ]
vx .. vy ⇒ vx
i
vy
. →

[lz(n) , uz(n) ]
(a) Elimination of multiple edges

[lz , uz ]
→ [lz , uz ] ∩ [−uz , −lz ]
vx vy ⇒ vx vy


[lz , uz ]
(b) Elimination of a two-edge cycle

Fig. 5.5. Transformation rules for the circuit graph.

the register Rx to the register Ry are replaced by a single edge in the circuit
graph according to the transformation illustrated in Figure 5.5(a). Next, two-
edge cycles between Rx and Ry are replaced by a single edge in the circuit
graph according to the transformation illustrated in Figure 5.5(b).
80 5 Clock Skew Scheduling and Clock Tree Synthesis

In the former case [Figure 5.5(a)], the edge direction labeling is preserved
while the permissible range for the new single edge is chosen such that the
permissible ranges of the multiple paths from Rx to Ry are simultaneously
satisfied. As shown in Figure 5.5(a), the new permissible range [lz , uz ] is the
intersection of the multiple permissible ranges [lz , uz ] through [lz(n) , uz(n) ]
between Rx and Ry . In other words, the new lower bound is lz = max{lz(i) }
i
and the new upper bound is uz = min{uz(i) }.
i
In the latter case [Figure 5.5(b)], an arbitrary choice for the edge direc-
tion can be made—the convention adopted here is to choose the direction
towards the vertex with the higher index. For the vertex vy , the new per-
missible range has a lower bound lz = min(lz , −uz ) and an upper bound
uz = max(uz , −lz ). It is straightforward to verify that any clock skew
s ∈ [lz , uz ] satisfies both permissible ranges [lz , uz ] and [lz , uz ] as shown in
Figure 5.5(b). The process for computing the permissible ranges of a circuit
graph [using (4.8), (4.13), (4.23), (4.24) and (4.29)] and the transformations
illustrated in Figure 5.5 have linear complexity in the number of signal paths
since each signal path is examined only once.
Note that the terms, circuit and graph, are used throughout the rest of
this research monograph interchangeably to denote the same fully synchronous
digital circuit. Also, note that for brevity, the superscript (C) when referring
to the circuit graph GC of a circuit C is omitted for the rest of the monograph
unless a circuit is explicitly indicated. The terms, register and vertex, are used
interchangeably as are edge, local data path, arc and a sequentially-adjacent
pair of registers. On a final note, it is assumed that the graph of any circuit
considered in this work is connected. If this is not the case, each of the disjoint
connected portions of the graph (circuit) can be individually analyzed.

5.3 Clock Scheduling


The process of non-zero clock skew scheduling is discussed in this section. The
following substitutions are introduced for notational convenience:
Definition 5.4. Let C be a fully synchronous digital circuit and let Ri and Rf
be a sequentially-adjacent pair of registers, i.e., Ri ;Rf . The long path delay
D̂Pi,fM of a local data path Ri ;Rf is defined as

i,f (DCQMFi
+ DPi,fM + δSF f + 2ΔF L ), if Ri , Rf are flip-flops
D̂P M = i,f
(DCQM + DP M + δS + ΔL + ΔL
Li Lf L
T ), if Ri , Rf are latches.
(5.3)
Similarly, the short delay D̂Pi,fm of a local data path Ri ;Rf is defined as

i,f (DPi,fm + DCQ
Fi
− δHFf
− 2ΔF L ), if Ri , Rf are flip-flops
D̂P m = i,f (5.4)
(DCQm + DP m − δH − ΔL
Li Lf
L − ΔT ), if Ri , Rf are latches.
L
5.3 Clock Scheduling 81

Table 5.1. LP-model for clock skew scheduling of edge-sensitive circuits.


LP Model
min TCP
s.t. TSkew (i, f ) ≤ TCP − D̂Pi,fM
TSkew (i, f ) ≥ −D̂Pi,fm

Based on Definition 5.4, the timing constraints of a local data path Ri ;Rf
with flip-flops [(4.8) and (4.13)] are used to construct the linear programming
(LP) model for clock skew scheduling [2] shown in Table 5.1. The constraints
in Table 5.1 are the operating conditions for an edge-sensitive circuit:

TSkew (i, f ) ≤ TCP − D̂Pi,fM = TCP − DPi,fM − δSF i (5.5)


−D̂Pi,fm ≤ TSkew (i, f ) = −DPi,fm Ff
+ δH . (5.6)

For a local data path Ri ;Rf consisting of the flip-flops, Ri and Rf , the setup
and hold time violations are avoided if (5.5) and (5.6), respectively, are sat-
isfied.
The clock skew TSkew (i, f ) of a local data path Ri ;Rf can be either pos-
itive or negative, as illustrated in Figures 4.13 and 4.14, respectively. Note
that negative clock skew may be used to effectively speed-up a local data
path Ri ;Rf by allowing an additional TSkew (i, f ) amount of time for the
signal to propagate from the register Ri to the register Rf . However, exces-
sive negative skew may create a hold time violation, thereby creating a lower
bound on TSkew (i, f ) as described by (5.6) and illustrated by l in Figure 5.2.
A hold time violation, as described in Chapter 4, is a clock hazard or a race
condition, also known as double clocking [2, 9]. Similarly, positive clock skew
effectively decreases the clock period TCP by TSkew (i, f ), thereby limiting the
maximum clock frequency and imposing an upper bound on the clock skew as
illustrated by u in Figure 5.2.7 In this case, a clocking hazard known as zero
clocking may be created [2, 9].
Examination of the constraints, (5.5) and (5.6), reveals a procedure for pre-
venting clock hazards. Assuming (5.5) is not satisfied, a suitably large value of
TCP can be chosen to satisfy constraint (5.5) and prevent zero clocking. Also
note that unlike (5.5), (5.6) is independent of the clock period TCP (or the
clock frequency). Therefore, TCP cannot be changed to correct a double clock-
ing hazard, but rather a redesign of the entire clock distribution network [83]
or a delay padding procedure onto the logic network [74] may be required.
Both double and zero clocking hazards can be eliminated if two simple
choices characterizing a fully synchronous digital circuit are made. Specifically,

7
Positive clock skew may also be thought of as increasing the path delay. In either
case, positive clock skew (TSkew > 0) increases the difficulty of satisfying (5.5).
82 5 Clock Skew Scheduling and Clock Tree Synthesis

if equal values are chosen for all clock delays, then the clock skew TSkew (i, f ) =
0 for each local data path Ri ;Rf ,

∀ Ri , Rf : ticd = tfcd ⇒ TSkew (i, f ) = 0. (5.7)

Therefore, (5.5) and (5.6) become

TSkew (i, f ) = ticd − tfcd = 0 ≤ TCP − D̂Pi,fM (5.8)


−D̂Pi,fm ≤ 0 = TSkew (i, f ) = ticd − tfcd . (5.9)

Note that (5.8) can be satisfied for each local data path Ri ;Rf in a circuit
if a sufficiently large value—larger than the greatest value D̂Pi,fM in a circuit—
is chosen for TCP . Furthermore, (5.9) can be satisfied across an entire circuit
if it can be ensured that D̂Pi,fm ≥ 0 for each local data path Ri ;Rf in the
circuit. The timing constraints, (5.8) and (5.9), can be satisfied since choosing
a sufficiently large clock period TCP is always possible and D̂Pi,fm is positive
for a properly designed local data path Ri ;Rf . The application of this zero
clock skew methodology [(5.7), (5.8), and (5.9)] has been central to the design
of fully synchronous digital circuits for decades [9, 32, 91]. By requiring the
clock signal to arrive at each register Rj with approximately the same delay tjcd ,
these design methods have become known as zero clock skew methods.8
As shown by previous research [9, 81, 82, 83, 88, 92, 93], both double
and zero clocking hazards may be removed from a synchronous digital cir-
cuit even when the clock skew is non-zero, that is, TSkew (i, f ) = 0 for some
(or all) local data paths Ri ;Rf . As long as (5.5) and (5.6) are satisfied, a
synchronous digital system can operate reliably with non-zero clock skews,
permitting the system to operate at higher clock frequencies while removing
all race conditions.
The vector column of clock delays TCD = [t1cd , t2cd , . . . ]T is called a clock
schedule [2, 9]. If TCD is chosen such that (5.5) and (5.6) are satisfied for
every local data path Ri ;Rf , TCD is called a consistent clock schedule. A
clock schedule that satisfies (5.7) is called a trivial clock schedule. Note that a
trivial clock schedule TCD implies global zero clock skew since for any i and
f , ticd = tfcd , thus, TSkew (i, f ) = 0.
An intuitive example of non-zero clock skew being used to improve the per-
formance and reliability of a fully synchronous digital circuit is shown in Fig-
ure 5.6. Two pairs of sequentially-adjacent flip-flops, R1 ;R2 and R2 ;R3 , are
shown in Figure 5.6, where both zero skew and non-zero skew situations are
illustrated in Figures 5.6(a) and 5.6(b), respectively. Note that the local data
paths made up of the registers, R1 and R2 and of R2 and R3 , respectively, are
connected in series (R2 being common to both R1 ;R2 and R2 ;R3 ). In each
of the Figures 5.6(a) and 5.6(b), the permissible ranges of the clock skew for

8
Equivalently, it is required that the clock signal arrive at each register at approx-
imately the same time.
5.3 Clock Scheduling 83

Clock Period = 8.5 ns


R1 R2 R3
Logic Logic
Data Data Data
1 ns—2.5 ns 5 ns—8 ns
Clock Clock Clock

t t t

-1 ns Permissible Range 6 ns -5 ns Permissible Range 0.5 ns

Skew = 0 Skew = 0
(a) The circuit operating with zero clock skew.

Clock Period = 8.5 ns


R1 R2 R3
Logic Logic
Data Data Data
1 ns—2.5 ns 5 ns—8 ns
Clock Clock Clock

t τ (τ < t) t

-1 ns Permissible Range 6 ns -5 ns Permissible Range 0.5 ns

Skew = 0 = t − τ Skew = 0 = τ − t
(b) The circuit operating with non-zero clock skew.

Fig. 5.6. Application of non-zero clock skew to improve circuit performance (a lower
clock period) or circuit reliability (increased safety margins within the permissible
range).

both local data paths, R1 ;R2 and R2 ;R3 , are lightly shaded under each cir-
cuit diagram. As shown in Figure 5.6, the target clock period for this circuit
is TCP = 8.5 ns.
The zero clock skew points (Skew = 0) are indicated in Figure 5.6(a)—
zero skew is achieved by delivering the clock signal to each of the registers,
R1 , R2 and R3 , with the same delay t (symbolically illustrated by the buffers
connected to the clock terminals of the registers). Observe that while the
zero clock skew points fall within the respective permissible ranges, these zero
84 5 Clock Skew Scheduling and Clock Tree Synthesis

clock skew points are dangerously close to the lower and upper bounds of the
permissible range for R1 ;R2 and R2 ;R3 , respectively. A situation could be
foreseen where, for example, the local data path R2 ;R3 has a larger than
expected long delay (larger than 8 ns), thereby causing the upper bound of
the permissible range for R2 ;R3 to decrease below the zero clock skew point.
In this scenario, a setup violation will occur on the local data path R2 ;R3 .
Consider next the same circuit with non-zero clock skew applied to the
data paths, R1 ;R2 and R2 ;R3 , as shown in Figure 5.6(b). Non-zero skew is
achieved by delivering the clock signal to the register R2 with a delay τ < t,
where t is the delay of the clock signal to both R1 and R3 . By applying this
delay τ < t, positive (t − τ > 0) and negative (τ − t < 0) clock skews are
applied to R1 ;R2 and R2 ;R3 , respectively. The corresponding clock skew
points are illustrated in the respective permissible ranges in Figure 5.6(b).
Comparing Figure 5.6(a) to Figure 5.6(b), observe that a timing violation is
less likely to occur in the latter case. In order for the previously described
setup timing violation to occur in Figure 5.6(b), the deviations in the delay
parameters of R2 ;R3 would have to be much greater in the non-zero clock
skew case than in the zero clock skew case. If the precise target value of the
non-zero clock skew τ − t < 0 is not met during the circuit design process,
the safety margin from the skew point to the upper bound of the permissible
range would be much greater.
Therefore, there are two identifiable benefits of applying non-zero clock
skew. First, the safety margins of the clock skew (that is, the distances be-
tween the clock skew point and the bounds of the permissible range) within the
permissible ranges of a data path can be improved. The likelihood of correct
circuit operation in the presence of process parameter variations and opera-
tional conditions is improved with these increased margins. In other words,
the circuit reliability is improved. Second, without changing the logic and cir-
cuit structure, the performance of the circuit can be increased by permitting
a higher maximum clock frequency (or lower minimum clock period). The
formulation of circuit timing constraints for different timing problems and
formulation of clock skew scheduling for different objectives are presented
in Section 5.4.
Friedman in 1989 first presented in [1] the concept of negative non-zero
clock skew as a technique to increase the clock frequency and circuit perfor-
mance across sequentially-adjacent pairs of registers. Soon afterwards in 1990,
Fishburn suggested an algorithm in [2] for computing a consistent clock sched-
ule that is nontrivial. It is shown in [1, 2] that by exploiting negative and pos-
itive clock skew within a local data path Ri ;Rf , a circuit can operate with
a clock period TCP less than the clock period achievable by a trivial (or zero
skew) clock skew schedule while satisfying the conditions specified by (5.5)
and (5.6). In fact, [2] determined an optimal clock schedule by applying lin-
5.4 Timing Constraints and Design Automation 85

ear programming techniques to solve for TCD so as to satisfy (5.5) and (5.6)
while minimizing the objective function Fobjective = min TCP 9 .
The process of determining a consistent clock schedule TCD can be consid-
ered as the mathematical problem of minimizing the clock period TCP under
the constraints, (5.5) and (5.6). However, there are important practical issues
to consider before a clock schedule can be properly implemented. A clock dis-
tribution network must be synthesized such that the clock signal is delivered
to each register with the proper delay so as to satisfy the clock skew sched-
ule TCD . Furthermore, this clock distribution network must be constructed so
as to minimize the deleterious effects of interconnect impedances and process
parameter variations on the implemented clock schedule. Synthesizing the
clock distribution network typically consists of determining a topology for the
network, together with the circuit design and physical layout of the buffers
and interconnect that make up a clock distribution network [9, 32].

5.4 Timing Constraints and Design Automation

Digital VLSI synchronous circuits are subject to different types of timing


analyses with regards to computing or analyzing their clock schedules. Tradi-
tional among these analysis are three different problems: clock period mini-
mization [2, 69, 72, 73, 94, 95, 96, 97, 98], clock period verification [69, 99, 100]
and circuit retiming [101, 102, 103, 104]. Clock period minimization is the
analysis of a synchronous circuit in order to solve for the minimum clock
period—the maximum operating frequency—of a synchronous circuit. Clock
period verification is the analysis to ensure that a synchronous circuit is fully-
operational for a given clock period. Clock period verification can also be used
to formulate the clock skew scheduling problem with the objective of improved
tolerance to process parameter variations for operation at a predetermined
clock period. Circuit retiming is the analysis of a synchronous circuit aiming
to achieve higher operating frequencies by modifying the circuit network.
Even though there are different types of timing analysis problems, the op-
eration of the synchronous circuit under scrutiny is identical in all cases (pos-
sibly except for retiming problems). Thus, in the formulation of the timing
analysis problem, a framework of constraints identifying synchronous circuit
operation is essential. The categorized set of constraints are verified at each
local data path of a circuit subject to a specific objective function, construct-
ing the static timing analysis. The timing relationship constraints discussed
in Section 5.3 and Section 6.1 for flip-flop-based and latch-based circuits, re-
spectively, are the building blocks for the automation for these timing analysis
processes.

9
This LP problem model is presented in Table 5.1.
86 5 Clock Skew Scheduling and Clock Tree Synthesis

5.5 Structure of the Clock Distribution Network


A clock distribution network is typically organized as a rooted tree struc-
ture [9, 81, 105], as illustrated in Figure 5.7, and is often called a clock tree [9].
A circuit schematic of a clock distribution network is shown in Figure 5.7(a).
An abstract graphical representation of the tree structure in Figure 5.7(a) is
shown in Figure 5.7(b). The unique source of the clock signal is at the root of
the tree. This signal is distributed from the source to every register in the cir-
cuit through a sequence of buffers and interconnect. Typically, a buffer in the
network drives a combination of other buffers and registers in a VLSI circuit.
A network of wires connects the output of the driving buffer to the inputs of
these driven buffers and registers. An internal node of the tree corresponds
to a buffer and a leaf node of the tree corresponds to a register. There are N
leaves10 in the clock tree labeled F1 through FN , where leaf Fj corresponds to
register Rj . A clock tree topology that implements a given clock schedule TCD

HH ........ R

..

HH
SOURCE HH q q HH q  to buffer
  HH to buffer

OC HH q ........ R
C ..
BUFFERS  

:
HH to buffer

(a) Circuit structure of the clock distribution network
wBUFFER
REGISTER
w CLOCK SOURCE
S
/ ?
 w
S
 S

w w Sw
 C S
/ ?  CW
 w
S
  C e
e

w w Cw
  C
  C
(b) Clock tree structure that corresponds to the circuit shown in (a)

Fig. 5.7. Tree structure of a clock distribution network.

must enforce a clock skew TSkew (i, f ) for each local data path Ri ;Rf of the
circuit in order to ensure that both (5.5) and (5.6) are satisfied.
10
The number of registers N in the circuit.
5.6 Solution of the Clock Tree Synthesis Problem 87

5.6 Solution of the Clock Tree Synthesis Problem


In this section, a solution to the topological synthesis problem [93, 106, 107]
is presented. The solution is based on the following assumption: the signal
propagation delay through a node and all of its descendant nodes is a con-
stant, denoted by Δb . Therefore, the propagation delay δj of the clock signal
from the clock source to the register Rj at depth bj is tjcd = δj = bj × Δb .
Note that Δb includes the delay through both a buffer and the interconnect
branches connected to the buffer output. There can be considerable difficulty
in practically achieving a constant Δb throughout all levels of the clock tree.
Therefore, new research should focus on removing this constraint by providing
variable branch delays.
After substituting δj = bj × Δb into (5.5) and (5.6), the necessary condi-
tions to avoid either clock hazard can be rewritten as follows:

−TSkew (i, f ) = (bf − bi )Δb >D̂Pi,fM − TCP (5.10)


TSkew (i, f ) = (bi − bf )Δb > − D̂Pi,fm . (5.11)

Therefore, the problem of designing the topology of a clock distribution net-


work can be formulated as the optimization problem of minimizing the clock
period TCP subject to the constraints (5.10) and (5.11).
The quantities bi and bf are integers, since these terms denote the num-
ber of branches (buffers) from the root of the clock tree to a particular leaf
(register). In the general case, this optimization problem can be described as
a mixed-integer linear programming problem (since TCP can be any real pos-
itive number) and is difficult to solve. However, previous research has demon-
strated [108] that if a fixed value for the clock period TCP is chosen, the
problem changes as follows. Given a value for TCP , find a set of integers
{b1 , b2 , . . . , bi , . . .} such that

(bj − bi )Δb > D̂Pi,fM − TCP


(bi − bj )Δb > −D̂Pi,fm (5.12)

for every sequentially-adjacent pair of registers Ri ;Rf or determine that no


such set of integers exist. Once (5.12) has been solved for a particular cir-
cuit, a clock tree topology such as the network shown in Figure 5.7 can be
implemented.
Each register Ri of a circuit receives a clock signal from a leaf Fi of the
clock tree at a branching depth b = bi , where bi is the integer obtained from
solving (5.12). In addition, Leiserson and Saxe describe in [90] an algorithm for
efficiently solving similar optimization problems such as represented by (5.12).
The run time of this algorithm is O(V E), where V and E denote the number
of registers and the number of sequentially-adjacent pairs of registers, respec-
tively. This algorithm is applied in this synthesis methodology for constructing
the topology of the clock tree.
88 5 Clock Skew Scheduling and Clock Tree Synthesis

The sequence of operations is as follows. A feasible range for the clock


period [Tmin , Tmax ] to be searched is determined initially—the bounds Tmin
and Tmax are determined as described in [88]. A binary search for the opti-
mal clock period Topt is then performed over the feasible range of the clock
period. This sequence of operations is presented in Algorithm 1. The feasible
range for the clock period [Tmin , Tmax ] to be searched is determined in lines 1
and 2. A binary search of the feasible clock period range is performed next
in lines 3 through 9. For each value of the clock period, (5.12) is solved in
line 5 to determine the feasibility of this current target value of the clock
period TCP . The binary search ends when the condition stated in line 4 is no
longer satisfied.

Algorithm 1 Compute clock schedule.


1: min ← Tmin
2: max ← Tmax
3: test ← (min + max)/2
4: while max − min > δ do
5: if (∃ feasible solution for TCP = test) then
6: max ← test
7: else
8: min ← test
9: end if
10: test ← (min + max)/2
11: end while

After computing a clock schedule, a mapping M : tcd → B is produced


such that each clock delay tcd (i) is mapped to a non-negative integer num-
ber b(i) ∈ B = {1, 2, . . . , bmax }. The integer b(i) is the required depth of
the leaf in the clock tree driving the register Ri . Typically, bmax < NR ,
since there may be more than one register with the same value of the
required depth b. In addition, note that the set B can be redefined as
{1 + k, 2 + k, . . . , bmax + k} without affecting the validity of the solution (k
is any integer). For example, if the solution for a circuit with 10 registers is
b(1), . . . , b(10) = {3, 5, 8, 10, −2, 0, 0, 5, 5, 4}, this solution can be changed to
{5, 7, 10, 12, 0, 2, 2, 7, 7, 6} by adding two branches (or buffers) to each of the
numbers b(1) through b(10).
The clock distribution network is implemented recursively in the following
manner. An integer value called the branching factor f is initially chosen.
The branching factor determines the number of outgoing branches from each
node of the clock tree. By maintaining f constant throughout the clock tree,
the requirement for a constant Δb can be satisfied. A specific number of reg-
isters nj is driven at a specific depth b(j) of the clock tree. Therefore, at
least nj /f  buffers at depth b(j − 1) of the clock tree are required to drive
these nj registers at depth b(j). The number of buffers and branches in the
5.7 Software Implementation 89

clock tree is determined by beginning at the bottom of the tree (those leaves
with the greatest depth) and recursively computing the number of buffers at
each preceding level.

5.7 Software Implementation

The techniques for clock skew scheduling and clock distribution network syn-
thesis discussed in this chapter have been implemented as two separate com-
puter programs. The first program implements the problem of simultaneous
clock skew scheduling and clock tree synthesis as described by (5.12). This
program is described and results are presented in Section 5.7.1. A second
more exhaustive software implementation for clock skew scheduling only is
described in Section 5.7.2.

5.7.1 Simultaneous Clock Scheduling and Clock Tree Synthesis

The algorithm has been implemented in a 3, 300 line program written in the
C++ high-level programming language. This program has been executed on
the ISCAS’89 suite of benchmark circuits. A simple delay model based on the
load of a gate is used to extrapolate the gate delays since these benchmark
circuits do not contain delay information. A summary of the results for the
benchmark circuits is shown in Table 5.2. These results demonstrate that by
applying the proposed algorithm to schedule the clock delays to each register,
up to a 64% decrease11 in the minimum clock period can be achieved for these
benchmark circuits while removing all race conditions. Note that due to the
relatively large number of buffers required in the clock tree, this approach is
only practical for circuits with a large number of registers.
Two example implementations of a clock tree topology with non-zero skew
are shown in Figures 5.8 and 5.9 for the benchmark circuits s1423 and s400,
respectively:
1. The clock tree topology shown in Figure 5.8 corresponds to the circuit
s1423 which contains N = 74 registers. The improvement of the mini-
mum achievable clock period TCP is 14% by applying the methodology
described in Section 5.6.
2. The clock tree topology shown in Figure 5.9 corresponds to the circuit
s400 which contains N = 21 registers. The improvement of the minimum
achievable clock period for this circuit when non-zero clock skew is applied
is 37%.

11
Compared to the minimum possible clock period if zero skew is used throughout
a circuit.
90 5 Clock Skew Scheduling and Clock Tree Synthesis

Table 5.2. ISCAS’89 suite of circuits. The name, number of registers, bounds of the
searchable clock period, optimal clock period (Topt ) and performance improvement
(in per cent) are shown for each circuit. Also shown in the last two columns labeled
B2 and B3 , respectively, are the number of buffers in the clock tree for f = 2 and
f = 3, respectively.

Circuit Regs Tmin Tmax Topt % Imp. B2 B3


s1196 18 7.80 20.80 13.00 17% 21 14
s13207 669 60.40 85.60 60.45 29% 681 348
s1423 74 75.80 92.20 79.00 14% 80 45
s1488 6 31.00 32.20 31.00 4% 5 4
s15850 597 83.60 116.00 83.98 28% 614 320
s208.1 8 5.20 12.40 5.48 56% 10 9
s27 3 5.40 6.60 5.40 18% 3 3
s298 14 9.40 13.00 10.48 19% 13 8
s344 15 18.40 27.00 18.65 31% 16 11
s349 15 18.40 27.00 18.65 31% 15 10
s35932 1728 34.20 34.20 34.20 0% 3457 2595
s382 21 8.00 14.20 8.88 37% 25 14
s38417 1636 42.20 69.00 42.82 38% 1647 832
s38584 1452 67.60 94.20 67.65 28% 1465 743
s386 6 17.00 17.80 17.80 0% 12 10
s400 21 8.40 14.20 8.88 37% 25 14
s420.1 16 5.20 16.40 7.45 55% 21 15
s444 21 8.40 16.80 10.17 39% 23 15
s510 6 14.80 16.80 15.20 10% 7 5
s526 21 9.40 13.00 10.48 19% 21 10
s526n 21 9.40 13.00 10.48 19% 21 10
s5378 179 20.40 28.40 22.29 22% 182 93
s641 19 71.00 88.00 71.03 19% 30 22
s713 19 79.20 89.20 72.23 19% 31 23
s820 5 19.20 19.20 19.20 0% 11 9
s832 5 19.80 19.80 19.80 0% 11 9
s838.1 32 5.20 24.40 8.76 64% 40 24
s9234.1 211 54.20 75.80 54.24 28% 220 113
s9234 228 54.20 75.80 54.24 28% 237 123
s953 29 16.40 23.20 18.96 18% 31 18

5.7.2 Clock Skew Scheduling

In this program implementation, only clock skew scheduling is implemented as


described in Sections 5.3 and 5.6. This implementation is targeted at commer-
cial integrated circuits for which accurate timing information can be obtained.
The program is written in the C++ high-level programming language and con-
sists of approximately 17, 300 lines of code. This program has been demon-
strated on a commercial integrated circuit with 6, 890 registers (a video-game
5.7 Software Implementation 91

Dummy Load

Internal Node (Buffer)

Leaf (Register)

Fig. 5.8. Buffered clock tree for the benchmark circuit s1423. The circuit s1423 has
a total of N = 74 registers and the clock tree consists of 45 buffers with a branching
factor of is f = 3.

controller) and some characterizing data is shown in Figure 5.12. The mini-
mum achievable clock period without clock skew scheduling is TCP = 14.8 ns
(= 67.5 MHz). After non-zero clock skew is applied to this circuit, the min-
imum achievable clock period with clock skew scheduling is TCP = 11.4 ns
(= 87.7 MHz) corresponding to a performance improvement of 23%.

Input File Format

The input to this program is a standard text file containing the timing in-
formation necessary to apply the clock scheduling algorithm to a fully syn-
chronous digital integrated circuit. This timing information characterizes the
minimum and maximum signal delay of each local data path and can be ob-
tained from the application of simulation tools known as static timing analyz-
ers. More accurate simulation methods—such as dynamic circuit simulation
(e.g., SPICE)—can be used to obtain highly accurate timing information for
relatively small circuits. A sample input file for the clock skew scheduling
program is shown in Figure 5.10. As shown in Figure 5.10, the input con-
sists of groups of information (lines 1-11 and 13-18 in Figure 5.10) enclosed
92 5 Clock Skew Scheduling and Clock Tree Synthesis

Dummy Leaves (Load)


Internal Node (Buffer)
Leaf (Register)

Fig. 5.9. Buffered clock tree for the benchmark circuit s400. The circuit s400 has
a total of N = 21 registers and the clock tree consists of 14 buffers with a branching
factor of f = 3.

in curly braces (the ‘{’ and ‘}’ symbols). Each line in a group describes
an instance of a register. The first line in a group describes a register Ri at
the beginning of a local data path Ri ;Rf . Each of the remaining lines of
a group describes a register Rf at the end of a local data path Ri ;Rf . In
the example shown in Figure 5.10, the registers Top/Block1/RegA[8]:sc and
TopA/Block1/RegA[7]:sc each describe the first register of a local data path
(lines 1 and 13, respectively).
Each register listed in the input file of the program consists of a sequence
of strings separated with slashes (the ’/’ character). These strings represent
the hierarchical name of the register in the design hierarchy. The register
on line 1, for example, is named RegA and is part of a design block named
5.7 Software Implementation 93

1: {Top/Block1/RegA[8]:d1 2.781105e-04 5.243128e-01


2: _ _
3: 3.000000e-02 3.00000 0e-02
4: _ _
5: {Top/Block2/RegB[7]:d1 4.596487e-01 5.079964e-01
6: 4.596487e-01 5.079964e-01}
7: {Top/Block2/RegB[6]:d1 4.116543e-01 4.677776e-01}
8: {Top/Block2/RegB[8]:d1 4.224569e-01 4.813909e-01}
9: {Top/Block2/RegB[7]:d1 4.596487e-01 5.079964e-01
10: 4.596487e-01 5.079964e-01}
11: }
12:
13: {TopA/Block1/RegA[7]:D 5.195378e-01 5.195681e-01
14: _ _
15: 3.000000e-02 3.000000e-02
16: _ _
17: {Top/Block1/RegC[6]:da 4.116543e-01 4.677776e-01}
18: }

Fig. 5.10. Sample input for the clock scheduling program described in Section 5.7.2.

Block1, whereas the design block Block1 is part of the module called Top.
Finally, a register bit index may be appended at the end of a register name
for multi-bit registers12 and the data pin name is appended after the bit index
and separated with a colon ‘:’.
The description of the initial register of a local data path is followed by
eight (8) numbers which specify the timing information characterizing this
register. These numbers specify the minimum and maximum values of the
setup and hold times for the register for the rising and falling edges of the
clock signal. If a number is not available, an underscore ‘ ’ is substituted for
this missing data. The program determines the type of register by examining
both the missing and specified numbers describing the setup and hold times.
Returning to line 1 in Figure 5.10, the minimum and maximum setup times
for the rising edge of the clock signal are included while the minimum and
maximum setup times for the falling edge of the clock signal are absent (note
the underscores in line 2). Therefore, this register instance is either a positive-
edge triggered flip-flop or a negative latch. A positive flip-flop has the setup
and hold times defined for the rising edge of the clock signal. Similarly, a
negative latch has the setup and hold times defined for the rising edge of the
clock signal. Since the register instance described by line 1 in Figure 5.10 has
setup and hold times defined for the rising edge of the clock signal, the register
instance is either a positive flip-flop or a negative latch.

12
If the register is not a multi-bit register, this index is omitted.
94 5 Clock Skew Scheduling and Clock Tree Synthesis

As mentioned previously, each register instance in an input file describes


an initial register at the beginning of a local data path and is followed by
one or more register instances describing a final register at the end of a local
data path. For the example shown in Figure 5.10, there are four (4) local data
paths (lines 5 through 10) with an initial register described on line 1. Each
final register of a local data path (lines 5 through 10) consists of a register
name and is followed by the timing information describing the local data
path terminated by this specific register instance. This timing information
may contain two or four delay numbers depending upon whether the starting
L
register of the local data path is a flip-flop or a latch. The minimum (DCQm
F L F
or DCQm ) and maximum (DCQM or DCQM ) clock-to-output delays are the
first two numbers listed on line 5 and are present regardless of the type of
register (recall the description of latches and flip-flops in Sections 4.2 and 4.4,
respectively). An additional pair of delay numbers specifies the minimum and
L L
maximum delays (DDQm and DDQM ) if the initial storage element of the local
data path is a latch (line 6 in Figure 5.10).

Output File Format

The output of the clock skew scheduling program is a standard text file. A
sample output is shown in Figure 5.11. Each line in the output consists of the
full hierarchical name of a register Rj and the value of the delay tjcd of the clock
signal to the register Rj . Recall that it is not the clock delays to the individual

1: Top/Block1/Reg1[7] 3.479695
2: Top/Block1/Reg143 2.814349
3: Top/Block1/Reg26[0] 2.159099
4: Top/Block1/Reg33A 3.479695
5: Top/Block1/Reg33B 3.479695
6: Top/Block1/reg_2a 3.479695
7: Top/Block1/reg_2 3.052987
8: Top/Block1/Reg271 2.541613
9: Top/Block1/Reg12 1.871610

Fig. 5.11. Sample output for the clock scheduling program described in Sec-
tion 5.7.2.

registers that are important but rather important but rather the difference be-
tween the clock delays—the clock skew TSkew —to each sequentially-adjacent
pair of registers that matters.
5.7 Software Implementation 95

Experimental Results

Two histograms are shown in Figure 5.12 which illustrate the effects of non-
zero clock skew on the circuit path delays. The distribution of the path de-
lay D̂Pi,fM is shown in Figure 5.12(a). With clock scheduling (non-zero clock
skew) applied, the effective path delay of each path Ri ;Rf is increased or de-
creased13 by the amount of clock skew scheduled for that path. This effective
path delay distribution is shown in Figure 5.12(b). Note that the net effect
of clock skew scheduling is a ‘shift’ of the path delay distribution away from
the maximum path delay [from right to left in Figure 5.12(b)]. There are two
beneficial effects of that shift of delay in that either the circuit can be run
at a lower clock period (or higher clock frequency) or the circuit can operate
at the target clock period with a reduced probability of setup and hold time
violations (improving the overall system reliability).

13
As described previously in this chapter, clock skew can be thought of as adding
(or subtracting) to (or from) the path delay.
96 5 Clock Skew Scheduling and Clock Tree Synthesis

Maximum Path Delay (fs)


615

Number of Paths (#)


0
0 fs 7416120 fs 14832240 fs
(a) Path delay distribution with zero skew (before clock skew scheduling is applied)

Maximum Path Delay (fs)


851

Number of Paths (#)


0
0 fs 7416120 fs 14832240 fs
(b) Path delay distribution after non-zero clock skew is applied

Fig. 5.12. The application of clock skew scheduling to a commercial integrated


circuit with 6,890 registers [note that the time scale is in femtoseconds, 1 fs = 10−15
sec = 106 ns].
6
Clock Skew Scheduling of Level-Sensitive
Circuits

Level-sensitive circuits are gaining popularity in the state-of-the-art high-


performance synchronous circuit design due to their smaller size, lower power
consumption and faster operation speeds [109, 110, 111]. The timing analysis
of level-sensitive circuits, however, is more difficult due to the non-linearity
of the timing constraints caused by the transparent latch operation discussed
in Section 4.2. Traditionally, the non-linearity of constraints has been resolved
with one of two approaches. On one hand, analyses which aim to accurately
model the effects of time borrowing have been considered too optimistic and
this property is fully disregarded from the analysis [69, 72]. More recently, the
non-linear constraints of operation are relaxed using iterative solution tech-
niques [73, 94, 99, 100]. The iterative solution techniques are practical for tim-
ing analysis where clock skew values (zero or non-zero) are known. However,
these techniques are not applicable in clock skew scheduling computation. In
this chapter, a linear programming (LP) formulation applicable to the timing
analysis of large-scale level-sensitive synchronous circuits is presented. The
presented LP formulation accurately models the effects of time borrowing.
This LP formulation is computationally efficient due to the linearization of
non-linear constraints, and the formulation and solution processes are fully-
automated.

6.1 Clock Scheduling for Level-Sensitive Circuits


The process of clock skew scheduling for level-sensitive circuits is governed by
the timing relationships defined for local data paths of latches. As discussed
in Section 5.3, the long and short data path delays defined for local datapaths
composed of latches are different than those defined for the more traditional
local data paths that are composed of flip-flops. Consequently, the timing
relationships of local data paths composed of (level-sensitive) latches need to
be integrated into the design framework to define the clock skew scheduling
process of level-sensitive circuits.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 97
DOI: 10.1007/978-0-387-71056-3 6,
c Springer Science+Business Media LLC 2009
98 6 Clock Skew Scheduling of Level-Sensitive Circuits

The timing relationships for local data paths with latches are categorized
into two sets: operational constraints and constructional constraints. The op-
erational constraints are the constraints that model the operation of a level-
sensitive synchronous circuit. The constructional constraints are defined to en-
sure the correctness and completeness of the formulation of the proposed tim-
ing analysis problem. The definitions for the operational constraints—called
latching, synchronization and propagation constraints, respectively—are de-
rived from the zero clock skew definitions in [69]. The latching, synchroniza-
tion and propagation constraints for a single-phase synchronization system
are described in Section 6.1.1, Section 6.1.2 and Section 6.1.3, respectively.
The constructional constraints, called validity and initialization constraints,
are defined to ensure the correctness and completeness of the formulation of
the presented timing analysis problem. The validity constraints are presented
in Section 6.1.4. The initialization constraints are presented in Section 6.1.5.

6.1.1 Latching Constraints

Latching constraints bound the arrival time of the data signal Df (recall the
local data path in Figure 4.15 on page 61) in order to ensure that Df is latched
during the intended clock cycle.
The interval for the data arrival time is characterized by the hold time
and the setup time requirements of Rf as follows:
Lf
δH ≤ af (6.1)
Af ≤ TCP − δSLf . (6.2)

Eq. (6.1) constrains the earliest arrival of Df at Rf . The earliest data arrival
time must be no earlier than hold time after the trailing edge of the previous
clock cycle. Suppose the (k + 1)-th clock cycle at latch Rf is illustrated in Fig-
ure 4.4 on page 46, where t1 = tfcd + kTCP [zero in the frame of reference of
(k + 1)-th cycle]. The hold time is defined by the difference t7 − t6 . If data
arrives at Rf earlier than the hold time, a double-clocking hazard occurs.
Similarly, (6.2) represents the setup constraint on Rf . As shown in Fig-
ure 4.4, the data must arrive at the final latch at least setup time prior
to the trailing edge of the clock cycle. Assuming the (k + 1)-th clock cy-
cle is illustrated in Figure 4.4, the trailing edge of the clock cycle occurs at
t6 = tfcd +kTCP +CW L
. Thus, data cannot be latched into Rf during the (k +1)-
th cycle if the data arrives later than t5 = tfcd + kTCP + CW
L
− δSLf . Late arrival
of the data signal results in a zero clocking hazard.

6.1.2 Synchronization Constraints

Synchronization constraints define the departure time of the data signal Qi


from the initial latch of a local data path. The departure time from a latch
6.1 Clock Scheduling for Level-Sensitive Circuits 99

k-th clock cycle k-th clock cycle

← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case I Case II

k-th clock cycle k-th clock cycle

← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case III Case IV

k-th clock cycle k-th clock cycle

← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case V Case VI

k-th clock cycle k-th clock cycle

← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case VII Case VIII

Fig. 6.1. Possible cases for the arrival and departure times of data at the initial
latch.

depends on the state of the latch—transparent or opaque. Implementation-


specific register internal delays, DDQ and DCQ , affect the departure times in
transparent and opaque states of operation, respectively. The earliest depar-
ture time di of Qi from Ri is defined in (6.3). The latest departure time Di is
defined by (6.4):
 
i
di = max ai + DDQm , TCP − CWL i
+ DCQm , (6.3)
 
Di = max Ai + DDQM , TCP − CW + DCQM .
i L i
(6.4)

An exhaustive inspection of all possible cases of earliest and latest de-


parture times during the k-th clock cycle is shown in Figure 6.1. The time
intervals for the arrival and departure times are illustrated by the upper and
lower parallel dotted lines, respectively. The left and right ends of these dot-
ted lines in the figure correspond to earliest and latest times, respectively.
The lengths of the white and black rectangular boxes correspond to the clock-
to-output and data-to-output latch delays, respectively. Note that cases V
through VIII may exhibit timing hazards.
Consider (6.3), which describes the earliest departure time of the data
Li
signal Qi from latch Ri . The first term of the max function, ai + DDQm , de-
scribes the time instant when the input data arrival occurs at its earliest time
100 6 Clock Skew Scheduling of Level-Sensitive Circuits

during the active phase of the clock signal Ci . The data signal immediately
propagates through the latch (as illustrated in cases I and VIII of Figure 6.1).
In these cases, the earliest departure time di from Ri depends on the earliest
Li
arrival time ai of the data signal and the time DDQ it takes for the data to
appear at the output terminal of Ri .  
The second term of the max function, TCP − CW L Li
+ DCQm , refers to the
case when the earliest data arrival time occurs during the opaque phase of Ri .
In the opaque phase of operation, the departure time of the data signal from
Li
the initial latch occurs clock-to-output delay DCQ later than the leading edge
of the clock signal. Such data propagation is illustrated in cases II-VII of
Figure 6.1. The max function is used to combine these cases and to define the
earliest departure time di from the initial latch Ri . Similar reasoning applies
to the derivation of the latest departure time Di defined by (6.4).

6.1.3 Propagation Constraints

Propagation constraints define the arrival time of the data signal Df at the
final latch Rf of a local data path. These constraints are as follows:
 
af = min di + D̂Pi,fm + TSkew (i, f ) − TCP (6.5)
i
 
Af = max Di + D̂Pi,fM + TSkew (i, f ) − TCP . (6.6)
i

For each incoming path to latch


Rf , the lower bound for af is individually cal-
culated using the expression di + D̂Pi,fm + TSkew (i, f ) − TCP . The minimum
of the arrival times among the incoming data paths is assigned as the earliest
arrival time at Rf . The latest arrival time Af for the data signal is defined
similarly. In case of multiple data paths fanning into Rf , the maximum of
the arrival times among the incoming data paths is the latest arrival time of
the data signal at Rf . These two facts are implied in the formulation by the
inclusion of the min and max functions in (6.5) and (6.6), respectively.
The propagation constraints are illustrated on a sample synchronous cir-
cuit in Figure 6.2. Note that in Figure 6.2, two local data paths starting at the
latches Ri1 and Ri2 and ending at Rf are considered. The time intervals for the
arrival and departure times of the data signal are illustrated by the upper and
lower parallel dotted lines, respectively. The lengths of the white and black
rectangular boxes correspond to the clock-to-output and data-to-output latch
delays, respectively. The earliest arrival time is illustrated on the data path
Ri1 ;Rf . The data signal departs from Ri1 at time di1 and propagates on the
i1 ,f
data path Ri1 ;Rf for a time period
of D̂P m . The earliest data arrival time
di1 + D̂Pi1 ,fm + TSkew (i1 , f ) − TCP observed on this data path is earlier than

the arrival time di2 + D̂Pi2 ,fm + TSkew (i2 , f ) − TCP observed on the only other
incoming path to Rf , Ri2 ;Rf . Hence, the earliest data arrival time af at Rf is
6.1 Clock Scheduling for Level-Sensitive Circuits 101
i1 i1
tcd + (k − 1)TCP tcd + kTCP t i1 + (k + 1)TCP
k-th clock cycle k + 1-th clock cycle cd

ai 1 Ai1
Ci1
Di1
di 1
TSkew (i1 , i2 ) < 0
1 ,f
DiPm
1 ,f
DiPM
f + (k − 1)T
tcd f + kT
tcd t f + (k + 1)TCP
CP CP
k-th clock cycle k + 1-th clock cycle cd

af Af
Cf
Df
df
2 ,f
DiPm
2 ,f
DiPM
i2 i2
tcd + (k − 1)TCP tcd + kTCP t i2 + (k + 1)TCP
k-th clock cycle k + 1-th clock cycle cd
Ci2
ai2 Ai2
TSkew (i1 , f ) > 0 Di2
di 2
TSkew (i2 , f ) > 0

Fig. 6.2. Propagation of the data signal in a simple circuit.

defined by the propagation on the Ri1 ;Rf data path. Similarly, on the data
path Ri2 ;Rf , a maximum data propagation
time of D̂Pi2 ,fM elapses conferring

the latest data arrival time at Rf , Af = Di2 + D̂Pi2 ,fM + TSkew (i2 , f ) − TCP .
The departure of Qi and the arrival of Df must occur during two con-
secutive clock cycles for proper circuit operation. In order to switch between
the frame of references of these two cycles, the phase shift operator φif is
used. The phase shift operator evaluates to φif = TCP for single-phase syn-
chronization as discussed in Section Section 4.9. Thus, the clock period TCP
is subtracted from the calculated arrival time in order to shift the point of
reference of the data arrival time at Rf to the beginning of the previous clock
cycle.

6.1.4 Validity Constraints

The definitions of the parameters af , Af , df and Df require the value of af (df )


to be smaller than or equal to the value of Af (Df ):

Af ≥ af (6.7)
Df ≥ df . (6.8)

While the operational constraints introduced in the preceding sections model


the timing properties of the circuit, the required sequentiality in time of the
102 6 Clock Skew Scheduling of Level-Sensitive Circuits

referred variables is not explicitly enforced. Consistency in the definitions of


af , Af , df and Df , must be maintained through post-solution checks or by
including additional constraints. A solution leading to a result where af > Af ,
for instance, is incorrect and must be discarded.
Introducing the validity constraints [(6.7) and (6.8)] in the problem for-
mulation is preferred over performing post-solution checks for two primary
reasons. The first reason is to gain the ability to easily detect the feasibility of
the problem. The second reason is to preserve the automation of the solution
procedure.

6.1.5 Initialization Constraints


The LP model clock skew scheduling problem is formulated in order to min-
imize the clock period of a synchronous circuit. Besides the minimum clock
period, it may also prove essential to accurately calculate the nominal data
arrival and departure times for each register. The initialization constraints
are introduced in order to fulfill this purpose, by leading to a consistent tim-
ing schedule for the data signal propagation in a level-sensitive synchronous
circuit.
After clock skew scheduling, the feasible (or optimal) solution set for one
or more variables can be a range of values rather than a specific value. For
instance, suppose that the earliest arrival time of a data signal at an arbitrary
latch Rk can get any value in the interval 1.8 ≤ ak ≤ 2.3 without changing
the minimum clock period of the circuit. For consistency, it is preferable to
assign the smallest value to the earliest arrival time (ak = 1.8). In general,
it is better to assign the smallest possible values to the earliest arrival and
departure time variables and the largest possible values to the latest arrival
and departure time variables (where applicable). Such assignment provides
a more comprehensive representation of data propagation (and sensitivity
information [112]) in the system. Identification of the sensitivity information
is useful to check for the consistency of the timing schedule generated by the
LP problem (if necessary) as will be briefly discussed in Section 11.1.2.
Note that, the earliest and latest data arrival times at all registers, except
for the input registers, are set to their lowest and highest possible values,
respectively. These assignments are enforced by the propagation constraints
[(6.5) and (6.6)]. The values assigned to the earliest and latest data arrival
times (a, A) at the input registers do not affect the minimum clock period
unless the assigned values cause the departure times to change. It may even
be considered redundant to define earliest and latest arrival time variables
(a, A) at the input registers as the non-local data paths do not affect the
circuit timing directly. For consistency and completeness of the generated
timing schedule, the data arrival times at the input registers are defined and
the following constraints are included in the LP formulation for each input
register Rl :
 Ll 
Al = dl − DCQ Ll
or DDQ ∀Rl : |F an − in(Rl )| = 0. (6.9)
6.2 Iterative Approach to Clock Skew Scheduling 103

6.2 Iterative Approach to Clock Skew Scheduling

The operational constraints provide a system of equations defining the timing


operation of a level-sensitive synchronous circuit. Different versions of the
constraints presented in Sections 6.1.1, 6.1.2 and 6.1.3 have been used by
designers in order to develop timing analysis models for zero clock skew, level-
sensitive circuits. The set of constraints initially defined for the clock period
minimization problem of a conventional zero clock skew problem in [69] is
known as the SMO formulation [75].
A popular timing analysis approach for level-sensitive circuits is presented
in [72, 73, 75, 94] based on the SMO formulation. This timing analysis ap-
proach involves several algorithms targeting clock period verification and min-
imization problems, all based on the analytical framework described in Sec-
tions 6.1.1, 6.1.2 and 6.1.3. The proposed algorithms are iterative algorithms.
In particular, very small values are assigned to the timing variables of a circuit
and the circuit is investigated for timing violations by iteratively increment-
ing the values of the timing variables. It is important to note that the clock
delay values ticd and consequently the clock skew values TSkew (i, f ) are pre-
determined numerical values in these algorithms. Thus, these iteration-based
algorithms do not support clock skew scheduling.
The iterative algorithm proposed in [73] for the clock period minimization
problem of level-sensitive circuits is presented in Figure 6.3. In the algorithm,
r is the number of registers in the synchronous circuit. The a, d, A and D
vectors are the earliest arrival/departure and latest arrival/departure times,
respectively, where the superscript prev identifies the value of a variable in the
previous clock cycle. The variables SetupV io and HoldV io hold the timing
violation information for each register. In this algorithm, the arrival times
are initialized to ai = Ai = −∞, where the algorithm simulates the start-up
timing of the circuit. At each iteration step, the execution of the circuit at a
clock cycle is simulated. Finally, once the arrival and departure times of the
latches are determined, the algorithm checks for potential setup and hold time
violations.
The algorithm presented in Figure 6.3 has been shown to converge to solu-
tions relatively quickly [73]. The algorithm complexity is reported as O(|r||p|),
where |r| is the number of latches in a circuit and |p| is the number of edges of
a circuit graph (recall from 5.2.2 that the number of edges of a circuit graph
is the number of local data paths). However, it has been proved in [72] that
in case of data-path loops (sequential feedback) in the synchronous circuit,
the arrival and departure times might increase without bound. This leads to
a setup violation and the described algorithm fails to provide reasonable run
times. In [94], a correction is offered to the algorithm. This correction is based
on the assumption that, a data path loop in the circuit can be detected in
|r| iterations. Thus, the algorithm is modified to artificially limit the num-
ber of iteration steps by |r|. In the modified algorithm, the complexity of the
resulting algorithm is cubic in the number of registers r, as each iteration in-
104 6 Clock Skew Scheduling of Level-Sensitive Circuits

//Initialize the latch arrival times


for i = 1 to |r| {
Aprev
i = aprev
i = −∞;
// iterate the evaluation of the departure and arrival time
// equations until coverage
// or a maximum of |r| iterations
iter = 0;
repeat
iter = iter + 1;
// update the latch departure times based on the latch
// arrival times
// computed in the previous iteration
for i = 1 to |r| {
Di = max ( Aprev i , φi + Di );
di = max ( aP i
rev
, φi + di );
};
// update the latch arrival times based on the just-computed
// latch departure times
for i = 1 to |r| {
Ai = maxj ( Dj + DP M );
ai = minj ( dj + DP m );
};
until ( ( ( Ai = Aprevi ) && ( ai = aprev
i ) ) || ( iter + 1 > |r| )
) ) ;
};
// check and record setup and hold violations
for i = 1 to |r| {
SetupV io[i] = Ai > TCP - δSLi + di ;
Li
HoldV io[i] = ai < δH + Di ;
};

Fig. 6.3. The iterative algorithm for static timing analysis of level-sensitive circuits.

volves examining up to |p| edges, and p is at most |r|2 . The iterative algorithm
presented in Figure 6.3 is later modified to account for more advanced tim-
ing features or data models, such as for crosstalk [100] and statistical timing
analysis [113].
Although the iterative algorithm provides an initial and useful formula-
tion for the timing analysis of level-sensitive circuits, it does not constitute a
framework lenient to general timing analysis problems or clock skew schedul-
ing.

6.3 Linearization of the Timing Analysis


The non-linear max and min functions in the constraints shown in (6.3), (6.4),
(6.5) and (6.6) present a major challenge in solving the clock skew scheduling
6.3 Linearization of the Timing Analysis 105

problem. A method is introduced in this chapter in order to replace the non-


linear constraints with linear constraints. Although theoretically inequivalent,
it is demonstrated that the same results are obtained with the original non-
linear programming (NLP) model and the novel linear programming (LP)
model problems in experimentation with ISCAS’89 benchmark circuits.
The proposed linearization method is described in Section 6.3.1. The LP
model for the clock period minimization problem of non-zero clock skew, level-
sensitive circuits is offered in Section 6.3.2.

6.3.1 Modified Big M (MBM) Method

The linearization of the constraints which exhibit non-linear behavior is a


commonly applied procedure in operations research [112]. When possible, non-
linear constraints are manipulated to derive linear constraints, which are in-
herently easier to solve. In this work, a collection of linearization procedures
is applied to the non-linear constraints of the timing analysis problem. The
collection of these procedures is called the Modified big M (MBM) method. It
is considered reasonable to denominate the collection of linearization proce-
dures the MBM method, as the research is developed by an inspiration from
the “big M method” [112]. The big M method is a special case of the simplex
algorithm [112] which has applications in a completely distinct set of prob-
lems with respect to the MBM method. The only similarity between the big M
method and the MBM method is the use of the constant M in both methods.
The constant M symbolically represents a sufficiently large positive number
used to assign an overwhelmingly large penalty to a variable in the objective
function in order to increase the priority of the variable in the optimization
process.
The collection of linearization procedures composing the MBM method
is presented in Table 6.1. For a minimization type LP problem—subject to
constraints that have min and max functions—the transformations listed in
Table 6.1 are applied to replace non-linear constraints with linear constraints.
Note that only relevant constraints and relevant terms of the objective func-
tion are included in Table 6.1.
Define a finite set N , consisting of the variables N = {a, b, c, . . . , n}. Con-
sider all variables in the finite set N to be elements of the real numbers set

Table 6.1. Modified Big M transformations.


min Z → min (Z + M a)
a = max(b, c) → a ≥ b
a≥c
min Z → min (Z − M a)
a = min(b, c) → a ≤ b
a≤c
106 6 Clock Skew Scheduling of Level-Sensitive Circuits

N = {a, b, c, . . . , n} ⊂ . The objective function Z is a linear function of the


variables {a, b, c, . . . , n} and is defined Z : |N | → . There are no limitations
on variables being inter-dependent, provided the linearity of the constraints
is preserved.
Two different linearization scenarios are presented in Table 6.1. In the
first scenario [linearization of a = max(b, c) expression], the variable a is con-
strained to be the greater of the variables b and c. The constraint is replaced
with two new constraints, explicitly requiring the variable a to be greater
than or equal to the variables b and c. The initial constraint and the relaxed
constraints are equivalent if either of the following conditions holds:
1. Equality condition is observed for at least one of the inequalities, while
the other inequality operation returns true,
2. Equality condition is observed for both inequalities.
The cost function denoted by the product M a is added to the objective
function. The product M a is overwhelmingly large with respect to other cost
functions in the objective function as a result of the highly-weighed cost figure
(recall the very large coefficient M ). Thus, M a is given the highest priority
in the minimization process. As a result, the greater of the variables b and c
is assigned to variable a.
The relaxation method in the second scenario [linearization of a =
min(b, c) expression] is also presented in Table 6.1. In this case, the cost func-
tion M a is subtracted from the objective function in order to exploit the
maximum value to be assigned to the variable a.
Similar to its implementation in the big M method, the constant M is
defined sufficiently large, but as small as possible. The selection of a value for
the constant M depends on the solution space of a specific problem (problem
constraints) and the objective function Z. Typically, the number M must be
chosen significantly larger than the values of any parameter in the problem.
However selection of an extremely large M may cause the LP solver to fail
drastically [114]. Thus, a sufficiently large number is desired to provide the
described minimization characteristic without degrading the performance of
solution mechanism.

6.3.2 Linear Programming (LP) Model

An LP model of the clock period minimization problem is generated through


the application of the MBM method. There are five sets of constraints in
the LP model. These sets are the latching [(6.1) and (6.2)], synchronization
[(6.3) and (6.4)], propagation [(6.5) and (6.6)], validity [(6.7) and (6.8)] and
initialization [(6.9)] constraints. The finalized LP model for the clock period
minimization problem is shown in Table 6.2.
The latching, validity and initialization constraints exhibit linear behav-
ior. Therefore, these constraints remain unchanged in both the LP and NLP
6.3 Linearization of the Timing Analysis 107

Table 6.2. LP model clock skew scheduling problem of level-sensitive circuits.


LP Model
 
min TCP + M [ (dj + Dj ) + (Ak − ak )]
∀Rj ∀Rk :|F an−in(Rk )|≥1
subject to
(i) af ≥ δH Lf

[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm Li

di ≥ TCP − CW L Li
+ DCQm
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM Li

Di ≥ TCP − CW L Li
+ DCQM
[Synchronization-Latest time]
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af ≤ din + DPin ,fm + TSkew (in , f ) − TCP
[Propagation-Earliest time]
(vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll Ll
orDDQm ), ∀Rl : |F an − in(Rl )| = 0
[Initialization]

models as shown in constraints (i-ii , vii-ix ) of the formulation. The synchro-


nization constraints, however, are formed by the max function and exhibit
non-linear behavior. The MBM method is used on the synchronization con-
straints in order to generate linear constraints for the LP model problem
(constraints iii and iv ). For instance, (iii ) depicts the replacement of the non-
linear constraint presented in (6.3) with two linear constraints, where .[i] is
 Li

greater than or equal to both operands of the max function, ai + DDQm
 
and TCP − CW L Li
+ DCQm . Note that the cost function M di is added to the
objective function. Propagation constraint on the latest data arrival time (6.6),
exhibits similar non-linearity with the synchronization constraints such that
the max function is used. The linearized propagation constraints in the LP
model are shown in (vi ). In  the LP model, the variable Af isgreater than or
equal to the expressions Di + DPi,fM + TSkew (i, f ) − TCP , evaluated for
108 6 Clock Skew Scheduling of Level-Sensitive Circuits

each fan-in path of register Rf . In the formulation, fan-in paths of Rf are


indexed by the parameter n.
Unlike other non-linear constraints in the formulation, the propagation
constraint on the earliest arrival time af is modeled by the min function. In this
type of linearization, af is set to be less than or equal to each operand of the min
function. As shown in (v ), the expressions di + DPi,fm + TSkew (i, f ) − TCP
evaluated for each fan-in path of register Rf are included in the finalized LP
model.

6.4 An Example and Experimental Results

The circuit network shown in Figure 6.4 is analyzed in order to illustrate


the application of the proposed linearization procedure. Without affecting the
generality of the solution, zero setup and hold times and zero internal delays
are considered (δSLi = δH
Li
= DCQ = DDQ = 0). A single phase synchronization
scheme with 50% duty cycle is selected as shown in Figure 6.5.
Given single-phase synchronization under zero and non-zero clock skew op-
eration, the clock period minimization problems of three different synchronous
circuits with same circuit topology are formulated. These circuits are:
1. Zero clock skew, edge-sensitive circuit,
2. Zero clock skew, level-sensitive circuit,
3. Non-zero clock skew, level-sensitive circuit.
The simpler (in terms of timing analysis) circuit is the zero clock skew,
edge-sensitive circuit. This circuit is used as the basis of comparison for other
circuits. The minimum clock period of a zero clock skew, edge-sensitive cir-
cuit is defined by the maximum data propagation time in the circuit [96].
Thus, the synchronous circuit network presented in Figure 6.4 has a minimum
clock period of TCP = DP3,2M = 7 (time units) when used with edge-triggered

[3, 4]

[2.9, 3] [5, 7]
R1 R2 R3
→ ←
[3 ]
,4 ,5
← ] 2.5
[ ←

R4

Fig. 6.4. A simple synchronous circuit.


6.4 An Example and Experimental Results 109
L = T /2
CW CP

φ = TCP /2
Csource

TCP

Fig. 6.5. A single-phase synchronization clock with a 50% duty cycle.

C1 C1

C2 C2

C3 C3

C4 C4

TCP = 4.66 TCP = 4.05


Zero clock skew Non-zero clock skew
Zero Skew Non-Zero Skew Critical Path
A3 = 1.66 = D1 + 4 − 4.66 A3 = 2.025 = D1 + 4 + (0.05 − 0) − 4.05 R1 → R3
A2 = 4.66 = D3 + 7 − 4.66 A2 = 4.05 = D3 + 7 + (0 − 0.925) − 4.05 R3 → R2

Fig. 6.6. Zero and non-zero clock skew timing schedules for the level-sensitive circuit
in Figure 6.4.

flip-flops. The second synchronous circuit of interest is the zero clock skew,
level-sensitive circuit. In order to design a level-sensitive synchronous circuit,
each flip-flop in the given circuit topology is replaced with a level-sensitive
latch. Zero clock skew, level-sensitive circuits exhibit improved circuit perfor-
mance due to time borrowing. Clock skew scheduling is applied to the zero
clock skew, level-sensitive circuit to generate the non-zero clock skew, level-
sensitive circuit. This circuit exhibits performance improvement due to the
simultaneous consideration of time borrowing and clock skew scheduling.
The clocking schedules and the data propagation on the critical paths of
the circuit in Figure 6.4 are shown in Figure 6.6. In Figure 6.6, the clocking
schedule for the zero clock skew circuit is shown on the left, with a min-
imum clock period of TCP = 4.66. Non-zero clock skew scheduling results
with a minimum clock period of TCP = 4.05 is shown on the right. For non-
zero clock skew scheduling, the optimal clock signal delays at the register are
t1cd = 0.05, t2cd = 0.925, t3cd = 0 and t4cd = 0.475. The arrows represent data
signal propagation on the respective critical paths. Note that unlike the case
110 6 Clock Skew Scheduling of Level-Sensitive Circuits

presented in Figure 6.6, the critical paths for zero and non-zero clock skew
scheduling need not be identical.
In the analysis, the minimum clock period for the zero clock skew, level-
sensitive circuit is calculated as 4.66 (time units), which is a 33% improvement
over the zero clock skew, edge-sensitive synchronous circuit. Note that the per-
centage improvement is calculated by the expression 100(Told − Tnew )/Told .
As stated earlier, clock skew scheduling is applied to the level-sensitive cir-
cuit in order to generate the non-zero clock skew, level-sensitive circuit. The
calculated minimum clock period of 4.05 for the non-zero clock skew, level-
sensitive circuit is a 13% improvement over the zero clock skew, level-sensitive
circuit and a 42% improvement over the zero clock skew, edge-sensitive cir-
cuit. Note that 13% improvement is only due to clock skew scheduling, while
42% improvement is due to time borrowing and clock skew scheduling.
Further analysis of the time borrowing and clock skew scheduling effects
on circuit timing are presented in Section 11.1.

6.4.1 Level-Sensitive Synchronous Circuit State of Operation

Presence of data path loops (cycles) and transient state errors are two ma-
jor issues that need to be identified in the timing analysis of level-sensitive
circuits. As discussed in Section 6.2, the iterative algorithm offered in [73]
suffers from excessive run times and produces false negative outputs in pres-
ence of data path loops [99]. In [99], modifications are offered for the iterative
algorithm in order to detect and handle the effects of data path loops in the
circuit. Also in [99], it has been shown that synchronous circuits are prone to
transient state errors. The transient state errors occur due to the non-unique
solution sets of the problem parameters, discussed (within a different context)
in Section 6.1.5. In circuits under transient state errors, setup violations occur
in certain registers after the system is initiated from a reset state. The arrival
and departure times may not be stable at start-up, in which case these times
change during initial clock cycles, constituting the transient state. As circuit
operation progresses in time, the arrival and departure times converge to their
steady-state values.
There are two major conventions in evaluating the transient errors and de-
termining the steady-state behavior. The first convention overlooks the tran-
sient errors and presumes that the departure times converge to the opening
edge of the driving clock, which is the expected schedule for the steady-state
of operation. The second convention is more strict in that transient state er-
rors are not permitted. The first convention is more common and leads to a
generally acceptable solution unless the transient state operation of the level-
sensitive circuit is decisive to overall circuit operation. Given that the second
convention is adopted, the reset state is preferably extended until the steady
state of operation is reached [99].
The LP model in Table 6.2 assumes the transient-state operation of a
level-sensitive circuit to be negligible. The aim of the generated model is to
6.4 An Example and Experimental Results 111

solve for the steady-state timing scheduling problem. The simplex algorithm-
based LP solver directs the gradual advancement of parameter values as they
are enforced by the LP model. Previously offered algorithms are vulnerable
to potential fallacies caused by data path loops due to their iterative nature.
In the LP model, complications posed by the presence of data path loops are
resolved within the mechanics of the LP solver without significantly affecting
the run time or quality of the solution. If the problem remains feasible, the
timing parameters for the steady state operation of the circuit are calculated.

In order to illustrate the described phenomenon, the steady-state optimal


timing schedule for the ISCAS’89 benchmark circuit s27 is presented in Fig-
ure 6.7. Simplifications of DPi,fm = DPi,fM , ∀Ri ;Rf and δSLi = δH Li
= DCQLi
=
Li
DDQ = 0 are considered. The circuit s27 has one input register and a data
path loop consisting of two other registers. The data signal departs from input
register R3 and perpetually propagates on the loop between R1 and R2 . The
minimum clock period is calculated to be 4.1, where the pre-computed data
propagation times are indicated on the circuit graph.
In Figure 6.7, the data propagations occurring on all data paths of the s27
benchmark circuit are analyzed. As defined in Section 4.6.2, the subscripts to
the clock signal indicate the register being synchronized by the clock signal.
The clock signals most likely are not aligned in time due to the non-identical
clock delays to their respective destination registers. The clock signal C3 at
the input register R3 has no delay in time with respect to the clock signal at
the clock source t3cd = 0 . Hence, the origin of the clock signal at the source is
aligned with the origin of C3 . The clock signals C1 and C2 however, are shifted
in time by t1cd = 3.8 and t2cd = 1.3 relative to the origin of the clock signal at
the source. The horizontal axis of Figure 6.7 represents the time, where the
beginning (k − 1)TCP of the k-th clock cycle of C3 , is defined as the local time
reference, with an assigned value of zero. In Figure 6.7, the numbers associated
with the leading (enabling) and trailing (latching) edges of the clock signals
label the times with respect to the local time reference. The arrows illustrate
the propagation between the registers and are drawn to scale. Illustration of
the data propagation on three consecutive clock cycles are sufficient to analyze
the behavior of the data path loop of the benchmark circuit s27. Arbitrary
cycles labeled the k-th, (k + 1)-th and (k + 2)-th clock cycles are selected. The
solid arrows represent the data propagation during the selected clock cycles.
For instance, the propagation between R3 and R1 is represented by the arrows
initiating from the C3 row at times 2.05 and 6.15, and concluding at the C2
row at times 8.65 and 12.75, respectively. Data propagation on the data path
loop between the registers R1 and R2 is visible by the cross-structured arrows
initiating and concluding in the corresponding clock signal rows. Note that the
calculated nominal arrival and departure times are illustrated on the circuit
graph, inside the boxes associated with each node.
In steady-state of operation, the departure times of the registers that con-
stitute a data path loop converge to the beginning of their respective clock
k-th clock cycle k + 1-th clock cycle k + 2-th clock cycle
112

8.65 12.75
C1
3.8 5.85 7.9 9.95 12 14.05
TSkew (3, 1) = −3.8

k-th clock cycle k + 1-th clock cycle k + 2-th clock cycle

C2
1.3 3.35 5.4 7.45 9.5 11.55 13.6
TSkew (3, 2) = −1.3
TSkew (1, 2) = 2.5
k-th clock cycle k + 1-th clock cycle k + 2-th clock cycle

C3
0 2.05 4.1 6.15 8.2 10.25 12.3 16.4
timeglobal
(k − 1)TCP (k − 1)TCP + 4.1 (k − 1)TCP + 8.2 (k − 1)TCP + 12.3 (k − 1)TCP + 16.4

a1 = 0.75 d1 = 2.05 a2 = 2.05 d2 = 2.05


A1 = 2.05 D1 = 2.05 [1.6] A2 = 2.05 D2 = 2.05
1 = 3.8
→ 2 = 1.3
tcd tcd
R1 R2
6 Clock Skew Scheduling of Level-Sensitive Circuits

[6.6]
[6. ←
6 4]
← ] [5.

R3 a3 = 0 d3 = 2.05
A3 = 0 D3 = 2.05
3 =0
tcd

Fig. 6.7. The optimized timing schedule for s27 operable with TCP = 4.1.
6.5 Optimality of the LP Formulation 113

cycles. The circuit s27 in Figure 6.7 is analyzed in order to provide a bet-
ter insight on how the latest departure times converge to a certain value in
the steady-state. Define a variable , where  is a very small period of time.
Suppose that a deviation of  occurs in the departure time of the data signal
from R3 . The signal departure from R3 occurs at time 2.05+, delaying the
arrival times at R1 and R2 by . The departure from R2 is gradually delayed
by  every turn, which in turn delays the arrival time at R1 . The arrival and
departure times cumulatively increase in each turn of the data signal around
the loop. Eventually, the signal arrivals at the latches occur during the non-
transparent state of the latches. At this point, the signal departure times
return to their starting values, which are the trailing edges of their respective
clock cycles. It is evident that the arrival times will finally be restored to their
initial values when the source of the deviation vanishes. Thus, the assignment
of the time-varying departure times to the leading edges of the synchronizing
clock signals is referred to as the steady-state of operation for the synchronous
circuit.

6.5 Optimality of the LP Formulation


The operational constraints (latching [(6.1) and (6.2)], synchronization [(6.3)
and (6.4)] and propagation [(6.5) and (6.6)] constraints) accurately model the
timing of level-sensitive synchronous circuits. However, the synchronization
and propagation constraints are non-linear, leading to a non-linear program-
ming (NLP) problem formulation.
Typical NLP problems, especially for large-scale systems, are very hard to
solve efficiently. Consequently, alternative modeling and solution procedures
to solve for the timing constraints of level-sensitive circuits are of interest for
researchers. As discussed in Section 6.3, a linearization procedure that gener-
ates an LP formulation is presented. Neither the iterative solution methods
proposed in [94, 72] nor the LP model problem presented in this monograph
are equivalent to the original non-linear problem. These alternative solution
methods are proposed in order to generate results that are as close as possible
to the optimal solution in relatively shorter run times.
In this section, a Mixed-Integer (Linear) Programming (MIP) [112, 114]
formulation that is equivalent to the NLP formulation of the clock skew
scheduling problem for level-sensitive circuits is described. A MIP problem
is a linear programming problem in which some or all of the problem vari-
ables are constrained to be integers [112, 114]. If the integer variables are
further constrained to take only 0 or 1 values, these variables are called bi-
nary variables.
In general, a MIP problem can be solved optimally (granted enough time)
or within a close proximity of the optimal solution [114]. A typical MIP prob-
lem, although generally harder to solve than an LP problem of similar size, is
114 6 Clock Skew Scheduling of Level-Sensitive Circuits

Table 6.3. MIP modeling of a constraint with a max or a min function.


yi = max(xi , xj , . . . , xk ) yi = min(xi , xj , . . . , xk )
yi ≥ xi yi ≤ xi
yi ≥ xj yi ≤ xj
.. ..
. .
yi ≥ xk yi ≤ xk
yi + (Bxi − 1)M ≤ xi yi + (1 − Bxi )M ≥ xi
yi + (Bxj − 1)M ≤ xj yi + (1 − Bxj )M ≥ xj
.. ..
. .
yi + (Bxk − 1)M ≤ xk yi + (1 − Bxk )M ≥ xk
Bxi + Bxj + · · · + Bxk ≥ 1 Bxi + Bxj + · · · + Bxk ≥ 1
Bxi , Bxj , . . . , Bxk binary Bxi , Bxj , . . . , Bxk binary

generally easier to solve than an NLP problem of similar size [112]. In experi-
mentation, the MIP problems generated for the clock skew scheduling problem
of level-sensitive ISCAS’89 benchmark circuits are solved optimally.
In order to generate the MIP formulation for the clock skew scheduling
problem of level-sensitive circuits, the non-linear synchronization and propa-
gation constraints in Table 6.2 (page 107) are remodeled using binary vari-
ables. Remember from Section 6.3.1 that the non-linearity of the synchro-
nization and propagation constraints are due to the max and min functions.
The transformations in Table 6.3 can be used to model a constraint with a
max function or a min function using a binary variable. In Table 6.3, yi , xi ,
xj and xk are continuous variables. A binary variable Bxa is defined for each
operand xa (xa ∈ {xi , xj , . . . , xk }) of the max or min function. For operand xi
of the max function shown on the left hand side of Table 6.3, for instance,
the binary variable Bxi is defined. The parameter M is a sufficiently large
constant, similar to its definition in Section 6.3.1.
For a non-linear constraint with the max function in the form given as
[yi = max(xi , xj , . . . , xk )], yi is constrained to be greater than or equal to
each one of the operands. For the max function to hold, equality condition
must be true for at least one of these inequalities (multiple equalities occur
when two or more identical operands are the maximal value). Binary variables
are used in order to enforce the equality of at least one of these inequalities.
The assignment of 0 or 1 to the binary variables Bxa either constrain yi to
be less than or equal to xa or constrain yi to be strictly greater than xa . In
particular for operand xi , when Bxi = 1, the relevant constraints become:

yi ≥ xi (6.10)
yi ≤ xi (6.11)

which simplifies to the equality yi = xi through xi being the largest of the


operands xi , xj , . . . , xk . On the other hand, if Bxi = 0, the relevant constraints
6.5 Optimality of the LP Formulation 115

1400

1200

1000
Seconds

800 MIP
600 LP

400

200

s13207
s208.1

s420.1

s838.1

s9234.
s1196
s1238
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669
s9234
s298
s344
s349
s382
s386
s400

s444
s510
s526

s641
s713
s820
s832

s938
s953
s967
s991
s27

s526n

Fig. 6.8. Run times under 1250 seconds for the LP and MIP formulations.

become:

yi ≥ xi (6.12)
yi − M ≤ xi (6.13)

which simplifies to yi > xi . The transformation for a non-linear constraint with


the min function in the form [yi = min(xi , xj , . . . , xk )] is similar, as shown on
the right hand side of Table 6.3.
Using the transformation procedures defined in Table 6.3 on the non-linear
synchronization and propagation constraints, the MIP problem is constructed
for the clock skew scheduling problem of level-sensitive circuits. The MIP
formulation is shown in Table 6.4. The MIP formulations of the clock skew
scheduling problem are performed for the ISCAS’89 benchmark circuits. These
MIP problems are solved in order to observe the potential deviations from op-
timality because of modeling the NLP problem as an LP problem as described
in Section 6.3.1. It is observed that all of the ISCAS’89 suite of benchmark
circuits are solved optimally with the LP model problem.
For small-sized circuits, the MIP formulation can be preferred due to its
guarantee for optimality. However, as the number of registers and paths grow,
the solutions of the MIP problems can suffer from very long run times (can
be practically insolvable). In order to compare the run times of the MIP
problems with the run times of the LP problems, experiments are performed
on the ISCAS’89 benchmark circuits.
In Figure 6.8, the ISCAS’89 benchmark circuits whose run times are below
1250 seconds using CPLEX (v7.5) [115] simplex solver on a 440MHz Sun Ultra-
10 Workstation are shown. For smaller circuits, both LP and MIP run times
are below a few seconds, thus cannot be visualized with the scale used in
Figure 6.8. For s1423 and larger benchmark circuits, whose number of paths
116 6 Clock Skew Scheduling of Level-Sensitive Circuits

Table 6.4. MIP model clock skew scheduling problem of level-sensitive circuits.
MIP Model
min TCP
subject to
(i) af ≥ δHLf

[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm L
i
di ≥ TCP − CW L L
+ DCQm i
di + (Bai − 1)M ≤ ai + DDQm L
i
di + (BT ai − 1)M ≤ TCP − CW L L
+ DDQm i
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM L
i
Di ≥ TCP − CW L Li
+ DCQM
Di + (BAi − 1)M ≤ Ai + DDQM Li

Di + (BT Ai − 1)M ≤ TCP − CW L


+ DCQMLi

[Synchronization-Latest time]
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af ≤ din + DPin ,fm + TSkew (in , f ) − TCP
af + (1 − Bdi1 f )M ≥ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af + (1 − Bdin f )M ≥ din + DPin ,fm + TSkew (in , f ) − TCP
[Propagation-Earliest time]
(vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP
Af + (BDi1 f − 1)M ≤ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af + (BDin f − 1)M ≤ Din + DPin ,fM + TSkew (in , f ) − TCP
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll L
or DDQm l), ∀Rl : |F an − in(Rl )| = 0
[Initialization]

exceed a thousand, a significant gap between the run times of the LP and MIP
problems is observed. For larger circuits, the MIP run times can get extremely
worse compared to the LP run times. For instance, the MIP problem run
6.6 Multi-Phase Level-Sensitive Circuits 117

time for s38417 is 286496 seconds, while the LP problem run time is only 603
seconds.
The run time experiment results shown in Figure 6.8 demonstrate the
advantages of using the LP formulation versus the MIP formulation. It is
demonstrated that the LP formulation suggests a scalable alternative to the
accurate MIP model. It is expected that the run times for industry-size in-
tegrated circuit will benefit even more from the simplifications of the LP
formulation. The results of the LP formulation for the ISCAS’89 benchmark
circuits are empirically shown to be equal to the optimal results1 . These em-
pirical results do not guarantee the optimality of results for all circuits using
the LP formulation. However, these results suggest the general accuracy of
the LP formulation for the clock skew scheduling problem of level-sensitive
circuits in leading to optimal or close to optimal results.

6.6 Multi-Phase Level-Sensitive Circuits


Single-phase synchronization has traditionally been used in the design and
analysis of systems, mostly due to its simplicity. Recently, however, multi-
phase clock synchronization has become a necessity for larger and relatively
complex integrated circuits. Also, when a single-phase, edge-triggered sys-
tem is converted to a level-sensitive system, multi-phase synchronization is
applied to conserve functionality without logic redesign. Besides these con-
ventional clock distribution networks modified for multi-phase synchroniza-
tion, emerging clocking technologies such as resonant rotary clocking tech-
nology (discussed in Chapter 10) also encompass multi-phase synchronization
schemes [116]. Such necessity to design and analyze for multi-phase synchro-
nization schemes requires dedicated design and analysis frameworks.
An extension to clock skew scheduling algorithms for edge-triggered cir-
cuits (Chapter 5) in order to account for multi-phase synchronization is rel-
atively straightforward. In this section, an enhancement to the Linear Pro-
gramming (LP) framework presented in Section 6.1 for non-zero clock skew,
level-sensitive circuits is described. The enhanced framework is used to profile
the performance improvement of level-sensitive circuits subject to clock skew
scheduling under multi-phase synchronization. Time borrowing and clock skew
scheduling are analyzed in single, two, three and four phase synchronization
schemes. The effects of multi-phase synchronization schemes—independent of
the clocking technology—on non-zero clock skew, level-sensitive circuit per-
formance are analyzed.

6.6.1 Multi-Phase Synchronization Overview


The advantages of level-sensitive design with multi-phase synchronization
have previously been investigated in different contexts. One line of research has
1
The results of these experiments will be presented in detail in Chapter 11.
118 6 Clock Skew Scheduling of Level-Sensitive Circuits

concentrated on circuit retiming, most notably in [117] and [118]. In [117], the
advantages of two-phase, level sensitive circuits (as opposed to edge-sensitive
circuits) are explored. It is concluded in [117] that the level of improvement
in circuit performance is insignificant for such a circuit transformation, when
circuit retiming is performed. In [118], the results of [117] are examined from
a wider perspective, considering the depth of pipelining within a circuit—
average improvements up to 30% are shown to be possible by two-phase,
level-sensitive clocking with circuit retiming.
Presented multi-phase, level-sensitive clock skew scheduling methodology
differs from [117] and [118] by expanding the multi-phase synchronization
concept to three, four and potentially higher number of phases (the studies
presented in [117] and [118] are performed only for two-phase, level-sensitive
circuits). Furthermore, unlike extensive emphasis on circuit retiming in [117]
and [118], the application of clock skew scheduling is presented in this section.
In [119], the authors advocate the use of a multi-phase clocking scheme for
both edge-triggered and level-sensitive synchronous circuits for increased cir-
cuit performance. In [120], the number of clock phases constituting the multi-
phase synchronization scheme and the skew values are restricted to reflect the
practical limitations of conventional clock distribution networks. These stud-
ies in [119] or [120] do not explore the effects of multi-phase synchronization
on the level of improvement in circuit performance for non-zero clock skew,
level-sensitive circuits.

6.6.2 Multi-Phase Level-Sensitive Circuit Timing

In Figure 6.9, two local data paths starting at the latches Ri1 and Ri2 , re-
spectively, and ending at Rf are considered. This figure is the multi-phase
synchronization counterpart of Figure 6.2 shown on page 101. The clock sig-
nals driving the initial latches Ri1 and Ri2 are shown at the top and bottom,
respectively. The middle clock signal corresponds to the final latch Rf . The
time intervals for the arrival and departure times of latch data are illustrated
by the upper and lower parallel dotted lines, respectively. Data delays are
represented by the lengths of white or black rectangular boxes. Similar to
the analysis in Section 6.1, the operational and constructional timing con-
straints of multi-phase, level-sensitive circuits are formulated based on these
data propagation rules.
The timing constraints governing the operation of a multi-phase, level-
sensitive synchronous system are summarized in Table 6.5. The multi-phase
clock skew definition from Section 4.6.2 is incorporated into the constraints.
These constraints are valid for all varieties of overlapping and non-overlapping
clocking schemes, and for any feasible selection of duty cycles per clock phase.
Note the max and min functions in the synchronization and propagation
constraints in Section 4.9. The non-linearities of these constraints are similar
to those reported in Section 6.3 for single-phase circuits. Consequently, the
6.6 Multi-Phase Level-Sensitive Circuits 119
pi1 pi1 pi1
ti1 + (k − 1)TCP + φ pi1
k-th clock cycle ti1 + kTCP + φ p i1
k + 1-th clock cycle ti1 + (k + 1)TCP + φ
pi1

pi ai 1 Ai1
Ci1 1 Di1
di1
1 ,f
pi pi DiPm
1 2
Tskew (i1 , i2 ) + φ pi1 pi2 < 0
1 ,f
DiPM

p p p
tf f + (k − 1)TCP + φ p f k-th clock cycle tf f + kTCP + φ p f k + 1-th clock cycle tf + (k + 1)TCP + φ p f
f

pf af Af
Cf Df
df
2 ,f
DiPm
2 ,f
DiPM
pi pi pi
ti2 2 + (k − 1)TCP + φ pi2 k-th clock cycle ti2 2 + kTCP + φ pi2 k + 1-th clock cycle ti2 2 + (k + 1)TCP + φ pi2
pi p f
pi 1
Tskew (i1 , f ) + φ pi1 p f > 0 ai2 Ai2
Ci2 2 Di2
pi p f
di2
2
Tskew (i2 , f ) + φ pi2 p f > 0

Fig. 6.9. Propagation of the data signal in a simple multi-phase circuit.

Table 6.5. LP model clock skew scheduling problem of multi-phase level-sensitive


circuits.
LP Model
 
min TCP + M [ (dj + Dj ) + (Ak − ak )]
∀Rj ∀Rk :|F an−in(Rk )|≥1
subject to
(i) af ≥ δH Lf

[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm Li

di ≥ TCP − CW L Li
+ DCQm
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM Li

Di ≥ TCP − CW L Li
+ DCQM
[Synchronization-Latest time]
pi1 pf
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) + φpi1 pf
..
.
pin pf
af ≤ din + DPin ,fm + TSkew (in , f ) + φpin pf
[Propagation-Earliest time]
pi1 pf
(vi) Af ≥ Di1 + DPi1 ,fM + +TSkew (i1 , f ) + φpi1 pf
..
.
pin pf
Af ≥ Din + DPin ,fM + TSkew (in , f ) + φpin pf
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll Ll
orDDQm ), ∀Rl : |F an − in(Rl )| = 0
[Initialization]
120 6 Clock Skew Scheduling of Level-Sensitive Circuits

multi-phase problem is solved by linearizing the non-linear constraints with


the Modified big M method (Section 6.3.1).

6.7 Summary
The timing analysis and optimization of synchronous circuits are subject to
non-zero clock skew (intentional or not) and other effects of process para-
meter variations. In this chapter, design and timing analysis procedures are
presented for clock skew scheduling of level-sensitive circuits. The formulation
is performed to improve the performance of level-sensitive synchronous cir-
cuits in permitting shorter clock periods. The described procedure integrates
non-zero clock skew scheduling in an automated fashion into the design and
analysis of level-sensitive circuits. The procedure is based on a stand-alone LP
model formulation (to be solved by any standard LP solver) which constitutes
a generic automated framework for the design and analysis of level-sensitive
synchronous circuits. The optimality of the results generated by the LP model
is empirically confirmed against the optimal results of a precise MIP model.
Using the clock skew definition that is enhanced for the increasingly popular
multi-phase clock systems, the LP model clock skew scheduling formulation
for level-sensitive circuits is presented.
7
Clock Skew Scheduling for Improved
Reliability

The operation of a fully synchronous digital system has been discussed in


detail in Chapters 1 through 5. Briefly, in order for such systems to func-
tion properly, a strict temporal ordering of the many thousands of switching
events within the circuit is required. This strict ordering is enforced by a global
synchronizing clock signal delivered to every register in a circuit by a clock
distribution network. Algorithms for determining a non-zero clock skew sched-
ule that satisfy the tighter timing constraints of high speed, VLSI complexity
systems have been presented in detail in Chapter 5.
In this chapter, the problem of determining an optimal clock skew sched-
ule for a fully synchronous VLSI system is considered in this chapter from
the perspective of improving system reliability. An original formulation of the
clock skew scheduling problem by Kourtev and Friedman is introduced as a
constrained quadratic programming (QP) problem [121, 122]. In this formu-
lation, the primary objective is to improve circuit reliability by maximizing
the tolerance to process parameter variations. As the initial step of the com-
putation process, first an objective is computed for the clock skew value of
each local data path. Then, a consistent clock schedule is found by applying
the proposed optimization algorithm. Unlike the approach discussed in Chap-
ter 5, the algorithm presented in this chapter minimizes the least square error
between the computed and objective clock skew schedules.1 It should also be
mentioned that a secondary objective of the clock skew scheduling algorithm
presented in this chapter is to increase the system-wide clock frequency.
This chapter begins with the alternative formulation of the clock skew
scheduling problem as a quadratic programming problem—discussed in detail
in Section 7.1. The mathematical procedures used to determine the clock skew
schedule are developed and analyzed in Section 7.2.

1
Recall that in Chapter 5, the starting point of the clock scheduling algorithms
is the set of timing constraints and the objective is to determine a feasible clock
schedule and a clock distribution network given these constraints.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 121
DOI: 10.1007/978-0-387-71056-3 7,
c Springer Science+Business Media LLC 2009
122 7 Clock Skew Scheduling for Improved Reliability

7.1 Problem Formulation


Recall the short delay D̂Pi,jm and long delay D̂Pi,jM of a local data path Ri ;Rj
introduced in Definition 5.4. Using the substitutions, (5.3) and (5.4), the tim-
ing constraints of a local data path Ri ;Rf are rewritten in (5.5) and (5.6).
A pair of constraints such as (5.5) and (5.6) must be satisfied for each lo-
cal data path within a circuit in order for this circuit to operate correctly.
Furthermore, the local data path timing constraints lead to the concept of
a permissible range introduced in Section 5.2.1 and illustrated in Figure 5.2.
Formally, the lower and upper bounds of the permissible range of a local data
path Ri ;Rj are

li,j = −D̂Pi,jm (7.1)


ui,j = TCP − D̂Pi,jM . (7.2)

Also defined here for notational convenience are the width wi,j and middle mi,j
of the permissible range. Specifically,
 
wi,j = ui,j − li,j = TCP − D̂Pi,jM − D̂Pi,jm (7.3)
1  1  
mi,j = li,j + ui,j = TCP − D̂Pi,jM − D̂Pi,jm . (7.4)
2 2
Recall from Section 5.3 that it is frequently possible to make two simple
choices (5.7) characterizing the clock skews and clock delays within a circuit,
such that both zero and double clocking violations are avoided. Specifically,
if equal values are chosen for all clock delays and a sufficiently large value—
larger than the longest delay D̂Pi,fM —is chosen for TCP , neither of these two
clocking hazard will occur. Formally,

∀ Ri , Rf : ticd = tfcd = Const (7.5)


Ri ;Rf ⇒ TCP > D̂Pi,fM , (7.6)

and, with (7.5) and (7.6), the timing constraints, (5.5) and (5.6), for a hazard-
free local data path Ri ;Rf become

D̂Pi,fM < TCP (7.7)


D̂Pi,fm > 0. (7.8)

Next, recall that each clock skew TSkew (i, f ) is the difference of the delays
of the clock signals, ticd and tfcd . These delays are the tangible physical quan-
tities which are implemented by the clock distribution network. The set of all
clock delays within a circuit can be denoted as the vector column,
⎡1 ⎤
tcd
⎢t2cd ⎥
tcd = ⎣ ⎦ ,
..
.
7.1 Problem Formulation 123

and is called a clock skew schedule or simply a clock schedule [2, 9, 106].
If tcd is chosen such that (5.5) and (5.6) are satisfied for every local data
path Ri ;Rj , tcd is called a feasible clock schedule. A clock schedule that
satisfies (5.7) [respectively, (7.5) and (7.6)] is called a trivial clock schedule.
Again, a trivial tcd implies global zero clock skew since for any i and f ,

t
ticd = tfcd , thus, TSkew (i, f ) = 0. Also, observe that if t1cd t2cd . . . is a feasible

t
clock schedule (trivial or not), c + t1cd c + t2cd . . . is also a feasible clock
schedule where c ∈ R is any real constant.
1

An alternative way to refer to a clock skew schedule is to specify the


vector of all clock skews within a circuit corresponding to a set of clock delays
tcd as specified above. Denoted by s, the vector column of clock skews is

t
s = s1 s2 . . . where the skews s1 , s2 , . . . of all local data paths within the
circuit are enumerated. Typically, the dimension of s is different from the
dimension of tcd for the same circuit. If a circuit consists of r registers and

t
t
p local data paths, for example, then s = s1 . . . sp and tcd = t1cd . . . trcd
for this circuit. Therefore, the clock skew schedule refers to either tcd or s,
where the precise reference is usually apparent from the context.
Note that tcd must be known to determine each clock skew within s. The
inverse situation, however, is not true, that is, the set of all clock skews within
a circuit need not be known in order to determine the corresponding clock
schedule tcd . As is shown in Sections 7.1 and 7.2, a small subset of clock
skews (compared to the total number of local data paths, that is, clock skews)
uniquely determines all the skews within a circuit as well as the different
feasible clock schedules tcd . Finally, note that a given feasible clock schedule s

t
allows for many possible implementations tcd = c + t1cd c + t2cd . . . where
any specific constant c implies a different tcd but the same s. Thus, the term
clock schedule is used to refer to tcd where the choice of the real constant
c ∈ R1 is arbitrary.
The classical linear programming approach for minimizing only the clock
period TCP of a circuit is first described in Section 7.1.1. The new problem
formulation approach for maximizing the safety of the non-zero clock skew
circuit towards variations in clock delays is described in Section 7.1.2. A new
quantitative measure to compare different clock schedules for the formulation
of maximum safety against variations is introduced in Section 7.1.3. This
section is concluded by sketching the clock skew scheduling problem as an
efficiently solvable quadratic programming problem in Section 7.1.4.

7.1.1 Clock Scheduling for Maximum Performance

The linear programming (LP) problem of computing a feasible clock skew


schedule while minimizing the clock period TCP of a circuit is discussed in
Chapter 5. With TCP as the value of the objective function being minimized,
this problem is formally defined as problem LCSS:
124 7 Clock Skew Scheduling for Improved Reliability

Problem LCSS (LP Clock Skew Scheduling)

min TCP
subject to: ticd − tjcd ≤ TCP − D̂Pi,jM (7.9)
ticd − tjcd ≥ −D̂Pi,jm .

To develop additional insight into problem LCSS, consider a circuit C1


consisting of the four registers, R1 , R2 , R3 , and R4 , and the five local data
paths, R1 ;R2 , R1 ;R3 , R3 ;R2 , R3 ;R4 , and R4 ;R2 . Let the long and short
delays for this circuit be2 D̂P1,2m = 1, D̂P1,2M = 3, D̂P1,3m = 2, D̂P1,3M = 4, D̂P3,2m = 5,
D̂P3,2M = 7, D̂P3,4m = 2.5, D̂P3,4M = 5, D̂P4,2m = 2, and D̂P4,2M = 4. Solving problem
LCSS yields a feasible clock schedule t1cd for the minimum achievable clock
period TCP = 5,
⎡1 ⎤ ⎡ ⎤
tcd 1
⎢t2 ⎥ ⎢ 2 ⎥
min TCP = 5 → tcd = ⎢ 1 cd ⎥ ⎢ ⎥
⎣t3cd ⎦ = ⎣ 0 ⎦ .
t4cd 2.5

These results are summarized in Table 7.1 along with the actual permissible
range for each local data path for the minimum value of the clock period
TCP = 5 (recall that the permissible range depends upon the value of the
clock period TCP ).

Table 7.1. Clock schedule t1cd —clock skews and permissible ranges for the example
circuit C1 (for the minimum clock period TCP = 5).
Local Data Path Permissible Range Clock Skew
R1 ;R3 [−2, 1] − t3cd = 1 − 0 = 1
t1cd
R3 ;R4 [−2.5, 0] tcd − t4cd = 0 − 2.5 = −2.5
3

R1 ;R2 [−1, 2] t1cd − t2cd = 1 − 2 = −1


R3 ;R2 [−5, −2] t3cd − t2cd = 0 − 2 = −2
R4 ;R2 [−2, 1] t4cd − t2cd = 2.5 − 2 = 0.5

Note that most of the clock skews (specifically, the first four) listed in Ta-
ble 7.1 are at one end of the corresponding permissible range. This situation
2
The times used in this section are all assumed to be in the same time unit.
The actual time unit—e.g., picoseconds, nanoseconds, microseconds, milliseconds,
seconds—is irrelevant and is therefore omitted.
7.1 Problem Formulation 125

is due to the inherent feature of linear programming which seeks the objective
function extrema at the vertices of the solution space. In practice, however,
this situation can be dangerous since correct circuit operation is strongly de-
pendent on the accurate implementation of a large number of clock delays—
effectively, the clock skews—across the circuit. It is quite possible that the
actual values of some of these clock delays may fluctuate from the target
values—due to manufacturing tolerances as well as variations in temperature
and supply voltage—thereby causing a catastrophic timing failure of the cir-
cuit. Observe that while zero clocking failures can be corrected by operating
the circuit at a slower speed (higher clock period TCP ), double clocking vi-
olations are race conditions that are render the circuit nonfunctional unless
delay padding is performed.

7.1.2 Maximizing Safety

Frequently in practice, a target clock period TCP is established for a specific


circuit implementation. Making the target clock period smaller may not be a
primary design objective. If this is the case, alternative optimization strate-
gies may be sought such that the resulting circuit is more tolerant to inac-
curacies in the timing parameters. Two different classes of timing parameters
are considered—the local data path delays and the clock delays (respectively,
the clock skews). Note first that the clock skew scheduling process depends
on accurate knowledge of the short and long path delays (D̂Pi,jm and D̂Pi,jM ) for
every local data path Ri ;Rj . Second, provided the path delay information is
predictable, correct circuit operation is contingent upon the accurate imple-
mentation of the computed clock schedule tcd . Both of these factors must be
considered if reliable circuit operation under various operating conditions is
to be attained.
One way to achieve the specified goal of higher circuit reliability is to artifi-
cially shrink the permissible range of each local data path by an equal amount
from either side of the interval and determine a feasible clock skew schedule
based on these new timing constraints. This idea has been addressed by in [2]
as the problem of maximizing the minimum slack [over all inequalities (5.5)
and (5.6)] or the amount by which an inequality exceeds the limit. Formally,
the problem can be expressed as the LP problem LCSS-SAFE:

Problem LCSS-SAFE (LP Clock Skew Scheduling for Safety)

max M
subject to: ticd − tjcd + M ≤ TCP − D̂Pi,jM
(7.10)
ticd − tjcd − M ≥ −D̂Pi,jm
M ≥0
126 7 Clock Skew Scheduling for Improved Reliability

To gain additional insight into problem LCSS-SAFE, consider again the


circuit example used in Section 7.1.1. Two solutions of problem LCSS-SAFE
are listed in Table 7.2 for two different values of the clock period, TCP = 6.5
and TCP = 6, respectively. The results are summarized in Table 7.2—denoted
by t2cd and t3cd , respectively—in columns two through five and six through
nine for TCP = 6.5 (clock schedule t2cd ) and TCP = 6 (clock schedule t3cd ),
respectively. For the specific value of TCP , the permissible range is listed in

Table 7.2. Solution of problem LCSS-SAFE for the example circuit C1 for clock
periods TCP = 6.5 and TCP = 6, respectively.
t2cd → TCP = 6.5, M = 1 t3cd → TCP = 6, M = 2/3

t
t
t2cd = 32 32 0 12 t3cd = 43 53 0 13
1 2 3 4 5 6 7 8 9
R1 ;R3 [−2, 2.5] 1.5 0.25 1.25 [−2, 2] 4/3 0 4/3
R3 ;R4 [−2.5, 1.5] −0.5 −0.5 0 [−2.5, 1] −1/3 −3/4 5/12
R1 ;R2 [−1, 3.5] 0 1.25 1.25 [−1, 3] −1/3 1 4/3
R3 ;R2 [−5, −0.5] −1.5 −2.75 1.25 [−5, −1] −5/3 −3 4/3
R4 ;R2 [−2, 2.5] −1 0.25 1.25 [−2, 2] 0 −4/3 4/3

1: local data path, 2,6: permissible range, 3,7: clock skew solution for this local
data path, 4,8: ideal clock skew value for this path (middle of permissible range),
5,9 distance (absolute value) of the clock skew solution from the actual clock skew

columns two and six, respectively, and the clock skew solution is listed in
columns three and seven, respectively.
Note that there are two additional columns of data for either value of TCP
in Table 7.2. First, an ‘ideal’ objective value of the clock skew is specified for
each local data path in columns four and eight, respectively. This objective
value of the clock skew is chosen in this example to be the value corresponding
to the middle mi,j [note (7.4)] of the permissible range of a local data path
Ri ;Rj in a circuit with a clock period TCP . The middle point of the permissi-
ble range is equally distant from either end of the permissible range, thereby
providing the maximum tolerance to process parameter  variations. Second,
the absolute value of the distance TSkew (i, j) − mi,j  between the ideal and
actual values of the clock skew for a local data path is listed in columns five
and nine, respectively. This distance is a measure of the difference between
the ideal clock skew and the scheduled clock skew. Note that in the general
case, it is virtually impossible to compute a clock schedule tcd such that the
clock skew TSkew (i, j) for each local data path Ri ;Rj is exactly equal to the
7.1 Problem Formulation 127

middle mi,j of the permissible range of this path. The reasons for this charac-
teristic are due to structural limitations of the circuits as will be highlighted
in Section 7.2.

7.1.3 Further Improvement

Problem LCSS-SAFE [see (7.10)] provides a solution to the clock skew schedul-
ing problem for the case where circuit reliability is of primary importance and
clock period minimization is not the focus of the optimization process. As
shown in Section 7.1.2, a certain degree of safety may be achieved by comput-
ing a feasible clock schedule subject to artificially smaller permissible ranges
[as defined in (7.10)]. However, Problem LCSS-SAFE is a brute force ap-
proach since it requires that the same absolute margins of safety are observed
for each permissible range regardless of the width of this range. Therefore,
this approach does not consider the individual characteristics of a permissi-
ble range and does not differentiate among local data paths with wider and
narrower permissible ranges.
It is possible to provide an alternative approach to clock skew scheduling
that considers all permissible ranges and also provides a natural quantitative
measure of the quality of a particular clock schedule. Consider, for instance,
a circuit with a target clock period TCP . Furthermore, denote an objective
clock skew value for a local data path Ri ;Rj by gi,j , where it is required that
li,j ≤ gi,j ≤ ui,j [recall the lower (7.1) and upper (7.2) bounds of the permissible
range]. For most practical circuits, it is unlikely that a feasible clock schedule
can be computed that is exactly equal to the objective clock schedule for each
local data path. Multiple linear dependencies among clock skews within each
circuit exist—those linear dependencies define a solution space such that the

t
clock schedule s = gi1 ,j1 gi2 ,j2 . . . most likely is not within this solution space
(unless the circuit is constructed of only non-recursive feed-forward paths). If
tcd is a feasible clock schedule, however, it is possible to evaluate how close a
realizable clock schedule is to the objective clock schedule by computing the
sum, 
2
ε= TSkew (i, j) − gi,j , (7.11)
Ri ; Rj

over all local data paths in the circuit.


Note that ε, as defined in (7.11), is the total least squares error of the
actual clock skew as compared to the objective clock skew. This error per-
mits any two different clock skew schedules to be compared. Moreover, the
clock skew scheduling problem can be considered as a problem of minimiz-
ing ε of a clock schedule tcd given the clock period TCP and an ‘ideal’ clock

t
schedule gi1 ,j1 gi2 ,j2 . . . subject to any specific circuit design criteria. The
flexibility permitted by such a formulation is far greater since the ideal sched-

t
ule gi1 ,j1 gi2 ,j2 . . . can be any clock schedule that satisfies a specific target
circuit.
128 7 Clock Skew Scheduling for Improved Reliability

Consider, for instance, the solution of LCSS-SAFE listed in Table 7.2 for
TCP = 6.5 and TCP = 6. Computing the total error [as defined by (7.11)] for
both solutions gives ε6.5 = 6.25 and ε6 = 1049 144 = 7.2847. Next, consider an
2
alternative clock schedule tcd for TCP = 6.5 as follows:
⎡1 ⎤ ⎡ ⎤
tcd 43/32
⎢t2 ⎥ ⎢38/32⎥
TCP = 6.5 → t2cd = ⎢ cd ⎥ ⎢ ⎥


⎣t3cd ⎦ = ⎣ 0 ⎦ . (7.12)
t4cd 31/32

It can be verified that with t2cd as specified, ε6.5 improves to 675
128 = 5.2734
from 6.25 for t2cd [columns two (2) through five (5) in Table 7.2]. Similarly,

an alternative clock schedule t3cd for the clock period TCP = 6 is
⎡1 ⎤ ⎡ ⎤
tcd 35/32
⎢t2 ⎥ ⎢54/32⎥
TCP = 6.5 → t3cd = ⎢ cd ⎥ ⎢ ⎥


⎣t3cd ⎦ = ⎣ 0 ⎦ . (7.13)
t4cd 39/32

Again, using t3cd leads to an improvement of ε6 to 6.1484 as compared to
7.2847 for the solution of LCSS-SAFE t3cd (see Table 7.2, columns six through
nine).

7.1.4 Clock Scheduling as a Quadratic Programming Problem

As discussed in Sections 7.1.1, 7.1.2, and 7.1.3, a common design objective is


ensuring reliable system operation under a target clock period. As hinted in
Section 7.1.3, it is possible to redefine the problem of clock skew scheduling
for this case. The input data for this redefined problem consists of:
• The clock period of the circuit TCP ,
• The circuit connectivity and delay information, i.e., all local data paths
Ri ;Rj and the short and long delays D̂Pi,jm and D̂Pi,jM , respectively,
⎡ ⎤
gi1 ,j1
⎢ ⎥
• An objective clock schedule g = ⎣gi2 ,j2 ⎦ .
..
.
Given this information, the optimization goal is to compute a feasible
clock schedule s∗ (respectively t∗cd ) so as to minimize the least square error
between the computed clock schedule s∗ and the objective clock schedule g.
Recall that the least square error ετ [described by (7.11)] is defined as the sum
of the squares of the distances (algebraic differences) between the actual and
objective clock skews over all local data paths in the circuit. This problem
is described within a formal framework in the following section. Also in the
following section, the mathematical algorithm to solve this revised problem is
explained in greater detail.
7.2 Derivation of the QP Algorithm 129

7.2 Derivation of the QP Algorithm

The formulation of clock skew scheduling as a quadratic programming problem


is described in detail in this section. First, the graph model introduced in
Chapter 5 is further analyzed in Section 7.2.1. The linear dependencies among
the clock skews and the fundamental set of cycles are introduced and analyzed
in Section 7.2.2. Finally, the quadratic programming problem is formulated
and solved in Section 7.2.3.

7.2.1 The Circuit Graph

As discussed in Section 5.2.2, a circuit C is represented as the simple


(C) (C) (C)
undirected graph GC = V (C) , E (C) , A(C) , hl , hu , hd , where VC =
{v1 , . . . , vr } is the set of vertices of the graph, EC = {e1 , . . . , ep } is the set
of edges of the graph, and the symmetric r × r matrix AC —called the ad-
jacency matrix—contains the graph connectivity [89]. Vertices from GC cor-
respond to the registers of the circuit C and the edges reflect the fact that
pairs of registers are sequentially-adjacent. Note the cardinalities |VC | = r
and |EC | = p—the circuit C has r registers and p local data paths. The adja-
cency matrix AC = [aij ]r×r is a square matrix of order r × r where both the
rows and columns of A correspond to the vertices of GC . As previously men-
tioned, for notational convenience sj denotes the clock skew corresponding to
the edge ej ∈ EC . Specifically, if the vertices vi1 and vi2 correspond to the
sequentially-adjacent pair of registers Ri1 ;Ri2 connected by the j-th edge ej ,
def
sj = TSkew (i1 , i2 ).

To illustrate these concepts, the graph GC1 of the small circuit example C1
introduced in Section 7.1.1 is illustrated in Figure 7.1 (note the enumeration
and labeling of the edges as specified in Definition 5.3). For this example,

[l1 , u1 ]
e1 →

[l3 , u3 ] [l4 , u4 ]
v1 v2 v3
e3 → e4 ←
[l5 ]
, u5 u2
e5
← ] l[ 2, ←
e2
v4

Fig. 7.1. Circuit graph of the simple example circuit C1 from Section 7.1.1.
130 7 Clock Skew Scheduling for Improved Reliability

r = 4, p = 5, and the adjacency matrix, that will be used in the solution


procedure, is
v1v2v3v4
⎡ ⎤
v1 0 1 1 0
v ⎢1 0 1 1⎥
AC1 = 2 ⎢ ⎥.
v3 ⎣ 1 1 0 1 ⎦
v4 0 1 1 0
Observe that in general, the elements of AC are defined as

u 1 if there is an edge ek connecting the vertices vi and vj
aij = (7.14)
0 otherwise.

In addition, note that the adjacency matrix as defined in (7.14) is always


symmetric. The edges of GC have no direction so each edge between vertices
vi and vj is shown in both of the rows corresponding to i and j. Also, all
diagonal elements of the adjacency matrix are zeroes since self-loop edges are
excluded by the required circuit graph properties described in 5.2.2. As a final
reminder and without any loss of generality, it is assumed that a circuit has a
connected graph [89]. In other words, a circuit does not have isolated groups
of registers. If a specific circuit has a disconnected graph, then each connected
subgraph (subcircuit) can be considered separately.

7.2.2 Linear Dependence of Clock Skews

Consider the circuit graph of C1 illustrated in Figure 7.1. The clock skews
for the local data paths R3 ;R2 , R3 ;R4 , and R4 ;R2 are s4 = TSkew (3, 2) =
t3cd − t2cd , s2 = TSkew (3, 4) = t3cd − t4cd , and s5 = TSkew (4, 2) = t4cd − t2cd ,
respectively. Note that s4 = s2 + s5 , i.e., the clock skews s2 , s4 , and s5 are
linearly dependent. In addition, note that other sets of linearly dependent
clock skews can be identified within C1 , such as, for example, s1 , s3 , and s4 .
Generally, large circuits contain many feedback and feed-forward signal
paths. Thus, many possible linear dependencies among clock skews—such as
those described in the previous paragraph—are typically present in such cir-
cuits. A natural question arises as to whether there exists a minimal set3 of
linearly independent clock skews which uniquely determines all clock skews
within a circuit. (The existence of any such set could lead to substantial
improvements in the run time of the clock scheduling algorithms as well as
permit significant savings in storage requirements when implementing these
algorithms on a digital computer.) It is generally possible to identify multiple
minimal sets within any circuit. Consider C1 , for example—it can be verified
that {s3 , s4 , s5 }, {s1 , s3 , s5 }, and {s1 , s4 , s5 } are each sets with the property
that (a) the clock skews within the set are linearly independent, and (b) every

3
Such that the removal of any element from the set destroys the property.
7.2 Derivation of the QP Algorithm 131

clock skew within C1 can be expressed as a linear combination of the clock


skews that exist in the set.
Let C be a circuit with graph GC and let vi0 , ej0 , vi1 , . . . , ejz−1 , viz ≡ vi0 be
an arbitrary sequence of vertices and edges. Formally, the condition for linear
dependence of the clock skews, sj0 , sj1 , . . . , sjz−1 , is


z−1


aik jk = 0 ⎪
⎬ z−1
k=0 ⇒ ±TSkew (ik , jk ) = 0, (7.15)



⎭ k=0
(iz = i0 ) = i1 = . . . = iz−1

where the proof of (7.15) is trivial by substitution. The product on the left
side of (7.15) requires that there exists an edge between every pair of vertices
vik and vik+1 (k = 0, . . . , z − 1). The sum in (7.15) can be interpreted4 as
traversing the vertices of the cycle C = vi0 , ej0 , vi1 , . . . , ejz−1 , viz ≡ vi0 in
the order of appearance in C and adding the skews along C with a positive or
negative sign depending on whether the direction labeled on the edge coincides
with the direction of traversal.
Typically, multiple cycles can be identified in a circuit graph and an
equation—such as (7.15)—can be written for each of these cycles. Referring
to Figure 7.1, three such cycles,

C1 = v1 , e1 , v3 , e2 , v4 , e5 , v2 , e3 , v1

C2 = v2 , e4 , v3 , e2 , v4 , e5 , v2

C3 = v1 , e1 , v3 , e4 , v2 , e3 , v1 ,

can be identified and the corresponding linear dependencies written:

cycle C1 → s1 + s2 − s3 + s5 = 0 (7.16)
cycle C2 → s2 − s4 + s5 = 0 (7.17)
cycle C3 → s1 − s3 + s4 = 0. (7.18)

Note that the order of the summations in (7.16), (7.17), and (7.18) has been
intentionally modified from the order of cycle traversal so as to highlight an
important characteristic. Specifically, observe that (7.16) is the sum of (7.17)
and (7.18), that is, there exists a linear dependence not only among the skews
within the circuit C, but also among the cycles (or, sets of linearly dependent
skews).
Note that any minimal set of linearly independent clock skews must not
contain a cycle [as defined by (7.15)] for if the set contains a cycle, the skews
4
Note the similarity with Kirchoff’s Voltage Law (KVL or loop equations) for an
electrical network [123].
132 7 Clock Skew Scheduling for Improved Reliability

within the set would not be linearly independent. Furthermore, any such set
must span all vertices (registers) of the circuit or it is not possible to express
the clock skews of any paths in and out of the vertices not spanned by the set.
Given a circuit C with r registers and p local data paths, these conclusions are
formally summarized in the following two results from graph theory [89, 124]:
1. Minimal Set of Linearly Independent Clock Skews. A minimal set of clock
skews can be identified such that (a) the skews within the set are linearly
independent, and (b) every skew in C is a linear combination of the skews
from the set. Such a minimal set is any spanning tree of GC and consists
of exactly r − 1 elements (recall that a spanning tree is a subset of edges
such that all vertices are spanned by the edges in the set). These r − 1
skews (respectively, edges) in the spanning tree are referred to as the skew
basis, while the remaining p − (r − 1) = p − r + 1 skews (edges) of the
circuit are referred to as chords. Note that there is a unique path between
any two vertices such that all edges of the path belong to the spanning
tree.
2. Minimal Set of Independent Cycles. A minimal set of cycles [where a
cycle is as defined by (7.15)] can be identified such that (a) the cycles are
linearly independent, and (b) every cycle in C is a linear combination of
the cycles from the set. Each choice of a spanning tree of GC determines
a unique minimal set of cycles, where each cycle consists of exactly one
chord vi1 , ej , vi2 plus the unique path that exists within the spanning tree
between the vertices vi1 and vi2 . Since there are p − (r − 1) = p − r + 1
chords, a minimal set of independent cycles consists of p−r+1 cycles. The
minimal set of independent cycles of a graph is also called a fundamental
set of cycles [89, 123, 124].
To illustrate the aforementioned properties, observe the two different
spanning trees of the example circuit C1 outlined with the thicker edges in
Figure 7.2 (the permissible ranges and direction labelings have been omitted
from Figure 7.2 for simplicity). The first tree is shown in Figure 7.2(a) and
consists of the edges {e3 , e4 , e5 } and the independent cycles C2 [see (7.17)] and
C3 [see (7.18)]. As previously explained, both C2 and C3 contain precisely one
of the skews not included in the spanning tree—s2 for C2 and s1 for C3 . Simi-
larly, the second spanning tree {e1 , e3 , e5 } is illustrated in Figure 7.2(b). The
independent cycles for the second tree are C1 [see (7.16)] and C3 [see (7.18)]—
generated by s2 and s4 , respectively.
Let a circuit C with r registers and p local data paths be described by
a graph G and let a skew basis (spanning tree) for this circuit (graph) be
identified. For the remainder of this discussion, it is assumed that the skews
have been enumerated such that those skews from the skew basis have the
highest indices.5 Introducing the notation sb for the basis and sc for the chords,
the clock schedule s can be expressed as
5
Such enumeration is always possible since the choice of indices for any enumera-
tion (including this example) is arbitrary.
7.2 Derivation of the QP Algorithm 133

e1 (s1 )

e3 (s3 ) e4 (s4 )
v1 v2 v3

e5 (s5 )

2 )
(s
e2
v4

(a) Spanning tree {e3 , e4 , e5 }

e1 (s1 )

e3 (s3 ) e4 (s4 )
v1 v2 v3
e5 (s5 )

2 )
(s
e2

v4

(b) Spanning tree {e1 , e3 , e5 }

Fig. 7.2. Two spanning trees and the corresponding minimal sets of linearly in-
dependent clock skews and linearly independent cycles for the circuit example C1 .
Edges from the spanning tree are indicated with thicker lines.

 c p−r+1 r−1
s " #$ %" #$ % t
s= = [ s1 . . . sp−r+1 sp−r+2 . . . sp ] , (7.19)
sb $ %" #$ %" #
Chords Basis

where ⎡ ⎤ ⎡ ⎤
s1 sp−r+2
⎢ .. ⎥ ⎢ ⎥
sc = ⎣ . ⎦ and sb = ⎣ ... ⎦ . (7.20)
sp−r+1 sp
Note that the case illustrated in Figure 7.2(a) is precisely the type of enumer-
ation just described by (7.19) and (7.20)—e1 , e2 (s1 , s2 ) are the chords and
e3 , e4 , e5 (s3 , s4 , s5 ) are the basis.
134 7 Clock Skew Scheduling for Improved Reliability

With the notation and enumeration as specified above, let nb = r − 1 be


the number of skews (edges) in the basis and nc = p − r + 1 = p − nb be
the number of chords (equal to the number of cycles). The set of linearly
independent cycles is C1 , . . . , Cnc and the clock skew dependencies for these
cycles are
1
iz−1

cycle C1 = vi10 , ej01 , vi11 , . . . , ejz−1
1 , vi10 → 0= ±sjk1
k=i10

..
. (7.21)
in c

z−1

cycle Cnc = vin0 c , ej0nc , vin1 c , . . . , ejz−1


nc , v nc
i0 → 0= ±sjknc .
k=in
0
c

Note that the sums in (7.21) can be written in matrix form,

Bs = 0, (7.22)

where B = [bij ]nc ×p is a matrix of order nc × p. The matrix B is called the


circuit connectivity matrix and each row of B corresponds to a cycle of the
circuit graph and contains elements from the incidence matrix A combined
with zeroes depending on whether a skew (an edge) belongs to the cycle or not.
Note that since each cycle contains exactly one chord, the cycles can always
be permuted such that the cycles appear in the order of the chords, i.e., C1
corresponds to e1 , C2 corresponds to e2 and so on. If this correspondence is
applied, the matrix B can be represented as

B = Inc Cnc ×nb , (7.23)

where the submatrix Inc is an identity6 matrix of dimension nc × nc , thereby


permitting (7.22) to be rewritten as
 

sc
Bs = I C = sc + Csb = 0. (7.24)
sb

Consider, for instance, the choice of spanning tree illustrated in Figure 7.2(a).
There are two independent cycles denoted by C1 [corresponding to C2 in (7.17)]
and C2 [corresponding to C3 in (7.18)]. The matrix relationship (7.22) for this
case is

s1 − s3 + s4 = 0 ← cycle C1 = v1 , e1 , v3 , e4 , v2 , e3 , v1
+ s2 − s4 + s5 = 0 ← cycle C2 = v3 , e2 , v4 , e5 , v2 , e4 , v3

6
Recall that an identity matrix In is a square n × n matrix such that the only
nonzero elements are on the main diagonal and are all equal to one.
7.2 Derivation of the QP Algorithm 135

and the matrices B and C, respectively, are


 

1 0 −1 1 0
B = I2 C2×3 = ,
0 1 0 −1 1
  (7.25)
−1 1 0
C= .
0 −1 1

From an algebraic standpoint [125], (7.22) requires that any clock schedule
s must necessarily be in the kernel ker(B) of the linear transformation B :
Rp → Rnc , i.e., s ∈ ker(B). The inverse situation, however, is not true, that is,
an arbitrary element of the kernel is not necessarily a feasible clock schedule.
Furthermore, note that B is already in reduced row echelon form [125] so the
rank of B is rank(B) = nc . Thus, the dimension of ker(B) is [125]

dim [ker(B)] = columns of B − rank(B)


= p − rank(B) (7.26)
= p − nc = nb .

Therefore, (7.22) is referred to here as the circuit kernel equation.


This last result expressed by (7.26) demonstrates that there are nb = r − 1
linearly independent skews in a circuit. Furthermore, considering that the
matrix C is ⎡ ⎤
| |
⎢ ⎥
⎢ ⎥
C =⎢ c
⎢ 1 . . . c ⎥
nb ⎥ ,
⎣ ⎦
| |
one possible basis for ker(B) can be written from inspection:
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−c1 −c2 −cnb
⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢ 0 ⎥ ⎢ 1 ⎥ ⎢ ⎥
basis for ker(B) = ⎢ ⎥ ⎢ ⎥ . . . ⎢ 0 ⎥. (7.27)
⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥
⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦
0 0 1
$ %" #
nb vectors

Any feasible clock schedule s ∈ ker(B) can be expressed as a linear combina-


tion of the vectors from the basis of the kernel,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
−c1 −c2 −cnb
 c ⎢ 1 ⎥ ⎢ 0 ⎥ ⎢ 0 ⎥  
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
s b ⎢ 0 ⎥ b ⎢ 1 ⎥ b ⎢ 0 ⎥ −Csb
s = b = s1 ⎢ ⎥ + s2 ⎢ ⎥ + . . . + snb ⎢ ⎥= , (7.28)
s ⎢ .. ⎥ ⎢ .. ⎥ ⎢ .. ⎥ sb
⎣ . ⎦ ⎣ . ⎦ ⎣ . ⎦
0 0 1
136 7 Clock Skew Scheduling for Improved Reliability

where the scalars, sb1 , sb2 , . . . , sbr , in (7.28) are the elements of the vector sb [as
defined by (7.19)]: ⎡ b⎤ ⎡ ⎤
s1 snc +1
⎢ sb2 ⎥ ⎢snc +2 ⎥
⎢ ⎥ ⎢ ⎥
sb = ⎢ . ⎥ = ⎢ . ⎥ . (7.29)
⎣ .. ⎦ ⎣ .. ⎦
sbnb sp
Observe that either knowing or deliberately choosing sb not only provides
sufficient information to determine the corresponding sc (respectively, the
entire s), but also permits computation of the clock delays tcd to implement
the desired clock schedule s. Specifically, the dependencies among the clock
skews in the branches (the local data paths) and the clock delays to the
vertices (the registers) can be described in matrix form as follows:
sb = Tnb ×r tcd = Tnb ×r tcd . (7.30)
Note that each skew is the difference of two clock delays so that each row
of the matrix T in (7.30) contains exactly two nonzero elements. These two
nonzero elements are 1 and −1, respectively, depending upon which two clock
delays determine the clock skew corresponding to this equation (or row in the
matrix). Also note that (7.30) is a consistent linear system (the rows corre-
spond to linearly independent skews within the circuit) with fewer equations
than the r unknown clock delays tcd . Therefore, (7.30) has an infinite number
of solutions all corresponding to the same clock schedule s.
Finding a solution tcd of (7.30) is now a straightforward matter. For ex-
ample, setting trcd = 0 and rewriting (7.30) to account for this substitution,
trcd = 0 ⇒ sb = T∗nb ×nb tcd = T∗nb ×nb (7.31)
yields a consistent linear system with the same number of variables as equa-
tions where the matrix T∗nb ×nb is the matrix Tnb ×r with the rightmost column
deleted. The most efficient way to solve the system characterized by (7.31)
with the highest accuracy is by back substitution (only addition/subtraction
operations are necessary). In the software implementation of this algorithm
discussed in this work, tcd is computed in an efficient way by traversing the
edges of the spanning tree.
This section concludes by illustrating the concepts discussed in this section
on a small circuit example C1 [the circuit graph GC1 is shown in Figure 7.1
and the respective spanning tree is shown in Figure 7.2(a)]. For this circuit,
r = 4, the number of local data paths is p = 5 and nb = 4 − 1 = 3. The clock
schedule is ⎡ ⎤
 c   s3
s s
s = b , where sc = 1 , sb = ⎣s4 ⎦ . (7.32)
s s2
s5
The independent cycles are C2 [from (7.17)] and C3 [from (7.18)] and the
matrices B and C are as defined in (7.25). A basis for the kernel of B has a
dimension nb = 3 and consists of the vectors,
7.2 Derivation of the QP Algorithm 137
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0
⎢0⎥ ⎢ 1⎥ ⎢−1⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1⎥ , ⎢ 0⎥ , and ⎢ 0⎥ . (7.33)
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦
0 0 1
Any clock schedule is in ker(B) and can be expressed as a linear combination
of the vectors from the kernel basis,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0
⎢0⎥ ⎢ 1⎥ ⎢−1⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
s = sb3 ⎢ ⎥ b⎢ ⎥ b⎢ ⎥
⎢1⎥ + s4 ⎢ 0⎥ + s5 ⎢ 0⎥ . (7.34)
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦
0 0 1
Consider, for instance, the clock skew schedule for TCP = 6.5 shown in Ta-
ble 7.2. Substituting s3 = 0, s4 = −1.5 and s5 = −1 into (7.34) yields the
clock schedule,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0 1.5
⎢0⎥ ⎢ 1⎥ ⎢−1⎥ ⎢−0.5⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
s = 0 ⎢1⎥ − 1.5 ⎢ 0⎥ − 1 ⎢
⎢ ⎥ ⎢ ⎥ ⎥ ⎢
⎢ 0⎥ = ⎢ 0⎥⎥. (7.35)
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦ ⎣−1.5⎦
0 0 1 −1
Finally, the clock delays tcd are derived from the underdetermined linear
system [as described by (7.30)],
⎡ ⎤
⎡ ⎤ ⎡ ⎤ t1cd
0 1 −1 0 0 ⎢ 2 ⎥
tcd ⎥
sb = ⎣−1.5⎦ = ⎣0 −1 1 0⎦ ⎢⎣t3cd ⎦ , (7.36)
−1 0 −1 0 1 4
tcd

where setting t4cd = 0 yields


⎡ ⎤ ⎡ ⎤⎡ 1 ⎤ t1cd = 1
0 1 −1 0 tcd
sb = ⎣−1.5⎦ = ⎣0 −1 1⎦ ⎣t2cd ⎦ ⇒ t2cd = 1 (7.37)
−1 0 −1 0 t3cd t3cd = −0.5.

t
Interestingly, the clock schedule 1 1 − 21 0 differs from the solution shown
in Table 7.2 by only a constant of c = − 21 . Namely,

t
t
1 1 − 12 0 = c + 32 c + 32 c + 0 c + 12 . (7.38)

7.2.3 Optimization Problem and Solution

Recall the intuitive definition of clock skew scheduling as a Quadratic Pro-


gramming (QP) problem first introduced in Section 7.1.4. In this section, the
138 7 Clock Skew Scheduling for Improved Reliability

QP formulation is formalized and the solution of the problem is explained in


detail.

Problem QP-1 (QP Clock Skew Scheduling)

Let C be a circuit with r registers, p local data paths and a target clock
period TCP , and let the local data paths be enumerated as

⎨ path1 → Ri1 ;Rj1

p local data paths .. (7.39)
⎪ .

pathp → Rip ;Rjp .

For each local data path pathk (Rik ;Rjk ) within C, let the lower bound lik ,jk ,
upper bound uik ,jk , width wik ,jk , and middle mik ,jk of the permissible range
of this path be defined as in (7.1), (7.2), (7.3), and (7.4), respectively. For
simplicity, these parameters of the permissible range are denoted with a single
subscript corresponding to the number of the respective local data path, that
is, for the pathk ≡ Rik ;Rjk , lik ,jk = lk , uik ,jk = uk , wik ,jk = wk , and mik ,jk = mk .
Furthermore, let the circuit graph of C be GC , let the skew basis sb and
chords sc be identified in GC [according to (7.19)], and let the corresponding

independent set of cycles be described by the matrix B = I C [as defined



t
t
in (7.23)]. Let an objective clock schedule be g = g1 . . . gp = m1 . . . mp ,

t
t
and let l = l1 . . . lp and u = u1 . . . up be the vectors of the lower and
upper bounds, respectively, of the permissible ranges. Find a feasible clock
schedule s that minimizes the least square error ε between s and g. Formally,


p
2
min ε= (sk − gk )
k=1
subject to: Bs = 0 (7.40)
l≤s
s ≤ u,

where the inequalities in (7.40) are treated componentwise, i.e., l1 ≤ s1 ≤ u1 ,


l2 ≤ s2 ≤ u2 , and so on.

Problem QP-1 is a constrained QP problem with bounded variables—


methods such as active constraints exist for solving such problems [126, 127,
128, 129, 130]. These methods are both analytically and numerically challeng-
ing. A two-phase solution process is suggested here that includes the solution
of a constrained version of Problem QP-1 as the first phase. If the result is
infeasible, a rapidly converging iterative refinement of the objective g is per-
formed until the feasibility of s is satisfied. This two-phase process is defined
formally as
7.2 Derivation of the QP Algorithm 139


p
2
Phase 1 → min ε= (sk − gk ) (7.41)
k=1
subject to Bs = 0
Phase 2 → Iterative refinement
. of s,

where Phase 1 is an equality-constrained quadratic optimization problem ex-


pressed as the following problem QP-2:

Problem QP-2 (QP Clock Skew Scheduling)

2

p
2
min ε = (s − g) = (sk − gkτ )
k=1
subject to: Bs = 0. (7.42)

Problem QP-2 is representative of a broader class of optimization problems


where the function that is minimized is a distance in the Euclidean space Rn .
One typical problem that arises in a variety of situations, for instance, is the
linear least squares problem. The objective of the linear least squares problem
is to find x∗ ∈ Rn such that the Euclidean distance between Dx∗ ∈ Rm and
b ∈ Rm is as small as possible. The matrix D is an m × n matrix and the
system Dx = b is typically inconsistent. The function being minimized in the
linear least squares problem is
⎡ ⎤
| |
⎢ ⎥
m
 t 2 ⎢ ⎥
t ⎢
di x − bi , where D = ⎢d1 . . . dm ⎥ ⎥.
i=1 ⎣ ⎦
| |

It is well known [125, 130] that if the kernel of D is ker(D) = {0}, then x∗ is
the solution of the consistent system Dt Dx = Dt b.
The quadratic programming problem QP-2 is solved by applying the clas-
sical method of Lagrange multipliers for constrained optimization [131, 129,
130]. To start, note that minimizing the objective function ε in (7.42) is equiv-
alent to minimizing the function,

ε∗ = st s − 2gt s.
140 7 Clock Skew Scheduling for Improved Reliability

For a quick proof of this equivalence, consider expanding the value of ε,


2
ε = (s − g)
= (s)2 − 2gt s + (g)2 (7.43)
= s s − 2g s + g g,
t t t

where the inner product gt g in (7.43) is a numeric constant. Therefore, if a


value s = s∗ exists which minimizes ε∗ in (7.43), s∗ also minimizes ε. Note
that since ε∗ = ε − gt g, the two minimums are related by

min(ε∗ ) = min(ε) − gt g. (7.44)

Thus, problem QP-2 is transformed into the following problem QP-3:

Problem QP-3 (QP Clock Skew Scheduling)

min ε∗ = st s − 2gt s
subject to: Bs = 0. (7.45)

To apply the method of Lagrange multipliers to problem QP-3, the vector



t
λ = λ1 . . . λnc is introduced, where each multiplier λi in λ corresponds to
the i-th equality constraint from Bs = 0. The Lagrangian function L(s, λ) is
introduced next,

L(s, λ) = ε∗ + λt Bs
(7.46)
= st s − 2gt s + λt Bs,

where the term λt Bs in (7.46) is the sum over all equality constraints of the
product of the i-th constraint times the multiplier λi .
Any extremum of ε∗ must be a stationary point of the Lagrangian
L(s, λ) [125], that is, the first derivatives of L(s, λ) with respect to si where
i ∈ {1, . . . , p} and λj where j ∈ {1, . . . , nc } must be zero. Formally, if the
differential operator is denoted as ∇, then any stationary point (s∗ , λ∗ ) of
L(s, λ) is a solution of the system of equations,

 ∇ L(s, λ) = 0
 s
∇L(s, λ) = 0 ⇒  (7.47)
 ∇λ L(s, λ) = 0.

In the general case of a QP problem with any type of constraints, systems


such as (7.47) can be non-linear and difficult to solve. In the case of linear
7.2 Derivation of the QP Algorithm 141

constraints, however, a solution can be derived in a straightforward manner.


To this end, consider the derivatives, ∇s L(s, λ) and ∇λ L(s, λ), of the La-
grangian,
 
∇s L(s, λ) = ∇s st s − 2gt s + λt Bs
= 2s − 2g + (λt B)t (7.48)
= 2s − 2g + B λ,t

 
and ∇λ L(s, λ) = ∇λ st s − 2gs + λt Bs = Bs. (7.49)
Note that (7.48) and (7.49) contain p and nc equations, respectively (recall
that s and λ have p and nc variables, respectively). Therefore, the solution
of (7.47) requires finding exactly p + nc = 2p − nb = 2p − r + 1 variables.
Substituting (7.48) and (7.49) back into (7.47) yields the linear system,

 2s + Bt λ = 2g

 (7.50)
 Bs = 0,

which can be conveniently written in matrix form,


    
2Ip Bt s g
=2 . (7.51)
B 0 λ 0

Solving (7.51) by Gauss-Jordan elimination is straightforward by premultiply-


ing with 12 B the first row of the system described by (7.51) and subtracting
the result from the second row, thereby yielding
         
2Ip Bt s g 2Ip Bt s g
=2 ⇒ =2 . (7.52)
B 0 λ 0 0 BBt λ Bg

A natural way to solve the linear system described by (7.52) is by back substi-
tution,7 such that λ is initially computed, followed by the computation of s.
The Lagrange multipliers λ are determined from the equation (BBt )λ = 2Bg
in the second row of (7.52), where the right-hand side 2Bg is a non-zero vec-
tor, that is, Bg = {0}. The opposite situation, Bg = {0}, is highly unlikely
to occur since Bg = {0} means that g ∈ ker(B), which in turn means [re-
call (7.26) through (7.29)] that the objective clock schedule g is feasible and
no optimization needs to be performed.8
Therefore, the equation (BBt )λ = 2Bg in (7.52) can have either no so-
lutions or exactly one solution depending upon whether the matrix BBt is
singular or not. In other words, the non-singularity of BBt is a necessary and

t
sufficient condition for the existence of a unique solution ŝt λ̂t of (7.51). If
the product BBt is denoted by M, note that the symmetric nc × nc matrix,
7
Since the coefficient matrix is an upper triangular matrix.
8
The chances of g being feasible for a large real circuit are infinitesimally small.
142 7 Clock Skew Scheduling for Improved Reliability
 

I
M = BBt = I C = I + CCt , (7.53)
Ct

is strictly positive-definite and thus nonsingular. Therefore, the system (7.51)


is absolutely guaranteed to have a unique solution,

λ̂ = 2M−1 Bg (7.54)
1  
ŝ = − Bt λ + g = − Bt M−1 B g + g, (7.55)
2
where the matrix M is as introduced in (7.53).
To gain further insight into the solution described by (7.51) through (7.55),
consider substituting (7.23) for B into (7.51), and representing the vector
column g of the objective clock skew schedule as
 c
g
g= b , (7.56)
g

where gc and gb correspond to sc and sb , that is, g1 is the objective value of


the clock skew s1 , g2 is the objective value of the clock skew s2 and so on.
With these substitutions, the system represented by (7.51) can be written as
⎡ ⎤ ⎡ c⎤ ⎡ c⎤ ⎡ c⎤
2Inc 0 Inc s s g
⎣ 0 2Inb Ct ⎦ ⎣sb ⎦ = K ⎣sb ⎦ = 2 ⎣gb ⎦ , (7.57)
Inc C 0 λ λ 0

where the coefficient matrix K on the left is symmetric. In (7.57), the Gaussian
elimination step described by (7.52) is equivalent to multiplying by 12 the first
row of K, premultiplying by 12 C the second row of K and subtracting both
of these rows from the third row:
⎡ ⎤ ⎡ c⎤ ⎡ c⎤
2I 0 I s g
⎣ 0 2I Ct ⎦ ⎣sb ⎦ = 2 ⎣gb ⎦
I C 0 λ 0
⎡ ⎤ ⎡ c⎤ ⎡ ⎤ (7.58)
2I 0 I s gc
⇒ ⎣ 0 2I Ct ⎦ ⎣sb ⎦ = 2 ⎣ gb ⎦.
t c b
0 0 I + CC λ g + Cg

Observe that the linear system of (7.58) is simply a more detailed technique
for rendering the linear system described by (7.52) where the first row of (7.52)
has been expanded into the first two rows of (7.58):
 

I
BBt = I C = I + CCt (7.59)
Ct
 

gc
Bg = I C = gc + Cgb . (7.60)
gb
7.2 Derivation of the QP Algorithm 143

With the matrix M as defined in (7.53), the solution of (7.58) is

λ̂ = 2M−1 Bg, (7.61)


1
ŝb = − Ct λ̂ + gb , (7.62)
2
1
ŝ = − λ̂ + gc .
c
(7.63)
2
As a final note, observe that the solution described by (7.54) and (7.55) is
not only a stationary point of the Lagrangian function L(s, λ) (i.e., a potential
local minimizer) but also a global minimizer of ε∗ in (7.45) [130]. As a matter
of fact, problem QP-3 belongs to a broader class of optimization problems
where the function being minimized is of the form f (x) = xt Zx + yt x (note
that in the case of problem QP-3 the matrix Z is the positive-definite identity
matrix Ip ). A proof can be found in [130] that if Z is positive-definite, a
solution process similar to the process represented by (7.46) through (7.55)
can be applied to obtain a unique global minimizer of f (x) = xt Zx + yt x.
Reference [130] provides a most thorough treatment of this subject as well as
proofs of the existence and uniqueness of the solution.
8
Delay Insertion and Clock Skew Scheduling

As briefly mentioned in Chapter 4, delay insertion into the logic network1


can be used as a post-processing step in mainstream digital integrated circuit
design flow in order to solve the short-path (hold time) timing violations
of synchronous circuits. The drawbacks of delay insertion, such as increased
circuit area and power dissipation, are usually disregarded in favor of achieving
a feasible timing schedule.
In this chapter, a delay insertion algorithm into the logic network that
improves the efficiency and results of clock skew scheduling is presented. By
systematic delay insertion, a higher operating speed or improved reliability
is achieved through clock skew scheduling. It is known that the minimum
clock period of a synchronous circuit achievable through clock skew scheduling
is limited by the uncertainties of the data propagation times on local data
paths [2] and the total data propagation times on data path loops [117]2 .
It has been shown recently by Taskin [132] that the reconvergent local data
paths also introduce an additional theoretical limit on the minimum clock
period of a synchronous circuit achievable through clock skew scheduling. This
limitation caused by reconvergent paths is theoretically derived and a delay
insertion method is defined in order to mitigate this limitation. Overall, these
limitations can be used to quickly and efficiently calculate the improvements
achievable through clock skew scheduling, without having to apply clock skew
scheduling. Based on the improvements achievable for a particular circuit, the
design team can decide whether or not to allocate resources in the design
budget to perform clock skew scheduling and non-zero clock skew clock tree
synthesis.

1
Note that clock skew scheduling also entails delay insertion, however, into the
clock distribution network.
2
Dependence of the minimum clock period TCP on the uncertainty of data prop-
agation times (between D̂Pi,fm and D̂Pi,fM for Ri ;Rf ) is visible in Problem LCSS
definition in Section 7.1.1. The linear dependency of clock skew values on data
path cycles are explained in Section 7.2.2.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 145
DOI: 10.1007/978-0-387-71056-3 8,
c Springer Science+Business Media LLC 2009
146 8 Delay Insertion and Clock Skew Scheduling

In this section, the limitation on the minimum clock period caused by all
three factors are derived as applied to edge-triggered circuits. The limitations
for level-sensitive circuit implementations can be derived similarly. It is shown
that through systematic delay insertion, the limitation on the minimum clock
period achievable through clock skew scheduling can be mitigated. In other
words, the improvements achieved through clock skew scheduling can further
be increased by additional delay insertion onto the logic network, simulta-
neous with the application of clock skew scheduling. For a fully-automated
application, the proposed delay insertion method is implemented as a Linear
Programming (LP) problem in tradition of clock skew scheduling applications
presented in Chapters 5 and 7. The application of the delay insertion method
is demonstrated both for edge-triggered and level-sensitive circuits.

8.1 Limitations on Minimum Clock Period


Both zero clock skew and non-zero clock skew circuits are subject to limi-
tations in the minimum clock period at which these circuits are fully opera-
tional. Remember from Section 5.3 that the limit for a zero clock skew circuit
is the slowest local data path of the circuit (the path with the largest delay
D̂Pi,fM ). Consequently, a timing analysis of zero clock skew circuits is centered
around identifying the N slowest local data paths of a circuit and ensuring
that there are no timing hazards on any of the local data paths for a given
clock period. Typically, this type of timing analysis is performed with the
goal of satisfying all setup time constraints on the N selected paths. As men-
tioned in Sections 4.7 and 4.8, this objective can be achieved by lowering the
clock frequency until all setup time constraints of the form of (4.8) [where
TSkew (i, f )=0] are satisfied. Any remaining hold time violations can then be
removed by inserting delay elements—a procedure called delay padding [74].
The limitations on non-zero clock skew circuits are more complicated.
These limitations are caused by various circuit topologies and, unlike zero
clock skew circuits, both setup and hold time violations are hard to remove.
The limitations on the minimum clock period of non-zero clock skew circuits
are caused by the following three factors:
1. Uncertainty of the data propagation time along the local data paths [2],
2. The total data propagation time of data path cycles [117],
3. The difference between the total data propagation time on reconvergent
paths [132].
The first of these three limitations occurs on every single local data path
of a synchronous circuit while the second and third limitations only occur
on those circuits where the topology of the circuit graph includes cycles and
reconvergent paths, respectively. A circuit with all three limitations will ulti-
mately be affected from the most dominant limitation. In this section, these
limitations are described for edge-triggered circuits—equivalent limitations on
8.1 Limitations on Minimum Clock Period 147

level-sensitive circuits can be similarly derived. In the rest of this chapter, it


is assumed that reconvergent paths are the dominant limiting factor on the
minimum clock period of a synchronous circuit achievable through clock skew
scheduling over other limiting factors of delay uncertainty and data path cy-
cles. This assumption does not invalidate the generality of the discussion, it
is adopted in order to simplify the presentation of the delay insertion process
(which is effective only for circuits where reconvergent paths are the most
limiting factor).

8.1.1 Uncertainty of Data Propagation Times

The uncertainty of the data propagation times is modeled by the corner-based


(best-case/worst-case or min/max) timing delay models in timing analysis.
The algebraic difference between the maximum data propagation time DPi,fM
and the minimum data propagation time DPi,fm on a local data path Ri ;Rf
constitutes the delay uncertainty. For a critical local data path, the trailing
edge of the previous clock cycle is the hold time before the earliest arrival of
the data signal Df at register Rf . The trailing edge of the current clock cycle
is the setup time after the latest arrival of the data signal Df at register Rf .
This situation is depicted on an example edge-triggered local data path in
Figure 8.1. Note that in Figure 8.1, the tolerance of the clock signals are

if

DP m , DPif M


Ri Rf

(a) A sample local data path.

Fi
DCQm
Ci
Fi
DCQM
DPif m
DPif M

af Af
Cf Ff
δH δsF f

TCP
(b) Delay uncertainty in timing diagram.

Fig. 8.1. Limitation on the minimum clock period TCP caused by the delay uncer-
tainty of a local data path.
148 8 Delay Insertion and Clock Skew Scheduling

ignored for the sake of simplicity. For such a critical timing path, the setup and
hold time constraints (that are modeled with inequalities) satisfy the equality
conditions3 . Due to this limitation, the clock period cannot be minimized any
further than:
 
i,f
min TCP = max ΔF i Fi
L + DCQM + DP M + δS
Ff
∀Ri ; Rf
 
i,f
− ΔF i Fi
L + DCQm + DP m + δH
Ff
(8.1)
 
= max D̂Pi,fM − D̂Pi,fm .
∀Ri ; Rf

The shaded region in Figure 8.1 illustrates the timing criticality, causing the
limitation on TCP .

8.1.2 Data Path Cycles

Limitations due to data path cycles occur due to the linear dependency of
clock skews of the local data paths on a cycle, as explained in Section 7.2.2.
In a zero clock skew circuit, the circuit topology is irrelevant in the timing
analysis because each local data path is analyzed independent of any neigh-
boring paths. The timing of neighboring local data paths in a non-zero clock
skew circuit, however, is interdependent. For a cycle of local data paths, this
interdependency regains the form described in Section 7.2.2. In this linear de-
pendency form, the minimum clock period is limited by the (timing) criticality
of the local data paths along the cycle (in addition to the limitations caused
by the delay uncertainties of each local data path along the cycle, which are
the limitations explained in Section 8.1.1). The data path cycle limitation is
illustrated for a sample local data path cycle in Figure 8.2.
The cyclic traveling path for the data signal over a data path cycle, such
as the example circuit shown in Figure 8.2, leads to stringent operating condi-
tions under non-zero clock skew. The local data paths along the cycle operate
without any slack time, because any existing slack on these local data paths
is distributed over the paths through the mechanics of the clock skew schedul-
ing process. In such circuits where a data path cycle is (timing) critical, the
minimum clock period depends on two factors. The first factor is the number
of registers n along the cycle. For n registers on the cycle, n clock cycles must
have passed after each completion of the cycle through a register. The second
factor is the total delay of the data signal over the local data paths along
the cycle. This total delay time includes the setup time δSF f and maximum
Fi
clock-to-output time DCQM of each register along the cycle, the maximum
i,f
data propagation time DP M of each local data path along the cycle, and the
tolerances of the clock signal (which are ignored below for simplicity). The

3
These constraints have no available slack for improvement.
8.1 Limitations on Minimum Clock Period 149

Data path cycle: n local data paths

Rk


R(n−1) R2


R1


(a) A sample local data path cycle.

nTCP

F1
C1
DCQM ... δsF 1

DP12M
(n−1)1
DP M
δsF 2 F2
DCQM ...
C2

DP23M
.. (n−2)(n−1)
. DP M
F (n−1) F (n−1)
C(n−1) δ.s . . DCQM

(b) Data path cycle timing.

Fig. 8.2. Limitation on the minimum clock period TCP caused by data path cycles.

limitation on the minimum clock period by the data path cycles is given by
the following formula:
  
i,f
ΔFL
i
+ D Fi
CQM + DPM + δ Ff
S
∀Ri ; Rf on cycle
min TCP =
  n (8.2)
i,f
D̂P M
∀Ri ; Rf on cycle
=
n
150 8 Delay Insertion and Clock Skew Scheduling

The shaded region in Figure 8.2 illustrates the timing criticality, causing the
limitation on TCP .

8.1.3 Reconvergent Paths

A reconvergent path is composed of a series of two or more local data paths


with a common source register (divergent register) and a common sink register
(convergent register). A reconvergent system is composed of at least two paral-
lel reconvergent paths. The interdependency of the timing of local data paths
in a non-zero clock skew system occurs explicitly in a reconvergent system
because of the reconvergent fanout. In a reconvergent system, a data signal
that is initially stored in the divergent register starts propagating simultane-
ously through all of the reconvergent paths. The signals that are processed on
the reconvergent paths arrive at the convergent register at (possibly) different
times. In the case of nonidentical numbers of registers in two reconvergent
paths, the data signals arrive at the convergent register during different clock
cycles. The timing of all reconvergent paths is satisfied by collectively ana-
lyzing the arrival time of the data signals at the convergent register over a
duration of (possibly) multiple clock cycles. In Figure 8.3, the limitation such
a reconvergent system imposed on the minimum clock period of a non-zero
clock skew circuit is illustrated.
In Figure 8.3, two reconvergent paths with m and n registers, respectively,
are considered. The total propagation time of the data signal on the two re-
convergent paths are shown. Let the propagation time on the reconvergent
paths with m and n registers be the longest and shortest total propagation
times, respectively. After propagating along these two paths, m and n clock
cycles must have elapsed, respectively, by the time the data signals arrive at
the convergent register. When critical, the reconvergent path with n registers
is matched with the trailing edge of the (n−1)-th clock cycle, while the recon-
vergent path with m registers is matched with the trailing edge of the (m)-th
clock cycle. Thus, the algebraic difference between the two total data propa-
gation times along the reconvergent paths limits the minimum clock period.
Mathematically, the limitation of the reconvergent paths on the minimum
clock period of non-zero clock skew circuits is given by
path1
P DM − P Dm
path2
+ δSF convergent + δH
F convergent
min TCP = , (8.3)
|m − n + 1|
pathp pathp
where P DM and P Dm represent the maximum and minimum total data
propagation times between the divergent and convergent registers over paths
path 1 and path 2, respectively.
Unlike the limitations caused by the delay uncertainty of the local data
paths and the total data propagation times along the data path cycles,
the limitations caused by reconvergent paths can be mitigated. The miti-
gation procedure offered in [132] involves systematic delay insertion on one
8.1 Limitations on Minimum Clock Period 151
path1 path1

P Dm , P DM

path1: m registers

Ri1 Rim →

Rd Rc


Rj1 Rjn →

path2: n registers
path2 path2

P Dm , P DM
(a) A sample reconvergent path system.

Fd
DCQm
Cd ... ...
Fd
DCQM path2
P Dm path2
P DM
path1
P Dm path1
P DM

... Fc ac ... Ac
Cc δH δsF c

|m − n + 1|TCP
(b) Reconvergent path system timing diagram.

Fig. 8.3. Limitation on the minimum clock period TCP caused by reconvergent
paths.

or more
 of the reconvergent
 paths in order to decrease the algebraic differ-
path1
ence P DM − P Dm path2
of (8.3), which consequently improves the mini-
path2
mum clock period TCP . Note that it is possible to increase path delay P Dm
path1
without increasing P DM because both paths are determined by two differ-
ent series of local data paths.4

4
The minimum and maximum total data propagation times along a reconvergent
system may be observed on the same reconvergent path. In such a case, delay
insertion is not beneficial.
152 8 Delay Insertion and Clock Skew Scheduling

8.2 Delay Insertion Method

When clock skew scheduling is applied to a synchronous circuit, a set of op-


timal values that satisfy the objective function (e.g. consider clock period
minimization) are assigned to the clock delays at each register. Certain data
paths become critical timing paths because of the distribution of these optimal
clock delays. In this section, the consequences of criticality to the short and
long path constraints of a reconvergent path are analyzed. It is demonstrated
that when the short and long path constraints of a reconvergent path are
critical, the minimum clock period can be improved via delay insertion. Note
that criticality of the constraints of a reconvergent path adheres to the prelim-
inary assumption that the limitation caused by this reconvergent path system
is dominant over other limitations. For circuits where limitations caused by
other factors are dominant, improvement through delay insertion is not possi-
ble. In experimentation, such circuits are reported to be one of the two cases
where the delay insertion method is inapplicable (e.g., delay insertion method
is not beneficial).
Let the source and sink registers in a reconvergent path system be called
the divergent register Rd and the convergent register Rc , respectively. Let
pd{i1 ...in }c define a reconvergent path starting from register Rd , continuing
through the intermediate registers Ri1 , . . ., Rin and ending at register Rc .
The number of intermediate registers rd{i1 ...in }c = n is a non-negative integer
number (n ∈ Z + ∪ {0}) and the path is acyclic [∀in , im : Rd = Rin , Rin = Rim ,
Rd = Rc and Rin = Rc ]. In the sample circuit modeled in Figure 7.1 (page 129),
for instance, there are three reconvergent paths between v1 and v2 , p12 , p132
and p1342 , where the numbers of intermediate registers for the three reconver-
gent paths of this circuit are r12 = 0, r132 = 1 and r1342 = 2, respectively. The
path delay P Dd{i1 ...in }c of a reconvergent path pd{i1 ...in }c is defined as the to-
tal data propagation time between the divergent and convergent registers Rd
and Rc , respectively, over the intermediate registers {Ri1 , . . . , Rin }. The mini-
mum and maximum path delays of this reconvergent data path are given by
d{i ...i }c d{i ...i }c
P Dm 1 n and P DM 1 n , respectively. The system delay SDdc of a recon-
vergent data path system between divergent and convergent registers Rd and
Rc is defined by the conjuncture of all the (reconvergent) path delays between
dc
registers Rd and Rc . The maximum system delay SDM of this reconvergent
data path system is defined by the largest of the maximum path delays be-
dc
tween Rd and Rc . Similarly, the minimum system delay SDm is defined by
the smallest of the minimum path delays between Rd and Rc . If there are k
number of reconvergent paths between Rd and Rc , labeled pA , pB , . . . , pK , then:
dc pA pB pK
SDm = min (P Dm , P Dm , . . . , P Dm ), (8.4)
dc pA pB pK
SDM = max (P DM , P DM , . . . , P DM ). (8.5)
8.2 Delay Insertion Method 153

[D12 a 12a 12a 12a


Pm , DPM ] = [PDm , PDM ] = [1.0, 1.2]

→ p12a

R1 R2

→ p12b
12 12 12 12
[DPmb , DPMb ] = [PDm b , PDM b ] = [0.6, 0.7]
Fig. 8.4. A simple reconvergent data path system.

8.2.1 Motivational Example with a Reconvergent Path

A simple reconvergent data path system formed by two reconvergent local


data paths sharing the divergent and convergent registers R1 and R2 , re-
spectively, is shown in Figure 8.4. Note that as a special case, subscripts
a and b are used to identify the two reconvergent local data paths p12a
and p12b . Registers R1 and R2 are the divergent and convergent registers,
respectively. The two reconvergent paths p12a and p12b form the reconver-
gent data path system. For this simple reconvergent data path system, the
path delay of each reconvergent
 path is the data propagation delay of  the
respective local data paths, P Dm 12a
= DP12ma = 1.0 , P DM12a
= DP12M
a
= 1.2 and
 
P Dm12b
= DP12mb = 0.6 , P DM
12b
= DP12Mb
= 0.7 . The minimum and maximum
system delays are driven by the reconvergent data paths p12b and p12a , respec-
tively:
12
 12a 12b
 12b
SDm = min P Dm , P Dm = P Dm = 0.6, (8.6)
12
 12 12
 12
SDM = max P DMa , P DMb = P DMa = 1.2. (8.7)

Two circuits with the topology presented in Figure 8.4 are analyzed in Sec-
tions 8.2.2 and 8.2.3–the edge-triggered circuit SF F and the level-sensitive
circuit SL , respectively.

8.2.2 Reconvergence in an Edge-Triggered Circuit

For edge triggered circuits, the data signals depart the registers clock-to-
output delay (DCQ ) after the latching edge of the clock signal. Consequently
in SF F , the signal Q1 (recall Figure 4.12 on page 55) departs R1 clock-to-
output delay DCQ time after the positive clock edge and propagates along
the reconvergent paths. In order to satisfy the short path constraints, the ar-
F2
rival of data signals X2a and X2b at R2 must occur δH later than the positive
edge of the previous clock cycle at R2 . Similarly, in order to satisfy the long
154 8 Delay Insertion and Clock Skew Scheduling

1
DCQ
C1

 
PD12
m
b
= SD12
m
PD12
M
b

PD12
m
a
 
PD12 a
M = SDM
12

a2 A2
C2 δF2 δF2
H S

Tmin

Fig. 8.5. Timing of the edge-sensitive reconvergent system in Figure 8.4 after CSS.

path constraints, the arrivals must occur δSF 2 earlier that the positive edge of
the current clock cycle at R2 :
F2
δH ≤ a2 ≤ A2 ≤ TCP − δSF 2 . (8.8)

Next, suppose clock skew scheduling for clock period minimization is ap-
plied to an arbitrary edge-triggered circuit which involves a reconvergent data
path system. After clock skew scheduling, if at least one of the reconvergent
paths becomes a critical timing path, the earliest and latest arrival times of the
data signal at the critical convergent node are at marginal values. Accordingly
for SF F , the arrival times a2 and A2 satisfy
F2
δH = a2 ≤ A2 = Tmin − δSF 2 , (8.9)

where Tmin is the minimum clock period achievable by clock skew scheduling.
The constraints in (8.9) are illustrated in Figure 8.5. C1 and C2 are the clock
signals synchronizing registers R1 and R2 , respectively. Also illustrated on
Figure 8.5 is the separation between A2 + δSF 2 and a2 − δH F2
defining the
minimum clock period:

Tmin = A2 + δSF 2 − (a2 − δH


F2
). (8.10)

Note that the data arrival times at R2 are given by the constraints similar to
the discussion in Section 4.7:
 
a2 = min d1 + DP12m a
− Tmin , d1 + DP12m b
− Tmin
  (8.11)
= d1 + min DP12m a
, DP12m
b
− Tmin ,
 
A2 = max D1 + DP12M a
− Tmin , D1 + DP12M b
− Tmin
  (8.12)
= D1 + max DP12M a
, DP12M
b
− Tmin .
8.2 Delay Insertion Method 155

Replacing (8.11) and (8.12) in (8.10) yields


 
Tmin = D1 + max DP12M a
, DP12M
b
− Tmin + δSF 2
 12a 
(8.13)
− d1 + min DP m , DP12m b
− Tmin + δH
F2
.
Eq. (8.13) is simplified to
   
Tmin = max P DM 12a 12b
, P DM − min P Dm
12a 12b
, P Dm + δSF 2 + δH
F2
. (8.14)
Following from (8.4) and (8.5), (8.14) is identical to
12
Tmin = SDM − SDm
12
+ δSF 2 + δH
F2
. (8.15)
Substituting the numerical values and assuming zero internal register delays
DCQ = DDQ = δSF = δH F
= 0, the minimum clock period Tmin of SF F after
clock skew scheduling is computed Tmin = 0.6 time units.
Consider (8.15), showing the dependence of Tmin on the algebraic dif-
ference between the maximum system delay and the minimum system delay
between Rd and Rc (summed with the internal register delays δSF f and δH Ff
).
The delay insertion method is proposed to modify these maximum and min-
imum system delays between Rd and Rc . The modification, when applicable,
decreases the algebraic difference in (8.15). In SF F , for instance, the mini-
12b
mum system delay between Rd and Rc is determined by P Dm of path p12b .
By inserting a delay element of 0.1 time units on p12b , the minimum and max-
imum path delays of this path are changed to DP12m b
= 0.7 and DP12M b
= 0.8,
respectively. More importantly, the minimum system delay between Rd and
12b
Rc is still determined by P Dm of path p12b , which is now 0.7 instead of the
original 0.6 time units. Both before and after delay insertion, the maximum
12a
system delay between Rd and Rc is determined by P DM of path p12a , which
is a constant 1.2 time units. Therefore, the algebraic difference between the
maximum and minimum system delays between Rd and Rc is improved from
(1.2−0.6 = 0.6) to (1.2−0.7 = 0.5) time units. This delay insertion procedure
for the circuit shown in Figure 8.4 is illustrated in Figure 8.6. The black circle
in Figure 8.6 represents a delay element of [0.1,0.2] that is inserted on the
reconvergent path p12b .
Note that for SF F , inserting a delay element with a value in range [0.4,0.5]
on p12b gives the minimum possible algebraic difference in (8.15), leading to

the minimum clock period obtainable through delay insertion Tmin . For SF F ,
∗ ∗
Tmin evaluates to Tmin = 1.2 − 1.0 = 0.2. It is shown that this minimum
clock period obtainable through delay insertion depends on the maximum of
the algebraic differences between the maximum and minimum path delays of
each reconvergent path (after delay insertion).
Proposition: Let there be k number of reconvergent paths between Rd and
Rc , labeled pA , pB , . . . , pK . The minimum possible algebraic difference between
the maximum and minimum path delays of each reconvergent path between

Rd and Rc after delay insertion is the minimum clock period Tmin obtainable
through delay insertion.
156 8 Delay Insertion and Clock Skew Scheduling

Let the minimum and maximum system delays define the real numbers
interval Λ, such that:
dc dc
Λ = [SDm , SDM ] (8.16)
By definition, the minimum possible algebraic difference between the maxi-
mum and minimum path delays of each reconvergent path after delay inser-
tion (defining the minimum possible clock period) is the minimum length of
interval Λ (after delay insertion).
In order to compute the minimum length |Λ| of interval Λ achievable
through delay insertion, the difference [max(Λ) − min(Λ)] is computed. Re-
calling (8.4) and (8.5), the following is derived:
dc pA pB pK
min(Λ) = SDm = min (P Dm , P Dm , . . . , P Dm ), (8.17)
dc pA pB pK
max(Λ) = SDM = max (P DM , P DM , . . . , P DM ). (8.18)

Let the real number delay intervals formed by the minimum and maxi-
mum delay values of the paths pA , pB , . . . , pK be represented by A, B, . . . , K,
respectively. In other words, a delay interval L, associated with the path
pL
pL ∈ {pA , pB , . . . , pK } is formed by L = [P Dm
pL
, P DM ]. One of the following
possibilities defining the expression [|Λ| = max(Λ) − min(Λ)] must hold:
P1. A delay interval M ∈ {A, . . . , K} determines both the minimum min(Λ)
and maximum max(Λ) values of the interval Λ. Then, Λ = M and |Λ| =
|M | = max(Λ) − min(Λ) = max(M ) − min(M ),
P2. Otherwise, two non-identical delay intervals determine the minimum
and maximum values of the interval Λ. Then, ∀L ∈ {A, . . . , K}: |Λ| =
max(Λ) − min(Λ) > max(L) − min(L).
For systems satisfying (P1), the minimum length for Λ is already given by
|Λ| = |M |. The minimum interval length, thus the minimum clock period,
cannot be changed by delay insertion. For systems satisfying (P2), delay in-
sertion method is used to modify one or more of the delay intervals in Λ

12a
[PD12
m , PDM ] = [1.0, 1.2]
a

→ p12a

R1 R2

→ p12b
12 12
[PDm b , PDM b ] = [0.6, 0.7] + [0.1, 0.2] = [0.7, 0.9]
Fig. 8.6. The simple reconvergent system in Figure 8.4 after delay insertion.
8.2 Delay Insertion Method 157

Delay Intervals Delay Intervals


Λ Λ
A A
B B
C C
D D

K Delay K Delay
0 3 12 0 2 14
(i) |Λ| = |D| = max(D) − min(D) = 9 (ii) |Λ| = max(B) − min(C) = 12

Delay Intervals Delay Intervals


Λ Λ
A A UA
B B UB
C C UC
D D UD

K Delay K Delay
0 7 16 0 7 18
(iii) |Λ| = |B| = max(B) − min(B) = 9 < 12 (iv) |Λ| = |B| = max(B) − min(B) = 11 < 12

Fig. 8.7. Two reconvergent data path systems satisfying (P1) and (P2), respectively.

in order to promote one of the delay intervals to become the interval M . In


other words, systems satisfying (P2) are converted to systems satisfying (P1)
through delay insertion. Delay insertion is performed into the logic network,
thus, the systems delays and the interval Λ are modified with delay insertion.
Note that both the minimum and maximum system delays can be modified
with delay insertion. Therefore, it is not possible to predetermine which re-
convergent path will be the determining path for the interval Λ after delay
insertion.
In case (i) of Figure 8.7, a sample system satisfying (P1) is illustrated,
where the delay interval D (associated with path pD ) determines the minimum
length for Λ. No modification is necessary for such systems, as the minimum
possible length for Λ is already observed.
In cases (ii) and (iii) of Figure 8.7, the application of the delay inser-
tion method to a sample system satisfying (P2) is illustrated. Note that in
case (ii), the minimum value in the Λ interval is determined identically by de-
pC pD
lay intervals C and D [min(Λ) = P Dm = P Dm ], while the maximum value
pB
is determined by delay interval B [max(Λ) = P DM ]. Delay insertion on a re-
convergent path is similar to adding an offset to the interval, while preserving
158 8 Delay Insertion and Clock Skew Scheduling

the interval length. If the optimal values of delay elements are inserted on each
path, the minimum possible |Λ| is achieved by asserting that the biggest delay
interval M ∈ {A, . . . , K} becomes the interval Λ. In the modification of the
sample system shown in cases (ii) and (iii) of Figure 8.7, the delay interval B
is promoted to become this biggest delay interval M such that both min(Λ)
and max(Λ) are determined by delay interval B (i.e. delay interval B becomes
Λ). The intervals before and after delay insertion on the sample system are
demonstrated in cases (ii) and (iii) of Figure 8.7, respectively.
There are two important points to note here. First, the solution set of
the inserted delay values is not unique (remember similar discussions in Sec-
tions 6.1.5 and 6.4.1). For instance, the delay inserted on the path defining
delay interval C in case (iii) of Figure 8.7 can be any value between 6 and
12 time units (|C| = 3) to satisfy the computed minimum interval. Similarly,
the delay values inserted on all paths can simultaneously be increased by any
identical amount (e.g. x time units) to generate an alternative solution. This
non-unique solution set property provides a certain range of safety against any
inherent uncertainty or unavailability of exact values of the delay elements.
The second important point to note is that after delay insertion, the in-
terval lengths are preserved only if the inserted delay elements have no delay
uncertainty. In demonstrating case (ii) of Figure 8.7, delay values with no
uncertainties are considered in order to simplify the presentation of the delay
insertion method. In reality, delay elements have delay uncertainties just like
any other circuit component. These delay uncertainties of the delay elements
are accrued over the associated delay intervals. Let the delay uncertainty of
the delay element inserted on path L be represented by U L . The application of
delay insertion to the sample system presented in case (ii) of Figure 8.7, where
the delay uncertainties of the delay elements are accounted for, is presented in
case (iv) of Figure 8.7. Note that due to the differences in the accrued delay
uncertainties for each delay interval, the interval determining the minimum
possible length for interval Λ can be different compared to the ideal case pre-
sented in case (iii). Incidentally, for cases (iii) and (iv) of Figure 8.7, the delay
intervals determining the minimum possible length for Λ are B and A, re-
spectively. Also, in a worst case scenario, the accrued delay intervals can end
up being larger compared to the minimum length for Λ presented in case (ii).
In the problem formulation presented later in Section 8.3, delay elements are
realistically modeled with uncertainties.
Reflecting the proposition on a general reconvergent circuit, there are two
possibilities in computing the minimum algebraic difference of (8.15):
P1*. The minimum and maximum system delays of the reconvergent data
path system between Rd and Rc are determined by the same reconvergent
path,
P2*. The minimum and maximum system delays of the reconvergent data
path system between Rd and Rc are determined by two non-identical
reconvergent paths.
8.2 Delay Insertion Method 159

For systems satisfying P1*, the minimum algebraic difference is already


achieved. For systems satisfying P2*, delay insertion is used. By inserting
delays in one or more of the reconvergent paths, the path with the largest dif-
ference between its maximum and minimum path delays after delay insertion

becomes the determinant path for the minimum clock period Tmin obtainable
through delay insertion. Therefore, the minimum clock period of SF F with
clock skew scheduling and delay insertion is

 
Tmin = max P DM 12α
− P Dm
12α
+ U 12α + δSF 2 + δH
F2
. (8.19)
∀α∈{a,b}

Assuming zero delay uncertainty and substituting the numerical values, the

minimum clock period Tmin of SF F after clock skew scheduling with delay

insertion method is Tmin = 1.2−1.0 = 0.2. The improvement achieved through
delay insertion over circuits with clock skew scheduling is computed with the

formula [(Tmin − Tmin )/Tmin ]100. Substituting the values, the improvement
is computed as [(0.6 − 0.2)/0.6]100 = 66.7%.
The computation of the amount of delays to be inserted on each path is
integrated into the clock skew scheduling algorithm. For simplicity, continu-
ous delay models are considered in here. The revised clock skew scheduling
algorithm and initial insight for a general analysis using discrete delay models
are presented in Sections 8.3 and 8.4.

8.2.3 Reconvergence in a Level-Sensitive Circuit

For level-sensitive circuits, results similar to an edge-triggered circuit are ob-


tained despite the significant changes in circuit operation. The timing con-
straints are similar to the constraints for the edge-triggered circuit:
F2
δH ≤ a2 ≤ A2 ≤ Tmin − δSF 2 . (8.20)

When clock skew scheduling is applied to SL , the earliest and latest arrival
times at R2 satisfy
L2
δH = a2 ≤ A2 = Tmin − δSL2 , (8.21)
as illustrated in
 Figure 8.8. Using the
 same derivation as (8.10) and (8.13)
L1 L1
and assuming DCQ = DDQ , d1 = D1 for practical reasons:
   
12a
Tmin = max P DM 12b
, P DM − min P Dm
12a 12b
, P Dm + δSL2 + δH
L2
. (8.22)

Substituting the numerical values into the equation and assuming zero inter-
nal register delays, the minimum clock period Tmin of SL after clock skew
scheduling is Tmin = 0.6.
The delay insertion method can also be used on level-sensitive circuits in
order to improve the minimum clock period. The minimum clock period of
160 8 Delay Insertion and Clock Skew Scheduling

1
DCQ
C1

 
PD12
m
b
= SD12
m
PD12
M
b

PD12
m
a
 
PD12 a
M = SDM
12

a2 A2
C2 δL2 δL2
H S

Tmin

Fig. 8.8. Timing of the simple level-sensitive reconvergent system in Figure 8.4
after CSS.

SL with clock skew scheduling and delay insertion is given by the following
formula:

 
Tmin = max P DM 12α
− P Dm
12α
+ U 12α + δSL2 + δH
L2
. (8.23)
∀α∈{a,b}


The minimum clock period Tmin of SL after clock skew scheduling and delay

insertion is computed as Tmin = 1.2 − 1.0 = 0.2, leading to an improve-
ment of 66.7% over circuit with clock skew scheduling. The revised clock skew
scheduling algorithm for level-sensitive circuits is presented in Section 8.3.
Note that the earliest and latest data departure times d1 and D1 , re-
spectively, from a register R1 can be non-identical in a level-sensitive circuit.
Figure 8.8 illustrates one such case, where d1 and D1 occur at the leading
and trailing edges of the clock signal, respectively. In such cases, the formulae
in (8.22) and (8.23) do not hold true, however the minimum clock period re-
mains directly proportional to the algebraic difference between the maximum
and minimum path delays between R1 and R2 . The delay insertion algorithm
is fully applicable to all level-sensitive circuits, as the referred algebraic differ-
ence can ultimately be modified with delay insertion leading to improvements
in the minimum clock period.

8.2.4 General Reconvergent Data Path Systems

The generalized case for a reconvergent data path system is presented in


Figure 8.9. The edge-triggered and level-sensitive circuits are analyzed on the
same circuit graph. Let there be k number of reconvergent paths between Rd
and Rc , labeled pA , pB , . . . , pK . The generalized system contains rd{i1 ...im }c = m
and rd{j1 ...jn }c = n intermediate registers on two of its reconvergent paths,
pI and pJ , respectively (pI , pJ ∈ {pA , pB , . . . , pK }). Assume that the minimum
8.2 Delay Insertion Method 161

d{i1 ...im }c d{i1 ...im }c
PDm , PDM

pd{i1 ...im }c
→ →
Ri1 Rim

Rd Rc

Rj1 Rjn
→ →
pd{j1 ...jn }c

d{j1 ...jn }c d{j1 ...jn }c
PDm , PDM

Fig. 8.9. A generalized reconvergent data path system.

and maximum system delays between Rd and Rc are determined by paths


pd{j1 ...jn }c = pJ and pd{i1 ...im }c = pI , respectively. Note that, if m = n, the
number of clock cycles for data propagation along the paths are different.
After clock skew scheduling is applied, the earliest and latest data arrival
times at the convergent node with respect to the global zero time reference
are

acglobal = tccd + nTmin + δH


Fc
(8.24)
Acglobal = tccd + (m + 1)Tmin − δSF c . (8.25)

Following from (8.10), the minimum clock period after clock skew scheduling
is bounded by

|m − n + 1|Tmin = Acglobal + δSF c − (acglobal − δH


Fc
), (8.26)

which leads to
d{i ...im }c d{j ...j }c
P DM 1 − P Dm 1 n + δSF c + δHFc
SDM dc
− SDmdc
+ δSF c + δH
Fc
Tmin = = .
|m − n + 1| |m − n + 1|
(8.27)
The identical lower bounds of the minimum clock period stated in (8.27)
for both the edge-triggered and level-sensitive circuits are demonstrated in
Figure 8.10 and Figure 8.11, respectively.
Similar to the simple reconvergence case analyzed in Section 8.2.1, if the
minimum and maximum path delays are determined by the same reconvergent
162 8 Delay Insertion and Clock Skew Scheduling

d
DCQ
Cd

d{j1 ...jn }c  
PDm = SDdc
m d{j1 ...jn }c
PDM
d{i1 ...im }c
PDm d{i1 ...im }c  
PDM = SDdc
M

ac Ac
Cc δFc δFc
H S

|m − n + 1|Tmin

Fig. 8.10. Timing of the edge-triggered reconvergent system with m=3 and n=2.

d
DCQ
Cd

d{j1 ...jn }c  
PDm = SDdc
m d{j1 ...jn }c
PDM
d{i1 ...im }c
PDm
d{i1 ...im }c  
PDM = SDdc
M

ac Ac
Cc δLc δLc
H S

|m − n + 1|Tmin

Fig. 8.11. Timing of the level-sensitive reconvergent system with m=3 and n=2.

path, the delay insertion method is not beneficial. If these delays are deter-
mined by different reconvergent paths, the delay insertion method is used
to improve the minimum clock period. The minimum clock period achieved
through clock skew scheduling and delay insertion is
pR 
∗ P DM − P Dm
pS
+ U pR − U pS δ F c + δH
Fc
Tmin = max + S .
∀pR ,pS ∈{pA ,pB ,...,pK } |m − n + 1| |m − n + 1|
(8.28)
The minimum (and maximum) path delay of the reconvergent paths can
be modified by inserting delays on the local data paths of the reconvergent
path. The amount of delay to be inserted is determined at run time by the
clock skew scheduling algorithm.

8.3 Linear Problem Formulation


A valid approach to computing the theoretical limitation caused by reconver-
gent paths in a synchronous circuit is to identify the reconvergent systems on
a circuit graph and evaluate (8.28). Such an approach might not be ideal for
8.4 Practical Concerns in Modeling and Application 163

Table 8.1. CSS method for edge-sensitive circuits with the delay insertion method.
LP Model
min TCP
s.t. TSkew (i, f ) ≤ TCP − DPi,fM − DCQM
Fi

TSkew (i, f ) ≥ −DPi,fm − DCQm


Fi Ff
+ δH
LP Model modified
min TCP
s.t. TSkew (i, f ) ≤ TCP − DPi,fM − DCQM
Fi
− IM
if

TSkew (i, f ) ≥ −DPi,fm − DCQm


Fi Ff
+ δH − Imif

IM ≥ Im
if if

trivial circuit topologies. As a more practical approach, two generalized LP


problems are defined in order to model the delay insertion method for level-
sensitive and edge-sensitive synchronous circuits. These LP problems not only
model and solve the clock period minimization problems also compute the op-
timal delay values to be inserted on each local data path in order to achieve
the minimum possible clock period.
Two clock skew scheduling algorithms presented in Table 5.1 and Ta-
ble 6.2 for level-sensitive and edge-triggered circuits, respectively, are mod-
ified in order to integrate the delay insertion method. Both LP models for
clock skew scheduling are highly amenable to accommodating additional de-
sign constraints.
The modified clock skew scheduling algorithms using the delay insertion
method, assuming continuous delay models with uncertainty, are presented in
Tables 8.1 and 8.2. The amount of delay to be inserted is formulated as the
if if
minimum-amount and maximum-amount variables Im and IM , respectively.
Obviously, the uncertainty U if of this delay element, defined in Section 8.2.2, is
U if = IM
if
−Imif
. The delay variables are included in the propagation constraints
on each local data path, however, pruning of the paths such that only the
propagation constraints of the reconvergent paths are modified is also possible.
For the former case, the clock skew scheduling algorithm simply returns zero
for the delay values on the non-reconvergent paths.

8.4 Practical Concerns in Modeling and Application

In the problem formulation, continuous delay models have been used. Practi-
cally, however, delay elements are available only in discrete values. There are
two possible approaches to solving the discrete valued delay insertion problem.
The naive approach is to solve the clock skew scheduling problem assuming
continuous delays and approximating the optimal values with the given set of
discrete components. Although likely to produce reasonable results for simple
164 8 Delay Insertion and Clock Skew Scheduling

Table 8.2. CSS method for level-sensitive circuits with the delay insertion method.
LP Model
 
min TCP + M [ (dj + Dj ) + (Aj − aj )]
∀j ∀j:|F I(j)|≥1
s.t. af ≥ Lf
δH
Af ≤ TCP − δSLf
di ≥ ai + DDQM
Li

di ≥ TCP − CW L Li
+ DCQm
Di ≥ Ai + DDQM
Li

Di ≥ TCP − CW L Li
+ DCQM
af ≤ din + DP m + TSkew (in , f ) − TCP , ∀n
in ,f

Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP , ∀n


Af ≥ af
Df ≥ df
LP Model modified
 
min TCP + M [ (dj + Dj ) + (Aj − aj ]
∀j ∀j:|F I(j)|≥1
s.t. af ≥ δH
Lf

Af ≤ TCP − δSLf
di ≥ ai + DDQm
Li

di ≥ TCP − CW L Li
+ DCQm
Di ≥ Ai + DDQM
Li

Di ≥ TCP − CW L
+ DCQM Li

af ≤ din + DP m + Im + TSkew (in , f ) − TCP , ∀n


in ,f in f

Af ≥ Din + DPin ,fM + IM in f


+ TSkew (in , f ) − TCP , ∀n
Af ≥ af
Df ≥ df
if
IM ≥ Imif

cases, such linear approximations to integer problems do not always guaran-


tee optimality [112]. As a more robust and ubiquitously valid approach, the
problem can be formulated as a mixed integer programming (MIP) problem.
Evidently, the expected run times for MIP problems are typically longer than
LP problems of similar size (see Section 6.5).
Modeling and solving the problem with continuous delay models serve best
to demonstrate the two main purposes of this work; Identifying the limitation
caused by the reconvergent paths and demonstrating how to mitigate these
limitations through the delay insertion method. By adapting continuous delay
models, the theoretical limitations of reconvergent paths and the level of im-
provement through mitigation of this limitation are analyzed independent of
any cell library. For practical implementation, MIP-based solution approaches
discussed above, or similar methods, must be used.
Another practical concern for the delay insertion method is the area-aware
delay insertion method proposed in [74]. In order to reduce the total area
increase due to inserted delays, a delay buffer tree structure is proposed. In the
8.5 Summary 165

buffer tree structure, a shared delay element is placed between the fanouts—or
fanins—of a register, if multiple fanouts of the same register must be padded.
Note that the delay buffer-tree construction is a post-timing analysis process
and is not integrated into the clock skew scheduling algorithms.
Throughout this research monograph, the local data paths are modeled
abstractly at a higher hierarchy level than gate-level hierarchy. Such simplifi-
cation is followed in this chapter in order to improve the demonstration of the
theoretical limitation of reconvergent paths and the mitigation of this limita-
tion by the delay insertion method. In practical implementation, the location
of the delay elements to be inserted into the logic must be identified at a lower
level of abstraction—most suitably at the gate-level of hierarchy. The model-
ing of local data paths at a higher abstraction level as suggested in this work
might lead to an ambiguous assignment of delays to reconvergent paths. In
an extreme case, it is plausible that three or more reconvergent paths might
share all of the logic paths that constitute a reconvergent system. For the
simplest case of four reconvergent paths, any two reconvergent paths might
differ by one logic path only, and all logic paths might be covered by the four
reconvergent paths. For such a reconvergent system, including delay elements
anywhere on a reconvergent path (on any logic path) would affect the path
delay of more than one reconvergent path. Thus, the optimal delay insertion
values computed by the presented LP problem must be post-processed for
practical implementation.
The described concerns in the practical implementation of the delay in-
sertion method are not considered in the experimentation stage of this work.
Simplicity is preserved in the models used in formulation in order to improve
the presentation of the limitation caused by the reconvergent paths and the
mitigation of this limitation by the delay insertion method. Designers, how-
ever, must be wary of these practical requirements. Some researchers have
already started analyzing these practical concerns [133]. In [133], the LP pro-
gramming model shown in Table 8.2 is redefined at the gate-level netlist to
pinpoint the placement of inserted delays on the gate-level netlist.

8.5 Summary

In this chapter, the limitations of delay uncertainty of the min-max timing


models, data path cycles and reconvergent paths on the improvements achiev-
able through clock skew scheduling are shown. The mitigation of these limi-
tations with a delay insertion method is possible. The delay insertion method
is formulated as an LP problem, proposing a highly-automated, versatile and
efficient implementation. Practical concerns in modeling and implementation,
such as the continuous versus discrete models of delay elements, delay element
sharing between neighboring branches and underlying area costs for the delay
insertion process are referenced.
9
Practical Considerations

The formulation of clock skew scheduling as a QP problem is introduced in


Chapter 7. Recall that in this formulation, a feasible and consistent clock
schedule is found that is close1 to a previously chosen ‘ideal’ objective clock
schedule. In this chapter, a computer methodology is presented for the solu-
tion of this QP clock scheduling problem. Different computer implementations
are analyzed and compared in detail in Section 9.1. It is shown that the QP
problem can be efficiently solved and three computer algorithmic procedures
for this solution are discussed. These three algorithms are demonstrated to
have O(r3 ) run time complexity and O(r2 ) storage complexity, where r is
the number of registers in the circuit. The numerical constants of the leading
terms in these complexity expressions are derived as a function of the ratio
of the number of local data paths to the number of registers in the circuit,
thereby permitting a suitable algorithm to be chosen for a specific circuit.
Furthermore, the methodology presented in Chapter 7 is extended in order
to account for two important details of practical interest. The circuit graph
model is first discussed in 9.2. It is shown that certain clock skews from the
basis are unconstrained2 and this information is integrated into the mathe-
matical framework described in Chapter 7. In Section 9.3, it is demonstrated
how to efficiently handle the timing constraints of the I/O registers of a cir-
cuit, including the necessary modifications to the mathematical optimization
procedure.

9.1 Computational Analysis


The solution to problem QP-3 is described in Section 7.2.3 in purely mathe-
matical terms and without consideration of any computational aspects. Natu-
rally, the solution described by (7.54) and (7.55) is determined from a program
1
Close in a Euclidean sense.
2
These skews are independent from other skews within the circuit. Nevertheless,
these skews must satisfy the permissible range requirement.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 167
DOI: 10.1007/978-0-387-71056-3 9,
c Springer Science+Business Media LLC 2009
168 9 Practical Considerations

running on a computer. In this section, the time and memory requirements of


three different computer implementations are analyzed in greater detail. The
run time complexity N of these algorithms is considered to be dependent upon
the number of multiplicative (multiple and divide) floating point operations.
Similarly, the memory complexity M is considered to be the largest number
of floating point storage units that must be stored in memory at any time
during the execution of the specific algorithm.3
It is shown here that the run time complexity of all three algorithms de-
scribed in this section is O(r3 ) where r is the number of registers in the circuit.
Furthermore, it is shown that the numerical constant of the leading r3 term
in these complexity expressions is a function of the ratio
p
k= (9.1)
r
of the number of local data paths p to the number of registers r in a circuit.
Similarly, the memory complexity of all three algorithms is O(r2 ) where the
numerical constant of the r2 term is a function of k introduced in (9.1). This
relationship is exploited to determine the most efficient algorithm for a specific
circuit.
Note that formally the Lagrange multipliers λ are not required for the
solution of problem QP-3 since the objective of the procedures described here
is to determine a feasible clock schedule s. Since the existence and unique-
ness of a clock schedule ŝ satisfying problem QP-3 have been established in
Section 7.2.3, this clock schedule can be directly computed by evaluating the
rightmost expression in (7.55),
 
ŝ = − Bt M−1 B g + g. (9.2)

As an alternative, a sequential approach can be adopted such that the La-


grange multipliers λ̂ are computed first, followed by computing ŝ (consisting
of ŝb and ŝc ) using λ̂.
In the former case (a straightforward computation of ŝ), the complexity
of evaluating the expression described by (9.2) determines the complexity of
the overall solution. In the latter case (computing λ̂ first), both ŝb and ŝc can
be computed quickly since these computations involve only the addition and
subtraction operations (recall that all non-zero elements of the matrix C are
either 1 or −1). Therefore, in the case of computing λ̂ and ŝ in this order,
the complexity of the overall solution of problem QP-3 is dominated by the
computation of the Lagrange multipliers λ̂.
Three computational algorithms for solving problem QP-3 are described
in the following three sections. The first two algorithms—called LMCS-1 and

3
Memory transfers between main and secondary storage are, of course, always an
option. For the quickest execution, however, all data should reside in the main
storage.
9.1 Computational Analysis 169

LMCS-2, respectively—compute λ̂ and ŝ in this order according to the de-


pendence relationship ŝ = 12 Bt λ̂ described by (7.55). The third algorithm—
called CSD—computes the clock schedule ŝ directly as described by (9.2).
The algorithms LMCS-1 and LMCS-2 are described in Sections 9.1.1 and
9.1.2, respectively. Algorithm CSD is described in Section 9.1.3 and is shown
to be superior to both of the other algorithms. A comparative summary of
the results is offered in Section 9.1.4.

9.1.1 Algorithm LMCS-1

As mentioned previously, this algorithm for solving problem QP-3 consists


of eliminating λ̂ from Mλ = 2Bg [see (7.54)], then computing ŝ according
to (7.55). To determine the value of the Lagrange multipliers λ̂ corresponding
to the minimization of ε∗τ in problem QP-3, consider the linear system,

Mλ = (BBt )λ = (I + CCt )λ = 2Bg = 2(gc + Cgb ), (9.3)

which corresponds to the last row of (7.52) and (7.58), respectively. As men-
tioned previously in Section 7.2.3, the symmetric matrix M is always positive-
definite4 and nonsingular, thereby permitting exactly one solution λ̂ of the
linear system described by (9.3).
The system described by (9.3) is a large square linear system of the type
Ax = b, where b ∈ Rn is a column vector and the coefficient matrix
A ∈ Rn×n is dense. Typically, the most effective approach to computing the
solution x̂ ∈ Rn of such systems consists of performing a triangular decompo-
sition5 of the coefficient matrix A followed by the successive solution of two
relatively ‘easy’ to solve square linear systems of order n × n. The triangular
decomposition of A is of the form A = LU, where L and U are a lower tri-
angular and an upper triangular matrix, respectively [134, 135]. The solution
of Ax = LUx = b is obtained next by first computing the intermediate solu-
tion ŷ of the system Ly = b. Finally, x̂ is the solution of the system Ux = ŷ.
Because of the triangularity of the matrices L and U, the vectors ŷ and x̂ can
be computed with relatively little effort. The components of the intermediate
solution ŷ are obtained by solving the system Ly = b—referred to as for-
ward elimination [134, 135]—since the first equation of Ly = b involves only
y1 , the second only y1 and y2 , and so on. Similarly, the components of x̂ are
obtained from the system Ux = ŷ in the reverse order xn , xn−1 , . . . , x1 . The
process of solving Ux = ŷ for x̂ is also called back substitution [134, 135].
Furthermore, the symmetry and positive-definiteness of M can be ex-
ploited to obtain a special form of the LU triangular decomposition of M
such that the lower and upper triangular matrices in the decomposition are
4

The positive-definiteness of M follows from M = BBt where B = I C has


linearly independent rows. Therefore, the kernel of Bt is ker(Bt ) = {0} and the
value of the quadratic form xt Mx = xt BBt x is positive for any value of x = 0.
5
The non-singularity of A, L and U is assumed in this discussion.
170 9 Practical Considerations

the transpose of each other. This alternative decomposition is known as the


Cholesky decomposition of M and permits M to be uniquely represented [134]
as the product,
BBt = M = L1 Lt1 , (9.4)
where L1 is a lower triangular matrix. The Cholesky decomposition is compu-
tationally more efficient than a general LU decomposition in that the Cholesky
decomposition requires about half of the computation time of a general LU
decomposition. Finally, the Cholesky decomposition has useful properties re-
lated to issues of numerical stability and accuracy. (An in-depth treatment of
this subject can be found in [134, 135].)
As mentioned previously, the complexity of algorithm LMCS-1 is domi-
nated by the complexity of computing the Lagrange multipliers λ̂. This com-
putation of λ̂ consists of a total of
1
N1 (r, k) = (k − 1)3 r3 + (k − 1)2 r2 (9.5)
6
multiplications distributed among tasks as follows:
task ← number multiplications
a. computing the Cholesky decomposition L1 of ← 16 n3c = 16 (k − 1)3 r3
M
b. forward elimination of ξ from L1 ξ = 2Bg ← 12 n2c = 12 (k − 1)2 r2
c. back substitution of λ̂ from Lt1 λ = ξ ← 12 n2c = 12 (k − 1)2 r2 .
The maximum memory usage of the algorithm LMCS-1 is
1
M1 (r, k) = (k − 1)2 r2 (9.6)
2
floating point elements. This memory is used during different tasks in LMCS-1
as follows:
a. matrix M ← 12 (p − r)2 = 12 (k − 1)2 r2
b. Cholesky decomposition L1 of M ← L1 overwrites M as is computed.

9.1.2 Algorithm LMCS-2

The algorithm LMCS-2 described in this section is similar to algorithm


LMCS-1 described in Section 9.1.1 in that both algorithms follow the same
general course of computation. Specifically, algorithm LMCS-2 also first elimi-
nates λ̂ from Mλ = 2Bg [see (7.54)], and next computes ŝ according to (7.55).
To determine the value of the Lagrange multipliers λ̂, (7.54) is solved by find-
ing the matrix inverse M−1 and then multiplying the right-hand side (2Bg)
by M−1 :
λ̂ = M−1 (2Bg). (9.7)
9.1 Computational Analysis 171

Note that the matrix inverse M−1 = (I + CCt )−1 in (9.7) can be expressed
using the Sherman-Morrison-Woodburry formula [134],
 −1  −1 t −1
D + EFt = D−1 − D−1 E I + Ft D−1 E FD , (9.8)

where D ∈ Rn×n , E ∈ Rn×k , F ∈ Rn×k , and both D and (I + Ft D−1 E) are


nonsingular. When applied to the matrix M−1 = (I + CCt )−1 , the Sherman-
Morrison-Woodburry formula described by (9.8) yields6

D=I ⎪ M−1 = (I + CCt )−1
⎬  −1 t
E=F=C ⇒ = I − C I + Ct C C (9.9)
t


N =I+C C = I − CN−1 Ct .

Note that in (9.9), not only can the matrix inverse N−1 = (I + Ct C)−1
be computed more quickly than M−1 (the dimension of N is nb × nb vs.
nc × nc = (k − 1)r × (k − 1)r for M) but the computation of this inverse
N−1 matrix does not have to be explicitly performed in order to evaluate the
product CN−1 Ct in (9.9). Let the Cholesky decomposition of N = I + Ct C
be
N = L2 Lt2 (9.10)
and substitute (9.10) into the product C(I + Ct C)−1 Ct = CN−1 Ct in (9.9),
then
M−1 = I − CN−1 Ct
 −1 t
= I − C L2 Lt2 C
 t −1
  −1 t  (9.11)
= I − C(L2 ) L2 C
= I − Xt X,

where X is used to denote the product (L−1 t


2 C ). The matrix X can be com-
puted by forward elimination according to the matrix equation L2 X = Ct ,
while the product CN−1 Ct is equal to the product Xt X. Also, observe that
the matrix M−1 can be computed one row at a time, thereby drastically re-
ducing the storage requirements of the algorithm. The j-th row of M−1 is
computed and used to calculate the Lagrange multiplier λ̂j as the inner prod-
uct of this j-th row of M−1 and the vector 2Bg. The memory used to store
the elements of the j-th row of M−1 is then overwritten with the elements of
the (j + 1)-th row of M−1 and so on. The rows of the matrix M−1 can be
stored in disk in order to permit the rows to be retrieved for future execution.
Just as in algorithm LMCS-1, the complexity of algorithm LMCS-2 is
dominated by the complexity of computing the Lagrange multipliers λ̂. This
computation of λ̂ consists of a total of

6
Note that I + Ct C is positive-definite, thus nonsingular.
172 9 Practical Considerations
 
1 1 1
N2 (r, k) = + (k − 1) + (k − 1)2 r3 + (k − 1)r2 (9.12)
6 2 2

multiplications distributed among the following tasks:


task ← number multiplications
a. computing the Cholesky decomposition L2 ← 16 r3
of N
b. forward elimination of X from L2 X = Ct ← 12 r2 (p − r) = 12 (k − 1)r3
c. evaluate M−1 = I − Xt X ← 12 r(p − r)2 = 12 (k − 1)2 r3
d. evaluate λ̂ = M−1 (2Bg) ← (p − r)2 = (k − 1)2 r2 .
The maximum memory usage of algorithm LMCS-2 is
1
M2 (r, k) = (k − )r2 + (k − 1)r (9.13)
2
floating point elements. This memory usage is distributed among different
tasks in LMCS-2 as follows:
a. matrix N ← requires 12 r2 storage units
b. Cholesky decomposition L2 of ← L2 overwrites N as is computed
N
c. matrix X from L2 X = Ct ← requires r(p−r) = (k −1)r2 storage
units
d. matrix M−1 = I − Xt X ← requires (p − r) = (k − 1)r storage
units for one row of M only.

9.1.3 Algorithm CSD

Unlike algorithms LMCS-1 and LMCS-2, the clock schedule ŝ is computed


directly in algorithm CSD, i.e., without first computing the Lagrange mul-
tipliers λ̂. With this strategy, the clock schedule ŝ is determined according
to (9.2),
Z = Bt M−1 B ⇒ ŝ = −Zg + g = (−Z + I) g, (9.14)
where the matrix Z is introduced in (9.14) in order to simplify the notation. To
evaluate Z, the expression described by (9.9) is substituted for M−1 into (9.14)
and the product Z = Bt M−1 B is evaluated using the same technique as
in (9.10) and (9.11):
 
Z = Bt M−1 B = Bt I − CN−1 Ct B
= Bt B − Bt CN−1 Ct B
(9.15)
= Bt B − Bt C(Lt2 )−1 L−1 t
2 C B
= Bt B − Yt Y.
9.1 Computational Analysis 173

The notation
Y = L−1 t
2 C B (9.16)
is introduced in (9.15) for simplicity, where similarly to the previously de-
scribed algorithm LMCS-2, the matrix Y can be eliminated according to the
equation L2 Y = Ct B.
The clock schedule ŝ can be computed if the operations described by (9.14),
(9.15), and (9.16) are carried on literally. These expressions, however, can be
manipulated to significantly reduce both the run time and memory require-
ments for algorithm CSD. Initially, note that computing each clock skew si
requires evaluating the inner product of two dense p-element-long vectors—
the i-th row of the matrix (−Z+I) and g. The evaluation of this inner product
requires p multiplications, where p is the number of local data paths in the
circuit. Recall, however, that the values of the clock skews from the basis sb
provide sufficient information to reconstruct all clock skews s in a quick fash-
ion. Specifically, once the skews from the basis sb are known, the skews sc
in the chords of the circuit may be derived through the operation described
by (7.24),  

sc
IC = sc + Csb = 0 ⇒ sc = −Csb . (9.17)
sb
Since only the basis sb is evaluated, only the last nb rows of the matrix (−Z+I)
are computed, thereby yielding significant savings of computation time. (Note
that computing one row of Z requires the evaluation of p row elements, each
row requiring r multiplications in the product Yt Y.) These concepts are
illustrated graphically in Figure 9.1.

1 p

sc
p s = p (−Z + I) g =

last nb rows sb nb

1
b
Fig. 9.1. Computation of the clock schedule basis s by computing only the last nb
rows of the matrix −Z + I.

The complexity of the evaluation of (−Z + I) = (−Bt B + Yt Y + I)


can be reduced further by examining the computation of Y. Typically, the
174 9 Practical Considerations

direct evaluation of Y—by forward elimination from L2 Y = Ct B—requires


1 2 1 3
2 pr = 2 kr multiplications. This number can be reduced by noting that

Ct B = Ct I C = Ct CCt

= Ct N − I = Ct L2 Lt2 − I (9.18)

and Y = Y1 Y2 − Y3 , (9.19)

where the matrices Y1 , Y2 , and Y3 can be eliminated from the following


dependencies, respectively:
1
L2 Y1 = Ct ← compute Y1 → requires (k − 1)r3
2
multiplications (9.20)
L2 Y2 = N ← Y2 = Lt2 → already computed (9.21)
1 1
L2 Y3 = I ← compute Y3 → requires r3 + (3r2 + 2r)
6 6
multiplications. (9.22)

Finally, the following transformations (9.23) through (9.25) are used to


evaluate the matrix (−Z + I):
     
t I
I C I C
BB= IC = =
Ct Ct Ct C Ct N − I
  (9.23)
O C
=I+ ,
Ct N − 2I
 
Y1t

V=Y Y= t
Y1 Y 2 − Y 3
Y2t − Y3t
 
V11 V12
=
(L2 L−1 −t −1 t −t −1 −1
2 C − L2 L2 C ) (L2 L2 + L2 L2 − L2 L2 − L2 L2 )
t t −t t
 
V11 V12
= −1 ,
Ct − (L−t
2 L−1
2 )C t
N − 2I + L−t2 L2
(9.24)
and
−Z + I = −Bt B + Yt Y + I
 
O C
= −I −
Ct N − 2I
 
V11 V12 (9.25)
+ −t −1 + I
Ct − (L−t −1
2 L2 )C N − 2I + L2 L2
t
 
... ...
= t −t −1 .
−(L−t −1
2 L2 )C L2 L2
9.1 Computational Analysis 175

Note that only the last r rows of (−Z + I) are shown in (9.25) since
only these r rows are required to compute sb . Also, note that the matrix
Y1 = L−1 t −1
2 C does not require evaluation. Only Y3 = L2 must be determined
−t −1 t
(from L2 Y3 = I) since L2 = (L2 ) .
The computation of the clock schedule ŝ in algorithm CSD consists of a
total of
1 1 1 1
N3 (r, k) = r3 + (3k + 4)r2 + r − (9.26)
2 3 2 6
multiplications distributed among the following tasks:
task ← number multiplications
a. computing the Cholesky decomposition L2 of ← 16 r3
N
b. forward elimination of Y3 = L−1 2 from ← 16 r3 + 12 r2 + 13 r
L2 Y3 = I
c. evaluate the product L−t −1
2 L2 ← 61 r3 + 16 (5r2 + r − 1)
d. evaluate sb ← rp = kr2 .
The maximum memory usage of algorithm CSD is

M3 (r, k) = r2 (9.27)

floating point elements. This memory usage is distributed among different


tasks in CSD as follows:

a. matrix N ← requires 12 r2 storage units


b. Cholesky decomposition L2 of N ← L2 overwrites N as is computed
c. matrix L−1
2 = Y3 ← L−1
2 overwrites L2 as is computed
−t −1
d. product L2 L2 ← requires 12 r2 storage units

9.1.4 Summary of the Proposed Algorithms

This section concludes with a brief synopsis of the run time and memory
requirements of the three algorithms for solving problem QP-3 described in
Sections 9.1.1, 9.1.2, and 9.1.3, respectively. To summarize the results, each
of the three algorithms, LMCS-1, LMCS-2, and CSD, requires O(r3 ) floating
point multiplicative operations and O(r2 ) floating-point storage units. The
numerical constant of the leading terms in the polynomial expressions for
both the run time and memory complexity is a function of the ratio k = p/r
which is the ratio of the number of local data paths to the number of registers
in a circuit.
To gain further insight into the proposed algorithms, the numerical con-
stants of the leading terms in the polynomial runtime complexity expressions
are plotted versus k in Figure 9.2. Similarly, the numerical constants of the
leading terms in the polynomial memory complexity expressions are plotted
176 9 Practical Considerations

40

LMCS-1
Runtime Complexity

30
LMCS-2

20 CSD

10

2 4 6 8 10 k
Fig. 9.2. The numerical constants (as functions of k = p/r) of the term r3 in
the runtime complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD,
respectively.

40

LMCS-1
Memory Complexity

30
LMCS-2

20 CSD

10

2 4 6 8 10 k
Fig. 9.3. The numerical constants (as functions of k = p/r) of the term r2 in
the memory complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD,
respectively.

versus k in Figure 9.3. Note that algorithm CSD outperforms both of the other
two LMCS algorithms where the superiority of algorithm CSD is particularly
evident with respect to the speed of execution. Thus, algorithm CSD is the
algorithm of choice for solving problem QP-3 as introduced in Section 7.2.3.

9.2 Unconstrained Basis Skews

Consider again the example circuit C1 introduced in 7.1.1 (the graph of C1 is


shown in Figure 7.1). A modified version of C1 with one additional edge—the
edge e6 —is shown in Figure 9.4. Also shown with thicker edges in Figure 9.4
9.2 Unconstrained Basis Skews 177

is a spanning tree for the modified circuit C1 . Note that the basis edge e6
does not belong to any of the fundamental cycles of the circuit depicted in
Figure 9.4. In fact, the edge e6 does not belong to any cycle of the circuit
in Figure 9.4 at all. Such basis edges which do not belong to any cycles are
called isolated, while the rest of the basis edges are called main. Note that
any isolated edge must necessarily by definition be a basis edge.7

[l1 , u1 ]
e1 →

[l3 , u3 ] [l4 , u4 ] [l6 , u6 ]


v1 v2 v3 v5
e3 → [l5 e4 ← e6 →
, ]
u5 u2
e5
← ] l[ 2, ←
e2
v4

Fig. 9.4. Modified example circuit C1 to include an additional edge e6 . C1 is origi-


nally introduced in Section 7.1.1 and illustrated in Figure 7.1.

Theoretically, a circuit with r registers (the vertices in the circuit graph)


may have any number ni of isolated basis edges where ni ranges from zero to
r − 1 = nb . A circuit with ni = nb = r − 1 isolated basis edges does not have
any cycles whatsoever—all edges of such circuits are basis edges and there are
no chord edges to complete a cycle. A simple example of such a circuit is a
shift register.
Note that since isolated edges do not belong to a cycle, the clock skews on
these edges are linearly independent of any other clock skews in the circuit.
Intuitively, the clock skew of an isolated edge can be assigned to be any value
without contradicting the linear dependencies among the skews in a circuit.
Observe, for example, (7.22) written for the modified circuit C1 shown in
Figure 9.4: ⎡ ⎤
s1
⎢s2 ⎥
 ⎢ ⎥
1 0 −1 1 0 0 ⎢ ⎥
⎢s3 ⎥ = 0.
Bs = (9.28)

0 1 0 −1 1 0 ⎢s4 ⎥ ⎥
⎣s5 ⎦
s6
All of the elements in the sixth column of B are zeroes. Therefore, if s1 through
s5 are such that (9.28) is satisfied, the choice of s6 does not invalidate (9.28).
This fact can be exploited in the mathematical solution of problem QP-1 to
decrease the number of variables, thereby decreasing the runtime and memory
7
A chord edge is already a part of a cycle and cannot be isolated.
178 9 Practical Considerations

requirements. The only requirement is that the basis skews (the edges) must
be enumerated such that the isolated skews are last. In other words, the clock
skew vector (7.19) becomes
Basis with nb elements
⎡ c⎤ " #$ %
s nc nb −ni ni
" #$ % " #$ % " #$ %
s = ⎣sb ⎦ = [ s1 . . . snc snc +1 . . . sp−ni sp−ni +1 . . . sp ]t , (9.29)
$ %" #$ %" #$ %" #
si Chords Main Basis Isolated Basis

where sb stands for the main basis and the isolated basis is denoted by si .
With this specific choice of clock skew enumeration, the B matrix in (7.22)
becomes

B = B1 0 , (9.30)
where 0 in (9.30) is a zero matrix of dimension nc × ni .
With this notation, it is straightforward to show that the matrix M
in (7.53) becomes
M = BBt = B1 Bt1 (9.31)
and the solution to problem QP-1 (7.54) and (7.55) is
⎡ c⎤
g  c

g
λ̂ = 2M−1 Bg = 2M−1 B1 0 ⎣gb ⎦ = 2M−1 B1 b , (9.32)
g
gi
⎡  c ⎤
  g
(I − B1 M−1 Bt1 ) b ⎦
ŝ = g − Bt M−1 B g = ⎣ g . (9.33)
gi

As can be observed in (9.32) and (9.33),


1. the choice of the objective isolated basis skews gi has no effect on either
the Lagrange multipliers (9.32) or the chords and main skew basis (9.33)
solution, and,
2. the final solution for the clock skews si in the isolated basis edges corre-
sponds precisely to the objective skew values gi for these edges.
Therefore, the isolated basis edges can be completely excluded from consider-
ation when solving problem QP-1. Equations (9.32) and (9.33) demonstrate
that the final clock skew values of these edges can be chosen arbitrarily pro-
vided these values satisfy the permissible range requirements.

9.3 I/O Registers and Target Delays


The clock skew scheduling methodology discussed in Chapter 7 is based on
the assumption that complete connectivity and timing information is available
for all local data paths within a circuit. This condition may, however, not be
9.3 I/O Registers and Target Delays 179

realistic. Consider, for example, the input and output registers (also called
the I/O registers) in a VLSI system. Some I/O registers are illustrated in
Figure 9.5 where the registers R1 and R5 are an input and an output register,
respectively, of the circuit C. The register R3 shown in Figure 9.5 is an internal
register since all of the other registers to which R3 is connected (via local data
paths) are inside the circuit C.
The timing of the I/O registers is less flexible than the timing of the
internal registers. Consider, for example, the local data path R6 ;R1 shown
in Figure 9.5. The register R6 is outside the circuit C which contains the
registers R1 through R5 . It is possible to apply a clock schedule to S that
specifies a clock delay t1cd to the register R1 . However, the timing information
for the local data path R6 ;R1 is not considered when scheduling the clock
signal delays to the registers within C (including t1cd ). Therefore, a timing
violation may occur on the local data path R6 ;R1 illustrated in Figure 9.5.
One strategy to overcome this difficulty is to include in the clock schedul-
ing process the timing information of those local data paths which cross the
boundaries of the circuit C. This approach does not change the nature of the
clock scheduling algorithm but rather only the number of timing constraints.
However, such an optimization scenario is difficult to conceive due to the many
instances where C may be used. Therefore, a preferable approach is to set the
clock signal delay to the I/O registers (such as t1cd to R1 ) to a specific value
with respect to the clock source (shown as the clock pin in Figure 9.5). If
this value is specified, all of the necessary timing information is available to

System (Board) Level Clock Source

Clock Signal Clock Pin

Circuit C
R2

R6 R1
R3

R4
R5

Fig. 9.5. I/O registers in a VLSI integrated circuit. Note that the I/O registers
form part of the local data paths between the inside of the circuit and the outside
of the circuit.
180 9 Practical Considerations

avoid any timing violations of the local data paths such as the path R6 ;R1
shown in Figure 9.5. Equivalently, a group of registers (the I/O registers, for
example) may be defined which require that the clock signal be delivered to all
of the registers within such a group with the same delay. Application-specific
integrated circuits (ASICs) and Intellectual Property (IP) blocks are good
examples of circuits where the aforementioned strategy may be useful.
Given the difficulty in knowing a priori all timing contexts of an integrated
circuit, a preferred solution may be to require that all I/O registers are clocked
at the same time (zero skew). More specifically, all possible explicit clock
delay requirements for registers within the circuit fall into one of the following
categories:
1. zero skew island, that is, a group of registers with equal delay,
2. target delays, that is, tkcd1 = δk1 , . . . , tkcdα = δkα , where kα ≤ r and δk1 . . . δkα
are explicitly specified clock signal delay constants,
3. target skews, that is, sj1 = σj1 , . . . , sjβ = σjβ , where jβ < nb and
σj1 . . . σjβ are explicitly specified clock skew constants.
Zero skew islands can be satisfied by collapsing the corresponding graph ver-
tices into a single vertex while eliminating all edges among vertices within the
island. Note that in this case, it must be verified that zero skew is within the
permissible range of each in-island path.8 Alternatively, the target delays are
converted to target skews (category 3 above) for sequentially-adjacent pairs
or by adding a ‘fake’ edge. Thus, an algorithm to handle only target skews is
necessary.
Note first that target values for only nf ≤ nb skews can be independently
specified. As nf approaches nb , the freedom to vary all skews decreases and
it may become impossible to determine any feasible s. Given nf ≤ nb , (a) the
basis can always be chosen to contain all target skews by using a spanning
tree algorithm with edge swapping, and (b) the edge enumeration can be
accomplished such that the target skews appear last in the basis. The problem
is now similar to (7.42) except for the change of the circuit kernel equation,
⎡ c⎤

C = [C1 C2 ] ⇒ Bs = [I C1 C2 ] ⎣ŝb ⎦ ⇒ B̂ŝ + C2 σ = 0, (9.34)
σ
c

where B̂ = [I C1 ], ŝ = ŝb , ŝc = sc , and ŝb is sc with the last nf elements
removed. The matrix C2 in (9.34) consists of the last nf columns of C, while
the target skew vector σ is an nf -element vector of target skews whose ele-
ments are ordered in the order of the target edges. The linear system (7.51)
8
Normally, this would be the case. However, [recall (4.8), (4.13), (4.23), (4.24),
and (4.29)], in an aggressive circuit design with a short clock period it may so
happen that zero skew is designed to be out of the permissible range, most likely
creating a setup time violation. In these circuits, negative skew is used to increase
the overall system-wide clock frequency, thereby removing the setup violation.
9.4 Summary 181

becomes
     
 2ŝ + B̂t m̂ = 2ĝ
 2I B̂t ŝ 2ĝ
 ⇒ = , (9.35)
 B̂ŝ + C2 σ = 0 B̂ 0 m̂ −C2 σ

with solution

m̂∗ = 2M̂−1 (B̂ĝ + C2 σ)


(9.36)
ŝ∗ = (I − B̂t M̂−1 B̂)ĝ − B̂t M̂−1 C2 σ.

9.4 Summary
This chapter describes the practical implementation concerns in the imple-
mentation of QP-based formulation from Chapter 7 and a general clock skew
scheduling implementations on a system-on-chip (SoC). First, the details of
a computer implementation for the QP-based clock skew scheduling program
described in Chapter 7 are presented. Three alternative implementations are
discussed, analyzing the memory requirements and computational complex-
ities based on the number variables and operations in each implementation.
Mathematical discussions are presented to the efficacy and accuracy of each
algorithm through theoretical discussions. A comparative analysis is also pre-
sented, which demonstrates the superiority of the CSD algorithm (out of the
three proposed computer implementation alternatives) over the LMCS-1 and
LMCS-2 algorithms. Later in the chapter, the timing isolation of the intellec-
tual property blocks in a system-on-chip implementation is presented, which
enables the implementation of clock skew scheduling on individual IP blocks.
10
Clock Skew Scheduling in Rotary Clocking
Technology

Development of a low-jitter, low-skew (or controllable skew for clock skew


scheduling) clocking technology that has low power dissipation is one of
the major research topics in the development of next-generation synchro-
nous integrated systems. Among the proposed clocking technologies are wire-
less [136, 137, 138] and transmission line-based [116, 139, 140, 141, 142, 143,
144, 145, 146, 147, 148] approaches. These technologies must be supported by
specific design flows and CAD suites in order to be viable in semiconductor
implementation. In this chapter, the adaptability of the majority of the non-
zero clock circuit skew design and analysis methods presented earlier in this
monograph to the physical design flow of circuits synchronized by a transmis-
sion line-based clocking technology is described. The particular transmission
line-based technology of interest, the rotary clocking technology, is described
in detail.
Three main types of resonant clocking technologies are described in Sec-
tion 10.1. In particular, the operation of resonant rotary clocking technology
is summarized in Section 10.1.1 and the timing of circuits synchronized with
the resonant rotary clocking technology is discussed in Section 10.1.2. The
physical design flow proposed for integrated circuits synchronized with the
rotary clocking technology, that does require non-zero clock skew operation
and scheduling, is presented in Section 10.2. A heuristic methodology for the
parallelization of clock skew scheduling is presented in Section 10.3. The chap-
ter is summarized in Section 10.4.

10.1 Resonant Clocking


In the last decade, clock frequencies of digital integrated circuits have sur-
passed the GHz milestone [149]. Historically, systems that operate at clock
frequencies in low MHz ranges have utilized off-chip quartz crystal oscilla-
tors [150, 151]. The oscillatory signal generated off-chip with the quartz crys-
tal is input to the on-chip PLL, where it is multiplied to the desired frequency
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 183
DOI: 10.1007/978-0-387-71056-3 10,
c Springer Science+Business Media LLC 2009
184 10 Clock Skew Scheduling in Rotary Clocking Technology

on chip. The generated signal is distributed to the synchronous components


throughout the chip, typically using a tree topology, called a clock tree net-
work [152]. Especially in nano-scale CMOS, where signal integrity has become
a dominating problem, the distribution of the clock signal from a single clock
source over a clock tree network has become quite error-prone. The discrepan-
cies in the arrival time of the clock signal at the destination registers increase
with scaling technology. The prevailing methodology to generate such high-
frequency clock signals is to use on-chip frequency multiplication by using
phase-locked-loop (PLL) components [153]. The on-chip PLL components oc-
cupy chip area and lead to problems with signal reflections, capacitive loading
and power dissipation that effectively limit the maximum operating frequency.
The resonant clocking technologies [116, 139, 140, 141, 142, 143, 144, 145,
146, 147, 148, 154, 155, 156] present an alternative to generating the syn-
chronizing clock signal. Based on energy recovering adiabatic switching prin-
ciples [142], resonant clocking technologies permit significant power savings.
As such, the resonant clocking technologies eliminate the necessity to use a
complicated on-chip PLL component. Currently, there are three major types
of resonant clocking technologies. These resonant clocking technologies are
categorized with respect to their oscillator types and the phase and voltage
characteristics of the generated clock signals:
1. Coupled LC oscillator based resonant clocking technology [140, 141, 147],
2. Standing wave oscillator based resonant clocking technology [139, 142,
145, 146],
3. Traveling wave oscillator based resonant clocking technology [116, 148,
157, 158].
Coupled LC oscillator based resonant clocking technology provides a con-
stant magnitude clock signal with constant phase. A clock signal with con-
stant magnitude and constant phase is similar to the conventional clock signals
that are delivered using conventional clock tree networks. The main advan-
tage of coupled LC oscillator based resonant clocking technology over other
resonant clocking technologies is that coupled LC oscillator based clocking
provides the desired clock signal without any change to the conventional
design flows. Higher circuit performances are achievable solely by replacing
the clock distribution network with the coupled LC oscillator based resonant
clocking technology distribution network. H-tree network based implemen-
tations are introduced in [141] and extensively analyzed, including tests on
silicon [147, 154, 155, 156, 159, 160, 161, 162, 163].
Standing wave oscillator based resonant clocking technology provides a
varying amplitude clock signal with a constant phase [146, 164]. Similar to
coupled LC oscillator based resonant clocking technology, clock phase is con-
stant. Thus, this technology does not require drastic modifications to the con-
ventional design flows. The varying clock signal magnitude, however, makes
standing wave oscillators unattractive in integrated circuit design.
10.1 Resonant Clocking 185

Table 10.1. Categorization of the resonant clocking technologies.


Oscillator Type Phase Voltage
Coupled LC Constant Constant
Standing Wave Constant Variable
Traveling Wave Variable Constant

Traveling wave oscillator based resonant clocking technology is the reso-


nant clocking technology of interest in this discussion. Traveling wave oscilla-
tor based resonant clocking technology, also called rotary clocking technology,
provides a clock signal which has a constant magnitude and varying phase.
Varying phase (delay) of clock signal provides permits easy implementation of
non-zero clock skew systems. The design and analysis methods proposed for
non-zero clock skew systems in earlier chapters can be used to design circuits
synchronized with the rotary clocking technology.
Table 10.1 [165] summarizes the categorization of the presented resonant
clocking technologies, based on the magnitude and phase properties of the
generated clock signals.

10.1.1 Rotary Traveling Wave Oscillators

Rotary traveling-wave oscillators (RTWO’s) comprise a novel clock network


implementation technology providing controllable-skew, low-jitter, gigahertz
range clocking with fast transition times and low power consumption [116].
RTWO’s are generated on cross-connected transmission lines, constructing
a differential LC transmission line oscillator. These oscillators generate multi
phase (360 degrees) square waves with low jitter, which switch adiabatically to
limit power dissipation. Multiple RTWO’s can be connected together forming
the rotary oscillator arrays (ROA) which distribute the synchronized square
wave over the whole chip. The basic ROA structure [116] is shown in Fig-
ure 10.1. This 7x7 ROA grid topology yields 25 interconnected RTWO rings.
Around each RTWO ring, a clock signal is produced that travels around the
ring in a frequency dependent on the physical parameters of the ring. Pulses
on each ring are phase-locked via the shared transmission line wires between
the rings.
When the transmission line is excited from one or more points, the travel-
ing wave is established on the cross-connected line. Figure 10.2(a) shows the
open loop that conceptually occurs when the circuit is being excited for the
first time. Figure 10.2(b) shows the closed loop in steady state of operation
where overlap of the traveling waves causes signal negation. The traveling
wave is inverted on the crossover points, generating different phases of the
square wave. Any number of crossovers are allowed on the transmission line.
The relative phase and skew of any point on the ring is well known due to the
homogeneity of the traveling pattern around the ring. Note that anti-parallel
186 10 Clock Skew Scheduling in Rotary Clocking Technology

= = = =

= = =

= = = =

(a) = = =

= = = =

= = =

= = = =

45o

225o
(c)
0o 180o 270o 90o
135o

315o
(b)
Fig. 10.1. Basic rotary clock architecture.

(shunt connected) inverter pairs are used between the cross-connected lines
to save power, initiate and maintain the traveling wave. After excitation, the
anti-parallel inverters feed the traveling wave in the stronger direction, up to
a stable oscillation frequency. The transmission line with anti-parallel con-
nected inverters is shown in Figure 10.3 [116]. In Figure 10.3, the traveling
wave is traveling from left to right.
Each pair of anti-parallel inverters on the path of the traveling signal
turns on after some time, stimulating the same process at the neighboring
pair of anti-parallel inverters in the direction of the wave. The transmission
line impedance is on the order of 10Ω and the differential on-resistance of
the anti-parallel connected inverters are in the 100Ω-1kΩ range for a 0.25μm
technology [116].
Once a wave is established, it takes little power to sustain it. The dissi-
pated power on the ring is given by the I 2 R dissipation instead of the con-
ventional CV 2 f expression. Such consideration of power is possible because
10.1 Resonant Clocking 187

WAVE
_
+ +
0
+
_ _ _
+
_ 0
0
0 +
0 _ +
0

+
0 0 _
0 _
0 0 _
0 +
0
0
0 +
0
(a) (b)
Fig. 10.2. The RTWO theory.

t0 t1 t2

+2.5V .....0V

Already Yet to
latched switch

0V .....+2.5V
(reinforces latch)

Fig. 10.3. The cross-section of the transmission line with shunt connected inverters.

the energy that goes into charging and discharging MOS gate capacitance (of
the inverters) becomes transmission line energy, which in turn is circulated
in the closed electromagnetic path. Such conservation of energy is enabled by
adiabatic switching [166, 167], in terminating the current path to the trans-
mission line, instead of ground. The coherent switching occurs only in the
direction of the traveling path. An equal amount of energy is launched in the
reverse direction, however the latches in this direction are already switched,
thus this energy simply serves to reinforce the previous switching events on
these registers.
The frequency of the clock signal generated by the rotary clocking tech-
nology depends on total capacitance and inductance in the system, which
are defined by the physical implementation of the rotary wires and the
188 10 Clock Skew Scheduling in Rotary Clocking Technology

anti-parallel inverters [116]. On a typical RTWO loop, the oscillation fre-


quency of the signal is given by the equation:
1
fosc ≈ √ (10.1)
2 Ltotal Ctotal
1
≈ &    . (10.2)
P μ0 πs
 
2 π log w+t + 1 ( Cinv + Creg + Cwire )

Ltotal is the total loop inductance and depends on the ring perimeter P , inter-
connect separation s, wire width w, thickness of the strip t and permeability
in vacuum μ0 . Ctotal is the total capacitance that is driven by the RTWO
ring. The total capacitance is defined by gate-oxide capacitances of inverter
pairs Cinv and registers Creg , and the tapping wire (from register to ring) ca-
pacitance Cwire . These introduced factors affecting the total inductance Ltotal
and the total capacitance Ctotal are the design parameters for an RTWO ring
that provide a design flexibility to generate the desired frequency. Inductance
variation on a typical silicon implementation is expected to be small because
of the high quality of lithographic reproduction. Overall, the projected post-
production variation in the targeted operating frequency is 5%, accounting
for
√ the sources
√ of variation and the dependence of the operating frequency on
C and L [116].
The operation of the ROA structure in providing a gigahertz frequency,
low jitter, low power clock signal with fast transition times is confirmed by
simulating the ring shown in Figure 10.1 at 965MHz and 3.4GHz. The ring
designed at 2.5V 0.25 μm CMOS technology has 25 interconnected RTWO
rings on a 7x7 array grid. The simulation result presented in [116] for the
3.4GHz ring is shown in Figure 10.4. Promising results of a clock jitter of 5.5ps
and 34-dB power supply rejection ratio (PSRR) are measured for 3.4GHz [116]
and with 117-dB noise on a 18 GHz implementation [168].
Two other important metrics for an oscillator are the sensitivity to changes
in temperature and supply voltage. It has been shown that the frequency
deviation with temperature change between −50o C and 150o C is only 1%
while the change with VDD deviation between 1.5 and 3.5V is around 2% [116].
The immunity of the RTWO signals to process variations while allowing full
skew control over 360 degree phases on the ring proves very valuable for deep
sub micrometer applications.
A detailed analysis of the rotary clocking technology and the RTWO loops
can be found in [116, 148, 157, 158]. Research on rotary clocking can be
categorized into characterization and physical design. In characterization re-
search presented in [168, 169, 170, 171], test chips and spice models are used
to analyze the power and frequency characteristics of homogeneous rotary
rings. Power savings around 60-80% are reported for a single, square rotary
ring [169, 170, 171]. In physical design research in [172, 173] and recently
in [174], skew computation and logic placement for a given rotary clock ring are
10.1 Resonant Clocking 189

Fig. 10.4. Line voltage and line current for the 3.4GHz clock example.

discussed. Both methods adopt iterative principles for integrated skew compu-
tation and logic placement. The point of interest for the clock skew scheduling
discussion presented in this monograph is primarily the timing requirements
of the rotary clocking technology, which are presented in Section 10.1.2.

10.1.2 Timing Requirements of Rotary Circuits

As described in Section 10.1.1, rotary clocking technology provides a constant


magnitude clock signal with varying phase (clock skew and clock phase). The
constant magnitude of the clock signal is similar to the customary, however,
varying clock phase is not common in mainstream circuit design flow. It is
more common to use a zero clock skew, single-phase clock signal in synchro-
nous circuit design due to its simplicity in design and analysis. The majority of
the design automation tools for clock tree synthesis produce better results in
generating a clock distribution network that provides a zero clock skew, single
phase clock distribution as opposed to a non-zero clock skew tree. Non-zero
clock skew and multi-phase synchronization, although shown in Section 6.6 to
be consistently superior over traditional zero clock skew, single-phase design,
is not very popular due to the lack of automation.
For traditional PLL-based clock sources and clock tree networks, excessive
amount of buffering can be necessary in order to deliver the clock signals to the
synchronous component with the desired delays. Remember from Section 8.4
that buffer elements are available only in discrete values. In traditional clock
190 10 Clock Skew Scheduling in Rotary Clocking Technology

C.W. Null: 0/360o C.W.W. Null: 0/360o

135o 225o

o 180o 180o
315 45o

270o 225o 45o 90o 135o 315o

90o 270o
=V
=I

Fig. 10.5. The clock phase relationships on an ROA ring.

tree networks, clock delays are generated with buffering, thus, clock delays
are available only in discrete values for such systems. For rotary clocking
technology, however, buffer elements are not necessary, as clock delays are
provided with the propagation of the clock signal on RTWO rings. The clock
phase driving a synchronous component is determined by the location of the
connection point of the clock signal wire on the RTWO ring as shown in
Figure 10.1(b) (page 186). Figure 10.5 also presents the different phases of the
clock signal available for a sample rotary implementation with one crossover
point. Note that with this implementation, two corresponding points on the
differential line provides clock signals with are shifted by 180 degrees.
Unlike traditional PLL-based clock sources, the generation of a multi-phase
clock signal is highly practical with rotary clocking. The number of phases in
the clock signal generated by the rotary clocking technology is determined
by the number and placement of crossovers onto the RTWO rings. Common
multi-phase synchronization scheme of two-phases, as well as any other arbi-
trary number of phases, can be implemented with rotary clocking technology,
without loss of quality. Two (or more) crossovers can be used to generate any
desired number of overlapping or non-overlapping clock phases for multi-phase
synchronization. The length and respective placement of the duty cycles of
the multiple clock phases are determined by the location of the crossovers on
the ring.
Rotary clocking technology readily supplies a fine grain of clock delays,
and potentially, phases. From a CAD perspective, continuous delay models
can be used to model clock delays available in the network. From a circuit
design perspective, the assignment of different clock delays to the synchro-
nous components of a rotary-clock synchronized circuit are essential for the
proper operation of the circuit. Towards this end, the most common prob-
lem is the unbalanced capacitive loading of the rotary network. The lack of
a relatively uniform load distribution (within one ring or between multiple
10.2 Physical Design Flow 191

rings) may affect the rotation of the oscillatory signal on the ring(s), thereby
causing degradations in the quality of synchronization. In the optimal schedul-
ing scenario, the clock delays at the synchronous components are distributed
relatively evenly in time, leading to a relatively balanced distribution of the
latching points on the rotary ring. The required balanced loading of the ROA
rings can be provided by clock skew scheduling (see the distribution of clock
delays for a sample circuit in Figure 11.5 on page 213).
The advanced timing methodology of using non-zero clock skew circuits
with multi-phase synchronization can easily be realized in circuits synchro-
nized with rotary clocking technology. Advantageously, implementation of
circuits synchronized with the rotary clocking technology mutually requires
the automated design and analysis methodologies for multi-phase, non-zero
clock skew synchronization schemes. Such integration of the design and analy-
sis methodologies into the physical design flow leads to circuits which benefit
both from the presented advanced timing methodologies and the rotary clock-
ing technology.

10.2 Physical Design Flow


The physical design flow for integrated circuits synchronized with the rotary
clocking technology necessitates a clock network stage for the implementation
of the ROA grid topology and multiple-phase, non-zero-clock-skew clocks. In
this chapter, the physical design flow is examined from the perspective of the
requirements for a non-zero clock skew circuit implementation.
The design flow is illustrated with the flow chart shown in Figure 10.6.
The flow includes processing the design entry to investigate the complexity
and requirements of the circuit, partitioning the netlist, performing clock skew
scheduling and performing register and logic placement. The three major steps
of the presented physical design methodology are the partitioning, clock skew
scheduling and placement steps. The partitioning step is proposed in order
to generate logic partitions that are implementable within the ROA ring re-
gions of a rotary clocking network. The clock skew scheduling step is required
to improve the scalability of conventional clock skew scheduling techniques
through partitioning. The placement step is proposed in order to provide a
practical implementation alternative for the mapping of the circuit logic and
registers to the ROA rings. These steps are required to increase the feasibil-
ity of the rotary clocking technology as the infrastructure of choice for the
advanced timing methodologies discussed in previous chapters (such as clock
skew scheduling and multi-phase synchronization).
The design entry is provided in industry standard file formats, such as
DEF, LEF and SDF file formats. An initial timing information of the circuit is
necessary for the application of clock skew scheduling. This information must
be obtained prior to tape-out, preferably from a preliminary placement and
routing of the circuit.
192 10 Clock Skew Scheduling in Rotary Clocking Technology

DESIGN ENTRY

PARTITIONING

ROA SIZE

PARTITIONING

REGISTER INSERTION

NO YES
ROA FEASIBLE?

CLOCK SKEW SCHEDULING

CSS on PARTITION I CSS on PARTITION N

CSS on TOP BLOCK

NO YES
CSS FEASIBLE?
PLACEMENT

REGISTER MAPPING

LOGIC PLACEMENT

Fig. 10.6. The physical design flow of VLSI circuits with RTWO clock synchro-
nization.

The implementation of the ROA rings and netlist partitioning are depen-
dent on each other as illustrated in the Partitioning step in the flow chart.
The size and number of rings in the ROA structures depend on several fac-
tors such as the complexity of the design, the availability of clock network
design resources, the computational resources for timing analysis and the sil-
icon area. Despite these dependencies, the number and physical dimensions
of ROA rings in a circuit are quite flexible. The number of ROA rings is usu-
ally held sufficiently high in order to limit the total wirelength. The shapes
of ROA rings are not necessarily regular (e.g., rectangles) as implied by the
mesh structure presented in Section 10.1.1. Such flexibility in the physical
10.2 Physical Design Flow 193

implementation of the ROA rings enables reconciliation of the non-routable


blocks of the chip area.
Partitioning is performed on a gate-level or a register-transfer level netlist.
For the former case, it is often necessary to insert extra registers in the logic
network as part of the timing-driven partitioning process. This process is rep-
resented by the “Register Insertion” block in the flow chart. These inserted
registers are level-sensitive latches operating in the transparent phases of oper-
ation (Section 4.2) in order to preserve the functionality of the original circuit.
The feasibility of the partitioning result is checked at the next validation step.
If the current result is not feasible, the partitioning step of the design flow is
repeated until feasibility is satisfied.
In the clock skew scheduling step, the rotary clock network is constructed.
Data paths that are local to each partition are identified and the corresponding
timing constraints are included in the clock skew scheduling problem for that
partition. Similarly, the timing constraints of local data paths which span
different partitions are included in the clock skew scheduling problem of the
top block. A heuristic method is proposed to solve the partition and top
block LP problems. The clock skew scheduling problems of each partition are
independent of each other, so these analyses can be parallelized.
After the clock skew scheduling block is completed, the optimal clock signal
delays required at each synchronous component are known. Depending on the
number of clock phases and the number of registers for a given clock phase, the
mapping of synchronous components to the registers within an ROA ring is
performed. This is an automated design step called “Register Mapping” in the
flow chart. Following register mapping, the rest of the logic within a partition
is placed in the area available within the ROA rings for this partition. The
placement is performed using conventional logic placement techniques.
The partitioning and clock skew scheduling steps of the physical design
flow are presented in detail in Sections 10.2.1 through 10.2.4. The placement
step is discussed in Section 10.2.5.

10.2.1 Timing-Driven Partitioning

The objective of the conventional timing-driven partitioning process is to gen-


erate circuit placements that are more likely to meet a particular timing bud-
get. Path-based and net-based partitioners [175] are the two most widely used
kinds of partitioners in current state-of-the-art physical design. Both path-
based and net-based partitioners are used to limit the lengths of selected
critical paths in a circuit. Such limitation in the number of analyzed paths
significantly reduces the processing time for partitioning (and static timing
analysis) while generally preserving the accuracy of the analysis.
In clock skew scheduling, the local data paths in an entire circuit (or cir-
cuit partition) are equally important and analyzed together. Thus, traditional
path-based and net-based timing-driven partitioning methods do not provide
ideal cuts for the application of clock skew scheduling. Thus, an alternative
194 10 Clock Skew Scheduling in Rotary Clocking Technology

partitioning approach is proposed in this work using selection criteria that lead
to partitions which are amenable to clock skew scheduling. Towards this end,
a hypergraph partitioning tool is used with fine-tuned partitioning criteria
to generate partitions that are easily implementable with the rotary clocking
technology. Principally, timing-driven partitioning is performed within the
proposed design methodology subject to the following considerations:
1. To construct the logic network partitions that will be synchronized by
individual ROA rings of the rotary clocking technology,
2. To enable the completion of path enumeration on large scale circuits,
3. To enable the completion of clock skew scheduling algorithms on large
scale circuits.
The first of the three factors listed above is directly related to the im-
plementation of the rotary clocking technology. If clock tree synthesis is per-
formed completely independent from logic synthesis, the assignment of syn-
chronous components to individual ROA rings can be inefficient for physical
implementation. As discussed in Section 10.1.2, a relatively balanced distri-
bution of clock phases is necessary for the quality of synchronization with a
rotary clock signal. An unbalanced loading of synchronous components on the
ROA rings may also cause hot spots in the circuit or significantly increase the
clock load on one side of the chip compared to another (thereby causing per-
formance degradation). To prevent such undesired operation, logic and clock
tree synthesis need to be performed interdependently. The partitioning proce-
dure presented here achieves this goal by generating balanced logic partitions
to be synchronized by each ROA ring. Advantageously, the clock phases at the
synchronous components within each partition are well distributed after the
application of clock skew scheduling (see Figure 11.5 on page 213) to the logic
partitions. Thus, the synchronization by non-zero clock skew requirement is
satisfied as well as the capacitive load balancing requirement for robust rotary
oscillation.
The second and third factors that drive the timing-driven partitioning
process are related to the design and analysis methodologies of large-scale
circuits. Although discussed here within the context of rotary clock synchro-
nization, the partitioning procedures presented in this chapter can also be
applied to circuits synchronized with traditional clocking technologies. From
a CAD perspective, the generality of the partitioning procedure to improving
the scalability of clock skew scheduling (independent of the particular clocking
technology) is discussed next.
As reported earlier in Chapter 5, scalability of clock skew scheduling is
an important drawback for its widespread acceptance in mainstream design.
Most industrial-strength timing tools or circuit designers that implement vari-
ations of clock skew scheduling perform these tasks only on certain portions of
the circuit, without analyzing the circuit in its entirety. Analysis of the entire
circuit in order to implement a full-scale application of clock skew scheduling
can be computationally intensive for very large-scale circuits. The main ob-
10.2 Physical Design Flow 195

stacle for the application of clock skew scheduling to the entire circuit is the
run times of LP model problems.
The LP problem for the application of clock skew scheduling is formulated
as described in Chapter 5. The LP problems generated for an integrated cir-
cuit with millions of paths and hundreds of thousands or more synchronous
components can be very large. The run times of such large LP problems are
usually reasonable within the typically long IC design cycles (up to a few days
with industrial strength LP solvers and common computing resources). How-
ever, very large models might not be solvable at all within the memory limits
of common computing resources. In several industry applications, for instance,
LP model problems for the clock skew scheduling of large-scale circuits are
observed to exceed the practical limits of desktop computing resources (e.g.
4 gigabytes of memory for 32-bit systems) [176]. Partitioning, as discussed
here, remedies this shortcoming. Through partitioning the circuit into small
partitions, small linear programming models can be developed and solved for
each partition. In practice, the LP formulations can be applied in parallel,
achieving further improved run times.

10.2.2 Partitioning with chaco

In the development of the partitioning step of the physical design flow, the par-
titioning tool Chaco [177] from Sandia National Laboratories is used. Chaco is
a hypergraph partitioning tool that is primarily developed for the paralleliza-
tion of tasks on special architectures. Nevertheless, chaco has been proved to
be applicable to a wide range of areas. Chaco offers various methods (spec-
tral bisectioning [178], the inertial method [179], the Kernighan-Lin [180],
Fiduccia-Mattheyses [181] algorithms and multilevel partitioners [182]) for
partitioning, each fine tuned for a specific purpose.
Among the multiple criteria for partitioning a synchronous circuit for clock
skew scheduling are the weight, number and location of the cuts amongst
partitions, the weight of each partition, the relative mapping of sequentially-
adjacent registers to partitions and the number of internal vertices per parti-
tion. Chaco tracks the quality of these partitioning performance metrics with
user-defined priorities. In order to generate partitions amenable to clock skew
scheduling, the number of cuts between partitions must be minimal and the
number of internal vertices (vertices that do not have edges between par-
titions) must be maximal. Depending on particular design budgets and the
priority of the performance metrics, the weights of particular nets or vertices
can be fine tuned.
In the computer-aided design (CAD) tool implementation, the application
of partitioning to two types of netlists are supported. These netlists, catego-
rized by the hierarchical level of input data, are:
1. Gate level netlists,
2. Register-transfer level netlists.
196 10 Clock Skew Scheduling in Rotary Clocking Technology

If the input to the CAD tool is a register-transfer level netlist, identifying


local data paths (register-to-register timing paths) is inherently simple. The
local data paths in the register-transfer level netlist already form a circuit
graph such as the one shown in Figure 7.1 (page 129), where each vertex is a
register or a synchronous component and each edge is a local data path. If the
input to the CAD tool is a gate-level netlist, some paths can be too long (high
logic depth), which practically limits the quality of partitions. For such long
paths, the partitioner is tuned such that registered-input, registered-output
partitions are generated. To encourage the generation of such partitions, the
following rules are applied in weight assignment to edges:
1. If the edge is between two registers, assign low edge weight.
2. If the edge is a fanout from the data output terminal of a synchronous
component to a combinational component, assign high edge weight.
3. If the edge is a fanout from a combinational component to the data input
terminal of a synchronous component, assign low edge weight.
4. If the edge is between two combinational components, assign high edge
weight.
Through such weighted assignments, the chaco partitioning tool minimizes
the weight cuts, leading the cuts to pass through the data inputs terminals of
synchronous components. In case of single input synchronous components, like
flip-flops, a data input net is singlefold, while a data output net can have mul-
tiple fanouts. Hence, the cuts are directed to occur at the data input terminal
of a synchronous component as opposed to a data output terminal. A syn-
chronous component on the boundary of two partitions is shared between two
partitions, structuring the registered-input and registered-output partitions.
The enforcement of the edge weights only on data I/O terminals (as opposed
to all terminals) are to avoid forcing artificial constraints on irrelevant I/O
terminals, such as synchronization and scan-path I/O terminals.
The chaco partitioning tool is operated with different priorities assigned
to the partitioning objectives. Experimentally, a balanced priority assignment
between minimizing the total cut weight and maximizing the number of in-
ternal vertices is found to be sufficiently effective.

10.2.3 Register Insertion for Partitioning

As discussed in Section 10.2.2, partitioning can be performed on netlists at


two different hierarchy levels. The application of partitioning on a register-
transfer level netlist is simpler compared to its application on a gate-level
netlist. On a partitioned register-transfer level netlist, a cut is assumed to pass
through an arbitrary location on the cut local data path. The final register of
the cut local data path is called a boundary register . Timing constraints for
the local data paths, where the boundary register is either the initial or the
final register of the path and the other register is within the same partition
with the boundary register are grouped into a partition LP problem. Timing
10.2 Physical Design Flow 197

constraints of the local data paths between the boundary register and registers
in other partitions are grouped into the top block LP problem. These LP
problems constitute an integral part of the physical design flow depicted in
Figure 10.6 on page 192.
When a gate-level netlist is used, the heuristic described in Section 10.2.2
is used to bolster cuts on the input of synchronous components. Unlike its
treatment for a register-transfer netlist, a final register of the local data path
must be in the same partition with the cut local data path. This objective
suggests registered-input, registered-output partitions, simplifying the timing
analysis. The slight variation in the weight (or load) balance of the partitions
is insignificant and eventually balances out as the transfer of registers between
partitions occurs in all directions.
For instances where the partitioner validates a cut on a net that is be-
tween two combinational components, register insertion is used to satisfy the
registered-input, registered-output scheme. The number of inserted registers
depends on the quality of the partitioner and the complexity of the design. In
the performed experiments, the number of inserted registers has been observed
to be directly proportional to the number of partitions. For higher number of
total partitions, the number of inserted registers can get even higher than the
number of original registers. This requires the partitioning step to be applied
with caution in designs where die area is a scarce resource.
The registers inserted into the logic network in the register insertion step
of the physical design flow can affect the functionality of the circuit. In order
to preserve the functionality of the circuit, level-sensitive latches are used.
The inserted registers are selected as level-sensitive latches operating in their
transparent phases of operation. The propagation of the data signals on the
inter-partition paths is not disrupted, as these signals are immediately propa-
gated through the level-sensitive latches during the transparent phases. Con-
straints similar to the linearized timing constraints presented in Chapter 5
are used in this step in order to drive the inserted registers with proper clock
delays and phases.
The general partitioning process is illustrated in Figure 10.7. In this figure,
the dots represent registers and the lines represent data paths. The paths from
partition (4,1) are demonstrated. Note that only some of the registers and
paths are shown. The data paths which are on a cut are identified and the
timing constraints of these paths are included within the top block LP.

10.2.4 Clock Skew Scheduling of Partitions

In this section, the application of clock skew scheduling on the partitions


generated by the timing-driven partitioner is described. A heuristic method
is presented in order to perform the referred application. It is shown that this
heuristic method, despite significantly simplifying the clock skew scheduling
process, does not guarantee an optimal solution. The heuristic method is
198 10 Clock Skew Scheduling in Rotary Clocking Technology

Fig. 10.7. Partitioning a circuit for timing analysis.

described explicitly for circuits synchronized by the rotary clock technology in


this chapter, however, it can be generally applied to any synchronous circuit.
The heuristic method to solve the clock skew scheduling of partitions is
as follows. Assume that there are n partitions. The partition LP problems
(LP1 , LP2 , . . . LPn ) are generated for these n circuit partitions. Each parti-
tion LPi is solved (sequentially or in parallel) in order to compute the mini-
mum clock period permitted by that partition. Note that the minimum clock
periods of each partition can be different. For proper operation of the circuit,
all partitions must operate at the same clock period. A simple resolution to
this issue is possible through the fact that each partition can freely operate
at any clock period higher than the minimum clock period computed for that
particular partition LP problem. Consequently, the maximum of the minimum
clock periods reported from each partition LP is selected as the principal clock
period at which all the partitions are operable. This maximum value corre-
sponds to the frequency at which at least one of the partitions is operating at
its maximum frequency, while the rest of the partitions are operating at fre-
quencies lower than their capacities. After solving the partition LP problems,
the maximum of the minimum clock periods computed for the partitions is
used to further constrain the top block LP. Consequently, a constraint in the
form
T ≥ max(T1 , T2 , . . . Tn ) (10.3)
is added to the top block LP, where T1 , T2 , . . . Tn denote the minimum clock
periods computed for partitions LP1 , LP2 , . . . LPn , respectively. If the top
block LP problem is less constraining on the minimum clock period compared
10.2 Physical Design Flow 199

to the partition LP problems (smaller minimum clock period), then the max-
imum of the minimum clock periods of the partition LP problems is assigned
as the clock period of the top block. Otherwise, the top block LP problem de-
termines the actual minimum operating clock period of the circuit (partitions
and top block).
The top block LP problem is solved after the partition LP problems are
solved, because the top block has the most number of boundary vertices im-
plied in its constraints. Actually, all boundary vertices are implied in the
constraints that make up the top block LP problem. Each partition LP prob-
lem only has a fraction of the boundary vertices implied in their constraints.
The solution of the clock delays to all boundary vertices, as computed by each
partition LP and the top block LP problems, must match in order to verify
the validity of the computed minimum clock period. In order to match these
clock delays of boundary vertices, the solutions computed for the top block
LP problem are enforced on the partition LP problems with equalities such
as:
ticd = xi , (10.4)
where the clock delay computed for register Ri in the top block LP problem is
xi time units. If the partition LP problems return feasibility, the computation
is complete.
There are two points to note here. First, note that the minimum clock
periods computed for partition LP problems are lower limits on the mini-
mum clock period of the complete circuit as each partition LP problem is a
subproblem of the original LP problem. The constraints that make up the
subproblems are subsets of the LP problem of the complete (original) circuit.
As the solution of one of the subproblems (the top block LP problem in this
case) is enforced on the remaining LP problems, the convex solution space
of the original problem is not violated. Intuitively, therefore, if the presented
heuristic method produces a feasible result, this result is optimal.
The second point to note is the fact that the presented heuristic method
does not guarantee a feasible solution. The percentage (65%) of ISCAS’89
benchmark circuits for which the presented heuristic method is feasible are
shown in Section 11.5. The following alternative approaches are proposed to
solve for cases where the presented heuristic method is not feasible:
• Reiteration: The infeasibility diagnostics of an LP solver can be used to
resolve the infeasibility problem by changing one or more clock delays that
appear in a contradictory constraint. Even if any infeasibility information
is not available, iterations can be performed on the infeasible subproblems
to search for a feasible answer. The clock delays whose values are changed
from the optimal solution of the top block LP are tracked such that the
feasibility of the remaining LP problems are not violated. Iterations are
performed either until a feasible solution is found or a time limit is reached.
• Constraining boundary vertices: As an alternative procedure, the
clock delays of all boundary registers can be fixed to a particular value.
200 10 Clock Skew Scheduling in Rotary Clocking Technology

Synchronous circuits are typically built to operate at zero clock skew,


thereby, constraining the clock delays of the boundary vertices to a par-
ticular value will guarantee proper circuit operation. The minimum oper-
ational clock period for the restricted circuit will be larger than or equal
to the minimum clock period of the original circuit due to the additional
constraints on the convex solution space. As discussed in Section 9.3, a
similar clock delay restriction procedure is applied to the timing of Intel-
lectual Property (IP) blocks. In the experiments performed on ISCAS’89
benchmark circuits for IP blocks, restricting the clock delays of boundary
vertices leads to the 27% improvement of conventional clock skew schedul-
ing reduced to 24%.
• Delay padding: Implementation of clock skew scheduling requires modi-
fication of the clock distribution network. If the designers can modify the
logic network as well as the clock distribution network, the infeasibility of
one or more partition LP problems can be mitigated by delay padding.
In this alternative procedure, the data propagation delays of all paths of
the infeasible LP problems are formulated as variables. To this end, the
minimum DPi,fm and the maximum DPi,fM data propagation delays of a lo-
if
cal data path can be formulated with additional slack variables Sm and
if
SM , respectively, specific to each local data path. The summation of all
the slack variables is added to the minimization type objective function
(min TCP ). The coefficient of the minimum clock period TCP is increased
as appropriate in the LP formulation in order to increase the priority of
minimizing TCP in optimization. In the LP problem solution, the non-zero
slack values reported on each local data path are the amounts of delay that
must be inserted to the logic paths. The clock delays of the boundary reg-
isters must be fixed prior to the solution, such that, the solutions of the
remaining LP problems are not violated. The practical concerns of delay
padding discussed in Chapter 8 are also valid for this alternative proce-
dure.

10.2.5 Timing-Driven Register Placement

A register placement methodology is presented for the physical design of cir-


cuits synchronized with the rotary clocking technology. In this methodology,
designated areas for register placement are reserved underneath the ROA
rings. Highly populated register banks are stacked inside these designated
regions, available for use with the full spectrum of clock phases. Upon synthe-
sis of the circuit and the computation of optimal clock phases, each register
in the synthesized netlist is physically mapped to a register underneath the
ROA ring. To complete the placement step, the synthesized blocks of combi-
national circuitry are distributed in the free space inside the region, outside
the designated areas.
In Figure 10.8, an ROA ring of a typical circuit designed with a 0.13 μm
technology on a 2mm x 2mm circuit die is illustrated. Note that the figure
10.2 Physical Design Flow 201

500 um

500 um

4 um

2 um

Fig. 10.8. An ROA ring in a chip layout illustrated in 0.13 um technology.

is not drawn to scale. The die area is evenly divided into 16 regions in a
four by four setting, each of which is synchronized with an ROA ring. The
dimensions of each ROA ring is 500μm by 500μm. Assuming a single row of
registers is placed underneath each ring, the maximum number of registers
that are realizable on this die can be easily be obtained using the dimensions
of a typical register. In the 0.13um technology, a size of a register is considered
to be 4μm by 4μm, with a minimal spacing of 2μm between two instances.
Therefore, there is enough space to place approximately 80 registers on each
ROA ring edge [(500+2)/(4+2) ≈ 80]. For 4 sides of an ROA ring and 16 rings,
a total of 5120 registers are available for mapping against the synthesized logic.
This number is adequate for most state-of-the-art digital circuit designs of
similar die size. The dimensions of the designated area for register placement
and the number of register bank rows are the determining factors for the
number of registers in a design, which can be altered for particular design
budget requirements. Availability of registers in the register bank enables a
good distribution and mapping of clock phases.
The register placement methodology is discussed to demonstrate a viable
mechanism to deliver the required clock delays to registers. The described
heuristic implementation of the register placement methodology is only pre-
sented as a proof-of-concept, and does not negatively or positively impact
the coveted synchronization principles with non-zero clock skew. Alternative
methods of placement and routing for rotary timing synchronization have been
offered in [173, 174] that can also be followed in the physical design flow.
202 10 Clock Skew Scheduling in Rotary Clocking Technology

10.3 Parallelization of Clock Skew Scheduling

The popularity of the personal computers in the consumer market over the
last few decades has significantly lowered the costs of computing systems.
Consequently, the costs associated with setting up a distributed computing
system have become relatively affordable. Processes, previously incomputable
or considered costly, can be executed on a cluster of standard computing
systems.
Xgrid [183] is a distributed computing software provided by Apple Com-
puters Inc. permitting the operation of a cluster of popular desktop machines
as a supercomputer. The Xgrid system aggregates an ad hoc network of Mac-
intosh desktop computers into a multi-agent computing cluster, where each
agent is called a computation grid. Xgrid is typically beneficial for highly par-
allelized problems that can be broken up into smaller pieces and each piece
executed separately and relatively independent from each other. One of the
computers in the cluster is set up as the client for Xgrid and other computers
are used as distributed agents. The Xgrid software is installed on all comput-
ers, enabling the agents to perform grid calculation. The computations can
be submitted when the agents are idle or it can be used as the master task.
The Xgrid software is run with a controller, which regulates the assignment of
computing processes to grids and manages the outputs as they are returned to
the server. Xgrid software serves as a simple distributed computing infrastruc-
ture and does not support message passing between independent agents as is
the case for typical Message Passing Interface (MPI) [184] systems.
The parallelization of the application of clock skew scheduling is imple-
mented for the Xgrid distributed computing system. The LP problems for
each partition are submitted as individual tasks to the Xgrid computing clus-
ter and solved simultaneously on specific agents. The generated system not
only exhibits the pre-described advantages of implementing a parallel execu-
tion scheme for clock skew scheduling, it also exemplifies the implementation
of a complex VLSI design application on the Xgrid software architecture.
The computing cluster is constructed with eight PowerMac computers with
dual G5 1.8GHz microprocessors and 3GB RAM operating Mac OS X 10.3.8.
The cluster has one dedicated client, one dedicated controller and six distrib-
uted computing agents. The agents are configured to process Xgrid tasks as
the master task. Only one of the processors on each computer is used in ex-
perimentation. This grid computing cluster setup is illustrated in Figure 10.9.
In order to effectively harness the distributed computing potential, the
benchmark circuits are partitioned into four partitions. Note that four parti-
tions emulate a 2x2 grid clock distribution for the rotary clocking technology.
The analysis of a 3x3 or a larger grid size is possible, however, perfect paral-
lelization for such grid size can not be achieved with six distributed agents.
10.4 Summary 203

Client

6
?

Controller

? ? ? ? ? ?

Agent 1 Agent 2 Agent 3 Agent 4 Agent 5 Agent 6

Fig. 10.9. Xgrid computing cluster.

10.3.1 Speedup of Computation


The primary advantage gained from the parallelization of the application of
clock skew scheduling is the speedup in computation time. The speedup is
gained not only from parallelization but also from partitioning the LP prob-
lems (smaller size LPs are generated).
The following simple and intuitive formula is used to compute the speedups
achieved through partitioning and parallelization:
Run time of physical design (PD) flow without partitioning
Speedup = .
Run time of PD flow with partitioning and parallelization
(10.5)
In a distributed computing environment, communication overhead is often
of concern. In the Xgrid environment, because of the simple (practically non-
existent) interface between independent computing agents, the communication
overhead is reduced to a minimum. The bulk of the required communication
occurs in distributing the tasks to agents.

10.4 Summary
Non-zero clock skew scheduling is proposed as a clock distribution network
design and improvement methodology in conventional VLSI design flow. The
204 10 Clock Skew Scheduling in Rotary Clocking Technology

conventional design flow, optimized to generate a zero clock skew synchronous


system, is used to implement a system, which is then improved for shorter
clock period or maximized safety as explained in earlier chapters. This chap-
ter demonstrates how a next-generation clocking technology, resonant rotary
clocking technology not only inherently provides but also necessitates the use
of non-zero clock skew design principles. The resonant rotary clocking tech-
nology is reviewed and timing requirements that provide and necessitate the
use of non-zero clock skew principles are described.
A cluster-based parallel clock scheduling methodology is described within
the context of rotary clocking. Possible extensions to a general approach for
clock skew scheduling parallelization is hinted. A physical design flow that
incorporates the coveted parallel clock skew scheduling step with placement
and routing steps is presented, which constitutes a proof-of-concept for advo-
cated design methodologies. The physical design flow for rotary-synchronized
circuits is an ongoing research field and more recent approaches are available
in literature [173, 174].
11
Experimental Results

The results of the various clock skew scheduling methodologies described in


this research monograph are presented in this chapter. Results of each applica-
tion are presented in dedicated sections for a thorough analysis and simplicity
in presentation. For comparison of results, identical experimental setups are
used where possible and publicly available ISCAS’89 benchmark circuits are
used as test subjects. Presented results entail explanations of experimental
setups for replicability, detailed reports (presented in tabular form) of exper-
iment runs including circuit statistics, reported improvements and runtimes,
and interpretation of results for observed trends or deviations from norms.
Specifically, experimental results for the following methodologies are pre-
sented: In Section 11.1, the results for the clock skew scheduling of level-
sensitive circuit results are shown. These results also include edge-triggered
circuit implementation results (presented earlier in Chapter 5) for side-by-
side comparison. In Section 11.2, the level-sensitive circuit results that are
expanded for multi-phase clock synchronization are shown. The effects of
multi-phase clocking are interpreted for best synchronization practices, which
is particularly useful for rotary clock synchronized circuits. In Section 11.3,
the performance of quadratic programming (QP) formulation proposed for
maximizing safety against variations is shown. In Section 11.4, the improve-
ment of clock skew scheduling results for edge-triggered and level-sensitive
circuits by applying the delay insertion method is shown. In Section 11.5,
preliminary results for the proof-of-concept physical design methodology for
rotary-clock-synchronized circuits are shown.

11.1 Clock Skew Scheduling of Level-Sensitive Circuits


The generic LP model shown in Table 6.2 (page 107) is used in the problem for-
mulation of clock skew scheduling application on level-sensitive circuits. The
commercial optimization package CPLEX (v7.5) [115] is used to solve for these
clock period minimization problems of the generated level-sensitive (ISCAS’89
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 205
DOI: 10.1007/978-0-387-71056-3 11,
c Springer Science+Business Media LLC 2009
206 11 Experimental Results

benchmark) circuits. In experiments, the primal and dual simplex optimizers


of CPLEX are used. The worst case analysis shows that the simplex method
and its variants may require exponential number of steps to reach an optimal
solution. However, a vast amount of practice has confirmed that in most cases,
the number of iterations to reach an optimal solution is polynomial [114]. In
the presented experiments, the LP clock skew scheduling formulations of all
benchmark circuits are solved in reasonable runtimes with CPLEX.
Consider the LP formulation in Table 6.2. The number of problem con-
straints m is proportional to the number of registers r and the number of
local data paths p in the circuit. Let s denote the number of input registers
for which the initialization constraints are defined. In Table 6.2, there are
eight (8) constraints for each register, two (2) constraints for each local data
path and one (1) constraint for each input register. Thus, the number of con-
straints in the problem formulation is m = 8r + 2p + s. The minimum clock
period TCP is a problem variable. Also, there are five (5) problem variables
defined for each register leading to a total number of n = 5r + 1 variables in
the problem formulation.

11.1.1 Experimental Results on ISCAS’89 Benchmark Circuits

The original ISCAS’89 benchmark circuits are edge-sensitive synchronous cir-


cuits without any timing information. The timing information for the bench-
mark circuits is generated explicitly with an algorithm, where the type, size
and fan-out of a gate are included in the computed combinational gate de-
lay. Level-sensitive implementations of the ISCAS’89 benchmark circuits are
generated by replacing each flip-flop in the original benchmark circuit with
a level-sensitive latch. In experimentation, a single phase clock signal with a
duty cycle of 50% (Figure 6.5 on page 109) is selected. Without affecting the
generality of the solution, the setup and hold times and the internal delays
are assumed to be zero (δSLi = δH Li Li
= DCQ = DDQLi
= 0). The consideration of
these numeric constants in an actual problem is straightforward.
In experimentation, edge-sensitive and level-sensitive synchronous circuit
implementations are analyzed for zero and non-zero clock skew scheduling
applications. The effects of time borrowing and clock skew scheduling in cir-
cuit implementation are investigated. The results of the analyses—computed
on a 440MHz Sun Ultra-10 Workstation—are presented in Table 11.1. For
each circuit, the following data are listed—the circuit name, the clock peri-
ods TFnoskew
F for a zero skew circuit with flip-flops, TLnoskew for a zero skew
circuit with latches, TFCSS
F for a non-zero skew circuit with flip-flops, TLCSS
for a non-zero skew circuit with latches, and TLr for a non-zero skew circuit
where the clock delays to I/O registers are restricted to be equal. The sub-
scripts F F, L represent circuit topologies for flip-flop based and latch-based
circuits, respectively. The superscripts noskew, CSS indicate zero or non-zero
clock skew scheduling. Also listed are the calculation time tCSS
L of TLCSS , and
TB CSS T BCS
the clock period improvements IL , IF F and IL , where the superscripts
11.1 Clock Skew Scheduling of Level-Sensitive Circuits 207

Table 11.1. Clock skew scheduling results for level-sensitive ISCAS’89 circuits.
Info Zero CS I (%) Non-Zero CS I (%) t (sec) R I (%)
Circuit TFnoskew
F
noskew
TL TB
IL TFCSS
F TLCSS CSS
IF F ILT BCS CSS
IL tCSS
L TLr r
IL
s27 6.6 5.4 18 4.1 4.1 38 38 24 0.02 4.1 38
s208.1 12.4 8.6 31 4.9 5.2 60 58 40 0.01 7.6 39
s298 13.0 10.6 18 9.4 9.4 28 28 11 0.02 10.6 18
s344 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32
s349 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32
s382 14.2 10.3 27 8.5 8.5 40 40 17 0.04 8.7 39
s386 17.8 17.3 3 17.3 17.3 3 3 0 0.03 17.3 3
s400 14.2 10.4 27 8.6 8.6 39 39 17 0.05 8.8 38
s420.1 16.4 12.6 23 6.8 7.2 59 56 43 0.04 10.3 37
s444 16.8 12.4 26 9.9 9.9 41 41 20 0.07 9.9 41
s510 16.8 14.8 12 14.8 14.3 12 15 3 0.02 14.8 12
s526 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18
s526n 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18
s641 83.6 66.2 21 61.9 61.9 26 26 6 0.05 63.1 25
s713 89.2 71.2 20 63.8 63.8 28 28 10 0.05 65.0 27
s820 18.6 18.3 2 18.3 18.3 2 2 0 0.01 18.3 2
s832 19.0 18.8 1 18.8 18.8 1 1 0 0.01 18.8 1
s838.1 24.4 20.6 16 8.3 9.1 66 63 56 0.28 15.6 36
s938 24.4 20.6 16 8.3 9.1 66 63 56 0.31 15.6 36
s953 23.2 21.2 9 18.3 18.3 21 21 14 0.10 21.2 9
s967 20.6 17.9 13 16.2 16.6 21 19 7 0.08 17.9 13
s991 96.4 91.6 5 79.4 79.4 18 18 13 0.02 79.4 18
s1196 20.8 16.0 23 10.8 7.8 48 63 51 0.03 16.0 23
s1238 20.8 16.0 23 10.8 7.8 48 63 51 0.01 16.0 23
s1423 92.2 86.4 6 77.4 75.8 16 18 12 1.10 75.8 18
s1488 32.2 29.0 10 29.0 29.0 10 10 0 0.02 29.0 10
s1494 32.8 29.6 10 29.6 29.6 10 10 0 0.01 29.6 10
s1512 39.6 34.8 12 34.8 34.8 12 12 0 0.28 34.8 12
s3271 40.3 29.8 26 28.6 28.6 29 29 4 0.69 29.0 28
s3330 34.8 23.4 33 17.8 17.8 49 49 24 0.49 23.2 33
s3384 85.2 77.4 9 67.4 67.4 21 21 13 1.88 76.2 11
s4863 81.2 75.4 7 69.0 69.0 15 15 8 0.64 69.0 15
s5378 28.4 23.2 18 22.0 22.0 23 23 5 1.66 22.0 23
s6669 128.6 124.6 3 109.8 109.8 15 15 12 3.62 109.8 15
s9234 75.8 64.8 15 54.2 54.2 28 28 16 4.59 59.2 22
s9234.1 75.8 64.8 15 54.2 54.2 28 28 16 3.88 59.2 22
s13207 85.6 67.4 21 57.1 57.1 33 33 15 14.86 57.1 33
s15850 116.0 92.8 20 83.6 83.6 28 28 10 76.96 83.6 28
s15850.1 81.2 71.4 12 57.4 57.4 29 29 20 58.89 57.4 29
s35932 34.2 34.1 0 20.4 20.4 40 40 40 80.03 20.4 40
s38417 69.0 54.8 21 42.2 42.2 39 39 23 603.49 43.0 39
s38584 94.2 76.4 19 65.2 65.2 31 31 16 321.74 64.8 31
Average 15 30 27 14 24

T B, CSS, T BCS stand for time borrowing, clock skew scheduling and both,
respectively.
The minimum clock periods calculated for the edge-sensitive synchronous
circuits under zero and non-zero clock skew scheduling (TFnoskew
F and TFCSSF ,
respectively) suggest an average improvement of 30% in the minimum clock
period for the ISCAS’89 benchmark circuits. The minimum clock periods cal-
culated for the level-sensitive synchronous circuits (TLnoskew and TLCSS ) sug-
gest an average improvement of 27% in the minimum clock period. Below,
208 11 Experimental Results

the clock period improvements for the level-sensitive latches are examined in
detail.
The experimental results shown in Table 11.1 demonstrate that utiliz-
ing latches as storage elements instead of flip-flops may result in up to 30%
improvements of the minimum clock period under zero clock skew (for single-
phase, 50% duty cycle clock synchronization). On the ISCAS’89 benchmark
circuits, an average of 15% improvement is observed when the flip-flops are re-
placed by latches (under zero clock skew). This level of improvement is solely
due to time borrowing.
Utilizing non-zero clock skew, an even higher improvement is possible.
Improvements up to 63%—over flip-flop based synchronous circuit with zero
clock skew—are observed. The average improvement in the minimum clock
period for ISCAS’89 benchmark circuits is 27%. This level of improvement is
due to simultaneous application of clock skew scheduling and consideration
of time borrowing. Out of this 27% improvement for non-zero clock skew,
level-sensitive circuits, the improvement due to time borrowing is 15% and
the improvement due to clock skew scheduling is 14%. It is interesting to
note that the improvements achieved through time borrowing and clock skew
scheduling are not additive. Time borrowing and clock skew scheduling target
the same resource in performance improvement, the slack propagation time
on local data paths. There is a limited amount of slack propagation time on
the critical paths and a circuit where time borrowing is abundantly realized,
cannot benefit as much from clock skew scheduling. It has been shown how-
ever, that even though time borrowing and clock skew scheduling are battling
effects (battling for the same resource), dramatically shorter clock periods are
achievable through the collaboration of both effects. It is also important to
note that, although non-zero clock skew, edge-triggered circuits benefit more
from improvements (30%) on average, non-zero clock skew, level-sensitive cir-
cuits lead to superior improvements for some of the circuits. Furthermore,
the smaller size of level-sensitive latches compared to edge-triggered flip-flops
is often highly desirable. Thus, the use of level-sensitive latches as register
elements in synchronous circuits where clock skew scheduling is applied is
advantageous for area savings1 , and sometimes, superior to edge-triggered
circuits in both area and operating speed.

11.1.2 Verification and Interpretation of Results

Some edge-sensitive synchronous circuits are inoperable with level-sensitive


latches due to design. For such circuits, the clock skew scheduling problem is
infeasible. The presented timing analysis procedure detects the infeasibility of
a such problem and provides diagnostics messages. The slack and excess values
associated with each constraint can be examined in the sensitivity analysis
output provided by an LP solver. Careful interpretation of the sensitivity
1
with a minor sacrifice in operating speed.
11.1 Clock Skew Scheduling of Level-Sensitive Circuits 209
75
70
65
60
55
Number of paths

50
45
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Propagation delay DP in time units

Fig. 11.1. Data propagation times for s938 with 32 registers and 496 data paths.

output leads to the identification of the necessary modifications on the circuit


topology to achieve the desired operating frequency.
The interpretation of the timing schedule for a synchronous circuit presents
a model to investigate the effects of zero and non-zero clock skew scheduling
on synchronous circuit operation. In the rest of this section, the timing sched-
ules generated for the synchronization of the ISCAS’89 benchmark circuit
s938 with zero and non-zero clock skew scheduling are analyzed. The analy-
ses include the data distributions for various parameters, which are presented
in Section 11.1.3. The verification of clock skew values is discussed in Sec-
tion 11.1.4. Also in Section 11.1.4, lower and upper bounds on clock skew are
derived.

11.1.3 Parameter Data Distributions

In Section 4.7.1, data propagation time DPif is defined as the period of time the
data is processed in the combinational logic block of a local data path Ri ;Rf .
Without loss of generality, an empirical calculation method is used to calculate
the data propagation times of each local data path of a circuit. The distrib-
ution of the calculated data propagation times for the ISCAS’89 benchmark
circuit s938 is illustrated in Figure 11.1. In this figure, the height of each bar
corresponds to the number of paths within a given delay range. For example,
there are nine (9) paths with delays between 4 and 5 time units.
Effective path delay D̂Pi,f [96] is defined as the time period between the
departure of the data signal from the initial register Ri and the arrival of
the same data signal at the final register Rf . The effective path delay of a
local data path differs from data propagation delay because of the additional
propagation time provided by clock skew and the time borrowing property
of level-sensitive synchronous circuits. Note that in level-sensitive synchro-
nous circuits, the effective path delay is defined within a permissible range
instead of a fixed value, as the arrival and departure times are indeterminate.
210 11 Experimental Results
60
55
50
45
Number of paths

40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Maximum effective path delay in time units

Fig. 11.2. Maximum effective path delays in data paths of s938 for zero clock skew.

The nominal effective path delay is determined when the arrival and depar-
ture times are realized in run-time as certain values in the permissible ranges
[af , Af ] and [di , Di ], respectively. Specifically, the shortest effective path delay
occurs when the data signal departs at its latest time Di from the initial reg-
ister Ri and arrives at its earliest arrival time af at the final register Rf . The
longest effective path delay is realized by the earliest departure di of the data
signal from Ri and latest arrival Af at Rf . Hence, the interval for the effective
path delay of level-sensitive synchronous circuits can be defined as:

af − Di − TSkew (i, f ) + TCP ≤ D̂Pi,f ≤ Af − di − TSkew (i, f ) + TCP . (11.1)

In this work, the longest effective path delay is investigated in order to


illustrate the effects of clock skew and time borrowing on data propagation.
The aim is to observe the increase in the effective path delay of a circuit,
which in turn leads to a higher operating frequency. This increase in oper-
ating frequency is obtained by the replacement of flip-flops with latches and
introducing non-zero clock skew. Observe that the distribution of the prop-
agation delays for the s938 benchmark circuit presented in Figure 11.1 is
exactly the same as the distribution of the effective path delay of the same
benchmark circuit s938, when operational with flip-flops (under zero clock
skew).
if In circuits
with flip-flops, the effective path delays are determinate
DP − TSkew (i, f ) as the data departures occur at the active transition of
the clock signal.
The distribution of the maximum effective path delays of the level-
sensitive s938 circuit with zero clock skew scheduling is shown in Figure 11.2.
Note that the maximum effective path delay is calculated by the expression
[Af − di − TSkew (i, f ) + TCP ]. The target clock period is TCP = 20.6. The
height of each bar corresponds to the number of paths with an effective path
delay within a given range. It is observed by comparing Figures 11.1 and 11.2
that the maximum effective path delays are increased in the  level-sensitive cir-
cuit, as well as providing a smaller minimum clock period TFnoskew
F = 24.4 v.s.
11.1 Clock Skew Scheduling of Level-Sensitive Circuits 211
55
50
45
40
Number of paths

35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Maximum effective path delay in time units

Fig. 11.3. Maximum effective path delays for s938 for non-zero clock skew.


TLnoskew = 20.6 . The increase in the effective path delays is due to time bor-
rowing. Accumulation of effective path delay values slightly below or above the
minimum operating clock period TCP = 20.6 is visible. Note that the effective
path delay having larger values than the minimum clock period is a sufficient
but not a necessary condition for time borrowing. Thus, local data paths where
the effective path delay is calculated to be smaller than TCP = 20.6 may still
benefit from time borrowing. Furthermore, it can be observed that certain
data paths in the circuit benefit more from time borrowing, realizing an ef-

fective path delay close to the theoretical limit of TCP + CWL


− TSkew (i, f ) .

11.1.4 Skew Analysis

As discussed throughout this monograph, non-zero clock skew scheduling in


synchronous circuits permits smaller clock periods. Note that in presence of
non-zero clock skew, the effective path delay for the data signal over a data
path most likely gets smaller compared to its value observed in zero clock
skew scheduling. This fact is directed by (11.1) (TCP gets smaller). However,
as the minimum clock period TCP gets smaller, the percentage of the data
paths, on which the effective path delay exceeds the minimum clock period,
significantly increases (see Figure 11.3). The target clock period is TCP = 9.09.
The height of each bar corresponds to the number of paths with an effective
path delay within a given range. The effect of clock skew on improving the
minimum clock period is visible by comparing the histograms presented in
Figures 11.2 and 11.3.
In order to generate an expression for the upper bound, express the fol-
lowing condition:
Di + DPi,fM − TCP + TSkew (i, f ) ≤ TCP − δSLf . (11.2)
In (11.2), the earliest possible time is assigned to Di in order to realize the
upper bound on clock skew. The earliest possible time that a data signal
212 11 Experimental Results

departs
 from a latch is DCQ later than the leading edge of the clock signal,
TCP − CWL
+ DCQ . Reordering the expression gives the upper bound on
clock skew:

Tskew (i, f ) ≤ TCP + CW


L
− DPifM − DCQ
L
− δSLf . (11.3)

The lower bound on the clock skew is derived similarly, which leads to:

af + DPi,fm ≥ TCP − TSkew (i, f ) + δH


Lf
. (11.4)

In order to derive the lower bound, the data arrival time at Rf must be con-
sidered to occur at its latest possible time. The latest data arrival time is the
setup time δSLf earlier than the trailing edge of the clock signal, TCP − δSLf .
Thus, the lower bound on the clock skew is:

TSkew (i, f ) ≥ TCP − TCP − DPi,fm + δSLf + δH


Lf
. (11.5)

Combining (11.3) and (11.5), the theoretical limits on clock skew is expressed
as follows:

−DPi,fm + δSLf + δH
Lf
≤ TSkew (i, f ) ≤ TCP + CW
L
− DPi,fM − DCQ
L
− δSLf . (11.6)
L L
Recall that in experimentation, the parameters DDQ , DCQ , δSLf , δH
Lf
are
considered zero and 50% duty cycle is selected for the single-phase synchro-
nization clock signal. In order to evaluate the upper and lower bounds on
clock skew in this simplified case, the parameters are substituted in (11.6):

−DPi,fm ≤ TSkew (i, f ) ≤ 1.5TCP − DPi,fM . (11.7)

Specifically on the ISCAS’89 benchmark circuit s938, the clock skew bounds
are verified using the experimental values shown in Figure 11.1. For the bench-
mark circuit s938 with a minimum clock period of 9.09, the minimum and
maximum propagation delays are calculated to be 5 and 24.4, respectively.
Thus, the values for the clock skew variable on the data paths of s938 is
bounded by −24.4 ≤ TSkew (i, f ) ≤ 8.64.
The distribution of the clock skew values of s938, when operable with a
minimum clock period of 9.09, is presented in Figure 11.4. The target clock
period is TCP = 9.09. The height of each bar corresponds to the number of
paths formed by sequentially adjacent pair of registers which have a clock
skew within the given range. The calculated clock skew values are within
the derived limits, most of which are negative. Negative clock skew between
registers help improve the minimum clock period of the synchronous circuit
due to the additional time it provides for data signal propagation. The data
paths, on which positive skew is recorded, most likely occur due to two reasons.
The first reason is the presence of data path cycles and reconvergent systems
within the circuit, which have constraining timing properties as explained
in Chapter 8. The second reason are the—faster—paths which provide extra
time for neighboring critical paths.
11.2 Multi-Phase Level-Sensitive Circuits 213
70
65
60
55
50
Number of paths

45
40
35
30
25
20
15
10
5
0
−20 −19 −18 −17 −16 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2
Clock skew Tskew (i, f ) in time units

Fig. 11.4. Distribution of the clock skew values of the non-zero clock skew case for
s938.

3.5

3
Number of latches

2.5

1.5

0.5

0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Clock delay ti in time units

Fig. 11.5. Distribution of the clock delay values of the non-zero clock skew case for
s938.

The distribution of the clock delays to each register presented in Fig-


ure 11.5. The target clock period is TCP = 9.09. The height of each bar
corresponds to the number of latches being driven by a clock signal with
a time delay within the given range. The distribution is significantly wide-
spread, ranging from 0 to 19 (time units), where the minimum clock period
is TCP = 9.09. If the clock tree network of the synchronous circuit is im-
plemented to accommodate for these nominal clock delays, operation at the
target minimum clock period is achieved.

11.2 Multi-Phase Level-Sensitive Circuits

Multi-phase synchronization of ISCAS’89 benchmark circuits are performed


using the transformation shown in Figure 11.6. This transformation is similar
to the procedure used in the literature, particularly in [72, 94, 99, 185]. In par-
214 11 Experimental Results

FF
D Q
DP
C

Latch Latch Latch Latch


D Q D Q D Q D Q
1
n DP
1
n DP
··· 1
n DP
1
n DP
C C C C
φ1 φ2 φn−1 φn
Fig. 11.6. Generation of an n-phase data path with latches.

ticular, to synchronize the circuit with an n-phase scheme, the combinational


logic block is divided to n equal-length (delay) blocks along the logic depth
and n latches are inserted between these blocks. Each latch is synchronized
with one phase of the n-phase synchronization scheme, where the phases are
selected in ascending order based on the location of the duty-cycle. The timing
information of the benchmark circuits is generated with a similar algorithm
to the one used in Section 11.1, where the type, size and fanout of a gate are
considered in the computed delays.
The experiments are performed using dual, three and four-phase clocking
schemes representing various degrees of multi-phase synchronization. For sim-
plicity, non-overlapping multi-phase clock signals with identical duty cycles,
shown in Figure 11.7, are used in experimentation. Due to the transformation
shown in Figure 11.6, a new level of latches is required for each additional
clock phase. The latches are modeled with inherent delays in order to capture
these effects in formulation. Latches are modeled with a delay pair of [0.9, 1.1]
time units, corresponding to the minimum and maximum delays for a latch.
For reference, a unity delay (a delay of 1 time unit) is close to the delay value
of an FO4 inverter in the proposed delay generation algorithm.
For multi-phase synchronization, each additional clock phase requires a
new level of latches to be inserted into data paths, effectively increasing path
delays. Thus, as the number of clock phases increases, the performance of a
zero clock skew system degrades. For non-zero clock skew systems, however,
this is not necessarily the case as shown by these experiments. The solutions
of clock period minimization problems computed with CPLEX (v7.5) barrier
optimizer [115] on a 440MHz Sun Ultra-10 workstation are presented in Ta-
bles 11.2, 11.3 and 11.4, and Figures 11.8, 11.9 and 11.10. The number of
registers r and paths p (before modification) of the ISCAS’89 benchmark
11.2 Multi-Phase Level-Sensitive Circuits 215

L = T /n
CW CP

n
φ(n) = TCP (n − 1)/n
Csource

L = T /n
CW CP

φ(n−1) = TCP (n − 2)/n


(n−1)
Csource

L = T /n
CW CP

φ2 = TCP /n
2
Csource

L = T /n
CW CP

1
φ1 = 0
Csource

Clock Period TCP

Fig. 11.7. Non-overlapping multi-phase synchronization clock.

circuits are shown in Tables 11.2, 11.3 and 11.4. Minimum clock periods,
improvements and calculation time are denoted by T , I and t, respectively.
Subscripts F F , nφ represent circuit topologies for flip-flop based and n-phase
level-sensitive circuits, respectively. Superscripts and titles T B, CSS, T BCS
stand for time borrowing, clock skew scheduling and both, respectively. Min-
imum clock periods (T ) are measured in time units.
In the rest of this section, the experimental results and factors contributing
to the improvements in these results are discussed in greater detail. In partic-
ular, the properties of multi-phase synchronization which affect level-sensitive
circuit performance are discussed in Section 11.2.1. The effects of multi-phase
synchronization on time borrowing are addressed in Section 11.2.2. The ef-
fects of multi-phase synchronization on clock skew scheduling are addressed
in Section 11.2.3. Finally, the effects of multi-phase synchronization on the
simultaneous application of time borrowing and clock skew scheduling are
addressed in Section 11.2.4.
216 11 Experimental Results

Table 11.2. Minimum clock periods of multi-phase ISCAS’89 benchmark circuits.


Circuit Zero Clock Skew Non-Zero Clock Skew
TB TB TB TB
Circuit TF F T1φ T2φ T3φ T4φ TFCSS
F
T BCS
T1φ T BCS
T2φ T BCS
T3φ T BCS
T4φ
s27 7.7 6.5 8.2 9.5 10.3 5.2 5.2 6.3 7.4 8.5
s208.1 13.5 9.0 11.8 13.9 15.4 5.8 6.3 9.8 11.1 11.4
s298 14.1 10.8 11.0 13.1 14.7 9.6 9.6 10.3 12.8 14.6
s344 28.1 19.5 22.4 24.5 26.1 19.5 19.5 22.4 24.5 26.1
s349 28.1 19.5 22.4 24.5 26.1 19.5 19.5 22.4 24.5 26.1
s382 15.3 11.0 14.4 15.0 16.7 9.2 9.2 12.2 14.8 16.5
s386 18.9 18.4 19.5 20.6 21.7 18.4 18.4 19.5 20.6 21.7
s400 15.3 11.1 14.4 15.0 16.7 9.3 9.3 12.3 14.8 16.5
s420.1 17.5 12.8 16.1 17.2 18.5 7.7 8.2 12.9 14.3 15.6
s444 17.9 12.6 16.1 17.6 19.3 10.6 10.6 14.5 17.2 19.0
s499 17.5 16.3 18.6 19.3 20.5 16.3 16.3 18.0 19.3 20.5
s510 17.9 15.9 18.2 19.2 20.1 15.9 15.9 17.4 18.8 20.0
s526 14.1 11.0 12.1 13.5 15.1 9.6 9.6 11.4 13.5 15.1
s526n 14.1 11.0 12.1 13.5 15.1 9.6 9.6 11.4 13.5 15.1
s635 165.9 157.4 113.3 127.9 135.4 4.8 4.9 78.8 97.1 102.6
s641 89.1 66.9 74.9 76.6 78.0 62.6 62.6 66.6 73.3 77.2
s713 90.3 71.9 75.8 77.4 78.7 64.5 64.5 67.5 74.2 78.1
s820 19.7 19.4 20.6 21.6 22.7 19.4 19.4 20.5 21.6 22.7
s832 20.1 19.9 21.1 22.1 23.2 19.9 19.9 21.0 22.1 23.2
s838 25.5 20.8 21.5 23.4 25.0 9.3 10.2 17.0 20.1 22.1
s938 25.5 20.8 21.5 23.4 25.0 9.3 10.2 17.0 20.1 22.1
s953 24.3 21.4 21.7 23.6 25.3 19.4 19.4 21.7 23.6 25.3
s967 21.7 18.6 19.6 21.4 22.8 17.3 17.7 19.6 21.4 22.8
s991 97.5 91.8 65.7 74.2 80.6 79.6 79.6 52.0 62.4 68.1
s1196 21.9 16.2 15.3 18.0 20.2 10.1 8.0 8.2 7.1 8.0
s1238 21.9 16.2 15.3 18.0 20.2 10.1 8.0 8.2 7.1 8.0
s1269 52.3 48.0 35.6 40.6 44.5 43.0 43.0 30.5 39.1 44.0
s1423 93.3 86.6 69.4 73.5 77.3 76.7 76.0 69.2 73.1 75.6
s1488 33.3 30.1 32.6 33.8 35.0 30.1 30.1 32.3 33.7 35.0
s1494 33.9 30.7 33.2 34.3 35.6 30.7 30.7 32.9 34.3 35.6
s1512 40.7 35.9 38.4 40.2 41.6 35.9 35.9 38.4 40.2 41.6
s3271 41.5 30.0 28.4 32.6 35.8 28.8 28.8 15.1 10.0 7.7
s3330 35.9 23.9 26.3 28.9 31.8 18.7 18.7 23.6 25.9 27.6
s3384 86.3 77.6 58.3 65.9 71.7 67.6 67.6 41.6 48.5 52.8
s4863 82.3 75.6 55.6 62.9 68.5 69.2 69.2 44.2 45.4 46.5
s5378 29.5 24.1 25.3 26.0 27.2 23.1 23.1 24.4 25.6 26.7
s6669 129.7 124.8 87.2 98.2 106.4 110.0 110.0 62.5 44.5 49.4
s9234.1 76.9 65.0 65.6 69.8 72.4 55.3 55.3 65.6 69.8 72.4
s9234 76.9 65.0 65.6 69.8 72.4 55.3 55.3 65.6 69.8 72.4
s13207.1 77.1 64.8 63.2 68.9 72.3 54.8 54.8 63.2 68.9 72.3
s13207 86.7 67.6 63.3 69.6 74.0 57.8 57.8 63.2 68.9 72.3
s15850.1 82.3 71.6 67.8 71.9 74.8 58.5 58.5 67.8 71.9 74.8
s15850 117.1 93.0 78.8 88.8 96.3 83.8 83.8 67.8 71.9 74.8
s35932 35.3 35.0 36.4 37.5 37.6 21.1 21.1 26.2 30.0 32.5

11.2.1 Multi-Phase Clocking

Multi-phase clocking is superior to single phase clocking by better accom-


modating the transparency periods of latches. Depending on the particular
synchronization scheme, however, the duration of the transparency periods
can be short for each phase, thereby reducing the advantages of multi-phase
clocking. In single-phase clocking, the transparency periods of latches have
identical positions within their respective clock cycles. In multi-phase clock-
ing, the transparency periods of different clock phases are distributed over the
clock cycle. In a multi-phase circuit synchronized with the clock signal shown
11.2 Multi-Phase Level-Sensitive Circuits 217

Table 11.3. Clock period improvements of multi-phase ISCAS’89 circuits.

Circuit Improvement TB (%) Improvement CSS (%) Improvement TBCS (%)


Circuit I1φ I2φ I3φ I4φ I1φ I2φ I3φ I4φ IF F I1φ I2φ I3φ I4φ
s27 16 -6 -24 -34 20 23 22 18 32 32 18 3 -10
s208.1 33 13 -3 -14 30 17 20 26 57 53 27 18 16
s298 23 22 7 -4 11 6 3 1 32 32 27 9 -3
s344 31 20 13 7 0 0 0 0 31 31 20 13 7
s349 31 20 13 7 0 0 0 0 31 31 20 13 7
s382 28 6 2 -9 16 15 2 1 40 40 20 4 -8
s386 3 -3 -9 -15 0 0 0 0 3 3 -3 -9 -15
s400 28 6 2 -9 16 15 2 1 39 39 20 3 -8
s420.1 27 8 2 -6 36 20 17 15 56 53 26 18 11
s444 30 10 2 -8 16 10 3 1 41 41 19 4 -6
s499 7 -6 -10 -17 0 3 0 0 7 7 -3 -10 -17
s510 11 -2 -7 -12 0 5 2 1 11 11 3 -5 -12
s526 22 14 4 -7 12 6 0 0 32 32 20 4 -7
s526n 22 14 4 -7 12 6 0 0 32 32 20 4 -7
s635 5 32 23 18 97 30 24 24 97 97 53 41 38
s641 25 16 14 13 6 11 4 1 30 30 25 18 13
s713 20 16 14 13 10 11 4 1 29 29 25 18 14
s820 2 -5 -10 -15 0 0 0 0 2 2 -4 -10 -15
s832 1 -5 -10 -15 0 0 0 0 1 1 -4 -10 -15
s838 18 16 8 2 51 21 14 12 64 60 33 21 14
s938 18 16 8 2 51 21 14 12 64 60 33 21 14
s953 12 11 3 -4 9 0 0 0 20 20 11 3 -4
s967 15 10 2 -5 5 0 0 0 20 18 10 2 -5
s991 6 33 24 17 13 21 16 16 18 18 47 36 30
s1196 26 30 18 8 51 47 61 60 54 63 63 68 64
s1238 26 30 18 8 51 47 61 60 54 63 63 68 64
s1269 8 32 22 15 10 14 4 1 18 18 42 25 16
s1423 7 26 21 17 12 0 0 2 18 19 26 22 19
s1488 10 2 -1 -5 0 1 0 0 10 10 3 -1 -5
s1494 9 2 -1 -5 0 1 0 0 9 9 3 -1 -5
s1512 12 6 1 -2 0 0 0 0 12 12 6 1 -2
s3271 28 32 22 14 4 47 69 79 31 31 64 76 82
s3330 33 27 19 11 22 10 10 13 48 48 34 28 23
s3384 10 32 24 17 13 29 26 26 22 22 52 44 39
s4863 8 32 24 17 8 21 28 32 16 16 46 45 43
s5378 18 14 12 8 4 3 2 2 22 22 17 13 9
s6669 4 33 24 18 12 28 55 54 15 15 52 66 62
s9234.1 15 15 9 6 15 0 0 0 28 28 15 9 6
s9234 15 15 9 6 15 0 0 0 28 28 15 9 6
s13207.1 16 18 11 6 15 0 0 0 29 29 18 11 6
s13207 22 27 20 15 15 0 1 2 33 33 27 21 17
s15850.1 13 18 13 9 18 0 0 0 29 29 18 13 9
s15850 21 33 24 18 10 14 19 22 28 28 42 39 36
s35932 1 -3 -6 -6 40 28 20 14 40 40 26 15 8
Average 16.7 15.3 8.0 1.6 16.5 12.1 11.4 11.3 30.3 30.3 24.8 17.7 12.0

in Figure 11.7, for instance, the transparency periods are located at different
times within the clock cycle (e.g., clock phases C 1 and C n are the first and last
sections, respectively). Such variety in the locations of transparency periods
provides flexibility on the permissible data propagation times of a local data
path. The assorted assignment of clock phases to registers, achieved through
clock skew scheduling or any other methods, leads to improvements in the
circuit performance.
218 11 Experimental Results

Table 11.4. Circuit info and run times for multi-phase ISCAS’89 circuits.

Circuit Info Time (sec)


Circuit r p tT

B
tBCS
1φ tT

B
tT

BCS
tT

B
tT

BCS
tT

B
tT

BCS

s27 3 4 0 0 0 0 0 0 0 0
s208.1 8 28 0 0 0 0 0 0 0 0
s298 14 54 0 0 0 0 0 0 0 0
s344 15 68 0 0 0 0 0 0 0 0
s349 15 68 0 0 0 0 0 0 0 0
s382 21 113 0 0 0 0 0 0 0 0
s386 6 15 0 0 0 0 0 0 0 0
s400 21 113 0 0 0 0 0 0 0 0
s420.1 16 120 0 0 0 0 0 0 0 0
s444 16 113 0 0 0 0 0 0 0 0
s499 22 462 0 0 0 0 0 0 1 1
s510 6 15 0 0 0 0 0 0 0 0
s526 21 117 0 0 0 0 0 0 0 0
s526n 21 117 0 0 0 0 0 0 0 0
s635 32 496 0 0 0 1 0 1 1 1
s641 19 81 0 0 0 0 0 0 0 0
s713 19 81 0 0 0 0 0 0 0 0
s820 5 10 0 0 0 0 0 0 0 0
s832 5 10 0 0 0 0 0 0 0 0
s838 32 496 0 0 0 0 0 1 1 0
s938 32 496 0 0 0 0 0 1 1 0
s953 29 135 0 0 0 0 1 0 0 0
s967 29 135 0 0 0 0 1 0 0 0
s991 19 51 0 0 0 0 0 0 0 0
s1196 18 20 0 0 0 0 0 0 0 0
s1238 18 20 0 0 0 0 0 0 0 0
s1269 37 1260 0 0 0 0 0 0 1 1
s1423 74 1471 1 1 1 2 2 2 2 3
s1488 6 15 0 0 0 0 0 0 0 0
s1494 6 15 0 0 0 0 0 0 0 0
s1512 57 415 0 0 0 1 1 1 1 1
s3271 116 789 1 1 1 1 1 1 2 2
s3330 132 514 0 0 1 1 1 1 2 2
s3384 183 1759 1 1 2 2 3 3 4 4
s4863 104 620 0 1 1 1 1 1 1 2
s5378 179 1147 1 1 1 2 2 2 3 3
s6669 239 2138 1 2 2 3 3 4 4 6
s9234.1 228 247 2 3 2 3 4 5 5 6
s9234 211 2342 2 3 3 4 4 5 5 7
s13207.1 669 3068 3 4 6 7 9 11 13 15
s13207 669 3068 3 5 6 8 9 13 13 19
s15850.1 534 10830 10 25 13 30 19 37 25 47
s15850 597 14257 15 26 19 32 24 42 30 46
s35932 1728 4187 6 8 16 17 21 28 27 39

As illustrated in the transformation procedure shown in Fig. 11.6, an extra


level of latches is inserted onto a logic data path for each clock phase. The
delays of these inserted latches can become significant for higher number of
clock phases which degrades the minimum clock period in the absence of a
clock skew scheduling application. For non-zero clock skew circuits, however,
the negative effects of latch insertion can be compensated, potentially leading
to equivalent or improved circuit performances. Note that, the complexity of
the design process increases due to clock skew scheduling and the complexity
of the timing analysis increases due to the multiplicity of clocking phases.
11.2 Multi-Phase Level-Sensitive Circuits 219

Performance Improvement via Time Borrowing per Clock Phase

100

80

60

40

20

0
s526 n
s27

Average
s298
s344
s349
s382
s386
s400

s444
s499
s510
s526

s635
s641
s713
s820
s832
s838
s938
s953
s967
s991
s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669

s9234
s208.1

s420.1

. s13207
s13207
. s15850
s15850
s35932
s9234.1
-20

-40
ISCAS'89 Modified Circuits
1-Phase 2-Phase 3-Phase 4-Phase

Fig. 11.8. Effects of multi-phase clocking on time borrowing.

As an added note, consider that the transformation procedure in Fig. 11.6


leads to a certain bias in circuit operation, such that, each sequentially adja-
cent latch pair is synchronized by two consecutive clock phases (i.e. φk and
φk+1 , k ∈ {1, 2, . . . n − 1}). Furthermore, the propagation delays of the combi-
national blocks are distributed evenly between clock phases according to the
transformation procedure. In typical circuit implementations, these regulari-
ties do not always occur, which leaves more room for improvement.

11.2.2 Multi-Phase Clocking Effects on Time Borrowing

In order to observe the effects of multi-phase clocking on time borrowing


(without clock skew scheduling), the transformation procedure of Figure 11.6
is applied under a conventional zero clock skew synchronization regime (i.e.
φ1 = φ2 = · · · = φn in Fig. 11.6). As shown in Table 11.3, an average im-
provement of 16.7% is achieved for single-phase circuits. Also, average im-
provements of 15.3%, 8.0% and 1.6% are achieved for dual, three and four
phase clocking schemes, respectively. For a visual representation of these re-
sults, the percentage improvements presented in Table 11.3 for each bench-
mark circuit are illustrated in Figure 11.8. In Figure 11.8, four data points
shown per benchmark circuit from left-to-right are the percentage improve-
ments observed for the single-phase, dual-phase, three-phase and four-phase
synchronization schemes, respectively.
220 11 Experimental Results

It is observed that the improvement achieved through time borrowing de-


creases on average as the number of clock phases increases. This average
degradation is expected, because by definition (CW = T /n, n ≥ 2), the trans-
parency period of latches shortens for a higher number of clock phases. The
degradation is worsened by the increasing delays of the latches inserted in
accordance with the transformation procedure presented in Fig. 11.6. Nev-
ertheless, 34% of the benchmark circuits in Table 11.2 (15 out of 44 total)
benefit more from time borrowing under multi-phase clocking. These circuit
demonstrate that the average degradation in improvement through time bor-
rowing with multi-phase clocking is not observed for all circuits. For these
circuits, multi-phase distribution of latch transparency periods provides addi-
tional slack where necessary, leading to these improvements in specific cases.

11.2.3 Multi-Phase Clocking and Clock Skew Scheduling

In order to observe the effects of multi-phase clocking on clock skew scheduling


(without time borrowing), comparisons have been performed between level-
sensitive, non-zero clock skew circuits and level-sensitive, zero clock skew
implementations. The improvement attributed to clock skew scheduling for
a dual-phase, level-sensitive circuit implementation is computed using the
formula (Told − Tnew ) /Told × 100%, where Tnew is the clock period of the
dual-phase, clock-skew-scheduled, level-sensitive implementation whereas Told
is the clock period of the dual-phase, zero-clock skew, level-sensitive imple-
mentation of the same circuit. The results for edge-sensitive circuits display
improvements of 30.3% on average due to clock skew scheduling alone, con-
forming with earlier results in literature [2, 96]. For single, dual, three and
four-phase level-sensitive implementations, clock skew scheduling results in
16.5%, 12.1%, 11.4% and 11.3% improvements on average, respectively.
The percentage improvements for each benchmark circuit are illustrated in
Fig. 11.9. In Figure 11.9, five data points shown per benchmark circuit from
left-to-right are the percentage improvements observed for the edge-sensitive,
single-phase, dual-phase, three-phase and four-phase synchronization schemes,
respectively. The average degradation in performance (compared to non-zero
clock skew, edge-sensitive circuits) is expected as the even distribution of
the transparency periods potentially negates the effectiveness of clock skew
scheduling. Nevertheless, 34% of the benchmark circuits in Table 11.2 (15
out of 44, where 5 circuits are not improved by clock skew scheduling for
any synchronization scheme) benefit more from clock skew scheduling under
multi-phase clocking. These circuit demonstrate that the average degradation
in improvements of clock skew scheduling with multi-phase clocking is not
observed for all circuits. For this important special case observed in some
circuits, the change in the delay paths (by the multi-phase transformation
procedure) is such that the resulting circuits are more suitable to the opti-
mization provided by clock skew scheduling.
11.2 Multi-Phase Level-Sensitive Circuits 221

Performance Improvement via CSS per Clock Phase

100
90
80
70
60
50
40
30
20
10
0
s27

s298
s344
s349
s382
s386
s400

s444
s499
s510
s526

s635
s641
s713
s820
s832
s838
s938
s953
s967
s991

Average
s526n

s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669

s9234
s208.1

s420.1

s13207

s15850
s35932
s9234.1

s13207.

s15850.
ISCAS'89 Modified Circuits

FF 1-Phase 2-Phase 3-Phase 4-Phase

Fig. 11.9. Effects of multi-phase clocking on clock skew scheduling.

As an added note, consider that clock skew scheduling is more effective


on circuits with certain characteristics. In particular, if the data propagation
delays on the local data paths of a circuit are irregular, higher improvements
in the circuit performance are achievable through clock skew scheduling. The
transformation method shown in Fig. 11.6 proposes an even distribution of
data propagation delays for adjacent local data paths, increasing the regular-
ity of the circuit for increasing number of clock phases. Therefore, the high
regularity of the multi-phase circuits—due to the bias in the transformation
procedure in Fig. 11.6—also contributes to the degradation.

11.2.4 Simultaneous Time Borrowing and Clock Skew Scheduling

In non-zero clock skew circuits, final improvements of 30.3%, 24.8%, 17.7%


and 12.0% on average are observed for single, dual, three and four phase
clocking schemes, respectively. These final improvements are due to the simul-
taneous application of clock skew scheduling and time borrowing, i.e. clock
skew scheduling application on level-sensitive version of the circuit. For a
dual-phase clocking regime, the improvement is computed using the formula
(Told − Tnew ) /Told × 100%, where Tnew is the clock period of the dual-phase,
clock-skew-scheduled, level-sensitive implementation whereas Told is the clock
period of the single-phase, zero-clock skew, edge-sensitive implementation (not
shown in Table 11.2) of the same circuit. The percentage improvements for
222 11 Experimental Results

Performance Improvement via TB and CSS per Clock Phase

100

80

60

40

20

0
s27

s298
s344
s349
s382
s386
s400

s444
s499
s510
s526

s635
s641
s713
s820
s832
s838
s938
s953
s967
s991

Average
s526n

s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669

s9234
s208.1

s420.1

s13207

s15850
s35932
s9234.1

s13207.

s15850.
-20

-40
ISCAS'89 Modified Circuits

1-Phase 2-Phase 3-Phase 4-Phase

Fig. 11.10. Effects of multi-phase clocking on time borrowing and clock skew
scheduling.

each benchmark circuit are illustrated in Fig. 11.10. In Figure 11.10, four
data points shown per benchmark circuit from left-to-right are the percent-
age improvements observed for the single-phase, dual-phase, three-phase and
four-phase synchronization schemes, respectively.
In general, the observed improvements for multi-phase synchronized cir-
cuits are superior compared to zero-skew, edge-sensitive circuits. Exemplifying
the positive trend is the analysis of the benchmark circuit s1196, for instance,
where an improvement of 68% is observed for three-phase clocking through
time borrowing and clock skew scheduling. For the same circuit, the improve-
ments are at 63%, 63% and 64% for single, dual and four-phase clocking,
respectively.
As discussed in Sections 11.2.2 and 11.2.3, the improvements achieved
through time borrowing and clock skew scheduling decrease on average as the
number of clock phases increases. The improvements through simultaneous
application of time borrowing and clock skew scheduling decrease on average
as well, as the number of clock phases increases. Some negative improvements
are also recorded, which are circuits with significant delay increase due to
latch insertion. Nevertheless, 23% of the level-sensitive benchmark circuits
in Table 11.2 (10 out of 44) benefit more from clock skew scheduling under
multi-phase clocking. These circuit demonstrate that the average degradation
in improvements of simultaneous time borrowing and clock skew scheduling
with multi-phase clocking is not observed for all the circuits.
11.3 Quadratic Programming (QP) for Maximizing Safety 223

It is observed from experiments that no particular multi-phase approach


is superior to others in all cases. For some circuits, conventional one-phase
edge-triggered or dual-phase level-sensitive applications can be best, however,
unconventional (but highly feasible with rotary clocking) implementations of
three and four phase synchronization of level-sensitive circuits can be best for
others. Investigation of all schemes using the presented analysis framework
is necessary in order to identify the optimal synchronization scheme for any
rotary-clock synchronized circuit.

11.3 Quadratic Programming (QP) for Maximizing


Safety

A quadratic programming formulation of the clock skew scheduling problem


is developed in Chapter 7. This QP problem can be efficiently solved by ap-
plying the mathematical procedures developed in Chapter 9. The algorithm
described in Section 7.2.3 has been implemented as a C++ program and ap-
plied to ISCAS’89 and ISCAS’93 benchmark circuits, as well as to industrial
circuits (IC1, IC2, and IC3). Results from the application of this computer
program are described in this chapter. Certain characteristics of the imple-
mentation are initially described in Section 11.3.1. Graphical illustrations of
representative results are shown in Section 11.3.2.

11.3.1 Description of Computer Implementation

The results described in this section are obtained from the execution of a
computer implementation of Algorithm CSD introduced in Section 9.1.3. This
computer implementation shares code with the computer implementation de-
scribed in Section 5.7. In particular, the input data file format and the in-
put/output routines are exactly the same. Without unnecessary details, this
computer implementation consists of the sequential execution of the following
major steps:
Step 1. Input data file format and input/output routines are shared with
the LP computer implementation described in 5.7. The circuit timing
and connectivity data is read in and compressed and stored in a binary
database. The database can be used for fast data access in subsequent
algorithmic applications of the same circuit. Furthermore, the data size of
the database permits significant space and time savings if the circuit data
is exchanged.
Step 2. The circuit data is examined and the circuit graph is built according to
the graph model described in 5.2.2. An adjacency lists data structure [105]
stored in memory is used for fast access of the circuit graph data.
224 11 Experimental Results

Step 3. The circuit graph is transformed according to the transformation rules


described in 5.7 and illustrated in Figure 5.5. Within this step, the per-
missible range bounds are calculated and directions for the graph edges
are determined.
Step 4. The circuit graph is traversed in order to determine the edges in the
skew basis sb and in the skew chords sc . This graph traversal is accom-
plished by using a depth-first search [89, 105, 124] algorithm—the classical
traversal algorithm of choice for building a spanning tree. Three additional
important tasks are accomplished during the traversal step:
1. For circuits with more than one connected disjoint subcircuit, these
connected disjoint parts are identified and marked. This step does not
incur any computational overhead—it is an inherent feature of the
depth-first search graph traversal algorithm to separate a graph into
disjoint pieces (if any).
2. The skew basis and chords of each disjoint connected circuit subgraph
are identified and enumerated.
3. The circuit connectivity matrix B (actually, only the non-identity-
matrix C portion of B) is derived for each disjoint connected cir-
cuit subgraph. Recall that C contains only elements from the set
{−1, 0, 1}, thus permitting an efficient bit compression scheme to be
used to store C in a small amount of memory.
Step 5. Using C, the matrix N is computed as described by (9.9).
Step 6. The Cholesky factorization L2 of N is calculated as described by (9.10).
Simple, yet efficient algorithms for computing the Cholesky factorization
have long been known and can be found in multiple sources [127, 131,
134, 135]. Recall that the matrix N is guaranteed to be positive-definite
by construction. Therefore, the real (no complex numbers) Cholesky de-
composition is guaranteed to exist.
Step 7. The objective clock skews are chosen at the center of the permissible
range for all local data paths. The actual clock skews (a consistent clock
schedule) are calculated as described by (9.25) and as illustrated in Fig-
ure 9.1. At this point, each clock skew is verified against the respective
permissible range. If all skews are within the respective permissible range
bounds, the algorithm concludes. Otherwise, the objective clock skews are
modified and the calculation is repeated again. Only the calculation de-
scribed in this step must be repeated since all matrices have now been
computed.
Different objective clock schedule modification strategies can be used. The
most effective strategy to modify the objective clock schedule—resulting
in the fastest convergence towards a feasible schedule—is as follows. All
objective clock skews are slightly increased or decreased depending upon
whether the respective calculated clock skews is larger or smaller than the
objective one. Using this strategy, a feasible solution is typically reached
within a few iterations.
11.4 Delay Insertion in Clock Skew Scheduling 225

Step 8. The actual clock delays to the individual registers are calculated by
traversing the spanning tree (basis) of the circuit graph. The clock delay
of the first register is arbitrarily chosen (zero in this implementation). As
the spanning tree is traversed, additional vertices adjacent to the current
vertex are visited. The clock delay of the visited vertex is determined
trivially since both the clock delay of the current vertex and the clock
skew of the edge between the current and visited vertex are known.
The results of the application of the algorithm to these circuits are summa-
rized in Table 11.5. For each circuit, the following data is listed—the circuit
name in column 1, the number of disjoint subgraphs in column 2, and the
number of vertices, edges, chords (cycles), main and isolated basis, and target
clock period in nanoseconds in columns 3 through 8, respectively. The num-
ber of iterations to reach
' a solution is listed in column 9. The average value
of ε in (7.42), that is, ε/p, is listed in column 10. The run time in minutes
for the mathematical portion of the program is shown in column 11 for a 170
MHz Sun Ultra 1 workstation.

11.3.2 Graphical Illustrations of Results


The application of the computer implementation described in Section 11.3.1
to many of the circuits listed in Table 11.5 is graphically illustrated in this
section. Immediately following are illustrations of two circuits shown in Fig-
ures 11.11 and 11.12, respectively.
Three histograms for a circuit are shown in each graphical illustrations.
These histograms are as follows:
(a) The distribution of the zero clock skews in the permissible range for the
clock period listed in Table 11.5 is illustrated in subfigure (a).
(b) The distribution of the non-zero clock skews in the permissible range after
one iteration of problem QP-2—as described in Step 7 in Section 11.3.1—is
shown in subfigure (b). Note that there are frequent lower bound and upper
bound violations of the permissible range. These violations are represented
by the dark leftmost and rightmost regions, respectively, where the number
of violations is also indicated.
(c) The final distribution of the non-zero clock skews within the permissible
range—no timing violations—is illustrated in subfigure (c). There is a
noticeable improvement since most clock skews are concentrated around
the center of the permissible range. The majority of the clock skews are
within 10% of the safest clock skew value at the center of the respective
permissible range of each local data path.

11.4 Delay Insertion in Clock Skew Scheduling


For experimentation, the clock skew scheduling algorithms with the delay
insertion method proposed for edge-triggered and level-sensitive circuits (Ta-
226

( (
ε ε

Circuit
# subcircuits
TCP (nanoseconds)
# iterations
Run time (min)
Circuit
# subcircuits
TCP (nanoseconds)
# iterations
Run time (min)

r p nc nm ni p r p nc nm ni p

1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
s1196 7 18 20 9 8 3 20.8 5 3.19 1 s526n 1 21 117 97 20 0 13 2 1.26 2
s1238 7 18 20 9 8 3 20.8 5 3.19 1 s5378 1 179 1147 969 158 20 28.4 20 8.79 3
s13207 49 669 3068 2448 581 39 85.6 20 18.92 5 s641 1 19 81 63 18 0 83.6 5 11.67 1
s1423 2 74 1471 1399 72 0 92.2 20 60.9 3 s713 1 19 81 63 18 0 89.2 6 12.74 1
11 Experimental Results

s1488 1 6 15 10 5 0 32.2 1 0.87 1 s820 1 5 10 6 4 0 18.6 1 0.71 1


s1494 1 6 15 10 5 0 32.8 1 0.88 1 s832 1 5 10 6 4 0 19 2 0.66 1
s15850 15 597 14257 13675 546 36 116 10 70.6 21 s838.1 1 32 496 465 31 0 24.4 3 3.68 3
s15850.1 22 534 10830 10318 478 34 81.2 9 31.44 19 s9234 3 228 2476 2251 222 3 75.8 20 16.67 4
s208.1 1 8 28 21 7 0 12.4 1 1.22 1 s9234.1 2 211 2342 2133 205 4 75.8 20 18.6 4
s27 1 3 3 1 2 0 6.6 1 0.71 1 s953 4 29 135 110 25 0 23.2 3 1.93 2
s298 1 14 54 41 12 1 13 1 1.16 1 s1269 1 37 251 215 36 0 51.2 20 12.73 2
s344 1 15 68 54 14 0 27 4 4.91 1 s1512 1 57 405 349 56 0 39.6 4 4.43 3
s349 1 15 68 54 14 0 27 4 4.91 1 s3271 1 116 789 674 107 8 40.4 3 3.64 5
ing algorithm to both benchmark and industrial circuits.

s35932 1 1728 4187 2460 1727 0 34.2 20 60.4 27 s3330 1 132 514 383 61 70 34.8 4 3.4 5
s382 1 21 113 93 20 0 14.2 6 1.59 2 s3384 25 183 1759 1601 151 7 85.2 5 15.5 7
s38417 11 1636 28082 26457 1443 182 69 20 32.35 31 s4863 1 104 620 517 103 0 81.2 8 39.85 3
s38584 2 1452 15545 14095 1400 50 94.2 11 29.1 29 s6669 20 239 2138 1919 218 1 128.6 3 20.67 6
s386 1 6 15 10 5 0 17.8 1 0.82 1 s938 1 32 496 465 31 0 24.4 2 3.41 2
s400 1 21 113 93 20 0 14.2 8 1.6 1 s967 4 29 135 110 25 0 20.6 2 1.76 2
s420.1 1 16 120 105 15 0 16.4 20 1.95 1 s991 1 19 51 33 18 0 96.4 3 8.58 1
s444 1 21 113 93 20 0 16.8 2 1.05 1 IC1 1 500 124750 124251 499 0 8.2 2 1.51 30
s510 1 6 15 10 5 0 16.8 1 0.85 1 IC2 1 59 493 435 58 0 10.3 3 1.82 4
s526 1 21 117 97 20 0 13 2 1.26 1 IC3 34 1248 4322 3108 1155 59 5.6 2 1.43 2
Table 11.5. Experimental results of the application of the QP based clock schedul-
11.4 Delay Insertion in Clock Skew Scheduling 227
53 → ← 53

0→ ←0
(a) Zero skew in permissible range.

112 → ← 112

82 →

(b) Non-zero clock skew in permissible range after iteration #1.

68 → ← 68

0→ ←0
(c) Non-zero clock skew in permissible range after all iterations.

Fig. 11.11. Circuit s3271 with r = 116 registers and p = 789 local data paths. The
target clock period is TCP = 40.4 nanoseconds.
228 11 Experimental Results
45 → ← 45

0→ ←0
(a) Zero skew in permissible range.

76 → ← 76

15 →

(b) Non-zero clock skew in permissible range after iteration #1.

37 → ← 37

0→ ←0
(c) Non-zero clock skew in permissible range after all iterations.

Fig. 11.12. Circuit s1512 with r = 57 registers and p = 405 local data paths. The
target clock period is TCP = 39.6 nanoseconds.
11.4 Delay Insertion in Clock Skew Scheduling 229

bles 8.1 and 8.2) are applied to the ISCAS’89 benchmark circuits. Continuous
delay models have been used in the experimentation. The experimental setup
in Section 11.1 (circuit delay information, clock signal duty cycle, internal
register delays, computing platform, LP solver) is replicated for the proposed
timing analyses. Experimental results are presented in Table 11.6. In Ta-
ble 11.6, the data shown are the number of registers r and paths p, the clock
period TF F for zero skew circuit with flip-flops, TFCSSF for non-zero skew circuit
with flip-flops, and, TFDICSS
F for non-zero skew circuit using delay insertion
with flop-flops. Also listed are the calculation times tCSS DICSS
F F , tF F , of TFCSS
F ,
DICSS
TF F , respectively, and the percentage clock period improvements IFCSS F ,
IFDICSS
F and I DI
FF for improvements from T F F to T CSS
FF , T F F to T DICSS
FF , T CSS
FF
to TFDICSS
F , respectively.
The clock skew scheduling algorithms used in experimentation are tar-
geting the clock period minimization problem. Therefore the improvements
achieved in the minimum clock period through the application of clock skew
scheduling and delay insertion methods are reported in Table 11.6. These
improvements are computed with the formula (Told − Tnew ) /Told × 100. The
zero clock skew, edge-sensitive synchronous circuit is selected as the common
comparison mark due to its simplicity and popularity in digital circuit de-
sign. Both for edge-triggered and level-sensitive circuits, the improvements
through conventional clock skew scheduling (IFCSS F and ILCSS , respectively)
and through clock skew scheduling with delay insertion (IFDICSS F and ILDICSS ,
respectively) are computed. Also shown in Table 11.6 are the comparisons
of the non-zero clock skew circuits scheduled with conventional clock skew
scheduling methods with non-zero clock skew circuits with delay insertion.
These comparisons (IFDIF and ILDI , respectively, for edge-triggered and level-
sensitive circuits) demonstrate the effectiveness of the delay insertion method
in further improving the performance of a conventional clock skew scheduled
circuit.
For the ISCAS’89 benchmark circuits, the delay insertion method leads
to 10% and 9% improvements on average over the conventional clock skew
scheduling algorithms for edge-triggered and level-sensitive circuits, respec-
tively. For better visualization, the performance improvements in minimum
clock period of edge-triggered and level-sensitive circuits achieved respectively
over corresponding non-zero clock skew edge-triggered and level-sensitive cir-
cuits are presented in Figure 11.13. Shown in Figure 11.13 are the percentage
improvements IFDIF and ILDI that are also presented in Table 11.6. Two data
points shown per benchmark circuit from left-to-right are the improvements
observed for edge-triggered and level-sensitive circuits, respectively. Note that
these improvements are due to delay insertion simultaneous with clock skew
scheduling.
The delay insertion method cannot be applied (not beneficial) to some
circuits due to the two reasons discussed in Sections 8.2 and 8.2.2. The first
reason, discussed in Section 8.2, is the fact that the minimum clock period of
the circuit can be determined by a limitation other than reconvergent paths,
230

Circuit Info Edge-Triggered Circuits Level-Sensitive Circuits


ISCAS’89 Clock Periods (tu) Run Times (s) Improvements (%) Clock Periods (tu) Run Times (s) Improvements (%)
CSS DICSS CSS DICSS CSS DICSS DI CSS DICSS CSS CSS DICSS DI
Circuit r p TF F TF F TF F tF F tF F IF F IF F IF F TL TL TL tL tDICSS
L IL IL IL IL
s27 3 4 6.6 4.1 4.1 0 0 38 38 0 5.4 4.1 4.1 0 0 18 38 38 0
s208.1 8 28 12.4 4.9 1.6 0 0 60 87 67 8.6 5.2 1.6 0 0 31 58 87 69
s298 14 54 13.0 9.4 9.4 0 0 28 28 0 10.6 9.4 9.4 0 0 18 28 28 0
s344 15 68 27.0 18.4 18.4 0 0 32 32 0 18.4 18.4 18.4 0 0 32 32 32 0
s349 15 68 27.0 18.4 18.4 0 0 32 32 0 18.4 18.4 18.4 0 0 32 32 32 0
s382 21 113 14.2 8.5 6.0 0 0 40 58 29 10.3 8.5 6.0 0 0 27 40 58 29
s386 6 15 17.8 17.3 17.3 0 0 3 3 0 17.3 17.3 17.3 0 0 3 3 3 0
s400 21 113 14.2 8.6 6.0 0 0 39 58 30 10.4 8.6 6.0 0 0 27 39 58 30
s420.1 16 120 16.4 6.8 1.6 0 0 59 90 76 12.6 7.2 1.6 0 0 23 56 90 78
11 Experimental Results

s444 16 113 16.8 9.9 7.9 0 0 41 53 20 12.4 9.9 8.0 0 0 26 41 53 20


s510 6 15 16.8 14.8 14.8 0 0 12 12 0 14.8 14.3 14.3 0 0 12 15 15 0
s526n 21 117 13.0 9.4 9.4 0 0 28 28 0 10.6 9.4 9.4 0 0 18 28 28 0
s641 19 81 83.6 61.9 57.8 0 0 26 31 7 66.2 61.9 57.8 0 0 21 26 31 7
s713 19 81 89.2 63.8 59.4 0 0 28 33 7 71.2 63.8 59.4 0 0 20 28 33 7
s820 5 10 18.6 18.3 18.3 0 0 2 2 0 18.3 18.3 18.3 0 0 2 2 2 0
s832 5 10 19.0 18.8 18.8 0 0 1 1 0 21.2 18.3 18.3 0 0 9 21 21 0
s953 29 135 23.2 18.3 18.3 0 0 21 21 0 16.0 7.8 7.8 0 0 23 63 63 0
s1196 18 20 20.8 10.8 7.8 0 0 48 63 28 16.0 7.8 7.8 0 0 23 63 63 0
s1423 74 1471 92.2 77.4 75.8 0 0 16 18 2 86.4 75.8 75.8 1 2 6 18 18 0
s1488 6 15 32.2 29.0 29.0 0 0 10 10 0 29.0 29.0 29.0 0 0 10 10 10 0
s1494 6 15 32.8 29.6 29.6 0 0 10 10 0 23.2 22.0 22.0 1 2 18 23 23 0
s5378 179 1147 28.4 22.0 22.0 0 0 23 23 0 64.8 54.2 54.2 2 4 15 28 28 0
s9234 228 247 75.8 54.2 54.2 1 1 28 28 0 67.4 57.1 53.8 4 7 21 33 37 6
s13207 669 3068 85.6 57.1 53.8 1 2 33 37 6 92.8 83.6 83.6 23 44 20 28 28 0
s15850 597 14257 116.0 83.6 83.6 5 19 28 28 0 71.4 57.4 57.4 23 34 12 29 29 0
s15850.1 534 10830 81.2 57.4 57.4 5 10 29 29 0 34.1 20.4 15.7 7 16 0 40 54 23
s35932 1728 4187 34.2 20.4 15.7 1 6 40 54 23 54.8 42.2 42.2 41 101 21 39 39 0
s38417 1636 28082 69.0 42.2 42.2 15 37 39 39 0 76.4 65.2 62.8 31 51 19 31 33 4
s38584 1452 15545 94.2 65.2 62.8 5 15 31 33 4 76.4 65.2 62.8 31 51 19 31 33 4
Average – 28 34 10 – 17 29 34 9
Table 11.6. Delay insertion results for edge-sensitive ISCAS’89 benchmark circuits.
11.4 Delay Insertion in Clock Skew Scheduling 231

which cannot be mitigated by the delay insertion method. The second rea-
son, discussed in Section 8.2.2, is the fact that due to the uncertainty of the
delay elements inserted into the logic, the delay insertion might be ineffective
in improving the minimum clock period. In the LP formulations presented
in Tables 8.1 and 8.2, the uncertainties of the delay elements are modeled
without lower (and upper) bounds (delay elements can have zero uncertainty
with Im = IM ). Thus, the second reason for inapplicability is not observed in
the experimentation. Among the selected ISCAS’89 circuits, the delay inser-
tion method for edge-triggered circuits is applicable to 41% (12 circuits) of
the total 29 circuits. By excluding the circuits for which zero improvements
are observed (for which the method is not applicable due to the first reason
stated above), the average improvement of the delay insertion method for
edge-triggered circuits is observed to be 26% over the conventional clock skew
scheduling algorithm of [2] (Table 5.1). The delay insertion method on level-
sensitive circuits was applicable to 34% (10 circuits) of the total 29 circuits.
By excluding the circuits for which zero improvements are observed, the av-
erage improvement of the delay insertion method for level-sensitive circuits is
observed to be 27% on average over the conventional clock skew scheduling
algorithm presented in Chapter 5.
The experimental results in Figure 11.13 show that reconvergent paths—
with a significant probability (41% and 34% as observed on the ISCAS’89
circuits)—are the dominant limiting factor on the minimum clock period after
clock skew scheduling for a synchronous circuit. The delay insertion method
can effectively be used to mitigate these limitations, as shown by 26% and
27% improvements in the minimum clock period. The proposed clock skew
scheduling method with delay insertion takes about twice as much time as
the conventional application of clock skew scheduling, however, the method is
highly practical with total run times below a few minutes with highly common
computing resources.
The improvements in minimum clock period achieved through conventional
clock skew scheduling (IFCSS F and ILCSS ), and through clock skew schedul-
DICSS
ing with delay insertion (IF F and ILDICSS ) for edge-triggered and level-
sensitive circuits are visually presented for each benchmark circuit in Fig-
ures 11.14 and 11.15, respectively.
Shown in Figure 11.14 are the percentage improvements (IFCSS DICSS
F and IF F
in Table 11.6, respectively) in minimum clock period via clock skew schedul-
ing and delay insertion for edge-triggered ISCAS’89 benchmark circuits. Two
data points shown per benchmark circuit from left-to-right are the improve-
ments observed for clock skew scheduling alone and delay insertion with clock
skew scheduling, respectively. Shown in Figure 11.15 are the percentage im-
provements (ILCSS and ILDICSS in Table 11.6, respectively) in minimum clock
period via clock skew scheduling and delay insertion for level-sensitive IS-
CAS’89 benchmark circuits. Two data points shown per benchmark circuit
from left-to-right are the improvements observed for clock skew scheduling
alone and delay insertion with clock skew scheduling, respectively.
232 11 Experimental Results

Improvements via Delay Insertion

100
Improvement (%)

80

60

40

20

s15850.1
s208.1

s420.1

Average
s13207
s15850

s35932
s38417
s38584
s1196
s1423
s1488
s1494
s5378
s9234
s298
s344
s349
s382
s386
s400

s444
s510

s641
s713
s820
s832
s953
s526n
s27

ISCAS'89 Benchmark Circuits

Edge-Triggered Circuits Level-Sensitive Circuits

Fig. 11.13. Percentage improvements through delay insertion in Table 11.6.

Edge-Triggered Circuits

100
Improvement (%)

80

60

40

20

0
s15850.1
s13207
s15850

s35932
s38417
s38584
s208.1

s420.1

s1196
s1423
s1488
s1494
s5378
s9234

Average
s526n
s298
s344
s349
s382
s386
s400

s444
s510

s641
s713
s820
s832
s953
s27

ISCAS'89 Benchmark Circuits

CSS DICSS

Fig. 11.14. Percentage improvements on edge-triggered circuits in Table 11.6.

The average total improvement of non-zero clock skew, edge-triggered cir-


cuits with delay insertion with respect to the zero clock skew, edge-triggered
circuits is 34%. The average total improvement of non-zero clock skew, level-
sensitive circuits with delay insertion with respect to the zero clock skew,
edge-triggered circuits is also 34%. Note that the total improvements are due
to the simultaneous effects of the applications of delay insertion, clock skew
11.5 Physical Design of Rotary Clock Synchronized Circuits 233

Level-Sensitive Circuits

100
Improvement (%)

80

60

40

20

s15850.1
s208.1

s420.1

s13207
s15850

s35932
s38417
s38584
s1196
s1423
s1488
s1494
s5378
s9234
s27

s298
s344
s349
s382
s386
s400

s444
s510

s641
s713
s820
s832
s953
s526n

A verage
ISCAS'89 Benchmark Circuits

CSS DICSS

Fig. 11.15. Percentage improvements on level-sensitive circuits in Table 11.6.

scheduling and consideration of time borrowing (for level-sensitive circuits


only) in the timing analysis. The improvement with delay insertion is equal
to or greater than the improvement with clock skew scheduling only, as delay
insertion is only applied when it can be used to mitigate the limitation of the
reconvergent paths.

11.5 Physical Design of Rotary Clock Synchronized


Circuits

The development of a computer-aided design tool called hpictiming following


the guidelines of the presented design methodology in Chapter 10 is performed
in an open source environment [186]. In this section, the timing portion of the
hpictiming tool is discussed. The details of the partitioning step implemen-
tation with chaco [177] and clock skew scheduling implementation step with
Xgrid parallel computing system are presented in Section 10.2. The logic flow
of the hpictiming program is presented in Figure 11.16. This flow is similar
to the physical design flow shown in Figure 10.6 (note that the specific design
decisions made in various stages of the implementation of the physical design
flow are indicated on the figure). In Figure 11.16, the grid size for partitioning
is set to 2x2 for simplicity. The focus in experimentation is on the effectiveness
of the application of clock skew scheduling with partitioning (sequentially and
particularly in parallel ). Towards this goal, the run times for the parallelized
application of clock skew scheduling on the parallel computing clusters and
final circuit performances are reported.
234 11 Experimental Results

Hpictiming is mainly written in C++ using the standard template library


(STL). The total code is approximately 250,000 lines. Some of the parsers
are written in lex/yacc and the partitioning tool (chaco software used to
implement timing-aware partitioning) is written in ansi C. The program is
developed on GNU/Linux, Solaris Unix (Sun OS 9) and Mac OS X 10.3.8
operating systems using gcc 3.0 compiler.

11.5.1 Clock Skew Scheduling of Partitions Results

Clock skew scheduling is applied in parallel to the partitions of ISCAS’89


benchmark circuits and an industrial circuit called industrial1. The par-
titioning results from chaco are utilized within hpictiming in generating
the top block and the partition LP problems. The LP problem [2] shown in
Table 5.1 (page 81) is used for clock skew scheduling of edge-sensitive syn-
chronous circuits. For industrial1, where register insertion is performed,
linear constraints described in Chapter 5 for level-sensitive local data paths
are added to the LP problem constraints. The feasibility of the parallel appli-
cation of clock skew scheduling is analyzed. The speedups achievable through
parallel clock skew scheduling are computed.
The experimental setup of Section 11.1 is replicated for experimentation.
The experiments are performed on an Xgrid cluster built with eight (8) Pow-
erMac computers with dual G5 1.8GHz microprocessors (only one processor
is used on each client computer) and 3GB RAM running Mac OS X 10.3.8
(Section 10.3). The simplex optimizer of the GNU LP solver GLPK (version
4.8) [187] is used to solve the LP problems. The results are presented in Ta-
ble 11.7. In Table 11.7, the number of registers r and the number of paths p
are shown for each analyzed circuit. Run times of various clock skew schedul-
ing methods are shown. Run times of the conventional method of Table 5.1
are denoted by tconven . The run times of the sequential solution of partitions
method are denoted by tsequen and the run times of the parallel solution of
partitions method are denoted by tparal . The feasibility of each circuit when
solved with the presented heuristic method is shown on the column labeled
“Feasibility”.
The minimum clock periods computed via each of the three methods (when
feasible) are identical and equal to the values reported in Tables 11.1 and 11.6
under columns TFCSS F . These minimum clock periods, presented in Tables 11.1
and 11.6) provide an average of 30% improvement over conventional zero clock
skew, edge-triggered circuits (28% is reported in Table 11.1 due to selected
accuracy of the tabular environment). In presented methodology, the target
is to improve the run times of clock skew scheduling without degrading these
clock period improvements. Accordingly, the run times in Table 11.7 are re-
ported in order to demonstrate the speedups achievable through partitioning
and parallel application of clock skew scheduling.
The selected suite of ISCAS’89 benchmark circuits and industrial1 are
partitioned into a 2x2 partition using chaco. The partition and top block LP
11.5 Physical Design of Rotary Clock Synchronized Circuits 235

DEF
LEF BENCH
SDF

PARTITIONING

2x2 GRID

CHACO

REGISTER INSERTION

CLOCK SKEW SCHEDULING

LP1 LP2 LP3 LP4


XGRID
GLPK GLPK GLPK GLPK
T1 T2 T3 T4

Choose max (T1, T2, T3, T4)

TOP BLOCK LP
T >= max (T1, T2, T3, T4 )
GLPK
min T
optimal ti

LP1 LP2 LP3 LP4


T = min T T = min T T = min T T = min T
XGRID
ti = optimal ti ti = optimal ti ti = optimal ti ti = optimal ti
GLPK GLPK GLPK GLPK

1) Re-Iteration NO YES
2) Constraining Boundary Vertices CSS FEASIBLE?
3) Delay Padding PLACEMENT

REGISTER MAPPING

LOGIC PLACEMENT

Fig. 11.16. CAD tool flow.

problems are generated. First, the generated LP problems are solved on a sin-
gle workstation in a sequential order. The observed run times tsequen record
speedups over conventional clock skew scheduling application due to partition-
ing. Second, the generated LP problems are solved on the Xgrid computing
cluster in parallel as described in Section 10.3. The observed run times tparal
236 11 Experimental Results

Table 11.7. Clock skew scheduling results on 2x2 partitioned ISCAS’89 circuits.
Circuit Info Run Time CSS (sec) RTI (%) Feasibility
Circuit r p tconven tsequen tparal RT Isequen RT Iparal Feasibility
s27 3 4 0 0 0 0 0 yes
s208.1 8 28 0 0 0 0 0 yes
s298 14 54 0 0 0 0 0 yes
s344 15 68 0 0 0 0 0 yes
s349 15 68 0 0 0 0 0 yes
s382 21 113 0 0 0 0 0 yes
s386 6 15 0 0 0 0 0 yes
s400 21 113 0 0 0 0 0 yes
s420.1 16 120 0 0 0 0 0 no
s444 16 113 0 0 0 0 0 yes
s510 6 15 0 0 0 0 0 yes
s526 21 117 0 0 0 0 0 yes
s526n 21 117 0 0 0 0 0 yes
s641 19 81 0 0 0 0 0 no
s713 19 81 0 0 0 0 0 no
s820 5 10 1 1 1 0 0 yes
s832 5 10 0 0 0 0 0 yes
s838.1 32 496 2 0 0 0 100 no
s938 32 496 1 1 1 0 0 no
s953 29 135 0 0 0 0 0 yes
s967 29 135 0 0 0 0 0 yes
s991 19 51 0 0 0 0 0 yes
s1196 18 20 0 0 0 0 0 no
s1238 18 20 0 0 0 0 0 no
s1423 74 1471 21 6 3 71 86 yes
s1488 6 15 0 0 0 0 0 yes
s1494 6 15 0 0 0 0 0 yes
s1512 57 415 1 0 0 100 100 yes
s3271 116 789 4 2 1 50 75 no
s3330 132 514 2 2 1 0 50 no
s3384 183 1759 22 4 3 82 86 yes
s4863 104 620 2 0 0 100 100 yes
s5378 179 1147 9 5 2 44 78 no
s6669 239 2138 33 10 7 30 79 no
s9234 228 247 52 15 8 71 85 no
s9234.1 211 2342 47 12 5 74 89 yes
s13207 669 3068 86 17 10 80 88 yes
s15850 597 14257 3545 735 447 79 87 no
s15850.1 534 10830 1358 156 110 89 92 yes
s35932 1728 4187 101 38 13 62 87 no
s38417 1636 28082 7707 3780 1845 51 76 yes
s38584 1452 15545 1394 749 339 46 76 yes
industrial1 14031 3692878 n/a 34680 25680 n/a n/a no
Average 25 28

record speedups over the conventional clock skew scheduling application due
to partitioning and parallelization of the application. Note that the applica-
tion of clock skew scheduling to industrial1 using the conventional clock
skew scheduling method is not possible, thus run times are not reported.
It is observed from Table 11.7 that tparal is consistently and significantly
(especially for large scale circuits) superior to tsequen and tconven . Similarly,
tsequen is consistently superior to tconven . The run time improvement from
tconven to tsequen and from tconven to tparal are listed under RT Isequen and
11.5 Physical Design of Rotary Clock Synchronized Circuits 237

RT Iparal , respectively. The improvements are computed with the formula


[(told − tnew )/told × 100]. On the ISCAS’89 benchmark circuits, the average
run time improvement via partitioning (RT Isequen ) is 25%. The average run
time improvement via partitioning and parallel application of clock skew
scheduling RT Iparal is 28%. The circuits, for which the method is infeasible,
are not considered in the computations of the average improvement. Overall,
the application of clock skew scheduling to partitions is feasible for 28 (65%)
of the total 43 circuits, whereas this method is not applicable to the remain-
ing 15 circuits (35%). For these 15 circuits, the alternative methods described
in Section 10.2.4 can be used.

11.5.2 Overall CAD Tool Results

In this section, the run times of hpictiming on the benchmark circuits are
analyzed to profile the speedups gained in overall program execution due to
partitioning and parallelization. In particular, the speedups available through
solving the partition problems sequentially and in parallel are computed using
the speedup formula presented in (10.5).
Table 11.8 presents the speedup results of hpictiming tool on ISCAS’89
benchmark and industrial1 circuits. In Table 11.8, the number of registers
and paths of each circuit are shown with r and p, respectively. Run times of
the hpictiming tool operated with various clock skew scheduling methods
on the ISCAS’89 benchmark circuits are shown. Run times of hpictiming
with the conventional clock skew scheduling method of Table 5.1 are denoted
by thpictiming
conven . The run times with the sequential solution of partitions method
hpictiming
are denoted by tsequen and the run times with the parallel solution of par-
titions method are denoted by thpictiming
paral . In Table 11.8, the speedups due to
partitioning and sequential application of clock skew scheduling to 2x2 parti-
tions of the circuits are denoted by speedupsequen . The speedup speedupsequen
is computed with the following formula:

thpictiming
conven
speedupsequen = . (11.8)
thpictiming
sequen

The speedups due to partitioning and application of clock skew scheduling in


parallel are denoted by speedupparal . The speedup speedupparallel is computed
with the following formula:

thpictiming
conven
speedupparal = . (11.9)
thpictiming
paral

Remember from Section 11.5.1 that the application of clock skew schedul-
ing with partitioning is not feasible for some of the ISCAS’89 benchmark
circuits and the industrial circuit industrial1. The circuits for which the
method is not applicable are not considered in the computation of average
238 11 Experimental Results

Table 11.8. Speedup of hpictiming on 2x2 partitioned ISCAS’89 circuits.


Circuit Info Run Time hpictiming (sec) Speedup (X)
hpictiming
Circuit r p thpictiming hpictiming
conven tsequen tparal speedupsequen speedupparal
s27 3 4 0 0 0 n/a n/a
s208.1 8 28 0 0 0 n/a n/a
s298 14 54 0 0 0 n/a n/a
s344 15 68 0 0 0 n/a n/a
s349 15 68 1 1 1 1.0x 1.0x
s382 21 113 0 0 0 n/a n/a
s386 6 15 0 0 0 n/a n/a
s400 21 113 0 0 0 n/a n/a
s420.1 16 120 0 0 0 n/a n/a
s444 16 113 1 1 1 1.0x 1.0x
s510 6 15 1 1 1 1.0x 1.0x
s526 21 117 0 0 0 n/a n/a
s526n 21 117 1 1 1 1.0x 1.0x
s641 19 81 1 1 1 1.0x 1.0x
s713 19 81 0 0 0 n/a n/a
s820 5 10 1 1 1 1.0x 1.0x
s832 5 10 0 0 0 n/a n/a
s838.1 32 496 3 1 1 3.0x 3.0x
s938 32 496 1 1 1 1.0x 1.0x
s953 29 135 1 1 1 1.0x 1.0x
s967 29 135 1 1 1 1.0x 1.0x
s991 19 51 1 1 1 1.0x 1.0x
s1196 18 20 0 0 0 n/a n/a
s1238 18 20 1 1 1 1.0x 1.0x
s1423 74 1471 22 7 4 3.1x 5.5x
s1488 6 15 1 1 1 1.0x 1.0x
s1494 6 15 1 1 1 1.0x 1.0x
s1512 57 415 2 1 1 2.0x 2.0x
s3271 116 789 6 4 3 2.5x 2.0x
s3330 132 514 2 2 1 1.0x 2.0x
s3384 183 1759 25 7 6 3.6x 4.2x
s4863 104 620 6 4 4 2.5x 2.5x
s5378 179 1147 15 11 8 1.4x 1.9x
s6669 239 2138 40 17 14 2.4x 2.9x
s9234 228 247 60 23 16 2.6x 3.8x
s9234.1 211 2342 53 18 11 2.9x 4.8x
s13207 669 3068 105 36 29 2.9x 3.6x
s15850 597 14257 3757 947 659 4.0x 5.7x
s15850.1 534 10830 1385 185 138 7.5x 10.0x
s35932 1728 4187 313 250 225 1.3x 1.4x
s38417 1636 28082 7881 3958 2021 2.0x 3.9x
s38584 1452 15545 1615 1022 611 1.6x 2.6x
industrial1 14031 3692878 n/a 36062 27046 n/a n/a
Average 2.1x 2.6x

speedups. Still, the speedup numbers are presented individually for all the
ISCAS’89 benchmark circuits and the industrial circuit industrial1 in Ta-
ble 11.8.
It is observed from Table 11.8 that on average 2.1x speedup is observed in
hpictiming run time due to partitioning. If the partitioned LP problems are
solved in parallel, the average speedup is 2.6x. It is intuitive that as the size
of a circuit increases, the clock skew scheduling step of hpictiming, which is
the fraction of the task that is enhanced with partitioning and parallelization,
11.5 Physical Design of Rotary Clock Synchronized Circuits 239

30000
25000
20000
Seconds

Scheduling
15000
Partitioning
10000
Read-In
5000
0

Industrial1
s38417

s38584
s15850.1

Fig. 11.17. The run times of hpictiming with Xgrid on large circuits.

increases as well. So, for larger size circuits, higher values of speedup are
expected through partitioning and parallelization. Indeed, such a trend is
observed in Table 11.8.
Speedup (10.5) is further investigated on several of the benchmark cir-
cuits. The execution of hpictiming is divided into three main steps, Read-in,
Partitioning and Scheduling. The Read-in step consists of reading the in-
put data and identifying the local data paths. Partitioning step consists of
the timing-driven partitioning procedure implemented with chaco, discussed
in Section 10.2.2. Scheduling step consists of the application of clock skew
scheduling to generated partitions.
Figure 11.17 illustrates the relative run time lengths of each step for sev-
eral ISCAS’89 benchmark circuits and the industrial circuit industrial1 for
the parallel application of clock skew scheduling. The ISCAS’89 benchmark
circuits, whose total run times are below a certain limit, are not included
in the analysis. The selectivity about the ISCAS’89 benchmark circuits is to
eliminate the inaccuracies due to the rounding off errors in run times, most
prominent for circuits with a run time below a few seconds. Although the
solution for industrial1 is infeasible, the reported run times are believed to
be a good approximation of what they would have been, if all the subparti-
tions had been feasible. The total run time of the hpictiming program (with
parallel application of clock skew scheduling) is reported in Table 11.8 under
hpictiming
the column tparal .
The breakdown of run times to the three steps of hpictiming is shown for
the three largest circuits, s38584, s38417 and industrial1. The run times are
shown in Figure 11.18, 11.19 and 11.20 for s38584, s38417 and Industrial1,
respectively.
The run times for three application methods—conventional, sequential and
parallel application of clock skew scheduling—are shown for each circuit. The
run times for each step of hpictiming is shown with color codes, listed as
240 11 Experimental Results

2000

1500
Seconds

Scheduling
1000
Partitioning
500 Read-In

0
Conventional Sequential Parallel

Fig. 11.18. Run time breakdown of hpictiming program steps for s38584.

10000

8000
Scheduling
Seconds

6000
Partitioning
4000 Read-In
2000

0
Conventional Sequential Parallel

Fig. 11.19. Run time breakdown of hpictiming program steps for s38417.

40000
35000
30000
25000 Scheduling
Seconds

20000 Partitioning
15000 Read-In
10000
5000
0
Sequential Parallel

Fig. 11.20. Run time breakdown of hpictiming program steps for industrial1.
11.5 Physical Design of Rotary Clock Synchronized Circuits 241

read-in, partitioning and scheduling steps from bottom to top for each data
bar.
Partitioning step is not required in the conventional application method,
thus is not shown on the run time bar in the figures for the conventional
application cases. Even for methods where partitioning is necessary, the par-
titioning stage of the run time bar is not visible, because the run times for
the partitioning process with chaco are very small compared to the rest of
the execution time.
Note that the run time of the read-in and partitioning (where applied)
steps are identical in all three application methods. Through partitioning and
application of clock skew scheduling in parallel, the run time of the clock skew
scheduling step of the hpictiming program is improved. This improvement
speeds up the hpictiming program, the results of which are presented in
Table 11.8.
12
Conclusions

The physical limitations of materials and lithography in nano-scale CMOS


design have significantly affected the IC design flow. Some previously ignored
material behavior lead to design altering phenomena at such small dimen-
sions. Non-zero clock skew is a prime example of such a change; previously
negligible, clock skew and jitter have combined to occupy up to 10% of useful
computation time within clock cycles. Clock design methodologies are being
improved to limit skew and jitter as technology scales.
Within such a dynamic and resourceful environment, where design tech-
niques are evolving to adapt to nanoscale silicon implementations, Computer-
Aided Design (CAD) tools are crucial to the successful development of novel
design techniques and the sustainability of techniques currently in use. In this
monograph, problem formulation and design automation of non-zero clock
skew scheduling are described from a CAD perspective. The focus is on the
algorithmic advances and automation of the application of non-zero clock skew
scheduling at the large scale. Based on the efficacy and scalability of these au-
tomation principles, the non-zero clock skew system design concept can be
adapted to mainstream physical design flow, permitting significant perfor-
mance improvements. Two major items of importance within the adaptation
are the scalability of clock skew scheduling methodologies and the practical
application of these methodologies amid increasing circuit complexities. The
parallel clock skew scheduling research described in this monograph is an
important first step to solving the scalability problem. Solving the issues in
practical application for efficient nano-scale implementation requires a com-
prehensive treatment from the logic synthesis stage (e.g. resource allocation
for non-zero clock skew systems) to the testing stage (e.g. scan clock insertion
for non-zero clock skew systems) of the IC design flow.
In this monograph, the current state of knowledge in the formulation,
automation and application of non-zero clock skew scheduling is presented.
The topics are discussed in not an attempt to address all of the challenges
listed above, but to present the existing knowledge and automation principles
in application. Equipped with the knowledge presented in this monograph,
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 243
DOI: 10.1007/978-0-387-71056-3 12,
c Springer Science+Business Media LLC 2009
244 12 Conclusions

the application of clock skew scheduling to application-specific integrated cir-


cuits (ASICs) is possible. Furthermore, formulation characteristics (inclusive
of known bottlenecks) are presented to establish a roadmap for non-zero clock
skew scheduling researchers.
In overview, the following topics are presented in this research monograph.
First in Chapter 1 through 4, the preliminary information of VLSI circuit
design and VLSI circuit timing are briefed. In Chapter 5, the original linear
programming (LP) formulation for clock skew scheduling of edge-triggered cir-
cuits is revisited. The implications of non-zero clock skew timing constraints
on clock tree synthesis step are iterated. In Chapter 6, a linear program-
ming formulation for the static timing analysis of level-sensitive circuits is
described. This LP formulation is the first stand-alone formulation offered for
the timing analysis of non-zero clock skew, level-sensitive circuits. The major-
ity of the current static timing analyzers utilize iteration-based approaches to
analyze the timing behavior of systems with latches. These iteration-based ap-
proaches are shown to converge to solutions relatively quickly for most circuits,
however, they require algorithmic extensions for complex circuit topologies.
The LP formulation presented in this monograph is topologically indepen-
dent and shown to operate with reasonable run times. Performance improve-
ments of 27% shorter clock periods on average are obtained for non-zero clock
skew, level-sensitive circuits over traditionally used zero clock skew, edge-
sensitive circuits. Although level-sensitive circuits do not provide additional
improvements in clock period over edge-sensitive circuits (approximately 28-
30% shorter clock periods over zero clock skew, edge-sensitive circuits) for
non-zero clock skew scheduling, they provide improvements in area savings.
Also in Chapter 6, an LP automation framework is presented in order to
analyze advanced multi-phase synchronization methodologies with non-zero
clock skew. These multi-phase synchronization analyses are performed in order
to provide design and analysis methods to address synchronous circuit design
with emerging clocking technologies, some of which entail multi-phase syn-
chronization schemes. For instance, the resonant rotary clocking technology
provides an improved clock distribution network which satisfies the complex
synchronization requirements of high-performance synchronous circuits by us-
ing multi-phase, non-zero clock skew clocking. The presented timing analysis
method efficiently captures the behavior of multi-phase, non-zero clock skew
circuits in a fully-automated fashion. The experiments performed on ISCAS’89
benchmark circuits demonstrate that multi-phase synchronization can actu-
ally be advantageous in terms of circuit speed, despite the increased path
delays due to latch insertion per each clock phase. Such a fact is contrary to
common wisdom, which has over the years been suggested for zero clock skew
systems. Approximately 17.7% and 12.0% shorter clock periods are obtained
on average over zero clock skew, edge-sensitive circuits for three-phase and
four-phase synchronization schemes, respectively.
In Chapter 7, an effective quadratic programming formulation to improve
the tolerance of circuits to process parameter variations is presented. The
12 Conclusions 245

variations (manufacturing and environmental) are becoming increasingly dom-


inant with semiconductor technology scaling. As researchers are working on
design-for-manufacturing (DFM) techniques, regular design fabrics (semicon-
ductor and nanoarchitecture levels) and timing tools to accurately model the
statistical behavior, multi-phase operation and clock skew scheduling can also
be used effectively to circumvent the hazards caused by such variations. In
experiments, safety factors are maximized for ISCAS’89 circuits with run-
times less than 30 minutes. The scalability of QP formulation is a concern for
increasing circuit sizes, similar to, but more so then, LP formulations.
In Chapter 8, the optimal clock schedules and data propagation times of
a circuit are analyzed after clock skew scheduling. With these analyses, the
theoretical limits of improvement in the minimum clock period achievable
through clock skew scheduling are identified. Traditionally, it has been con-
sidered that the data path cycles and delay uncertainties are the only limiting
factors on the minimum clock period achievable through clock skew schedul-
ing. As shown recently, the reconvergent data paths also introduce theoretical
limits on the minimum achievable clock period through clock skew schedul-
ing. This limitation is mitigated by the delay insertion method, leading to
improvements of 10% and 9% shorter clock periods on average over conven-
tional clock skew scheduling techniques for edge-sensitive and level-sensitive
circuits, respectively. In mainstream digital circuit design flow, delay insertion
is commonly used as a post-processing step in order to solve the short-path
(hold time) violations. The drawbacks of delay insertion, such as increased
circuit area and power consumption, are mainly disregarded in favor of the
feasibility of the timing schedules. Similarly for the presented design princi-
ples, the drawbacks of delay insertion are considered tolerable in favor of the
improvement in the circuit performance.
In Chapter 9, the practical considerations in the implementation of the de-
sign automation algorithms as well as the clock skew scheduling methodologies
are discussed. Three different implementations of the QP based algorithm are
demonstrated for varying objectives, which might be selected based on prac-
tical limitations or necessities. Also shown to address a potential practical
limitation in timing is the implementation of non-zero clock skew schedul-
ing on intellectual-property (IP) blocks within an ASIC. The proposed design
strategy suggests a zero clock skew implementation at the I/O registers for
easy synchronization of the IP within the ASIC. The additional constraint on
the clock skew at the I/O registers limits the level of improvement through
clock skew scheduling, however, simplifies the timing relationship between
communicating IPs. Alternative implementations and strategies in synchro-
nizing multiple non-zero clock skew IPs are indeed feasible.
In Chapter 10, the integration of the presented timing and synchroniza-
tion methodologies into the physical design flow of circuits synchronized with
rotary clocking technology is described. Rotary clocking technology is a type
of resonant clocking technology, which provides controllable skew, low-jitter,
giga-hertz range clocking with fast transition times and low power consump-
246 12 Conclusions

tion. Rotary clocking technology also permits non-zero clock skew operation
and multi-phase synchronization of systems. In the presented discussion, the
development of the physical design flow for rotary clock synchronized circuits
is described. The physical design flow consists of a novel partitioning step in
order to generate partitions of the circuit netlist on which clock skew schedul-
ing can be applied individually. The potential to parallelize the application of
clock skew scheduling is explored. Partitioning and the parallelization of the
application of clock skew scheduling are shown to provide significant speedups
in run times of the timing analysis. Over the ISCAS’89 benchmark circuits, a
average speedup of 2.6x is observed over four (4) processors. When applica-
ble, clock skew scheduling of partitions significantly improves the scalability
of clock skew scheduling.
In summary, this monograph presents valuable timing and synchroniza-
tion methodologies for non-zero clock skew scheduling and their automation
methods. The timing and synchronization methodologies are proposed par-
ticularly for the non-zero clock skew operation of high-performance digital
VLSI integrated circuits. Various algorithms and blueprints for methodology
development are presented, which include algorithms for circuits with edge-
sensitive registers v.s. level-sensitive registers, circuits synchronized by a single
clock phase scheme v.s. multi-phase clocking schemes and algorithms modi-
fying clock distribution network only v.s. modifying the clock distribution
network simultaneous with the logic network. Theoretical limitations of im-
provement achievable through non-zero clock skew scheduling are presented
for the proposed algorithms and methodologies.
References

1. E. G. Friedman, Performance Limitations in Synchronous Digital Systems.


Ph.D. thesis, University of California, Irvine, California, 1989. Abstract pub-
lished in Dissertations Abstracts International, Volume 50, Number 7, p. 3067-
B, January 1990.
2. J. P. Fishburn, “Clock Skew Optimization,” IEEE Transactions on Computers,
Vol. C–39, pp. 945–951, July 1990.
3. J. S. Kilby, “Invention of the Integrated Circuit,” IEEE Transactions on Elec-
tron Devices, Vol. ED-23, pp. 648–654, July 1976.
4. G. E. Moore, “Cramming more Components onto Integrated Circuits,” Pro-
ceedings of the IEEE, Vol. 86, pp. 82–85, January 1998.
5. D. C. Pham, D. Boerstler, M. Bolliger, R. Chaudhry, D. Cox, P. Harvey,
P. Hofstee, C. Johns, J. Kahle, A. Kameyama, J. Keaty, Y. Masubuchi,
M. Pham, J. Pille, S. Posluzsny, M. Riley, D. Stasiak, M. Suzuoki, O. Taka-
hashi, J. Warnock, S. Weitzel, D. Wendel, and K. Yazawa, “Overview of the
architecture, circuit design, and physical implementation of a first-generation
cell processor,” IEEE Journal of Solid-State Circuits, Vol. 41, pp. 179–196,
January 2006.
6. S. Rusu, S. Tam, H. Muljono, D. Ayers, J. Chang, B. Cherkauer, J. Stinson,
J. Benoit, R. Varada, J. Leung, R. Limaye, and S. Vora, “A 65 nm Dual-Core
Multithreaded Xeon Processor with 16-MB L3 Cache,” IEEE Journal of Solid
State Circuits, Vol. 42, pp. 17–25, January 2007.
7. A. S. Leon, K. W. Tam, J. L. Shin, D. Weisner, and F. Schumacher, “A
Power efficient High-Throughput 32-Thread SPARC processor,” IEEE Journal
of Solid State Circuits, Vol. 42, pp. 7–16, January 2007.
8. H. B. Bakoglu, Circuits, Interconnections, and Packaging for VLSI. Addison-
Wesley Publishing Company, 1990.
9. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems.
IEEE Press, 1995.
10. N. W. Weste and D. Harris, Principles of CMOS VLSI Design: A Systems
Perspective. Addison-Wesley Publishing Company, Reading, MA, 3rd ed., 2004.
11. D. D. Gajski, Silicon Compilation. Addison-Wesley Publishing Company,
Reading, MA, 1988.
12. Z. Kohavi, Switching and Finite Automata Theory. McGraw-Hill Book Com-
pany, New York, NY, 2nd ed., 1978.

247
248 References

13. F. J. Hill and G. R. Peterson, Computer Aided Logical Design (with emphasis
on VLSI). John Wiley & Sons, Inc., 4th ed., 1993.
14. J. P. Uyemura, Introduction to VLSI Circuits and Systems. Wiley Publishing,
2001.
15. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and
Design. The McGraw-Hill Companies, Inc., 3rd ed., 2002.
16. J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits:
A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, 2nd ed.,
2002.
17. C. Mead and L. Conway, Introduction to VLSI Systems. Addison-Wesley Pub-
lishing Company, Reading, MA, 1980.
18. F. Anceau, “A Synchronous Approach for Clocking VLSI Systems,” IEEE
Journal of Solid-State Circuits, Vol. SC-17, pp. 51–56, February 1982.
19. M. Afghani and C. Svensson, “A Unified Clocking Scheme for VLSI Systems,”
IEEE Journal of Solid State Circuits, Vol. SC-25, pp. 225–233, February 1990.
20. S. H. Unger and C.-J. Tan, “Clocking Schemes for High-Speed Digital Sys-
tems,” IEEE Transactions on Computers, Vol. C-35, pp. 880–895, October
1986.
21. G. Y. Yacoub, H. Pham, M. Ma, and E. G. Friedman, “A System for Crit-
ical Path Analysis Based on Back Annotation and Distributed Interconnect
Impedance Models,” Microelectronics Journal, Vol. 19, pp. 21–30, May/June
1988.
22. H. Shichman and D. A. Hodges, “Modeling and Simulation of Insulated-Gate
Field-Effect Transistor Switching Circuits,” IEEE Journal of Solid-State Cir-
cuits, Vol. SC-3, pp. 285–289, September 1968.
23. N. Hedenstierna and K. O. Jeppson, “CMOS Circuit Speed and Buffer Op-
timization,” IEEE Transactions on Computer-Aided Design, Vol. CAD-6,
pp. 270–281, March 1987.
24. M. R. C. M. Berkelaar and J. A. G. Jess, “Gate Sizing in MOS Digital Circuits
with Linear Programming,” Proceedings of the European Design Automation
Conference, pp. 217–221, March 1990.
25. O. Coudert, “Gate Sizing for Constrained Delay/Power/Area Optimiza-
tion,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Vol. VLSI-5, pp. 465–472, December 1997.
26. U. Ko and P. T. Balsara, “Short-Circuit Power Driven Gate Sizing Technique
for Reducing Power Dissipation,” IEEE Transactions on Very Large Scale In-
tegration (VLSI) Systems, Vol. VLSI-3, pp. 450–455, September 1995.
27. S. R. Vemuru and N. Scheinberg, “Short-Circuit Power Dissipation Estima-
tion for CMOS Logic Gates,” IEEE Transactions on Circuits and Systems I:
Fundamental Theory and Applications, Vol. 41, pp. 762–765, November 1994.
28. H. J. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and its
Impact on the Design of Buffer Circuits,” IEEE Journal of Solid-State Circuits,
Vol. SC-19, pp. 468–473, August 1984.
29. A. S. Sedra and K. C. Smith, Microelectronic Circuits. Oxford University Press,
4th ed., 1997.
30. T. Sakurai and A. R. Newton, “Alpha-power Law MOSFET Model and its
Applications to CMOS Inverter Delay and Other Formulas,” IEEE Journal of
Solid-State Circuits, Vol. SC-25, pp. 584–594, April 1990.
References 249

31. A. I. Kayssi, K. A. Sakallah, and T. M. Burks, “Analytical Transient Re-


sponse of CMOS Inverters,” IEEE Transactions on Circuits and Systems—
I : Fundamental Theory and Applications, Vol. CAS I–39, pp. 42–45, January
1992.
32. E. G. Friedman, ed., High Performance Clock Distribution Networks. Kluwer
Academic Publishers, Norwell, Massachusetts, 1997.
33. H. B. Bakoglu and J. D. Meindl, “Optimal Interconnection Circuits for VLSI,”
IEEE Transactions on Electron Devices, Vol. ED-32, pp. 903–909, May 1985.
34. A. Wilnai, “Open-Ended RC Line Model Predicts MOSFET IC Response,”
Electronic Design News, pp. 53–54, December 1971.
35. T. Sakurai, “Approximation of Wiring Delay in MOSFET LSI,” IEEE Journal
of Solid-State Circuits, Vol. SC-18, pp. 418–426, August 1983.
36. S. R. Vemuru and A. R. Thorbjornsen, “Variable-Taper CMOS Buffer,” IEEE
Journal of Solid-State Circuits, Vol. SC-26, pp. 1265–1269, September 1991.
37. C. Prunty and L. Gal, “Optimum Tapered Buffer,” IEEE Journal of Solid-State
Circuits, Vol. SC-27, pp. 118–119, January 1992.
38. N. Hedenstierna and K. O. Jeppson, “Comments on the Optimum CMOS
Tapered Buffer Problem,” IEEE Journal of Solid-State Circuits, Vol. SC-29,
pp. 155–158, February 1994.
39. B. S. Cherkauer and E. G. Friedman, “Channel Width Tapering of Serially Con-
nected MOSFET’s with Emphasis on Power Dissipation,” IEEE Transactions
on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-2, pp. 100–114,
March 1994.
40. B. S. Cherkauer and E. G. Friedman, “Design of Tapered Buffers with Local
Interconnect Capacitance,” IEEE Journal of Solid-State Circuits, Vol. SC–30,
pp. 151–155, February 1995.
41. B. S. Cherkauer and E. G. Friedman, “A Unified Design Methodology for
CMOS Tapered Buffers,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, Vol. VLSI-3, pp. 99–111, March 1995.
42. V. Adler and E. G. Friedman, “Repeater Insertion to Reduce Delay and Power
in RC Tree Structures,” Proceedings of the Asilomar Conference on Signals,
Systems, and Computers, pp. 749–752, November 1997.
43. V. Adler and E. G. Friedman, “Delay and Power Expressions for a CMOS
Inverter Driving a Resistive-Capacitive Load,” Proceedings of the IEEE Inter-
national Symposium on Circuits and Systems, pp. 4.101–4.104, May 1996.
44. V. Adler and E. G. Friedman, “Repeater Design to Reduce Delay and Power
in Resistive Interconnect,” Proceedings of the IEEE International Symposium
on Circuits and Systems, pp. 2148–2151, June 1997.
45. V. Adler and E. G. Friedman, “Timing and Power Models for CMOS Repeaters
Driving Resistive Interconnect,” Proceedings of the IEEE ASIC Conference,
pp. 201–204, September 1996.
46. C. J. Alpert and A. Devgan, “Wire Segmenting for Improved Buffer Insertion,”
Proceedings of the IEEE/ACM Design Automation Conference, pp. 588–593,
June 1997.
47. V. E. Adler and E. G. Friedman, “Repeater Design to Reduce Delay and Power
in Resistive Interconnect,” IEEE Transactions on Circuits and Systems II:
Analog and Digital Signal Processing, Vol. CAS II-45, pp. 607–616, May 1998.
48. I. E. Sutherland, “Micropipelines,” Communications of the ACM, Vol. 32,
pp. 720–738, June 1989.
250 References

49. J. M. Rabaey, Digital Integrated Circuits : A Design Perspective. Prentice Hall,


Inc., 1996.
50. R. H. Krambeck, C. M. Lee, and H.-F. S. Law, “High Speed Compact Circuits
with CMOS,” IEEE Journal of Solid-State Circuits, Vol. SC-17, pp. 614–619,
June 1982.
51. V. Friedman and S. Liu, “Dynamic Logic CMOS Circuits,” IEEE Journal of
Solid-State Circuits, Vol. SC-19, pp. 263–266, April 1984.
52. N. F. Gonclaves and H. J. DeMan, “NORA: A Racefree Dynamic CMOS Tech-
nique for Pipelined Logic Structures,” IEEE Journal of Solid-State Circuits,
Vol. SC-18, pp. 261–266, June 1983.
53. C. M. Lee and E. W. Szeto, “Zipper CMOS,” IEEE Circuits and Systems
Magazine, pp. 10–16, May 1986.
54. L. G. Heller, W. R. Griffin, J. W. Davis, and N. G. Thoma, “Cascade Voltage
Switch Logic: A Differential CMOS Logic Family,” Proceedings of the IEEE
International Solid State Circuits Conference, pp. 16–17, February 1984.
55. T. A. Grotjohn and B. Hoefflinger, “Sample-Set Differential Logic (SSDL) for
Complex High-Speed VLSI,” IEEE Journal of Solid-State Circuits, Vol. SC-21,
pp. 367–369, April 1986.
56. L. C. M. Pfennings, W. G. J. Mol, J. J. J. Bastiens, and J. M. F. V. Dijk, “Dif-
ferential Split-Level CMOS Logic for Subnanosecond Speed,” IEEE Journal of
Solid-State Circuits, Vol. SC-20, pp. 1050–1055, October 1985.
57. K. M. Chu and D. I. Pulfrey, “Design Procedures for Differential Cascode
Voltage Switch Circuits,” IEEE Journal of Solid-State Circuits, Vol. SC-21,
pp. 1082–1087, December 1986.
58. L. A. Glasser and D. W. Dobberpuhl, The Design and Analysis of VLSI Cir-
cuits. Addison-Wesley Publishing Company, 1985.
59. M. M. Mano and C. R. Kime, Logic and Computer Design Fundamentals.
Prentice Hall, Inc., 1997.
60. W. Wolf, Modern VLSI Design : A Systems Approach. Prentice-Hall, Inc., 1994.
61. T. Kacprzak and A. Albicki, “Analysis of Metastable Operation in RS CMOS
Flip-Flops,” IEEE Journal of Solid-State Circuits, Vol. SC-22, pp. 57–64, Feb-
ruary 1987.
62. T. A. Jackson and A. Albicki, “Analysis of Metastable Operation in D
Latches,” IEEE Transactions on Circuits and Systems—I : Fundamental The-
ory and Applications, Vol. CAS I–36, pp. 1392–1404, Nov 1989.
63. E. G. Friedman, “Latching Characteristics of a CMOS Bistable Register,”
IEEE Transactions on Circuits and Systems—I : Fundamental Theory and Ap-
plications, Vol. CAS I–40, pp. 902–908, December 1993.
64. S. H. Unger, “Double-Edge-Triggered Flip-Flops,” IEEE Transactions on Com-
puters, Vol. C-30, pp. 447–451, June 1981.
65. S.-L. Lu, “A Novel CMOS Implementation of Double-Edge-Triggered D-Flip-
Flops,” IEEE Journal of Solid State Circuits, Vol. SC-25, pp. 1008–1010, Au-
gust 1990.
66. M. Afghani and J. Yuan, “Double-Edge-Triggered D-Flip-Flops for High-Speed
CMOS Circuits,” IEEE Journal of Solid State Circuits, Vol. SC-26, pp. 1168–
1170, August 1991.
67. R. Hossain, L. Wronski, and A. Albicki, “Low Power Design Using Double
Edge Triggered Flip-Flops,” IEEE Transactions on Very Large Scale Integra-
tion (VLSI) Systems, Vol. VLSI-2, pp. 261–265, June 1994.
References 251

68. G. M. Blair, “Low-Power Double-Edge Triggered Flip-Flop,” Electronics Let-


ters, Vol. 33, pp. 845–847, May 1997.
69. M. R. Dagenais and N. C. Rumin, “On the Calculation of Optimal Clock-
ing Parameters in Synchronous Circuits with Level-Sensitive Latches,” IEEE
Transactions on Computer-Aided Design, Vol. CAD-8, pp. 268–278, March
1989.
70. I. Lin, J. A. Ludwig, and K. Eng, “Analyzing Cycle Stealing on Synchronous
Circuits with Level-Sensitive Latches,” Proceedings of the ACM/IEEE Design
Automation Conference, pp. 393–398, June 1992.
71. J. Lee, D. T. Tang, and C. K. Wong, “A Timing Analysis Algorithm for Circuits
with Level-Sensitive Latches,” IEEE Transactions on Computer-Aided Design,
Vol. CAD-15, pp. 535–543, May 1996.
72. T. G. Szymanski, “Computing Optimal Clock Schedules,” Proceedings of the
ACM/IEEE Design Automation Conference, pp. 399–404, June 1992.
73. K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “checkTc and minTc :
Timing Verification and Optimal Clocking of Synchronous Digital Circuits,”
Proceedings of the IEEE/ACM International Conference on Computer–Aided
Design, pp. 552–555, November 1990.
74. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Minimum
Padding to Satisfy Short Path Constaints,” Proceedings of the IEEE/ACM
International Conference on Computer–Aided Design, pp. 156 –161, November
1993.
75. K. A. Sakallah, T. N. Mudge, and O. A. Olukotun, “Analysis and Design
of Latch-Controlled Synchronous Digital Circuits,” IEEE Transactions on
Computer-Aided Design, Vol. CAD-11, pp. 322–333, March 1992.
76. S. Bothra, B. Rogers, M. Kellam, and C. M. Osburn, “Analysis of the Effects
of Scaling on Interconnect Delay in ULSI Circuits,” IEEE Transactions on
Electron Devices, Vol. ED-40, pp. 591–597, March 1993.
77. N. Gaddis and J. Lotz, “A 64-b Quad-Issue CMOS RISC Microprocessor,”
IEEE Journal of Solid-State Circuits, Vol. SC-31, pp. 1697–1702, November
1996.
78. P. E. Gronowski et al., “A 433-MHz 64-bit Quad-Issue RISC Microprocessor,”
IEEE Journal of Solid-State Circuits, Vol. SC-31, pp. 1687–1696, November
1996.
79. N. Vasseghi, K. Yeager, E. Sarto, and M. Seddighnezhad, “200-Mhz Super-
scalar RISC Microprocessor,” IEEE Journal of Solid-State Circuits, Vol. SC-
31, pp. 1675–1686, November 1996.
80. W. J. Bowhill et al., “Circuit Implementation of a 300-MHz 64-bit Second-
generation CMOS Alpha CPU,” Digital Technical Journal, Vol. 7, No. 1,
pp. 100–118, 1995.
81. J. L. Neves and E. G. Friedman, “Topological Design of Clock Distribution
Networks Based on Non-Zero Clock Skew Specification,” Proceedings of the
IEEE Midwest Symposium on Circuits and Systems, pp. 468–471, August 1993.
82. J. G. Xi and W. W.-M. Dai, “Useful-Skew Clock Routing With Gate Sizing
for Low Power Design,” Proceedings of the ACM/IEEE Design Automation
Conference, pp. 383–388, June 1996.
83. J. L. Neves and E. G. Friedman, “Design Methodology for Synthesizing Clock
Distribution Networks Exploiting Non–Zero Localized Clock Skew,” IEEE
Transactions on Very Large Scale Integration (VLSI) Systems, Vol. VLSI-4,
pp. 286–291, June 1996.
252 References

84. M. A. B. Jackson, A. Srinivasan, and E. S. Kuh, “Clock Routing for High-


Performance ICs,” Proceedings of the ACM/IEEE Design Automation Confer-
ence, pp. 573–579, June 1990.
85. R.-S. Tsay, “An Exact Zero-Skew Clock Routing Algorithm,” IEEE Transac-
tions on Computer-Aided Design of Integrated Circuits and Systems, Vol. CAD-
12, pp. 242–249, February 1993.
86. N.-C. Chou and C.-K. Cheng, “On General Zero-Skew Clock Net Construc-
tion,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Vol. VLSI-3, pp. 141–146, March 1995.
87. N. Ito, H. Sugiyama, and T. Konno, “ChipPRISM: Clock Routing and Tim-
ing Analysis for High-Performance CMOS VLSI Chips,” Fujitsu Scientific and
Technical Journal, Vol. 31, pp. 180–187, December 1995.
88. J. L. Neves and E. G. Friedman, “Optimal Clock Skew Scheduling Tolerant to
Process Variations,” Proceedings of the ACM/IEEE Design Automation Con-
ference, pp. 623–628, June 1996.
89. D. B. West, Introduction to Graph Theory. Prentice-Hall, 1996.
90. C. E. Leiserson and J. B. Saxe, “A Mixed-Integer Linear Programming Problem
Which is Efficiently Solvable,” Journal of Algorithms, Vol. 9, pp. 114–128,
March 1988.
91. T.-C. Lee and J. Kong, “The New Line in IC Design,” IEEE Spectrum, pp. 52–
58, March 1997.
92. E. G. Friedman, “The Application of Localized Clock Distribution Design to
Improving the Performance of Retimed Sequential Circuits,” Proceedings of the
IEEE Asia–Pacific Conference on Circuits and Systems, pp. 12–17, December
1992.
93. I. S. Kourtev and E. G. Friedman, “Simultaneous Clock Scheduling and
Buffered Clock Tree Synthesis,” Proceedings of the IEEE International Sym-
posium on Circuits and Systems, pp. 1812–1815, June 1997.
94. T. M. Burks, K. A. Sakallah, and T. N. Mudge, “Critical Paths in Circuits with
Level-Sensitive Latches,” IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, Vol. 3, pp. 273–291, June 1995.
95. I. S. Kourtev and E. G. Friedman, “A Quadratic Programming Approach to
Clock Skew Scheduling for Reduced Sensitivity to Process Parameter Varia-
tions,” Proceedings of the IEEE ASIC/SOC Conference, 1999.
96. I. S. Kourtev and E. G. Friedman, Timing Optimization Through Clock Skew
Scheduling. Kluwer Academic Publishers, 2000.
97. B. Taskin and I. S. Kourtev, “Linear Timing Analysis of SOC Synchronous
Circuits with Level-Sensitive Latches,” Proceedings of the IEEE ASIC/SOC
Conference, pp. 358–362, September 2002.
98. B. Taskin and I. S. Kourtev, “Performance Optimization of Single-Phase
Level-Sensitive Circuits Using Time Borrowing and Clock Skew Scheduling,”
ACM/IEEE International Workshop on Timing Issues in the Specification and
Synthesis of Digital Systems, pp. 111–118, 2002.
99. T. G. Syzmanski and N. Shenoy, “Verifying Clock Schedules,” Proceedings of
the IEEE/ACM International Conference on Computer–Aided Design, pp. 124–
131, November 1992.
100. H. Zhou, “Clock Schedule Verification Crosstalk,” ACM/IEEE International
Workshop on Timing Issues in the Specification and Synthesis of Digital Sys-
tems, pp. 78–83, 2002.
References 253

101. C. Leiserson and J. Saxe, “Retiming Synchronous Circuitry,” Algorithmica,


Vol. 6, No. 1, 1991.
102. B. Lockyear and C. Ebeling, “Optimal Retiming of Level-Clocked Circuits
Using Symmetric Clock Schedules,” IEEE Transactions on Computer-Aided
Design, Vol. CAD-13, pp. 1097–1109, Sep 1994.
103. N. Maheshwari and S. Sapatnekar, “A Practical Algorithm for Retiming Level-
Clocked Circuits,” Proceedings of International Conference on VLSI in Com-
puters and Processors, pp. 440–445, October 1996.
104. N. Shenoy and R. Rudell, “Efficient Implementation of Retiming,” Proceedings
of IEEE/ACM International Conference on Computer-Aided Design, pp. 226–
233, 1994.
105. T. H. Cormen, C. E. Leiserson, and R. L. Rivest, Introduction to Algorithms.
MIT Press, 1989.
106. I. S. Kourtev and E. G. Friedman, “Topological Synthesis of Clock Trees
for VLSI-Based DSP Systems,” Proceedings of the IEEE Workshop on Signal
Processing Systems, pp. 151–162, November 1997.
107. I. S. Kourtev and E. G. Friedman, “Topological Synthesis of Clock Trees with
Non-Zero Clock Skew,” Proceedings of the ACM/IEEE International Workshop
on Timing Issues in the Specification and Design of Digital Systems, pp. 158–
163, December 1997.
108. R. B. Deokar and S. S. Sapatnekar, “A Graph–Theoretic Approach to Clock
Skew Optimization,” Proceedings of the IEEE International Symposium on
Circuits and Systems, pp. 407–410, May 1995.
109. S. Naffziger, G. Colon-Bonet, T. Fischer, R. Riedlinger, T. Sullivan, and
T. Grutkowski, “The Implementation of the Itanium 2 Microprocessor,” IEEE
Journal of Solid-State Circuits, Vol. 37, pp. 1448–1460, November 2002.
110. J. Warnock, “Circuit Design Issues for the POWER4 Chip,” Proceedings of
the 2003 International Symposium on VLSI Technology, Systems, and Appli-
cations, pp. 125–128, October 2003.
111. C. Webb, C. Anderson, L. Sigal, K. Shepard, J. Liptay, J.D.Warnock, B. Cur-
ran, B. Krumm, M. Mayo, P. Camporese, E. Schwarz, M. Farrell, P. Res-
tle, R. A. III, T. Slegel, W. Houtt, Y. Chan, B. Wile, T. Nguyen, P. Emma,
D. Beece, C. Ching-Te, and C. Price, “A 400-MHz S/390 Microprocessor,”
IEEE Journal of Solid-State Circuits, Vol. 32, pp. 1665–1675, November 1997.
112. W. L. Winston, Operations Research Application and Algorithms. PWS-Kent
Publishing Company, second ed., 1991.
113. R. Chen and H. Zhou, “Clock Schedule Verification Under Process Variations,”
Proceesings of the IEEE Conference on Computer-Aided Design, pp. 619–625,
November 2004.
114. S.-C. Fang and S. Puthenpura, Linear Optimization and Extensions: Theory
and Algorithms. AT&T, Prentice Hall, 1993.
115. ILOG, France, ILOG CPLEX 7.1 User’s Manual, 2001.
116. J. Wood, T. Edwards, and S. Lipa, “Rotary Traveling-Wave Oscillator Arrays:
A New Clock Technology,” IEEE Journal of Solid-State Circuits, Vol. 36,
pp. 1654–1665, November 2001.
117. M. C. Papaefthymiou and K. Randall, “Edge-Triggering vs. Two-Phase Level-
Clocking,” Proceedings of the 1993 in Research in Integrated Systems, March
1993.
254 References

118. C. Ebeling and B. Lockyear, “On the Performance of Level-Clocked Circuits,”


Proceedings of the Sixteenth Conference on Advanced Research in VLSI,
pp. 342–356, March 1995.
119. Y. C. Hsu, S. Sun, D. Du, and X. Chu, “Enhancing Circuit Performance Under
a Multiple-Phase Clocking Scheme,” Proceedings of the 1998 IEEE Interna-
tional Symposium on Circuits and Systems, pp. 219–222, June 1998.
120. K. Ravindran, A. Kuehlmann, and E. Sentovich, “Multi-Domain Clock Skew
Scheduling,” Proceedings of the International Conference on Computer Aided
Design, pp. 801–808, November 2003.
121. I. S. Kourtev and E. G. Friedman, “A Quadratic Programming Approach to
Clock Skew Scheduling for Reduced Sensitivity to Process Parameter Varia-
tions,” Proceedings of the IEEE International ASIC/SOC Conference, pp. 210–
215, November 1999.
122. I. S. Kourtev and E. G. Friedman, “Clock Skew Scheduling for Improved Reli-
ability via Quadratic Programming,” Proceedings of the IEEE/ACM Interna-
tional Conference on Computer–Aided Design, pp. 239–243, November 1999.
123. S.-P. Chan, S.-Y. Chan, and S.-G. Chan, Analysis of Linear Networks and
Systems : A Matrix-Oriented Approach with Computer Applications. Addison-
Wesley Publishing Company, 1972.
124. E. M. Reingold, J. Nievergelt, and N. Deo, Combinatorial Algorithms: Theory
and Practice. Prentice-Hall, 1977.
125. O. Bretscher, Linear Algebra with Applications. Prentice-Hall, 1996.
126. P. G. Ciarlet and J. L. Lions, eds., Handbook of Numerical Analysis, Vol. I.
North-Holland, 1990.
127. R. W. Farebrother, Linear Least Square Computations. Marcel Dekker, 1988.
128. M. R. Osborne, Finite Algorithms in Optimization and Data Analysis. John
Wiley & Sons, 1985.
129. R. Fletcher, Practical Methods of Optimization. John Wiley & Sons, 1987.
130. Å. Björck, Numerical Methods for Least Squares Problems. North-Holland,
1996.
131. C. L. Lawson and R. J. Hanson, Solving Least Squares Problems. Prentice-Hall,
1974.
132. B. Taskin and I. S. Kourtev, “Delay Insertion in Clock Skew Scheduling,” ACM
International Symposium on Physical Design, (San Francisco, CA), April 2005.
133. S.-H. Huang, C.-H. Cheng, C.-M. Chang, and Y.-T. Nieh, “Clock Period Min-
imization with Minimum Delay Insertion,” Proceedings of the IEEE/ACM De-
sign Automation Conference, pp. 970–975, June 2007.
134. G. H. Golub and C. F. V. Loan, Matrix Computations. Johns Hopkins Univer-
sity Press, 1996.
135. G. Forsythe and C. B. Moler, Computer Solution of Linear Algebraic Systems.
Prentice-Hall, 1967.
136. B. Floyd, X. Guo, J. Caserta, T. Dickson, C.-M. Hung, K. Kim, and
K. O, “Wireless Interconnects for Clock Distribution,” Proceedings of the 8th
ACM/IEEE Intl. Workshop on Timing Issues in the Specification and Synthe-
sis of Digital Systems,, December 2002.
137. B. Floyd, C. Hung, and K.K.O, “Intra-chip Wireless Interconnect for Clock
Distribution Implemented with Integrated Antennas, Receivers, and Transmit-
ters,” IEEE Journal of Solid-State Circuits, Vol. 37, pp. 522–543, May 2002.
References 255

138. R. Li, X. Guo, and K. O, “A Technique for Incorporation of a Heatsink for


a System Utilizing Integrated Circuits with Wireless Connections to an Off-
chip Antenna,” Proceedings of the IEEE International Interconnect Technology
Conference, pp. 160–162, June 2004.
139. W. Andress and D. Ham, “Standing Wave Oscillators Utilizing Wave-adaptive
Tapered Transmission Lines,” Digest of Technical Papers, 2004 Symposium on
VLSI Circuits, pp. 50–53, June 2004.
140. S. C. Chan, P. J. Restle, N. K. James, and R. L. Franch, “A 4.6 GHz Resonant
Global Clock Distribution Network,” IEEE ISSCC Digest of Technical Papers,
pp. 341–343, February 2004.
141. S. C. Chan, K. L. Shepard, and P. J. Restle, “Design of Resonant Global
Clock Distributions,” Proceedings of the International Conference on Computer
Design, pp. 238–243, 2003.
142. V. L. Chi, “Salphasic Distribution of Clock Signals for Synchronous Systems,”
IEEE Transactions on Computers, Vol. 43, pp. 597–602, May 1994.
143. A. Drake, K. Nowka, T. Nguyen, J. Burns, and R. Brown, “Resonant Clocking
Using Distributed Parasitic Capacitance,” IEEE Journal of Solid-State Cir-
cuits, Vol. 39, pp. 1520–1528, September 2004.
144. L. Hall, M.Clemens, W. Liu, and G. Bilbro, “Clock Distribution Using Coop-
erative Ring Oscillators,” Proceedings of the Conference on Advanced Research
in VLSI, pp. 15–16, September 1997.
145. F. O’Mahony, C. Yue, M. Horowitz, and S. Wong, “A 10-GHz Global Clock Dis-
tribution Using Coupled Standing-wave Oscillators,” IEEE Journal of Solid-
State Circuits, Vol. 38, pp. 1813–1820, November 2003.
146. F. O’Mahony, C. P. Yue, M. Horowitz, and S. Wong, “Design of a 10GHz Clock
Distribution Network Using Coupled Standing Wave Oscillators,” Proceesings
of IEEE/ACM Design Automation Conference, (Anaheim, CA), pp. 682–687,
June 2003.
147. P. J. Restle, T. G. McNamara, P. J. Camporese, K. F. Eng, K. A. Jenkins,
D. H. Allen, M. J. Rohn, M. P. Quaranta, D. W. Boerstler, C. J. Alpert, C. A.
Carter, R. N. Bailey, J. G. Petrovik, B. L. Krauter, , and B. D. McCredie, “A
Clock Distribution Network for Microprocessors,” IEEE Journal of Solid-State
Circuits, Vol. 36, pp. 792–799, May 2001.
148. J. Wood, S. Lipa, P. Franzon, and M. Steer, “Multi-Gigahertz Low-Power Low-
Skew Rotary Clock Scheme,” Proceedings of the IEEE International Solid-State
Circuits Conference, pp. 400–401, February 2001.
149. M. Saint-Laurent, M. Swaminathoan, and J. Meindl, “On the Micro-
architectural Impact of Clock Distribution Using Multiple PLLs,” Proceedings
of IEEE International Conference on Computer Design, pp. 214–220, Septem-
ber 2001.
150. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and
Design. The McGraw-Hill Companies, Inc., 1996.
151. J. M. Rabaey, Digital Integrated Circuits: A Design Perspective. Prentice-Hall,
Inc., Upper Saddle River, NJ, 1995.
152. E. G. Friedman, Clock Distribution Networks in VLSI Circuits and Systems.
IEEE Press, 1995.
153. H. G. Chyun and J. Hung, “Phase-Locked Loop Techniques. A survey,” IEEE
Transactions on Industrial Electronics, Vol. 43, pp. 609–615, December 1996.
256 References

154. A. J. Drake, K. J. Nowka, T. Y. Nguyen, J. L. Burns, and R. B. Brown,


“Resonant Clocking Using Distributed Parasitic Capacitance,” IEEE Journal
of Solid-State Circuits, Vol. 39, pp. 1520–1528, September 2004.
155. J.-Y. Chueh, M. C. Papaefthymiou, and C. H. Ziesler, “Two-phase Resonant
Clock Distribution,” Proceedings of the IEEE Computer Society Annual Sym-
posium on VLSI, pp. 65–70, May 2005.
156. S. C. Chan, K. L. Shepard, and P. J. Restle, “Distributed Differential Oscilla-
tors for Global Clock Networks,” IEEE Journal of Solid-State Circuits, Vol. 41,
pp. 2083–2094, September 2006.
157. J. Wood, “Electronic circuitry.” United States Patent Application Number
20030128075, July 2003.
158. J. Wood, “Electronic circuitry.” United States Patent Number 6,816,020, No-
vember 2004.
159. J.-Y. Chueh, C. H. Ziesler, and M. C. Papaefthymiou, “Experimental Eval-
uation of Resonant Clock Distribution,” Proceedings of the IEEE Computer
Society Annual Symposium on VLSI Emergim Trends in VLSI System Design,
pp. 135–140, February 2004.
160. J.-Y. Chueh, C. H. Ziesler, and M. C. Papaefthymiou, “Empirical Evaluation
of Timing and Power in Resonant Clock Distribution,” Proceedings of the In-
ternational Symposium on Circuits and Systems, pp. 249–252, May 2004.
161. J. Rosenfeld and E. G. Friedman, “Sensitivity Evaluation of Global Resonant
H-tree Clock Distribution Networks,” Proceedings of the ACM Great Lakes
Symposium on VLSI, pp. 192–197, April-May 2006.
162. J. Rosenfeld and E. G. Friedman, “Design Methodologies for Global Resonant
H-tree Clock Distribution Networks,” Proceedings of the IEEE International
Symposium on Circuits and Systems, pp. 2073–2076, May 2006.
163. J. Rosenfeld and E. G. Friedman, “Design Methodology for Global Resonant
H-tree Clock Distribution Networks,” IEEE Transactions on Very Large Scale
Integration (VLSI) Systems, Vol. 15, pp. 135–148, February 2007.
164. F. O’Mahony, 10 GHz Global Clock Distribution Using Coupled Standing-Wave
Oscillators. PhD thesis, Stanford University, Aug. 2003.
165. P. Restle, “Resonant Clock Networks.” http://www.research.ibm.com/, 2005.
IBM Research, Computer Science, Innovative Matters, VLSI Design.
166. J. Denker, “A review of Adiabatic Computing,” Proceedings of the 1994 Sym-
posium on Low Power Electronics, pp. 94–97, October 1994.
167. K. S. Kim and M. Papaefthymiou, “Single-Phase Source-Coupled Adiabatic
Logic,” Proceedings of the International Symposium on Low Power Electronics
and Design, pp. 97–99, 1999.
168. G. D. Mercey, “A 18GHz Rotary Traveling Wave VCO in CMOS with I/Q Out-
puts,” Proceedings of the European Solid-State Circuits Conference, pp. 489–
492, Sept. 2003.
169. Z. Yu and X. Liu, “Power Analysis of Rotary Clock,” Proceedings of the IEEE
Computer Society Annual Symposium on VLSI, pp. 150–155, May 2005.
170. Z. Yu and X. Liu, “Power Minimization of Rotary Clock Design,” Proceedings
of the IEEE International SOC Conference, pp. 19–24, September 2005.
171. Z. Yu and X. Liu, “Low-Power Rotary Clock Array Design,” IEEE Trans-
actions on Very Large Scale Integration (VLSI) Systems, Vol. 15, pp. 5–12,
January 2007.
References 257

172. G. Venkataraman, J. Hu, F. Liu, and C.-N. Sze, “Integrated Placement and
Skew Optimization for Rotary Clocking,” Proceedings of the IEEE Design,
Automation and Test in Europe, pp. 1–6, March 2006.
173. G. Venkataramam, J. Hu, and F. Liu, “Integrated Placement and Skew Opti-
mization for Rotary Clocking,” IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, pp. 149–158, February 2007.
174. Z. Yu and X. Liu, “Design of Rotary Clock Based Circuits,” Proceedings of the
ACM/IEEE Design Automation Conference, pp. 43–48, June 2007.
175. C. Ababei, S. Navaratnasothie, K. Bazargan, and G. Karypis, “Multi-Objective
Circuit Partitioning for Cutsize and Path-based Delay Minimization,” Proceed-
ings of the IEEE/ACM International Conference on Computer Aided Design,
pp. 181–185, November 2002.
176. I. Lustig, “Private Communication,” 2004. ILOG Inc.
177. B. Hendrickson and R. Leland, “The Chaco User’s Guide: Version 2.0,” Tech.
Rep., Sandia National Laboratories, Albuquerque, NM, Jul 1995.
178. A. Pothen, H. Simon, and K. Liou, “Partitioning Sparse Matrices Eigenvectors
of Graphs,” SIAM Journal of Matrix Analysis, Vol. 11, pp. 430–452, 1990.
179. R. Williams, “Performance of Dynamic Load Balancing Algorithms for Un-
structured Mesh Calculations,” Concurrency, Vol. 3, pp. 457–481, 1991.
180. B. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning
Graphs,” Bell System Technical Journal, Vol. 29, pp. 291–307, 1970.
181. C. M. Fiduccia and R. Mattheyses, “A Linear Heuristic for Improving Network
Partitions,” Proceedings of the IEEE/ACM Design Automation Conference,
pp. 175–181, 1982.
182. B. Hendrickson and R. W. Leland, “A Multi-Level Algorithm For Partitioning
Graphs,” Supercomputing, 1995.
183. Apple Inc., Advanced Computing Group, Xgrid Guide, 2004.
184. MPI Standard Forum, http://www-unix.mcs.anl.gov/mpi/standard.html,
Message Passing Interface Standard v 2.0, 1997.
185. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Graph Algo-
rithms for Clock Schedule Optimization,” Proceedings of the IEEE/ACM In-
ternational Conference on Computer–Aided Design, pp. 132–136, November
1992.
186. B. Taskin, “High Performance Integrated Circuit (hpic) Timing Software Pack-
age v1.9.” http://sourceforge.net/projects/hpictiming/, 2004.
187. Free Software Foundation (FSF), http://www.gnu.org/software/glpk/glpk.
html, GLPK (GNU Linear Programming Kit), 2005. version 4.8.
Index

A
Application-specific integrated circuits Clock skew scheduling
(ASICs), 15, 180, 244, 245 applications of, 96
basis skews
B clock skew vector and enumeration,
Bernoulli equations, 29 178
thicker edges and basis edges,
C 176–177
CAD. See Computer-aided design circuit design process and safety
Cascade voltage switch logic (CVSL), margin in, 84
39 clocking technology, 183
Clock distribution network definitions and graphical model
branching factor, 88 clock delays in, 74
circuit and interconnect structure, 72 graph-based models in, 76
design process for, 16 inherent structural limitations of,
resistive-capacitive (RC), 35 76
scheduling algorithms for, 4 permissible range of, 74–76
signals, 4 synchronous digital system, 73,
tree structure of, 86 75–80
Clock signal timing parameters of, 75
clock pulse, 51 delay insertion method, 153–162
clock skew edge-triggered circuits, 232
lead/lag relationship, 52 ISCAS’89 benchmark circuits,
sequentially-adjacent registers, 53 229–230
coincidental cycles of, 56 level-sensitive circuits, 231, 233
data in, 49 QP-based clock scheduling
latching and non-latching edges of, algorithm, 226–229
48–49 double and zero clocking hazards in,
leading and trailing edge of, 45 82
multi-phase clock synchronization double clocking, 81
reference clock cycle, 54 input file format
sample of, 53 samples of, 93, 94
storage elements in, 50–51 static timing analyzers, 91

259
260 Index

Clock skew scheduling (Continued) parallelization


I/O registers and target delays computation time speedup, 203
delay requirements for, 180 Xgrid computing cluster, 202–203
local data paths, 178 Xgrid software architecture, 202
timing information and violations, partitioning process
179–180 alternative approaches, 199–200
VLSI integrated circuit, 179 clock periods, 198–199
level-sensitive circuits delay padding, 200
circuit networks, 108–113 heuristic method, 197–199
initialization constraints, 102 performance characteristics in, 5
interpretation results, 209 problem formulation
on ISCAS’89 Benchmark circuits, LCSS-SAFE, 127–128
206–208 linear programming approach, 123
iterative approach, 103–104 local data path timing constraints,
latching constraints, 98 122
linearization, 105–108 maximum performance, 123–125
LP formulation, 113–117 quadratic programming problem,
multi-phase, 117–120 128
parameter data distributions, safety, 125–127
209–211 quadratic programming algorithm
propagation constraints, 100–101 derivation
skew analysis, 211–213 circuit graph, 129–130
synchronization constraints, 98–100
linear dependence, 130–137
timing relationships, 97–98
optimization problem and solution,
validity constraints, 101–102
137–143
verification, 208
quadratic programming formulation
linear problem formulation
computer implementation, 223–225
delay models, 163
graphical illustrations, 225
reconvergent paths, theoretical
registers, 83
limitation of, 162
ROA rings and application of, 191
linear programming (LP) formulation
for, 81, 244 rotary clock synchronized circuits
localized negative synchronous industrial1, 234
circuit, 4 minimum clock periods, 234–235
LP models, 163, 195, 197–198 scalability of, 194, 243
minimum clock period software implementation
circuits and limitations on, 146, 149 benchmark circuit s400, 92
data path cycles, 148–150 benchmark circuit s1423, 91
data propagation times for, 147–148 in clock tree synthesis, 89
dominant limiting factor in, 147 data paths in, 94
reconvergent paths, 150–151 ISCAS’89 suite of circuits, 90
register delays in, 155 timing constraints and design
min-max timing models, 165 automation, 85
modeling and applications topological design of, 16
delay buffer tree structure, 164–165 tree topology implementations, 89
delay insertion problems, 163 Clock tree synthesis, 87
post-timing analysis process, 165 Complementary metal-oxide-
output file format semiconductor (CMOS)
path delay distribution in, 95 input waveforms in, 28
sample of, 94 inverter logic gate in, 26
Index 261

logic circuits and styles for, 39 propagation delay time in, 32


operating mode of, 27 short-circuit power for, 33
P-channel and N-channel transistor, Delay controlling, 31
terminal voltages for, 26 Delay insertion method
PMOS and NMOS device in, 25 divergent and convergent registers in,
transistor configuration of, 9 152–153
Computational algorithms drawbacks of, 245
CSD edge-triggered circuit reconvergent
clock schedule, 172–173 system
computation time, 173 CSS method for, 163
memory usage, 175 data signals for, 153
LMCS-1 minimum clock period, 159
memory usage of, 170 path delays, algebraic difference in,
triangular, cholesky decomposition, 156
169–170 timing of, 154, 162
LMCS-2 values, interval and elements in,
lagrange multipliers, 170–171 157–158
memory usage of, 172 level-sensitive circuit reconvergent
Sherman-Morrison-Woodburry system
formula, 171 clock skew scheduling algorithm for,
run time and memory 160
complexity expressions, 176 CSS method for, 164
requirements, 175 timing of, 160, 162
Computer-aided design (CAD), 195 zero internal register delays, 159
tools, 196, 243 minimum clock period obtainable in,
CVSL. See Cascade voltage switch logic 155
reconvergent data path systems
D delays in, 153
Data path cycles edge-triggered and level-sensitive
clock skews circuit of, 148 circuits in, 160
minimum clock period, limitation on, Design-for-manufacturing (DFM)
149 techniques, 245
reconvergent system and paths, Digital integrated circuits, 8
timing diagrams of, 150–151 Double-edge-triggered (DET) flip-flops,
Data propagation times 47
setup and hold time constraints for, DSM. See Deep submicrometer
148
timing delay models in, 147 F
Deep submicrometer (DSM), 3 Finite-state machine (FSM) model, 13
Delay analytical analysis Flip-flops
fall time derivation positive and negative-edge-triggered,
input waveforms for, 28 47
transition process, 26 single-phase path
rise time derivation, 30 data signal, early arrival of, 58–60
short-channel effects data signal, late arrival of, 55–58
channel-length modulation in, 33 delay padding, 60
velocity saturation, 34–35 logic gates and, 55
waveform effects timing parameters
delay expressions in, 31–32 clock pulse width of, 48–49
262 Index

clock-to-output delay, 49 schematic representation of, 43


hold time and setup time, 49 single-phase path
violation setup, timing diagram of, 56 data signal, early arrival of, 63–65
Full adder circuit, 8–9 data signal, late arrival of, 61–63
max operation, 63
H registers and logic gates in, 61
Hardware description language (HDL), Level-sensitive circuits
15 circuit networks
hpictiming tool, 233–234 edge-sensitive circuit, 108–109
level-sensitive circuit, 109
I
non-zero clock, 110
industrial1, 234
Integrated circuits (ICs) synchronous circuit state, 110–113
characteristics and factors of, 1 topology, 108
circuit structures and chip area in, 2 clock skew scheduling
data traffic in, 2 interpretation results, 209
performance of, 3 on ISCAS’89 Benchmark circuits,
Intellectual property (IP) blocks, 180 206–208
benchmark circuits for, 200 parameter data distributions,
ISCAS’89 benchmark circuits 209–211
average speedup of, 245 skew analysis, 211–213
CAD tool verification, 208
parallel speedup, 237–238 initialization constraints, 102
sequential speedup, 237–238 iterative approach
tool flow, 235 algorithm, 103–104
clock skew scheduling framework lenient, 104
industrial1, 234 SMO formulation, 103
minimum clock periods, 234–235 latching constraints, 98
hpictiming tool, 233 limitations, 146–147
multi-phase synchronization, 244 linearization
run time breakdown process, 239–241 linear programming (LP) model,
106–108
L modified big M (MBM) method,
Latches 105–106
clock signal levels in, 43 timing analysis, 104–105
idealized operation of, 44 LP formulation
multi-phase path benchmark circuits, 115–116
combinational logic blocks in, 65 MIP problems and model, 113–114,
data signal, early arrival of, 68–69 116–117
data signal, late arrival of, 66–68 NLP problems, 113
timing properties of, 66 non-linear constraint, 114–115
parameters optimized timing schedule, 112
clock pulse width of, 45 multi-phase clocking
clock-to-output delay, 45 clock skew scheduling, 220–221
data-to-output delay and setup simultaneous time borrowing and
time, 45–46 clock skew scheduling, 221–223
hold time, 46–47 time borrowing, 219–220
minimum and maximum values of, multi-phase synchronization
47 non-overlapping process, 215–216
signal relationships on, 44 transformation process, 213–214
Index 263

multi-phase system P
edge-triggered system, 117 Phase-locked-loop (PLL), 183
synchronization overview, 117–118 clock sources, 190
timing circuits, 118–120 components, 184
propagation constraints, 100–101 reflections, capacitive loading, 184
synchronization constraints, 98–100 Power supply rejection ratio (PSRR),
timing relationships, 97–98 188
validity constraints, 101–102 Q
Linear programming (LP) problems, QP algorithm derivation
146 circuit graph, 129–130
models, 164–165 linear dependence
naive approach, 163 circuit connectivity matrix, 134
Logic gates circuit graph cycles, 131
and registers clock scheduling algorithms, 130
sequentially-adjacent pair of, 74 graph theory, 132
in synchronous digital circuit, 73 independent cycle matrix, 136–137
switching characteristics of, 22–23 kernel equation, 135–136
values and properties of, 9 local data paths, 132–133
matrix relationship, 134–135
spanning trees, 133
M optimization problem and solution
Message passing interface (MPI), 202 active constraints, 138–139
Metal-oxide-semiconductor field effect clock skew definition, 137–138
transistor (MOSFETs), 23 Gauss-Jordan elimination, 141
Mixed-integer linear programming global minimizer, 143
(MIP), 87, 113, 164 Lagrange multipliers, 139–140
Modified big M method (MBM), linear system technique, 142–143
105–106 local data paths, 138
Moore’s law, 3 non-linear equation, 140
Multi-phase synchronization approach, objective clock skew schedule, 142
54 QP-based clock skew scheduling,
computational analysis
CSD, 172–175
N LMCS-1, 169–170
NAND gate, 9–10 LMCS-2, 170–172
N-channel enhancement mode MOSFET run time and memory requirements,
transistor (NMOS), 24 168
NMOS transistor, 44 Quadratic programming (QP)
Non-linear programming (NLP), 113 formulation, 244–245
Non-zero clock skew scheduling computer implementation, 223–225
applications of, 243 graphical illustrations, 225
automation and application of, 243 R
circuit operating and applications of, Resistive-capacitive (RC) loads, 72
83 circuit network for, 38
clock signals and benefits in, 84 signal delay expressions in, 37
researchers, 244 Resonant clocking technology
synchronization methodologies for, clock tree network, 184
245 digital integrated circuits, 183
264 Index

oscillators S
coupled LC and standing wave, 185 Shichman-Hodges equations, 24
traveling wave, 186 Spanning tree algorithm, edge swapping
partitioning process and enumeration, 180
balanced priority assignment, 196 Standard template library (STL), 234
with chaco, 195–196 Synchronous digital system
clock skew scheduling, 197–200 logic gates and storage registers, 73
path-based and net-based, 193 signal cycles and graph representation
registered-input and registered- of, 78
output, 197 Synchronous systems
register insertion, 196 clock signals, 50–54
register placement, 200–201 finite-state machine (FSM) model of,
timing constraints and data path, 13
196–197 flip-flops, 47–50
timing-driven, 193–195 single-phase path with, 55–60
tools and factors for, 194 latches, 43–47
VLSI circuits, 192 multi-phase path with, 65–69
rotary circuits, timing requirements single-phase path with, 61–65
clock skew and signals, 189 storage elements, 42
oscillatory signals, 191 timing properties of, 41
synchronization schemes, 190–191 System-on-chip (SoC), 181
rotary traveling-wave oscillators
(RTWO’s), 185–189 V
ROA. See Rotary oscillator arrays Very large scale integration (VLSI)
Rotary clock synchronized circuits systems
CAD tool buffers and registers, 86
run time breakdown process, circuit design and timing, 244
239–241 circuits production in, 3
speedup process, 237–238 delay metrics
clock skew scheduling circuit analysis and design for, 23
industrial1, 234 computer-aided design applications,
minimum clock periods, 234–235 20
Rotary oscillator arrays (ROA), 185 logic gates and elements in, 19, 21
clock architecture, 186 signal propagation and making in,
grid topology, 185, 191 23
ring, clock phase relationships of, signal transitions in, 22
190, 191 signal waveforms circuit in, 21–22
structures, 192 delay mitigation, 37
Rotary traveling-wave oscillators design process
(RTWO’s) electronic devices, switching
anti-parallel inverters, 186 properties in, 14
integrated skew computation and synthesis tools in, 15
logic placement, 189 devices and interconnections
loop inductance, 188 analytical delay analysis, 26–31
novel clock network, 185 delay controlling, 31
rings, 185, 187, 190 delay mitigation, 37–39
shunt connected inverters, transmis- gain factor in, 25
sion line, 187 importance of, 35–37
theory, 187 RC estimation in, 36
Index 265

short-channel effects, 33–35 computational algorithm in, 11


signal delay in, 19 data path in, 14
static and dynamic circuit analysis finite-state machine (FSM) model,
for, 25 13
waveform effects, 31–33 signal propagation delay, 11
digital systems in, 7 systems and circuits of, 4
integrated circuits, 245 technologies and systems in, 12
networks and logic gates in, 12 transistor elements, 2
signal representation and ULSI-based digital systems, 72
data processing in, 7
electronic devices and circuits, 8 X
storage elements, 42 Xgrid
synchronous circuits, 85 computing cluster, 202–203
synchronous systems software architecture, 202

Вам также может понравиться