Академический Документы
Профессиональный Документы
Культура Документы
Skew Scheduling
Ivan S. Kourtev • Baris Taskin • Eby G. Friedman
Timing Optimization
Through Clock Skew
Scheduling
ABC
Ivan S. Kourtev Baris Taskin
University of Pittsburgh Drexel University
Pittsburgh, PA Philadelphia, PA
USA USA
Eby G. Friedman
University of Rochester
Rochester, NY
USA
springer.com
Preface
V
VI Preface
Acknowledgments
The authors would like to thank all of those who have helped writing
and correcting early manuscript versions of this monograph—fellow colleagues
and students, as well as the anonymous reviewers who provided important
comments on improving the overall quality of this book. The authors would
also like to thank Dr. Bob Grafton from the National Science Foundation for
supporting the early research projects that have culminated in the writing
and production of this book. We would also like to warmly acknowledge the
assistance and support of Alex Greene and Katelyn Stanne from Springer—
Alex and Katie’s patience and encouragement have been crucial to the success
of this project.
The research work described in this research monograph was made pos-
sible in part by support from the National Science Foundation under Grant
No. MIP-9423886 and Grant No. MIP-9610108, by a grant from the New
York State Science and Technology Foundation to the Center for Advanced
Technology-Electronic Imaging Systems, and by grants from the Xerox Cor-
poration, IBM Corporation, Intel Corporation and Multigig Inc.
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
2 VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1 Signal Representation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2 Synchronous VLSI Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
2.3 The VLSI Design Process . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
IX
X Contents
12 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 259
List of Figures
XIII
XIV List of Figures
5.1 A simple synchronous digital circuit with four registers and four
logic gates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
5.2 The permissible range of the clock skew of a local data path. A
timing violation exists if sk ∈ / [lk , uk ]. . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3 A directed multi-graph representation of the synchronous
system shown in Figure 5.1. The graph vertices correspond to
the registers, R1 , R2 , R3 and R4 , respectively. . . . . . . . . . . . . . . . . . . 77
5.4 A graph representation of the synchronous system shown in
Figure 5.1 according to Definition 5.3. The graph vertices
v1 , v2 , v3 , and v4 correspond to the registers, R1 , R2 , R3 and R4 ,
respectively. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.5 Transformation rules for the circuit graph. . . . . . . . . . . . . . . . . . . . . 79
5.6 Application of non-zero clock skew to improve circuit
performance (a lower clock period) or circuit reliability
(increased safety margins within the permissible range). . . . . . . . . . 83
5.7 Tree structure of a clock distribution network. . . . . . . . . . . . . . . . . . 86
5.8 Buffered clock tree for the benchmark circuit s1423. The circuit
s1423 has a total of N = 74 registers and the clock tree consists
of 45 buffers with a branching factor of is f = 3. . . . . . . . . . . . . . . 91
5.9 Buffered clock tree for the benchmark circuit s400. The circuit
s400 has a total of N = 21 registers and the clock tree consists
of 14 buffers with a branching factor of f = 3. . . . . . . . . . . . . . . . . . 92
5.10 Sample input for the clock scheduling program described in
Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93
5.11 Sample output for the clock scheduling program described in
Section 5.7.2. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
5.12 The application of clock skew scheduling to a commercial
integrated circuit with 6,890 registers [note that the time scale
is in femtoseconds, 1 fs = 10−15 sec = 106 ns]. . . . . . . . . . . . . . . . . . . 96
6.1 Possible cases for the arrival and departure times of data at the
initial latch. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Propagation of the data signal in a simple circuit. . . . . . . . . . . . . . 101
6.3 The iterative algorithm for static timing analysis of
level-sensitive circuits. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 104
6.4 A simple synchronous circuit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
List of Figures XV
7.1 Circuit graph of the simple example circuit C1 from Section 7.1.1. 129
7.2 Two spanning trees and the corresponding minimal sets of
linearly independent clock skews and linearly independent cycles
for the circuit example C1 . Edges from the spanning tree are
indicated with thicker lines. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
9.5 I/O registers in a VLSI integrated circuit. Note that the I/O
registers form part of the local data paths between the inside of
the circuit and the outside of the circuit. . . . . . . . . . . . . . . . . . . . . . . 179
11.1 Data propagation times for s938 with 32 registers and 496 data
paths. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 209
11.2 Maximum effective path delays in data paths of s938 for zero
clock skew. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
11.3 Maximum effective path delays for s938 for non-zero clock skew. 211
11.4 Distribution of the clock skew values of the non-zero clock skew
case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.5 Distribution of the clock delay values of the non-zero clock skew
case for s938. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 213
11.6 Generation of an n-phase data path with latches. . . . . . . . . . . . . . . 214
11.7 Non-overlapping multi-phase synchronization clock. . . . . . . . . . . . . 215
11.8 Effects of multi-phase clocking on time borrowing. . . . . . . . . . . . . . 219
11.9 Effects of multi-phase clocking on clock skew scheduling. . . . . . . . . 221
11.10 Effects of multi-phase clocking on time borrowing and clock
skew scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
11.11 Circuit s3271 with r = 116 registers and p = 789 local data
paths. The target clock period is TCP = 40.4 nanoseconds. . . . . . . 227
11.12 Circuit s1512 with r = 57 registers and p = 405 local data
paths. The target clock period is TCP = 39.6 nanoseconds. . . . . . . 228
11.13 Percentage improvements through delay insertion in Table 11.6. . . 232
11.14 Percentage improvements on edge-triggered circuits in Table 11.6. 232
11.15 Percentage improvements on level-sensitive circuits in Table 11.6. 233
11.16 CAD tool flow. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 235
11.17 The run times of hpictiming with Xgrid on large circuits. . . . . . . 239
11.18 Run time breakdown of hpictiming program steps for s38584. . . 240
11.19 Run time breakdown of hpictiming program steps for s38417. . . 240
11.20 Run time breakdown of hpictiming program steps for
industrial1. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240
1
Introduction
1
Monolithic integrated circuits were first introduced in the early 1960’s.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 1
DOI: 10.1007/978-0-387-71056-3 1,
c Springer Science+Business Media LLC 2009
2 1 Introduction
Transistor Count
1010
Dual-core Itanium
109
Itanium II
108
Pentium IV
107 Pentium II
Pentium
i486
106 i860 V80
V60/V70
i43201 i80286
105 μPD7809
i8087
μPD7720
i8086
104
i4004
increased circuit performance has been largely achieved by the following ap-
proaches:
• reduction in feature size (technology scaling), that is, the capability of
manufacturing physically smaller and faster circuit structures,
• increase in chip area, permitting a larger number of circuits and therefore
greater on-chip functionality,
• advances in packaging technology, permitting the increasing volume of
data traffic between an integrated circuit and its environment as well as
the efficient removal of heat created during circuit operation.
The most complex integrated circuits are referred to as VLSI circuits,
where the term VLSI stands for Very Large Scale Integration. This term
describes the complexity of modern integrated circuits consisting of hundreds
of thousands to many millions of active transistor elements. Presently, the
1 Introduction 3
104
ItaniumII
3 PentiumIV
10
Itanium
DECAlpha
102
Pentium
V70
101 i80286
i8086
1 i4004
year
1975 1980 1985 1990 1995 2000 2005
of fully synchronous VLSI systems, these effects have the potential to create
catastrophic failures due to the limited time available for signal propagation
among the gates.
The material presented in this monograph is associated with these afore-
mentioned delay effects from the perspective of a synchronous digital VLSI
system. The research results described here can be used to improve the per-
formance and reliability of a synchronous VLSI circuit through the design of
the clock distribution network common to any synchronous digital system.
Specifically, new algorithms for scheduling the arrival time of the clock sig-
nals at the individual registers (or synchronous macro blocks) of a circuit
and synthesizing the overall clock tree are discussed. Operational character-
istics, performance improvements and limitations to suggested improvements
are presented in a cohesive manner.
To provide an intuitive perspective into the topics discussed here, consider
the simple synchronous circuit shown in Figure 1.3 [9]. Two consecutively con-
nected local data paths, consisting of the registers, R1 and R2 , and R2 and R3 ,
respectively, are depicted in this figure. Consider that, by design, clock delays
to R1 and R3 must be identical. That is, the clock signal C1 to the register
R1 is synchronized2 with the clock signal C3 to R3 . The signal delays through
the registers are considered identical in this example, numerically assigned
to 2 ns. Under this identical register delay assumption, the path from R2 to
R3 is the worst case path (since it has a larger logic signal delay). By delaying
the clock signal C3 to the register R3 with respect to the clock signal to the
register R2 , a leading (or negative) clock skew is added to this local data path
from R2 to R3 . As the clock delays to R1 and R3 are designed to be identical, a
certain amount of lagging (or positive) clock skew is applied to the local data
path from R1 to R2 . Thus, the clock signal C2 should be designed to lead the
clock signal C3 by 1.5 ns, thereby forcing both paths R1 to R2 and R2 to R3
to have the same total effective local data path delay (consisting of propaga-
R1 R2 R3
Logic Logic
Data Signal Data Data Data
Delay = 4 ns Delay = 7 ns
Clock Clock Clock
C1 C2 C3
Clock Signal
Fig. 1.3. Example of applying localized negative clock skew to a synchronous circuit.
2
The signals C1 and C3 arrive at the same time with no delay or advance with
respect to each other.
1 Introduction 5
tion delay TP D and local data path skew TSkew ) TP D + TSkew = 7.5 ns. The
delay of the critical path (R2 to R3 ) of the synchronous circuit is temporally
refined to the precision of the clock distribution network, and the entire sys-
tem (for this simple example) could operate at a maximum clock frequency of
133.3 MHz. Note that, if no localized clock skew were applied, the maximum
possible frequency would be 111.1 MHz. The performance characteristics of
the system, both with and without the application of localized clock skew, are
summarized in Table 1.1.
Table 1.1. Performance characteristics of the circuit shown in Figure 1.3 without
and with localized clock skew.
Local Data Path TP D(min) with TCi TCf TSkew TP D(min) with
zero skew non-zero skew
R1 ;R2 4 + 2 + 0 = 6 3 1.5 1.5 4 + 2 + 1.5 = 7.5
R2 ;R3 7 + 2 + 0 = 9 1.5 3 -1.5 7 + 2 − 1.5 = 7.5
fmax 111.1 MHz 133.3 MHz
Note that |TSkew | < TP D (since | − 1.5 ns | < 9 ns) for the local data
path from R2 to R3 . Therefore, it is ensured that the correct data signal is
successfully latched into R3 and no local data path/clock skew constraint rela-
tionship is violated. This design technique of applying localized clock skew is
particularly effective in sequentially-adjacent, temporally irregular local data
paths; however, it is applicable to any type of synchronous sequential sys-
tem. For certain architectures, a significant improvement in performance and
reliability is both possible and likely.
One of the objectives of this research monograph is to provide detailed
insight into the systematic application of the technique exemplified in Fig-
ure 1.3 and Table 1.1 and described above to synchronous sequential digital
circuits of arbitrary structure and size. To this end, the basic properties of
CMOS-based digital integrated circuits as well as the fundamental principles
of synchronous VLSI system operation are reviewed in Chapter 2.
In Chapters 3 and 4, the timing issues related to the implementation of
synchronous VLSI circuits are discussed. A summary of the definitions and
notations used in this monograph is presented. Signal delay in CMOS digital
integrated circuits is presented in Chapter 3 where the sources of both device
and interconnect delays are discussed.
In Chapter 4, the fundamental timing relationships of synchronous digital
systems are summarized as these relationships are key to understanding the
algorithms presented in Chapters 5 and 7. More specifically, Chapter 4 de-
scribes in considerable detail the properties of both the various types of timed
storage elements and of the data paths built with these elements.
6 1 Introduction
along the way. Such a physical variable—also called a signal—is, for example,
the electrical voltage provided by a power supply (with respect to a ground
potential) and developed in circuit elements in the presence of an electromag-
netic field. The voltage signal or bit of information (in a digital circuit) is tem-
porarily stored in a circuit structure capable of accumulating electric charge.
This accumulating or storage property is called a capacitance—denoted by
the symbol C—and, depending on the materials and the physical properties,
is created by a variety of different forms of conductor-insulator-conductor
structures commonly found in integrated circuits
Furthermore, modern digital circuits utilize Boolean (binary) logic, in
which information is encoded by two values of a signal. These two signal
values are typically called false and true (or low and high or logic zero and
logic one) and correspond to the minimum and maximum1 allowable values
of the signal voltage for a specific integrated circuit implementation.2 Since
the voltage V is proportional to the stored electric charge q (q = CV, where
C is the storage capacitance), the logic low value corresponds to a fully dis-
charged capacitance (q = CV = 0) while the logic high value corresponds
to a capacitance storing the maximum possible charge (fully charged to a
voltage V ).
The largest and most complicated digital integrated circuits today contain
many millions of circuit elements each processing hundreds and thousands of
binary signals [8, 10, 16, 17]. Every circuit element has a number of input
terminals through which data is received from other elements. In addition, a
circuit element has a number of output terminals through which the results
of the processing are made available to other elements. For a circuit to imple-
ment a particular function, the inputs and outputs of all of the elements must
be properly connected among each other. These connections are accomplished
with wires, which are collectively referred to as an interconnect network, while
the set of circuit elements processing the binary signals is often simply called
the logic gates. During normal circuit operation, signals are received at the in-
puts of the logic gates, the gates process the signals to generate new data, and
then transmit the resulting data signals to the corresponding logic elements
through a network of interconnections. This process involves the transport of
a voltage signal from one physical location to another physical location. In
each case, this process takes a small yet finite amount of time to be completed
and is often called the propagation delay of the signal.
Usually, a small number of logic gates are combined to yield modules (or
standard cells) that perform frequently encountered operations—these mod-
ules can then be reused at many different places in a circuit. An example of
such a module is the full adder circuit shown in Figure 2.1. This specific circuit
1
Or the maximum and minimum (that is, vice versa) voltage levels.
2
In practice, a range of values close to the minimum and maximum signal voltages,
respectively, are interpreted as logic zero and one, respectively. By doing so, the
noise immunity of the circuit is significantly improved.
2.1 Signal Representation 9
adds two one-bit numbers x0 and y0 and a carry-in bit c0 to produce a two-bit
result z1 z0 , where z1 = x0 y0 + x0 c0 + y0 c0 and z0 = x0 ⊕ y0 ⊕ c0 . A typical
CMOS transistor configuration for one of the two-input NAND gates is shown
in Figure 2.2 [corresponding to the gates na 1 through na 3 in Figure 2.1].
z1 z0
xo 1 na 4
xo 2 na 1 na 2 na 3
x0 y0 c0
Fig. 2.1. Logic schematic view of a full adder circuit.
VDD
x0 x1
x0
x1
sA , sB
sA sB
90%
tr B
tP DAB = tP LH AB
50%
t fA
10%
time
Fig. 2.3. Signal propagation delay from point A to point B with a linear ramp
input and a linear ramp output.
• regardless of shape, sB has the same logical meaning, that is, that the
state of the circuit at point B changes from low to high; this low-to-high
transition and the reverse high-to-low state transition of signal sA require
a positive amount of time to complete.
The temporal relationship between sA and sB as shown in Figures 2.3 and 2.4
must be evaluated quantitatively. This information permits the speed of the
signals at different points in the same circuit or in different circuits built
in different semiconductor technologies to be temporally characterized. By
quantifying the physical speed of the logical operations, circuit designers are
provided with the necessary timing information to design correctly functioning
integrated circuits.
sA , sB
sA sB
90%
tr B
tP DAB = tP LH AB
50%
t fA
10%
time
Fig. 2.4. Signal propagation delay from point A to point B with a linear ramp
input and an exponential output.
3
RISC = Reduced Instruction Set Computer.
12 2 VLSI Systems
elements or logic gates. Each logic element accepts certain input signals and
computes an output signal used by other logic elements. At the logic level of
abstraction, a VLSI system is a network of tens of thousands or more logic
gates whose terminals are interconnected by wires in order to implemented
the target algorithm.
As mentioned earlier in Section 2.1, the switching variables acting as in-
puts and outputs of a logic gate in a VLSI system are represented by tangible
physical quantities,4 while a number of these devices are interconnected to
yield the desired function of each logic gate. The specific physical characteris-
tics are collectively summarized with the term technology, that encompasses
such detail as the type and behavior of the devices that can be built, the
number and sequence of the manufacturing steps and the impedance of the
different interconnect materials. Today, several technologies are used in the im-
plementation of high performance VLSI systems—these are best exemplified
by CMOS, Bipolar, BiCMOS, and Gallium Arsenide [10, 16]. CMOS technol-
ogy, in particular, exhibits many desirable performance characteristics, such
as low power consumption, high density, ease of design and moderate to high
speed. Due to these excellent performance characteristics, CMOS technology
has become the dominant VLSI technology used today.
The design of a digital VLSI system requires a great deal of effort when
considering a broad range of architectural and logic issues, such as choosing
the appropriate gates and interconnections among these gates to achieve the
required circuit function. No design is complete, however, without considering
the dynamic (or transient) characteristics of the signal propagation or, alter-
natively, the changing behavior of the signals with time. Every computation
performed by a switching circuit involves multiple signal transitions between
the logic states, each transition requiring a finite amount of time to com-
plete. The voltage at every circuit node must reach a specific value for the
computation to be completed. Therefore, state-of-the-art integrated circuit
design is largely centered around the difficult task of predicting and properly
interpreting signal waveform shapes at various points within a circuit.
In a typical VLSI system, millions of signal transitions occur, such as
those shown in Figures 2.3 and 2.4, which determine the individual gate de-
lays and the overall speed of the system. Some of these signal transitions can
be executed concurrently while others must be executed in a strict sequential
order [17]. The sequential occurrence of the latter operations—or signal tran-
sition events—must be carefully coordinated in time so that logically correct
system operation is guaranteed and the results are reliable (in the sense that
these results can be repeated). This coordination is known as synchronization
and is critical to ensuring that any pair of logical operations in a circuit with
a precedence relationship proceed in the proper order. In modern digital inte-
grated circuits, synchronization is achieved at all stages of the system design
process and system operation by a variety of techniques, known as a timing
4
Such quantities as the electrical voltages and currents in electronic devices.
2.2 Synchronous VLSI Systems 13
discipline or timing scheme [10, 18, 19, 20]. With few exceptions, these circuits
are based on a fully synchronous timing scheme, specifically developed to cope
with the finite speed required by the physical signals to propagate throughout
a system.
A fully synchronous system is most frequently modeled as a finite-state
machine as shown in Figure 2.5. As illustrated in Figure 2.5, there are three
COMPUTATION
Input Output
Data Combinational Logic Data
Clock Signal
SYNCHRONIZATION
Fig. 2.5. A finite-state machine (FSM) model of a synchronous system.
a new switching process. As time proceeds, the signals propagate through the
logic, generating results at the logic output. By the end of the clock period,
these results are stored in the registers and are operated upon during the
following clock cycle.
Ri Rf
Combinational
Data Data
Logic
Clock Clock
Signal activity at the
end of the clock period
the design process refers to the activity in which a concept and a set of spec-
ifications are converted into an actual integrated circuit.
A view of the VLSI design process—also known as a design flow—is illus-
trated in Figure 2.7 magnifying the clock distribution network design process.
This flow is typical in the design of high-volume, Application-Specific Inte-
grated Circuits (ASICs). The sequence of steps in this design flow is from
top to bottom and follows the direction of the arrows as shown in Figure 2.7.
As previously mentioned, the design process often starts with loosely defined
behavioral and architectural specifications, as well as with design constraints
such as physical dimensions, cost, power supply voltage, operational temper-
ature and so on. Architectural specifications are refined and coded into a
Hardware Description Language (HDL) which forms the basis for the actual
synthesis process. The HDL descriptions are also useful in performing simu-
lations to verify the desired circuit function.
The synthesis process is performed by software-based synthesis tools which
compile the HDL descriptions into an equivalent logic schematic of a circuit—
each logic gate in this schematic has been predesigned and is available to
the synthesis tool as a library element. After the circuit synthesis process is
completed, the resulting logic and register circuit structures are symbolically
placed to form the integrated circuit. Wire routing among the circuit struc-
tures is performed next to connect the inputs and outputs of the logic gates
as well as to deliver the clock signal to each of the clocked registers within the
circuit. A variety of verification and simulation procedures are also performed
to ensure the correct functionality and timing of the integrated circuit. Among
these procedures is a timing verification step which includes the analysis of
the data and clock signal delays to ensure correct temporal operation.
The body of research presented in this monograph deals with certain as-
pects of the timing of VLSI-based digital circuits, particularly those topics
related to the clock distribution network. The timing optimization algorithms
presented in Chapters 5, 6 and 7 are integrated into the design flow at the
step called Clock Planning, shown shaded in Figure 2.7. As indicated in Fig-
ure 2.7, Clock Planning includes clock scheduling and the design of both the
topology and the circuit structure of the clock tree5 . The timing information
describing the signal delays obtained from the Placement of Logic and Regis-
ters step is used in the clock planning process. Specifically, both the maximum
and minimum data path delays are used in the clock skew scheduling process.
The entire chip verification process is not considered complete until the tim-
ing verification is satisfied after the detailed chip routing has been completed
and all physical impedance characteristics have been back annotated and an-
alyzed with accurate timing analysis tools [8, 9, 21]. Several iterations of the
Clock Planning may be required in order to satisfy the entire chip verification
process.
5
A clock tree is another term for describing the clock distribution network.
16 2 VLSI Systems
Delay Information
Clock Planning
(Clock Scheduling,
Clock Tree Topology)
Clock Verification
2.4 Summary
The behavior of a fully synchronous system is well defined and controllable
as long as the time window provided by the clock period is sufficiently long
2.4 Summary 17
to allow every signal in the circuit to propagate through the required logic
gates and interconnect wires and successfully latch into the final register of
each local data path. In designing the system and choosing the proper clock
period, however, two contradictory requirements must be satisfied. First, the
smaller the clock period, the more computational cycles can be performed by
the circuit in a given amount of time. Alternatively, the time window defined
by the clock period must be sufficiently long so that the slowest signals reach
the destination registers before the current clock cycle is concluded and the
following clock cycle is initiated.
This strategy for organizing the computational process has certain clear
advantages that have made a fully synchronous timing scheme the primary
choice for digital VLSI systems:
• The properties and variations are well understood.
• The nondeterministic behavior of the propagation delay of the combina-
tional logic (due to environmental and process fluctuations and the un-
known input signal pattern) is eliminated such that the system as a whole
has a completely deterministic behavior corresponding to the implemented
algorithm. As long as the data signal is successfully captured inside the
register before the arrival of the next clock signal, the timing characteris-
tics of the system are completely known.
• The circuit design process does not need to be concerned with glitches
in the combinational logic outputs. Therefore, the only relevant dynamic
timing characteristic of the logic is the propagation delay.
• The state of the system is completely defined within the storage elements—
this characteristic greatly simplifies certain aspects of the design, debug
and test phases when developing a large synchronous digital system.
However, the synchronous paradigm also has certain limitations that makes
the design of a synchronous VLSI system increasingly challenging:
• This synchronous approach has a serious drawback in that this approach
requires the overall circuit to operate as slow as the slowest register-to-
register path. Thus, the global speed of a fully synchronous system de-
pends upon those data paths with the largest delays—these paths are also
known as the worst case or critical paths. In a typical VLSI system, the
propagation delays in the combinational paths are distributed unevenly so
there may be many paths with delays much smaller than the clock period.
Although these paths could operate at a lower clock period—or higher
clock frequency—it is these critical paths that bound the minimum clock
period, thereby imposing a limit on the overall system speed (or clock fre-
quency). This imbalance in propagation delays is sometimes so dramatic
that the system speed is dictated by only a handful of very slow paths.
• The clock signal has to be distributed to tens of thousands of storage
registers scattered throughout the system. Therefore, a significant portion
of the system area and dissipated power is devoted to the clock distribution
18 2 VLSI Systems
The delay of a signal propagating from one point within a circuit to another
point is caused by both the active electronic devices (transistors) in the logic
elements and the various passive interconnect structures connecting the logic
gates. While the physical principles behind the operation of transistors and
interconnect are well understood at the current-voltage (I-V ) level, it is often
computationally difficult to directly apply this detailed information to the
densely packed multi-million transistor DSM integrated circuits of today.
A general form of a circuit with N input and M output terminals (labeled
x1 , . . . , xN and y1 , . . . , yM , respectively) is shown in Figure 3.1(a). The box
labeled ‘CIRCUIT’ may represent a simple wire, a transistor, a logic gate con-
sisting of several transistors, or an arbitrarily complex combination of these
elements. The logic schematic outlined in Figure 3.1(b), for example, may
correspond to a portion of the circuit between points X and Y shown in
Figure 3.1(a). With the choice of logic circuit illustrated in Figure 3.1(b), a
logically possible signal activity at the circuit points X, Y, and Z is shown
in Figure 3.2. The dynamic characteristics and temporal relationships of the
signal transitions are described and formalized in Definitions 3.1, 3.2, and 3.3.
x1 y1
Y
CIRCUIT
X
xN yM
(a) Abstract representation of a circuit
Y
X
Z
Definition 3.1. If X and Y are two points in a circuit and sX and sY are
the signals at X and Y, respectively, the signal propagation delay tP DXY from
X to Y is defined 1 as the time interval from the 50% point of the signal
transition of sX to the 50% point of the signal transition of sY .
This formal definition of the propagation delay is related to the concept
that ideally, the switching point of a logic gate is at the 50% level of the output
waveform. Thus, 50% of the maximum output signal level is assumed to be
the boundary point where the state of the gate switches from one binary logic
state to the other binary logic state. Practically, a more physically correct def-
inition of propagation delay is the time from the switching point of the driving
circuit to the switching point of the driven circuit. Currently, however, this
switching point-based reference for signal delay is not widely used in practical
computer-aided design applications because of the computational complexity
of the algorithms and the increased amount of data required to estimate the
delay of a path based on information describing the signal waveform shape.
Therefore, choosing the switching point at 50% has become a generally ac-
ceptable practice for referencing the propagation delay of a switching element.
Also note that the propagation delay tP D as defined in Definition 3.1
is mathematically additive, thereby permitting the delay between any two
points X and Y to be determined by summing the delays through consecu-
1
Although the delay can be defined from any point X to any other point Y, the
points X and Y typically correspond to an input and an output of a logic gate,
respectively. In such a case, the signal delay from X to Y is the propagation delay
of the gate.
3.1 Delay Metrics 21
sX , sY , sZ
90%
tP DXY
= tP DXZ + tP DZY
50%
tP DZY = tP HLZY
tP DXZ = tP LHXZ
10%
sZ sX sY
time
Fig. 3.2. Signal waveforms for the circuit shown in Figure 3.1(b).
tive structures between X and Y . From Figures 3.1(b) and 3.2, for example,
tP DXY = tP DXZ + tP DZY . However, this additivity property must be applied
with caution since neither of the switching points of consecutively connected
gates may occur at the 50% level. In addition, passive interconnect struc-
tures along signal paths do not exhibit switching properties although physical
signals propagate through these structures with finite speed (more precisely,
through signal dispersion). Therefore, if the properties of a signal propagat-
ing through a series connection of logic gates and interconnections are being
evaluated, an analysis of the entire signal path composed of gates and wires—
rather than adding 50%-to-50% delays—is necessary to avoid accumulating
significant error in the path delay.
In high performance CMOS VLSI circuits, logic gates often switch before
the input signal completes a transition.2 This difference in switching speed
may be sufficiently large such that an output signal of a gate will reach the 50%
point before the input signal reaches the 50% point. If this is the case, tP D as
defined by Definition 3.1 may have a negative value. Consider, for example, the
inverter connected between nodes X (inverter input) and Z (inverter output)
shown in Figure 3.1(b). The specific input and output waveforms for this
2
Also, a gate may have asymmetric signal paths, whereby a gate would switch
faster in one direction than in the other direction.
22 3 Signal Delay in VLSI Systems
sX , sZ
sX sZ
90%
TP HLXZ < 0
(tfZ < trX )
50%
TP LHXZ > 0
(trZ > tfZ )
10%
time
tfX trZ tfZ trX
Fig. 3.3. Signal waveforms for the inverter in the circuit shown in Figure 3.1(b).
inverter are shown in detail in Figure 3.3. When the input signal sX makes
a high-to-low transition, the output signal sZ makes a low-to-high transition
(and vice versa). In this specific example, the low-to-high transition of the
signal sZ crosses the 50% signal level after the high-to-low transition of the
signal sX . Therefore, the signal delay tP LH (the signal name index is omitted
for clarity) is positive as shown by the direction of the arrow in Figure 3.3—
coinciding with the positive direction of the x-axis. However, when the input
signal sX makes a low-to-high transition, the output signal sZ makes a faster
high-to-low transition and crosses the 50% signal level before the input signal
sX crosses the 50% signal level. The signal delay tP HL in this case is negative
as shown by the direction of the arrow in Figure 3.3—coinciding with the
negative direction of the x-axis. This phenomenon can occur in circuits with
slow input signal transitions and fast output signal transitions, demonstrating
a weakness in the 50% delay definition commonly used today throughout
industry.
The possible asymmetry of the switching characteristics of a logic gate—
as illustrated by the waveforms shown in Figure 3.3—requires the ability to
discriminate between the values of the propagation delay in the two differ-
ent switching situations (a low-to-high or a high-to-low transition). One sin-
gle value of the propagation delay tP D —as defined in Definition 3.1—does
not provide sufficient information about possible asymmetry in the switch-
3.2 Devices and Interconnections 23
3
MOSFET ≡ Metal-Oxide-Semiconductor Field Effect Transistor
24 3 Signal Delay in VLSI Systems
the properties of both active devices and interconnections are discussed from
the perspective of circuit performance.
An N-channel enhancement mode MOSFET transistor (NMOS) is de-
picted in Figure 3.4. Note that in most digital applications, the substrate
Vd
− +
Vgd Idd
drain
+ gate base
Vg Vb Vds
+ (substrate)
source
Vgs Iss
− −
Vs
Fig. 3.4. An N-channel enhancement mode MOS transistor.
Vdd
Q1
Idsp
Q2
Table 3.1. Terminal voltages for the P-channel and N-channel transistor in a CMOS
inverter circuit.
Q1 (PMOS) Q2 (NMOS)
Vgs Vgsp = Vi − VDD Vgsn = Vi
Vgd Vgdp = Vi − Vo Vgdn = Vi − Vo
Vds Vdsp = Vo − VDD Vdsn = Vo
are illustrated in Figure 3.6 depending upon the values of Vi and Vo . Re-
ferring to Figure 3.6 may be helpful in understanding the switching process
of a CMOS inverter. Methods for determining the values of the fall time
tf and the propagation delay tP HL are described in this section. Similarly,
closed form expressions are derived for the rise time tr and the propagation
delay tP LH .
Vo
A B
Vdd
Q1 linear Q1 linear Q1 cutoff
Q2 cutoff Q2 sat Q2 sat
C (V − V )
I II III IV dd tn
Q1 sat
Q2 sat
−Vtp VII VI V
F
Fig. 3.6. Operating mode of a CMOS inverter depending upon the input and output
voltages. (Note that the abbreviation ‘sat’ stands for the saturation region.)
A B A B
C
I II III IV I II III IV
C C
F F
VII VI V VII VI V
E D E D
Fig. 3.8. Operating point trajectory of a CMOS inverter for different input wave-
forms (only the rising input signal is shown).
2CL 2η
Vo (t2 ) = Vdd − Vtn for t2 = 2 Vtn = . (3.7)
βn (Vdd − Vtn ) γn (1 − η)
3.2 Devices and Interconnections 29
A closed form expression for the output voltage Vo (t) for time t ≥ t2 is
obtained by solving (3.8), a Bernoulli equation, with the initial condition
Vo (t2 ) = Vdd − Vtn :
2(1 − η)
for t ≥ t2 , Vo (t) = Vdd . (3.9)
1 + eγn (t−t2 )
The values of t1 from (3.6) and t3 and t4 from (3.9) are ([10, 15, 23])
1 0.2
t1 = ,
γn 1 − η
1 2η
t3 = + ln(3 − 4η) , (3.10)
γn 1 − η
1 2η
and t4 = + ln(19 − 20η) .
γn 1 − η
The rise time tr and propagation delay tP LH are determined from the switch-
ing process illustrated in Figure 3.9 (similarly to tf and tP HL derived earlier
in this section). Assume that the input signal Vi has been held at logic high
(Vi = Vdd ) for a sufficiently long time such that the capacitor CL is fully
discharged to Vo = 0. The operating point of the inverter is point D shown
in Figures 3.6 and 3.8. At time t0 = 0, the input signal abruptly switches
to a logic low. Since the voltage on CL cannot change instantaneously, the
operating point is forced at point E. At E, the device Q2 is cut off while Q1
is conducting, thereby permitting CL to begin charging through Q1 . As this
charging process develops, the operating point moves up the line EA towards
point A at which point CL is fully charged, i.e., Vo (A) = Vdd . Note that during
the interval 0 ≤ t < t2 , the operating point is between E and F and the device
30 3 Signal Delay in VLSI Systems
Vdd , t<0
Vi (t) =
0, t≥0
tr
tP LH Vo (t)
0.9Vdd
0.5Vdd
−Vtp
0.1Vdd
0 t1 t2 t3 t4 t
Fig. 3.9. Low-to-high output transition for a step input signal.
Note that in (3.11) and (3.15), the fall and rise times, respectively, are the
product of the term CL /β, and another process dependent term (a function
composed solely of Vdd and Vt ). These relationships imply that for a given
manufacturing process, improvements in the individual gate delays are possi-
ble by reducing the load impedance CL or by increasing the current gain of
the transistors. Increasing the current gain (higher β) is possible either by uti-
lizing a more advanced technology or by controlling certain physical qualities
of the transistor (the specific physical layout). In the latter case, increasing β
of the devices (recall that β ∝ W/L) is typically accomplished by controlling
the value of W —a process known as transistor or gate sizing5 [24, 25, 26].
Transistor sizing, however, has limits—area requirements may limit the max-
imum channel width W, and increasing W will also increase the input load
capacitance of the previous gates.
The ideal step input waveform used in the derivation of the delay expressions
presented in Section 3.2.1 is a physical abstraction. Such an ideal waveform
does not practically exist, although it can be used to simplify the analysis
presented in Section 3.2.1. Note that despite ideally fast input waveforms, the
output signal of a CMOS logic gate has a finite slope, thereby contributing
to the gate delay. In a practical VLSI integrated circuit, both the input and
output signals have a non-zero rise and fall time caused by the impedances
along any signal path. Fast input waveforms can be effectively considered as
5
Typically, device channel length is chosen to be the minimum geometry permitted
by the technology and therefore cannot be decreased to further increase β.
32 3 Signal Delay in VLSI Systems
step inputs. The delay expressions derived in (3.11) and (3.15) model the de-
lays for such cases with reasonable accuracy. Slow input waveforms, however,
contribute significantly to the overall delay of the charge/discharge path in a
gate [8, 10, 15, 23], making the delay expressions presented in Section 3.2.1
less accurate.
Furthermore, it is considerably more difficult to derive closed form delay
expressions for non-step input waveforms. Consider, for example, the deriva-
tion of the fall time of the inverter shown in Figure 3.5 assuming a non-ideal
input, such as the linear ramp signal sA depicted in Figure 2.3. Referring
to Figure 3.8, the trajectory of the operating point relating Vi and Vo for a
non-ideal (non-step) input is as shown in the diagram on the right. This tra-
jectory is a curve passing through regions I, II, III, and IV,6 and down the
line C → C → D, rather than the two straight-line segments A → B and
B → C → D (as shown in the diagram on the left). Therefore, calculating an
exact expression for tf in this case requires separately evaluating the delay
for all five portions of the output Vo —one for each region.
An analysis of the CMOS inverter shown in Figure 3.5 with a non-step
input signal, as well as the respective delay expressions, can be found in [23].
Consider, for example, a linear ramp input described by
⎧
⎪
⎪0 t<0
⎨ t
Vi (t) = Vdd 0 ≤ t < tri , (3.17)
⎪
⎪ t
⎩ ri
Vdd t ≥ tri
where tri is the rise time of the input voltage signal Vi (t). For the case de-
picted in the upper diagram shown in Figure 3.8, the total propagation delay
tP HLramp at the 50 % level [23] is given by
1
tP HLramp = (1 + 2η)tri + tP HLstep , (3.18)
6
where tP HLstep is the propagation delay time for a step input given by (3.12).
Note that the ramp input described by (3.17) is also an idealization in-
tended to simplify analysis. In a practical integrated circuit, the input wave-
form to the inverter is not a linear ramp but rather the output waveform
of another gate within the circuit. For such an input signal—also known as
a characteristic input [23]—it is preferable to regard the propagation delay
through the inverter gate shown in Figure 3.5 as a function of the CL /β ra-
tio of the preceding gate or, equivalently, as a function of the step response
delay of the preceding stage [23]. This type of direct analytical solution—by
breaking the output waveform into regions depending upon the trajectory of
the operating point—is further complicated for those gates with more than
one input arriving at an arbitrary time and with arbitrary waveforms. Due to
6
I, II, III, IV, and V for slower input signals.
3.2 Devices and Interconnections 33
Channel-Length Modulation
A MOSFET device modeled by (3.1) has an infinite output resistance in sat-
uration and acts as a voltage-controlled current source. Recall the linear por-
tion of the falling/rising output waveforms from the analyses described in
Section 3.2.1. The device acts as a current source since the drain current Idsn is
completely independent of the voltage Vdsn in the saturation region [see (3.1)].
This independence, however, is an idealization that does not consider the ef-
fect of the voltage Vdsn on the shape of the channel. In practice, as Vdsn
increases beyond Vgsn − Vtn (such that Vgdn < Vtn or Vgdp > Vtp for a PMOS
device), the channel pinch-off point moves towards the source. Therefore, due
to an effect known as channel-length modulation, the effective channel length
is reduced [10, 15, 22, 29].
To analytically account for channel-length modulation, an expression for
the current of a MOS transistor operating in the saturation region is modified
as follows:
1 2
Idsn = βn (Vgsn − Vtn ) (1 + λn Vdsn ) . (3.19)
2
The additional factor (1 + λn Vdsn ) in (3.19) describes the finite device output
resistance ∂Vdsn /∂Idsn = 2(Vgsn − Vtn )−2 /(λn βn ) when the transistor oper-
ates in the saturation region. The output waveform deteriorates due to the
degradation of the transfer characteristic of the inverter.
34 3 Signal Delay in VLSI Systems
Velocity Saturation
In (3.20) and (3.21), α is the velocity saturation index, VD0 is the drain
saturation voltage for Vgsn = Vdd , and ID0 is the drain saturation current
for Vgsn = Vdsn = Vdd . A typical value of the velocity saturation index for a
short-channel device is 1 ≤ α ≤ 2, where (3.21) is the same as (3.1) for α = 2.
Analytical solutions for the output voltage of a CMOS inverter with a
purely capacitive load CL for a step, linear ramp and exponential input wave-
forms can be found in [31]. Closed form expressions for the delay of a CMOS
7
Short-channel MOS devices in general.
3.2 Devices and Interconnections 35
inverter as shown in Figure 3.5 under the α-power law model are given in [30]
and are repeated below:
Vtn 1 1−η CL Vdd
η= and tP HL = tP LH = − tT + . (3.22)
Vdn 2 1+α 2ID0
The analysis of the CMOS gate delay as described in Section 3.2.1 is based on
the assumption that the load of the inverter shown in Figure 3.5 is a purely
capacitive load (C). This assumption is generally true for logic gates placed
physically close to each other. In a multi-million transistor VLSI system, how-
ever, certain connected logic gates may be relatively far from each other. In
this situation, the impedance of the interconnect wires cannot be considered
as being purely capacitive but rather as being resistive-capacitive (RC). An
important type of global circuit interconnect structure where the gates can
be very far apart is the clock distribution network [9, 32].
On-chip interconnect has become a major concern due to the high resis-
tance of the interconnect which can limit overall circuit performance. These
interconnect impedances have become significant as the minimum line dimen-
sions have been scaled down into the deep submicrometer regime while the
overall chip dimensions have increased. Perhaps the most important conse-
quence of these trends of scaling transistor and interconnect dimensions and
increasing chip sizes is that the primary source of signal propagation delay
has shifted from the active transistors to the passive interconnect lines. There-
fore, the nature of the load impedance has shifted from a lumped capacitance
to a distributed resistance-capacitance, thereby requiring new qualitative and
quantitative interpretations of the signal switching processes.
To illustrate the effects of scaling, consider ideal scaling [8] where devices
are scaled down by a factor of S (S > 1) and chip sizes are scaled up by a
factor of Sc (Sc > 1). The delay of the logic gates decreases by 1/S while
the delay due to the interconnect increases by S 2 SC 2
[8, 33]. Therefore, the
ratio of interconnect delay to gate delay after ideal scaling increases by a
factor of S 3 SC
2
. For example, if S = 4 (corresponding to scaling down from
a 2 μm CMOS technology to a 0.5 μm CMOS technology) and Sc = 1.225
(corresponding to the chip area increasing by 50%), the ratio of interconnect
delay to gate delay increases by a factor of 43 × 1.225 = 78.4 times.
36 3 Signal Delay in VLSI Systems
Table 3.2. Closed form expressions for the signal delay of the CMOS inverter shown
in Figure 3.5 driving an RC load. An ideal step input signal (Vi (t) transitioning from
high to low) is assumed.
The delay values listed in Table 3.2 are graphically illustrated in Fig-
ure 3.10 [34]. Two waveforms describing the output of a CMOS inverter (shown
in Figure 3.5) for an input signal making a high-to-low transition are shown
in Figure 3.10. These two waveforms are based on the assumption that the
RC load of the CMOS inverter is distributed and lumped, respectively.
Furthermore, assuming an on-resistance Rtr of the driving transistor [33],
the interconnect delay Tintc can be characterized by the following expres-
sion [34],
The on-resistance of the driving transistor Rtr in (3.23) and (3.24) can be
approximated [33] by
1
Rtr ≈ , (3.25)
βVDD
where the term β in (3.25) is the current gain of the driving transistor oper-
ating in the saturation region [see (3.2)].
3.2 Devices and Interconnections 37
Vout (t)/Vdd
0.9 90%
distributed
lumped
0.63 63%
0.5 50%
0.1 10%
time
0.5RC 1.0RC 1.5RC 2.0RC
Fig. 3.10. Graphical illustration of the RC signal delay expressions listed in Ta-
ble 3.2 (from [34]). The output waveforms for a CMOS inverter are for both a
distributed and lumped RC load.
Table 3.3. Circuit network to model distributed RC line with maximum error of 3%
(from [35]). The notations Π, T and L correspond to a Π, T and L impedance model,
respectively. The notations R and C correspond to a single lumped resistance and
capacitance, respectively. The notation N means that the interconnect impedance
can be ignored.
RT
CT 0 0.01 0.1 0.2 0.5 1 2 5 10 20 50 100
0 Π3 Π3 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.01 Π3 Π3 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.1 T2 T2 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.2 T2 T2 Π2 Π2 Π1 Π1 Π1 Π1 Π1 C C C
0.5 T1 T1 T1 T1 Π1 Π1 Π1 Π1 Π1 C C C
1 T1 T1 T1 T1 Π1 Π1 Π1 Π1 Π1 C C C
2 T1 T1 T1 T1 Π1 Π1 Π1 Π1 L1 L1 C C
5 Π1 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 C C
10 Π1 Π1 Π1 Π1 Π1 Π1 L1 L1 L1 L1 C C
20 R R R R R R L1 L1 L1 L1 C C
50 R R R R R R R R R R C N
100 R R R R R R R R R R N N
delay are required in order to guarantee that the circuit will operate correctly.
Furthermore, certain signal delays within a circuit may need to be decreased
so as to meet specific performance goals.
A variety of different techniques have been developed to improve the sig-
nal delay characteristics depending upon the type of load and other circuit
parameters. Among the most important techniques are:
• Gate sizing to increase the output current drive capability of the transistors
along a logic chain [24, 25, 26]. Gate sizing must be applied with caution,
however, because of the resulting increase in area and power dissipation,
and, if incorrectly applied, increase in delay.
• Tapered buffer circuit structures are often used to drive large capacitive
loads (such as at the output pad of a chip) [17, 36, 37, 38, 39, 40, 41]. A
series of CMOS inverters such as the circuit shown in Figure 3.5 can be
cascaded where the output drive of each buffer is increased by a constant
(or variable) tapering factor.
• The use of repeater circuit structures to drive resistive-capacitive (RC)
loads. Unlike tapered buffers, repeaters are typically CMOS inverters of
uniform size (drive capability) that are inserted at uniform intervals along
an interconnect line [8, 42, 43, 44, 45, 46, 47].
• A different timing discipline such as asynchronous timing [17, 48, 49].
Unlike fully synchronous circuits, the order of execution of logic opera-
tions in an asynchronous circuit is not controlled by a global clock signal.
Therefore, the temporal operation of asynchronous circuits is essentially
3.2 Devices and Interconnections 39
The general structure and principles for operating a fully synchronous dig-
ital VLSI system are described in Chapter 2. The combinational logic and
the storage elements make up the computational circuitry used to implement
a specific synchronous system. The clock distribution network provides the
time reference for the storage elements—or registers—thereby enforcing the
required logical order of operations. This time reference consists of one or
more clock signals that are delivered to each and every register within the in-
tegrated circuit. These clock signals control the order of computational events
by controlling the exact times the register data input signals are sampled.
As shown in Chapter 3, the data signals are inevitably delayed as these sig-
nals propagate through the logic gates and along interconnections within the
local data paths. These propagation delays can be evaluated within a certain
accuracy and used to derive timing relationships among the signals within a
circuit. In this chapter, the properties of commonly used types of registers
and their local timing relationships for different types of local data paths are
described. After discussing registers in general in Section 4.1, the properties of
level-sensitive registers (latches) and the significant timing parameters char-
acterizing these registers are reviewed in Sections 4.2 and 4.3, respectively.
Edge-sensitive registers (flip-flops) and the timing parameters are analyzed
in Sections 4.4 and 4.5, respectively. Properties and definitions related to
the clock distribution network are reviewed in Section 4.6. The mathemat-
ical foundation for analyzing timing violations in flip-flops and latches for
single-phase operation, and latches for multi-phase operation are discussed in
Sections 4.7, 4.8 and 4.9, respectively, followed by some final comments in
Section 4.10.
Data (Outputs)
Data (Inputs)
REGISTER
groups as shown in Figure 4.1. One group of signals—called the data signals—
consists of input and output signals of the storage element. These input and
output signals are typically connected to the terminals of ordinary logic gates
and may be connected to the data signal terminals of other storage elements.
Another group of signals—identified by the name control signals—are those
signals that control the storage of the data signals in the registers but do not
participate in the logical computation process.
Certain control signals enable the storage of a data signal in a register
independently of the values of any data signals. These control signals are
typically used to initialize the data in a register to a specific well known
value. Other control signals—such as a clock signal—control the process of
storing a data signal within a register. In a synchronous circuit, each register
has at least one clock (or control) signal input.
The two major groups of storage elements (registers) are considered in
the following sections based on the type of relationship that exists among the
data and clock signals of these elements. In latches, it is the specific value or
level of a control signal1 that determines the data storage process. Therefore,
latches are also called level-sensitive registers. In contrast to latches, a data
signal is stored in flip-flops enabled by an edge of a control signal. For that
reason, flip-flops are also called edge-triggered registers. The timing properties
of latches and flip-flops are described in detail in the following two sections.
1
This signal is most frequently the clock signal.
4.2 Latches 43
4.2 Latches
A latch is a register whose behavior depends upon the value or level of the
clock signal [10, 12, 14, 15, 29, 58, 59, 60]. Therefore, a latch is often referred
to as a transparent latch, a level-sensitive register or a polarity hold latch. A
simple type of latch with a clock signal C and an input signal D is depicted in
Figure 4.2—the output of the latch is typically labeled Q. This type of latch
is also known as a D latch and its operation is illustrated in Figure 4.3.
Clock Input C
As described in Table 4.1 and illustrated in Figure 4.3, the output signal
of the latch follows the data input signal while the clock signal remains high,
i.e., C = 1 ⇒ Q = D. Thus, the latch is said to be in a transparent state
during the interval t0 < t < t1 as shown in Figure 4.3. When the clock signal
C changes from 1 to 0, the current value of D is stored in the register and the
output Q remains fixed to that value regardless of whether the data signal
D changes. The latch does not pass the input data signal to the output but
rather holds onto the final value of the data signal when the clock signal made
the high-to-low transition. By analogy with the term transparent introduced
above, this state of the latch is called opaque and corresponds to the interval
t1 < t < t2 shown in Figure 4.3 where the input data signal is isolated from
the output port. As shown in Figure 4.3, the clock period is TCP = t2 − t0 .
2
Or simply a positive latch.
44 4 Timing Properties of Synchronous Systems
Transparent Opaque
State State
Clock
Leading Edge Trailing Edge
C
Data In
Stored Value
Q
Data Out
t1
Clock Period TCP
t0 t2
Fig. 4.3. Idealized operation of a level-sensitive register or latch.
The edge of the clock signal that causes the latch to switch to its transpar-
ent state is identified as the leading edge of the clock pulse. In the case of the
positive latch shown in Figure 4.2, the leading edge of the clock signal occurs
at time t0 . The opposite edge direction of the clock signal is identified as the
trailing edge—the falling edge at time t1 shown in Figure 4.3. Note that for
a negative latch, the leading edge is a high-to-low transition and the trailing
edge is a low-to-high transition.
Note: The remaining portion of this chapter and the rest of this monograph
use an extensive notation for various parameters describing the signals and
storage elements.
The latch setup time δSL = t6 − t5 , shown in Figure 4.4, is the minimum time
between a change in the data signal and the trailing edge of the clock signal
such that the new value of D would successfully propagate to the output Q
of the latch and be stored within the latch during the opaque state.
46 4 Timing Properties of Synchronous Systems
Q
C
t8
Clock Period TCP
Data-to-Output DDQ
L
Hold Time δH
L
t7
t6
t5
Setup Time δSL
Width of Clock Pulse CW
L
t4
t3
t2
t1
Clock-to-Output DCQ
L
Clock
Data In
Data Out
L
stored in the latch during the opaque state. This definition of δH assumes that
the last change of the value of D has occurred no later than δSL before the
trailing edge of the clock signal. The term δH L
= t7 − t6 is shown in Figure 4.4.
Note: The latch parameters introduced in Sections 4.3.1 through 4.3.5 are
used to refer to any latch in general or to a specific instance of a latch when
this instance can be unambiguously identified. To refer to a specific instance i
of a latch explicitly, the parameters are additionally shown with a superscript.
Li
For example, DCQ refers to the clock-to-output delay of latch i. Also, adding
m and M to the subscript of any parameter is used to refer to the minimum
and maximum values of that parameter, respectively.
4.4 Flip-Flops
An edge-triggered register or flip-flop is a type of register which, unlike the
latches described in Sections 4.2 and 4.3, is never transparent with respect to
the input data signal [10, 12, 14, 15, 29, 58, 59, 60]. The output of a flip-flop
normally does not follow the input data signal at any time during the register
operation but rather holds onto a previously stored data value until a new
data signal is stored in the flip-flop. A simple type of flip-flop with a clock
signal C and an input signal D is shown in Figure 4.5—similarly to latches,
Clock Input C
As shown in the timing diagram in Figure 4.6, the output of the flip-flop
remains unchanged most of the time regardless of the transitions in the data
signal. Only values of the data signal in the vicinity of the storing edge of
the clock signal can affect the output of the flip-flop. Therefore, changes in
the output will only be observed when the currently stored data has a logic
value x and the storing edge of the clock signal occurs while the input data
signal has a logic value of x̄.
Clock
Latching Edge
C
Data In
Stored Value Stored Value
Q
Data Out
t0 t1 t2
Fig. 4.6. Idealized operation of an edge-triggered register or flip-flop.
to-Q delay—is the propagation delay from the clock signal terminal to the
F
output terminal. The value of DCQ is defined assuming that the data input
signal has settled to a stable value sufficiently early, i.e., setting the data input
signal any earlier with respect to the latching clock edge will not affect the
F
value of DCQ .
The flip-flop setup time δSF is shown in Figure 4.7—δSF = t3 − t2 . The pa-
rameter δSF is defined as the minimum time between a change in the data
signal and the latching edge of the clock signal such that the new value of D
propagates to the output Q of the flip-flop and is successfully latched within
the flip-flop.
Q
C
t6
Width of Clock Pulse CW
F
Clock-to-Output DCQ
F
Hold Time δH
F
t5
t4
t3
t2
Setup Time δSF
Clock Period TCP
t1
Clock
Data In
Data Out
system. As described in 2.2, the storage elements serve to establish the relative
sequence of events within a system so that those operations that cannot be
executed concurrently operate on the proper data signals.
A typical clock signal c(t) in a synchronous digital system is shown in
Figure 4.8. The clock period TCP of c(t) is also indicated in Figure 4.8. In
ΔL ΔT
ΔL
ΔT
Clock Period TCP
order to provide the highest possible clock frequency, the objective is for TCP
to be the smallest number such that
where n is an integer. The width of the clock pulse CW is shown in Figure 4.8
where the meaning of CW is explained in Sections 4.3.1 (for a latch) and 4.5.1
(for a flip-flop), respectively.
Typically, the period of the clock signal TCP is a constant, that is,
∂TCP /∂t = 0. If the clock signal c(t) has a delay τ from some reference
point, the leading edges of c(t) occur at times
To account for this clock jitter, the following parameters are introduced:
• the maximum deviation ΔL of the leading edge of the clock signal, i.e., the
leading edge is guaranteed to occur anywhere in an interval (τ + kTCP −
ΔL , τ + kTCP + ΔL ),
• the maximum deviation ΔT of the trailing edge of the clock signal, i.e.,
the trailing edge is guaranteed to occur anywhere in the interval (τ +CW +
kTCP − ΔT , τ + CW + kTCP + ΔT ).
Consider a local data path such as the path shown in Figure 2.6 on page 14.
Without loss of generality, assume that the registers shown in Figure 2.6 are
flip-flops. The clock signal with period TCP is delivered to each of the registers
Ri and Rf . Let the clock signal driving the register Ri be denoted as Ci and the
clock signal driving the register Rf be denoted by Cf . Also, let ticd and tfcd be
the delays of Ci and Cf to the registers Ri and Rf , respectively.3 As described
by (4.2), the latching or leading edges of Ci occur at times
as described by (4.3).
The clock skew TSkew (i, f ) = ticd −tfcd between Ci and Cf is introduced next
as the difference of the arrival times of Ci and Cf [9] (a more formal definition
is provided in Chapter 5). This concept is illustrated by Figure 4.9. Note that
depending on the values of ticd and tfcd , the clock skew can be zero, negative or
3
Note that ticd and tfcd are measured with respect to the same reference point.
4.6 The Clock Signal 53
positive, depending upon whether ticd is equal to, less than or greater than tfcd ,
respectively. Furthermore, note that the clock skew as defined above is only
defined for sequentially-adjacent registers, that is, a local data path [such as
the path shown in Figure 2.6].
CW
n φ(n)
Csource
CW
(n−1) φ(n−1)
Csource
CW
2 φ2
Csource
CW
φ1
1
Csource
pf φp f
Csource
pf
Cf
pp
|φ pi p f + TSkew
i f
(i, f )|
p φ pi
i
Csource
p
Ci i
pi
ti
Clock Period TCP
1
the circuit. For instance, Csource denotes the clock signal at the clock source
“source” of the clock phase C 1 . When this clock signal is delivered to an ar-
bitrary register Rk , it is denoted by Ck1 . The start time φpi of clock signal
phase C pi is defined with respect to a common reference clock cycle. The
phase shift operator φpi pf [69] is used to transform variables between differ-
ent clock phases. The phase shift operator φpi pf is defined as the algebraic
difference φpi pf = φpi − φpf + kTCP , where k is the number of clock cycles
occurring between phases. Note that for a single-phase clocking scheme, the
phase shift operator evaluates to φif = TCP .
A multi-phase synchronization approach can be advantageous in terms of
increasing the reachability of circuit registers, creating less skew within phys-
ically neighboring local clock domains and potentially saving power. Despite
these advantages, the design and analysis of such synchronization schemes are
more complex.
pi pf p
The multi-phase clock skew is defined as TSkew (i, f ) = tpi i − tf f , where
pi pf pi pf
ti and tf are the delays of the clock signals Ci and Cf from the clock
sources to the registers Ri and Rf , respectively. The multi-phase clock skew
is illustrated in Figure 4.11. The common clock period for all clock phases is
denoted by TCP for consistency with the original formulation of the single-
phase synchronized circuits.
4.7 Single-Phase Path with Flip-Flops 55
Flip-Flop Ri Flip-Flop Rf
Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C
Clock Ci Clock Cf
Fig. 4.12. A single-phase local data path.
the data signal and the final flip-flop Rf is the destination of the data signal.
The combinational logic block Lif between Ri and Rf accepts the input data
signals supplied by Ri and other registers and logic gates and transmits the
operated upon data signals to Rf . The period of the clock signal is denoted by
TCP and the delays of the clock signals Ci and Cf to the flip-flops Ri and Rf
are denoted by ticd and tfcd , respectively. The input and output data signals to
Ri and Rf are denoted by Di , Qi , Df , and Qf , respectively.
An analysis of the timing properties of the local data path shown in Fig-
ure 4.12 is offered in the following sections. First, the timing relationships to
prevent the late arrival of data signals to Rf are examined in Section 4.7.1. The
timing relationships to prevent the early arrival of signals to the register Rf are
described in Section 4.7.2. The analyses presented in Sections 4.7.1 and 4.7.2
borrow some of the notation from [19] and [20]. Similar analyses of synchro-
nous circuits from the timing perspective can be found in [69, 70, 71, 72, 73].
The operation of the local data path Ri ;Rf shown in Figure 4.12 requires that
any data signal that is being stored in Rf arrives at the data input Df of Rf no
later than δSF f 4 before the latching edge of the clock signal Cf . It is possible
for the opposite event to occur, that is, for the data signal Df not to arrive at
the register Rf sufficiently early in order to be stored successfully within Rf . If
this situation occurs, the local data path shown in Figure 4.12 fails to perform
as expected and a timing failure or violation is created. This form of timing
violation is typically called a setup (or long path) violation. A setup violation
is depicted in Figure 4.13 and is used in the following discussion.
4
As a reminder for the definitions in Section 4.5, in δSF f representation, subscript S
denotes the setup time, the superscript F denotes a flip-flop parameter and the
superscript f denotes the parameter defined at the final register Rf .
56 4 Timing Properties of Synchronous Systems
ΔL
Ci
k-th clock period
Di
Fi
DCQ
Qi
DPi,fM
Df
δSF f
TCP
Cf
ΔL
Fig. 4.13. Timing diagram of a local data path with flip-flops illustrating a violation
of the setup (or long path) constraint.
The coincidental cycles (k-th) of the clock signals Ci and Cf are shaded
for identification in Figure 4.13. Also shaded in Figure 4.13 are those portions
of the data signals Di , Qi , and Df that are relevant to the operation of
the local data path shown in Figure 4.12. Specifically, the shaded portion
of Di corresponds to the data to be stored in Ri at the beginning of the k-
th clock cycle. This data signal propagates to the output of the register Ri
and is illustrated by the shaded portion of Qi shown in Figure 4.13. The
combinational logic operates on Qi during the k-th clock cycle. The result
of this operation is illustrated by the shaded portion of the signal Df which
must be stored in Rf during the next (k + 1)-st clock cycle.
Observe that as illustrated in Figure 4.13, the leading edge of Ci that
initiates the k-th clock cycle occurs at time ticd + kTCP with respect to a
global time reference of zero. Similarly, the leading edge of Cf that initiates
4.7 Single-Phase Path with Flip-Flops 57
the (k + 1)-th clock cycle occurs at time tfcd + (k + 1)TCP . Therefore, the latest
arrival time Af of the data signal Df at the flip-flop Rf must satisfy
Af ≤ tfcd + (k + 1)TCP − ΔF L − δS .
Ff
(4.4)
f
The term tcd + (k + 1)TCP − ΔF L on the right hand side of (4.4) corresponds
to the critical situation of the leading edge of Cf arriving earlier by the maxi-
mum possible deviation ΔF L . The −δS term on the right hand side of (4.4) ac-
Ff
counts for the setup time of Rf (recall the definition of δSF from Section 4.5.3).
Note that the value of Af in (4.4) consists of two components:
1. The latest arrival time Di that a valid data signal Qi appears at the output
of Ri , i.e., the sum Di = ticd + kTCP + ΔF Fi
L + DCQM of the latest possible
arrival time of the leading edge of Ci and the maximum clock-to-Q delay
of Ri ,
2. The maximum propagation delay DPi,fM of the data signals through the
combinational logic block Lif and interconnect along the path Ri ;Rf .
Therefore, Af can be described as
Af = Di + DPi,fM = ticd + kTCP + ΔF Fi i,f
L + DCQM + DP M . (4.5)
By substituting (4.5) into (4.4), the timing condition guaranteeing correct
signal arrival at the data input D of Rf is
i i,f f
Ff
tcd + kTCP + ΔF L + DCQM +DP M ≤ tcd + (k + 1)TCP − ΔL −δS . (4.6)
Fi F
The above inequality can be transformed by subtracting the kTCP terms from
both sides of (4.6). Furthermore, certain terms in (4.6) can be grouped to-
gether. Also, by noting that ticd − tfcd = TSkew (i, f ) is the clock skew between
the registers Ri and Rf ,
i,f
TSkew (i, f ) + 2ΔF
L ≤ T CP − D Fi
CQM + D PM + δ Ff
S . (4.7)
Late arrival of the signal Df at the data input of Rf (see Figure 4.12) is ana-
lyzed in Section 4.7.1. In this section, an analysis of the timing relationships
of the local data path Ri ;Rf to prevent early data arrival of Df is presented.
To this end, recall from the discussion in Section 4.5.4 that any data signal Df
Ff
being stored in Rf must lag the arrival of the leading edge of Cf by at least δH .
It is possible for the opposite event to occur, i.e., for a new data signal Dfnew to
overwrite the value of Df and be stored within the register Rf . If this situation
occurs, the local data path shown in Figure 4.12 will not perform as desired
because of the timing violation known as a hold time (or short path) violation.
In this section, these hold time violations caused by race conditions are
analyzed. It is shown that a hold violation is more dangerous than a setup
violation since a hold violation cannot be removed by simply adjusting the
clock period TCP [unlike the case of a data signal arriving late where TCP can
be increased to satisfy (4.7)]. A hold violation is depicted in Figure 4.14 and
is used in the following discussion.
The situation depicted in Figure 4.14 is different from the situation de-
picted in Figure 4.13 in the following sense. In Figure 4.13, a data signal stored
in Ri during the k-th clock cycle arrives too late to be stored in Rf during the
(k + 1)-st clock cycle. In Figure 4.14, however, the data stored in Ri during the
k-th clock cycle arrives at Rf too early and overwrites the data that had to be
stored in Rf during the same k-th clock cycle. To clarify this concept, certain
portions of the data signals are shaded for easy identification in Figure 4.14.
The data Di being stored in Ri at the beginning of the k-th clock cycle is
shaded. This data signal propagates to the output of the register Ri and is
illustrated by the shaded portion of Qi shown in Figure 4.14. The output of
the logic (left unshaded in Figure 4.14) is being stored within the register Rf
at the beginning of the (k +1)-st clock cycle. Finally, the shaded portion of Df
corresponds to the data signal that is to be stored in Rf at the beginning of
the k-th clock cycle.
Note that, as illustrated in Figure 4.14, the leading (or latching) edge of Ci
that initiates the k-th clock cycle occurs at time ticd + kTCP . Similarly, the
4.7 Single-Phase Path with Flip-Flops 59
ΔL
Ci
k-th clock period
Di
Fi
DCQ
Qi
DPi,fm
Df
Ff
δH
Cf
k-th clock period
ΔL
Fig. 4.14. Timing diagram of a local data path with flip-flops with a violation of
the hold constraint.
leading (or latching) edge of Cf that initiates the k-th clock cycle occurs at
time tfcd + kTCP . Therefore, the earliest arrival time af of the data signal Df
at the register Rf must satisfy the following condition:
af ≥ tfcd + kTCP + ΔF Ff
L + δH . (4.9)
f
The term tcd + kTCP + ΔF L on the right hand side of (4.9) corresponds to
the critical situation of the leading edge of the k-th clock cycle of Cf arriving
late by the maximum possible deviation ΔF L . Note that the value of af in (4.9)
has two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of Ri , i.e., the sum di = ticd + kTCP − ΔF Fi
L + DCQm of the earliest
arrival time of the leading edge of Ci and the minimum clock-to-Q delay
of Ri ,
2. The minimum propagation delay DPi,fm of the signals through the combi-
national logic block Lif and interconnect wires along the path Ri ;Rf .
60 4 Timing Properties of Synchronous Systems
By substituting (4.10) into (4.9), the timing condition that guarantees that
Df does not arrive too early at Rf is
i i,f f
tcd + kTCP − ΔF
L + DCQm + DP m ≥ tcd + kTCP + ΔL + δH .
Fi F Ff
(4.11)
are harmful in the sense that these terms impose a lower bound on the
clock skew TSkew (i, f ) between the register Ri and Rf . Although positive
skew may be used to relax (4.12),
these two terms
work against relaxing
Fi i,f
the values of TSkew (i, f ) and DCQm + DP m .
Finally, the relationship (4.12) can be rewritten to stress the lower bound
imposed on the clock skew TSkew (i, f ):
TSkew (i, f ) ≥ − DPi,fm + DCQ
Fi Ff
+ δH + 2ΔF L. (4.13)
5
Increasing the clock period TCP in order to satisfy (4.7) is equivalent to reducing
the frequency of the clock signal.
4.8 Single-Phase Path with Latches 61
Latch Ri Latch Rf
Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C
Clock Ci Clock Cf
Fig. 4.15. A single-phase local data path with latches.
logic block Lif between Ri and Rf accepts the input data signals sourced by Ri
and other registers and logic gates and transmits the data signals that have
been operated on to Rf . The period of the clock signal is denoted by TCP and
the delays of the clock signals Ci and Cf to the latches Ri and Rf are denoted
by ticd and tfcd , respectively. The input and output data signals to Ri and Rf
are denoted by Di , Qi , Df , and Qf , respectively.
An analysis of the timing properties of the local data path shown in Fig-
ure 4.15 is offered in the following sections. The timing relationships to prevent
the late arrival of the data signal at the latch Rf are examined in Section 4.8.1.
The timing relationships to prevent the early arrival of the data signal at the
latch Rf are examined in Section 4.8.2.
The analyses presented in this section are built on the timing relationships
among the signals of a latch that are similar to those used in Section 4.7.
Specifically, it is guaranteed that every data signal arrives at the data input
of a latch no later than δSL time before the trailing clock edge. Also, this data
L
signal must remain stable at least δH time after the trailing edge, i.e., no
L
new data signal should arrive at a latch δH time after the latch has become
opaque.
Observe the differences between a latch and a flip-flop [70, 75]. In flip-
flops, the setup and hold requirements described in the previous paragraph are
relative to the leading—not to the trailing—edge of the clock signal. Similarly,
in flip-flops, the late and early arrival of the data signal to a latch gives rise
to timing violations known as a setup and hold violation, respectively.
during the k-th clock cycle. The data Qi stored in Ri propagates through the
combinational logic Lif and the interconnect along the path Ri ;Rf . In the
(k + 1)-st clock cycle, the result Df of the computation in Lif is stored within
the latch Rf . The signal Df must arrive at least δSL time before the trailing
edge of Cf in the (k + 1)-st clock cycle.
Similar to the discussion presented in Section 4.7.1, the latest arrival time
Af of Df at the D input of Rf must satisfy
Af ≤ tfcd + (k + 1)TCP + CW L
T − δS .
− ΔL Lf
(4.14)
Note the difference between (4.14) and (4.4). In (4.4), the first term on the
right hand side is [tfcd + (k + 1)TCP − ΔFL ], while in (4.14), the first term on the
L L
right hand side has an additional term CW . The addition of CW corresponds to
the concept that unlike flip-flops, a data signal is stored in the latches, shown
L
in Figure 4.15, at the trailing edge of the clock signal
f (the CW term). Similar to
Af = DPi,fM + Di . (4.15)
≤ tfcd + (k + 1)TCP + CW L
T − δS .
− ΔL Lf
(4.18)
DPi,fM + Ai + DDQMLi
≤ tcd + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf
(4.19)
DPi,fM + ticd + kTCP + ΔL Li
L +DCQM
f
Taking into account that the clock skew TSkew (i, f ) = ticd − tfcd , (4.19)
and (4.20) can be rewritten, respectively, as
f
DPi,fM + Ai + DDQM
Li
≤ tcd + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf
(4.21)
L
i,f
TSkew (i, f ) + ΔL + ΔLT ≤ TCP + CW − DCQM + DP M + δS
L Li Lf
. (4.22)
Similar to Sections 4.7.1 and 4.7.2, (4.22) can be rewritten to emphasize the
upper bound on the clock skew TSkew (i, f ) imposed by (4.22):
f
DPi,fM + Ai + DDQM
Li
≤ tcd + (k + 1)TCP + CW L
− ΔL T − δS , (4.23)
Lf
i,f
TSkew (i, f ) ≤ TCP + CW L
− ΔL L − Δ L
T − D Li
CQM + D PM + δ Lf
S . (4.24)
the (k + 1)-st clock cycle. In the latter case, the data signal stored in the
latch Ri during the k-th clock cycle propagates to the latch Rf too early and
overwrites the data signal that is already stored in the latch Rf during the
same k-th clock cycle.
In order for the proper data signal to be successfully latched within Rf
during the k-th clock cycle, there should not be any changes in the signal Df
until at least the hold time after the arrival of the storing (trailing) edge of
the clock signal Cf . Therefore, the earliest arrival time af of the data signal
Df at the register Rf must satisfy the following condition,
af ≥ tfcd + kTCP + CW L
+ ΔL Lf
T + δH . (4.25)
f L
The term tcd + kTCP + CW + ΔLT on the right hand side of (4.25) corre-
sponds to the critical situation of the trailing edge of the k-th clock cycle of
the clock signal Cf arriving late by the maximum possible deviation ΔL T . Note
that the value of af in (4.25) consists of two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of the latch Ri , i.e., the sum di = ticd + kTCP − ΔL Li
L + DCQm of
the earliest arrival time of the leading edge of the clock signal Ci and the
Li
minimum clock-to-Q delay DCQm of Rf ,
2. The minimum propagation delay DPi,fm of the signal through the combi-
national logic Lif and the interconnect along the path Ri ;Rf .
Therefore, af can be described as
af = di + DPi,fm = ticd + kTCP − ΔL Li i,f
L + DCQm + DP m . (4.26)
By substituting (4.26) into (4.25), the timing condition guaranteeing that Df
does not arrive too early at the latch Rf is
i i,f f
tcd + kTCP − ΔL L + DCQm + DP m ≥ tcd + kTCP + CW + ΔT + δH .
Li L L Lf
(4.27)
The inequality (4.27) can be further simplified by reorganizing the terms
and noting that ticd − tfcd = TSkew (i, f ) is the clock skew between the registers
Ri and Rf :
i,f
TSkew (i, f ) − ΔL L + ΔT ≥ − DCQm + DP m + δH .
L Li Lf
(4.28)
Finally, the relationship (4.28) can be rewritten to emphasize the lower bound
on the clock skew TSkew (i, f ):
Li i,f
L + ΔT − DCQm + DP m + δH .
TSkew (i, f ) ≥ ΔL L Lf
(4.29)
Latch Ri Latch Rf
Di Qi Df (Data) Qf
D Q Combinational D Q
Data In Data Out
C Data Logic Lif C
Clock Cipi p
Clock Cf f
Note the difference between (4.30) and (4.14). In (4.30), the term on the
right hand side has an additional term φpf to account for the clock phase
information.
Observe that the value of Af in (4.30) consists of two components:
1. The latest arrival time Di when a valid data signal Qi appears at the
output of the latch Ri ,
2. The maximum signal propagation delay through the combinational logic
block Lif and the interconnect along the path Ri ;Rf .
Therefore, Af can be described as
Af = DPi,fM + Di . (4.31)
4.9 Multi-Phase Path with Latches 67
Similar to Section 4.8.1, the value of Di in (4.31) is the greater of the following
two quantities:
pi
Di = max Ai + DDQM Li
, φ + tpi i + kTCP + ΔL Li
L + DCQM . (4.32)
There are two terms in the right hand side of (4.32):
Li
1. The term Ai + DDQM corresponds to the situation in which Di arrives
at Ri afterthe leading edge of the k-th clock cycle,
2. The term φpi + tpi i + kTCP + ΔL Li
L + DCQM corresponds to the situation
in which Di arrives at Ri before the arrival of the leading edge of the k-th
clock pulse.
By substituting (4.32) into (4.31), the latest time of arrival Af is
pi
(4.34)
Equation (4.34) is an expression of the inequality that must be satisfied in
order to prevent the late arrival of a data signal at the data input D of
the latch Rf . By satisfying (4.34), any setup violation in a local data path
with latches as shown in Figure 4.16 is avoided. For a circuit to operate cor-
rectly, (4.34) must be enforced for every local data path Ri ;Rf consisting of
the latches, Ri and Rf .
Similar to single-phase operation, the max operator in (4.34) may be split
into two conditions:
p p
DPi,fM + Ai + DDQM
Li
≤ φ f + tff + (k + 1)TCP + CW L
− ΔLT − δS ,
Lf
(4.35)
pi
DPi,fM pi L Li
+ φ + ti + kTCP + ΔL +DCQM
p
≤ φpf + tff +(k + 1)TCP + CW
L
T − δS .
− ΔL Lf
(4.36)
pi pf
Taking into account that the multi-phase clock skew is Tskew = tpi i −
(i, f )
pf
tf , (4.35) and (4.36) can be rewritten, respectively, as
DPi,fM + Ai + DDQM Li
(4.37)
p
≤ φpf + tff + (k + 1)TCP + CWL
− ΔL T − δS ,
Lf
pi pf
φpi pf + TSkew (i, f ) + ΔL L + ΔT
L
Li (4.38)
≤ TCP + CW L
− DCQM + DPi,fM + δSLf .
68 4 Timing Properties of Synchronous Systems
Similar to Sections 4.8.1 and 4.8.2, (4.38) can be rewritten to emphasize the
pi pf
upper bound on the clock skew TSkew (i, f ):
DPi,fM + Ai + DDQM
Li
(4.39)
p
≤ φpf + tff + (k + 1)TCP + CW
L
T − δS ,
− ΔL Lf
p p
i f
TSkew (i, f )
Li i,f
(4.40)
≤ −φpi pf + TCP + CW
L
L − ΔT − DCQM + DP M + δS
− ΔL L Lf
.
In order for the proper data signal to be successfully latched within Rf during
the k-th clock cycle, there should not be any changes in the signal Df until at
least the hold time after the arrival of the storing (trailing) edge of the clock
p
signal Cf f . Therefore, the earliest arrival time af of the data signal Df at the
register Rf must satisfy the following condition,
p
af ≥ φpf + tff + kTCP + CW L
+ ΔL Lf
T + δH . (4.41)
p
The term φpf + tff + kTCP + CW L
+ ΔLT on the right hand side of (4.41)
corresponds to the critical situation of the trailing edge of the k-th clock cycle
p
of the clock signal Cf f arriving late by the maximum possible deviation ΔL T.
Note that the value of af in (4.41) consists of two components:
1. The earliest arrival time di that a valid data signal Qi appears at the
output of the latch Ri , i.e., the sum di = φpi + tpi i + kTCP − ΔL Li
L + DCQm
pi
of the earliest arrival time of the leading edge of the clock signal Ci and
Li
the minimum clock-to-Q delay DCQm of Rf ,
2. The minimum propagation delay DPi,fm of the signal through the combi-
national logic Lif and the interconnect along the path Ri ;Rf .
Therefore, af can be described as
af = di + DPi,fm = φpi + tpi i + kTCP − ΔL Li i,f
L + DCQm + DP m . (4.42)
Finally, the relationship (4.44) can be rewritten to emphasize the lower bound
pi pf
on the clock skew TSkew (i, f ):
pi pf Li i,f
TSkew (i, f ) ≥ −φpi pf + ΔL L + ΔT − DCQm + DP m + δH .
L Lf
(4.45)
The properties of registers and local data paths are described in this chapter.
Specifically, the timing relationships to prevent setup and hold timing viola-
tions in a local data path consisting of two positive edge-triggered flip-flops are
analyzed in Sections 4.7.1 and 4.7.2, respectively. The timing relationships to
prevent setup and hold timing violations in a local data path consisting of two
positive-polarity latches have also been analyzed in Sections 4.8.1 and 4.8.2,
respectively. Timing relationships to prevent setup and hold timing violations
in a local data path consisting of two positive-polarity latches, synchronized by
a multi-phase clocking scheme, have been analyzed in Sections 4.9.1 and 4.9.2,
respectively.
In a fully synchronous digital VLSI system, however, it is possible to en-
counter certain local data paths different from those circuits analyzed in this
chapter. For example, a local data path may begin with a positive-polarity,
7
As described by the inequality (4.44) not being satisfied.
70 4 Timing Properties of Synchronous Systems
5.1 Background
data data
R1 G1 G2 R3 G3 R4
input output
data
R2 G4
input
The set of logic gates G = {G1 , G2 , G3 , G4 }
Fig. 5.1. A simple synchronous digital circuit with four registers and four logic
gates.
The timing constraints of a local data path have been derived in Sections 4.7.1
through 4.8.2 for paths consisting of flip-flops and latches. The concept of clock
skew used in these timing constraints is formally defined next:
Definition 5.2. Clock skew. In a given digital synchronous circuit, the clock
skew TSkew (i, j) between the registers Ri and Rj is defined as the algebraic
difference,
TSkew (i, j) = ticd − tjcd , (5.1)
where Ci and Cj are the clock signals driving the registers Ri and Rj , respec-
tively, and ticd and tjcd are the delays of the clock signals Ci and Cj , respec-
tively.
In Definition 5.2, the clock delays, ticd and tjcd , are with respect to an
arbitrary—but necessarily the same—reference point. A commonly used ref-
erence point is the source of the clock distribution network on the integrated
circuit. Note that the clock skew TSkew (i, j) as defined in Definition 5.2 obeys
the antisymmetric property,
Recall that the clock skew TSkew (i, j) as defined in Definition 5.2 is a com-
ponent in the timing constraints of a local data path [see inequalities (4.8),
(4.13), (4.23), (4.24), (4.29), (4.39), (4.40) and (4.45)]. Therefore, the clock
skew TSkew (i, j) is defined and is of primary practical use for sequentially-
adjacent pairs of registers Ri ;Rj , that is, for local data paths.2
1
Propagating through a sequence of logic elements only.
2
Note that technically, TSkew (i, j) can be calculated for any ordered pair of regis-
ters Ri , Rj . However, the skew between a non-sequential pair of registers has no
practical value.
5.2 Definitions and Graphical Model 75
Each signal data path has a unique permissible range associated with it.3
The permissible range is a continuous interval of valid skews for a specific path.
As suggested by the inequalities, (4.8), (4.13), (4.23), (4.24), (4.29), (4.39),
(4.40) and (4.45) and illustrated in Figure 5.2, every permissible range is de-
limited by a lower and upper bound of the clock skew. These bounds—denoted
by lk and uk , respectively—are determined based on the timing parameters
of the individual local data paths and the constraints to prevent timing vio-
lations discussed in Chapter 4. Note that the bounds lk and uk also depend
on the operational clock period for the specific circuit. When sk ∈ [lk , uk ]—
as shown in Figure 5.2—the timing constraints of this specific k-th local data
path are satisfied. The clock skew sk is not permitted to be in either the inter-
val (−∞, lk ) because a race condition will be created or the interval (uk , +∞)
because the minimum clock period will be limited.
Furthermore, note that the reliability of a circuit is related to the prob-
ability of a timing violation occurring for any local data path Ri ;Rf . This
3
Later in Section 5.2.2 it is shown that it is more appropriate to refer to the permis-
sible range of a sequentially-adjacent pair of registers. There may be more than
one local data path between the same pair of registers but circuit performance is
ultimately determined by the permissible ranges of the clock skew between pairs
of registers.
76 5 Clock Skew Scheduling and Clock Tree Synthesis
observation suggests that the reliability of any local data path Ri ;Rf of a
circuit (and therefore of the entire circuit) is increased in two ways:
1. by choosing the clock skew sk for the k-th local data path as far as possible
from the borders of the interval [lk , uk ], that is, by (ideally) positioning
the clock skew sk in the middle of the permissible range as sk = 12 (lk +uk ),
2. by increasing the width (uk − lk ) of the permissible range of the local data
path Ri ;Rf .
Even if the clock signals can be delivered to the registers within a given circuit
with arbitrary delays, it is generally not possible to have all clock skews in
the middle of the permissible range as suggested above. The reason behind
this characteristic is that inherent structural limitations of the circuit create
linear dependencies among the clock skews within the circuit. These linear
dependencies and the effect of these dependencies on a number of circuit
optimization techniques are examined in detail in Chapter 7.
4
As a matter of fact, the graph model described here is quite universal and can be
successfully applied for a variety of other different circuit analysis and optimiza-
tion purposes.
5.2 Definitions and Graphical Model 77
R1 G3 , G2
G1 ,
G2
G3
R3 R4
G2
G 1, G4 , G1 , G2
R2 G4 , G1 , G2
5
In the order in which the traveling signals pass through the gates.
6
Restrictions on the model itself and not on the ability of the model to represent
features of the circuits.
78 5 Clock Skew Scheduling and Clock Tree Synthesis
v1
[l1 ,
u
e1 1 ]
→
[l3 , u3 ]
v3 v4
e3 →
u 2]
[l 2,
→
e2
v2
Fig. 5.4. A graph representation of the synchronous system shown in Figure 5.1
according to Definition 5.3. The graph vertices v1 , v2 , v3 , and v4 correspond to the
registers, R1 , R2 , R3 and R4 , respectively.
5.2 Definitions and Graphical Model 79
is labeled with the corresponding permissible range of the clock skew for the
given pair of registers. An arrow is drawn next to each edge to indicate the
order of the registers in this specific sequentially-adjacent pair—recall that
the clock skew as defined in Definition 5.2 is an algebraic difference. As shown
in the rest of this section, either direction of an edge can be selected as long
as the proper choices of lower and upper clock skew bounds are made.
In most practical cases, a unique signal path (a local data path) exists
between a given sequentially-adjacent pair of registers Ri , Rj
. In these cases,
the labeling of the corresponding edge is straightforward. The permissible
range bounds lk and uk are computed using (4.8), (4.13), (4.23), (4.24), (4.29),
(4.39), (4.40) and (4.45) and the direction of the arrow is chosen so as to
coincide with the direction of the signal propagation from Ri to Rj . With
these choices, the clock skew is computed as s = ticd − tjcd . In Figure 5.4, for
example, the direction labels of both e1 and e2 can be chosen from v1 to v3
and from v2 to v3 , respectively.
Multiple signal paths between a pair of registers, Rx and Ry , require a
more complicated treatment. As specified before, there can be only one edge
between the vertices, vx and vy , in the circuit graph. Therefore, a methodology
is presented for choosing the correct permissible range bounds and direction
labeling for this single edge. This methodology is illustrated in Figure 5.5 and
is a two-step process. First, multiple signal paths in the same direction from
[lz , uz ]
→ [lz(i) , uz(i) ]
vx .. vy ⇒ vx
i
vy
. →
→
[lz(n) , uz(n) ]
(a) Elimination of multiple edges
[lz , uz ]
→ [lz , uz ] ∩ [−uz , −lz ]
vx vy ⇒ vx vy
→
←
[lz , uz ]
(b) Elimination of a two-edge cycle
the register Rx to the register Ry are replaced by a single edge in the circuit
graph according to the transformation illustrated in Figure 5.5(a). Next, two-
edge cycles between Rx and Ry are replaced by a single edge in the circuit
graph according to the transformation illustrated in Figure 5.5(b).
80 5 Clock Skew Scheduling and Clock Tree Synthesis
In the former case [Figure 5.5(a)], the edge direction labeling is preserved
while the permissible range for the new single edge is chosen such that the
permissible ranges of the multiple paths from Rx to Ry are simultaneously
satisfied. As shown in Figure 5.5(a), the new permissible range [lz , uz ] is the
intersection of the multiple permissible ranges [lz , uz ] through [lz(n) , uz(n) ]
between Rx and Ry . In other words, the new lower bound is lz = max{lz(i) }
i
and the new upper bound is uz = min{uz(i) }.
i
In the latter case [Figure 5.5(b)], an arbitrary choice for the edge direc-
tion can be made—the convention adopted here is to choose the direction
towards the vertex with the higher index. For the vertex vy , the new per-
missible range has a lower bound lz = min(lz , −uz ) and an upper bound
uz = max(uz , −lz ). It is straightforward to verify that any clock skew
s ∈ [lz , uz ] satisfies both permissible ranges [lz , uz ] and [lz , uz ] as shown in
Figure 5.5(b). The process for computing the permissible ranges of a circuit
graph [using (4.8), (4.13), (4.23), (4.24) and (4.29)] and the transformations
illustrated in Figure 5.5 have linear complexity in the number of signal paths
since each signal path is examined only once.
Note that the terms, circuit and graph, are used throughout the rest of
this research monograph interchangeably to denote the same fully synchronous
digital circuit. Also, note that for brevity, the superscript (C) when referring
to the circuit graph GC of a circuit C is omitted for the rest of the monograph
unless a circuit is explicitly indicated. The terms, register and vertex, are used
interchangeably as are edge, local data path, arc and a sequentially-adjacent
pair of registers. On a final note, it is assumed that the graph of any circuit
considered in this work is connected. If this is not the case, each of the disjoint
connected portions of the graph (circuit) can be individually analyzed.
Based on Definition 5.4, the timing constraints of a local data path Ri ;Rf
with flip-flops [(4.8) and (4.13)] are used to construct the linear programming
(LP) model for clock skew scheduling [2] shown in Table 5.1. The constraints
in Table 5.1 are the operating conditions for an edge-sensitive circuit:
For a local data path Ri ;Rf consisting of the flip-flops, Ri and Rf , the setup
and hold time violations are avoided if (5.5) and (5.6), respectively, are sat-
isfied.
The clock skew TSkew (i, f ) of a local data path Ri ;Rf can be either pos-
itive or negative, as illustrated in Figures 4.13 and 4.14, respectively. Note
that negative clock skew may be used to effectively speed-up a local data
path Ri ;Rf by allowing an additional TSkew (i, f ) amount of time for the
signal to propagate from the register Ri to the register Rf . However, exces-
sive negative skew may create a hold time violation, thereby creating a lower
bound on TSkew (i, f ) as described by (5.6) and illustrated by l in Figure 5.2.
A hold time violation, as described in Chapter 4, is a clock hazard or a race
condition, also known as double clocking [2, 9]. Similarly, positive clock skew
effectively decreases the clock period TCP by TSkew (i, f ), thereby limiting the
maximum clock frequency and imposing an upper bound on the clock skew as
illustrated by u in Figure 5.2.7 In this case, a clocking hazard known as zero
clocking may be created [2, 9].
Examination of the constraints, (5.5) and (5.6), reveals a procedure for pre-
venting clock hazards. Assuming (5.5) is not satisfied, a suitably large value of
TCP can be chosen to satisfy constraint (5.5) and prevent zero clocking. Also
note that unlike (5.5), (5.6) is independent of the clock period TCP (or the
clock frequency). Therefore, TCP cannot be changed to correct a double clock-
ing hazard, but rather a redesign of the entire clock distribution network [83]
or a delay padding procedure onto the logic network [74] may be required.
Both double and zero clocking hazards can be eliminated if two simple
choices characterizing a fully synchronous digital circuit are made. Specifically,
7
Positive clock skew may also be thought of as increasing the path delay. In either
case, positive clock skew (TSkew > 0) increases the difficulty of satisfying (5.5).
82 5 Clock Skew Scheduling and Clock Tree Synthesis
if equal values are chosen for all clock delays, then the clock skew TSkew (i, f ) =
0 for each local data path Ri ;Rf ,
Note that (5.8) can be satisfied for each local data path Ri ;Rf in a circuit
if a sufficiently large value—larger than the greatest value D̂Pi,fM in a circuit—
is chosen for TCP . Furthermore, (5.9) can be satisfied across an entire circuit
if it can be ensured that D̂Pi,fm ≥ 0 for each local data path Ri ;Rf in the
circuit. The timing constraints, (5.8) and (5.9), can be satisfied since choosing
a sufficiently large clock period TCP is always possible and D̂Pi,fm is positive
for a properly designed local data path Ri ;Rf . The application of this zero
clock skew methodology [(5.7), (5.8), and (5.9)] has been central to the design
of fully synchronous digital circuits for decades [9, 32, 91]. By requiring the
clock signal to arrive at each register Rj with approximately the same delay tjcd ,
these design methods have become known as zero clock skew methods.8
As shown by previous research [9, 81, 82, 83, 88, 92, 93], both double
and zero clocking hazards may be removed from a synchronous digital cir-
cuit even when the clock skew is non-zero, that is, TSkew (i, f ) = 0 for some
(or all) local data paths Ri ;Rf . As long as (5.5) and (5.6) are satisfied, a
synchronous digital system can operate reliably with non-zero clock skews,
permitting the system to operate at higher clock frequencies while removing
all race conditions.
The vector column of clock delays TCD = [t1cd , t2cd , . . . ]T is called a clock
schedule [2, 9]. If TCD is chosen such that (5.5) and (5.6) are satisfied for
every local data path Ri ;Rf , TCD is called a consistent clock schedule. A
clock schedule that satisfies (5.7) is called a trivial clock schedule. Note that a
trivial clock schedule TCD implies global zero clock skew since for any i and
f , ticd = tfcd , thus, TSkew (i, f ) = 0.
An intuitive example of non-zero clock skew being used to improve the per-
formance and reliability of a fully synchronous digital circuit is shown in Fig-
ure 5.6. Two pairs of sequentially-adjacent flip-flops, R1 ;R2 and R2 ;R3 , are
shown in Figure 5.6, where both zero skew and non-zero skew situations are
illustrated in Figures 5.6(a) and 5.6(b), respectively. Note that the local data
paths made up of the registers, R1 and R2 and of R2 and R3 , respectively, are
connected in series (R2 being common to both R1 ;R2 and R2 ;R3 ). In each
of the Figures 5.6(a) and 5.6(b), the permissible ranges of the clock skew for
8
Equivalently, it is required that the clock signal arrive at each register at approx-
imately the same time.
5.3 Clock Scheduling 83
t t t
Skew = 0 Skew = 0
(a) The circuit operating with zero clock skew.
t τ (τ < t) t
Skew = 0 = t − τ Skew = 0 = τ − t
(b) The circuit operating with non-zero clock skew.
Fig. 5.6. Application of non-zero clock skew to improve circuit performance (a lower
clock period) or circuit reliability (increased safety margins within the permissible
range).
both local data paths, R1 ;R2 and R2 ;R3 , are lightly shaded under each cir-
cuit diagram. As shown in Figure 5.6, the target clock period for this circuit
is TCP = 8.5 ns.
The zero clock skew points (Skew = 0) are indicated in Figure 5.6(a)—
zero skew is achieved by delivering the clock signal to each of the registers,
R1 , R2 and R3 , with the same delay t (symbolically illustrated by the buffers
connected to the clock terminals of the registers). Observe that while the
zero clock skew points fall within the respective permissible ranges, these zero
84 5 Clock Skew Scheduling and Clock Tree Synthesis
clock skew points are dangerously close to the lower and upper bounds of the
permissible range for R1 ;R2 and R2 ;R3 , respectively. A situation could be
foreseen where, for example, the local data path R2 ;R3 has a larger than
expected long delay (larger than 8 ns), thereby causing the upper bound of
the permissible range for R2 ;R3 to decrease below the zero clock skew point.
In this scenario, a setup violation will occur on the local data path R2 ;R3 .
Consider next the same circuit with non-zero clock skew applied to the
data paths, R1 ;R2 and R2 ;R3 , as shown in Figure 5.6(b). Non-zero skew is
achieved by delivering the clock signal to the register R2 with a delay τ < t,
where t is the delay of the clock signal to both R1 and R3 . By applying this
delay τ < t, positive (t − τ > 0) and negative (τ − t < 0) clock skews are
applied to R1 ;R2 and R2 ;R3 , respectively. The corresponding clock skew
points are illustrated in the respective permissible ranges in Figure 5.6(b).
Comparing Figure 5.6(a) to Figure 5.6(b), observe that a timing violation is
less likely to occur in the latter case. In order for the previously described
setup timing violation to occur in Figure 5.6(b), the deviations in the delay
parameters of R2 ;R3 would have to be much greater in the non-zero clock
skew case than in the zero clock skew case. If the precise target value of the
non-zero clock skew τ − t < 0 is not met during the circuit design process,
the safety margin from the skew point to the upper bound of the permissible
range would be much greater.
Therefore, there are two identifiable benefits of applying non-zero clock
skew. First, the safety margins of the clock skew (that is, the distances be-
tween the clock skew point and the bounds of the permissible range) within the
permissible ranges of a data path can be improved. The likelihood of correct
circuit operation in the presence of process parameter variations and opera-
tional conditions is improved with these increased margins. In other words,
the circuit reliability is improved. Second, without changing the logic and cir-
cuit structure, the performance of the circuit can be increased by permitting
a higher maximum clock frequency (or lower minimum clock period). The
formulation of circuit timing constraints for different timing problems and
formulation of clock skew scheduling for different objectives are presented
in Section 5.4.
Friedman in 1989 first presented in [1] the concept of negative non-zero
clock skew as a technique to increase the clock frequency and circuit perfor-
mance across sequentially-adjacent pairs of registers. Soon afterwards in 1990,
Fishburn suggested an algorithm in [2] for computing a consistent clock sched-
ule that is nontrivial. It is shown in [1, 2] that by exploiting negative and pos-
itive clock skew within a local data path Ri ;Rf , a circuit can operate with
a clock period TCP less than the clock period achievable by a trivial (or zero
skew) clock skew schedule while satisfying the conditions specified by (5.5)
and (5.6). In fact, [2] determined an optimal clock schedule by applying lin-
5.4 Timing Constraints and Design Automation 85
ear programming techniques to solve for TCD so as to satisfy (5.5) and (5.6)
while minimizing the objective function Fobjective = min TCP 9 .
The process of determining a consistent clock schedule TCD can be consid-
ered as the mathematical problem of minimizing the clock period TCP under
the constraints, (5.5) and (5.6). However, there are important practical issues
to consider before a clock schedule can be properly implemented. A clock dis-
tribution network must be synthesized such that the clock signal is delivered
to each register with the proper delay so as to satisfy the clock skew sched-
ule TCD . Furthermore, this clock distribution network must be constructed so
as to minimize the deleterious effects of interconnect impedances and process
parameter variations on the implemented clock schedule. Synthesizing the
clock distribution network typically consists of determining a topology for the
network, together with the circuit design and physical layout of the buffers
and interconnect that make up a clock distribution network [9, 32].
9
This LP problem model is presented in Table 5.1.
86 5 Clock Skew Scheduling and Clock Tree Synthesis
HH ........ R
..
HH
SOURCE HH q q HH q to buffer
HH to buffer
OC HH q ........ R
C ..
BUFFERS
:
HH to buffer
(a) Circuit structure of the clock distribution network
wBUFFER
REGISTER
w CLOCK SOURCE
S
/ ?
w
S
S
w w Sw
C S
/ ? CW
w
S
C e
e
w w Cw
C
C
(b) Clock tree structure that corresponds to the circuit shown in (a)
must enforce a clock skew TSkew (i, f ) for each local data path Ri ;Rf of the
circuit in order to ensure that both (5.5) and (5.6) are satisfied.
10
The number of registers N in the circuit.
5.6 Solution of the Clock Tree Synthesis Problem 87
clock tree is determined by beginning at the bottom of the tree (those leaves
with the greatest depth) and recursively computing the number of buffers at
each preceding level.
The techniques for clock skew scheduling and clock distribution network syn-
thesis discussed in this chapter have been implemented as two separate com-
puter programs. The first program implements the problem of simultaneous
clock skew scheduling and clock tree synthesis as described by (5.12). This
program is described and results are presented in Section 5.7.1. A second
more exhaustive software implementation for clock skew scheduling only is
described in Section 5.7.2.
The algorithm has been implemented in a 3, 300 line program written in the
C++ high-level programming language. This program has been executed on
the ISCAS’89 suite of benchmark circuits. A simple delay model based on the
load of a gate is used to extrapolate the gate delays since these benchmark
circuits do not contain delay information. A summary of the results for the
benchmark circuits is shown in Table 5.2. These results demonstrate that by
applying the proposed algorithm to schedule the clock delays to each register,
up to a 64% decrease11 in the minimum clock period can be achieved for these
benchmark circuits while removing all race conditions. Note that due to the
relatively large number of buffers required in the clock tree, this approach is
only practical for circuits with a large number of registers.
Two example implementations of a clock tree topology with non-zero skew
are shown in Figures 5.8 and 5.9 for the benchmark circuits s1423 and s400,
respectively:
1. The clock tree topology shown in Figure 5.8 corresponds to the circuit
s1423 which contains N = 74 registers. The improvement of the mini-
mum achievable clock period TCP is 14% by applying the methodology
described in Section 5.6.
2. The clock tree topology shown in Figure 5.9 corresponds to the circuit
s400 which contains N = 21 registers. The improvement of the minimum
achievable clock period for this circuit when non-zero clock skew is applied
is 37%.
11
Compared to the minimum possible clock period if zero skew is used throughout
a circuit.
90 5 Clock Skew Scheduling and Clock Tree Synthesis
Table 5.2. ISCAS’89 suite of circuits. The name, number of registers, bounds of the
searchable clock period, optimal clock period (Topt ) and performance improvement
(in per cent) are shown for each circuit. Also shown in the last two columns labeled
B2 and B3 , respectively, are the number of buffers in the clock tree for f = 2 and
f = 3, respectively.
Dummy Load
Leaf (Register)
Fig. 5.8. Buffered clock tree for the benchmark circuit s1423. The circuit s1423 has
a total of N = 74 registers and the clock tree consists of 45 buffers with a branching
factor of is f = 3.
controller) and some characterizing data is shown in Figure 5.12. The mini-
mum achievable clock period without clock skew scheduling is TCP = 14.8 ns
(= 67.5 MHz). After non-zero clock skew is applied to this circuit, the min-
imum achievable clock period with clock skew scheduling is TCP = 11.4 ns
(= 87.7 MHz) corresponding to a performance improvement of 23%.
The input to this program is a standard text file containing the timing in-
formation necessary to apply the clock scheduling algorithm to a fully syn-
chronous digital integrated circuit. This timing information characterizes the
minimum and maximum signal delay of each local data path and can be ob-
tained from the application of simulation tools known as static timing analyz-
ers. More accurate simulation methods—such as dynamic circuit simulation
(e.g., SPICE)—can be used to obtain highly accurate timing information for
relatively small circuits. A sample input file for the clock skew scheduling
program is shown in Figure 5.10. As shown in Figure 5.10, the input con-
sists of groups of information (lines 1-11 and 13-18 in Figure 5.10) enclosed
92 5 Clock Skew Scheduling and Clock Tree Synthesis
Fig. 5.9. Buffered clock tree for the benchmark circuit s400. The circuit s400 has
a total of N = 21 registers and the clock tree consists of 14 buffers with a branching
factor of f = 3.
in curly braces (the ‘{’ and ‘}’ symbols). Each line in a group describes
an instance of a register. The first line in a group describes a register Ri at
the beginning of a local data path Ri ;Rf . Each of the remaining lines of
a group describes a register Rf at the end of a local data path Ri ;Rf . In
the example shown in Figure 5.10, the registers Top/Block1/RegA[8]:sc and
TopA/Block1/RegA[7]:sc each describe the first register of a local data path
(lines 1 and 13, respectively).
Each register listed in the input file of the program consists of a sequence
of strings separated with slashes (the ’/’ character). These strings represent
the hierarchical name of the register in the design hierarchy. The register
on line 1, for example, is named RegA and is part of a design block named
5.7 Software Implementation 93
Fig. 5.10. Sample input for the clock scheduling program described in Section 5.7.2.
Block1, whereas the design block Block1 is part of the module called Top.
Finally, a register bit index may be appended at the end of a register name
for multi-bit registers12 and the data pin name is appended after the bit index
and separated with a colon ‘:’.
The description of the initial register of a local data path is followed by
eight (8) numbers which specify the timing information characterizing this
register. These numbers specify the minimum and maximum values of the
setup and hold times for the register for the rising and falling edges of the
clock signal. If a number is not available, an underscore ‘ ’ is substituted for
this missing data. The program determines the type of register by examining
both the missing and specified numbers describing the setup and hold times.
Returning to line 1 in Figure 5.10, the minimum and maximum setup times
for the rising edge of the clock signal are included while the minimum and
maximum setup times for the falling edge of the clock signal are absent (note
the underscores in line 2). Therefore, this register instance is either a positive-
edge triggered flip-flop or a negative latch. A positive flip-flop has the setup
and hold times defined for the rising edge of the clock signal. Similarly, a
negative latch has the setup and hold times defined for the rising edge of the
clock signal. Since the register instance described by line 1 in Figure 5.10 has
setup and hold times defined for the rising edge of the clock signal, the register
instance is either a positive flip-flop or a negative latch.
12
If the register is not a multi-bit register, this index is omitted.
94 5 Clock Skew Scheduling and Clock Tree Synthesis
The output of the clock skew scheduling program is a standard text file. A
sample output is shown in Figure 5.11. Each line in the output consists of the
full hierarchical name of a register Rj and the value of the delay tjcd of the clock
signal to the register Rj . Recall that it is not the clock delays to the individual
1: Top/Block1/Reg1[7] 3.479695
2: Top/Block1/Reg143 2.814349
3: Top/Block1/Reg26[0] 2.159099
4: Top/Block1/Reg33A 3.479695
5: Top/Block1/Reg33B 3.479695
6: Top/Block1/reg_2a 3.479695
7: Top/Block1/reg_2 3.052987
8: Top/Block1/Reg271 2.541613
9: Top/Block1/Reg12 1.871610
Fig. 5.11. Sample output for the clock scheduling program described in Sec-
tion 5.7.2.
registers that are important but rather important but rather the difference be-
tween the clock delays—the clock skew TSkew —to each sequentially-adjacent
pair of registers that matters.
5.7 Software Implementation 95
Experimental Results
Two histograms are shown in Figure 5.12 which illustrate the effects of non-
zero clock skew on the circuit path delays. The distribution of the path de-
lay D̂Pi,fM is shown in Figure 5.12(a). With clock scheduling (non-zero clock
skew) applied, the effective path delay of each path Ri ;Rf is increased or de-
creased13 by the amount of clock skew scheduled for that path. This effective
path delay distribution is shown in Figure 5.12(b). Note that the net effect
of clock skew scheduling is a ‘shift’ of the path delay distribution away from
the maximum path delay [from right to left in Figure 5.12(b)]. There are two
beneficial effects of that shift of delay in that either the circuit can be run
at a lower clock period (or higher clock frequency) or the circuit can operate
at the target clock period with a reduced probability of setup and hold time
violations (improving the overall system reliability).
13
As described previously in this chapter, clock skew can be thought of as adding
(or subtracting) to (or from) the path delay.
96 5 Clock Skew Scheduling and Clock Tree Synthesis
The timing relationships for local data paths with latches are categorized
into two sets: operational constraints and constructional constraints. The op-
erational constraints are the constraints that model the operation of a level-
sensitive synchronous circuit. The constructional constraints are defined to en-
sure the correctness and completeness of the formulation of the proposed tim-
ing analysis problem. The definitions for the operational constraints—called
latching, synchronization and propagation constraints, respectively—are de-
rived from the zero clock skew definitions in [69]. The latching, synchroniza-
tion and propagation constraints for a single-phase synchronization system
are described in Section 6.1.1, Section 6.1.2 and Section 6.1.3, respectively.
The constructional constraints, called validity and initialization constraints,
are defined to ensure the correctness and completeness of the formulation of
the presented timing analysis problem. The validity constraints are presented
in Section 6.1.4. The initialization constraints are presented in Section 6.1.5.
Latching constraints bound the arrival time of the data signal Df (recall the
local data path in Figure 4.15 on page 61) in order to ensure that Df is latched
during the intended clock cycle.
The interval for the data arrival time is characterized by the hold time
and the setup time requirements of Rf as follows:
Lf
δH ≤ af (6.1)
Af ≤ TCP − δSLf . (6.2)
Eq. (6.1) constrains the earliest arrival of Df at Rf . The earliest data arrival
time must be no earlier than hold time after the trailing edge of the previous
clock cycle. Suppose the (k + 1)-th clock cycle at latch Rf is illustrated in Fig-
ure 4.4 on page 46, where t1 = tfcd + kTCP [zero in the frame of reference of
(k + 1)-th cycle]. The hold time is defined by the difference t7 − t6 . If data
arrives at Rf earlier than the hold time, a double-clocking hazard occurs.
Similarly, (6.2) represents the setup constraint on Rf . As shown in Fig-
ure 4.4, the data must arrive at the final latch at least setup time prior
to the trailing edge of the clock cycle. Assuming the (k + 1)-th clock cy-
cle is illustrated in Figure 4.4, the trailing edge of the clock cycle occurs at
t6 = tfcd +kTCP +CW L
. Thus, data cannot be latched into Rf during the (k +1)-
th cycle if the data arrives later than t5 = tfcd + kTCP + CW
L
− δSLf . Late arrival
of the data signal results in a zero clocking hazard.
← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case I Case II
← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case III Case IV
← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case V Case VI
← tcd
i + (k − 1)T
CP
i + kT
tcd CP → ← tcd
i + (k − 1)T
CP
i + kT
tcd CP →
Case VII Case VIII
Fig. 6.1. Possible cases for the arrival and departure times of data at the initial
latch.
during the active phase of the clock signal Ci . The data signal immediately
propagates through the latch (as illustrated in cases I and VIII of Figure 6.1).
In these cases, the earliest departure time di from Ri depends on the earliest
Li
arrival time ai of the data signal and the time DDQ it takes for the data to
appear at the output terminal of Ri .
The second term of the max function, TCP − CW L Li
+ DCQm , refers to the
case when the earliest data arrival time occurs during the opaque phase of Ri .
In the opaque phase of operation, the departure time of the data signal from
Li
the initial latch occurs clock-to-output delay DCQ later than the leading edge
of the clock signal. Such data propagation is illustrated in cases II-VII of
Figure 6.1. The max function is used to combine these cases and to define the
earliest departure time di from the initial latch Ri . Similar reasoning applies
to the derivation of the latest departure time Di defined by (6.4).
Propagation constraints define the arrival time of the data signal Df at the
final latch Rf of a local data path. These constraints are as follows:
af = min di + D̂Pi,fm + TSkew (i, f ) − TCP (6.5)
i
Af = max Di + D̂Pi,fM + TSkew (i, f ) − TCP . (6.6)
i
ai 1 Ai1
Ci1
Di1
di 1
TSkew (i1 , i2 ) < 0
1 ,f
DiPm
1 ,f
DiPM
f + (k − 1)T
tcd f + kT
tcd t f + (k + 1)TCP
CP CP
k-th clock cycle k + 1-th clock cycle cd
af Af
Cf
Df
df
2 ,f
DiPm
2 ,f
DiPM
i2 i2
tcd + (k − 1)TCP tcd + kTCP t i2 + (k + 1)TCP
k-th clock cycle k + 1-th clock cycle cd
Ci2
ai2 Ai2
TSkew (i1 , f ) > 0 Di2
di 2
TSkew (i2 , f ) > 0
defined by the propagation on the Ri1 ;Rf data path. Similarly, on the data
path Ri2 ;Rf , a maximum data propagation
time of D̂Pi2 ,fM elapses conferring
the latest data arrival time at Rf , Af = Di2 + D̂Pi2 ,fM + TSkew (i2 , f ) − TCP .
The departure of Qi and the arrival of Df must occur during two con-
secutive clock cycles for proper circuit operation. In order to switch between
the frame of references of these two cycles, the phase shift operator φif is
used. The phase shift operator evaluates to φif = TCP for single-phase syn-
chronization as discussed in Section Section 4.9. Thus, the clock period TCP
is subtracted from the calculated arrival time in order to shift the point of
reference of the data arrival time at Rf to the beginning of the previous clock
cycle.
Af ≥ af (6.7)
Df ≥ df . (6.8)
Fig. 6.3. The iterative algorithm for static timing analysis of level-sensitive circuits.
volves examining up to |p| edges, and p is at most |r|2 . The iterative algorithm
presented in Figure 6.3 is later modified to account for more advanced tim-
ing features or data models, such as for crosstalk [100] and statistical timing
analysis [113].
Although the iterative algorithm provides an initial and useful formula-
tion for the timing analysis of level-sensitive circuits, it does not constitute a
framework lenient to general timing analysis problems or clock skew schedul-
ing.
[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm Li
di ≥ TCP − CW L Li
+ DCQm
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM Li
Di ≥ TCP − CW L Li
+ DCQM
[Synchronization-Latest time]
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af ≤ din + DPin ,fm + TSkew (in , f ) − TCP
[Propagation-Earliest time]
(vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll Ll
orDDQm ), ∀Rl : |F an − in(Rl )| = 0
[Initialization]
[3, 4]
→
[2.9, 3] [5, 7]
R1 R2 R3
→ ←
[3 ]
,4 ,5
← ] 2.5
[ ←
R4
φ = TCP /2
Csource
TCP
C1 C1
C2 C2
C3 C3
C4 C4
Fig. 6.6. Zero and non-zero clock skew timing schedules for the level-sensitive circuit
in Figure 6.4.
flip-flops. The second synchronous circuit of interest is the zero clock skew,
level-sensitive circuit. In order to design a level-sensitive synchronous circuit,
each flip-flop in the given circuit topology is replaced with a level-sensitive
latch. Zero clock skew, level-sensitive circuits exhibit improved circuit perfor-
mance due to time borrowing. Clock skew scheduling is applied to the zero
clock skew, level-sensitive circuit to generate the non-zero clock skew, level-
sensitive circuit. This circuit exhibits performance improvement due to the
simultaneous consideration of time borrowing and clock skew scheduling.
The clocking schedules and the data propagation on the critical paths of
the circuit in Figure 6.4 are shown in Figure 6.6. In Figure 6.6, the clocking
schedule for the zero clock skew circuit is shown on the left, with a min-
imum clock period of TCP = 4.66. Non-zero clock skew scheduling results
with a minimum clock period of TCP = 4.05 is shown on the right. For non-
zero clock skew scheduling, the optimal clock signal delays at the register are
t1cd = 0.05, t2cd = 0.925, t3cd = 0 and t4cd = 0.475. The arrows represent data
signal propagation on the respective critical paths. Note that unlike the case
110 6 Clock Skew Scheduling of Level-Sensitive Circuits
presented in Figure 6.6, the critical paths for zero and non-zero clock skew
scheduling need not be identical.
In the analysis, the minimum clock period for the zero clock skew, level-
sensitive circuit is calculated as 4.66 (time units), which is a 33% improvement
over the zero clock skew, edge-sensitive synchronous circuit. Note that the per-
centage improvement is calculated by the expression 100(Told − Tnew )/Told .
As stated earlier, clock skew scheduling is applied to the level-sensitive cir-
cuit in order to generate the non-zero clock skew, level-sensitive circuit. The
calculated minimum clock period of 4.05 for the non-zero clock skew, level-
sensitive circuit is a 13% improvement over the zero clock skew, level-sensitive
circuit and a 42% improvement over the zero clock skew, edge-sensitive cir-
cuit. Note that 13% improvement is only due to clock skew scheduling, while
42% improvement is due to time borrowing and clock skew scheduling.
Further analysis of the time borrowing and clock skew scheduling effects
on circuit timing are presented in Section 11.1.
Presence of data path loops (cycles) and transient state errors are two ma-
jor issues that need to be identified in the timing analysis of level-sensitive
circuits. As discussed in Section 6.2, the iterative algorithm offered in [73]
suffers from excessive run times and produces false negative outputs in pres-
ence of data path loops [99]. In [99], modifications are offered for the iterative
algorithm in order to detect and handle the effects of data path loops in the
circuit. Also in [99], it has been shown that synchronous circuits are prone to
transient state errors. The transient state errors occur due to the non-unique
solution sets of the problem parameters, discussed (within a different context)
in Section 6.1.5. In circuits under transient state errors, setup violations occur
in certain registers after the system is initiated from a reset state. The arrival
and departure times may not be stable at start-up, in which case these times
change during initial clock cycles, constituting the transient state. As circuit
operation progresses in time, the arrival and departure times converge to their
steady-state values.
There are two major conventions in evaluating the transient errors and de-
termining the steady-state behavior. The first convention overlooks the tran-
sient errors and presumes that the departure times converge to the opening
edge of the driving clock, which is the expected schedule for the steady-state
of operation. The second convention is more strict in that transient state er-
rors are not permitted. The first convention is more common and leads to a
generally acceptable solution unless the transient state operation of the level-
sensitive circuit is decisive to overall circuit operation. Given that the second
convention is adopted, the reset state is preferably extended until the steady
state of operation is reached [99].
The LP model in Table 6.2 assumes the transient-state operation of a
level-sensitive circuit to be negligible. The aim of the generated model is to
6.4 An Example and Experimental Results 111
solve for the steady-state timing scheduling problem. The simplex algorithm-
based LP solver directs the gradual advancement of parameter values as they
are enforced by the LP model. Previously offered algorithms are vulnerable
to potential fallacies caused by data path loops due to their iterative nature.
In the LP model, complications posed by the presence of data path loops are
resolved within the mechanics of the LP solver without significantly affecting
the run time or quality of the solution. If the problem remains feasible, the
timing parameters for the steady state operation of the circuit are calculated.
8.65 12.75
C1
3.8 5.85 7.9 9.95 12 14.05
TSkew (3, 1) = −3.8
C2
1.3 3.35 5.4 7.45 9.5 11.55 13.6
TSkew (3, 2) = −1.3
TSkew (1, 2) = 2.5
k-th clock cycle k + 1-th clock cycle k + 2-th clock cycle
C3
0 2.05 4.1 6.15 8.2 10.25 12.3 16.4
timeglobal
(k − 1)TCP (k − 1)TCP + 4.1 (k − 1)TCP + 8.2 (k − 1)TCP + 12.3 (k − 1)TCP + 16.4
[6.6]
[6. ←
6 4]
← ] [5.
→
R3 a3 = 0 d3 = 2.05
A3 = 0 D3 = 2.05
3 =0
tcd
Fig. 6.7. The optimized timing schedule for s27 operable with TCP = 4.1.
6.5 Optimality of the LP Formulation 113
cycles. The circuit s27 in Figure 6.7 is analyzed in order to provide a bet-
ter insight on how the latest departure times converge to a certain value in
the steady-state. Define a variable , where is a very small period of time.
Suppose that a deviation of occurs in the departure time of the data signal
from R3 . The signal departure from R3 occurs at time 2.05+, delaying the
arrival times at R1 and R2 by . The departure from R2 is gradually delayed
by every turn, which in turn delays the arrival time at R1 . The arrival and
departure times cumulatively increase in each turn of the data signal around
the loop. Eventually, the signal arrivals at the latches occur during the non-
transparent state of the latches. At this point, the signal departure times
return to their starting values, which are the trailing edges of their respective
clock cycles. It is evident that the arrival times will finally be restored to their
initial values when the source of the deviation vanishes. Thus, the assignment
of the time-varying departure times to the leading edges of the synchronizing
clock signals is referred to as the steady-state of operation for the synchronous
circuit.
generally easier to solve than an NLP problem of similar size [112]. In experi-
mentation, the MIP problems generated for the clock skew scheduling problem
of level-sensitive ISCAS’89 benchmark circuits are solved optimally.
In order to generate the MIP formulation for the clock skew scheduling
problem of level-sensitive circuits, the non-linear synchronization and propa-
gation constraints in Table 6.2 (page 107) are remodeled using binary vari-
ables. Remember from Section 6.3.1 that the non-linearity of the synchro-
nization and propagation constraints are due to the max and min functions.
The transformations in Table 6.3 can be used to model a constraint with a
max function or a min function using a binary variable. In Table 6.3, yi , xi ,
xj and xk are continuous variables. A binary variable Bxa is defined for each
operand xa (xa ∈ {xi , xj , . . . , xk }) of the max or min function. For operand xi
of the max function shown on the left hand side of Table 6.3, for instance,
the binary variable Bxi is defined. The parameter M is a sufficiently large
constant, similar to its definition in Section 6.3.1.
For a non-linear constraint with the max function in the form given as
[yi = max(xi , xj , . . . , xk )], yi is constrained to be greater than or equal to
each one of the operands. For the max function to hold, equality condition
must be true for at least one of these inequalities (multiple equalities occur
when two or more identical operands are the maximal value). Binary variables
are used in order to enforce the equality of at least one of these inequalities.
The assignment of 0 or 1 to the binary variables Bxa either constrain yi to
be less than or equal to xa or constrain yi to be strictly greater than xa . In
particular for operand xi , when Bxi = 1, the relevant constraints become:
yi ≥ xi (6.10)
yi ≤ xi (6.11)
1400
1200
1000
Seconds
800 MIP
600 LP
400
200
s13207
s208.1
s420.1
s838.1
s9234.
s1196
s1238
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669
s9234
s298
s344
s349
s382
s386
s400
s444
s510
s526
s641
s713
s820
s832
s938
s953
s967
s991
s27
s526n
Fig. 6.8. Run times under 1250 seconds for the LP and MIP formulations.
become:
yi ≥ xi (6.12)
yi − M ≤ xi (6.13)
Table 6.4. MIP model clock skew scheduling problem of level-sensitive circuits.
MIP Model
min TCP
subject to
(i) af ≥ δHLf
[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm L
i
di ≥ TCP − CW L L
+ DCQm i
di + (Bai − 1)M ≤ ai + DDQm L
i
di + (BT ai − 1)M ≤ TCP − CW L L
+ DDQm i
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM L
i
Di ≥ TCP − CW L Li
+ DCQM
Di + (BAi − 1)M ≤ Ai + DDQM Li
[Synchronization-Latest time]
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af ≤ din + DPin ,fm + TSkew (in , f ) − TCP
af + (1 − Bdi1 f )M ≥ di1 + DPi1 ,fm + TSkew (i1 , f ) − TCP
..
.
af + (1 − Bdin f )M ≥ din + DPin ,fm + TSkew (in , f ) − TCP
[Propagation-Earliest time]
(vi) Af ≥ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af ≥ Din + DPin ,fM + TSkew (in , f ) − TCP
Af + (BDi1 f − 1)M ≤ Di1 + DPi1 ,fM + TSkew (i1 , f ) − TCP
..
.
Af + (BDin f − 1)M ≤ Din + DPin ,fM + TSkew (in , f ) − TCP
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll L
or DDQm l), ∀Rl : |F an − in(Rl )| = 0
[Initialization]
exceed a thousand, a significant gap between the run times of the LP and MIP
problems is observed. For larger circuits, the MIP run times can get extremely
worse compared to the LP run times. For instance, the MIP problem run
6.6 Multi-Phase Level-Sensitive Circuits 117
time for s38417 is 286496 seconds, while the LP problem run time is only 603
seconds.
The run time experiment results shown in Figure 6.8 demonstrate the
advantages of using the LP formulation versus the MIP formulation. It is
demonstrated that the LP formulation suggests a scalable alternative to the
accurate MIP model. It is expected that the run times for industry-size in-
tegrated circuit will benefit even more from the simplifications of the LP
formulation. The results of the LP formulation for the ISCAS’89 benchmark
circuits are empirically shown to be equal to the optimal results1 . These em-
pirical results do not guarantee the optimality of results for all circuits using
the LP formulation. However, these results suggest the general accuracy of
the LP formulation for the clock skew scheduling problem of level-sensitive
circuits in leading to optimal or close to optimal results.
concentrated on circuit retiming, most notably in [117] and [118]. In [117], the
advantages of two-phase, level sensitive circuits (as opposed to edge-sensitive
circuits) are explored. It is concluded in [117] that the level of improvement
in circuit performance is insignificant for such a circuit transformation, when
circuit retiming is performed. In [118], the results of [117] are examined from
a wider perspective, considering the depth of pipelining within a circuit—
average improvements up to 30% are shown to be possible by two-phase,
level-sensitive clocking with circuit retiming.
Presented multi-phase, level-sensitive clock skew scheduling methodology
differs from [117] and [118] by expanding the multi-phase synchronization
concept to three, four and potentially higher number of phases (the studies
presented in [117] and [118] are performed only for two-phase, level-sensitive
circuits). Furthermore, unlike extensive emphasis on circuit retiming in [117]
and [118], the application of clock skew scheduling is presented in this section.
In [119], the authors advocate the use of a multi-phase clocking scheme for
both edge-triggered and level-sensitive synchronous circuits for increased cir-
cuit performance. In [120], the number of clock phases constituting the multi-
phase synchronization scheme and the skew values are restricted to reflect the
practical limitations of conventional clock distribution networks. These stud-
ies in [119] or [120] do not explore the effects of multi-phase synchronization
on the level of improvement in circuit performance for non-zero clock skew,
level-sensitive circuits.
In Figure 6.9, two local data paths starting at the latches Ri1 and Ri2 , re-
spectively, and ending at Rf are considered. This figure is the multi-phase
synchronization counterpart of Figure 6.2 shown on page 101. The clock sig-
nals driving the initial latches Ri1 and Ri2 are shown at the top and bottom,
respectively. The middle clock signal corresponds to the final latch Rf . The
time intervals for the arrival and departure times of latch data are illustrated
by the upper and lower parallel dotted lines, respectively. Data delays are
represented by the lengths of white or black rectangular boxes. Similar to
the analysis in Section 6.1, the operational and constructional timing con-
straints of multi-phase, level-sensitive circuits are formulated based on these
data propagation rules.
The timing constraints governing the operation of a multi-phase, level-
sensitive synchronous system are summarized in Table 6.5. The multi-phase
clock skew definition from Section 4.6.2 is incorporated into the constraints.
These constraints are valid for all varieties of overlapping and non-overlapping
clocking schemes, and for any feasible selection of duty cycles per clock phase.
Note the max and min functions in the synchronization and propagation
constraints in Section 4.9. The non-linearities of these constraints are similar
to those reported in Section 6.3 for single-phase circuits. Consequently, the
6.6 Multi-Phase Level-Sensitive Circuits 119
pi1 pi1 pi1
ti1 + (k − 1)TCP + φ pi1
k-th clock cycle ti1 + kTCP + φ p i1
k + 1-th clock cycle ti1 + (k + 1)TCP + φ
pi1
pi ai 1 Ai1
Ci1 1 Di1
di1
1 ,f
pi pi DiPm
1 2
Tskew (i1 , i2 ) + φ pi1 pi2 < 0
1 ,f
DiPM
p p p
tf f + (k − 1)TCP + φ p f k-th clock cycle tf f + kTCP + φ p f k + 1-th clock cycle tf + (k + 1)TCP + φ p f
f
pf af Af
Cf Df
df
2 ,f
DiPm
2 ,f
DiPM
pi pi pi
ti2 2 + (k − 1)TCP + φ pi2 k-th clock cycle ti2 2 + kTCP + φ pi2 k + 1-th clock cycle ti2 2 + (k + 1)TCP + φ pi2
pi p f
pi 1
Tskew (i1 , f ) + φ pi1 p f > 0 ai2 Ai2
Ci2 2 Di2
pi p f
di2
2
Tskew (i2 , f ) + φ pi2 p f > 0
[Latching-Hold time]
(ii) Af ≤ TCP − δSLf
[Latching-Setup time]
(iii) di ≥ ai + DDQm Li
di ≥ TCP − CW L Li
+ DCQm
[Synchronization-Earliest time]
(iv ) Di ≥ Ai + DDQM Li
Di ≥ TCP − CW L Li
+ DCQM
[Synchronization-Latest time]
pi1 pf
(v ) af ≤ di1 + DPi1 ,fm + TSkew (i1 , f ) + φpi1 pf
..
.
pin pf
af ≤ din + DPin ,fm + TSkew (in , f ) + φpin pf
[Propagation-Earliest time]
pi1 pf
(vi) Af ≥ Di1 + DPi1 ,fM + +TSkew (i1 , f ) + φpi1 pf
..
.
pin pf
Af ≥ Din + DPin ,fM + TSkew (in , f ) + φpin pf
[Propagation-Latest time]
(vii) Af ≥ af
[Validity-Arrival time]
(viii) Df ≥ df
[Validity-Departure time]
(ix ) Al = dl − (DCQm Ll Ll
orDDQm ), ∀Rl : |F an − in(Rl )| = 0
[Initialization]
120 6 Clock Skew Scheduling of Level-Sensitive Circuits
6.7 Summary
The timing analysis and optimization of synchronous circuits are subject to
non-zero clock skew (intentional or not) and other effects of process para-
meter variations. In this chapter, design and timing analysis procedures are
presented for clock skew scheduling of level-sensitive circuits. The formulation
is performed to improve the performance of level-sensitive synchronous cir-
cuits in permitting shorter clock periods. The described procedure integrates
non-zero clock skew scheduling in an automated fashion into the design and
analysis of level-sensitive circuits. The procedure is based on a stand-alone LP
model formulation (to be solved by any standard LP solver) which constitutes
a generic automated framework for the design and analysis of level-sensitive
synchronous circuits. The optimality of the results generated by the LP model
is empirically confirmed against the optimal results of a precise MIP model.
Using the clock skew definition that is enhanced for the increasingly popular
multi-phase clock systems, the LP model clock skew scheduling formulation
for level-sensitive circuits is presented.
7
Clock Skew Scheduling for Improved
Reliability
1
Recall that in Chapter 5, the starting point of the clock scheduling algorithms
is the set of timing constraints and the objective is to determine a feasible clock
schedule and a clock distribution network given these constraints.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 121
DOI: 10.1007/978-0-387-71056-3 7,
c Springer Science+Business Media LLC 2009
122 7 Clock Skew Scheduling for Improved Reliability
Also defined here for notational convenience are the width wi,j and middle mi,j
of the permissible range. Specifically,
wi,j = ui,j − li,j = TCP − D̂Pi,jM − D̂Pi,jm (7.3)
1 1
mi,j = li,j + ui,j = TCP − D̂Pi,jM − D̂Pi,jm . (7.4)
2 2
Recall from Section 5.3 that it is frequently possible to make two simple
choices (5.7) characterizing the clock skews and clock delays within a circuit,
such that both zero and double clocking violations are avoided. Specifically,
if equal values are chosen for all clock delays and a sufficiently large value—
larger than the longest delay D̂Pi,fM —is chosen for TCP , neither of these two
clocking hazard will occur. Formally,
and, with (7.5) and (7.6), the timing constraints, (5.5) and (5.6), for a hazard-
free local data path Ri ;Rf become
Next, recall that each clock skew TSkew (i, f ) is the difference of the delays
of the clock signals, ticd and tfcd . These delays are the tangible physical quan-
tities which are implemented by the clock distribution network. The set of all
clock delays within a circuit can be denoted as the vector column,
⎡1 ⎤
tcd
⎢t2cd ⎥
tcd = ⎣ ⎦ ,
..
.
7.1 Problem Formulation 123
and is called a clock skew schedule or simply a clock schedule [2, 9, 106].
If tcd is chosen such that (5.5) and (5.6) are satisfied for every local data
path Ri ;Rj , tcd is called a feasible clock schedule. A clock schedule that
satisfies (5.7) [respectively, (7.5) and (7.6)] is called a trivial clock schedule.
Again, a trivial tcd implies global zero clock skew since for any i and f ,
t
ticd = tfcd , thus, TSkew (i, f ) = 0. Also, observe that if t1cd t2cd . . . is a feasible
t
clock schedule (trivial or not), c + t1cd c + t2cd . . . is also a feasible clock
schedule where c ∈ R is any real constant.
1
min TCP
subject to: ticd − tjcd ≤ TCP − D̂Pi,jM (7.9)
ticd − tjcd ≥ −D̂Pi,jm .
These results are summarized in Table 7.1 along with the actual permissible
range for each local data path for the minimum value of the clock period
TCP = 5 (recall that the permissible range depends upon the value of the
clock period TCP ).
Table 7.1. Clock schedule t1cd —clock skews and permissible ranges for the example
circuit C1 (for the minimum clock period TCP = 5).
Local Data Path Permissible Range Clock Skew
R1 ;R3 [−2, 1] − t3cd = 1 − 0 = 1
t1cd
R3 ;R4 [−2.5, 0] tcd − t4cd = 0 − 2.5 = −2.5
3
Note that most of the clock skews (specifically, the first four) listed in Ta-
ble 7.1 are at one end of the corresponding permissible range. This situation
2
The times used in this section are all assumed to be in the same time unit.
The actual time unit—e.g., picoseconds, nanoseconds, microseconds, milliseconds,
seconds—is irrelevant and is therefore omitted.
7.1 Problem Formulation 125
is due to the inherent feature of linear programming which seeks the objective
function extrema at the vertices of the solution space. In practice, however,
this situation can be dangerous since correct circuit operation is strongly de-
pendent on the accurate implementation of a large number of clock delays—
effectively, the clock skews—across the circuit. It is quite possible that the
actual values of some of these clock delays may fluctuate from the target
values—due to manufacturing tolerances as well as variations in temperature
and supply voltage—thereby causing a catastrophic timing failure of the cir-
cuit. Observe that while zero clocking failures can be corrected by operating
the circuit at a slower speed (higher clock period TCP ), double clocking vi-
olations are race conditions that are render the circuit nonfunctional unless
delay padding is performed.
max M
subject to: ticd − tjcd + M ≤ TCP − D̂Pi,jM
(7.10)
ticd − tjcd − M ≥ −D̂Pi,jm
M ≥0
126 7 Clock Skew Scheduling for Improved Reliability
Table 7.2. Solution of problem LCSS-SAFE for the example circuit C1 for clock
periods TCP = 6.5 and TCP = 6, respectively.
t2cd → TCP = 6.5, M = 1 t3cd → TCP = 6, M = 2/3
t
t
t2cd = 32 32 0 12 t3cd = 43 53 0 13
1 2 3 4 5 6 7 8 9
R1 ;R3 [−2, 2.5] 1.5 0.25 1.25 [−2, 2] 4/3 0 4/3
R3 ;R4 [−2.5, 1.5] −0.5 −0.5 0 [−2.5, 1] −1/3 −3/4 5/12
R1 ;R2 [−1, 3.5] 0 1.25 1.25 [−1, 3] −1/3 1 4/3
R3 ;R2 [−5, −0.5] −1.5 −2.75 1.25 [−5, −1] −5/3 −3 4/3
R4 ;R2 [−2, 2.5] −1 0.25 1.25 [−2, 2] 0 −4/3 4/3
1: local data path, 2,6: permissible range, 3,7: clock skew solution for this local
data path, 4,8: ideal clock skew value for this path (middle of permissible range),
5,9 distance (absolute value) of the clock skew solution from the actual clock skew
columns two and six, respectively, and the clock skew solution is listed in
columns three and seven, respectively.
Note that there are two additional columns of data for either value of TCP
in Table 7.2. First, an ‘ideal’ objective value of the clock skew is specified for
each local data path in columns four and eight, respectively. This objective
value of the clock skew is chosen in this example to be the value corresponding
to the middle mi,j [note (7.4)] of the permissible range of a local data path
Ri ;Rj in a circuit with a clock period TCP . The middle point of the permissi-
ble range is equally distant from either end of the permissible range, thereby
providing the maximum tolerance to process parameter variations. Second,
the absolute value of the distance TSkew (i, j) − mi,j between the ideal and
actual values of the clock skew for a local data path is listed in columns five
and nine, respectively. This distance is a measure of the difference between
the ideal clock skew and the scheduled clock skew. Note that in the general
case, it is virtually impossible to compute a clock schedule tcd such that the
clock skew TSkew (i, j) for each local data path Ri ;Rj is exactly equal to the
7.1 Problem Formulation 127
middle mi,j of the permissible range of this path. The reasons for this charac-
teristic are due to structural limitations of the circuits as will be highlighted
in Section 7.2.
Problem LCSS-SAFE [see (7.10)] provides a solution to the clock skew schedul-
ing problem for the case where circuit reliability is of primary importance and
clock period minimization is not the focus of the optimization process. As
shown in Section 7.1.2, a certain degree of safety may be achieved by comput-
ing a feasible clock schedule subject to artificially smaller permissible ranges
[as defined in (7.10)]. However, Problem LCSS-SAFE is a brute force ap-
proach since it requires that the same absolute margins of safety are observed
for each permissible range regardless of the width of this range. Therefore,
this approach does not consider the individual characteristics of a permissi-
ble range and does not differentiate among local data paths with wider and
narrower permissible ranges.
It is possible to provide an alternative approach to clock skew scheduling
that considers all permissible ranges and also provides a natural quantitative
measure of the quality of a particular clock schedule. Consider, for instance,
a circuit with a target clock period TCP . Furthermore, denote an objective
clock skew value for a local data path Ri ;Rj by gi,j , where it is required that
li,j ≤ gi,j ≤ ui,j [recall the lower (7.1) and upper (7.2) bounds of the permissible
range]. For most practical circuits, it is unlikely that a feasible clock schedule
can be computed that is exactly equal to the objective clock schedule for each
local data path. Multiple linear dependencies among clock skews within each
circuit exist—those linear dependencies define a solution space such that the
t
clock schedule s = gi1 ,j1 gi2 ,j2 . . . most likely is not within this solution space
(unless the circuit is constructed of only non-recursive feed-forward paths). If
tcd is a feasible clock schedule, however, it is possible to evaluate how close a
realizable clock schedule is to the objective clock schedule by computing the
sum,
2
ε= TSkew (i, j) − gi,j , (7.11)
Ri ; Rj
Consider, for instance, the solution of LCSS-SAFE listed in Table 7.2 for
TCP = 6.5 and TCP = 6. Computing the total error [as defined by (7.11)] for
both solutions gives ε6.5 = 6.25 and ε6 = 1049 144 = 7.2847. Next, consider an
2
alternative clock schedule tcd for TCP = 6.5 as follows:
⎡1 ⎤ ⎡ ⎤
tcd 43/32
⎢t2 ⎥ ⎢38/32⎥
TCP = 6.5 → t2cd = ⎢ cd ⎥ ⎢ ⎥
⎣t3cd ⎦ = ⎣ 0 ⎦ . (7.12)
t4cd 31/32
It can be verified that with t2cd as specified, ε6.5 improves to 675
128 = 5.2734
from 6.25 for t2cd [columns two (2) through five (5) in Table 7.2]. Similarly,
an alternative clock schedule t3cd for the clock period TCP = 6 is
⎡1 ⎤ ⎡ ⎤
tcd 35/32
⎢t2 ⎥ ⎢54/32⎥
TCP = 6.5 → t3cd = ⎢ cd ⎥ ⎢ ⎥
⎣t3cd ⎦ = ⎣ 0 ⎦ . (7.13)
t4cd 39/32
Again, using t3cd leads to an improvement of ε6 to 6.1484 as compared to
7.2847 for the solution of LCSS-SAFE t3cd (see Table 7.2, columns six through
nine).
To illustrate these concepts, the graph GC1 of the small circuit example C1
introduced in Section 7.1.1 is illustrated in Figure 7.1 (note the enumeration
and labeling of the edges as specified in Definition 5.3). For this example,
[l1 , u1 ]
e1 →
[l3 , u3 ] [l4 , u4 ]
v1 v2 v3
e3 → e4 ←
[l5 ]
, u5 u2
e5
← ] l[ 2, ←
e2
v4
Fig. 7.1. Circuit graph of the simple example circuit C1 from Section 7.1.1.
130 7 Clock Skew Scheduling for Improved Reliability
Consider the circuit graph of C1 illustrated in Figure 7.1. The clock skews
for the local data paths R3 ;R2 , R3 ;R4 , and R4 ;R2 are s4 = TSkew (3, 2) =
t3cd − t2cd , s2 = TSkew (3, 4) = t3cd − t4cd , and s5 = TSkew (4, 2) = t4cd − t2cd ,
respectively. Note that s4 = s2 + s5 , i.e., the clock skews s2 , s4 , and s5 are
linearly dependent. In addition, note that other sets of linearly dependent
clock skews can be identified within C1 , such as, for example, s1 , s3 , and s4 .
Generally, large circuits contain many feedback and feed-forward signal
paths. Thus, many possible linear dependencies among clock skews—such as
those described in the previous paragraph—are typically present in such cir-
cuits. A natural question arises as to whether there exists a minimal set3 of
linearly independent clock skews which uniquely determines all clock skews
within a circuit. (The existence of any such set could lead to substantial
improvements in the run time of the clock scheduling algorithms as well as
permit significant savings in storage requirements when implementing these
algorithms on a digital computer.) It is generally possible to identify multiple
minimal sets within any circuit. Consider C1 , for example—it can be verified
that {s3 , s4 , s5 }, {s1 , s3 , s5 }, and {s1 , s4 , s5 } are each sets with the property
that (a) the clock skews within the set are linearly independent, and (b) every
3
Such that the removal of any element from the set destroys the property.
7.2 Derivation of the QP Algorithm 131
where the proof of (7.15) is trivial by substitution. The product on the left
side of (7.15) requires that there exists an edge between every pair of vertices
vik and vik+1 (k = 0, . . . , z − 1). The sum in (7.15) can be interpreted4 as
traversing the vertices of the cycle C = vi0 , ej0 , vi1 , . . . , ejz−1 , viz ≡ vi0 in
the order of appearance in C and adding the skews along C with a positive or
negative sign depending on whether the direction labeled on the edge coincides
with the direction of traversal.
Typically, multiple cycles can be identified in a circuit graph and an
equation—such as (7.15)—can be written for each of these cycles. Referring
to Figure 7.1, three such cycles,
C1 = v1 , e1 , v3 , e2 , v4 , e5 , v2 , e3 , v1
C2 = v2 , e4 , v3 , e2 , v4 , e5 , v2
C3 = v1 , e1 , v3 , e4 , v2 , e3 , v1 ,
cycle C1 → s1 + s2 − s3 + s5 = 0 (7.16)
cycle C2 → s2 − s4 + s5 = 0 (7.17)
cycle C3 → s1 − s3 + s4 = 0. (7.18)
Note that the order of the summations in (7.16), (7.17), and (7.18) has been
intentionally modified from the order of cycle traversal so as to highlight an
important characteristic. Specifically, observe that (7.16) is the sum of (7.17)
and (7.18), that is, there exists a linear dependence not only among the skews
within the circuit C, but also among the cycles (or, sets of linearly dependent
skews).
Note that any minimal set of linearly independent clock skews must not
contain a cycle [as defined by (7.15)] for if the set contains a cycle, the skews
4
Note the similarity with Kirchoff’s Voltage Law (KVL or loop equations) for an
electrical network [123].
132 7 Clock Skew Scheduling for Improved Reliability
within the set would not be linearly independent. Furthermore, any such set
must span all vertices (registers) of the circuit or it is not possible to express
the clock skews of any paths in and out of the vertices not spanned by the set.
Given a circuit C with r registers and p local data paths, these conclusions are
formally summarized in the following two results from graph theory [89, 124]:
1. Minimal Set of Linearly Independent Clock Skews. A minimal set of clock
skews can be identified such that (a) the skews within the set are linearly
independent, and (b) every skew in C is a linear combination of the skews
from the set. Such a minimal set is any spanning tree of GC and consists
of exactly r − 1 elements (recall that a spanning tree is a subset of edges
such that all vertices are spanned by the edges in the set). These r − 1
skews (respectively, edges) in the spanning tree are referred to as the skew
basis, while the remaining p − (r − 1) = p − r + 1 skews (edges) of the
circuit are referred to as chords. Note that there is a unique path between
any two vertices such that all edges of the path belong to the spanning
tree.
2. Minimal Set of Independent Cycles. A minimal set of cycles [where a
cycle is as defined by (7.15)] can be identified such that (a) the cycles are
linearly independent, and (b) every cycle in C is a linear combination of
the cycles from the set. Each choice of a spanning tree of GC determines
a unique minimal set of cycles, where each cycle consists of exactly one
chord vi1 , ej , vi2 plus the unique path that exists within the spanning tree
between the vertices vi1 and vi2 . Since there are p − (r − 1) = p − r + 1
chords, a minimal set of independent cycles consists of p−r+1 cycles. The
minimal set of independent cycles of a graph is also called a fundamental
set of cycles [89, 123, 124].
To illustrate the aforementioned properties, observe the two different
spanning trees of the example circuit C1 outlined with the thicker edges in
Figure 7.2 (the permissible ranges and direction labelings have been omitted
from Figure 7.2 for simplicity). The first tree is shown in Figure 7.2(a) and
consists of the edges {e3 , e4 , e5 } and the independent cycles C2 [see (7.17)] and
C3 [see (7.18)]. As previously explained, both C2 and C3 contain precisely one
of the skews not included in the spanning tree—s2 for C2 and s1 for C3 . Simi-
larly, the second spanning tree {e1 , e3 , e5 } is illustrated in Figure 7.2(b). The
independent cycles for the second tree are C1 [see (7.16)] and C3 [see (7.18)]—
generated by s2 and s4 , respectively.
Let a circuit C with r registers and p local data paths be described by
a graph G and let a skew basis (spanning tree) for this circuit (graph) be
identified. For the remainder of this discussion, it is assumed that the skews
have been enumerated such that those skews from the skew basis have the
highest indices.5 Introducing the notation sb for the basis and sc for the chords,
the clock schedule s can be expressed as
5
Such enumeration is always possible since the choice of indices for any enumera-
tion (including this example) is arbitrary.
7.2 Derivation of the QP Algorithm 133
e1 (s1 )
e3 (s3 ) e4 (s4 )
v1 v2 v3
e5 (s5 )
2 )
(s
e2
v4
e1 (s1 )
e3 (s3 ) e4 (s4 )
v1 v2 v3
e5 (s5 )
2 )
(s
e2
v4
Fig. 7.2. Two spanning trees and the corresponding minimal sets of linearly in-
dependent clock skews and linearly independent cycles for the circuit example C1 .
Edges from the spanning tree are indicated with thicker lines.
c p−r+1 r−1
s " #$ %" #$ % t
s= = [ s1 . . . sp−r+1 sp−r+2 . . . sp ] , (7.19)
sb $ %" #$ %" #
Chords Basis
where ⎡ ⎤ ⎡ ⎤
s1 sp−r+2
⎢ .. ⎥ ⎢ ⎥
sc = ⎣ . ⎦ and sb = ⎣ ... ⎦ . (7.20)
sp−r+1 sp
Note that the case illustrated in Figure 7.2(a) is precisely the type of enumer-
ation just described by (7.19) and (7.20)—e1 , e2 (s1 , s2 ) are the chords and
e3 , e4 , e5 (s3 , s4 , s5 ) are the basis.
134 7 Clock Skew Scheduling for Improved Reliability
..
. (7.21)
in c
z−1
Bs = 0, (7.22)
Consider, for instance, the choice of spanning tree illustrated in Figure 7.2(a).
There are two independent cycles denoted by C1 [corresponding to C2 in (7.17)]
and C2 [corresponding to C3 in (7.18)]. The matrix relationship (7.22) for this
case is
s1 − s3 + s4 = 0 ← cycle C1 = v1 , e1 , v3 , e4 , v2 , e3 , v1
+ s2 − s4 + s5 = 0 ← cycle C2 = v3 , e2 , v4 , e5 , v2 , e4 , v3
6
Recall that an identity matrix In is a square n × n matrix such that the only
nonzero elements are on the main diagonal and are all equal to one.
7.2 Derivation of the QP Algorithm 135
From an algebraic standpoint [125], (7.22) requires that any clock schedule
s must necessarily be in the kernel ker(B) of the linear transformation B :
Rp → Rnc , i.e., s ∈ ker(B). The inverse situation, however, is not true, that is,
an arbitrary element of the kernel is not necessarily a feasible clock schedule.
Furthermore, note that B is already in reduced row echelon form [125] so the
rank of B is rank(B) = nc . Thus, the dimension of ker(B) is [125]
where the scalars, sb1 , sb2 , . . . , sbr , in (7.28) are the elements of the vector sb [as
defined by (7.19)]: ⎡ b⎤ ⎡ ⎤
s1 snc +1
⎢ sb2 ⎥ ⎢snc +2 ⎥
⎢ ⎥ ⎢ ⎥
sb = ⎢ . ⎥ = ⎢ . ⎥ . (7.29)
⎣ .. ⎦ ⎣ .. ⎦
sbnb sp
Observe that either knowing or deliberately choosing sb not only provides
sufficient information to determine the corresponding sc (respectively, the
entire s), but also permits computation of the clock delays tcd to implement
the desired clock schedule s. Specifically, the dependencies among the clock
skews in the branches (the local data paths) and the clock delays to the
vertices (the registers) can be described in matrix form as follows:
sb = Tnb ×r tcd = Tnb ×r tcd . (7.30)
Note that each skew is the difference of two clock delays so that each row
of the matrix T in (7.30) contains exactly two nonzero elements. These two
nonzero elements are 1 and −1, respectively, depending upon which two clock
delays determine the clock skew corresponding to this equation (or row in the
matrix). Also note that (7.30) is a consistent linear system (the rows corre-
spond to linearly independent skews within the circuit) with fewer equations
than the r unknown clock delays tcd . Therefore, (7.30) has an infinite number
of solutions all corresponding to the same clock schedule s.
Finding a solution tcd of (7.30) is now a straightforward matter. For ex-
ample, setting trcd = 0 and rewriting (7.30) to account for this substitution,
trcd = 0 ⇒ sb = T∗nb ×nb tcd = T∗nb ×nb (7.31)
yields a consistent linear system with the same number of variables as equa-
tions where the matrix T∗nb ×nb is the matrix Tnb ×r with the rightmost column
deleted. The most efficient way to solve the system characterized by (7.31)
with the highest accuracy is by back substitution (only addition/subtraction
operations are necessary). In the software implementation of this algorithm
discussed in this work, tcd is computed in an efficient way by traversing the
edges of the spanning tree.
This section concludes by illustrating the concepts discussed in this section
on a small circuit example C1 [the circuit graph GC1 is shown in Figure 7.1
and the respective spanning tree is shown in Figure 7.2(a)]. For this circuit,
r = 4, the number of local data paths is p = 5 and nb = 4 − 1 = 3. The clock
schedule is ⎡ ⎤
c s3
s s
s = b , where sc = 1 , sb = ⎣s4 ⎦ . (7.32)
s s2
s5
The independent cycles are C2 [from (7.17)] and C3 [from (7.18)] and the
matrices B and C are as defined in (7.25). A basis for the kernel of B has a
dimension nb = 3 and consists of the vectors,
7.2 Derivation of the QP Algorithm 137
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0
⎢0⎥ ⎢ 1⎥ ⎢−1⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎢1⎥ , ⎢ 0⎥ , and ⎢ 0⎥ . (7.33)
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦
0 0 1
Any clock schedule is in ker(B) and can be expressed as a linear combination
of the vectors from the kernel basis,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0
⎢0⎥ ⎢ 1⎥ ⎢−1⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥
s = sb3 ⎢ ⎥ b⎢ ⎥ b⎢ ⎥
⎢1⎥ + s4 ⎢ 0⎥ + s5 ⎢ 0⎥ . (7.34)
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦
0 0 1
Consider, for instance, the clock skew schedule for TCP = 6.5 shown in Ta-
ble 7.2. Substituting s3 = 0, s4 = −1.5 and s5 = −1 into (7.34) yields the
clock schedule,
⎡ ⎤ ⎡ ⎤ ⎡ ⎤ ⎡ ⎤
1 −1 0 1.5
⎢0⎥ ⎢ 1⎥ ⎢−1⎥ ⎢−0.5⎥
⎢ ⎥ ⎢ ⎥ ⎢ ⎥ ⎢ ⎥
s = 0 ⎢1⎥ − 1.5 ⎢ 0⎥ − 1 ⎢
⎢ ⎥ ⎢ ⎥ ⎥ ⎢
⎢ 0⎥ = ⎢ 0⎥⎥. (7.35)
⎣0⎦ ⎣ 1⎦ ⎣ 0⎦ ⎣−1.5⎦
0 0 1 −1
Finally, the clock delays tcd are derived from the underdetermined linear
system [as described by (7.30)],
⎡ ⎤
⎡ ⎤ ⎡ ⎤ t1cd
0 1 −1 0 0 ⎢ 2 ⎥
tcd ⎥
sb = ⎣−1.5⎦ = ⎣0 −1 1 0⎦ ⎢⎣t3cd ⎦ , (7.36)
−1 0 −1 0 1 4
tcd
Let C be a circuit with r registers, p local data paths and a target clock
period TCP , and let the local data paths be enumerated as
⎧
⎨ path1 → Ri1 ;Rj1
⎪
p local data paths .. (7.39)
⎪ .
⎩
pathp → Rip ;Rjp .
For each local data path pathk (Rik ;Rjk ) within C, let the lower bound lik ,jk ,
upper bound uik ,jk , width wik ,jk , and middle mik ,jk of the permissible range
of this path be defined as in (7.1), (7.2), (7.3), and (7.4), respectively. For
simplicity, these parameters of the permissible range are denoted with a single
subscript corresponding to the number of the respective local data path, that
is, for the pathk ≡ Rik ;Rjk , lik ,jk = lk , uik ,jk = uk , wik ,jk = wk , and mik ,jk = mk .
Furthermore, let the circuit graph of C be GC , let the skew basis sb and
chords sc be identified in GC [according to (7.19)], and let the corresponding
p
2
min ε= (sk − gk )
k=1
subject to: Bs = 0 (7.40)
l≤s
s ≤ u,
p
2
Phase 1 → min ε= (sk − gk ) (7.41)
k=1
subject to Bs = 0
Phase 2 → Iterative refinement
. of s,
2
p
2
min ε = (s − g) = (sk − gkτ )
k=1
subject to: Bs = 0. (7.42)
It is well known [125, 130] that if the kernel of D is ker(D) = {0}, then x∗ is
the solution of the consistent system Dt Dx = Dt b.
The quadratic programming problem QP-2 is solved by applying the clas-
sical method of Lagrange multipliers for constrained optimization [131, 129,
130]. To start, note that minimizing the objective function ε in (7.42) is equiv-
alent to minimizing the function,
ε∗ = st s − 2gt s.
140 7 Clock Skew Scheduling for Improved Reliability
min ε∗ = st s − 2gt s
subject to: Bs = 0. (7.45)
L(s, λ) = ε∗ + λt Bs
(7.46)
= st s − 2gt s + λt Bs,
where the term λt Bs in (7.46) is the sum over all equality constraints of the
product of the i-th constraint times the multiplier λi .
Any extremum of ε∗ must be a stationary point of the Lagrangian
L(s, λ) [125], that is, the first derivatives of L(s, λ) with respect to si where
i ∈ {1, . . . , p} and λj where j ∈ {1, . . . , nc } must be zero. Formally, if the
differential operator is denoted as ∇, then any stationary point (s∗ , λ∗ ) of
L(s, λ) is a solution of the system of equations,
∇ L(s, λ) = 0
s
∇L(s, λ) = 0 ⇒ (7.47)
∇λ L(s, λ) = 0.
and ∇λ L(s, λ) = ∇λ st s − 2gs + λt Bs = Bs. (7.49)
Note that (7.48) and (7.49) contain p and nc equations, respectively (recall
that s and λ have p and nc variables, respectively). Therefore, the solution
of (7.47) requires finding exactly p + nc = 2p − nb = 2p − r + 1 variables.
Substituting (7.48) and (7.49) back into (7.47) yields the linear system,
2s + Bt λ = 2g
(7.50)
Bs = 0,
A natural way to solve the linear system described by (7.52) is by back substi-
tution,7 such that λ is initially computed, followed by the computation of s.
The Lagrange multipliers λ are determined from the equation (BBt )λ = 2Bg
in the second row of (7.52), where the right-hand side 2Bg is a non-zero vec-
tor, that is, Bg = {0}. The opposite situation, Bg = {0}, is highly unlikely
to occur since Bg = {0} means that g ∈ ker(B), which in turn means [re-
call (7.26) through (7.29)] that the objective clock schedule g is feasible and
no optimization needs to be performed.8
Therefore, the equation (BBt )λ = 2Bg in (7.52) can have either no so-
lutions or exactly one solution depending upon whether the matrix BBt is
singular or not. In other words, the non-singularity of BBt is a necessary and
t
sufficient condition for the existence of a unique solution ŝt λ̂t of (7.51). If
the product BBt is denoted by M, note that the symmetric nc × nc matrix,
7
Since the coefficient matrix is an upper triangular matrix.
8
The chances of g being feasible for a large real circuit are infinitesimally small.
142 7 Clock Skew Scheduling for Improved Reliability
I
M = BBt = I C = I + CCt , (7.53)
Ct
λ̂ = 2M−1 Bg (7.54)
1
ŝ = − Bt λ + g = − Bt M−1 B g + g, (7.55)
2
where the matrix M is as introduced in (7.53).
To gain further insight into the solution described by (7.51) through (7.55),
consider substituting (7.23) for B into (7.51), and representing the vector
column g of the objective clock skew schedule as
c
g
g= b , (7.56)
g
where the coefficient matrix K on the left is symmetric. In (7.57), the Gaussian
elimination step described by (7.52) is equivalent to multiplying by 12 the first
row of K, premultiplying by 12 C the second row of K and subtracting both
of these rows from the third row:
⎡ ⎤ ⎡ c⎤ ⎡ c⎤
2I 0 I s g
⎣ 0 2I Ct ⎦ ⎣sb ⎦ = 2 ⎣gb ⎦
I C 0 λ 0
⎡ ⎤ ⎡ c⎤ ⎡ ⎤ (7.58)
2I 0 I s gc
⇒ ⎣ 0 2I Ct ⎦ ⎣sb ⎦ = 2 ⎣ gb ⎦.
t c b
0 0 I + CC λ g + Cg
Observe that the linear system of (7.58) is simply a more detailed technique
for rendering the linear system described by (7.52) where the first row of (7.52)
has been expanded into the first two rows of (7.58):
I
BBt = I C = I + CCt (7.59)
Ct
gc
Bg = I C = gc + Cgb . (7.60)
gb
7.2 Derivation of the QP Algorithm 143
1
Note that clock skew scheduling also entails delay insertion, however, into the
clock distribution network.
2
Dependence of the minimum clock period TCP on the uncertainty of data prop-
agation times (between D̂Pi,fm and D̂Pi,fM for Ri ;Rf ) is visible in Problem LCSS
definition in Section 7.1.1. The linear dependency of clock skew values on data
path cycles are explained in Section 7.2.2.
I.S. Kourtev et al., Timing Optimization Through Clock Skew Scheduling, 145
DOI: 10.1007/978-0-387-71056-3 8,
c Springer Science+Business Media LLC 2009
146 8 Delay Insertion and Clock Skew Scheduling
In this section, the limitation on the minimum clock period caused by all
three factors are derived as applied to edge-triggered circuits. The limitations
for level-sensitive circuit implementations can be derived similarly. It is shown
that through systematic delay insertion, the limitation on the minimum clock
period achievable through clock skew scheduling can be mitigated. In other
words, the improvements achieved through clock skew scheduling can further
be increased by additional delay insertion onto the logic network, simulta-
neous with the application of clock skew scheduling. For a fully-automated
application, the proposed delay insertion method is implemented as a Linear
Programming (LP) problem in tradition of clock skew scheduling applications
presented in Chapters 5 and 7. The application of the delay insertion method
is demonstrated both for edge-triggered and level-sensitive circuits.
if
DP m , DPif M
→
Ri Rf
Fi
DCQm
Ci
Fi
DCQM
DPif m
DPif M
af Af
Cf Ff
δH δsF f
TCP
(b) Delay uncertainty in timing diagram.
Fig. 8.1. Limitation on the minimum clock period TCP caused by the delay uncer-
tainty of a local data path.
148 8 Delay Insertion and Clock Skew Scheduling
ignored for the sake of simplicity. For such a critical timing path, the setup and
hold time constraints (that are modeled with inequalities) satisfy the equality
conditions3 . Due to this limitation, the clock period cannot be minimized any
further than:
i,f
min TCP = max ΔF i Fi
L + DCQM + DP M + δS
Ff
∀Ri ; Rf
i,f
− ΔF i Fi
L + DCQm + DP m + δH
Ff
(8.1)
= max D̂Pi,fM − D̂Pi,fm .
∀Ri ; Rf
The shaded region in Figure 8.1 illustrates the timing criticality, causing the
limitation on TCP .
Limitations due to data path cycles occur due to the linear dependency of
clock skews of the local data paths on a cycle, as explained in Section 7.2.2.
In a zero clock skew circuit, the circuit topology is irrelevant in the timing
analysis because each local data path is analyzed independent of any neigh-
boring paths. The timing of neighboring local data paths in a non-zero clock
skew circuit, however, is interdependent. For a cycle of local data paths, this
interdependency regains the form described in Section 7.2.2. In this linear de-
pendency form, the minimum clock period is limited by the (timing) criticality
of the local data paths along the cycle (in addition to the limitations caused
by the delay uncertainties of each local data path along the cycle, which are
the limitations explained in Section 8.1.1). The data path cycle limitation is
illustrated for a sample local data path cycle in Figure 8.2.
The cyclic traveling path for the data signal over a data path cycle, such
as the example circuit shown in Figure 8.2, leads to stringent operating condi-
tions under non-zero clock skew. The local data paths along the cycle operate
without any slack time, because any existing slack on these local data paths
is distributed over the paths through the mechanics of the clock skew schedul-
ing process. In such circuits where a data path cycle is (timing) critical, the
minimum clock period depends on two factors. The first factor is the number
of registers n along the cycle. For n registers on the cycle, n clock cycles must
have passed after each completion of the cycle through a register. The second
factor is the total delay of the data signal over the local data paths along
the cycle. This total delay time includes the setup time δSF f and maximum
Fi
clock-to-output time DCQM of each register along the cycle, the maximum
i,f
data propagation time DP M of each local data path along the cycle, and the
tolerances of the clock signal (which are ignored below for simplicity). The
3
These constraints have no available slack for improvement.
8.1 Limitations on Minimum Clock Period 149
Rk
←
R(n−1) R2
→
R1
→
→
(a) A sample local data path cycle.
nTCP
F1
C1
DCQM ... δsF 1
DP12M
(n−1)1
DP M
δsF 2 F2
DCQM ...
C2
DP23M
.. (n−2)(n−1)
. DP M
F (n−1) F (n−1)
C(n−1) δ.s . . DCQM
Fig. 8.2. Limitation on the minimum clock period TCP caused by data path cycles.
limitation on the minimum clock period by the data path cycles is given by
the following formula:
i,f
ΔFL
i
+ D Fi
CQM + DPM + δ Ff
S
∀Ri ; Rf on cycle
min TCP =
n (8.2)
i,f
D̂P M
∀Ri ; Rf on cycle
=
n
150 8 Delay Insertion and Clock Skew Scheduling
The shaded region in Figure 8.2 illustrates the timing criticality, causing the
limitation on TCP .
P Dm , P DM
path1: m registers
Ri1 Rim →
→
Rd Rc
→
Rj1 Rjn →
path2: n registers
path2 path2
P Dm , P DM
(a) A sample reconvergent path system.
Fd
DCQm
Cd ... ...
Fd
DCQM path2
P Dm path2
P DM
path1
P Dm path1
P DM
... Fc ac ... Ac
Cc δH δsF c
|m − n + 1|TCP
(b) Reconvergent path system timing diagram.
Fig. 8.3. Limitation on the minimum clock period TCP caused by reconvergent
paths.
or more
of the reconvergent
paths in order to decrease the algebraic differ-
path1
ence P DM − P Dm path2
of (8.3), which consequently improves the mini-
path2
mum clock period TCP . Note that it is possible to increase path delay P Dm
path1
without increasing P DM because both paths are determined by two differ-
ent series of local data paths.4
4
The minimum and maximum total data propagation times along a reconvergent
system may be observed on the same reconvergent path. In such a case, delay
insertion is not beneficial.
152 8 Delay Insertion and Clock Skew Scheduling
→ p12a
R1 R2
→ p12b
12 12 12 12
[DPmb , DPMb ] = [PDm b , PDM b ] = [0.6, 0.7]
Fig. 8.4. A simple reconvergent data path system.
Two circuits with the topology presented in Figure 8.4 are analyzed in Sec-
tions 8.2.2 and 8.2.3–the edge-triggered circuit SF F and the level-sensitive
circuit SL , respectively.
For edge triggered circuits, the data signals depart the registers clock-to-
output delay (DCQ ) after the latching edge of the clock signal. Consequently
in SF F , the signal Q1 (recall Figure 4.12 on page 55) departs R1 clock-to-
output delay DCQ time after the positive clock edge and propagates along
the reconvergent paths. In order to satisfy the short path constraints, the ar-
F2
rival of data signals X2a and X2b at R2 must occur δH later than the positive
edge of the previous clock cycle at R2 . Similarly, in order to satisfy the long
154 8 Delay Insertion and Clock Skew Scheduling
1
DCQ
C1
PD12
m
b
= SD12
m
PD12
M
b
PD12
m
a
PD12 a
M = SDM
12
a2 A2
C2 δF2 δF2
H S
Tmin
Fig. 8.5. Timing of the edge-sensitive reconvergent system in Figure 8.4 after CSS.
path constraints, the arrivals must occur δSF 2 earlier that the positive edge of
the current clock cycle at R2 :
F2
δH ≤ a2 ≤ A2 ≤ TCP − δSF 2 . (8.8)
Next, suppose clock skew scheduling for clock period minimization is ap-
plied to an arbitrary edge-triggered circuit which involves a reconvergent data
path system. After clock skew scheduling, if at least one of the reconvergent
paths becomes a critical timing path, the earliest and latest arrival times of the
data signal at the critical convergent node are at marginal values. Accordingly
for SF F , the arrival times a2 and A2 satisfy
F2
δH = a2 ≤ A2 = Tmin − δSF 2 , (8.9)
where Tmin is the minimum clock period achievable by clock skew scheduling.
The constraints in (8.9) are illustrated in Figure 8.5. C1 and C2 are the clock
signals synchronizing registers R1 and R2 , respectively. Also illustrated on
Figure 8.5 is the separation between A2 + δSF 2 and a2 − δH F2
defining the
minimum clock period:
Note that the data arrival times at R2 are given by the constraints similar to
the discussion in Section 4.7:
a2 = min d1 + DP12m a
− Tmin , d1 + DP12m b
− Tmin
(8.11)
= d1 + min DP12m a
, DP12m
b
− Tmin ,
A2 = max D1 + DP12M a
− Tmin , D1 + DP12M b
− Tmin
(8.12)
= D1 + max DP12M a
, DP12M
b
− Tmin .
8.2 Delay Insertion Method 155
Let the minimum and maximum system delays define the real numbers
interval Λ, such that:
dc dc
Λ = [SDm , SDM ] (8.16)
By definition, the minimum possible algebraic difference between the maxi-
mum and minimum path delays of each reconvergent path after delay inser-
tion (defining the minimum possible clock period) is the minimum length of
interval Λ (after delay insertion).
In order to compute the minimum length |Λ| of interval Λ achievable
through delay insertion, the difference [max(Λ) − min(Λ)] is computed. Re-
calling (8.4) and (8.5), the following is derived:
dc pA pB pK
min(Λ) = SDm = min (P Dm , P Dm , . . . , P Dm ), (8.17)
dc pA pB pK
max(Λ) = SDM = max (P DM , P DM , . . . , P DM ). (8.18)
Let the real number delay intervals formed by the minimum and maxi-
mum delay values of the paths pA , pB , . . . , pK be represented by A, B, . . . , K,
respectively. In other words, a delay interval L, associated with the path
pL
pL ∈ {pA , pB , . . . , pK } is formed by L = [P Dm
pL
, P DM ]. One of the following
possibilities defining the expression [|Λ| = max(Λ) − min(Λ)] must hold:
P1. A delay interval M ∈ {A, . . . , K} determines both the minimum min(Λ)
and maximum max(Λ) values of the interval Λ. Then, Λ = M and |Λ| =
|M | = max(Λ) − min(Λ) = max(M ) − min(M ),
P2. Otherwise, two non-identical delay intervals determine the minimum
and maximum values of the interval Λ. Then, ∀L ∈ {A, . . . , K}: |Λ| =
max(Λ) − min(Λ) > max(L) − min(L).
For systems satisfying (P1), the minimum length for Λ is already given by
|Λ| = |M |. The minimum interval length, thus the minimum clock period,
cannot be changed by delay insertion. For systems satisfying (P2), delay in-
sertion method is used to modify one or more of the delay intervals in Λ
12a
[PD12
m , PDM ] = [1.0, 1.2]
a
→ p12a
R1 R2
→ p12b
12 12
[PDm b , PDM b ] = [0.6, 0.7] + [0.1, 0.2] = [0.7, 0.9]
Fig. 8.6. The simple reconvergent system in Figure 8.4 after delay insertion.
8.2 Delay Insertion Method 157
K Delay K Delay
0 3 12 0 2 14
(i) |Λ| = |D| = max(D) − min(D) = 9 (ii) |Λ| = max(B) − min(C) = 12
K Delay K Delay
0 7 16 0 7 18
(iii) |Λ| = |B| = max(B) − min(B) = 9 < 12 (iv) |Λ| = |B| = max(B) − min(B) = 11 < 12
Fig. 8.7. Two reconvergent data path systems satisfying (P1) and (P2), respectively.
the interval length. If the optimal values of delay elements are inserted on each
path, the minimum possible |Λ| is achieved by asserting that the biggest delay
interval M ∈ {A, . . . , K} becomes the interval Λ. In the modification of the
sample system shown in cases (ii) and (iii) of Figure 8.7, the delay interval B
is promoted to become this biggest delay interval M such that both min(Λ)
and max(Λ) are determined by delay interval B (i.e. delay interval B becomes
Λ). The intervals before and after delay insertion on the sample system are
demonstrated in cases (ii) and (iii) of Figure 8.7, respectively.
There are two important points to note here. First, the solution set of
the inserted delay values is not unique (remember similar discussions in Sec-
tions 6.1.5 and 6.4.1). For instance, the delay inserted on the path defining
delay interval C in case (iii) of Figure 8.7 can be any value between 6 and
12 time units (|C| = 3) to satisfy the computed minimum interval. Similarly,
the delay values inserted on all paths can simultaneously be increased by any
identical amount (e.g. x time units) to generate an alternative solution. This
non-unique solution set property provides a certain range of safety against any
inherent uncertainty or unavailability of exact values of the delay elements.
The second important point to note is that after delay insertion, the in-
terval lengths are preserved only if the inserted delay elements have no delay
uncertainty. In demonstrating case (ii) of Figure 8.7, delay values with no
uncertainties are considered in order to simplify the presentation of the delay
insertion method. In reality, delay elements have delay uncertainties just like
any other circuit component. These delay uncertainties of the delay elements
are accrued over the associated delay intervals. Let the delay uncertainty of
the delay element inserted on path L be represented by U L . The application of
delay insertion to the sample system presented in case (ii) of Figure 8.7, where
the delay uncertainties of the delay elements are accounted for, is presented in
case (iv) of Figure 8.7. Note that due to the differences in the accrued delay
uncertainties for each delay interval, the interval determining the minimum
possible length for interval Λ can be different compared to the ideal case pre-
sented in case (iii). Incidentally, for cases (iii) and (iv) of Figure 8.7, the delay
intervals determining the minimum possible length for Λ are B and A, re-
spectively. Also, in a worst case scenario, the accrued delay intervals can end
up being larger compared to the minimum length for Λ presented in case (ii).
In the problem formulation presented later in Section 8.3, delay elements are
realistically modeled with uncertainties.
Reflecting the proposition on a general reconvergent circuit, there are two
possibilities in computing the minimum algebraic difference of (8.15):
P1*. The minimum and maximum system delays of the reconvergent data
path system between Rd and Rc are determined by the same reconvergent
path,
P2*. The minimum and maximum system delays of the reconvergent data
path system between Rd and Rc are determined by two non-identical
reconvergent paths.
8.2 Delay Insertion Method 159
Assuming zero delay uncertainty and substituting the numerical values, the
∗
minimum clock period Tmin of SF F after clock skew scheduling with delay
∗
insertion method is Tmin = 1.2−1.0 = 0.2. The improvement achieved through
delay insertion over circuits with clock skew scheduling is computed with the
∗
formula [(Tmin − Tmin )/Tmin ]100. Substituting the values, the improvement
is computed as [(0.6 − 0.2)/0.6]100 = 66.7%.
The computation of the amount of delays to be inserted on each path is
integrated into the clock skew scheduling algorithm. For simplicity, continu-
ous delay models are considered in here. The revised clock skew scheduling
algorithm and initial insight for a general analysis using discrete delay models
are presented in Sections 8.3 and 8.4.
When clock skew scheduling is applied to SL , the earliest and latest arrival
times at R2 satisfy
L2
δH = a2 ≤ A2 = Tmin − δSL2 , (8.21)
as illustrated in
Figure 8.8. Using the
same derivation as (8.10) and (8.13)
L1 L1
and assuming DCQ = DDQ , d1 = D1 for practical reasons:
12a
Tmin = max P DM 12b
, P DM − min P Dm
12a 12b
, P Dm + δSL2 + δH
L2
. (8.22)
Substituting the numerical values into the equation and assuming zero inter-
nal register delays, the minimum clock period Tmin of SL after clock skew
scheduling is Tmin = 0.6.
The delay insertion method can also be used on level-sensitive circuits in
order to improve the minimum clock period. The minimum clock period of
160 8 Delay Insertion and Clock Skew Scheduling
1
DCQ
C1
PD12
m
b
= SD12
m
PD12
M
b
PD12
m
a
PD12 a
M = SDM
12
a2 A2
C2 δL2 δL2
H S
Tmin
Fig. 8.8. Timing of the simple level-sensitive reconvergent system in Figure 8.4
after CSS.
SL with clock skew scheduling and delay insertion is given by the following
formula:
∗
Tmin = max P DM 12α
− P Dm
12α
+ U 12α + δSL2 + δH
L2
. (8.23)
∀α∈{a,b}
∗
The minimum clock period Tmin of SL after clock skew scheduling and delay
∗
insertion is computed as Tmin = 1.2 − 1.0 = 0.2, leading to an improve-
ment of 66.7% over circuit with clock skew scheduling. The revised clock skew
scheduling algorithm for level-sensitive circuits is presented in Section 8.3.
Note that the earliest and latest data departure times d1 and D1 , re-
spectively, from a register R1 can be non-identical in a level-sensitive circuit.
Figure 8.8 illustrates one such case, where d1 and D1 occur at the leading
and trailing edges of the clock signal, respectively. In such cases, the formulae
in (8.22) and (8.23) do not hold true, however the minimum clock period re-
mains directly proportional to the algebraic difference between the maximum
and minimum path delays between R1 and R2 . The delay insertion algorithm
is fully applicable to all level-sensitive circuits, as the referred algebraic differ-
ence can ultimately be modified with delay insertion leading to improvements
in the minimum clock period.
pd{i1 ...im }c
→ →
Ri1 Rim
Rd Rc
Rj1 Rjn
→ →
pd{j1 ...jn }c
d{j1 ...jn }c d{j1 ...jn }c
PDm , PDM
Following from (8.10), the minimum clock period after clock skew scheduling
is bounded by
which leads to
d{i ...im }c d{j ...j }c
P DM 1 − P Dm 1 n + δSF c + δHFc
SDM dc
− SDmdc
+ δSF c + δH
Fc
Tmin = = .
|m − n + 1| |m − n + 1|
(8.27)
The identical lower bounds of the minimum clock period stated in (8.27)
for both the edge-triggered and level-sensitive circuits are demonstrated in
Figure 8.10 and Figure 8.11, respectively.
Similar to the simple reconvergence case analyzed in Section 8.2.1, if the
minimum and maximum path delays are determined by the same reconvergent
162 8 Delay Insertion and Clock Skew Scheduling
d
DCQ
Cd
d{j1 ...jn }c
PDm = SDdc
m d{j1 ...jn }c
PDM
d{i1 ...im }c
PDm d{i1 ...im }c
PDM = SDdc
M
ac Ac
Cc δFc δFc
H S
|m − n + 1|Tmin
Fig. 8.10. Timing of the edge-triggered reconvergent system with m=3 and n=2.
d
DCQ
Cd
d{j1 ...jn }c
PDm = SDdc
m d{j1 ...jn }c
PDM
d{i1 ...im }c
PDm
d{i1 ...im }c
PDM = SDdc
M
ac Ac
Cc δLc δLc
H S
|m − n + 1|Tmin
Fig. 8.11. Timing of the level-sensitive reconvergent system with m=3 and n=2.
path, the delay insertion method is not beneficial. If these delays are deter-
mined by different reconvergent paths, the delay insertion method is used
to improve the minimum clock period. The minimum clock period achieved
through clock skew scheduling and delay insertion is
pR
∗ P DM − P Dm
pS
+ U pR − U pS δ F c + δH
Fc
Tmin = max + S .
∀pR ,pS ∈{pA ,pB ,...,pK } |m − n + 1| |m − n + 1|
(8.28)
The minimum (and maximum) path delay of the reconvergent paths can
be modified by inserting delays on the local data paths of the reconvergent
path. The amount of delay to be inserted is determined at run time by the
clock skew scheduling algorithm.
Table 8.1. CSS method for edge-sensitive circuits with the delay insertion method.
LP Model
min TCP
s.t. TSkew (i, f ) ≤ TCP − DPi,fM − DCQM
Fi
IM ≥ Im
if if
In the problem formulation, continuous delay models have been used. Practi-
cally, however, delay elements are available only in discrete values. There are
two possible approaches to solving the discrete valued delay insertion problem.
The naive approach is to solve the clock skew scheduling problem assuming
continuous delays and approximating the optimal values with the given set of
discrete components. Although likely to produce reasonable results for simple
164 8 Delay Insertion and Clock Skew Scheduling
Table 8.2. CSS method for level-sensitive circuits with the delay insertion method.
LP Model
min TCP + M [ (dj + Dj ) + (Aj − aj )]
∀j ∀j:|F I(j)|≥1
s.t. af ≥ Lf
δH
Af ≤ TCP − δSLf
di ≥ ai + DDQM
Li
di ≥ TCP − CW L Li
+ DCQm
Di ≥ Ai + DDQM
Li
Di ≥ TCP − CW L Li
+ DCQM
af ≤ din + DP m + TSkew (in , f ) − TCP , ∀n
in ,f
Af ≤ TCP − δSLf
di ≥ ai + DDQm
Li
di ≥ TCP − CW L Li
+ DCQm
Di ≥ Ai + DDQM
Li
Di ≥ TCP − CW L
+ DCQM Li
buffer tree structure, a shared delay element is placed between the fanouts—or
fanins—of a register, if multiple fanouts of the same register must be padded.
Note that the delay buffer-tree construction is a post-timing analysis process
and is not integrated into the clock skew scheduling algorithms.
Throughout this research monograph, the local data paths are modeled
abstractly at a higher hierarchy level than gate-level hierarchy. Such simplifi-
cation is followed in this chapter in order to improve the demonstration of the
theoretical limitation of reconvergent paths and the mitigation of this limita-
tion by the delay insertion method. In practical implementation, the location
of the delay elements to be inserted into the logic must be identified at a lower
level of abstraction—most suitably at the gate-level of hierarchy. The model-
ing of local data paths at a higher abstraction level as suggested in this work
might lead to an ambiguous assignment of delays to reconvergent paths. In
an extreme case, it is plausible that three or more reconvergent paths might
share all of the logic paths that constitute a reconvergent system. For the
simplest case of four reconvergent paths, any two reconvergent paths might
differ by one logic path only, and all logic paths might be covered by the four
reconvergent paths. For such a reconvergent system, including delay elements
anywhere on a reconvergent path (on any logic path) would affect the path
delay of more than one reconvergent path. Thus, the optimal delay insertion
values computed by the presented LP problem must be post-processed for
practical implementation.
The described concerns in the practical implementation of the delay in-
sertion method are not considered in the experimentation stage of this work.
Simplicity is preserved in the models used in formulation in order to improve
the presentation of the limitation caused by the reconvergent paths and the
mitigation of this limitation by the delay insertion method. Designers, how-
ever, must be wary of these practical requirements. Some researchers have
already started analyzing these practical concerns [133]. In [133], the LP pro-
gramming model shown in Table 8.2 is redefined at the gate-level netlist to
pinpoint the placement of inserted delays on the gate-level netlist.
8.5 Summary
3
Memory transfers between main and secondary storage are, of course, always an
option. For the quickest execution, however, all data should reside in the main
storage.
9.1 Computational Analysis 169
which corresponds to the last row of (7.52) and (7.58), respectively. As men-
tioned previously in Section 7.2.3, the symmetric matrix M is always positive-
definite4 and nonsingular, thereby permitting exactly one solution λ̂ of the
linear system described by (9.3).
The system described by (9.3) is a large square linear system of the type
Ax = b, where b ∈ Rn is a column vector and the coefficient matrix
A ∈ Rn×n is dense. Typically, the most effective approach to computing the
solution x̂ ∈ Rn of such systems consists of performing a triangular decompo-
sition5 of the coefficient matrix A followed by the successive solution of two
relatively ‘easy’ to solve square linear systems of order n × n. The triangular
decomposition of A is of the form A = LU, where L and U are a lower tri-
angular and an upper triangular matrix, respectively [134, 135]. The solution
of Ax = LUx = b is obtained next by first computing the intermediate solu-
tion ŷ of the system Ly = b. Finally, x̂ is the solution of the system Ux = ŷ.
Because of the triangularity of the matrices L and U, the vectors ŷ and x̂ can
be computed with relatively little effort. The components of the intermediate
solution ŷ are obtained by solving the system Ly = b—referred to as for-
ward elimination [134, 135]—since the first equation of Ly = b involves only
y1 , the second only y1 and y2 , and so on. Similarly, the components of x̂ are
obtained from the system Ux = ŷ in the reverse order xn , xn−1 , . . . , x1 . The
process of solving Ux = ŷ for x̂ is also called back substitution [134, 135].
Furthermore, the symmetry and positive-definiteness of M can be ex-
ploited to obtain a special form of the LU triangular decomposition of M
such that the lower and upper triangular matrices in the decomposition are
4
Note that the matrix inverse M−1 = (I + CCt )−1 in (9.7) can be expressed
using the Sherman-Morrison-Woodburry formula [134],
−1 −1 t −1
D + EFt = D−1 − D−1 E I + Ft D−1 E FD , (9.8)
Note that in (9.9), not only can the matrix inverse N−1 = (I + Ct C)−1
be computed more quickly than M−1 (the dimension of N is nb × nb vs.
nc × nc = (k − 1)r × (k − 1)r for M) but the computation of this inverse
N−1 matrix does not have to be explicitly performed in order to evaluate the
product CN−1 Ct in (9.9). Let the Cholesky decomposition of N = I + Ct C
be
N = L2 Lt2 (9.10)
and substitute (9.10) into the product C(I + Ct C)−1 Ct = CN−1 Ct in (9.9),
then
M−1 = I − CN−1 Ct
−1 t
= I − C L2 Lt2 C
t −1
−1 t (9.11)
= I − C(L2 ) L2 C
= I − Xt X,
6
Note that I + Ct C is positive-definite, thus nonsingular.
172 9 Practical Considerations
1 1 1
N2 (r, k) = + (k − 1) + (k − 1)2 r3 + (k − 1)r2 (9.12)
6 2 2
The notation
Y = L−1 t
2 C B (9.16)
is introduced in (9.15) for simplicity, where similarly to the previously de-
scribed algorithm LMCS-2, the matrix Y can be eliminated according to the
equation L2 Y = Ct B.
The clock schedule ŝ can be computed if the operations described by (9.14),
(9.15), and (9.16) are carried on literally. These expressions, however, can be
manipulated to significantly reduce both the run time and memory require-
ments for algorithm CSD. Initially, note that computing each clock skew si
requires evaluating the inner product of two dense p-element-long vectors—
the i-th row of the matrix (−Z+I) and g. The evaluation of this inner product
requires p multiplications, where p is the number of local data paths in the
circuit. Recall, however, that the values of the clock skews from the basis sb
provide sufficient information to reconstruct all clock skews s in a quick fash-
ion. Specifically, once the skews from the basis sb are known, the skews sc
in the chords of the circuit may be derived through the operation described
by (7.24),
sc
IC = sc + Csb = 0 ⇒ sc = −Csb . (9.17)
sb
Since only the basis sb is evaluated, only the last nb rows of the matrix (−Z+I)
are computed, thereby yielding significant savings of computation time. (Note
that computing one row of Z requires the evaluation of p row elements, each
row requiring r multiplications in the product Yt Y.) These concepts are
illustrated graphically in Figure 9.1.
1 p
sc
p s = p (−Z + I) g =
last nb rows sb nb
1
b
Fig. 9.1. Computation of the clock schedule basis s by computing only the last nb
rows of the matrix −Z + I.
Ct B = Ct I C = Ct CCt
= Ct N − I = Ct L2 Lt2 − I (9.18)
and Y = Y1 Y2 − Y3 , (9.19)
V=Y Y= t
Y1 Y 2 − Y 3
Y2t − Y3t
V11 V12
=
(L2 L−1 −t −1 t −t −1 −1
2 C − L2 L2 C ) (L2 L2 + L2 L2 − L2 L2 − L2 L2 )
t t −t t
V11 V12
= −1 ,
Ct − (L−t
2 L−1
2 )C t
N − 2I + L−t2 L2
(9.24)
and
−Z + I = −Bt B + Yt Y + I
O C
= −I −
Ct N − 2I
V11 V12 (9.25)
+ −t −1 + I
Ct − (L−t −1
2 L2 )C N − 2I + L2 L2
t
... ...
= t −t −1 .
−(L−t −1
2 L2 )C L2 L2
9.1 Computational Analysis 175
Note that only the last r rows of (−Z + I) are shown in (9.25) since
only these r rows are required to compute sb . Also, note that the matrix
Y1 = L−1 t −1
2 C does not require evaluation. Only Y3 = L2 must be determined
−t −1 t
(from L2 Y3 = I) since L2 = (L2 ) .
The computation of the clock schedule ŝ in algorithm CSD consists of a
total of
1 1 1 1
N3 (r, k) = r3 + (3k + 4)r2 + r − (9.26)
2 3 2 6
multiplications distributed among the following tasks:
task ← number multiplications
a. computing the Cholesky decomposition L2 of ← 16 r3
N
b. forward elimination of Y3 = L−1 2 from ← 16 r3 + 12 r2 + 13 r
L2 Y3 = I
c. evaluate the product L−t −1
2 L2 ← 61 r3 + 16 (5r2 + r − 1)
d. evaluate sb ← rp = kr2 .
The maximum memory usage of algorithm CSD is
M3 (r, k) = r2 (9.27)
This section concludes with a brief synopsis of the run time and memory
requirements of the three algorithms for solving problem QP-3 described in
Sections 9.1.1, 9.1.2, and 9.1.3, respectively. To summarize the results, each
of the three algorithms, LMCS-1, LMCS-2, and CSD, requires O(r3 ) floating
point multiplicative operations and O(r2 ) floating-point storage units. The
numerical constant of the leading terms in the polynomial expressions for
both the run time and memory complexity is a function of the ratio k = p/r
which is the ratio of the number of local data paths to the number of registers
in a circuit.
To gain further insight into the proposed algorithms, the numerical con-
stants of the leading terms in the polynomial runtime complexity expressions
are plotted versus k in Figure 9.2. Similarly, the numerical constants of the
leading terms in the polynomial memory complexity expressions are plotted
176 9 Practical Considerations
40
LMCS-1
Runtime Complexity
30
LMCS-2
20 CSD
10
2 4 6 8 10 k
Fig. 9.2. The numerical constants (as functions of k = p/r) of the term r3 in
the runtime complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD,
respectively.
40
LMCS-1
Memory Complexity
30
LMCS-2
20 CSD
10
2 4 6 8 10 k
Fig. 9.3. The numerical constants (as functions of k = p/r) of the term r2 in
the memory complexity expressions for the algorithms LMCS-1, LMCS-2 and CSD,
respectively.
versus k in Figure 9.3. Note that algorithm CSD outperforms both of the other
two LMCS algorithms where the superiority of algorithm CSD is particularly
evident with respect to the speed of execution. Thus, algorithm CSD is the
algorithm of choice for solving problem QP-3 as introduced in Section 7.2.3.
is a spanning tree for the modified circuit C1 . Note that the basis edge e6
does not belong to any of the fundamental cycles of the circuit depicted in
Figure 9.4. In fact, the edge e6 does not belong to any cycle of the circuit
in Figure 9.4 at all. Such basis edges which do not belong to any cycles are
called isolated, while the rest of the basis edges are called main. Note that
any isolated edge must necessarily by definition be a basis edge.7
[l1 , u1 ]
e1 →
requirements. The only requirement is that the basis skews (the edges) must
be enumerated such that the isolated skews are last. In other words, the clock
skew vector (7.19) becomes
Basis with nb elements
⎡ c⎤ " #$ %
s nc nb −ni ni
" #$ % " #$ % " #$ %
s = ⎣sb ⎦ = [ s1 . . . snc snc +1 . . . sp−ni sp−ni +1 . . . sp ]t , (9.29)
$ %" #$ %" #$ %" #
si Chords Main Basis Isolated Basis
where sb stands for the main basis and the isolated basis is denoted by si .
With this specific choice of clock skew enumeration, the B matrix in (7.22)
becomes
B = B1 0 , (9.30)
where 0 in (9.30) is a zero matrix of dimension nc × ni .
With this notation, it is straightforward to show that the matrix M
in (7.53) becomes
M = BBt = B1 Bt1 (9.31)
and the solution to problem QP-1 (7.54) and (7.55) is
⎡ c⎤
g c
g
λ̂ = 2M−1 Bg = 2M−1 B1 0 ⎣gb ⎦ = 2M−1 B1 b , (9.32)
g
gi
⎡ c ⎤
g
(I − B1 M−1 Bt1 ) b ⎦
ŝ = g − Bt M−1 B g = ⎣ g . (9.33)
gi
realistic. Consider, for example, the input and output registers (also called
the I/O registers) in a VLSI system. Some I/O registers are illustrated in
Figure 9.5 where the registers R1 and R5 are an input and an output register,
respectively, of the circuit C. The register R3 shown in Figure 9.5 is an internal
register since all of the other registers to which R3 is connected (via local data
paths) are inside the circuit C.
The timing of the I/O registers is less flexible than the timing of the
internal registers. Consider, for example, the local data path R6 ;R1 shown
in Figure 9.5. The register R6 is outside the circuit C which contains the
registers R1 through R5 . It is possible to apply a clock schedule to S that
specifies a clock delay t1cd to the register R1 . However, the timing information
for the local data path R6 ;R1 is not considered when scheduling the clock
signal delays to the registers within C (including t1cd ). Therefore, a timing
violation may occur on the local data path R6 ;R1 illustrated in Figure 9.5.
One strategy to overcome this difficulty is to include in the clock schedul-
ing process the timing information of those local data paths which cross the
boundaries of the circuit C. This approach does not change the nature of the
clock scheduling algorithm but rather only the number of timing constraints.
However, such an optimization scenario is difficult to conceive due to the many
instances where C may be used. Therefore, a preferable approach is to set the
clock signal delay to the I/O registers (such as t1cd to R1 ) to a specific value
with respect to the clock source (shown as the clock pin in Figure 9.5). If
this value is specified, all of the necessary timing information is available to
Circuit C
R2
R6 R1
R3
R4
R5
Fig. 9.5. I/O registers in a VLSI integrated circuit. Note that the I/O registers
form part of the local data paths between the inside of the circuit and the outside
of the circuit.
180 9 Practical Considerations
avoid any timing violations of the local data paths such as the path R6 ;R1
shown in Figure 9.5. Equivalently, a group of registers (the I/O registers, for
example) may be defined which require that the clock signal be delivered to all
of the registers within such a group with the same delay. Application-specific
integrated circuits (ASICs) and Intellectual Property (IP) blocks are good
examples of circuits where the aforementioned strategy may be useful.
Given the difficulty in knowing a priori all timing contexts of an integrated
circuit, a preferred solution may be to require that all I/O registers are clocked
at the same time (zero skew). More specifically, all possible explicit clock
delay requirements for registers within the circuit fall into one of the following
categories:
1. zero skew island, that is, a group of registers with equal delay,
2. target delays, that is, tkcd1 = δk1 , . . . , tkcdα = δkα , where kα ≤ r and δk1 . . . δkα
are explicitly specified clock signal delay constants,
3. target skews, that is, sj1 = σj1 , . . . , sjβ = σjβ , where jβ < nb and
σj1 . . . σjβ are explicitly specified clock skew constants.
Zero skew islands can be satisfied by collapsing the corresponding graph ver-
tices into a single vertex while eliminating all edges among vertices within the
island. Note that in this case, it must be verified that zero skew is within the
permissible range of each in-island path.8 Alternatively, the target delays are
converted to target skews (category 3 above) for sequentially-adjacent pairs
or by adding a ‘fake’ edge. Thus, an algorithm to handle only target skews is
necessary.
Note first that target values for only nf ≤ nb skews can be independently
specified. As nf approaches nb , the freedom to vary all skews decreases and
it may become impossible to determine any feasible s. Given nf ≤ nb , (a) the
basis can always be chosen to contain all target skews by using a spanning
tree algorithm with edge swapping, and (b) the edge enumeration can be
accomplished such that the target skews appear last in the basis. The problem
is now similar to (7.42) except for the change of the circuit kernel equation,
⎡ c⎤
ŝ
C = [C1 C2 ] ⇒ Bs = [I C1 C2 ] ⎣ŝb ⎦ ⇒ B̂ŝ + C2 σ = 0, (9.34)
σ
c
ŝ
where B̂ = [I C1 ], ŝ = ŝb , ŝc = sc , and ŝb is sc with the last nf elements
removed. The matrix C2 in (9.34) consists of the last nf columns of C, while
the target skew vector σ is an nf -element vector of target skews whose ele-
ments are ordered in the order of the target edges. The linear system (7.51)
8
Normally, this would be the case. However, [recall (4.8), (4.13), (4.23), (4.24),
and (4.29)], in an aggressive circuit design with a short clock period it may so
happen that zero skew is designed to be out of the permissible range, most likely
creating a setup time violation. In these circuits, negative skew is used to increase
the overall system-wide clock frequency, thereby removing the setup violation.
9.4 Summary 181
becomes
2ŝ + B̂t m̂ = 2ĝ
2I B̂t ŝ 2ĝ
⇒ = , (9.35)
B̂ŝ + C2 σ = 0 B̂ 0 m̂ −C2 σ
with solution
9.4 Summary
This chapter describes the practical implementation concerns in the imple-
mentation of QP-based formulation from Chapter 7 and a general clock skew
scheduling implementations on a system-on-chip (SoC). First, the details of
a computer implementation for the QP-based clock skew scheduling program
described in Chapter 7 are presented. Three alternative implementations are
discussed, analyzing the memory requirements and computational complex-
ities based on the number variables and operations in each implementation.
Mathematical discussions are presented to the efficacy and accuracy of each
algorithm through theoretical discussions. A comparative analysis is also pre-
sented, which demonstrates the superiority of the CSD algorithm (out of the
three proposed computer implementation alternatives) over the LMCS-1 and
LMCS-2 algorithms. Later in the chapter, the timing isolation of the intellec-
tual property blocks in a system-on-chip implementation is presented, which
enables the implementation of clock skew scheduling on individual IP blocks.
10
Clock Skew Scheduling in Rotary Clocking
Technology
= = = =
= = =
= = = =
(a) = = =
= = = =
= = =
= = = =
45o
225o
(c)
0o 180o 270o 90o
135o
315o
(b)
Fig. 10.1. Basic rotary clock architecture.
(shunt connected) inverter pairs are used between the cross-connected lines
to save power, initiate and maintain the traveling wave. After excitation, the
anti-parallel inverters feed the traveling wave in the stronger direction, up to
a stable oscillation frequency. The transmission line with anti-parallel con-
nected inverters is shown in Figure 10.3 [116]. In Figure 10.3, the traveling
wave is traveling from left to right.
Each pair of anti-parallel inverters on the path of the traveling signal
turns on after some time, stimulating the same process at the neighboring
pair of anti-parallel inverters in the direction of the wave. The transmission
line impedance is on the order of 10Ω and the differential on-resistance of
the anti-parallel connected inverters are in the 100Ω-1kΩ range for a 0.25μm
technology [116].
Once a wave is established, it takes little power to sustain it. The dissi-
pated power on the ring is given by the I 2 R dissipation instead of the con-
ventional CV 2 f expression. Such consideration of power is possible because
10.1 Resonant Clocking 187
WAVE
_
+ +
0
+
_ _ _
+
_ 0
0
0 +
0 _ +
0
+
0 0 _
0 _
0 0 _
0 +
0
0
0 +
0
(a) (b)
Fig. 10.2. The RTWO theory.
t0 t1 t2
+2.5V .....0V
Already Yet to
latched switch
0V .....+2.5V
(reinforces latch)
Fig. 10.3. The cross-section of the transmission line with shunt connected inverters.
the energy that goes into charging and discharging MOS gate capacitance (of
the inverters) becomes transmission line energy, which in turn is circulated
in the closed electromagnetic path. Such conservation of energy is enabled by
adiabatic switching [166, 167], in terminating the current path to the trans-
mission line, instead of ground. The coherent switching occurs only in the
direction of the traveling path. An equal amount of energy is launched in the
reverse direction, however the latches in this direction are already switched,
thus this energy simply serves to reinforce the previous switching events on
these registers.
The frequency of the clock signal generated by the rotary clocking tech-
nology depends on total capacitance and inductance in the system, which
are defined by the physical implementation of the rotary wires and the
188 10 Clock Skew Scheduling in Rotary Clocking Technology
Ltotal is the total loop inductance and depends on the ring perimeter P , inter-
connect separation s, wire width w, thickness of the strip t and permeability
in vacuum μ0 . Ctotal is the total capacitance that is driven by the RTWO
ring. The total capacitance is defined by gate-oxide capacitances of inverter
pairs Cinv and registers Creg , and the tapping wire (from register to ring) ca-
pacitance Cwire . These introduced factors affecting the total inductance Ltotal
and the total capacitance Ctotal are the design parameters for an RTWO ring
that provide a design flexibility to generate the desired frequency. Inductance
variation on a typical silicon implementation is expected to be small because
of the high quality of lithographic reproduction. Overall, the projected post-
production variation in the targeted operating frequency is 5%, accounting
for
√ the sources
√ of variation and the dependence of the operating frequency on
C and L [116].
The operation of the ROA structure in providing a gigahertz frequency,
low jitter, low power clock signal with fast transition times is confirmed by
simulating the ring shown in Figure 10.1 at 965MHz and 3.4GHz. The ring
designed at 2.5V 0.25 μm CMOS technology has 25 interconnected RTWO
rings on a 7x7 array grid. The simulation result presented in [116] for the
3.4GHz ring is shown in Figure 10.4. Promising results of a clock jitter of 5.5ps
and 34-dB power supply rejection ratio (PSRR) are measured for 3.4GHz [116]
and with 117-dB noise on a 18 GHz implementation [168].
Two other important metrics for an oscillator are the sensitivity to changes
in temperature and supply voltage. It has been shown that the frequency
deviation with temperature change between −50o C and 150o C is only 1%
while the change with VDD deviation between 1.5 and 3.5V is around 2% [116].
The immunity of the RTWO signals to process variations while allowing full
skew control over 360 degree phases on the ring proves very valuable for deep
sub micrometer applications.
A detailed analysis of the rotary clocking technology and the RTWO loops
can be found in [116, 148, 157, 158]. Research on rotary clocking can be
categorized into characterization and physical design. In characterization re-
search presented in [168, 169, 170, 171], test chips and spice models are used
to analyze the power and frequency characteristics of homogeneous rotary
rings. Power savings around 60-80% are reported for a single, square rotary
ring [169, 170, 171]. In physical design research in [172, 173] and recently
in [174], skew computation and logic placement for a given rotary clock ring are
10.1 Resonant Clocking 189
Fig. 10.4. Line voltage and line current for the 3.4GHz clock example.
discussed. Both methods adopt iterative principles for integrated skew compu-
tation and logic placement. The point of interest for the clock skew scheduling
discussion presented in this monograph is primarily the timing requirements
of the rotary clocking technology, which are presented in Section 10.1.2.
135o 225o
o 180o 180o
315 45o
90o 270o
=V
=I
tree networks, clock delays are generated with buffering, thus, clock delays
are available only in discrete values for such systems. For rotary clocking
technology, however, buffer elements are not necessary, as clock delays are
provided with the propagation of the clock signal on RTWO rings. The clock
phase driving a synchronous component is determined by the location of the
connection point of the clock signal wire on the RTWO ring as shown in
Figure 10.1(b) (page 186). Figure 10.5 also presents the different phases of the
clock signal available for a sample rotary implementation with one crossover
point. Note that with this implementation, two corresponding points on the
differential line provides clock signals with are shifted by 180 degrees.
Unlike traditional PLL-based clock sources, the generation of a multi-phase
clock signal is highly practical with rotary clocking. The number of phases in
the clock signal generated by the rotary clocking technology is determined
by the number and placement of crossovers onto the RTWO rings. Common
multi-phase synchronization scheme of two-phases, as well as any other arbi-
trary number of phases, can be implemented with rotary clocking technology,
without loss of quality. Two (or more) crossovers can be used to generate any
desired number of overlapping or non-overlapping clock phases for multi-phase
synchronization. The length and respective placement of the duty cycles of
the multiple clock phases are determined by the location of the crossovers on
the ring.
Rotary clocking technology readily supplies a fine grain of clock delays,
and potentially, phases. From a CAD perspective, continuous delay models
can be used to model clock delays available in the network. From a circuit
design perspective, the assignment of different clock delays to the synchro-
nous components of a rotary-clock synchronized circuit are essential for the
proper operation of the circuit. Towards this end, the most common prob-
lem is the unbalanced capacitive loading of the rotary network. The lack of
a relatively uniform load distribution (within one ring or between multiple
10.2 Physical Design Flow 191
rings) may affect the rotation of the oscillatory signal on the ring(s), thereby
causing degradations in the quality of synchronization. In the optimal schedul-
ing scenario, the clock delays at the synchronous components are distributed
relatively evenly in time, leading to a relatively balanced distribution of the
latching points on the rotary ring. The required balanced loading of the ROA
rings can be provided by clock skew scheduling (see the distribution of clock
delays for a sample circuit in Figure 11.5 on page 213).
The advanced timing methodology of using non-zero clock skew circuits
with multi-phase synchronization can easily be realized in circuits synchro-
nized with rotary clocking technology. Advantageously, implementation of
circuits synchronized with the rotary clocking technology mutually requires
the automated design and analysis methodologies for multi-phase, non-zero
clock skew synchronization schemes. Such integration of the design and analy-
sis methodologies into the physical design flow leads to circuits which benefit
both from the presented advanced timing methodologies and the rotary clock-
ing technology.
DESIGN ENTRY
PARTITIONING
ROA SIZE
PARTITIONING
REGISTER INSERTION
NO YES
ROA FEASIBLE?
NO YES
CSS FEASIBLE?
PLACEMENT
REGISTER MAPPING
LOGIC PLACEMENT
Fig. 10.6. The physical design flow of VLSI circuits with RTWO clock synchro-
nization.
The implementation of the ROA rings and netlist partitioning are depen-
dent on each other as illustrated in the Partitioning step in the flow chart.
The size and number of rings in the ROA structures depend on several fac-
tors such as the complexity of the design, the availability of clock network
design resources, the computational resources for timing analysis and the sil-
icon area. Despite these dependencies, the number and physical dimensions
of ROA rings in a circuit are quite flexible. The number of ROA rings is usu-
ally held sufficiently high in order to limit the total wirelength. The shapes
of ROA rings are not necessarily regular (e.g., rectangles) as implied by the
mesh structure presented in Section 10.1.1. Such flexibility in the physical
10.2 Physical Design Flow 193
partitioning approach is proposed in this work using selection criteria that lead
to partitions which are amenable to clock skew scheduling. Towards this end,
a hypergraph partitioning tool is used with fine-tuned partitioning criteria
to generate partitions that are easily implementable with the rotary clocking
technology. Principally, timing-driven partitioning is performed within the
proposed design methodology subject to the following considerations:
1. To construct the logic network partitions that will be synchronized by
individual ROA rings of the rotary clocking technology,
2. To enable the completion of path enumeration on large scale circuits,
3. To enable the completion of clock skew scheduling algorithms on large
scale circuits.
The first of the three factors listed above is directly related to the im-
plementation of the rotary clocking technology. If clock tree synthesis is per-
formed completely independent from logic synthesis, the assignment of syn-
chronous components to individual ROA rings can be inefficient for physical
implementation. As discussed in Section 10.1.2, a relatively balanced distri-
bution of clock phases is necessary for the quality of synchronization with a
rotary clock signal. An unbalanced loading of synchronous components on the
ROA rings may also cause hot spots in the circuit or significantly increase the
clock load on one side of the chip compared to another (thereby causing per-
formance degradation). To prevent such undesired operation, logic and clock
tree synthesis need to be performed interdependently. The partitioning proce-
dure presented here achieves this goal by generating balanced logic partitions
to be synchronized by each ROA ring. Advantageously, the clock phases at the
synchronous components within each partition are well distributed after the
application of clock skew scheduling (see Figure 11.5 on page 213) to the logic
partitions. Thus, the synchronization by non-zero clock skew requirement is
satisfied as well as the capacitive load balancing requirement for robust rotary
oscillation.
The second and third factors that drive the timing-driven partitioning
process are related to the design and analysis methodologies of large-scale
circuits. Although discussed here within the context of rotary clock synchro-
nization, the partitioning procedures presented in this chapter can also be
applied to circuits synchronized with traditional clocking technologies. From
a CAD perspective, the generality of the partitioning procedure to improving
the scalability of clock skew scheduling (independent of the particular clocking
technology) is discussed next.
As reported earlier in Chapter 5, scalability of clock skew scheduling is
an important drawback for its widespread acceptance in mainstream design.
Most industrial-strength timing tools or circuit designers that implement vari-
ations of clock skew scheduling perform these tasks only on certain portions of
the circuit, without analyzing the circuit in its entirety. Analysis of the entire
circuit in order to implement a full-scale application of clock skew scheduling
can be computationally intensive for very large-scale circuits. The main ob-
10.2 Physical Design Flow 195
stacle for the application of clock skew scheduling to the entire circuit is the
run times of LP model problems.
The LP problem for the application of clock skew scheduling is formulated
as described in Chapter 5. The LP problems generated for an integrated cir-
cuit with millions of paths and hundreds of thousands or more synchronous
components can be very large. The run times of such large LP problems are
usually reasonable within the typically long IC design cycles (up to a few days
with industrial strength LP solvers and common computing resources). How-
ever, very large models might not be solvable at all within the memory limits
of common computing resources. In several industry applications, for instance,
LP model problems for the clock skew scheduling of large-scale circuits are
observed to exceed the practical limits of desktop computing resources (e.g.
4 gigabytes of memory for 32-bit systems) [176]. Partitioning, as discussed
here, remedies this shortcoming. Through partitioning the circuit into small
partitions, small linear programming models can be developed and solved for
each partition. In practice, the LP formulations can be applied in parallel,
achieving further improved run times.
In the development of the partitioning step of the physical design flow, the par-
titioning tool Chaco [177] from Sandia National Laboratories is used. Chaco is
a hypergraph partitioning tool that is primarily developed for the paralleliza-
tion of tasks on special architectures. Nevertheless, chaco has been proved to
be applicable to a wide range of areas. Chaco offers various methods (spec-
tral bisectioning [178], the inertial method [179], the Kernighan-Lin [180],
Fiduccia-Mattheyses [181] algorithms and multilevel partitioners [182]) for
partitioning, each fine tuned for a specific purpose.
Among the multiple criteria for partitioning a synchronous circuit for clock
skew scheduling are the weight, number and location of the cuts amongst
partitions, the weight of each partition, the relative mapping of sequentially-
adjacent registers to partitions and the number of internal vertices per parti-
tion. Chaco tracks the quality of these partitioning performance metrics with
user-defined priorities. In order to generate partitions amenable to clock skew
scheduling, the number of cuts between partitions must be minimal and the
number of internal vertices (vertices that do not have edges between par-
titions) must be maximal. Depending on particular design budgets and the
priority of the performance metrics, the weights of particular nets or vertices
can be fine tuned.
In the computer-aided design (CAD) tool implementation, the application
of partitioning to two types of netlists are supported. These netlists, catego-
rized by the hierarchical level of input data, are:
1. Gate level netlists,
2. Register-transfer level netlists.
196 10 Clock Skew Scheduling in Rotary Clocking Technology
constraints of the local data paths between the boundary register and registers
in other partitions are grouped into the top block LP problem. These LP
problems constitute an integral part of the physical design flow depicted in
Figure 10.6 on page 192.
When a gate-level netlist is used, the heuristic described in Section 10.2.2
is used to bolster cuts on the input of synchronous components. Unlike its
treatment for a register-transfer netlist, a final register of the local data path
must be in the same partition with the cut local data path. This objective
suggests registered-input, registered-output partitions, simplifying the timing
analysis. The slight variation in the weight (or load) balance of the partitions
is insignificant and eventually balances out as the transfer of registers between
partitions occurs in all directions.
For instances where the partitioner validates a cut on a net that is be-
tween two combinational components, register insertion is used to satisfy the
registered-input, registered-output scheme. The number of inserted registers
depends on the quality of the partitioner and the complexity of the design. In
the performed experiments, the number of inserted registers has been observed
to be directly proportional to the number of partitions. For higher number of
total partitions, the number of inserted registers can get even higher than the
number of original registers. This requires the partitioning step to be applied
with caution in designs where die area is a scarce resource.
The registers inserted into the logic network in the register insertion step
of the physical design flow can affect the functionality of the circuit. In order
to preserve the functionality of the circuit, level-sensitive latches are used.
The inserted registers are selected as level-sensitive latches operating in their
transparent phases of operation. The propagation of the data signals on the
inter-partition paths is not disrupted, as these signals are immediately propa-
gated through the level-sensitive latches during the transparent phases. Con-
straints similar to the linearized timing constraints presented in Chapter 5
are used in this step in order to drive the inserted registers with proper clock
delays and phases.
The general partitioning process is illustrated in Figure 10.7. In this figure,
the dots represent registers and the lines represent data paths. The paths from
partition (4,1) are demonstrated. Note that only some of the registers and
paths are shown. The data paths which are on a cut are identified and the
timing constraints of these paths are included within the top block LP.
to the partition LP problems (smaller minimum clock period), then the max-
imum of the minimum clock periods of the partition LP problems is assigned
as the clock period of the top block. Otherwise, the top block LP problem de-
termines the actual minimum operating clock period of the circuit (partitions
and top block).
The top block LP problem is solved after the partition LP problems are
solved, because the top block has the most number of boundary vertices im-
plied in its constraints. Actually, all boundary vertices are implied in the
constraints that make up the top block LP problem. Each partition LP prob-
lem only has a fraction of the boundary vertices implied in their constraints.
The solution of the clock delays to all boundary vertices, as computed by each
partition LP and the top block LP problems, must match in order to verify
the validity of the computed minimum clock period. In order to match these
clock delays of boundary vertices, the solutions computed for the top block
LP problem are enforced on the partition LP problems with equalities such
as:
ticd = xi , (10.4)
where the clock delay computed for register Ri in the top block LP problem is
xi time units. If the partition LP problems return feasibility, the computation
is complete.
There are two points to note here. First, note that the minimum clock
periods computed for partition LP problems are lower limits on the mini-
mum clock period of the complete circuit as each partition LP problem is a
subproblem of the original LP problem. The constraints that make up the
subproblems are subsets of the LP problem of the complete (original) circuit.
As the solution of one of the subproblems (the top block LP problem in this
case) is enforced on the remaining LP problems, the convex solution space
of the original problem is not violated. Intuitively, therefore, if the presented
heuristic method produces a feasible result, this result is optimal.
The second point to note is the fact that the presented heuristic method
does not guarantee a feasible solution. The percentage (65%) of ISCAS’89
benchmark circuits for which the presented heuristic method is feasible are
shown in Section 11.5. The following alternative approaches are proposed to
solve for cases where the presented heuristic method is not feasible:
• Reiteration: The infeasibility diagnostics of an LP solver can be used to
resolve the infeasibility problem by changing one or more clock delays that
appear in a contradictory constraint. Even if any infeasibility information
is not available, iterations can be performed on the infeasible subproblems
to search for a feasible answer. The clock delays whose values are changed
from the optimal solution of the top block LP are tracked such that the
feasibility of the remaining LP problems are not violated. Iterations are
performed either until a feasible solution is found or a time limit is reached.
• Constraining boundary vertices: As an alternative procedure, the
clock delays of all boundary registers can be fixed to a particular value.
200 10 Clock Skew Scheduling in Rotary Clocking Technology
500 um
500 um
4 um
2 um
is not drawn to scale. The die area is evenly divided into 16 regions in a
four by four setting, each of which is synchronized with an ROA ring. The
dimensions of each ROA ring is 500μm by 500μm. Assuming a single row of
registers is placed underneath each ring, the maximum number of registers
that are realizable on this die can be easily be obtained using the dimensions
of a typical register. In the 0.13um technology, a size of a register is considered
to be 4μm by 4μm, with a minimal spacing of 2μm between two instances.
Therefore, there is enough space to place approximately 80 registers on each
ROA ring edge [(500+2)/(4+2) ≈ 80]. For 4 sides of an ROA ring and 16 rings,
a total of 5120 registers are available for mapping against the synthesized logic.
This number is adequate for most state-of-the-art digital circuit designs of
similar die size. The dimensions of the designated area for register placement
and the number of register bank rows are the determining factors for the
number of registers in a design, which can be altered for particular design
budget requirements. Availability of registers in the register bank enables a
good distribution and mapping of clock phases.
The register placement methodology is discussed to demonstrate a viable
mechanism to deliver the required clock delays to registers. The described
heuristic implementation of the register placement methodology is only pre-
sented as a proof-of-concept, and does not negatively or positively impact
the coveted synchronization principles with non-zero clock skew. Alternative
methods of placement and routing for rotary timing synchronization have been
offered in [173, 174] that can also be followed in the physical design flow.
202 10 Clock Skew Scheduling in Rotary Clocking Technology
The popularity of the personal computers in the consumer market over the
last few decades has significantly lowered the costs of computing systems.
Consequently, the costs associated with setting up a distributed computing
system have become relatively affordable. Processes, previously incomputable
or considered costly, can be executed on a cluster of standard computing
systems.
Xgrid [183] is a distributed computing software provided by Apple Com-
puters Inc. permitting the operation of a cluster of popular desktop machines
as a supercomputer. The Xgrid system aggregates an ad hoc network of Mac-
intosh desktop computers into a multi-agent computing cluster, where each
agent is called a computation grid. Xgrid is typically beneficial for highly par-
allelized problems that can be broken up into smaller pieces and each piece
executed separately and relatively independent from each other. One of the
computers in the cluster is set up as the client for Xgrid and other computers
are used as distributed agents. The Xgrid software is installed on all comput-
ers, enabling the agents to perform grid calculation. The computations can
be submitted when the agents are idle or it can be used as the master task.
The Xgrid software is run with a controller, which regulates the assignment of
computing processes to grids and manages the outputs as they are returned to
the server. Xgrid software serves as a simple distributed computing infrastruc-
ture and does not support message passing between independent agents as is
the case for typical Message Passing Interface (MPI) [184] systems.
The parallelization of the application of clock skew scheduling is imple-
mented for the Xgrid distributed computing system. The LP problems for
each partition are submitted as individual tasks to the Xgrid computing clus-
ter and solved simultaneously on specific agents. The generated system not
only exhibits the pre-described advantages of implementing a parallel execu-
tion scheme for clock skew scheduling, it also exemplifies the implementation
of a complex VLSI design application on the Xgrid software architecture.
The computing cluster is constructed with eight PowerMac computers with
dual G5 1.8GHz microprocessors and 3GB RAM operating Mac OS X 10.3.8.
The cluster has one dedicated client, one dedicated controller and six distrib-
uted computing agents. The agents are configured to process Xgrid tasks as
the master task. Only one of the processors on each computer is used in ex-
perimentation. This grid computing cluster setup is illustrated in Figure 10.9.
In order to effectively harness the distributed computing potential, the
benchmark circuits are partitioned into four partitions. Note that four parti-
tions emulate a 2x2 grid clock distribution for the rotary clocking technology.
The analysis of a 3x3 or a larger grid size is possible, however, perfect paral-
lelization for such grid size can not be achieved with six distributed agents.
10.4 Summary 203
Client
6
?
Controller
? ? ? ? ? ?
10.4 Summary
Non-zero clock skew scheduling is proposed as a clock distribution network
design and improvement methodology in conventional VLSI design flow. The
204 10 Clock Skew Scheduling in Rotary Clocking Technology
Table 11.1. Clock skew scheduling results for level-sensitive ISCAS’89 circuits.
Info Zero CS I (%) Non-Zero CS I (%) t (sec) R I (%)
Circuit TFnoskew
F
noskew
TL TB
IL TFCSS
F TLCSS CSS
IF F ILT BCS CSS
IL tCSS
L TLr r
IL
s27 6.6 5.4 18 4.1 4.1 38 38 24 0.02 4.1 38
s208.1 12.4 8.6 31 4.9 5.2 60 58 40 0.01 7.6 39
s298 13.0 10.6 18 9.4 9.4 28 28 11 0.02 10.6 18
s344 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32
s349 27.0 18.4 32 18.4 18.4 32 32 0 0.03 18.4 32
s382 14.2 10.3 27 8.5 8.5 40 40 17 0.04 8.7 39
s386 17.8 17.3 3 17.3 17.3 3 3 0 0.03 17.3 3
s400 14.2 10.4 27 8.6 8.6 39 39 17 0.05 8.8 38
s420.1 16.4 12.6 23 6.8 7.2 59 56 43 0.04 10.3 37
s444 16.8 12.4 26 9.9 9.9 41 41 20 0.07 9.9 41
s510 16.8 14.8 12 14.8 14.3 12 15 3 0.02 14.8 12
s526 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18
s526n 13.0 10.6 18 9.4 9.4 28 28 11 0.05 10.6 18
s641 83.6 66.2 21 61.9 61.9 26 26 6 0.05 63.1 25
s713 89.2 71.2 20 63.8 63.8 28 28 10 0.05 65.0 27
s820 18.6 18.3 2 18.3 18.3 2 2 0 0.01 18.3 2
s832 19.0 18.8 1 18.8 18.8 1 1 0 0.01 18.8 1
s838.1 24.4 20.6 16 8.3 9.1 66 63 56 0.28 15.6 36
s938 24.4 20.6 16 8.3 9.1 66 63 56 0.31 15.6 36
s953 23.2 21.2 9 18.3 18.3 21 21 14 0.10 21.2 9
s967 20.6 17.9 13 16.2 16.6 21 19 7 0.08 17.9 13
s991 96.4 91.6 5 79.4 79.4 18 18 13 0.02 79.4 18
s1196 20.8 16.0 23 10.8 7.8 48 63 51 0.03 16.0 23
s1238 20.8 16.0 23 10.8 7.8 48 63 51 0.01 16.0 23
s1423 92.2 86.4 6 77.4 75.8 16 18 12 1.10 75.8 18
s1488 32.2 29.0 10 29.0 29.0 10 10 0 0.02 29.0 10
s1494 32.8 29.6 10 29.6 29.6 10 10 0 0.01 29.6 10
s1512 39.6 34.8 12 34.8 34.8 12 12 0 0.28 34.8 12
s3271 40.3 29.8 26 28.6 28.6 29 29 4 0.69 29.0 28
s3330 34.8 23.4 33 17.8 17.8 49 49 24 0.49 23.2 33
s3384 85.2 77.4 9 67.4 67.4 21 21 13 1.88 76.2 11
s4863 81.2 75.4 7 69.0 69.0 15 15 8 0.64 69.0 15
s5378 28.4 23.2 18 22.0 22.0 23 23 5 1.66 22.0 23
s6669 128.6 124.6 3 109.8 109.8 15 15 12 3.62 109.8 15
s9234 75.8 64.8 15 54.2 54.2 28 28 16 4.59 59.2 22
s9234.1 75.8 64.8 15 54.2 54.2 28 28 16 3.88 59.2 22
s13207 85.6 67.4 21 57.1 57.1 33 33 15 14.86 57.1 33
s15850 116.0 92.8 20 83.6 83.6 28 28 10 76.96 83.6 28
s15850.1 81.2 71.4 12 57.4 57.4 29 29 20 58.89 57.4 29
s35932 34.2 34.1 0 20.4 20.4 40 40 40 80.03 20.4 40
s38417 69.0 54.8 21 42.2 42.2 39 39 23 603.49 43.0 39
s38584 94.2 76.4 19 65.2 65.2 31 31 16 321.74 64.8 31
Average 15 30 27 14 24
T B, CSS, T BCS stand for time borrowing, clock skew scheduling and both,
respectively.
The minimum clock periods calculated for the edge-sensitive synchronous
circuits under zero and non-zero clock skew scheduling (TFnoskew
F and TFCSSF ,
respectively) suggest an average improvement of 30% in the minimum clock
period for the ISCAS’89 benchmark circuits. The minimum clock periods cal-
culated for the level-sensitive synchronous circuits (TLnoskew and TLCSS ) sug-
gest an average improvement of 27% in the minimum clock period. Below,
208 11 Experimental Results
the clock period improvements for the level-sensitive latches are examined in
detail.
The experimental results shown in Table 11.1 demonstrate that utiliz-
ing latches as storage elements instead of flip-flops may result in up to 30%
improvements of the minimum clock period under zero clock skew (for single-
phase, 50% duty cycle clock synchronization). On the ISCAS’89 benchmark
circuits, an average of 15% improvement is observed when the flip-flops are re-
placed by latches (under zero clock skew). This level of improvement is solely
due to time borrowing.
Utilizing non-zero clock skew, an even higher improvement is possible.
Improvements up to 63%—over flip-flop based synchronous circuit with zero
clock skew—are observed. The average improvement in the minimum clock
period for ISCAS’89 benchmark circuits is 27%. This level of improvement is
due to simultaneous application of clock skew scheduling and consideration
of time borrowing. Out of this 27% improvement for non-zero clock skew,
level-sensitive circuits, the improvement due to time borrowing is 15% and
the improvement due to clock skew scheduling is 14%. It is interesting to
note that the improvements achieved through time borrowing and clock skew
scheduling are not additive. Time borrowing and clock skew scheduling target
the same resource in performance improvement, the slack propagation time
on local data paths. There is a limited amount of slack propagation time on
the critical paths and a circuit where time borrowing is abundantly realized,
cannot benefit as much from clock skew scheduling. It has been shown how-
ever, that even though time borrowing and clock skew scheduling are battling
effects (battling for the same resource), dramatically shorter clock periods are
achievable through the collaboration of both effects. It is also important to
note that, although non-zero clock skew, edge-triggered circuits benefit more
from improvements (30%) on average, non-zero clock skew, level-sensitive cir-
cuits lead to superior improvements for some of the circuits. Furthermore,
the smaller size of level-sensitive latches compared to edge-triggered flip-flops
is often highly desirable. Thus, the use of level-sensitive latches as register
elements in synchronous circuits where clock skew scheduling is applied is
advantageous for area savings1 , and sometimes, superior to edge-triggered
circuits in both area and operating speed.
50
45
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Propagation delay DP in time units
Fig. 11.1. Data propagation times for s938 with 32 registers and 496 data paths.
In Section 4.7.1, data propagation time DPif is defined as the period of time the
data is processed in the combinational logic block of a local data path Ri ;Rf .
Without loss of generality, an empirical calculation method is used to calculate
the data propagation times of each local data path of a circuit. The distrib-
ution of the calculated data propagation times for the ISCAS’89 benchmark
circuit s938 is illustrated in Figure 11.1. In this figure, the height of each bar
corresponds to the number of paths within a given delay range. For example,
there are nine (9) paths with delays between 4 and 5 time units.
Effective path delay D̂Pi,f [96] is defined as the time period between the
departure of the data signal from the initial register Ri and the arrival of
the same data signal at the final register Rf . The effective path delay of a
local data path differs from data propagation delay because of the additional
propagation time provided by clock skew and the time borrowing property
of level-sensitive synchronous circuits. Note that in level-sensitive synchro-
nous circuits, the effective path delay is defined within a permissible range
instead of a fixed value, as the arrival and departure times are indeterminate.
210 11 Experimental Results
60
55
50
45
Number of paths
40
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Fig. 11.2. Maximum effective path delays in data paths of s938 for zero clock skew.
The nominal effective path delay is determined when the arrival and depar-
ture times are realized in run-time as certain values in the permissible ranges
[af , Af ] and [di , Di ], respectively. Specifically, the shortest effective path delay
occurs when the data signal departs at its latest time Di from the initial reg-
ister Ri and arrives at its earliest arrival time af at the final register Rf . The
longest effective path delay is realized by the earliest departure di of the data
signal from Ri and latest arrival Af at Rf . Hence, the interval for the effective
path delay of level-sensitive synchronous circuits can be defined as:
35
30
25
20
15
10
5
0
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Fig. 11.3. Maximum effective path delays for s938 for non-zero clock skew.
TLnoskew = 20.6 . The increase in the effective path delays is due to time bor-
rowing. Accumulation of effective path delay values slightly below or above the
minimum operating clock period TCP = 20.6 is visible. Note that the effective
path delay having larger values than the minimum clock period is a sufficient
but not a necessary condition for time borrowing. Thus, local data paths where
the effective path delay is calculated to be smaller than TCP = 20.6 may still
benefit from time borrowing. Furthermore, it can be observed that certain
data paths in the circuit benefit more from time borrowing, realizing an ef-
departs
from a latch is DCQ later than the leading edge of the clock signal,
TCP − CWL
+ DCQ . Reordering the expression gives the upper bound on
clock skew:
The lower bound on the clock skew is derived similarly, which leads to:
In order to derive the lower bound, the data arrival time at Rf must be con-
sidered to occur at its latest possible time. The latest data arrival time is the
setup time δSLf earlier than the trailing edge of the clock signal, TCP − δSLf .
Thus, the lower bound on the clock skew is:
Combining (11.3) and (11.5), the theoretical limits on clock skew is expressed
as follows:
−DPi,fm + δSLf + δH
Lf
≤ TSkew (i, f ) ≤ TCP + CW
L
− DPi,fM − DCQ
L
− δSLf . (11.6)
L L
Recall that in experimentation, the parameters DDQ , DCQ , δSLf , δH
Lf
are
considered zero and 50% duty cycle is selected for the single-phase synchro-
nization clock signal. In order to evaluate the upper and lower bounds on
clock skew in this simplified case, the parameters are substituted in (11.6):
Specifically on the ISCAS’89 benchmark circuit s938, the clock skew bounds
are verified using the experimental values shown in Figure 11.1. For the bench-
mark circuit s938 with a minimum clock period of 9.09, the minimum and
maximum propagation delays are calculated to be 5 and 24.4, respectively.
Thus, the values for the clock skew variable on the data paths of s938 is
bounded by −24.4 ≤ TSkew (i, f ) ≤ 8.64.
The distribution of the clock skew values of s938, when operable with a
minimum clock period of 9.09, is presented in Figure 11.4. The target clock
period is TCP = 9.09. The height of each bar corresponds to the number of
paths formed by sequentially adjacent pair of registers which have a clock
skew within the given range. The calculated clock skew values are within
the derived limits, most of which are negative. Negative clock skew between
registers help improve the minimum clock period of the synchronous circuit
due to the additional time it provides for data signal propagation. The data
paths, on which positive skew is recorded, most likely occur due to two reasons.
The first reason is the presence of data path cycles and reconvergent systems
within the circuit, which have constraining timing properties as explained
in Chapter 8. The second reason are the—faster—paths which provide extra
time for neighboring critical paths.
11.2 Multi-Phase Level-Sensitive Circuits 213
70
65
60
55
50
Number of paths
45
40
35
30
25
20
15
10
5
0
−20 −19 −18 −17 −16 −15 −14 −13 −12 −11 −10 −9 −8 −7 −6 −5 −4 −3 −2 −1 0 1 2
Clock skew Tskew (i, f ) in time units
Fig. 11.4. Distribution of the clock skew values of the non-zero clock skew case for
s938.
3.5
3
Number of latches
2.5
1.5
0.5
0
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Clock delay ti in time units
Fig. 11.5. Distribution of the clock delay values of the non-zero clock skew case for
s938.
FF
D Q
DP
C
L = T /n
CW CP
n
φ(n) = TCP (n − 1)/n
Csource
L = T /n
CW CP
L = T /n
CW CP
φ2 = TCP /n
2
Csource
L = T /n
CW CP
1
φ1 = 0
Csource
circuits are shown in Tables 11.2, 11.3 and 11.4. Minimum clock periods,
improvements and calculation time are denoted by T , I and t, respectively.
Subscripts F F , nφ represent circuit topologies for flip-flop based and n-phase
level-sensitive circuits, respectively. Superscripts and titles T B, CSS, T BCS
stand for time borrowing, clock skew scheduling and both, respectively. Min-
imum clock periods (T ) are measured in time units.
In the rest of this section, the experimental results and factors contributing
to the improvements in these results are discussed in greater detail. In partic-
ular, the properties of multi-phase synchronization which affect level-sensitive
circuit performance are discussed in Section 11.2.1. The effects of multi-phase
synchronization on time borrowing are addressed in Section 11.2.2. The ef-
fects of multi-phase synchronization on clock skew scheduling are addressed
in Section 11.2.3. Finally, the effects of multi-phase synchronization on the
simultaneous application of time borrowing and clock skew scheduling are
addressed in Section 11.2.4.
216 11 Experimental Results
in Figure 11.7, for instance, the transparency periods are located at different
times within the clock cycle (e.g., clock phases C 1 and C n are the first and last
sections, respectively). Such variety in the locations of transparency periods
provides flexibility on the permissible data propagation times of a local data
path. The assorted assignment of clock phases to registers, achieved through
clock skew scheduling or any other methods, leads to improvements in the
circuit performance.
218 11 Experimental Results
Table 11.4. Circuit info and run times for multi-phase ISCAS’89 circuits.
s27 3 4 0 0 0 0 0 0 0 0
s208.1 8 28 0 0 0 0 0 0 0 0
s298 14 54 0 0 0 0 0 0 0 0
s344 15 68 0 0 0 0 0 0 0 0
s349 15 68 0 0 0 0 0 0 0 0
s382 21 113 0 0 0 0 0 0 0 0
s386 6 15 0 0 0 0 0 0 0 0
s400 21 113 0 0 0 0 0 0 0 0
s420.1 16 120 0 0 0 0 0 0 0 0
s444 16 113 0 0 0 0 0 0 0 0
s499 22 462 0 0 0 0 0 0 1 1
s510 6 15 0 0 0 0 0 0 0 0
s526 21 117 0 0 0 0 0 0 0 0
s526n 21 117 0 0 0 0 0 0 0 0
s635 32 496 0 0 0 1 0 1 1 1
s641 19 81 0 0 0 0 0 0 0 0
s713 19 81 0 0 0 0 0 0 0 0
s820 5 10 0 0 0 0 0 0 0 0
s832 5 10 0 0 0 0 0 0 0 0
s838 32 496 0 0 0 0 0 1 1 0
s938 32 496 0 0 0 0 0 1 1 0
s953 29 135 0 0 0 0 1 0 0 0
s967 29 135 0 0 0 0 1 0 0 0
s991 19 51 0 0 0 0 0 0 0 0
s1196 18 20 0 0 0 0 0 0 0 0
s1238 18 20 0 0 0 0 0 0 0 0
s1269 37 1260 0 0 0 0 0 0 1 1
s1423 74 1471 1 1 1 2 2 2 2 3
s1488 6 15 0 0 0 0 0 0 0 0
s1494 6 15 0 0 0 0 0 0 0 0
s1512 57 415 0 0 0 1 1 1 1 1
s3271 116 789 1 1 1 1 1 1 2 2
s3330 132 514 0 0 1 1 1 1 2 2
s3384 183 1759 1 1 2 2 3 3 4 4
s4863 104 620 0 1 1 1 1 1 1 2
s5378 179 1147 1 1 1 2 2 2 3 3
s6669 239 2138 1 2 2 3 3 4 4 6
s9234.1 228 247 2 3 2 3 4 5 5 6
s9234 211 2342 2 3 3 4 4 5 5 7
s13207.1 669 3068 3 4 6 7 9 11 13 15
s13207 669 3068 3 5 6 8 9 13 13 19
s15850.1 534 10830 10 25 13 30 19 37 25 47
s15850 597 14257 15 26 19 32 24 42 30 46
s35932 1728 4187 6 8 16 17 21 28 27 39
100
80
60
40
20
0
s526 n
s27
Average
s298
s344
s349
s382
s386
s400
s444
s499
s510
s526
s635
s641
s713
s820
s832
s838
s938
s953
s967
s991
s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669
s9234
s208.1
s420.1
. s13207
s13207
. s15850
s15850
s35932
s9234.1
-20
-40
ISCAS'89 Modified Circuits
1-Phase 2-Phase 3-Phase 4-Phase
100
90
80
70
60
50
40
30
20
10
0
s27
s298
s344
s349
s382
s386
s400
s444
s499
s510
s526
s635
s641
s713
s820
s832
s838
s938
s953
s967
s991
Average
s526n
s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669
s9234
s208.1
s420.1
s13207
s15850
s35932
s9234.1
s13207.
s15850.
ISCAS'89 Modified Circuits
100
80
60
40
20
0
s27
s298
s344
s349
s382
s386
s400
s444
s499
s510
s526
s635
s641
s713
s820
s832
s838
s938
s953
s967
s991
Average
s526n
s1196
s1238
s1269
s1423
s1488
s1494
s1512
s3271
s3330
s3384
s4863
s5378
s6669
s9234
s208.1
s420.1
s13207
s15850
s35932
s9234.1
s13207.
s15850.
-20
-40
ISCAS'89 Modified Circuits
Fig. 11.10. Effects of multi-phase clocking on time borrowing and clock skew
scheduling.
each benchmark circuit are illustrated in Fig. 11.10. In Figure 11.10, four
data points shown per benchmark circuit from left-to-right are the percent-
age improvements observed for the single-phase, dual-phase, three-phase and
four-phase synchronization schemes, respectively.
In general, the observed improvements for multi-phase synchronized cir-
cuits are superior compared to zero-skew, edge-sensitive circuits. Exemplifying
the positive trend is the analysis of the benchmark circuit s1196, for instance,
where an improvement of 68% is observed for three-phase clocking through
time borrowing and clock skew scheduling. For the same circuit, the improve-
ments are at 63%, 63% and 64% for single, dual and four-phase clocking,
respectively.
As discussed in Sections 11.2.2 and 11.2.3, the improvements achieved
through time borrowing and clock skew scheduling decrease on average as the
number of clock phases increases. The improvements through simultaneous
application of time borrowing and clock skew scheduling decrease on average
as well, as the number of clock phases increases. Some negative improvements
are also recorded, which are circuits with significant delay increase due to
latch insertion. Nevertheless, 23% of the level-sensitive benchmark circuits
in Table 11.2 (10 out of 44) benefit more from clock skew scheduling under
multi-phase clocking. These circuit demonstrate that the average degradation
in improvements of simultaneous time borrowing and clock skew scheduling
with multi-phase clocking is not observed for all the circuits.
11.3 Quadratic Programming (QP) for Maximizing Safety 223
The results described in this section are obtained from the execution of a
computer implementation of Algorithm CSD introduced in Section 9.1.3. This
computer implementation shares code with the computer implementation de-
scribed in Section 5.7. In particular, the input data file format and the in-
put/output routines are exactly the same. Without unnecessary details, this
computer implementation consists of the sequential execution of the following
major steps:
Step 1. Input data file format and input/output routines are shared with
the LP computer implementation described in 5.7. The circuit timing
and connectivity data is read in and compressed and stored in a binary
database. The database can be used for fast data access in subsequent
algorithmic applications of the same circuit. Furthermore, the data size of
the database permits significant space and time savings if the circuit data
is exchanged.
Step 2. The circuit data is examined and the circuit graph is built according to
the graph model described in 5.2.2. An adjacency lists data structure [105]
stored in memory is used for fast access of the circuit graph data.
224 11 Experimental Results
Step 8. The actual clock delays to the individual registers are calculated by
traversing the spanning tree (basis) of the circuit graph. The clock delay
of the first register is arbitrarily chosen (zero in this implementation). As
the spanning tree is traversed, additional vertices adjacent to the current
vertex are visited. The clock delay of the visited vertex is determined
trivially since both the clock delay of the current vertex and the clock
skew of the edge between the current and visited vertex are known.
The results of the application of the algorithm to these circuits are summa-
rized in Table 11.5. For each circuit, the following data is listed—the circuit
name in column 1, the number of disjoint subgraphs in column 2, and the
number of vertices, edges, chords (cycles), main and isolated basis, and target
clock period in nanoseconds in columns 3 through 8, respectively. The num-
ber of iterations to reach
' a solution is listed in column 9. The average value
of ε in (7.42), that is, ε/p, is listed in column 10. The run time in minutes
for the mathematical portion of the program is shown in column 11 for a 170
MHz Sun Ultra 1 workstation.
( (
ε ε
Circuit
# subcircuits
TCP (nanoseconds)
# iterations
Run time (min)
Circuit
# subcircuits
TCP (nanoseconds)
# iterations
Run time (min)
r p nc nm ni p r p nc nm ni p
1 2 3 4 5 6 7 8 9 10 11 1 2 3 4 5 6 7 8 9 10 11
s1196 7 18 20 9 8 3 20.8 5 3.19 1 s526n 1 21 117 97 20 0 13 2 1.26 2
s1238 7 18 20 9 8 3 20.8 5 3.19 1 s5378 1 179 1147 969 158 20 28.4 20 8.79 3
s13207 49 669 3068 2448 581 39 85.6 20 18.92 5 s641 1 19 81 63 18 0 83.6 5 11.67 1
s1423 2 74 1471 1399 72 0 92.2 20 60.9 3 s713 1 19 81 63 18 0 89.2 6 12.74 1
11 Experimental Results
s35932 1 1728 4187 2460 1727 0 34.2 20 60.4 27 s3330 1 132 514 383 61 70 34.8 4 3.4 5
s382 1 21 113 93 20 0 14.2 6 1.59 2 s3384 25 183 1759 1601 151 7 85.2 5 15.5 7
s38417 11 1636 28082 26457 1443 182 69 20 32.35 31 s4863 1 104 620 517 103 0 81.2 8 39.85 3
s38584 2 1452 15545 14095 1400 50 94.2 11 29.1 29 s6669 20 239 2138 1919 218 1 128.6 3 20.67 6
s386 1 6 15 10 5 0 17.8 1 0.82 1 s938 1 32 496 465 31 0 24.4 2 3.41 2
s400 1 21 113 93 20 0 14.2 8 1.6 1 s967 4 29 135 110 25 0 20.6 2 1.76 2
s420.1 1 16 120 105 15 0 16.4 20 1.95 1 s991 1 19 51 33 18 0 96.4 3 8.58 1
s444 1 21 113 93 20 0 16.8 2 1.05 1 IC1 1 500 124750 124251 499 0 8.2 2 1.51 30
s510 1 6 15 10 5 0 16.8 1 0.85 1 IC2 1 59 493 435 58 0 10.3 3 1.82 4
s526 1 21 117 97 20 0 13 2 1.26 1 IC3 34 1248 4322 3108 1155 59 5.6 2 1.43 2
Table 11.5. Experimental results of the application of the QP based clock schedul-
11.4 Delay Insertion in Clock Skew Scheduling 227
53 → ← 53
0→ ←0
(a) Zero skew in permissible range.
112 → ← 112
82 →
68 → ← 68
0→ ←0
(c) Non-zero clock skew in permissible range after all iterations.
Fig. 11.11. Circuit s3271 with r = 116 registers and p = 789 local data paths. The
target clock period is TCP = 40.4 nanoseconds.
228 11 Experimental Results
45 → ← 45
0→ ←0
(a) Zero skew in permissible range.
76 → ← 76
15 →
37 → ← 37
0→ ←0
(c) Non-zero clock skew in permissible range after all iterations.
Fig. 11.12. Circuit s1512 with r = 57 registers and p = 405 local data paths. The
target clock period is TCP = 39.6 nanoseconds.
11.4 Delay Insertion in Clock Skew Scheduling 229
bles 8.1 and 8.2) are applied to the ISCAS’89 benchmark circuits. Continuous
delay models have been used in the experimentation. The experimental setup
in Section 11.1 (circuit delay information, clock signal duty cycle, internal
register delays, computing platform, LP solver) is replicated for the proposed
timing analyses. Experimental results are presented in Table 11.6. In Ta-
ble 11.6, the data shown are the number of registers r and paths p, the clock
period TF F for zero skew circuit with flip-flops, TFCSSF for non-zero skew circuit
with flip-flops, and, TFDICSS
F for non-zero skew circuit using delay insertion
with flop-flops. Also listed are the calculation times tCSS DICSS
F F , tF F , of TFCSS
F ,
DICSS
TF F , respectively, and the percentage clock period improvements IFCSS F ,
IFDICSS
F and I DI
FF for improvements from T F F to T CSS
FF , T F F to T DICSS
FF , T CSS
FF
to TFDICSS
F , respectively.
The clock skew scheduling algorithms used in experimentation are tar-
geting the clock period minimization problem. Therefore the improvements
achieved in the minimum clock period through the application of clock skew
scheduling and delay insertion methods are reported in Table 11.6. These
improvements are computed with the formula (Told − Tnew ) /Told × 100. The
zero clock skew, edge-sensitive synchronous circuit is selected as the common
comparison mark due to its simplicity and popularity in digital circuit de-
sign. Both for edge-triggered and level-sensitive circuits, the improvements
through conventional clock skew scheduling (IFCSS F and ILCSS , respectively)
and through clock skew scheduling with delay insertion (IFDICSS F and ILDICSS ,
respectively) are computed. Also shown in Table 11.6 are the comparisons
of the non-zero clock skew circuits scheduled with conventional clock skew
scheduling methods with non-zero clock skew circuits with delay insertion.
These comparisons (IFDIF and ILDI , respectively, for edge-triggered and level-
sensitive circuits) demonstrate the effectiveness of the delay insertion method
in further improving the performance of a conventional clock skew scheduled
circuit.
For the ISCAS’89 benchmark circuits, the delay insertion method leads
to 10% and 9% improvements on average over the conventional clock skew
scheduling algorithms for edge-triggered and level-sensitive circuits, respec-
tively. For better visualization, the performance improvements in minimum
clock period of edge-triggered and level-sensitive circuits achieved respectively
over corresponding non-zero clock skew edge-triggered and level-sensitive cir-
cuits are presented in Figure 11.13. Shown in Figure 11.13 are the percentage
improvements IFDIF and ILDI that are also presented in Table 11.6. Two data
points shown per benchmark circuit from left-to-right are the improvements
observed for edge-triggered and level-sensitive circuits, respectively. Note that
these improvements are due to delay insertion simultaneous with clock skew
scheduling.
The delay insertion method cannot be applied (not beneficial) to some
circuits due to the two reasons discussed in Sections 8.2 and 8.2.2. The first
reason, discussed in Section 8.2, is the fact that the minimum clock period of
the circuit can be determined by a limitation other than reconvergent paths,
230
which cannot be mitigated by the delay insertion method. The second rea-
son, discussed in Section 8.2.2, is the fact that due to the uncertainty of the
delay elements inserted into the logic, the delay insertion might be ineffective
in improving the minimum clock period. In the LP formulations presented
in Tables 8.1 and 8.2, the uncertainties of the delay elements are modeled
without lower (and upper) bounds (delay elements can have zero uncertainty
with Im = IM ). Thus, the second reason for inapplicability is not observed in
the experimentation. Among the selected ISCAS’89 circuits, the delay inser-
tion method for edge-triggered circuits is applicable to 41% (12 circuits) of
the total 29 circuits. By excluding the circuits for which zero improvements
are observed (for which the method is not applicable due to the first reason
stated above), the average improvement of the delay insertion method for
edge-triggered circuits is observed to be 26% over the conventional clock skew
scheduling algorithm of [2] (Table 5.1). The delay insertion method on level-
sensitive circuits was applicable to 34% (10 circuits) of the total 29 circuits.
By excluding the circuits for which zero improvements are observed, the av-
erage improvement of the delay insertion method for level-sensitive circuits is
observed to be 27% on average over the conventional clock skew scheduling
algorithm presented in Chapter 5.
The experimental results in Figure 11.13 show that reconvergent paths—
with a significant probability (41% and 34% as observed on the ISCAS’89
circuits)—are the dominant limiting factor on the minimum clock period after
clock skew scheduling for a synchronous circuit. The delay insertion method
can effectively be used to mitigate these limitations, as shown by 26% and
27% improvements in the minimum clock period. The proposed clock skew
scheduling method with delay insertion takes about twice as much time as
the conventional application of clock skew scheduling, however, the method is
highly practical with total run times below a few minutes with highly common
computing resources.
The improvements in minimum clock period achieved through conventional
clock skew scheduling (IFCSS F and ILCSS ), and through clock skew schedul-
DICSS
ing with delay insertion (IF F and ILDICSS ) for edge-triggered and level-
sensitive circuits are visually presented for each benchmark circuit in Fig-
ures 11.14 and 11.15, respectively.
Shown in Figure 11.14 are the percentage improvements (IFCSS DICSS
F and IF F
in Table 11.6, respectively) in minimum clock period via clock skew schedul-
ing and delay insertion for edge-triggered ISCAS’89 benchmark circuits. Two
data points shown per benchmark circuit from left-to-right are the improve-
ments observed for clock skew scheduling alone and delay insertion with clock
skew scheduling, respectively. Shown in Figure 11.15 are the percentage im-
provements (ILCSS and ILDICSS in Table 11.6, respectively) in minimum clock
period via clock skew scheduling and delay insertion for level-sensitive IS-
CAS’89 benchmark circuits. Two data points shown per benchmark circuit
from left-to-right are the improvements observed for clock skew scheduling
alone and delay insertion with clock skew scheduling, respectively.
232 11 Experimental Results
100
Improvement (%)
80
60
40
20
s15850.1
s208.1
s420.1
Average
s13207
s15850
s35932
s38417
s38584
s1196
s1423
s1488
s1494
s5378
s9234
s298
s344
s349
s382
s386
s400
s444
s510
s641
s713
s820
s832
s953
s526n
s27
Edge-Triggered Circuits
100
Improvement (%)
80
60
40
20
0
s15850.1
s13207
s15850
s35932
s38417
s38584
s208.1
s420.1
s1196
s1423
s1488
s1494
s5378
s9234
Average
s526n
s298
s344
s349
s382
s386
s400
s444
s510
s641
s713
s820
s832
s953
s27
CSS DICSS
Level-Sensitive Circuits
100
Improvement (%)
80
60
40
20
s15850.1
s208.1
s420.1
s13207
s15850
s35932
s38417
s38584
s1196
s1423
s1488
s1494
s5378
s9234
s27
s298
s344
s349
s382
s386
s400
s444
s510
s641
s713
s820
s832
s953
s526n
A verage
ISCAS'89 Benchmark Circuits
CSS DICSS
DEF
LEF BENCH
SDF
PARTITIONING
2x2 GRID
CHACO
REGISTER INSERTION
TOP BLOCK LP
T >= max (T1, T2, T3, T4 )
GLPK
min T
optimal ti
1) Re-Iteration NO YES
2) Constraining Boundary Vertices CSS FEASIBLE?
3) Delay Padding PLACEMENT
REGISTER MAPPING
LOGIC PLACEMENT
problems are generated. First, the generated LP problems are solved on a sin-
gle workstation in a sequential order. The observed run times tsequen record
speedups over conventional clock skew scheduling application due to partition-
ing. Second, the generated LP problems are solved on the Xgrid computing
cluster in parallel as described in Section 10.3. The observed run times tparal
236 11 Experimental Results
Table 11.7. Clock skew scheduling results on 2x2 partitioned ISCAS’89 circuits.
Circuit Info Run Time CSS (sec) RTI (%) Feasibility
Circuit r p tconven tsequen tparal RT Isequen RT Iparal Feasibility
s27 3 4 0 0 0 0 0 yes
s208.1 8 28 0 0 0 0 0 yes
s298 14 54 0 0 0 0 0 yes
s344 15 68 0 0 0 0 0 yes
s349 15 68 0 0 0 0 0 yes
s382 21 113 0 0 0 0 0 yes
s386 6 15 0 0 0 0 0 yes
s400 21 113 0 0 0 0 0 yes
s420.1 16 120 0 0 0 0 0 no
s444 16 113 0 0 0 0 0 yes
s510 6 15 0 0 0 0 0 yes
s526 21 117 0 0 0 0 0 yes
s526n 21 117 0 0 0 0 0 yes
s641 19 81 0 0 0 0 0 no
s713 19 81 0 0 0 0 0 no
s820 5 10 1 1 1 0 0 yes
s832 5 10 0 0 0 0 0 yes
s838.1 32 496 2 0 0 0 100 no
s938 32 496 1 1 1 0 0 no
s953 29 135 0 0 0 0 0 yes
s967 29 135 0 0 0 0 0 yes
s991 19 51 0 0 0 0 0 yes
s1196 18 20 0 0 0 0 0 no
s1238 18 20 0 0 0 0 0 no
s1423 74 1471 21 6 3 71 86 yes
s1488 6 15 0 0 0 0 0 yes
s1494 6 15 0 0 0 0 0 yes
s1512 57 415 1 0 0 100 100 yes
s3271 116 789 4 2 1 50 75 no
s3330 132 514 2 2 1 0 50 no
s3384 183 1759 22 4 3 82 86 yes
s4863 104 620 2 0 0 100 100 yes
s5378 179 1147 9 5 2 44 78 no
s6669 239 2138 33 10 7 30 79 no
s9234 228 247 52 15 8 71 85 no
s9234.1 211 2342 47 12 5 74 89 yes
s13207 669 3068 86 17 10 80 88 yes
s15850 597 14257 3545 735 447 79 87 no
s15850.1 534 10830 1358 156 110 89 92 yes
s35932 1728 4187 101 38 13 62 87 no
s38417 1636 28082 7707 3780 1845 51 76 yes
s38584 1452 15545 1394 749 339 46 76 yes
industrial1 14031 3692878 n/a 34680 25680 n/a n/a no
Average 25 28
record speedups over the conventional clock skew scheduling application due
to partitioning and parallelization of the application. Note that the applica-
tion of clock skew scheduling to industrial1 using the conventional clock
skew scheduling method is not possible, thus run times are not reported.
It is observed from Table 11.7 that tparal is consistently and significantly
(especially for large scale circuits) superior to tsequen and tconven . Similarly,
tsequen is consistently superior to tconven . The run time improvement from
tconven to tsequen and from tconven to tparal are listed under RT Isequen and
11.5 Physical Design of Rotary Clock Synchronized Circuits 237
In this section, the run times of hpictiming on the benchmark circuits are
analyzed to profile the speedups gained in overall program execution due to
partitioning and parallelization. In particular, the speedups available through
solving the partition problems sequentially and in parallel are computed using
the speedup formula presented in (10.5).
Table 11.8 presents the speedup results of hpictiming tool on ISCAS’89
benchmark and industrial1 circuits. In Table 11.8, the number of registers
and paths of each circuit are shown with r and p, respectively. Run times of
the hpictiming tool operated with various clock skew scheduling methods
on the ISCAS’89 benchmark circuits are shown. Run times of hpictiming
with the conventional clock skew scheduling method of Table 5.1 are denoted
by thpictiming
conven . The run times with the sequential solution of partitions method
hpictiming
are denoted by tsequen and the run times with the parallel solution of par-
titions method are denoted by thpictiming
paral . In Table 11.8, the speedups due to
partitioning and sequential application of clock skew scheduling to 2x2 parti-
tions of the circuits are denoted by speedupsequen . The speedup speedupsequen
is computed with the following formula:
thpictiming
conven
speedupsequen = . (11.8)
thpictiming
sequen
thpictiming
conven
speedupparal = . (11.9)
thpictiming
paral
Remember from Section 11.5.1 that the application of clock skew schedul-
ing with partitioning is not feasible for some of the ISCAS’89 benchmark
circuits and the industrial circuit industrial1. The circuits for which the
method is not applicable are not considered in the computation of average
238 11 Experimental Results
speedups. Still, the speedup numbers are presented individually for all the
ISCAS’89 benchmark circuits and the industrial circuit industrial1 in Ta-
ble 11.8.
It is observed from Table 11.8 that on average 2.1x speedup is observed in
hpictiming run time due to partitioning. If the partitioned LP problems are
solved in parallel, the average speedup is 2.6x. It is intuitive that as the size
of a circuit increases, the clock skew scheduling step of hpictiming, which is
the fraction of the task that is enhanced with partitioning and parallelization,
11.5 Physical Design of Rotary Clock Synchronized Circuits 239
30000
25000
20000
Seconds
Scheduling
15000
Partitioning
10000
Read-In
5000
0
Industrial1
s38417
s38584
s15850.1
Fig. 11.17. The run times of hpictiming with Xgrid on large circuits.
increases as well. So, for larger size circuits, higher values of speedup are
expected through partitioning and parallelization. Indeed, such a trend is
observed in Table 11.8.
Speedup (10.5) is further investigated on several of the benchmark cir-
cuits. The execution of hpictiming is divided into three main steps, Read-in,
Partitioning and Scheduling. The Read-in step consists of reading the in-
put data and identifying the local data paths. Partitioning step consists of
the timing-driven partitioning procedure implemented with chaco, discussed
in Section 10.2.2. Scheduling step consists of the application of clock skew
scheduling to generated partitions.
Figure 11.17 illustrates the relative run time lengths of each step for sev-
eral ISCAS’89 benchmark circuits and the industrial circuit industrial1 for
the parallel application of clock skew scheduling. The ISCAS’89 benchmark
circuits, whose total run times are below a certain limit, are not included
in the analysis. The selectivity about the ISCAS’89 benchmark circuits is to
eliminate the inaccuracies due to the rounding off errors in run times, most
prominent for circuits with a run time below a few seconds. Although the
solution for industrial1 is infeasible, the reported run times are believed to
be a good approximation of what they would have been, if all the subparti-
tions had been feasible. The total run time of the hpictiming program (with
parallel application of clock skew scheduling) is reported in Table 11.8 under
hpictiming
the column tparal .
The breakdown of run times to the three steps of hpictiming is shown for
the three largest circuits, s38584, s38417 and industrial1. The run times are
shown in Figure 11.18, 11.19 and 11.20 for s38584, s38417 and Industrial1,
respectively.
The run times for three application methods—conventional, sequential and
parallel application of clock skew scheduling—are shown for each circuit. The
run times for each step of hpictiming is shown with color codes, listed as
240 11 Experimental Results
2000
1500
Seconds
Scheduling
1000
Partitioning
500 Read-In
0
Conventional Sequential Parallel
Fig. 11.18. Run time breakdown of hpictiming program steps for s38584.
10000
8000
Scheduling
Seconds
6000
Partitioning
4000 Read-In
2000
0
Conventional Sequential Parallel
Fig. 11.19. Run time breakdown of hpictiming program steps for s38417.
40000
35000
30000
25000 Scheduling
Seconds
20000 Partitioning
15000 Read-In
10000
5000
0
Sequential Parallel
Fig. 11.20. Run time breakdown of hpictiming program steps for industrial1.
11.5 Physical Design of Rotary Clock Synchronized Circuits 241
read-in, partitioning and scheduling steps from bottom to top for each data
bar.
Partitioning step is not required in the conventional application method,
thus is not shown on the run time bar in the figures for the conventional
application cases. Even for methods where partitioning is necessary, the par-
titioning stage of the run time bar is not visible, because the run times for
the partitioning process with chaco are very small compared to the rest of
the execution time.
Note that the run time of the read-in and partitioning (where applied)
steps are identical in all three application methods. Through partitioning and
application of clock skew scheduling in parallel, the run time of the clock skew
scheduling step of the hpictiming program is improved. This improvement
speeds up the hpictiming program, the results of which are presented in
Table 11.8.
12
Conclusions
tion. Rotary clocking technology also permits non-zero clock skew operation
and multi-phase synchronization of systems. In the presented discussion, the
development of the physical design flow for rotary clock synchronized circuits
is described. The physical design flow consists of a novel partitioning step in
order to generate partitions of the circuit netlist on which clock skew schedul-
ing can be applied individually. The potential to parallelize the application of
clock skew scheduling is explored. Partitioning and the parallelization of the
application of clock skew scheduling are shown to provide significant speedups
in run times of the timing analysis. Over the ISCAS’89 benchmark circuits, a
average speedup of 2.6x is observed over four (4) processors. When applica-
ble, clock skew scheduling of partitions significantly improves the scalability
of clock skew scheduling.
In summary, this monograph presents valuable timing and synchroniza-
tion methodologies for non-zero clock skew scheduling and their automation
methods. The timing and synchronization methodologies are proposed par-
ticularly for the non-zero clock skew operation of high-performance digital
VLSI integrated circuits. Various algorithms and blueprints for methodology
development are presented, which include algorithms for circuits with edge-
sensitive registers v.s. level-sensitive registers, circuits synchronized by a single
clock phase scheme v.s. multi-phase clocking schemes and algorithms modi-
fying clock distribution network only v.s. modifying the clock distribution
network simultaneous with the logic network. Theoretical limitations of im-
provement achievable through non-zero clock skew scheduling are presented
for the proposed algorithms and methodologies.
References
247
248 References
13. F. J. Hill and G. R. Peterson, Computer Aided Logical Design (with emphasis
on VLSI). John Wiley & Sons, Inc., 4th ed., 1993.
14. J. P. Uyemura, Introduction to VLSI Circuits and Systems. Wiley Publishing,
2001.
15. S.-M. Kang and Y. Leblebici, CMOS Digital Integrated Circuits: Analysis and
Design. The McGraw-Hill Companies, Inc., 3rd ed., 2002.
16. J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital Integrated Circuits:
A Design Perspective. Prentice-Hall, Inc., Upper Saddle River, NJ, 2nd ed.,
2002.
17. C. Mead and L. Conway, Introduction to VLSI Systems. Addison-Wesley Pub-
lishing Company, Reading, MA, 1980.
18. F. Anceau, “A Synchronous Approach for Clocking VLSI Systems,” IEEE
Journal of Solid-State Circuits, Vol. SC-17, pp. 51–56, February 1982.
19. M. Afghani and C. Svensson, “A Unified Clocking Scheme for VLSI Systems,”
IEEE Journal of Solid State Circuits, Vol. SC-25, pp. 225–233, February 1990.
20. S. H. Unger and C.-J. Tan, “Clocking Schemes for High-Speed Digital Sys-
tems,” IEEE Transactions on Computers, Vol. C-35, pp. 880–895, October
1986.
21. G. Y. Yacoub, H. Pham, M. Ma, and E. G. Friedman, “A System for Crit-
ical Path Analysis Based on Back Annotation and Distributed Interconnect
Impedance Models,” Microelectronics Journal, Vol. 19, pp. 21–30, May/June
1988.
22. H. Shichman and D. A. Hodges, “Modeling and Simulation of Insulated-Gate
Field-Effect Transistor Switching Circuits,” IEEE Journal of Solid-State Cir-
cuits, Vol. SC-3, pp. 285–289, September 1968.
23. N. Hedenstierna and K. O. Jeppson, “CMOS Circuit Speed and Buffer Op-
timization,” IEEE Transactions on Computer-Aided Design, Vol. CAD-6,
pp. 270–281, March 1987.
24. M. R. C. M. Berkelaar and J. A. G. Jess, “Gate Sizing in MOS Digital Circuits
with Linear Programming,” Proceedings of the European Design Automation
Conference, pp. 217–221, March 1990.
25. O. Coudert, “Gate Sizing for Constrained Delay/Power/Area Optimiza-
tion,” IEEE Transactions on Very Large Scale Integration (VLSI) Systems,
Vol. VLSI-5, pp. 465–472, December 1997.
26. U. Ko and P. T. Balsara, “Short-Circuit Power Driven Gate Sizing Technique
for Reducing Power Dissipation,” IEEE Transactions on Very Large Scale In-
tegration (VLSI) Systems, Vol. VLSI-3, pp. 450–455, September 1995.
27. S. R. Vemuru and N. Scheinberg, “Short-Circuit Power Dissipation Estima-
tion for CMOS Logic Gates,” IEEE Transactions on Circuits and Systems I:
Fundamental Theory and Applications, Vol. 41, pp. 762–765, November 1994.
28. H. J. Veendrick, “Short-Circuit Dissipation of Static CMOS Circuitry and its
Impact on the Design of Buffer Circuits,” IEEE Journal of Solid-State Circuits,
Vol. SC-19, pp. 468–473, August 1984.
29. A. S. Sedra and K. C. Smith, Microelectronic Circuits. Oxford University Press,
4th ed., 1997.
30. T. Sakurai and A. R. Newton, “Alpha-power Law MOSFET Model and its
Applications to CMOS Inverter Delay and Other Formulas,” IEEE Journal of
Solid-State Circuits, Vol. SC-25, pp. 584–594, April 1990.
References 249
172. G. Venkataraman, J. Hu, F. Liu, and C.-N. Sze, “Integrated Placement and
Skew Optimization for Rotary Clocking,” Proceedings of the IEEE Design,
Automation and Test in Europe, pp. 1–6, March 2006.
173. G. Venkataramam, J. Hu, and F. Liu, “Integrated Placement and Skew Opti-
mization for Rotary Clocking,” IEEE Transactions on Very Large Scale Inte-
gration (VLSI) Systems, pp. 149–158, February 2007.
174. Z. Yu and X. Liu, “Design of Rotary Clock Based Circuits,” Proceedings of the
ACM/IEEE Design Automation Conference, pp. 43–48, June 2007.
175. C. Ababei, S. Navaratnasothie, K. Bazargan, and G. Karypis, “Multi-Objective
Circuit Partitioning for Cutsize and Path-based Delay Minimization,” Proceed-
ings of the IEEE/ACM International Conference on Computer Aided Design,
pp. 181–185, November 2002.
176. I. Lustig, “Private Communication,” 2004. ILOG Inc.
177. B. Hendrickson and R. Leland, “The Chaco User’s Guide: Version 2.0,” Tech.
Rep., Sandia National Laboratories, Albuquerque, NM, Jul 1995.
178. A. Pothen, H. Simon, and K. Liou, “Partitioning Sparse Matrices Eigenvectors
of Graphs,” SIAM Journal of Matrix Analysis, Vol. 11, pp. 430–452, 1990.
179. R. Williams, “Performance of Dynamic Load Balancing Algorithms for Un-
structured Mesh Calculations,” Concurrency, Vol. 3, pp. 457–481, 1991.
180. B. Kernighan and S. Lin, “An Efficient Heuristic Procedure for Partitioning
Graphs,” Bell System Technical Journal, Vol. 29, pp. 291–307, 1970.
181. C. M. Fiduccia and R. Mattheyses, “A Linear Heuristic for Improving Network
Partitions,” Proceedings of the IEEE/ACM Design Automation Conference,
pp. 175–181, 1982.
182. B. Hendrickson and R. W. Leland, “A Multi-Level Algorithm For Partitioning
Graphs,” Supercomputing, 1995.
183. Apple Inc., Advanced Computing Group, Xgrid Guide, 2004.
184. MPI Standard Forum, http://www-unix.mcs.anl.gov/mpi/standard.html,
Message Passing Interface Standard v 2.0, 1997.
185. N. Shenoy, R. K. Brayton, and A. L. Sangiovanni-Vincentelli, “Graph Algo-
rithms for Clock Schedule Optimization,” Proceedings of the IEEE/ACM In-
ternational Conference on Computer–Aided Design, pp. 132–136, November
1992.
186. B. Taskin, “High Performance Integrated Circuit (hpic) Timing Software Pack-
age v1.9.” http://sourceforge.net/projects/hpictiming/, 2004.
187. Free Software Foundation (FSF), http://www.gnu.org/software/glpk/glpk.
html, GLPK (GNU Linear Programming Kit), 2005. version 4.8.
Index
A
Application-specific integrated circuits Clock skew scheduling
(ASICs), 15, 180, 244, 245 applications of, 96
basis skews
B clock skew vector and enumeration,
Bernoulli equations, 29 178
thicker edges and basis edges,
C 176–177
CAD. See Computer-aided design circuit design process and safety
Cascade voltage switch logic (CVSL), margin in, 84
39 clocking technology, 183
Clock distribution network definitions and graphical model
branching factor, 88 clock delays in, 74
circuit and interconnect structure, 72 graph-based models in, 76
design process for, 16 inherent structural limitations of,
resistive-capacitive (RC), 35 76
scheduling algorithms for, 4 permissible range of, 74–76
signals, 4 synchronous digital system, 73,
tree structure of, 86 75–80
Clock signal timing parameters of, 75
clock pulse, 51 delay insertion method, 153–162
clock skew edge-triggered circuits, 232
lead/lag relationship, 52 ISCAS’89 benchmark circuits,
sequentially-adjacent registers, 53 229–230
coincidental cycles of, 56 level-sensitive circuits, 231, 233
data in, 49 QP-based clock scheduling
latching and non-latching edges of, algorithm, 226–229
48–49 double and zero clocking hazards in,
leading and trailing edge of, 45 82
multi-phase clock synchronization double clocking, 81
reference clock cycle, 54 input file format
sample of, 53 samples of, 93, 94
storage elements in, 50–51 static timing analyzers, 91
259
260 Index
multi-phase system P
edge-triggered system, 117 Phase-locked-loop (PLL), 183
synchronization overview, 117–118 clock sources, 190
timing circuits, 118–120 components, 184
propagation constraints, 100–101 reflections, capacitive loading, 184
synchronization constraints, 98–100 Power supply rejection ratio (PSRR),
timing relationships, 97–98 188
validity constraints, 101–102 Q
Linear programming (LP) problems, QP algorithm derivation
146 circuit graph, 129–130
models, 164–165 linear dependence
naive approach, 163 circuit connectivity matrix, 134
Logic gates circuit graph cycles, 131
and registers clock scheduling algorithms, 130
sequentially-adjacent pair of, 74 graph theory, 132
in synchronous digital circuit, 73 independent cycle matrix, 136–137
switching characteristics of, 22–23 kernel equation, 135–136
values and properties of, 9 local data paths, 132–133
matrix relationship, 134–135
spanning trees, 133
M optimization problem and solution
Message passing interface (MPI), 202 active constraints, 138–139
Metal-oxide-semiconductor field effect clock skew definition, 137–138
transistor (MOSFETs), 23 Gauss-Jordan elimination, 141
Mixed-integer linear programming global minimizer, 143
(MIP), 87, 113, 164 Lagrange multipliers, 139–140
Modified big M method (MBM), linear system technique, 142–143
105–106 local data paths, 138
Moore’s law, 3 non-linear equation, 140
Multi-phase synchronization approach, objective clock skew schedule, 142
54 QP-based clock skew scheduling,
computational analysis
CSD, 172–175
N LMCS-1, 169–170
NAND gate, 9–10 LMCS-2, 170–172
N-channel enhancement mode MOSFET run time and memory requirements,
transistor (NMOS), 24 168
NMOS transistor, 44 Quadratic programming (QP)
Non-linear programming (NLP), 113 formulation, 244–245
Non-zero clock skew scheduling computer implementation, 223–225
applications of, 243 graphical illustrations, 225
automation and application of, 243 R
circuit operating and applications of, Resistive-capacitive (RC) loads, 72
83 circuit network for, 38
clock signals and benefits in, 84 signal delay expressions in, 37
researchers, 244 Resonant clocking technology
synchronization methodologies for, clock tree network, 184
245 digital integrated circuits, 183
264 Index
oscillators S
coupled LC and standing wave, 185 Shichman-Hodges equations, 24
traveling wave, 186 Spanning tree algorithm, edge swapping
partitioning process and enumeration, 180
balanced priority assignment, 196 Standard template library (STL), 234
with chaco, 195–196 Synchronous digital system
clock skew scheduling, 197–200 logic gates and storage registers, 73
path-based and net-based, 193 signal cycles and graph representation
registered-input and registered- of, 78
output, 197 Synchronous systems
register insertion, 196 clock signals, 50–54
register placement, 200–201 finite-state machine (FSM) model of,
timing constraints and data path, 13
196–197 flip-flops, 47–50
timing-driven, 193–195 single-phase path with, 55–60
tools and factors for, 194 latches, 43–47
VLSI circuits, 192 multi-phase path with, 65–69
rotary circuits, timing requirements single-phase path with, 61–65
clock skew and signals, 189 storage elements, 42
oscillatory signals, 191 timing properties of, 41
synchronization schemes, 190–191 System-on-chip (SoC), 181
rotary traveling-wave oscillators
(RTWO’s), 185–189 V
ROA. See Rotary oscillator arrays Very large scale integration (VLSI)
Rotary clock synchronized circuits systems
CAD tool buffers and registers, 86
run time breakdown process, circuit design and timing, 244
239–241 circuits production in, 3
speedup process, 237–238 delay metrics
clock skew scheduling circuit analysis and design for, 23
industrial1, 234 computer-aided design applications,
minimum clock periods, 234–235 20
Rotary oscillator arrays (ROA), 185 logic gates and elements in, 19, 21
clock architecture, 186 signal propagation and making in,
grid topology, 185, 191 23
ring, clock phase relationships of, signal transitions in, 22
190, 191 signal waveforms circuit in, 21–22
structures, 192 delay mitigation, 37
Rotary traveling-wave oscillators design process
(RTWO’s) electronic devices, switching
anti-parallel inverters, 186 properties in, 14
integrated skew computation and synthesis tools in, 15
logic placement, 189 devices and interconnections
loop inductance, 188 analytical delay analysis, 26–31
novel clock network, 185 delay controlling, 31
rings, 185, 187, 190 delay mitigation, 37–39
shunt connected inverters, transmis- gain factor in, 25
sion line, 187 importance of, 35–37
theory, 187 RC estimation in, 36
Index 265