Вы находитесь на странице: 1из 5

See

discussions, stats, and author profiles for this publication at: https://www.researchgate.net/publication/251916106

An Efficient Clock Tree Synthesis Method in


Physical Design
Conference Paper December 2009
DOI: 10.1109/EDSSC.2009.5394159

CITATIONS

READS

267

4 authors, including:
Yuan Wang

Ganggang Zhang

Peking University

Northwest A & F University

106 PUBLICATIONS 111 CITATIONS

45 PUBLICATIONS 114 CITATIONS

SEE PROFILE

SEE PROFILE

Available from: Yuan Wang


Retrieved on: 01 May 2016

An Efficient Clock Tree Synthesis


Method in Physical Design
Guirong Wu, Song Jia, Yuan Wang, Ganggang Zhang
Abstract This paper proposes a method aiding in low
clock skew applicable to the mainstream industry clock
tree synthesis (CTS) design flow. The original clock root
is partitioned into several pseudo clock sources at the gate
level. The automatic place and route (APR) tool may
synthesize the clock tree with better performance in clock
skew because each pseudo clock source drives smaller
number of fan out. The proposed method is applied to a
chip level clock tree network and achieves good results.
Keywords: Physical Design, Clock Tree Synthesis, Low
Clock Skew

I. INTRODUCTION
Clock network design has been a key aspect of the
design process which directly impacts the performance of
the chip. The following equation [1] summarizes the
relationship of the clock period P , clock skew s , worst
case data path delay d max , and other offset constant

Po for the proper timing.


P = s + d max + Po

(1)

The clock skew is the maximum different among the


clock latencies from the clock source to flip-flops. Skew
can be calculated at the edge of the clock root in three
fashions: rise skew, fall skew, and trigger-edge skew [2].
In this paper, we calculate the skew in trigger-edge
fashion. Po is a constant that includes data set up, hold
time, latch active time, and other possible offset factors
like safety margins. It is clear from the equation that to
reduce the cycle time P it is necessary to minimize the
skew s , besides the minimization of the worst case data
delay d max on the combinational logics.
As interconnection delay is becoming more dominating
in deep submicron (DSM) silicon technology levels, the
clock skew is more significant in terms of circuit
performance. Therefore the minimization of skew is
always a very important topic in the design of
synchronous sequential circuit [3].
With the growing complexity of system designs, clock
network are getting increasingly complex. Clock nets
Guirong Wu, Song Jia*, Yuan Wang and Ganggang Zhang are
with the Key Laboratory of Microelectronics Devices and
Circuits, Institute of Microelectronics, Peking University,
Beijing, P. R. China.
E-mail: rong@pku.edu.cn, jias@pku.edu.cn,
wangyuan@pku.edu.cn, zhgang@ime.pku.edu.cn

978-1-4244-4298-0/09/$25.00 2009 IEEE

have bigger fan-out, and have to be distributed over more


function modules. It has been observed that the existing
APR tool performance has deteriorated because of its
limited computation capability. To address this issue, the
logic solution is to break the clock nets into smaller parts.
In paper [4], this is accomplished by partitioning the chip
into several pseudo-partitions at the layout level, based on
the cells placement. However, a new Visual Basic based
routing tool based on the Exact Zero Skew (EZS) routing
algorithm [1] should be developed to support the
methodology implementation which is so time-consuming.
Furthermore, other issues discussed in its conclusion
section restrict the use of this method also. As for
partitioning at the RTL level, it involves some changes to
the chip architecture and would increases the complexity
of the solution. Except for the above two methods,
partitioning at the gate level has not been reported hereto
and it will become our object of study.
In this paper, we partition the original clock root into
several pseudo clock sources at the gate level, which
needs no changes to the chip architecture and no extra
routing tool developed. The method is applicable to the
mainstream industry CTS design flow that ensures the
quality and efficiency. The outline of this article is as
follows. We first review the common CTS modes
supported by the mainstream industry APR tools, the
Cadence First Encounter (v03.30) being used as the
platform. Next we conduct a series of experiments and
derive the clock tree partition guidance as the result of
these experiments. Then we apply the method to a chip
level clock tree synthesis of an embedded processor and
compare the experimental results between the proposed
method and conventional method. Finally, we make some
discussion and draw the conclusion.

II. REVIEW FOR COMMON CTS MODES


There are two modes for running CTS in Cadence First
Encounter APR tool: manual and automatic [2].
Manual CTS mode allows you to control the number of
levels and the number of buffers, and specify the types of
buffers at each level. The following is an example of
clock-tree specification file syntax and a graphic
representation of that syntax as seen in Fig. 1:
ClockNetName
LevelNumber
LevelSpec
LevelSpec
PostOpt
End

MCLK_GE
2
1
2
CLKBUFX20
2
10
CLKBUFX16
YES

as specified in the clock tree specification file. Clock


grouping balances the clocks and attempts to meets clock
skew for all clocks as if they were one tree. The following
is an example of clock group syntax and its graphical
representation as seen in Fig. 3:

Fig. 1. Graphic representation of manual CTS mode

ClkGroup
+ SH1/I3/Z1
+ SH2/I4/Z2

For automatic CTS on a net, CTS builds the clock


buffer tree according to the clock tree specification file,
such as the maximum delay, maximum transition and
maximum skew, generates the clock tree topology, and
balance the clock phase delay with appropriately sized,
inserted clock buffers. The following is an example of
clock tree specification file syntax for automatic CTS on
a net and a graphic representation of the syntax as shown
in Fig. 2:
MacroModel pin
AutoCTSRootPin
NoGating
Buffer
MaxDelay
MinDelay
SinkMaxTran
BufMaxTran
MaxSkew
End

alu_core/clk 20ps 18ps 20ps


18ps 30ff
clk_div/U3/Q
rising
CLKINV CLKBUF DLY
5ns
0ns
80ps
80ps
50ps

Fig. 2. Graphic representation of automatic CTS mode


Note that the skew among nodes A, B, and C may meet
the maximum skew specified in clock tree specification
file.
At the end of this section, we introduce the Clock
Grouping technology which is available in automatic
CTS mode. All clock root pin names entered into a clock
group that will have their sinks meet the maximum skew

Fig. 3. Graphic representation of clock grouping syntax

III. OUR PROPOSED CTS METHOD


Before our proposed method is introduced, let's
conduct a series of CTS experiments on one function
module which is targeted at SMIC 0.18um process
technology using Cadence First Encounter (v03.30). The
module consists of 10906 gates, each gate representing
one two-input NAND gate with minimum driving
capability in targeted process technology, besides 1190 D
flip-flops (DFFs) which are triggered at falling edge and
are all synchronized by a single clock root named as
MCLK. The original clock root is partitioned into several
pseudo clock sources at the gate level. The clock sources
are referred to "pseudo" because they are not a real design
intent. After the completion of above partition stage, each
pseudo clock source drives a smaller number of fun-out.
In this experiment, we break the initial clock root into 19
new pseudo clock sources, named as MCLK_0, MCLK_1,
MCLK_2, ..., MCLK__17, MCLK_18. From MCLK_0 to
MCLK_17, each drives 64 DFFs respectively and last one
clock source MCLK_18 drives the remaining 38 DFFs.
The graphic representation is shown in Fig. 4.
In the first CTS experiment, we specify the original
clock root as the only one AutoCTSRootPin in the clock
tree specification file. We use the First Encounter to
synthesize the clock tree in automatic CTS mode
discussed in section II. The trigger-edge skew is
measured as 39.2 ps.
In the second CTS experiment, we specify 19 new
pseudo clock sources split through the partition scheme in
the first paragraph of this section as root pins in the clock
tree specification file. We also use the First Encounter to
synthesize 19 clock trees in automatic CTS mode. The
statistics of trigger-edge skews in each pseudo clock

domain are listed in Table I in which the skew unit is


picosecond.

MCLK_1 driving 617 DFFs;


C: The original clock root is partitioned into 5 sources, in
which 4 sources MCLK_0, MCLK_1, MCLK_2, and
MCLK_3 driving 256 DFFs respectively and the last
one MCLK_4 driving 166 DFFs;
D: The original clock root is partitioned into 9 sources, in
which 8 sources MCLK_0 ~ MCLK_7 driving 128
DFFs respectively and the last one MCLK_9 driving
166 DFFs;
E: The original clock root is partitioned into 19 sources,
in which 18 sources MCLK_0 ~ MCLK_17 driving
64 DFFs respectively and the last one MCLK_18
driving 38 DFFs;
We use the First Encounter to synthesize the clock trees
in automatic CTS mode for each partition scheme on the
same config in clock tree specification file. The triggeredge skew of each case are shown in Table II.
TABLE II
STATISTICS OF THE CTS RESULT

Fig. 4. Graphic representation of clock root partition

Source
MCLK_0
MCLK_1
MCLK_2
MCLK_3
MCLK_4
MCLK_5
MCLK_6
MCLK_7
MCLK_8
MCLK_9

TABLE I
STATISTICS OF SKEW
Skew
Source
(ps)
13.1
MCLK_10
15.3
MCLK_11
11.3
MCLK_12
11.5
MCLK_13
6.6
MCLK_14
11.5
MCLK_15
14.7
MCLK_16
14.1
MCLK_17
15.4
MCLK_18
14.3

Skew
(ps)
11.0
7.1
11.2
7.4
12.2
9.9
13.0
15.0
7.6

From the statistics in Table I, the skews are much


lower than the one which is 39.2 ps in first experiment. It
is because that the APR tool will perform better within
expectation for small clock net. The tools capability
limitation is slowing the design process and deteriorating
the performance as the clock net fan out size grows
bigger and bigger [5]. Based on the above observation,
we make the following assumption: Breaking up the
clock root into several new pseudo clock sources at the
gate level and then synthesizing the clock tree for each
may achieve the low clock skew. To balance the skew
among new pseudo clock trees, we can use the Clock
Grouping technology mentioned in section I.
We conduct the third CTS experiment to validate our
assumption. Five clock root partition schemes on the
same function module are listed as belows.
A: Clock tree is synthesized under one single clock root
which drives 1190 DFFs;
B: The original clock root MCLK is partitioned into 2
clock source sources, MCLK_0 driving 573 DFFs and

Case

Skew
( ps )

Clock
Tree Area
( m 2 )

Total Area
( m 2 )

Time
(normalized)

A
B
C
D
E

39.20
34.40
26.40
23.6
17.3

7048
8522
10003
11113
15062

154739
157569
159442
161344
161220

1
1
1
1
1.8

The results confirm our assumption. It clearly shows


that the proposed CTS method is effective in low clock
skew. The smaller number of fan out driven by pseudo
clock sources, the lower skew can be achieved. However,
lower skew is with the penalty of the bigger chip area and
longer run time. For instance, when the skew improves
56% in case E compared with case A, the chip area
increases 4.2% and run time increases 80%. Therefore,
the appropriate partition scheme should be determined
from the trade-off among skew, area, and time cost. From
the statistics in Table II, case C or D are seemed as
appropriate for its obvious improvement in skew while a
little chip area increases and almost no run time grows.

IV METHOD APPLICATION
In this section, we apply the proposed method to a chip
level clock tree synthesis of the 32-bit RISC-based
embedded processor (27690 gates, 66Mhz, SMIC 0.18um
process technology) which is designed by the R&D team
of the Key Laboratory of Microelectronics Devices and
Circuits, Institute of Microelectronics, Peking University.
Firstly, we analyze the clock tree structure to determine
the partition scheme of the pseudo clock sources. In the
targeted design chip, there is only one original clock root,
named as MCLK, which synchronizes the total 1673
DFFs. The brief graphic representation of the clock tree
structure is shown in Fig. 5. The number of DFFs in each
function module is listed in bracket.

V. CONCLUSION

Fig. 5. Graphic representation of the clock tree structure


Secondly, we determine the partition scheme according
to the function and number of fan out. This process may
require repeated trials in order to obtain the appropriate
scheme. One partition scheme is shown in Fig. 6. We
break up the original clock root into 10 new pseudo clock
sources. The partition scheme in module register_bank is
as same as the case C described in section III. A few Perl
scripts are developed to partition the original clock root
and generate the new gate-level netlist containing pseudo
clock sources.

The method proposed in this article improves the clock


skew significantly with the cost of area and through put
time, as opposed to the results shown in [4] in which the
maximum skews show some improvement, but not
significantly different, and there is a significant
improvement in time. There is no area reported in [4], so
we can not make the relative comparison. As for the
significantly different in through put time, the
improvement is attributed to the ability to run clock tree
generation (CTG) on multiple smaller pseudo clock
sources on multiple workstations in parallel while our
tool is run on one CPU core. This is a good explanation
of our result in through out time. In theory, there should
be significant improvement in time if we run tool on
multiple workstations concurrently.
Furthermore, because our method is applicable to the
mainstream industry CTS design flow, it overcomes many
defects existing in [4] such as too much manually analysis,
accurate delay estimations, and other realistic layout
concerns discussed in its conclusion section. From this
view, our approach is superior to the one mentioned in
article [4].

ACKNOWLEDGMENT
I owe a lot of thanks to all of the people who contribute
to this paper as possible. First and foremost, I would like
to thank my adviser, Dr. Song Jia, from the bottom of my
heart, for his guidance. He has been a great source of
ideas and provides me with invaluable feedback. Second,
I would like to thank Dr. Yuan Wang and Ganggang
Zhang, who give me their constant support and
suggestions during the project and paper writing.

REFERENCES

Fig. 6. Graphic representation of the partition scheme


Finally, we use First Encounter to synthesize the
pseudo clock trees in automatic CTS mode, named as new
in case item in the table. For comparison between the
proposed method and the conventional method, we also
conduct CTS experiment for the original clock root,
named as original in case item. The summary of the
comparison is shown in Table III. It shows that the
method proposed in this article improves 66.3% in
trigger-edge skew, increases 5.88% in chip area and 40%
in run time compared with the conventional method.
TABLE III
SUMMARY OF THE COMPARISON
Case

Skew
( ps )

Clock Tree
Area
( m 2 )

Total Area
( m 2 )

Time
(normalized)

original
new

88.4
29.8

16695
22473

452783
479367

1
1.4

[1] Tsay, R.-S. Exact zero skew, Computer-Aided


Design, 1991. ICCAD-91. Digest of Technical
papers., 1991IEEE International Conference, pp.
336-339, 1991.
[2] Encounter User Guide, Product Version 5.2.1,
February 2006.
[3] Chia-Ming Chang, Shih-Hsu Huang, Yuan-Kai Ho,
Jia-Zong Lin, Hsin_Po Wang, Yu-Sheng Lu, Typematching clock tree for zero skew clock gating,
Design Automation Conference, 2008. DAC 2008.
45th ACM/IEEE, pp. 714-719, 2008.
[4] Reaz, M.B.I., Amin, N., Ibrahimy, M.I., Mohd-Yasin,
F., Mohammad, A., Zero skew clock routing for fast
clock tree generation, Electrical
and Computer Engineering, 2008, CCECE 2008.
Canadian Conference on, pp. 4-7, May 2008.
[5] Y. P. Chen, D. F. Wong, An Algorithm for Zero
Skew Clock Tree Routing with B, In: Proceedings
of the 42nd annual conference on Design automation,
USA, pp. 783-788, 2005.

Вам также может понравиться