You are on page 1of 15

Multi-Million Gate design

From RTL to GDS


Using Synopsys flow within less then 10 weeks

Yaron Lavi

Intel Corporation

yaron.lavi@intel.com

ABSTRACT

Synopsys holds a set of tools, which enables smooth flow from RTL to GDS (TO) within relative
short time and with only two major layout iterations.
Although schedule (RTL2GDS) is high depended on design complexity, layout utilization, computing
resources, head-count and many other factors, we found a flow which enable to do the job with high
confidence level and with approximately constant time to multi-million gate count projects.
This paper will present a proved flow that we used in several projects in which we took the
advantages of:
§ Design compiler for synthesis.
§ Physical compiler + DFT compiler for placement and scan insertion
§ Astro for clock tree, HFN and routing
§ PrimeTime for static timing analyzes
The results are working Silicon in all design target corners, which is being manufactured in high
volume quantity.
Yet, it would be fair to mention that there are tools from other vendors, which support the design
effort and validation, but this is the backbone of the back-end flow.
Table of Context
2

1.0 Introduction........................................................................................................................... 3
2.0 Design flow........................................................................................................................... 4
2.1 RTL Verification.............................................................................................................. 5
2.2 Professional Synthesis...................................................................................................... 6
2.3 Floor Planning (Update)................................................................................................... 8
2.4 Physical Synthesis (G2PG)............................................................................................... 8
2.5 Layout and Timing closure ............................................................................................. 10
2.5.1 Astro Physical stage............................................................................................... 10
2.5.2 Full Timing model for Static Timing Analyzes.......................................................... 13
2.5.3 Astro ECO mode .................................................................................................. 14
2.6 Final tuning.................................................................................................................... 15
3.0 Conclusions and Recommendations................................................................................ 15
4.0 Acknowledgements ...................................................................................................... 15

Table of Tables
Table 1 - Typical Synthesis results................................................................................................ 6
Table 2 - Physical congestion typical results.................................................................................. 9
Table 3 - Astro clock tree skews ............................................................................................... 10

Table of Figures

Figure 1 - General design flow..................................................................................................... 4


Figure 2 - Physical SCAN Chains................................................................................................ 9
Figure 3 - Layout Typical Congenstion....................................................................................... 12
Figure 4 - Bonus and FIB cell scattering..................................................................................... 14

SNUG Israel 2004 2 From RTL to GDS


1.0 Introduction

The increasing demand of the market for new communication products, with tough competition from
powerful as well as new competitor make the slogan of Time To Market a key element of success.
Products life time becomes shorter, and a fast production High Volume Manufacturing ramp-up is
needed.
As a result, design cycles are shortening and there is a need to have a steady flow. This flow should
support high level of confidence, to meet the schedule constraints. We have to reduce the relation of
gate count and complexity of design depended, and become approximately constant time to multi-
million gate count projects success.
In this paper, I will present our flow (and cost) to meet the requirements of such a task. It is based
on Synopsys set of tools with a flow which was proved in couple of projects.

Using advanced process for the communication application may have the advantage of adding extra
cells with no impact over timing. Always there is a place for additional margin to guarantee fast
execution.

An important point to emphasize is the fact that any request/constraints have a cost to be paid. The
confidence level of quality and schedule that we have developed will cost:
a. More computing resources
b. Much more license.
c. Extra die area

Those targets based on two major assumptions, which must be kept:


a. High quality of RTL code.
b. Only two major cycles of timing closure loops.

The flow is based on 0.18um and 0.13um process with about 2M gates design. The design also
includes hard macros like memories and others. This size of design is behind the tools limitations (to
handle as one chunk), so it being divided for several cluster to run at parallel.

Most of the examples, which presented here, are shown for one cluster of the design, in order to
prevent IP disclosure.
The paper will include the following steps of design
a. RTL to Gates
b. Gates to Placed Gates
c. Astro Physical Stage
• Clock Tree & HFN
• Routing
d. Timing closure
e. ECO flow
f. Full chip verification

SNUG Israel 2004 3 From RTL to GDS


2.0 Design flow

The design flow includes several steps and milestones, which can be viewed in general as described
in figure1.

Product Definition

RTL Coding Logic ENV Manual Design


Floor-Plan
Re-use TB Analog

Behavioral
Model

Logic Model

Full Chip Testing Synthesis Analog Testing

Verification LAYOUT
Regression, ATG, STA, FV, GLS, RV Manual, APR

LVS/DRC
TO

Figure 1 - General design flow

This paper focus is over the physical design flow (colored) and clarifies the way to get to Tape-Out
within 10 weeks with one full synthesis loop and two major timing closure loops.
Of course physical design team should know the design and the methodology very well before
starting this flow, because there is no place for re-work. It is highly recommended to pass through
such a process two times over uncompleted and non-ready design to make sure the flow and tools
are working as expected. This is the time to raise issues of floor-plan (clusters partitioning), MAX
delay path, too complex random logic for the formal verifications tools etc.

The physical design can be summarized in the following points:


• RTL verification toward Back-end readiness – pre stage to meet
• Synthesis optimization (including clock gates insertion) – 2 week
• Physical synthesis (including scan insertion and DRC fixing) – 1 week
• Routing (including HFN and CT insertion) – 1 week
• Static Timing Analysis (full annotation) X2 – 2 weeks
• Logic + Timing ECO’s X2 – 2 weeks
• Layout ECO’s – 2 weeks

SNUG Israel 2004 4 From RTL to GDS


2.1 RTL Verification

One of the critical and key milestones is the Fist Sign-Off, in which the entire RTL database is
delivered to the Physical Design flow. At this point there is always a point to validate that database is
ready.
What is ready in the eyes of the physical designer?
Our definition for database ready includes very tight definition with the following major points:
1. Synthesizable code
2. Synchronous design
3. No Latches
4. No Max delays 1
5. Design for Testability verified (Scan, Memory BIST etc).
6. Asynchronous path defined and verified
7. All design exceptions are approved.
8. All kinds of constraints file are ready.
9. Right use of pre created special cells.

All those check points can be verified with several tools in the market. Some of them are Synopsys
tools of the flow (like Design Compiler, Prime-Time), while others are specific for different aspects
of Design Rule Checking (like SpyGlass, Logic Equivalent Checker, ATPG etc).

Additional important verification, which is being done at this point, is the floor-plan area vs. gate
count matching. This is the last time to make any change in the floor plan due to the major schedule
impact, when it is done later in the flow.
All physical design clusters should have no more the 70% utilization. This utilization is considered
low enough to include all design “buffer” in the flow and guarantee no need for floor plan changes.

The MAX delay margin is also an important parameter that should take into account at early stages
of design. Communication products frequencies are low compared with the advanced process which
is being used. For example, typical frequencies are in the range of 40-160 MHz, with some
exceptions of design of 250 MHz. It is far away from the CPU’s which using similar process, but
with GHz frequencies. Therefore, at the pre-stages of Physical-Compiler/ASRTO, we define the
clock uncertainty as 25-30% of clock cycle in this stage (basic synthesis of RTL). This is a very high
margin, causing all MAX delay violations to be solved by Logic concepts (like pipelines) at early
stages of design. The physical designer is hardly meeting with this time consuming problem of MAX
delay.

Summary:
Keep verification simple and use conservative design rules
Start with low utilization to remove floor plan risk (pay with die)
Prevent MAX delay by high clock uncertainty to protect streaming of the flow (pay with area)

1
Defining MAX delay is highly related to the clock definition. The margin which defined in the clock uncertainty
may cause unreal MAX delay violation. Our concept will be explained in the next chapter.

SNUG Israel 2004 5 From RTL to GDS


2.2 Professional Synthesis

In this part we are actually start the physical design flow and the schedule clock start to count.
We are using at this stage all our computing resources and all available licenses to make the best
results out of the RTL in gates.
The constraints files are ready from the early/basic stage and no surprises at this stage should occur.
This includes various types of synthesis like:
• Top down
• Bottom up
• Bottom up using characterized method
• Using advanced flow with DC_Ultra and DW_foundation when needed.

This stage has some characteristics, which can be similar to “trial and error” method. But the efforts
yield results. Each design has its right approach to synthesis and you can’t know it from just looking
over the code. The design exploration and elaboration is an effort that must be taken into account. I
would expect any physical designer to know all kinds of synthesis methods and use them over his
block prior to the final run, but it seems always the final drop of RTL has its own secrets (especially
when dealing with arithmetic data path blocks).
As a result we can see netlist with up to 10% gate-count reduction, including scan FF’s and clock
gates (power compiler).

In the table below it can be seen a typical block synthesis results. The Maximum benefit is gained by
a professional synthesis using additional licenses like Ultra and DW foundation. It is design
depended but our experience showed 5-10% reduction in gate count.
More over when using 0.13um process (in the table) the effect of extra synthesis MAX delay margin
has minor effect over gate count and area. Those extra cells, which are the cost for the extra margin,
reduce the delay in the flow to fix the MAX delay violations.
The power compiler with the standard clock gate cell reduce gate count by additional ~7%.

NAND
Clock Gate Number
uncertainty Power count of
Margin [%] Compiler [Kgate] Instance
10 ON 171.3 42,412
10 OFF 186.4 46,840
25 ON 175.6 44,631
25 OFF 190.0 48,502
25* OFF 211.5 50,212

* - Basic synthesis
Table 1 - Typical Synthesis results

Summary:
Keep control over gate-count with all netlist changes (DFT and Power)

SNUG Israel 2004 6 From RTL to GDS


SNUG Israel 2004 7 From RTL to GDS
2.3 Floor Planning (Update)

In general, the floor plan stage is very early in the design flow. I’m mansion it at this point, since it is
the last point to make any modification with minor impact over schedule.

Blocks, macros, pads are verified once again, that easily can be placed on the die size. This is based
on the synthesis final results.

Also, the pins from each cluster should place in the logical order. Jupiter tool, which can find the
logical connection between the units, will place them accordingly.
Just to remind, the main considerations for chip floor planning are:
• Die size
• Connection between macros/blocks to the pads
• Blocks interconnect
• Grid supply
• Amount of pins
• Complexity of connection in the gate area , which will lead to a group/region definitions

The results of the above would be inserted back to the Physical-Compiler for the placed synthesis.

2.4 Physical Synthesis (G2PG)

In this stage, the number of iteration reduced dramatically. Cluster are now at the level of 850K
gates (equivalent NAND gate) and the run time is about 30 hours including all the reports .
The design passes through several iterations of optimization, and SCAN chains are being built
according to the placement of the FF’s. The minimum is one PhyOpt and additional two incremental
runs.
The design MAX delay margins are being reduced to 20% of clock cycle.
Additional iteration of fixing MAX Transition and MAX Capacitance is done in order to reduce
those violations from the Layout stage. Usually, resizing solves these violations, but we also add in
extra buffer to split long or high fan-out nets. Min delay violations (actually preventions) are being
handled too (there is no Clock Tree yet) based on statistical results.
The approach that we use for the MIN delay prevention is to add extra buffer in any path that has
the potential to become MIN delay path. For example: “Back to Back” FF’s are in this category.
All of that yield “ECO” of about inserting 7,000 different buffers over a block with 25K FF’s
(180K instances).
The utilization now is raised to 75% and may in extreme cases raised up to 85% that is our upper
limit. Beyond this limit, (utilization, gate count and instance count) the results become poor (MAX
delay, congestion) and run time is much higher. Also the Physical-Compiler tool can crash.

SNUG Israel 2004 8 From RTL to GDS


Figure 2 - Physical SCAN Chains

The physical synthesis stage is also a good place to add Logic ECO’s.
The ECO mode is very easy to use in the netlist level. Physical-Compiler adds and places the gates
with no major effect over timing when the design is ready.

As you can see the reports of typical block is seen quit good.

X Y
Congestion threshold: 0.7000 0.7000
Violations (usage > threshold)
Number of edges: 9323/95392 17046/95392
Maximum violation: 0.6802 0.5357
Average violation: 0.0800 0.1797

Histo graph for congestion on X


< 0.80: **************************************** (92278)
0.80 - 1.00: ** (3058)
1.00 - 1.20: * (45)
1.20 - 1.40: * (11)

Histo graph for congestion on Y


< 0.80: **************************************** (80945)
0.80 - 1.00: ******* (13724)
1.00 - 1.20: * (717)
1.20 - 1.40: * (6)

Table 2 - Physical congestion typical results

Database now defined as verilog netlist, PDEF file and SDC files.
Two files are defined as SDC: one SDC for Clock Tree and relaxed one to the routing flow.
The SDC file for the Clock Tree is the same as the Physical-Compiler work with 20% clock
uncertainty definition. For the routing, we use a relaxed uncertainty of 15%. In several cases, the
results and run time are better that way.

SNUG Israel 2004 9 From RTL to GDS


2.5 Layout and Timing closure

2.5.1 Astro Physical stage

This is the first time we meet the layout tool (Astro). The placement, netlist and timing constraints are
loaded to the tool and all cells are being fixed. Our methodology will use the placement of the
Physical-Complier as mush as possible. Only minor placement changes are being allowed. At any
stage, the Layout tool doesn’t make any DRC (resize, add buffers) fixes automatically. This is from
the reason that the timing engine of Astro is still different from the Prime-Time (the sign-off timing
tool). Also, the flow of changing the design by ECO flow makes it more controllable and accurate. It
can be done since Physical-Compiler project the layout timing very close to the results of the Prime-
Time. We also protect it by reducing the design margins within the progress of the design.

The first stage is done with tighter timing constraints, in which clock trees, reset trees and High Fan-
Out net are being built.

The Clock Tree stage starts with the load SDC section. It is for Astro to understand the design
constraint and the clocks definition. In order to verify that all the SDC constraint was read correctly
we dumped it out for review.
Now we have to check that design can achieve the timing requirement with out the nets impact, so
we will apply the “without interconnect” option and check timing report.
Design should meet timing in that stage.
An important stage is the clock tree optimization. With the cost of insertion delay, the clock tree
skew is reduced to the level that there is almost no MIN delays violation. It depends also of the level
of prevention that we used before, but those values are taken into consideration.
For example, see table

CLOCK1 CLOCK2
No After No After
Optimization Optimization Optimization Optimization
Short path 0.4 2.4 1.3 1.43
Long path 244.7 13.1 2.9 1.45
Skew 244.3 10.7 1.6 0.02
Table 3 - Astro clock tree skews

Clock Tree stage is building structures of trees for all the FF’s, which correlates to the same clock
nets. The Clock Tree creates levels of buffers, to allow the implementation of the clock signal to all
those flops with the right drive. Astro give as the option to interfere in the building process, to define
the cells to be used and to optimize that tree.
Of course that Clock Tree could be built trough gates which are not flops, also it is available to use
generate clock from other clock. We recognized that the clock tree optimization has to be manual

SNUG Israel 2004 10 From RTL to GDS


changed only for the path of special gated clock with controlled FF’s. This manual optimization is
easier and faster then solving the MIN delay violation it creates2.
Now we are ready to run the HFO nets. The definition of such net is a wide connection net, e.g.
many gates to be connected to the same net which has no clock attribute.
The result of using HFO net command is a net described by buffers, which creates a structure as a
clock tree.
It is not recommended to treat the HFN nets (such as reset) as clocks, and that why they have to
have their own refer.

Routing
Astro is used as a routing tool only. The constraints file now include clock uncertainty degree of 15-
20% of clock cycle, and hold time margin is 100 – 350 pS. This is very fast runs and we can
complete full cluster within 2 days.

Like the previous stage we start the routing stage with loaded the SDC for the routing constraint
definitions and duped it out in order to verify that all the SDC constraint were read

Before we start to route, we check that we have blockages in the areas that we don’t won’t to route
on, like blocks memory or areas that save to other purpose. We can use blockage for specific metal
(like blockage for metal 1) or for all of them.
As well, we loaded route guide for special nets like clocks that we prefer to route in higher metal.

The route starts with special and/or sensitive nets like clock nets, and proceed wish all the other nets
in three steps

• Global route –that maps the general pathway through the design for each unrouted nets (with
no physical layer)

• Track assignment – Assign nets to wire tracks then places wires and VIAs to show the initial
routing configuration.

• Detail route – perform detail routing on a design and then writes the violations to a routed
cell

Lastly, we fixed the violations with search & repair command that find the violations and fix them
automatically. The tool makes almost all the layout DRC fixes.

As you can see below, out typical block pass the routing stage with only one small area of high
congestion.

2
This is done manually after verification of Prime-Time at the first timing loop

SNUG Israel 2004 11 From RTL to GDS


Figure 3 - Layout Typical Congenstion

Database now is verilog netlist and SPEF file produced by STAR-RC

SNUG Israel 2004 12 From RTL to GDS


2.5.2 Full Timing model for Static Timing Analyzes

The database, which includes the new netlist & SPEF files from the Astro, is being read into the
Prime-Time and being verified in both fast and slow corners.
It is highly important to verify that all the un-annotated nets warnings are due to nets’ branches that
doesn’t appear in the layout (and therefore are not a real problem). In case those warning do point
to real problems, the issue should be check & fixed.
Also all the nets’ names mismatches, between the SPEF files & the net list file, should also be fixed.
All working modes have to be tested. Such as: System mode, Production mode, Debug mode and
SCAN mode.
This is the time to verify timing violation and fix them. The four main violations that are handled are
max & min delay, cap & transition violations. All violation should be handling, up to the relevant
percentage of the violation (for example, we may decide not to fix a cap violation, which is below
10% of the driving cell ability)
Our experience shows that only 300-500 path need to be fixed, such as by adding buffers, up/down
sizing or changing the location of a cell.
For each working mode there is different constraints file according to the risk and the importance of
the mode. For example: different MIN delay margin is defined for SCAN in shift and capture mode.
When all violations are fixed and verified in the Prime-Time environment, we load the netlist to
Design-Compiler and use DC command to add all the changes to the netlist.
This is netlist for the 1st timing closure.
The conclusion from this step is that Physical-Compiler has a very good projection of timing and the
combination of Physical-Compiler as a placer and Astro as a router yield high efficiency.
Since this stage may require a lot of checks, it may be a very high time consumer.
We need to reload all the constraint for every checked mode (clocks & external definitions, relevant
RC files, different case analysis, etc.).
Therefore, we need to update the design for every different checked mode. When an update is
done, we reload the DB (netlist & RC files) for sefty reasons.
Even though we use high powered machines, a full chip update may take more then 30 minutes.
Since we may check up to 12 different modes (double 2, for both corner), it will be a waste of time
not to run all of them in parallel. At the last stages of the project, we use about 5-6 linux machines,
with enough CPUs for running up to 3 primetime license per machine.
For handling all this parallel running, we build an environment, that automatically creates elaborate
reports for all relevant checked modes.

SNUG Israel 2004 13 From RTL to GDS


2.5.3 Astro ECO mode

The updated netlist is being loaded to Astro and a compare file is being produced.
New cells are being located manually, in order to prevent any new violation like MAX transition that
the automatic placer can cause.
At this stage we also add our FIB cells 3 and bonus cells4. Astro knows how to place them
homogeneously in the design. The amount of that kind of cell is mainly driven by the cluster utilization
and the risk of the block.
We repeat steps 2.5.2 and 2.5.3 for one additional loop in which the number of violation reduce by
factor of 10 each time
In some cases, there is additional loop to fix 1-10 violations but in general, two iterations are enough
to close all issues.

Figure 4 - Bonus and FIB cell scattering

3
FIB cells – Extra STD library cell which all the interface connection go to the upper metal for easy design
changes in the LAB.
4
Bonus cells – Extra STD library cells which will be used for bugs fixes by metal changes only.

SNUG Israel 2004 14 From RTL to GDS


2.6 Final tuning

In many cases, there is still a need to balance clocks or to add/remove delay due to external AC
timing constraints. Those tasks are being done only in one top-level cluster, which is kept open till
TO. In some other cases, interconnects between clusters cause some violations and there is a need
to make again small ECO to solve issues like MAX capacitance. In the bottom line such a concept
that we presented in this paper keeps us in the time frame and enables to produce TO’s.

3.0 Conclusions and Recommendations


The paper presents a concept of taking the advantage of advanced processes and pay of some extra
area in order to get a stable and predictable flow from RTL to GDS.
It can be use in the communication products in which most of them using low frequencies demands
compared to the process.
Using a set of tools from Synopsys house seems to add another level of stability and prediction to
the Tape-Out flow. This combination proved itself in our product line in the past and we continue to
work that way.

4.0 Acknowledgements
I want to thanks my colleagues who work with this methodology and help me to collect the data to
this paper.
Sagy Eick, Oren Mamet, Oded Pilowsky, Shai Michaeli

SNUG Israel 2004 15 From RTL to GDS