Вы находитесь на странице: 1из 11

Emulation White Paper


May 2009
Chip and verification complexities continue to grow. Despite these growing complexities, time-to-market
pressures require that chip verification be completed on schedule. Hardware-assisted verification is used
primarily to reduce risk by running more verification in a given time. Successfully completing this type of
verification depends on three main parameters: performance of the verification engine; quickly adopting
the changes in RTL, IP, or peripheral interfaces; and emulating the behavior of the target environment.
Ease-of-use, synergy with existing verification environments, and interoperability with software simulators
are also contributing factors to successful hardware-assisted verification.
This paper discusses the latest developments in Veloces highly optimized emulation SoC technology.
It explores how Veloces unique architecture gives users the flexibility to build highly productive
verification environments through the implementation of a hardware stimulus, a software stimulus, or a
combination of the two.

Vijay Chobisa
Product Marketing Manager, Mentor Graphics Corporation
Mentor Graphics Corporation
Charley Selvidge
R&D Director, Mentor Graphics Corporation
Mentor Graphics Corporation


Faster runtime performance without

compromising fidelity
Faster design compilations
Debug productivity and faster turn-around time
Single platform for ICE and acceleration
(stimulus flexibility)
Scalable architecture (scale the capacity as design
size grows)

Emulation vendors are facing challenges of delivering
faster runtime performance, addressing capacity needs,
supporting different stimulus sources/methodologies,
and offering simulation like ease-of-use and debug.
It is possible to have test stimulus come from in-circuit
target systems (hardware testbenches) or alternatively,
from abstract software-based testbenches using C/C++
/SystemC/SystemVerilog. Providing flexibility where
users can use multiple stimulus methodologies in the
same project or on the same model using single
emulator platform is a real differentiator. As verification
moves to a higher level of abstraction, hardwareassisted verification tools are required to handle
verification environments designed using these
advanced methodologies.

The Veloce SoC, while FPGA-like in its computation

resources, has a unique network for interconnecting
these computation resources specifically for fast
compile times. The challenge in designing a re-pro
grammable logic core is not building the most optimized
commercial FPGA, but optimizing the logic for emulation
applications using a distinctive inter-connect network.
Veloce SoC supports a capability to independently
compile the logic part of the design from the
interconnect part; these are unique steps that use
distinct system-level resources for predictable

The key factors that drive a users decision to work

within an emulation environment include:
Verification productivity
Stimulus flexibility
Accurately model the synchronous and
asynchronous behavior of a device
Capacity in a smaller footprint

Veloce SoC has build-in functions that are needed to

design an efficient emulator, such as a chip-level communication network that can be programmed rapidly.
There are specific resources built into the programmable logic core to support a variety of runtime and
debug features. These functionalities allow set/get,
forcing/releasing the signals, triggering on all state
elements, and access to the debug data for visibility.

Taking the above criteria into consideration, Mentor

Graphics has designed the highly successful Veloce
emulator. One of the key attributes of Veloce is its
custom and highly optimized emulation SoC chip
technology which is a key reason why potential users
should consider Veloce rather than using systems built
around a commercial FPGA or a custom processor.

From the trigger perspective, there are tens of thousands

of trigger signals generated within each chip, and these
signals are globally combined together for trigger
expressions. It is not possible to take tens of thousands
of signals out of each chip and then combine several
tens of millions signals at system-level for triggering.
Veloce has trigger reduction logic within the chip that
takes these tens of thousands potential trigger values
and reduces them to a smaller set of partially computed


Instead of using a commercial FPGA, or building a
custom processor, Mentor Graphics has designed
a custom SoC optimized exclusively for emulation.
Employing the Veloce SoC technology users gain the
following advantages:


Storing visibility data is one of the most challenging

problems in emulation. To enable full visibility, there is
an integrated memory controller built into each Veloce
SoC for a set of on-board DRAMs used for storing
visibility data.

From the system density standpoint, the most effective

way is to put these functionalities inside the computational chip with dedicated fixed resources. Veloces
SoC uses this approach. A smaller system footprint is
achieved by using a high-speed multiplex communication
network that can be easily programmed for controlling
clock domains associated with specific communication
network resources, and controlling the switching
functionality that causes signals to go from one logic
block to another.

Todays designs have a very large number of memories.

If the user models those memories outside the computational chip, it will require too much interconnection
bandwidth, and as a result, access to memories will be
slow. In order to provide fast access to memory modeling
resources without crossing chip boundaries, Veloce
SoC has memory modeling resources that allows the
to model a large number of distinct, small- to mediumsized user memories.


Veloce SoC is a dense chip with a dedicated interconnect network that delivers higher capacity
utilization and a smaller system footprint. Another
factor affecting the footprint is system-level packaging.
The Veloce system has a larger number of logic boards
packed inside the physical chassis with switching boards
to flexibly interconnect these logic boards. Clocks are
system-level functionalities and come from their own
dedicated resources. This allows Veloce to support line
break pointing, stopping the clocks, and triggering

Veloce SoC is designed to accept multiple asynchronous clocks to support environments that have
interfaces using asynchronous clocks. If the user wants
to have an independent evaluation capability for distinct
clock domains, then its required to give each clock
domain its own communication resources, which are
synchronized to that clock domain. The communication
network inside the Veloce SoC is extremely flexible
and can be programmed to a fine level of granularity
in order to move signals in and out from the chip for a
distinct clock domain. The flexible communication
network enables Veloce to accurately model asynchronous
environments and behaviors.

Other vendors typically supply smaller, lower capacity

system units and then connect them together for larger
capacity SoC emulation. This approach consumes
much more real estate than Veloce and also creates
reliability issues due to hundreds of cables
Compilation speed, runtime performance, and debug
features are the three key factors that contribute to the
overall verification productivity. The compilation speed
and runtime performance are important; however, a
rich debug infrastructure that can quickly identify the
cause of the failure is the key to the turnaround time.
Veloces simulation-like interactive debug and
advanced graphical user interface are designed to
accelerate the ability to analyze failures.

A large number of system functions have to be

implemented to accomplish an emulation system.
The following are three potential options:
Integrate them into the chip that translates into
a smaller system footprint
Implement them by consuming user programmable logic, which reduces the capacity of the
Implement them outside the computational logic
device that increases system footprint


the emulator, and how the storage elements are

clocked. It is possible to have setup and hold issues
if computed signals can clock storage elements.

Its useful to understand that designers do not compile
the design many times a day. They compile the design,
run various tests, report bugs to the RTL designer, and
then recompile the design when new RTL is ready.
Generally, designers spend more time running and
debugging designs than actually compiling the design.
A typical scenario would be the verification team finds
a bug, reports the bug to the designer(s) for the RTL
fix, and then runs the compile once the RTL is fixed.
Usually the time needed to fix the RTL is longer than
the compile, run, and debug. It is not uncommon for
new RTL to be released to the verification team once
every week or two. This highlights that compiling is
infrequent, and having very fast compiles does not
necessarily influence productivity to the level that it
gets advertised.

There is no exposure to design-dependant setup and

hold issues if the user has a hardware infrastructure
and uses a compilation strategy that does not use
design computed signals, as the clocks of the any
storage elements are clocked by a distinct special
dedicated clock.
Veloce uses a timing re-synthesis technique that
eliminates the need for clocking storage elements using
computed signals. Veloce complier takes an arbitrary
design and produces a functionally equivalent model
where storage elements are clocked by single systemlevel clock. All the storage elements in Veloce are
clocked by a special system clock. As a result, there is
no way to clock storage elements in Veloce by compute
signals and therefore, Veloce cannot have setup and
hold issues.

Depending upon the organization and verification

strategy/flow, compile frequency can vary from
customer to customer. In the scenario where designers
do verification, they find their own bugs, so compiles
can be frequent. This is often the case for small
companies and smaller designs. It is very rare in big
companies as designers usually do not perform system
verification. Fast compiles are important and Veloces
fast compile can support three to four turns-a-day for a
30MG design.

The second important attribute of compile predictability

is the interconnect capacity of the network. A modeling
system has compute resources that are in the chips and
the network that connect the chips together. The data
carrying capacity of that network in comparison to the
logic contents of the chips strongly influences whether
the user will have a predictable compile process with
respect to capacity utilization. Consistent capacity and
predictable compile are influenced by the amount of
connectivity provided by the interconnect network.

Compile predictability means multiple compiles on
the same database delivers the exact same results and
behavior; compile speed and predictability are equally
important. One attribute of compile predictability is
setup and hold issues. Any emulator, whether its
FPGA-based or processor-based, must contain compute
elements that perform logic functions and storage
elements, which store state data. Setup and hold issues
are not functions of whether its a FPGA-based solution
or processor-based solution. Its a function of how the
user chooses to use compute and storage resources in


Veloce SoC is carefully designed to allocate the right

proportion of silicon to network resources and compute
logic. Users can design a ultra-dense system by using
most of the silicon for compute logic and get good
utilization at chip-level, but end up with poor
utilization at system-level, due to limited interconnect
resources and compile failures because of route-ability

Veloce has both hard virtual wires and soft virtual wires
for network connectivity. Hard virtual wires are the
fixed baseline interconnect network designed directly
in the silicon. The soft virtual wire technique is an ability
to flexibly extend the network in order to optimize
overall capacity of the chips by using some of the
compute resources. For interconnect dominant designs,
hard virtual wire interconnect resources are not sufficient;
therefore, Veloce uses soft virtual wires to augment
connectivity. This doubles the interconnect network and
delivers improvements in overall capacity utilization.

In summary, because of above properties, the Veloce

compile process is very predictable and fast.
Overall test execution consists of test setup, test run,
and results collection. Length of a test run is determined by how fast the emulator runs; how long it takes
to load and setup the test on emulator (download
design, initialize states, manipulate memory contents,
etc.); and upload the test results. Some users have
methodologies where they either run very long tests
(or test never stops), while others frequently start new
tests. So runtime clock speed and speed of these
auxiliary services are equally important.

Commercial FPGAs are designed with small

interconnect network resources and large compute
resources because users pay for compute resources.
They take a compute structure that is small, but makes
the compile time longer, and there is a potential of
some compile failures due to limited interconnect
resources. Veloce SoC is optimized for faster compile
times and predictability, which is done by designing
more network resources in the chip. For Veloce, the
probability of chip-level failure is extremely low,
perhaps one in a million.

Veloce SoC is designed to deliver fast runtime clock

speed by implementing dedicated built-in high speed
interconnect and state-of-art fabrication technology.
Veloce has a built-in rapid access to all of the memory
modeling resources, all the state elements for set/get
and force/release operations, and efficient the chip
Another aspect of runtime performance is how a user
responds to the clock edges. If the computation and
communication systems need to respond the same way
to rising and falling edges, the user ends up doing a fair
amount of redundant activity. If only the computations
associated with the rising edge (when the rising edge
comes) and computation associated with the falling
edge (when the falling edge comes) are done, the user
can tune the actions specifically to a clock edge. This
allows the computation to have a duration of more than
just the period between two half cycles of the clock.
In the case of 1X and 2X clocks, users can recognize
the 1X activity and give full duration of clock for

Another important aspect of predictable compile is to

have a compilation flow where the network part of
compilation is done independently of the logic part of
the compilation. This allows the network compilation
timing to be adapted to the logic compilation timing.
Veloce has a completely independent network
scheduling process called global scheduler. It is
impossible to have network timings that do not satisfy
logic timing because the network scheduling process
happens after logic is fully compiled.
Each Veloce SoC takes less than five minutes to
compile, whereas a comparable commercial FPGA
takes approximately two hours. The chip compilation is
distributed and well-suited to evolution of PCs, as PCs
have multiple processors and processor chips have
multiple cores.


Users need hardware flexible enough to initiate computations based on various edges, and software sophisticated
enough to recognize what actions are associated with

what clock edges and complex domains with complex

paths. The Veloce hardware and software distinctly
recognizes edges of the users clocks and is able to
perform a distinct computation and a distinct set of
communication associated with activities caused by
rising clock edges vs. activities caused by falling
clock edges.

visibility and allows users to go as deep as they want

by virtue of continuous uploads. This addresses the
ease-of-use issues around accessibility of the data.
Another ease-of-use feature is allowing users to control
the clocks. Sometimes its easier to manage the debug
process by not controlling the data capture and execution
in data dependent fashion. Debugging involves getting
data at the time the problem occurs. The data can be
collected at the time the problem occurs in two different
ways: 1) capture the data at the time a problem is
occurring without stopping the clocks (using triggering
system), and 2) stop the clocks at the time the problem
is occurring and then start data capture. Because the
clock can be stopped, users can inspect the register
and memory contents at the time of problem. Users
can view data at the time the problem is occurring
either by controlling the capture of the data in the
region of the problem or controlling the clock
advance to stop at the problem time.

When working with custom emulation technology

(with reconfigurable fabric), the basic computation is
faster than processor technology because Veloce has
statically configured signals flow. The dominant factor
of the performance is large scale data movement inside
the chip and within the system. For large capacity
systems, Veloce is faster because, in big systems
movement of data is limited by the physical size of the
system. Propagating information along printed circuit
boards or the cables takes time. By virtue of having a
more densely packed system in one physical chassis,
the distance of data movement for Veloce is minimized.
This makes Veloce compact and faster for 8MG to
512MG systems.

Debug productivity is enhanced by having flexibility

in the way a user either controls the storage capture or
controls the clocks. Finally, if the user is in data
dependant control (triggers or break points), then its
useful to have a large set of automatically provided
triggers and break points on RTL lines that can be used
to control the capture.

Other emulation vendors can deliver MHz class

performance on small designs, such as 10 to 20MG.
However, performance for competitive systems drops
when design sizes grow to 50 to 200MG as multiple
emulation systems/units are cabled together. Veloces
architecture does not impose this restriction. Its performance remains approximately the same when the
users moves from 10MG to 100MG designs and
beyond. Additionally, other emulator vendors
performance drops when more logic is packed into
emulation chips, Veloce does not exhibit this behavior.

Veloce has an improved triggering system that allows

the users to write triggers on the state elements that are
available by default; therefore, it is not required to
recompile the design ever for trigger purposes. Veloce
enhanced interactive debug enables the user to startand-stop the test just like a software simulator, set line
break points on RTL source, and perform RTL level

One aspect of debug productivity is how easily a user
can use the debugging infrastructure provided. Another
aspect is how long it takes to get the data to start the
debug. In applications where users can control the
clocks, the Veloce standard visibility model gives full


Veloce supports multiple GUIs for ease-of-use for the

customer during debug. Veloce supports the Questa
GUI and also supports Debussy for post-process debug.

Veloce is architected to deliver a simulation-like

interactive debug environment. The emulator has
dedicated built-in hardware infrastructure to provide
100 percent visibility, trigger system, and line break
points. These resources are outside the user logic,
which means there is no impact on gate capacity and
performance by enabling these debug features. Unlike
Veloce, for custom processor-based solutions, debug
and visibility are driven from processor instructions
meaning it comes from user gates and impacts
performance and capacity.

on the user selected signals within user specified time

In addition, the store ON and OFF feature allows users
to filter the data that goes into the hardware trace buffers,
which makes efficient use of these trace buffers. This
feature becomes very important for environments with
dynamic targets, where clocks cannot be stopped and
the user is interested in an activity that is highly
infrequent. For example, if a user is interested in
memory read cycles and memory read cycles occur
only once every 50K cycles, then this feature is very
useful to filter out unwanted data.

Time to visibility is another aspect of debug productivity.
If the user wants to provide a full visibility of arbitrary
signals, it is extremely challenging to do this in a
resource efficient way. Veloce delivers a software
technique called Software State Replay (SSR), which
re-computes all data for a chip from a limited amount
of per chip storage data. Typically, in any debugging
process, the user does not need to look at the whole
design and the entire simulation time. One looks at
subsets of time and space (signals), so having a
mechanism to be able to flexibly run soft computation
on a subset of time and space is a way to potentially
give access to a large amount of time for all of the
design. This feature reduces the time required to access
the suspected problem segment that needs debugging.
Veloces selective upload in time and space feature
accelerates time to visibility. The user selects a set of
signals and the time window of a test and during
runtime, debug data is exported from emulator.


One of the more difficult problems of high-performance
verification is establishing the testbench environment
and test strategy which have the ability to accept a
variety of high performance stimulus. Whether the
stimulus is from hardware protocol generators, hardware connecting to a live application using speed
adaptors, or high-level software environments via
transactors, the ability to combine these stimulus sources
into one unified verification platform is a challenge.
Another challenge is availability of testbench components to stimulate each interface of the design in a way
that is consistent with verification needs. Users want a
verification platform that can accommodate in-circuit
stimulus for interfaces using speed adaptors and
software-based stimulus. Hardware-based stimulus has
the advantage of traffic fidelity and model availability;
however, the disadvantage is the lack of traffic
controllability and sometimes users do not know how
to slow down a hardware interface to the emulation
speed. In-circuit interfaces can be used when the user
does not know how to build a suitable soft model.
Software-based stimulus has significant controllability
(directed test, randomization, constraints, etc.).
However, fidelity of the stimulus is a challenge. If its
important to control the traffic to obtain the desired

Veloce visibility system is architected for flexible timeto-visibility data capture to accelerate the debugging
process. Distributed SSR visibility computation per
chip is well-suited for the evolution of the computer
industry. The SSR has a set of independent computations for each chip which can be distributed across
many CPUs or cores. The On-Demand visibility further
accelerates time-to-visibility by running reconstruction


verification coverage, then software-based solutions are

the best way to go. When the verification plan is first
assembled, it is recommended to decide what interfaces
will be needed to stimulate and what options are
available. For some interfaces, hardware-based
stimulus is the best and for other interfaces, softwarebased stimulus is best. But sometimes the preferred
stimulus might not be available.

environment between simulation and emulation. TBX

is based upon SCE-MI 2.0 standard (supports SCE-MI
1.0 for backward compatibility).
The advantages of SCE-MI 2.0 are as follows:
Testbench abstraction-level is much higher
involving communication by procedural calls
High-level implicit management of
Environments are interoperable with industry
standard simulators; one verification environment
for simulation and acceleration
Fully deterministic environments

It is not uncommon for users to use a combination of

hardware stimulus for some interfaces and software
stimulus for other interfaces of the same design in
order to optimize or accelerate verification.
Veloce is a single platform architected to optimally
support both hardware acceleration and in-circuit
emulation (ICE). This enables users to perform the
complete verification on one platform from block-level,
to chip-level, to system-level verification. Veloces
flexible infrastructure for interfacing allows users to
connect targets with multiple asynchronous clocks,
accepts clocks coming in, and source the clocks to the
targets. If targets are static, then Veloce has the ability
to stop the clocks. The clock stopping ability gives
access to certain kinds of powerful debug capabilities
(for example, line break point and inspect memory

Figure 1 shows a verification environment designed

with some guidelines for acceleration that can seamlessly
work in simulation and acceleration.
Veloce TBX, in addition to being a stimulus supplying
channel, can also be a real-time monitoring channel for
auxiliary debug mechanism. A distinctive use model is
to use ICE as a stimulus and TBX as a monitor channel.
Many verification methodologies have specific infrastructure built-in to extract data from the models
without extracting full visibility data.
Assertion-based verification is a methodology where
one inserts checker logic in the design files to monitor
the activities in the design during verification runs.
Veloce natively supports assertion-based verification
in acceleration and in-circuit environments; users can
write assertions using PSL, SVA, OVL, and QVL.

Sometimes stimulus models need to be software for a

variety of reasons; these models can be the same
models used in the simulation environment. Mentor
Graphics has a high-speed soft testbench interfacing
technology called TBX with an abstract way of
implicitly managing communication between a
testbench and DUT to accelerate transaction-based
environment on Veloce. TBX allows building a
testbench environment that is equally applicable to
acceleration, using Veloce TBX and a software
simulator that is SystemVerilog compliant such as
Mentors Questa product. This fosters interoperability
with software simulators and a common verification


Veloce provides flexibility on the type of stimulus

users would like to utilize as well: hardware-based,
soft testbench-based, or a combination of both. Veloce
is the best-in-class transaction-based accelerator. It
enables users to build a verification environment at a
higher level of abstraction, develop the tests at the
application-level, and communicate to the DUT at

Figure 1: Unified Verification Environment.

transaction-level using protocol-based transactors.

This allows users to do the debugging at the transaction
level instead of pin-level.

verify the chip using a live environment, where various

interfaces are operating using independent asynchronous
clocks. All emulation vendors can model single clock
domain designs to accurately match the behavior of the
final device.

Finally, Veloce is designed to handle advanced

verification methodologies where a user can run the
verification environment in a simulator such as Questa
from Mentor Graphics, and seamlessly migrate to
Veloce TBX for acceleration. Figure 2 illustrates a
typical verification environment.

However, in a custom processor-based emulator, all the

domains run simultaneously every time there is activity
in any one domain, which makes the evaluations
synchronized across all the domains, leaving some
verification uncovered for asynchronous clock domain
designs. It is not possible to catch race conditions,
re-convergence, FIFO overrun, and glitches because of
the synchronized evaluation of all the domains. There
is no evaluation independence between two domains.

As system-level integration increases on an individual

chip, the adoption of multiple interfaces and standards
on a single SoC becomes more customary. There is a
significant requirement for verification teams to have
off-the-shelf solutions that provide complex stimulus
and analysis tools for a variety of applications, such as
multimedia, wireless, networking, bus protocols, and
embedded processors. Mentor Graphics has a full
range of in-circuit speed adaptors for vertical market
applications. Figure 3 illustrates the portfolio of the
vertical solutions available from Mentor.

Veloce models single-clock and multi-clock domain

designs accurately in the emulator. Veloce supports
in-circuit environments where target systems drive
DUT interfaces with each using asynchronous clocks.
It allows a distinct, independent cycle-based computation
for each distinct clock domain, which is important for
maintaining the fidelity to see cross domain effects.
Veloce is a sophisticated extension of a cycle-based
algorithm that gives an accurate and high fidelity
model even in the presence of multi-clock domains.


SoCs have a large number of asynchronous clocks
requiring verification tools to provide capabilities to


Figure 2: Advance Verification Environment.

Veloce has a large amount of network resources and

it can be partitioned at a fine granularity for communication. These network resources can be either one
large single domain network, or a set of independent
single domain networks for different clock domains.


For very large emulation systems, space, power
consumption, and cooling are significant costs to the
users. Mentor Graphics emulation chip technology and
the Veloce system-level packaging design, provide high
capacity systems with small footprints that conserve
significant power and minimize cooling requirements.
Veloce takes five times less power compared to a
comparable capacity processor-based solution with
a 4X smaller footprint.

For a multi-clock domain compile, specific network

resources are dedicated to a specific domain so that
network resources for different domains are independent
of one another. This gives independent computation
resources and independent network resources that
allow various kinds of modeling phenomenon, which
users do not get with a processor-based emulator.


For a smaller footprint, you can pack as much as you

can in a chip, put as many chips as you can on a board,
and put as many boards as you can in a single physical
chassis. Veloce supports up to 512 MG capacity in a
really small footprint because of its architecture and
system packaging.


Figure 3: Veloces Vertical Market Solutions.

Veloce is a high-performance, high-capacity platform
with a small footprint. Its unique architecture gives
users flexibility to build verification environments using
either a hardware stimulus, a software stimulus, or a
combination of the two.

Because Veloce is standards compliant, it fosters interoperability with software simulators. Veloce accurately
models the users design into the emulator, irrespective
of whether its a single clock domain or a multi-clock
domain design.
Finally, Veloces dedicated debug/visibility infrastructure
and simulation-like debug make it an easy-to-use
verification platform.

For more information, call us or visit: www.mentor.com/emulation

Copyright 2009 Mentor Graphics Corporation. This document contains information that is proprietary to Mentor Graphics Corporation and may
be duplicated in whole or in part by the original recipient for internal business purposes only, provided that this entire notice appears in all copies.
In accepting this document, the recipient agrees to make every reasonable effort to prevent unauthorized use of this information.

MGC 05-09

5819: 080926