Академический Документы
Профессиональный Документы
Культура Документы
ON
MULTIPROCESSOR REAL-TIME ANALYSIS FOR
SCAN-BASED EMULATION: A METHODOLOGY
OF DSP APPLICATION
FOR
QUEST-2K8
AT
BY
G UDLAVALLERU
ENGINEERING COLLEGE
GUDLAVALLERU
AUTHORS:
Introduction
In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.
With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on
the CPU pins. With modern CPUs such as DSPs, the traditional approach has several
problems. The first problem is the speed of newer DSP chips. Bus cycle times can be
25 nanoseconds or shorter, and all instructions typically execute in a single cycle. This
makes it difficult for a traditional emulator to allow emulation at full speed. The
number of pins to monitor can be staggering, with chips having multiple 32-bit address
and data buses, making a traditional emulator expensive. The second, and more serious,
problem is that DSPs often have on-chip caches, pipelines, memory and peripherals.
Sometimes a whole algorithm can execute without any activity on the CPU pins.
The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scannedoutfromthedevice.
Multiple devices can be daisy chained, and an entire PC board can therefore be scanned
in a single scan chain. It is possible to use the same method to scan out not only the state
of a devices pins, but to scan out any internal information from the device, such as
register values, memory location; as a consequence scan-based emulation was born. The
JTAG specification does not include the pin out for the JTAG connector. The extension to
JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with pin out and
physical dimensions common to all DSPs that support JTAG involved in this
2
methodology .During JTAG emulation, the emulator supplies the clock that scans the
device. This means that the target clock speed is completely independent of the emulation
clock, and the emulator can support targets running at any clock speed.
The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:
A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.
Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells. Parallel
unload operations cause the signals to be loaded from the input cells to the core logic. Data
is shifted in serial mode by daisy chaining devices. The figure below shows the TDI of
each device connected to the TDO of the next device in the scan chain. It is possible to
avoid scanning any device by placing it in bypass mode. Typically, the system architect
is responsible for determining the type (homogeneous or heterogeneous) type of
arrangement of devices, their order in the scan chain and if they will be placed in
bypass.
Fig 1: Boundary Scan
The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermi lab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
.
of matter and antimatter in the universe .
Analysis of real-time embedded applications is necessary at several points during the
software life cycle: during development, as a means to debug; towards the end of
development, for tuning performance; and after the application is deployed, for
failure analysis. Logic analyzers have been used for many years to clamp onto the
data busses of the target and monitor the data flow of the application in order to
analyze application behavior. Aside from the fact that logic analyzers are expensive
($15K to $60K for a DSP), the increase in system-level integration over the years has
resulted in fewer exposed data paths for the logic analyzer to monitor. Most modern
microprocessors are architected with specialized hardware counters that can be
programmed for the purpose of tracing applications. Traditionally these registers
have been used to determine the design of the micro architecture such as caches and
TLBs, etc. Whereas these registers can be used to trace the behavior of applications
at a very fine level of granularity, they cannot easily be used as a RTA mechanism.
An ancillary yet significant issue is that analysis requires that the user have an
advanced knowledge of the target micro architecture in order to interpret the data.
Fig 2: Debugger with real-time data exchange
Finally, tracing supports data transfer only from target to host and not from host to
target.
An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in. The JTAG
interface that connects the on-chip emulation logic to the host-based emulator provides
the physical mechanism on which to transport data from the target to the host and vice
versa. The target application is the subject matter to be analyzed; it is the source of data
to be sent to the host and the sink for data received from the host. Therefore, a RTA
target software library exists to bridge the gap from the target application to the on- chip
emulation hardware. Good software engineering practices dictate that an API exists for
this software library. On the host, the data is to be analyzed by a host application. This
host application may also input data to the target application. We must therefore bridge
the gap from the emulator to the host application. .
An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.
Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc.
Host Target
RTA Real-
Emu Target Time
SW Target
HW Lib App
Target API
F
ig
3
:
S
i
n
gl
e
p
r
o
c
es
s
o
r
R
T
A
b
a
se
d
u
p
o
n
J
T
A
G
e
m
u
la
ti
o
n
For target-to-host
data transfer, there
are two distinct
parts of the data
flow path. The first
part extends from
the target
application to the
RTA host software
library. This is the
real-time
transportation leg.
Since our target
application has
real-time
constraints, data
must be off-loaded
from the target to
the host at a
certain rate. The
RTA host software
library is the first
piece of software
on the host that
realizes it has
received real-time
data for analysis.
(The emulation
software is
agnostic to what
type of data it is
scanning.) The
RTA host software
can record the data
to disk and be
done with it, or
buffer it internally.
The second part of
the target-to-host
data flow path
extends from the
RTA host software
library to the host
application. The
data is analyzed by
the host
application. If the
data has been
recorded in
persistent storage,
then the data can
be played back at
any time. If the
data is not in
persistent storage,
then it must be
analyzed by the
host application as
it is produced; that
is, the data must be
drained from the
RTA host software
buffers as they fill.
This architecture
for multiprocessor
real-time analysis
via scan-based
emulation provides
the basis for the
methodology.
Methodology
In this section we
present an end-to-
end methodology
that is predicated
on support in
hardware and
software across
several families of
DSPs. There is
special emulation
hardware
architected into the
DSP core and
emulation drivers
as well as RTA
target and host
side software that
permit the user to
perform
RTA.Fundamentall
y, this methodology
involves using a
development
environment to
develop and
download a target
application to a
DSP. The
application running
on the DSP
interfaces with the
RTA software to
send and receive
data. The data is
scanned out using
JTAG boundary
scan. The data is
received on the
host by the
emulation driver
that interfaces to
the hos
Side RTA software. The data is then presented to the host client application for analysis.
The figures of merit used to determine the success of this methodology are performance,
scalability, ease of use and reliability. We discuss these criteria within the scope of both
hardware capabilities and the RTA software architecture in a multiprocessor
environment.
Performance
Performance
Dedicated Emulation Hardware
Data is transferred between target and host using dedicated emulation hardware to
improve performance. In a heterogeneous multiprocessor arrangement of DSPs one
complication that arises is that of varying scan lengths. Each design of DSP has its
own emulation hardware. This results in scan lengths that vary within a family of
DSPs and between Instruction Set Architectures (ISAs). The result of this variance is
that longer scan chains require greater disassembly time for the scanned data
resulting in lower throughput. This results in lower performance. In a multiprocessor
JTAG scan, a special JTAG boundary scan bypass instruction obviates the need to
scan any device set to bypass mode. This results in less time to disassemble data
being transferred between host and target.
A RTA solution for a multiprocessor environment must be able to identify the processor from
which data originates. This introduces the need to mark the data with a processor identifier.
The decision then becomes where in the system to do this. If we examine the host, we see
that there is a one-to-one correspondence between the emulation software drivers and the
processors in the system. Since there is an emulation software driver for each target in the
system, these drivers can stamp the data with a processor identifier.
Note that from a performance perspective, it is better to
mark the data on the host-side as to the target-side. If a
unique processor identifier were sent down to the target and
the data were tagged there, then more data would be sent
from the target to the host and would consume precious
bandwidth.. Virtual data paths that extend from the target
application to the host application are used to segregate
data.
Scalability
Hardware Scalability
Software Scalability
Data Selection
Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the
hardware in the system.
Hardware Support
Software Support
At setup, the user selects the type of target and loads the
system with an emulation driver for that target. The user
also specifies the number of targets of each type and their
position in the scan chain. Without this capability users
would have to add code in their host applications that
performed the same function, resulting in messy and
unnecessarily complex code. Host side support is provided in
the way of object-oriented interfaces based on the Component
4
Object Model (COM) , which is a defacto industry standard.
This permits the host application developer to write client
programs that are not tightly coupled to a specific DSP.
Hardware Reliability
The JTAG specification has long established as a reliable
standard. It has been adopted and extended. An extensive
set of target libraries have been developed for various
flavors of DSPs based on boundary scan. Reliability is
achieved through reuse of the same register set in different
versions of emulation hardware across ISAs within ISA s.
Challenges
Conclusion
References: