Академический Документы
Профессиональный Документы
Культура Документы
ON
MULTIPROCESSOR REAL-TIME ANALYSIS FOR
SCAN-BASED EMULATION: A METHODOLOGY
OF DSP APPLICATION
FOR
TECHFEST’08
AT
G UDLAVALLERU
ENGINEERING COLLEGE
GUDLAVALLERU
Introduction
In embedded systems the ability to perform Real-Time Analysis (RTA) can involve a
dedicated hardware and software capability with an end-to-end methodology that
supports the transferring of data between the host and the target in a lossless and reliable
manner. Specifically, the Real-Time Analysis encompassed by this methodology consists
of capturing data from a target application using dedicated hardware, transferring it
through various layers of software dedicated to creating a real-time path and making it
available to a host application for the purpose of analysis. Analysis includes the
determining of whether applications meet both timing and logical correctness
requirements.
With a traditional emulator, the CPU to be emulated is usually removed from its socket
and replaced with an emulator pod. The emulator pod typically has a replacement CPU,
plus various amounts of random logic and memory to monitor what is happening on
the CPU pins. With modern CPUs such as DSPs, the traditional approach has several
problems. The first problem is the speed of newer DSP chips. Bus cycle times can be
25 nanoseconds or shorter, and all instructions typically execute in a single cycle. This
makes it difficult for a traditional emulator to allow emulation at full speed. The
number of pins to monitor can be staggering, with chips having multiple 32-bit address
and data buses, making a traditional emulator expensive. The second, and more serious,
problem is that DSPs often have on-chip caches, pipelines, memory and peripherals.
Sometimes a whole algorithm can execute without any activity on the CPU pins.
The Joint Test Action Group (JTAG) defines an interface called the JTAG interface for
testing individual devices on printed circuit boards, without the need to remove the
devices from the board. This is accomplished by a method called boundary scan, whereby
the state of each pin of each device (with some special logic on the device) is serially
scannedoutfromthedevice.
Multiple devices can be daisy chained, and an entire PC board can therefore be scanned
in a single scan chain. It is possible to use the same method to scan out not only the state
of a devices pins, but to scan out any internal information from the device, such as
register values, memory location; as a consequence scan-based emulation was born. The
JTAG specification does not include the pin out for the JTAG connector. The extension to
JTAG defines a 14 pin, 2 row, 0.1" spacing JTAG connector header, with pin out and
physical dimensions common to all DSPs that support JTAG involved in this
2
methodology .During JTAG emulation, the emulator supplies the clock that scans the
device. This means that the target clock speed is completely independent of the emulation
clock, and the emulator can support targets running at any clock speed.
The JTAG device architecture is based on the IEEE 1149.1 architecture. In this
specification, there are four dedicated pins collectively known as the Test Access Port
(TAP). They are:
A boundary scan cell is connected to each boundary scan register on each device that is
being scanned. The architecture further specifies a finite machine TAP controller with
inputs TMS and TCK. There is an Instruction Register (IR) holding the current
instruction, a bypass register, and an optional 32-bit identification register for permanent
identification.
Boundary scan cells are configured into a parallel-in, parallel-out shift register. Parallel
load operations cause signals from the core logic to be loaded into the output cells. Parallel
unload operations cause the signals to be loaded from the input cells to the core logic. Data
is shifted in serial mode by daisy chaining devices. The figure below shows the TDI of
each device connected to the TDO of the next device in the scan chain. It is possible to
avoid scanning any device by placing it in bypass mode. Typically, the system architect
is responsible for determining the type (homogeneous or heterogeneous) type of
arrangement of devices, their order in the scan chain and if they will be placed in
bypass.
Loosely Coupled DSP Arrangement
(Multiple Boards)
Target Board
1 JTAG Header (14 Pin) Target Board
TDI
TDI TDO
To Host Computer
The following application in the domain of high energy physics illustrates the necessity
for RTA in a heterogeneous multiprocessor environment. The Fermi lab Tevatron
Collider generates 15 million particle collisions per second. These particle collisions
result in the creation of subatomic particles that travel through a spectrometer. The data
output from the spectrometer is in the order of terabytes per second and must be analyzed
in real time. The analysis engine comprises a massively parallel arrangement of
heterogeneous DSPs and GPPs (general purpose processors). Analysis consists of
applying algorithms that reconstruct and filter the collision data. The result is a select set
of interesting collisions from which physicists can study some of the remaining mysteries
3
of matter and antimatter in the universe .
Analysis of real-time embedded applications is necessary at several points during the
software life cycle: during development, as a means to debug; towards the end of
development, for tuning performance; and after the application is deployed, for
failure analysis. Logic analyzers have been used for many years to clamp onto the
data busses of the target and monitor the data flow of the application in order to
analyze application behavior. Aside from the fact that logic analyzers are expensive
($15K to $60K for a DSP), the increase in system-level integration over the years has
resulted in fewer exposed data paths for the logic analyzer to monitor. Most modern
microprocessors are architected with specialized hardware counters that can be
programmed for the purpose of tracing applications. Traditionally these registers
have been used to determine the design of the micro architecture such as caches and
TLBs, etc. Whereas these registers can be used to trace the behavior of applications
at a very fine level of granularity, they cannot easily be used as a RTA mechanism.
An ancillary yet significant issue is that analysis requires that the user have an
advanced knowledge of the target micro architecture in order to interpret the data.
Finally, tracing supports data transfer only from target to host and not from host to
target.
An alternative real-time analysis solution based upon JTAG emulation is presented here.
This hardware and software architecture for a single processor is shown in. The JTAG
interface that connects the on-chip emulation logic to the host-based emulator provides
the physical mechanism on which to transport data from the target to the host and vice
versa. The target application is the subject matter to be analyzed; it is the source of data
to be sent to the host and the sink for data received from the host. Therefore, a RTA
target software library exists to bridge the gap from the target application to the on- chip
emulation hardware. Good software engineering practices dictate that an API exists for
this software library. On the host, the data is to be analyzed by a host application. This
host application may also input data to the target application. We must therefore bridge
the gap from the emulator to the host application. .
An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.
Host Target
Host RTA
RTA Real-
Emu Target Time
App1 Host Emu Emulator SW Target
SW SW JTAG HW Lib App
…
Host Lib
AppM
Target API
Host API
. An emulation software driver controls the scanning of data to/from the target via the
emulator. It is the first piece of host software to receive data from the target and the last
piece of host software to handle data heading to the target. A RTA host software library
funnels the data between the emulation software driver and the host application. Again,
an API exists for the RTA host software library. It should be noted that multiple host
applications may be run concurrently.
Data flow in this architecture is bi-directional: data flows from the target application to
the host application for analysis, and data may flow from the host application to the target
application for supplying input parameters. Such input parameters may be used for fine
tuning performance, for supplying test data, etc.
For target-to-host data transfer, there are two distinct parts of the data flow path. The first
part extends from the target application to the RTA host software library. This is the real-
time transportation leg. Since our target application has real-time constraints, data must be
off-loaded from the target to the host at a certain rate. The RTA host software library is
the first piece of software on the host that realizes it has received real-time data for
analysis. (The emulation software is agnostic to what type of data it is scanning.) The
RTA host software can record the data to disk and be done with it, or buffer it internally.
The second part of the target-to-host data flow path extends from the RTA host software
library to the host application. The data is analyzed by the host application. If the data
has been recorded in persistent storage, then the data can be played back at any time. If
the data is not in persistent storage, then it must be analyzed by the host application as it
is produced; that is, the data must be drained from the RTA host software buffers as they
fill.
Host Target
SW SW JTAG HW SW Target
…
Input
Fig 4: Data flow in single processor RTA based upon JTAG emulation
The above RTA architecture for a single embedded processor is easily extended to a
multiprocessor environment. This is shown in A RTA target software library must exist
on each embedded target to connect the target application to the emulation logic on that
target. For multi-core architectures, a RTA target software library will exist for each core.
The data from each processor is scanned up to the host via the JTAG interface ring as
described in (Scan-Based Emulation). On the host, there exists an emulation software
driver corresponding to each target in the system. Each emulation software driver
receives the data from its corresponding target and delivers the data to the one RTA host
software library.
…
Host Emu Emu TDI
AppM SW Pn
This architecture for multiprocessor real-time analysis via scan-based emulation provides
the basis for the methodology.
Methodology
Performance
Performance
Dedicated Emulation Hardware
Host Target
Virtual Path 1
Virtual Path 2
Host
App … Target
App
Virtual Path N
Data Identification
A RTA solution for a multiprocessor environment must be able to identify the processor
from which data originates. This introduces the need to mark the data with a processor
identifier. The decision then becomes where in the system to do this. If we examine the
host, we see that there is a one-to-one correspondence between the emulation software
drivers and the processors in the system. Since there is an emulation software driver for
each target in the system, these drivers can stamp the data with a processor identifier. Note
that from a performance perspective, it is better to mark the data on the host-side as to
the target-side. If a unique processor identifier were sent down to the target and the data
were tagged there, then more data would be sent from the target to the host and would
consume precious bandwidth. At the processor level, it is possible to allow finer-grain
identification of data. Virtual data paths that extend from the target application to the
host application are used to segregate data.
Scalability
A key aspect of this methodology is scalability. This issue is addressed in both hardware
and software.
Hardware Scalability
The JTAG specification permits the daisy chaining of hardware. The limits placed on the
number of devices that can be daisy chained is based on signal strength limitations as
opposed to the JTAG specification.
Software Scalability
In software, data is tagged from each target with a unique identifier so that data being
transferred between host and target can be identified as to which processor it belongs.
Further, the RTA architecture is software scalable; writing the target application is not
dependent upon the number or processors and does not need to be altered if processors
are either added or removed from the system configuration. There is no requirement
that the target application have any knowledge of the type or number of processors in a
scan chain at the time of development.
Data Selection
The host application should be able to select from which processor to send or receive
data. This is accomplished by incorporating this functionality into the host API. This
Proves to be very favorable with respect to scalability. For example, let’s assume that
we have a target application that performs a series of transformations on a given vector,
and then transfers the resulting vector to the host for analysis. Let’s further assume that
there are many vectors that must be transformed and that we choose to deploy the same
target application on as many processors as there are vectors to achieve maximum
computing parallelization. We can design a host application that sends a different
vector to each processor and then collects the resulting vectors from each processor,
respectively, for analysis
Ease of Use
Ease of use is an important but often difficult figure of metric to sustain. A software
debugging environment is provided that permits the user to easily configure the
hardware in the system.
Host Application
• Select Processor1 Processor1
• Send vector1
Transform
• Select Processor2
• Send vector2
Processor2
• Select Processor3
• Send vector3
Transform
• Select Processor1
• Get resultant
vector Processor3
• Select Processor2
Transform
• Get resultant
vector
• Select Processor3
• Get resultant vector
Hardware Support
A trend in DSP emulation hardware is to support device registers that are mapped at
fixed addresses. This permits the source code porting of applications. Further, a trend in
more contemporary DSP emulation logic is to replicate the logic on all DSPs. This
further simplifies the deployment of RTA tools.
Software Support
At setup, the user selects the type of target and loads the system with an emulation
driver for that target. The user also specifies the number of targets of each type and
their position in the scan chain. Without this capability users would have to add code in
their host applications that performed the same function, resulting in messy and
unnecessarily complex code. Host side support is provided in the way of object-oriented
4
interfaces based on the Component Object Model (COM) , which is a defacto industry
standard. This permits the host application developer to write client programs that are not
tightly coupled to a specific DSP.
Hardware Reliability
The JTAG specification has been long established as a reliable standard. It has been
adopted and extended. An extensive set of target libraries have been developed for
various flavors of DSPs based on boundary scan. Reliability is achieved through reuse
of the same register set in different versions of emulation hardware across ISAs and
within ISAs.
Software Reliability
The use of unidirectional virtual paths for both target-to-host and host-to-target data
transfers assists in ensuring that there is no data corruption. Further, host applications
synchronize on data buffers connected to virtual paths and so there is no data loss. Buffer
management is precise and is architected to ensure no data loss on both target and
host sides. Another feature of the RTA architecture is congestion control. With this
capability buffers are guaranteed not to overflow. During host-to-target data transfer,
the RTA architecture signals the end of data transfer through a virtual path using
callbacks.
Challenges
Conclusion
The RTA methodology presented in this paper is extensively used and widely accepted. The
problem that is illustrated in the high energy physics application presented is not limiting.
Our experiences to date have shown that other domains such as wireless and mobile
computing require the processing of RTA data where both DSPs and microcontrollers are
on the same scan chain. The development of this RTA capability has been predicated on
the JTAG specification and the adherence to this standard in the emulation hardware that
has been designed into the DSP core. The software that has been developed is able to
differentiate between the various DSPs. The virtual paths in the RTA architecture
guarantee data integrity.
References:
Gottschalk, E.E., et al., "The BTeV DAQ and Trigger System – Some Throughput, Usability,
and Fault Tolerance Aspects," Proceedings of the Computing in High Energy and Nuclear
Physics Conference (CHEP 2001), p. 628, Beijing,