Вы находитесь на странице: 1из 7

Using FPGAs to Improve the

Performance of Radar, Navigation and


Guidance Systems
Radar, navigation and guidance systems process data that is acquired using arrays of sensors.
The energy delta from sensor to sensor over time holds the key to information such as
targets, position or course. This two-dimensional array of data, often referred to as an
“observation matrix”, must be solved as a set of linear equations to extract the desired
information. Solution methods include matrix inverse, factorization, adaptive filtering and
singular value decomposition and are typically performed using floating point arithmetic to
allow for sufficient dynamic range and precision of the input data. Doing so, however, limits
the performance of a system.

AccelChip, Inc.
1900 McCarthy Blvd.
Suite 204
Milpitas, CA 95035
(408) 943-0700

www.accelchip.com
Radar, navigation and guidance systems process data that is acquired using arrays of sensors.
The energy delta from sensor to sensor over time holds the key to information such as
targets, position or course. This two-dimensional array of data, often referred to as an
“observation matrix”, must be solved as a set of linear equations to extract the desired
information. Solution methods include matrix inverse, factorization, adaptive filtering and
singular value decomposition and are typically performed using floating point arithmetic to
allow for sufficient dynamic range and precision of the input data. Doing so, however, limits
the performance of a system.

Today’s DSP oriented FPGAs such as Xilinx Virtex 4 and Altera Stratix II provide far
greater performance than a floating point DSP processor for this class of applications and
offer the flexibility to extend the dynamic range of a fixed-point implementation significantly
beyond the limitations of a fixed-point DSP processor. Singular value decomposition (SVD)
for an 8x8 matrix can run over 50 times faster in fixed-point arithmetic on an FPGA than a
floating point implementation running on a TI TMS320C67x DSP processor. Achieving this
performance requires a hardware architecture that utilizes 261 of the Virtex4 DSP48
multipliers running in parallel at 200 MHz.

These are challenging applications to design on any hardware platform. Determining an


FPGA architecture that effectively utilizes the DSP blocks to achieve a worthwhile
performance advantage adds significantly to this design complexity. The type of
architectural tradeoff analysis necessary to determine an optimal solution, however, is well
suited to a high-level DSP design methodology.

The Fixed-Point Dynamic Range “Issue”

As stated earlier, this class of applications makes extensive use of matrix inversion, matrix
factorization, singular value decomposition and division. These operations require cascaded
serial multiply operations on the input data that limit the dynamic range of a system. The
example shown below is for a popular method of determining the inverse of a matrix called
QR Decomposition.

MATLAB Example of QRD Matrix Inverse:

% do QR factorization
[Q,R] = qr_factor(Xtmp);
% find inverse of R
Rinv = zeros(M,N);
for row2 = M+7:-1:1
if row2 > 7
Rtmp = R(row2-7,row2-7);
end
RDiagInv = 1/Rtmp;
if row2 < M+1
Rinv(row2,row2) = RDiagInv;
for col2 = row2+1:N
accum = 0;
for t = row2+1:N
accum = accum - (R(row2,t) * Rinv(t,col2));
end
Rinv(row2,col2) = accum * RDiagInv;
end
end
end
% inverse of input
Xi = Rinv * Q';

2
This algorithm, when implemented with Givens rotations, requires 5 cascaded multiply and
one divide operation to be performed on the input data.
Fixed-point arithmetic dictates that the number of output bits of a multiply operation be equal
to the sum of the two input operands if all precision is to be maintained. If left un-truncated
bit widths can grow quickly as shown below in Figure 1.

Figure 1 – Fixed-Point Bit Growth of Multiplies

For 16 bit inputs, non-truncated bit growth of the QRD inverse can exceed 200 bits. Casting
these internal variables to the 16 bit internal busses of the TI TMS320C64x fixed-point
processor places severe limits in the usable dynamic range of the inputs. For this reason, TI
recommends, and rightly so, that these applications be implemented exclusively on their
floating point DSP processors.

The Flexibility of the FPGA Fabric

FPGA logic is not limited to specified bit widths for internal busses and may grow as needed
to meet the demands of the application. This bit growth comes at the expense of added
hardware which, if left unbounded, can be significant. Reasonable internal bit growth
beyond 16 bits, however, can improve the dynamic range of a fixed-point implementation to
provide a viable hardware solution for systems using up to 16-bits.

Exploring the bit growth requirements of the QRD matrix inverse shows that quantizing the
inputs to 16 bits signed offers an integer dynamic range between -32,768 to +32,768. Figure
2 shows the AccelChip DSP Synthesis tools “Fixed-Point Report” which lists the
quantizations used for the QRD matrix inverse.

In Figure 2 the “Quantizer” column nomenclature is as follows; “fixed” means signed twos-
complement, “ufixed” means unsigned binary, floor is the saturate mode if the MSB and
“wrap” is the rounding mode of the LSB. The number in square brackets represents the word
length and decimal point location respectively. For more information on this nomenclature
refer to the MATLAB help for the command “quantizer”.

3
Figure 2 – AccelChip Fixed-Point Report

The variable “RDiagInv”, which is the result of a divide operation, is quantized using 32 total
bits with 17 integer bits. Maintaining an adequate number of integer bits here is critical to
maintaining an acceptable response of the inverse function. The flexibility offered by the
Virtex 4 FPGA allows for the necessary bit growth of the integer bits to occur while some
reasonable trimming of the fractional bits may take place.

Multiplying operands greater than 16 bits

The Xilinx Virtex 4 device includes dedicated hardware multipliers in the DSP48 blocks that
support up to 18 input bits with up to 48 bits of accumulation. Even though generous, this
does not place a hard limit of 18 bits on the internal busses. Multiplication operations
requiring greater than 18 bits and accumulations requiring greater than 48 bits can be
constructed using additional DSP48 blocks while maintaining exceptional performance in
excess of 300 MHz.

Figure 3 – 32 bit multiply implemented in Virtex 4 DSP48 Blocks

4
The FPGA Performance Advantage

The real advantage of an FPGA implementation is realized when the hardware is architected
to support multiple DSP operations running concurrently on a single device. Figure 4 shows
the block diagram for a sensor array processing application that includes pre-filtering,
beamforming, adaptive filtering and post processing.

Radar
FIR Doppler Beam Forming Post-Nulling
Data Target
Filtering Filtering Processing Reports
Data O(BW*DOF*BT)
Rate O(BW*N*T) O(BW*N*Log2(CPI*PRF) O(BW*BT)
Adaptive
Weights

O(DOF3/CPI))

Figure 4 – Sensor Array Processing Block Diagram

Implementing this application in a floating point processor requires either multiple chips or a
significant compromise in performance to allow resource sharing of the limited multiplier
resources between the multiple DSP operations. A single FPGA, however, supports the
entire operation providing a performance advantage

The 500 Multiplier Advantage!

The XC4VSX55 device offers a peak performance capacity that is 512 times greater than the
TI TMS320C67x floating point processor for multiplier dominated designs. The C67x offers
2 floating point data paths, each containing a single multiplier that can operate up to 250
MHz to provide a peak performance of 500 MFLOPs. The XC4VSX55 includes 512
dedicated DSP48 blocks each containing 1 signed 18x18 multiplier capable of running at 500
MHz providing the fixed-point equivalent of a peak performance of 256 GFLOPs. Granted
this is a simplistic method for comparing the performance capacity of a floating- vs. fixed-
point device but this comparison should provide a sense of the possibilities the FPGA
architecture has to offer.

Design Challenges

Maximizing the performance of this system requires that partial parallelism be implemented
in key areas of the design that will have the greatest impact on overall performance. The
added hardware that results from this additional parallelism must not exceed the available
resources of the target FPGA. The number of architectural possibilities a designer must
evaluate is considerable and grows exponentially with the size of the system which makes the
determination of an optimal hardware architecture a tedious and time consuming design task.
AccelChip® provides a high-level design methodology that greatly simplifies this process.
Radar, navigation and guidance systems can be described in MATLAB using loops, and
vector and matrix multiplies. These operations can be automatically “unrolled” during the
algorithmic synthesis process providing designers a rapid way to explore the impact of
parallelism on different blocks of the system without modifying their golden source. By
using an automated flow the final solution can be easily tailored to maximize the available

5
resources of the target FPGA. Table 1 provides an example of how design exploration can
be used to tailor the performance of a QRD-RLS adaptive filter.

# Multipliers Performance (MSPS)


1 9.5
41 100

Table 1 – QRD-RLS Adaptive Filter Performance vs. Multipliers

AccelChip Solutions

AccelChip offers both stand alone IP cores and an algorithmic synthesis environment based
on MATLAB for designing fixed-point implementations of radar, navigation and guidance
systems. The quickest path to hardware is to use an AccelCore® or AccelWare® IP core.
AccelCore IP provides users with a synthesizable RTL model along with documentation and
a testbench that can be incorporated into a larger design through RTL instantiation. The
AccelCore library includes matrix inversion, matrix factorization and singular value
decomposition functions. The AccelWare IP library includes over 50 synthesizable
MATLAB models that can be combined at the MATLAB level with user defined
functionality and synthesized into VHDL or Verilog with AccelChip DSP Synthesis. This
form of IP is easy to integrate into larger system-level models defined in MATLAB.

Figure 5 – AccelWare IP Generation Form for QR Inverse

AccelChip® DSP Synthesis provides complete flexibility to define and implement custom
architectures for radar, navigation and guidance systems using floating-point MATLAB.
AccelChip provides automated floating- to fixed-point conversion to assist in solving the
complex quantization issues resulting from the cascaded multiply and divide operations used
in matrix inversion and factorization. Once an acceptable fixed-point model is determined
users can rapidly explore performance verses hardware tradeoffs using algorithmic synthesis.
Here the number of dedicated hardware multipliers used in the design can be quickly
increased to improve performance and take full advantage of the flexibility of the FPGA
architecture.

6
Summary

The performance advantages of a Xilinx Virtex 4 or an Altera Stratix II FPGA is now


available to radar, navigation and guidance system designers requiring up to 16 bits of input
dynamic range. Realizing this performance advantage requires a high-level design
methodology, such as the one offered by AccelChip, to craft a hardware architecture that
fully utilizes the available FPGA DSP resources in a timely manor. Technical white papers
and application notes are available for download at www.accelchip.com or send an e-mail to
info@accelchip.com.

References

[1].TMS320c67x Floating Point DSP Performance


[2] Report on NAG Benchmark Tests for SUN SMPs, The University of Liverpool
[3] Comparing Fixed-and Floating-Point DSPs, Texas Instruments
[4] A BDTI Analysis of the Texas Instruments TMS320C67x, BDTI