Вы находитесь на странице: 1из 66

EURASIP Journal on Embedded Systems

FPGA Supercomputing Platforms,


Architectures, and Techniques for
Accelerating Computationally
Complex Algorithms
Guest Editors: Vinay Sriram and Miriam Leeser
FPGA Supercomputing Platforms,
Architectures, andTechniques for Accelerating
Computationally Complex Algorithms
EURASIP Journal on Embedded Systems
FPGA Supercomputing Platforms,
Architectures, andTechniques for Accelerating
Computationally Complex Algorithms
Guest Editors: Vinay Sriram and Miriam Leeser
Copyright 2009 Hindawi Publishing Corporation. All rights reserved.
This is a special issue published in volume 2009 of EURASIP Journal on Embedded Systems. All articles are open access articles
distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any
medium, provided the original work is properly cited.
Editor-in-Chief
Zoran Salcic, University of Auckland, New Zealand
Associate Editors
Sandro Bartolini, Italy
Neil Bergmann, Australia
Shuvra Bhattacharyya, USA
Ed Brinksma, The Netherlands
Paul Caspi, France
Liang-Gee Chen, Taiwan
Dietmar Dietrich, Austria
Stephen A. Edwards, USA
Alain Girault, France
Rajesh K. Gupta, USA
Thomas Kaiser, Germany
Bart Kienhuis, The Netherlands
Chong-Min Kyung, Korea
Miriam Leeser, USA
John McAllister, UK
Koji Nakano, Japan
Antonio Nunez, Spain
Sri Parameswaran, Australia
Zebo Peng, Sweden
Marco Platzner, Germany
Marc Pouzet, France
S. Ramesh, India
Partha S. Roop, New Zealand
Markus Rupp, Austria
Asim Smailagic, USA
Leonel Sousa, Portugal
Jarmo Henrik Takala, Finland
Jean-Pierre Talpin, France
J urgen Teich, Germany
Dongsheng Wang, China
Contents
FPGA Supercomputing Platforms, Architectures, and Techniques for Accelerating Computationally
Complex Algorithms, Vinay Sriram and Miriam Leeser
Volume 2009, Article ID 218456, 2 pages
Prototyping Advanced Control Systems on FPGA, St ephane Simard, Jean-Gabriel Mailloux,
and Rachid Beguenane
Volume 2009, Article ID 897023, 12 pages
Parallel Backprojection: A Case Study in High-Performance Recongurable Computing, Ben Cordes and
Miriam Leeser
Volume 2009, Article ID 727965, 14 pages
Performance Analysis of Bit-Width Reduced Floating-Point Arithmetic Units in FPGAs: A Case Study of
Neural Network-Based Face Detector, Y. Lee, Y. Choi, M. Lee, and S. Ko
Volume 2009, Article ID 258921, 11 pages
Accelerating Seismic Computations Using Customized Number Representations on FPGAs, Haohuan Fu,
William Osborne, Robert G. Clapp, Oskar Mencer, and Wayne Luk
Volume 2009, Article ID 382983, 13 pages
An FPGA Implementation of a Parallelized MT19937 UniformRandomNumber Generator,
Vinay Sriram and David Kearney
Volume 2009, Article ID 507426, 6 pages
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 218456, 2 pages
doi:10.1155/2009/218456
Editorial
FPGA Supercomputing Platforms, Architectures, and Techniques
for Accelerating Computationally Complex Algorithms
Vinay Sriram
1
and MiriamLeeser
2
1
Defence and Systems Institute, University of South Australia, Adelaide, South Australia 5001, Australia
2
Department of Electrical and Computer Engineering, College of Engineering, Northeastern University, Boston, MA 02115, USA
Correspondence should be addressed to Miriam Leeser, mel@coe.neu.edu
Received 6 May 2009; Accepted 6 May 2009
Copyright 2009 V. Sriram and M. Leeser. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
This is a special issue on FPGA supercomputing platforms,
architectures, and techniques for accelerating computation-
ally complex algorithms. This issue covers a broad range
of applications in which eld programmable gate arrays
(FPGAs) are successfully used to accelerate processing. It
also provides researchers insights on the challenges in
successfully using FPGAs. The applications discussed include
motor control, radar processing, face recognition, processing
seismic data, and accelerating random number generation.
Techniques discussed by the authors include partitioning
between a CPU and FPGA hardware, reducing bitwidth
to improve performance, interfacing to analog signals, and
using high level tools to develop applications.
Two challenges that face many users of recongurable
hardware are interfacing to the analog domain and easing
the job of developing applications. In the paper entitled
Prototyping Advanced Control Systems on FPGA, the
authors present a rapid prototyping platform and design
ow suitable for the design of onchip motion controllers
and other SoCs with a need for analog interfacing. The
target hardware platform consists of a customized FPGA
design for the Amirix AP1000 PCI FPGA board coupled
with a multichannel analog I/O daughter card. The design
ow uses Xilinx System Generator in MATLAb/Simulink for
system design and test, and Xilinx Platform Studio for SoC
integration. This approach has been applied to the analysis,
design, and hardware implementation of a vector controller
for 3-phase AC induction motors.
Image processing is an application area that exhibits a
great deal of parallelism. In the work entitled Parallel Back-
projection: A Case Study in High-Performance Recong-
urable Computing, the authors investigate the use of a high-
performance recongurable supercomputer built from both
general-purpose processors and FPGAs. These architectures
allow a designer to exploit both ne-grained and coarse-
grained parallelism, achieving high degrees of speedup. The
authors describe how backprojection, used to reconstruct
Synthetic Aperture Radar (SAR) images, is implemented on
a high-performance recongurable computer system. The
results show an overall application speedup of 50 times.
Neural networks have successfully been used to detect
faces in video images. In the paper entitled Performance
Analysis of Bit-Width Reduced Floating-Point Arithmetic
Units in FPGAs: Case Study of Neural Network-based Face
Detector, the authors describe the implementation of an
FPGA-based face detector using a neural network and bit-
width reduced oating-point arithmetic units (FPUs). The
FPUs and neural network are designed using MATLAB and
VHDL, and the two implementations are compared. The
authors demonstrate that reductions in the number of bits
used in arithmetic computation can produce signicant cost
reductions including area, speed, and power with a small
sacrice in accuracy.
The oil and gas industry has a huge demand for high-
performance computing on extremely large volumes of data.
FPGAs are exceedingly well matched for this task. Reduced
precision arithmetic operations can greatly decrease the area
cost and I/O bandwidth of the FPGA-based design, support-
ing increased parallelism and achieving high performance.
In the work entitled Accelerating Seismic Computations
Using Customized Number Representations on FPGAs, the
authors present a tool to determine the minimum-number
2 EURASIP Journal on Embedded Systems
of precision that still provides acceptable accuracy for seismic
applications. By using the minimized number format, the
authors are able to demonstrate speedups ranging from 5 to
7 times, including overhead costs such as the transfer time
to and from the general purpose processors. With improved
bandwidth between CPU and FPGA, the authors show that a
48-time speedup is possible.
A large number of applications require large quantities
of uncorrelated random numbers. In the paper entitled An
FPGA Implementation of a Parallelized MT19937 Uniform
Random Number Generator, Vinay Sriram and David
Kearney present a fast uniform random-number generator
implemented in recongurable hardware that is both higher
throughput and more area ecient than previous implemen-
tations. The design presented, which generates up to 624
random numbers in parallel, has a throughput that is more
than 15 times better than previously published results.
This collection of papers represents an overview of active
research in the eld of recongurable hardware applications
and techniques.
Vinay Sriram
Miriam Leeser
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 897023, 12 pages
doi:10.1155/2009/897023
Research Article
Prototyping Advanced Control Systems on FPGA
St ephane Simard, Jean-Gabriel Mailloux, and Rachid Beguenane
Department of Applied Sciences, University of Quebec at Chicoutimi, 555 boul. de lUniversite, Chicoutimi, QC, Canada G7H 2B1
Correspondence should be addressed to Rachid Beguenane, rbeguena@uqac.ca
Received 19 June 2008; Accepted 3 March 2009
Recommended by Miriam Leeser
In advanced digital control and mechatronics, FPGA-based systems on a chip (SoCs) promise to supplant older technologies, such
as microcontrollers and DSPs. However, the tackling of FPGA technology by control specialists is complicated by the need for
skilled hardware/software partitioning and design in order to match the performance requirements of more and more complex
algorithms while minimizing cost. Currently, without adequate software support to provide a straightforward design ow, the
amount of time and eorts required is prohibitive. In this paper, we discuss our choice, adaptation, and use of a rapid prototyping
platform and design ow suitable for the design of on-chip motion controllers and other SoCs with a need for analog interfacing.
The platform consists of a customized FPGA design for the Amirix AP1000 PCI FPGA board coupled with a multichannel analog
I/O daughter card. The design ow uses Xilinx System Generator in Matlab/Simulink for system design and test, and Xilinx
Platform Studio for SoC integration. This approach has been applied to the analysis, design, and hardware implementation of
a vector controller for 3-phase AC induction motors. It also has contributed to the development of CMCs MEMS prototyping
platform, now used by several Canadian laboratories.
Copyright 2009 St ephane Simard et al. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
The use of advanced control algorithms depends upon being
able to perform complex calculations within demanding
timing constraints, where system dynamics can require
feedback response in as short as a couple tens of microsec-
onds. Developing and implementing such capable feedback
controllers is currently a hard goal to achieve, and there is
much technological challenge in making it more aordable.
Thanks to major technological breakthroughs in recent years,
and to sustained rapid progress in the elds of very large
scale integration (VLSI) and electronic design automation
(EDA), electronic systems are increasingly powerful [1,
2]. In the latter paper, it is rightly stated that FPGA
devices have reached a level of development that puts
them on the edge of microelectronics fabrication technology
advancements. They provide many advantages with respect
to their nonrecongurable counterparts such as the general
purpose micropocessors and DSP processors. In fact, FPGA-
based digital processing systems achieve better performance-
cost compromise, and with a moderate design eort they
can aord the implementation of a powerful and exible
embedded SoCs. Exploiting the FPGA technology benets
for industrial electrical control systems has been the source of
intensive research investigations during last decade in order
to boost their performances at lower cost [3, 4]. There is
still, however, much work to be done to bring such power
in the hands of control specialists. In [5], it is stated that the
potential of implementing one FPGA chip-based controller
has not been fully exploited in the complicated motor
control or complex converter control applications. Until
now, most related research works using FPGA devices are
focusing on designing specic parts mainly to control power
electronic devices such as space vector pulse width modula-
tion (SVPWM) and power factor correction [6, 7]. Usually
these are implemented on small FPGAs while the main
control tasks are realised sequentially by the supervising
processor system, basically the DSP. Important and constant
improvement in FPGA devices, synthesis, place-and-route
tools, and debug capabilities has made FPGA prototyping
more available and practical to ASIC/SoC designers than
ever before. The validation of their hardware and software
on a common platform can be accomplished using FPGA-
based prototypes. Thanks to the existing and mature tools
2 EURASIP Journal on Embedded Systems
that provide automation while maintaining exibility, the
FPGAprototypes make it now possible for ASIC/SoCdesigns
to be delivered on time at minimal budget. Consequently,
FPGA-based prototypes could be eciently exploited for
motion control applications to permit an easy modication
of the advanced control algorithms through short-design
cycles, simple simulation, and rapid verication. Still the
implementation of FPGA-based SoCs for motion control
results in very complex tasks involving SW and HW skilled
developers. The ecient IP integration constitutes the main
diculty from hardware perspective while in software side
the issue is the complexity of debugging the software that
runs under real-time operating system (RTOS), in real hard-
ware. This paper discusses the choice, adaptation, and use of
a rapid prototyping platform and design ow suitable for the
design of on-chip motion controllers and other SoCs with a
need for analog interfacing. Section 2 describes the chosen
prototyping platform and the methodology that supports
embedded application software coupled with custom FPGA
logic and analog interfacing. Section 3 presents the strategy
for simulating and prototyping any control algorithm using
Xilinx system Generator (XSG) along with Matlab/Simulink.
A vector control for induction motor is taken as a running
example to explain some features related to the cosimulation.
Section 4 describes the process of integrating the designed
controller, once completely debugged, within an SoC archi-
tecture using Xilinx Platform Studio (XPS) and targeting
the chosen FPGA-based platform. Section 5 discusses the
complex task of PCI initialization of the analog I/O card
and controller setup by software under embedded Linux
operating system. Section 6 takes the induction motor vector
control algorithm as an application basis to demonstrate
the usefulness of the chosen FPGA-based SoC platform to
design/verify on-chip motion controllers. The last section
concludes the paper.
2. The FPGA-Based Prototyping Platform for
On-Chip Motion Controllers
With the advent of a new generation of high-performance
and high-density FPGAs oering speeds in the 100 seconds
of MHz and complexities of up to 2 megagates, the FPGA-
based prototyping becomes appropriate for verication of
SoC and ASIC designs. Consequently the increasing design
complexities and the availability of high-capacity FPGAs
in high-pin-count packages are motivating the need for
sophisticated boards. Board development has become a task
that demands unique expertise. That is one reason why
commercial o-the-shelf (COTS) boards are quickly becom-
ing the solution of choice because they are closely related
to the implementation and debugging tools. During many
years, and under its System-on-Chip Research Network
(SOCRN) program, CMC Microsystems provided canadian
universities with development tools, various DSP/Embedded
Systems/multimedia boards, and SoC prototyping boards
such as Amirix AP1000 PCI FPGA development platform.
In order to support our research on on-chip motion
controllers, we have managed the former plateform, a host
Analog I/Q
daughter card
AP1000 board
Digital outputs
from the FPGA
Figure 1: Rapid prototyping station equiped with FPGA board and
multichannel analog I/O daughter card.
PC (3.4 GHz Xeon CPU with 2.75 GB of RAM) equiped
with the Amirix AP1000 PCI FPGA development board,
to support a multichannel analog I/O PMC daughter card
(Figure 1) to communicate with exterior world.
The AP1000 has lots of features to support complex
system prototyping, including test access and expansion
capabilities. The PCB is a 64-bit PCI card that can be
inserted in a standard expansion slot on a PC motherboard
or PCI backplane. Use of the PMC site requires a second
chassis slot on the backside of the board and an optional
extender card to provide access to the board I/O. The AP1000
platform includes a Xilinx Virtex-II Pro XC2VP100 FPGA
and is connected to dual banks of DDR SDRAM (64 MB)
and SRAM (2 MB), Flash Memory (16 MB), Ethernet and
other interfaces. It is congured as a single board computer
based on two embedded IBM PowerPC processors, and it is
providing an advanced design starting point for the designer
to improve time-to-market and reduce development costs.
The analog electronics are considered modular, and can
either be external or included on the same chip (e.g., when
fabricated into an ASIC). On the prototyping platform,
of course, they are supplied by the PMC daughter card.
It is a General Standards PMC66-16AISS8A04 analog I/O
board featuring twelve 16-bit channels: eight simultaneously
sampled analog inputs, and four analog outputs, with input
sampling rates up to 2.0 MSPS per channel. It acts as a two-
way analog interface between the FPGA and lab equipment,
connected through an 80-pin ribbon cable and a breakeout
board to the appropriate ports of the power module.
The application software is compiled with the free
Embedded Linux Development Kit (ELDK) from DENX
Software Engineering. Since it runs under such a complete
operating system as Linux, it can perform elaborated func-
tions, including user interface management (via a serial
link or through networking), and real-time supervision and
adaptation of a process such as adaptive control.
The overall platform is very well suited to FPGA-in-
the-loop control and SoC controller prototyping (Figure 2).
The controller can either be implemented completely in
digital hardware, or executed on an application-specic
instruction set processor (ASIP). The hardware approach has
a familiar design ow, using the Xilinx System Generator
EURASIP Journal on Embedded Systems 3
AC
induction
motor
Power
module
Digital
outputs
(PMW,etc)
RJ45
RS232
Xcvr
PMC Ethernet RJ45
PCI
bridge
Bridge
Bridge
External
local
bridge
Interrupt
controller UART
PLB
OPB
FPGA Virtex-II Pro XC2VP100
AP1000 FPGA board
PowerPC
405
SDRAM
controller
User
logic
Interface
Application
software + HW logic driver
under Linux
General standards PMC analog
I/Q card with 12 16 bit analog
channels: 4 outputs, and 8
simultaneously sampled inputs
-
Figure 2: Architecture of the embedded platform driving a power
system (schematic not to scale).
(XSG) blockset and hardware/software cosimulation features
in Matlab/Simulink. An ASIP specially devised for advanced
control applications is currently under development within
our laboratory.
3. Matlab/Simulink/XSGController Design
It is well known that simulation of large systems within
system analysis and modelling software environments takes
a prohibitive amount of time. The main advantage of a rapid
prototyping design ow with hardware/software cosimula-
tion is that it provides the best of a system analysis and
modelling environment while oering adequate hardware
acceleration.
Hardware/software cosimulation has been introduced
by major EDA vendors around year 2000, combining
Matlab/Simulink, the computing, and Model-Based Design
software, with synthesizable blocksets and automated hard-
ware synthesis software such as DSP Builder from Altera,
and System Generator from Xilinx (XSG). Such a design
ow reduces the learning time and development risk for
DSP developers, shortens the path from design concept to
working hardware, and enables engineers to rapidly create
and implement innovative, high-performance DSP designs.
The XSG cosimulation feature allows the user to run a
design on the FPGA found on a certain platform. An impor-
tant advantage of XSG is that it allows for quick evaluation
of system response when making changes (e.g., changing
coecient and data widths). As the AP1000 is not supported
by XSG among the preprogrammed cosimulation targets, we
use the Virtex-4 ML402 SX XtremeDSP Evaluation Platform
instead (Figure 3). The AP1000 is only targetted at the SoC
integration step (see Section 4).
Figure 3: Virtex-4 ML402 SX XtremeDSP evaluation platform.
We begin with a conventional, oating-point, simu-
lated control system model, and corresponding xed-point
hardware representation is then constructed using the XSG
blockset, leading to a bit-accurate FPGA hardware model
(Figure 4), and XSG generates synthesizable HDL targetting
Xilinx FPGAs. The XSG design, simulation, and test pro-
cedure is briey outlined below. Power systems including
motor drives can be simulated using the SimPowerSystems
(SPS) blockset in Simulink.
(1) Start by coding each system module individually with
the XSG blockset.
(2) Import any user-designed HDL cores.
(3) Adjust the xed-point bit precisions (including bit
widths and binary point position) for each XSGblock
of the system.
(4) Use the Xilinx Gateway blocks to interface a oating-
point Simulink model with a xed-point XSG design.
The Gateway-in and Gateway-out blocks, respec-
tively, convert inputs from Simulink to XSG and
outputs from XSG to Simulink.
(5) Test system response using the same input stimuli for
an equivalent XSG design and Simulink model with
automatic comparision of their respective outputs.
Commonly, software simulation of a complete drive
model, for a few seconds of results, could take a couple of
days of computer time. Hardware/software cosimulation can
be used to accelerate the process of controller simulation,
thus reducing the computing time to about a couple of hours.
It also ensures that the design will respond correctly once
implemented in hardware.
4. System-on-Chip Integration in Xilinx
Platform Studio
FPGA design and the SoC architecture are managed with
Xilinx Platform Studio (XPS), targetting the AP1000. We
have customized the CMC-modied Amirix baseline design
4 EURASIP Journal on Embedded Systems
Hardware
library
component
Baseline
SoC
architecture
Application-
specification
hardware
components
Hardware
design flow
Functional
simulation
Integration of the
components
to the SoC
Controller
synthesis
in XSG
Matlab/Simulink
modeling
Hardware/software
co-simulation
Software
libraries
and drivers
Source-level
integration
Low-level
software
simulation
Application-
specific code
Embedded Linux
operation system
Software
design flow
FPGA prototype
C
o
-
s
i
m
u
l
a
t
i
o
n

d
a
t
a

l
i
n
k
Figure 4: Controller-on-chip design ow.
to support analog interfacing, user logic on the Processor
Local Bus (PLB), and communication with application
software under embedded Linux. XPS generates the corre-
sponding .bin le, which is then transferred to the Flash
conguration memory on the AP1000. The contents of this
memory is used to recongure the FPGA. We have found
an undocumented fact that, on the AP1000, this approach
is the only practicable way to program the FPGA. JTAG
programming is proved inconvenient, because it suppresses
the embedded Linux, which is essential to us for PCI
initialization. Once programmed, user logic awaits a start
signal from our application software following analog I/O
card initialization.
To accelerate the logic synthesis process, the mapper and
place and route options are set to STD (standard) in the
implementation options le (etc/fast runtime.opt), found in
the Project Files menu. If the user wants a more aggressive
eort, these options should be changed to HIGH, which
requires much more time. Our experiments have shown that
it typically amounts to several hours.
4.1. Bus Interfacing. The busses implemented in FPGA logic
follow the IBM CoreConnect standard. It provides master
and slave operation modes for any instanciated hardware
module. The most important system busses are the Processor
Local Bus (PLB), and the On-chip Peripheral Bus (OPB).
The implementation of the vector control scheme
requires much less of generality, and deletes some commu-
nication stages that might be used in other applications. It is
easier to start fromsuch a generic design, dropping unneeded
features, than to start from scratch. This way, one can quickly
progress from SoC architecture in XPS down to a working
controller on the AP1000.
4.1.1. Slave Model Register Read Mux. The baseline XPS
design provides the developer with a slave model register
read multiplexer. This allows to decide which data is provided
when a read request is sent to the user logic peripheral by
another peripheral in the system. While a greater number
may be used, our pilot application, the vector control, only
use four slave registers. The user logic peripheral has a
specic base address (C BASEADDR), and the four 32-
bit registers are accessed through C BASEADDR + register
oset. In this example, C BASEADDR + 0x0 corresponds
to the control and status register, which is composed of the
following bits:
07 : the DIP switches on the AP1000 for
debugging purposes,
8 : used by user software to reset, start, or
stop the controller,
931 : reserved.
As for the other 3 registers, they correspond to
C BASEADDR + 0x4: Output to analog
channel 1
C BASEADDR + 0x8: Output to analog
channel 2
C BASEADDR + 0xC: Reserved (often used for
debugging purposes)
4.1.2. Master Model Control. The master model control
state machine is used to control the requests and responses
between the user logic peripheral and the analog I/O card.
The latter is used to read the input currents and voltages
for vector control operation. The start signal previously
mentioned in slave register 0 is what gets the state machine
out of IDLE mode, and thus starts the data acquisition
process. In this specic example, the I/O card is previously
initialized by the embedded application software, relieving
the state machine of any initialization code. Analog I/O
EURASIP Journal on Embedded Systems 5
initialization sets a lot of parameters, including how many
active channels are to be read.
The state machine operates in the following way
(Figure 5).
(1) The user logic waits for a start signal from the user
through slave register 0.
(2) The dierent addresses to access the right AIO card
elds are set up, namely, the BCR and read buer.
(3) A trigger is sent to the AIO card to buer the values
of all desired analog channels.
(4) A read cycle is repeated for the number of active
channels previously dened.
(5) Once all channels have been read, the state machine
falls back to trigger state, unless the user chooses to
stop the process using slave register 0.
4.2. Creating or Importing User Cores. User-designed logic
and other IPs can be created or imported into the XPS design
following this procedure.
(1) Select Create or Import Peripheral from the Hard-
ware menu, and follow the wizard (unless otherwise
stated below, the default options should be accepted).
(2) Choose the preferred bus. In the case of our vector
controller, it is connected to the PLB.
(3) For PLB interfacing, select the following IPIF ser-
vices:
(a) burst and cacheline transaction support,
(b) master support,
(c) S/W register support.
(4) The User S/W Regitser data width should be 32.
(5) Accept the other wizard options as default, then click
Finish.
(6) You should nd your newly created/imported core in
the Project Repository of the IP Catalog; right click
on it, and select Add IP.
(7) Finally go to the Assembly tab in the main System
Assembly View, and set the base address (e.g.,
0x2a001000), the memory size (e.g., 512), and the bus
connection (e.g., plb bus).
4.3. Instantiating a Netlist Core. Using HDL generated by
System Generator may be inconvenient for large control
systems described with the XSG blockset, as it can require
a couple of days of synthesis time. System Generator
can be asked to produce a corresponding NGC binary
netlist le instead, which is then treated as a black box
to be imported and integrated into an XPS project. This
considerably reduces the synthesis time needed. The process
of instantiating a Netlist Core in a custom peripheral (e.g.,
user logic.vhd), performed following the steps documented
in XPS user guide.
IDLE
Adresses
setup
AIO
trigger
PAUSE
Start signal
Trigger ACK
All channels
read
One active channel
read
Read another
active channel
Stop signal
Read
cycle
S
etu
p
co
m
p
leted
BCR and
status
Figure 5: Master model state machine.
Table 1: The Two Intel StrataFlash Flash memory devices.
Bank Address Size Mode Description
1 0x20000000 0x1000000 (16 MB) 16 Program Flash
2 0x24000000 0x1000000 (16 MB) 8 Cong. Flash
Table 2: AP1000 ash congurations.
Region Bank Sectors Description
0 2 039 Conguration 0
1 2 4079 Conguration 1
2 2 80127 Conguration 2 (Default Cong.)
4.4. BIN File Generation and FPGA Conguration. To cong-
ure the FPGA, a BIN le must be generated from the XSG
project. Since JTAG programming disables the embedded
Linux, the BIN le must be downloaded directly to onboard
Flash memory. There are two Intel Strataash Flash memory
devices on the AP1000, one for the conguration, and one
for the U-boot bootstrap code (which should not be crushed)
(Table 1).
The conguration memory (Table 2) is divided into three
sections. Section 2 is the default Amirix conguration, and
should not be crushed. Downloading the BIN le to memory
is done through a network cable using the TFTP protocol.
For this purpose, a TFTP server must be set up on the
host PC. The remote side of the protocol is managed by
U-boot on the AP1000. Commands to U-boot to initiate
the transfer and to trigger FPGA reconguration from a
designated region are entered by the user through a serial link
terminal program. Here is the complete U-boot command
sequence:
setenv serverip 132.212.202.166
setenv ipaddr 132.212.201.223
erase 2 : 039
Send tftp 00100000 download.bin
Send cp.b 00100000 24000000 00500000
Send swrecon
6 EURASIP Journal on Embedded Systems
5. Application Software and Drivers
One of the main advantages of using an embedded Linux
system is the ability to perform the complex task of PCI
initialization. In addition, it allows for application software
to provide elaborated interfacing and user monitoring
through appropriate software drivers. Initialization of the
analog I/O card on the PMC site and controller setup are
among such tasks that are best performed by software.
5.1. Linux Device Drivers Essentials. Appropriate device
drivers have to be written in order to use daughter cards
(such as an analog I/O board) or custom hardware com-
ponents on a bus internal to the SoC, and be able to
communicate with them from the embedded Linux. Drivers
and application software for the AP1000 can be developed
with the free Embedded Linux Development Kit (ELDK)
from DENX Software Engineering, Germany. The ELDK
includes the GNU cross development tools, along with
prebuilt target tools and libraries to support the target
system. It comes with full source code, including all patches,
extensions, programs, and scripts used to build the tools.
A complete discussion on writing Linux device drivers is
beyond the scope of this paper, and this information may be
found elsewhere, such as in [8]. Here, we only mention a few
important issues relevant to the pilot application.
To support all the required functions when creating a
Linux device driver, the following includes are needed:
#include<linux/config.h>
#include<linux/module.h>
#include<linux/pci.h>
#include<linux/init.h>
#include<linux/kernel.h>
#include<linux/slab.h>
#include<linux/fs.h>
#include<linux/ioport.h>
#include<linux/ioctl.h>
#include<linux/byteorder/
big endian.h>
#include<asm/io.h>
#include<asm/system.h>
#include<asm/uaccess.h>
5.2. PCI Access to the Analog I/O Board . The pci nd
device() function begins or continues searching for a PCI
device by vendor/device ID. It iterates through the list of
known PCI devices, and if a PCI device is found with
a matching vendor and device, a pointer to its device
structure is returned. Otherwise, NULL is returned. For
the PMC66-16AISS8A04, the vendor ID is 0x10e3, and the
device ID is 0x8260. The device must then be initialized with
pci initialize device() before it can be used by the driver.
The start address of the base address registers (BARs) can be
obtained using pci resource start(). In the example, we get
BAR 2 which gives access to the main control registers of the
PMC66-16AISS8A04.
volatile u32 base addr;
struct pci dev dev;
struc resource ctrl res;
dev = pci find device(VENDORID,
DEVICEID, NULL);
.
.
.
pci enable device (dev);
get revision (dev);
base addr = (volatile u32 )
pci resource start (dev, 2);
ctrl res = request mem region (
(unsigned long)base addr,
0x80L,"control");
bcr = (u32 ) ioremap nocache (
(unsigned long)base addr,
0x80L);
The readl() and writel() functions are dened to access
PCI memory space in units of 32 bits. Since the PowerPC
is big-endian while the PCI bus is by denition little-
endian, a byte swap occurs when reading and writing PCI
data. To ensure correct byte order, the le32 to cpu() and
cpu to le32() functions are used on incoming and outgoing
data. The following code example denes some macros to
read and write the Board Control Register, to read data from
the analog input buer, and to write to one of the four analog
output channels.
volatile u32 bcr;
#define GET BCR() (le32 to cpu(\\
readl (bcr)))
#define SET BCR(x) writel(\\
cpu to le32(x), bcr)
#define ANALOG IN()le32 to cpu(\\
readl (&bcr[ANALOG INPUT BUF]))
#define ANALOG OUT(x,c) writel(\\
cpu to le32(x), \\
&bcr[ANALOG OUTPUT CHAN 00+c])
5.3. Cross-Compilation with the ELDK. To properly compile
with the ELDK, a makele is required. Kernel source code
should be available in KERNELDIR to provide for essential
includes. The version of the preinstalled kernel on the
AP1000 is Linux 1.4. Example of a minimal makele:
TARGET= thetarget
OBJS= myobj.o
#EDIT THE FOLLOWING TO POINT TO
#THE TOP OF THE KERNEL SOURCE TREE
KERNELDIR =

/kernel-sw-003996-01
CC = ppc 4xxgcc
LD = ppc 4xxld
EURASIP Journal on Embedded Systems 7
DEFINES = D

KERNEL

DMODULE\\
DEXPORT SYMTAB
INCLUDES= I$(KERNELDIR)/include\\
I$(KERNELDIR)/include/Linux\\
I$(KERNELDIR)/include/asm
FLAGS =fno-strict-aliasing \\
fno-common\\
fomit-frame-pointer\\
fsigned-char
CFLAGS = $(DEFINES) $(WARNINGS)\\
$(INCLUDES) $(SWITCHES)\\
$(FLAGS)
all: $(TARGET).o Makefile
$(TARGET).o: $(OBJS)
$(LD) r -o $@$

5.4. Software and Driver Installation on the AP1000. For ease


of manipulation, user software and drivers are best carried on
a CompactFlash card, which is then inserted in the back slot
of the AP1000 and mounted into the Linux le system. The
drivers are then intalled, and the application software started,
as follows:
mount /dev/discs/disc0/part1 /mnt
insmod /mnt/logic2/hwlogic.o
insmo /mnt/aio.o
cd /dev
mknod hwlogic c 254 0
mknod aio c 253 0
/mnt/cvecapp
6. Application: AC Induction Motor Control
Given their well-known qualities of being cheap, highly
robust, ecient, and reliable, AC induction motors currently
constitute the bulk of the motion industry park. From the
control point of view, however, these motors have highly
nonlinear behavior.
6.1. FPGA-Based Induction Motor Vector Control. The
selected control algorithm for our pilot application is the
rotor-ux oriented vector control of a three-phase AC
induction motor of the squirrel-cage type. It is the rst
method which makes it possible to articially give some
linearity to the torque control of induction motors [9].
RFOC algorithm consists in partial linearization of the
physical model of the induction motor by breaking up the
stator current i
s
into its components in a suitable reference
frame (d, q). This frame is synchronously revolving along
with the rotor ux space vector in order to get a separate
control of the torque and rotor ux. The overall strategy
then consists in regulating the speed while maintaining
the rotor ux constant (e.g., 1 Wb). The RFOC algorithm
is directly derived from the electromechanical model of
a three-phase, Y-connected, squirrel-cage induction motor.
This is described by equations in the synchronously rotating
reference frame (d, q) as
u
sd
= R
s
i
sd
+ L
s
d
dt
i
sd
L
s
i
sq
+
M
L
r
d
dt

r
,

Dd
u
sq
= R
s
i
sq
+ L
s
d
dt
i
sq
+L
s
i
sd
+
M
L
r

r
,

Dq
d
dt

r
=
R
r
L
r
(Mi
sd

r
),
= P
p

r
+
MR
r

r
L
r
i
sq
,
d
r
dt
=
3
2
P
p
M
JL
r

r
i
sq

D
J

r

T
l
J
,
(1)
where u
sd
and u
sq
are d and q components of stator voltage
u
s
, i
sd
, and i
sq
are d and q components of stator current i
s
,
r
is the modulus of rotor ux modulus, and is the angular
position of rotor ux, is the synchronous angular speed of
the (d, q) reference frame ( = d/dt), and L
s
, L
r
, and M
are stator, rotor, and mutual inductances, R
s
, R
r
are stator
and rotor resistances, is the leakage coecient of the motor,
and P
p
is the number of pole pairs,
r
is the mechanical rotor
speed, D is damping coecient, J is the inertial momentum,
and T
l
is torque load.
6.2. RFOCAlgorithm. The derived expressions for each block
composing the induction motor RFOC scheme, as shown in
Figure 6, are given as follows:
Speed PI Controller:
i

sq
= k
pv

v
+ k
iv


v
dt;
v
=

r

r
. (2)
Rotor Flux PI Controller:
i

sd
= k
pf

f
+ k
i f


f
dt;
f
=

r

r
. (3)
Rotor Flux Estimator:

r
=

2
r
+
2
r
, (4)
cos =

r

r
, sin =

r

r
, (5)
with

r
=
L
r
M
(
s
L
s
i
s
),
r
=
L
r
M

s
L
s
i
s

, (6)

s
=

(u
s
R
s
i
s
),
s
=

u
s
R
s
i
s

, (7)
8 EURASIP Journal on Embedded Systems
V
v
v
u
u
sp
sp
sp
sp
sp
u
u
i
sp
i
i
i
DC
sd
sq
sd
sq
al
bh
ch
bl
cl
sa
sb
sa
ah
sb
sd
sq
Decoupling
Rotor
flux
estimator
Speed PI
controller
Rotor flux
PI controller
Q-current
PI controller
D-current
PI controller
Park
transform
Speed measure
Inverse
transform
park
SVPWM
module
gating
Clarke
transform
Clarke
transform
IM
+

+
+
+
+ +
+
+

r
i

sq
i

sd
u

s
u

s
cos
sin

estimator

r
i
s
i
s
u
s
u
s
3-
voltage
PWM
inverter
Figure 6: Conceptual block diagram of the system.
and using Clarke transformation
i
s
= i
sa
, i
s
=
1

3
i
sa
+
2

3
i
sb
, (8)
u
s
= u
sa
, u
s
=
1

3
u
sa
+
2

3
u
sb
. (9)
To be noticed that sine and cosine, of (5), sum up to a
division, and therefore do not have to be directly calculated.
Current PI Controller:
v
sd
= k
pi

isd
+ k
ii


isd
dt;
isd
= i

sd
i
sd
, (10)
v
sq
= k
pi

isq
+ k
ii


isq
dt;
isq
= i

sq
i
sq
. (11)
Decoupling:
u
sd
= L
s
v
sd
+ D
d
; u
sq
= L
s
v
sq
+ D
q
, (12)
with
D
d
= L
s
i
sq
+
M
L
r
d
dt

r
, D
q
= +L
s
i
sd
+
M
L
r

r
.
(13)
Omega () Estimator:
= P
p

r
+
MR
r

r
L
r
i
sq
. (14)
Park Transformation:

i
sd
i
sq

cos sin
sin cos

i
s
i
s

. (15)
Inverse Park Transformation:

s
u

cos sin
sin cos

u
sd
u
sq

. (16)
In the above equations, for x standing for any variable
such as voltage u
s
, current i
s
or rotor ux
r
, we have the
following.
(x

) Input reference corresponding to x.


(
x
) Error signal corresponding to x.
(k
px
, k
ix
) Proportional and integral parameters corre-
sponding to the PI controller of x.
(x
a
, x
b
, x
c
) a, b, and c three-phase components of x in the
stationary reference frame.
(x

, x

) and two-phase components of x in the


stationary reference frame.
(x
d
, x
q
) d and q components of x in the synchronously
rotating frame.
The RFOC scheme features vector transformations
(Clarke and Park), 4 IP regulators, and space-vector PWM
generator (SVPWM). This algorithm is of interest for its
good performances, and because it has a fair level of
complexity which benets from a very-high-performance
FPGA implementation. In fact, FPGAs make it possible to
execute the loop of a complicated control algorithm in a
matter of a few microseconds. The rst prototype of such
a controller has been developed using the method and
platform described here, and has been implemented entirely
in FPGA logic [10].
Commonly used mediums prior to the advent of todays
large FPGAs, including the use of DSPs alone and/or special-
ized microcontrollers, led to a total cycle time of more than
100 s for vector control. This lead to switching frequencies
EURASIP Journal on Embedded Systems 9
Dynamo with
optical speed
encoder
Encoder
cable
Resistive load
Power
supply
Cable interface
to analog I/O card
Digital I/O
from FPGA
High-voltage
power module
Squirrel-cage
induction motor
Figure 7: Experimental setup with power electronics, induction
motor, and loads.
in the range of 15 kHz, which produced disturbing noise in
the audible band. With todays FPGAs, it becomes possible to
t a very large control system on a single chip, and to support
very high switching frequencies.
6.3. Validation of RFOC Using Cosimulation with XSG. A
strong hardware/software cosimulation environment and
methodology is necessary to allow validation of the hardware
design against a theoretical control system model.
As mentioned is Section 3, the design ow which has
been adopted in this research uses the XSG blockset in
Matlab/Simulink. XSG model of RFOC block is built up
from (2) to (16) and the global system architecture is shown
in Figure 8 where Gateway-in and Gateway-out blocks pro-
vide the necessary interface between the xed-point FPGA
hardware that include the RFOC and Space Vector Pulse
Width Modulation (SVPWM) algorithms and the oating-
point Simulink blocksets mainly the SimPowerSystems (SPS)
models. In fact to make the simulations more realistic, the
three-phase AC induction motor and the corresponding
Voltage Source Inverter were modelled in Simulink using
the SPS blockset, which is robust and well proven. To be
noticed that SVPWM is a widely used technique for three-
phase voltage-source inverters (VSI), and is well suited for
AC induction motors.
At runtime, the hardware design (RFOC and SVPWM)
is automatically downloaded into the actual FPGA device,
and its response can then be veried in real-time against that
of the theoretical model simulation done with oating-point
Simulink blocksets. An arbitrary load is induced by varying
the torque load variable T
l
as a time function. SPS receives
a reference voltage from the control through the inverse
Park transformation module. This voltage consists of two
quadrature voltages (u

s
, u

s
), plus the angle (sine/cosine)
of the voltage phasor u
sd
corresponding to the rotor ux
orientation (Figure 6).
6.4. Reducing Cosimulation Times. In a closed loop setting,
such as RFOC, hardware acceleration is only possible as long
as the replaced block does not require a lot of steps for
completion. If the XSG design requires more steps to process
the data which is sent than what is necessary for the next data
to be ready for processing, a costly (time wise) adjustment
has to be made. The Simulink period for a given simulated
FPGA clock (one XSG design step) must be reduced, while
the rest of the Simulink system runs at the same speed as
before. In a xed step Simulink simulation environment, this
means that the xed step size must be reduced enough so
that the XSG system has plenty of time to complete between
two data acquisitions. Obviously, such lenghty simulations
should only be launched once the debugging process is
nished and the controller is ready to be thouroughly tested.
Once the control algorithm is designed with XSG, the
HW/SW cosimulation procedure consists of the following.
(1) Building the interface between Simulink and FPGA-
Based Cosimulation board.
(2) Making a hardware cosimulation design.
(3) Executing hardware cosimulation.
When using Simulink environment for cosimulation, one
should distinguish between the single-step and free-running
modes, in order for debugging purposes, to get much shorter
simulations times.
Single-step cosimulation can improve simulation time
when replacing one part of a bigger system. This is espe-
cially true when replacing blocks that cannot be natively
accelerated by Simulink, like embedded Matlab functions.
Replacing a block with an XSG cosimulated design shifts the
burden from Matlab to the FPGA, and the block no longer
remains the simulations bottleneck.
Free-running cosimulation means that the FPGA will
always be running at full speed. Simulink will no longer
be dictating the speed of an XSG step as was the case
in single-step cosimulation. With the Virtex-4 ML402 SX
XtremeDSP Evaluation Platform, that step will nowbe a xed
10 nanoseconds. Therefore, even a very complicated system
requiring many steps for completion should have ample time
to process its data before the rest of the Simulink system does
its work. Nevertheless, a synchronization mechanism should
always be used for linking the free-running cosimulation
block with the rest of the design to ensure an exterior start
signal will not be mistakenly interpreted as more than one
start pulse. Table 3 shows the decrease of simulation time
aorded by the free-running mode for the induction motor
vector control. This has been implemented using XSG with
the motor and its SVPWM-based drive being modeled using
SPS blockset from Simulink. For the same precision and the
same amount of data to be simulated (speed variations over
a period of 7 seconds), a single-step approach would require
100.7 times longer to complete, thus being an ineective
approach. A more complete discussion of our methodology
for rapid testing of an XSG-based controller using free-
running cosimulation and SPS, has been given in [11].
6.5. Timing Analysis. Before actually generating a BIT le
to recongure the FPGA, and whether the cosimulation is
done through JTAG or Ethernet, the design must be able to
10 EURASIP Journal on Embedded Systems
fu_in
fv_in
fw_in
fu_pul_ou
fu_pulbar_o
fv_pulbar_o
fw_pulbar_o
fv_pul_ou
fw_pul_ou
Gating
fu
fv
ua
ub
prediction_uab
fire_u
fire_v
fire_w START
vqs
vds
Power system blockset domain
(floating point)
System generator blockset domain
(fixed point)
Firing_Signals
start_contr
uA
uB
READY
spd_ref
flux_ref
isa
usa
usb
aq_done
m_w
Vector_control
Gateway Out6
Gateway Out9
Gateway Out17
Gateway Out10
Gateway Out13
Gateway Out11
Out
Out
Out2
Out3
Out
Out
Out
Out
vds_in
vds_in2
vds_in1
vqs_in
vqs_in1
In
In
In
In
In
wbar
vbar
ubar
Motor_Drive
is_abc
wm
Te>
volt_mea>
Sensors
In1
In2
Wref
Speed_Ref
phiref
Rotor_Flux_Ref
Syestem
generator
Resource
estimator
Discrete,
Ts = 2.5e- 006 s
y
x
u
v
w
Out1
Figure 8: Indcution motor RFOC drive, as modelled with XSG and SPS blocksets.
Table 3: Simulation times and methods
Type of simulation Simulation time
Free-running cosimulation 1734 s
Single-step cosimulation 174610 s (48 hours)
run at 100 MHz (10 nanoseconds step time). As long as the
design is running inside Simulink, there are never any issues
with meeting timing requirements for the XSG model. Once
completed, the design will be synthesized, and simulated on
FPGA. If the user launches the cosimulation block generation
process, the timing errors will be mentioned quite far into
the operation. This means that, after waiting for a relatively
long delay (sometimes 2030 minutes depending on the
complexity of a design and the speed of the host computer),
the user notices the failure to meet timing requirements with
no extra information to quickly identify the problem. This
is why the timing analysis tool must always be run prior to
cosimulation. While it might seem a bit time-consuming,
this tool will not simply tell you that your design does not
meet requirements, but it will give you the insight required
to x the timing problems. The control algorithm once
being fully designed, analysed (timing wise), and debugged
through the aforementioned FPGA-in-the-loop simulation
platform, the corresponding NGC binary netlist le or
VHDL/Verilog code are automatically generated. These
could then be integrated within the SoC architecture using
Xilinx Platform Studio (XPS) and targetting the AP1000
platform. Next section describes the related steps.
6.6. Experimental Setup. Figure 7 shows the experimental
setup with power electronics, induction motor, and loads.
The power supply is taken from a 220 V outlet. The high
voltage power module, from Microchip, is connected to the
analog I/O card through the rainbow ex cable, and to
the expansion digital I/Os of the AP1000 through another
parallel cable. Signals from a 1000-line optical speed encoder
are among the digital signals fed to the FPGA. As for the
loads, there is both a manually-controlled resistive load box,
and a dynamo coupled to the motor shaft.
From the three motor phases, three currents and three
voltages (all preltered and prescaled) are fed to the analog
I/O board to be sampled. Samples are stored in an internal
input buer until fetched by the controller on FPGA. Data
EURASIP Journal on Embedded Systems 11
exchange between the FPGA and the I/O board proceeds
through the PLB and the Dual Processor PCI Bus Bridge to
and from the PMC site.
The process of generating SVPWM signals continuously
runs in parallel with controller logic, but the speed at which
these signals are generated is greater than the speed required
for the vector control processing. As a consequence, these
two processes are designed and tested separately before being
assembled and tested together.
Power gating and motor speed decoding are continuous
processes that have critical clocking constraints beyond the
capabilities of bus operation to and from the I/O board.
Therefore, even though the PMC66-16AISS8A04 board also
provides digital I/O, both the PWM gating signals and the
input pulses from the optical speed encoder are directly
passed through FPGA pins to be processed by dedicated
hardware logic. This is done by plugging a custom-made
adapter card with Samtec CON 0.8 mm connectors into the
expansion site on the AP1000. While the vector control uses
data acquired from the AIO card through a state machine,
the PWM signals are constantly fed to the power module
(Figure 6). Those signals are sent directly through the general
purpose digital outputs on the AP1000 itself instead of going
through the AIO card. This ensures complete control over
the speed at which these signals are generated and sent
while targeting a specic operating frequency (16 kHz in
our example). This way, the speed calculations required for
the vector control algorithm are done using precise clocking
without adding to the burden of the state machine which
dictates the communications between FPGA and the AIO
card. The number of transitions found on the signal lines
between the FPGA and speed encoder are used to evaluate
the speed at which the motor is operating.
6.7. Timing Issues. Completion of one loop cycle of our vec-
tor control design, takes 122 steps leading to a computation
time of less than 1.5 s. To be noticed that for a sampling rate
of 32 kHz, the SVPWM signal has 100 divisions (two zones
divided by 50), which has been chosen as a good compromise
between precision and simulation time. The simulation
xed-step size is then 625 nanoseconds, which is already
small enough to hinder the performance of simulating the
SPS model. Since PWM signal generation is divided into
two zones, for every 50 steps of Simulink operations (PWM
signal generation and SPS model simulation), the 122 vector
control steps must complete. The period of the XSG
Simulink system must be adjusted in order for the XSG
model to run 2.44 times faster than the other Simulink
components. The simulation xed-step size becomes 2.56
nanoseconds, thus prolonging simulation time. In other
words, since the SPS model and PWMsignals generation take
little time (in terms of steps) to complete whereas the vector
control scheme requires numerous steps, the coupling of the
two forces the use of a very small simulation xedstep size.
7. Conclusion
In this paper, we have discussed our choice, adaptation,
and use of a rapid prototyping platform and design ow
suitable for the design of on-chip motion controllers and
other SoCs with a need for analog interfacing. It supports
embedded application software coupled with custom FPGA
logic and analog interfacing, and is very well suited to FPGA-
in-the-loop control and SoC controller prototyping. Such
platform is suitable for academia and research communauty
that cannot aord the expensive commercial solutions for
FPGA-in-the-loop simulation [12, 13].
A convenient FPGA design, simulation, and test proce-
dure, suitable for advanced feedback controllers, has been
outlined. It uses the Xilinx System Generator blockset in
Matlab/Simulink and a simulated motor drive described with
the SPS blockset. SoCintegration of the resulting controller is
done in Xilinx Platform Studio. Our custom SoC design has
been described, with highlights on the state machine for bus
interfacing, NGC le integration, BIN le generation, and
FPGA conguration.
Application software and drivers development for
embedded Linux are often needed to provide for PCI and
analog I/O card initialization, interfacing, and monitoring.
We have provided here some pointers along with essential
information not easily found elsewhere. The proposed design
ow and prototyping platform have been applied to the
analysis, design, and hardware implementation of a vector
controller for three-phase AC induction motors, with very
good performance results. The resulting computation times,
of about 1.5 s, can in fact be considered record-breaking for
such a controller.
Acknowledgments
This research is funded by a Grant from the National
Sciences and Engineering Research Council of Canada
(NSERC). CMC Microsystems provided development tools
and support through the System-on-Chip Research Network
(SOCRN) program.
References
[1] Accelerating Canadian competitiveness through microsys-
tems: strategic plan 20052010, Tech. Rep., CMC Microsys-
tems, Kingston, Canada, 2004.
[2] J. J. Rodriguez-Andina, M. J. Moure, and M. D. Valdes,
Features, design tools, and application domains of FPGAs,
IEEE Transactions on Industrial Electronics, vol. 54, no. 4, pp.
18101823, 2007.
[3] R. Dubey, P. Agarwal, and M. K. Vasantha, Programmable
logic devices for motion controla review, IEEE Transactions
on Industrial Electronics, vol. 54, no. 1, pp. 559566, 2007.
[4] E. Monmasson and M. N. Cirstea, FPGAdesign methodology
for industrial control systemsa review, IEEE Transactions on
Industrial Electronics, vol. 54, no. 4, pp. 18241842, 2007.
[5] D. Zhang, A stochastic approach to digital control design and
implementation in power electronics, Ph.D. thesis, Florida State
University College of Engineering, Tallahassee, Fla, USA, 2006.
[6] Y.-Y. Tzou and H.-J. Hsu, FPGA realization of space-vector
PWM control IC for three-phase PWM inverters, IEEE
Transactions on Power Electronics, vol. 12, no. 6, pp. 953963,
1997.
12 EURASIP Journal on Embedded Systems
[7] A. de Castro, P. Zumel, O. Garca, T. Riesgo, and J. Uceda,
Concurrent and simple digital controller of an AC/DC
converter with power factor correction based on an FPGA,
IEEE Transactions on Power Electronics, vol. 18, no. 1, part 2,
pp. 334343, 2003.
[8] Developing device drivers for Linux Kernel 1.4., Tech. Rep.,
CMC Microsystems, Kingston, Canada, 2006.
[9] B. K. Bose, Power Electronics and Variable-Frequency Drives:
Technology and Applications, IEEE Press, New York, NY, USA,
1996.
[10] J.-G. Mailloux, Prototypage rapide de la commande vectorielle
sur FPGA ` a laide des outils SimulinkSystem Generator, M.S.
thesis, Universit e du Qu ebec ` a Chicoutimi, Quebec, Canada,
January 2008.
[11] J.-G. Mailloux, S. Simard, and R. Beguenane, Rapid testing
of XSG-based induction motor vector controller using free-
running hardware co-simulation and SimPowerSystems, in
Proceedings of the 5th International Conference on Comput-
ing, Communications and Control Technologies (CCCT 07),
Orlando, Fla, USA, July 2007.
[12] C. Dufour, S. Abourida, J. B elanger, and V. Lapointe, Real-
time simulation of permanent magnet motor drive on FPGA
chip for high-bandwidth controller tests and validation, in
Proceedings of the 32nd Annual Conference on IEEE Indus-
trial Electronics (IECON 06), pp. 45814586, Paris, France,
November 2006.
[13] National Instruments, Creating Custom Motion Control and
Drive Electronics with an FPGA-based COTS System, 2006.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 727965, 14 pages
doi:10.1155/2009/727965
Research Article
Parallel Backprojection: A Case Study in High-Performance
Recongurable Computing
Ben Cordes and MiriamLeeser
Department of Electrical and Computer Engineering, Northeastern University, Boston, MA 02115, USA
Correspondence should be addressed to Miriam Leeser, mel@coe.neu.edu
Received 22 June 2008; Accepted 18 December 2008
Recommended by Vinay Sriram
High-performance recongurable computing (HPRC) is a novel approach to provide large-scale computing power to modern
scientic applications. Using both general-purpose processors and FPGAs allows application designers to exploit ne-grained
and coarse-grained parallelism, achieving high degrees of speedup. One scientic application that benets from this technique is
backprojection, an image formation algorithm that can be used as part of a synthetic aperture radar (SAR) processing system. We
present an implementation of backprojection for SAR on an HPRC system. Using simulated data taken at a variety of ranges, our
implementation runs over 200 times faster than a similar software program, with an overall application speedup better than 50x.
The backprojection application is easily parallelizable, achieving near-linear speedup when run on multiple nodes of a clustered
HPRC system. The results presented can be applied to other systems and other algorithms with similar characteristics.
Copyright 2009 B. Cordes and M. Leeser. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
In the continuing quest for computing architectures that are
capable of solving more computationally complex problems,
a new direction of study is high-performance recongurable
computing (HPRC). HPRC can be dened as a marriage of
traditional high-performance computing (HPC) techniques
and recongurable computing (RC) devices.
HPC is a well-known set of architectural solutions
for speeding up the computation of problems that can
be divided neatly into pieces. Multiple general-purpose
processors (GPPs) are linked together with high-speed
networks and storage devices such that they can share data.
Pieces of the problem are then distributed to the individual
processors and computed, and the answer is assembled
from the pieces. Commonly available HPC systems include
Beowulf clusters and other supercomputers. Recongurable
computing uses many of the same concepts as HPC, but at a
ner grain. A special-purpose processor (SPP), often a eld-
programmable gate array (FPGA), is attached to a GPP and
programmed to execute a useful function. Special-purpose
hardware computes the answer to the problem quickly by
exploiting hardware design techniques such as pipelining, the
replication of small computation units, and high-bandwidth
local memories.
Both of these computing architectures reduce compu-
tation time by exploiting the parallelism inherent in the
application. They rely on the fact that multiple parts of the
overall problem can be computed relatively independently
of each other. Though HPC and RC act on dierent levels
of parallelism, in general, applications with a high degree of
parallelism are well-suited to these architectures.
The idea behind HPRC is to provide a computing
architecture that takes advantage of both the coarse-grained
parallelism exploited by clustered HPC systems and the ne-
grained parallelism exploited by RC systems. In theory, more
exploited parallelism means more speedup and faster com-
putation times. In reality, factors such as communications
bandwidth may prevent performance from improving as
much as is desired.
In this paper, we examine one application that contains
a very high degree of parallelism. The backprojection image
formation algorithm for synthetic aperture radar (SAR)
systems is embarrassingly parallel, meaning that it can
be broken down and parallelized on many dierent levels.
For this reason, we chose to implement backprojection on
2 EURASIP Journal on Embedded Systems
an HPRC machine at the Air Force Research Laboratory
in Rome, NY, USA, as part of an SAR processing system.
We present an analysis of the algorithm and its inherent
parallelism, and we describe the implementation process
along with the design decisions that went into the solution.
Our contributions are as follows.
(i) We implement the backprojection algorithm for
SAR on an FPGA. Though backprojection has been
implemented many times in the past (see Section 3),
FPGA implementations of backprojection for SAR
are not well represented in the literature.
(ii) We further parallelize this implementation by devel-
oping an HPC application that produces large SAR
images on a multinode HPRC system.
The rest of this paper is organized as follows. Section 2
provides some background information on the backpro-
jection algorithm and the HPRC system on which we
implemented it. In Section 3, we discuss related research.
Section 4 describes the backprojection implementation and
how it ts into the overall backprojection application. The
performance data and results of our design experiments are
analyzed in Section 5. Finally, Section 6 draws conclusions
and suggests future directions for research.
Readers who are interested in more detail about this work
are directed to the master thesis on which it is based [1].
2. Background
This section provides supporting information that is useful
to understanding the application presented in Section 4.
Section 2.1 describes backprojection and SAR, highlighting
the mathematical function that we implemented in hard-
ware. Section 2.2 presents details about the HPRC system
that hosts our application.
2.1. Backprojection Algorithm. We briey describe the back-
projection algorithm in this section. Further details on the
radar processing and signal processing aspects of this process
can be found in [2, 3].
Backprojection is an image reconstruction algorithmthat
is used in a number of applications, including medical imag-
ing (computed axial tomography, or (CAT)) and synthetic
aperture radar (SAR). The implementation we describe is
used in an SAR application. For both radar processing
and medical imaging applications, backprojection provides
a method for reconstructing an image from the data that are
collected by the transceiver.
SAR data are essentially a series of time-indexed radar
reections observed by a single transceiver. At each step
along the synthetic aperture, or ight path, a pulse is emitted
from the source. This pulse reects o elements in the scene
to varying degrees, and is received by the transceiver. The
observed response to a radar pulse is known as a trace.
SAR data can be collected in one of two modes,
strip-map or spotlight. These modes describe the motion
of the radar relative to the area being imaged. In the
spotlight mode of SAR, the radar circles around the scene.
Our application implements the strip-map mode of SAR, in
which radar travels along a straight and level path.
Regardless of mode, given a known speed at which the
radar pulse travels, the information from the series of time-
indexed reections can be used to identify points of high
reectivity in the target area. By processing multiple traces
instead of just one, a larger radar aperture is synthesized and
thus a higher-resolution image can be formed.
The backprojection image formation algorithm has two
parts. First, the radar traces are ltered according to a linear
time-invariant system. This lter accounts for the fact that
the airplane on which the radar dish is situated does not
y in a perfectly level and perfectly straight path. Second,
after ltering, the traces are smeared across an image plane
along contours that are dened by the SARmode; in our case,
the ight path of the plane carrying the radar. Coherently,
summing each of the projected images provides the nal
reconstructed version of the scene.
Backprojection is a highly eective method of processing
SAR images. It is computationally complex, much like tradi-
tional Fourier-based image formation techniques. However,
backprojection contains a high degree of parallelism, which
makes it suitable for the implementation on recongurable
devices.
The operation of backprojection takes the form of a
mapping from projection data p(t, u) to an image f (x, y). A
single pixel of the image f corresponds to an area of ground
containing some number of objects which reect radar to a
certain degree. Mathematically, this relationship is written as
f (x, y) =

i(x, y, u), u

du, (1)
where i(x, y, u) is an indexing function indicating, at a given
u, those t that play a role in the value of the image at location
(x, y). For the case of SAR imaging, the projection data
p(t, u) take the form of the ltered radar traces described
above. Thus, the variable u corresponds to the slow-time
location of the radar, and t is the fast-time index into that
projection. Fast-time variables are related to the speed of
radar propagation (i.e., the speed of light), while slow-time
variables are related to the speed of the airplane carrying the
radar. The indexing function i takes the following form for
SAR:
i(x, y, u) = (y x tan)

x
2
+ (y u)
2
c
, (2)
where c is the speed of light, the beamwidth of the radar,
and (a, b) equal to 1 for a u b and 0 otherwise. x and
y describe the two-dimensional oset between the radar and
the physical spot on the ground corresponding to a pixel, and
can thus be used in a simple distance calculation as seen in
the right-hand side of (2).
In terms of implementation, we work with a discretized
form of (1) in which the integral is approximated as a
Riemann sum over a nite collection of projections u
k
, k
{1 K} and is evaluated at the centers of image pixels (x
i
,
y
j
), i {1 N}, j {1 K}. Because the evaluation
of the index function at these discrete points will generally
EURASIP Journal on Embedded Systems 3
not result in a value of t which is exactly at a sample location,
interpolation could be performed to increase accuracy.
2.2. HPRC Architecture. This project aimed at exploiting the
full range of resources available on the heterogeneous high-
performance cluster (HHPC) at the Air Force Research Lab-
oratory in Rome, NY, USA [4]. Built by HPTi [5], the HHPC
features a Beowulf cluster of 48 heterogeneous computing
nodes, where each node consists of a dual 2.2 GHz Xeon PC
running Linux and an Annapolis Microsystems WildStar II
FPGA board.
The WildStar II features two VirtexII FPGAs and con-
nects to the host Xeon general-purpose processors (GPPs) via
the PCI bus. Each FPGAhas access to 6 MB of SRAM, divided
into 6 banks of 1 MB each, and a single 64 MB SDRAM bank.
The Annapolis API supports a master-slave paradigm for
control and data transfer between the GPPs and the FPGAs.
Applications for the FPGA can be designed either through
traditional HDL-based design and synthesis tools, as we have
done here, or by using Annapoliss CoreFire [6] module-
based design suite.
The nodes of the HHPC are linked together in three
ways. The PCs are directly connected via gigabit Ethernet
as well as through Myrinet MPI cards. The WildStar II
boards are also directly connected to each other through
a low-voltage dierential signaling (LVDS) I/O daughter
card, which provides a systolic interface over which each
FPGA board may talk to its nearest neighbor in a ring.
Communication over Ethernet is supplied by the standard
C library under Linux. Communication over Myrinet is
achieved with an installation of the MPI message-passing
standard, though MPI can also be directed to use Ethernet
instead. Communicating through the LVDS interconnect
involves writing communication modules for the FPGA
manually. In this project, we relied on Myrinet to move
data between nodes. This architecture represents perhaps the
most direct method for adding recongurable resources to
a supercomputing cluster. Each node architecture is similar
to that of a single-node recongurable computing solution.
Networking hardware which interfaces well to the Linux
PCs is included to create the cluster network. The ability
to communicate between FPGAs is included but remains
dicult for the developer to employ. Other HPRC platforms,
such as those developed by Cray and SRC, may employ
dierent interconnection methods, programming methods,
and communication paradigms.
3. Related Work
Backprojection itself is a well-studied algorithm. Most
researchers have focused on implementing backprojection
for computed tomography (CT) medical imaging applica-
tions; backprojection for synthetic aperture radar (SAR) on
FPGAs is not well-represented in the literature.
The precursor to this work is that of Coric et al. [7]
Backprojection for CT uses the spotlight mode of imaging,
in which the sensing array is rotated around the target
area. (Contrast this with the strip-map mode described in
Section 2.1.) Other implementations of backprojection for
CT on FPGAs have been published [8].
CT backprojection has also been implemented on several
other computing devices, including GPUs [9] and the cell
broadband engine [10]. Results are generally similar (within
a factor of 2) to those achieved on FPGAs.
Of the implementations of backprojection for SAR,
almost none has been designed for FPGAs. Soumekh et al.
have published on implementations of SAR in general and
backprojection in particular [11], as well as the Soumekh
reference book on the subject [2], but they do not examine
the use of FPGAs for computation. Some recent work on
backprojection for SAR on parallel hardware has come from
Halmstad University in Sweden [12, 13]; their publications
lay important groundwork but have not been implemented
except in software and/or simulation.
Backprojection is not the only application that has been
mapped to HPRC platforms, though signal processing is
traditionally a strength of RCand so large and complex signal
processing applications like backprojection are common.
With the emergence of HPRC, scientic applications are
also seeing signicant research eort. Among these are
such applications as hyperspectral dimensionality reduction
[14], molecular dynamics [15, 16], and cellular automata
simulations [17].
Another direction of HPRC research has been the devel-
opment of libraries of small kernels that are useful as building
blocks for larger applications. The Vforce framework [18]
allows for portable programming of RC systems using a
library of kernels. Other developments include libraries of
oating-point arithmetic units [19], work on FFTs [20], and
linear algebra kernels such as BLAS [21, 22].
Several survey papers [23, 24] address the trends that
can be found among the reported results. The transfer
of data between GPP and FPGA can signicantly impact
performance. The ability to determine and control the
memory access patterns of the FPGA and the on-board
memories is critical. Finally, sacricing the accuracy of the
results in favor of using lighter-weight operations that can be
more easily implemented on an FPGAcan be an eective way
of increasing performance.
4. Experimental Design
In this section, we describe an implementation of the
backprojection image formation algorithm on a high-
performance recongurable computer. Our implementation
has been designed to provide high-speed image forma-
tion services and support output data distribution via a
publish/subscribe [25] methodology. Section 4.1 describes
the system on which our implementation runs. Section 4.2
explores the inherent parallelism in backprojection and
describes the high-level design decisions that steered the
implementation. Section 4.3 describes the portion of the
implementation that runs in software, and Section 4.4
describes the hardware.
4 EURASIP Journal on Embedded Systems
Image formation Filter
Processor #1
Processor #2
Processor #24
.
.
.
8
:
2
4
e
t
h
e
r
n
e
t
s
w
i
t
c
h
F
r
o
n
t
e
n
d
F
P
G
A
1
:
8
t
i
m
e
D
e
M
U
X
A
D
C
2 GHz
input data
250 MHz 8
Figure 1: Block diagram of Swathbuckler system. (Adapted from [26].)
4.1. System Background. In Section 2.2, we described the
HHPC system. In this section, we will explore more deeply
the aspects of that system that are relevant to our experimen-
tal design.
4.1.1. HHPC Features. Several features of the Annapolis
WildStar II FPGAboards are directly relevant to the design of
our backprojection implementation. In particular, the host-
to-FPGA interface, the on-board memory bandwidth, and
the available features of the FPGA itself guided our design
decisions.
Communication between the host GPP and the WildStar
II board is over a PCI bus. The HHPC provides a PCI bus
that runs at 66 MHz with 64-bit datawords. The WildStar II
on-board PCI interface translates this into a 32-bit interface
running at 133 MHz. By implementing the DMA data
transfer mode to communicate between the GPP and the
FPGA, the on-board PCI interface performs this translation
invisibly and without signicant loss of performance. A
133 MHz clock is also a good and achievable clock rate for
FPGA hardware, so most of the hardware design can be run
directly o the PCI interface clock. This simplies the design
since there are fewer clock domains (see Section 4.4.1).
The WildStar II board has six on-board SRAM memories
(1 MB each) and one SDRAM memory (64 MB). It is
benecial to be able to read one datum and write one datum
in the same clock cycle, so we prefer to use multiple SRAMs
instead of the single larger SDRAM. The SRAMs run at
50 MHz and feature a 32-bit dataword (plus four parity bits),
but they use a DDR interface. The Annapolis controller for
the SRAMtranslates this into a 50 MHz 72-bit interface. Both
features are separately important: we will need to cross from
the 50 MHz memory clock domain to the 133 MHz PCI clock
domain, and we will need to choose the size of our data
such that they can be packed into a 72-bit memory word (see
Section 4.2.4).
Finally, the Virtex2 6000 FPGA on the Wildstar II has
some useful features that we use to our advantage. A large
amount of on-chip memory is available in the form of
BlockRAMs, which are congurable in width and depth but
can hold at most 2 KB of data each. One hundred forty four
of these dual-ported memories are available, each of which
can be accessed independently. This makes BlockRAMs a
good candidate for storing and accessing input projection
data (see Sections 4.2.4 and 4.4.3.) BlockRAMs can also be
congured as FIFOs, and due to their dual-ported nature,
can be used to cross clock domains.
4.1.2. Swathbuckler Project. This project was designed to
t in as part of the Swathbuckler project [2628], an
implementation of synthetic aperture radar created by a
joint program between the American, British, Canadian, and
Australian defense research project agencies. It encompasses
the entire SAR process including the aircraft and radar dish,
signal capture and analog-to-digital conversion, ltering,
and image formation hardware and software.
Our problem as posed was to increase the processing
capabilities of the HHPC by increasing the performance of
the portions of the application seen on the right-hand side of
Figure 1. Given that a signicant amount of work had gone
into tuning the performance of the software implementation
of the ltering process [26], it remained for us to improve
the speed at which images could be formed. According to
the project specication, the input data are streamed into the
microprocessor main memory. In order to perform image
formation on the FPGA, it is then necessary to copy data
from the host to the FPGA. Likewise, the output image must
be copied from the FPGA memory to the host memory
so that it can be made accessible to the publish/subscribe
software. These data transfer times are included in our
performance measurements (see Section 5).
4.2. Algorithm Analysis. In this section, we dissect the
backprojection algorithm with an eye toward implementing
it on an HPRC machine. There are many factors that
need to be taken into account when designing an HPRC
application. First and foremost, an application that does
not have a high degree of parallelism is generally not
a good candidate. Given a suitable application, we then
decide how to divide the problem along the available levels
of parallelism in order to determine what part of the
application will be executed on each available processor.
This includes GPP/FPGA assignment as well as dividing
the problem across the multiple nodes of the cluster. For
the portions of the application run on the FPGAs, data
arrays must be distributed among the accessible memories.
Next, we look at some factors to improve the performance
of the hardware implementation, namely, data formats and
computation strength reduction. We conclude by examining
EURASIP Journal on Embedded Systems 5
the parameters of the data collection process that aect the
computation.
4.2.1. Parallelism Analysis. In any recongurable applica-
tion design, performance gains due to implementation in
hardware inevitably come from the ability of recongurable
hardware (and, indeed, hardware in general) to perform
multiple operations at once. Extracting the parallelism
in an application is thus critical to a high-performance
implementation.
Equation (1) shows the backprojection operation in
terms of projection data p(t, u) and an output image f (x, y).
That equation may be interpreted to say that for a particular
pixel f ( x, y), the nal value can be found from a summation
of contributions from the set of all projections p(t, u) whose
corresponding radar pulse covered that ground location.
The value of t for a given u is determined by the mapping
function i(x, y, u) according to (2). There is a large degree of
parallelism inherent in this interpretation.
(1) The contribution from projection p( u) to pixel
f ( x, y) is not dependent on the contributions from
all other projections p(u), u
/
= u to that same pixel
f ( x, y).
(2) The contribution from projection p( u) to pixel
f ( x, y) is not dependent on the contribution from
p( u) to all other pixels f (x, y), x
/
= x, y
/
= y.
(3) The nal value of a pixel is not dependent on the
value of any other pixel in the target image.
It can be said, therefore, that backprojection is an
embarrassingly parallel application, which is to say that
it lacks any data dependencies. Without data dependencies,
the opportunity for parallelism is vast and it is simply a
matter of choosing the dimensions along which to divide
the computation that best matches the system on which the
algorithm will be implemented.
4.2.2. Dividing the Problem. There are two ways in which
parallel applications are generally divided across the nodes
of a cluster.
(1) Split the data. In this case, each node performs the
same computation as every other node, but on a
subset of data. There may be several dierent ways
that the data can be divided.
(2) Split the computation. In this case, each node per-
forms a portion of the computation on the entire
dataset. Intermediate sets of data ow from one node
to the next. This method is also known as task-
parallel or systolic computing.
While certain supercomputer networks may make the
task-parallel model attractive, our work with the HHPCindi-
cates that its architecture is more suited to the data-parallel
mode. Since internode communication is accomplished over
a many-to-many network (Ethernet or Myrinet), passing
F
l
i
g
h
t
p
a
t
h
o
f
r
a
d
a
r
u

A
z
i
m
u
t
h

/
y
d
i
m
e
n
s
i
o
n
u + w
u
Range/x dimension
Beamwidth 2
Subimage
x
c
x
0
x
i
x
c
x
j
x
c
+ x
0
Projection data p(t, u)
t = 0 t
i
t
j
Target image
f (x, y)
Figure 2: Division of target image across multiple nodes.
data from one node to the next as implied by the task-
parallel model will potentially hurt performance. A task-
parallel design also implies that a new FPGA design must be
created for each FPGA node in the system, greatly increasing
design and verication time. Finally, the number of tasks
available in this application is relatively small and would not
occupy the number of nodes that are available to us.
Given that we will create a data-parallel design, there
are several axes along which we considered splitting the
data. One method involves dividing the input projection
data p(t, u) among the nodes along the u dimension. Each
node would hold a portion of the projections p(t, [u
i
, u
j
])
and calculate that portion contribution to the nal image.
However, this implies that each node must hold a copy of
the entire target image in memory, and furthermore, that all
of the partial target images would need to be added together
after processing before the nal image could be created. This
extra processing step would also require a large amount of
data to pass between nodes. In addition, the size of the nal
image would be limited to that which would t on a single
FPGA board.
Rather than dividing the input data, the preferred
method divides the output image f (x, y) into pieces along
the range (x) axis (see Figure 2). In theory, this requires that
every projection be sent to each node; however, since only
a portion of each projection will aect the slice of the nal
image being computed on a single node, only that portion
must be sent to that node. Thus, the amount of input data
being sent to each node can be reduced to p([t
i
, t
j
], u). We
refer to the portion of the nal target image being computed
on a single node, f ([x
i
, x
j
], y), as a subimage.
Figure 2 shows that t
j
is slightly beyond the time index
that corresponds to x
j
. This is due to the width of the cone-
shaped radar beam. The dotted line in the gure shows a
single radar pulse taken at slow-time index y = u. The
minimum distance to any part of the subimage is at the
point (x
i
, u), which corresponds to fast-time index t
i
in the
projection data. The maximum distance to any part of the
subimage, however, is along the outer edge of the cone to
the point (x
j
, u w), where w is a factor calculated from
the beamwidth angle of the radar and x
j
. Thus, the fast-time
index t
j
is calculated relative to x
j
and w instead of simply
x
j
. This also implies that the [t
i
, t
j
] range for two adjacent
6 EURASIP Journal on Embedded Systems
nodes will overlap somewhat, or (equivalently) that some
projection data will be sent to more than one node.
Since the nal value of a pixel does not depend on the
values of the pixels surrounding it, each FPGA needs hold
only the subimage that it is responsible for computing. That
portion is not aected by the results on any other FPGA,
which means that the postprocessing accumulation stage can
be avoided. If a larger target image is desired, subimages can
be stitched together simply by concatenation.
In contrast to the method where input data are divided
along the u dimension, the size of the nal target image is
not restricted by the amount of memory on a single node,
and furthermore, larger images can be processed by adding
nodes to the cluster. This is commonly referred to as coarse-
grained parallelism, since the problem has been divided into
large-scale independent units. Coarse-grained parallelism is
directly related to the performance gains that are achieved by
adapting the application from a single-node computer to a
multinode cluster.
4.2.3. Memory Allocation. The memory devices used to
store the input and output data on the FPGA board may
now be determined. We need to store two large arrays
of information: the target image f (x, y) and the input
projection data p(t, u). On the Wildstar II board, there are
three options: an on-board DRAM, six on-board SRAMs,
and a variable number of BlockRAMs which reside inside
the FPGA and can be instantiated as needed. The on-board
DRAM has the highest capacity (64 MB) but is the most
dicult to use and only has one read/write port. BlockRAMs
are the most exible (two read/write ports and a exible
geometry) and simple to use, but have a small (2 KB)
capacity.
For the target image, we would like to be able to both read
and write one target pixel per cycle. It is also important that
the size of the target image stored on one node be as large as
possible, so memories with larger capacity are better. Thus,
we will use multiple on-board SRAMs to store the target
image. By implementing a two-memory storage system, we
can provide two logical ports into the target image array.
During any given processing step, one SRAM acts as the
source for target pixels, and the other acts as the destination
for the newly computed pixel values. When the next set of
projections is sent to the FPGA, the roles of the two SRAMs
are reversed.
Owing to the 1 MB size of the SRAMs in which we store
the target image data, we are able to save 2
19
pixels. We choose
to arrange this into a target image that is 1024 pixels in the
azimuth dimension and 512 in the range dimension. Using
power-of-two dimensions allows us to maximize our use of
the SRAM, and keeping the range dimension small allows
us to reduce the amount of projection data that must be
transferred.
For the projection data, we would like to have many small
memories that can each feed one of the projection adder
units. BlockRAMs allow us to instantiate multiple small
memories in which to hold the projection data; each memory
has two available ports, meaning that two adders can be
supported in parallel. Each adder reads from one SRAM and
writes to another; since we can support two adders, we could
potentially use four SRAMs.
4.2.4. Data Formats. Backprojection is generally accom-
plished in software using a complex (i.e., real and imaginary
parts) oating-point format. However, since the result of this
application is an image which requires only values from 0
to 255 (i.e., 8-bit integers), the loss of precision inherent
in transforming the data to a xed-point/integer format is
negligible. In addition, using an integer data format allows
for much simpler functional units.
Given an integer data format, it remains to determine
how wide the various datawords should be. We base our
decision on the word width of the memories. The SRAM
interface provides 72 bits of data per cycle, comprised of
two physical 32-bit datawords plus four bits of parity each.
The BlockRAMs are congurable, but generally can provide
power-of-two sized datawords.
Since backprojection is in essence an accumulation oper-
ation, it makes sense for the output data (target image pixels)
to be wider than the input data (projection samples). This
reduces the likelihood of overow error in the accumulation.
We, therefore, use 36-bit complex integers (18-bit real and
18-bit imaginary) for the target image, and 32-bit complex
integers for the projection data.
After backprojection, a complex magnitude operator is
needed to reduce the 36-bit complex integers to a single 18-
bit real integer. This operator is implemented in hardware,
but the process of scaling data from 18-bit integer to 8-bit
image is left to the software running on the GPP.
4.2.5. Computation Analysis. The computation to be per-
formed on each node consists of three parts. The summation
from (1) and the distance calculation from (2) represent the
backprojection work to be done. The complex magnitude
operation is similar to the distance calculation.
While adders are simple to replicate in large numbers,
the hardware required to perform multiplication and square
root is more costly. If we were using oating-point data
formats, the number of functional units that could be
instantiated would be very small, reducing the parallelism
that we can exploit. With integer data types, however, these
units are relatively small, fast, and easily pipelined. This
allows us to maintain a high clock rate and one-result-per-
cycle throughput.
4.2.6. Data Collection Parameters. The conditions under
which the projection data are collected aect certain aspects
of the backprojection computation. In particular, the spacing
between samples in the p(t, u) array and the spacing between
pixels in the f (x, y) array imply constant factors that must be
accounted for during the distance-to-time index calculation
(see Section 4.4.3).
For the input data, u indicates the distance (in meters)
between samples in the azimuth dimension. This is equiv-
alent to the distance that the plane travels between each
outgoing pulse of the radar. Often, due to imperfect ight
EURASIP Journal on Embedded Systems 7
paths, this value is not regular. The data ltering that occurs
prior to backprojection image formation is responsible for
correcting for inaccuracies due to the actual ight path, so
that a regular spacing can be assumed.
As the reected radar data are observed by the radar
receiver, they are sampled at a particular frequency . That
frequency translates to a range distance t between samples
equal to c/2, where c is the speed of light. The additional
factor of 1/2 accounts for the fact that the radar pulse travels
the intervening distance, is reected, and travels the same
distance back. Owing to the fact that the airplane is not ying
at ground level, there is an additional angle of elevation that
is included to determine a more accurate value for t.
For the target image (output data), x and y simply
correspond to the real distance between pixels in the range
and azimuth dimensions, accordingly. In general, x and
y are not necessarily related to u, and t and can be
chosen at will. In practice, setting y = u makes the
algorithm computation more regular (and thus more easily
parallelizable). Likewise, setting x = t reduces the need
for interpolation between samples in the t dimension since
most samples will line up with pixels in the range dimension.
Finally, setting x = y provides for square pixels and an
easier-to-read aspect ratio in the output image.
The nal important parameter is the minimum range
from the radar to the target image, known as R
min
. This is
related to the t
i
parameter, and is used by the software to
determine what portion of the projection data is applicable
to a particular node.
4.3. Software Design. We now describe the HPRC imple-
mentation of backprojection. As with most FPGA-based
applications, the work that makes up the application is
divided between the host GPP and the FPGA. In this section,
we will discuss the work done on the GPP; in Section 4.4, we
continue with the hardware implemented on the FPGA.
The main executable running on the GPP begins by
using the MPI library to spawn processes on several of the
HHPC nodes. Once all MPI jobs have started, the host code
congures the FPGA with the current values of the ight
parameters from Section 4.2.6. In particular, the values of
x, y, and R
min
(the minimum range) are sent to the
FPGA. However, in order to avoid the use of fractional
numbers, all of these parameters are normalized such that
t = 1. This allows the hardware computation to be in
terms of fast-time indices in the t domain instead of ground
distances.
Next, the radar data is read. In the Swathbuckler system,
this input data would be streamed directly into memory and
no separate read step would be required. Since we are not
able to integrate directly with Swathbuckler, our host code
reads the data from a le on the shared disk. These data are
translated from complex oating-point format to integers.
The host code also determines the appropriate range of t that
is relevant to the subimage being calculated by this node (see
Section 4.2.2).
The host code then loops over the u domain of the
projection data. A chunk of the data is sent to the FPGA
On-board
memories
T
a
r
g
e
t
m
e
m
o
r
i
e
s
(
4

1
M
B
S
R
A
M
)
O
n
-
b
o
a
r
d
P
C
I
i
n
t
e
r
f
a
c
e
Control +
status regs
DMA recv.
controller
DMA Xmit.
controller
Address
generators
Projection adders
Complex
magnitude
Virtex6000 FPGA
Datapath PCI & controller
Figure 3: Block diagram of backprojection hardware unit.
and processed. The host code waits until the FPGA signals
processing is complete, and then transmits the next chunk
of data. When all projection data have been processed, the
host code requests that the nal target image be sent fromthe
FPGA. The pixels of the target image are scaled, rearranged
into an image buer, and an image le is optionally produced
using a library call to the GTK+ library [29].
After processing, the target subimages are simply held in
the GPP memory. In the Swathbuckler system, subimages
are distributed to consumers via a publish/subscribe mech-
anism, so there is no need to assemble all the subimages into
a larger image.
4.3.1. Conguration Parameters. Our backprojection imple-
mentation can be congured using several compile-time
parameters in both the host code and the VHDL code that
describes the hardware. In software, the values of x and y
are set in the header le and compiled in. The value of R
min
is specic to a dataset, so it is read from the le that contains
the projection data.
It is also possible to set the dimensions of the subimage
(1024 512 by default), though the hardware would require
signicant changes to support this.
The hardware VHDL code allows two parameters to be
set at compile time (see Section 4.4.3). N is the number of
projection adders in the design, and R is the size of the
projection memories (R 1024 words). Once compiled, the
value of these parameters can be read from the FPGA by the
host code.
4.4. FPGA Design. The hardware that is instantiated on
the FPGA boards runs the backprojection algorithm and
computes the values of the pixels in the output image. A
block diagram of the design is shown in Figure 3. References
to blocks in this gure are printed in monospace.
4.4.1. Clock Domains. In general, using multiple clock
domains in a design adds complexity and makes verication
signicantly more dicult. However, the design of the
Annapolis Wildstar II board provides for one xed-rate clock
on the PCI interface, and a separate xed-rate clock on the
8 EURASIP Journal on Embedded Systems
SRAMmemories. This is a common attribute of FPGA-based
systems.
To simplify the problem, we run the bulk of our design
at the PCI clock rate (133 MHz). Since Annapolis VHDL
modules refer to the PCI interface as the LAD bus, we
call this the L-clock domain. Every block in Figure 3, with
the exception of the SRAMs themselves and their associated
Address Generators, is run from the L-clock.
The SRAMs are run from the memory clock, or M-
clock, which is constrained to run at 50 MHz. Between the
Target Memories and the Projection Adders, there is
some interface logic and an FIFO. This is not shown in
Figure 3, but exists to cross the M-clock/L-clock domain
boundary.
BlockRAM-based FIFOs, available as modules in the
Xilinx CORE Generator [30] library, are used to cross
clock domains. Since each of the ports on the dual-ported
BlockRAMs is individually clocked, the read and write can
happen in dierent clock domains. Control signals are
automatically synchronized to the appropriate clock, that is,
the full signal is synchronized to the write clock and the
empty signal to the read clock. Using FIFOs whenever clock
domains must be crossed provides a simple and eective
solution.
4.4.2. Control Registers and DMA Input. The Annapolis API,
like many FPGA control APIs, allows for communication
between the host PC and the FPGA with two methods:
programmed or memory-mapped I/O (PIO), which is best
for reading and writing one or two words of data at a time;
direct memory access (DMA), which is best for transferring
large blocks of data.
The host software uses PIO to set control registers
on the FPGA. Projection data is placed in a specially
allocated memory buer, and then transmitted to the FPGA
via DMA. On the FPGA, the DMA Receive Controller
receives the data and arranges it in the BlockRAMs inside the
Projection Adders.
4.4.3. Datapath. The main datapath of the backprojec-
tion hardware is shown in Figure 4. It consists of ve
parts: the Target Memory SRAMs that hold the target
image, the Distance-To-Time Index Calculator (DIC),
the Projection Data BlockRAMs, the Adders to perform
the accumulation operation, and the Address Generators
that drive all of the memories. These devices all operate in
a synchronized fashion, though there are FIFOs in several
places to temporally decouple the producers and consumers
of data, as indicated in Figure 4 with segmented rectangles.
Address Generators. There are three data arrays that must be
managed in this design: the input target data, the out-
put target data, and the projection data. The pixel
indices for the two target data arrays (Target Memory
A and Target Memory B in Figure 4) are managed directly
by separate address generators. The address generator for the
Projection Data BlockRAMs also produces pixel indices;
Target memory A
(1 MB SRAM)
Address
generator
Address
generator
Projection adder 1
Projection adder 2
.
.
.
.
.
.
.
.
.
Projection adder N
Target memory B
(1 MB SRAM)
Address
generator
FIFO
+
Distance-to-time
calculator
Projection data
(2 KB BlockRAM)
Figure 4: Block diagram of hardware datapath.
the DIC converts the pixel index into a fast-time index that is
used to address the BlockRAMs.
Because a single read/write operation to the SRAMs
produces/consumes two pixel values, the address generators
for the SRAMs run for half as many cycles as the address
generator for the BlockRAMs. However, address generators
run in the clock domain relevant to the memory that they
are addressing, so n/2 SRAM addresses take slightly longer to
generate at 50 MHz than n BlockRAM addresses at 133 MHz.
Because of the use of FIFOs between the memories and
the adders, the address generators for Target Memory A
and the Projection Data BlockRAMs can run freely. FIFO
control signals ensure that an address generator is paused in
time to prevent it from overowing the FIFO. The address
generator for Target Memory B is incremented whenever
data are available from the output FIFO.
Distance-To-Time Index Calculator. The Distance-To-
Time Index Calculator (DIC) implements (2), which is
comprised of two parts. At rst glance, each of these parts
involves computation that requires large amount of hardware
and/or time to calculate. However, a few simplifying assum-
ptions make this problem easier and reduce the amount of
needed hardware.
Rather than implementing a tangent function in hard-
ware, we rely on the fact that the beamwidth of the radar
is a constant. The host code performs the tan function and
sends the result to the FPGA, which is then used to calculate
(a, b). This value is used both on a coarse-grained level
to narrow the range of pixels which are examined for each
processing step, and on a ne-grained level to determine
whether or not a particular pixel is aected by the current
projection (see Figure 2).
EURASIP Journal on Embedded Systems 9
The right-hand side of (2) is a distance function
(

x
2
+ y
2
) and a division. The square root function is
executed using an iterative shift-and-subtract algorithm. In
hardware, this algorithm is implemented with a pipeline
of subtractors. Two multiplication units handle the x
2
and
y
2
functions. Some additional adders and subtractors are
necessary to properly align the input data to the output
data according to the data collection parameters discussed
in Section 4.2.6. We used pipelined multipliers and division
units from the Xilinx CORE Generator library; adders and
subtractors are described with VHDL arithmetic operators,
allowing the synthesis tools to generate the appropriate
hardware.
The distance function and computation of () occur
in parallel. If the () function determines that the pixel is
outside the aected range, the adder input is forced to zero.
Projection Data BlockRAMs. The output of the DIC is a fast-
time index into the p(t, u) array. Each Projection Data
BlockRAM holds the data for a particular value of u. The
fast-time index t is applied to retrieve a single value of p(t, u)
that corresponds to the pixel that was input by the address
generator. This value is stored in an FIFO, to be synchronized
with the output of the Target Memory A FIFO.
Projection Data memories are congured to hold 2 k
datawords by default, which should be sucient for a 1 k
range pixel image. This number is a compile-time parameter
in the VHDL source and can be changed. The resource
constraint is the number of available BlockRAMs.
Projection Adder. As the FIFOs from the Projection
Data memories and the Target Memory are lled, the
Projection Adder reads datawords from both FIFOs, adds
them together, and passes them to the next stage in the
pipeline (see Figure 4).
The design is congured with eight adder stages, meaning
eight projections can be processed in one step. This number
is a compile-time parameter in the VHDL source and
can be changed. The resource constraint is a combina-
tion of the number of available BlockRAMs (because the
Projection Data BlockRAMs and FIFO are duplicated)
and the amount of available logic (to implement the DIC).
The number of adder stages implemented directly
impacts the performance of our application. By computing
the contribution of multiple projections in parallel, we
exploit the ne-grained parallelism inherent in the backpro-
jection algorithm. Fine-grained parallelism is directly related
to the performance gains achieved by implementing the
application in hardware, where many small execution units
can be implemented that all run at the same time on dierent
pieces of data.
4.4.4. Complex Magnitude and DMA Output. When all
projections have been processed, the nal target image data
reside in one of the Target Memory SRAMs. The host code
then requests that the image data be transferred via DMA to
the host memory. This process occurs in three steps.
First, an Address Generator reads the data out of
the SRAM in the correct order. Second, the data are
converted from complex to real. The Complex Magnitude
operator performs this function with a distance calculation
(

re
2
+ im
2
). We instantiate another series of multipliers,
adders, and subtractors (for the integer square root) to
perform this operation. Third, the real-valued pixels are
passed to the DMA Transmit Controller, where they are
sent from the FPGA to the host memory.
5. Experimental Results
After completing the design, we conducted a series of
experiments to determine the performance and accuracy
of the hardware. When run on a single node, a detailed
prole of the execution time of both the software and
hardware programs can be determined, and the eects of
recongurable hardware design techniques can be studied.
Running the same program on multiple nodes shows how
well the application scales take advantage of the processing
power available on HPC clusters. In this section, we describe
the experiments and analyze the collected results.
5.1. Experimental Setup. Our experiments consist of running
programs on the HHPC and measuring the run time of
individual components as well as the overall execution time.
There are two programs: one which forms images by running
backprojection on the GPP (the software program), and
one which runs it in hardware on an FPGA (the hardware
program).
We are concerned with two factors: speed of execution
and accuracy of solution. We will consider not only the
execution time of the backprojection operation by itself,
but also the execution time of the entire program including
peripheral operations such as memory management and data
scaling. In addition, we examine the full application run
time.
In terms of accuracy, the software program computes
its results in oating point while the hardware uses integer
arithmetic. We will examine the dierences between the
images produced by these two programs in order to establish
the error introduced by the data format conversion.
The ability of our programs to scale across multiple
nodes is an additional performance metric. We measure
the eectiveness of exploiting coarse-grained parallelism by
comparing the relative performance of both the software
programand the hardware implementation when run on one
node and when run on many nodes.
5.1.1. Software Design. All experiments were conducted on
the HHPC system as described in Section 2.2. Nodes on the
HHPC run the Linux operating system, RedHat release 7.3,
using kernel version 2.4.20.
Both the software program and the calling framework for
the hardware implementation are written in C and produce
an executable that is started from the Linux shell. The
software program executes entirely on the GPP: memory
buers are established, projection data are read from disk,
10 EURASIP Journal on Embedded Systems
the backprojection algorithm is run, and output data are
transformed from complex to real. The nal step involves
rearranging the output data and scaling it, then writing a
PNG image to disk.
The hardware program begins by establishing memory
buers and initializing the FPGA. Projection data are read
from disk into the GPP memory. Those data are then
transferred to the FPGA, where the backprojection algorithm
is run in hardware. Output data are transformed from
complex to real on the FPGA, then transferred to the GPP
memory. The GPP then executes the same rearrangement
and scaling step as the software program.
To control the FPGA, the hardware program uses an API
that is provided by Annapolis to interface to the WildStar
II boards. The FPGA hardware was written in VHDL and
synthesized using version 8.9 of Synplify Pro. Hardware place
and route used the Xilinx ISE 9.1i suite of CAD tools.
The nal step of both programs, where the target data are
written as a PNG image to disk by the GPP, uses the GTK+
[29] library version 2.0.2. For the multinode experiments,
version 1.2.1.7b of the MPICH [31] implementation of MPI
is used to handle internode startup, communication, and
synchronization.
The timing data presented in this section were collected
using timing routines that were inserted into the C code.
These routines use Linux system calls to display timing
information. The performance of the timing routines was
determined by running them several times in succession
with no code in between. The overhead of the performance
routines was shown to be less than 100 microseconds , so
timing data are presented as accurate to the millisecond.
Unless noted otherwise, applications were run ve times
and an average (arithmetic mean) was taken to arrive at the
presented data.
We determine accuracy both qualitatively (i.e., by exam-
ining the images with the human eye) and quantitatively by
computing the dierence in pixel values between the two
images.
5.1.2. Test Data. Four sets of data were used to test our
programs. The datasets were produced using a MATLAB
simulation of SAR taken from the Soumekh book [2]. This
MATLAB script allows the parameters of the data collection
process to be congured (see Section 4.2.6). When run, it
generates the projection data that would be captured by an
SAR system imaging that area. A separate C program takes
the MATLAB output and converts it to an optimized le that
can be read by the backprojection programs.
Each dataset contains four point source (i.e., 1 1 pixel
in size) targets that are distributed randomly through the
imaged area. The imaged area for each set is of a similar size,
but situated at a dierent distance from the radar. Targets are
also assigned a random reectivity value that indicates how
strongly they reect radar signals.
5.2. Results and Analysis. In general, owing to the high degree
of parallelism inherent in the backprojection algorithm, we
Table 1: Single-node experimental performance.
Component Software Hardware Ratio
Backprojection 76.4 s 351 ms 217:1
Complex magnitude 73 ms 15 ms 4.9:1
Form image (software) 39 ms 340 ms 1:8.7
Total 76.5 s 706 ms 108:1
Table 2: Single-node backprojection performance by dataset.
Dataset Software Hardware BP speedup App speedup
1 24.5 s 146 ms 167.4 49.8
2 30.4 s 169 ms 179.5 61.9
3 47.7 s 268 ms 177.5 75.5
4 76.5 s 351 ms 217.6 108.4
expect a considerable performance benet from implemen-
tation in hardware even on a single node. For the multinode
program, the lack of need to transfer data between the nodes
implies that the performance should scale in a linear relation
to the number of nodes.
5.2.1. Single Node Performance. The rst experiment involves
running backprojection on a single node. This allows us to
examine the performance improvement due to ne-grained
parallelism, that is, the speedup that can be gained by imple-
menting the algorithm in hardware. For this experiment, we
ran the hardware and software programs on all four of the
datasets. Table 1 shows the timing breakdown of dataset no.
4; Table 2 shows the overall results for all four datasets. Note
that dataset no. 1 is closest to the radar, and dataset no. 4 is
furthest away.
In Table 1, Software and Hardware refer to the run time
of a particular component; Ratio is the ratio of software
time to hardware time, showing the speedup or slowdown
of the hardware program. Backprojection is the running of
the core algorithm. Complex Magnitude transforms the data
from complex to real integers. Finally, Form Image scales the
data to the range [0:255] and creates the memory buer that
is used to create the PNG image.
There are a number of signicant observations that can
be made from the data in Table 1. Most importantly, the
process of running the backprojection algorithm is greatly
accelerated in hardware, running over 200x faster than our
software implementation. It is important to emphasize that
this includes the time required to transfer projection data
from the host to the FPGA, which is not required by the
software program. Many of the applications discussed in
Section 3 exhibit only modest performance gains due to the
considerable amount of time spent transferring data. Here,
the vast degree of ne-grained parallelism in the backpro-
jection algorithm that can be exploited by FPGA hardware
allows us to achieve excellent performance compared to a
serial software implementation.
The complex magnitude operator also runs about 5x
faster in hardware. In this case, the transfer of the output
EURASIP Journal on Embedded Systems 11
data from the FPGA to the host is overlapped with the
computation of the complex magnitude. This commonly
used technique allows the data transfer time to be hidden,
preventing it from aecting overall performance.
However, the process of converting the backprojection
output into an image buer that can be converted to a PNG
image (Form Image) runs faster when executed as part of the
software program. This step is performed in software regard-
less of where the backprojection algorithm was executed.
The dierence in run time can be attributed to memory
caching. When backprojection occurs in software, the result
data lie in the processor cache. When backprojection occurs
in hardware, the result data are copied via DMA into the
processor main memory, and must be loaded into the cache
before the Form Image step can begin.
We do not report the time required to initialize either
the hardware or software program, since in the Swathbuckler
system it is expected that initialization can be completed
before the input data become available.
Table 2 shows the single-node performance of both
programs on all four datasets. Note that the reported run
times are only the times required by the backprojection
operation. Thus, column four, BP Speedup, shows the
factor of speedup (software:hardware ratio) for only the
backprojection operation. Column ve, App Speedup, shows
the factor of speedup for the complete application including
all of the steps shown in Table 1.
These results show that the computation time of the
backprojection algorithm is data dependent. This is directly
related to the minimum range of the projection data.
According to Figure 2, as the subimage gets further away
from the radar, the width of the radar beam is larger. This
is reected in the increased limits of the (a, b) term of (2),
which are a function of the tangent of the beamwidth and
the range. A larger range implies more pixels are impacted by
each projection, resulting in an increase in processing time.
The hardware and software programs scale at approximately
the same rate, which is expected since they are processing the
same amount of additional data at longer ranges.
More notable is the increase in application speedup;
this can be explained by considering that the remainder of
the application is not data dependent and stays relatively
constant as the minimum range varies. Therefore, as the
range increases and the amount of data to process increases,
the backprojection operation takes up a larger percentage
of the run time of the entire application. For software, this
increase in proportion is negligible (99.5% to 99.8%), but
for the hardware, it is quite large (12.6% to 25.0%). As the
backprojection operator takes up more of the overall run
time, the relative gains from implementing it in hardware
become larger, resulting in the increasing speedup numbers
seen in the table.
5.2.2. Single Node Accuracy. Qualitatively, the hardware
images look very similar, with the hardware images perhaps
slightly darker near the point source target. This is due to
the quantization imposed by using integer data types. As
discussed in Section 5.2.1, a longer range implies a wider
Table 3: Image accuracy by dataset.
Dataset Errors Max error Mean error
1 4916 0.9% 18 7.0% 1.55
2 13036 2.5% 19 7.4% 1.73
3 18706 3.6% 12 4.6% 1.56
4 29093 5.6% 16 6.2% 1.64
radar beam. The (a, b) function from (2) determines how
wide the beam is at a given range. When computed in xed
point for the hardware program, y x tan returns slightly
dierent values than when the software program computes it
in oating point. Thus, there is a slight smearing or blurring
of the point source. Recall that dataset no. 4 has a longer
range than the other datasets; appropriately, the smearing is
most notable in that dataset.
Quantitatively, the two images can be compared pixel-by-
pixel to determine the dierences. For each dataset, Table 3
presents error in terms of the dierences between the image
produced by the software program and the image produced
by the hardware program.
The second column shows the number of pixels that are
dierent between the two images. There are 1024512 pixels
in the image, so the third column shows the percent of overall
image pixels that are dierent. The maximum and arithmetic
mean error are shown in the last two columns. Recall that our
output images are 256 gray scale PNG les; the magnitude of
error is given by err(x, y) = |hw(x, y) sw(x, y)|.
Again, errors can be attributed to the dierence in the
computed width of the radar beam between the software
and hardware programs. For comparison, a version of each
program was written that does not include the (a, b) func-
tion and instead assumes that every projection contributes
to every pixel (i.e., an innite beamwidth). In this case, the
images are almost identical; the number of errors drops to
0.1%, and the maximum error is 1. Thus, the error is not
due to quantization of processed data; the computation of
the radar beamwidth is responsible.
5.2.3. Multinode Performance. The second experiment in-
volves running backprojection on multiple nodes simul-
taneously, using the MPI library to coordinate. These results
show how well the application scales due to coarse-grained
parallelism, that is, the speedup that can be gained by
dividing a large problems into smaller pieces and running
each piece separately. For this experiment, we create an
output target image that is 64 times the size of the image
created by a single node. Thus, when run on one node,
64 iterations are required; for two nodes, 32 iterations are
required, and so on. Table 4 shows the results for a single
dataset.
For both the software and hardware programs, ve
trials were run. For each trial, the time required to run
backprojection and form the resulting image on each node
was measured, and the maximum time reported. Thus, the
overall run time is equal to the run time of the slowest node.
The arithmetic mean of the times (in seconds) from the ve
12 EURASIP Journal on Embedded Systems
Table 4: Multinode experimental performance.
Nodes
Software Hardware
Mean Standard deviation Speedup Mean Standard deviation Speedup
1 1943.2 6.15 1.0 25.0 .01 1.0
2 983.2 10.46 2.0 13.4 .02 1.9
4 496.0 4.60 3.9 7.8 .02 3.9
8 256.5 5.85 7.6 4.0 .06 6.0
16 128.4 1.28 15.1
trials are presented, with standard deviation. The mean run
time is compared to the mean run time for one node in order
to show the speedup factor.
Results are not presented for a 16-node trial of the
hardware program. During our testing, it was not possible
to nd 16 nodes of the HHPC that were all capable of
running the hardware program at once. This was due to
hardware errors on some nodes, and inconsistent system
software installations on others.
The mostly linear distribution of the data in Table 4
shows that for the backprojection application, we have
achieved a nearly ideal parallelization. This can be attributed
to the lack of data passing between nodes, combined with
an insignicant amount of overhead involved in running
the application in parallel with MPI. The hardware program
shows a similar curve, except for N = 8 nodes, where
the speedup drops o slightly. At run times under ve
seconds, the MPI overhead involved in synchronizing the
nodes between each processing iteration becomes signicant,
resulting in a slight slowdown (6x speedup compared to the
ideal 8x).
The speedup provided by the hardware program is
further described in Table 5. Compared to one node running
the hardware program, we have already seen the nearly linear
speedup. Compared to an equal number of nodes running
the software program, the hardware consistently performs
around 75x faster. Again, for N = 8, there is a slight drop
o in speedup owing to the MPI overhead for short run
times. Finally, we show that when compared to a single
node running the software program, the combination of
ne- and coarse-grained parallelism results in a very large
performance gain.
6. Discussion
The results from Section 5.2.3 show that excellent speedup
can be achieved by implementing the backprojection algo-
rithmon an HPRCmachine. As HPRCarchitectures improve
and more applications are developed for them, designers will
continue to search for ways to carve out more and more
performance. Based on the lessons learned in Section 3 and
our work on backprojection, in this section we suggest some
directions for future research.
6.1. Future Backprojection Work. This project was devel-
oped with an eye toward implementation as a part of
Table 5: Speedup factors for hardware program.
Nodes
Ratio compared to
1 hardware N software 1 software
1 1.0 77.8 77.8
2 1.9 75.8 149.8
4 3.9 76.5 299.8
8 6.0 61.1 463.0
the Swathbuckler SAR system (see Section 4.1.2). Owing
to the classied nature of that project, additional work
beyond the scope of this project is required to integrate
our backprojection implementation into that project. To
determine the success of this aspect of the project, we would
need to compare backprojection to the current Swathbuckler
image formation algorithm, both in terms of run time as well
as image quality.
Despite excellent speedup results, there are further
avenues for improvement of our hardware. The Wildstar II
boards feature two identical FPGAs, so it may be possible to
process two images at once. If the data transfer to one FPGA
can be overlapped with computation on the other, signicant
speedup is possible. It may also be possible for each FPGA
to create a larger target image using more of the on-board
SRAMs.
An interesting study could be performed by porting
backprojection to several other HPRC systems, some of
which can be targeted by high-level design languages. This
would be the rst step toward developing a benchmark suite
for testing HPRC systems; however, without signicant tool
support-to-support application portability between HPRC
platforms, this process would be daunting.
6.2. HPRC Systems and Applications. One common theme
among the FPGA applications mentioned in Section 3 is data
transfer. Applications that require a large amount of data to
be moved between the host and the FPGAcan eliminate most
of the gains provided by increased parallelism. Backprojec-
tion does not suer from this problem because the amount
of parallelism exploited is so high that the data transfer is a
relatively small portion of the run time, and some of the data
transfers can be overlapped with computation. These are
common and well-known techniques in HPRC application
design.
EURASIP Journal on Embedded Systems 13
This leads us to two conclusions. First, when considering
porting an application to an HPRC system, it is important
to consider whether the amount of available parallelism is
sucient to provide good speedup. Tools that can analyze an
application to aid designers in making this decision are not
generally available.
Second, it is crucial for the speed of the data transfers
to be as high as possible. Early HPRC systems such as the
HHPC use common bus architectures like PCI, which do not
provide very high bandwidth. This limits the eectiveness of
many applications. More recent systems such as the SRC-7
have included signicantly higher bandwidth interconnect,
leading to improved data transfer performance and increas-
ing the number of applications that can be successfully
ported. Designers of future HPRC systems must continue to
focus on ways to improve the speed of these data transfers.
It is also noteworthy that the backprojection application
presented here was developed using hand-coded VHDL, with
some functional units from the Xilinx CoreGen library [30].
Writing applications in an HDL provides the highest amount
of exibility and customization, which generally implies the
highest amount of exploited parallelism. However, HDL
development time tends to be prohibitively high. Recent
research has focused on creating programming languages
and tools that can be used to increase programmer produc-
tivity, but applications developed with these tools have not
provided speedups comparable to those of hand-coded HDL
applications. The HPRC community would benet from the
continued improvement of development tools such as these.
Finally, each HPRC system has its own programming
method that is generally incompatible with other systems.
The standardization of programming interfaces would make
the development of design tools easier, and would also
increase application developer productivity when moving
from one machine to the next. Alternately, tools to support
portability of HPRC applications such as the VForce project
[18] would also help HPRC developers.
7. Conclusions
In conclusion, we have shown that backprojection is an
excellent choice for porting to an HPRC system. Through
the design and implementation of this application, we have
explored the benets and diculties of HPRC systems
in general, and identied several important features of
both these systems and applications that are candidates
for porting. Backprojection is an example of the class of
problems that demand larger amounts of computational
resources than can be provided by desktop or single-node
computers. As HPRC systems and tools mature, they will
continue to help in meeting this demand and making new
categories of problems tractable.
Acknowledgments
This work was supported in part by the Center for Subsurface
Sensing and Imaging Systems (CenSSIS) under the Engi-
neering Research Centers Program of the National Science
Foundation (Award no. EEC-9986821) and by the DOD
High Performance Computer Modernization Program. The
authors would also like to thank Dr. Richard Linderman and
Prof. Eric Miller for their input to this project as well as Xilinx
and Synplicity Corporations for their generous donations.
References
[1] B. Cordes, Parallel backprojection: a case study in high-
performance recongurable computing, M.S. thesis, Depart-
ment of Electrical and Computer Engineering, Northeastern
University, Boston, Mass, USA, 2008.
[2] M. Soumekh, Synthetic Aperture Radar Signal Processing with
MATLAB Algorithms, John Wiley & Sons, New York, NY, USA,
1999.
[3] A. C. Kak and M. Slaney, Principles of Computerized Tomo-
graphic Imaging, IEEE Press, New York, NY, USA, 1988.
[4] V. W. Ross, Heterogeneous high performance computer, in
Proceedings of the High Performance Computing Modernization
Program Users Group Conference (HPCMP 05), pp. 304307,
Nashville, Tenn, USA, June 2005.
[5] High Performance Technologies Inc., Cluster Computing,
January 2008, http://www.hpti.com/.
[6] Annapolis Microsystems, Inc., CoreFire FPGA Design Suite,
January 2008, http://www.annapmicro.com/corere.html.
[7] S. Coric, M. Leeser, E. Miller, and M. Trepanier, Parallel-
beam backprojection: an FPGA implementation optimized
for medical imaging, in Proceedings of the 10th ACM/SIGDA
International Sysmposium on Field-Programmable Gate Arrays
(FPGA 02), pp. 217226, Monterey, Calif, USA, February
2002.
[8] N. Gac, S. Mancini, and M. Desvignes, Hardware/software
2D-3D backprojection on a SoPC platform, in Proceedings
of the ACM Symposium on Applied Computing (SAC 06), pp.
222228, Dijon, France, April 2006.
[9] X. Xue, A. Cheryauka, and D. Tubbs, Acceleration of uoro-
CT reconstruction for a mobile C-arm on GPU and FPGA
hardware: a simulation study, in Medical Imaging 2006:
Physics of Medical Imaging, M. J. Flynn and J. Hsieh, Eds., vol.
6142 of Proceedings of SPIE, pp. 14941501, San Diego, Calif,
USA, February 2006.
[10] O. Bockenbach, M. Knaup, and M. Kachelrie, Implemen-
tation of a cone-beam backprojection algorithm on the cell
broadband engine processor, in Medical Imaging 2007: Physics
of Medical Imaging, vol. 6510 of Proceedings of SPIE, pp. 110,
San Diego, Calif, USA, February 2007.
[11] L. Nguyen, M. Ressler, D. Wong, and M. Soumekh, Enhance-
ment of backprojection SARimagery using digital spotlighting
preprocessing, in Proceedings of the IEEE Radar Conference,
pp. 5358, Philadelphia, Pa, USA, April 2004.
[12] A. Hast and L. Johansson, Fast factorized back-projection in an
FPGA, M.S. thesis, Halmstad University, Halmstad, Sweden,
2006, http://hdl.handle.net/2082/576.
[13] A. Ahlander, H. Hellsten, K. Lind, J. Lindgren, and B.
Svensson, Architectural challenges in memory-intensive, real-
time image forming, in Proceedings of the 36th International
Conference on Parallel Processing (ICPP 07), p. 35, Xian,
China, September 2007.
[14] E. El-Ghazawi, E. El-Araby, A. Agarwal, J. LeMoigne, and
K. Gaj, Wavelet spectral dimension reduction of hyperspec-
tral imagery on a recongurable computer, in Proceedings
14 EURASIP Journal on Embedded Systems
of the International Conference on Military and Aerospace
Programmable Logic Devices (MAPLD 04), Washington, DC,
USA, September 2004.
[15] S. R. Alam, P. K. Agarwal, M. C. Smith, J. S. Vetter, and
D. Caliga, Using FPGA devices to accelerate biomolecular
simulations, Computer, vol. 40, no. 3, pp. 6673, 2007.
[16] J. S. Meredith, S. R. Alam, and J. S. Vetter, Analysis of a com-
putational biology simulation technique on emerging pro-
cessing architectures, in Proceedings of the 21st International
Parallel and Distributed Processing Symposium(IPDPS 07), pp.
18, Long Beach, Calif, USA, March 2007.
[17] J. L. Tripp, M. B. Gokhale, and A. A. Hansson, A case study
of hardware/software partitioning of trac simulation on the
cray XD1, IEEE Transactions on Very Large Scale Integration
(VLSI) Systems, vol. 16, no. 1, pp. 6674, 2008.
[18] N. Moore, A. Conti, M. Leeser, and L. S. King, Vforce:
an extensible framework for recongurable supercomputing,
Computer, vol. 40, no. 3, pp. 3949, 2007.
[19] X. Wang, S. Braganza, and M. Leeser, Advanced components
in the variable precision oating-point library, in Proceedings
of the 14th Annual IEEE Symposium on Field-Programmable
Custom Computing Machines (FCCM 06), pp. 249258, Napa,
Calif, USA, April 2006.
[20] K. D. Underwood, K. S. Hemmert, and C. Ulmer, Architec-
tures and APIs: assessing requirements for delivering FPGA
performance to applications, in Proceedings of the ACM/IEEE
Conference on Supercomputing (SC 06), Tampa, Fla, USA,
November 2006.
[21] M. C. Smith, J. S. Vetter, and S. R. Alam, Scientic
computing beyond CPUs: FPGAimplementations of common
scientic kernels, in Proceedings of the 8th International
Conference on Military and Aerospace Programmable Logic
Devices (MAPLD 05), Washington, DC, USA, September
2005.
[22] L. Zhuo and V. K. Prasanna, High performance linear algebra
operations on recongurable systems, in Proceedings of the
ACM/IEEE Conference on Supercomputing (SC 05), p. 2, IEEE
Computer Society, Seatle, Wash, USA, November 2005.
[23] M. Gokhale, C. Rickett, J. L. Tripp, C. Hsu, and R. Scrofano,
Promises and pitfalls of recongurable supercomputing, in
Proceedings of the International Conference on Engineering of
Recongurable Systems and Algorithms (ERSA 06), pp. 1120,
Las Vegas, Nev, USA, June 2006.
[24] M. C. Herbordt, T. VanCourt, Y. Gu, et al., Achieving high
performance with FPGA-based computing, Computer, vol.
40, no. 3, pp. 5057, 2007.
[25] P. Th. Eugster, P. A. Felber, R. Guerraoui, and A.-M. Kermar-
rec, The many faces of publish/subscribe, ACM Computing
Surveys, vol. 35, no. 2, pp. 114131, 2003.
[26] S. Rouse, D. Bosworth, and A. Jackson, Swathbuckler wide
area SAR processing front end, in Proceedings of the IEEE
Radar Conference, pp. 16, New York, NY, USA, April 2006.
[27] R. W. Linderman, Swathbuckler: wide swath SAR system
architecture, in Proceedings of the IEEE Radar Conference, pp.
465470, Verona, NY, USA, April 2006.
[28] S. Tucker, R. Vienneau, J. Corner, and R. W. Linderman,
Swathbuckler: HPC processing and information exploita-
tion, in Proceedings of the IEEE Radar Conference, pp. 710
717, New York, NY, USA, April 2006.
[29] GTK+ Project, March 2008, http://www.gtk.org/.
[30] Xilinx, Inc., CORE Generator, March 2008, http://www.xilinx
.com/products/design tools/logic design/design entry/core-
generator.htm.
[31] Argonne National Laboratories, MPICH, March 2008,
http://www.mcs.anl.gov/research/projects/mpich2/.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 258921, 11 pages
doi:10.1155/2009/258921
Research Article
Performance Analysis of Bit-Width Reduced
Floating-Point Arithmetic Units in FPGAs:
A Case Study of Neural Network-Based Face Detector
Yongsoon Lee,
1
Younhee Choi,
1
Seok-BumKo,
1
and Moon Ho Lee
2
1
Electrical and Computer Engineering Department, University of Saskatchewan, Saskatoon, SK, Canada S7N 5A9
2
Institute of Information and Communication, Chonbuk National University, Jeonju, South Korea
Correspondence should be addressed to Seok-Bum Ko, seokbum.ko@usask.ca
Received 4 July 2008; Revised 16 February 2009; Accepted 31 March 2009
Recommended by Miriam Leeser
This paper implements a eld programmable gate array- (FPGA-) based face detector using a neural network (NN) and the bit-
width reduced oating-point arithmetic unit (FPU). The analytical error model, using the maximum relative representation error
(MRRE) and the average relative representation error (ARRE), is developed to obtain the maximum and average output errors
for the bit-width reduced FPUs. After the development of the analytical error model, the bit-width reduced FPUs and an NN
are designed using MATLAB and VHDL. Finally, the analytical (MATLAB) results, along with the experimental (VHDL) results,
are compared. The analytical results and the experimental results show conformity of shape. We demonstrate that incremented
reductions in the number of bits used can produce signicant cost reductions including area, speed, and power.
Copyright 2009 Yongsoon Lee et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Neural networks have been studied and applied in various
elds requiring learning, classication, fault tolerance, and
associate memory since the 1950s. The neural networks are
frequently used to model complicated problems which are
dicult to make equations by analytical methods. Applica-
tions include pattern recognition and function approxima-
tion [1]. The most popular neural network is the multilayer
perceptron (MLP) trained using the error back propagation
(BP) algorithm [2]. Because of the slow training in MLP-
BP, however, it is necessary to speed up the training time.
The very attractive solution is to implement it on eld
programmable gate arrays (FPGAs).
For implementing MLP-BP, each processing element
must perform multiplication and addition. Another impor-
tant calculation is an activation function, which is used
to calculate the output of the neural network. One of the
most important considerations for implementing a neural
network on FPGAs is the arithmetic representation format.
It is known that oating-point (FP) formats are more
area ecient than xed-point ones to implement articial
neural networks with the combination of addition and
multiplication on FPGAs [3].
The main advantage of the FP format is its wide range.
The feature of the wide range is good for neural network
systems because the system requires the big range when the
learning weight is calculated or changed [4]. Another advan-
tage of the FP format is the ease of use. A personal computer
uses the oating-point format for its arithmetic calculation.
If the target application uses the FP format, the eort of
converting to other arithmetic format is not necessary.
FP hardware oers a wide dynamic range and high
computation precision, but it occupies large fractions of total
chip area and energy consumption. Therefore, its usage is
very limited. Many embedded microprocessors do not even
include a oating-point unit (FPU) due to its unacceptable
hardware cost.
A bit-width reduced FPU solves this complexity problem
[5, 6]. An FP bit-width reduction can provide a signicant
saving of hardware resources such as area and power. It is
useful to understand the loss in accuracy and the reduction in
costs as the number of bits in an implementation of oating-
point representation is reduced. Incremented reductions in
2 EURASIP Journal on Embedded Systems
the number of bits used can produce useful cost reductions.
In order to determine the required number of bits in the bit-
width reduced FPU, analysis of the error caused by a reduced-
precision is essential. Precision reduced error analysis for
neural network implementations was introduced in [7]. A
formula that estimates the standard deviation of the output
dierences of xed-point and oating-point networks was
developed in [8]. Previous error analyses are useful to
estimate possible errors. However, it is necessary to know the
maximum and average possible errors caused by a reduced-
precision FPU for a practical implementation.
Therefore, in this paper, the error model is developed
using the maximum relative representation error (MRRE)
and average relative representation error (ARRE) which are
representative indices to examine the FPU accuracy.
After the error model for the reduced precision FPU
is developed, the bit-width reduced FPUs and the neural
network for face detection are designed using MATLAB and
Very high speed integrated circuit Hardware Description
Language (VHDL). Finally the analytical (MATLAB) results
are compared with the experimental (VHDL) results.
Detecting a face in an image means to nd its position
in the image plane and its size. There has been extensive
research in the eld, ranging mostly in the software domain
[9, 10]. There have been a few researches for hardware
face detector implementations on FPGAs [11, 12], but
most of the proposed solutions are not very compact and
the implementations are not purely on hardware. In our
previous work, the FPGA-based stand-alone face detector to
support a face recognition system was suggested and showed
that an embedded system could be made [13].
Our central contribution here is to examine how neural
network-based face detector can employ the minimal num-
ber of bits in an FPU to reduce hardware resources, yet
maintain a face detectors overall accuracy.
This paper is outlined as follows. In Section 2, the FPGA
implementation of the neural network face detector using the
bit-width reduced FPUs is described. Section 3 explains how
representation errors theoretically aect a detection rate in
order to determine the required number of bits for the bit-
width reduced FPUs. In Section 4, the experimental results
are presented, and then they are compared to the analytical
results to verify if both results match closely. Section 5 draws
conclusions.
2. A Neural Network-Based Face Detector Using
a Bit-Width Reduced FPUin an FPGA
2.1. General Review on MLP. A neural network model can
be categorized into two types: single layer perceptron and
multilayer perceptron (MLP). A single layer perceptron has
only two layers: the input layer and the output layer. Each
layer contains a certain number of neurons. The MLP is a
neural network model that contains multiple layers, typically
three or more layers including one or more hidden layers.
The MLP is a representative method of supervized learning.
Each neuron in one layer receives its input from the
neurons in the previous layer and broadcasts its output to
Hidden node (300)
Weights 12
Activation
function
Weights 01
Input node
(400)
y1
Layer 1 Layer 2

F
Figure 1: A two-layer MLP architecture.
the neurons in the next layer. Every processing node in one
particular layer is usually connected to every node in the
previous layer and the next layer. The connections carry
weights, and the weights are adjusted during training. The
operation of the network consists of two stages: forward pass
and backward pass or back-propagation. In the forward pass,
an input pattern vector is presented to the network and the
output of the input layer nodes is precisely the components
of the input pattern. For successive layers, the input to each
node is then the sum of the products of the incoming vector
components with their respective weights.
The input to a node j is given by simply
input
j
=

i
w
ji
out
i
, (1)
where w
ji
is the weight connecting node i to node j and out
i
is the output from node i.
The output of a node j is simply
out
j
= f

input
j

, (2)
which is then sent to all nodes in the following layer. This
continues through all the layers of the network until the
output layer is reached and the output vector is computed.
The input layer nodes do not perform any of the above
calculations. They simply take the corresponding value
from the input pattern vector. The function f denotes the
activation function of each node, and it will be discussed in
the following section.
It is known that 3 layers having 2-hidden layers are
better than 2 layers to approximate any given function [14].
However, a 2-layer MLP is used in this paper, as shown in
Figure 1. The output error equation of the rst layer (15)
and the error equation of the second layer (21) are dierent.
However, the error equation of the second layer (21) and the
error equation of the other layers (22) are the same form.
EURASIP Journal on Embedded Systems 3
3 2 1 0 1 2 3
x
3
2
1
0
1
2
3
f
(
x
)
f (x) =
2
(1 + e
2x
)
1
f (x) = 0.75x
Figure 2: Estimation (5) of an activation function (3).
Multiplication and
accumulation
(MAC)
FPU multiplication
Neural network
top module
FPU addition
(modified from Leon
processor FPU)
FPGA multiplier
(H/W IP)
Figure 3: Block diagram of the neural network in an FPGA.
Therefore, a 2-layer MLP is enough to be examined in this
paper. The number of neurons of 400 and 300 were used for
input and rst layer respectively in this experiment.
After the face data enters the input node, it is calcu-
lated by the multiplication-and-accumulation (MAC) with
weights. Face or non-face data is determined by comparing
output results with the thresholds. For example, if the output
is larger than the threshold, it is considered as a face data.
Here, on the FPGA, this decision is easily made by checking a
sign bit after subtracting the output results and the threshold.
2.2. Estimation of Activation Function. An activation func-
tion is used to calculate the output of the neural network.
The learning procedure of the neural network requires
the dierentiation of the activation function to renew the
weights value. Therefore, the activation function has to be
dierentiable. A sigmoid function, having an S shape, is
used for the activation function, and a logistic or a hyperbolic
tangent function is commonly used as the sigmoid function.
The hyperbolic tangent function and its antisymmetric
feature were better than the logistic function for learning
ability in our experiment. Therefore, hyperbolic tangent
sigmoid transfer function was used, as shown in (3). The
rst-order derivative of the hyperbolic tangent sigmoid
FPU multiplication
Data
fetch
Stage1 S2 S3 S4 S5
Pre-
norm
Post-
norm
Round/
norm
Adder
Figure 4: Block diagram of 5 stage-pipelined FPU.
Table 1: MRRE and ARRE of Five Dierent FBUs.
Bit-width
Unit
, e, m

Range MRRE(ulp) ARRE


FPU32 2, 8, 23 2
2
8
1
= 2
255
2
23
0.3607 2
23
FPU24 2, 6, 17 2
17
0.3607 2
17
FPU20 2, 6, 13 2
13
0.3607 2
13
FPU16 2, 6, 9 2
2
6
1
= 2
63
2
9
0.3607 2
9
FPU12 2, 6, 5 2
5
0.3607 2
5

: radix, e: exponent, m: mantisa .


Table 2: Timing results of the neural network-based FPGA face
detector by dierent FPUs.
Bit-width
Max. Clock
(MHz)
1/f(ns)
Time

/1
frame (ms)
Frame rate

FPU64

8.5 117 50 20
FPU32 48 21.7 8.7 114.4
FPU24 58 (+21%) 17.4 7.4 135.9
FPU20 77 (+60%) 13 5.5 182.1
FPU16 80 (+67%) 12.5 5.3 189.8
FPU12 85 (+77%) 11.7 5 201.8

General PCs use the 64-bit FPU,

operating time = [(1 / Max. Clock)


423,163 (total cycle)],

frame rate = 1000/Operating time.


transfer function can be easily obtained as (4). MATLAB
provides the commands, tansig and dtansig:
f (x) =
2
(1 + e
2x
)
1 =
1 e
2x
1 + e
2x
, (3)
f

(x) = 1 f (x)2 , (4)


where x =input
j
in (2).
The activation function can be estimated by using
dierent methods. The Taylor and polynomial methods are
eective, and guarantee the highest speed and accuracy
among these methods.
The polynomial method is used to estimate the activation
function in this paper as seen in (5) and (6) because it is
simpler than the Taylor approximation.
A rst-degree polynomial estimation of an activation
function is
f (x) = 0.75x. (5)
4 EURASIP Journal on Embedded Systems
Pre-
processing
Save
weights
Data
(face and
non-face)
Memory
(weights)
Memory
(input
data)
MATLAB (learning) MATLAB (detection)
Performace
test and
verification
NN
detector
NN
detector
MODELSIM
(simulation)
Xilinx ISE
(design and synthesis)
NN
learning
program
Pre-
processing
Figure 5: Block diagram of the design environment.
A rst-order derivative is
f

(x) = 0.75. (6)


Figure 2 shows the estimation (5) of an activation func-
tion (3).
2.3. FPU Implementation. Figure 3 shows the simplied
block diagram of the implemented neural network in an
FPGA. The module consists of control logic and an FPU.
The implemented FPU is IEEE 754 Standard [15]
compliant. The FPU in this system has two modules: mul-
tiplication and addition. Figure 4 shows the block diagram
of the 5 stage-pipelined FP addition and multiplication unit
implemented in this paper. A commercial IP core, an FP
adder of the LEON processor [16] is used and modied to
make the bit size adjustable. A bit-width reduced oating-
point multiplication unit is designed using a multiplier and
a hard intellectual property (IP) core in an FPGA to improve
speed. Consequently, the multiplication was performed
within 2 cycles of total stages as shown in Figure 4.
2.4. Implementation of the Neural Network-Based FPGA Face
Detector Using MATLAB and VHDL. Figure 5 shows the total
design ow using MATLAB and VHDL. MATLAB program
consists of two parts: learning and detection. After the
learning procedure, weights data are xed and saved to a le.
The weights le is saved to a memory model le for FPGA
and VHDL simulation. The MATLAB also provides input test
data to the VHDL program and analyzes the result from the
result le of MODELSIMsimulation program. Preprocessing
includes mask, resizing, and normalization.
The Olivetti face database [17] is chosen for this study.
The Olivetti face database consists of mono-color face and
0 50 100 150 200 250
The number of input data
0.5
0
0.5
1
1.5
N
e
u
r
a
l
n
e
t
w
o
r
k
o
u
t
p
u
t
(
t
h
r
e
s
h
o
l
d
)
Non-face data
(160: #61 #220)
Face data
(60: #1 #60)
Figure 6: Data classication result of the neural network.
1
2
p
1
2
o
1
2
m
1
2
n





.
.
.
.
.
.
.
.
.
.
.
.
Node: i j k l
Data: X
i
O
j O
k
O
l
Weights: W
i j
W
jk
W
kl
Error:

j

k

l
Figure 7: Error model for general neural network.
EURASIP Journal on Embedded Systems 5
0 1 2 3 4 5 6 7
x
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f

(
x
)
f

(x) = 1

2
1 + e
2x
1

2
30 20 10 0 10 20 30
x
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
f

(
x
)
Figure 8: First derivative graph of the activation function.
Table 3: Area results of the neural network-based FPGA face
detector by dierent FPUs.
Bit-width No. of Slices No. of FFs No. of LUTs
FPU32 1077 771 1952
FPU24 878 (18.5%) 637 1577
FPU20 750 (30.4%) 569 1356
FPU16 650 (39.7%) 501 1167
FPU12 556 (48.4%) 433 998
Table 4: Area results of 32/24/20/16/12-Bit FP adders.
FP Adder Bit-
width
Memory (Kbits)
NN
Area (Slices)
FP
Adder area
(Slices)
32 3760 1077 486
24 2820 (25%) 878 403 (17%)
20 2350 (37%) 750 300 (38%)
16 1880 (50%) 650 250 (49%)
12 1410 (63%) 556 173 (64%)
non-face image so it is easy to use. Some other databases
which have large size, color, mixed with other pictures are
dicult for this error analysis purpose due to the necessity
of more preprocessing like cropping, data classication, and
color model change.
Figure 6 shows the classication result of 220 face and
non-face data. X-axis shows the data number of face data
Table 5: Power consumption of the neural network-based FPGA
face detector by the dierent FPUs (unit: Mw).
Bitwidth CLBs RAM(Width)
Multiplier
(Block)
I/O Power

FPU32 2 17 ( 36) 9 ( 5) 67 306


FPU24 2 17 ( 36) 7 ( 4) 49 286 (6.5%)
FPU20 2 17 ( 36) 4 ( 2) 45 279 (8.8%)
FPU16 2 8 ( 18) 4 ( 2) 36 261 (14.7%)
FPU12 1 8 ( 18) 4 ( 2) 29 253 (17.3%)

Total power = sub sum + 211 mW (basic power consumption of the


FPGA)
Table 6: Comparison of dierent FP adder architectures (5 pipeline
stages).
Adder type Slices FFs LUTs Max. freq. (MHz)
LEON IP 486 269 905 71.5
LOP 570 (+17%) 294 1052 102 (+42.7%)
2-path 1026(+111%) 128 1988 200 (+180%)
Table 7: Specications of neural network-based FPGAface detector
Feature Specication
FPU Bit-width 32, 24, 20, 16, 12
Frequency 48/58/77/80/85 MHz
Slices (Xilinx Spartan)
1077/878/750/650/556
(FPU32/FPU16)
Arithmetic unit
IEEE 754 single precision with
bit-width reduced FPU
Networks 2 Layers (400/300/1 node)
Input Data Size 2020 (400 pixel image)
Operating Time 8.7/7.4/5.5/5.3/5 ms/frame
Frame Rate 114/136/182/190/201 seconds
from 1 to 60, and non-face data from 61 to 220. Y-axis shows
the output value of the neural network. The neural network
is learned to pursue the desired value 1 for face and 1
for non-face.
3. Error Analysis of the Neural Network Caused
by Reduced-Precision FPU
3.1. MRRE and ARRE. The number of bits in the FPU is
important for the area, operating speed, and power [13].
Therefore, it is important to decide the minimal number
of bits in oating-point hardware to reduce hardware costs,
yet maintain an applications overall accuracy. A maximum
relative representation error (MRRE) [18] is used as one of
the indices of oating-point arithmetic accuracy, as shown
in Table 1. Note that e and m represent exponent and
mantissa, respectively. The MRRE can be obtained as follows:
MRRE =
1
2
ulp , (7)
where ulp is a unit in the last position and is the exponent
base.
6 EURASIP Journal on Embedded Systems
Table 8: Dierence between (3) and (5) in face detection rate (MATALAB).
Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 10
Tansig (3) 34.09 34.55 37.27 45.91 53.64 61.36 73.09 77.73 75 72.73
Poly (5) 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27
Abs di 0.91 4.54 8.64 7.73 9.09 8.64 0.82 0.46 3.18 4.54
Avg. error 4.9
Table 9: Detection rate of PC software face detector.
Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Face 60 60 60 53 50 43 29 21 17 10
Rate 100 100 100 88.33 83.33 71.67 48.33 35 28.33 16.67
Nface 17 26 41 65 88 111 130 149 155 160
Rate 10.625 16.25 25.625 40.625 55 69.375 81.25 93.125 96.875 100
Total 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27
An average relative representation error (ARRE) can be
considered for practical use:
ARRE =
1
ln

1
4
ulp. (8)
3.2. Output Error Estimation of the Neural Network. The FPU
representation error increases with repetitive multiplication
and addition in the neural network. The dierence in output
can be expressed using the following equations with the
notation depicted in Figure 7.
The error of the 1st layer is the dierence between the
output by a nite precision arithmetic (O
f
j
) and the ideal
output (O
j
), and it can be described as

j
= O
f
j
O
j
= f

i=1
W
f
i j
X
f
i j

i=1
W
i j
X
i

,
(9)
where
j
represents the hidden layer error (
k
represents total
error generated between hidden layer and output layer on
Figure 7), W represents the weights, X represents input data,
and O represents the output of the hidden layer.

is the
summation of other possible errors, and dened by

=
f
+ Multiplication Error (

)
+ Summation Error(
+
)
+ Other Calculation Errors
(10)

f
is the nonlinear function error by Taylor estimation;
f
is
very small and negligible. Therefore,
f
becomes 0.
Other calculation errors occur when the dierential of
activation is calculated (i.e., f

(x) = 0.75 sum), and the


nal face determination is calculated as follows: f(x) = f

(x)
+ (-0.5).
The multiplication error,

, is not considered in this


paper. The multiplication unit assigns twice the size of the
bits to save the result data. For example, multiplication of
16 bits 16 bits needs 32 bits. This large size register reduces
the error, thus the

error is negligible.
However, the summation error,
+
is not negligible and
added to the error term,

. The multiplication error (

)
and the addition error (
+
) are bounded by the MRRE
(assuming rounding mode = truncation) as given by (11)
and (12):

<

Multiplication Result (MRRE)

, (11)
where negative sign () describes the direction.
For example, the

of 4 5 = 20:

= 20 (MRRE):

+
< |Addition Result (MRRE)|. (12)
For example, the
+
of 4 + 5 = 9:
+
= 9 (MRRE).
Note that the maximum error caused by truncation of
rounding scheme is bounded as

t
<

2
ulp

= |x (MRRE)|. (13)
The error caused by round-to-the-nearest scheme is
bounded as

n
<

2
ulp1

x
1
2
(MRRE)

. (14)
The truncation of rounding scheme creates a negative
error and a round-to-nearest scheme creates a positive error.
The total error can be reduced by almost 50% by round-to-
nearest scheme [18].
From (9), the terms W
f
and X
f
are weights data and
input data, respectively, including the reduced-precision
error. They are described by W
f
= W+
W
and X
f
= X+
X
.
Therefore, the multiplication of weights and input data are
denoted byW
f
X
f
= (W +
W
)(X +
X
).
Equations (16) and (18) are obtained by applying the
rst-order Taylors series approximation as given by [7, 8]
f (x + h) f (x) = h f

(x). (15)
From (9), the error of the rst layer,
j
, is given by

j
= f

i=1
W
i j
X
i
+ h
1

i=1
W
i j
X
i

+
+
= h
1
f

i=1
W
i j
X
i

+
+
,
(16)
EURASIP Journal on Embedded Systems 7
Table 10: Detection rate of reduced-precision FPUs (VDHL).
Threshold 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Avg. detection rate error
FPU64 (PC) 35 39.09 45.91 53.64 62.73 70 72.27 77.27 78.18 77.27
FPU32 NN 35 39.09 45.91 53.64 62.73 70 72.27 76.82 78.18 77.27 0
FPU24 NN 35 39.09 45.91 53.64 62.73 70 72.27 76.82 78.18 77.27 0
FPU20 NN 35 39.09 46.36 53.64 63.18 70 73.64 76.82 77.73 76.82 0.36
FPU18 NN 35 41.36 47.73 56.82 65.46 69.55 74.55 77.73 77.27 74.09 1.73
FPU16 NN 35.91 44.55 53.18 66.36 70.46 76.36 78.18 74.55 72.73 72.73 5.91
|FPU64-FPU16| 0.91 5.45 7.27 12.73 7.73 6.36 5.91 2.73 5.46 4.55 5.91
where
h
1
=
n

i=1

Xi
X
i
+
Wi j
W
i j
+
Wi j

Xi

. (17)
The error of the second layer can also be found as

k
= O
f
k
O
k
= f

j=1
O
f
j
W
f
jk

j=1
O
j
W
jk

+
+
.
(18)
By replacing the O
f
j
and the W
f
jk
with (O
j
+
j
) and(W
jk
+

W jk
), (18) becomes

k
= f

j=1

O
j
+
j

W
jk
+
Wjk

j=1
O
j
W
jk

+
+
.
(19)
Simply,

k
= f

j=1

O
j
W
jk
+ h
2

j=1
O
j
W
jk

+
+
h
2
f

j=1
O
j
W
jk

+
+
,
(20)
where
h
2
=
m

j=1

W jk
O
j
+
j
W
jk
+
W jk

. (21)

j=1

W jk
O
j
+
j
W
jk

j=1

O
j
W
jk

+
+
.
(22)
The error (22) can be generalized for the lth layer, l, in a
similar way:

k=1
(
Wkl
O
k
+
k
W
kl
)

k=1
O
k
W
kl

+
+
. (23)
3.3. Output Error Estimation by MRRE and ARRE. The error
equation can be rewritten using the MRRE in the error
term to nd the maximum output error caused by reduced-
precision. The average error can be estimated in the practical
application by replacing the MRRE with ARRE= (0.3607
MRRE).
From (16), the output error of the rst layer is described
as

i=1
(
W
X +
X
W)

i=1
W
i j
X
i

+
+
. (24)
The
W
(max)and
X
(max)terms can be dened by

W
(max) = W MRRE and
X
(max) = X MRRE.
Thus from (24), the error
j
is bounded so that

j
<

i=1

W
i j
MRRE

X
i
(X
i
MRRE)W
i j

i=1
W
i j
X
i

+
+

;
(25)

j
<

2MRRE

i1

W
i j
X
i

i=1
W
i j
X
i

+
+

, (26)
where

+

n

i=1

X
i
W
i j

MRRE. (27)
Finally, the output error of the second layer
k
is also
described from (22) as shown in (28), where the error of
weights can also be written as

W jk
(max) =

W
jk
MRRE

k
<

j=1

W
jk
MRRE O
j

j
W
jk

j=1
O
j
W
jk

+
+

,
(28)
where

+

n

i=1

O
j
W
jk

MRRE. (29)
8 EURASIP Journal on Embedded Systems
Table 11: Results of output error on a neural network-based FPGA
face detector.
Bit-width
Calculation Experiment
MRRE ARRE max
FPU32 4E-05 2.89E-05 1.93E-05
FPU24 0.0026 0.0018 0.0012
FPU20 0.0410 0.0296 0.0192
FPU18 0.1641 0.1184 0.0766
FPU16 0.6560 0.4733 0.2816
FPU14 2.62 1.891 0.9872
FPU12 10.4 7.5256 1.0741
3.4. Relationship between MRRE and Output Error. In order
to observe the relationship between the MRRE and output
error, (28) is written as (30) again.
By using (26),

k
<

MRRE A f

j=1
O
j
W
jk

+
+

, (30)
where
A =
m

j=1

W
jk
O
j

i=1

W
i j
X
i

i=1
W
i j
X
i

W
jk

.
(31)
Some properties are derived from (26) and (30) for
the output error. The dierential of summations aects the
output error proportionally like

j
f

i=1
W
i j
X
i

, from (24),

k
f

j=1
O
j
W
jk

, from (26).
(32)
The output of the well-learned neural network system
goes to the desired value as shown in Figure 2. In that case,
the dierential value goes to 0 as shown in Figure 8. It means
the well-learned neural network system has less output error.
One more nding is that the output error is also
proportional to the MRRE.
From the (30),

k
MRRE, (33)
where MRRE = 2 ul p (assuming rounding mode =
truncation). Therefore, (33) can be described as

k
2
ulp
. (34)
Finally, it is concluded that n-bits reduction in the FPU
creates 2
n
times the error. If one bit is reduced, for example,
4 times
12 14 16 18 20 22 24 26 28 30 32
FPU bits
0
2
4
6
8
10
12
O
u
t
p
u
t
e
r
r
o
r
s
Calculation (MRRE)
Calculation (ARRE)
Experiment (max)
Experiment (mean)
Figure 9: Comparison between analytical output errors and
experimental output errors.
2
12 14 16 18 20 22 24 26 28 30 32
FPU bits
20
18
16
14
12
10
8
6
4
2
0
2
4
O
u
t
p
u
t
e
r
r
o
r
s
(
l
o
g
2
)
Calculation (MRRE)
Calculation (ARRE)
Experiment (max)
Experiment (mean)
Figure 10: Comparison between analytical output errors and
experimental output errors (log
2
).
the output error is doubled (e.g., 2
(1)
= 2). After putting
the MRRE between FPU32 and other reduced precision FPU
bits into error terms in (26) and (28) using MATLAB and real
face data, nally, the total accumulated error of the neural
network is obtained as shown in Table 11.
4. Result and Discussion
4.1. FPGA Synthesis Results. The FPGA-based face detector
using the neural network and the reduced-precision, FPU, is
implemented in this paper. The logic circuits of the neural
network-based FPGA face detector are synthesized using the
FPGA design tool, Xilinx ISE on a Spartan-3 XC3S4000 [19].
To verify the error model, rst of all, the neural network on
a PC is designed using MATLAB. Next, the weights and test-
bench data are saved as a le to verify the VHDL code.
After simulation, area and operating speed are obtained
by synthesizing the logic circuits. The FPU uses the same
calculation method, oating-point arithmetic, as the PC so
it is easy to verify and easy to change the neural networks
structure.
EURASIP Journal on Embedded Systems 9
4.1.1. Timing. The implemented neural network-based
FPGA face detector (FPU16) took only 5.3 milliseconds to
process 1 frame at 80 MHz which is 9 times faster than
50 milliseconds (i.e., 40 milliseconds for loading time +
10 milliseconds for calculation time) required for a PC
(Pentium 4, 1.4 GHz) as shown in Table 2. As the total FPU
representation bits decrease, a maximum clock frequency
increases considerably from 21% (FPU24) to 67% (FPU16)
compared to FPU32.
The remaining question is to examine if bit-width
reduced FPU can still maintain a face detectors overall
accuracy. For this purpose, detection rate error for bit-width
reduced FPU will be discussed in Section 4.2.2.
4.1.2. Area. As shown in Table 3, only 2% (650/27648)
and 4% (1077/27648) of the total available slices (3S4000)
are used for FPU16 and FPU32, respectively. Therefore,
the stand-alone embedded face detection system including
preprocessing, FPU, and other extra functions can be easily
implemented on a small inexpensive FPGA.
As the bit-width decreases, the number of slices is
decreased from 18.5% (FPU24) to 39.7% (FPU16) compared
to FPU32.
Bit reduction of the FP adder leads to an area reduction
and a faster operating clock speed. For example, a 50% bit
reduction from FP adder 32 to FP adder 16 results in a 49%
reduction of the adder area (250/486) and a 50% reduction
of the memory (1880/3760) as shown in Table 4. It is possible
to use the XILINX FPGA 3S4000 which provides the size
of 2160 Kbits memory (block RAM: 1728 Kb, distributed
memory: 432 Kb) when the FPU16 is necessary.
The number of slices of the oating-point adder varies
from 31% (FP12: 173/556) to 45% (FP32: 486/1077) of the
total size of the neural network as shown in Table 4.
4.1.3. Power. The results of power consumption are shown in
Table 5. The power consumptions are obtained using Xilinx
Web Power Tool [20].
As the bit-width decreases, the power consumption
decreases. For example, bit reduction from the FPU32 to the
FPU16 reduces the total power by 14.7% (FPU32: 306 mW,
FPU16: 261 mW) through RAM, multiplier, and I/O as
shown in Table 5.
The change of the logic cell does not considerably aect
the power as much as hardwired IP such as memory and
multiplier spend the power. See the number of congurable
logic blocks (CLBs) in Table 5.
4.1.4. Architectures of FP Adder. The neural network system
and the FPU hardware performance are greatly aected by
the FP addition [21]. The bit-width reduced FP addition
is modied for this study from the commercial IP, LEON
processor. LEON FPU uses standard adder architecture [16].
The system performance and the clock speed can be further
improved by leading-one-prediction (LOP) algorithm and 2-
path (close-and-far path) algorithm, respectively [18].
In our experiment, FP addition based on the LOP
algorithm increases the maximum clock frequency by 42.7%
compared to the performance of the commercial IP, LEON.
The FP addition based on the 2-path algorithm [18, 22]
increases the area by 111%, but improves the maximum
clock frequency by 180% compared to the performance of
the commercial IP, LEON as shown in Table 6.
4.1.5. Specication. The specication of the implemented
neural network-based FPGA face detector is summarized in
Table 7.
4.2. Detection Rate Error. Two factors aect the detection
rate error. One is the polynomial estimation error as shown
in Figure 2 which is occurred when the activation function
is estimated through the polynomial equation. Another
possible error caused by the bit-width reduced FPU.
4.2.1. Detection Rate Error by Polynomial Estimation. To
reduce the error caused by polynomial estimation, the
polynomial equation, (35) can be more elaborately modied
as shown in (36). The problem of (36) is not dierentiable
at 1, and also the error (30) will be identically 0 (i.e.,
f

(sum) = (1)

= 0) for|sum| > 1, which will make error


analysis dicult:
f (sum) = 0.75 sum
f (sum) = 0.75 sum, |sum| < 1,
(35)
= 1, sum 1,
= 1, sum 1.
(36)
Therefore, the simplied polynomial (35) is used in this
paper. It is conrmed through MATLAB simulation that this
polynomial approximation results in an average 4.9% error
in the detection rate compared with the result of (3) in our
experiment as shown in Table 8.
4.2.2. Detection Rate Error by Reduced-Precision FPU. Table 9
is obtained after the threshold value changed from 0.1 to 1.
When the threshold is 0.5, the face detection rate is 83% and
the non-face detection rate is 55%. When the threshold is 0.6,
face and the non-face detect is almost same as 71.67% and
69.4% respectively. As the threshold value goes to 1 (i.e. as
the horizontal red line goes up in Figure 6), face detection
rate is decreased. It means input image is dicult to pass,
and it is good for security. Therefore, the threshold is needed
to be chosen accordingly depending upon applications. The
result of Table 9 is used in the second column (FPU64(PC))
of the Table 10.
Table 10 shows the detection rate error (i.e., |detection
rate of FPU64 (PC software)detection rate of reduced-
precision FPUs|) caused by reduced-precision FPUs. The
detection rate is changed from FPU64(PC) to FPU16 by only
5.91% (i.e., |72.27 78.18|).
Table 11 and Figure 9 show the output error (|neural
network output on PC- the output of VHDL|). Figure 10 is
the log graph (base is 2) of Figure 9.
Analytical results are found to be in agreement with
simulation results as shown in Figure 10. The analytical
MRRE results and the maximum experimental results show
10 EURASIP Journal on Embedded Systems
conformity of shape. The analytical ARRE results and the
minimum experimental results also show conformity of
shape.
As the n bits in the FPU are reduced within the ranges
from 32 bits to 14 bits, the output error is incremented by 2
n
times. For example, 2-bit reduction from FPU16 to FPU14
makes 4 times (2
n=1614=2
= 4) the error.
Due to the small number of fraction bits (e.g., 5 bits
in FPU12), no meaningful results are obtained under 14
bits. Therefore, at least 14 bits should be employed to
achieve an acceptable face detection rate. See Figures 9
and 10.
5. Conclusion
In this paper, the analytical error model was developed
using the maximum relative representation error (MRRE)
and average relative representation error (ARRE) to obtain
the maximum and average output errors for the bit-width
reduced FPUs.
After the development of the analytical error model,
the bit-width reduced FPUs, and the neural network were
designed using MATLAB and VHDL. Finally, the analytical
(MATLAB) results with the experimental (VHDL) results
were compared.
The analytical results and the experimental results
showed conformity of shape. According to both results, as
the n bits in FPU are reduced within the ranges from 32 bits
to 14 bits, the output error is incremented by 2
n
times.
An operating speed was signicantly improved from an
FPGA-based face detector implementation using a reduced
precision FPU. For example, it took only 5.3 milliseconds
in the FPU16 to process one frame which is 9 times faster
than 50 milliseconds (40 milliseconds for loading time +10
milliseconds for calculation time) of the PC (Pentium 4,
1.4 GHz). It was found that bit reduction from FPU 32 bits
to FPU16 bits reduced the size of memory and arithmetic
units by 50% and the total power consumption by 14.7%,
while still maintaining 94.1% face detection accuracy. The
developed error analysis for bit-width reduced FPUs will
be helpful to determine the specication for an embedded
neural network hardware system.
Acknowledgments
The authors would like to acknowledge the Natural Science
and Engineering Research Council of Canada (NSERC) / the
University of Saskatchewans Publications Fund, the Korea
Research Foundation, and a Korean Federation of Science
and Technology Societies grant funded by the South Korean
government (MOEHRD, Basic Research Promotion Fund)
for supporting this research and to thank the reviewers for
their valuable suggestions.
References
[1] M. Skrbek, Fast neural network implementation, Neural
Network World, vol. 9, no. 5, pp. 375391, 1999.
[2] D. E. Rumelhart and J. L. McClelland, Parallel Distributed
Processing: Explorations in the Microstructure of Cognition, vol.
1, MIT Press, Cambridge, Mass, USA, 1986.
[3] X. Li, M. Moussa, and S. Areibi, Arithmetic formats for
implementing articial neural networks on FPGAs, Canadian
Journal of Electrical and Computer Engineering, vol. 31, no. 1,
pp. 3040, 2006.
[4] H. K. Brown, D. D. Cross, and A. G. Whittaker, Neural
network number systems, in Proceedings of International Joint
Conference on Neural Networks (IJCNN 90), vol. 3, pp. 903
908, San Diego, Calif, USA, June 1990.
[5] J. Kontro, K. Kalliojarvi, and Y. Neuvo, Use of short oating-
point formats in audio applications, IEEE Transactions on
Consumer Electronics, vol. 38, no. 3, pp. 200207, 1992.
[6] J. Tong, D. Nagle, and R. Rutenbar, Reducing power by
optimizing the necessary precision/range of oating-point
arithmetic, IEEE Transactions on VLSI Systems, vol. 8, no. 3,
pp. 273286, 2000.
[7] J. L. Holt and J.-N. Hwang, Finite precision error analysis
of neural network hardware implementations, IEEE Transac-
tions on Computers, vol. 42, no. 3, pp. 281290, 1993.
[8] S. Sen, W. Robertson, and W. J. Phillips, The eects of
reduced precision bit lengths on feed forward neural networks
for speech recognition, in Proceedings of IEEE International
Conference on Neural Networks, vol. 4, pp. 19861991, Wash-
ington, DC, USA, June 1996.
[9] R. Feraud, O. J. Bernier, J.-E. Viallet, and M. Collobert, A fast
and accurate face detector based on neural networks, IEEE
Transactions on Pattern Analysis and Machine Intelligence, vol.
23, no. 1, pp. 4253, 2001.
[10] H. A. Rowley, S. Baluja, and T. Kanade, Neural network-
based face detection, IEEE Transactions on Pattern Analysis
and Machine Intelligence, vol. 20, no. 1, pp. 2338, 1998.
[11] T. Theocharides, G. Link, N. Vijaykrishnan, M. J. Irwin, and
W. Wolf, Embedded hardware face detection, in Proceedings
of the 17th IEEE International Conference on VLSI Design, pp.
133138, Mumbai, India, January 2004.
[12] M. Sadri, N. Shams, M. Rahmaty, et al., An FPGA based fast
face detector, in Global Signal Processing Expo and Conference
(GSPX 04), Santa Clara, Calif, USA, September 2004.
[13] Y. Lee and S.-B. Ko, FPGA implementation of a face detector
using neural networks, in Canadian Conference on Electrical
and Computer Engineering (CCECE 07), pp. 19141917,
Ottawa, Canada, May 2006.
[14] D. Chester, Why two hidden layers are better than one,
in Proceedings of International Joint Conference on Neural
Networks (IJCNN 90), vol. 1, pp. 265268, Washington, DC,
USA, January 1990.
[15] IEEE Std 754-1985, IEEE standard for binary oating-point
arithmetic, Standards Committee of the IEEE Computer
Society, New York, NY, USA, August 1985.
[16] LEON Processor, http://www.gaisler.com.
[17] Olivetti & Oracle Research Laboratory, The Olivetti & Oracle
Research Laboratory Face Database of Faces,
http://www.cam-orl.co.uk/facedatabase.html.
[18] I. Koren, Computer Arithmetic Algorithms, A K Peters, Natick,
Mass, USA, 2nd edition, 2001.
[19] XILINX, Spartan-3 FPGA Family Complete Data Sheet,
Product Specication, April 2008.
[20] XILINX Spartan-3 Web Power Tool Version 8.1.01, http://
www.xilinx.com/cgi-bin/power tool/power Spartan3.
[21] G. Govindu, L. Zhuo, S. Choi, and V. Prasanna, Analysis
of high-performance oating-point arithmetic on FPGAs, in
EURASIP Journal on Embedded Systems 11
Proceedings of the 18th International Parallel and Distributed
Processing Symposium (IPDPS 04), pp. 149156, Santa Fe,
NM, USA, April 2004.
[22] A. Malik, Design trade-o analysis of oating-point adder in
FPGAs, M.S. thesis, Department of Electrical and Computer
Engineering, University of Saskatchewan, Saskatoon, Canada,
2005.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 382983, 13 pages
doi:10.1155/2009/382983
Research Article
Accelerating Seismic Computations Using Customized Number
Representations on FPGAs
Haohuan Fu,
1
WilliamOsborne,
1
Robert G. Clapp,
2
Oskar Mencer,
1
and Wayne Luk
1
1
Department of Computing, Imperial College London, London SW7 2AZ, UK
2
Department of Geophysics, Stanford University, CA 94305, USA
Correspondence should be addressed to Haohuan Fu, haohuan@gmail.com
Received 31 July 2008; Accepted 13 November 2008
Recommended by Vinay Sriram
The oil and gas industry has an increasingly large demand for high-performance computation over huge volume of data.
Compared to common processors, eld-programable gate arrays (FPGAs) can boost the computation performance with a
streaming computation architecture and the support for application-specic number representation. With hardware support
for recongurable number format and bit width, reduced precision can greatly decrease the area cost and I/O bandwidth of
the design, thus multiplying the performance with concurrent processing cores on an FPGA. In this paper, we present a tool to
determine the minimum number precision that still provides acceptable accuracy for seismic applications. By using the minimized
number format, we implement core algorithms in seismic applications (the FK step in forward continued-based migration and 3D
convolution in reverse time migration) on FPGAand showspeedups ranging from5 to 7 by including the transfer time to and from
the processors. Provided sucient bandwidth between CPU and FPGA, we show that a further increase to 48X speedup is possible.
Copyright 2009 Haohuan Fu et al. This is an open access article distributed under the Creative Commons Attribution License,
which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
Seismic imaging applications in oil and gas industry involves
terabytes of data collected from elds. For each data sample,
the imaging algorithm usually tries to improve the image
quality by performing more costly computations. Thus,
there is an increasingly large demand for high-performance
computation over huge volume of data. Among all the
dierent kinds of imaging algorithms, downward continued-
based migration [1] is the most prevalent high-end imaging
technique today and reverse time migration appears to be
one of the dominant imaging techniques of the future.
Compared to conventional microprocessors, FPGAs
apply a dierent streaming computation architecture. Com-
putations we want to perform are mapped into circuit units
on the FPGA board. Previous work has already achieved 20X
acceleration for prestack Kirchho time migration [2] and
40X acceleration for subsurface oset gathers [3].
Besides the capability of performing computations in
a parallel way, FPGAs also support application-specic
number representations. Since all the processing units
and connections on the FPGA are recongurable, we can
use dierent number representations, such as xed-point,
oating-point, logarithmic number system (LNS), residue
number system (RNS), and so forth, with dierent bit-width
settings. Dierent number representations lead to dierent
complexity of the arithmetic units, thus dierent costs and
performances of the resulting circuit design [4]. Switching to
a number representation that ts a given application better
can sometimes greatly improve the performance or reduce
the cost.
A simple case of switching number representations is to
trade o precision of the number representation with the
speed of the computation. For example, by reducing the
precision from 32-bit oating-point to 16-bit xed-point,
the number of arithmetic units that t into the same area
can be increased by scores of times. The performance of
the application is also improved signicantly. Meanwhile, we
also need to watch for the possible degradation of accuracy
in the computation results. We need to check whether
the accelerated computation using reduced precision is still
generating meaningful results.
To solve the above problem in the seismic application
domain, we develop a tool that performs an automated preci-
sion exploration of dierent number formats, and gures out
the minimum precision that can still generate good enough
2 EURASIP Journal on Embedded Systems
Table 1
Integer part: mbits Fractional part: f bits
x
m1
x
m2
x
0
x
1
x
f +1
x
f
Table 2
Sign: 1 bit Exponent: mbits Mantissa: f bits
S M F
seismic results. By using the minimized number format, we
implement core algorithms in seismic applications (complex
exponential step in forward continued based migration
and 3D convolution in reverse time migration) on FPGA
and show speedups ranging from 5 to 7 by including the
transfer time to and from the processors. Provided sucient
bandwidth between CPU and FPGA, we show that a further
increase to 48X speedup is possible.
2. Background
2.1. Number Representation. As mentioned in Section 1,
precision and range are key resources to be traded o against
the performance of a computation. In this work, we look
at two dierent types of number representation: xed-point
and oating-point.
Fixed-Point Numbers. The xed-point number has two
parts, the integer part and the fractional part. It is in a format
as shown in Table 1.
When it uses a sign-magnitude format (the rst bit
denes the sign of the number), its value is given by
(1)
xm1

m2
i=f
x
i
2
i
. It may also use a two-complement
format to indicate the sign.
Floating-Point Numbers. According to IEEE-754 standard,
oating-point numbers can be divided into three parts: the
sign bit, the exponent, and the mantissa, shown as in Table 2.
Their values are given by (1)
S
1F 2
M
. The sign
bit denes the sign of the number. The exponent part uses
a biased format. Its value equals to the sum of the original
value and the bias, which is dened as 2
m1
1. The extreme
values of the exponent (0 and 2
m
1) are used for special
cases, such as values of zero and . The mantissa is an
unsigned fractional number, with an implied 1 to the left
of the radix point.
2.2. Hardware Compilation Tool. We use a stream compiler
(ASC) [5] as our hardware compilation tool to develop a
range of dierent solutions for seismic applications. ASC
was developed following research at Stanford University
and Bell Labs, and is now commercialized by Maxeler
Technologies. ASC enables the use of FPGAs as highly
parallel stream processors. ASC is a C-like programming
environment for FPGAs. ASC code makes use of C++ syntax
and ASC semantics which allow the user to program on the
architecture level, the arithmetic level, and the gate level.
ASC provides the productivity of high-level hardware design
// ASC code starts here
STREAM START;
// Hardware Variable Declarations
HWint in (IN);
HWint out (OUT);
HWint tmp (TMP);
STREAM LOOP (16);
tmp = (in 1) + 55;
out = tmp;
// ASC code ends here
STREAM END;
Algorithm 1: A simple ASC example.
tools and the performance of low-level optimized hardware
design. On the arithmetic level, PAM-Blox II provides an
interface for custom arithmetic optimization. On the higher
level, ASCprovides types and operators to enable research on
custom data representation and arithmetic. ASC hardware
types are HWint, HWx, and HWoat. Utilizing the data-
types, we build libraries such as a function evaluation library
or develop special circuits to solve particular computational
problems such as graph algorithms. Algorithm 1 shows
a simple example of an ASC description for a stream
architecture that doubles the input and adds 55.
The ASC code segment shows HWint variables and the
familiar C syntax for equations and assignments. Compiling
this programwith gcc and running it creates a netlist which
can be transformed into a conguration bitstream for an
FPGA.
2.3. Precision Analysis. There exist a number of research
projects that focus on precision analysis, most of which are
static methods that operate on the computational ow of
the design and uses techniques based on range and error
propagation to perform the analysis.
Lee et al. [6] present a static precision analysis technique
which uses ane arithmetic to derive an error model of the
design and applies simulated annealing to nd out minimum
bit widths to satisfy the given error requirement. A similar
approach is shown in a bit-width optimization tool called
Prcis [7].
These techniques are able to perform an automated
precision analysis of the design and provide optimized
bit widths for the variables. However, they are not quite
suitable for seismic imaging algorithms. The rst reason is
that seismic imaging algorithms usually involve numerous
iterations, which can lead to overestimation of the error
bounds and derive a meaningless error function. Secondly,
the computation in the seismic algorithms does not have
a clear error requirement. We can only judge the accuracy
of the computation from the generated seismic image.
Therefore, we choose to use a dynamic simulation method
to explore dierent precisions, detailed in Sections 3.3 and
3.4.
EURASIP Journal on Embedded Systems 3
2.4. Computation Bottlenecks in Seismic Applications. Down-
ward-continued migration comes in various avors includ-
ing common azimuth migration [8], shot prole migra-
tion, source-receiver migration, plane-wave or delayed shot
migration, and narrow azimuth migration. Depending on
the avor of the downward continuation algorithm, there are
four potential computation bottlenecks.
(i) In many cases, the dominant cost is the FFT step.
The dimensionality of the FFT varies from 1D (tilted
plane-wave migration [9]) to 4D (narrow azimuth
migration [10]). The FFT cost is often dominant due
to its nlog(n) cost ratio, n being the number of points
in the transform, and the noncache friendly nature of
multidimensional FFTs.
(ii) The FK step, which involves evaluating (or looking
up) a square root function and performing complex
exponential is a second potential bottleneck. The
high operational count per sample can eat up signi-
cant cycles.
(iii) The FX step, which involves a complex exponential,
or sine/second operation for lumbar disc herniation.
Spine 1993; 18: 2206-11. 16. Silvers HR, Lewis PJ,
Asch HL, Clabeaux DE. Lumbar diskectomy for
recurrent disk herniation. J Spinal Disord 1994;
7: 408-19. 17. Jonsson B, Stromqvist B. Repeat
decompression of lumbar nerve roots. A prospective
two-year evaluation. J Bone Joint Surg (Br) 1993;
75-B: 894-7. cosine multiplication, has a similar, but
computationally less demanding, prole. Subsurface
oset gathers for shot prole or plane-wave migra-
tion, particularly 3D subsurface oset gathers, can be
an overwhelming cost. The large op-count per sample
and the noncache friendly nature of the data usage
pattern can be problematic.
(iv) For nite dierence-based schemes, a signicant con-
volution cost can be involved.
The primary bottleneck of reverse time migration is
applying the nite-dierent stencil. In addition to the large
operation count (5 to 31 samples per cell) the access pattern
has poor cache behavior for real size problems. Beyond
applying the 3D stencil, the next most dominant cost is
implementing damping boundary conditions. Methods such
as perfectly matched layers (PMLs) can be costly [11]. Finally,
if you want to use reverse time migration for velocity analysis,
subsurface oset gathers need to be generated. The same cost
prole that exists in downward continued-based migration
exists for reverse time migration.
In this paper, we focus on two of the above computation
bottlenecks: one is the FK step in forward continued-
based migration, which includes a square root function
and a complex exponential operation; the other one is the
3D convolution in reverse time migration. We perform
automated precision exploration of these two computation
cores, so as to gure out the minimum precision that can still
generate accurate enough seismic images.
3. A Tool for Number Representation
Exploration
FPGA-based implementations have the advantage over cur-
rent software-based implementations of being able to use
customizable number representations in their circuit designs.
On a software platform, users are usually constrained to a
few xed number representations, such as 32/64-bit integers
and single/double-precision oating-point, while the recon-
gurable logic and connections on an FPGAenables the users
to explore various kinds of number formats with arbitrary
bit widths. Furthermore, users are also able to design the
arithmetic operations for these customized number repre-
sentations, thus can provide a highly customized solution for
a given problem.
In general, to provide a customized number representa-
tion for an application, we need to determine the following
three things.
(i) Format of the Number Representation. There are existing
FPGA applications using xed-point, oating-point, and
logarithmic number system (LNS) [12]. Each of the three
number representations has its own advantages and disad-
vantages over the others. For instance, xed-point has simple
arithmetic implementations, while oating-point and LNS
provide a wide representation range. It is usually not possible
to gure out the optimal format directly. Exploration is
needed to guide the selection.
(ii) Bit Widths of Variables. This problem is generally
referred to as bit width or word-length optimization [6, 13].
We can further divide this into two dierent parts: range
analysis considers the problem of ensuring that a given
variable inside a design has a sucient number of bits
to represent the range of the numbers, while in precision
analysis, the objective is to nd the minimum number of
precision bits for the variables in the design such that the
output precision requirements of the design are met.
(iii) Design of the Arithmetic Units. The arithmetic opera-
tions of each number system are quite dierent. For instance,
in LNS, multiplication, division, and exponential operations
become as simple as addition or shift operations, while
addition and subtraction become nonlinear functions to
approximate. The arithmetic operations of regular data
formats, such as xed-point and oating-point, also have
dierent algorithms with dierent design characteristics. On
the other hand, evaluation of elementary functions plays
a large part in seismic applications (trigonometric and
exponential functions). Dierent evaluation methods and
congurations can be used to produce evaluation units with
dierent accuracies and performance.
This section presents our tool that tries to gure out the
above three design options by exploring all the possible num-
ber representations. The tool is partly based on our previous
work on bit-width optimization [6] and comparison between
dierent number representations [14, 15].
Figure 1 shows our basic work ow to explore dierent
number representations for a seismic application. We man-
ually partition the Fortran program into two parts: one part
4 EURASIP Journal on Embedded Systems
Fortran program for seismic processing
Manual partition
Fortran code
executing on processors
Fortran code
targeting an FPGA
Prole
Range information
(max/min values)
Distribution information
Map to a circuit design:
arithmetic operation & function evaluation
Circuit design description
Translate circuit design description into
bit-accurate simulation code
Value simulator
with recongurable settings
Exploration with dierent congurations:
number representations, bit-width values, etc.
Final design with customized number representation
Figure 1: Basic steps to achieve a hardware design with customized
number representations.
runs on CPUs and we try to accelerate the other part (target
code) on FPGAs. The partition is based on two metrics: (1)
the target code shall consume a large portion of processing
time in the entire program, otherwise the acceleration does
not bring enough performance improvement to the entire
application; (2) the target code shall be suitable for a
streaming implementation on FPGA, thus highly probable
to accelerate. After partition, the rst step is to prole
the target code to acquire information about the range of
values and their distribution that each variable can take. In
the second step, based on the range information, we map
theFortran code into a hardware design described in ASC
format, which includes implementation of arithmetic oper-
ations and function evaluation. In the third step, the ASC
description is translated into bit-accurate simulation code,
and merged into the original Fortran program to provide a
value simulator for the original application. Using this value
simulator, explorations can be performed with congurable
settings such as dierent number representations, dierent
bit widths, and dierent arithmetic algorithms. Based on the
exploration results, we can determine the optimal number
format for this application with regards to certain metrics
such as circuit area and performance.
3.1. Range Proling. In the proling stage, the major objec-
tive is to collect range and distribution information for
the variables. The idea of our approach is to instrument
every target variable in the code, adding function calls to
a b
0
c d
Figure 2: Four points to record in the proling of range informa-
tion.
initialize data structures for recording range information and
to modify the recorded information when the variable value
changes.
For the range information of the target variables (vari-
ables to map into the circuit design), we keep a record of four
specic points on the axis, shown in Figure 2. The points a
and d represent the values far away from zero, that is, the
maximum absolute values that need to be represented. Based
on their values, the integer bit width of xed-point numbers
can be determined. Points b and c represent the values close
to zero, that is, the minimum absolute values that need to be
represented. Using both the minimumand maximumvalues,
the exponent bit width of oating-point numbers can be
determined.
For the distribution information of each target variable,
we keep a number of buckets to store the frequency of
values at dierent intervals. Figure 3 shows the distribution
information recorded for the real part of variable wd (a
complex variable). In each interval, the frequency of positive
and negative values is recorded separately. The results show
that, for the real part of variable wd, in each interval, the
frequencies of positive and negative values are quite similar,
and the major distribution of the values falls into the range
10
1
to 10
4
.
The distribution information provides a rough metric
for the users to make an initial guess about which number
representations to use. If the values of the variables cover
a wide range, oating-point and LNS number formats are
usually more suitable. Otherwise, xed-point numbers shall
be enough to handle the range.
3.2. Circuit Design: Basic Arithmetic and Elementary Func-
tion Evaluation. After proling range information for the
variables in the target code, the second step is to map the
code into a circuit design described in ASC. As a high-
level FPGA programming language, ASC provides hardware
data types, such as HWint, HWx, and HWoat. Users can
specify the bit-width values for hardware variables, and ASC
automatically generates corresponding arithmetic units for
the specied bit widths. It also provides congurable options
to specify dierent optimization modes, such as AREA,
LATENCY, and THROUGHPUT. In the THROUGHPUT
optimization mode, ASC automatically generates a fully
pipelined circuit. These features make ASCan ideal hardware
compilation tool to retarget a piece of software code onto the
FPGA hardware platform.
With support for xed-point and oating-point arith-
metic operations, the target Fortran code can be transformed
into ASC C++ code in a straightforward manner. We also
have interfaces provided by ASC to modify the internal
settings of these arithmetic units.
Besides basic arithmetic operations, evaluation of ele-
mentary functions takes a large part in seismic applications.
EURASIP Journal on Embedded Systems 5
0
5e + 06
1e + 07
1.5e + 07
2e + 07
2.5e + 07
3e + 07
3.5e + 07
4e + 07
4.5e + 07
5e + 07
F
r
e
q
u
e
n
c
y
12 10 8 6 4 2 0 2 4 6 8
Bucket
Positive
Negative
Figure 3: Range distribution of the real part of variable wd.
The leftmost bucket with index = 11 is reserved for zero values.
The other buckets with index = x store the values in the range
(10
x1
, 10
x
].
For instance, in the rst piece of target code we try to
accelerate, the FK step, a large portion of the computation
is to evaluate the square root and sine/cosine functions. To
map these functions into ecient units on the FPGA board,
we use a table-based uniform polynomial approximation
approach, based on Dong-U Lees work on optimizing
hardware function evaluation [16]. The evaluation of the two
functions can be divided into three dierent phases [17].
(i) Range reduction: reduce the range of the input
variable x into a small interval that is convenient
for the evaluation procedure. The reduction can
be multiplicative (e.g., x

= x/2
2n
for square root
function) or additive (e.g., x

= x 2 n for sine/
cosine functions).
(ii) Function evaluation: approximate the value of the
function using a polynomial within the small inter-
val.
(iii) Range reconstructions: map the value of the function
in the small interval back into the full range of the
input variable x.
To keep the whole unit small and ecient, we use
degree-one polynomial so that only one multiplication and
one addition are needed to produce the evaluation result.
Meanwhile, to preserve the approximation error at a small
scale, the reduced evaluation range is divided into uniform
segments. Each segment is approximated with a degree-one
polynomial, using the minimax algorithm. In the FK step,
the square root function is approximated with 384 segments
in the range of [0.25, 1] with a maximum approximation
error of 4.74 10
7
, while the sine and cosine functions are
approximated with 512 segments in the range of [0, 2]with a
maximum approximation error of 9.54 10
7
.
3.3. Bit-Accurate Value Simulator. As discussed in Sect-
ion 3.1, based on the range information, we are able to
determine the integer bit width of xed-point, and partly
determine the exponent bit width of oating-point numbers
(as exponent bit width does not only relate to the range
but also to the accuracy). The remaining bit widths, such as
the fractional bit width of xed-point, and the mantissa bit
width of oating-point numbers, are predominantly related
to the precision of the calculation. In order to nd out the
minimumacceptable values for these precision bit widths, we
need a mechanism to determine whether a given set of bit-
width values produce satisfactory results for the application.
In our previous work on function evaluation or other
arithmetic designs, we set a requirement of the absolute error
of the whole calculation, and use a conservative error model
to determine whether the current bit-width values meet the
requirement or not [6]. However, a specied requirement for
absolute error does not work for seismic processing. To nd
out whether the current conguration of precision bit width
is accurate enough, we need to run the whole program to
produce the seismic image, and nd out whether the image
contains the correct pattern information. Thus, to enable
exploration of dierent bit-width values, a value simulator
for dierent number representations is needed to provide
bit-accurate simulation results for the hardware designs.
With the requirement to produce bit-accurate results as
the corresponding hardware design, the simulator also needs
to be eciently implemented, as we need to run the whole
application (which takes days using the whole input dataset)
to produce the image.
In our approach, the simulator works with ASC for-
mat C++ code. It reimplements the ASC hardware data
types, such as HWx and HWoat, and overloads their
arithmetic operators with the corresponding simulation
code. For HWx variables, the value is stored in a 64-
bit signed integer, while another integer is used to record
the fractional point. The basic arithmetic operations are
mapped into shifts and arithmetic operations of the 64-
bit integers. For HWoat variables, the value is stored in a
80-bit extended-precision oating-point number, with two
other integers used to record the exponent and mantissa
bit width. To keep the simulation simple and fast, the
arithmetic operations are processed using oating-point
values. However, to keep the result bit accurate, during each
assignment, by performing corresponding bit operations,
we decompose the oating-point value into mantissa and
exponent, truncate according to the exponent and mantissa
bit widths, and combine them back into the oating-point
value.
3.4. Accuracy Evaluation of Generated Seismic Images. As
mentioned above, the accuracy of a generated seismic image
depends on the pattern contained inside, which estimates the
geophysical status of the investigated area. To judge whether
the image is accurate enough, we compare it to a target
image, which is processed using single-precision oating-
point and assumed to contain the correct pattern.
To perform this pattern comparison automatically, we
use techniques based on prediction error lters (PEFs) [18]
to highlight dierences between two images. The basic work
ow of comparing image a to image b (assume image a is the
target image) is as follows.
6 EURASIP Journal on Embedded Systems
(i) Divide image a into overlapping small regions of 40
40 pixels, and estimate PEFs for these small regions.
(ii) Apply these PEFs to both image a and image b to get
the results a

and b

.
(iii) Apply algebraic combinations of the images a

and b

to acquire a value indicating the image dierences.


By the end of the above work ow, we achieve a single
value which describes the dierence from the generated
image to the target image. For convenience of discussion
afterwards, we call this value as dierence indicator (DI).
Figure 4 shows a set of dierent seismic images calculated
from the same dataset, and their DI values compared to
the image with correct pattern. The image showing correct
pattern is calculated using single-precision oating-point,
while the other images are calculated using xed-point
designs with dierent bit-width settings. All these images are
results of the bit-accurate value simulator mentioned above.
If the generated image contains no information at all
(as shown in Figure 4(a)), the comparison does not return
a nite value. This is mostly because a very low precision
is used for the calculation. The information is lost during
numerous iterations and the result only contains zeros or
innities. If the comparison result is in the range of 10
4
to 10
5
(Figures 4(b) and 4(c)), the image contains random
pattern which is far dierent from the correct one. With
a comparison result in the range of 10
3
(Figure 4(d)), the
image contains similar pattern to the correct one, but
information in some parts is lost. With a comparison result
in the range of 10
2
or smaller, the generated image contains
almost the same pattern as the correct one.
Note that the DI value is calculated from algebraic
operations on the two images you compare with. The
magnitude of DI value is only a relative indication of the
dierence between the two images. The actual usage of the DI
value is to gure out the boundary between the images that
contains mostly noises and the images that provide useful
patterns of the earth model. From the samples shown in
Figure 7, in this specic case, the DI value of 10
2
is a good
guidance value for acceptable accuracy of the design. From
the bit-width exploration results shown in Section 4, we can
see that the DI value of 10
2
also happens to be a precision
threshold, where the image turns from noise into accurate
pattern with the increase of bit width.
3.5. Number Representation Exploration. Based on all the
above modules, we can now perform exploration of dierent
number representations for the FPGA implementation of a
specic piece of Fortran code.
The current tools support two dierent number rep-
resentations, xed-point, and oating-point numbers (the
value simulator for LNS is still in progress). For all the
dierent number formats, the users can also specify arbitrary
bit widths for each dierent variable.
There are usually a large number of dierent variables
involved in one circuit design. In our previous work, we
usually apply heuristic algorithms, such as ASA [19], to nd
out a close-to-optimal set of bit-width values for dierent
variables. The heuristic algorithms may require millions of
test runs to check whether a specic set of values meet the
constraints or not. This is acceptable when the test run is only
a simple error function and can be processed in nanoseconds.
In our seismic processing application, depending on the
problem size, it takes half an hour to several days to run
one test set and achieve the result image. Thus, heuristic
algorithms become impractical.
A simple and straightforward method to solve the
problem is to use uniform bit width over all the dierent
variables, and either iterate over a set of possible values or
use a binary search algorithm to jump to an appropriate
bit-width value. Based on the range information and the
internal behavior of the program, we can also try to divide
the variables in the target Fortran code into several dierent
groups, and assign a dierent uniform bit width for each
dierent group. For instance, in the FK step, there is a clear
boundary that the rst half performs square, square root,
and division operations to calculate an integer value, and
the second half uses the integer value as a table index, and
performs sine, cosine, and complex multiplications to get the
nal result. Thus, in the hardware circuit design, we divide
the variables into two groups based on which half it belongs
to. Furthermore, in the second half of the function, some of
the variables are trigonometric values in the range of [1, 1],
while the other variables represent the seismic image data
and scale up to 10
6
. Thus, they can be further divided into
two parts and assigned bit widths separately.
4. Case Study I: The FKStep in Downward
Continued-Based Migration
4.1. Brief Introduction. The code shown in Algorithm 2
is the computationally intensive portion of the FK step
in a downward continued-based migration. The governing
equation for the FK step is the double square root equation
(DSR) [20]. The DSR equation describes how to downward
continue a wave-eld U one depth z step. The equation
is valid for a constant velocity medium v and is based on
the wave number of the source k
s
and receiver k
g
. The DSR
equation can be written as (1), where is the frequency.
The code takes the approach of building a priori a relatively
small table of the possible values of vk/. The code then
performs a table lookup that converts a given vk/ value to
an approximate value of the square root.
In practical applications, wd contains millions of data
items. The computation pattern of this function makes it an
ideal target to map to a streaming hardware circuit on an
FPGA.
4.2. Circuit Design. The mapping from the software code
to a hardware circuit design is straightforward for most
parts. Figure 5 shows the general structure of the circuit
design. Compared with the software Fortran code shown in
Algorithm 2, one big dierence is the handling of the sine
and cosine functions. In the software code, the trigonometric
functions are calculated outside the ve-level loop, and
stored as a lookup table . In the hardware design, to take
advantage of the parallel calculation capability provided by
the numerous logic units on the FPGA, the calculation
EURASIP Journal on Embedded Systems 7
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(a) DI = Inf
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(b) DI = 10
5
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(c) DI = 10
4
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(d) DI = 10
3
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(e) DI = 10
2
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Seismic image
(f) DI =10
0
1000
2000
3000
y
2000 3000 4000 5000 6000 7000 8000
x
Full precision seismic image
(g) Correct pattern
Figure 4: Examples of seismic images with dierent Dierence Indicator (DI) values. Inf means that the approach does not return a nite
dierence value. 10
x
means that the dierence value is in the range of [1 10
x
, 1 10
x+1
).
8 EURASIP Journal on Embedded Systems
! generation of table step%ctable
do i = 1, size (step%ctable)
k = ko step%dstep dsr%phase (i)
step%ctable (i) = dsr%amp (i) cmplx (cos(k), sin(k))
end do
! the core part of function wei wem
do i4 = 1, size (wd, 4)
do i3 = 1, size (wd, 3)
do i2 = 1, size (wd, 2)
do i1 = 1, size (wd, 1)
k = sqrt (step%kx (i1, i3) 2 + step%ky (i2, i4)2)
itable = max (1, min (int (1 + k/ko/dsr%d), dsr%n))
wd (i1, i2, i3, i4, i5) = wd (i1, i2, i3, i4, i5) step%ctable (itable)
end do
end do
end do
end do
Algorithm 2: The code for the major computations of the FK step.
Table 3: Proling results for the ranges of typical variables in
function wei wem. wd real and wd img refer to the real
and imaginary parts of the wd data. Max and Min refer to
the maximum and minimum absolute values of variables.
Variable Step%x ko wd real wd img
Max 0.377 0.147 3.918e6 3.752e6
Min 0 7.658e3 4.168e14 5.885e14
of the sine/cosine functions is merged into the processing
core of the inner loop. Three function evaluation units are
included in this design to produce values for the square
root, cosine and sine functions separately. As mentioned in
Section 3.2, all three functions are evaluated using degree-
one polynomial approximation with 386 or 512 uniform
segments:
U
_
, k
s
, k
g
, z + z
_
= exp
_
iv
_

1
vk
g

1
vk
s

__
U
_
, k
s
, k
g
, z
_
.
(1)
The other task in the hardware circuit design is to map
the calculation into arithmetic operations of certain number
representations. Table 3 shows the value range of some
typical variables in the FK step. Some of the variables (in
the part of square root and sine/cosine function evaluations)
have a small range within [0, 1], while other values (especially
wd data) have a wide range from 10
14
to 10
6
. If we use
oating-point or LNS number representations, their wide
representation ranges are enough to handle these variables.
However, if we use xed-point number representations in
the design, special handling is needed to achieve acceptable
accuracy over wide ranges.
The rst issue to consider in xed-point designs is the
enlarged error caused by the division after the evaluation
Step kx Step ky
sqrt sum = step kx
2
+ step ky
2
Function evaluation unit
sqrt res =
_
sqrt sum
Itable = max(1, min(sqrt res/ko/dsr% d, dsr% n))
k = kostep% dstepdsr% phase (itable)
Function evaluation unit
a = cos(k)
Function evaluation unit
b = sin(k)
wd = wdcmplx(a, b)dsr% amp (itable) wd
Updated wd
Figure 5: General structure of the circuit design for the FK step.
of the square root (
_
step%x
2
+ step%y
2
/ko). The values of
step%x, step%y, and ko come from the software program
as input values to the hardware circuit, and contain errors
propagated from previous calculations or caused by the
truncation/rounding into the specied bit width on hard-
ware. Suppose the error in the square root result sqrt res
is E
sqrt
, and the error in variable ko is E
ko
, assuming that
the division unit itself does not bring extra error, the
error in the division result is given by E
sqrt
sqrt res/ko +
E
ko
(sqrt res/ko
2
). According to the proling results, ko
EURASIP Journal on Embedded Systems 9
holds a dynamic range from 0.007658 to 0.147, and sqrt res
has a maximum value of 0.533 (variables step%x and step%y
have similar ranges). In the worst case, the error from
sqrt res can be magnied by 70 times, and the error from
ko magnied by approximately 9000 times.
To solve the problemof enlarged errors, we performshifts
at the input side to keep the three values step%x, step%y,
and ko in a similar range. The variable ko is shifted by the
distance d
1
so that the value after shifting falls in the range
of [0.5,1). The variables step%x and step%y are shifted by
another distance d
2
so that the larger value of the two also
falls in the range of [0.5,1). The dierence between d
1
and
d
2
is recorded so that after the division, the result can be
shifted back into the correct scale. In this way, the sqrt res
has a range of [0.5,1.414] and ko has a range of [0.5,1]. Thus,
the division only magnies the errors by an order of 3 to 6.
Meanwhile, as the three variables step%x, step%y, and ko are
originally in single-precision oating-point representation
in software, when we pass their values after shifts, a large
part of the information stored in the mantissa part can be
preserved. Thus, a better accuracy is achieved through the
shifting mechanism for xed-point designs.
Figure 6 shows experimental results about the accuracy
of the table index calculation when using shifting compared
to not using shifting, with dierent uniform bit widths. The
possible range of the table index result is from 1 to 2001. As
it is the index for tables of smooth sequential values, an error
within ve indices is generally acceptable. We use the table
index results calculated with single-precision oating-point
as the true values for error calculation. When the uniform
bit width of the design changes from 10 to 20, designs using
the shifting mechanism show a stable maximum error of
3, and an average error around 0.11. On the other hand,
the maximum error of designs without shifting vary from
2000 to 75, and the average errors vary from approximately
148 to 0.5. These results show that the shifting mechanism
provides much better accuracy for the part of the table index
calculation in xed-point designs.
The other issue to consider is the representation of
wd data variables. As shown in Table 3, both the real
and imaginary parts of wd data have a wide range from
10
14
to 10
6
. Generally, xed-point numbers are not suitable
to represent such wide ranges. However, in this seismic
application, the wd data is used to store the processed
image information. It is more important to preserve the
pattern information shown in the data values rather than the
data values themselves. Thus, by omitting the small values
and using the limited bit width to store the information
contained in large values, xed-point representations still
have a big chance to achieve accurate image in the nal step.
In our design, for convenience of bit-width exploration, we
scale down all the wd data values by a ratio of 2
22
so that
they fall into the range of [0,1).
4.3. Bit-Width Exploration Results. The original software
Fortran code of the FK step performs the whole computation
using single-precision oating-point. We rstly replace the
original Fortran code of the FK step with a piece of C++
1
0.1
0.01
10
100
1000
10000
M
a
x
i
m
u
m
/
a
v
e
r
a
g
e
e
r
r
o
r
8 10 12 14 16 18 20 22
Uniform bit-width of hardware variables
Max error, without shift
Max error, with shift
Average error, without shift
Average error, with shift
Figure 6: Maximum and average errors for the calculation of the
table index when using and not using the shifting mechanism in
xed-point designs, with dierent uniform bit-width values from
10 to 20.
code using double-precision oating-point to generate a full-
precision image to compare with. After that, to investigate
the eect of dierent number representations for variables
in the FK step on the accuracy of the whole application, we
replace the code of the FK step with our simulation code
that can be congured with dierent number representations
and dierent bit widths and generate results for dierent
settings. The approach for accuracy evaluation, introduced
in Section 3.4, is used to provide DI values that indicate the
dierences in the patterns of the resulted seismic images from
the pattern in full-precision image.
4.3.1. Fixed-Point Designs. In the rst step, we apply uniform
bit width over all the variables in the design. We change the
uniform bit width from 10 to 20. With of uniform bit width
of 16, the design provides a DI value around 100, which
means that the image contains a pattern almost the same to
the correct one.
In the second step, as mentioned in Section 3.5, accord-
ing to their characteristics in range and operational behavior,
we can divide the variables in the design into dierent
groups and apply a uniform bit width in each group. In the
hardware design for the FK step, the variables are divided
into three groups: SQRT, the part from the beginning to
the table index calculation, which includes an evaluation of
the square root; SINE, the part from the end of SQRT to
the evaluation of the sine and cosine functions; WFLD, the
part that multiplies the complex values of wd data with
a complex value consisting of the sine and cosine values
(for phase modication), and a real value (for amplitude
modication). To perform the accuracy investigation, we
keep two of the bit-width values constant, and change the
other one gradually to see its eect on the accuracy of the
entire application.
Figure 7(a) shows the DI values of the generated images
when we change the bit width of the SQRT part from 6 to
10 EURASIP Journal on Embedded Systems
20. The bit widths of the SINE and WFLD parts are set to
20 and 30, respectively. Large bit widths are used for the
other two parts so that they do not contribute much to the
errors and the eect of variables bit width in SQRT can be
extracted out. The case of SQRT bit widths shows a clear
precision threshold at the bit-width value of 10. When the
SQRT bit width increases from 8 bits to 10 bits, the DI value
falls down from the scale of 10
5
to the scale of 10
2
. The
signicant improvement in accuracy is also demonstrated
in the generated seismic images. The image on the left of
Figure 7(a) is generated with 8-bit design. Compared to the
true image calculated with single-precision oating-point,
the lower part of the image is mainly noise signals, while the
lower part starts to show a similar pattern as the correct ones.
The dierence between the qualities of the lower and upper
parts is because of the imaging algorithm, which calculates
the image from summation of a number of points at the
corresponding depth. In acoustic models, there are generally
more sample points when we go deeper into the earth.
Therefore, using the same precision, the lower part shows a
better quality than the upper part. The image on the right
of Figure 7(a) is generated with 10-bit design, and already
contains almost the same pattern as the true image.
In a similar way, we perform the exploration for the other
two parts, and acquire the precision threshold 10, 12, and 16
for the SQRT, SINE, and WFLD parts, respectively. However,
as the above results are acquired with two out of the three bit
widths set to very large values, the practical solution shall be
lightly larger than these values. Meanwhile, constrained by
the current I/O bandwidth of 64 bits per second, the sum of
the bit widths for SQRT and WFLD parts shall be less than
30. We perform further experiments for bit-width values
around the initial guess point, and nd out that bit widths
of 12, 16, and 16 for the three parts provide a DI value of
131.5 and also meet the bandwidth requirement.
4.3.2. Floating-Point Designs. In oating-point design of the
FKstep, we performan exploration of dierent exponent and
mantissa bit widths. Similar to xed-point designs, we use a
uniform bit width for all the variables. When we investigate
one of them, we keep the other one with a constant high
value.
Figure 7(b) shows the case that we change the exponent
bit width from 3 to 10, while we keep the mantissa bit width
as 24. There is again a clear cut at the bit width of 6. When
the exponent bit width is smaller than 6, the DI value of the
generated image is at the level of 10
5
. When the exponent bit
width increases to 6, the DI value decreases to around 1.
With a similar exploration of the mantissa bit width,
we gure out that exponent bit width of 6 and mantissa
bit width of 16 provide the minimum bit widths needed to
achieve a DI value around 10
2
. Experiment conrms that this
combination produces image with a DI value of 43.96.
4.4. Hardware Acceleration Results. The hardware acceler-
ation tool used in this project is the FPGA computing
platform MAX-1, provided by Maxeler Technologies [21].
It contains a high-performance Xilinx Virtex IV FX100
Table 4: Speedups achieved on FPGA compared to software
solutions. Xilinx Virtex IV FX100 FPGA compared with Intel Xeon
CPU of 1.86 GHz.
Size of dataset Software time FPGA time Speedup
43056 5.32 ms 0.84 ms 6.3
216504 26.1 ms 3.77 ms 6.9
Table 5: Resource cost of the FPGA design for the FK step in
downward continued-based migration.
Type of resource Used units Percentage
Slices 12032 28%
BRAMs 59 15%
Embedded multipliers 16 10%
FPGA, which consists of 42176 slices, 376 BRAMs, and
192 embedded multipliers. Meanwhile, it provides a high-
bandwidth interface of PCI Express X8 (2 G bytes per
second) to the software side residing in CPUs.
Based on the exploration results of dierent number
representations, the xed-point design with bit widths of 12,
16, and 16 for three dierent parts is selected in our hardware
implementation. The design produces images containing
the same pattern as the double-precision oating-point
implementation, and has the smallest bit-width values, that
is, the lowest resource cost among all the dierent number
representations.
Table 4 shows the speedups we can achieve on FPGA
compared to software solutions running on Intel Xeon CPU
of 1.86 GHz. We experiment with two dierent sizes of
datasets. For each of the datasets, we record the processing
time for 10 000 times and calculate the average as the result.
Speedups of 6.3 and 6.9 times are achieved for the two
dierent datasets, respectively.
Table 5 shows the resource cost to implement the FK step
on the FPGA card. It utilizes 28% of the logic units, 15%
of the BRAMs (memory units) and 10% of the arithmetic
units. Considering that a large part (around 20%) of the used
logic units are circuits handling PCI-Express I/O, there is still
much potential to put more processing cores onto the FPGA
card and to gain even higher speedups.
5. Case Study II: 3DConvolution in
Reverse Time Migration
3D convolution is one of the major computation bottlenecks
in reverse time migration algorithms. In this paper, we imple-
mented a 6th-order acoustic modeling kernel to investigate
the potential speedups on FPGAs. The 3D convolution uses
a kernel with 19 elements. Once each line of the kernel has
been processed, it is scaled by a constant factor.
One of the key challenges to implement 3D convolution
is how to keep a fast access to all the data elements needed for
a 19-point operations. As the data items are generally stored
in one direction, when you want to access the data items in a
3D pattern, you need to either buer a large amount of data
items or access them in a very slow nonlinear pattern. In our
EURASIP Journal on Embedded Systems 11
0
1000
2000
3000
Reduced precision seismic image
8-bit xed-point
2000 3000 4000 5000 6000 7000 8000
0
1000
2000
3000
Reduced precision seismic image
2000 3000 4000 5000 6000 7000 8000
10-bit xed-point
2000 3000 4000 5000 6000 7000 8000
True image: single-precision oating-point
0
1000
2000
3000
Full precision seismic image
1E + 00
1E + 02
1E + 03
1E + 04
1E + 05
1E + 06
4 6 8 10 12 14 16 18 20 22
SQRT bit-width
D
i

e
r
e
n
c
e
i
n
d
i
c
a
t
o
r
v
a
l
u
e
s
o
f
t
h
e
g
e
n
e
r
a
t
e
d
i
m
a
g
e
s
(a) DI values for dierent SQRT bit-widths in a xed-point design
0
1000
2000
3000
Reduced precision seismic image
Floating-point: 5-bit exponent
2000 3000 4000 5000 6000 7000 8000
0
1000
2000
3000
Reduced precision seismic image
2000 3000 4000 5000 6000 7000 8000
Floating-point: 6-bit exponent
2000 3000 4000 5000 6000 7000 8000
True image: single-precision oating-point
0
1000
2000
3000
Full precision seismic image
1E 01
1E + 00
1E + 01
1E + 02
1E + 03
1E + 04
1E + 05
1E + 06
2 6 4 8 10 12
Exponent bit-width
D
i

e
r
e
n
c
e
i
n
d
i
c
a
t
o
r
v
a
l
u
e
s
o
f
t
h
e
g
e
n
e
r
a
t
e
d
i
m
a
g
e
s
(b) DI values for dierent exponent bit-widths in a oating-point design
Figure 7: Exploration of xed-point and oating-point designs with dierent bit widths.
12 EURASIP Journal on Embedded Systems
FPGA design, we solve this problem by buering the current
block we process into the BRAM FIFOs. ASC provides a
convenient interface to automatically buer the input values
into BRAMs and the users can access them by specifying the
cycle number that the value gets read in. Thus, we can easily
index into the stream to obtain values already sent to the
FPGA and perform the 3D operator.
Compared to the 3D convolution processed on CPUs,
FPGA has two major advantages. One is the capability
of performing computations in parallel. We exploit the
parallelism of the FPGA to calculate one result per cycle.
When ASC assigns the elements to BRAMs, it does so in
such a way as to maximize the number of elements that
can be obtained from the BRAM every cycle. This means
that consecutive elements of the kernel must not in general
be placed in the same BRAM. The other advantage is the
support for application-specic number representations. By
using xed-point of 20 bits (the minimum bit-width setting
that provides acceptable accuracy), we can reduce the area
cost greatly thus put more processing units into the FPGA.
We test the convolution design on a data size of 700
700 700. To compute the entire computation all at the
same time (as is the case when a high-performance processor
is used) requires a large local memory (in the case of the
processor, a large cache). The FPGA has limited resources
on-chip (376 BRAMs which can each hold 512 32 bit values).
To solve this problem, we break the large dataset into cubes
and process them separately. To utilize all of our input and
output bandwidths, we assign 3 processing cores to the FPGA
resulting in 3 inputs and 3 outputs per cycle at 125 MHz
(constrained by the throughput of the PCI-Express bus ).
This gives us a theoretical maximum throughput of 375 M
results a second.
The disadvantage of breaking the problem into smaller
blocks is that the boundaries of each block are essentially
wasted (although a minimal amount of reuse can occur)
because they must be resent when the adjacent block is
calculated. We do not consider this a problem since the
blocks we use are at least 100 100 700 which means only
a small proportion of the data is resent.
In software, the convolution executes in 11.2 seconds
on average. The experiment was carried out using a dual-
processor machine (each quad-core Intel Xeon 1.86 GHz)
with 8 GB of memory.
In hardware, using the MAX-1 platform we perform
the same computation in 2.2 seconds and obtain a 5 times
speedup. The design uses 48 DSP blocks (30%), 369 (98%)
RAMB16 blocks, and 30,571 (72%) of the slices on the
Virtex IV chip. This means that there is room on the chip to
substantially increase the kernel size. For a larger sized kernel
(31 points), the speedup should be virtually linear, resulting
in an 8 times speedup compared to the CPUimplementation.
6. Further Potential Speedups
One of the major constraints for achieving higher speedups
on FPGAs is the limited bandwidth between the FPGA card
and the CPU. For the current PCI-Express interface provided
by the MAX-1 platform, in each cycle, we can only read
8 bytes into the FPGA card and write back 8 bytes to the
system.
An example is the implementation of the FK step,
described in Section 4. As shown in Algorithm 2, in our
current designs, we take step%kx, step%ky, and both the
real and imaginary parts of wd as inputs to the circuit on
FPGA, and take the modied real and imaginary parts of
wd as outputs. Therefore, although there is much space
on the FPGA card to support multiple cores, the interface
bandwidth can only support one single core and get a
speedup of around 7 times.
However, in the specic case of FK step, there are further
techniques we can utilize to gain some more speedups.
From the codes in Algorithm 2, we can nd out that wd
varies with all the four dierent loop indices, while step%kx
and step%ky only vary with two of the four loop indices.
To take advantage of this characteristic, we can divide the
processing of the loop into two parts: in the rst part, we
use the bandwidth to read in the step%kx and step%ky
values, without doing any calculation; in the second part,
we can devote the bandwidth to read in wd data only, and
start the processing as well. In this pattern, suppose we are
processing a 100 100 100 100 four-level loop, the
bandwidth can support two cores processing concurrently
while spending 1 out of 100 cycles to read in the step%kx
and step%ky values in advance. In this way, we are able
to achieve a speedup of 6.9 2 100/101 13.7 times.
Furthermore, assume that there is an unlimited communi-
cation bandwidth, the cost of BRAMs (15%) becomes the
major constraint. We can then put 6 concurrent cores on
the FPGA card and achieve a speedup of 6.9 7 48
times.
Another possibility is to put as much computation as
possible onto the FPGAcard, and reduce the communication
cost between FPGA and CPU. If multiple portions of the
algorithm are performed on the FPGA without returning to
the CPU, the additional speedup can be considerable. For
instance, as mentioned in Section 2, the major computation
cost in downward continued-based migration lies in the
multidimensional FFTs and the FK step. If the FFT and the
FK step can reside simultaneously on the FPGA card, the
communication cost between the FFT and the FK step can
be eliminated completely. In the case of 3D convolution in
reverse time migration, multiple time steps can be applied
simultaneously.
7. Conclusions
This paper describes our work on accelerating seismic
applications by using customized number representations
on FPGAs. The focus is to improve the performance of the
FK step in downward continued-based migration and the
acoustic 3D convolution kernel in reverse time migration.
To investigate the tradeo between precision and speed,
we develop a tool that performs an automated precision
exploration of dierent number formats, and gures out
the minimum precision that can still generate good enough
seismic results. By using the minimized number format,
we implement the FK step in forward continued-based
EURASIP Journal on Embedded Systems 13
migration and 3D convolution in reverse time migration
on FPGA and show speedups ranging from 5 to 7 by
including the transfer time to and from the processors. We
also show that there are further potentials to accelerate these
applications by above 10 or even 48 times.
Acknowledgments
The support from the Center for Computational Earth and
Environmental Science, Stanford Exploration Project, Com-
puter Architecture Research Group at Imperial College Lon-
don, and Maxeler Technologies is gratefully acknowledged.
The authors also would like to thank Professor Martin Morf
and Professor Michael Flynn for their support and advice.
References
[1] J. Gazdag and P. Sguazzero, Migration of seismic data by
phase shift plus interpolation, in Migration of Seismic Data,
G. H. F. Gardner, Ed., Society of Exploration Geophysicists,
Tulsa, Oklahoma, 1985.
[2] C. He, M. Lu, and C. Sun, Accelerating seismic migration
using FPGA-based coprocessor platform, in Proceedings of the
12th Annual IEEE Symposium on Field-Programmable Custom
Computing Machines (FCCM 04), pp. 207216, Napa, Calif,
USA, April 2004.
[3] O. Pell and R. G. Clapp, Accelerating subsurface oset gathers
for 3D seismic applications using FPGAs, SEG Technical
Program Expanded Abstracts, vol. 26, no. 1, pp. 23832387,
2007.
[4] J. Deschamps, G. Bioul, and G. Sutter, Synthesis of Arith-
metic Circuits: FPGA, ASIC and Embedded Systems, Wiley-
Interscience, New York, NY, USA, 2006.
[5] O. Mencer, ASC: a stream compiler for computing with
FPGAs, IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 25, no. 9, pp. 16031617,
2006.
[6] D.-U. Lee, A. A. Gaar, R. C. C. Cheung, O. Mencer, W. Luk,
and G. A. Constantinides, Accuracy-guaranteed bit-width
optimization, IEEE Transactions on Computer-Aided Design of
Integrated Circuits and Systems, vol. 25, no. 10, pp. 19902000,
2006.
[7] M. L. Chang and S. Hauck, Precis: a usercentric word-length
optimization tool, IEEE Design & Test of Computers, vol. 22,
no. 4, pp. 349361, 2005.
[8] B. Biondi and G. Palacharla, 3-D prestack migration of
common-azimuth data, Geophysics, vol. 61, no. 6, pp. 1822
1832, 1996.
[9] G. Shan and B. Biondi, Imaging steep salt ank with plane-
wave migration in tilted coordinates, SEG Technical Program
Expanded Abstracts, vol. 25, no. 1, pp. 23722376, 2006.
[10] B. Biondi, Narrow-azimuth migration of marine streamer
data, SEG Technical Program Expanded Abstracts, vol. 22, no.
1, pp. 897900, 2003.
[11] L. Zhao and A. C. Cangellaris, GT-PML: generalized theory of
perfectly matched layers and its application to the reection-
less truncation of nite-dierence time-domain grids, IEEE
Transactions on Microwave Theory and Techniques, vol. 44, no.
12, part 2, pp. 25552563, 1996.
[12] R. Matousek, M. Tichy, Z. Pohl, J. Kadlec, C. Softley, and
N. Coleman, Logarithmic number system and oating-point
arithmetic on FPGA, in Proceedings of the 12th International
Conference on Field-Programmable Logic and Applications
(FPL 02), pp. 627636, Madrid, Spain, August 2002.
[13] G. A. Constantinides, P. Y. K. Cheung, and W. Luk, Heuristic
datapath allocation for multiple wordlength systems, in
Proceedings of the Conference on Design, Automation and Test
in Europe (DATE 01), pp. 791796, Munich, Germany, March
2001.
[14] H. Fu, O. Mencer, and W. Luk, Comparing oating-point and
logarithmic number representations for recongurable accel-
eration, in Proceedings of the IEEE International Conference
on Field Programmable Technology (FPT 06), pp. 337340,
Bangkok, Thailand, December 2006.
[15] H. Fu, O. Mencer, and W. Luk, Optimizing logarithmic
arithmetic on FPGAs, in Proceedings of the 15th Annual IEEE
Symposium on Field-Programme Custom Computing Machines
(FCCM 07), pp. 163172, Napa, Calif, USA, April 2007.
[16] D.-U. Lee, A. A. Gaar, O. Mencer, and W. Luk, Optimizing
hardware function evaluation, IEEE Transactions on Comput-
ers, vol. 54, no. 12, pp. 15201531, 2005.
[17] J. Muller, Elementary Functions: Algorithms and Implementa-
tion, Birkh auser, Secaucus, NJ, USA, 1997.
[18] J. Claerbout, Geophysical estimation by example: Environ-
mental soundings image enhancement: Stanford Exploration
Project, 1999, http://sepwww.stanford.edu/sep/prof/.
[19] L. Ingber, Adaptive Simulated Annealing (ASA) 25.15, 2004,
http://www.ingber.com/.
[20] J. Claerbout, Basic Earth Imaging (BEI), 2000, http://
sepwww.stanford.edu/sep/prof/.
[21] Maxeler Technologies, http://www.maxeler.com/.
Hindawi Publishing Corporation
EURASIP Journal on Embedded Systems
Volume 2009, Article ID 507426, 6 pages
doi:10.1155/2009/507426
Research Article
An FPGA Implementation of a Parallelized MT19937 Uniform
RandomNumber Generator
Vinay Sriramand David Kearney
University of South Australia, Recongurable computing Laboratory, School of Computer and Information Science,
Mawson Lakes Campus, Adelaide, SA 5085, Australia
Correspondence should be addressed to Vinay Sriram, srivb001@students.unisa.edu.au
Received 20 August 2008; Revised 16 February 2009; Accepted 21 April 2009
Recommended by Miriam Leeser
Recent times have witnessed an increase in use of high-performance recongurable computing for accelerating large-scale
simulations. A characteristic of such simulations, like infrared (IR) scene simulation, is the use of large quantities of uncorrelated
random numbers. It is therefore of interest to have a fast uniform random number generator implemented in recongurable
hardware. While there have been previous attempts to accelerate the MT19937 pseudouniform random number generator using
FPGAs we believe that we can substantially improve the previous implementations to develop a higher throughput and more area-
time ecient design. Due to the potential for parallel implementation of random numbers generators, designs that have both a
small area footprint and high throughput are to be preferred to ones that have the high throughput but with signicant extra area
requirements. In this paper, we rst present a single port design and then present an enhanced 624 port hardware implementation
of the MT19937 algorithm. The 624 port hardware implementation when implemented on a Xilinx XC2VP70-6 FPGA chip has
a throughput of 119.6 10
9
32 bit random numbers per second which is more than 17x that of the previously best published
uniform random number generator. Furthermore it has the lowest area time metric of all the currently published FPGA-based
pseudouniform random number generators.
Copyright 2009 V. Sriram and D. Kearney. This is an open access article distributed under the Creative Commons Attribution
License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly
cited.
1. Introduction
Recongurable computing is increasingly being seen as an
attractive solution for accelerating simulations that require
fast generation of large quantities of random numbers.
Although random numbers are often a very small part of
these algorithms inability to generate them fast enough,
them can cause a bottleneck in the recongurable com-
puting implementation of the simulation. For example, in
the simulated generation of infrared scenes that take into
account the eects of a turbulent atmosphere and the eects
of CCD camera sensor electronic noise, each 352352 scene
generated by the simulation requires more than 1.87 10
6
gaussian random numbers. A real-time simulation sequence
at 15 scenes/s thus needs more than 28.1 10
6
random
samples generated per second. Since a typical software
uniform generator [1] can only manage 10 10
6
per second
you would need 29 PCs to keep up with this rate.
A key requirement of Infrared (IR) scene simulation is
the necessity to generate large sequences of random numbers
on the provision of a single seed. Not all random number
generators are capable of doing this (e.g., see those presented
in [2]). Moreover, in order to achieve the high throughput
required, it is important to make use of algorithms internal
parallelism (i.e., by splitting the algorithm into independent
subsequences) as well as external parallelism (i.e., through
parallel implementations of the algorithm). It has been
recommended in [3] and reinforced in [4] that in order
to prevent possible correlations between output sequences
in parallel implementation of the same algorithm using
dierent initial seeds, it is necessary to use a random number
generator that has a period greater than 2
200
. In summary
the requirements of an FPGA optimized uniform random
number generator for IR scene simulation are as follows:
(1) should be seeded random number generator (so that
the same sequence may be regenerated);
2 EURASIP Journal on Embedded Systems
Stage 1
Seed
generator
Seed Seed
Seed value
modulator
Seed[i] Seed[i]
Output
generator
Random
number
New
seed
Dual port
RAM
624 32 bit
seeds
FIFO buffer
397 seeds
Stage 2
Stage 3
Seed[i + 1]
Seed[i + 396]
Figure 1: Single port version.
(2) have the ability to generate a large quantity of random
numbers from one seed;
(3) can be split in many independent subsequences;
(4) have a very large period;
(5) generate random numbers quickly;
(6) satisfy statistical tests from randomness;
(7) able to generate parallel streams of uncorrelated
random numbers.
2. Survey of FPGA-Based UniformRandom
Number Generators
As discussed in the previous section IR scene simulation
[5] requires fast generation of large sequences of random
numbers on the provision of a single seed. Fromthe extensive
literature in the eld of software pseudouniform random
number generators, some algorithms that achieve this are
the generalized feedback shift register and the MT19937.
They both have the ability to generate large sequences of
random numbers on the provision of a single seed and
have the ability to be split in independent subsequences to
allow for a more parallelized implementation in hardware.
An additional benet of these algorithms is that they have
large periods. It has been recommended in [1] and reinforced
in [4] that in order to prevent possible problems with
correlations when implementing the same algorithm with
dierent initial seeds in parallel, the algorithm needs to
have a period in excess of 2
200
. The MT19937 algorithm
has a period of 2
19937
, which therefore allows for parallel
implementation of MT19937 algorithm with dierent initial
seeds.
There are currently two FPGA optimized implementa-
tions of the MT19937, including a single port design, see
[6], and a multiport design presented in [7]. The well-known
generalized feedback shift register has been modied for
FPGA implementation in [8] to achieve the smallest area
time design to date. Thus it is of interest to see if a hardware
implementation of a 624 port MT19937 algorithm can be
made more competitive. This is the subject of investigation
of this paper. This paper is organized as follows, in Section 2
the MT19937 algorithm is briey described. In Sections 3
and 4 we present single port and 624 port hardware imple-
mentations of the MT19937 algorithm. In Section 5, diehard
test results of the two hardware implementations along
with the performance comparisons of these implementations
with other published FPGA-based pseudouniform random
number generators are presented.
3. MT19937
The origins of the MT19937 algorithm are from the Taus-
worthe generator, which is a linear feedback shift register that
produces long sequences of binary bits; see [9]. The period
of this polynomial, which is irreducible, depends on the
characteristic polynomial. The period is the smallest integer
n for which x
n
+1 is divisible by the characteristic polynomial.
The polynomial has the following form;
x
n+1
= (A
1
x
n
+ A
2
x
n1
+ + A
k
x
nk+1
)
mod 2
, (1)
where x
i
, A
i
[0, 1] for all i. Although this algorithm
produces uniform random bits, it is slow. This algorithm
was later modied by Lewis and Payne in [10], by creating
a variant of this known as the generalized feedback shift
register.
x
i
=

x
i
p

x or

x
i
q

, (2)
where each x
i
is a vector of size w with components 0 or 1.
The maximum possible period of 2p 1 of this generator is
achieved when the primitive trinomial xp+xq+1 divides xn
1 for n = 2p 1, for the smallest value of n. The maximum
period can be achieved by selecting n as a Mersenne Prime.
It was later identied that the eectiveness, that is, the
randomness of the numbers produced, of this algorithm was
dependent on the selection of initial seeds. Furthermore,
the algorithm required n words working area (which was
memory consuming) and the randomness of the numbers
produced was dependent on the selection of initial seeds.
This discovery led Matsumoto and Kurita 1994 to further
revise this algorithm to develop the twisted generalized
feedback shift register II in [11]. This generator used linear
combinations of only relatively few bits of the preceding
numbers and was thus considerably faster and was named
TT800. Later Matsumoto and Kurita in 1998 further revised
the TT800 to admit a Mersenne-prime period, and this new
algorithm was called the MT19937; see [12].
The MT19937 algorithm generates sequences of uni-
formly distributed pseudo random integers 32 or 54 bit
numbers between [0, 2w 1). The MT19937 algorithm is
based on the following linear recurrence formula, where x
and a denote word vectors, and Ais w by w matrix. The proof
of the algorithm is provided in [12],
x
k+n
= x
k+m

x
u
k
| x
l
k+1

, (3)
where k = (0, 1, . . .).
4. Single Port Version
This section describes our rst hardware implementation of
MT19937 which we call the single port version. Generation
EURASIP Journal on Embedded Systems 3
Uniform
random
number
O
u
t
p
u
t

g
e
n
e
r
a
t
o
r
S
e
e
d

v
a
l
u
e

m
o
d
u
l
a
t
o
r
Stage 1 Stage 2
MUX
mag1 mag2
Stage 3
Multi-
plexer
Seed[i + 397]
01L
Seed[i + 1]
07FFFFFFFL
080000000L
Seed[i]
09D2C5680UL
0EFC60000UL
New seed[i]

11

15
Figure 2: Internal logic of stage 2 and stage 3.
of random numbers is carried out in 3 stages, namely, the
seed generator, seed value modulator, and output generator.
This is illustrated in Figure 1.
Typically the user provides one number as a seed;
however, the MT19937 algorithm works with a pool of 624
seeds so that generator stage generates 624 seeds from the
single input from the user. In stage two (the seed value
modulator), which is the core of the algorithm, three values
seed[i], seed[i + 1], and seed[i + 396] are read from the pool
and based on the computation dened in the algorithm;
seed[i] is updated. In the nal stage, the output generator
reads one of the pool values and generates the output
uniform random number from this value.
The logic used to generate values out of stages 2 and
3 is shown in Figure 2. The simplest form of parallelism
for MT19937 is to perform stages 2 and 3 in parallel, and
this is illustrated in Figure 2. Note that it is not possible
to more nely pipeline the output generator because its
processing rate is tied to the seed value modulator, which
can only be pipelined into 3 stages. In other words, the seed
value modulator is a bottleneck in the design. It needs to be
pointed out that if the data comes from a dual port BRAM
only one value can be read and one written in the same
clock cycle. Since we need three values to be read, we use
3 dual port BRAMs. We then need logic to decide which
BRAM to write into. The write back selection logic forms
another stage in the seed value modulator, which now has 4
stages. Not shown in Figure 1 is the logic by which the BRAM
address will be read from and written to. The single port
version generates one new random number per clock cycle.
In Figure 2, mag1, mag2, and the hex numbers are constants
given in the algorithm denition.
The single port version provided is similar to the software
implementation of the MT19937 algorithm as it does not
provide any signicant parallelization in the generation of
the seeds. The only parallelism that is exploited is in the
Table 1: Diehard test results.
Test
Single-port
implementation
624 port
implementation
Birthday 0.348414 0.467321
OPERM5 0.634231 0.892018
Binary Rank (31 31) 0.523537 0.678026
Binary Rank (32 32) 0.521654 0.578317
Binary Rank (6 8) 0.235435 0.457192
Bitstream 0.235891 0.280648
OPSO 0.624894 0.987569
OQSO 0.235526 0.678913
DNA 0.623498 0.446857
Stream Count-the-1 0.352235 0.789671
Byte Count-the-1 0.652971 0.865671
Parking Lot 0.662841 0.567193
Minimum Distance 0.623121 0.467819
3D Spheres 0.622152 0.678991
Squeeze 0.951623 0.456719
Overlapping Sums 0.542882 0.345671
Runs Up 0.626844 0.456191
Runs Down 0.954532 0.898761
Craps 0.347221 0.689187
concurrent execution of seed value modulator (stage 2)
and output generator (stage 3). It was also found that it
was not possible to pipeline the output generator to more
than 3 stages as it was tied to the seed value modulator.
Signicant improvements in throughput could be achieved
by the parallelization of the stages 2 and 3 in addition to
executing them in parallel as shown above. However, the
4 EURASIP Journal on Embedded Systems
Seed 0 Seed 0
Seed 1
Seed 1
Seed 6
Seed 624
Seed pool New seed pool
Seed 396
Seed 397
Seed
[i + 396]
Seed[i]
.
.
.
.
.
.
.
.
.
.
.
.
(a) Storage of seeds pools
Uniform
random
number
Uniform
random
number
Output
generator
Seed value
modulator
Output
generator
Seed value
modulator
Seed
[i + 2]
Seed
[i + 397]
Seed
[i + 1]
Seed
[i + 396]
New
seed
[i]
New
seed
[i + 1]
Seed[i]
.
.
.
(b) An example of one of the parallel instances of the 624 port design
Figure 3: 624 port version.
problem with parallelizing stages 2 and 3 is that currently
the seeds are all stored in a single dual port BRAM. It is
not possible to carry out multiple reads and multiple writes
to a single BRAM in one clock cycle. Previously in [7]
parallelization of both these stages was achieved by dividing
the seeds into multiple BRAMs. This however signicantly
increased the area requirements of the design. In the next
section we study this problem in more detail and present our
new design that has a high throughput and is area ecient.
5. 624 Port Version
There has previously been an attempt to parallelize the
MT19937 algorithm by dividing the seeds into various pools
and then replicating the stages 2 and 3 to generate multiple
outputs; see [7]. However, it was noted that this was found
not to be area time ecient. A close examination of the
design reveals that in order to parallelize the generation of
random numbers, the authors divide the seeds into multiple
BRAMs. Although this did increase the throughput, it greatly
increased the area of the design as well. The reason for this
is that the logic required to generate the necessary BRAM
address increased in complexity with the dividing of seeds
across multiple BRAMs.
It is important to note here that the problem is not
the parallelization of the generation of the uniform random
numbers but is the storing of seeds in multiple BRAMs.
Thus if the seeds were to be stored in registers rather than
BRAMs the logic used to generate the BRAM address could
be saved. The excessive use of BRAMs to store seeds was
always considered problematic. For example, in [13] it was
found that the TT800 algorithm suered in a similar manner
when the seeds were distributed across multiple BRAMs. In
this paper it was reported that the single port version used
81 Xilinx slices while the 3 port one used 132 slices. Of the
132 slices used, 60 slices were used for the complex BRAM
address generation logic. We believe that we can parallelize
the MT19937 algorithm to the maximum possible 624 port
by storing seeds in registers rather than BRAMs. In this
section, we study a more simplied design for a 624 port
MT19937 random number generator that uses registers to
store seeds.
A careful examination of the addressing scheme shows
that the seeds can be divided into groups in such a way that
there is no need for the logic in one group to access the seeds
in another group. We call these groups seed pools and these
are shown in Figure 3.
We also present a generic model which makes use of each
of these seed pools to modify the seed value and generate new
random numbers per clock cycle. Now on each seed pool
the two stages of the MT19937 presented in Figure 3 work
together to modify each seed value and generate a new one.
This is illustrated in Figure 3(b). From a point of view of
circuit speed and complexity, no register is shared by more
than two reading channels and one writing channel. The
consequence is that register access logic is simpler, smaller,
and faster.
6. Results
6.1. Test for Randomness. As a preliminary test, the output
of the hardware implementations was successfully veried
against the output of the software implementation. For a
more complete test, the hardware implementations have
EURASIP Journal on Embedded Systems 5
Table 2: Period, area, time, and throughput comparisons.
Period
Xilinx
XC2VP70
Slices
LUTs
Clock rate
MHz
Area time slices
sec per 32 bit
number 10
6
Throughput 32 bit
numbers/sec 10
9
MT19937
[This work]
Single port 2
19937
87 319 0.34 0.24
624 port 2
19937
1253 190 0.009 119.6
Software

[12] 2
19937
2800 1.017
MT19937 [6] 2
19937
420 76 5.5 0.076
MT19937 [7]
SMT

2
19937
149 128.02 1.16 0.12
PMT52

2
19937
2.850 71.63 0.76 3.7
FMT52

2
19937
11.463 157.60 1.45 8.2
PMT52
i
n

2
19937
2.914 62.24 0.9 3.2
FMT52
i
n

2
19937
5.925 74.16 0.77 3.8
LUT [8]
4-tap, k = 32 2
32
33 309 0.06

0.3
4-tap, k = 64 2
64
65 310 0.05

0.6
4-tap, k = 96 2
98
97 298 0.05

1.1
4-tap, k = 128 2
128
127 287 0.05

1.8
4-tap, k = 256 2
258
257 246 0.06

1.8
4-tap, k = 1248 2
1248
1249 168 0.09

6.7
3-tap, k = 32 2
32
33 302 0.06

0.31
3-tap, k = 64 2
64
65 319 0.05

0.64
3-tap, k = 96 2
98
97 308 0.06

1.2
3-tap, k = 128 2
128
127 287 0.06

1.7
3-tap, k = 256 2
258
257 243 0.07

1.9
3-tap, k = 1248 2
1248
1249 173 0.09

6.7

Software implementation was on a Pentium 4 (2.8 GHz) single core processor.

Each slice consists of 2 LUTs, therefore the area time rating of these desings equals LUTs/2 Time.

This design has been implemented on an Altera Stratix. Each Xilinx slice is equivalent to two Altera logic elements.
been tested using the diehard test. In Table 1 the test results
of the diehard tests are presented. The diehard test produces
P-values in the range [0, 1). The P-values are to be above .025
and below .975 for the test to pass. Both the implementations
pass this test.
6.2. Comparison with Existing FPGA-Based Uniform Random
Number Generators. In this section we compare our designs
with those that are currently published. We compare our
designs on the basis of area time rating and throughput. In
contrasting these solutions we take into account the amount
of total resources used, including slices, LUTs, and ip ops.
From Table 2 it should be noted that our 624 port
hardware implementation of the MT19937 algorithm when
implemented on a Xilinx XC2VP70-6 FPGA chip achives
more than 115x the throughput of the same algorithms
implementation in software on a Pentium 4 (2.8 GHz)
single core processor. It can also be seen that there are no
other published random number generators from current
literature that are able to achieve a throughput of greater than
119 10
9
32 bit random numbers per second. The closest
competitors are the FMT52, 4-tap, k = 1248, and 3-tap, k =
1248 random number generators which are still signicantly
behind. The design presented herein has an AT rating of only
0.00910
9
for a throughput of 11910
9
random numbers
per second. A further criticism of [8] is that the specialized
feedback register matrix used in the implementation was not
completely published.
Our best implementation, which is the 624 port
MT19937, uses only 1253 Xilinx slices. This is signicantly
less than all of the other multiport designs currently pub-
lished in literature as we use registers to store seeds and have
arranged our seed value modulator and output generator
pipelines in an area ecient manner. Thus we do not require
any complex BRAM address generation logic and access to
BRAMs. As a result we save on area and since our design
if 624 port we generate 624 uniform random numbers per
clock cycle. In a recongurable computing implementation,
where only the random number generation is accelerated in
hardware, like all of the other FPGA-based random number
generators, the 624 port implementation is limited by the I/O
bandwidth of the FPGA.
6 EURASIP Journal on Embedded Systems
7. Conclusion
In this paper we have presented a unique 624 port
MT19937 hardware implementation. Whilst currently there
are hardware implementations of uniform random number
generators published none seem to be able to oer a
high throughput as well as area time eciency. It was
demonstrated that the 624 port design presented in this
paper is a high throughput, area time ecient FPGA
optimized pseudouniform random number generator with a
large period and with the ability to generate large quantities
of uniform random numbers from a single seed. Thus
making suitable for use in a recongurable computing
implementation of real-time IR scene simulation.
Acknowledgment
Research undertaken for this report has been assisted with an
international scholarship from the Maurice de Rohan fund.
This support is acknowledged and greatly appreciated.
References
[1] P. LEcuyer, Random number generation, in Handbook of
Simulation, J. Banks, Ed., chapter 4, pp. 93137, John Wiley
& Sons, New York, NY, USA, 1998.
[2] W. H. Press, B. P. Flannery, et al., Numerical Recipes: The Art of
Scientic Computing, Cambridge University Press, Cambridge,
UK, 1986.
[3] A. Srinivasan, M. Mascagni, and D. Ceperley, Testing parallel
random number generators, Parallel Computing, vol. 29, no.
1, pp. 6994, 2003.
[4] P. LEcuyer and R. Panneton, Fast randomnumber generators
based on linear recurrences modulo 2: overview and compar-
ison, in Proceedings of the Winter Simulation Conference, pp.
110119, IEEE Press, 2005.
[5] V. Sriram and D. Kearney, High speed high delity infrared
scene simulation using recongurable computing, in Pro-
ceedings of the IEEE International Conference on Field Pro-
grammable Logic and Applications, IEEE Press, Madrid, Spain,
August 2006.
[6] V. Sriram and D. A. Kearney, An area time ecient eld
programmable mersenne twister uniform random number
generator, in Proceedings of the International Conference on
Engineering of Recongurabe Systems and Algorithms, CSREA
Press, June 2006.
[7] S. Konuma and S. Ichikawa, Design and evaluation of hard-
ware pseudo-random number generator MT19937, IEICE
Transactions on Information and Systems, vol. E88-D, no. 12,
pp. 28762879, 2005.
[8] D. B. Thomas and W. Luk, High quality uniform random
number generation through LUT optimised linear recur-
rences, in Proceedings of the IEEE International Conference
on Field Programmable Technology (FPT 05), pp. 6168,
Singapore, December 2005.
[9] R. Tausworthe, Random numbers generated by linear recur-
rence modulo two, Mathematics of Computation, vol. 19, pp.
201209, 1965.
[10] T. Lewis and W. Payne, Generalized feedback shift register
pseudorandom number algorithm, Journal of the ACM, vol.
20, no. 3, pp. 456468, 1973.
[11] M. Matsumoto and Y. Kurita, Twisted GFSR generatorsII,
ACM Transactions on Modeling and Computer Simulation, vol.
4, no. 3, pp. 254266, 1994.
[12] M. Matsumoto and T. Nishimura, Mersenne twister: a 623-
dimensionally equidistributed uniformpseudo-randomnum-
ber generator, ACM Transactions on Modeling and Computer
Simulation, vol. 8, no. 1, pp. 330, 1998.
[13] V. Sriram and D. Kearney, Towards a multi-FPGA infrared
simulator, The Journal of Defense Modeling and Simulation:
Applications, Methodology, Technology, vol. 4, no. 4, pp. 5063,
2007.

Вам также может понравиться