Вы находитесь на странице: 1из 40

Reconfigurable Computing: A Survey of Systems and Software

KATHERINE COMPTON
Northwestern University

AND

SCOTT HAUCK
University of Washington

Due to its potential to greatly accelerate a wide variety of applications, reconfigurable


computing has become a subject of a great deal of research. Its key feature is the ability
to perform computations in hardware to increase performance, while retaining much of
the flexibility of a software solution. In this survey, we explore the hardware aspects of
reconfigurable computing machines, from single chip architectures to multi-chip
systems, including internal structures and external coupling. We also focus on the
software that targets these machines, such as compilation tools that map high-level
algorithms directly to the reconfigurable substrate. Finally, we consider the issues
involved in run-time reconfigurable systems, which reuse the configurable hardware
during program execution.

Categories and Subject Descriptors: A.1 [Introductory and Survey]; B.6.1 [Logic
Design]: Design Style—logic arrays; B.6.3 [Logic Design]: Design Aids; B.7.1
[Integrated Circuits]: Types and Design Styles—gate arrays
General Terms: Design, Performance
Additional Key Words and Phrases: Automatic design, field-programmable, FPGA,
manual design, reconfigurable architectures, reconfigurable computing, reconfigurable
systems

1. INTRODUCTION of algorithms. The first is to use hard-


wired technology, either an Application
There are two primary methods in con- Specific Integrated Circuit (ASIC) or a
ventional computing for the execution group of individual components forming a

This research was supported in part by Motorola, Inc., DARPA, and NSF.
K. Compton was supported by an NSF fellowship.
S. Hauck was supported in part by an NSF CAREER award and a Sloan Research Fellowship.
Authors’ addresses: K. Compton, Department of Electrical and Computer Engineering, Northwestern Uni-
versity, 2145 Sheridan Road, Evanston, IL 60208-3118; e-mail: kati@ece.northwestern.edu; S. Hauck, De-
partment of Electrical Engineering, The University of Washington, Box 352500, Seattle, WA 98195; e-mail:
hauck@ee.washington.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit
is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any compo-
nent of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or
permissions@acm.org.
2002
c ACM 0360-0300/02/0600-0171 $5.00

ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171–210.
172 K. Compton and S. Hauck

board-level solution, to perform the oper- figurable hardware by computing the logic
ations in hardware. ASICs are designed functions of the circuit within the logic
specifically to perform a given computa- blocks, and using the configurable routing
tion, and thus they are very fast and to connect the blocks together to form the
efficient when executing the exact com- necessary circuit.
putation for which they were designed. FPGAs and reconfigurable computing
However, the circuit cannot be altered af- have been shown to accelerate a variety of
ter fabrication. This forces a redesign and applications. Data encryption, for exam-
refabrication of the chip if any part of its ple, is able to leverage both parallelism
circuit requires modification. This is an ex- and fine-grained data manipulation. An
pensive process, especially when one con- implementation of the Serpent Block
siders the difficulties in replacing ASICs Cipher in the Xilinx Virtex XCV1000
in a large number of deployed systems. shows a throughput increase by a factor
Board-level circuits are also somewhat in- of over 18 compared to a Pentium Pro
flexible, frequently requiring a board re- PC running at 200 MHz [Elbirt and Paar
design and replacement in the event of 2000]. Additionally, a reconfigurable com-
changes to the application. puting implementation of sieving for fac-
The second method is to use soft- toring large numbers (useful in breaking
ware-programmed microprocessors—a far encryption schemes) was accelerated by a
more flexible solution. Processors execute factor of 28 over a 200-MHz UltraSparc
a set of instructions to perform a compu- workstation [Kim and Mangione-Smith
tation. By changing the software instruc- 2000]. The Garp architecture shows a
tions, the functionality of the system is comparable speed-up for DES [Hauser
altered without changing the hardware. and Wawrzynek 1997], as does an
However, the downside of this flexibility FPGA implementation of an elliptic curve
is that the performance can suffer, if not cryptography application [Leung et al.
in clock speed then in work rate, and is 2000].
far below that of an ASIC. The processor Other recent applications that have
must read each instruction from memory, been shown to exhibit significant speed-
decode its meaning, and only then exe- ups using reconfigurable hardware
cute it. This results in a high execution include: automatic target recognition
overhead for each individual operation. [Rencher and Hutchings 1997], string pat-
Additionally, the set of instructions that tern matching [Weinhardt and Luk 1999],
may be used by a program is determined Golomb Ruler Derivation [Dollas et al.
at the fabrication time of the processor. 1998; Sotiriades et al. 2000], transitive
Any other operations that are to be im- closure of dynamic graphs [Huelsbergen
plemented must be built out of existing 2000], Boolean satisfiability [Zhong et al.
instructions. 1998], data compression [Huang et al.
Reconfigurable computing is intended to 2000], and genetic algorithms for the tra-
fill the gap between hardware and soft- velling salesman problem [Graham and
ware, achieving potentially much higher Nelson 1996].
performance than software, while main- In order to achieve these performance
taining a higher level of flexibility than benefits, yet support a wide range of appli-
hardware. Reconfigurable devices, in- cations, reconfigurable systems are usu-
cluding field-programmable gate arrays ally formed with a combination of re-
(FPGAs), contain an array of computa- configurable logic and a general-purpose
tional elements whose functionality is de- microprocessor. The processor performs
termined through multiple programmable the operations that cannot be done effi-
configuration bits. These elements, some- ciently in the reconfigurable logic, such
times known as logic blocks, are connected as data-dependent control and possibly
using a set of routing resources that are memory accesses, while the computational
also programmable. In this way, custom cores are mapped to the reconfigurable
digital circuits can be mapped to the recon- hardware. This reconfigurable logic can be

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 173

composed of either commercial FPGAs or uration compression and the partial reuse
custom configurable hardware. of already programmed configurations can
Compilation environments for reconfig- be used to reduce this overhead.
urable hardware range from tools to assist This article presents a survey of cur-
a programmer in performing a hand map- rent research in hardware and software
ping of a circuit to the hardware, to com- systems for reconfigurable computing, as
plete automated systems that take a cir- well as techniques that specifically target
cuit description in a high-level language run-time reconfigurability. We lead off this
to a configuration for a reconfigurable sys- discussion by examining the technology
tem. The design process involves first par- required for reconfigurable computing, fol-
titioning a program into sections to be im- lowed by a more in-depth examination of
plemented on hardware, and those which the various hardware structures used in
are to be implemented in software on the reconfigurable systems. Next, we look at
host processor. The computations destined the software required for compilation of
for the reconfigurable hardware are syn- algorithms to configurable computers, and
thesized into a gate level or register trans- the trade-offs between hand-mapping and
fer level circuit description. This circuit is automatic compilation. Finally, we discuss
mapped onto the logic blocks within the re- run-time reconfigurable systems, which
configurable hardware during the technol- further utilize the intrinsic flexibility of
ogy mapping phase. These mapped blocks configurable computing platforms by opti-
are then placed into the specific physi- mizing the hardware not only for different
cal blocks within the hardware, and the applications, but for different operations
pieces of the circuit are connected using within a single application as well.
the reconfigurable routing. After compi- This survey does not seek to cover ev-
lation, the circuit is ready for configura- ery technique and research project in the
tion onto the hardware at run-time. These area of reconfigurable computing. Instead,
steps, when performed using an automatic it hopes to serve as an introduction to
compilation system, require very little ef- this rapidly evolving field, bringing in-
fort on the part of the programmer to terested readers quickly up to speed on
utilize the reconfigurable hardware. How- developments from the last half-decade.
ever, performing some or all of these oper- Those interested in further background
ations by hand can result in a more highly can find coverage of older techniques
optimized circuit for performance-critical and systems elsewhere [Rose et al. 1993;
applications. Hauck and Agarwal 1996; Vuillemin et al.
Since FPGAs must pay an area penalty 1996; Mangione-Smith et al. 1997; Hauck
because of their reconfigurability, device 1998b].
capacity can sometimes be a concern. Sys-
tems that are configured only at power-
2. TECHNOLOGY
up are able to accelerate only as much
of the program as will fit within the pro- Reconfigurable computing as a concept
grammable structures. Additional areas of has been in existence for quite some time
a program might be accelerated by reusing [Estrin et al. 1963]. Even general-purpose
the reconfigurable hardware during pro- processors use some of the same basic
gram execution. This process is known ideas, such as reusing computational com-
as run-time reconfiguration (RTR). While ponents for independent computations,
this style of computing has the benefit of and using multiplexers to control the
allowing for the acceleration of a greater routing between these components. How-
portion of an application, it also introduces ever, the term reconfigurable comput-
the overhead of configuration, which lim- ing, as it is used in current research
its the amount of acceleration possible. Be- (and within this survey), refers to sys-
cause configuration can take milliseconds tems incorporating some form of hard-
or longer, rapid and efficient configuration ware programmability—customizing how
is a critical issue. Methods such as config- the hardware is used using a number

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


174 K. Compton and S. Hauck

Fig. 1. A programming bit for SRAM-based FPGAs [Xilinx 1994] (left) and a pro-
grammable routing connection (right).

of physical control points. These control Thus, these chips can be programmed and
points can then be changed periodically in reprogrammed about as easily as a stan-
order to execute different applications us- dard static RAM. In fact, one research
ing the same hardware. project, the PAM project [Vuillemin et al.
The recent advances in reconfigurable 1996], considers a group of one or more
computing are for the most part de- FPGAs to be a RAM unit that performs
rived from the technologies developed computation between the memory write
for FPGAs in the mid-1980s. FPGAs (sending the configuration information
were originally created to serve as a hy- and input data) and memory read (read-
brid device between PALs and Mask- ing the results of the computation). This
Programmable Gate Arrays (MPGAs). leads some to use the term Programmable
Like PALs, FPGAs are fully electrically Active Memory or PAM.
programmable, meaning that the physical One example of how the SRAM configu-
design costs are amortized over multiple ration points can be used is to control rout-
application circuit implementations, and ing within a reconfigurable device [Chow
the hardware can be customized nearly in- et al. 1999a]. To configure the routing on
stantaneously. Like MPGAs, they can im- an FPGA, typically a passgate structure
plement very complex computations on a is employed (see Figure 1 right). Here the
single chip, with devices currently in pro- programming bit will turn on a routing
duction containing the equivalent of over connection when it is configured with a
a million gates. Because of these features, true value, allowing a signal to flow from
FPGAs had been primarily viewed as glue- one wire to another, and will disconnect
logic replacement and rapid-prototyping these resources when the bit is set to false.
vehicles. However, as we show through- With a proper interconnection of these ele-
out this article, the flexibility, capacity, ments, which may include millions of rout-
and performance of these devices has ing choice points within a single device, a
opened up completely new avenues in rich routing fabric can be created.
high-performance computation, forming Another example of how these configu-
the basis of reconfigurable computing. ration bits may be used is to control mul-
Most current FPGAs and reconfig- tiplexers, which will choose between the
urable devices are SRAM-programmable output of different logic resources within
(Figure 1 left), meaning that SRAM1 the array. For example, to provide optional
bits are connected to the configuration stateholding elements a D flip-flop (DFF)
points in the FPGA, and programming may be included with a multiplexer se-
the SRAM bits configures the FPGA. lecting whether to forward the latched
or unlatched signal value (see Figure 2
1 The term “SRAM” is technically incorrect for many left). Thus, for systems that require state-
FPGA architectures, given that the configuration holding the programming bits controlling
memory may or may not support random access. In the multiplexer would be configured to se-
fact, the configuration memory tends to be continu- lect the DFF output, while systems that
ally read in order to perform its function. However,
this is the generally accepted term in the field and
do not need this function would choose
correctly conveys the concept of static volatile mem- the bypass route that sends the input di-
ory using an easily understandable label. rectly to the output. Similar structures

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 175

Fig. 2. D flip-flop with optional bypass (left) and a 3-input LUT (right).

can choose between other on-chip func- very closely coupled systems, the recon-
tionalities, such as fixed-logic computation figurability lies within customizable func-
elements, memories, carry chains, or other tional units on the regular datapath of
functions. the microprocessor. On the other hand, a
Finally, the configuration bits may be reconfigurable computing system can be
used as control signals for a computational as loosely coupled as a networked stand-
unit or as the basis for computation it- alone unit. Most reconfigurable systems
self. As a control signal, a configuration are categorized somewhere between these
bit may determine whether an ALU per- two extremes, frequently with the recon-
forms an addition, subtraction, or other figurable hardware acting as a coproces-
logic computations. On the other hand, sor to a host microprocessor. The pro-
with a structure such as a lookup table grammable array itself can be comprised
(LUT), the configuration bits themselves of one or more commercially available
form the result of the computation (see FPGAs, or can be a custom device designed
Figure 2 right). These elements are essen- specifically for reconfigurable computing.
tially small memories provided for com- The design of the actual computation
puting arbitrary logic functions. LUTs can blocks within the reconfigurable hardware
compute any function of N inputs (where varies from system to system. Each unit of
N is the number of control signals for the computation, or logic block, can be as sim-
LUT’s multiplexer) by programming the ple as a 3-input lookup table (LUT), or as
2N programming bits with the truth ta- complex as a 4-bit ALU. This difference
ble of the desired function. Thus, if all in block size is commonly referred to as
programming bits except the one corre- the granularity of the logic block, where
sponding to the input pattern 111 were the 3-bit LUT is an example of a very
set to zero a 3-input LUT would act as a fine-grained computational element, and a
3-input AND gate, while programming it 4-bit ALU is an example of a quite coarse-
with all ones except in 000 would compute grained unit. The finer-grained blocks are
a NAND. useful for bit-level manipulations, while
the coarse-grained blocks are better opti-
mized for standard datapath applications.
3. HARDWARE
Some architectures employ different sizes
Reconfigurable computing systems use or types of blocks within a single recon-
FPGAs or other programmable hardware figurable array in order to efficiently sup-
to accelerate algorithm execution by map- port different types of computation. For
ping compute-intensive calculations to the example, memory is frequently embedded
reconfigurable substrate. These hardware within the reconfigurable hardware to pro-
resources are frequently coupled with a vide temporary data storage, forming a
general-purpose microprocessor that is heterogeneous structure composed of both
responsible for controlling the reconfig- logic blocks and memory blocks [Ebeling
urable logic and executing program code et al. 1996; Altera 1998; Lucent 1998;
that cannot be efficiently accelerated. In Marshall et al. 1999; Xilinx 1999].

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


176 K. Compton and S. Hauck

The routing between the logic blocks cessor’s registers. A call to the Chimaera
within the reconfigurable hardware is also unit is in actuality only a fetch of the re-
of great importance. Routing contributes sult value. This value is stable and valid
significantly to the overall area of the re- after the correct input values have been
configurable hardware. Yet, when the per- written to the registers and have filtered
centage of logic blocks used in an FPGA be- through the computation.
comes very high, automatic routing tools In the next sections, we consider in
frequently have difficulty achieving the greater depth the hardware issues in re-
necessary connections between the blocks. configurable computing, including both
Good routing structures are therefore es- logic and routing. To support the compu-
sential to ensure that a design can be suc- tation demands of reconfigurable comput-
cessfully placed and routed onto the recon- ing, we consider the logic block architec-
figurable hardware. tures of these devices, including possibly
Once a circuit has been programmed the integration of heterogeneous logic re-
onto the reconfigurable hardware, it is sources within a device. Heterogeneity
ready to be used by the host processor dur- also extends between chips, where one of
ing program execution. The run-time op- the most important concerns is the cou-
eration of a reconfigurable system occurs pling of the reconfigurable logic with stan-
in two distinct phases: configuration and dard, general-purpose processors. How-
execution. The programming of the recon- ever, reconfigurable devices are more than
figurable hardware is under the control of just logic devices; the routing resources
the host processor. This host processor di- are at least as important as logic re-
rects a stream of configuration data to the sources, and thus we consider intercon-
reconfigurable hardware, and this config- nect structures, including 1D-oriented de-
uration data is used to define the actual vices that are beginning to appear.
operation of the hardware. Configurations
can be loaded solely at start-up of a pro-
3.1. Coupling
gram, or periodically during runtime, de-
pending on the design of the system. More Frequently, reconfigurable hardware is
concepts involved in run-time reconfigu- coupled with a traditional microprocessor.
ration (the dynamic reconfiguration of de- Programmable logic tends to be inefficient
vices during computation execution) are at implementing certain types of opera-
discussed in a later section. tions, such as variable-length loops and
The actual execution model of the re- branch control. In order to run an applica-
configurable hardware varies from sys- tion in a reconfigurable computing system
tem to system. For example, the NAPA most efficiently, the areas of the program
system [Rupp et al. 1998] by default that cannot be easily mapped to the recon-
suspends the execution of the host pro- figurable logic are executed on a host mi-
cessor during execution on the recon- croprocessor. Meanwhile, the areas with a
figurable hardware. However, simulta- high density of computation that can ben-
neous computation can occur with the efit from implementation in hardware are
use of fork-and-join primitives, similar to mapped to the reconfigurable logic. For the
multiprocessor programming. REMARC systems that use a microprocessor in con-
[Miyamori and Olukotun 1998] is a re- junction with reconfigurable logic, there
configurable system that uses a pipelined are several ways in which these two com-
set of execution phases within the recon- putation structures may be coupled, as
figurable hardware. These pipeline stages Figure 3 shows.
overlap with the pipeline stages of the host First, reconfigurable hardware can be
processor, allowing for simultaneous ex- used solely to provide reconfigurable
ecution. In the Chimaera system [Hauck functional units within a host proces-
et al. 1997], the reconfigurable hardware sor [Razdan and Smith 1994; Hauck
is constantly executing based upon the in- et al. 1997]. This allows for a tradi-
put values held in a subset of the host pro- tional programming environment with the

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 177

Fig. 3. Different levels of coupling in a reconfigurable system. Reconfigurable logic


is shaded.

addition of custom instructions that may logic is embedded into the data cache.
change over time. Here, the reconfigurable This cache can then be used as either a
units execute as functional units on the regular cache or as an additional com-
main microprocessor datapath, with reg- puting resource depending on the target
isters used to hold the input and output application.
operands. Third, an attached reconfigurable
Second, a reconfigurable unit may processing unit [Vuillemin et al. 1996;
be used as a coprocessor [Wittig and Annapolis 1998; Laufer et al. 1999] be-
Chow 1996; Hauser and Wawrzynek 1997; haves as if it is an additional processor in
Miyamori and Olukotun 1998; Rupp et al. a multiprocessor system or an additional
1998; Chameleon 2000]. A coprocessor is, compute engine accessed semifrequently
in general, larger than a functional unit, through external I/O. The host processor’s
and is able to perform computations with- data cache is not visible to the attached
out the constant supervision of the host reconfigurable processing unit. There is,
processor. Instead, the processor initial- therefore, a higher delay in communica-
izes the reconfigurable hardware and ei- tion between the host processor and the re-
ther sends the necessary data to the logic, configurable hardware, such as when com-
or provides information on where this data municating configuration information,
might be found in memory. The reconfig- input data, and results. This communi-
urable unit performs the actual computa- cation is performed though specialized
tions independently of the main processor, primitives similar to multiprocessor sys-
and returns the results after completion. tems. However, this type of reconfigurable
This type of coupling allows the reconfig- hardware does allow for a great deal of
urable logic to operate for a large num- computation independence, by shifting
ber of cycles without intervention from large chunks of a computation over to the
the host processor, and generally permits reconfigurable hardware.
the host processor and the reconfigurable Finally, the most loosely coupled form
logic to execute simultaneously. This re- of reconfigurable hardware is that of
duces the overhead incurred by the use an external stand-alone processing unit
of the reconfigurable logic, compared to a [Quickturn 1999a, 1999b]. This type of
reconfigurable functional unit that must reconfigurable hardware communicates
communicate with the host processor each infrequently with a host processor (if
time a reconfigurable “instruction” is used. present). This model is similar to that
One idea that is somewhat of a hybrid be- of networked workstations, where pro-
tween the first and second coupling meth- cessing may occur for very long periods
ods, is the use of programmable hardware of time without a great deal of commu-
within a configurable cache [Kim et al. nication. In the case of the Quickturn
2000]. In this situation, the reconfigurable systems, however, this hardware is geared

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


178 K. Compton and S. Hauck

more towards emulation than reconfig-


urable computing.
Each of these styles has distinct ben-
efits and drawbacks. The tighter the in-
tegration of the reconfigurable hardware,
the more frequently it can be used within
an application or set of applications due
to a lower communication overhead. How-
ever, the hardware is unable to operate
for significant portions of time without in-
tervention from a host processor, and the
amount of reconfigurable logic available is
often quite limited. The more loosely cou-
pled styles allow for greater parallelism in Fig. 4. A basic logic block, with a 4-input
program execution, but suffer from higher LUT, carry chain, and a D-type flip-flop with
communications overhead. In applications bypass.
that require a great deal of communica-
tion, this can reduce or remove any accel- established that the best function block
eration benefits gained through this type for a standard FPGA, a device whose pri-
of reconfigurable hardware. mary role is the implementation of ran-
dom digital logic, is the one found in the
3.2. Traditional FPGAs
first devices deployed—the lookup table
(Figure 2 right). As described in the pre-
Before discussing the detailed architec- vious section, an N-input LUT is basically
ture design of reconfigurable devices in a memory that, when programmed appro-
general, we will first describe the logic priately, can compute any function of up to
and routing of FPGAs. These concepts N inputs. This flexibility, with relatively
apply directly to reconfigurable systems simple routing requirements (each input
using commercial FPGAs, such as PAM need only be routed to a single multiplexer
[Vuillemin et al. 1996] and Splash 2 control input) turns out to be very power-
[Arnold et al. 1992; Buell et al. 1996], ful for logic implementation. Although it is
and many also extend to architectures less area-efficient than fixed logic blocks,
designed specifically for reconfigurable such as a standard NAND gate, the truth
computing. Hardware concepts applying is that most current FPGAs use less than
specifically to architectures designed for 10% of their chip area for logic, devoting
reconfigurable computing, as well as vari- the majority of the silicon real estate for
ations on the generic FPGA description routing resources.
provided here, are discussed following this The typical FPGA has a logic block
section. More detailed surveys of FPGA ar- with one or more 4-input LUT(s), op-
chitectures themselves can be found else- tional D flip-flops (DFF), and some form
where [Brown et al. 1992a; Rose et al. of fast carry logic (Figure 4). The LUTs
1993]. allow any function to be implemented, pro-
Since the introduction of FPGAs in the viding generic logic. The flip-flop can be
mid-1980s, there have been many differ- used for pipelining, registers, statehold-
ent investigations into what computation ing functions for finite state machines, or
element(s) should be built into the ar- any other situation where clocking is re-
ray [Rose et al. 1993]. One could consider quired. Note that the flip-flops will typi-
FPGAs that were created with PAL-like cally include programmable set/reset lines
product term arrays, or multiplexer-based and clock signals, which may come from
functionality, or even basic fixed functions global signals routed on special resources,
such as simple NAND and XOR gates. In or could be routed via the standard in-
fact, many such architectures have been terconnect structures from some other
built. However, it seems to be fairly well input or logic block. The fast carry logic

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 179

minals. These blocks also connect shorter


local wires to longer-distance routing re-
sources. Signals flow from the logic block
into the connection block, and then along
longer wires within the routing channels.
At the switchboxes, there are connections
between the horizontal and vertical rout-
ing resources to allow signals to change
their routing direction. Once the signal
has traversed through routing resources
and intervening switchboxes, it arrives at
the destination logic block through one of
its local connection blocks. In this man-
ner, relatively arbitrary interconnections
can be achieved between the logic blocks
in the system.
Fig. 5. A generic island-style FPGA routing archi- Within a given routing channel, there
tecture.
may be a number of different lengths of
routing resources. Some local interconnec-
is a special resource provided in the cell tions may only move between adjacent
to speed up carry-based computations, logic blocks (carry chains are a good ex-
such as addition, parity, wide AND op- ample of this), providing high-speed lo-
erations, and other functions. These re- cal interconnect. Medium length lines may
sources will bypass the general routing run the width of several logic blocks, pro-
structure, connecting instead directly be- viding for some longer distance intercon-
tween neighbors in the same column. nect. Finally, longlines that run the entire
Since there are very few routing choices chip width or height may provide for more
in the carry chain, and thus less delay on global signals. Also, many architectures
the computation, the inclusion of these re- contain special “global lines” that provide
sources can significantly speed up carry- high-speed, and often low-skew, connec-
based computations. tions to all of the logic blocks in the array.
Just as there has been a great deal These are primarily used for clocks, resets,
of experimentation in FPGA logic block and other truly global signals.
architectures, there has been equally While the routing architecture of an
as much investigation into interconnect FPGA is typically quite complex—the con-
structures. As logic blocks have basically nection blocks and switchboxes surround-
standardized on LUT-based structures, ing a single logic block typically have thou-
routing resources have become primarily sands of programming points—they are
island-style, with logic surrounded by gen- designed to be able to support fairly arbi-
eral routing channels. trary interconnection patterns. Most users
Most FPGA architectures organize their ignore the exact details of these architec-
routing structures as a relatively smooth tures and allow the automatic physical de-
sea of routing resources, allowing fast and sign tools to choose appropriate resources
efficient communication along the rows to use in order to achieve a given intercon-
and columns of logic blocks. As shown nect pattern.
in Figure 5, the logic blocks are em-
bedded in a general routing structure, 3.3. Logic Block Granularity
with input and output signals attaching
to the routing fabric through connection Most reconfigurable hardware is based
blocks. The connection blocks provide pro- upon a set of computation structures that
grammable multiplexers, selecting which are repeated to form an array. These
of the signals in the given routing channel structures, commonly called logic blocks
will be connected to the logic block’s ter- or cells, vary in complexity from a very

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


180 K. Compton and S. Hauck

flip-flop. Additionally, there is specialized


carry-chain circuitry that helps to acceler-
ate addition, parity, and other operations
that use a carry chain. These types of logic
blocks are useful for fine-grained bit-level
manipulation of data, as can frequently be
found in encryption and image processing
applications. Also, because the cells are
fine-grained, computation structures of
arbitrary bit widths can be created. This
can be useful for implementing datapath
circuits that are based on data widths not
Fig. 6. The functional unit from a Xilinx 6200 cell
[Xilinx 1996].
implemented on the host processor (5 bit
multiply, 18 bit addition, etc). Reconfig-
urable hardware can not only take advan-
small and simple block that can calculate tage of small bit widths, but also large data
a function of only three inputs, to a struc- widths. When a program uses bit widths
ture that is essentially a 4-bit ALU. Some in excess of what is normally available in
of these block types are configurable, in a host processor, the processor must per-
that the actual operation is determined by form the computations using a number of
a set of loaded configuration data. Other extra steps in order to handle the full data
blocks are fixed structures, and the config- width. A fine-grained architecture would
urability lies in the connections between be able to implement the full bit width in a
them. The size and complexity of the ba- single step, without the fetching, decoding,
sic computing blocks is referred to as the and execution of additional instructions,
block’s granularity. as long as enough logic cells are available.
An example of a very fine-grained logic A number of reconfigurable systems use
block can be found in the Xilinx 6200 series a granularity of logic block that we cat-
of FPGAs [Xilinx 1996]. The functional egorize as medium-grained [Xilinx 1994;
unit from one of these cells, as shown in Hauser and Wawrzynek 1997; Haynes and
Figure 6, can implement any two-input Cheung 1998; Lucent 1998; Marshall et al.
function and some three-input functions. 1999]. For example, Garp [Hauser and
However, although this type of architec- Wawrzynek 1997] is designed to perform
ture is useful for very fine-grained bit ma- a number of different operations on up
nipulation, it can be too fine-grained to ef- to four 2-bit inputs. Another medium-
ficiently implement many types of circuits, grained structure was designed specifi-
such as multipliers. Similarly, finite state cally to be embedded inside of a general-
machines are frequently too complex to purpose FPGA to implement multipliers
easily map to a reasonable number of of a configurable bit width [Haynes and
very fine-grained logic blocks. However, fi- Cheung 1998]. The logic block used in the
nite state machines are also too dependent multiplier FPGA is capable of implement-
upon single bit values to be efficiently im- ing a 4 × 4 multiplication, or cascaded into
plemented in a very coarse-grained archi- larger structures. The CHESS architec-
tecture. This type of circuit is more suited ture [Marshall et al. 1999] also operates
to an architecture that provides more on 4-bit values, with each of its cells act-
connections and computational power per ing as a 4-bit ALU. Medium-grained logic
logic block, yet still provides sufficient ca- blocks may be used to implement datapath
pability for bit-level manipulation. circuits of varying bit widths, similar to
The logic cell in the Altera FLEX 10K ar- the fine-grained structures. However, with
chitecture [Altera 1998] is a fine-grained the ability to perform more complex oper-
structure that is somewhat coarser than ations of a greater number of inputs, this
the 6200. This architecture mainly con- type of structure can be used efficiently to
sists of a single 4-input LUT with a implement a wider variety of operations.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 181

Fig. 7. One cell in the RaPiD-I reconfigurable architecture [Ebeling et al.


1996]. The registers, RAM, ALUs, and multiplier all operate on 16-bit values.
The multiplier outputs a 32-bit result, split into the high 16 bits and the low
16 bits. All routing lines shown are 16-bit wide busses. The short parallel
lines on the busses represent configurable bus connectors.

Very coarse-grained architectures are is composed of an 8 × 8 array of 16-bit


primarily intended for the implementa- processors. Each of these processors uses
tion of word-width datapath circuits. Be- its own instruction memory in conjunction
cause the logic blocks used are optimized with a global program counter. This style
for large computations, they will perform of architecture closely resembles a single-
these operations much more quickly (and chip multiprocessor, although with much
consume less chip area) than a set of simpler component processors because the
smaller cells connected to form the same system is intended to be coupled with a
type of structure. However, because their host processor. The RAW project [Moritz
composition is static, they are unable et al. 1998] is a further example of a re-
to leverage optimizations in the size of configurable architecture based on a mul-
operands. For example, the RaPiD archi- tiprocessor design.
tecture [Ebeling et al. 1996], shown in The granularity of the FPGA also has
Figure 7, as well as the Chameleon ar- a potential effect on the reconfiguration
chitecture [Chameleon 2000], are exam- time of the device. This is an important
ples of this very coarse-grained type of issue for run-time reconfiguration, which
design. Each of these architectures is com- is discussed in further depth in a later sec-
posed of word-sized adders, multipliers, tion. A fine-grained array has many config-
and registers. If only three 1-bit values uration points to perform very small com-
are required, then the use of these archi- putations, and thus requires more data
tectures suffers an unnecessary area and bits during configuration.
speed overhead, as all of the bits in the full
word size are computed. However, these
3.4. Heterogeneous Arrays
coarse-grained architectures can be much
more efficient than fine-grained architec- In order to provide greater performance
tures for implementing functions closer to or flexibility in computation, some recon-
their basic word size. figurable systems provide a heterogeneous
An alternate form of a coarse-grained structure, where the capabilities of the
system is one in which the logic blocks logic cells are not the same throughout
are actually very small processors, poten- the system. One use of heterogeneity in
tially each with its own instruction mem- reconfigurable systems is to provide mul-
ory and/or data values. The REMARC ar- tiplier function blocks embedded within
chitecture [Miyamori and Olukotun 1998] the reconfigurable hardware [Haynes and

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


182 K. Compton and S. Hauck

Cheung 1998; Chameleon 2000; Xilinx can be emulated [Altera 1998; Cong and
2001]. Because multiplication is one of the Xu 1998; Wilton 1998; Heile and Leaver
more difficult computations to implement 1999]. In fact, because there may be more
efficiently in a traditional FPGA struc- than one value output from the memory
ture, the custom multiplication hardware on a read operation, the memory struc-
embedded within a reconfigurable array ture may be able to perform multiple dif-
allows a system to perform even that func- ferent computations (one for each bit of
tion well. data output), provided that all necessary
Another use of heterogeneous struc- inputs appear on the address lines. In this
tures is to provide embedded memory manner, the embedded RAM behaves the
blocks scattered throughout the reconfig- same as a very large LUT. Therefore, em-
urable hardware. This allows storage of bedded memory allows a programmer or
frequently used data and variables, and a synthesis tool to perform a trade-off be-
allows for quick access to these values tween logic and memory usage in order to
due to the proximity of the memory to achieve higher area efficiency.
the logic blocks that access it. Memory Furthermore, a few of the commercial
structures embedded into the reconfig- FPGA companies have announced plans to
urable fabric come in two forms. The first include entire microprocessors as embed-
is simply the use of available LUTs as ded structures within their FPGAs. Altera
RAM structures, as can be done in the has demonstrated a preliminary ARM9-
Xilinx 4000 series [Xilinx 1994] and Virtex based Excalibur device, which combines
[Xilinx 1999] FPGAs. Although making reconfigurable hardware with an embed-
these very small blocks into a larger ded ARM9 processor core [Altera 2001].
RAM structure introduces overhead to the Meanwhile, Xilinx is working with IBM to
memory system, it does provide local, vari- include a PowerPC processor core within
able width memory structures. the Virtex-II FPGA [Xilinx 2000]. By con-
Some architectures include dedicated trast, Adaptive Silicon’s focus is to provide
memory blocks within their array, such reconfigurable logic cores to customers for
as the Xilinx Virtex series [Xilinx 1999, embedding in their own system-on-a-chip
2001] and Altera [Altera 1998] FPGAs, as (SoC) devices [Adaptive 2001].
well as the CS2000 RCP (reconfigurable
communications processor) device from
3.5. Routing Resources
Chameleon Systems, Inc. [Chameleon
2000]. These memory blocks have greater Interconnect resources are provided in a
performance in large sizes than similar- reconfigurable architecture to connect to-
sized structures built from many small gether the device’s programmable logic el-
LUTs. While these structures are some- ements. These resources are usually con-
what less flexible than the LUT-based figurable, where the path of a signal is
memories, they can also provide some cus- determined at compile or run-time rather
tomization. For example, the Altera FLEX than fabrication time. This flexible inter-
10K FPGA [Altera 1998] provides embed- connect between logic blocks or computa-
ded memories that have a limited total tional elements allows for a wide variety
number of wires, but allow a trade-off be- of circuit structures, each with their own
tween the number of address lines and the interconnect requirements, to be mapped
data bit width. to the reconfigurable hardware. For ex-
When embedded memories are not used ample, the routing for FPGAs is gener-
for data storage by a particular config- ally island-style, with logic surrounded
uration, the area that they occupy does by routing channels, which contain sev-
not necessarily have to be wasted. By us- eral wires, potentially of varying lengths.
ing the address lines of the memory as Within this type of routing architecture,
function inputs and the values stored in however, there are still variations. Some of
the memory as function outputs, logical these differences include the ratio of wires
expressions of a large number of inputs to logic in the system, how long each of the

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 183

Fig. 8. Segmented (left) and hierarchical (right) routing structures. The white
boxes are logic blocks, while the dark boxes are connection switches.

wires should be, and whether they should local communications traffic. These short
be connected in a segmented or hierarchi- wires can be connected together using
cal manner. switchboxes to emulate longer wires. Fre-
A step in the design of efficient rout- quently, segmented routing structures
ing structures for FPGAs and reconfig- also contain longer wires to allow sig-
urable systems therefore involves exam- nals to travel efficiently over long dis-
ining the logic vs. routing area trade-off tances without passing through a great
within reconfigurable architectures. One number of switches. Hierarchical routing
group has argued that the interconnect [Aggarwal and Lewis 1994; Lai and Wang
should constitute a much higher propor- 1997; Tsu et al. 1999] is the second method
tion of area in order to allow for successful to provide both local and global commu-
routing under high-logic utilization condi- nication. Routing within a group (or clus-
tions [Takahara et al. 1998]. However, for ter) of logic blocks is at the local level,
FPGAs, high-LUT utilization may not nec- only connecting within that cluster. At
essarily be the most desirable situation, the boundaries of these clusters, however,
but rather efficient routing usage may be longer wires connect the different clusters
of more importance [DeHon 1999]. This together. This is potentially repeated at a
is because the routing resources occupy a number of levels. The idea behind the use
much larger part of the area of an FPGA of hierarchical structures is that, provided
than the logic resources, and therefore the a good placement has been made onto the
most area efficient designs will be those hardware, most communication should be
that optimize their use of the routing re- local and only a limited amount of com-
sources rather than the logic resources. munication will traverse long distances.
The amount of required routing does not Therefore, the wiring is designed to fit this
grow linearly with the amount of logic model, with a greater number of local rout-
present; therefore, larger devices require ing wires in a cluster than distance routing
even greater amounts of routing per logic wires between clusters.
block than small ones [Trimberger et al. Because routing can occupy a large part
1997b]. of the area of a reconfigurable device, the
There are two primary methods to pro- type of routing used must be carefully con-
vide both local and global routing re- sidered. If the wires available are much
sources, as shown in Figure 8. The first longer than what is required to route a sig-
is the use of segmented routing [Betz and nal, the excess wire length is wasted. On
Rose 1999; Chow et al. 1999a]. In seg- the other hand, if the wires available are
mented routing, short wires accommodate much shorter than necessary, the signal

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


184 K. Compton and S. Hauck

Fig. 9. A traditional two-dimensional island-style routing structure (left) and a one-


dimensional routing structure (right). The white boxes represent logic elements.

must pass through switchboxes that con- routing is that if there are not enough
nect the short wires together into a longer routing resources in a particular area of
wire, or through levels of the routing hier- a mapped circuit, routing that circuit be-
archy. This induces additional delay and comes actually more difficult than on a
slows the overall operation of the circuit. two-dimensional array that provides more
Furthermore, the switchbox circuitry oc- alternatives. A number of different re-
cupies area that might be better used for configurable systems have been designed
additional logic or wires. in this manner. Both Garp [Hauser and
There are a few alternatives to the Wawrzynek 1997] and Chimaera [Hauck
island-style of routing resources. Systems et al. 1997] are structures that provide
such as RaPiD [Ebeling et al. 1996] use cells that compute a small number of bit
segmented bus-based routing, where sig- positions, and a row of these cells to-
nals are full word-sized in width. This is gether computes the full data word. A
most common in the one-dimensional type row can only be used by a single config-
of architecture, as discussed in the next uration, making these designs one dimen-
section. sional. In this manner, each configuration
occupies some number of complete rows.
Although multiple narrow-width compu-
3.6. One-Dimensional Structures
tations can fit within a single row, these
Most current FPGAs are of the two- structures are optimized for word-based
dimensional variety, as shown in Figure 9. computations that occupy the entire row.
This allows for a great deal of flexibility, The NAPA architecture [Rupp et al. 1998]
as any signal can be routed on a nearly is similar, with a full column of cells act-
arbitrary path. However, providing this ing as the atomic unit for a configura-
level of routing flexibility requires a great tion, as is PipeRench [Cadambi et al. 1998;
deal of routing area. It also complicates Goldstein et al. 2000].
the placement and routing software, as the In some systems, the computation
software must consider a very large num- blocks in a one-dimensional structure op-
ber of possibilities. erate on word-width values instead of
One solution is to use a more one- single bits. Therefore, busses are routed
dimensional style of architecture, also de- instead of individual values. This also
picted in Figure 9. Here, placement is decreases the time required for routing,
restricted along one axis. With a more as the bits of a bus can be considered
limited set of choices, the placement can together rather than as separate routes.
be performed much more quickly. Routing As shown previously in Figure 7, RaPiD
is also simplified, because it is generally [Ebeling et al. 1996] is basically a one-
along a single dimension as well, with the dimensional design that only includes
other dimension generally only used for word-width processing elements. The dif-
calculations requiring a shift operation. ferent computation units are organized in
One drawback of the one-dimensional a single dimension along the horizontal

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 185

Fig. 10. Mesh (left) and partial crossbar (right) interconnect topologies for multi-FPGA
systems.

axis. The general flow of information fol- level column and row busses is the P1
lows this layout, with the major routing system developed within the PAM project
busses also laid out in a horizontal man- [Vuillemin et al. 1996]. This architecture
ner. Additionally, all routing is of word- uses a central array of 16 commercial
sized values, and therefore all routing is FPGAs with connections to nearest-
of busses, not individual wires. A few ver- neighbors. However, four 16-bit row busses
tical resources are included in the archi- and four 16-bit column busses run the
tecture to allow signals to transfer be- length of the array and facilitate commu-
tween busses, or to travel from a bus to nication between non-neighbor FPGAs.
a computation node. However, the major- A crossbar attempts to remove this prob-
ity of the routing in this architecture is lem by using special routing-only chips
one-dimensional. to connect each FPGA potentially to any
other FPGA. The inter-chip delays are
more uniform, given that a signal trav-
3.7. Multi-FPGA Systems
els the exact same “distance” to get from
Reconfigurable systems that are composed one FPGA to another, regardless of where
of multiple FPGA chips interconnected those FPGAs are located. However, a
on a single processing board have addi- crossbar interconnect does not scale eas-
tional hardware concerns over single-chip ily with an increase in the number of
systems. In particular, there is a need for FPGAs. The crossbar pattern of the chips
an efficient connection scheme between is fixed at fabrication of the multi-FPGA
the chips, as well as to external memory board. Variants on these two basic topolo-
and the system bus. This is to provide for gies attempt to remove some of the prob-
circuits that are too large to fit within a lems encountered in mesh and crossbar
single FPGA, but may be partitioned over topologies [Arnold et al. 1992; Varghese
the multiple FPGAs available. A number et al. 1993; Buell et al. 1996; Vuillemin
of different interconnection schemes have et al. 1996; Lewis et al. 1997; Khalid and
been explored [Butts and Batcheller 1991; Rose 1998]. One of these variants can be
Hauck et al. 1998a; Hauck 1998; Khalid found in the Splash 2 system [Arnold et al.
1999] including meshes and crossbars, as 1992; Buell et al. 1996]. The predecessor,
shown in Figure 10. A mesh connects the Splash 1, used a linear systolic commu-
nearest-neighbors in the array of FPGA nication method. This type of connection
chips. This allows for efficient communi- was found to work quite well for a vari-
cation between the neighbors, but may ety of applications. However, this highly
require that some signals pass through constrained communication model made
an FPGA simply to create a connection some types of computations difficult or
between non-neighbors. Although this can even impossible. Therefore, Splash 2 was
be done, and is quite possible, it uses valu- designed to include not only the linear con-
able I/O resources on the FPGA that forms nections of Splash 1 that were found to
the routing bridge. One system that uses be useful for many applications, but also
a mesh topology with additional board- a crossbar network to allow any FPGA

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


186 K. Compton and S. Hauck

to communicate with any other FPGA on


the same board. For multi-FPGA systems,
because of the need for efficient commu-
nication between the FPGAs, determin-
ing the inter-chip routing topology is a
very important step in the design process.
More details on multi-FPGA system archi-
tectures can be found elsewhere [Hauck
1998b; Khalid 1999].

3.8. Hardware Summary


The design of reconfigurable hardware Fig. 11. Three possible design flows for algorithm
varies wildly from system to system. The implementation on a reconfigurable system. Grey
reconfigurable logic may be used as a stages indicate manual effort on the part of the de-
configurable functional unit, or may be signer, while white stages are done automatically.
The dotted lines represent paths to improve the re-
a multi-FPGA stand-alone unit. Within sulting circuit. It should be noted that the middle
the reconfigurable logic itself, the com- design cycle is only one of the possible compromises
plexity of the core computational units, between automatic and manual design.
or logic blocks, vary from very simple to
extremely complex, some implementing
a 4-bit ALU or even a 16 × 16 multi- reconfigurable system employed, as well
plication. These blocks are not required as a significant amount of design time. On
to be uniform throughout the array, as the other end of the spectrum, an auto-
the use of different types of blocks can matic compilation system provides a quick
add high-performance functionality in the and easy way to program for reconfig-
case of specialized computation circuitry, urable systems. It therefore makes the use
or expanded storage in the case of em- of reconfigurable hardware more accessi-
bedded memory blocks. Routing resources ble to general application programmers,
also offer a variety of choices, primarily in but quality may suffer.
amount, length, and organization of the Both for manual and automatic cir-
wires. Systems have been developed that cuit creation, the design process proceeds
fit into many different points within this through a number of distinct phases, as
design space, and no true “best” system indicated in Figure 11. Circuit specifica-
has yet been agreed upon. tion is the process of describing the func-
tions that are to be placed on the recon-
4. SOFTWARE
figurable hardware. This can be done as
simply as by writing a program in C that
Although reconfigurable hardware has represents the functionality of the algo-
been shown to have significant perfor- rithm to be implemented in hardware. On
mance benefits for some applications, it the other hand, this can also be as complex
may be ignored by application program- as specifying the inputs, outputs, and op-
mers unless they are able to easily in- eration of each basic building block in the
corporate its use into their systems. This reconfigurable system. Between these two
requires a software design environment methods is the specification of the circuit
that aids in the creation of configurations using generic complex components, such
for the reconfigurable hardware. This soft- as adders and multipliers, which will be
ware can range from a software assist mapped to the actual hardware later in
in manual circuit creation to a complete the design process. For descriptions in a
automated circuit design system. Manual high-level language (HLL), such as C/C++
circuit description is a powerful method or Java, or ones using complex building
for the creation of high-quality circuit de- blocks, this code must be compiled into
signs. However, it requires a great deal of a netlist of gate-level components. For
background knowledge of the particular the HLL implementations, this involves

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 187

ping stage may also consider using these


memories as logic units when they are not
being used for data storage. The memories
act as very large LUTs, where the number
of inputs is equal to the number of address
lines. In order to use these memories as
Fig. 12. A wide function implemented with multiple logic, the mapping software must analyze
LUTs.
how much of the memory blocks are actu-
ally used as storage in a given mapping. It
generating computational components to must then determine which are available
perform the arithmetic and logic opera- in order to implement logic, and what part
tions within the program, and separate or parts of the circuit are best mapped to
structures to handle the program control, the memory [Cong and Xu 1998; Wilton
such as loop iterations and branching op- 1998].
erations. Given a structural description, After the circuit has been mapped, the
either generated from a HLL or specified resulting blocks must be placed onto the
by the user, each complex structure is re- reconfigurable hardware. Each of these
placed with a network of the basic gates blocks is assigned to a specific location
that perform that function. within the hardware, hopefully close to
Once a detailed gate- or element-level the other logic blocks with which it com-
description of the circuit has been created, municates. As FPGA capacities increase,
these structures must be translated to the the placement phase of circuit mapping
actual logic elements of the reconfigurable becomes more and more time consuming.
hardware. This stage is known as tech- Floorplanning is a technique that can
nology mapping, and is dependent upon be used to alleviate some of this cost.
the exact target architecture. For a LUT- A floorplanning algorithm first partitions
based architecture, this stage partitions the logic cells into clusters, where cells
the circuit into a number of small subfunc- with a large amount of communication
tions, each of which can be mapped to a are grouped together. These clusters are
single LUT [Brown et al. 1992a; Abouzeid then placed as units onto regions of the
et al. 1993; Sangiovanni-Vincentelli et al. reconfigurable hardware. Once this global
1993; Hwang et al. 1994; Chang et al. placement is complete, the actual place-
1996; Hauck and Agarwal 1996; Yi and ment algorithm performs detailed place-
Jhon 1996; Chowdhary and Hayes 1997; ment of the individual logic blocks within
Lin et al. 1997; Cong and Wu 1998; Pan the boundaries assigned to the cluster
and Lin 1998; Togawa et al. 1998; Cong [Sankar and Rose 1999].
et al. 1999]. Some architectures, such as The use of a floorplanning tool is par-
the Xilinx 4000 series [Xilinx 1994], con- ticularly helpful for situations where the
tain multiple LUTs per logic cell. These circuit structure being mapped is of a dat-
LUTs can be used either separately to gen- apath type. Large computational compo-
erate small functions, or together to gen- nents or macros that are found in datapath
erate some wider-input functions [Inuani circuits are frequently composed of highly
and Saul 1997; Cong and Hwang 1998]. regular logic. These structures are placed
By taking advantage of multiple LUTs and as entire units, and their component cells
the internal routing within a single logic are restricted to the floorplanned location
cell, functions with more inputs than can [Shi and Bhatia 1997; Emmert and Bhatia
be implemented using a single LUT can 1999]. This encourages the placer to find a
efficiently be mapped into the FPGA ar- very regular placement of these logic cells,
chitecture. Figure 12 shows one example resulting in a higher performance layout
of a wide function mapped to a multi-LUT of the circuit. Another technique for the
FPGA logic cell. mapping and placement of datapath ele-
For reconfigurable structures that in- ments is to perform both of these steps
clude embedded memory blocks, the map- simultaneously [Callahan et al. 1998].

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


188 K. Compton and S. Hauck

This method also exploits the regular- many connected components to be placed
ity of the datapath elements to gener- far from one another, as the signals that
ate mappings and placements quickly and travel long distances use more routing
efficiently. resources than those that travel shorter
Floorplanning is also important when ones. A good placement is therefore es-
dealing with hierarchically structured re- sential to the routing process. One of
configurable designs. In these architec- the challenges in routing for FPGAs and
tures, the available resources have been reconfigurable systems is that the avail-
grouped by the logic or routing hierarchy able routing resources are limited. In gen-
of the hardware. Because performance is eral hardware design, the goal is to min-
best when routing lengths are minimized, imize the number of routing tracks used
the cells to be placed should be grouped in a channel between rows of computation
such that cells that require a great deal units, but the channels can be made as
of communication or which are on a criti- wide as necessary. In reconfigurable sys-
cal path are placed together within a logic tems, however, the number of available
cluster on the hardware [Krupnova et al. routing tracks is determined at fabrication
1997; Senouci et al. 1998]. time, and therefore the routing software
After floorplanning, the individual logic must perform within these boundaries.
blocks are placed into specific logic cells. Thus, FPGA routing concentrates on min-
One algorithm that is commonly used imizing congestion within the available
is the simulated annealing technique tracks [Brown et al. 1992b; McMurchie
[Shahookar and Mazumder 1991; Betz and Ebeling 1995; Alexander and Robins
and Rose 1997; Sankar and Rose 1999]. 1996; Chan and Schlag 1997; Lee and Wu
This method takes an initial placement 1997; Thakur et al. 1997; Wu and Marek-
of the system, which can be generated Sadowska 1997; Swartz et al. 1998; Nam
(pseudo-) randomly, and performs a series et al. 1999]. Because routing is one of
of “moves” on that layout. A move is sim- the more time-intensive portions of the
ply the changing of the location of a sin- design cycle, it can be helpful to deter-
gle logic cell, or the exchanging of loca- mine if a placed circuit can be routed
tions of two logic cells. These moves are before actually performing the routing
attempted one at a time using random step. This quickly informs the designer
target locations. If a move improves the if changes need to be made to the layout
layout, then the layout is changed to re- or a larger reconfigurable structure is re-
flect that move. If a move is considered to quired [Wood and Rutenbar 1997; Swartz
be undesirable, then it is only accepted a et al. 1998].
small percentage of the time. Accepting a Each of the design phases mentioned
few “bad” moves helps to avoid any local above may be implemented either manu-
minima in the placement space. Other al- ally or automatically using compiler tools.
gorithms exist that are not so based on The operation of some of these individual
random movements [Gehring and Ludwig steps are described in greater depth in the
1996], although this searches a smaller following sections.
area of the placement space for a solution,
and therefore may be unable to find a so-
4.1. Hardware-Software Partitioning
lution which meets performance require-
ments if a design uses a high percentage For systems that include both reconfig-
of the reconfigurable resources. urable hardware and a traditional micro-
Finally, the different reconfigurable processor, the program must first be par-
components comprising the application titioned into sections to be executed on
circuit are connected during the routing the reconfigurable hardware and sections
stage. Particular signals are assigned to to be executed in software on the micro-
specific portions of the routing resources processor. In general, complex control se-
of the reconfigurable hardware. This can quences such as variable-length loops are
become difficult if the placement causes more efficiently implemented in software,

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 189

while fixed datapath operations may be celeration gained through the execution
more efficiently executed in hardware. of a code fragment in hardware to de-
Most compilers presented for reconfig- termine whether the cost of configuration
urable systems generate only the hard- is overcome by the benefits of hardware
ware configuration for the system, rather execution.
than both hardware and software. In some
cases, this is because the reconfigurable
4.2. Circuit Specification
hardware may not be coupled with a host
processor, so only a hardware configura- In order to use the reconfigurable hard-
tion is necessary. For cases where recon- ware, designers must somehow be able to
figurable hardware does operate alongside specify the operation of their custom cir-
a host microprocessor, some systems cur- cuits. Before high-level compilation tools
rently require that the hardware compila- are developed for a specific reconfigurable
tion be performed separately from the soft- system, this is done through hand map-
ware compilation, and special functions ping of the circuit, where the designer
are called from within the software in specifies the operation of the components
order to configure and control the reconfig- in the configurable system directly. Here,
urable hardware. However, this requires the designers utilize the basic building
effort on the part of the designer to iden- blocks of the reconfigurable system to cre-
tify the sections that should be mapped ate the desired circuit. This style of cir-
to hardware, and to translate these into cuit specification is primarily useful only
special hardware functions. In order to when a software front-end for circuit de-
make the use of the reconfigurable hard- sign is unavailable, or for the design of
ware transparent to the designer, the par- small circuits or circuits with very high
titioning and programming of the hard- performance requirements. This is due
ware should occur simultaneously in a to the great amount of time involved in
single programming environment. manual circuit creation. However, for cir-
For compilers that manage both the cuits that can be reasonably hand mapped,
hardware and software aspects of applica- this provides potentially the smallest and
tion design, the hardware/software parti- fastest implementation.
tioning can be performed either manually, Because not all designers can be inti-
or automatically by the compiler itself. mately familiar with every reconfigurable
When the partitioning is performed by architecture, some design tools abstract
the programmer, compiler directives are the specifics of the target architecture.
used to mark sections of program code for Creating a circuit using a structural de-
hardware compilation. The NAPA C lan- sign language involves describing a cir-
guage [Gokhale and Stone 1998] provides cuit using building blocks such as gates,
pragma statements to allow a program- flip-flops and latches [Bellows and Hutch-
mer to specify whether a section of code is ings 1998; Gehring and Ludwig 1998;
to be executed in software on the Fixed In- Hutchings et al. 1999]. The compiler then
struction Processor (FIP), or in hardware maps these modules to one or more ba-
on the Adaptive Logic Processor (ALP). sic components of the architecture of the
Cardoso and Neto [1999] present another reconfigurable system. Structural VHDL
compiler that requires the user to specify is one example of this type of program-
(using information gained through the use ming, and commercial tools are avail-
of profiling tools) which areas of code to able for compiling from this language
map to the reconfigurable hardware. into vendor-specific FPGAs [Synplicity
Alternately, the hardware/software par- 1999].
titioning can be done automatically However, these two methods require
[Chichkov and Almeida 1997; Kress et al. that the designer possess either an in-
1997; Callahan et al. 2000; Li et al. 2000a]. timate knowledge of the targeted recon-
In this case, the compiler will use cost figurable hardware, or at least a work-
functions based upon the amount of ac- ing knowledge of the concepts involved

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


190 K. Compton and S. Hauck

in hardware design. In order to allow suffer from the drawback that it tends to
a greater number of software developers produce larger and slower designs than
to take advantage of reconfigurable com- those generated by a structural descrip-
puting, tools that allow for behavioral tion or hand-mapping. Behavioral descrip-
circuit descriptions are being developed. tions can leave many aspects of the cir-
These systems trade some area and per- cuit unspecified. For example, a compiler
formance quality for greater flexibility and that encounters a while loop must gener-
ease of use. ate complicated control structures in or-
Behavioral circuit design is similar to der to allow for an unspecified number
software design because the designer in- of iterations. Also, in many HLL imple-
dicates the steps a hardware subsys- mentations, optimizations based upon the
tem must go through in order to per- bit width of operands cannot be performed.
form the desired computation rather than The compiler is generally unaware of
the actual composition of the circuit. any application-specific limitations on the
These behavioral descriptions can be ei- operand size; it only sees the program-
ther in a generic hardware description mer’s choice of data format in the program.
language such as VHDL or Verilog, or a Problems such as these might be solved
general-purpose high-level language such through additional programmer effort to
as C/C++ or Java. The eventual goal of replace while loops whenever possible
this type of compilation is to allow users with for loops, and to use compiler direc-
to write programs in commonly used lan- tives to indicate exact sizes of operands
guages that compile equally well, with- [Galloway 1995; Gokhale and Stone 1998].
out modification, to both a traditional This method of hardware design falls be-
software executable and to an executable tween structural description and behav-
which leverages reconfigurable hardware. ioral description in complexity, because
Working towards this direction, although the programmers do not need
Transmogrifier C [Galloway 1995] al- to know a great deal about hardware de-
lows a subset of the C language to be sign, they are required to follow addi-
used to describe hardware circuits. While tional guidelines that are not required for
multiplication, division, pointers, arrays, software-only implementations.
and a few other C language specifics are
not supported, this system provides a
behavioral method of circuit description
4.3. Circuit Libraries
using a primitive form of the C language.
Similarly, the C++ programming environ- The use of circuit or macro libraries
ment used for the P1 system [Vuillemin can greatly simplify and speed the de-
et al. 1996] provides a hybrid method of sign process. By predesigning commonly
description, using a combination of be- used structures such as adders, mul-
havioral and structural design. Synopsys’ tipliers, and counters, circuit creation
CoCentric compiler [Synopsys 2000], for configurable systems becomes largely
which can be targeted to the Xilinx Virtex the assembly of high-level components,
series of FPGA, uses SystemC to provide and only application-specific structures
for behavioral compilation of C/C++ require detailed design. The actual ar-
with the assistance of a set of additional chitecture of the reconfigurable device
hardware-defining classes. Other compil- can be abstracted, provided only library
ers, such as Nimble [Li et al. 2000a] and components are used, as these low-level
the Garp compiler [Callahan et al. 2000], details will already have been encapsu-
are fully behavioral C compilers, handling lated within the library structures. Al-
the full set of the ANSI C language. though the users of the circuit library
Although behavioral description, and may not know the intricacies of the des-
HLL description in particular, provides tination architecture, they are still able
a convenient method for the program- to make use of architecture-specific op-
ming of reconfigurable systems, it does timizations, such as specialized carry

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 191

chains. This is because designers very ever, circuit generators create semicus-
familiar with the details of the target ar- tomized high-level structures automati-
chitecture create the components within a cally at compile time, as opposed to circuit
circuit library. They can take advantage libraries that only provide static struc-
of architecture specifics when creating the tures. For example, a circuit generator can
modules to make these components faster create an adder structure of the exact bit
and smaller than a designer unfamiliar width required by the designer, whereas a
with the architecture likely would. An circuit library is likely to contain a limited
added benefit of the architecture abstrac- number of adder structures, none of which
tion is that the use of library components may be of the correct size. Circuit gener-
can also facilitate design migration from ators are therefore more flexible than cir-
one architecture to another, because de- cuit libraries because of the customization
signers are not required to learn a new allowed.
architecture, but only to indicate the new Some circuit generators, such as
target for the library components. How- MacGen [Yasar et al. 1996], are executed
ever, this does require that a circuit li- at the command line using custom de-
brary contain implementations for more scription files to generate physical design
than one architecture. layout data files. Newer circuit genera-
One method for using library com- tors, however, are functions or methods
ponents is to simply instantiate them called from high-level language programs.
within an HDL design [Xilinx 1997; Altera PAM-Blox [Mencer et al. 1998], for exam-
1999]. However, circuit libraries can also ple, is a set of circuit generators executed
be used in general language compil- in C++ that generate structures for use
ers by comparing the dataflow graph of with the PCI Pamette reconfigurable
the application to the dataflow graphs processing board. The circuit generator
of the library macros [Cadambi and presented by Chu et al. [1998] contains
Goldstein 1999]. If a dataflow represen- a number of Java classes to allow a
tation of a macro matches a portion of programmer to generate arbitrarily sized
the application graph, the correspond- arithmetic and logical components for a
ing macro is used for that part of the circuit. Although the examples presented
configuration. in that paper were mapped to a Xilinx
Another benefit of circuit design with 4000 series FPGA, the generator uses
library macros is that of fast compila- architecture specific libraries for module
tion. Because the library structures may generation. The target architecture can
have been premapped, preplaced, and pre- therefore be changed through the use
routed (at least within the macro bound- of a different design library. The Carry
aries), the actual compile time is reduced Look-Ahead circuit generator described
to the time required to place the library by Stohmann and Barke [1996] is also
components and route between them. For retargetable, because it maps to an
example, fast configuration was one of FPGA logic cell architecture defined by
the main motivations for the creation of the user.
libraries for circuit design in the DISC One drawback of the circuit generators
reconfigurable image processing system is that they depend on a regular logic
[Hutchings 1997]. and routing structure. Hierarchical rout-
ing structures (such as those present in
the Xilinx 6200 series [Xilinx 1996]) and
4.4. Circuit Generators
specialized heterogeneous logic blocks are
Circuit generators fulfill a role similar to frequently not accounted for. Therefore,
circuit libraries, in that they provide opti- some optimized features of a particular ar-
mized high-level structures for use within chitecture may be unused. For these cases,
larger applications. Again, designers are a circuit macro from a library may pro-
not required to understand the low-level vide a more highly optimized structure
details of particular architectures. How- than one created with a circuit generator,

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


192 K. Compton and S. Hauck

provided that the library macro fits the puting to allocate memories to hold vari-
needs of the application. ables and other data. Off-chip memories
may be added to the reconfigurable sys-
tem. Alternately, if a reconfigurable sys-
4.5. Partial Evaluation
tem includes memory blocks embedded
Functions that are to be implemented on into the reconfigurable logic, these may be
the reconfigurable array should occupy used, provided that the storage require-
as little area as possible, so as to maxi- ments do not surpass the available embed-
mize the number of functions that can be ded memory. If multiple off-chip memories
mapped to the hardware. This, combined are available to a reconfigurable system,
with the minimization of the delay in- variables used in parallel should be placed
curred by each circuit, increases the over- into different memory structures, such
all acceleration of the application. Partial that they can be accessed simultaneously
evaluation is the process of reducing hard- [Gokhale and Stone 1999]. When smaller
ware requirements for a circuit structure embedded memory units are used, larger
through optimization based upon known memories can be created from the smaller
static inputs. Specifically, if an input is ones. However, in this case, it is desir-
known to be constant, that value can po- able to ensure that each smaller mem-
tentially be propagated through one or ory is close to the computation that most
more gates in the structure at compile requires its contents [Babb et al. 1999].
time, and only the portions of a circuit that As mentioned earlier, the small embed-
depend on time-varying inputs need to be ded memories that are not allocated for
mapped to the reconfigurable structure. data storage may be used to perform logic
One example of the usefulness of this functions.
operation is that of constant coefficient
multipliers. If one input to a multiplier
4.7. Parallelization
is constant, a multiplier object can be re-
duced from a general-purpose multiplier One of the benefits of reconfigurable com-
to a set of additions with static-length puting is the ability to execute multi-
shifts between them corresponding to the ple operations in parallel. In cases where
locations of 1s in the binary constant. circuits are specified using a structural
This type of reduction leads to a lower hardware description language, the user
area requirement for the circuit, and po- specifies all structures and timing, and
tentially higher performance due to fewer therefore either implicitly or explicitly
gate delays encountered on the critical specifies any parallel operation. However,
path. Partial evaluation can also be per- for behavioral and HLL descriptions, there
formed in conjunction with circuit gener- are two methods to incorporate paral-
ation, where the constants passed to the lelism: manual parallelization through
generator function are used to simplify special instructions or compiler direc-
the created hardware circuit [Wang and tives, and automatic parallelization by the
Lewis 1997; Chu et al. 1998]. Other exam- compiler.
ples of this type of optimization for specific To manually incorporate parallelism
algorithms include the partial evaluation within an application, the programmer
of DES encryption circuits [Leonard and can specifically mark sections of code
Mangione-Smith 1997], and the partial that should run as parallel threads, and
evaluation of constant multipliers and use similar operations to those used in
fixed polynomial division circuits [Payne traditional parallel compilers [Cronquist
1997]. et al. 1998; Gokhale and Stone 1998].
For example, a signal/wait technique can
4.6. Memory Allocation
be used to perform synchronization of
the different threads of the computation.
As with traditional software programs, it The RaPiD-B language [Cronquist et al.
may be necessary in reconfigurable com- 1998] is one that uses this methodology.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 193

Although the NAPA C compiler [Gokhale required between the FPGAs, the num-
and Stone 1998] requires programmers ber of paths with a high (inter-chip) de-
to mark the areas of code for executing lay is reduced, and the circuit may have
the host processor and the reconfigurable an overall higher performance. Similarly,
hardware in parallel, it also detects and those sections of the circuit that require a
exploits fine-grained parallelism within short delay time must be placed upon the
computations destined for the reconfig- same chip. Global placement then deter-
urable hardware. mines which of the actual FPGAs in the
Automatic parallelization of inner loops multi-FPGA system will contain each of
is another common technique in recon- the partitions.
figurable hardware compilers to attempt After the circuit has been partitioned
to maximize the use of the reconfig- into the different FPGA chips, the con-
urable hardware. The compiler will se- nections between the chips must be
lect the innermost loop level to be com- routed [Mak and Wong 1997; Ejnioui and
pletely unrolled for parallel execution in Ranganathan 1999]. A global routing al-
hardware, potentially creating a heav- gorithm determines at a high level the
ily pipelined structure [Cronquist et al. connections between the FPGA chips. It
1998; Weinhardt and Luk 1999]. For these first selects a region of output pins on the
cases, outer loops may not have multi- source FPGA for a given signal, and de-
ple iterations executing simultaneously. termines which (if any) routing switches
Any loop reordering to improve the par- or additional FPGAs the signal must
allelism of the circuit must be done by the pass through to get to the destination
programmer. However, some compiler sys- FPGA. Detailed routing and pin assign-
tems have taken this procedure a step fur- ment [Slimane-Kade et al. 1994; Hauck
ther and focus on the parallelization of all and Borriello 1997; Mak and Wong 1997;
loops within the program, not just the in- Ejnioui and Ranganathan 1999] are then
ner loops [Wang and Lewis 1997; Budiu used to assign signals to traces on an exist-
and Goldstein 1999]. This type of compiler ing multi-FPGA board, or to create traces
generates a control flow graph based upon for a multi-FPGA board that is to be cre-
the entire program source code. Loop un- ated specifically to implement the given
rolling is used in order to increase the circuit.
available parallelism, and the graph is Because multi-FPGA systems use inter-
then used to schedule parallel operations chip connections to allow the circuit parti-
in the hardware. tions to communicate, they frequently re-
quire a higher proportion of I/O resources
vs. logic in each chip than is normally re-
4.8. Multi-FPGA System Software
quired in single-FPGA use. For this rea-
When reconfigurable systems use more son, some research has focused on meth-
than one FPGA to form the complete ods to allow pins of the FPGAs to be reused
reconfigurable hardware, there are ad- for multiple signals. This procedure is re-
ditional compilation issues to deal with ferred to as Virtual Wires [Babb et al.
[Hauck and Agarwal 1996]. The design 1993; Agarwal 1995; Selvidge et al. 1995],
must first be partitioned into the differ- and allows for a flexible trade-off between
ent FPGA chips [Hauck 1995; Acock and logic and I/O within a given multi-FPGA
Dimond 1997; Vahid 1997; Brasen and system. Signals are multiplexed onto a
Saucier 1998; Khalid 1999]. This is gen- single wire by using multiple virtual clock
erally done by placing each highly con- cycles, one per multiplexed signal, within
nected portions of a circuit into a single a user clock cycle, thus pipelining the com-
chip. Multi-FPGA systems have a limited munication. In this manner, the I/O re-
number of I/O pins that connect the chips quirements of a circuit can be reduced,
together, and therefore their use must be while the logic requirements (because of
minimized in the overall circuit mapping. the added circuitry used for the multiplex-
Also, by minimizing the amount of routing ing) are increased.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


194 K. Compton and S. Hauck

4.9. Design Testing architecture-specific optimizations avail-


able to generate a high-performance ap-
After compilation, an application needs
plication. However, this requires a great
to be tested for correct operation be-
deal of time and effort on the part of the de-
fore deployment. For hardware configu-
signer. At the opposite end of the spectrum
rations that have been generated from
is fully automatic compilation of a high-
behavioral descriptions, this is similar
level language. Using the automatic tools,
to the debugging of a software applica-
a software programmer can transparently
tion. However, structurally and manu-
utilize the reconfigurable hardware with-
ally created circuits must be simulated
out the need for direct intervention. The
and debugged with techniques based upon
circuits created using this method, while
those from the design of general hard-
quickly and easily created, are generally
ware circuits. For these structures, simu-
larger and slower than manually created
lation and debugging are critical not only
versions. The actual tools available for
to ensure proper circuit operation, but
compilation onto reconfigurable systems
also to prevent possible incorrect connec-
fall at various points within this range,
tions from causing a short within the cir-
where many are partially automated but
cuit, which can damage the reconfigurable
require some amount of manual aid. Cir-
hardware.
cuit designers for reconfigurable systems
There are several different methods of
therefore face a trade-off between the ease
observing the behavior of a configuration
of design and the quality of the final
during simulation. The contents of mem-
layout.
ory structures within the design can be
viewed, modified, or saved. This allows on-
the-fly customization of the simulated ex- 5. RUN-TIME RECONFIGURATION
ecution environment of the reconfigurable
hardware, as well as a method for exam- Frequently, the areas of a program that
ining the computation results. The input can be accelerated through the use of
and output values of circuit structures and reconfigurable hardware are too numer-
substructures can also be viewed either on ous or complex to be loaded simultane-
a generated schematic drawing or with a ously onto the available hardware. For
traditional waveform output. By examin- these cases, it is beneficial to be able
ing these values, the operation of the cir- to swap different configurations in and
cuit can be verified for correctness, and out of the reconfigurable hardware as
conflicts on individual wires can be seen. they are needed during program execution
A number of simulation and debugging (Figure 13). This concept is known as run-
software systems have been developed time reconfiguration (RTR).
that use some or all of these techniques Run-time reconfiguration is based upon
[Arnold et al. 1992; Buell et al. 1996; the concept of virtual hardware, which is
Gehring and Ludwig 1996; Lysaght and similar to virtual memory. Here, the phys-
Stockwood 1996; Bellows and Hutchings ical hardware is much smaller than the
1998; Hutchings et al. 1999; McKay and sum of the resources required by each
Singh 1999; Vasilko and Cabanis 1999]. of the configurations. Therefore, instead
of reducing the number of configurations
that are mapped, we instead swap them
4.10. Software Summary
in and out of the actual hardware as they
are needed. Because run-time reconfigu-
Reconfigurable hardware systems require ration allows more sections of an appli-
software compilation tools to allow pro- cation to be mapped into hardware than
grammers to harness the benefits of can be fit in a non-run-time reconfig-
reconfigurable computing. On one end urable system, a greater portion of the
of the spectrum, circuits for reconfig- program can be accelerated. This provides
urable systems can be designed manu- potential for an overall improvement in
ally, leveraging all application-specific and performance.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 195

Fig. 13. Applications which are too large to entirely fit on the reconfigurable
hardware can be partitioned into two or more smaller configurations that
can occupy the hardware at different times.

During a single program’s execution, single context architecture is that it al-


configurations are swapped in and out lows for an extremely fast context switch
of the reconfigurable hardware. Some of (on the order of nanoseconds), whereas the
these configurations will likely require ac- single context may take milliseconds or
cess to the results of other configurations. more to reprogram. The partially reconfig-
Configurations that are active at differ- urable architecture is also more suited to
ent periods in time therefore must be pro- run-time reconfiguration than the single
vided with a method to communicate with context, because small areas of the array
one another. Primarily, this can be done can be modified without requiring that the
through the use of registers [Ebeling et al. entire logic array be reprogrammed.
1996; Cadambi et al. 1998; Rupp et al. For all of these run-time reconfigurable
1998; Scalera and Vazquez 1998], the con- architectures, there are also a number of
tents of which can remain intact between compilation issues that are not encoun-
reconfigurations. This allows one configu- tered in systems that only configure at
ration to store a value, and a later config- the beginning of an application. For ex-
uration to read back that value for use in ample, run-time reconfigurable systems
further computations. An alternative for are able to optimize based on values that
reconfigurable systems that do not include are known only at run-time. Furthermore,
state-holding devices is to write the result compilers must consider the run-time re-
back to registers or memory external to the configurability when generating the dif-
reconfigurable array, which is then read ferent circuit mappings, not only to be
back by successive configurations [Hauck aware of the increase in time-multiplexed
et al. 1997]. capacity, but also to schedule reconfigura-
There are a few different configuration tions so as to minimize the overhead that
memory styles that can be used with re- they incur. These software issues, as well
configurable systems. A single context de- as an overview of methods to perform fast
vice is a serially programmed chip that configuration, will be explored in the sec-
requires a complete reconfiguration in or- tions that follow.
der to change any of the programming bits.
A multicontext device has multiple layers
5.1. Reconfigurable Models
of programming bits, each of which can
be active at a different point in time. De- Traditional FPGA structures have been
vices that can be selectively programmed single context, only allowing one full-chip
without a complete reconfiguration are configuration to be loaded at a time. How-
called partially reconfigurable. These dif- ever, designers of reconfigurable systems
ferent types of configuration memory are have found this style of configuration
described in more detail later. An advan- to be too limiting or slow to efficiently
tage of the multicontext FPGA over a implement run-time reconfiguration. The

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


196 K. Compton and S. Hauck

Fig. 14. The different basic models of reconfigurable computing: single context, multicon-
text, and partially reconfigurable. Each of these designs is shown performing a reconfigu-
ration.

following discussion defines the single con- total reconfiguration delay. If all the con-
text device, and further considers newer figurations used within a certain time pe-
FPGA designs (multicontext and partially riod are present in the same context, no
reconfigurable), along with their impact reconfiguration will be necessary. How-
on run-time reconfiguration. ever, if a number of successive configura-
tions are each partitioned into different
5.1.1. Single Context. Current single contexts, several reconfigurations will be
context FPGAs are programmed using needed, slowing the operation of the run-
a serial stream of configuration infor- time reconfigurable system.
mation. Because only sequential access
is supported, any change to a configu- 5.1.2. Multicontext. A multicontext FPGA
ration on this type of FPGA requires a includes multiple memory bits for each
complete reprogramming of the entire programming bit location [DeHon 1996;
chip. Although this does simplify the Trimberger et al. 1997a; Scalera and
reconfiguration hardware, it does incur Vazquez 1998; Chameleon 2000]. These
a high overhead when only a small part memory bits can be thought of as mul-
of the configuration memory needs to be tiple planes of configuration information,
changed. Many commercial FPGAs are of as shown in Figure 14. One plane of con-
this style, including the Xilinx 4000 se- figuration information can be active at a
ries [Xilinx 1994], the Altera Flex10K given moment, but the device can quickly
series [Altera 1998], and Lucent’s Orca switch between different planes, or con-
series [Lucent 1998]. This type of FPGA texts, of already-programmed configura-
is therefore more suited for applications tions. In this manner, the multicontext de-
that can benefit from reconfigurable com- vice can be considered a multiplexed set of
puting without run-time reconfiguration. single context devices, which requires that
A single context FPGA is depicted in a context be fully reprogrammed to per-
Figure 14. form any modification. This system does
In order to implement run-time recon- allow for the background loading of a con-
figuration onto a single context FPGA, the text, where one plane is active and in ex-
configurations must be grouped into con- ecution while an inactive place is in the
texts, and each full context is swapped in process of being programmed. Figure 15
and out of the FPGA as needed. Because shows a multicontext memory bit, as used
each of these swap operations involve re- in [Trimberger et al. 1997a]. A commer-
configuring the entire FPGA, a good parti- cial product that uses this technique is the
tioning of the configurations between con- CS2000 RCP series from Chameleon, Inc
texts is essential in order to minimize the [Chameleon 2000]. This device provides

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 197

portions of the array may continue execu-


tion, allowing the overlap of computation
with reconfiguration. This has the benefit
of potentially hiding some of the reconfig-
uration latency.
When configurations do not require the
entire area available within the array, a
Fig. 15. A four-bit multicontexted programming bit
[Trimberger et al. 1997a]. P0-P3 are the stored
number of different configurations may
programming bits, while C0-C3 are the chip-wide be loaded into unused areas of the hard-
control lines that select the context to program or ware at different times. Since only part
activate. of the array is reconfigured at a given
point in time, the entire array does not re-
two separate planes of programming in- quire reprogramming. Additionally, some
formation. At any given time, one of these applications require the updating of only
planes is controlling current execution on a portion of a mapped circuit, while the
the reconfigurable fabric, and the other rest should remain intact, as shown in
plane is available for background loading Figure 14. For example, in a filtering op-
of the next needed configuration. eration in signal processing, a set of con-
Fast switching between contexts makes stant values that change slowly over time
the grouping of the configurations into may be reinitialized to a new value, yet the
contexts slightly less critical, because if overall computation in the circuit remains
a configuration is on a different context static. Using this selective reconfiguration
than the one that is currently active, it can can greatly reduce the amount of configu-
be activated within an order of nanosec- ration data that must be transferred to the
onds, as opposed to milliseconds or longer. FPGA. Several run-time reconfigurable
However, it is likely that the number of systems are based upon a partially re-
contexts within a given program is larger configurable design, including Chimaera
than the number of contexts available in [Hauck et al. 1997], PipeRench [Cadambi
the hardware. In this case, the partition- et al. 1998; Goldstein et al. 2000], NAPA
ing again becomes important to ensure [Rupp et al. 1998], and the Xilinx 6200 and
that configurations occurring in close tem- Virtex FPGAs [Xilinx 1996, 1999].
poral proximity are in a set of contexts Unfortunately, since address informa-
that are loaded into the multicontext de- tion must be supplied with configura-
vice at the same time. More aspects involv- tion data, the total amount of information
ing temporal partitioning for single- and transferred to the reconfigurable hard-
multicontext devices will be discussed in ware may be greater than what is required
the section on compilers for run-time re- with a single context design. This makes
configurable systems. a full reconfiguration of the entire array
slower than the single context version.
5.1.3. Partially Reconfigurable. In some However, a partially reconfigurable design
cases, configurations do not occupy the full is intended for applications in which the
reconfigurable hardware, or only a part of size of the configurations is small enough
a configuration requires modification. In that more than one can fit on the available
both of these situations, a partial recon- hardware simultaneously. Plus, as we dis-
figuration of the array is required, rather cuss in subsequent sections, a number of
than the full reconfiguration required by fast configuration methods have been ex-
a single- or multicontext device. In a par- plored for partially reconfigurable systems
tially reconfigurable FPGA, the underly- in order to help reduce the configuration
ing programming bit layer operates like data traffic requirements.
a RAM device. Using addresses to spec-
ify the target location of the configuration 5.1.4. Pipeline Reconfigurable. A modifi-
data allows for selective reconfiguration cation of the partially reconfigurable
of the array. Frequently, the undisturbed FPGA design is one in which the partial

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


198 K. Compton and S. Hauck

Fig. 16. A timeline of the configuration and reconfiguration of pipeline stages on a pipeline
reconfigurable FPGA. This example shows three physical pipeline stages implementing five
virtual pipeline stages [Cadambi et al. 1998].

reconfiguration occurs in increments of the first pipeline location in the hard-


pipeline stages. This style of reconfig- ware (step 4), overwriting the first virtual
urable hardware is called pipeline recon- pipeline stage. The reconfiguration of the
figurable, or sometimes a striped FPGA hardware pipeline stages continues until
[Luk et al. 1997b; Cadambi et al. 1998; the last virtual pipeline stage has been
Deshpande and Somani 1999; Goldstein programmed (step 7), at which point the
et al. 2000]. Each stage is configured as a first stage of the virtual pipeline is again
whole. This is primarily used in datapath- configured onto the hardware for the next
style computations, where more pipeline data set. These structures also allow for
stages are used than can fit simultane- the overlap of configuration and execution,
ously on available hardware. Figure 16 as one pipeline stage is configured while
shows an example of a pipeline reconfig- the others are executing. Therefore, N-1
urable array implementing more pipeline data values are processed each time the
stages than can fit on the available hard- virtual pipeline is fully traversed on an
ware. In a pipeline-reconfigurable FPGA, N-stage hardware system.
there are two primary execution possi-
bilities. Either the number of hardware
5.2. Run-Time Partial Evaluation
pipeline stages available is greater than
or equal to the number of pipeline stages One of the advantages that a run-time re-
of the designed circuit (virtual pipeline configurable device has over a system that
stages), or the number of virtual pipeline is only programmed at the beginning of
stages will exceed the number of hardware an application is the ability to perform
pipeline stages. The first case is straight- hardware optimizations based upon val-
forward: the circuit is simply mapped to ues determined at run-time. Partial evalu-
the array, and some hardware stages may ation was already discussed in this article
go unused. The second case is more com- in reference to compilation optimizations
plex and is the one that requires run- for general reconfigurable systems. Run-
time reconfiguration. The pipeline stages time partial evaluation allows for the fur-
are configured one by one, from the start ther exploitation of “constants” because
of the pipeline, through the end of the the configurations can be modified based
available hardware stages (steps 1, 2, not only on completely static values, but
and 3 in Figure 16). After each stage is also those that change slowly over time
programmed, it begins computation. In [Burns et al. 1997; Luk et al. 1997a; Payne
this manner, the configuration of a stage 1997; Wirthlin and Hutchings 1997; Chu
is exactly one step ahead of the flow of et al. 1998; McKay and Singh 1999]. This
data. Once the hardware pipeline has gives reconfigurable circuits the potential
been completely filled, reuse of the hard- to achieve an even higher performance
ware pipeline stages begins. Configura- than an ASIC, which must retain gener-
tion of the next virtual stage begins at ality in these situations. The circuit in the

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 199

reconfigurable system can be customized run-time reconfigurable system, the cir-


to the application at a given time, rather cuits loaded on the hardware change over
than to the application as a category. For time. If the user must specify by hand
example, where an ASIC may have to the loading and execution of the circuits
include a generic multiplier, a reconfig- in the reconfigurable hardware, then the
urable system could instantiate a constant compilers must include methods to indi-
coefficient multiplier that changes over cate these operations. JHDL [Bellows and
time. Additionally, partial evaluation can Hutchings 1998; Hutchings et al. 1999] is
be used in encryption systems [Leonard one such compiler. It provides for the in-
and Mangione-Smith 1997]. A key-specific stantiation of configurations through the
reconfigurable encrypter or decrypter is use of Java constructors, and the removal
optimized for the particular key being of the circuits from the hardware by using
used, but retains the ability to use more a destructor on the circuit objects. This al-
than one key over the lifetime of the hard- lows the programmer to indicate exactly
ware (unlike a key-specialized ASIC) or the loading pattern of the configurations.
during actual run-time. Alternately, the compiler can automate
Although partial evaluation can be used the use of the run-time reconfigurable
to reduce the overall area requirements hardware. For a single context or multi-
of a circuit by removing potentially ex- context device, configurations must be
traneous hardware within the implemen- temporally partitioned into a number of
tation, occasionally it is preferable to re- different full contexts of configuration
serve sufficient area for the largest case, information. This involves determining
and have all mappings occupy that area. which configurations are likely to be used
This allows the partially evaluated por- near in time to one another, and which
tion of a given configuration to be reconfig- configurations are able to fit together onto
ured, while leaving the remainder of the the reconfigurable hardware. Ideally, the
circuit intact. For example, if a constant number of reconfigurations that are to be
coefficient multiplier within a larger con- performed is minimized. By reducing the
figuration requires that the constant be number of reconfigurations, the propor-
changed, only the area occupied by the tion of time spent in reconfiguration (com-
multiplier requires reconfiguration. This pared to the time spent in useful compu-
is true even if the new constant coefficient tation) is reduced.
multiplier is a larger structure than the The problem of forming and schedul-
previous one, because the reserved area ing single- and multiconfiguration con-
for it is based upon the largest possibility texts for use in single context or multicon-
[McKay and Singh 1999]. Although par- text FPGA designs has been discussed by
tial evaluation does not minimize the area a number of groups [Chang and Marek-
occupied by the circuit in this case, the Sadowska 1998; Trimberger 1998; Liu and
speed of configuration is improved by mak- Wong 1999; Purna and Bhatia 1999; Li
ing the multiplier a modular replaceable et al. 2000a]. In particular, a single cir-
component. Additionally, this method re- cuit that is too large to fit within the re-
tains the speed benefits of partial recon- configurable hardware may be partitioned
figuration because it still minimizes the over time to form a sequential set of con-
logic and routing actually used to imple- figurations. This involves examining the
ment the structure. control flow graph of the circuit and divid-
ing the circuit into distinct computation
nodes. The nodes can then be grouped to-
5.3. Compilation and Configuration
gether within contexts, based upon their
Scheduling
proximity to one another within the flow
For some reconfigurable systems, a con- control graph. If possible, those config-
figuration requires programming the re- urations that are used in quick succes-
configurable hardware only at the start sion will be placed within the same group.
of its execution. On the other hand, in a These groups are finally mapped into full

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


200 K. Compton and S. Hauck

contexts, to be loaded into the reconfig- There are a number of different tactics
urable hardware at run-time. Nimble [Li for reducing the configuration overhead.
et al. 2000a] is one of the compilers that First, loading of the configurations can be
perform this type of operation. This com- timed such that the configuration over-
piler focuses on mapping core loops within laps as much as possible with the execu-
C code to reconfigurable hardware. Hard- tion of instructions by the host processor.
ware models for the candidate loops that Second, compression techniques can be in-
will fit within the reconfigurable hardware troduced to decrease the amount of config-
are first extracted from the C application. uration data that must be transferred to
Then these loops are grouped into indi- the system. Third, specialized hardware
vidual configurations using a partitioning can be used to adjust the physical loca-
method in order to encourage the hard- tion of configurations at run-time based on
ware loops that are used in close temporal where the free area on the hardware is lo-
proximity to be mapped to the same config- cated at any given time. Finally, the actual
uration, reducing configuration overhead. process of transferring the data from the
For partially reconfigurable designs, the host processor to the reconfigurable hard-
compiler must determine a good place- ware can be modified to include a configu-
ment in order to prevent configurations ration cache, which would provide a faster
that are used together in close temporal reconfiguration.
proximity from occupying the same re-
5.4.1. Configuration Prefetching. Perfor-
sources. Again, through minimizing the
mance is improved when the actual con-
number of reconfigurations, the overall
figuration of the hardware is overlapped
performance of the system is increased, as
with computations performed by the
configuration is a slow process [Li et al.
host processor, because programming the
2000b]. An alternative approach, which
reconfigurable hardware requires from
allows the final placement of a configura-
milliseconds to seconds to accomplish.
tion to be determined at run-time, is also
Overlapping configuration and execution
discussed within the Fast Configuration
prevents the host processor from stalling
section of this article.
while it is waiting for the configuration to
finish, and hides the configuration time
5.4. Fast Configuration from the program execution. Configura-
tion prefetching [Hauck 1998a] attempts
Because run-time reconfigurable systems
to leverage this overlap by determining
involve reconfiguration during program
when to initiate reconfiguration of the
execution, the reconfiguration must be
hardware in order to maximize overlap
done as efficiently and as quickly as pos-
with useful computation on the host
sible. This is in order to ensure that the
processor. It also seeks to minimize the
overhead of the reconfiguration does not
chance that a configuration will be pre-
eclipse the benefit gained by hardware ac-
fetched falsely, overwriting the configura-
celeration. Stalling execution of either the
tion that is actually used next.
host processor or the reconfigurable hard-
ware because of configuration is clearly 5.4.2. Configuration Compression. Unfor-
undesirable. In the DISC II system, from tunately, there will always be cases in
25% [Wirthlin and Hutchings 1996] to 71% which the configuration overheads cannot
[Wirthlin and Hutchings 1995] of execu- be successfully hidden using a prefetch-
tion time is spent in reconfiguration, while ing technique. This can occur when a con-
in the UCLA ATR work this figure can rise ditional branch occurs immediately be-
to over 98.5% [Mangione-Smith 1999]. If fore the use of a configuration, potentially
the delays caused by reconfiguration are making a 100% correct prefetch predic-
reduced, performance can be greatly in- tion impossible, or when multiple config-
creased. Therefore, fast configuration is an urations or contexts must be loaded in
important area of research for run-time re- quick succession. In these cases, the delay
configurable systems. incurred is minimized when the amount

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 201

of data transferred from the host proces- out of groups of smaller configurations,
sor to the reconfigurable array is mini- the configuration overhead of partial re-
mized. Configuration compression can be configuration is reduced because more op-
used to compact this configuration infor- erations can be present on chip simul-
mation [Hauck et al. 1998b; Hauck and taneously. However, there are some area
Wilson 1999; Li and Hauck 1999; Dandalis and execution penalties imposed by this
and Prasanna 2001]. method, creating a trade-off between re-
One form of configuration compression duced reconfiguration overhead and faster
has already been implemented in a com- execution with a smaller area.
mercial system. The Xilinx 6200 series of
FPGA [Xilinx 1996] contains wildcarding 5.4.3. Relocation and Defragmentation in
hardware, which provides a method to pro- Partially Reconfigurable Systems. Partially
gram multiple logic cells with a single ad- reconfigurable systems have the advan-
dress and data value. This is accomplished tage over single context systems in that
by setting a special register to indicate they allow a new configuration to be writ-
which of the address bits should behave ten to the programmable logic while the
as “don’t-care” values, resolving to multi- configurations not occupying that same
ple addresses for configuration. For exam- area remain intact and available for future
ple, suppose two configuration addresses, use. Because these configurations will not
00010 and 00110, are both to be pro- have to be reconfigured onto the array,
grammed with the same value. By setting and because the programming of a sin-
the wildcard register to 00100, the address gle configuration can require the transfer
value sent is interpreted as 00X10 and of far less configuration data than the pro-
both these locations are programmed us- gramming of an entire context, a partially
ing either of the two addresses above in a reconfigurable system can incur less con-
single operation. Hauck et al. [1998b] dis- figuration overhead than a single context
cuss the benefits of this hardware, while FPGA.
Li and Hauck [1999] cover a potential ex- However, inefficiencies can arise if two
tension to the concept, where “don’t care” partial configurations are supposed to
values in the configuration stream can be be located at overlapping physical loca-
used to allow areas with similar but not tions on the FPGA. If these configura-
identical configuration data values to also tions are repeatedly used one after an-
be programmed simultaneously. other, they must be swapped in and out of
Within partially reconfigurable sys- the array each time. This type of conflict
tems, there is an added potential to com- could negate much of the benefit achieved
press effectively the amount of data sent by partially reconfigurable systems. A
to the reconfigurable hardware. A con- better solution to this problem is to allow
figuration can possibly reuse configura- the final placement of the configurations
tion information already present on the to occur at run-time, allowing for run-
array, such that only the areas differing time relocation of those configurations
in configuration values must be repro- [Li et al. 2000b; Compton et al. 2002].
grammed. Therefore, configuration time Using relocation, a new configuration
can be reduced through the identification may be placed onto the reconfigurable
of these common components and the cal- array where it will cause minimum con-
culation of the incremental configurations flict with other needed configurations al-
that must be loaded [Luk et al. 1997a; ready present on the hardware. A num-
Shirazi et al. 1998]. ber of different systems support run-time
Alternately, similar operations can be relocation, including Chimaera [Hauck
grouped together to form a single con- et al. 1997], Garp [Hauser and Wawrzynek
figuration that contains extra control cir- 1997], and PipeRench [Cadambi et al.
cuitry in order to implement the various 1998; Goldstein et al. 2000].
functions within the group [Kastrup et al. Even with relocation, partially reconfig-
1999]. By creating larger configurations urable hardware can still suffer from some

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


202 K. Compton and S. Hauck

placement conflicts that could be avoided figurable array. However, in many archi-
by using an additional hardware optimiza- tectures, there are some routing resources
tion. Over time, as a partially reconfig- that traverse long distances, and may tra-
urable device loads and unloads config- verse areas allocated to different config-
urations, the location of the unoccupied urations. Care must be taken such that
area on the array is likely to become frag- different configurations do not attempt to
mented, similar to what occurs in mem- drive to these wires simultaneously, as
ory systems when RAM is allocated and multiple drivers to a wire can potentially
deallocated. There may be enough empty damage the hardware. Therefore, systems
area on the device to hold an incoming such as the Xilinx 6200 [Xilinx 1996] and
configuration, but it may be distributed Chimaera [Hauck et al. 1997] have spe-
throughout the array. A configuration nor- cially designed routing resources that pre-
mally requires a contiguous region of the vent multiple drivers. LEGO [Chow et al.
chip, so it would have to overwrite a por- 1999b] includes an additional control sig-
tion of a valid configuration in order to nal preventing conflicts during the span of
be placed onto the reconfigurable hard- time between startup and actual program-
ware. A system that incorporates the abil- ming of the hardware.
ity to perform defragmentation of the re- An additional difficulty in using run-
configurable array, however, would be able time reconfigurable systems occurs when
to consolidate the unused area by mov- the host processor runs multiple threads
ing valid configurations to new locations or processes. These threads or processes
[Diessel and El Gindy 1997; Compton et al. may each have their own sets of config-
2002]. This area can then be used by in- urations that are to be mapped to the
coming configurations, potentially with- reconfigurable hardware. Issues such as
out overwriting any of the moved config- the correct use of memory protection and
urations. virtual memory must be considered dur-
ing memory accesses by the reconfigurable
5.4.4. Configuration Caching. Because a hardware [Chien and Byun 1999; Jacob
great deal of the delay caused by config- and Chow 1999; Jean et al. 1999]. An-
uration is due to the distance between other problem can occur when one thread
the host processor and the reconfigurable or process configures the hardware, which
hardware, as well the reading of the is then reconfigured by a different thread
configuration data from a file or main or process. Threads and processes must be
memory, a configuration cache can poten- prevented from incorrectly calling hard-
tially reduce the costs of reconfiguration ware functions that no longer appear
[Deshpande et al. 1999; Li et al. 2000b]. on the reconfigurable hardware. This re-
By storing the configurations in fast mem- quires that the state of the reconfigurable
ory near to the reconfigurable array, the hardware be set to “dirty” on a main pro-
data transfer during reconfiguration is ac- cessor context switch, or re-loaded with
celerated, and the overall time required the correct configuration context.
is reduced. Additionally, a special config- Partially reconfigurable systems must
uration cache can allow for specialized di- also protect against inter-process or inter-
rect output to the reconfigurable hardware thread conflicts within the array. Even
[Compton et al. 2000]. This output can if each application has ensured that
leverage the close proximity of the cache their own configurations can safely co-
by providing high-bandwidth communica- exist, a combination of configurations from
tions that would facilitate wide parallel different applications re-introduces the
loading of the configuration data, further possibility of inadvertently causing an
reducing configuration times. electrical short within the reconfigurable
hardware. This particular issue can be
5.5. Potential Problems with RTR
solved through the use of an architecture
Partial reconfiguration involves selec- that does not have “bad” configurations,
tively programming portions of the recon- such as the 6200 series [Xilinx 1996] and

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 203

Chimaera [Hauck et al. 1997]. The po- ASIC, reconfigurable systems provide a
tential for this type of conflict also intro- method to map circuits into hardware. Re-
duces the possibility of extremely destruc- configurable systems therefore have the
tive configurations that can destroy the potential to achieve far greater perfor-
system’s underlying hardware. mance than software as a result of bypass-
ing the fetch-decode-execute cycle of tradi-
tional microprocessors as well as possibly
5.6. Run-Time Reconfiguration Summary exploiting a greater degree of parallelism.
We have discussed the benefits of using Reconfigurable hardware systems come
run-time reconfiguration to increase the in many forms, from a configurable func-
benefits gained through reconfigurable tional unit integrated directly into a CPU,
computing. Different configurations may to a reconfigurable coprocessor coupled
be used at different phases of a program’s with a host microprocessor, to a multi-
execution, customizing the hardware not FPGA stand-alone unit. The level of cou-
only for the application, but also for the pling, granularity of computation struc-
different stages of the application. Run- tures, and form of routing resources are all
time reconfiguration also allows configu- key points in the design of reconfigurable
rations larger than the available recon- systems. The use of heterogeneous struc-
figurable hardware to be implemented, tures can also greatly add to the overall
as these circuits can be split into sev- performance of the final design.
eral smaller ones that are used in succes- Compilation tools for reconfigurable
sion. Because of the delays associated with systems range from simple tools that aid
configuration, this style of computing re- in the manual design and placement of
quires that reconfiguration be performed circuits, to fully automatic design suites
in a very efficient manner. Multicontext that use program code written in a high-
and partially reconfigurable FPGAs are level language to generate circuits and the
both designed to improve the time re- controlling software. The variety of tools
quired for reconfiguration. Hardware opti- available allows designers to choose be-
mizations, such as wildcarding, run-time tween manual and automatic circuit cre-
relocation, and defragmentation, fur- ation for any or all of the design steps.
ther decrease configuration overhead in Although automatic tools greatly simplify
a partially reconfigurable design. Soft- the design process, manual creation is still
ware techniques to enable fast configura- important for performance-driven appli-
tion, including prefetching and incremen- cations. Circuit libraries and circuit gen-
tal configuration calculation, were also erators are additional software tools that
discussed. enable designers to quickly create efficient
designs. These tools attempt to aid the
designer in gaining the benefits of man-
ual design without entirely sacrificing the
6. CONCLUSION
ease of automatic circuit creation.
Reconfigurable computing is becoming an Finally, run-time reconfiguration pro-
important part of research in computer vides a method to accelerate a greater por-
architectures and software systems. By tion of a given application by allowing the
placing the computationally intense por- configuration of the hardware to change
tions of an application onto the reconfig- over time. Apart from the benefits of added
urable hardware, that application can be capacity through the use of virtual hard-
greatly accelerated. This is because recon- ware, run-time reconfiguration also allows
figurable computing combines many of the for circuits to be optimized based on run-
benefits of both software and ASIC im- time conditions. In this manner, perfor-
plementations. Like software, the mapped mance of a reconfigurable system can ap-
circuit is flexible, and can be changed over proach or even surpass that of an ASIC.
the lifetime of the system or even the Reconfigurable computing systems have
lifetime of the application. Similar to an shown the ability to accelerate program

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


204 K. Compton and S. Hauck

execution greatly, providing a high- ALTERA CORPORATION. 1998. Data Book. Altera
performance alternative to software-only Corporation, San Jose, CA.
implementations. However, no one hard- ALTERA CORPORATION. 1999. Altera MegaCore
Functions. Available online at http://www.altera.
ware design has emerged as the clear pin- com/html/tools/megacore.html. Altera Corpora-
nacle of reconfigurable design. Although tion, San Jose, CA.
general-purpose FPGA structures have ALTERA CORPORATION. 2001. Press Release: Al-
standardized into LUT-based architec- tera Unveils First Complete System-on-a-
tures, groups designing hardware for re- Programmable-Chip Solution at Embedded
Systems Conference. Altera Corporation, San
configurable computing are currently also Jose, CA.
exploring the use of heterogeneous struc- ANNAPOLIS MICROSYSTEMS, INC. 1998. Wildfire Ref-
tures and word-width computational ele- erence Manual. Annapolis Microsystems, Inc,
ments. Those designing compiler systems Annapolis, MD.
face the task of improving automatic de- ARNOLD, J. M., BUELL, D. A., AND DAVIS, E. G. 1992.
sign tools to the point where they may Splash 2. In Proceedings of the ACM Symposium
achieve mappings comparable to manual on Parallel Algorithms and Architectures, 316–
324.
design for even high-performance applica-
BABB, J., RINARD, M., MORITZ, C. A., LEE, W., FRANK,
tions. Within both of these research cat- M., BARUA, R., AND AMARASINGHE, S. 1999. Par-
egories lies the additional topic of run- allelizing applications into silicon. IEEE Sympo-
time reconfiguration. While some work sium on Field-Programmable Custom Comput-
has been done in this field as well, re- ing Machines, 70–80.
search must continue in order to be able BABB, J., TESSIER, R., AND AGARWAL, A. 1993. Vir-
tual wires: Overcoming pin limitations in FPGA-
to perform faster and more efficient re- based logic emulators. In IEEE Workshop
configuration. Further study into each of on FPGAs for Custom Computing Machines,
these topics is necessary in order to har- 142–151.
ness the full potential of reconfigurable BELLOWS, P. AND HUTCHINGS, B. 1998. JHDL—An
computing. HDL for reconfigurable systems. IEEE Sympo-
sium on Field-Programmable Custom Comput-
ing Machines, 175–184.
BETZ, V. AND ROSE, J. 1997. VPR: A new packing,
REFERENCES
placement and routing tool for FPGA research.
ABOUZEID, P., BABBA, P., DE PAULET, M. C., AND SAUCIER, Lecture Notes in Computer Science 1304—Field-
G. 1993. Input-driven partitioning methods Programmable Logic and Applications. W. Luk,
and application to synthesis on table-lookup- P. Y. K. Cheung, and M. Glesner, Eds. Springer-
based FPGA’s. IEEE Trans. Comput. Aid. Des. Verlag, Berlin, Germany, 213–222.
Integ. Circ. Syst. 12, 7, 913–925. BETZ, V. AND ROSE, J. 1999. FPGA routing archi-
ACOCK, S. J. B. AND DIMOND, K. R. 1997. Automatic tecture: Segmentation and buffering to optimize
mapping of algorithms onto multiple FPGA- speed and density. ACM/SIGDA International
SRAM Modules. Field-Programmable Logic and Symposium on FPGAs, 59–68.
Applications, W. Luk, P. Y. K. Cheung, and BRASEN, D. R., AND SAUCIER, G. 1998. Using cone
M. Glesner, Eds. Lecture Notes in Computer structures for circuit partitioning into FPGA
Science, vol. 1304, Springer-Verlag, Berlin, packages. IEEE Trans. CAD Integ. Circ. Syst. 17,
Germany, 255–264. 7, 592–600.
ADAPTIVE SILICON, INC. 2001. MSA 2500 Pro- BROWN, S. D., FRANCIS, R. J., ROSE, J., AND VRANESIC,
grammable Logic Cores. Adaptive Silicon, Inc., Z. G. 1992a. Field-Programmable Gate Ar-
Los Gatos, CA. rays, Kluwer Academic Publishers, Boston, MA.
AGARWAL, A. 1995. VirtualWires: A Technology BROWN, S., ROSE, J., AND VRANESIC, Z. G. 1992b. A
for Massive Multi-FPGA Systems. Available detailed router for field-programmable gate ar-
online at http://www.ikos.com/products/virtual- rays. IEEE Trans. Comput. Aid. Desi. 11, 5, 620–
wires.ps. 628.
AGGARWAL, A. AND LEWIS, D. 1994. Routing archi- BUDIU, M. AND GOLDSTEIN, S. C. 1999. Fast com-
tectures for hierarchical field programmable pilation for pipelined reconfigurable fabrics.
gate arrays. In Proceedings of the IEEE Interna- ACM/SIGDA International Symposium on
tional Conference on Computer Design, 475–478. FPGAs, 195–205.
ALEXANDER, M. J. AND ROBINS, G. 1996. New BUELL, D., ARNOLD, S. M., AND KLEINFELDER, W. J.
performance-driven FPGA routing algorithms. 1996. SPLASH 2: FPGAs in a Custom Comput-
IEEE Trans. CAD Integ. Circ. Syst. 15, 12, 1505– ing Machine, IEEE Computer Society Press, Los
1517. Alamitos, CA.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 205

BURNS, J., DONLIN, A., HOGG, J., SINGH, S., AND CHOW, P., SEO, S. O., ROSE, J., CHUNG, K., PÁEZ-MONZÓN,
DE WIT, M. 1997. A dynamic reconfigu- G., AND RAHARDJA, I. 1999b. The design of an
ration run-time system. IEEE Symposium SRAM-based field-programmable Gate Array—
on Field-Programmable Custom Computing Part II: Circuit Design and Layout. IEEE Trans.
Machines, 66–75. VLSI Syst. 7, 3, 321–330.
BUTTS, M. AND BATCHELLER, J. 1991. Method of us- CHOWDHARY, A. AND HAYES, J. P. 1997. General
ing electronically reconfigurable logic circuits. modeling and technology-mapping technique for
US Patent 5,036,473. LUT-based FPGAs. ACM/SIGDA International
CADAMBI, S. AND GOLDSTEIN, S. C. 1999. CPR: A Symposium on FPGAs, 43–49.
configuration profiling tool. IEEE Symposium CHU, M., WEAVER, N., SULIMMA, K., DEHON, A., AND
on Field-Programmable Custom Computing WAWRZYNEK, J. 1998. Object oriented circuit-
Machines, 104–113. generators in Java. IEEE Symposium on Field-
CADAMBI, S., WEENER, J., GOLDSTEIN, S. C., SCHMIT, H., Programmable Custom Computing Machines,
AND THOMAS, D. E. 1998. Managing pipeline- 158–166.
reconfigurable FPGAs. ACM/SIGDA Interna- COMPTON, K., COOLEY, J., KNOL, S., AND HAUCK,
tional Symposium on FPGAs, 55–64. S. 2000. Configuration relocation and defrag-
CALLAHAN, T. J., CHONG, P., DEHON, A., AND WAWRZYNEK, mentation for FPGAs, Northwestern Univer-
J. 1998. Fast Module Mapping and Placement sity Technical Report, Available online at http://
for Datapaths in FPGAs. ACM/SIGDA Interna- www.ece.nwu.edu/∼kati/publications.html.
tional Symposium on FPGAs, 123–132. COMPTON, K., LI, Z., COOLEY, J., KNOL, S., AND HAUCK,
CALLAHAN, T. J., HAUSER, J. R., AND WAWRZYNEK, J. S. 2002. Configuration relocation and defrag-
2000. The Garp architecture and C compiler. mentation for run-time reconfigurable comput-
IEEE Comput. 3, 4, 62–69. ing. IEEE Trans. VLSI Syst., to appear.
CARDOSO, J. M. P. AND NETO, H. C. 1999. Macro- CONG, J. AND HWANG, Y. Y. 1998. Boolean match-
based hardware compilation of JavaTM byte- ing for complex PLBs in LUT-based FPGAs with
codes into a dynamic reconfigurable computing application to architecture evaluation. ACM/
system. IEEE Symposium on Field-Programm- SIGDA International Symposium on FPGAs,
able Custom Computing Machines, 2–11. 27–34.
CHAMELEON SYSTEMS, INC. 2000. CS2000 Advance CONG, J. AND WU, C. 1998. An efficient algorithm
Product Specification. Chameleon Systems, Inc., for performance-optimal FPGA technology map-
San Jose, CA. ping with retiming. IEEE Trans. CAD Integr.
CHAN, P. K. AND SCHLAG, M. D. F. 1997. Accel- Circ. Syst. 17, 9, 738–748.
eration of an FPGA router. IEEE Symposium CONG, J., WU, C., AND DING, Y. 1999. Cut ranking
on Field-Programmable Custom Computing and pruning enabling a general and efficient
Machines, 175–181. FPGA mapping solution. ACM/SIGDA Interna-
CHANG, D. AND MAREK-SADOWSKA, M. 1998. Parti- tional Symposium on FPGAs, 29–35.
tioning sequential circuits on dynamically recon- CONG, J. AND XU, S. 1998. Technology mapping
figurable FPGAs. ACM/SIGDA International for FPGAs with embedded memory blocks.
Symposium on FPGAs, 161–167. ACM/SIGDA International Symposium on
CHANG, S. C., MAREK-SADOWSKA, M., AND HWANG, T. T. FPGAs, 179–188.
1996. Technology mapping for TLU FPGA’s CRONQUIST, D. C., FRANKLIN, P., BERG, S. G.,
based on decomposition of binary decision AND EBELING, C. 1998. Specifying and com-
diagrams. IEEE Trans. CAD Integ. Circ. Syst. 15, piling applications for RaPiD. IEEE Sympo-
10, 1226–1248. sium on Field-Programmable Custom Comput-
ing Machines, 116–125.
CHICHKOV, A. V. AND ALMEIDA, C. B. 1997. An hard-
ware/software partitioning algorithm for cus- DANDALIS, A. AND PRASANNA, V. K. 2001. Configura-
tom computing machines. Lecture Notes in Com- tion compression for FPGA-based embedded sys-
puter Science 1304—Field-Programmable Logic tems. ACM/SIGDA International Symposium
and Applications. W. Luk, P. Y. K. Cheung, on Field-Programmable Gate Arrays, 173–182.
and M. Glesner, Eds. Springer-Verlag, Berlin, DEHON, A. 1996. DPGA Utilization and Applica-
Germany, 274–283. tion. ACM/SIGDA International Symposium on
CHIEN, A. A. AND BYUN, J. H. 1999. Safe and pro- FPGAs, 115–121.
tected execution for the morph/AMRM recon- DEHON, A. 1999. Balancing interconnect and com-
figurable processor. IEEE Symposium on Field- putation in a reconfigurable computing array (or,
Programmable Custom Computing Machines, why you don’t really want 100% LUT utiliza-
209–221. tion). ACM/SIGDA International Symposium
CHOW, P., SEO, S. O., ROSE, J., CHUNG, K., PÁEZ-MONZÓN, on FPGAs, 69–78.
G., AND RAHARDJA, I. 1999a. The design of an DESHPANDE, D., SOMANI, A. K., AND TYAGI, A.
SRAM-based field-programmable Gate Array— 1999. Configuration caching vs data caching
Part I: Architecture. IEEE Trans. VLSI Syst. 7, for striped FPGAs. ACM/SIGDA International
2, 191–197. Symposium on FPGAs, 206–214.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


206 K. Compton and S. Hauck

DIESSEL, O. AND EL GINDY, H. 1997. Run-time com- GRAHAM, P. AND NELSON, B. 1996. Genetic algo-
paction of FPGA designs. Lecture Notes in rithms in software and in hardware—A per-
Computer Science 1304—Field-Programmable formance analysis of workstations and custom
Logic and Applications. W. Luk, P. Y. K. computing machine implementations. IEEE
Cheung, M. Glesner, Eds. Springer-Verlag, Symposium on FPGAs for Custom Computing
Berlin, Germany, 131–140. Machines, 216–225.
DOLLAS, A., SOTIRIADES, E., AND EMMANOUELIDES, A. HAUCK, S. 1995. Multi-FPGA systems. Ph.D. dis-
1998. Architecture and design of GE1, A FCCM sertation, Univ. Washington, Dept. of C.S.&E.
for golomb ruler derivation. IEEE Sympo- HAUCK, S. 1998a. Configuration prefetch for sin-
sium on Field-Programmable Custom Comput- gle context reconfigurable coprocessors. ACM/
ing Machines, 48–56. SIGDA International Symposium on FPGAs,
EBELING, C., CRONQUIST, D. C., AND FRANKLIN, P. 65–74.
1996. RaPiD—Reconfigurable pipelined dat- HAUCK, S. 1998b. The roles of FPGAs in repro-
apath. Lecture Notes in Computer Science grammable systems. Proc. IEEE 86, 4, 615–638.
1142—Field-Programmable Logic: Smart Appli- HAUCK, S. AND AGARWAL A. 1996. Software tech-
cations, New Paradigms and Compilers. R. W. nologies for reconfigurable systems. Dept. of
Hartenstein, M. Glesner, Eds. Springer-Verlag, ECE Technical Report, Northwestern Univ.
Berlin, Germany, 126–135. Available online at http://www.ee.washington.
EJNIOUI, A. AND RANGANATHAN, N. 1999. Multi- edu/faculty/hauck/publications.html.
terminal net routing for partial crossbar-based HAUCK, S. AND BORRIELLO, G. 1997. Pin assignment
multi-FPGA systems. ACM/SIGDA Interna- for multi-FPGA systems. IEEE Trans. Comput.
tional Symposium on FPGAs, 176–184. Aid. Desi. Integ. Circ. Syst. 16, 9, 956–964.
ELBIRT, A. J. AND PAAR, C. 2000. An FPGA im- HAUCK, S., BORRIELLO, G., AND EBELING, C. 1998a.
plementation and performance evaluation of Mesh routing topologies for multi-FPGA sys-
the serpent block cipher. ACM/SIGDA Interna- tems. IEEE Trans. VLSI Syst. 6, 3, 400–408.
tional Symposium on FPGAs, 33–40.
HAUCK, S., FRY, T. W., HOSLER, M. M., AND KAO, J. P.
EMMERT, J. M. AND BHATIA, D. 1999. A methodology 1997. The Chimaera reconfigurable functional
for fast FPGA floorplanning. ACM/SIGDA In- unit. IEEE Symposium on Field-Programmable
ternational Symposium on FPGAs, 47–56. Custom Computing Machines, 87–96.
ESTRIN, G., BUSSEL, B., TURN, R., AND BIBB, J. 1963. HAUCK, S., LI, Z., AND SCHWABE, E. 1998b. Configu-
Parallel processing in a restructurable com- ration compression for the Xilinx XC6200 FPGA.
puter system. IEEE Trans. Elect. Comput. 747– IEEE Symposium on Field-Programmable Cus-
755. tom Computing Machines, 138–146.
GALLOWAY, D. 1995. The transmogrifier C hard- HAUCK, S. AND WILSON, W. D. 1999. Runlength
ware description language and compiler for compression techniques for FPGA configura-
FPGAs. IEEE Symposium on FPGAs for Custom tions. Dept. of ECE Technical Report, North-
Computing Machines, 136–144. western Univ. Available online at http://www.
GEHRING, S. AND LUDWIG, S. 1996. The trianus sys- ee.washington.edu / faculty / hauck / publications.
tem and its application to custom computing. html.
Lecture Notes in Computer Science 1142—Field- HAUSER, J. R. AND WAWRZYNEK, J. 1997. Garp: A
Programmable Logic: Smart Applications, New MIPS processor with a reconfigurable coproces-
Paradigms and Compilers. R. W. Hartenstein sor. IEEE Symposium on Field-Programmable
and M. Glesner, Eds. Springer-Verlag, Berlin, Custom Computing Machines, 12–21.
Germany, 176–184.
HAYNES, S. D. AND CHEUNG, P. Y. K. 1998. A re-
GEHRING, S. W. AND LUDWIG, S. H. M. 1998. Fast configurable multiplier array for video image
integrated tools for circuit design with FPGAs. processing tasks, suitable for embedding in an
ACM/SIGDA International Symposium on FPGA structure. IEEE Symposium on Field-
FPGAs, 133–139. Programmable Custom Computing Machines,
GOKHALE, M. B. AND STONE, J. M. 1998. NAPA C: 226–234.
Compiling for a hybrid RISC/FPGA architec- HEILE, F. AND LEAVER, A. 1999. Hybrid product
ture. IEEE Symposium on Field-Programmable term and LUT based architectures using embed-
Custom Computing Machines, 126–135. ded memory blocks. ACM/SIGDA International
GOKHALE, M. B. AND STONE, J. M. 1999. Automatic Symposium on FPGAs, 13–16.
allocation of arrays to memories in FPGA proces- HUANG, W. J., SAXENA, N., AND MCCLUSKEY, E. J.
sors with multiple memory banks. IEEE Sympo- 2000. A reliable LZ data compressor on
sium on Field-Programmable Custom Comput- reconfigurable coprocessors. IEEE Symposium
ing Machines, 63–69. on Field-Programmable Custom Computing
GOLDSTEIN, S. C., SCHMIT, H., BUDIU, M., CADAMBI, Machines, 249–258.
S., MOE, M., AND TAYLOR, R. 2000. PipeRench: HUELSBERGEN, L. 2000. A representation for dy-
A Reconfigurable Architecture and Compiler, namic graphs in reconfigurable hardware
IEEE Computer, vol. 33, No. 4. and its application to fundamental graph

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 207

algorithms. ACM/SIGDA International Sympo- KRUPNOVA, H., RABEDAORO, C., AND SAUCIER, G. 1997.
sium on FPGAs, 105–115. Synthesis and floorplanning for large hierarchi-
HUTCHINGS, B. L. 1997. Exploiting reconfig- cal FPGAs. ACM/SIGDA International Sympo-
urability through domain-specific systems. sium on FPGAs, 105–111.
Lecture Notes in Computer Science 1304— LAI, Y. T. AND WANG, P. T. 1997. Hierarchical in-
Field-Programmable Logic and Applications. terconnection structures for field programmable
W. Luk, P. Y. K. Cheung, and M. Glesner, gate arrays. IEEE Trans. VLSI Syst. 5, 2, 186–
Eds. Springer-Verlag, Berlin, Germany, 193– 196.
202. LAUFER, R., TAYLOR, R. R., AND SCHMIT, H. 1999.
HUTCHINGS, B., BELLOWS, P., HAWKINS, J., HEMMERT, PCI-PipeRench and the SwordAPI: A system for
S., NELSON, B., AND RYTTING, M. 1999. A CAD stream-based reconfigurable computing. IEEE
suite for high-performance FPGA design. IEEE Symposium on Field-Programmable Custom
Symposium on Field-Programmable Custom Computing Machines, 200–208.
Computing Machines, 12–24. LEE, Y. S. AND WU, A. C. H. 1997. A performance
HWANG, T. T., OWENS, R. M., IRWIN, M. J., AND and routability-driven router for FPGA’s consid-
WANG, K. H. 1994. Logic synthesis for field- ering path delays. IEEE Trans. CAD Integ. Circ.
programmable gate arrays. IEEE Trans. Com- Syst. 16, 2, 179–185.
put. Aid. Des. Integ. Circ. Syst. 13, 10, 1280– LEONARD, J. AND MANGIONE-SMITH, W. H. 1997. A
1287. case study of partially evaluated hardware cir-
INUANI, M. K. AND SAUL, J. 1997. Technology map- cuits: Key-specific DES. Lecture Notes in Com-
ping of heterogeneous LUT-based FPGAs. Lec- puter Science 1304—Field-Programmable Logic
ture Notes in Computer Science 1304—Field- and Applications. W. Luk, P. Y. K. Cheung,
Programmable Logic and Applications. W. Luk, and M. Glesner, Eds. Springer-Verlag, Berlin,
P. Y. K. Cheung, and M. Glesner, Eds. Springer- Germany, 151–160.
Verlag, Berlin, Germany, 223–234. LEUNG, K. H., MA, K. W., WONG, W. K., AND LEONG,
JACOB, J. A. AND CHOW, P. 1999. Memory interfacing P. H. W. 2000. FPGA Implementation of a mi-
and instruction specification for reconfigurable crocoded elliptic curve cryptographic processor.
processors. ACM/SIGDA International Sympo- IEEE Symposium on Field-Programmable Cus-
sium on Field-Programmable Gate Arrays, 145– tom Computing Machines, 68–76.
154. LEWIS, D. M., GALLOWAY, D. R., VAN IERSSEL, M., ROSE,
JEAN, J. S. N., TOMKO, K., YAVAGAL, V., SHAH, J., J., AND CHOW, P. 1997. The Transmogrifier-2:
AND COOK R. 1999. Dynamic reconfiguration A 1 million gate rapid prototyping system.
to support concurrent applications. IEEE Trans. ACM/SIGDA International Symposium on
Comput. 48, 6, 591–602. FPGAs, 53–61.
KASTRUP, B., BINK, A., AND HOOGERBRUGGE, J. 1999. LI, Y., CALLAHAN, T., DARNELL, E., HARR, R., KURKURE,
ConCISe: A compiler-driven CPLD-based in- U., AND STOCKWOOD, J. 2000a. Hardware-
struction set accelerator. IEEE Symposium software co-design of embedded reconfigurable
on Field-Programmable Custom Computing architectures. Design Automation Conference,
Machines, 92–101. 507–512.
KHALID, M. A. S. 1999. Routing architecture and LI, Z., COMPTON, K., AND HAUCK, S. 2000b. Config-
layout synthesis for multi-FPGA systems. Ph.D. uration caching for FPGAs. IEEE Symposium
dissertation, Dept. of ECE, Univ. Toronto. on Field-Programmable Custom Computing
KHALID, M. A. S. AND ROSE, J. 1998. A hybrid Machines, 22–36.
complete-graph partial-crossbar routing archi- LI, Z. AND HAUCK, S. 1999. Don’t care discovery for
tecture for multi-FPGA systems. ACM/SIGDA FPGA configuration compression. ACM/SIGDA
International Symposium on FPGAs, 45–54. International Symposium on FPGAs, 91–98.
KIM, H. J. AND MANGIONE-SMITH, W. H. 2000. Fac- LIN, X., DAGLESS, E., AND LU, A. 1997. Technol-
toring large numbers with programmable hard- ogy mapping of LUT based FPGAs for delay
ware. ACM/SIGDA International Symposium optimisation. Lecture Notes in Computer Sci-
on FPGAs, 41–48. ence 1304—Field-Programmable Logic and Ap-
plications. W. Luk, P. Y. K. Cheung, and M.
KIM, H. S., SOMANI, A. K., AND TYAGI, A. 2000. A
Glesner, Eds. Springer-Verlag, Berlin, Germany,
reconfigurable multi-function computing cache
245–254.
architecture. ACM/SIGDA International Sym-
posium on FPGAs, 85–94. LIU, H. AND WONG, D. F. 1999. Circuit partitioning
for dynamically reconfigurable FPGAs. ACM/
KRESS, R., HARTENSTEIN, R. W., AND NAGELDINGER, U.
SIGDA International Symposium on FPGAs,
1997. An operating system for custom comput-
187–194.
ing machines based on the Xputer paradigm.
Lecture Notes in Computer Science 1304—Field- LUCENT TECHNOLOGIES, INC. 1998. FPGA Data
Programmable Logic and Applications. W. Luk, Book. Lucent Technologies, Inc., Allentown, PA.
P. Y. K. Cheung, and M. Glesner, Eds. Springer- LUK, W., SHIRAZI, N., AND CHEUNG, P. Y. K. 1997a.
Verlag, Berlin, Germany, 304–313. Compilation tools for run-time reconfigurable

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


208 K. Compton and S. Hauck

designs. IEEE Symposium on Field-Programm- FPGAs. ACM/SIGDA International Symposium


able Custom Computing Machines, 56–65. on FPGAs, 35–42.
LUK, W., SHIRAZI, N., GUO, S. R., AND CHEUNG, P. PAYNE, R. 1997. Run-time parameterised circuits
Y. K. 1997b. Pipeline morphing and virtual for the Xilinx XC6200. Lecture Notes in Com-
pipelines. Lecture Notes in Computer Science puter Science 1304—Field-Programmable Logic
1304—Field-Programmable Logic and Applica- and Applications. W. Luk, P. Y. K. Cheung,
tions. W. Luk, P. Y. K. Cheung, and M. Glesner, and M. Glesner, Eds. Springer-Verlag, Berlin,
Eds. Springer-Verlag, Berlin, Germany, 111– Germany, 161–172.
120. PURNA, K. M. G. AND BHATIA, D. 1999. Temporal
LYSAGHT, P. AND STOCKWOOD, J. 1996. A simulation partitioning and scheduling data flow graphs for
tool for dynamically reconfigurable field pro- reconfigurable computers. IEEE Trans. Comput.
grammable gate arrays. IEEE Trans. VLSI Syst. 48, 6, 579–590.
4, 3, 381–390. QUICKTURN, A CADENCE COMPANY. 1999a. System
MAK, W. K. AND WONG, D. F. 1997. Board- RealizerTM . Available online at http://www.
level multi net routing for FPGA-based logic quickturn . com / products / systemrealizer . htm.
emulation. ACM Trans. Des. Automat. Elect. Quickturn, A Cadence Company, San Jose, CA.
Syst. 2, 2, 151–167. QUICKTURN, A CADENCE COMPANY. 1999b.
MANGIONE-SMITH, W. H. 1999. ATR from UCLA. MercuryTM Design Verification System Technol-
Personal Commun. ogy Backgrounder. Available online at http://
MANGIONE-SMITH, W. H., HUTCHINGS, B., ANDREWS, D., www.quickturn.com/products/mercury backgro-
DEHON, A., EBELING, C., HARTENSTEIN, R., MENCER, under.htm. Quickturn, A Cadence Company,
O., MORRIS, J., PALEM, K., PRASANNA, V. K., AND San Jose, CA, 1999.
SPAANENBURG, H. A. E. 1997. Seeking solu- RAZDAN, R. AND SMITH, M. D. 1994. A high-
tions in configurable computing. IEEE Comput. performance microarchitecture with hardware-
30, 12, 38–43. programmable functional units. International
MARSHALL, A., STANSFIELD, T., KOSTARNOV, I., VUILLEMIN, Symposium on Microarchitecture, 172–180.
J., AND HUTCHINGS, B. 1999. A reconfigurable RENCHER, M. AND HUTCHINGS, B. L. 1997. Auto-
arithmetic array for multimedia applications. mated target recognition on SPLASH2. IEEE
ACM/SIGDA International Symposium on Symposium on Field-Programmable Custom
FPGAs, 135–143. Computing Machines, 192–200.
MCKAY, N. AND SINGH, S. 1999. Debugging tech- ROSE, J., EL GAMAL, A., AND SANGIOVANNI-VINCENTELLI,
niques for dynamically reconfigurable hard- A. 1993. Architecture of field-programmable
ware. IEEE Symposium on Field-Programmable gate arrays. Proc. IEEE 81, 7, 1013–1029.
Custom Computing Machines, 114–122. RUPP, C. R., LANDGUTH, M., GARVERICK, T., GOMERSALL,
MCMURCHIE, L. AND EBELING, C. 1995. Pathfinder: E., HOLT, H., ARNOLD, J. M., AND GOKHALE, M.
A negotiation-based performance-driven router 1998. The NAPA adaptive processing architec-
for FPGAs. ACM/SIGDA International Sympo- ture. IEEE Symposium on Field-Programmable
sium on FPGAs, 111–117. Custom Computing Machines, 28–37.
MENCER, O., MORF, M., AND FLYNN, M. J. 1998. PAM- SANGIOVANNI-VINCENTELLI, A., EL GAMAL, A., AND ROSE,
blox: High performance FPGA design for adap- J. 1993. Synthesis methods for field pro-
tive computing. IEEE Symposium on Field- grammable gate arrays. Proc. IEEE 81, 7, 1057–
Programmable Custom Computing Machines, 1083.
167–174. SANKAR, Y. AND ROSE, J. 1999. Trading quality
MIYAMORI, T. AND OLUKOTUN, K. 1998. A quanti- for compile time: Ultra-fast placement for
tative analysis of reconfigurable coprocessors FPGAs. ACM/SIGDA International Symposium
for multimedia applications. IEEE Symposium on FPGAs, 157–166.
on Field-Programmable Custom Computing SCALERA, S. M. AND VAZQUEZ, J. R. 1998. The
Machines, 2–11. design and implementation of a context
MORITZ, C. A., YEUNG, D., AND AGARWAL, A. 1998. switching FPGA. IEEE Symposium on Field-
Exploring optimal cost performance designs Programmable Custom Computing Machines,
for Raw microprocessors. IEEE Symposium 78–85.
on Field-Programmable Custom Computing SELVIDGE, C., AGARWAL, A., DAHL, M., AND BABB J.
Machines, 12–27. 1995. TIERS: Topology IndependEnt Pipelined
NAM, G. J., SAKALLAH, K. A., AND RUTENBAR, Routing and Scheduling for VirtualWireTM
R. A. 1999. Satisfiability-based layout re- Compilation. ACM/SIGDA International Sym-
visited: detailed routing of complex FPGAs posium on Field-Programmable Gate Arrays,
via search-based boolean SAT. ACM/SIDGA 25–31.
International Symposium on FPGAs, 167– SENOUCI, S. A., AMOURA, A., KRUPNOVA, H., AND SAUCIER,
175. G. 1998. Timing driven floorplanning on pro-
PAN, P. AND LIN, C. C. 1998. A new retiming-based grammable hierarchical targets. ACM/SIGDA
technology mapping algorithm for LUT-based International Symposium on FPGAs, 85–92.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


Reconfigurable Computing 209

SHAHOOKAR, K. AND MAZUMDER, P. 1991. VLSI cell capacity FPGA. ACM/SIGDA International
placement techniques. ACM Comput. Surv. 23, Symposium on FPGAs, 3–9.
2, 145–220. TSU, W., MACY, K., JOSHI, A., HUANG, R., WALKER, N.,
SHI, J. AND BHATIA, D. 1997. Performance driven TUNG, T., ROWHANI, O., GEORGE, V., WAWRZYNEK,
floorplanning for FPGA based designs. J., AND DEHON, A. 1999. HSRA: High-speed,
ACM/SIGDA International Symposium on hierarchical synchronous reconfigurable ar-
FPGAs, 112–118. ray. ACM/SIGDA International Symposium on
SHIRAZI, N., LUK, W., AND CHEUNG, P. Y. K. 1998. FPGAs, 125–134.
Automating production of run-time reconfig- VAHID, F. 1997. I/O and performance tradeoffs
urable designs. IEEE Symposium on Field- with the FunctionBus during multi-FPGA parti-
Programmable Custom Computing Machines, tioning. ACM/SIGDA International Symposium
147–156. on FPGAs, 27–34.
SLIMANE-KADI, M., BRASEN, D., AND SAUCIER, G. 1994. VARGHESE, J., BUTTS, M., AND BATCHELLER, J. 1993.
A fast-FPGA prototyping system that uses An efficient logic emulation system. IEEE Trans.
inexpensive high-performance FPIC. ACM/ VLSI Syst. 1, 2, 171–174.
SIGDA Workshop on Field-Programmable Gate VASILKO, M. AND CABANIS, D. 1999. Improving sim-
Arrays. ulation accuracy in design methodologies for dy-
SOTIRIADES, E., DOLLAS, A., AND ATHANAS, P. 2000. namically reconfigurable logic systems. IEEE
Hardware-software codesign and parallel imple- Sympos. Field-Prog. Cust. Comput. Mach. 123–
mentation of a Golomb ruler derivation engine. 133.
IEEE Symposium on Field-Programmable Cus- VUILLEMIN, J., BERTIN, P., RONCIN, D., SHAND, M.,
tom Computing Machines, 227–235. TOUATI, H., AND BOUCARD, P. 1996. Pro-
STOHMANN, J. AND BARKE, E. 1996. An universal grammable active memories: Reconfigurable
CLA adder generator for SRAM-based FPGAs. systems come of age. IEEE Trans. VLSI Syst. 4,
Lecture Notes in Computer Science 1142—Field- 1, 56–69.
Programmable Logic: Smart Applications, New WANG, Q. AND LEWIS, D. M. 1997. Automated field-
Paradigms and Compilers. R. W. Hartenstein programmable compute accelerator design using
and M. Glesner, Eds. Springer-Verlag, Berlin, partial evaluation. IEEE Symposium on Field-
Germany, 44–54. Programmable Custom Computing Machines,
SWARTZ, J. S., BETZ, V., AND ROSE, J. 1998. A 145–154.
fast routability-driven router for FPGAs. ACM/ WEINHARDT, M. AND LUK, W. 1999. Pipeline vector-
SIGDA International Symposium on FPGAs, ization for reconfigurable systems. IEEE Sympo-
140–149. sium on Field-Programmable Custom Comput-
SYNOPSYS, INC. 2000. CoCentric System C Com- ing Machines, 52–62.
piler. Synopsys, Inc., Mountain View, CA. WILTON, S. J. E. 1998. SMAP: Heterogeneous tech-
SYNPLICITY, INC. 1999. Synplify User Guide Release nology mapping for area reduction in FPGAs
5.1. Synplicity, Inc., Sunnyvale, CA. with embedded memory arrays. ACM/SIGDA
TAKAHARA, A., MIYAZAKI, T., MUROOKA, T., KATAYAMA, M., International Symposium on FPGAs, 171–178.
HAYASHI, K., TSUTSUI, A., ICHIMORI, T., AND FUKAMI, WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1995. A dy-
K. 1998. More wires and fewer LUTs: A namic instruction set computer. IEEE Sym-
design methodology for FPGAs. ACM/SIGDA posium on FPGAs for Custom Computing
International Symposium on FPGAs, 12–19. Machines, 99–107.
THAKUR, S., CHANG, Y. W., WONG, D. F., AND WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1996. Se-
MUTHUKRISHNAN, S. 1997. Algorithms for an quencing run-time reconfigured hardware with
FPGA switch module routing problem with ap- software. ACM/SIGDA International Sympo-
plication to global routing. IEEE Trans. CAD sium on FPGAs, 122–128.
Integ. Circ. Syst. 16, 1, 32–46. WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1997. Improv-
TOGAWA, N., YANAGISAWA, M., AND OHTSUKI, T. 1998. ing functional density through run-time con-
Maple-OPT: A performance-oriented simultane- stant propagation. ACM/SIGDA International
ous technology mapping, placement, and global Symposium on FPGAs, 86–92.
gouting algorithm for FPGA’s. IEEE Trans. CAD WITTIG, R. D. AND CHOW, P. 1996. OneChip: An
Integ. Circ. Syst. 17, 9, 803–818. FPGA processor with reconfigurable logic. IEEE
TRIMBERGER, S. 1998. Scheduling designs into a Symposium on FPGAs for Custom Computing
time-multiplexed FPGA. ACM/SIGDA Interna- Machines, 126–135.
tional Symposium on FPGAs, 153–160. WOOD, R. G. AND RUTENBAR, R. A. 1997. FPGA
TRIMBERGER, S., CARBERRY, D., JOHNSON, A., AND routing and routability estimation via Boolean
WONG, J. 1997a. A time-multiplexed FPGA. satisfiability. ACM/SIGDA International Sym-
IEEE Symposium on Field-Programmable Cus- posium on FPGAs, 119–125.
tom Computing Machines, 22–28. WU, Y. L. AND MAREK-SADOWSKA, M. 1997. Routing
TRIMBERGER, S., DUONG, K., AND CONN, B. 1997b. for array-type FPGA’s. IEEE Trans. CAD Integ.
Architecture issues and solutions for a high- Circ. Syst. 16, 5, 506–518.

ACM Computing Surveys, Vol. 34, No. 2, June 2002.


210 K. Compton and S. Hauck

XILINX, INC. 1994. The Programmable Logic Data macro generator. Lecture Notes in Computer Sci-
Book. Xilinx, Inc., San Jose, CA. ence 1142—Field-Programmable Logic: Smart
XILINX, INC. 1996. XC6200: Advance Product Spec- Applications, New Paradigms and Compil-
ification. Xilinx, Inc., San Jose, CA. ers. R. W. Hartenstein and M. Glesner,
Eds. Springer-Verlag, Berlin, Germany, 307–
XILINX, INC. 1997. LogiBLOX: Product Specifica-
326.
tion. Xilinx, Inc., San Jose, CA.
YI, K. AND JHON, C. S. 1996. A new FPGA tech-
XILINX, INC. 1999. VirtexTM 2.5 V Field Pro-
nology mapping approach by cluster merging.
grammable Gate Arrays: Advance Product Spec-
Lecture Notes in Computer Science 1142—Field-
ification. Xilinx, Inc., San Jose, CA.
Programmable Logic: Smart Applications, New
XILINX, INC. 2000. Press Release: IBM and Xilinx Paradigms and Compilers. R. W. Hartenstein
Team to Create New Generation of Integrated and M. Glesner, Eds. Springer-Verlag, Berlin,
Circuits. Xilinx, Inc., San Jose, CA. Germany, 366-370.
XILINX, INC. 2001. Virtex-II 1.5V Field Pro- ZHONG, P., MARTINOSI, M., ASHAR, P., AND MALIK, S.
grammable Gate Arrays: Advance Product 1998. Accelerating Boolean satisfiability with
Specification. Xilinx, Inc., San Jose, CA. configurable hardware. IEEE Symposium
YASAR, G., DEVINS, J., TSYRKINA, Y., STADTLANDER, on Field-Programmable Custom Computing
G., AND MILLHAM, E. 1996. Growable FPGA Machines, 186–195.

Received May 2000; revised October 2001 and January 2002; accepted February 2002

ACM Computing Surveys, Vol. 34, No. 2, June 2002.

Вам также может понравиться