Академический Документы
Профессиональный Документы
Культура Документы
KATHERINE COMPTON
Northwestern University
AND
SCOTT HAUCK
University of Washington
Categories and Subject Descriptors: A.1 [Introductory and Survey]; B.6.1 [Logic
Design]: Design Style—logic arrays; B.6.3 [Logic Design]: Design Aids; B.7.1
[Integrated Circuits]: Types and Design Styles—gate arrays
General Terms: Design, Performance
Additional Key Words and Phrases: Automatic design, field-programmable, FPGA,
manual design, reconfigurable architectures, reconfigurable computing, reconfigurable
systems
This research was supported in part by Motorola, Inc., DARPA, and NSF.
K. Compton was supported by an NSF fellowship.
S. Hauck was supported in part by an NSF CAREER award and a Sloan Research Fellowship.
Authors’ addresses: K. Compton, Department of Electrical and Computer Engineering, Northwestern Uni-
versity, 2145 Sheridan Road, Evanston, IL 60208-3118; e-mail: kati@ece.northwestern.edu; S. Hauck, De-
partment of Electrical Engineering, The University of Washington, Box 352500, Seattle, WA 98195; e-mail:
hauck@ee.washington.edu.
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted
without fee provided that copies are not made or distributed for profit or direct commercial advantage and
that copies show this notice on the first page or initial screen of a display along with the full citation.
Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit
is permitted. To copy otherwise, to republish, to post on servers, to redistribute to lists, or to use any compo-
nent of this work in other works requires prior specific permission and/or a fee. Permissions may be requested
from Publications Dept., ACM, Inc., 1515 Broadway, New York, NY 10036 USA, fax +1 (212) 869-0481, or
permissions@acm.org.
2002
c ACM 0360-0300/02/0600-0171 $5.00
ACM Computing Surveys, Vol. 34, No. 2, June 2002, pp. 171–210.
172 K. Compton and S. Hauck
board-level solution, to perform the oper- figurable hardware by computing the logic
ations in hardware. ASICs are designed functions of the circuit within the logic
specifically to perform a given computa- blocks, and using the configurable routing
tion, and thus they are very fast and to connect the blocks together to form the
efficient when executing the exact com- necessary circuit.
putation for which they were designed. FPGAs and reconfigurable computing
However, the circuit cannot be altered af- have been shown to accelerate a variety of
ter fabrication. This forces a redesign and applications. Data encryption, for exam-
refabrication of the chip if any part of its ple, is able to leverage both parallelism
circuit requires modification. This is an ex- and fine-grained data manipulation. An
pensive process, especially when one con- implementation of the Serpent Block
siders the difficulties in replacing ASICs Cipher in the Xilinx Virtex XCV1000
in a large number of deployed systems. shows a throughput increase by a factor
Board-level circuits are also somewhat in- of over 18 compared to a Pentium Pro
flexible, frequently requiring a board re- PC running at 200 MHz [Elbirt and Paar
design and replacement in the event of 2000]. Additionally, a reconfigurable com-
changes to the application. puting implementation of sieving for fac-
The second method is to use soft- toring large numbers (useful in breaking
ware-programmed microprocessors—a far encryption schemes) was accelerated by a
more flexible solution. Processors execute factor of 28 over a 200-MHz UltraSparc
a set of instructions to perform a compu- workstation [Kim and Mangione-Smith
tation. By changing the software instruc- 2000]. The Garp architecture shows a
tions, the functionality of the system is comparable speed-up for DES [Hauser
altered without changing the hardware. and Wawrzynek 1997], as does an
However, the downside of this flexibility FPGA implementation of an elliptic curve
is that the performance can suffer, if not cryptography application [Leung et al.
in clock speed then in work rate, and is 2000].
far below that of an ASIC. The processor Other recent applications that have
must read each instruction from memory, been shown to exhibit significant speed-
decode its meaning, and only then exe- ups using reconfigurable hardware
cute it. This results in a high execution include: automatic target recognition
overhead for each individual operation. [Rencher and Hutchings 1997], string pat-
Additionally, the set of instructions that tern matching [Weinhardt and Luk 1999],
may be used by a program is determined Golomb Ruler Derivation [Dollas et al.
at the fabrication time of the processor. 1998; Sotiriades et al. 2000], transitive
Any other operations that are to be im- closure of dynamic graphs [Huelsbergen
plemented must be built out of existing 2000], Boolean satisfiability [Zhong et al.
instructions. 1998], data compression [Huang et al.
Reconfigurable computing is intended to 2000], and genetic algorithms for the tra-
fill the gap between hardware and soft- velling salesman problem [Graham and
ware, achieving potentially much higher Nelson 1996].
performance than software, while main- In order to achieve these performance
taining a higher level of flexibility than benefits, yet support a wide range of appli-
hardware. Reconfigurable devices, in- cations, reconfigurable systems are usu-
cluding field-programmable gate arrays ally formed with a combination of re-
(FPGAs), contain an array of computa- configurable logic and a general-purpose
tional elements whose functionality is de- microprocessor. The processor performs
termined through multiple programmable the operations that cannot be done effi-
configuration bits. These elements, some- ciently in the reconfigurable logic, such
times known as logic blocks, are connected as data-dependent control and possibly
using a set of routing resources that are memory accesses, while the computational
also programmable. In this way, custom cores are mapped to the reconfigurable
digital circuits can be mapped to the recon- hardware. This reconfigurable logic can be
composed of either commercial FPGAs or uration compression and the partial reuse
custom configurable hardware. of already programmed configurations can
Compilation environments for reconfig- be used to reduce this overhead.
urable hardware range from tools to assist This article presents a survey of cur-
a programmer in performing a hand map- rent research in hardware and software
ping of a circuit to the hardware, to com- systems for reconfigurable computing, as
plete automated systems that take a cir- well as techniques that specifically target
cuit description in a high-level language run-time reconfigurability. We lead off this
to a configuration for a reconfigurable sys- discussion by examining the technology
tem. The design process involves first par- required for reconfigurable computing, fol-
titioning a program into sections to be im- lowed by a more in-depth examination of
plemented on hardware, and those which the various hardware structures used in
are to be implemented in software on the reconfigurable systems. Next, we look at
host processor. The computations destined the software required for compilation of
for the reconfigurable hardware are syn- algorithms to configurable computers, and
thesized into a gate level or register trans- the trade-offs between hand-mapping and
fer level circuit description. This circuit is automatic compilation. Finally, we discuss
mapped onto the logic blocks within the re- run-time reconfigurable systems, which
configurable hardware during the technol- further utilize the intrinsic flexibility of
ogy mapping phase. These mapped blocks configurable computing platforms by opti-
are then placed into the specific physi- mizing the hardware not only for different
cal blocks within the hardware, and the applications, but for different operations
pieces of the circuit are connected using within a single application as well.
the reconfigurable routing. After compi- This survey does not seek to cover ev-
lation, the circuit is ready for configura- ery technique and research project in the
tion onto the hardware at run-time. These area of reconfigurable computing. Instead,
steps, when performed using an automatic it hopes to serve as an introduction to
compilation system, require very little ef- this rapidly evolving field, bringing in-
fort on the part of the programmer to terested readers quickly up to speed on
utilize the reconfigurable hardware. How- developments from the last half-decade.
ever, performing some or all of these oper- Those interested in further background
ations by hand can result in a more highly can find coverage of older techniques
optimized circuit for performance-critical and systems elsewhere [Rose et al. 1993;
applications. Hauck and Agarwal 1996; Vuillemin et al.
Since FPGAs must pay an area penalty 1996; Mangione-Smith et al. 1997; Hauck
because of their reconfigurability, device 1998b].
capacity can sometimes be a concern. Sys-
tems that are configured only at power-
2. TECHNOLOGY
up are able to accelerate only as much
of the program as will fit within the pro- Reconfigurable computing as a concept
grammable structures. Additional areas of has been in existence for quite some time
a program might be accelerated by reusing [Estrin et al. 1963]. Even general-purpose
the reconfigurable hardware during pro- processors use some of the same basic
gram execution. This process is known ideas, such as reusing computational com-
as run-time reconfiguration (RTR). While ponents for independent computations,
this style of computing has the benefit of and using multiplexers to control the
allowing for the acceleration of a greater routing between these components. How-
portion of an application, it also introduces ever, the term reconfigurable comput-
the overhead of configuration, which lim- ing, as it is used in current research
its the amount of acceleration possible. Be- (and within this survey), refers to sys-
cause configuration can take milliseconds tems incorporating some form of hard-
or longer, rapid and efficient configuration ware programmability—customizing how
is a critical issue. Methods such as config- the hardware is used using a number
Fig. 1. A programming bit for SRAM-based FPGAs [Xilinx 1994] (left) and a pro-
grammable routing connection (right).
of physical control points. These control Thus, these chips can be programmed and
points can then be changed periodically in reprogrammed about as easily as a stan-
order to execute different applications us- dard static RAM. In fact, one research
ing the same hardware. project, the PAM project [Vuillemin et al.
The recent advances in reconfigurable 1996], considers a group of one or more
computing are for the most part de- FPGAs to be a RAM unit that performs
rived from the technologies developed computation between the memory write
for FPGAs in the mid-1980s. FPGAs (sending the configuration information
were originally created to serve as a hy- and input data) and memory read (read-
brid device between PALs and Mask- ing the results of the computation). This
Programmable Gate Arrays (MPGAs). leads some to use the term Programmable
Like PALs, FPGAs are fully electrically Active Memory or PAM.
programmable, meaning that the physical One example of how the SRAM configu-
design costs are amortized over multiple ration points can be used is to control rout-
application circuit implementations, and ing within a reconfigurable device [Chow
the hardware can be customized nearly in- et al. 1999a]. To configure the routing on
stantaneously. Like MPGAs, they can im- an FPGA, typically a passgate structure
plement very complex computations on a is employed (see Figure 1 right). Here the
single chip, with devices currently in pro- programming bit will turn on a routing
duction containing the equivalent of over connection when it is configured with a
a million gates. Because of these features, true value, allowing a signal to flow from
FPGAs had been primarily viewed as glue- one wire to another, and will disconnect
logic replacement and rapid-prototyping these resources when the bit is set to false.
vehicles. However, as we show through- With a proper interconnection of these ele-
out this article, the flexibility, capacity, ments, which may include millions of rout-
and performance of these devices has ing choice points within a single device, a
opened up completely new avenues in rich routing fabric can be created.
high-performance computation, forming Another example of how these configu-
the basis of reconfigurable computing. ration bits may be used is to control mul-
Most current FPGAs and reconfig- tiplexers, which will choose between the
urable devices are SRAM-programmable output of different logic resources within
(Figure 1 left), meaning that SRAM1 the array. For example, to provide optional
bits are connected to the configuration stateholding elements a D flip-flop (DFF)
points in the FPGA, and programming may be included with a multiplexer se-
the SRAM bits configures the FPGA. lecting whether to forward the latched
or unlatched signal value (see Figure 2
1 The term “SRAM” is technically incorrect for many left). Thus, for systems that require state-
FPGA architectures, given that the configuration holding the programming bits controlling
memory may or may not support random access. In the multiplexer would be configured to se-
fact, the configuration memory tends to be continu- lect the DFF output, while systems that
ally read in order to perform its function. However,
this is the generally accepted term in the field and
do not need this function would choose
correctly conveys the concept of static volatile mem- the bypass route that sends the input di-
ory using an easily understandable label. rectly to the output. Similar structures
Fig. 2. D flip-flop with optional bypass (left) and a 3-input LUT (right).
can choose between other on-chip func- very closely coupled systems, the recon-
tionalities, such as fixed-logic computation figurability lies within customizable func-
elements, memories, carry chains, or other tional units on the regular datapath of
functions. the microprocessor. On the other hand, a
Finally, the configuration bits may be reconfigurable computing system can be
used as control signals for a computational as loosely coupled as a networked stand-
unit or as the basis for computation it- alone unit. Most reconfigurable systems
self. As a control signal, a configuration are categorized somewhere between these
bit may determine whether an ALU per- two extremes, frequently with the recon-
forms an addition, subtraction, or other figurable hardware acting as a coproces-
logic computations. On the other hand, sor to a host microprocessor. The pro-
with a structure such as a lookup table grammable array itself can be comprised
(LUT), the configuration bits themselves of one or more commercially available
form the result of the computation (see FPGAs, or can be a custom device designed
Figure 2 right). These elements are essen- specifically for reconfigurable computing.
tially small memories provided for com- The design of the actual computation
puting arbitrary logic functions. LUTs can blocks within the reconfigurable hardware
compute any function of N inputs (where varies from system to system. Each unit of
N is the number of control signals for the computation, or logic block, can be as sim-
LUT’s multiplexer) by programming the ple as a 3-input lookup table (LUT), or as
2N programming bits with the truth ta- complex as a 4-bit ALU. This difference
ble of the desired function. Thus, if all in block size is commonly referred to as
programming bits except the one corre- the granularity of the logic block, where
sponding to the input pattern 111 were the 3-bit LUT is an example of a very
set to zero a 3-input LUT would act as a fine-grained computational element, and a
3-input AND gate, while programming it 4-bit ALU is an example of a quite coarse-
with all ones except in 000 would compute grained unit. The finer-grained blocks are
a NAND. useful for bit-level manipulations, while
the coarse-grained blocks are better opti-
mized for standard datapath applications.
3. HARDWARE
Some architectures employ different sizes
Reconfigurable computing systems use or types of blocks within a single recon-
FPGAs or other programmable hardware figurable array in order to efficiently sup-
to accelerate algorithm execution by map- port different types of computation. For
ping compute-intensive calculations to the example, memory is frequently embedded
reconfigurable substrate. These hardware within the reconfigurable hardware to pro-
resources are frequently coupled with a vide temporary data storage, forming a
general-purpose microprocessor that is heterogeneous structure composed of both
responsible for controlling the reconfig- logic blocks and memory blocks [Ebeling
urable logic and executing program code et al. 1996; Altera 1998; Lucent 1998;
that cannot be efficiently accelerated. In Marshall et al. 1999; Xilinx 1999].
The routing between the logic blocks cessor’s registers. A call to the Chimaera
within the reconfigurable hardware is also unit is in actuality only a fetch of the re-
of great importance. Routing contributes sult value. This value is stable and valid
significantly to the overall area of the re- after the correct input values have been
configurable hardware. Yet, when the per- written to the registers and have filtered
centage of logic blocks used in an FPGA be- through the computation.
comes very high, automatic routing tools In the next sections, we consider in
frequently have difficulty achieving the greater depth the hardware issues in re-
necessary connections between the blocks. configurable computing, including both
Good routing structures are therefore es- logic and routing. To support the compu-
sential to ensure that a design can be suc- tation demands of reconfigurable comput-
cessfully placed and routed onto the recon- ing, we consider the logic block architec-
figurable hardware. tures of these devices, including possibly
Once a circuit has been programmed the integration of heterogeneous logic re-
onto the reconfigurable hardware, it is sources within a device. Heterogeneity
ready to be used by the host processor dur- also extends between chips, where one of
ing program execution. The run-time op- the most important concerns is the cou-
eration of a reconfigurable system occurs pling of the reconfigurable logic with stan-
in two distinct phases: configuration and dard, general-purpose processors. How-
execution. The programming of the recon- ever, reconfigurable devices are more than
figurable hardware is under the control of just logic devices; the routing resources
the host processor. This host processor di- are at least as important as logic re-
rects a stream of configuration data to the sources, and thus we consider intercon-
reconfigurable hardware, and this config- nect structures, including 1D-oriented de-
uration data is used to define the actual vices that are beginning to appear.
operation of the hardware. Configurations
can be loaded solely at start-up of a pro-
3.1. Coupling
gram, or periodically during runtime, de-
pending on the design of the system. More Frequently, reconfigurable hardware is
concepts involved in run-time reconfigu- coupled with a traditional microprocessor.
ration (the dynamic reconfiguration of de- Programmable logic tends to be inefficient
vices during computation execution) are at implementing certain types of opera-
discussed in a later section. tions, such as variable-length loops and
The actual execution model of the re- branch control. In order to run an applica-
configurable hardware varies from sys- tion in a reconfigurable computing system
tem to system. For example, the NAPA most efficiently, the areas of the program
system [Rupp et al. 1998] by default that cannot be easily mapped to the recon-
suspends the execution of the host pro- figurable logic are executed on a host mi-
cessor during execution on the recon- croprocessor. Meanwhile, the areas with a
figurable hardware. However, simulta- high density of computation that can ben-
neous computation can occur with the efit from implementation in hardware are
use of fork-and-join primitives, similar to mapped to the reconfigurable logic. For the
multiprocessor programming. REMARC systems that use a microprocessor in con-
[Miyamori and Olukotun 1998] is a re- junction with reconfigurable logic, there
configurable system that uses a pipelined are several ways in which these two com-
set of execution phases within the recon- putation structures may be coupled, as
figurable hardware. These pipeline stages Figure 3 shows.
overlap with the pipeline stages of the host First, reconfigurable hardware can be
processor, allowing for simultaneous ex- used solely to provide reconfigurable
ecution. In the Chimaera system [Hauck functional units within a host proces-
et al. 1997], the reconfigurable hardware sor [Razdan and Smith 1994; Hauck
is constantly executing based upon the in- et al. 1997]. This allows for a tradi-
put values held in a subset of the host pro- tional programming environment with the
addition of custom instructions that may logic is embedded into the data cache.
change over time. Here, the reconfigurable This cache can then be used as either a
units execute as functional units on the regular cache or as an additional com-
main microprocessor datapath, with reg- puting resource depending on the target
isters used to hold the input and output application.
operands. Third, an attached reconfigurable
Second, a reconfigurable unit may processing unit [Vuillemin et al. 1996;
be used as a coprocessor [Wittig and Annapolis 1998; Laufer et al. 1999] be-
Chow 1996; Hauser and Wawrzynek 1997; haves as if it is an additional processor in
Miyamori and Olukotun 1998; Rupp et al. a multiprocessor system or an additional
1998; Chameleon 2000]. A coprocessor is, compute engine accessed semifrequently
in general, larger than a functional unit, through external I/O. The host processor’s
and is able to perform computations with- data cache is not visible to the attached
out the constant supervision of the host reconfigurable processing unit. There is,
processor. Instead, the processor initial- therefore, a higher delay in communica-
izes the reconfigurable hardware and ei- tion between the host processor and the re-
ther sends the necessary data to the logic, configurable hardware, such as when com-
or provides information on where this data municating configuration information,
might be found in memory. The reconfig- input data, and results. This communi-
urable unit performs the actual computa- cation is performed though specialized
tions independently of the main processor, primitives similar to multiprocessor sys-
and returns the results after completion. tems. However, this type of reconfigurable
This type of coupling allows the reconfig- hardware does allow for a great deal of
urable logic to operate for a large num- computation independence, by shifting
ber of cycles without intervention from large chunks of a computation over to the
the host processor, and generally permits reconfigurable hardware.
the host processor and the reconfigurable Finally, the most loosely coupled form
logic to execute simultaneously. This re- of reconfigurable hardware is that of
duces the overhead incurred by the use an external stand-alone processing unit
of the reconfigurable logic, compared to a [Quickturn 1999a, 1999b]. This type of
reconfigurable functional unit that must reconfigurable hardware communicates
communicate with the host processor each infrequently with a host processor (if
time a reconfigurable “instruction” is used. present). This model is similar to that
One idea that is somewhat of a hybrid be- of networked workstations, where pro-
tween the first and second coupling meth- cessing may occur for very long periods
ods, is the use of programmable hardware of time without a great deal of commu-
within a configurable cache [Kim et al. nication. In the case of the Quickturn
2000]. In this situation, the reconfigurable systems, however, this hardware is geared
Cheung 1998; Chameleon 2000; Xilinx can be emulated [Altera 1998; Cong and
2001]. Because multiplication is one of the Xu 1998; Wilton 1998; Heile and Leaver
more difficult computations to implement 1999]. In fact, because there may be more
efficiently in a traditional FPGA struc- than one value output from the memory
ture, the custom multiplication hardware on a read operation, the memory struc-
embedded within a reconfigurable array ture may be able to perform multiple dif-
allows a system to perform even that func- ferent computations (one for each bit of
tion well. data output), provided that all necessary
Another use of heterogeneous struc- inputs appear on the address lines. In this
tures is to provide embedded memory manner, the embedded RAM behaves the
blocks scattered throughout the reconfig- same as a very large LUT. Therefore, em-
urable hardware. This allows storage of bedded memory allows a programmer or
frequently used data and variables, and a synthesis tool to perform a trade-off be-
allows for quick access to these values tween logic and memory usage in order to
due to the proximity of the memory to achieve higher area efficiency.
the logic blocks that access it. Memory Furthermore, a few of the commercial
structures embedded into the reconfig- FPGA companies have announced plans to
urable fabric come in two forms. The first include entire microprocessors as embed-
is simply the use of available LUTs as ded structures within their FPGAs. Altera
RAM structures, as can be done in the has demonstrated a preliminary ARM9-
Xilinx 4000 series [Xilinx 1994] and Virtex based Excalibur device, which combines
[Xilinx 1999] FPGAs. Although making reconfigurable hardware with an embed-
these very small blocks into a larger ded ARM9 processor core [Altera 2001].
RAM structure introduces overhead to the Meanwhile, Xilinx is working with IBM to
memory system, it does provide local, vari- include a PowerPC processor core within
able width memory structures. the Virtex-II FPGA [Xilinx 2000]. By con-
Some architectures include dedicated trast, Adaptive Silicon’s focus is to provide
memory blocks within their array, such reconfigurable logic cores to customers for
as the Xilinx Virtex series [Xilinx 1999, embedding in their own system-on-a-chip
2001] and Altera [Altera 1998] FPGAs, as (SoC) devices [Adaptive 2001].
well as the CS2000 RCP (reconfigurable
communications processor) device from
3.5. Routing Resources
Chameleon Systems, Inc. [Chameleon
2000]. These memory blocks have greater Interconnect resources are provided in a
performance in large sizes than similar- reconfigurable architecture to connect to-
sized structures built from many small gether the device’s programmable logic el-
LUTs. While these structures are some- ements. These resources are usually con-
what less flexible than the LUT-based figurable, where the path of a signal is
memories, they can also provide some cus- determined at compile or run-time rather
tomization. For example, the Altera FLEX than fabrication time. This flexible inter-
10K FPGA [Altera 1998] provides embed- connect between logic blocks or computa-
ded memories that have a limited total tional elements allows for a wide variety
number of wires, but allow a trade-off be- of circuit structures, each with their own
tween the number of address lines and the interconnect requirements, to be mapped
data bit width. to the reconfigurable hardware. For ex-
When embedded memories are not used ample, the routing for FPGAs is gener-
for data storage by a particular config- ally island-style, with logic surrounded
uration, the area that they occupy does by routing channels, which contain sev-
not necessarily have to be wasted. By us- eral wires, potentially of varying lengths.
ing the address lines of the memory as Within this type of routing architecture,
function inputs and the values stored in however, there are still variations. Some of
the memory as function outputs, logical these differences include the ratio of wires
expressions of a large number of inputs to logic in the system, how long each of the
Fig. 8. Segmented (left) and hierarchical (right) routing structures. The white
boxes are logic blocks, while the dark boxes are connection switches.
wires should be, and whether they should local communications traffic. These short
be connected in a segmented or hierarchi- wires can be connected together using
cal manner. switchboxes to emulate longer wires. Fre-
A step in the design of efficient rout- quently, segmented routing structures
ing structures for FPGAs and reconfig- also contain longer wires to allow sig-
urable systems therefore involves exam- nals to travel efficiently over long dis-
ining the logic vs. routing area trade-off tances without passing through a great
within reconfigurable architectures. One number of switches. Hierarchical routing
group has argued that the interconnect [Aggarwal and Lewis 1994; Lai and Wang
should constitute a much higher propor- 1997; Tsu et al. 1999] is the second method
tion of area in order to allow for successful to provide both local and global commu-
routing under high-logic utilization condi- nication. Routing within a group (or clus-
tions [Takahara et al. 1998]. However, for ter) of logic blocks is at the local level,
FPGAs, high-LUT utilization may not nec- only connecting within that cluster. At
essarily be the most desirable situation, the boundaries of these clusters, however,
but rather efficient routing usage may be longer wires connect the different clusters
of more importance [DeHon 1999]. This together. This is potentially repeated at a
is because the routing resources occupy a number of levels. The idea behind the use
much larger part of the area of an FPGA of hierarchical structures is that, provided
than the logic resources, and therefore the a good placement has been made onto the
most area efficient designs will be those hardware, most communication should be
that optimize their use of the routing re- local and only a limited amount of com-
sources rather than the logic resources. munication will traverse long distances.
The amount of required routing does not Therefore, the wiring is designed to fit this
grow linearly with the amount of logic model, with a greater number of local rout-
present; therefore, larger devices require ing wires in a cluster than distance routing
even greater amounts of routing per logic wires between clusters.
block than small ones [Trimberger et al. Because routing can occupy a large part
1997b]. of the area of a reconfigurable device, the
There are two primary methods to pro- type of routing used must be carefully con-
vide both local and global routing re- sidered. If the wires available are much
sources, as shown in Figure 8. The first longer than what is required to route a sig-
is the use of segmented routing [Betz and nal, the excess wire length is wasted. On
Rose 1999; Chow et al. 1999a]. In seg- the other hand, if the wires available are
mented routing, short wires accommodate much shorter than necessary, the signal
must pass through switchboxes that con- routing is that if there are not enough
nect the short wires together into a longer routing resources in a particular area of
wire, or through levels of the routing hier- a mapped circuit, routing that circuit be-
archy. This induces additional delay and comes actually more difficult than on a
slows the overall operation of the circuit. two-dimensional array that provides more
Furthermore, the switchbox circuitry oc- alternatives. A number of different re-
cupies area that might be better used for configurable systems have been designed
additional logic or wires. in this manner. Both Garp [Hauser and
There are a few alternatives to the Wawrzynek 1997] and Chimaera [Hauck
island-style of routing resources. Systems et al. 1997] are structures that provide
such as RaPiD [Ebeling et al. 1996] use cells that compute a small number of bit
segmented bus-based routing, where sig- positions, and a row of these cells to-
nals are full word-sized in width. This is gether computes the full data word. A
most common in the one-dimensional type row can only be used by a single config-
of architecture, as discussed in the next uration, making these designs one dimen-
section. sional. In this manner, each configuration
occupies some number of complete rows.
Although multiple narrow-width compu-
3.6. One-Dimensional Structures
tations can fit within a single row, these
Most current FPGAs are of the two- structures are optimized for word-based
dimensional variety, as shown in Figure 9. computations that occupy the entire row.
This allows for a great deal of flexibility, The NAPA architecture [Rupp et al. 1998]
as any signal can be routed on a nearly is similar, with a full column of cells act-
arbitrary path. However, providing this ing as the atomic unit for a configura-
level of routing flexibility requires a great tion, as is PipeRench [Cadambi et al. 1998;
deal of routing area. It also complicates Goldstein et al. 2000].
the placement and routing software, as the In some systems, the computation
software must consider a very large num- blocks in a one-dimensional structure op-
ber of possibilities. erate on word-width values instead of
One solution is to use a more one- single bits. Therefore, busses are routed
dimensional style of architecture, also de- instead of individual values. This also
picted in Figure 9. Here, placement is decreases the time required for routing,
restricted along one axis. With a more as the bits of a bus can be considered
limited set of choices, the placement can together rather than as separate routes.
be performed much more quickly. Routing As shown previously in Figure 7, RaPiD
is also simplified, because it is generally [Ebeling et al. 1996] is basically a one-
along a single dimension as well, with the dimensional design that only includes
other dimension generally only used for word-width processing elements. The dif-
calculations requiring a shift operation. ferent computation units are organized in
One drawback of the one-dimensional a single dimension along the horizontal
Fig. 10. Mesh (left) and partial crossbar (right) interconnect topologies for multi-FPGA
systems.
axis. The general flow of information fol- level column and row busses is the P1
lows this layout, with the major routing system developed within the PAM project
busses also laid out in a horizontal man- [Vuillemin et al. 1996]. This architecture
ner. Additionally, all routing is of word- uses a central array of 16 commercial
sized values, and therefore all routing is FPGAs with connections to nearest-
of busses, not individual wires. A few ver- neighbors. However, four 16-bit row busses
tical resources are included in the archi- and four 16-bit column busses run the
tecture to allow signals to transfer be- length of the array and facilitate commu-
tween busses, or to travel from a bus to nication between non-neighbor FPGAs.
a computation node. However, the major- A crossbar attempts to remove this prob-
ity of the routing in this architecture is lem by using special routing-only chips
one-dimensional. to connect each FPGA potentially to any
other FPGA. The inter-chip delays are
more uniform, given that a signal trav-
3.7. Multi-FPGA Systems
els the exact same “distance” to get from
Reconfigurable systems that are composed one FPGA to another, regardless of where
of multiple FPGA chips interconnected those FPGAs are located. However, a
on a single processing board have addi- crossbar interconnect does not scale eas-
tional hardware concerns over single-chip ily with an increase in the number of
systems. In particular, there is a need for FPGAs. The crossbar pattern of the chips
an efficient connection scheme between is fixed at fabrication of the multi-FPGA
the chips, as well as to external memory board. Variants on these two basic topolo-
and the system bus. This is to provide for gies attempt to remove some of the prob-
circuits that are too large to fit within a lems encountered in mesh and crossbar
single FPGA, but may be partitioned over topologies [Arnold et al. 1992; Varghese
the multiple FPGAs available. A number et al. 1993; Buell et al. 1996; Vuillemin
of different interconnection schemes have et al. 1996; Lewis et al. 1997; Khalid and
been explored [Butts and Batcheller 1991; Rose 1998]. One of these variants can be
Hauck et al. 1998a; Hauck 1998; Khalid found in the Splash 2 system [Arnold et al.
1999] including meshes and crossbars, as 1992; Buell et al. 1996]. The predecessor,
shown in Figure 10. A mesh connects the Splash 1, used a linear systolic commu-
nearest-neighbors in the array of FPGA nication method. This type of connection
chips. This allows for efficient communi- was found to work quite well for a vari-
cation between the neighbors, but may ety of applications. However, this highly
require that some signals pass through constrained communication model made
an FPGA simply to create a connection some types of computations difficult or
between non-neighbors. Although this can even impossible. Therefore, Splash 2 was
be done, and is quite possible, it uses valu- designed to include not only the linear con-
able I/O resources on the FPGA that forms nections of Splash 1 that were found to
the routing bridge. One system that uses be useful for many applications, but also
a mesh topology with additional board- a crossbar network to allow any FPGA
This method also exploits the regular- many connected components to be placed
ity of the datapath elements to gener- far from one another, as the signals that
ate mappings and placements quickly and travel long distances use more routing
efficiently. resources than those that travel shorter
Floorplanning is also important when ones. A good placement is therefore es-
dealing with hierarchically structured re- sential to the routing process. One of
configurable designs. In these architec- the challenges in routing for FPGAs and
tures, the available resources have been reconfigurable systems is that the avail-
grouped by the logic or routing hierarchy able routing resources are limited. In gen-
of the hardware. Because performance is eral hardware design, the goal is to min-
best when routing lengths are minimized, imize the number of routing tracks used
the cells to be placed should be grouped in a channel between rows of computation
such that cells that require a great deal units, but the channels can be made as
of communication or which are on a criti- wide as necessary. In reconfigurable sys-
cal path are placed together within a logic tems, however, the number of available
cluster on the hardware [Krupnova et al. routing tracks is determined at fabrication
1997; Senouci et al. 1998]. time, and therefore the routing software
After floorplanning, the individual logic must perform within these boundaries.
blocks are placed into specific logic cells. Thus, FPGA routing concentrates on min-
One algorithm that is commonly used imizing congestion within the available
is the simulated annealing technique tracks [Brown et al. 1992b; McMurchie
[Shahookar and Mazumder 1991; Betz and Ebeling 1995; Alexander and Robins
and Rose 1997; Sankar and Rose 1999]. 1996; Chan and Schlag 1997; Lee and Wu
This method takes an initial placement 1997; Thakur et al. 1997; Wu and Marek-
of the system, which can be generated Sadowska 1997; Swartz et al. 1998; Nam
(pseudo-) randomly, and performs a series et al. 1999]. Because routing is one of
of “moves” on that layout. A move is sim- the more time-intensive portions of the
ply the changing of the location of a sin- design cycle, it can be helpful to deter-
gle logic cell, or the exchanging of loca- mine if a placed circuit can be routed
tions of two logic cells. These moves are before actually performing the routing
attempted one at a time using random step. This quickly informs the designer
target locations. If a move improves the if changes need to be made to the layout
layout, then the layout is changed to re- or a larger reconfigurable structure is re-
flect that move. If a move is considered to quired [Wood and Rutenbar 1997; Swartz
be undesirable, then it is only accepted a et al. 1998].
small percentage of the time. Accepting a Each of the design phases mentioned
few “bad” moves helps to avoid any local above may be implemented either manu-
minima in the placement space. Other al- ally or automatically using compiler tools.
gorithms exist that are not so based on The operation of some of these individual
random movements [Gehring and Ludwig steps are described in greater depth in the
1996], although this searches a smaller following sections.
area of the placement space for a solution,
and therefore may be unable to find a so-
4.1. Hardware-Software Partitioning
lution which meets performance require-
ments if a design uses a high percentage For systems that include both reconfig-
of the reconfigurable resources. urable hardware and a traditional micro-
Finally, the different reconfigurable processor, the program must first be par-
components comprising the application titioned into sections to be executed on
circuit are connected during the routing the reconfigurable hardware and sections
stage. Particular signals are assigned to to be executed in software on the micro-
specific portions of the routing resources processor. In general, complex control se-
of the reconfigurable hardware. This can quences such as variable-length loops are
become difficult if the placement causes more efficiently implemented in software,
while fixed datapath operations may be celeration gained through the execution
more efficiently executed in hardware. of a code fragment in hardware to de-
Most compilers presented for reconfig- termine whether the cost of configuration
urable systems generate only the hard- is overcome by the benefits of hardware
ware configuration for the system, rather execution.
than both hardware and software. In some
cases, this is because the reconfigurable
4.2. Circuit Specification
hardware may not be coupled with a host
processor, so only a hardware configura- In order to use the reconfigurable hard-
tion is necessary. For cases where recon- ware, designers must somehow be able to
figurable hardware does operate alongside specify the operation of their custom cir-
a host microprocessor, some systems cur- cuits. Before high-level compilation tools
rently require that the hardware compila- are developed for a specific reconfigurable
tion be performed separately from the soft- system, this is done through hand map-
ware compilation, and special functions ping of the circuit, where the designer
are called from within the software in specifies the operation of the components
order to configure and control the reconfig- in the configurable system directly. Here,
urable hardware. However, this requires the designers utilize the basic building
effort on the part of the designer to iden- blocks of the reconfigurable system to cre-
tify the sections that should be mapped ate the desired circuit. This style of cir-
to hardware, and to translate these into cuit specification is primarily useful only
special hardware functions. In order to when a software front-end for circuit de-
make the use of the reconfigurable hard- sign is unavailable, or for the design of
ware transparent to the designer, the par- small circuits or circuits with very high
titioning and programming of the hard- performance requirements. This is due
ware should occur simultaneously in a to the great amount of time involved in
single programming environment. manual circuit creation. However, for cir-
For compilers that manage both the cuits that can be reasonably hand mapped,
hardware and software aspects of applica- this provides potentially the smallest and
tion design, the hardware/software parti- fastest implementation.
tioning can be performed either manually, Because not all designers can be inti-
or automatically by the compiler itself. mately familiar with every reconfigurable
When the partitioning is performed by architecture, some design tools abstract
the programmer, compiler directives are the specifics of the target architecture.
used to mark sections of program code for Creating a circuit using a structural de-
hardware compilation. The NAPA C lan- sign language involves describing a cir-
guage [Gokhale and Stone 1998] provides cuit using building blocks such as gates,
pragma statements to allow a program- flip-flops and latches [Bellows and Hutch-
mer to specify whether a section of code is ings 1998; Gehring and Ludwig 1998;
to be executed in software on the Fixed In- Hutchings et al. 1999]. The compiler then
struction Processor (FIP), or in hardware maps these modules to one or more ba-
on the Adaptive Logic Processor (ALP). sic components of the architecture of the
Cardoso and Neto [1999] present another reconfigurable system. Structural VHDL
compiler that requires the user to specify is one example of this type of program-
(using information gained through the use ming, and commercial tools are avail-
of profiling tools) which areas of code to able for compiling from this language
map to the reconfigurable hardware. into vendor-specific FPGAs [Synplicity
Alternately, the hardware/software par- 1999].
titioning can be done automatically However, these two methods require
[Chichkov and Almeida 1997; Kress et al. that the designer possess either an in-
1997; Callahan et al. 2000; Li et al. 2000a]. timate knowledge of the targeted recon-
In this case, the compiler will use cost figurable hardware, or at least a work-
functions based upon the amount of ac- ing knowledge of the concepts involved
in hardware design. In order to allow suffer from the drawback that it tends to
a greater number of software developers produce larger and slower designs than
to take advantage of reconfigurable com- those generated by a structural descrip-
puting, tools that allow for behavioral tion or hand-mapping. Behavioral descrip-
circuit descriptions are being developed. tions can leave many aspects of the cir-
These systems trade some area and per- cuit unspecified. For example, a compiler
formance quality for greater flexibility and that encounters a while loop must gener-
ease of use. ate complicated control structures in or-
Behavioral circuit design is similar to der to allow for an unspecified number
software design because the designer in- of iterations. Also, in many HLL imple-
dicates the steps a hardware subsys- mentations, optimizations based upon the
tem must go through in order to per- bit width of operands cannot be performed.
form the desired computation rather than The compiler is generally unaware of
the actual composition of the circuit. any application-specific limitations on the
These behavioral descriptions can be ei- operand size; it only sees the program-
ther in a generic hardware description mer’s choice of data format in the program.
language such as VHDL or Verilog, or a Problems such as these might be solved
general-purpose high-level language such through additional programmer effort to
as C/C++ or Java. The eventual goal of replace while loops whenever possible
this type of compilation is to allow users with for loops, and to use compiler direc-
to write programs in commonly used lan- tives to indicate exact sizes of operands
guages that compile equally well, with- [Galloway 1995; Gokhale and Stone 1998].
out modification, to both a traditional This method of hardware design falls be-
software executable and to an executable tween structural description and behav-
which leverages reconfigurable hardware. ioral description in complexity, because
Working towards this direction, although the programmers do not need
Transmogrifier C [Galloway 1995] al- to know a great deal about hardware de-
lows a subset of the C language to be sign, they are required to follow addi-
used to describe hardware circuits. While tional guidelines that are not required for
multiplication, division, pointers, arrays, software-only implementations.
and a few other C language specifics are
not supported, this system provides a
behavioral method of circuit description
4.3. Circuit Libraries
using a primitive form of the C language.
Similarly, the C++ programming environ- The use of circuit or macro libraries
ment used for the P1 system [Vuillemin can greatly simplify and speed the de-
et al. 1996] provides a hybrid method of sign process. By predesigning commonly
description, using a combination of be- used structures such as adders, mul-
havioral and structural design. Synopsys’ tipliers, and counters, circuit creation
CoCentric compiler [Synopsys 2000], for configurable systems becomes largely
which can be targeted to the Xilinx Virtex the assembly of high-level components,
series of FPGA, uses SystemC to provide and only application-specific structures
for behavioral compilation of C/C++ require detailed design. The actual ar-
with the assistance of a set of additional chitecture of the reconfigurable device
hardware-defining classes. Other compil- can be abstracted, provided only library
ers, such as Nimble [Li et al. 2000a] and components are used, as these low-level
the Garp compiler [Callahan et al. 2000], details will already have been encapsu-
are fully behavioral C compilers, handling lated within the library structures. Al-
the full set of the ANSI C language. though the users of the circuit library
Although behavioral description, and may not know the intricacies of the des-
HLL description in particular, provides tination architecture, they are still able
a convenient method for the program- to make use of architecture-specific op-
ming of reconfigurable systems, it does timizations, such as specialized carry
chains. This is because designers very ever, circuit generators create semicus-
familiar with the details of the target ar- tomized high-level structures automati-
chitecture create the components within a cally at compile time, as opposed to circuit
circuit library. They can take advantage libraries that only provide static struc-
of architecture specifics when creating the tures. For example, a circuit generator can
modules to make these components faster create an adder structure of the exact bit
and smaller than a designer unfamiliar width required by the designer, whereas a
with the architecture likely would. An circuit library is likely to contain a limited
added benefit of the architecture abstrac- number of adder structures, none of which
tion is that the use of library components may be of the correct size. Circuit gener-
can also facilitate design migration from ators are therefore more flexible than cir-
one architecture to another, because de- cuit libraries because of the customization
signers are not required to learn a new allowed.
architecture, but only to indicate the new Some circuit generators, such as
target for the library components. How- MacGen [Yasar et al. 1996], are executed
ever, this does require that a circuit li- at the command line using custom de-
brary contain implementations for more scription files to generate physical design
than one architecture. layout data files. Newer circuit genera-
One method for using library com- tors, however, are functions or methods
ponents is to simply instantiate them called from high-level language programs.
within an HDL design [Xilinx 1997; Altera PAM-Blox [Mencer et al. 1998], for exam-
1999]. However, circuit libraries can also ple, is a set of circuit generators executed
be used in general language compil- in C++ that generate structures for use
ers by comparing the dataflow graph of with the PCI Pamette reconfigurable
the application to the dataflow graphs processing board. The circuit generator
of the library macros [Cadambi and presented by Chu et al. [1998] contains
Goldstein 1999]. If a dataflow represen- a number of Java classes to allow a
tation of a macro matches a portion of programmer to generate arbitrarily sized
the application graph, the correspond- arithmetic and logical components for a
ing macro is used for that part of the circuit. Although the examples presented
configuration. in that paper were mapped to a Xilinx
Another benefit of circuit design with 4000 series FPGA, the generator uses
library macros is that of fast compila- architecture specific libraries for module
tion. Because the library structures may generation. The target architecture can
have been premapped, preplaced, and pre- therefore be changed through the use
routed (at least within the macro bound- of a different design library. The Carry
aries), the actual compile time is reduced Look-Ahead circuit generator described
to the time required to place the library by Stohmann and Barke [1996] is also
components and route between them. For retargetable, because it maps to an
example, fast configuration was one of FPGA logic cell architecture defined by
the main motivations for the creation of the user.
libraries for circuit design in the DISC One drawback of the circuit generators
reconfigurable image processing system is that they depend on a regular logic
[Hutchings 1997]. and routing structure. Hierarchical rout-
ing structures (such as those present in
the Xilinx 6200 series [Xilinx 1996]) and
4.4. Circuit Generators
specialized heterogeneous logic blocks are
Circuit generators fulfill a role similar to frequently not accounted for. Therefore,
circuit libraries, in that they provide opti- some optimized features of a particular ar-
mized high-level structures for use within chitecture may be unused. For these cases,
larger applications. Again, designers are a circuit macro from a library may pro-
not required to understand the low-level vide a more highly optimized structure
details of particular architectures. How- than one created with a circuit generator,
provided that the library macro fits the puting to allocate memories to hold vari-
needs of the application. ables and other data. Off-chip memories
may be added to the reconfigurable sys-
tem. Alternately, if a reconfigurable sys-
4.5. Partial Evaluation
tem includes memory blocks embedded
Functions that are to be implemented on into the reconfigurable logic, these may be
the reconfigurable array should occupy used, provided that the storage require-
as little area as possible, so as to maxi- ments do not surpass the available embed-
mize the number of functions that can be ded memory. If multiple off-chip memories
mapped to the hardware. This, combined are available to a reconfigurable system,
with the minimization of the delay in- variables used in parallel should be placed
curred by each circuit, increases the over- into different memory structures, such
all acceleration of the application. Partial that they can be accessed simultaneously
evaluation is the process of reducing hard- [Gokhale and Stone 1999]. When smaller
ware requirements for a circuit structure embedded memory units are used, larger
through optimization based upon known memories can be created from the smaller
static inputs. Specifically, if an input is ones. However, in this case, it is desir-
known to be constant, that value can po- able to ensure that each smaller mem-
tentially be propagated through one or ory is close to the computation that most
more gates in the structure at compile requires its contents [Babb et al. 1999].
time, and only the portions of a circuit that As mentioned earlier, the small embed-
depend on time-varying inputs need to be ded memories that are not allocated for
mapped to the reconfigurable structure. data storage may be used to perform logic
One example of the usefulness of this functions.
operation is that of constant coefficient
multipliers. If one input to a multiplier
4.7. Parallelization
is constant, a multiplier object can be re-
duced from a general-purpose multiplier One of the benefits of reconfigurable com-
to a set of additions with static-length puting is the ability to execute multi-
shifts between them corresponding to the ple operations in parallel. In cases where
locations of 1s in the binary constant. circuits are specified using a structural
This type of reduction leads to a lower hardware description language, the user
area requirement for the circuit, and po- specifies all structures and timing, and
tentially higher performance due to fewer therefore either implicitly or explicitly
gate delays encountered on the critical specifies any parallel operation. However,
path. Partial evaluation can also be per- for behavioral and HLL descriptions, there
formed in conjunction with circuit gener- are two methods to incorporate paral-
ation, where the constants passed to the lelism: manual parallelization through
generator function are used to simplify special instructions or compiler direc-
the created hardware circuit [Wang and tives, and automatic parallelization by the
Lewis 1997; Chu et al. 1998]. Other exam- compiler.
ples of this type of optimization for specific To manually incorporate parallelism
algorithms include the partial evaluation within an application, the programmer
of DES encryption circuits [Leonard and can specifically mark sections of code
Mangione-Smith 1997], and the partial that should run as parallel threads, and
evaluation of constant multipliers and use similar operations to those used in
fixed polynomial division circuits [Payne traditional parallel compilers [Cronquist
1997]. et al. 1998; Gokhale and Stone 1998].
For example, a signal/wait technique can
4.6. Memory Allocation
be used to perform synchronization of
the different threads of the computation.
As with traditional software programs, it The RaPiD-B language [Cronquist et al.
may be necessary in reconfigurable com- 1998] is one that uses this methodology.
Although the NAPA C compiler [Gokhale required between the FPGAs, the num-
and Stone 1998] requires programmers ber of paths with a high (inter-chip) de-
to mark the areas of code for executing lay is reduced, and the circuit may have
the host processor and the reconfigurable an overall higher performance. Similarly,
hardware in parallel, it also detects and those sections of the circuit that require a
exploits fine-grained parallelism within short delay time must be placed upon the
computations destined for the reconfig- same chip. Global placement then deter-
urable hardware. mines which of the actual FPGAs in the
Automatic parallelization of inner loops multi-FPGA system will contain each of
is another common technique in recon- the partitions.
figurable hardware compilers to attempt After the circuit has been partitioned
to maximize the use of the reconfig- into the different FPGA chips, the con-
urable hardware. The compiler will se- nections between the chips must be
lect the innermost loop level to be com- routed [Mak and Wong 1997; Ejnioui and
pletely unrolled for parallel execution in Ranganathan 1999]. A global routing al-
hardware, potentially creating a heav- gorithm determines at a high level the
ily pipelined structure [Cronquist et al. connections between the FPGA chips. It
1998; Weinhardt and Luk 1999]. For these first selects a region of output pins on the
cases, outer loops may not have multi- source FPGA for a given signal, and de-
ple iterations executing simultaneously. termines which (if any) routing switches
Any loop reordering to improve the par- or additional FPGAs the signal must
allelism of the circuit must be done by the pass through to get to the destination
programmer. However, some compiler sys- FPGA. Detailed routing and pin assign-
tems have taken this procedure a step fur- ment [Slimane-Kade et al. 1994; Hauck
ther and focus on the parallelization of all and Borriello 1997; Mak and Wong 1997;
loops within the program, not just the in- Ejnioui and Ranganathan 1999] are then
ner loops [Wang and Lewis 1997; Budiu used to assign signals to traces on an exist-
and Goldstein 1999]. This type of compiler ing multi-FPGA board, or to create traces
generates a control flow graph based upon for a multi-FPGA board that is to be cre-
the entire program source code. Loop un- ated specifically to implement the given
rolling is used in order to increase the circuit.
available parallelism, and the graph is Because multi-FPGA systems use inter-
then used to schedule parallel operations chip connections to allow the circuit parti-
in the hardware. tions to communicate, they frequently re-
quire a higher proportion of I/O resources
vs. logic in each chip than is normally re-
4.8. Multi-FPGA System Software
quired in single-FPGA use. For this rea-
When reconfigurable systems use more son, some research has focused on meth-
than one FPGA to form the complete ods to allow pins of the FPGAs to be reused
reconfigurable hardware, there are ad- for multiple signals. This procedure is re-
ditional compilation issues to deal with ferred to as Virtual Wires [Babb et al.
[Hauck and Agarwal 1996]. The design 1993; Agarwal 1995; Selvidge et al. 1995],
must first be partitioned into the differ- and allows for a flexible trade-off between
ent FPGA chips [Hauck 1995; Acock and logic and I/O within a given multi-FPGA
Dimond 1997; Vahid 1997; Brasen and system. Signals are multiplexed onto a
Saucier 1998; Khalid 1999]. This is gen- single wire by using multiple virtual clock
erally done by placing each highly con- cycles, one per multiplexed signal, within
nected portions of a circuit into a single a user clock cycle, thus pipelining the com-
chip. Multi-FPGA systems have a limited munication. In this manner, the I/O re-
number of I/O pins that connect the chips quirements of a circuit can be reduced,
together, and therefore their use must be while the logic requirements (because of
minimized in the overall circuit mapping. the added circuitry used for the multiplex-
Also, by minimizing the amount of routing ing) are increased.
Fig. 13. Applications which are too large to entirely fit on the reconfigurable
hardware can be partitioned into two or more smaller configurations that
can occupy the hardware at different times.
Fig. 14. The different basic models of reconfigurable computing: single context, multicon-
text, and partially reconfigurable. Each of these designs is shown performing a reconfigu-
ration.
following discussion defines the single con- total reconfiguration delay. If all the con-
text device, and further considers newer figurations used within a certain time pe-
FPGA designs (multicontext and partially riod are present in the same context, no
reconfigurable), along with their impact reconfiguration will be necessary. How-
on run-time reconfiguration. ever, if a number of successive configura-
tions are each partitioned into different
5.1.1. Single Context. Current single contexts, several reconfigurations will be
context FPGAs are programmed using needed, slowing the operation of the run-
a serial stream of configuration infor- time reconfigurable system.
mation. Because only sequential access
is supported, any change to a configu- 5.1.2. Multicontext. A multicontext FPGA
ration on this type of FPGA requires a includes multiple memory bits for each
complete reprogramming of the entire programming bit location [DeHon 1996;
chip. Although this does simplify the Trimberger et al. 1997a; Scalera and
reconfiguration hardware, it does incur Vazquez 1998; Chameleon 2000]. These
a high overhead when only a small part memory bits can be thought of as mul-
of the configuration memory needs to be tiple planes of configuration information,
changed. Many commercial FPGAs are of as shown in Figure 14. One plane of con-
this style, including the Xilinx 4000 se- figuration information can be active at a
ries [Xilinx 1994], the Altera Flex10K given moment, but the device can quickly
series [Altera 1998], and Lucent’s Orca switch between different planes, or con-
series [Lucent 1998]. This type of FPGA texts, of already-programmed configura-
is therefore more suited for applications tions. In this manner, the multicontext de-
that can benefit from reconfigurable com- vice can be considered a multiplexed set of
puting without run-time reconfiguration. single context devices, which requires that
A single context FPGA is depicted in a context be fully reprogrammed to per-
Figure 14. form any modification. This system does
In order to implement run-time recon- allow for the background loading of a con-
figuration onto a single context FPGA, the text, where one plane is active and in ex-
configurations must be grouped into con- ecution while an inactive place is in the
texts, and each full context is swapped in process of being programmed. Figure 15
and out of the FPGA as needed. Because shows a multicontext memory bit, as used
each of these swap operations involve re- in [Trimberger et al. 1997a]. A commer-
configuring the entire FPGA, a good parti- cial product that uses this technique is the
tioning of the configurations between con- CS2000 RCP series from Chameleon, Inc
texts is essential in order to minimize the [Chameleon 2000]. This device provides
Fig. 16. A timeline of the configuration and reconfiguration of pipeline stages on a pipeline
reconfigurable FPGA. This example shows three physical pipeline stages implementing five
virtual pipeline stages [Cadambi et al. 1998].
contexts, to be loaded into the reconfig- There are a number of different tactics
urable hardware at run-time. Nimble [Li for reducing the configuration overhead.
et al. 2000a] is one of the compilers that First, loading of the configurations can be
perform this type of operation. This com- timed such that the configuration over-
piler focuses on mapping core loops within laps as much as possible with the execu-
C code to reconfigurable hardware. Hard- tion of instructions by the host processor.
ware models for the candidate loops that Second, compression techniques can be in-
will fit within the reconfigurable hardware troduced to decrease the amount of config-
are first extracted from the C application. uration data that must be transferred to
Then these loops are grouped into indi- the system. Third, specialized hardware
vidual configurations using a partitioning can be used to adjust the physical loca-
method in order to encourage the hard- tion of configurations at run-time based on
ware loops that are used in close temporal where the free area on the hardware is lo-
proximity to be mapped to the same config- cated at any given time. Finally, the actual
uration, reducing configuration overhead. process of transferring the data from the
For partially reconfigurable designs, the host processor to the reconfigurable hard-
compiler must determine a good place- ware can be modified to include a configu-
ment in order to prevent configurations ration cache, which would provide a faster
that are used together in close temporal reconfiguration.
proximity from occupying the same re-
5.4.1. Configuration Prefetching. Perfor-
sources. Again, through minimizing the
mance is improved when the actual con-
number of reconfigurations, the overall
figuration of the hardware is overlapped
performance of the system is increased, as
with computations performed by the
configuration is a slow process [Li et al.
host processor, because programming the
2000b]. An alternative approach, which
reconfigurable hardware requires from
allows the final placement of a configura-
milliseconds to seconds to accomplish.
tion to be determined at run-time, is also
Overlapping configuration and execution
discussed within the Fast Configuration
prevents the host processor from stalling
section of this article.
while it is waiting for the configuration to
finish, and hides the configuration time
5.4. Fast Configuration from the program execution. Configura-
tion prefetching [Hauck 1998a] attempts
Because run-time reconfigurable systems
to leverage this overlap by determining
involve reconfiguration during program
when to initiate reconfiguration of the
execution, the reconfiguration must be
hardware in order to maximize overlap
done as efficiently and as quickly as pos-
with useful computation on the host
sible. This is in order to ensure that the
processor. It also seeks to minimize the
overhead of the reconfiguration does not
chance that a configuration will be pre-
eclipse the benefit gained by hardware ac-
fetched falsely, overwriting the configura-
celeration. Stalling execution of either the
tion that is actually used next.
host processor or the reconfigurable hard-
ware because of configuration is clearly 5.4.2. Configuration Compression. Unfor-
undesirable. In the DISC II system, from tunately, there will always be cases in
25% [Wirthlin and Hutchings 1996] to 71% which the configuration overheads cannot
[Wirthlin and Hutchings 1995] of execu- be successfully hidden using a prefetch-
tion time is spent in reconfiguration, while ing technique. This can occur when a con-
in the UCLA ATR work this figure can rise ditional branch occurs immediately be-
to over 98.5% [Mangione-Smith 1999]. If fore the use of a configuration, potentially
the delays caused by reconfiguration are making a 100% correct prefetch predic-
reduced, performance can be greatly in- tion impossible, or when multiple config-
creased. Therefore, fast configuration is an urations or contexts must be loaded in
important area of research for run-time re- quick succession. In these cases, the delay
configurable systems. incurred is minimized when the amount
of data transferred from the host proces- out of groups of smaller configurations,
sor to the reconfigurable array is mini- the configuration overhead of partial re-
mized. Configuration compression can be configuration is reduced because more op-
used to compact this configuration infor- erations can be present on chip simul-
mation [Hauck et al. 1998b; Hauck and taneously. However, there are some area
Wilson 1999; Li and Hauck 1999; Dandalis and execution penalties imposed by this
and Prasanna 2001]. method, creating a trade-off between re-
One form of configuration compression duced reconfiguration overhead and faster
has already been implemented in a com- execution with a smaller area.
mercial system. The Xilinx 6200 series of
FPGA [Xilinx 1996] contains wildcarding 5.4.3. Relocation and Defragmentation in
hardware, which provides a method to pro- Partially Reconfigurable Systems. Partially
gram multiple logic cells with a single ad- reconfigurable systems have the advan-
dress and data value. This is accomplished tage over single context systems in that
by setting a special register to indicate they allow a new configuration to be writ-
which of the address bits should behave ten to the programmable logic while the
as “don’t-care” values, resolving to multi- configurations not occupying that same
ple addresses for configuration. For exam- area remain intact and available for future
ple, suppose two configuration addresses, use. Because these configurations will not
00010 and 00110, are both to be pro- have to be reconfigured onto the array,
grammed with the same value. By setting and because the programming of a sin-
the wildcard register to 00100, the address gle configuration can require the transfer
value sent is interpreted as 00X10 and of far less configuration data than the pro-
both these locations are programmed us- gramming of an entire context, a partially
ing either of the two addresses above in a reconfigurable system can incur less con-
single operation. Hauck et al. [1998b] dis- figuration overhead than a single context
cuss the benefits of this hardware, while FPGA.
Li and Hauck [1999] cover a potential ex- However, inefficiencies can arise if two
tension to the concept, where “don’t care” partial configurations are supposed to
values in the configuration stream can be be located at overlapping physical loca-
used to allow areas with similar but not tions on the FPGA. If these configura-
identical configuration data values to also tions are repeatedly used one after an-
be programmed simultaneously. other, they must be swapped in and out of
Within partially reconfigurable sys- the array each time. This type of conflict
tems, there is an added potential to com- could negate much of the benefit achieved
press effectively the amount of data sent by partially reconfigurable systems. A
to the reconfigurable hardware. A con- better solution to this problem is to allow
figuration can possibly reuse configura- the final placement of the configurations
tion information already present on the to occur at run-time, allowing for run-
array, such that only the areas differing time relocation of those configurations
in configuration values must be repro- [Li et al. 2000b; Compton et al. 2002].
grammed. Therefore, configuration time Using relocation, a new configuration
can be reduced through the identification may be placed onto the reconfigurable
of these common components and the cal- array where it will cause minimum con-
culation of the incremental configurations flict with other needed configurations al-
that must be loaded [Luk et al. 1997a; ready present on the hardware. A num-
Shirazi et al. 1998]. ber of different systems support run-time
Alternately, similar operations can be relocation, including Chimaera [Hauck
grouped together to form a single con- et al. 1997], Garp [Hauser and Wawrzynek
figuration that contains extra control cir- 1997], and PipeRench [Cadambi et al.
cuitry in order to implement the various 1998; Goldstein et al. 2000].
functions within the group [Kastrup et al. Even with relocation, partially reconfig-
1999]. By creating larger configurations urable hardware can still suffer from some
placement conflicts that could be avoided figurable array. However, in many archi-
by using an additional hardware optimiza- tectures, there are some routing resources
tion. Over time, as a partially reconfig- that traverse long distances, and may tra-
urable device loads and unloads config- verse areas allocated to different config-
urations, the location of the unoccupied urations. Care must be taken such that
area on the array is likely to become frag- different configurations do not attempt to
mented, similar to what occurs in mem- drive to these wires simultaneously, as
ory systems when RAM is allocated and multiple drivers to a wire can potentially
deallocated. There may be enough empty damage the hardware. Therefore, systems
area on the device to hold an incoming such as the Xilinx 6200 [Xilinx 1996] and
configuration, but it may be distributed Chimaera [Hauck et al. 1997] have spe-
throughout the array. A configuration nor- cially designed routing resources that pre-
mally requires a contiguous region of the vent multiple drivers. LEGO [Chow et al.
chip, so it would have to overwrite a por- 1999b] includes an additional control sig-
tion of a valid configuration in order to nal preventing conflicts during the span of
be placed onto the reconfigurable hard- time between startup and actual program-
ware. A system that incorporates the abil- ming of the hardware.
ity to perform defragmentation of the re- An additional difficulty in using run-
configurable array, however, would be able time reconfigurable systems occurs when
to consolidate the unused area by mov- the host processor runs multiple threads
ing valid configurations to new locations or processes. These threads or processes
[Diessel and El Gindy 1997; Compton et al. may each have their own sets of config-
2002]. This area can then be used by in- urations that are to be mapped to the
coming configurations, potentially with- reconfigurable hardware. Issues such as
out overwriting any of the moved config- the correct use of memory protection and
urations. virtual memory must be considered dur-
ing memory accesses by the reconfigurable
5.4.4. Configuration Caching. Because a hardware [Chien and Byun 1999; Jacob
great deal of the delay caused by config- and Chow 1999; Jean et al. 1999]. An-
uration is due to the distance between other problem can occur when one thread
the host processor and the reconfigurable or process configures the hardware, which
hardware, as well the reading of the is then reconfigured by a different thread
configuration data from a file or main or process. Threads and processes must be
memory, a configuration cache can poten- prevented from incorrectly calling hard-
tially reduce the costs of reconfiguration ware functions that no longer appear
[Deshpande et al. 1999; Li et al. 2000b]. on the reconfigurable hardware. This re-
By storing the configurations in fast mem- quires that the state of the reconfigurable
ory near to the reconfigurable array, the hardware be set to “dirty” on a main pro-
data transfer during reconfiguration is ac- cessor context switch, or re-loaded with
celerated, and the overall time required the correct configuration context.
is reduced. Additionally, a special config- Partially reconfigurable systems must
uration cache can allow for specialized di- also protect against inter-process or inter-
rect output to the reconfigurable hardware thread conflicts within the array. Even
[Compton et al. 2000]. This output can if each application has ensured that
leverage the close proximity of the cache their own configurations can safely co-
by providing high-bandwidth communica- exist, a combination of configurations from
tions that would facilitate wide parallel different applications re-introduces the
loading of the configuration data, further possibility of inadvertently causing an
reducing configuration times. electrical short within the reconfigurable
hardware. This particular issue can be
5.5. Potential Problems with RTR
solved through the use of an architecture
Partial reconfiguration involves selec- that does not have “bad” configurations,
tively programming portions of the recon- such as the 6200 series [Xilinx 1996] and
Chimaera [Hauck et al. 1997]. The po- ASIC, reconfigurable systems provide a
tential for this type of conflict also intro- method to map circuits into hardware. Re-
duces the possibility of extremely destruc- configurable systems therefore have the
tive configurations that can destroy the potential to achieve far greater perfor-
system’s underlying hardware. mance than software as a result of bypass-
ing the fetch-decode-execute cycle of tradi-
tional microprocessors as well as possibly
5.6. Run-Time Reconfiguration Summary exploiting a greater degree of parallelism.
We have discussed the benefits of using Reconfigurable hardware systems come
run-time reconfiguration to increase the in many forms, from a configurable func-
benefits gained through reconfigurable tional unit integrated directly into a CPU,
computing. Different configurations may to a reconfigurable coprocessor coupled
be used at different phases of a program’s with a host microprocessor, to a multi-
execution, customizing the hardware not FPGA stand-alone unit. The level of cou-
only for the application, but also for the pling, granularity of computation struc-
different stages of the application. Run- tures, and form of routing resources are all
time reconfiguration also allows configu- key points in the design of reconfigurable
rations larger than the available recon- systems. The use of heterogeneous struc-
figurable hardware to be implemented, tures can also greatly add to the overall
as these circuits can be split into sev- performance of the final design.
eral smaller ones that are used in succes- Compilation tools for reconfigurable
sion. Because of the delays associated with systems range from simple tools that aid
configuration, this style of computing re- in the manual design and placement of
quires that reconfiguration be performed circuits, to fully automatic design suites
in a very efficient manner. Multicontext that use program code written in a high-
and partially reconfigurable FPGAs are level language to generate circuits and the
both designed to improve the time re- controlling software. The variety of tools
quired for reconfiguration. Hardware opti- available allows designers to choose be-
mizations, such as wildcarding, run-time tween manual and automatic circuit cre-
relocation, and defragmentation, fur- ation for any or all of the design steps.
ther decrease configuration overhead in Although automatic tools greatly simplify
a partially reconfigurable design. Soft- the design process, manual creation is still
ware techniques to enable fast configura- important for performance-driven appli-
tion, including prefetching and incremen- cations. Circuit libraries and circuit gen-
tal configuration calculation, were also erators are additional software tools that
discussed. enable designers to quickly create efficient
designs. These tools attempt to aid the
designer in gaining the benefits of man-
ual design without entirely sacrificing the
6. CONCLUSION
ease of automatic circuit creation.
Reconfigurable computing is becoming an Finally, run-time reconfiguration pro-
important part of research in computer vides a method to accelerate a greater por-
architectures and software systems. By tion of a given application by allowing the
placing the computationally intense por- configuration of the hardware to change
tions of an application onto the reconfig- over time. Apart from the benefits of added
urable hardware, that application can be capacity through the use of virtual hard-
greatly accelerated. This is because recon- ware, run-time reconfiguration also allows
figurable computing combines many of the for circuits to be optimized based on run-
benefits of both software and ASIC im- time conditions. In this manner, perfor-
plementations. Like software, the mapped mance of a reconfigurable system can ap-
circuit is flexible, and can be changed over proach or even surpass that of an ASIC.
the lifetime of the system or even the Reconfigurable computing systems have
lifetime of the application. Similar to an shown the ability to accelerate program
execution greatly, providing a high- ALTERA CORPORATION. 1998. Data Book. Altera
performance alternative to software-only Corporation, San Jose, CA.
implementations. However, no one hard- ALTERA CORPORATION. 1999. Altera MegaCore
Functions. Available online at http://www.altera.
ware design has emerged as the clear pin- com/html/tools/megacore.html. Altera Corpora-
nacle of reconfigurable design. Although tion, San Jose, CA.
general-purpose FPGA structures have ALTERA CORPORATION. 2001. Press Release: Al-
standardized into LUT-based architec- tera Unveils First Complete System-on-a-
tures, groups designing hardware for re- Programmable-Chip Solution at Embedded
Systems Conference. Altera Corporation, San
configurable computing are currently also Jose, CA.
exploring the use of heterogeneous struc- ANNAPOLIS MICROSYSTEMS, INC. 1998. Wildfire Ref-
tures and word-width computational ele- erence Manual. Annapolis Microsystems, Inc,
ments. Those designing compiler systems Annapolis, MD.
face the task of improving automatic de- ARNOLD, J. M., BUELL, D. A., AND DAVIS, E. G. 1992.
sign tools to the point where they may Splash 2. In Proceedings of the ACM Symposium
achieve mappings comparable to manual on Parallel Algorithms and Architectures, 316–
324.
design for even high-performance applica-
BABB, J., RINARD, M., MORITZ, C. A., LEE, W., FRANK,
tions. Within both of these research cat- M., BARUA, R., AND AMARASINGHE, S. 1999. Par-
egories lies the additional topic of run- allelizing applications into silicon. IEEE Sympo-
time reconfiguration. While some work sium on Field-Programmable Custom Comput-
has been done in this field as well, re- ing Machines, 70–80.
search must continue in order to be able BABB, J., TESSIER, R., AND AGARWAL, A. 1993. Vir-
tual wires: Overcoming pin limitations in FPGA-
to perform faster and more efficient re- based logic emulators. In IEEE Workshop
configuration. Further study into each of on FPGAs for Custom Computing Machines,
these topics is necessary in order to har- 142–151.
ness the full potential of reconfigurable BELLOWS, P. AND HUTCHINGS, B. 1998. JHDL—An
computing. HDL for reconfigurable systems. IEEE Sympo-
sium on Field-Programmable Custom Comput-
ing Machines, 175–184.
BETZ, V. AND ROSE, J. 1997. VPR: A new packing,
REFERENCES
placement and routing tool for FPGA research.
ABOUZEID, P., BABBA, P., DE PAULET, M. C., AND SAUCIER, Lecture Notes in Computer Science 1304—Field-
G. 1993. Input-driven partitioning methods Programmable Logic and Applications. W. Luk,
and application to synthesis on table-lookup- P. Y. K. Cheung, and M. Glesner, Eds. Springer-
based FPGA’s. IEEE Trans. Comput. Aid. Des. Verlag, Berlin, Germany, 213–222.
Integ. Circ. Syst. 12, 7, 913–925. BETZ, V. AND ROSE, J. 1999. FPGA routing archi-
ACOCK, S. J. B. AND DIMOND, K. R. 1997. Automatic tecture: Segmentation and buffering to optimize
mapping of algorithms onto multiple FPGA- speed and density. ACM/SIGDA International
SRAM Modules. Field-Programmable Logic and Symposium on FPGAs, 59–68.
Applications, W. Luk, P. Y. K. Cheung, and BRASEN, D. R., AND SAUCIER, G. 1998. Using cone
M. Glesner, Eds. Lecture Notes in Computer structures for circuit partitioning into FPGA
Science, vol. 1304, Springer-Verlag, Berlin, packages. IEEE Trans. CAD Integ. Circ. Syst. 17,
Germany, 255–264. 7, 592–600.
ADAPTIVE SILICON, INC. 2001. MSA 2500 Pro- BROWN, S. D., FRANCIS, R. J., ROSE, J., AND VRANESIC,
grammable Logic Cores. Adaptive Silicon, Inc., Z. G. 1992a. Field-Programmable Gate Ar-
Los Gatos, CA. rays, Kluwer Academic Publishers, Boston, MA.
AGARWAL, A. 1995. VirtualWires: A Technology BROWN, S., ROSE, J., AND VRANESIC, Z. G. 1992b. A
for Massive Multi-FPGA Systems. Available detailed router for field-programmable gate ar-
online at http://www.ikos.com/products/virtual- rays. IEEE Trans. Comput. Aid. Desi. 11, 5, 620–
wires.ps. 628.
AGGARWAL, A. AND LEWIS, D. 1994. Routing archi- BUDIU, M. AND GOLDSTEIN, S. C. 1999. Fast com-
tectures for hierarchical field programmable pilation for pipelined reconfigurable fabrics.
gate arrays. In Proceedings of the IEEE Interna- ACM/SIGDA International Symposium on
tional Conference on Computer Design, 475–478. FPGAs, 195–205.
ALEXANDER, M. J. AND ROBINS, G. 1996. New BUELL, D., ARNOLD, S. M., AND KLEINFELDER, W. J.
performance-driven FPGA routing algorithms. 1996. SPLASH 2: FPGAs in a Custom Comput-
IEEE Trans. CAD Integ. Circ. Syst. 15, 12, 1505– ing Machine, IEEE Computer Society Press, Los
1517. Alamitos, CA.
BURNS, J., DONLIN, A., HOGG, J., SINGH, S., AND CHOW, P., SEO, S. O., ROSE, J., CHUNG, K., PÁEZ-MONZÓN,
DE WIT, M. 1997. A dynamic reconfigu- G., AND RAHARDJA, I. 1999b. The design of an
ration run-time system. IEEE Symposium SRAM-based field-programmable Gate Array—
on Field-Programmable Custom Computing Part II: Circuit Design and Layout. IEEE Trans.
Machines, 66–75. VLSI Syst. 7, 3, 321–330.
BUTTS, M. AND BATCHELLER, J. 1991. Method of us- CHOWDHARY, A. AND HAYES, J. P. 1997. General
ing electronically reconfigurable logic circuits. modeling and technology-mapping technique for
US Patent 5,036,473. LUT-based FPGAs. ACM/SIGDA International
CADAMBI, S. AND GOLDSTEIN, S. C. 1999. CPR: A Symposium on FPGAs, 43–49.
configuration profiling tool. IEEE Symposium CHU, M., WEAVER, N., SULIMMA, K., DEHON, A., AND
on Field-Programmable Custom Computing WAWRZYNEK, J. 1998. Object oriented circuit-
Machines, 104–113. generators in Java. IEEE Symposium on Field-
CADAMBI, S., WEENER, J., GOLDSTEIN, S. C., SCHMIT, H., Programmable Custom Computing Machines,
AND THOMAS, D. E. 1998. Managing pipeline- 158–166.
reconfigurable FPGAs. ACM/SIGDA Interna- COMPTON, K., COOLEY, J., KNOL, S., AND HAUCK,
tional Symposium on FPGAs, 55–64. S. 2000. Configuration relocation and defrag-
CALLAHAN, T. J., CHONG, P., DEHON, A., AND WAWRZYNEK, mentation for FPGAs, Northwestern Univer-
J. 1998. Fast Module Mapping and Placement sity Technical Report, Available online at http://
for Datapaths in FPGAs. ACM/SIGDA Interna- www.ece.nwu.edu/∼kati/publications.html.
tional Symposium on FPGAs, 123–132. COMPTON, K., LI, Z., COOLEY, J., KNOL, S., AND HAUCK,
CALLAHAN, T. J., HAUSER, J. R., AND WAWRZYNEK, J. S. 2002. Configuration relocation and defrag-
2000. The Garp architecture and C compiler. mentation for run-time reconfigurable comput-
IEEE Comput. 3, 4, 62–69. ing. IEEE Trans. VLSI Syst., to appear.
CARDOSO, J. M. P. AND NETO, H. C. 1999. Macro- CONG, J. AND HWANG, Y. Y. 1998. Boolean match-
based hardware compilation of JavaTM byte- ing for complex PLBs in LUT-based FPGAs with
codes into a dynamic reconfigurable computing application to architecture evaluation. ACM/
system. IEEE Symposium on Field-Programm- SIGDA International Symposium on FPGAs,
able Custom Computing Machines, 2–11. 27–34.
CHAMELEON SYSTEMS, INC. 2000. CS2000 Advance CONG, J. AND WU, C. 1998. An efficient algorithm
Product Specification. Chameleon Systems, Inc., for performance-optimal FPGA technology map-
San Jose, CA. ping with retiming. IEEE Trans. CAD Integr.
CHAN, P. K. AND SCHLAG, M. D. F. 1997. Accel- Circ. Syst. 17, 9, 738–748.
eration of an FPGA router. IEEE Symposium CONG, J., WU, C., AND DING, Y. 1999. Cut ranking
on Field-Programmable Custom Computing and pruning enabling a general and efficient
Machines, 175–181. FPGA mapping solution. ACM/SIGDA Interna-
CHANG, D. AND MAREK-SADOWSKA, M. 1998. Parti- tional Symposium on FPGAs, 29–35.
tioning sequential circuits on dynamically recon- CONG, J. AND XU, S. 1998. Technology mapping
figurable FPGAs. ACM/SIGDA International for FPGAs with embedded memory blocks.
Symposium on FPGAs, 161–167. ACM/SIGDA International Symposium on
CHANG, S. C., MAREK-SADOWSKA, M., AND HWANG, T. T. FPGAs, 179–188.
1996. Technology mapping for TLU FPGA’s CRONQUIST, D. C., FRANKLIN, P., BERG, S. G.,
based on decomposition of binary decision AND EBELING, C. 1998. Specifying and com-
diagrams. IEEE Trans. CAD Integ. Circ. Syst. 15, piling applications for RaPiD. IEEE Sympo-
10, 1226–1248. sium on Field-Programmable Custom Comput-
ing Machines, 116–125.
CHICHKOV, A. V. AND ALMEIDA, C. B. 1997. An hard-
ware/software partitioning algorithm for cus- DANDALIS, A. AND PRASANNA, V. K. 2001. Configura-
tom computing machines. Lecture Notes in Com- tion compression for FPGA-based embedded sys-
puter Science 1304—Field-Programmable Logic tems. ACM/SIGDA International Symposium
and Applications. W. Luk, P. Y. K. Cheung, on Field-Programmable Gate Arrays, 173–182.
and M. Glesner, Eds. Springer-Verlag, Berlin, DEHON, A. 1996. DPGA Utilization and Applica-
Germany, 274–283. tion. ACM/SIGDA International Symposium on
CHIEN, A. A. AND BYUN, J. H. 1999. Safe and pro- FPGAs, 115–121.
tected execution for the morph/AMRM recon- DEHON, A. 1999. Balancing interconnect and com-
figurable processor. IEEE Symposium on Field- putation in a reconfigurable computing array (or,
Programmable Custom Computing Machines, why you don’t really want 100% LUT utiliza-
209–221. tion). ACM/SIGDA International Symposium
CHOW, P., SEO, S. O., ROSE, J., CHUNG, K., PÁEZ-MONZÓN, on FPGAs, 69–78.
G., AND RAHARDJA, I. 1999a. The design of an DESHPANDE, D., SOMANI, A. K., AND TYAGI, A.
SRAM-based field-programmable Gate Array— 1999. Configuration caching vs data caching
Part I: Architecture. IEEE Trans. VLSI Syst. 7, for striped FPGAs. ACM/SIGDA International
2, 191–197. Symposium on FPGAs, 206–214.
DIESSEL, O. AND EL GINDY, H. 1997. Run-time com- GRAHAM, P. AND NELSON, B. 1996. Genetic algo-
paction of FPGA designs. Lecture Notes in rithms in software and in hardware—A per-
Computer Science 1304—Field-Programmable formance analysis of workstations and custom
Logic and Applications. W. Luk, P. Y. K. computing machine implementations. IEEE
Cheung, M. Glesner, Eds. Springer-Verlag, Symposium on FPGAs for Custom Computing
Berlin, Germany, 131–140. Machines, 216–225.
DOLLAS, A., SOTIRIADES, E., AND EMMANOUELIDES, A. HAUCK, S. 1995. Multi-FPGA systems. Ph.D. dis-
1998. Architecture and design of GE1, A FCCM sertation, Univ. Washington, Dept. of C.S.&E.
for golomb ruler derivation. IEEE Sympo- HAUCK, S. 1998a. Configuration prefetch for sin-
sium on Field-Programmable Custom Comput- gle context reconfigurable coprocessors. ACM/
ing Machines, 48–56. SIGDA International Symposium on FPGAs,
EBELING, C., CRONQUIST, D. C., AND FRANKLIN, P. 65–74.
1996. RaPiD—Reconfigurable pipelined dat- HAUCK, S. 1998b. The roles of FPGAs in repro-
apath. Lecture Notes in Computer Science grammable systems. Proc. IEEE 86, 4, 615–638.
1142—Field-Programmable Logic: Smart Appli- HAUCK, S. AND AGARWAL A. 1996. Software tech-
cations, New Paradigms and Compilers. R. W. nologies for reconfigurable systems. Dept. of
Hartenstein, M. Glesner, Eds. Springer-Verlag, ECE Technical Report, Northwestern Univ.
Berlin, Germany, 126–135. Available online at http://www.ee.washington.
EJNIOUI, A. AND RANGANATHAN, N. 1999. Multi- edu/faculty/hauck/publications.html.
terminal net routing for partial crossbar-based HAUCK, S. AND BORRIELLO, G. 1997. Pin assignment
multi-FPGA systems. ACM/SIGDA Interna- for multi-FPGA systems. IEEE Trans. Comput.
tional Symposium on FPGAs, 176–184. Aid. Desi. Integ. Circ. Syst. 16, 9, 956–964.
ELBIRT, A. J. AND PAAR, C. 2000. An FPGA im- HAUCK, S., BORRIELLO, G., AND EBELING, C. 1998a.
plementation and performance evaluation of Mesh routing topologies for multi-FPGA sys-
the serpent block cipher. ACM/SIGDA Interna- tems. IEEE Trans. VLSI Syst. 6, 3, 400–408.
tional Symposium on FPGAs, 33–40.
HAUCK, S., FRY, T. W., HOSLER, M. M., AND KAO, J. P.
EMMERT, J. M. AND BHATIA, D. 1999. A methodology 1997. The Chimaera reconfigurable functional
for fast FPGA floorplanning. ACM/SIGDA In- unit. IEEE Symposium on Field-Programmable
ternational Symposium on FPGAs, 47–56. Custom Computing Machines, 87–96.
ESTRIN, G., BUSSEL, B., TURN, R., AND BIBB, J. 1963. HAUCK, S., LI, Z., AND SCHWABE, E. 1998b. Configu-
Parallel processing in a restructurable com- ration compression for the Xilinx XC6200 FPGA.
puter system. IEEE Trans. Elect. Comput. 747– IEEE Symposium on Field-Programmable Cus-
755. tom Computing Machines, 138–146.
GALLOWAY, D. 1995. The transmogrifier C hard- HAUCK, S. AND WILSON, W. D. 1999. Runlength
ware description language and compiler for compression techniques for FPGA configura-
FPGAs. IEEE Symposium on FPGAs for Custom tions. Dept. of ECE Technical Report, North-
Computing Machines, 136–144. western Univ. Available online at http://www.
GEHRING, S. AND LUDWIG, S. 1996. The trianus sys- ee.washington.edu / faculty / hauck / publications.
tem and its application to custom computing. html.
Lecture Notes in Computer Science 1142—Field- HAUSER, J. R. AND WAWRZYNEK, J. 1997. Garp: A
Programmable Logic: Smart Applications, New MIPS processor with a reconfigurable coproces-
Paradigms and Compilers. R. W. Hartenstein sor. IEEE Symposium on Field-Programmable
and M. Glesner, Eds. Springer-Verlag, Berlin, Custom Computing Machines, 12–21.
Germany, 176–184.
HAYNES, S. D. AND CHEUNG, P. Y. K. 1998. A re-
GEHRING, S. W. AND LUDWIG, S. H. M. 1998. Fast configurable multiplier array for video image
integrated tools for circuit design with FPGAs. processing tasks, suitable for embedding in an
ACM/SIGDA International Symposium on FPGA structure. IEEE Symposium on Field-
FPGAs, 133–139. Programmable Custom Computing Machines,
GOKHALE, M. B. AND STONE, J. M. 1998. NAPA C: 226–234.
Compiling for a hybrid RISC/FPGA architec- HEILE, F. AND LEAVER, A. 1999. Hybrid product
ture. IEEE Symposium on Field-Programmable term and LUT based architectures using embed-
Custom Computing Machines, 126–135. ded memory blocks. ACM/SIGDA International
GOKHALE, M. B. AND STONE, J. M. 1999. Automatic Symposium on FPGAs, 13–16.
allocation of arrays to memories in FPGA proces- HUANG, W. J., SAXENA, N., AND MCCLUSKEY, E. J.
sors with multiple memory banks. IEEE Sympo- 2000. A reliable LZ data compressor on
sium on Field-Programmable Custom Comput- reconfigurable coprocessors. IEEE Symposium
ing Machines, 63–69. on Field-Programmable Custom Computing
GOLDSTEIN, S. C., SCHMIT, H., BUDIU, M., CADAMBI, Machines, 249–258.
S., MOE, M., AND TAYLOR, R. 2000. PipeRench: HUELSBERGEN, L. 2000. A representation for dy-
A Reconfigurable Architecture and Compiler, namic graphs in reconfigurable hardware
IEEE Computer, vol. 33, No. 4. and its application to fundamental graph
algorithms. ACM/SIGDA International Sympo- KRUPNOVA, H., RABEDAORO, C., AND SAUCIER, G. 1997.
sium on FPGAs, 105–115. Synthesis and floorplanning for large hierarchi-
HUTCHINGS, B. L. 1997. Exploiting reconfig- cal FPGAs. ACM/SIGDA International Sympo-
urability through domain-specific systems. sium on FPGAs, 105–111.
Lecture Notes in Computer Science 1304— LAI, Y. T. AND WANG, P. T. 1997. Hierarchical in-
Field-Programmable Logic and Applications. terconnection structures for field programmable
W. Luk, P. Y. K. Cheung, and M. Glesner, gate arrays. IEEE Trans. VLSI Syst. 5, 2, 186–
Eds. Springer-Verlag, Berlin, Germany, 193– 196.
202. LAUFER, R., TAYLOR, R. R., AND SCHMIT, H. 1999.
HUTCHINGS, B., BELLOWS, P., HAWKINS, J., HEMMERT, PCI-PipeRench and the SwordAPI: A system for
S., NELSON, B., AND RYTTING, M. 1999. A CAD stream-based reconfigurable computing. IEEE
suite for high-performance FPGA design. IEEE Symposium on Field-Programmable Custom
Symposium on Field-Programmable Custom Computing Machines, 200–208.
Computing Machines, 12–24. LEE, Y. S. AND WU, A. C. H. 1997. A performance
HWANG, T. T., OWENS, R. M., IRWIN, M. J., AND and routability-driven router for FPGA’s consid-
WANG, K. H. 1994. Logic synthesis for field- ering path delays. IEEE Trans. CAD Integ. Circ.
programmable gate arrays. IEEE Trans. Com- Syst. 16, 2, 179–185.
put. Aid. Des. Integ. Circ. Syst. 13, 10, 1280– LEONARD, J. AND MANGIONE-SMITH, W. H. 1997. A
1287. case study of partially evaluated hardware cir-
INUANI, M. K. AND SAUL, J. 1997. Technology map- cuits: Key-specific DES. Lecture Notes in Com-
ping of heterogeneous LUT-based FPGAs. Lec- puter Science 1304—Field-Programmable Logic
ture Notes in Computer Science 1304—Field- and Applications. W. Luk, P. Y. K. Cheung,
Programmable Logic and Applications. W. Luk, and M. Glesner, Eds. Springer-Verlag, Berlin,
P. Y. K. Cheung, and M. Glesner, Eds. Springer- Germany, 151–160.
Verlag, Berlin, Germany, 223–234. LEUNG, K. H., MA, K. W., WONG, W. K., AND LEONG,
JACOB, J. A. AND CHOW, P. 1999. Memory interfacing P. H. W. 2000. FPGA Implementation of a mi-
and instruction specification for reconfigurable crocoded elliptic curve cryptographic processor.
processors. ACM/SIGDA International Sympo- IEEE Symposium on Field-Programmable Cus-
sium on Field-Programmable Gate Arrays, 145– tom Computing Machines, 68–76.
154. LEWIS, D. M., GALLOWAY, D. R., VAN IERSSEL, M., ROSE,
JEAN, J. S. N., TOMKO, K., YAVAGAL, V., SHAH, J., J., AND CHOW, P. 1997. The Transmogrifier-2:
AND COOK R. 1999. Dynamic reconfiguration A 1 million gate rapid prototyping system.
to support concurrent applications. IEEE Trans. ACM/SIGDA International Symposium on
Comput. 48, 6, 591–602. FPGAs, 53–61.
KASTRUP, B., BINK, A., AND HOOGERBRUGGE, J. 1999. LI, Y., CALLAHAN, T., DARNELL, E., HARR, R., KURKURE,
ConCISe: A compiler-driven CPLD-based in- U., AND STOCKWOOD, J. 2000a. Hardware-
struction set accelerator. IEEE Symposium software co-design of embedded reconfigurable
on Field-Programmable Custom Computing architectures. Design Automation Conference,
Machines, 92–101. 507–512.
KHALID, M. A. S. 1999. Routing architecture and LI, Z., COMPTON, K., AND HAUCK, S. 2000b. Config-
layout synthesis for multi-FPGA systems. Ph.D. uration caching for FPGAs. IEEE Symposium
dissertation, Dept. of ECE, Univ. Toronto. on Field-Programmable Custom Computing
KHALID, M. A. S. AND ROSE, J. 1998. A hybrid Machines, 22–36.
complete-graph partial-crossbar routing archi- LI, Z. AND HAUCK, S. 1999. Don’t care discovery for
tecture for multi-FPGA systems. ACM/SIGDA FPGA configuration compression. ACM/SIGDA
International Symposium on FPGAs, 45–54. International Symposium on FPGAs, 91–98.
KIM, H. J. AND MANGIONE-SMITH, W. H. 2000. Fac- LIN, X., DAGLESS, E., AND LU, A. 1997. Technol-
toring large numbers with programmable hard- ogy mapping of LUT based FPGAs for delay
ware. ACM/SIGDA International Symposium optimisation. Lecture Notes in Computer Sci-
on FPGAs, 41–48. ence 1304—Field-Programmable Logic and Ap-
plications. W. Luk, P. Y. K. Cheung, and M.
KIM, H. S., SOMANI, A. K., AND TYAGI, A. 2000. A
Glesner, Eds. Springer-Verlag, Berlin, Germany,
reconfigurable multi-function computing cache
245–254.
architecture. ACM/SIGDA International Sym-
posium on FPGAs, 85–94. LIU, H. AND WONG, D. F. 1999. Circuit partitioning
for dynamically reconfigurable FPGAs. ACM/
KRESS, R., HARTENSTEIN, R. W., AND NAGELDINGER, U.
SIGDA International Symposium on FPGAs,
1997. An operating system for custom comput-
187–194.
ing machines based on the Xputer paradigm.
Lecture Notes in Computer Science 1304—Field- LUCENT TECHNOLOGIES, INC. 1998. FPGA Data
Programmable Logic and Applications. W. Luk, Book. Lucent Technologies, Inc., Allentown, PA.
P. Y. K. Cheung, and M. Glesner, Eds. Springer- LUK, W., SHIRAZI, N., AND CHEUNG, P. Y. K. 1997a.
Verlag, Berlin, Germany, 304–313. Compilation tools for run-time reconfigurable
SHAHOOKAR, K. AND MAZUMDER, P. 1991. VLSI cell capacity FPGA. ACM/SIGDA International
placement techniques. ACM Comput. Surv. 23, Symposium on FPGAs, 3–9.
2, 145–220. TSU, W., MACY, K., JOSHI, A., HUANG, R., WALKER, N.,
SHI, J. AND BHATIA, D. 1997. Performance driven TUNG, T., ROWHANI, O., GEORGE, V., WAWRZYNEK,
floorplanning for FPGA based designs. J., AND DEHON, A. 1999. HSRA: High-speed,
ACM/SIGDA International Symposium on hierarchical synchronous reconfigurable ar-
FPGAs, 112–118. ray. ACM/SIGDA International Symposium on
SHIRAZI, N., LUK, W., AND CHEUNG, P. Y. K. 1998. FPGAs, 125–134.
Automating production of run-time reconfig- VAHID, F. 1997. I/O and performance tradeoffs
urable designs. IEEE Symposium on Field- with the FunctionBus during multi-FPGA parti-
Programmable Custom Computing Machines, tioning. ACM/SIGDA International Symposium
147–156. on FPGAs, 27–34.
SLIMANE-KADI, M., BRASEN, D., AND SAUCIER, G. 1994. VARGHESE, J., BUTTS, M., AND BATCHELLER, J. 1993.
A fast-FPGA prototyping system that uses An efficient logic emulation system. IEEE Trans.
inexpensive high-performance FPIC. ACM/ VLSI Syst. 1, 2, 171–174.
SIGDA Workshop on Field-Programmable Gate VASILKO, M. AND CABANIS, D. 1999. Improving sim-
Arrays. ulation accuracy in design methodologies for dy-
SOTIRIADES, E., DOLLAS, A., AND ATHANAS, P. 2000. namically reconfigurable logic systems. IEEE
Hardware-software codesign and parallel imple- Sympos. Field-Prog. Cust. Comput. Mach. 123–
mentation of a Golomb ruler derivation engine. 133.
IEEE Symposium on Field-Programmable Cus- VUILLEMIN, J., BERTIN, P., RONCIN, D., SHAND, M.,
tom Computing Machines, 227–235. TOUATI, H., AND BOUCARD, P. 1996. Pro-
STOHMANN, J. AND BARKE, E. 1996. An universal grammable active memories: Reconfigurable
CLA adder generator for SRAM-based FPGAs. systems come of age. IEEE Trans. VLSI Syst. 4,
Lecture Notes in Computer Science 1142—Field- 1, 56–69.
Programmable Logic: Smart Applications, New WANG, Q. AND LEWIS, D. M. 1997. Automated field-
Paradigms and Compilers. R. W. Hartenstein programmable compute accelerator design using
and M. Glesner, Eds. Springer-Verlag, Berlin, partial evaluation. IEEE Symposium on Field-
Germany, 44–54. Programmable Custom Computing Machines,
SWARTZ, J. S., BETZ, V., AND ROSE, J. 1998. A 145–154.
fast routability-driven router for FPGAs. ACM/ WEINHARDT, M. AND LUK, W. 1999. Pipeline vector-
SIGDA International Symposium on FPGAs, ization for reconfigurable systems. IEEE Sympo-
140–149. sium on Field-Programmable Custom Comput-
SYNOPSYS, INC. 2000. CoCentric System C Com- ing Machines, 52–62.
piler. Synopsys, Inc., Mountain View, CA. WILTON, S. J. E. 1998. SMAP: Heterogeneous tech-
SYNPLICITY, INC. 1999. Synplify User Guide Release nology mapping for area reduction in FPGAs
5.1. Synplicity, Inc., Sunnyvale, CA. with embedded memory arrays. ACM/SIGDA
TAKAHARA, A., MIYAZAKI, T., MUROOKA, T., KATAYAMA, M., International Symposium on FPGAs, 171–178.
HAYASHI, K., TSUTSUI, A., ICHIMORI, T., AND FUKAMI, WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1995. A dy-
K. 1998. More wires and fewer LUTs: A namic instruction set computer. IEEE Sym-
design methodology for FPGAs. ACM/SIGDA posium on FPGAs for Custom Computing
International Symposium on FPGAs, 12–19. Machines, 99–107.
THAKUR, S., CHANG, Y. W., WONG, D. F., AND WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1996. Se-
MUTHUKRISHNAN, S. 1997. Algorithms for an quencing run-time reconfigured hardware with
FPGA switch module routing problem with ap- software. ACM/SIGDA International Sympo-
plication to global routing. IEEE Trans. CAD sium on FPGAs, 122–128.
Integ. Circ. Syst. 16, 1, 32–46. WIRTHLIN, M. J. AND HUTCHINGS, B. L. 1997. Improv-
TOGAWA, N., YANAGISAWA, M., AND OHTSUKI, T. 1998. ing functional density through run-time con-
Maple-OPT: A performance-oriented simultane- stant propagation. ACM/SIGDA International
ous technology mapping, placement, and global Symposium on FPGAs, 86–92.
gouting algorithm for FPGA’s. IEEE Trans. CAD WITTIG, R. D. AND CHOW, P. 1996. OneChip: An
Integ. Circ. Syst. 17, 9, 803–818. FPGA processor with reconfigurable logic. IEEE
TRIMBERGER, S. 1998. Scheduling designs into a Symposium on FPGAs for Custom Computing
time-multiplexed FPGA. ACM/SIGDA Interna- Machines, 126–135.
tional Symposium on FPGAs, 153–160. WOOD, R. G. AND RUTENBAR, R. A. 1997. FPGA
TRIMBERGER, S., CARBERRY, D., JOHNSON, A., AND routing and routability estimation via Boolean
WONG, J. 1997a. A time-multiplexed FPGA. satisfiability. ACM/SIGDA International Sym-
IEEE Symposium on Field-Programmable Cus- posium on FPGAs, 119–125.
tom Computing Machines, 22–28. WU, Y. L. AND MAREK-SADOWSKA, M. 1997. Routing
TRIMBERGER, S., DUONG, K., AND CONN, B. 1997b. for array-type FPGA’s. IEEE Trans. CAD Integ.
Architecture issues and solutions for a high- Circ. Syst. 16, 5, 506–518.
XILINX, INC. 1994. The Programmable Logic Data macro generator. Lecture Notes in Computer Sci-
Book. Xilinx, Inc., San Jose, CA. ence 1142—Field-Programmable Logic: Smart
XILINX, INC. 1996. XC6200: Advance Product Spec- Applications, New Paradigms and Compil-
ification. Xilinx, Inc., San Jose, CA. ers. R. W. Hartenstein and M. Glesner,
Eds. Springer-Verlag, Berlin, Germany, 307–
XILINX, INC. 1997. LogiBLOX: Product Specifica-
326.
tion. Xilinx, Inc., San Jose, CA.
YI, K. AND JHON, C. S. 1996. A new FPGA tech-
XILINX, INC. 1999. VirtexTM 2.5 V Field Pro-
nology mapping approach by cluster merging.
grammable Gate Arrays: Advance Product Spec-
Lecture Notes in Computer Science 1142—Field-
ification. Xilinx, Inc., San Jose, CA.
Programmable Logic: Smart Applications, New
XILINX, INC. 2000. Press Release: IBM and Xilinx Paradigms and Compilers. R. W. Hartenstein
Team to Create New Generation of Integrated and M. Glesner, Eds. Springer-Verlag, Berlin,
Circuits. Xilinx, Inc., San Jose, CA. Germany, 366-370.
XILINX, INC. 2001. Virtex-II 1.5V Field Pro- ZHONG, P., MARTINOSI, M., ASHAR, P., AND MALIK, S.
grammable Gate Arrays: Advance Product 1998. Accelerating Boolean satisfiability with
Specification. Xilinx, Inc., San Jose, CA. configurable hardware. IEEE Symposium
YASAR, G., DEVINS, J., TSYRKINA, Y., STADTLANDER, on Field-Programmable Custom Computing
G., AND MILLHAM, E. 1996. Growable FPGA Machines, 186–195.
Received May 2000; revised October 2001 and January 2002; accepted February 2002