Вы находитесь на странице: 1из 17

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

Why Con gurable Computing? The Computational Density Advantage of Con gurable Architectures
Soda Hall #1776 University of California at Berkeley Berkeley, CA 94720-1776 Voice: (510) 643-2818 FAX: (510) 642-5775
A large and growing community of researchers have used Field-Programmable Gate Arrays (FPGAs) to accelerate computing applications with many point successes. While the absolute performance achieved by these machines have been impressive, it is most interesting to understand (1) what advantages, if any, these FPGA architectures have over more conventional, programmable computing alternatives, (2) where these advantages can be pro tably employed, and (3) what we can learn from this relevant to the design of new computer architectures? In this paper, we highlight the results of data collected from CMOS implementations of processors, FPGAs, and custom silicon. These observations are backed with simple models of device area and utilization. Together, we see that there is a signi cant potential performance/area advantage for these FPGAs and o er some intuition on where this potential can be exploited. recon gurable-components, con gurable, con gurable computing, vlsi, FPGA, instructions, area e ciency, Special-issue-con gurable-computing99

<andre@acm.org>

Andre DeHon

Abstract

Keywords

Con gurable Computers reign or have reigned as the fastest or most economical way to solve a number of problems. RSA { The \Programmable Active Memories" (PAM) machines built at INRIA and DEC PRL 1] held the distinction of achieving the fastest RSA decryption rate (600Kb/s with 512b keys, and 185Kb/s with 970b keys) on any machine 2] 3]. DNA Sequence Matching { The Supercomputer Research Center's SPLASH 4] and SPLASH-2 5] machines run DNA sequence matching, over two orders of magnitude faster than contemporary MPPs and supercomputers (CM-2, Cray-2) and three orders of magnitude faster than the attached workstation (Sparcstation I). Signal Processing { Filters implemented on Xilinx and Altera components outperform DSPs and processors by an order of magnitude 6] 7]. Emulation { Modern microprocessors are veri ed using FPGA-based emulation systems 8] 9]. Cryptographic Attacks { Collections of FPGAs o er the highest-performance/most cost-e ective programmable approach to breaking di cult encryption algorithms 10] 11]. While these achievements are impressive, by themselves they don't tell us why they were so successful compared to their RISC and DSP processor counterparts. Are there inherent advantages in these FPGA architectures? (or are these ukes of technology and market pricing?) Can we expect these e ects to increase, decrease, or remain the same as technology advances? Where can these devices be pro tably employed? What does this suggest for future, post-fabrication programmable devices?
Author is currently with the University of California at Berkeley Portions of this work were done while the author was at MIT.

I. Introduction

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

A careful review of CMOS implementations of processor, DSP, and FPGA architectures suggests that the allocation of silicon resources in these architectures does, in fact, o er a potential computational density advantage to the FPGA architectures. That is, we can get more raw computations per unit area-time out of the FPGA devices. A simple area model con rms that the empirical observations about FPGA and processor implementations is not simply an artifact, but rather a direct e ect of architectural resource allocation. Coupled with a simple usage model, we can build intuition on how the architectural di erences impact the e ciency of each architecture under general application characteristics. What this data and modeling tells us is that these applications do achieve their remarkable performance for good reasons. It also points the way to a rich architectural space which can be mined to fuel our continual search for greater device performance and economy. In the next section, we brie y review FPGA architecture, de ne con gurable computing, and review the CMOS area-time metrics we will be using as a base for the empirical comparisons. Section III makes the empirical comparison between FPGAs and RISC processors. Section IV builds an area model and uses it to compare the areas of these architectures and understand their realms of e ciency. Section V extends the empirical and modeling comparison to include specialized functional units focusing on multiplication hardware. Section VI looks at what these observations may suggest for future computer architectures.
II. Background

A. Field-Programmable Gate Arrays An FPGA is an array of post fabrication programmable bit processing elements. Most traditional FPGAs use small lookup tables (LUTs) to serve as the programmable computational elements. These LUTs are wired together using programmable interconnect which actually accounts for most of the area in each FPGA cell (See Figure 1). Four input lookup tables (4-LUTs) are used for the programmable processing elements in many commercial devices due to their area-e ciency 12].
Config. Memory FF

LUT

Interconnect

Active Logic

Configuration Memory

Fig. 1. LUT-based FPGA Caricature

Most of the examples given in the introduction use Xilinx XC4000 13] or Altera A8000 14] components as their primary computational workhorse. These commercial architectures have several specialpurpose features beyond the general model above (e.g. carry-chains for adders, memory modes, shared bus lines) but are basically 4-LUT based.
B. Con gurable Computing Computing with FPGAs has been termed con gurable computing, but what is it that really distinguishes these devices and their application from more conventional processors or custom silicon?

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

Spatial Computation
x
A X X + X +

B C

Temporal Computation
t1 x t2 A t1 t2 t2 + B t2 t2 t1 y t2 + C
t1 t2 A B C

ALU

Fig. 2. Spatial versus Temporal Computation for the expression y = Ax2 + Bx + C

Like processors, FPGAs can also be con gured after fabrication to solve virtually any computational task.1 The non-permanent, post-fabrication customizability is what distinguishes processors and FPGAs from custom silicon whose operation is set at fabrication time and can only implement a single or very small range of functions. Unlike processors, the primitive computing and interconnect elements in FPGAs holds only a single instruction within the computing device.2 Without undergoing a lengthy recon guration process, the FPGA resources can be reused only to perform the same operation from cycle to cycle. This limitation of a single instruction provides an area advantage at the cost of restricting the domain of e cient operation as we will see in Section IV. In these con gurable devices, tasks are implemented by spatially composing primitive operations and operators rather than temporally composing them as in traditional processors (See Figure 2). Con gurability is a property which a device can have in varying degrees | e.g. from dedicated oating-point adders that have a mode bits to \con gure" the rounding mode, all the way up to these FPGAs which can be wired to compute any function which can be implemented with a limited number of gates and state elements. Similarly, there is a large range of instruction management degrees between traditional FPGA devices where computations are wired up completely spatially and traditional uniprocessors where computations are issued entirely sequentially. Nonetheless, we tend to identify the end of the post-fabrication programmable device spectrum where computations are realized by con guring interconnect between programmable function units to wire up computations spatially as con gurable computing (See Figure 3). C. CMOS Area and Time Metrics In order to assess the costs of architectures, we need to understand their area and timing requirements. Since CMOS is widely used for FPGAs, processors, and custom silicon, we will take advantage of the common medium in order to make reasonable, cross-architecture area comparisons. C.1. Area. Lambda ( ) is the typical unit used to characterize a MOS VLSI process. One lambda is de ned as half the minimum drawn feature size on a process. Typically processes are named by the minimum transistor channel width and lambda is half this value. So a 1.0 m CMOS process would
1 Both processors and FPGAs have the problem that their nite memory and instruction resources prevent them from really solving any problem. 2 We use the term instruction broadly here to refer to the set of bits which control the operation of the post-fabrication programmable device.

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)


Spatial

custom silicon

FPGA Configurable Architectures Programmable Architectures

Temporal

VLIW super scalar RISC boot time onetime program thousands of cycles fabrication millions of cycles every cycle

Operation Binding Time

Fig. 3. Custom vs. Con gurable vs. Programmable Architecture Points

have = 0:5 . As long as all features shrink uniformly 15], a die or macro will occupy the same 2 area as feature geometry shrinks. For example, in 16], Intel 2describes 0.8 m and 0.6 m implementations of the Pentium. The 0.8 m die is 284mm2 or 284:4 106= m2 = 1:78G 2, while the 0.6 m die is 163mm2 (0 m ) 163 106 m2 = 1:81G 2 . or (0:3 m= )2 Processes, of course, have many di erent parameters and not all processes with the same feature size are equivalent. Di erent numbers of metal layers at the same feature size will, for example, yield di erent feature densities. Nonetheless, processes are generally optimized in order to make densities track feature size (e.g. 17], 18]). As such, lambda-normalized area generally gives a good estimate of the area required to implement a die or macro, usually within 20%.
C.2. Time. As features sizes shrink, intrinsic delays will also shrink. If voltage is scaled along with , gate delays will also scale with . However, wire transit times, which are becoming an increasingly signi cant fraction of cycle time, are not scaling down at this rate. Consequently, the cycle time reduction associated with feature size scaling is not as clean as the area scaling. This e ect is exacerbated by the fact that device voltages remained at 5V for many technology generations then made a discrete jump to 3.3V, rather than scaling along with feature size. E.g. consider the aforementioned Pentium shrink. The 0.8 m version ran at 66MHz on a 5V supply, while the 0.6 m version ran at 100MHz on a 3.3V supply. The 25% reduction in feature size accompanied by a 33% reduction in supply voltage gave a 33% reduction in cycle time. When comparing technologies with close features sizes (10-20%), the speed di erence is usually in the noise for this level of comparison and absolute times can be used. When comparing large di erence in technology (e.g. factor of two or more in feature size), it is worthwhile to be aware of potential di erences in device speeds.

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

To begin to address the question of whether or not there is an advantage for con gurable architectures (e.g. FPGAs) versus programmable architectures (e.g. RISC processors), let us look at the peak computation which each of them can deliver on a cycle and normalize that to the implementation area and the cycle time. That is, we look at the computation that the device can produce per unit area-time. We know that this never tells us the full story about a device. Resource balances and resource and data dependencies may cause the device to yield below this peak on particular applications. Nonetheless, it does give us some idea of the capabilities of the devices. To make a head-to-head comparison, we de ne the basic unit of computation as one bit-op, and de ne a bit-op as the amount of computation which one bit of the ALU performs on one cycle (e.g. add, subtract, and, or). Further we equate one bit-op to two 4-LUT operations on the FPGA. For add/subtract operations, the FPGA will require one 4-LUT to compute the sum output and one for carry, so this is a fair comparison.3 For logical operations, each FPGA 4-LUT can often perform more than one ALU bit operation, so this underestimates the FPGA computational power. Note also that the processor ALU performs many operations which are e ectively interconnect operations (e.g. shift, move) and would be implemented in the con gurable interconnect on the FPGA. We omit, for the moment, any hardwired computational units on the processor which can only be used for very specialized purposes (e.g. multipliers, oating-point units) we will return to these in Section V. Rolling this together, we compute the computational density of a device as:

III. Empirical Computational Density

w (1) Processor Computational Density = NALU A tcycle (2) FPGA Computational Density = 2 N4LUTs A tcycle Where A is the die area, w is the width of the processor ALU (e.g. 32, 64), and NALU is the number of concurrently executing ALUs in the processor. For the processor tcycle is simply the clock cycle time between instructions. For the FPGA, this is the minimum cycle time for a logic block, ip- op, and minimal interconnect. This metric is useful in the following manner: If we have an area limit (e.g. number of components in a board or system, 2 of die area), the device or architecture with the higher computational density will pack more performance into the xed area. Processing Capacity = Area Computational Density If we have an application with a high computational throughput requirement, we will need less components (less die area, smaller time-slices on the compute resources) from the device or architecture with the higher computational density. Target Throughput Area Required = Computational Density The results of computing this computational density metric for the last 10{20 years of CMOS processor and FPGA implementations is shown in Figure 4. We see that FPGAs can perform roughly 10 the number of raw bit-ops per unit area-time as their microprocessor counterparts. The use of shallow (single) instruction memories saves the FPGA area versus the processor and allows the FPGA to economically control their computational resources at the level of a single bit operation,
3 Actually, two 3-LUTs would be adequate, but when 4-LUTs are the basic primitive there is not a general way to perform the operations with less than 2 4-LUTs per bit. Further, most traditional FPGAs have su cient carry support to handle one adder/subtractor bit per 4-LUT output, so a 1-to-1 mapping between ALU bits and FPGA 4-LUTs might be a reasonable equivalence.

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

[ALU bit Ops / 2s]

100

|| | |

Peak Computational Density

| | | | |

10

1 | 0.1
|

Data shown is taken from published clock rates, device organization, and published and measured die sizes 22]. ALU bit-ops/ 2 s is the density of operations per unit area-time. Area is normalized by the technology feature size ( is half the minimum feature size). Time is given in seconds, an unnormalized unit, since several small feature e ects prevent delay scaling from being a simple function of feature size. Fig. 4. Computational Density Comparison of Processors and FPGAs

whereas processors control word-wide datapaths, typically in units of 16, 32, or 64 bits at a time. The granularity di erence provides a potential second order of magnitude processing density advantage for the FPGAs when computing with narrow data values. Note that while vector extensions to modern processors (e.g. VIS 19], MMX 20]) allow segmented operation (i.e. SIMD multigauge 21]) on wide datapaths, the processor is limited to performing the same operation across the entire wide datapath, while the FPGA can perform di erent operations on each bit. For regular computations | those which may employ the same set of operators applied to a large amount of data with little data-dependent ow | the FPGAs can yield a high fraction of this peak computational density. From this data we see that FPGAs have an order of magnitude more raw computational power per unit area than conventional processors, along with the potential for a second order of magnitude when operating on narrow data items. This trend has remained true for several process generations. As we get more silicon on the processing die, both the FPGAs and the processors have been able to turn the larger dies into commensurately greater raw computational power, but the densities and gap between them remain.
IV. Architectural Space and Modeling

Con gurable architectures have a very di erent balance between active resources and memory than traditional processor architectures. This section builds a model of computing device area in order to provide insight into the tradeo s among computational density, instruction density, and ne-grained control. This uni ed model highlights the relations between con gurable and programmable architectures and their respective strengths and weaknesses. In this section, we focus on device area, assuming that clock cycles in the same technology will be comparable between the processor and the FPGA|an assumption which is quite manageable when both are pipelined similarly.

|| | |
| | | | | | | | | | | | |

1.0
SRAM-based FPGAs RISC Processors Technology []

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

A. Model Figure 5 shows the composition of a generic computational block. The block represents the fraction of a computing device independently controlled by a single collection of instruction bits on one cycle. For the sake of clarity, we use the term pinst (primitive instruction) to refer to the set of bits which control one compute block and its associated interconnect and state management on one cycle of operation. One can think of a typical RISC instruction as a pinst. Equivalently, the eld of a VLIW instruction which controls one functional unit may be considered a pinst. In general, a collection of these compute blocks will be tiled to form a complete device. Each computational block contains: Bit processing units (e.g. ALU slice, Look Up Table (LUT)) Interconnect { Spatial data routing (e.g. switching network, bus) { Temporal data routing (e.g. registers, FIFO, D-cache) Instruction memory (e.g. I-cache, con guration memory) This computational block is parameterized by the number of pinsts stored in local memory (also known as the context depth), c, and the number of bit processing elements controlled by a single pinst in SIMD fashion, w. Additionally, the block may be characterized by the interconnect richness, p, the Rent parameter 23]. We will assume p = 0:5 uniformly in this article in order to focus on instruction e ects, but variations in p also have a large e ect on area and hence computational density. A traditional FPGA, for instance, is characterized by a pinst depth of one (c = 1) and a datapath width of one (w = 1). Setting c = 1024 and w = 64, the model provides a caricature of a traditional microprogrammed processor. Figure 6 shows how this computational block is organized and tiled for an 80M 2 processor and FPGA die highlighting the di erences in area distribution between the architectures. We can also measure the number of unique instruction per unit area and determine the instruction density of the architecture. When we are trying to place a large computation in a small amount of area, we often do so by reusing computation and interconnect resources to perform multiple functions. We see in Figure 6 that while the processor packs less compute bits into the available area, it also packs many more instructions into the same area as the FPGA. Using the values from Figure 5, we can compute: Abit;elm(c w), the normalized area per bit-op under various instruction organizations (computational density) Ainst(c w), the normalized area per pinst under each organization (instruction density) Figure 7 plots the relative pinst and computational densities as a function of these c and w. As we expect, wide-word architectures with one pinst per compute block are the most computationally dense. Computational density drops o as we go to ner granularity and greater pinst depths. What is interesting to see here is the magnitude of the e ect which instruction and data contexts have on computational density. While an instruction is moderately small (64 1200 2 = 77K 2) compared c to active computation and interconnect (500K!1M 2 ), it is not completely trivial. When w > 100 (there are more than 100 instructions per datapath bit), the area is completely dominated by the instruction area. If, as is typical for balance, the data memory depth (D) is increased along with the instruction memory, this point comes much earlier. That is, the two orders of magnitude di erence in computational density we see in Figure 7 comes largely from the memory active ratio. Table I compares model predictions directly against select implementations showing that the model does predict device areas reasonably well. Complementary to computational density, pinst density increases as pinst depth is increased and as datapath width (w) is increased. It is this e ect which has traditionally been used to t a large computation onto a limited amount of silicon area. Note here that the instruction density advantage will atten out and saturate once instruction and data memory area dominate active compute and c interconnect area ( w > 100 as noted above). Because of the low bandwidth to o -chip instruction

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

Pinst Depth (Contexts)

(c)

Instr. Mem.

Rent Parameter

(p) Bit Proc. Units

Datapath Width

(w)

Interconnect

Ablock = w Acompute;bit + w NSW (Np w p) ASW + c nibits Amem cell + w d Amem cell
Where:

(3)

Parameter Function
Acompute;bit Np p NSW

ASW nibits Amem cell d

ALU/LUT compute area 20K 2 Number of bit elements on device 16K Rent exponent 0.5 Number of switches 30 log2 NLUT ; 50 w (simpli cation only valid for p = 0:5 and NLUT > 32) w Area per switch 2500 2 Number of bits per pinst 64 Memory cell area 1200 2 (static) Data bits stored per compute bit c (one per instruction)

Assumed Value

See 22] for further details on the assumptions used here. Fig. 5. Composition and Area for a Generic Computational Block

Spot Check Area Model versus Empirical Data

TABLE I

Architecture
FPGA SIMD Processor

model w = 1, d = c = 1, k = 4 Xilinx 4K Altera 8K model w = 1000, c = 0, d = 64, k = 3 Abacus 24] model w = 32, d = 32, c = 1024, k = 2 MIPS-X 25]

Datapoint

880K 630K 930K 170K 190K 2.6M 2.1M

Area
2 2 2 2 2 2 2

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

Interconnect

Context Memory (IStore)

Control (PC,Branch) Configurable Interconnect (Register File) 32b Processor Processing Elements (ALU/EU) Processing Element Context Memory

14x10 FPGA Array

Fig. 6. Area Proportions on Processor and FPGA

Word Width (w) 16 4 1 1/2 1/4 Density 1/8 1/16 1/32 1/64 1/128 1 4 16

128 64

128

64 Word Width (w) 16 4 1

1/2 1/4 1/8 1/16 /16 1/32 /32 Idensity 1/64 1/128 /128 1/256 /256 1/512 /512 1/1024 1024 256 64 16 64 256 1024 1 4 Context Depth (c) Dept

102 1024

Context Depth (c)

Computational Density

Pinst Density

Fig. 7. Raw Densities as a Function of Instruction Depth (c) and Datapath Width (w)

10

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

resources and limited size of early VLSI processors, increasing pinst depth has been useful to t entire task, or, at least, the computationally intensive portion of tasks, on the available silicon die area. Instructions and data contexts are not free. They are smaller than active compute and interconnect resources and this creates the interesting design space we see here with two orders of magnitude variance in computational density. B. E ciency These peak densities tell us what the architecture can provide when the task requirements match the architectural assumptions. If the task requires the native manipulation of small data words on a large word machine, we will yield only a fraction of that peak. For example, a 32-bit architecture processing 8-bit data items would only realize one fourth of its peak processing power. Similarly, a task dominated by a loop with a critical path of 8 sequential instructions running on an architecture with a pinst depth of 2, might require 4 more processing elements than are actually useful to run the task here, too, only a fraction of the peak performance can be extracted. These two e ects combine, along with others such as limited interconnect, control, and i/o bandwidth, to determine the fraction of the peak density which is actually yielded to an application. We can use the area model to take an idealized look at the e ciency of an architecture running tasks with mismatched requirements. If wtask is the native task data width and l is the task sequential path length (or 1 is the task throughput requirement), then the e ciency is the ratio of the area l required to support the task in a matched architecture to the area required in a particular mismatched architecture with parameters c and w. ( m l (4) E ciency = l w Ablockml wtask) l A block (c w) wtask c Figure 8 shows this two-dimensional slice4 of e ciency across these two variables for two architectural points. On the left we have the traditional FPGA design point (c = w = 1), on the right we have the processor caricature (c = 1024, w = 64). Note, that just across the application space marked out for this plot, both architectures vary over two orders of magnitude in e ciency. The FPGA is most e cient on bit-level operations with no critical path (allowing fully pipelining), while the processor is most e cient for word-wide operations with very long path lengths (where it must execute a sequence of a thousand instructions and touching comparable data before the computation repeats). These architectures represent opposite design points within this region of architecture space | where one is e cient, the other is less than 1% e cient. Just looking at this small window of these two aspects of applications and architecture, it suggests that the space of computational organizations is large. Within this space, e ciency can easily vary by orders of magnitude. Here, we see neither of these architectures can robustly handle the entire space. C. Model Limitations Of course, this comparison is very idealized. We have held interconnect richness xed which could account for another order of magnitude in density di erences across plausible architectural and application space, and we have also ignored control granularity. However, this model gives us an intuitive picture of the architectural space, its magnitude, and the tradeo s involved. The raw densities agree with the empirical data we saw in Figure 4 and Table I. In the previous section, we focussed on the use of generic processing elements such as ALUs and LUTs. In practice, modern microprocessors regularly include specialized, hardwired functional units
4 Additional dimensions come from other architectural features such as interconnect richness (p) which we are leaving constant in this comparison.

V. Specialized Functional Units

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)


128 64 128 64

11

Design Width (w) 16 4 1.0 1

Design Width (w) 16 4 1.0 1

0.8 0.6 0.4 0.2

0.8 0.6 0.4 0.2

Efficiency

Efficiency

1 4 16 Path Length (l) 64 256 1024

1 4 16 Path Length (l) 64 256 1024

c = 1, w = 1
\FPGA"

c = 1024, w = 64 \Processor" Caricature

Fig. 8. Modeled E ciency of Con gurable and Programmable Design Points

such as multipliers, oating-point units, and graphic coprocessors. These specialized processing units provide a greater e ective computational density when called upon to perform their respective tasks and provide little or no computational density when di erent operations are needed. Since the area per bit operation in these specialized units can often be 100 smaller than the amortized area of a bit operation in a generic datapath, it is worth including such specialized functions when they will be used su ciently often.
A. Example: Hardware Multiplier A hardwired multiplier is often one of the rst specialized units to be added to a processor architecture. The inclusion of a hardwired multiplier is one of the primary architectural features of a digital signal processor (DSP). Given their regularity and importance, multipliers are one of the most heavily optimized computational building blocks, and therefore, may serve as an extreme example showing the computational density of a hardwired unit compared to its con gurable and programmable counterparts. Table II compares several 16 16 multiply implementations. The hardwired multiply is 2 orders of magnitude denser than the con gurable (FPGA) implementation and 3{4 orders of magnitude denser than the programmed processor implementation. Notice that the DSP achieves about the same multiply density as the FPGA, despite the fact that it includes a hardwired 16 16 multiplier. This density dilution is easy to understand when we realize the hardwired multiplier added to the DSP is small compared to the rest of the programmed processor in the DSP. The processor area, which is 100 that of the multiplier, is responsible for the two orders of magnitude di erence in overall performance density between the multiplier alone and the DSP. Notice, that the dilution seen above in the DSP is a general issue whenever we couple a hardwired function in a exible way into an otherwise general-purpose computational element. A reasonable piece of interconnect allowing the multiply block to be exibly allocated within a large computation is easily twice the area of the 3M 2 multiply block itself, dropping the density to one-third the density

12

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)


Comparison of 16 16 Multiplier

TABLE II

Style
Custom FPGA

Design

26] y 0.63 2.6M 40 ns XC4K 13], 27] 0.6 110M 120 ns 88 CLBs 1.25M 2/CLB, 7.5 ns/cycle 16 cycles DSP 28] 0.65 350M 50 ns Processor PA-RISC 0.75 125M 2904 ns 29], 30] 66 ns/cycle 44 cycles

Feature Area Size ( ) ( 2)

Time

Area-Time Ratio 2
( s) 0.104 13.2 17.5 363 1.0 130 170 3500

y { 22] surveys a large number of multiplier implementations. This one was chosen as representative because it is the the most dense 16 16 multiplier and it is implemented in a feature size most comparable to the other data points.

of the multiplier in isolation. If we also add 1024 con guration contexts (instructions to control data routing to the multiply and hold intermediate data), the instruction and data memory dominates even the multiplier and switching area causing density to drop by an additional factor of ve to about 6% of the density of the hardwired multiplier in isolation. 16 1024 1200 2 20M 2 | {z } | {z } bits/pinst pinst depth area/bit Admem = |{z} 16 1024 1200 2} 20M 2 | {z } | {z bits/word data words area/bit A = Acmem + Admem + Ampy + Ainterconnect = 49M
|{z}

Acmem =

(5) (6)
2

(7)

Nonetheless, if the hardwired unit is 1000 as dense as the programmed version and can be used frequently, the addition can still represent a substantial net increase in computational density, as we saw comparing the DSP to the processor. B. Mismatches The greater performance density provided by a hardwired unit may, however, be undermined in two ways: 1. Lack of use { when a hardwired unit is not needed, it takes up space without providing any capacity in the extreme case of no use, its inclusion diminishes computational density. 2. Overly general { when a hardwired unit solves a more general problem than needed at a particular time, the density bene t is diminished. There is a tension between these e ects. To make sure a hardwired unit can be used as much as possible, we tend to generalize it. However, the more it is generalized, the less optimized it is for solving a particular problem, and the lower the advantage compared to a con gurable solution. Consider, adding the 3M 2 16 16 multiplier to the 125M 2 processor from Table II. If every operation is a 16 16 multiply, the computational density is increased by 43 125M M2 244 . If no 128 128;125M 2 . Breakeven occurs operation is a multiply, computational density is decreased by 2% 125M 2 when roughly 1300 non-multiply operations occur for each multiply operation. The 16 16 multiply could be too general in a number of ways. For instance, an application could: 1. require a smaller multiply (e.g. 8 12)

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)


Multiply Comparisons

13

TABLE III

Style

16 16 16 16b-constant 8 8 8 8b-constant Custom 0.104 0.104 0.104 0.104 13.2 4.2 3.3 0.69 FPGA DSP 17.5 17.5 17.5 17.5 Processor 363 57.8 198 33 16 16 16 16b-constant 8 8 8 8b-constant Custom 1 1 1 1 FPGA 130 41 32 6.6 170 170 170 170 DSP 560 1900 320 Processor 3500

Area-Time in 2s

Style

Ratio to Custom

2. be multiplying by a constant value 3. require only a limited precision result For example, 30] describes how to perform specialized multiplies on the PA-RISC processor and 31] describes them on the Xilinx 4K FPGA. Table III highlights how limited data sizes and constant values reduce the computationally density bene t of the hardwired multiplier. Just looking at these examples, we see the density bene t drop by an order of magnitude. While specialized units can be used to boost computational density on speci c tasks, the raw density of the specialized unit can be quickly diluted by (1) the overhead of coupling it into a general-purpose ow and by (2) mismatches with application requirements. In many cases, employing con gurable computing architectures will provide similar performance density boosts over programmable architectures without requiring a priori decisions on the specialized units to include. C. Large Specialized Unit Example: Finite Impulse Response As we saw in the multiply example above, a small, 16 16 multiplier block is only half the size of the programmable interconnect required to use it. With deep instruction memory, the block area can be completely dominated by instruction and data storage. To avoid diluting the high density of special-purpose blocks, we could look at integrating larger specialized blocks. However, as we go to larger specialized blocks, the aforementioned mismatch e ects can play an even larger role in diluting the bene t of these specialized blocks. As an example, Table IV compares several nite-impulse response (FIR) lter implementations. A ktap FIR is a convolution of an input stream of samples fx1 x2 : : :g against a set of k, xed coe cients fw1 w2 : : :wk g to produce a series of ltered outputs, fy1 y2 : : :g:

yi = w1 xi + w2 xi+1 +

+ wk xi+k;1

While the \full custom" implementations with programmable coe cients are 50-200 denser than the programmable processor implementations, they are only 1-2 denser than the con gurable designs. The con gurables can be specialized around the lter coe cients, as shown in the previous section, narrowing the 100 native hardwired gap which they might pay if required to implement exactly the same architecture as the custom silicon rather than simply the same computation. Notice that

14

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)


FIR Survey { 8b sample, 8b coefficient

TABLE IV

Architecture Reference Feature Size


32b RISC 29] 30] 28] 32] 33] 34] 7] 35] 36] 37] 38] 0.75 16b DSP 32b RISC/DSP 64b RISC XC4K Altera 8K Full Custom ( xed coe cient) 0.65 0.25 0.18 0.6 0.3 0.75 0.6 0.75 0.6

area and time


125M 2 66 ns/cycle 6+cycles/TAP 350M 2 , 50 ns/TAP 1.2G 2 , 40 ns/TAP 6.8G 2 , 2.3 ns/TAP 240 CLBs 1:25M 2/CLB 14.3 ns/8-TAPs 30 LEs 0:92M 2/LE, 10 ns/TAP 400M 2 , 45 ns/64 TAPs 140M 2 , 33 ns/16 TAPs 82M 2 , 50 ns/10 TAPs 114M 2 , 6.7 ns/43 TAPsx

area-time/TAP
( 2s) 50 17.5 46 16 0.54 0.28 0.28 0.28 0.41 0.018

x { 16b samples

the xed-coe cient custom lter does exhibit a 15-30 advantage over the con gurable implementations. This further demonstrates that it is this coe cient specialization which allows the FPGA implementations to narrow the performance density gap. This example emphasizes that it is hard to achieve robust, widely applicable, density improvements with a larger specialized block. The FIR itself is a rather specialized block even when the coe cients are programmable, but the programmability leaves it without a clear advantage over con gurable solutions which can be specialized around the programmed coe cients. In the past, in order to handle reasonable sized computational problems, it was necessary to use deep context memories to t entire computations onto a single VLSI die. Today, as continuing advances in silicon technology gives us more area on each silicon die, this premium is reduced. It is no longer necessary to so heavily reuse active processing in order to t interesting computing tasks onto the available real-estate. Rather, as we have seen, once the recurring portion of a task ts onto the available silicon, there is a signi cant computational density advantage to be gained by directly con guring the operations and data ow spatially on a con gurable substrate. As the available silicon continues to grow, more computational problems can be t using spatial data ow, so the con gurable architecture techniques will become increasingly important. The major e ects favoring con gurables which we have seen here are (1) density and (2) operation granularity. The density advantage came primarily from a lower ratio of instruction and data to active processing and interconnect. As the application instructions and data t onto the die, processor components may also vary their memory to compute ratio, opting for more computation. As this happens, the \processor" will gain numerous computational units, beyond the 4{8 we see with today's superscalar and VLIW architectures. If this happens, they will have hundreds of processing units and require signi cant spatial data ow that is, they will start looking more like today's con gurable architectures. How processors address the granularity issue is less clear. The vector/multigauge architecture helps on some operations, but it is unclear if the low-precision SIMD model is rich enough to fully close the
VI. Future Programmable Architectures

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

15

gap. At the con gurable computing extreme of one pinst per bit operation, the cost of bit granularity is moderately small (see the c = 1 curve in Figure 7). At this extreme the memory area for a single instruction is still small compared to the active compute and interconnect resources. This is very di erent from the programmable end of the space where instruction and data memory area dominates active compute and interconnect area. There is a large, intermediate space of architectures between the extremes of modern programmable processors and modern con gurable architectures (FPGAs) which is worth exploring. Adding a few contexts to FPGAs (e.g. MIT's DPGA 39] 40] or Xilinx's Time-Switched FPGA 41]) can make them more robust against irregular operations. Using wider datapaths and moderately shallow instruction and data contexts provides much of the con gurable density advantage over DSPs for datapath operations without paying the full costs of bit-level granularity (e.g. UCB's PADDI architectures 42] 43] and UW's RaPiD 44]). Flexible instruction organization along with exible allocation of on-chip memory and bandwidth to instructions and data may allow a single architecture to better achieve reasonable e ciency across the wide range of application characteristics we saw in Section IV (e.g. MIT's MATRIX 45]). Since many tasks have a mix of irregular and regular computing tasks, a hybrid architecture which tightly couples arrays of mixed datapath sizes and instruction depths along with exible control is another promising approach to achieve robust performance across the entire application. In the simplest case, such an architecture might couple an FPGA array into a conventional processor, allocating the regular, ne-grained tasks to the array, and the irregular, coarse-grained tasks to the conventional processor. We saw in Figure 8 that the processor and FPGA are less than 1% e cient at their cross point. A hybrid device can split computations between the two achieving the best characteristics of both architectures. Such coupled architectures are now being studied (e.g. Harvard's PRISC 46], UCB's GARP 47], Northwestern's Chimaera 48], National Semiconductor's NAPA 49]). Con gurable architectures o er a real advantage in terms of raw computational density and usable computational density on narrow data operations. The advantage comes at the cost of instruction density, making con gurable architectures good for regular computations where the same operations can be performed repeatedly and the major data ow ts spatially on the available resources, but ine cient on tasks with long sequential dependencies which prevent tight pipelining. As feature sizes diminish, the gap between con gurables and processors has remained. As long as processors maintain their current balance in computational resources, this trend will continue. As technology o ers us more silicon on each die, more problems will t directly into a spatial computational ow and can exploit the increased computational density of con gurable architectures. Thus, the range of application is growing along with silicon capacity. Con gurable architectures can still be signi cantly less dense (e.g. two orders of magnitude) than full custom, dedicated computing units. However, once these custom units are coupled in a general way into a programmable or con gurable component and generalized for broad use, much, if not all, of the density advantage can be lost. The space between FPGAs and traditional processors is large as are the range of architectural e ciencies within this space. Growing die capacity opens up this larger space of architectures to the computer architect. Consequently, the modern architect needs to understand this landscape to build e cient computing devices for both domain-speci c and general-purpose computing tasks. Portions of this work were performed as part of the Reinventing Computing Project at MIT under the direction of Dr. Thomas F. Knight, Jr. The research was supported by the Defense Advanced Research Projects Agency of the Department of Defense under Rome Labs contract number F3060294-C-0252. At Berkeley, the work is part of the Berkeley Recon gurable Architectures Software and
Acknowledgments VII. Summary

16

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

Systems e ort (BRASS) under the direction of Prof. John Wawrzynek supported by the Defense Advanced Research Projects Agency under contract number DABT63-96-C-0048.
1] Jean E. Vuillemin, Patrice Bertin, Didier Roncin, Mark Shand, Herve Touati, and Philippe Boucard, \Programmable active memories: Recon gurable systems come of age," IEEE Transactions on VLSI Systems, vol. 4, no. 1, pp. 56{69, March 1996. 2] Mark Shand and Jean Vuillemin, \Fast implementations of rsa cryptography," in Proceedings of the 11th Symposium on Computer Arithmetic, Earl Swartzlander Jr., Mary Jane Irwin, and Graham Julien, Eds., Los Alamitos, California, June 1993, IEEE Computer Society, pp. 252{259, IEEE Computer Society Press. 3] Ernest F. Brickell, \A sruvey of hardware implementations of rsa," in Proceedings of the Advances in Cryptology (CRYPTO'89). 1989, number 435 in Lecture Notes in Computer Science, pp. 368{370, Springer-Verlag. 4] Maya Gokhale, William Holmes, Andrew Kopser, Sara Lucas, Ronald Minnich, Douglas Sweely, and Daniel Lopresti, \Building and using a highly programmable logic array," IEEE Computer, vol. 24, no. 1, pp. 81{89, January 1991. 5] Duncan Buell, Je rey Arnold, and Walter Kleinfelder, Splash 2: FPGAs in a Custom Computing Machine, IEEE Computer Society Press, 10662 Los Vasqueros Circle, PO Box 3014, Los Alamitos, CA 90720-1264, 1996. 6] Steve Knapp, Using Programmable Logic to Accelerate DSP Functions, Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124, March 1998, <http://www.xilinx.com/appnotes/dspintro.pdf>. 7] Altera Corporation, 2610 Orchard Parkway, San Jose, CA 95134-2020, AN 73: Implementing FIR Filters in FLEX Devices, January 1996, <http://www.altera.com/document/an/an073_01.ps>. 8] Joseph Varghese, Michael Butts, and Jon Batcheller, \An e cient logic emulation system," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 1, no. 2, pp. 171{174, June 1993. 9] Michael Butts, \Future directions of dynamically reporgrammable systems," in Proceedings of the IEEE 1995 Custom Integrated Circuits Conference. IEEE, May 1995, pp. 487{94. 10] Ian Goldberg and David Wagner, \Architectural considerations for cryptanalytic hardware," CS252 Report <http://www. cs.berkeley.edu/~iang/isaac/hardware/>, May 1996. 11] Matt Blaze, Whit eld Di e, Ronald L. Rivest, Bruce Schneier, Tsutomu Shimomura, Eric Thompson, and Michael Wiener, \Minimal key lengths for symmetric ciphers to provide adequate commercial security," Online <http://www.bsa.org/policy/ encryption/cryptographers_c.html>, January 1996. 12] Jonathan Rose, Robert Francis, David Lewis, and Paul Chow, \Architecture of eld-programmable gate arrays: The e ect of logic block functionality on area e ciency," IEEE Journal of Solid-State Circuits, vol. 25, no. 5, pp. 1217{1225, October 1990. 13] Xilinx, Inc., 2100 Logic Drive, San Jose, CA 95124, The Programmable Logic Data Book, 1994. 14] Altera Corporation, 2610 Orchard Parkway, San Jose, CA 95134-2020, Data Book, March 1995. 15] Robert H. Dennard, Fritz H. Gaensslen, Hwa-Nien Yu, V. Leo Rideout, Ernest Bassous, and Andre R. LeBlanc, \Design of ion-implanted mosfet's with very small physical dimensions," Journal of Solid-State Circuits, vol. 9, no. 5, pp. 256{268, October 1974. 16] Joseph Schultz, \A 3.3v 0.6 m bicmos superscalar microprocessor," in 1994 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. IEEE, February 1994, pp. 202{203. 17] Mark Bohr, \Mos transistors: Scaling and performance trends," Semiconductor International, pp. 75{79, June 1995. 18] Mark Bohr, \Interconnect scaling { the real limiter to high performance ulsi," in International Electron Devices Meeting 1995 Technical Digest. Electron Devices Society of IEEE, December 1995, pp. 241{244. 19] L. Kohn, G. Maturana, Mark Tremblay, A Prabhu, and G Zyner, \The visual instruction set (vis) in ultrasparc," in COMPCON '95 Digest of Papers, March 1995, pp. 462{9. 20] A. Peleg, S. Wilkie, and U. Weiser, \Intel mmx for multimedia pcs," Communications of the ACM, vol. 40, no. 1, pp. 24{38, January 1997. 21] Lawrence Snyder, \An inquiry into the bene ts of multigauge parallel computation," in Proceedings of the 1985 International Conference on Parallel Processing. IEEE, August 1985, pp. 488{492. 22] Andre DeHon, \Recon gurable architectures for general-purpose computing," AI Technical Report 1586, MIT Arti cial Intelligence Laboratory, 545 Technology Sq., Cambridge, MA 02139, October 1996. 23] B. S. Landman and R. L. Russo, \On pin versus block relationship for partitions of logic circuits," IEEE Transactions on Computers, vol. 20, pp. 1469{1479, 1971. 24] Michael Bolotski, Thomas Simon, Carlin Vieri, Rajeevan Amirtharajah, and Thomas F. Knight Jr., \Abacus: A 1024 processor 8ns simd array," in Advanced Research in VLSI 1995, 1995. 25] Mark Horowitz, John Hennessy, Paul Chow, Glenn Gulak, John Acken, Anant Agarwal, Chorng-Yeung Chu, Scott McFarling, Steven Przybylski, Steven Richardson, Arturo Salz, Richard Simoni, Don Stark, Peter Steenkiste, Steven Tjiang, and Malcom Wing, \A 32b microprocessor with on-chip 2k byte instruction cache," in 1987 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. IEEE, February 1987, pp. 30{31. 26] Jahil Fadavi-Ardekani, \m n booth encoded multiplier generator using optimized wallace trees," IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol. 1, no. 2, pp. 120{125, June 1993. 27] Tsuyoshi Isshiki and Wayne Wei-Ming Dai, \High-level bit-serial datapath synthesis for multi-fpga systems," in Proceedings of the ACM/SIGDA International Symposium on Field-Programmable Gate Arrays. ACM, February 1995, pp. 167{173. 28] Kenji Kaneko, Tetsuya Nakagawa, Atsushi Kiuchi, Yoshimune Hagiwara, Hirotada Ueda, and Hitoshi Matsushima, \A 50ns dsp with parallel processing architecture," in 1987 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. IEEE, February 1987, pp. 158{159. 29] Je Yetter, Mark Forsyth, William Ja e, Darius Tanksalvala, and John Wheeler, \A 15 mips 32b microprocessor," in 1987 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. IEEE, February 1987, pp. 26{27.

References

PRELIMINARY VERSION|LIMIT DISTRIBUTION|CONTACT AUTHOR FOR FINAL (DEHON 1998)

17

30] Daniel J. Magenheimer, Liz Peters, Karl Pettis, and Dan Zuras, \Integer multiplication and division on the hp precision architecture," in Proceedings of the Second International Conference on the Architectural Support for Programming Languages and Operating Systems. IEEE, 1987, pp. 90{99. 31] Kenneth David Chapman, \Fast integer multipliers t in fpgas," EDN, vol. 39, no. 10, pp. 80, May 12 1993. 32] Kouhei Nadehara, Miwako Hayashida, and Ichiro Kuroda, A Low-Power, 32-bit RISC Processor with Signal Processing Capability and its Multiply-Adder, vol. VIII of VLSI Signal Processing, pp. 51{60, IEEE, 1995. 33] Paul Gronowski, Peter Bannon, Michael Bertone, Randel Blake-Campos, Gregory Bouchard, William Bowhill, David Carlson, Ruben Castelino, Dale Donchin, Richard Fromm, Mary Gowan, Anil Jain, Bruce Loughlin, Shekhar Mehta, Jeanne Meyer, Robert Mueller, Andy Olesin, Tung Pham, Ronald Preston, and Paul Robinfeld, \A 433mhz 64b quad-issue risc microprocessor," in 1996 IEEE International Solid-State Circuits Conference, Digest of Technical Papers. IEEE, February 1996, pp. 222{223. 34] Bruce Newgard, \Signal processing with xilinx fpgas," <http://www.xilinx.com/apps/appnotes/sd_xdsp.pdf>, June 1996. 35] Peter Ruetz, \The architectures and design of a 20-mhz real-time dsp chip set," IEEE Journal of Solid-State Circuits, vol. 24, no. 2, pp. 338{348, April 1989. 36] Carla Golla, Fulvio Nava, Franco Cavallotti, Alessandro Cremonesi, and Giulio Casagrande, \30-msamples/s programmable lter processor," IEEE Journal of Solid-State Circuits, vol. 25, no. 6, pp. 1502{1509, December 1990. 37] Dirk Reuver and Heinrich Klar, \A con gurable convolution chip with programmable coe cients," IEEE Journal of SolidState Circuits, vol. 27, no. 7, pp. 1121{1123, July 1992. 38] Joe Laskowski and Henry Samueli, \A 150-mhz 43-tap half-band r digital lter in 1.2- m cmos generated by silicon compiler," in Proceedings of the IEEE 1992 Custom Integrated Circuits Conference. IEEE, May 1992, pp. 11.4.1{11.4.4. 39] Edward Tau, Ian Eslick, Derrick Chen, Jeremy Brown, and Andre DeHon, \A rst generation dpga implementation," in Proceedings of the Third Canadian Workshop on Field-Programmable Devices, May 1995, pp. 138{143. 40] Andre DeHon, \Dpga utilization and application," in Proceedings of the 1996 International Symposium on Field Programmable Gate Arrays. ACM/SIGDA, February 1996, Extended version available as Transit Note #129 <http://www.ai. mit.edu/projects/transit/transit-notes/tn129.ps.Z>. 41] Steve Trimberger, Dean Carberry, Anders Johnson, and Jennifer Wong, \A time-multiplexed fpga," in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1997, pp. 22{28. 42] Dev C. Chen and Jan M. Rabaey, \A recon gurable multiprocessor ic for rapid prototyping of algorithmic-speci c high-speed dsp data paths," IEEE Journal of Solid-State Circuits, vol. 27, no. 12, pp. 1895{1904, December 1992. 43] Alfred K. Yeung and Jan M. Rabaey, \A 2.4 gops data-drivern recon gurable multiprocessor ic for dsp," in Proceedings of the 1995 IEEE International Solid-State Circuits Conference. IEEE, February 1995, pp. 108{109. 44] Carl Ebeling, Darren Cronquist, and Paul Franklin, \Rapid | recon gurable pipelined datapath," in Proceedings of the 6th International Workshop on Field-Programmable Logic and Applications (FPL'96). September 1996, number 1142 in Lecture Notes in Computer Science, pp. 126{135, Springer. 45] Ethan Mirsky and Andre DeHon, \Matrix: A recon gurable computing architecture with con gurable instruction distribution and deployable resources," in Proceedings of the IEEE Workshop on FPGAs for Custom Computing Machines, April 1996. 46] Rahul Razdan and Michael D. Smith, \A high-performance microarchitecture with hardware-programmable functional units," in Proceedings of the 27th Annual International Symposium on Microarchitecture. IEEE Computer Society, November 1994, pp. 172{180. 47] John R. Hauser and John Wawrzynek, \Garp: A mips processor with a recon gurable coprocessor," in Proceedings of the IEEE Symposium on Field-Programmable Gate Arrays for Custom Computing Machines. IEEE, April 1997, pp. 12{21, IEEE. 48] Scott Hauck, Thomas Fry, Mathew Hosler, and Je ery Kao, \The chimaera recon gurable functional unit," in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1997, pp. 87{96. 49] Charle Rupp, Mark Landguth, Tim Garverick, Edson Gomersall, Harry Holt, Je rey Arnold, and Maya Gokhale, \The napa adaptive processing architecture," in Proceedings of the IEEE Symposium on FPGAs for Custom Computing Machines, April 1998.

Вам также может понравиться