Академический Документы
Профессиональный Документы
Культура Документы
TABLE OF CONTENTS…………………………………………………………...……i
LIST OF FIGURES………………………………………………………………….….vi
LIST OF TABLES…………………………………………………………….…......….xi
LIST OF ABBREVIATIONS…………………………..……………………..……….xii
ABSTRACT………………………………………………………………….…….….….1
CHAPTER 1: INTRODUCTION………………………………………...…...……….3
3.3.2.1 Architecture…………………………………………..….21
i
3.3.2.2 Interconnects Technology……………………………….22
4.2.5 Exceptions…………………………………..…..….…….……....37
ii
5.2.1 Introduction……………………………………..……..…………51
5.3.1 Introduction…………………………………….…….…..………76
iii
5.3.6.1 Pre-Add/Subtract Sub-Module……….….……………….87
iv
6.5.2 FP Multiplier Unit Timing Simulation…………….…..……….114
7.1 Conclusions…….……………..……………….………………….…….116
REFRENCES………………………………………………………………………….118
v
List of Figures
Unit……………………………………………………………………....56
Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit………..58
Multiplier Unit………...…………………………………………...…….58
vi
Figure 5.7 The Three Parallel Multiplications Performed in the Block
Multiplication Algorithm……………………………………………..….59
Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition...61
Figure 5.11 Dividing the Partial Fraction to prepare them for Addition……………..62
Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit….67
FP Multiplier Unit………………………………..………………………68
vii
Figure 5.24 Symbol of the Overflow Recheck Sub-Module in the FP Multiplier……72
Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit…..74
FP Adder/Subtractor Unit………………………………..…...………… 80
FP Adder/Subtractor Unit……………………….……...………...….…..82
Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit…...83
viii
Figure 5.41 Symbol of the Pre-Add/Subtract Sub-Module in the
FP Adder/Subtractor Unit……………………….……………………….87
FP Adder/Subtractor Unit………………………………………………..89
FP Adder/Subtractor Unit……………………………………….……….89
FP Adder/Subtractor Unit……………………...…………..…………….91
FP Adder/Subtractor Unit………………………….………….…………93
FP Adder/Subtractor Unit……………………………………..…………94
FP Adder/Subtractor Unit…………………..………….……...…………94
FP Adder/Subtractor Unit..………………………………………………95
ix
FP Adder/Subtractor Unit………………………………….…..……..….96
FP Adder/Subtractor Unit…………………..………..………….……….97
FP Adder/Subtractor Unit……………………………………….…….…98
x
List of Tables
Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits….………….69
Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa…..71
Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa…....71
Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation…. 100
Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed
xi
List of Abbreviations
FP Floating Point
IC Integrated Circuit
xii
IFFT Inverse Fast Fourier Transform
MC Macro Cell
MUX Multiplexer
xiii
RM Round towards Minus-infinity
xiv
ABSTRACT
Nowadays, every CPU has one or more Floating Point Units (FPUs) integrated
within it. FPUs are commonly used in math extensive applications, such as digital
military fields as well as in other fields requiring audio, image or video manipulation.
With the advancement in FPGAs, high performance FPGAs are now built with
millions of gates along with sophisticated features. Accordingly FPGAs are becoming
more suitable for implementation of high performance FPUs especially when short
The objective of this thesis is to design and implement high speed generic FP
standard Leading One Detector (LOD) algorithm. The novel multiplication algorithm
the FP Multiplier and Adder/Subtractor Units are deeply pipelined which also lead to
maximum throughput.
The FP Multiplier Unit using the novel Block Multiplication algorithm and the
FP Adder Unit using the LOD algorithm were both completely described using
VHDL code to allow their implementation to any FPGA platform. In our research,
1
both units were implemented on Virtex2Pro, Virtex4 and Virtex5 FPGAs and were
able to operate at speeds higher than 320 MHz on Virtex2pro whilst occupying
around 20% of the FPGA and at speeds higher than 400 MHz on Virtex4 and Virtex5
FPGA whilst occupying around 3% of the FPGA. Post route simulation of both units
was performed to verify design operation post implementation (routing) and power
2
Chapter 1
Introduction
Ever since the invention of digital computers, arithmetic logic units (ALUs)
have always been a fundamental building block of the computer’s CPU. The ALU
usually refers to the circuit that deals with binary numbers in the integer format (like
2's compliment and binary coded decimal). A FPU on the other hand, refers to the
arithmetic unit that deals with floating point numbers (i.e. real numbers). FPUs are
Some of the greatest achievements of the 20th century would not have been
possible without the floating point capabilities of digital computers and systems.
FPUs are especially important for implementation of engineering and math extensive
applications used in digital signal processing and other scientific computations that
require wide dynamic ranges and high precision. As a result, high performance FPUs
are essential in several fields such as communications, military and medicine as well
diagnostics equipments and others. Then for portable applications, the need for low
In conventional FPUs, the most frequently used floating point operations are
multiplication and addition/subtraction accounting for more than 94% of all floating
point instructions [1]. Hence the employment of highly performing FP Multiplier and
3
Adder/Subtractor Units is of high importance and has been the core of interest of
many researchers few of which gave any attention to the power consumption issues.
In this thesis, we aim to design, implement and test an IEEE compliant single
precision, generic, low power, high speed FPU (Multiplier and Adder/Subtractor). In
parallel. This new algorithm is referred to as “Block Multiplication” and can be used
in the implementation of any large multiplier that is to be optimized for speed. The FP
1.2 Organization
previous work of other researchers in the field of designing and implementing FPUs
to FPGA. Chapter 3 gives an overview on FPGAs specifically the Virtex family along
with the FPGA design flow as given by Xilinx ISE. Chapter 4 provides an overview
on the IEEE 754-2008 standard for binary floating point numbers along with
discusses the proposed designs for both the FP Multiplier using the new proposed
Block Multiplication algorithm and the FP Adder/Subtractor using the LOD algorithm
by describing the design specifications, then explaining the modules of both units and
showing the behavioral simulations of each module and of the complete designs.
4
Chapter 6 gives the synthesis, implementation, post route simulation and power
results of the proposed designs. Finally, Chapter 7 wraps up with the conclusion and
future work.
5
Chapter 2
software emulators which although saves the added hardware cost is significantly
slow. Later, floating point operations were performed using external coprocessors that
were used when needed to allow for the execution of math extensive operations.
Nowadays, every computer has a high speed floating point unit integrated within its
CPU.
computations at higher clock frequencies. This makes modern FPGAs quite suitable
for implementation of high speed floating point arithmetic which can be particularly
useful when flexibility and fast time to market, some of FPGA’s strongest assets, are
of concern.
There have been research papers containing work on the design and
implementations of FPUs to FPGA ever since the late 1990s. One of the first IEEE
were introduced in 1996 by L. Louca et al. [2]. They implemented their designs on
FLEX8000 Altera FPGA. Their main objective was to minimize the area of both the
resources of the FPGA while achieving reasonable speed to all and maintaining IEEE
accuracy. Despite their efforts, only one of their proposed units could be implemented
to the FPGA at a time while achieving a peak performance of 7 MFlops and 2.3
6
One of the early FPU designs was given by Mamun Bin Ibne Reaz et al in
2002[3]. Their work included the design and simulation of pipelined FPU, including
in details the block diagrams of the adder/subtractor, multiplier and divider units.
They did not synthesize or implement their design, thus no indication of speed or area
was given for their work. This paper provided us with a good basic overview on the
design of FPUs.
Another research paper by Jian Liang et al. in 2003 [4] presented a FPU
Units. The paper was the first, according to the authors’ knowledge, to discuss and
compare the different algorithms used to implement the floating point adders. The
Adder/Subtractor Unit that trades off latency and throughput for area. The results they
got were from implementing their designs on Spartan-3 FPGA. One of their optimized
designs showed latency just above 250ns and a throughput of around 75MHz.
In 2004, Suhaili and Sidek [5] proposed a reconfigurable 32 bit ALU that can
perform both integer and floating point addition. They described their module in
Verilog and implemented it on Spartan 2e FPGA. The synthesis report indicated that
Also in 2004, Gokul Govindu et al. [6] showed that FPGA based FPU can
paper, they analyzed the maximum achievable speed, area, latency, power and
pipelining as a parameter. They used both VHDL and Xilinx Library Cores to
7
describe their design. They implemented it on Virtex2Pro and were able to achieve
In 2005 Ali Malik [7], discussed and compared in detail the design of FP
Virtex2p FPGA. Malik’s work is related to that of Liang [4] in that they both analyzed
implementations for some of the main modules. He then discussed the design
delay and area (number of occupied FPGA slices). Finally he used the optimized sub-
modules to build several FP Adders, each using a different algorithm, and compared
for overall latency, area (number of occupied slices), and speed. His fastest
In 2006, Brunelli and Nurmi [8] introduced the design of what they called the
Milk Co-Processor which is a 32 bit FPU. The main objective of their design was
reusability. They did not give much detail about their used design algorithms. They
did mention though that they described their design using VHDL and that a special
VHDL file was written containing a set of generics that were used to give
customizability to the design. For example, these parameters allow the choice of
Also in 2006, K.Scott Hemmert and Keith D. Underwood [9] from Sandia
National Laboratories published their work which involved the design of a high speed
HDL) as their design entry language and implemented the FPUs to both Virtex2 and
8
Virtex4 FPGAs. Their proposed designs were able to operate at frequencies of up to
320 and 350 MHz on Virtex4 for the Multiplier and Adder respectively.
Another contribution in 2006 was given by Per Karlstrom et al. [10] who
They used Virtex4 DSP48 blocks to build the multiplier module within the FP
Multiplier Unit. As a result their FP Multiplier Unit was able to operate at nearly 450
MHz for Virtex4 FPGA. As for the FP Adder Unit, they worked on increasing the
operating speed by optimizing the bottle neck block of the design. This was
performed by breaking up the binary number to be dealt with in that block and
considering each four bits separately in a parallel manner. Despite the fact that this
operating speed of 361 MHz on Virtex4 FPGA. Later in 2008, Karlstrom et al.
multiplier in the FP Multiplier Unit such as Booth and Canonic Signed Digit (CSD)
multiplications. They used VHDL to describe their design. They implemented their
topic of FP Adder/Subtractor and Multiplier Units design especially for double (64
bits) and quadruple (128 bits) precision FP numbers. They have published several
papers within the topic. In 2009, Florent de Dinechin Banescu [13], studied several
9
FP multipliers, on FPGAs. The objective of his work was to build large multipliers
operating at high frequencies while reducing their DSP block usage. Each of his
studied techniques was found to be more suitable to a certain FPGA depending on the
architecture of its DSP blocks. Florent was able to build multipliers that operated at
frequencies around 440 MHz for both Virtex4 and Virtex5 FPGAs. Later in 2010,
Banescu along with Florent et al. [14] published a paper studying the same
presented double and quadruple precision FP multipliers that operated at 400MHz for
both Virtex4 and Virtex5 FPGAs. Florent et al. [15] also published a paper in 2010
that discusses a FP adder generation tool, in a project they referred to as the FloPoCo
project (Floating Point Cores). Their work explores the tradeoffs between size,
latency and frequency for pipelined large precision adders on FPGA in several
architectures. For each of these architectures, resource estimation models are defined
and used in an adder generator that selects the best architecture considering the target
FPGA, target operating frequency and the addition bit width. They were able to
construct double and quadruple precision FP adders whose synthesis results indicated
relevant work was introduced by Govindu [6], Hemmert [9], Karlstrom [10,11] and
the work introduced by Lyon University[13-15]. They were all able to design FP
Units operating at sufficiently high speeds that were high above 200MHz. There were
two main approaches used to achieve such high speeds of operation which are:
blocks on the specific FPGA or by using Xilinx Cores that are specifically
11
optimized for the target FPGA. This approach was used by Govindu [6] and
unsigned multiplier was build by using optimized Xilinx cores and DSP
2. Using Fast Algorithms: Many fast algorithms are present for both the FP
which are by far the most common fast multiplication algorithms used with
The thing with the first approach is that the designs were not generic; they were
actually optimized for specific FPGAs. That is why second approach was more
fanciful for our work since our objective is to design a high speed, low power and
11
Chapter 3
Digital Design
Based on the design specifications, a digital designer has various options when
selecting a hardware platform for their design, ranging from Application Specific
Integrated Circuits (ASICs) to all sorts of Programmable Logic Devices (PLDs) and
This chapter gives a preview on these various hardware design options starting
with the ASIC, going through PLDs, FPGAs and then discussing thoroughly Xilinx
Virtex FPGAs. Finally, the digital design flow for FPGAs is explained.
ASICs started appearing in the early 1980s. ASIC chips are customized for a
particular use rather than general purpose use. With the improvement in design tools
and reduction in feature sizes, the number of gates in an ASIC has grown from a few
thousands to over a 100 million gates. Modern ASICs, also known as SoC (System-
There are many types of ASICs based on the number of mask layers on which
the designer has control over. The most well-known types of ASICs are [16, 17]:
Full Custom:
In full custom ASIC, the designer here has full control over every mask layer
used to fabricate the silicon chip. Accordingly, the designer has full control over the
sizes of all transistors in his design which allows fine tuning of the transistors’ sizes
for optimum performance. Full custom design can be used to design some or all of the
21
circuits for a specific ASIC. Fewer full custom ICs are being designed due to the long
time to market and high non-recurring engineering (NRE) cost involved for its design
and fabrication. Also with the improvement in standard cell and gate array ASICs,
they are providing the required performance for more applications with their high
speeds and low costs which steers away designers from full custom design. Full
custom design is usually used when there are no suitable existing cell libraries that
can be used usually because the existing cell libraries are either not fast enough, not
small enough, consume too much power or don't provide a certain required function.
Full custom ASICs are commonly used for microprocessors which must operate as
Standard Cell:
Standard cell ASICs are based on predesigned logic cells (such as logic gates,
multiplexers, flip-flops, etc…) known as standard cells that are used to build the
design along with larger predesigned cells known as megacells (such as memory
blocks can be embedded to the design. During the design, each and every transistor in
every standard cell can be chosen to optimize a certain design parameter and tools can
cell ASICs, all mask layers are customized (transistors and interconnects) thus a
custom photo-mask is created for every layer for the device's fabrication. The
advantage of standard cell ASICs is that designers save time and money by using
Gate Arrays:
Gate arrays ASICs are partially fabricated chips with repetitive similar blocks,
21
and resistors depending on the vendor. Basic cells are replicated to form arrays of
basic cells. The designer chooses from a gate-array library of predesigned and pre-
characterized logic cells (including gates, registers, etc...) which the user uses along
with more complicated blocks to build his circuit by controlling only the top few
layers of metal used for interconnects. The disadvantage of gate array ASICs is the
3.2.1 Introduction
PLDs were first introduced in the mid 1970s. A PLD is a programmable chip
that is mass produced at the factory and then customized by the end-user to perform
different logic functions. Unlike ASICs, PLDs are intended for general use not for
programmable depending on the technology used to implement the cross points within
the device.
One time programmable PLDs implement the cross points using fusible or anti-
fusible technology used to create permanent open or short circuits respectively based
on the data to be programmed on the PLD. For multiple time programmed PLDs, the
cross points are implemented using a single bit memory cell used to store binary data
at the cross points to implement an open or short circuit. Such PLDs can be volatile or
nonvolatile depending on whether volatile or non volatile memories are used at the
cross points.
21
PLDs can be used to implement simple combinational circuits to fairly complex
sequential state machines depending on the type of PLD. There are three types of
PLDs:
user.
matrix of AND gates (referred to as the AND plane) and a matrix of OR gates
(referred to as the OR plane). These planes are used to implement any circuit as Sum-
of-Product through programming the horizontal to vertical cross points as either open
AND plane and a programmable OR plane. The cross points in both the AND plane
and the OR plane can be programmed to form any sum-of-product expression. PLAs
are particularly useful for large designs that require many common product terms that
can be used by several outputs. The downside of the PLA device is the price of
manufacture and speed. This device has two levels of programmable links and signals
that take a relatively long time to pass through programmable links as opposed to pre-
21
Figure 3.1 Basic Architecture of PLA
The speed problems associated with the PLA were addressed with the
development of the PROM and the PAL. A PROM is a special type of PLA with a
programmable OR plane and a fixed AND plane that produces all possible product
terms for the given inputs. A PAL is a special type of PLA with a programmable
AND plane and a fixed OR plane. The advantage of a PROM and a PAL is that they
are faster due to having only one single programmable array on the cost of less
21
Figure 3.3 Basic Architecture of PAL
A GAL is PLA that has additional logic circuitry at each output, referred to as
a macro cell. A macro cell is a programmable output cell containing logic gates, a
flip-flop, and multiplexers with internal programmability that allows several modes of
operation. GALs also differ from PLAs in that they had a feedback signal from the
macro cell back to the programmable array, which increase the GAL's flexibility, and
in that there cross points were implemented using EEPROM instead of fuse/anti-fuse
or PROM/EPROM. GAL devices can have a maximum frequency around 250 MHz
interconnecting matrix. CPLDs have I/O drivers and a clock/control unit. Modern
CPLDs also include JTAG support (port for circuit access/test defined by Joint Test
Action Group and standardized in the IEEE 1149.1 standard), a large number of I/O
Figure 3.5.
CPLDs feature predictable timing characteristics that make them ideal for
more predictable delay than FPGAs and other programmable logic devices. CPLDs
are inexpensive and require small amounts of power [19], thus they are commonly
21
Observing the most famous Altera and Xilinx CPLDs it can summarized that
CPLDs are fabricated using 0.18um or 0.35 um CMOS technology, have a number of
user pins ranging from 27 to 272 and that CPLDs have maximum operating speeds
Around the beginning of the 1980s, the gap in the digital IC continuum
became apparent. At one end there were programmable devices such as SPLDs and
CPLDs, which were highly configurable and had fast design and modification times,
but could not support large or complex functions. At the other end, there were ASICs
which could support extremely large and complex functions, but they were very
expensive and time consuming to design. Furthermore, once a design has been
To fill that gap, FPGAs were introduced by Xilinx in the mid 1980s. They are
considered the most complex PLDs since they contain thousands of configurable logic
blocks and configurable interconnects that can both be programmed (only a single
time or many times depending on the type of FPGA) to perform a variety of complex
At first, FPGAs were used to implement simple logic circuits at relatively low
speeds. Later at the 1990s, the size and sophistication of FPGAs started to increase
and they found market in telecommunications, networks and many other industrial
applications. FPGAs were then also commonly used to prototype ASIC designs or to
However, FPGAs ease of design, flexibility, low development cost and short time to
market shortly made FPGAs find their way into final products.
21
Although the first FPGAs contained a few thousand gates, nowadays FPGAs
contain over a billion gates [20]. Today’s FPGAs have several sophisticated features
such as high speed input/output interfaces, internal clocking and consist of millions of
gates along with embedded elements such as microprocessor cores, RAMs, DSP
blocks, multipliers and dedicated arithmetic carry chains. Such high performance
FPGAs can be used to implement almost any design and at very high speeds that
Generally, FPGAs are more cost effective for limited productions while
ASICs are more suitable for larger productions. An advantage of using FPGAs instead
of ASIC is that the FPGA design flow eliminates the complex and time-consuming
floor planning, place and route, timing analysis, and mask / re-spins stages of the
project since the design logic is already synthesized to be placed onto an already
verified, characterized FPGA device. Moreover, FPGAs have the advantage that they
can be easily and rapidly reprogrammed. A good way to shorten the development time
complexity, or volume designs in the past, modern FPGAs easily push the 500 MHz
performance barrier due to their very high densities and their sophisticated features.
designs that could previously have been realized only on ASICs and custom silicon.
FPGAs can be used to implement almost any type of design such as communication
devices, software defined radios, radar, image processing, digital signal processing all
the way to system-on-chip (SoC) components that contain both hardware and
software elements.
12
FPGAs have a much more sophisticated structure than CPLDs. This gives
FPGAs the advantage over CPLDs that they can be used to implement complex
now considered the largest PLD supplier owning more than 50% of the market [21].
3.3.2.1 Architecture
switch matrices [19]. The internal architecture of CLBs might differ from one FPGA
family to another. Generally, a CLB consists of a number of slices each slice in turn
consisting of a number of logic cells. A logic cell is the core building block in a
12
A simplified illustration of a Xilinx logic cell is shown in Figure 3.7. It
includes a 4 input Look Up Table (LUT), a Delay Flip Flop (DFF) along with a
multiplexer to allow a registered or unregistered output. Other than the LUT, MUX
and register a logic slice can also contain other elements such as fast look ahead carry
chains, arithmetic logic and dedicated internal routing. Advanced FPGAs may also
The reason for FPGAs having a hierarchy of CLBs consisting of slices that in
interconnects. Thus there is a fast interconnect between logic cells within the same
slice, then slightly slower interconnects between slices in a CLB followed by the
between making it easy to connect things together without incurring excessive inter-
implementation. The most famous of these technologies are listed in Table 3.1.
11
Table 3.1 Technologies Used to Implement FPGA Interconnects
FPGA Predominantly
Programmability
Technology Associated With
Fuse OTP SPLDS
Anti-fuse OTP FPGAs
Can be erased using ultraviolet
EPROM light (takes at least 20 minutes) SPLDs & CPLDs
then reprogrammed.
EEPROM/ Can be electrically erased then SPLDs & CPLDs &
Flash reprogrammed. some FPGAs
FPGAs & some
SRAM Reprogrammable.
CPLDs
Most Xilinx FPGAs are based on SRAM technology where the programmable
the FPGA are controlled by SRAM cells. SRAM based FPGAs are volatile, that is the
device's configuration data is lost once the power is removed from the system. Thus
SRAM based FPGAs require external boot ROM to reprogram the FPGA every time
it is powered on. This is not much of a problem since these devices have the
The Xilinx Virtex series was first introduced in 1998 as a low power high
performance solution. It was the first line of FPGAs to offer one million system gates.
The Virtex product line consistently offers the industry's leading combination of
11
In addition to FPGA logic, the Virtex series includes embedded fixed function
although still available, but their functionality is largely superseded by the Virtex-4
and -5 FPGA families. The Virtex2 series is manufactured on a 1.5V, 0.15μm 8-Layer
Metal Process with 0.12μm High-Speed Transistors whilst the Virtex-2 Pro series is
Transistors [20].
Virtex4 is very similar to that used to all previous Virtex and Spartan (up to Spartan
3A) FPGAs. The simplified schematic of a Virtex4 slice is shown in Figure 3.8. It
Two Multiplexers.
Dedicated arithmetic logic including two 1-bit adders, carry chain and two
as latches.
11
Figure 3.8 Simplified Schematic of Virtex-4 Slice
Virtex-5 FPGAs
In the Virtex5 series, Xilinx moved from its traditional four-input LUT design
technology. Virtex5 series offers a lower power solution delivered by the 65nm
The simplified schematic of a Virtex5 slice is shown in Figure 3.9. The main
11
Figure 3.9 Simplified Schematic of Virtex-5 Slice
There is no consistent definition of a speed grade for all devices. Even for
FPGA or a CPLD. Originally speed grades for Xilinx FPGAs represented the time
through a look up table but now the speed grade doesn't actually represent a timing
path. Instead the speed grade is a relative metric of performance within a specific
FPGA family. Different speed grades within a family results merely due to process
For modern Xilinx FPGAs, such as those of the Virtex family, higher numbers
represent faster devices. For example, Virtex4 speed grades are -10, -11, and -12 with
-10 being the slowest and -12 being the fastest. Virtex5 speed grades are -1, -2, and -3
11
3.4 FPGAs Design Flow
provided by the Xilinx foundation [20] followed by a brief summary of all its steps
[16,23].
Design Entry
Generally, there are different techniques for design entry which are schematic
based, HDL, a combination of both and finite state machine (FSM). If the designer
wants to deal more with hardware, schematic based design entry is the better choice.
HDL and FSM on the other hand, represent a level of abstraction that can isolate the
designer from the details of the hardware implementation. FSM is used when the
design can be thought of as a series of states. HDL is considered the most popular
11
design entry methodology for FPGA design as it is the best choice for describing
complex designs.
level. There are two industry IEEE standard HDLs namely, VHDL and Verilog.
While their syntax and semantics are quite different, they are used for similar
purposes. VHDL contains more constructs for high level modeling, model
parameterization, design reuse and management of large designs than Verilog does.
Many EDA tools are designed to work with either language or both languages
together.
Not all VHDL constructs are synthesizable. Generally, if the VHDL code is
physically meaningless or too far moved from the hardware it attempts to describe, it
may not be synthesizable [19]. Thus it is best describe the design in a simple manner
in order to assure the synthesizer is able to correctly interpret the design to the
Behavioral Simulation
verify its logical correctness. Behavioral simulation does not consider propagation
delays.
Synthesis
Synthesis is the process which translates the HDL code into a netlist form (i.e. a
complete circuit with logical elements such as gates, flip-flops, etc) targeted at a
11
specific FPGA platform. At this stage, detailed timing analysis can be carried out and
estimate of occupied area can be obtained. The resulting netlist is stored to an NGC
implemented in the target FPGA. This process consists of a sequence of three steps:
file saved as an NGD (Native Generic Database) file for Xilinx tools.
ii. Map: The map process fits the logic defined by the NGD file into the
targeted FPGA elements (i.e. CLBs, IOBs) and generates an NCD (Native
iii. Place and Route (PAR): The PAR process places the blocks from the
map process into logical blocks according to the defined user constraints
then connects between them. The PAR tool takes the mapped NCD file
design meets the power budget thus attaining system performance and cost
goals. Low power enables higher clock frequency, higher reliability, better
11
path of the design is determined here and hence the fastest design speed. The main
advantage of static timing analysis is that it is relatively fast, doesn’t need a test bench
In Timing Simulation the VHDL timing model generated by the place and route
tool, which includes the block and routing delay information from the routed design,
to give a more accurate assessment of the behavior of the circuit under worst-case
conditions. Timing simulation is a highly recommended part of the HDL design flow
for Xilinx devices to verify the implemented design to the target FPGA meets timing
constraints. Since timing simulation uses the detailed timing and design layout
information that is available after place and route, this simulation of the design closely
matches the actual device operation. Timing simulation simulates the VHDL timing
model, generated by the place and route tool, to verify the synthesized logic as
The routed NCD file is converted to a bit stream file to be used to configure the
12
Chapter 4
Since only binary information can be stored and processed in digital computers,
thus the most natural system to use when representing decimal numbers is the binary
system. In order to represent decimal numbers in binary notations, there are two
methods depending on the position of the binary point. The fixed point method
assumes the binary point is always in a fixed position while the floating point
representation assumes the binary point can floats anywhere within the number's
significant bits.
This chapter gives a brief presentation of the IEEE-754 [24] standard for single
precision floating point numbers and explains the floating point multiplication and
assuming the radix point, also referred to as binary point for binary systems, is fixed
in a certain position such that there are a fixed number of digits after and before the
radix point. The two most widely used radix point positions when fixed point
1. Placing the radix point in the extreme left of the number such that it can only
represent fractions.
2. Placing the radix point in the extreme right of the number such that it only
13
In either case, the radix point is not actually present, but its presence is assumed
from the fact that the number stored is treated as a fraction or as an integer [25].
parts. The first part represents a signed, fixed point number called the mantissa. The
second part designates the position of the radix point and is called the exponent.
Usually, the radix point is shifted such that there is one non-zero digit to its left. So
basically, floating point represents real numbers in the scientific notation where the
radix point can float anywhere in the number and the exponent is adjusted
accordingly. The general format of any floating point number (F) can be written as:
the mantissa and r is the radix of the used numbering system (i.e. r =10 for decimal,
r = 2 for binary) and exp is the exponent that represents the original location of the
radix point. For example, the binary floating point number (1101.01)2 is represented
implement than floating point representation. On the other side, floating point
representation has the advantage of being able to represent very large or very small
numbers and thus arithmetic operations are less likely to overflow or underflow when
13
representation. As a result, floating point representation is more suitable for
IEEE 754 standard for floating point representation was defined to achieve
Precisely specify floating point number encoding such that all computers
would interpret floating point numbers in the same way. This made it possible
exceptional conditions that could result from them such that all computers will
give the same result for a given operation with the same input data.
The IEEE-754 standard was first created in 1985. Several revisions were made
until the final IEEE-754 2008 standard was published in August 2008. The IEEE 754
standard specifies how floating numbers are represented and how to carry out
arithmetic operations on them. Nowadays, IEEE 754 standard is the most common
32 bits. The 32 bits are divided into three fields, a 1-bit sign, an 8-bit exponent and a
11
Sign Exponent Mantissa
B31 B30…………..B23 B22…………………………………………………….B0
The sign bit stores '0' for positive numbers and '1' for negative numbers. The
8-bit exponent stores the exponent in excess-127 code. That is a bias of 127 is added
to any exponent before it is stored. So for example, to represent the binary exponent 2,
So using the excess-127 code gives us a range for E' of 0 <= E' <= 255.
IEEE-754 reserves exponent field values of 0 and 255 (all 0s and all 1s) to denote
becomes 1 <= E<= 254, equivalent to -126<= E <= 127 range for E. Excess-127
The 23-bit mantissa is actually a 24-bit mantissa where the most significant bit
is not stored and thus is known as the implicit bit. While this provides efficient
storage, the implied bit is necessary to carry out arithmetic operation on the number
and must be explicitly represented before any operations involving the number is
performed.
13
The set of different possible values for reading a floating point numbers is
implicit bit is ‘1’. Normalization has the advantage of allowing a wide range of
numbers to be represented with great precision. The floating point number (FN) then
the implied bit. They are used to represent numbers that are smaller than the smallest
normalized number. They are identified by an all zero exponent and a non-zero
mantissa.
13
4.2.3 Special Values
The IEEE-754 standard reserves the exponent field of all '0's and all '1's for
special values. The special values defined by the IEEE-754 standard are:
1. Zero : Due to assumption of '1' implied bit, a zero number cannot be directly
exponent and mantissa. The sign bit differentiates between +0 and -0 which
2. Infinity: Infinity is denoted by an all '1's exponent and an all '0's mantissa.
3. NaN : The value Not a Number (NaN) is used to represent a value that is
not a real number. NaN are represented by an all '1' exponent and a non zero
mantissa.
shown in Table 4.2. Since the sign of floating point numbers is given by a special
leading bit, the range of negative numbers is given simply by the negation the range
of positive numbers.
13
4.2.5 Exceptions
1. Overflow : Set when the result has a value that is too large to be
represented.
2. Underflow : Set when the result has a value that is too small to be
represented.
error is introduced.
4. Invalid : Set when an operation cannot return a real value such as in the
solution.
Arithmetic operations usually result in floating point numbers with more bits
that can actually be stored. In such cases, it is necessary to normalize the number and
round it in order to fit in the storage format. The IEEE 754 standard defines five
rounding modes. The first two modes round to a nearest value, while the other three
1. Round to nearest, ties to even (REN): Rounds to the nearest value; if the
number falls midway it is rounded to the nearest value with an even (zero) least
significant bit, which occurs 50% of the time; this is the default algorithm for binary
floating-point.
13
2. Round to nearest, ties away from zero: Rounds to the nearest value; if the
number falls midway it is rounded to the nearest value above (for positive numbers)
5. Round towards 0 (RZ) : Also called truncation as the extra bits are simply
truncated.
The rounding mode affects the results of most arithmetic operations and the
thresholds for overflow and underflow exceptions. The default rounding mode is REN
and is mostly used in all the arithmetic implementations in software and hardware. To
increase the precision of the result and to enable the REN rounding mode, three bits
13
4.3 Floating Point Arithmetic
this section, the basic algorithm for performing floating point addition/subtraction and
multiplication is explained.
operations since it is performed by directly adding the exponents and multiplying the
mantissas. The floating point multiplication basic operations can be divided into the
127 bias to compensate for the bias being added in both exponents.
useful at this point to check the exponent for possible overflow or underflow.
binary multiplication due long carry chain involved in the multiplication operation.
13
Generally, the multiplication of two n bit binary numbers consists of successive
partial products to give the final result, which for the multiplication of two n bit
numbers wouldn't exceed 2n bits. Then for multiplication of signed numbers, the
resultant sign has to be determined. There are two main concerns when implementing
the numbers to be multiplied separately in order to find the resultant sign. Yet,
modern computers usually deal with the 1's complement, 2's complement or
signed number representation which all embed the sign in the number itself. A
simple long multiplication process won’t be sufficient in such cases but must
2. The Long Carry Chain Involved: The multiplication result from the n by n
binary multiplication appears after all the intermediate n partial fractions are
calculated and then added. Such operation is very time consuming due to the
When considering the above issues for the case of the binary multiplier involved
unsigned bit mantissas so the first issue is not of concern when dealing with
2. On the other hand, the large delay resulting from the long carry chain involved
34
very critical issue that is the reason why the 24 by 24 unsigned multiplication
decreasing the number of partial fraction and hence reducing the number of
intermediate additions and breaking the long carry chain. Generally, floating point
multipliers perform the same main operations, which were explained above, but differ
Booth Multiplication
when one of the numbers to be multiplied includes a string of consecutive '1's. Booth's
algorithm is based on the fact is that any binary number containing a string of
consecutive '1's can be represented as the difference between two numbers using the
Thus the multiplication of (20 * 14) in binary can be performed as shown in Table
4.4. As shown in the table, the use of booth multiplication reduced the number of
partial fraction to be added which is the key factor in speeding the binary
multiplication.
33
Table 4.4 Example on Booth Multiplication
encoded in the canonical form if it contains no adjacent non-zero digits. CSDs are
generated by encoding a binary number such that it contains the fewest number of
non-zero bits.
A CSD n*n bit multiplier contains (n+1) cascaded CSD encoder units to generate
the CSD representation of the multiplier. This unit receives three inputs and provides
two outputs. The inputs are the multipliers bits to be encoded and carry bit while the
output is next carry bit and canonic signed digit representation. The generated CSD
vector is then given to CSD logic and shift control unit along with the multiplicand
and its 2’s complement. The CSD logic and shift control unit provides n/2 number of
n-bit partial products that are added to given to an adder block. The output of this
The basic idea behind the CSD multiplication is again to reduce the number of
33
operation where in CSD multiplication, the product of two n-bit numbers is calculated
in (n/2 – 1) steps.
subtract their mantissas when performing floating point addition or subtraction. That
is equivalent to both numbers having equal exponents. If the exponents of the two
floating point numbers are not equal, one of the numbers has to be pre-normalized.
Pre-normalization is the process where the exponent of the smaller floating point
number by shifting left it’s mantissa a number of times equal to the difference
point number such that if data bits are lost due to the shifting operation, the effect is
31
Generally, floating point addition/subtraction is known to be more complicated
than floating point multiplication due to the extensive processing required in adjusting
the operands before and after the execution of the addition or subtraction operation
when pre-normalizing the smaller mantissa and when post normalizing the resultant
mantissa respectively.
The bottle neck of the floating point addition/subtraction is the post normalization
operation. This is due to the fact that the post normalization of the resultant mantissa
i. Finding the leading '1' bit in the resultant mantissa to be set as the implied bit.
The leading one bit here can be located anywhere within the mantissa.
ii. Shifting the resultant mantissa in order to set the detected '1' bit as the most
significant bit.
iii. Adjusting the resultant exponent to compensate for the shift in the resultant
mantissa.
Many algorithms have been proposed to implement the post normalization process
speed. Generally, floating point adders perform the same main operations, which were
explained above. They only differ in the algorithm used to implement the post
normalization operation. The most famous of these algorithms are [4, 26]:
determine the location of the leading one. Based on the leading one location, the
33
resultant mantissa is shifted left by a number of times subsequently subtracted from
the exponent.
LOD algorithm is an area efficient simple algorithm that is also known as the
standard algorithm.
The main difference between LOD and LOP algorithms is that LOD method
detects the leading one after the addition/subtraction operation has taken place while
the LOP method predicts the leading one in parallel with the addition/subtraction
computation. This is illustrated in Figure 4.2. Usually, the LOP method requires a
correction circuit.
The LOP method has the advantage of reduced latency on the expense of added
difference of 0 or 1. Making use of this idea, the two path algorithm implements two
33
parallel data-paths one for when the exponent difference is equal to 0 or 1 and the
effective operation is subtraction called the Close Path and another for all other cases
called the Far Path. In the two path algorithm, the latency is reduced by removing
the pre-normalization from the close path and removing the LOD or LOP from the far
path. This comes on the expense of increased area due to the dual path
implementation.
The two path algorithm is faster than the LOD algorithm and experiences less
latency, but takes more area and consumes more power [26].
This architecture has three datapaths, of which only one is operational at a time.
Two of these paths have the exact same functionality as the far and close data paths in
the two path algorithm. The third datapath deals with NaN and infinity values along
with the case when the exponent difference is greater than the width of the mantissa.
33
Chapter 5
In this chapter, the architecture of the proposed FPU is explained. First, the
design specifications of the proposed FPU is discussed starting from the implemented
IEEE standard the power considerations then going through the design methodology
and finally thoroughly discussing the proposed design. For both the FP Multiplier and
Adder/Subtractor Units, the block diagrams are explained block by block and
behavioral simulations for each block and the complete units are given.
The FPU deals with single precision floating point numbers that are
represented as specified by the IEEE 754 standard. The exact specifications of the
floating point numbers dealt with in the proposed FPU are explained in this section.
Denormalized numbers are generally rare and require complicated hardware for
a zero and dealt with accordingly. The reason for excluding denormalized numbers is
because of the large overhead in taking care of these numbers, especially for the
multiplier [9, 10]. These are commonly excluded from high-performance systems, for
example, the cell broadband engine does not use denormalized numbers for the
74
Special Values
one must consider the zero when implementing a floating point unit. In the
NaN: NaN values are considered rare and require a lot of hardware to deal
Exceptions
and multiplication are overflow and underflow. In the proposed design, they are
honored only in the FP Multiplier Unit where they are more likely to occur. A
detected overflow or underflow would set an appropriate flag that would propagate
Rounding
REN is the most common mode used in software and hardware implementations of
74
The guard, round and sticky bits were added to the mantissas in both the
multiplier and adder/subtractor units to increase the accuracy of the stored result.
In multiplication, a zero input means that the result will be also a zero. In such a
them will actually mean unnecessary power consumption. To avoid losing the
unnecessary power, both input operands are checked for zeros at a very early stage in
the design. If one or both operands are found to be zero, a zero flag is set. Blocks in
the design are written to operate only if the zero flag is not set.
addition/subtraction which are a zero input and an expected zero output. Now each
a zero, the result can be directly determined to be the other operand or a zero
respectively. The only operation necessary then would be the output sign
necessary. So in the proposed design, the input operands are checked for zeros at
an early stage and the appropriate zeros flag is set accordingly to identify whether
one or both inputs are zeros. Again, blocks are designed to operate only if these
flags indicate that the two input operands are not zeros.
74
Output Zero Detection: If both input operands are non-zeros, a zero output can
still result if both the input are equal and the effective operation turns out to be
subtraction. In the proposed design, the two input operands are compared at an
early stage and an aequalb flag is set when they are equal. The zeros flag is set if
the effective operation is subtraction and the aequalb flag is set in a manner to
indicate a zero result. Then again in such a case, unnecessary calculations are
avoided.
Floating point operations can easily be broken down into several sub-
operations which make pipelining suitable for implementation of FPUs. Actually, all
high speed computers have pipelined arithmetic units since pipelining allows for a
faster clock cycle and increased data throughput at small expense to latency from the
extra latching overhead [25]. Because FPGAs are register-rich, this is usually an
advantageous structure for FPGA design since the pipeline is created at no cost in
terms of device resources. The flip flops introduced by pipelining typically occupy the
unused flip flops within the logic cells that are already used for implementing the
design.
Thus in order to increase the design speed of our proposed designs, both the
them down into simple modules with registers placed in between. The number of
registers that were placed between the different modules depended on each module's
delay time. The overall design has both an input and output register to synchronize
top-down approach was used. At first an overview of the complete system was made
05
to gain a firm understanding of its operation and required specifications. Then the
design was divided into modules which were further broken down into smaller sub-
modules that performed simple specific operations. Each of the sub-modules was
optimized and tested separately before optimizing and testing the complete design.
5.2.1 Introduction
The simplified block diagram of the FP Multiplier Unit is shown in Figure 5.1
where operands A and B are the single precision inputs and operand result is the
The 32 bit input operands are initially unpacked to sign, exponent and mantissa
where each will be manipulated differently throughout the design. The exponents of
both operands are inspected to check for an input zero in which case the zeroflag is
set. Otherwise if both inputs were non-zero operands the zeroflag is unset, the
exponents are added and the mantissas are multiplied. Multiplication here is an
be either directly in the normalized form or if a carry bit occurs, requiring a one bit
shift to the right. So the resultant mantissa is post normalized if necessary and the
exponent is adjusted accordingly and tested for possible overflow or underflow. Now,
the 48 bit mantissa is rounded to fit it in the specified number of bits and if necessary
05
the resultant exponent is adjusted and re-checked for possible overflow or underflow.
Finally, the 32 bit result is given in an IEEE compliant format along with the overflow
consuming process of adding the 24 partial products that were calculated by the
algorithms, that were explained in Chapter 3, the main concept adopted to increase the
partial fractions to be added in order to break down the long carry chain involved.
operation into several smaller multiplications performed in parallel whose results are
appropriately manipulated to give the final 48 but resultant. This in turn is performed
by slicing up the 24 input mantissas of operands A and B into smaller blocks and
performing the multiplication on these blocks, hence came the term Block
Multiplication.
Figures 5.2(a) and 5.3(b) illustrate the detailed block diagram of the proposed FP
Multiplier Unit. In the following sub-sections, each module shown in these figures is
Multiplier Module. Finally the behavioral simulation of the entire FP Multiplier Unit
is given. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and
were behaviorally simulated to verify their correct operation using ModelSim 6.3a.
05
05
07
5.2.2 Zero Detect Module:
The zero detect unit is responsible for three main tasks which are:
and mantissa of the input operands since each is dealt with differently in the FP
a. Sign Bits: The input signs are XORed to determine the resultant sign.
result.
2. Input zero detection: The exponents of both input operands are checked. If
one or both of the exponents was found to be all zeros, the zero flag is set.
Since a zero floating point number has an all zeros exponent and mantissa
mantissa, then checking only the exponent of the input operands serves in
accordingly.
3. Setting the implied bit: The implied bit that was omitted from each mantissa
performed correctly. The implied bit is always restored as a '1' since in the
Figures 5.3 and 5.4 show the symbol and behavioral simulation of the Unpack
Module. The outputs appear after two clock cycles due to the presence of an input and
output register within the module. The simulation illustrates how the 32 bit mantissa
00
is broken to sign, exponent and mantissa while setting the implied bit in the mantissa.
zeroflag is unset at 400ns and operands A and B are broken down into signa= „1‟,
respectively. On the other hand for the case of a zero or a denormalized number like
the case at 300ns and 400ns where B= (00000000)H and B= (00000777) respectively,
Figure 5.4 Behavioral Simulation of the Unpack Module in the FP Multiplier Unit
05
5.2.3 Add Exponent Module:
This module is mainly responsible for finding the resultant exponent. So in this
module, the two exponents are added and a 127 bias is subtracted to substitute for the
bias being added in both exponents. The resultant exponent is then stored in 10 bits;
that is two carry bits are added to the left to allow for overflow and underflow
either too small or too large. Since the 8 bit exponent of a single precision floating
point number is an unsigned number, the 10th bit being '1' indicates an underflow. An
overflow is detected when the 10th bit is '0' and the 9th bit is '1'. The checking of an
In this module, the resultant sign is also calculated using a simple XOR operation.
The mantissa of operand B is sliced into three 8-bit blocks here to prepare it for the
Figures 5.5 and 5.6 show the symbol and behavioral simulation of the Add
Exponent Module. After the arst signal (asynchronous reset) becomes inactive at
100ns, the input exponents expa=238=(EE)H and expb=51=(33)H are added and the
appearing one clock later at 200ns as (0A2)H. The resultant sign is also determined
where signres = signa XOR signb = ‟0‟ XOR „1‟ = „1‟ as appearing at 200ns. The
input mantissa B is broken up to the 8 bit slices namely mantb2, mantb1 and mantb0
04
Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit
This module is responsible for performing the unsigned multiplication of the two
operation using a simple multiplication statement written in VHDL. This was mainly
04
In order to implement the “Block Multiplication” algorithm the two 24 input
mantissas were sliced into three 8 bit blocks. Now Mantissa A=A2A1A0 and Mantissa
B=B2B1B0 with A2, A1, A0, B2, B1 and B0 each being an 8 bit block. Now in order
parallel as shown in Figure 5.7. Within each of the three multiplication operations,
multiplications are actually performed where the specific B block is multiplied to one
of the three blocks of mantissa A. Each of the 8 by 8 bit multiplications gives a 16 bit
result. These three 16 bit results are appropriately manipulated, as illustrated in Figure
5.8, to give the 32 bit resultant expected from the 24 * 8 multiplication. The resulting
32 bit numbers from each of the three main operations of Figure 5.7 are calculated in
are shifted left by 8 and 16 places respectively then added to the 32 bit resultant from
the multiplication of Mantissa A with B0 to give the final 48 bit result of the 24 by 24
bit multiplication. The detailed block diagram of the implemented Block Multiplier is
with the three different slices of the mantissa of operand B. This is performed, as
16 bit outputs. As shown in Figure 5.9, three instances are used from this sub-module
55
5.2.4.2 Partial Fraction Adjust Sub-Module
This sub-module accepts the three 16 bit outputs from the Multiplier (8 by 8)
Sub-Module and appropriately manipulates them, as discussed shortly before this and
as illustrated in Figure 5.8, to give the 32 bit partial fraction resulting from
there are three instances from this module, one following each instance of the
Multiplier (8 by 8) Sub-Module. The outputs from these instances are PF2, PF1 and
respectively.
This sub-module accepts the three 32 bit partial fractions, appropriately shifts
them then adds them together. Since the Multiplier Module is dealing with 8 bit slices
to perform the multiplication operation, thus to appropriately align the partial fraction
for their addition, PF1 and PF2 are shifted left by 8 and 16 bits respectively as shown
in Figure 5.10.
Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition
In order to add the three partial fractions PF2, PF1 and PF0 together, each of them
was first divided into two parts most significant (MS) and least significant (LS) as
shown in Figure 5.11. Then all the least significant parts were added together and the
result stored in 26 bits (24 bits and two carry bits where maximum possible carry is
“10”), and all the most significant parts were added together and the result stored in
55
24 bits. Performing the addition in such a manner leads to the increase of the overall
speed of the design by breaking up the long carry chain involved in the addition of the
32, 40 and 48 bits long partial fractions PF0, PF1 and PF2 respectively.
Figure 5.11 Dividing the Partial Fraction to prepare them for Addition.
resultants from the addition of the least and most significant parts of the partial
fractions that were given by the previous module, in order to the final 48 bit resultant
Figure 5.12 shows the block diagram of the Multiplier Module. Figures 5.13 to
Module. Each of the figures shows the input and output of a particular sub-module for
55
55
Figure 5.13 Behavioral Simulation of the Multiplier (8 by8) Sub-Module
57
Figure 5.15 Behavioral Simulation of the Add Partial Fractions Sub-Module
This module is responsible for making sure the output mantissa is in the
50
5.2.5.1 Post Normalize Sub-Module
mantissa that has a radix point to the right of the two most significant bits, i.e. the 48th
and the 47th bits. Since both multiplied mantissas are normalized, that is their most
significant bit is a '1', it is only logical that either the 48th or 47th bits of the resultants
would be a '1'. Accordingly to make sure the resultant mantissa is in the normalized
form these two bits are checked to locate the most significant '1'which is to be defined
as the implied bit. If the 48th bit is detected to be '1', the mantissa is normalized by
being shifted right by one place and the exponent is incremented by 1 to compensate
for such a shift. Otherwise if the 47th bit is detected to be '1', the resultant mantissa is
then already in the normalized form and no shifting is required. In both cases, the
detected '1'that is identified as the implied bit is dropped. Finally, the resultant
mantissa is truncated to 26 bits with the least significant bit being the sticky bit which
Figures 5.17 and 5.18 show the symbol and behavioral simulation of the Post
(000001)H at 200ns was shifted right by one bit, the implied bit has been dropped and
the appropriate sticky bit has been added, „1‟ in this case since there‟s a dropped „1‟
bit. The exponent has been incremented from (33)H to (34)H to account for the one
bit shift to the right in the resultant mantissa. At 300ns, the input is (400000000001)H
so the carry bit is zero and no shifting of the mantissa or increment of exponent are
same as that at 300ns except the fact that the least significant bit changed from '1' to
55
'0'. So the output mantissa at 500ns has sticky bit with a value '0' since all the dropped
Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit
underflow is detected. Checking the exponent was performed after the Post Normalize
Module in order to take into account the case in which the normalization operation
performed by inspecting the 10th and 9th bits of the resultant 10 bit exponent
54
calculated earlier in the Add Exponent Module. If the 10th bit was found to be a '1' the
underflow flag is set, if not and the 9th bit was found to be a '1' then the overflow flag
is set. Otherwise, the exponent is adjusted to 8 bits by the dropping these 10th and 9th
bits and the overflow and underflow flags are left unset.
Figures 5.19 and 5.20 show the symbol and behavioral simulation of the
0010 0010)B leads to both the overflow and underflow flags being unset at 300ns. At
300ns an exponent (222)H =(10 0010 0010)B sets the underflow flag at 400ns where
the 10th bit in the exponent is a '1'. A (122)H=(01 0010 0010)B exponent at 400ns sets
the overflow flag at 500ns where the 10th bit is a '0' while the 9th bit is a '1'.
Figure 5.19 Symbol of the Exception Detection Sub-Module in the FP Multiplier Unit
54
5.2.6 Rounding Module
This module is responsible for rounding the 26 bit resultant mantissa to 23 bits
using the REN technique, then rechecking for possible overflow due to rounding and
finally giving the 32 bit resultant in IEEE format. This is performed through three
Within this sub-module, the 26 bit resultant mantissa is rounded to 23 bits using
the REN technique which inspects the guard, round and sticky bits to determine
(Increment), rounded to the nearest even number (Tie to Nearest Even) or whether it
will be left unchanged equivalent to simply truncating the guard, round and sticky bits
(Truncate). All the possible cases are summarized in Table 5.2. In this module also, a
recheck flag will be set for the case of increment rounding to be passed to the
following sub-module, the Overflow Recheck Sub-Module. The reason behind this
Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits
54
Figure 5.21 shows the symbol of the REN Sub-Module. Figures 5.22 and 5.23
show the behavioral simulation of the REN Sub-Module for an even and odd resultant
mantissa respectively. So if the guard (GB), round (RB) and sticky (SB) bits are "100"
the mantissa is expected to remain unchanged in the simulation of Figure 5.22 (case
of even mantissa) and be incremented to tie it to even for the simulations in Figure
5.23 (case of odd mantissa). For ease of illustration, the results of Figure 5.22 and
45
Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa
Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa
45
5.2.6.2 Overflow Recheck Sub-Module
2. If the recheck flag, given by the previous sub-module, is set and the mantissa
is all zeros. That is because an all zeros mantissa has resulted from the case of
the mantissa being all '1's that has been incremented in the rounding operation.
Such a mantissa has a '1' carry bit so it needs normalizing by shifting left by
Figures 5.24 and 5.25 show the symbol and behavioral simulation of the Overflow
Recheck Sub-Module respectively. Case 1 is shown when an all ones exponent (FF)H
is introduced at 200ns thus setting the overflow flag at 300ns. Case 2 is shown at
400ns where the mantissa is all zeros and the recheck flag is set. The exponent is
(33)H at 500ns with a the recheck flag being set leads to an increment in the exponent
45
Figure 5.25 Behavioral Simulation of the Overflow Recheck Sub-Module
in the FP Multiplier Unit
This sub-module is responsible for appending the sign bit, 8 bit exponent and the
23 bit mantissa together to give the single precision floating point multiplication result
Figure 5.26 and 5.27 show the symbol and behavioral simulation of the Final
Module respectively. A '1' sign (78)H exponent and (2AAAAA)H exponent at 100ns
are appended to give the 32 bit final result (BC2AAAAA)H at 200ns. A set zeroflag
results at 400ns results in an all zeros output at 500ns. Also, a set overflow or
underflow flag at 300ns and 500ns results in an all zero output at 400ns and 600ns
respectively.
45
Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit
After designing each of the FP Multiplier Unit modules and simulating them
to assure their correct functionality, they were all put together and the complete
the FP Multiplier Unit behavioral simulation is shown in Figure 5.29 for the test
bench in Table 5.5. For example, the output of the operation 25.8 * -7.4 =
(C33EEB85)H.
47
Figure 5.29 Behavioral Simulation of the FP Multiplier Unit
40
5.3 Proposed FP Adder/Subtractor Unit
5.3.1 Introduction
in Figure 5.30 where operands A and B are the single precision inputs and operand
unpacking the sign, exponent and mantissa of both input operands for each to be dealt
with separately. The inputs are then swapped if necessary to assure that operand B
carries the smaller operand which is pre-normalized before the addition or subtraction
operation is performed, while operand A carries the larger operand whose exponent is
set as the resultant exponent. Zero detection is performed next where the zeros flag is
set appropriately to indicate if none, one or both operands is a zero. If both operands
pre-normalizing the smaller operand, operand B, then finding the resultant sign and
effective operation, performing that effective operation, then post normalizing (using
LOD algorithm) and rounding (using REN technique) the resultant mantissa. Finally,
the resultant sign, exponent and mantissa are appended together to give the final result
45
As mentioned earlier, in the FP Adder/Subtractor Unit the post normalization
process is the bottle neck of the design. In Chapter Three, the different commonly
Adder/Subtractor Unit were introduced and it was shown that they mainly differ in
when the leading „1‟ bit was detected. It was either detected after the resultant was
integrating both paths and using the optimum to perform the operation as in the two-
that was deeply pipelined to achieve high maximum operating frequency. The LOD
of the inputs in order to predict the position of the leading one and then error
ii. Area efficient due to simple one path data flow as opposed to the two-path and
three-path implementations that both consume a very large area and require the
Figure 5.31 illustrates the detailed block diagram of the FP Adder/Subtractor Unit.
is shown. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and
were behaviorally simulated to verify their correct operation using ModelSim 6.3a.
44
44
5.3.2 Unpack Module
This first module is responsible for two main tasks which are:
numbers and deals with the signs, exponents and mantissas of these two
2. Checking if the input operands are equal: The two operands are said to be
equal if both their exponents and mantissas compare as equal. An aequalb flag
would be then set to be used in upcoming modules for output zero detection if
Figures 5.32 and 5.33 show the symbol and behavioral simulation of the
Unpack Module respectively. Like in the FP Multiplier Unit, this is the first block in
the FP Adder/Subtractor Unit thus it has both an input and output register causing the
output to appear after two clock cycles. The simulation illustrates how the 32 bit
mantissa is broken to sign, exponent and mantissa. The inputs A= (AAAA AAAA)H
and B=(BBBB BBBB)H at 100ns were broken down to signa=‟1‟, expa=(55)H and
44
manta=(2AAAAA)H and signb=‟1‟, expb=(77)H and manta=(3BBBBB)H at 200ns.
At 200ns, both inputs were equal so the aeqb flag was set at 400ns.
both the input operands are aligned, i.e. have equal exponents. If the two exponents
are not equal, the smaller operand has to be pre-normalized first before the addition or
them for the addition/subtraction operation, the following steps must be performed:
45
1. Determining the smaller input operand.
4. Shifting right the mantissa of the smaller input a number of times equal to the
The Swap Module is the module responsible for determining which input is
Module. So the exponents of the two input operands are compared to determine the
contents of both operands are swapped and a swap flag is set to be taken into
consideration when determining the effective operation later in the Adder Module.
Figures 5.34 and 5.35 show the symbol and behavioral simulation of the Swap
Unit respectively. At 100ns, the exponent of operand A=(AA)H is smaller than that of
operand B=(BB)H causing a swap operation to take place at 200 ns and just setting
the swap flag. At 200ns when the exponents of operands A and B are equal, no swap
45
Figure 5.35 Behavioral Simulation of the Swap Module in the FP
Adder/Subtractor Unit
Zero Detect Module is responsible for three main tasks which are:
1. Determining if one or both of the input operands is a zero and setting the zeros
flag accordingly.
2. In the case where both operands are non-zeros, the difference between their
The zero detection is performed here as opposed to being in the Unpack Module
like in the FP Multiplier Unit to simplify the zero detection operation by making use
of the aequalb flag set in the Unpack Module if both inputs were equal and the fact
that the Swap Module ensured that the smaller input is stored as operand B. Thus, by
checking only the exponent of operand B and the aequalb flag the appropriate zeros
45
flag can be set as summarized in Table 5.6. Again like in the FP Multiplier Unit,
checking only the exponent of the input operands serves in detecting a denormalized
Figures 5.36 and 5.37 show the symbol and behavioral simulation of the Zero
Detect Module respectively. At 100ns, both mantissas are non-zeros so at 200ns the
zeros flag is unset and the difference between the exponents is calculated where diff=
expa – expb = (AA)H – (33)H = (77)H. At 300ns, both operands are equal to zeros so
at 400ns the zeros flag is set and the difference is zero. At 400ns, only operand B is
zeros so at 500ns the zeros flag is unset but the difference is still zero. At all cases, the
Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit
45
Figure 5.37 Behavioral Simulation of the Zero Detect Module in the FP
Adder/Subtractor Unit
The Pre-normalize Module is responsible for adjusting the two mantissas for the
1. Fitting the mantissas into 28 bits where a carry bit (C) and the implied bit (I)
are added to left of the mantissas and a guard (G), round (R) and sticky (S)
bits are added to their right to increase the precision of the addition/subtraction
operation and to be used in the Rounding Module. The format of the 28 bit is
C I 23 bit Mantissa G R S
47
2. Pre-normalizing the mantissa of operand B, the smaller operand, by shifting it
right a number of times equal to the difference between the two input operand
implemented totally in VHDL. Barrel shifting has the advantage of shifting the data
by any number of bits in one operation where as if a simple shifter was used, shifting
by n bit positions would require n clock cycles. This makes barrel shifting the most
suitable for the shifting operations required in the floating point operations, especially
addition/subtraction.
When considering the possible shift that may be needed in this module, we found
that theoretically the shift can be any number between 1 and 253 depending on the
difference between the exponents of the input operands. Practically speaking though,
any shift greater than 25 would just lead to the dropping of all the mantissa bits. So in
such a case, the shifting operation will lead to an all '0's mantissa with the exception
of the sticky bit that carries the logical "Or"ing of all the dropped bits and thus in such
a case would always store '1'. So in the implementation, the shift operation is
performed only if the difference between the two exponents was less than or equal 25.
Otherwise, the mantissa is set as all '0's with a sticky bit of value '1'.
The Pre-normalize Module is designed to operate only if the zeros flag indicates
and the inputs are just adjusted into 28 bits with the carry, guard, round and sticky bits
Figures 5.39 and 5.40 show the symbol and behavioral simulation of the Pre-
Normalize Module respectively. At 100ns the signal diff_i indicates that the exponent
40
difference (and required shift) is (3)H = 3. Thus at 300ns, mantissa B comes out
shifted right by three places. At 200ns when the exponent difference is (1E)H = 30,
mantissa B comes out at 400ns as all zeros except for the sticky bit.
The Add/Subtract Module was broken up into three sub-modules to allow for
deeper pipelining thus increasing the overall speed of the design. These three sub-
modules are:
45
5.3.6.1 Pre-Add/Subtract Sub-Module
This sub-module is responsible for determining the resultant sign and the effective
signs of both input operands and the required operation. The resultant sign is
operation, the swap flag and an agrtb flag (a greater than b flag) as shown in Equation
5.1. The agrtb flag is a flag set in this module if the mantissa of operand A is greater
Resultant_Sign = (sw_f AND rop) OR ( (NOT sgna) AND (NOT sagb_f) AND rop))
XOR ((sgna AND s_agb) OR (sgnb AND (NOT agb_f) AND (NOT rop)) (5.1)
5.42 and 5.43 show the behavioral simulation for the Pre-Add/Subtract Sub-Module
for the cases of mantissa B greater than mantissa A and vice versa respectively.
44
Figure 5.42 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the FP
44
5.3.6.2 Zeros Flag Update Sub-Module
This sub-module updates the zeros flag to indicate a zero final result if the
aequalb flag is set and the effective operation was found to be subtraction.
Figures 5.44 and 5.45 show the symbol and behavioral simulation of the Zero Flag
Update Sub-Module respectively. When the aequalb flag was set and the effective
operation was subtraction (eff_oper_i =‟1‟) at 300 ns, the zerosflag was accordingly
44
5.3.6.3 Adder Sub-Module
subtraction operation to calculate the 28 bit resultant mantissa. The effective operation
is performed only if the zeros flag indicates that both input operands are non-zeros.
Otherwise the output is directly given depending on the zeros flag as summed up in
Table 5.7.
Figures 5.46 and 5.47 show the symbol and behavioral simulation of the Adder
only when the zeros flag was “00”. For example mantissa A= (2222222)H is added to
when the zeros flag became “01” at 500ns and “11” at 600 ns, no calculations were
made and the resultant mantissa were directly given as mantissa A at 600ns and zero
at 700 ns respectively.
45
Figure 5.47 Behavioral Simulation of the Adder Sub-Module in the FP
Adder/Subtractor Unit
The Post Normalize Module is responsible for normalizing the resultant mantissa
by detecting the most significant '1' bit in the resultant mantissa, shifting the mantissa
to set this detected '1' as the implied bit and finally adjusting the exponent in a manner
The Post Normalize Module is implemented using the LOD algorithm. Although
floating point adders implemented using the LOD algorithms are not generally the
fastest [7], this was overcome by two means. The first was using barrel shifting to
implement the shift operations in this module. The second was breaking up the Post
Normalize Module into several sub-modules to allow for its deep pipelining which
proved to significantly improve the design speed. All these sub-modules were
designed to operate only if the zeros flag indicated that both input operands were non-
45
5.3.7.1 Zeros Count Sub-Module
This sub-module is the first step in the LOD algorithm. It is responsible for
detecting the most significant „1‟ in the resultant mantissa. Several techniques were
introduced by designers to detect the most significant '1' bit [7]. In the proposed
design the position of the most significant '1' bit is found by simply counting the
number of '0' bits to the right of the mantissa before the first '1' is detected. The
the result by shifting the mantissa left by it and adding it to the exponent to
Figures 5.48 and 5.49 show the symbol and behavioral simulation of the Zero
Count Sub-Module respectively. At 100ns, where the zeros flag was unset, the input
was (8888888)H indicating that the carry bit was „1‟ so the number of zeros came out
at 200ns as (00)H. Also, at 200ns the input was (4888888)H so the carry bit is a „0‟
followed by a „1‟ in the position of the implied bit. Technically, the number is already
in the normalized form so again this case outputs the number of zeros at 300ns to be
(00)H. At 300ns the input is (2888888)H so the number of zeros comes out at 400ns
to be (01)H. At 500ns and 600ns, the zeros flag indicates that one and both inputs are
zeros respectively. No shift is necessary in both these case thus the number of zeros is
45
Figure 5.49 Behavioral Simulation of Zeros Count Sub-Module in the FP
Adder/Subtractor Unit
coordination with the expected shift of the resultant mantissa. This situation has three
1. The resultant mantissa carry bit is '1'. It is considered to be the implied bit
thus it is required to shift the mantissa one place to the right and the exponent
is incremented by one.
2. The resultant carry bit is '0' followed directly by a '1'. The mantissa is
3. If none of the above scenarios is satisfied, then the number of zeros to the left
subtracted from the resultant exponent to account for shifting the resultant
Figures 5.50 and 5.51 show the symbol and behavioral simulation of the
Exponent Adjust Sub-Module respectively. At 100ns, the mantissa has a carry „1‟ so
45
the exponent is incremented from (29)H to (2A)H as shown at 200ns. At 200ns, the
carry bit is a „0‟ and the bit on the implied bit position is a „1‟ so the mantissa is post
At 300ns and 400ns, the number of zeros is 1 and 7 so the exponent comes out
Adder/Subtractor Unit
47
5.3.7.3 Mantissa Normalize Sub-Module
This sub-module is responsible for performing the final step of the normalization
process which is the shifting of the resultant mantissa and dropping the implied bit.
The amount of shift depends on the position of the detected most significant „1‟, that
1. If the resultant mantissa carry bit is '1', the mantissa is shifted to the right by
on place to set the carry bit as the implied bit. The dropped bit, which is the least
significant bit, is passed to the Round Module to be used to update the sticky bit.
2. If the resultant carry bit is '0' followed directly by a '1', no shifting of the
3. If none of the above scenarios is satisfied, then the resultant mantissa is shifted
left a number of times equal to the number of zeros before the most significant „1‟.
Figures 5.52 and 5.53 show the symbol and behavioral simulation of the
Mantissa Normalize Sub-Module respectively. At 100ns and 200ns the carry bit of the
resultant mantissa is „1‟, so the normalized mantissa appears at 200 ns and 300ns
shifted one bit to the right with the drop bit being set to „1‟ only at 300ns to account
for the „1‟ dropped from the mantissa (4444445)H at 200ns. A shift left by the number
of zeros is performed on the resultant mantissa input at 300ns, 400ns and 500ns.
40
Figure 5.53 Behavioral Simulation of Mantissa Normalize Sub-Module in the FP
Adder/Subtractor Unit
Like in the FP Multiplier Unit, the Round Module here is responsible for rounding
the 26 bit resultant mantissa to 23 bits using the REN technique and then giving the
final result of the addition or subtraction operation in the IEEE single precision
This sub-module is responsible for rounding the resultant mantissa using the REN
mode. The REN mode inspects the guard, round and sticky bits to determine whether
(Increment), or whether the mantissa will rounded to the nearest even number (Tie to
truncating the guard, round and sticky bits (Truncate). All these possible cases were
45
Figures 5.54 and 5.55 show the symbol and behavioral simulation of the REN
Sub-Module respectively. In the shown simulation, the mantissa is even until 700ns,
and then it becomes odd. The test bench used is the similar to that used in the REN
Sub-Module of the FP Multiplier Unit which was summarized in Tables 5.3 and 5.4
44
5.3.8.2 Final Sub-Module
appending the sign bit, 8 bit exponent and the 23 bit mantissa together to give the
Figures 5.56 and 5.57 show the symbol and behavioral simulation of the Final
100ns are appended to give the 32 bit final result (91222222)H at 200ns.
44
5.3.9 The FP Adder/Subtractor Unit Behavioral Simulation
them to assure their correct functionality, they were all put together and the complete
the FP Adder/Subtractor Unit behavioral simulation is shown in Figure 5.59 for the
test bench in Table 5.8. For example, the output of the operation 12+12 =
44
Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation
555
Chapter 6
previous chapter were able to operate at high operating speeds hence surpassing most
of the existing designs as will be illustrated in this chapter. Such designs were a result
speed.
implementation results are illustrated, discussed and compared with results from the
previous work of others. Finally the power analysis and timing simulations are
illustrated for both designs. All synthesis and implementation results were generated
speed.
Initially, the maximum operating speed of each module was found from the
synthesis reports for both the FP Adder/Subtractor and Multiplier Units. Although the
101
synthesis reports are not as accurate as the static timing reports given post routing,
they still give a good indication of the module’s performance with the advantage of
being generated much faster than static timing reports and thus are sufficient to be
Next using the generated synthesis reports, critical modules were continuously
defined and optimized. Basically, modules were optimized by either breaking them
down into smaller pipelined modules or editing the VHDL code for better synthesis.
Synthesis options provided by the Xilinx ISE Tool were also used to achieve best
possible performance of the entire design by efficiently setting them in order to reach
and Multiplier Units were implemented and speeds of separate modules were
generated from the static timing report. Critical modules were then defined and
optimized whenever possible in a manner very similar to that just explained. Like the
synthesis tool, Xilinx implementation tool has several options that can dramatically
Xilinx implementation tool also allows user defined timing constraints which
helps designers reach timing closure in high performance applications [29]. The use
of user defined timing constraints also ensures no timing violation, such as period,
setup or hold violations, occurs in the timing simulation. The fundamental timing
6.1. It sets the setup time, referred to in Xilinx tools as clock to pad , which
102
specifies the maximum time allowed for data to enter the chip, travel through
RAM) where that pin has a setup requirement before a clocking signal. Thus
offset in constrain is used to make sure that the input data comes in an
appropriate time. Hold time can also be set using the offset in constraint by
their required timing specifications, very tight timing constraints could result in
results far from desired. In this work, the maximum speed reported by the synthesis
tool was used as a guide to set realistic constraints for both the period and offset in
constraints to make sure that the input data comes in an appropriate time.
In the next section, a briefing of the optimization steps performed for the FP
stages. Each stage performed one of the basic floating point addition/subtraction
103
normalizing. The speed of this FP Adder/Subtractor Unit was 50 MHz when
synthesized to Virtex5. In order to increase the operating speed, further pipelining and
optimization of the design were necessary which was performed by breaking down
each of the three modules into smaller simple sub-modules with registers inserted
between them. For example, the module responsible for addition or subtraction was
sub-divided into three sub modules which are the Pre-Add Sub-module, the Zero Flag
Update Sub-module and the Adder/Subtractor Sub-module. Each of these new sub-
modules was synthesized to determine its maximum operating speed and hence
After applying this technique for all the FP Adder/Subtractor Unit modules
and sub-modules until further pipelining had no positive effect on the overall speed,
operating frequency was just above 255 MHz when synthesized to Virtex5. Further
optimization within each module was made where ever possible, mostly by trying to
simplify the written codes which sometimes lead to increase in the module’s speed.
Next the design was implemented and implementation results, given by the
static timing analysis report, showed that the FP Adder/Subtractor operated at around
register at the module’s output, was added which raised the speed of the FP
Finally, Xilinx options along with user defined timing constraints were
iteratively adjusted until optimal results were attained. The implementation results of
this final design over several FPGA platforms are summarized in the Implementation
104
6.1.3 FP Multiplier Optimization
the FP Adder/Subtractor Unit was put in handy. The first implementation of the FP
indicated that there were two critical modules which were the Add Exponent and the
So initially in order to increase the overall speed, a register was added post the
Add Exponent Module. The more critical issue then at hand was the optimization of
the Multiplier Module which in turn led to the evolution of the novel proposed Block
Multiplication algorithm. At first, the 24 by 24 multiplier was broken down into three
parallel 24 by 8 multipliers. When integrated into the FP Multiplier Unit, the design’s
was used. Multiplications were further broken down into nine 8 by 8 multiplications
Table 6.1 compares the performance of the FP Multiplier Unit when using the
proposed Block Multiplier Module and as opposed to using the Simple Multiplier
Module [30]. The table summarizes the synthesis results for speed and area when the
designs are synthesized to several Virtex FPGA platforms. The comparison shows that
when using the Block Multiplier, the FP Multiplier Unit is able to operate at speeds
almost double those attained when using the Simple Multiplier. This comes on the
price of occupied area where the opposite occurs for the design using the Block
105
Multiplier which occupies almost double the area that is occupied the design using the
Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed Block
Multiplier vs. using Simple Multiplier
Proposed
Simple Multiplier
Multiplier
Speed Area Speed Area
(MHz) (Slices) (MHz) (Slices)
Virtex2p
296 1038 181 412
Xc2vp7ff896 -7
Virtex4
461 945 106 343
Xc4vfx100 -12
Virtex5
450 592 217 578
Xc5vlx110 -3
Finally the design was implemented and implementation results, given by the
static timing analysis report, showed that the FP Multiplier Unit operated at around
360MHz. Then like in the FP Adder/Subtractor Unit design, Xilinx options along with
user defined timing constraints were iteratively adjusted until optimal results were
attained. The implementation results of this final design over several FPGA platforms
Xilinx implementation tool was used along with user defined period, setup and
varying the timing constraints until timing closure could not be achieved.
generated from the static timing report are summarized in Table 6.2 and 6.3 for
several FPGA platforms. It is noticed that Virtex5 uses almost half the number of
106
slices used by Virtex2 and Virtex4 FPGAs. This owes to the fact that the Virtex5 slice
FP Adder
Speed Area
(MHz) Slices Utilization
Virtex2p
325 1331 21%
Xc2vp7ff896 -7
Virtex4
401 1533 3%
Xc4vfx100 -12
Virtex5
442 625 3%
Xc5vlx110 -3
FP Multiplier
Speed Area
(MHz) Slices Utilization
Virtex2p
339 1029 20%
Xc2vp7ff896 -7
Virtex4
465 1467 3%
Xc4vfx100 -12
Virtex5
472 702 4%
Xc5vlx110 -3
Chapter 2, the most relevant work in the topic of design of high speed FPUs was the
work of Govindu, Hemmert and Karlstrom [6, 9, 10 and 11]. The basic specifications
of their FPUs and its comparison to the specifications of the proposed FPU can be
summarized as follows:
107
Denormalized numbers, infinity or NaN:
from high performance systems due to their rare occurrence and excessive hardware
[27]. The FPUs implemented by Govindu [6] and Karlstrom [10, 11], they did not
Hemmert [9], it was fully IEEE compliant considering both NaN and infinity but not
denormalized numbers.
with denormalized numbers as zero values, signals infinity values as overflow yet
Rounding Modes:
Generally, the REN mode is the most common when implementing hardware and
software arithmetic operations although it is the mode requiring the most hardware to
implement. On the other hand, truncation is the simplest rounding mode to implement
since it only involves truncating the extra bits in resultant mantissa in order to store it
in the assigned storage. Both Govindu [6] and Karlstrom [10, 11] were implemented
their FPUs using the REN and truncation rounding modes. Hemmert [9] did not give
Parallelism:
Parallelism in design serves in improving its overall latency. Both Govindu [6]
and Karlstrom [10, 11] made use of parallelism when implementing their FPUS. They
108
both had at least two parallel paths one for the dealing with the exponents and the
On the other hand since we were more concerned with speed than latency, each of
9, 10 and 11] as compared to the design specifications of the proposed design, it can
be concluded that their design specifications are very similar to those of the proposed
design. Hence comparison of the proposed design to the work of Govindu, Hemmert
When comparing the results from this work to those of Govindu, Hemmert
and Karlstrom [6, 9, 10 and 11], comparison of architectures used and optimization
techniques is relevant.
Like in this work, Govindu [6] implemented the post normalization operation
Unit using VHDL but also made use of Xilinx Library Cores to describe the adders in
both the mantissa addition module and the rounding module within the FP
Adder/Subtractor Unit. Govindu [6] used Xilinx Tool for the implementation of his
design to Virtex2 and Virtex4 FPGAs. In order to optimize his design, he used deep
pipelining and synthesis options given by the Xilinx tool. His 19 stage pipelined
109
Unit using Verilog. Then in order to optimize the post normalize module he used a
hardware extensive approach that inspected four bits of the mantissa at a time. He
reasoned the choice of him inspecting four bits at a time to allow for better mapping
to the four input LUTs of the Virtex4 FPGA. In order to further optimize his design,
he made sure the adder/subtractor was implemented using only one LUT per bit. The
use of such technique led a design that operated at 288 and 361MHz with a latency of
only 10 clock cycles forVirtex2pro and Virtex4 respectively. In his later work, he was
able to push the performance of his design to 377MHz for Virtex4. Strangely enough,
Hemmert [9] described their design using JHDL. They did not give any detail
about their design architecture or how they optimized it. They did compare their work
to floating point units distributed by Xilinx and were superior to them in all speed,
area and latency. Their FP Adder/Subtractor Unit operated at 298MHz and 356MHz
FP Adder Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
325 278 288 298 250
Virtex2p -7
442 419 NA NA NA
Virtex5 -12
*NA: Not Available
110
6.3.2 Comparison of FP Multiplier Unit Implementation Results
Govindu [6] constructed the 24 by 24 unsigned multiplier using Xilinx cores and
made use of the synthesis options to further optimize his design while Karlstrom [10,
11] used four of Virtex4’s DSP48 blocks to construct a 35 by35 multiplier. On the
other hand, Hemmert [9] did not give any details what so ever about how they
FP Multiplier Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
339 NA NA 290 250
Virtex2p -7
469 500 NA NA NA
Virtex5 -12
*NA: Not Available
accurate view of the power breakdown based on the exact resource utilization
111
The XPower Analyzer integrated within the Xilinx 9.2i version is an early
access version and is reported by Xilinx to give incorrect results [20]. So for accurate
power analysis a newer version of XPower Analyzer, which was integrated in Xilinx
11.1, was used. This version of Xilinx only supported Virtex4 and newer FPGAs.
Figures 6.6 and 6.7 illustrate the power consumption with respect to frequency
settings which assumes input/output toggle rate = 100%, i.e. both inputs and outputs
300
250
Power (mW)
200
150
100
50
0
0 100 200 300 400 500
Frequncy (MHz)
112
FP Multiplier Unit Power Consumption
350
300
250
Power (mW)
200
150
100
50
0
0 100 200 300 400 500
Frequency (MHz)
Post route simulation is performed using the Xilinx ISE Simulator. The
simulator uses the post place and route simulation model along with the Standard
Delay Format (SDF) file containing true delay information of the design to simulate a
312.5MHz (3.2ns) is shown in Figures 6.4 and 6.5 for the same test bench given in
Chapter 5. For example, the output of the operation 1000.05 - 88.01 = (447A0333)H
113
Figure 6.4 Input Test Bench of the FP Adder/Subtractor at 3.2ns
(2.5ns) is shown in Figures 6.6 and 6.7 for the same test bench given in Chapter 5. For
114
Figure 6.6 Input Testbench for FP Multiplier at 2.5ns
115
Chapter 7
7.1 Conclusions
designed to handle digital signal processing functions. Although in the past the usage
of DSPs has been more common, with the development of FPGA technology and DSP
in the recent years, there are more and more applications of their combination in
digital signal processing systems especially when high performance, flexibility, fast
time to market, reliability and maintainability are required [31-33]. And since FP
addition and multiplication are very common in digital signal processing, such as in
Multiplier and Adder/Subtractor Units has been presented. Both units were written
320MHz for Virtex2Pro and 400MHz for Virtex4 and Virtex5 FPGAs whilst giving
116
an output with every clock cycle. Specifically, the proposed FP Adder/Subtractor
operates at 442 MHz while the FP Multiplier operates at 469 MHz when implemented
Multiplier Units was analyzed using Xpower Analyzer. Finally, post route simulation
was performed to verify the operation of both the FP Adder/Subtractor and Multiplier
In the future, the latency of the designs could be reduced by making use of
117
References
[3] M. Reaz, S. Islam and M. Suliman, " Pipeline floating point ALU design using
[4] J. Liang, R. Tessier and O. Mencer, " Floating point unit generation and
[6] G. Govindu, L. Zhuo, S. Choi and V. Prasanna, " Analysis of high performance
[7] Ali Malik, “Design Tradeoff Analysis of Floating-Point Adder in FPGAs”, M.Sc.
118
[8] Claudio Brunelli and Jari Nurmi, “ Design and Verification of VHDL Model of a
pp.349-350.
[10] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based
Floating Point Adder and Multiplier Units in a Virtex 4," in the 24th Norchip
[11] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based
Floating Point Adder and Multiplier Units in a Virtex 4," in Computers and
[12] S.V.Siddamal, R.M. Banakar and B.C. Jinaga, " Design of high speed floating
[13] Florent de Dinechin and Bogdan Pasca, “Large multipliers with fewer DSP
[14] Sebastian Banescu, Florent de Dinechin, Bogdan Pasca and Radu Tudoran,
2010.
119
[15] Florent de Dinechin, Hong Diep Nguyen and Bogdan Pasca, “ Pipelined FPGA
[16] Clive Maxfield, The Design Warrior’s Guide to FPGAs: Devices, Tools and
[18] Sunggu Lee, Advanced Digital Logic Design using VHDL, State Machines and
[19] Volnei A. Pedroni, Digital Electronics and Design with VHDL, USA: Morgan
Kaufmann, 2008.
http://www.xilinx.com/company/about.htm
core.com/library/digital/fpga-logic-cells/
[23] Peter Wilson, Design Recipes for FPGAs, Great Britain: Newness, 2007.
[24] IEEE standards board, IEEE standard for floating-point arithmetic, 2008.
[26] Sheetal A. Jain, Low Power Single Precision IEEE Floating Point Unit, Master of
[27] Oh H.-J., Mueller S.M., Jacobi C., Tran K.D., Cottier S.R., Michael B.W., ET
120
Processor Element of a Cell Processor’, Solid-State Circuits IEEE J., 2006, 41,
http://www.xilinx.com
[30] Lamiaa Sayed Abdel Hamid, Khaled Shehata, Hassan El-Ghitani, Mohamed
[32] “FPGA versus DSP design Reliability and Maintenance.” [Online.] Available :
http://www.dsp-fpga.com
[33] Zhi-Jian Sun and Xue-Mei Liu, “Application of Floating Point and DSP in
121