FPUMasterthesis Lamiaa

Table of Contents
TABLE OF CONTENTS…………………………………………………………...……i
LIST OF FIGURES………………………………………………………………….….vi
LIST OF TABLES…………………………………………………………….…......….xi
LIST OF ABBREVIATIONS…………………………..……………………..……….xii
ABSTRACT………………………………………………………………….…….….….1
CHAPTER 1: INTRODUCTION………………………………………...…...……….3
1.1 Problem Statement and Contribution………………………..………….…3
1.2 Organization ……………………………………...……...……...….……..4
CHAPTER 2: HISTORY OF FLOATING POINT UNITS ON FPGAs..…....….….6
CHAPTER 3: DIGITAL DESIGN……………………………………………...…….12
3.1 Application Specific Integrated Circuits (ASICs)…………………….....12
3.2 Programmable Logic Devices (PLDs)…………………….………...…...14
3.2.1 Introduction ………………………………………….…………… 14
3.2.2 Simple Programmable Logic Devices (SPLDs)…………………...15
3.2.3 Complex Programmable Logic Devices (CPLDs)…….…....……...18
3.3 Field Programmable Gate Arrays (FPGAs)…………..................…...….19
3.3.1 Comparing FPGAs to ASICs and CPLDs…………..…....….…. 20
3.3.2 Xilinx FPGAs.………………………………………….…….….21
3.3.2.1 Architecture…………………………………………..….21
i
3.3.2.2 Interconnects Technology……………………………….22
3.3.2.3 Virtex FPGA Family……………………..…...…...…….23
3.3.2.4 Speed Grades…………………………………………….26
3.4 FPGAs Design Flow………………………………………………….….27
CHAPTER 4: FLOATING POINT ARITHMETIC ...…………………..………….31
4.1 Fixed and Floating Point Representation………………...………………31
4.2 IEEE-754 Standard for Floating Point Representation……….………….33
4.2.1 Numerical Encoding…………………………….……………… 33
4.2.2 Normalized and Denormalized Numbers……….……..………... 35
4.2.3 Special Values………………………………………….…...…....36
4.2.4 Range of Floating Point Numbers………….………….………....36
4.2.5 Exceptions…………………………………..…..….…….……....37
4.2.6 Rounding Modes………………………….……..……….……....37
4.3 Floating Point Arithmetic……………………………….……….…...….39
4.3.1 Floating Point Multiplication Algorithms.…….………………....39
4.3.2 Floating Point Addition/Subtraction Algorithms ……...………..43
CHAPTER 5: PROPOSED FLOATING POINT UNIT ARCHITECTURE……...47
5.1 Design Specifications of Proposed FPU………………...……..………...47
5.1.1 Floating Point Representation……………...……..….………..…47
5.1.2 Power Considerations…………………….………….……..……49
5.1.3 Pipelined Architecture…………………………………………...50
5.2 Proposed FP Multiplier Unit …………….…………………...….………51
ii
5.2.1 Introduction……………………………………..……..…………51
5.2.2 Zero Detect Module…………………………….…….……………55
5.2.3 Add Exponent Module …………………….…………....…………57
5.2.4 Multiplier Module………………………………….………………58
5.2.4.1 Multiplier (8 by 8) Sub-Module………………….………60
5.2.4.2 Partial Fraction Adjust Sub-Module...……………...……61
5.2.4.3 Add Partial Fractions Sub-Module……………...….……61
5.2.4.4 Final Result Sub-Module………….…………….…….…62
5.2.5 Post Normalize Module……………………………………...…… 65
5.2.5.1 Post Normalize Sub-Module……….……………………66
5.2.5.2 Exception Detection Sub-Module ….………………....…67
5.2.6 Rounding Module…………………………….…………...….……69
5.2.6.1 REN Sub-Module……………………………….…….…69
5.2.6.2 Overflow Recheck Sub-Module……....…………………72
5.2.6.3 Output Sub-Module………………………….……..……73
5.2.1.7 The FP Multiplier Unit Behavioral Simulation………….....……74
5.3 Proposed FP Adder/Subtractor Unit………….…………...….….………76
5.3.1 Introduction…………………………………….…….…..………76
5.3.2 Unpack Module……………………………….…………..…...…79
5.3.3 Swap Module ………………………………..………….….……80
5.3.4 Zero Detect Module……………………….………….…….……82
5.3.5 Pre-Normalize Module…………………….………….….………84
5.3.6 Add/Subtract Module………………….…………….……..…….86
iii
5.3.6.1 Pre-Add/Subtract Sub-Module……….….……………….87
5.3.6.2 Zeros Flag Update Sub-Module……………….…...…….89
5.3.6.3 Adder Sub-Module………..………………………….….90
5.3.7 Post Normalize Module………………………...………....……….91
5.3.7.1 Zeros Count Sub-Module…………….…………….…….92
5.3.7.2 Exponent Adjust Sub-Module………………….….…….93
5.3.7.3 Mantissa Normalize Sub-Module……………….……….95
5.3.8 Rounding Module……………………………...………….……….96
5.3.8.1 REN Sub-Module………………………………..……….96
5.3.8.2 Final Sub-Module……………………………….……….98
5.3.9 The FP Adder/Subtractor Behavioral Simulation…..……………..99
CHAPTER 6: DESIGN IMPLEMENTATION AND RESULTS……..……...…..101
6.1 Design Optimization…………………………………………....………101
6.1.1 Optimization Techniques…………………………….…………101
6.1.2 FP Adder/Subtractor Optimization..……………………………103
6.1.3 FP Multiplier Unit Optimization……………….……….………105
6.2 Implementation Results………………………………………...………106
6.3 Comparison with Previous Implementations …………...…..…………107
6.3.1 Comparison of FP Adder Unit Implementation Results…….…109
6.3.2 Comparison of FP Multiplier Unit Implementation Results……111
6.5 Power Analysis ……………………………………………….……..…111
6.5 Post Route Simulations ……………….………..……….………...……113
6.5.1 FP Adder/Subtractor Unit Timing Simulation……….…...…….113
iv
6.5.2 FP Multiplier Unit Timing Simulation…………….…..……….114
CHAPTER 7: CONCLUSIONS and FUTURE WORK………………..….…….…116
7.1 Conclusions…….……………..……………….………………….…….116
7.2 Future Work……………………...……………………………………..117
REFRENCES………………………………………………………………………….118
v
List of Figures
Figure 3.1 Basic Architecture of PLA………………………………….....……...….16
Figure 3.2 Basic Architecture of PROM………………………………..…..……….16
Figure 3.3 Basic Architecture of PAL……………………………………………….17
Figure 3.4 Basic Architecture of GAL…………………………………...………….17
Figure 3.5 Simple Schematic of a CPLD……………………………..….………….18
Figure 3.6 Basic Architecture of a Xilinx FPGA………………………………...….21
Figure 3.7 Simplified Xilinx Logic Cell………………………….……………...….22
Figure 3.8 Simplified Schematic of Virtex-4 Slice………………………………….25
Figure 3.9 Simplified Schematic of Virtex-5 Slice…………………..….….……….26
Figure 3.10 Digital Design Flow……………………………………………….……..27
Figure 4.1 IEEE Format of Single Precision FP Number……….……………….… 34
Figure 4.2 LOD vs. LOP Algorithms………………………………….…………….45
Figure 5.1 Simplified Block Diagram of FP Multiplier Unit ……………………….51
Figure 5.2(a) Detailed Block Diagram of FP Multiplier Unit………………………….53
Figure 5.2(b) Detailed Block Diagram of FP Multiplier Unit………………………….54
Figure 5.3 Symbol of Unpack Module in the FP Multiplier Unit…………...……...56
Figure 5.4 Behavioral Simulation of the Unpack Module in the FP Multiplier
Unit……………………………………………………………………....56
Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit………..58
Figure 5.6 Behavioral Simulation of the Add Exponent Module in the FP
Multiplier Unit………...…………………………………………...…….58
vi
Figure 5.7 The Three Parallel Multiplications Performed in the Block
Multiplication Algorithm……………………………………………..….59
Figure 5.8 Details of Mantissa A * B0 Multiplication…………………….…...……59
Figure 5.9 Block Diagram of the Block Multiplier……………………..….……..…60
Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition...61
Figure 5.11 Dividing the Partial Fraction to prepare them for Addition……………..62
Figure 5.12 Block Diagram of Multiplier Module in the FP Multiplier Unit…….…..63
Figure 5.13 Behavioral Simulation of the Multiplier (8 by8) Sub-Module…….…….64
Figure 5.14 Behavioral Simulation of the Partial Fraction Adjust Sub-Module……..64
Figure 5.15 Behavioral Simulation of the Add Partial Fractions Sub-Module……….65
Figure 5.16 Behavioral Simulation of the Final Result Sub-Module………….……...65
Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit….67
Figure 5.18 Behavioral Simulation of the Post Normalize Sub-Module
in the FP Multiplier Unit……………………………………….…..…….67
Figure 5.19 Symbol of the Exception Detection Sub-Module in the
FP Multiplier Unit………………………………..………………………68
Figure 5.20 Behavioral Simulation of the Exception Detection Sub-Module
in the FP Multiplier Unit…………………………………..…………….68
Figure 5.21 Symbol of the REN Sub-Module in the FP Multiplier Unit……………..70
Figure 5.22 Behavioral Simulation of the REN Sub-Module in the FP Multiplier
Unit when the Resultant Mantissa to be rounded is Even……………….70
Unit when the Resultant Mantissa to be rounded is Odd………………..71
vii
Figure 5.24 Symbol of the Overflow Recheck Sub-Module in the FP Multiplier……72
Figure 5.25 Behavioral Simulation of the Overflow Recheck Sub-Module
in the FP Multiplier Unit………………………………….……….....…..73
Figure 5.26 Symbol of the Final Module in the FP Multiplier Unit………………….73
Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit…..74
Figure 5.28 Symbol of the FP Multiplier Unit………………………………………..74
Figure 5.29 Behavioral Simulation of the FP Multiplier Unit………………..………75
Figure 5.30 Simplified Block Diagram of FP Adder/Subtractor Unit…….…….……76
Figure 5.31a Detailed Block Diagram of FP Adder/Subtractor Unit………..…………78
Figure 5.31b Detailed Block Diagram of FP Adder/Subtractor Unit………..…………78
Figure 5.32 Symbol of the Unpack Module in the FP………………..………………80
Figure 5.33 Behavioral Simulation of the Unpack Module in the
FP Adder/Subtractor Unit………………………………..…...………… 80
Figure 5.34 Symbol of the Swap Module in the FP Adder/Subtractor Unit…….……81
Figure 5.35 Behavioral Simulation of the Swap Module in the
FP Adder/Subtractor Unit……………………….……...………...….…..82
Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit…...83
Figure 5.37 Behavioral Simulation of the Zero Detect Module in
the FP Adder/Subtractor Unit…………………………..………………..84
Figure 5.38 Format of 28 bit Mantissa………………………………………………..84
Figure 5.39 Symbol of the Pre-normalize Module in the FP Adder/Subtractor Unit...86
Figure 5.40 Behavioral Simulation of the Pre-normalize Module
in the FP Adder/Subtractor Unit…………………………..……..………86
viii
Figure 5.41 Symbol of the Pre-Add/Subtract Sub-Module in the
FP Adder/Subtractor Unit……………………….……………………….87
Figure 5.42 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the
FP Adder/Subtractor Unit for Mantissa B greater than Mantissa A..……88
Figure 5.43 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the
FP Adder/Subtractor Unit for Mantissa A greater than Mantissa B..……88
Figure 5.44 Symbol of the Zero-Update Sub-Module in the
FP Adder/Subtractor Unit………………………………………………..89
Figure 5.45 Behavioral Simulation of the Zero Update Sub-Module in the
FP Adder/Subtractor Unit……………………………………….……….89
Figure 5.46 Symbol of the Adder Sub-Module in the FP Adder/Subtractor Unit.…...90
Figure 5.47 Behavioral Simulation of the Adder Sub-Module in the
FP Adder/Subtractor Unit……………………...…………..…………….91
Figure 5.48 Symbol of Zeros Count Sub-Module in the FP Adder/Subtractor Unit….92
Figure 5.49 Behavioral Simulation of Zeros Count Sub-Module in the
FP Adder/Subtractor Unit………………………….………….…………93
Figure 5.50 Symbol of the Exponent Adjust Sub-Module in the
FP Adder/Subtractor Unit……………………………………..…………94
Figure 5.51 Behavioral Simulation of the Exponent Adjust Sub-Module in the
FP Adder/Subtractor Unit…………………..………….……...…………94
Figure 5.52 Symbol of Mantissa Normalize Sub-Module in the
FP Adder/Subtractor Unit..………………………………………………95
Figure 5.53 Behavioral Simulation of Mantissa Normalize Sub-Module in the
ix
FP Adder/Subtractor Unit………………………………….…..……..….96
Figure 5.54 Symbol of the REN Sub-Module in the FP Adder/Subtractor Unit….…97
Figure 5.55 Behavioral Simulation of REN Sub-Module in the
FP Adder/Subtractor Unit…………………..………..………….……….97
Figure 5.56 Symbol of the Final Sub-Module in the FP Adder/Subtractor Unit…….98
Figure 5.57 Behavioral Simulation of Final Sub-Module in the
FP Adder/Subtractor Unit……………………………………….…….…98
Figure 5.58 Symbol of the FP Adder/Subtractor Unit………………………………..99
Figure 5.59 Behavioral Simulation of the FP Adder/Subtractor Unit………………..99
Figure 6.1 Offset in Time Constraint…...…………………….……………………103
Figure 6.2 FP Adder/Subtractor Unit Power Consumption vs. Frequency.………..112
Figure 6.3 FP Multiplier Unit Power Consumption vs. Frequency……………..….113
Figure 6.4 Input Test Bench of the FP Adder/Subtractor at 3.2ns…………..……..114
Figure 6.5 Output of the FP Adder/Subtractor Test Bench.………………..…..…..114
Figure 6.6 Input Testbench for FP Multiplier at 2.5ns …………………………….115
Figure 6.7 Output of the FP Multiplier Test Bench ……………………………….115
x
List of Tables
Table 3.1 Technologies Used to Implement FPGA Interconnects………………….23
Table 4.1 Summary of Floating Point Number Values………………….………….35
Table 4.2 FP Ranges for Normalized and Denormalized Numbers…….………….36
Table 4.3 Examples on the IEEE 754-2008 Rounding Modes……………………..38
Table 4.4 Example on Booth Multiplication………………………………………..42
Table 5.1 Test Bench for the Block Multiplier Module…………………………….62
Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits….………….69
Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa…..71
Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa…....71
Table 5.5 Test bench of the FP Multiplier Unit Behavioral Simulation……………75
Table 5.6 Summary of Zero Detection Methodology………………………..……..83
Table 5.7 Resultant Mantissa based on Zeros Flag………………………….……...90
Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation…. 100
Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed
Block Multiplier vs. using Simple Multiplier……...……………….…..106
Table 6.2 Summary of Proposed FP Adder/Subtractor Implementation Results…107
Table 6.3 Summary of Proposed FP Multiplier Implementation Results…………107
Table 6.4 Speed Comparison between Proposed and other FP Adders……..…….110
Table 6.5 Speed Comparison between Proposed and other FP Multipliers………111
xi
List of Abbreviations
ASIC Application Specific Integrated Circuit
ALU Arithmetic Logic Unit
CLB Configurable Logic Block
CMOS Complementary Metal Oxide Semiconductor
CPLD Complex Programmable Logic Devices
CPU Central Processing Unit
CSD Canonic Signed Digit
DFF Delay Flip Flop
DSP Digital Signal Processing
EDA Electronic Design Automation
EPROM Electrically Programmable Read-Only Memory
EEPROM Electrically Erasable Programmable Read-Only Memory
FFT Fast Fourier Transform
FP Floating Point
FPGA Field Programmable Gate Array
FPU Floating Point Unit
FSM Finite State Machine
GAL Generic Array Logic
HDL Hardware Description Language
IC Integrated Circuit
IEEE Institute of Electrical and Electronics Engineering
xii
IFFT Inverse Fast Fourier Transform
IOB Input Output Buffer
JHDL Just-Another Hardware Description Language
JTAG Joint Test Action Group
LOD Leading One Detector
LOP Leading One Predictor
LUT Look Up Table
MC Macro Cell
MFLOPS Mega Floating Point Operation per second
MUX Multiplexer
NaN Not a Number
NCD Native Circuit Description
NGC Native Generic Circuit
NGD Native Generic Database
NRE Non-Recurring Engineering
OTP One Time Programmable
PAL Programmable AND array Logic
PAR Place and Route
PLA Programmable Logic Array
PLD Programmable Logic Devices
PROM Programmable Read-Only Memory
RAM Random Access Memory
REN Round to Nearest Even (ties to even)
xiii
RM Round towards Minus-infinity
ROM Read-Only Memory
RP Round towards Plus-infinity
RZ Round towards Zero
SDF Standard Delay Format
SoC System on Chip
SPLD Simple Programmable Logic Devices
SRAM Static Random Access Memory
VHDL Very high speed integrated circuit Hardware Description Language
XST Xilinx Synthesis Technology
xiv
ABSTRACT
Nowadays, every CPU has one or more Floating Point Units (FPUs) integrated
within it. FPUs are commonly used in math extensive applications, such as digital
signal processing. Consequently, FPUs find place in engineering, medical and
military fields as well as in other fields requiring audio, image or video manipulation.
The main operations of a conventional FPU are multiplication and
addition/subtraction accounting for 94% of the operations of a conventional FPU [1].
With the advancement in FPGAs, high performance FPGAs are now built with
millions of gates along with sophisticated features. Accordingly FPGAs are becoming
more suitable for implementation of high performance FPUs especially when short
time to market, low development cost and flexibility are required.
The objective of this thesis is to design and implement high speed generic FP
Multiplier and Adder/Subtractor Unit that competed with existing designs.
A novel multiplication algorithm is proposed and used in the implementation
of a FP Multiplier Unit whilst a FP Adder/Subtractor Unit is implemented using the
standard Leading One Detector (LOD) algorithm. The novel multiplication algorithm
is referred to as “Block Multiplication” and is used to optimize the large
multiplication operation in the FP Multiplier by dividing it into several smaller
multiplications performed in parallel. In order to achieve high operating speeds both
the FP Multiplier and Adder/Subtractor Units are deeply pipelined which also lead to
maximum throughput.
The FP Multiplier Unit using the novel Block Multiplication algorithm and the
FP Adder Unit using the LOD algorithm were both completely described using
VHDL code to allow their implementation to any FPGA platform. In our research,
1
both units were implemented on Virtex2Pro, Virtex4 and Virtex5 FPGAs and were
able to operate at speeds higher than 320 MHz on Virtex2pro whilst occupying
around 20% of the FPGA and at speeds higher than 400 MHz on Virtex4 and Virtex5
FPGA whilst occupying around 3% of the FPGA. Post route simulation of both units
was performed to verify design operation post implementation (routing) and power
consumption was calculated post routing for most accurate analysis.
2
Chapter 1
Introduction
Ever since the invention of digital computers, arithmetic logic units (ALUs)
have always been a fundamental building block of the computer’s CPU. The ALU
usually refers to the circuit that deals with binary numbers in the integer format (like
2's compliment and binary coded decimal). A FPU on the other hand, refers to the
arithmetic unit that deals with floating point numbers (i.e. real numbers). FPUs are
considered superior over traditional ALUs in sophisticated applications which require
wide dynamic range and high precision.
1.1 Problem Statement and Contribution
Some of the greatest achievements of the 20th century would not have been
possible without the floating point capabilities of digital computers and systems.
FPUs are especially important for implementation of engineering and math extensive
applications used in digital signal processing and other scientific computations that
require wide dynamic ranges and high precision. As a result, high performance FPUs
are essential in several fields such as communications, military and medicine as well
as in many applications including image processing, robotics, radar, medical
diagnostics equipments and others. Then for portable applications, the need for low
power FPUs is inevitable.
In conventional FPUs, the most frequently used floating point operations are
multiplication and addition/subtraction accounting for more than 94% of all floating
point instructions [1]. Hence the employment of highly performing FP Multiplier and
3
Adder/Subtractor Units is of high importance and has been the core of interest of
many researchers few of which gave any attention to the power consumption issues.
In this thesis, we aim to design, implement and test an IEEE compliant single
precision, generic, low power, high speed FPU (Multiplier and Adder/Subtractor). In
order to minimize power consumption, the FPU is designed in a manner to reduce
unnecessary switching activity. As for achieving maximal speed, a new algorithm is
proposed to implement the FP Multiplier Units that optimizes time consuming
multiplication operation by breaking it into several multiplications performed in
parallel. This new algorithm is referred to as “Block Multiplication” and can be used
in the implementation of any large multiplier that is to be optimized for speed. The FP
Adder/Subtractor Unit is implemented using the standard LOD algorithm which is
deeply pipelined to allow for high operating speeds.
1.2 Organization
The rest of the thesis is structured as follows. Chapter 2 summarizes the
previous work of other researchers in the field of designing and implementing FPUs
to FPGA. Chapter 3 gives an overview on FPGAs specifically the Virtex family along
with the FPGA design flow as given by Xilinx ISE. Chapter 4 provides an overview
on the IEEE 754-2008 standard for binary floating point numbers along with
introducing floating point arithmetic (multiplication and addition/subtraction) and the
most famous algorithms used for their implementations. Chapter 5 thoroughly
discusses the proposed designs for both the FP Multiplier using the new proposed
Block Multiplication algorithm and the FP Adder/Subtractor using the LOD algorithm
by describing the design specifications, then explaining the modules of both units and
showing the behavioral simulations of each module and of the complete designs.
4
Chapter 6 gives the synthesis, implementation, post route simulation and power
results of the proposed designs. Finally, Chapter 7 wraps up with the conclusion and
future work.
5
Chapter 2
History of Floating Point Units on FPGAs
Earlier on, floating point arithmetic was performed in computers using
software emulators which although saves the added hardware cost is significantly
slow. Later, floating point operations were performed using external coprocessors that
were used when needed to allow for the execution of math extensive operations.
Nowadays, every computer has a high speed floating point unit integrated within its
CPU.
Modern FPGAs have sophisticated features such as dedicated carry chains,
memories, multipliers and DSP blocks which make it possible to perform
computations at higher clock frequencies. This makes modern FPGAs quite suitable
for implementation of high speed floating point arithmetic which can be particularly
useful when flexibility and fast time to market, some of FPGA’s strongest assets, are
of concern.
There have been research papers containing work on the design and
implementations of FPUs to FPGA ever since the late 1990s. One of the first IEEE
compliant 32 bit FP Adder/Subtractor and Multiplier Units implemented to FPGA
were introduced in 1996 by L. Louca et al. [2]. They implemented their designs on
FLEX8000 Altera FPGA. Their main objective was to minimize the area of both the
FP Adder/Subtractor and Multiplier Units to allow their implementation to the limited
resources of the FPGA while achieving reasonable speed to all and maintaining IEEE
accuracy. Despite their efforts, only one of their proposed units could be implemented
to the FPGA at a time while achieving a peak performance of 7 MFlops and 2.3
MFlops for the FP Adder/Subtractor and FP Multiplier Units respectively.
6
One of the early FPU designs was given by Mamun Bin Ibne Reaz et al in
2002[3]. Their work included the design and simulation of pipelined FPU, including
an adder/subtractor, multiplier and divider, designed totally in VHDL. They discussed
in details the block diagrams of the adder/subtractor, multiplier and divider units.
They did not synthesize or implement their design, thus no indication of speed or area
was given for their work. This paper provided us with a good basic overview on the
design of FPUs.
Another research paper by Jian Liang et al. in 2003 [4] presented a FPU
generation tool for FP Adder/Subtractor Units on FPGAs. It is based on throughput,
latency and area requirements and is able to create a range of FP Adder/Subtractor
Units. The paper was the first, according to the authors’ knowledge, to discuss and
compare the different algorithms used to implement the floating point adders. The
given generation tool selects from those different algorithms to give a FP
Adder/Subtractor Unit that trades off latency and throughput for area. The results they
got were from implementing their designs on Spartan-3 FPGA. One of their optimized
designs showed latency just above 250ns and a throughput of around 75MHz.
In 2004, Suhaili and Sidek [5] proposed a reconfigurable 32 bit ALU that can
perform both integer and floating point addition. They described their module in
Verilog and implemented it on Spartan 2e FPGA. The synthesis report indicated that
their design could operate at speed of up to 20MHz.
Also in 2004, Gokul Govindu et al. [6] showed that FPGA based FPU can
achieve a significant improvement in performance over that of processors. In their
paper, they analyzed the maximum achievable speed, area, latency, power and
throughput of FP Multiplier and Adder/Subtractor Units by considering their
pipelining as a parameter. They used both VHDL and Xilinx Library Cores to
7
describe their design. They implemented it on Virtex2Pro and were able to achieve
speeds of up to 250 MHz.
In 2005 Ali Malik [7], discussed and compared in detail the design of FP
Adders using several algorithms which he described in VHDL and implemented to
Virtex2p FPGA. Malik’s work is related to that of Liang [4] in that they both analyzed
several implementations of FP Adders. Malik though thoroughly discussed several
implementations for some of the main modules. He then discussed the design
tradeoffs for each of these implementations, usually by comparing their combinational
delay and area (number of occupied FPGA slices). Finally he used the optimized sub-
modules to build several FP Adders, each using a different algorithm, and compared
for overall latency, area (number of occupied slices), and speed. His fastest
implementation was able to operate at a speed of 152MHz on Virtex2p FPGA.
In 2006, Brunelli and Nurmi [8] introduced the design of what they called the
Milk Co-Processor which is a 32 bit FPU. The main objective of their design was
reusability. They did not give much detail about their used design algorithms. They
did mention though that they described their design using VHDL and that a special
VHDL file was written containing a set of generics that were used to give
customizability to the design. For example, these parameters allow the choice of
whether or not to handle denormalized numbers. They FP Adder/Subtractor and
Multiplier Units were able to operate at 77 and 75 MHz respectively when
implemented to Stratix FPGA.
Also in 2006, K.Scott Hemmert and Keith D. Underwood [9] from Sandia
National Laboratories published their work which involved the design of a high speed
FPU (Adder/Subtractor, Multiplier and Divider). They used JHDL (Just-Another
HDL) as their design entry language and implemented the FPUs to both Virtex2 and
8
Virtex4 FPGAs. Their proposed designs were able to operate at frequencies of up to
320 and 350 MHz on Virtex4 for the Multiplier and Adder respectively.
Another contribution in 2006 was given by Per Karlstrom et al. [10] who
introduced a high speed FP Adder and Multiplier Units implemented on Virtex4.
They used Virtex4 DSP48 blocks to build the multiplier module within the FP
Multiplier Unit. As a result their FP Multiplier Unit was able to operate at nearly 450
MHz for Virtex4 FPGA. As for the FP Adder Unit, they worked on increasing the
operating speed by optimizing the bottle neck block of the design. This was
performed by breaking up the binary number to be dealt with in that block and
considering each four bits separately in a parallel manner. Despite the fact that this
technique was rather hardware extensive, it succeeded in achieving a maximum
operating speed of 361 MHz on Virtex4 FPGA. Later in 2008, Karlstrom et al.
published a reviewed version of their work in which the FP Multiplier and
Adder/Subtractor Units operated at speeds of up until 440 and 377MHz respectively
when implemented on Virtex4 [11].
In 2008, Saroja V. Siddmal et al. [12] thoroughly discussed and compared
several high speed algorithms for implementing the 24 by 24 unsigned integer
multiplier in the FP Multiplier Unit such as Booth and Canonic Signed Digit (CSD)
multiplications. They used VHDL to describe their design. They implemented their
design on Virtex E and were able to achieve speeds of up to 333MHz.
Recently, the University of Lyon in France showed massive interest in the
topic of FP Adder/Subtractor and Multiplier Units design especially for double (64
bits) and quadruple (128 bits) precision FP numbers. They have published several
papers within the topic. In 2009, Florent de Dinechin Banescu [13], studied several
non-standard implementations techniques of large multipliers, which can be used in
9
FP multipliers, on FPGAs. The objective of his work was to build large multipliers
operating at high frequencies while reducing their DSP block usage. Each of his
studied techniques was found to be more suitable to a certain FPGA depending on the
architecture of its DSP blocks. Florent was able to build multipliers that operated at
frequencies around 440 MHz for both Virtex4 and Virtex5 FPGAs. Later in 2010,
Banescu along with Florent et al. [14] published a paper studying the same
multiplication techniques plus integrating them in high radix FP Multipliers. They
presented double and quadruple precision FP multipliers that operated at 400MHz for
both Virtex4 and Virtex5 FPGAs. Florent et al. [15] also published a paper in 2010
that discusses a FP adder generation tool, in a project they referred to as the FloPoCo
project (Floating Point Cores). Their work explores the tradeoffs between size,
latency and frequency for pipelined large precision adders on FPGA in several
architectures. For each of these architectures, resource estimation models are defined
and used in an adder generator that selects the best architecture considering the target
FPGA, target operating frequency and the addition bit width. They were able to
construct double and quadruple precision FP adders whose synthesis results indicated
they can operate at 450MHz for Virtex4 FPGA.
When considering the above summary of the previous work made in
implementing FP Multipliers and Adder/Subtractor Units, we find that the most
relevant work was introduced by Govindu [6], Hemmert [9], Karlstrom [10,11] and
the work introduced by Lyon University[13-15]. They were all able to design FP
Units operating at sufficiently high speeds that were high above 200MHz. There were
two main approaches used to achieve such high speeds of operation which are:
1. Optimization for a specific FPGA: This is performed by either using existing
blocks on the specific FPGA or by using Xilinx Cores that are specifically
11
optimized for the target FPGA. This approach was used by Govindu [6] and
Karlstrom [10,11] when designing in the FP Multiplier Unit where the
unsigned multiplier was build by using optimized Xilinx cores and DSP
blocks of Virtex2 and Virtex4 FPGA respectively.
2. Using Fast Algorithms: Many fast algorithms are present for both the FP
Multiplier and Adder/Subtractor Units. This approach was used by Siddmal
[12] to implement the unsigned multiplier by using the Booth Multiplication
and the CSD Multiplication algorithms, to be explained in Chapter Three,
which are by far the most common fast multiplication algorithms used with
floating point multiplication.
The thing with the first approach is that the designs were not generic; they were
actually optimized for specific FPGAs. That is why second approach was more
fanciful for our work since our objective is to design a high speed, low power and
generic FPU (Adder/Subtractor and Multiplier).
11
Chapter 3
Digital Design
Based on the design specifications, a digital designer has various options when
selecting a hardware platform for their design, ranging from Application Specific
Integrated Circuits (ASICs) to all sorts of Programmable Logic Devices (PLDs) and
Field Programmable Gate Arrays (FPGAs).
This chapter gives a preview on these various hardware design options starting
with the ASIC, going through PLDs, FPGAs and then discussing thoroughly Xilinx
Virtex FPGAs. Finally, the digital design flow for FPGAs is explained.
3.1 Application Specific Integrated Circuits (ASICs)
ASICs started appearing in the early 1980s. ASIC chips are customized for a
particular use rather than general purpose use. With the improvement in design tools
and reduction in feature sizes, the number of gates in an ASIC has grown from a few
thousands to over a 100 million gates. Modern ASICs, also known as SoC (System-
on-Chip), include embedded processors and memory blocks.
There are many types of ASICs based on the number of mask layers on which
the designer has control over. The most well-known types of ASICs are [16, 17]:
Full Custom:
In full custom ASIC, the designer here has full control over every mask layer
used to fabricate the silicon chip. Accordingly, the designer has full control over the
sizes of all transistors in his design which allows fine tuning of the transistors’ sizes
for optimum performance. Full custom design can be used to design some or all of the
21
circuits for a specific ASIC. Fewer full custom ICs are being designed due to the long
time to market and high non-recurring engineering (NRE) cost involved for its design
and fabrication. Also with the improvement in standard cell and gate array ASICs,
they are providing the required performance for more applications with their high
speeds and low costs which steers away designers from full custom design. Full
custom design is usually used when there are no suitable existing cell libraries that
can be used usually because the existing cell libraries are either not fast enough, not
small enough, consume too much power or don't provide a certain required function.
Full custom ASICs are commonly used for microprocessors which must operate as
fast as possible and will be produced in great quantities.
Standard Cell:
Standard cell ASICs are based on predesigned logic cells (such as logic gates,
multiplexers, flip-flops, etc…) known as standard cells that are used to build the
design along with larger predesigned cells known as megacells (such as memory
blocks, microprocessors or microcontrollers). Also in standard cell ASIC, custom
blocks can be embedded to the design. During the design, each and every transistor in
every standard cell can be chosen to optimize a certain design parameter and tools can
be used to optimize placement standard cells and interconnections. So for standard
cell ASICs, all mask layers are customized (transistors and interconnects) thus a
custom photo-mask is created for every layer for the device's fabrication. The
advantage of standard cell ASICs is that designers save time and money by using
optimized predefined and pretested standard cells.
Gate Arrays:
Gate arrays ASICs are partially fabricated chips with repetitive similar blocks,
known as basic cells, consisting of a collection of predefined unconnected transistors
21
and resistors depending on the vendor. Basic cells are replicated to form arrays of
basic cells. The designer chooses from a gate-array library of predesigned and pre-
characterized logic cells (including gates, registers, etc...) which the user uses along
with more complicated blocks to build his circuit by controlling only the top few
layers of metal used for interconnects. The disadvantage of gate array ASICs is the
un-optimized routing which negatively impacts the performance and power
consumption of the design.
3.2 Programmable Logic Devices (PLDs)
3.2.1 Introduction
PLDs were first introduced in the mid 1970s. A PLD is a programmable chip
that is mass produced at the factory and then customized by the end-user to perform
different logic functions. Unlike ASICs, PLDs are intended for general use not for
specific applications and can be fabricated to be one time or multiple-time
programmable depending on the technology used to implement the cross points within
the device.
One time programmable PLDs implement the cross points using fusible or anti-
fusible technology used to create permanent open or short circuits respectively based
on the data to be programmed on the PLD. For multiple time programmed PLDs, the
cross points are implemented using a single bit memory cell used to store binary data
at the cross points to implement an open or short circuit. Such PLDs can be volatile or
nonvolatile depending on whether volatile or non volatile memories are used at the
cross points.
21
PLDs can be used to implement simple combinational circuits to fairly complex
sequential state machines depending on the type of PLD. There are three types of
PLDs:
1. SPLDs : Simple PLDs which include PLAs (Programmable Logic Arrays),
PALs (Programmable AND array Logics), PROMs (Programmable Read-Only
Memories) chips and GALs (Generic Array Logics).
2. CPLDs: Complex PLDs which were originally constructed by associating
several SPLDs on the same chip.
3. FPGAs: Considered as a PLD since it is also a device programmed by the end
user.
3.2.2 Simple Programmable Logic Devices (SPLDs)
SPLDs have numerous horizontal and vertical connection wires forming a
matrix of AND gates (referred to as the AND plane) and a matrix of OR gates
(referred to as the OR plane). These planes are used to implement any circuit as Sum-
of-Product through programming the horizontal to vertical cross points as either open
or short circuits [18].
PLAs are the most configurable of SPLDs. They consist of a programmable
AND plane and a programmable OR plane. The cross points in both the AND plane
and the OR plane can be programmed to form any sum-of-product expression. PLAs
are particularly useful for large designs that require many common product terms that
can be used by several outputs. The downside of the PLA device is the price of
manufacture and speed. This device has two levels of programmable links and signals
that take a relatively long time to pass through programmable links as opposed to pre-
defined ones. A simplified architecture of a PLA is shown in Figure 3.1.
21
Figure 3.1 Basic Architecture of PLA
The speed problems associated with the PLA were addressed with the
development of the PROM and the PAL. A PROM is a special type of PLA with a
programmable OR plane and a fixed AND plane that produces all possible product
terms for the given inputs. A PAL is a special type of PLA with a programmable
AND plane and a fixed OR plane. The advantage of a PROM and a PAL is that they
are faster due to having only one single programmable array on the cost of less
flexibility in design due to the presence of a fixed plane. A simplified architecture of a
PROM and a PAL are shown in Figures 3.2 and 3.3.
Figure 3.2 Basic Architecture of PROM
21
Figure 3.3 Basic Architecture of PAL
A GAL is PLA that has additional logic circuitry at each output, referred to as
a macro cell. A macro cell is a programmable output cell containing logic gates, a
flip-flop, and multiplexers with internal programmability that allows several modes of
operation. GALs also differ from PLAs in that they had a feedback signal from the
macro cell back to the programmable array, which increase the GAL's flexibility, and
in that there cross points were implemented using EEPROM instead of fuse/anti-fuse
or PROM/EPROM. GAL devices can have a maximum frequency around 250 MHz
[19]. The simplified architecture of a GAL is shown in Figure 3.4.
Figure 3.4 Basic Architecture of GAL

21
3.2.3 Complex Programmable Logic Devices (CPLDs)
CPLDs constructed from several SPLDs (generally GALs or PLAs) fabricated
on the same chip, which communicate through a complex and programmable
interconnecting matrix. CPLDs have I/O drivers and a clock/control unit. Modern
CPLDs also include JTAG support (port for circuit access/test defined by Joint Test
Action Group and standardized in the IEEE 1149.1 standard), a large number of I/O
user pins and low-power consumption. A simple schematic of a CPLD is shown in
Figure 3.5.
Figure 3.5 Simple Schematic of a CPLD
CPLDs feature predictable timing characteristics that make them ideal for
critical, high-performance control applications. Typically, CPLDs have a shorter and
more predictable delay than FPGAs and other programmable logic devices. CPLDs
are inexpensive and require small amounts of power [19], thus they are commonly
used in cost-sensitive, battery-operated portable applications. CPLDs are also used in
simple applications such as address decoding.
21
Observing the most famous Altera and Xilinx CPLDs it can summarized that
CPLDs are fabricated using 0.18um or 0.35 um CMOS technology, have a number of
user pins ranging from 27 to 272 and that CPLDs have maximum operating speeds
ranging from 56MHz to 323 MHz [19].
3.3 Field Programmable Gate Arrays (FPGAs)
Around the beginning of the 1980s, the gap in the digital IC continuum
became apparent. At one end there were programmable devices such as SPLDs and
CPLDs, which were highly configurable and had fast design and modification times,
but could not support large or complex functions. At the other end, there were ASICs
which could support extremely large and complex functions, but they were very
expensive and time consuming to design. Furthermore, once a design has been
implemented on ASIC it was effectively frozen in silicon [16].
To fill that gap, FPGAs were introduced by Xilinx in the mid 1980s. They are
considered the most complex PLDs since they contain thousands of configurable logic
blocks and configurable interconnects that can both be programmed (only a single
time or many times depending on the type of FPGA) to perform a variety of complex
digital functions [19].
At first, FPGAs were used to implement simple logic circuits at relatively low
speeds. Later at the 1990s, the size and sophistication of FPGAs started to increase
and they found market in telecommunications, networks and many other industrial
applications. FPGAs were then also commonly used to prototype ASIC designs or to
provide a hardware platform to verify the physical implementation of new algorithms.
However, FPGAs ease of design, flexibility, low development cost and short time to
market shortly made FPGAs find their way into final products.
21
Although the first FPGAs contained a few thousand gates, nowadays FPGAs
contain over a billion gates [20]. Today’s FPGAs have several sophisticated features
such as high speed input/output interfaces, internal clocking and consist of millions of
gates along with embedded elements such as microprocessor cores, RAMs, DSP
blocks, multipliers and dedicated arithmetic carry chains. Such high performance
FPGAs can be used to implement almost any design and at very high speeds that
could easily reach 600MHz [20].
3.3.1 Comparing FPGAs to ASICs and CPLDs
Generally, FPGAs are more cost effective for limited productions while
ASICs are more suitable for larger productions. An advantage of using FPGAs instead
of ASIC is that the FPGA design flow eliminates the complex and time-consuming
floor planning, place and route, timing analysis, and mask / re-spins stages of the
project since the design logic is already synthesized to be placed onto an already
verified, characterized FPGA device. Moreover, FPGAs have the advantage that they
can be easily and rapidly reprogrammed. A good way to shorten the development time
of a product is to make prototypes using FPGAs and then switch to an ASIC.
So although FPGAs used to be selected over ASICs for lower speed,
complexity, or volume designs in the past, modern FPGAs easily push the 500 MHz
performance barrier due to their very high densities and their sophisticated features.
As a result today's FPGAs are increasingly being used to implement a variety of
designs that could previously have been realized only on ASICs and custom silicon.
FPGAs can be used to implement almost any type of design such as communication
devices, software defined radios, radar, image processing, digital signal processing all
the way to system-on-chip (SoC) components that contain both hardware and
software elements.
12
FPGAs have a much more sophisticated structure than CPLDs. This gives
FPGAs the advantage over CPLDs that they can be used to implement complex
digital designs that cannot be implemented on CPLDs.
3.3.2 Xilinx FPGAs
Xilinx was founded in 1984 and is known to be the inventor of FPGAs. It is
now considered the largest PLD supplier owning more than 50% of the market [21].
3.3.2.1 Architecture
The basic architecture of a Xilinx FPGA is illustrated in Figure 3.6. It consists
of a matrix of Configurable Logic Blocks (CLBs) interconnected by an array of
switch matrices [19]. The internal architecture of CLBs might differ from one FPGA
family to another. Generally, a CLB consists of a number of slices each slice in turn
consisting of a number of logic cells. A logic cell is the core building block in a
modern Xilinx FPGA.
Figure 3.6 Basic Architecture of a Xilinx FPGA
12
A simplified illustration of a Xilinx logic cell is shown in Figure 3.7. It
includes a 4 input Look Up Table (LUT), a Delay Flip Flop (DFF) along with a
multiplexer to allow a registered or unregistered output. Other than the LUT, MUX
and register a logic slice can also contain other elements such as fast look ahead carry
chains, arithmetic logic and dedicated internal routing. Advanced FPGAs may also
have multipliers and memory blocks.
Figure 3.7 Simplified Xilinx Logic Cell
The reason for FPGAs having a hierarchy of CLBs consisting of slices that in
turn consist of logic blocks is that it is complemented by an equivalent hierarchy of
interconnects. Thus there is a fast interconnect between logic cells within the same
slice, then slightly slower interconnects between slices in a CLB followed by the
interconnects between CLBs. This enables the achievement of an optimum trade-off
between making it easy to connect things together without incurring excessive inter-
connect related delay. [16]
3.3.2.2 Interconnects Technology
There are various programming technologies used to implement FPGAs
configurable interconnects. Like SPLDs, an FPGA can be one time programmable
(OTP) or multi-time programmable according to the technology used in its
implementation. The most famous of these technologies are listed in Table 3.1.
11
Table 3.1 Technologies Used to Implement FPGA Interconnects
FPGA Predominantly
Programmability
Technology Associated With
Fuse OTP SPLDS
Anti-fuse OTP FPGAs
Can be erased using ultraviolet
EPROM light (takes at least 20 minutes) SPLDs & CPLDs
then reprogrammed.
EEPROM/ Can be electrically erased then SPLDs & CPLDs &
Flash reprogrammed. some FPGAs
FPGAs & some
SRAM Reprogrammable.
CPLDs
Most Xilinx FPGAs are based on SRAM technology where the programmable
interconnections, made using pass-transistors, transmission gates or multiplexers, in
the FPGA are controlled by SRAM cells. SRAM based FPGAs are volatile, that is the
device's configuration data is lost once the power is removed from the system. Thus
SRAM based FPGAs require external boot ROM to reprogram the FPGA every time
it is powered on. This is not much of a problem since these devices have the
advantage that they can be quickly and repeatedly reprogrammed as required.
3.3.2.3 Virtex FPGA Family
The Xilinx Virtex series was first introduced in 1998 as a low power high
performance solution. It was the first line of FPGAs to offer one million system gates.
The Virtex product line consistently offers the industry's leading combination of
performance, capability, and integration at the lowest system cost [20].
11
In addition to FPGA logic, the Virtex series includes embedded fixed function
hardware for commonly used functions such as multipliers, memories, serial
transceivers and microprocessor cores.
Virtex4 & Earlier FPGAs
Older-generation devices such as the Virtex, Virtex2 and Virtex2Pro are
although still available, but their functionality is largely superseded by the Virtex-4
and -5 FPGA families. The Virtex2 series is manufactured on a 1.5V, 0.15μm 8-Layer
Metal Process with 0.12μm High-Speed Transistors whilst the Virtex-2 Pro series is
manufactured on a 1.5V, 0.12μm 8-Layer Metal Process with 90 nm High-Speed
Transistors [20].
The Virtex4 series is manufactured on a 1.2V, 90-nm. The architecture used in
Virtex4 is very similar to that used to all previous Virtex and Spartan (up to Spartan
3A) FPGAs. The simplified schematic of a Virtex4 slice is shown in Figure 3.8. It
includes [20, 22]:
 Two 4-input LUTs.
 Two Multiplexers.
 Dedicated arithmetic logic including two 1-bit adders, carry chain and two
dedicated AND gates for fast and efficient multiplication.
 Two 1-bit registers that can be configured to operate either as flip-flops or
as latches.
11
Figure 3.8 Simplified Schematic of Virtex-4 Slice
Virtex-5 FPGAs
In the Virtex5 series, Xilinx moved from its traditional four-input LUT design
to six-input LUTs. It is a 65nm design fabricated in 1.0V, triple-oxide process
technology. Virtex5 series offers a lower power solution delivered by the 65nm
technology and power-saving IP blocks [20].
The simplified schematic of a Virtex5 slice is shown in Figure 3.9. The main
differences between a Virtex4 and a Virtex5 slice are [20, 22]:
 4 Configurable 6-to-1 (or 5-to-2 LUTs) instead of 4-to-1 LUTs.
 4 LUTs and 4 register bits per slice.
 Dedicated arithmetic logic circuitry doesn't include dedicated AND gate.
11
Figure 3.9 Simplified Schematic of Virtex-5 Slice
3.3.2.4 Speed Grades
There is no consistent definition of a speed grade for all devices. Even for
Xilinx, speed grades mean different things depending on if we are referring to a
FPGA or a CPLD. Originally speed grades for Xilinx FPGAs represented the time
through a look up table but now the speed grade doesn't actually represent a timing
path. Instead the speed grade is a relative metric of performance within a specific
FPGA family. Different speed grades within a family results merely due to process
variations with all masks and parts being identical.
For modern Xilinx FPGAs, such as those of the Virtex family, higher numbers
represent faster devices. For example, Virtex4 speed grades are -10, -11, and -12 with
-10 being the slowest and -12 being the fastest. Virtex5 speed grades are -1, -2, and -3
with -1 being the slowest and -3 being the fastest.
11
3.4 FPGAs Design Flow
The design flow for FPGAs implementation is illustrated in Figure 3.10 as
provided by the Xilinx foundation [20] followed by a brief summary of all its steps
[16,23].
Figure 3.10 Digital Design Flow
Design Entry
Generally, there are different techniques for design entry which are schematic
based, HDL, a combination of both and finite state machine (FSM). If the designer
wants to deal more with hardware, schematic based design entry is the better choice.
HDL and FSM on the other hand, represent a level of abstraction that can isolate the
designer from the details of the hardware implementation. FSM is used when the
design can be thought of as a series of states. HDL is considered the most popular
11
design entry methodology for FPGA design as it is the best choice for describing
complex designs.
HDL is a high level textual programming language that includes specialized
constructs to describe or model the behavior or structure of a digital system. An HDL
allows a system's behavior to be described at an abstract and technology independent
level. There are two industry IEEE standard HDLs namely, VHDL and Verilog.
While their syntax and semantics are quite different, they are used for similar
purposes. VHDL contains more constructs for high level modeling, model
parameterization, design reuse and management of large designs than Verilog does.
Many EDA tools are designed to work with either language or both languages
together.
Not all VHDL constructs are synthesizable. Generally, if the VHDL code is
physically meaningless or too far moved from the hardware it attempts to describe, it
may not be synthesizable [19]. Thus it is best describe the design in a simple manner
in order to assure the synthesizer is able to correctly interpret the design to the
equivalent logical elements.
Behavioral Simulation
Behavioral simulation is the step where the design description is simulated to
verify its logical correctness. Behavioral simulation does not consider propagation
delays.
Synthesis
Synthesis is the process which translates the HDL code into a netlist form (i.e. a
complete circuit with logical elements such as gates, flip-flops, etc) targeted at a
11
specific FPGA platform. At this stage, detailed timing analysis can be carried out and
estimate of occupied area can be obtained. The resulting netlist is stored to an NGC
(Native Generic Circuit) file for Xilinx Synthesis Technology (XST).
Implementation (Commonly referred to as Place and Route)
Implementation is basically the process in which the synthesized netlist will be
implemented in the target FPGA. This process consists of a sequence of three steps:
i. Translate: Combines all input netlists and constraints to a logical design
file saved as an NGD (Native Generic Database) file for Xilinx tools.
ii. Map: The map process fits the logic defined by the NGD file into the
targeted FPGA elements (i.e. CLBs, IOBs) and generates an NCD (Native
Circuit Description) file which physically represents the design to be
mapped to the components of the target FPGA.
iii. Place and Route (PAR): The PAR process places the blocks from the
map process into logical blocks according to the defined user constraints
then connects between them. The PAR tool takes the mapped NCD file
and produces a completely routed NCD file. Power calculation and
analysis can be performed, if required, after PAR. It is essential to assure
design meets the power budget thus attaining system performance and cost
goals. Low power enables higher clock frequency, higher reliability, better
noise margins, and reduced capital and operational costs.
Static Timing Analysis
Static timing analysis is performed post PAR. It incorporates timing delay
information to provide a comprehensive timing summary of the design. The critical
11
path of the design is determined here and hence the fastest design speed. The main
advantage of static timing analysis is that it is relatively fast, doesn’t need a test bench
and exhaustively tests every possible path in the design.
Timing Simulation (also known as Post Route simulation)
In Timing Simulation the VHDL timing model generated by the place and route
tool, which includes the block and routing delay information from the routed design,
to give a more accurate assessment of the behavior of the circuit under worst-case
conditions. Timing simulation is a highly recommended part of the HDL design flow
for Xilinx devices to verify the implemented design to the target FPGA meets timing
constraints. Since timing simulation uses the detailed timing and design layout
information that is available after place and route, this simulation of the design closely
matches the actual device operation. Timing simulation simulates the VHDL timing
model, generated by the place and route tool, to verify the synthesized logic as
mapped to the target FPGA meets timing constraints.
Performing a timing simulation in addition to a static timing analysis will help
to uncover issues that cannot be found in a static timing analysis alone.
Download & Circuit Verification
The routed NCD file is converted to a bit stream file to be used to configure the
target FPGA device.
12
Chapter 4
Floating Point Arithmetic
Since only binary information can be stored and processed in digital computers,
thus the most natural system to use when representing decimal numbers is the binary
system. In order to represent decimal numbers in binary notations, there are two
methods depending on the position of the binary point. The fixed point method
assumes the binary point is always in a fixed position while the floating point
representation assumes the binary point can floats anywhere within the number's
significant bits.
This chapter gives a brief presentation of the IEEE-754 [24] standard for single
precision floating point numbers and explains the floating point multiplication and
addition/subtraction arithmetic and algorithms.
4.1 Fixed and Floating Point Representation
As mentioned earlier, fixed point representation represents real numbers while
assuming the radix point, also referred to as binary point for binary systems, is fixed
in a certain position such that there are a fixed number of digits after and before the
radix point. The two most widely used radix point positions when fixed point
representation is used to store numbers in digital computers are:
1. Placing the radix point in the extreme left of the number such that it can only
represent fractions.
2. Placing the radix point in the extreme right of the number such that it only
represents integer numbers.
13
In either case, the radix point is not actually present, but its presence is assumed
from the fact that the number stored is treated as a fraction or as an integer [25].
Floating point representation is a method for representing any number in two
parts. The first part represents a signed, fixed point number called the mantissa. The
second part designates the position of the radix point and is called the exponent.
Usually, the radix point is shifted such that there is one non-zero digit to its left. So
basically, floating point represents real numbers in the scientific notation where the
radix point can float anywhere in the number and the exponent is adjusted
accordingly. The general format of any floating point number (F) can be written as:
F = (-1)sign * n.nnn * rexp (4.1)
Where sign is 0 or 1 for a positive or negative number respectively, n.nnn is
the mantissa and r is the radix of the used numbering system (i.e. r =10 for decimal,
r = 2 for binary) and exp is the exponent that represents the original location of the
radix point. For example, the binary floating point number (1101.01)2 is represented
in floating point with a mantissa and an exponent as follows:
1101.01= 1.110101 x 23 (4.2)
Fixed point representation is considered easier and more area efficient to
implement than floating point representation. On the other side, floating point
representation has the advantage of being able to represent very large or very small
numbers and thus arithmetic operations are less likely to overflow or underflow when
using floating point representation as opposed to when using fixed point
13
representation. As a result, floating point representation is more suitable for
applications that require high precision and wide dynamic range.
IEEE 754 standard for floating point representation was defined to achieve
several innovations which are:
 Precisely specify floating point number encoding such that all computers
would interpret floating point numbers in the same way. This made it possible
to transfer floating point numbers from one computer to another.
 Precisely specify how to perform arithmetic operations and deal with
exceptional conditions that could result from them such that all computers will
give the same result for a given operation with the same input data.
4.2 IEEE-754 Standard for Floating Point Representation
The IEEE-754 standard was first created in 1985. Several revisions were made
until the final IEEE-754 2008 standard was published in August 2008. The IEEE 754
standard specifies how floating numbers are represented and how to carry out
arithmetic operations on them. Nowadays, IEEE 754 standard is the most common
representation for real numbers on computers, including Intel-based PC's,
Macintoshes, and most UNIX platforms.
4.2.1 Numerical Encoding
Single precision IEEE-754 format represents binary floating point numbers in
32 bits. The 32 bits are divided into three fields, a 1-bit sign, an 8-bit exponent and a
23-bit mantissa as shown in Figure 4.1.
11
Sign Exponent Mantissa
B31 B30…………..B23 B22…………………………………………………….B0
Figure 4.1 IEEE Format of Single Precision FP Number
The sign bit stores '0' for positive numbers and '1' for negative numbers. The
8-bit exponent stores the exponent in excess-127 code. That is a bias of 127 is added
to any exponent before it is stored. So for example, to represent the binary exponent 2,
we encode 127 + 2 = 129 in binary (10000001). To represent the binary exponent -2
we encode 127 - 2 = 125 in binary (01111101). This can be expressed as:
E' = E + 127 (4.3)
where E' is the stored exponent and E is the actual exponent
So using the excess-127 code gives us a range for E' of 0 <= E' <= 255.
IEEE-754 reserves exponent field values of 0 and 255 (all 0s and all 1s) to denote
special values in the floating-point scheme leading to an operating range of E'
becomes 1 <= E<= 254, equivalent to -126<= E <= 127 range for E. Excess-127
code is used as opposed to signed representation to allow expressing a wider range of
exponents. Excess-127 code is also used as opposed to 2's compliment representations
to allow easier exponent comparison (needed in arithmetic operations).
The 23-bit mantissa is actually a 24-bit mantissa where the most significant bit
is not stored and thus is known as the implicit bit. While this provides efficient
storage, the implied bit is necessary to carry out arithmetic operation on the number
and must be explicitly represented before any operations involving the number is
performed.
13
The set of different possible values for reading a floating point numbers is
explained the following sections and is summarized in Table 4.1.
Table 4.1 Summary of Floating Point Number Values

Value Sign Exponent Mantissa
Normalized Numbers (-1)S x 1.f x 2E'-127 X 0 < E' <255 XX……..XX

Sub norms (-1)S x 0.f x 2-126 X All '0's Non Zero Value
Zero +0 '0' All '0's All '0's
-0 '1' All '0's All '0's
Infinity +∞ '0' All '1's All '0's
- ∞ '1' All '1's All '0's
Not a Number NaN X All '1's Non Zero Value
4.2.2 Normalized and Denormalized Numbers
A floating point number is said to be normalized if it is adjusted such that its
implicit bit is ‘1’. Normalization has the advantage of allowing a wide range of
numbers to be represented with great precision. The floating point number (FN) then
has the following format
FN= (-1)Sign x 1.Mantissa x 2Exponent (4.4)
Denormalized numbers, also known as subnorms, on the contrary have a '0' as
the implied bit. They are used to represent numbers that are smaller than the smallest
normalized number. They are identified by an all zero exponent and a non-zero
mantissa.
13
4.2.3 Special Values
The IEEE-754 standard reserves the exponent field of all '0's and all '1's for
special values. The special values defined by the IEEE-754 standard are:
1. Zero : Due to assumption of '1' implied bit, a zero number cannot be directly
represented. Zero is thus considered a special value denoted by an all '0'
exponent and mantissa. The sign bit differentiates between +0 and -0 which
are distinct values that compare as equal.
2. Infinity: Infinity is denoted by an all '1's exponent and an all '0's mantissa.
The sign bit separates between positive and negative infinity.
3. NaN : The value Not a Number (NaN) is used to represent a value that is
not a real number. NaN are represented by an all '1' exponent and a non zero
mantissa.
4.2.4 Range of Floating Point Numbers
The range of positive/negative single precision floating point numbers is
shown in Table 4.2. Since the sign of floating point numbers is given by a special
leading bit, the range of negative numbers is given simply by the negation the range
of positive numbers.
Table 4.2 FP Ranges for Normalized and Denormalized Numbers
Binary Representation Decimal Representation
Normalized ± [ 2-126 to (2-2-23) * 2127 ] ± [~ 1.175 * 10-38 to ~ 3.4 * 1038]

Numbers
Denormalized ± [ 2-149 to (1-2-23) * 2-126 ] ± [~ 1.4 * 10-45 to ~ 1.175 * 10-38]

Numbers
13
4.2.5 Exceptions
The IEEE-754 standard defines five exceptions, each of which corresponds to a
flag being set. These exceptions are:
1. Overflow : Set when the result has a value that is too large to be
represented.
2. Underflow : Set when the result has a value that is too small to be
represented.
3. Inexact : Set when the result can't be exactly represented so a rounding
error is introduced.
4. Invalid : Set when an operation cannot return a real value such as in the
cases of infinity-infinity, zero/zero or the square root of a negative number.
5. Divide by zero: Set when an operation on finite operands gives an infinite
solution.
4.2.6 Rounding Modes
Arithmetic operations usually result in floating point numbers with more bits
that can actually be stored. In such cases, it is necessary to normalize the number and
round it in order to fit in the storage format. The IEEE 754 standard defines five
rounding modes. The first two modes round to a nearest value, while the other three
are called direct rounding. These five rounding modes are:
1. Round to nearest, ties to even (REN): Rounds to the nearest value; if the
number falls midway it is rounded to the nearest value with an even (zero) least
significant bit, which occurs 50% of the time; this is the default algorithm for binary
floating-point.
13
2. Round to nearest, ties away from zero: Rounds to the nearest value; if the
number falls midway it is rounded to the nearest value above (for positive numbers)
or below (for negative numbers).
3. Round towards +∞ (RP) : Rounds the number towards +∞.
4. Round towards -∞ (RM) : Rounds the number towards -∞.
5. Round towards 0 (RZ) : Also called truncation as the extra bits are simply
truncated.
Table 4.3 Examples on the IEEE 754-2008 Rounding Modes
Round to Nearest Direct Rounding

Ties to Ties
Number Even Away
RP RM RZ
from
Zero
10.01 10 10 11 10 10
10.10 10 11 11 10 10
10.11 11 11 11 10 10
- 10.01 -10 -10 -10 -11 -10
-10.10 -10 -10 -10 -11 -10
-10.11 -10 -10 -10 -11 -10
The rounding mode affects the results of most arithmetic operations and the
thresholds for overflow and underflow exceptions. The default rounding mode is REN
and is mostly used in all the arithmetic implementations in software and hardware. To
increase the precision of the result and to enable the REN rounding mode, three bits
are added to the left of the mantissa. These bits are:
1. The Guard Bit: Simply an extension of mantissa for extra precision.
2. The Round Bit: Also an extension of mantissa for extra precision.
3. The Sticky Bit: It is the logical "OR"ing of all dropped bits.
13
4.3 Floating Point Arithmetic
Floating point arithmetic is considered more complex integer arithmetic. In
this section, the basic algorithm for performing floating point addition/subtraction and
multiplication is explained.
4.3.1 Floating Point Multiplication Algorithms
Floating point multiplication is considered one of the simplest floating point
operations since it is performed by directly adding the exponents and multiplying the
mantissas. The floating point multiplication basic operations can be divided into the
following steps [25]:
1. Check for Zero: Checking if any of the input operands is a zero.
2. Determine Resultant Sign: By considering the operand signs.
3. Determine Resultant Exponent: Adding the two exponents and subtracting a
127 bias to compensate for the bias being added in both exponents.
4. Multiply the Mantissas: This involves a 24 by 24 unsigned integer
multiplication that gives a 48 bit result.
5. Post normalization: The resultant mantissa is normalized and the resultant
exponent is adjusted accordingly. Since when multiplying two very large or
very small numbers it is quite possible for overflow or underflow to occur, it is
useful at this point to check the exponent for possible overflow or underflow.
6. Rounding: Rounding the resultant mantissa to fit it in the assigned number of
bits in order to give the output in the IEEE standard format.
The bottleneck of the floating point multiplier is the binary 24 by 24 unsigned
binary multiplication due long carry chain involved in the multiplication operation.
13
Generally, the multiplication of two n bit binary numbers consists of successive
multiplications to calculate n partial products then addition of the properly shifted
partial products to give the final result, which for the multiplication of two n bit
numbers wouldn't exceed 2n bits. Then for multiplication of signed numbers, the
resultant sign has to be determined. There are two main concerns when implementing
multiplication of two binary numbers which are:
1. Determining the Resultant Sign: Usually, multiplication handles the sign of
the numbers to be multiplied separately in order to find the resultant sign. Yet,
modern computers usually deal with the 1's complement, 2's complement or
signed number representation which all embed the sign in the number itself. A
simple long multiplication process won’t be sufficient in such cases but must
be adjusted for correct operation.
2. The Long Carry Chain Involved: The multiplication result from the n by n
binary multiplication appears after all the intermediate n partial fractions are
calculated and then added. Such operation is very time consuming due to the
long carry chain involved in the addition operation.
When considering the above issues for the case of the binary multiplier involved
in the floating point multiplication it is found that:
1. Multiplying floating point numbers involves multiplying the two 24 bit
unsigned bit mantissas so the first issue is not of concern when dealing with
floating point numbers.
2. On the other hand, the large delay resulting from the long carry chain involved
in the 24 by 24 multiplication, due to the addition of 24 partial fractions, is a
34
very critical issue that is the reason why the 24 by 24 unsigned multiplication
operation is the bottleneck of the floating point multiplier.
Several algorithms have been introduced to speed up the binary multiplication by
decreasing the number of partial fraction and hence reducing the number of
intermediate additions and breaking the long carry chain. Generally, floating point
multipliers perform the same main operations, which were explained above, but differ
in the algorithm by which the 24 by 24 bit multiplication is performed. The most
famous of these algorithms are [12]:
Booth Multiplication
Booth multiplication is a fast multiplication algorithm that is especially useful
when one of the numbers to be multiplied includes a string of consecutive '1's. Booth's
algorithm is based on the fact is that any binary number containing a string of
consecutive '1's can be represented as the difference between two numbers using the
following general rule:
(ddd111….111dddd)2 = (dd1000…000dd)2 – (ddd000…001ddd)2 (4.5)
where d: don't care.
Thus the multiplication of (20 * 14) in binary can be performed as shown in Table
4.4. As shown in the table, the use of booth multiplication reduced the number of
partial fraction to be added which is the key factor in speeding the binary
multiplication.
33
Table 4.4 Example on Booth Multiplication
Normal Multiplication Booth's Multiplication
010100 010100 010100

X 001110 X 001110 X 001110
010100 010100 010100
010100 _ 010100 + 111101100
+ 010100 0100011000
100011000 Then by using 2’s
compliment 
Canonic Signed Digit (CSD) Multiplication
CSD multiplication algorithm reduces the number of partial products in the
multiplication operation by encoding the multiplier as CSDs. A number is said to be
encoded in the canonical form if it contains no adjacent non-zero digits. CSDs are
generated by encoding a binary number such that it contains the fewest number of
non-zero bits.
A CSD n*n bit multiplier contains (n+1) cascaded CSD encoder units to generate
the CSD representation of the multiplier. This unit receives three inputs and provides
two outputs. The inputs are the multipliers bits to be encoded and carry bit while the
output is next carry bit and canonic signed digit representation. The generated CSD
vector is then given to CSD logic and shift control unit along with the multiplicand
and its 2’s complement. The CSD logic and shift control unit provides n/2 number of
n-bit partial products that are added to given to an adder block. The output of this
adder block is final product of the multiplier and multiplicand.
The basic idea behind the CSD multiplication is again to reduce the number of
partial products calculated in order to increase the speed of the multiplication
33
operation where in CSD multiplication, the product of two n-bit numbers is calculated
in (n/2 – 1) steps.
4.3.2 Floating Point Addition/Subtraction
Two floating point numbers should be aligned to be able to directly add or
subtract their mantissas when performing floating point addition or subtraction. That
is equivalent to both numbers having equal exponents. If the exponents of the two
floating point numbers are not equal, one of the numbers has to be pre-normalized.
Pre-normalization is the process where the exponent of the smaller floating point
number to be added/subtracted is adjusted to be equal to the exponent of the larger
number by shifting left it’s mantissa a number of times equal to the difference
between the two exponents. Prenormalization is performed on the smaller floating
point number such that if data bits are lost due to the shifting operation, the effect is
not significant. The main operations of floating point addition/subtraction can be
divided into the following steps [25]:
1. Check for Zeros: Checking if one or both of the operands is a zero.
2. Calculate Exponent Difference: To be used in the Prenormalization step.
3. Prenormalization: Adjusting the smaller FP operand in order to align the two
FP numbers that are to be added or subtracted.
4. Addition/Subtraction: Performing the actual addition/subtraction operation
of the two mantissas.
5. Post Normalization: Normalizing the final result by shifting the resultant
mantissa and adjusting the exponent accordingly.
7. Rounding: Rounding the resultant mantissa to fit it in the assigned number of
bits in order to give the output in the IEEE standard format.
31
Generally, floating point addition/subtraction is known to be more complicated
than floating point multiplication due to the extensive processing required in adjusting
the operands before and after the execution of the addition or subtraction operation
when pre-normalizing the smaller mantissa and when post normalizing the resultant
mantissa respectively.
The bottle neck of the floating point addition/subtraction is the post normalization
operation. This is due to the fact that the post normalization of the resultant mantissa
involves several time consuming dependant operations which are:
i. Finding the leading '1' bit in the resultant mantissa to be set as the implied bit.
The leading one bit here can be located anywhere within the mantissa.
ii. Shifting the resultant mantissa in order to set the detected '1' bit as the most
significant bit.
iii. Adjusting the resultant exponent to compensate for the shift in the resultant
mantissa.
Many algorithms have been proposed to implement the post normalization process
each aiming to optimize a certain performance parameter such as area, latency or
speed. Generally, floating point adders perform the same main operations, which were
explained above. They only differ in the algorithm used to implement the post
normalization operation. The most famous of these algorithms are [4, 26]:
Leading One Detector (LOD) Adder/Subtractor
The resulting mantissa from the addition/subtraction operation is first inspected to
determine the location of the leading one. Based on the leading one location, the
33
resultant mantissa is shifted left by a number of times subsequently subtracted from
the exponent.
LOD algorithm is an area efficient simple algorithm that is also known as the
standard algorithm.
Leading One Predictor (LOP) Adder/Subtractor
The main difference between LOD and LOP algorithms is that LOD method
detects the leading one after the addition/subtraction operation has taken place while
the LOP method predicts the leading one in parallel with the addition/subtraction
computation. This is illustrated in Figure 4.2. Usually, the LOP method requires a
correction circuit.
Figure 4.2 LOD vs. LOP Algorithms
The LOP method has the advantage of reduced latency on the expense of added
area and design complexity [7].
Two Path Adder/Subtractor (Far and Close Datapath)
According to studies, 43% of floating point instructions have an exponent
difference of 0 or 1. Making use of this idea, the two path algorithm implements two
33
parallel data-paths one for when the exponent difference is equal to 0 or 1 and the
effective operation is subtraction called the Close Path and another for all other cases
called the Far Path. In the two path algorithm, the latency is reduced by removing
the pre-normalization from the close path and removing the LOD or LOP from the far
path. This comes on the expense of increased area due to the dual path
implementation.
The two path algorithm is faster than the LOD algorithm and experiences less
latency, but takes more area and consumes more power [26].
Three Path Adder/Subtractor
This architecture has three datapaths, of which only one is operational at a time.
Two of these paths have the exact same functionality as the far and close data paths in
the two path algorithm. The third datapath deals with NaN and infinity values along
with the case when the exponent difference is greater than the width of the mantissa.
33
Chapter 5
Proposed Floating Point Unit Architecture
In this chapter, the architecture of the proposed FPU is explained. First, the
design specifications of the proposed FPU is discussed starting from the implemented
IEEE standard the power considerations then going through the design methodology
and finally thoroughly discussing the proposed design. For both the FP Multiplier and
Adder/Subtractor Units, the block diagrams are explained block by block and
behavioral simulations for each block and the complete units are given.
5.1 Design Specifications of the Proposed FPU
5.1.1 Floating Point Representation
The FPU deals with single precision floating point numbers that are
represented as specified by the IEEE 754 standard. The exact specifications of the
floating point numbers dealt with in the proposed FPU are explained in this section.
Normalized and Denormalized Numbers
Denormalized numbers are generally rare and require complicated hardware for
their handling. In our design we do not honor denormalized numbers, but if a
denormalized number is by any chance introduced into the design, it is considered as
a zero and dealt with accordingly. The reason for excluding denormalized numbers is
because of the large overhead in taking care of these numbers, especially for the
multiplier [9, 10]. These are commonly excluded from high-performance systems, for
example, the cell broadband engine does not use denormalized numbers for the
single-precision format in its synergistic processing units (SPU) [27].
74
Special Values
Special values are dealt with as follows:
 Zero: Zero is by far the most common special value in numerical
representations and their arithmetic operations. Thus, it is only reasonable that
one must consider the zero when implementing a floating point unit. In the
proposed design, a zero input is detected at early stages to facilitate result
calculations hence avoiding unnecessary calculations. This will be thoroughly
explained later in this chapter.
 Infinity: The infinity is more likely to occur during multiplication. In the
proposed FP Multiplier Unit, a resulting positive or negative infinity would set
the overflow or underflow flags respectively.
 NaN: NaN values are considered rare and require a lot of hardware to deal
with [9] and hence are not honored in our design.
Exceptions
The most common exceptions that occur in floating point addition/subtraction
and multiplication are overflow and underflow. In the proposed design, they are
honored only in the FP Multiplier Unit where they are more likely to occur. A
detected overflow or underflow would set an appropriate flag that would propagate
smoothly to be given as an output by the final block in the FP Multiplier Unit.
Rounding
When considering the rounding modes introduced in Chapter 4, we know that
REN is the most common mode used in software and hardware implementations of
arithmetic, mathematical and engineering applications. Thus in our design, we chose
to implement all rounding using the REN mode.
74
The guard, round and sticky bits were added to the mantissas in both the
multiplier and adder/subtractor units to increase the accuracy of the stored result.
5.1.2 Power Considerations
In order to reduce power consumption, both the FP Multiplier and
Adder/Subtractor Units were designed to reduce switching activity by avoiding
unnecessary calculations as follows:
Power Consideration for FP Multiplier Unit
In multiplication, a zero input means that the result will be also a zero. In such a
case, no multiplication, normalization or rounding is needed at all and performing
them will actually mean unnecessary power consumption. To avoid losing the
unnecessary power, both input operands are checked for zeros at a very early stage in
the design. If one or both operands are found to be zero, a zero flag is set. Blocks in
the design are written to operate only if the zero flag is not set.
Power Consideration for FP Adder/Subtractor Units
Unlike multiplication, there are two possible cases involving a zero in
addition/subtraction which are a zero input and an expected zero output. Now each
case is detected and dealt with a little differently as follows:
 Input Zero Detection: In addition or subtraction, if one or both input operands is
a zero, the result can be directly determined to be the other operand or a zero
respectively. The only operation necessary then would be the output sign
determination, no pre-normalization, post normalization or rounding would be
necessary. So in the proposed design, the input operands are checked for zeros at
an early stage and the appropriate zeros flag is set accordingly to identify whether
one or both inputs are zeros. Again, blocks are designed to operate only if these
flags indicate that the two input operands are not zeros.
74
 Output Zero Detection: If both input operands are non-zeros, a zero output can
still result if both the input are equal and the effective operation turns out to be
subtraction. In the proposed design, the two input operands are compared at an
early stage and an aequalb flag is set when they are equal. The zeros flag is set if
the effective operation is subtraction and the aequalb flag is set in a manner to
indicate a zero result. Then again in such a case, unnecessary calculations are
avoided.
5.1.3 Pipelined Architecture
Floating point operations can easily be broken down into several sub-
operations which make pipelining suitable for implementation of FPUs. Actually, all
high speed computers have pipelined arithmetic units since pipelining allows for a
faster clock cycle and increased data throughput at small expense to latency from the
extra latching overhead [25]. Because FPGAs are register-rich, this is usually an
advantageous structure for FPGA design since the pipeline is created at no cost in
terms of device resources. The flip flops introduced by pipelining typically occupy the
unused flip flops within the logic cells that are already used for implementing the
design.
Thus in order to increase the design speed of our proposed designs, both the
FP Multiplier and Adder/Subtractor Units were deeply pipelined by breaking each of
them down into simple modules with registers placed in between. The number of
registers that were placed between the different modules depended on each module's
delay time. The overall design has both an input and output register to synchronize
data entrance and exit.
When implementing the pipelined FP Multiplier and Adder/Subtractor Units, a
top-down approach was used. At first an overview of the complete system was made
05
to gain a firm understanding of its operation and required specifications. Then the
design was divided into modules which were further broken down into smaller sub-
modules that performed simple specific operations. Each of the sub-modules was
optimized and tested separately before optimizing and testing the complete design.
5.2 Proposed FP Multiplier Unit
5.2.1 Introduction
The simplified block diagram of the FP Multiplier Unit is shown in Figure 5.1
where operands A and B are the single precision inputs and operand result is the
output also given in the single precision format.
Figure 5.1 Simplified Block Diagram of FP Multiplier Unit
The 32 bit input operands are initially unpacked to sign, exponent and mantissa
where each will be manipulated differently throughout the design. The exponents of
both operands are inspected to check for an input zero in which case the zeroflag is
set. Otherwise if both inputs were non-zero operands the zeroflag is unset, the
exponents are added and the mantissas are multiplied. Multiplication here is an
unsigned 24 by 24 multiplication operation to give a 48 bit resultant mantissa. Since
both mantissas to be multiplied are normalized, the resultant mantissa is expected to
be either directly in the normalized form or if a carry bit occurs, requiring a one bit
shift to the right. So the resultant mantissa is post normalized if necessary and the
exponent is adjusted accordingly and tested for possible overflow or underflow. Now,
the 48 bit mantissa is rounded to fit it in the specified number of bits and if necessary
05
the resultant exponent is adjusted and re-checked for possible overflow or underflow.
Finally, the 32 bit result is given in an IEEE compliant format along with the overflow
and underflow flags.
As referred to earlier in Chapter 4, the main design concern in the FP Multiplier
Unit is the implementation of the 24 by 24 unsigned multiplier due to the time
consuming process of adding the 24 partial products that were calculated by the
multiplier and the long carry chain involved.
In the proposed FP Multiplier Unit, a new fast multiplication algorithm referred to
as the “Block Multiplication” Algorithm is proposed. Like in other fast multiplication
algorithms, that were explained in Chapter 3, the main concept adopted to increase the
multiplication speed in the Block Multiplication algorithm is to reduce the number of
partial fractions to be added in order to break down the long carry chain involved.
Block Multiplication does that by converting the 24 by 24 unsigned multiplication
operation into several smaller multiplications performed in parallel whose results are
appropriately manipulated to give the final 48 but resultant. This in turn is performed
by slicing up the 24 input mantissas of operands A and B into smaller blocks and
performing the multiplication on these blocks, hence came the term Block
Multiplication.
Figures 5.2(a) and 5.3(b) illustrate the detailed block diagram of the proposed FP
Multiplier Unit. In the following sub-sections, each module shown in these figures is
thoroughly discussed followed by its behavioral simulation. The Block Multiplication
algorithm and implementation are thoroughly explained when discussing the
Multiplier Module. Finally the behavioral simulation of the entire FP Multiplier Unit
is given. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and
were behaviorally simulated to verify their correct operation using ModelSim 6.3a.
05
05
07
5.2.2 Zero Detect Module:
The zero detect unit is responsible for three main tasks which are:
1. Unpacking the input operands: It is necessary to separate the sign, exponent
and mantissa of the input operands since each is dealt with differently in the FP
Multiplier Unit as follows:
a. Sign Bits: The input signs are XORed to determine the resultant sign.
b. Exponents: The exponents are added to calculate the resultant
exponent then checked for possible overflow or underflow.
c. Mantissas: The mantissa are multiplied then rechecked for possible
overflow. It is finally post normalized and rounded to produce the final
result.
2. Input zero detection: The exponents of both input operands are checked. If
one or both of the exponents was found to be all zeros, the zero flag is set.
Since a zero floating point number has an all zeros exponent and mantissa
while a denormalized number has an all zero exponent and a non-zero
mantissa, then checking only the exponent of the input operands serves in
detecting a denormalized input number as a zero and dealing with it
accordingly.
3. Setting the implied bit: The implied bit that was omitted from each mantissa
before storage is now retrieved for the multiplication operation to be
performed correctly. The implied bit is always restored as a '1' since in the
proposed design only normalized numbers are considered.
Figures 5.3 and 5.4 show the symbol and behavioral simulation of the Unpack
Module. The outputs appear after two clock cycles due to the presence of an input and
output register within the module. The simulation illustrates how the 32 bit mantissa
00
is broken to sign, exponent and mantissa while setting the implied bit in the mantissa.
The zeroflag is either set or unset depending on the input operands.
For example for operands A= (BBBBBBBB)H and B= (33333333)H at 200ns, the
zeroflag is unset at 400ns and operands A and B are broken down into signa= „1‟,
expa=(77)H, manta=(BBBBBB)H and signb= „0‟ , expb=(66)H , mantb=(B33333)H
respectively. On the other hand for the case of a zero or a denormalized number like
the case at 300ns and 400ns where B= (00000000)H and B= (00000777) respectively,
the output zero flag is set at 500ns and 600ns .
Figure 5.3 Symbol of the Unpack Module in the FP Multiplier Unit
Figure 5.4 Behavioral Simulation of the Unpack Module in the FP Multiplier Unit
05
5.2.3 Add Exponent Module:
This module is mainly responsible for finding the resultant exponent. So in this
module, the two exponents are added and a 127 bias is subtracted to substitute for the
bias being added in both exponents. The resultant exponent is then stored in 10 bits;
that is two carry bits are added to the left to allow for overflow and underflow
detection that commonly result in multiplication if both operands to be multiplied are
either too small or too large. Since the 8 bit exponent of a single precision floating
point number is an unsigned number, the 10th bit being '1' indicates an underflow. An
overflow is detected when the 10th bit is '0' and the 9th bit is '1'. The checking of an
overflow or underflow is performed later in the Exception Detection Module by
inspecting these two extra added bits.
In this module, the resultant sign is also calculated using a simple XOR operation.
The mantissa of operand B is sliced into three 8-bit blocks here to prepare it for the
multiplication operation in the Block Multiplier Module coming next.
Figures 5.5 and 5.6 show the symbol and behavioral simulation of the Add
Exponent Module. After the arst signal (asynchronous reset) becomes inactive at
100ns, the input exponents expa=238=(EE)H and expb=51=(33)H are added and the
127 bias is subtracted resulting in expres10=162=(A2)H stored in 10 bits and
appearing one clock later at 200ns as (0A2)H. The resultant sign is also determined
where signres = signa XOR signb = ‟0‟ XOR „1‟ = „1‟ as appearing at 200ns. The
input mantissa B is broken up to the 8 bit slices namely mantb2, mantb1 and mantb0
to prepare it for the upcoming Multiplier Module.
04
Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit
Figure 5.6 Behavioral Simulation of the Add Exponent Module in the FP

Multiplier Unit
5.2.4 Multiplier Module
This module is responsible for performing the unsigned multiplication of the two
24 bit mantissas. At first, the multiplication operation in this module was
implemented using a Simple Multiplier which implements the multiplication
operation using a simple multiplication statement written in VHDL. This was mainly
implemented to assure correct functionality of the entire FP Multiplier Unit before
introducing the new proposed Block Multiplier to it.
04
In order to implement the “Block Multiplication” algorithm the two 24 input
mantissas were sliced into three 8 bit blocks. Now Mantissa A=A2A1A0 and Mantissa
B=B2B1B0 with A2, A1, A0, B2, B1 and B0 each being an 8 bit block. Now in order
to perform the unsigned 24 by 24 bit multiplication three operations are performed in
parallel as shown in Figure 5.7. Within each of the three multiplication operations,
mantissa A is multiplied in one of the blocks of mantissa B (i.e. a 24 * 8
multiplication is performed which is expected to give a 32 bit resultant).
Figure 5.7 The Three Parallel Multiplications Performed in the Block

Multiplication Algorithm
In order to perform each of these 24 * 8 multiplications, three 8 by 8
multiplications are actually performed where the specific B block is multiplied to one
of the three blocks of mantissa A. Each of the 8 by 8 bit multiplications gives a 16 bit
result. These three 16 bit results are appropriately manipulated, as illustrated in Figure
5.8, to give the 32 bit resultant expected from the 24 * 8 multiplication. The resulting
32 bit numbers from each of the three main operations of Figure 5.7 are calculated in
the same method.
Figure 5.8 Details of Mantissa A * B0 Multiplication

04
Finally the 32 bit resultant from the multiplication of Mantissa A with B1 and B2
are shifted left by 8 and 16 places respectively then added to the 32 bit resultant from
the multiplication of Mantissa A with B0 to give the final 48 bit result of the 24 by 24
bit multiplication. The detailed block diagram of the implemented Block Multiplier is
shown in Figure 5.9.
Figure 5.9 Block Diagram of the Block Multiplier
5.2.4.1 Multiplier (8 by 8) Sub-Module
This sub-module is responsible for multiplying the 24 bit mantissa of operand A
with the three different slices of the mantissa of operand B. This is performed, as
referred to earlier, through performing three 8 by 8 multiplications thus giving three
16 bit outputs. As shown in Figure 5.9, three instances are used from this sub-module
one for each slice of mantissa B.
55
5.2.4.2 Partial Fraction Adjust Sub-Module
This sub-module accepts the three 16 bit outputs from the Multiplier (8 by 8)
Sub-Module and appropriately manipulates them, as discussed shortly before this and
as illustrated in Figure 5.8, to give the 32 bit partial fraction resulting from
multiplying mantissa A in one of the slices of mantissa B. As shown in Figure 5.9,
there are three instances from this module, one following each instance of the
Multiplier (8 by 8) Sub-Module. The outputs from these instances are PF2, PF1 and
PF0 resulting from multiplying mantissa A in mantissa B slices B2, B1 and B0
respectively.
5.2.4.3 Add Partial Fractions Sub-Module
This sub-module accepts the three 32 bit partial fractions, appropriately shifts
them then adds them together. Since the Multiplier Module is dealing with 8 bit slices
to perform the multiplication operation, thus to appropriately align the partial fraction
for their addition, PF1 and PF2 are shifted left by 8 and 16 bits respectively as shown
in Figure 5.10.
47…..40 39…..32 31..…24 23…..16 15…….8 7…….0

PF0
PF1 (00)H
PF2 (00)H (00)H
Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition
In order to add the three partial fractions PF2, PF1 and PF0 together, each of them
was first divided into two parts most significant (MS) and least significant (LS) as
shown in Figure 5.11. Then all the least significant parts were added together and the
result stored in 26 bits (24 bits and two carry bits where maximum possible carry is
“10”), and all the most significant parts were added together and the result stored in
55
24 bits. Performing the addition in such a manner leads to the increase of the overall
speed of the design by breaking up the long carry chain involved in the addition of the
32, 40 and 48 bits long partial fractions PF0, PF1 and PF2 respectively.
47…..40 39…..32 31..…24 23…..16 15…….8 7…….0

PF0MS PF0LS
PF1MS PF1LS
PF2MS PF2LS
Figure 5.11 Dividing the Partial Fraction to prepare them for Addition.
5.2.4.4 Final Result Sub-Module
This module is responsible for appropriately manipulating the 26 and 24 bit
resultants from the addition of the least and most significant parts of the partial
fractions that were given by the previous module, in order to the final 48 bit resultant
from the multiplication of the unsigned 24 by 24 bit mantissas.
Figure 5.12 shows the block diagram of the Multiplier Module. Figures 5.13 to
5.16 go through the behavioral simulations of each sub-module of the Multiplier
Module. Each of the figures shows the input and output of a particular sub-module for
the sample test bench given in Table 5.1.
Table 5.1 Test Bench for the Block Multiplier Module

Mantissa A Mantissa B Resultant Mantissa
B0 0000 B0 0000 7900 0000 0000
B0 0000 80 0000 5800 0000 0000
B1 999A 80 0000 58CC CD00 0000
B1 999A 85 3333 5C68 51FD 47AE
B1 999A 80 0000 0000 0000 0000
55
55
Figure 5.13 Behavioral Simulation of the Multiplier (8 by8) Sub-Module
Figure 5.14 Behavioral Simulation of the Partial Fraction Adjust Sub-Module
57
Figure 5.15 Behavioral Simulation of the Add Partial Fractions Sub-Module
Figure 5.16 Behavioral Simulation of the Final Result Sub-Module
5.2.5 Post Normalize Module
This module is responsible for making sure the output mantissa is in the
normalized form. This is done by using two sub-modules:
50
5.2.5.1 Post Normalize Sub-Module
The multiplication of the two normalized 24 bit mantissas results in a 48 bits
mantissa that has a radix point to the right of the two most significant bits, i.e. the 48th
and the 47th bits. Since both multiplied mantissas are normalized, that is their most
significant bit is a '1', it is only logical that either the 48th or 47th bits of the resultants
would be a '1'. Accordingly to make sure the resultant mantissa is in the normalized
form these two bits are checked to locate the most significant '1'which is to be defined
as the implied bit. If the 48th bit is detected to be '1', the mantissa is normalized by
being shifted right by one place and the exponent is incremented by 1 to compensate
for such a shift. Otherwise if the 47th bit is detected to be '1', the resultant mantissa is
then already in the normalized form and no shifting is required. In both cases, the
detected '1'that is identified as the implied bit is dropped. Finally, the resultant
mantissa is truncated to 26 bits with the least significant bit being the sticky bit which
is the logical "OR"ing of all the truncated bits.
Figures 5.17 and 5.18 show the symbol and behavioral simulation of the Post
Normalize Sub-Module respectively. At 100ns, the 48 bit resultant mantissa is
(800000000001)H having a „1‟ carry bit. Accordingly, the output mantissa=
(000001)H at 200ns was shifted right by one bit, the implied bit has been dropped and
the appropriate sticky bit has been added, „1‟ in this case since there‟s a dropped „1‟
bit. The exponent has been incremented from (33)H to (34)H to account for the one
bit shift to the right in the resultant mantissa. At 300ns, the input is (400000000001)H
so the carry bit is zero and no shifting of the mantissa or increment of exponent are
performed as shown at 400ns. The input mantissa at 400ns (400000000000)H is the
same as that at 300ns except the fact that the least significant bit changed from '1' to
55
'0'. So the output mantissa at 500ns has sticky bit with a value '0' since all the dropped
bits from the 48 bit mantissa were '0's.
Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit
Figure 5.18 Behavioral Simulation of the Post Normalize Sub-Module in the FP

Multiplier Unit
5.2.5.2 Exception Detection Sub-Module
This sub-module is responsible for setting the appropriate flag if an overflow or
underflow is detected. Checking the exponent was performed after the Post Normalize
Module in order to take into account the case in which the normalization operation
causes an increment in the exponent. The checking for an overflow or underflow is
performed by inspecting the 10th and 9th bits of the resultant 10 bit exponent
54
calculated earlier in the Add Exponent Module. If the 10th bit was found to be a '1' the
underflow flag is set, if not and the 9th bit was found to be a '1' then the overflow flag
is set. Otherwise, the exponent is adjusted to 8 bits by the dropping these 10th and 9th
bits and the overflow and underflow flags are left unset.
Figures 5.19 and 5.20 show the symbol and behavioral simulation of the
Exception Detection Sub-Module respectively. At 200ns an exponent (022)H=(00
0010 0010)B leads to both the overflow and underflow flags being unset at 300ns. At
300ns an exponent (222)H =(10 0010 0010)B sets the underflow flag at 400ns where
the 10th bit in the exponent is a '1'. A (122)H=(01 0010 0010)B exponent at 400ns sets
the overflow flag at 500ns where the 10th bit is a '0' while the 9th bit is a '1'.
Figure 5.19 Symbol of the Exception Detection Sub-Module in the FP Multiplier Unit
Figure 5.20 Behavioral Simulation of the Exception Detection Sub-Module in the FP

Multiplier Unit
54
5.2.6 Rounding Module
This module is responsible for rounding the 26 bit resultant mantissa to 23 bits
using the REN technique, then rechecking for possible overflow due to rounding and
finally giving the 32 bit resultant in IEEE format. This is performed through three
sub-modules which are:
5.2.6.1 REN Sub-Module
Within this sub-module, the 26 bit resultant mantissa is rounded to 23 bits using
the REN technique which inspects the guard, round and sticky bits to determine
whether a one increment is to be performed to the original 23 bit resultant mantissa
(Increment), rounded to the nearest even number (Tie to Nearest Even) or whether it
will be left unchanged equivalent to simply truncating the guard, round and sticky bits
(Truncate). All the possible cases are summarized in Table 5.2. In this module also, a
recheck flag will be set for the case of increment rounding to be passed to the
following sub-module, the Overflow Recheck Sub-Module. The reason behind this
will be explained thoroughly when the Overflow Recheck Sub-Module is explained
right after this one.
Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits
Guard Round Sticky Rounding

Bit Bit Bit
0 0 X Truncate
0 1 X Truncate
1 0 0 Tie to Nearest Even
1 0 1 Increment
1 1 X Increment
54
Figure 5.21 shows the symbol of the REN Sub-Module. Figures 5.22 and 5.23
show the behavioral simulation of the REN Sub-Module for an even and odd resultant
mantissa respectively. So if the guard (GB), round (RB) and sticky (SB) bits are "100"
the mantissa is expected to remain unchanged in the simulation of Figure 5.22 (case
of even mantissa) and be incremented to tie it to even for the simulations in Figure
5.23 (case of odd mantissa). For ease of illustration, the results of Figure 5.22 and
5.23 are summarized in Tables 5.3 and 5.4 respectively.
Figure 5.21 Symbol of the REN Sub-Module in the FP Multiplier Unit

Unit when the Resultant Mantissa to be rounded is Even
45
Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa
Time Resultant Expected Rounding Output

GB RB SB
(ns) Mantissa Case Mantissa
100 2000000 0 0 0 Truncate 400000
200 2000003 0 1 1 Truncate 400000
300 2000004 1 0 0 Tie to Nearest Even 400000
400 2000005 1 0 1 Increment 400001
500 2000006 1 1 0 Increment 400001

Unit when the Resultant Mantissa to be rounded is Odd
Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa
Time Resultant Expected Rounding Output

GB RB SB
(ns) Mantissa Case Mantissa
900 2000008 0 0 0 Truncate 400001
1000 200000C 1 0 0 Tie to Nearest Even 400002
1100 200000D 1 0 1 Increment 400002
1200 200000E 1 1 0 Increment 400002
45
5.2.6.2 Overflow Recheck Sub-Module
In this sub-module, the exponent is rechecked for a possible overflow. The
overflow flag is set here if one of the following cases is satisfied:
1. If the exponent is all ones.
2. If the recheck flag, given by the previous sub-module, is set and the mantissa
is all zeros. That is because an all zeros mantissa has resulted from the case of
the mantissa being all '1's that has been incremented in the rounding operation.
Such a mantissa has a '1' carry bit so it needs normalizing by shifting left by
one place equivalent to an increment in the exponent by one. The incremented
exponent is finally rechecked for possible overflow.
Figures 5.24 and 5.25 show the symbol and behavioral simulation of the Overflow
Recheck Sub-Module respectively. Case 1 is shown when an all ones exponent (FF)H
is introduced at 200ns thus setting the overflow flag at 300ns. Case 2 is shown at
400ns where the mantissa is all zeros and the recheck flag is set. The exponent is
(FE)H so an increment would lead to an overflow as seen at 500ns. An exponent of
(33)H at 500ns with a the recheck flag being set leads to an increment in the exponent
to (34)H without setting the overflow flag as shown at 600ns.
Figure 5.24 Symbol of the Overflow Recheck Sub-Module in the FP Multiplier

Unit
45
Figure 5.25 Behavioral Simulation of the Overflow Recheck Sub-Module
in the FP Multiplier Unit
5.2.6.3 Output Sub-Module
This sub-module is responsible for appending the sign bit, 8 bit exponent and the
23 bit mantissa together to give the single precision floating point multiplication result
along with the resultant overflow flag and underflow flags.
Figure 5.26 and 5.27 show the symbol and behavioral simulation of the Final
Module respectively. A '1' sign (78)H exponent and (2AAAAA)H exponent at 100ns
are appended to give the 32 bit final result (BC2AAAAA)H at 200ns. A set zeroflag
results at 400ns results in an all zeros output at 500ns. Also, a set overflow or
underflow flag at 300ns and 500ns results in an all zero output at 400ns and 600ns
respectively.
Figure 5.26 Symbol of the Final Module in the FP Multiplier Unit
45
Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit
5.2.7 The FP Multiplier Unit Behavioral Simulation
After designing each of the FP Multiplier Unit modules and simulating them
to assure their correct functionality, they were all put together and the complete
design was behaviorally simulated.
The symbol of the FP Multiplier Unit is shown in Figure 5.28. A snapshot of
the FP Multiplier Unit behavioral simulation is shown in Figure 5.29 for the test
bench in Table 5.5. For example, the output of the operation 25.8 * -7.4 =
(41CE6666)H * (C0ECCCCD)H at 400ns appears at 2300ns to be -190.92 =
(C33EEB85)H.
Figure 5.28 Symbol of the FP Multiplier Unit
47
Figure 5.29 Behavioral Simulation of the FP Multiplier Unit
Table 5.5 Test bench of the FP Multiplier Unit Behavioral Simulation
Clk Operand A Operand B Clk Result

Ns Hexadecimal Decimal Hexadecimal Decimal ns Hexadecimal Decimal
100 C1B00000 -22 C1B00000 -22 2000 43F20000 484
200 C1B00000 -22 40800000 4 2100 C2B00000 -88
300 C1B00000 -22 3F800000 1 2200 C1B00000 -22
400 41CE6666 25.8 C0ECCCCD -7.4 2300 C33EEB85 -190.92
500 469C4000 20,000 3B03126F 0.002 2400 42200000 40
600 00000000 0 400CCCCD 2.2 2500 00000000 0
700 7EE1B1E6 1.5E38 43FA0000 500 2600 00000000 OL
800 00A355E6 1.5E-38 3B03126F 0.002 2700 00000000 UL
40
5.3 Proposed FP Adder/Subtractor Unit
5.3.1 Introduction
The simplified block diagram of the proposed FP Adder/Subtractor Unit is shown
in Figure 5.30 where operands A and B are the single precision inputs and operand
result is theoutput also given in the single precision format.
Figure 5.30 Simplified Block Diagram of FP Adder/Subtractor Unit
Like in the FP Multiplier Unit, the FP Adder/Subtractor Unit starts with
unpacking the sign, exponent and mantissa of both input operands for each to be dealt
with separately. The inputs are then swapped if necessary to assure that operand B
carries the smaller operand which is pre-normalized before the addition or subtraction
operation is performed, while operand A carries the larger operand whose exponent is
set as the resultant exponent. Zero detection is performed next where the zeros flag is
set appropriately to indicate if none, one or both operands is a zero. If both operands
were found to be non-zeros the addition or subtraction operation is performed by first
pre-normalizing the smaller operand, operand B, then finding the resultant sign and
effective operation, performing that effective operation, then post normalizing (using
LOD algorithm) and rounding (using REN technique) the resultant mantissa. Finally,
the resultant sign, exponent and mantissa are appended together to give the final result
in IEEE single precision format.
45
As mentioned earlier, in the FP Adder/Subtractor Unit the post normalization
process is the bottle neck of the design. In Chapter Three, the different commonly
used algorithms to implement the Post Normalize Module within the
Adder/Subtractor Unit were introduced and it was shown that they mainly differ in
when the leading „1‟ bit was detected. It was either detected after the resultant was
calculated as in the LOD algorithm or predicted in parallel with the addition or
subtraction operation as in the LOP algorithm or makes use of both ideas by
integrating both paths and using the optimum to perform the operation as in the two-
path and three-path algorithms.
The proposed FP Adder/Subtractor Unit is implemented using the LOD algorithm
that was deeply pipelined to achieve high maximum operating frequency. The LOD
algorithm was chosen for the following reasons:
i. Simple to design as opposed to the LOP algorithm which requires pre-encoding
of the inputs in order to predict the position of the leading one and then error
correction for possible error in the detected one location.
ii. Area efficient due to simple one path data flow as opposed to the two-path and
three-path implementations that both consume a very large area and require the
implementation of both the LOD and the LOP algorithms.
Figure 5.31 illustrates the detailed block diagram of the FP Adder/Subtractor Unit.
In the following sub-section, the architecture and operation of each FP
Adder/Subtractor Unit modules is explained thoroughly followed by its behavioral
simulation. Finally, the behavioral simulation of the entire FP Adder/Subtractor Unit
is shown. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and
were behaviorally simulated to verify their correct operation using ModelSim 6.3a.
44
44
5.3.2 Unpack Module
This first module is responsible for two main tasks which are:
1. Unpacking the input operands: Like in the FP Multiplier Unit, the
Adder/Subtractor Unit accepts two normalized single precision floating point
numbers and deals with the signs, exponents and mantissas of these two
numbers separately as follows:
a. Sign Bits: Used along with required operation to determine the
resultant sign and the effective operation to be performed.
b. Exponents: The difference between the two exponents is calculated to
be used to pre-normalize the smaller operand in preparation for the
addition or subtraction operation.
c. Mantissas: The smaller mantissa is pre-normalized then the mantissas
are added or subtracted to give the final result.
2. Checking if the input operands are equal: The two operands are said to be
equal if both their exponents and mantissas compare as equal. An aequalb flag
would be then set to be used in upcoming modules for output zero detection if
the inputs compare as equal and the effective operation to be performed is
subtraction as explained earlier in this chapter.
Unpack Module respectively. Like in the FP Multiplier Unit, this is the first block in
the FP Adder/Subtractor Unit thus it has both an input and output register causing the
output to appear after two clock cycles. The simulation illustrates how the 32 bit
mantissa is broken to sign, exponent and mantissa. The inputs A= (AAAA AAAA)H
and B=(BBBB BBBB)H at 100ns were broken down to signa=‟1‟, expa=(55)H and
44
manta=(2AAAAA)H and signb=‟1‟, expb=(77)H and manta=(3BBBBB)H at 200ns.
At 200ns, both inputs were equal so the aeqb flag was set at 400ns.
Figure 5.32 Symbol of the Unpack Module in the FP
Figure 5.33 Behavioral Simulation of the Unpack Module in the FP

Adder/Subtractor Unit
5.3.3 Swap Module
Addition or subtraction of two floating point numbers cannot be performed unless
both the input operands are aligned, i.e. have equal exponents. If the two exponents
are not equal, the smaller operand has to be pre-normalized first before the addition or
subtraction operation is performed. To align both input operands in order to prepare
them for the addition/subtraction operation, the following steps must be performed:
45
1. Determining the smaller input operand.
2. Setting the exponent of the larger operand as the resultant exponent.
3. Calculating the difference between the input exponents.
4. Shifting right the mantissa of the smaller input a number of times equal to the
difference between the two exponents.
The Swap Module is the module responsible for determining which input is
smaller and storing it as operand B to be pre-normalized later in the Pre-normalize
Module. So the exponents of the two input operands are compared to determine the
smaller operand. If exponent A is smaller than the exponent of operand B, the
contents of both operands are swapped and a swap flag is set to be taken into
consideration when determining the effective operation later in the Adder Module.
Figures 5.34 and 5.35 show the symbol and behavioral simulation of the Swap
Unit respectively. At 100ns, the exponent of operand A=(AA)H is smaller than that of
operand B=(BB)H causing a swap operation to take place at 200 ns and just setting
the swap flag. At 200ns when the exponents of operands A and B are equal, no swap
operation is performed and the swapflag is left unset.
Figure 5.34 Symbol of the Swap Module in the FP Adder/Subtractor Unit
45
Figure 5.35 Behavioral Simulation of the Swap Module in the FP
5.3.4 Zero Detect Module
Zero Detect Module is responsible for three main tasks which are:
1. Determining if one or both of the input operands is a zero and setting the zeros
flag accordingly.
2. In the case where both operands are non-zeros, the difference between their
exponents is calculated to be used in the pre-normalization of the mantissa of
operand B in the Pre-Normalize Module.
3. Setting the exponent of operand A as the resultant exponent.
The zero detection is performed here as opposed to being in the Unpack Module
like in the FP Multiplier Unit to simplify the zero detection operation by making use
of the aequalb flag set in the Unpack Module if both inputs were equal and the fact
that the Swap Module ensured that the smaller input is stored as operand B. Thus, by
checking only the exponent of operand B and the aequalb flag the appropriate zeros
45
flag can be set as summarized in Table 5.6. Again like in the FP Multiplier Unit,
checking only the exponent of the input operands serves in detecting a denormalized
input number as a zero and dealing with it accordingly.
Table 5.6 Summary of Zero Detection Methodology
Operand B Aequb Zeros Comments

Flag Flag
Non-zero X 00 Both operands are non-zeros.
Zero 0 01 Only operand B is a zero.
Zero 1 11 Both operands are zeros.
Figures 5.36 and 5.37 show the symbol and behavioral simulation of the Zero
Detect Module respectively. At 100ns, both mantissas are non-zeros so at 200ns the
zeros flag is unset and the difference between the exponents is calculated where diff=
expa – expb = (AA)H – (33)H = (77)H. At 300ns, both operands are equal to zeros so
at 400ns the zeros flag is set and the difference is zero. At 400ns, only operand B is
zeros so at 500ns the zeros flag is unset but the difference is still zero. At all cases, the
exponent of operand A is stored as the resultant exponent.
Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit
45
Figure 5.37 Behavioral Simulation of the Zero Detect Module in the FP
5.3.5 Pre-normalize Module
The Pre-normalize Module is responsible for adjusting the two mantissas for the
addition/subtraction operation by doing the following:
1. Fitting the mantissas into 28 bits where a carry bit (C) and the implied bit (I)
are added to left of the mantissas and a guard (G), round (R) and sticky (S)
bits are added to their right to increase the precision of the addition/subtraction
operation and to be used in the Rounding Module. The format of the 28 bit is
shown in Figure 5.38.
C I 23 bit Mantissa G R S
Figure 5.38 Format of 28 bit Mantissa
47
2. Pre-normalizing the mantissa of operand B, the smaller operand, by shifting it
right a number of times equal to the difference between the two input operand
exponents that was calculated in the Zero Detect Module.
The shifting operation performed here is implemented using barrel shifting
implemented totally in VHDL. Barrel shifting has the advantage of shifting the data
by any number of bits in one operation where as if a simple shifter was used, shifting
by n bit positions would require n clock cycles. This makes barrel shifting the most
suitable for the shifting operations required in the floating point operations, especially
in the pre-normalization and post normalization operations in the floating point
addition/subtraction.
When considering the possible shift that may be needed in this module, we found
that theoretically the shift can be any number between 1 and 253 depending on the
difference between the exponents of the input operands. Practically speaking though,
any shift greater than 25 would just lead to the dropping of all the mantissa bits. So in
such a case, the shifting operation will lead to an all '0's mantissa with the exception
of the sticky bit that carries the logical "Or"ing of all the dropped bits and thus in such
a case would always store '1'. So in the implementation, the shift operation is
performed only if the difference between the two exponents was less than or equal 25.
Otherwise, the mantissa is set as all '0's with a sticky bit of value '1'.
The Pre-normalize Module is designed to operate only if the zeros flag indicates
that both input operands are non-zeros. Otherwise no pre-normalization is necessary
and the inputs are just adjusted into 28 bits with the carry, guard, round and sticky bits
all being '0' and the implied bit set to '1'.
Figures 5.39 and 5.40 show the symbol and behavioral simulation of the Pre-
Normalize Module respectively. At 100ns the signal diff_i indicates that the exponent
40
difference (and required shift) is (3)H = 3. Thus at 300ns, mantissa B comes out
shifted right by three places. At 200ns when the exponent difference is (1E)H = 30,
mantissa B comes out at 400ns as all zeros except for the sticky bit.
Figure 5.39 Symbol of the Pre-normalize Module in the FP Adder/Subtractor Unit
Figure 5.40 Behavioral Simulation of the Pre-normalize Module in the FP

5.3.6 Add/Subtract Module
The Add/Subtract Module was broken up into three sub-modules to allow for
deeper pipelining thus increasing the overall speed of the design. These three sub-
modules are:
45
5.3.6.1 Pre-Add/Subtract Sub-Module
This sub-module is responsible for determining the resultant sign and the effective
operation to be performed. The effective operation is simply the "XOR"ing of the
signs of both input operands and the required operation. The resultant sign is
determined to be positive or negative by considering the operand signs, the required
operation, the swap flag and an agrtb flag (a greater than b flag) as shown in Equation
5.1. The agrtb flag is a flag set in this module if the mantissa of operand A is greater
than the mantissa of operand B.
Resultant_Sign = (sw_f AND rop) OR ( (NOT sgna) AND (NOT sagb_f) AND rop))
XOR ((sgna AND s_agb) OR (sgnb AND (NOT agb_f) AND (NOT rop)) (5.1)
Where sw_f, agb_f : swap & agrtb flags respectively.

rop : required operation.
sgna, sgnb : sign of operands A & B respectively.
Figure 5.41 shows the symbol of the Pre-Add/Subtract Sub-Module. Figures
5.42 and 5.43 show the behavioral simulation for the Pre-Add/Subtract Sub-Module
for the cases of mantissa B greater than mantissa A and vice versa respectively.
Figure 5.41 Symbol of the Pre-Add/Subtract Sub-Module in the FP

44
Figure 5.42 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the FP
Adder/Subtractor Unit for Mantissa B greater than Mantissa A
Figure 5.43 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the FP
Adder/Subtractor Unit for Mantissa A greater than Mantissa B
44
5.3.6.2 Zeros Flag Update Sub-Module
This sub-module updates the zeros flag to indicate a zero final result if the
aequalb flag is set and the effective operation was found to be subtraction.
Figures 5.44 and 5.45 show the symbol and behavioral simulation of the Zero Flag
Update Sub-Module respectively. When the aequalb flag was set and the effective
operation was subtraction (eff_oper_i =‟1‟) at 300 ns, the zerosflag was accordingly
updated to (3)H = (11)2 to indicate an final output resultant zero.
Figure 5.44 Symbol of the Zero-Update Sub-Module in the FP Adder/Subtractor Unit
Figure 5.45 Behavioral Simulation of the Zero-Update Sub-Module in the FP

44
5.3.6.3 Adder Sub-Module
This sub-module is responsible for performing the effective addition or
subtraction operation to calculate the 28 bit resultant mantissa. The effective operation
is performed only if the zeros flag indicates that both input operands are non-zeros.
Otherwise the output is directly given depending on the zeros flag as summed up in
Table 5.7.
Table 5.7 Resultant Mantissa based on Zeros Flag

Zeros Output Mantissa
Flag
00 Result of addition/subtraction
01 Mantissa A
11 Zero
Figures 5.46 and 5.47 show the symbol and behavioral simulation of the Adder
Sub-Module respectively. The effective operation is carried out on mantissas A and B
only when the zeros flag was “00”. For example mantissa A= (2222222)H is added to
mantissa B= (4444444)H at 400ns to give (6666666)H at 500ns. On the other hand,
when the zeros flag became “01” at 500ns and “11” at 600 ns, no calculations were
made and the resultant mantissa were directly given as mantissa A at 600ns and zero
at 700 ns respectively.
Figure 5.46 Symbol of the Adder Sub-Module in the FP Adder/Subtractor Unit
45
Figure 5.47 Behavioral Simulation of the Adder Sub-Module in the FP
5.3.7 Post Normalize Module
The Post Normalize Module is responsible for normalizing the resultant mantissa
by detecting the most significant '1' bit in the resultant mantissa, shifting the mantissa
to set this detected '1' as the implied bit and finally adjusting the exponent in a manner
that compensates for this shift in the mantissa.
The Post Normalize Module is implemented using the LOD algorithm. Although
floating point adders implemented using the LOD algorithms are not generally the
fastest [7], this was overcome by two means. The first was using barrel shifting to
implement the shift operations in this module. The second was breaking up the Post
Normalize Module into several sub-modules to allow for its deep pipelining which
proved to significantly improve the design speed. All these sub-modules were
designed to operate only if the zeros flag indicated that both input operands were non-
zeros as otherwise, no post normalization would be necessary. The LOD Post
Normalize Sub-Modules are:
45
5.3.7.1 Zeros Count Sub-Module
This sub-module is the first step in the LOD algorithm. It is responsible for
detecting the most significant „1‟ in the resultant mantissa. Several techniques were
introduced by designers to detect the most significant '1' bit [7]. In the proposed
design the position of the most significant '1' bit is found by simply counting the
number of '0' bits to the right of the mantissa before the first '1' is detected. The
number of counted „0‟s will be used in the following sub-modules to post-normalize
the result by shifting the mantissa left by it and adding it to the exponent to
compensate for this shift.
Figures 5.48 and 5.49 show the symbol and behavioral simulation of the Zero
Count Sub-Module respectively. At 100ns, where the zeros flag was unset, the input
was (8888888)H indicating that the carry bit was „1‟ so the number of zeros came out
at 200ns as (00)H. Also, at 200ns the input was (4888888)H so the carry bit is a „0‟
followed by a „1‟ in the position of the implied bit. Technically, the number is already
in the normalized form so again this case outputs the number of zeros at 300ns to be
(00)H. At 300ns the input is (2888888)H so the number of zeros comes out at 400ns
to be (01)H. At 500ns and 600ns, the zeros flag indicates that one and both inputs are
zeros respectively. No shift is necessary in both these case thus the number of zeros is
a zero as given at 600ns and 700ns.
Figure 5.48 Symbol of Zeros Count Sub-Module in the FP Adder/Subtractor Unit
45
Figure 5.49 Behavioral Simulation of Zeros Count Sub-Module in the FP
5.3.7.2 Exponent Adjust Sub-Module
This sub-module is responsible for adjusting the resultant exponent in
coordination with the expected shift of the resultant mantissa. This situation has three
possible scenarios which are:
1. The resultant mantissa carry bit is '1'. It is considered to be the implied bit
thus it is required to shift the mantissa one place to the right and the exponent
is incremented by one.
2. The resultant carry bit is '0' followed directly by a '1'. The mantissa is
already in the normalized form thus the exponent is left unchanged.
3. If none of the above scenarios is satisfied, then the number of zeros to the left
of the resultant mantissa (nz) calculated in the Zero Count Sub-Module is
subtracted from the resultant exponent to account for shifting the resultant
mantissa to the left "nz" times.
Exponent Adjust Sub-Module respectively. At 100ns, the mantissa has a carry „1‟ so
45
the exponent is incremented from (29)H to (2A)H as shown at 200ns. At 200ns, the
carry bit is a „0‟ and the bit on the implied bit position is a „1‟ so the mantissa is post
normalized and the exponent comes out at 300ns unchanged.
At 300ns and 400ns, the number of zeros is 1 and 7 so the exponent comes out
at 400ns and 500ns decremented to (28)H and (22)H respectively.
Figure 5.50 Symbol of the Exponent Adjust Sub-Module in the FP

Figure 5.51 Behavioral Simulation of the Exponent Adjust Sub-Module in the FP
47
5.3.7.3 Mantissa Normalize Sub-Module
This sub-module is responsible for performing the final step of the normalization
process which is the shifting of the resultant mantissa and dropping the implied bit.
The amount of shift depends on the position of the detected most significant „1‟, that
was determined in the Zero Count Module, as follows:
1. If the resultant mantissa carry bit is '1', the mantissa is shifted to the right by
on place to set the carry bit as the implied bit. The dropped bit, which is the least
significant bit, is passed to the Round Module to be used to update the sticky bit.
2. If the resultant carry bit is '0' followed directly by a '1', no shifting of the
mantissa is required as it is already in the normalized form.
3. If none of the above scenarios is satisfied, then the resultant mantissa is shifted
left a number of times equal to the number of zeros before the most significant „1‟.
Mantissa Normalize Sub-Module respectively. At 100ns and 200ns the carry bit of the
resultant mantissa is „1‟, so the normalized mantissa appears at 200 ns and 300ns
shifted one bit to the right with the drop bit being set to „1‟ only at 300ns to account
for the „1‟ dropped from the mantissa (4444445)H at 200ns. A shift left by the number
of zeros is performed on the resultant mantissa input at 300ns, 400ns and 500ns.
Figure 5.52 Symbol of Mantissa Normalize Sub-Module in the FP Adder/Subtractor

Unit
40
Figure 5.53 Behavioral Simulation of Mantissa Normalize Sub-Module in the FP
5.3.8 Round Module
Like in the FP Multiplier Unit, the Round Module here is responsible for rounding
the 26 bit resultant mantissa to 23 bits using the REN technique and then giving the
final result of the addition or subtraction operation in the IEEE single precision
format. This module consists of two sub-modules:
5.3.8.1 REN Sub-Module
This sub-module is responsible for rounding the resultant mantissa using the REN
mode. The REN mode inspects the guard, round and sticky bits to determine whether
an increment by one is to be performed to the original 23 bit resultant mantissa
(Increment), or whether the mantissa will rounded to the nearest even number (Tie to
Nearest Even) or whether it will be left unchanged which is equivalent to simply
truncating the guard, round and sticky bits (Truncate). All these possible cases were
summarized earlier in Table 5.2.
45
Figures 5.54 and 5.55 show the symbol and behavioral simulation of the REN
Sub-Module respectively. In the shown simulation, the mantissa is even until 700ns,
and then it becomes odd. The test bench used is the similar to that used in the REN
Sub-Module of the FP Multiplier Unit which was summarized in Tables 5.3 and 5.4
for even and odd mantissas respectively.
Figure 5.54 Symbol of the REN Sub-Module in the FP Adder/Subtractor Unit
Figure 5.55 Behavioral Simulation of the REN Sub-Module in the FP

44
5.3.8.2 Final Sub-Module
This is the final block in the FP Adder/Subtractor Unit. It is responsible for
appending the sign bit, 8 bit exponent and the 23 bit mantissa together to give the
single precision floating point addition resultant.
Figures 5.56 and 5.57 show the symbol and behavioral simulation of the Final
Sub-Module respectively. A '1' sign (22)H exponent and (222222)H exponent at
100ns are appended to give the 32 bit final result (91222222)H at 200ns.
Figure 5.56 Symbol of the Final Sub-Module in the FP Adder/Subtractor Unit.
Figure 5.57 Behavioral Simulation of the Final Sub-Module in the FP

Adder/Subtractor Unit.
44
5.3.9 The FP Adder/Subtractor Unit Behavioral Simulation
After designing each of the FP Adder/Subtractor Unit modules and simulating
them to assure their correct functionality, they were all put together and the complete
design was behaviorally simulated.
Figure 5.58 shows the symbol of the FP Adder/Subtractor Unit. A snapshot of
the FP Adder/Subtractor Unit behavioral simulation is shown in Figure 5.59 for the
test bench in Table 5.8. For example, the output of the operation 12+12 =
(41400000)H + (41400000)H at 300ns appears at 2000ns to be 24=(41C00000)H.
Figure 5.58 Symbol of the FP Adder/Subtractor Unit
Figure 5.59 Behavioral Simulation of the FP Adder/Subtractor Unit
44
Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation
Clk Operand A Operand B Clk Result

+/-
ns Hexadecimal Decimal Hexadecimal Decimal ns Hexadecimal Decimal
100 41400000 12 - 3F800000 1 1800 41300000 11
200 41400000 12 + 3F800000 1 1900 41500000 13
300 41400000 12 + 41400000 12 2000 41C00000 24
400 447A0333 1000.05 - 42B0051F 88.01 2100 4464028F 912.04
500 42B0051F 88.01 - 42B0051F 88.01 2200 00000000 0
600 C20551EC -33.33 + BF2B851F -0.67 2300 C2080000 -34
700 00000000 0 + BF99999A -1.2 2400 BF99999A -1.2
800 00000000 0 - BF99999A -1.2 2500 3F99999A 1.2
555
Chapter 6
Design Implementation & Results
The proposed FP Adder/Subtractor and Multiplier Units discussed in the
previous chapter were able to operate at high operating speeds hence surpassing most
of the existing designs as will be illustrated in this chapter. Such designs were a result
of several optimizations performed in order to achieve maximum possible operating
speed.
In this chapter, a brief summary of the optimizations performed to reach the
final proposed FP Adder/Subtractor and Multiplier Units is given. The
implementation results are illustrated, discussed and compared with results from the
previous work of others. Finally the power analysis and timing simulations are
illustrated for both designs. All synthesis and implementation results were generated
using Xilinx ISE 9.2i Tools.
6.1 Design Optimization
Design optimizations are essentially used by designers in order to reach their
required system specifications. For both the FP Adder/Subtractor and Multiplier
Units, several optimizations were performed in order to reach maximum operating
speed.
6.1.1 Optimization Techniques
Initially, the maximum operating speed of each module was found from the
synthesis reports for both the FP Adder/Subtractor and Multiplier Units. Although the
101
synthesis reports are not as accurate as the static timing reports given post routing,
they still give a good indication of the module’s performance with the advantage of
being generated much faster than static timing reports and thus are sufficient to be
used at the initial phase of optimizations.
Next using the generated synthesis reports, critical modules were continuously
defined and optimized. Basically, modules were optimized by either breaking them
down into smaller pipelined modules or editing the VHDL code for better synthesis.
Synthesis options provided by the Xilinx ISE Tool were also used to achieve best
possible performance of the entire design by efficiently setting them in order to reach
the target optimization goal of this work, which is speed [28].
Finally when no further optimizations were possible, the FP Adder/Subtractor
and Multiplier Units were implemented and speeds of separate modules were
generated from the static timing report. Critical modules were then defined and
optimized whenever possible in a manner very similar to that just explained. Like the
synthesis tool, Xilinx implementation tool has several options that can dramatically
change the design performance. Thus implementation options have to be efficiently
set to achieve best possible performance [20].
Xilinx implementation tool also allows user defined timing constraints which
helps designers reach timing closure in high performance applications [29]. The use
of user defined timing constraints also ensures no timing violation, such as period,
setup or hold violations, occurs in the timing simulation. The fundamental timing
constraints in the Xilinx Tools are the following:
 Period Constraints : Sets the minimum clock period of the design.
 Offset IN Constraints : The offset in constraint is illustrated in Figure
6.1. It sets the setup time, referred to in Xilinx tools as clock to pad , which
102
specifies the maximum time allowed for data to enter the chip, travel through
logic and routing, and arrive at a synchronous element (flip-flop, latch, or
RAM) where that pin has a setup requirement before a clocking signal. Thus
offset in constrain is used to make sure that the input data comes in an
appropriate time. Hold time can also be set using the offset in constraint by
specifying the time through which the data is valid.
Figure 6.1 Offset In Time Constraint
While appropriately setting Xilinx timing constraints helps designers reach
their required timing specifications, very tight timing constraints could result in
results far from desired. In this work, the maximum speed reported by the synthesis
tool was used as a guide to set realistic constraints for both the period and offset in
constraints to make sure that the input data comes in an appropriate time.
In this work, implementation options along with user defined timing
constraints were iteratively used to reach best possible performance.
In the next section, a briefing of the optimization steps performed for the FP
Adder/Subtractor and Multiplier Unit is summarized.
6.1.2 FP Adder/Subtractor Optimization
Initially, the FP Adder/Subtractor Unit was implemented using three pipeline
stages. Each stage performed one of the basic floating point addition/subtraction
operations namely pre-normalizing, adding or subtracting and finally post
103
normalizing. The speed of this FP Adder/Subtractor Unit was 50 MHz when
synthesized to Virtex5. In order to increase the operating speed, further pipelining and
optimization of the design were necessary which was performed by breaking down
each of the three modules into smaller simple sub-modules with registers inserted
between them. For example, the module responsible for addition or subtraction was
sub-divided into three sub modules which are the Pre-Add Sub-module, the Zero Flag
Update Sub-module and the Adder/Subtractor Sub-module. Each of these new sub-
modules was synthesized to determine its maximum operating speed and hence
whether further pipelining is necessary.
After applying this technique for all the FP Adder/Subtractor Unit modules
and sub-modules until further pipelining had no positive effect on the overall speed,
the FP Adder/Subtractor Unit consisted of twelve modules and its maximum
operating frequency was just above 255 MHz when synthesized to Virtex5. Further
optimization within each module was made where ever possible, mostly by trying to
simplify the written codes which sometimes lead to increase in the module’s speed.
Next the design was implemented and implementation results, given by the
static timing analysis report, showed that the FP Adder/Subtractor operated at around
340MHz. Implementation of all modules defined the Pre-Normalize Module as the
critical module. In order to optimize the Pre-Normalize Module a pipeline stage, a
register at the module’s output, was added which raised the speed of the FP
Adder/Subtractor Unit to 373MHz.
Finally, Xilinx options along with user defined timing constraints were
iteratively adjusted until optimal results were attained. The implementation results of
this final design over several FPGA platforms are summarized in the Implementation
Results section later in this chapter.
104
6.1.3 FP Multiplier Optimization
When designing the FP Multiplier Unit, experience gained from optimizing
the FP Adder/Subtractor Unit was put in handy. The first implementation of the FP
Multiplier used the Simple Multiplier to implement the 24 by 24 unsigned
multiplication operation. It consisted of eight pipelined modules and had a maximum
operation speed of 217MHz when synthesized to Virtex5. Synthesizing all modules
indicated that there were two critical modules which were the Add Exponent and the
Multiplier Modules operating at 350 and 207MHz respectively.
So initially in order to increase the overall speed, a register was added post the
Add Exponent Module. The more critical issue then at hand was the optimization of
the Multiplier Module which in turn led to the evolution of the novel proposed Block
Multiplication algorithm. At first, the 24 by 24 multiplier was broken down into three
parallel 24 by 8 multipliers. When integrated into the FP Multiplier Unit, the design’s
maximum operating speed was found to be 316MHz when synthesized to Virtex5. In
order to improve the FP Multiplier Unit performance, block multiplication algorithm
was used. Multiplications were further broken down into nine 8 by 8 multiplications
performed in parallel which lead an increase of the FP Multiplier Unit maximum
speed to around 450MHz when synthesized to Virtex5.
Table 6.1 compares the performance of the FP Multiplier Unit when using the
proposed Block Multiplier Module and as opposed to using the Simple Multiplier
Module [30]. The table summarizes the synthesis results for speed and area when the
designs are synthesized to several Virtex FPGA platforms. The comparison shows that
when using the Block Multiplier, the FP Multiplier Unit is able to operate at speeds
almost double those attained when using the Simple Multiplier. This comes on the
price of occupied area where the opposite occurs for the design using the Block
105
Multiplier which occupies almost double the area that is occupied the design using the
Simple Multiplier Module.
Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed Block
Multiplier vs. using Simple Multiplier
Proposed
Simple Multiplier
Multiplier
Speed Area Speed Area
(MHz) (Slices) (MHz) (Slices)
Virtex2p
296 1038 181 412
Xc2vp7ff896 -7
Virtex4
461 945 106 343
Xc4vfx100 -12
Virtex5
450 592 217 578
Xc5vlx110 -3
Finally the design was implemented and implementation results, given by the
static timing analysis report, showed that the FP Multiplier Unit operated at around
360MHz. Then like in the FP Adder/Subtractor Unit design, Xilinx options along with
user defined timing constraints were iteratively adjusted until optimal results were
attained. The implementation results of this final design over several FPGA platforms
are summarized in the Implementation Results section next.
6.2 Implementation Results
Xilinx implementation tool was used along with user defined period, setup and
hold timing constraints to determine the maximum clock frequency by iteratively
varying the timing constraints until timing closure could not be achieved.
The FP Adder/Subtractor and Multiplier Units implementation results as
generated from the static timing report are summarized in Table 6.2 and 6.3 for
several FPGA platforms. It is noticed that Virtex5 uses almost half the number of
106
slices used by Virtex2 and Virtex4 FPGAs. This owes to the fact that the Virtex5 slice
has almost double the resources of Virtex2 and Virtex4 [20].
Table 6.2 Summary of Proposed FP Adder/Subtractor Implementation Results
FP Adder
Speed Area
(MHz) Slices Utilization
Virtex2p
325 1331 21%
Xc2vp7ff896 -7
Virtex4
401 1533 3%
Xc4vfx100 -12
Virtex5
442 625 3%
Xc5vlx110 -3
Table 6.3 Summary of Proposed FP Multiplier Implementation Results
FP Multiplier
Speed Area
(MHz) Slices Utilization
Virtex2p
339 1029 20%
Xc2vp7ff896 -7
Virtex4
465 1467 3%
Xc4vfx100 -12
Virtex5
472 702 4%
Xc5vlx110 -3
6.3 Comparison with Previous Work
The implementation results of the proposed FP Adder/Subtractor and
Multiplier Units are compared to results of other designers. As referred to earlier in
Chapter 2, the most relevant work in the topic of design of high speed FPUs was the
work of Govindu, Hemmert and Karlstrom [6, 9, 10 and 11]. The basic specifications
of their FPUs and its comparison to the specifications of the proposed FPU can be
summarized as follows:
107
Denormalized numbers, infinity or NaN:
Generally denormalized numbers, infinity and NaN are commonly excluded
from high performance systems due to their rare occurrence and excessive hardware
[27]. The FPUs implemented by Govindu [6] and Karlstrom [10, 11], they did not
support denormalized numbers, infinity or NaN. As for the FPU implemented by
Hemmert [9], it was fully IEEE compliant considering both NaN and infinity but not
denormalized numbers.
As referred to earlier in Chapter 5, the FPUs implemented in this thesis deals
with denormalized numbers as zero values, signals infinity values as overflow yet
does not does not consider NaN values.
Rounding Modes:
Generally, the REN mode is the most common when implementing hardware and
software arithmetic operations although it is the mode requiring the most hardware to
implement. On the other hand, truncation is the simplest rounding mode to implement
since it only involves truncating the extra bits in resultant mantissa in order to store it
in the assigned storage. Both Govindu [6] and Karlstrom [10, 11] were implemented
their FPUs using the REN and truncation rounding modes. Hemmert [9] did not give
information the rounding modes used in their design.
As referred to earlier in Chapter 5, the FPUs implemented in this thesis used
the REN rounding mode.
Parallelism:
Parallelism in design serves in improving its overall latency. Both Govindu [6]
and Karlstrom [10, 11] made use of parallelism when implementing their FPUS. They
108
both had at least two parallel paths one for the dealing with the exponents and the
other for dealing with the mantissas.
On the other hand since we were more concerned with speed than latency, each of
the FP Adder/Subtractor and Multiplier Units had a single pipelined path.
By considering the design specifications of Govindu, Hemmert and Karlstrom [6,
9, 10 and 11] as compared to the design specifications of the proposed design, it can
be concluded that their design specifications are very similar to those of the proposed
design. Hence comparison of the proposed design to the work of Govindu, Hemmert
and Karlstrom’s [6, 9, 10 and 11] is considered very much fair.
6.3.1 Comparison of FP Adder/Subtractor Unit Implementation Results
When comparing the results from this work to those of Govindu, Hemmert
and Karlstrom [6, 9, 10 and 11], comparison of architectures used and optimization
techniques is relevant.
Like in this work, Govindu [6] implemented the post normalization operation
using the standard LOD algorithm. He described most of his FP Adder/Subtractor
Unit using VHDL but also made use of Xilinx Library Cores to describe the adders in
both the mantissa addition module and the rounding module within the FP
Adder/Subtractor Unit. Govindu [6] used Xilinx Tool for the implementation of his
design to Virtex2 and Virtex4 FPGAs. In order to optimize his design, he used deep
pipelining and synthesis options given by the Xilinx tool. His 19 stage pipelined
design was able to operate at 250 MHz when implemented to Virtex2pro.
Karlstrom [10, 11] as well implemented the post normalization operation
using the standard LOD algorithm. He described his proposed FP Adder/Subtractor
109
Unit using Verilog. Then in order to optimize the post normalize module he used a
hardware extensive approach that inspected four bits of the mantissa at a time. He
reasoned the choice of him inspecting four bits at a time to allow for better mapping
to the four input LUTs of the Virtex4 FPGA. In order to further optimize his design,
he made sure the adder/subtractor was implemented using only one LUT per bit. The
use of such technique led a design that operated at 288 and 361MHz with a latency of
only 10 clock cycles forVirtex2pro and Virtex4 respectively. In his later work, he was
able to push the performance of his design to 377MHz for Virtex4. Strangely enough,
his later implementation’s speed fell to 278 on Virtex2pro.
Hemmert [9] described their design using JHDL. They did not give any detail
about their design architecture or how they optimized it. They did compare their work
to floating point units distributed by Xilinx and were superior to them in all speed,
area and latency. Their FP Adder/Subtractor Unit operated at 298MHz and 356MHz
for Virtex2pro and Virtex4 respectively with a latency of 10 clock cycles.
The comparison between the proposed FP Adder/Subtractor Unit and previous
ones is summarized in Table 6.4 and is arranged chronologically.
Table 6.4 Speed Comparison between Proposed and other FP Adders
FP Adder Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
325 278 288 298 250
Virtex2p -7
401 377 361 356 NA

Virtex4 -12
442 419 NA NA NA
Virtex5 -12
*NA: Not Available
110
6.3.2 Comparison of FP Multiplier Unit Implementation Results
In order to optimize the multiplier module within the FP Multiplier Unit,
Govindu [6] constructed the 24 by 24 unsigned multiplier using Xilinx cores and
made use of the synthesis options to further optimize his design while Karlstrom [10,
11] used four of Virtex4’s DSP48 blocks to construct a 35 by35 multiplier. On the
other hand, Hemmert [9] did not give any details what so ever about how they
implemented or optimized their FP Multiplier Unit.
The comparison between the proposed FP Adder/Subtractor Unit and previous
ones is summarized in Table 6.5.
Table 6.5 Speed Comparison between Proposed and other FP Multipliers
FP Multiplier Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
339 NA NA 290 250
Virtex2p -7
465 440 450 317 NA

Virtex4 -12
469 500 NA NA NA
Virtex5 -12
*NA: Not Available
6.4 Power Analysis
Xilinx provides an accurate post implementation power analysis and
estimation tool which is XPower Analyzer. XPower Analyzer gives designers an
accurate view of the power breakdown based on the exact resource utilization
information extracted from the FPGA design implementation [20].
111
The XPower Analyzer integrated within the Xilinx 9.2i version is an early
access version and is reported by Xilinx to give incorrect results [20]. So for accurate
power analysis a newer version of XPower Analyzer, which was integrated in Xilinx
11.1, was used. This version of Xilinx only supported Virtex4 and newer FPGAs.
Figures 6.6 and 6.7 illustrate the power consumption with respect to frequency
of the proposed FP Adder/Subtractor and Multiplier Units respectively as calculated
by the Xpower Analyzer Tool. Power consumption is analyzed assuming default
settings which assumes input/output toggle rate = 100%, i.e. both inputs and outputs
are assumed to toggle with every clock cycle.
FP Adder Unit Power Consumption

350
300
250
Power (mW)
200
150
100
50
0
0 100 200 300 400 500
Frequncy (MHz)
Virtex5 (Vcc=1V, 0.65nm) Virtex4 (Vcc= 1.2V, 90nm)
Figure 6.2 FP Adder/Subtractor Unit Power Consumption vs. Frequency
112
FP Multiplier Unit Power Consumption
350
300
250
Power (mW)
200
150
100
50
0
0 100 200 300 400 500
Frequency (MHz)
Virtex5 (Vcc=1V, 0.65nm) Virtex4 (Vcc=1.2V, 90nm)
Figure 6.3 FP Multiplier Unit Power Consumption vs. Frequency
6.5 Post Route Simulation
Post route simulation is performed using the Xilinx ISE Simulator. The
simulator uses the post place and route simulation model along with the Standard
Delay Format (SDF) file containing true delay information of the design to simulate a
user’s test bench.
6.5.1 FP Adder/Subtractor Unit Post Route Simulation
The post route timing simulation of the FP Adder/Subtractor operating at
312.5MHz (3.2ns) is shown in Figures 6.4 and 6.5 for the same test bench given in
Chapter 5. For example, the output of the operation 1000.05 - 88.01 = (447A0333)H
- (42B0051F)H introduced at 172ns appears at 226.4ns to be 912.04= (4464028F)H.
This is equivalent to a latency of 17 clock cycles.
113
Figure 6.4 Input Test Bench of the FP Adder/Subtractor at 3.2ns
Figure 6.5 Output of the FP Adder/Subtractor Test Bench
6.4.2 FP Multiplier Unit Post Route Simulation
The post route timing simulation of the FP Multiplier operating at 400MHz
(2.5ns) is shown in Figures 6.6 and 6.7 for the same test bench given in Chapter 5. For
example, the output of the operation 25.8 * -7.4 = (41CE6666)H *(C0ECCCCD)H
introduced at 431.2ns appears at 498.7ns to be -190.92 = (C33EEB85)H. This is
equivalent to a latency of 27 clock cycles.
114
Figure 6.6 Input Testbench for FP Multiplier at 2.5ns
Figure 6.7 Output of the FP Multiplier Test Bench
115
Chapter 7
Conclusions and Future Work
7.1 Conclusions
Digital signal processing functions are commonly implemented on two types
of programmable platforms; FPGAs and DSPs, which are specialized microprocessor
designed to handle digital signal processing functions. Although in the past the usage
of DSPs has been more common, with the development of FPGA technology and DSP
in the recent years, there are more and more applications of their combination in
digital signal processing systems especially when high performance, flexibility, fast
time to market, reliability and maintainability are required [31-33]. And since FP
addition and multiplication are very common in digital signal processing, such as in
Fast Fourier Transform (FFT), the importance of design of high speed FP
adder/subtractors and multipliers to FPGA is inevitable.
In this work, the design and implementation of pipelined IEEE compliant FP
Multiplier and Adder/Subtractor Units has been presented. Both units were written
entirely in VHDL to allow their implementation on any FPGA.
A novel algorithm, referred to as “Block Multiplication”, is proposed to
optimize the multiplication operation in the FP Multiplier. Block Multiplication
speeds up the 24 by 24 integer multiplication involved in the FP Multiplier by
dividing it to several smaller multiplications performed in parallel. The performance
of the FP Adder/Subtractor is optimized by deeply pipelining the post normalize
module which is implemented using the LOD architecture.
Both designs are able to operate at high operating frequencies exceeding
320MHz for Virtex2Pro and 400MHz for Virtex4 and Virtex5 FPGAs whilst giving
116
an output with every clock cycle. Specifically, the proposed FP Adder/Subtractor
operates at 442 MHz while the FP Multiplier operates at 469 MHz when implemented
to Virtex-5 FPGA. Power consumption of both the FP Adder/Subtractor and
Multiplier Units was analyzed using Xpower Analyzer. Finally, post route simulation
was performed to verify the operation of both the FP Adder/Subtractor and Multiplier
Units after routing.
7.2 Future Work
In the future, the latency of the designs could be reduced by making use of
parallel processing of the exponent and mantissa. It would be interesting to explore
whether slicing the multiplication process into 6 by 6 multiplications would improve
speed, especially since 6 by 6 multiplication is expected to map well to the advanced
6-LUT based FPGAs in Virtex-5 and newer FPGAs. Accordingly, it would be
interesting to implement designs to the newest Virtex-6 FPGA.
117
References
[1] A. Amarica, M. Vladutiu , L. Prodan, M. Udrescu and O. Boncalo, “Design of
addition and multiplication units for high performance interval arithmetic
processor,” in Proceedings of the International Conference on Computer Design,
2007, pp. 1-4.
[2] L. Louca, T. A. Cook and W. H. Johnson, “Implementation of IEEE Single
Precision Floating Point Addition and Multiplication on FPGAs,” FPGAs for
Custom Computing, 1996.
[3] M. Reaz, S. Islam and M. Suliman, " Pipeline floating point ALU design using
VHDL," in Proceedings of Semicoductor Electronics, 2002, pp. 204-208.
[4] J. Liang, R. Tessier and O. Mencer, " Floating point unit generation and
evaluation for FPGAs," in Field-Programmable Custom Computing Machines,
2003, pp. 185–194.
[5] Shamsiah Suhaili and Othman Sidek, “Design and Implementation of
Reconfigurable ALU on FPGA”, in the 3rd International Conference on Electrical
and Computer Engineering, 2004.
[6] G. Govindu, L. Zhuo, S. Choi and V. Prasanna, " Analysis of high performance
floating point arithmetic on FPGAs," in Proceeding of the 18th International
Parallel and Distributed Processing Symposium, April 2004, pp. 149-156.
[7] Ali Malik, “Design Tradeoff Analysis of Floating-Point Adder in FPGAs”, M.Sc.
Thesis, University of Saskatchewan, Canada, 2005.
118
[8] Claudio Brunelli and Jari Nurmi, “ Design and Verification of VHDL Model of a
Floating Point Unit for RISC Microprocessor,” in International Symposium on
System-on-chip, November 2006, pp. 1-4.
[9] K. Scott Hemmert, Keith D. Underwood, "Open Source High Performance
Floating-Point Modules," in Proceeding of the 14th Annual IEEE Symposium on
Field-Programmable Custom Computing Machines (FCCM'06), April 2006,
pp.349-350.
[10] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based
Floating Point Adder and Multiplier Units in a Virtex 4," in the 24th Norchip
Conference, Novemeber 2006, pp. 31-34.
[11] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based
Floating Point Adder and Multiplier Units in a Virtex 4," in Computers and
Digital Techniques, volume 2, issue 4, 2008, pp. 305-313.
[12] S.V.Siddamal, R.M. Banakar and B.C. Jinaga, " Design of high speed floating
point multiplier," in the 4th IEEE International Syposium on Electronic Design,
Test and Application , 2008, pp.285–289.
[13] Florent de Dinechin and Bogdan Pasca, “Large multipliers with fewer DSP
blocks,” in Proceedings of the International Conference on Field Programmable
Logic and Applications IEEE, August 2009, pp. 250-255.
[14] Sebastian Banescu, Florent de Dinechin, Bogdan Pasca and Radu Tudoran,
"Multipliers for Floating-Point Double Precision and Beyond on FPGAs," in
Publications de la Recherche Universitaire de l'ENS de Lyon (PRUNEL), April
2010.
119
[15] Florent de Dinechin, Hong Diep Nguyen and Bogdan Pasca, “ Pipelined FPGA
Adders,” in Publications de la Recherche Universitaire de l'ENS de Lyon
(PRUNEL), April 2010.
[16] Clive Maxfield, The Design Warrior’s Guide to FPGAs: Devices, Tools and
Flows, Great Britain: Newness, 2004.
[17] Michael John, Sebastian Smith, Application-Specific Integrated Circuits,
Addison-Wesley Professional, 1997.
[18] Sunggu Lee, Advanced Digital Logic Design using VHDL, State Machines and
Synthesis for FPGAs, Canada: Thomson 2006.
[19] Volnei A. Pedroni, Digital Electronics and Design with VHDL, USA: Morgan
Kaufmann, 2008.
[20] Xilinx, http://www.xilinx/com.
[21] “Semiconductor Industry Leader.” [Online]. Available:
http://www.xilinx.com/company/about.htm
[22] “FPGA Logic Cells Comparison.” [Online]. Available: http://www.1-
core.com/library/digital/fpga-logic-cells/
[23] Peter Wilson, Design Recipes for FPGAs, Great Britain: Newness, 2007.
[24] IEEE standards board, IEEE standard for floating-point arithmetic, 2008.
[25] M. Morris Mano, Computer System Architecture, USA:Prentice Hall, 1992.
[26] Sheetal A. Jain, Low Power Single Precision IEEE Floating Point Unit, Master of
Engineering Thesis, Massachusetts Institute of Technology (MIT), USA, 2003.
[27] Oh H.-J., Mueller S.M., Jacobi C., Tran K.D., Cottier S.R., Michael B.W., ET
AL.: „A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic
120
Processor Element of a Cell Processor’, Solid-State Circuits IEEE J., 2006, 41,
(4), pp. 759–771.
[28] “Speed Strategies.” [Online]. Available: http://www.xilinx.com
[29] “Xilinx Timing Constraints User Guide.” [Online]. Available:
http://www.xilinx.com
[30] Lamiaa Sayed Abdel Hamid, Khaled Shehata, Hassan El-Ghitani, Mohamed
ElSaid, "Design of Generic Floating Point Multiplier and Adder/Subtractor
Units," in Proceedings of the 12th International Conference on Computer
Modeling and Simulation, IEEE/UKSIM, March 2010, pp.615-618.
[31] K. Underwood, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,”
in proceedings of the ACM/SIGDA 12th International Symposium on Field
Programmable Gate Arrays, February 2004, pp. 171-180.
[32] “FPGA versus DSP design Reliability and Maintenance.” [Online.] Available :
http://www.dsp-fpga.com
[33] Zhi-Jian Sun and Xue-Mei Liu, “Application of Floating Point and DSP in
Integration Navigation System,” in the proceeding of International Conference on
Computer Science and Software Engineering, December 2008, pp.58-61.
121

FPUMasterthesis Lamiaa

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

FPUMasterthesis Lamiaa

Загружено:

Авторское право:

Доступные форматы

Table of Contents

1.1 Problem Statement and Contribution………………………..………….…3

1.2 Organization ……………………………………...……...……...….……..4

CHAPTER 2: HISTORY OF FLOATING POINT UNITS ON FPGAs..…....….….6

CHAPTER 3: DIGITAL DESIGN……………………………………………...…….12

3.1 Application Specific Integrated Circuits (ASICs)…………………….....12

3.2 Programmable Logic Devices (PLDs)…………………….………...…...14

3.2.1 Introduction ………………………………………….…………… 14

3.2.2 Simple Programmable Logic Devices (SPLDs)…………………...15

3.2.3 Complex Programmable Logic Devices (CPLDs)…….…....……...18

3.3 Field Programmable Gate Arrays (FPGAs)…………..................…...….19

3.3.1 Comparing FPGAs to ASICs and CPLDs…………..…....….…. 20

3.3.2 Xilinx FPGAs.………………………………………….…….….21

3.3.2.3 Virtex FPGA Family……………………..…...…...…….23

3.3.2.4 Speed Grades…………………………………………….26

3.4 FPGAs Design Flow………………………………………………….….27

CHAPTER 4: FLOATING POINT ARITHMETIC ...…………………..………….31

4.1 Fixed and Floating Point Representation………………...………………31

4.2 IEEE-754 Standard for Floating Point Representation……….………….33

4.2.1 Numerical Encoding…………………………….……………… 33

4.2.2 Normalized and Denormalized Numbers……….……..………... 35

4.2.3 Special Values………………………………………….…...…....36

4.2.4 Range of Floating Point Numbers………….………….………....36

4.2.6 Rounding Modes………………………….……..……….……....37

4.3 Floating Point Arithmetic……………………………….……….…...….39

4.3.1 Floating Point Multiplication Algorithms.…….………………....39

4.3.2 Floating Point Addition/Subtraction Algorithms ……...………..43

CHAPTER 5: PROPOSED FLOATING POINT UNIT ARCHITECTURE……...47

5.1 Design Specifications of Proposed FPU………………...……..………...47

5.1.1 Floating Point Representation……………...……..….………..…47

5.1.2 Power Considerations…………………….………….……..……49

5.1.3 Pipelined Architecture…………………………………………...50

5.2 Proposed FP Multiplier Unit …………….…………………...….………51

5.2.2 Zero Detect Module…………………………….…….……………55

5.2.3 Add Exponent Module …………………….…………....…………57

5.2.4 Multiplier Module………………………………….………………58

5.2.4.1 Multiplier (8 by 8) Sub-Module………………….………60

5.2.4.2 Partial Fraction Adjust Sub-Module...……………...……61

5.2.4.3 Add Partial Fractions Sub-Module……………...….……61

5.2.4.4 Final Result Sub-Module………….…………….…….…62

5.2.5 Post Normalize Module……………………………………...…… 65

5.2.5.1 Post Normalize Sub-Module……….……………………66

5.2.5.2 Exception Detection Sub-Module ….………………....…67

5.2.6 Rounding Module…………………………….…………...….……69

5.2.6.1 REN Sub-Module……………………………….…….…69

5.2.6.2 Overflow Recheck Sub-Module……....…………………72

5.2.6.3 Output Sub-Module………………………….……..……73

5.2.1.7 The FP Multiplier Unit Behavioral Simulation………….....……74

5.3 Proposed FP Adder/Subtractor Unit………….…………...….….………76

5.3.2 Unpack Module……………………………….…………..…...…79

5.3.3 Swap Module ………………………………..………….….……80

5.3.4 Zero Detect Module……………………….………….…….……82

5.3.5 Pre-Normalize Module…………………….………….….………84

5.3.6 Add/Subtract Module………………….…………….……..…….86

5.3.6.2 Zeros Flag Update Sub-Module……………….…...…….89

5.3.6.3 Adder Sub-Module………..………………………….….90

5.3.7 Post Normalize Module………………………...………....……….91

5.3.7.1 Zeros Count Sub-Module…………….…………….…….92

5.3.7.2 Exponent Adjust Sub-Module………………….….…….93

5.3.7.3 Mantissa Normalize Sub-Module……………….……….95

5.3.8 Rounding Module……………………………...………….……….96

5.3.8.1 REN Sub-Module………………………………..……….96

5.3.8.2 Final Sub-Module……………………………….……….98