Вы находитесь на странице: 1из 135

Table of Contents

TABLE OF CONTENTS…………………………………………………………...……i

LIST OF FIGURES………………………………………………………………….….vi

LIST OF TABLES…………………………………………………………….…......….xi

LIST OF ABBREVIATIONS…………………………..……………………..……….xii

ABSTRACT………………………………………………………………….…….….….1

CHAPTER 1: INTRODUCTION………………………………………...…...……….3

1.1 Problem Statement and Contribution………………………..………….…3

1.2 Organization ……………………………………...……...……...….……..4

CHAPTER 2: HISTORY OF FLOATING POINT UNITS ON FPGAs..…....….….6

CHAPTER 3: DIGITAL DESIGN……………………………………………...…….12

3.1 Application Specific Integrated Circuits (ASICs)…………………….....12

3.2 Programmable Logic Devices (PLDs)…………………….………...…...14

3.2.1 Introduction ………………………………………….…………… 14

3.2.2 Simple Programmable Logic Devices (SPLDs)…………………...15

3.2.3 Complex Programmable Logic Devices (CPLDs)…….…....……...18

3.3 Field Programmable Gate Arrays (FPGAs)…………..................…...….19

3.3.1 Comparing FPGAs to ASICs and CPLDs…………..…....….…. 20

3.3.2 Xilinx FPGAs.………………………………………….…….….21

3.3.2.1 Architecture…………………………………………..….21

i
3.3.2.2 Interconnects Technology……………………………….22

3.3.2.3 Virtex FPGA Family……………………..…...…...…….23

3.3.2.4 Speed Grades…………………………………………….26

3.4 FPGAs Design Flow………………………………………………….….27

CHAPTER 4: FLOATING POINT ARITHMETIC ...…………………..………….31

4.1 Fixed and Floating Point Representation………………...………………31

4.2 IEEE-754 Standard for Floating Point Representation……….………….33

4.2.1 Numerical Encoding…………………………….……………… 33

4.2.2 Normalized and Denormalized Numbers……….……..………... 35

4.2.3 Special Values………………………………………….…...…....36

4.2.4 Range of Floating Point Numbers………….………….………....36

4.2.5 Exceptions…………………………………..…..….…….……....37

4.2.6 Rounding Modes………………………….……..……….……....37

4.3 Floating Point Arithmetic……………………………….……….…...….39

4.3.1 Floating Point Multiplication Algorithms.…….………………....39

4.3.2 Floating Point Addition/Subtraction Algorithms ……...………..43

CHAPTER 5: PROPOSED FLOATING POINT UNIT ARCHITECTURE……...47

5.1 Design Specifications of Proposed FPU………………...……..………...47

5.1.1 Floating Point Representation……………...……..….………..…47

5.1.2 Power Considerations…………………….………….……..……49

5.1.3 Pipelined Architecture…………………………………………...50

5.2 Proposed FP Multiplier Unit …………….…………………...….………51

ii
5.2.1 Introduction……………………………………..……..…………51

5.2.2 Zero Detect Module…………………………….…….……………55

5.2.3 Add Exponent Module …………………….…………....…………57

5.2.4 Multiplier Module………………………………….………………58

5.2.4.1 Multiplier (8 by 8) Sub-Module………………….………60

5.2.4.2 Partial Fraction Adjust Sub-Module...……………...……61

5.2.4.3 Add Partial Fractions Sub-Module……………...….……61

5.2.4.4 Final Result Sub-Module………….…………….…….…62

5.2.5 Post Normalize Module……………………………………...…… 65

5.2.5.1 Post Normalize Sub-Module……….……………………66

5.2.5.2 Exception Detection Sub-Module ….………………....…67

5.2.6 Rounding Module…………………………….…………...….……69

5.2.6.1 REN Sub-Module……………………………….…….…69

5.2.6.2 Overflow Recheck Sub-Module……....…………………72

5.2.6.3 Output Sub-Module………………………….……..……73

5.2.1.7 The FP Multiplier Unit Behavioral Simulation………….....……74

5.3 Proposed FP Adder/Subtractor Unit………….…………...….….………76

5.3.1 Introduction…………………………………….…….…..………76

5.3.2 Unpack Module……………………………….…………..…...…79

5.3.3 Swap Module ………………………………..………….….……80

5.3.4 Zero Detect Module……………………….………….…….……82

5.3.5 Pre-Normalize Module…………………….………….….………84

5.3.6 Add/Subtract Module………………….…………….……..…….86

iii
5.3.6.1 Pre-Add/Subtract Sub-Module……….….……………….87

5.3.6.2 Zeros Flag Update Sub-Module……………….…...…….89

5.3.6.3 Adder Sub-Module………..………………………….….90

5.3.7 Post Normalize Module………………………...………....……….91

5.3.7.1 Zeros Count Sub-Module…………….…………….…….92

5.3.7.2 Exponent Adjust Sub-Module………………….….…….93

5.3.7.3 Mantissa Normalize Sub-Module……………….……….95

5.3.8 Rounding Module……………………………...………….……….96

5.3.8.1 REN Sub-Module………………………………..……….96

5.3.8.2 Final Sub-Module……………………………….……….98

5.3.9 The FP Adder/Subtractor Behavioral Simulation…..……………..99

CHAPTER 6: DESIGN IMPLEMENTATION AND RESULTS……..……...…..101

6.1 Design Optimization…………………………………………....………101

6.1.1 Optimization Techniques…………………………….…………101

6.1.2 FP Adder/Subtractor Optimization..……………………………103

6.1.3 FP Multiplier Unit Optimization……………….……….………105

6.2 Implementation Results………………………………………...………106

6.3 Comparison with Previous Implementations …………...…..…………107

6.3.1 Comparison of FP Adder Unit Implementation Results…….…109

6.3.2 Comparison of FP Multiplier Unit Implementation Results……111

6.5 Power Analysis ……………………………………………….……..…111

6.5 Post Route Simulations ……………….………..……….………...……113

6.5.1 FP Adder/Subtractor Unit Timing Simulation……….…...…….113

iv
6.5.2 FP Multiplier Unit Timing Simulation…………….…..……….114

CHAPTER 7: CONCLUSIONS and FUTURE WORK………………..….…….…116

7.1 Conclusions…….……………..……………….………………….…….116

7.2 Future Work……………………...……………………………………..117

REFRENCES………………………………………………………………………….118

v
List of Figures

Figure 3.1 Basic Architecture of PLA………………………………….....……...….16

Figure 3.2 Basic Architecture of PROM………………………………..…..……….16

Figure 3.3 Basic Architecture of PAL……………………………………………….17

Figure 3.4 Basic Architecture of GAL…………………………………...………….17

Figure 3.5 Simple Schematic of a CPLD……………………………..….………….18

Figure 3.6 Basic Architecture of a Xilinx FPGA………………………………...….21

Figure 3.7 Simplified Xilinx Logic Cell………………………….……………...….22

Figure 3.8 Simplified Schematic of Virtex-4 Slice………………………………….25

Figure 3.9 Simplified Schematic of Virtex-5 Slice…………………..….….……….26

Figure 3.10 Digital Design Flow……………………………………………….……..27

Figure 4.1 IEEE Format of Single Precision FP Number……….……………….… 34

Figure 4.2 LOD vs. LOP Algorithms………………………………….…………….45

Figure 5.1 Simplified Block Diagram of FP Multiplier Unit ……………………….51

Figure 5.2(a) Detailed Block Diagram of FP Multiplier Unit………………………….53

Figure 5.2(b) Detailed Block Diagram of FP Multiplier Unit………………………….54

Figure 5.3 Symbol of Unpack Module in the FP Multiplier Unit…………...……...56

Figure 5.4 Behavioral Simulation of the Unpack Module in the FP Multiplier

Unit……………………………………………………………………....56

Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit………..58

Figure 5.6 Behavioral Simulation of the Add Exponent Module in the FP

Multiplier Unit………...…………………………………………...…….58

vi
Figure 5.7 The Three Parallel Multiplications Performed in the Block

Multiplication Algorithm……………………………………………..….59

Figure 5.8 Details of Mantissa A * B0 Multiplication…………………….…...……59

Figure 5.9 Block Diagram of the Block Multiplier……………………..….……..…60

Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition...61

Figure 5.11 Dividing the Partial Fraction to prepare them for Addition……………..62

Figure 5.12 Block Diagram of Multiplier Module in the FP Multiplier Unit…….…..63

Figure 5.13 Behavioral Simulation of the Multiplier (8 by8) Sub-Module…….…….64

Figure 5.14 Behavioral Simulation of the Partial Fraction Adjust Sub-Module……..64

Figure 5.15 Behavioral Simulation of the Add Partial Fractions Sub-Module……….65

Figure 5.16 Behavioral Simulation of the Final Result Sub-Module………….……...65

Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit….67

Figure 5.18 Behavioral Simulation of the Post Normalize Sub-Module

in the FP Multiplier Unit……………………………………….…..…….67

Figure 5.19 Symbol of the Exception Detection Sub-Module in the

FP Multiplier Unit………………………………..………………………68

Figure 5.20 Behavioral Simulation of the Exception Detection Sub-Module

in the FP Multiplier Unit…………………………………..…………….68

Figure 5.21 Symbol of the REN Sub-Module in the FP Multiplier Unit……………..70

Figure 5.22 Behavioral Simulation of the REN Sub-Module in the FP Multiplier

Unit when the Resultant Mantissa to be rounded is Even……………….70

Figure 5.23 Behavioral Simulation of the REN Sub-Module in the FP Multiplier

Unit when the Resultant Mantissa to be rounded is Odd………………..71

vii
Figure 5.24 Symbol of the Overflow Recheck Sub-Module in the FP Multiplier……72

Figure 5.25 Behavioral Simulation of the Overflow Recheck Sub-Module

in the FP Multiplier Unit………………………………….……….....…..73

Figure 5.26 Symbol of the Final Module in the FP Multiplier Unit………………….73

Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit…..74

Figure 5.28 Symbol of the FP Multiplier Unit………………………………………..74

Figure 5.29 Behavioral Simulation of the FP Multiplier Unit………………..………75

Figure 5.30 Simplified Block Diagram of FP Adder/Subtractor Unit…….…….……76

Figure 5.31a Detailed Block Diagram of FP Adder/Subtractor Unit………..…………78

Figure 5.31b Detailed Block Diagram of FP Adder/Subtractor Unit………..…………78

Figure 5.32 Symbol of the Unpack Module in the FP………………..………………80

Figure 5.33 Behavioral Simulation of the Unpack Module in the

FP Adder/Subtractor Unit………………………………..…...………… 80

Figure 5.34 Symbol of the Swap Module in the FP Adder/Subtractor Unit…….……81

Figure 5.35 Behavioral Simulation of the Swap Module in the

FP Adder/Subtractor Unit……………………….……...………...….…..82

Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit…...83

Figure 5.37 Behavioral Simulation of the Zero Detect Module in

the FP Adder/Subtractor Unit…………………………..………………..84

Figure 5.38 Format of 28 bit Mantissa………………………………………………..84

Figure 5.39 Symbol of the Pre-normalize Module in the FP Adder/Subtractor Unit...86

Figure 5.40 Behavioral Simulation of the Pre-normalize Module

in the FP Adder/Subtractor Unit…………………………..……..………86

viii
Figure 5.41 Symbol of the Pre-Add/Subtract Sub-Module in the

FP Adder/Subtractor Unit……………………….……………………….87

Figure 5.42 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the

FP Adder/Subtractor Unit for Mantissa B greater than Mantissa A..……88

Figure 5.43 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the

FP Adder/Subtractor Unit for Mantissa A greater than Mantissa B..……88

Figure 5.44 Symbol of the Zero-Update Sub-Module in the

FP Adder/Subtractor Unit………………………………………………..89

Figure 5.45 Behavioral Simulation of the Zero Update Sub-Module in the

FP Adder/Subtractor Unit……………………………………….……….89

Figure 5.46 Symbol of the Adder Sub-Module in the FP Adder/Subtractor Unit.…...90

Figure 5.47 Behavioral Simulation of the Adder Sub-Module in the

FP Adder/Subtractor Unit……………………...…………..…………….91

Figure 5.48 Symbol of Zeros Count Sub-Module in the FP Adder/Subtractor Unit….92

Figure 5.49 Behavioral Simulation of Zeros Count Sub-Module in the

FP Adder/Subtractor Unit………………………….………….…………93

Figure 5.50 Symbol of the Exponent Adjust Sub-Module in the

FP Adder/Subtractor Unit……………………………………..…………94

Figure 5.51 Behavioral Simulation of the Exponent Adjust Sub-Module in the

FP Adder/Subtractor Unit…………………..………….……...…………94

Figure 5.52 Symbol of Mantissa Normalize Sub-Module in the

FP Adder/Subtractor Unit..………………………………………………95

Figure 5.53 Behavioral Simulation of Mantissa Normalize Sub-Module in the

ix
FP Adder/Subtractor Unit………………………………….…..……..….96

Figure 5.54 Symbol of the REN Sub-Module in the FP Adder/Subtractor Unit….…97

Figure 5.55 Behavioral Simulation of REN Sub-Module in the

FP Adder/Subtractor Unit…………………..………..………….……….97

Figure 5.56 Symbol of the Final Sub-Module in the FP Adder/Subtractor Unit…….98

Figure 5.57 Behavioral Simulation of Final Sub-Module in the

FP Adder/Subtractor Unit……………………………………….…….…98

Figure 5.58 Symbol of the FP Adder/Subtractor Unit………………………………..99

Figure 5.59 Behavioral Simulation of the FP Adder/Subtractor Unit………………..99

Figure 6.1 Offset in Time Constraint…...…………………….……………………103

Figure 6.2 FP Adder/Subtractor Unit Power Consumption vs. Frequency.………..112

Figure 6.3 FP Multiplier Unit Power Consumption vs. Frequency……………..….113

Figure 6.4 Input Test Bench of the FP Adder/Subtractor at 3.2ns…………..……..114

Figure 6.5 Output of the FP Adder/Subtractor Test Bench.………………..…..…..114

Figure 6.6 Input Testbench for FP Multiplier at 2.5ns …………………………….115

Figure 6.7 Output of the FP Multiplier Test Bench ……………………………….115

x
List of Tables

Table 3.1 Technologies Used to Implement FPGA Interconnects………………….23

Table 4.1 Summary of Floating Point Number Values………………….………….35

Table 4.2 FP Ranges for Normalized and Denormalized Numbers…….………….36

Table 4.3 Examples on the IEEE 754-2008 Rounding Modes……………………..38

Table 4.4 Example on Booth Multiplication………………………………………..42

Table 5.1 Test Bench for the Block Multiplier Module…………………………….62

Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits….………….69

Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa…..71

Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa…....71

Table 5.5 Test bench of the FP Multiplier Unit Behavioral Simulation……………75

Table 5.6 Summary of Zero Detection Methodology………………………..……..83

Table 5.7 Resultant Mantissa based on Zeros Flag………………………….……...90

Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation…. 100

Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed

Block Multiplier vs. using Simple Multiplier……...……………….…..106

Table 6.2 Summary of Proposed FP Adder/Subtractor Implementation Results…107

Table 6.3 Summary of Proposed FP Multiplier Implementation Results…………107

Table 6.4 Speed Comparison between Proposed and other FP Adders……..…….110

Table 6.5 Speed Comparison between Proposed and other FP Multipliers………111

xi
List of Abbreviations

ASIC Application Specific Integrated Circuit

ALU Arithmetic Logic Unit

CLB Configurable Logic Block

CMOS Complementary Metal Oxide Semiconductor

CPLD Complex Programmable Logic Devices

CPU Central Processing Unit

CSD Canonic Signed Digit

DFF Delay Flip Flop

DSP Digital Signal Processing

EDA Electronic Design Automation

EPROM Electrically Programmable Read-Only Memory

EEPROM Electrically Erasable Programmable Read-Only Memory

FFT Fast Fourier Transform

FP Floating Point

FPGA Field Programmable Gate Array

FPU Floating Point Unit

FSM Finite State Machine

GAL Generic Array Logic

HDL Hardware Description Language

IC Integrated Circuit

IEEE Institute of Electrical and Electronics Engineering

xii
IFFT Inverse Fast Fourier Transform

IOB Input Output Buffer

JHDL Just-Another Hardware Description Language

JTAG Joint Test Action Group

LOD Leading One Detector

LOP Leading One Predictor

LUT Look Up Table

MC Macro Cell

MFLOPS Mega Floating Point Operation per second

MUX Multiplexer

NaN Not a Number

NCD Native Circuit Description

NGC Native Generic Circuit

NGD Native Generic Database

NRE Non-Recurring Engineering

OTP One Time Programmable

PAL Programmable AND array Logic

PAR Place and Route

PLA Programmable Logic Array

PLD Programmable Logic Devices

PROM Programmable Read-Only Memory

RAM Random Access Memory

REN Round to Nearest Even (ties to even)

xiii
RM Round towards Minus-infinity

ROM Read-Only Memory

RP Round towards Plus-infinity

RZ Round towards Zero

SDF Standard Delay Format

SoC System on Chip

SPLD Simple Programmable Logic Devices

SRAM Static Random Access Memory

VHDL Very high speed integrated circuit Hardware Description Language

XST Xilinx Synthesis Technology

xiv
ABSTRACT

Nowadays, every CPU has one or more Floating Point Units (FPUs) integrated

within it. FPUs are commonly used in math extensive applications, such as digital

signal processing. Consequently, FPUs find place in engineering, medical and

military fields as well as in other fields requiring audio, image or video manipulation.

The main operations of a conventional FPU are multiplication and

addition/subtraction accounting for 94% of the operations of a conventional FPU [1].

With the advancement in FPGAs, high performance FPGAs are now built with

millions of gates along with sophisticated features. Accordingly FPGAs are becoming

more suitable for implementation of high performance FPUs especially when short

time to market, low development cost and flexibility are required.

The objective of this thesis is to design and implement high speed generic FP

Multiplier and Adder/Subtractor Unit that competed with existing designs.

A novel multiplication algorithm is proposed and used in the implementation

of a FP Multiplier Unit whilst a FP Adder/Subtractor Unit is implemented using the

standard Leading One Detector (LOD) algorithm. The novel multiplication algorithm

is referred to as “Block Multiplication” and is used to optimize the large

multiplication operation in the FP Multiplier by dividing it into several smaller

multiplications performed in parallel. In order to achieve high operating speeds both

the FP Multiplier and Adder/Subtractor Units are deeply pipelined which also lead to

maximum throughput.

The FP Multiplier Unit using the novel Block Multiplication algorithm and the

FP Adder Unit using the LOD algorithm were both completely described using

VHDL code to allow their implementation to any FPGA platform. In our research,

1
both units were implemented on Virtex2Pro, Virtex4 and Virtex5 FPGAs and were

able to operate at speeds higher than 320 MHz on Virtex2pro whilst occupying

around 20% of the FPGA and at speeds higher than 400 MHz on Virtex4 and Virtex5

FPGA whilst occupying around 3% of the FPGA. Post route simulation of both units

was performed to verify design operation post implementation (routing) and power

consumption was calculated post routing for most accurate analysis.

2
Chapter 1

Introduction

Ever since the invention of digital computers, arithmetic logic units (ALUs)

have always been a fundamental building block of the computer’s CPU. The ALU

usually refers to the circuit that deals with binary numbers in the integer format (like

2's compliment and binary coded decimal). A FPU on the other hand, refers to the

arithmetic unit that deals with floating point numbers (i.e. real numbers). FPUs are

considered superior over traditional ALUs in sophisticated applications which require

wide dynamic range and high precision.

1.1 Problem Statement and Contribution

Some of the greatest achievements of the 20th century would not have been

possible without the floating point capabilities of digital computers and systems.

FPUs are especially important for implementation of engineering and math extensive

applications used in digital signal processing and other scientific computations that

require wide dynamic ranges and high precision. As a result, high performance FPUs

are essential in several fields such as communications, military and medicine as well

as in many applications including image processing, robotics, radar, medical

diagnostics equipments and others. Then for portable applications, the need for low

power FPUs is inevitable.

In conventional FPUs, the most frequently used floating point operations are

multiplication and addition/subtraction accounting for more than 94% of all floating

point instructions [1]. Hence the employment of highly performing FP Multiplier and

3
Adder/Subtractor Units is of high importance and has been the core of interest of

many researchers few of which gave any attention to the power consumption issues.

In this thesis, we aim to design, implement and test an IEEE compliant single

precision, generic, low power, high speed FPU (Multiplier and Adder/Subtractor). In

order to minimize power consumption, the FPU is designed in a manner to reduce

unnecessary switching activity. As for achieving maximal speed, a new algorithm is

proposed to implement the FP Multiplier Units that optimizes time consuming

multiplication operation by breaking it into several multiplications performed in

parallel. This new algorithm is referred to as “Block Multiplication” and can be used

in the implementation of any large multiplier that is to be optimized for speed. The FP

Adder/Subtractor Unit is implemented using the standard LOD algorithm which is

deeply pipelined to allow for high operating speeds.

1.2 Organization

The rest of the thesis is structured as follows. Chapter 2 summarizes the

previous work of other researchers in the field of designing and implementing FPUs

to FPGA. Chapter 3 gives an overview on FPGAs specifically the Virtex family along

with the FPGA design flow as given by Xilinx ISE. Chapter 4 provides an overview

on the IEEE 754-2008 standard for binary floating point numbers along with

introducing floating point arithmetic (multiplication and addition/subtraction) and the

most famous algorithms used for their implementations. Chapter 5 thoroughly

discusses the proposed designs for both the FP Multiplier using the new proposed

Block Multiplication algorithm and the FP Adder/Subtractor using the LOD algorithm

by describing the design specifications, then explaining the modules of both units and

showing the behavioral simulations of each module and of the complete designs.

4
Chapter 6 gives the synthesis, implementation, post route simulation and power

results of the proposed designs. Finally, Chapter 7 wraps up with the conclusion and

future work.

5
Chapter 2

History of Floating Point Units on FPGAs

Earlier on, floating point arithmetic was performed in computers using

software emulators which although saves the added hardware cost is significantly

slow. Later, floating point operations were performed using external coprocessors that

were used when needed to allow for the execution of math extensive operations.

Nowadays, every computer has a high speed floating point unit integrated within its

CPU.

Modern FPGAs have sophisticated features such as dedicated carry chains,

memories, multipliers and DSP blocks which make it possible to perform

computations at higher clock frequencies. This makes modern FPGAs quite suitable

for implementation of high speed floating point arithmetic which can be particularly

useful when flexibility and fast time to market, some of FPGA’s strongest assets, are

of concern.

There have been research papers containing work on the design and

implementations of FPUs to FPGA ever since the late 1990s. One of the first IEEE

compliant 32 bit FP Adder/Subtractor and Multiplier Units implemented to FPGA

were introduced in 1996 by L. Louca et al. [2]. They implemented their designs on

FLEX8000 Altera FPGA. Their main objective was to minimize the area of both the

FP Adder/Subtractor and Multiplier Units to allow their implementation to the limited

resources of the FPGA while achieving reasonable speed to all and maintaining IEEE

accuracy. Despite their efforts, only one of their proposed units could be implemented

to the FPGA at a time while achieving a peak performance of 7 MFlops and 2.3

MFlops for the FP Adder/Subtractor and FP Multiplier Units respectively.

6
One of the early FPU designs was given by Mamun Bin Ibne Reaz et al in

2002[3]. Their work included the design and simulation of pipelined FPU, including

an adder/subtractor, multiplier and divider, designed totally in VHDL. They discussed

in details the block diagrams of the adder/subtractor, multiplier and divider units.

They did not synthesize or implement their design, thus no indication of speed or area

was given for their work. This paper provided us with a good basic overview on the

design of FPUs.

Another research paper by Jian Liang et al. in 2003 [4] presented a FPU

generation tool for FP Adder/Subtractor Units on FPGAs. It is based on throughput,

latency and area requirements and is able to create a range of FP Adder/Subtractor

Units. The paper was the first, according to the authors’ knowledge, to discuss and

compare the different algorithms used to implement the floating point adders. The

given generation tool selects from those different algorithms to give a FP

Adder/Subtractor Unit that trades off latency and throughput for area. The results they

got were from implementing their designs on Spartan-3 FPGA. One of their optimized

designs showed latency just above 250ns and a throughput of around 75MHz.

In 2004, Suhaili and Sidek [5] proposed a reconfigurable 32 bit ALU that can

perform both integer and floating point addition. They described their module in

Verilog and implemented it on Spartan 2e FPGA. The synthesis report indicated that

their design could operate at speed of up to 20MHz.

Also in 2004, Gokul Govindu et al. [6] showed that FPGA based FPU can

achieve a significant improvement in performance over that of processors. In their

paper, they analyzed the maximum achievable speed, area, latency, power and

throughput of FP Multiplier and Adder/Subtractor Units by considering their

pipelining as a parameter. They used both VHDL and Xilinx Library Cores to

7
describe their design. They implemented it on Virtex2Pro and were able to achieve

speeds of up to 250 MHz.

In 2005 Ali Malik [7], discussed and compared in detail the design of FP

Adders using several algorithms which he described in VHDL and implemented to

Virtex2p FPGA. Malik’s work is related to that of Liang [4] in that they both analyzed

several implementations of FP Adders. Malik though thoroughly discussed several

implementations for some of the main modules. He then discussed the design

tradeoffs for each of these implementations, usually by comparing their combinational

delay and area (number of occupied FPGA slices). Finally he used the optimized sub-

modules to build several FP Adders, each using a different algorithm, and compared

for overall latency, area (number of occupied slices), and speed. His fastest

implementation was able to operate at a speed of 152MHz on Virtex2p FPGA.

In 2006, Brunelli and Nurmi [8] introduced the design of what they called the

Milk Co-Processor which is a 32 bit FPU. The main objective of their design was

reusability. They did not give much detail about their used design algorithms. They

did mention though that they described their design using VHDL and that a special

VHDL file was written containing a set of generics that were used to give

customizability to the design. For example, these parameters allow the choice of

whether or not to handle denormalized numbers. They FP Adder/Subtractor and

Multiplier Units were able to operate at 77 and 75 MHz respectively when

implemented to Stratix FPGA.

Also in 2006, K.Scott Hemmert and Keith D. Underwood [9] from Sandia

National Laboratories published their work which involved the design of a high speed

FPU (Adder/Subtractor, Multiplier and Divider). They used JHDL (Just-Another

HDL) as their design entry language and implemented the FPUs to both Virtex2 and

8
Virtex4 FPGAs. Their proposed designs were able to operate at frequencies of up to

320 and 350 MHz on Virtex4 for the Multiplier and Adder respectively.

Another contribution in 2006 was given by Per Karlstrom et al. [10] who

introduced a high speed FP Adder and Multiplier Units implemented on Virtex4.

They used Virtex4 DSP48 blocks to build the multiplier module within the FP

Multiplier Unit. As a result their FP Multiplier Unit was able to operate at nearly 450

MHz for Virtex4 FPGA. As for the FP Adder Unit, they worked on increasing the

operating speed by optimizing the bottle neck block of the design. This was

performed by breaking up the binary number to be dealt with in that block and

considering each four bits separately in a parallel manner. Despite the fact that this

technique was rather hardware extensive, it succeeded in achieving a maximum

operating speed of 361 MHz on Virtex4 FPGA. Later in 2008, Karlstrom et al.

published a reviewed version of their work in which the FP Multiplier and

Adder/Subtractor Units operated at speeds of up until 440 and 377MHz respectively

when implemented on Virtex4 [11].

In 2008, Saroja V. Siddmal et al. [12] thoroughly discussed and compared

several high speed algorithms for implementing the 24 by 24 unsigned integer

multiplier in the FP Multiplier Unit such as Booth and Canonic Signed Digit (CSD)

multiplications. They used VHDL to describe their design. They implemented their

design on Virtex E and were able to achieve speeds of up to 333MHz.

Recently, the University of Lyon in France showed massive interest in the

topic of FP Adder/Subtractor and Multiplier Units design especially for double (64

bits) and quadruple (128 bits) precision FP numbers. They have published several

papers within the topic. In 2009, Florent de Dinechin Banescu [13], studied several

non-standard implementations techniques of large multipliers, which can be used in

9
FP multipliers, on FPGAs. The objective of his work was to build large multipliers

operating at high frequencies while reducing their DSP block usage. Each of his

studied techniques was found to be more suitable to a certain FPGA depending on the

architecture of its DSP blocks. Florent was able to build multipliers that operated at

frequencies around 440 MHz for both Virtex4 and Virtex5 FPGAs. Later in 2010,

Banescu along with Florent et al. [14] published a paper studying the same

multiplication techniques plus integrating them in high radix FP Multipliers. They

presented double and quadruple precision FP multipliers that operated at 400MHz for

both Virtex4 and Virtex5 FPGAs. Florent et al. [15] also published a paper in 2010

that discusses a FP adder generation tool, in a project they referred to as the FloPoCo

project (Floating Point Cores). Their work explores the tradeoffs between size,

latency and frequency for pipelined large precision adders on FPGA in several

architectures. For each of these architectures, resource estimation models are defined

and used in an adder generator that selects the best architecture considering the target

FPGA, target operating frequency and the addition bit width. They were able to

construct double and quadruple precision FP adders whose synthesis results indicated

they can operate at 450MHz for Virtex4 FPGA.

When considering the above summary of the previous work made in

implementing FP Multipliers and Adder/Subtractor Units, we find that the most

relevant work was introduced by Govindu [6], Hemmert [9], Karlstrom [10,11] and

the work introduced by Lyon University[13-15]. They were all able to design FP

Units operating at sufficiently high speeds that were high above 200MHz. There were

two main approaches used to achieve such high speeds of operation which are:

1. Optimization for a specific FPGA: This is performed by either using existing

blocks on the specific FPGA or by using Xilinx Cores that are specifically

11
optimized for the target FPGA. This approach was used by Govindu [6] and

Karlstrom [10,11] when designing in the FP Multiplier Unit where the

unsigned multiplier was build by using optimized Xilinx cores and DSP

blocks of Virtex2 and Virtex4 FPGA respectively.

2. Using Fast Algorithms: Many fast algorithms are present for both the FP

Multiplier and Adder/Subtractor Units. This approach was used by Siddmal

[12] to implement the unsigned multiplier by using the Booth Multiplication

and the CSD Multiplication algorithms, to be explained in Chapter Three,

which are by far the most common fast multiplication algorithms used with

floating point multiplication.

The thing with the first approach is that the designs were not generic; they were

actually optimized for specific FPGAs. That is why second approach was more

fanciful for our work since our objective is to design a high speed, low power and

generic FPU (Adder/Subtractor and Multiplier).

11
Chapter 3

Digital Design

Based on the design specifications, a digital designer has various options when

selecting a hardware platform for their design, ranging from Application Specific

Integrated Circuits (ASICs) to all sorts of Programmable Logic Devices (PLDs) and

Field Programmable Gate Arrays (FPGAs).

This chapter gives a preview on these various hardware design options starting

with the ASIC, going through PLDs, FPGAs and then discussing thoroughly Xilinx

Virtex FPGAs. Finally, the digital design flow for FPGAs is explained.

3.1 Application Specific Integrated Circuits (ASICs)

ASICs started appearing in the early 1980s. ASIC chips are customized for a

particular use rather than general purpose use. With the improvement in design tools

and reduction in feature sizes, the number of gates in an ASIC has grown from a few

thousands to over a 100 million gates. Modern ASICs, also known as SoC (System-

on-Chip), include embedded processors and memory blocks.

There are many types of ASICs based on the number of mask layers on which

the designer has control over. The most well-known types of ASICs are [16, 17]:

Full Custom:

In full custom ASIC, the designer here has full control over every mask layer

used to fabricate the silicon chip. Accordingly, the designer has full control over the

sizes of all transistors in his design which allows fine tuning of the transistors’ sizes

for optimum performance. Full custom design can be used to design some or all of the

21
circuits for a specific ASIC. Fewer full custom ICs are being designed due to the long

time to market and high non-recurring engineering (NRE) cost involved for its design

and fabrication. Also with the improvement in standard cell and gate array ASICs,

they are providing the required performance for more applications with their high

speeds and low costs which steers away designers from full custom design. Full

custom design is usually used when there are no suitable existing cell libraries that

can be used usually because the existing cell libraries are either not fast enough, not

small enough, consume too much power or don't provide a certain required function.

Full custom ASICs are commonly used for microprocessors which must operate as

fast as possible and will be produced in great quantities.

Standard Cell:

Standard cell ASICs are based on predesigned logic cells (such as logic gates,

multiplexers, flip-flops, etc…) known as standard cells that are used to build the

design along with larger predesigned cells known as megacells (such as memory

blocks, microprocessors or microcontrollers). Also in standard cell ASIC, custom

blocks can be embedded to the design. During the design, each and every transistor in

every standard cell can be chosen to optimize a certain design parameter and tools can

be used to optimize placement standard cells and interconnections. So for standard

cell ASICs, all mask layers are customized (transistors and interconnects) thus a

custom photo-mask is created for every layer for the device's fabrication. The

advantage of standard cell ASICs is that designers save time and money by using

optimized predefined and pretested standard cells.

Gate Arrays:

Gate arrays ASICs are partially fabricated chips with repetitive similar blocks,

known as basic cells, consisting of a collection of predefined unconnected transistors

21
and resistors depending on the vendor. Basic cells are replicated to form arrays of

basic cells. The designer chooses from a gate-array library of predesigned and pre-

characterized logic cells (including gates, registers, etc...) which the user uses along

with more complicated blocks to build his circuit by controlling only the top few

layers of metal used for interconnects. The disadvantage of gate array ASICs is the

un-optimized routing which negatively impacts the performance and power

consumption of the design.

3.2 Programmable Logic Devices (PLDs)

3.2.1 Introduction

PLDs were first introduced in the mid 1970s. A PLD is a programmable chip

that is mass produced at the factory and then customized by the end-user to perform

different logic functions. Unlike ASICs, PLDs are intended for general use not for

specific applications and can be fabricated to be one time or multiple-time

programmable depending on the technology used to implement the cross points within

the device.

One time programmable PLDs implement the cross points using fusible or anti-

fusible technology used to create permanent open or short circuits respectively based

on the data to be programmed on the PLD. For multiple time programmed PLDs, the

cross points are implemented using a single bit memory cell used to store binary data

at the cross points to implement an open or short circuit. Such PLDs can be volatile or

nonvolatile depending on whether volatile or non volatile memories are used at the

cross points.

21
PLDs can be used to implement simple combinational circuits to fairly complex

sequential state machines depending on the type of PLD. There are three types of

PLDs:

1. SPLDs : Simple PLDs which include PLAs (Programmable Logic Arrays),

PALs (Programmable AND array Logics), PROMs (Programmable Read-Only

Memories) chips and GALs (Generic Array Logics).

2. CPLDs: Complex PLDs which were originally constructed by associating

several SPLDs on the same chip.

3. FPGAs: Considered as a PLD since it is also a device programmed by the end

user.

3.2.2 Simple Programmable Logic Devices (SPLDs)

SPLDs have numerous horizontal and vertical connection wires forming a

matrix of AND gates (referred to as the AND plane) and a matrix of OR gates

(referred to as the OR plane). These planes are used to implement any circuit as Sum-

of-Product through programming the horizontal to vertical cross points as either open

or short circuits [18].

PLAs are the most configurable of SPLDs. They consist of a programmable

AND plane and a programmable OR plane. The cross points in both the AND plane

and the OR plane can be programmed to form any sum-of-product expression. PLAs

are particularly useful for large designs that require many common product terms that

can be used by several outputs. The downside of the PLA device is the price of

manufacture and speed. This device has two levels of programmable links and signals

that take a relatively long time to pass through programmable links as opposed to pre-

defined ones. A simplified architecture of a PLA is shown in Figure 3.1.

21
Figure 3.1 Basic Architecture of PLA

The speed problems associated with the PLA were addressed with the

development of the PROM and the PAL. A PROM is a special type of PLA with a

programmable OR plane and a fixed AND plane that produces all possible product

terms for the given inputs. A PAL is a special type of PLA with a programmable

AND plane and a fixed OR plane. The advantage of a PROM and a PAL is that they

are faster due to having only one single programmable array on the cost of less

flexibility in design due to the presence of a fixed plane. A simplified architecture of a

PROM and a PAL are shown in Figures 3.2 and 3.3.

Figure 3.2 Basic Architecture of PROM

21
Figure 3.3 Basic Architecture of PAL

A GAL is PLA that has additional logic circuitry at each output, referred to as

a macro cell. A macro cell is a programmable output cell containing logic gates, a

flip-flop, and multiplexers with internal programmability that allows several modes of

operation. GALs also differ from PLAs in that they had a feedback signal from the

macro cell back to the programmable array, which increase the GAL's flexibility, and

in that there cross points were implemented using EEPROM instead of fuse/anti-fuse

or PROM/EPROM. GAL devices can have a maximum frequency around 250 MHz

[19]. The simplified architecture of a GAL is shown in Figure 3.4.

Figure 3.4 Basic Architecture of GAL


21
3.2.3 Complex Programmable Logic Devices (CPLDs)

CPLDs constructed from several SPLDs (generally GALs or PLAs) fabricated

on the same chip, which communicate through a complex and programmable

interconnecting matrix. CPLDs have I/O drivers and a clock/control unit. Modern

CPLDs also include JTAG support (port for circuit access/test defined by Joint Test

Action Group and standardized in the IEEE 1149.1 standard), a large number of I/O

user pins and low-power consumption. A simple schematic of a CPLD is shown in

Figure 3.5.

Figure 3.5 Simple Schematic of a CPLD

CPLDs feature predictable timing characteristics that make them ideal for

critical, high-performance control applications. Typically, CPLDs have a shorter and

more predictable delay than FPGAs and other programmable logic devices. CPLDs

are inexpensive and require small amounts of power [19], thus they are commonly

used in cost-sensitive, battery-operated portable applications. CPLDs are also used in

simple applications such as address decoding.

21
Observing the most famous Altera and Xilinx CPLDs it can summarized that

CPLDs are fabricated using 0.18um or 0.35 um CMOS technology, have a number of

user pins ranging from 27 to 272 and that CPLDs have maximum operating speeds

ranging from 56MHz to 323 MHz [19].

3.3 Field Programmable Gate Arrays (FPGAs)

Around the beginning of the 1980s, the gap in the digital IC continuum

became apparent. At one end there were programmable devices such as SPLDs and

CPLDs, which were highly configurable and had fast design and modification times,

but could not support large or complex functions. At the other end, there were ASICs

which could support extremely large and complex functions, but they were very

expensive and time consuming to design. Furthermore, once a design has been

implemented on ASIC it was effectively frozen in silicon [16].

To fill that gap, FPGAs were introduced by Xilinx in the mid 1980s. They are

considered the most complex PLDs since they contain thousands of configurable logic

blocks and configurable interconnects that can both be programmed (only a single

time or many times depending on the type of FPGA) to perform a variety of complex

digital functions [19].

At first, FPGAs were used to implement simple logic circuits at relatively low

speeds. Later at the 1990s, the size and sophistication of FPGAs started to increase

and they found market in telecommunications, networks and many other industrial

applications. FPGAs were then also commonly used to prototype ASIC designs or to

provide a hardware platform to verify the physical implementation of new algorithms.

However, FPGAs ease of design, flexibility, low development cost and short time to

market shortly made FPGAs find their way into final products.

21
Although the first FPGAs contained a few thousand gates, nowadays FPGAs

contain over a billion gates [20]. Today’s FPGAs have several sophisticated features

such as high speed input/output interfaces, internal clocking and consist of millions of

gates along with embedded elements such as microprocessor cores, RAMs, DSP

blocks, multipliers and dedicated arithmetic carry chains. Such high performance

FPGAs can be used to implement almost any design and at very high speeds that

could easily reach 600MHz [20].

3.3.1 Comparing FPGAs to ASICs and CPLDs

Generally, FPGAs are more cost effective for limited productions while

ASICs are more suitable for larger productions. An advantage of using FPGAs instead

of ASIC is that the FPGA design flow eliminates the complex and time-consuming

floor planning, place and route, timing analysis, and mask / re-spins stages of the

project since the design logic is already synthesized to be placed onto an already

verified, characterized FPGA device. Moreover, FPGAs have the advantage that they

can be easily and rapidly reprogrammed. A good way to shorten the development time

of a product is to make prototypes using FPGAs and then switch to an ASIC.

So although FPGAs used to be selected over ASICs for lower speed,

complexity, or volume designs in the past, modern FPGAs easily push the 500 MHz

performance barrier due to their very high densities and their sophisticated features.

As a result today's FPGAs are increasingly being used to implement a variety of

designs that could previously have been realized only on ASICs and custom silicon.

FPGAs can be used to implement almost any type of design such as communication

devices, software defined radios, radar, image processing, digital signal processing all

the way to system-on-chip (SoC) components that contain both hardware and

software elements.

12
FPGAs have a much more sophisticated structure than CPLDs. This gives

FPGAs the advantage over CPLDs that they can be used to implement complex

digital designs that cannot be implemented on CPLDs.

3.3.2 Xilinx FPGAs

Xilinx was founded in 1984 and is known to be the inventor of FPGAs. It is

now considered the largest PLD supplier owning more than 50% of the market [21].

3.3.2.1 Architecture

The basic architecture of a Xilinx FPGA is illustrated in Figure 3.6. It consists

of a matrix of Configurable Logic Blocks (CLBs) interconnected by an array of

switch matrices [19]. The internal architecture of CLBs might differ from one FPGA

family to another. Generally, a CLB consists of a number of slices each slice in turn

consisting of a number of logic cells. A logic cell is the core building block in a

modern Xilinx FPGA.

Figure 3.6 Basic Architecture of a Xilinx FPGA

12
A simplified illustration of a Xilinx logic cell is shown in Figure 3.7. It

includes a 4 input Look Up Table (LUT), a Delay Flip Flop (DFF) along with a

multiplexer to allow a registered or unregistered output. Other than the LUT, MUX

and register a logic slice can also contain other elements such as fast look ahead carry

chains, arithmetic logic and dedicated internal routing. Advanced FPGAs may also

have multipliers and memory blocks.

Figure 3.7 Simplified Xilinx Logic Cell

The reason for FPGAs having a hierarchy of CLBs consisting of slices that in

turn consist of logic blocks is that it is complemented by an equivalent hierarchy of

interconnects. Thus there is a fast interconnect between logic cells within the same

slice, then slightly slower interconnects between slices in a CLB followed by the

interconnects between CLBs. This enables the achievement of an optimum trade-off

between making it easy to connect things together without incurring excessive inter-

connect related delay. [16]

3.3.2.2 Interconnects Technology

There are various programming technologies used to implement FPGAs

configurable interconnects. Like SPLDs, an FPGA can be one time programmable

(OTP) or multi-time programmable according to the technology used in its

implementation. The most famous of these technologies are listed in Table 3.1.

11
Table 3.1 Technologies Used to Implement FPGA Interconnects

FPGA Predominantly
Programmability
Technology Associated With
Fuse OTP SPLDS
Anti-fuse OTP FPGAs
Can be erased using ultraviolet
EPROM light (takes at least 20 minutes) SPLDs & CPLDs
then reprogrammed.
EEPROM/ Can be electrically erased then SPLDs & CPLDs &
Flash reprogrammed. some FPGAs
FPGAs & some
SRAM Reprogrammable.
CPLDs

Most Xilinx FPGAs are based on SRAM technology where the programmable

interconnections, made using pass-transistors, transmission gates or multiplexers, in

the FPGA are controlled by SRAM cells. SRAM based FPGAs are volatile, that is the

device's configuration data is lost once the power is removed from the system. Thus

SRAM based FPGAs require external boot ROM to reprogram the FPGA every time

it is powered on. This is not much of a problem since these devices have the

advantage that they can be quickly and repeatedly reprogrammed as required.

3.3.2.3 Virtex FPGA Family

The Xilinx Virtex series was first introduced in 1998 as a low power high

performance solution. It was the first line of FPGAs to offer one million system gates.

The Virtex product line consistently offers the industry's leading combination of

performance, capability, and integration at the lowest system cost [20].

11
In addition to FPGA logic, the Virtex series includes embedded fixed function

hardware for commonly used functions such as multipliers, memories, serial

transceivers and microprocessor cores.

Virtex4 & Earlier FPGAs

Older-generation devices such as the Virtex, Virtex2 and Virtex2Pro are

although still available, but their functionality is largely superseded by the Virtex-4

and -5 FPGA families. The Virtex2 series is manufactured on a 1.5V, 0.15μm 8-Layer

Metal Process with 0.12μm High-Speed Transistors whilst the Virtex-2 Pro series is

manufactured on a 1.5V, 0.12μm 8-Layer Metal Process with 90 nm High-Speed

Transistors [20].

The Virtex4 series is manufactured on a 1.2V, 90-nm. The architecture used in

Virtex4 is very similar to that used to all previous Virtex and Spartan (up to Spartan

3A) FPGAs. The simplified schematic of a Virtex4 slice is shown in Figure 3.8. It

includes [20, 22]:

 Two 4-input LUTs.

 Two Multiplexers.

 Dedicated arithmetic logic including two 1-bit adders, carry chain and two

dedicated AND gates for fast and efficient multiplication.

 Two 1-bit registers that can be configured to operate either as flip-flops or

as latches.

11
Figure 3.8 Simplified Schematic of Virtex-4 Slice

Virtex-5 FPGAs

In the Virtex5 series, Xilinx moved from its traditional four-input LUT design

to six-input LUTs. It is a 65nm design fabricated in 1.0V, triple-oxide process

technology. Virtex5 series offers a lower power solution delivered by the 65nm

technology and power-saving IP blocks [20].

The simplified schematic of a Virtex5 slice is shown in Figure 3.9. The main

differences between a Virtex4 and a Virtex5 slice are [20, 22]:

 4 Configurable 6-to-1 (or 5-to-2 LUTs) instead of 4-to-1 LUTs.

 4 LUTs and 4 register bits per slice.

 Dedicated arithmetic logic circuitry doesn't include dedicated AND gate.

11
Figure 3.9 Simplified Schematic of Virtex-5 Slice

3.3.2.4 Speed Grades

There is no consistent definition of a speed grade for all devices. Even for

Xilinx, speed grades mean different things depending on if we are referring to a

FPGA or a CPLD. Originally speed grades for Xilinx FPGAs represented the time

through a look up table but now the speed grade doesn't actually represent a timing

path. Instead the speed grade is a relative metric of performance within a specific

FPGA family. Different speed grades within a family results merely due to process

variations with all masks and parts being identical.

For modern Xilinx FPGAs, such as those of the Virtex family, higher numbers

represent faster devices. For example, Virtex4 speed grades are -10, -11, and -12 with

-10 being the slowest and -12 being the fastest. Virtex5 speed grades are -1, -2, and -3

with -1 being the slowest and -3 being the fastest.

11
3.4 FPGAs Design Flow

The design flow for FPGAs implementation is illustrated in Figure 3.10 as

provided by the Xilinx foundation [20] followed by a brief summary of all its steps

[16,23].

Figure 3.10 Digital Design Flow

Design Entry

Generally, there are different techniques for design entry which are schematic

based, HDL, a combination of both and finite state machine (FSM). If the designer

wants to deal more with hardware, schematic based design entry is the better choice.

HDL and FSM on the other hand, represent a level of abstraction that can isolate the

designer from the details of the hardware implementation. FSM is used when the

design can be thought of as a series of states. HDL is considered the most popular

11
design entry methodology for FPGA design as it is the best choice for describing

complex designs.

HDL is a high level textual programming language that includes specialized

constructs to describe or model the behavior or structure of a digital system. An HDL

allows a system's behavior to be described at an abstract and technology independent

level. There are two industry IEEE standard HDLs namely, VHDL and Verilog.

While their syntax and semantics are quite different, they are used for similar

purposes. VHDL contains more constructs for high level modeling, model

parameterization, design reuse and management of large designs than Verilog does.

Many EDA tools are designed to work with either language or both languages

together.

Not all VHDL constructs are synthesizable. Generally, if the VHDL code is

physically meaningless or too far moved from the hardware it attempts to describe, it

may not be synthesizable [19]. Thus it is best describe the design in a simple manner

in order to assure the synthesizer is able to correctly interpret the design to the

equivalent logical elements.

Behavioral Simulation

Behavioral simulation is the step where the design description is simulated to

verify its logical correctness. Behavioral simulation does not consider propagation

delays.

Synthesis

Synthesis is the process which translates the HDL code into a netlist form (i.e. a

complete circuit with logical elements such as gates, flip-flops, etc) targeted at a

11
specific FPGA platform. At this stage, detailed timing analysis can be carried out and

estimate of occupied area can be obtained. The resulting netlist is stored to an NGC

(Native Generic Circuit) file for Xilinx Synthesis Technology (XST).

Implementation (Commonly referred to as Place and Route)

Implementation is basically the process in which the synthesized netlist will be

implemented in the target FPGA. This process consists of a sequence of three steps:

i. Translate: Combines all input netlists and constraints to a logical design

file saved as an NGD (Native Generic Database) file for Xilinx tools.

ii. Map: The map process fits the logic defined by the NGD file into the

targeted FPGA elements (i.e. CLBs, IOBs) and generates an NCD (Native

Circuit Description) file which physically represents the design to be

mapped to the components of the target FPGA.

iii. Place and Route (PAR): The PAR process places the blocks from the

map process into logical blocks according to the defined user constraints

then connects between them. The PAR tool takes the mapped NCD file

and produces a completely routed NCD file. Power calculation and

analysis can be performed, if required, after PAR. It is essential to assure

design meets the power budget thus attaining system performance and cost

goals. Low power enables higher clock frequency, higher reliability, better

noise margins, and reduced capital and operational costs.

Static Timing Analysis

Static timing analysis is performed post PAR. It incorporates timing delay

information to provide a comprehensive timing summary of the design. The critical

11
path of the design is determined here and hence the fastest design speed. The main

advantage of static timing analysis is that it is relatively fast, doesn’t need a test bench

and exhaustively tests every possible path in the design.

Timing Simulation (also known as Post Route simulation)

In Timing Simulation the VHDL timing model generated by the place and route

tool, which includes the block and routing delay information from the routed design,

to give a more accurate assessment of the behavior of the circuit under worst-case

conditions. Timing simulation is a highly recommended part of the HDL design flow

for Xilinx devices to verify the implemented design to the target FPGA meets timing

constraints. Since timing simulation uses the detailed timing and design layout

information that is available after place and route, this simulation of the design closely

matches the actual device operation. Timing simulation simulates the VHDL timing

model, generated by the place and route tool, to verify the synthesized logic as

mapped to the target FPGA meets timing constraints.

Performing a timing simulation in addition to a static timing analysis will help

to uncover issues that cannot be found in a static timing analysis alone.

Download & Circuit Verification

The routed NCD file is converted to a bit stream file to be used to configure the

target FPGA device.

12
Chapter 4

Floating Point Arithmetic

Since only binary information can be stored and processed in digital computers,

thus the most natural system to use when representing decimal numbers is the binary

system. In order to represent decimal numbers in binary notations, there are two

methods depending on the position of the binary point. The fixed point method

assumes the binary point is always in a fixed position while the floating point

representation assumes the binary point can floats anywhere within the number's

significant bits.

This chapter gives a brief presentation of the IEEE-754 [24] standard for single

precision floating point numbers and explains the floating point multiplication and

addition/subtraction arithmetic and algorithms.

4.1 Fixed and Floating Point Representation

As mentioned earlier, fixed point representation represents real numbers while

assuming the radix point, also referred to as binary point for binary systems, is fixed

in a certain position such that there are a fixed number of digits after and before the

radix point. The two most widely used radix point positions when fixed point

representation is used to store numbers in digital computers are:

1. Placing the radix point in the extreme left of the number such that it can only

represent fractions.

2. Placing the radix point in the extreme right of the number such that it only

represents integer numbers.

13
In either case, the radix point is not actually present, but its presence is assumed

from the fact that the number stored is treated as a fraction or as an integer [25].

Floating point representation is a method for representing any number in two

parts. The first part represents a signed, fixed point number called the mantissa. The

second part designates the position of the radix point and is called the exponent.

Usually, the radix point is shifted such that there is one non-zero digit to its left. So

basically, floating point represents real numbers in the scientific notation where the

radix point can float anywhere in the number and the exponent is adjusted

accordingly. The general format of any floating point number (F) can be written as:

F = (-1)sign * n.nnn * rexp (4.1)

Where sign is 0 or 1 for a positive or negative number respectively, n.nnn is

the mantissa and r is the radix of the used numbering system (i.e. r =10 for decimal,

r = 2 for binary) and exp is the exponent that represents the original location of the

radix point. For example, the binary floating point number (1101.01)2 is represented

in floating point with a mantissa and an exponent as follows:

1101.01= 1.110101 x 23 (4.2)

Fixed point representation is considered easier and more area efficient to

implement than floating point representation. On the other side, floating point

representation has the advantage of being able to represent very large or very small

numbers and thus arithmetic operations are less likely to overflow or underflow when

using floating point representation as opposed to when using fixed point

13
representation. As a result, floating point representation is more suitable for

applications that require high precision and wide dynamic range.

IEEE 754 standard for floating point representation was defined to achieve

several innovations which are:

 Precisely specify floating point number encoding such that all computers

would interpret floating point numbers in the same way. This made it possible

to transfer floating point numbers from one computer to another.

 Precisely specify how to perform arithmetic operations and deal with

exceptional conditions that could result from them such that all computers will

give the same result for a given operation with the same input data.

4.2 IEEE-754 Standard for Floating Point Representation

The IEEE-754 standard was first created in 1985. Several revisions were made

until the final IEEE-754 2008 standard was published in August 2008. The IEEE 754

standard specifies how floating numbers are represented and how to carry out

arithmetic operations on them. Nowadays, IEEE 754 standard is the most common

representation for real numbers on computers, including Intel-based PC's,

Macintoshes, and most UNIX platforms.

4.2.1 Numerical Encoding

Single precision IEEE-754 format represents binary floating point numbers in

32 bits. The 32 bits are divided into three fields, a 1-bit sign, an 8-bit exponent and a

23-bit mantissa as shown in Figure 4.1.

11
Sign Exponent Mantissa
B31 B30…………..B23 B22…………………………………………………….B0

Figure 4.1 IEEE Format of Single Precision FP Number

The sign bit stores '0' for positive numbers and '1' for negative numbers. The

8-bit exponent stores the exponent in excess-127 code. That is a bias of 127 is added

to any exponent before it is stored. So for example, to represent the binary exponent 2,

we encode 127 + 2 = 129 in binary (10000001). To represent the binary exponent -2

we encode 127 - 2 = 125 in binary (01111101). This can be expressed as:

E' = E + 127 (4.3)

where E' is the stored exponent and E is the actual exponent

So using the excess-127 code gives us a range for E' of 0 <= E' <= 255.

IEEE-754 reserves exponent field values of 0 and 255 (all 0s and all 1s) to denote

special values in the floating-point scheme leading to an operating range of E'

becomes 1 <= E<= 254, equivalent to -126<= E <= 127 range for E. Excess-127

code is used as opposed to signed representation to allow expressing a wider range of

exponents. Excess-127 code is also used as opposed to 2's compliment representations

to allow easier exponent comparison (needed in arithmetic operations).

The 23-bit mantissa is actually a 24-bit mantissa where the most significant bit

is not stored and thus is known as the implicit bit. While this provides efficient

storage, the implied bit is necessary to carry out arithmetic operation on the number

and must be explicitly represented before any operations involving the number is

performed.

13
The set of different possible values for reading a floating point numbers is

explained the following sections and is summarized in Table 4.1.

Table 4.1 Summary of Floating Point Number Values


Value Sign Exponent Mantissa

Normalized Numbers (-1)S x 1.f x 2E'-127 X 0 < E' <255 XX……..XX


Sub norms (-1)S x 0.f x 2-126 X All '0's Non Zero Value
Zero +0 '0' All '0's All '0's
-0 '1' All '0's All '0's
Infinity +∞ '0' All '1's All '0's
- ∞ '1' All '1's All '0's
Not a Number NaN X All '1's Non Zero Value

4.2.2 Normalized and Denormalized Numbers

A floating point number is said to be normalized if it is adjusted such that its

implicit bit is ‘1’. Normalization has the advantage of allowing a wide range of

numbers to be represented with great precision. The floating point number (FN) then

has the following format

FN= (-1)Sign x 1.Mantissa x 2Exponent (4.4)

Denormalized numbers, also known as subnorms, on the contrary have a '0' as

the implied bit. They are used to represent numbers that are smaller than the smallest

normalized number. They are identified by an all zero exponent and a non-zero

mantissa.

13
4.2.3 Special Values

The IEEE-754 standard reserves the exponent field of all '0's and all '1's for

special values. The special values defined by the IEEE-754 standard are:

1. Zero : Due to assumption of '1' implied bit, a zero number cannot be directly

represented. Zero is thus considered a special value denoted by an all '0'

exponent and mantissa. The sign bit differentiates between +0 and -0 which

are distinct values that compare as equal.

2. Infinity: Infinity is denoted by an all '1's exponent and an all '0's mantissa.

The sign bit separates between positive and negative infinity.

3. NaN : The value Not a Number (NaN) is used to represent a value that is

not a real number. NaN are represented by an all '1' exponent and a non zero

mantissa.

4.2.4 Range of Floating Point Numbers

The range of positive/negative single precision floating point numbers is

shown in Table 4.2. Since the sign of floating point numbers is given by a special

leading bit, the range of negative numbers is given simply by the negation the range

of positive numbers.

Table 4.2 FP Ranges for Normalized and Denormalized Numbers

Binary Representation Decimal Representation

Normalized ± [ 2-126 to (2-2-23) * 2127 ] ± [~ 1.175 * 10-38 to ~ 3.4 * 1038]


Numbers

Denormalized ± [ 2-149 to (1-2-23) * 2-126 ] ± [~ 1.4 * 10-45 to ~ 1.175 * 10-38]


Numbers

13
4.2.5 Exceptions

The IEEE-754 standard defines five exceptions, each of which corresponds to a

flag being set. These exceptions are:

1. Overflow : Set when the result has a value that is too large to be

represented.

2. Underflow : Set when the result has a value that is too small to be

represented.

3. Inexact : Set when the result can't be exactly represented so a rounding

error is introduced.

4. Invalid : Set when an operation cannot return a real value such as in the

cases of infinity-infinity, zero/zero or the square root of a negative number.

5. Divide by zero: Set when an operation on finite operands gives an infinite

solution.

4.2.6 Rounding Modes

Arithmetic operations usually result in floating point numbers with more bits

that can actually be stored. In such cases, it is necessary to normalize the number and

round it in order to fit in the storage format. The IEEE 754 standard defines five

rounding modes. The first two modes round to a nearest value, while the other three

are called direct rounding. These five rounding modes are:

1. Round to nearest, ties to even (REN): Rounds to the nearest value; if the

number falls midway it is rounded to the nearest value with an even (zero) least

significant bit, which occurs 50% of the time; this is the default algorithm for binary

floating-point.

13
2. Round to nearest, ties away from zero: Rounds to the nearest value; if the

number falls midway it is rounded to the nearest value above (for positive numbers)

or below (for negative numbers).

3. Round towards +∞ (RP) : Rounds the number towards +∞.

4. Round towards -∞ (RM) : Rounds the number towards -∞.

5. Round towards 0 (RZ) : Also called truncation as the extra bits are simply

truncated.

Table 4.3 Examples on the IEEE 754-2008 Rounding Modes

Round to Nearest Direct Rounding


Ties to Ties
Number Even Away
RP RM RZ
from
Zero
10.01 10 10 11 10 10
10.10 10 11 11 10 10
10.11 11 11 11 10 10
- 10.01 -10 -10 -10 -11 -10
-10.10 -10 -10 -10 -11 -10
-10.11 -10 -10 -10 -11 -10

The rounding mode affects the results of most arithmetic operations and the

thresholds for overflow and underflow exceptions. The default rounding mode is REN

and is mostly used in all the arithmetic implementations in software and hardware. To

increase the precision of the result and to enable the REN rounding mode, three bits

are added to the left of the mantissa. These bits are:

1. The Guard Bit: Simply an extension of mantissa for extra precision.

2. The Round Bit: Also an extension of mantissa for extra precision.

3. The Sticky Bit: It is the logical "OR"ing of all dropped bits.

13
4.3 Floating Point Arithmetic

Floating point arithmetic is considered more complex integer arithmetic. In

this section, the basic algorithm for performing floating point addition/subtraction and

multiplication is explained.

4.3.1 Floating Point Multiplication Algorithms

Floating point multiplication is considered one of the simplest floating point

operations since it is performed by directly adding the exponents and multiplying the

mantissas. The floating point multiplication basic operations can be divided into the

following steps [25]:

1. Check for Zero: Checking if any of the input operands is a zero.

2. Determine Resultant Sign: By considering the operand signs.

3. Determine Resultant Exponent: Adding the two exponents and subtracting a

127 bias to compensate for the bias being added in both exponents.

4. Multiply the Mantissas: This involves a 24 by 24 unsigned integer

multiplication that gives a 48 bit result.

5. Post normalization: The resultant mantissa is normalized and the resultant

exponent is adjusted accordingly. Since when multiplying two very large or

very small numbers it is quite possible for overflow or underflow to occur, it is

useful at this point to check the exponent for possible overflow or underflow.

6. Rounding: Rounding the resultant mantissa to fit it in the assigned number of

bits in order to give the output in the IEEE standard format.

The bottleneck of the floating point multiplier is the binary 24 by 24 unsigned

binary multiplication due long carry chain involved in the multiplication operation.

13
Generally, the multiplication of two n bit binary numbers consists of successive

multiplications to calculate n partial products then addition of the properly shifted

partial products to give the final result, which for the multiplication of two n bit

numbers wouldn't exceed 2n bits. Then for multiplication of signed numbers, the

resultant sign has to be determined. There are two main concerns when implementing

multiplication of two binary numbers which are:

1. Determining the Resultant Sign: Usually, multiplication handles the sign of

the numbers to be multiplied separately in order to find the resultant sign. Yet,

modern computers usually deal with the 1's complement, 2's complement or

signed number representation which all embed the sign in the number itself. A

simple long multiplication process won’t be sufficient in such cases but must

be adjusted for correct operation.

2. The Long Carry Chain Involved: The multiplication result from the n by n

binary multiplication appears after all the intermediate n partial fractions are

calculated and then added. Such operation is very time consuming due to the

long carry chain involved in the addition operation.

When considering the above issues for the case of the binary multiplier involved

in the floating point multiplication it is found that:

1. Multiplying floating point numbers involves multiplying the two 24 bit

unsigned bit mantissas so the first issue is not of concern when dealing with

floating point numbers.

2. On the other hand, the large delay resulting from the long carry chain involved

in the 24 by 24 multiplication, due to the addition of 24 partial fractions, is a

34
very critical issue that is the reason why the 24 by 24 unsigned multiplication

operation is the bottleneck of the floating point multiplier.

Several algorithms have been introduced to speed up the binary multiplication by

decreasing the number of partial fraction and hence reducing the number of

intermediate additions and breaking the long carry chain. Generally, floating point

multipliers perform the same main operations, which were explained above, but differ

in the algorithm by which the 24 by 24 bit multiplication is performed. The most

famous of these algorithms are [12]:

Booth Multiplication

Booth multiplication is a fast multiplication algorithm that is especially useful

when one of the numbers to be multiplied includes a string of consecutive '1's. Booth's

algorithm is based on the fact is that any binary number containing a string of

consecutive '1's can be represented as the difference between two numbers using the

following general rule:

(ddd111….111dddd)2 = (dd1000…000dd)2 – (ddd000…001ddd)2 (4.5)

where d: don't care.

Thus the multiplication of (20 * 14) in binary can be performed as shown in Table

4.4. As shown in the table, the use of booth multiplication reduced the number of

partial fraction to be added which is the key factor in speeding the binary

multiplication.

33
Table 4.4 Example on Booth Multiplication

Normal Multiplication Booth's Multiplication

010100 010100 010100


X 001110 X 001110 X 001110
010100 010100 010100
010100 _ 010100 + 111101100
+ 010100 0100011000
100011000 Then by using 2’s
compliment 

Canonic Signed Digit (CSD) Multiplication

CSD multiplication algorithm reduces the number of partial products in the

multiplication operation by encoding the multiplier as CSDs. A number is said to be

encoded in the canonical form if it contains no adjacent non-zero digits. CSDs are

generated by encoding a binary number such that it contains the fewest number of

non-zero bits.

A CSD n*n bit multiplier contains (n+1) cascaded CSD encoder units to generate

the CSD representation of the multiplier. This unit receives three inputs and provides

two outputs. The inputs are the multipliers bits to be encoded and carry bit while the

output is next carry bit and canonic signed digit representation. The generated CSD

vector is then given to CSD logic and shift control unit along with the multiplicand

and its 2’s complement. The CSD logic and shift control unit provides n/2 number of

n-bit partial products that are added to given to an adder block. The output of this

adder block is final product of the multiplier and multiplicand.

The basic idea behind the CSD multiplication is again to reduce the number of

partial products calculated in order to increase the speed of the multiplication

33
operation where in CSD multiplication, the product of two n-bit numbers is calculated

in (n/2 – 1) steps.

4.3.2 Floating Point Addition/Subtraction

Two floating point numbers should be aligned to be able to directly add or

subtract their mantissas when performing floating point addition or subtraction. That

is equivalent to both numbers having equal exponents. If the exponents of the two

floating point numbers are not equal, one of the numbers has to be pre-normalized.

Pre-normalization is the process where the exponent of the smaller floating point

number to be added/subtracted is adjusted to be equal to the exponent of the larger

number by shifting left it’s mantissa a number of times equal to the difference

between the two exponents. Prenormalization is performed on the smaller floating

point number such that if data bits are lost due to the shifting operation, the effect is

not significant. The main operations of floating point addition/subtraction can be

divided into the following steps [25]:

1. Check for Zeros: Checking if one or both of the operands is a zero.

2. Calculate Exponent Difference: To be used in the Prenormalization step.

3. Prenormalization: Adjusting the smaller FP operand in order to align the two

FP numbers that are to be added or subtracted.

4. Addition/Subtraction: Performing the actual addition/subtraction operation

of the two mantissas.

5. Post Normalization: Normalizing the final result by shifting the resultant

mantissa and adjusting the exponent accordingly.

7. Rounding: Rounding the resultant mantissa to fit it in the assigned number of

bits in order to give the output in the IEEE standard format.

31
Generally, floating point addition/subtraction is known to be more complicated

than floating point multiplication due to the extensive processing required in adjusting

the operands before and after the execution of the addition or subtraction operation

when pre-normalizing the smaller mantissa and when post normalizing the resultant

mantissa respectively.

The bottle neck of the floating point addition/subtraction is the post normalization

operation. This is due to the fact that the post normalization of the resultant mantissa

involves several time consuming dependant operations which are:

i. Finding the leading '1' bit in the resultant mantissa to be set as the implied bit.

The leading one bit here can be located anywhere within the mantissa.

ii. Shifting the resultant mantissa in order to set the detected '1' bit as the most

significant bit.

iii. Adjusting the resultant exponent to compensate for the shift in the resultant

mantissa.

Many algorithms have been proposed to implement the post normalization process

each aiming to optimize a certain performance parameter such as area, latency or

speed. Generally, floating point adders perform the same main operations, which were

explained above. They only differ in the algorithm used to implement the post

normalization operation. The most famous of these algorithms are [4, 26]:

Leading One Detector (LOD) Adder/Subtractor

The resulting mantissa from the addition/subtraction operation is first inspected to

determine the location of the leading one. Based on the leading one location, the

33
resultant mantissa is shifted left by a number of times subsequently subtracted from

the exponent.

LOD algorithm is an area efficient simple algorithm that is also known as the

standard algorithm.

Leading One Predictor (LOP) Adder/Subtractor

The main difference between LOD and LOP algorithms is that LOD method

detects the leading one after the addition/subtraction operation has taken place while

the LOP method predicts the leading one in parallel with the addition/subtraction

computation. This is illustrated in Figure 4.2. Usually, the LOP method requires a

correction circuit.

Figure 4.2 LOD vs. LOP Algorithms

The LOP method has the advantage of reduced latency on the expense of added

area and design complexity [7].

Two Path Adder/Subtractor (Far and Close Datapath)

According to studies, 43% of floating point instructions have an exponent

difference of 0 or 1. Making use of this idea, the two path algorithm implements two

33
parallel data-paths one for when the exponent difference is equal to 0 or 1 and the

effective operation is subtraction called the Close Path and another for all other cases

called the Far Path. In the two path algorithm, the latency is reduced by removing

the pre-normalization from the close path and removing the LOD or LOP from the far

path. This comes on the expense of increased area due to the dual path

implementation.

The two path algorithm is faster than the LOD algorithm and experiences less

latency, but takes more area and consumes more power [26].

Three Path Adder/Subtractor

This architecture has three datapaths, of which only one is operational at a time.

Two of these paths have the exact same functionality as the far and close data paths in

the two path algorithm. The third datapath deals with NaN and infinity values along

with the case when the exponent difference is greater than the width of the mantissa.

33
Chapter 5

Proposed Floating Point Unit Architecture

In this chapter, the architecture of the proposed FPU is explained. First, the

design specifications of the proposed FPU is discussed starting from the implemented

IEEE standard the power considerations then going through the design methodology

and finally thoroughly discussing the proposed design. For both the FP Multiplier and

Adder/Subtractor Units, the block diagrams are explained block by block and

behavioral simulations for each block and the complete units are given.

5.1 Design Specifications of the Proposed FPU

5.1.1 Floating Point Representation

The FPU deals with single precision floating point numbers that are

represented as specified by the IEEE 754 standard. The exact specifications of the

floating point numbers dealt with in the proposed FPU are explained in this section.

Normalized and Denormalized Numbers

Denormalized numbers are generally rare and require complicated hardware for

their handling. In our design we do not honor denormalized numbers, but if a

denormalized number is by any chance introduced into the design, it is considered as

a zero and dealt with accordingly. The reason for excluding denormalized numbers is

because of the large overhead in taking care of these numbers, especially for the

multiplier [9, 10]. These are commonly excluded from high-performance systems, for

example, the cell broadband engine does not use denormalized numbers for the

single-precision format in its synergistic processing units (SPU) [27].

74
Special Values

Special values are dealt with as follows:

 Zero: Zero is by far the most common special value in numerical

representations and their arithmetic operations. Thus, it is only reasonable that

one must consider the zero when implementing a floating point unit. In the

proposed design, a zero input is detected at early stages to facilitate result

calculations hence avoiding unnecessary calculations. This will be thoroughly

explained later in this chapter.

 Infinity: The infinity is more likely to occur during multiplication. In the

proposed FP Multiplier Unit, a resulting positive or negative infinity would set

the overflow or underflow flags respectively.

 NaN: NaN values are considered rare and require a lot of hardware to deal

with [9] and hence are not honored in our design.

Exceptions

The most common exceptions that occur in floating point addition/subtraction

and multiplication are overflow and underflow. In the proposed design, they are

honored only in the FP Multiplier Unit where they are more likely to occur. A

detected overflow or underflow would set an appropriate flag that would propagate

smoothly to be given as an output by the final block in the FP Multiplier Unit.

Rounding

When considering the rounding modes introduced in Chapter 4, we know that

REN is the most common mode used in software and hardware implementations of

arithmetic, mathematical and engineering applications. Thus in our design, we chose

to implement all rounding using the REN mode.

74
The guard, round and sticky bits were added to the mantissas in both the

multiplier and adder/subtractor units to increase the accuracy of the stored result.

5.1.2 Power Considerations

In order to reduce power consumption, both the FP Multiplier and

Adder/Subtractor Units were designed to reduce switching activity by avoiding

unnecessary calculations as follows:

Power Consideration for FP Multiplier Unit

In multiplication, a zero input means that the result will be also a zero. In such a

case, no multiplication, normalization or rounding is needed at all and performing

them will actually mean unnecessary power consumption. To avoid losing the

unnecessary power, both input operands are checked for zeros at a very early stage in

the design. If one or both operands are found to be zero, a zero flag is set. Blocks in

the design are written to operate only if the zero flag is not set.

Power Consideration for FP Adder/Subtractor Units

Unlike multiplication, there are two possible cases involving a zero in

addition/subtraction which are a zero input and an expected zero output. Now each

case is detected and dealt with a little differently as follows:

 Input Zero Detection: In addition or subtraction, if one or both input operands is

a zero, the result can be directly determined to be the other operand or a zero

respectively. The only operation necessary then would be the output sign

determination, no pre-normalization, post normalization or rounding would be

necessary. So in the proposed design, the input operands are checked for zeros at

an early stage and the appropriate zeros flag is set accordingly to identify whether

one or both inputs are zeros. Again, blocks are designed to operate only if these

flags indicate that the two input operands are not zeros.

74
 Output Zero Detection: If both input operands are non-zeros, a zero output can

still result if both the input are equal and the effective operation turns out to be

subtraction. In the proposed design, the two input operands are compared at an

early stage and an aequalb flag is set when they are equal. The zeros flag is set if

the effective operation is subtraction and the aequalb flag is set in a manner to

indicate a zero result. Then again in such a case, unnecessary calculations are

avoided.

5.1.3 Pipelined Architecture

Floating point operations can easily be broken down into several sub-

operations which make pipelining suitable for implementation of FPUs. Actually, all

high speed computers have pipelined arithmetic units since pipelining allows for a

faster clock cycle and increased data throughput at small expense to latency from the

extra latching overhead [25]. Because FPGAs are register-rich, this is usually an

advantageous structure for FPGA design since the pipeline is created at no cost in

terms of device resources. The flip flops introduced by pipelining typically occupy the

unused flip flops within the logic cells that are already used for implementing the

design.

Thus in order to increase the design speed of our proposed designs, both the

FP Multiplier and Adder/Subtractor Units were deeply pipelined by breaking each of

them down into simple modules with registers placed in between. The number of

registers that were placed between the different modules depended on each module's

delay time. The overall design has both an input and output register to synchronize

data entrance and exit.

When implementing the pipelined FP Multiplier and Adder/Subtractor Units, a

top-down approach was used. At first an overview of the complete system was made

05
to gain a firm understanding of its operation and required specifications. Then the

design was divided into modules which were further broken down into smaller sub-

modules that performed simple specific operations. Each of the sub-modules was

optimized and tested separately before optimizing and testing the complete design.

5.2 Proposed FP Multiplier Unit

5.2.1 Introduction

The simplified block diagram of the FP Multiplier Unit is shown in Figure 5.1

where operands A and B are the single precision inputs and operand result is the

output also given in the single precision format.

Figure 5.1 Simplified Block Diagram of FP Multiplier Unit

The 32 bit input operands are initially unpacked to sign, exponent and mantissa

where each will be manipulated differently throughout the design. The exponents of

both operands are inspected to check for an input zero in which case the zeroflag is

set. Otherwise if both inputs were non-zero operands the zeroflag is unset, the

exponents are added and the mantissas are multiplied. Multiplication here is an

unsigned 24 by 24 multiplication operation to give a 48 bit resultant mantissa. Since

both mantissas to be multiplied are normalized, the resultant mantissa is expected to

be either directly in the normalized form or if a carry bit occurs, requiring a one bit

shift to the right. So the resultant mantissa is post normalized if necessary and the

exponent is adjusted accordingly and tested for possible overflow or underflow. Now,

the 48 bit mantissa is rounded to fit it in the specified number of bits and if necessary

05
the resultant exponent is adjusted and re-checked for possible overflow or underflow.

Finally, the 32 bit result is given in an IEEE compliant format along with the overflow

and underflow flags.

As referred to earlier in Chapter 4, the main design concern in the FP Multiplier

Unit is the implementation of the 24 by 24 unsigned multiplier due to the time

consuming process of adding the 24 partial products that were calculated by the

multiplier and the long carry chain involved.

In the proposed FP Multiplier Unit, a new fast multiplication algorithm referred to

as the “Block Multiplication” Algorithm is proposed. Like in other fast multiplication

algorithms, that were explained in Chapter 3, the main concept adopted to increase the

multiplication speed in the Block Multiplication algorithm is to reduce the number of

partial fractions to be added in order to break down the long carry chain involved.

Block Multiplication does that by converting the 24 by 24 unsigned multiplication

operation into several smaller multiplications performed in parallel whose results are

appropriately manipulated to give the final 48 but resultant. This in turn is performed

by slicing up the 24 input mantissas of operands A and B into smaller blocks and

performing the multiplication on these blocks, hence came the term Block

Multiplication.

Figures 5.2(a) and 5.3(b) illustrate the detailed block diagram of the proposed FP

Multiplier Unit. In the following sub-sections, each module shown in these figures is

thoroughly discussed followed by its behavioral simulation. The Block Multiplication

algorithm and implementation are thoroughly explained when discussing the

Multiplier Module. Finally the behavioral simulation of the entire FP Multiplier Unit

is given. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and

were behaviorally simulated to verify their correct operation using ModelSim 6.3a.

05
05
07
5.2.2 Zero Detect Module:

The zero detect unit is responsible for three main tasks which are:

1. Unpacking the input operands: It is necessary to separate the sign, exponent

and mantissa of the input operands since each is dealt with differently in the FP

Multiplier Unit as follows:

a. Sign Bits: The input signs are XORed to determine the resultant sign.

b. Exponents: The exponents are added to calculate the resultant

exponent then checked for possible overflow or underflow.

c. Mantissas: The mantissa are multiplied then rechecked for possible

overflow. It is finally post normalized and rounded to produce the final

result.

2. Input zero detection: The exponents of both input operands are checked. If

one or both of the exponents was found to be all zeros, the zero flag is set.

Since a zero floating point number has an all zeros exponent and mantissa

while a denormalized number has an all zero exponent and a non-zero

mantissa, then checking only the exponent of the input operands serves in

detecting a denormalized input number as a zero and dealing with it

accordingly.

3. Setting the implied bit: The implied bit that was omitted from each mantissa

before storage is now retrieved for the multiplication operation to be

performed correctly. The implied bit is always restored as a '1' since in the

proposed design only normalized numbers are considered.

Figures 5.3 and 5.4 show the symbol and behavioral simulation of the Unpack

Module. The outputs appear after two clock cycles due to the presence of an input and

output register within the module. The simulation illustrates how the 32 bit mantissa

00
is broken to sign, exponent and mantissa while setting the implied bit in the mantissa.

The zeroflag is either set or unset depending on the input operands.

For example for operands A= (BBBBBBBB)H and B= (33333333)H at 200ns, the

zeroflag is unset at 400ns and operands A and B are broken down into signa= „1‟,

expa=(77)H, manta=(BBBBBB)H and signb= „0‟ , expb=(66)H , mantb=(B33333)H

respectively. On the other hand for the case of a zero or a denormalized number like

the case at 300ns and 400ns where B= (00000000)H and B= (00000777) respectively,

the output zero flag is set at 500ns and 600ns .

Figure 5.3 Symbol of the Unpack Module in the FP Multiplier Unit

Figure 5.4 Behavioral Simulation of the Unpack Module in the FP Multiplier Unit

05
5.2.3 Add Exponent Module:

This module is mainly responsible for finding the resultant exponent. So in this

module, the two exponents are added and a 127 bias is subtracted to substitute for the

bias being added in both exponents. The resultant exponent is then stored in 10 bits;

that is two carry bits are added to the left to allow for overflow and underflow

detection that commonly result in multiplication if both operands to be multiplied are

either too small or too large. Since the 8 bit exponent of a single precision floating

point number is an unsigned number, the 10th bit being '1' indicates an underflow. An

overflow is detected when the 10th bit is '0' and the 9th bit is '1'. The checking of an

overflow or underflow is performed later in the Exception Detection Module by

inspecting these two extra added bits.

In this module, the resultant sign is also calculated using a simple XOR operation.

The mantissa of operand B is sliced into three 8-bit blocks here to prepare it for the

multiplication operation in the Block Multiplier Module coming next.

Figures 5.5 and 5.6 show the symbol and behavioral simulation of the Add

Exponent Module. After the arst signal (asynchronous reset) becomes inactive at

100ns, the input exponents expa=238=(EE)H and expb=51=(33)H are added and the

127 bias is subtracted resulting in expres10=162=(A2)H stored in 10 bits and

appearing one clock later at 200ns as (0A2)H. The resultant sign is also determined

where signres = signa XOR signb = ‟0‟ XOR „1‟ = „1‟ as appearing at 200ns. The

input mantissa B is broken up to the 8 bit slices namely mantb2, mantb1 and mantb0

to prepare it for the upcoming Multiplier Module.

04
Figure 5.5 Symbol of the Add Exponent Module in the FP Multiplier Unit

Figure 5.6 Behavioral Simulation of the Add Exponent Module in the FP


Multiplier Unit

5.2.4 Multiplier Module

This module is responsible for performing the unsigned multiplication of the two

24 bit mantissas. At first, the multiplication operation in this module was

implemented using a Simple Multiplier which implements the multiplication

operation using a simple multiplication statement written in VHDL. This was mainly

implemented to assure correct functionality of the entire FP Multiplier Unit before

introducing the new proposed Block Multiplier to it.

04
In order to implement the “Block Multiplication” algorithm the two 24 input

mantissas were sliced into three 8 bit blocks. Now Mantissa A=A2A1A0 and Mantissa

B=B2B1B0 with A2, A1, A0, B2, B1 and B0 each being an 8 bit block. Now in order

to perform the unsigned 24 by 24 bit multiplication three operations are performed in

parallel as shown in Figure 5.7. Within each of the three multiplication operations,

mantissa A is multiplied in one of the blocks of mantissa B (i.e. a 24 * 8

multiplication is performed which is expected to give a 32 bit resultant).

Figure 5.7 The Three Parallel Multiplications Performed in the Block


Multiplication Algorithm

In order to perform each of these 24 * 8 multiplications, three 8 by 8

multiplications are actually performed where the specific B block is multiplied to one

of the three blocks of mantissa A. Each of the 8 by 8 bit multiplications gives a 16 bit

result. These three 16 bit results are appropriately manipulated, as illustrated in Figure

5.8, to give the 32 bit resultant expected from the 24 * 8 multiplication. The resulting

32 bit numbers from each of the three main operations of Figure 5.7 are calculated in

the same method.

Figure 5.8 Details of Mantissa A * B0 Multiplication


04
Finally the 32 bit resultant from the multiplication of Mantissa A with B1 and B2

are shifted left by 8 and 16 places respectively then added to the 32 bit resultant from

the multiplication of Mantissa A with B0 to give the final 48 bit result of the 24 by 24

bit multiplication. The detailed block diagram of the implemented Block Multiplier is

shown in Figure 5.9.

Figure 5.9 Block Diagram of the Block Multiplier

5.2.4.1 Multiplier (8 by 8) Sub-Module

This sub-module is responsible for multiplying the 24 bit mantissa of operand A

with the three different slices of the mantissa of operand B. This is performed, as

referred to earlier, through performing three 8 by 8 multiplications thus giving three

16 bit outputs. As shown in Figure 5.9, three instances are used from this sub-module

one for each slice of mantissa B.

55
5.2.4.2 Partial Fraction Adjust Sub-Module

This sub-module accepts the three 16 bit outputs from the Multiplier (8 by 8)

Sub-Module and appropriately manipulates them, as discussed shortly before this and

as illustrated in Figure 5.8, to give the 32 bit partial fraction resulting from

multiplying mantissa A in one of the slices of mantissa B. As shown in Figure 5.9,

there are three instances from this module, one following each instance of the

Multiplier (8 by 8) Sub-Module. The outputs from these instances are PF2, PF1 and

PF0 resulting from multiplying mantissa A in mantissa B slices B2, B1 and B0

respectively.

5.2.4.3 Add Partial Fractions Sub-Module

This sub-module accepts the three 32 bit partial fractions, appropriately shifts

them then adds them together. Since the Multiplier Module is dealing with 8 bit slices

to perform the multiplication operation, thus to appropriately align the partial fraction

for their addition, PF1 and PF2 are shifted left by 8 and 16 bits respectively as shown

in Figure 5.10.

47…..40 39…..32 31..…24 23…..16 15…….8 7…….0


PF0
PF1 (00)H
PF2 (00)H (00)H

Figure 5.10 Shift Operations Performed to Align the Partial Fractions for Addition

In order to add the three partial fractions PF2, PF1 and PF0 together, each of them

was first divided into two parts most significant (MS) and least significant (LS) as

shown in Figure 5.11. Then all the least significant parts were added together and the

result stored in 26 bits (24 bits and two carry bits where maximum possible carry is

“10”), and all the most significant parts were added together and the result stored in

55
24 bits. Performing the addition in such a manner leads to the increase of the overall

speed of the design by breaking up the long carry chain involved in the addition of the

32, 40 and 48 bits long partial fractions PF0, PF1 and PF2 respectively.

47…..40 39…..32 31..…24 23…..16 15…….8 7…….0


PF0MS PF0LS
PF1MS PF1LS
PF2MS PF2LS

Figure 5.11 Dividing the Partial Fraction to prepare them for Addition.

5.2.4.4 Final Result Sub-Module

This module is responsible for appropriately manipulating the 26 and 24 bit

resultants from the addition of the least and most significant parts of the partial

fractions that were given by the previous module, in order to the final 48 bit resultant

from the multiplication of the unsigned 24 by 24 bit mantissas.

Figure 5.12 shows the block diagram of the Multiplier Module. Figures 5.13 to

5.16 go through the behavioral simulations of each sub-module of the Multiplier

Module. Each of the figures shows the input and output of a particular sub-module for

the sample test bench given in Table 5.1.

Table 5.1 Test Bench for the Block Multiplier Module


Mantissa A Mantissa B Resultant Mantissa
B0 0000 B0 0000 7900 0000 0000
B0 0000 80 0000 5800 0000 0000
B1 999A 80 0000 58CC CD00 0000
B1 999A 85 3333 5C68 51FD 47AE
B1 999A 80 0000 0000 0000 0000

55
55
Figure 5.13 Behavioral Simulation of the Multiplier (8 by8) Sub-Module

Figure 5.14 Behavioral Simulation of the Partial Fraction Adjust Sub-Module

57
Figure 5.15 Behavioral Simulation of the Add Partial Fractions Sub-Module

Figure 5.16 Behavioral Simulation of the Final Result Sub-Module

5.2.5 Post Normalize Module

This module is responsible for making sure the output mantissa is in the

normalized form. This is done by using two sub-modules:

50
5.2.5.1 Post Normalize Sub-Module

The multiplication of the two normalized 24 bit mantissas results in a 48 bits

mantissa that has a radix point to the right of the two most significant bits, i.e. the 48th

and the 47th bits. Since both multiplied mantissas are normalized, that is their most

significant bit is a '1', it is only logical that either the 48th or 47th bits of the resultants

would be a '1'. Accordingly to make sure the resultant mantissa is in the normalized

form these two bits are checked to locate the most significant '1'which is to be defined

as the implied bit. If the 48th bit is detected to be '1', the mantissa is normalized by

being shifted right by one place and the exponent is incremented by 1 to compensate

for such a shift. Otherwise if the 47th bit is detected to be '1', the resultant mantissa is

then already in the normalized form and no shifting is required. In both cases, the

detected '1'that is identified as the implied bit is dropped. Finally, the resultant

mantissa is truncated to 26 bits with the least significant bit being the sticky bit which

is the logical "OR"ing of all the truncated bits.

Figures 5.17 and 5.18 show the symbol and behavioral simulation of the Post

Normalize Sub-Module respectively. At 100ns, the 48 bit resultant mantissa is

(800000000001)H having a „1‟ carry bit. Accordingly, the output mantissa=

(000001)H at 200ns was shifted right by one bit, the implied bit has been dropped and

the appropriate sticky bit has been added, „1‟ in this case since there‟s a dropped „1‟

bit. The exponent has been incremented from (33)H to (34)H to account for the one

bit shift to the right in the resultant mantissa. At 300ns, the input is (400000000001)H

so the carry bit is zero and no shifting of the mantissa or increment of exponent are

performed as shown at 400ns. The input mantissa at 400ns (400000000000)H is the

same as that at 300ns except the fact that the least significant bit changed from '1' to

55
'0'. So the output mantissa at 500ns has sticky bit with a value '0' since all the dropped

bits from the 48 bit mantissa were '0's.

Figure 5.17 Symbol of the Post Normalize Sub-Module in the FP Multiplier Unit

Figure 5.18 Behavioral Simulation of the Post Normalize Sub-Module in the FP


Multiplier Unit

5.2.5.2 Exception Detection Sub-Module

This sub-module is responsible for setting the appropriate flag if an overflow or

underflow is detected. Checking the exponent was performed after the Post Normalize

Module in order to take into account the case in which the normalization operation

causes an increment in the exponent. The checking for an overflow or underflow is

performed by inspecting the 10th and 9th bits of the resultant 10 bit exponent

54
calculated earlier in the Add Exponent Module. If the 10th bit was found to be a '1' the

underflow flag is set, if not and the 9th bit was found to be a '1' then the overflow flag

is set. Otherwise, the exponent is adjusted to 8 bits by the dropping these 10th and 9th

bits and the overflow and underflow flags are left unset.

Figures 5.19 and 5.20 show the symbol and behavioral simulation of the

Exception Detection Sub-Module respectively. At 200ns an exponent (022)H=(00

0010 0010)B leads to both the overflow and underflow flags being unset at 300ns. At

300ns an exponent (222)H =(10 0010 0010)B sets the underflow flag at 400ns where

the 10th bit in the exponent is a '1'. A (122)H=(01 0010 0010)B exponent at 400ns sets

the overflow flag at 500ns where the 10th bit is a '0' while the 9th bit is a '1'.

Figure 5.19 Symbol of the Exception Detection Sub-Module in the FP Multiplier Unit

Figure 5.20 Behavioral Simulation of the Exception Detection Sub-Module in the FP


Multiplier Unit

54
5.2.6 Rounding Module

This module is responsible for rounding the 26 bit resultant mantissa to 23 bits

using the REN technique, then rechecking for possible overflow due to rounding and

finally giving the 32 bit resultant in IEEE format. This is performed through three

sub-modules which are:

5.2.6.1 REN Sub-Module

Within this sub-module, the 26 bit resultant mantissa is rounded to 23 bits using

the REN technique which inspects the guard, round and sticky bits to determine

whether a one increment is to be performed to the original 23 bit resultant mantissa

(Increment), rounded to the nearest even number (Tie to Nearest Even) or whether it

will be left unchanged equivalent to simply truncating the guard, round and sticky bits

(Truncate). All the possible cases are summarized in Table 5.2. In this module also, a

recheck flag will be set for the case of increment rounding to be passed to the

following sub-module, the Overflow Recheck Sub-Module. The reason behind this

will be explained thoroughly when the Overflow Recheck Sub-Module is explained

right after this one.

Table 5.2 Rounding Action Based on Guard, Round and Sticky Bits

Guard Round Sticky Rounding


Bit Bit Bit
0 0 X Truncate
0 1 X Truncate
1 0 0 Tie to Nearest Even
1 0 1 Increment
1 1 X Increment

54
Figure 5.21 shows the symbol of the REN Sub-Module. Figures 5.22 and 5.23

show the behavioral simulation of the REN Sub-Module for an even and odd resultant

mantissa respectively. So if the guard (GB), round (RB) and sticky (SB) bits are "100"

the mantissa is expected to remain unchanged in the simulation of Figure 5.22 (case

of even mantissa) and be incremented to tie it to even for the simulations in Figure

5.23 (case of odd mantissa). For ease of illustration, the results of Figure 5.22 and

5.23 are summarized in Tables 5.3 and 5.4 respectively.

Figure 5.21 Symbol of the REN Sub-Module in the FP Multiplier Unit

Figure 5.22 Behavioral Simulation of the REN Sub-Module in the FP Multiplier


Unit when the Resultant Mantissa to be rounded is Even

45
Table 5.3 REN Sub-Module Behavioral Simulation Results for Even Mantissa

Time Resultant Expected Rounding Output


GB RB SB
(ns) Mantissa Case Mantissa
100 2000000 0 0 0 Truncate 400000
200 2000003 0 1 1 Truncate 400000
300 2000004 1 0 0 Tie to Nearest Even 400000
400 2000005 1 0 1 Increment 400001
500 2000006 1 1 0 Increment 400001

Figure 5.23 Behavioral Simulation of the REN Sub-Module in the FP Multiplier


Unit when the Resultant Mantissa to be rounded is Odd

Table 5.4 REN Sub-Module Behavioral Simulation Results for Odd Mantissa

Time Resultant Expected Rounding Output


GB RB SB
(ns) Mantissa Case Mantissa
900 2000008 0 0 0 Truncate 400001
1000 200000C 1 0 0 Tie to Nearest Even 400002
1100 200000D 1 0 1 Increment 400002
1200 200000E 1 1 0 Increment 400002

45
5.2.6.2 Overflow Recheck Sub-Module

In this sub-module, the exponent is rechecked for a possible overflow. The

overflow flag is set here if one of the following cases is satisfied:

1. If the exponent is all ones.

2. If the recheck flag, given by the previous sub-module, is set and the mantissa

is all zeros. That is because an all zeros mantissa has resulted from the case of

the mantissa being all '1's that has been incremented in the rounding operation.

Such a mantissa has a '1' carry bit so it needs normalizing by shifting left by

one place equivalent to an increment in the exponent by one. The incremented

exponent is finally rechecked for possible overflow.

Figures 5.24 and 5.25 show the symbol and behavioral simulation of the Overflow

Recheck Sub-Module respectively. Case 1 is shown when an all ones exponent (FF)H

is introduced at 200ns thus setting the overflow flag at 300ns. Case 2 is shown at

400ns where the mantissa is all zeros and the recheck flag is set. The exponent is

(FE)H so an increment would lead to an overflow as seen at 500ns. An exponent of

(33)H at 500ns with a the recheck flag being set leads to an increment in the exponent

to (34)H without setting the overflow flag as shown at 600ns.

Figure 5.24 Symbol of the Overflow Recheck Sub-Module in the FP Multiplier


Unit

45
Figure 5.25 Behavioral Simulation of the Overflow Recheck Sub-Module
in the FP Multiplier Unit

5.2.6.3 Output Sub-Module

This sub-module is responsible for appending the sign bit, 8 bit exponent and the

23 bit mantissa together to give the single precision floating point multiplication result

along with the resultant overflow flag and underflow flags.

Figure 5.26 and 5.27 show the symbol and behavioral simulation of the Final

Module respectively. A '1' sign (78)H exponent and (2AAAAA)H exponent at 100ns

are appended to give the 32 bit final result (BC2AAAAA)H at 200ns. A set zeroflag

results at 400ns results in an all zeros output at 500ns. Also, a set overflow or

underflow flag at 300ns and 500ns results in an all zero output at 400ns and 600ns

respectively.

Figure 5.26 Symbol of the Final Module in the FP Multiplier Unit

45
Figure 5.27 Behavioral Simulation of the Final Module in the FP Multiplier Unit

5.2.7 The FP Multiplier Unit Behavioral Simulation

After designing each of the FP Multiplier Unit modules and simulating them

to assure their correct functionality, they were all put together and the complete

design was behaviorally simulated.

The symbol of the FP Multiplier Unit is shown in Figure 5.28. A snapshot of

the FP Multiplier Unit behavioral simulation is shown in Figure 5.29 for the test

bench in Table 5.5. For example, the output of the operation 25.8 * -7.4 =

(41CE6666)H * (C0ECCCCD)H at 400ns appears at 2300ns to be -190.92 =

(C33EEB85)H.

Figure 5.28 Symbol of the FP Multiplier Unit

47
Figure 5.29 Behavioral Simulation of the FP Multiplier Unit

Table 5.5 Test bench of the FP Multiplier Unit Behavioral Simulation

Clk Operand A Operand B Clk Result


Ns Hexadecimal Decimal Hexadecimal Decimal ns Hexadecimal Decimal
100 C1B00000 -22 C1B00000 -22 2000 43F20000 484

200 C1B00000 -22 40800000 4 2100 C2B00000 -88

300 C1B00000 -22 3F800000 1 2200 C1B00000 -22

400 41CE6666 25.8 C0ECCCCD -7.4 2300 C33EEB85 -190.92

500 469C4000 20,000 3B03126F 0.002 2400 42200000 40

600 00000000 0 400CCCCD 2.2 2500 00000000 0

700 7EE1B1E6 1.5E38 43FA0000 500 2600 00000000 OL

800 00A355E6 1.5E-38 3B03126F 0.002 2700 00000000 UL

40
5.3 Proposed FP Adder/Subtractor Unit

5.3.1 Introduction

The simplified block diagram of the proposed FP Adder/Subtractor Unit is shown

in Figure 5.30 where operands A and B are the single precision inputs and operand

result is theoutput also given in the single precision format.

Figure 5.30 Simplified Block Diagram of FP Adder/Subtractor Unit

Like in the FP Multiplier Unit, the FP Adder/Subtractor Unit starts with

unpacking the sign, exponent and mantissa of both input operands for each to be dealt

with separately. The inputs are then swapped if necessary to assure that operand B

carries the smaller operand which is pre-normalized before the addition or subtraction

operation is performed, while operand A carries the larger operand whose exponent is

set as the resultant exponent. Zero detection is performed next where the zeros flag is

set appropriately to indicate if none, one or both operands is a zero. If both operands

were found to be non-zeros the addition or subtraction operation is performed by first

pre-normalizing the smaller operand, operand B, then finding the resultant sign and

effective operation, performing that effective operation, then post normalizing (using

LOD algorithm) and rounding (using REN technique) the resultant mantissa. Finally,

the resultant sign, exponent and mantissa are appended together to give the final result

in IEEE single precision format.

45
As mentioned earlier, in the FP Adder/Subtractor Unit the post normalization

process is the bottle neck of the design. In Chapter Three, the different commonly

used algorithms to implement the Post Normalize Module within the

Adder/Subtractor Unit were introduced and it was shown that they mainly differ in

when the leading „1‟ bit was detected. It was either detected after the resultant was

calculated as in the LOD algorithm or predicted in parallel with the addition or

subtraction operation as in the LOP algorithm or makes use of both ideas by

integrating both paths and using the optimum to perform the operation as in the two-

path and three-path algorithms.

The proposed FP Adder/Subtractor Unit is implemented using the LOD algorithm

that was deeply pipelined to achieve high maximum operating frequency. The LOD

algorithm was chosen for the following reasons:

i. Simple to design as opposed to the LOP algorithm which requires pre-encoding

of the inputs in order to predict the position of the leading one and then error

correction for possible error in the detected one location.

ii. Area efficient due to simple one path data flow as opposed to the two-path and

three-path implementations that both consume a very large area and require the

implementation of both the LOD and the LOP algorithms.

Figure 5.31 illustrates the detailed block diagram of the FP Adder/Subtractor Unit.

In the following sub-section, the architecture and operation of each FP

Adder/Subtractor Unit modules is explained thoroughly followed by its behavioral

simulation. Finally, the behavioral simulation of the entire FP Adder/Subtractor Unit

is shown. All modules were written in VHDL using the FPGAdv 8.1 Mentor Tool and

were behaviorally simulated to verify their correct operation using ModelSim 6.3a.

44
44
5.3.2 Unpack Module

This first module is responsible for two main tasks which are:

1. Unpacking the input operands: Like in the FP Multiplier Unit, the

Adder/Subtractor Unit accepts two normalized single precision floating point

numbers and deals with the signs, exponents and mantissas of these two

numbers separately as follows:

a. Sign Bits: Used along with required operation to determine the

resultant sign and the effective operation to be performed.

b. Exponents: The difference between the two exponents is calculated to

be used to pre-normalize the smaller operand in preparation for the

addition or subtraction operation.

c. Mantissas: The smaller mantissa is pre-normalized then the mantissas

are added or subtracted to give the final result.

2. Checking if the input operands are equal: The two operands are said to be

equal if both their exponents and mantissas compare as equal. An aequalb flag

would be then set to be used in upcoming modules for output zero detection if

the inputs compare as equal and the effective operation to be performed is

subtraction as explained earlier in this chapter.

Figures 5.32 and 5.33 show the symbol and behavioral simulation of the

Unpack Module respectively. Like in the FP Multiplier Unit, this is the first block in

the FP Adder/Subtractor Unit thus it has both an input and output register causing the

output to appear after two clock cycles. The simulation illustrates how the 32 bit

mantissa is broken to sign, exponent and mantissa. The inputs A= (AAAA AAAA)H

and B=(BBBB BBBB)H at 100ns were broken down to signa=‟1‟, expa=(55)H and

44
manta=(2AAAAA)H and signb=‟1‟, expb=(77)H and manta=(3BBBBB)H at 200ns.

At 200ns, both inputs were equal so the aeqb flag was set at 400ns.

Figure 5.32 Symbol of the Unpack Module in the FP

Figure 5.33 Behavioral Simulation of the Unpack Module in the FP


Adder/Subtractor Unit

5.3.3 Swap Module

Addition or subtraction of two floating point numbers cannot be performed unless

both the input operands are aligned, i.e. have equal exponents. If the two exponents

are not equal, the smaller operand has to be pre-normalized first before the addition or

subtraction operation is performed. To align both input operands in order to prepare

them for the addition/subtraction operation, the following steps must be performed:

45
1. Determining the smaller input operand.

2. Setting the exponent of the larger operand as the resultant exponent.

3. Calculating the difference between the input exponents.

4. Shifting right the mantissa of the smaller input a number of times equal to the

difference between the two exponents.

The Swap Module is the module responsible for determining which input is

smaller and storing it as operand B to be pre-normalized later in the Pre-normalize

Module. So the exponents of the two input operands are compared to determine the

smaller operand. If exponent A is smaller than the exponent of operand B, the

contents of both operands are swapped and a swap flag is set to be taken into

consideration when determining the effective operation later in the Adder Module.

Figures 5.34 and 5.35 show the symbol and behavioral simulation of the Swap

Unit respectively. At 100ns, the exponent of operand A=(AA)H is smaller than that of

operand B=(BB)H causing a swap operation to take place at 200 ns and just setting

the swap flag. At 200ns when the exponents of operands A and B are equal, no swap

operation is performed and the swapflag is left unset.

Figure 5.34 Symbol of the Swap Module in the FP Adder/Subtractor Unit

45
Figure 5.35 Behavioral Simulation of the Swap Module in the FP
Adder/Subtractor Unit

5.3.4 Zero Detect Module

Zero Detect Module is responsible for three main tasks which are:

1. Determining if one or both of the input operands is a zero and setting the zeros

flag accordingly.

2. In the case where both operands are non-zeros, the difference between their

exponents is calculated to be used in the pre-normalization of the mantissa of

operand B in the Pre-Normalize Module.

3. Setting the exponent of operand A as the resultant exponent.

The zero detection is performed here as opposed to being in the Unpack Module

like in the FP Multiplier Unit to simplify the zero detection operation by making use

of the aequalb flag set in the Unpack Module if both inputs were equal and the fact

that the Swap Module ensured that the smaller input is stored as operand B. Thus, by

checking only the exponent of operand B and the aequalb flag the appropriate zeros

45
flag can be set as summarized in Table 5.6. Again like in the FP Multiplier Unit,

checking only the exponent of the input operands serves in detecting a denormalized

input number as a zero and dealing with it accordingly.

Table 5.6 Summary of Zero Detection Methodology

Operand B Aequb Zeros Comments


Flag Flag
Non-zero X 00 Both operands are non-zeros.

Zero 0 01 Only operand B is a zero.

Zero 1 11 Both operands are zeros.

Figures 5.36 and 5.37 show the symbol and behavioral simulation of the Zero

Detect Module respectively. At 100ns, both mantissas are non-zeros so at 200ns the

zeros flag is unset and the difference between the exponents is calculated where diff=

expa – expb = (AA)H – (33)H = (77)H. At 300ns, both operands are equal to zeros so

at 400ns the zeros flag is set and the difference is zero. At 400ns, only operand B is

zeros so at 500ns the zeros flag is unset but the difference is still zero. At all cases, the

exponent of operand A is stored as the resultant exponent.

Figure 5.36 Symbol of the Zero Detect Module in the FP Adder/Subtractor Unit

45
Figure 5.37 Behavioral Simulation of the Zero Detect Module in the FP

Adder/Subtractor Unit

5.3.5 Pre-normalize Module

The Pre-normalize Module is responsible for adjusting the two mantissas for the

addition/subtraction operation by doing the following:

1. Fitting the mantissas into 28 bits where a carry bit (C) and the implied bit (I)

are added to left of the mantissas and a guard (G), round (R) and sticky (S)

bits are added to their right to increase the precision of the addition/subtraction

operation and to be used in the Rounding Module. The format of the 28 bit is

shown in Figure 5.38.

C I 23 bit Mantissa G R S

Figure 5.38 Format of 28 bit Mantissa

47
2. Pre-normalizing the mantissa of operand B, the smaller operand, by shifting it

right a number of times equal to the difference between the two input operand

exponents that was calculated in the Zero Detect Module.

The shifting operation performed here is implemented using barrel shifting

implemented totally in VHDL. Barrel shifting has the advantage of shifting the data

by any number of bits in one operation where as if a simple shifter was used, shifting

by n bit positions would require n clock cycles. This makes barrel shifting the most

suitable for the shifting operations required in the floating point operations, especially

in the pre-normalization and post normalization operations in the floating point

addition/subtraction.

When considering the possible shift that may be needed in this module, we found

that theoretically the shift can be any number between 1 and 253 depending on the

difference between the exponents of the input operands. Practically speaking though,

any shift greater than 25 would just lead to the dropping of all the mantissa bits. So in

such a case, the shifting operation will lead to an all '0's mantissa with the exception

of the sticky bit that carries the logical "Or"ing of all the dropped bits and thus in such

a case would always store '1'. So in the implementation, the shift operation is

performed only if the difference between the two exponents was less than or equal 25.

Otherwise, the mantissa is set as all '0's with a sticky bit of value '1'.

The Pre-normalize Module is designed to operate only if the zeros flag indicates

that both input operands are non-zeros. Otherwise no pre-normalization is necessary

and the inputs are just adjusted into 28 bits with the carry, guard, round and sticky bits

all being '0' and the implied bit set to '1'.

Figures 5.39 and 5.40 show the symbol and behavioral simulation of the Pre-

Normalize Module respectively. At 100ns the signal diff_i indicates that the exponent

40
difference (and required shift) is (3)H = 3. Thus at 300ns, mantissa B comes out

shifted right by three places. At 200ns when the exponent difference is (1E)H = 30,

mantissa B comes out at 400ns as all zeros except for the sticky bit.

Figure 5.39 Symbol of the Pre-normalize Module in the FP Adder/Subtractor Unit

Figure 5.40 Behavioral Simulation of the Pre-normalize Module in the FP


Adder/Subtractor Unit

5.3.6 Add/Subtract Module

The Add/Subtract Module was broken up into three sub-modules to allow for

deeper pipelining thus increasing the overall speed of the design. These three sub-

modules are:

45
5.3.6.1 Pre-Add/Subtract Sub-Module

This sub-module is responsible for determining the resultant sign and the effective

operation to be performed. The effective operation is simply the "XOR"ing of the

signs of both input operands and the required operation. The resultant sign is

determined to be positive or negative by considering the operand signs, the required

operation, the swap flag and an agrtb flag (a greater than b flag) as shown in Equation

5.1. The agrtb flag is a flag set in this module if the mantissa of operand A is greater

than the mantissa of operand B.

Resultant_Sign = (sw_f AND rop) OR ( (NOT sgna) AND (NOT sagb_f) AND rop))
XOR ((sgna AND s_agb) OR (sgnb AND (NOT agb_f) AND (NOT rop)) (5.1)

Where sw_f, agb_f : swap & agrtb flags respectively.


rop : required operation.
sgna, sgnb : sign of operands A & B respectively.

Figure 5.41 shows the symbol of the Pre-Add/Subtract Sub-Module. Figures

5.42 and 5.43 show the behavioral simulation for the Pre-Add/Subtract Sub-Module

for the cases of mantissa B greater than mantissa A and vice versa respectively.

Figure 5.41 Symbol of the Pre-Add/Subtract Sub-Module in the FP


Adder/Subtractor Unit

44
Figure 5.42 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the FP

Adder/Subtractor Unit for Mantissa B greater than Mantissa A

Figure 5.43 Behavioral Simulation of the Pre-Add/Subtract Sub-Module in the FP

Adder/Subtractor Unit for Mantissa A greater than Mantissa B

44
5.3.6.2 Zeros Flag Update Sub-Module

This sub-module updates the zeros flag to indicate a zero final result if the

aequalb flag is set and the effective operation was found to be subtraction.

Figures 5.44 and 5.45 show the symbol and behavioral simulation of the Zero Flag

Update Sub-Module respectively. When the aequalb flag was set and the effective

operation was subtraction (eff_oper_i =‟1‟) at 300 ns, the zerosflag was accordingly

updated to (3)H = (11)2 to indicate an final output resultant zero.

Figure 5.44 Symbol of the Zero-Update Sub-Module in the FP Adder/Subtractor Unit

Figure 5.45 Behavioral Simulation of the Zero-Update Sub-Module in the FP


Adder/Subtractor Unit

44
5.3.6.3 Adder Sub-Module

This sub-module is responsible for performing the effective addition or

subtraction operation to calculate the 28 bit resultant mantissa. The effective operation

is performed only if the zeros flag indicates that both input operands are non-zeros.

Otherwise the output is directly given depending on the zeros flag as summed up in

Table 5.7.

Table 5.7 Resultant Mantissa based on Zeros Flag


Zeros Output Mantissa
Flag
00 Result of addition/subtraction
01 Mantissa A
11 Zero

Figures 5.46 and 5.47 show the symbol and behavioral simulation of the Adder

Sub-Module respectively. The effective operation is carried out on mantissas A and B

only when the zeros flag was “00”. For example mantissa A= (2222222)H is added to

mantissa B= (4444444)H at 400ns to give (6666666)H at 500ns. On the other hand,

when the zeros flag became “01” at 500ns and “11” at 600 ns, no calculations were

made and the resultant mantissa were directly given as mantissa A at 600ns and zero

at 700 ns respectively.

Figure 5.46 Symbol of the Adder Sub-Module in the FP Adder/Subtractor Unit

45
Figure 5.47 Behavioral Simulation of the Adder Sub-Module in the FP

Adder/Subtractor Unit

5.3.7 Post Normalize Module

The Post Normalize Module is responsible for normalizing the resultant mantissa

by detecting the most significant '1' bit in the resultant mantissa, shifting the mantissa

to set this detected '1' as the implied bit and finally adjusting the exponent in a manner

that compensates for this shift in the mantissa.

The Post Normalize Module is implemented using the LOD algorithm. Although

floating point adders implemented using the LOD algorithms are not generally the

fastest [7], this was overcome by two means. The first was using barrel shifting to

implement the shift operations in this module. The second was breaking up the Post

Normalize Module into several sub-modules to allow for its deep pipelining which

proved to significantly improve the design speed. All these sub-modules were

designed to operate only if the zeros flag indicated that both input operands were non-

zeros as otherwise, no post normalization would be necessary. The LOD Post

Normalize Sub-Modules are:

45
5.3.7.1 Zeros Count Sub-Module

This sub-module is the first step in the LOD algorithm. It is responsible for

detecting the most significant „1‟ in the resultant mantissa. Several techniques were

introduced by designers to detect the most significant '1' bit [7]. In the proposed

design the position of the most significant '1' bit is found by simply counting the

number of '0' bits to the right of the mantissa before the first '1' is detected. The

number of counted „0‟s will be used in the following sub-modules to post-normalize

the result by shifting the mantissa left by it and adding it to the exponent to

compensate for this shift.

Figures 5.48 and 5.49 show the symbol and behavioral simulation of the Zero

Count Sub-Module respectively. At 100ns, where the zeros flag was unset, the input

was (8888888)H indicating that the carry bit was „1‟ so the number of zeros came out

at 200ns as (00)H. Also, at 200ns the input was (4888888)H so the carry bit is a „0‟

followed by a „1‟ in the position of the implied bit. Technically, the number is already

in the normalized form so again this case outputs the number of zeros at 300ns to be

(00)H. At 300ns the input is (2888888)H so the number of zeros comes out at 400ns

to be (01)H. At 500ns and 600ns, the zeros flag indicates that one and both inputs are

zeros respectively. No shift is necessary in both these case thus the number of zeros is

a zero as given at 600ns and 700ns.

Figure 5.48 Symbol of Zeros Count Sub-Module in the FP Adder/Subtractor Unit

45
Figure 5.49 Behavioral Simulation of Zeros Count Sub-Module in the FP
Adder/Subtractor Unit

5.3.7.2 Exponent Adjust Sub-Module

This sub-module is responsible for adjusting the resultant exponent in

coordination with the expected shift of the resultant mantissa. This situation has three

possible scenarios which are:

1. The resultant mantissa carry bit is '1'. It is considered to be the implied bit

thus it is required to shift the mantissa one place to the right and the exponent

is incremented by one.

2. The resultant carry bit is '0' followed directly by a '1'. The mantissa is

already in the normalized form thus the exponent is left unchanged.

3. If none of the above scenarios is satisfied, then the number of zeros to the left

of the resultant mantissa (nz) calculated in the Zero Count Sub-Module is

subtracted from the resultant exponent to account for shifting the resultant

mantissa to the left "nz" times.

Figures 5.50 and 5.51 show the symbol and behavioral simulation of the

Exponent Adjust Sub-Module respectively. At 100ns, the mantissa has a carry „1‟ so

45
the exponent is incremented from (29)H to (2A)H as shown at 200ns. At 200ns, the

carry bit is a „0‟ and the bit on the implied bit position is a „1‟ so the mantissa is post

normalized and the exponent comes out at 300ns unchanged.

At 300ns and 400ns, the number of zeros is 1 and 7 so the exponent comes out

at 400ns and 500ns decremented to (28)H and (22)H respectively.

Figure 5.50 Symbol of the Exponent Adjust Sub-Module in the FP


Adder/Subtractor Unit

Figure 5.51 Behavioral Simulation of the Exponent Adjust Sub-Module in the FP

Adder/Subtractor Unit

47
5.3.7.3 Mantissa Normalize Sub-Module

This sub-module is responsible for performing the final step of the normalization

process which is the shifting of the resultant mantissa and dropping the implied bit.

The amount of shift depends on the position of the detected most significant „1‟, that

was determined in the Zero Count Module, as follows:

1. If the resultant mantissa carry bit is '1', the mantissa is shifted to the right by

on place to set the carry bit as the implied bit. The dropped bit, which is the least

significant bit, is passed to the Round Module to be used to update the sticky bit.

2. If the resultant carry bit is '0' followed directly by a '1', no shifting of the

mantissa is required as it is already in the normalized form.

3. If none of the above scenarios is satisfied, then the resultant mantissa is shifted

left a number of times equal to the number of zeros before the most significant „1‟.

Figures 5.52 and 5.53 show the symbol and behavioral simulation of the

Mantissa Normalize Sub-Module respectively. At 100ns and 200ns the carry bit of the

resultant mantissa is „1‟, so the normalized mantissa appears at 200 ns and 300ns

shifted one bit to the right with the drop bit being set to „1‟ only at 300ns to account

for the „1‟ dropped from the mantissa (4444445)H at 200ns. A shift left by the number

of zeros is performed on the resultant mantissa input at 300ns, 400ns and 500ns.

Figure 5.52 Symbol of Mantissa Normalize Sub-Module in the FP Adder/Subtractor


Unit

40
Figure 5.53 Behavioral Simulation of Mantissa Normalize Sub-Module in the FP

Adder/Subtractor Unit

5.3.8 Round Module

Like in the FP Multiplier Unit, the Round Module here is responsible for rounding

the 26 bit resultant mantissa to 23 bits using the REN technique and then giving the

final result of the addition or subtraction operation in the IEEE single precision

format. This module consists of two sub-modules:

5.3.8.1 REN Sub-Module

This sub-module is responsible for rounding the resultant mantissa using the REN

mode. The REN mode inspects the guard, round and sticky bits to determine whether

an increment by one is to be performed to the original 23 bit resultant mantissa

(Increment), or whether the mantissa will rounded to the nearest even number (Tie to

Nearest Even) or whether it will be left unchanged which is equivalent to simply

truncating the guard, round and sticky bits (Truncate). All these possible cases were

summarized earlier in Table 5.2.

45
Figures 5.54 and 5.55 show the symbol and behavioral simulation of the REN

Sub-Module respectively. In the shown simulation, the mantissa is even until 700ns,

and then it becomes odd. The test bench used is the similar to that used in the REN

Sub-Module of the FP Multiplier Unit which was summarized in Tables 5.3 and 5.4

for even and odd mantissas respectively.

Figure 5.54 Symbol of the REN Sub-Module in the FP Adder/Subtractor Unit

Figure 5.55 Behavioral Simulation of the REN Sub-Module in the FP


Adder/Subtractor Unit

44
5.3.8.2 Final Sub-Module

This is the final block in the FP Adder/Subtractor Unit. It is responsible for

appending the sign bit, 8 bit exponent and the 23 bit mantissa together to give the

single precision floating point addition resultant.

Figures 5.56 and 5.57 show the symbol and behavioral simulation of the Final

Sub-Module respectively. A '1' sign (22)H exponent and (222222)H exponent at

100ns are appended to give the 32 bit final result (91222222)H at 200ns.

Figure 5.56 Symbol of the Final Sub-Module in the FP Adder/Subtractor Unit.

Figure 5.57 Behavioral Simulation of the Final Sub-Module in the FP


Adder/Subtractor Unit.

44
5.3.9 The FP Adder/Subtractor Unit Behavioral Simulation

After designing each of the FP Adder/Subtractor Unit modules and simulating

them to assure their correct functionality, they were all put together and the complete

design was behaviorally simulated.

Figure 5.58 shows the symbol of the FP Adder/Subtractor Unit. A snapshot of

the FP Adder/Subtractor Unit behavioral simulation is shown in Figure 5.59 for the

test bench in Table 5.8. For example, the output of the operation 12+12 =

(41400000)H + (41400000)H at 300ns appears at 2000ns to be 24=(41C00000)H.

Figure 5.58 Symbol of the FP Adder/Subtractor Unit

Figure 5.59 Behavioral Simulation of the FP Adder/Subtractor Unit

44
Table 5.8 Test Bench of the FP Adder/Subtractor Unit Behavioral Simulation

Clk Operand A Operand B Clk Result


+/-
ns Hexadecimal Decimal Hexadecimal Decimal ns Hexadecimal Decimal
100 41400000 12 - 3F800000 1 1800 41300000 11

200 41400000 12 + 3F800000 1 1900 41500000 13

300 41400000 12 + 41400000 12 2000 41C00000 24

400 447A0333 1000.05 - 42B0051F 88.01 2100 4464028F 912.04

500 42B0051F 88.01 - 42B0051F 88.01 2200 00000000 0

600 C20551EC -33.33 + BF2B851F -0.67 2300 C2080000 -34

700 00000000 0 + BF99999A -1.2 2400 BF99999A -1.2

800 00000000 0 - BF99999A -1.2 2500 3F99999A 1.2

555
Chapter 6

Design Implementation & Results

The proposed FP Adder/Subtractor and Multiplier Units discussed in the

previous chapter were able to operate at high operating speeds hence surpassing most

of the existing designs as will be illustrated in this chapter. Such designs were a result

of several optimizations performed in order to achieve maximum possible operating

speed.

In this chapter, a brief summary of the optimizations performed to reach the

final proposed FP Adder/Subtractor and Multiplier Units is given. The

implementation results are illustrated, discussed and compared with results from the

previous work of others. Finally the power analysis and timing simulations are

illustrated for both designs. All synthesis and implementation results were generated

using Xilinx ISE 9.2i Tools.

6.1 Design Optimization

Design optimizations are essentially used by designers in order to reach their

required system specifications. For both the FP Adder/Subtractor and Multiplier

Units, several optimizations were performed in order to reach maximum operating

speed.

6.1.1 Optimization Techniques

Initially, the maximum operating speed of each module was found from the

synthesis reports for both the FP Adder/Subtractor and Multiplier Units. Although the

101
synthesis reports are not as accurate as the static timing reports given post routing,

they still give a good indication of the module’s performance with the advantage of

being generated much faster than static timing reports and thus are sufficient to be

used at the initial phase of optimizations.

Next using the generated synthesis reports, critical modules were continuously

defined and optimized. Basically, modules were optimized by either breaking them

down into smaller pipelined modules or editing the VHDL code for better synthesis.

Synthesis options provided by the Xilinx ISE Tool were also used to achieve best

possible performance of the entire design by efficiently setting them in order to reach

the target optimization goal of this work, which is speed [28].

Finally when no further optimizations were possible, the FP Adder/Subtractor

and Multiplier Units were implemented and speeds of separate modules were

generated from the static timing report. Critical modules were then defined and

optimized whenever possible in a manner very similar to that just explained. Like the

synthesis tool, Xilinx implementation tool has several options that can dramatically

change the design performance. Thus implementation options have to be efficiently

set to achieve best possible performance [20].

Xilinx implementation tool also allows user defined timing constraints which

helps designers reach timing closure in high performance applications [29]. The use

of user defined timing constraints also ensures no timing violation, such as period,

setup or hold violations, occurs in the timing simulation. The fundamental timing

constraints in the Xilinx Tools are the following:

 Period Constraints : Sets the minimum clock period of the design.

 Offset IN Constraints : The offset in constraint is illustrated in Figure

6.1. It sets the setup time, referred to in Xilinx tools as clock to pad , which

102
specifies the maximum time allowed for data to enter the chip, travel through

logic and routing, and arrive at a synchronous element (flip-flop, latch, or

RAM) where that pin has a setup requirement before a clocking signal. Thus

offset in constrain is used to make sure that the input data comes in an

appropriate time. Hold time can also be set using the offset in constraint by

specifying the time through which the data is valid.

Figure 6.1 Offset In Time Constraint

While appropriately setting Xilinx timing constraints helps designers reach

their required timing specifications, very tight timing constraints could result in

results far from desired. In this work, the maximum speed reported by the synthesis

tool was used as a guide to set realistic constraints for both the period and offset in

constraints to make sure that the input data comes in an appropriate time.

In this work, implementation options along with user defined timing

constraints were iteratively used to reach best possible performance.

In the next section, a briefing of the optimization steps performed for the FP

Adder/Subtractor and Multiplier Unit is summarized.

6.1.2 FP Adder/Subtractor Optimization

Initially, the FP Adder/Subtractor Unit was implemented using three pipeline

stages. Each stage performed one of the basic floating point addition/subtraction

operations namely pre-normalizing, adding or subtracting and finally post

103
normalizing. The speed of this FP Adder/Subtractor Unit was 50 MHz when

synthesized to Virtex5. In order to increase the operating speed, further pipelining and

optimization of the design were necessary which was performed by breaking down

each of the three modules into smaller simple sub-modules with registers inserted

between them. For example, the module responsible for addition or subtraction was

sub-divided into three sub modules which are the Pre-Add Sub-module, the Zero Flag

Update Sub-module and the Adder/Subtractor Sub-module. Each of these new sub-

modules was synthesized to determine its maximum operating speed and hence

whether further pipelining is necessary.

After applying this technique for all the FP Adder/Subtractor Unit modules

and sub-modules until further pipelining had no positive effect on the overall speed,

the FP Adder/Subtractor Unit consisted of twelve modules and its maximum

operating frequency was just above 255 MHz when synthesized to Virtex5. Further

optimization within each module was made where ever possible, mostly by trying to

simplify the written codes which sometimes lead to increase in the module’s speed.

Next the design was implemented and implementation results, given by the

static timing analysis report, showed that the FP Adder/Subtractor operated at around

340MHz. Implementation of all modules defined the Pre-Normalize Module as the

critical module. In order to optimize the Pre-Normalize Module a pipeline stage, a

register at the module’s output, was added which raised the speed of the FP

Adder/Subtractor Unit to 373MHz.

Finally, Xilinx options along with user defined timing constraints were

iteratively adjusted until optimal results were attained. The implementation results of

this final design over several FPGA platforms are summarized in the Implementation

Results section later in this chapter.

104
6.1.3 FP Multiplier Optimization

When designing the FP Multiplier Unit, experience gained from optimizing

the FP Adder/Subtractor Unit was put in handy. The first implementation of the FP

Multiplier used the Simple Multiplier to implement the 24 by 24 unsigned

multiplication operation. It consisted of eight pipelined modules and had a maximum

operation speed of 217MHz when synthesized to Virtex5. Synthesizing all modules

indicated that there were two critical modules which were the Add Exponent and the

Multiplier Modules operating at 350 and 207MHz respectively.

So initially in order to increase the overall speed, a register was added post the

Add Exponent Module. The more critical issue then at hand was the optimization of

the Multiplier Module which in turn led to the evolution of the novel proposed Block

Multiplication algorithm. At first, the 24 by 24 multiplier was broken down into three

parallel 24 by 8 multipliers. When integrated into the FP Multiplier Unit, the design’s

maximum operating speed was found to be 316MHz when synthesized to Virtex5. In

order to improve the FP Multiplier Unit performance, block multiplication algorithm

was used. Multiplications were further broken down into nine 8 by 8 multiplications

performed in parallel which lead an increase of the FP Multiplier Unit maximum

speed to around 450MHz when synthesized to Virtex5.

Table 6.1 compares the performance of the FP Multiplier Unit when using the

proposed Block Multiplier Module and as opposed to using the Simple Multiplier

Module [30]. The table summarizes the synthesis results for speed and area when the

designs are synthesized to several Virtex FPGA platforms. The comparison shows that

when using the Block Multiplier, the FP Multiplier Unit is able to operate at speeds

almost double those attained when using the Simple Multiplier. This comes on the

price of occupied area where the opposite occurs for the design using the Block

105
Multiplier which occupies almost double the area that is occupied the design using the

Simple Multiplier Module.

Table 6.1 Synthesis Results for the FP Multiplier Unit using Proposed Block
Multiplier vs. using Simple Multiplier

Proposed
Simple Multiplier
Multiplier
Speed Area Speed Area
(MHz) (Slices) (MHz) (Slices)
Virtex2p
296 1038 181 412
Xc2vp7ff896 -7
Virtex4
461 945 106 343
Xc4vfx100 -12
Virtex5
450 592 217 578
Xc5vlx110 -3

Finally the design was implemented and implementation results, given by the

static timing analysis report, showed that the FP Multiplier Unit operated at around

360MHz. Then like in the FP Adder/Subtractor Unit design, Xilinx options along with

user defined timing constraints were iteratively adjusted until optimal results were

attained. The implementation results of this final design over several FPGA platforms

are summarized in the Implementation Results section next.

6.2 Implementation Results

Xilinx implementation tool was used along with user defined period, setup and

hold timing constraints to determine the maximum clock frequency by iteratively

varying the timing constraints until timing closure could not be achieved.

The FP Adder/Subtractor and Multiplier Units implementation results as

generated from the static timing report are summarized in Table 6.2 and 6.3 for

several FPGA platforms. It is noticed that Virtex5 uses almost half the number of

106
slices used by Virtex2 and Virtex4 FPGAs. This owes to the fact that the Virtex5 slice

has almost double the resources of Virtex2 and Virtex4 [20].

Table 6.2 Summary of Proposed FP Adder/Subtractor Implementation Results

FP Adder
Speed Area
(MHz) Slices Utilization
Virtex2p
325 1331 21%
Xc2vp7ff896 -7
Virtex4
401 1533 3%
Xc4vfx100 -12
Virtex5
442 625 3%
Xc5vlx110 -3

Table 6.3 Summary of Proposed FP Multiplier Implementation Results

FP Multiplier

Speed Area
(MHz) Slices Utilization
Virtex2p
339 1029 20%
Xc2vp7ff896 -7
Virtex4
465 1467 3%
Xc4vfx100 -12
Virtex5
472 702 4%
Xc5vlx110 -3

6.3 Comparison with Previous Work

The implementation results of the proposed FP Adder/Subtractor and

Multiplier Units are compared to results of other designers. As referred to earlier in

Chapter 2, the most relevant work in the topic of design of high speed FPUs was the

work of Govindu, Hemmert and Karlstrom [6, 9, 10 and 11]. The basic specifications

of their FPUs and its comparison to the specifications of the proposed FPU can be

summarized as follows:
107
Denormalized numbers, infinity or NaN:

Generally denormalized numbers, infinity and NaN are commonly excluded

from high performance systems due to their rare occurrence and excessive hardware

[27]. The FPUs implemented by Govindu [6] and Karlstrom [10, 11], they did not

support denormalized numbers, infinity or NaN. As for the FPU implemented by

Hemmert [9], it was fully IEEE compliant considering both NaN and infinity but not

denormalized numbers.

As referred to earlier in Chapter 5, the FPUs implemented in this thesis deals

with denormalized numbers as zero values, signals infinity values as overflow yet

does not does not consider NaN values.

Rounding Modes:

Generally, the REN mode is the most common when implementing hardware and

software arithmetic operations although it is the mode requiring the most hardware to

implement. On the other hand, truncation is the simplest rounding mode to implement

since it only involves truncating the extra bits in resultant mantissa in order to store it

in the assigned storage. Both Govindu [6] and Karlstrom [10, 11] were implemented

their FPUs using the REN and truncation rounding modes. Hemmert [9] did not give

information the rounding modes used in their design.

As referred to earlier in Chapter 5, the FPUs implemented in this thesis used

the REN rounding mode.

Parallelism:

Parallelism in design serves in improving its overall latency. Both Govindu [6]

and Karlstrom [10, 11] made use of parallelism when implementing their FPUS. They

108
both had at least two parallel paths one for the dealing with the exponents and the

other for dealing with the mantissas.

On the other hand since we were more concerned with speed than latency, each of

the FP Adder/Subtractor and Multiplier Units had a single pipelined path.

By considering the design specifications of Govindu, Hemmert and Karlstrom [6,

9, 10 and 11] as compared to the design specifications of the proposed design, it can

be concluded that their design specifications are very similar to those of the proposed

design. Hence comparison of the proposed design to the work of Govindu, Hemmert

and Karlstrom’s [6, 9, 10 and 11] is considered very much fair.

6.3.1 Comparison of FP Adder/Subtractor Unit Implementation Results

When comparing the results from this work to those of Govindu, Hemmert

and Karlstrom [6, 9, 10 and 11], comparison of architectures used and optimization

techniques is relevant.

Like in this work, Govindu [6] implemented the post normalization operation

using the standard LOD algorithm. He described most of his FP Adder/Subtractor

Unit using VHDL but also made use of Xilinx Library Cores to describe the adders in

both the mantissa addition module and the rounding module within the FP

Adder/Subtractor Unit. Govindu [6] used Xilinx Tool for the implementation of his

design to Virtex2 and Virtex4 FPGAs. In order to optimize his design, he used deep

pipelining and synthesis options given by the Xilinx tool. His 19 stage pipelined

design was able to operate at 250 MHz when implemented to Virtex2pro.

Karlstrom [10, 11] as well implemented the post normalization operation

using the standard LOD algorithm. He described his proposed FP Adder/Subtractor

109
Unit using Verilog. Then in order to optimize the post normalize module he used a

hardware extensive approach that inspected four bits of the mantissa at a time. He

reasoned the choice of him inspecting four bits at a time to allow for better mapping

to the four input LUTs of the Virtex4 FPGA. In order to further optimize his design,

he made sure the adder/subtractor was implemented using only one LUT per bit. The

use of such technique led a design that operated at 288 and 361MHz with a latency of

only 10 clock cycles forVirtex2pro and Virtex4 respectively. In his later work, he was

able to push the performance of his design to 377MHz for Virtex4. Strangely enough,

his later implementation’s speed fell to 278 on Virtex2pro.

Hemmert [9] described their design using JHDL. They did not give any detail

about their design architecture or how they optimized it. They did compare their work

to floating point units distributed by Xilinx and were superior to them in all speed,

area and latency. Their FP Adder/Subtractor Unit operated at 298MHz and 356MHz

for Virtex2pro and Virtex4 respectively with a latency of 10 clock cycles.

The comparison between the proposed FP Adder/Subtractor Unit and previous

ones is summarized in Table 6.4 and is arranged chronologically.

Table 6.4 Speed Comparison between Proposed and other FP Adders

FP Adder Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
325 278 288 298 250
Virtex2p -7

401 377 361 356 NA


Virtex4 -12

442 419 NA NA NA
Virtex5 -12
*NA: Not Available

110
6.3.2 Comparison of FP Multiplier Unit Implementation Results

In order to optimize the multiplier module within the FP Multiplier Unit,

Govindu [6] constructed the 24 by 24 unsigned multiplier using Xilinx cores and

made use of the synthesis options to further optimize his design while Karlstrom [10,

11] used four of Virtex4’s DSP48 blocks to construct a 35 by35 multiplier. On the

other hand, Hemmert [9] did not give any details what so ever about how they

implemented or optimized their FP Multiplier Unit.

The comparison between the proposed FP Adder/Subtractor Unit and previous

ones is summarized in Table 6.5.

Table 6.5 Speed Comparison between Proposed and other FP Multipliers

FP Multiplier Designs
Proposed Karlstrom Karlstrom Hemmert Govindu
[30] [11] [10] [9] [6]
339 NA NA 290 250
Virtex2p -7

465 440 450 317 NA


Virtex4 -12

469 500 NA NA NA
Virtex5 -12
*NA: Not Available

6.4 Power Analysis

Xilinx provides an accurate post implementation power analysis and

estimation tool which is XPower Analyzer. XPower Analyzer gives designers an

accurate view of the power breakdown based on the exact resource utilization

information extracted from the FPGA design implementation [20].

111
The XPower Analyzer integrated within the Xilinx 9.2i version is an early

access version and is reported by Xilinx to give incorrect results [20]. So for accurate

power analysis a newer version of XPower Analyzer, which was integrated in Xilinx

11.1, was used. This version of Xilinx only supported Virtex4 and newer FPGAs.

Figures 6.6 and 6.7 illustrate the power consumption with respect to frequency

of the proposed FP Adder/Subtractor and Multiplier Units respectively as calculated

by the Xpower Analyzer Tool. Power consumption is analyzed assuming default

settings which assumes input/output toggle rate = 100%, i.e. both inputs and outputs

are assumed to toggle with every clock cycle.

FP Adder Unit Power Consumption


350

300

250
Power (mW)

200

150

100

50

0
0 100 200 300 400 500
Frequncy (MHz)

Virtex5 (Vcc=1V, 0.65nm) Virtex4 (Vcc= 1.2V, 90nm)

Figure 6.2 FP Adder/Subtractor Unit Power Consumption vs. Frequency

112
FP Multiplier Unit Power Consumption
350

300

250
Power (mW)

200

150

100

50

0
0 100 200 300 400 500
Frequency (MHz)

Virtex5 (Vcc=1V, 0.65nm) Virtex4 (Vcc=1.2V, 90nm)

Figure 6.3 FP Multiplier Unit Power Consumption vs. Frequency

6.5 Post Route Simulation

Post route simulation is performed using the Xilinx ISE Simulator. The

simulator uses the post place and route simulation model along with the Standard

Delay Format (SDF) file containing true delay information of the design to simulate a

user’s test bench.

6.5.1 FP Adder/Subtractor Unit Post Route Simulation

The post route timing simulation of the FP Adder/Subtractor operating at

312.5MHz (3.2ns) is shown in Figures 6.4 and 6.5 for the same test bench given in

Chapter 5. For example, the output of the operation 1000.05 - 88.01 = (447A0333)H

- (42B0051F)H introduced at 172ns appears at 226.4ns to be 912.04= (4464028F)H.

This is equivalent to a latency of 17 clock cycles.

113
Figure 6.4 Input Test Bench of the FP Adder/Subtractor at 3.2ns

Figure 6.5 Output of the FP Adder/Subtractor Test Bench

6.4.2 FP Multiplier Unit Post Route Simulation

The post route timing simulation of the FP Multiplier operating at 400MHz

(2.5ns) is shown in Figures 6.6 and 6.7 for the same test bench given in Chapter 5. For

example, the output of the operation 25.8 * -7.4 = (41CE6666)H *(C0ECCCCD)H

introduced at 431.2ns appears at 498.7ns to be -190.92 = (C33EEB85)H. This is

equivalent to a latency of 27 clock cycles.

114
Figure 6.6 Input Testbench for FP Multiplier at 2.5ns

Figure 6.7 Output of the FP Multiplier Test Bench

115
Chapter 7

Conclusions and Future Work

7.1 Conclusions

Digital signal processing functions are commonly implemented on two types

of programmable platforms; FPGAs and DSPs, which are specialized microprocessor

designed to handle digital signal processing functions. Although in the past the usage

of DSPs has been more common, with the development of FPGA technology and DSP

in the recent years, there are more and more applications of their combination in

digital signal processing systems especially when high performance, flexibility, fast

time to market, reliability and maintainability are required [31-33]. And since FP

addition and multiplication are very common in digital signal processing, such as in

Fast Fourier Transform (FFT), the importance of design of high speed FP

adder/subtractors and multipliers to FPGA is inevitable.

In this work, the design and implementation of pipelined IEEE compliant FP

Multiplier and Adder/Subtractor Units has been presented. Both units were written

entirely in VHDL to allow their implementation on any FPGA.

A novel algorithm, referred to as “Block Multiplication”, is proposed to

optimize the multiplication operation in the FP Multiplier. Block Multiplication

speeds up the 24 by 24 integer multiplication involved in the FP Multiplier by

dividing it to several smaller multiplications performed in parallel. The performance

of the FP Adder/Subtractor is optimized by deeply pipelining the post normalize

module which is implemented using the LOD architecture.

Both designs are able to operate at high operating frequencies exceeding

320MHz for Virtex2Pro and 400MHz for Virtex4 and Virtex5 FPGAs whilst giving

116
an output with every clock cycle. Specifically, the proposed FP Adder/Subtractor

operates at 442 MHz while the FP Multiplier operates at 469 MHz when implemented

to Virtex-5 FPGA. Power consumption of both the FP Adder/Subtractor and

Multiplier Units was analyzed using Xpower Analyzer. Finally, post route simulation

was performed to verify the operation of both the FP Adder/Subtractor and Multiplier

Units after routing.

7.2 Future Work

In the future, the latency of the designs could be reduced by making use of

parallel processing of the exponent and mantissa. It would be interesting to explore

whether slicing the multiplication process into 6 by 6 multiplications would improve

speed, especially since 6 by 6 multiplication is expected to map well to the advanced

6-LUT based FPGAs in Virtex-5 and newer FPGAs. Accordingly, it would be

interesting to implement designs to the newest Virtex-6 FPGA.

117
References

[1] A. Amarica, M. Vladutiu , L. Prodan, M. Udrescu and O. Boncalo, “Design of

addition and multiplication units for high performance interval arithmetic

processor,” in Proceedings of the International Conference on Computer Design,

2007, pp. 1-4.

[2] L. Louca, T. A. Cook and W. H. Johnson, “Implementation of IEEE Single

Precision Floating Point Addition and Multiplication on FPGAs,” FPGAs for

Custom Computing, 1996.

[3] M. Reaz, S. Islam and M. Suliman, " Pipeline floating point ALU design using

VHDL," in Proceedings of Semicoductor Electronics, 2002, pp. 204-208.

[4] J. Liang, R. Tessier and O. Mencer, " Floating point unit generation and

evaluation for FPGAs," in Field-Programmable Custom Computing Machines,

2003, pp. 185–194.

[5] Shamsiah Suhaili and Othman Sidek, “Design and Implementation of

Reconfigurable ALU on FPGA”, in the 3rd International Conference on Electrical

and Computer Engineering, 2004.

[6] G. Govindu, L. Zhuo, S. Choi and V. Prasanna, " Analysis of high performance

floating point arithmetic on FPGAs," in Proceeding of the 18th International

Parallel and Distributed Processing Symposium, April 2004, pp. 149-156.

[7] Ali Malik, “Design Tradeoff Analysis of Floating-Point Adder in FPGAs”, M.Sc.

Thesis, University of Saskatchewan, Canada, 2005.

118
[8] Claudio Brunelli and Jari Nurmi, “ Design and Verification of VHDL Model of a

Floating Point Unit for RISC Microprocessor,” in International Symposium on

System-on-chip, November 2006, pp. 1-4.

[9] K. Scott Hemmert, Keith D. Underwood, "Open Source High Performance

Floating-Point Modules," in Proceeding of the 14th Annual IEEE Symposium on

Field-Programmable Custom Computing Machines (FCCM'06), April 2006,

pp.349-350.

[10] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based

Floating Point Adder and Multiplier Units in a Virtex 4," in the 24th Norchip

Conference, Novemeber 2006, pp. 31-34.

[11] P. Karlstrom, A. Ehliar and D. Liu, " High performance, low latency FPGA based

Floating Point Adder and Multiplier Units in a Virtex 4," in Computers and

Digital Techniques, volume 2, issue 4, 2008, pp. 305-313.

[12] S.V.Siddamal, R.M. Banakar and B.C. Jinaga, " Design of high speed floating

point multiplier," in the 4th IEEE International Syposium on Electronic Design,

Test and Application , 2008, pp.285–289.

[13] Florent de Dinechin and Bogdan Pasca, “Large multipliers with fewer DSP

blocks,” in Proceedings of the International Conference on Field Programmable

Logic and Applications IEEE, August 2009, pp. 250-255.

[14] Sebastian Banescu, Florent de Dinechin, Bogdan Pasca and Radu Tudoran,

"Multipliers for Floating-Point Double Precision and Beyond on FPGAs," in

Publications de la Recherche Universitaire de l'ENS de Lyon (PRUNEL), April

2010.

119
[15] Florent de Dinechin, Hong Diep Nguyen and Bogdan Pasca, “ Pipelined FPGA

Adders,” in Publications de la Recherche Universitaire de l'ENS de Lyon

(PRUNEL), April 2010.

[16] Clive Maxfield, The Design Warrior’s Guide to FPGAs: Devices, Tools and

Flows, Great Britain: Newness, 2004.

[17] Michael John, Sebastian Smith, Application-Specific Integrated Circuits,

Addison-Wesley Professional, 1997.

[18] Sunggu Lee, Advanced Digital Logic Design using VHDL, State Machines and

Synthesis for FPGAs, Canada: Thomson 2006.

[19] Volnei A. Pedroni, Digital Electronics and Design with VHDL, USA: Morgan

Kaufmann, 2008.

[20] Xilinx, http://www.xilinx/com.

[21] “Semiconductor Industry Leader.” [Online]. Available:

http://www.xilinx.com/company/about.htm

[22] “FPGA Logic Cells Comparison.” [Online]. Available: http://www.1-

core.com/library/digital/fpga-logic-cells/

[23] Peter Wilson, Design Recipes for FPGAs, Great Britain: Newness, 2007.

[24] IEEE standards board, IEEE standard for floating-point arithmetic, 2008.

[25] M. Morris Mano, Computer System Architecture, USA:Prentice Hall, 1992.

[26] Sheetal A. Jain, Low Power Single Precision IEEE Floating Point Unit, Master of

Engineering Thesis, Massachusetts Institute of Technology (MIT), USA, 2003.

[27] Oh H.-J., Mueller S.M., Jacobi C., Tran K.D., Cottier S.R., Michael B.W., ET

AL.: „A Fully Pipelined Single-Precision Floating-Point Unit in the Synergistic

120
Processor Element of a Cell Processor’, Solid-State Circuits IEEE J., 2006, 41,

(4), pp. 759–771.

[28] “Speed Strategies.” [Online]. Available: http://www.xilinx.com

[29] “Xilinx Timing Constraints User Guide.” [Online]. Available:

http://www.xilinx.com

[30] Lamiaa Sayed Abdel Hamid, Khaled Shehata, Hassan El-Ghitani, Mohamed

ElSaid, "Design of Generic Floating Point Multiplier and Adder/Subtractor

Units," in Proceedings of the 12th International Conference on Computer

Modeling and Simulation, IEEE/UKSIM, March 2010, pp.615-618.

[31] K. Underwood, “FPGAs vs. CPUs: Trends in Peak Floating-Point Performance,”

in proceedings of the ACM/SIGDA 12th International Symposium on Field

Programmable Gate Arrays, February 2004, pp. 171-180.

[32] “FPGA versus DSP design Reliability and Maintenance.” [Online.] Available :

http://www.dsp-fpga.com

[33] Zhi-Jian Sun and Xue-Mei Liu, “Application of Floating Point and DSP in

Integration Navigation System,” in the proceeding of International Conference on

Computer Science and Software Engineering, December 2008, pp.58-61.

121

Вам также может понравиться