Вы находитесь на странице: 1из 35

An Efficient Implementation of Floating Point Multiplier

ABSTRACT
This paper describes an efficient implementation of an IEEE 754 single precision floating point multiplier targeted for Xilinx Virtex-5 FPGA. VHDL is used to implement a technology-independent pipelined design. The multiplier implementation handles the overflow and underflow cases. Rounding is not implemented to give more precision when using the multiplier in a Multiply and Accumulate (MAC) unit. With latency of three clock cycles the design achieves 301 MFLOPs. The multiplier was verified against Xilinx floating point multiplier core.

1. INTRODUCTION
Floating point numbers are one possible way of representing real numbers in binary format; the IEEE 754 standard presents two different floating point formats, Binary interchange format and Decimal interchange format. Multiplying floating point numbers is a critical requirement for DSP applications involving large dynamic range. This paper focuses only on single precision normalized binary interchange format. Fig. 1 shows the IEEE 754 single precision binary format representation; it consists of a one bit sign (S), an eight bit exponent (E), and a twenty three bit fraction (M or Mantissa). An extra bit is added to the fraction to form what is called the significand1. If the exponent is greater than 0 and smaller than 255, and there is 1 in the MSB of the significand then the number is said to be a normalized number.

Where M = m22 2-1 + m21 2-2 + ................+ m0 2-23. Bias = 127


SIGNIFICAND: Significand is the mantissa with an extra MSB bit. OVER VIEW OF INDUSTRY:
Very-large-scale integration (VLSI) is the process of creating integrated circuits by combining thousands of transistors into a single chip.

1.1 Developments:
The first semiconductor chips held two transistors each. Subsequent advances added more and more transistors, and, as a consequence, more individual functions or systems were integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
2

single device. Now known respectively as small-scale integration (SSI), improvements in technique led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved far past this mark and today's microprocessors have many millions of gates and billions of individual transistors.

At one time, there was an effort to name and calibrate various levels of large-scale integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of gates and transistors available on common devices has rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.

As of early 2008, billion-transistor processors are commercially available. This became more commonplace as semiconductor fabrication advanced from the then-current generation of 65 nm processes. A notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest devices, use extensive design automation and automated logic synthesis to lay out the transistors, enabling higher levels of complexity in the resulting logic functionality. Certain high performance logic blocks like SRAM cell, however, are still designed by hand to ensure the highest efficiency. VLSI technology may be moving toward further radical miniaturization with introduction of NEMS technology. Structured VLSI design is a modular methodology originated by Carver Mead and Lynn Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained by repetitive arrangement of rectangular macro blocks which Can be inter connected using wiring by

abutment: an example is portioning the layout of an adder in to a row of equal bit slices cells, in complex design this structuring may be achieved by hierarchal nesting. Structured VLSI design had been popular in the early 1980s, but lost its popularity later because of the advent of placement and routing tools wasting a lot of area by routing, which is tolerated because of the progress of Moores Law. When introducing the hardware description language KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design" (originally as "structured LSI design"), echoing Edsger Dijkstra's structured programming

approach by procedure nesting to avoid chaotic spaghetti-structured programs.

1.2 Challenges:
As microprocessors become more complex due to technology scaling, microprocessor designers have encountered several challenges which force them to think beyond the design plane, and look ahead to post-silicon: Power usage/ Heat dissipation advancing process technology, - As threshold voltages have ceased to scale with dynamic power dissipation has not scaled

proportionally. Maintaining logic complexity when scaling the design down only means that the power dissipation per area will go up. This has given rise to techniques such as dynamic voltage and frequency scaling (DVFS) to minimize overall power. Process variation - As photolithography techniques tend closer to the fundamental laws of optics, achieving high accuracy in doping concentrations and etched wires is becoming more difficult and prone to errors due to variation. Designers now must simulate across multiple fabrication process corners before a chip is certified ready for production. Stricter design rules - Due to lithography and etc issues with scaling, design rules for layout have become increasingly stringent. Designers must keep ever more of these rules in mind while laying out custom circuits. The overhead for custom design is now reaching a tipping point, with many design houses opting to switch to electronic design automation (EDA) tools to automate their design process. Timing/design closure - As clock frequencies tend to scale up, designers are finding it more difficult to distribute and maintain low clock skew between these high frequency clocks across the entire chip. This has led to a rising interest in multi core and multiprocessor architectures, since an overall speedup can be obtained by lowering the clock frequency and distributing processing. First-pass success - As die sizes shrink (due to scaling), and wafer sizes go up (to lower manufacturing costs), the number of dies per wafer increases, and the complexity of making suitable photo masks goes up rapidly. A mask set for a modern technology can cost several million dollars. This non-recurring expense deters the old iterative
4

philosophy involving several "spin-cycles" to find errors in silicon, and encourages first-pass silicon success. Several design philosophies have been developed to aid this new design flow, including design for manufacturing (DFM), design for test (DFT), and Design for X.

1.3 Applications of VLSI:


Electronic systems now perform a wide variety of tasks in daily life. Electronic systems in some cases have replaced mechanisms that operated mechanically, hydraulically, or by other means; electronics are usually smaller, more flexible, and easier to service. In other cases electronic systems have created totally new applications. Electronic systems perform a variety of tasks, some of them visible, some more hidden: Personal entertainment systems such as portable MP3 players and DVD players perform sophisticated algorithms with remarkably little energy. Electronic systems in cars operate stereo systems and displays; they also control fuel injection systems, adjust suspensions to varying terrain, and perform the control functions required for anti-lock braking (ABS) systems. Digital electronics compress and decompress video, even at high-definition data rates, on-the-fly in consumer electronics. Low-cost terminals for Web browsing still require sophisticated electronics, despite their dedicated function. Personal computers and workstations provide word-processing, financial analysis, and games. Computers include both central processing units (CPUs) and special-purpose hardware for disk access, faster screen display, etc. Medical electronic systems measure bodily functions and perform complex processing algorithms to warn about unusual conditions. The availability of these complex systems, far from overwhelming consumers, only creates demand for even more complex systems. The growing sophistication of applications continually pushes the design and manufacturing of integrated circuits and electronic systems to new levels of complexity. And perhaps the most
5

amazing characteristic of this collection of systems is its variety-as systems become more complex, we build not a few general-purpose computers but an ever wider range of specialpurpose systems. Our ability to do so is a testament to our growing mastery of both integrated circuit manufacturing and design, but the increasing demands of customers continue to test the limits of design and manufacturing.

2. VERILOG HDL
Verilog HDL is a hardware description language that can be used to model a digital system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the switch-level. The complexity of the digital system being modeled could vary from that of a simple gate to a complete electronic digital system, or anything in between. The digital system can be described hierarchically and timing can be explicitly modeled within the same description. The Verilog HDL language includes capabilities to describe the behavior-al nature of a design, the dataflow nature of a design, a design's structural composition, delays and a waveform generation mechanism including aspects of response monitoring and verification, all modeled using one single language. In addition, the language provides a programming language interface through which the internals of a design can be accessed during simulation including the control of a simulation run. The language not only defines the syntax but also defines very clear simulation semantics for each language construct. Therefore, models written in verified using a Verilog simulator. The language inherits this language can be

many of its operator symbols and

constructs from the C programming language. Verilog HDL provides an extensive range of modeling capabilities, some of which are quite difficult to comprehend initially. However, a core subset of the language is quite easy to leam and use. This is sufficient to model most applications.

2.1 History:
The verilog HDL language was first developed by Gateway Design Automation in 1983 as hardware are modeling language for their simulator product, At that time ,it was a propnetary language. Because of the popularity of the, simulator product, Verilog HDL gained acceptance as a usable and practical language by a number of designers. In an effort to increase the popularity of the language, the language was placed in the public domain in 1990. Open verilog International (OVI) was formed to promote verilog. In 1992 OVI decided to pursue
7

standardization of verilog HDL as an IEEE standard. This effort was successful and the language became an IEEE standard in 1995. The complete standard is described in the verilog hardware description language reference manual. The standard is called std 1364-1995.

2.2 Major Capabilities:


Listed below are the major capabilities of the verilog hardware description: Primitive logic gates, such as and, or and nand, are built-in into the language. Flexibility of creating a user-defined primitive (UDP). Such a primitive could either be a combinational logic primitive or a sequential logic primitive. Switch-level modeling primitive gates, such as pmos and nmos, are also built-in into the language. Explicit language constructs are provided for specifying pin-to-pin delays, path delays and timing checks of a design. A design can be modeled in three different styles or in a mixed style. These styles are: behavioral style - modeled using procedural constructs; dataflow style - modeled using continuous assignments; and structural style - modeled using gate and module instantiations. There are two data types in Verilog HDL; the net data type and the register data type. The net type represents a physical connection between structural elements while a register type represents an abstract data storage element. Verilog HDL also has built-in logic functions such as & (bitwise-and) and I (bitwise-or). High-level programming language constructs such as conditions, case statements, and loops are available in the language. Notion of concurrency and time can be explicitly modeled. Powerful file read and write capabilities fare provided. The language is non-deterministic under certain situations, that is, a model may produce different results on different simulators; for example, the ordering of events on an event queue is not defined by the standard.

2.3 Synthesis:
Synthesis is the process of constructing a gate level netlist from a register-transfer level model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis system may as an intermediate step, generate a netlist that is comprised of register-transfer level blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by wires. In such a case, a second program called the RTL module builder is necessary. The purpose of this builder is to build, or acquire from a library of predefined components, each of the required RTL blocks in the user-specified target technology.

Figure.2-2 synthesis process

Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes the circuit for the user-specified area and timing constraints. These area and timing constraints may also be used by the module builder for appropriate selection or generation of RTL blocks. In this book, we assume that the target netlist is at the gate level. The logic gates used in the synthesized netlists are described in Appendix B. The module building and logic optimization phases are not described in this book. The above figure shows the basic elements of Verilog HDL and the elements used in hardware. A mapping mechanism or a construction mechanism has to be provided that translates the Verilog HDL elements into their corresponding hardware elements as shown in figure.2-3
9

2.4 Advantages of Verilog HDL :


1. It is not possible to describe the functionality of digital circuits using higher level languages such as FORTRAN, C and other higher level languages, because these are sequential in nature. Hardware Description Languages (HDLs) came into existence. 2.Design can be described and implemented at a very abstract (high) level). 3. Technology independent implementation. Functional verification of the design can be done early in the design cycle.

10

4. One can optimize and modify the design description until it meets the desired functionality as well as required specifications. Most design bugs are eliminated before going to implementation (Chip). 5.Designing with HDLs is a analogous to computer programming, a textual description is an easier way to develop and debug circuits. 6.Design reusability, short developing time and easy modification of design. 7. Verilog HDL is non proprietary and is an IEEE standard. 8.Switch level modeling primitive gates, such as pmos and nmos are also built-in into the languages. 9. It is human and machine readable. Thus it can be used as an exchange language between tools and designers. 10.Verilog HDL can be used to perform response monitoring of the design under test, that is, the values of a design under test can be monitored and displayed. These values can be compared with expected values, and in case of a mismatch, a report message can be printed.

11

3. FPGA DESIGN FLOW


This is part of chapter deals with the implementation flow specifying the significance

of various properties, reports obtained and simulation waveforms of architectures developed to implement.

3.1 FPGA Design flow:


The various steps involved in the design flow are as follows: 1) Design entry. 2) Functional simulation. 3) Synthesizing and optimizing (translation) the design. 4) Placing and routing the design 5) Timing simulation of the design after post PAR. 6) Static timing analysis. 7) Configuring the device by bit generation. 3.1.1 Design entry: The first step in implementing the design is to create the HDL code based on design criteria. To support these instantiations we need to include UNISIM library and compile all design libraries before performing the functional simulation. The constraints (timing and area constraints) can also be included during the design entry. Xilinx accepts the constraints in the form of user constraint (UCF) file. 3.1.2 Functional Simulation:

12

This step deals with the verification of the functionality of the written source code. ISE provides its own ISE simulator and also allows for the integration with other tools such as Modelsim. This project uses Modelsim for the functional verification by selecting the option during project creation. Functional simulation determines if the logic in the design is correct before implementing it in a device. Functional simulation can take place at the earliest stages of the design flow. Because timing information for the implemented design is not available at this stage, the simulator tests the logic in the design using unit delays. 3.1.3 Synthesizing and Optimizing: In this stage behavioral information in the HDL file is translated into a structural net list, and the design is optimized for a Xilinx device. To perform synthesis this project uses Xilinx XST tool [17]. From the original design, a net list is created, then synthesized and translated into a native generic object (NGO) file. This file is fed into the Xilinx software program called NGDBuild, which produces a logical native generic database (NGD) file. 3.1.4 Design implementation: In this stage, The MAP program maps a logical design to a Xilinx FPGA. The input to MAP is an NGD file, which is generated using the NGDBuild program. The NGD file contains a logical description of the design that includes both the hierarchical components used to develop the design and the lower level Xilinx primitives. The NGD file also contains any number of NMC (macro library) files, each of which contains the definition of a physical macro. MAP first performs a logical DRC (Design Rule Check) on the design in the NGD file. MAP then maps the design logic to the components (logic cells, I/O cells, and other components) in the target Xilinx FPGA. The output from MAP is an NCD (Native Circuit Description) file, and PCF (Physical constraint file). NCD (Native Circuit Description) filea physical description of the design in terms of the components in the target Xilinx device.

13

PCF (Physical Constraints File)an ASCII text file that contains constraints specified during design entry expressed in terms of physical elements. The physical constraints in the PCF are expressed in Xilinxs constraint language. After the creation of Native Circuit Description (NCD) file with the MAP program, place and route that design file using PAR. PAR accepts a mapped NCD file as input, places and routes the design, and outputs an NCD file to be used by the bit stream generator (Bit Generation). The PAR placer executes multiple phases of the placer. PAR writes the NCD after all the placer phases are complete. During placement, PAR places components into sites based on factors such as constraints specified in the PCF file, the length of connections, and the available routing resources. After placing the design, PAR executes multiple phases of the router. The router performs a converging procedure for a solution that routes the design to completion and meets timing constraints. Once the design is fully routed, PAR writes an NCD file, which can be analyzed against timing. PAR writes a new NCD as the routing improves throughout the router phases. 3.1.5 Timing simulation after post PAR: Timing simulation at this stage verifies that the design runs at the desired speed for the device under worst-case conditions. This process is performed after the design is mapped, placed, and routed for FPGAs. At this time, all design delays are known. Timing simulation is valuable because it can verify timing relationships and determine the critical paths for the design under worst-case conditions. It can also determine whether or not the design contains set-up or hold violations. In most of the designs the same test bench can be used to simulate at this stage. 3.1.6 Static timing analysis:

14

Static timing analysis is best for quick timing checks of a design after it is placed and routed. It also allows you to determine path delays in your design. Following are the two major goals of static timing analysis: Timing verification This is verifying that the design meets your timing constraints. Reporting This is enumerating input constraint violations and placing them into an accessible file. ISE provides Timing Reporter and Circuit Evaluator (TRACE) tool to perform STA. The input files to the TRACE are .ncd file and .pcf from PAR .and the output file is a .twr file.

3.2 Processes and properties:


Processes and properties enable the interaction of our design with the functionality available in the ISE suite of tools. 3.2.1 Processes: Processes are the functions listed hierarchically in the Processes window. They perform functions from the start to the end of the design flow. 3.2.2 Properties: Process properties are accessible from the right-click menu for select enable us to customize the parameters used by the process. Process properties are set at synthesis and implementation phase. processes. They

15

3.3 Synthesize options:


The following properties apply to the Synthesize properties using the Xilinx Synthesis Technology (XST) synthesis tool. Optimization Goal. Specifies the global optimization goal for area or speed. Select an option from the drop-down list. Speed. Optimizes the design for speed by reducing the levels of logic. Area Optimizes the design for area by reducing the total amount of logic used for design implementation. By default, this property is set to Speed. 3.3.1 Optimization Effort: Specifies the synthesis optimization effort level. Select an option from the drop-down list. Normal Optimizes the design using minimization and algebraic factoring algorithms. High Performs additional optimizations that are tuned to the selected device architecture. "High" takes more CPU time than "Normal" because multiple optimization algorithms are tried to get the best result for the target architecture. By default, this property is set to Normal. This project aims at Timing performance and was selected HIGH effort level.

16

3.3.2 Power Reduction: When set to Yes (checkbox is checked), XST optimizes the design to consume as little power as possible. By default, this property is set to No (checkbox is blank). 3.3.3 Use Synthesis Constraints File: Specifies whether or not to use the constraints file entered in the previous property. By default, this constraints file is used (property checkbox is checked). 3.3.4 Keep Hierarchy: Specifies whether the corresponding design unit should be preserved or not merged with the rest of the design. You can specify Yes, No and Soft. Soft is used when you wish to maintain the hierarchy through synthesis, but you do not wish to pass the keep_ hierarchy attributes to place and route. By default, this property is set to No. The change in option of this property from no to yes gave me almost double the speed.

17

4. FLOATING POINT MULTIPLIPLICATION ALGORITHM.


4.1 FLOATING POINT MULTIPLICATION:
Multiplying two numbers in floating point format is done by 1. Adding the exponent of the two numbers then subtracting the bias from their result. 2. Multiplying the significant of the two numbers. 3. Calculating the sign by XOR ing the sign of the two numbers. In order to represent the multiplication result as a normalized number their should be1 in the MSB of the result.

4.2 FLOATING POINT MULTIPLICATION ALGORITHM:


As stated in the introduction, normalized floating point numbers have the form of

Z= (-1S) * 2 (E - Bias) * (1.M). To multiply two floating point numbers the following is done: 1. Multiplying the significand; i.e. (1.M1*1.M2). 2. Placing the decimal point in the result. 3. Adding the exponents; i.e. (E1 + E2 - Bias). 4. Obtaining the sign; i.e. s1 xor s2. 5. Normalizing the result; i.e. obtaining 1 at the MSB of the results significand. 6. Rounding the result to fit in the available bits. 7. Checking the underflow and overflow occurrence. Consider a floating point representation similar to the IEEE 754 single precision floating point format, but with a reduced number of mantissa bits (only 4) while still retaining the hidden
18

1 bit for normalized numbers: A=0100001000100=40, To multiply A and B 1. Multiply significand:


1.0100 1.1110

B=1100000011110=-7.5.

00000 10100 10100 10100 10100 1001011000

2. Place the decimal point: 10.01011000 3. Add exponents: 10000100 + 10000001 100000101 The exponent representing the two numbers is already shifted/biased by the bias value (127) and is not the true exponent; i.e. EA = EA-true + bias and EB = EB-true + bias And EA + EB = EA-true + EB-true + 2 bias So we should subtract the bias from the resultant exponent otherwise the bias will be added twice. 100000101
19

- 01111111 10000110 4. Obtain the sign bit and put the result together: 1 1000011010.01011000 5. Normalize the result so that there is a 1 just before the radix point (decimal point). Moving the radix point one place to the left increments the exponent by 1; moving one place to the right decrements the exponent by 1. 1 1000011010.01011000 1 100001111.001011000 The result is (without the hidden bit): 1 1000011100101100 6. The mantissa bits are more than 4 bits (mantissa available bits); rounding is needed. If we applied the truncation rounding mode then the stored value. (before normalizing) (normalized)

1 100001110010 In this paper we present a floating point multiplier in which rounding support isnt implemented. Rounding support can be added as a separate unit that can be accessed by the multiplier or by a floating point adder, thus accommodating for more precision if the multiplier is connected directly to an adder in a MAC unit. Fig. 2 shows the multiplier structure; Exponents addition, Significand multiplication, and Results sign calculation are independent and are done in parallel. The significand multiplication is done on two 24 bit numbers and results in a 48 bit product, which we will call the intermediate product (IP). The IP is represented as (47 down to 0) and the

20

decimal point is located between bits 46 and 45 in the IP. The following sections detail each block of the floating point multiplier.

In the below figure it is clearly showing the each block of floating point multiplier.

FLOATING POINT MULTIPLIER BLOCK DIAGRAM:

21

5. HARDWARE OF FLOATING POINT MULTIPLIER


The practical implementation of this multiplier is divided into four hardware modules. MODULE 1: it includes sign bit calculation and exponent addition. MODULE 2: it includes mantissa multiplication using a carry save multiplier. MODULE 3: it includes normaliser. MODULE 4: it contains overflow and underflow detection.

MODULE 1: Includes sign bit calculation and exponent addition.


Concepts used:
1. Operation of XOR gate. 2. Unsigned ripple carry adder. 3. Zero subtractor and one subtacctor.

Sign bit calculation:


Multiplying two numbers results in a negative sign no. if one of the multiplied nos is of a negative value by the aid of a truth table we can find that this can be obtained by xoring the sign of two inputs.

Table 1: XOR-TRUTH TABLE.

Exponent addition:
22

This unsigned adder is responsible for adding the exponent of the first input to the exponent of the second input and subtracting the Bias (127) from the addition result (i.e. Exponent + Bexponent - Bias). The result of this stage is called the intermediate exponent. The add operation is done on 8 bits, and there is no need for a quick result because most of the calculation time is spent in the significand multiplication process (multiplying 24 bits by 24 bits); thus we need a moderate exponent adder and a fast significand multiplier. An 8-bit ripple carry adder is used to add the two input exponents. As shown in Fig. 3 a ripple carry adder is a chain of cascaded full adders and one half adder; each full adder has three inputs (A, B, Ci) and two outputs (S, Co). The carry out (Co) of each adder is fed to the next full adder (i.e. each carry bit "ripples" to the next full adder)

The addition process produces an 8-bit sum (s7 s0) and a carry bit (c0,7). These bits are concatenated to form a 9- bit addition result(s8-s0) from which the bias is subtracted.

Bias subtraction:
The Bias is subtracted using an array of ripple borrow subtractors. A normal subtractor has three inputs (minuend (S), subtrahend (T), Borrow in (Bi)) and two outputs (Difference (R), Borrow out (Bo)). The subtractor logic can be optimized if one of its inputs is a constant value which is our case, where the Bias is constant (127|10 = 001111111|2). Table I shows the truth table for a 1-bit subtractor with the input T equal to 1 which we will call one subtractor (OS).
23

One subtarctor:

Here one input is always 1. The Boolean equations that represent the substactor are:

Truth table:

24

Zero substractor:

Here one input is always zero. The Boolean equations that represent this subtarctor are:

Truth table:

The below figure shows the Bias subtractor which is a chain of 7 one subtractors (OS) followed by 2 zero subtractors (ZS); the borrow output of each subtractor is fed to the next subtractor. If an underflow occurs then Eresult < 0 and the number is out of the IEEE 754 single precision normalized numbers range; in this case the output is signaled to 0 and an underflow flag is asserted.

25

Ripple borrow substractor:

MODULE 2: Includes mantissa multiplication using a carry save multiplier.


Concepts used:

1. 2.

Carry save multiplication. Half adders and full adders.

Unsigned multiplier (for significand multiplication):


This unit is responsible for multiplying the unsigned significand and placing the decimal point in the multiplication product. The result of significand multiplication will be called the intermediate product (IP). The unsigned significand multiplication is done on 24 bit. Multiplier performance should be taken into consideration so as not to affect the whole multipliers performance. A 24x24 bit carry save multiplier architecture is used as it has a moderate speed with a simple architecture. In the carry save multiplier, the carry bits are passed diagonally downwards (i.e. the carry bit is propagated to the next stage). Partial products are made by AND the inputs together and passing them to the appropriate adder.

Carry save multiplier has three main stages:


26

1. The first stage is an array of half adders. 2. The middle stages are arrays of full adders. 3. The number of middle stages is equal to significand size minus two. . 4. The last stage is an array of ripple carry adders. This stage is called the vector merging stage.

The number of adders (Half adders and Full adders) in each stage is equal to the significand size minus one. For example, a 4x4 carry save multiplier is shown in Fig. Below and it has the following stages: 1. The first stage consists of three half adders. 2. Two middle stages; each consists of three full adders. 3. The vector merging stage consists of one half adder and two full adders. The decimal point is between bits 45 and 46 in the significand multiplier result. The multiplication time taken by the carry save multiplier is determined by its critical path. The critical path starts at the AND gate of the first partial products (i.e. a1b0 and a0b1), passes through the carry logic of the first half adder and the carry logic of the first full adder of the middle stages, then passes through all the vector merging adders.

Partial product
27

AIBI = AI AND BI;


HA: HALF ADDER. FA: FULL ADDER.

MODULE 3: Includes norlamalizer.


Concepts used:

1.

Normalisation

Normaliser:
The result of the significand multiplication (intermediate product) must be normalized to have a leading 1 just to the left of the decimal point (i.e. in the bit 46 in the intermediate product). Since the inputs are normalized numbers then the intermediate product has the leading one at bit 46 or 47. If the leading one is at bit 46 (i.e. to the left of the decimal point) then the intermediate product is already a normalized number and no shifts is needed. If the leading one is at bit 47 then the intermediate product is shifted to the right and the exponent incremented by1. The shift operation is done using combinational shift logic made by multiplexers. Fig .8 shows a simplified logic of a normaliser that has an 8 bit intermediate product input and the 6 bit intermediate exponent input.

28

MODULE 4: Includes overflow and underflow detection


Overflow/underflow means that the results exponent is too large or small to be represented in the exponent field. The exponent of the result must be 8 bit in size and must be between 1 and 254 otherwise the value is not a normalized value. An overflow may occur while adding the two exponents or during normalization. Overflow due to exponent addition may be compensated during subtraction of the bias resulting in a normal output value (normal operation). An underflow may occur while subtracting the bias to form the intermediate exponent. If the intermediate exponent less than zero then its an underflow that can never be compensated, if the intermediate exponent equals to zero then its an underflow that may be compensated during normalization by adding 1 to it. When an overflow occurs an overflow flag signal goes high and the result turns to infinity (sign determined according to the sign of the floating point multiplier inputs). When an underflow occurs an underflow flag signal goes high and the result turns to zero (sign determined according to the sign of the floating point multipliers inputs). Denormalized numbers
29

are signal to zero with the appropriate sign calculated from the inputs an underflow flag is raised. Assume that E1 and E2 are the exponents of the two numbers A and B respectively, the results exponent is calculated by (6) Eresult = E1 + E2 127 (6)

E1 and E2 can have the values from 1 to 254; resulting in Eresult having values from 125 (2-127) to 381 (508-127); but for normalized numbers, Eresult can only have the values from 1 to 254. Table III summarizes the Eresult different values and the effect of normalization on it.

Table 4: overflow and underflow.

30

6. PIPELINING THE MULTIPLIER


Pipelining increases the CPU instruction throughput - the number of instructions completed per unit of time. But it does not reduce the execution time of an individual instruction. In fact, it usually slightly increases the execution time of each instruction due to overhead in the pipeline control. The increase in instruction throughput means that a program runs faster and has lower total execution time. In order to enhance the performance of the multiplier, three pipelining stages are used to divide the critical path thus increasing the maximum operating frequency of the multiplier. The pipelining stages are imbedded at the following locations: In the middle of the significant multiplier and in the middle of the exponent adder (before the bias subtraction). After the significant multiplier and after the exponent adder At the floating point multiplier outputs (sign, exponent and mantissa bits)

Fig. 9 shows the pipelining stages as dotted lines.

31

Three pipelining stages mean that there is latency in the output by three clocks. The synthesis tool retiming option was used so that the synthesizer uses its optimization logic to better place the pipelining registers across the critical path.

32

7. SIMULATION RESULTS FOR INDIVIDUAL MODULES

33

8. IMPLIMENTATION AND TESTING


The whole multiplier (top unit) was tested against the Xilinx floating point multiplier core generated by Xilinx coregent. Xilinx core was customized to have two flags to indicate overflow and underflow, to have a maximum latency of three cycles. Xilinx core implements the round to nearest rounding mode. A testbench is used to generate the stimulus and applies it to the implemented floating point multiplier and to the Xilinx core then compares the results. The floating point multiplier code was also checked using designchecker. Designchecker is a linting tool which helps in filtering design issues like clocks, unused/undriven logic, and combinational loops. The design was synthesized using precision synthesis tool targeting Xilinx vertex-5 with a timing constraint of 300MHz. post synthesis ansd place and route simulations were made to ensure the design functionality after synthesis and place and route. The area of Xilinx core is less that the implemented floating point multiplier because the latter doesnt truncate/round the 48 bits result of the mantissa multiplier which is reflected in the amount of function generators and registers used to perform operations on the extra bits; also the speed of Xilinx core is affected by the fact that it implements the round to nearest rounding mode.

34

9. CONCLUSION AND FUTURE WORK


This paper represents an implementation of floating point multiplier that supports the IEEE 7542008 binary interchange format; the multiplier doesnt implement rounding and just presents the significand multiplication result as is (48bits); this gives better precision if the whole 48 bits are utilized in another unit; i.e. a floating point adder to form MAC unit. The design has three pipelining stages and after implementation on a Xilinx Virtex5 FPGA it achieves 301 MFLOPS

35

Вам также может понравиться