Академический Документы
Профессиональный Документы
Культура Документы
ABSTRACT
This paper describes an efficient implementation of an IEEE 754 single precision floating point multiplier targeted for Xilinx Virtex-5 FPGA. VHDL is used to implement a technology-independent pipelined design. The multiplier implementation handles the overflow and underflow cases. Rounding is not implemented to give more precision when using the multiplier in a Multiply and Accumulate (MAC) unit. With latency of three clock cycles the design achieves 301 MFLOPs. The multiplier was verified against Xilinx floating point multiplier core.
1. INTRODUCTION
Floating point numbers are one possible way of representing real numbers in binary format; the IEEE 754 standard presents two different floating point formats, Binary interchange format and Decimal interchange format. Multiplying floating point numbers is a critical requirement for DSP applications involving large dynamic range. This paper focuses only on single precision normalized binary interchange format. Fig. 1 shows the IEEE 754 single precision binary format representation; it consists of a one bit sign (S), an eight bit exponent (E), and a twenty three bit fraction (M or Mantissa). An extra bit is added to the fraction to form what is called the significand1. If the exponent is greater than 0 and smaller than 255, and there is 1 in the MSB of the significand then the number is said to be a normalized number.
1.1 Developments:
The first semiconductor chips held two transistors each. Subsequent advances added more and more transistors, and, as a consequence, more individual functions or systems were integrated over time. The first integrated circuits held only a few devices, perhaps as many as ten diodes, transistors, resistors and capacitors, making it possible to fabricate one or more logic gates on a
2
single device. Now known respectively as small-scale integration (SSI), improvements in technique led to devices with hundreds of logic gates, known as medium-scale integration (MSI). Further improvements led to large-scale integration (LSI), i.e. systems with at least a thousand logic gates. Current technology has moved far past this mark and today's microprocessors have many millions of gates and billions of individual transistors.
At one time, there was an effort to name and calibrate various levels of large-scale integration above VLSI. Terms like ultra-large-scale integration (ULSI) were used. But the huge number of gates and transistors available on common devices has rendered such fine distinctions moot. Terms suggesting greater than VLSI levels of integration are no longer in widespread use.
As of early 2008, billion-transistor processors are commercially available. This became more commonplace as semiconductor fabrication advanced from the then-current generation of 65 nm processes. A notable example is Nvidia's 280 series GPU. This GPU is unique in the fact that almost all of its 1.4 billion transistors are used for logic, in contrast to the Itanium, whose large transistor count is largely due to its 24 MB L3 cache. Current designs, unlike the earliest devices, use extensive design automation and automated logic synthesis to lay out the transistors, enabling higher levels of complexity in the resulting logic functionality. Certain high performance logic blocks like SRAM cell, however, are still designed by hand to ensure the highest efficiency. VLSI technology may be moving toward further radical miniaturization with introduction of NEMS technology. Structured VLSI design is a modular methodology originated by Carver Mead and Lynn Conway for saving microchip area by minimizing the interconnect fabrics area. This is obtained by repetitive arrangement of rectangular macro blocks which Can be inter connected using wiring by
abutment: an example is portioning the layout of an adder in to a row of equal bit slices cells, in complex design this structuring may be achieved by hierarchal nesting. Structured VLSI design had been popular in the early 1980s, but lost its popularity later because of the advent of placement and routing tools wasting a lot of area by routing, which is tolerated because of the progress of Moores Law. When introducing the hardware description language KARL in the mid' 1970s, Reiner Hartenstein coined the term "structured VLSI design" (originally as "structured LSI design"), echoing Edsger Dijkstra's structured programming
1.2 Challenges:
As microprocessors become more complex due to technology scaling, microprocessor designers have encountered several challenges which force them to think beyond the design plane, and look ahead to post-silicon: Power usage/ Heat dissipation advancing process technology, - As threshold voltages have ceased to scale with dynamic power dissipation has not scaled
proportionally. Maintaining logic complexity when scaling the design down only means that the power dissipation per area will go up. This has given rise to techniques such as dynamic voltage and frequency scaling (DVFS) to minimize overall power. Process variation - As photolithography techniques tend closer to the fundamental laws of optics, achieving high accuracy in doping concentrations and etched wires is becoming more difficult and prone to errors due to variation. Designers now must simulate across multiple fabrication process corners before a chip is certified ready for production. Stricter design rules - Due to lithography and etc issues with scaling, design rules for layout have become increasingly stringent. Designers must keep ever more of these rules in mind while laying out custom circuits. The overhead for custom design is now reaching a tipping point, with many design houses opting to switch to electronic design automation (EDA) tools to automate their design process. Timing/design closure - As clock frequencies tend to scale up, designers are finding it more difficult to distribute and maintain low clock skew between these high frequency clocks across the entire chip. This has led to a rising interest in multi core and multiprocessor architectures, since an overall speedup can be obtained by lowering the clock frequency and distributing processing. First-pass success - As die sizes shrink (due to scaling), and wafer sizes go up (to lower manufacturing costs), the number of dies per wafer increases, and the complexity of making suitable photo masks goes up rapidly. A mask set for a modern technology can cost several million dollars. This non-recurring expense deters the old iterative
4
philosophy involving several "spin-cycles" to find errors in silicon, and encourages first-pass silicon success. Several design philosophies have been developed to aid this new design flow, including design for manufacturing (DFM), design for test (DFT), and Design for X.
amazing characteristic of this collection of systems is its variety-as systems become more complex, we build not a few general-purpose computers but an ever wider range of specialpurpose systems. Our ability to do so is a testament to our growing mastery of both integrated circuit manufacturing and design, but the increasing demands of customers continue to test the limits of design and manufacturing.
2. VERILOG HDL
Verilog HDL is a hardware description language that can be used to model a digital system at many levels of abstraction ranging from the algorithmic-level to the gate-level to the switch-level. The complexity of the digital system being modeled could vary from that of a simple gate to a complete electronic digital system, or anything in between. The digital system can be described hierarchically and timing can be explicitly modeled within the same description. The Verilog HDL language includes capabilities to describe the behavior-al nature of a design, the dataflow nature of a design, a design's structural composition, delays and a waveform generation mechanism including aspects of response monitoring and verification, all modeled using one single language. In addition, the language provides a programming language interface through which the internals of a design can be accessed during simulation including the control of a simulation run. The language not only defines the syntax but also defines very clear simulation semantics for each language construct. Therefore, models written in verified using a Verilog simulator. The language inherits this language can be
constructs from the C programming language. Verilog HDL provides an extensive range of modeling capabilities, some of which are quite difficult to comprehend initially. However, a core subset of the language is quite easy to leam and use. This is sufficient to model most applications.
2.1 History:
The verilog HDL language was first developed by Gateway Design Automation in 1983 as hardware are modeling language for their simulator product, At that time ,it was a propnetary language. Because of the popularity of the, simulator product, Verilog HDL gained acceptance as a usable and practical language by a number of designers. In an effort to increase the popularity of the language, the language was placed in the public domain in 1990. Open verilog International (OVI) was formed to promote verilog. In 1992 OVI decided to pursue
7
standardization of verilog HDL as an IEEE standard. This effort was successful and the language became an IEEE standard in 1995. The complete standard is described in the verilog hardware description language reference manual. The standard is called std 1364-1995.
2.3 Synthesis:
Synthesis is the process of constructing a gate level netlist from a register-transfer level model of a circuit described in Verilog HDL. Figure.2-2 shows such a process. A synthesis system may as an intermediate step, generate a netlist that is comprised of register-transfer level blocks such as flip-flops, arithmetic-logic-units, and multiplexers, interconnected by wires. In such a case, a second program called the RTL module builder is necessary. The purpose of this builder is to build, or acquire from a library of predefined components, each of the required RTL blocks in the user-specified target technology.
Having produced a gate level netlist, a logic optimizer reads in the netlist and optimizes the circuit for the user-specified area and timing constraints. These area and timing constraints may also be used by the module builder for appropriate selection or generation of RTL blocks. In this book, we assume that the target netlist is at the gate level. The logic gates used in the synthesized netlists are described in Appendix B. The module building and logic optimization phases are not described in this book. The above figure shows the basic elements of Verilog HDL and the elements used in hardware. A mapping mechanism or a construction mechanism has to be provided that translates the Verilog HDL elements into their corresponding hardware elements as shown in figure.2-3
9
10
4. One can optimize and modify the design description until it meets the desired functionality as well as required specifications. Most design bugs are eliminated before going to implementation (Chip). 5.Designing with HDLs is a analogous to computer programming, a textual description is an easier way to develop and debug circuits. 6.Design reusability, short developing time and easy modification of design. 7. Verilog HDL is non proprietary and is an IEEE standard. 8.Switch level modeling primitive gates, such as pmos and nmos are also built-in into the languages. 9. It is human and machine readable. Thus it can be used as an exchange language between tools and designers. 10.Verilog HDL can be used to perform response monitoring of the design under test, that is, the values of a design under test can be monitored and displayed. These values can be compared with expected values, and in case of a mismatch, a report message can be printed.
11
of various properties, reports obtained and simulation waveforms of architectures developed to implement.
12
This step deals with the verification of the functionality of the written source code. ISE provides its own ISE simulator and also allows for the integration with other tools such as Modelsim. This project uses Modelsim for the functional verification by selecting the option during project creation. Functional simulation determines if the logic in the design is correct before implementing it in a device. Functional simulation can take place at the earliest stages of the design flow. Because timing information for the implemented design is not available at this stage, the simulator tests the logic in the design using unit delays. 3.1.3 Synthesizing and Optimizing: In this stage behavioral information in the HDL file is translated into a structural net list, and the design is optimized for a Xilinx device. To perform synthesis this project uses Xilinx XST tool [17]. From the original design, a net list is created, then synthesized and translated into a native generic object (NGO) file. This file is fed into the Xilinx software program called NGDBuild, which produces a logical native generic database (NGD) file. 3.1.4 Design implementation: In this stage, The MAP program maps a logical design to a Xilinx FPGA. The input to MAP is an NGD file, which is generated using the NGDBuild program. The NGD file contains a logical description of the design that includes both the hierarchical components used to develop the design and the lower level Xilinx primitives. The NGD file also contains any number of NMC (macro library) files, each of which contains the definition of a physical macro. MAP first performs a logical DRC (Design Rule Check) on the design in the NGD file. MAP then maps the design logic to the components (logic cells, I/O cells, and other components) in the target Xilinx FPGA. The output from MAP is an NCD (Native Circuit Description) file, and PCF (Physical constraint file). NCD (Native Circuit Description) filea physical description of the design in terms of the components in the target Xilinx device.
13
PCF (Physical Constraints File)an ASCII text file that contains constraints specified during design entry expressed in terms of physical elements. The physical constraints in the PCF are expressed in Xilinxs constraint language. After the creation of Native Circuit Description (NCD) file with the MAP program, place and route that design file using PAR. PAR accepts a mapped NCD file as input, places and routes the design, and outputs an NCD file to be used by the bit stream generator (Bit Generation). The PAR placer executes multiple phases of the placer. PAR writes the NCD after all the placer phases are complete. During placement, PAR places components into sites based on factors such as constraints specified in the PCF file, the length of connections, and the available routing resources. After placing the design, PAR executes multiple phases of the router. The router performs a converging procedure for a solution that routes the design to completion and meets timing constraints. Once the design is fully routed, PAR writes an NCD file, which can be analyzed against timing. PAR writes a new NCD as the routing improves throughout the router phases. 3.1.5 Timing simulation after post PAR: Timing simulation at this stage verifies that the design runs at the desired speed for the device under worst-case conditions. This process is performed after the design is mapped, placed, and routed for FPGAs. At this time, all design delays are known. Timing simulation is valuable because it can verify timing relationships and determine the critical paths for the design under worst-case conditions. It can also determine whether or not the design contains set-up or hold violations. In most of the designs the same test bench can be used to simulate at this stage. 3.1.6 Static timing analysis:
14
Static timing analysis is best for quick timing checks of a design after it is placed and routed. It also allows you to determine path delays in your design. Following are the two major goals of static timing analysis: Timing verification This is verifying that the design meets your timing constraints. Reporting This is enumerating input constraint violations and placing them into an accessible file. ISE provides Timing Reporter and Circuit Evaluator (TRACE) tool to perform STA. The input files to the TRACE are .ncd file and .pcf from PAR .and the output file is a .twr file.
15
16
3.3.2 Power Reduction: When set to Yes (checkbox is checked), XST optimizes the design to consume as little power as possible. By default, this property is set to No (checkbox is blank). 3.3.3 Use Synthesis Constraints File: Specifies whether or not to use the constraints file entered in the previous property. By default, this constraints file is used (property checkbox is checked). 3.3.4 Keep Hierarchy: Specifies whether the corresponding design unit should be preserved or not merged with the rest of the design. You can specify Yes, No and Soft. Soft is used when you wish to maintain the hierarchy through synthesis, but you do not wish to pass the keep_ hierarchy attributes to place and route. By default, this property is set to No. The change in option of this property from no to yes gave me almost double the speed.
17
Z= (-1S) * 2 (E - Bias) * (1.M). To multiply two floating point numbers the following is done: 1. Multiplying the significand; i.e. (1.M1*1.M2). 2. Placing the decimal point in the result. 3. Adding the exponents; i.e. (E1 + E2 - Bias). 4. Obtaining the sign; i.e. s1 xor s2. 5. Normalizing the result; i.e. obtaining 1 at the MSB of the results significand. 6. Rounding the result to fit in the available bits. 7. Checking the underflow and overflow occurrence. Consider a floating point representation similar to the IEEE 754 single precision floating point format, but with a reduced number of mantissa bits (only 4) while still retaining the hidden
18
B=1100000011110=-7.5.
2. Place the decimal point: 10.01011000 3. Add exponents: 10000100 + 10000001 100000101 The exponent representing the two numbers is already shifted/biased by the bias value (127) and is not the true exponent; i.e. EA = EA-true + bias and EB = EB-true + bias And EA + EB = EA-true + EB-true + 2 bias So we should subtract the bias from the resultant exponent otherwise the bias will be added twice. 100000101
19
- 01111111 10000110 4. Obtain the sign bit and put the result together: 1 1000011010.01011000 5. Normalize the result so that there is a 1 just before the radix point (decimal point). Moving the radix point one place to the left increments the exponent by 1; moving one place to the right decrements the exponent by 1. 1 1000011010.01011000 1 100001111.001011000 The result is (without the hidden bit): 1 1000011100101100 6. The mantissa bits are more than 4 bits (mantissa available bits); rounding is needed. If we applied the truncation rounding mode then the stored value. (before normalizing) (normalized)
1 100001110010 In this paper we present a floating point multiplier in which rounding support isnt implemented. Rounding support can be added as a separate unit that can be accessed by the multiplier or by a floating point adder, thus accommodating for more precision if the multiplier is connected directly to an adder in a MAC unit. Fig. 2 shows the multiplier structure; Exponents addition, Significand multiplication, and Results sign calculation are independent and are done in parallel. The significand multiplication is done on two 24 bit numbers and results in a 48 bit product, which we will call the intermediate product (IP). The IP is represented as (47 down to 0) and the
20
decimal point is located between bits 46 and 45 in the IP. The following sections detail each block of the floating point multiplier.
In the below figure it is clearly showing the each block of floating point multiplier.
21
Exponent addition:
22
This unsigned adder is responsible for adding the exponent of the first input to the exponent of the second input and subtracting the Bias (127) from the addition result (i.e. Exponent + Bexponent - Bias). The result of this stage is called the intermediate exponent. The add operation is done on 8 bits, and there is no need for a quick result because most of the calculation time is spent in the significand multiplication process (multiplying 24 bits by 24 bits); thus we need a moderate exponent adder and a fast significand multiplier. An 8-bit ripple carry adder is used to add the two input exponents. As shown in Fig. 3 a ripple carry adder is a chain of cascaded full adders and one half adder; each full adder has three inputs (A, B, Ci) and two outputs (S, Co). The carry out (Co) of each adder is fed to the next full adder (i.e. each carry bit "ripples" to the next full adder)
The addition process produces an 8-bit sum (s7 s0) and a carry bit (c0,7). These bits are concatenated to form a 9- bit addition result(s8-s0) from which the bias is subtracted.
Bias subtraction:
The Bias is subtracted using an array of ripple borrow subtractors. A normal subtractor has three inputs (minuend (S), subtrahend (T), Borrow in (Bi)) and two outputs (Difference (R), Borrow out (Bo)). The subtractor logic can be optimized if one of its inputs is a constant value which is our case, where the Bias is constant (127|10 = 001111111|2). Table I shows the truth table for a 1-bit subtractor with the input T equal to 1 which we will call one subtractor (OS).
23
One subtarctor:
Here one input is always 1. The Boolean equations that represent the substactor are:
Truth table:
24
Zero substractor:
Here one input is always zero. The Boolean equations that represent this subtarctor are:
Truth table:
The below figure shows the Bias subtractor which is a chain of 7 one subtractors (OS) followed by 2 zero subtractors (ZS); the borrow output of each subtractor is fed to the next subtractor. If an underflow occurs then Eresult < 0 and the number is out of the IEEE 754 single precision normalized numbers range; in this case the output is signaled to 0 and an underflow flag is asserted.
25
1. 2.
1. The first stage is an array of half adders. 2. The middle stages are arrays of full adders. 3. The number of middle stages is equal to significand size minus two. . 4. The last stage is an array of ripple carry adders. This stage is called the vector merging stage.
The number of adders (Half adders and Full adders) in each stage is equal to the significand size minus one. For example, a 4x4 carry save multiplier is shown in Fig. Below and it has the following stages: 1. The first stage consists of three half adders. 2. Two middle stages; each consists of three full adders. 3. The vector merging stage consists of one half adder and two full adders. The decimal point is between bits 45 and 46 in the significand multiplier result. The multiplication time taken by the carry save multiplier is determined by its critical path. The critical path starts at the AND gate of the first partial products (i.e. a1b0 and a0b1), passes through the carry logic of the first half adder and the carry logic of the first full adder of the middle stages, then passes through all the vector merging adders.
Partial product
27
1.
Normalisation
Normaliser:
The result of the significand multiplication (intermediate product) must be normalized to have a leading 1 just to the left of the decimal point (i.e. in the bit 46 in the intermediate product). Since the inputs are normalized numbers then the intermediate product has the leading one at bit 46 or 47. If the leading one is at bit 46 (i.e. to the left of the decimal point) then the intermediate product is already a normalized number and no shifts is needed. If the leading one is at bit 47 then the intermediate product is shifted to the right and the exponent incremented by1. The shift operation is done using combinational shift logic made by multiplexers. Fig .8 shows a simplified logic of a normaliser that has an 8 bit intermediate product input and the 6 bit intermediate exponent input.
28
are signal to zero with the appropriate sign calculated from the inputs an underflow flag is raised. Assume that E1 and E2 are the exponents of the two numbers A and B respectively, the results exponent is calculated by (6) Eresult = E1 + E2 127 (6)
E1 and E2 can have the values from 1 to 254; resulting in Eresult having values from 125 (2-127) to 381 (508-127); but for normalized numbers, Eresult can only have the values from 1 to 254. Table III summarizes the Eresult different values and the effect of normalization on it.
30
31
Three pipelining stages mean that there is latency in the output by three clocks. The synthesis tool retiming option was used so that the synthesizer uses its optimization logic to better place the pipelining registers across the critical path.
32
33
34
35