Вы находитесь на странице: 1из 74


1.1 Introduction Power dissipation is recognized as a critical parameter in modern VLSI design field. To satisfy MOORES law and to produce consumer electronics goods with more backup and less weight, low power VLSI design is necessary. Fast Multiplier and Accumulators are essential parts of digital signal processing systems. MAC is the unit which combines an Adder and Multiplier. This will give the summation of multiplication products. In both of these Operations Multiplication takes more cycles than Accumulation. The speed of multiply operation is of great importance in digital signal processing as well as in the general purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is twofold i.e. evaluation of partial products and accumulation of the shifted partial products. It is performed by the successive additions of the columns of the shifted partial product matrix. The multiplier is successfully shifted and gates the appropriate bit of the multiplicand. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to

both signed and unsigned numbers, a convenient number system would be the representation of numbers in twos complement format. The MAC (Multiplier and Accumulator Unit) is used for image processing and digital signal processing (DSP) in a DSP processor. Algorithm of MAC is Booth's radix-4 algorithm, Modified Booth Multiplier; 17-bit SPST adder improves speed and reduces the power. 1.2 Background of Multiplier And Accumulator(MAC) unit In computing, especially digital signal processing, multiply-accumulate is a common operation that computes the product of two numbers and adds that product to an accumulator. When done with floating point numbers it might be performed with two rounding (typical in many DSPs) or with a single rounding. When performed with a single rounding, it is called a fused multiply-add (FMA) or fused multiplyaccumulate (FMAC). Modern computers may contain a dedicated multiply-accumulate unit, or "MAC unit", consisting of a multiplier implemented in combinational logic followed by an adder and an accumulator register which stores the result when clocked. The output of the register is fed back to one input of the adder, so that on each clock the output of the multiplier is added to the register. Combinational multipliers require a large amount of logic, but can compute a product much more quickly than the method of shifting and adding typical of earlier computers. The first processors to be equipped with MAC-units were digital signal processors, but the technique is now common in general-purpose processors too. A conventional MAC unit consists of (fast multiplier) booth multiplier and an accumulator that contains the signed or unsigned extended sum of the previous consecutive products. As shown in Figure 1.1, the multiplier and the adder are the combination circuits. The multiplier shown was the booth multiplier whereby the adder was the accumulator. From Equation 1.1. F = Xi Yi. eq 1.1

Xi and Yi are either the unsigned or signed data inputs. N is the number of partialproducts accumulated by the MAC unit. Two different values X and Y (in same bits length) are multiplied each other using booth multiplier and the multiplication result will be fed into the accumulator. At the same time, the accumulator will add up the booth multiplier output and the output of the accumulator register. Initially, during the first clock cycle, the output of register will be reset to have a zero values output.

Figure 1.1 Block diagram of MAC After the first time output of the accumulator is loaded into the register. Meanwhile the third clock cycle time will let the accumulator to add up the output of the booth multiplier with output of the register. In conjunction with the objective of this project, the MAC unit is emphasized in improving the design speed and lowering the design power consumption. The first one is the partial products reduction network and the second one is the accumulator since both of these stages require addition of large operands that involves long paths for carry propagation. 1.3 Objective The main objective of this thesis is to design and implementation of a Multiplier and Accumulator. A multiplier which is a combination of Modified Booth and SPST (Spurious Power Suppression Technique) adder are designed taking into account the less area consumption of booth algorithm because of less number of

partial products and more speedy accumulation of partial products and less power consumption of partial products addition using SPST adder approach. To increase the MAC unit design speed by using the Booth Multiplier, To decrease the power consumption by using SPST adder. 1.4 Methodology This paper proposes a new architecture of multiplier-and-accumulator (MAC) for high speed and low-power by adopting the new SPST implementing approach. This multiplier is designed by equipping the Spurious Power Suppression Technique (SPST) on a modified Booth encoder which is controlled by a detection unit using an AND gate. The modified booth encoder will reduce the number of partial products generated by a factor of 2. The SPST adder will avoid the unwanted addition and thus minimize the switching power dissipation. improved. In this project we used Xilinx-ISE Simulator for logical verification, and further synthesizing it on Xilinx-XST tool using target technology and performing placing & routing operation for system verification on targeted FPGA. 1.5 Significance of this work Digital Signal processing (DSP) is used in a wide range of application such as speech and audio coding, image processing and video, pattern recognition and so on. In real time Very Large Scale Integration (VLSI) implementation of the DSP instruction, the system requires hardware architecture which can process input signal samples as they received. Most of the DSP computation involves the use of multiply and multiply accumulate operations and therefore Multiplier Accumulator (MAC) unit is very important in DSP application. In Cryptography implementations We need a more speed MAC unit. As Our designed MAC unit is the optimized design for both power and speed it is better to use in both of the applications. The multiplication circuit represents the core of the MAC unit. Most of the effort focused on improving the performance of digital multiplication has been focusing on increasing the speed of operation and decreasing the power consumption.

By combining multiplication with

accumulation and devising a low power equipped SPST, the performance was

In this project, modified booth encoder is used to increase the speed of operation and SPST approach is used to decrease the power consumed by the MAC unit. 1.6 Applications Multimedia and communication systems Real-time signal processing like Audio signal processing, video/image processing, or large-capacity data processing Cryptography Algorithm implementations such as Computations in Security Algorithms. 1.7 Outline of this report This report is divided into 5 chapters. The following chapter (Chapter 2) is regarding the literature review of the MAC unit design. Chapter 3 explained the MAC unit design methodologies whereas chapter 4 covers the FPGA ,XILINX and Verilog Introduction, chapter 5 contains results and discussion on all the waveforms and power consumption, design speed, area analysis result. Finally, in the last chapter which is Chapter 6 summarizes the results and comes to the conclusions from this final year project work. Moreover, some suggestions on the future possible improvements are discussed in this chapter. 1.8 Conclusion This chapter had outlined the overview, the status of problems, the objectives and also the project methodology of this project. It also had introduced the All the design background of the multiplier accumulator and booth multiplier. chapters also cover the overview on next others chapters contents. 1.9 Software used Language used: VHDL XILINX 10.1Simulation XILINX 10.1 synthesis and implementation

functionality on how it works had been explained clearly in this chapter. Lastly, these

1.10 Hardware used Spartan3 FPGA Kit For implementation


2.1 Background of MAC In the majority of digital signal processing (DSP) applications the critical operations are the multiplication and accumulation, especially digital signal processing, multiply-accumulate is a common operation that computes the product of two numbers and adds that product to an accumulator. When done with floating point numbers it might be performed with two rounding (typical in many DSPs) or with a single rounding. When performed with a single rounding, it is called a fused multiplyadd (FMA) or fused multiply-accumulate (FMAC).

Figure 2.1 MAC Architecture with Radix 4 Booth Multiplier Real-time signal processing requires high speed and high throughput MultiplierAccumulator (MAC) unit that consumes low power, which is always a key to achieve a high performance digital signal processing system. The purpose of this work is to design and implementation of a low power MAC unit with block enabling technique to save power.

2.2 Basics of Multiplier


Multiplication is a mathematical operation that at its simplest is an abbreviated process of adding an integer to itself a specified number of times. A number (multiplicand) is added to itself a number of times as specified by another number (multiplier) to form a result (product). In elementary school, students learn to multiply by placing the multiplicand on top of the multiplier. The multiplicand is then multiplied by each digit of the multiplier beginning with the rightmost, Least Significant Digit (LSD). Intermediate results (partial products) are placed one atop the other, offset by one digit to align digits of the same weight. The final product is determined by summation of all the partial-products. Although most people think of multiplication only in base 10, this technique applies equally to any base, including binary. Figure 2.1 shows the data flow for the basic multiplication technique just described. Each black dot represents a single digit.

Figure 2.2 Basic Multiplication Here, we assume that MSB represent the sign of the digit. The operation of multiplication is rather simple in digital electronics. It has its origin from the classical algorithm for the product of two binary numbers. This algorithm uses addition and shift left operations to calculate the product of two numbers. Based upon the above procedure, we can deduce an algorithm for any kind of multiplication which is shown in figure 2.2. We can check at the initial stage also that whether the product will be positive or negative or after getting the whole result, MSB of the results tells the sign of the product.

Figure 2.3 Signed Multiplication Algorithm 2.3 Binary Multiplication In the binary number system the digits, called bits, are limited to the set [0, 1]. The result of multiplying any binary number by a single binary bit is either 0, or the original number. This makes forming the intermediate partial-products simple and efficient. Summing these partial- products is the time consuming task for binary multipliers. One logical approach is to form the partial-products one at a time and sum them as they are generated. Often implemented by software on processors that do not have a hardware multiplier, this technique works fine, but is slow because at least one machine cycle is required to sum each additional partial-product. For applications where this approach does not provide enough performance, multipliers can be implemented directly in hardware. The two main categories of binary multiplication include signed and unsigned numbers. Digit multiplication is a series of bit shifts and series of bit additions, where the two numbers, the multiplicand and the multiplier are combined into the result. Considering the bit representation of the multiplicand x = xn-1..x1 x0 and the multiplier y = yn-1..y1y0 in order to form the product up to n shifted copies of the multiplicand are to be added for unsigned multiplication. The

entire process consists of three steps, partial product generation, partial product reduction and final addition. 2.4 Multiplication process The simplest multiplication operation is to directly calculate the product of two numbers by hand. This procedure can be divided into three steps: partial product generation, partial product reduction and the final addition. To further specify the operation process, let us calculate the product of 2 twos complement numbers, for example, 11012 (310) and 01012 (510), when computing the product by hand, which can be described according to figure 2.3.

Figure 2.4 Multiplication calculations by hand The bold italic digits are the sign extension bits of the partial products. The first operand is called the multiplicand and the second the multiplier. The intermediate products are called partial products and the final result is called the product. However, the multiplication process, when this method is directly mapped to hardware, is shown in figure 2.4. As can been seen in the figures, the multiplication operation in hardware consists of PP generation, PP reduction and final addition steps. The two rows before the product are called sum and carry bits. The operation of this method is to take one of the multiplier bits at a time from right to left, multiplying the multiplicand by the single bit of the multiplier and shifting the intermediate product one position to the left of the earlier intermediate products. All the bits of the partial products in each column are added to obtain two bits: sum and carry. Finally, the sum and carry bits in

each column have to be summed.

Similarly, for the multiplication of an n-bit

multiplicand and an m-bit multiplier, a product with n + m bits long and m partial products can be generated. The method shown in figure 1.3 is also called a non-Booth encoding scheme.

Figure 2.5 Multiplication Operation in hardware 2.5 Modified Booth Encoder 2.5.1 Booths Multplication Algorithm It is a multiplication algorithm that multiplies two signed binary numbers in two's complement notation. The algorithm was invented by Andrew Donald Booth in 1951. Which Algorithm is faster than the normal Multiplication Algorithm by using a shifting operation instead of addition operation. Booth algorithm is widely used in the implementations of hardware or software multipliers because its application makes it possible to reduce the number of partial products. It can be used for both sign magnitude numbers as well as 2s complement numbers.


Procedure Booth's algorithm involves repeatedly adding one of two predetermined values A and S to a product P where A and S are Addition and Subtraction , then performing a rightward arithmetic shift on P . Let m and r be the multiplicand and multiplier, respectively; and let x and y represent the number of bits in m and r. I. Determine the values of A and S , and the initial value of P . All of these numbers should have a length equal to ( + x y+1) 1. A: Fill the most significant (leftmost) bits with the value of m . Fill the remaining ( + 1) y bits with zeros. 2. S: Fill the most significant bits with the value of ( ) in two's complement notation. Fill the remaining ( + 1) bits with zeros. the least significant (rightmost) bit with a zero. 3. 3. P: Fill the most significant x bits with zeros. To the right of this, append the value of . II. Determine the two least significant (rightmost) bits of P

If they are 01, find the value of P + A . Ignore any overflow. If they are 10, find the value of P + S. Ignore any overflow. If they are 00, do nothing. Use P directly in the next step. If they are 11, do nothing. Use P directly in the next step.

Arithmetically shift the value obtained in the 2nd step by a single place to the right. Let P now equal this new value. Repeat steps 2 and 3 until they have been done y times. Drop the least significant (rightmost) bit from P . T his is the product of m and r Example Find 3 (4), with m = 3 and r = 4, and x = 4 and y = 4:


A = 0011 0000 0 P = 0000 1100 0 S = 1101 0000 0 Perform the loop four times : P = 0000 11000. T he last two bits are 00. P = 0000 0110 0. Arithmetic right shift. P = 0000 01100. The last two bits are 00. P = 0000 0011 0. Arithmetic right shift. P = 0000 0010. The last two bits are 10. P = 1101 0011 0. P = P + S. P = 1110 1001 1. Arithmetic right shift. P = 1110 10011. The last two bits are 11. P = 1111 0100 1. Arithmetic right shift. The product is 1111 0100, which is 12. In which algorithm the total number of computations are of no of multiplier bits. This technique is inadequate when multiplicand is the largest negative number for example if the multiplicand has 4 bits then this value is -8. For the better utilization of algorithm we have to multiplicand and multiplier before operation starts, since XY and YX are the same. As per the algorithm for the pair of Product 00 and 11 we have to perform only one Arithmetic Right Shift operation, whereas for 01 we need to perform an addition operation also along with the shifting operation. likewise for 10 we need to perform a subtraction i.e addition of S operation also alongside the normal shifting operation, which leads some extra computations so it is better to choose the value having large rail of 1s or a number having less number of 01 and 10 pairs as a multiplier and the another one as the


multiplicand. For example if we need to multiply 1110 and 1010 we have to choose first one as multiplier and the second one as multiplicand. 2.5.2 Modified Booth Algorithm A Modification of the Booth algorithm was proposed by Mac Sorley in which a triplet of bits is scanned instead of two bits. The booth MacSorley algorithm , usually called the Modified Booth algorithm or simply the Booth algorithm, can be generalized to any radix. This technique has the advantage of reducing the number of partial products by one half regardless of the inputs. The Recoding is performed in two steps: encoding and selection. The purpose of the encoding is to scan the triplet of bits of the multiplier and define the operation to be performed on the multiplicand , as shown in the following figure 2.5.

Figure 2.6 Implementation of Modified Booth Recoding For example a 3-bit Recoding would require the following set of digits to be multiplied by the multiplicand : -3,-2,-1,0,+1,+2,+3. The difficulty lies in the fact that +3Y is computed by summing 1 to +2Y, which means that a carry propagation occurs. Booth Recoding necessitates the internal use of 2s complement representation in order to efficiently perform subtraction of the partial products as well as additions. The advantage of Booth Recoding is that it generates only a halve of the partial products compares to the multiplier implementation which does not use

Booth recoding. However the benefit achieved comes at the expense of increased hardware complexity. Indeed, this implementation requires hardware for the encoding and for the selection of the partial products(-2Y,-Y,0,+Y,+2Y).

Figure 2.7 System Architecture for Multiplier With Radix4 MBA 2.6 Overview of MAC A multiplier can be divided into three operational steps. The first is radix-2 Booth encoding in which a partial product is generated from the multiplicand X and the multiplier Y. The second is adder array or partial product compression to add all partial products. The last is the final addition in which the process to accumulate the multiplied results is included. The general hardware architecture of this MAC is shown in Fig.1.2. It executes the multiplication operation by multiplying the input multiplier X and the multiplicand Y. This is added to the previous multiplication result Z as the accumulation step. The N-bit 2s complement binary number can be expressed as

..(2.1) If (2.1) is expressed in base-4 type redundant sign digit form in order to apply the radix-2 Booths algorithm.

........(2.2) .....(2.3) If (2.2) is used, multiplication can be expressed as

...(2.4) If these equations are used, the afore-mentioned multiplicationaccumulation results can be expressed as

(2.5) Each of the two terms on the right-hand side of (2.5) is calculated independently and the final result is produced by adding the two results. The MAC architecture implemented by (2.5) is called the standard design. If N-bit data are multiplied, the number of the generated partial products is proportional to N. In order to add them serially, the execution time is also proportional to N. The architecture of a multiplier, which is the fastest, uses radix-2 Booth encoding that generates partial products. If radix-2 Booth encoding is used, the number of partial products, is reduced to half, resulting in the decrease in Addition of Partial Products step. In addition, the signed multiplication based on 2s complement numbers is also possible. encoding. Due to these reasons, most current used multipliers adopt the Booth

2.7 Existing MAC Architecture


MAC is composed of an adder, multiplier and an accumulator. Usually adders implemented are Carry- Select or Carry-Save adders, as speed is of utmost importance in DSP (Chandrakasan, Sheng, & Brodersen, 1992 and Weste & Harris, 3rd Ed). One implementation of the multiplier could be as a parallel array multiplier. The inputs for the MAC are to be fetched from memory location and fed to the multiplier block of the MAC, which will perform multiplication and give the result to adder which will accumulate the result and then will store the result into a memory location. This entire process is to be achieved in a single clock cycle (Weste & Harris, 3rd Ed). The architecture of the MAC unit which had been designed in this work consists of one 16 bit register, one 16-bit Modified Booth Multiplier, 32-bit accumulator. To multiply the values of A and B, Modified Booth multiplier is used instead of conventional multiplier because Modified Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity. SPST Adder is used for the addition of partial products and a register is used for accumulation. The operation of the designed MAC unit is as in Equation 1.1. The product of Ai X Bi is always fed back into the 32-bit accumulator and then added again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with previous product consecutively up to as many as times.


The digital signal processing such as filtering, convolution, and inner products. Most digital signal processing methods use nonlinear functions such as discrete cosine transform (DCT) [2] or discrete wavelet transform (DWT) [3]. Because they are basically accomplished by repetitive application of multiplication and addition, the speed of the multiplication and addition arithmetic determines the execution speed and performance of the entire calculation. Because the multiplier requires the longest delay among the basic operational blocks in digital system, the critical path is determined by the multiplier, in general. For high-speed multiplication, the modified radix-4 Booths algorithm (MBA) [4] is commonly used. However, this cannot completely solve the problem due to the long critical path for multiplication [5], [6]. In general, a multiplier uses Booths algorithm [7] and array of full adders (FAs), or Wallace tree [8] instead of the array of FAs., i.e., this multiplier mainly consists of the three parts: Booth encoder, a tree to compress the partial products such as Wallace tree, and final adder [9], [10]. Because Wallace tree is to add the partial products from encoder as parallel as possible, its operation time is proportional to , where is the number of inputs. It uses the fact that counting the number of 1s among the inputs reduces the number of outputs into. In real implementation, many (3:2) or (7:3) counters are used to reduce the number of outputs in each pipeline step. The most effective way to increase the speed of a multiplier is to reduce the number of the partial products because multiplication proceeds a series of additions for the partial products. To reduce the number of calculation steps for the partial products, MBA algorithm has been applied mostly whereWallace tree has taken the role of increasing the speed to add the partial products. To increase the speed of the MBA algorithm, many parallel multiplication architectures have been researched [11][13]. Among them, the architectures based on the BaughWooley algorithm (BWA) have been developed and they have been applied to various digital filtering calculations [14][16]. One of the most advanced types of MAC for general-purpose digital signal processing has been proposed by Elguibaly [17]. It is an architecture in which accumulation has been combined with the carry save adder (CSA) tree that compresses partial products. In the architecture proposed in [17], the critical path was reduced by eliminating the adder for

accumulation and decreasing the number of input bits in the final adder. While it has a better performance because of the reduced critical path compared to the previous MAC architectures, there is a need to improve the output rate due to the use of the final adder results for accumulation. Architecture to merge the adder block to the accumulator register in the MAC operator was proposed in [18] to provide the possibility of using two separate /2-bit adders instead of one -bit adder to accumulate the bitMAC results. Recently, Zicari proposed an architecture that took a merging technique to fully utilize the

Figure 3.1 Basic arithmetic steps of multiplication and accumulation. Also took this compressor as the basic building blocks for the multiplication circuit. In this paper, a new architecture for a high-speed MAC is proposed. In this MAC, the computations of multiplication and accumulation are combined and a hybrid-type CSA structure is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based on 1s complement number system. A modified array structure for the sign bits is used to increase the density of the operands. A carry lookahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in the final

adder. In addition, in order to increase the output rate by optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form of sum and carry instead of the final adder outputs.This paper is organized as follows. In Section II, a simple introduction of a general MAC will be given, and the architecture for the proposed MAC will be described in Section III. In Section IV, the implementation result will be analyzed and the characteristic of the proposed MAC will be shown. 3.1 Overview of MAC

Figure 3.2 Hardware architecture of general MAC In this section, basic MAC operation is introduced. A multiplier can be divided into three operational steps. The first is radix-2 Booth encoding in which a partial product is generated from the multiplicand and the multiplier .The second is adder array or partial product compression to add all partial products and convert them into the form of sum and carry. The last is the final addition in which the final multiplication result is produced by adding the sum and the carry. If the process to accumulate the multiplied results is included, a MAC consists of four steps, as shown in Fig. 1, which shows the operational steps explicitly. A general hardware architecture of this MAC is shown in Fig. 2. It executes the multiplication operation by multiplying the input multiplier and the multiplicand . This is added to the previous multiplication result as the accumulation step.


Each of the two terms on the right-hand side of (5) is calculated independently and the final result is produced by adding the two results. The MAC architecture implemented by (5) is called the standard design [6]. If -bit data are multiplied, the number of the generated partial products is proportional to . In order to add them serially, the execution time is also proportional to . The architecture of a multiplier, which is the fastest, uses radix-2 Booth encoding that generates partial products and aWallace tree based on CSA as the adder array to add the partial products. If radix-2 Booth encoding is used, the number of partial products, i.e., the inputs to the Wallace tree, is reduced to half, resulting in the decrease in CSA tree step. In addition, the signed multiplication based on 2s complement numbers is also possible. Due to these reasons, most current used multipliers adopt the Booth encoding. 3.2 Proposed MAC Architecture In this section, the expression for the new arithmetic will be derived from equations of the standard design. a hybrid-typed CSA architecture that can satisfy the operation of the proposed MAC will be proposed. A. If an operation to multiply two bit numbers and accumulate into a 2 -bit number is considered, the critical path is determined by the 2 -bit accumulation operation.


The delay of the last accumulator must be reduced in order to improve the performance of the MAC. The overall performance of the proposed MAC is improved by eliminating the accumulator itself by combining it with the CSA function. If the accumulator has been eliminated, the critical path is then determined by the final adder in the multiplier. The basic method to improve the performance of the final adder is to decrease the number of input bits. In order to reduce this number of input bits, the multiple partial products are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be transferred to the final adder is reduced by adding the lower bits of sums and carries in advance within the range in which the overall performance will not be degraded. A 2-bit CLA is used to add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied, the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the standard design and [17]. In order to efficiently solve the increase in the amount of data, a CSA architecture is modified to treat the sign bit.

Figure 3.3 Hardware architecture of the proposed MAC.


Figure 3.4 Proposed arithmetic operation of multiplication and accumulation. If the MAC process proposed in the previous section is rearranged, it would be as Fig. 3.4, in which the MAC is organized into three steps. When compared with Fig. 1, it is easy to identify the difference that the accumulation has been merged into the process of adding the partial products. Another big difference from Fig. 1 is that the final addition process in step 3 is not always run even though it does not appear explicitly in Fig. 3. Since accumulation is carried out using the result from step 2 instead of that from step 3, step 3 does not have to be run until the point at which the result for the final accumulation is needed. The hardware architecture of the MAC to satisfy the process in Fig. 3 is shown in Fig. 4. The -bitMAC inputs, and , are converted into an -bit partial product by passing through the Booth encoder. In the

CSA and accumulator, accumulation is carried out along with the addition of the partial products. As a result, -bit , and (the result from adding the lower bits of the sum and carry) are generated. These three values are fed back and used for the next accumulation. If the final result for the MAC is needed, is generated by adding and in the final adder and combined with that was already generated. The architecture of the hybrid-type CSA that complies with the operation of the proposed MAC is shown in Fig. 5, which performs 8 8-bit operation. It was formed based on (12). In Fig. 5, is to simplify the sign expansion and is to compensate 1s complement number into 2s complement number. and correspond to the th bit of the feedback sum and carry. is the th bit of the sum of the lower bits for each partial product that were added in advance and is the previous result. In addition, corresponds to the th bit of the th partial product. Since the multiplier is for 8 bits, totally four partial products are generated from the Booth encoder. In (11), and correspond to and , respectively. This CSA requires at least four rows of FAs for the four partial products. Thus, totally five FA rows are necessary since one more level of rows are needed for accumulation. For an -bit MAC operation, the level of CSA is . The white square in Fig. 5 represents an FA and the gray square is a half adder (HA). The rectangular symbol with five inputs is a 2-bit CLA with a carry input. The critical path in this CSA is determined by the 2-bit CLA. It is also possible to use FAs to implement the CSA without CLA. However, if the lower bits of the previously generated partial product are not processed in advance by the CLAs, the number of bits for the final adder will increase. When the entire multiplier or MAC is considered, it degrades the performance. In Table I, the characteristics of the proposed CSA architecture have been summarized and briefly compared with other architectures.


Figure 3.5 Proposed CSA Architecture

Table 3.1 Gate size of logic circuit element


Table 3.2 Estimation of gate size by synthesis 3.3 Implementation And Experiment In this section, the proposed MAC is implemented and analyzed. Then it would be compared with some previous researches. First, the amount of used resources in implementing in hardware is analyzed theoretically and experimentally, then the delay of the hardware is analyzed by simplifying Sakurais alpha power law [20]. Finally, the pipeline stage is defined and the performance is analyzed based on this pipelining scheme. Implementation result from each section will be compared with the standard design [6] and Elguibalys design [17], each of which has the most representative parallel MBA architecture. 3.3.1 Analysis of hardware resource The three architecture mentioned before are analyzed to compare the hardware resources and the results are given in Table II. In calculating the amount of the hardware resources, the resources for Booth encoder is excluded by assuming that the identical ones were used for all the designs. The hardware resources in Table II are the results from counting all the logic elements for a general 16 bit architecture. The 90 nm CMOS HVT standard cell library from TSMCwas used as the hardware library for the 16 bits. The gate count for each designwas obtained by synthesizing the logic elements in an optimal form and the resultwas generated by multiplying it with the estimated number of hardware resources. The gate counts for the circuit elements obtained through synthesis are showninTable III, which are based on a two-input NANDgate. Let us examine the gate count for several elements in Table III first. Since the gate count is 3.2 for HA and 6.7 for FA, FA is about twice as large as HA. Because the gate count for a 2-bit it is slightly larger than FA. In other words, even if

a 2-bit CLA is used to add the lower bits of the partial products in the proposed CSA architecture, it can be seen that the hardware resources will not increase significantly. As Table II shows, the standard design uses the most hardware resources and the proposed architecture uses the least. The proposed architecture has optimized the resources for the CSA by using both FA and HA. By reducing the number of input bits to the final adder, the gate count of the final adder was reduced . 3.3.2 Gate count by synthesis The proposed MAC and were implemented in register-transfer level (RTL) using hardware description language (HDL). The designed circuits were synthesized using the Design Complier from Synopsys, Inc., and the gate counts for the resulting netlists were measured and summarized in Table IV. The circuits in Table IV are for 16-bit MACs. In order to examine the various circuit characteristics for different CMOS processes, the most popular four process libraries (0.25, 0.18, 0.13 m, 90 nm) for manufacturing digital semiconductors were used. It can be seen that the finer the process is, the smaller the number of gates is. As shown in Table II, the gate count for our architecture is slightly smaller than that in [17]. It must be kept in mind that if a circuit is implemented as part of a larger circuit, the number of gates may change depending on the timing for the entire circuit and the electric environments even though identical constraints were applied in the synthesis. The results in Table IV were for the combinational circuits without sequential element. The total gate count is equal to the sum of the Booth encoder, the CSA, and the final adder.


Figure 3.6 Pipelined hardware structure (a) Proposed structure(b) Elguibalys structure 3.4 High speed Booth Encoded Multiplier Design Fast multipliers are essential parts of digital signal processing systems. The speed of multiply operation is of great importance in digital signal processing as well as in the general purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be product is generally twice the length of

operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is two fold i.e. evaluation of partial

products and accumulation of the shifted partial products. It is performed by the successive additions of the columns of the shifted partial product matrix. The multiplier is successfully shifted and gates the appropriate bit of the multiplicand. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and unsigned. 3.5 Modified Booth Encoder Design 3.5.1 Booth Encoding

Figure 3.7 Grouping of bits from Multiplier term In order to achieve high-speed multiplication, multiplication algorithms using parallel counters, such as the modified Booth algorithm has been proposed, and some multipliers based on the algorithms have been implemented for practical use. This type of multiplier operates much faster than an array multiplier for longer operands because its computation time is proportional to the logarithm of the word length of operands. Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by recoding the numbers that are multiplied. It is possible to reduce the number of partial products by half, by using the technique of radix-4 Booth recoding. The basic idea is that, instead of shifting and adding for every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and multiply by 1, 2, or 0, to obtain the same results. The advantage of this method is

the halving of the number of partial products. To Booth recode the multiplier term, we consider the bits in blocks of three, such that each block overlaps the previous block by one bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier. Figure 3.3 shows the grouping of bits from the multiplier term for use in modified booth encoding. Each block is decoded to generate the correct partial product. The encoding of the multiplier Y, using the modified booth algorithm, generates the following five signed digits, -2, -1, 0, +1, +2. Each encoded digit in the multiplier performs a certain operation on the multiplicand, X, as illustrated in Table

Table 3.3 Decode values for a group in MBE For the partial product generation, we adopt Radix-4 Modified Booth algorithm to reduce the number of partial products for roughly one half. For multiplication of 2s complement numbers, the two-bit encoding using this algorithm scans a triplet of bits. When the multiplier B is divided into groups of two bits, the algorithm is applied to this group of divided bits. Figure 3.2 shows a computing example of Booth multiplying two numbers2AC9 and 006A. The shadow denotes that the numbers in this part of Booth multiplication are all zero so that this part of the computations can be neglected. Saving those computations can significantly reduce the power consumption caused by the transient signals. According to the analysis of the multiplication shown in figure 4, we propose the SPST-equipped modified-Booth encoder, which is controlled by a detection unit. The detection unit has one of the two operands as its input to decide whether the Booth encoder calculates redundant

computations. As shown in figure 9. The latches can, respectively, freeze the inputs of MUX-4 to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7 are zero; to reduce the transition power dissipation. Figure 10, shows the booth partial product generation circuit. It includes AND/OR/EX-OR logic.

Figure 3.8 Illustration of multiplication using modified Booth Encoding The PP generator generates five candidates of the partial products, i.e., {-2A,-A, 0, A, 2A}. These are then selected according to the Booth encoding results of the operand B. When the operand besides the Booth encoded one has a small absolute value, there are opportunities to reduce the spurious power dissipated in the compression tree.

Figure 3.9 Modified Booth Encoder


3.5.2 Partial product generator The multiplication first step generates from A and X a set of bits whose weights sum is the product P. For unsigned multiplication, P most significant bit weight is positive, while in 2's complement it is negative.

Figure 3.10 Booth partial product selector logic The partial product is generated by doing AND between a and b which are a 4 bit vectors as shown in fig3.6. If we take, four bit multiplier and 4-bit multiplicand we get four partial products in which the first partial product is stored in q. Similarly, the second, third and fourth partial products are stored in 4-bit vector n, x, y.

Figure 3.11 Booth partial product generation


The multiplication second step reduces the partial products from the preceding step into two numbers while preserving the weighted sum. The sough after product P is the sum of those two numbers. The two numbers will be added during the third step The "Wallace trees" synthesis follows the Dadda's algorithm, which assures of the minimum counter number. If on top of that we impose to reduce as late as (or as soon as) possible then the solution is unique. The two binary number to be added during the third step may also be seen a one number in CSA notation (2 bits per digit).

Figure 3.11 Booth single partial product selector logic 3.5.3 Modified Booth Encoder Multiplication consists of three steps: 1) the first step to generate the partial products; 2) the second step to add the generated partial products until the last two rows are remained; 3) the third step to compute the final multiplication results by adding the last two rows. The modified Booth algorithm reduces the number of partial products by half in the first step. We used the modified Booth encoding (MBE) scheme proposed in. It is known as the most efficient Booth encoding and decoding scheme. To multiply X by Y using the modified Booth algorithm starts from grouping Y by three bits and encoding into one of {-2, -1, 0, 1, 2}.


Multiply by zero means the multiplicand is multiplied by zero. Multiply by one means the product is still the same as the multiplicand value. Multiply by -1 means that the product is the twos complement of the multiplicand value. Multiply by 2 is to shift left one bit of the multiplicand value whereas multiply by -2 is to shift left one bit the twos complement of the multiplicand value whereas multiply by -2 is to shift left one bit the twos complement of the multiplicand value.



4.1. INTRODUCTION TO VHDL The VHDL language and is mainly intended as a companion for the Digital Design Laboratory. This writing aims to give the reader a quick introduction to VHDL and to give a complete or in-depth discussion of VHDL. For a more detailed treatment, please consult any of the many good books on this topic. Several of these books are listed in the reference list. 4.1.1. Introduction VHDL stands for VHSIC (Very High Speed Integrated Circuits) Hardware Description Language. In the mid-1980s the U.S. Department of Defense and the IEEE sponsored the development of this hardware description language with the goal to develop very high-speed integrated circuit. It has become now one of industrys standard languages used to describe digital systems. The other widely used hardware description language is Verilog. Both are powerful languages that allow you to describe and simulate complex digital systems. A third HDL language is ABEL (Advanced Boolean Equation Language) which was specifically designed for Programmable Logic Devices (PLD). ABEL is less powerful than the other two languages and is less popular in industry. This tutorial deals with VHDL, as described by the IEEE standard 1076-1993. Although these languages look similar as conventional programming languages, there are some important differences. A hardware description language is inherently parallel, i.e. commands, which correspond to logic gates, are executed (computed) in parallel, as soon as a new input arrives. A HDL program mimics the behavior of a physical, usually digital, system. It also allows incorporation of timing specifications (gate delays) as well as to describe a system as an interconnection of different components.


4.1.2. Levels of representation and abstraction A digital system can be represented at different levels of abstraction [1]. This keeps the description and design of complex systems manageable. Figure 1 shows different levels of abstraction.

Figure 4.1 Levels of abstraction: Behavioral, Structural and Physical The highest level of abstraction is the behavioral level that describes a system in terms of what it does (or how it behaves) rather than in terms of its components and interconnection between them. A behavioral description specifies the relationship between the input and output signals. This could be a Boolean expression or a more abstract description such as the Register Transfer or Algorithmic level. As an example, let us consider a simple circuit that warns car passengers when the door is open or the seatbelt is not used whenever the car key is inserted in the ignition lock At the behavioral level this could be expressed as, Warning = Ignition_on AND ( Door_open OR Seatbelt_off) The structural level, on the other hand, describes a system as a collection of gates and components that are interconnected to perform a desired function. A structural description could be compared to a schematic of interconnected logic gates. It is a representation that is usually closer to the physical realization of a system. For the example above, the structural representation is shown in Figure 2 below.


Figure 4.2 Structural representation of a buzzer circuit. VHDL allows one to describe a digital system at the structural or the behavioral level. The behavioral level can be further divided into two kinds of styles: Data flow and Algorithmic. The dataflow representation describes how data moves through the system. This is typically done in terms of data flow between registers (Register Transfer level). The data flow model makes use of concurrent statements that are executed in parallel as soon as data arrives at the input. On the other hand, sequential statements are executed in the sequence that they are specified. VHDL allows both concurrent and sequential signal assignments that will determine the manner in which they are executed. Examples of both representations will be given later. 4.1.3. Basic Structure of a VHDL file A digital system in VHDL consists of a design entity that can contain other entities that are then considered components of the top-level entity. Each entity is modeled by an entity declaration and an architecture body. One can consider the entity declaration as the interface to the outside world that defines the input and output signals, while the architecture body contains the description of the entity and is composed of interconnected entities, processes and components, all operating concurrently, as schematically shown in Figure 3 below. In a typical design there will be many such entities connected together to perform the desired function.

Figure 4.3 A VHDL entity


VHDL uses reserved keywords that cannot be used as signal names or identifiers. Keywords and user-defined identifiers are case insensitive. Lines with comments start with two adjacent hyphens (--) and will be ignored by the compiler. VHDL also ignores line breaks and extra spaces. VHDL is a strongly typed language which implies that one has always to declare the type of every object that can have a value, such as signals, constants and variables.

Figure 4.4 Top down design approach 4.2 Abstraction Level VHDL supports designing at many different levels of abstraction. Three of them are very important: Behavioral level Register-Transfer Level Gate Level


4.2.1 Behavioral Level This level describes a system by concurrent algorithms (Behavioral). Each algorithm itself is sequential, that means it consists of a set of instructions that are executed one after the other. Functions, Tasks and Always blocks are the main elements. There is no regard to the structural realization of the design. 4.2.2 Register-Transfer Level Designs using the Register-Transfer Level specify the characteristics of a circuit by operations and the transfer of data between the registers. An explicit clock is used. RTL design contains exact timing bounds: operations are scheduled to occur at certain times. Modern RTL code definition is "Any code that is synthesizable is called RTL code". 4.2.3 Gate Level Within the logic level the characteristics of a system are described by logical links and their timing properties. All signals are discrete signals. They can only have definite logical values (`0', `1', `X', `Z`). The usable operations are predefined logic primitives (AND, OR, NOT etc gates). Using gate level modeling might not be a good idea for any level of logic design. Gate level code is generated by tools like synthesis tools and this netlist is used for gate level simulation and for backend. 4.3 Introduction to FPGA FPGA stands for Field Programmable Gate Array which has the array of logic module, I/O module and routing tracks (programmable interconnect). FPGA can be configured by end user to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz. Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP. FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and main advantage of using SRAM programming technology is re-configurability. Issues in FPGA technology are complexity of logic element, (Routing). clock support, IO support and interconnections


4.3.1 FPGA Design Flow FPGA contains a two dimensional arrays of logic blocks and interconnections between logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are programmed to implement a desired function and the interconnects are programmed using the switch boxes to connect the logic blocks. To be more clear, if we want to implement a complex design (CPU for instance), then the design is divided into small sub functions and each sub function is implemented using one logic block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks must be connected and this is done by programming the interconnects. Internal structure of an FPGA is depicted in the following figure.

Figure 4.5 Internal structure of FPGA FPGAs, alternative to the custom ICs, can be used to implement an entire System On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can reprogram an FPGA to implement a design and this is done after the FPGA is manufactured. This brings the name Field Programmable. Custom ICs are expensive and takes long time to design so they are useful when produced in bulk amounts. But FPGAs are easy to implement with in a short time with the help of Computer Aided Designing (CAD) tools (because there is no

physical layout process, no mask making, and no IC manufacturing).Some disadvantages of FPGAs are, they are slow compared to custom ICs as they cant handle vary complex designs and also they draw more power. Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An LUT is used to implement number of different functionality. The input lines to the logic block go into the LUT and enable it. The output of the LUT gives the result of the logic function that it implements and the output of logic block is registered or unregistered out put from the LUT. SRAM is used to implement a LUT.A k-input logic function is implemented using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k. Advantage of such an architecture is that it supports implementation of so many logic functions, large. however the disadvantage is unusually large number of memory cells required to implement such a logic block in case number of inputs is

Figure 4.6 4-input LUT based implementation of logic block LUT based design provides for better logic block utilization. A k-input LUT based logic block can be implemented in number of different ways with trade off between performance and logic density. An n-LUT can be shown as a direct implementation of a function truth-table. Each of the latch holds the value of the function corresponding to one input combination. For Example: 2- LUT can be used to implement 16 types of functions like AND , OR, A+not B .... etc.

4.3.2 Inteconnects A wire segment can be described as two end points of an interconnect with no programmable switch between them. A sequence of one or more wire segments in an FPGA can be termed as a track. Typically an FPGA has logic blocks, interconnects and switch blocks

(Input/Output blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are connected to logic blocks through switch blocks. Depending on the required design, one logic block is connected to another and so on. 4.3.3 FPGA Design Flow In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified version of design flow is given in the flowing diagram.

Figure 4.7 FPGA Design Flow 4.3.4 Design Entry There are different techniques for design entry. Schematic based, Hardware Description Language and combination of both etc. . Selection of a method depends on the design and designer. If the designer wants to deal more with Hardware, then Schematic entry is the better choice. When the design is complex or the designer thinks the design in an algorithmic way then HDL is the better choice. Language based entry is faster but lag in performance and density. HDLs represent a level of abstraction that can isolate the designers from the details of the hardware

implementation. Schematic based entry gives designers much more visibility into the hardware. It is the better choice for those who are hardware oriented. Another method but rarely used is state-machines. It is the better choice for the designers who think the design as a series of states. But the tools for state machine entry are limited. In this documentation we are going to deal with the HDL based design entry. 4.3.5 Synthesis The process which translates VHDL or Verilog code into a device netlist formate. i.e a complete circuit with logical elements( gates, flip flops, etc) for the design.If the design contains more than one sub designs, ex. to implement a processor, we need a CPU as one design element and RAM as another and so on, then the synthesis process generates netlist for each design element Synthesis process will check code syntax and analyze the hierarchy of the design which ensures that the design is optimized for the design architecture, the designer has selected. The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx Synthesis Technology (XST)).

Figure 4.8 FPGA Synthesis 4.3.6 Implementation This process consists a sequence of three steps 1.Translate 2.Map 3.Place and Route

4.3.7 Translate Process combines all the input netlists and constraints to a logic design file. This information is saved as a NGD (Native Generic Database) file. This can be done using NGD Build program. Here, defining constraints is nothing but, assigning the ports in the design to the physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time requirements of the design. This information is stored in a file named UCF (User Constraints File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.

Figure 4.9 FPGA Translate 4.3.8 Map Process divides the whole circuit with logical elements into sub blocks such that they can be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file which physically represents the design mapped to the components of FPGA. MAP program is used for this purpose.


4.10 FPGA Map 4.3.9 Place and Route PAR program is used for this process. The place and route process places the sub blocks from the map process into logic blocks according to the constraints and connects the logic blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save the time but it may effect some other constraint. So trade off between all the constraints is taken account by the place and route process. The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as output. Output NCD file consists the routing information.

4.11 FPGA Place and Route 4.3.10 Device Programming Now the design must be loaded on the FPGA. But the design must be

converted to a format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can be used to configure the target FPGA device. This can be done using a cable. Selection of cable depends on the design. 4.3.11 Design Verification Verification can be done at different stages of the process steps. 4.3.12 Behaviour Simulation(RTL Simulation) This is first of all simulation steps; those are encountered throughout the hierarchy of the design flow. This simulation is performed before synthesis process to

verify RTL (behavioral) code and to confirm that the design is functioning as intended. Behavioral simulation can be performed on either VHDL or Verilog designs. In this process, signals and variables are observed, procedures and functions are traced and breakpoints are set. This is a very fast simulation and so allows the designer to change the HDL code if the required functionality is not met with in a short time period. Since the design is not yet synthesized to gate level, timing and resource usage properties are still unknown. 4.3.13 Functional Simulation(Post Translate Simulation) Functional simulation gives information about the logic operation of the circuit. Designer can verify the functionality of the design using this process after the Translate process. If the functionality is not as expected, then the designer has to made changes in the code and again follow the design flow steps. 4.3.14 Static Timing Analysis This can be done after MAP or PAR processes Post MAP timing report lists signal path delays of the design derived from the design logic. Post Place and Route timing report incorporates timing delay information to provide a comprehensive timing summary of the design. 4.4 XILINX Xilinx is the world's largest supplier of programmable logic devices, the inventor of the field programmable gate array (FPGA) and the first semiconductor company with a fables manufacturing model. The Xilinx software can do simulation and synthesis .The entire processor will be implemented using the Xilinx FPGAs. So it wont spend much time to wiring up that part of the circuit. However, the designer have to wire the switches and lights that are used to control the processor, and have to wire the Xilinx part itself to the switches and lights, but this shouldn't be too bad. The designer will also use the backplane bus in the lab kit so that the Triscuit will be built on two boards: one for the Xilinx chip, and one for the switches and lights. The HDL Editor feature provides extensive edit and search capabilities with language- specific color coding of keywords, as well as integrated on-line syntax checking to scan VHDL code for

errors. The Language Assistant feature speeds design entry by providing a lookup list of typical language constructs and commonly used synthesis modules like counters, accumulators, and adders. 4.4.1 XST Design Flow Overview The following figure shows the flow of files through the XST software. Design Entry Overview. Design entry can create source files to represent the design. The top-level design source file can be any of the following formats: Hardware Description Language (HDL), such as VHDL or Verilog Schematic (SCH) Embedded processor (XMP)

4.4.2 ISE Design Suite The ISE Design Suite: Logic Edition allows you to go from design entry, through implementation and verification, to device programming from within the unified environment of the ISE Project Navigator or from the command line. This edition includes exclusive tools and technologies to help achieve optimal design results, including the following: Xilinx Synthesis Technology (XST) - synthesizes VHDL, Verilog, or mixed

language designs. ISim - enables you to perform functional and timing simulations for VHDL, Verilog and mixed VHDL/Verilog designs. Plan Ahead software - enables you to do advanced FPGA floor planning. The Plan Ahead software includes I/O Planner, an environment designed to help you to import or create Ahead software includes I/O Planner, an environment designed to help you to import or create the initial I/O Port list, group the related ports into separate folders called Interfaces and assign them to package pins. I/O Planner supports fully automatic pin placement or semi- automated interactive modes to allow controlled I/O Port assignment. With early, intelligent decisions in FPGA I/O assignments, you can more easily optimize the connectivity between the PCB and FPGA.

CORE Generator software - provides an extensive library of Xilinx LogiCORE IP from basic elements to complex, system level IP cores. Smart Guide technology - enables you to use results from a previous to guide the next implementation for faster incremental

implementation implementation.

Design Preservation - enables you to use placement and routing for unchanged blocks from a previous implementation to reduce iterations in the timing closure phase. Partial Reconfiguration - enables dynamic design modification of a configured FPGA. The ISE software uses Partition technology to define and implement static and reconfigurable regions of the device. This feature requires an additional license code. XPower Analyzer enables you to analyze power consumption for Xilinx FPGA and CPLD devices. Power Optimization for Virtex-6 devices minimizes logic toggling to reduce dynamic power consumption. IMPACT enables you to directly configure Xilinx FPGAs or program Xilinx CPLDs and PROMs with the Xilinx cables. It also enables you to create programming files, read back and verify design configuration data, debug configuration problems, and execute SVF and XSVF files. ChipScope Protocol assists with in circuit verification. Note Design Preservation and Partial Reconfiguration are supported for the command line tools and the


standalone version of the PlanAhead software.

Figure 4.12 Properties of Device Parameters 4.4.3 Process Window The processes windows list the available processes (corresponding to the process selected in the processes window). Typically you will select a particular process that you want to perform on the selected source file This can include a simulation, implementation, etc. To run a process you can double click on the process. When a process has been successfully executed a red tick-off icon appears. When you run a high-level process, the Project Navigator will automatically run all the associated lower-level processes. Integrated Software Environment (ISE) is the Xilinx design software suite. This overview explains the general progression of a design through ISE from start to

finish. ISE enables the design to be initiated with any of a number of different source types, including: HDL (VHDL, Verilog HDL, ABEL) Schematic design files EDIF State Machines IP Cores From source files, ISE enables quick verification of the functionality of these sources using the integrated simulation capabilities, including ModelSim Xilinx Edition and the HDL Bencher test bench generator. HDL sources may be synthesized using the Xilinx Synthesis Technology (XST) as well as partner synthesis engines used standalone or integrated into ISE. The Xilinx implementation tools continue the process into a placed and routed FPGA or fitted CPLD, and finally produce a bit stream for the device configuration. 4.4.4 Design Entry 1. ISE Text Editor - The ISE Text Editor is provided in ISE for entering design code and viewing reports. 2. Schematic Editor - The Engineering Capture System (ECS) is a graphical user interface (GUI) that allows creating, viewing, and editing schematics and symbols for the Design Entry step of the Xilinx design flow. 3. CORE Generator - The CORE Generator System is a design tool that delivers parameterized cores optimized for Xilinx FPGAs ranging in complexity from simple arithmetic operators such as adders, to system-level building blocks such as filters, transforms, FIFOs, and memories. 4. Constraints Editor - The Constraints Editor allows to create and modify the most commonly used timing constraints. 5. PACE - The Pin out and Area Constraints Editor (PACE) allows viewing and editing I/O, Global logic, and Area Group constraints.

6. State CAD State Machine Editor - State CAD allows to specify states, transitions, and actions in a graphical editor. The state machine will be created in HDL. 4.4.5 Implementation 1. Translate - The Translate process runs NGD Build to merge all of the input net lists as well as design constraint information into a Xilinx database file. 2. Map - The Map program maps a logical design to a Xilinx FPGA. 3. Place and Route (PAR) - The PAR program accepts the mapped design, places and routes the FPGA, and produces output for the bit stream generator. 4. Floor planner - The Floor planner allows viewing a graphical representation of the FPGA, and to view and modify the placed design. 5. FPGA Editor - The FPGA Editor allows viewing and modifying the physical implementation, including routing. 6. Timing Analyzer - The Timing Analyzer provides a way to perform static timing analysis on FPGA and CPLD designs. With Timing Analyzer, analysis can be performed immediately after mapping, placing or routing an FPGA design, and after fitting and routing a CPLD design. 4.4.6 Features Memory Editor Window. Post Map and Post Translate flow supported in Project Navigator. Support of simulation of Embedded designs in XPS and Project Navigator ISim Hardware Co-Simulation Limited Access feature. 4.5 Introduction to XILINXs Spartan-3 Kit The Spartan-3 family of Field Programmable Gate Arrays is specifically designed to meet the needs of high volume, million system gates.

cost sensitive consumer electronic

applications. The eight member family offers densities ranging from 50,000 to five

The Spartan-3 family builds on the success of the earlier Spartan-IIE family by increasing the amount of logic resources, the capacity of internal RAM, the total number of I/Os, and the overall level of performance as well as by improving clock management functions. Numerous enhancements derive from the Virtex-II platform technology. These Spartan-3 FPGA enhancements, combined with advanced process technology, deliver more functionality and bandwidth per dollar than was previously possible, setting new standards in the programmable logic industry. Because of their exceptionally low cost, Spartan-3 FPGAs are ideally suited to a wide range of consumer electronics applications; including broadband access, home networking, display/projection and digital television equipment. The Spartan-3 family is a superior alternative to mask programmed ASICs. FPGAs avoid the high initial cost, the lengthy development cycles, and the inherent inflexibility of conventional ASICs. Also, FPGA programmability permits design upgrades in the field with no hardware replacement.


Figure 4.13 Spartan-3 FPGA Kit 4.6 Boundary Scan (JTAG) Mode In Boundary Scan mode, dedicated pins are used for configuring the FPGA. The configuration is done entirely through the IEEE 1149.1 Test Access Port (TAP). FPGA configuration using the Boundary Scan mode is compatible with the IEEE STD 1149.1-1993 standard and IEEE Std 1532 for In System Configurable (ISC) devices. Configuration through the Boundary Scan port is always available, regardless of the selected configuration mode. In some cases, however, the mode pin setting may affect proper programming of the device due to various interactions.

For example, if the mode pins are set to Master Serial or Master Parallel mode, and the associated PROM is already programmed with a valid configuration image, then there is potential for configuration interference between the JTAG and PROM data. Selecting the Boundary Scan mode disables the other modes and is the most reliable mode when programming via JTAG. 4.7 Swithces,Buttons And Knob 4.7.1 Slide Switches The Spartan-3 Starter Kit board has eight slide switches. The slide switches are located in the lower right corner of the board and are labeled SW7 through SW0. Switch SW7 is the left- most switch, and SW0 is the right-most switch.

Figure 4.14 Slide Switches Operation : When in the UP or ON position, a switch connects the FPGA pin to 3.3V, a logic High. When DOWN or in the OFF position, the switch connects the FPGA pin to ground, a logic Low. The switches typically exhibit about 2 ms of mechanical bounce and there is no active de- bouncing circuitry, although such circuitry could easily be added to the FPGA design programmed on the board. 4.7.2 Push Button Switches The Spartan-3 Starter Kit board has eight momentary contact push-button switches. The push buttons are located in the lower left corner of the board.

Figure 4.15 Push Button Switches


Operation: Pressing a push button connects the associated FPGA pin to 3.3V. Use an internal pull- down resistor within the FPGA pin to generate a logic Low when the button is not pressed. There is active de-bouncing circuitry on the push button.

4.7.3 Discrete LEDs The Spartan-3 Starter Kit board has eight individual surface mount LEDs located above the slide switches. The LEDs are labeled LED7 through LED0. LED7 is the left-most LED, LED0 the right-most LED.

Figure 4.16 Discrete LEDs Operation : Each LED has one side connected to ground and the other side connected to a pin on the Spartan-3 device via a 390 current limiting resistor. To light an individual LED, drive the associated FPGA control signal High. 4.8 Clock Connections Each of the clock inputs connect directly to a global buffer input in I/O Bank 0, along the top of the FPGA. Each of the clock inputs also optimally connects to an associated DCM. 4.9 Voltage Control The voltage for all I/O pins in FPGA I/O Bank 0 is controlled by jumper JP9. Consequently, these clock resources are also controlled by jumper JP9. By default, JP9 is set for 3.3 V. The on-board oscillator is a 3.3V device and might not perform as expected when jumper JP9 is set for 2.5V. 4.10 Auxillary Clock Oscillatory Socket The provided 8-pin socket accepts clock oscillators that fit the 8-pin DIP footprint. Use this socket if the FPGA application requires a frequency other than 50MHz. Alternatively, use the FPGAs Digital Clock Manager (DCM) to generate or synthesize other frequencies from the on board 50 MHz oscillator.


5.1 Simulation results We simulate the structures using ModelSim software .The outputs shown here are obtained by using ModelSim software.

Figure 5.1-Simulation result for All Signals of MAC


Figure5.2-Simulation result of Adder

Figure 5.3-Simulation result for Booth Encoder


Figure 5.4-Simulation result for Booth Multiplier

Figure 5.5-Simulation result for Full Adder


Figure 5.6-Simulation result for Half Adder

Figure 5.7-Simulation result for Booth Top-Level Module

5.2 SYNTHESIS REPORT 5.2.1 Synthesis Options Summary ---- Source Parameters Input File Name Input Format : "imp_mactop_tb.prj" : mixed

Ignore Synthesis Constraint File : NO ---- Target Parameters Output File Name Output Format Target Device ---- Source Options Top Module Name Automatic FSM Extraction FSM Encoding Algorithm Safe Implementation FSM Style : imp_mactop_tb : YES : Auto : No : lut

: "imp_mactop_tb" : NGC : xc3s500e-4-fg320

RAM Extraction RAM Style ROM Extraction Mux Style Decoder Extraction Priority Encoder Extraction Shift Register Extraction Logical Shifter Extraction XOR Collapsing ROM Style Mux Extraction Resource Sharing

: Yes : Auto : Yes : Auto : YES : YES : YES : YES : YES : Auto : YES : YES : NO

Asynchronous To Synchronous Multiplier Style : auto

Automatic Register Balancing ---- Target Options Add IO Buffers Global Maximum Fanout

: No

: YES : 500 : 24

Add Generic Clock Buffer(BUFG) Register Duplication Slice Packing : YES : YES

Optimize Instantiated Primitives : NO Use Clock Enable : Yes


Use Synchronous Set Use Synchronous Reset Pack IO Registers into IOBs Equivalent register Removal ---- General Options Optimization Goal Optimization Effort Library Search Order Keep Hierarchy Netlist Hierarchy RTL Output Global Optimization Read Cores Write Timing Constraints Cross Clock Analysis Hierarchy Separator Bus Delimiter Case Specifier Slice Utilization Ratio BRAM Utilization Ratio Verilog 2001 Auto BRAM Packing Slice Utilization Ratio Delta

: Yes : Yes : auto : YES

: Speed :1 : imp_mactop_tb.lso : NO : as_optimized : Yes : AllClockNets : YES : NO : NO :/ : <> : maintain : 100 : 100 : YES : NO :5

5.2.2 HDL Synthesis Report Macro Statistics # ROMs 8x1-bit ROM # Accumulators 32-bit up accumulator # Registers 1-bit register 16-bit register # Multiplexers 1-bit 8-to-1 multiplexer # Xors 1-bit xor2 1-bit xor3 : 200 : 32 : 168 :3 :1 :2 : 136 : 136 :8 :8 :1 :1

5.2.3 Advanced HDL Synthesis Report Macro Statistics # ROMs 8x1-bit ROM # Accumulators 32-bit up accumulator # Registers Flip-Flops # Multiplexers : 33 : 33 : 136

:8 :8 :1 :1

1-bit 8-to-1 multiplexer # Xors 1-bit xor2 1-bit xor3 5.2.4 Final Report : 200 : 32

: 136

: 168

Found area constraint ratio of 100 (+ 5) on block imp_mactop_tb, actual ratio is 3.Final Macro Processing ... Final Register Report Macro Statistics # Registers Flip-Flops Final Results RTL Top Level Output File Name Top Level Output File Name Output Format Optimization Goal Keep Hierarchy Design Statistics # IOs Cell Usage : # BELS # # GND INV : 116 :1 :1

: 36 : 36

: imp_mactop_tb.ngr : imp_mactop_tb

: NGC : Speed : NO


# # # # # # #


:3 :3 :7 : 29 : 39 :1 : 32 : 36 : 35 :1 :1 :1 :4 :3 :1

# FlipFlops/Latches # # FDC FDE

# Clock Buffers # BUFGP

# IO Buffers # # IBUF OBUF

Device utilization summary: Selected Device : 3s500efg320-4 Number of Slices: Number of Slice Flip Flops: Number of 4 input LUTs: Number of IOs: Number of bonded IOBs: Number of GCLKs: 5 5 out of 1 out of

23 out of 4656

0% 0% 0%

36 out of 9312 43 out of 9312

232 24

2% 4%

TIMING REPORT Timing Summary: Speed Grade: -4 Minimum period: 5.921ns (Maximum Frequency: 168.890MHz) Minimum input arrival time before clock: 4.160ns Maximum output required time after clock: 4.283ns Maximum combinational path delay: No path found Timing Detail: All values displayed in nanoseconds (ns) Timing constraint: Default period analysis for Clock 'clk' Clock period: 5.921ns (frequency: 168.890MHz) Total number of paths / destination ports: 2207 / 33 Delay: Source: Destination: Source Clock: 5.921ns (Levels of Logic = 32) breg_5 (FF) uut/mul_out_r_31 (FF) clk rising

Destination Clock: clk rising Total 5.921ns (4.582ns logic, 1.339ns route) (77.4% logic, 22.6% route) Timing constraint: Default OFFSET IN BEFORE for Clock 'clk' Total number of paths / destination ports: 6 / 4 Offset: Source: 4.160ns (Levels of Logic = 2) rst (PAD)


macout_reg (FF)

Destination Clock: clk rising Data Path: rst to macout_reg Gate Cell:in->out IBUF:I->O INV:I->O FDE:CE Total Net

fanout Delay Delay Logical Name (Net Name) 36 1.218 1.263 rst_IBUF (rst_IBUF) 1 0.704 0.420 rst_inv1_INV_0 (rst_inv) 0.555 macout_reg

4.160ns (2.477ns logic, 1.683ns route) (59.5% logic, 40.5% route)

Timing constraint: Default OFFSET OUT AFTER for Clock 'clk' Total number of paths / destination ports: 1 / 1 Offset: Source: Destination: Source Clock: 4.283ns (Levels of Logic = 1) macout_reg (FF) macout1 (PAD) clk rising

Data Path: macout_reg to macout1 Gate Cell:in->out FDE:C->Q OBUF:I->O Total Net

fanout Delay Delay Logical Name (Net Name) 1 0.591 0.420 macout_reg (macout_reg) 3.272 macout1_OBUF (macout1)

4.283ns (3.863ns logic, 0.420ns route) (90.2% logic, 9.8% route)


Total memory usage is 140792 kilobytes Number of errors : 0 ( 0 filtered)

Number of warnings : 40 ( 0 filtered) Number of infos : 5 ( 0 filtered)

5.3 Synthesis Results

Figure 5.8 RTL of Adder unit


Figure 5.9 Internal Structure of RTL Top module


Figure 5.10 RTL of Booth Multiplier

Figure 5.11 RTL of Booth Encoder


Figure 5.12 Internal RTL of Multiplier

Figure 5.13 RTL schematic of Top module


Figure 5.14 RTL Internal Structure of Adder block


A 16x16 multiplier-accumulator (MAC) is presented in this work. A RADIX 4Modified Booth multiplier circuit is used for MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational speed and less hardware count. The basic building blocks for the MAC unit are identified and each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks. 1-bit MAC unit is designed with enable to reduce the total power consumption based

on block enable technique. Using this block, the N-bit MAC unit is constructed and the total power consumption is calculated for the MAC unit. The power reduction techniques adopted in this work. The MAC unit designed in this work can be used in filter realizations for High speed DSP applications.

BIBLIOGRAPHY [1] J.J.FCavanagh,Digital Computer Arithmetic. New York:McGraw-Hill,1984 [2] Information Technology-Coding of Moving Picture and Associated Auto,MPEG-2 Draft International Standard,ISO/IEC 13818-1,2,31994. [3] JPEG 2000 Part 1 Final 119l Draft,ISO/IEC JTC1/SC29 WGI.


[4] O.L.MacSorely,High speed arithmetic in binary computers, Proc.IRE, vol. 49, pp.67-91,Jan. 1961. [5] S.Waser and M.J.Flynn,Introduction to Arithmetic for Digital System Designers. New York:Holt,Rinehart and Winston,1982. [6] A.R.Omondi,Computer Arithmetic Systems. Englewood Cliffs,NJ:PrenticeHall,1994, 1994. [7] A.D.Booth,A signed binary multiplication technique, Quart. J.Math., vol4,pp.236-240,1952. [8] C.S.Wallace,A suggestion for a fast multiplier, IEEE Trans, Electron Comput.,vol. EC-13 no.1 ,pp.14-17,Feb. 1964. [9] A.R.Cooper, Parallel architecture modified Booth multiplier, Proc. Inst.

Electr. Eng. G vol. 135, pp. 125-128, 1988. [10] N.R. Shanbag and P.Juneja , Paralel implementation of a 4*4-bit multiplier using modified Booths algorithm, IEEE J. Solid-State Circuits ,vol.27,no.9,pp. 1229-1236,Sep. 1992. [11] G.Goto,T.Sato,M.Nakajima,and T. Sukemura, A 54*54 regular structured tree multiplier, IEEE J.Solis State Circuits,vol. 27,no. 9,pp. 1229-1236,Sep.1992. [12] J. Fadavi-Ardekani,M*N Booth encoded multiplier generator using optimized Wallace trees, IEEE Trans. Very Large Scale Integer.(VLSI) Syst., vol. 1,no. 2,pp. 120-125,Jun. 1993 [13] N. Ohkubo,M. Suzuki,T. Shinbo, T.Yamanaka,A.Shimizu,K.Sasaki and Y. Nakagone,A 4.4ns CMOS 54*54 multiplier using pass-transistor multiplexer, IEEE J.Solid State Circuits, vol. 30,no.3,pp. 251-257, Mar. 1995


[14] A. Tawfik, F.Elguibaly, and P. Agathoklis, New realization and implementation of fixed-point IIR digital filters, J.Circuits, Syst., vol. 7,no. 3,pp. 187-191,1994. [15] A. Tawfik, F. Elguibaly, M. N. Fahmi, E. Abdel-Raheem, and P.Agathoklis, High-speed area-efficient inner-product processor, Can.J. Electr. Comput. Eng., vol.19,pp. 187-191, 1994. [16] F.Elguibaly and A. Rayhan, Overflow handling in inner-product processors, in Proc. IEEE Pacific Rim Conf. Commun., Comput., Signal Process.,Aug. 1997,pp. 117-120. [17] F.Elguibaly, A fast parallel multiplier-accumulator using the modified Booth algorithm, IEEE Trans. Circuits Syst., vol. 27, no. 9,pp. 902-908, Sep. 2000. [18] A. Faye and M. Bayoumi, A merged multiplier-accumulator for high speed signal processing applications, Proc. ICASSP, vol. 3,pp.3212-3215 2002. [19] P. Zicari, S. Perri, P. Corsonello, and G. Cocorullo, An optimized adder accumulator for high speed MACs, Proc. ASICON 2005, vol.2, pp. 757-760, 2005. [20] T.Sakurai and A. R. Newton, Alpha-power law MOSFET model and its

applications to CMOS and other formulas, IEEE J.Solid-State Circuits, vol. 25, no. 2,pp. 584-594, Feb. 1990.