You are on page 1of 36

Abstract

With the recent rapid advances in multimedia and communication systems, realtime signal processing like audio signal processing, video/image processing, or largecapacity data processing are increasingly being demanded. The multiplier and multiplier-and-accumulator (MAC) are the essential elements of the digital signal processing such as filtering, convolution, and Inner products. In this project, I proposed a new architecture of multiplier-and-accumulator (MAC) for high-speed arithmetic. By combining multiplication with accumulation and devising a hybrid type of carry save adder (CSA), the performance was improved. Since the accumulator that has the largest delay in MAC was merged into CSA, the overall performance was elevated. The proposed CSA tree uses 1s-complement-based radix-2 modified Booths algorithm (MBA) and has the modified array for the sign extension in order to increase the bit density of the operands. The CSA propagates the carries to the least significant bits of the partial products and generates the least significant bits in advance to decrease the number of the input bits of the final adder. Also, the proposed MAC accumulates the intermediate results in the type of sum and carry bits instead of the output of the final adder, which made it possible to optimize the pipeline scheme to improve the performance. In this project we used Modelsim for logical verification, and further synthesizing it on Xilinx-ISE tool using target technology and performing placing & routing operation for system verification on targeted FPGA.

CHAPTER 1 INTRODUCTION AND LITERATURE SURVEY

1.Introduction:
Power dissipation is recognized as a critical parameter in modern VLSI design field. To satisfy MOORES law and to produce consumer electronics goods with more backup and less weight, low power VLSI design is necessary. Fast multipliers are essential parts of digital signal processing systems. The speed of multiply operation is of great importance in digital signal processing as well as in the general purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is two fold i.e. evaluation of partial products and accumulation of the shifted partial products. It is performed by the successive additions of the columns of the shifted partial product matrix. The multiplier is successfully shifted and gates the appropriate bit of the multiplicand. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and unsigned numbers, a convenient number system would be the representation of numbers in twos complement format.

The MAC(Multiplier and Accumulator Unit) is used for image processing and digital signal processing (DSP) in a DSP processor. Algorithm of MAC is Booth's radix-4 algorithm, Modified Booth Multiplier, 34-bit CSA and improves speed. MIPS was implemented as micro processors and permitted high performance pipeline implementations through the use of their simple register oriented instruction sets. Although those algorithms (radix-4 algorithm, pipelining, etc ) are widely used technique for speeding up each part, the MAC on specific processor cannot be run at 100% efficiency. Due to the reasons of lower speed of MAC, MIPS instruction "mul" (multiplication) takes longer time than any other instruction in our MIPS processor. To improve speed of MIPS, MAC needs to be fast and MIPS must have special algorithm for "mul" instruction. One of the methods we chose was to design multi-clock MAC instead of one-clock MAC which improved the speed of MIPS. In general, the instruction set of MIPS processor includes complex works like multiplication and floating point operation which has multi execution stage. Therefore, system clock of the processor was increased efficiently. We applied 2 stage pipelining to the MAC to MIPS

processor and as a result we were able to get the result of matrix multiplication which was used for image processing in our MIPS processor that supports MAC.

1.1Backgrounds
1.1.1 MAC
In the majority of digital signal processing (DSP) applications the critical operations are the multiplication and accumulation. Realtime signal processing requires high speed and high throughput Multiplier-Accumulator (MAC) unit that consumes low power, which is always a key to achieve a high performance digital signal processing system. The purpose of this work is, design and implementation of a low power MAC unit with block enabling technique to save power. Firstly, a 1-bit MAC unit is designed, with appropriate geometries that give optimized power, area and delay. The delay in the pipeline stages in the MAC unit is estimated based on which a control unit is designed to control the data flow between the MAC blocks for low power. Similarly, the N-bit MAC unit is designed and controlled for low power using a control logic that enables the pipelined stages at appropriate time. The adder cell designed has advantage of high operational speed, small Gate count and low power. In general, a multiplier uses Booths algorithm and array of full adders (FAs), or Wallace tree instead of the array of FAs., i.e., this multiplier mainly consists of the three parts: Booth encoder, a tree to compress the partial products such as Wallace tree, and final adder . Because Wallace tree is to add the partial products from encoder as parallel as possible, its operation time is proportional to , where is the number of inputs. It uses the fact that counting the number of 1s among the inputs reduces the number of outputs into . In real implementation, many (3:2) or (7:3) counters are used to reduce the number of outputs in each pipeline step. The most effective way to increase the speed of a multiplier is to reduce the number of the partial products because multiplication proceeds a series of additions for the partial products. To reduce the number of calculation steps for the partial products, MBA algorithm has been applied mostly whereWallace tree has taken the role of increasing the speed to add the partial products. To increase the speed of the MBA algorithm, many parallel multiplication architectures have been researched .Among them, the architectures based on the

BaughWooley algorithm (BWA) have been developed and they have been applied to various digital filtering calculations. 1.1.2. Overview of MAC: A multiplier can be divided into three operational steps. The first is radix-2 Booth encoding in which a partial product is generated from the multiplicand X and the multiplier Y . The second is adder array or partial product compression to add all partial products and convert them into the form of sum and carry. The last is the final addition in which the final multiplication result is produced by adding the sum and the carry. If the process to accumulate the multiplied results is included, a MAC consists of four steps, as shown in Fig. 1, which shows the operational steps explicitly. A general hardware architecture of this MAC is shown in Fig. 2. It executes the multiplication operation by multiplying the input multiplier X and the multiplicand Y . This is added to the previous multiplication result Z as the accumulation step. The N-bit 2s complement binary number can be expressed as

..(1) If (1) is expressed in base-4 type redundant sign digit form in order to apply the radix-2 Booths algorithm.

..(2) (3) If (2) is used, multiplication can be expressed as

(4) If these equations are used, the afore-mentioned multiplicationaccumulation results can be expressed as

.(5) Each of the two terms on the right-hand side of (5) is calculated independently and the final result is produced by adding the two results. The MAC architecture implemented by (5) is called the standard design [6]. If -bit data are multiplied, the number of the generated partial products is proportional to . In order to add them serially, the execution time is also proportional to . The architecture of a multiplier, which is the fastest, uses radix-2 Booth encoding that generates partial products and aWallace tree based on CSA as the adder array to add the partial products. If radix-2 Booth encoding is used, the number of partial products, i.e., the inputs to the Wallace tree, is reduced to half, resulting in the decrease in CSA tree step. In addition, the signed multiplication based on 2s complement numbers is also possible. Due to these reasons, most current used multipliers adopt the Booth encoding.

1.1.3..Proposed MAC Architecture:


In this section, the expression for the new arithmetic will be derived from equations of the standard design. From this result, VLSI architecture for the new MAC will be proposed. In addition, a hybrid-typed CSA architecture that can satisfy the operation of the proposed MAC will be proposed. 1.1.4 Basic Concept: If an operation to multiply two N bit numbers and accumulate into a 2N -bit number is considered, the critical path is determined by the 2 -bit accumulation operation. If a pipeline scheme is applied for each step in the standard design of Fig. 1, the delay of the last accumulator must be reduced in order to improve the performance of the MAC. The overall performance of the proposed MAC is improved by eliminating the accumulator itself by combining it with the CSA function. If the accumulator has been eliminated, the critical path is then determined by the final adder in the multiplier. The basic method to improve the performance of the final adder is to decrease the number of input bits. In order to reduce this number of input bits, the multiple partial products are compressed into a sum and a carry by CSA. The number of bits of sums and carries to be transferred to the final adder is reduced by adding the lower bits of

sums and carries in advance within the range in which the overall performance will not be degraded. A 2-bit CLA is used to add the lower bits in the CSA. In addition, to increase the output rate when pipelining is applied, the sums and carrys from the CSA are accumulated instead of the outputs from the final adder in the manner that the sum and carry from the CSA in the previous cycle are inputted to CSA. Due to this feedback of both sum and carry, the number of inputs to CSA increases, compared to the standard design and [17]. In order to efficiently solve the increase in the amount of data, a CSA architecture is modified to treat the sign bit.

1.2. Multiplier and Accumulator Unit


MAC is composed of an adder, multiplier and an accumulator. Usually adders implemented are Carry- Select or Carry-Save adders, as speed is of utmost importance in DSP (Chandrakasan, Sheng, & Brodersen, 1992 and Weste & Harris, 3rd Ed). One implementation of the multiplier could be as a parallel array multiplier. The inputs for the MAC are to be fetched from memory location and fed to the multiplier block of the MAC, which will perform multiplication and give the result to adder which will accumulate the result and then will store the result into a memory location. This entire process is to be

achieved in a single clock cycle (Weste & Harris, 3rd Ed). Figure 2 is the architecture of the MAC unit which had been designed in this work. The design consists of one 16 bit register, one 16-bit Modified Booth Multiplier multiplier, 33-bit accumulator using ripple carry and two16-bit accumulator registers. To multiply the values of A and B, Modified Booth multiplier is used instead of conventional multiplier because Modified Booth multiplier can increase the MAC unit design speed and reduce multiplication complexity. Carry save Adder (CSA) is used as an accumulator in this design. Apparently, together with the utilization of Wallace tree multiplier approach, carry save adder in the final stage of the Modified Booth multiplier and Carry save Adder as the accumulator, this MAC unit design is not only reducing the standby power consumption but also can enhance the MAC unit speed so as to gain better system performance. The operation of the designed MAC unit is as in Equation 2.1. The product of Ai X Bi is always fed back into the 34-bit Carry save accumulator and then added again with the next product Ai x Bi. This MAC unit is capable of multiplying and adding with previous product consecutively up to as many as eight times. Operation: Output = Ai Bi (2.1) In this paper, the design of 16x16 multiplier unit is carried out that can perform accumulation on 34 bit number. This MAC unit has 34 bit output and its operation is to add repeatedly the multiplication results. The total design area is also being inspected by observing the total count of gates [Hardwires]. Power delay product is calculated by multiplying the power consumption result with the time delay.

1.3. A radix-4 modified Booth's algorithm:


Booth's Algorithm is simple but powerful. Speed of MAC is dependent on the number of partial products and speed of accumulate partial product. Booth's Algorithm provide us to reduced partial products. We choose radix-4 algorithm because of below reasons. Original Booth's algorithm has an inefficient case. The 17 partial products are generated in 16bit x 16bit signed or unsigned multiplication.

Modified Booth's radix-4 algorithm has fatal encoding time in 16bit x 16bit multiplication. Radix-4 Algorithm has a 3x term which means that a partial product cannot be generated by shifting. Therefore, 2x + 1x are needed in encoding processing. One of the solution is handling an additional 1x term in wallace tree. However, large wallace tree has some problems too. A radix-4 modified Booth's algorithm: Booth's radix-4 algorithm is widely used to reduce the area of multiplier and to increase the speed. Grouping 3 bits of multiplier with overlapping has half partial products which improves the system speed. Radix-4 modified Booth's algorithm is shown below: X-1 = 0; Insert 0 on the right side of LSB of multiplier. Start grouping each 3bits with overlapping from x-1 If the number of multiplier bits is odd, add a extra 1 bit on left side of MSB generate partial product from truth table when new partial product is generated, each partial product is added 2 bit left shifting in regular sequence.

x: multiplicand y: multiplier 1.3.1. Sign or zero extension Our MAC supports signed or unsigned multiplication and the produced result is 64bit which are stored in 2 special 32bit register. First MAC receives a multiplicand and multiplier but just 16bit operands are signed number in Booth's radix-4 algorithm. Hence, extension bit is required to express 16bit signed number. The core idea of this is that 16bit unsigned number can be expressed by 33bit signed number. The 17 partial products are generated in 33bit x 33bit case (16 partial products in 32bit x 32bit case).

Here is an example of signed and unsigned multiplication. When x(multiplicand) is 3bit 111 and y(multiplier) is 3bit 111, the signed and unsigned multiplication is different. In signed case x y = 1 (-1 x -1 = 1) and in unsigned case x y = 49 (7 x 7 = 49).

1.3.2. Carry Save Adder When three or more operands are to be added simultaneously using two operand adders, the time consuming carry propagation must be repeated several times. If the number of operands is k, then carries have to propagate (k-1) times (Weste & Harris, 3rd Ed). In the carry save addition, we let the carry propagate only in the last step, while in all the other steps we generate the partial sum and sequence of carries separately. A CSA is capable of reducing the number of operands to be added from 3 to 2 without any carry propagation. A CSA can be implemented in different ways. In the simplest implementation, the basic element of carry save adder is the combination of two half adders or 1 bit full adder (Weste & Harris, 3rd Ed)

1.4. Circuit Design Features


One of the most advanced types of MAC for general-purpose digital signal processing has been proposed by Elguibaly . It is an architecture in which accumulation has been combined with the carry save adder (CSA) tree that compresses partial products. In the architecture proposed in , the critical path was reduced by eliminating the adder for accumulation and decreasing the number of input bits in the final adder.

While it has a better performance because of the reduced critical path compared to the previous MAC architectures, there is a need to improve the output rate due to the use of the final adder results for accumulation. An architecture to merge the adder block to the accumulator register in the MAC operator was proposed to provide the possibility of using two separate N/2-bit adders instead of one -bit adder to accumulate the bitMAC results. Recently, Zicari proposed an architecture that took a merging technique to fully utilize the 42 compressor .It also took this compressor as the basic building blocks for the multiplication circuit.

1.5. MAC 1.5.1 Block Diagram of MAC: In this paper, a new architecture for a high-speed MAC is proposed. In this MAC, the computations of multiplication and accumulation are combined and a hybrid-type CSA structure is proposed to reduce the critical path and improve the output rate. It uses MBA algorithm based on 1s complement number system. A modified array structure for the sign bits is used to increase the density of the operands. A carry look-ahead adder (CLA) is inserted in the CSA tree to reduce the number of bits in

the final adder. In addition, in order to increase the output rate by optimizing the pipeline efficiency, intermediate calculation results are accumulated in the form of sum and carry instead of the final adder outputs. A multiplier can be divided into three operational steps. The first is radix-2 Booth encoding in which a partial product is generated from the multiplicand X and the multiplier Y . The second is adder array or partial product compression to add all partial products and convert them into the form of sum and carry. The last is the final addition in which the final multiplication result is produced by adding the sum and the carry. If the process to accumulate the multiplied results is included, a MAC consists of four steps, as shown in Fig. 1, which shows the operational steps explicitly.

CHAPTER 2 Design of MAC


In the majority of digital signal processing (DSP) applications the critical operations usually involve many multiplications and/or accumulations. For real-time signal processing, a high speed and high throughput Multiplier-Accumulator (MAC) is always a key to achieve a high performance digital signal processing system. In the last few years, the main consideration of MAC design is to enhance its speed. This is because; speed and throughput rate is always the concern of digital signal processing system. But for the epoch of personal communication, low power design also becomes another main design consideration. This is because; battery energy available for these portable products limits the power consumption of the system. Therefore, the main motivation of this work is to investigate various Pipelined multiplier/accumulator architectures and circuit design techniques which are suitable for implementing high throughput signal processing algorithms and at the same time achieve low power consumption. A conventional MAC unit consists of (fast multiplier) multiplier and an accumulator that contains the sum of the previous

consecutive products. The function of the MAC unit is given by the following equation: F = A i Bi(2.1) The main goal of a DSP processor design is to enhance the speed of the MAC unit, and at the same time limit the power consumption. In a pipelined MAC circuit, the delay of pipeline stage is the delay of a 1bit full adder. Estimating this delay will assist in identifying the overall delay of the pipelined MAC. In this work, 1-bit full adder is designed. Area, power and delay are calculated for the full adder, based on which the pipelined MAC unit is designed for low power. 2.1. High-Speed Booth Encoded Parallel Multiplier Design: Fast multipliers are essential parts of digital signal processing systems. The speed of multiply operation is of great importance in digital signal processing as well as in the general purpose processors today, especially since the media processing took off. In the past multiplication was generally implemented via a sequence of addition, subtraction, and shift operations. Multiplication can be considered as a series of repeated additions. The number to be added is the multiplicand, the number of times that it is added is the multiplier, and the result is the product. Each step of addition generates a partial product. In most computers, the operand usually contains the same number of bits. When the operands are interpreted as integers, the product is generally twice the length of operands in order to preserve the information content. This repeated addition method that is suggested by the arithmetic definition is slow that it is almost always replaced by an algorithm that makes use of positional representation. It is possible to decompose multipliers into two parts. The first part is dedicated to the generation of partial products, and the second one collects and adds them. The basic multiplication principle is two fold i.e. evaluation of partial products and accumulation of the shifted partial products. It is performed by the successive additions of the columns of the shifted partial product matrix. The multiplier is successfully shifted and

gates the appropriate bit of the multiplicand. The delayed, gated instance of the multiplicand must all be in the same column of the shifted partial product matrix. They are then added to form the product bit for the particular form. Multiplication is therefore a multi operand operation. To extend the multiplication to both signed and unsigned. 2.2. Modified Booth Encoder: In order to achieve high-speed multiplication, multiplication algorithms using parallel counters, such as the modified Booth algorithm has been proposed, and some multipliers based on the algorithms have been implemented for practical use. This type of multiplier operates much faster than an array multiplier for longer operands because its computation time is proportional to the logarithm of the word length of operands.

Booth multiplication is a technique that allows for smaller, faster multiplication circuits, by recoding the numbers that are multiplied. It is possible to reduce the number of partial products by half, by using the technique of radix-4 Booth recoding. The basic idea is that, instead of shifting and adding for every column of the multiplier term and multiplying by 1 or 0, we only take every second column, and multiply by 1, 2, or 0, to obtain the same results. The advantage of this method is the halving of the number of partial

products. To Booth recode the multiplier term, we consider the bits in blocks of three, such that each block overlaps the previous block by one bit. Grouping starts from the LSB, and the first block only uses two bits of the multiplier. Figure 3 shows the grouping of bits from the multiplier term for use in modified booth encoding.

Fig.2.2 Grouping of bits from the multiplier term Each block is decoded to generate the correct partial product. The encoding of the multiplier Y, using the modified booth algorithm, generates the following five signed digits, -2, -1, 0, +1, +2. Each encoded digit in the multiplier performs a certain operation on the multiplicand, X, as illustrated in Table 1

For the partial product generation, we adopt Radix-4 Modified Booth algorithm to reduce the number of partial products for roughly one half. For multiplication of 2s complement numbers, the two-bit encoding using this algorithm scans a triplet of bits. When the multiplier B is divided into groups of two bits, the algorithm is applied to this group of divided bits. Figure 4, shows a computing example of Booth multiplying two numbers 2AC9 and 006A. The shadow denotes that the numbers in this part of Booth multiplication are all zero so that this part of the computations can be neglected. Saving those computations can

significantly reduce the power consumption caused by the transient signals. According to the analysis of the multiplication shown in figure 4, we propose the SPST-equipped modified-Booth encoder, which is controlled by a detection unit. The detection unit has one of the two operands as its input to decide whether the Booth encoder calculates redundant computations. As shown in figure 9. The latches can, respectively, freeze the inputs of MUX-4 to MUX-7 or only those of MUX-6 to MUX-7 when the PP4 to PP7 or the PP6 to PP7 are zero; to reduce the transition power dissipation. Figure 10, shows the booth partial product generation circuit. It includes AND/OR/EX-OR logic.

Fig.2.3 Illustration of multiplication using modified Booth encoding The PP generator generates five candidates of the partial products, i.e., {-2A,-A, 0, A, 2A}. These are then selected according to the Booth encoding results of the operand B. When the operand besides the Booth encoded one has a small absolute value, there are opportunities to reduce the spurious power dissipated in the compression tree.

Fig.2.4 SPST equipped modified Booth encoder 2.3. Partial product generator:

Fig.2.5 Booth partial product selector logic 2.4. Partial Product Generator: The multiplication first step generates from A and X a set of bits whose weights sum is the product P. For unsigned multiplication, P most significant bit weight is positive, while in 2's complement it is negative. The partial product is generated by doing AND between a and b which are a 4 bit vectors as shown in fig. If we take, four bit

multiplier and 4-bit multiplicand we get sixteen partial products in which the first partial product is stored in q. Similarly, the second, third and fourth partial products are stored in 4-bit vector n, x, y.

Fig.2.6 Booth partial products Generation The multiplication second step reduces the partial products from the preceding step into two numbers while preserving the weighted sum. The sough after product P is the sum of those two numbers. The two numbers will be added during the third step The "Wallace trees" synthesis follows the Dadda's algorithm, which assures of the minimum counter number. If on top of that we impose to reduce as late as (or as soon as) possible then the solution is unique. The two binary number to be added during the third step may also be seen a one number in CSA notation (2 bits per digit).

Fig.2.7 Booth single partial product selector logic

2.5. Modified Booth Encoder: Multiplication consists of three steps: 1) the first step to generate the partial products; 2) the second step to add the generated partial products until the last two rows are remained; 3) the third step to compute the final multiplication results by adding the last two rows. The modified Booth algorithm reduces the number of partial products by half in the first step. We used the modified Booth encoding (MBE) scheme proposed in. It is known as the most efficient Booth encoding and decoding scheme. To multiply X by Y using the modified Booth algorithm starts from grouping Y by three bits and encoding into one of {-2, -1, 0, 1, 2}. Table I shows the rules to generate the encoded signals by MBE scheme and Fig. 1 (a) shows the corresponding logic diagram. The Booth decoder generates the partial products using the encoded signals as shown in Fig. 1

Fig.2.8 Booth Encoder

Fig.2.9 Booth Decoder Fig. shows the generated partial products and sign extension scheme of the 8-bit modified Booth multiplier. The partial products generated by the modified Booth algorithm are added in parallel using the Wallace tree until the last two rows are remained. The final

multiplication results are generated by adding the last two rows. The carry propagation adder is usually used in this step.

Fig 2.10 Truth table for MBE Scheme 2.6. Spurious power suppression technique: The former SPST has been discussed in [7] and [8].figure 2 shows the five cases of a 16-bit addition in which the spurious switching activities occur. The 1st case illustrates a transient state in which the spurious transitions of carry signals occur in the MSP though the final result of the MSP are unchanged. The 2nd and the 3rd cases describe the situations of one negative operand adding another positive operand without and with carry from LSP, respectively. Moreover, the 4th and the 5th cases respectively demonstrate the addition of two negative operands without and with carry-in from LSP. In those cases, the results of the MSP are predictable Therefore the computations in the MSP are useless and can be neglected. The data are separated into the Most Significant Part (MSP) and the Least Significant Part (LSP).To know whether the MSP affects the computation results or not. We need a detection logic unit to detect the effective ranges of the inputs. The Boolean logical equations shown below express the behavioral principles of the detection logic unit in the MSP circuits of the SPST-based adder/subtractor:

Figure 2. Spurious transition cases in multimedia/ DSP processing


AMSP Aand Band = A[15:8]; BMSP = B[15:8] ; = A[15] A[14] A[8]; = B[15] B[14] B[8];]

where A[m] and B[n] respectively denote the mth bit of the operands A and the nth bit of the operand B, and AMSP and BMSP respectively denote the MSP parts, i.e. the 9th bit to the 16th bit, of the operands A and B. When the bits in AMSP and/or those in BMSP are all ones, the value of Aand and/or that of Band respectively become one, while the bits in AMSP and/or those in BMSP are all zeros, the value of Anor, and/or that of Bnor respectively turn into one. Being one of the three outputs of the detection logic unit, close denotes whether the MSP circuits can be neglected or not. When the two input operand can be classified into one of the five classes as shown in figure 1, the value of close becomes zero which indicates that the MSP circuits can be closed. figure 1. also shows that it is necessary to compensate the sign bit of computing results Accordingly, we derive the Karnaugh maps which lead to the Boolean equations (7)

and (8) for the Carr_ctrl and the sign signals, respectively. In equation (7) and (8), CLSP denotes the carry propagated from the LSP circuits.

Figure shows a 16-bit adder/subtractor design example based on the proposed SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP at the place between the 8th bit and the 9th bit. Latches implemented by simple AND gates are used to control the input data of the MSP. When the MSP is necessary, the input data of MSP remain the same as usual, while the MSP is negligible, the input data of the MSP become zeros to avoid switching power consumption. From the derived Boolean equations (1) to (8), the detection logic unit of the SPST is designed as shown in figure 4. which can determine whether the input data of MSP should be latched or not. Moreover, we add three 1-bit to control the assertion of the close, sign, and Carr-ctrl signals in order to further decrease the glitch signals occurred in the cascaded circuits which are usually adopted in VLSI architectures designed for video coding.

Fig. shows a 16-bit adder/subtractor design example adopting the proposed SPST. In this example, the 16-bit adder/subtractor is divided into MSP and LSP between the eighth and the ninth bits. Latches implemented by simple AND gates are used to control the input data of the MSP. When theMSP is necessary, the input data of MSP remain unchanged. However, when the MSP is negligible, the input data of the MSP become zeros to avoid glitching power consumption. The two operands of the MSP enter the detection-logic unit, except the adder/subtractor, so that the detectionlogic unit can decide whether to turn off the MSP or not. Based on the derived Boolean equations (1) to (8), the detection-logic unit of SPST is shown in Fig. 6(a), which can determine whether the input data of MSP should be latched or not. Moreover, we propose the novel glitch-diminishing technique by adding three 1-bit registers to control the assertion of the close, sign, and carr-ctrl signals to further decrease the transient signals occurred in the cascaded circuits which are usually adopted in VLSI architectures designed for multimedia/DSP applications. The timing diagram is shown in Fig. 6(b). A certain amount of delay is used to assert the close, sign, and carr-ctrl signals after the period of data transition which is achieved by controlling the three 1bit registers at the outputs of the detection-logic unit. Hence, the transients of the detection-logic unit can be filtered out; thus, the data latches shown in Fig can prevent the glitch signals from flowing into the MSP with tiny cost. The data transient time and the earliest required time of all the inputs are also illustrated. The delay should be set in the range of, which is shown as the shadow area in Fig, to filter out the glitch signals as well as to keep the computation results correct. Based on Figs. 5 and 6, the timing issue of the SPST is analyzed as follows.

2.6.1. When the detection-logic unit turns off the MSP: At this moment, the outputs of the MSP are directly compensated by the SE unit; therefore, the time saved from skipping the computations in the MSP circuits shall cancel out the delay caused by the detection-logic unit. 2.6.2. When the detection-logic unit turns on the MSP: The MSP circuits must wait for the notification of the detection-logic unit to turn on the data latches to let the data in. Hence, the delay caused by the detection-logic unit will contribute to the delay of the whole combinational circuitry, i.e., the16-bit adder/subtractor in this design example. 2.6.3.When the detection-logic unit remains its decision: No matter whether the last decision is turning on or turning off the MSP, the delay of the detection logic is negligible because the path of the combinational circuitry (i.e., the 16-bit adder/subtractor in this design example) remains the same. From the analysis earlier, we can know that the total delay is affected only when the detection-logic unit turns on the MSP. However, the detection-logic unit should be a speed-oriented design. When the SPST is applied on combinational circuitries, we should first determine the longest transitions of the interested cross sections of each combinational circuitry, which is a timing characteristic and is also related to the adopted technology. The longest transitions can be obtained from analyzing the timing differences between the earliest arrival and the latest arrival signals of the cross sections of a combinational circuitry. Then, a delay generator similar to the delay line used in the DLL

CHAPTER 3 Results and Discussion


3.1.Simulation Results of MAC:

3.2. Introduction to FPGA


FPGA stands for Field Programmable Gate Array which has the array of logic module, I /O module and routing tracks (programmable interconnect). FPGA can be configured by end user to implement specific circuitry. Speed is up to 100 MHz but at present speed is in GHz. Main applications are DSP, FPGA based computers, logic emulation, ASIC and ASSP. FPGA can be programmed mainly on SRAM (Static Random Access Memory). It is Volatile and main advantage of using SRAM programming technology is reconfigurability. Issues in FPGA technology are complexity of logic element, clock support, IO support and interconnections (Routing). 3.2.1. FPGA Design Flow: FPGA contains a two dimensional arrays of logic blocks and interconnections between logic blocks. Both the logic blocks and interconnects are programmable. Logic blocks are programmed to implement a desired function and the interconnects are programmed using the switch boxes to connect the logic blocks. To be more clear, if we want to implement a complex design (CPU for instance), then the design is divided into small sub functions and each sub function is implemented using one logic block. Now, to get our desired design (CPU), all the sub functions implemented in logic blocks must be connected and this is done by programming the interconnects.

Internal structure of an FPGA is depicted in the following figure.

FPGAs, alternative to the custom ICs, can be used to implement an entire System On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can reprogram an FPGA to implement a design and this is done after the FPGA is manufactured. This brings the name FieldProgrammable. Custom ICs are expensive and takes long time to design so they are useful when produced in bulk amounts. But FPGAs are easy to implement with in a short time with the help of Computer Aided Designing (CAD) tools (because there is no physical layout process, no mask making, and no IC manufacturing). Some disadvantages of FPGAs are, they are slow compared to custom ICs as they cant handle vary complex designs and also they draw more power. Xilinx logic block consists of one Look Up Table (LUT) and one FlipFlop. An LUT is used to implement number of different functionality. The input lines to the logic block go into the LUT and enable it. The output of the LUT gives the result of the logic

function that it implements and the output of logic block is registered or unregistered out put from the LUT. SRAM is used to implement a LUT.A k-input logic function is implemented using 2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k. Advantage of such an architecture is that it supports implementation of so many logic functions, however the disadvantage is unusually large number of memory cells required to implement such a logic block in case number of inputs is large.

Figure below shows a 4-input LUT based implementation of logic block LUT based design provides for better logic block utilization. A k-input LUT based logic block can be implemented in number of different ways with trade off between performance and logic density. An n-LUT can be shown as a direct implementation of a function truth-table. Each of the latch holds the value of the function corresponding to one input combination. For Example: 2-LUT can be used to implement 16 types of functions like AND , OR, A+not B .... etc. A 0 0 1 1 B 0 1 0 1 AND 0 0 0 1 0 1 1 1 OR ..... ...... ......

Interconnects A wire segment can be described as two end points of an interconnect with no programmable switch between them. A sequence of one or more wire segments in an FPGA can be termed as a track. Typically an FPGA has logic blocks, interconnects and switch blocks (Input/Output blocks). Switch blocks lie in the periphery of logic blocks and interconnect. Wire segments are connected to logic blocks through switch blocks. Depending on the required design, one logic block is connected to another and so on. FPGA DESIGN FLOW In this part of tutorial we are going to have a short intro on FPGA design flow. A simplified version of design flow is given in the flowing diagram.

FPGA Design Flow Design Entry There are different techniques for design entry. Schematic based, Hardware Description Language and combination of both etc. . Selection of a method depends on the design and designer. If the designer wants to deal more with Hardware, then Schematic entry is the better choice. When the design is complex or the designer thinks the design in an algorithmic way then HDL is the better choice. Language based entry is faster but lag in performance and density.

HDLs represent a level of abstraction that can isolate the designers from the details of the hardware implementation. Schematic based entry gives designers much more visibility into the hardware. It is the better choice for those who are hardware oriented. Another method but rarely used is state-machines. It is the better choice for the designers who think the design as a series of states. But the tools for state machine entry are limited. In this documentation we are going to deal with the HDL based design entry. Synthesis The process which translates VHDL or Verilog code into a device netlist formate. i.e a complete circuit with logical elements( gates, flip flops, etc) for the design.If the design contains more than one sub designs, ex. to implement a processor, we need a CPU as one design element and RAM as another and so on, then the synthesis process generates netlist for each design element Synthesis process will check code syntax and analyze the hierarchy of the design which ensures that the design is optimized for the design architecture, the designer has selected. The resulting netlist(s) is saved to an NGC( Native Generic Circuit) file (for Xilinx Synthesis Technology (XST)).

FPGA Synthesis Implementation This process consists a sequence of three steps 1. Translate 2. Map 3. Place and Route Translate: Process combines all the input netlists and constraints to a logic design file. This information is saved as a NGD (Native Generic Database) file. This can be done using

NGD Build program. Here, defining constraints is nothing but, assigning the ports in the design to the physical elements (ex. pins, switches, buttons etc) of the targeted device and specifying time requirements of the design. This information is stored in a file named UCF (User Constraints File). Tools used to create or modify the UCF are PACE, Constraint Editor etc.

FPGA Translate Map Process divides the whole circuit with logical elements into sub blocks such that they can be fit into the FPGA logic blocks. That means map process fits the logic defined by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB), Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file which physically represents the design mapped to the components of FPGA. MAP program is used for this purpose.

FPGA map Place and Route : PAR program is used for this process. The place and route process places the sub blocks from the map process into logic blocks according to the constraints and

connects the logic blocks. Ex. if a sub block is placed in a logic block which is very near to IO pin, then it may save the time but it may effect some other constraint. So trade off between all the constraints is taken account by the place and route process The PAR tool takes the mapped NCD file as input and produces a completely routed NCD file as output. Output NCD file consists the routing information.

FPGA Place and route Device Programming: Now the design must be loaded on the FPGA. But the design must be converted to a format so that the FPGA can accept it. BITGEN program deals with the conversion. The routed NCD file is then given to the BITGEN program to generate a bit stream (a .BIT file) which can be used to configure the target FPGA device. This can be done using a cable. Selection of cable depends on the design. Design Verification Verification can be done at different stages of the process steps. Behavioral Simulation (RTL Simulation): This is first of all simulation steps; those are encountered throughout the hierarchy of the design flow. This simulation is performed before synthesis process to verify RTL (behavioral) code and to confirm that the design is functioning as intended. Behavioral simulation can be performed on either VHDL or Verilog designs. In this process, signals and variables are observed, procedures and functions are traced and breakpoints are set. This is a very fast simulation and so allows the designer to change the HDL code if the required functionality is not met with in a short time period. Since the design is not yet synthesized to gate level, timing and resource usage properties are still unknown.

Functional simulation (Post Translate Simulation): Functional simulation gives information about the logic operation of the circuit. Designer can verify the functionality of the design using this process after the Translate process. If the functionality is not as expected, then the designer has to made changes in the code and again follow the design flow steps. Static Timing Analysis : This can be done after MAP or PAR processes Post MAP timing report lists signal path delays of the design derived from the design logic. Post Place and Route timing report incorporates timing delay information to provide a comprehensive timing summary of the design.

3.4 Synthesis Result


The developed MAC design is simulated and verified their functionality. Once the functional verification is done, the RTL model is taken to the synthesis process using the Xilinx ISE tool. In synthesis process, the RTL model will be converted to the gate level netlist mapped to a specific technology library. This MAC design can be synthesized on the family of Spartan 3E. Here in this Spartan 3E family, many different devices were available in the Xilinx ISE tool. In order to synthesis this design the device named as XC3S500E has been chosen and the package as FG320 with the device speed such as -4. The design of MAC is synthesized and its results were analyzed as follows.

Device utilization summary:

This device utilization includes the following. Logic Utilization Logic Distribution Total Gate count for the Design

The device utilization summery is shown above in which its gives the details of number of devices used from the available devices and also represented in %. Hence as the result of the synthesis process, the device utilization in the used device and package is shown above. Timing Summary: Speed Grade: -4 Minimum period: 11.496ns (Maximum Frequency: 86.987MHz) Minimum input arrival time before clock: 6.892ns Maximum output required time after clock: 4.394ns Maximum combinational path delay: No path found In timing summery, details regarding time period and frequency is shown are approximate while synthesize. After place and routing is over, we get the exact timing summery. Hence the maximum operating frequency of this synthesized design is given as 86.987 MHz and the minimum period as 11.496 ns. Here, OFFSET IN is the minimum input arrival time before clock and OFFSET OUT is maximum output required time after clock.

RTL Schematic The RTL (Register Transfer Logic) can be viewed as black box after synthesize of design is made. It shows the inputs and outputs of the system. By double-clicking on the diagram we can see gates, flip-flops and MUX.

IN P U T

O U TP U T S
Figure 3.6 Schematic with Basic Inputs and Output

3.4 Summary
The developed MAC design is modelled and is simulated using the Modelsim tool. The simulation results are discussed by considering different cases. The RTL model is synthesized using the Xilinx tool in Spartan 3E and their synthesis results were discussed with the help of generated reports.

CHAPTER 4 CONCLUSION
A 16x16 multiplier-accumulator (MAC) is presented in this work. A RADIX 4 Modified Booth multiplier circuit is used for MAC architecture. Compared to other circuits, the Booth multiplier has the highest operational speed and less hardware count. The basic building blocks for the MAC unit are identified and each of the blocks is analyzed for its performance. Power and delay is calculated for the blocks. 1-bit MAC unit is designed with enable to reduce the total power consumption based on block enable technique. Using this block, the N-bit MAC unit is constructed and the total power consumption is calculated for the MAC unit. The power reduction techniques adopted in this work. The MAC unit designed in this work can be used in filter realizations for High speed DSP applications. Table 12 summarizes the results obtained.