Вы находитесь на странице: 1из 8

HIGH SPEED LOW POWER COMPLEX MULTIPLIER DESIGN USING PARALLEL ADDERS AND SUBTRACTORS

PRABIR KUMAR SAHA


Dept. of ECE, JIS College of Engineering. Kalyani-741235, INDIA sahaprabir1@gmail.com

ARINDAM BANERJEE
Dept. of ECE, JIS College of Engineering. Kalyani-741235, INDIA

banerjee.arindam1@gmail.com ANUP DANDAPAT


Dept. of ETCE, Jadavpur University. Kolkata-700032, INDIA anup.dandapat@gmail.com

ABSTRACT
In this paper we have proposed a combinational scheme for the implementation of a complex multiplier. The efficient implementation of complex multiplier has been designed by using parallel adders, sub-tractors and compressors that have been lucidly discussed in this paper. Energy delay product of the multiplier using standard CMOS (90nm) model measures to 1.86x10-21 J-s. To enhance the speed and performance of multipliers, compressors and sklansky adders have been used.

Keywords: Parallel Adders, Subtractors, Compressors, Multiplier, Complex Multiplier. I. INTRODUCTION


DFT (Discrete Fourier Transformation) and FFT (Fast Fourier Transformation) are the fundamental backbone of a Digital Signal Processor. Complex multiplier is the basic building block of DFTs and FFTs. The mathematical computation criteria of complex multiplier have got the prime attention to the modern development of ASIC design. In particular, high-speed multiplication is becoming increasingly important in DSP's, graphics accelerators and scientific computing. The scientific computations also include the convolutions, correlations, orthogonal transformations, complex filtering and many other operations. For effective manipulation of complex number several authors have proposed the special number representation techniques [1]-[5]. Conventionally, the implementation of Complex Multiplier requires four real-number multipliers and two adders which lead to highly complex topology [3] because of the interconnection of four multipliers and adders. In this brief, an efficient realimaginary complex multiplier with alternate scheme is proposed and is used to implement complex-number multipliers with less interconnection complexity. This paper presents a complex multiplier architecture based on compressors, parallel adders, sub-tractors and Wallace tree multipliers. The multiplier is fully parameterized, so any configuration of input and output word lengths could be elaborated. The design preliminaries of complex multiplier are introduced in section II, Multiplier algorithms are described in section III, Complex Multipliers are described in section IV and simulation results are described in section V.

II. DESIGN PRELIMINARIES 2.1 PARALLEL ADDER

We have chosen adders with a wide spectrum of timing and complexity which makes it interesting to compare their performance in terms of power, delay and energy-delay product. The adders range from the

simple but slow (linear time) ripple carry adder to the fairly complex but extremely fast (constant time) signed-digit adders. The third type of adder is the parallel prefix and complexity O(log N) time blocked carry look-ahead adder.

Substituting Ci = gi-1+Ci-1pi-1 expression yields

(2.3)

in the above

Ci+1=gi+gi-1p1+ci-1pi-1p1

2.1.1 RIPPLE CARRY ADDER

The fundamental unit of a ripple carry adder (RCA) given in Fig. 1 is a full adder which computes a sum bit and a carry bit. Sum bit Si= and Carry bit Ci+1=ai.bi+bi.ci+ai.ci (2.1) (2.2)

However, for large columns of bits, extremely large numbers of gates are required. Thus we may divide CLA into groups and have a separate adder in each group of 4 and each group can propagate as a Ripple Carry Adder (Section I). Fig. 2 shows a four bit CLA structure. The delay and power calculation of various length carry look ahead adders (using standard 90nm CMOS technology) are shown in Table 2.

(2.4)

Fig. 1 Ripple Carry Adder Architecture

The architecture signifies that it adds two operand bits to produce a sum bit and a carry bit, since in the worst case the carry can propagate from the least significant bit position to most significant bit position. In this case one single bit full adder is adding three bits (viz. a0, b0, c0), where as and bs are the input of these sections and cs are representing the carry input values. The delay and power calculation of various length ripple carry adder (using 90nm standard CMOS technology) are shown in Table-1.
Table-1 Delay and power comparison of Ripple Carry adder

Length 4 bit 8 bit 16 bit 32 bit

Delay (ps) 141.5 170 190 323.5

Power (uw) 0.567 1.33 1.91 4.53

EDP (10-27J-s) 11.4 38.43 69 474

Fig. 2 Carry Look Ahead adder 4 bit Table-2 Delay and power comparison of Carry Look Ahead adder

2.1.2 CARRY LOOK AHEAD ADDER


A CLA (carry look-ahead adder) reduces the time necessary to compute the carry out after the arrival of a valid carry in [6].The CLA does not require the additional information concerning the previous input digits. Let gi=ai.bi denotes the generated carry and let pi= aibi denote the propagation carry.

Length 4 bit 8 bit 16 bit 32 bit

Delay (ps) 138 149.5 198.5 305.5

Power (uw) 0.856 1.33 2.75 6.35

EDP (10-27 J-s) 16.3 29.7 108.4 592.63

2.1.3 BRENT KUNG ADDER

Ci+1 = ai.bi+bi.ci+ai.ci = aibi+ci(aibi) = gi + c i p i

(2.3)

The Brent Kungs adder works on the principle of O operator [7]. The O operator is defined as (gx, px) O (gy, py) = (gx + px . gy , px. py) (2.5) = (Gxy , Pxy)

And if y = 0, then (Gx0, Px0) = (Gx , Px) Lemma1. Let (Gi , Pi ) = (g0 , p0 ) if i=0 (2.6) = (gi , pi) O ( Gi-1 , Pi-1) if 1 i n-1 Then ci = Gi for i=0,1, 2,3..n-1 . Let m = (gm , pm ) and ji = (Gji , Pji )= (gj , pj ) O (gj-1 , pj-1 ) O..O(gi , pi )(2.7)

Lemma2. The operator O is associative i.e.. 31= (3 ) O (21 ) = ( 32 ) O (1) (2.8) The resultant 16 bit carry generation topologies are given below [Fig. 3]. In Brent-Kung adders the significance of black and white processors are given in Fig. 4 and delay and power calculation (using standard 90nm CMOS technology) of various length are shown in Table 3.

The Kogge-Stone tree shown in Fig. 6 achieves both (log2N) stage and fan-out of 2 at each stage [8]. This comes at the cost of many long wires that must be routed between stages. The tree also contains more PG (propagate and generate) cells. While this may not impact the area if the adder layout is on a regular grid however it will increase the power consumption. The cell diagram notations are given in Fig. 5. The resultant 16 bit carry generation topologies are shown in Fig.6 The delay and power calculation (using standard 90nm CMOS technology) of various length Kogge- Stone Adders are shown in Table 4.
Table-4 Delay and power comparison of Kogge-Stone adder

Length 4 bit 8 bit 16 bit 32 bit

Delay (ps)

170.5 227 254.5 315.5

Power (uw) 1.10 2.03 4.48 8.98

EDP (10-27 J-s) 32 104 290 894

Fig. 3 16 bit carry generation of Brent Kung Adder Fig. 5 The cell diagram notation of Kogge- Stone Adder

Fig. 4 (a) White Processor

(b) Black Process

Table-3 Delay and power comparison of Brent-Kung adder

Length 4 bit 8 bit 16 bit 32 bit

Delay(ps) 181 239 271.5 340.5

Power (uw) 1.13 2.42 2.43 11

EDP (10-27 J-s) 37 138.2 179.12 1275

Fig. 6 16 bit carry generation of Kogge-Stone adder

2.1.5 SKLANSKY ADDER

2.1.4 KOGGE STONE ADDER

The sklansky adder is not as heavily impacted by wire delay in comparison to similar prefix adder such as Kogge-Stone[8].We assume the critical path in an adder is determined by the time required to pass the

carry-bit from the least significant bit to most significant bit. Fig. 7 illustrates the critical path of Sklansky adder.

2003 [12]. This implementation is better and the delay is that for three XOR gates only. The problems with compressors are [13]: this kind of conventional

(i)

The uneven delay profile of the outputs arriving from different input paths tends to generate glitches. Compressors do the simple operation of addition that adds more number of bits at a time. But the conventional 4-2 compressors require one more half adder of which two inputs are C OUT and C [11], to produce the final addition results.

(ii)

Fig. 7 16 bit Sklansky adder

The delay and power calculation (using standard 90nm CMOS technology) of various length Sklansky Adders are shown in table 5.

Example: if X1=X2=X3=X4=1 and CIN =0 then the addition result should be four i.e 100 but the conventional architecture produces COUT=1, C=1 and S=0. Now if COUT and C are fed to a half adder then it produces the final result as shown in Fig. 9.

So the conventional compressors require one more half adder to get the final result and this adds to more delay and power consumption. The modified design of compressors is given in Fig. 9.

Fig. 8 Pre and Post processing unit Table-5 Delay and power comparison of Sklansky adder

Length 4 bit 8 bit 16 bit 32 bit

Delay (ps) 83.75 135.7 181 301.5

Power (uw) 0.32 1.04 2.36 3.50

EDP (10-27 J-s) 2.24 19.15 77.32 318.16

Fig. 9

Modified 4-2 compressor

MODIFIED 4:3 COMPRESSOR


The above mentioned problems have been considered to design a new 4-3 compressor to improve the performance of multiplier circuits [14]. We like to define the 4-3 compressor as a counter of 1s at the input bits. As there are four input bits, the maximum count is 4 and depending upon this concept we have designed the 4-3 compressor. The block diagram is shown in Fig. 10.

2.2. COMPRESSORS 2.2.1 4:3 COMPRESSORS


The conventional 4-2 compressor circuit actually compresses five partial product bits into three [9]. The architecture can be implemented with two stages of full adder (FA) connected in series. The outputs of 4-2 compressor consist of one bit in position j and two bits in position (j + 1). This straight forward approach has four XOR gate delays [10]. An attractive implementation using XOR gates and MUX has been described by Jiangmin and Hong in

Fig. 10

Block diagram of proposed 4-3 compressor

The transistor level implementation is shown in Fig. 11. We have used transmission gates (TG) as circuit elements to make it faster. In this design all the outputs have the same three stage delay and the delays are nearly equal. In this architecture the outputs have the three consecutive bit positions (j, j+1, j+2). So on the partial product addition stage of the multiplication process, Cj+1 should be fed to the (j +1)th adder of the next stage and Cj+2 to the (j +2) th adder of that same stage.

Fig. 12 Block diagram of 6-3 compressor

2.2.2 6:3 COMPRESSOR

The 6-3 compressor logic described here is based upon the concept of the counter properties of full adder. It can be defined as single bit adder circuit which has six inputs and three outputs. The block diagram of 6-3 compressor is shown in Fig. 12.

2.2.2.1 6-3 COMPRESSOR ARCHITECTURE


The block diagram with the sub units is shown in Fig. 13. It contains three circuit blocks: two adder blocks and another block that performs the parallel addition operation. Here six bits are processed through full adder circuits. The outputs are then added by using a smart architecture that does the parallel addition. The performance comparison result of 6-3 compressor are shown in table 6.

Fig. 13

Block diagram of 6-3 compressor

2.3 ADDER/SUBTRACTOR

The conventional adder /sub-tractor block are shown in Fig. 14, to perform addition as well as subtraction in a single block. The performance of different adders is compared in Fig. 15. Here the control signal is used for the operation of addition or subtraction.
Table -6 Performance comparisons for six bit addition

Del Delay Delay Used ay (x2)(ps) (x1) Module (x3) (ps) (ps) HA& FA 192 183 172 6-3 comp 119 85 91

Avg Avg Delay Power (ps) (nw) 182 98 517 134

Fig. 11 Modified 4-3 compressor designed with TG

Fig. 14 Conventional Adder Sub-tractor

The major disadvantage of this circuit (shown in Fig.15) is that subtraction of larger number from smaller number gives the result in 2's complement form. In modified architecture, the problem can be solved by incorporating a second parallel adder stage. Fig. 16 is the modified architecture of adder/subtractor. The signal named control is used to select the type of operation (add/sub). When we need to add, the control signal is active low & to subtract it is active high. For addition, second stage will pass the output of the first stage & for subtraction it will act according to the carry of the first stage. If the first stage carry is active low that means the result of the first stage is in the 2's complement form (negative), its 2's complement from the second stage will give the exact result. If the first stage carry is active high that means the result of the first stage is in the normal form (positive), the second stage will pass the output of the first stage.

Fig.16 Modified architecture of adder sub-tractor

III. MULTIPLIER

A 16-bit multiplier is constructed by using Wallace tree architecture [14]. The architecture has been shown in Fig. 17. Partial products are added in five stages. Adders and different compressors are used to minimize the stage operations. Compressors and adders are used carefully so that a minimum number of outputs would be generated. As an example, let us consider the column number ten where ten bits are added at the first stage. These ten bits could be added by using one 6-3 compressor and one 4-3 compressors, but that will generate

six (three of each compressor) outputs, instead of this we have used one 7-3 compressor and one full adder that generate five outputs only (three of compressor and two of full adder) that eventually decrease the number of bits for the next stage. Now let us consider the column number sixteen where sixteen partial products are added at the first stage. Sixteen partial products could be added by using one 7-3 compressor, one 6-3 compressor and one full adder circuits, but these three architectures will generate eight outputs. Instead we have used two 7-3 compressors that generate six outputs and the other two bits are promoted to the second stage directly, so ultimately eight bits are left for the second stage addition. Thus by using minimum number of adders/compressors partial products are added without compromising the number of bits generation for the next stage operation. Performances of different multipliers are discussed in the Table 7.
Table 7 Delay and power calculation of multipliers of a multiplier block.

Multiplier Type 16X16 32X32

Delay(ps) 221 1640

Power(uw) 64 148

EDP (10-24 J-s) 3.12 398

Fig. 15 Delay Comparison results of Adder Sub-tractor

Fig. 17 16-bit multiplier architecture

IV. COMPLEX MULTIPLIER

A complex number consists of two parts, i.e real part and an imaginary part distinguished by a vector j= . To compute the product of complex number multipliers requires four real multiplication, one addition unit and one subtraction unit. The complex number can be defined as follows A = Ar + j Ai and B = B r + j B i. (3.1) Where Ar , Br are the real parts and Ai ,Bi are the imaginary parts of the complex number. Multiplication of A and B is given by A B = (Ar + j Ai ) (Br + j Bi ) = ( Ar Br Ai Bi) + j ( Ai Br + Ar Bi ). (3.2) To reduce the arithmetic complexity of complex multiplier the algebraic transformation is given in equation (3.3) as proposed by Blahut[15]. A B = ( Ar Aj ) Bi + Ar (Br Bi ) +j [(Ar Aj) Bi + Ai (Br + Bi)] (3.3) This method saves one multiplication, at the expense of two more addition and one more subtraction. Fig. 18 gives the graphical view of equation (3.3) From the graphical representation (shown in Fig. 18), requires pre-addition of Br + Bi, and pre-subtractions, Ar - Ai, and Br - Bi , prior to the binary multiplications, which results in an increase of critical path delay. Fig. 19 shows the direct method for implementation of complex multiplier. It contain only two types of processors, i.e (1) Multiplier (ii) Adders or Subtractors. Here the critical path delay is reduced, because of one adder block. It implies that four multiplication blocks are required but all the blocks are working in parallel. Table 8 describes the performance comparison result of complex multiplier by using different architecture.

Fig. 18 Graphical representation of equation 3.3

Fig. 19 Direct method implementation of Complex Multiplier. Table 8 Performance Comparison of Complex Multiplier

Multiplie r Type 16X16 16X16 16X16 32X32 32X32 32X32

Architecture Used Blahut Distributed Algorithm Proposed Blahut Distributed Algorithm Proposed

Delay (ns) 6 25 5 11 48 9

Power (mw) 12 15 8 29 28 23

EDP (10-21 J-s) 432 9375 200 3509 64512 1863

V. CONCLUSIONS

A Wallace-tree-based complex multiplier has been designed, and varified for functionality, speed and power consumption. The Wallace tree multiplier has been designed by the compressors and the parallel adders. All the designs are done to achieve high speed and low power. The spice simulation results are given graphically in Fig.20. Our proposed architecture

shows that energy delay products are reduced by 53.7% for 16 bit complex multiplier, 46.9% reduced for 32 bit complex multiplier.

Fig. 20 (a) Delay Comparison of Parallel Adder

Fig. 20 (b) Power Comparison of Parallel Adder References: [1] Y. Ohi, T. Aoki, and T. Higuchi, Redundant complex number systems," Proc. 25th IEEE Int'l Sym-p. MultipleValued Logic, pp. 14-19, May 1995. [2] N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. Nakagome, A 4.4ns CMOS 5454-b multiplier using pass-transistor multiplexer," IEEE J. Solid-State Circuits, Vol. 30, No. 3, pp. 251-257, March 1995. [3] T. Aoki, Y. Ohi, and T. Higuchi, Redundant complex number arithmetic for high-speed signal processing, in Proc. 1995 IEEE Workshop VLSI Signal Processing, Sakai, Japan, Oct. 1995, pp. 523531. [4] Anders Berkeman, Viktor wall, and Mats Torkelson, A Low Logic Depth Complex Multiplier Using Distributed Arithmetic, IEEE Journal of Solid State Circuits, Vol. 35, No. 4, April-2000. [5] S. G. Smith and P. B. Denyer, Efficient bit-serial complex multiplication and sum-of products computation using distributed arithmetic, in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, 1986, pp. 271276. [6] B. D. Lee and V. G. Oklobdzija Improved CLA Scheme with Optimized Delay, Journal of VLSI Signal Processing, vol. 3, pp. 265-274, 1991.

[7] R.P. Brent and H.T. Kung, A regular layout for parallel adders, IEEE Transactions on Computers, Vol. C-31. No.3, March 1982. pp260-264. [8] Z. Huang and M. D. Ercegovac, Effect of wire delay on the design of prefix adders in deep-submicron technology, In Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers, Oct. 2000. [9] Oklobdzija G V, Villeger D, Liu S S, A Method For Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using An Algorithmic Approach, IEEE Transactions on Computers, Vol. 45, No. 3, March 1996. [10] Hsiao F S, Jiang R M, Yeh S J, Design of High Speed Low Power 3-2 Counter and 4-2 Compressor for Fast Multipliers, Electronic Letters, Vol. 34, No. 4, pp 341-343, 1998. [11] Z. Huang and M. D. Ercegovac. Effect of wire delay on the design of prefix adders in deep-submicron technology. In Proceedings of the 34th Asilomar Conference on Signals,Systems, and Computers, Oct. 2000. [12] Jiangmin G, Chip-Hong C, Ultra Low Voltage Low Power 4-2 Compressor for High Speed Multiplications,in Proceedings of the International Symposium on Circuits and Systems, ISCAS 03, May 2003, Bangkok, Thailand, Vol. 5, pp. v321-v324, 2003. [13] A. Dandapat, P. Bose and D. Mukhopadhyay, LowVoltage Low-Power 4-2 Compressor for High Speed Multiplication, in 2nd National Conference on Trends and Developments in VLSI and Embedded Systems, 5th 6th Mar, 2007, Hosur. [14] A. Dandapat, P. Bose, Sayan Ghosh, Pikul Sarkar, and D. Mukhopadhyay, Design of an Application Specific LowPower High Performance Carry Save 4-2 Compressor in IEEE VLSI Design and Test Symposium 2007, VDAT-07, Kolkata. [15] Prasad K, Parthi K K, Low Power 4-2 and 5-2 Compressors, in Proceedings. of the 35th Asilomar Conference on Signals, Systems and Computers, CA, USA, Vol. 1, pp. 129-133, 2001 [16] Osman Hasan and Skander Kort, Automated formal synthesis of Wallace Tree multipliers MSWCAS,2007,pages 293-296. [17] R. E. Blahut, Fast Algorithms For Digital Signal Processing: Addison- Wesley, 1987. [18] Weidong Li and Lars Wanhamrnar, A complex multiplier using overturned-stairs adder tree Electronics, Circuits and Systems, 1999.

Вам также может понравиться