Академический Документы
Профессиональный Документы
Культура Документы
ARINDAM BANERJEE
Dept. of ECE, JIS College of Engineering. Kalyani-741235, INDIA
ABSTRACT
In this paper we have proposed a combinational scheme for the implementation of a complex multiplier. The efficient implementation of complex multiplier has been designed by using parallel adders, sub-tractors and compressors that have been lucidly discussed in this paper. Energy delay product of the multiplier using standard CMOS (90nm) model measures to 1.86x10-21 J-s. To enhance the speed and performance of multipliers, compressors and sklansky adders have been used.
We have chosen adders with a wide spectrum of timing and complexity which makes it interesting to compare their performance in terms of power, delay and energy-delay product. The adders range from the
simple but slow (linear time) ripple carry adder to the fairly complex but extremely fast (constant time) signed-digit adders. The third type of adder is the parallel prefix and complexity O(log N) time blocked carry look-ahead adder.
(2.3)
in the above
Ci+1=gi+gi-1p1+ci-1pi-1p1
The fundamental unit of a ripple carry adder (RCA) given in Fig. 1 is a full adder which computes a sum bit and a carry bit. Sum bit Si= and Carry bit Ci+1=ai.bi+bi.ci+ai.ci (2.1) (2.2)
However, for large columns of bits, extremely large numbers of gates are required. Thus we may divide CLA into groups and have a separate adder in each group of 4 and each group can propagate as a Ripple Carry Adder (Section I). Fig. 2 shows a four bit CLA structure. The delay and power calculation of various length carry look ahead adders (using standard 90nm CMOS technology) are shown in Table 2.
(2.4)
The architecture signifies that it adds two operand bits to produce a sum bit and a carry bit, since in the worst case the carry can propagate from the least significant bit position to most significant bit position. In this case one single bit full adder is adding three bits (viz. a0, b0, c0), where as and bs are the input of these sections and cs are representing the carry input values. The delay and power calculation of various length ripple carry adder (using 90nm standard CMOS technology) are shown in Table-1.
Table-1 Delay and power comparison of Ripple Carry adder
Fig. 2 Carry Look Ahead adder 4 bit Table-2 Delay and power comparison of Carry Look Ahead adder
(2.3)
The Brent Kungs adder works on the principle of O operator [7]. The O operator is defined as (gx, px) O (gy, py) = (gx + px . gy , px. py) (2.5) = (Gxy , Pxy)
And if y = 0, then (Gx0, Px0) = (Gx , Px) Lemma1. Let (Gi , Pi ) = (g0 , p0 ) if i=0 (2.6) = (gi , pi) O ( Gi-1 , Pi-1) if 1 i n-1 Then ci = Gi for i=0,1, 2,3..n-1 . Let m = (gm , pm ) and ji = (Gji , Pji )= (gj , pj ) O (gj-1 , pj-1 ) O..O(gi , pi )(2.7)
Lemma2. The operator O is associative i.e.. 31= (3 ) O (21 ) = ( 32 ) O (1) (2.8) The resultant 16 bit carry generation topologies are given below [Fig. 3]. In Brent-Kung adders the significance of black and white processors are given in Fig. 4 and delay and power calculation (using standard 90nm CMOS technology) of various length are shown in Table 3.
The Kogge-Stone tree shown in Fig. 6 achieves both (log2N) stage and fan-out of 2 at each stage [8]. This comes at the cost of many long wires that must be routed between stages. The tree also contains more PG (propagate and generate) cells. While this may not impact the area if the adder layout is on a regular grid however it will increase the power consumption. The cell diagram notations are given in Fig. 5. The resultant 16 bit carry generation topologies are shown in Fig.6 The delay and power calculation (using standard 90nm CMOS technology) of various length Kogge- Stone Adders are shown in Table 4.
Table-4 Delay and power comparison of Kogge-Stone adder
Delay (ps)
Fig. 3 16 bit carry generation of Brent Kung Adder Fig. 5 The cell diagram notation of Kogge- Stone Adder
The sklansky adder is not as heavily impacted by wire delay in comparison to similar prefix adder such as Kogge-Stone[8].We assume the critical path in an adder is determined by the time required to pass the
carry-bit from the least significant bit to most significant bit. Fig. 7 illustrates the critical path of Sklansky adder.
2003 [12]. This implementation is better and the delay is that for three XOR gates only. The problems with compressors are [13]: this kind of conventional
(i)
The uneven delay profile of the outputs arriving from different input paths tends to generate glitches. Compressors do the simple operation of addition that adds more number of bits at a time. But the conventional 4-2 compressors require one more half adder of which two inputs are C OUT and C [11], to produce the final addition results.
(ii)
The delay and power calculation (using standard 90nm CMOS technology) of various length Sklansky Adders are shown in table 5.
Example: if X1=X2=X3=X4=1 and CIN =0 then the addition result should be four i.e 100 but the conventional architecture produces COUT=1, C=1 and S=0. Now if COUT and C are fed to a half adder then it produces the final result as shown in Fig. 9.
So the conventional compressors require one more half adder to get the final result and this adds to more delay and power consumption. The modified design of compressors is given in Fig. 9.
Fig. 8 Pre and Post processing unit Table-5 Delay and power comparison of Sklansky adder
Fig. 9
Fig. 10
The transistor level implementation is shown in Fig. 11. We have used transmission gates (TG) as circuit elements to make it faster. In this design all the outputs have the same three stage delay and the delays are nearly equal. In this architecture the outputs have the three consecutive bit positions (j, j+1, j+2). So on the partial product addition stage of the multiplication process, Cj+1 should be fed to the (j +1)th adder of the next stage and Cj+2 to the (j +2) th adder of that same stage.
The 6-3 compressor logic described here is based upon the concept of the counter properties of full adder. It can be defined as single bit adder circuit which has six inputs and three outputs. The block diagram of 6-3 compressor is shown in Fig. 12.
Fig. 13
2.3 ADDER/SUBTRACTOR
The conventional adder /sub-tractor block are shown in Fig. 14, to perform addition as well as subtraction in a single block. The performance of different adders is compared in Fig. 15. Here the control signal is used for the operation of addition or subtraction.
Table -6 Performance comparisons for six bit addition
Del Delay Delay Used ay (x2)(ps) (x1) Module (x3) (ps) (ps) HA& FA 192 183 172 6-3 comp 119 85 91
The major disadvantage of this circuit (shown in Fig.15) is that subtraction of larger number from smaller number gives the result in 2's complement form. In modified architecture, the problem can be solved by incorporating a second parallel adder stage. Fig. 16 is the modified architecture of adder/subtractor. The signal named control is used to select the type of operation (add/sub). When we need to add, the control signal is active low & to subtract it is active high. For addition, second stage will pass the output of the first stage & for subtraction it will act according to the carry of the first stage. If the first stage carry is active low that means the result of the first stage is in the 2's complement form (negative), its 2's complement from the second stage will give the exact result. If the first stage carry is active high that means the result of the first stage is in the normal form (positive), the second stage will pass the output of the first stage.
III. MULTIPLIER
A 16-bit multiplier is constructed by using Wallace tree architecture [14]. The architecture has been shown in Fig. 17. Partial products are added in five stages. Adders and different compressors are used to minimize the stage operations. Compressors and adders are used carefully so that a minimum number of outputs would be generated. As an example, let us consider the column number ten where ten bits are added at the first stage. These ten bits could be added by using one 6-3 compressor and one 4-3 compressors, but that will generate
six (three of each compressor) outputs, instead of this we have used one 7-3 compressor and one full adder that generate five outputs only (three of compressor and two of full adder) that eventually decrease the number of bits for the next stage. Now let us consider the column number sixteen where sixteen partial products are added at the first stage. Sixteen partial products could be added by using one 7-3 compressor, one 6-3 compressor and one full adder circuits, but these three architectures will generate eight outputs. Instead we have used two 7-3 compressors that generate six outputs and the other two bits are promoted to the second stage directly, so ultimately eight bits are left for the second stage addition. Thus by using minimum number of adders/compressors partial products are added without compromising the number of bits generation for the next stage operation. Performances of different multipliers are discussed in the Table 7.
Table 7 Delay and power calculation of multipliers of a multiplier block.
Power(uw) 64 148
A complex number consists of two parts, i.e real part and an imaginary part distinguished by a vector j= . To compute the product of complex number multipliers requires four real multiplication, one addition unit and one subtraction unit. The complex number can be defined as follows A = Ar + j Ai and B = B r + j B i. (3.1) Where Ar , Br are the real parts and Ai ,Bi are the imaginary parts of the complex number. Multiplication of A and B is given by A B = (Ar + j Ai ) (Br + j Bi ) = ( Ar Br Ai Bi) + j ( Ai Br + Ar Bi ). (3.2) To reduce the arithmetic complexity of complex multiplier the algebraic transformation is given in equation (3.3) as proposed by Blahut[15]. A B = ( Ar Aj ) Bi + Ar (Br Bi ) +j [(Ar Aj) Bi + Ai (Br + Bi)] (3.3) This method saves one multiplication, at the expense of two more addition and one more subtraction. Fig. 18 gives the graphical view of equation (3.3) From the graphical representation (shown in Fig. 18), requires pre-addition of Br + Bi, and pre-subtractions, Ar - Ai, and Br - Bi , prior to the binary multiplications, which results in an increase of critical path delay. Fig. 19 shows the direct method for implementation of complex multiplier. It contain only two types of processors, i.e (1) Multiplier (ii) Adders or Subtractors. Here the critical path delay is reduced, because of one adder block. It implies that four multiplication blocks are required but all the blocks are working in parallel. Table 8 describes the performance comparison result of complex multiplier by using different architecture.
Fig. 19 Direct method implementation of Complex Multiplier. Table 8 Performance Comparison of Complex Multiplier
Architecture Used Blahut Distributed Algorithm Proposed Blahut Distributed Algorithm Proposed
Delay (ns) 6 25 5 11 48 9
Power (mw) 12 15 8 29 28 23
V. CONCLUSIONS
A Wallace-tree-based complex multiplier has been designed, and varified for functionality, speed and power consumption. The Wallace tree multiplier has been designed by the compressors and the parallel adders. All the designs are done to achieve high speed and low power. The spice simulation results are given graphically in Fig.20. Our proposed architecture
shows that energy delay products are reduced by 53.7% for 16 bit complex multiplier, 46.9% reduced for 32 bit complex multiplier.
Fig. 20 (b) Power Comparison of Parallel Adder References: [1] Y. Ohi, T. Aoki, and T. Higuchi, Redundant complex number systems," Proc. 25th IEEE Int'l Sym-p. MultipleValued Logic, pp. 14-19, May 1995. [2] N. Ohkubo, M. Suzuki, T. Shinbo, T. Yamanaka, A. Shimizu, K. Sasaki, and Y. Nakagome, A 4.4ns CMOS 5454-b multiplier using pass-transistor multiplexer," IEEE J. Solid-State Circuits, Vol. 30, No. 3, pp. 251-257, March 1995. [3] T. Aoki, Y. Ohi, and T. Higuchi, Redundant complex number arithmetic for high-speed signal processing, in Proc. 1995 IEEE Workshop VLSI Signal Processing, Sakai, Japan, Oct. 1995, pp. 523531. [4] Anders Berkeman, Viktor wall, and Mats Torkelson, A Low Logic Depth Complex Multiplier Using Distributed Arithmetic, IEEE Journal of Solid State Circuits, Vol. 35, No. 4, April-2000. [5] S. G. Smith and P. B. Denyer, Efficient bit-serial complex multiplication and sum-of products computation using distributed arithmetic, in Proc. IEEE Int. Conf. Acoustics Speech and Signal Processing, 1986, pp. 271276. [6] B. D. Lee and V. G. Oklobdzija Improved CLA Scheme with Optimized Delay, Journal of VLSI Signal Processing, vol. 3, pp. 265-274, 1991.
[7] R.P. Brent and H.T. Kung, A regular layout for parallel adders, IEEE Transactions on Computers, Vol. C-31. No.3, March 1982. pp260-264. [8] Z. Huang and M. D. Ercegovac, Effect of wire delay on the design of prefix adders in deep-submicron technology, In Proceedings of the 34th Asilomar Conference on Signals, Systems, and Computers, Oct. 2000. [9] Oklobdzija G V, Villeger D, Liu S S, A Method For Speed Optimized Partial Product Reduction and Generation of Fast Parallel Multipliers Using An Algorithmic Approach, IEEE Transactions on Computers, Vol. 45, No. 3, March 1996. [10] Hsiao F S, Jiang R M, Yeh S J, Design of High Speed Low Power 3-2 Counter and 4-2 Compressor for Fast Multipliers, Electronic Letters, Vol. 34, No. 4, pp 341-343, 1998. [11] Z. Huang and M. D. Ercegovac. Effect of wire delay on the design of prefix adders in deep-submicron technology. In Proceedings of the 34th Asilomar Conference on Signals,Systems, and Computers, Oct. 2000. [12] Jiangmin G, Chip-Hong C, Ultra Low Voltage Low Power 4-2 Compressor for High Speed Multiplications,in Proceedings of the International Symposium on Circuits and Systems, ISCAS 03, May 2003, Bangkok, Thailand, Vol. 5, pp. v321-v324, 2003. [13] A. Dandapat, P. Bose and D. Mukhopadhyay, LowVoltage Low-Power 4-2 Compressor for High Speed Multiplication, in 2nd National Conference on Trends and Developments in VLSI and Embedded Systems, 5th 6th Mar, 2007, Hosur. [14] A. Dandapat, P. Bose, Sayan Ghosh, Pikul Sarkar, and D. Mukhopadhyay, Design of an Application Specific LowPower High Performance Carry Save 4-2 Compressor in IEEE VLSI Design and Test Symposium 2007, VDAT-07, Kolkata. [15] Prasad K, Parthi K K, Low Power 4-2 and 5-2 Compressors, in Proceedings. of the 35th Asilomar Conference on Signals, Systems and Computers, CA, USA, Vol. 1, pp. 129-133, 2001 [16] Osman Hasan and Skander Kort, Automated formal synthesis of Wallace Tree multipliers MSWCAS,2007,pages 293-296. [17] R. E. Blahut, Fast Algorithms For Digital Signal Processing: Addison- Wesley, 1987. [18] Weidong Li and Lars Wanhamrnar, A complex multiplier using overturned-stairs adder tree Electronics, Circuits and Systems, 1999.