Вы находитесь на странице: 1из 6

Novel Architectures for High-Speed and Low-Power 3-2, 4-2 and 5-2 Compressors

Sreehari Veeramachaneni, Kirthi Krishna M, Lingamneni Avinash, Sreekanth Reddy Puppala , M.B. Srinivas

Centre for VLSI and Embedded System Technologies. International Institute of Information Technology Gachibowli, Hyderabad-500032, India. srihari@research.iiit.ac.in, {kirthikrishna, avinashl, sreekanthp}@students.iiit.ac.in, srinivas@iiit.ac.in.

Abstract

The 3-2, 4-2 and 5-2 compressors are the basic components in many applications, in particular partial product summation in multipliers. In this paper novel architectures and designs of high speed, low power 3- 2, 4-2 and 5-2 compressors capable of operating at ultra-low voltages are presented. The power consumption, delay and area of these new compressor architectures are compared with existing and recently proposed compressor architectures and are shown to perform better. The proposed architecture lays emphasis on the use of multiplexers in arithmetic circuits that result in high speed and efficient design. Also in all existing implementations of XOR gate and multiplexers, both output and its complement are available but current designs of compressors do not use these outputs efficiently. In the proposed architecture these outputs are efficiently utilized to improve the performance of compressors. The combination of low power, low transistor count and lesser delay makes the new compressors a viable option for efficient design.

1. Introduction.

Multiplication is a basic arithmetic operation important in applications like digital signal processing which rely on efficient implementation of generic arithmetic logic units (ALU) and floating point units to execute dedicated operations like convolution and filtering. In the implementation of multipliers, the main phases are generation of partial products, reduction of partial products using CSA (carry-save architecture) [7-10] and a carry propagation adder for the computation of the final result. It is obvious that the second phase, that is, the reduction of the partial products contributes most to the overall delay, area and power. In most of these implementations, compressor lies directly within the critical path dictating the overall circuit, due to which the demand for high-speed and low-power compressors is continuously increasing [7- 9]. This paper presents new compressor architectures that lay emphasis on the use of multiplexers in place of

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 XOR gates to efficiently use the outputs from the

XOR gates to efficiently use the outputs from the previous stages and improve the overall performance. It is because the use of multiplexers improves the speed when placed in the critical path [2]. The rest of the paper is organized as follows: In Section 2 the efficiency of MUX and XOR-XNOR are compared and the possibility of replacing MUX with XOR-XNOR is discussed. In section 3, 4, 5 & 6 the proposed architectures of 3-2, 4-2 and 5-2 compressors are presented and compared with the existing architectures. Implementations have been carried out in 0.18µm CMOS technology.

2. MUX Vs XOR-XNOR.

Existing CMOS designs of 2x1 multiplexer and 2- input XOR gate are shown in Fig.1 [2].

S S A B A B S S S O MUX A B O S
S
S
A
B
A
B
S
S
S
O
MUX
A
B
O
S
S
O
O
(a) xnor A A B B A B xor B XOR-XNOR A A B xor
(a)
xnor
A
A
B
B
A
B
xor
B
XOR-XNOR
A
A B
xor
xnor

(b)

Fig.1. CMOS Implementations of (a) MUX (b) XOR- XNOR

In Fig.1(a), it can be seen that if both the select bit and its complement arrive before the inputs arrive then

XNOR In Fig.1(a), it can be seen that if both the select bit and its complement

the output is generated with very less delay because switching of the transistors is already completed. Also if both the select bit and its complement are generated in the previous stage then the additional stage of the inverter is eliminated which reduces the overall delay in the critical path [2]. By using the output and its complement in every stage the total number of garbage outputs is reduced. By decreasing the number of transistors the overall power consumption and the area occupied is reduced considerably [1]. An alternative design of the multiplexer is shown in Fig.2.

A S O B
A
S
O
B

Fig.2. Transmission Gate Implementation of a multiplexer

This design of the multiplexer is faster than the CMOS design when buffers are not used at the output [10]. But these can only be used in the intermediate stages because of their limited driving capability. This design also consumes lesser power than the CMOS design [2]. In the proposed architectures the blocks where this design can be used are shown as MUX*.

3. 3-2 Compressor.

A 3-2 compressor takes 3 inputs X1, X2, X3 and generates 2 outputs, the sum bit S, and the carry bit C as shown in Fig.3a. The compressor is governed by the basic equation

X1 + X2 + X3 = Sum + 2*Carry

(1)

X1 X2 Cin X1 X2 X3 XOR 3 – 2 XOR MUX Carry Sum Carry
X1
X2
Cin
X1
X2
X3
XOR
3 – 2
XOR
MUX
Carry
Sum
Carry
SUM
(a)
(b)

Fig.3. (a) A 3-2 Compressor (b) Conventional Implementation of the 3-2 compressor

The 3-2 compressor can also be employed as a full adder cell when the third input is considered as the Carry input from the previous compressor block or X3 = C in . Existing architectures shown in Fig.3 (b) employ two XOR gates in the critical path [3-6]. The equations

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 governing the existing 3-2 compressor outputs are shown

governing the existing 3-2 compressor outputs are shown below

Sum

=

x

1

x

2

x

3

(2)

Carry

=

(

x

1

x

2)

x

3

+

(

x

1

x

2)

x

1

(3)

In the proposed architecture shown in Fig. 4, the fact that both the XOR and XNOR values are computed is efficiently used to reduce the delay by replacing the second XOR with a MUX. This is due to the availability of the select bit at the MUX block before the inputs arrive. Thus the time taken for the switching of the transistors in the critical path is reduced.

X1 X2 X3 XOR-XNOR MUX MUX Carry SUM
X1
X2
X3
XOR-XNOR
MUX
MUX
Carry
SUM

Fig.4. Proposed architecture of the 3-2 Compressor

The equations governing the 3-2 compressor outputs are shown below

Sum

=

x

( 1

x

2)

x

3

+

x

( 1

x

2)

x

3

(4)

Carry

=

x

( 1

x

2)

x

3

+

x

( 1

x

2)

x

1

(5)

It can be seen that in this implementation the overall delay is -XOR +-MUX (where refers to delay).

4. 4-2 Compressor.

The 4-2 compressor has 4 inputs X1, X2, X3 and X4 and 2 outputs Sum and Carry along with a Carry-in (Cin) and a Carry-out (Cout) as shown in Fig 5. The input Cin is the output from the previous lower significant compressor. The Cout is the output to the compressor in the next significant stage.

X1 X2 X3 X4 Cout 4 – 2 Cin Carry Sum
X1
X2
X3
X4
Cout
4 – 2
Cin
Carry
Sum

Fig.5. A 4-2 Compressor Block

Similar to the 3-2 compressor the 4-2 compressor is governed by the basic equation

Sum Fig.5. A 4-2 Compressor Block Similar to the 3-2 compressor the 4-2 compressor is governed

x1+x2+x3+x4+Cin = Sum + 2*(Carry +

Cout)

(6)

The standard implementation [3-6] of the 4-2 compressor is done using 2 Full Adder cells as shown in Fig 6(a).

X1 X2 X3 X4 X1 X2 X3 X4 XOR XOR FA Cin Cin XOR MUX
X1
X2
X3
X4
X1 X2 X3
X4
XOR
XOR
FA
Cin
Cin
XOR
MUX
Cout
FA
Cout
XOR
MUX
Carry Sum
Sum
Carry
(a)
(b)

Fig.6. (a) A 4-2 compressor implemented with full adders (b) Existing implementation of 4-2 compressor

When the individual full Adders are broken into their constituent XOR blocks, it can be observed that the overall delay is equal to 4*-XOR. The block diagram in Fig. 6(b) shows the existing architecture for the implementation of the 4-2 compressor with a delay of 3*-XOR [3-6]. The equations governing the outputs in the existing architecture are shown below

Sum

=

x

1

x

2

x 3

x

4

Cin

 

(7)

Cout

=

(

x

1

x

2)

x

3

+

(

x

1

x

2)

x

1

(8)

Carry

=

(

x

1

x

2

x 3

x

4)

Cin

+

 

(

x

1

x

2

x 3

x

4)

x

4

(9)

However, like in the case of 3-2 compressor, the fact that both the output and its complement are available at every stage, is neglected [2]. Thus replacing some XOR blocks with multiplexers results in a significant improvement in delay.

X1 X2 X3 X4 XOR-XNOR XOR-XNOR Cin MUX MUX* Cout MUX MUX Sum Carry
X1
X2
X3
X4
XOR-XNOR
XOR-XNOR
Cin
MUX
MUX*
Cout
MUX
MUX
Sum
Carry

Fig 7. Proposed 4-2 Compressor Architecture

Also the MUX block at the SUM output gets the select bit before the inputs arrive and thus the transistors are already switched by the time they arrive.

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 This minimizes the delay to a considerable extent. This
This minimizes the delay to a considerable extent. This is shown in Fig. 7. The
This minimizes the delay to a considerable extent. This
is shown in Fig. 7.
The equations governing the outputs in the
proposed architecture are shown below
Sum =
( 1
x
x
2)
• x 3 ⊕
x
4
+
( 1
x
x
2)
( 3
x
x
4)
Cin
+
( 1
x
x
2)
• x 3 ⊕
x
4
+
( 1
x
x
2)
( 3
x
x
4)
Cin
(10)
Cout =
( 1
x
x
2)
x
3
+
( 1
x
x
2)
x
1
(11)
Carry =
( 1
x
⊕ x 2 ⊕ x 3 ⊕
x
4)
Cin +
( 1
x
⊕ ⊕ ⊕
x
2
x
3
x
4)
x
4
(12)
The
critical
path
delay
of
the
proposed

implementation is -XOR + 2*-MUX.

5. 5-2 Compressor.

The 5-2 Compressor block has 5 inputs X1,X2,X3,X4,X5 and 2 outputs, Sum and Carry, along with 2 input carry bits (Cin1, Cin2) and 2 output carry bits (Cout1,Cout2) as shown in Fig.8a. The input carry bits are the outputs from the previous lesser significant compressor block and the output carry are passed on to the next higher significant compressor block.

Cout1

Cout2

X1 X2 X3 X4 X5 5 – 2 Carry Sum
X1
X2
X3
X4
X5
5 – 2
Carry
Sum

(a)

Cin1

Cin2

Cout1

Cout2

X1 X2 X3 X4 X5 FA FA FA Sum Carry
X1
X2
X3
X4
X5
FA
FA
FA
Sum
Carry

(b)

Cin1

Cin2

Fig.8. (a) A 5-2 compressor block (b) Conventional implementation of a 5-2 compressor block

The basic equation that governs the function of the 5-2 compressor block is given below

X1+X2+X3+X4+X5+Cin1+Cin2

=Sum+2*(Carry + Cout1 + Cout2) (13)

The conventional implementation [3-6] of the compressor block is shown in Fig.8(b) where 3 cascaded full adder cells are used. When these full adders are replaced with their constituent blocks of XOR gates then it can be observed that the overall delay is equal to 6*-XOR for the sum or carry output. Many architectures have been proposed where the delay has been reduced to 5*-XOR (Fig.9a) and then further reduced to 4*-XOR. (Fig.9 b&c) [3-6].

where the delay has been reduced to 5* ∆ -XOR (Fig.9a) and then further reduced to
X1 X2 X3 X4 X5 X1 X2 X3 X4 X5 XOR XOR XOR XOR (X1+X2)
X1
X2
X3
X4
X5
X1
X2
X3
X4
X5
XOR
XOR
XOR
XOR
(X1+X2) (X3+X4)
(X1X2 + X3X4)
XOR
MUX
Cout1
Cin1
XOR
XOR
Cout1
Cin1
XOR
XOR
MUX
XOR
MUX
Cout2
Cin2
Cout2
Cin2
XOR
MUX
XOR
MUX
SUM
Carry
SUM
Carry
(a)
(b)
X1 X2 X3 Cin2 X4 X5 Cin1 CGEN1 XOR* XOR* Cout1 XOR^ XOR^ MUX Cout2
X1 X2
X3
Cin2
X4 X5
Cin1
CGEN1
XOR*
XOR*
Cout1
XOR^
XOR^
MUX
Cout2
XOR*
MUX
XOR
SUM
Carry

(c)

Fig.9 Existing architectures of 5-2 compressors

X1 X2 X3 Cin2 X4 X5 Cin1 CGEN1 XOR-XNOR XOR-XNOR Cout1 MUX* MUX* MUX Cout2
X1
X2
X3
Cin2
X4
X5
Cin1
CGEN1
XOR-XNOR
XOR-XNOR
Cout1
MUX*
MUX*
MUX
Cout2
MUX*
MUX
MUX
SUM
Carry

Fig.10. Proposed architecture of the 5-2 compressor

In the proposed architecture changes have been made, to efficiently use the outputs generated at every stage, by replacing a few XOR blocks with MUX blocks. Also the select bits to the multiplexers in the critical path are made available much ahead than the inputs so that the critical path delay is minimized. For example the Cout2 output from the previous lesser significant compressor block is utilized as the select bit after a stage it is produced so that the MUX block is already switched and the output is produced as soon as the inputs arrive. Also if the output of the multiplexer is used as select bit for another multiplexer, then it can be used efficiently in similar manner because the negation of select bit is also required, as shown in Figure 1(a), in the design and an extra stage to compute the negation can be saved. Similarly replacing

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 the XOR block in the second stage with a

the XOR block in the second stage with a MUX block reduces the delay because the select bit X3 is already available and the time taken for the transistor switching to take place is done in parallel with the computation of the inputs of the block. As mentioned before, in all the general implementations of the XOR or MUX block, in particular CMOS implementation, the output and its complement are generated. But in the existing architectures this advantage is not being utilized at all [3-6]. In the proposed architecture these outputs are utilized efficiently by using multiplexers at select stages in the circuit. Also additional inverter stages are eliminated. This in turn contributes to the reduction of delay, power consumption and transistor count (area). The equations governing the outputs are shown below:

Sum

=

x

1

x

2

x 3

x

4

x 5

Cin

1

Cin

2

 

(14)

Cout

1

=

(

x

1

+

x

2)

x

3

+

x

1

x

2

(15)

Cout

2

=

(

x

4

x

5)

Cin

1

+

(

x

4

x

5)

x

 

4

(16)

Carry

=

((

x

1

x

2

x

3)

(

x

4

x 5

Cin

1))

Cin

2

+

(17)

((

x

1

x

2

x

3)

(

x

4

x 5

Cin

1))

(

x

1

x

2

x

3)

The critical path delay of the proposed implementation is -XOR + 3*-MUX. In the Carry generation module mentioned in Fig.10, we use the mathematical equation (15) to design a CMOS implementation of Cout1 as shown in Fig.11.

X1 X3 X2 X1 X2 Cout1 X3 X1 X2 X1 X2 Fig.11. Carry Generation Module
X1
X3
X2
X1
X2
Cout1
X3
X1
X2
X1
X2
Fig.11. Carry Generation Module (CGEN1)

6. Simulation and results

a. Simulation environment.

All the simulations have been done using Cadence Tools. The calculation of power (including glitch power) and delay are carried out using the Virtual Analog Simulation tool already integrated into Cadence Tools. All the schematics and layouts (Fig 13, 15 & 17) are done using the CMOS 0.18-µm

integrated into Cadence Tools. All the schematics and layouts (Fig 13, 15 & 17) are done

technology. Hence the circuits are optimized for this process technology. The simulations are performed under various voltages ranging from 0.9V to 3.3V. All the inputs are fed at a frequency of 100MHz.

B. Simulation results. The proposed and the existing architectures [3-6] have been compared by implementing both of them in 0.18-µm CMOS technology.

100 Existing Proposed 80 60 40 20 0 0.9V 1.2V 1.8V 2.5V 3.3V Voltage (V)
100
Existing
Proposed
80
60
40
20
0
0.9V
1.2V
1.8V
2.5V
3.3V
Voltage (V)
(a)
8
Existing
6
Proposed
4
2
0
0.9V
1.2V
1.8V
2.5V
3.3V
Voltage (V)
Power (nW)
Delay (ns)

(b)

Existing 120 Proposed 100 80 60 40 20 0 0.9V 1.2V 1.8V 2.5V 3.3V Voltage
Existing
120
Proposed
100
80
60
40
20
0
0.9V
1.2V
1.8V
2.5V
3.3V
Voltage (V)
Power-delay
product (nW-ns)

(c)

Figure 12(a)Power consumption(nW) (b)Delay(ns) (c)Power Delay product for 5-2 compressors

(b)Delay(ns) (c)Power Delay product for 5-2 compressors Fig.13 Layout of the proposed 5-2 compressor Architecture

Fig.13 Layout of the proposed 5-2 compressor Architecture

 

Existing

0.9V

1.2V

1.8V

2.5V

3.3V

Proposed

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 (a) Existing     0.9V 1.2V 1.8V 2.5V

(a)

Existing

 
 

0.9V

1.2V

1.8V

2.5V

3.3V

Proposed

(b)

0.9V

1.2V

1.8V

2.5V

3.3V

 

Existing

Proposed

(c)

Figure 14(a) Power consumption (nW) (b)Delay(ns) (c)Power Delay product for 4-2 compressors

(nW) (b)Delay(ns) (c)Power Delay product for 4-2 compressors Fig.15 Layout of the proposed 4-2 compressor architecture

Fig.15 Layout of the proposed 4-2 compressor architecture

 
 
 

Exist ing

0.9V

1.2V

1.8V

2.5V

3.3V

Proposed

(a)

 
   

Exist ing

     

Proposed

0.9V

1.2V

1.8V

2.5V

3.3V

(b)

 
   

0.9V

1.2V

1.8V

2.5V

3.3V

Existing

Proposed

(c)

Figure 16(a)Power consumption(nW) (b)Delay(ns) (c)Power Delay product for 3-2 compressors

Existing Proposed (c) Figure 16(a)Power consumption(nW) (b)Delay(ns) (c)Power Delay product for 3-2 compressors
Fig.17 Layout of the proposed 3-2 compressor architecture The figures 12, 14 & 16 show

Fig.17 Layout of the proposed 3-2 compressor architecture The figures 12, 14 & 16 show that the proposed architecture for the 5-2 compressor consumes 13.2% lesser power and is 26% faster than the existing architectures when operating at 1.8V. Because of the decrease in the number of transistors the overall area decreases by about 11.15% in the proposed 5-2 compressor. The 4-2 compressor architecture is 33.3% faster and consumes 15% lesser power than the existing architectures. Also the proposed 3-2 compressor is 7% faster and consumes 10.2% lesser power than the existing architectures. The improvement in the power-delay product is 36.4%, 27.8% and 24% in the proposed 5-2 compressor, 4-2 compressor and 3-2 compressor respectively. As mentioned in section 1, the MUX* blocks in the proposed architecture can be implemented using transmission gate (CMOS+) logic. This new implementation is compared with the CMOS implementation and the results are shown below.

M

UX* AS

   
CM M CM OS UX* AS OS+

CM

M

CM

OS

UX* AS

OS+

0

0.9V

1.2V

1.8V

2.5V

3.3V

 

(a)

7 6 5 M UX* AS 4 CM OS 3 M UX* AS CM OS+
7
6
5
M
UX* AS
4
CM
OS
3
M
UX* AS
CM
OS+
2
1
0
0.9V
1.2V
1.8V
2.5V
3.3V
(b)
350
300
250
MUX* AS
200
CMOS
150
MUX* AS
CM OS+
100
50
0
0.9 V
1.2V
1.8V
2.5V
3.3V

(c)

Figure 18 (a) Power consumption (nW) (b) Delay(ns) (c) Power Delay product for proposed 5-2 compressors with MUX* in CMOS and CMOS+ designs.

20th International Conference on VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007

VLSI Design (VLSID'07) 0-7695-2762-0/07 $20.00 © 2007 Figure 18 shows that the implementation of the intermediate

Figure 18 shows that the implementation of the intermediate stages using CMOS+ design in the proposed 5-2 compressor results in a delay efficiency of 14.6%, power efficiency of 5.1% and efficiency of 18.2% in power-delay product when compared to the CMOS implementation of the same design. Similar results have been obtained with 3-2 and 4-2 compressors also.

7. Conclusions.

The architectures of the 3-2, 4-2 and 5-2 compressor are analyzed using CMOS and CMOS+ implementations of XOR and the MUX blocks. New 3-2, 4-2 and 5-2 compressor architectures have been proposed and compared with the existing architectures. Simulations have been performed over a range of voltages, from 0.9V to 3.3V. The proposed architectures perform better than the existing ones in every aspect i.e., area, power, delay and power-delay product over the complete voltage range simulated.

8. References.

[1] A. P. Chandrakasan and R. W. Brodersen, Low Power Digital CMOS Design. Norwell. MA: Kluwer, 1995. [2] R. Zimmermann and W.Fichtner, “Low-power logic styles: CMOS versus pass-transistor logic,” IEEE J. Solid- State Circuits, vol. 32, pp. 1079–1090, July 1997.

[3] S. F. Hsiao, M. R. Jiang, and J. S. Yeh, “Design of high-

speed low-power 3-2 counter and 4-2 compressor for fast

multipliers,” Electron. Lett, vol. 34, no. 4, pp. 341–343,

1998.

[4]K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2 compressors,” in Proc. of the 35th Asilomar Conf. on Signals, Systems and Computers, vol. 1, 2001, pp. 129–133. [5] C. H. Chang, J. Gu, M. Zhang, “Ultra low-voltage low-

power CMOS 4-2 and 5-2 compressors for fast arithmetic

circuits” IEEE Transactions on Circuits and Systems I:

Regular Papers, Volume 51, Issue 10, Oct. 2004 Page(s):1985 – 1997 [6]S. F. Hsiao, M. R. Jiang, and J. S. Yeh, “Design of high-

speed low-power 3-2 counter and 4-2 compressor for fast multipliers,” Electron. Lett, pp. 341–343, 1998. [7] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design technique for column compression multipliers,” IEEE Trans. Comput., vol. 44, pp. 962–970, Aug. 1995. [8] Milos Ercegovac, Tomas Lang, "Digital Arithmetic", Morgan Kaufman, 2004. [9] I . Koren, Computer Arithmetic Algorithms. Englewood Cliffs, NJ, Prentice Hall, 1993. [10] J. M. Rabaey, A. Chandrakasan, and B. Nikolic, Digital “Integrated Circuits (A design perspective)”, Prentice Hall,

2003

Rabaey, A. Chandrakasan, and B. Nikolic, Digital “Integrated Circuits (A design perspective)”, Prentice Hall, 2003