Power Analysis of High Throughput Pipelined Carry

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/4139548
Power analysis of high throughput pipelined carry-propagation adders
Conference Paper · December 2004

DOI: 10.1109/NORCHP.2004.1423842 · Source: IEEE Xplore
CITATIONS READS
4 19
4 authors, including:
Oscar Gustafsson Lars Wanhammar

Linköping University Linköping University
190 PUBLICATIONS 1,800 CITATIONS 242 PUBLICATIONS 2,257 CITATIONS
SEE PROFILE SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Energy efficient coding for deep sub-micron buses View project
Analog filters using MATLAB springer 2009 View project
All content following this page was uploaded by Lars Wanhammar on 30 May 2014.
The user has requested enhancement of the downloaded file.

Power Analysis of High Throughput
Pipelined Carry-Propagation Adders
Anders Åslund, Oscar Gustafsson, Henrik Ohlsson, and Lars Wanhammar
Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, SWEDEN

Tel: +46-13-28 4059, Fax: +46-13-13 92 82, E-mail: {andas366, oscarg, henriko, larsw}@isy.liu.se
Abstract in
In several previous papers the area, delay, and power
consumption for various carry-propagation adders have T T T VMA out
been compared. However, for high throughput applica-
tions it may be necessary to introduce pipelining into the Figure 1. High throughput FIR filter utilizing carry-
adder. The number of stages to be inserted and the width save arithmetic. Bold lines corresponds to
of the pipelining registers differs between different adders data in carry-save representation.
structure. In this work we focus on the power consump-
pipelining registers will also be different. As the main
tion for adder structures when pipelining is used to
application for this work is DSP we will consider all pos-
increase the throughput. Four adder structures with vary-
sible wordlengths, not only powers of two as in much of
ing wordlengths and pipeline levels are implemented
the previous work. This will also have some effects as the
using standard cells and the power consumption is com-
number of stages in most of the high-speed adder struc-
pared. The results show that the Kogge-Stone parallel
tures increases with every power of two input bits. Hence,
prefix adder gives the lowest power consumption given
generally, a 17-bits and a 32-bits adder will have the same
the throughput most of the time.
logic depth for the high-speed adders.
In the next section, the considered adder structures are
1. Introduction reviewed. In Section 3 the implementation is discussed.
Then, in Section 4, the results are presented. Finally, in
The operation of adding two numbers in a non-redundant
Section 5 some conclusions are drawn.
representation to obtain another non-redundant number is
a key operation in most DSP algorithms. As a non-redun-
dant output is required this implies that a carry signal
2. Adder Structures
must be propagated from the least significant bit of the
In this work four different adder structures are consid-
input words to the most significant bit of the output word.
ered. The adder structures are ripple-carry adder, condi-
This introduces a bound on the adder delay that increases
tional sum adder, and Brent-Kung and Kogge-Stone
with the wordlength.
parallel prefix adders.
During the years numerous types of different adder
Among adder structures not considered are, e.g., the
structures have been proposed to, essentially, speed up the
carry lookahead adder. This structure was left out as it
carry propagation. Comparisons have been performed
was not possible to introduce pipelining in a regular way
and presented for different adder structures in terms of
inside the lookahead blocks. There are also several more
area, delay, and power consumption [1]–[3].
parallel prefix adder structures, that could be considered,
For high-throughput applications, these different
but Brent-Kung and Kogge-Stone was selected. Further-
adder structures may not be enough to obtain the required
more, carry-select or carry-increment adders are also left
throughput. Instead, pipelining must be introduced. This
out, due to irregularity in the pipelining.
may be the case for, e.g., high-speed FIR filters based on
We also restrict ourselves to introducing pipelining
carry-save arithmetic, or heavily pipelined multipliers,
into a given adder structure. An alternative would be to
where a carry propagation adder is used at the output of
divide the addition into several shorter additions, where
the filter (often referred to as vector merging adder). In
all shorter additions are performed using a certain adder
Fig. 1, a transposed direct form FIR filter using carry-
structure. It would also be possible to introduce pipelin-
save arithmetic is shown.
ing within the shorter additions.
The aim of this work is to compare the power con-
To conclude there are a multitude of different cases
sumption for different adder structures implemented
that could have been included, but we selected the, in our
using standard cells, when pipelining is used to meet the
opinion, most important cases.
requirements on throughput. The number of pipelining
stages will differ between adder structures for the same
2.1. Ripple-Carry Adder
throughput requirements. Furthermore, the width of the
The ripple-carry adder (RCA) is the most straightforward
This work was supported by the Swedish Research Council implementation of a carry-propagation adder. It is con-
y6 x6 y5 x5 y4 x4 y3 x3 y2 x2 y1 x1 y0 x0 y6 x6 y5 x5 y4 x4 y3 x3 y2 x2 y1 x1 y0 x0
.. ........ ........ ........ ........ ........ ......... ........ ........ ........ ........ ........ .......
.
.
?? ?? ?? ?? ?? ?? ?? cin
.. ......... ........ ........ ........ ........ ........ ......... ........ ........ ...... .
. . claPG claPG claPG claPG claPG claPG claPG
. .
.. ......... ........ ......... ........ ........ ......... ........ ..... . . ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ...
. . .
. . G0:0
q q q
?? q
. . . G6:6 G4:4 G2:2
.. .... .... .... ..... .... .... .... .... .... ..... .... .. . P6:6 P4:4 P2:2 P0:0
.
.
.
.
.
.
.
.
.
. ? ? G5:5
q ? ? G3:3
q ? ? G1:1
q
...................................
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. FCO q
P5:5
FCO q
P3:3 FCO q
P1:1
claC qc0
.................. . . . . .
. . .
? ? ? ? ? ? ? ?.
.
.
? ? ? ? ? ? .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ...
. .
.
.
.
.
. . . G6:5
? ?
P6:5 q G2:1
??
P2:1
q

cout
.
.
. .
.
.
.
.
.
.
.
.
.
.
.
.
cin G4:3 q qc1
FA .
.
FA .
.
FA .
.
FA .
.
FA .
.
FA .
.
FA
q
P4:3
claC
??
? ?
. FCO
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
. . . . . . . . . . . . . . . . .. q
claC q
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
..................................

FCO q
. . . .
. . . . ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ...
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ..
. . .
.
.
.
.
.
.
.
..................................................................
G6:3
??
P6:3
G5:3 P5:3 q
.
.
.
.
. c3 q
.
.
. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. claC
??
. q
..................................................................................................
? ? ? ? ? ? ?
claC
??
s s s s s s s claC q
6 5 4 3 2 1 0
??
claC q
Figure 2. Ripple-carry adder with seven bits inputs. ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ... ... ... ... ... ... .. ... ... ... ... ... ... ... ... ...
P6:6 P P P P P P
Dashed lines indicates possible pipeline cuts. ? ?5:5 ?4:4 ?3:3 ?2:2 ?1:1 ?0:0
claS claS claS claS c3
claS
c2
claS
c1 claS
c0
cout c 6 c5 c4
y6 x6 x5 y4 x4 y3 x3 y2 x2 y1 x1 y0 x0 ci s6
? ?
s5
?
s4 s
? ?
s2
?
s1
?
s0
y5 3
? ? ? ? ? ? ? ? ? ? ? ? ? ?
condHA condHA condHA condHA condHA condHA condHA
Figure 4. Brent-Kung parallel prefix adder with seven
cc0=0 sc0=0
.........................................................................................................................................................................................................................................................
1 0 sc0=1
cc0=1
? ? ? ?
cc5=0
6 sc5=0 cc3=0 sc3=0 C
0

1
mux c
1
0 C
0
mux

1
0
c0
.........................................................................................................................................................................................................................................................
5 4 3
sc3=1
bits inputs. Dashed lines indicates possible
cc5=1 sc5=1 cc3=1
p p p p p p
??p ??p
3
6 5 4 cc1=0 sc1=0
C
? ?
??
C
0
? ?
mux
1
??
0
mux
1 c c4=0
5 C
0
??
mux
C
1
??
0
mux
1 c c2=0
3
C
0
2
??

mux
cc1=1
1
2
C
0
1
??

mux
1
sc1=1
1 pipeline cuts.
0 1 0 1 cc4=1 0 1 0 1 cc2=1 c1 c1
cc6=0
7
cc6=1
sc6=0
6 C
sc6=1
C mux mux 5 C C
mux mux 3
p p 7 p p
.........................................................................................................................................................................................................................................................
6
sc2=0
? ? ? ? 2
sc2=1 y6 x6 y5 x5 y4 x4 y2 x2 y1 x1 y0 x0
0 1 0 1 cc4=0
? ? ?? ?? y3 x3
C mux
?? C ??mux 6

2
?? ?? ?? ?? ?? ?? ??
0 1 0 1 0 1
0 1 0 1 cc4=1 C
mux c
C
mux c
C
mux c
cc4=0
C mux
C sc4=0
mux 6
sc4=0
2 2 2
cin
4
7 .........................................................................................................................................................................................................................................................
6
?? ??
cc4=1 ??
sc4=1 sc4=1 ?? sc4=1 claPG claPG claPG claPG claPG claPG claPG
0 17 0 1 6 0 1 5 0 1
4
.... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... ..
C
mux c
C mux c
C mux c
C
mux c4
q
4 4 4
G6:6

cout P6:6
? s6
? s5 s4
? ?
s3
?
s2 s1
? ?
s0 ? ? G5:5 q
FCO q P5:5
Figure 3. Conditional sum adder with seven bits inputs.
? ?q G4:4 q P
Dashed lines indicates possible pipeline cuts. FCO q 4:4
? ? q
G3:3 q
structed from cascaded full adder gates, where each gate FCO qP3:3
has inputs for the two bits to add and one incoming carry ? ?q G2:2 q
FCO qP2:2
from the lower significance full adder. The outputs of the ? ?q G1:1 q
full adder gate are a sum bit and an outgoing carry to the FCO qP1:1
q
? ? G0:0 q
next higher significance full adder. In Fig. 2 a seven bit FCO q 0:0
P
RCA is shown, together with possible pipelining cuts as ?? q

claC qc0
.... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... ..
indicated by the dashed lines. G6:5 P6:5
? ? G5:4 P5:4
G4:3 q
P4:3 q
? ?
FCO
G3:2 q
FCO qP3:2
2.2. Conditional Sum Adder ? ? G2:1 q
FCO qP2:1
The conditional sum adder (COSA) [4] is composed of ? ? G1:0 q
FCO qP1:0
conditional half adder cells and a network of 2-to-1 multi- ??
qc1
claC
??
plexers. Each conditional half adder cell generates two claC qc0
.... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... ..
carry and sum pairs, one for carry in equal to 0 and one G6:3 P6:3
? ? G5:2 P5:2 G4:1 P4:1 G3:0 P3:0
claC qc3
for carry in equal to 1. Then, depending on the actual ??
claC ?? qc2
carry in, the correct carry-sum pair is multiplexed to the qc1
claC
??
claC qc0
next level. The conditional sum adder can be seen as a .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... .... ..... .... .... .... .... ..
P6:6 P5:5 P4:4 P3:3 P2:2 P1:1 P0:0
carry select adder with a block size of one. An illustration cout
?c
6
?

c5
?

c4
?

c3
?

c2
?c
1
?
c0
claS claS claS claS claS claS claS
of a seven bit COSA is shown in Fig. 3, where the dashed ? ? ? ? ? ? ?
s 6 s 5 s 4 s3 s2 s 1 s0
lines indicate possible pipelining cuts.
Figure 5. Kogge-Stone parallel prefix adder with seven
bits inputs. Dashed lines indicates possible
2.3. Brent-Kung Parallel Prefix Adder
pipeline cuts.
The Brent-Kung (BK) adder [5] is a parallel prefix adder.
Parallel prefix adders are a class of adders that are based The correct carry for significance l can then be com-
on the use of generate and propagate signals. The gener- puted as
ate signal indicates that the outgoing carry is 1, indepen-
dent of the incoming carry, while the propagate carry - pl p p 0
indicates that the outgoing carry is equivalent to the = • l–1 •…• 0 • (2)
Cl gl gl – 1 g0 c in
incoming carry.
To determine the correct carry for each significance where cin is the carry in to the carry-propagation adder. As
the dot operator ( •) is used. The dot operator is defined as the dot operator is associative the computation for a cer-
tain carry can be rearranged to reduce the number of cas-
pk pj pi p j pi caded dot operators. A tree of dot operators that computes
= • = (1)
gk gj gi g j + gi p j all required carry signals are referred to as a parallel prefix
tree.
where p and g are the propagate and generate signals, re- The Brent-Kung adder uses one of several parallel
spectively. The dot operator is associative, but not com- prefix trees. The tree is characterized by a low number of
mutative, so it is assumed that the order of significance is dot operators, while at the same time there are few cas-
k > j > i, where k is the most significant bit of the three. caded dot operators. A seven bit Brent-Kung adder is
shown in Fig. 4.
2.4. Kogge-Stone Parallel Prefix Adder
PDP [µW/MHz]
As for the Brent-Kung adder, a parallel prefix computa-
tion is used for the Kogge-Stone (KS) adder [6]. How- 2000
(a)
ever, for the Kogge-Stone adder the fanout for a dot
operator is at each level limited to two. This leads to a 0
smaller capacitive load, resulting in higher speed, as will 10
be seen in the results section. This comes at a cost of an 5
Pipeline stages 0 20 10
60 50 40 30
increased number of dot-operators. The Kogge-Stone Wordlength
adder also has the smallest possible number of cascaded
PDP [µW/MHz]
dot operators assuming only two-input dot operators are
used. A seven bit Kogge-Stone adder is shown in Fig. 5. 500
(b)
0
3. Implementation 4
2
A program was implemented that can generate synthesiz- Pipeline stages 0 20 10
60 50 40 30
able VHDL for any of the four adder types with an arbi- Wordlength
PDP [µW/MHz]
trary wordlength and an arbitrary degree of pipelining.
The VHDL netlist was then mapped to a 0.35 µm 500
(c)
standard cell library using Leonardo Spectrum. Registers
are included on all inputs and outputs. A clock tree was 0
inserted to take the different number of flip-flops more 5
accurately into account. The resulting transistor netlist Pipeline stages 0 50 40 30 20 10

was simulated using Nanosim. 60
Wordlength
The simulation was performed without including
PDP [µW/MHz]
routing as no layout was supplied for the standard cell

library, only transistor netlists. This may give some (d) 500
advantage to the Kogge-Stone adder where the imple-
0
mentation may result in long wires. 5
For each adder types all wordlengths between three
and 64 bits were generated. Pipelining was introduced Pipeline stages 0 50 40 30 20 10
60
after every n level, where the range of n varied between Wordlength
the adder structures. The adders were simulated using Figure 6. Power-delay product (PDP) in µW/MHz as a
random input data for 1000 ns at the maximal data rate function of wordlength and number of
estimated from the synthesis tool. The power supply volt- pipeline stages for (a) ripple-carry adder, (b)
age was 3.3 V. conditional sum adder, (c) Brent-Kung adder,
and (d) Kogge-Stone adder. Note that the z-
4. Results axis has a different scale for the ripple-carry
adder.
The focus of the work was to determine which adder
structure, with possible pipelining, that have the lowest high-speed adders in Figs. 3–5 have different properties.
power consumption for a given throughput. However, the Hence, as the wordlength passes a power of two, the pipe-
synthesis and simulations produced data from which line cuts that are used may be different resulting in possi-
more can be concluded. bly a smaller load at the output of the registers, and,
In Fig. 6 the power-delay product (PDP) [7] is shown hence, an increased throughput.
for all adder structures. The PDP can be seen as the The power consumption as a function of maximum
energy per computation. In Fig. 6, the pipelining is throughput is shown in Fig. 7 for a number of different
expressed as the number of pipeline stages introduced. wordlengths. Each line in the plot corresponds to the
This is different from introducing pipelining after every n same adder structure using different degrees of pipelin-
levels as discussed above, but was chosen to obtain more ing. Hence, the right-most point corresponds to the maxi-
readable figures. From this it is shown that the Kogge- mum degree of pipelining. Here, it is again clear that the
Stone adder generally has the smallest power-delay prod- KS adder yields the best results most of the time. The BK
uct. It is also clear that the PDP increases with increased yields almost as good results although the maximal
pipelining. Hence, the possible savings in terms of throughput of the BK adder is lower. It can also be seen
reduced glitches is not enough to decrease the PDPs. that the COSA has better relative performance as the
However, standard cell flip-flops are generally character- wordlength increases.
ized by robustness rather than low power consumption, so
using low power flip-flops may change the results. 5. Conclusions
It can also be seen from Fig. 6 that there are some dis-
continuities in the PDP as a function of wordlength. This In this work the power consumption of pipelined carry-
is due to the fact that the possible pipeline cuts for the propagation adders for high throughput applications was
investigated. Four different adder structures were used,
4−bits adders 8−bits adders 12−bits adders 15−bits adders
30
Power Consumption [mW]
RCA 100 150

25 COSA 60
BK 80
20 KS
60 100
40
15
40
10 50
20
5 20
0 0 0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
200
400
250
150
150 200 300
100 150
100 200
100
50 50 100
50
0 0 0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
500 1500
2000
600
400
1000 1500
300 400
1000
200
500
200
100 500
0 0 0 0
0 200 400 600 800 0 200 400 600 800 0 200 400 600 800 0 200 400 600 800
Clock frequency [MHz] Clock frequency [MHz] Clock frequency [MHz] Clock frequency [MHz]
Figure 7. Power consumption as a function of maximal clock frequency using different degree of pipelining for ripple-
carry (RCA) , conditional sum (COSA) , Brent-Kung (BK) , and Kogge-Stone (KS) adders.
namely, ripple-carry, conditional sum, Brent-Kung, and [3] R. Zimmermann, Binary Adder Architectures for
Kogge-Stone adders. Simulations showed that the Kogge- Cell-Based VLSI and their Synthesis, PhD Diss.,
Stone adder generally have the lowest power consump- Swiss Federal Institute of Technology (ETH) Zurich,
tion for a given throughput. The Brent-Kung adder gave Switzerland, 1998.
similar results, but with a lower throughput.
[4] J. Sklansky, “Conditional-sum addition logic,” IRE
Trans. Electronic Computers, vol. 9, pp. 226–231,
6. References June 1960.
[1] T. K. Callaway and E. E. Swartzlander Jr., [5] R. P. Brent and H. T. Kung, “A regular layout for
“Estimating the power consumption of CMOS parallel adders,” IEEE Trans. Computers, vol. 31, pp.
adders,” in Proc. IEEE Symp. Computer Arithmetic, 260–264, March 1982.
1993, pp. 210–216. [6] P. M. Kogge and H. S. Stone, “A parallel algorithm
[2] C. Nagendra, M. J. Irwin, and R. M. Owens, “Area- for the efficient solution of a general class of
time-power tradeoffs in parallel adder,” IEEE Trans. recurrence equations,” IEEE Trans. Computers, vol.
Circuits Syst.–II, vol. 43, no. 10, pp. 689–702, Oct. 22, no. 8, pp. 260-264, Aug. 1973.
1996. [7] J. M. Rabaey, Digital Integrated Circuits: A Design
Perspective, Prentice-Hall, 1996.
View publication stats

Power Analysis of High Throughput Pipelined Carry

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Power Analysis of High Throughput Pipelined Carry

Загружено:

Авторское право:

Доступные форматы

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Power analysis of high throughput pipelined carry-propagation adders

Conference Paper · December 2004

Oscar Gustafsson Lars Wanhammar

SEE PROFILE SEE PROFILE

Energy efficient coding for deep sub-micron buses View project

Analog filters using MATLAB springer 2009 View project

The user has requested enhancement of the downloaded file.

Department of Electrical Engineering, Linköping University, SE-581 83 Linköping, SWEDEN

RCA is shown, together with possible pipelining cuts as ?? q

accurately into account. The resulting transistor netlist Pipeline stages 0 50 40 30 20 10

routing as no layout was supplied for the standard cell

RCA 100 150

View publication stats

Вам также может понравиться