Fritz 2017

This article has been accepted for inclusion in a future issue of this journal.
Content is final as presented, with the exception of pagination.
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1
Fast Binary Counters Based on Symmetric Stacking

Christopher Fritz and Adly T. Fam
Abstract— In this brief, a new binary counter design is proposed.

It uses 3-bit stacking circuits, which group all of the “1” bits together,
followed by a novel symmetric method to combine pairs of 3-bit stacks
into 6-bit stacks. The bit stacks are then converted to binary counts,
producing 6:3 counter circuits with no xor gates on the critical path.
This avoidance of xor gates results in faster designs with efficient power
and area utilization. In VLSI simulations, the proposed counters are 30%
faster than existing parallel counters and also consume less power than
other higher order counters. Additionally, using the proposed counters
in existing counter-based Wallace tree multiplier architectures reduces
latency and power consumption for 64 and 128-bit multipliers.
Index Terms— Counter, high speed, low power, multiplier,

VLSI, Wallace tree.
I. I NTRODUCTION Fig. 1. A 7:3 counter and a 6:3 counter built from full and half adders.
High speed, efficient addition of multiple operands is an essential
operation in any computational unit. The speed and power efficiency on the critical path. Some of these muxes can be implemented with
of multiplier circuits is of critical importance in the overall perfor- transmission gate logic to produce even faster designs.
mance of microprocessors. Multiplier circuits are an essential part In this brief, we present a counting method that uses bit stacking
of an arithmetic logic unit, or a digital signal processor system circuits followed by a novel method of combining two small stacks
for performing filtering and convolution. The binary multiplication to form larger stacks. A 6:3 counter built using this method uses no
of integers or fixed-point numbers results in partial products that XOR gates or multiplexers on its critical path. VLSI simulation results
must be added to produce the final product. The addition of these show that our 6:3 counter is at least 30% faster than existing counter
partial products dominates the latency and power consumption of the designs while also using less power. Simulations were also run on full
multiplier. multiplier circuits for various sizes. The same counter-based Wallace
In order to combine the partial products efficiently, column com- multiplier design was used for each simulation, while the internal
pression is commonly used. Many methods have been presented to counter was varied. Use of the proposed counter improves multiplier
optimize the performance of the partial product summation, such as efficiency for larger circuits, yielding 64- and 128-bit multipliers that
the well-known row compression techniques in the Wallace tree [1] are both faster and consume less power than other counter based
or Dadda tree [2], or the improved architecture in [3]. These methods Wallace (CBW) designs.
involve using full adders functioning as counters to reduce groups of
3 bits of the same weight to 2 bits of different weight in parallel II. S YMMETRIC B IT S TACKING
using a carry-save adder tree. Through several layers of reduction, The proposed 6:3 counter is realized by first stacking all of the
the number of summands is reduced to two, which are then added input bits such that all of the “1” bits are grouped together. After
using a conventional adder circuit. stacking the input bits, this stack can be converted into a binary
To achieve higher efficiency, larger numbers of bits of equal weight count to output the 6-bit count. Small 3-bit stacking circuits are first
can be considered. The basic method when dealing with larger num- used to form 3-bit stacks. These 3-bit stacks are then combined to
bers of bits is the same: bits in one column are counted, producing make a 6-bit stack using a symmetric technique that adds one extra
fewer bits of different weights. For example, a 7:3 counter circuit layer of logic.
accepts 7 bits of equal weight and counts the number of “1” bits.
This count is then output using 3 bits of increasing weight. The A. Three-Bit Stacking Circuit
7:3 and 6:3 counter circuits can be constructed using full and half Given inputs X 0 , X 1 , and X 2 , a 3-bit stacker circuit will have
adders, as shown in Fig. 1. three outputs Y0 , Y1 , and Y2 such that the number of “1” bits in the
Much of the delay in these counter circuits is due to the chains outputs is the same as the number of “1” bits in the inputs, but the
of XOR gates on the critical path. Therefore, many faster parallel “1” bits are grouped together to the left followed by the “0” bits. It is
counter architectures have been presented. A parallel 7:3 counter clear that the outputs are then formed by
was presented in [4] and used to design a high speed counter-based
Wallace tree multiplier in [5]. Additionally, counter designs as in Y0 = X 0 + X 1 + X 2 (1)
[6] and [7] use multiplexers to reduce the number of XOR gates Y1 = X 0 X 1 + X 0 X 2 + X 1 X 2 (2)
Y2 = X 0 X 1 X 2 . (3)
Manuscript received November 21, 2016; revised February 27, 2017
and May 4, 2017; accepted June 12, 2017. (Corresponding author: Namely, the first output will be “1” if any of the inputs is one,
Christopher Fritz.) the second output will be “1” if any two of the inputs are one, and
The authors are with the Department of Electrical Engineering, State
University of New York at Buffalo, Buffalo, NY 14260 USA (e-mail:
the last output will be one if all three of the inputs are “1.” The
cvfritz@buffalo.edu; afam@buffalo.edu). Y1 output is a majority function and can be implemented using one
Digital Object Identifier 10.1109/TVLSI.2017.2723475 complex CMOS gate. The 3-bit stacking circuit is shown in Fig. 2.
1063-8210 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
Fig. 2. Three-bit stacker circuit.
B. Merging Stacks
We wish to form a 6-bit stacking circuit using the 3-bit stacking
circuits discussed. Given six inputs X 0 , . . . , X 5 , we first divide them
into two groups of 3 bits which are stacked using 3-bit stacking
circuits. Let X 0 , X 1 , and X 2 be stacked into signals named H0 , H1 , Fig. 3. Six-bit stacking example.
and H2 and X 3 , X 4 , and X 5 be stacked into I0 , I, and I2 . First,
we reverse the outputs of the first stacker and consider the six bits K vector which finds exactly one overlap. Then, the J and K vectors
H2, H1, H0, I0, I1, and I 2 . See the top of Fig. 3 for an example of are restacked to form the final 6-bit stack.
this process. We notice that within these six bits, there is a train
of “1” bits surrounded by “0” bits. To form a proper stack, this train
of “1” bits must start from the leftmost bit. III. C ONVERTING B IT S TACK TO B INARY N UMBER
In order to form the proper 6-bit stack, two more 3-bit vectors of In order to implement a 6:3 counter circuit, the 6-bit stack
bits are formed called J0 , J1 , J2 and K 0 , K 1 , K 2 . The idea is to fill described in Section II must be converted to a binary number. For a
the J vector with ones first, before filling the K vector. So we let faster, more efficient count, we can use intermediate values H, I,
J0 = H2 + I0 (4) and K to quickly compute each output bit without needing the
bottom layer of stackers. Call the output bits C2, C1, and S in which
J1 = H1 + I1 (5)
C2, C1, S is the binary representation of the number of “1” input bits.
J2 = H0 + I2 . (6) To compute S, we note that we can easily determine the parity
of the outputs from the first layer of 3-bit stackers. Even parity
In this way, the first three “1” bits of the train are guaranteed to
occurs in the H if zero or two “1” bits appear in X 0 , X 1 , and X 2 .
fill into the J bits although they may not be properly stacked. Now
Thus, He and Ie , which indicate even parity in the H and I bits, are
to ensure no bits are counted twice, the K bits are formed using the
given by
same inputs but with the AND gates instead
K 0 = H2 I0 (7) He = H0 + H1 H2 (10)
K 1 = H1 I1 (8) Ie = I0 + I1 I2 . (11)
K 2 = H0 I2 . (9) As S indicates odd parity over all of the input bits, and because the
sum of two numbers with different parities is odd, we can compute
If the train of “1”s is no more than three places long, then all of
B0 as
the K bits will be “0” as the AND gate inputs are three positions
apart. If the train is longer than three places long, then some of the S = He ⊕ Ie . (12)
AND gates will have both inputs as “1”s as the AND gate inputs are
three positions apart. The number of AND gates that will have this Although this does incur one XOR gate delay, it is not on the
property will be three less than the length of the train of “1”s. critical path. To compute C1, we note C1 = 1 when the count is 2,
We notice that now J0 J1 J2 and K 0 K 1 K 2 still contain the same 3, or 6. Therefore, there are two cases. First, we need to check if
number of “1” bits as the input in total but now J bits will be filled we have at least two but no more than three total inputs. We can use
with ones before any of the K bits. We must now stack J0 J1 J2 the intermediate H, I, and K vectors for this. To check for at least
and K 0 K 1 K 2 using two more 3-bit stacking circuits. The outputs two inputs we need to see stacks of length two from either top level
of these two circuits can then be concatenated to form the stack stacker, or two stacks of length one, which yields H1 + I1 + H0 I0 .
outputs Y5 , . . . , Y0 . To check that we do not have more than three inputs set, we simply
An example of this process is shown for an input vector containing need to make sure that none of the K bits are set as the K vector
four “1” bits in Fig. 3. In this example, first the H and I vectors are is only set when more than three inputs are “1,” as discussed in
formed by stacking groups of three input bits. Then, the H vector Section II. This gives (K 0 + K 1 + K 2 ).
is reversed, forming a continuous train of four “1” bits surrounded Second, we need to check if we have all six inputs as “1.” We can
by zero bits. Corresponding bits are OR-ed to form the J vector check this by checking that all three of both the H and I bits are
which is full of “1” bits. Corresponding bits are AND-ed to form the set. As these are bit stacks, we simply check the rightmost bit in the
Fig. 4. A 6:3 counter based on symmetric stacking. Fig. 5. A 7:3 counter based on symmetric stacking.
TABLE I
6:3 C OUNTER S IMULATION R ESULTS also simulated. It has a critical path delay of 1XOR + 3MUX .
Two of the muxes on the crucial path can be implemented with trans-
mission gate logic which is slightly faster. The proposed 6:3 counter
has no XOR gates or muxes on its critical path. It has a critical path
delay of seven basic gates.
Table I shows the results of the simulation of these four 6:3 counter
implementations in terms of latency, average power consumption, and
number of transistors used. Average power consumption was calcu-
lated by integrating the spectre instantaneous power consumption out-
put and dividing by the simulation runtime. For the proposed counter,
these results are for the entire counter including the 3-bit-stacker
circuits and binary conversion logic. The transistor count is output
in the circuit inventory in the spectre output as a node count. The
stack for this case, which yields H2 I2 . Altogether, this yields
simulation was run at 50 MHz.
C1 = (H1 + I1 + H0 I0 )(K 0 + K¯1 + K 2 ) + H2 I2 . (13) Because the proposed 6:3 counter based on bit stacking has no
XOR gates on its critical path, it operates nearly 30% faster than
We can easily calculate C2 as it should be set whenever we have all other counter designs. Thus, this novel method of counting
at least 4-bit set via bit stacking allows construction of a counter for a substantial
C2 = K 0 + K 1 + K 2 . (14) performance increase without increasing power consumption.
Using (12)–(14), the final 6:3 counter circuit can be constructed,
as shown in Fig. 4. V. 7:3 C OUNTER D ESIGN
Using larger CMOS gates, the critical path delay is reduced to
The symmetric stacking method can be used to create a 7:3 counter
seven basic gates. As there are no XOR gates on the critical path, this
as well. The 7:3 counters are desirable as they provide a higher
6:3 counter outperforms existing designs as shown in Section IV. One
compression ratio. The design of the 7:3 counter involves computing
drawback of this design is an increase in wiring complexity: we see
outputs for C1 and C2 assuming both X 6 = 0 (which matches the
from Figs. 3 and 4 that the symmetric approach necessitates signals
6:3 counter) and assuming X 6 = 1. We compute the S output by
crossing after the first layer of stackers, while traditional counters,
adding one additional XOR gate.
as in Fig. 1, do not have as many crossing paths.
If X 6 = 1, then C1 = 1 if the count of X 0 , . . . , X 5 is at least 1
IV. 6:3 C OUNTER S IMULATION but less than 3 or 5, which can be computed as
The proposed 6:3 counter design was built as a standard CMOS C1 = (H 0 + I0 ) J0 J¯1 J2 + H2 I1 + H1 I2 . (15)
design and simulated using spectre, using the ON semiconductor
C5 0.5-μm process (formerly AMI06). For comparison, a 6:3 counter Also, C2 = 1 if the count of X 0 , . . . , X 5 is at least 3
design was implemented using standard CMOS full adders as C2 = J0 J1 J2 . (16)
in Fig. 1. The parallel counter design from [4] was converted to
a 6:3 counter and simulated as well. It has a critical path delay of Both versions of C1 and C2 are computed and a mux is used to
3XOR + 2basicgates. The mux-based counter design from [6] was select the correct version based on X 6 . Note that this design therefore
4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS
TABLE II TABLE III

7:3 C OUNTER S IMULATION R ESULTS M ULTIPLIER S IMULATION R ESULTS
Fig. 7. Power and latency for CBW multipliers with different counters.
For small size, the counter has little impact on the performance
of the multiplier. The standard Wallace tree has low overhead and
performs well at these sizes, and consumes the lest power at all
sizes. We see that as the size of the multiplier increases, a counter-
based Wallace tree built using the proposed 6:3 counters is faster
than multipliers of the same size built using a standard Wallace
Fig. 6. CBW multiplier reduction tree using up to 6:3 counters. tree design or using existing 7:3 counters. Furthermore, the power
consumption is reduced by using the stacker-based counters compared
has muxes on the critical path. The 7:3 counter design is shown to multipliers built with existing 7:3 counters. Latency and average
in Fig. 5. power consumption are shown in Fig. 7 for the different multiplier
Simulations were run on the proposed 7:3 counter against the sizes.
original 7:3 counters. The results are shown in Table II. VII. C ONCLUSION
While the proposed counter is still slightly faster than the existing In this brief, a new binary counter based on a novel symmetric bit
counters, the improvement is not as significant and the power stacking approach is proposed. We showed that this counting method
consumption is increased. For this reason, we will use the proposed can be used to implement 6:3 and 7:3 counters, which can be used in
6:3 counter to build 64-bit multipliers in Section VI even though this any binary multiplier circuit to add the partial products. We demon-
requires one more reduction phase. strated that 6:3 counters implemented with this bit stacking technique
achieve higher speed than other higher order counter designs while
VI. F ULL M ULTIPLIER S IMULATIONS reducing power consumption. This is due to the lack of XOR gates
To demonstrate a use case of the proposed 6:3 counter, multiplier and multiplexers on the critical path. The 64-bit and 128-bit counter-
circuits of different sizes were constructed using different internal based Wallace tree multipliers built using the proposed 6:3 counters
counters. No new multiplier design is proposed; rather, existing outperform both the standard Wallace tree implementation as well as
architectures are simulated with different internal counters. For refer- multipliers built using existing 7:3 counters.
ence, a standard Wallace tree was implemented for each size. Then, R EFERENCES
the counter-based Wallace tree was used from [5], which achieves [1] C. S. Wallace, “A suggestion for a fast multiplier,” IEEE Trans. Electron.
the fewest reduction phases. The internal 7:3 and 6:3 counters used Comput., vol. EC-13, no. 1, pp. 14–17, Feb. 1964.
for this CBW multiplier were varied. The 5:3 and 4:3 counters were [2] L. Dadda, “Some schemes for parallel multipliers,” Alta Freq., vol. 34,
kept the same for each multiplier, using the counter designs from [5]. pp. 349–356, May 1965.
[3] Z. Wang, G. A. Jullien, and W. C. Miller, “A new design technique for
Standard CMOS implementations were used for the full and half column compression multipliers,” IEEE Trans. Comput., vol. 44, no. 8,
adders. Because of the efficiency of the 6-bit version of the proposed pp. 962–970, Aug. 1995.
counter, for simulations using the stacker-based counter, we use the [4] M. Mehta, V. Parmar, and E. Swartzlander, “High-speed multiplier
6-bit version with no 7:3 counters, even though this results in one design using multi-input counter and compressor circuits,” in Proc. 10th
IEEE Symp. Comput. Arithmetic, Jun. 1991, pp. 43–50.
additional reduction phase for each size. An example of a CBW [5] S. Asif and Y. Kong, “Design of an algorithmic wallace multiplier
multiplier reduction tree that uses up to 6:3 counters for 16-bit inputs using high speed counters,” in Proc. IEEE Comput. Eng. Syst. (ICCES),
is shown in Fig. 6. The simulation results are shown in Table III. Dec. 2015, pp. 133–138.
[6] S. Veeramachaneni, L. Avinash, M. Krishna, and M. B. Srinivas, “Novel [11] K. Prasad and K. K. Parhi, “Low-power 4-2 and 5-2 compressors,” in
architectures for efficient (m, n) parallel counters,” in Proc. 17th ACM Proc. Conf. Rec. 35th Asilomar Conf. Signals, Syst. Comput., vol. 1.
Great Lakes Symp. VLSI, 2007, pp. 188–191. Nov. 2001, pp. 129–133.
[7] S. Veeramachaneni, K. M. Krishna, L. Avinash, S. R. Puppala, and M. [12] I. Koren, Computer Arithmetic Algorithms, 2nd ed. Natick, MA, USA:
B. Srinivas, “Novel architectures for high-speed and low-power 3-2, A. K. Peters, 2002.
4-2 and 5-2 compressors,” in Proc. 20th Int. Conf. VLSI Design Held [13] M. Rouholamini, O. Kavehie, A.-P. Mirbaha, S. J. Jasbi, and K. Navi,
Jointly 6th Int. Conf. Embedded Syst. (VLSID), Jan. 2007, pp. 324–329. “A new design for 7:2 compressors,” in Proc. IEEE/ACS Int. Conf.
[8] V. G. Oklobdzija, D. Villeger, and S. S. Liu, “A method for speed opti- Comput. Syst. Appl., May 2007, pp. 474–478.
mized partial product reduction and generation of fast parallel multipliers [14] A. Dandapat, S. Ghosal, P. Sarkar, and D. Mukhopadhyay, “A 1.2-ns
using an algorithmic approach,” IEEE Trans. Comput., vol. 45, no. 3, 16 × 16-bit binary multiplier using high speed compressors,” Int.
pp. 294–306, Mar. 1996. J. Elect. Electron. Eng., vol. 4, no. 3, pp. 234–239, 2010.
[9] S. Asif and Y. Kong, “Analysis of different architectures of counter [15] D. Radhakrishnan, “Low-voltage low-power CMOS full adder,”
based Wallace multipliers,” in Proc. 10th Int. Conf. Comput. Eng. IEE Proc.-Circuits, Devices Syst., vol. 148, no. 1, pp. 19–24,
Syst. (ICCES), Dec. 2015, pp. 139–144. Feb. 2001.
[10] J. Gu and C.-H. Chang, “Low voltage, low power (5:2) compressor cell [16] S.-F. Hsiao, M.-R. Jiang, and J.-S. Yeh, “Design of high-speed low-
for fast arithmetic circuits,” in Proc. IEEE Int. Conf. Acoust., Speech, power 3-2 counter and 4-2 compressor for fast multipliers,” Electron.
Signal Process. (ICASSP), vol. 2. Apr. 2003, pp. 661–664. Lett., vol. 34, no. 4, pp. 341–343, Feb. 1998.

Fritz 2017

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Fritz 2017

Загружено:

Авторское право:

Доступные форматы

This article has been accepted for inclusion in a future issue of this journal.

Content is final as presented, with the exception of pagination.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1

Fast Binary Counters Based on Symmetric Stacking

Abstract— In this brief, a new binary counter design is proposed.

Index Terms— Counter, high speed, low power, multiplier,

2 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

Fig. 2. Three-bit stacker circuit.

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 3

4 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS

TABLE II TABLE III

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 5

Вам также может понравиться