Вы находитесь на странице: 1из 5

On Signal-Gating Schemes for Low-Power Adders

Zhijun Huang and Milo5 D. Ercegovac


Computer Science Department
University of California Los Angeles
Los Angeles, CA 90095
{ zj huang, milos}@cs.ucla.edu

Abstract is the lack of arithmetic details and reliable power es-


timation, which would affect the optimal points and
Signal gating schemes for low-power adder design area/power/delay tradeoffs. For example, the power
are studied in this paper. Signal gating dynamically de- distribution in adders with short-precision data is gen-
activates portions of an adder according to the actual erally not uniform. [16] attempted to consider adder
precision of two operands. Based on program analy- details with signal gating logic and concluded that lit-
sis, signal gating is developed for two different adders: tle power saving could be achieved because of the over-
symmetric adders and asymmetric adders. The effect head. This conclusion is true if the adder design is
of signal gating is investigated by incorporating several considered separately. However, the overhead can be
gating schemes into a RISC pipeline. Experimental re- reduced or amortized if the related units are also mod-
sults indicate more power saving compared to previous ified t o accept short-precision data.
work. In this paper, we develop signal gating schemes with
arithmetic details for different precision patterns of ad-
dition and appl the techniques into DLX microproces-
sor pipeline [l'i'f The rest of this paper is organized as
1 Introduction follows. Section 2 gives basic definitions. Section 3
presents program analysis results as the motivation of
Adders are fundamental arithmetic units in digi- this work. Section 4 discusses signal gating schemes
tal systems. As power consumption is becoming an for adders with symmetric-precision operands. Section
important concern, low-power adders have been stud- 5 studies signal gating schemes for special adders with
ied extensively on the circuit and logic levels. In [1]- asymmetric-precision operands. Section 6 discusses ex-
[4], power estimation and comparison of various adder perimental results including the power optimization of
structures are presented. In [5]-[8], power optimiza- the DLX pipeline. The last section concludes this work.
tion techniques are proposed to reduce switching ac-
tivity. In these studies, adders have been treated as 2 Definitions
isolated units with little consideration of application
data characteristics. Because power dissipation is di-
rectly related to data switching patterns, the isolated A n-bit two's-complement number can be parti-
adder optimization would lead to limited power sav- tioned into two parts: sign extension bits (leading ze-
ing. Program analysis has revealed that there are a ros or ones), and significand bits including the sign bit.
large number of short-precision additions in most ap- The sign extension part is denoted as E and the sig-
plications [9]-[12]. To take advantage of short-precision
data, signal ating can be applied to deactivate por-
tions of an a%der to match run-time data precision.
nificand part as D.
1"' and ID1 denote the lengths of
+
E and D , respective y. Therefore, n = IEl 101and
1 5 ]Dl 5 n . 1 1 is the actual precision of the num-
0
Signal gating is a popular power reduction technique ber. For two's-complement addition with operands X
that has been used widely at all levels of abstraction. and Y , the operation precision is defined as IDopl =
On the architecture level, an idle functional unit and m u z ( ~ ~I+l).x ~ , q{JDopJ= i}] is the probability of
its input/output registers can be powered down by dis- operations with precision IDopl = i. We also define the
\
abling their clock signals 13][14]. For a busy func-
tional unit, some portion(s can still be gated accord-
ing to operand precisions as first proposed in [lo]. The
following probability variables for each operand:
Pi D : the probability of bit i being in part D;
Pi[E]:the probability of bit i being in part E ;
notion of operand gating is further generalized to all SPi: the static probability of bit i being logic '1';
stages of the microprocessor pipeline in [12]. In the TRi: the rate of bit i toggling per cycle.
adaptive power-aware system proposed in [15], an en- Pi(D) and Pi(E) reflect data spatial correlation. If two
semble of functional units with different fixed widths neighbor bits have the similar P ( E ) , they are highly
are provided and only one of them is adaptively en- correlated. S P and T R reflect the temporal correla-
abled according to input data precision. A major limi- tion. The lower the values, the higher the data are
tation of these architecture-level signal gating schemes temporally correlated.

0-7803-7147-x/01/$10.0002001IEEE 867
3 Program Analysis 4.1 Signal Gating with One Boundary

With the execution tracing tool Shade [18], we have Suppose the upper G-bit portion of the adder is
analyzed run-time features of operations and operands identified as the candidate to be gated. The general
in 32-bit Mediabench programs [19]. We have the fol- structure of a signal-gated adder is illustrated in Fig. 2.
lowing observations. First, about 70% of the total ex- IDL is identity detection logic and GCTL is gating con-
ecuted instructions involve addition steps. This is be- trol logic as shown in 3. The behavior of si nal gating
cause additions/subtractions, load/store memory ad- is as follows. The first step is to detect ]Ex7and IEyl,
dress calculation and branches all require addition op- the sign-extension widths of X and Y . Two leading
erations. Moreover, most instruction executions have bits from the lower ( N - G) part are involved in test-
a PC incrementing step. In program djpeg, for ex- ing. This is necessary because we need the result of
ample, the distribution of instruction types is: addi- lower-part computation to be in a correct form with-
tions/subtractions are 36.08%, multldiv 3.26%, shift out overflow. The OR signal of upper (G 2) bits in +
and logic 21.02%' load/store 30.49%, branches 7.44%. each operand indicate if upper G bits can be viewed as
Second, most arithmetic operations have precisions extension bits of '0'. The NAND signal of upper ( G + 2 )
much smaller than the datapath hardware width. In bits indicate if upper G bits can be viewed as exten-
djpeg, 86% of additionlsubtraction has precision of 20 sion bits of '1'. To protect the clock signal from glitches
bits or less, 57% has precision of 13 bits or less, as and ensure correct timing, g is latched to be g l before
shown in Fig. 1. Third, the precision difference be- controlling the clock and is registered to be gll before
tween two operands is significant. In djpeg, the average controlling the combinational circuit. If lExl > G and
precision difference between additionlsubtraction's two lEyl > G, the G-bit portion is gated by disabling the
operands is 7 bits while the difference between the two clock of input registers and blocking the carry signal.
operands in data memory address calculation is 13 bits. The adder works as a short-precision adder and the re-
Fourth, the SP of each bit is less than 0.5 in most cases, sult is then restored to the full width. Otherwise, the
and the T R is often not equal to 2 x S P x (1 - S P ) adder works as a normal full-width adder.
because of the correlation. These observations have
X G bits Y: G bits X (N-G) bits Y: (N-G)
bits
motivated this work.

Precision Distribution of AdditionlSubtractionin djpeg


100 8 ;.a. -
/flfl

SUM:G bits S U M (N-G)


bits

Figure 2 : Symmetric adder with signal gating.


4 8 12 16 20 24 28 32
Precision

.:i.-
Figure 1: Precision distribution of AddISub.
w biu
a bich

gclk
4 Signal Gating for Symmetric Adders
c
a, Lars lkg

We define symmetric adders as n-bit adders with


both operands have the similar data range, which is 811

the general case. In our design, input data of an adder (a) IDL (b) GCTL
are stored in two registers. Upon each clock rising edge,
new data are loaded into the registers and the addition Figure 3: Identical Detection and Gating Control Logic.
works on the loaded data. Signal gating is applied to
both input registers and combinational addition logic.
In many cases, signal gating with one gating boundary
cannot fully utilize dynamic data precisions. Adders 4.2 Overhead Analysis
with multiple gating boundaries may be designed for
more energy saving. Here we only discuss signal gating To justify the signal gating technique, the energy
with one boundary. overhead should not exceed the power reductions.

868
Moreover, the area and delay overhead need to be con- Power Distribution in 32-bit AddedSubtractor
sidered. We assume tree structures of 2-input gates are
used t o implement IDL. To implement w-input NAND,
a tree structure will have log2 w levels and (20-1) gates.
For the structure in Fig. 2, we estimate the area over-
head as
Aoh = (4G + 8) 1 Anand2 4- Q A ~ a t c h4-G . Amur21

The energy overhead, Job, depends on Aoh and the G-


bit data switching activities. The delay increase of the 5 -
IDL stage is

Tloh = (log,(G + 2) -k 2)tnand2 + tlatch 4


4 8 12 16
Bit
20 24 28 32

and the delay increase of the addition stage is


Figure 4: Power distribution in adderlsubtractor.
T20h = tlatch tmuz.21
To find the optimal gating positions, power distribu- incrementing is complicated as the branch target ad-
tion in the baseline registers and adders is needed. This dress would be loaded into PC when a branch is taken.
can be achieved by various power estimation tools. For In [12], the PC incrementing is performed byte-serially
a baseline adderlsubtractor with test data from djpeg, to reduce the activity. In the worst case, the PC incre-
the power distribution is given in Fig. 4. Denote the menting has to be done in four cycles. Here we look at
energy consumption of the i-th cell as Ji. The Rower how to apply signal gating in low power PC increment-
consumption of the addition unit is Jadd = Ji. ing design with little performance loss. A diagram of
When the upper g bits are gated (ng = n - g ) , the the design is given in Fig. 5. The upper G portion is
power consumption is reduced to still the gating candidate. When branch is 1 indicating
the branch is taken, gating logic is inactive and branch
ng fl target address BPC is loaded into PC in the next cy-
Jgadd = Ji + (1 - P{ lDopl < n g } ) Ji cle. When branch is 0 and the lower N G bits of are
not all l’s, the upper G portion are gated in the next
i=l i=ng+l
cycle. If branch is 0 and the N G bits are all l’s, the
and the power saving in the addition unit is NG part in the next cycle will compute 11 . . . 11 + 1,
which would generate a carry into the upper G portion.
n In this case, the G portion must be active in order to
A J a J d = P{ lDopl< ng} Ji accept the carry.
i=ng+l

in which P{ IDopJ < ng} is the probability of operation


precisions being less than ng, which can be obtained
from Fig. 1. Similarly, we can deduce A J r e g x and
A Jregy. The overall power saving will be
AJtotal = A J r e g x + A J r e g y + A J a d d - Joh
5 Signal Gating for Asymmetric Figure 5: PC incrementor with signal gating.
Adders Addition with one immediate operand is fre-
quently used in addition/subtraction instructions and
We define asymmetric adders as n-bit adders with load/store memory address calculations. One operand
two operands having quite different data ranges. The has the full precision range of n . The other operand
range difference can exist explicitly or implicitly. Two has small range of w (w < n ) . To apply signal gating,
types of explicit asymmetric additions are considered the addition can be partitioned into two portions: the
here: PC incrementing and addition with one operand lower w bits and the upper ( n - w) bits. the upper por-
being short-width immediates. In implicit asymmet- tion is then simplified as an incrementor/decrementor,
ric addition, both operands have full hardware widths as shown in Fig. 6. if half adder is used to imple-
while the precisions are quite different at run time. ment incrementor/decrementor, the boolean expres-
In processor pipelines, PC incrementing is an ex- sion is si = ci@xi and ci+l = ci(xi$dir). The direction
plicit asymmetric addition. Assuming the PC has word control dir is S’Cd and dir = 1 indicates decrementing.
resolution, one operand of the addition is always +1 The carry-in of the upper portion is modified to be
while the other operand is in full 30-bit precision. The s cd.

869
X: s d d d d d d d d did d d d d d Table 1: Power comparison ( n W / M H z )
y: s s s s s s s s s s j d d d d d d

s Cdl
y

Y: Sign Portion
cd II p
Schemes I baseline I G11 I

din
I-.. ,I
i195.7 i140.7
30.6 I 46.2 I
i
G18 I 2G 1
99.2 i 109.2
58.9 I 57.6
I
0 0 1 0 0 0 0 0 0 0 0 0 0 I F PI)” 120.8 I 90.51 I 81.1 I 76.0 1
( 1 1 1 0 0 0 0 0 0 0 0 0 01
0 1 0 0 0 0 0 0 0 0 0 -1
1 0 0 0 0 0 0 0 0 0 0 I

Figure 6: Simplification in addition with immediates.


As we have expected, the power is reduced in the
following circuit blocks: the clock port, input register ,
Because the logic of the upper portion is reduced, and the addition core. Considering only the Pus,, the
power consumption goes down accordingly even if there power reduction is encouraging: G11 reduces 19%, G18
is no signal gating. If we want to gate the upper por- 27% and 2G 29%. For the overall power consumption,
tion when s @ cd = 0, output multiplexers are neces- however, only G11 reduces the power by 6.2% while
sary. More importantly, there is a timing problem to the other two schemes consumes even more power. To
be solved. If the upper portion with input registers justify the use of signal gating in adders, the power
is gated when waiting for the carry signal cd, the new overhead must be reduced or amortized. If we consider
X input may be lost when the upper portion starts adders in a whole design, it is possible to reduce the
computing. gating-controlled Latches can be inserted overhead by amortizing the precision detection cost or
to hold new X at the register’s outputs. As the over- keeping the short-precision data for subsequent com-
head is a latch and a multiplexer each bit, the power putations. For example, if signal gating is applied in a
saving is minimal, if any. An alternative approach is multiplier, there is no need to provide precision detec-
to put the lower portion and the upper portion into tion logic for the final CPA. In the following, we study
two pipeline stages if possible. The carry signal of the how to keep short-precision data in the DLX processor
lower-portion stage is used to gate the upper-portion pipeline.
stage.
In implicit asymmetric addition, a partition similar 6.2 DLX Pipeline with Signal Gating
to Fig. 6 can still be made. However, the upper por-
tion of Y is not always the sign portion. It is possible
that some or all bits in the upper portion is significant The DLX pipeline considered here is an improved
bits. We estimate that the signal gating design for im- version of the basic pipeline with reduced stalls from
plicit asymmetric addition would be complicated and branch hazards [17]. There are five stages: instruction
the power saving is not much. fetch (IF), instruction decode (ID), execute (EX), data
memory access (MEM) and write back (WB). IF, ID
and EX have addition operations. The adder in stage
6 Experiments IF is the P C incrementor. The adder in stage ID cal-
culates the branch target address (BTA), which is the
addition of PC and an immediate. The ALU imple-
6.1 Adder with Signal Gating ments arithmetic and logic operations, as well as data
memory address calculation.
We first studied the power saving effect of the Previous work in [14] studied clock gating on the
general symmetric adders with signal gating. These whole units and the related pipeline registers, which
schemes have been designed in gate-level VHDL. The we call whole-unit ating. For example, the EX stage
Synopsys design environment is used as a common plat- will be idle at 9.176 of the total execution time and
form to compare different schemes. Test data are gath- MEM will be idle at 69.5% of the total time when run-
ered by tracing the execution of addition/subtraction ning djpeg, which are good candidates for whole-unit
in djpeg. The breakdown of power consumption in gating. In addition t o whole-unit gating, we extend
different schemes is shown in Table 1. The baseline the signal gating schemes described in this paper to
scheme is a 32-bit two’s-complement carry propagate the whole pipeline, which is called portion-unit gating.
adder/subtractor. G11 is the scheme with signal gat- Instead of detecting precision at two inputs of ALU,
ing on upper 11 bits. G18 is the scheme with gating the precision is detected at the outputs of ALU and
on upper 18 bits. 2G is the scheme with dual gating data memory. The precision information is kept in the
on upper 11 and 18 bits. The power consumption is pipeline using some tag bits. In our experiment, we
divided into two parts: the useful power consumption, use dual signal gating on upper 11 and 18 bits of EX,
Pus,and the overhead power, Po,,.Pus, includes Perk, MEM, WB units as well as their input registers. PC+1
Pdin,Preg,and Padd. Pclk and Pdin are the power con- and BTA are asymmetric adders described in Section 5.
sumptions of clock and data input ports. Preg and Padd There is no gating on IF and ID units because they are
are the power consumptions of input registers and the always busy and instructions are not data to be com-
adder/subtractor core. puted. Instruction compression techniques can be used

870
to reduce the number of bits to be processed, which is Proc. IEEE Int. Symp. Circuits and Systems (IS-
out of our scope here. Based on the simulation data CAS’98), ~01.2,pp.453-457, 1998.
in Table 1, each unit with dual signal gating is as- [5] M.D. Ercegovac and T. Lang, “Reducing transi-
sumed to consume 33% less power on average when tion counts in arithmetic circuits,” in IEEE Symp.
gated. Table 2 lists the the power reduction percent- Low Power Electronics, pp.64-65, 1994.
age in pipeline blocks of two gating schemes compared [6] M.D. Ercegovac and T. Lang, “Low-power accu-
to the baseline DLX pipeline with no gating. mulator(correlator),” in IEEE Symp. Low Power
Electronics , pp.30-31, Oct. 1995.
171 C.A. Fabian and M.D. Ercegovac, “Input synchro-
nization in low power CMOS arithmetic circuit de-
sign,” in Proc. 30th Asilomar Conf. Signals, Sys-
Schemes PC+1 BTA EX MEM WB I REG tems and Computers, pp.172-176, Nov. 1996.
W 0 0 9.17 69.5 9.17 I 26.33
[8] Y . Wang and K.K. Parhi, “New low power adders
W+P 44.53 25.00 39.42 79.65 39.42 1 37.77
based on new representations of carry signals,” in
Proc. 35th Asalomlsr Conf Signals, Systems and
Computers, Nov. 2000.
Only those units with power changing are listed in [9] Bishop, B.; Kelliher, T.P.; Irwin, M.J. “A detailed
the table. W is the whole-unit gating scheme. W+P analysis of MediaBench,” in IEEE Workshop on
is the combining of whole-unit gating and portion-unit Signal Processing Systems (SiPS’99), pp.448-455,
gating. It can be seen that W+P achieve another 10- 1999.
45% reduction compared t o W. With respect to the
gating overhead, there is little cost in the W scheme [lo] D. Brooks and M. Martonosi, “Value-based clock
because there is no precision detection and the gating gating and operation packing: dynamic strate-
control signals are also pipeline control signals. The gies for improving processor power and perfor-
W+P scheme has precision detection in both ALU and mance,” ACM Trans. Computer Systems, vo1.18,
MEM outputs. The total overhead may offset the ben- no.2, pp.89-126, May 2000.
efit in EX stage, judged from Table 1. Considering [ l l ] Stephenson, M.; Babb, J.; Amarashinghe, S.
that ALU and MEM are gated when idle, the overhead “Bitwidth analysis with application to silicon com-
would be much less. pilation”, ACM SIGPLAN Notices, vo1.35, (no.5),
pp.108-20, May 2000.
[12] Canal, R.; Gonzalez, A.; Smith, J.E. “Very low
7 Conclusions power pipelines using significance compression,’’
in Proc. 33rd Annual IEEE/ACM Int. Symp. on
Signal gating schemes for low-power adder design Microarchitecture, pp.181-190, 2000.
have been studied in this paper. The program anal- E131 Gowan, M.K.; Biro, L.L.; Jackson, D.B. “Power
ysis indicates that there is a large number of short- considerations in the design of the Alpha 21264
precision additions and the precision difference between microprocessor,” in Proc. 35th Design and Au-
two operands is large. Based on the precision features, tomation Coni pp.726-731, 1998.
signal gating is developed for two types of adders: sym- [14] Wu Ye; Irwin, M.J. “Power analysis of gated
metric adders and asymmetric adders. The effect of sig- pipeline registers,” in 12th Annual IEEE Int.
nal gating is studied by treating a signal-gated adder ASIC/SOC Con& pp.281-285, 1999.
as a separate unit as well as incorporating signal gat- [15] M. Bhardwaj, R. Min, and A. Chandrakasan,
ing into a RISC pipeline. Experimental results indicate “Power-aware systems,” in Proc. 35th Asilomar
10-45% power saving in the pipeline units compared to Conf. Signals, Systems and Computers, v01.2,
previous work. pp.1695-1701, NOV.2000.
[16] J . Choi, J. Jeon, and K. Choi, “Power minimiza-
tion of functional units by partially guarded com-
References putation,” in Proc. Int. symp. Low Power Elec-
tronics and Design, pp.131-136. Jul. 2000.
[l] Callaway, T.K.; Swartzlander, E.E., Jr. “Estimat-
[17] J.L. Hennessy and D.A. Patterson, Computer Ar-
ing the power consumption of CMOS adders, ” chitecture: A Quantitative Approach, 2nd Edition,
in Proc. IEEE 11th Symp. Computer Arithmetic, Morgan Kaufmann Publishers, Inc., 1996.
pp .2 10-216, 1993. [18] Sun Microsystems, Shade User’s Manual, 1993.
[2] Nagendra, C.; Irwin, M.J.; Owens, R.M. “Power- [19] Chunho Lee; Potkonjak, M.; Mangione-Smith,
delay characteristics of CMOS adders”, IEEE W.H. “MediaBench: a tool for evaluating and
Trans. VLSI Systems, v01.2, no.3, Sept. 1994. synthesizing multimedia and communications sys-
[3] Nagendra, C.; Irwin, M.J.; Owens, R.M. “Area- tems,’’ in Proc. 30th Annual IEEE/ACM Int.
time-power tradeoffs in parallel adders”, IEEE Symp. Microarchitecture,pp.330-335, Dec. 1997.
Trans. Circuits and Systems 11: Analog and Digi-
tal Signal Processing, vo1.43, (no.lO), Oct. 1996.
[4] Freking, R.A.; Parhi, K.K. “Theoretical estima-
tion of power consumption in binary adders,” in

87 1

Вам также может понравиться