Вы находитесь на странице: 1из 6

Variable-latency Adder (VL-Adder): New Arithmetic Circuit Design Practice to Overcome NBTI

Yiran Chen
Seagate Technology 7801 Computer Ave. Bloomington, MN 55435 +1(952)402-7481

Hai Li
Seagate Technology 7801 Computer Ave Bloomington, MN 55435 +1(952)402-7493

Jing Li
Purdue University 465 Northwestern Ave West Lafayette, IN 47906 +1(765) 494-0759

Cheng-Kok Koh
Purdue University 465 Northwestern Ave West Lafayette, IN 47906 +1(765) 496-3683

yiran.chen@seagate.com

helen.li@seagate.com

Jingli@purdue.edu

chengkok@purdue.edu

ABSTRACT
Negative bias temperature instability (NBTI) has become a dominant reliability concern for nanoscale PMOS transistors. In this paper, we propose variable-latency adder (VL-adder) technique for NBTI tolerance. By detecting the circuit failure on-the-fly, the proposed VL-adder can automatically shift data capturing clock edge to tolerate NBTI-induced delay degradation on critical timing paths. VL-adder operates with a fixed supply voltage and clock period, avoiding the high design and manufacturing costs incurred by existing NBTI-tolerant techniques. Compared to other related lower-power adder designs, VL-adder technique always provides better energy efficiency through the whole chip lifetime with very limited performance degradation (4.6% or less).

transistor [5], as these techniques incur extremely unbalanced power consumption profile or switching activity distribution. A common practice to counter the effects of NBTI-induced transistor aging is to over design: A design corner that denoting the maximum performance degradation of transistors (over the lifetime of the chip) is analyzed. This technique is called guardbanding [6]. The guardbanding method could be very pessimistic and powerinefficient because (1) the profile of parameters affecting NBTI effects (temperature, supply voltage and duty cycle of input signal) could be very unbalanced and (2) NBTI-induced aging effects has statistical components due to process variations. To avoid these pitfalls, many adaptive NBTI-tolerant methodologies have been proposed. They include clock frequency tuning, adaptive body biasing and adaptive supply voltage. These techniques however, all require complicated control circuitry, large extra power and area overheads, and significant additional manufacture cost. To improve the power efficiency, some sensors of NBTI effects have been designed to guide these adaptive NBTI-tolerant methodologies [6]. In this paper, we propose an adder design concept named variable-latency adder (VL-adder) for NBTI tolerance. VL-adder technique leverages from the idea of differentiating operation latency in the Ripple-Carry Adder of [7] and the Cascaded CarrySelect Adder of [8]. For example, in a 32-bit unsigned RippleCarry Adder (RCA), the longest carry propagation delay occurs only when the carry-out signal (CO) of the adder of the leastsignificant bit (LSB) propagates through the adder of the mostsignificant bit (MSB), e.g., A<31:0> = 0xFFFFFFFF and B<31:0> = 0x00000001 [7]. The occurrence probability of operands that result in the longest carry propagation delay is very low, i.e., 2 32 2 64 2.3 10 10 for random inputs. In [7], authors used the input vectors to predict the possible longest carry propagation delay of RCA. Operations are classified as long- or short-latency ones. When a long-latency operation occurs, VDD is raised to a higher level to satisfy certain timing requirement. In [8], the proposed Cascaded Carry-Select Adder (C2SA) can detect the operation latency on the fly. When long latency operations come in, the capturing clock edge is shifted to catch the output correctly. The key of VL-adder design is an operation latency detector that can adjust latency-detection threshold: operations that are classified as short-latency operation at the beginning of the chip lifetime may be classified as long-latency operation towards the end of the chip lifetime. Compared to the traditional adaptive NBTI-tolerant techniques, the proposed VL-adders have three main advantages: 1) The working frequency and supply voltage are fixed throughout the lifetime of a chip;

Categories and Subject Descriptors


B.8.1 [Performance and Reliability]: Reliability, Testing, and Fault-Tolerance

General Terms: Performance, Design, Reliability Keywords: Negative Bias Temperature Instability (NBTI),
Variable-Latency adder (VL-adder)

1. INTRODUCTION
The continual scaling of semiconductor process technology [1] has caused variability and reliability issues to emerge as primary concerns in modern VLSI design. NBTI occurs under negative gate voltage (e.g., Vgs = - VDD) and is measured by the shift in threshold voltage (Vth). The increase in PMOS transistor threshold voltage over time degrades device drive current, extends circuit delay [2], and significantly reduces the lifetime of a chip [3]. The extent of NBTI-induced Vth shifting of a PMOS transistor is heavily determined by its history of work status temperature, supply voltage and the duty cycle of input signal (i.e., the portion of time when the PMOS transistor is on). Therefore, transistors at different locations on the same chip may suffer varying degrees of NBTI-induced delay degradation. This situation is exacerbated by modern power management techniques, e.g., clock-gating [4], sleep

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ISLPED07, August 2729, 2007, Portland, Oregon, USA. Copyright 2007 ACM 978-1-59593-709-4/07/0008...$5.00.

195

The NBTI-tolerance mechanism is automatic and local: It does not have the negative global effects that clock frequency tuning or supply voltage tuning may have; 3) The power overhead and area penalty are minimal, when compared to other tuning technique targeting transistor speed, e.g., body biasing. The remaining sections are organized as follows: Section 2 introduces the necessary background of VL-adder design; Section 3 presents the details of VL-adder, including the practices of a 32-bit RCA and a 64-bit Carry-Select Adder (CSA); Section 4 provides experimental results and analysis; Section 5 concludes our work.

2)

2.2 Cascaded CSA design


The concept of operation latency differentiation in [7] was extended to Carry-Select Adder (CSA) design in [8]. In an M-bit CSA design, the input bits are divided into 2 M stages (see Fig. 2(b) and the row labeled std. CSA in Table 1. The number of input bits in each carry-select stage (CSS) increases linearly from the least significant stage to the most significant stage (Table 1).
Bit 24-30 Setup Bit 0-1 Setup CSS MUX SUM MUX SUM CSS CSS MUX SUM Bit 31-33 Bit 34-37 Setup Setup CSS MUX SUM MUX SUM CLDC Bit 56-63 Setup

2. Preliminary 2.1 Carry length detect circuit


Fig. 1 shows a general 32-bit unsigned RCA design. HA and FA denote the half-adder and full-adder, respectively. As Section 1, the longest carry propagation occurs when the carry signal propagates from the LSB adder through the MSB adder. The maximum possible carry propagation length (CPL), which is the number of single-bit adders that the carry signal goes through, is 32. CPL determines the operation latency of the RCA. The propagation of carry signal through the kth adder in an Mbit RCA can be described by the logic Pk = AkBk [9], k = 0, 1, , (M-1), where Ak and Bk are the inputs of the kth adder. When Pk = 0 (Ak = Bk), the carry-out signal of the adder of bit k is determined by only Ak and Bk, i.e., the carry propagation is killed at the kth adder. If A<31:0> = 0xFFFF7FFF and B<31:0> = 0x00000001 for example, the carry-out signal of the 15th adder is determined by only A15 and B15. Consequently, the longest CPL, which corresponds to the carry propagation from the 15th adder to the 31st adder, is shown in Fig. 1.
A31 B31 A18 B18 A15 B15 A13 B13 A0 B0

CSS

Longest delay of long-latency operations Longest delay of short-latency operations

(a)

Bit 24-31 Setup Bit 0-1 Setup CSS MUX SUM MUX SUM Critical delay of standard CSA CSS

Bit 52-63 Setup

CSS

MUX SUM

(b)

Fig. 2 C2SA and standard CSA (a) C2SA (b) Standard CSA Stage Std. CSA C2SA VL-C2SA Table 1. Carry-select stages in various CSA designs 1 2 3 4 5 6 7 8 9 10 11 2 2 2 2 2 2 3 3 3 4 4 4 6 6 6 7 7 7 8 7 7 9 3 5 11 4 2 12 5 5 / 6 6 12 / 7 7 13 / 8 8

FA

FA

FA

FA

HA

Maximum CPL of 32-bit RCA = 32 Longest CPL = 17 for A<31:0> = 0xFFFF7FFF & B<31:0> = 0x00000001 Longest CPL of short-latency operation =19 in [7] Fig. 1 Carry propagation length in a 32-bit RCA

In contrast, there are 2 M CSSs in the Cascaded CSA (C2SA) [8] (see Fig. 2(a) and the row labeled C2SA in Table 1). The 2 M CSSs in a C2SA are divided in two groups, each with
M CSSs. In each group, starting with a small number of inputs in the least significant CSS, the number of input bits in each CSS increases linearly from the least significant CSS to the most significant CSS (see Table 1). As in [7], the long- and short-latency operations are differentiated by checking the carry propagation in a few CCSs in the middle of C2SA (see Fig. 2(a)).

A carry length detection circuit (CLDC) was proposed in [8] to detect whether the carry propagation is killed among some NC consecutive bits starting from the Lth bit. The logic of CDLC is:
P < L + N C 1 : L >=
L + N C 1 k =L

(1)

When P<L+NC-1: L> = 1, the operation is a long-latency one with a CPL that could reach M, the maximum possible CPL for an M-bit RCA. Otherwise, the operation is categorized as a shortlatency one, as the carry propagation is killed by some bit(s) covered by CLDC. For random inputs, the probability of longlatency operation is PrL = 1/2Nc. The maximum possible CPL of a short-latency operation is max(L+NC, M-L). A larger Nc reduces the probability of long-latency operations and increase the maximum possible CPL of short-latency operations. Experimental results in [7] showed that for power efficiency, the combination of L=13 and NC = 6 gives the optimal configuration of CLDC for a 32-bit RCA.

Compared to standard CSA, the critical delay of long-latency operations of C2SA is longer. However, the critical delay of shortlatency operations, which occur more frequently, is shorter. The 64-bit C2SA of [8], for example, has a CLDC of logic P<37:31>. Long-latency operations occur with a probability of PrL = 1/27 0.78%. The critical delays of short-latency operations and longlatency operations are Dsetup + 11Dcarry + Dsum and Dsetup + 15Dcarry + Dsum, respectively. Here Dsetup denotes the setup time of CSA, i.e., the delays of creating the intermediate signals G (Generation) and P (Propagation) [9]. Dcarry and Dsum denote the delays of carry generation circuit and sum generation circuit, respectively. We assume that the delay of carry generation Dcarry equals the delay of

196

multiplexer (MUX) Dmux [9]. For comparison, the critical delay of standard CSA equals Dsetup + 13Dcarry + Dsum.

2.3 Guardband violation sensor


Guardband is defined as the reserved time period between the last possible data switching and the capturing clock edge [6], as shown in Fig. 3. A guardband violation occurs when the data switch within the guardband (i.e., Output B in Fig. 3). When guardband violations occur due to NBTI, adaptive techniques such as clock frequency tuning can be applied to fix such a violation.
Clock period Clock Output A Output B Guardband

design overhead. The relationships between Vcrt(t) and t of the 32bit RCA of [7] with PTM 70nm Tech. at Vdes = 1.0V are shown in Fig. 6. Here, Vdes is selected such that at the end of 7-year chip lifetime short-latency operations are still timing correct. When duty cycle increases, the gap between Vcrt(t) and Vdes increases; for a duty cycle of 0.75, this gap can be more than 10% of Vdes.
Duty cycle = 0.25
Duty cycle = 0.5
Duty cycle = 0.75

Delay degradation (%)

15 12 9 6 3 0 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08 1.0E+10


7 years lifetime

Operation Time (s)

Fig. 3 Guardband violation

Fig. 5 NBTI-induced RCA delay degradation

Fig. 4 shows a guardband violation sensor proposed in [6]. This sensor includes three components, namely, a stability checker (comparator), a delay element, and an output latch. By comparing the signal at the beginning of guardband and the signal at the capturing clock edge, any occurrence of signal switching within the guardband can be detected. The area and power overheard incurred by this guardband violation sensor is negligible as the delay element and the output latch can be shared by multiple sensors, and the execution of guardband violation detection occurs quite infrequently, e.g., with an interval of 15 days.
Sensor

Building on the concept of operation-latency differentiation in [7][8], we propose a variable-latency adder (VL-adder) technique to overcome NBTI effect. In the proposed VL-adder, the detection threshold of long-latency operation in VL-adder can be adaptively adjusted to account for NBTI-degraded delay; operations with a NBTI-degraded delay that is longer than one clock cycle are automatically categorized as long-latency operations and are executed within two clock cycles. Consequently, a VL-adder can continuously work at a low supply voltage level, eliminating the need for clock period tuning or supply voltage tuning. A guardband violation sensor that can predict NBTI-induced timing violations and generate the signal to adjust the detection threshold is needed.
Duty cycle = 0.25 Duty cycle = 0.5 Duty cycle = 0.75
1.00 0.97 0.94 0.91 0.88 7 years lifetime 0.85 1.00E+00 1.00E+02 1.00E+04 1.00E+06 1.00E+08 1.00E+10

Comb. Logic Clock

Flip-Flop

Output

Fig. 4 Guardband violation sensor with flip-flop

3. Variable-Latency Adder (VL-adder)


Fig. 5 shows the normalized NBTI-induced delay degradations of the critical timing path in a 32-bit RCA for duty cycles of 0.25, 0.5 and 0.75 (the portion of time when the PMOS transistor is on). The half and full adders are designed (using Complementary CMOS logic [9]) with PTM 70nm Tech. [11]. We used the NBTI model proposed in [10] and assumed a working temperature of 125 C. The supply voltage is set at 1.0V and the simulations are conducted with HSPICE. We use Tdelay_min to denote the delay of the critical timing path at the beginning of chip life time (i.e., no NBTI degradation). To guarantee the correct timing during the whole chip lifetime, e.g., 7 years, the clock period Tclk (without considering the setup time of flip-flop) must be at least 8.66% (duty cycle of 0.25) to 13.52% (duty cycle of 0.75) longer than Tdelay_min . At design time, a conservative supply voltage Vdes is selected to guarantee the correct timing throughout the chip lifetime. The critical supply voltage Vcrt(t), which is the lowest supply voltage that ensures the correct timing of chip at time t, approaches Vdes as the chip operation time t increases. Here the clock period is selected to ensure timing correctness throughout the 7-year chip lifetime. The gap between Vcrt(t) and Vdes is considered as an over-

Critical Supply Voltage (V)

Delay Element

Stability Checker

Latch

Violation (TH_ADJ)

Operation Time (s)

Fig. 6 Relation between critical supply voltage and chip operation time

3.1 VL-adder practice: RCA


Our proposed 32-bit RCA-based VL-adder (VL-RCA) is shown in Fig. 7. When the guardband violation sensor at the output flipflops catches any guardband violation, the signal TH_ADJ switches to high and the value is stored at a non-volatile memory cell (MEM). Then the detection threshold of the long-latency operation is adjusted by changing the logic of CLDC from P<18:13> for example, to P<17:14> (see Eqn. (1)).
CLK A B RCA 31 FF w/ NBTI Sensor 0 MEM

GATING

CLDC

TH_ADJ

Fig. 7 RCA-based 32-bit VL-adder

197

Fig. 8 shows an example of new adjustable CDLC circuit design for our 32-bit RCA-based VL-adder. A multiplexer is controlled by signal TH_ADJ to choose between two logic functions P<18:13> and P<17:14>. When a long-latency operation is detected, a clockgating signal GATING is generated to shift the capturing clock edge to the end of next cycle.
P14 P15 P16 P17 P18 P13 1 0 GATING D Q Q TH_ADJ CLK GATING

supply voltage Vcrt(0) at the beginning of the chip lifetime is only 0.91V for a duty cycle of 0.75. Simulation results show that at 0.94V, a 32-bit RCA design with CLDC logic P<17:14> would still be timing-correct at the end of chip lifetime. In other words, for a 32-bit VL-RCA with PTM 70nm technology, a lower supply voltage of 0.94V, denoted VVL,des, is sufficient to ensure timingcorrectness throughout the lifetime of the chip. The timing correctness of long-latency operations requires the latency of long-latency operation is always shorter than two clock cycles [8]. In Section 4 we shall show this requirement is always met in our VL-adder designs before and after the detection threshold of long-latency operation changes. Here, we extend the VL-adder concept to Cascaded CSA (C2SA) design for NBTI-induced circuit delay degradation tolerance. Fig. 10 shows the structure of our proposed 64-bit C2SA-based VLadder (VL-C2SA). The core adder of a VL-C2SA is a C2SA with two modified carry-select stages covered by CLDC: The bit width of the modified carry-select stages 8 and 9 are five and two, respectively, as shown in the row VL-C2SA in Table 1.
CLK CLDC A B

Fig. 8 Adjustable CLDC for RCA-based VL-adder

3.2 VL-adder practice: CSA

Recall that reducing the number of input bits covered by CLDC shortens the critical delay of short-latency operations but increases the occurrence frequency of long-latency operations. The timing diagrams before and after changing the detection threshold of longlatency operation are shown in Fig. 9. The output of an operation that fails with original detection threshold of long-latency operation is successfully captured by shifted clock edge with the new detection threshold of long-latency operation. For example, after adjusting the detection threshold by changing the CLDC logic from P<18:13> to P<17:14>, the longest delay of short-latency operations reduces from 19Dcarry to 18Dcarry, allowing for some tolerance for the NBTI-induced circuit delay. (As defined in Section 2.2, Dcarry is the carry propagation delay of one single bit adder.) The probability that a long-latency operation occurs increases from 1/26 to 1/24.
Orig. threshold with guardband violation Clock Output TH_ADJ New threshold eliminates the guardband violation: Clock Output GATING Guardband

2 2 3 4 6 7 7 5 2 5 6 7 8 C2SA

GATING

FF w/ NBTI Sensor TH_ADJ MEM

Fig. 10 Proposed 64-bit C2SA-based VL-adder

Fig. 9 Timing diagrams of VL-adder before and after detection threshold of long-latency operation changes

The proposed C2SA-based VL-adder (or VL-C2SA) in Fig. 10 still keeps the same longest delays of long- and short-latency operations as the ones of the 64-bit C2SA of [8]. Originally the threshold of long-latency operations is determined by the logic P<37:31>. As mentioned in Section 2.2, the longest delay of shortlatency operation is Dsetup + 11Dcarry + Dsum and the longest delay of long-latency operation is Dsetup + 15Dcarry + Dsum under the assumption that Dcarry = Dmux. When guardband violation sensor detects a violation, signal TH_ADJ is generated to adjust the detection threshold of longlatency operations by changing the CLDC logic from P<37:31> to P<35:31>. As a result, the occurrence probability of long-latency operations PrL increases from 1/27 0.78% to 1/25 3.13%. The new critical path of short-latency operation is from the inputs of carry-select stage 1 to the outputs of carry-select stage 8. Hence, the longest delay of short-latency operations changes from Dsetup + 11Dcarry + Dsum to Dsetup + 10Dcarry + Dsum accordingly. The adjustable CLDC of C2SA-based VL-adder is shown in Fig. 11. The logic functions P<37:31> and P<35:31> are selected by control signal TH_ADJ, according to the detection result of guardband violation sensors. We note that the multiplexer MUX is between NAND3 and OR2. As its operation overlaps with that of OR1, no performance penalty is incurred. As the critical timing path of the short-latency operation of our proposed VL-C2SA under original detection threshold of longlatency operation is from the inputs of carry-select group 1 through

As a long-latency operation requires two clock cycles to complete [8], the increase in long-latency operations introduces performance overhead to original RCA design proposed in [7]. In our 32-bit RCA-based VL-adder design, this extra performance overhead is (1/24 - 1/26)/(1+1/26) 4.6%. Since the critical timing path under original detection threshold of long-latency operation can only be from bit 0 to bit 18 and from bit 13 to bit 31, two guardband violation sensors are needed at bit 31 and bit 18. As mentioned in [6], the sensors are activated very infrequently. Hence, the power overhead is negligible. Interestingly, VL-RCA may present an opportunity to reduce power. Recall that Vdes, the supply voltage of a circuit, has to be selected to ensure timing correctness of the circuit throughout its lifetime. Consider the 32-bit RCA design (with CLDC logic P<18:13>) presented in [7]. Suppose that Vdes = 1.0V allows the short-latency operations to be timing-correct throughout the lifetime of the chip with PTM 70nm technology. The critical

198

the outputs of carry-select stage 9, only two guardband violation sensors are required at the two output bits of carry-select stage 9.
P31 P32 P33 P34 P35 P36 P37 NAND3 NAND1 GATING OR1 OR2 NAND2 GND MUX 1 0 TH_ADJ D Q Q CLK GATING

Fig. 14 shows the comparison between the normalized PDP (power-delay production) of our 32-bit VL-RCA and that of the 32bit RCA of [7] over the whole 7-year chip operation time. We defined the average adder delay that considering variable adder latency as Tclk(1+PrL), where PrL is the occurrence probability of long-latency operations. To be conservative, we assume adders are all working with a duty cycle of 0.75. 214 = 16384 random inputs are simulated. The power dissipation of CLDC adjustment circuitry of VL-RCA is accounted for in the plot for VL-RCA.
Longest delay of short-latency operation (ns)
1.22 1.20 1.18 1.16 1.14 1.12 1.0E+00 Duty cycle = 0.25 Duty cycle =0.5 Duty cycle =0.75

Fig. 11 Adjustable CLDC for C2SA-based VL-adder

4. Experimental Result and Analysis


In our experiments, we created the SPICE netlists of a 32-bit VL-RCA and a 64-bit VL-C2SA with PTM 70nm technology [11]. For comparisons, the SPICE netlists of a 32-bit RCA (with CLDC) of [7] and a 64-bit C2SA of [8] are also created with the same technology. The PMOS transistor threshold voltage Vth at the beginning of chip lifetime is set to 0.21V and the gate oxide thickness Tox is set to 1.7nm. The original supply voltage of 32-bit RCA of [7] and 64-bit C2SA of [8] is set to 1.0V. Working temperature is set to 125 C.

Tclk

Turning Pt.

1.0E+02

1.0E+04

1.0E+06

1.0E+08

Operation Time (s)

Fig. 13 Time-varying longest delays of short-latency operations of 32bit VL-RCA

4.1 VL-RCA vs. RCA with CLDC


The relationship between the longest delays of long- and shortlatency operation of the 32-bit RCA of [7] and chip operation time is shown in Fig. 12. The duty cycles of 0.25, 0.5 and 0.75 are all simulated. To ensure the correct timing of the 32-bit RCA of [7] for its 7-year lifetime under supply voltage of 1.0V and duty cycle of 0.75, the clock cycle Tclk, must be set to 1.215ns.
Short: DC=0.25 Long: DC=0.25 Short: DC=0.5 Long: DC=0.5 Short: DC=0.75 Long: DC=0.75 1.35 1.30 1.25 1.20 1.15 1.10 1.05 1.0E+00
2.10 2.00 1.90 1.80 1.70 1.60 1.50

Adjusting the detection threshold of long-latency operations results in 4.6% degradation in clock-cycle-based performance (see Section 3.1) when compared to the RCA of [7]. However, Fig. 14 shows that by working at a lower supply voltage, VL-RCA always provides higher energy-efficiency (about 10% less PDP) than the RCA of [7] does, throughout the entire 7-year chip lifetime.
1.00

RCA in [9]

VL-RCA

Normalized PDP

0.95 0.90 0.85 0.80 0.75 1.0E+00 1.0E+02 1.0E+04 1.0E+06 1.0E+08

Longest delay of short-latency operation (ns)

Longest delay of long-latency operation (ns)

Operation Time (s)

Fig. 14 PDPs of 32-bit RCA-based VL-adder and 32-bit RCA of [7]

1.0E+02

1.0E+04

1.0E+06

1.0E+08

Operation Time (s)

Fig. 12 Time-varying longest delays of long- and short -latency operation of the 32-bits RCA of [7]

When a guardband violation is detected, our proposed 32-bit RCA-based VL-adder (VL-RCA) adjusts the detection threshold of long-latency operation by using CLDC logic function P<17:14> (see Section 3.1). To ensure that all short-latency operations of our 32-bit VL-RCA can complete within Tclk=1.215ns throughout the 7-year chip lifetime, the supply voltage is at least 0.94V. Under a supply voltage of 0.94V, the relation between the longest delay of the short-latency operations of a 32-bit VL-RCA and the chip operational lifetime is shown in Fig. 13. For a duty cycle of 0.75 the detection threshold of long-latency operation of VL-RCA has to be adjusted after chip has been working for around 2107s (or around 8 months). With a duty cycle of 0.5 or 0.25, the detection threshold of long-latency operation has to be adjusted after the chip has been in operation for 6107s or 2108s, respectively.

Table 2 shows the transistor counts of different components of the RCA of [7] and VL-RCA. Compared to the 32-bit RCA of [7], VL-RCA introduced 7.80% area overhead (in terms of transistor counts). Because of the infrequent activation of the guardband violation sensor, the power overhead of the sensor is negligible [6].
Table 2. Transistor counts of different circuitries Component Core adder Sensor CLDC RCA 896 40 VL-RCA 896 61 52 C2SA 5304 42 VL-C2SA 5304 61 54

4.2 VL-C2SA vs. C2SA


Similarly, the longest delays of long- and short-latency operations of C2SA of [8] with various duty cycles over the lifetime of the design are shown in Fig. 15. The required clock period for a 7-year lifetime and a duty cycle of 0.75 is 0.617ns. When a guardband violation happens, the detection threshold of long-latency operation of VL-C2SA is adjusted by changing the

199

CLDC logic from P<37:31> to P<35:31>. The minimal supply voltage level to ensure a 7-year lifetime of VL-C2SA is 0.91V. For a duty cycle of 0.75, the detection threshold of long-latency operation has to be adjusted after chip has been in operation for only around 1106s (or 12 days). Therefore, we choose a higher supply voltage of 0.94V for the proposed VL-C2SA. The degradation of the longest delay of the short-latency operations of a VL-C2SA is shown in Fig. 16. For a duty cycle of 0.75, the detection threshold of long-latency operation has to be adjusted after about 8 months (2107s) of operation. In practice, designers can increase the supply voltage (and hence, power) to delay the adjustment of the detection threshold of long-latency operation.
Short: DC=0.25 Long: DC=0.25 Short: DC=0.5 Long: DC=0.5 Short: DC=0.75 Long: DC=0.75 0.95 0.92 0.89 0.86 0.83 0.80 1.0E+02 1.0E+04 1.0E+06 1.0E+08 Operation Time (s)

performance by (1/25 - 1/27)/(1+1/27) 2.33%, when compared to the 64-bit C2SA of [8]. Nonetheless, the proposed VL-C2SA is still more energy-efficient for the entire 7-year chip lifetime. The transistor counts of every component of the proposed VLC2SA design and the C2SA of [8] are also shown in Table 2. Due to size of the core adder, the area overhead percentage incurred by the VL-C2SA design is only 1.37%.

5. Conclusion
In this paper, we present a new adder design concept called Variable latency-Adder (VL-adder) for NBTI tolerance. The operations of adder are differentiated according to their latencies: short-latency and long-latency. When a long-latency operation occurs, the data-capturing clock edge is shifted one more cycle to allow more computation time (and to latch the output data correctly). The detection threshold of long-latency operation can be dynamically adjusted in a VL-adder. If the delay of an originally short-latency operation exceeds one clock cycle due to NBTI degradation, this short-latency operation is re-categorized as a long-latency one by adjusting the detection threshold of longlatency operation, eliminating the need for supply voltage tuning or clock frequency tuning is required. While providing better energy efficiency throughout the chip lifetime, the proposed VL-adder design incurs minimal area and performance penalties.

0.63 0.61 0.59 0.57 0.55 1.0E+00

Longest latency of long-latency operations (ns)

0.65
Longest latency of shortlatency operations (ns)

Fig. 15 Time-varying longest delays of long- and short -latency operation of the 64-bit C2SA of [8]

6. Reference
[1] International Technology Roadmap for Semiconductors, 2005. [2] N. Kimizuka, et al., The impact of bias temperature instability for direct-tunneling ultra-thin gate oxide on MOSFET scaling, VLSI Symp. on Tech., 1999, pp. 73-74. [3] V. Reddy, et al., Impact of Negative Bias Temperature Instability on Digital Circuit Reliability, International Reliability Physics Symposium, 2002, pp. 248-254. [4] H. Li, et al, Deterministic clock gating for microprocessor power reduction, the 9th Intl Symp. on High Performance Computer Arch., Feb. 2003, pp. 113-124. [5] A. Agarwal, et al, A Single-Vt Low-Leakage Gated-Ground Cache for Deep Submicron, IEEE Jour. of Solid-State Circuits, Vol.38-2, pp. 319-328, Feb. 2003.

The normalized PDP of our 64-bit VL-C2SA and that of the 64bit VL-C2SA of [8] over a 7-year chip lifetime are shown in Fig. 17. Again, a total of 214 = 16384 random inputs are simulated. Also, the power dissipation of CLDC adjustment circuitry of VLC2SA has been accounted for.
Duty cycle = 0.25 Longest delay of shortlatency operation (ns)
0.62 0.61 0.60 0.59 0.58 0.57 0.56

Duty cycle = 0.5

Duty cycle = 0.75

Tclk

Turning Pt.

1.0E+00

1.0E+02

1.0E+04

1.0E+06

1.0E+08

Operation Time (s)

Fig. 16 Time-varying longest delay of short-latency operations of 64bit VL-C2SA


1.00 C2SA in [10] VL-C2SA

[6] M. Agarwal, et al., On-line Failure Prediction and Its Application to Transistor Aging, ACM/IEEE Intl Workshop on Timing Issues in the Specification and Synthesis of Digital Systems (TAU), 2007. [7] H. Suzuki, et al, Low Power Adder with Adaptive Supply Voltage, the 21st Intl Conf. on Computer Design, San Jose, Oct. 2003, pp. 103-106. [8] Y. Chen, et al, Cascaded Carry-Select Adder (C2SA): A New Structure for Low-Power CSA Design, 2005 Intl. Symp. on Low Power Electronics Design 2005, pp. 115-118. [9] J. M. Rabaey, Digital Integrated Circuits: a design perspective, Englewood Cliffs, NJ: Prentice Hall, 1996.

Normalized PDP

0.95 0.90 0.85 0.80 0.75 1.0E+00

1.0E+02

1.0E+04

1.0E+06

1.0E+08

Operation Time (s)

Fig. 17 PDPs of 64-bit VL-C2SA and 64-bit C2SA of [8]

[10] K. Kang, et al., Efficient Transistor-Level Sizing Technique under Temporal Performance Degradation due to NBTI, IEEE International Conference on Computer Design, 2006, pp. 216-221. [11] Predict Technology Model http://www.eas.asu.edu/~ptm/

As mentioned in Section 3.2, adjusting the detection threshold of long-latency operations degrades the clock-cycle-based

200

Вам также может понравиться