Академический Документы
Профессиональный Документы
Культура Документы
Fig 7: High VDD for critical paths and low VDD for non-critical paths.
2) Clustered Voltage Scaling (CVS)
This is a technique to reduce power without changing circuit performance by
making use of two supply voltages. Gates of the critical path are run at the lower supply
to reduce power, as shown in Fig. 8. To minimize the number of interfacing level
converters needed, the circuits which operate at reduced voltages are clustered leading to
clustered voltage scaling.
Razor approach
Razor, a new approach to DVS, is based on dynamic detection and correction of speed
path failures in digital designs. Its key idea is to tune the supply voltage by monitoring
the error rate during operation. Because this error detection provides in-place monitoring
of the actual circuit delay, it accounts for both global and local delay variations and
doesn’t suffer from voltage scaling disparities. It therefore eliminates the need for voltage
margins to ensure always-correct circuit operation in traditional designs. In addition, a
key Razor feature is that operation at subcritical supply voltages doesn’t constitute a
catastrophic failure but instead represents a trade-off between the power penalty incurred
from error correction and the additional power savings obtained from operating at a lower
supply voltage. The improbability of the worst-case conditions that drive
traditional circuit design underlies the technology. The improbability of worst-case operating
conditions, an opportunity exists to reduce voltage commensurate with typical operating
conditions. The processor pipeline must, however, incorporate a timing-error detection and
recovery mechanism to handle the rare cases that require a higher voltage. In addition, the system
must include a voltage control system capable of responding to the operating condition changes,
such as temperature, that might require higher or lower voltages.
Fig 9: Pipeline stage augmented with razor latches and control lines.
To guarantee that the shadow latch will always latch the input
data correctly, designers constrain the allowable operating voltage so
that under worst-case conditions the logic delay doesn’t exceed the
shadow latch’s setup time. If an error occurs in stage L1 during a
particular clock cycle, the data in L2 during the following clock cycle is
incorrect and must be flushed from the pipeline. However, because the
shadow latch contains the correct output data from stage L1, the
instruction needn’t re-execute through this failing stage. Thus, a key
Razor feature is that if an instruction fails in a particular pipeline stage,
it re-executes through the following pipeline stage while incurring a
one-cycle penalty. The proposed approach therefore guarantees a
failing instruction’s forward progress, which is essential to avoid
perpetual failure of an instruction at a particular pipeline stage.
Fig 11: Pipeline recovery using global clock gating. (a) Pipeline organization and (b) pipeline timing for an
error occurring in the Execute (EX) stage. Asterisks denote a failing stage computation. IF = Instruction
Fetch; ID = Instruction Decode; MEM = Memory; WB = Write Back.
Fig 12: Pipeline recovery using counterflow pipelining. (a) Pipeline organization and (b) pipeline timing
for an error occurring in the Execute (EX) stage. Asterisks denote a failing stage computation. IF =
Instruction Fetch; ID = Instruction Decode; MEM = Memory; WB = Write Back.
Fig. 12 illustrates this approach, which places negligible timing constraints on the
baseline pipeline design at the expense of extending pipeline recovery over a few cycles.
When a Razor flip-flop generates an error signal, pipeline recovery logic must take two
specific actions. First, it generates a bubble signal to nullify the computation in the
following stage. This signal indicates to the next and subsequent stages that the pipeline
slot is empty. Second, recovery logic triggers the flush train by asserting the ID of the
stage generating the error signal. In the following cycle, the Razor flip-flop injects the
correct value from the shadow latch data back into the pipeline, allowing the errant
instruction to continue with its correct inputs. Additionally, the flush train begins
propagating the failing stage’s ID in the opposite direction of instructions. At each stage
that the active flush train visits, a bubble replaces the pipeline stage. When the flush ID
reaches the start of the pipeline, the flush control logic restarts the pipeline at the
instruction following the failing instruction. In the event that multiple stages generate
error signals in the same cycle, all the stages will initiate recovery, but only the failing
instruction closest to the end of the pipeline will complete. Later recovery sequences will
flush earlier ones.
Fig 14: Simplified functional block diagram of in situ timing fault detection
and correction for circuit-level timing speculation with double sampling
Fig. 14 shows circuit-level timing speculation (i.e., razor) in a simplified functional block
diagram. An aggressive clock period ϕ, which is shorter than the conservative _ in conventional
synchronous designs (i.e., _ ≥ critical path delay at the WC), is applied. A shadow register is
implemented beside the main output register to sample the datapath again with the clock signal
skewed by δ (i.e., δ = _ − ϕ, for the longest delay of _). In other words, the datapath output is
sampled twice (i.e., double sampling). The conservatively sampled result in the shadow register is
for confirming the aggressively sampled (i.e., timing speculated) result in the main register. If the
two results match, the timing speculation succeeds and no further action would be necessary.
Otherwise, the mis-speculated computation based on the inaccurate result (from the main register)
would be discarded, and the accurate result in the shadow register (i.e., for confirming
speculation) would be loaded back to the main register to restart the computation in the next
clock cycle. In this design, long-latency data patterns cause misspeculation for aggressive timing
and require an additional cycle. The behavior is similar to that of the variable-latency designs
described in Section I. Circuit-level timing speculation adapts the datapath latency with actual
execution, which tolerates PVT-dependent delay variations in addition to data-dependent
variations. However, it acquires additional buffer insertion to prevent race conditions on the
shadow register from the succeeding input data for the delayed sampling by δ. In another
viewpoint, the allowable delay of the datapath is shifted from 0 ∼ ϕ (i.e., conventional
synchronous design) to δ ∼ (ϕ + δ) with buffer insertion and timing relaxation (either from
voltage scaling at run time or timing constraint relaxation during logic synthesis). The inserted
buffers can overwhelm the area and energy conservation from timing constraint relaxation,
particularly for the considerable PVTD variations in advanced process technologies. For example,
the silicon area of a 32-bit multiplier drops from 10 765 to 5389 μm2 in the 40-nm CMOS
technology as the timing constraints for logic synthesis are relaxed from the tightest 1.236 to ∼2
ns at the WC. However, the area of a razor-enabled multiplier stops decreasing at ∼20% relaxed
timing because of buffer insertion and even grows considerably larger (18 272 μm2) for a 50%
timing relaxation, as shown in Fig. 15. Although the overheads can be mitigated through
speculation on selective paths only, the short path problem (i.e., hold-time fixing) remains
challenging in advanced technologies exhibiting considerable variations.
Fig 15: Silicon areas of a 32-bit multiplier under various timing constraints
and one with razor in situ timing fault detection and correction.
Speculative Look-ahead
SL, to speculate the datapath result by sampling its output exhibiting an aggressive clock
period shorter than the critical path delay at the WC. An SL also confirms the speculated result by
sampling the datapath again, but at the succeeding clock cycle (i.e., no skewed clock required).
To prevent race conditions in the second sampled result (by the succeeding data pattern), SL
holds the datapath input for two cycles. A duplicated datapath is added to process the succeeding
data pattern for single-cycle throughput. In other words, consecutive operations (e.g., MPU
instructions) are assigned to one of the two datapaths in SL alternatively. For example, datapath 0
executes even operations, whereas datapath 1 executes odd operations, as shown in Fig. 16.
Conclusion
Power is the next great challenge for computer systems
designers, especially those building mobile systems with frugal energy
budgets. Technologies like Razor enable “better than worst-case
design,” opening the door to methodologies that optimize for the
common case rather than the worst. Optimizing designs to meet the
performance constraints of worst-case operating points requires
enormous circuit effort, resulting in tremendous increases in logic
complexity and device sizes. It is also power-inefficient because it
expends tremendous resources on operating scenarios that seldom
occur. Using recomputation to process rare, worst-case scenarios
leaves designers free to optimize standard cells or functional units—at
both their architectural and circuit levels—for the common case, saving
both area and power.
Circuit-level timing speculation can improve the energy efficiency of MPU cores by
exploiting PVTD-dependent delay variations with AVS and timing constraint relaxation. a novel
timing speculation scheme, SL, to resolve the short path problem in conventional double
sampling approaches (e.g., razor), in which the buffer insertion can overwhelm the energy savings
in advanced technology exhibiting severe delay variations. The proposed SL design (i.e.,
consisting of duplicate datapaths but markedly simple logic and no additional buffers) consumed
only a 54.89% area compared with conventional razor-based designs and conserved 59.77%
energy per operation at 1.1 V nominal voltage and 53.49% when applying AVS for the same
performance
References
1. T. Pering, T. Burd, and R. Brodersen, “The Simulation and Evaluation of
Dynamic Voltage Scaling Algorithms,” Proc. Int’l Symp. Low Power Electronics
and Design (ISLPED 98), ACM Press, 1998, pp. 76-81.
2. T. Mudge, “Power: A First-Class Architectural Design Constraint,” Computer,
vol. 34, no. 4, Apr. 2001, pp. 52-58.
3. S. Dhar, D. Maksimovic, and B. Kranzen, “Closed-Loop Adaptive Voltage
Scaling Controller for Standard-Cell ASICs,” Proc. Int’l Symp. Low Power
Electronics and Design
4. W. F. Ellis, E. Adler, and H. L. Kalter, “A 2.5-V 16-Mb DRAM in 0.5-mm
5. CMOS technology”, in IEEE International Symposium on Low PowerElectronics.
ACM/IEEE, Oct. 1994, vol. 1, pp. 88–9.
6. K. Itoh, K. Sasaki, and Y. Nakagome, “Trends in low-power RAM circuit
technologies”, in IEEE International Symposium on Low Power Electronics.
ACM/IEEE, Oct. 1994, vol. 1, pp. 84–7.
7. Dan Ernst, Shidhartha Das, Seokwoo Lee, David Blaauw, Todd Austin, Trevor
Mudge, University of Michigan, Nam Sung Kim,” Razor : circuit-level correction
of timing errors for low power operation”, Published by the IEEE Computer
Society.
8. “Razor: A Low-Power Pipeline Based on Circuit-Level Timing Speculation” in
the 36th Annual International Symposium on Microarchitecture (MICRO-36),
9. R. Sivakumar1, D. Jothi1, Department of ECE, RMK Engineering College,
India,” Recent Trends in Low Power VLSI Design”, published in International
Journal of Computer and Electrical Engineering.
10. Tawfik, S. A. (May 2009). Low power and high speed multi threshold voltage
interface circuits. IEEE Transactions on Very Large Scale Integration (VLSI)
Systems, 17(5).
11. Pillai, P., & Shin, K. G. (Dec. 2001). Real-Time dynamic voltage scaling for low-
power embedded operating systems. Proceedings of the Eighteenth ACM
Symposium on Operating Systems Principles.
12. J. Ziegler et al., “IBM Experiments in Soft Fails in Computer Electronics,” IBM
J. Research and Development, Jan. 1996, pp. 3-18. P. Rubinfeld, “Managing
Problems at High Speed,” Computer, Jan. 1998, pp. 47-48.
13.Tay-Jyi Lin, Member, IEEE, and Ting-Yu Shyu, Student Member, IEEE,