An Architectural Approach

VLSI Architecture
An Architectural Approach to
Reduce Leakage Energy in
Memory
A Seminar report submitted in partial fulfillment for the award of the degree of
Bachelor of Technology
IN
Electronics and communication Engineering
By
S .Raja sekhar
07BM1A0425
Under the Guidence of
Mr.S.Satish Kumar
Vice principal and HOD
STANLEY STEPHEN COLLEGE OF ENGINEERING AND

TECHNOLOGY
Panchalingala, Kurnool – 518 004, Kurnool (Dist) A.P.
Year 2010-2011
VLSI Architecture
Certificate
This is to certify that the seminar report entitled ___________ is

being submitted by Mr./Ms ________________ is partial fulfillment for the
award of the degree of Bachelor of Technology in
______________________ to the Jawarlal Nehru Technological university
Anantapur.A.P
Date:
Signature of Guide
Signature
VLSI Architecture
CONTENTS
Sr.no Topic Page

No
I Introduction 1
II Related Work 2
• Dynamically Resizable Instruction Cache
• Cache Decay
• Partitioned Cache Architecture
• Selective Cache Ways
III Time Based Leakage Control in Partitioned Cache 9

Architecture
• Overview
• Block Diagram
• Implementation
• Placement Strategies
• Prediction Strategies
• Deciding Cache Decay Interval
IV CONCLUSION 13
REFERENCES 14
VLSI Architecture
SECTION I: INRODUCTION
The advance in the semiconductor technology has paved way for increasing the density of
transistors per chip. The amount of information storable on a given amount of silicon has roughly doubled
every year since the technology was invented. Thus the performance of the processor improved and the
chips’ energy dissipation increased in each processor generation. This created awareness for designing low
power circuits. Low power is important in portable devices because the weight and size of the device is
determined by the amount of battery needed which in turn depends on the amount of power dissipated in
the circuit. The cost involved in providing power and associated cooling, reliability issues; expensive
packaging made low power a concern in nonportable applications like desktops and servers too. Even
though most power dissipation in CMOS CPUs is dynamic power dissipation (which is a function of
frequency of operation of the device and the switching capacitance), leakage power (function of number of
on-chip transistors) is also becoming increasingly significant as leakage current flows in every transistor
that is on, irrespective of signal transition. Most of the leakage energy comes from memories, since cache
occupies much of CPU chip’s area and has more number of transistors, reducing leakage in cache will
result in a significant reduction in overall leakage energy of the processor. This paper suggests an
architectural approach for reducing leakage energy in caches.
Various approaches have been suggested both in architecture and circuit level to reduce leakage
energy. One approach is to count the total number of misses in a cache and upsize/ downsize the cache
depending on whether the miss count is greater or lesser than a preset value. The cache dynamically resizes
to the application’s required size and the unused sections of the cache are shut off. Another method called
cache decay turns off the cache lines when they hold data not likely to be reused. The cache lines are shut
off during their dead time that is during the time after the last access and before the eviction. After a
specific number of cycles have elapsed and still if the data is unused then that cache line is shut off.
Another approach was to disable portion of the cache ways called selective cache ways. This method,
which is application sensitive, enables all the cache ways (a way is one of the n sections in an n-way set
associative cache) when high performance is required and enables only a subset of the ways when cache
demands are not high.
This paper is organized as follows, section 1 narrates the work done related to this problem,
section 2 describes our approach, which uses a time based decay policy in a partitioned architecture of level
– 2 cache and finally section 3 presents the conclusion.
SECTION II: RELATED WORK

VLSI Architecture
Dynamically resizable instruction cache:

This method exploits the utilization of cache; cache utilization varies depending on the application
requirements. By shutting of portion of the cache that is unused, leakage energy can be reduced
significantly. It uses a dynamically resizable I-cache architecture, which resizes in
accordance with the application requirements and uses a technique called gated-Vdd in the circuit level to
turn off unused portions of the cache. The number of misses is counted periodically (say every 1 million
instructions), the cache size is increased or decreased depending on whether the count is more or less than a
preset value. The cache is also prevented from thrashing by fixing a minimum size beyond which the cache
cannot be decreased.
Merits:
• Reduces the average size of a 64 K cache by 62%, thus lower leakage energy and the performance
degradation is within 4%.
• By employing a wide NMOS dual-Vt gated-Vdd implementation the leakage is virtually
eliminated (connecting gated Vdd transistor in series with SRAM cell-stacking effect) with only
5% area increase.
• By controlling the miss rate with reference to a preset value, the performance degradation and the
increase in lower cache levels’ energy dissipation (due to misses in L1 cache) is kept low.
• The dynamic energy of the counter hardware used is small as the average number of bits switching
on a counter increment is less than two (as the ith bit in a counter switches once only every 2^i
increments).
Demerits:
• Here resizing affects the miss rate, a miss in L1 cache will lead to dynamic energy dissipation in
L2 cache, so the number of accesses to L2 cache should be low.
• There is an extra L1 dynamic energy due to the resizing bits.
• Resizing circuitry may increase energy dissipation offsetting the gains form cache resizing, so the
resizing frequency should be low.
• Longer resizing will span multiple application phases reducing opportunity for resizing and
shorter resizing interval may result in increase in overhead.
• Resizing form one size to another will modify the set-mapping function for blocks and may result
in an incorrect lookup.
• For an application, which requires a small I-cache, dynamic component will be large due to large
number of resizing tag bits.
• Gated Vdd transistor must be large to sink the current flowing through SRAM cells during read/
write operation. But too large gated Vdd reduces stacking effect and increases area of overhead.
Cache decay:
In this technique, cache lines, which hold data that are not likely to be reused, are turned off. It
exploits the fact that cache lines will be used frequently when data is first brought in and then there will be
a period of dead time before the data is evicted. So by turning off the cache lines during their dead time,
VLSI Architecture
leakage energy can be reduced significantly without additional misses incurred thus performance will be
comparable to a conventional cache. The policy used here is a time-based policy that turns a cache line off
if a pre-set number of cycles have elapsed since its last access.
Fig A
As seen in fig A, the access interval is the time between two hits; dead time is the time between
last hit and the time at which the data is evicted.
Fig B
As seen in the fig B, the dead time for most of the benchmarks is high.
Merits:
• 70 % reduction in L1 data cache leakage energy achieved.
• Program performance or dynamic power dissipation will not be affected much as the cache line is
turned off only during its dead time.
• Results show that dead times is long, thus moderately easy to identify.
• Very successful if application has poor reuse of data like streaming applications.
• Can be applied to outer levels of cache hierarchy (as outer levels are likely to have longer
generations with larger dead time intervals)
Demerits:
• There might be additional L1 misses if there is a miss in the L1 cache due to early shut off of the
cache line.
• Shorter decay intervals (time after which cache line is shut off) reduce leakage energy but may
increase miss rate, leading to dynamic energy dissipation in lower level memory.
Partitioned Cache Architecture:
This method partitions the cache into smaller units (subcaches) each of which acts as a cache and
selectively disables unused components. Since the partition is at the architecture level the data placement
and data probing mechanisms are sophisticated than those at the circuit level. The topology of subcaches
may be different. The cache predictor tells the cache controller, which subcaches should be activated. The
VLSI Architecture
Effectiveness of the probing strategy determines the number of subcaches accessed per data reference, and
thus the energy consumed.
Merits:
• Reduces per access energy costs
• Improves locality behavior
• Smaller and lesser energy consuming components
• The subcaches can all be the same or caches with different topology
• Both performance and energy can be optimized
• Breaking up into sub caches or sub banks reduces wiring and diffusion capacitances of bit lines
and wiring and gate capacitances of word lines. Thus dynamic energy consumption when
accessing the cache will be less.
VLSI Architecture
PARTITIONED CACHE ARCHITECTURE
MISSING PHYSICAL BLOCK #
No Prediction
SUB CACHE 1
CACHE PREDICTOR
Default
Predictor
Reprobe
Logic
M
SUB CACHE 2
CACH
E
E ID
Feedback
V Cache ID
M
I
R CACHE CONTROLLER CACHE
T MISS
SUB CACHE 3 AND
O
U PLACEMEN
A Page Offset T LOGIC
L
A
Frame #
D
R
D |
R |
|
E |
|
S |
S |
|
|
|
Y
|
|
|
|
TLB |
SUB CACHE N
SUB CACHES
Fig C
VLSI Architecture
ARCHITECTURE OF A SUB-CACHE
GLOBAL
COUNTER
VALID BIT
LOCAL 2 BIT
V CACHE-LINE (DATA +TAG)
COUNTER
WRD
LOCAL 2 BIT
V CACHE-LINE (DATA +TAG)
COUNTER
WRD
CASCADED
ROW TICK
DECODERS PULSE
T
V Vb
B Bb B Bb
M M
FSM 2-BIT
V
COUNTER
Vg
WRD
RESET
POWER OFF
SWITCHED POWER
ALWAYS POWERED
WRD WRD
WRD
s1 s0
0 1 1 1 1 0 T/PowerOff
T
0 0 T
T
State Diagram for 2-bit(S1,S0),Saturating,Gray Code Counter with two inputs(WRD,T)

Fig D
Demerits:
VLSI Architecture
• If large number of cycles is spent in servicing a memory request because of a poor probing
strategy then performance will be degraded.
• Performance depends on the effectiveness of the probing policy; if the probing policy is not good
then there will be reprobing penalty.
• Energy depends on the number of subcaches accessed per data reference.
Selective cache ways:

This method exploits the subarray partitioning that is usually already present and enables all the
cache ways when required to achieve high performance but only a subset of cache ways when cache
demands are not high. Since only a subset of the cache ways is active leakage energy can be reduced
significantly. This strategy exploits the fact that cache requirements vary considerably between applications
as well as within an application. A software visible register called the cache way select register (CWSR),
signals the hardware to enable/ disable particular ways. Special instructions are there for writing and
reading cache way select register. Software also plays a role in analyzing application cache requirements,
enabling cache ways and saving cache way select register. Thus this is a combination of hardware and
software elements. The degree to which ways are disabled depends on the relative energy dissipation of
different memory hierarchy levels and how they are affected by disabling ways.
SECTION III: TIME BASED LEAKAGE CONTROL IN PARTITIONED CACHE

ARCHITECTURE
Overview:
Level 2 caches is larger in size than level 1 cache, thus level 2 cache will dissipate more leakage
energy than level 1 cache. Thus by reducing leakage energy in L2 cache, overall leakage energy can be
reduced to a great extent. This paper combines two existing strategies, to reduce leakage power in level 2
cache. This paper exploits the advantages of partitioning and time based cache decay techniques. The level
2 cache is partitioned into smaller units each of which is a cache by itself called subcache. Methods were
proposed for partitioning the cache structure, shutting of part of cache ways during their dead time,
partitioning the sub arrays of a cache structure. In this paper we suggest partitioning the cache structure into
small caches called subcaches and implementing the cache decay (shutting of portions of cache ways) in
each subcache. This can reduce the leakage energy significantly. Subcache architecture enjoys the
following benefits:
• Reduces per access energy costs
• Improves locality behavior
• Smaller and lesser energy consuming components
• Both performance and energy can be optimized
• Breaking up into subcaches or subbanks reduces wiring and diffusion capacitances of bit lines and
wiring and gate capacitances of word lines. Thus dynamic energy consumption when accessing
the cache will be less.
VLSI Architecture
This architecture selectively disables unused subcaches and activates the one holding the data; thereby
leakage energy can be reduced significantly. By applying the time based cache decay technique to each of
the subcache, only part of the cache ways will be enabled within a subcache, the power wasted on dead
times (when cache way is idle) can be avoided, thus this combination of partitioning and selective cache
ways can reduce the leakage energy more than that when one technique is applied. Selective cache ways is
an appropriate technique to be used in subbank because of the following reasons:
• Program performance will not be affected much as the cache line is turned of only during its dead
time.
• Time based cache decay works well if the reuse of data is poor, reuse of data in L2 cache will be
less than that in L1 cache, so it is appropriate to apply this technique to L2 cache.
• Outer levels of hierarchy are likely to have longer generations with larger dead time intervals,
which is what is required for this time based cache decay technique.
• The fraction of time the cache way is dead increases with higher miss rate as the lines spend more
of their time about to be evicted.
Implementation:
The block diagram of the hardware implementation is shown in Fig C. The level 2 cache is
divided into smaller units each acting like a cache by itself, these are called as SUBBANKS or
SUBCACHES. The subcache that needs to be activated is decided by a
Logic called CACHE PREDICTOR. This operation is performed concurrently with table lookup operation
in order to avoid delay in critical path. The output of the cache predictor will be the subcache id or will be a
no prediction. If the output is a no prediction then a logic called DEFAULT PREDICTOR will be used to
select the cache for activation. Based on the cache predictor output the CACHE CONTROLLER will
activate the appropriate subcache. The check will be made only with the cache ways that are active within
the subcache, not all cache ways will be enabled within the subcache. Disabling the cache ways within a
subcache is done by means of a time based decay policy (fig D).
The time based decay policy is implemented in each subcache. Each cache line within a subcache
is connected to a counter, this counter is a 2-bit counter (local counter) which increments its value after
receiving the tick pulse form a global counter. The two inputs the local counter receives are the global tick
signal T and the cache line access signal WRD. When the 2 bit counter reaches its maximum value, the
decay interval, (It is found that for L2 caches the decay interval should be in the range of tens of thousands
of cycles) which is the time allowed before which the line is shut off would have elapsed. On every access
to the cache line the 2-bit counter is reset to its initial value. Once the counter saturates to its maximum
value the cache line is shut off using gated Vdd technique. The gated Vdd transistor connected in series
with the SRAM cell is turned off, disabling that cache way. Thus the cache ways that are idle will be
disabled.
If the cache controller cannot find data in the selected subcache then the CACHE MISS logic
informs the RE-PROBE logic, which determines the next subcache to probe. The re-probe logic will be
active until the data is found. When the data cannot be found on any of the subcache then the cache miss
VLSI Architecture
and placement logic will become active and brings the block from main memory. The information in cache
predictor and re-probe logic is updated, as one of the blocks needs to be evicted. The cache predictor is also
updated whenever there is a cache hit that was not predicted by it.
The global counter used can be common for all subcaches. The tick signal is cascaded from one local
counter to another with one clock cycle latency, so that writebacks cannot take place at same time. Each
cache line will implement the state machine shown in figure D. The output of the local counter is a power
off signal that goes to gated Vdd transistor, which will be turned off when asserted.
The subcache architecture adopts certain policies for placing data if a miss is encountered,
predicting the subcache (probing) in which data will be present, reprobing subcaches in case the first
probing fails. These are described below.
Placement strategies:
This tells how a data from memory is placed in the subcache system. Selecting a good placement
policy has both energy and performance implications and it depends on the amount of past history
maintained by the system. Once the subcache is selected then the data will be placed inside the subcache by
its own topology. The different policies are random, least-recently-used (LRU), spatial temporal (ST) and
modified-spatial-temporal (MST). In random policy one of the subcaches is selected at random, In LRU the
subcache which was least recently used is selected for placement. Usually the spatial data and temporal
data are stored in separate subcaches, if in case a bypass data comes in then this is stored in either spatial or
temporal subcache instead of fixing a location for the bypass data, this strategy is called MST and the
performance is found to be better if done this way, as the number of misses is reduced. The data will be
stored in the subcache spatial or temporal that has less number of misses for better load balance. MST
improves performance, as the numbers of misses are less when compared to other strategies.
Prediction strategies:
This is the strategy used to probe the data in the subcache. This needs to be good otherwise there
will be a penalty if miss occurs, as reprobing has to be done. The strategy ‘All ‘ accesses all subcaches
concurrently, thus this strategy will not provide any energy savings during probing stage. MRU/WP
strategy accesses the most recently used subcache first. This most recently used information can be
maintained in a single register. CIB- Cache Identifier Buffer strategy holds a list of mostly recently used
virtual addresses and the corresponding physical subcaches holding those blocks. Whenever the CIB is not
able to make a prediction, a default predictor predicts the subcache. The CIB entries are updated whenever
the corresponding cache line is accessed and evicted by the cache miss and placement logic.
In order to reduce reprobe penalty on the first probe misses, it is good to probe all subcaches other
than the one already accessed simultaneously. CIB is an effective probing strategy. When the program
exhibits good locality CIB has a high probability of making a prediction.
The decay interval that determines the time each cache way is shut off within a subcache has to be
chosen properly in order to avoid extra misses in the cache. Some techniques is described below.
Deciding cache decay interval:

VLSI Architecture
Two things can be done either the cache line can be turned off at some reference point deciding
that the cache line is worth turning off or this fact can be watched over a period of time and turned off if no
further access occurs. One policy is to turn off at a point in time where the extra cost incurred by waiting is
precisely equal to extra cost that might occur if the action done is wrong. If waited longer, then the leakage
energy dissipated will be more, on the other hand if the decay interval is short then the number of L2
misses will be more, which cannot be tolerated especially if the miss penalty is from off chip. One best time
to turn off is when the static energy dissipated since last access is precisely equal to dynamic energy that
would be dissipated if turning the line off induces an extra miss. The decay interval for L1 cache was found
to be about 10,000 cycles; since higher level cache tends to have longer generation with large dead time
interval the decay interval for L2 cache will be greater.
SECTION IV: CONCLUSION
Thus by partitioning the cache into smaller units and shutting of portion of the cache ways in each
subcache using time based decay policy, the leakage energy can be reduced more than that when single
technique is implemented without much performance degradation. Partitioning the cache into smaller units
offers many benefits like reducing per access energy, improving locality behavior, less dynamic energy
consumption. The subcache architecture uses efficient placement and prediction schemes. The placement
scheme is MST (modified spatial temporal), which reduces, miss rate by placing the data at the appropriate
subcache (spatial or temporal), the prediction scheme is CIB (cache identifier buffer), which predicts good
especially when the program exhibits good locality. Since the time based decay policy turns off the cache
only during its dead time the performance will not be affected much, as the number of misses in the cache
will be less. The time based decay policy exploits the generational characteristics of cache-line usage. Here
individual cache lines are turned off during their dead period (the time between the last successful access
and line’s eviction). A global counter provides a tick pulse for the local counters of all cache lines. The
cache line is turned off when the local counter reaches its maximum value. Compared to standard cache of
various sizes a decay cache offers better active size, for the same miss rate or better miss rate for the same
active size. The decay interval can be varied as per the utilization of the cache line, thus saving more
energy. The adaptive scheme involves selecting among a multitude of decay intervals per cache line, by
varying the decay interval per cache line depending upon the utilization of cache line, additional energy due
to leakage can be saved. The time based cache decay technique can also be applied to level 1 cache.
VLSI Architecture
REFERENCES:
1. Michael Powell, Se-Hyun Yang, Babak Falsafi, Kaushik Roy, and T.N. Vijaykumar, “Reducing
Leakage in a High-Performance Deep-Submicron Instruction Cache,” in IEEE Transactions on
Very Large Scale Integration (VLSI) Systems, Vol.9 No.1, February 2001
2. Stefanos kaxiras, Zhigang Hu, and Margaret Martonosi, “Cache Decay: Exploiting Generational
Behaviour to Reduce Cache Leakage Power”
3. DavidH. Albonesi, “Selective Cache Ways: On-Demand Cache Resource Allocation,” in Journal
of Instruction-Level Parallelism 2 (2000) 1-6 May 2000.
4. S.Kim, N. Vijaykrishnan, M. Kandemir, A. Sivasubramaniam, M.J. Irwin and E. Geethanjali,
“Power-aware Partitioned Cache Architectures”
5. John L Hennessy and David A Patterson, “Computer Architecture A Quantitative Approach,”
second edition.
6. Anantha P. Chandrakasan and Robert W. Brodersen, “Minimizing Power Consumption in Digital
CMOS Circuits,” in Proceedings
7. of the IEEE, Vol. 83, NO. 4, April 1995.

An Architectural Approach

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

An Architectural Approach

Загружено:

Авторское право:

Доступные форматы

VLSI Architecture

Electronics and communication Engineering

Under the Guidence of

Vice principal and HOD

STANLEY STEPHEN COLLEGE OF ENGINEERING AND

This is to certify that the seminar report entitled ___________ is

Sr.no Topic Page

• Dynamically Resizable Instruction Cache

• Partitioned Cache Architecture

• Selective Cache Ways

III Time Based Leakage Control in Partitioned Cache 9

• Deciding Cache Decay Interval

SECTION II: RELATED WORK

Dynamically resizable instruction cache:

PARTITIONED CACHE ARCHITECTURE

MISSING PHYSICAL BLOCK #

State Diagram for 2-bit(S1,S0),Saturating,Gray Code Counter with two inputs(WRD,T)

Selective cache ways:

SECTION III: TIME BASED LEAKAGE CONTROL IN PARTITIONED CACHE

Deciding cache decay interval:

SECTION IV: CONCLUSION

Вам также может понравиться