Академический Документы
Профессиональный Документы
Культура Документы
Abstract—This paper presents a highly configurable low-voltage user experience. However, the SoCs in Fig. 1 also need to be
write-ability assist implementation along with a sense-amplifier extremely energy efficient since the end product is generally
offset reduction technique to improve SRAM read performance. battery powered.
Write-assist implementation combines negative bit-line (BL) and
VDD collapse schemes in an efficient way to maximize Vmin SRAMs, the most common type of on-chip memories, are
improvements while saving on area and energy overhead of these extremely important, as the contribution of SRAM area and
assists. Relative delay and pulse width of assist control signals power to total chip power has also been continuously increas-
are also designed with configurability to provide tuning of assist ing over the past years [2]. This is due to the trend of increased
strengths. Sense-amplifier offset compensation scheme uses capac- levels of parallelism to improve performance in the realm of
itors to store and negate threshold mismatch of input transistors.
A test chip fabricated in 28 nm HP CMOS process demonstrates multicore and multithreaded computing platforms. Refs. [3]–
operation down to 0.5 V with write assists and more than 10% [5] are three examples of recently published state-of-the-art
reduction in word-line pulsewidth with the offset compensated microprocessors featuring 24, 37.5, and 64 MB of total SRAM
sense amplifiers. cache on die. Consequently, achieving energy efficiency for
Index Terms—CMOS, low-voltage SRAM, offset compensation, SRAMs is one of the key components for energy-efficient
SRAM assist. systems.
Dynamic voltage and frequency scaling (DVFS) is one of
I. I NTRODUCTION the most effective methods to lower energy consumption in
digital circuits [6], [7]. While DVFS presents various complex-
T HE PROCESSING capabilities packed into a single die
have been driven by Moore’s law and have been con-
tinuously increasing. Fig. 1 shows the 16 bit floating point
ities for logic in terms of power delivery, clock generation and
distribution, timing validation and so forth, there is generally
precision (FP16) performance of NVIDIA Tegra SoC designs not a functionality problem with standard CMOS static logic
[1]. Over the course of 5 years, from Tegra 2 to Tegra X1, pro- operating as long as the operating voltage is not in the deep sub-
cess technology scaled from 40 to 20 nm and FP16 performance threshold region. In contrast, the concept of operating SRAMs
increased by roughly two orders of magnitude. Driven by the over a large voltage range is very challenging. This is because
integration of high-end graphics processor cores with multi- conventional SRAM bit-cells are generally designed to operate
core CPUs, today’s high-end SoC platforms can deliver more robustly at the nominal supply voltage. Because of the ratioed
than 1 TFLOPs of FP16 performance and provide an enhanced design of the conventional 6 T SRAM bit-cells, when the oper-
ating conditions move away from the nominal point, operating
Manuscript received April 30, 2015; revised October 01, 2015; accepted
margins that are required for functionality start to erode quickly.
October 25, 2015. This paper was approved by Associate Editor Hideto Hidaka.
This research was developed, in part, with funding from the Defense Advanced Moreover, local and global transistor variation coupled with
Research Projects Agency (DARPA). The views, opinions, and/or findings con- aging effects restricts the design space for SRAM functionality,
tained in this article/presentation are those of the author(s)/presenter(s) and and consequently it is becoming increasingly difficult to make
should not be interpreted as representing the official views or policies of the
Department of Defense or the U.S. Government. Distribution Statement A
SRAMs operational at lower supply voltages.
(Approved for Public Release, Distribution Unlimited). One method to utilize DVFS for a system consisting of logic
M. E. Sinangil was with NVIDIA, Santa Clara, CA 95050 USA. He gates and SRAM macros is decoupling the supply voltages.
is now with TSMC North America, San Jose, CA 95134 USA (e-mail: This dual-rail option enables separate control and scaling of
sinangil@alum.mit.edu).
J. W. Poulton, M. R. Fojtik, T. H. Greer III, S. G. Tell, and C. T. Gray are
logic voltage and frequency and hence provides energy sav-
with NVIDIA, Durham, NC 27713 USA. ings for the logic while the SRAM supply voltage is kept at
A. J. Gotterba, J. Wang, J. Golbus, B. Zimmer, and W. J. Dally are with a higher voltage to satisfy SRAM functionality requirements.
NVIDIA, Santa Clara, CA 95050 USA. Although effective, this approach presents various challenges at
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. the system level. First, introduction of a new rail requires allo-
Digital Object Identifier 10.1109/JSSC.2015.2498302 cation of metal resources for robust power delivery. Creation
0018-9200 © 2015 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SINANGIL et al.: 28 NM 2 MBIT 6 T SRAM WITH HIGHLY CONFIGURABLE LOW-VOLTAGE WRITE-ABILITY ASSIST IMPLEMENTATION 3
SINANGIL et al.: 28 NM 2 MBIT 6 T SRAM WITH HIGHLY CONFIGURABLE LOW-VOLTAGE WRITE-ABILITY ASSIST IMPLEMENTATION 5
Fig. 7. Digitally controlled delay lines along with a voltage divider at the gate
of pull-down device N0 are used to tune the strength of assists.
Fig. 6. (a) Charging of capacitor with and without charge sharing and (b) cor-
responding negative BL voltage achieved. Fig. 8. Simulated Vmin improvement versus energy overhead of assists scat-
ter plot. Configurability and tuning of assists provide a larger space for Vmin
versus energy overhead tradeoff.
SINANGIL et al.: 28 NM 2 MBIT 6 T SRAM WITH HIGHLY CONFIGURABLE LOW-VOLTAGE WRITE-ABILITY ASSIST IMPLEMENTATION 7
Fig. 13. Input referred offset voltage improvement with the proposed compen- Fig. 14. Die photograph of the test chip fabricated in 28 nm HP CMOS process.
sation scheme across SF and TT corners with VDD = 0.5 and temperature at
85 C.
SINANGIL et al.: 28 NM 2 MBIT 6 T SRAM WITH HIGHLY CONFIGURABLE LOW-VOLTAGE WRITE-ABILITY ASSIST IMPLEMENTATION 9
pin (wlg) is inserted for each macro that causes the WL pulse
width to be shorter than the low phase of the clock. The on-chip
sampling scope placed on the test chip is used to measure the
WL pulsewidth observed on die by connecting it to one WL
from the conventional and assisted SRAM macros.
During the scope operation, MBIST is placed in a mode
to continuously hit the address corresponding to the observed
WL. The on-chip sampling scope uses a comparator clocked Fig. 17. Measured shmoo plots of the (a) conventional and (b) assisted SRAM
with a separately controlled sampling clock, sclk. By intention- macros showing improvement of write-ability errors with assists.
ally creating a slight frequency difference between clk and sclk
from an off-chip instrument, sclk starts to “slide” through the
WL pulse and effectively samples the WL voltage with small
time steps. The entire WL pulse is sampled across many cycles
but as MBIST is placed in a continuous mode, WL is guaran-
teed to turn ON every cycle. At the output, this provides a time
expanded view of the WL pulsewidth as shown in the oscillo-
scope screenshot in Fig. 16. The conversion factor between the
displayed pulsewidth on the oscilloscope screen and the actual
pulse on the chip is set by the frequency difference between
sclk and clk. For the example in Fig. 16, this conversion factor
is 1.418 ns/ms and with clk running at 30 MHz, wlg is used
to shorten the WL pulsewidth from half of the clock period
(16.67 ns) to 4.57 ns.
SRAMs from TT and SF process corners were tested. Fig. 18. Measured WL pulsewidth versus correct sensing probability for the
Fig. 17(a) and (b) shows shmoo plots with and without assists. conventional and proposed sense-amplifier designs.
Without write assists, macros begin to see write failures as
the supply voltage is scaled down. Turning ON both negative
BL and VDD collapse assists results in more than 25% WL
pulsewidth reduction at 0.5 V, enabling operation down to 0.5 V. WL pulsewidth, correct and erroneous bits are recorded and
The proposed write assists can provide even higher Vmin sav- bits that are sensed correctly are represented with the correct
ings for larger die with higher number of bits, as these will have sensing probability. For the same correct sensing probability
more bit-cell samples and a longer tail for the Vmin distribution of 99%, proposed sense amplifiers provide more than 10%
of cells. shorter WL pulsewidths at 0.5 V. These results agree with the
Fig. 18 shows the measured reduction of WL pulsewidth simulated sense-amplifier offset improvement discussed in this
by using the proposed capacitor-based input threshold match- paper. Because of the independent clock tree for clk and wlg
ing sense amplifiers. With a smaller sense-amplifier offset, signals and the skew introduced by them at each macro inter-
the WL pulsewidth can be shorter and a smaller differen- face, a direct measurement of offset compensation across all
tial between BL and BLB can be resolved correctly. For the macros is not possible. The plot in Fig. 18 shows the WL
experiment in Fig. 18, the WL pulsewidth is set to a very nar- pulsewidth improvement of each macro independently. Finally,
row pulse, and then it is incremented in small steps. For each Table I summarizes specifications of the test chip.
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
TABLE I [7] V. Gutnik and A. Chandrakasan, “Embedded power supply for low-power
C HIP S PECIFICATIONS DSP,” IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 5, no. 4,
pp. 425–435, Dec. 1997.
[8] A. P. Chandrakasan et al., “Technologies for ultradynamic voltage scal-
ing,” Proc. IEEE, vol. 98, no. 2, pp. 191–214, Feb. 2010.
[9] K. Zhang et al., “A 3-GHz 70-Mb SRAM in 65-nm CMOS technol-
ogy with integrated column-based dynamic power supply,” IEEE J.
Solid-State Circuits, vol. 41, no. 1, pp. 146–1151, Jan. 2006.
[10] M. Yamaoka et al., “Low-power embedded SRAM modules with
expanded margins for writing,” in IEEE Int. Solid-State Circuits Conf.
(ISSCC) Dig. Tech. Papers, Feb. 2005, pp. 480–481.
[11] E. Karl et al., “A 0.6 V 1.5 GHz 84 Mb SRAM design in 14 nm FinFET
CMOS technology,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2015, pp. 310–311.
[12] K. Nii et al., “A 45-nm single-port and dual-port SRAM family with
robust read/write stabilizing circuitry under DVFS environment,” in Proc.
IEEE Symp. VLSI Circuits, Jun. 2008, pp. 212–213.
V. C ONCLUSION [13] Y. Fujimura et al., “A configurable SRAM with constant-negative-
level write buffer for low-voltage operation with 0.149 µm2 cell in
This work presents a highly configurable write-assist imple- 32 nm high-k metal-gate CMOS,” in IEEE Int. Solid-State Circuits Conf.
mentation along with a sense-amplifier offset compensation (ISSCC) Dig. Tech. Papers, Feb. 2010, pp. 348–349.
scheme in a 28 nm HP CMOS process. The assisted design [14] T. Song et al., “A 14 nm FinFET 128 Mb 6T SRAM with VMIN-
enhancement techniques for low-power applications,” in IEEE Int. Solid-
extends SRAM Vmin down to 0.5 V with sense-amplifier off- State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2014, pp. 232–233.
set compensation providing 10% reduction in WL pulsewidth. [15] Y.-H. Chen et al., “A 16 nm 128 Mb SRAM in high-K metal-gate FinFET
The configurable nature of the assists allows for easy selection technology with write-assist circuitry for low-VMIN applications,” in
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb.
of assist techniques to be used, and also enables adjustment 2014, pp. 238–239.
of the strength of the assist circuits. These settings can be [16] B. Wicht, T. Nirschl, and D. Schmitt-Landsiedel, “Yield and speed opti-
altered even on a cycle-by-cycle basis to provide maximum mization of a latch-type voltage sense amplifier,” IEEE J. Solid-State
Circuits, vol. 39, no. 7, pp. 1148–1158, Jul. 2014.
flexibility. As the process, voltage and temperature (PVT) con- [17] Y. Sinangil and A. P. Chandrakasan, “A 128 kbit SRAM with an embed-
ditions create a multidimensional design space across which ded energy monitoring circuit and sense amplifier offset compensation
SRAM functionality needs to be assured, the energy cost of using body biasing,” IEEE J. Solid-State Circuits, vol. 49, no. 11,
pp. 2730–2739, Nov. 2014.
assist circuits can be minimized with these configurable assist [18] B. Giridhar, N. Pinckney, D. Sylvester, and D. Blaauw, “A reconfig-
techniques. Along with global process monitors and local tem- urable sense amplifier with auto-zero calibration and pre-amplification
perature and supply voltage monitoring circuits, a system-level in 28 nm CMOS,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig.
Tech. Papers, Feb. 2014, pp. 242–243.
hierarchical control scheme can be created to decide whether [19] A. Kawasumi et al., “Energy efficiency deterioration by variability in
assists are necessary or not for the current PVT conditions. SRAM and circuit techniques for energy saving without voltage reduc-
Moreover, a programmable lookup table can be created to tion,” in Proc. IEEE Int. Conf. IC Des. Technol. (ICICDT), May 2012,
pp. 1–4.
decide which PVT conditions require which assists to be turned [20] S. J. Lovett, G. A. Gibbs, and A. Pancholy, “Yield and matching impli-
ON and with what strength. This information can be trans- cations for static RAM memory array sense amplifier design,” IEEE J.
mitted to each macro from the local monitoring and control Solid-State Circuits, vol. 35, no. 8, pp. 1200–1204, Aug. 2000.
circuits synchronously and assist schemes allowing cycle-by-
cycle adjustments can respond very quickly. Alternatively, a Mahmut E. Sinangil (S’06–M’12) received the
B.Sc. degree in electrical and electronics engineering
simpler approach can involve stopping accesses and making from Bogazici University, Istanbul, Turkey, in 2006,
adjustments to the assist settings when system-level PVT con- and the S.M. and Ph.D. degrees in electrical engi-
ditions are changed. These methods can provide maximum neering and computer science from Massachusetts
Institute of Technology (MIT), Cambridge, MA,
energy efficiency at the system level by providing SRAM Vmin USA, in 2008 and 2012, respectively.
improvement by only the necessary amount. From 2012 to 2015, he was a Senior Research
Scientist with the Circuits Research Group, NVIDIA,
Durham, NC, USA. In 2015, he joined TSMC North
America where he is currently a Technical Manager.
R EFERENCES His research interests include low-power and high-density memory circuit
design with a focus on low-voltage operation and application specific circuit
[1] NVIDIA Corp.. (2015, Jan.). NVIDIA Tegra X1 White Paper [Online].
optimizations.
Available: http://international.download.nvidia.com/pdf/tegra/Tegra-X1-
Dr. Sinangil was the recipient of the Ernst A. Guillemin Thesis Award from
whitepaper-v1.0.pdf MIT for his Master’s thesis in 2008, the 2006 Bogazici University Faculty
[2] International Solid State Circuit Conference. (2013, Nov. 1). ISSCC 2014
of Engineering Special Student Award, and corecipient of the 2008 A-SSCC
Tech Trends [Online]. Available: http://isscc.org/doc/2014/2014_Trends.
Outstanding Design Award.
pdf
[3] R. Kan et al., “The 10th generation 16-core SPARC64 processor for mis- John W. Poulton (M’85–SM’90–F’12) received the
sion critical UNIX server,” IEEE J. Solid-State Circuits, vol. 49, no. 1, B.S. degree from Virginia Polytechnic Institute and
pp. 32–40, Jan. 2014. State University, Blacksburg, VA, USA, in 1967, the
[4] S. Rusu et al., “Ivytown: A 22 nm 15-core enterprise Xeon processor M.S. degree from the State University of New York,
family,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers Stony Brook, NY, USA, in 1969, and the Ph.D. degree
(ISSCC), Feb. 2014, pp. 102–103. from the University of North Carolina, Chapel Hill
[5] P. Li et al., “A 20 nm 32-core 64 MB L3 cache SPARC M7 processor,” in (UNCCH), NC, USA, in 1980, all in physics.
IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers (ISSCC), From 1981 to 1999, he was a Researcher with the
Feb. 2015, pp. 72–73. Department of Computer Science, UNCCH, where
[6] T. Burd and R. Brodersen, “Design issues for dynamic voltage scaling,” from 1995 he held the rank of Research Professor.
in Proc. Int. Symp. Low Power Electron. Des. (ISPLED), 2000, pp. 9–14. From 1999 to 2003, he was a Chief Engineer with
This article has been accepted for inclusion in a future issue of this journal. Content is final as presented, with the exception of pagination.
SINANGIL et al.: 28 NM 2 MBIT 6 T SRAM WITH HIGHLY CONFIGURABLE LOW-VOLTAGE WRITE-ABILITY ASSIST IMPLEMENTATION 11
Velio Communications, Milpitas, CA, USA. From 2003 to 2009, he was a Jason Golbus received the B.S. degree in electri-
Technical Director with Rambus, Inc., Chapel Hill, NC, USA. Currently, he cal engineering from Duke University, Durham, NC,
is a Senior Scientist with NVIDIA, Inc., Durham, NC, USA. His research inter- USA, in 1996, and the M.S. degree in electrical engi-
ests include VLSI-based architectures for graphics and imaging, design and neering from the University of California, Berkeley,
construction of the pixel-planes and PixelFlow computer graphics systems. CA, USA, in 1998.
He joined NVIDIA in 2009 and has been leading
Matthew R. Fojtik received the B.S., M.S., and the SRAM custom design team for several genera-
Ph.D. degrees in electrical engineering from the tions of Tegra processors.
University of Michigan, Ann Arbor, MI, USA, in
2008, 2010, and 2013, respectively.
He joined NVIDIA, Durham, NC, USA, as a mem-
ber of the Circuits Research Group and is currently a
member of NVIDIA’s ASIC/VLSI Research Group. Brian Zimmer (S’09–M’15) received the B.S.
His research interests include timing margin reduc- degree in electrical engineering from the University
tion techniques, clocking and synchronization, low of California at Davis, Davis, CA, USA, in 2010,
power on-chip communication, and efficient VLSI and the M.S. and Ph.D. degrees in electrical engi-
methodologies. neering and computer sciences from the University of
California at Berkeley, Berkeley, CA, USA, in 2012
Thomas H. Greer III received the B.S. degree in and 2015, respectively.
mathematics and physics from the University of the He is currently with the Circuits Research Group,
South, Sewanee, TN, USA, in 1984, and the M.S. NVIDIA Corporation, Santa Clara, CA, USA. His
degree in computer science from the University of research interests include energy-efficient digital
North Carolina, Chapel Hill, NC, USA, in 1988. design, with an emphasis on low-voltage SRAM
He is currently with NVIDIA Corporation, design and variation tolerance.
Durham, NC, USA. His research interests include
efficient movement of data and pumpkins.
William J. Dally (M’80–SM’01–F’02) received the
B.S. degree in electrical engineering from Virginia
Tech, Blacksburg, VA, USA, in 1980, the M.S. degree
Stephen G. Tell was born in New Jersey, USA, in in electrical engineering from Stanford University,
1967. He received the B.S.E. degree in electrical engi- Stanford, CA, USA, in 1981, and the Ph.D. degree in
neering from Duke University, Durham, NC, USA, in computer science from Caltech, Pasadena, CA, USA,
1989, and the M.S. degree in computer science from in 1986.
the University of North Carolina, Chapel Hill, NC, He is a Chief Scientist and Senior Vice President
USA, in 1991. of Research with NVIDIA Corporation, Durham,
He was a Senior Research Associate with NC, USA, and a Professor (Research) and Former
UNC/Chapel Hill, Chapel Hill, NC, USA, from 1991 Chair of Department of Computer Science, Stanford
to 1999, worked on parallel graphics systems and University, Stanford, CA, USA. He currently leads projects on computer archi-
high speed signaling, and in 1999 joined Chip2Chip tecture, network architecture, circuit design, and programming systems. He
Inc., San Jose, CA, USA. From 2003 to 2009, he has authored over 200 papers in these areas, holds over 90 issued patents,
worked with Rambus, Inc. In 2009, he joined NVIDIA, Durham, NC, USA, as a and authored textbooks Digital Design: A Systems Approach, Digital Systems
member of the Circuits Research Group. His research interests include custom Engineering, and Principles and Practices of Interconnection Networks.
circuit design and the surrounding logic for intra- and interchip communication. Dr. Dally is a member of the National Academy of Engineering, a fellow of
the ACM and American Academy of Arts and Sciences. He was the recipient
Andreas J. Gotterba (S’02–M’05) received the of ACM Eckert-Mauchly Award, the IEEE Seymour Cray Award, and the ACM
B.S. degree in electrical engineering from Stanford Maurice Wilkes Award.
University, Stanford, CA, USA, in 2003, and
the M.Eng.Sc. degree in photovoltaics from the
University of New South Wales, Sydney, Australia, C. Thomas Gray (M’89) received the B.S. degree in
in 2004. computer science and mathematics from Mississippi
From 2005 to 2009, he worked with Novelics LLC, College, Clinton, MS, USA, in 1988, and the M.S.
Aliso Viejo, CA, USA. In 2009, he joined NVIDIA, and Ph.D. degrees in computer engineering from
Santa Clara, CA, USA. His research interests include North Carolina State University, Raleigh, NC, USA,
SRAMs and other custom circuits, particularly low- in 1990 and 1993, respectively.
power and handshaking designs. From 1993 to 1998, he was an Advisory Engineer
with IBM, Research Triangle Park, NC, USA, work-
Jesse Wang (M’14) received the B.S. degree in com- ing in the area of transceiver design for communica-
puter engineering from the University of California, tion systems. From 1998 to 2004, he was a Senior
Irvine, CA, USA, in 2006, and a graduate certificate Staff Design Engineer with the Analog/Mixed Signal
in electronic circuits from the Stanford University, Design Group, Cadence Design Systems, Bracknell, UK, working on SerDes
Stanford, CA, USA, in 2010. system architecture. From 2004 to 2010, he was Consultant Design Engineer
From 2005 to 2009, he worked with Novelics LLC, with Artisan/ARM, San Jose, CA, USA, and Technical Lead of SerDes archi-
Aliso Viejo, CA, USA. In 2009, he joined NVIDIA tecture and design. In 2010, he joined Nethra Imaging, Cupertino, CA, USA,
Corporation, Santa Clara, CA, USA, as a member of as a System Architect, and in 2011 he joined the Circuits Research Group,
the Custom Circuit Design Team. His research inter- NVIDIA, Durham, NC, USA, where he currently serves as Director of Circuit
ests include driving custom implementation for CPU Research and leads activities related to high-speed signaling, low-energy mem-
L2 data caches. ories, variation tolerant clocking, and power delivery. His research interests
include digital signal processing design and CMOS implementation of DSP
blocks as well as high-speed serial link communication systems, architectures,
and implementation.