Вы находитесь на странице: 1из 86

1

Welcome

CEA is a French government-funded technological research organization. Drawing on its


excellence in fundamental research, its activities cover three main areas: Energy,
Information and Health Technologies, and Defense and Security. As a prominent player in
the European Research Area, with an internationally acknowledged level of expertise in its
core competencies, CEA is involved in setting up collaborative projects with many partners
around the world.

Within CEA Technological Research Division, three institutes lead researches in


order to increase the industrial competitiveness through technological innovation and
transfers: the CEA-LETI, focused on microelectronics, information & healthcare
technologies, the CEA-LIST dedicated to technologies for digital systems, and the CEALITEN devoted to new energy technologies.
The CEA-LETI is focused on micro and nanotechnologies and their applications, from
wireless devices and systems, to biology and healthcare or photonics. Nanoelectronics and
Microsystems (MEMS) are at the core of its silicon activities. As a major player in the
MINATEC innovation campus, CEA-LETI operates 8,000-m state-of-the-art clean rooms,
on 24/7 mode, on 200mm and 300mm wafer platforms. With 1,700 employees, CEA-LETI
trains more than 240 Ph.D. students and hosts 200 assignees from partner companies.
Strongly committed to the creation of value for the industry, CEA-LETI puts a strong
emphasis on intellectual property and owns more than 1,880 patent families.
For more information, visit http://www.leti.fr.
The CEA-LIST is a key player in Information and Communication Technologies. Its
research activities are focused on Digital Systems with major societal and economic
stakes: Embedded Systems, Ambient Intelligence and Information Processing. With its 650
researchers, engineers and technicians, the CEA-LIST performs innovative research in
partnership with major industrial players in the fields of ICT, Energy, Transport, Security &
Defence, Medical and Industrial Process.
For more information, visit http://www-list.cea.fr.

Design Architectures & Embedded Software research activity is shared between


CEA-LETI and CEA-LIST through a dedicated division. More than 240 people are focusing
on RF, digital and SoC, imaging circuits, design environment and embedded software.
Theses researchers perform work for both internal clients and outside customers, including
Nokia, STMicroelectronics, Sofradir, MicroOLED, Cassidian, Trixell, Kalray, Delphi, Renault,
Airbus, Schneider Electric, Magillem, etc

Contents

Page 5
Thierry Colette
> Interview

Head of Architecture & IC design,


Embedded Software Division

Page 7
Key Figures
Page 9
Scientific Activities
Page 11
Architecture & IC Design
for RF & mmW
Page 21
Architecture & IC Design
for Image Sensors

Page 31
Architecture, IC Design
& Control for Digital
SoCs
Page 49
Architecture & IC Design
for Emerging
Technologies
Page 57
Embedded Software
Page 69
Reliability & Test
Page 77
PhD Degrees Awarded

A Wide Spectrum for

Interview with Thierry Collette,


Head of Architecture & IC Design,
Embedded Software Division
Dear reader,

CEA-Leti / L. Godart

30 years ago, with the microelectronic revolution, raised the communication new age.
In parallel began another one: the computing science revolution. Now, we move into a
new one, which is finally the synthesis of both. With the success of the Internet, and the
new needs, e.g. in health, transportation or security, more and more computing devices
are being smart and connected, leading to new research fields: the efficient big data
management, and the integration of hardware and software know-how inside integrated
embedded systems. Furthermore, what we assist today with smartphones, tablets or
onboard computers, will be widely spread to many kinds of devices: the Internet of
Things is emerging.
Our multidisciplinary platform dedicated to Integrated Circuit Design and Embedded
Software, allows us to address this new trend. By joining these two fields of know-how,
CEA is the one of first research organization in Europe to support such an original and
global offer to the industry providing a wide range of capabilities, oriented towards the
applicative analysis and the exploration of integrated embedded architectures.
This platform includes the tools, methods and human competencies from front-end &
back-end integrated circuit design (digital, analog & mixed) in the most advanced
technologies, complex circuit emulation, hardware/software integration, to
industrial test and reliability.
We hope reading this scientific report will convince you that this wide spectrum platform
brings specific innovations, creating new opportunities to fulfill our first mission: support
and promote the industry by innovation and technology transfer.
Thierry Collette

2012 Key Figures

3 locations:
MINATEC campus (Grenoble)
Integration Research Center (Gires)
PARIS-SACLAY Campus (Palaiseau)

160 Permanent researchers,


65 PhDs and Post-docs

Full suite of IC CAD tools,


Hardware Emulators,
& Test equipments,
for Analog, RF & Digital circuits.

34M budget
85% funding from contracts

37 granted patents
29 papers, journals & books
136 conferences & workshops

Credits CEA-Leti / CEA-List

Scientific Activity

Publications
165 publications in 2012, including journals and Top conferences like ISSCC, VLSI Circuits
Symposium, DAC, DATE, PIERS, ESSCIRC, RTSS and ESWeek.

Prize and Awards


IEEE SOI Conference 2012 Best Paper Award granted to Olivier Thomas et al.
HIPEAC Paper Award granted to Antoine Joubert et al. for their DAC 2012 paper
ATC 2012 Best Student Paper Award granted to Ngoc-Mai Nguyen

Experts
31 CEA experts: 2 research directors, 2 international experts
9 Researchers with habilitation qualification (to independently supervise doctoral candidates)
2 IEEE Senior Members

Scientific Committees
Editorial Boards: Journal of Low Power Electronics,
19 members of Technical Programs and Steering Committees in major conferences: ISSCC,
ESSCIRC, DAC, DATE, ESWEEK, RTNS, IJCNN, IWANN, EMSOFT
Normalization committee: AUTOSAR (Automoive Open System Architecture)

Conferences and Workshops organizations


MPSoC 2012, DTC 2012, D43D 2012, VARI 2012, ICE 2012

International Collaborations
Collaborations with more than 20 universities and institutes worldwide
Caltech, University of Berkeley, University of Columbia, Carnegie Mellon University, EPFL, CSEM,
UCL, Polito Torino, KIT, Chalmers University, Tongji, .

10

11

Wireless Sensor Node


UWB Localization
Power Amplifiers
RF & mmW Passives
RF BIST

Architecture &
IC Design
For RF & mmW

12

Simulation Infrastructure for Energy


Autonomous Wireless Sensor Networks
with Sense & React Capability
Research topics : Wireless Sensor Networks, Energy Harvesting, Sense & React
C. Bernier, A. Didioui, D. Morche, O. Sentieys (IRISA)
ABSTRACT: Papers [1,2] present the simulation framework developed within the GRECO (GREen Communicating
Objects) project. GRECO project aim is to design an energy efficient wireless platform that is totally autonomous
thanks to energy harvesting capabilities and adaptive power management. To reach this goal, GRECO partners
are developing a simulation framework that will allow a complete modeling of the platform in order to evaluate
different power optimization strategies leading to energy neutral operations.

The huge variety of Wireless Sensor Network (WSN)


applications, ranging from environmental monitoring to
healthcare and smart homes, requires modular and
reconfigurable platforms. Additionally, these systems have to
be low cost, comply with severe size constraints, and
demonstrate very high energy efficiency since the most
important constraint in WSNs remains the energy
consumption. Several technologies have been developed for
harvesting energy from our surroundings such as solar, wind
and vibration energy. As the environmental energy can be
scavenged for as long as desired, if a Power Manager (PM) is
designed such that the consumed energy remains lower than
the harvested energy over a long period, thus leading to an
energy neutral operation (ENO), the system can
theoretically reach infinite lifetime.

cost, and (3) a detailed interference model for the


radiofrequency (RF) environment. Indeed, this last point is
crucial in the context of a physical (PHY) layer with sense &
react capability. Indeed, considerable power savings can be
obtained when an RF transceiver is able to instantaneously
adapt its level of performance to the time-varying conditions
of the propagation channel.
Since interference (Fig. 2) can lead to packet data loss,
missed alarms, delay, loss of synchronization, etc., many
authors have investigated its impact on WSNs. However,
none has studied the problem of interference due to
intermodulation which is caused by the nonlinearity of the RF
receiver. Unfortunately, linearity typically comes at the cost
of increased power consumption.

Figure 1: Generic Architecture of Energy Autonomous Wireless


Sensor Node

As the energy used by the radio transceiver represents the


major part of the energy consumed by a WSN node, adaptive
MAC layer policies are an active field of research. For
example, in Fig. 1, the power manager, considered as the
core of the energy harvesting wireless sensor node, controls
the wake-up period of the microcontroller and the radio
transceiver according to the harvested energy, hence keeping
the node in energy neutral operation.
Clearly, the development of novel applications and
deployment scenarios based on such adaptive platforms must
be assisted by the simultaneous development of a dedicated
simulation framework. Contrary to existing network
simulators, this framework must be able to model (1) the
energy harvesting subsystem which is highly dependent on
time and environmental factors, (2) the cross-layer adaptive
power management techniques and their associated power

Figure 2: Interference between desired user and nodes of


adjacent networks.

We therefore propose the following new SINR model [2] for


the investigation of performance degradation of WSNs under
intermodulation interference:

This model has been implemented in the GRECO simulation


platform hence enabling the study of different dynamic
power/performance tradeoff strategies with the aim of
specifying a new sense & react transceiver for perpetually
powered autonomous sensor networks.

References:
[1] Berder O., Sentieys O., Le T. N., Fontaine R., Pegatoquet A., Belleudy C., Auguin M., Tatinian W., Jacquemod G., Broekaert F., Didioui A.,
Bernier C., Benchehida K., Bourdel S., Barthelemy H., Ciais P. & Barratt C., "GRECO: GREen communicating objects." Design and Architectures
for Signal and Image Processing (DASIP), 2012 Conference on: 1-2.
[2] Didioui A., Bernier C., Morche D. & Sentieys O., "Impact of RF front-end nonlinearity on WSN Communications.", 2012 9th International
Symposium on Wireless Communication Systems, ISWCS 2012, 28 August 2012 - 31 August 2012: 875-879.

13

Robust and Precise Localization with


Double Quadrature Receivers
Research topics : UWB, Localization, Beamforming, Antennas
F.Bautista, D. Morche, G.Masson, F.Dehmas, S.Bories
ABSTRACT: In this work, several refinements have been added to the localization techniques in impulse radio in
order to improve the precision, the range as well as the robustness of the existing techniques. The receiver
architecture exploited in this approach is the double quadrature. This solution has shown its capability to reach
fine ranging precision in the cm range [1]. Then a multi-antenna approach has been exploited to extend the
range of the receiver and to extract the Angle-of-Arrival information. More recently, we have shown that double
quadrature architecture shows better robustness to antenna characteristics than the classical single quadrature
[4]. Lastly, a new approach has been proposed to recover the same performances with single quadrature.
The needs for surveyed positions in civil safety and military
applications require a new generation of Impulse Radio-Ultra
Wide Band (IR-UWB) technology for range up to several km,
capable of communication, precise localization and low
consumption. Up to now, most of the IR-UWB localization
solutions were based on non-coherent receivers with poor
performances. In [1], as far as we know, we presented
LORELEI, the first IR-UWB receiver working in the authorized
3-5 GHz frequency band and reaching a ranging accuracy
lower than 10 cm. The fine localization is obtained thanks to
the double quadrature architecture. High flexibility capability
to cope with various channel conditions and to reduce
synchronization phase has been reached thanks to the
sampled baseband architecture. Even if several hundred
meters range can be obtained with this solution, it may be
desirable to extend the range even more as well as the
localization performances.

ADC

LO1I

ADC

LO1Q

ADC

can undoubtedly improve the performances of the existing


localization system.
LORELEI performance opens the door to a new kind of
applications where a small RFID tag (with energy scavenging
or remote power) can be precisely identified and localized.
The development of such kinds of applications requires really
small tags and readers such that these equipments are not
noticeable by the users. The bottleneck in size reduction is
the antenna. Reducing its size far below the wavelength
impacts its radiated efficiency and its strong integration on
the device disturbs its omni-directionality in module and in
phase. As can be seen in Figure 2, the radiated signal
becomes dependent of the elevation angle. This may impact
the performances of the ranging system.

Out_II

LO2I
LO2Q

LNA

Out_IQ
Out_QI

Figure 2 : Received signal for 40 and 145 elevations

LO2I
LO2Q

ADC

Out_QQ

Figure 1 : LORELEI Architecture

In [2] and [3] we have exploited a multi-antenna scheme to


enhance the performances. By using four antennas and
LORELEI ICs, we can achieve some beamforming
functionality by a simple digital algorithm. It increases the
range and can be exploited to reduce the power of the
unwanted blockers. It can be also exploited to extract
independently the Angle-of-Arrival of each path of the
impulse channel responses. The error is lower than 2 degrees
over a wide angle range. This functionality opens the door of
new localization algorithms which can combine Angle-ofArrival and time of arrival for all distinguishable paths. This

In [4], we have shown that the performance of the classical


single quadrature receiver degrades when faced to such
phenomena. On the other hand, in the LORELEI receiver, the
signal is projected on an orthogonal base of two signals. As a
consequence, with a 0.6cm worst case error, the system
appears to be really robust against some deviation of the
antenna characteristics. This emphasizes the key impact of
dedicated and innovative architectures [5] to reach high
performances in IR- UWB systems. Thanks to this approach,
the obtained performance is among the most interesting in
the state of the art [6].
More recently, we have shown that by modifying the
processing done in single quadrature receivers, it is possible
to reach the same robustness and precision, at the cost of an
increased complexity. The next step will be to reach mm
ranging precision, in order to be able to consider a wider
range of applications.

References :
[1] G.Masson et al. A 1 nJ/b 3.2-to-4.7 GHz UWB 50 Mpulses/s Double Quadrature Receiver for Communication and Localization ESSCIRC
2010
[2] Farid Bautista et al. UWB Beamforming Architecture for RTLS applications using Digital Phase-Shifters ISCAS 2011 - Rio de Janeiro
[3] Farid Bautista, Dominique Morche, Franois Dehmas and Gilles Masson "Low power beamforming RF architecture enabling fine ranging and
AOA techniques ICUWB' 2011
[4] Farid Bautista, Dominique Morche, Serge Bories, Gilles .Masson Antenna Characteristics and Ranging Robustness with Double Quadrature
Receiver and UWB Impulse Radio ICUWB 2012
[5]D.Morche, M.Pelissier, G.Masson, P.Vincent UWB : Innovative Architectures Enable Disruptive Low Power Wireless Applications DATE
2012
[6] S.Bourdel, G.Gielen, S.Damico, D.Wisland, B.Busze, D.Neyrinck, J.Jantunen D.Morche Advanced Tutorial on UWB Circuits and Systems
Workshop at ESSCIRC 2012

14

Design of a Fully Integrated CMOS


Self-Testable RF Power Amplifier
Using a Thermal Sensor
Research topics : CMOS power amplifiers, RF built-in self-test, temperature sensors
J.L. Gonzlez, J. Altet (UPC), N. Deltimple (IMS), Y. Luque (IMS), E. Kerherv (IMS)
ABSTRACT: This research work presents a wideband RF power amplifier (PA) dedicated to 2GHz applications
integrating a contact-less temperature sensor that allows on-chip observation and testing of the PA. Indeed,
based on the static and dynamic local temperature changes caused by the PA operation, the thermal sensor can
sense parameters such as output power or efficiency. This principle is applied to a 65nm CMOS PA with an OCP1
of 21dBm. We demonstrate that the output voltage of the thermal sensor follows the PA efficiency under single
tone and multi-tone input signal conditions.

Testing issues, mainly its cost, is becoming crucially


important for the success of RF SoC products for mass
markets Test cost is directly related with testing time and
cost of test equipment. One strategy to enhance yield and to
ease RF test consists in incorporating sensors on chip that
measure the operation of the circuit-under-test (CUT). In
most of the cases, the sensors imply contact to electrical
nodes of the RF circuit and high-frequency signal processing,
at least at the input section of the sensor. In this work we
propose an alternative sensing strategy that requires no
contact to the CUT since it is based on the measurement of
the temperature variations in the vicinity of the circuit [1].
This technique is especially well suited for the observation of
power amplifier characteristics [2,3] such as 1dB
compression point or bandwidth, which can be used for
testing or implementing for self-calibration loops.

tone input signal of a fixed frequency and varying power.


When this type of signal is applied to the input of the PA, the
variation of the local temperature of the PA active devices
(shown in black in Fig. 2 right as measured by the sensor)
tracks the variation of on-chip power, and therefore, on
power delivered to the load. The plot shows how the DC
value of the sensor follows closely the PA Efficiency figure of
merit. A second set of measurements is shown in Fig. 3.
There, the input signal consists in two tones of fixed spacing
(10 kHz) and varying frequency. As the two tones are swept
over the PA bandwidth the thermal signal observed at the
two tones beat frequency tracks the PA bandwidth, with
accuracy comparable to the conventional RF measurement.
These experiments demonstrate the potentials of noninvasive, temperature based observation techniques for RF
circuits BIST or self-healing

Figure 2 : Left: Output power and gain of the PA obtained by RF


measurements. Right: comparison of efficiency, PAE and
temperature sensor output as a function of input power.
Figure 1 : Layout of the CMOS PA including a differential
temperature sensor.

The idea behind this technique is that any modification of the


balance between the power drawn from the supply and the
power provided to the load (or to the next stage) results in a
variation of the dissipated power, that can be detected as a
local temperature increases in the vicinity of the active
devices of the CUT. We have applied this technique to
compare some RF measurements and the results obtained
with an integrated temperature sensor for figures of merit of
a 2.5 GHz PA fabricated in a 0.65nm CMOS process, shown in
Fig. 1.
A first set of measurements is shown in Fig. 2 for a single

Figure 3 : PA frequency domain characteristics extracted using RF


measurements and using the on-chip temperature sensor

References :
[1] D. Gmez, C. Dufis, J. Altet, D. Mateo, J. L. Gonzlez, Electro-thermal coupling analysis methodology for RF circuits, Microelectronics
Journal, Vol. 43, No. 9, September 2012, pp 633641.
[2] J.L. Gonzlez, B. Martineau, D. Mateo, and J. Altet, Non-invasive monitoring of CMOS power amplifier operating at RF and mmW
frequencies using and on-chip thermal sensor, 2011 IEEE Radio Frequency Integrated Circuits (RFIC) Symposium Digest of Papers, pp.1-4, 57 June 2011.
[3] Deltimple N., Gonzlez J.L., Altet J., Luque Y., & Kerherv E., "Design of a fully integrated CMOS self-testable RF power amplifier using a
thermal sensor." in Proceedings of the 38th European Solid State Circuits Conference, ESSCIRC 2012, 17 September 2012 - 21 September
2012: 398-401.

15

SOI CMOS RF Power Amplifier and


Tunable Matching Network for
Integrated RF Front-Ends
Research topics : SOI, CMOS, Power Amplifier, Tunable Matching, RF Front-End
A. Giry, G. Tant
ABSTRACT: A high integration level and tunable RF functions in SOI CMOS technology are key enablers to make
smaller and more cost-effective RF Front-Ends. In this work, a two-stage SOI LDMOS linear PA and a SOI CMOS
Tunable Matching Network have been designed and characterized. The obtained results represent a new step
towards high efficiency integrated RF Front-Ends for future multimode multiband cellular applications.

Next generation wireless terminals and access points will


have to handle an increased number of standards and
frequency bands, which translates into great challenges and
stringent requirements when looking at the RF front-end (RFFE) section. Multiple Power Amplifiers (PA), RF switches and
filters will be needed, which will result in an increased size
and cost of the RFFE section, especially if the multiple
technologies (GaAs, SAW, IPD) currently required to achieve
adequate performances cannot be circumvented. A higher
integration level is the key to make smaller and more costeffective RF-FE, and SOI CMOS technology provides an
attractive trade-off among performance, cost and integration
capability appears today as a key technology. In addition,
size constraints lead to intrinsically small antennas which are
very sensitive to their environment and experience wide
impedance variations leading to large mismatch losses and
important degradation of RFFE energy efficiency.
The proposed research work aims at investigating SOI CMOS
technology for the design of highly integrated RFFE with
reduced power consumption. To meet the needs of future
cellular RFFE, a watt-level SOI LDMOS PA with high linearity
and efficiency has been developed [1] together with a lowloss SOI CMOS Tunable Matching Network (TMN) allowing
improved energy efficiency under various mismatch
conditions. The proposed PA and TMN have been
implemented in a 0.13um SOI CMOS industrial process with a
high resistivity substrate. Fig. 1 shows a micrograph of the
two-stage PA which occupies an area of 0.84mm2 and has
been designed by using a high voltage LDMOS power device
to get high efficiency and Through Silicon Vias (TSV) [2] for
efficient ground connection. At 900MHz under 3.6V supply
voltage, the LDMOS power stage delivers up to +33.2dBm of
peak power with a maximum efficiency of 60%. When tested
with a 10MHz bandwidth 16QAM uplink LTE signal, the twostage PA provides a higher linear output power of +27dBm
with less than 3% EVM. Fig. 2 shows a micrograph of the SOI
CMOS TMN based on integrated high-power tunable
capacitors which consist in arrays of binary weighted
switched-capacitors. Each tunable capacitor exhibits 32
states and has been designed to cover the range 0.7-2.8 pF
with a minimum quality factor of 40 at 2.7 GHz and a
maximum power rating of +36dBm. Control logic is
integrated on-chip and allows the selection of appropriate
capacitance values through an integrated SPI interface. The

TMN circuit occupies an area of 1.6mm2 and operates under


2.5V voltage supply. As can be seen in Fig. 2, the TMN is
centered on 50 Ohms and provides good impedance coverage
at 1.95 GHz. When combined with a miniature dual-band
antenna [3], the TMN succeeds to reduce the reflection losses
down to less than 0.5 dB and allows maintaining a fairly
constant radiated power even in cases of strong
perturbations created by a metallic plane close to the
antenna.

Figure 1: SOI LDMOS PA micrograph

Figure 2: SOI CMOS Tunable Matching Network micrograph (left)


and measured Smith chart coverage at 1.95GHz (right)

References :
[1] A. Giry, G. Tant, Y. Lamy, C. Raynaud, P. Vincent, A Monolithic Watt-level SOI LDMOS Linear Power Amplifier with Through Silicon Via,
2013 IEEE Topical Conference on Power Amplifiers for Wireless and Radio Applications (PAWR), 20-23 Jan. 2013
[2] http://www.leti.fr/en/How-to-collaborate/Collaborating-with-Leti/Open-3D
[3] L. Dussopt, M.A.C Niamien, A. Giry, A. Chebihi, S. Contal, F. Fraysse, S. Aissa, O. Perrin, C. Delaveaud, "Enhanced-efficiency front-end
module with multi-standard impedance-tunable antenna," IEEE 17th International Workshop on Computer Aided Modeling and Design of
Communication Links and Networks (CAMAD), pp.328-332, 17-19 Sept. 2012

16

Slow-Wave CPW and CPW in


CMOS65nm SOI Technology:
A Benchmark
Research topics: CMOS, 60 GHz, transmission lines, CPW, quality factor
X.L. Tang (IMEP), A.L. Franc (IMEP), E. Pistono (IMEP), A. Siligaris, P. Vincent, P. Ferrari (IMEP),
and J.M. Fournier (IMEP)
In this work, slow-wave coplanar transmission lines (S-CPW) and standard CPW are compared through
measurements up to 65 GHz. Both S-CPW and CPW lines are fabricated on an industrial CMOS 65nm SOI with
high resistivity substrate. Due to the slow-wave effect, S-CPW lines achieve a high effective permittivity that
reduces the wave-length. As a result, very high quality factors are achieved that show the interest of this object
for millimeter-wave (mmW) circuit design.

Millimeter-wave CMOS circuits have been intensively


developed in the past decade in order to respond to a
growing demand for mass-market, high throughput wireless
applications. The most popular approach for the matching
networks and inductive components design is the use of
microstrip (MS) and CPW[1]. However, CPW and MS lines
suffer, at high frequency, from high losses and low quality
factor because of thin metallic layers in CMOS technologies.
The concept of S-CPW respond to this problem by exploiting
the slow-wave phenomenon that increases artificially the
effective permittivity (eff). As a result, the wavelength is
decreased and thus, the corresponding physical line length
for a given phase shift is reduced. This is illustrated in
equation (1):
(1)
Where is the attenuation constant, is the spatial phase
velocity and c0 the speed of light in the vacuum.
Figure 1 shows a 3-D schematic structure of a S-CPW line
integrated in a CMOS back-end with six Copper metal layers
and one Aluminum top layer. It consists of a conventional
CPW line with patterned metallic shield placed between the
CPW and the silicon substrate.
Two characteristic impedance transmission lines (28 and
65 respectively) were fabricated in S-CPW and
conventional CPW. Measurements were carried out up to 65
GHz using a two-port VNA. The extracted quality factor for
each measured line is shown in figure 2.

Figure 2 : Measured performance comparison of S-CPW and CPW


quality factor Q. Experimental CPW results at 60GHz from [1]:
square for 70- CPW and triangle for 38- CPW.

Thanks to the enhanced effective dielectric permittivity, the


quality factors of the S-CPW are significantly improved,
compared to CPW ones. The Q-factors of S-CPW are
increased by a factor 4 and 2.5, for 28 S-CPW and for 65
S-CPW, respectively.
In this work, high performance slow-wave lines fabricated in
an advanced 65 nm HR-SOI CMOS technology were
characterized and optimized. Experimental results show that
the performance improvement of S-CPW, compared to
conventional CPW, is mainly due to the increase of the
effective permittivity. At 60 GHz, the attenuation constant of
S-CPW is reduced by almost 40% and the effective relative
permittivity is two to six times higher, leading to almost two
to four times higher quality factor.

Figure 1: (a) 3-D schematic view of S-CPW structure. (b)


Schematic cross section of the 65 nm SOI CMOS back-end.

References :
[1] A. Siligaris, C. Mounet, B. Reig, and P. Vincent, "CPW and discontinuities modeling for circuit design up to 110 GHz in SOI CMOS
technology," in IEEE Radio Frequency Integrated Circuits (RFIC) Symposium, pp. 295-298, 2007.
[2] Xiao-Lan Tan, A.-L. Franc, E. Pistono, A. Siligaris, P. Vincent, P. Ferrari, and J.M. Fournier, "Performance Improvement Versus CPW and
Loss Distribution Analysis of Slow-Wave CPW in 65 nm HR-SOI CMOS Technology," IEEE Transactions on Electron Devices, vol.59, no.5,
pp.1279-1285, May 2012.

17

On the Electrical Properties of Slotted


Metallic Planes in CMOS Processes for
RF and Millimeter-Wave Applications
Research topics : Interconnections, RF and mmW ICs
J.L. Gonzlez, B. Martineau (STMicroelectronics), D. Belot (STMicroelectronics)
ABSTRACT: This research work is focused in the effects of slotted metallic planes in passive structures built
using CMOS processes for RF and millimeter-wave (mmW) applications. The impact of holes on the reference
plane resistance and in the capacitance of any surrounding structure to the plane are investigated through
electromagnetic (EM) simulations. Two analytical expressions are derived that capture the holes impact on the
plane resistivity and on the dielectric constant of the materials found between the plane and the surroundings.
These expressions are used to propose a simplified EM simulation methodology for on-chip microstrip
transmission lines.
Recent realizations of integrated radios operating at
millimeter wave frequency (several tens of GHz) [1,2], and to
a lesser extent at RF frequencies (several GHz), require the
use of distributed passives such as transmission lines. These
structures must be fabricated by respecting strict
manufacturing rules imposed by the semiconductor
processing tools and procedures. For large area metallic
surfaces that are required to build the reference ground
planes of microstrip lines, for example, the manufacturing
rules impose a maximum density of metal, so that such
planes must be pierced with holes, as indicated in Figure 1.a.
Up to now, little attention was paid on the impact of this
modified planes with respect to the ideal, continuous metallic
plane that should be used if possible. Figure 2.b shows the
basic parameters of a section of a slotted plane. A basic cell
consisting on a section of the plane with a single hole can be
defined, and the plane can be considered as a 2D repetition
of this basic cell. The relative size of the hole with respect to
the size of the basic cell sets the basic parameter for the
plane: the metal density (or its inverse, the hole density).

this way the electrical properties of the transmission lines.


Figure 3 shows the impact of different hole and basic cell
sized in the capacitance of a line to the plane, where relative
changes by a factor of 3 are observed.

Figure 2 : Effective conductivity values obtained by simulation


(symbols) and comparison with the analytical model (lines).

Figure 1 : (a) 3D view of an example microstrip transmission line


structure with slotted ground plane. (b) Details of the slotted
metallic plane.

In this research work we have analyzed the modification of


the electrical properties of the plane that are caused by the
presence of the hole, in comparison to an ideal, continuous
plane without holes (i.e. with a 100% metal density). Figure
2 shows how the conductivity of the plane is reduced by a
factor of 5 if the metal density is reduced up to a 40%. This
modification of the plane conductivity is observed for
different plane thicknesses, such as those obtained by using
the various metallization levels available in CMOS processes.
The holes opened in the plane also modify the electric fields
of the surrounding structures to the plane, such as for
example the capacitance of a line of the plane, modifying in

Figure 3 : Comparison between EM simulation results and


predictions for capacitance of a M6 interconnection to a slotted
plane in several metal layers from the analytical model

The observed significant change in the plane and


transmission line properties observed must be taken into
account for an accurate design of this type of structures. In
[3] a simulation strategy is proposed that goes in that
direction.

References :.
[1] J.L. Gonzalez, F. Badets, B. Martineau, D. Belot,A 56-GHz LC-tank VCO with 17% tuning range in 65-nm bulk CMOS for wireless HDMI,
IEEE Trans. Microwave Theory Tech., 58 (2010).
[2] B. Martineau, V. Knopik, A. Siligaris, F. Gianesello, D. Belot,A 53-to-68 GHz 18 dB m power amplifier with an 8-way combiner in standard
65nm CMOS, in Proceedings of the IEEE International Solid-State Circuits Conference, February 2010, pp. 428429.
[3] Jos Luis Gonzlez, Baudouin Martineau, Didier Belot, On the electrical properties of slotted metallic planes in CMOS processes for RF and
millimeter-wave applications, Microelectronics Journal, Volume 43, Issue 8, August 2012, Pages 582-591.

18

BAW Filters for Ultra-Low Power


Narrow-Band Applications
Research topics : BAW, Ultra-Low-Power, Narrow-band
C. Bernier, J.-B. David
ABSTRACT: This paper presents an original method for the design of Bulk Acoustic Wave (BAW) filters for a new
class of applications: ultra-low-power, narrow-band RF filtering. To this end, a filter co-design methodology,
based on existing BAW resonator technology and fabrication processes, is developed and the link between
decreasing filter bandwidth and decreasing power consumption of the associated integrated circuit is
demonstrated. Depending on required bandwidth, the power dissipation of the driving electronics can be
reduced by large factors (10 to 40). High-IF receivers and Wake-up receivers are main applications.

In addition to classical out-of-band spurious signals, with the


increasing number of RF standards and devices, susceptibility
to in-band blockers is becoming a serious issue in wireless
systems. This is especially true in the context of shared RF
bands such as the 2.4GHz ISM band.
Faced with strong in-band blockers, the RF system architect
has the choice to either increase the linearity specification of
the receiver, which requires power consumption, or to
increase the receiver selectivity, and this ideally as close as
possible to the antenna. There is therefore a need for
narrow-band filters at VHF/UHF frequencies [2]. Whereas
achieving this with active designs is prohibitive in terms of
power consumption, BAW filters [3] appear to be good
candidates if they can be made sufficiently narrow-band and
if the problem of frequency agility is solved e.g. using a high
intermediate frequency (IF) architecture.
Most IC blocks in low power architectures are designed to
optimize the voltage transfer characteristic and, for this
reason, high input and output impedances are favored.
However common RF filters impedance (50) is a difficulty in
low power designs, meaning current dissipation.
The aim of this work is therefore to simultaneously solve
these two problems of narrow required bandwidths and low
power dissipation. To do so, we used a co-design approach to
simultaneously optimize the filter response while minimizing
the power consumption of the IC design, leading to an
original high impedance BAW filter topology.

Gv =

Gm Z 21 Z o Z out
2( Z out Z o + Z out Z 11 + Z 22 Z o + Z 11Z 22 Z 21Z 12 )

In the particular case where Gm=1, this expression is


homogenous to an impedance, an equivalent load, which is
directly related to the S21 scattering parameter of the BAW.
We converged to a filter response which has both a narrow
bandwidth and a large equivalent load, by drastically
reducing the frequency spacing between the series and
parallel resonators, to the overlap area (see fig.2), which is
forbidden in classical power transmit BAW filter design.

Figure 2 : BAW filter Equivalent Load response for classical and


new frequency offset

Figure 1 : Circuit for voltage gain calculation

Instead of classical S parameters, the approach is based on


the notion of equivalent load which is convenient for both
the BAW filter and IC designers. Considering the equivalent
circuit shown in fig.1, we are then allowed to write the
following expression for voltage gain Gv=Vout/Vin:

Note that these filter responses are obtained with resonator


parameters extracted from state of the art devices. The only
difference with respect to existing process flows is the loading
layer thickness which must be modified to create smaller
frequency offsets.
The filter design methodology described in this work has
been explored in High-IF architecture for ISM band. It can
also be used to design an extremely selective, ultra-low
power RF gain stage for a wake-up radio where typical input
and output impedances are small capacitances (e.g. 100fF),
allowing the design of a lattice filter with equivalent load
greater than 1k and BW-3dB<4MHz.

References :
[1] C. Bernier, J.-B. David, "BAW Filters for Ultra-Low Power Narrow-Band Applications", IEEE International Conference on IC Design &
Technology (ICICDT), 2012.
[2] L. Lolis, T. Ayed, C. Bernier, M. Pelissier, D. Dallet, J. B. Bgueret, Ultra Low Power Bandpass Sampling Architectures Using Lamb Wave
Filters, IEEE NEWCAS, 2010.
[3] A. Flament, A. Frappe, B. Stefanelli, A. Kaiser 1, A. Cathelin, S. Giraud, M. Chatras, S. Bila, D. Cros, J.-B. David, L. Leyssenne, E. Kerherve,
"A complete UMTS transmitter using BAW filters and duplexer, a 90-nm CMOS digital RF signal generator and a 0.25-m BiCMOS power
amplifier", International Journal of RF and Microwave Computer-Aided Engineering 21 (2011) 466-476.

19

A Frequency Measurement BIST


Implementation Targeting
GigaHertz Applications
Research topics : Design for test, Radiofrequency measurements
M.Dubois, E. de Foucauld, C. Mounet, S. Dia (Presto Engineering) and C.Mayor (Presto Engineering)
ABSTRACT: We propose a Built-In Self-Test (BIST) technique for measuring the natural resonance frequency of
oscillators which are set much higher than the working speed of current Automated Test Equipment (ATE).
Based on an asynchronous counter, the BIST response corresponds to a digital output code proportional to the
frequency of the oscillator under test. The efficiency of the proposed BIST is demonstrated on an Ultra-WideBand transceiver, whose communication frequency ranges in the band of 7.25GHz to 8.5GHz.

We present a BIST dedicated to the measurement of a high


frequency transceiver based on the Super Regenerative
Oscillator (SRO) principle. The suggested BIST architecture
relies on an asynchronous counter that deduces the
frequency measurement counting the number of oscillation
periods within a given period of time. The digital output test
response suites any digital communication and processing
systems, which can use the information for test purposes
and/or self-calibration or compensation techniques.
For respecting standardization rules, this frequency is set to
F=7.875GHz, the center of the 7.25GHz to 8.5GHz UWB
band. For this application, the optimal BER is reached when
the oscillator resonance frequency matches the central
frequency of the input signal.
The super-regenerative receiver with its BIST is implemented
in a 0.13m CMOS technology. The connection of the output
of the SRO and the input of the BIST has to be as short as
possible to limit the parasitic capacitance of the net
connection and crosstalk with other signals. On the other
hand, the introduction of the BIST close to this critical path of
the system increases the risk of performance reduction of the
system even when the BIST is switched off.

A close ground line could increase crosstalk in the reception


path. Therefore, only the first stage is introduced very close
to the SRO whereas the next stages are located as far as
possible of the output of the SRO.
Figure 1 represents the SRO resonance frequency obtained
by the measurement with the ATE resources FMeas and with
the BIST technique FBIST. This scatter plot shows the
excellent correlation between both measurements.

Figure 2: Picture of the chip connected to the ATE Verigy 93K.

Figure 1: Frequency measurement of the SRO resonance with the


ATE resources and the BIST technique.

Figure 2 shows the chip connected in its socket. The SRO


output is connected to the mixer through the coaxial cable
whereas the BIST used only digital pins of the tester.
We suggested a complete BIST technique for measuring high
oscillation frequency of a fully integrated front-end designed
for UWB transmission systems. To achieve this, we first
derive from the high frequency oscillation a proportional
lower clock signal. This clock is then used to increment an
asynchronous counter. The final counter state enables a
direct computation of the oscillation frequency. Experimental
results are excellent and confirm the results expected by
thorough electrical simulations. The comparison of the BIST
technique with the standard test setup shows a negligible
difference in the frequency measurement for a test time
saving by a factor 20.

Reference :
[1] Dubois, M.; De Foucauld, E.; Mounet, C.; Dia, S.; Mayor, C., A frequency measurement BIST implementation targeting gigahertz
application, 2012 IEEE International Test Conference (ITC), Page(s): 1 - 8

20

21

High Performance IR Imagers


3D Integration for Imagers
Advanced Integrated Algorithms

Architecture &
IC Design
For Image
Sensors

22

An 88dB SNR, 30m Pixel Pitch


Infra-Red image Sensor
With a 2-step 16 bit A/D Conversion
Research topics : CMOS image sensors, Infra-Red, pixel-level ADC
A. Peizerat, J-P. Rostaing, N. Zitouni, N. Baier, F. Guellec, R. Jalby, M. Tchagaspanian
ABSTRACT: A new readout IC (ROIC) with a 2 step A/D conversion for cooled infrared image sensors is
presented in this paper. The sensor operates at a 50Hz frame rate in an Integrate-While-Read snapshot mode.
The 16-bit ADC resolution preserves the excellent detector SNR at full well (~3Ge-). The ROIC, featuring a
320x256 array with 30m pixel pitch, has been designed in a standard 0.18m CMOS technology. The IC has
been hybridized (indium bump bonding) to a LWIR (Long Wave Infra Red) detector fabricated using our inhouse HgCdTe process. The first measurement results of the detector assembly validate both the 2-step ADC
concept and its circuit implementation. This work sets a new state-of-the-art SNR of 88dB.
Used in security and defense applications, cooled (77K)
Infrared HgCdTe (Mercury Cadmium Telluride) hybrid sensors
(detector bump-bounded over the CMOS IC) are very
demanding in terms of SNR (typical state-of-the-art values
are in the 70-80dB range). The detector sensitivity can be
limited either by the incident number of photons during one
frame or by the CMOS readout IC (ROIC) charge handling
capacity. In many thermal imaging conditions, this second
point predominates. This limited charge well capacity is
determined by two CMOS process constraints: the integration
capacitance that has to be fit in a given pixel area and
voltage range. To overcome this limitation, new ROIC
architectures must be developed. Pixel-level analog-to-digital
conversion is a very attractive solution that enables high
dynamic range imaging and SNR breakthrough performance,
while being compatible with an IR pixel size.
The overall architecture of the sensor is given Fig. 1. At the
end of the integration time, the global shutter pixel delivers
an 11 bit digital output as well as its residue analog output.
This residue is then converted using a 5 bit flash ADC, which
gives a final 16-bit digital output. In order to make the
Integrate While Read (IWR) feature possible, a whole image
memory (SRAM) is needed.

charge packets so that, at the end of the integration time,


the pixel counter contains a digital value proportional to the
total integrated charge and the residue remains on the
integration capacitance (Cint). GS (Global Shutter) is a global
signal while RS (Row Select) is a linewise signal that allows
the pixel to write on the digital bus on one hand and on the
analog bus on the other. For a fixed resolution of 16 bits, the
number of bits at the pixel level can be assessed on an area
criterion. For the 0.18m process we used, Fig. 2 shows that
there is a tradeoff between the integration capacitance and
the counter depth.
Vdd

digital
bus

Vpulse

GS

rst
Vint

indium
bump

+
Vref
-

VBIAS

monostable
circuit

RS
11bit
counter

GS

IPD
MCT
PD

RS

11
analog
bus

CINT

Vref +

5 bit flash
ADC
16

pixel

towards
SRAM

bottom of the column

Fig 3 : 2-step 16 bit ADC principle

pixel area
900

counter area
+ CINT area

7.2mm

9.6mm

column charge amplifiers


5 bit flash ADCs
Row decoder
320*256
16 bit word
SRAM

700

area (um2)

320x256 pixel array

500

counter area

300

CINT area

100

16 bit output shift register


dataout<0:15>

Fig 1 : overall block diagram

10

12

14

16

number of bits in the pixel


Fig 2 : Cint trade off

As illustrated on Fig. 3, the pixel uses a pixel-level ADC


technique that is described in [1]. It consists in counting

The test chip was fabricated in a 1P6M 0.18m standard


CMOS process. This 320x256 hybrid HgCdTe sensor
demonstrates how the 3Ge- full well capacity associated with
a 16-bit ADC resolution paves the way for a breakthrough in
thermal sensitivity. In electro-optical tests, a peak SNR of
88dB has been reached with power consumption below
72mW.
[this work]

[4]

[1]

[2]

CMOS
process

0.18m

0.35m

0.18m

0.18m

Pixel pitch

30m

50m

30m

50m

Peak SNR

88dB

85dB

75dB

70dB

Power/pixel

0.5W

1.7W

10W

9.7W

Format

320x256

128x128

16x1

64x64

Table 1 : Summary of the sensor features vs other works

References :
[1] A. Peizerat, M. Arques and J.-L. Martin, Pixel-level A/D conversion: comparison of two charge packets counting techniques, in Proc., 2007
International Image Sensor Workshop.
[2] Peizerat, A.; Rostaing, J.; Zitouni, N.; Baier, N.; Guellec, F.; Jalby, R.; Tchagaspanian, M., An 88dB SNR, 30m pixel pitch Infra-Red
image sensor with a 2-step 16 bit A/D conversion, 2012 Symposium on VLSI Circuits (VLSIC), pp. 128-129

23

Linear Photon-Counting
with HgCdTe APDs
Research topics : photon counting, image sensor, infrared
F. Guellec, G. Vojetta, J. Rothman
ABSTRACT: A custom readout IC has been developed for photon counting. It features a pixel with 115V/econversion gain, 10e- noise and 13W power consumption to comply with upcoming integration in focal plane
arrays. It has been hybridized to mid-wave infrared HgCdTe Avalanche Photodiodes (APD). The circuit
performances allowed fine characterization of APD gain and excess noise as well as reproducing the Poisson
statistics of the laser pulse from measurements. Linear mode photon counting with low APD gain (40) at 80K
has been demonstrated. A 90% internal photon detection efficiency and a 800kHz dark count rate have been
evaluated. This dark count rate can be reduced at higher gain (8kHz at 200) or with short-wave infrared APD.
Infrared avalanche photodiodes (APD) using HgCdTe
compound semiconductor material are developed by CEA-Lti
since several years. These photodiodes are typically cooled at
80K and operate below the breakdown with an avalanche
mechanism only initiated by electrons resulting in a linear
amplification (M) with a very low excess noise factor (F).
For this work [1], we developed a new low-noise readout
electronic circuit in a standard 0.18m CMOS technology. It
targets single-photon detection at moderate APD gain. This
circuit will enable operation in the Short Wave infrared
(SWIR) band where the APDs exhibit a significantly reduced
gain for a given photodiode bias voltage compared to the Mid
Wave Infrared (MWIR) band. The use of SWIR APD allows
reducing the Dark Count Rate (DCR) or increasing the
operating temperature (for a given DCR).

Figure 2 shows the Probability Density Function (PDF) from


more than 10000 samples. The probability to generate a
laser pulse with a n-photon state is a Poisson distribution:

P (n) = ( n n!) e

where

=< n > .

So, the measurement points were fitted with a Poisson


distribution enlarged by a Gaussian distribution with a
standard deviation derived from the excess noise factor and
convolved by a Gaussian noise with a standard deviation 0
and offset 0. The latter two parameters are characteristics of
the position and width of the zero-photon peak and give an
indication on the total noise and offset induced by the circuit
and dark current events with low multiplication. We can see
that the Poisson statistics of the pulsed laser light is correctly
reproduced. The slight discrepancy between the two peaks is
attributed to the jitter of the laser pulse.
The PDF corresponding to the detection of one or two
photons is well distinguished from the zero-photon
distribution, implying that single photon detection can be
achieved at threshold values below the average amplitude
value of a single vale and, as a consequence, with a high
photon detection efficiency (PDE) for a moderate APD gain.
n=0.14
(without laser)
n=1.15
n=2.05

Figure 1 : Circuit basic principle and simulated output voltage


where each step corresponds to a laser impulse.

In the pixel, the photodiode current pulse is integrated on the


input node capacitance. It is interesting to use this relatively
small capacitance (around 15fF) to perform a fast and lownoise current to voltage conversion. The resulting voltage is
then amplified by 10 with a low-noise stage having a small
input capacitance compared to the photodiode junction
capacitance in order to keep the total input node capacitance
as low as possible. The pixel power consumption was reduced
to 13W in order to allow its use in focal plane arrays.
Single and proportional photon detection capability was
characterized by measuring the amplitude distribution of the
circuit output voltage step occurring with a laser impulse.

Figure 2 : Measured PDFs (dots) at 80K and -8V photodiode


substrate bias voltage compared to the calculated distribution
with M=43, F=1.5, 0=2.3mV, 0=1.1mV.

The high conversion gain (115V/e-), low noise (10e-)


custom IC that has been developed is useful to characterize
APD gain and excess noise. It allowed demonstrating linear
mode photon counting at low APD gain (40) with an
estimated 90% PDE and 800kHz DCR for a threshold at 40%
of average single-photon amplitude for MWIR APD at 80K.

References :
[1] G. Vojetta, F. Guellec, L. Mathieu, et al., Linear photon-counting with HgCdTe APDs, Proc. SPIE 8375, Advanced Photon Counting
Techniques VI, 83750Y, May 2012.

23bis

A low-noise, 15m pixel-pitch,


640x512 hybrid InGaAs image sensor
for night-vision
Research topics : image sensor, infrared, night vision
F. Guellec
ABSTRACT: This paper presents the design of a 15m pixel-pitch, 640x512 CMOS readout IC. A careful noise
analysis of the C-TIA pixel circuit is necessary to achieve low noise performance with a high conversion gain. A
30e- read noise for a 71dB Dynamic Range (DR) has been reached with the developed hybrid InGaAs image
sensor operated in rolling shutter with Correlated Double Sampling. These state of the art results demonstrate
that this detector is well suited for night vision in the Short Wave Infrared band where it can take advantage of
the airglow. The dual gain functionality of the pixel furthermore enables both night and day use. In low
conversion gain configuration, the noise floor vs. dynamic range trade-off is different and we get a DR of 79dB.
Hybrid InGaAs infrared detectors allow easy and compact
camera integration as cooling is not needed. They are
sensitive from the Short Wave infrared (SWIR) (=1.7m)
down to the visible (=0.4m) when the substrate is thinned.
The SWIR band presents some key advantages for night
vision. In this band, the haze offers a good transmission and
an optical phenomenon occurring in the atmosphere (called
airglow or nightglow) causes a weak generation of light.
In this context, we developed in collaboration with the III-V
Lab a low-noise, 15m pixel-pitch, 640x512 hybrid InGaAs
image sensor for night vision [1, 2]. We were in charge of the
readout IC design in a standard 0.18m CMOS technology.
The pixel is based on a dual gain C-TIA circuit with an antiblooming function. The image sensor is operated in rolling
shutter with an optional correlated double-sampling mode
which is useful to reduce the noise in high-gain configuration.
Thanks to a thorough noise analysis (taking into account
power supply noise and CDS filtering) and careful circuit
optimization with respects to area and power consumption
constraints state of the art performances have been reached.

Experimental results are in good agreement with simulated


values. In high gain configuration (17.6V/e-) a read noise of
30e- has been reached for a dynamic range of 71dB. In low
gain configuration (1.9V/e-) we get respectively 108e- and
79dB. As expected, the lower noise floor in high gain is
obtained at the expense of the dynamic range. This trade-off
should be adjusted accord to application needs. The dual gain
of the pixel allows a use in both night and day conditions as
well as image fusion if needed. The 640x512 image sensor
operates at a frame rate up to 120fps with a total power
consumption of 150mW.

Figure 2 : View of the packaged hybrid image sensor and picture


taken with the developed camera.
Figure 1 : Simplified pixel architecture and modeled noise
spectral density (dashed blue: input noise, bold blue: output
noise, red: output noise after CDS)

Further work is carried out to reduce the pixel pitch to 10m


while maintaining good noise performance in the aim of
developing a future 1280x1024 detector.

Rfrences :
[1] F. Guellec, S. Dubois, E. de Borniol et al., A low-noise, 15m pixel-pitch, 640x512 hybrid InGaAs image sensor for night vision, Proc. SPIE
8298, Sensors, Cameras, and Systems for Industrial and Scientific Applications XIII, 82980C, February 2012.
[2] E. de Borniol, F. Guellec, P. Castelein, A. Rouvi, J.-A. Robo and J.-L. Reverchon, High-performance 640x512 pixel hybrid InGaAs image
sensor for night vision, Proc. SPIE 8353, Infrared Technology and Applications XXXVIII, 835307, May 2012.

24

High Dynamic Range Image Sensor


with Self Adapting Integration Time
in 3D Technology
Research topics: 3D Technology, HDR, Image Sensor, Integration Time
F. Guezzi-Messaoud, A. Dupret, A. Peizerat, Y. Blanchard (ESIEE Paris)
ABSTRACT: This paper presents a High Dynamic Range (HDR) image sensor architecture that uses capabilities of
three-dimensional integrated circuit (3D IC) to reach a dynamic range over 120 dB without modifying the classic
(3T or 4T) pixel architecture. The integration time is evaluated on subsets of pixels on the lower IC of the stack
and then sent back by vertical interconnections to the sensor array. This work evaluates the performance of an
analog Winner Take All circuit, used to detect the maximum exponent corresponding to the optimum integration
time chosen for every group of pixels.

Integrating more complex functions within the same circuit is


one of the main quests for the microelectronics industry.
Three-dimensional integration by circuit stacking (3D
stacking) constitutes a promising way to achieve this goal. It
allows notably pushing some limitations that circuits have
reached nowadays. The main motivation is to take advantage
of the 3D topology to exceed the limited dynamic range of
the standard image sensors while keeping the classic 3T or
4T pixel architecture. This work presents a new architecture
of an image sensor that allows reaching a dynamic range
over 120dB without modifying the classic (3T or 4T) pixel
architecture. This architecture takes advantage of emergence
of technologies of dense vertical interconnections, Through
Silicon Via (TSV), to locally adapt the integration time of a
group of pixels. The coding of a high dynamic range and a
high PSNR image leads to an increase of the data
throughput, at the IOs of the circuit. The HDR architecture is
so coupled to a two-level compression system [1, 2].
To mitigate the available lack of pixel area and TSV pitches
of about tens of microns, the circuit proposed in this work
takes advantage of 3D stacking of 2 integrated circuits. The
circuit consists of two stacked dies vertically interconnected
by TSVs. The upper die performs image acquisition and is
based on the architecture of a classical 2D image sensor,
with 3T or 4T pixels. The processing performed on the lower
die contains two stages. Firstly, it estimates the best suited
of integration time for every macro-pixel, and then, generate
the command signal that adjusts the integration time. To
deduce the optimal integration time, we use the circuit
architecture presented in Fig.1.
In every macro-pixel, the maximal voltage drop V,
corresponding to the minimum integration time, is
determined by means of a Winner Take All (WTA). We have
designed a WTA circuit in 32nm double oxide CMOS
technology. Due to its analog nature, the transfer function of
the WTA has an offset and a gain error. The output voltage
as a function of different sets of input voltages has been
simulated (Fig. 2).

Figure 1: Architecture of the pixel and the integration time feedback loop

Figure 2: WTA output voltage versus pixel voltage

The characteristic equation of the resulting curve shows a


gain about 0.973 and a 321mV offset voltage. These values
are coherent with the analytical expressions of the offset
(Eq.1) and the gain (Eq.2):

V (gi ) = sV

G=

V
V

g ns Vw
t l nN ) (

(1)

gm 1 w
gm 1 w
c

N
i ng m w

1 +wg d 1s+ g
wd 3s w
+
+
g
g
g
m 1 w
d 2 s i d 4s
i

(2)

References :
[1] F. Guezzi Messaoud , A. Dupret, A. Peizerat and Y. Blanchard, A novel 3D architecture for High Dynamic Range image sensor and on-chip
data compression, Proceedings of the Sensors, Cameras, and Systems for Industrial, Scientific, and Consumer Applications XII, San Francisco,
SPIE 2011.
[2] F. Guezzi Messaoud, A. Dupret, A. Peizerat and Y. Blanchard, On-chip compression for HDR image sensors, proc.DASIP, 90-96, October
2010.
[3] Guezzi Messaoud F., Dupret A., Peizerat A. & Blanchard Y. High Dynamic Range Image Sensor with Self Adapting Integration time in 3D
Technology, IEEE International Conference on Electronics, Circuits, and Systems (ICECS), December 9-12, Seville, Spain, 2012

25

Computational SAR ADC for


a 3D CMOS Image Sensor
Research topics: 3D integration, CMOS image sensor, image descriptor
A. Verdant, A. Dupret, M. Tchagaspanian, A. Peizerat
ABSTRACT: The architecture and simulation of a Computational SAR ADC (C-SAR) dedicated to the processing of
image descriptors for a 3D CMOS image sensor are reported here. The differential charge sharing architecture
enables to A/D convert the convolution of multiple binary weighted pixel signals on multi-scale kernels. The
CMOS image sensor is constituted of two tiers (two 3D layers). An array of C-SAR is implemented on the bottom
layer. Each C-SAR is associated to a square of 88 pixels on the top layer, with a pitch of 10m and a fill factor
of 80%. The total noise of 460VRMS simulated at transistor level on a 65nm technology enables to reach a
processing resolution of 9 signed bits on 0.5V pixels dynamic, with a FOM of 6.25pJ/pixel.
In automotive applications, the driver drowsiness detection is
extremely constrained in terms of processing bandwidth. The
eye blinking analysis is based on high frame rate video
(200fps). The general principle of the method used to extract
blink features from video. A part of this processing relies on
the face detection from Viola-Jones algorithms using Haarlike descriptors. Despite of the high-throughput architectures
associated to standard CMOS image sensors allowing spatial
weighted sums (convolution) to be computed, integrated
processing features are mandatory to reduce power and
silicon area costs. Hence, to overcome the limitations
associated to the use of DSP, processing features have been
successfully implemented in CMOS image sensors [1].
The Computational-SAR (C-SAR) architecture allowing the
calculation of the Haar descriptors is here presented. This
topology takes benefit from the high bandwidth of the SAR
ADC together with low power consumption. Indeed, the
successive approximation converters are known to provide
the best FOM considering the energy per step. This C-SAR
processing
unit
will
be
exposed
considering
its
implementation in a two tiers 3D CMOS image sensor, chosen
to preserve the fill factor of the sensor array.

The C-SAR is conceived as the building block of the readout


circuit of a 3D CMOS image sensor. The top tier of (Fig. 1)
embeds a 3232 macropixel array. Each macropixel is
composed of a square of 88 10m back-illuminated pixels
being locally read in rolling shutter. On the second tier, an
array of 3232 C-SAR cells is implemented to compute the
binary weighted sum of pixels. Each C-SAR cell is associated
to a macropixel and is thus shared by its block of 88 pixels.
The connection between tier 1 and tier 2 is realized in direct
metal bonding, in a face to face configuration. Only one
interconnection is required for each 8 pixels column of a
macropixel. The readout pipeline of a macropixel is presented
in Fig. 2. Each column of the macropixel in tier 1 is
associated to a sample and hold circuit in tier 2. A bank of
28 capacitors thus enables to store the reference (black
level) and pixel signal of a line of 8 pixels being read in
rolling shutter mode. The sampled data are then multiplexed
towards the analog to digital processing unit to be then
computed.

Figure 2 : Readout pipeline of a macropixel

Figure 1 : 3D CMOS image sensor embedding C-SAR

A low power consumption architecture has been simulated for


a processing resolution of 9 signed bits reaching a FOM of
6.25pJ/pixel. Compared to standard processing architectures,
no additive time is required, the processing being performed
together with conversion. This C-SAR is suitable with high
frame rate up to 2200fps.

References :
[1] L. Alacoque, L. Chotard, M. Tchagaspanian, J. Chossat, A small footprint, streaming compliant, versatile wavelet compression scheme for
cameraphone imagers, In International Image Sensor Workshop, IISW09, Bergen Norway.
[2] Verdant, A.; Dupret, A.; Tchagaspanian, M. & Peizerat, A. Computational SAR ADC for a 3D CMOS image sensor, IEEE 10th International
New Circuits and Systems Conference (NEWCAS), 2012, pp. 337-340

26

Design and Optimization of


Two Motion Detection Circuits
for Video Monitoring
Research topics : Smart image sensor, visual perception
A. Dupret, M. Zhang, N. Llaser and H. Mathias (IEF)
ABSTRACT: In a classical video monitoring system, though for most of the time the captured images contain no
relevant information, it cannot prevent the monitoring system from useless power consuming for image
processing. One technique proved to be effective for a video monitoring system to reduce its power
consumption is based on macro-pixels or blocks of pixels for region of interest (ROI) finding. The key feature
necessary for ROI finding corresponds to motion detection. To be able to implement ROI detection, two CMOS
based motion detection circuits are proposed. Both are designed and optimized to have fewer transistors, lower
power consumption, higher sensitivity and better uniformity detection.
Security issue has been becoming more and more important
in our society. Video monitoring system offers one of the
solutions. However, statistically speaking, most of the images
captured by a video monitoring system contain no relevant
information. The data transmission, the image processing as
well as the power consumption must be performed, which
uselessly increases not only image processing time but also
power consumption. To improve the video monitoring system
performance, the macropixel approach proposed by our
research team offers several advantages and therefore
seems quite promising.

their functionality.

Figure 2 : Second motion detection circuit. Two half hysteresis


voltage comparators are used for negative and positive voltage
comparison. The logic NOR is used as the output stage, in which
the active charge (M9) is shared among the same column or the
same row circuit.

Figure 1 : First motion detection circuit. Vt is the baising voltage


to control the motion detection threshold. Vr represents the
previous mean voltage value and Vin the current mean voltage
value. Vp2 is the biasing voltage for transistor M18 to realize a
current source as an active charge for the logic NOR, which is
shared among the same column or the same row motion
detection circuits

Two original motion detection circuits are designed to be


integrated within the macro-pixel, which is a promising
technique we have proposed to achieve a high resolution / a
low power consumption video monitoring system. The design
is made in a CMOS 0.35m technology. Simulation results
show that both circuits display low power consumption and
dispose a variable threshold choice for motion detection by
varying electrically the comparator window width. However,
the second circuit cleary exhibits more interesting aspects
than the first one in terms of fewer transistor numbers, lower
power consumption and much lower non-homogeneity within
the designed common mode input range as well as much
higher voltage gain. The next step consists in integrating
both proposed circuits on silicon to experimentally evaluate

Figure 3 : Simulation results of Fig.2 with Vr=1.65V and a


variable Vt to have different threshold levels, which can be
interpreted by the symmetrical width around Vr to achieve
absolute value comparison. The symmetry of window comparison
is simulated for a fixed central voltage but variable window
widths ranging from 10mV to 500mV.

Table 1.
Parameters
Number of transistors
Power consumption
Window width
Non homogeneity
Sensitivity

Circuit 1
17
1W
10mV/300mV
<12.5%
Lower

Circuit 2
14
0.7W
8mV/500mV
<1%
High

References :
[1] Zhang Ming; Llaser Nicolas; Mathias Herve, Dupret Antoine, "Design and optimization of two motion detection circuits for video monitoring
system," Circuits and Systems (ISCAS), 2012 IEEE International Symposium on , vol., no., pp.1907-1910, 20-23 May 2012
[2] Faiza Ait-Kaci, Herv Mathias, Ming Zhang, Antoine Dupret,"Exploration of Analog Pre-processing Architectures for Coarse Low
Power Motion Detection", IEEE NEWCAS2011, 28th-30th Juin, Bordeaux, France, 2011.
[3] Verdant A.; Villard P., Dupret A.; Mathias H., "Architecture for a low power image sensor with motion detection based ROI," 14th IEEE
International Conference on Electronics, Circuits and Systems, 2007. ICECS 2007, pp.1023-1026, 11-14 Dec. 2007.
[4] Verdant A., Dupret A., Mathias H., Villard P., Lacassagne L., "Adaptive Multiresolution for Low Power CMOS Image Sensor," IEEE
International Conference on Image Processing, 2007. ICIP 2007, vol.5, pp.V-185-V-188,

27

Towards a Real Time Sensor for


Focusing Through Scattering Media
Research topics : Image sensor, Wavefront correction
T. Laforest, A. Verdant, A. Dupret, S. Gigan (CNRS UMR 7587), F. Ramaz (CNRS UMR 7587)
ABSTRACT: Materials such as milk, paper, white paint and biological tissue scatter light. As a result, transmitted
light intensity through these materials is a speckle pattern, having often a short persistence time. Recently,
advances in optics to control light through disordered media have reported an increasing efficiency.
Consequently, that allows us to foresee a real time sensor that achieve such task in an integrated way. Thereby,
in this perspective, we propose a genetic algorithm implemented with pyramidal approach in a CMOS image
sensor, which matches integrated data processing and short persistence time. Our algorithm have been
simulated with a faithful model. Results show at least a gain of a factor 10 compared to the state of the art.
Materials such as milk, paper, white paint and biological tissue
are opaque due to multiple scattering of light. Consequently,
the interaction between the media and the light beam causes
phase changes of light. Recently, many works have been
reported to control coherent light through scattering media.
The principle consists in correcting phase perturbations
produced by the media achieving inverse diffusion. Indeed, the
use of phase only Spatial Light Modulators (SLM) for wavefront
correction is a promising way to achieve focusing coherent
light. Wavefront correction can be achieved by finding the
optimal set of phases thanks to SLM which phases can be
adjusted. This task constitutes an optimization problem.

the parallel processing at the pixel level allows dramatic


acceleration of processing [1]. A major challenge is to make
the implementation compatible with pixel level. In that scope,
we present a pyramidal genetic algorithm (GA) that can be
implemented within a CMOS image sensor.

Turbid media, especially biological tissue, often feature short


persistence time, of few milliseconds. Hence, the corrected
wavefront must be computed within the persistence time. This
complicate the optimization process, which hence must be
robust with regards to high noise level. Some works propose
focusing sequential algorithms or the measurement of
transmission matrix that allows generating the correct phase
set that, in turn, will allow focusing the light beam.
All these algorithms are time consuming or suffer from lack of
robustness in noisy environment. Recently an efficient genetic
algorithm has been presented.

Figure 1 : Standard optical setup. M, mirror, SLM, spatial light


modulator, SIS, smart image sensor.

For instance, considering a 256x256 pixels image array, a


persistence time of 2 ms, and assuming that the algorithm
needs 250 frames to converge, the image sensor have to
capture 125 000 frames per second (fps), corresponding to a
transfer rate of nearly 9 Giga-pixels per second. The standard
approach, i.e. camera and processor suffers from limitations:
delay due to frame transfer and centralized data processing.
Therefore, we aim at developing a dedicated smart image
sensor allowing enhancing the focusing convergence time
with regards to persistence time in biological media. Indeed,

Figure 2: 2D and 1D images of the intensity before (a) and after


(b) optimization with our genetic algorithm.

The standard optical setup corresponding to this model, used


for
testing
the
algorithms
and
simulating
their
implementation, is shown in Fig. 1. A Laser source illuminates
a reflective SLM array. Each element of the SLM array shifts
the phase of its incident light from 0 to 2. Next the light
beam is scattered by the media, and finally the transmitted
intensity is recorded on an image sensor.
In order to compare our implementation to state of the art,
we consider the previously used criterion of enhancement
defined as the transmitted intensity in the chosen target
(focus point) over the averaged transmitted intensity before
optimization. This criterion is measured with regards to the
number of frames acquired by the image sensor. An example
of transmitted light is shown in Fig. 2 running our genetic
algorithm.
Results show at least a gain of a factor 10 with our algorithm
compared to state of the art. Moreover, the pyramidal
approach compared to the classical one allows at least a gain
of a factor 2. Finally, our genetic algorithm has been
evaluated with different noise levels and compared to the
state of the art. Results show a convergence of our algorithm
with high noise level while the state of art does not converge.

References :
[1] J.-M. Tualle, A. Dupret, and M. Vasiliu, Ultra-compact sensor for diffuse correlation spectroscopy, Electronics Letters, vol. 46, no. 12, pp.
819820, 2010.
[2] Laforest T., Verdant A., Dupret A., Gigan S. & Ramaz F. Towards a real time sensor for focusing through scattering media, 2012 IEEE
Sensors, October 28-31, Taipei, Taiwan, 2012

28

Perceptual Image Quality


Assessment Metric
Research topics : motion blur, digital photography
F. Gavant, A. Dupret, L. Alacoque, D. David
ABSTRACT: Image sensors stabilization is usually based on accelerometers. To reduce the number of external
components of digital image sensors, an integrated image stabilization system is envisaged. Such a system
requires modeling the blur due to hand tremor and a general sharpness metric to quantify the gain of such a
stabilization system. We aim at providing an accurate model of the hand tremor and its impact as a Point Spread
Function. In order to define the specification of the image based image stabilization we have derived perceptual
visual quality sharpness metric for camera shake blur. This sharpness metric is based on visual blur test. It
proves to fit well ground truthes such as mean opinion score data base and quality ruler measure of blur.
The digital imaging market is characterized by conflicting
demands: smaller pixels, in order to attain large format and
to reduce the cost of the die, and sensor high performances
in terms of sensitivity and signal-to-noise ratio (SNR). To
keep a reasonably high SNR, longer integration times are
required. Yet, longer integration time makes the quality of
the resulting image sensitive to motion blur. Since hand
tremor is more important for lighter device these problems
are even more dramatic for compact cameras and
cameraphones. Therefore, an image stabilization (IS)
mechanism is to be used to reduce blur due to the camera
shake. In order to get rid of the classical mechanical
accelerometers used in IS, our approach is to develop an
integrated image-based motion detection. The specifications
of this integrated image based motion detection derive from
the impact of hand tremor blur on the quality of the image.
Our work so leads to a faithful model of hand tremor and a
metric to measure the impact of blur on the quality of
images.
The angular variations between camera and scene caused by
hand tremor present a power spectral density (PSD). The
characteristics of the camera (focal length, pixel pitch, etc.)
are responsible for the conversion of angular tremor to the
translation motion of pixels on the image sensor. The Point
Spread Function (PSF) then results from the integration of
the motion signal. The PSF is used to generate the motion
blurred image from a reference scene by convolving it with
the reference image.
The particular blur induced by the hand tremor in the
resultant image has not been well characterized regarding its
impact on human perception. Yet, two particular types of blur
(Gaussian blur and straight-line motion blur) have been
studied. The Gaussian blur can be found in defocus condition
while the straight-line blur is generally used as a simplified
model of the motion tremor. For the Gaussian blur, some
publicly available databases of subjective quality data already
exist. The data base uses several distortions such as
Gaussian blur providing the Mean Opinion Score (MOS).
Regarding the straight-line blur quality ruler based on the
just noticeable difference (JND) have been studied. Yet, due
to the complexity of the camera shake, these particular
results are not suitable for complex blurs.
Thus we developed a sharpness quality metric based on the
PSF of the camera shake. The result of the metric is then

normalized regarding both, JND and MOS database to provide


direct human perceptual value not limited to the general case
as straight-line blur and Gaussian blur. Our metric is also
based on the circle of confusion which can take apart on the
final user viewing condition (such as web applications,
display, printing). The metric is validated by user test such
as image comparison and it fits the experimental trends of
other databases both in the case of linear motion blur (Fig.1)
and arbitrary motion blur (Fig. 2).
To our best knowledge this is the first metric that can
measure all types of arbitrary blur. This metric leads to
specifying image based electronic image stabilization systems
and can quantify the subjective final gain of the overall IS.

Figure 1: comparison of quality prediction and the ground truth in


the case of linear blur

Figure 2: comparison of quality prediction with the ground truth


in the case of arbitrary motion blur

References :
[1] Gavant, F.; Alacoque, L.; Dupret, A.; Ho-Phuoc, T. & David, D. (2012), Perceptual image quality assessment metric that handles arbitrary
motion blur'', SPIE Conference on Image Quality and System Performance IX, Burlingame, CA, JAN 24-26, 2012.

29

Saliency-Based Data Compression for


Image Sensors
Research topics :visual attention compression, architecture-algorithm co-design
Tien Ho-Phuoc, L. Alacoque, A. Dupret, A. Gurin-Dugu, (GIPSA-LAB)
ABSTRACT: As saliency models have revealed ability to predict where observers fixate during scene exploration.
Embedding a saliency model into an image sensor for data compression allows allocating bit-rate budget
according to the saliency level of a region. This paper presents an original implementation of a saliency-based
data compression algorithm and architecture. A video-rate compliant, compact saliency models is designed to
allow its integration within an image sensor. It shows better performances in predicting human fixation than
the state-of-the-art models. A simpler version of our proposed model requires 256 times less memory. Second, a
Haar wavelet based compression is applied according to the saliency of regions in each frame.
Lossy compression algorithms enable higher compression
ratio than their lossless counterparts at the expense of
artifacts that are visually disturbing, especially on salient
regions, and when high compression ratio are used. An
image sensor integrating a saliency model is able to adapt
the
compression
ratio
according
to
saliency.
Its
implementation with the image sensor must be compliant
with strong hardware constraints, i.e. limited memory and
processing elements within the image sensors. We first
propose a very compact - yet efficient - video saliency model
that complies with the low-complexity requirement of image
sensors. The proposed model combines - through the OR
operation - motion saliency with the central fixation bias, a
human viewing tendency. Motion saliency is computed in
blocks thanks to an adaptive threshold (Fig. 1) resulting in
little required memory (Fig. 2). The central fixation bias is
constant for all frames and is stored within a look-up table.
Second, the compression step is applied to each block. If a
block is salient, all its information is conserved. By contrast,
non-salient blocks are reconstructed by only their LL
(approximation) component from the Haar wavelet
transform. Only compact operators are used in the proposed
model.

since it features a very compact physical implementation, the


second is Itti's model that usually serves as a reference. The
best performances are obtained with our model.
Fig. 3 represents several frames, their saliency maps and
compressed versions. This framework is particularly effective
with scenes containing locally distributed motion. Indeed, in
this case the moving regions - very well predicted by our
model and actually fixated by observers - conserve all
information while large non-fixated regions are reconstructed
only by low-frequency information.

Figure 2 : Frame memory required for motion computation

The proposed framework presents an original, compact yet


efficient, saliency-based data compression model for image
sensors. It is flexible and so might be improved by adding
filtering operators.

Figure 1 : Illustration of the motion saliency extraction by


adaptive threshold

Fig. 3 illustrates the saliency map of the proposed saliency


model - exploiting motion and the central fixation bias - for a
given frame. It is also compared with the saliency maps of
two other algorithms: the first one is Sigma-Delta algorithm,

Figure 3 : Other frames (first row) and their saliency maps (third
row) provided by the model BSM1. Compressed frames (second
row)

References :
[1] Tien Ho-Phuoc, Alacoque L., Dupret A., Guerin-Dugue A., Verdant A., "A compact saliency model for video-rate implementation", 45th
Asilomar Conference on Signals, Systems and Computers (ASILOMAR), 2011, pp.244-248, 6-9 Nov. 2011.
[2] Tien Ho-Phuoc, Laurent Alacoque, Antoine Dupret, Anne Gurin-Dugu, Arnaud Verdant, A unified method for comparison of algorithms of
saliency extraction Proc. SPIE. 8293, Image Quality and System Performance IX 829315 (January 22, 2012)
[3] Tien Ho-Phuoc, Laurent Alacoque, Antoine Dupret, Compact saliency model and architectures for image sensors, IEEE Workshop on Signal
Processing Systems 2012.
[4] Tien Ho-Phuoc, Antoine Dupret, Laurent Alacoque, Saliency-Based Data Compression for Image Sensors. IEEE Sensors, 2012. Oct. 28-31,
Taipei, Taiwan

30

A New Approach of
Smart Vision Sensors
Research topics: smart imagers, adaptive processing, feedback
J. Bezine, M. Thvenin, R. Schmit, M. Duranton, M. Paindavoine (LEAD)
ABSTRACT: Todays digital image sensors are used as passive photon integrators and image processing is
essentially performed by digital processors separated from the image sensing parts. This approach imposes to
the processing part to deal with definitive pictures with possibly unadjusted capture parameters. This work
presents a self-adaptable preprocessing architecture concept with fast feedback controls on the sensing level.
These feedbacks are controlled by digital processing in order to adapt the exposition and processing parameters
to the captured scene parameters. This innovative way of designing smart vision sensors, integrating fast
feedback control enables new approaches for machine vision architectures and their applications.
Nowadays, in most image processing systems, the sensor is
separated from the image processing part, pixel values being
sent serially. First, photons are integrated for a predefined
exposition time; next, a control circuit reads and sequentially
converts the pixel values from analog to digital. Finally, pixel
values are sent to an image processor for image
enhancement or computer vision applications. Thus, image
processing systems consider pixel values after the end of full
exposure. In that way, corrections such as dynamic range
enhancement or image stabilization need to be added in
order to suppress the effects of unadjusted image capture
parameters. This is particularly true in vision applications
such as obstacle detection, or target tracking, the image
sensor being used on moving vehicles, suffering from their
vibrations and often analyzing difficult scenes (highly
contrasted or bad weather conditions).
During the last decade, image processing systems tend to
link sensing parts to the processing units. Near-pixel
processing were introduced in smart sensor, at analog or
digital level, in order to refine or adapt captured images
before final processing, thus optimizing it. To further improve
silicon and energy efficiency, this work proposes to associate
even more closely image capture and image processing by
adding fast and local feedback controls in the usual image
capture process.

the image capture parameters (exposure time, conversion


gain and pixel reset), during photons integration time. This
introduces the use of frame sub-exposures to construct a full
frame. These sub-exposures may be considered as sampled
continuous readout. To deal with the control needed for our
approach, we propose a hardware architecture adaptation
relying on 3D stacking technologies to process pixel quickly
enough to enable capture control by feedback during the
image construction. It associates a 2D preprocessing
elements matrix to the photo-sensitive layer, separated in
pixel blocks. These preprocessing elements are designed to
do generic vision pre-computing in order to provide a
preprocessed image, or specific image features to the
associated high level processing unit.The innovative purpose
of this layer is to locally control the photo-sensitive layer by
processing incoming pixel values on the fly, and sending back
adapted capture parameters.

Figure 2: Multi- exposition adaptation and resulting motion


related Region-of-Interest delivered by the sensor integration fast
feedback adaptation.

Figure 1: Schematic of feedback integration approach in the


image processing flow.

This adaptation of the usual image capture process is


presented in Fig. 1. It is firstly based on the close control of

This work was presented in [1], showing the first results of


feedback controlled design. Fig. 2 shows an application of our
approach for motion detection in a highly contrasted
environment. As image processing algorithms are designed
for traditional architecture that processes images after their
acquisition, new algorithms must be considered in order to
benefit from this smart sensor architecture. Further work will
investigate such designs, and enhance our smart sensor
adaptation capabilities and flexibility.

[1] J. Bzine, M. Thvenin, R. Schmit, M. Duranton, M. Paindavoine, A New Approach of Smart Vision Sensors, Proc. SPIE 8436, Optics,
Photonics, and Digital Technologies for Multimedia Applications II, 84360I (June 1, 2012).

31

3D Architectures & Circuits


Manycores
FDSOI Circuits & Memories
Asynchronous Design
Exploration & Estimation
Adaptive Control

Architecture,
IC Design &
Control for
Digital SoCs

32

Platform 2012, a 3D-ready Many-Core


Computing Accelerator with Power,
Thermal and Variability Management
Research topics : Many-core architecture, low-power, System-on-Chip, 3D stacking
L. Benini (UNIBO), D. Melpignano (ST), E. Flamand (ST),
B. Jego (ST), T. Lepley (ST), G. Haugou (ST), F. Clermidy, D. Dutoit
ABSTRACT: P2012 is an area- and power-efficient many-core computing accelerator based on multiple processor
clusters implemented with independent power and clock domains, enabling aggressive fine-grained power,
reliability and variability management. Clusters are connected via a high-performance fully-asynchronous
Network-on-Chip (NoC) and feature up to 16 processors. The SoC is being implemented in STMicroelectronics
low-power 28nm CMOS process and is 3D stacking ready. Target chip area is below 26mm for a 4 clusters
version.
The Platform 2012 (P2012 [1]) project aims at moving a
significant step forward in programmable accelerator
architectures for next generation data-intensive embedded
applications such as multimodal sensor fusion, image
understanding and mobile augmented reality. P2012 is an
area-, power-efficient and process aware many-core
computing fabric, and it provides an architectural harness
that eases integration of hardwired IPs. P2012 can be
described as a Globally Asynchronous Locally Synchronous
(GALS) fabric of tiles, called clusters, connected through an
asynchronous global NoC [2] (G-ANoC). The P2012 cluster
aggregates a multi-core computing engine (ENCore), and a
cluster controller (CC). The ENCore cluster can host a
number of processors varying from 1 to 16.

bits data-wide asynchronous IO ports driven by micro-buffers


and tied to micro-pads for die stacking. In addition (not
shown in the figure), power and ground are also delivered
through a vertical plug. In this configuration the die will be
flipped and stacked on top on a host SoC with CPU,
peripherals, standard IOs and DRAM interfaces.

Power, thermal and variability management are essential


features in computing architectures targeting deepsubmicron CMOS implementation.
P2012 makes use of
several hardware-assisted control loops to reduce designtime margin and to improve energy efficiency. Each cluster
has a local clock, generated with a small-size and highly
reactive Frequency-Locked-Loop (FLL). Clock speed can be
adjusted in a few cycles on a per-cluster basis with no intercluster constraints. The fabric interconnect is fully
asynchronous, hence no global chip-wide clock distribution is
required. Static and dynamic variability are managed though
a number of distributed sensors, both direct (critical path
monitors, both embedded and replica-based) and indirect
(thermal sensors, both absolute and relative). Sensors are
accessible through memory-mapped registers clustered in
the Clock Variability and Power (CVP) module which controls
process, variability and temperature sensors. Hence
feedback-based software policies can be implemented for
operating point selection.

A second 2D configuration is supported by the static MUXes.


In this mode traditional board-level high-speed interface
(denoted 2D SNoC) links the fabric with the external host and
main memory. This interface is physically driven through a
smaller number of standard IO pads (two 81-pin ports). The
2D configuration allows simple interfacing with on-board
FPGA-based hosts.

The first silicon embodiment of P2012 is the flexible SoC


depicted in Figure 1. One key innovation in the physical
implementation of the SoC is its flexibility in off-chip
connectivity. The die can be configured as an accelerator
chiplet for three-dimensional die-stacking by appropriately
setting the static MUXes shown on the right hand side of
figure 1. In this 3D mode (denoted 3D ANoC) the fabric
interface to host and main memory goes through three 32

The SoC is being implemented in STMicroelectronics lowpower 28nm CMOS process. Target chip area is below
26mm. The power distribution grid of the SoC is designed to
handle power delivery in both 3D and 2D configurations. The
chip power consumption under heavy workload is upperbounded at 4W (at 1.1V, 125C), but its aggressive power
management features enables energy-proportional operation
up to a few hundreds mW average power.

Figure 1 : Block diagram of the flexible SoC.

References :
[1] Melpignano D., Benini L., Flamand E., Jego B., Lepley T., Haugou G., Clermidy F. & Dutoit D., "Platform 2012, a many-core computing
accelerator for embedded SoCs: Performance evaluation of visual analytics applications." 49th Annual Design Automation Conference, DAC '12,
3 June 2012 - 7 June 2012: 1137-1142.
[2] Y Thonnart, P. Vivet, F. Clermidy, "A fully-asynchronous low-power framework for GALS NoC integration, DATE 2010

33

Enhancing Cache Coherent Architectures


with Access Patterns for Embedded
Manycore Systems
Research topics : shared memory, coherence protocols, manycores, memory access patterns
J. Marandola (USP), S. Louise, L. Cudennec, J-T Acquaviva, D.A. Bader (GATech)
ABSTRACT: One of the key challenges in advanced micro-architecture is to provide high performance hardwarecomponents that work as application accelerators. In this paper [1], we present a Cache Coherent Architecture
that optimizes memory accesses to patterns using both a hardware component and specialized instructions. The
high performance hardware-component in our context is aimed at CMP (Chip Multi-Processing) and MPSoC
(Multiprocessor System-on-Chip). We also provide a first evaluation of the proposal on a representative
embedded benchmark program, which shows that we can achieve over 50% computing speedup and reduce
memory throughput by nearly 40%.
Shared memory paradigms are gaining interest to program
multicore systems: the main C compilers already embed
support for OpenMP. Indeed, such programming concepts
allow improving on legacy code to obtain a reasonable and
efficient multicore support. But the age of simple multicores
is reaching an end: as the number of cores grows, single
buses are replaced by Networks-on-Chip (NoCs), distributed
memory, and distributed data-paths: bus spying techniques
used to ensure cache coherence are no more applicable.
With distributed caches and NoCs, the usual MESI (Modified,
Exclusive, Shared, Invalid) protocol for cache coherence must
be modified to refer to a given (reference) core called Home
Node (HN) which tracks the MESI state of a given cache line
for the whole chip. But this technique does not scale well,
and is not adapted to embedded devices and applications.
First, it can be very talkative as seen in Figure 1, and,
second, it does not take advantage of regular memory
accesses.

manycore systems, a hardware structure and a specific


protocol was designed specifically to handle the pattern
based access. An example comparing the same series of data
accesses for both protocols can be seen in Figure 2.

Figure 2: Comparison between baseline protocol and the pattern


approach (speculative-hybrid protocol).

Even for such a simple pattern with only 3 elements (the


difference grows linearly with the size of the pattern), the
number of messages is reduced, and a speculative prefetch is
done: once the first element of the pattern is detected, the
remaining parts of the pattern is fetched and updated without
waiting for any other memory access. Hence, future accesses
are automatically prefetched and ready for use, reducing
both throughput and memory latency.
A first real-size evaluation of the supposed advantage was
done on a simple simulation instrumented with a in-house
modified version of the pinatrace Pintool memory analyzer,
from Intel's Pin framework. We showed on that on a two-pass
image filter that was chosen for it stresses memory accesses,
we obtained a reduction of 37% of message throughput and
an acceleration of the application by more than 50% with
regards to the baseline protocol alone.
Figure 1: A write message transaction with the baseline protocol.

Such regular accesses can be represented as memory access


patterns and a research effort was engaged which led to a
patent deposit [2]. Improving on the baseline protocol which
is the state of the art of shared memory mechanisms for

Hence, this protocol, taking advantage of regular memory


accesses (patterns), was validated on a program
representative of embedded applications. The results show
that such an apparatus significantly reduce message and
memory throughput and accelerate applications. Such
breakthrough can be vital for the future of manycore
systems, their programmability and their performance.

References :
[1] Marandola, J.; Louise, S.; Cudennec, L.; Acquaviva, J.-T. & Bader, D. A. Enhancing Cache Coherent Architectures with access patterns for
embedded manycore systems System on Chip (SoC), Proc. of 2012 International Symposium on SOC, Tampere, Finland, 1 -7, 2012
[2] L. Cudennec, J. Marandola, J-Th Acquaviva and J-S Camier, Multi-core System and Method of Data Consistency, FR2970794 (A1), CEA,
January 2011.

34

Adaptive Stackable 3D Cache


Architecture for Manycores
Research topics: 3D, cache, NUCA, manycore
E. Guthmuller, I. Miro-Panades and A. Greiner (UMPC/LIP6)
ABSTRACT: With the emergence of manycore architectures, the need of on-chip memories such as caches grows
faster than the number of cores. Moreover the bandwidth to off-chip memories is saturating. Big memory caches
can alleviate the pressure to off-chip accesses. We have designed an adaptive 3D cache architecture taking
advantage of dense vertical connections in stacked chips. We also propose a dynamically adaptive mechanism to
optimize the use of the aforementioned 3D cache architecture according to the workload needs. We show that
our approach can lead to a 50% reduction of both external memory accesses and application execution time.

3D technologies allow the placement of caches on top of


processors. This greatly simplifies the circuit floorplan. A big
NUCA (Non-Uniform Cache Access) cache can then be used
without sacrificing access latency and bandwidth to the
cache, and without introducing much distance between the
processors.
With the high density of TSVs of recent 3D technologies, very
large vertical interconnections can provide a high vertical
bandwidth to the distributed 3D cache architecture. Such
distributed cache architectures can be built with a large
number of vertical access ports as shown in Figure 2. We can
even imagine having one access port to the 3D cache per
processing unit.
The physical view of the 3D cache architecture that we have
proposed in [1] is depicted in Fig 1. This architecture
integrates processing units, cache tiles, cache controllers and
external memory controllers interconnected by NoCs.
Processing units (processors or dedicated hardware
accelerators) send requests to cache controllers mapped in
the global memory space. The cache controllers then
dispatch these requests to cache tiles through a 3D NoC.
Cache tiles process requests and, in case of MISS, they
transmit those requests to external memory controllers
through a 3D ExtMem NoC. In our approach, the processing
units, the cache controllers and the external memory
controller are placed in the bottom tier while the cache tiles
are placed in the top tiers. This organization allows us to
build a modular stackable architecture.

by the Operating System, to allocate a larger private cache


quantity to a given application. Moreover, the operating
system can also decide to share a given cache tile between
one or several applications running in various memory
segments to reduce the overall MISS rate while losing the
exclusivity of access to this cache tile. In this case, the highly
accessed memory segments will occupy a larger storage
capacity in the 3D cache.
By allowing the OS to control the cache resource allocation,
we expect to increase the performances: we can either
reduce the MISS rate for a chosen application or for the
overall system. In this later case, we also expect to reduce
the overall bandwidth requirements to the external memory.
@ 0x000
Segment 0

Cache access control


Cache
Access
Controller 0

Segment 1

Memory
space

Cache
Access
Controller 1

3,0 3,1

3,0

3,1

3,2

3,3

2,0

2,1

2,2

2;3

1,0

1,1

1,2

1,3

0,0

0,1

0,2

0,3

2,0 2,1

3,0 3,1
2,0 2,1
1,0 1,1
0,0 0,1

Mem Mapping

@ 0xFFF

Cache tiles

First mapping:
Mapping of cache
access controllers
in memory space

Allocation

Second mapping:
Allocation of cache
tiles to cache access
controllers

Figure 2: Principle of the cache allocation

Figure 1: 3D architecture of the MPSoC

The key point in this proposal is adaptability:


As shown in Fig. 2, the 3D cache can be statically configured

SystemC and RTL models have been written to validate the


proposed cache architecture and evaluate performances
under intensive workloads. Our experiments show that, when
running unbalanced workloads, the tile sharing mechanism
can reduce by up to 50% both execution time of the most
memory intensive application and the overall traffic to the
external memory on a 64 cores manycore architecture with
16 MB of 3D cache per tier. With no sharing, the less memory
intensive application is not penalized and the cache provides
a better quality of service to this application.
We have synthesized our design in CMOS STMicroelectronics
65nm Low Power node. Each tile is a 1 MB cache with 16
ways and 512 sets, and it includes 564 TSVs to stack up to 4
tiers of 3D cache on top of the manycore. The total area of a
1 MB cache tile using 5 m wide TSVs and SRAM memories is
6 mm2. The useful area (the data of the cache) is close to
90% of the total tile area, demonstrating the low cost of our
approach.

References :
[1] E. Guthmuller, I. Miro-Panades and A. Greiner, Adaptive Stackable 3D Cache Architecture for Manycores, in VLSI (ISVLSI), 2012 IEEE
Computer Society Annual Symposium on, aug. 2012, pp. 3944.

35

Design-for-Test and Fault Tolerant


Architecture for a 3D NoC
TSV-based Infrastructure
Research topics : 3D Design, TSV, Design For Test, Fault Tolerance, NoC
P. Vivet, F. Darve, D. Dutoit, F. Clermidy
ABSTRACT: 3D stacking is seen as one of the most interesting technologies for System-on-Chip (SoC)
developments. However, 3D technologies using Through Silicon Vias (TSV) have not yet proved their viability for
being deployed in large-range of products. One of the main challenges of 3D TSV-based design is regarding
testability and reliability. In this work, we propose a new Design-for-Test 3D architecture, coupled with some
fault tolerance design techniques, adapted for the test and repair of TSV connection within a 3D architecture.
The proposed scheme has been successfully applied to the design of a 3D Network on Chip architecture.
3D stacking is one of the most interesting technology
revolutions for System-on-Chip to sustain and cope with the
ever increasing system complexity. Regarding applications,
3D stacking allows a wide range of new SoC applications,
such as heterogeneous stacking (Digital, Memory, RF,
Mems); Interposers for multi-chip connection similar to a
silicon board. The first envisaged 3D applications are mainly
the WideIO DRAM 3D memory interface for high throughput
and low power memory-on-logic stacking, as well as efficient
3D Network-on-Chip communication infrastructure for logicon-logic stacking targeting many-core architectures [1][2].
One of the main challenges of 3D TSV-based design is
regarding testability and reliability. For testability purposes,
3D stacking requires that each die be individually tested
before assembly to identify the known-good-die (KGD), and
the 3D circuit is finally assembled and tested, including the
3D TSV connections. Standardization efforts are on-going
with the IEEE 3D-Test P1838 WG to define a 3D test
standard.
In order to test a 3D stack (such as illustrated figure 1)
including 3 dies, while guarantying tester and CAD tool
compatibility, we propose to use the IEEE 1149.1 JTAG test
protocol. In the proposed 3D DFT architecture, the die logic is
fully wrapped in a 1149.1 Test Wrapper, composed of
Boundary Scan Chains, which are controlled by a TAP
controller. The test wrapper can be respectively used in INT
mode to test the die internal logic, or in EXT mode to test the
3D TSV connections between two adjacent dies.
JTAG
Port

TAP
Controller

to transport the scan chain from/to the above/below die; if a


die is not detected, the tdi/tdo muxes are controlled to
transport the scan chain in the local TAP controller. When
testing a single die, the JTAG SWITCH behaves as a bypass
between the external JTAG port (tester connection) and the
local TAP controller.
For reliability purposes, since 3D technology does not reach
yet a mature production level with predictable yield, it is
required to offer a certain level of fault tolerance. We
introduce a fault-tolerant architecture using spare TSVs,
whose test and repair mechanism is fully integrated in the 3D
DFT architecture. For die-to-die communication, we introduce
about 12% spare TSVs (1 additional TSV every 8 connection),
which are controlled by simple muxes/demuxes, this repair
scheme allows one TSV failure per group of 8 signals [3].
Unconnected

Unconnected

3D-Micro-Buffers

3D-Micro-Buffers

3D ANoC Router

Trst
Tck
Tms
Tdi
Tdo

DIE 2

IO
PAD

Unconnected

Config

Test
3D-Micro-Buffers

3D-Micro-Buffers

DIE 0

3D INTERCONNECTS
TSV + MICRO BUMPS
3D-Micro-Buffers

3D-Micro-Buffers

3D ANoC Router

Trst
Tck
Tms
Tdi
Tdo

1149.1 Test Wrapper

JTAG SWITCH

JTAG
SW

JTAG
SW

IO
PAD

To Board

Config
JTAG
Port

TAP
Controller

1149.1 Test Wrapper

Tester

JTAG
Port

Test

DIE 1

JTAG SWITCH

TAP
Controller

3D-Micro-Buffers

Unconnected

Unconnected

DIE 0

Figure 2 : 3D asynchronous NoC with 3D DFT and TSV repair

1149.1 Test Wrapper

JTAG SWITCH

3D-Micro-Buffers

DIE 0

Figure 1 : 3D DFT Architecture, based on a JTAG Switch

The newly introduced JTAG SWITCH [3] allows to


automatically transport the JTAG signals, and to have the
tdi/tdo scan chain circulates in the 3D stack, according to the
die position in the stack. To achieve this, an automatic
detection of adjacent die is performed using some pull-down
cells by adding some die-to-die detection signals forced to
logic 1. If a die is detected, the tdi/tdo muxes are controlled

The proposed 3D DFT and fault tolerance scheme has been


applied to a 3D asynchronous NoC [2], which has been
encapsulated by two layers (Figure 2) : firstly a JTAG Test
Wrapper (in yellow) in order to test all individual TSV
connections, and by a Fault Tolerance Wrapper (in pink) in
order to control and repair using the spare TSVs. The overall
test & repair control is achieved using the JTAG protocol and
associated JTAG SWITCH to circulate the tdi/tdo scan chain
within the 3D circuit stack.

References :
[1] F. Clermidy, F. Darve, D. Dutoit, W. Lafi, P. Vivet, 3D Embedded Multi-core: Some Perspectives, DATE2011, Grenoble, March 2011.
[2] P. Vivet, D. Dutoit, Y. Thonnart and F. Clermidy, 3D NoC Using Through Silicon Via: an Asynchronous Implementation , VLSI-SOC2011,
Hong Kong, Oct2011.
[3] P. Vivet, F. Clermidy, D. Dutoit, "Design-for-Test and Fault Tolerant Architecture for a 3D NoC TSV-based Infrastructure." LPonTR
Workshop, during European Test Symposium, ETS12, Annecy, France, May 2012.

36

6T SRAM Design for Wide Voltage


Range in 28nm FDSOI
Research topics: SRAM, VMIN, Energy efficiency, UTBB-FDSOI
O. Thomas, B. Nikoli (UC Berkeley)
ABSTRACT: The requirements for high performance and low power in modern portable devices highlight the
need for circuit operation over a wide range of supply voltages to maximize energy efficiency for given
performance requirements. Unique features of the 28nm ultra-thin body and buried oxide (UTBB) FDSOI
technology enable the operation of SRAM in a wide voltage range. This work investigates the design of a HD 6T
SRAM array.

In FDSOI technology, VT is primarily set by the metal-gate


(MG) stack work function. UTBB-FDSOI technology offers
additional flexibility by setting the BP doping type underneath
the BOX, either n or p (Fig. 1). Combining BP and twin-MG
process integration allows getting at least 3 distinct VTs
(High-VT, Regular-VT and Low-VT). In addition, the BOX
dielectric electrically isolates the well from the source and
drain of the transistors, which expands the range of possible
well bias voltages (VB) and therefore improves the range of
possible VT adjustments, through a high body factor. VB is
only limited by pn-well junctions. FDSOI achieves record low
VT variability because of its immunity to RDF, even with
forward body bias (FBB).

Figure 1: UTBB-FDSOI device cross section view. BP is electrically


connected to VB through the well. The VB biasing over a wide
voltage range allows a continuum VT adjustment for process
compensation and power management [1].

Analysis methodology in this paper is based on dynamic


margins, obtained through transient simulations. Margins
against read stability (RS), read access time (RA) and
writeability (WA) failure are assessed using Monte Carlo (MC)
based bit error rate (BER) estimates that do not make any
assumption about the distribution of each failure metric, in
contrast to other methods. The simulation conditions
assumed are: Typical process corner, 27C, 28nm high
density (HD) 6T bitcell with post-layout backend parasitic
capacitances and a clock period scaling versus VDD extracted
on a critical path of an ARM 9 Cortex, assuming 2GHZ at 1V.
The simulations have been performed with a siliconcalibrated surface-potential based SPICE model.
For the bulk-like baseline 6T bitcell the minimum usable
supply voltage (VMIN) is limited by RA and WA (Fig. 2). In
contrast to static analysis (with infinite access) RS margin, a
dynamic metric, is high, even at very low VDD thanks to the
lack of time to flip the bitcell. For a distribution that exceeds
6 in variation, VMIN is 800mV for a 256b column tall. WAlimited VMIN is 760mV.
For a short clock period (CP), WA VMIN is limited by the
completion of the transition of the high logic level node
driven by PMOS PU transistors. RA limitation of the VMIN is

due to a low PG transistor drivability.


PUL

PGL

PUR

1
PDL

PGR

PDR

Figure 2: 6-Transistors bitcell schematic. BER equivalent sigma


vs. VDD for read stability (RS), readability (RA), writeability (WA).
VMIN is limited by WA and RA and not RS

To strengthen the PU, an Single-PW bitcell architecture is


introduced, depicted in Fig. 3. Both PMOS and NMOS
transistors are placed over a common P-well, which lowers
the threshold of the PMOS transistors. The VT of PD and PG
NMOS transistors does not change, compared to the bulk-like
bitcell. By increasing VB, NMOS transistors are forward biased
to improve RA and therefore VMIN. In this framework, the PW
is isolated from the p-substrate by using a deep n-well
(DNW) tied to VDDS. Thanks to the single common well, VB can
be biased up to (or tied to) VDDS, biasing the nMOS transistors
in a full forward mode. WA improvement lowers VMIN by
120mV for 64 and 128 bitcell columns, while for 256b RA
improvement lowers VMIN by 60mV. RS is also improved by
almost one sigma due to the higher strength of the PU, which
reinforces the high voltage level on the opposite side of the
bitcell during the read stress.
The SPW architecture and VB biasing improve VMIN at the cost
of increased leakage at same supply voltage. For power
management in a mobile processor, two back-bias modes can
be considered. In active mode, VB would be biased to VDD to
achieve the lowest VMIN and in standby-mode VB is grounded
to minimize the static power, making the SPW bitcell
compelling for high capacity cache designs.
WL
PUL

VDDS

VDD

PUR

PGL

PGR
L

BLL
VB=VDD

PDL

PDR

BLR
PW
DNW

Figure 3: SPW 6T bitcell schematic. VMIN benchmark. SPW and VB


tied to VDD leads to the lowest RA and WA VMIN [2].

This work was carried out in collaboration with the BWRC of


UC-Berkeley and STMicroelectronics.

References :
[1] C. Fenouillet-Beranger et al., "Efficient Multi-VT FDSOI technology with UTBOX for low power circuit design", SOI Conference, 2012.
[2] O. Thomas et al., "6T SRAM design for wide voltage range in 28nm FDSOI", SOI Conference, 2012.

37

Ultra-Low-Voltage SRAM Design in


UTBB-FDSOI technology
Research topics : SRAM, Low power, Low Voltage, Stability
A. Makosiej (ISEP & Leti), O. Thomas, Andrei Vladimirescu (ISEP), Amara Amara (ISEP)
ABSTRACT: In todays systems-on-a-chip (SOC) the embedded SRAM can often take over 50% of total chip area,
which in some cases may lead to the leakage power dominating the overall power consumption. Supply voltage
(VDD) scaling is an efficient way to reduce the SRAM leakage but it is limited by the ever increasing parameter
variations, which adversely impact the cell stability. Ultra-thin-body and box FDSOI (UTBB-FDSOI) technology
emerges as an efficient solution for Low Power SRAM design, due to its strong multi-VT capabilities and low
variability. Moreover, the high body factor of this technology allows the optimization of active and standby mode
stability, further increasing its potential for Low Power SRAM design.

1.4

FDSOI

0.059 um

VMIN [V]

1.2
1
0.8

0.4
1

1.1

1.2

1.3

1.4
A

VT

1.5
1.6
[mVum]

1.7

1.8

1.9

0.7

Figure 1 : VMIN as a function of AVT for 2 different cell sizes and


cell VT ratios; one corresponding to minimum DRV and the other
for nominal case with improved read stability (VTN+100 mV)

nominal
Minimum DRV- VTN=VTNNOM-100mV

DRV [V]

0.6
0.5
0.4
0.3
0.2

18

32
28
Technology Node [nm]

22

38

45

Figure 2 : DRV degradation vs. technology node for 2 different VT


sets and AVT=1.25mVm
4
3

nominal
Minimum DRV- V =VTNNOM-100mV
TN

2
1
0

VTN-100mV

0.072 um2

0.6

Figure 1 depicts the VMIN in function of AVT for two different VT


cases for the minimum and enlarged cells at 22 nm node. It
can be noted, that VMIN decreases significantly towards lower
AVTs with the slope of over 600mV/1mVm. Moreover, the
optimization of DRV by decreasing VTN causes a simultaneous
increase of VMIN by 100mV, indicating the requirement of
different VTs for VMIN and DRV minimization. Figures 2 and 3
show the DRV and maximum tolerable AVT for VMIN<VDD for
various technology nodes. Clearly, the DRV increase with
technology node shrinking is significant even for the AVT as
low as 1.25 mVm, in particular for the nominal VT case. In
Fig. 3 it can be observed that in order to maintain VMIN<VDD
for minimum sized cell, a 30% decrease of AVT per technology
node must be obtained, reaching as low as almost 1.25
mVm for 22 nm node for nominal and 1 mVm for DRV
optimized case.
UTBB-FDSOI with AVT=1.25 mVm should therefore allow
obtaining VMIN<0.7 V and DRV<0.4 V at 22 nm node under
the condition, that a 100mV VT adjustment through body
biasing between active and standby modes is performed and
the initial nominal VT values are carefully set using the
multi-VT capabilities.

AVT [mVum]

Minimization of SRAM power consumption is a major concern


for modern chip design for a number of reasons: (i)
increasing area taken by embedded SRAM- even over 50% of
the total area, (ii) increasing individual transistor leakage due
to the degradation of electrostatic control of the channel in
standard bulk CMOS devices, (iii) large random threshold
voltage (VT) variation- limits voltage scaling range due to
stability constraints (VT=AVT/(LW), where AVT is the
Pelgrom Coefficient) and (iv) contradictory requirements on
cell operation conditions for best stability in active and
standby [1,2].
UTBB-FDSOI is an attractive solution to reduce the impact of
these issues on SRAM power efficiency. Improved
electrostatic control of the channel leads to better device
properties and hence, lower leakage. Due to an undoped
channel the Random Dopant Fluctuations no longer affect
variability. In consequence the AVT is significantly reduced
and is evaluated at approximately 1.25 mVm at 22nm as
compared to almost 2.5 mVm for 32nm Low Power bulk
CMOS. High body factor of 60-70mV/V and very wide range
body bias adjustment are both unique properties of UTBBFDSOI at sub-28 nm nodes, allowing stability optimization for
active and standby modes separately. Finally, due to the
possibility of changing the type of the backplane doping,
modification of the gate workfunction and the use of either
single or two well designs, strong multi-VT capabilities are
obtained. The importance of these features for sub-28 nm
power efficient SRAM is demonstrated through the analysis of
high density SRAM limitations from the point of view of
minimum applicable voltages in active (VMIN) and standby
(DRV) for 6 yield [3]. Both metrics are evaluated using the
Static Noise Margin (SNM) approach, under the assumption
that read SNM is the limiting factor for active mode stability.

18

22

32
28
Technology Node [nm]

38

45

Figure 3 : Minimum AVT required to meet the minimum cell sizing


requirement from VMIN perspective vs. technology node for 2
different VT sets; VDDs scaled by technology along LP roadmap
(0.65V;0.7V;0.75V;0.8V;0.85V;0.9V, respectively)

This work was performed in collaboration with Andrei


Vladimirescu (BWRC, UC Berkeley and ISEP, Paris) and
Amara Amara (ISEP, Paris).

References :
[1] A. Makosiej, O. Thomas, A. Vladimirescu, A. Amara, "Stability and Yield-Oriented Ultra-Low-Power Embedded 6T SRAM Cell Design
Optimization", DATE 2012
[2] A. Makosiej, O. Thomas, A. Vladimirescu, A. Amara, Low-Power Embedded 6T SRAM Cell Design For 6 Yield, VARI Workshop 2012
[3] A. Makosiej, O. Thomas, A. Vladimirescu, A. Amara, CMOS SRAM Scaling Limits under Optimum Stability Constraints, ISCAS 2013

38

A Mixed LPDDR2 Impedance Calibration


Technique exploiting 28nm Fully
Depleted SOI Back-Biasing
Research topics: LPDDR2, FDSOI, Back-Biasing
D. Soussan (ST), A. Valentian, S. Majcherczak (ST), M. Belleville
ABSTRACT: Signal integrity is a major concern in high-speed interfaces for digital communications. In such
interfaces, the impedance of the output driver must be matched to the transmission line, to avoid reflections.
Traditionally, a thermometer code is used for adjusting the impedance as function of Process, Voltage and
Temperature: segments of the output driver are turned on or off. The drawback is that, during such calibration
phases, the interface is unavailable. The back-biasing capability of the Fully-Depleted SOI technology is
exploited to compensate for temperature and voltage drifts during circuit operation, while the process deviation
remains digitally compensated: this mixed calibration is implemented in a LPDDR2 memory interface.
With the ever increasing communication data rate, the need
for achieving high signal quality through transmission lines is
mandatory to avoid bit errors. One of the requirements of
high-speed interfaces is that their output impedance must be
matched to the impedance of the transmission line, in order
to avoid signal reflections.
Double Data Rate (DDR) interfaces are a good example of a
high-speed digital interface in that case between a
microprocessor and a memory. In a conventional LPDDR2
interface, widely used in mobile applications, the impedance
of the output driver is matched by using a digital calibration
scheme. The LPDDR2 transmitter is made of 7 parallel slices,
each one is calibrated to 240 whatever the Process-VoltageTemperature (PVT) conditions, providing a ~34.3 total
impedance on the output driver to match the transmission
line. Each slice consists of N+1 programmable Pull-UP (PU)
array, N+1 programmable Pull-Down (PD) array and a shared
series resistance (RLIN) for linearity concerns.
During the calibration phase, those PU/PD transistors are
turned on or off to adjust the total impedance to the
specification. It must be noted that, during this calibration
phase, the I/O interface is not functional. The calibration
phases occur at only at circuit power-on to compensate for
Process variation, but also during circuit operation to cope
with voltage and temperature drifts.
We conducted research to avoid DDR interfaces being
periodically rendered unavailable because of calibration, by
exploiting the Back-Biasing capability of Fully-Depleted SOI
(FDSOI) technology.
The threshold voltage of transistors fabricated in Ultra-Thin
Box and Body technology can be strongly modulated by
applying a bias voltage to the back interface. A mixed
analog/digital impedance calibration scheme was thus
developed [1]: the digital calibration is kept for the
initialization phase, to compensate for Process variation, and
the analog calibration is used for coping with temperature
and voltage drifts. The mixed calibration scheme is shown in
Fig. 1. The back-bias voltage PU_BB is common to all
transistors and is generated by a charge pump, driven by a
bang-bang controller.
Simulations of digital calibration phase followed by analog
calibration phase have been carried out. The obtained
results, in all PVT corners of the PU versus the PD

impedance, are depicted in Fig. 2. It shows that mixed


analog/digital impedance calibration fulfills perfectly the
15% LPDDR2 impedance specification. The PU impedance is
within 6% against 9% for the PD impedance. Less
accuracy can be observed for the PD impedance compared to
the PU one. This can be explained by the fact that PU
calibration is based on an external accurate resistance,
whereas PD calibration is based on a PU replica. Therefore,
the possible inaccuracy coming from the PU array is
transmitted to the PD array.

Figure 1: Mixed Analog/Digital Impedance calibration circuit

Figure 2: PU impedance vs. PD impedance in all PVT corners

References:
[1] D. Soussan, A. Valentian, S. Majcherczak and M. Belleville, "A mixed LPDDR2 impedance calibration technique exploiting 28nm fullydepleted SOI back-biasing." in Proceedings of the IEEE International Conference on Integrated Circuit Design and Technology, ICICDT 2012, 30
May 2012 - 1 June 2012

39

Fault Tolerant Asynchronous Design,


Design Flow & Application to
Network-on-Chip
Research topics: Asynchronous design, SEE, Fault tolerance
J. Pontes, P. Vivet, N. Calazans (PUCRS)
ABSTRACT: In advanced CMOS technology, Single Event Effects (SEEs) due to high energy particles may cause
different types of electrical effects when crossing silicon: from small delay variations, to bit flips, until
permanent damage. Asynchronous QDI circuits are immune to delay variations but are sensitive to bit flips as
any synchronous circuit. Targeting fault tolerant many-core architectures, we propose a temporal redundancy
delay insensitive code for application to asynchronous Network-on-Chip. The proposed TRDIC scheme has been
validated using a new SEE digital design flow, for accurate SEE fault injection and simulation, in a 32nm CMOS
technology.
In advanced CMOS technology, Single Event Effects (SEEs)
due to high energy particles may cause different types of
electrical effects when crossing silicon: from small delay
variations, to bit flips, until permanent damage. Due to their
un-clocked nature, Quasi Delay Insensitive asynchronous
circuits are the most immune to any delay variations thanks
to the use of Delay Insensitive codes, but can be very
sensitive to bit flips since a Single Event Effect may corrupt
the asynchronous handshake protocol. QDI asynchronous
logic has been extensively studied to design robust and lowpower Network-on-Chip interconnects for GALS like manycore architectures [1].
Nevertheless, even if large research has been done on fault
tolerant synchronous design, very little research has been
carried out on fault tolerant asynchronous logic. In this
paper [2], we propose a design technique to mitigate Single
Event Effect by adding Temporal Redundancy to Delay
Insensitive Codes (TRDIC). This multiple bit fault tolerant
design technique is adaptable to any 1-of-N DI codes, and is
particularly well suited to asynchronous Networks-on-Chip.
As presented in figure 1, the initial 1-of-m DI code is encoded
in a 2-of-m+1 DI code, by adding one bit encoding the
concatenation of the current data token with the previous
data token value, then the new encoded token is sent
through the asynchronous NoC data link, and finally is
decoded and corrected in case of error detection, by
comparing the received token with the expected previous
token. This temporal redundancy scheme is less costly than
spatial redundancy such as TMR, and requires only a slight
modification of the existing asynchronous NoC link and
routers (by adding one bit per data token).
Data
Sender

1-of-m

TRDIC 2-of-m+1
Encoder

Figure 2 : SEE Accurate Digital Design Flow for Fault Simulation


(in blue the existing tool/format, in red the tool/flow extensions)

Regarding Design-Flow, similar to local variations and signal


integrity problems, Single Event Effects (SEEs) are a new
design concern for digital system design that arises in deep
sub-micron technologies. In order to fill the gap between
accurate spice-level simulation of SEE effects and standard
fault simulation injecting fault in digital logic at RTL level, we
have proposed and developed an accurate digital design flow,
which inject and simulate SEE fault propagation [3].
Starting from low level SPICE-accurate simulations, SEEs are
characterized, modeled and simulated in the digital design
using commercial and well accepted standards and tools.
Existing technology libraries are modified to take into account
SEE effect, a SystemC fault simulator performs fault
injection/monitoring in accurate back-annotated gate level
simulations. This can be applied at system level for any stdcell based design, synchronous or asynchronous.
The TRDIC fault tolerant asynchronous NoC architecture and
the associated SEE design flow has been fully exercised in a
32nm CMOS technology. The fault simulation result shows
better SEE tolerance (figure 3).

QDI Data
Link

3500

1-of-4
2-of-5
TRDIC 2-of-5

Data
Receiver

TRDIC
1-of-m Decoder

2-of-m+1

Figure 1 : Temporal Redundancy Delay Insensitive code (TRDIC)


for fault tolerant asynchronous Network-on-Chip link

K Failures/second

3000
2500
2000
1500
1000
500
0
100

The proposed Temporally Redundant Delay Insensitive codes


(TRDIC) have been evaluated using a Single Event Effect
digital fault characterization environment (figure 2).

200

400

500

700

800

1000

Single Event Effect Interval (ns)

Figure 3 : Failure rates according to SEE injection rate, for the


initial asynchronous NOC link and the proposed TRDIC NOC link.

References :
[1] Y. Thonnart, P. Vivet, F. Clermidy. "A fully-asynchronous low-power framework for GALS NoC integration." Proceedings of the 13th Design,
Automation and Test in Europe Conference and Exhibition, DATE 2010, Dresden, Germany, pp. 33-38.
[2] J. Pontes, N. Calazans, P. Vivet, Adding temporal redundancy to delay insensitive codes to mitigate single event effects'', Proceedings of
the IEEE 18th International Symposium on Asynchronous Circuits and Systems, ASYNC 2012, Copenhagen, Denmark, pp. 142-149.
[3] J. Pontes, N. Calazans, P. Vivet, An accurate single event effect digital design flow for reliable system level design'', Proceedings of the
15th Design, Automation and Test in Europe Conference and Exhibition, DATE 2012, Dresden, Germany, pp. 224-229.

40

Efficient Physical Implementation of


Asynchronous Logic for High
Performance Variability-Tolerant Circuits
Research topics : EDA flows; Asynchronous circuits; Networks-on-chip
Y. Thonnart, E. Beign, F. Clermidy, P. Vivet
ABSTRACT: Aggressive CMOS technology nodes present increasing variability, which impede the implementation
high-performance large scale synchronous circuits. To overcome this, we developed a performance-oriented
implementation flow for QDI Asynchronous circuits, which is fully compatible with conventional EDA tools for
synchronous designs. Using pseudo-synchronous models of a simple standard-cell library for asynchronous
logic, a simple set of pseudo-synchronous timing constraints can be given to industrial EDA tools to benefit from
their optimization strategies, during synthesis, place & route. This flow allows achieving significantly better
performance and regularity than asynchronous modeling, for faster run times and reduced design effort.
While QDI asynchronous circuits are designed to be
insensitive to propagation delays, they cannot rely on a
global clock frequency target to define their speed, and most
often full-custom implementation is needed to achieve high
performance.
To overcome this, our method [1,2] considers each flowcontrol synchronization cell (C-element) as a pseudo-flip-flop,
by splitting the timing arcs between its inputs and outputs in
two, taking a reference time instant in the middle as a
pseudo-clock triggering edge. As these C-elements are the
only ones needing a Reset input to initialize their logic state,
it is possible to define the global Reset signal as a pseudoclock with ideal propagation delays. The original inout
timing arcs are converted to a Resetout combinational arc
at the beginning of a pseudo-synchronous timing path, and
several inReset setup constraints at the ends of pseudosynchronous timing paths. By doing so, it becomes possible
to break all combinational loops without disabling any timing
arc, as shown in Fig.1.
Reset
C

(a)

Reset
Fwd
Logic

(b)

Fwd
Logic

Bwd
Logic
Reset=clk

Reset

Bwd
Logic

Bwd
Logic
Reset=clk

Fwd
Logic

Fwd
Logic

Bwd
Logic
Reset=clk

Fwd
Logic

Bwd
Logic

Source
preCTS.sdc

Reset=clk

Synthesis

Netlist.ref.v

Place & IPO

PSync.lib

dummy.ctsspec

CTS
postCTS.sdc

Route & IPO

Reset
C

Async.lib

SPEF

PSyncIP.lib

Netlist.final.v

Sign-off TA

GDS

DRC, LVS
SDF

Final sim.

Tape-out

Figure 2: Implementation flow using pseudo-synchronous models

Using a dummy clock constraint on the pseudo-synchronous


models, it is possible to perform timing-driven synthesis,
placement and routing, with up to 60% improvement in
speed, for at most 25% increase in area, as shown on Fig.3.
With refined pseudo-synchronous constraints for forward and
backward paths, it is even possible to achieve maximum
performance with zero slack and only 20% area increase.

Fwd
Logic

Bwd
Logic

Figure 1: (a) Combinatorial loops in asynchronous circuits


(b) Pseudo-synchronous forward & backward paths

The method derives new timing models for the C-elements


using initial Liberty '.lib' models coming from standard cell
characterization. From the original timing arcs is derived a
single dummy combinational arc Resetout, depending on
the cell output capacitance, and several inReset setup
constraints depending on the input transition times.
Conventional EDA tools can therefore handle the circuit as a
fully synchronous one, and asynchronous designs can benefit
of the full power of timing-aware placement and in-place
optimization algorithms developed for synchronous circuits.
The resulting implementation flow is described in Fig.2.

Figure 3: Pseudo-synchronous quality of results vs dummy period

Our method was successfully applied to implement SoCs


using high-speed asynchronous networks-on-chip in 65nm
[3,4] and 28nm [5] with peak performance at up to 1.28GHz.

References :
[1] Y. Thonnart, E. Beign, and P. Vivet, A pseudo-synchronous implementation flow for WCHB QDI asynchronous circuits, Proceedings of the
2012 IEEE 18th International Symposium on Asynchronous Circuits and Systems (ASYNC 2012), pp. 73-80, May 2012.
[2] Y. Thonnart, P. Vivet, F. Clermidy, A fully-asynchronous low-power framework for GALS NoC integration, Proceedings - Design,
Automation and Test in Europe, DATE 2010, pp. 33-38, March 2010.
[3] P. Vivet, D. Dutoit, Y. Thonnart, F. Clermidy, 3D NoC using through silicon via: An asynchronous implementation, 2011 IEEE/IFIP 19th
International Conference on VLSI and System-on-Chip, VLSI-SoC 2011, pp. 232-237, October 2011.
[4] F. Clermidy, C. Bernard, R. Lemaire, J. Martin, I. Miro-Panades, Y. Thonnart, P. Vivet, N. Wehn, A 477mW NoC-based digital baseband for
MIMO 4G SDR, Digest of Technical Papers - IEEE International Solid-State Circuits Conference, ISSCC 2010, pp. 278-279, February 2010.
[5] D. Melpignano et al., Platform 2012, a many-core computing accelerator for embedded SoCs: performance evaluation of visual analytics
applications, Proceedings of the 49th Annual Design Automation Conference, DAC 2012, pp. 1137-1142, June 2012.

41

An Iterative Computational Technique


for Performance Evaluation
of Networks-on-Chip
Research topics : Networks-on-chip ; Performance Evaluation ; Formal methods
S. Foroutan (now with TIMA), Y. Thonnart, F. Petrot (TIMA)
ABSTRACT: This work introduces a novel analytical method that can be used in the design of best effort
wormhole Networks-on-Chip (NoC) for the purpose of performance evaluation and thus optimization loop and
design space exploration. The method is based on a router delay model that computes average router latency
and other delay components (such as port acquisition and link transfer delays) of a best effort router. The
router model is used iteratively to deal with the direct contentions that occur in a router (i.e. iterative
technique), in a recursive algorithm (i.e. dependency tree) to deal with indirect contentions and back pressure
impact that happens in sequences of routers.
Due to the distributed and complex nature of Networks-onChip (NoC) in terms of topology, wire size, routing algorithm,
etc, the performance of a NoC-based infrastructure is difficult
to predict. Therefore, one of the important phases in the NoC
design flow is performance evaluation which is to extract
performance metrics in order to verify whether a specific
instance from the NoC design space satisfies the
requirements of the entire system. In this sense, reducing
the time to obtain the NoC performance and consequently
speeding-up the design space exploration, is one of the keys
that can considerably reduce the design-flow time and cost.
Path latency is counted from the moment a tagged packet
header arrives to the input port of a source router until the
moment it gets out of the path destination router (Fig. 1). It
is the sum of the routers average latencies. Due to resource
sharing in Best Effort NoCs, the tagged packet may have
contention with disrupting packets coming from other flows.
Direct contention (reciprocal impact on average latency that
different flows produce on each other) happens in one router
and causes a cyclic dependency in the computation of
latencies of different flows incoming to the router, which we
solve by an iterative technique.
Indirect contention happens in a sequence of routers. In a BE
wormhole network a chain of packets with different
destinations may stay blocked one after the other over a
sequence of routers. This means that the latency of each
router is a function of contention and thus of the latency of
its following routers (downstream routers in the sequence).
To deal with this acyclic dependency, we build a dependency
tree (Fig. 2), and then recursively compute router latencies
from the leaves of the tree backward to its root.
Using a Poisson analytical traffic model of an application, and
from the two-step method using recursively indirect and
direct contention, we are able to obtain <5% error on latency
up to 80% of the saturation point in seconds of runtime.

Figure 1 : Disrupting packet communication model

Figure 2 : Dependency graph for a sample path

Table 1 : Performance evalutation runtime efficiency

Figure 3 : Path latency and NoC saturation as a function of load

References :
[1] S. Foroutan, Y. Thonnart, F. Petrot, "An Iterative Computational Technique for Performance Evaluation of Networks-on-Chip." IEEE
Transactions on Computers, to appear, 2013.
[2] S. Foroutan, Y. Thonnart, R. Hersemeule, A. Jerraya, "A Markov Chain based method for NoC end-to-end latency evaluation", Proceedings
of the 2010 IEEE International Symposium on Parallel and Distributed Processing, Workshops and Phd Forum, IPDPSW, July 2010.
[3] S. Foroutan, Y. Thonnart, R. Hersemeule, A. Jerraya, "An analytical method for evaluating network-on-chip performance", Proceedings Design, Automation and Test in Europe, DATE, pp. 1629. March 2010.
[4] S. Foroutan, Y. Thonnart, R. Hersemeule, A. Jerraya, "Analytical computation of packet latency in a 2D-mesh NoC", 2009 Joint IEEE NorthEast Workshop on Circuits and Systems and TAISA Conference, NEWCAS-TAISA, May 2009.

42

SESAM/Par4All: A Tool for Joint


Exploration of MPSoC Architectures and
Dynamic Dataflow Code Generation
Research topics: MPSoC, architecture exploration, parallelization, code generation
N. Ventroux, T. Sassolas, A. Guerre, B. Creusillet (HPC Project) and R. Keryell (HPC Project)
ABSTRACT: Due to the increasing complexity of new multiprocessor systems on chip, flexible and accurate
simulators become a necessity for design space exploration. In a streaming execution model, only a wellbalanced pipeline leads to an efficient implementation. However with dynamic applications, each stage is prone
to execution time variations. Only a joint exploration of the application space of parallelization possibilities,
together with the possible MPSoC architectural choices, can lead to an efficient embedded system. In this paper,
we associate a semi-automatic parallelization workflow based on the Par4All retargetable compiler, to the
SESAM environment in order to explore both a radio sensing application and the asymmetric MPSoC platform.
The emergence of new embedded applications for telecom,
automotive, digital television and multimedia applications has
the
demand
for
architectures
with
higher
fueled
performances, and better chip area and power efficiency.
These applications are usually computation-intensive, which
prevents them from being executed by general-purpose
processors. In addition, architectures must be able to
simultaneously manage concurrent information flows; and
they must all be efficiently dispatched and processed. This is
only feasible in a multithreaded execution environment.
Designers are thus showing interest in System-on-Chip (SoC)
paradigms composed of multiple computation resources
connected through networks that are highly efficient in terms
of latency and bandwidth. The resulting new trend in
architectural design is the MultiProcessor SoC (MPSoC).
To bring performance increase on such systems, applications
need to be parallelized. One possible approach to parallelize
an application is to pipeline its execution. This programming
and execution model suits well with data-oriented
applications that consider a continuous flow of data. However
embedded computation-intensive applications become highly
data-dependent and their execution time depends on their
input data. As a result, static allocation is non-optimal in such
systems and pushes forward the need for efficient online
control.

SESAM [2,3] is a tool that was designed to help the design of


new asymmetric MPSoC architectures. This tool allows the
exploration of MPSoC architectures and the evaluation of
many different features (effective performance, used
bandwidth, system overheads...). In this paper [1], we
associate the SESAM environment to a semi-automatic code
generation workflow using Par4All. For the first time, two
exploration tools, one for the architecture, one for the task
code generation of dataflow applications, are associated to
create a complete exploration environment for embedded
systems. Fig. 1 shows how SESAM and Par4All tools interact
to bring this fully integrated exploration environment.
To validate this approach the exploration of the
implementation of a radio sensing application on a complete
asymmetric MPSoC architecture was conducted. Various
parameters of both the application and the architecture were
studied to find efficient trade-offs for performance and silicon
efficiency. For instance we analyzed the impact of the
number of pipeline tasks and the number of processing
resources on execution speed (Fig.2), the memory usage or
the control overhead, with a limited application parallelization
development cost. Thanks to the association of our tools
electronic system designers can really tune both the
application and the architecture to bring higher performance
to the end-user.

Number of processing elements

Figure 1: SESAM and Par4All exploration tool

Figure 2: Impact of parallelization and processing resources on


the radio sensing application execution.

References :
[1] N. Ventroux, T. Sassolas, A. Guerre, B. Creusillet and R. Keryell. SESAM/Par4All: A Tool for Joint Exploration of MPSoC Architectures and
Dynamic Dataflow Code Generation, HIPEAC Worshop on Rapid Simulation and Performance Evaluation: Methods and Tools (RAPIDO), Paris,
France, January 2012.
[2] N. Ventroux, A. Guerre, T. Sassolas, L. Moutaoukil, C. Bechara, and R. David. SESAM: an MPSoC Simulation Environment for Dynamic
Application Processing, IEEE International Conference on Embedded Software and Systems(ICESS), Bradford, UK, July 2010.
[3] N. Ventroux, T. Sassolas, R. David, G. Blanc, A. Guerre, and C. Bechara. SESAM Extension For Fast MPSoC Architectural Exploration And
Dynamic Streaming Application, IEEE/IFIP International Conference on VLSI and System-on-Chip (VLSI-SoC), Madrid, Spain, September 2010.

43

Statistical Leakage Estimation in 32nm


CMOS Considering Cells Correlations
Process variations, leakage currents, leakage estimation, statistical simulations
S. Joshi, A. Lombardot (ST), E. Beigne, M. Belleville, S. Girard (INRIA)
ABSTRACT: A method to estimate the leakage power consumption of CMOS digital circuits taking into account
input states and process variations is proposed. The statistical leakage estimation is based on a pre
characterization of library cells considering correlations () between cells leakages. A method to create cells
leakage correlation matrix is introduced. The maximum relative error achieved in the correlation matrix is 0.4%
with respect to the correlations obtained by Monte Carlo simulations. Next the total circuit leakage is calculated
from this matrix and cells leakage means and variances. The accuracy and efficiency of the approach is
demonstrated on a C3540 (8 bit ALU) ISCAS85 Benchmark circuit.
As silicon industry is moving towards smaller and smaller
critical dimensions, controlling device parameters during
fabrication is becoming a great challenge. The variations in
channel length, width, oxide thickness and channel doping
profiles result in a large variation of the threshold voltage. As
the leakage components in a device depend on the transistor
geometry and threshold voltage, statistical variation of those
parameters leads to a significant spread of the total leakage.
This increased variability in advanced CMOS technologies is
playing an increased role in determining the total leakage of
a chip. This has accentuated the need to statistically account
for leakage variations during the design cycle. Designing for
worst case leakage may cause excessive guard-banding,
resulting in lower performances. Moreover, underestimating
leakage variations can reduce the yield, as dies, violating the
product leakage requirements, should be discarded.
In this work we propose a solution to estimate the leakage
current of large circuits by modeling each gate for every
input vector. This work is based on pre-characterizing library
cells, and storing data like leakage mean, variance and
correlations in tables. Monte Carlo simulations are used as a
reference for this work. In this study leakage estimation is
performed on technology mapped ISCAS circuits, which
consist of fixed sets of cells. In order to enable a fast and
efficient estimation, some information is pre computed for
each cell of the library and stored in look-up tables. We also
used correlation matrix created for leakage estimations.

Look Up Table approach involves the characterization of the


mean (), variance (2), and correlation () of leakage of
each cell for each input state. To build each cell table
(leakage mean and variance for each input state), we
performed 10,000 simulations of Monte-Carlo Process
Variations (MC-PV) at temperature (T=125C). We explored
this approach for the C3540 ISCAS 85 benchmark circuit with
a detailed comparison between Monte Carlo and the LUT
approach as shown in Figure 1. Variance calculated from LUT
approach gives accurate results as we are considering the
input state of each cell. Detailed comparison for C3540 is
shown in figure 2, Q-Q plot, which compares Monte Carlo
quantiles on horizontal axis with LUT quantiles on the vertical
axis for ISCAS C3540 circuit. The linearity of the points
shows that LUT fits the MC well. From Q-Q plot we can see
that blue points are aligned with red points. 99.9% of data
are accurately mapped and only 0.1% of data is
overestimated by LUT approach, using the cell correlation
coefficients.

Figure 2: Normalized Q-Q plot for MC and LUT approach for


ISCAS C3540 circuit

Figure 1: Comparison of MC and LUT approach for C3540 ISCAS


85 circuit

The major advantage of this approach is to limit the size of


the simulations. For ISCAS C3540, Monte Carlo Simulation
time was approximately one hour and leakage computation
time by LUT approach, with cells correlations, was few
seconds for a maximum relative error of 0.4% for correlation
factor for bigger circuits.

References :
[1] Joshi S., Lombardot A., Belleville M., Beigne E. & Girard S., A gate level methodology for efficient statistical leakage estimation in complex
32nm circuits, Design Automation & test in Europe Conference, DATE 2013, 18-22 March 2013
[2]Joshi S., Lombardot A., Belleville M., Beigne E. & Girard S., Statistical leakage estimation in 32nm CMOS considering cells correlations'',
2012 IEEE Faible Tension Faible Consommation, FTFC 2012, Paris '.
[3] Joshi S., Lombardot A., Flatresse P., D'Agostino C., Juge A., Beigne E. & Girard S., Statistical estimation of dominant physical parameters
for leakage variability in 32 nanometer CMOS, under supply voltage variations, Journal of Low Power Electronics 8(1), 113-124, 2012.

44

Workload Impact on HCI-BTI Induced


Timing Variations in Processor
Microarchitectures
Research topics : processor microarchitecture, simulation, HCI, BTI
O. Heron, C. Bertolini, N. Ventroux, C. Sandionigi and F. Marc (IMS-Univ. Bordeaux 1)
ABSTRACT: Hot Carrier Injection (HCI) and Bias Temperature Instability (BTI) failures become the major
detractors in the leading-edge MOS technology. High-performance chips will suffer from internal timing shifts
and logic errors. A software framework is proposed to analyze the impact of processor workload on HCI-BTI,
early in the design cycles. The sensitive paths to the degradation mechanisms, the variations of slack times and
the most aggressive instructions are extracted. This work is the first step of a global methodology that aims at
enabling design space exploration of multi-processors for reliability in 28-nm and below.

Die shrinking under 32-nm combined with the non-ideal


scaling of voltage increases the probability of transistors to
encounter Hot Carrier Injection (HCI) and Bias Temperature
Instability (BTI). These degradation mechanisms result in a
shift of Vth, leading to a loss of performance. The damage
rate due to HCI cumulates during transitions between two
logic states, while the one due to BTI cumulates when
transistor gates have a positive (PBTI) or negative (NBTI)
bias voltage. Static Timing Analysis tools (STA) aid to achieve
a preliminary sign-off of the chip path timings.

RTL description

applications
cross-compiler

Design
netlist

VCD

Updated library

Modelsim

Manufacturer
library

binaries

Design
compiler

Cell delay
updater

performed: aged slack times and sensitive paths to HCI are


extracted. A RISC processor named AntX (developed by CEA
LIST) and designed in 40nm TSMC technology was
investigated (freq=200MHz). Fig. 2 shows the aged slack
time of the 70 longest paths vs. various workloads formed of
7 benchmarks and Worst case (all bits always toggle)
scenarios. The results are obtained with an identical
simulation time (17s) and an accelerated degradation
condition. Path ranking is done under fresh condition.
For all applications, the most sensitive path to HCI differs to
the critical path. For all paths, the shift varies according to
the workload. The worst case scenario (blue line) leads to an
over-estimation of the biggest variation (up to x4). The most
aggressive instruction is the Shift operation. The execution of
instruction pairs, including this instruction and various
operands, can aid to tune the design guard band to a more
realistic value.

Degradation model
fresh/aged

the most sensitive paths under Worst case

PrimeTime
Slack
times

Sensitive
paths

Toggle
rate

up to x4

Figure 1: Software framework to analyze BTI-HCI effects

Recent works proposed degradation models of standard cells


able to estimate the timing shifts in logic cells during the first
design cycles and hence, the variation of the maximum
frequency. The analysis is performed by applying random
input vectors, whose generation is based on the hardware
implementation (e.g., automatic test pattern generator). This
approach leads to an over-estimation of the frequency shift.
Here, a simulation framework is developed to analyze the
impact of the instruction set (RISC) on the timings of a
processor microarchitecture netlist under HCI condition [1]
(BTI is under investigation). Fig. 1 shows the internal
organization of our framework that includes proprietary and
third-party tools. RTL design is first synthesized. Then, the
applications are simulated in the netlist and bit toggling
activity is extracted. Finally, the degradation analysis is

the most sensitive paths (except Worst case)

Figure 2: AntX slack times under HCI condition vs. Workloads

This work is the first step of a global methodology that aims


at enabling the design space exploration of multi-processors
for reliability in 28-nm and below. Our current work is
focused on the development of an enhanced SystemC/TLM
based processor simulator with the capability to predict
timing shifts at instruction level, hence ready to be integrated
in a multi-processor simulator such as described in [2].

References :
[1] C. Bertolini, O. Heron, N. Ventroux and F. Marc, Relation between HCI-induced performance degradation and applications in a RISC
processor, IEEE Int. on-line Testing Symp., pp. 67-72, July 2012.
[2] N. Ventroux, A. Guerre, T. Sassolas, L. Moutaoukil, G. Blanc, C. Bechara, and R. David SESAM: an MPSoC Simulation Environment for
Dynamic Application Processing, IEEE International Conference on Embedded Software and Systems (ICESS), Bradford, UK, July 2010.

45

Voltage and Temperature Estimation


Using Statistical Tests for Variability
Mitigation and Power Efficiency
Research topics : Variability, Power management, MPSoC, AVFS
L. Vincent, E. Beign, S. Lesecq, P. Maurine (LIRMM)
ABSTRACT: Power efficiency of embedded systems has become a tremendous challenge in the context of limited
power budget and computational performance constraints. Recent evolutions of Multi-Processor System-on-Chip
architectures allow today to manage the local frequency clock and the supply voltage for each power domain,
and thus its performance and power consumption. This requires to monitor at fine gain the current operating
conditions, i.e. the supply voltage and the temperature. We propose a novel approach to estimate on-line the
local voltage and temperature of a chip using goodness-of-fit hypothesis tests to process measurements
acquired from a fully digital sensor.
Power consumption provides huge constraints on the
development of embedded systems because of limited power
budget associated with thermal issues. Today embedded
applications require even more computational performances.
The development of MultiProcessor System-on-Chip (MPSoC)
architectures provides a way to distribute the workload over
several cores while the power consumption can be highly
decreased by the application of so-called Dynamic Voltage
and Frequency Scaling at fine grain.
The use of advanced technologies implies more variations
during the manufacturing process (P) that affect the
characteristics and performances of similar chips on the same
wafer but also between cores on the same chip. The voltage
(V) and temperature (T) variations are environmental
variabilities that also affect the performances of the chip.
PVT-variability also affects the power consumption. To reach
performance targets, the traditional worst case approach
leads to an increase in the design margins while sacrificing
the power efficiency. As VT-variability is local and dynamic,
the architecture has to be adaptive (Fig. 1), i.e. the VT
variations have to be monitored and mitigated on-chip and at
run-time.

Figure 1: Principle of an Adaptive architecture

In this context we propose an information extraction method


(green box on Fig. 1), associated to a fully digital sensor
(black squares) to estimate the current local VT state within
a power domain. Contrarily to existing voltage or
temperature sensors, our sensor [1], named Multiprobe, is a
digital one, based on a set of 7 different Ring Oscillators. As
it is not possible to infer directly the V and T values from the

set of measured frequencies, our method fuses the


measurements in order to extract the Voltage V and
Temperature T estimates from the Multiprobe measurements.
The method proposed is based on the comparison of the
current set of frequencies with other ones stored in a model
database. The models are acquired during a calibration phase
when known (V,T) states are applied to the chip.
At run-time, the comparison between the current
measurements and each model is performed using a nonparametric goodness-of-fit hypothesis test. This test
measures the discrepancy between the measurements and
the model. Fig. 2 presents one result of the estimation
procedure. The green circle corresponds to the real (V,T)
conditions applied to the circuit while the black cross is the
estimated (V,T) state. Each model tested is depicted by a dot
whose color corresponds to the discrepancy between the
model tested and the measurements acquired from the
Multiprobe. Dark red (resp. dark blue) dots correspond to a
high (resp. low) similarity between the model and the
measurements. Experiments have shown a mean estimation
accuracy of (5mV, 7.5C) on the estimation of V and T.

Figure 2: Result of an estimation (circle = simulated state, cross


= estimated state)

We are currently developing strategies to adapt at runningtime the settings of the so-called voltage and frequency
actuators to reach the most energy efficient operating point.

References :
[1] L. Vincent, E. Beign, L. Alacoque, S. Lesecq, C. Bour, and P. Maurine, A Fully Integrated 32 nm MultiProbe for Dynamic PVT
Measurements within Complex Digital SoC, 2nd European Workshop on CMOS Variability, VARI11, 2011.
[2] L. Vincent, P. Maurine, S. Lesecq, and E. Beign, Embedding Statistical Tests for on-chip Dynamic Voltage and Temperature Monitoring,
Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE, 2012.
[3] L. Vincent, S. Lesecq, P. Maurine, and E. Beign, Local Condition Monitoring in integrated circuits using a set of Kolmogorov-Smirnov
Tests, IEEE Conference on Control and Applications (CCA), 2012.
[4] L. Vincent, P. Maurine, E. Beign, and S. Lesecq, Local Environmental Variability Monitoring using Hypothesis Tests, Conf. Int. Faible
Tension Faible Consommation (FTFC), Paris, France, 2012.
[5] S. Lesecq, L. Vincent, E. Beign, Ph. Maurine, VT-state condition monitoring in integrated circuits using fusion of information from general
purpose sensors, 12th International Forum on Embedded MPSoC and Multicore MPSoC, July 9-13, Canada, 2012.
[6] S. Lesecq, L. Vincent, E. Beign, Ph. Maurine, How state estimation in integrated circuits based on statistical tests can be used to fine-tune
the control of the voltage and frequency actuators in the power management framework, VARI 2012, Nice, France.

46

Power Mode Selection in Embedded


Systems with Performance Constraints
Research topics: System-on-Chip, Power Management
Y. Akgul, D. Puschini, S. Lesecq, I. Miro-Panades, P. Benoit (LIRMM), L. Torres (LIRMM), E. Beign
ABSTRACT: Mobile computing platforms must provide ever increasing performances under stringent power
consumption constraints. Dynamic Voltage and Frequency Scaling (DVFS) techniques allow to reduce power
consumption by providing just enough power to the chip to finish the task before its deadline. DVFS is usually
achieved by setting the supply voltage and the clock frequency to predefined values (so-called Power Modes)
during given durations that depend on the task to be run and on its deadline. Here, the problem of power
management is recast as a linear programming one and the time spent in each one of the N power modes is
obtained with a Simplex algorithm solution. Results for 3 power modes exemplify the proposed approach.
In embedded systems, the main issue is to strike the balance
between performance and power consumption. The wellknown Dynamic Voltage and Frequency Scaling (DVFS)
technique allows decreasing power consumption when the
applications require less performance. DVFS is usually
achieved by setting the supply voltage and the clock
frequency to predefined values (so-called Power Modes
PMs) during given durations. Considering a given application
composed of a set of tasks, with various performance
constraints (defined e.g. by the deadline of each task), it is
possible to optimize performance vs. power consumption by
adjusting the duration of each PM. This problem of power
management is recast as a linear programming one and the
computation of the duration spent in each power mode is
obtained with a Simplex algorithm solution.
Assume a given processing element PE of a MPSOC or
multicore system with different PMs. The main objective is to
choose a sequence of PMs, and the duration of each PM, in
order to minimize the total power consumption of each PE
when N ( 2) power modes (PMs) are used. A PM is defined
with the clock frequency F applied to the PE and the power
consumption P associated to F (and to its related supply
voltage Vdd), as depicted in Fig.1. Suppose that a PE has to
run task T with a known deadline d. Here, we consider a
deterministic application so that task T requires NI clock
cycles to be fully processed, including possible memory
accesses and specific functions. The mean frequency F
required to execute the NI clock cycles on the deadline d is
F = NI / d .

to run a given task so that the mean frequency F satisfies


F [Fi+1, Fi] does not always provide an optimal solution
from a power consumption point of view. Our results show
that for a given distribution of the power modes, the
maximum power consumption gain is reached for a particular
workload, see Figure 2.
Moreover, we have shown that for a given range of
workload, the 3rd power mode must be carefully placed in
order to maximize the power consumption gain. Figure 2
shows such a situation. If the 3rd PM is properly positioned
between PM1 and PM2, then the consumption gain for all
workloads between 63% and 93% is higher than 15%.

Figure 2: Improvement of the power consumption when a 3rd PM


is added vs. normalized workload

The case with N > 3 PMs must be deeply studied in order


to express the optimum solution in an analytical form, or at
least provide a sub-optimal analytical solution not far from
the optimal one. In this way, the solution of the optimization
problem, solved in this work with the Simplex algorithm,
might be reached through simple computations, possibly
implemented in hardware. In our future work, the switching
time and the extra power consumed during the transition
between PMs will be as well taken into account.

Figure 1: N power modes (Fi; Pi), i=1:N

In [1] an analytical solution was proposed for 3 PMs. We


have proved that the choice of the PMs (Pi, Fi) and (Pi+1, Fi+1)
References :
[1] Akgul, Y.; Puschini, D.; Lesecq, S.; Miro-Panades, I.; Benoit, P.; Torres, L.; Beigne, E.; , "Power mode selection in embedded systems with
performance constraints," Faible Tension Faible Consommation (FTFC), 2012 IEEE, pp.1-4, 6-8 June 2012.

47

Event-driven Power Management


for Wireless Sensor Nodes
Research topics: Energy and Power Management, Energy harvesting
J.F. Christmann, E. Beign, C. Condemine, J. Willemin, C. Piguet (CSEM)
ABSTRACT: Wireless Sensor Networks (WSN) need improved power supply architectures and energy
management to reach enhanced life span. Advanced power management techniques are presented that leverage
energy harvesting to reduce charge/discharge constraints on the battery. While power efficiency is improved
thanks to multiple power paths architecture, energy management benefits from energy driven behavior to
optimize application scheduling. Event based energy monitoring is proposed to allow the more complex
architecture to be aware of its energy state while maintaining low power consumption.

Energy harvesting is a relevant solution to address Wireless


Sensor Networks energy self-sufficiency issue. It indeed
allows longer battery life to be reached and thus enhances
the network life span. Todays nodes embed various sensors
such as temperature, pressure, light or acceleration sensors,
data processing elements such as DSP or microcontrollers
and communication modules which enable wireless links
between the nodes. Powering those devices has to be done
with optimized energy efficiency. Usually, electrochemical
batteries are leveraged as energy sources, but their finite
capacitance implies limited energy autonomy. Improved
power management architectures consist in harvesting
energy from the environment (light, thermal gradients or
vibrations) in order to recharge the battery and enhance the
node life span [1].

output mean power is constant. As available environmental


energy is changing, harvested power may be too low to
maintain the capacitor voltage. The battery is then drained
off to satisfy the application constraints. We propose energy
driven schemes to be used to adapt the task period to
incoming energy levels. In this case, tasks are performed
once enough energy is stored into the capacitor. Fig. 2
illustrates batterys and capacitors energy levels in fixed and
adaptive period schemes while input power varies.

We proposed to improve the power supply efficiency by


preventing the energy to flow necessarily through the
battery. A direct power path is set to supply power directly
from the energy harvesters to the power loads (Fig.1). A
capacitor is used to temporarily store energy, sustaining
short but strong energy needs from the loads. The battery is
thus recharged in case of energy excess and drained off
when harvested power is not sufficient to maintain voltage
level on the capacitor.

Figure 2: Adaptive period algorithm behavior

Figure 1: Multiple power paths architecture example

Two power paths are available for loads power supply: the
conventional indirect power path consists in recharging the
battery with harvested energy and afterwards draining it to
fulfill the capacitor and the additional direct power path only
includes the temporary energy storage into the capacitor [1].
The direct power path has much higher power efficiency due
to fewer voltage conversion stages but is only available when
both energy is harvested and required by the loads.
Nevertheless, while applicative task period is usually fixed,

Although tasks are not precisely scheduled, power supply is


optimized thanks to the only use of the direct power path.
Moreover, the battery is freed and the energy autonomy is
improved. Advanced low power voltage monitoring is
mandatory to control the power paths within the architecture.
Event driven voltage monitoring is proposed to avoid useless
voltage sampling and prevent heavy monitoring energy cost
[2].
Adaptive harvesting aware power management is thus
implemented and scheduling can be optimized according to
incoming power levels. Algorithms are under development to
leverage energy driven scheme which allows battery free
operations while maintaining realistic applicative scenari. A
dedicated power management circuit is under fabrication to
demonstrate power management improvements in a whole
wireless sensor node.

References :
[1] J.F. Christmann, E. Beign, C. Condemine, J. Willemin, C. Piguet, Energy Harvesting and Power Management for Autonomous Sensor
Nodes, IEEE/ACM Design Automation Conference (DAC), San Francisco CA, USA, 1049-1054, June 2012
[2] J.F. Christmann, E. Beign, C. Condemine, C. Piguet, Event-driven asynchronous voltage monitoring in energy harvesting platforms, IEEE
NorthEast Workshop on circuits and Systems (NEWCAS), Montral, Canada, 457-460, 2012

48

49

Neuromorphic Circuits
Beyond CMOS RF Devices
RRAM Circuits

Architecture
& IC Design
For Emerging
Technologies

50

Towards cognitive chips:


Design of Spiking Neural Network
Building Blocks
Research topics: Spiking neurons, Neuromorphic system
R. Hliot, A. Joubert, B. Belhadj, M. Duranton, O. Temam (INRIA)
ABSTRACT : Stringent energy constraints and the increased variability of modern CMOS technologies impose to
design energy-efficient and defect-tolerant hardware accelerators. Additionally, the spectrum of applications is
widening, with a shift from computing applications to Recognition, Mining and Classification applications, for
which conventional architectures are not effective. Work was carried out on Spiking Neural Network, which are
promising for their very efficient information encoding: (1) Analog and digital implementations of spiking
neurons were compared; (2) A configurable conduction delay was implemented; (3) 3D stacking was analyzed.

Neuromorphic architectures have been proposed in the past


two decades, with the aim of emulating biological spiking
neurons on dedicated silicon hardware. Such neuromorphic
systems were initially developed to model biological systems
and thus better understand their underlying functioning. But
nowadays, high-performance applications experience a
dramatic shift of their own, from scientific computing to
Recognition, Mining and Synthesis applications: few
approaches are better positioned than neuromorphic
architectures to tackle those applications, while providing
inherent defect-tolerant and energy-efficient properties.
Spiking neurons are usually considered to be the neuron
model with the greatest potential for applications, thanks to
its very efficient coding of information. Their model is
described in the Fig. 1: the input spikes are weighted and
summed up in a leaky tank. If the obtained value reaches a
given threshold, a spike is generated on the output.

spikes. This is why we considered exploiting the density of


integration offered by the 3D stacking technology, and
especially cleverly benefiting from the parasitic capacitance of
Through-Silicon Vias (TSV) [2]. In Fig. 2 are shown different
partitioning strategies of a neuron in 3D: schemes (c) and (d)
present the most benefits, respectively reducing the area by
a factor of 2 and increasing the connectivity by a factor of 8.

Figure 2 : Partitioning strategies of a neuron in 3D


Figure 1: Block diagram of a leaky-integrate-and-fire neuron

The area and power consumption of the hardware design can


have a significant influence on system scalability, whether
emulating biology or implementing processing tasks. This is
why we compared the analog and digital implementations of
the Leaky-Integrate-and-Fire (LIF) neuron, in 65nm Bulk
CMOS technology [1]. It shows that the analog version is 20
times more energy efficient and 5 times smaller than the
digital one. Although analog circuits do not scale as much as
digital circuits at each technology generation, projections
show that the analog implementation should keep its area
advantage at least to the 22nm node. And at same area, it
will still be 3x more energy efficient.
For designing an analog LIF neuron, we can easily exploit
CMOS basic electrical properties: Temporal integration can be
realized through capacitive integration, and spatial
summation through Kirchhoffs law. Finally, leakage is an
intrinsic behavior of microelectronic devices. One of the
drawbacks of analog LIF neuron implementation is the rather
large capacitance that is necessary for integrating input

Finally, the conduction delay in neural systems has been


proven to play an important role in processing neural
information. In hardware spiking neural networks (SNN),
emulating conduction delays consists of intercepting and
buffering spikes for a certain amount of time during their
transfer. The complexity of the conduction delay
implementation increases with high spiking rates; it implies
(1) storing a large number of spikes into memory cells and
(2) conserving the required time resolution while processing
the delays. The goal of our research was to find a costefficient design that supports high firing rates while
maintaining good temporal accuracy. We show that it can be
achieved using a mixed counter-register implementation
which provides a good area/accuracy tradeoff for a broad
range of hardware spiking neural networks [3]. The size of
the delay circuit increases with the time granularity (temporal
accuracy).

References:
[1] A. Joubert, B. Belhadj, O. Temam and R. Hliot, "Hardware spiking neurons design: Analog or digital?", in Proceedings of the 2012 Annual
International Joint Conference on Neural Networks (IJCNN), June 2012
[2] A. Joubert, M. Duranton, B. Belhadj, O. Temam and R. Hliot, "Capacitance of TSVs in 3-D stacked chips a problem? Not for neuromorphic
systems!", in Proceedings of the 49th Annual Design Automation Conference (DAC), June 2012
[3] B. Belhadj, A. Joubert, O. Temam and R. Heliot, "Configurable conduction delay circuits for high spiking rates", in Proceedings of the 2012
IEEE International Symposium on Circuits and Systems (ISCAS), May 2012

51

Visual Pattern Extraction Using Energy


Efficient 2-PCM Synapse
Neuromorphic Architecture
Research topics: neuromorphic system, phase-change materials
O. Bichler, M. Suri, D. Querlioz (IEF), D. Vuillaume (IEMN), B. DeSalvo and C. Gamrat
ABSTRACT: We introduce a novel energy-efficient methodology 2-PCM Synapse to use phase-change memory
(PCM) as synapses in large-scale neuromorphic systems. Our spiking neural network architecture exploits the
gradual crystallization behavior of PCM devices for emulating both synaptic potentiation and synaptic
depression. The system, comprising about 2 million synapses, directly learns from event-based dynamic vision
sensors. When tested with real-life data, it is able to extract complex and overlapping temporally correlated
features such as car trajectories on a freeway. The synaptic programming power consumption of the system
during learning is estimated and could be as low as 100 nW for scaled down PCM technology.
Phase Change Memory (PCM) devices have been proposed to
emulate biologically inspired features of synaptic functionality
that are essential for realizing neuromorphic hardware.
Among the different types of emulated synaptic features,
Spike-Timing-Dependent Plasticity (STDP) has gained a lot of
significance recently. STDP is widely believed to be a
foundation of learning mechanisms inside the brain.

This novel low power architecture called `2-PCM Synapse' is


aimed for emulation of synaptic functions in large scale
neural networks [1,2]. Using this architecture we designed a
fully connected, feed-forward spiking neural network (SNN)
and implemented a simplified form of the biological STDP
learning rule. We show a real world application of extracting
complex patterns from recorded video data.

When the number of neurons and synapses in a


neuromorphic system featuring STDP grows large, its
implementation on classical computer architecture quickly
becomes a severe demonstration of the Von Neumann
bottleneck. This is a major reason motivating the research for
new neuromorphic memory architectures that could allow insitu, instantaneous and fully parallel, synaptic-weight
updates. From a technological perspective PCM is a good
candidate for neuromorphic applications because of CMOS
compatibility, high scalability, strong endurance and good
retention characteristics.
One of the main limitations of using a single PCM device as a
synapse is the implementation of Long-Term Depression
(LTD), which is not progressive with amorphization by using
invariant or identical pulses. To overcome these issues, we
propose to implement both Long-Term Potentiation (LTP) and
LTD using crystallization, with two PCM devices constituting
one synapse (Fig.1). The two devices have an opposite
contribution to the neuron integration.

Figure 2: Network topology used in simulation. It is fully


connected, and each pixel of the AER dynamic vision sensor is
connected to neurons of the first layer through two synapses.

Figure 1: (Left) Experimental LTP characteristics of GST PCM


devices with 30 consecutive identical potentiating pulses.
(Right) 2-PCM synapse principle.

We have demonstrated that Phase Change Memory devices


can be used to elaborate large scale synapse-like arrays for
neuromorphic systems. A two layer spiking neural network
with about 2 million synapses and 4 million PCM devices has
been simulated (see Fig.2), showing a complex visual pattern
extraction with an average detection rate of 92%, and a
synaptic power consumption of 112 W during learning. The
extrapolated power consumption for the most recent state of
the art devices, if used for the same test case, could be as
low as 100 nW. The low spiking frequency in this type of
neural network is remarkable, considering the complex
detection task involved, and is a good indicator of the
scalability and potentially high efficiency of the association of
dynamic vision sensors and spiking neural network compared
to the classical synchronous frame-by-frame motion analysis.

References:
[1] O. Bichler, M. Suri, D. Querlioz, D. Vuillaume, B. DeSalvo and C. Gamrat, Visual Pattern Extraction Using Energy-Efficient 2-PCM Synapse
Neuromorphic Architecture, Electron Devices, IEEE Transactions on, vol. 59, no. 8, pp. 2206-2214, 2012
[2] M. Suri, O. Bichler, D. Querlioz, O. Cueto, L. Perniola, V. Sousa, D. Vuillaume, C. Gamrat and Barbara DeSalvo, Phase change memory as
synapse for ultra-dense neuromorphic systems: Application to complex visual pattern extraction, Electron Devices Meeting (IEDM), 2011 IEEE
International, pp. 4.4.1-4.4.4, 2012

52

Pavlovs Dog Associative Learning


Demonstrated on Synaptic-Like
Organic Transistors
Research topics: associative memory, organic memory transistor
O. Bichler, W. Zhao (IEF), F. Alibart*, S. Pleutin*, S. Lenfant*, D. Vuillaume (*IEMN), C. Gamrat
ABSTRACT: We present an original demonstration of an associative learning neural network inspired by the
famous Pavlovs dogs experiment. A single nanoparticle organic memory field effect transistor (NOMFET) is used
to implement each synapse. We show how the physical properties of this dynamic memristive device can be
used to perform low-power write operations for the learning and implement short-term association using
temporal coding and spike-timing-dependent plasticitybased learning. An electronic circuit was built to
validate the proposed learning scheme with packaged devices, with good reproducibility despite the complex
synaptic-like dynamics of the NOMFET in pulse regime.
We propose an original scheme using Nano-particle Organic
Memory Field Effect Transistors (NOMFETs) to implement
dynamic associative learning and demonstrate it by
interfacing NOMFETs to a CMOS discrete circuit [1]. We show
how the unique synaptic properties of the NOMFET can be
used directly at device level to implement what we call a
dynamic associative memory, where the association is only
retained as long as there is a minimal activity at its input.

area and power consumption, which can be roughly


estimated to 1012 synapses/cm3 and 1-10 fJ/spike if one
considers an average firing rate of 1-10 Hz, given a power
consumption of the human brain on the order of 10 W.

In classical conditioning, associative learning involves


repeatedly pairing an unconditioned stimulus, which always
triggers a reflexive response, with a neutral stimulus, which
normally triggers no response. After conditioning, a response
can be triggered for both the unconditioned stimulus and the
neutral stimulus. This concept goes back to Pavlov's
experiments in the early 1900s. He showed how a neutral
stimulus - like the ring of a bell - could be associated to the
sight of food and trigger the salivation of his dogs (Fig.1).
Associative memory is now a key concept in the learning and
adaptability processes of the brain.

Figure 2: The experimental board, with the NOMFETs in a TO


(transistor outline) case at the center.

It was shown in previous work that the NOMFET [2] could be


seen as a memristive device [3], by modulating the
conductivity of its organic semi-conducting channel through
the charging of nano-particles embedded into the channel. It
is volatile and has a retention time of typically 10 to 1000 s.
With these physical properties, the NOMFET can mimic /
reproduce many behaviors of a dynamic synapse.
Figure 1: Equivalent electronic circuit for the associative memory.
There are three neurons (Input #1, Input #2, and Output) and
two synapses/NOMFETs.

However, the lack of an efficient implementation of artificial


synapses for associative learning neural networks has greatly
impeded the use of associative memory as a general-purpose
type of memory or learning tool. Synapses implemented with
the most current CMOS technology are still several orders of
magnitude behind their biological counterparts in terms of

Associative memory is a fundamental computing block in the


brain and is implemented extremely efficiently in biological
neural networks. An efficient and scalable implementation of
associative
memory
would
certainly
benefit
many
applications, especially in the area of natural data processing.
We demonstrated experimentally an elementary associative
memory, which uses only one NOMFET memristive nanodevice per synapse (Fig.2). It exhibits dynamical behaviors
closer to biology than any other known memristive device.

References:
[1] O. Bichler, W. Zhao, F. Alibart, S. Pleutin, S. Lenfant, D. Vuillaume and C. Gamrat, Pavlov's Dog Associative Learning Demonstrated on
Synaptic-Like Organic Transistors, Neural Computation, vol. 25, no. 2, pp. 549-566, 2012.
[2] O. Bichler, W. Zhao, F. Alibart, S. Pleutin, D. Vuillaume and C. Gamrat, Functional Model of a Nanoparticle Organic Memory Transistor for
Use as a Spiking Synapse, Electron Devices, IEEE Transactions on, vol. 57, no. 11, p. 3115-3122, 2010.
[3] F. Alibart, S. Pleutin, O. Bichler, C. Gamrat, T. Serrano-Gotarredona, Be. Linares-Barranco and D. Vuillaume, A Memristive
Nanoparticle/Organic Hybrid Synapstor for Neuroinspired Computing, Advanced Functional Materials, vol. 22, no. 3, pp. 609-616, 2012.

53

Spin-Torque Oscillator Modelling for


Network Synchronization
Research topics : spintronics, oscillator locking, synchronization, LTI model
M.Zarudniev, P.Villard, E.Colinet, U.Ebels (Spintec), M.Quinsat (Spintec), G.Scorletti (AMPERE Lab)
ABSTRACT: The spin-torque oscillator (STO) is a new device based on thin-film magnetic effects. Its high
compactness and wide frequency tuning range are interesting features for multi-standard radio-frequency
applications. However the output signal provided by a single STO exhibits low power and poor phase noise.
Interconnecting several STOs within a synchronized network could circumvent these two drawbacks. Since the
available physical model is too complex for a straightforward study of an STO network, the first step consists in
simplifying it to derive a linear time invariant (LTI) model providing the phase as output signal. Such an LTI
model is required to take benefit from the powerful analysis and synthesis methods developed in control theory.
Thin-film magnetism studies in the last decades have opened
a new research field, referred as spintronics, where not only
the charge but also the spin of electrons are exploited. The
properties exhibited by such thin films led to the
development of new devices with outstanding performances,
like for instance the hard disk giant magnetoresistance readhead. More recently, it has been shown that a spin-polarized
DC current of sufficient density can give rise to a high
frequency oscillation of a thin-film magnetization. This
property is exploited in a new device called the spintorque
oscillator (STO), depicted Fig.1, which provides a GHz-range
oscillating output voltage under proper DC current biasing
[1].

Bias T

Ibias

Hext

of non linear equations in polar coordinates

VAC

pinned layer

and

(t ) :
(2)

(1)

where I is the bias current, Hef is the bias magnetic field, ,


0 and 0 are coefficients depending on the layers physical

where k and c depend on the parameters of equation (2)


and on the linearization reference trajectory.
0.06
0.04

(rad)

Figure 1 : STO structure and biasing scheme

This spintorque device has dimensions much smaller than the


standard LC-tank oscillators used in RF devices. Moreover,
the oscillation frequency can be controlled over a wide range
(e.g. 3GHz to 20GHz) via the applied DC current and external
magnetic field. For these reasons, one may wish to use these
devices as voltage control oscillators (VCO) for applications
needing high agility, e.g. multi-standard RF. Unfortunately
the output signal of a single STO exhibits poor phase noise
and low output power. Our goal in this study is thus to use
several spintorque oscillators as building blocks for a macrooscillator with improved output characteristics, in particular
in the frequency domain (phase noise) [2].
The so-called macrospin approximation in magnetic layers
allows to derive the differential equation which rules the free
layer magnetization vector m behavior, known as the
Landau-Lifschitz-Gilbert-Slonzcewski (LLGS) equation:

(s ) = kN
s(s + c )
I

100 nm



dm
dm
= 0 m H ef + m
0 I [m [m p ]]
dt
dt

r (t )

r = k d 1 + k q r r + 0 I 1 k s r r

= 0 + Nr 2
2

where 0 is the oscillation frequency in rad/s, N is the


amplitude to phase conversion (non linearity) parameter, and
kd, kq, ks are fitting parameters.
Then, a set of linear time-invariant (LTI) models was
obtained by linearizing the system around several
trajectories. Each LTI model links the STO output voltage
instantaneous phase to the bias current through a transfer
function (Laplace domain):

L C

free layer
non magnetic layer

properties and p is the spin-polarization vector. Being 3dimensional and strongly non-linear. this equation, although
accurate from a physics point of view, is too complex to be
used when studying a network where the bias current of each
oscillator is to be modulated by the output voltage of some of
the other oscillators. So our first need was a reliable, fast,
phase oriented model able to account for the impact of
bias current modulation on a STO output voltage.
The new model was derived in two steps. Thanks to a first
approximation, the LLGS was turned into a 2-dimensional set

0.02

LLGS
r,
LTI

0
-0.02
-0.04
4

4.2

4.4

4.6

4.8

x 10

-8

Time (sec)

Figure 2 : Phase response of the 3 models to an input current


step

As an example, Fig.2 shows the phase response of the three


models to a bias current pulse. Even if the details of this
response are not strictly reproduced in the simplified models,
the main component is accounted for, which allows the next
step, i.e. the study of synchronization within an STO macrooscillator according to a method relying on analysis and
synthesis tools used in control theory.

References :
[1] P.Villard, U.Ebels, D.Houssameddine, J.Katine, D.Mauri, B.Delaet, P.Vincent, M.-C.Cyrille, B.Viala, J.-P.Michel, J.Prouve, and F.Badets, A
GHz-spintronic-based RF oscillator, IEEE Journal of Solid State Circuits, 45(1), January 2010, pp.214-223
[2] M.Zarudniev, E.Colinet, P.Villard, U.Ebels, M.Quinsat, G.Scorletti, Synchronization of a spintorque oscillator array by a radiofrequency
current, Mechatronics Journal 22 (2012) pp.552-5.55

54

Carbon Nanotube FET


Process Variability and Noise Model
for Radiofrequency Investigations
Research topics : CNT-FET, RF integrated circuits, process variability
J.L. Gonzlez, B. Martineau (STMicroelectronics), D. Belot (STMicroelectronics)
ABSTRACT: This work focuses on process variability and noise in carbon nanotube field-effect transistors
(CNFET) to obtain a compact model usable for radiofrequency (RF) design and simulations. CNFET figures of
merit (FoM) are determined and compared to International Technology Roadmap for Semiconductors (ITRS)
requirements on conventional analog silicon-based devices. The developed model is also used to investigate on
the impact of manufacturing process variability on the CNFETs RF-performance and noise behavior in carbon
nanotube field-effect transistors (CNFET) to obtain a compact model usable for radiofrequency (RF) design and
simulations. CNFET figures of merit (FoM) are determined and compared to International Technology Roadmap
for Silicon.
The quest for a substitute device type for todays Siliconbased CMOS technology has started some years ago. One of
the candidates is the carbon nanotube field-effect transistor
(CNFET). This device is expected to provide high-speed and
low power, and has been widely investigated in the field of
digital devices. Their application to RF circuits has deserved
less attention. In this work we investigate the properties of
this type of device from the point of view of RF figures-ofmerit [1]. We do this by considering not just an ideal device
composed of a parallel array of equal carbon nanotube, but a
real device in which the tube diameters of a single device
may vary following statistical distributions, such as those
shown in Figure 1. Moreover, the fact that some of the
nanotubes of the device may be metallic is also considered.
Furthermore, a non-ideal metallic tubes removal process is
also considered, as illustrated in Figure 2.b.

Figure 1 : Variability model for CNFETs.

The CNFET model developed also incorporates the noise


sources (thermal, flicker, shot) that are found in this type of
devices. This realistic device model is used to extract
meaningful RF figures of merit and to investigate its
dependence on the CNFET manufacturing process and device

biasing conditions, as illustrated in Figure 2.

Figure 2 : RF figures of merit of CNT-FET transistors

The most important results of this study have been published


in [2]. We have shown that excellent performance, except of
high flicker noise, is achieved for a large-diameter
distribution with low standard deviation. An important
outcome of the work is a compact CNFET model incorporating
noise and variability modeling suitable for RF circuits design.

References :
[1] G.M. Landauer, J.L. Gonzalez, Radiofrequency Performance of Carbon Nanotube FETs in Comparison to Classical CMOS Technology,
ESSCIRC 2011, Sept. 2011, Fringe Poster Session.
[2] G.M. Landauer, J.L. Gonzlez, "Carbon nanotube FET process variability and noise model for radiofrequency investigations," 2012 12th IEEE
Conference on Nanotechnology (IEEE-NANO), pp.1-5, 20-23 Aug. 2012.

55

Bipolar OxRRAM-based
Non-volatile SRAM (NV-SRAM)
for Information Back-up
Research topics: Resistive RAMs, NV-SRAM, Low leakage, Non-volatile, FDSOI
Hraziia (ISEP), O. Thomas, F. Clermidy, C. Angels (ISEP), A. Amara (ISEP)
ABSTRACT: One of many techniques to reduce the static power dissipation at stand-by mode is using the
concept of power gating. But this technique cannot be employed to the memory section as it leads to data loss
and affects the data integrity. The inherent non-volatile property of Resistive RAMs (RRAMs) can be exploited to
provide non-volatility when used alongside volatile Static RAMs and at the same time exploit the zero-power
consumption property of Resistive RAMs at stand-by mode for low power management.

As technology scales down, power consumption becomes


significant in the overall Systems-on-chip (SOCs) power. A
potential solution for reducing static power consumption is to
use the technique of power gating switching off the power
supply to certain blocks which are not in use. But this
technique cannot be used to the memory section having
embedded SRAMs covering almost half the area of the chip,
as it will result in loss of information. Another potential
solution for reducing static leakage is scaling down the VDD of
memory blocks during stand-by but its efficiency is limited
by the transistors variability.
Hybrid memory circuits combining Static RAMs (SRAMs) and
Resitive RAMs (RRAMs) is an effective solution for information
storage during a power-down at a lower leakage. The use of
Resistive RAMs (RRAMs) with volatile memories like Static
Random Access Memory (SRAMs) will not only provide nonvolatility but also reduce the static power consumption on a
chip where the leakage current contributed by SRAMs
dominates the power consumption of a chip. The hybrid
memory cell (NV-SRAM) [1,2] that realizes these features is
composed of a typical 6T- SRAM cell based on 22nm FDSOI
technology with resistive OxRRAM devices embedded at the
data nodes of the SRAM cell. In order not to lose information
during a power-down, the logical state of the volatile SRAM
memory is stored into the non-volatile resistive memories. To
make the circuit resilient to information loss, the following
operational sequence must be respected: RESET, STORE,
POWER DOWN, RESTORE (as shown in Fig. 1).

OxRRAM devices.
The STORE operation is always successful as it will set the
OxRRAMs resistance to a low value but RON will vary under
the VT variation which will change the ROFF/RON in turn will
influence the proper restore of SRAMs logical state from
OxRRAMs. Fig. 2 depicts that the (ROFF/RON) can be increased
by having a long STORE time or increasing the WLP of the
SRAM pull-up transistor. But having a long STORE time
results in power consumption and WLP size will be limited by
the cell area and the write ability of the SRAM. Fig. 3 depicts
a worst case mismatch analysis of the hybrid cell during a
RESTORE operation. Referring to Fig.3, the maximum
achievable ROFF/RON(=4) of NV-SRAM corresponds to n = 2
which is very low from stability viewpoint. To ensure SRAM
yield of 5 (6) for reliable recovery, maximisation of the
resistance drop, RDROP in the high R/t region corresponding
to the ROFF/RON (Fig.2) for low STORE time should be the
focus. The results on Monte Carlo simulation for recovery
operation indicates that even in the worst case for SRAM
yield, the resistance ratio should be at least 10(20). Hence,
the stability analysis implies that, from the RRAM technology
point, the focus should be to maximise the ROFF/RON resistance
ratio.

Figure 2: RON vs. time during STORE operation for various WLP in
the range 80nm to 200nm range [1]
6

Figure 1: Operational sequence of a Power down Power up cycle


[1]; TSTORE = 20ns, TRESET = 20ns, TRESTORE = 20ns @ VDD = 1.0V

n*sigma fail point

5
4
3
2
1
0

The stability study on the NV-SRAM cell when VT variability of


the transistors is taken into consideration reveals that the
key yield limiting factor of the NVSRAM cell is the reliable
recovery of data from the OxRRAMs to the SRAMs which
strongly depends on the resistance ratio (ROFF/RON) of the

10

20

30

40

50
ROFF/RON

60

70

80

90

100

Figure 3: Variation of n*VT

fail point versus the ROFF/RON ratio


in the worst case variation analysis [1].

Rfrences :
[1] "Operation and Stability analysis of Bipolar OxRRAM-based Non-Volatile 8T-2R SRAM as a solution for Information Back-Up," Hraziia et al.,
accepted for publication in Elsevier Solid-State Electronics, 2012.
[2] Bipolar OxRRAMs based non-volatile 8T2R SRAM for information back-up, Hraziia et al., EUROSOI 2012.

56

RRAM-based FPGA for Normally Off,


Instantly On Applications
Research topics: Resistive RAM, Non-volatile Memory, FPGA
O. Turkyilmaz, F. Clermidy, S. Onkaraiah, Hraziia, C. Anghel (ISEP), J.M. Portal, M. Bocquet (IM2NP)
ABSTRACT: Normally off, instantly on applications are becoming common in our environment ranging from
healthcare to video surveillance. In such a context, Field Programmable Gate Arrays (FPGAs) present a good
trade-off between performance and flexibility. However, they consume high static power and can hardly be
associated with power gating techniques due to their long context restoring phase. In [1], we propose to
integrate non-volatile resistive memories in configuration cells to instantly restore the FPGA context. We then
show if the circuit is in ON state for less than 42% of time, non-volatile FPGA starts saving energy. Finally, for
a typical application with only 1% of time spent in ON state, the energy gain reaches 50%.
Many new embedded applications can be characterized as
normally off, instantly on. These applications share a
similar feature: a long idle period followed by a short highly
intensive computing phase. Wake-up phase must be
maintained as short as possible in order to avoid missing
important information. For example, a heart attack can be
anticipated by analyzing electrical signals after an abnormal
event, but false detections must be avoided due to complex
signal monitoring and processing must be finished in a few
microseconds after wake-up.
In this context, General Purpose Processors (GPPs) are not
efficient as they offer reduced power efficiency and require
long boot sequences when putting to ON states. Application
Specific Integrated Circuits (ASICs) or System-On-Chip
(SoCs) are power efficient but are expensive to develop and
not flexible enough to address a wide range of application.
Field Programmable Gate Arrays (FPGAs) offer a good
compromise between flexibility and power efficiency.
However, decreasing supply voltages and shrinking feature
sizes increase the leakage current of FPGA which becomes
the major cause of power consumption during standby mode.
In order to improve static consumption, power-gating
technique is massively employed. However, FPGA loses all
the information contained in SRAM memories when switched
off. FPGAs can communicate with external Flash memories to
store their context. Restoring a context after a power-off
mode is accomplished by serially loading a bitstream in all
the configuration cells of the FPGA which is quite long and
can take up to hundreds of milliseconds. Thus, current FPGA
structures cannot fulfill ON/OFF application constraints.

BET =

t0
=
t0 + t1

1
P
1 + OH
PL

POH: power overhead


PL: leakage power
t0: normal mode duration
t1: idle mode duration
Figure 1 : Conceptual view of potential gain and power overhead
of SRAM and NVSRAM implementations and break-even time
(BET) definition.

In [1], we propose to integrate Resistive RAM (RRAM) in the


FPGA structure to obtain an instant power-on phase and
saving power in normally off, instantly on applications. We
focus on the ON/OFF application using novel NVSRAM
memories with bipolar OxRRAM technology [2] in 22nm LETIFDSOI process. All the volatile SRAM nodes are replaced with
NVSRAM, to store and restore bitstream quickly in an ON/OFF
cycle. As a result, a total power down state, i.e. zero leakage
power consumption, is achieved. When FPGA is switched into
sleep mode, the dissipated leakage power is conserved.
However, the overhead due to RRAM integration and power
switches must be evaluated. Fig. 1 shows a conceptual view
of expected power overhead and gain of the SRAM and
NVSRAM implementations. Considering these factors, breakeven time (BET) is defined as the duty cycle when the energy
overhead is equal to the leakage energy as given in Fig.1.
Consequently, when the actual duty cycle of the application is
smaller than the BET, it is possible to save leakage power
and reduce total power consumption.

Figure 2 : Power gain depending on different duty cycle values.


Considering an application with 1% ON time, gained power can
reach up to 50% on average.

With VPR5 toolflow evaluation, NVSRAM integration in FPGA


results as 7%, 18% and 2% overhead in delay, area and
power respectively. Power gating implementation extends the
area and critical path by 8% for each metric.
Duty cycle of the application has a direct influence on the
conserved power levels. Fig. 2 shows that for the on/off
application, if the duty cycle is much lower than the average
BET (42%), the power gain increases rapidly. For an
application where the circuit is active for 1% of the time,
power gain reaches 50% on average.

References :
[1] Ogun Turkyilmaz, Santhosh Onkaraiah, Marina Reyboz, Fabien Clermidy, Hraziia, Costin Anghel, Jean-Michel Portal, and Marc Bocquet.
RRAM-based FPGA for Normally Off, Instantly On Applications, NanoArch 2012.
[2] Hraziia, Costin Anghel, Andrei Vladimirescu, Amara Amara, Jean-Michel Portal, Marc Bocquet, Christophe Muller, Damien Deleruyelle,
Santhosh Onkaraiah, Marina Reyboz and Olivier Thomas. Bipolar OxRRAM-based non-volatile 8T2R SRAM for information back-up, EUROSOI
2012.

57

Real Time Software


Parallel Software
Middleware for Sensor Networks

Embedded
Software

58

Integrated Architecture
Exploration Workflow
Research topics: HW/SW co-design, architecture exploration workflow
D. Puschini, J. Mottin, C. Fabre, N. Palix, L. Apostol, E. Vaumorin (Magillem Design Services)
ABSTRACT: Compute-intensive applications can greatly benefit from the flexibility of NoC-based heterogeneous
multi-core platforms. However, mapping applications on such MPSoC is becoming increasingly complex and
requires integrated design flows. We conducted a case study to evaluate the benefits of an integrated design
flow for the mapping space exploration of a real telecommunication application on a NoC-based heterogeneous
platform. Thanks to the flow, we simulated several virtual platforms and several mappings of our application on
each. This approach drastically lowers the required skills and the time needed for design space exploration. An
improvement of several weeks has been observed.
Embedded systems have evolved from a single processor
architecture to Multi-Processor Systems-on-Chip (MPSoC).
Designing embedded software for these increasingly complex
and heterogeneous platforms in an efficient manner is
becoming a serious challenge. Applications are particularly
difficult to program on platforms based on Network-on-Chip
(NoC) interconnection, where developers must also define
and setup the communications between the units.
Studying and optimizing the mapping of micro-code on the
different units is particularly difficult and subject to errors.
Developers usually need to define fine-grain and coherent
configuration files for every node. Thus, moving a task from
one IP core to another, impacts not only the communications
of this code but also the ones of every unit it communicates
with. Design flows for NoC-based SoC is an active research
area. One of the most challenging issues is to organize the
mapping on the cores while taking into account issues related
to communication like deadlock, bandwidth, latencies, and
non-functional properties like energy consumption.
Figure 1 shows a generic design flow for NoC-based
embedded systems.

Figure 2 illustrates a novel integrated exploration workflow


more suitable for architecture and/or application mapping
exploration. In order to tackle the exploration bottleneck of
the previous workflow, it considers an XML-based model that
centralizes the system specifications for hardware design and
software mapping. This global data-base allows sharing the
information between the different exploration facilities: HW
design tool suite, NoC configuration tools and unit-specific
compilers. An IP-XACT compliant representation has been
chosen in order to match a standard representation for the
hardware design.

Figure 2 : Integrated architecture exploration workflow

Figure 1 : previous co-design workflow

The problem of such workflow is that changes on the HW


specification mean redefining the SW specifications, most of
the time after the HW system design.

This standard-based design flow gives a large speedup for


design space exploration, from months to weeks, for less
than 10% of overhead compared to the previous design
model. In addition, it maximizes the possible block reusing
among several projects, allowing the integration of the
hardware and software designs ensuring the coherence of the
descriptions. Thus, it further reduces the time to explore
architecture options, either in the hardware design or the
mapping of the application. This approach could be used to
automatically explore the design space, optimizing both
functional and physical behavior of the overall system.

References :
[1] Puschini D., Mottin J., Palix N., Apostol L., Fabre C., "Integrated architecture exploration workflow: A NoC-based case study," 2012 23rd
IEEE International Symposium on Rapid System Prototyping (RSP), pp.135-141, 11-12 Oct. 2012

59

Optimization of Dynamic Memory


Allocation and Associative Arrays for
Complex Embedded Software
Research topics : embedded systems, control, associative arrays, memory allocation
A. Carbon, D. Courouss, Y. Lhuillier, H.P. Charles
ABSTRACT: Embedded systems were initially dedicated to execute very specific tasks. Since the emergence of
on-chip manycore architectures and the convergence between general purpose and embedded architectures,
embedded system are now targeting a wider range of applications. Embedded software involves more and more
complex runtime control: dynamic scheduling, dynamic resource management, or even just-in-time compilation.
Our experiments showed that dynamic memory allocation and associative array manipulations represent a
significant part of this new control software. We propose software and hardware optimizations of associative
arrays and memory management to maximize performance and simplicity embedded control software.
Embedded systems were initially dedicated to execute very
specific tasks leveraging highly dedicated hardware. Since
the emergence of on-chip manycore architectures and the
convergence between general purpose and embedded
architectures, embedded system are now targeting a wider
range of applications. Embedded software involves more and
more complex runtime control for parallelism and dynamism.
We characterized new embedded control software such as
dynamic scheduling, resources management, scripting
languages, binary translation and just-in-time compilation,
showing that dynamic memory allocation and associative
array manipulations represent a significant part of this new
control software. We first focused on memory management
optimization and then on associative arrays. We observed
that memory allocator algorithms are in fact based on
associative array manipulation.
Embedded systems often exhibit highly complex memory
organizations. Distributed and private memories, absence of
address virtualization are frequent burdens of embedded
developer.
Moreover, with the increasing amount of
parallelism, an additional stress is put on dynamic memory
allocation. In [1], we propose a flexible memory allocator to
handle complex memory organizations of embedded systems.
The memory allocator leverages dynamic code generation so
that flexibility is not at the expense of performances. We
show that combining dynamic code generation and machine
learning, can give a 56% speedup on memory allocator's
allocation and release operations (Figure 1).

Figure 1 : Average speedup of malloc&free operations after


successive runtime optimizations of the allocator map.

During our experiments, we showed that associative array


manipulation is a significant part of embedded control
software (from 25% for dynamic compilation, up to ~80% in
resource managers, binary translators or interpreters).
Moreover, we also showed that associative array
manipulation is at the heart of memory allocators. Thus, we
argue that optimization of associative arrays is a crucial point
to deliver high-performance embedded control software.
In order to optimize associative array manipulation in
embedded control software, we propose a low-level
optimization using self-modifying code. In [2], we base our
work on Red-Black tree algorithms which are widely used to
implement associative arrays (C++ STL, jdmalloc). In order
to accelerate Red-Black tree algorithms, we propose to
transform tree data structures into executable code. With
Red-Black trees encoded as specialized binary code rather
than data, we intend to accelerate the tree traversal by
taking advantage of the underlying hardware: program
cache, processor fetch and decode. Our experiments show
that we obtain a gain of 45% with our technique on an ARM
Cortex-A9 processor. We also show that we transfer most of
the data-cache pressure to the program-cache, motivating
future work on dedicated hardware.

Figure 2 : Comparison for array operations in a map size of


16384 elements between original implementation and our
specialized code in number of cycles/operations.

References :
[1] Y. Lhuillier, D. Courouss, Embedded System Memory Allocator Optimization Using Dynamic Code Generation, 2012 workshop "Dynamic
Compilation Everywhere", in conjunction with the 7th HiPEAC conference, 2012
[2] A. Carbon, Y. Lhuillier, and H-P. Charles, Code Specialization For Red-Black Tree Management Algorithms, 3rd International Workshop on
Adaptive Self-tuning Computing Systems, 2012

60

A Practical Approach Towards Static


DVFS and DPM Scheduling
in Real-Time Systems
Research topics: DVFS, DPM, Embedded systems, Real-time constraints
K. Trabelsi, M.Jan, R.Sirdey
ABSTRACT: Dynamic Voltage and Frequency Scaling (DVFS) and Dynamic Power Management (DPM) are the
two main techniques that can be used at the software level to reduce the power consumption of embedded hard
real-time systems. In the field of embedded real-time systems, such as automotive or energy distribution,
microprocessor manufacturers have recently proposed chips with not only DVFS capability but also a large
number of low-power states that can be used to apply DPM strategies. In order to leverage the energy-saving
abilities of such microprocessors and to avoid missing deadlines, we formulate the problem using a linear
programming approach in which we modelize the available frequencies and low-power states [1].
In the last few years, a lot of works have been done in the
field of power-aware computing, particularly on embedded
systems containing real-time constraints. At the software
level, there are two main strategies to reduce the power
consumption of embedded processors: 1) lower their
frequency and their voltage 2) set them in low-power states.
The first approach is called Dynamic Voltage Frequency
Scaling (DVFS), while the second is called Dynamic Power
Management (DPM). DVFS policies change the voltage and/or
the operating frequency, more generally the speed, of the
processing unit on-the-fly and reduce the dynamic energy
consumption. In CMOS circuits, power consumption is
proportional to the product of the frequency and the square
of the supply voltage. DPM policies reduce the (static) energy
consumption by essentially turning off some parts of the
system. This allows for a drastic decrease in power
consumption, but most of the time the DPM energy-saving
modes disable processing, by turning off memories, caches
or even processing units completely. Leveraging frequencies
and energy-saving modes of modern processing units in the
same time is one of the challenges of todays embedded
systems community. In the context of hard real-time
systems, energy-aware policies must be part of the
scheduling, and the challenge is to reduce energy
consumption while still meeting the deadlines of tasks.
Actually, lowering the speed of the processor increases the
execution time of a task.
A simple model can be for instance that the worst-case
execution time (WCET) of a task scales linearly with the
processor frequency. More realistic models have been
proposed which divide a task between computationally
intensive parts and peripherals accesses. In all cases, the
challenge when using DVFS is that this increase of the
execution time must not prevent tasks to fulfill their deadline.
When the processor is set to a low-power state, tasks can no
longer be executed. In this case, the challenge is to awake
the processor from a low-power state just in time so that
tasks can still fulfill their deadlines. Some of the properties of
hard real-time systems can be used to save power: at
compile-time, the tasks are thoroughly examined to get their
worst-case execution time (WCET), and their occurrence
patterns (period and arrival time). This allows the scheduling
algorithm to do a first pass of off-line optimization:
computing a minimal speed at which the system can run or
the low-power states that can be used to achieve a minimal
energy consumption while still meeting the deadlines of the
tasks. The scheduling policy can also reduce the energy
consumption on-line, by taking into account the dynamic
behavior of the tasks.

Figure 1 : Energy consumption of the DPM, DVFS and the hybrid


strategies for a frame of 1 ms.

To the best of our knowledge, previous works have focused


on static DVFS scheduling, while static DPM has rarely been
studied. The availability of microprocessors with advanced
DPM and DVFS functionalities motivates the first contribution
of this work: an accurate modeling of DPM and DVFS
capacities in order to find the optimal static solution
interplaying both techniques. We formulate the problem
using a linear programming approach in which we modelize
the available frequencies and low-power states using binary
variables. Then, we study the difference between such an
accurate modeling of DPM and DVFS and when transition
costs of these techniques are neglected.
We performed some experiments by using the STM32L microcontroller based family. Figure 1 shows the energy
consumption of the DPM, DVFS and the hybrid strategies
(DPM and DVFS together) for different values of workload
rates, considering a frame of 1 ms. Our results show that,
using the hybrid strategy, mixing DVFS and DPM, is the best
solution regardless of workload and considering transition
costs or not. We show also that hybrid strategy whittles down
transition costs effect on energy saving..
As future work, we would like to generalize the results we
obtained from our generic model on different microcontrollers and not only on the STM32L. The solution
described in this paper is pessimistic as it assumes the worst
case execution time (WCET). We are therefore also interested
in adding robustness to our approach by adding bounds on
the uncertainty of WCET.

References :
[1] Trabelsi, K.; Jan, M. & R., S. A practical approach towards static DVFS and DPM scheduling in real-time systems, in 'Proceedings of the 1st
IEEE Workshop on Power, Energy and Temperature Aware Real-Time Systems', 2012

61

Dynamic Code Generation :


Large Spectrum, Many Applications
Research topics : dynamic code generation, compilation,optimization
H.P. Charles, D. Courouss, Y. Lhuillier, V. Lomuller, A. Carbon
ABSTRACT: Dynamic Code Generation is used in many situations: to enhance portability, adapt binary code to
run-time values, use specialized instruction sets, reduce code size. We show our activity in this domain on
significant example. We give results on one : the matrix multiplication dynamic library
To generate an executable program, programmers generally
use a compiler which transforms source code to binary code
(compile time), and during the execution (run-time) the
binary program does not change. This is the classical way to
produce and execute binary programs, called static
compilation scheme.
There is an increasing number of situations where this
sometimes
classical
scheme
is
insufficient,
and
counterproductive.
Due to the increasing complexity of memory hierarchy and
parallelism levels, program performance is increasingly tied
to the characteristics of the data sets. Data size, data
alignments and data values have a deep impact on program
behavior. Iterative compilation is a technology allowing a
binary code to be adapted to a running data set, but the
adaptation is only valid for one specific data set, not the
arbitrary set that a user can provide. The adaptation of a
running code to a given data set parameter is one of our
motivations.
Portability issues on embedded systems, such as cell phones
and set-top boxes, is another motivation. Many compilation
infrastructures are used such as JIT compilation for Android
Java based applications, for javascript in browsers for smart
phones and set-top boxes, for graphic rendering on Android
GPUs, etc. In these situations, binary generation has to be
fast, occupy a small memory footprint, generate efficient
code, and use power sparingly.
We have developed a tool and experiments that try to tackle
these problems. Usage examples are listed in the following
items:
ISA dynamic adaptation: we have developed a small code
generator embedded in the Scilab mathematical solver that is
able to determine at run-time if a code should run either on a
CPU or on a GPU. On the GPU side, our code generator is
able to dynamically adapt the code according to the matrix
size and based on initial benchmarking.
The results of the experimentation are shown in the figures 1
& 2. Figure 1 shows the performance of MAGMA, the
reference mathematical library for matrix multiplication on
GPUs. For clarity we have only plotted the results superior to
145 GFlops for matrix sizes between 64x64 and 2000x2000.
In figure 2 we have plotted the results from the same
experiment using our library dynamic adaptor. [1]
JIT hardware acceleration: [3] JIT compilation requires a
lot of memory access, which uses hash tables and tree
balancing. This impacts the performance because code
generation is done at run-time. We have shown in our article
that we could mix data and programs in order to accelerate
searches in these trees.

Figure 1: Matrix product GFlops (> to 145) obtained on NVIDIA


GPUs by the MAGMA BLAS library

Figure 2: Matrix product GFlops (> to 145) obtained on NVIDIA


GPUs by our dynamic library

VLIW dynamic bundling: [2] Many VLIW processors use


bundled instructions (grouped processor instructions) to
implement instruction-level parallelization. The compilation
process tries to maximize bundle usage. We have shown that
we could use dynamic bundlesization by using running
parameters and improving code performances.
Tool for dynamic code generation: [4] dynamic code
generation is a difficult task owing to the lack of general
tools. We have developed an infrastructure which helps to
build dynamic code generators.

References :
[1] Charles H.P., Lomller V., Data Size and Data Type Dynamic GPU Code Generation in GPU design pattern, Editor: Magoules, SAXECOBURG PUBLICATIONS, 2012
[2] Courouss, D. & Charles, H.-P. Dynamic Code Generation: An Experiment on Matrix Multiplication Proceedings Work-in-Progress Session of
LCTES 2012, 2012
[3] Alexandre Carbon, Y. L. & Charles, H.-P. Scaling down to embedded systems for dynamic compilation 2nd international workshop on
"Dynamic compilation everywhere", 2013
[4] Charles, H.-P. Basic Infrastructure for Dynamic Code Generation Workshop "Dynamic Compilation Everywhere", in conjunction with the
7th HiPEAC conference, 2012.

62

Adapting Just-In-Time Compilation to


Embedded Systems
Research topics: JIT compilation, embedded systems
A. Carbon, Y. Lhuillier, H-P. Charles
ABSTRACT: Just-In-Time (JIT) compilation is today widely employed in many application domains and massively
transferred to embedded systems. However, JIT compilation complexity lead to important performance loss for
embedded processors due to their lack of mechanisms to manage JIT compilation algorithm irregularities in
terms of control and data. Managing these irregularities, associative arrays and dynamic memory allocation still
represent 25 % of the LLVM bytecode compiler execution time, despite many existing software optimizations.
To reduce their impact on execution time, our ongoing work consists in the proposition of hardware dedicated
resources to accelerate them, based on standard libraries and replacing these software optimizations.
Just-In-Time (JIT) compilation has become a major topic for
academic and industrial researchers in the last 15 years. JIT
compilation technologies consist on executing all, or parts of,
compilation stages dynamically during the application
execution.
The main reasons of this growing interest are the following:
Increasing dynamism of applications and their workloads
Increasing interactions between applications
Increasing portability and security requirements
Increasing performance requirements
Based on a state-of-the-art analysis, we identify four main
technologies using JIT compilation:
Virtual machines (eg. Java Virtual Machines)
Dynamic binary translation (eg. Apple Rosetta)
Multistage dynamic compilation, consisting in deporting
compilation phases to runtime (eg. LLVM framework)
JIT compilation for dynamically-type languages (eg.
JavaScript, Python)
In all these technologies, the efficiency of Just-In-Time
compilation depends on the ability to compensate its
overhead with execution speedups obtained on the generated
code. Compilation algorithms are complex and particularly
difficult to handle for embedded processors, with important
performance loss introduced, as presented in Figure 1.

Figure 2: Misprediction rates for indirect branches and instruction


cache, relative to the total number of instructions (x86).

Table 1 shows a common data irregularity with a slight


increase of indirection depths for JIT compilation algorithms
(LLVM in our case), highlighting the fact that JIT compilation
algorithms are significantly pointer-intensive.

x1,4

Conventional algorithms

Table 1 : Algorithm
simulator).

JIT compilation algorithms

1x

ratio contrary to regular algorithms, in which the amount of


managed data is far bigger than the amount of instructions.

12x

17x

Execution time slowdown from Intel Core2 Duo to ARM Cortex-A9

Figure 1: Execution time slowdown comparison


conventional and JIT compilation algorithms.

between

We profiled different JIT compilation algorithms [1],


extracted from the highlighted technologies, run them on a
x86 processor, and compared their behavior to regular
algorithms extracted from miBench benchmarks. Figure 2
shows a common control irregularity compared to regular
algorithms with a slight increase of misprediction rates for
indirect branches and instruction cache. This highlights a
common complexity on control and a high instruction/data

indirection

depths

(x86

instrumented

These irregularities, already visible on x86, are important


issues for embedded processors due to a lack of mechanisms
to handle them (especially concerning predictions).
To highlight the parts of the code responsible for these
irregularities, we profile the LLVM bytecode compiler. Results
obtained show that associative array and dynamic memory
allocation represent on average 25 % of its execution time
[2], despite many existing software optimizations: LLVM
developers provide more than 8 specific data-types reimplementing standard data types of the C++ STL library.
Our ongoing work deals with the development of hardware
dedicated resources, based on standard libraries. We are
looking for accelerating associative array management and
dynamic memory allocation, replacing existing software
optimizations to reduce their impact on execution time.

References :
[1] A. Carbon, Y. Lhuillier, and H-P. Charles, Just-In-Time Compilation Characterization, Poster session of the 7th International Conference on
High-Performance and Embedded Architectures and Compilers, Paris, France, January 2012.
[2] A. Carbon, Y. Lhuillier, and H-P. Charles, Adapting Dynamic Compilation to Embedded Systems, Poster session of the 8th International
Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems, Fiugi, Italia, July 2012.

63

Parallelism Reduction in Dataflow


Oriented Programming Languages for
Many-Core Architectures
Research topics : Parallelism, dataflow, compilation, many-core
L. Cudennec, R. Sirdey
ABSTRACT: This work aims at transparently tune the degree of parallelism of applications intended to be
deployed on massively parallel architectures. Dataflow programming languages offer a promising framework for
developing, in a very intuitive way, large scale applications. In this context, the compiler can tune the degree of
parallelism by modifying the application graph, in such a way the application fits the targeted host while its
semantics are preserved. A compiler plugin is proposed, that is able to match and substitute patterns onto the
application graph. Experimentations show that the reduction engine is accurate and efficient enough to be
shipped within an industry-grade compilation toolchain for a manycore architecture.
The growing number of processing cores on a single chip
challenges
regarding
the
efficient
leads
to
new
programmability of many-core architectures, while staying
appealing to regular HPC developers. In such a context, the
dataflow programming model can be used to structure the
source code, organize tasks into a logical network and
enforce some relevant properties for distributed and shared
systems like the detection of live deadlocks and buffer
overflows. The Sigma-C [1] programming language has been
introduced for this purpose. One of its leitmotiv is to let the
developer focus on the algorithm side, while taking care, at
compilation time, of the entire parallelism tuning aspects.
The goal is here to adapt the number of running tasks
regarding the number of physical processing elements in
order to 1/ leverage the application performance and 2/
minimize the memory footprint, the latter being a prevalent
property in embedded systems.
Tuning a dataflow application can be achieved by
transparently altering the task connection graph in a way
that 1/ the resulting graph meets the given parallelism
requirements and 2/ the application semantics are preserved.
Some systems have been proposed in the literature, mostly
based on simple task fusion, like in Streamit. In [2], the
authors propose a language to describe and modify system
tasks in charge of the data reorganization.

as shown in Figure 1. Provided some good properties


regarding the consumptions and productions of the system
tasks, it is possible to replace this pattern with another one
that is built with a different number of split outputs, join
inputs and even a different number of tiers. The resulting
pattern directly modifies the number of user tasks
instantiated within the cascade pattern. Several patterns
have been proposed, designed for generic or applicationspecific use.

Figure 2: Accuracy of the parallelism reduction engine.

Figure 1: Split-*-Join cascade pattern.

The application transformations can also be described using


graph patterns [3]. A pattern is defined as a parameterized
description of a sub-graph that can be matched in an
application and replaced by another parameterized subgraph. One popular example is the split-join cascade pattern

A parallelism reduction engine has been implemented within


a plugin for the Sigma-C compilation toolchain. This engine
applies different pattern substitutions to find a solution that
fits the application onto the targeted host. Figure 2 shows the
accuracy of the reduction engine. Two applications including
some split-join and systolic matrix multiplication patterns are
initially built with respectively 1925 and 1867 instances. The
reduction engine is thereafter able to reach a number of
instances, given as a parameter, with a small error deviation.
Other experimentations show that the engine is able to find
the best solution for an application made of 6 patterns, by
only taking a few seconds when run onto a low-class laptop
computer. This demonstrates that the reduction engine can
be used within a regular compilation process.
As for ongoing works, a generic reduction approach is
currently evaluated, based on the automatic equivalent task
merge. The merging process is thereafter applied such as the
Sigma-C application throughput constraints remain satisfied,
while reducing at most the memory footprint.

References :
[1] T. Goubier, R. Sirdey, S. Louise, V. David, Sigma-C: A programming model and language for embedded manycores, in: Algorithms and
Architectures for Parallel Processing, Vol. 7016 of Lecture Notes in Computer Science, Springer Berlin / Heidelberg, 2011, pp. 385394.
[2] P. de Oliveira Castro, S. Louise, D. Barthou, Dsl stream programming on multicore architectures, in: Programming Multi-core and Manycore Computing Systems, John Wiley and Sons, 2011.
[3] L. Cudennec, R. Sirdey, Parallelism reduction based on pattern substitution in dataflow oriented programming languages, in: Proceedings
of the 12th International Conference on Computational Science, 2012.

64

A Parallel Simulated Annealing


Approach for the Mapping of
Large Process Networks
Research topics: simulated annealing; parallelism; process network
F. Galea, R. Sirdey
ABSTRACT: We propose a parallel simulated annealing approach to solve a dataflow process network mapping
problem, where a network of communicating tasks is mapped into a set of processors with limited resource
capacities, while minimizing the overall communication bandwidth between processors. The speedups obtained
using this approach enables us to solve problems with more than one thousand tasks, on up to 48 processors, in
reasonable time. Results have been obtained by taking profit of the specific architecture of a Non-Uniform
Memory Access (NUMA) computer.

In the context of compilation for manycore architectures,


data flow programming languages such as C [2] allow to
design parallel applications which can naturally exploit the
massive parallelism of the target architecture.
Dataflow applications are made up of potentially thousands of
parallel tasks, interconnected to one another with
communication channels. Such applications are suited to be
mapped onto manycore processors, which basically are sets
of processors with local memory, interconnected with a
network on chip (NoC).
Several tasks may be allocated to the same processor, as
long as the total resource usage (memory, CPU, IO) of all
tasks allocated the processor does not exceed the available
quantities. Communication between tasks in the same
processor will use the internal memory, while routes on the
NoC must be established for communication between tasks
on different processors. The goal is then to reduce the global
NoC usage by maximizing the in-processor communication.

35

30

25

144

20
3

graph, and a Quadratic Assigment heuristic to assign the task


partitions to the processors [3,4]. The pros of this method
are the quick response time and the satisfactory quality for
early developments. However, the quality drops when the
size of the problem increases (involving more than 10
processors), and the method may find solutions for which
there is no possibility to route all the necessary
communications on the NoC, due to bandwidth limitations on
the NoC links.
In this work [1] we proposed a more general heuristic
method based on the simulated annealing metaheuristic. The
tasks are directly mapped on the processors, and the method
proceeds by incrementally swapping tasks between
processors in order to try reducing the global NoC usage. As
time progresses, changes decreasing the global quality
become less and less accepted, directing the solution path to
a potentially near-optimal solution.
This method has shown to provide solutions of much better
quality than the previous one with the total NoC usage
reduced by 50% on some instances. However the necessary
computing times are much higher. We dealt with this by
parallelizing the solver on a 48-core NUMA server.

Figure 1 : the Distributed Network Mapping Problem

Figure 1 illustrates the mapping of the tasks from a grid of


18x18 tasks onto 9 processors interconnected with a 2D
torus network. Each processor has an assignment limit of 40
tasks. The different resulting colors correspond to the
different processors the tasks are allocated to.
This problem has first been tackled in our C compilation
toolchain using a two-level method, involving a greedy
randomized approach for creating partitions of the task

324

15

529

10
5
0
2

12

24

48

Figure 2: Speedups of the parallel method (48-core NUMA server)

As shown on Figure 2, speedups on more than 30 could be


experienced on instances of more than 500 tasks. We could
find very good solutions for instances of more than 1000
tasks in less than 10 minutes, making the method suitable
for late development processes, where final builds of
applications are to be packaged for the embedded platform.

References :
[1] F. Galea and R. Sirdey, A parallel simulated annealing approach for the mapping of large process networks, PCO'12 as part of the 26th
IEEE International Parallel & Distributed Processing Symposium (IPDPS'12), Shanghai, China, 2012.
[2] T. Goubier, R. Sirdey, S. Louise and V. David, C : A Programming Model and Language for Embedded Manycores, Algorithms and
Architectures for Parallel Processing, Berlin / Heidelberg, Springer, 2011, p. 385394.
[3] R. Sirdey, Contributions l'optimisation combinatoire pour l'embarqu : des autocommutateurs cellulaires aux microprocesseurs
massivement parallles, HDR Thesis, 2011.
[4] O. Stan, R. Sirdey, J. Carlier et D. Nace, A heuristic algorithm for stochastic partitioning of process networks, 16th International
Conference on System Theory, Control and Computing (ICSTCC), 2012.

65

A Heuristic Algorithm for Stochastic


Partitioning of Process Networks
Research topics: graph partitioning, chance-constrained optimization, compilation
O. Stan, R. Sirdey, J. Carlier, D. Nace (UTC)
ABSTRACT: In this work, we study the problem of partitioning networks of processes under chance constraints.
This problem arises in the field of compilation for multi-core processors. The theoretical equivalent for the case
we consider is the Node Capacitated Graph Partitioning with uncertainty affecting the weights of the vertices.
For solving this problem we propose an approximate algorithm which takes benefit of the available
experimental data through a sample-based approach combined with a randomized greedy heuristic, originally
developed for the deterministic version. Our experimental results illustrate the algorithm ability to efficiently
obtain solutions of good quality within an acceptable execution time which are also robust to data variations.
The development of 100+ cores microprocessor architectures
has triggered a renewed interest for the so-called dataflow
programming models in which one expresses computationintensive applications as networks of concurrent tasks
interacting through (and only through) unidirectional FIFO
channels. In [1], we present a heuristic algorithm dedicated
to the resource-constrained graph partitioning problem which
crops up when mapping networks of dataflow processes on a
parallel architecture assuming the resource consumptions of
the processes are uncertain.

Figure 1 shows the statistically significant approximation


model to an initial chance-constrained program, obtained
with our robust binomial approach. i are binaries variables
and their sum follows a Binomial distribution, k is a
parameter determined in function of NS, and , and L is a
constant of large size, depending on the problem structure
but generally easy to find.

Known, in the deterministic case of single dimensional


weights, as the NP-hard problem of Node Capacitated Graph
Partitioning, the assignment of the weighted vertices of a
dataflow graph to a fixed set of partitions, has, to the best of
our knowledge, received little attention from the stochastic
programming community.
In order to respect as close as possible the real context of
our application, a qualitative analysis of the sources of
uncertainty, mainly the execution times, was performed. This
preliminary analysis showed the inherent difficulty of
obtaining an analytical description of the distributions of the
execution times. Even if it is reasonable to assume that the
probability distributions of execution times have a bounded
support (no infinite loops), we have to cope with the fact that
these distributions are intrinsically multimodal (due to the
presence of data dependent control). Also, in the case of
process networks, we cannot overlook the problem of
dependencies between these random variables.
The approach we propose is justified by the theory of
statistical hypothesis testing and takes into account the
important role of experimental data. Additionally, for solving
a chance constrained problem, no assumptions are being
made about the joint distribution of the random variables, in
particular with respect to the independence of these
variables.
Our algorithm design methodology consists in leveraging an
existing heuristic for the deterministic case without significant
destructuring (i.e. at small cost in terms of software
engineering)
and with acceptable performance hit.
Furthermore, this non-parametric method we introduce for
solving our chance-constrained partitioning of process
networks is applicable, in combination with other
approximation algorithms (e.g. metaheuristics), to other
optimization problems.

Figure 1: Approximation using robust binomial approach

For our problem, the objective f is to minimize the


communication inter partitions and the probabilistic
constraints are on the capacities of the clusters. The heuristic
we adapted for treating the stochastic graph partitioning is an
already available greedy randomized affinity-based heuristic,
easy to modify and quite efficient for the placement of the
processes in the deterministic case. The task weights are
random variables (memory footprint or computing core
occupancy) for which we dispose of a relevant sample of NS
independent and identically distributed realizations.
By using the statistical hypothesis testing within a heuristic
approach, we overcome the computational effort of taking
into account the uncertainties of the weights of the vertices.
Concerning the complexity, we remark a linear increase with
a factor of NS in comparison to the deterministic version.
This approach can solve, with an acceptable solution quality,
confidence
level
and
computation
time,
problems
representative in size of our application context. The overall
solutions have a quality comparable to those of the heuristic
for the deterministic case and moreover they are statistically
guaranteed at a confidence level 1-.

References :
[1] O. Stan, R. Sirdey, J. Carlier and D. Nace, "A heuristic algorithm for stochastic partitioning of process networks", Proceedings of the 16th
IEEE International Conference on System Theory, Control and Computing, Sinaia, Romania, 2012.

66

A Low-overhead Dedicated Execution


Support for Stream Applications on
Shared Memory CMP
Research topics: execution model, micro-kernel, manycore, stream programming
P. Dubrulle, S. Louise, R. Sirdey, V. David
ABSTRACT: The ever growing number of cores in Chip Multi-Processors (CMP) brings a renewed interest in
stream programming to solve the programmability issues raised by massively parallel architectures. Stream
programming languages are flourishing (StreamIt, Brook, C, etc.). Nonetheless, their execution support have
not yet received enough attention, in particular regarding the new generation of many-cores.
In embedded software, a lightweight solution can be implemented as a specialized library, but a dedicated
micro-kernel offers a more flexible solution. We propose to explore the latter way with a Logical Vector Time
based execution model, for CMP architectures with on-chip memory.
Many-cores represent a challenge for programmers. Using
efficiently the parallelism of hardware with hundreds of cores,
using on-chip shared memory can be done with a language
offering good abstraction and with an adapted compiler [1].
Stream programming languages (based on Kahn Process
Networks) are suitable as they offer a guarantee for
determinism, an abstraction of underlying hardware and good
properties for efficient placing/routing tools [1].
Stream programming is a very good approach for signal and
image processing, which are predominant in the embedded
applications. For embedded many-cores, the execution
support of stream applications is possible through an efficient
execution model implemented as a micro-kernel, with a
dynamic scheduling [2].
An offline scheduler can infer a partial order of execution of
the tasks in a stream application, as to guarantee
determinism of data access. From this partial order of
execution, it is possible to encode dependencies between all
the activations of all tasks in a stream application by
assigning them each a vector clock and increments to update
them infinitely using addition and modulo operations. These
vector clocks capture causality between computation events,
comparing them tells if an event precedes another or if they
are causally independent. For a given pair of tasks a and b,
checking current activation of a precedes the current
activation of b using the vector clocks can be done by a
scalar operation [2].

The micro-kernel is designed to run on the multi-core scale


(up to 16 cores). To scale up to the many-core architectures,
several instances of this multi-core micro-kernel run on
partitions of the global set of cores (either logical partitions,
or physical clusters in a hierarchical architecture) [2].
A prototype of this execution support was realized and
evaluated on a x86 multi-core platform. The performance
results show that the execution supports overhead is not
depending on the count of tasks in the application, which
means the micro-kernel offers scalable dynamic scheduling
as can be seen on figure 1. This evaluation demonstrates that
the LVT micro-kernel is an efficient execution support for
stream applications with filters implementing complex
computations. On the other hand, applications with filters
performing short operations require the intervention of
compilation tools to merge some parallel tasks to reach an
appropriate minimum execution time [3].

The proposed execution model relies on the vector clocks to


update a data dependency counter at runtime. Initially, the
counters value is known and tasks without dependencies are
executed. When a task activation ends, the vector clocks are
compared and dependencies are updated according to the
comparison results.
In the micro-kernel, these operations are performed on a
supervision core (many-cores tend to propose such additional
cores for housekeeping, like the MPPA chip). This asymmetric
approach takes advantage of the target parallelism to absorb
a part of the scheduling overhead [2].

Figure 1 :duration of system operations depending on the


number of tasks in the executed application

References :
[1] T. Goubier, R. Sirdey, S. Louise and V. David. C: A programming model and language for embedded manycores. ICA3PP, Lecture Notes
in Computer Science, vol. 7016, 2011.
[2] P. Dubrulle, S. Louise, R. Sirdey and V. David. A low-overhead dedicated execution support for stream applications on shared memory
CMP, Proc. of 10th ACM int. conf. on Embedded Software. pp. 143-152, 2012.
[3] L. Cudennec and R. Sirdey, Parallelism reduction based on pattern substitution in dataflow oriented programming languages, Proc. of 12th
int. conf. on Computer Science, 2012.

67

Autonomic Pervasive Applications Driven


by Abstract Specifications

Research topics: Internet of Things, Service-oriented & Autonomic Computing


O. Gunalp, L. Gurgen, V. Lestideau (LIG), P. Lalanda (LIG)
ABSTRACT: Pervasive application architectures present stringent requirements that make their development
especially hard. In particular, they need to be flexible in order to cope with dynamism in different forms (e.g.
dynamically changing service and data providers and consumers). Existing development approaches do not
provide explicit support for managing this dynamism. In this paper we describe Rondo, a tool suite for designing
pervasive applications. Rondo proposes pervasive application specification, which borrows concepts from
service-oriented component assembly, model-driven engineering (MDE) and continuous deployment, resulting
in a more flexible approach than traditional application definitions.
Pervasive computing aims at removing the barrier between
users and computing systems by blending the computers into
the users environment. This vision is becoming possible in
the near future thanks to recent evolution in mobile, wireless
and sensor technologies. However, the development of
pervasive applications is a difficult challenge because the
developer needs to manage contextual changes, device and
application dynamism, as well as the business logic. As a
consequence, current pervasive applications are generally
insufficient in terms of software engineering: they are
difficult to design, code, test and maintain; most existing
solutions are proprietary, limited in terms of provided
services and executed in a closed world.
For this, we propose a tool suite, Rondo [1], which enables
design, development and execution of dynamic pervasive
applications using runtime models and autonomic computing
principles. We adopt a service oriented approach, which aims
to promote loose coupling between components. The
implementations of the services are decoupled from their
specifications. As so, applications can be built upon loosely
coupled service providers and consumers based on service
level contracts. Late binding and sustainability becomes
possible opening the way of dynamically adaptable software
architectures.

Rondo is a tool suite for designing, deploying and executing


pervasive applications. Rondo framework uses a modeldriven approach and has three main goals: designing
dynamic applications, specifying pervasive environment and
enabling application adaptations for context-awareness. By
this approach we aim to manage life cycle of pervasive
applications, from development until runtime changes. Rondo
provides a domain-specific language based on the notion of
components to define the architecture of pervasive
applications. Then at runtime, the application manager takes
this description and configures the service-oriented execution
environment in order to deploy and start the application, all
taking into account current state of the environment,
represented by several runtime models (Figure 1). And while
the application is running, this manager continues to monitor
and manage the created application.
We have implemented and validated utility of our approach,
demonstrating it in a smart home application in the context
of a European project, namely BUTLER. The application
enables the media streams to follow the user in the home in
different rooms equipped with audio/video devices so that the
user can continue to watch and/or to listen the media while
he/she moves through the house. Several sensors such as
presence detectors, sonar sensors, touch sensors, etc. based
on heterogeneous technologies (ZigBee, 6LoWPAN/CoAP) are
used to localize the user.

Figure 2 : Follow-me application


Figure 1 : Rondo Global Approach
References :
[1] O. Gunalp, L. Gurgen, V. Lestideau, P. Lalanda. Autonomic Pervasive Applications Driven by Abstract Specifications., 2012 International
Workshop on Self-aware internet of things (Self-IoT '12). ACM, New York, NY, USA, 19-24.

68

Resource-Based Oriented Middleware


for Sensor/Actuator Networks
Research topics : coordination middleware, sensor/actuator networks
L.-F. Ducreux, C. Guyon-Gardeux, H. Iris, S. Lesecq, F. Pacull, S. R. Thior
ABSTRACT: A coordination middleware for sensor/actuator networks allows to solve the difficult problems
inherent to heterogeneity, asynchronicity and distribution at the system level. Domain-dependent frameworks,
e.g. for the building automation area, can be built on top of this middleware in order to make application
designers to focus on the scenario aspects. In addition the rule based language of this middleware enforce a
high level coordination protocol that provides interesting properties such that better reliability and improved
energy efficiency of the sensor nodes.

The coordination middleware developed in our team aims to


and
coordinate
hardware
and
software
deploy [3]
components that are not initially planned to talk together.
Building Automation (BA) is one of the fields where such
middleware can demonstrate its usefulness.
Indeed, BA Systems (BAS) encompass a wide variety of
systems, e.g. HVAC, lighting, access control, intrusion alarm,
fire detection. A building that integrates several BAS can be
seen as a System of Systems (SoS) where the integration
and interaction of the systems together with their own
individual control aims at providing a higher level of energy
efficiency, user comfort, etc. Unfortunately, most of the BAS
coexist without any cooperation and interoperability, leading
to conflicting actions. Moreover, the individual control of each
(sub-)system does not lead to a global optimum for the
whole targeted objective. In addition well established
industrial protocols (e.g. BACnet, LonWorks, KNX, ) need to
cohabit with the new wireless technologies more and more
present in the field.
This leads to a complex heterogeneous network with a
variety of technologies which make the task of the people in
charge of the administration of the system a nightmare.
To overcome these difficulties, a middleware can provide an
abstraction view of the underlying systems and provide some
facilities for coordinating all the individual networks.
At the level of the abstraction level, we have developed a
dedicated framework named PUTUTU[1] (see Fig. 1) in order
to ensure the separation of concerns that allows delegating to
the appropriate people the different tasks required when BA
is considered.

framework which provides the abstraction level. At this level


there is no difference in between a wired LON sensor and a
wireless sensor operating in the 433Mhz range. Both of them
are seen as resources contained in bags that can be
interrogated through the coordination protocol[1]
In addition the generic blocks presented in the green box (Fig
1) take care of the treatments common to all the sensors
actuators technologies: initialisation of the gateway,
frequency sampling, management of the basic information
(type, timestamps, journalization, ). Then the application
designer, who can be a third person can define the scenarii
thanks to the rule based coordination language with no need
to know deep details of the PUTUTU framework and the
technology used by the sensors and actuators.
Low power consumption and reliability are two important
properties of wireless sensor networks. To improve these
aspects, we go one step further and enforce the coordination
protocol on top of the communication protocols imposed by
the different wireless sensor networks. Thus, we move the
callee side of this protocol from the gateway to the
sensors/actuators in order to make them able to directly
respond to this protocol [2], see Fig. 2.
The high-level coordination protocol brings on the one hand
the control from the application side the activities
(sleep/awake) of the sensors and on the other hand the
transactional processing of operations involving a group of
sensors/actuators. This has a positive impact on the
consumption and on the reliability.

Tellstick

Watteco

PUTUTU framework

iRIO

Object_dongles_modules

Homes

KNX

Object_Wsan_sensors

Object_Wsan_actuators
PLUGWISE

Object_Wsan_sensors_actuators

RFXCOM

Letibee

TelosB

Figure 1 : PUTUTU framework and integrated technologies

The implementation of the driver between the gateway and


the devices is let to the responsibility of a person that is
familiar with the communication protocol and the specific
aspects of the sensors and actuators
Once done, this driver is directly usable by the PUTUTU

Figure 2 : Smart Sensor approach

References :
[1] L.-F. Ducreux, C. Guyon-Gardeux, S. Lesecq, F. Pacull, S. R. Thior, Resource-based middleware in the context of heterogeneous building
automation systems, 38th Annual Conference of the IEEE Industrial Electronics Society IECON, Montreal, Canada, October 2012.
[2] H. Iris, F. Pacull, Protocol Awareness: A Step Towards Smarter Sensors, Wish workshop, Third International Conference on Sensor Device
Technologies and Applications SENSORDEVICES 2012, Rome, Italy, August 2012.
[3] F. Pacull, "Deployment management through on-remote-site dynamic compilation.", Workshop Dynamic Compilation Everywhere in
conjunction with the 7th International Conference on High-Performance and Embedded Architectures and Compilers (HIPEAC - 2012), Paris,
France.

69

Memories
Wire Diagnosis

Reliability
& Test

70

Memory Reliability Improvements Based


on Maximized Error-Correcting Codes
Research topics: Reliability; Error Correcting Codes; MTTF
V. Gherman, S. Evain, Y. Bonhomme
ABSTRACT: Error-correcting codes (ECC) offer an efficient way to improve the reliability and yield of memory
subsystems. ECC codeword length is not the maximum allowed by a certain check-bit number since the number
of data-bits is constrained by the width of the memory data interface. This work investigates the additional
error correction opportunities offered by the absence of a perfect match between the numbers of data-bits and
check-bits in some of the most commonly used ECCs. A method is proposed for the selection of multi-bit errors
which can become correctable with a minimal impact on decoder latency. Reliability improvements are
evaluated for memories in which all errors affecting the same number of bits in a codeword are equally
probable.
Error-correcting codes (ECC) provide an effective way to
achieve the required level of transient fault tolerance in
storage and memory subsystems since they can be applied at
system or component levels with relatively limited design
overhead. Since the implementation of an ECC requires a
certain amount of storage overhead, any approach able to
boost the fault masking capacity is helpful. An approach with
lower performance overhead is to extend the error correction
capability of an ECC without increasing the check-bit number
per codeword. This is possible due to the fact that usually the
number of data-bits needs to be a power of 2 or a multiple of
a power of 2. For example, an (16, 8) ECC allows the correction of all single-bit and double-bit errors, which means 136
errors, while, in principle, up to 255 distinct errors could be
distinguished with the available information redundancy.
Most of these ECC extensions are devoted to the correction of
burst errors which affect contiguous codeword bits. This
choice is not justified when the burst errors are not necessarily the most probable multi-bit errors. Examples include
CMOS memories protected against multi-bit upsets with the
help of bit interleaving or non-volatile memories where the
corruption of information is not necessarily induced by
ionizing particles such as the magnetic RAMs (MRAM).
Here, we propose the concept of maximized ECCs obtained
by a better utilization of the information redundancy available
in linear block ECCs. Starting from an N-bit ECC, the goal is
to get a maximum number of correctable (N+1)-bit errors.
Selection criteria are defined in order to reduce the impact on
the latency of the error correcting logic. For example, among
all correctable (N+1)-bit errors which generate the same
syndrome, the errors which affect a maximum number of
check-bits are privileged.
The impact on mean-time-to-failure (MTTF) of a memory was
evaluated for the particular cases of ECCs that enable the
correction of errors affecting a maximum of one bit or two
bits. Especially in the case when two-bit errors can be corrected, significant MTTF increase can be obtained. In this
paper, only reliability improvements are quantified despite
the fact that such solutions could also be used to improve
memory manufacturing yield.
Consider a memory unit protected by an N-bit ECC. Memory
failures occur if codewords with more than N corrupted bits
are accessed. It is assumed that:

The (N+1)-bit errors are independent and identically


distributed with a given rate ,

The rate of errors affecting more than N+1 bits is


much smaller than .
In these conditions, the probability P that a memory unit
operates without failure during a time interval can be expressed as:

P = e

where 1/ gives the memory MTTF under the specified


assumptions.
If a certain fraction x (0x1) of the (N+1)-bit errors can be
masked by a certain mechanism, the (N+1)-bit error rate
responsible for memory failures will become (1-x) and the
memory MTTF improvement is given by the following
expression:

MTTF
x
=
MTTF
1 x
If 20% of the (N+1)-bit errors become correctable, then the
MTTF can be improved with 25%. This means that an MTTF of
4 years can be extended with one additional year. We
generate H-matrices of the maximized double-bit error
correcting codes (DEC) codes from scratch with the help of a
SAT-solver. If only data-bits need to be provided by a
maximized DEC decoder, the hardware overhead can be
reduced by selecting those triplets that involve a maximum
number of check-bit positions. Besides the constraints specific to a DEC code, two additional goals were imposed:

Find an H-matrix that can provide a maximum


number of triple-bit errors that can be corrected,

Maximize the number of correctable triple-bit errors


that involve only check-bit positions.
Once the H-matrix is found, the set of correctable triple-bit
errors is constructed by selecting first those triple-bit errors
which involve only check-bit positions. The next privileged
triple-bit errors affect two check-bit positions. The remaining
triple-bit errors are selected among those which involve one
or zero check-bit positions. All triple-bit errors that affect
contiguous bit positions can be made correctable without
affecting the total number of correctable triple-bit errors.
Table I reports the ratios of triple-bit errors masked with the
obtained maximized DEC codes with respect to the total
numbers of triple-bit errors. The achieved MTTF improvements were between 27% and 100%. The maximized DEC
decoders were synthesised with a 45nm standard cell library
and their latency overheads were between 0% and 20%.
Number Number
Achieved number of
Achieved
of data- of check- masked triple-bit errors
MTTF
bits
bits
and ratio of triple-bit errors improvement
(16,8)
8
8
118
21%
27%
DEC
code

(22,12)

12

10

770

50%

100%

(26,16)

16

(36,24)

24

10

672

26%

35%

12

3250

46%

(44,32)

32

84%

12

3100

23%

31%

Table 1 - MTTF of Maximized DEC Codes

References :
[1] V. Gherman, S. Evain, and Y. Bonhomme, Memory Reliability Improvements based on Maximized Error-Correcting Codes, IEEE European
Test Symposium, pp. 1-6, 2012.

71

A Distributed Diagnosis Strategy using


Bayesian Network for Complex Wiring
Networks
Research topics: wiring network, reflectometry, Bayesian Networks, uncertainty
W. Ben Hassen, F.Auzanneau, F.Peres and A. Tchangani (LGP, ENIT, INPT)
ABSTRACT: In this paper, a distributed diagnosis strategy using reflectometry is proposed. It consists in making
reflectometry measurements at different spots of a highly complex wiring network. The proposed approach
targets sensors number optimization using Bayesian Networks. It consists in three steps: (1) sensors
implementation in a deterministic case, (2) diagnosis modeling using Bayesian Networks, (3) sensors number
optimization. Here, the main objective is to find a compromise between the sensors number (system cost) and
diagnosis measure uncertainty (diagnosis quality).
Reflectometry is a powerful technique for electrical faults
detection, localization and characterization. In branched
wiring networks, a distributed diagnosis strategy is required
to guarantee a good diagnosis quality. The main idea is to
implement several sensors at different locations on the
network. However, this multi-sensor architecture has
imposed serious challenges on signal processing, sensors
number and location optimization, resource allocation, etc.
This paper focuses on sensors number optimization
depending on a predefined target confidence level. The main
objective is to find a good compromise between diagnosis
quality and system cost. The proposed approach consists in
several steps which are:
Deterministic case implementation:
One sensor is implemented at each end of the transmission
line, which maximizes diagnosis coverage. Here, a high
diagnosis quality is obtained (i.e. 100% confidence level), but
the system cost is also very important.
Diagnosis modeling using Bayesian Networks (BN):
In order to reduce BN implementation complexity, a local BN
is modeled for each sensor as depicted by Fig.1.

network is diagnosed by only one sensor as depicted by


Fig.2.

Figure 2: Sensors optimization (left: deterministic case, right:


non-deterministic case)

A fault is simulated on branch B3 of the network. In this


case, a low system cost is achieved (from 6 sensors down to
3). However, the obtained diagnosis quality is really bad
(confidence level is equal to 33%). In order to overcome this
problem without adding any sensors in the network,
communication between neighboring sensors is introduced to
exchange information about the detected fault. Fig.3 shows
the global Bayesian network model in the optimized case.
Here, diagnosis quality and confidence level are both
satisfactory (equal to 100%).

Figure 1: The local Bayesian Network structure.

Then local BNs are integrated into a global BN in order to


locate faults on the whole network [1].
Sensors number optimization:
In order to reduce sensors number, the network is divided
into several generic networks of Y or star topology. Each sub-

Figure 3 : The Global Bayesian


communication: fault localization.

Network

with

sensors

As future works, the cable life profile will be introduced for


better sensors optimization.

References :
[1] W. B. HASSEN, F. AUZANNEAU, F. PERES, and A. TCHANGANI, A Distributed Diagnosis Strategy using Bayesian Network for Complex
Wiring Networks, in IFAC Workshop on Advanced Maintenance Engineering, Services and Technology (AMEST), November 2012.

72

Soft Faults Diagnosis in Wire Networks


Using Time Reversal Reflectometry
Research topics: FDTD, Reflectometry, Time Reversal.
L.El-Sahmarany, F. Auzanneau, L. Berry1 and P. Bonnet1 (1Universit Blaise Pascal)
ABSTRACT: The invariance of the wave equation under time-reversal (TR) in lossless transmission line is
exploited for detection and localization of soft faults in a wire network. A TR-based signal processing is exposed
and evaluated on numerical examples. To test the efficiency of this method, the TR algorithm has been
developed and simulated using FDTD (Finite Difference Time Domain Method). It allows us to better diagnose
soft fault in the wire thanks to the time reversal method.

Reflectometry methods are commonly used for testing


transmission lines. Hard faults (open and short) are
observable by standard reflectometry, but soft faults
(damaged insulation, etc) are generally not (as illustrated
in Fig. 1). This study presents a new signal processing based
on time reversal for the detection and the localization of soft
faults.

Figure 1: Simulation result for a simple line with an inductive


fault (14% of the inductive value of the line under test).

Time reversal was first introduced in acoustics by M. Fink.


This technique efficiently focuses energy on a target taking
benefit from the invariance property of the propagation
equation with respect to time.
Adopting a similar approach, any local impedance
discontinuity in a transmission line, created by a soft fault,
behaves like a secondary source generating a transmitted
wave and a reflected wave. Thus, a TR process can be
applied to locate these modifications relative to a healthy
reference line.
As illustrated in Fig. 2, the proposed signal processing
requires three steps. A voltage pulse is injected in the cable
without fault and propagates according to the wave
propagation equations. The simulation method gives the
spatial voltage distribution along the transmission line: Vin. In
a second step, the same pulse is injected in the cable with
fault and the reflected signal Vrd is recorded. In a third step,
the recorded signal Vrd is time-reversed and re-injected into
the cable without fault. The spatial voltage distribution Vbis_rd,
along the line is provided by the simulation code.

In the reverse temporal space, a convolution product is then


calculated Vbis_rd*Vin. This operation reaches a maximal value
for the component of the line which has a time delay equal to
the time delay of the fault. Fig. 3 represents the convolution
result which allows focusing the energy on each fault without
a priori knowledge on their locations. A modification of a perunit-length parameter, respectively 38% of C (capacitance)
and 11% of L (inductance) was supposed to simulate each
fault.

Figure 2: The three steps of the time reversal procedure.

Figure 3: Peaks detection for two simultaneous faults.

This method improves the efficiency of detecting and locating


"soft faults" in a transmission line or in a wire network.

Reference:
[1] L EL-SAHMARANY, F. AUZANNEAU and P.BONNET Nouvelle mthode de diagnostic filaire base sur le retournement temporel'', Actes du
16me Colloque International sur la Compatibilit Electromagntique (CEM 2012)', Rouen, Avril 2012.

73

Time Reversal Reflectometry for Cable


Ageing Characterization
Research topics: Reflectometry, Time Reversal, Cable ageing.
L. El-Sahmarany, F.Auzanneau and P.Bonnet (Universit Blaise Pascal, Clermont-Ferrand)
ABSTRACT: We investigate the effects of ageing (i.e. slow homogeneous degradation) on electrical cable
characteristics by the use of a new method based on time reversal. In case of a global cable ageing, the
commonly used methods such as reflectometry provide non-relevant or inaccurate information. Through
theoretical study and numerical simulations, the benefits of this new method called Time Reversal
Reflectometry (TRR) are presented. TRR is experimentally shown to be effective for the detection and
quantification of cable ageing.

Ageing is described as a slow structural modification which


gradually decreases the efficiency of an object, information or
organism to provide its functions. Therefore, this paper
overcomes reflectometrys limitations by proposing a new
approach based on time reversal applied to reflectometrys
fundamental principle. It focuses on the detection and
estimation of electrical cable ageing.
This new method is based on the principles of time reversal
and standard reflectometry methods [1]. Instead of using a
predefined
signal
(Gaussian
pulse)
like
standard
reflectometry, it uses an adapted signal that allows
characterizing
more
precisely
the
cables
electrical
parameters (RLCG) modifications due to ageing. The adapted
signal will be insensitive to dispersion which distorts signals
and decreases the ability of cable ageing detection and
estimation. The detection of cable ageing using time reversal
is summarized by the following process:
1) Inject a (symmetrical) pulse signal into a healthy cable.
2) If needed, truncate and shift in time, then normalize the
reflected signal.
3) Apply time reversal, and then save as adapted signal.
4) Inject the adapted signal into the aged cable.
5) Process the reflected signal, calculate the Skewness
Coefficient noted SC, and estimate the cable ageing. SC is
calculated by quantifying the signals distortion on the left a
or the right side b of its maximum as presented on Fig.1.
Then, SC = b/a. A value of SC close to 1 means the cable
under test is healthy. Otherwise, if the reflected signal is
asymmetrical or SC value is far from 1, this means the cable
is aged and the value of SC enables to quantify the ageing.
6) Loop steps 4 and 5 when needed.

Figure 1: Example of calculation of the skewness coefficient SC

In order to investigate this method a comparison by


changing the per unit length capacitance value was
performed, the simulation is done by using a RLCG frequency
model of a cable using MATLAB. The reflected signal from

step 5 for the healthy cable (Capacitance is C0) and for three
simulated aged cables (Capacitance values of aged cables are
0.2, 0.6 and 1.2 times C0) were calculated. Table I presents
the values of SC and shows the effect of ageing. When the
cable is healthy, SC is equal to 1 and when it is aged (0.2 *
C0) SC is down to 0.5127.
Table 1: Values of skewness coefficient SC for different simulated
aged cables

Simulations
SC

1.2*C0
1.09

C0
1

0.6* C0
0.7943

0.2 *C0
0.5127

Thermal ageing experiment was performed on a 100 m


long coaxial cable [2]. Table 2 shows the effect of ageing via
the variation of the skewness coefficient. It was noted that
the increase of SC with time led the reflected signal to lose
its symmetry (as illustrated in Fig.2).
Table 2: Values of skewness coefficient during ageing

SC

new
1

1 month
1.28

2 months
1.29

2 months 10 d
1.31

Figure 2: The reflected signals during ageing

The proposed method presents a simple and more accurate


technique to estimate cable ageing. It can help monitor the
health of the cables and the safety of an entire electrical
system.

References :
[1] L EL-SAHMARANY, F. AUZANNEAU and P.BONNET A new method for detection and characterization of electrical cable aging, Progress In
Electromagnetics Research Symposium, in Kula Lumpur, Malaysia, March 2012.
[2] L EL-SAHMARANY, F. AUZANNEAU and P.BONNET Novel Reflectometry Method Based on Time Reversal for Cable Aging Characterization,
Portland, OR, USA, 23 - 26 September 2012.

74

New Advances in Monitoring the Ageing


Of Electric Cables in Nuclear Power Plants
Research topics: Cable Ageing, Reflectometry, Signal Processing, Nuclear Plant
M. Franchet, N. Ravot, N. Gregis, J. Cohen, O. Picon (Universit Paris-Est, ESYCOM)
ABSTRACT: Monitoring the ageing of electrical cables used in nuclear power plants is a crucial issue for nuclear
industrials, whose objective is to extend the lifetime of their plant while ensuring its security. For cost and
efficiency reasons, reflectometry, which is a non-destructive method, is well-adapted to this problem.
Unfortunately, it may be not sensitive enough to small changes of the cable. To overcome this difficulty this
article proposes to use time-frequency tools (the Wigner Ville transform and a normalized time-frequency crosscorrelation function) in addition to time domain reflectometry. This method has been applied on two RG-59B
coaxial cables (a new one and an old one) commonly used in nuclear power plants.
Studying cable ageing involves being able to detect minor
modifications. This is the same challenge as detecting soft
faults (incipient defects), which translate into reflected
signals of very low amplitudes. Indeed only slight changes
due to ageing will affect the reflectograms obtained by Time
Domain Reflectometry (TDR), Fig 1. To detect them, TDR
cant be used alone, another tool is needed. In this article,
we propose to apply a time-frequency transform, called the
Wigner Ville transform (WVt), on TDR results and compute a
normalized time-frequency cross-correlation function (TFC),
Fig 2. This is part of a method called Joint Time-Frequency
Domain Reflectometry (JTFDR) which has shown promising
results for soft faults. However the WVt is a quadratic
transform. Then unwanted cross-terms can affect the results.
To overcome this problem the Pseudo Wigner Ville transform
(PWVt) can be used instead of the WVt [1]. JTFDR has
already been used to study local ageing of cables used in
NPPs. This is comparable to detecting soft faults.
Nevertheless the ageing process may affect the entire length
of the cable. So the efficiency of such a method has to be
tested for global ageing too and a tool to decide if a cable has
to be replaced or not must be defined.

commonly used to assess if they are alike. This is the tool


chosen in this study to decide if a cable is worn or not. As a
consequence, a reference to which TFC results can be
compared is required. For this study, the results obtained
with the cable considered as new are taken as references.
Table 1 gives the results obtained, for two kinds of injection,
when the TFC is computed with the WVt and the PWVt. Using
a thinner injected signal makes it easier to determine if a
cable has been used or not.

Figure 2: TFC obtained with the PWVt, when the injected signal is
a Gaussian pulse of 1ns width at half-height

Width of
the injected
pulse

1ns
5ns
Figure 1: Reflectograms obtained after injecting a Gaussian pulse
of 1ns width at half-height

Thanks to the TFC the small changes due to the ageing


process have been amplified. Nevertheless, a criterion must
be chosen in order to determine if a cable is worn or not. In
practice, the correlation coefficients between two results are

Correlation
coefficient

Correlation coefficient

obtained

obtained with the PWVt

with the WVt

0.57
0.96

0.65
0.94

Table 1 Correlation coefficients obtained depending on the kind of


injection

This method performs well on the studied example. However


this study has to be deepened in order to see if it is able to
discriminate different level of ageing. This is the object of
current and future work made within the framework of
Advance, a FP7 European program.

References :
[1] M. Franchet, N. Ravot, and O. Picon,The use of the pseudo wigner ville transform for detecting soft defects in electric cables," in
IEEE/ASME International Conference on Advanced Intelligent Mechatronics, Budapest, Hungary, 3-7 jul, 2011, 2011.
[2] Franchet, M.; Ravot, N.; Grgis, N.; J., C. & Picon, O., New Advances in Monitoring the Aging of Electric Cables in Nuclear Power Plants,
Advanced Electromagnetics Symposium, AES 2012, Paris, France, 16-19 April 2012.

75

On a Useful Tool to Localize Jacks in


Wiring Network
Research topics: wired network diagnosis, soft fault detection
M. Franchet, N. Ravot, O. Picon (Universit Paris-Est, ESYCOM)
ABSTRACT: To efficiently monitor and maintain wired networks, their topology has to be known. Most of the
time a wiring network is made of several cables linked to each other with connectors. So knowing where they
are is valuable information for going back to the topology. Then the damaged portions of the network can be
localized relatively to the jacks, which will facilitate and accelerate maintenance. New data processing
techniques based on a time frequency transform are shown to improve the detection capacity of both soft
defects and connectors in wires.

Electrical cables are everywhere in many fields where the


transfer of energy and information is necessary to guarantee
the performance of a system. One day or another, a cable
network will show signs of weakness or ageing involving the
appearance of defects. These anomalies can be at the origin
of dysfunctions and imply serious consequences for the
system or the environment. This is why diagnosis methods
for wired networks have been thoroughly studied in the past
few years.
Reflectometry based methods have proven to be the best
suited, as they provide detection and localization information,
while requiring only one connection to the network. But, Time
Domain Reflectometry (TDR) or Frequency Domain
Reflectometry (FDR) methods are well suited for hard defects
(i.e. defect that prevent any signal from going further away).
But soft defects, such as localized damage to the insulation
or shielding of a wire, are much more difficult to diagnose.
This kind of defect account for 30% to 50% of all detected
wiring faults, and is the premises of future hard defects.
A new method, called JTFDR (Joint Time Frequency Domain
Reflectometry) takes benefit of the advantages of both TDR
and FDR while avoiding their limitations by the use of
innovative signal processing. It is based on the use of the
Wigner Ville transform (WVT) coupled to a normalized Time
Frequency Cross-correlation function (TFC) applied to TDR
measurements, which greatly enhances the connectors and
soft defects signatures.

Fig. 1 shows that standard TDR measurement cannot


efficiently detect and locate a connector or a soft defect in a
line, as their peaks have very weak amplitudes [1].
The WVT has previously shown great ability for time
frequency localization of chirp-like signals. For this reason,
the Pseudo WVT is combined with a normalized Time
Frequency Cross-correlation function (TFC), defined below.

In this formula, Es(t) and Er are normalization factors. The


first normalization term provides local amplification of the
weak signals isolated by the WVT. As a result, the weak
peaks of the connector and the soft defect are enhanced to a
level similar to the end of line reflection (an open circuit),
making them much easier to detect and localize [2] figure
2.

Figure 2: Enhancement of the jacks and the soft defects


signatures using TFC (from [2])

So knowing where these jacks are is valuable information for


going back to the topology. Once this information is known,
the damaged portions of the network can be localized
relatively to the connectors; this greatly facilitates and
accelerates maintenance.
Besides, in a wiring network, the jacks themselves can be
sources of damage. So it is important to be able to monitor
their condition. Knowing where they are is the first step to
study their health and anticipate their ageing.
Figure 1: Standard TDR measurement on a line with a jack and a
soft defect
References :
[1] Franchet, M.; The Use of the Pseudo Wigner Ville Transform for Detecting Soft Defects in Electric Cable, in IEEE/ASME International
Conference on Advanced Intelligent Mechatronics (AIM), Budapest, 2011
[2] Franchet, M.; Ravot, N. & Picon, O., On a Useful Tool to Localize Jacks in Wiring Network, in 'Proceedings of PIERS 2012 in Kuala
Lumpur', 2012

76

77

PhD Degrees
Awarded
Maud Franchet
Cline Azar
Sbastien Courroux
Olivier Bichler
Fabien Gavant
Mykhailo Zarudniev

78

PhD degrees
awarded in 2012
Maud Franchet
University: Universit Paris-Est
Reflectometry applied to soft fault detetction in bundles of wires
The research works presented in this thesis is about the topic of detecting soft faults (incipient
faults) in specic wiring structures: multiconductor transmission lines (MTL), also known as bundles
of wires. The reectometry methods, often used for the diagnosis of wiring networks, arent for now
ecient enough to detect such defects. Besides, they have been designed for single lines only,
where electromagnetic coupling between conductors (crosstalk) is mostly irrelevant. However such
phenomenon can provide more information about the state of the cable. Using this information
could enable us to detect soft faults more easily. In this work we propose a new reectometry
method, which takes advantage of crosstalk signals in order to detect incipient faults. Such a tool
has also the advantage of being well-adapted to bundles of cables.
Thanks to the preliminary study of the impact of soft faults on the characteristic parameters of a
multiconductor transmission lines and on crosstalk signals, a method called Cluster Time
Frequency Domain Reectometry , has been proposed. It is a three step process. First temporal
reectometry measurements are made at the beginning of the line under test. All the available
signals, even crosstalk ones, are recorded. A time-frequency process is then applied on them, in
order to amplify the presence of defects. Finally, a clustering algorithm, that has been specically
developed for wiring diagnosis, is used to benet from the whole available information.

Cline Azar
University: Universit de Bretagne Sud design
On the design of a distributed adaptive manycore architecture for embedded
systems
Chip design challenges emerged lately at many levels: the increase of the number of cores at the
hardware stage, the complexity of the parallel programming models at the software level, and the
dynamic requirements of current applications. Facing this evolution, this PhD thesis aims at
designing distributed adaptive manycore architecture, named CEDAR (Congurable Embedded
Distributed ARchitecture), which main assets are scalability, exibility and simplicity. The CEDAR
platform is an array of homogeneous, small footprint, RISC processors, each connected to its four
nearest neighbors. No global control exists, yet it is distributed among the cores. Two versions are
designed for the platform, along with a user-familiar programming model. A software version,
CEDAR-S, is the basic implementation where adjacent cores are connected to each other via shared
buffers. A co-processor called DMC (Direct Management of Communications) is added in the CEDARH version, to optimize the routing protocol. The DMCs are interconnected in a mesh fashion.
Two novel concepts are proposed to enhance the adaptiveness of CEDAR. First, a distributed
dynamic routing strategy, based on a bio-inspired algorithm, handles routing in a non-supervised
fashion, and is independent of the physical placement of communicating tasks. The second concept
presents dynamic distributed task migration in response to several system and application
requirements. Results show that CEDAR scores high performances with its optimized routing
strategy, compared to state-of-art networks. The migration cost is evaluated and adequate
protocols are presented. CEDAR is shown to be a promising design concept for future manycores.

79

PhD degrees
awarded in 2012
Sbastien Courroux
University: Universit de Bourgogne
Wavelet-based algorthms for embedded image processing and integration into a
smart vision system.
Data at the output of a CMOS image sensor are processed through a set of operations, either for
purposes of rendering or image analysis. Increasing resolution of the sensors and reducing of the
size of the pixels make operations to be applied even more complex and require a large storage
capacity. It is increasingly difficult to reconcile these different constraints in a low-cost embedded
sensor, consisting of an analog part and a digital circuit having a single processor, low storage
capacity and low operating frequency. New methods are then investigated. One of them proposes to
use alternative data representation. The wavelet representation decomposes an image into
frequency bands, orientation and scale, simplifying the future operations of the processing chain.
In a first step, the thesis proposes to study the interest of the wavelet representation for image
processing in embedded real time context. For this, a state of the art of the algorithm methods is
established and allows defining two algorithmic chains: reconstruction of CFA images and facial
recognition. The quality of the process is demonstrated for these two processing.
In a second step, a wavelet-oriented vision system is proposed consisting of an embedded processor
and a module dedicated to the wavelet transform. The wavelet transform module adopts a so-called
'semi-folded' structure and performs effectively the wavelet decomposition at several scales using
only a few lines of internal memory. This vision system is used to speed up processing and increase
application flexibility and effectiveness of low-cost sensors.

Olivier Bichler
University: Universit Paris-Sud
Adaptive Computing Architectures Based on Nano-fabricated Components
In this thesis, we study the potential applications of emerging memory nano-devices in computing
architecture. We show that neuro-inspired architectural paradigms could provide the eciency and
adaptability required for complex image/audio processing and classication applications with a much
lower cost in terms of power consumption and silicon area than current solutions. This work is
focusing on memristive nano-devices, such as: Phase-Change Memory (PCM), Conductive-Bridging
RAM (CBRAM), resistive RAM (RRAM)... We show that these devices are particularly suitable for the
implementation of natural unsupervised learning algorithms like Spike-Timing-Dependent Plasticity
(STDP), requiring very little control circuitry. The integration of memristive devices in crossbar array
could provide the huge density required by this type of architecture (several thousand synapses per
neuron), which is impossible to match with a CMOS-only implementation.
In this work, we propose synaptic models for memristive devices and simulation methodologies for
architectural design exploiting them. Novel neuro-inspired architectures are introduced and
simulated for natural data processing. They exploit the synaptic characteristics of memristives nanodevices, along with the latest progresses in neurosciences.
Finally, we propose hardware implementations for several device types. We assess their scalability
and power eciency potential, and their robustness to variability and faults, which are unavoidable
at the nanometric scale of these devices.

80

PhD degrees
awarded in 2012
Fabien Gavant
University: Universit de Grenoble
Architectures for Image Sensors Stabilization based on Visual Perception and on the
Physiology of Hand Tremor; a Contribution
With the integration of cameras in mobile devices, their democratization and the reduction of the
imager size, the optical system dimensions and the pixels miniaturization, the pictures become more
and more subject to motion blur due to the hand tremor. In addition, the requirements in terms of
image quality become higher and higher. Hence, in order to reduce this blur, several image
stabilization systems have been developed. Nevertheless, they cannot guarantee the sharpness
quality of resulting images and in some cases, they show integration difficulties.
In order to overcome these limitations, the research work presented in this thesis proposes, first of
all, a physiological tremor model that aims at simulating realistic camera shake and secondly,
presents a study on visual perception of blur. This study enables the development of a quality
metric. Finally, stabilization algorithms and architectures exploiting these new tools are presented.
These new architectures reduce the number of external components and ensure sharp stabilized
images.

Mykhailo Zarudniev
University: Universit de Lyon Ecole Centrale de Lyon
Frequency Synthesis using Spin Torque Oscillator Coupling
Current trends in telecommunication are leading to multiple standards systems. The conventional
solution consists in using one local oscillator for each standard. The spin torque oscillator (STO) is a
new device that appears as a potential candidate for the LC-tank oscillator replacement, due to its
wide frequency accordability and its small volume. However, it exhibits poor power and phase noise
performance.
In this work, we propose to reach the technical specification of the radiofrequency applications by
coupling a large number of spin torque oscillators. An original oscillator network model that
describes qualitative properties of the oscillator synchronization is introduced. Next, the control law
architecture for an oscillator set is established in order to achieve the technical specifications.
Finally, we propose two original frequency domain design methods allowing the resolution of our
frequency synthesis problem. The first design method allows considering explicitly a performance
criterion corresponding to a desired frequency constraint. The method allows obtaining a suitable
sub-system interconnection matrix that fits the frequency specification constraint. The second
design method allows to find an interconnection matrix and to take into account simultaneously
several frequency specification constraints. The interconnection matrix obtained with the proposed
method solves the problem of frequency synthesis by coupling of spin torque oscillators.

81

82

Greetings

Editorial Committee
Marc Belleville
Christian Gamrat
Fabrice Auzanneau
Ernesto Perea
Hlne Vatouyas
Jean-Baptiste David

Readers
Thierry Collette
Eric Mercier

Graphic Art
Valrie Lassablire