Vlsi Abstract

Hades InfoTech Pvt.
Ltd
IEEE VLSI ABSTRACT 2013-14
HARDWARE IMPLEMENTATION
HVL01
A Current Consumption Measurement Approach for FPGA-Based
Embedded Systems
Abstract:
An approach for field programmable gate array (FPGA) based embedded system
dynamic current consumption measurement is presented in the paper. The measurement
method is based on current conversion to time interval. The time interval is then
measured using timer implemented in FPGA. Measurement uncertainty budget analysis is
performed. It reveals key components and their parameters mostly affecting current
measurement uncertainty. System architecture for incorporating measurement setup to the
standard FPGA development flow is suggested.
HVL02
Design of a Real-Time FPGA-Based Data Acquisition Architecture for
the LabPET
Abstract:
The LabPET II detector block was designed to achieve submillimeter spatial
resolution in small animal PET imaging. Each detection block consists of two arrays of
4$, times,$ 8 avalanche photodiodes (APD) individually coupled to an 8$, times,$8
scintillator array, to form 64 independent detectors with parallel readout channels. This
new detection block entails an eightfold increase in pixel density compared to the
LabPET I. A 64-channel mixed-signal application-specific integrated circuit (ASIC) was
designed to extract relevant PET data in real time from the LabPET II detection blocks.
In order to interface the ASICs forming the PET camera with the storage units, a realtime FPGA-based digital data acquisition (DAQ) system was designed. The DAQ system
allows event harvesting, processing and transmission to a host computer for data storage
G2, Metha Complex, Little Mount, Saidapet, Chennai-15
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Mail: projects@hades.in, contact@hades.in, www.hades.in
Page |1
Hades InfoTech Pvt. Ltd

as well as system programming and calibration. Real-time event processing embedded in
the DAQ includes time trigger, energy computation using a time-over-threshold (TOT)
conversion scheme, timing corrections, and event sorting trees. In the standard DAQ
mode, a real-time coincidence engine analyzes events and only keeps relevant
information to minimize data throughput and post-acquisition data processing. The
architecture consists of three FPGA-based electronic layers wired through gigabit links: a
Front-End layer extracts time and energy along with the pixel address, a custom Hub
layer chronologically sorts incoming events, and a Coincidence engine matches
coincident events and computes an estimate of the random events rate. Every FPGA in
the different layers is accessible through an Ethernet link. The real-time digital
architecture sustains the required throughput of $sim $111 million events/s for a $sim
{hbox{37thinspace000}}$-channel scanner configuration.
HVL03
Project-Based Learning in Embedded Systems Education Using an
FPGA Platform
Abstract:
With embedded systems becoming ubiquitous, there is a growing need to teach
and train engineers to be well-versed in their design and development. The
multidisciplinary nature of such systems makes it challenging to give students exposure
to and experience in all their facets. This paper proposes a generic architecture,
containing multiple processors, that allows easy integration of custom and/or predefined
peripherals. The architecture allows students to explore both the hardware and software
issues associated with real-time and embedded systems. Furthermore, the architecture can
be extended to train students in advanced concepts in embedded multiprocessor systems.
This generic architecture has been used for two courses at the National University of
Singaporeone on real-time embedded systems and the other emphasizing the hardware
aspects of embedded systems. The project in the real-time embedded systems course has
students develop a five-a-side soccer system on multiple field-programmable gate array
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |2

(FPGA) boards using embedded processors. In the embedded hardware design course
project, students use an embedded processor-based system to perform decryption of a
block encrypted image, accelerated through a custom co-processor. The use of displays
gives students a visual/interactive experience and a sense of accomplishment, while
reinforcing the theoretical concepts. Both qualitative and quantitative assessment results
are presented, showing how students perceived these projects and met the learning
objectives.
HVL04
A Reliable and Cost-Effective Sand Monitoring System on the Field
Programmable Gate Array
Abstract:
Sand monitoring gives the benefits of avoiding equipment erosion and production
failure in the oil industry. This paper presents the design and implementation of a reliable
and cost-effective sand monitoring system for measuring sand production in gas and oil
flows in real time. The designed monitoring system involves two acoustic emission (AE)
sensors and one Doppler sensor for gauging the sand impaction and the velocity of sand
particles in a nonintrusive manner, respectively. It is implemented on a field
programmable gate array (FPGA) as a prototype for real-time data acquisition and
processing, and evaluated using a testbed pump skid available in our laboratory. Relying
on a low-cost FPGA board to integrate all acquisition and processing functionality, our
monitoring system can measure the sand production amount on-the-fly reliably with high
accuracy, according to our experimental evaluation. It is likely to achieve better accuracy
than one without the Doppler sensor. The proposed monitoring system is affordable for
wide deployment, given its high accuracy, good resilience to surrounding noise, and low
cost (when compared with commercially available systems whose price tags can be some
tenfold higher).

Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |3

HVL05
Simple Closed Loop DC Speed Control System On FPGA Platform for
VHDL Beginner
Abstract:
Students develop interest to understand the theory around a project that fascinates
them. The motivation of this paper has been to devise a model for teaching VHDL
language to embedded students. The teaching model is based on a practical example of a
simple DC Motor control application conducted in the lab session. A series of Lab
modules from basic digital logics leading gradually to a closed loop DC motor speed
control is presented as lab experiments to maintain the interest of first time learner of
VHDL language. The theory of VHDL language construction is covered separately by
way of classroom lectures but in conjunction with the Lab practice sessions.
HVL06
FPGA Based Real Time Systems For Position Tracking
Abstract:
Position tracking systems supported with RISS and gyroscopes are found to be
better solutions in the places where GPS is unemployable or in the places where GPS
cannot work. Generally systems that are based on GPS for position tracking, face a lot of
problems in the areas where line of sight is hard to achieve i.e. GPS denied environment,
like dense terrestrial areas, subways, tunnels and hidden places. This system provides
continuous and highly reliable position tracking by synchronising real time stimulus
obtained from the sensors and the actual GPS values. The core processor of the system is
built on an FPGA which is used in the system kernel. The key factor for using FPGA in
the system is its customisable core and its flexibility to interface with the sensors. The
core employees the Hybrid Kalman filter for estimating the displacement and position. In
this system we integrate the 3D-RISS with GPS to achieve a Reliable and uninterrupted
Position Tracking. In these systems the processor estimates the posit on of the object
based on the four inputs taken from the RISS and the Odometer, they are Velocity,
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |4

acceleration, orientation and position. Here the system integrates the offline Inertial data (
i.e. while the GPS is unavailable) with that of actual GPS data. The system starts to
compute the position and velocity using the initial data provided by the GPS at the instant
it was lost. This kind of position tracking systems used in various kinds of moving
objects like Aircrafts, Guided missiles, Land Rovers, and Marine navigations.
HVL07
FPGA based Real Time 3D-RISS / GPS Integrated System for Position
Tracking
Abstract:
Navigation algorithms integrating measurements from multi-sensor systems
overcome the problems that arise from using GPS navigation systems in standalone
mode. Algorithms which integrate the data from 2D low-cost reduced inertial sensor
system (RISS), consisting of a gyroscope and an odometer or wheel encoders, along with
a GPS receiver via a Kalman filter has proved to be worthy in providing a consistent and
more reliable navigation solution compared to standalone GPS receivers. It has been also
shown to be beneficial, especially in GPS-denied environments such as urban canyons
and tunnels. The main objective of this paper is to narrow the idea-to-implementation gap
that follows the algorithm development by realizing a low-cost real-time embedded
navigation system capable of computing the data-fused positioning solution. The role of
the developed system is to synchronize the measurements from the three sensors, relative
to the pulse per second signal generated from the GPS, after which the navigation
algorithm is applied to the synchronized measurements to compute the navigation
solution in real-time. Employing a customizable soft-core processor on an FPGA in the
kernel of the navigation system, provided the flexibility for communicating with the
various sensors and the computation capability required by the Kalman filter integration
algorithm.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |5

HVL08
FPGA based Braille to text & speech for blind persons
Abstract:
Blind people are an integral part of the society. However, their disabilities have
made them to have less access to computers, the Internet, and high quality educational
software than the people with clear vision. Consequently, they have not been able to
improve on their own knowledge, and have significant influenc e and impact on the
economic, commercial, and educational ventures in the society. One way to narrow this
widening gap and see a reversal of this trend is to develop a system, within their
economic reach, and whi ch will empower them to communicate freely and widely using
the Internet or any other information infrastructure. Over time, the Braille system has
been used by the visually impaired for communication and contact with the outside
world. This paper presents the implementation of Braille to Text/Speech Converter on
FPGA Spartan3 kit. The actual Braille language is converted into English language in
normal domain. The input is given through braille keypad which consists of diff erent
combinations of cells. This input goes to the FPGA Spartan3 Kit. According to the
combinations given, FPGA converts the input into corresponding english text through the
decoding logic in VHDL language. After decoding, the corresponding alphabet is
converted to speech through algorithm. Also it is displayed on the LCD by interfacing the
LCD to the Spartan3 kit.
HVL09
Implementation of FPGA based PID Controller for DC Motor Speed
Control System
Abstract:
In this paper, the implementation of software module using VHDL for Xilinx
FPGA (XC3S400) based PID controller for DC motor speed control system is presented.
The tools used for building and testing the software modules are Xilinx ISE 9.2i and
ModelSim XE III 6.3c. Before verifying the design on FPGA the complete design is
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |6

simulated using Modelsim Simulation tool. A test bench is written where, the set speed
can be changed for the motor. It is observed that the motor speed gradually changes to the
set speed and locks to the set speed.
HVL010
Implementation of Hamming code using VLSI
Abstract:
This paper tries to explain the implementation of hamming code using VLSI. In
the present world the field of communication has got many applications, and in every
field the data is encoded at the transmitter and transferred on a communication channel
and received at the receiver after it is decoded. During the transmission of data it might
get corrupted because of some noise on the channel. So it is necessary for the receiver to
have some function which can detect the error in the received data. Hamming code is one
of such forward error correcting code which has got many applications. In this paper the
algorithm for hamming code is discussed and then implementation of it in verilog is done
to get the results. Hamming code is an improvement over parity check method. Here a
code is implemented in verilog in which 4-bit of information data is transmitted with 3redundancy bits. In order to find the value of these redundancy bits a code is written in
verilog which will be simulated in Xillinx 9.1 software. The r sult of simulation and test
bench waveforms are also shown.
HVL011
FPGA Based Critical Patient Health Monitoring Using Fuzzy Neural
Network
Abstract:
We have designed FPGA based system and trained a fuzzy neural network for
early diagnosis of a patient. The system employs a fuzzy interface cascaded with a feedforward neural network in order to obtain an optimum decision regarding the future
pathology physiological state of a patient,. The neurons that are considered in the
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |7

proposed network are devoid of self-connections instead of commonly used selfconnected neurons. The system has been trained and tested with renal data of patients
taken at 10 days interval of time. Applying the methodology, the chance of forecasting of
critical renal condition of a patient has been predicted accurately, 30 days ahead of
actually attaining the critical condition. The system has also been tested for pathology
physiological state prediction of patients at multiple time steps ahead and the prediction
at the next instant of time stands out to be the most accurate. The fuzzy interface
discussed here performs fuzzification of patient data. The data from the patient such as h
ight or weight data cannot always be trusted as they are subjected to the quality and
accuracy of measuring units and the skill of the technician. Moreover, based on a single
data, it would be highly uncertain to make an accurate decision about the future
physiological state of the patient. So the patient data have been fuzzified with the
objective of transformation of periodic measures into likelihoods that the body mass
index, blood glucose, urea, creatinine, systolic and diastolic blood pressure of the patient
is high, low or moderate.
HVL012
VIDEO acquisition through I2C using VHDL
Abstract:
With I2C implementation of video acquisition system, devices can communicate
with each other very speedily than with any other communication protocol. The main
purpose of this technique is not only to communicate the devices but keep in touch with
every operation which can be performing by with the help of this protocol. I2c just uses
two lines for data communication making it lightweight, economical, and omnipresent.
The design is use in acquisition system to make overall system more efficient and
accurate, with the use of I2C data transmit rate get also increased.OV7620 single-chip
CMOS VGA colour digital camera is used in the design using VHDL and FPGA.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |8

HVL013
FPGA and GSM Implementation of Advanced Home
Security System
Abstract:
Home and industrial security today needs to make use of the latest technological
components. In this paper I going to present the design and implementation of a remote
and sensing, control and home security system based on GSM (Global System for
Mobile) . This system offers a complete, low cost, powerful and user friendly way of 24
hours of real time monitoring and remote control of a home and industrial security. The
system works as a remote sensing for the electrical appliances at home to check whether
it is on or off, at the same time the user can control the electrical appliances at home by
sending SMS ( Short Messaging Service) message to the system, for example turning on t
he AC before returning home. In case of fire/security the chip will receive signals from
the different sensors in the monitoring place and acts according to the received signal by
sending an SMS message to users Mobile Phone, it also works as automatic and
immediate reporting to the user in case of emergency for home security, as ell as
immediate and automatic reporting to the fire brigade and police station according to
activated sensor to decrease the time required for tacking action.The design has been
described using VHDL (VHSIC Hardware Description Language) and implemented in
hardware using FPGA (Field Programmable Gate Array).
HVL014
FPGA Implementation of Picoblaze based Embedded System for
Monitoring Applications
Abstract:
PicoBlaze is an 8-bit soft core microprocessor developed by Xilinx that can be
synthesized in some FPGA families. This paper presents a set of peripherals that have
been developed to interface with PicoBlaze: VGA control, serial communication, PS/2
keyboard port and LCD control. To demonstrate its capabilities, the system has been
implemented in a FPGA board and some typical control and monitoring systems have
Ph: 044-22200258, Mobile: 9840989556, 9952050233
Page |9

been developed. The design approach of the peripherals and details of the integration of
the systems are explained.
HVL015
Design and Development of FPGA Based Data Acquisition System for
Process Automation
Abstract:
This paper presents a novel approach to the design of data acquisition system for
process applications. The core heart of the proposed system is Field Programmable Gate
Array (FPGA) which is configured and programmed to acquire a maximum of 16 MB
real time data. For the real time validation of the designed system, a process plant with
three parameters i.e. pressure, temperature and level is considered. Real time data from
the process is acquired using suitable temperature, pressure and level sensors. Signal
conditioners are designed for each sensor and are tested in real time. Designed FPGA
based data acquisition system along with corresponding signal conditioners is validated
in real-time by running the process and comparing the same with the corresponding
references. The data acquired in real time compares well with the references.
HVL016
Intelligent Car Parking Management System On FPGA
Abstract:
Car parking has become an immense issue, especially in big cities. There are two
main reasons: Firstly, the growth in population, secondly, the security. Moreover, the car
theft has become an evil art haunting drivers. In this paper, we provide an interface and a
software/ hardware module for Intelligent Car Park Management System (ICPMS). The
ICPMS will provide an extensive management for vehicles including parking facilities
and security. The ICPMS is validated using a test case scenario and extensive
experimentation proves the feasibility of the approach.
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 10

HVL017
FPGA Based Implementation of Flat Panel Display Controller with DVI
Interface
Abstract:
Digital displays are a fast-growing market comprising LCD, plasma, and rear
projection television technologies as well as smaller displays for mobile handsets and
automobiles, in addition to many other applications. Digital image processing enhances
the overall viewing aesthetics of the displayed image and can differentiate our product.
This paper deals about the design and development of Flat panel display controller design
using Xilinx Spartan-3AN Development board and intended for display panel
applications to assist in developing products for this market. The display solution FPGA
consists of a DVI Input interface, colour temperature correction, precise gamma
correction, an image dithering engine, and Low- Voltage Differential Signaling (LVDS)
Transmit (TX) output interface.
HVL018
Design Of FPGA Based PWM Solar Power Inverter For Livelihood
Generation In Rural
Abstract:
The paper presents development of a utility interface solar power converter
(Inverter) in Grid / DG power supply for a Solar lighting system used in rural home of
Indian villages. The power supply system comprises of solar (PV) array, PWM converter
incorporating PWM control strategy, energy storage battery devices. The model of the
system has been designed for its operation and a prototype solar power converter. The
system simulation of PWM Pulse generation has been done on a XILINX based FPGA
Spartan 3E board using VHDL code. The test on simulation of PWM generation program
after synthesis and compilation were recorded and verified on a prototype sample.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 11

HVL019
Real Time Design & Implementation of Digital Speedometer On FPGA
Abstract:
In this paper, a Digital Speedometer is designed and implemented using FPGA
(Field Programmable Gate Array). Here, the FPGA used is Smart Fusion FPGA. It is
more flexible in hardware and embedded design where need a true system-on-chip (SoC)
solution FPGA devices are ideal than traditional fixed-function microcontrollers and
without the excessive cost of soft processor cores on traditional FPGAs. At the inception,
the speedometer is designed using Verilog Hardware Description Language. Synthesissoftware algorithmically transforms the Verilog source code into a netlist, a logicallyequivalent description consisting only of elementary logic primitives (AND, OR, NOT,
flip-flops, etc.) that are available in a specific FPGA or VLSI technology. The designed
speedometer gives lots of extra features than existing speedometers. The special addition
of this speedometer is the velocity of the speedometer is accurate not only in normal
times but also at exceedingly small velocities.
HVL020
Design of Control Module for Serial DAC Based on FPGA
Abstract:
In order to increase the flexibility of control for serial DAC, a new control method
for DAC based on FPGA is proposed in this paper. A state transition diagram can be
drawn according to the timing diagram of DAC, Which can be realized in FPGA using
Very High-speed Integrated Circuit Hardware Description Language. The simulate
results show that logic in FPGA is consistent with the requirements. The module based
on FPGA can be modified just by modifying software, not the hardware.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 12

HVL021
FPGA-Based Evolvable Hardware Systems
Abstract:
Since 1992, year where Hugo de Garis has published the first paper on Evolvable
Hardware (EHW), a period of intense creativity has followed. It has been actively
researched, developed and applied to various problems. Different approaches have been
proposed that created three main classifications: extrinsic, mixtrinsic and intrinsic EHW.
Each of these solutions has a real interest. Nevertheless, although the extrinsic evolution
generates some excellent results, the intrinsic systems are not so advanced. This paper
suggests 3 possible solutions to implement the run-time configuration intrinsic EHW
system:
FPGA-based
Run-Time
Configuration
system,
JBits-based
Run-Time
Configuration system and Multi-board functional-level Run-Time Configuration system.

The main characteristic of the proposed architectures is that they are implemented on
Field Programmable Gate Array. A comparison of proposed solutions demonstrates that
multi-board functional-level run-time configuration is superior in terms of scalability,
flexi ility and the implementation easiness.
HVL022
Implementation of Maximum Power Point Tracking Using Kalman
Filter for Solar Photovoltaic Array on FPGA
Abstract:
This paper proposes FPGA implementation of a novel approach to track
maximum power point of a solar photovoltaic array. The approach uses Kalman filter
algorithm to track maximum power point. Using this approach tracking becomes much
faster than using the generic Perturb & Observe algorithm in case of sudden weather
changes. In this paper output of the proposed algorithm on FPGA is provided.
Experimentation was performed under optimal conditions as well as under cloudy
conditions i.e. falling irradiance levels. Using the proposed technique the maximum
power point of a solar PV array is tracked with an efficiency of 97.11%. Moreover, the
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 13

maximum power point has been tracked at a much faster rate i.e. 4.5 ms using the
proposed algorithm compared to the existing generic Perturb and Observe approach.
HVL023
An FPGA Based Implementation of a Flexible Digital PID Controller
For a Motion Control System
Abstract:
Implementation of digital controllers in embedded environment suffers from the
inherent problems associated with analog-digital signals interfacing in hard real-time,
therefore, the control algorithms are invariantly subjected to approximations. This paper
presents a novel technique for implementation of an efficient FPGA based digital
Proportional-Integral-Derivative (PID) controller for the motion control of a permanent
magnet DC motor. The implementation technique circumnavigates the problem of
interfacing analog and digital systems in real-time. The controller is used in a speed
control loop. The hardware implementation has been done on a Xilinx Spartan 3 FPGA
chip. A novel technique has been adopted for the generation of the control input as a
PWM signal for controlling the motor driver circuit and decoding the optical encoder
data for using it for the speed feedback in the PID control loop. The VHDL algorithm for
the proposed implementation has also been presented in this paper. A comparison of the
experimental results with the Matlab based simulation shows the effectiveness of the
proposed method.
HVL024
Design of an Oximeter Based on LED-LED Configuration and FPGA
Technology
Abstract:
A fully digital photoplethysmographic (PPG) sensor and actuator has been
developed. The sensing circuit uses one Light Emitting Diode (LED) for emitting light
into human tissue and one LED for detecting the reflectance light from human tissue. A
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 14

Field Programmable Gate Array (FPGA) is used to control the LEDs and determine the
PPG and Blood Oxygen Saturation (SpO2). The configurations with two LEDs and four
LEDs are developed for measuring PPG signal and Blood Oxygen Saturation (SpO2). NLEDs configuration is proposed for multichannel SpO2 measurements. The approach
resulted in better spectral sensitivity, increased and adjustable resolution, reduced noise,
small size, low cost and low power consumption.
HVL025
Smart Camera Based on FPGA Oriented to Embedded Image
Processing
Abstract:
This paper presents an image processing system based on smart camera platform,
whose two principle elements are a Wide-VGA CMOS Sensor and a Field Programmable
Gate Array (FPGA). The latter is used to control the various sensor parameter
configurations and, where desired, to receive and process the images captured by the
CMOS sensor. With the advent of today's highly integrated Field Programmable Gate
Array (FPGA) it is possible to have a software programmable processor and hardware
computing resources on the same chip. Apart from having sufficient logic blocks on
which the hardware is implemented these chips also have an embedded processor with
system software to implement the application software around it. In this paper, the
Spartan-3A DSP based Xilinx VSK platform is used for developing the proposed
extensible hardware-software video streaming and processing modules. In order to
develop the required hardware and software in an integrated fashion, Xilinx Embedded
Development Kit (EDK) design tool has been used. A number of Xilinx provided IPs are
customized to realize the hardware modules in the FPGA fabric. Copyright 2013 Praise
Worthy Prize S.r.l. -All rights reserved.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 15

HVL026
Real Time Vehicular Camera Vision Acquisition System Using FPGA
Abstract:
With the advent of active safety technologies in the automotive industry, a need to
record and replay the actual on-road vehicular scenario has risen, especially in systems
involving camera-based vision. The primary objective of the paper is to propose a design
of a system for real-time video acquisition. Hence, a design for a Camera Hardware
simulator has been proposed in this paper. The system involves a camera that captures
visual information through its image sensor. The system is designed such that it can do
direct display; that is, it can generate vertical and horizontal synchronization signals, as
per the specification of the camera and it can buffer the pixel clock coming from the
camera and send it to another system that uses the video information being received such
as an in-vehicle display to display it. It also includes the ability to record the incoming
data stream in a computer for offline processing. As the aforementioned functionality is
to be achieved for high incoming data rates and also handy interfacing to any recording
device is required, we are using a Universal Serial Bus (USB) 2.0 (high speed). Many
subsystems are to be designed on the same chip, so we propose the use of a Field
Programmable Gate Array (FPGA) based system for fast data processing and
miniaturization of the system. The system under consideration is comprised of a
Complementary Metal-Oxide Semiconductor (CMOS) camera at the input and a highspeed USB interface at the output. The FPGA is programmed as a Video Graphics Array
(VGA) Controller, buffer and a USB Controller. FPGA consumes less power when
compared to any other embedded-based system, making it more usable inside an
automobile. This system can be used for offline video processing and simulation of the
exact on-road scenario in lab for vision based active safety system's testing purpose or for
active safety algorithm improvement to achieve desired results (by tweaking the
algorithm for desired results, based on the observations made from the recorded data
without having the need to go on road again and again).

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 16

HVL027
Constructivist Multi-Access Lab Approach in Teaching FPGA Systems
Design with LabVIEW
Abstract:
Embedded systems play vital role in modern applications [1]. They can be found
in autos, washing machines, electrical appliances and even in toys. FPGAs are the most
recent computing technology that is used in embedded systems. There is an increasing
demand on FPGA based embedded systems, in particular, for applications that require
rapid time responses. Engineering education curricula needs to respond to the increasing
industrial demand of using FPGAs by introducing new syllabus for teaching and learning
this subject. This paper describes the development of new course material for teaching
FPGA-based embedded systems design by using G Programming Language of
LabVIEW. A general overview of FPGA role in engineering education is provided. A
survey of available Hardware Programming Languages for FPGAs is presented. A survey
about LabVIEW utilization in engineering education is investigated; this is followed by a
motivation section of why to use LabVIEW graphical programming in teaching and its
capabilities. Then, a section of choosing a suitable kit for the course is laid down. Later,
constructivist closed-loop model the FPGA course has been proposed in accordance with
[2-4; 80,86,89,92]. The paper is proposing a pedagogical framework for FPGA teaching;
pedagogical evaluation will be conducted in future studies. The complete study has been
done at the Faculty of Electrical and Electronic Engineering, Aleppo University.
HVL028
FPGA-Based Educational Platform for Real-Time Image Processing
Experiments
Abstract:
In this paper, an implementation of an educational platform for real-time linear
and morphological image filtering using a FPGA NexysII, Xilinx, Spartan 3E, is
described. The system is connected to a USB port of a personal computer, which in that
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 17

way form a powerful and low-cost design station for educational purposes. The FPGAbased system is accessed through a MATLAB graphical user interface, which handles the
communication setup and data transfer. The system allows the students to perform
comparisons between results obtained from MATLAB simulations and FPGA-based realtime processing. Concluding remarks derived from course evaluations and lab reports are
presented.
HVL029
Real-time Traffic Sign detection and Recognition on FPGA
Abstract:
This paper presents the implementation of an embedded automotive system that
detects and recognizes traffic signs within a video stream. In addition, it discusses the
recent advances in driver assistance technologies and highlights the safety motivations
for smart in-car embedded systems. An algorithm is presented that processes RGB image
data, extracts relevant pixels, filters the image, labels prospective traffic signs and
evaluates them against template traffic sign images. A reconfigurable hardware system is
described which uses the Virtex-5 Xilinx FPGA and hardware/software co-design tools in
order to create an embedded processor and the necessary hardware IP peripherals. The
implementation is shown to have robust performance results, both in terms of timing and
accuracy.
HVL030
Design and implementation of a secure RFID system on FPGA
Abstract:
Radio Frequency Identification systems have been widely used in many
applications nowadays. Since then, data security has been an important issue in RFID
communication in order to prevent undesired people to decrypt communication data.
Considering the problem, a secure RFID system is designed in this study. An RFID
communication at 868MHz is demonstrated by programming transceiver modules via
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 18

Spartan3E FPGA Kits. To increase communication security between reader and tag, 2stage authentication protocol is preferred. Tiny Encryption Algorith (TEA) module is
used as IP Core in FPGA to provide data encryption and sistem design has been
completed.
STIMULATION ONLY
HVL031
A 2.63 Mbit/s VLSI Implementation of SISO Arithmetic Decoders for
High Performance Joint Source Channel Codes
Abstract:
This paper highlights the implementation challenges faced by the current high
performing error resilient joint source channel coding (JSCC) techniques based on the
concept of soft-input soft-output (SISO) decoding of arithmetic codes (AC). Further, it
proposes several efficient algorithmic and a very large scale integration (VLSI)
architectural techniques to improve the throughput performance of SISO for JSCC. The
VLSI hardware implementation of the proposed algorithm, when implemented on a 90
nm standard cells technology running at 588 MHz, achieves a decoding throughput of up
to 2.63 Mbits/s capable of decoding QCIF format for video conferencing.
HVL032
3-D Mesh-Based Optical Network-on-Chip for Multiprocessor Systemon-Chip
Abstract:
Optical networks-on-chip (ONoCs) are emerging communication architectures
that can potentially offer ultrahigh communication bandwidth and low latency to
multiprocessor systems-on-chip (MPSoCs). In addition to ONoC architectures, 3-D
integrated technologies offer an opportunity to continue performance improvements with
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 19

higher integration densities. In this paper, we present a 3-D mesh-based ONoC for
MPSoCs, and new low-cost nonblocking 4 4, 5 5, 6 6, and 7 7 optical routers for
dimension-order routing in the 3-D mesh-based ONoC. Besides, we propose an
optimized floorplan for the 3-D mesh-based ONoC. The floorplan follows the regular 3D mesh topology but implements all optical routers in a single optical layer. The
floorplan is optimized to minimize the number of extra waveguide crossings caused when
merging the 3-D ONoC to one optical layer. Based on a set of real applications and
uniform traffic pattern, we develop a SystemC-based cycle-accurate NoC simulator and
compare the 3-D mesh-based ONoC with the matched 2-D mesh-based ONoC and 2-D
electronic NoC for performance and energy efficiency. Additionally, we quantitatively
analyze thermal effects on the 3-D 8 8 2 mesh-based ONoC.
HVL033
A Built-In Repair Analyzer With Optimal Repair Rate for WordOriented Memories
Abstract:
This paper presents a built-in self repair analyzer with the optimal repair rate for
memory arrays with redundancy. The proposed method requires only a single test, even
in the worst case. By performing the must-repair analysis on the fly during the test, it
selectively stores fault addresses, and the final analysis to find a solution is performed on
the stored fault addresses. To enumerate all possible solutions, existing techniques use
depth first search using a stack and a finite-state machine. Instead, we propose a new
algorithm and its combinational circuit implementation. Since our formulation for the
circuit allows us to use the parallel prefix algorithm, it can be configured in various ways
to meet area and test time requirements. The total area of our infrastructure is dominated
by the number of content addressable memory entries to store the fault addresses, and it
only grows quadratically with respect to the number of repair elements. The
infrastructure is also extended to support various types of word-oriented memories.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 20

HVL034
A Fast-Locking All-Digital Deskew Buffer With Duty-Cycle Correction
Abstract:
In this paper, a fast-locking all-digital deskew buffer with duty cycle correction is
proposed and implemented. A cyclic time-to-digital converter is introduced to decrease
the locking time in conventional register-controlled delay-locked loop to only two input
clock cycles in coarse tuning. With the aid of the three half delay lines technique, the
mismatch between half delay lines causing the duty cycle distortion can be alleviated by
interpolation. A balanced edge combiner to achieve a precise 50% output clock is also
presented. A test chip is fabricated in 0.18-m technology to demonstrate the feasibility
of the proposed architecture. The circuit can accept the input clock rates from 250 to 625
MHz with the duty cycle variation within 30% and 70% to generate 50% output clocks. It
preserves the capability of closed-loop control with a small area and power consumption.
HVL035
A High-Speed Low-Complexity Modified FFT Processor for High Rate
WPAN Applications
Abstract:
This paper presents a high-speed low-complexity modified radix-25 512-point
fast Fourier transform (FFT) processor using an eight data-path pipelined approach for
high rate wireless personal area network applications. A novel modified radix-25 FFT
algorithm that reduces the hardware complexity is proposed. This method can reduce the
number of complex multiplications and the size of the twiddle factor memory. It also uses
a complex constant multiplier instead of a complex Booth multiplier. The proposed FFT
processor achieves a signal-to-quantization noise ratio of 35 dB at 12 bit internal word
length. The proposed processor has been designed and implemented using 90-nm CMOS
technology with a supply voltage of 1.2 V. The results demonstrate that the total gate
count of the proposed FFT processor is 290 K. Furthermore, the highest throughput rate
is up to 2.5 GS/s at 310 MHz while requiring much less hardware complexity.
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 21

HVL036
A Low-Complexity Turbo Decoder Architecture for Energy-Efficient
Wireless Sensor Networks
Abstract:
Turbo codes have recently been considered for energy-constrained wireless
communication applications, since they facilitate a low transmission energy consumption.
However, in order to reduce the overall energy consumption, lookup table-log-BCJR
(LUT-Log-BCJR) architectures having a low processing energy consumption are
required. In this paper, we decompose the LUT-Log-BCJR architecture into its most
fundamental add compare select (ACS) operations and perform them using a novel lowcomplexity ACS unit. We demonstrate that our architecture employs an order of
magnitude fewer gates than the most recent LUT-Log-BCJR architectures, facilitating a
71% energy consumption reduction. Compared to state-of-the-art maximum logarithmic
Bahl-Cocke-Jelinek-Raviv implementations, our approach facilitates a 10% reduction in
the overall energy consumption at ranges above 58 m.
HVL037
A Meet-in-the-Middle Algorithm for Fast Synthesis of Depth-Optimal
Quantum Circuits
Abstract:
We present an algorithm for computing depth-optimal decompositions of logical
operations, leveraging a meet-in-the-middle technique to provide a significant speedup
over simple brute force algorithms. As an illustration of our method, we implemented this
algorithm and found factorizations of commonly used quantum logical operations into
elementary gates in the Clifford+T set. In particular, we report a decomposition of the
Toffoli gate over the set of Clifford and T gates. Our decomposition achieves a total Tdepth of 3, thereby providing a 40% reduction over the previously best known
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 22

decomposition for the Toffoli gate. Due to the size of the search space, the algorithm is
only practical for small parameters, such as the number of qubits, and the number of
gates in an optimal implementation.
HVL038
AC-Plus
Scan
Methodology
for
Small
Delay
Testing
and
Characterization
Abstract:
Small delay defects escaping traditional delay testing could cause a device to
malfunction in the field and thus detecting these defects is often necessary. To address
this issue, we propose three test modes in a new methodology called AC-plus scan, in
which versatile test clocks can be generated on the chip by embedding an all-digital
phase-locked loop (ADPLL) into the circuit under test (CUT). AC-plus scan can be
executed on an in-house wireless test platform called HOY system. The first test mode of
our AC-plus scan provides a more efficient way to measure the longest path delay
associated with each test pattern. Experimental result shows that our method could
greatly reduce the test time by 81.8%. The second test mode is designed for volume
production test. It could effectively detect small delay defects and provide fast
characterization on those defective chips for further processing. This mode could be used
to help predict which chips are more likely to fall victim to operational failure in the
field. The third test mode is to extract the waveform of each flip-flop's output in a real
chip. This is made possible by taking advantage of the almost unlimited test memory our
HOY test platform provides, so that we could easily store a great volume of data and
reconstruct the waveform for post-silicon debugging. We have successfully fabricated a
Viterbi decoder chip with such an AC-plus scan methodology inside to demonstrate its
capability.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 23

HVL039
Addressing Transient and Permanent Faults in NoC With Efficient
Fault-Tolerant Deflection Router
Abstract:
Continuing decrease in the feature size of integrated circuits leads to increases in
susceptibility to transient and permanent faults. This paper proposes a fault-tolerant
solution for a bufferless network-on-chip, including an on-line fault-diagnosis mechanism
to detect both transient and permanent faults, a hybrid automatic repeat request, and
forward error correction link-level error control scheme to handle transient faults and a
reinforcement-learning-based fault-tolerant deflection routing (FTDR) algorithm to
tolerate permanent faults without deadlock and livelock. A hierarchical-routing-tablebased algorithm (FTDR-H) is also presented to reduce the area overhead of the FTDR
router. Synthesized results show that, compared with the FTDR router, the FTDR-H
router can reduce the area by 27% in an 88 network. Simulation results demonstrate that
under synthetic workloads, in the presence of permanent link faults, the throughput of an
8 8 network with FTDR and FTDR-H algorithms are 14% and 23% higher on average
than that with the fault-on-neighbor (FoN) aware deflection routing algorithm and the
cost-based deflection routing algorithm, respectively. Under real application workloads,
the FTDR-H algorithm achieves 20% less hop counts on average than that of the FoN
algorithm. For transient faults, the performance of the FTDR router can achieve graceful
degradation even at a high fault rate. We also implement the fault-tolerant deflection
router which can achieve 400 MHz in TSMC 65-nm technology.
HVL040
All-Digital
Fast-Locking
Pulse
width-Control
Circuit
With
Programmable Duty Cycle

Abstract:
This paper proposes an all-digital fast-locking pulsewidth-control circuit with
programmable duty cycle. In comparison with prior state-of-the-art methods, our use of
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 24

two delay lines and a time-to-digital detector allows the pulsewidth-control circuit to
operate over a wide frequency range with fewer delay cells, while maintaining the same
level of accuracy. This paper presents a new duty-cycle setting circuit that calculates the
desired output duty cycle without the need for a look-up table. The circuit was fabricated
under the two-stage matrix converter 0.18-CMOS process. Results show that the
proposed circuit performs well for an input operating frequency ranging from 200 to 600
MHz, and an input duty cycle ranging from 30% to 70%. It achieves a programmable
output duty cycle ranging from 31.25% to 68.75% in increments of 6.25%.
HVL041
An Analytical Latency Model for Networks-on-Chip
Abstract:
We propose an analytical model based on queueing theory for delay analysis in a
wormhole-switched network-on-chip (NoC). The proposed model takes as input an
application communication graph, a topology graph, a mapping vector, and a routing
matrix, and estimates average packet latency and router blocking time. It works for
arbitrary network topology with deterministic routing under arbitrary traffic patterns.
This model can estimate per-flow average latency accurately and quickly, thus enabling
fast design space exploration of various design parameters in NoC designs. Experimental
results show that the proposed analytical model can predict the average packet latency
more than four orders of magnitude faster than an accurate simulation, while the
computation error is less than 10% in non-saturated networks for different system-onchip platforms.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 25

HVL042
An Energy-Efficient
L2
Cache
Architecture
Using
Way
Tag
Information Under Write-Through Policy

Abstract:
Many high-performance microprocessors employ cache write-through policy for
performance improvement and at the same time achieving good tolerance to soft errors in
on-chip caches. However, write-through policy also incurs large energy overhead due to
the increased accesses to caches at the lower level (e.g., L2 caches) during write
operations. In this paper, we propose a new cache architecture referred to as way-tagged
cache to improve the energy efficiency of write-through caches. By maintaining the way
tags of L2 cache in the L1 cache during read operations, the proposed technique enables
L2 cache to work in an equivalent direct-mapping manner during write hits, which
account for the majority of L2 cache accesses. This leads to significant energy reduction
without performance degradation. Simulation results on the SPEC CPU2000 benchmarks
demonstrate that the proposed technique achieves 65.4% energy savings in L2 caches on
average with only 0.02% area overhead and no performance degradation. Similar results
are also obtained under different L1 and L2 cache configurations. Furthermore, the idea
of way tagging can be applied to existing low-power cache design techniques to further
improve energy efficiency.
HVL043
An On-Chip Network Fabric Supporting Coarse-Grained Processor
Array
Abstract:
Coarse grained arrays (CGAs) with run-time reconfigurability play an important
role in accelerating reconfigurable computing applications. It is challenging to design onchip communication networks (OCNs) for such CGAs with dynamic run-time
reconfigurability whilst satisfying the tight budgets of power and area for an embedded
system. This paper presents a silicon-proven design of a 64-PE circuit-switched OCN
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 26

fabric with a dynamic path-setup scheme capable of supporting an embedded coarsegrained processor array. A proof-of-concept test chip fabricated in a 0.13 m CMOS
process occupies a silicon area of 23 mm2 and consumes a peak power of 200 mW @
128 MHz and 1.2 Vcc, at room temperature. The OCN overhead consumes 9.4% of the
area and 18% of the power of the total chip. Experimental results and analysis show that
the proposed OCN fabric with its dynamic path-setup is suitable for use in an embedded
CGA supporting fast run-time reconfigurability.
HVL044
An
Ultrasynchronization
Checking
Method
With
Trace-Driven
Simulation for Fast and Accurate MPSoC Virtual Platform Simulation

Abstract:
Efficiency and accuracy are two critical concerns in multiprocessor system on
chip (MPSoC) virtual platform simulation. Traditional simulation approaches that
achieve high efficiency usually sacrifice accuracy. On the other hand, the cycle-accurate
simulation algorithms are generally slow in speed. In this paper, we propose an
ultrasynchronization checking method with an efficient trace-driven simulation
mechanism that can not only improve simulation speed, but also maintain simulation
accuracy. We build a SystemC-based MPSoC virtual platform to evaluate the
effectiveness of our approach. The experimental results show that our proposed
simulation scheme can improve simulation speed up to 156× over the traditional
clock-step simulation method. Furthermore, our proposed method can also guarantee the
cycle-accurate simulation result.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 27

HVL045
Application
Space
Exploration
of
Heterogeneous
Run-Time
Configurable Digital Signal Processor

Abstract:
This paper describes the application space exploration of a heterogeneous digital
signal processor with dynamic reconfiguration capabilities. The device is built around
three reconfigurable engines featuring different flavours and computation granularities
that make it suitable for a wide range of signal processing application domains such as
video coding, image processing, telecommunications, and cryptography. Performance of
signal processing applications is evaluated from measurements performed on a CMOS 90
nm prototype. In order to characterize the application space of the processor, performance
is compared with state-of-the-art devices, taking programmability, computational
capabilities, and energy efficiency as the main metrics. The device exploits performance
and energy efficiency significantly more than general purpose processors, while still
maintaining a user-friendly programming approach that mainly relies on softwareoriented languages. The device is able to achieve 1.2 to 15 GOPS with an energy
efficiency from 2 to 50 GOPS/W when running the selected applications
HVL046
Architecture
for
Real-Time
Nonparametric
Probability
Density
Function Estimation
Abstract:
Adaptive systems are increasing in importance across a range of application
domains. They rely on the ability to respond to environmental conditions, and hence realtime monitoring of statistics is a key enabler for such systems. Probability density
function (PDF) estimation has been applied in numerous domains; computational
limitations, however, have meant that proxies are often used. Parametric estimators
attempt to approximate PDFs based on fitting data to an expected underlying distribution,
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 28

but this is not always ideal. The density function can be estimated by rescaling a
histogram of sampled data, but this requires many samples for a smooth curve. Kernelbased density estimation can provide a smoother curve from fewer data samples. We
present a general architecture for nonparametric PDF estimation, using both histogrambased and kernel-based methods, which is designed for integration into streaming
applications on field-programmable gate array (FPGAs). The architecture employs
heterogeneous resources available on modern FPGAs within a highly parallelized and
pipelined design, and is able to perform real-time computation on sampled data at speeds
of over 250 million samples per second, while extracting a variety of statistical
properties.
HVL047
Built-In Generation of Functional Broadside Tests Using a Fixed
Hardware Structure
Abstract:
Functional broadside tests are two-pattern scan-based tests that avoid overtesting
by ensuring that a circuit traverses only reachable states during the functional clock
cycles of a test. In addition, the power dissipation during the fast functional clock cycles
of functional broadside tests does not exceed that possible during functional operation.
On-chip test generation has the added advantage that it reduces test data volume and
facilitates at-speed test application. This paper shows that on-chip generation of
functional broadside tests can be done using a simple and fixed hardware structure, with a
small number of parameters that need to be tailored to a given circuit, and can achieve
high transition fault coverage for testable circuits. With the proposed on-chip test
generation method, the circuit is used for generating reachable states during test
application. This alleviates the need to compute reachable states offline.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 29

HVL048
A Built-In Self-Repair Scheme for 3-D RAMs With Interdie
Redundancy
Abstract:
3-D integration using through silicon via is an emerging technology for integrated
circuit designs. Random access memory (RAM) is one good candidate for the application
of 3-D integration technology. However, yield will be a key challenge for the volume
production of 3-D RAMs. In this paper, we present yield-enhancement techniques for 3D RAMs. An interdie redundancy scheme is proposed to improve the yield of 3-D
RAMs. Three stacking flows with respect to different bonding technologies for 3-D
RAMs with interdie redundancy are proposed as well. Finally, a built-in self-repair
(BISR) scheme is proposed to perform the repair of 3-D RAMs with interdie
redundancies. The BISR circuits in two stacked dies can work together to allocate
interdie redundancies. Simulation results show that the proposed yield-enhancement
techniques can effectively improve the yield of 3-D RAMs.
HVL049
Check pointing for Virtual Platforms and SystemC-TLM
Abstract:
Integrating simulation models created using different simulation systems is a
common problem when constructing virtual platforms. Different companies and different
departments can create models, and virtual platforms for different purposes using
different tools. There are also existing models that need to be integrated into new tools, or
the other way around. The simulators can be quite different in details, even in the case of
transaction-level models. We present work in integrating SystemC transaction-level
models into two typical full-system simulation environments, QEMU and Simics. We
present issues in reconciling the semantics of the different platforms, and our proposed
solutions. In the Simics integration, we additionally enable checkpointing in the models,
based on the Simics checkpoint mechanism.
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 30

HVL050
Circuit-Level Timing Error Tolerance forLow-Power DSP Filters and
Transforms
Abstract:
In this paper, we present a novel circuit-level timing error mitigation technique,
which aims to increase energy-efficiency of digital signal processing datapaths without
loss of robustness. Timing errors are detected using razor flip-flops on critical-paths, and
the error-rate feedback is used to control a dynamic voltage scaling control loop. In place
of conventional razor error correction by replay, we propose a new approach to bound the
magnitude of intermittent timing errors at the circuit level. A timing guard-band is
created by shaping the path delay distribution such that the critical paths correspond to a
group of least-significant bit registers. These end-points are ensured to be critical by
modifying the topology of the final stage carry-merge adder, and by using tool-based
device sizing. Hence, timing violations lead to weakly correlated logical errors of small
magnitude in a mean-squared-error sense. We examine this approach in an finite-impulse
response (FIR) filter and a 2-D discrete cosine transform implementation, in 32-nm
CMOS. Power saving compared to a conventional design at iso-frequency is 21%-23% at
the typical corner, while retaining a voltage guard-band to protect against fast transient
changes in switching activity and supply noise. The impact on minimum clock period is
small (16%-20%), as it does not necessitate the use of ripple-carry adders and also
requires only a bare minimum of additional design effort.
HVL051
Combined Architecture/Algorithm Approach to Fast FPGA Routing
Abstract:
We propose a new field-programmable gate array (FPGA) routing approach,
which, when combined with a low-cost architecture change, results in a 40% reduction in
router runtime, at the cost of a 6% area overhead and with no increase in critical path
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 31

delay. Our approach begins with PathFinder-style routing, which we run on a coarsened
representation of the routing architecture. This leads to fast generation of a partial routing
solution where the signals are assigned to groups of wire segments rather than individual
wire segments. A Boolean satisfiability (SAT)-based stage follows, generating a legal
routing solution from the partial solution. We explore approximately 165 000 FPGA
switch block architectures, showing that the choice of the architecture has a significant
impact on the complexity of the SAT formulation, and by extension, on routing runtime.
Our approach points to a new research direction, namely, reducing FPGA computer-aided
design runtime by exploring FPGA architectures and algorithms together.
HVL052
CORDIC Designs for Fixed Angle of Rotation
Abstract:
Rotation of vectors through fixed and known angles has wide applications in
robotics, digital signal processing, graphics, games, and animation. But, we do not find
any optimized coordinate rotation digital computer (CORDIC) design for vector-rotation
through specific angles. Therefore, in this paper, we present optimization schemes and
CORDIC circuits for fixed and known rotations with different levels of accuracy. For
reducing the area- and time-complexities, we have proposed a hardwired pre-shifting
scheme in barrel-shifters of the proposed circuits. Two dedicated CORDIC cells are
proposed for the fixed-angle rotations. In one of those cells, micro-rotations and scaling
are interleaved, and in the other they are implemented in two separate stages. Pipelined
schemes are suggested further for cascading dedicated single-rotation units and birotation CORDIC units for high-throughput and reduced latency implementations. We
have obtained the optimized set of micro-rotations for fixed and known angles. The
optimized scale-factors are also derived and dedicated shift-add circuits are designed to
implement the scaling. The fixed-point mean-squared-error of the proposed CORDIC
circuit is analyzed statistically, and strategies for reducing the error are given. We have
synthesized the proposed CORDIC cells by Synopsys Design Compiler using TSMC 90G2, Metha Complex, Little Mount, Saidapet, Chennai-15
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 32

nm library, and shown that the proposed designs offer higher throughput, less latency and
less area-delay product than the reference CORDIC design for fixed and known angles of
rotation. We find similar results of synthesis for different Xilinx field-programmable
gate-array platforms.
HVL053
Design and Implementation of an On-Chip Permutation Network for
Multiprocessor System-On-Chip
Abstract:
This paper presents the silicon-proven design of a novel on-chip network to
support guaranteed traffic permutation in multiprocessor system-on-chip applications.
The proposed network employs a pipelined circuit-switching approach combined with a
dynamic path-setup scheme under a multistage network topology. The dynamic pathsetup scheme enables runtime path arrangement for arbitrary traffic permutations. The
circuit-switching approach offers a guarantee of permuted data and its compact overhead
enables the benefit of stacking multiple networks. A 0.13- m CMOS test-chip validates
the feasibility and efficiency of the proposed design. Experimental results show that the
proposed on-chip network achieves 1.9 to 8.2 reduction of silicon overhead compared
to other design approaches.
HVL054
Design of Digit-Serial FIR Filters: Algorithms,Architectures, and a
CAD Tool
Abstract:
In the last two decades, many efficient algorithms and architectures have been
introduced for the design of low-complexity bit-parallel multiple constant multiplications
(MCM) operation which dominates the complexity of many digital signal processing
systems. On the other hand, little attention has been given to the digit-serial MCM design
that offers alternative low-complexity MCM operations albeit at the cost of an increased
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 33

delay. In this paper, we address the problem of optimizing the gate-level area in digitserial MCM designs and introduce high-level synthesis algorithms, design architectures,
and a computer-aided design tool. Experimental results show the efficiency of the
proposed optimization algorithms and of the digit-serial MCM architectures in the design
of digit-serial MCM operations and finite impulse response filters.
HVL055
Efficient Implementation of Reconfigurable Warped Digital Filters
With Variable Low-Pass, High-Pass,Bandpass, and Bandstop Responses
Abstract:
In this brief, an efficient implementation of reconfigurable warped digital filter
with variable low-pass, high-pass, bandpass, and bandstop responses is presented. The
warped filters, obtained by replacing each unit delay of a digital filter with an all-pass
filter, are widely used for various audio processing applications. However, warped filters
require first-order all-pass transformation to obtain variable low-pass or high-pass
responses, and second-order all-pass transformation to obtain variable bandpass or
bandstop responses. To overcome this drawback, the proposed method combines the
warped filters with the coefficient decimation technique. The proposed architecture
provides variable low-pass or high-pass responses with fine control over cut-off
frequency and variable bandwidth bandpass or bandstop responses at an arbitrary center
frequency without updating the filter coefficients or filter structure. The design example
shows that the proposed variable digital filter is simple to design and offers substantial
savings in gate counts and power consumption over other approaches.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 34

HVL056
Error Detection in Majority Logic Decoding of Euclidean Geometry
Low Density Parity Check (EG-LDPC) Codes
Abstract:
In a recent paper, a method was proposed to accelerate the majority logic
decoding of difference set low density parity check codes. This is useful as majority logic
decoding can be implemented serially with simple hardware but requires a large decoding
time. For memory applications, this increases the memory access time. The method
detects whether a word has errors in the first iterations of majority logic decoding, and
when there are no errors the decoding ends without completing the rest of the iterations.
Since most words in a memory will be error-free, the average decoding time is greatly
reduced. In this brief, we study the application of a similar technique to a class of
Euclidean geometry low density parity check (EG-LDPC) codes that are one step
majority logic decodable. The results obtained show that the method is also effective for
EG-LDPC codes. Extensive simulation results are given to accurately estimate the
probability of error detection for different code sizes and numbers of errors.
HVL057
Exploration and Optimization of 3-D Integrated DRAM Subsystems
Abstract:
Energy efficiency is the major optimization criterion for systems-on-chip (SoCs)
for mobile devices (smartphones and tablets). Through silicon via (TSV) technology
enables 3-D integration of dies and the heterogeneous stacking of multiple memory or
logic layers, allowing increased bandwidth and lower energy consumption of the memory
interface compared to traditional approaches. In this paper, we explore the 3-D-DRAM
architecture design space. The result is an optimized 2 Gb 3-D-DRAM, which shows a
83% lower energy/bit than a 2 Gb device. Furthermore, we propose a highly energyefficient DRAM subsystem for next-generation 3-D-integrated SoCs, consisting of a
SDR/DDR 3-D-DRAM controller and an attached 3-D-DRAM cube with fine-grained
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 35

access and a flexible (WIDE-IO) interface. We assess the energy efficiency using a
synthesizable model of the SDR/DDR 3-D-DRAM channel controller (CC) as well as
functional models of the 3-D-stacked DRAM, including an accurate power estimation
engine. We also investigate different DRAM families (WIDE IO SDR/DDR, LPDDR,
and LPDDR2) and densities from 256 Mb to 4 Gb per channel. The implementation
results of the proposed 3-D-DRAM subsystem show that energy optimized accesses to
the 3-D-DRAM enable up to 50% energy savings compared to standard accesses. To the
best of our knowledge this is the first design space exploration for 3-D-stacked DRAM
considering different technologies based on real-world physical data and the first design
of a 3-D-DRAM CC and 3-D-DRAM model featuring co-optimization of memory and
controller architecture.
HVL058
Glitch-Free NAND-Based Digitally Controlled Delay-Lines
Abstract:
The recently proposed NAND-based digitally controlled delay-lines (DCDL)
present a glitching problem which may limit their employ in many applications. This
paper presents a glitch-free NAND-based DCDL which overcame this limitation by
opening the employ of NAND-based DCDLs in a wide range of applications. The
proposed NAND-based DCDL maintains the same resolution and minimum delay of
previously proposed NAND-based DCDL. The theoretical demonstration of the glitchfree operation of proposed DCDL is also derived in the paper. Following this analysis,
three driving circuits for the delay control-bits are also proposed. Proposed DCDLs have
been designed in a 90-nm CMOS technology and compared, in this technology, to the
state-of-the-art. Simulation results show that novel circuits result in the lowest resolution,
with a little worsening of the minimum delay with respect to the previously proposed
DCDL with the lowest delay. Simulations also confirm the correctness of developed
glitching model and sizing strategy. As example application, proposed DCDL is used to
realize an All-digital spread-spectrum clock generator (SSCG). The employ of proposed
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 36

DCDL in this circuit allows to reduce the peak-to-peak absolute output jitter of more than
the 40% with respect to a SSCG using three-state inverter based DCDLs.
HVL059
Improved Trace Buffer Observation via Selective Data Capture Using
2-D Compaction for Post-Silicon Debug
Abstract:
This paper presents a novel technique for extending the capacity of trace buffers
when capturing debug data during post-silicon debug. It exploits the fact that is it not
necessary to capture error-free data in the trace buffer since that information can be
obtained from simulation. A selective data capture method is proposed in this paper that
only captures debug data during clock cycles in which errors are present. The proposed
debug method requires only three debug sessions. The first session estimates a rough
error rate, the second session identifies a set of suspect clock cycles where errors may be
present, and the third session captures the suspect clock cycles in the trace buffer. The
suspect clock cycles are determined through a 2-D compaction technique using multipleinput signature register signatures and cycling register signatures. Intersecting both
signatures generates a small number of suspect clock cycles for which the trace buffer
needs to capture. The effective observation window of the trace buffer can be expanded
significantly, by up to orders of magnitude. Experimental results indicate very significant
increases in the effective observation window for a trace buffer can be obtained.
HVL060
IsoNet: Hardware-Based Job Queue Management for Many-Core
Architectures
Abstract:
Imbalanced distribution of workloads across a chip multiprocessor (CMP)
constitutes wasteful use of resources. Most existing load distribution and balancing
techniques employ very limited hardware support and rely predominantly on software for
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 37

their operation. This paper introduces IsoNet, a hardware-based conflict-free dynamic
load distribution and balancing engine. IsoNet is a lightweight job queue manager
responsible for administering the list of jobs to be executed, and maintaining load balance
among all CMP cores. By exploiting a micro-network of load-balancing modules, the
proposed mechanism is shown to effectively reinforce concurrent computation in manycore environments. Detailed evaluation using a full-system simulation framework
indicates that IsoNet significantly outperforms existing techniques and scales efficiently
to as many as 1024 cores. Furthermore, to assess its feasibility, the IsoNet design is
synthesized, placed, and routed in 45-nm VLSI technology. Analysis of the resulting lowlevel implementation shows that IsoNet's area and power overhead are almost negligible.
HVL061
Joint Decoding of LDPC Code and Phase Factors for OFDM Systems
With PTS PAPR Reduction
Abstract:
In this paper, we investigate a low-density parity-check (LDPC)-coded orthogonal
frequency-division multiplexing (OFDM) system with a peak-to-average power ratio
(PAPR) reduction using the partial transmit sequence (PTS), which does not transmit
PTS side information about the phase factors. We view the PTS processing as a stage of
coding and call the resulted code of LDPC coding and PTS processing a concatenated
LDPC-PTS code. Then, we derive the parity-check matrix of the concatenated LDPCPTS code. With the parity-check matrix, the LDPC code and phase factors can be jointly
decoded using belief propagation algorithms. Neither transmission of PTS side
information (phase factors) nor phase factor estimation before decoding is required by the
proposed scheme. Simulation results show that the proposed joint decoding provides
nearly perfect phase factor recovery and LDPC decoding for a small number of PTS
partitions.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 38

HVL062
Latch-Based Performance Optimization for Field-Programmable Gate
Arrays
Abstract:
We explore using pulsed latches for timing optimization -- a first in the FPGA
community. Pulsed latches are transparent latches driven by a clock with a non-standard
(non-50%) duty cycle. We exploit existing functionality within commercial FPGA chips
to implement latch-based optimizations that do not have the power or area drawbacks
associated with other timing optimization approaches, such as clock skew and retiming.
We propose an algorithm that iteratively replaces certain flip-flops in a logic design with
latches for an improvement in circuit speed. Results show that much of the performance
improvement achieved by using multiple skewed clocks can also be achieved using a
single clock and latches. We also consider the impact of short delay paths (i.e. minimum
delays), which can cause hold-time violations. Under conservative minimum delay
assumptions, our latch-based optimization, operating on the routed design, provides a 5%
performance improvement, on average, essentially for "free" (i.e. without any rerouting/delay padding). We show that short paths greatly hinder the ability of using
latches for speed improvement, motivating further work to reduce their effects.
HVL063
MDC FFT/IFFT Processor With Variable Length for MIMO-OFDM
Systems
Abstract:
This paper presents an multipath delay commutator (MDC)-based architecture
and memory scheduling to implement fast Fourier transform (FFT) processors for
multiple input multiple output-orthogonal frequency division multiplexing (MIMOOFDM) systems with variable length. Based on the MDC architecture, we propose to use
radix-$N_{s}$ butterflies at each stage, where $N_{s}$ is the number of data streams, so
that there is only one butterfly needed in each stage. Consequently, a 100% utilization
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 39

rate in computational elements is achieved. Moreover, thanks to the simple control
mechanism of the MDC, we propose simple memory scheduling methods for input data
and output bit/set-reversing, which again results in a full utilization rate in memory
usage. Since the memory requirements usually dominate the die area of FFT/inverse fast
Fourier transform (IFFT) processors, the proposed scheme can effectively reduce the
memory size and thus the die area as well. Furthermore, to apply the proposed scheme in
practical applications, we let $N_{s}=4$ and implement a 4-stream FFT/IFFT processor
with variable length including 2048, 1024, 512, and 128 for MIMO-OFDM systems. This
processor can be used in IEEE 802.16 WiMAX and 3GPP long term evolution
applications. The processor was implemented with an UMC 90-nm CMOS technology
with a core area of 3.1 ${rm mm}^{2}$. The power consumption at 40 MHz was
63.72/62.92/57.51/51.69 mW for 2048/1024/512/128-FFT, respectively in the post-layout
simulation. Finally, we analyze the complexity and performance of the implemented
processor and compare it with other processors. The results show advantages of the
proposed scheme in terms of area and power consumption.
HVL064
Mining Hardware Assertions With Guidance From Static Analysis
Abstract:
We present GoldMine, a methodology for generating assertions automatically in
hardware. Our method involves a combination of data mining and static analysis of the
register transfer level (RTL) design. The RTL design is first simulated to generate data
about the design's dynamic behavior. The generated data is then mined for candidate
assertions that are likely to be invariants. The data mining algorithm is a decision-treebased supervised learning algorithm. These candidate assertions are then passed through
a formal verification engine to filter out the spurious candidates. The assertions that are
attested as true by the formal engine are system invariants. These are then evaluated by a
process of designer ranking that is provided as feedback to the data mining engine. We
demonstrate the scalability of GoldMine by showing assertion generation of the RTL of
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 40

Sun's OpenSparc T2 many-threaded processor. Our results show that GoldMine can
generate complex, high coverage assertions for sequential as well as combinational
designs in RTL, thereby minimizing human effort in this process. GoldMine assertions
distill the random input stimulus space and can be used for calibrating directed tests.
They can be used in a regression test suite of an evolving RTL. They are also useful in
providing differing perspectives from the designer, as well as hints to designers for
manually writing assertions.
HVL065
NCTU-GR 2.0: Multithreaded Collision-Aware Global Routing with
Bounded-Length Maze Routing
Abstract:
Modern global routers employ various routing methods to improve routing speed
and quality. Maze routing is the most time-consuming process for existing global routing
algorithms. This paper presents two bounded-length maze routing (BLMR) algorithms
(optimal-BLMR and heuristic-BLMR) that perform much faster routing than traditional
maze routing algorithms. In addition, a rectilinear Steiner minimum tree aware routing
scheme is proposed to guide heuristic-BLMR and monotonic routing to build a routing
tree with shorter wirelength. This paper also proposes a parallel multithreaded collisionaware global router based on a previous sequential global router (SGR). Unlike the
partitioning-based strategy, the proposed parallel router uses a task-based concurrency
strategy. Finally, a 3-D wirelength optimization technique is proposed to further refine
the 3-D routing results. Experimental results reveal that the proposed SGR uses less
wirelength and runs faster than most of other state-of-the-art global routers with a
different set of parameters , , , . Compared to the proposed SGR, the proposed parallel
router yields almost the same routing quality with average 2.71 and 3.12-fold speedup on
overflow-free and hard-to-route cases, respectively, when running on a 4-core system.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 41

HVL066
Novel MIMO Detection Algorithm for High-Order Constellations in the
Complex Domain
Abstract:
A novel detection algorithm with an efficient VLSI architecture featuring efficient
operation over infinite complex lattices is proposed. The proposed design results in the
highest throughput, the lowest latency, and the lowest energy compared to the complexdomain VLSI implementations to date. The main innovations are a novel complexdomain means of expanding/visiting the intermediate nodes of the search tree on demand,
rather than exhaustively, as well as a new distributed sorting scheme to keep track of the
best candidates at each search phase. Its support of unbounded infinite lattice decoding
distinguishes the present method from previous K-Best strategies and also allows its
complexity to scale sublinearly with the modulation order. Since the expansion and
sorting cores are data-driven, the architecture is well suited for a pipelined parallel VLSI
implementation. The proposed algorithm is used to fabricate a 44, 64-QAM complex
multiple-input-multiple-output detector in a 0.13-m CMOS technology, achieving a
clock rate of 417 MHz with the core area of 340 kgates. The chip test results prove that
the fabricated design can sustain a throughput of 1 Gb/s with energy efficiency of 110
pJ/bit, the best numbers reported to date.
HVL067
On the Fixed-Point Accuracy Analysis and Optimization of Polynomial
Specifications
Abstract:
Fixed-point accuracy analysis and optimization of polynomial data-flow graphs
with respect to a reference model is a challenging task in many digital signal processing
applications. Range and precision analysis are two important steps of this process to
assign suitable integer and fractional bit-widths to the fixed-point variables and constant
coefficients in a design such that no overflow occurs and a given error bound on
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 42

maximum mismatch (MM) or mean-square-error (MSE) and signal-to-quantization-noise
ratio (SQNR) is satisfied. This paper explores efficient optimization algorithms based on
robust analyses of MM and MSE/SQNR for fixed-point polynomial data-flow graphs.
Experimental results illustrate the robustness of our analyses and the efficiency of the
optimization algorithms compared to previous work.
HVL068
Pipelined Radix- Feedforward FFT Architectures
Abstract:
The appearance of radix-22 was a milestone in the design of pipelined FFT
hardware architectures. Later, radix-22 was extended to radix-2k . However, radix-2k
was only proposed for single-path delay feedback (SDF) architectures, but not for
feedforward ones, also called multi-path delay commutator (MDC). This paper presents
the radix-2k feedforward (MDC) FFT architectures. In feedforward architectures radix-2k
can be used for any number of parallel samples which is a power of two. Furthermore,
both decimation in frequency (DIF) and decimation in time (DIT) decompositions can be
used. In addition to this, the designs can achieve very high throughputs, which makes
them suitable for the most demanding applications. Indeed, the proposed radix-2k
feedforward architectures require fewer hardware resources than parallel feedback ones,
also called multi-path delay feedback (MDF), when several samples in parallel must be
processed. As a result, the proposed radix-2k feedforward architectures not only offer an
attractive solution for current applications, but also open up a new research line on
feedforward structures

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 43

HVL069
Pragmatic Integration of an SRAM Row Cache in Heterogeneous 3-D
DRAM Architecture Using TSV
Abstract:
As scaling DRAM cells becomes more challenging and energy-efficient DRAM
chips are in high demand, the DRAM industry has started to undertake an alternative
approach to address these looming issues-that is, to vertically stack DRAM dies with
through-silicon-vias (TSVs) using 3-D-IC technology. Furthermore, this emerging
integration technology also makes heterogeneous die stacking in one DRAM package
possible. Such a heterogeneous DRAM chip provides a unique, promising opportunity for
computer architects to contemplate a new memory hierarchy for future system design. In
this paper, we study how to design such a heterogeneous DRAM chip for improving both
performance and energy efficiency. In particular, we found that, if we want to design an
SRAM row cache in a DRAM chip, simple stacking alone cannot address the majority of
traditional SRAM row cache design issues. In this paper, to address these issues, we
propose a novel floorplan and several architectural techniques that fully exploit the
benefits of 3-D stacking technology. Our multi-core simulation results with memoryintensive applications suggest that, by tightly integrating a small row cache with its
corresponding DRAM array, we can improve performance by 30% while saving dynamic
energy by 31%.
HVL070
Reconfigurable Adaptive Singular Value Decomposition Engine Design
for High-Throughput MIMO-OFDM Systems
Abstract:
Singular value decomposition (SVD) is an optimal method to obtain spatial
multiplexing gain in multi-input multi-output (MIMO) channels. However, the high cost
of implementation and high decomposing latency of the SVD restricts its usage in current
wireless communication applications. In this paper, we present a complete adaptive SVD
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 44

algorithm and a reconfigurable architecture for high-throughput MIMO-orthogonal
frequency division multiplexing systems. There are several proposed architectural design
techniques: reconfigurable scheme, division-free adaptive step size scheme, early
termination scheme, and data interleaving scheme. The reconfigurable scheme can
support all antenna configurations in a MIMO system. The division-free adaptive step
size and early termination schemes are used to effectively reduce the decomposing
latency and improve hardware utilization. The data interleaving scheme helps to deal
with several channel matrices concurrently. Besides, we propose an orthogonal
reconstruction scheme to obtain more accurate SVD outputs, and then the system
performance will be greatly enhanced. We apply our SVD design to the IEEE 802.11 n
applications. This design is implemented and fabricated in UMC 90 nm 1P9M CMOS
technology. The maximum operating frequency is measured to be at 101.2 MHz, and the
corresponding power dissipation is at 125 mW. The core size is 2.17 ${rm mm}^{2}$
and the die size occupies 4.93 ${rm mm}^{2}$. The chip result shows that the average
latency is only 0.33% of the wireless local area network coherence time. Hence, the
proposed reconfigurable adaptive SVD engine design is very suitable for high-throughput
wireless communication applications
HVL071
Scalability Analysis of Memory Consistency Models in NoC-based
Distributed Shared Memory SoCs
Abstract:
We analyze the scalability of six memory consistency models in network-on-chip
(NoC)-based distributed shared memory multicore systems: 1) protected release
consistency (PRC); 2) release consistency (RC); 3) weak consistency (WC); 4) partial
store ordering (PSO); 5) total store ordering (TSO); and 6) sequential consistency (SC).
Their realizations are based on a transaction counter and an address-stack-based
approach. The scalability analysis is based on different workloads mapped on various
sizes of networks using different problem sizes. For the experiments, we use Nostrum
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 45

NoC-based configurable multicore platform with a 2-D mesh topology and a deflection
routing algorithm. Under the synthetic workloads, the average execution time for the
PRC, RC, WC, PSO, and TSO models in the 8 8 network (64-cores) is reduced by
32.3%, 28.3%, 20.1%, 13.8%, and 9.9% over the SC model, respectively. For the
application workloads, as the network size grows, the average execution time under these
relaxed memory models decreases with respect to the SC model depending on the
application and its match to the architecture. The performance improvement of the PRC
and RC models over the SC model tends to be higher than 50% as observed in the
experiments, when the system is further scaled up. The area cost in the network interface
for the relaxed memory models is increased by less than 4% over the SC model.
HVL072
Scaling Energy Per Operation via an Asynchronous Pipeline
Abstract:
Statistical analysis of computations per unit energy in processors over the last 30
years is given that illustrates a sharp reduction in the rate of energy efficiency
improvements over the last several years resulting in the formation of an asymptotic
wall with our dataset; we use the measure of giga multiply accumulates per Joule. We
have developed an energy model which takes into account the realities of scaling,
specifically for asynchronous systems. Studies of an energy efficient asynchronous
pipeline show fabricated results of 17 Giga Operations per Joule in 0.6 m at
subthreshold when fully pipelined, and simulations at a more modern 65 nm process
show a further order of magnitude improvement on that.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 46

HVL073
Secure Dual-Core Cryptoprocessor for Pairings Over Barreto-Naehrig
Curves on FPGA Platform
Abstract:
This paper is devoted to the design and the physical security of a parallel dualcore flexible cryptoprocessor for computing pairings over Barreto-Naehrig (BN) curves.
The proposed design is specifically optimized for field-programmable gate-array (FPGA)
platforms. The design explores the in-built features of an FPGA device for achieving an
efficient cryptoprocessor for computing 128-bit secure pairings. The work further
pinpoints the vulnerability of those pairing computations against side-channel attacks and
demonstrates experimentally that power consumptions of such devices can be used to
attack these ciphers. Finally, we suggest a suitable countermeasure to overcome the
respective weaknesses. The proposed secure cryptoprocessor needs 1 730 000, 1 206 000,
and 821 000 cycles for the computation of Tate, ate, and optimal-ate pairings,
respectively. The implementation results on a Virtex-6 FPGA device shows that it
consumes 23 k Slices and computes the respective pairings in 11.93, 8.32, and 5.66 ms.
HVL074
Selective Flexibility:Creating Domain-Specific Reconfigurable Arrays
Abstract:
Historically, hardware acceleration technologies have either been applicationspecific, therefore lacking in flexibility, or fully programmable, thereby suffering from
notable inefficiencies on an application-by-application basis. To address the growing
need for domain-specific acceleration technologies, this paper describes a design
methodology (i) to automatically generate a domain-specific coarse-grained array from a
set of representative applications and (ii) to introduce limited forms of architectural
generality to increase the likelihood that additional applications can be successfully
mapped onto it. In particular, coarse-grained arrays generated using our approach are
intended to be integrated into customizable processors that use application-specific
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 47

instruction set extensions to accelerate performance and reduce energy; rather than
implementing these extensions using application-specific integrated circuit (ASIC) logic,
which lacks flexibility, they can be synthesized onto our reconfigurable array instead,
allowing the processor to be used for a variety of applications in related domains. Results
show that our array is around 2 slower and 15 larger than an ultimately efficient ASIC
implementation, and thus far more efficient than fieldprogrammable gate arrays (FPGAs),
which are known to be 3-4 slower and 20-40 larger. Additionally, we estimate that our
array is usually around 2 larger and 2 slower than an accelerator synthesized using
traditional datapath merging, which has, if any, very limited flexibility beyond the design
set of DFGs.
HVL075
Self-Repairing Digital System With Unified Recovery Process Inspired
by Endocrine Cellular Communication
Abstract:
Self-repairing digital systems have recently emerged as the most promising
alternative for fault-tolerant systems. However, such systems are still impractical in many
cases, particularly due to the complex rerouting process that follows cell replacement.
They lose efficiency when the circuit size increases, due to the extra hardware in addition
to the functional circuit and the unutilization of normal operating hardware for fault
recovery. In this paper, we propose a system inspired by endocrine cellular
communication, which simplifies the rerouting process in two ways: 1) by lowering the
hardware overhead along with the increasing size of the circuit and 2) by reducing the
hardware unutilized for fault recovery while maintaining good fault-coverage. The
proposed system is composed of a structural layer and a gene-control layer. The structural
layer consists of novel modules and their interconnections. In each module of our system,
the encoded data, called the genome, contains information about the function and the
connection. Therefore, a faulty module can be replaced and the whole system's functions
and connections are maintained by simply assigning the same encoded data to a spare
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 48

(stem) module. In existing systems, a huge amount of hardware, such as a dynamic
routing system, is required for such an operation. The gene-control layer determines the
neighboring spare module in the structural layer to replace the faulty module without
collision. We verified the proposed mechanism by implementing the system with a fieldprogrammable gate array with the application of a digital clock whose status can be
monitored with light-emitting-diodes. In comparison with existing methods, the proposed
architecture and mechanism are efficient enough for application with real fault-tolerant
systems dealing with harsh and remote environments, such as outer space or deep sea.
HVL076
STBC-OFDM Downlink Baseband Receiver for Mobile WMAN
Abstract:
This paper proposes a space time block code-orthogonal frequency division
multiplexing downlink baseband receiver for mobile wireless metropolitan area network.
The proposed baseband receiver applied in the system with two transmit antennas and
one receive antenna aims to provide high performance in outdoor mobile environments. It
provides a simple and robust synchronizer and an accurate but hardware affordable
channel estimator to overcome the challenge of multipath fading channels. The coded bit
error rate performance for 16 quadrature amplitude modulation can achieve less than 10-6
under the vehicle speed of 120 km/hr. The proposed baseband receiver designed in 90-nm
CMOS technology can support up to 27.32 Mb/s uncoded data transmission under 10
MHz channel bandwidth. It requires a core area of 2.41 2.41 mm2 and dissipates 68.48
mW at 78.4 MHz with 1 V power supply.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 49

HVL077
Techniques for Compensating Memory Errors in JPEG2000
Abstract:
This paper presents novel techniques to mitigate the effects of SRAM memory
failures caused by low voltage operation in JPEG2000 implementations. We investigate
error control coding schemes, specifically single error correction double error detection
code based schemes, and propose an unequal error protection scheme tailored for
JPEG2000 that reduces memory overhead with minimal effect in performance.
Furthermore, we propose algorithm-specific techniques that exploit the characteristics of
the discrete wavelet transform coefficients to identify and remove SRAM errors. These
techniques do not require any additional memory, have low circuit overhead, and more
importantly, reduce the memory power consumption significantly with only a small
reduction in image quality.
HVL078
Test Patterns of Multiple SIC Vectors: Theory and Application in BIST
Schemes
Abstract:
This paper proposes a novel test pattern generator (TPG) for built-in self-test. Our
method generates multiple single-input change (MSIC) vectors in a pattern, i.e., each
vector applied to a scan chain is an SIC vector. A reconfigurable Johnson counter and a
scalable SIC counter are developed to generate a class of minimum transition sequences.
The proposed TPG is flexible to both the test-per-clock and the test-per-scan schemes. A
theory is also developed to represent and analyze the sequences and to extract a class of
MSIC sequences. Analysis results show that the produced MSIC sequences have the
favorable features of uniform distribution and low input transition density. The
performances of the designed TPGs and the circuits under test with 45 nm are evaluated.
Simulation results with ISCAS benchmarks demonstrate that MSIC can save test power
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 50

and impose no more than 7.5% overhead for a scan design. It also achieves the target
fault coverage without increasing the test length.
HVL079
The LUT-SR Family of Uniform Random Number Generators for
FPGA Architectures
Abstract:
Field-programmable gate array (FPGA) optimized random number generators
(RNGs) are more resource-efficient than software-optimized RNGs because they can take
advantage of bitwise operations and FPGA-specific features. However, it is difficult to
concisely describe FPGA-optimized RNGs, so they are not commonly used in real-world
designs. This paper describes a type of FPGA RNG called a LUT-SR RNG, which takes
advantage of bitwise xor operations and the ability to turn lookup tables (LUTs) into shift
registers of varying lengths. This provides a good resourcequality balance compared to
previous FPGA-optimized generators, between the previous high-resource high-period
LUT-FIFO RNGs and low-resource low-quality LUT-OPT RNGs, with quality
comparable to the best software generators. The LUT-SR generators can also be
expressed using a simple C++ algorithm contained within this paper, allowing 60 fullyspecified LUT-SR RNGs with different characteristics to be embedded in this paper,
backed up by an online set of very high speed integrated circuit hardware description
language (VHDL) generators and test benches.
HVL080
Theoretical Modeling of Elliptic Curve Scalar Multiplier on LUT-Based
FPGAs for Area and Speed
Abstract:
This paper uses a theoretical model to approximate the delay of different
characteristic two primitives used in an elliptic curve scalar multiplier architecture
(ECSMA) implemented on k input lookup table (LUT)-based field-programmable gate
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 51

arrays. Approximations are used to determine the delay of the critical paths in the
ECSMA. This is then used to theoretically estimate the optimal number of pipeline stages
and the ideal placement of each stage in the ECSMA. This paper illustrates suitable
scheduling for performing point addition and doubling in a pipelined data path of the
ECSMA. Finally, detailed analyses, supported with experimental results, are provided to
design the fastest scalar multiplier over generic curves. Experimental results for
GF(2163) show that, when the ECSMA is suitably pipelined, the scalar multiplication can
be performed in only 9.5 s on a Xilinx Virtex V. Notably the design has an area which is
significantly smaller than other reported high-speed designs, which is due to the better
LUT utilization of the underlying field primitives.
HVL081
A Flexible and Customizable Architecture for the Relaxation Labeling
Algorithm
Abstract:
This brief presents a flexible and customizable architecture for the probabilistic
relaxation labeling (PRL) algorithm. The algorithm has been restructured by using a
hardware-friendly process that is executed on the proposed architecture. This enables the
design to handle different numbers of objects and labels flexibly. Moreover, in the
design, the proposed PRL unit can be easily duplicated for K times according to the
available resources on the field-programmable gate array (FPGA). In this brief, K can be
scalable up to 10 by using a Virtex-6 FPGA XC6VLX240T platform. Compared with
existing architectures that are not suitable for a large number of objects, the proposed
architecture reduces the time complexity from O(N M) to O(N) with the same O(N
M2) space complexity, where N and M are the numbers of objects and labels,
respectively. The experimental results show that the execution time of our design is about
15 times less for five objects and about 35 times less for a 128 64 image block than the
software implementation running on a Quad-core Intel 32-nm machine.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 52

HVL082
A Novel VLSI DHT Algorithm for a Highly Modular and Parallel
Architecture
Abstract:
A new very large scale integration (VLSI) algorithm for a 2N-length discrete
Hartley transform (DHT) that can be efficiently implemented on a highly modular and
parallel VLSI architecture having a regular structure is presented. The DHT algorithm
can be efficiently split on several parallel parts that can be executed concurrently.
Moreover, the proposed algorithm is well suited for the subexpression sharing technique
that can be used to significantly reduce the hardware complexity of the highly parallel
VLSI implementation. Using the advantages of the proposed algorithm and the fact that
we can efficiently share the multipliers with the same constant, the number of the
multipliers has been significantly reduced such that the number of multipliers is very
small comparing with that of the existing algorithms. Moreover, the multipliers with a
constant can be efficiently implemented in VLSI.
HVL083
An Adaptive Subsystem Based Algorithm for Channel Equalization in a
SIMO System
Abstract:
The principle of multiple input/output inversion theorem (MINT) has been
employed for multi-channel equalization. In this work, we propose to partition a singleinput multiple-output system into two subsystems. The equivalence between the
deconvoluted signals of the two subsystems is termed as auto-relation and we
subsequently exploit this relation as an additional constraint to the existing adaptive
MINT algorithm. In addition, we provide analysis of the auto-relation constraint and
show that this constraint confines the solution of equalization filters within a multidimensional space. We also explain through the use of convergence analysis why our
proposed algorithm can achieve a higher rate of convergence compared to the existing
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 53

MINT-based algorithms. Simulation results, using both synthetic and recorded channel
impulse responses, show that our proposed auto-relation aided MINT algorithm can
achieve a fast convergence compared to the existing MINT-based algorithms.
HVL084
Binary Discrete Cosine and Hartley Transforms
Abstract:
In this paper, a systematic method for developing a binary version of a given
transform by using the Walsh-Hadamard transform (WHT) is proposed. The resulting
transform approximates the underlying transform very well, while maintaining all the
advantages and properties of WHT. The method is successfully applied for developing a
binary discrete cosine transform (BDCT) and a binary discrete Hartley transform
(BDHT). It is shown that the resulting BDCT corresponds to the well-known sequencyordered WHT, whereas the BDHT can be considered as a new Hartley-ordered WHT.
Specifically, the properties of the proposed Hartley-ordering are discussed and a shiftcopy scheme is proposed for a simple and direct generation of the Hartley-ordering
functions. For software and hardware implementation purposes, a unified structure for the
computation of the WHT, BDCT, and BDHT is proposed by establishing an elegant
relationship between the three transform matrices. In addition, a spiral-ordering is
proposed to graphically obtain the BDHT from the BDCT and vice versa. The application
of these binary transforms in image compression, encryption and spectral analysis clearly
shows the ability of the BDCT (BDHT) in approximating the DCT (DHT) very well.
HVL085
Computing Two-Pattern Test Cubes for Transition Path Delay Faults
Abstract:
Considering full-scan circuits, incompletely-specified tests, or test cubes, are used
for test data compression. When considering path delay faults, certain specified input
values in a test cube are needed only for determining the lengths of the paths associated
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 54

with detected faults. Path delay faults, and therefore, small delay defects, would still be
detected if such values are unspecified. The goal of this paper is to explore the possibility
of increasing the number of unspecified input values in a test set for path delay faults by
unspecifying such values in order to make the test set more amenable to test data
compression. Experimental results indicate that significant numbers of such values exist.
The proposed procedure unspecifies them gradually to obtain a series of test sets with
increasing numbers of unspecified values and decreasing path lengths. Experimental
results also indicate that filling the unspecified values randomly (as with some test data
compression methods) recovers some or all of the path lengths associated with detected
path delay faults. The procedure uses a matching of the sets of detected faults for the
comparison of path lengths.
HVL086
Design of Hardware Function Evaluators Using Low-Overhead
Nonuniform Segmentation With Address Remapping
Abstract:
In the piecewise function evaluation with polynomial approximation, nonuniform
segmentation can effectively reduce the size of lookup tables for some arithmetic
functions compared to uniform segmentation approaches, at the cost of the extra segment
address (index) encoder that results in area and delay overhead. Also, it is observed that
the nonuniform segmentation reflects a design tradeoff between the ROM size and the
area cost of the subsequent arithmetic computation hardware. In this paper, we propose a
new nonuniform segmentation method that searches for the optimal segmentation scheme
with the goal of minimized ROM, total area, or delay. For some high-variation arithmetic
functions, the proposed segmentation method achieves significant area reduction
compared to the uniform segmentation method. We also demonstrate the design tradeoff
among uniform and nonuniform segmentation, and degree-one and degree-two
polynomial approximations, with respect to precision ranging from 12 to 32 bits for the
elementary function of reciprocal.
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 55

HVL087
DS-CDMA Implementation With Iterative Multiple Access Interference
Cancellation
Abstract:
In this paper an implementation of iterative joint detection for multiple access
interference using direct-sequence code-division multiple-access (DS-CDMA) is
presented. Results for multiple field programmable gate array (FPGA) platforms and
multiple technology nodes for synthesized application specific integrated circuits (ASIC)
are presented. The joint detection is performed using a generalized version of interleavedivision multiple-access (IDMA) known as partition spreading (PS) CDMA. Decoding is
performed using iterative methods from turbo and sum-product decoding. The
synthesized ASIC system demonstrates a maximum aggregate throughput of 197 Mb/s
for a fully loaded 50-user system, while the implemented FPGA 50-user system has a
maximum aggregate throughput of 119 Mb/s.
HVL088
FPGA-Based 40.9-Gbits/s Masked AES With Area Optimization for
Storage Area Network
Abstract:
In order to protect data-at-rest in storage area networks from the risk of
differential power analysis attacks without degrading performance, a high-throughput
masked advanced encryption standard (AES) engine is proposed. However, this engine
usually adopts the unrolling technique which requires extremely large field
programmable gate array (FPGA) resources. In this brief, we aim to optimize the area for
a masked AES with an unrolled structure. We achieve this by mapping its operations
from to as much as possible. We reduce the number of mapping [ to ] and inverse
mapping [ to ] operations of the masked SubBytes step from ten to one. In order to be
compatible, the masked MixColumns, masked AddRoundKey, and masked ShiftRows
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 56

including the redundant masking values are carried over . We also use FPGA block RAM
(BRAM) to further reduce hardware resources. Compared with a state-of-the-art design,
our implementation reduces the overall area by 36.2% (20.5% is contributed by the main
method, and 15.7% is contributed by the BRAM optimization). It achieves 40.9-Gbits/s at
4.5-Mbits/s/slice on the Xilinx XC6VLX240T platform. We have attacked the iterative
version of this masked AES in hardware. Results show that none of the bytes can be
guessed from the masked AES with the collected 10 000 power traces, but 14 out of 16
bytes can be guessed from the unprotected AES with the same number of traces.
HVL089
Low-Cost FIR Filter Designs Based on Faithfully Rounded Truncated
Multiple Constant Multiplication/Accumulation
Abstract:
Low-cost finite impulse response (FIR) designs are presented using the concept of
faithfully rounded truncated multipliers. We jointly consider the optimization of bit width
and hardware resources without sacrificing the frequency response and output signal
precision. Nonuniform coefficient quantization with proper filter order is proposed to
minimize total area cost. Multiple constant multiplication/accumulation in a direct FIR
structure is implemented using an improved version of truncated multipliers.
Comparisons with previous FIR design approaches show that the proposed designs
achieve the best area and power results.
HVL090
Low-Resolution DAC-Driven Linearity Testing of Higher Resolution
ADCs Using PolynomialFitting Measurements
Abstract:
A low-cost linearity test methodology for high-resolution analog-to-digital
converters (ADCs) is presented in this paper. Linearity testing of ADCs requires highprecision digital-to-analog conversion (DAC) capability, commonly 3-bit higher
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 57

resolution than the ADC under test. Further, a large number of ADC output data samples
must be collected making conventional histogram testing impractical for high-resolution
ADCs with 18-24 bit precision. In the proposed test methodology, two low-precision and
low-cost DACs are used to generate a high-resolution ADC test stimulus. Significant
reductions in test cost and test time are achieved by using low-cost instrumentation and
by making fewer measurements than required for conventional histogram test. A leastsquares-based polynomial fitting approach is used to determine the transfer function of
the ADC under test. The generated transfer function is used to compute the non-linearity
of the ADC accurately. No assumption is made regarding the linearity of the lower
precision signal generators (DACs) used in the testing procedure. Software simulations
and hardware experiments are performed to validate the proposed test methodology.
HVL091
One Analog STBC-DCSK Transmission Scheme not Requiring Channel
State Information
Abstract:
Both the inherently wideband differential-chaos-shift-keying (DCSK) modulation
and the space-time block code (STBC) are techniques that can mitigate the effect of
multipath fading. By applying STBC at the chaotic segment level, a novel analog STBCDCSK scheme is proposed in this paper. The proposed scheme is a simple configuration
that combines the advantages of STBC and chaotic modulation. Due to the very low
correlation between different analog chaotic signals, the proposed scheme can
remarkably suppress the inter-transmit-antenna interference so as to recover the desired
information and to achieve the full diversity gain. The theoretical bit-error-rate (BER)
performance and the highly consistent simulation results demonstrate that the STBCDCSK scheme outperforms the conventional single-input-single-output (SISO)-DCSK
scheme by about 5 dB at a BER of $10^{-4}$ . The performance superiority of the
proposed scheme is further demonstrated in a typical UWB channel by simulations. More
importantly, the proposed scheme maintains the same low transceiver cost as the SISOG2, Metha Complex, Little Mount, Saidapet, Chennai-15
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 58

DCSK scheme. Consequently, this proposed scheme is a low-cost alternative for wireless
local area network (WLAN) applications.
HVL092
Reconfigurable Accelerator for the Word-Matching Stage of BLASTN
Abstract:
BLAST is one of the most popular sequence analysis tools used by molecular
biologists. It is designed to efficiently find similar regions between two sequences that
have biological significance. However, because the size of genomic databases is growing
rapidly, the computation time of BLAST, when performing a complete genomic database
search, is continuously increasing. Thus, there is a clear need to accelerate this process. In
this paper, we present a new approach for genomic sequence database scanning utilizing
reconfigurable field programmable gate array (FPGA)-based hardware. In order to derive
an efficient structure for BLASTN, we propose a reconfigurable architecture to accelerate
the computation of the word-matching stage. The experimental results show that the
FPGA implementation achieves a speedup around one order of magnitude compared to
the NCBI BLASTN software running on a general purpose computer.
HVL093
Reduced-Complexity LCC ReedSolomon Decoder Based on Unified
Syndrome Computation
Abstract:
Reed-Solomon (RS) codes are widely used in digital communication and storage
systems. Algebraic soft-decision decoding (ASD) of RS codes can obtain significant
coding gain over the hard-decision decoding (HDD). Compared with other ASD
algorithms, the low-complexity Chase (LCC) decoding algorithm needs less computation
complexity with similar or higher coding gain. Besides employing complicated
interpolation algorithm, the LCC decoding can also be implemented based on the HDD.
However, the previous syndrome computation for 2 test vectors and the key equation
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 59

solver (KES) in the HDD requires long latency and remarkable hardware. In this brief, a
unified syndrome computation algorithm and the corresponding architecture are
proposed. Cooperating with the KES in the reduced inversion-free Berlekamp-Messy
algorithm, the reduced-complexity LCC RS decoder can speed up by 57% and the area
will be reduced to 62% compared with the original design for = 3.
HVL094
Scale-Free Hyperbolic CORDIC Processor and Its Application to
Waveform Generation
Abstract:
This paper presents a novel completely scaling-free CORDIC algorithm in
rotation mode for hyperbolic trajectory. We use most-significant-1 bit detection
technique for micro-rotation sequence generation to reduce the number of iterations. By
storing the sinh/cosh hyperbolic values at octant boundaries in a ROM, we can extend the
range of convergence to the entire coordinate space. Based on this, we propose a pipeline
hyperbolic CORDIC processor to implement a direct digital synthesizer (DDS). The DDS
is further used to derive an efficient arbitrary waveform generator (AWG), where a
pseudo-random number generator modulates the linear increments of phase to produce
random phase-modulated waveform. The proposed waveform generator requires only one
DDS for generating variety of modulated waveforms, while existing designs require
separate DDS units for different type of waveforms, and multiple DDS units are required
to generate composite waveforms. Therefore, area complexity of existing designs gets
multiplied with the number of different types waveforms they generate, while in case of
proposed design that remains unchanged. The proposed AWG when mapped on Xilinx
Spartan 2E device, consumes 1076 slices and 2016 4-input LUTs. The proposed AWG
involves significantly less area and lower latency, with nearly the same throughput
compared to the existing CORDIC-based designs.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 60

HVL095
Scaling, Offset, and Balancing Techniques in FFT-Based BP Nonbinary
LDPC Decoders
Abstract:
An analysis of finite precision effects in nonbinary mixed-domain low-density
parity-check decoders is presented. It is shown how improved decoding performance can
be achieved by using an offset-based method and proper scaling techniques. In addition, a
novel fast Fourier transform (FFT)-based belief propagation (BP) decoder architecture is
proposed which balances the computational load between processing units. The results
show a 47% reduction in the number of required field-programmable gate array slices
compared to a standard FFT-based BP architecture
HVL096
Two-Rate Based Low-Complexity Variable Fractional-Delay FIR Filter
Structures
Abstract:
This paper considers two-rate based structures for variable fractional-delay (VFD)
finite-length impulse response (FIR) filters. They are single-rate structures but derived
through a two-rate approach. The basic structure considered hitherto utilizes a regular
half-band (HB) linear-phase filter and the Farrow structure with linear-phase subfilters.
Especially for wide-band specifications, this structure is computationally efficient
because most of the overall arithmetic complexity is due to the HB filter which is
common to all Farrow-structure subfilters. This paper extends and generalizes existing
results. Firstly, frequency-response masking (FRM) HB filters are utilized which offer
further complexity reductions. Secondly, both linear-phase and low-delay subfilters are
treated and combined which offers trade-offs between the complexity, delay, and
magnitude response overshoot which is typical for low-delay filters. Thirdly, the HB
filter is replaced by a general filter which enables additional frequency-response
constraints in the upper frequency band which normally is treated as a don't-care band.
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 61

Wide-band design examples (90, 95, and 98% of the Nyquist band) reveal arithmetic
complexity savings between some 20 and 85% compared with other structures, including
infinite-length impulse response structures. Hence, the VFD filter structures proposed in
this paper exhibit the lowest arithmetic complexity among all hitherto published VFD
filter structures.
HVL097
VLSI Architectures for the 4-Tap and 6-Tap 2-D Daubechies Wavelet
Filters Using Algebraic Integers
Abstract:
This paper proposes a novel algebraic integer (AI) based multi-encoding of
Daubechies-4 and -6 2-D wavelet filters having error-free integer-based computation.
Digital VLSI architectures employing parallel channels are proposed, physically realized
and tested. The multi-encoded AI framework allows a multiplication-free and
computationally accurate architecture. It also guarantees a noise-free computation
throughput the multi-level multi-rate 2-D filtering operation. A single final reconstruction
step (FRS) furnishes filtered and down-sampled image outputs in fixed-point, resulting in
low levels of quantization noise. Comparisons are provided between Daubechies-4 and -6
designs in terms of SNR, PSNR, hardware structure, and power consumptions, for
different word lengths. SNR and PSNR improvements of approximately 30% were
observed in favour of AI-based systems, when compared to 8-bit fixed-point schemes
(six fractional bits). Further, FRS designs based on canonical signed digit representation
and on expansion factors are proposed. The Daubechies-4 and -6 4-level VLSI
architectures are prototyped on a Xilinx Virtex-6 vcx240t-1ff1156 FPGA device at 282
MHz and 146 MHz, respectively, with dynamic power consumption of 164 mW and 339
mW, respectively, and verified on FPGA chip using an ML605 platform.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 62

HVL098
VLSI Implementation of a Low-Cost High-Quality Image Scaling
Processor
Abstract:
In this brief, a low-complexity, low-memory-requirement, and high-quality
algorithm is proposed for VLSI implementation of an image scaling processor. The
proposed image scaling algorithm consists of a sharpening spatial filter, a clamp filter,
and a bilinear interpolation. To reduce the blurring and aliasing artifacts produced by the
bilinear interpolation, the sharpening spatial and clamp filters are added as prefilters. To
minimize the memory buffers and computing resources for the proposed image processor
design, a T-model and inversed T-model convolution kernels are created for realizing the
sharpening spatial and clamp filters. Furthermore, two T-model or inversed T-model
filters are combined into a combined filter which requires only a one-line-buffer memory.
Moreover, a reconfigurable calculation unit is invented for decreasing the hardware cost
of the combined filter. Moreover, the computing resource and hardware cost of the
bilinear interpolator can be efficiently reduced by an algebraic manipulation and
hardware sharing techniques. The VLSI architecture in this work can achieve 280 MHz
with 6.08-K gate counts, and its core area is 30378 m2 synthesized by a 0.13-m CMOS
process. Compared with previous low-complexity techniques, this work reduces gate
counts by more than 34.4% and requires only a one-line-buffer memory.
HVL099
VLSI Implementation of a Multi-Mode Turbo/LDPC Decoder
Architecture
Abstract:
Low-density parity-check (LDPC) codes and convolutional Turbo codes are two
of the most powerful error correcting codes that are widely used in modern
communication systems. In a multi-mode baseband receiver, both LDPC and Turbo
decoders may be required. However, the different decoding approaches for LDPC and
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 63

Turbo codes usually lead to different hardware architectures. In this paper we propose a
unified message passing algorithm for LDPC and Turbo codes and introduce a flexible
soft-input soft-output (SISO) module to handle LDPC/Turbo decoding. We employ the
trellis-based maximum a posteriori (MAP) algorithm as a bridge between LDPC and
Turbo codes decoding. We view the LDPC code as a concatenation of n super-codes
where each super-code has a simpler trellis structure so that the MAP algorithm can be
easily applied to it. We propose a flexible functional unit (FFU) for MAP processing of
LDPC and Turbo codes with a low hardware overhead (about 15% area and timing
overhead). Based on the FFU, we propose an area-efficient flexible SISO decoder
architecture to support LDPC/Turbo codes decoding. Multiple such SISO modules can be
embedded into a parallel decoder for higher decoding throughput. As a case study, a
flexible LDPC/Turbo decoder has been synthesized on a TSMC 90 nm CMOS
technology with a core area of 3.2 mm2. The decoder can support IEEE 802.16e LDPC
codes, IEEE 802.11n LDPC codes, and 3GPP LTE Turbo codes. Running at 500 MHz
clock frequency, the decoder can sustain up to 600 Mbps LDPC decoding or 450 Mbps
Turbo decoding.
HVL0100
An Efficient Interpolation-Based Chase BCH Decoder
Abstract:
BCH codes are adopted in many systems, such as flash memory, optical
communications, and digital video broadcasting. By trying 2 test vectors, the softdecision Chase decoding algorithm of BCH codes can achieve significant coding gain
over hard-decision decoding. Previous one-pass Chase schemes find the error locators
based on the Berlekamp's algorithm and need hardware-demanding selection methods to
decide which locator corresponds to the correct code word. In this brief, a novel
interpolation-based one-pass Chase decoder is proposed for BCH codes. By making use
of the binary property of BCH codes, an innovative yet low-complexity method is
developed to select the interpolation output leading to successful decoding without
Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 64

bringing any performance loss. The code word recovery step is also significantly
simplified through nontrivial mathematical derivations. From architectural analysis, the
proposed decoder with = 4 for a (4200, 4096) BCH code has 2.3 times higher efficiency
in terms of throughput-over-area ratio than the prior one-pass Chase decoder based on the
Berlekamp's algorithm, while achieving the same error-correcting performance.
HVL0101
An Efficient Multi-Standard LDPC Decoder Design Using HardwareFriendly Shuffled Decoding
Abstract:
This paper presents an efficient multi-standard low-density parity-check (LDPC)
decoder architecture using a shuffled decoding algorithm, where variable nodes are
divided into several groups. In order to provide sufficient memory bandwidth without the
need for using registers, a FIFO-based check-mode memory, which dominates the
decoder area, is used. Since two compensation factors, rather than a single factor, are
dynamically used in the offset Min-Sum algorithm, the number of quantization bits, and,
hence, the memory size, can be reduced without degradation in error performance. In
order to further reduce the memory size, artificial minimum values, which do not need to
be stored in memory, are used. We also propose an algorithm that can be used to partition
variable nodes such that the hardware cost can be minimized. Using the proposed
techniques, a multi-standard decoder that supports the LDPC codes specified in the ITU
G.hn, IEEE 802.11n, and IEEE 802.16e standards was designed and implemented using a
90-nm CMOS process. This decoder supports 133 codes, occupies an area of 5.529 mm2
, and achieves an information throughput of 1.956 Gbps.

Ph: 044-22200258, Mobile: 9840989556, 9952050233
P a g e | 65

Vlsi Abstract

Загружено:

Сведения о документе

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Vlsi Abstract

Загружено:

Авторское право:

Доступные форматы

Hades InfoTech Pvt.

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

FPGA and GSM Implementation of Advanced Home

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Configuration system and Multi-board functional-level Run-Time Configuration system.

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Programmable Duty Cycle

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Information Under Write-Through Policy

Hades InfoTech Pvt. Ltd

Simulation for Fast and Accurate MPSoC Virtual Platform Simulation

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Configurable Digital Signal Processor

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd

G2, Metha Complex, Little Mount, Saidapet, Chennai-15

Hades InfoTech Pvt. Ltd

Hades InfoTech Pvt. Ltd