Вы находитесь на странице: 1из 55

CHAPTER 1

INTRODUCTION
Multiplication is widely used in many real time applications. They form an
integral part in implementation of many Digital Signal Processing, Digital Image
Processing and Multimedia algorithms. The size, speed and power dissipation of any
DSP chip can be significantly influenced by the design and implementation of its
multiplication and squaring functions. This mandates the need to have a multiplication
and squaring function that is not just fast but at the same time occupies less area on
the chip. Over the years a lot of research has been done in designing and
implementing multiplication functions that yield less area on the chip, consume less
power and have minimal propagation time. The move towards achieving less area on
chip started with the implementation of fixed width multipliers.
A fixed width multiplier has smaller silicon area compared to a full width
multiplier which takes in n-bit Multiplicand and n-bit Multiplier to yield an output
that is 2n-bits wide. A fixed width multiplication can be derived by truncating a fullwidth multiplier.
There are many different approaches to deriving a fixed width multiplier and
the most commonly used ones are:1. Truncating the 2n-bit result of a full-width multiplier to generate a result which is
n-bits wide. The least significant n-bits out of 2n-bits are truncated. This design yields
the best results of all the fixed-width multiplier designs. However, the area savings are
minimal. This is mainly because the design retains all the columns of the partial
product array even when generating a truncated output. As an example, when
processing inputs of word size n-bits, there are 2n-1columns in the partial product
array whereas the output is n-bits wide.

2. The second approach involves truncating the least significant n-columns of the
partial product matrix of a full-width multiplier to directly yield a result that is n- bits
wide. This design offers massive savings in area but the errors introduced due to
truncation are high. Design of any fixed-width multiplier is always a trade-off
between area savings and error. The truncation of bits/partial product terms results in
error. So error compensation algorithm needed to compensate truncation errors.

1.1 Motivation
The most executed operation in the data path is addition, which requires a
binary adder that adds two given numbers. Adders also play a vital role in more
complex computations like multiplication, division and decimal operations. Hence, an
efficient implementation of binary adder is crucial to an efficient data path. Relatively
significant work has been done in proposing and realizing efficient adder circuits for
binary addition as described in the next chapter. However, as the technology is scaling
down new design issues like fan-out and wiring complexity are appearing in the frontline. These issues are addressed to some extent by new adder architectures known as
sparse adders. As operand size increases, adders also suffer from design issues that are
becoming vital as they have direct impact on the performance of an adder Thus, there
is an urgent need to develop alternative adder architectures which can address these
design issues.
The next most important block in data-path after adder is the multiplier, which
is also very crucial in ASICs and DSPs. High speed multipliers reported in literature
use parallel multiplier architectures that employ compressors along with adders as
basic building blocks. Compressors are multi-input, multi-output combinational logic
circuits that determine the number of logic 1s in their input vectors and generate a
binary coded output vector that corresponds to this number. Compressors have carry
inputs and carry outputs in addition to the normal inputs and outputs. As these
blocks lie directly within the critical path of a given design, thus dictating the overall
circuit performance, there is an urgent need to design and validate new high
speed/low power compressors.

1.2 Objective of Thesis:


The following objectives are proposed to be addressed in this thesis:

Implementation of parallel array multipliers using 3:2,4:2,5:2 compressors


Implementation of RPR block with increased accuracy and less error rate
Implementation of Algorithmic Noise Tolerant Architecture based fixed width
multiplier with improved performance

1.3 Organization of Thesis:


This thesis is organized into six chapters. The chapter wise out line is as follows:
Chapter 2 deals with different methods of fixed width multipliers and different
methods implemented previously.
Chapter 3gives information about compressors and their types and implementation
of different types of compressors.
Chapter 4 gives introduction to multipliers and different types and their working and
reduction techniques of multiplier architecture.
Chapter 5 deals about Algorithmic noise tolerant architecture and implementation of
proposed fixed width multipliers using compressors.
Chapter 6gives experiment results of proposed ANT architecture based fixed width
multipliers and comparative analysis of results obtained
Chapter 7 gives Conclusion and future scope

CHAPTER 2
LITERATURE REVIEW
2.1 Introduction:
Multipliers can be broadly classified into two types.
Sequential multiplier: In a sequential multiplier each bit of either the multiplier or the
multiplicand is processed one at a time. The main advantage of the design is its small
area. A small piece of hardware involving a shifter and an accumulator is all that is
needed to generate the output.
Parallel multiplier: In a parallel multiplier all the partial product terms are generated
in parallel and the final result is obtained by adding the partial product terms over the
columns. The main advantage of this design is its higher speed.
The parallel multiplier is widely used in various Digital Signal Processing,
Video Processing and Multimedia applications because of its high speed of operation.
The parallel array multiplier in turn is classified into two types namely,

Full Width Multiplier


Fixed Width Multiplier
The main drawback of a full width multiplier over a fixed width multiplier is

the area on the chip and power consumption. The presence of AND gates to generate
the partial product terms and the HALF and FULL ADDERS to perform the column
wise sum of the partial product terms results in large number of transistors. Hence, the
large power consumption. Over the years attempts have been made to overcome the
area constraint and limit the power consumption.

2.2 Area Efficient Multipliers for Digital Signal Processing:


Sunder S Kidambi, Andreas Antoniou and Fayez El-Guibaly proposed a
design which brought 50% reduction in the area of a parallel multiplier [3]. In many
DSP applications there is a need to maintain the word size. This motivated the trio to
come up with a design that addressed this requirement. In the proposed
design the lower N columns of a parallel multiplier are truncated and
a correction is then added to the remaining most significant
columns. The authors carried out a statistical error analysis to
predict the error due to truncation and provide an appropriate
correction term. Based on the correction term the design is altered
to replace the half adders with full adders where needed.
In Figure 2.1 the partial product terms along the diagonal are
added to generate the output bits p8-p15. The carry terms from
each addition propagate along the vertical arrows. FA represents the
Full adder block and AFA represents the block containing an AND
gate and a Full adder. HA stands for Half Adder block.

Figure 2.1 Multiplier design proposed by Kidambi(N=8)


Highlights: This was one of the earliest designs of fixed width parallel
multipliers. In addition to reducing the area of the parallel multiplier for almost 50% it
proposed an analytical technique to generate a correction term to reduce the error due
to truncation. The main drawback of this design is the correction bias added to offset
the error due to truncation. The correction added is a constant term and does not
depend on the inputs being fed to the multiplier.

2.3 Fixed Width Multiplier with Correction Constant:


Michael J Schulte and Earl E Swartzlander furthered the work
carried out by Kidambi and team [4]. The design proposed by them
are based on the following,

Truncating the least n-k columns.


Rounding the result to n columns.

In a full width multiplier with an n-bit Multiplicand and an n-bit


Multiplier, the output is a 2n-bit word. There are 2n-1 columns in
partial product array matrix and the terms are added column-wise to
yield a 2n bit result. In the design proposed by Schulte and
Swartzlander, n+k columns are truncated to yield n+k columns of
partial product terms. This design is slightly different from that
proposed by Kidambi [3] in that the latter design truncated n
columns where as in this design n-k columns are truncated.
Retaining more columns gives better results at the expense of area.
In order to offset the error introduced due to truncation, a constant
term is added to most significant n+k columns. This correction is the
average of the truncated portion of the partial product matrix. The
result obtained after adding the partial product terms is then
rounded to yield an n-bit result.
Highlights: The design proposed by Schulte and Swartzlander [4] brings
about a huge saving in area when compared to a full-width multiplier. It also
introduces a degree of flexibility in the number of columns that are truncated. This
gives designers a chance to choose between 12 area savings and better error
6

correction. The design proposed by Kidambi [3] and team is a special case of this
design [4].

2.4 Truncated Multiplication with Approximate Rounding


Earl E Swartzlander, Jr proposed a fixed width multiplier that uses variable
correction instead of constant correction [5]. The basic design of the multiplier is the
same as that of a constant correction fixed width multiplier. The least significant N- 2
partial product columns of a full width multiplier are truncated. The partial product
terms in the N-1column are then added to the partial product terms in the Nth column
using full-adders. This is done in order to offset the error introduced due to truncation
of least significant N-2columns. The correction term that is generated is based on the
following arguments,

The biggest column in the entire partial product array of a full-width multiplier is
the Nth column(assuming that the columns are numbered from1 to 2N, 1 being the

least significant column and 2N being the most significant).


TheN-1thcolumn contributes more information to the most significant N-1
columns than the rest of the least significant N-1 columns. The information
presented could be made more accurate if the carry from the N-1th column is

preserved and passed onto the Nth column.


Adding the elements in N-1th column to the Nth column provides a variable
correction as the information presented is dependent on input bits. When all the
partial product terms in the N-1th column are zero, the correction added is zero.
When all the terms are one, a different correction value is added.
The partial product terms along the outer most diagonal form the elements of
the N-1thcolumn of the partial product array of a standard multiplier. The partial
product terms along the outer diagonal are fed to the ADDER blocks in the Nth
column (represented by the adjacent diagonal elements). The output obtained is N
bits wide.

Highlights: The advantage of this design is that the correction term changes with
the inputs fed to the multiplier. The complexity is slightly more in this case when
compared to the constant correction fixed width multiplier. This design can at most
bring about N- 2 column truncation.

2.5Design and Implementation of Low Error fixed width multiplier


Proposed by Jou
Jer Min Jou and Shiann Rong Kuang proposed a design [6] which is a slight
variation ver the other designs documented above. The Nth column of the parallel
multiplier is the largest contributor of information to the most significant half of the
partial product array matrix. Retaining the entire column can significantly reduce the
error due to truncation. However, retaining the entire column would increase the area
of the chip as there would be an increase in number of AND gates and adders in the
design to compute the partial product terms and the sum for that column. The design
proposed by the duo aims at reducing the area by using an AO cell which is a
combination of an AND gate and an OR gate as shown in figure 2.2.
This AO cell is used to generate the carry information which is then
fed to the next column. This design brings about a small
improvement in area as the adders in the nth column are replaced
by the AO cell. Figure 2.2 shows the implementation of the multiplier
proposed by Jer Min Jouand Shiann Rong Kuang.

Figure 2.2: Multiplier design using AO cells for error compensation


HighlightsThis design provides a new approach to offset the
error due to truncation. It is based on the reasoning that then Nth
column, being the largest column in the entire partial product array,
is the largest contributor of information to the remaining most
significant N columns. It includes the entire column using the AO cell
which is used instead of an AND gate and a full adder. This design is
better compared to previous designs as it reduces the area by using
the AO cell. However, the maximum possible truncation is N-1
columns.

2.6 Truncated Multiplier with Symmetric Correction


The truncated multiplier with Symmetric correction [7] was
implemented by Hyuk Park and Earl E Swartzlander, Jr with the
intention of improving the error correction over other existing
techniques. This multiplier design is very similar to variable
correction technique with an additional piece of logic. The basic
design incorporates the same truncation principle as compared to
9

the variable truncation. The introduced correction is slightly


different. In addition to adding the partial product terms from N K-1
column to N-K column an additional piece of hardware logic is added
which is responsible for bringing in symmetry between the positive
and negative maximum error. The introduction of the proposed
additional logic only slightly increases the complexity of the
design..The block is responsible for introducing symmetry between
positive and negative maximum errors. The proposed logic block is
responsible for introducing symmetry between positive and negative
maximum errors.
Highlights - The main advantage of the design is that it evenly
distributes the error between its positive maximum and negative
maximum errors. Retaining the extra columns in the partial product
array and introducing the proposed logic increases the area of the
multiplier. Thus, the better error correction obtained by this design
is offset by the increase in area

2.7Dual Tree Error Compensation


This design was proposed by Antonio G M Strollo, Nicolo Petra
and David De Caro[8]. It is a variant of fixed width multiplier that
uses tree based adding scheme. This design uses an error
compensation function that can be tweaked, based on the
requirements, to either bring down the maximum error or the mean
error. The design structure is shown in figure 2.3.

10

The Dual Tree approach is based on the understanding that


each of the partial product terms in the input correction vector carry
different weights. It is shown in the paper that the outer partial
product terms in the input correction vector(which is the N-1
column) have a lower weight when compared to the inner partial
product terms. Therefore, choosing the right combination of partial
product terms depending on their weights yields good results.
The paper proposes that, in order to get low mean square
error the partial product terms in the N-1 column are separated into
two different groups and passed on to adder trees. The outer partial
product terms are processed separately bypassing them to a
standard tree adder structure(which is composed of half and full
adders) whereas the remaining partial product terms are added
using a modified tree structure (which is composed of AND and OR
gates). In order to get a multiplier that yields low maximum error
the outer partial product are added using a standard tree structure
whereas the inner terms are added using modified tree structure.
Each of these tree structures generate carry terms which are fed to
the adjacent column which has the weight 2-n. The paper also
discusses the use of mixing block which is used to pass on the carry
to the adjacent column with a higher weight.

11

Figure 2.3 Dual tree compensation proposed by Antonio


Highlights The main advantage of this design is that it provides the user with
the flexibility to choose either a design that lowers maximum error or a design that
lowers mean square error. The error compensation provided by this design is by far
the best when compared to the other designs. It provides a variable error
compensation bias as the correction value is dependent on the input bits. The error
correction function is simple to implement as the partial product terms in the input
correction vectors are used directly to provide the correction.

12

CHAPTER 3
DESIGN AND IMPLEMENTATION OF COMPRESSORS
3.1 Introduction
Multiplication is a basic arithmetic operation that is crucial in applications like
digital signal processing which in turn rely on efficient implementation of generic
arithmetic and logic units (ALU) and floating point units to execute dedicated
operations like convolution and filtering. In the implementation of multipliers, the
main phases include generation of partial products, reduction of partial products using
CSA (Carry-Save Adder) and Carry propagation for the computation of the final result
as shown in Fig 3.1. The second phase i.e. reduction of the partial products
contributes most to the overall delay, area and power. In order to reduce partial
products, multi-operand adders, which are different from conventional adders, are
required and hence a different design methodology is needed for multi-operand
adders. A special structure known as compressor is one strategy that can be adopted
for multi-operand addition. Wallace and Dadda were the first ones who explained the
usage of compressors for partial product reduction tree in multipliers. Later different
optimized structures for compressors have been reported in literature.

13

Figure 3.1 Steps involved in multiplication

3.2 Compressors
A (N, 2) compressor is a logic circuit that takes N bits of same significance
and generates a Sum bit and several Carry bits as the output. Though a compressor
gives Sum and Carry, it is different from a conventional adder. For example,
compressor adds N-bits of same precision whereas an adder adds 2 operands of N-bit
numbers of different precision. Compressor operation can be shown logically as
I 1+ I 2++ ( Cin 1+Cin 2+ Cink )= +2(Cout 1+Cout 2+..Coutk )
Where I1,I2 are inputs and Cin1,Cin2 are also carry inputs

Figure 3:2 Compressors

14

3.3 Existing Compressor Designs


3.3.1 3-2 Compressor
A 3-2 compressor takes 3 inputs X1, X2, X3 and generates 2 outputs, the Sum bit S,
and the Carry bit C as shown in figure.3.2(a).
The compressor is governed by the basic equation
X 1+ X 2+ X 3= + 2Carry

(3.1)

Figure 3.2(a) compressor (b) Conventional 3:2 compressor


The 3-2 compressor can also be employed as a full adder cell when the third
input is considered as the Carry-in from the previous compressor block. Existing
design shown in figure 3.2(b) employs two XOR gates in the critical path
3.3.2 4-2 Compressor
A 4-2 compressor has 4 inputs X1, X2, X3 and X4 and 2 outputs, Sum and
Carry, along with a Carry-in (C in) and a Carry-out (Cout) as shown in figure 3.3. The
input Cin is the output from the previous lower significant compressor. The C out is the
output to the compressor in the next significant stage

15

Figure 3.3 4-2 compressor block


Similar to the 3-2 compressor, a 4-2 compressor is governed by the basic equation
X 1+ X 2+ X 3+ X 4+Cin= +2(Carry+ Cout)

(3.2)

The standard implementation of the 4-2 compressor can be done using 2 full Adder
cells as shown in Fig 3.4

Figure 3.4 Design of 4-2 compressor using full adders


When the individual full adders are broken into their constituent XOR blocks, it can
be observed that the overall delay is equal to 4*-XOR gates (where refers to delay)
The block diagram in Figure 3.5 shows the existing architecture for the
implementation of the 4-2 compressor with a delay of 3*-XOR gates. but in this
architecture, the fact that both the output and its complement are available at every
stage was not taken into account
3.3.3 5-2 Compressor
The 5-2 Compressor block has 5 inputs X1, X2, X3, X4 and X5 and 2 outputs, Sum
and Carry, along with 2 input Carry bits (Cin1, Cin2) and 2 output Carry bits
16

(Cout1,Cout2) as shown in figure.3.5(a). Input Carry bits are the outputs from the
previous lesser significant compressor block and the output Carry bits are passed on
to the next higher significant compressor block.

Figure 3.5(a) 5:2 compressor (b) Conventional 5:2 compressor


The basic equation that governs the function of a 5-2 compressor block is
given below
X 1+ X 2+ X 3+ X 4+ X +Cin 1+Cin 2= + 2( carry+ Cout 1+Cout 2 ) (3.3)
Conventional implementation of the compressor block is shown in figure 3.5(b) where
3 cascaded full adder cells are used. When these full adders are replaced with their
constituent blocks of XOR gates, then it can be observed that the overall delay is
equal to (6*-XOR) for the Sum or Carry output.

17

Figure 3.6 Existing Architecture of 5-2 Compressor

3.4 Design and Implementation of Efficient Compressors


3.4.1 3-2 Compressor
In CMOS implementation, the gates like OR and AND require implementation
of NOR and NAND gates followed by an inverter. Thus, from OR and AND gates, we
can obtain NOR and NAND outputs without any extra hardware. This technique is
used to design a XOR-XNOR pair gate which is shown in figure3.7

Figure 3.7 CMOS Implementation of XOR/XNOR Gate


18

A 3-2 compressor can be implemented by the following expressions.

X 1 X 2 X3

(3.4)

Carry=(X1 X 2 ).X3+ X 1 X 2 .X1

(3.5)

A gate-level implementation of these expressions has earlier been shown in


figure3.3. In the existing design, the output of the first XOR gate and X3 are given as
inputs to second stage XOR gate. This XOR gate can be replaced by a multiplexer
which reduces the delay, as multiplexer has less delay compared to XOR gate
In the design shown in figure.3.9, the fact that both the XOR and XNOR
outputs are computed, is efficiently used to reduce the delay by replacing the second
XOR gate with a MUX. This is due to the availability of the select bit i.e. X3 at the
MUX block before the inputs arrive. Thus, the time taken for the switching ON of the
transistors is reduced in the critical path

Figure 3.8 Efficient design of 3:2 compressor


The equations governing the proposed (3, 2) compressor outputs are shown below

3.4.2 4-2 Compressor

19

In this design also, the fact that both the output and its complement are
available at every stage is neglected .Thus replacing some XOR gates with
multiplexers result in a significant improvement in delay.

Figure 3.9 Efficient design of 4:2 compressor


Like in previous case, the MUX block at the SUM output gets the select bit before the
inputs arrive and thus the transistors are already switched ON by the time the inputs
arrive. This minimizes the delay to a considerable extent as shown in Fig 3.10.
The equations governing the outputs are shown below

X 1 X 2 X 3 X 4 C
Cout=(X1

X2

).X3+

X 1 X 2

(3.8)
.X1

(3.9)

carry=( x 1 x 2 x 3 ) . x 4 + x 1 x 2 x 3 x 4 . x 4

(3.10)

The MUX* structure in figure3.10 is a multiplexer implemented using


transmission gate logic style shown in Figure. 3.11. This design of the multiplexer is
faster and also consumes lesser power than the CMOS design but requires buffers to
enhance the driving capability. Therefore, these types of multiplexers can be used
where there are a CMOS transistors at its input and output, because CMOS has good
driving capability. Thus, transmission gate multiplexers are used in the intermediate
stage, thereby increasing the performance.
20

3.4.3 5-2 Compressor


In the proposed design of the 5-2 compressor the most important change is to
efficiently use the outputs generated at every stage. This is done by replacing some XOR
blocks with MUX blocks. Also the select bits to the multiplexers in the critical path are

made available much ahead of the inputs so that the critical path delay is minimized.
For example, the Cout2 output from the previous lesser significant compressor block
is utilized as the select bit after a stage it is produced so that the MUX block is already
switched ON and the output is produced as soon as the inputs arrive. Also if the
output of the multiplexer is used as select bit for another multiplexer, then it can be
used efficiently in a similar manner because the negation of select bit is also required,
as shown in Figure 3.7. Thus an extra stage to compute the negation can be saved.
Similarly replacing the XOR block in the second stage with a MUX block reduces the
delay because the select bit x3 is already available and the time taken for the transistor
switching to take place happens in parallel with the computation of the inputs of the
block

Figure 3.10 Efficient design of 5:2 compressor

As mentioned before, in all the general implementations of the XOR or MUX block,
in particular CMOS implementation, the output and its complement are generated.
But in the existing design this advantage is not being utilized fully. In the proposed
design these outputs are utilized efficiently by using multiplexers at particular stages
21

in the circuit. Also additional inverter stages are eliminated. This in turn contributes to
the reduction of delay, power consumption and transistor count (area).
The equations governing the outputs are shown below:

X 1 X 2 X 3 X 4 X 5 C 1 C 2

(3.11)

Cout1= (X1+X2).X3+X1.X2
Cout2=(X4

X5

).Cin1+

(3.12)
X 4 X5

.X1

(3.13)

temp=X 1 X 2 X 3 X 4 X 5 C 1
Carry=temp.

C 2

+ temp (X1 X 2 X 3 )

(3.14)
In the carry generation module (CGEN1) shown in Figure.3.10, the above equation
(3.12) is used to design the CMOS implementation of Cout1 as shown in Figure 3.11.

Figure 3.11 Carry Generation Module (CGEN)

In the carry generation module (CGEN) shown in figure.3.10, the above equation is
used to design the CMOS implementation of Cout as shown in figure 3.11.

22

CHAPTER-4
INTRODUCTION TO MULTIPLIERS
Multipliers are among the fundamental components of many digital systems
and hence their power dissipation and speed are of primary concern [19]. For portable
applications where the power consumption is the most important parameter, one
should reduce the power dissipation as much as possible. One of the best ways to
reduce the dynamic power dissipation is to minimize the total switching activity, i.e.,
the total number of signal transitions of the system.
Multiplication plays an essential role in computer arithmetic operations for
both general purpose and digital signal processors. For computational extensive
algorithms required by multimedia functions such as Finite Impulse Response (FIR)
filters, Infinite Impulse Response (IIR) filters and Fast Fourier Transform (FFT).
In a popular array multiplication scheme the summation of partial products
proceeds in a more regular but in slower manner. Using this scheme only one row of
bits in the matrix is eliminated at each stage of the summation. In a parallel multiplier
the partial products are generated by using an array of AND gates. The main problem
is the summation of the partial products, and it is the time taken to perform this
summation which determines the maximum speed at which a multiplier may operate.
The Wallace scheme essentially minimizes the number of adder stages required to
perform the summation of partial products. This is achieved by using full and half
adders. Wallace multiplier consists of three stages. The partial product matrix is
formed in the first stage by N2 AND stages. In the second stage, the partial product
matrix is reduced to a height of two. Dadda replaced Wallace pseudo adders with
parallel (n, m) counters. A parallel (n, m) counter is a circuit which has n inputs and
produce m outputs which provide a binary count of the ONEs present at the inputs
23

A full adder is an implementation of a (3, 2) counter which takes 3 inputs


and produces 2 outputs. Similarly a half adder is an implementation of a (2, 2) counter
which takes 2 inputs and produces 2 outputs. Dadda multipliers have less expensive
reduction phase, but the numbers may be a few bits longer, thus requiring slightly
bigger adders.
In general, the product p of two n-bit unsigned binary numbers x and y may be
expressed as follows:
n1

( p(2 n1) p (2 n2) p 2 p1 p 0)= { y i(x

n1

x0 )

i=0

In a parallel multiplier, the terms

y(i x

n1

}.2i (4.1)

x0 )

are known as the partial

products and are generated using an array of AND gates. For a parallel multiplier, the
shifting term 2i is inherent in the wiring and does not require any explicit hardware.
Thus the main problem is the summation of the partial products, and it is the time
taken to perform this summation which determines the maximum speed at which a
multiplier may operate.
The realization of a parallel multiplier for digital computers has been
considered by C.S. Wallace, who proposed a tree of pseudo-adders (that means adders
without carry propagation) producing two numbers, whose sum equals the product.
This sum can be obtained by applying the two numbers to a carry-propagating adder.

4.1 Multiplier Schemes:


.

There are two basic schemes in the multiplication process. They are serial

multiplication and parallel multiplication


Serial Multiplication (Shift-Add)
It computing a set of partial products, and then summing the partial products
together. The implementations are primitive with simple architectures (used when
there is a lack of a dedicated hardware multiplier)
Parallel Multiplication

24

Partial products are generated simultaneously Parallel implementations are


used for high performance machines, where computation latency needs to be
minimized.
Comparing these two types parallel multiplication has more advantage than the serial
multiplication. Because the parallel type has lesser steps comparing to the serial
multiplication. So it performs faster than the serial multiplication

4.2 Baugh-Wooley multiplier:


This is the most basic form of binary multiplier construction. Its basic principle is
exactly like that done by pen and paper. It consists of a highly regular array of full
adders, the exact number depending on the length of the binary number to be
multiplied. Each row of this array generates a partial product. This partial product
generated value is then added with the sum and carry generated on the next row. The
final result of the multiplication is obtained directly after the last row. ANDed terms
generated using logic AND gate. Full Adder (FA) implementation showing the two
bits(A,B) and Carry In (Ci) as inputs and Sum (S) and Carry Out (Co) as outputs

25

Figure 4.1 Hardware Architecture of Baugh Wooley Multiplier

4.2.1Principle

Figure 4.2 Principle in multiplier


Due to the highly regular structure, array multiplier is very easily constructed and also
can be densely implemented in VLSI, which takes less space. But compared to other
26

multiplier structures proposed later, it shows a high computational time. In fact, the
computational time is of order of log O(N), one of the highest in any multiplier
structure.
Baugh-Wooley Multiplier are used for both unsigned and signed number
multiplication. Signed Number operands which are represented in 2s complemented
form. Partial Products are adjusted such that negative sign move to last step, which in
turn maximize the regularity of the multiplication array. Baugh-Wooley Multiplier
operates on signed operands with 2s complement representation to make sure that the
signs of all partial products are positive. To reiterate, the numerical value of 2s
complement numbers

Figure 4.3 Baugh-Wooley Architecture for 4*4 multiplication

4.3 Advantages

Minimum complexity.
27

Easily scalable.

Easily pipelined.

Regular shape, easy to place & route.

4.4 Disadvantages

High power consumption.

More digital gates resulting in large chip area.

4.5 Wallace Tree Multiplier :


A Wallace tree is an efficient hardware implementation of a digital circuit that
multiplies two integers. For a N*N bit multiplication, partial products are formed
from (N^2)AND gates. Next N rows of the partial products are grouped together in set
of three rows each. Any additional rows that are not a member of these groups are
transferred to the next level without modification. For a column consisting of three
partial products and a full adder is used with the sum dropped down to the same
column whereas the carry out is brought to the next higher column. For column with
two partial products, a half adder is used in place of full adder as shown in figure 4.4.
At the final stage, a carry propagation adder is used to add over all the propagating
carries to get the final result. It can also be implemented using Carry Save Adders.
Sometimes it will be combined with Booth Encoding. Various other researches have
been done to reduce the number of adders, for higher order bits such as 16 &
32.Applications, as the use in DSP for performing FFT,FIR, etc.,

28

Figure 4.4 Wallace Tree Architecture

4.5.1 Function
The Wallace tree has three steps:

Multiply (that is - AND) each bit of one of the arguments, by each bit of the other,
yielding n2 results. Depending on position of the multiplied bits, the wires carry
different weights, for example wire of bit carrying result of a2b3 is 32.

Reduce the number of partial products to two by layers of full and half adders.

Group the wires in two numbers, and add them with a conventional adder

4.5.2 Structure of 4 bit Wallace:

29

Figure 4.5 Wallace tree structure for 4 bit multiplication:

4.5.3 Example:

30

Figure 4.6 Wallace Example for 4 bit multiplication

4.6 Advantages:

Each layer of the tree reduces the number of vectors by a factor of 3:2
Minimum propagation delay.

The benefit of the Wallace tree is that there are only O(log n) reduction layers,
but adding partial products with regular adders would require O(log n)2 time.

4.7 Disadvantages:

Wallace trees do not provide any advantage over ripple adder trees in many
FPGAs.

Due to the irregular routing, they may actually be slower and are certainly
more difficult to route.

Adder structure increases for increased bit multiplication

31

CHAPTER 5
IMPLEMENTATION OF FIXED WIDTH MULTIPLIER
USING ALGORITHMIC NOISE
TOLERANTARCHITECTURE (ANT)
Algorithmic noise tolerant architecture ANT architecture is effective method
used for error compensation in high processing DSP applications. RPR based ANT
system employs replica block for higher magnitude of errors in main block.

5.1 ANT Architecture Design


The ANT technique includes both main digital signal processor (MDSP) and error
correction (EC) block, as shown in Figure. 6.1.

Figure 5.1 Block Diagram of proposed work


Let and be two -bit unsigned numbers
n1

n1

x= xi .2 , y= yj . 2 (5.1)
0
0

32

Their product is given by following Mathematical equation


2 n1

P Pk . 2

k=0

n1 n1

. 2i+ j
j=0

i=0

(5.2)

Generally for n bit inputs multiplier produces 2n bits output. But fixed width
multiplier produces output with n bits or less of 2n bits output with reduced precision.
This is possible by rounding off or truncation of lower order bits. Output product of
fixed width multiplier may not be actual result obtained using full width multiplier but
provides advantage of less area and less delay compared to full width multiplier.
Precision or accuracy of result of fixed width multiplier can be improved by
compensating truncation error. This can be done by using error compensation
functions In our proposed work it is done by using lower order truncation bits feeding
to major part after truncation to provide compensation.
In the ANT technique, a replica of the MDSP but with reduced precision
operands and shorter computation delay is used as EC block. Under VOS, there are a
number of input-dependent soft errors in its output ya[n]; however, RPR output yr[n]
is still correct since the critical path delay of the replica is smaller than Tsamp
Therefore, yr[n] is applied to detect errors in the MDSP output ya[n]. Error detection
is accomplished by comparing the difference

y o [ n ] y r [ n ]

against a threshold

Th. Once the difference between ya[n] and yr[n] is larger than Th, the output y[n] is
yr[n] instead of ya[n]. As a result, y[n] can be expressed as

y [ n ] = y a [ n ] , if | y a [ n ] y r [ n ]|Th
y r [ n ] , if y a [ n ] y r [n]>Th ( 5.3)

Threshold value is given by


max
y o [ n ] y r [ n ] input

Th=

(5.4)

yo[n] is error free output and yr[n] is output from RPR block

33

5.2 Proposed ANT with fixed width RPR


In this paper, we further proposed the fixed-width RPR to replace the fullwidth RPR block in the ANT design, as shown in Figure. 2, which can not only
provide higher computation precision, lower power consumption, and lower area
overhead in RPR, but also perform with higher SNR, more area efficient, lower
operating supply voltage, and lower power consumption in realizing the ANT
architecture. We demonstrate our fixed-width RPR-based ANT design in an ANT
multiplier.
The fixed-width designs are usually applied in DSP applications to avoid
infinite growth of bit width. Cutting off n-bit least significant bit (LSB) output is a
popular solution to construct a fixed-width DSP with n-bit input and n-bit output. The
hardware complexity and power consumption of a fixed-width DSP is usually about
half of the full-length one. However, truncation of LSB part results in rounding error,
which needs to be compensated precisely. Many literatures have been presented to
reduce the truncation error with constant correction value or with variable correction
value the circuit complexity to compensate with constant corrected value can be
simpler than that of variable correction value; however, the variable correction
approaches are usually more precise.
However, in the fixed-width RPR of an ANT multiplier, the compensation
error we need to correct is the overall truncation error of MDSP block. Our
compensation method is to compensate the truncation error between the full-length
MDSP multiplier and the fixed-width RPR multiplier. In nowadays, there are many
fixed-width multiplier designs applied to the full-width multipliers. However, there is
still no fixed-width RPR design applied to the ANT multiplier designs.
To achieve more precise error compensation, we compensate the truncation
error with variable correction value. We construct the error compensation circuit
mainly using the partial product terms with the largest weight in the least significant
segment. The error compensation algorithm makes use of probability, statistics, and
linear regression analysis to find the approximate compensation value. To save
hardware complexity, the compensation vector in the partial product terms with the

34

largest weight in the least significant segment is directly inject into the fixed-width
RPR, which does not need extra compensation logic gates

5.3 Error Compensation Vector for Fixed width RPR Design


5.3.1 Error compensation using AO Cells
To design a low-error fixed-width multiplier, we first analyze the source of
errors generated by MP; and then derive a small carry-generating circuit Cg to feed
each carry input of MP0 to reduce errors effectively. Let denote the difference
between the two products produced by the standard multiplier and circuit MP0; and it
is caused by the carries generated from column circuit Pn-1 in least significant part
Let be sum of weights of pn-1 column it has more weight than sum of
remaining columns let say than cg can be easily found out based on and

Figure 5.2 Error Compensation considering ICV using AO cell


35

5.3.2 Error compensation considering ICV and MICV


The (n/2)-bit unsigned full-width or n-bit multiplier partial product array can
be divided into four subsets, which are most significant part (MSP),input correction
vector ICV, minor ICV and LSP. In fixed width multipliers only MSP part is kept
remaining part is removed.
For compensation of error caused due to truncation can be done by using ICV
and MICV as feeding to MSP for compensation because ICV and MICV has higher
weights compared to other terms in LSP Part.ICV depends on and it is sum of all
partial product terms in ICV as shown in Figure 5.3

Figure 5.3: Partial product array of unsigned Baugh-Wooley multiplier

Therefore, the other three parts of ICV(), MICV(), and LSP are called as
truncated part. The truncated ICV() and MICV() are the most important parts
because of their highest weighting. Therefore, they can be applied to construct the
truncation error compensation algorithm
To evaluate the accuracy of a fixed-width RPR, we can exploit the difference
between the (n/2)-bit fixed-width RPR output and the 2n-bit full-length MDSP output,
which is expressed as
=PPt (5.5)
Where P is the output of the complete multiplier in MDSP and Ptis the output of the
fixed-width multiplier in RPR

36

The source of errors generated in the fixed-width RPR is dominated by the bit
products of ICV since they have the largest weight. It is reported that a low-cost EC
circuit can be designed easily if a simple relationship between f (EC) and is found. It
is noted that is the summation of all partial products of ICV. By statistically
analyzing the truncated difference between MDSP and fixed width RPR with uniform
input distribution, we can find the relationship between f (EC) and
The statistical results show that the average truncation error in the fixed-width
RPR multiplier is approximately distributed between and +1. More precisely, as
= 0, the average truncation error is close to + 1. As >0, the average truncation
error is very close to . If we can select as the compensation vector, the
compensation vector can directly inject into the fixed-width RPR as compensation,
which does not need extra compensation logic gates. Therefore, we can apply
multiple input error compensation vectors to further enhance the
error compensation precision. For the >0 case, we can still select
as the compensation vector. For the = 0 case, we select + 1
combining with MICV as the compensation vector.
Below figure shows architecture for error compensation

Figure 5.4 error compensation vector considering ICV &MICV


The compensation vector ICV() is realized by directly injecting the partial
terms of Xn1Yn/2, Xn2Y(n/2)+1, Xn3Y(n/2)+2, . . . , X(n/2)+2Yn2. These directly
37

injecting compensation terms are labeled as C1,C2,C3, . . . ,C(n/2)1 in Figure. 6.4.


The other compensation vector used to mend the insufficient error compensation case
is constructed by one conditional controlled OR gate. One input of OR gate is injected
by X(n/2)Yn1, which is designed to realize the function of compensation vector .
The other input is conditional controlled by the judgment formula used to judge
whether = 0 and l _= 0 as well. Then, Cm is injected together with X(n/2)Yn1 into
a two-input OR gate to correct the insufficient error compensation. Accordingly, in
the case of = 0 and l _= 0 as well, one additional carry-in signal C(n/2) is injected
into the compensation vector to modify the compensation value as + 1 instead of .
Moreover, the carry-in signal C(n/2) is injected in the bottom of error compensation
vector, which is the farthest location away from the critical path.
In our proposed work compressors used in main block still reduces critical path
delay and power consumption. So reducing critical path delay reduces error rate and
performance of error compensation is done using error analysis and tabulated. Main
block is implemented using Baugh-Wooley multipliers using compressors and
Wallace tree structure using compressors and results are analyzed in next chapter

38

39

CHAPTER-6
EXPERIMENT RESULTS
6.1 Compressors
As compressor is one strategy that can be adopted for multi-operand addition
in effective manner. Hence 3:2, 4:2 and 5:2 compressors are used to reduce the critical
path delay in the design of fixed width multipliers. The RTL schematic of 3:2
compressor is shown in figure 6.1.

+
Figure 6.1 3:2 compressor module
The internal RTL schematic of 3:2 compressor is shown in figure 6.2.

Figure 6.2 RTL Schematic of 3:2 compressor module


40

The figure 6.3 shows the RTL schematic of 4:2 compressor

Figure 6.3 4:2 compressor module


The figure 6.4 shows the internal RTL schematic of 4:2 compressor.

Figure 6.4 RTL Schematic of 4:2 compressor module

41

The RTL schematic of 5:2 compressor is shown in Figure 6.5.

Figure 6.5 5:2 compressor module

Figure 6.6 RTL Schematic of 5:2 compressor module

6.2 Baugh-Wooley Multiplier Using Compressors:


Main block in our work is done using Baugh-Wooley and Wallace and adders
are replaced with compressors such that critical path delay can be reduce and multi
operand addition is possible

42

The figure 6.7 shows the RTL schematic of 16 bit Baugh-wooley using compressors

Figure 6.7: Baugh-Wooley using compressors module


The figure 6.8 is internal RTL schematic of 16 bit Baugh-wooley using compressors

Figure 6.8: RTL Schematic of Baugh-Wooley using compressors

43

The Simulation result of 16 bit Baugh-wooley using compressors is shown in


figure 6.9.

Figure 6.9: Simulation Result of Baugh-Wooley using compressors

6.3 Wallace Tree Multiplier using Compressors:


The Wallace tree multiplier of 16 bit size is implemented using 3:2,4:2,5:2
compressors which reduces partial products reduction and Wallace tree implemented
and RTL schematic, output waveform of 16 bit multiplier for different combination of
inputs is shown below
Figure 6.10 is RTL schematic of 16 bit Wallace tree using compressors (main block)

Figure 6.10: Wallace tree multiplier using compressors

44

Figure 6.11 is internal RTL schematic of 16 bit Wallace tree multiplier using
compressors.

Figure 6.11: RTL schematic of Wallace tree multiplier using compressors


The Figure 6.12 shows the simulation result of 16 bit Wallace tree using compressors.

Figure 6.12 output waveform of Wallace tree multiplier using compressors

6.416 bit RPR block with compensation (ICV&MICV)


RPR block of 16 bit size is implemented with compensation is done
considering

Input

Correction

Vector(ICV)

and

Minor

Input

Correction

Vector(MICV).RTL schematic and waveform for different combination of input is


shown in below fig 6.13,6 14 and 6.15.
45

Figure 6.13 RPR of 16-bit module

Figure 6.14 RTL schematic of RPR of 16 bit size with compensation


The Simulation result of 16 bit RPR block withcompensation(ICV&MICV) is shown
in figure 6.15

Figure 6.15 Output waveform of RPR of 16 bit size with compensation

46

6.5 ANT architecture with 16 bit RPR using compressors:


The 16 bit Baugh-Wooley multiplier and Wallace tree multiplier using ANT
Architecture with compressors are designed and their respective results are discussed
below.
The RTL schematic of ANT Architecture implemented using Baugh-Wooley
multiplier with compressors is Figure 6.16.

Figure 6.16 ANT with RPR of 16 bitusing Baugh-Wooleywith compressors


The RTL schematic of ANT Architecture implemented using Baugh-Wooley
multiplier with compressors is shown in Figure 6.17.

47

Figure 6.17 ANT with RPR 16 bit & Baugh-Wooley with compressors
The figure 6.18 shows the simulation result of ANT Architecture.

Figure 6.18 Simulation result of ANT with RPR(16 bit) & Baugh-Wooley with
compressors
The RTL schematic of ANT Architecture implemented using Wallace tree with
compressors is shown in figure 6.19.

Figure 6.19 ANT with RPR16 bit using Wallace tree with compressors

48

The internal RTL schematic of ANT Architecture implemented using Wallace


tree with compressors is shown in figure 6.20.

Figure 6.20 ANT with RPR of 16 bit & Wallace tree with compressors
The figure 6.21 shows the simulation result of ANT Architecture implemented
using Wallace tree with compressors.

Figure 6.21 Simulation result of ANT with RPR of 16 bit& Wallace tree using
compressors

49

6.6 Result analysis:


6.6.1 Error Analysis:
RPR block which is fixed width multiplier of results in truncation error due to
truncation of lower order bits hence to compensate truncation errors, error
compensation algorithm is implemented using truncated partial products error
analysis of different error compensation algorithm for RPR of 8 bit and 16 bit is
tabulated below in table 6.1.
Table 6.1 Error Analysis
Architecture
RPR without
Compensation
RPR with AO cell
compensation
RPR considering
ICV&MICV

Mean error of
RPR of 8 bit
67.9717%

Mean error of
RPR of 16 bit
20.0187%

37.107%

10.9157%

28.902%

2.558%

6.6.2 Area analysis:


The Device utilization of ANT architecture of different multipliers and RPR of
16 bit and 8 bit are compared and tabulated in Table 6.3. The Vertex 6 FPGA is used
for synthesis.
Table 6.2 Gate count of ANT architecture
Baugh-Wooley
without compressors

Baugh-Wooley
using compressors

Wallace using
compressors

ANT with RPR


of 16 bit

923

864

837

ANT with RPR


of 8 bit

764

740

699

Device count

50

The ANT architecture of Baugh-Wooley and Wallace Tree multipliers using


Compressors shows reduction in device utilization compared to existing BaughWooley without compressors.
The ANT architecture of Baugh-Wooley multiplier using Compressors shows
reduction in device utilization by 6.39% and 3.14% for RPR of 16 and 8 bit
respectively compared to existing Baugh-Wooley without compressors.
The ANT architecture of Wallace tree multiplier using Compressors shows
slight reduction in device utilization compared to Baugh-Wooley with compressors.

6.6.3 Delay analysis:


Delay of ANT architecture with RPR with Baugh-Wooley and Wallace tree
using Compressors is analyzed and tabulated in below Table 6.4. Results show that
decrease in delay of 8.61% and 23.07% of Baugh-Wooley using compressors
compared to previous work [1] and decrease in delay of 10.61% and 32.56% of
Baugh-Wooley using compressors compared to previous work.
Table 6.3 Delay Analysis of ANT architecture

Architecture
ANT with RPR
of 16 bit
ANT with RPR
of 8 bit

Existing
work[1]
Delay(ns)
21.26
21.09

Baugh-Wooley
using compressors
Delay(ns)
20.09

Wallace using compressors


Delay(ns)

16.22

14.22

51

19.00

6.6.4 Power analysis:


The Power analysis of an ANT Architecture based 16 bit fixed width
multiplier implementations is shown in Table 6.4 below. The Power consumed by
ANT Architecture using Baugh-Wooley multiplier with compressors is decreased by
15.39% compared to ANT Architecture implemented using Baugh-Wooley without
compressors.
Table 6.4 Power Analysis of ANT architecture based designs
ANT Architecture
based Designs

Total
power(mw)

Dynamic
power(mw)

Quiscent
power(mw)

Baugh-Wooley
without compressors

16.69

2.97

13.12

Baugh-Wooley with
compressors

14.12

2.88

11.24

Wallace tree with


compressors

14.16

2.92

11.24

In the case of Wallace tree design with compressors the power reduction is
almost close to that of Baugh- Wooley design. The power analysis is performed using
Xilinx Xpower analyser Tool.

52

CHAPTER-7
CONCLUSION & FUTURE SCOPE
7.1 Conclusion
Multiplier is most important block in DSP Processors and FFT Processor and Design
using 3:2, 4:2, 5:2 compressors reduces critical path delay and power. The proposed
work is simulated using ISIM simulator and synthesis is carried out using XILINX
ISE synthesis tool on the XILINX platform. Reports shows better result compared to
previous work in terms of delay, power and error analysis. The proposed system
reduces power up to 15.15% by Implementing Baugh-Wooley using compressors
whencompared to previous work.

7.2 Future Scope


Fixed width multipliers reduce area, power and delay. The current work
proved the efficiency of Algorithmic Noise Tolerant Architecture based fixed width
multiplier in these aspects. The same can be used in implementation of FIR filters,
MAC units and FFT processors.
The error rate and power reduction can be further improved by the use of
further advanced compensation algorithms and architectures respectively, to design
fixed width multipliers with better performance.

53

REFERENCES
[1] I-Chyn Wey, Chien-Chang Peng, and Feng-Yu Liao Reliable Low-Power
Multiplier Design Using Fixed-Width Replica Redundancy Block IEEE
Transaction Very Large Scale Integration. (VLSI) Systems, vol. 23, no.01, pp.
78-87, January 2015.
[2] SreehariVeeramachaneni, Kirthi M Krishna, LingamneniAvinash, Sreekanth
Reddy Puppala and M.B. Srinivas; Novel Architectures for High-Speed and
Low-Power 3-2, 4-2 and 5-2 Compressors, 20th International Conference on
VLSI Design, pp.324-329, Jan.2007.
[3] S. K. Sunder, E. Fayez, and A. Antoniou, Area efficient multipliers for digital
signal processing applications, IEEE Trans. on Circuits and Syst., vol. 43, pp.
2, Feb. 1996.
[4] M. J. Schulte and E .E. Swartzlander Jr., Truncated multiplication with
correction constant, IEEE Trans. VLSI Systsems. vol. I, pp. 388-396, May
1993.
[5] E. E. Swartzlander Jr., Truncated multiplication with approximate rounding,
in 1999 Proc. 33rd Asilomar Conference: Signals, Systems, and Computers,
pp. 1480-1483, July 1999.
[6] J. M. Jou, S. R. Kuang, and R. D. Chen, Design of low-error fixed- width
multiplier for DSP applications, IEEE Trans. Circuits and Systs., vol. 46, pp.
836-842, June 1999.
[7] H. Park and E. E. Swartzlander, Truncated multiplication with symmetric
correction, 40th Asilomar Conf.: Signals, Systems and Computers, 2006, vol.
6, pp. 931 934, Sept. 2006.
[8] G. M. Strollo, N. Petra, and D. De Caro, Dual-tree error compensation for
high performance fixed-width multipliers, IEEE Trans. on Circuits and Systs.
II: Express Briefs, vol. 52, pp. 501-507, Aug. 2005.
[9] L. Da Van, S. S. Wang, and W. S. Feng, Design of the lower error fixed-width
multiplier and its application, IEEE Trans. on Circuits and Systs.II: Analog
and Digital Signal Processing, vol. 47, pp. 1112-1118, Oct. 2000.

PUBLICATION
N. Sai Mani Bharath, H. Phanendra Babu, Reliable Low Power Design of
Fixed width Multiplier using Algorithmic Noise Tolerant Architecture
54

International journal of Science and Research (IJSR), Google Scholar


Publication (communicated).

55