You are on page 1of 78




In recent years, we have experienced a great development in the field of digital

communication technologies which brought together a great concern about security in

computers and communications systems. Several public-key cryptosystems were
proposed in order to enable the encryption of messages using a public encryption key e
without a prior communication of a secret key. The secrecy relies on the fact that
decryption key is computationally infeasible to deduce from the public encryption key.
Then, the only person who can decrypt the cipher-text is the receiver, who knows the
secret decryption keyd.
Public-key cryptography plays an important role in digital communication and
storage systems. Processing public-key cryptosystems requires huge amount of
computation, and, there is therefore, a great demand for developing dedicated hardware
to speed up the computations. Speeding up the computation using specialized hardware
enables the use of larger keys in public-key cryptosystems. This is translated into an
increase of the security of the system. Also, this enables the speedup of a secure link
between two distant points using an insecure channel, which is critical in real-time
systems. The reduction of the hardware amount is another important aspect when
implementing in dedicated hardware because it allows for the miniaturization of portable
devices and reduces fabrication costs
The Residue Number System (RNS) is a non-weighted number system that can
map large numbers to smaller residues, without any need for carry propagations .Its most
important property is that additions, subtractions, and multiplications are inherently
carry-free. These arithmetic operations can be performed on residue digits concurrently
and independently. Thus, using residue arithmetic, would in principle, increase the speed
of computations RNS has shown high efficiency in realizing special purpose applications
such as digital filters , image processing , RSA cryptography and specific applications
for which only additions, subtractions and multiplications are used and the number

dynamic range is specific. Special moduli sets have been used extensively to reduce the
hardware complexity in the implementation of converters and arithmetic operations.
Among which the triple moduli set {2n+1,2n,2n-1} have some benefits. Since the
operation of multiplication is of major importance for almost all kinds of processors,
efficient implementation of multiplication modulo 2n-1 is important for the application of


A residue number system is characterized by a base that is not a single radix but an

N-tuple of integers (mN,mN-1 m1). Each of these mi (i = 1, 2, N) is called a modulus.

An integer X is represented in the residue number system by N-tuple (x N, xN-1 x1)
where xI is a nonnegative integer satisfying
X = mI * qI + xI , (1)
where qI is the largest integer such that 0<=x I <= (mI 1). xi is known as the residue of X
modulo mi, and notations X mod mi and |X|mi are commonly used.
The RNS divides an integer into a number of smaller integers (i.e. with a shorter
binary representation) that can be processed in parallel independently of each other. This
provides a speed-up for arithmetic operations that are inherently dependent on operand
length, such as addition and multiplication. The disadvantages of using an RNS are the
complexity involved in division, magnitude comparisons and conversions between binary
and residue number.

The RNS system provide a unique feature of parallelism that make arithmetic
operations such as addition, subtraction and modulation very easy to handle and perform
increasing speed and reducing chip area.
Carry free


Parallel Operation

Low Power Circuits

Medium Security
Error Detection and Correction Capability
Fault Tolerant
RNS has its disadvantages too. Operations such as division, sign-detection, and
magnitude comparison and overflow detection are complex and hard to implement. This
has limited the application of RNS to certain fields where addition/multiplication
operations are used extensively and the result is known to be within a predetermined
range. RNS work only for integer values therefore adding extra cost for conversion from
binary-to-RNS and vice versa.

There are two different meanings of the word cryptosystem. One is used by the
cryptographic community, while the other is the meaning understood by the public. In
this meaning, the term cryptosystem is used as shorthand for "cryptographic system". A
cryptographic system is any computer system that involves cryptography. Such systems
include for instance, a system for secure electronic mail which might include methods for
digital signatures, cryptographic hash functions, key management techniques, and so on.
Cryptographic systems are made up of cryptographic primitives, and are usually rather
Typically, a cryptosystem consists of three algorithms: one for key generation, one
for encryption, and one for decryption. The term cipher (sometimes cypher) is often used
to refer to a pair of algorithms, one for encryption and one for decryption. Therefore, the
term "cryptosystem" is most often used when the key generation algorithm is important.
For this reason, the term "cryptosystem" is commonly used to refer to public key
techniques; however both "cipher" and "cryptosystem" are used for symmetric key
Public-key cryptography refers to a cryptographic system requiring two separate
keys, one of which is secret and one of which is public. Although different, the two parts
of the key pair are mathematically linked. One key locks or encrypts the plaintext, and
the other unlocks or decrypts the cipher text. Neither key can perform both functions by
itself. Public-key cryptography is a fundamental, important, and widely used
technology. It is an approach used by many cryptographic algorithms and cryptosystems.
It underpins such Internet standards as Transport Layer Security (TLS), PGP, and GPG.
There are three primary kinds of public key systems.


Fast multipliers are essential parts of digital signal processing systems. The speed
of multiply operation is of great importance in digital signal processing as well as in the
general purpose processors today, especially since the media processing took off. In the
past multiplication was generally implemented via a sequence of addition, subtraction,
and shift operations. Multiplication can be considered as a series of repeated additions.
The number to be added is the multiplicand, the number of times that it is added is the
multiplier, and the result is the product. Each step of addition generates a partial product.
In most computers, the operand usually contains the same number of bits.
The basic multiplication principle is twofold i.e. evaluation of partial products and
accumulation of the shifted partial products. It is performed by the successive additions
of the columns of the shifted partial product matrix. The multiplier is successfully
shifted and gates the appropriate bit of the multiplicand.


In order to achieve high-speed multiplication, multiplication algorithms using
parallel counters, such as the modified Booth algorithm has been proposed, and some
multipliers based on the algorithms have been implemented for practical use. This type of
multiplier operates much faster than an array multiplier for longer operands because its
computation time is proportional to the logarithm of the word length of operands.
Booth multiplication is a technique that allows for smaller, faster multiplication
circuits, by recoding the numbers that are multiplied. It is possible to reduce the number
of partial products by half, by using the technique of radix-4 Booth recoding. The basic
idea is that, instead of shifting and adding for every column of the multiplier term and
multiplying by 1 or 0, we only take every second column, and multiply by 1, 2, or 0, to

obtain the same results. Grouping starts from the LSB, and the first block only uses two
bits of the multiplier..

Fig 2.1: Grouping of bits from the multiplier term

Each block is decoded to generate the correct partial product. The encoding of the
multiplier Y, using the modified booth algorithm, generates the following five signed
digits, -2, -1, 0, +1, +2. Each encoded digit in the multiplier performs a certain operation
on the multiplicand, X, as illustrated in Table 1


For the partial product generation, we adopt Radix-4 Modified Booth algorithm to
reduce the number of partial products for roughly one half. For multiplication of 2s
complement numbers, the two-bit encoding using this algorithm scans a triplet of bits.
When the multiplier B is divided into groups of two bits, the algorithm is applied to this
group of divided bits.

Fig 2.2: Illustration of multiplication using modified Booth encoding

The PP generator generates five candidates of the partial products, i.e., {-2A,-A,
0, A, 2A}. These are then selected according to the Booth encoding results of the operand
B. When the operand besides the Booth encoded one has a small absolute value, there are
opportunities to reduce the spurious power dissipated in the compression tree


Carry-save adder is a type of digital adder, used in computer micro architecture to
compute the sum of three or more n-bit numbers in binary. It differs from other digital
adders in that it outputs two numbers of the same dimensions as the inputs, one which is a
sequence of partial sum bits and another which is a sequence of carry bits.
Consider the sum:
+ 87654322
Using the arithmetic we learned as children, we go from right to left, "8+2=0,
carry 1", "7+2+1=0, carry 1", "6+3+1=0, carry 1", and so on to the end of the sum.
Although we know the last digit of the result at once, we cannot know the first digit until
we have gone through every digit in the calculation, passing the carry from each digit to

the one on its left. Thus adding two n-digit numbers has to take a time proportional to n,
even if the machinery we are using would otherwise be capable of performing many
calculations simultaneously.
The carry-save unit consists of n full adders, each of which computes a single sum
and carry bit based solely on the corresponding bits of the three input numbers. Given the
threen - bit numbers a, b, and c, it produces a partial sum ps and a shift-carry sc:
The entire sum can then be computed by:

Shifting the carry sequence sc left by one place.


Appending a 0 to the front (most significant bit) of the partial sum sequence ps.


Using a ripple carry adder to add these two together and produce the resulting n +
1-bit value.
When adding together three or more numbers, using a carry-save adder followed

by a ripple carry adder is faster than using two ripple carry adders. This is because a
ripple carry adder cannot compute a sum bit without waiting for the previous carry bit to
be produced, and thus has a delay equal to that of n full adders.
1. Produce all of its output in parallel resulting in the same as a full adder.
2. Very little propagation delay when Cary save adder plus ripple adder=n+1 and 2 ripple
carry adders=2n.
3. Allow for high clock speeds
1. We do not know whether the result is positive or negative
2. This is the draw back when performing modulo multiplication since you didnt know
whether the inter mediate result is greater than or less than the modulation.

In electronics, an adder is a digital circuit that performs addition of numbers. In
modern computers adders reside in the arithmetic logic unit (ALU) where other
operations are performed. Although adders can be constructed for many numerical
representations, such as Binary-coded decimal or excess-3, the most common adders
operate on binary numbers. In cases where two's complement is being used to represent
negative numbers it is trivial to modify an adder into an adder-subtracter
Types of adders
For single bit adders, there are two general types.
A half adder has two inputs, generally labelled A and B, and two outputs, the sum
S and carry C. S is the two-bit XOR of A and B, and C is the AND of A and B. Essentially
the output of a half adder is the sum of two one-bit numbers, with C being the most
significant of these two outputs.
The second type of single bit adder is the full adder. The full adder takes into
account a carry input such that multiple adders can be used to add larger numbers. To
remove ambiguity between the input and output carry lines, the carry in is labelled Ci or
Cin while the carry out is labelled Co or Cout.

Half adder

Fig 3.1 : Half adder circuit diagram


A half adder is a logical circuit that performs an addition operation on two

binary digits. The half adder produces a sum and a carry value which are both binary

Following is the logic table for a half adder


0 0 0

1 0 1

0 0 1

1 1 0

Full adder

Fig 3.2: Inputs: {A, B, Carry In} Outputs: {Sum, Carry Out}


Schematic symbol for a 1-bit full adder

A full adder is a logical circuit that performs an addition operation on three binary
digits. The full adder produces a sum and carries value, which are both binary digits. It
can be combined with other full adders (see below) or work on its own.

Input Output
000 0

001 0

010 0

011 1

100 0

101 1

110 1

111 1

Note that the final OR gate before the carry-out output may be replaced by an
XOR gate without altering the resulting logic. This is because the only discrepancy
between OR and XOR gates occurs when both inputs are 1; for the adder shown here, one
can check this is never possible. Using only two types of gates is convenient if one
desires to implement the adder directly using common IC chips.
A full adder can be constructed from two half adders by connecting A and B to the input
of one half adder, connecting the sum from that to an input to the second adder,
connecting Ci to the other input and or the two carry outputs. Equivalently, S could be
made the three-bit xor of A, B, and Ci and Co could be made the three-bit majority


function of A, B, and Ci. The output of the full adder is the two-bit arithmetic sum of
three one-bit numbers.

A Binary multiplier is an electronic hardware device used in digital electronics or
a computer or other electronic device to perform rapid multiplication of two numbers in
binary representation. It is built using binary adders.
The rules for binary multiplication can be stated as follows
1. If the multiplier digit is a 1, the multiplicand is simply copied down and
represents the product.
2. If the multiplier digit is a 0 the product is also 0.
For designing a multiplier circuit we should have circuitry to provide or do the following
three things:
1. It should be capable identifying whether a bit is 0 or 1.
2. It should be capable of shifting left partial products.
3. It should be able to add all the partial products to give the products as sum of
partial products.
4. It should examine the sign bits. If they are alike, the sign of the product will be a
positive, if the sign bits are opposite product will be negative. The sign bit of the
product stored with above criteria should be displayed along with the product.
From the above discussion we observe that it is not necessary to wait until all the partial
products have been formed before summing them. In fact the addition of partial product
can be carried out as soon as the partial product is formed.


a multiplicand
b multiplier p
Binary multiplication (eg n=4)
an1 an2 a1a0
bn1bn2 b1b0
p2 n1 p2 n2 p1 p0







Multiplication followed by accumulation is a operation in many digital systems,
particularly those highly interconnected like digital filters, neural networks, data
quantisers, etc.
One typical MAC(multiply-accumulate) architecture is illustrated in figure. It
consists of multiplying 2 values, then adding the result to the previously accumulated
value, which must then be restored in the registers for future accumulations. Another
feature of MAC circuit is that it must check for overflow, which might happen when the
number of MAC operation is large .
This design can be done using component because we have already design each of
the units shown in figure. However since it is relatively simple circuit, it can also be
designed directly. In any case the MAC circuit, as a whole, can be used as a component in
application like digital filters and neural networks.



The architecture of a radix 2n multiplier is given in the Figure. This block
diagram shows the multiplication of two numbers with four digits each. These numbers
are denoted as V and U while the digit size was chosen as four bits. The reason for this
will become apparent in the following sections. Each circle in the figure corresponds to
a radix cell which is the heart of the design. Every radix cell has four digit inputs and
two digit outputs. The input digits are also fed through the corresponding cells.
The dots in the figure represent latches for pipelining. Every dot consists of four
latches. The ellipses represent adders which are included to calculate the higher order

Fig 3.3: Radix 2 multiplier architecture


The decision to use a Radix-4 modified Booth algorithm rather than Radix-2
Booth algorithm is that in Radix-4, the number of partial products is reduced to n/2.
Though Wallace Tree structure multipliers could be used but in this format, the
multiplier array becomes very large and requires large numbers of logic gates and
interconnecting wires which makes the chip design large and slows down the operating

Booth Multiplication Algorithm

Booth Multiplication Algorithm for radix 2
Booth algorithm gives a procedure for multiplying binary integers in signed 2s
complement representation. I will illustrate the booth algorithm with the following
example: Example, 2 ten x (- 4) ten
0010 two * 1100 two
Step 1: Making the Booth table
I. From the two numbers, pick the number with the smallest difference between a series
of consecutive numbers, and make it a multiplier.
i.e., 0010 -- From 0 to 0 no change, 0 to 1 one change, 1 to 0 another change ,so there are
two changes on this one
1100 -- From 1 to 1 no change, 1 to 0 one change, 0 to 0 no change, so there is only one
change on this one.
Therefore, multiplication of 2 x ( 4), where 2

(1100two) is the multiplier.

II. Let X = 1100 (multiplier)
Let Y = 0010 (multiplicand)




) is the multiplicand and ( 4)


Take the 2s complement of Y and call it Y Y = 1110

III. Load the X value in the table.
IV. Load 0 for X-1 value it should be the previous first least significant bit of X
V. Load 0 in U and V rows which will have the product of X and Y at the end of
VI. Make four rows for each cycle; this is because we are multiplying four bits numbers.

Step 2: Booth Algorithm

Booth algorithm requires examination of the multiplier bits, and shifting of the
partial product. Prior to the shifting, the multiplicand may be added to partial product,
subtracted from the partial product, or left unchanged according to the following rules:
Look at the first least significant bits of the multiplier X, and the previous least
significant bits of the multiplier X - 1.
0 0 Shift only
1 1 Shift only.
0 1 Add Y to U, and shift
1 0 Subtract Y from U, and shift or add (-Y) to U and shift
Take U & V together and shift arithmetic right shift which preserves the sign bit
of 2s complement number. Thus a positive number remains positive, and a negative
number remains negative.
Shift X circular right shift because this will prevent us from using two registers


We have finished four cycles, so the answer is shown, in the last rows of U and V which
is: 11111000two.
Note: By the fourth cycle, the two algorithms have the same values in the Product


Booth multiplication algorithm for radix 4

One of the solutions of realizing high speed multipliers is to enhance parallelism
which helps to decrease the number of subsequent calculation stages. The original version
of the Booth algorithm (Radix-2) had two drawbacks. They are: (i) The number of add
subtract operations and the number of shift operations becomes variable and becomes
inconvenient in designing parallel multipliers. (ii) The algorithm becomes inefficient
when there are isolated 1s. These problems are overcome by using modified Radix4
Booth algorithm which scan strings of three bits with the algorithm given below:
1) Extend the sign bit 1 position if necessary to ensure that n is even.
2) Append a 0 to the right of the LSB of the multiplier.
3) According to the value of each vector,each Partial Product will he 0, +y , -y, +2y or
The negative values of y are made by taking the 2s complement and in this paper
Carry-look-ahead (CLA) fast adders are used. The multiplication of y is done by shifting
y by one bit to the left. Thus, in any case, in designing a n-bit parallel multipliers, only
n/2 partial products are generated.











Multipliers are most commonly used in various electronic applications e.g. Digital
signal processing in which multipliers are used to perform various algorithms like FIR,
IIR etc. Earlier, the major challenge for VLSI designer was to reduce area of chip by
using efficient optimization techniques to satisfy MOORES law. Then the next phase is
to increase the speed of operation to achieve fast calculations like, in todays
microprocessors millions of instructions are performed per second. Speed of operation is
one of the major constraints in designing DSP processors and todays general-purpose
processors. However area and speed are two conflicting constraints. So improving speed
results always in larger areas. Now, as most of todays commercial electronic products are
portable like Mobile, Laptops etc. that require more battery backup. Therefore, lot of
research is going on to reduce power consumption. So, in this paper it is tried to find out
the best solution to achieve low power consumption, less area required and high speed for
multiplier operation. The basic principle used for multiplication is to evaluate partial
products and accumulation of shifted partial products. In order to perform this operation
number of successive addition operation is required. Therefore one of the major
components required to design a multiplier is Adder. Adders can be Ripple Carry, Carry
Look Ahead, Carry Select, Carry Skip and Carry Save [1-3]. A lot of research work has
been done to analyze performance of different fast adders. The effect of the RCA wordlength, on the time complexities of each constituent component of the multiplier is
analyzed qualitatively and the multiplier delay is shown to be almost linearly dependent
on the RCA word-length. Consequently, the delay of the multiplier can be directly
controlled by the wordlength of the RCAs. By means of modulo arithmetic properties, we
show that the compensation constant that negates the effect of the bias introduced in this
process can be precomputed and implemented by direct hardwiring with no delay
overhead for all feasible combinations of and it is shown that the proposed multiplier
lowers power dissipation of the radix-4 Booth encoded multiplier.


Modular arithmetic operations (i.e., inversion, multiplication and exponentiation)
are used in several cryptography applications, such as decipherment operation of RSA
algorithm, Difie-Hellman key exchange algorithm, elliptic curve cryptography, and the
Digital Signature Standard including the Elliptic Curve Digital Signature Algorithm.
Modular Multiplication is the key algorithm of RSA and other public key
cryptosystems, and so provides an indication of the efficiency of the RNS
implementation. The majority of the currently established Public-Key Cryptosystems
(RSA, Difie-Hellman, Digital Signature Algorithm (DSA), Elliptic Curves (ECC), etc.)
require modular multiplication in finite fields as their core operation which accounts for
up to 99% of the time spent for encryption and decryption.
Modular Multiplication in Public Key Cryptosystems
One of the cornerstones of public-key cryptography is modular arithmetic, on
which nearly all established schemes are based. An efficient software implementation of
modular arithmetic is therefore desirable. While modular additions and subtractions are
rather trivial cases, efficient modular multiplication remains an elusive target for


Fig 5.1.1: Modulo (2n-1) multiplier architecture

Multiplier :

The modulo 2n 1 multiplication of two numbers n-bit each follows 3 steps :


production of n2 partial products modulo 2n 1 reduction of this n2 partial

2n 1

products 2n 1 into two numbers of n bits addition of these two numbers

modulo 2n 1 with the preceding adder.

Partial products :

The reduction of the n2 partial products modulo 2n 1 is similar to fast

reduction modulo

Multiplication reduction and the graphical conventions are preserved.

2n 1



Let A=





or alternatively , A = {an-1 an-2......a0}, be the multiplicand. Then it is easy to show that

A multiplied by a power of 2, 2j , results a left cyclic shifting of j bits





n1 j



2k+j +



2k Mod (2N-1)

or, in bit representation,

A.2j = { an-1-j an-2-j .a0 an-1 an-2}
Mod (2N-1)

Now let




.. 4

be the multiplier. Then




A2m Mod (2N-1)

Similar to the binary multiplier, bm A 2m Mod(2N-1) can be treated as a row of

partial products, which is the ANDing of bm and the bit row of Eq. ().The multiplication is
thus converted to the summation of the partial product array together to get the final
Modular RNS-based multipliers can be classified into three main groups.
I. The first group deals with specific moduli, i.e. 2n-1, 2n, or 2n+1. Figure 2 below shows
An example of a type-1 modular multiplier that make use of an LUT.
II. The second type uses any moduli value and utilizes special ROM architectures in order

III. The third group of multipliers handle medium to large values of moduli but it uses
mainstream arithmetic components that have been developed beforehand, thus facilitating
the job of the hardware designer by reducing the overall project lifespan. These
components could be regular binary multipliers, adders, subtractors, logic components
and small size ROM architectures.

Fig 5.1.2: Type-1 RNS-based modular multiplier architecture


Booth multiplication is a technique that allows for smaller, faster multiplication
circuits, by recoding the numbers that are multiplied. It is the standard technique used in
chip design, and provides significant improvements over the "long multiplication"


The radix-8 Booth encoding reduces the number of partial products to

which is more aggressive than the radix-4 Booth encoding. However, in the radix-8
Booth encoded modulo 2n-1 multiplication, not all modulo-reduced partial products can
be generated using the bitwise circular-left-shift operation and bitwise inversion.
Radix-4 and radix-8 multiplication
Recoding of binary numbers was first hinted at by Booth four decades ago.
MacSorley proposed a modification of Booths algorithm a decade after. The modified
Booths algorithm (radix-4 recoding) starts by appending a zero to the right of x0
(multiplier LSB).







Table 2: Radix-4 encoding

0 0Y

1 +1Y

0 +1Y

1 +2Y

0 -2Y

1 -1Y

0 -1Y


( n2 )1

D i.4i


i =0


Fig 5.2.1: Signed-digit representation





Table 3:Radix-8 recoding
































Here we have an odd multiple of the multiplicand, 3Y, which is not immediately
available. To generate it we need to perform this previous add: 2Y+Y=3Y. But we are
designing a multiplier for specific purpose and thereby the multiplicand belongs to a
previously known set of numbers which are stored in a memory chip. We have tried to
take advantage of this fact, to ease the bottleneck of the radix-8 architecture, that is, the
generation of 3Y. In this manner we try to attain a better overall multiplication time, or at
least comparable to the time we could obtain using radix-4 architecture (with the
additional advantage of using a less number of transistors). To generate 3Y with 21-bit
words we only have to add 2Y+Y, that is, to add the number with the same number
shifted one position to the left, getting in this way a new 23-bit word, as shown in figure

Fig 5.2.2: 21-bit previous add

In fact, only a 21-bit adder is needed to generate the bit positions from z1 to z21.
Bits z0 and z22 are directly known because z0=y0 and z22=y20 (sign bit of the 2scomplement number; 3Y and Y have the same sign). If in the memory from where we
take the numbers just two additional bits are stored together with each value of the set of
numbers, we can decompose the previous add in three shorter adds that can be done in
parallel. In this way, the delay is the same of a 7-bit adder:


Fig 5.2.3: Modified Previous Adder

Bits which are going to be stored are the two intermediate carry signals c8 and
c15. Before each word of the set of numbers is stored in the memory, the value of its
intermediate carries has to be obtained and stored beside it. In this way, they are
immediately available when it is required to perform the previous add to get the multiple
3Y of one of the numbers that belongs to the set.
The digit set conversion is given by
Di =y3i-1+y3i+2y3i+1-4y3i+2
Where y-1, yn, yn+1 and yn+2 are zero. For the radix-8 Booth encoded modulo 2n-1
multiplier, the required modulo-reduced partial products are shown in Table 4.
From Table 3, the necessary modulo-reduced partial products except 3X can be
generated by circular-left-shift operation and/or bit-wise complementation of the
multiplicand, X. The generation of

3X requires a large word-length adder which

increases the critical path delay of the multiplier significantly.






Let X =



and Y=


.2i represent the multiplicand and the

multiplier of the modulo 2n-1 multiplier, respectively. The radix-8 Booth encoding
algorithm can be viewed as a digit set conversion of four consecutive overlapping
multiplier bits y3i+2 y3i+1y3i (y3i-1) to a signed digit, di ,di

4,4 , for i=0,1, . ,


. The digit set conversion is formally expressed as

di=y3i-1 +y3i +2y3i+1 - 4y3i+2


y-1 =yn=yn+1=yn+2=0














For The

Radix-8 Booth Encoding

Table 5 summarizes the modulo-







CLS( X ,1)




CLS( X ,2)

reduced multiples of X for all possible values of the radix-8 Booth encoded multiplier
digit, di, where CLS(X, J) denotes a circular-left-shift of X by j bit positions. Three
unique properties of modulo 2n-1 arithmetic that will be used for simplifying the
combinatorial logic circuit of the proposed modulo multiplier design are reviewed here.
The all possible two operand adder implementations, the RCA has indubitably the
least area and dynamic power dissipation. The addends X2n-1 and 2X2n-1are added with
carry propagation through full adders (FAs), and the end-around-carry addition is realized
with carry propagation through half adders.

Fig 5.3.1: Generation of |+3X |2n-1using two n-bit RCA


The above technique for |+3X2n-1computation involves two -bit carry-propagate

additions in series such that the carry propagation length is twice the operand length, n. In
the worst case, the late arrival of the |+3X

2 -1

may considerably delay all subsequent

stages of the modulo 2n-1multiplier. Hence, this approach for hard multiple generation
can no longer categorically ensure that the multiplication in the modulo 2 n-1 channel still
falls in the noncritical path of a RNS multiplier.
To ensure that the radix-8 Booth encoded modulo multiplier does not constitute
the system critical path of a high-DR moduli set based RNS multiplier, the carry
propagation length in the hard multiple generation should not exceed n-bits. To this end,
the carry propagation through the HAs in Fig. 5 can be eliminated by making the endaround-carry bit c7 a partial product bit to be accumulated in the CSA tree. This technique
reduces the carry propagation length to n bits by representing the hard multiple as a sum
and a redundant end-around-carry bit pair.
Let |X2n-1 and 2X2n-1be added by a group of M=(n/k) k-bit RCAs such that there is
no carry propagation between the adders. shows this addition for n=8 and k=4.

Fig 5.3.2: Generation of partially-redundant |+3X 2n-1 using k-bit RCA


Fig 5.3.3: Generation of partially-redundant |B+3X 2n-1|

where the sum and carry-out bits from the RCA block are represented as
respectively. In Fig. 6, the carry-out of RCA 0,
is not propagated to the carry input of RCA 1 but preserved as one of the partial
product bits to be accumulated in the CSA tree. The binary weight of the carry-out


RCA 1 has, however, exceeded the maximum range of the modulus and has to be modulo
reduced before it can be accumulated by the CSA tree.
From Fig., the partially-redundant form of |+3X 2n-1 is given by the partial-sum
and partial-carry pair (S, C) where


Since modulo negation is equivalent to bitwise complementation by Property 1,

the negative hard multiple in a partially-redundant form,
computed as follows:




To avoid having many long strings of ones in

the hard multiple such that both C and

M 1


2k . j

an appropriate bias, B, is added to

are sparse. The value of is chosen as

0. .....01...0....01

. (7)

The addends for the computation of the biased hard multiple, |B+3X 2n-1 in a
partially-redundant form are X2n-1 and 2X2n-1 and B or equivalently S , C and B. Since B
is chosen to be a binary word that has logic ones at bit positions 2kj , and logic zeros at
other bit positions,| B+3X 2n-1 can be generated by simple XNOR and OR operations on
the bits of and at bit positions 2kj . Fig. 7 illustrates how these bits in the sum and the
carry outputs of RCA 0 and RCA 1 are modified. In general |B+3X 2n-1, is given by the
partial-sum and partial-carry pair (BS, BC) such that

.. (8)



... (10)

For j= 0, 1.M-1.

.. (11)

It can be easily verified that the sum of (BS, BC) and

2B|2n-1. Therefore,

modulo 2n-1 is |

represents the partially-redundant form of |B-3X|2n-1.


The proposed technique represents the hard multiple in a biased partiallyredundant form. Since the occurrences of the hard multiple cannot be predicted at design
time, all multiples must be uniformly represented. Similar to the hard multiple, all other
Booth encoded multiples listed in Table 5 must also be biased and generated in a
partially-redundant form. Fig. 8 shows the biased simple multiples, |B+0|2n-1, |B+X|2n-1, |
B+2X|2n-1 , |B+4X|2n-1 represented in partially redundant form for n=8. From Fig. 8, it can
be seen that the generation of these biased multiples involves only shift and selective
complementation of the multiplicand bits without additional hardware overhead.


Fig 5.3.4: Generation of partially-redundant simple multiples

Fig 5.3.5: Modulo-reduced partial products and CC for |X .Y|28-1


The i-th partial product of a radix-8 Booth encoded modulo 2n-1 multiplier is given by
PPi=|23i .di . X) |2n-1.. (12)
To include the bias B necessary for partially-redundant representation of PPi, (12) is
modified to
PPi=|23i (B+di . X) |2n-1. (13)
Using Property 3, the modulo 2n-1 multiplication by 23i , in (13) is efficiently
implemented as bitwise circular-left-shift of the biased multiple,(B+ di . X). For n=8, k=4,

Fig. 9 illustrates the partial product matrix of |X .Y|28-1 with (N/3+1) partial products in
partially-redundant representation. Each PPi consists of an n-bit vector, ppi7, ppi1, ppi0 and
a vector of n/k=2, redundant carry bits qi1,qi0 . Since qi0 and qi1 are the carry-out bits of the
RCAs, they are displaced by k-bit positions for a given PP i. The bits, qij is displaced
circularly to the left of q(i-1)j by 3 bits, i.e., q20 and q21 are displaced circularly to the left
of q10 and q11 by 3 bits, respectively q10 and q11 are in turn displaced to the left of q 00
and q01 by 3 bits, respectively. The last partial product in Fig. 9 is the Compensation
Constant (CC) for the bias introduced in the partially- redundant representation.

Fig 5.3.6: Modulo-reduced partial product generation

The generation of qij the modulo-reduced partial products, PP0, PP1, and PP2, in a
partially-redundant representation using Booth Encoder (BE) and Booth Selector (BS)
blocks are illustrated in Fig. 10. The BE block produces a signed one-hot encoded digit

from adjacent overlapping multiplier bits as illustrated in Fig. 11(a). The signed one-hot
encoded digit is then used to select the correct multiple to generate PP i. A bit-slice of the
radix-8 BS for the partial product bit, ppij is shown in Fig.

Fig 5.3.7: Bit-slice of Booth Encoder (BE).

Fig 5.3.8: Bit-slice of Booth Selector (BS)

As the bit positions of do not overlap, as shown in Fig., they can be merged into a
single partial product for accumulation. The merged partial products, PP i and the constant

CC are accumulated using a CSA tree with end-around-carry addition at each CSA level
and a final two-operand modulo 2n-1 adder as shown in Fig.

Fig 5.3.9: Modulo-reduced partial product accumulation



To humans, decimal numbers are easy to comprehend and implement for
performing arithmetic. However, in digital systems, such as a microprocessor, DSP
(Digital Signal Processor) or ASIC (Application-Specific Integrated Circuit), binary
numbers are more pragmatic for a given computation.

Fig 5.4.1: Binary Adder Example

Carry propagate adders

Binary carry-propagate adders have been extensively published, heavily attacking
problems related to carry chain problem. Binary adders evolve from linear adders, which
have a delay approximately proportional to the width of the adder, e.g. ripple-carry adder
(RCA) , to logarithmic-delay adder, such as the carry-lookahead adder (CLA) .
Modulo addition, an operation with a small variation to binary addition can also
be applied with prefix architectures. Common modulo addition can even be found in
memory addressing. Modulo 2n-1 addition is one of the most common operations that has
been put to hardware implementations because of its circuit efficiency. Furthermore,
modulo 2n + 1 addition is critical to improving advanced cryptography techniques.
Arithmetic modulo 2n-1 (Mersenne numbers) and modulo 2n + 1 (Fermat numbers) is
used in various applications, e.g., residue number systems (RNS) and cryptography.
Modulo 2n-1 addition is one of the most common operations that has been put to
hardware implementations because of its circuit efficiency. There are several ways of


doing modulo 2n-1 addition. The basic idea is to add the carry-out to the sum as in the
fashion of end-around add.
Modulo (2n-1) addition or, which is the same, ones complement addition can be
formulated as

(A+B) mod (2n-1) =

A + B( 2n 1 )
( A+ B+1 ) mod 2n
if A+ B 2n 1 .. (5.3.1)
A+ B Ot h erwise

The modulo 2n reduction is automatically performed if an n-bit adder is used.

Note that the value 11..1 never occurs and that only one single representation
00..0 of zero exists. The equation (14) can be rewritten using the condition A+B
2n =

(A+B) mod (2n-1) =

A + B( 2n 1 )
( A+ B+1 ) mod 2n
.. (5.3.2)
if A +B 2n
A+ B Ot h erwise

Now, zero has a double representation

(00..0 and 11..1)..Since the

new condition
A+B 2n is equivalent to cout=1,where cout is the carryout of the addition A + B, equation
(5.3.2) can be rewritten as
(A+B) mod = (A+B+ cout) mod2n


Therefore, modulo (2n-1) addition with a double representation of zero can be

realized by the n-bit end-around-carry parallel-prefix adder of Fig.
To resolve the delay of carry-lookahead adders, the scheme of multilevellookahead adders or parallel-prefix adders can be employed. The idea is to compute small
group of intermediate prefixes and then find large group prefixes, until all the carry bits

are computed. These adders have tree structures within a carry-computing stage similar to
the carry propagate adder. The Process Steps involved in Parallel Prefix Addition is
depicted in fig 15.
A parallel prefix adder can be seen as a 3-stage process:
In pre-computation stage, each bit computes its carry generate (g)/propagate (p)
signals and a temporary sum as below. These two signals are said to describe how the
Carry-out signal will be handled.
gi=ai .bi
pi=ai xor bi
ci+1=gi+pi .ci
In the prefix stage, the group carry generate/propagate signals are computed to
form the carry chain and provide the carry-in for the adder below. Various signal
graphs/architectures can be used to calculate the carry-outs for the final sum. A few of
them are as follows.


Kogge-Stone prefix tree



Han-Carlson Prefix Tree

An example of Sklansky prefix architecture is given below:


Fig 5.4.2: Sklansky Parallel-Prefix Example


In the post-computation stage, the sum and carry-out are finally produced. The
carry-out can be omitted if only a sum needs to be produced.
si=pi xor ci

Fig Parallel Prefix Addition Process Steps


Parallel-prefix structures are found to be common in high performance adders
because of the delay is logarithmically proportional to the adder width. An example of an
8-bit parallel-prefix structure is shown in figure below.

Fig 5.4.3: Example of an 8-bit parallel-prefix structure

In the prefix tree, group generate/propagate are the only signals used. The group
generate/ propagate equations are based on single bit generate/propagate, which are
computed in the pre-computation stage.
gi=ai .bi
pi=ai xor bi (5.5.1)
where 0 i n. g-1 = cin and p-1 = 0. Sometimes, pi can be computed with OR logic
instead of an XOR gate.
In the prefix tree, group generate/propagate signals are computed at each bit.
Pi:k=Pi:j.Pj-1:k (5.5.2)
More practically, Equation (5.5.2) can be expressed using a symbol
denoted by Brent and Kung . Its function is exactly the same as that of a black cell. That

(Gi:k,Pi:k) = (Gi:j,Pi:j) o (Gj-1:k,Pj-1:k). (5.5.3)


Gi:k=(gi,pi) o (gi-1,pi-1)o.o(gk,pk)

. (5.5.4)

operation will help make the rules of building prefix structures.

In the post-computation, the sum and carry-out are the final output.

Fig 5.4.4: Cell Definitions

Cout=Gn:-1.. (5.5)
Where -1 is the position of carry-input. The generate/propagate signals can be
grouped in different fashion to get the same correct carries. Based on different ways of
grouping the generate/propagate signals, different prefix architectures can be created.
Figure 17 shows the definitions of cells that are used in prefix structures, including black
cell and gray cell. Black/gray cells implement Equation (5.5.2) or (5.5.3), which will be
heavily used in the following discussion on prefix trees

Fig 5.4.5: 8-bit Empty Prefix Tree


Step 1:

Fig 5.4.6: Build 8-bit Sklansky Prefix Tree

Step 2 :

Fig 5.4.7: Build 8-bit Sklansky Prefix Tree


Step 3:

Fig 5.4.8: Build 8-bit Sklansky Prefix Tree.

The way of building a prefix tree can be processed as the arrows indicate (i.e.
from LSB to MSB horizontally and then from top logic level down to bottom logic level
The example shown in Figure 19.3 is an 8-bit Sklansky prefix tree.

Fig 5.4.9: 16-bit Sklansky Prefix Tree

Sklansky prefix tree takes the least logic levels to compute the carries. Plus, it
uses less cells than Knowles and Kogge-Stone structure at the cost of higher fan-out.

Figure 19.4 shows the 16-bit example of Sklansky prefix tree with critical path in solid
line.Few of them are given below.
Kogge-Stone prefix tree
Han-Carlson Prefix Tree

Fig 5.4.10: 16-bit Kogge-Stone Prefix Tree










2 log 2 n1

2 nlog2 n2


log 2 n

n log 2 nn+1



log 2 n+1

(n/ 4) log 2 n+ 3 n/41


log 2 n

n log 2 nn+1


log 2 n

(n/2) log 2 n





log 2 n

(n/2) log 2 n



log 2 n+1

(n/2) log 2 n


Table 6: Algorithmic Analysis

Fig 5.4.11: Cell Definitions for Skalnsky Parallel-Prefix Tree


In a prefix problem, n inputs xn-1, xn-2 .. .x0 and an arbitrary associative operator
are used
to compute n outputs yi=xi.xi-1..x0 for i=0,1,2.n-1.Thus each output yi is dependent
on all inputs xj of same or lower magnitude (j i) .
Carry propagation in binary addition is a prefix problem. The n-bit carry
propagate addition with input operands A and B, carry-in cin, sum output S, and carry-out
cout can be expressed by the logic equations:

(Cout ,S)=

Cout + S= A + B + Cin



a0 b0 +a 0 c 0 +b 0 c 0 if i=0
a i b i otherwise

pi=ai xor bi

(G0i:i,P0i:i) = (gi,pi)

(Gli:k,Pli:k) = (Gl-1i:j+1,Pl-1i:j+1) . (Gl-1j:k ,Pl-1j:k)

=( Gl-1i:j+1+ Pl-1i:j+1 Gl-1j:k , Pl-1i:j+1 Pl-1j:k)

Ci+1= G


si=pi xor ci
The cell definitions for the above mentioned codes have been depicted in figure 19.

Fig 5.4.12: Prefix adder structure with carry-in


Fig 5.4.13: Parallel prefix structure by Sklansky

Fig 5.4.14: Proposed End-around carry parallel prefix adder structure


The prefix-structure size is only increased by n black nodes and the critical path
by one black node, which results in highly area and delay efficient end-around-carry
adders. Note that an n-bit end-around-carry parallel-prefix adder has the same delay but is
smaller compared to an ordinary 2n-bit parallel-prefix adder.

Fig 5.4.15: Sklansky adder

The Sklansky adder has:

Minimal depth
High fan-out nodes



Fig 5.4.16: Kogge-Stone adder

The Kogge-Stone adder has:

Low depth
High node count (implies more area).
Minimal fan-out of 1 at each node (implies faster performance)



Fig 5.4.17: Ladner-Fischer adder

The Ladner-Fischer adder has:

Low depth
High fan-out nodes
This adder topology appears the same as the Schlanskly adder. Ladner-Fischer
formulated a parallel prefix network design space which included this minimal depth


Fig 5.4.18: Brent-Kung adder

The Brent-Kung adder is the extreme boundary case of:

Maximum logic depth in PP adders (implies longer calculation time).
Minimum number of nodes (implies minimum area).


It requires Xilinx ISE 10.1 version of software where Verilog source code can be
used for design implementation.

Introduction To Modelsim
In ModelSim, all designs are compiled into a library. You typically start a new
simulation in ModelSim by creating a working library called "work". "Work" is the
library name used by the compiler as the default destination for compiled design units.
Compiling Your Design: After creating the working library, you compile your design
units into it. The ModelSim library format is compatible across all supported
platforms. You can simulate your design on any platform without having to recompile
your design.
Loading the Simulator with Your Design and Running the Simulation With the design
compiled, you load the simulator with your design by invoking the simulator on a toplevel module (Verilog) or a configuration or entity/architecture pair (VHDL).
Assuming the design loads successfully, the simulation time is set to zero, and you
enter a run command to begin simulation.
Debugging Your Results
If you dont get the results you expect, you can use ModelSims robust debugging
Environment to track down the cause of the problem.

Introduction to Model Simulator:

Basic Simulation Flow
The following diagram shows the basic steps for simulating a design in ModelSim.
Create a working

Compile design files

Load and Run
Debug results

Fig 6.1.1: Basic simulation flow

Project Design Flow
Important differences:
You do not have to create a working library in the project flow; it is done for you
Create a project

Add files to the

Compile design files

Run simulation

Debug results

Fig 6.1.2: Project design flow

Introduction To XILINX ISE


This tool can be used to create, implement, simulate, and synthesize Verilog designs for
implementation on FPGA chips.
ISE: Integrated Software Environment

Environment for the development and test of digital systems design targeted to
Integrated collection of tools accessible through a GUI
Based on a logical synthesis engine (XST: Xilinx Synthesis Technology)
XST supports different languages:
XST produce a net list integrated with constraints
Supports all the steps required to complete the design:
Translate, map, place and route
Bit stream generation
Supports verification at different steps of the design

XILINX Design Process

Step 1: Design entry
HDL (Verilog or VHDL, ABEL x CPLD), Schematic Drawings, Bubble
Step 2: Synthesis
Translates .v, .vhd, .sch files into a netilist file (.ngc)
Step 3: Implementation
FPGA: Translate/Map/Place & Route, CPLD: Fitter
Step 4: Configuration/Programming
Download a BIT file into the FPGA
Program JEDEC file into CPLD
Program MCS file into Flash PROM
Simulation can occur after steps 1, 2, 3



FPGA stands for Field Programmable Gate Array which has the array of logic
module, I /O module and routing tracks (programmable interconnect). FPGA can be
configured by end user to implement specific circuitry. Speed is up to 100 MHz but at
present speed is in GHz.
FPGA contains a two dimensional arrays of logic blocks and interconnections
between logic blocks. Both the logic blocks and interconnects are programmable. Logic
blocks are programmed to implement a desired function and the interconnects are
programmed using the switch boxes to connect the logic blocks.

Fig 6.2.1: FPGA block diagram

FPGAs, alternative to the custom ICs, can be used to implement an entire System
On one Chip (SOC). The main advantage of FPGA is ability to reprogram. User can
reprogram an FPGA to implement a design and this is done after the FPGA is
manufactured. This brings the name FieldProgrammable.
SRAM is used to implement a LUT.A k-input logic function is implemented using
2^k * 1 size SRAM. Number of different possible functions for k input LUT is 2^2^k.
Advantage of such an architecture is that it supports implementation of so many logic
functions, however the disadvantage is unusually large number of memory cells required
to implement such a logic block in case number of inputs is large.

Figure below shows a 4-input LUT based implementation of logic block

Fig 6.2.2: Configurable Logic Blocks (CLB)

LUT based design provides for better logic block utilization. A k-input LUT
based logic block can be implemented in number of different ways with trade off between
performance and logic density. An n-LUT can be shown as a direct implementation of a
function truth-table. Each of the latch holds the value of the function corresponding to
one input combination. For Example: 2-LUT can be used to implement 16 types of
functions like AND , OR, A+not B .... etc.
In this part of tutorial we are going to have a short intro on FPGA design flow. A
simplified version of design flow is given in the flowing diagram.


Fig 6.2.3: FPGA Design Flow

Design Entry
There are different techniques for design entry. Schematic based, Hardware

Description Language and combination of both etc. . Selection of a method depends on

the design and designer. If the designer wants to deal more with Hardware, then
Schematic entry is the better choice.

The process which translates VHDL or Verilog code into a device netlist formate.

i.e a complete circuit with logical elements( gates, flip flops, etc) for the design.If the
design contains more than one sub designs, ex. to implement a processor, we need a CPU
as one design element and RAM as another and so on, then the synthesis process.

Fig 6.2.4: FPGA Synthesis


This process consists a sequence of three steps
1. Translate
2. Map
3. Place and Route

Process combines all the input netlists and constraints to a logic design file. This

information is saved as a NGD (Native Generic Database) file.

Fig 6.2.5: FPGA Translate


Process divides the whole circuit with logical elements into sub blocks such that
they can be fit into the FPGA logic blocks. That means map process fits the logic defined
by the NGD file into the targeted FPGA elements (Combinational Logic Blocks (CLB),
Input Output Blocks (IOB)) and generates an NCD (Native Circuit Description) file
which physically represents the design mapped to the components of FPGA. MAP
program is used for this purpose.

Fig 6.2.6: FPGA map

Place and Route

PAR program is used for this process. The place and route process places the sub

blocks from the map process into logic blocks according to the constraints and connects
the logic blocks.

Figure 6.2.7: FPGA Place and route

Device Programming


Now the design must be loaded on the FPGA. But the design must be converted to
a format so that the FPGA can accept it. BITGEN program deals with the conversion. The
routed NCD file is then given to the BITGEN program to generate a bit stream (a .BIT
file) which can be used to configure the target FPGA device. This can be done using a
cable. Selection of cable depends on the design.
Behavioral Simulation:This is first of all simulation steps; those are encountered throughout the hierarchy of
the design flow. This simulation is performed before synthesis process to verify RTL
(behavioral) code and to confirm that the design is functioning as intended.


Fig 6.3.1(a): Top module Simulation Result


Fig 6.3.1(b): PPG module Simulation Result

Fig 6.3.1(c): CSA module Simulation Result


Fig 6.3.1(d): Parllel Prefix module Simulation Result


Fig 6.3.2(a): Design summary on Xilinx ISE of design


Advanced HDL Synthesis Report:

Macro Statistics
# Registers




# Xors

: 109

1-bit xor2

: 61

1-bit xor3

: 48

Final Register Report

Macro Statistics
# Registers





Final Report

Final Results
RTL Top Level Output File Name

: TOP_NIS.ngr

Top Level Output File Name


Output Format


Optimization Goal

: Speed

Keep Hierarchy

: NO

Design Statistics
# IOs

: 26

Cell Usage:

: 193


: 10


: 56


: 120



# Flip Flops/Latches




# Clock Buffers




# IO Buffers

: 25


: 17




Device Utilization Summary:


Selected Device: 3s500efg320-4

Number of Slices

: 106 out of 4656


Number of 4 input LUTs

: 186 out of 9312


Number of IOs

: 26

Number of bonded IOBs

: 26 out of

IOB Flip Flops

: 8

Number of GCLKs

: 1 out of






Timing Summary:
------------------------------------------------------------------------------------------------------------Speed Grade: -4

Minimum period: No path found

Minimum input arrival time before clock: 24.020ns
Maximum output required time after clock: 4.283ns
Maximum combinational path delay: No path found

Total : 24.020ns

(13.529ns logic, 10.491ns route)

(56.3% logic, 43.7% route)

Timing Detail:
--------------------------------------------------------------------------------------------------------------All values displayed in nanoseconds (ns)



Fig 6.3.2(b): RTL schematic diagram of Top Module

Fig 6.3.2(c): RTL schematic internal blocks


Fig 6.3.2(d): FPGA place and routed diagram

Fig 6.3.2(e): Power analyzer


Fig 6.3.2(f): FPGA dumping process

The residue number system is very attractive solution to many researchers
especially during the last decade. Extensive research have been put on the theory of
improving the RNS system and applying it in some application areas such as, digital
signal processing, digital filters, fast Fourier transform (FFT), and image processing.
The RNS is inherently parallel, modular and fault tolerant. Performing operations
such as addition, subtraction, and multiplication is inherently carry-free, thus reducing a
great amount of circuit integration area where carry-detection circuitry had to be
implemented before.
RSA Algorithm
Digital Signal Processing

Digital Filtering
Image Processing
Error Detection and Correction

A new approach for multiplication, modulo (2n-1) is proposed. In this design
Partial Product Generator, Carry Save Adder and Parallel Prefix Adder are Used. Similar
to the binary multiplier, the generation of the partial products is accomplished by AND
gates. The Partial Product Generator (Radix-8) is applied to increase the speed by
compression of row size from N to (N/2)-1. Carry Save Adder is used to add the PPG
output values. To completely utilize the unequal delay of a full adder, an algorithm for
delay optimization of the Wallace tree is developed. The proposed parallel Prefix Adding
approach exhibits superior performance, in terms of either speed of hardware
requirement, in comparison with a recent counterpart for the same purpose. In addition,
the proposed multiplier modulo (2n-1) shows an extremely regular structure and is very
suitable for VLSI implementation.

I have Used ModelSim- 6.4b for simulation, Xilinx ISE 14.6 for Synthesis, Time
Analysis and Power Analysis and FPGA SPARTAN- 3E Kit for dumping and Post
Simulation of the design. I achieved the total delay value is 24ns and total power value is
0.0076 W.

Montgomery modular multiplication algorithm is a well-known method that is
employed in efficient modular multiplication architectures and therefore is widely used in
GF( p) elliptic curve applications.
The complexity of Montgomery multiplier makes the testing process a big
challenge. A methodology for developing testing modules is introduced. Including a selftesting block in the multiplier's system will be beneficial and will reduce the time and
effort for testing. A self-testing block will perform Montgomery multiplication of
hardwired numbers and compare the result with predefined values. A flag bit can be used
to indicate an error.
Power dissipation study of the design is also needed in the context of power
differential attack. This type of attack on a cryptographic system tries to deduce

parameters of the system by observing system's power dissipation. This study would be
applicable to show the adequacy of this design approach to hw-power devices, such as
portable computers.
More study need to be done to see the effect of applying re-timing technique to
radix-2 design, and how the re-timing will affect the performance of the design. Some
investigations need to be done to show how the radix-4 design presented in this text can
be extended to cover the unified architecture presented . The integration of multiplication
and exponentiation can be included as part of a hardware co-processor.




Encoder Modulo2 n-1 Multipliers Adaptive Delay for

High Dynamic Range Residue number system. Ramya Muralidharan,

Student Member, IEEE, and Chip-Hong Chang, Senior Member, IEEER.

VERILOG HDL A Guide to Digital Design and Synthesis IEEE 1364-2001 Complaint


Design through verilog hdl by padmanabam.


V. Miller, Use of elliptic curves in cryptography, in Proc. Advances in CryptologyCRYPTO85, Lecture Notes in Computer Science, 1986, vol. 218, pp. 417426.


C. Efstathiou, H. T. Vergos, and D. Nikolos, Modified Booth modulo 2 n- multiplier,

IEEE Trans. Comput., vol. 53, no. 3, pp. 370374, Mar. 2004.



B. S. Cherkauer and E. G. Friedman, A hybrid radix-4/radix-8 low power signed

multiplier architecture, IEEE Trans. Circuits Syst. II, Analog Digit. Signal Process., vol.
44, no. 8, pp. 656659, Aug. 1997.


M. J. Flynn and S. F. Oberman, Advanced Computer Arithmetic Design. New York:

Wiley, 2001.


G. Dimitrakopoulos, D. G. Nikolos, H. T. Vergos, D. Nikolos, and C. Efstathiou,

Newarchitectures for modulo 2n-1 adders, in Proc. 12th IEEE Int. Conf. Electronics,
Circuits and Systems, Gammarth, Tunisia, Dec. 2005, pp. 14.


R. A. Patel, M. Benaissa, and S. Boussakta, Fast Parallel-prefix architectures for

modulo 2n-1 addition with a single representation of zero, IEEE Trans. Comput., vol.
56, no. 11, pp. 14841492, Nov. 2007.


R. Muralidharan and C. H. Chang, Fast hard multiple generators for radix-8 Booth
encoded modulo 2n-1and modulo 2n+1multipliers, in Proc. 2010 IEEE Int. Symp.
Circuits and Systems, Paris, France, Jun. 2010, pp. 717720.