Вы находитесь на странице: 1из 11

Exploring the Potential of Threshold Logic

for Cryptography-Related Operations


Alessandro Cilardo, Member, IEEE
AbstractMotivated by the emerging interest in new VLSI processes and technologies, such as Resonant Tunneling Diodes (RTDs),
Single-Electron Tunneling (SET), Quantum Cellular Automata (QCA), and Tunneling Phase Logic (TPL), this paper explores the
application of the non-Boolean computational paradigms enabled by such new technologies. In particular, we consider Threshold Logic
functions, directly implementable as primitive gates in the above-mentioned technologies, and study their application to the domain of
cryptographic computing. From a theoretical perspective, we present a study on the computational power of linear threshold functions
related to modular reduction and multiplication, the central operations in many cryptosystems such as RSA and Elliptic Curve
Cryptography. We establish an optimal bound to the delay of a threshold logic circuit implementing Montgomery modular reduction and
multiplication. In particular, we show that fixed-modulus Montgomery reduction can be implemented as a polynomial-size depth-2
threshold circuit, while Montgomery multiplication can be implemented as a depth-3 circuit. We also propose an architecture for
Montgomery modular reduction and multiplication, which ensures feasible C(i
2
) area requirements, preserving the properties of
constant latency and a low architectural critical path independent of the input size i. We compare this result with existing polynomial-
size solutions based on the Boolean computational model, showing that the presented approach has intrinsically better architectural
delay and latency, both C(1).
Index TermsThreshold logic, modular arithmetic, Montgomery multiplication.

1 INTRODUCTION
A
S standard VLSI processes approach the limits of
CMOS scaling, new technologies are emerging as a
viable alternative for the semiconductor industry. The
International Technology Roadmap for Semiconductors
(ITRS) [9] indicates several new technologies that may
replace CMOS processes in the future. Many of these, such
as Resonant Tunneling Diodes (RTDs) [11], Single-Electron
Tunneling (SET) [26], Quantum Cellular Automata (QCA)
[16], and Tunneling Phase Logic (TPL) [27], will also impose
a revisitation of computational models and digital design
methodologies. In fact, while digital design based on CMOS
technologies has been mainly oriented to a Boolean gates
model, the above-mentioned emerging technologies pro-
vide a direct support for the non-Boolean computational
model known as Threshold Logic [1].
The work presented in this paper explores the computa-
tional power of the Threshold Logic model, and in
particular, it applies this emerging paradigm to an increas-
ingly important application domain: cryptographic compu-
tation. Many efforts have, in fact, been spent during the last
few years to find efficient solutions for cryptography-
related operations [3], [18], [15], and they are mostly
oriented to standard CMOS technologies. From a theoretical
perspective, we present here a study on the computational
power of threshold functions related to modular reduction
and modular multiplication, the central operation in many
cryptosystems such as RSA [17] and Elliptic Curve
Cryptography (ECC) [2]. We establish an optimal bound
to the delay of a Threshold Logic circuit implementing
Montgomery modular arithmetic. In particular, we show
that fixed-modulus Montgomery reduction can be imple-
mented as a polynomial-size depth-2 linear threshold
circuit, independent of the input size, while Montgomery
multiplication can be implemented as a depth-3 circuit. We
also propose a constant-depth architecture for Montgomery
modular reduction and multiplication, which ensures an
C(i
2
) area complexity, while preserving constant latency
and a low architectural critical path. We compare our
constant-depth solutions based on Threshold Logic with
different state-of-the-art schemes proposed in the technical
literature, showing that our approach has intrinsically
better architectural delay and latency (both C(1) with
respect to the input size i), and thus, outperforms classical
systolic architectures and fully parallel solutions based on
the Boolean model.
The remainder of the paper is structured as follows:
Section 2 summarizes the background on Threshold Logic
and introduces the cryptography-related operations ad-
dressed by this study. Section 3 presents theoretical results
on optimal-depth, polynomial-size schemes for modular
reduction and multiplication. Section 4 introduces a con-
stant-depth scheme with an C(i
2
) gate count. Section 5
discusses the results and compares them with the state-of-
the-art architectures in the technical literature oriented to
CMOS technologies. Finally, Section 6 concludes the paper
with some final remarks.
2 BACKGROUND
2.1 Threshold Logic Computing
Since the introduction of digital technologies, theoretical
properties and computational power of elementary gates
452 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011
. The author is with the Dipartimento di Informatica e Sistemistica,
University of Naples Federico II, via Claudio 20, 80125 Napoli (NA), Italy.
E-mail: acilardo@unina.it.
Manuscript received 10 Dec. 2007; revised 4 Apr. 2009; accepted 10 June
2009; published online 26 May 2010.
Recommended for acceptance by A. Benso, Y. Makris, and P. Mazumder.
For information on obtaining reprints of this article, please send e-mail to:
tc@computer.org, and reference IEEECS Log Number TCSI-2007-12-0632.
Digital Object Identifier no. 10.1109/TC.2010.116.
0018-9340/11/$26.00 2011 IEEE Published by the IEEE Computer Society
have been deeply investigated. It is well known, for
example, that any function )(A) : 0. 1
i
0. 1 can be
obtained as a 3-level circuit made of elementary Boolean
gates (NOT/AND/OR), directly implementable in silicon
technologies such as CMOS. Nevertheless, for most non-
trivial functions, such optimal-depth circuits often result in
infeasible area requirements. In fact, for many fundamental
functions, such as integer multiplication and parity func-
tion, realizing a 3-level circuit requires a number of gates,
which grows exponentially in the number of inputs i [8]. The
theoretical limitations in computational power of Boolean
gates stimulated an interest for other models of elementary
gates, possibly more powerful, since the early years of
digital design research. A fundamental example is Thresh-
old Logic [14]. A linear threshold function )(A) =
)(r
0
. . . . . r
i1
) : 0. 1
i
0. 1 is defined through a real
function
~
)(A) as follows:
)(A) = sgn(
~
)(A)) =
1. if
~
) _ 0.
0. if
~
) < 0.
&
where
~
) is given by
~
) =
P
i1
i=0
n
i
r
i
t, where t, the
threshold, and n
0
. n
1
. . . . . n
i1
, the weights, are a set of
constant real coefficients. Real parameters t and n
i
can be
replaced by integers of C(ilog i) bits without limiting the
computational power of the threshold function [14]. Fig. 1
presents some examples of threshold gates and shows their
graphical representation. Note that the study presented in
this paper will only need threshold functions with integer
weights, like those in Fig. 1.
Although introduced in the sixties, the study of thresh-
old functions was stimulated by developments in artificial
neural networks during the last two decades. In fact,
threshold functions constitute the mathematical model of
neurons used in feed-forward artificial neural networks.
During the late eighties, however, several researchers also
investigated arithmetic properties of threshold functions,
aiming at defining depth efficient schemes for such
common functions as addition, multiplication, and division
[20], [21], [22]. In particular, many works tried to obtain
threshold logic circuits with weights polynomially bounded
in the number of inputs i, i.e., n
i
_ i
c
for some constant c.
If this is the case, we only require C(log i)-bit accuracy in
each weight, making practical implementations more
realistic. At the same time, area requirements are always
guaranteed to be of polynomial order.
Although very promising, such results often remained of
theoretical nature. In fact, the technology trends of the last
decades, especially the evolution of CMOS processes, did
not provide effective support for the implementation of
threshold logic gates. However, Resonant Tunneling Diodes
[11], Single-Electron Tunneling [26], Quantum Cellular
Automata [16], and Tunneling Phase Logic [27] may deeply
change this scenario, as linear threshold functions find a
natural implementation in such technologies.
2.2 Modular Arithmetic in Cryptography
Motivated by the emerging interest in new generation
threshold gates, in this work, we explore the application of
the Threshold Logic model to an important application
domain: cryptographic computation. Ubiquitous network-
ing and the centrality of the Internet are, in fact, stimulating
an increasing demand for high-performance implementa-
tions of cryptographic algorithms and protocols. Two
widely adopted public-key cryptosystems, in particular,
are the Rivest-Shamir-Adleman (RSA) [17] and the Elliptic
Curve [2] cryptosystems. As they are rather time-consum-
ing, their implementation is often addressed in the
technical literature. The kernel of both schemes is based
on repeated modular multiplications. As shown by a large
variety of scientific works, Montgomery algorithm [13] has
proved to be the most effective implementation technique
for modular multiplication [25], [3]. It is, in fact, based on a
slightly different definition of the modular product,
summarized in the following, which enables particularly
efficient implementations.
Assume that `, the modulus, is an i-bit number, i.e.,
2
i1
_ ` < 2
i
. Call mod the modulo reduction operation,
i.e., 1 = mod ` represents the (unique) integer less than
` such that / ` = 1 for some /.
1
The Montgomery
product of two numbers and 1 is defined as
1 1
1
mod `, where 1 can be any number for which
there exists an inverse 1
1
modulo `, i.e., a number such
that 1
1
1 mod ` = 1. In order for such a number to exist, it
suffices that gcd(`. 1) = 1. In practical cryptographic
applications, the modulus is often a prime number (e.g.,
in Elliptic Curve cryptography) or the product of two large
primes (e.g., in RSA cryptography). In such cases, setting 1
to a power of 2 always satisfies the condition gcd(`. 1) = 1.
We give in the following Montgomery modular reduction
algorithm in its general form:
Algorithm 1. Montgomery Modular Reduction Algorithm
Input: `, 1 and
~
` such that 1 1
1
`
~
` = 1 (1
1
and
~
` always exist since gcd(`. 1) = 1), T < ` 1
Output: T 1
1
mod `
1. Q = T
~
` mod 1
2. 1 =
TQ`
1
3. if 1 ` return 1 ` else return P
Note that the numerator in Step 2 is always a multiple of
1, s i n c e (T Q `) mod 1 = (T (T
~
` mod 1) `)
mod 1 = (T (T mod 1)) mod 1 = 0. On the other hand,
we clearly have T Q ` = T( mod `), so Step 2 computes
a number equal to T 1
1
modulo `. Moreover,
TQ`
1
<
`1
1

Q
1
` < 2`. Step 3 in the above algorithm
CILARDO: EXPLORING THE POTENTIAL OF THRESHOLD LOGIC FOR CRYPTOGRAPHY-RELATED OPERATIONS 453
Fig. 1. Graphical representation of (a) a generic threshold function and
(b) a depth-2 threshold circuit realizing the XOR function.
1. The precedence rules assumed here for the mod operator are as
follows: multiplication/division, modulo reduction, and addition/subtrac-
tion. For example, the expression 1 C mod ` is equal to
[(1 C) mod `[.
ensures that the result 1 is actually less than the
modulus `. Note that both 1 and
~
` do not depend on
the input T to be reduced.
The above algorithm can be easily particularized to
modular multiplication if 1 ` and the two operands
and 1 are less than `. In this case, T = 1 < 1 ` can be
usedas the input for the reductionalgorithmandproduce the
Montgomery product 1 1
1
mod `. When we choose a
power of two 2
/
for the constant 1, Montgomery algorithm
shows excellent implementationproperties [25], [23], [3], [15].
In particular, the computation of Qin Step 1 and the addition
and division (right shift) in Step 2 can be interleaved with the
accumulation of partial products in the multiplication 1,
enabling high-speedpipelinedor systolic designapproaches.
Fig. 2 shows an example of Montgomery multiplication. The
example uses a 6-bit modulus `. In Fig. 2a, a full-width
multiplication is performed followed by Montgomery reduc-
tion with 1 = 2
/
= 2
8
. The figure also shows the quantities
1
1
and
~
` such that 1 1
1
`
~
` = 1, and the calculation
of Q = T
~
` mod 1 = (T mod 1)
~
` mod 1. In Fig. 2b, on
the other hand, Montgomery reduction is interleaved with
multiplication steps. Precisely, there are four steps, each
performinga partial multiplicationof bya two-bit subword
of 1, followed by a Montgomery reduction with 1 = 2
/
= 2
2
.
Fig. 2c, finally, shows an example of the fine-grained
pipelining enabled by Montgomery algorithm.
An interesting property enabled by Montgomery multi-
plication is the possibility to work on the `-residues of
numbers, defined as = 1 mod `. It can be easily seen
that the Montgomery product of two numbers in `-residue
form is still in `-residue form: 1 1
1
mod ` =
1 11 1
1
mod ` = (1) 1 mod ` = 1. Clearly, this
also holds true for modular addition: (1) mod ` =
1. In other words, we can define a Montgomery
arithmetic, where numbers are manipulated in `-residue
form, and modular multiplication is replaced by Montgom-
ery multiplication, while modular addition is computed as
usual. This clearly applies to any operation that is
computed as a composition of modular multiplications
and additions. In particular, we can reduce modular
exponentiation (used in RSA cryptography) to a sequence
of modular multiplications. The same holds true for
modular inversion, used in ECC cryptography, which can
be implemented as a sequence of modular multiplications
via Fermats Little Theorem.
Note that Algorithm 1 requires a magnitude compar-
ison (Step 3) in order to ensure that the result is actually
less than the modulus `. However, when many con-
secutive multiplications are to be performed, we can
allow intermediate results to stay in the range [0. 2`[ with
a proper choice for 1. In fact, if we choose 1 4`, it can
be easily seen that the reduction algorithm accepts
multiplicands . 1 < 2` (not necessarily less than `):
1Q`
1
<
2`2`Q`
1
=
4`
1
`
Q
1
` < 2`, so the algorithm
preserves the invariant that inputs and output are less
than 2`. In the rest of the paper, we will thus refer to
Algorithm 1 with no final correction (Step 3).
3 THEORETICAL RESULTS
In this section, we investigate some theoretical properties of
Threshold Logic related to modular arithmetic used in
cryptographic applications. Previous results showed that
Threshold Logic provides interesting advantages for the
implementation of arithmetic functions. In particular, most
works focused on threshold circuits of polynomial size (in
the number of primary inputs i) and polynomially
bounded weights. Similar to [20], we will denote this subset
of threshold circuits by
c
1T. Based on harmonic analysis of
Boolean functions [4], authors in [20] proved that there
exists an
c
1T circuit of depth 2 implementing the sum of two
C(i)-bit numbers. This result was used in [19] and [21] to
show that the integer multiplication of two C(i)-bit
numbers can be implemented as an
c
1T circuit of depth 4.
454 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011
Fig. 2. Example of Montgomery algorithm: (a) parallel execution with / = 8, (b) interleaved execution with / = 2, and (c) a possible pipeline scheme.
This upper bound was further improved in [22], where
authors showed that the multiplication of two C(i)-bit
numbers can be implemented as an
c
1T circuit of depth 3
and proved that this depth is optimal. Modular multi-
plication is a more complex operation since it involves
modular reduction, as shown in Fig. 2. We will show here
that fixed-modulus modular reduction, based on Montgom-
ery formulation, can be implemented as an
c
1T circuit of
depth 2, while modular multiplication requires a depth-3
c
1T circuit, independent of the input width. We will also
show that these upper bounds are optimal.
We briefly recapitulate some fundamental results pre-
sented in [6], [7], [22], which are relevant to our work. Note
that we will often refer to polynomial bounds. For example,
when we say that there exist polynomially many functions
)
(i)
(r
0
. r
1
. . . . . r
i1
), we mean that they exist in a number
upper-bounded by i
c
, for some constant c.
A first important property related to the power of
c
1T circuits is expressed by the following lemma:
Lemma 1. The class of polynomial-size threshold circuits of depth
d with polynomially bounded integer weights at the output
gate (but no restrictions on the weights of internal gates)
coincides with the class of
c
1T circuits of the same depth d.
For the proof of this lemma, see [7, Theorem 11].
The following lemma, useful for our results, establishes a
sufficient condition for a Boolean function ) to be
implemented as an
c
1T circuit. The Lemma was first
introduced in [22]:
Lemma 2. Let A = (r
0
. r
1
. . . . . r
i1
) 0. 1
i
, and ) :
0. 1
i
0. 1 be a Boolean function of A, which depends
only on a weighted sum of the input variables o =
P
n
i
r
i
,
with unrestricted (possibly exponential) weights n
i
. If
) = 1 = o '
`1
;=0
[c
;
. u
;
[, where ` is polynomially bounded
in i, then ) can be implemented as an
c
1T circuit of depth 2.
Proof. For ; = 0. . . . . ` 1, define the following functions:
o
/
;
= sgn
X
n
i
r
i
c
;

o
//
;
= sgn u
;

X
n
i
r
i

.
The function )(A) can be written as
)(A) = sgn
X
`1
;=0
(o
/
;
o
//
;
) ` 1
!
. (1)
In fact, if o =
P
n
i
r
i
, [c
;
. u
;
[, then o
/
;
o
//
;
= 1 for all ;
and
P
`1
;=0
(o
/
;
o
//
;
) ` = 0. If, for some /, o [c
/
. u
/
[,
t he n o
/
/
o
//
/
= 2 a nd o
/
;
o
//
;
= 1 f or ; ,= /, s o
P
`1
;=0
(o
/
;
o
//
;
) ` = (` 1) ` = 1. As emphasized
in (1), )(A) is a threshold function having polynomially
bounded weights and taking as input polynomially
many threshold functions o
/
;
and o
//
;
with unrestricted
weights. It thus satisfies the hypothesis of Lemma 1 and
can be realized as an
c
1T circuit of depth 2. '
Based on the results recapitulated above, we will now
address the implementation of modular reduction and
multiplication. For our results, we refer to the extended
notion of Montgomery reduction, given in Section 2.2, i.e.,
the modulus ` is assumed to be odd and the reduced
number 1 can be in the range [0. 2`[. The following lemma
provides a preliminary result useful for the subsequent
discussion:
Lemma 3. Take a fixed i-bit modulus ` and i numbers A
;
=
P
i1
i=0
r
;i
2
i
o f i b i t s , wi t h i = C(i
c
). Le t T =
P
i1
;=0
P
i1
i=0
r
;i
2
i
be the sum of such numbers. The Mon-
tgomery reduced sum, 1 = T 2
/
mod `, can be computed
with an
c
1T circuit of depth 2 for a suitable value of /. The
circuit is optimal in depth.
Proof. The maximum value of the sum T =
P
i1
;=0
P
i1
i=0
r
;i
2
i
is i(2
i
1), so T has at most (log ii) bits.
2
Define
/ = 1 log i. This value of / ensures that the reduced
sum 1 = T 2
/
mod `, computed as in Algorithm 1, is
actually less than 2`:
1 =
T T
~
` mod 2
/

`
2
/
<
T
2
/

2
/
2
/
`
_
i(2
i
1)
2 2
log i
` < 2
i1
` < 2`.
We claim that each bit j
/
of the reduced sum 1 =
T 2
/
mod ` is a function satisfying the hypothesis of
Lemma 2. We have to prove that there exists a weighted
sum o of the input bits and that the output bit j
/
is equal
to 1 if and only if o belongs to one of polynomially many
intervals.
We split the sum of the input bits as follows:
T
1
=
X
i1
;=0
X
i1
i=/
r
;i
2
i/
T
0
=
X
i1
;=0
X
/1
i=0
r
;i
2
i
so that T = T
1
2
/
T
0
. The reduced sum (see Algorithm 1)
can be written as
1 =
T (T
~
` mod 2
/
) `
2
/
=
T (T mod 2
/

~
`) mod 2
/
`
2
/
=
T
1
2
/
T
0
(T
0
mod 2
/

~
`) mod 2
/
`
2
/
= T
1

T
0
(T
0

~
` mod 2
/
) `
2
/
= T
1
)(T
0
).
Consider the bit j
/
of 1, and define T
/
1
=
P
i1
;=0
P
//
i=/
r
;i
2
i/
, i.e., a truncated version of T
1
. From
the above formulation, it is clear that j
/
only depends on
T
/
1
and T
0
, as emphasized in Fig. 3. Define
CILARDO: EXPLORING THE POTENTIAL OF THRESHOLD LOGIC FOR CRYPTOGRAPHY-RELATED OPERATIONS 455
2. Throughout the text, log i denotes base 2 logarithm and is always
assumed to be rounded up to the nearest integer, so, in general, 2
log i
_ i.
Fig. 3. Multiple sum Montgomery reduction.
` = max T
/
1

1 = i(2
/1
1) 1 T
/
1
and the sum o = T
0
` T
/
1
so that
T
0
= T
0
(o) = o div `.
T
/
1
= T
/
1
(o) = o mod `.
The bit j
/
of 1 is a function of T
/
1
and T
0
, and hence, of
the sum o. Note that the terms in the sum o =
T
0
` T
/
1
are exponential in i. However, T
0
only takes
on a polynomial number of different values, since its
maximum value is i(2
/
1) < 4i
2
, and i = C(i
c
). For
each value of T
0
(and hence, of )(T
0
)), the bit j
/
depends
only on T
/
1
(see Fig. 3). Precisely, after defining
c = )(T
0
) mod 2
/1
, it turns out that the bit in position
/, j
/
, is 1 if and only if
T
/
1
'
i
;=0
[(2; 1)2
/
c. (; 1)2
/1
1 c[.
i.e., it remains constant in i1 subintervals. Overall, as
the sum o = T
0
` T
/
1
spans its exponential range, the
bit j
/
changes its value a polynomial number of times.
Each of the functions j
/
, therefore, satisfies Lemma 2 and
can thus be realized as a polynomial-size depth-2 thresh-
old logic circuit with polynomially bounded weights.
It is easy to prove that this depth is optimal. Choose
an arbitrary i, take any two (i 2)-bit numbers and 1
and an arbitrary i-bit modulus `. Let T = 441 and
consider its Montgomery reduction with / = 2:
1 =
4(1) (4(1)
~
` mod 4) `
4
= 1.
If multiple sum Montgomery reduction could be
implemented as a depth-1 threshold circuit, then this
would be possible also for addition. It is well known, on
the other hand, that addition requires at least a depth-2
threshold circuit. '
By using the above lemma, we can reduce modulo ` the
sum of multiple numbers, each having the same size i as the
modulus. In practice, we will often need to reduce a single
number T having a bit size larger than i. The following
theorem, based on Lemma 3, addresses this problem:
Theorem 1. Let ` be an i-bit modulus and T an i-bit number,
with i i, i = C(i
c
). Fixed-modulus Montgomery reduc-
tion can be implemented as an
c
1T circuit of depth 2. The
circuit is optimal in depth.
Proof. Let ` be the i-bit modulus and let T =
P
i1
i=0
t
i
2
i
be
the i-bit number to be reduced, i i, and i = C(i
c
).
Let /
/
be an arbitrary nonnegative integer. The essential
idea is to replace each bit t
i
of T falling outside the
i-bit reduction range with a suitable multiple `
(i)
of the
modulus `, having all zero bits outside this range. The
reduced T will be computed as the sum of such
multiples, based on Lemma 3.
Define the quantities `
(i)
as follows:
`
(i)
=
2
i
(2
i

~
` mod 2
/
/
) `. 0 _ i < /
/
.
(2
i/
/
mod `) 2
/
/
. i _ /
/
i.
&
(2)
so that `
(i)
= 2
i
mod ` for any i < /
/
and i _ /
/
i. As in
Algorithm 1, the quantities
~
` are such that 2
/
/
2
/
/

`
~
` = 1. Note that all `
(i)
are multiples of 2
/
/
. In
particular, for 0 _ i < /
/
, the quantities `
(i)
are computed
like the numerator of 1 in step 2 of Algorithm 1, with
T = 2
i
and1 = 2
/
/
, andthus, are divisible by 2
/
/
. It canalso
be easily verified that for any i, `
(i)
< 2
/
/
i
, i.e., all `
(i)
have nonzero bits only within the range [/
/
. /
/
i 1[.
Note also that the quantities `
(i)
only depend on the
modulus ` and the value /
/
. An example is given in Fig. 4.
The i-bit number T can be written as
T =
X
i1
i=0
t
i
2
i
=
X
/
/
1
i=0
t
i
2
i

X
/
/
i1
i=/
/
t
i
2
i

X
i1
i=/
/
i
t
i
2
i
.
whichis congruent modulo` withthe followingquantity:
X
/
/
1
i=0
t
i
`
(i)

X
/
/
i1
i=/
/
t
i
2
i

X
i1
i=/
/
i
t
i
`
(i)
= T
/
2
/
/
. (3)
The third summation is present only if i /
/
i. Based
on the definition of `
(i)
, the above quantity is a multiple
of 2
/
/
and has been thus rewritten as T
/
2
/
/
. Note that T
/
is
simply obtained by reorganizing the input bits t
i
since the
modulus and the related quantities `
(i)
are fixed. We
456 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011
Fig. 4. Generation and use of the quantities `
(i)
.
thus obtain ii 1 = C(i
c
) numbers to sum, all
having their nonzero bits in the range [/
/
. /
/
i 1[.
They can be treated as i-bit numbers, and based on
Lemma 3, they can be summed and Montgomery-
reduced with a depth-2
c
1T circuit. This yields a quantity
T
//
less than 2` and congruent to T
/
2
/
//
mod ` =
T 2
(/
/
/
//
)
mod ` = T 2
/
mod `, with / = /
/
/
//
, i.e.,
the final Montgomery reduction.
It is easy to show that this depth is optimal. For
example, if we take i = i 1, T =
P
i
i=0
t
i
2
i
, with t
1
= t
2
and t
0
= 0, and ` =
P
i1
i=0
i
i
2
i
, with i
1
= i
2
= 0 and
i
0
= 1, we can apply the previous construction with
/
/
= 1, /
//
= 2, and /
/
= /
//
= 3. With such inputs, it could
be shown that the least significant bit of the reduced
number 1 = T 2
/
mod ` is j
0
= t
3
t
2
. It is well
known, however, that the XOR function cannot be
implemented as a depth-1 threshold circuit. '
Note that, unlike /
//
(which must be set to log(ii 1)
by Lemma 3), the constant /
/
was initially set to an arbitrary
value in the above proof. /
/
can be used to adjust the final
Montgomery extra factor 2
/
= 2
(/
/
/
//
)
to an appropriate
value, e.g., 2
i
or 2
i2
, used in common applications.
Theorem 2. Fixed-modulus Montgomery multiplication of two
C(i
c
)-bit numbers and 1 can be implemented as an
c
1T circuit of depth 3. The circuit is optimal in depth.
Proof. The proof of this theorem is similar to Theorem 1. Let
` be the fixed i-bit modulus and =
P
i1
i=0
o
i
2
i
and
1 =
P
i1
;=0
/
;
2
;
be the two i-bit numbers to be multi-
plied modulo `, i = C(i
c
). The unreduced product T =
1 can then be written as
T =
X
2i2
|=0
t
|
2
|
=
X
2i2
|=0
X
i;=|
o
i
/
;
2
|
. 0 _ i. ; _ i1.
We can choose an arbitrary integer /
/
, define the
values `
(i)
as in (2), and write the following quantity,
congruent to T =
P
2i2
|=0
t
|
2
|
modulo `:
X
/
/
1
|=0
X
i;=|
o
i
/
;
" #
`
(|)

X
/
/
i1
|=/
/
X
i;=|
o
i
/
;
" #
2
|

X
2i2
|=/
/
i
X
i;=|
o
i
/
;
" #
`
(|)
= T
/
2
/
/
.
This time, however, the first and third summations
contain more terms in the form o
i
/
;
`
(|)
, although no
more than i for each value of |. The number of i-bit
numbers to be added and Montgomery-reduced is
therefore still polynomial in i. The construction of the
circuit proceeds in a way similar to Theorem 1. It will just
require an additional layer needed to compute the bit
products o
i
/
;
, so the overall depth will be three.
It is easy to show that this depth is optimal. Choose
two arbitrary /-bit numbers and 1 and any
modulus ` of i = 4/ bits. Compute the Montgomery
multiplication of 2
/
and 1 2
/
choosing / = 2/:
1 =
1 2
2/
(1 2
2/

~
` mod 2
2/
) `
2
2/
= 1.
If Montgomery multiplication could be implemented as
an
c
1T circuit of depth 2, then this would be possible also
for multiplication, contradicting previously established
results [22]. '
By establishing some optimal bounds, the above theo-
rems make a useful contribution to the theoretical study of
Threshold Logic computational power, extending previous
results on small depth threshold circuits for arithmetic
operations [20], [21], [22].
4 CONSTANT-DEPTH C(i
2
)-SIZE ARCHITECTURES
FOR MODULAR REDUCTION AND MULTIPLICATION
Section 3 establishes an optimal bound to the depth of an
c
1T circuit implementing modular multiplication. Both
weights and the circuit size are guaranteed to be poly-
nomial in the number of inputs i. However, a broad C(i
c
)
bound on area complexity may still not ensure a feasible
implementation even for moderately large bit sizes i. In this
section, we present two constant-depth architectures for
modular reduction and multiplication, which ensure an
area complexity of C(i
2
). The following two lemmas, also
presented in [21] with a similar formulation, are useful for
the proof of the theorems introduced in this section:
Lemma 4. Let A = (r
0
. r
1
. . . . . r
i1
) 0. 1
i
, and ) :
0. 1
i
0. 1 be a Boolean function of A, which depends
only on a weighted sum of the input variables r
i
, with
polynomially bounded integer weights. Then, there exist an
integer constant : = C(i
c
) and polynomially many linear
threshold functions o
;
(A) such that the integer function
P
o
;
(A) : takes on the same values 0. 1 as the Boolean
function )(A).
Proof. Let )(A) be a Boolean function such that
)(A) = )
/
(
P
n
i
r
i
), where the weights n
i
are polyno-
mially bounded. Note that the value of
P
n
i
r
i
upon
which the function ) depends never exceeds the
interval [`. `[, where ` =
P
[n
i
[. For some values
of
P
n
i
r
i
within this interval, the function )(A) takes
on the value 1, while for others, it takes on the value 0.
Let 1
(1)
= [c
0
. u
0
[ ' [c
1
. u
1
[ ' ' [c
:1
. u
:1
[ be the sub-
set of [`. `[ such that )(A) = 1 ==
P
n
i
r
i
1
(1)
.
Due to the bound on the weights n
i
, the size of the
interval [`. `[, and hence, the number : of
subintervals in its subset 1
(1)
, are polynomially large
integers. We define the following 2: threshold func-
tions o
;
, each having polynomially bounded weights:
o
/
;
= sgn
X
n
i
r
i
c
;

. o
//
;
= sgn u
;

X
n
i
r
i

for ; = 0 . . . : 1. It can be easily shown that
)(A) =
P
o
/
;

P
o
//
;
:, where )(A) is viewed as an
integer function. In fact, if
P
n
i
r
i
does not fall in any
1-interval [c
;
. u
;
[, then there are exactly : among the
2: functions o
/
;
and o
//
;
, which are equal to 1, while if
P
n
i
r
i
falls in some 1-interval [c
/
. u
/
[, then both o
/
/
and o
//
/
are equal to 1, and exactly, : 1 of the remaining
o
/
;
and o
//
;
(; ,= /) are equal to 1. '
CILARDO: EXPLORING THE POTENTIAL OF THRESHOLD LOGIC FOR CRYPTOGRAPHY-RELATED OPERATIONS 457
Clearly, a function ) satisfying the hypothesis of Lemma 4
can be realized as a depth-2
c
1T circuit, with the second layer
defined as ) = sgn(
P
o
/
;

P
o
//
;
: 1).
Remark 1. If some linear threshold functions )
(i)
all
satisfying the hypothesis of Lemma 4 are used as the
input to a linear threshold function q, then a depth-2
threshold circuit is sufficient to implement q.
This property is true because q = sgn(
P
i
n
i
)
(i)
t) =
sgn(
P
i
n
i
(
P
;
o
(i)
;
:
i
) t) = sgn(
P
i;
n
i
o
(i)
;
t
/
). Addition-
ally, if there are polynomially many )
(i)
and the weights in q
are polynomially bounded, then this function can be
implemented as an
c
1T circuit.
Lemma 5. Let A = (y
0
. . . . . y
i
1
1
. :
0
. . . . . :
i
2
1
) 0. 1
i
,
i = i
1
i
2
, and ) : 0. 1
i
0. 1 be a Boolean function
of A which depends only on a weighted sum of the input
variables y
i
and a weighted sum of the input variables :
i
, both
with polynomially bounded weights. Then, there exist a
constant : = C(i
c
) and polynomially many linear threshold
functions o
;
(A) such that the integer function
P
o
;
(A) :
takes on the same values 0. 1 as the Boolean function )(A).
Proof. It suffices to show that a function )(A) =
)
/
(
P
i
n
i
y
i
.
P
/
.
/
:
/
) satisfying the hypothesis of Lemma 5
also satisfies Lemma 4. Let ` =
P
/
[.
/
[ 1 be an integer
such that
P
/
.
/
:
/
[ `. `[ for any input A. Note that `
is polynomially bounded in i because .
/
s are. Consider
the base-2` number 1 = (
P
i
n
i
y
i
) 2` (
P
/
.
/
:
/
).
Clearly, distinct values of the pair (
P
i
n
i
y
i
.
P
/
.
/
:
/
)
give distinct values of 1. Thus, we can write ) = )(A) =
)
/
(
P
i
n
i
y
i
.
P
/
.
/
:
/
) = )
//
(1), which is a function satisfy-
ing Lemma 4, since 1 is a polynomially bounded
weighted sum of the input variables. '
For our results on Montgomery reduction and multi-
plication, we will exploit the so-called block-save technique,
used in [19], to obtain small depth multiplication threshold
logic circuits. The block-save technique addresses the
problem of summing i i-bit integers. The essential idea is
to partition vertically the i numbers into columns of
width log i. The block-save technique computes separately
the sum in each column i in the form C
i
2
log i
o
i
, with
o
i
< 2
log i
. Since in each column we sum i numbers, we also
have C
i
< i _ 2
log i
. Therefore, the carry word C of a
column can only overlap with the sum word o of the
subsequent column. In other words, computing the block-
save sum in all columns (which can be performed in
parallel) results in only two binary numbers, C and o,
which can then be summed together to give the final result
of the addition. This process is illustrated in Fig. 5.
An interesting property is that the single block-save
addition, based on Lemma 4, can be computed by a
depth-2
c
1T circuit. In fact, C
i
and o
i
depend only on the
sum
P
i1
;=0
P
log i1
/=0
2
/
r
;/
, where r
;/
are the input bits
present in the ith column (gray area in Fig. 5) and weights
2
/
are less than i. So, by Lemma 4, each bit in C
i
and o
i
can be implemented as an
c
1T circuit of depth 2.
Furthermore, if the subsequent Adder in Fig. 5 is
implemented as a threshold logic circuit, we can merge
its first layer with the second layer of the block-save
circuit, as suggested in Remark 1. The Adder could be
implemented as a depth-2 polynomial-size circuit, as in
[20], resulting in an overall depth-3 circuit for the
i-operand addition of Fig. 5. Note that this scheme, based
on Lemma 4, could also be applied if we had, more
generally, C(i) numbers to sum. The approach taken in
the following relies on the above block-save technique,
combined with Montgomery modular reduction. For
evaluating the corresponding area complexity, a useful
consideration is summarized in the following Lemma,
referred to the generation of the C,o pair from a single
column of i numbers of log i bits:
Lemma 6. Given ilog i-bit numbers A
;
=
P
log i1
i=0
2
i
r
;i
, with
; [0. i 1[, each bit of C and o < 2
log i
such that
P
i1
;=0
A
;
= C 2
log i
o can be generated by a depth-2
c
1T
circuit having an area complexity of C(i).
Proof. First, consider o =
P
log i1
i=0
2
i
:
i
. Each bit :
/
of o
depends onl y on t he t runcat ed sum o
/
=
P
i1
;=0
P
/
i=0
2
i
r
;i
. Precisely, since :
/
is the bit of o and o
/
in position /, it can be written as :
/
= (o
/
div 2
/
) mod 2
(where div denotes the integer division). Let `
/
=
(2
/1
1) i. We clearly have 0 _ o
/
_ `
/
. Following
the construction of Lemma 4, let 1
(1)
/
= [c
0
. u
0
[ ' [c
1
. u
1
[ '
' [c
:1
. u
:1
[ be the subset of [0. `
/
[ such that
:
/
= 1 == o
/
1
(1)
/
. We can correspondingly define
2: threshold functions o such that :
/
=
P
o :, where :
is the number of subintervals in 1
(1)
/
. Since :
/
depends only
on o
/
div 2
/
, it is constant within intervals large 2
/
due to
the integer division by 2
/
. It follows that :
/
may change its
value `
/
,2
/
= (2
/1
1) i,2
/
~ 2i times at most, as o
/
spans the range [0. `
/
[. Hence, the maximum number of
subintervals [c
i
. u
i
[ _ 1
(1)
/
where :
/
= 1 is approximately
i, andbyLemma 4, the maximumnumber of functions ois
approximately 2i.
A very similar reasoning can be applied to the bits c
/
of the word C, although we omit the details here. '
The example shown in Fig. 6 will help clarify the
construction used in Lemmas 4 and 6. Suppose that we have
i = 8 numbers of log i = 3 bits to be summed and consider,
for instance, the computation of bit :
1
(/ = 1). The bit :
1
clearly depends only on input bits r
;i
, ; [0. 7[, i [0. 1[.
Define the weighted sum of the input bits as =
P
7
;=0
P
1
i=0
r
;i
2
i
. We have
458 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011
Fig. 5. Illustration of block-save addition.
:
1
= 1 = [2. 3[ ' [6. 7[ ' [10. 11[ ' [14. 15[ ' [18. 19[
' [22. 23[.
We thus need 12 functions o, constituting the first layer in
the circuit. Each corresponding threshold gate is repre-
sented in the figure as a circle, with an indication of the
input weighted sum followed by a comma and the value of
the threshold for that gate, e.g., (. 2) denotes the gate
computing the threshold function sgn(
P
7
;=0
P
1
i=0
r
;i
2
i
2).
Remark 2. Since each output bit requires an area complexity
of C(i), the generation of the carry/sum pair for the
block-save addition of i i-bit numbers requires an area
complexity of C(i
2
).
We will now consider the Montgomery reduction of a
multiple sum of i-bit numbers, similar to Lemma 3. Unlike
the construction of the Lemma, however, we will decom-
pose the reduction process into more stages in order to
derive a threshold circuit with C(i
2
) area complexity. As
usual, call T =
P
A
;
the sum of the i i-bit numbers.
According to Algorithm 1, we first need to compute
Q = T
~
` mod 2
/
, and hence, the quantity Q ` (call it Y ).
Then, we will add Y = Q ` along with the input numbers
A
;
, discarding the zero bits produced in the lowest
positions of the sum output. The multiple sum will be
based on the block-save technique, as depicted in Fig. 7.
As usual, partition the input numbers A
;
into /-bit
vertical columns. Call the first column T
0
, i.e., T
0
=
P
i1
;=0
P
/1
i=0
2
i
r
;i
, and similarly, call T
i
one of the subsequent
columns. As emphasized in the figure, Q = T
~
` mod 2
/
=
T
0

~
` mod 2
/
depends only on T
0
. Hence, each bit of C
i
and
o
i
depends only on the two weighted sums T
i
and T
0
, and
thus, satisfies Lemma 5 and Lemma 4. We could thus
directly obtain the carry/sum pair as a depth-2 circuit.
Unfortunately, by applying the procedure in the proof of
the two lemmas, we would need at least C(i
3
) different
functions o in order to build each of the bits of C and o,
thereby obtaining an C(i
4
) area complexity overall. We will
show here that a suitable decomposition of the circuit can
limit the area complexity to C(i
2
), at the price of only one
additional layer. We will compute Q and Y = Q `
explicitly. Note that Q = T
~
` mod 2
/
has exactly / bits,
where / = C(log i). The quantity Y = Q `, to be summed
along with the input numbers A
;
, can be thus obtained as
/ different multiples of the (fixed) modulus ` added
separately in the overall sum (see Fig. 7). We thus have
i / numbers to add, which will be summed by applying
the block-save technique. The resulting C,o pair is such
that C o,2
/
= T 2
/
, where T =
P
i1
;=0
A
;
. Since / must
be equal to the logarithm of the number of addends plus 1,
we will here set / to the minimum integer such that
/ = 1 |oq(i /). In fact, we will always have either / =
1 log i or / = 2 log i for any i 2. Based on this
decomposition, the following lemma helps build an
C(i
2
)-size circuit for block-save Montgomery reduction:
Lemma 7. Given i i-bit numbers A
;
=
P
i1
/=0
2
i
r
;/
, with
; [0. i 1[, there exists a depth-3
c
1T circuit generating
the Montgomery-reduced block-save pair C and o, requiring
an area complexity of C(i
2
).
Proof. We use the same notations and the decomposition
emphasized in Fig. 7. Although Q depends only on the
weighted sum T
0
, we cannot directly evaluate area
complexity as done for the C,o pair in Lemma 6, since Q
has a more complex relationship with input bits. A key
property we can exploit, however, is that each bit
/
of Q
depends only on
Q mod 2
/1
= T
0
~
` mod 2
/

mod 2
/1
= T
0
mod 2
/1

~
` mod 2
/

mod 2
/1
.
i.e., on the input bits r
;i
, with i _ / < /. The input
weighted sum
P
i1
;=0
P
/
i=0
2
i
r
;i
never exceeds the range
[0. i(2
/1
1)[, so the application of Lemma 4 to
/
requires no more than i 2
/1
functions o

. For generat-
ing all / bits of Q, the area complexity will thus be in the
order of
X
/1
/=0
i2
/1
= i (2
/1
2) _ i (2
3log i
1) = C(i
2
).
Each bit
/
can be written as the sum of some functions
o

that, based on Remark 1, can be driven directly to the


subsequent layer of threshold gates. In column i, the
bits
/
are used to generate the block Y
i
(see Fig. 7). Since
the modulus ` is fixed, Y = Q` is just a reorganization
of bits
/
, and thus, Y
i
does not require any actual
computation. Finally, we can compute the bits of the
words C
i
and o
i
. These bits depend on C(i /) = C(i)
rows (taking the bits of Y
i
as input), and thus, based on
Lemma 6, each of them can be written as a sum of C(i)
functions o
c:
depending on the input bits in T
i
and Y
i
.
CILARDO: EXPLORING THE POTENTIAL OF THRESHOLD LOGIC FOR CRYPTOGRAPHY-RELATED OPERATIONS 459
Fig. 7. Scheme for depth-3 Montgomery block-save reduction. The
resulting C,o pair is such that C o,2
/
= T 2
/
, where T =
P
i1
;=0
A
;
.
Fig. 6. Depth-2 circuit for generation of output :
1
, with i = 8.
For generating all bits of C and o, we thus need C(i
2
)
functions o
c:
. The bits of C and o, expressed as a sum of
functions o
c:
, can be computed by a third layer of
threshold logic functions (as sgn(
P
o
c:
:)), or merged
with the first layer of a subsequent stage. A high-level
view of the circuit is depicted in Fig. 8. '
Remark 3. Lemmas 6 and 7 still hold true if we have C(i)
numbers to sum.
Note that the C,o pair provides the result in redundant
form. To sumthemtogether in a nonredundant form, we rely
on the addition scheme proposed in [24]. For i bit operands,
it provides a depth-3 C(
i
2
log i
) size
c
1T circuit for integer
addition. BasedonRemark 1, the first of its three layers canbe
merged with the last layer of the block-save Montgomery
reduction circuit (Fig. 8), yielding a depth-5 circuit overall.
This idea is exploited by the following two theorems:
Theorem 3. Let ` be an i-bit modulus and T an i-bit number,
with i i and i = C(i). Fixed-modulus Montgomery
reduction can be implemented as a depth-5
c
1T circuit with an
C(i
2
) area complexity.
Proof. Similar to Theorem 1, let ` be the i-bit modulus and
T =
P
i1
i=0
2
i
t
i
be the i-bit input to reduce, with i i,
i = C(i). Choose an arbitrary /
/
and define `
(i)
as in
(2). Similar to (3), T is congruent modulo ` to the
following quantity:
X
/
/
1
i=0
t
i
`
(i)

X
/
/
i1
i=/
/
t
i
2
i

X
i1
i=/
/
i
t
i
`
(i)
= T
/
2
/
/
.
Since themodulus `is fixed, T
/
is areorganizationof input
bits t
i
(like in Fig. 4). In the above expression, we have to
sum and Montgomery-reduce ii 1 = C(i) i-bit
numbers. We thus choose the minimum integer /
//
such
that /
//
= 1 log(ii 1 /
//
), and apply the block-
save Montgomery reduction of Lemma 7 to generate a
C
//
,o
//
pair such that the following congruence modulo `
holds for C
//
o
//
,2
/
//
= T
//
:
T
//
= T
/
2
/
//
= T 2
(/
/
/
//
)
.
Finally, we use the adder described in [24] to sum the
carry/sumpair withanC(
i
2
log i
) area complexity. The adder
requires three layers, but the first of these can be merged
with the thirdlayer usedto generate the carry/sumpair of
T
//
. We have five layers overall, and each of these does not
require more than C(i
2
) gates to compute the result. As
before, the first constant /
/
was chosen arbitrarily and can
be set to adjust the Montgomery extra factor 2
(/
/
/
//
)
to an
appropriate value, e.g., 2
i
or 2
i2
. '
Finally, the following theorem allows us to build a
constant-depth circuit for modular multiplication with
C(i
2
) area complexity:
Theorem 4. Given an i-bit fixed-modulus `, the Montgomery
multiplication of two numbers , 1 < 2` can be implemented
as a depth-7
c
1T circuit with an C(i
2
) area complexity.
Proof. Let ` be the i-bit modulus and =
P
i
i=0
o
i
2
i
and
1 =
P
i
;=0
/
;
2
;
be the two (i 1)-bit numbers to be
multiplied modulo `. As observed in Section 2.2, and
1 are allowed to be in the range [0. 2`[. We need a first
layer computing the bit products o
i
/
;
, which produces the
i 1 numbers
P
i
i=0
2
i;
o
i
/
;
, ; [0. i[. We then use
Lemma 6 to sum them together, requiring an C(i
2
) area
complexity and producing a carry/sum pair T
C
=
P
2i|
i=|
2
i
t
C
i
, T
o
=
P
2i
i=0
2
i
t
o
i
, where | = log(i 1). The out-
put values T
C
and T
o
have 4i 2 bits overall.
Choose an arbitrary /
/
and define `
(i)
as in (2). Similar
to (3), we obtain a reduced version T
/
by exploiting the
relationship 2
i
= `
(i)
mod `:
1 =
X
/
/
1
i=|
t
C
i
`
(i)

X
/
/
1
i=0
t
o
i
`
(i)

X
/
/
i1
i=/
/
t
C
i
2
i

X
/
/
i1
i=/
/
t
o
i
2
i

X
2i|
i=/
/
i
t
C
i
`
(i)

X
2i
i=/
/
i
t
o
i
`
(i)
.
Because of the definition of `
(i)
, the above quantity is a
multiple of 2
/
/
. Lets call it T
/
2
/
/
. Note that, depending
on the value of /
/
, some of the above summations may
not be present.
In the above expression, we have C(i) i-bit numbers,
which can be Montgomery-reduced based on Lemma 7,
requiring C(i
2
) area complexity. Again, this yields a
carry/sum pair C
//
,o
//
such that T
//
= T
/
2
/
//
= 1
2
(/
/
/
//
)
, where T
//
= C
//
o
//
,2
/
//
. The C
//
,o
//
pair can
then be converted into a nonredundant form by adopting
the C(
i
2
log i
)-size adder used above.
We need one layer for bit products o
i
/
;
, two layers for
generating the carry/sum pair T
C
,T
o
(the second of
these layers can be merged with the subsequent one),
three layers for Montgomery block-save reduction
producing T
//
(again, the third layer can be merged with
the subsequent one), and three layers for the final
addition. By applying merging where possible, we need
seven layers altogether. Area complexity is no more than
C(i
2
) in each of them. '
The C(i
2
)-size solutions presented in this section provide
a feasible scheme for implementation of Montgomery
reduction and multiplication with constant-depth critical
paths of 5 and 7 threshold gates, respectively, independent
of the size i of the operands.
460 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011
Fig. 8. Structure of the depth-3 Montgomery reduction circuit.
5 COMPARISONS AND DISCUSSION
In order to discuss our results and compare them to the
state of the art in the technical literature, we consider the
different design approaches available for implementation of
Montgomery modular multiplication. By its nature, Mon-
tgomery algorithm is ideally suited for parallelization, and
was adopted for a large number of pipelined and systolic
implementations [25], [23], [3], [15], in addition to fully
parallel approaches [18].
In a systolic array architecture, we typically have C(i
2
)
units, each processing C(1) bits at time. Data and control
signals are pumped through the units from one end of the
architecture to the other, and held by registers within each
unit. Signal propagation is always local, in the sense that
each unit only communicates with its neighbors, thereby
ensuring an C(1) critical path, which is made of one full
adder and some selection logic, at least. We must wait
C(i) clock cycles before the first bits of the result come out.
After this initial latency, we achieve a theoretical through-
put of C(1) (i.e., one new multiplication output each clock
cycle, or each constant number of clock cycles) provided
that input data are fed at the appropriate rate from the
outside. As an alternative to ensure lower area require-
ments, several works proposed linear systolic arrays [23],
[3], [15]. In this case, area complexity is C(i), but the
theoretical throughput is C(
1
i
) (i.e., one multiplication
completed each c i clock cycles, where c is a constant).
Authors in [18], on the other hand, present a class of
schemes for parallel Montgomery multipliers. The essential
idea is to build a parallel tree multiplier based on 3-2
compressors. These are realized as i-bit rows working on
the i-bit partial products to be added and Montgomery-
reduced. Among the architectures proposed in [18], the best
area complexity is C(i
2
) while the critical path contains at
least C(log3
2
i) stages, each made of two cascaded full adders.
Table 1 summarizes the above-discussed approaches for
modular multiplication and compares them with the
Threshold-Logic-based solutions presented here. The table
uses complexity notation and gives the comparisons in
terms of architectural critical path, latency, throughput, and
gate count. Critical paths for the architectures presented in
Section 3 and Section 4 are equal to 3 and 7 threshold logic
gates, respectively.
Note that the above classification based on time and gate
count complexity is independent of specific implementation
details. It is appropriate to describe the performance of all
high-speed architectures proposed in the literature for
Montgomery multiplication, and this provides insightful
information on the asymptotic behavior of all of them. It is
clear from the table that none of the standard approaches
achieves constant latency execution with a constant time
delay, i.e., an architectural critical path independent of the
operand size i. The solutions based on Threshold Logic, on
the other hand, ensure both constant latency and constant-
depth critical path (the value of this constant can be traded
off for area requirements, while remaining low and
independent of i). In other words, solutions based on
Threshold Logic enable single-cycle computation of Mon-
tgomery multiplication, and at the same time, they may
ensure low delays. As shown in the table, this is an intrinsic
advantage of the architectures based on Threshold Logic,
and provides an insightful indication of the potential of such
new paradigm for the computation of performance critical
operations like modular reduction and multiplication.
6 CONCLUSIONS
Next generation VLSI technologies will impose a revisitation
of computational models and digital design methodologies.
In particular, the availability of new technologies makes it
likely to adopt the Threshold Logic computational paradigm
and implement linear threshold functions as elementary
gates. This paper investigated the potential of the newmodel
for an increasingly important application domain, crypto-
graphic computation. The work established some theoretical
results on the power of Threshold Logic for cryptographic
operations, and introduced fixed-depth architectures suita-
ble for implementation of modular multiplication, providing
an insightful demonstration of the power of Threshold Logic
for cryptography-related operations.
REFERENCES
[1] V. Beiu, J.M. Quintana, and M.J. Avedillo, VLSI Implementations
of Threshold Logica Comprehensive Survey, IEEE Trans.
Neural Networks, vol. 14, no. 5, pp. 1217-1243, Sept. 2003.
[2] I.F. Blake, G. Seroussi, and N.P. Smart, Elliptic Curves in
Cryptography. Cambridge Univ. Press, 1999.
[3] T. Blum and C. Paar, High-Radix Montgomery Modular
Exponentiation on Reconfigurable Hardware, IEEE Trans. Com-
puters, vol. 50, no. 7, pp. 759-764, July 2001.
[4] J. Bruck, Harmonic Analysis of Polynomial Threshold Func-
tions, SIAM J. Discrete Math., vol. 3, no. 2, pp. 168-177, May 1990.
[5] S. Cotofana, C. Lageweg, and S. Vassiliadis, Addition Related
Arithmetic Operations via Controlled Transport of Charge, IEEE
Trans. Computers, vol. 54, no. 3, pp. 243-256, Mar. 2005.
[6] M. Goldmann, J. Hastad, and A. Razborov, Majority Gates vs.
General Weighted Threshold Gates, Proc. Seventh Ann. Conf.
Structure in Complexity Theory, pp. 2-13, 1992.
[7] M. Goldmann and M. Karpinski, Simulating Threshold Circuits
by Majority Circuits, Proc. 25th Ann. ACM Symp. Theory of
Computing, pp. 551-560, May 1993.
[8] J. Hastad, Almost Optimal Lower Bounds for Small Depth
Circuits, Proc. 18th Ann. ACM Symp. Theory of Computing, vol. 18,
pp. 6-20, 1986.
[9] Intl Technology Roadmap for Semiconductors, 2005 ed., http://
www.itrs.net, 2010.
CILARDO: EXPLORING THE POTENTIAL OF THRESHOLD LOGIC FOR CRYPTOGRAPHY-RELATED OPERATIONS 461
TABLE 1
Comparison of Different Approaches to High-Speed
Implementation of Montgomery Multiplication
[10] C. Lageweg, S. Cotofana, and S. Vassiliadis, A Linear Threshold
Gate Implementation in Single-Electron Technology, Proc. IEEE
CS Workshop Very Large Scale Integration (VLSI), pp. 93-98, Apr.
2001.
[11] P. Mazumder, S. Kulkarni, M. Bhattacharya, J.P. Sun, and G.I.
Haddad, Digital Circuit Applications of Resonant Tunneling
Devices, Proc. IEEE, vol. 86, no. 4, pp. 664-686, Apr. 1998.
[12] C. Meenderinck and S. Cotofana, Computing Division Using
Single-Electron Tunneling Technology, IEEE Trans. Nanotechnol-
ogy, vol. 6, no. 4, pp. 451-459, July 2007.
[13] P.L. Montgomery, Modular Multiplication without Trial Divi-
sion, Math. Computation, vol. 44, no. 170, pp. 519-521, Apr. 1985.
[14] S. Muroga, Threshold Logic and Its Applications. Wiley, 1971.
[15] S.B. O

rs, L. Batina, B. Preneel, and J. Vandewalle, Hardware


Implementation of a Montgomery Modular Multiplier in a
Systolic Array, Proc. Intl Parallel and Distributed Processing Symp.
(IPDPS 03), p. 184b, 2003.
[16] W. Porod, C. Lent, G.H. Bernstein, A.O. Orlov, I. Hamlani, G.L.
Snider, and J.L. Merz, Quantum-Dot Cellular Automata: Com-
puting with Coupled Quantum Dots, Intl J. Electronics, vol. 86,
no. 5, pp. 549-590, 1999.
[17] R.L. Rivest, A. Shamir, and L. Adleman, A Method for Obtaining
Digital Signatures and Public-Key Cryptosystems, Comm. ACM,
vol. 21, pp. 120-126, 1978.
[18] M.O. Sanu, E.E. Swartzlander, and C.M. Chase, Parallel
Montgomery Multipliers, Proc. 15th IEEE Intl Conf. Application-
Specific Systems, Architectures and Processors (ASAP 04), pp. 63-72,
2004.
[19] K.-Y. Siu and J. Bruck, Neural Computation of Arithmetic
Functions, Proc. IEEE, vol. 78, no. 10, pp. 1669-1675, Oct. 1990.
[20] K.-Y. Siu and J. Bruck, On the Power of Threshold Circuits with
Small Weights, SIAM J. Discrete Math., vol. 4, no. 3, pp. 423-435,
Aug. 1991.
[21] K.-Y. Siu, J. Bruck, T. Kailath, and T. Hofmeister, Depth Efficient
Neural Networks for Division and Related Problems, IEEE Trans.
Information Theory, vol. 39, no. 3, pp. 946-956, May 1993.
[22] K.-Y. Siu and V.P. Roychowdhury, On Optimal Depth Threshold
Circuits for Multiplication and Related Problems, SIAM J.
Discrete Math., vol. 7, no. 2, pp. 284-292, May 1994.
[23] W.-C. Tsai, C.B. Shung, and S.-J. Wang, Two Systolic Architec-
tures for Modular Multiplication, IEEE Trans. Very Large Scale
Integration (VLSI) Systems, vol. 8, no. 1, pp. 103-107, Feb. 2000.
[24] S. Vassiliadis, S. Cotofana, and K. Bertels, 2-1 Addition and
Related Arithmetic Operations with Threshold Logic, IEEE Trans.
Computers, vol. 45, no. 9, pp. 1062-1067, Sept. 1996.
[25] C.D. Walter, Systolic Modular Multiplication, IEEE Trans.
Computers, vol. 42, no. 3, pp. 376-378, Mar. 1993.
[26] Nanoelectronics and Information Technology: Advanced Electronic
Materials and Novel Devices, R. Waser, ed., first ed. Wiley-VCH,
2003.
[27] T. Yang, R.A. Kiehl, and L.O. Chua, Tunneling Phase Logic
Cellular Nonlinear Networks, Intl J. Bifurcation and Chaos,
vol. 11, no. 12, pp. 2895-2911, 2001.
Alessandro Cilardo received a five-year de-
gree in computer engineering, magna cum
laude, in January 2003, and the PhD degree in
November 2006 from the University of Naples
Federico II. He is currently an assistant profes-
sor at the University of Naples Federico II. His
main research interests include computer arith-
metic, with special emphasis on cryptography-
related operations. In particular, he investigated
architectural solutions for efficient implementa-
tion of modular and finite field arithmetic used in public-key crypto-
graphic algorithms. He is also involved in several research projects
related to embedded systems design, reconfigurable technologies
(FPGAs), and reconfigurable computing design techniques. He is a
recipient of several industrial awards for innovative security-related
applications mainly oriented to embedded environments and smart
cards. He is an author of around 25 peer-reviewed articles, including
publications appeared in the Proceedings of the IEEE, IEEE Transac-
tions on Computers, IET Electronics Letters, and papers presented at
DATE, ITC, and VLSI-SoC conferences. He is a member of the IEEE.
> For more information on this or any other computing topic,
please visit our Digital Library at www.computer.org/publications/dlib.
462 IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 4, APRIL 2011

Вам также может понравиться