High Performance Elliptic Curve Cryptographic Processor Over GF (2163)

High Performance Elliptic Curve Cryptographic Processor Over GF (2163 )
Hyun Min Choi, Chun Pyo Hong Chang Hoon Kim

Dept. of Computer and Communication Dept. of Computer and Information
Daegu University, South Korea Daegu University, South Korea
hmchoi@dsp.daegu.ac.kr, cphong@daegu.ac.kr kimch@daegu.ac.kr
Abstract tion can be computed by repeated point additions and dou-

blings. If we implement ECC over GF (2m ), an efficiency
In this paper, we propose a high performance elliptic of computing kP depends on the choice of a point multi-
curve cryptographic processor over GF (2163 ). The pro- plication algorithm, coordinate system, and basis represen-
posed architecture is based on a modified López-Dahab el- tation. From previous results [1-3], it can be noted that,
liptic curve point multiplication algorithm and uses Gaus- for hardware implementations of ECC, López-Dahap Mont-
sian normal basis (GNB) for GF (2163 ) field arithmetic. To gomery’s kP algorithm based on projective coordinate is
achieve a high throughput rates, we design two new word- faster than any other point multiplication methods. Another
level arithmetic units over GF (2163 ) and derive a paral- advantage of this algorithm is that the same operations are
lelized elliptic curve point doubling and point addition al- performed in every iteration of the main loop, thereby po-
gorithm. We implement our design using Xilinx XC4VLX80 tentially increasing resistance to timing attacks and power
FPGA device which uses 24,263 slices and has a maximum analysis attacks.
frequency of 143MHz. Our design is roughly 4.8 times In this paper, we propose a high performance elliptic
faster with 2 times increased hardware complexity com- curve cryptographic processor over GF (2163 ). The pro-
pared with the previous hardware implementation. There- posed architecture is based on a modified López-Dahab el-
fore, the proposed architecture is well suited to elliptic liptic curve point multiplication algorithm and uses GNB
curve cryptosystems requiring high throughput rates such for GF (2m ) field arithmetic. Three major characteristics of
as network processors and web servers. the proposed architecture are: 1) it uses fast arithmetic units
based on a word-level multiplier [6] 2) it adopts a paral-
lelized point doubling and point addition unit with uniform
1. Introduction addressing mode, 3) and it utilizes benefits of GNB repre-
sentation. Therefore, the proposed architecture leads to a
Elliptic curve cryptosystems (ECC) have recently gained considerable reduction of computational delay time com-
much attention in industry and academia. The main reason pared with previous hardware implementations. Further-
is that for a properly chosen elliptic curve, no known sub- more, since the proposed architecture has the features of
exponential algorithm can be used to break the encryption modularity and a simple control structure, it is well suited
through the solution of the discrete logarithm problem. This to VLSI implementation.
means that significantly smaller parameters can be used in
ECC than in other competitive systems such as RSA and
2 Arithmetic unit for ECC
ElGamal with equivalent levels of security. Some bene-
fits of having smaller key sizes include faster computations,
reductions in processing power, storage space, and band- 2.1 López-Dahab kP algorithm for GF (2m )
width. Due to these advantages of ECC, a number of soft-
ware and hardware implementations [1-3] have been pro-
posed, and included in many standards such as IEEE 1363 Let E be an elliptic curve over GF (2m ) defined by
[4] and NIST [5] for elliptic curve digital signature algo-
rithm. E : y 2 + xy = x3 + ax2 + b, (1)
Computing kP (a point or scalar multiplication) is the
most important arithmetic operation in ECC, where k is an where a, b are in GF (2m ) and b 6= 0. Let P = (x, y) be
integer and P is a point on an elliptic curve. This opera- a point on E of large prime order in E(GF (2m )), and let
k ≥ 0 be an integer. We may write k as a binary expansion Algorithm 2 Point doubling & addition with uniform ad-
dressing
k = (ks−1 ks−2 · · · k1 k0 )2
(2) if kl−2 =1 then
= ks−1 2s−1 + ks−2 2s−2 + · · · k1 s + k0 , Swap(X1 , X2 ), Swap(Z1 , Z2 )
end if
where ki = 0, 1 with ks−1 = 1. We can compute kP us- for i = l − 2 down to 0 do
ing the following López-Dahab algorithm [7]. Two main Z3 ← (X1 Z2 + X2 Z1 )2
advantages of the López-Dahab algorithm are: 1) A num- X2 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z2 ← Z3 ,
ber of inversions are avoided by using projective coordinate X1 ← X14 + bZ14 , Z1 ← X12 Z12
representation and 2) The same operations are performed in if (l 6= 0 and kl 6= kl−1 ) or (l = 0 and kl = 1) then
every iteration of the main loop. These two benefits mean Swap(X1 , X2 ), Swap(Z1 , Z2 )
that faster implementation is possible with potentially in- end if
creasing resistance to timing attacks. end for
Algorithm 1 López-Dahab version for point multiplication
Input: P = (x, y) ∈ E(GF (2m )), an integer k ≥ 0.
Output: kP = (x0 , y0 ). (T2 ), and X1 Z1 can be computed at the first step and the
1. If k = 0 or x = 0, then kP = ∞ or P . multiplications T1 T2 , xZ3 , and bZ14 can be computed at the
2. k ← (kl−1 , · · · , k1 , k0 )2 . second step respectively. From this observation, we can de-
3. (X1 , Z1 ) ← (x, 1), (X2 , Z2 ) ← (x4 + b, x2 ). rive a new parallelized version corresponding to the algo-
4. For i = l − 2 down to 0 do rithm 2, and a modified version is described in algorithm
5. Z3 ← (X1 Z2 + X2 Z1 )2 . 3.
6. If ki = 1 then
7. X1 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z1 ← Z3 , Algorithm 3 Parallelized version of point doubling & addi-
X2 ← X24 + bZ24 , Z2 ← X22 Z22 . tion with uniform addressing
8. Else if kl−2 =1 then
9. X2 ← xZ3 + (X1 Z2 )(X2 Z1 ), Z2 ← Z3 , Swap(X1 , X2 ), Swap(Z1 , Z2 )
X1 ← X14 + bZ14 , Z1 ← X12 Z12 . end if
10. End if for i = l − 2 down to 0 do
11. End for 1. T1 ← (X1 Z2 ), T2 ← (X2 Z1 ), Z3 ← (T1 + T2 ) ,
2
12. x0 ← X Z1 ,
1
T3 ← (X1 Z1 )2 , Z2 ← Z3
y0 ← x · (x + X
1
Z1 )·
1
2. X2 ← T1 T2 + xZ3 , X1 ← bZ14 + X14 , Z1 ← T3
{(x + X X2 2
Z1 )(x + Z2 ) + x + y} + y.
1
if (l 6= 0 and kl 6= kl−1 ) or (l = 0 and ki = 1) then
13. Return kP = (x0 , y0 ). Swap(X1 , X2 ), Swap(Z1 , Z2 )
end if
end for
2.2 Parallelized arithmetic unit for point

doubling & addition A data dependence graph for point DA corresponding to
the algorithm 3 is given in Fig. 1. Based on the data depen-
By observing the main loop of the López-Dahab point dence graph of Fig. 1, we can derive a new arithmetic unit
multiplication scheme in algorithm 1, we can see that both as shown in Fig. 2. As described in Fig. 2, this arithmetic
step 7 and step 9 have the same operations except for in- unit is merged version of step 1 and step 2 in the depen-
put and output variables depending on the value of kl . This dence graph of Fig. 1. In Fig. 2, input and output values are
condition means that we can construct uniform operations controlled by 2-bits Ctrl-A and 1-bit Ctrl-B signals. This
structure by appropriate modifications. Based on these leads to a very simple control architecture. The new arith-
properties, Ansari and Hasan [9] proposed a modified ver- metic unit of Fig. 2 consists of three multipliers, two adder,
sion for point doubling and addition (DA) as described in six 3-to-1 multiplexers, two buffer registers, and two swap
s
algorithm 2. logic blocks. As a reference, since the A2 operations of an
m
From algorithm 2, if we use one multiplier, six multipli- element A in GF (2 ) are s-bits cyclic shift when GNB is
cations are required. By observing the algorithm 2, how- used, any logic gates for these operations are not required.
ever, we can find that three multiplications are indepen- In Fig. 2, Xi , Zi , X14 , and Z14 , b, and x are transferred from
dently computed, i.e., the multiplications X1 Z2 (T1 ), X2 Z1 register file, and Xi 0 and Zi 0 mean temporary computation
Figure 3. Arithmetic unit for coordinate con-
version
scheme. The Itoh-Tsujii’s algorithm is based on the follow-

Figure 1. Data dependence graph for point ing relation:
doubling & addition
s s
results and directly used as input values for the next itera- 2s − 1 = (2 2 − 1)(2 2 + 1), if s = even,
tion, where i ∈ 1, 2. Therefore we can reduce clock cycles s−1 s−1
for data fetch from register file. The two buffer registers are = 2(2 2 − 1)(2 2 + 1) + 1, if s = odd.
used to adjust input timing. In other words, the temporary (4)
results Z1 and Z2 are computed at the first step, however,
these results should be used at the same time as inputs of
The Itoh-Tsujii’s inversion scheme requires blog2 (m−1)c+
the next iteration with the temporary results X1 and X2 .
H(m − 1) − 1 multiplications in GF (2m ), where H de-
From Fig. 2, we can notice that (l − 1)·2(clock cycles for
notes the hamming weight of the binary expansion of given
multiplication) clock cycles are required for main loop of
integer. In case of m = 163, we get m − 1 = 162 =
the López-Dahab point multiplication.
(10100010)2 . Thus, the number of required multiplications
are 7 + 3 − 1 = 9. Therefore, we can compute the inverse
2.3 Arithmetic unit for coordinate conver-
of A in GF (2163 ) in the following order of the exponents.
sion
Unlike point DA algorithm, the coordinate conversion 81

+1)[2(240 +1)(220 +1)(210 +1)(25 +1){2(22 +1)(2+1)+1}+1]
A−1 = A2(2
(CC) routine in algorithm 1 includes inversion operation.
(5)
Therefore, we firstly derive inverter over GF (2m ).
Let A be an element in GF (2m ). Since the order of the
m
multiplicative group GF (2m )× is 2m − 1, we get A2 −1 = Based on (5), we can derive a new VLSI architecture for CC
1. That is over GF (2163 ) as shown in Fig. 3. As described in Fig. 3,
A−1 = A2
m
−2
= A2(2
m−1
−1)
(3) we add 163(3-to-1) multiplexers, 163(9-to-1) multiplexers,
163 (2-to-1) multiplexers, one GF (2163 ) adder, and control
Based on (3), Itoh-Tsujii [8] proposed an efficient inversion signal with length 8.
Figure 2. Arithmetic unit for point doubling & addition
3 Elliptic curve cryptographic processor over multiplication over GF (2163 ), where we assumed word size
GF (2163 ) ω = 55, i.e., L = 3.
3.1 Datapath for elliptic curve point mul- 3.2 FPGA implementation and perfor-
tiplication over GF (2163 ) mance analysis
A new ECC processor for GF (2163 ), proposed in this
paper, is shown in Fig. 4. The ECC processor, shown in The ECC processor over GF (2163 ) in Fig. 4 was coded
Fig. 4, consists of eight main components. Eight compo- using VHDL and synthesized by Synopsys FPGA Com-
nents are host interface (HI), data memory, register file, in- piler II, in which Xilinx XC4VLX80 was used as the tar-
struction memory, control-1, control-2, AU-1, and AU-2. get device. The placement, route process, and timing anal-
The HI communicates with host microprocessor, i.e., host ysis of the synthesized designs were accomplished using
microprocessor transmits all parameters for kP to HI with Xilinx’s foundation software. We implemented the design
Start signal, and receives kP results and End signal from using Libtron’s SYS-Lab 5000 system-on-chip test board
HI. The data memory consists of 8×163-bit dual port block which includes Intel PXA272 microprocessor and Xilinx
RAM and the instruction memory contains 13 microcode XC4VLX80 FPGA device.
sequences of 11-bits word. For high performance imple- Performance comparisons with recently proposed archi-
mentation of point doubling & addition, we add 7 × 163-bit tectures are given in Table 2. From Table 2, it is noted
register file, which receives data form HI and transmits tem- that the proposed architecture is the fastest design includ-
porary computation results (X1 , X2 , Z1 , Z1 ) to data mem- ing ASIC implementation. As a detailed comparison, our
ory. The AU-1 is used for point doubling & addition and ECC processor is 4.8 times faster than the architecture in [1]
controlled by Control-1. The AU-2 performs the coordinate which is the best FPGA implementation to up to date to the
conversion in algorithm 1. The Control-2 receives opera- author’s knowledge. The proposed design, however, uses
tion code from instruction memory and generates control roughly 2 times more hardware resources than the Shu’s ar-
signals for AU-2, data memory, and HI. Table 1 gives num- chitecture, since one slice of Xilinx’s XC4VLX80 device
ber of required clock cycles to perform elliptic curve point has 2 LUTs.
Figure 4. Architecture of ECC processor over GF (2163 )
Table 1. Clock cycles required to compute kP over GF (2163 )

# Operations # Clock Cycles
Initialize Add.: 1, Sqr.: 1 1
Main Loop 162(Mult.: 2) 162{2(dm/ωe + 1)}=1296
Coordinate Conversion Inv: 3, Mult: 5, Add: 5 32 dm/ωe + 53 =149
Table 2. Performance comparison results

Device/Size f(MHz)/Time Remarks
Shu et al. [1] XCV2000E-7 68.9 6 MSD Multipliers D =
2m , 163-bit, NIST 25,763 LUTs 48µs 32, 8
Satoh et al. [2] 0.13 µs CMOS 510.2
MMM, D = 64
2m , 160-bit, Any 117,500 Gates 190µs
Gura et al. [3] XCV2000E-7 66.5
LSD Multiplier, D = 64
2m , 163-bit 19,508 LUTs 143µs
This Work XC4VLX80 143
3 GNB Multipliers, D = 55
2m , 163-bit 24,363 Slices 10µs
4 Conclusions
In this paper, we proposed a high performance ECC

processor. We implemented our design using Xilinx
XC4VLX80 FPGA device. The proposed architecture uses
24,263 slices and has a maximum frequency of 143MHz.
Our design is roughly 4.8 times faster with 2 times in-
creased hardware complexity compared with the previous
hardware implementation proposed by Shu. et al. [1].
Therefore, the proposed elliptic curve cryptographic proces-
sor is well suited to elliptic curve cryptosystems requiring
high throughput rates such as network processors and web
servers. Furthermore, since the proposed architecture has
the features of modularity and a simple control structure,
it can easily be implemented on ASICs or FPFAs by using
hardware description languages.
References
[1] C. Shu, K. Gaj, and T. El-Ghazawi, “Low Latency Elliptic

Curve Cryptography Accelerators for NIST Curves over Bi-
nary Fields,” FPT 2005 1965, pp. 309-310, 2005.
[2] A. Satoh and K. Takano, “A Scalable Dual-Field Elliptic
Curve Cryptographic Processor,” IEEE Trans. Computers,
Vol. 52, No. 4, pp. 449-460, April. 2003.
[3] N. Gura, S.C. Shantz, H. Eberle, S. Gupta, V. Gupta, D.
Finchelstein, E. Goupy, and D. Stebila, “An End-to-End Sys-
tems Approach to Elliptic Curve Cryptography,” CHES 2002,
Lecture Notes in Computer Science 2523, pp. 349-365, 2002.
[4] IEEE 1363, Standard Specifications for Publickey Cryptogra-
phy, 2000.
[5] NIST, Recommended elliptic curves for federal government
use, May 1999. http://csrc.nist.gov/encryption.
[6] C.H. Kim, Y.K. Kwon, T.H. Kim, S. Kwon, and C.P. Hong,
“Word Level Multiplier for GF (2m ) Using Gaussian Normal
Basis,” Journal of Korea Information and Communications.,
Vol. 31, No. 11c, pp. 1120-1127, Nov. 2006.
[7] J. López and R. Dahab, “Fast Multiplication on Elliptic
Curves over GF (2m ) without Precomputation,” CHES 1999,
Lecture Notes in Computer Science Vol. 1717, pp. 316-327,
1999.
[8] T. Itoh and S. Tsuji, “A fast algorithm for computing multi-
plicative inverses GF (2m ) in using normal bases,” Informa-
tion and Computing, Vol. 78, No. 3, pp. 171-177, 1988.
[9] B. Ansari, M. Anwar Hasan, “High Performance Architecture
of Elliptic Curve Scalar Multiplication,” Tech. Report CACR
2006-01, 2006.

High Performance Elliptic Curve Cryptographic Processor Over GF (2163)

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

High Performance Elliptic Curve Cryptographic Processor Over GF (2163)

Загружено:

Авторское право:

Доступные форматы

High Performance Elliptic Curve Cryptographic Processor Over GF (2163 )

Hyun Min Choi, Chun Pyo Hong Chang Hoon Kim

Abstract tion can be computed by repeated point additions and dou-

2.2 Parallelized arithmetic unit for point

scheme. The Itoh-Tsujii’s algorithm is based on the follow-

Unlike point DA algorithm, the coordinate conversion 81

Table 1. Clock cycles required to compute kP over GF (2163 )

Table 2. Performance comparison results

In this paper, we proposed a high performance ECC

[1] C. Shu, K. Gaj, and T. El-Ghazawi, “Low Latency Elliptic

Вам также может понравиться