Вы находитесь на странице: 1из 5

An FPGA Implementation of 30Gbps Security Module for GPON Systems

Truong Quang Vinh1, Ju-Hyun Park1, Young-Chul Kim1, Kwang-Ok Kim2


1
Department of Electronics and Computer Engineering, Chonnam National University
300 Yongbong-dong, Buk-gu, Gwangju 500-757, Korea
tqvinh@soc.chonnam.ac.kr
2
Electronics and Telecommunication Research Institute
161 Gajeong-dong, Yuseong-gu, Daejeon, Korea
kwangok@etri.re.kr

Abstract amount of resource needed. To achieve high


throughput for gigabit links in GPONs, we apply
GPON systems require gigabit throughput data pipelined architectures for all process blocks of
encryption for security and privacy. This paper security module, especially for AES core. The
presents an implementation of very high speed security pipelined architecture for AES can improve the
module for GPON on Virtex4 FPGA. The security throughput but it utilizes much area due to duplicated
module supports payload encryption with constant hardware for implementing 11 rounds. Therefore, some
delay by using counter mode AES algorithm. Our researchers have proposed several speed-area trade-off
design of AES has three advanced features: composite to implement the architectures for AES algorithm. To
field arithmetic SubByte, efficient MixColumn optimize the resource for AES implementation,
transformation, and On-the-Fly Key-Scheduling. Full- researchers focus on improving some blocks of the
pipelined architecture is employed for the AES ciphers. In [8]-[10], efficient implementations of the S-
architecture in order to achieve the high performance box are proposed to minimize area and delay. The
for security module. The experiment shows that the architecture of proposed S-box is combination of
proposed architecture can achieve a throughput of SubBytes and Inverse SubBytes transformations,
30Gbits/s on a Xilinx Virtex-4 VLX100-12 device. The instead of look-up tables that require much memory. To
performance of our design is well suitable for enhance key schedule, some authors propose on-the-fly
encryption applications of GPON systems. key expansion that can generate the round keys
concurrently during the encryption or decryption
1. Introduction procedure without extra memory to store the round
keys [8], [9].
Recently, GPONs (Gigabit-capable Passive Optical This paper explores efficient schemes for designing
Networks) are attractive for cost-effective delivery of the security module in order to achieve the target
high-bandwidth data directly to building, curb, and performance of GPON systems. Our design employs a
home. This creates a strong requirement for access composite field arithmetic architecture for SubByte
network to be trustworthy, secure, and reliable. transformation. Moreover, we apply sub-pipelined for
Therefore, encryption module is an essential part in this function block to improve the throughput of AES
GPON systems for protecting broadcast data from algorithm. Another part that has improvement is the
eavesdropping due to the multicast nature of the key-expander. We propose an area-efficient key
GPONs. The ITU-T G.984 document [1] recommends expander which can compute round keys in on-the-fly
using the Advanced Encryption Standard (AES) for manner. Besides, we exploit sub-pipelined architecture
payload encryption in GPONs. The National Institute for key expansion block and use optimized set of
of Standards and Technology (NIST) defined five registers to store round keys. Our key expander is
modes of operation of AES [2]. However, only AES suitable for pipelined AES architecture that can start at
with counter mode (CTR-AES) can be used for GPON the same time with data encryption.
payload encryption. In this paper, we present a GPON The paper is organized as follows. Section 2
security module using CTR-AES algorithm which is presents the architecture of the GPON security module.
implemented by a full-pipelined architecture for area Section 3 describes the hardware implementation of
and performance optimization. AES algorithm. The advanced features for AES
For hardware implementation of security module, hardware implementation are presented in Section 4.
there are two critical constrains: performance and Section 5 presents the implementation results and the

978-1-4244-2358-3/08/$20.00 © 2008 IEEE 868 CIT 2008


performance comparisons with different architectures. delay as encryption time to synchronize with the cipher
In the section 6, we give the conclusion. GEM payload at the output.
f. Security Encoder: multiplexes the cipher GEM
2. The architecture of the GPON security (G-PON Encapsulation Method) Payloads from Bypass
GEM Payload and Encrypted GEM Payload depending
The GPON security module is implemented to whether security function is enabled. For the authentic
guarantee a secure communication in Tx/Rx link of frames, the encoder performs XORed 128bits
GPON. Using the module, the transmission data are Pseudorandom Cipher block with delayed GEM
ensured to be confidentiality, integrity, and origin payload to generate cipher GEM payload.
authenticity of each frame sent and received by the
OLT (Optical Line Termination) / ONT (Optical The AES algorithm in GPON security module uses
Network Termination) [1]. The top structure of the counter mode to encrypt data [2]. In counter mode
GPON security module is shown in Fig. 1. encryption, the forward cipher function is invoked on
each counter blocks, and the resulting output blocks are
exclusive-ORed with the corresponding plaintext
blocks to produce the ciphertext blocks. The forward
cipher function is used in both CTR decryption and
CTR encryption. Therefore, only one hardware
implementation is used for both encryption and
decryption. The XORed operation is executed in
security encoder block.

3. AES core implementation


Fig. 1. The top structure of the GPON security module. 3.1 AES general architecture
The AES algorithm is a symmetric-key cipher, in
a. Port-ID Table: is implemented as 4K 12-bit which both the sender and the receiver use a single key
registers to store the port identifier. Only frames with for encryption and decryption. In the encryption of the
the appropriate Port-ID are encrypted by CTR-AES AES algorithm, each round except the final round
core. consists of four steps: SubByte, ShiftRow, MixColumn,
b. Security Decoder: generates Crypto counter and AddRoundKey. The SubByte is nonlinear
with the format: (Inter Frame Count[19:0] & Intra transformation, which substitutes each byte of round
Frame Count[15:0]) & (Inter Frame Count[29:0] & data according to a substitution table called SBox. The
Intra Frame Count[15:0]) & (Inter Frame Count[29:0] ShiftRow step is a circular shifting of bytes in each row
& Intra Frame Count[15:0]). It also registers 128-bit of the round data. The MixColumn transformation
GTC Payload for the Payload Bypass. operates on the State column-by-column, treating each
c. Key Expander: restores the initial key and column as a four-term polynomial. The AddRoundKey
generates round keys for CTR-AES from 128-bit key can be simply performed by applying exclusive OR to
input. The total bit number of round_keys is 1408 = the round key with the data block. The round keys are
128*(10+1). The shadow key is used if the OLT different in every round and are generated by Key
require key exchange. The ONT responds by Expansion.
generating, storing and sending a new key. When the
new key is transferred successfully to OLT, both the 3.2 The full-pipelined architecture for AES
OLT and ONU (Optical Network Unit) begin using the algorithm
new key at precisely the same frame boundary. In order to achieve very high throughput, we apply
d. CTR-ARE Core: is the same process of AES pipeline technique both for outer round and inner round
algorithm except input values which is crypto counter. of AES architecture. For outer round pipelining, the
The crypto counter increases at every 128-bit data pipeline registers are placed between the data path
block. 128-bit input blocks are transformed into 128-bit instances of each round. For the inner round pipelining,
pseudorandom cipher blocks we decompose four processes SubByte, ShiftRow,
e. Payload Bypass: delivers the insecure payload MixColumn and AddRoundKey into sub-pipelined
without an authentication encryption. It has the same stages with equivalent delay. The Fig.2 shows full
pipelined architecture of AES algorithm.

869
Among round processes of AES algorithm, the drawback, further pipelining can be used. By using the
SubByte phase has the most delay. Therefore, the 2-stage pipelined architecture with three 8-bit registers
number of sub-stages of this block is more than that of (Fig.3), the critical path is broken in half. To reduce
other phases. We implemented two full-pipelined more path delay, the 3-stage pipelined architecture can
architectures which have 2-stage sub-pipeline and 5- be also applied (Fig.4).
stage sub-pipeline for each round process. Thus, the
SubByte block has to be decomposed into 2 stages and
3 stages, respectively. We can achieve a very high

map

affine
map
throughput when using 5-stage sub-pipelined for AES

-1
architecture.
Fig. 3. 2-stage pipelined SBox using GF operations.

Fig. 4. 3-stage pipelined SBox using GF operations.

4.2. MixColumn
In MixColumn transformation, the columns of the
State are considered as polynomials over GF(2 8) and
Fig. 2. Full-pipelined architecture for AES algorithm. multiplied modulo x4 + 1 with a fixed polynomial c(x )
= ‘03’ x3 + ‘01’ x2 + ‘01’ x + ‘02’. In direct form, the
4. Advanced features for AES Hardware MixColumn transformation can be expressed as
implementation ìs'0,c = ({02} · s 0,c ) Å ({03} · s1,c ) Å s 2,c Å s 3,c
ïs' = s Å ({02} · s ) Å ({03} · s ) Å s
ï 1,c 0,c 1,c 2, c 3, c (1)
í
This section presents innovative features in AES
hardware implementation. Each sub-block in
s ' = s
ï 2, c 0 ,c Å s1, c Å ({02 } · s 2, c ) Å ({03 } · s 3,c )

encryption process is optimized for area and delay. Our ïîs'3,c = ({03} · s 0,c ) Å s1,c Å s 2,c Å ({02} · s 3,c )
improvement for AES architecture is focused on
SubByte, MixColumn, and Key Expander block. The Several architectures have been proposed for the
detail hardware implementations for these blocks are implementation of MixColumn transformation.
described as follows. Substructure-shared architecture is applied in [4] [6],
[7], [9]. In our implementation, we also use
4.1. SubByte transformation substructure sharing techniques to implement an
In the SubByte transformation (Sbox), the input is efficient hardware for MixColumn transformation. To
considered as an element of GF(28). First, the apply this technique, the equation (1) should be
multiplicative inverse in GF(28) is calculated. Then, an rewritten in an efficient way as
affine transformation over GF(2) is applied. The ìs'0, c = {02} · (s 0, c Å s1, c ) Å s1, c Å (s 2, c Å s3,c )
implementation of a SBox can be done by a look-up ïs' = {02} · (s Å s ) Å s Å (s Å s )
table, but it consumes much resource. Nevertheless, we ï 1, c 1, c 2, c 0, c 2, c 3, c (2)
í
can implement a SBox using Galios Field operations ï s ' 2 , c = {02} · (s 2 , c Å s 3, c ) Å s 3 , c Å (s 0 , c Å s1, c )
[10]. Field arithmetic GF(2 4) is used instead of GF(28) ïîs'3, c = {02} · (s 3, c Å s0, c ) Å s 2, c Å (s 0, c Å s1, c )
to optimize area. In this architecture, the input values is
mapped to two elements of GF(2 4). Then, the
The equation for MixColumn transformation is now
multiplicative inverse is calculated using GF(24)
more symmetrical, and we can apply substructure
operation. Next, the two GF(24) elements are inverse
sharing to optimize area for hardware implementation.
mapped to one element in GF(28). Last, the affine
The {02} constant multiplication is computed by the
transformation is performed. Although the composite
function denoted by a = xtime(b). The xtime() function
field implementation of Sbox is very efficient in area, it
can be implemented at the byte level as a left shift and
suffers from a long critical path. To overcome this

870
a subsequent conditional bitwise XOR with {1b} if the After r clock cycles, a new round key is generated, so
most significant of input byte is one (b7 = 1). The all the round keys are available after (r×Nr) +1 clock
xtime() block can be implemented by 3 2-bit XOR gate. cycles.
By using efficient architecture of xtime() and applying
XOR-sharing, the MixColumn transformation can be
implemented as shown in the Fig.5.

reg

reg

reg
reg

reg

reg

reg

reg
Roundkey(Nr)
Roundkey(0)

Roundkey(1)

Roundkey(2)

Roundkey(3)
Fig.6. The architecture of on-the-fly key expander.

Fig. 5. (a) The efficient architecture of the MixColumn. 5. Performance results and comparisons
(b) The implementation of xtime() function.
We implemented the GPON security module with
The total number of gate counts for MixColumn full-pipelined architecture of 128-bit CTR-AES on
transformation is 324, which includes 108 2-bit XOR Virtex-4 VLX100-12. Xilinx ISE 8.2i was used to
gates (each XOR gate contains 3 gates). synthesize the design and provided post-placement
timing results. For simulation, we used ModelSim 5.8c
4.3. Key-Expander to verify the encrypt/decrypt operations. We evaluated
the hardware cost in terms of BRAMs, slices,
The Key Expansion routine generates a total of 11 maximum frequency and throughput.
round keys from an initial key in 128-bit AES We implemented two full-pipelined architectures of
algorithm. For pipelined AES architecture, all round AES core which have 3 sub-pipelined stages (r=3) and
keys must be available at the same time. Therefore, 5 sub-pipelined stages (r=5). The 3-stage sub-pipelined
some researchers implemented a key expansion routine design has total 31 stages (r×10 + 1). Thus, after 31
to compute a round key, and duplicate this hardware 10 clock cycles, the corresponding cipher text blocks will
times for total 10 rounds [4], [5]. These architectures appear every clock cycle. By using this architecture, we
can calculate all round keys at the same time, but they can achieve the throughput of 26.7Gbits/s. The 5-stage
consume much area. Some other researchers propose sub-pipelined design has higher performance than the
method to reduce Xinmiao Zhang [9] has proposed key 3-stage sub-pipelined design. However, this design
expander that can operate in on-the-fly manner. The consumes more area for pipeline registers and takes
data encryption and the key expansion can start more clock cycles for round processes. The table 1
simultaneously. Inherited from that architecture, we shows the comparison between existed AES
implement an area-efficient key expander which also implementations and our implementation. Since
can compute round key in on-the-fly manner. In order previous architectures have been implemented on
to operate synchronously with the sub-pipelined round VirtexE device, we also choose Xilinx VirtexE-family
process, the key expander is divided into r sub-stages. device beside Virtex4 for our design in order to
We use 11 registers to store 11 round keys. It is compare the result fairly. According to the experiment
different from the architecture of the key expansion in result, the designs in [3]-[5] have less performance
[9], in which the author used r sets of registers all because they just use outer-pipeline architecture. In the
round keys and temporary values for sub-pipelined implementations of [7] and [9], the authors improve the
stage. By this scheme, we can reduce more area than throughput by applying sub-pipeline architectures.
the previous architecture. The sub-pipelined Nevertheless, these designs require more slices for
architecture for on-the-fly key expander with 3 sub- extra hardware. In term of throughput/slice, our
stage (r=3) is shown in Fig.6. implementation is more efficient than the published
Since round keys are generated on the fly, the approaches. The result of synthesized report shows that
number of sub-pipelined stages for key expansion must our design with 5-stage sub-pipelined architecture can
be the same with the number of encryption sub-stages. achieve throughput of 31.6 Gbits/s.

871
Table 1. Comparison of FPGA implementation of the AES algorithm
Design Device Frequency Throughput slices BRAMs Mbps/slice
(MHz) (Mbps)
Shuenn-Shyang [3] XCV1000e-8 125.38 1604 1857 0 0.867
Jae-Gon Lee [4] XCV3200e-8 40 5120 8009 104 0.639
Saqib, N.A. [5] XCV812e-8 20.192 2584 2744 0 0.942
Jarvinen [7] XCV1000e-8 129.2 16500 11719 0 1.408
Xinmiao Zhang (r=3) [9] XCV812e-8 93.5 11965 9406 0 1.272
Xinmiao Zhang (r=7) [9] XCV1000e-8 168.4 21556 11022 0 1.956
Our AES core design (r=3) XCV1000e-8 91.1 11661 8914 0 1.308
Our AES core design (r=5) XCV1000e-8 150.25 19232 9820 0 1.958
Our AES core design (r=3) XC4VLX100-12 208.49 26686 9478 0 2.816
Our AES core design (r=5) XC4VLX100-12 247.19 31640 9904 0 3.195

The whole architecture of GPON security including [4] Jae-Gon Lee, Woong Hwangbo, Seonpil Kim, Chong-Min Kyung,
AES core are synthesized on Xilinx Virtex-4 VXL100- “Top-down implementation of pipelined AES cipher and its
verification with FPGA-based simulation accelerator”, Proceedings
12. The some extra resource is needed for security
of 6th International Conference on ASIC , pp. 68-72, Oct. 2005.
decoder, security encoder, and payload bypass.
[5] Saqib, N.A., Rodriguez-Henriquez, F., Diaz-Perez, A., “AES
Therefore, the total areas for the security module with
algorithm implementation - an efficient approach for sequential and
AES core (r=3) and AES core (r=5) are 11958 slices, pipeline architectures”, Proceedings of the Fourth Mexican
and 13384 slices, respectively. International Conference on Computer Science , pp. 126-130, Sept.
2003.
6. Conclusions [6] Nedjah, N., de Macedo Mourelle, L., Cardoso, M.P., “A
Compact Pipelined Hardware Implementation of the AES-128
Cipher”, Third International Conference on Information
We presented a FPGA implementation of the high
Technology: New Generations , pp. 216-221, April 2006.
speed GPON security module using counter mode AES
[7] Yongzhi Fu, Lin Hao, Xuejie Zhang, Rujin Yang, “Design of an
algorithm. Our design has three main efficient features:
extremely high performance counter mode AES reconfigurable
composite field arithmetic SubByte, area-efficient processor”, Second International Conference on Embedded
MixColumn, and on-the-fly sub-pipelined Key- Software and Systems, Dec. 2005.
Expander. By using these improvement features, our [8] Hodjat, A., Verbauwhede, I., “Area-throughput trade-offs for
design has optimal area and maximum throughput. For fully pipelined 30 to 70 Gbits/s AES processors”, IEEE Transactions
full-pipelined architecture with 51 stages, we can on Computers, vol. 55, no. 4, pp. 366-372, April 2006.
achieve throughput of 30 Gbits/s on Virtex4 VLX100 [9] Xinmiao Zhang, Parhi, K.K., “High-speed VLSI architectures for
device. Our implementation is well suitable for the AES algorithm”, IEEE Transactions on Very Large Scale
encryption applications of GPON systems. Integration (VLSI) Systems, vol. 12, no. 9, pp. 957-967, Sept. 2004.
[10] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC
Acknowledgement Implementation of the AES Sboxes”, Proceeding of RSA Conference ,
This research was financially supported by the pp.29-52, Feb. 2002.
Electronics and Telecommunication Research Institute
(ETRI) in Korea. The CAD tools for design in this
work were supported by IDEC.

References
[1] “Gigabit-capable Passive Optical Networks (G-PON):
Transmission convergence layer specification”, ITU-T G.984.3
Amendment 1, July. 2005.
[2] Morris Dworkin, “Recommendation for Block Cipher Modes of
Operation”, NIST Special Publication , http://csrc.nist.gov/
CryptoToolkit/modes/, 2001.
[3] Shuenn-Shyang Wang, Wan-Sheng Ni, “An efficient FPGA
implementation of advanced encryption standard algorithm”,
Proceedings of the International Symposium on Circuits and
Systems, vol. 2, pp. 597-600, May 2004.

872

Вам также может понравиться