Академический Документы
Профессиональный Документы
Культура Документы
869
Among round processes of AES algorithm, the drawback, further pipelining can be used. By using the
SubByte phase has the most delay. Therefore, the 2-stage pipelined architecture with three 8-bit registers
number of sub-stages of this block is more than that of (Fig.3), the critical path is broken in half. To reduce
other phases. We implemented two full-pipelined more path delay, the 3-stage pipelined architecture can
architectures which have 2-stage sub-pipeline and 5- be also applied (Fig.4).
stage sub-pipeline for each round process. Thus, the
SubByte block has to be decomposed into 2 stages and
3 stages, respectively. We can achieve a very high
map
affine
map
throughput when using 5-stage sub-pipelined for AES
-1
architecture.
Fig. 3. 2-stage pipelined SBox using GF operations.
4.2. MixColumn
In MixColumn transformation, the columns of the
State are considered as polynomials over GF(2 8) and
Fig. 2. Full-pipelined architecture for AES algorithm. multiplied modulo x4 + 1 with a fixed polynomial c(x )
= ‘03’ x3 + ‘01’ x2 + ‘01’ x + ‘02’. In direct form, the
4. Advanced features for AES Hardware MixColumn transformation can be expressed as
implementation ìs'0,c = ({02} · s 0,c ) Å ({03} · s1,c ) Å s 2,c Å s 3,c
ïs' = s Å ({02} · s ) Å ({03} · s ) Å s
ï 1,c 0,c 1,c 2, c 3, c (1)
í
This section presents innovative features in AES
hardware implementation. Each sub-block in
s ' = s
ï 2, c 0 ,c Å s1, c Å ({02 } · s 2, c ) Å ({03 } · s 3,c )
encryption process is optimized for area and delay. Our ïîs'3,c = ({03} · s 0,c ) Å s1,c Å s 2,c Å ({02} · s 3,c )
improvement for AES architecture is focused on
SubByte, MixColumn, and Key Expander block. The Several architectures have been proposed for the
detail hardware implementations for these blocks are implementation of MixColumn transformation.
described as follows. Substructure-shared architecture is applied in [4] [6],
[7], [9]. In our implementation, we also use
4.1. SubByte transformation substructure sharing techniques to implement an
In the SubByte transformation (Sbox), the input is efficient hardware for MixColumn transformation. To
considered as an element of GF(28). First, the apply this technique, the equation (1) should be
multiplicative inverse in GF(28) is calculated. Then, an rewritten in an efficient way as
affine transformation over GF(2) is applied. The ìs'0, c = {02} · (s 0, c Å s1, c ) Å s1, c Å (s 2, c Å s3,c )
implementation of a SBox can be done by a look-up ïs' = {02} · (s Å s ) Å s Å (s Å s )
table, but it consumes much resource. Nevertheless, we ï 1, c 1, c 2, c 0, c 2, c 3, c (2)
í
can implement a SBox using Galios Field operations ï s ' 2 , c = {02} · (s 2 , c Å s 3, c ) Å s 3 , c Å (s 0 , c Å s1, c )
[10]. Field arithmetic GF(2 4) is used instead of GF(28) ïîs'3, c = {02} · (s 3, c Å s0, c ) Å s 2, c Å (s 0, c Å s1, c )
to optimize area. In this architecture, the input values is
mapped to two elements of GF(2 4). Then, the
The equation for MixColumn transformation is now
multiplicative inverse is calculated using GF(24)
more symmetrical, and we can apply substructure
operation. Next, the two GF(24) elements are inverse
sharing to optimize area for hardware implementation.
mapped to one element in GF(28). Last, the affine
The {02} constant multiplication is computed by the
transformation is performed. Although the composite
function denoted by a = xtime(b). The xtime() function
field implementation of Sbox is very efficient in area, it
can be implemented at the byte level as a left shift and
suffers from a long critical path. To overcome this
870
a subsequent conditional bitwise XOR with {1b} if the After r clock cycles, a new round key is generated, so
most significant of input byte is one (b7 = 1). The all the round keys are available after (r×Nr) +1 clock
xtime() block can be implemented by 3 2-bit XOR gate. cycles.
By using efficient architecture of xtime() and applying
XOR-sharing, the MixColumn transformation can be
implemented as shown in the Fig.5.
reg
reg
reg
reg
reg
reg
reg
reg
Roundkey(Nr)
Roundkey(0)
Roundkey(1)
Roundkey(2)
Roundkey(3)
Fig.6. The architecture of on-the-fly key expander.
Fig. 5. (a) The efficient architecture of the MixColumn. 5. Performance results and comparisons
(b) The implementation of xtime() function.
We implemented the GPON security module with
The total number of gate counts for MixColumn full-pipelined architecture of 128-bit CTR-AES on
transformation is 324, which includes 108 2-bit XOR Virtex-4 VLX100-12. Xilinx ISE 8.2i was used to
gates (each XOR gate contains 3 gates). synthesize the design and provided post-placement
timing results. For simulation, we used ModelSim 5.8c
4.3. Key-Expander to verify the encrypt/decrypt operations. We evaluated
the hardware cost in terms of BRAMs, slices,
The Key Expansion routine generates a total of 11 maximum frequency and throughput.
round keys from an initial key in 128-bit AES We implemented two full-pipelined architectures of
algorithm. For pipelined AES architecture, all round AES core which have 3 sub-pipelined stages (r=3) and
keys must be available at the same time. Therefore, 5 sub-pipelined stages (r=5). The 3-stage sub-pipelined
some researchers implemented a key expansion routine design has total 31 stages (r×10 + 1). Thus, after 31
to compute a round key, and duplicate this hardware 10 clock cycles, the corresponding cipher text blocks will
times for total 10 rounds [4], [5]. These architectures appear every clock cycle. By using this architecture, we
can calculate all round keys at the same time, but they can achieve the throughput of 26.7Gbits/s. The 5-stage
consume much area. Some other researchers propose sub-pipelined design has higher performance than the
method to reduce Xinmiao Zhang [9] has proposed key 3-stage sub-pipelined design. However, this design
expander that can operate in on-the-fly manner. The consumes more area for pipeline registers and takes
data encryption and the key expansion can start more clock cycles for round processes. The table 1
simultaneously. Inherited from that architecture, we shows the comparison between existed AES
implement an area-efficient key expander which also implementations and our implementation. Since
can compute round key in on-the-fly manner. In order previous architectures have been implemented on
to operate synchronously with the sub-pipelined round VirtexE device, we also choose Xilinx VirtexE-family
process, the key expander is divided into r sub-stages. device beside Virtex4 for our design in order to
We use 11 registers to store 11 round keys. It is compare the result fairly. According to the experiment
different from the architecture of the key expansion in result, the designs in [3]-[5] have less performance
[9], in which the author used r sets of registers all because they just use outer-pipeline architecture. In the
round keys and temporary values for sub-pipelined implementations of [7] and [9], the authors improve the
stage. By this scheme, we can reduce more area than throughput by applying sub-pipeline architectures.
the previous architecture. The sub-pipelined Nevertheless, these designs require more slices for
architecture for on-the-fly key expander with 3 sub- extra hardware. In term of throughput/slice, our
stage (r=3) is shown in Fig.6. implementation is more efficient than the published
Since round keys are generated on the fly, the approaches. The result of synthesized report shows that
number of sub-pipelined stages for key expansion must our design with 5-stage sub-pipelined architecture can
be the same with the number of encryption sub-stages. achieve throughput of 31.6 Gbits/s.
871
Table 1. Comparison of FPGA implementation of the AES algorithm
Design Device Frequency Throughput slices BRAMs Mbps/slice
(MHz) (Mbps)
Shuenn-Shyang [3] XCV1000e-8 125.38 1604 1857 0 0.867
Jae-Gon Lee [4] XCV3200e-8 40 5120 8009 104 0.639
Saqib, N.A. [5] XCV812e-8 20.192 2584 2744 0 0.942
Jarvinen [7] XCV1000e-8 129.2 16500 11719 0 1.408
Xinmiao Zhang (r=3) [9] XCV812e-8 93.5 11965 9406 0 1.272
Xinmiao Zhang (r=7) [9] XCV1000e-8 168.4 21556 11022 0 1.956
Our AES core design (r=3) XCV1000e-8 91.1 11661 8914 0 1.308
Our AES core design (r=5) XCV1000e-8 150.25 19232 9820 0 1.958
Our AES core design (r=3) XC4VLX100-12 208.49 26686 9478 0 2.816
Our AES core design (r=5) XC4VLX100-12 247.19 31640 9904 0 3.195
The whole architecture of GPON security including [4] Jae-Gon Lee, Woong Hwangbo, Seonpil Kim, Chong-Min Kyung,
AES core are synthesized on Xilinx Virtex-4 VXL100- “Top-down implementation of pipelined AES cipher and its
verification with FPGA-based simulation accelerator”, Proceedings
12. The some extra resource is needed for security
of 6th International Conference on ASIC , pp. 68-72, Oct. 2005.
decoder, security encoder, and payload bypass.
[5] Saqib, N.A., Rodriguez-Henriquez, F., Diaz-Perez, A., “AES
Therefore, the total areas for the security module with
algorithm implementation - an efficient approach for sequential and
AES core (r=3) and AES core (r=5) are 11958 slices, pipeline architectures”, Proceedings of the Fourth Mexican
and 13384 slices, respectively. International Conference on Computer Science , pp. 126-130, Sept.
2003.
6. Conclusions [6] Nedjah, N., de Macedo Mourelle, L., Cardoso, M.P., “A
Compact Pipelined Hardware Implementation of the AES-128
Cipher”, Third International Conference on Information
We presented a FPGA implementation of the high
Technology: New Generations , pp. 216-221, April 2006.
speed GPON security module using counter mode AES
[7] Yongzhi Fu, Lin Hao, Xuejie Zhang, Rujin Yang, “Design of an
algorithm. Our design has three main efficient features:
extremely high performance counter mode AES reconfigurable
composite field arithmetic SubByte, area-efficient processor”, Second International Conference on Embedded
MixColumn, and on-the-fly sub-pipelined Key- Software and Systems, Dec. 2005.
Expander. By using these improvement features, our [8] Hodjat, A., Verbauwhede, I., “Area-throughput trade-offs for
design has optimal area and maximum throughput. For fully pipelined 30 to 70 Gbits/s AES processors”, IEEE Transactions
full-pipelined architecture with 51 stages, we can on Computers, vol. 55, no. 4, pp. 366-372, April 2006.
achieve throughput of 30 Gbits/s on Virtex4 VLX100 [9] Xinmiao Zhang, Parhi, K.K., “High-speed VLSI architectures for
device. Our implementation is well suitable for the AES algorithm”, IEEE Transactions on Very Large Scale
encryption applications of GPON systems. Integration (VLSI) Systems, vol. 12, no. 9, pp. 957-967, Sept. 2004.
[10] J. Wolkerstorfer, E. Oswald, and M. Lamberger, “An ASIC
Acknowledgement Implementation of the AES Sboxes”, Proceeding of RSA Conference ,
This research was financially supported by the pp.29-52, Feb. 2002.
Electronics and Telecommunication Research Institute
(ETRI) in Korea. The CAD tools for design in this
work were supported by IDEC.
References
[1] “Gigabit-capable Passive Optical Networks (G-PON):
Transmission convergence layer specification”, ITU-T G.984.3
Amendment 1, July. 2005.
[2] Morris Dworkin, “Recommendation for Block Cipher Modes of
Operation”, NIST Special Publication , http://csrc.nist.gov/
CryptoToolkit/modes/, 2001.
[3] Shuenn-Shyang Wang, Wan-Sheng Ni, “An efficient FPGA
implementation of advanced encryption standard algorithm”,
Proceedings of the International Symposium on Circuits and
Systems, vol. 2, pp. 597-600, May 2004.
872