Академический Документы
Профессиональный Документы
Культура Документы
Authorized licensed use limited to: IEEE Xplore. Downloaded on November 3, 2008 at 19:10 from IEEE Xplore. Restrictions apply.
Gbps) for 128-bit key mode was achieved in one FPGA same for both encrypt and decrypt. The three other
implementation with a cost of only 58.5K gates [3]. This functions have inverses used in the decrypt process:
was done by introducing a 4-stage pipeline for the main Inverse SubBytes, Inverse ShiftRows, and Inverse
functions and performing a basis transformation on MixColumns.
SubBytes. The alternative S-Box design would be to use a Either encryption or decryption begins with the round
lookup table (LUT). key expansion created by the key schedule function.
Using the RCON values in combination with a series of
3. AES algorithm specification XOR, SubBytes, and RotWord (rotate word) operations
an expanded round key is generated with a size of
The AES encryption process [4] for a 128-bit plain text (Nr + 1) × Nb bytes. For the 256-bit key expansion, the
block is shown in Figure 1. The AES algorithm specifies SubBytes operation is reapplied 4 words after each use of
128-bit, 192-bit, and 256-bit modes. Each bit mode has a the RCON.
corresponding number of rounds (Nr) based on the key The primitive functions are called Nr times in a loop
length (Nk words). Nb (state block size) is constant for all (called a round). Nr is initialized to 10, 12 or 14 as a
bit modes. This 128-bit block is termed the state. Each function of key length. SubBytes is a nonlinear
state is comprised of 4 words. A word is subsequently transformation in which one byte is substituted for
defined as 4 bytes. Table 1 shows the possible key/state another by means of an affine transformation over the
Galois Field GF (2 ) , as seen in Equation 1.
block/round combinations.
Plain Text
S’00 S’10 S’20 S’30
ªb′0 º ª1
S00 S10 S20 S30
S01 S11 S31
S-Box S’01 S’11 S’ S’31 0 0 0 1 1 1 1º ªb 0 º ª1º
Key (0) AddRoundKey Sij ij
S02 S12 S22 S32 S’02 S’12 S’22 S’32 «b′ » «1 1 0 0 0 1 1 1»» «b » «1»
SubBytes S03 S13 S23 S33 S’03 S’13 S’23 S’33 « 1» « « 1» « »
ShiftRows
MixColumns
«b′2 » «1 1 1 0 0 0 1 1» « b 2 » «0 »
S00 S10 S20 S30 No rotation S’00 S’10 S’20 S’30 « » « » « » « »
«b′3 » = «1 1 1 1 0 0 0 1» « b 3 » + «0 »
Key (1) AddRoundKey S01 S11 S21 S31 Left rotate by 1 S’12 S’21 S’31 S’01
... S02 S12 S22 S32 Left rotate by 2 S’22 S’32 S’02 S’12
«b′4 » «1 1 1 1 1 0 0 0» « b 4 » «0 »
S03 S13 S23 S33 Left rotate by 3 S’33 S’03 S’13 S’23
« » « » « » « »
SubBytes
ShiftRows
«b′5 » «0 1 1 1 1 1 0 0» «b 5 » «1»
S0j S S’00 S’10S’
MixColumns
S00 S10 S 20 30
X C(x)
0j S’30
S’20 « b ′ » «0 0 1 1 1 1 1 0» «b » «1»
AddRoundKey
S01 S11 S1j S31 S’01 S’11S’ S’31
1j « 6» « » « 6» « »
¬«b′7 ¼» ¬«0
Key (Nr-1) S02 S12 S22 S32 S’02 S’12 S’22 S’32
S2j
S03 S13 S23 S33
S’2j
S’03 S’13 S’23 S’33
0 0 1 1 1 1 1¼» ¬«b 7 ¼» ¬«0¼»
SubBytes S3j S’3j
ShiftRows
(1)
Key (Nr) AddRoundKey
S0j S
S00 S10 S 20 30
K0j S’00 S’10S’0j S’30
S’20 ShiftRows is a shift operation performed on the last
Cipher Text S01 S11 S31 S’01 S’11S’ S’31 three rows of the state. The last three rows are rotated to
S1j K1j 1j
S02 S12 S22 S32 + S’02 S’12 S’22 S’32 the left by: 1, 2, or 3 bytes shown more specifically in
S2j K2j S’2j
S03 S13 S23 S33
S3j K3j
S’03 S’13 S’23 S’33
S’3j Figure 1. MixColumns is finite field matrix multiplication
applied every round except the last. Each column is
multiplied as a four-term polynomial in
Figure 1. AES encryption algorithm
( ) ( )
GF 2 8 mod x 4 + 1 using the array shown in
Table 1. AES bit mode specifications Equation 2.
Bit Mode Key State Size Number of ªs ′0,c º ª02 03 01 01º ªs 0,c º
« s′ » « «s »
01»»
Length (Nb Rounds
(Nk words) (Nr) « 1,c » = « 01 02 03 « 1,c » (2)
words) «s ′2,c » « 01 01 02 03» «s 2,c »
128 4 4 10 « » « » « »
192 6 4 12 ¬s ′3,c ¼ ¬03 01 01 02¼ ¬s 3,c ¼
256 8 4 14
AddRoundKey performs a bitwise XOR operation with
The encrypt/decrypt process involves a sequence of the current state and the round (expanded) key every
four primitive functions: SubBytes, ShiftRows, round including an initial round and the last round. The
MixColumns, and AddRoundKey. AddRoundKey is the round key is read from round 0 to Nr for encrypt and vice
122
verse for decrypt. During regeneration of the round key, of use and similarity to Verilog. A C++ I/O was utilized
the decrypt process must halt and wait for generation of because of ease of use and interoperability.
the last round key causing a significant delay in The AES algorithm specification FIPS-197 [4] was
initialization. followed as closely as possible, while minimizing
The decryption process calls the inverse of each redundancy. Processes shared between the encryption and
function. Inverse SubBytes involves taking an inverse decryption processes were reused as often as possible.
affine transformation. Inverse ShiftRows rotates the bytes From this software design, it was observed that the
to the right by: 3, 2, or 1 byte(s). Inverse MixColumns MixColumns and SubBytes functions took up significant
uses the same operations as MixColumns but uses the CPU time. These functions require GF operations that are
inverse matrix shown in Equation 3. difficult to negotiate in software. MixColumns
specifically required sequential left shifts, each followed
ªs ′0,c º ª 0e 0b 0d 09º ªs 0,c º by a conditional XOR operation. The condition of the
« s′ » « « » XOR operation depends on there being a 1 in the most
« 1,c » = «09 0e 0b 0d »» « s 1,c » significant bit of the current byte before it is shifted. If the
(3)
«s ′2,c » «0d 09 0e 0b» «s 2,c » condition is true, the shifted byte is XOR-ed with byte
« » « {1b}, the irreducible GF polynomial is shown in Equation
»« »
¬s ′3,c ¼ ¬0b 0d 09 0e ¼ ¬s 3,c ¼ 4 below [4].
Before designing the architecture, a preliminary The affine transformation used by SubBytes can be
software implementation was developed in order to better implemented as a 16x16 lookup table. Every time
understand the algorithm, develop test benches, and SubBytes/Inverse SubBytes is called; the table is parsed
anticipate any obstacles in the transition to hardware. The for the appropriate byte substitution.
software implementation was written in C, due to its ease
IR
A
16 Byte
D Mod Inverse
State RF Flag
S-Box
S-Box
Data LUT
LUT
Out
208 Byte
Working
32 Byte Register RoundKey
RF
RCON Key RF 1b
00
LUT
MixCols
Accumulator
Controller
Figure 2. Overall hardware design
123
There is no simple way to getting around the use of a State RF is physically moved from one shift register to the
lookup table, as research into other implementations has next until the each byte reaches its respective register
shown. An attempt at splitting the search process into during either ShiftRows operation.
sixteen smaller comparisons did not significantly increase
efficiency. The LUT used a great deal of memory in
software, and similarly a large number of gates in the . 8b Shift Reg
5 to 32 .
A
hardware design. . .
.
Various methods for reducing GF circuit size exist, Decoder . Serial
8b Shift Reg . Out
such as composite (or tower field inversion), Fermat’s .
little theorem or extended Euclidean algorithms [2]. .
D .
However, these methods introduce large propagation .
delays and increased power consumption. The low-cost R/-W
.
and low-power approach minimizes the complexity of GF SH
8b Shift Reg
operations by sacrificing speed.
The basis for the hardware implementation is a result CLK
of the work done on the C++ reference code. Using the 8b Shift Reg
code, the design intent was to develop a loosely coupled
AES co-processor which operates independently of the Parallel Out
main processor. The resulting data-path design is shown
in Figure 2.
Figure 4. Key register file
The main components of the datapath are the State RF
(Register File), Key RF, RoundKey RF, XOR gate, S-Box,
Inv S-Box, Working Register, and MixColumns
accumulator. Data and instructions are fed into the 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
module by an 8-bit line and assigned to an 8-bit shift 8b Shift Reg 8b Shift Reg 8b Shift Reg 8b Shift Reg
register (see Figure 3) by a 5-bit address line. The data is
output serially upon completion of the state operations.
The Controller generates control signals for data Figure 5. Row shifter in state register file
transportation, key expansion, encryption and decryption.
The S-Box LUT receives an 8-bit input through a
I7 I6 I5 I4 I3 I2 I1 I0
multiplexer from either one of the three register files and
R/-W
processes the byte by combinational logic. The Inverse S-
SH
0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
Box only receives a connection from the State RF. The S-
Serial
D Q D Q D Q D Q D Q D Q D Q D Q
Out Box output is tied to the State RF, RoundKey RF, and
1b
Reg
1b
Reg
1b
Reg
1b
Reg
1b
Reg
1b
Reg
1b
Reg
1b
Reg
Working Register; while the Inv S-box ties back to the
State RF.
O7 O6 O5 O4 O3 O2 O1 O0 The modules are connected serially to the XOR gate.
An 8-bit temporary register (Working Register) is placed
Figure 3. 8-bit shift register at the gate’s output to assist in byte operations during the
key expansion and MixColumns. The Working Register
Assuming the data and key are first loaded by the main operates in tandem with the MixCols Accumulator to
processor into the register files, the key expansion process compute both MixColumns functions. In addition, the
can begin. The Key RF and RoundKey RF will receive byte values {1b} and {00} are held constant at the input
addresses from the controller allowing individual bytes to of the XOR gate to be used as needed.
be chosen for manipulation. Once the round key is The controller reads flags from the ModFlag block and
generated, the values are held constant until the main instruction register (IR). The ModFlag block signals the
processor assigns a new key. The State RF contains the controller as to the status of the most significant bit in the
current state values and changes after each function call. current output byte for use in the conditional XOR. The
The Key RF is designed to hold the first 16, 24 or 32 IR contains the command the co-processor will respond to.
bytes of the round key as shown in Figure 4. Similarly,
the RoundKey RF is simply a 208 byte extension of the 5. Discussion and conclusions
Key RF design. The State RF is broken down into 4 sets
of 4 8-bit shift registers (a single set is shown in Figure 5). This architecture is being developed and prototyped in
Each set constitutes a row, and using a pair of control an FPGA platform using the Verilog® Hardware
wires, both the ShiftRows and Inverse ShiftRows Description Language (HDL) and the Spartan II XC2S50
operation can be performed internally. The data in the
124
FPGA from Xilinx. This FPGA platform has 50K Table 2 shows comparisons of size and speed, among
available gates, 1,728 logic cells, 32 Kb RAM, and 64 I/O other attributes, of this implementation with others. As
ports; which are suitable for the design. Table 2 demonstrates, this implementation uses only a
The critical paths in the hardware implementation have small fraction of the gates used by other designs that have
shown to be SubBytes, MixColumns, and the decrypt key been reported in the literature, while maintaining an
schedule. The hardware testing phase will reveal the acceptable level of throughput. This is evident from the
actual delays in algorithm processing. Several design high efficiency of this design, which is given by the
modifications are to be evaluated: modified SubBytes Throughput/Gates figure of merit.
design, and a reduced MixColumns algorithm. By By using a true low level bit-serial approach, a
focusing on minimization of redundancies within these minimum cost AES co-processor architecture has been
functions, cost can be reduced while maintaining a achieved. This architecture can be used in many military,
reasonable operating speed. industrial, and commercial applications that require
The SubBytes and MixColumns algorithms can also be compactness and low cost.
merged as other implementations have done. With an
advanced chip process, the architecture would perform 6. References
much faster, and the design can be scaled accordingly.
Some aspects of the design can also be parallelized [1] S. Morioka and A. Satoh “A 10-Gbps Full AES-Crypto
selectively, but will not necessarily reach the levels of Design with a Twisted BDD S-Box Architecture”, IEEE
other implementations; however, the chief objective of Transactions on VLSI Systems, Vol. 12, No. 7, July 2004, pp.
this research was to develop an architecture to minimize 686-691.
the cost of the implementation (i.e. gate count). [2] V. Fischer and M. Drutarovsky, “Two Methods of Rijndael
Implementation in Reconfigurable Hardware”, Proc. CHES, Vol.
2162, 2001, pp.81-96.
Table 2. Comparison of some AES architectures
[3] C-P. Su, T-F. Lin, C-T. Huang; and C-W. Wu, “A High-
Throughput Low-Cost AES Processor”, Communications
[5] [6] [3] THIS Magazine, IEEE, Vol. 41, Issue: 12, December 2003, pp. 86-91.
Technology 0.35 0.18 0.35 FPGA [4] NIST, "ADVANCED ENCRYPTION STANDARD (AES,
µm µm µm Rijndael)", FIPS-197, November 2001,
Maximum N/A 125 200 510 http://csrc.nist.gov/encryption/aes/.
Clock Rate MHz MHz MHz [5] T. Ichikawa, T. Kasuya, and M. Matsui, “Hardware
Evaluation of the AES Finalists,” Proc. 3rd AES Candidate Conf.,
Throughput 1.95 1.14 2.00 0.37
2000.
(Gb/S) [6] I. Verbauwhede, P. Schaumont, and H. Kuo, “Design and
Gate Count 612K 173K 59K 10K Performance Testing of a 2.29-Gb/s Rijndael Processor,” IEEE J.
Throughput/ 3.18 6.59 34.98 36.13 Solid-State Circuits, Vol. 38, No. 3, Mar. 2003, pp. 569-572.
Gate Count
(Kb/S/Gate)
125