Академический Документы
Профессиональный Документы
Культура Документы
11/16/14 10:55 PM
Abstract
Salman Ul Haq, Jawad Masood, Aamir Majeed, Usman Aziz
10/11/2011
This article covers the implementation and optimization of the Advanced Encryption
Standard (AES) on AMD GPUs using OpenCL, which is fine-tuned for bulk encryption applications. Reliable encryption schemes are needed to ensure the information
security of individuals, organizations and governments by protecting against potential
threats. One particular scheme is the AES algorithm-based bulk encryption technique,
which is based upon the Rijndael algorithm, a symmetric block cipher with 128-bit, 192bit and 256-bit cipher keys. OpenCL also allows you to tap into the huge parallel processing power of GPUs for data parallel computing applications. This article begins by
exploring the AES algorithm, focusing on a parallel breakdown of the problem and explaining suitable indexing schemes. This is followed by GPU-specific optimization
strategies, such as using local memory, covering their relation to the memory bandwidth and computational intensity that is required. We finish the article by examining
the final benchmarks that signify the acceleration achieved using AMD GPUs.
Introduction
Information security is becoming increasingly important given the ever -increasing
number of new applications in the public and private domain. There is a continuing
trend to secure data in all of its uses, ranging from its live communication to archived
data storage. The unauthorized access to intercepted transmissions can result in the
compromise of sensitive and vital information. Data managers around the world are,
thus, facing an interesting dilemma;: how to store data securely while still being able to
access it quickly. Encryption is an effective solution for protecting valuable data assets
against such attacks.
Encryption
Encryption is the process of transforming information referred to as plain-text into an
unintelligible code called cipher-text, using a secret key and an algorithm generally rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 1 of 25
11/16/14 10:55 PM
ferred to as the cipher [1]. The cipher-text (encrypted data) can be decoded back into its
original form using the same cipher algorithm and the secret key. In this process, critical
information can be protected from hackers, competitors and others who would use the
information for malicious intent.
Common uses for encryption technology are found in the static archiving of large
amounts of sensitive data, as well as its communication over the local area network
(LAN) or across an Internet gateway in the case of Wide Area Networks (WANs) or Virtual Private Networks (VPNs). Similar applications can also be abundantly found in the
telecommunications industry and other proprietary setups dealing with data protection
issues.
Bulk Encryption
Bulk encryption provides safe and effective methods for protecting data transmissions
from its compromise and theft. This can be achieved through secured storage and the
transmission of bulk data.
Bulk encryption technology provides a method to encrypt large amounts of data during
transmission or storage. The amount of information that must be encrypted, however,
simultaneously leads to very large response times. Currently, the processing power requirements for bulk encryption are being met by hardware extensions in the form of
cryptographic accelerators [2]. There exists the potential to use the parallel processing
power of a GPU as a co-processor in a similar role that existing hardware cryptographic
solutions play.
Page 2 of 25
11/16/14 10:55 PM
of the decryption algorithm can decipher any transmission written with that particular algorithm.
2. Whether they work on blocks of symbols of a fixed size (block ciphers), or on a continuous stream of symbols (stream ciphers).A block cipher is a symmetric key cipher operating on a fixed-length groups of bits, called blocks, with an unvarying
transformation. A block cipher encryption algorithm might take (for example) a
128-bit block of plain-text as input, and output a corresponding 128-bit block of
cipher-text. The exact transformation is controlled using a second input called the
secret key. Decryption is similar; the decryption algorithm takes, in this example, a
128-bit block of cipher text together with the secret key, and yields the original 128bit block of plain-text.A message longer than the block size (128 bits in the above
example) can still be encrypted with a block cipher by breaking the message into
blocks and encrypting each block individually. Since all pure block ciphers have
independent workloads, they are the ideal candidates for parallel implementation.
AES Algorithm
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 3 of 25
11/16/14 10:55 PM
In this section, we will provide a brief overview of the AES algorithm and the working
of its major constituent computations.
The AES block-cipher operates on a 44 array of bytes (128 Bits), termed as the state. For
the AES algorithm, the size of the input block, the output block and the state is 128 bits.
This is represented by Nb = 4, which reflects the number of 32-bit words (number of columns) in the state array. The permissible lengths of the Cipher Key, K, are 128, 192, and
256 bits. The key length is represented by Nk = 4, 6, or 8, which reflects, again, the number of 32-bit words (number of columns) in the Cipher Key array [6].
The state is encrypted or decrypted by applying byte-oriented transformations for a
specific number of rounds. The number of rounds to be performed is dependent on the
key size. The number of rounds is represented by Nr, where Nr = 10 when Nk = 4, Nr =
12 when Nk = 6, and Nr = 14 when Nk = 8 [6].
The AES algorithm specifies both cipher and its inverse for the complete
encrypt-decrypt cycle. The Forward Cipher takes plain-text as input along with the
cipher-key and its output is the encrypted data or cipher-text. The Inverse Cipher takes
this cipher-text as input and decrypts it back to plain-text using the same cipher-key
used for encryption.
The AES algorithm consists of following phases:
1. Key Expansion.Round keys are derived from the cipher key using the
Rijndaels
2. Initial Round.AddRoundKeyeach byte of the state is combined with the
round key using a bit-wise operation.
3. Middle Rounds.Nr = 1 till Nr-1 Repeatedly perform the following transformations:
1. SubBytesa non-linear substitution step where each byte is
replaced with another according to a lookup table.
2. ShiftRowsa transposition step where each row of the state is
shifted cyclically a certain number of steps.
3. MixColumnsa mixing operation which operates on the columns
of the state, combining the four bytes in each column.
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 4 of 25
11/16/14 10:55 PM
AES Transformations
For both the Forward and Inverse Cipher, the AES algorithm uses a round function that
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 5 of 25
11/16/14 10:55 PM
Page 6 of 25
11/16/14 10:55 PM
(number of bytes) offset. For AES, the first row is not shifted (left unchanged).
Each byte of the second row is shifted one byte to the left. Similarly, the third
and fourth rows are shifted by offsets of two and three bytes respectively. In
this way, each column of the output state of the ShiftRows step is composed
of bytes from each column of the input state. Specifically, the ShiftRows transformation proceeds as follows [6]:where the shift value shift(r,Nb) depends on
the row
number,
r, as follows:
Page 7 of 25
11/16/14 10:55 PM
and 1920 bits. During each round, a different portion of the expanded key is used in the
AddRoundKey step.
Modes of Operation
A block cipher by itself allows encryption of a single data block of size equal to the
ciphers block size. Modes of operation enable the repeated and secure use of a block
cipher, on multiple data blocks, under a single key [7]. When targeting a variable-length
message, the data must first be partitioned into separate cipher blocks. Typically, the
last block must also be extended to match the ciphers block length using a suitable
padding scheme. A mode of operation describes the process of encrypting each of these
data blocks, and generally uses randomization based on an additional input value, often
called an initialization vector.
There are different modes under which encryption can take place, where some modes
are inherently more secure and some lend themselves more to parallelism. The gfollowing table lists various modes of operation along with their inherent level of parallelism
[7]. For details on the modes of operation, look at the Resources section [7].
Mode of Operation
Parallelism
Electronic codebook (ECB) High
Counter (CTR)
High
Cipher-block chaining (CBC) Low
Cipher feedback (CFB)
Low
Output feedback (OFB)
Low
The ECB mode comes out to be the most parallel implementation. The message is
divided into blocks and each block is encrypted with an identical key and there is no
serial dependence between the blocks. The advantage of ECB mode is the extensive parallelism which scales well to the GPU architecture. The disadvantage of this method is
that, identical plain-text blocks are encrypted into identical cipher-text blocks; thus, it
does not hide data patterns well and the large scale structures in the plain-text are preserved [7].
In the Counter (CTR) mode, the large scale structures that may have been present in the
original plain-text are diminished. Thus, the cipher-text blocks obtained by encrypting
two identical plain-text blocks using CTR mode are completely different. This provides
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 8 of 25
11/16/14 10:55 PM
better security level as compared to the ECB mode [7]. We have implemented the ECB
mode of operation that is not only parallel but can be easily extended to CTR.
Design Approach
In this section we will discuss the level of parallelism we want to exploit, the portion of
code that should be ported to GPU, and the Host-Device work division.
In the current approach, we will exploit parallelism only on the block level without
changing the original algorithm. (The algorithm breakdown can further optimize the rehttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 9 of 25
11/16/14 10:55 PM
sults, though that discussion is outside the scope of this article.). Each work-item will
take one state block as input and convert it to cipher-text. This implies that the Global
Work-Size is directly proportional to the exploited parallelism. Encryption of the one
state block of 128 bit will remain serial; however we will use loop unrolling to optimize
the code. Another serial operation in AES is the key-expansion which provides the
round keys to be used in subsequent rounds. However, keeping in view its serial nature
and the fact that it is just a one-time operation, key-expansion can be safely moved to
the CPU Host Code for better performance. Figure 3 below explains the parallelism in
AES as well as our design approach.
Page 10 of 25
11/16/14 10:55 PM
In the SubBytes transformation, each state value is updated by a value from S-Box, having the same index as the value of the state. For example, if S1,1 = {53}, then the substitution value from the S-Box would be determined by the intersection of the row with index 5 and the column with index 3. This process is explained in Figure 4 below.
In the ShiftRows transformation, the bytes in the last three rows of the state are cyclically shifted over different numbers of bytes. The first row, r = 0, is not shifted. Each byte
of the second row is shifted one byte to the left. Similarly, the third and fourth rows are
shifted by offsets of two and three bytes respectively. This has the effect of moving bytes
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 11 of 25
11/16/14 10:55 PM
to lower positions in the row, while the lowest bytes wrap around into the top of
the row. Figure 5 below explains the ShiftRows transformation. Here S represents the
state array and Sis the n_state array.
The code for ShiftRows transformation is listed below. The shift rows transformation
uses both n_state and the state buffers as it is not an in-place transform:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
n_state[4*x + y] = state[4*x + ((y+x)& 0x03)];
}
Page 12 of 25
11/16/14 10:55 PM
wise XOR operation as depicted in the figure below. Each Round Key consists of Nb
words from the expanded key obtained from the Key-Expansion function, described
earlier. Figure 6 depicts the AddRoundKey Transformation.
The sample code for AddRoundKey transformation is listed next:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = state[4*x + y] ^
((keysched]y] & (0xff << (x*8))) >> (x*8));
}
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 13 of 25
11/16/14 10:55 PM
Page 14 of 25
11/16/14 10:55 PM
and is multiplied modulo x4+1 by the coefficient polynomial a(x) [6] shown here:
The MixColumns transformation
updates each column of the state
using a matrix multiplication, as explained by the following equation [6]:
Figure 7 below explains the MixColumns
transformation.
Page 15 of 25
11/16/14 10:55 PM
AES Ciphers.
Forward Cipher
We explain the working of our kernel by considering the simplest case where we have a
single work item operating on a 128-Bit state block. Kernel arguments would be the
input and output buffer, AES fixed table buffers and the expanded key buffer. Also, the
key-Length parameter adds the flexibility to use all three allowed key sizes128, 192 and
256-bit keysand they are passed to the kernel as an argument. The number of rounds to
be performed is calculated based upon the key length. 128-bit state is copied from global
plain-text buffer into the registers for computing. Two blocks of state size are created in
the register files, as all the AES-Transformations cant be performed in-place. The input
is copied to the state block in registers using a special access pattern to allow coalescing
(more on this latter). Forward AES-Transformations are applied to the state block as
described by the AES flow graph. The resulting cipher-text block is copied back to the
Global cipher-text buffer using the same indexing scheme that was followed while copying plain-text to the state.
Inverse Cipher
In this section we will discuss the major changes required to convert the Forward Cipher into the Inverse Cipher for decryption process. The Inverse Cipher essentially runs
the forward cipher in the reverse order for decryption process. The AES transformations
used in Inverse Cipher are the inverse versions of previously discussed forward transforms.
The Inverse Cipher incorporates minor changes in the transformations, the order of execution and the required AES-Tables. For example the InvSubBytes transform, which is
the inverse of SubBytes transform, requires Inverse S-box table instead of the S-box
table. The code for InvSubBytes is shown here:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_AES_isbox[state[4*x + y]];
}
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 16 of 25
11/16/14 10:55 PM
All other transformations: ShiftRows, MixColumns and the AddRoundKey remain the
same. The order in which these transforms are applied is different from the Forward
Cipher. Figure 8 displays the flow-graph for Inverse Cipher.
Indexing Schemes
Now we will examine the input and output indexing schemes for a simple AES kernel
with a single work-item in detail. We will then explain what needs to be added to run
the kernel with multiple work-groups and larger work-group sizes. The described inhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 17 of 25
11/16/14 10:55 PM
where Nb = 4 for our case. This is the column major access pattern as depicted in Figure
9 below. The code for column major access pattern for input is listed here:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_input]i];
}
Page 18 of 25
11/16/14 10:55 PM
x = i & 0x03;
y = i >> 2;
gpu_output]i] = state[4*x + y]; ,br />}
Generalizing this indexing scheme to accommodate more threads require some mechanism of identifying which thread is being executed. A new variable named idx is introduced that queries the OpenCL runtime for the Global Id of each thread. Now, the
input array will be copied to the state array as follows:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
state[4*x + y] = gpu_input]i + 16*idx];
}
The net offset for each thread is 16*idx, as each thread handles 16 elements (Bytes) of the
input array. The same holds for writing the data to the output array after completion of
Encryption or Decryption Process:
for(i = 0; i < 16; i++)
{
x = i & 0x03;
y = i >> 2;
gpu_output]i + 16*idx] = state[4*x + y];
}
Memory Optimizations
Here we will discuss the drawbacks in the basic implementation described above and
suggest improvements to overcome these.
In the basic implementation we have used only the Global memory available on the
GPU. Remember, Global memory has the least memory bandwidth compared to other
memory spaces available on the GPU. The main disadvantage of low memory bandwidth is long latency access. Another drawback is the huge resource usage per thread as
all the calculations takes place in the register files, thus limiting the number of parallel
threads and degrading performance. The possible memory optimizations include the
use of local and constant memory. The results of the significant performance increments
with these optimizations have been included; however, the discussion is outside the
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 19 of 25
11/16/14 10:55 PM
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 20 of 25
11/16/14 10:55 PM
Performance Results
Performance
tests
were
carried out on two different
machines, both running a 64bit version of the Windows
7 operating system and AMD
APP
SDK
v2.3
with
OpenCL 1.1 support. The
kernel execution times have
been measured using the
AMD APP Profiler v2.1. The
hardware details for both systems are described below.
Due to the inherent parallelism in the AES algorithm, it shows better performance gains
for large data sizes, which are suited for bulk encryption. In the benchmarks, we validated this through performance results taken on various input sizes. The results also
show the impact of various optimization techniques applied to the standard implementation to further increase the performance, especially by reducing global memory calls
and moving more data to constant and local caches.
Figure 11 shows a performance comparison of various AES kernels. The graph has been
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 21 of 25
11/16/14 10:55 PM
plotted with input size on the horizontal axis (Mega Bytes) and the kernel execution
time (milliseconds) for 256-Bit AES on the vertical axis.
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 22 of 25
11/16/14 10:55 PM
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 23 of 25
11/16/14 10:55 PM
Conclusion
The results illustrated in this article prove the viability of implementing the AES algorithm on the AMD GPUs, which show considerable speedups compared to the current
generation Intel processor or commodity graphics cards. We have obtained a speedup
of up to 16 times with the ATI Radeon HD 5870 GPU while the ATI Mobility
Radeon HD 5650 GPU is showing up to 3 times the performance increase.
References
[1] http://en.wikipedia.org/wiki/Encryption viewed 20 March, 2011.
[2] AES Encryption Implementation and Analysis on Commodity Graphics Processing
Units Owen Harrison and John Waldron, 2007.
[3] http://en.wikipedia.org/wiki/Cipher viewed 20 March, 2011.
[4] http://en.wikipedia.org/wiki/Block_cipher viewed 20 March, 2011.
[5]http://en.wikipedia.org/wiki/Advanced_Encryption_Standardviewed 20
2011.
March,
[6]Announcing the ADVANCED ENCRYPTION STANDARD (AES) Federal Information Processing Standards Publications, November 26, 2001.
[7] http://en.wikipedia.org/wiki/Modes_of_operationviewed 20 March, 2011.
[8] ATI Stream Computing OpenCL Programming Guide, Ch-4 OpenCL performance and optimization, June 2010.
GLOSSARY
Forward Cipher Series of transformations that converts plain-text to cipher-text using
the Cipher Key.
Cipher Key Secret, cryptographic key that is used by the Key Expansion routine to genhttp://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 24 of 25
11/16/14 10:55 PM
http://developer.amd.com/resources/documentation-articles/articles-whitepapers/bulk-encryption-on-gpus/
Page 25 of 25