Вы находитесь на странице: 1из 9

22 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1, NO.

1, MARCH 1993

SIGMA: A VLSI Systolic Array Implementation


of a Galois Field GF(2m) Based Multiplication
and Division Algorithm
Mario Kovac, Member, IEEE, N. Ranganathan, Senior Member, IEEE, and Murali Varanasi, Senior Member, IEEE

Abstract-Finite or Galois fields are used in numerous appli- hardware implementation. A VLSI architecture is described
cations like error correcting codes, digital signal processing and for implementing the proposed algorithm. The architecture is
cryptography. The design of efficient methods for Galois field systolic and uses the principles of pipelining and parallelism to
arithmetic such as multiplication and division is critical for these
applications. In this paper, we present a new algorithm based on obtain high speed and throughput. The overall architecture is
a pattern matching technique for computing multiplication and a multistage linear pipeline and hence, can yield a new result
division in GF(2m).An efficient systolic architecture is described every clock cycle.
for implementing the algorithm which can produce a new result A prototype CMOS VLSI chip, SIGMA, implementing the
every clock cycle and the multiplication and division operations architecture for Galois field GF(z4) has been designed, fabri-
can be interleaved. The architecture has been implemented using
2-pm CMOS technology. The chip yields a computational rate of cated and tested. The prototype chip is operational at 33.3 MHz
33.3 million multiplicatioddivisions per second. yielding a throughput of 33.3 million multiplications/divisions
per second. An important feature of our design is that the
multiplication and division operations can be interleaved. The
I. INTRODUCTION hardware is programmable for different primitive irreducible
polynomials.
I N RECENT YEARS, finite fields or Galois fields have
been extensively applied in i) error correcting codes such as
BCH codes and RS codes [l], ii) digital signal processing [2],
The outline of the paper is as follows. An overview of the
various hardware approaches for Galois field GF(2m) is given
in Section 11. Section I11 develops the theoretical basis for
iii) pseudorandom number generation [131, and iv) encryption
our proposed algorithm for multiplication and division. The
and decryption protocols in cryptography [3] and space object
algorithm is described in Section TV. The SIGMA architecture
tracking applications [16]. In many of the above applications,
is presented in the Section V. Section VI describes the VLSI
the finite field GF(2m), a number system made of 2" elements
chip implementation and its performance. Conclusions are
is used. The practical use of GF(2m) in various applications
given in Section VII.
requires arithmetic operations like addition, multiplication and
division. Thus efficient algorithms are required to perform
these arithmetic operations on-the-fly. Addition in GF(2") is 11. RELATEDWORK
relatively simple and straightforward. However, multiplication
In recent years, many researchers have proposed hardware
and division are more complex operations. Division is per-
algorithms and architectures for performing arithmetic op-
formed by multiplication of the inverse of the denominator
erations in Galois fields that can be implemented in VLSI
element and inversion itself is achieved through repeated
[4]-[ 131. Most approaches implement separate hardware for
multiplications.
multiplication, inversion and division. In our approach, we
In this paper, we propose a new algorithm for multipli-
propose a systolic hardware architecture that can perform both
cation and division in GF(2m). The algorithm is designed
multiplication and division. We present here a brief overview
for elements of GF(2'") represented by the conventional
of previous work on VLSI hardware methods for Galois field
basis (1, a,a2,a 3 , . , am-'}. The algorithm is based on a
arithmetic.
pattem matching and recognition approach and is amenable for
A cellular array multiplier for GF(2m) was proposed by
Manuscript received June 8, 1992; revised October 14, 1992. M. Kovac was Laws and Rushfort [4]. In their design, the elements of the
supported by Department of Science, Republic of Croatia. N. Ranganathan was Galois field are represented using a conventional basis. The ar-
supported in part by the National Science Foundation under Grant MIP-9 010
358 and by the Florida High Technology and Industry Council. M. Varanasi chitecture consists of a two-dimensional array of m 2 identical
was supported in part by the AT&T Foundation. cells which is a straightforward spatial iteration of the standard
M. Kovac is with the College of Electrical Engineering, University of sequential shift register multiplier. The computation requires
Zagreb, Unska 3, 41 OOO Zagreb, Croatia.
N. Ranganathan is with the Center for Microelectronics Research, Depart- about 2m gate delays. The hardware used is programmable to
ment of Computer Science and Engineering, University of South Florida, implement fields based on different irreducible polynomials.
Tampa, FL 33620. Yeh, Reed, and Truong [5] proposed two systolic architectures
M. Varanasi is with the Department of Computer Science and Engineering,
University of South Florida, Tampa, FL 33620. for multiplication in Galois fields. Their design is also based
IEEE Log Number 9206232. on a Galois field represented by a conventional basis. The
1063-8210/93$03.00 0 1993 IEEE

1
KOVAC et al.: SIGMA VLSI SYSTOLIC ARRAY IMPLEMENTATION 23

first architecture is a one-dimensional systolic array which is 111. MULTIPLICATION


AND DIVISION
IN GF(2")
a serial-in-serial-out multiplier. The hardware has a through Finite fields with 2" symbols are called Galois fields,
delay of 2m cycles and requires a minimum average time of GF(2"). Elements of the field can be represented as m-
m cycles per computation. The second architecture is a two- tuples over GF(2). It is conventional to represent each nonzero
dimensional systolic array with an average computation time element as a power of a primitive element , where is a root
of one cycle. of F ( z ) , a primitive irreducible polynomial of degree m over
Wang et al. [6] proposed a set of architectures for im- GF(2). The nonzero elements of GF(2") can be represented as
plementing the Massey-Omura multiplication algorithm. The 1, a, a', . . . ,a 2 ( m - 1 ) . of these elements can be expressed
Each
Massey and Omura a1 orithm uses a normal basis of the as a sum of the elements {l,a,a',...,a"-'} which is
fm-1,
form {a,a', a 4 , .. . ,a2 }. Two different architectures are commonly known as the conventional basis. A polynomial
proposed in [6] for multiplication and inversion. Inversion F ( z ) of degree m is said to be irreducible over the field GF(2)
is done by performing multiplication repeatedly m times. if F ( z ) is not divisible by any polynomial of degree less than
Although, the multiplication hardware is efficient for small m and greater than zero. An irreducible polynomial of degree
values of m, like m = 4,the authors point out that the method m over GF(2) is called primitive if it has a primitive element
is impractical for realization for large m. Later, Wang and Pei of GF(2") as a root. Every Galois field has at least one such
[ 131 proposed a VLSI design for computing exponentiation in primitive element a, the successive powers of which generate
GF(2") based on the multiplier described in [6]. all the nonzero elements of the Galois field. For a complete
Scott, Tavares, and Peppard [7] proposed a new algorithm overview of finite fields, the reader is referred to [ 11, [ 1714191.
and hardware for multiplication of elements in Galois field In this section, we develop only the necessary background
represented by the conventional basis. The hardware is based required for understanding the Galois field multiplication and
on a bitslice architecture and the data is input and output in division algorithm proposed in the next section.
a serial manner. The circuit complexity is O ( m ) for both Notation: The addition of two elements from GF(2m) as
space and time. Okano and Imai [8] present methods of well as the addition of two integers will be expressed with the
solving algebraic equations over Galois field GF(2"). In this symbol "+" while bit-wise modulo 2 addition will be shown
paper, they discuss how finite field arithmetic operations like as "W.
multiplication and division can be performed using modulo Dejinition I : Let
addition and subtraction. The elements of a Galois field
GF(2") can be expressed as a power of a [l] which is p = pm-lam--l + pm-2am-2 + . . . + P1a + Po.
called the "exponent expression". The exponents are obtained
using table look-ups and multiplicatioddivision is achieved be a nonzero element of GF(2m) where a is a root of a
by performing modulo additiodsubtraction on these exponent primitive irreducible polynomial F ( x ) , where
values. F ( x ) = x7n + fm-lxm-l + fm-2xm-2 + . . . + fl. + 1
A bit-serial linear systolic array is proposed for multiplica-
with P i , fi E GF(2).
tion over GF(2") by Zhou [9]. The hardware requires (3m- 1)
units of time per multiplication. Since the method is recursive, For convenience, we define a function p as follows
it cannot be pipelined for vector processing. Multiplication
algorithm proposed by Pincin [lo] is also based on the p ( P ) = aB(modu1o F ( a ) ) . (1)
Massey-Omura algorithm. The algorithm is parallelizable, but
Since p is a power of a, p ( P ) is the element corresponding to
implementation issues have not been considered. Furer and
the next power of a. It is easy to verify that
Mehlhom [ 111 present some theoretical results analyzing the
area-time complexities for VLSI implementation of Galois P i ( P ) = p ( p Z - l ( B ) ) = P(P"-'(P)) =
field multiplication in general form GFb"). A VLSI architec- . . . = p ( p ( . . . p ( P ) . . .)) = azp. (2)
ture for inversion in GF(2") is proposed by Feng in [ 121. The
algorithm proposed by Feng requires O ( mlog, m) time for
inversion. The inversion is achieved using a systolic serial-in Example I : Let p and 6 = p ( p ) be two elements of ~ i F ( 2 ~ )
parallel-out multiplier. Each multiplication itself requires m +
and F ( z ) = x4 + x 1. Then P and 6 can be expressed as
clock pulses and the number of multiplications required per 4-tuples over GF(2). Further, the coefficients 63,62, 61,60 are
inversion is O(log, m). related to ps,,&, PI,PO as follows. Let
Thus there exist different approaches for performing mul-
tiplication and division in hardware. Each approach has its P = P3a3 + P2a2 + PlQ + Po
merits and demerits. The purpose of this paper is to propose s = 6 3 a 3 + 6 2 0 2 + 61a + so.
a single VLSI chip architecture that can perform both mul-
Then,
tiplication as well as division with an average computation
time of one clock cycle. This is achieved by using a fully 6 = p ( p ) = @(modulo a4 + a + 1 )
pipelined systolic array architecture. The multiplication and = p3a4 + p2a3+ ,&a2+ ,f?oa(modulo a4 + a + 1 )
division operations can be interleaved. The proposed archi-
= p3(a + I) + p2a3+ plo2 + p0a
tecture and chip are targeted for applications that use Galois
fields GF(2"), m 5 8. = + + CB ~ 3 )+a (PO P3.

1
24 IEEE TRANSACTIONS O N VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1 , NO. 1 , MARCH 1993

and k must satisfy the relation, k = i - jn,k < j or


equivalently i = k + jn. Thus
a2 =akai"
p =pk(aJn), k <j.
Fig. 1. Implementation of circle rotation function.
Example 3: Let m = 4, F ( z ) = x4 + z + 1, /3 = 3, and
y = a 4 . By choosing appropriately, four subsets of GF(2") are
Therefore, formed. The first subset contains elements {ao,al, . . ,a 3 } ,
the second {a4,a5,...,a7}, the third { a 8 , a 9 , . - . , a l i } and
63 = Pz, 62 = Pi, 61 = Po CBP 3 , and60 = P 3 .
the last subset {a1', a13,ai4}.Initially, p is compared to all
Thus the elements of 6 can be obtained by shifting the elements the patterns { a o a , 4 ,a8,ai'}. If it does not match with any
of /3 circularly to the left and introducing ,f33 corresponding to of the patterns, then all the patterns are rotated. From (l),
the two rightmost positions. In further discussion, we will refer circle rotation of patterns increments their power by one, so
to this function as circle rotation. The coefficients of 6 can after the first rotation P will be compared to the set of patterns
be computed from elements of p by implementing a simple { a ' , a', a', d 3 } . After three rotations, /3 will be compared to
circuit shown in Fig. 1. , a " } . Now, p will equal the rotated value of the first
{ a 3 a7,
Theorem 1: Given two elements p, y E GF(2m) with pattern and the procedure is complete. It is also important to
p = aiand y = a J , 05 i , j 5 2"-2, p can be expressed as notice that after a certain number of rotations, the last pattern
will equal a' which is the value of the first pattern in the
p = pk(y), i = Ik+j(mod(2" - 1) 0 5 i , j 5 2"-2. initial step, and hence the comparisons to the last pattern will
(3) not be necessary.
This expression implies that every nonzero element of GF(2") Theorem 3: For two elements of GF(2m), and PZ where
can be obtained from another nonzero element of GF(2") by P1 = pkl(ajnl) and p2 = pkz(ajnz), the product of the
a specific number of circle rotations. elements PI, PZ can be expressed as p1pz = pk3 ( a i n 3where )
Proof: For i, j , k , 0 5 i,j , Ic 5 2" - 2, there exists a k
+
such that Ij klmod(2" - 1) = i. Consequently, k3 = ll(jn1 + kl) + ( j n +~ kz) /mod (2" - 1)1mod j

CY2 = aJak * aZ= p k ( a J )* p = p k ( y ) . R.3 = L(l(j.1 + kl) + ( j n z + kz)l mod (2" - l))/jJ.
Proof:
Example 2: Let m = 4, F ( z ) = z4 + z + 1,p = a4 and
p = ai3.Then,

p = a4 = p 4 ( 4 = p4(aZa13) = p6(a13) = p6(y).


Therefore,
The number of circle rotations needed can always be computed
+
by solving for k in (3). Since 4 = Ik 131 mod 15,k = 6. plpz = & + j n l a k z + ~ n z
Theorem 2: The element p can also be obtained as - alk~+jnl+kz+in~lmod(2"-1).
-

P = Pk (aJ"), The exponent of a corresponding to the result can be


n E {0,1,2,. . . , I L(2" - 2)/jJ 1) and >j (4) computed as
where J1. denotes the largest integer 5 z. In this case, there l k i j n l kz jnzl mod (2" - 1)+ +
exist several elements that satisfy (aJ")in (4). Consecutive
elements aJnthat are used in (4) differ by a power of a equal
+
= I I ( j n z + k i ) + ( j n z kz)l mod (2" - 1)1
to j . Thus the number of rotations needed to obtain ,L? in the . mod j j L( l(j.1 + + + +
kl) ( j n 2 kz)l mod (2" - 1))J
worst case is equal to ( j - 1). All the elements aJnwill be = k3 j n 3 , + 0 5 IC3 < j , 0 5 713 5 L(2" - 2)/jJ.
of great importance in our discussion and for reasons that will
Therefore,
be explained in context, will be called patterns.
Proof: It is clear that by choosing y = aJ all the nonzero pip2 = &+in3 = pk3 (ain3 1.
elements of GF(2") can be divided into L(2" - 2)/jJ 1 +
subsets, each starting with element aJn,n E {0,1, . . . , L(2" - The significance of this result will be illustrated by the next
2)/jJ}, and containing j elements (except for the last subset example.
that may contain less than j ) . All the elements from any subset Example 4: Let m = 4, and let all the integers
+ +
contain powers of a equal to { j n , j n 1 , j n 2,. . . , j 7 1 + k l , 71.1, kz,nz, k 3 , n 3 be represented in binary form. Let, also
n - 1). Therefore, it follows that j be a power of 2, for example, j = = 4.In this case every
two nonzero elements PI, PZ E GF(24) can be represented as
B = CY2.
i e { jj nn ,,jj nn + 1, ... , j n + n - 1 in general p1 = pk' (a4"l)

+ 1, . . . ,2" - 2 for the last subset PZ = p k 2 ( a 4 n 2 ) , 0 5 k l , IC2 < 4; 0 5 n1, n 2 5 3.


KOVAC et al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 25

If kl n1 k p , np are represented as binary numbers, then prod-


uct PlBz can be computed very easily. Because j = 2", the
least significant x bits in the m-bit representation of a product
j n will be zero, while the rest m - x bits will equal j . In this
case, the patterns will be cyoooo, ao1O0, doo0, d 1 O 0 . Further,
+
the addition k j n can be done just by replacing the x least
significant zeros with k. After modulo (2"-l) addition of the
two binary numbers, (operand powers), the reverse procedure
to obtain j 3 and k3 is performed by copying the higher and
the lower portion of the result to 723 and k3, respectively.
The binary representation of k and j leads to a simple
method for computing the quotient ,&/Pp:
Theorem 4: For elements PI, P2 in GF(2"), P I / P z = :.....
pk3( a j n 3 )where
,

where 3 denotes one's complement of z.


Proof: From the properties of GF(2"), ~ ( 2 ~=~(YO.
' )
Therefore,
at = cytQO = Q t c y ( 2 m - 1 ) = &+2"m-')
(ylQIY = a("-Y)a("-Y+Zm-') = &+(-y+Zm-'))

If (-y +
2m - 1) is represented in binary form, Since, 2" can
be represented as

-y + 2m - 1 = -(y,-12"-1 + ym-22m-2+
. . ' + y12 + yo) + 2" - 1.
Fig. 2. Flowchart of the GMA algorithm.
2"-1 + 2m-2 + . . . + 2 + 1 + 1
-y + 2m - 1 = -(y m-12m-l + ym-22m-2+ to a pattern T,. For every pattern T, = aJn, the product j n
will be denoted as the pattern power. The given input p, is
.-.+?/12+y0)+(2"-1 +2"-2+
compared with each pattern T,, where n = 0 to [(2"-')/jJ.
... + 2 + 1 + 1) - 1 As the input p, is compared with the first pattern R,=o, if
= 2"-l(l- ?/"-I) + 2"-2(1- ym-p)+ they do not match, the pattern is circle rotated once and then
. . e + 2(1- Yl) + (1 - yo) compared again. This process is repeated until a match occurs.
If there is no match, the loop is executed j times, since it is
-
- y.
only possible to arrive at j elements from any single pattern
within a subset through circle rotation. The values of k and n
IV. T H E GMA ALGORITHM that correspond to a successful match become the desired k,
In this section, we present a new algorithm, called the and n, values. Thus for the inputs Dl and pp the corresponding
GMA algorithm, for the computation of Galois Field based kl, n1 and k2, np values are obtained. The above steps are
multiplication and division. The algorithm exploits the various repeated for every pattern R,.
properties described in Section I1 for efficient computation and The next step is to perform modulo (am - 1) addition
leads to a simple hardware implementation. The arithmetic as explained in Theorem 4 in order to derive k3 and n3. It
operations, i.e., the multiplication of two elements PI and should be noted that the computation of k3 and n3 would
p2 from a Galois field GF(2") or their division p1//32, can have been complex if y = a3 was chosen such that j is
be achieved using modulo (2"-l) addition in the following not a power of two. Once the values k3 and n3 have been
manner: (i) for P1 and Pz, find the corresponding values for computed, they are used to find the resultant element P 3 . The
k l l n l and k p l n 2 pairs, (ii) using k1,nl and k 2 , n 2 pairs resultant element p 3 is determined using the method which is
compute kg, 723, which corresponds to 03, the resultant product a reversal of the process described in the previous paragraph.
or quotient, and finally, (iii) transform k3 and n3 into the actual The value 723 is compared with each value of n and then the
result p 3 . The flowchart of the proposed algorithm is given in pattern R,, corresponding to the matching n is selected as the
Fig. 2. The algorithm is described in the rest of this section. intermediary result Pi. This intermediary result is circle rotated
The elements of GF(2") are divided into subsets for a k3 times in order to arrive at the final result ,&. It is important
particular y value as in Theorem 2. Each subset n corresponds to observe that the same algorithm can be used to compute
26 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1, NO. 1, MARCH 1993

.......................................................................................
j Result Rotation Pancm RecognitionSystolic
Now, we will describe the proposed architecture with an
Pow- Computation
; Pmccssor(P3) Processor (Pl) example Galois field where m = 4, j = 4 and a choice from
Pmcessor(F'2) :

two irreducible polynomials, z4+x+l and z4+z3+l.For this


example, the pattern recognition processor will consist of four
processing elements with each processing element consisting
of four stages. Each stage corresponds to one hardware clock
cycle. The power computation processor consists of two stages
which is the same for any choice of m. The result rotation
processor consists of three stages since the result can be
arrived from the intermediary pattern with a maximum of three
rotations.
The two input operands x = 01 and y = p2, along with
a single bit signal op, indicating the operation (multiplica-
tioddivision) that is to be performed, are provided to the
I . : . I I
".f...f.... ..................................................................... first processing element in the pattern recognition processor.
Each processing element stores a pattern that corresponds to
VddGND 41 +Z
a subset of the elements of GFQ4). The two input operands
Fig. 3. SIGMA chip architecture are compared with this pattern in parallel. If they match, the
corresponding values of n and k are selected. If the match is
unsuccessful, the pattern is circle rotated once and compared
multiplication as well as division with the only change being again. This is repeated for all the three possible rotations. If
in the modulo addition step. For computing the quotient, in a any of the four comparisons is successful, the corresponding
division operation, the values k2 and n2 are inverted before values of n and k are forwarded to the next processing element
the addition step. during the fifth clock cycle. A single-bit signal is sent along
with the n and k values to indicate that a match has already
V. VLSI ARCHITECTURE FOR GMA ALGORITHM occurred. This control signal is interpreted by the subsequent
processors in deciding whether to perform comparison or not.
In this section, we describe the architecture of SIGMA, a
After 16 cycles, the values of n and k for both the operands
Systolic Implementation of the GMA algorithm. The architec-
emerge out of PI and are available to P2 during the 17th
ture exploits the principles of pipelining and parallelism to
cycle. The power computation processor takes two cycles to
obtain high speed and throughput. The architecture is systolic
perform modulo (2" - 1) addition or subtraction of the inputs.
and implemented as a multistage linear static pipeline. Thus
The resultant power computed by P2 is passed on to the last
once the pipeline is filled, we can obtain a new result every
processing element in P I . This data passes through the systolic
clock cycle. The architecture has been implemented as a
array in a backward direction during the next four cycles. The
single chip using CMOS VLSI technology. The design and
resultant power is compared with the powers of the patterns
implementation of the SIGMA chip will be described in the
stored in the PE's. The pattern corresponding to the successful
next section.
match is passed on as an intermediary result to P3 at the start
of the 23rd cycle. This intermediary result is circle rotated k g
A. The SIGMA Architecture times in P3 which is a three stage operation. Thus the final
The system architecture of SIGMA is given in Fig. 3. The result is output by the system after 25 cycles.
basic architecture consists of three computational processors The SIGMA chip architecture shown in Fig. 3 indicates the
(i) Pattern recognition processor ( P I ) , (ii) Power computa- flow of data and control signals through the various processors
tion processor (P2), and (iii) Result rotation processor (P3). within the chip. The input signals px and py are always
The three processors are organized as a pipeline structure and initialized to zero values which are used to forward the powers
each processor itself is organized internally as a linear multi- selected as a result of successful pattern matches from one
stage pipe. The pattern recognition processor PI is organized processor to the next. The single bit control signals rz and ry
as a systolic array. It is named so since the input data is are initialized to zero. These will be set within some processing
compared with the pattern that is stored in each processing element to indicate that the buses p z and py carry valid
element within this systolic array. The power computation operand powers. The power computation processor P2 outputs
processor will perform modulo (2"-') addition or subtraction the resultant power rp, and a single bit rr which when set to
depending on multiplication or division that is being computed. 0; indicates that none of the inputs x and y corresponded to a
The purpose of the result rotation processor is to circle rotate zero value. As the signals r p and rr flow through the systolic
the intermediary result pattern that is output by P I . Although, array, r p is compared with the pattern stored in each processor.
the proposed architecture can be used to implement the GMA The input signal r is initialized to a zero value and the pattern
algorithm for Galois fields GF(2") of higher values of m, we recognition processor copies the selected intermediary result
will describe the architecture using an example Galois field to the r bus. The r bus carries the intermediary result to the
GF(z4). The architecture is programmable in order to choose result rotation processor P3. The two least significant bits of
from a set of irreducible polynomials. r p are used in the result rotation processor in order to rotate
KOVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 21

I I

PE0 j PE1 j PE2 j PE3

Fig. 4. Block diagram of pattern recognition systolic processor (Pl).

the intermediary result r to get the final result. The final result
is output on the r bus by P3. The set of input signals denoted Fig. 5. Block diagram of a single PE in the pattern recognition systolic
array PI.
as “Initialization and control” in Fig. 3 is used to preselect the
polynomial and to preload the patterns and the corresponding
pattern powers into the various PE’s of the pattern recognition components in a stage are (i) X logic, (ii) Y logic, and
processor. This loading is done once in the beginning of the (iii) circle rotation logic. The X and Y logics are identical
computation. in terms of hardware. The circuitry for each consists of a
comparator, some latches and multiplexers. The input element
B. Pattern Recognition Processor ( P l ) z is compared with the pattern P stored in the P register of
the central module. If they match and the recognition bit r z is
The pattern recognition processor is a systolic array of not set, then the pattern power P P is copied on to the output
processors organized as a multistage linear static pipeline. The bus p z and the recognition bit r z is set. If the recognition
number of processors in the array is a function of m and j bit input r z has already been set, it indicates that there was a
which decide the number of subsets within the chosen finite match earlier in some previous stage, in which case, the current
field GF(2m). In this subsection, we will describe the hardware match result is ignored. This logic helps to avoid the possible
organization of a single processing element that will be repli- second match of the input element in the last PE of P I . A
cated in space to form the pattern recognition processor. The second match is possible if j is chosen such that last subset of
block diagram of the pattern recognition processor is given in GF(2m)has less than j elements. If the match operation in the
Fig. 4. The pattern recognition processor consists of 4 PE’s current stage is unsuccessful or the recognition bit has already
arranged as a systolic array. The flow of inputs and outputs been set, then the input values z, pz, and r z are simply passed
through the various processing elements is shown in the figure. onto the next stage. The circle rotation logic implements the
Each PE consists of three logic modules, the upper module, the circle rotation function for the two primitive polynomials.
central module and the lower module. The PE is divided into The Upper Module of P1 is shown in Fig. 7. The function
these modules on the basis of the function being performed of the upper module is to compare the input result power r p
in each of them. The detailed block diagram of a single PE with the pattern power p p from the central module of the
in the pattern recognition processor is shown in Fig. 5. The same PE. If the match is successful and the result recognition
central module in each PE stores (i) a pattern P corresponding bit rr is set, then the pattern p from the P register in the
to one of the subsets in the chosen Galois Field GF(2m), (ii) central module is copied onto the r bus which represents the
a pattern power PP associated with the pattern P, and (iii) a intermediary result. If the match was not successful, the input
polynomial selection bit P S B that is used to select one of the values from the previous PE are passed on to the next PE.
+ +
two polynomials implemented (z4 z 1 or z4 z3 1).The + +
three values are stored in static registers P, P P , and P S B ,
respectively. The function of the lower module is to compare C. The Power Computation Processor P2
the inputs with the pattern stored in the central module (and The block diagram of the Power Computation Processor
its rotated values). P2 is given in Fig. 8. The purpose of the power computation
The Lower Module (LM) consists of four stages- processor P2 is to compute the power of the result element
LMO-LMhrganized as a linear pipeline. The block which corresponds to the product in case of multiplication
diagram of a single stage is shown in Fig. 6. The major and the quotient in case of division. This processor is a two
28

fmm PP reg. from F‘SB reg.


in cenwl module in cenwl module

pattern

. & ...........
. _ _ _ _ $..............
.~
PI031 4

............
t
--
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,VOL. I , NO. I , MARCH 1993

---
e - -

latch
e

latch
+

++,
4 bit
adder

I- e e e
n I
e

$1 $2 W $2

Fig. 8. The power computation processor (P2).

$1 $2
L ............................................. Fig. 9. Result recognition processor (P3).

Fig. 6. Block diagram of one stage from the lower module.

-
* 4 r[031 2:. */ intermediate
result
mux 4,

10
next
7-
z mrmi I from
previous
PE PE

result recognition
4 bit
4 rp[03] result power
4

.......................... .... ..................................... :


from PP register from P register C comparator
in cenwl module in cenwl module
Fig. 7. The upper module

stage pipe that performs modulo (2m - 1) additiodsubtraction uu U


of the operand powers px and py. If the op bit is set to low $1 4 Vdd

indicating multiplication, modulo addition is performed and if Fig. 10. Floorplan of SIGMA chip.
the op bit is set to high, then modulo subtraction is performed.
During the phil phase of the first cycle, the op bit is used of this processor is to circle rotate the intermediary result r
to select the py value or its 1’s complement. It should be that is output by P I . The number of circle rotations that will
noted that this is the only operation that differentiates between ever be required for an intermediary result is less than or equal
multiplication and division in the entire algorithm. Also, the to three. Each stage of the result rotation processor consists
recognition bits rx and r y are ‘AND’ed to set the value of of (i) the circle rotation logic, (ii) a comparator, and (iii) a
rr which indicates a valid result in the end. During the phi2 2:l multiplexer. The intermediary result r and the lower two
phase, the px and py values are added using a 4-bit carry look- bits of its power rp are input to the first stage of P3. The
ahead adder. During the phil phase of the second cycle, the intermediary result r is circle rotated once in the phil phase
result is checked if it is larger than 14 (modulo 15 addition of the first cycle. During the phi2 phase, the result power is
in our example). If it is larger, this logic passes a which checked using the comparator. If the value of r p is greater than
during the following phi2 cycle, increments the result by one zero, the rotated result is selected through the multiplexer. In
in order to get the correct result in rp. the second stage, the intermediary result is rotated again if
the value of r p [0:1] is greater than one. In the last stage,
D. The Result Rotation Processor ( P 3 ) the result is rotated again if r p is greater than two. Thus, the
The architecture of the result rotation processor P3 is shown result rotation processor P3 outputs the final resultant value r
in Fig. 9. The hardware is organized as a three stage pipeline that is the product/quotient depending on whether the operation
and the computation takes a total of three cycles. The purpose performed was multiplication or division.
KOVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 29

Fig. I I . SIGMA chip microphotograph

VI. VLSI IMPLEMENTATIONAND PERFORMANCE nonoverlapping clocking scheme. The circuit was fitted on
A prototype VLSI chip was designed using CMOS p-well a 6.68x4.48 mm2 MOSIS standard frame. The floorplan of
2-pm technology and was fabricated by MOSIS. The chip the chip is given in Fig. 10. The placement of the various
implements the GMA algorithm and architecture described in processors and the pin assignments are shown in the floorplan.
Sections IV and V. Since the entire architecture is a multistage The overall circuitry required a silicon area of 1.789x3.570
linear static pipeline, the chip can produce one result every mm2 and a total of 8799 transistors. As can be seen in
clock cycle, after the pipe is filled. It should be noted that the the floorplan, the pattern recognition processor PI required
architecture has a through delay of O ( a m )which is equal to most of the silicon area. It occupies 1.789x2.318 mm2 and
the number of stages in the pipe and hence, is not suitable consists of 7 184 transistors. Although the chip required a
for large m. Since many of the applications use only fields total of only 26 pins, the circuitry was fitted in a @-pin
of small m, the proposed architecture is targetted toward package with the most remaining pins connected to the various
such applications. The chip was designed using a 2-phase internal nodes for extra testability. The prototype chip was
30 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1. NO. 1. MARCH 1993

tested using the HP82000 high speed IC tester and was found [15] K. Hwang and F. A. Briggs, Computer Architecture And Parallel
to be fully operational at 33.3 MHz. The perfomance can Processing. McGraw Hill, 1984.
[16] T. C. Bartee and D. I. Schneider, “Computation with finite fields,”
be improved by omitting the extra connections and also, Information and Conrrol, no. 6, pp 79-98, 1963.
additional speed-ups can be obtained by using sub-micron [17] S. Lin and D. Costello, Error Control Coding. Englewood Cliffs, NJ:
Prentice-Hall, 1983.
technology. The critical delay of the chip depends only on [ 181 W. W.Peterson and E. J. Weldon, E m r Correcting Codes. Cambridge,
the four bit adder and the chip does not have any global M A MIT Press, 1981.
signals. Based on the performance of the prototype chip, the [19] T. R. N. Rao and E. Fujiwara, Error-Control Coding for Computer
Systems. Englewood Cliffs, NJ: Prentice-Hall, 1989.
SIGMA chip can be improved in design so as to operate
at a speed as high as 40 MHz yielding a throughput of 40
million multiplications/divisionsper second. The SIGMA chip
microphotograph is shown in Fig. 11. Mario Kovac (S’WM’91) received the B.S. and
M.S. degrees in computer science and engineer-
ing from the Faculty of Electrical Engineering,
University of Zagreb, Croatia, in 1988 and 1991,
VII. CONCLUSIONS respectively, where he is working toward the Ph.D.
In this paper, we have presented a new algorithm for degree.
He has been on the faculty of the University of
performing multiplication as well as division of two elements Zagreb since 1989 and is currently holding a sci-
of GF(2”). An efficient VLSI architecture for implement- entific assistant position. During 1990 and 1991, he
ing the proposed algorithm is described. The architecture is was a visiting research scholar at the University
of South Florida, Tampa. His research interests
systolic and exploits pipelining and parallelism possible in include computer architecture, parallel processing, VLSI and implementation
order to obtain high speed and throughput. A CMOS VLSI of algorithms and architectures in hardware (on both PCB and chip level).
chip, SIGMA, for a GF(Z4) was designed, fabricated and
tested. The chip can yield a computation rate of 40 million
multiplications/divisions per second. The hardware can be
N. Ranganathan (S’81-M’88-SM’92) was bom
programmed for choosing different irreducible polynomials. in Tiivaiym, India, in 1961. He received the
B.E. (Hons) degree in electrical and electronics
engineering from Regional Engineering College,
REFERENCES Tiichirapalli, University of Madras, India, in 1983,
and the Ph.D. degree in computer science from the
S. Lin, An Introduction to Error-Correcting Codes. Englewood Cliffs: University of Central Florida, Orlando, in 1988.
Prentice-Hall, 1970. His research intents include VLSI design and
J. H. McClellan and C. M. Rader, Number Theory in Digital Signal hardware algorithms, computer architectureand par-
Processing. Prentice Hall, Englewood Cliffs, 1979.
S. Berkovits, J. Kowaltchuk, and B. Schanning, “Implementing public
allel processing. He is currently involved in the
key scheme,” IEEE Commun. Mag., vol. 17, pp. 2-3, May 1979. design and implementation of VLSI architecturesfor
B. A. Laws and C. K. Rushfort, “A cellular-army multiplier for computer vision, image p m i s i n g , databases. data compression, and signal
GF(2”),” IEEE Trans. Computers, vol. C-20, pp. 1573-1578, Dec. processing appplications. He has been named the Program Co-chair for the
1971. 7th International Conference on VLSI Design to be held in Calcutta, India,
C. S. Yeh, I. S. Reed, and T. K. Truong, “Systolic multipliers for finite in January 1994.
fields GF(2’”),” IEEE Trans. Computers, vol. C-33, pp. 357-360, Apr. Dr. Ranganathan is a member of the IEEE Computer Society, the IEEE
1984. Computer Society Technical Committee on VLSI, the ACM, and the VLSI
C. S. Wang et al., “VLSI architecturefor computing multiplications and Society of India.
inverses in GF( 2’”),” IEEE Trans. Computers, vol. C-34, pp. 709-716,
Aug. 1985.
P. A. Scott, S. E. Tavares, and L. E. Peppard, “A fast VLSI multiplier
for GF(2“),” IEEE J. Select. Areas Commun., vol. SAC-4, pp. 62-65, M u d R. Varanasi (S’72-M’73SM’89) received
Jan. 1986. the B.Sc. and D.M.I.T. degrees from Andhra Uni-
H. Okano and H. Imai, “A construction method of high-speed decoders
versity, India, and Madras Institute of Technology,
using ROM’s for bch and rs codes,” IEEE Trans. Computers, vol. C-36,
India, Ha has also received the M.S. and Ph.D.
pp. 1165-1171, Oct. 1987.
B. B. Zhou, “A new bit-serial systolic multiplier over GF(2’”),” IEEE degrees in electrical engineering from the University
Trans. Computers, vol. 37, pp. 749-751, June 1988. of Maryland, College Park, in 1972 and 1973,
A. Pincin, “A new algorithm for multiplication in finite fields,” IEEE respectively.
Trans. Computers, vol. 38, pp. 1045-1049, July 1989. From 1973 to 1980 he was with the Department
M.Furer and K. Mehlhom, “AT2 optimal galois field multiplier for of Electrical Engineering, Old Dominion University,
VLSI”, IEEE Trans. Computers, vol. 38, pp. 1333-1336, Sept. 1989. Norfolk, VA. He is currently working as a Professor
G. L. Feng, “A VLSI architecture for fast inversion in GF(2”),” IEEE of Computer Science at the University of South
Trans. Computers, vol. 38, pp. 1383-1386, Oct. 1989. Florida, Tampa. His research interests include coding theory, computer ar-
C. C. Wang and D.Pei, “A VLSI design for computing exponentiations chitecture, fault tolerant computing, and VLSI design.
in GF2m),” IEEE Trans. Computers, vol. 39, pp. 258-262, Feb. 1990. Dr. Varanasi is a member of the IEEE Computer Society and is currently
N. Weste and K. Eshraghian, Principles of CMOS VISI Design-A involved in educational and publications activities of that society. He is a
Systems Perspective, Reading, MA: Addison-Wesley, 1988. member of ACM, Eta Kappa Nu, and Sigma Xi.

Вам также может понравиться