Академический Документы
Профессиональный Документы
Культура Документы
1, MARCH 1993
Abstract-Finite or Galois fields are used in numerous appli- hardware implementation. A VLSI architecture is described
cations like error correcting codes, digital signal processing and for implementing the proposed algorithm. The architecture is
cryptography. The design of efficient methods for Galois field systolic and uses the principles of pipelining and parallelism to
arithmetic such as multiplication and division is critical for these
applications. In this paper, we present a new algorithm based on obtain high speed and throughput. The overall architecture is
a pattern matching technique for computing multiplication and a multistage linear pipeline and hence, can yield a new result
division in GF(2m).An efficient systolic architecture is described every clock cycle.
for implementing the algorithm which can produce a new result A prototype CMOS VLSI chip, SIGMA, implementing the
every clock cycle and the multiplication and division operations architecture for Galois field GF(z4) has been designed, fabri-
can be interleaved. The architecture has been implemented using
2-pm CMOS technology. The chip yields a computational rate of cated and tested. The prototype chip is operational at 33.3 MHz
33.3 million multiplicatioddivisions per second. yielding a throughput of 33.3 million multiplications/divisions
per second. An important feature of our design is that the
multiplication and division operations can be interleaved. The
I. INTRODUCTION hardware is programmable for different primitive irreducible
polynomials.
I N RECENT YEARS, finite fields or Galois fields have
been extensively applied in i) error correcting codes such as
BCH codes and RS codes [l], ii) digital signal processing [2],
The outline of the paper is as follows. An overview of the
various hardware approaches for Galois field GF(2m) is given
in Section 11. Section I11 develops the theoretical basis for
iii) pseudorandom number generation [131, and iv) encryption
our proposed algorithm for multiplication and division. The
and decryption protocols in cryptography [3] and space object
algorithm is described in Section TV. The SIGMA architecture
tracking applications [16]. In many of the above applications,
is presented in the Section V. Section VI describes the VLSI
the finite field GF(2m), a number system made of 2" elements
chip implementation and its performance. Conclusions are
is used. The practical use of GF(2m) in various applications
given in Section VII.
requires arithmetic operations like addition, multiplication and
division. Thus efficient algorithms are required to perform
these arithmetic operations on-the-fly. Addition in GF(2") is 11. RELATEDWORK
relatively simple and straightforward. However, multiplication
In recent years, many researchers have proposed hardware
and division are more complex operations. Division is per-
algorithms and architectures for performing arithmetic op-
formed by multiplication of the inverse of the denominator
erations in Galois fields that can be implemented in VLSI
element and inversion itself is achieved through repeated
[4]-[ 131. Most approaches implement separate hardware for
multiplications.
multiplication, inversion and division. In our approach, we
In this paper, we propose a new algorithm for multipli-
propose a systolic hardware architecture that can perform both
cation and division in GF(2m). The algorithm is designed
multiplication and division. We present here a brief overview
for elements of GF(2'") represented by the conventional
of previous work on VLSI hardware methods for Galois field
basis (1, a,a2,a 3 , . , am-'}. The algorithm is based on a
arithmetic.
pattem matching and recognition approach and is amenable for
A cellular array multiplier for GF(2m) was proposed by
Manuscript received June 8, 1992; revised October 14, 1992. M. Kovac was Laws and Rushfort [4]. In their design, the elements of the
supported by Department of Science, Republic of Croatia. N. Ranganathan was Galois field are represented using a conventional basis. The ar-
supported in part by the National Science Foundation under Grant MIP-9 010
358 and by the Florida High Technology and Industry Council. M. Varanasi chitecture consists of a two-dimensional array of m 2 identical
was supported in part by the AT&T Foundation. cells which is a straightforward spatial iteration of the standard
M. Kovac is with the College of Electrical Engineering, University of sequential shift register multiplier. The computation requires
Zagreb, Unska 3, 41 OOO Zagreb, Croatia.
N. Ranganathan is with the Center for Microelectronics Research, Depart- about 2m gate delays. The hardware used is programmable to
ment of Computer Science and Engineering, University of South Florida, implement fields based on different irreducible polynomials.
Tampa, FL 33620. Yeh, Reed, and Truong [5] proposed two systolic architectures
M. Varanasi is with the Department of Computer Science and Engineering,
University of South Florida, Tampa, FL 33620. for multiplication in Galois fields. Their design is also based
IEEE Log Number 9206232. on a Galois field represented by a conventional basis. The
1063-8210/93$03.00 0 1993 IEEE
1
KOVAC et al.: SIGMA VLSI SYSTOLIC ARRAY IMPLEMENTATION 23
1
24 IEEE TRANSACTIONS O N VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1 , NO. 1 , MARCH 1993
CY2 = aJak * aZ= p k ( a J )* p = p k ( y ) . R.3 = L(l(j.1 + kl) + ( j n z + kz)l mod (2" - l))/jJ.
Proof:
Example 2: Let m = 4, F ( z ) = z4 + z + 1,p = a4 and
p = ai3.Then,
If (-y +
2m - 1) is represented in binary form, Since, 2" can
be represented as
-y + 2m - 1 = -(y,-12"-1 + ym-22m-2+
. . ' + y12 + yo) + 2" - 1.
Fig. 2. Flowchart of the GMA algorithm.
2"-1 + 2m-2 + . . . + 2 + 1 + 1
-y + 2m - 1 = -(y m-12m-l + ym-22m-2+ to a pattern T,. For every pattern T, = aJn, the product j n
will be denoted as the pattern power. The given input p, is
.-.+?/12+y0)+(2"-1 +2"-2+
compared with each pattern T,, where n = 0 to [(2"-')/jJ.
... + 2 + 1 + 1) - 1 As the input p, is compared with the first pattern R,=o, if
= 2"-l(l- ?/"-I) + 2"-2(1- ym-p)+ they do not match, the pattern is circle rotated once and then
. . e + 2(1- Yl) + (1 - yo) compared again. This process is repeated until a match occurs.
If there is no match, the loop is executed j times, since it is
-
- y.
only possible to arrive at j elements from any single pattern
within a subset through circle rotation. The values of k and n
IV. T H E GMA ALGORITHM that correspond to a successful match become the desired k,
In this section, we present a new algorithm, called the and n, values. Thus for the inputs Dl and pp the corresponding
GMA algorithm, for the computation of Galois Field based kl, n1 and k2, np values are obtained. The above steps are
multiplication and division. The algorithm exploits the various repeated for every pattern R,.
properties described in Section I1 for efficient computation and The next step is to perform modulo (am - 1) addition
leads to a simple hardware implementation. The arithmetic as explained in Theorem 4 in order to derive k3 and n3. It
operations, i.e., the multiplication of two elements PI and should be noted that the computation of k3 and n3 would
p2 from a Galois field GF(2") or their division p1//32, can have been complex if y = a3 was chosen such that j is
be achieved using modulo (2"-l) addition in the following not a power of two. Once the values k3 and n3 have been
manner: (i) for P1 and Pz, find the corresponding values for computed, they are used to find the resultant element P 3 . The
k l l n l and k p l n 2 pairs, (ii) using k1,nl and k 2 , n 2 pairs resultant element p 3 is determined using the method which is
compute kg, 723, which corresponds to 03, the resultant product a reversal of the process described in the previous paragraph.
or quotient, and finally, (iii) transform k3 and n3 into the actual The value 723 is compared with each value of n and then the
result p 3 . The flowchart of the proposed algorithm is given in pattern R,, corresponding to the matching n is selected as the
Fig. 2. The algorithm is described in the rest of this section. intermediary result Pi. This intermediary result is circle rotated
The elements of GF(2") are divided into subsets for a k3 times in order to arrive at the final result ,&. It is important
particular y value as in Theorem 2. Each subset n corresponds to observe that the same algorithm can be used to compute
26 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1, NO. 1, MARCH 1993
.......................................................................................
j Result Rotation Pancm RecognitionSystolic
Now, we will describe the proposed architecture with an
Pow- Computation
; Pmccssor(P3) Processor (Pl) example Galois field where m = 4, j = 4 and a choice from
Pmcessor(F'2) :
I I
the intermediary result r to get the final result. The final result
is output on the r bus by P3. The set of input signals denoted Fig. 5. Block diagram of a single PE in the pattern recognition systolic
array PI.
as “Initialization and control” in Fig. 3 is used to preselect the
polynomial and to preload the patterns and the corresponding
pattern powers into the various PE’s of the pattern recognition components in a stage are (i) X logic, (ii) Y logic, and
processor. This loading is done once in the beginning of the (iii) circle rotation logic. The X and Y logics are identical
computation. in terms of hardware. The circuitry for each consists of a
comparator, some latches and multiplexers. The input element
B. Pattern Recognition Processor ( P l ) z is compared with the pattern P stored in the P register of
the central module. If they match and the recognition bit r z is
The pattern recognition processor is a systolic array of not set, then the pattern power P P is copied on to the output
processors organized as a multistage linear static pipeline. The bus p z and the recognition bit r z is set. If the recognition
number of processors in the array is a function of m and j bit input r z has already been set, it indicates that there was a
which decide the number of subsets within the chosen finite match earlier in some previous stage, in which case, the current
field GF(2m). In this subsection, we will describe the hardware match result is ignored. This logic helps to avoid the possible
organization of a single processing element that will be repli- second match of the input element in the last PE of P I . A
cated in space to form the pattern recognition processor. The second match is possible if j is chosen such that last subset of
block diagram of the pattern recognition processor is given in GF(2m)has less than j elements. If the match operation in the
Fig. 4. The pattern recognition processor consists of 4 PE’s current stage is unsuccessful or the recognition bit has already
arranged as a systolic array. The flow of inputs and outputs been set, then the input values z, pz, and r z are simply passed
through the various processing elements is shown in the figure. onto the next stage. The circle rotation logic implements the
Each PE consists of three logic modules, the upper module, the circle rotation function for the two primitive polynomials.
central module and the lower module. The PE is divided into The Upper Module of P1 is shown in Fig. 7. The function
these modules on the basis of the function being performed of the upper module is to compare the input result power r p
in each of them. The detailed block diagram of a single PE with the pattern power p p from the central module of the
in the pattern recognition processor is shown in Fig. 5. The same PE. If the match is successful and the result recognition
central module in each PE stores (i) a pattern P corresponding bit rr is set, then the pattern p from the P register in the
to one of the subsets in the chosen Galois Field GF(2m), (ii) central module is copied onto the r bus which represents the
a pattern power PP associated with the pattern P, and (iii) a intermediary result. If the match was not successful, the input
polynomial selection bit P S B that is used to select one of the values from the previous PE are passed on to the next PE.
+ +
two polynomials implemented (z4 z 1 or z4 z3 1).The + +
three values are stored in static registers P, P P , and P S B ,
respectively. The function of the lower module is to compare C. The Power Computation Processor P2
the inputs with the pattern stored in the central module (and The block diagram of the Power Computation Processor
its rotated values). P2 is given in Fig. 8. The purpose of the power computation
The Lower Module (LM) consists of four stages- processor P2 is to compute the power of the result element
LMO-LMhrganized as a linear pipeline. The block which corresponds to the product in case of multiplication
diagram of a single stage is shown in Fig. 6. The major and the quotient in case of division. This processor is a two
28
pattern
. & ...........
. _ _ _ _ $..............
.~
PI031 4
............
t
--
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS,VOL. I , NO. I , MARCH 1993
---
e - -
latch
e
latch
+
++,
4 bit
adder
I- e e e
n I
e
$1 $2 W $2
$1 $2
L ............................................. Fig. 9. Result recognition processor (P3).
-
* 4 r[031 2:. */ intermediate
result
mux 4,
10
next
7-
z mrmi I from
previous
PE PE
result recognition
4 bit
4 rp[03] result power
4
indicating multiplication, modulo addition is performed and if Fig. 10. Floorplan of SIGMA chip.
the op bit is set to high, then modulo subtraction is performed.
During the phil phase of the first cycle, the op bit is used of this processor is to circle rotate the intermediary result r
to select the py value or its 1’s complement. It should be that is output by P I . The number of circle rotations that will
noted that this is the only operation that differentiates between ever be required for an intermediary result is less than or equal
multiplication and division in the entire algorithm. Also, the to three. Each stage of the result rotation processor consists
recognition bits rx and r y are ‘AND’ed to set the value of of (i) the circle rotation logic, (ii) a comparator, and (iii) a
rr which indicates a valid result in the end. During the phi2 2:l multiplexer. The intermediary result r and the lower two
phase, the px and py values are added using a 4-bit carry look- bits of its power rp are input to the first stage of P3. The
ahead adder. During the phil phase of the second cycle, the intermediary result r is circle rotated once in the phil phase
result is checked if it is larger than 14 (modulo 15 addition of the first cycle. During the phi2 phase, the result power is
in our example). If it is larger, this logic passes a which checked using the comparator. If the value of r p is greater than
during the following phi2 cycle, increments the result by one zero, the rotated result is selected through the multiplexer. In
in order to get the correct result in rp. the second stage, the intermediary result is rotated again if
the value of r p [0:1] is greater than one. In the last stage,
D. The Result Rotation Processor ( P 3 ) the result is rotated again if r p is greater than two. Thus, the
The architecture of the result rotation processor P3 is shown result rotation processor P3 outputs the final resultant value r
in Fig. 9. The hardware is organized as a three stage pipeline that is the product/quotient depending on whether the operation
and the computation takes a total of three cycles. The purpose performed was multiplication or division.
KOVAC er al.: SIGMA: VLSI SYSTOLIC ARRAY IMPLEMENTATION 29
VI. VLSI IMPLEMENTATIONAND PERFORMANCE nonoverlapping clocking scheme. The circuit was fitted on
A prototype VLSI chip was designed using CMOS p-well a 6.68x4.48 mm2 MOSIS standard frame. The floorplan of
2-pm technology and was fabricated by MOSIS. The chip the chip is given in Fig. 10. The placement of the various
implements the GMA algorithm and architecture described in processors and the pin assignments are shown in the floorplan.
Sections IV and V. Since the entire architecture is a multistage The overall circuitry required a silicon area of 1.789x3.570
linear static pipeline, the chip can produce one result every mm2 and a total of 8799 transistors. As can be seen in
clock cycle, after the pipe is filled. It should be noted that the the floorplan, the pattern recognition processor PI required
architecture has a through delay of O ( a m )which is equal to most of the silicon area. It occupies 1.789x2.318 mm2 and
the number of stages in the pipe and hence, is not suitable consists of 7 184 transistors. Although the chip required a
for large m. Since many of the applications use only fields total of only 26 pins, the circuitry was fitted in a @-pin
of small m, the proposed architecture is targetted toward package with the most remaining pins connected to the various
such applications. The chip was designed using a 2-phase internal nodes for extra testability. The prototype chip was
30 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 1. NO. 1. MARCH 1993
tested using the HP82000 high speed IC tester and was found [15] K. Hwang and F. A. Briggs, Computer Architecture And Parallel
to be fully operational at 33.3 MHz. The perfomance can Processing. McGraw Hill, 1984.
[16] T. C. Bartee and D. I. Schneider, “Computation with finite fields,”
be improved by omitting the extra connections and also, Information and Conrrol, no. 6, pp 79-98, 1963.
additional speed-ups can be obtained by using sub-micron [17] S. Lin and D. Costello, Error Control Coding. Englewood Cliffs, NJ:
Prentice-Hall, 1983.
technology. The critical delay of the chip depends only on [ 181 W. W.Peterson and E. J. Weldon, E m r Correcting Codes. Cambridge,
the four bit adder and the chip does not have any global M A MIT Press, 1981.
signals. Based on the performance of the prototype chip, the [19] T. R. N. Rao and E. Fujiwara, Error-Control Coding for Computer
Systems. Englewood Cliffs, NJ: Prentice-Hall, 1989.
SIGMA chip can be improved in design so as to operate
at a speed as high as 40 MHz yielding a throughput of 40
million multiplications/divisionsper second. The SIGMA chip
microphotograph is shown in Fig. 11. Mario Kovac (S’WM’91) received the B.S. and
M.S. degrees in computer science and engineer-
ing from the Faculty of Electrical Engineering,
University of Zagreb, Croatia, in 1988 and 1991,
VII. CONCLUSIONS respectively, where he is working toward the Ph.D.
In this paper, we have presented a new algorithm for degree.
He has been on the faculty of the University of
performing multiplication as well as division of two elements Zagreb since 1989 and is currently holding a sci-
of GF(2”). An efficient VLSI architecture for implement- entific assistant position. During 1990 and 1991, he
ing the proposed algorithm is described. The architecture is was a visiting research scholar at the University
of South Florida, Tampa. His research interests
systolic and exploits pipelining and parallelism possible in include computer architecture, parallel processing, VLSI and implementation
order to obtain high speed and throughput. A CMOS VLSI of algorithms and architectures in hardware (on both PCB and chip level).
chip, SIGMA, for a GF(Z4) was designed, fabricated and
tested. The chip can yield a computation rate of 40 million
multiplications/divisions per second. The hardware can be
N. Ranganathan (S’81-M’88-SM’92) was bom
programmed for choosing different irreducible polynomials. in Tiivaiym, India, in 1961. He received the
B.E. (Hons) degree in electrical and electronics
engineering from Regional Engineering College,
REFERENCES Tiichirapalli, University of Madras, India, in 1983,
and the Ph.D. degree in computer science from the
S. Lin, An Introduction to Error-Correcting Codes. Englewood Cliffs: University of Central Florida, Orlando, in 1988.
Prentice-Hall, 1970. His research intents include VLSI design and
J. H. McClellan and C. M. Rader, Number Theory in Digital Signal hardware algorithms, computer architectureand par-
Processing. Prentice Hall, Englewood Cliffs, 1979.
S. Berkovits, J. Kowaltchuk, and B. Schanning, “Implementing public
allel processing. He is currently involved in the
key scheme,” IEEE Commun. Mag., vol. 17, pp. 2-3, May 1979. design and implementation of VLSI architecturesfor
B. A. Laws and C. K. Rushfort, “A cellular-army multiplier for computer vision, image p m i s i n g , databases. data compression, and signal
GF(2”),” IEEE Trans. Computers, vol. C-20, pp. 1573-1578, Dec. processing appplications. He has been named the Program Co-chair for the
1971. 7th International Conference on VLSI Design to be held in Calcutta, India,
C. S. Yeh, I. S. Reed, and T. K. Truong, “Systolic multipliers for finite in January 1994.
fields GF(2’”),” IEEE Trans. Computers, vol. C-33, pp. 357-360, Apr. Dr. Ranganathan is a member of the IEEE Computer Society, the IEEE
1984. Computer Society Technical Committee on VLSI, the ACM, and the VLSI
C. S. Wang et al., “VLSI architecturefor computing multiplications and Society of India.
inverses in GF( 2’”),” IEEE Trans. Computers, vol. C-34, pp. 709-716,
Aug. 1985.
P. A. Scott, S. E. Tavares, and L. E. Peppard, “A fast VLSI multiplier
for GF(2“),” IEEE J. Select. Areas Commun., vol. SAC-4, pp. 62-65, M u d R. Varanasi (S’72-M’73SM’89) received
Jan. 1986. the B.Sc. and D.M.I.T. degrees from Andhra Uni-
H. Okano and H. Imai, “A construction method of high-speed decoders
versity, India, and Madras Institute of Technology,
using ROM’s for bch and rs codes,” IEEE Trans. Computers, vol. C-36,
India, Ha has also received the M.S. and Ph.D.
pp. 1165-1171, Oct. 1987.
B. B. Zhou, “A new bit-serial systolic multiplier over GF(2’”),” IEEE degrees in electrical engineering from the University
Trans. Computers, vol. 37, pp. 749-751, June 1988. of Maryland, College Park, in 1972 and 1973,
A. Pincin, “A new algorithm for multiplication in finite fields,” IEEE respectively.
Trans. Computers, vol. 38, pp. 1045-1049, July 1989. From 1973 to 1980 he was with the Department
M.Furer and K. Mehlhom, “AT2 optimal galois field multiplier for of Electrical Engineering, Old Dominion University,
VLSI”, IEEE Trans. Computers, vol. 38, pp. 1333-1336, Sept. 1989. Norfolk, VA. He is currently working as a Professor
G. L. Feng, “A VLSI architecture for fast inversion in GF(2”),” IEEE of Computer Science at the University of South
Trans. Computers, vol. 38, pp. 1383-1386, Oct. 1989. Florida, Tampa. His research interests include coding theory, computer ar-
C. C. Wang and D.Pei, “A VLSI design for computing exponentiations chitecture, fault tolerant computing, and VLSI design.
in GF2m),” IEEE Trans. Computers, vol. 39, pp. 258-262, Feb. 1990. Dr. Varanasi is a member of the IEEE Computer Society and is currently
N. Weste and K. Eshraghian, Principles of CMOS VISI Design-A involved in educational and publications activities of that society. He is a
Systems Perspective, Reading, MA: Addison-Wesley, 1988. member of ACM, Eta Kappa Nu, and Sigma Xi.