A Systolic Array Exploiting The Parallelisms of Artificial Neural Inherent Networks

Microprocessing and Microprogramming 33 (1991/92) 145-159 14fi
North-Holland
A systolic array exploiting the inherent

parallelisms of artificial neural networks
Jai-Hoon Chung, Hyunsoo Yoon and Seung Ryoul Maeng
Department of Computer Science, Korea Advanced Institute of Science and Technology, 373-1 Kusong-Dong, Yusung-Gu,
Taejon 305-701, South Korea
Received 10 January 1991

Revised 14 November 1991
Abstract
Chung, J.-H., H. Yoon and S.R. Maeng, A systolic array exploiting the inherent parallelisms of artificial neural networks,
MJcroprocessing and Microprogramming 33 (1991/92) 145-159.
The systolic array implementation of artificial neural networks is one of the best solutions to the communication problems generated
by the highly interconnected neurons. In this paper, a two-dimensional systolic array for backpropagation neural network is
presented. The design is based on the classical systolic algorithm of matrix-by-vector multiplication, and exploits the inherent
parallelisms of backpropagation neural networks. This design executes the forward and backward passes in parallel, and exploits
the pipelined parallelism of multiple patterns in each pass. The estimated performance of this design shows that the pipelining
of multiple patterns is an important factor in VLSI neural network implementations.
Keywords. Artificial neural network; backpropagation model; systolic array; pipelining; VLSI implementation.
1. Introduction rated by the highly interconnected neurons, and can

exploit the massive parallelism inherent in the prob-
In recent years, artificial neural networks have lem. Moreover, since the computation of neural net-
become a subject of a very extensive research, since works can be represented by a series of matrix-by-
they are envisioned as an alternative way to the vector multiplications, the classical systolic algo-
problems such as pattern recognition, vision, and rithms can be used to implement them.
speech recognition that artificial intelligence had There have been several research efforts on sys-
been unable to solve [1]. However, as simulations of tolic algorithms and systolic array structures to im-
large neural networks on a sequential computer plement the neural networks. The approaches can
frequently require days and even weeks of computa- be classified into three groups. One is mapping the
tions, and the long computational time had been a systolic algorithms for neural networks onto paral-
critical obstacle for progress in neural network re- lel computers such as Warp, MasPar MP-1, and
searches, extensive efforts are being devoted to the Transputer arrays [3-7], another is designing pro-
parallel implementation of neural networks. grammable systolic arrays for general neural net-
A systolic array [2] is one of the best solutions to work models [8-I1], and the other is designing a
the parallel implementation of neural networks. It VLSI systolic array dedicated to one or two specific
can overcome the communication problems gene- models [12-14]. All of these approaches exploit the
spatial parallelism and the training set parallelism
Correspondence to: Jai-Hoon Chung, Department of Computer inherent in neural networks, and suggest the systol-
Science, Korea Advanced Institute of Science and Technology, ic ring array or two-dimensional mesh array as the
373-1 Kusong-Dong, Yusung-Gu, Taejon 305-701, South Korea. resultant array structures.
146 J.-H. Chung et aL
-i
The backpropagation neural network [15], in ad-
y~] = flx~)) = T + e- =';~" (2 < n < L), (3)
dition to the above two types of parallelism, has one
more parallel aspect that the forward and backward
passes can be executed in parallel with pipelining of where x~] is the net input to neuron j in layer n for
multiple training patterns. This paper proposes a pattern p and ipj is the input to neuron j in input
systolic design dedicated to backpropagation neural layer for pattern p. The w~7) (t) is the value of a
network that can exploit this type of parallelism. weight associated to the connection from neuron i
In the following sections, the basic systolic algo- in layer n - 1 to neuron j in layer n after weight up-
rithms and the systolic array structures to imple- dates of t times, and 0j is the threshold of neuron j.
ment the neural networks, the issues of exploiting The y~] is the state output from neuron j in layer n
the parallelisms inherent in neural networks, and for pattern p.
the various systolic approaches are briefly investi-
gated. Then a systolic array that can exploit the for- 2.1.2. The backwardpass
ward/backward pipelining of multiple training pat- The backward pass of the network operations
terns is presented and the performance is analyzed. can be described by Eqs. 4-6:
Finally the potential performance of this design is
fl~] = y~] - dpj, (n = L), (4)
compared to the performance of existing implemen- Nn+l
E +
tations for the NETtalk text-to-speech neural net- "kj + 1)(0,
' ) . . . . ("
(2_<n<L--1), (5)
k=l
work [16].
6~) = ,-,,t3("!",pj~(".)"(1 - y~)) = g(fl~)), (2 _< n _< L), (6)
2. Background where fl~) is the net error input to neuron j in layer

n for pattern p and dpj is the desired output of
2.1. Multilayer neural network with backpropaga- neuron j in output layer for pattern p. vpj i(,~ is the
tion learning algorithm error output from n e u r o n j in layer n to neurons in
layer n - 1 for pattern p. Because the weights asso-
The neural network we are focusing is a multi- ciated with the errors in the output layer after the
layer neural network using the backpropagation forward pass should be adjusted, the t must be the
learning algorithm [15], which is the most popular same as that in the forward pass.
neural network today [17], and has been widely
used in pattern classification, speech recognition, 2.1.3. The weight increment update
machine vision, sonar, radar, robotics, and signal After presentation of the training patterns,
processing applications [1]. weights are updated according to Eqs. 7-8:
The operations of the backpropagation learning
Aw~7)(u + 1) = Aw~7)(u) + vpj,5("!. ,(".-.,p, 1), (2 _< n _< L),
algorithm are represented by following equations.
(7)
Consider an L-layer (layer 1 to layer L) network
w}7)(t + 1) = w}7)(t) + cAw}7 ), (2 < n < L), (8)
consisting of Nk neurons at the kth layer. Layer 1 is
the input layer, layer L is the output layer, and where u is the number of training patterns pre-
layers 2 to L - 1 are the hidden layers. The neurons sented, t is the number of weight updates, and e is
between the adjacent layers are fully connected. the learning rate (typically 1.0). Two different stra-
tegies are in common use for updating the weights
2.1.1. The forwardpass in the network. In the first approach, the weights
The forward pass of such a network can be de- are updated every cycle of the entire set of training
scribed by Eqs. 1-3: patterns presentation to minimize the average error
for all the training patterns. This update cycle is
y~./) = x~) = ipj, (n = 1), (1)
called an epoch. With this method we can always get
Nn - 1 the true gradient descent of the composite error
x~3 = ~_, Y~"i- 1). w~.i)(t) _ Oj, (2 _< n _< L), (2) function in weight-space. This approach is called the
i=1
A systolic array exploiting the inherentparallelisms of ANNs 147
batch or periodic updating [18]. In practice, howev- show the one-dimensional algorithms and the cor-
er, some arbitrary update cycle is chosen to average responding linear systolic arrays, and Fig. 2(c) and
over several inputs before updating weights [16]. If Fig. 2(d) show the two-dimensional systolic algo-
the weights are updated after presentation o f p pat- rithms and the corresponding systolic arrays. In the
terns, the relation u =p.t is satisfied. one-dimensional arrays, the elements of the matrix
In the second approach, the network weights are are fed into the array in which each cell contains
updated continuously after each training pattern is each element of the vector as shown in Fig. 2(b), or
presented. This approach is called the online or con- the elements of the vector are also fed into the cells
tinuous updating [18]. This method might become as shown in Fig. 2(a). On the other hand, in the
trapped for a few atypical training patterns, but the two-dimensional arrays, each cell contains each ele-
advantage is that it eliminates the need to accumu- ment of the matrix and the vectors are fed into the
late the error over a number of patterns presented array.
and allows a network to learn a given task more In the systolic array in the form of Fig. 2(c), the
fast, if there is a lot of redundant information in the matrix elements are set into the two dimensional
training patterns. A disadvantage is that it requires systolic array basic cells and the vector elements
more weight update steps. The issue of weight up- propagate vertically from the top cell to the bottom
dating time has been controversial with contention cell step by step through the array. Each basic cell is
between researchers. a computational element that eprforms the multi-
plication x:wji, adds it to the partial sum received
2.1.4. Terminating conditions from the left cell, and sends the result to the right
The learning phase is terminated when the total cell.
error is sufficiently small [19]: The systolic algorithm in the form of Fig. l ( d ) is
NL presented by Kung and Hwang [8], in which the
E = E E r = E Z (Y~) - dpl) 2 < c, (9) data flow can be simplified by re-ordering the
p pi=l
weight elements so that the outputs of each layer
where the index p ranges over the set of training are aligned with the inputs of the next layer. This
input patterns, i ranges over the set of output units, has an advantage that the load of each processor
and Ep represents the error on pattern p. ~ is the can be balanced when it is partitioned vertically and
user-defined sufficiently small value. mapped onto a linear array. They proposed a map-
ping of the algorithm onto a linear array, and the
2.2. Systolic algorithms for neural network imple- similar approach has been studied in [9]. However,
mentations since the numbers of neurons of the adjacent layers
in many practical multilayer neural networks are
The computation of the multilayer neural net- much different, the application of this algorithm to
works described above can be represented by a se- such networks is inefficient when we are going to or-
ries of matrix-by-vector multiplications interleaved ganize a physically two-dimensional systolic array.
with the non-linear activation functions. The matrix
represents the weights associated to the connec- 2.3. Exploiting the inherent parallelism of neural
tions, and the vector represents the input patterns networks
or the activation levels of neurons as shown in Fig.
1. The resultant vector of a matrix-by-vector mul- The massive parallelism inherent in neural ne-
tiplication is applied to the non-linear activation tworks may be classified into three types and can be
function. exploited in systolic array implementations.
The classical systolic algorithms for matrix-by-
vector multiplications can be used to implement 2.3.1. Spatial parallelism
neural networks. The systolic algorithms for the This type of parallelism is exploited by executing
multiplication of a 4-by-4 matrix and a 4-by- 1 vec- the operations required in each layer on multiple
tor are shown in Fig. 2. Figure 2(a) and Fig. 2(b) processors. In the spatial parallelism, there are three
148 ,1. -14. Chung et al.
yl(n+') {Wll Wt2 w,3qly1(n~

y2("+1'I = [ |w2, w2z W=3/|y2~n'l
y3("+l)J Lw31 W33Jky3(nI
W32
Fig. 1. Multilayer neural network: The computation is represented by a series of matrix-by-vector multiplications interleaved with
the non-linear activation functions.
levels of parallelism exploitation: 2.3.2. Training set parallelism

S1 Partitioning the neurons of each layer into This type of parallelism is also called data parti-
groups. tioning [3], and can be exploited by partitioning the
$2 Partitioning the operations of each neuron into training patterns and executing on multiple proces-
sum of products operations and the non-linear sors independently. This derives from the fact that
function. the backpropagation and similar algorithms pro-
$3 Partitioning the sum of products operations vide for the linear combination of the individual
into several groups. contributions made by each pattern to the adjust-
The parallelism S 1 is called network partitioning [3], ment of the network's weight. The weights in each
and exploited in most of parallel implementations processor are updated periodically by all-to-all
of neural networks [20]. These partitioned opera- broadcasting of the weight increments for the sub-
tions mentioned above can be executed on different set of the entire training pattern set allocated in
processors in parallel. Most of neural network si- each processor. The training set parallelism has
mulators implemented on M I M D computers ex- been exploited in [3, 5, 6, 21].
ploit parallelism S1 and $2, but parallelism $3 is not
exploited much due to the communication over- 2.3.3. Pattern pipelined parallelism
head. This type of parallelism can be exploited by mul-
w44 w44
w34 w43 w43 w34
w24 w33 w42 w42 w33 w24
w14 w23 w32 w41 w41 w32 w23 w14
w13 w22 w31 w31 w22 w13
w12 w21 w21 w12
wll wll
x4 x3 x2 xl ilE 0 :~ y4 y3 y2 yl
Z ~iiii:
y4
y3
y2
(a) (b)
!ii~iii:i~i:~::i:i::~~:ii~:i~i~i!!~i:iii~ii~i:i~i~i:i:i~i~:!~ii!i~i:i:ii~iii:ii:i~i:i~i~ii~:i~ii~i~i~iiiiiii~ii:iiii:i~:i!~:~:i~i~i~i~i:i:i!ii~i:iii
w22 w23 w24
W42 w43 w44 :::::::::::::::::::::::::::::
x4 0 xl 0 x2 0 x3 0 x4
x3
x2
xl
yl
y2
y3
y4
yl y2 y3 y4
:i:ilili:i::::::::i: i:i!i?:i ilif:::iiii::::::ii il :i??ii!i ii: i:i:i:i :ii:ii::::i!ili::ii:!:!:iiii?iiii:i:ii:
(c) (d)
Fig. 2. Systolic algorithms for matrix-by-vector multiplication and their corresponding systolic array: (a), (b) one-dimensionalsy-
stolic arrays; (c), (d) two-dimensional systolic arrays.
tiple input pattern pipelining, in which processors the learning phase, but also to the input patterns in
that completed the parts o f operations required for the recalling phase.
a pattern start execution for the next pattern while P1 Pipelining o f input patterns in the recalling
the other processors are processing for the pre- phase.
viously presented training patterns. This parallelism P2 Pipelining o f training patterns in the learning
can be applied not only to the training patterns in phase.
150 ,I. -H. C h u n g et al.
The depth of pipelining is proportional to the parallelism can be exploited in systolic array imple-
number of neurons in the network, and the exploi- mentations as shown in Fig. 3. The spatial parallel-
tation of this parallelism is effective when the size of ism $3 is too fine-grained to be exploited in the sys-
the training set is much larger than the number of tolic arrays shown in Fig. 2. It has been exploited in
neurons in the network. MasPar MP-I, a massively parallel SIMD com-
puter [7]. The exploitation of the pattern pipelined
2.3.4. Exploiting the parallelisms & systolic arrays parallelism P2 is presented in Section 3.
The exploitation of the training set parallelism is
trivial. We need duplicated copies of a systolic array 2.4. Systolic approaches to neural network imple-
and a broadcast network, and the training set is mentations
partitioned so that every systolic array processes a
subset of them. This is useful when the number of There have been several research efforts on sys-
the training patterns is too large. In [5], a toroidal tolic algorithms and systolic array structures to im-
mesh array is proposed to exploit the training set plement the neural networks. The approaches can
parallelism by connecting each ring systolic array be classified into three groups as follows:
that has a duplicated copy of the weights. Mapping the systolic algorithms for neural net-
The spatial parallelism and the pattern pipelined works onto existing SIMD and M I M D com-
Pattern
Pipelined
Parallelism
P1
Spatial Parallelism S2
//// Parallelism S1
Fig. 3. Exploitation of spatial parallelism and pattern pipelined parallelism in systolic array implementations.
A systolic array exploiting the inherent parallelisms of ANNs 151
Table 1
Systolic approaches to neural networks
Approach Dimension Researchers

(system) Models studied Parallelisms exploited Ref.
Mapping 1 -D systolic array Pomerleau et al. (Warp) Backpropagation Spatial $1, $2, Training [3]
Mill&n & Bofill (trans-

puters) Backpropagation Spatial $1, Training [5]
Mann & Haykin (Warp) Kohonen Training [6]
2-D systolic array Borkar et al. (iWarp) Backpropagation Spatial $1, $2, Training [4]
Chinn et al. (MasPar

MP-1 ) Backpropagation Spatial $1, $2, $3 [7]
Programmable sys- 1 -D systolic array Kung & Hwang (ring

tolic array systolic) Hopfield Spatial $1 [8]
Kato et al. (Sandy/8) Backpropagation Spatial S1 [9]
2-D systolic array Kung & Hwang (cas-

caded ring) Backpropagation Spatial $1, Pattern P1 [8]
Ramacher et al. (neuro-

emulator) General model Spatial $1, $2, Pattern
P1 [1 O]
Hiraiwa et al. (GCN) Back-prop. Kohonen

Boltzmann Spatial Sl, Training [11 ]
VLSI 2-D systolic array Blayo & Hurat (APLY-

SIE) Hopfield Spatial $1, $2, Pattern
P1 [12]
Blayo & Lehmann

(GENES H8) Hopfield Kohonen Spatial $1, $2, Pattern
P1 [13]
Kwan & Tsang (com-

bined array) Backpropagation Spatial $1, $2, Pattern
P1 [14]
This design (pipelined

array) Backpropagation Spatial $1, $2, Pattern
P1, P2
puters [3-7]. These approaches are summarized in Table 1. As

Design of a programmable systolic array for shown in Table 1, the two-dimensional approaches
general neural networks [8-11 ]. can exploit more parallelisms. Our approach is dis-
Design of a VLSI systolic array dedicated to tinguished from the other approaches in that it ex-
specific neural network models [12-14]. ploits the pattern pipelined parallelism P2 described
above.
152 J. -H. Chung et aL
3. A systolic array exploiting the pattern pipelined In Fig. 4(b), which represents an array for layer
parallelism n + 1 to layer n of a network, each row of the array
receives the errors propagated from the previous
3.1. Exploiting the pattern pipelined parallelism array and computes the sum-of-products opera-
tions shown in Eq. 5. These sum-of-products are fed
To exploit the pattern pipelined parallelism in the into the next array after being applied the derivative
learning phase, the forward and backward passes of the activation function.
should be executed in parallel sharing the weights. The arrays in Fig. 4(a) and Fig. 4(b) can be com-
In this design, we use the systolic algorithms in the bined into an array as shown in Fig. 5(a). The
form of Fig. 2(c). In Fig. 4, the systolic array organ- weight matrix is shared between the forward and
izations between two adjacent layers are shown. backward passes, and the data paths of the activa-
One layer consists of 5 neurons and the other layer tions forward-propagated and the errors back-
consists of 3 neurons. Fig. 4(a) is for the forward propagated are disjoint. Each basic cell executes the
pass, Fig. 4(b) is for the backward pass. sum-of-products operations shown in Eqs. 2 and 5
Each basic cell wji is a computational element, and weight updating operations shown in Eqs. 7
which contains the weight value and the weight in- and 8. the 0funit executes the thresholding and the
crement associated to the connection between activation function shown inf Eqs. 2 and 3, and the
neuron i of a layer and neuronj' of the next layer. In g unit executes the derivative of the activation func-
Fig. 4(a), each basic cell in a row of the array re- tion shown in Eq. 6. The g unit has an FIFO queue
ceives the input patterns or the activations propa- to contain the outputs of a neuron that are fed back
gated from the previous array and computes the to the basic cells to compute the weight updates in
sum-of-products operations shown in Eq. 2. These the backward pass. The operations of the basic cell
sum-of-products are fed into next array after being that can be executed in parallel are shown in Fig.
applied the activation function. This systolic algo- 5(b).
rithm can be equally adopted to the backward pass.
Error back-propagated
##::.::::: iiiiii
iiiiiiiii!iiiiiiiiiiiiii :iii!ii~!ii!!i!iiil
0 ~i::i::iii~
w
Activation forward-propagated
:::::::::::::;::
::::::::::::::::: iiiii~!i!ii!::::'
iiiiiiiiiiiiiiiiiiiiiiiiiiiiiii
::::::::::::::::: iii!iiiiiiiiiiiiiiiiiliiiii!iiiiil
1 ~iiiiiiii!::i:iw ~ i~i4iiiii~iiw .
::::::::::::::::::
::::::::::::::::::
:ii!ii##.
:.:.:+:.:,:,: .ii!i!!i!iii
iiiiiiiiiiiiiii!:!
::::::::::::::::::
:.::::::::::::::::
: :.;.: :.:+:.
l!!!ililililiiii
~2 i:i:i:i:i:~:::~:i
W 4 ~iiii~i:;:ii! W'
::::::::::::::::::::::::: iiiiiiiii~!i!i!i!i
...........,..
~iiiiiiiiiiiii!iiiiiiiiiiiiiiiiii
~1 iiiiiii!!!iiiiilw 5 iiiiiiiiiiiiiiiiiw5 ~~iiiiiii~i~
.:,:.:.:.:.:.:.:.: ::::::::::::
i!~?!~!?i?iiiiiiii i!~i~i~iiiiiiiiiii
(a) (b)
Fig. 4. Systolic array exploiting the pattern pipelined parallelism: (a) for forward pass; (b) for backward pass.
Activation forward-propagated
(n-;)
(nq ) y5 NW NE
yl(n'l) y2(n'l) y3(n'l) y4.
*
WN Iiiiiiiiiliiiiiiiiiiiiiiiiiiilili
Error 81( .)
back- ~ ~ , ~ O[ ~" " " ,1
propagated WS-~iiiiiii;]iiiiiiiiiiiiiiiiiiiiiill
82(n! (n"
0 "Y2)
~3(n) y3 (n)
01iiiii ....... ES = NW >< W + WS
SW = NW
SE = WN x W + NE
EN = WN
: : :(n.n ~("~) '~
(a) (b)
Fig. 5. Systolic array organization between two adjacent layers of multilayer neural networks: (a) combined design for forward and
backward passes; (b) basic cell operations.
3.2. Systolic array organization applied to the thresholding and the activation func-
tion, and fed into the second array. In the second
In Fig. 6, an example of a systolic array organiza- array, each cell wji performs the product operation
tion for (5-3-2) 3-layer network is shown. The ele- with wji and the input from the left cell, adds it to
ments of two weight matrices are set into the two- the subtotal received from the cell above, and then
dimensional arrays, the 3-by-5 weight matrix repre- sends it to cell below. After the total sum is calculat-
sent the weights associated to the connections be- ed in the bottommost basic cells of the second
tween the input layer and the hidden layer, and the array, each value is fed into the 0funit.
2-by-3 weight matrix represents the weights asso- In the output of the network, the difference be-
ciated to the connections between the hidden layer tween the actual output and the desired output as
and the output layer. The second matrix is trans- shown in Eq. 4 is computed for a pattern presented.
posed in the figure. The differences are accumulated to get the total
In the forward pass, the data flow is denoted by error as shown in Eq. 9. If the value is sufficiently
the bold lines. The elements of input patterns are small the learning phase ends. Else the backward
fed into the first array step by step from the leftmost pass starts. The data flow of backward pass is den-
basic cell to the rightmost basic cell, and propagat- oted as thin lines. Each error value is applied to the
ed vertically from the top cell to the bottommost derivative of the activation function and fed into
cell. In the first array, each cell wji performs the the second array from the above through the wrap-
product operation with wji and the input from the around connection.
above cell, adds it to the subtotal received from the Each cell wji in the second array performs the
left cell, and then sends it to the right cell. After the product operation with wji and the input from the
total sum is calculated in the rightmost basic cells of above cell, adds it to the partial sum received from
the first array, each value is fed into the 0 f u n i t to be the left cell, and then sends it to the right cell. After
154 J. -14. Chung et al.
Input patterns b
b
i2
il .
0 0 0
Termination
Forward pass
- Backward pass
dl
d2
Desired outputs
Fig. 6. Example of a systolic array organization for (5-3-2) 3- layer neural network.
the total sum is calculated, it is fed into the g unit, a systolic array organization for (5-3-3-2) 4-layer
applied the derivative of the activation function, network is shown.
and fed into the first array from the left. In the first
array the error output needs not be calculated. The 3.3. Basic cell architecture
error inputs from the second array are propagated
to the right cells and used to update the weight val- The internal data path of the basic cell shown in
ue. Fig. 8(a). It consists of three multipliers, three ad-
All of these operations are pipelined. So the con- ders, two registers for weight value and weight in-
tinuously generated output values of a neuron must crement. The basic cell operations are divided into
be stored to be fed back to update the weight. In two phases, and at each phase, three multiplications
this design, these values are stored in the g units. As and three additions are executed in parallel as
mentioned in Section 2.1.3, the weights are updated shown in Fig. 8 (b).
over several training patterns. While the weights are
updated, the feeding of training patterns is stopped, 4. Performance evaluation
and the pipeline is stalled.
A systolic array for networks of arbitrary number Let us consider an L-layer network consisting of
of layers can be organized. In Fig. 7, an example of Nk neurons at the kth layer. The required cycles, C,
A systofic array exploiting the inherentparallelisms of ANNs 155
MEMORY
I n p u t patterns
o 0
0--
(
0m
, Forward pass
Backward pass
Fig. 7. Example of a systolic array organization for (5-3-3-2) 4-layer neural network.
for forward and backward passes of this design are When the forward and backward passes are pipe-
denoted by Eq. 10 and Eq. 1 l" lined, the required cycles for a single training pat-
tern are denoted by Eq. 12:
L
Cf . . . . . d = Y~ Ni + L-- 2, (10) Csingle ~--- ( C f . . . . . d + Cbackward) -- NL + 1. (12)
i=1
L Equation 12 shows only the effects of the spatial
Cbackward : ~ N i + L-- 1, (1 1) parallelism and the forward/backward pipelining.
i=l
NW NE NW NE
fi::i=:i=:]iii::i::i::i::ii::;ii::i::i:=i=::iEi!iN::i::i:=l
I! 1 Phasel
T1 = NWW ;
T2 = WNxW ;
SW
EN
=
=
NW
WN
T3 = WN x NE; SE = NE
NE Phase2
..................l .... ES = WS + T1
~ SE = NE + T2
AW = AW + T3
ws i!iiiiii!i!ii!iiiii;iiii Es
SW SE
(a) (b)
Fig. 8. The basic cell architecture: (a) the internal data path; (b) the basic cell operations.
The effects of the forward/backward pipelining on that the feeding of training patterns are stopped for
performance is not great even though NL is much updating the weights are denoted by Eq. 15:
greater than 1. However, it makes the pipelining of
Cupdate = Csi,gle -- (N1 + N2 -- 2) =
multiple training patterns possible. When multiple L
patterns are pipelined, the required cycles for learn- 2 ~,, N, + 2 L - (N 1 + N 2 + NL). (15)
i=1
ingp training patterns are denoted by Eq. 13:
When we update the weight t times during pre-
Cp = Csingle + (P - 1). (13)
sentation o f p patterns, the total required cycles are
The speedup, Sr by the pipelining of multiple denoted by Eq. 16:
training patterns is denoted by Eq. 14. When p>>-
(Cp -~- Cupdate X t, (p > t)
Csingle , the speedup can be approximated to Csingle Ctotal~ ~ (16)
that is the depth of the pipelining and dependent (%i.g~o P, (p = t).
only on the network size.
Sp = P" C~i.gl~ = P" Csingle . (14) The performance of neurocomputers is measured

Cp Csingle -~- (p - 1)
in millions of connection updates per second
(MCUPS). Let N be the number of connections of
If the weight updating phase starts after the pre- the target neural network, then the MCUPS is cal-
sentation of the pth training pattern, the presenta- culated by Eq. 17:
tion of the ( p + l ) t h training pattern must be
stopped until all the weights are updated. It takes total connections calculated
MCUPS =
Csingle cycles to update the weights, but the weights total elapsed time in #sec
updating and the presentation of the next patterns Np
(17)
can be overlapped for N1 + N2-- 2 cycles. The cycles Ctotal cycle time'
where the cycle time is determined as the maximum we implement this design with 100 ns of cycle time
time among the time required by one multiplication for computation and 50 ns for communication, the
and one addition, the time for thresholding and ac- performance comparison with high-performance
tivation function (table lookup), and the time re- neurocomputers is shown in Table 2. The potential-
quired to feed the input patterns continuously. ly high performance of this design shows that the
Speed measurements of several high performance array structure that can exploit the massive parallel-
neural network implementations have been per- ism of neural networks plays an important role in
formed for NETtalk [16] as a benchmark [3, 9, 22- VLSI systolic array designs.
24]. NETtalk learns to transform written English
text to a phonetic representation. The network con- 5. Conclusions
sists of (203-60-26) three layers, and the total
number of connections is 13 826 including 86 con- In this paper, a systolic array for backpropaga-
nections to the true unit for thresholds. Assuming tion neural network that executes the forward and
Table 2
Performance comparison of neurocomputers
Machine No. of PEs No. of Train-

ing patterns MCUPS for N ETtalk
(203-6026) MCUPS for (128-
128-128) network MCUPS for (256-
128-256) network Ref.
MicroVax 1 0.008 [22]

Sun 3/75 1 0.01 [22]
Max 780 1 0.027 [22]
Sun 3/160 1 0.034 [22]
Ridge 32 1 0.05 [3]
Dec 8600 1 0.06 [22]
Convex 1 0.8 [22]
Convex C-1 1 1.8 [3]
Cray-2 1 7 [22]
VACA 4K 1 51.4 [23]
Sandy/8 256 1 116 [9]
Warp 10 1 17 [3]
20 1 32 [23]
iWarp 10 1 36 [4]
DAP-610 4K [24]
(8bits) 1 160
(16bits) 1 40
CM-1 16K 1 2.6 [3]

64K 1 13 [22]
CM-2 64K 1 40 [25]

64K 64K 1 300 [21 ]
This design 1 3K 1 248

32K 64K 324 000
64K 1 637
backward passes in parallel, and exploits the pipe- Networks, Vol.I I, San Diego, CA (Jul. 1988) 165-172.
lining of multiple training patterns in each pass has [9] H. Kato, H. Yoshizawa, H. Iciki and K. Asakawa, A paral-
been presented. The estimated performance of this lel neurocomputer architecture towards billion connec-
tion updates per second, Proc. Internat. Joint Conf. on
design shows that the pipelining of multiple training Neural Networks, Vol.ll, Washington, DC (Jan. 1990)
patterns is an important factor in VLSI implemen- 51 54.
tations of artificial neural networks. [10] U. Ramacher and W. Raab, Fine-grain system architec-
From the hardware implementation point of tures for systolic emulation of neural algorithms, Proc.
Internat. Conf. on Application Specific Array Processors
view, an extendable and reconfigurable architecture
(IEEE Computer Society Press, Sep. 1990) 554-566.
design to be used as a generic building block for [11]A. Hiraiwa, M. Fujita, S. Kurosu, S. Arisawa and M.
neural network implementation is in demand. Even Inoue, Implementation of ANN on RISC processor array,
with the future 0.3-/zm technology, single chip Proc. Intemat. Conf. on Application Specific Array Pro-
direct implementation of a complete neural net-,,,J cessors (IEEE Computer Society Press, Sep. 1990) 677-
688.
work is difficult since the applications may require
[12] F. Blayo and P. Hurat, A VLSI systolic array dedicated to
quantities of neurons and the size of neural net- Hopfield neural network, VLSl for Artificial Intelligence
works required in the application field becomes (Kluwer Dordrecht, 1989) 255-264.
larger. To implement a large network with the sys- 1-13] F. Blayo and C. Lehmann, A systolic implementation of
tolic design proposed in this paper, an efficient map- the self organization algorithm, Proc. Internat. Neural
Network Conf., Vol.ll, Paris (Jul. 1990) 600.
ping method that still allows the exploitation of the
1-14] H.K. Kwan and P.C. Tsang, Systolic implementation of
pattern pipelined parallelism is required as a further multi-layer feed-forward neural network with back-
study. propagation learning scheme, Proc. Internat. Joint Conf.
on Neural Networks, Vol.ll, Washington, DC (Jan.
1990) 84-87.
[15] D.E. Rumelhart, G.E. Hinton and R.J. Williams, Learning
References
internal representations by error propagation, Parallel
Distributed Processing: Explorations in the Microstruc-
tures of Cognition, Vol. 1, Foundations (M IT Press, Cam-
[1] DARPA, Neural Network Study (AFCEA International bridge, MA, 1986) 318 362.
Press, 1988). [16] T.J. Sejnowski and C.R. Rosenberg, Parallel networks
[2] H.T. Kung, Why systolic architectures?, IEEE Comput. that learn to pronounce English text, Complex Syst.
(Jan. 1982) 37-46. (1987) 145-168.
[3] D.A. Pomerleau, G.L. Gusciora, D.S. Touretzky and H.T. 1-17] R. Hecht-Nielsen, Neurocomputing: Picking the human
Kung, Neural network simulation at Warp speed: How brain, IEEESpectrum (Mar. 1988) 36-41.
we got 17 million connections per second, Proc. IEEEIn- [18] S.E. Fahlman, Faster-learning variations on back-propa-
ternat. Conf. on Neural Networks, Vol.ll, San Diego, CA gation: An empirical study, Proc. 1988 Connectionist
(Jul. 1988) 143-150. Models Summer School (Morgan Kaufmann, Los Altos,
[4] S. Borkar et al., iWarp: An integrated solution to high- CA, Jun. 1988) 38-51.
speed parallel computing, Proc. Supercomputing "88, [19]J.L. McClelland and D.E. Rumelhart, Explorations in
Orlando, FL, IEEE Computer Society (Nov. 1988) 330- Parallel Distributed Processing: A Handbook of Models
339. Programs, and Exercises (MIT, Cambridge, MA, 1988)
[5] J.R. Mill~n and P. Bofill, Learning by back-propagation: 121-159.
A systolic algorithm and its transputer implementation, [20] H. Yoon, J.H. Nang and S.R. Maeng, Neural networks
NeuralNetworks (3) (Jul. 1989) 119-137. on parallel computers in: B. Souc~k, ed., Neural and In-
[6] R. Mann and S. Haykin, A parallel implementation of Ko- telligent Systems Integration (Wiley, New York, 1991 ).
honen feature maps on the Warp systolic computer, Proc. [21 ] A. Singer, Exploiting the inherent parallelism of artificial
Internat. Joint Conf. on Neural Networks, VoI.II, Wash- neural networks to achieve 1300 million interconnects
ington, DC (Jan. 1990) 84-87. per second, Proc. Internat. Neural Network Conf. Vol.ll,
[7] G. Chinn, K.A. Grajski, C. Chen, C. Kuszmaul and S. Paris (Jul. 1990) 656460.
Tomboulian, Systolic array implementations of neural [22] G. Blelloch and C.R. Rosenberg, Network learning on
nets on the MasPar MP-1 massively parallel processor, the Connection Machine, Proc. lOth Intemat. Joint
Proc. Internat. Joint Conf. on Neural Networks, VoI.II, Conf. on Artificial Intelligence (IJCAI "87), Milano, Italy
San Diego, CA (Jun. 1990) 169-173. (Aug. 1987) 323-326.
[8] S.Y. Kung and J.N. Hwang, Parallel architectures for arti- [23] B. Faure and G. Mazare, Implementation of back-propa-
ficial neural nets, Proc. IEEE Intemat. Conf. on Neural gation on a VLSl asynchronous cellular architecture,
A systolic array exploitin.(l the inherentparallelisms of ANNs 159
Proc. Internat. Neural Network Conf. Vol.ll, Paris (Jul. Hyunsoo o o n received the B.S.
1990) 631 ~o34. degree in Electronics Engineering
[24] F.J. N~JSez and J.A.B. Fortes, Performance of connec- from the Seoul National University,
Korea, in 1979, the M.S. degree in
tionist learning algorithms on 2-D SIMD processor ar- Computer Science from the Korea
rays, in: D.S. Touretzky, ed. Advances in Neurallnforma- Advanced Institute of Science and
tion Processing Systems (Morgan Kaufmann, Los Altos, Technology, in 1981, and the Ph.D.
CA, 1989) 810-817, degree in Computer and Informa-
tion Science from the Ohio State
[25] X. Zhang, M. Mckenna, J.P. Mesirov and D.L. Waltz, An University, Columbus, Ohio, USA,
efficient implementation of the back-propagation algo- in 1988. From 1978 to 1980, he was
rithm on the Connection Machine CM-2, in: D.S. Tou- with the Tongyang Broadcasting
retzky, ed., Advances in Neural Information Processing Company, Korea, from 1980 to
1984, with the Samsung Electronics Company, Korea, and
Systems (Morgan Kaufmann, Los Altos, CA, 1989) 801- from 1988 to 1989, with the AT&T Bell Labs, as a Member of
809. Technical Staff. Since 1989 he has been a faculty member of
the Department of Computer Science of the Korea Advanced
Institute of Science and Technology. His research interests in-
clude parallel computer architectures, interconnection net-
works, parallel computing, and neural networks.
Jai-Hoon Chung received the B.S. Seung Ryoul Maeng received the
degree in Computer Engineering B.S. degree in Electronics Engineer-
from the Seoul National University, ing from the Seoul National Univer-
Korea, in 1986, and the M.S. degree sity, Korea, in 1977, and the M.S.
in Computer Science from the Korea and Ph.D. degrees in Computer
Advanced Institute of Science and Science from the Korea Advanced
Technology in 1988. He is currently Institute of Science and Technolo-
working towards the Ph.D. degree gy, in 1979 and 1984, respectively.
in Computer Science at the Korea Since 1984 he has been a faculty
Advanced Institute of Science and member of the Department of Com-
Technology. His research interests puter Science of the Korea Ad-
include parallel computer architec- vanced Institute of Science and
ture, multicomputer networks, VLSI Technology. From 1988 to 1989, he
architecture, and neural networks. was with the University of Pennsylvania as a visiting scholar.
His research interests include parallel computer architecture,
dataflow machines, vision architecture, and neural networks.

A Systolic Array Exploiting The Parallelisms of Artificial Neural Inherent Networks

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

A Systolic Array Exploiting The Parallelisms of Artificial Neural Inherent Networks

Загружено:

Авторское право:

Доступные форматы

Microprocessing and Microprogramming 33 (1991/92) 145-159 14fi

A systolic array exploiting the inherent

Received 10 January 1991

1. Introduction rated by the highly interconnected neurons, and can

2. Background where fl~) is the net error input to neuron j in layer

yl(n+') {Wll Wt2 w,3qly1(n~

levels of parallelism exploitation: 2.3.2. Training set parallelism

w22 w23 w24

W42 w43 w44 :::::::::::::::::::::::::::::

Approach Dimension Researchers

Mill&n & Bofill (trans-

Mann & Haykin (Warp) Kohonen Training [6]

Chinn et al. (MasPar

Programmable sys- 1 -D systolic array Kung & Hwang (ring

Kato et al. (Sandy/8) Backpropagation Spatial S1 [9]

2-D systolic array Kung & Hwang (cas-

Ramacher et al. (neuro-

Hiraiwa et al. (GCN) Back-prop. Kohonen

VLSI 2-D systolic array Blayo & Hurat (APLY-

Blayo & Lehmann

Kwan & Tsang (com-

This design (pipelined

puters [3-7]. These approaches are summarized in Table 1. As

Sp = P" C~i.gl~ = P" Csingle . (14) The performance of neurocomputers is measured

Machine No. of PEs No. of Train-

MicroVax 1 0.008 [22]

CM-1 16K 1 2.6 [3]

CM-2 64K 1 40 [25]

This design 1 3K 1 248

Вам также может понравиться