CNN On FPGA

CESAR: Emulating Cellular Networks on FPGA
Jens Mller, Ralf Becker, Jan Mller and Ronald Tetzlaff

Institute of Fundamentals of Electrical Engineering Faculty of Electrical and Computer Engineering Technische Universitt Dresden 01062 Dresden, Germany Email: jens.mueller1@tu-dresden.de
AbstractComplex dynamical systems establish offer entirely new possibilities to the development of groundbreaking data processing methods. In the domains of image and video processing, locally coupled cellular array computers, based on Cellular Nonlinear Networks (CNN), accelerate the computation of large amounts of data in real-time, due to their inherent concept of massive parallelism. Current VLSI implementations however, are accompanied by several distinct drawbacks. The computational accuracy of most currently available systems is limited to 8 bit, and the volatilely capacitively stored state values of analogue realisations often lead to errors when multiple tasks are processed sequentially. Moreover, the systems hardly allow to run a CNN program code to provide the full functionality of a CNN-UM. In this contribution, the novel CESAR architecture is proposed for the digital emulation of a time-discrete CNN-UM. The programmable array computer facilitates the powerful computation of consecutive CNN operations and the cost-efcient implementation of several application-specic congurations with variable network size and data representation. The presented architecture retains the inherent parallel paradigm of CNN, and assigns one processing element to each cell of the network. The cell outputs are coupled and stored locally, thus minimising data exchange with external structures and maximising the computation speed. The internal xed-point multiplications are accelerated by using on-chip DSP resources provided by current FPGAs. By this means, a CNN-based embedded system with 128 cells, a 3 3 neighbourhood and 18 bit data representation was implemented on a Xilinx Virtex-5 FPGA.
of PDEs [9] and time series analysis [10] full congurability is desired, with respect for the network dimension, the coupling of adjacent cells, and the representation and precision of state values. The emulation of a discrete-time CNN model on recongurable hardware combines both the performance of VLSI implementations and the exibility and computation accuracy of software simulations [9], [11]. In this paper, a previously proposed architecture implemented on an FPGA [10] is extended by feedforward couplings to cell inputs and an efcient handling of consecutive operations. Thus, we provide the functionality of the CNN-UM, i. e. to allow the processing of a CNN program code. II. A RCHITECTURE A. Processing element To retain the powerful structure of the genuine CNN paradigm, a cellular array of processing elements was designed, targeting an implementation on recongurable hardware. As opposed to analogue VLSI approaches, the digital cell state is represented by a bit vector, leading to higher effort to wire the interconnections. The architecture is currently restricted to a 3 3 neighbourhood. A wider neighbourhood is feasible yet unfavourable, since the number of interconnections is increasing exponentially with the neighbourhood radius r. We apply a E ULER forward discretisation with step size h, and assume a discrete version of the full-signal-range model [12] to limit the range of cell states to [1; 1]. From the wellknown CNN state equation [4] with states xij (n) of iteration n, inputs uij , feedback and feedforward coupling weights akl and bkl , respectively, and bias z , we obtain the discrete iteration step xij (n +1) = N with xij (n) +
klSr
I. I NTRODUCTION The continuously growing relevance of multi-core processors and the distributed computation on multi-node clusters demand more and more for the development of sophisticated parallel algorithms in order to exploit the full power of the underlying hardware. In this context CNN are not only regarded as a powerful array of cellular processors but moreover as a paradigm of inherently parallel computation. This paradigm has been successfully applied to real-time processing in the elds of image and video processing [1] for in-line process control [2], and has been widely studied for modelling and simulating of complex systems [3]. After its introduction by C HUA and YANG in 1988 [4] and its extension to the CNN Universal Machine (CNNUM) in 1993 [5], numerous analog and digital designs have been proposed, mainly for image processing purposes [6][8]. Although they offer outstanding performance for binary and gray-scale operations, these systems offer merely low accuracy (usually 8 bit) and inexible designs. Especially for simulation
a kl xi+k,j +l (n) + w ij x < 1 1 x 1, x>1
(1)
1, N (x) = x, 1,
(2)
and w ij =
klSr
= const , bkl ui+k,j +l + z
(3)
a kl , bkl xij , uij
18 18
41
36
w ij
, b z a
ACCU
41 18
to RAM x u xij Cij R1 M2 R2 outputs

0
41
xij
18
en rst
inputs
xkl
M1
w ij
Fig. 1.
Processing element architecture Fig. 3. Interconnections of cell Cij to its neighbours
Reg MUL
b0 u0 2 1 3 z 0 1 2 3 4 6 4 5 6 7 8 5 7 8
not used 31
t
N_it 25 24 5
U_s 4
X(0)_s BC 1
ACC
3 2
Fig. 4.
Instruction register (IR) format
(a)
MUL
2 1 a 0 x0 (n) xij w ij 0 4 3 1 2
5 7 6 3 4 5
B. Multi-operation processing The CESAR architecture enables highly efcient computation of sequential CNN operations, as required in many image-processing applications. Sequential processing minimises time-consuming RAM accesses and thus reduces the total execution time. To understand the concept of multi-operation processing it is worth to take a closer look at sub-operations that are required for the completion of CNN operations. As the network does not process direct inputs (e. g. from photo sensors), both the inputs uij and initial states xij (0) have to be read in sequentially for each processing element. After the computation of a given number of iterations the resulting states are written back sequentially, as well. As a consequence, for larger networks the time required for initialisation and readout tIO easily exceeds the actual computation time tc . For the given architecture and typical gray-scale CNN operations we would obtain tIO tc , and thus a low utilisation of the processor, for more than 300 processing elements. On the other hand, in practical applications (especially in the eld of image processing) the applied CNN operations often refer to preceding results, forming a cascade of operators for a single input array. Many operations do not require either inputs or initial states, or expect them to be zero. We integrated these requirements into the processing control, allowing to choose the sources of uij and xij (0) prior to each operation: C ASE I: Read from memory. C ASE II: Copy from result of preceding operation. C ASE III: Set to zero. In this way, most templates related to the standard CNN can be processed in a very efcient manner. Fig. 3 illustrates the interconnection of a processing element to its neighbours and to global memory. Multiplexer M2 and register R2 facilitate the implementation of C ASES I-III for inputs and initial states, respectively. Since u is loaded and processed prior to x(0), only one register R2 is required to
ACC N
8 xij t
(b) Fig. 2. Time schedule for the processing of a 3 3 neighbourhood: (a) calculation of w ij , (b) calculation of next state xij (n + 1)
and the modied coupling weights a 0,0 = h (a0,0 1); a kl = hakl , k, l = 0; bkl = hbkl , the modied bias z = hz , and the non-linearity N practically implementing the fullsignal-range model. Since the inputs and the bias are assumed to be constant during a single CNN operation, we can calculate the expression w ij in advance to the actual iterations (1). Based on this operation scheme, we developed the architecture of a processing element representing one network cell as depicted in Fig. 1. Both the input multiplications (3) and the state calculations (1) are computed in the same core in order to efciently use the required hardware. Hence, an operation is divided in two sub-operations, as presentend in the schedule diagrams in Fig. 2. Initially, after storing z in the accumulator, the input couplings are processed and w ij is cached in the according register (Fig. 2(a)). Afterwards, the feedback couplings and the nonlinearity are calculated for each E ULER iteration, updating xij (n + 1). Fig. 2(b) outlines one iteration. Three-stage pipelined multipliers are used to improve the PEs throughput and thus to speed up the whole CNN operation. The calculation of w ij and the rst iteration are overlapping in a pipeline fashion, thus further reducing the execution time. Overlapping of iterations is not possible since the results xij (n) from neighbouring cells are required to start the next iteration.
program code input offset (1) number of operations

0x00000010 0x00000000 0x00000000
inputs
0x00000000 0x00002000
...
inputs (1)
states
0x00000CCC 0x00000000
0x00000000
IR (1) state offset (1)
0x00000644 0x00000080 0x00000080 0x00000013
input offset (2)
states (1)
...
inputs (2)
... state offset (2) states (2) template offset (1) templates
0x00001333
...
0x00000CCC 0x00000000
templates (1)
numbers in parenthesis () refer to corresponding operation
...
templates (2)
Fig. 5.
Address space comprising program code and sections for state values, input values and templates. One operation is shaded in gray.
cache both of them, and can be loaded from R1 (resulting state of previous operation) or from memory. A global reset of R2 generates a zero input for all cells. All parameters and input congurations of CNN operations are coded in microinstructions, being part of the program code required to run the system. Fig. 4 shows the format of the instruction register containing the microinstructions. The BC eld selects the CNN boundary condition (N EUMANN or D IRICHLET). The number of iterations (N_it) can be adjusted for each operation, to adapt to the requirements of the CNN templates and the input data. The vectors U_s and X(0)_s serve to select one of the the source modes C ASE I to III for inputs and initial states, respectively. In addition to the microinstructions, the program code contains for each operation address pointers with offsets to the memories for state values, input values and templates (Fig. 5). In a real-processor fashion, the code is read sequentially by the CNN controller module, starting with the total number of operations and the pointers to the rst data sections. The microinstruction code completes this code fragment and the CNN controller can proceed reading the offsets of the subsequent operation. The utilisation of these pointers allows the re-use of both the CNN program code (in terms of templates) and the data sections, to reduce memory requirements when processing similar tasks, e. g. applying the same mask to different input images. Using 32-bit-aligned memory, 16 B (bytes) of machine code and additional 76 B for 19 CNN templates are required per operation. The memory size for state and input values depends on the network size and re-use intensity; 4 B are required per state, or per input, respectively.
III. I MPLEMENTATION A. Embedded system The CESAR architecture for the emulation of discrete-time CNNs is implemented as a part of an embedded system on a Xilinx FPGA (Fig. 6). The multiplication, as the core function of the cell, is mapped to hardware-accelerated multipliers in dedicated DSPs, saving general-purpose logic resources. The PowerPC coprocessor is mainly utilised to control the data exchange between memories and a host sever via a Gigabit Ethernet interface. The memories storing program code, input data, initial states, and templates, are implemented as dualport RAMs that can be accessed by both the coprocessor and the CNN controller. In addition, the coprocessor executes sequential parts of the CNN algorithms, such as pre- and postprocessing of the data. The CNN controller is initialised via software accessible registers and issues an interrupt to the coprocessor after
SAR
Embedded Processor Core
FSM
Int Ctrl
INT
CNN Controller
PE PE PE PE
PE PE PE PE
PE PE PE PE
PE PE PE PE
Processor Local Bus
Ethernet Interface
Dualport Dualport RAM RAM

Port 1 Port 2
Network
Configurable Logic
Fig. 6. Embedded system consisting of CNN, RAM, coprocessor, bus and peripherals. INT: interrupt register, SAR: software accessible register, FSM: nite state machine
70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 50 100 150 200 250 input threshold dilation
LUTs Registers
(a)
dilation logic and logic not
cells
(a)
mask
(b) Fig. 8. CNN operation examples: (a) performing a horizontal ConnectedComponent Detection on a 8 16 network; (b) performing multiple operations on gray-scale and binary images on a 8 8 network
70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 8 10 12 14 16 18 20 22 24
64 cells 128 cells
the subsequent state xij (n + 1). For conventional binary and gray-scale image operations, approximately 20 iterations are sufcient, corresponding to 3 s in time. Initialisation and read-out of the cells are performed consecutively through the dual-port block RAM. C. Results
state resolution / bit
(b)
Fig. 7. Utilisation (a) of LUTs and registers depending on network size for 18 bit data representation and (b) of LUTs depending on data representation for 64 and 128 cell networks, respectively. Using a Virtex-5 XC5VFX70T with 128 DSP slices, only 128 cells with up to 18 bit can be mapped.
completion of the CNN processing. The ne-grain operation sequences are controlled by a nite state machine. B. Resources and performance Using a Xilinx Virtex-5 XC5VFX70T and a 18 bit data representation, a network with 128 cells has been implemented. A clock frequency of 100 MHz was chosen as a trade-off between processing speed and routability of the design. The number of processing elements is mainly limited by the available DSP slices in that FPGA type. In the case of 128 cells, more than 90 % of the FPGAs slices are occupied. As shown in Fig. 7, the network size can be increased at the expense of the computational accuracy and vice versa. Hence, if logic resources prevent the implementation of larger networks, a reduction to 8 bit accuracy would signicantly reduce their requirements. The power consumption of the system amounts to 3.1 W. The required computing time strongly depends on the chosen step size h and the convergence properties of the respective CNN template (and thus the number of iterations). Independent of network parameters, it takes 15 clock cycles to compute
A horizontal Connected-Component Detection (CCD) [13] is used to illustrate the computation of complex CNN behaviour. Since a feedback CNN coupling is required, the CCD cannot be implemented on systems lacking this feature like the Q-Eye chip [6]. In the example depicted in Fig. 8(a), a network consisting of 8 16 cells is initialised with the shown binary image. Using h = 0.1, the network achieves a steady state after 100 iterations, corresponding to 18 s including data I/O. The CCD template is a wave-type operator, shifting and compressing the scene to the right-hand edge. Thus, the computation time tc strongly depends on the network dimensions and increases roughly linearly in a M N network: tc M. (4)
In order to illustrate the application of several consecutive operations, we chose a simplied example resembling an algorithm for CNN based control of laser beam welding [2]. Fig. 8(b) shows a sequence consisting of one gray-scale operation (threshold) and three binary operations (dilation, AND and NOT). Running on our implementation, each operation required not more than 20 iterations or 3 s in time. RAM access was necessary only three times: to load the input image, load the mask image and to write-back the results. Apart from this, the operations are executed without any delay. In this example, the total time for image loading and read-out is very low (tIO < 2 s) yet increasing rapidly with the network size: tIO M N. (5)
In contrast, the computation time (tc 12 s) does not vary with the network size: tc = const. (6)
IV. C ONCLUSION The enhanced CESAR architecture for the emulation of a CNN-UM was developed and implemented on a Xilinx Virtex5 FPGA. The proposed system provides higher computational accuracy and exibility than current ASIC implementations, at comparable processing speeds for binary and gray-scale CNN templates. The recently-presented design for EEG signal processing using CNN [10] was extended to facilitate the processing of the discrete-time CNN state equations including now feedforward couplings, and to allow an efcient computation of consecutive operations. In a next step, the design will be ported to an FPGA cluster comprising several Virtex-6 FPGAs, coupled through highspeed serial connectors. Furthermore, a complex network with polynomial couplings will be distribution on this system and the architecture shall be extended to assign more cells to each processing element, in order to process large-scale input data. R EFERENCES
[1] . Zarndy and C. Rekeczky, 2D operators on topographic and non-topographic architecturesimplementation, efciency analysis, and architecture selection methodology, International Journal of Circuit Theory and Applications, vol. 39, no. 10, pp. 9831005, 2011. [2] L. Nicolosi, F. Abt, A. Blug, A. Heider, R. Tetzlaff, and H. Her, A novel spatter detection algorithm based on typical cellular neural network operations for laser beam welding processes, Measurement Science and Technology, vol. 23, p. 015401, 2012. [3] F. Gollas, C. Niederhfer, and R. Tetzlaff, Toward an autonomous platform for spatio-temporal EEG-signal analysis based on cellular nonlinear networks, Int. J. Circuit Theory Appl., vol. 36, no. 5-6, pp. 623639, 2008. [4] L. O. Chua and L. Yang, Cellular neural networks: theory, IEEE Transactions on Circuits and Systems, vol. 35, no. 10, pp. 12571272, 1988. [5] T. Roska and L. O. Chua, The CNN universal machine: an analogic array computer, IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 40, no. 3, pp. 163173, 1993. [6] A. Rodriguez-Vazquez, R. Dominguez-Castro, F. Jimenez-Garrido, S. Morillas, J. Listan, L. Alba, C. Utrera, S. Espejo, and R. Romay, The Eye-RIS CMOS vision system, in Analog Circuit Design. Sensors, Actuators and Power Drivers; Integrated Power Ampliers from Wireline to RF; Very High Frequency Front Ends, H. Casier, M. Steyaert, and A. H. M. van Roermund, Eds. Internetausgabe: Springer, 2008, pp. 1532. [7] P. Dudek, An asynchronous cellular logic network for trigger-wave image processing on ne-grain massively parallel arrays, IEEE Transactions on Circuits and Systems II: Express Briefs, vol. 53, no. 5, pp. 354358, 2006. [8] P. Fldesy, A. Zarandy, and C. Rekeczky, Congurable 3D-integrated focal-plane cellular sensorprocessor array architecture, International Journal of Circuit Theory and Applications, vol. 36, no. 5-6, pp. 573 588, 2008. [9] Z. Nagy and P. Szolgay, Congurable multilayer CNN-UM emulator on FPGA, IEEE Transactions on Circuits and Systems I: Fundamental Theory and Applications, vol. 50, no. 6, pp. 774778, 2003. [10] J. Mller, J. Mller, and R. Tetzlaff, A new cellular nonlinear network emulation on FPGA for EEG signal processing in epilepsy, in Proceedings of SPIE, vol. 8068, 2011, p. 80680M. [11] J. Albo-Canals, J. A. Villasante-Bembibre, J. Riera-Babures, and X. Vilasis-Cardona, 8-Bit gray-scale DTCNN implementation over an FPGA for Robot Guiding algorithm, in Proc. 12th Int Cellular Nanoscale Networks and Their Applications (CNNA) Workshop, 2010, pp. 12.
[12] S. Espejo, R. Carmona, R. Domnguez-Castro, and A. RodrguezVzquez, A VLSI-oriented continuous-time CNN model, International Journal of Circuit Theory and Applications, vol. 24, pp. 341356, 1996. [13] T. Roska, L. Kek, L. Nemes, A. Zarandy, M. Brendel, and P. Szolgay, CNN software library (templates and algorithms), Analogical and neural computing laboratory, Computer and Automation Institute of the Hungarian Academy of Sciences, Version, vol. 7, 1999.

CNN On FPGA

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

CNN On FPGA

Загружено:

Авторское право:

Доступные форматы

CESAR: Emulating Cellular Networks on FPGA

Jens Mller, Ralf Becker, Jan Mller and Ronald Tetzlaff

a kl xi+k,j +l (n) + w ij x < 1 1 x 1, x>1

= const , bkl ui+k,j +l + z

a kl , bkl xij , uij

to RAM x u xij Cij R1 M2 R2 outputs

Processing element architecture Fig. 3. Interconnections of cell Cij to its neighbours

Instruction register (IR) format

program code input offset (1) number of operations

IR (1) state offset (1)

0x00000644 0x00000080 0x00000080 0x00000013

input offset (2)

numbers in parenthesis () refer to corresponding operation

Embedded Processor Core

Processor Local Bus

Dualport Dualport RAM RAM

70.000 60.000 50.000 40.000 30.000 20.000 10.000 0 8 10 12 14 16 18 20 22 24

64 cells 128 cells

Вам также может понравиться