Вы находитесь на странице: 1из 7

An Exploration of FPGA based Multilayer Perceptron

using Residue Number System for Space Applications


Jain, Avi​1​*, Pitchika, Eswar Deep​2​*, Bharadwaj, Shivang​3​*
1,2​
Department of Electronics and Communication Engineering, Manipal Academy of Higher Education, Karnataka, India
3​
Department of Computer Science Engineering, Manipal Academy of Higher Education, Karnataka, India
Email ID: ​jainavi0401@gmail.c​om, ​eeswardeep@gmail.com​, shivang.bharadwaj3@gmail.com

Abstract​—​In recent times, most satellite applications perform arithmetic operations on the numbers in a parallel
consist of complex and computationally intensive data manner. RNS also obviates the need for a carry mechanism
processing systems. The challenge is to meet the demands and allows us to execute the arithmetic operations of addition
of onboard processing while keeping the power and multiplication in the same time as that required by an
consumption at a minimum. In this paper, we explore the addition operation [2].
scope of an FPGA based neural network using Residue
B. Organisation of the Paper
Number System for space based applications. We propose
an implementation that uses RNS arithmetic to exploit the The paper is organized as follows: In Section II, we
parallelism present in neural networks for faster discuss the related work done in the field of Residue number
computations thus meeting onboard processing demands System and FPGA implementation of Neural Networks. In
of satellites without the use of high powered CPUs or Section III, we describe the key concepts of Neural Network
GPUs. and Residue Number System as background necessary for our
work. In Section IV, we describe the proposed method for this
Keywords—FPGA, RNS, Neural Network, Space Application paper along with the hardware flow. In Section V, we present
an implementation to show the feasibility of our work. In
I. INTRODUCTION
Section VI, we conclude the paper with a few results obtained
A. Overview and in Section VII, we provides a brief mention of the future
A plethora of satellite applications demand high computing work that can be done on this topic.
resources that are generally gratified by using high power II. RELATED WORK
consuming embedded CPUs or onboard GPUs. In this work,
we explore the possibilities of using an FPGA as a substitute A lot of work has been done on the Residue Number
in order to curb the high power demand and to meet the System and its use in implementing Neural Networks. A
real-time data processing and transmission needs of the comprehensive study covering the theory and implementation
system. To show the feasibility of the proposed setup we of the Residue Number system and its arithmetic is presented
implement a neural network classifier acting as a payload data in [3]. It provides a very thorough insight on the RNS usage
processing system on an FPGA. The onboard payload data and practical hardware implementations. Reference [4]
processing system allows us to reduce the load on the main presents a compilation of studies presenting the FPGA and
CPU as well as perform computation on the data collected by ASIC implementation of artificial neural networks. It covers
the satellite for onboard analysis and ​transmitting only the end important parameters and constraints to be considered when
result to the ground station, thus reducing the cost of data implementing neural network on hardware platforms.
transmission. A proposed implementation of a trained multilayered
The FPGA allows us to use an appropriate low precision perceptron network is presented in [6]. A framework for
representation which reduces hardware resources and neural network implementation is provided in [7]. Various
increases clock frequency [5]. Another significant merit of the types of neural network require different type of design
FPGA is lower power consumption than that of a traditional considerations for their hardware models. Hardware design for
CPU or GPU. A previous work [5] showed that the FPGA Convolutional Neural Networks introduce a lot of constraints
based Neural Network is about 10 times more efficient, in due to their size and complexity. Reference [8] and [9] present
terms of performance per power than that of a GPU based implementation of CNN on FPGA platforms.
neural network.
Representation of numbers in Residue Number System An implementation of residue number system in neural
decomposes the number into smaller units, allowing us to networks can be seen in [5], which proposes a Nested Residue
Number System to realize a deep convolutional neural

XXX-X-XXXX-XXXX-X/XX/$XX.00 ©20XX IEEE * - These authors contributed equally


network. In reference [​14], the author discusses methods to B. Residue Number System
improve the performance of convolutional neural networks The Residue Number System represents an integer as a
using residue number system. collection of its remainders with a set of predetermined
III. BACKGROUND co-prime numbers. The decomposition of a number into its
smaller units allows us to exploit the intrinsic parallelism of
A. Artificial Neural Networks modulus arithmetic, thereby reducing the total net
A Neural Network is a tool modelled on the human brain computation required for basic arithmetic operations on the
that is used to solve a surplus of real world applications. hardware.
Neural Networks have the ability to adapt to changing input In a fixed radix system, the number system is specified by its
without modifying the output criteria. A neural network is radix base. Similarly, an RNS is completely described by
made of highly connected layers of basic data processing stating the base, which consists of ​n-tuples rather than a single
systems called neurons. Each layer receives the output from integer. Thus, the radix of an RNS is defined by a set of
the previous layer and performs basic arithmetic operations on integers m​1​, m​2​, m​3​, …, m​n where each individual member is
it and passes the output to the next layer. coprime relative to the others and is called a modulus.
N −1
N etM = ∑ W tn X n (1) For a set of predetermined values for m​1​, m​2​, m​3​, …, m​n​, an
n=0 integer ​x is represented as a set of tuples x​1​, x​2​, x​3​, …, x​n
where x​i​ is an integer defined using (2).
In equation (1), Net​M ​is the synaptic input to the m​th neuron in
a layer, X is the input data vector to the layer, and W is the x = q i mi + xi (2)
weight-bias matrix of the layer.
For quite some time, Neural Networks have been utilized to In (2), 1 ≤ i ≤ n and q​i is chosen such that 0 ≤ xi < mi . From
perform a variety of functions or optimize the already existing the discussion above we can infer that q​i is the integer value of
functions in space based applications including but not limited the quotient when x is divided by m​i​. A residue number
to onboard image processing and control algorithms [11], [​12​] system can also be identified by its dynamic range ​M.
and [13].
n
Neural networks exhibit several types of parallelism, careful M = ∏ mi (3)
analysis of these is required so as to efficiently map the neural i=1
network structures on the hardware. Generally, fully parallel
implementation of the said structures is not feasible, except for To demonstrate the addition and multiplication operation,
small sized networks. Fully parallel networks with increasing consider two numbers ​X and ​Y represented in residue number
number of neurons and cascaded layers after some point use system. The moduli set consists of m​1​, m​2​, …, m​n is
up more resources than that they can efficiently utilise and determined where any two pairs of modulus are coprime. The
therefore give sub-optimal performance[4]. There are total of dynamic range, M of the system can be found using (3).
three effective types of parallelism that can be exploited in
Neural Networks. (i) Layer Parallelism which is present in (
|X + Y |M ↔ ||||xm1 || + ||y m1 ||||m , ..., ||||xmn || + ||y mn ||||mn
1
) (4)
multilayered networks, this type of parallelism can be
exploited using pipelining. (ii) Node Parallelism, this type of
parallelism corresponds to the individual neurons in a layer. (
|XY |M ↔ ||||xm1 || ||y m1 ||||m , ..., ||||xmn || ||y mn ||||mn
1
) (5)
This is the most essential type of parallelism and is very much
suited for FPGA implementation. (iii) Weight Parallelism, this Residue Number Systems have not been very successful in
parallelism corresponds to the computation of the total practical implementations due to the complex and time
synaptic input to a neuron [4]. Our implementation heavily consuming operation of converting a number from its RNS
exploits the Node parallelism and can also exploit layer representation to binary weighted conventional representation
parallelism. Each neuron is placed in the FPGA with its own at any point in the required arithmetic process. This time
dedicated hardware. All neurons in a layer are evaluated consuming conversion process creates a computational
simultaneously and independently. The multiple layers of the bottleneck which contributes largely to why Residue Number
neural network can be pipelined so that each layer can process Systems are avoided in practical implementations. What must
its own sets of inputs and outputs parallely. We do not try to be noted is that this costly conversion is required if and only if
exploit weight parallelism and thus the net synaptic input to it is absolutely necessary. Computing does not depend on the
each neuron is computed sequentially although independent of representation of numbers as it not about numbers but about
the other neurons. insight. Thus a neural network classifier whose result for
every input is effectively a binary output, i.e. a classification is
either correct or it is not, does not require a Reverse
Conversion. Thus we can do away with the costly Reverse
Conversion process and exploit the benefits of the fast Fig1. Structure of the Basic Functional Unit, computes
arithmetic properties of the Residue Number System [4]. synaptic input to neurons.
IV. PROPOSED METHOD
We propose an onboard satellite system, which splits the
work between an onboard low power embedded
microcontroller/CPU and an onboard FPGA. The MCU/CPU
is responsible for handling the critical satellite functionalities
such as its attitude control, communication system, scheduling
etc, whereas the FPGA is responsible for the high computing
needs, in our case, a pre-trained neural network. The need of
an onboard neural network on a satellite and the proposed
onboard FPGA unit depends on the purpose and the
application requirement of the satellite.
There are several modules in our FPGA implementation that
allow us to effectively map the network structure on the
hardware. We use a Forward Converter Module, to convert the Fig2. Structure of a single MLP Layer.
binary weighted inputs to the RNS representation. Multiply
and Accumulate Modules comprising of modulo-m multiplier In the proposed system, we assume that the input data to the
and adders are used to calculate the net synaptic inputs to each FPGA has been standardized or normalized in accordance
neuron. We use an Activation Function Module to implement with the training data. In the FPGA, we convert the
the Neuron activation function on the net synaptic input. standardized/normalized input vector into its RNS
Multiple layers made of neurons implemented using a representation and then feed it as the input vector to the first
hierarchical structure of the above stated hardware modules layer of the implemented neural network. We convert the
can be cascaded. At the end of the final/output layer, we use a weight-bias matrix of each layer of the trained neural network
RNS Comparator Module to provide us the result of the beforehand into its corresponding RNS representation. The
classification with respect to an input to the network. This RNS representation of the weight-bias matrix of each layer is
result is then fed back to the MCU/CPU. stored in a Lookup Table. The short latency access to the
In order to work in the Residue Number System, we first need LUTs coupled with RNS arithmetic allows for expeditious
to determine the precision criterion and dynamic range using computation inside the module.
the criteria specified in [1] and [2]. While determining the The Forward Converter Module takes as input a data vector
precision criterion and dynamic range we also need to which consists of ​N values, these ​N values are parallely
consider the range of values that can occur inside the neural converted into their corresponding RNS representations. In
network, therefore we take a range large enough to account for order to perform Binary weighted to RNS conversion in
all the values that flow through the neural network. We find a parallel, we make use of ​N × k ​Sequential Lookup Table
set of ​k coprime moduli which satisfy the above criteria, Converters, where ​k is the total number of moduli in our
hereby referred to as the moduli set. ​Each modulus of the moduli set. The output of this module is a vector containing ​N
moduli set is represented using B​m​ bits. values wherein each value is represented using ​k × Bm​ ​ bits.

Fig3. Forward Converter Module: Binary Weighted Values


are converted to Residue Representation for K moduli

In each layer, the output vector from the previous layer and
the weight-bias matrix of the current layer is used to calculate
neuron output. Refer equation (1).
The output from each neuron in the current layer is used to
B. Moduli Set Determination
create the output vector which is then subjected to an
activation function and passed as input vector to the next As discussed before, we need to determine a precision
layer. In order to exploit parallelism, we make use if ​M criterion as well as the moduli set before we can work in the
activation function modules, where ​M is the total number of residue number system. The set of moduli used for the
neurons present in the current layer. We repeat this cycle for arithmetic operations in RNS also depend upon various factors
every layer of the neural network. The total number of clock of the pre trained neural network and determine the
cycles required to convert an input vector to its RNS architecture and complexity of the Neural Network. One
representation depends upon the representation of the input fallout of working in the Residue Number system is the
binary weighted numbers(32-bit or 64-bit). Finally, in the last representation of numbers containing fractional parts. In our
layer, we use a comparator module to return the target class of implementation, the non-integer inputs are scaled by a factor
the input vector. The final result is a ​[[ log​2n​ ]] ​bit output of 10​4​, and then approximated to the nearest integer before
where n is the total number of target classes. they are used in the FPGA. Allowing the numbers to scaled
using a precision criterion, as the one chosen above, doesn’t
cause significant drop in the accuracy of the neural network
V. IMPLEMENTATION and allows us to work in the residue number system. One can
work with higher precision, however that requires increasing
A. About the DataSet the dynamic range of the system which depending on the
In this paper, we worked with an in-house trained classifier on moduli set taken, may require more hardware.
the data from UCI Machine Learning Repository In this work, we determined the set {​31, 37, 41, 43, 47, 53, 59,
Crowdsourced Mapping DataSet [16]. The dataset was derived 61} to be our moduli set with a dynamic range of 1.812x10​13 .
from geospatial data from two sources: Arithmetic precisions for synthesis are: 32 bits for
1) Landsat time-series satellite imagery from the years conventional binary weighted inputs, 6 bits for each modulus
2014-2015 in the moduli set, thus resulting in 48 bit RNS representation
2) Crowdsourced georeferenced polygons with land of each element of the input data vector after the forward
cover labels obtained from OpenStreetMap. conversion. The elements of the weight-bias matrix are also
The data contained 28 features and the neural network was stored in their RNS representation corresponding to the
trained to classify an input vector in one of the six moduli set taken above. Thus each weight is also represented
classes(water, farm, impervious, orchard, grass, forest). by 48 bits.
This data set was chosen so as to show the application of this C. Hardware Realization
work in orbital satellites that could perform onboard data
Figures 1 to 3 show the organisation of the hardware modules
processing.
used for mapping the neural network onto the FPGA. Each
The neural network was trained on MATLAB 2017a with an
hardware module can be realised as an all-ROM structure,
accuracy of 94.37. The input was standardized with respect to
all-combinational logic, or a combination of both. A simple
the training data so as to achieve optimum output. The
and direct way to implement the Forward Converter Module is
activation function used for training was Rectifier or ReLU.
to have a sequential structure that consists of a lookup table
that stores all the values of |2​j​|​m , a modular adder, a counter
and an accumulator[3].

Fig4. Layer Diagram of the implemented Neural Network


Classifie​r
Fig5. Sequential Lookup Table Converter

n−1
|X|​m​ = ​| ∑ xj 2j |​m (6)
j=0

Based on equation (6) we can infer that only n sequential


modular additions are required to calculate the residue of X
with respect to a moduli m, where n is the number of bits used
to represent input X in the conventional binary weighted
representation (note x​j​ is either 1 or 0). Fig7. Quarter Square Modulo-m Multiplier
For each element of the input data vector fed into the neural
network we require to calculate K residues. Thus to support a To implement the activation function, ReLu for this
parallel forward conversion process we use K parallel implementation, we use a RNS Comparator Module based on
Sequential Lookup Table converters [3] for each input of the the study presented in [10]. We use a modified version of the
input data vector. same to evaluate the final result of the output layer (consisting
We use a modulo-m adder defined by equation (7) and (8) for of six neurons in this case) to infer which class does the input
all modulo additions required in the forward conversion and vector belongs.
multiply accumulate unit. This comparator is used to compare a value to a fixed
reference value. It is designed to have the reference value as
|A + B|m = A + B IF A > B (7)
an odd number. This leads to a reduction in hardware
requirements and power consumption. A zero value checker is
|A + B|m = A + B − m Otherwise (8)
connected to the output of the subtractor to check if the
Implementing this procedure in a simplistic way requires very number is zero. As zero in RNS is represented by all zeros the
minimal hardware consisting of three carry propagate adders, block is a multiple input nor gate. The equality condition(zero
one for addition, one for subtraction and one for checker=1) overrides all the other conditions of the
comparison[3]. Comparator Module.
ReLu is realized by using a comparator with midpoint as the
reference. The outputs of the comparator is fed to a logic
network that goes to a multiplexer which outputs the number
being compared or the value zero.

Fig6. Modulo-m Adder

As depicted in Figure 1, to implement the multiplication


operation in the residue number system we make use of
Quarter-Square modulo-m multiplier. Modular multiplication
this way can be implemented directly using three adders for
the addition and two subtractions and by using two lookup
tables that produce quarter squares [3].
Fig8. Realisation of ReLu using an RNS comparator
the method can be utilized as an alternative to the
VI. RESULTS AND CONCLUSION conventional implementation of onboard data processing units.
The MLP Neural Network is implemented on the Xilinx
Spartan-6 XC6SLX100 FPGA at clock frequency of 100Mhz. VII. FUTURE WORK
The FPGA Implementation Summary is as follows: Future work of this paper would entail working towards a
more scalable solution that could provide support for multiple
Layer Number of Activation data processing systems. We would also like to increase
Nodes(Neurons) Function hardware parallelism by using hardware methods that facilitate
parallel computation in RNS arithmetic.
Input 29 _ We would like to scale the method to facilitate the
development of multiple neural networks such as
1st Layer(Hidden 10 ReLU
convolutional neural networks and recurrent neural networks
Layer)
in the residue number system. We would also like to optimize
2nd Layer(Output 6 ReLU our hardware so as to provide a platform for training neural
Layer) networks onboard without much trade-off against time and
accuracy.
Table1: MLP Structure
REFERENCES
Resource Utilization [1] Dr. Sweidan Andraos, “Fixed Point Ensigned Fractional Representation
In Residue Number System”, IEEE 39​th Midwest Symposium on
DSP48A1 Slices 43.3% Circuits and Systems 1996.
[2] Harvey L. Garner, “The Residue Number System” IRE Transactions on
Electronic Computers (Volume: EC-8, Issue: 2, June 1959).
Flip Flops 32.6% [3] Amos Omondi and Benjamin Premkumar, “Residue Number Systems”,
Imperial College Press, 2007.
LUTs 14.2% [4] Amos R. Omondi and Jagath C. Rajapakse, “FPGA Implementation of
Neural Networks”, Springer 2006.
Table2: FPGA Resource Utilization [5] Hiroki Nakahara, and Tsutomu Sasao, “A Deep Convolutional Neural
Network Based on Nested Residue Number System”, 25​th International
Each Spartan-6 FPGA slice contains 4 LUTs and 8 Flip-Flops Conference on Field Programmable Logic and Applications, 2015.
. [6] Seema Singh, Shreyashree Sanjeevi, Suma V, Akhil Talashi, “FPGA
Implementation of a Trained Neural Network”, IOSR Journal of
We observe that for the above network structure with an
Electronics and Communication Engineering, Volume 10, Issue 3, Ver.
input layer having 29 (28 inputs and 1 bias) 32-bit binary III (May-June 2015), PP 45-54.
inputs, the net execution time of an input sequence to an [7] P. Škoda, T. Lipić, Á. Srp, B. Medved Rogina, K. Skala, F. Vajda
output classification for the Spartan-6 FPGA implementation “Implementation Framework for Artificial Neural Networks on FPGA”,
is always less than 8𝜇s (7.81𝜇s if the input sequences are MIPRO 2011, May 23-27, 2011, Opatija, Croatia
[8] Marco Bacis, Giuseppe Natale, Emanuele Del Sozzo, Marco Domenico
pipelined). The accuracy of the FPGA based MLP classifier is Santambrogio, “A Pipelined and Scalable Dataflow Implementation of
82.46%. Whereas, execution of the same MLP neural network Convolutional Neural Networks on FPGA”, 2017 IEEE International
on the Intel i7 6700HQ CPU with a fixed precision of 4 Parallel and Distributed Processing Symposium Workshops
decimal places gave an accuracy of 86.80% along with an [9] Abhinav Podili, Chi Zhang, Viktor Prasanna, “Fast and Efficient
Implementation of Convolutional Neural Networks on FPGA”, 2017
input-output latency of 3.67𝜇s. Thus, from the above results IEEE 28​th International Conference on Application-specific Systems,
we can conclude that even though the FPGA based Architectures and Processors.
implementation causes a drop in the performance of the neural [10] Jen-Shiun Chiang, Mi Lu, “ Floating-Point Numbers in Residue Number
network, the trade-off of performance against power is decent Systems”, Computers Math. Applic. Vol. 22, No. 10, pp. 127-140, 1991
[11] Keyang Cai, Hong Wang, “Cloud Classification of Satellite Image based
enough in order for the method to be considered a viable
on Convolutional Neural Networks”, 2017 IEEE 8​th Conference on
option for the implementation of data processing systems Software Engineering and Service Science
onboard a satellite. [12] Emmanuel Maggiori, Guillaume Charpiat, Yuliya Tarabalka, Pierre
The accuracy of the classifier can be increased by taking Alliez, “Recurrent Neural Networks to Correct Satellite Image
higher precision for the RNS system and thus by using more Classification Maps”, 2017 IEEE Transactions on Geoscience and
Remote Sensing Society, Volume 55, Issue 9, September-2017
of the FPGA resources. According to the application, [13] Rober Boutros, Mohamed Ibnkahla, “New Adaptive Polynomial and
appropriate requital can be made to achieve acceptable output Neural Network Predistortion Techniques for Satellite Transmissions”,
accuracy depending on the sensitivity of the application. The [14] 2002 9​th International Symposium on Antenna Technology and Applied
proposed method helps achieve an appropriate number of Electromagnetics
[15] N.I. Chervyakov, P.A. Lyakhov, M.V.Valueva, “Increasing of
computational operations per unit of power consumed, hence convolutional neural network performance using residue number
system”, 2017 International Multi-Conference on Engineering, and
Information Sciences Computer
[16] Johnson, B. A., & Iizuka, K. (2016). Integrating OpenStreetMap
crowdsourced data and Landsat time-series imagery for rapid land
use/land cover (LULC) mapping: Case study of the Laguna de Bay area
of the Philippines. Applied Geography, 67, 140-149. Data Set

Вам также может понравиться