Вы находитесь на странице: 1из 8

A Non-Sigmoidal Activation Function for Feedforward Artificial

Neural Networks

Pravin Chandra, Udayan Ghose and Apoorvi Sood


Abstract-For a single hidden layer feedforward artificial
neural network to possess the universal approximation prop
erty, it is sufficient that the hidden layer nodes activation
functions are continuous non-polynomial function. It is not
required that the activation function be a sigmoidal function. In
this paper a simple continuous, bounded, non-constant, differ
entiable, non-sigmoid and non-polynomial function is proposed,
for usage as the activation function at hidden layer nodes. The
proposed activation function does require the computation of an
exponential function, and thus is computationally less intensive
as compared to either the log-sigmoid or the hyperbolic tangent

This output function for the node is discontinuous. A con


tinuous variant / extension of this function, which is also
bounded, differentiable and monotonically increasing, is the
sigmoidal class of functions. A sigmoidal function may be
defined as [2]:
sigmoidal function 0' ( . ) is a map 0' : R ----+
where R is the set of all real numbers, having the
following limits for the argument to the function (x) tending
to oo:
Definition 1. A

R,

lim

x-+ <X!

function. On a set of 10 function approximation tasks we


demonstrate the efficiency and efficacy of the usage of the
proposed activation functions. The results obtained allow us
to assert that, at least on the 10 function approximation tasks,

lim

x---+ - 00

the results demonstrate that in equal epochs of training, the


networks using the proposed activation function reach deeper
minima of the error functional and also generalize better in
most of the cases, and statistically are as good as if not better
than networks using the logistic function as the activation
function at the hidden nodes.

I.

Artificial Neural Networks (ANN) are inspired by the


computational paradigm of the biological neural networks.
Thus, the initial thrust was to develop a structure based on
interconnection of simple computational units I. The simplest
node used for approximating the behaviour of the biological
neuron is a threshold device also called the McCulloch-Pitts
node [1]. The net input to a node for inputs [Xl, X2, ... , xn V
is given by:
n

L
i=l

W i Xi

+()

(1)

where Wi is the strength of connection for the Xi input and


() is the bias of the node2 The bias decides at what point
the transition of the node state from dormant to exited takes
place. This can be seen from the output of the node which
is:
(2)
P. Chandra and Udayan Ghose are faculty at the University School of In
formation and Communication Technology,Guru Gobind Singh Indraprastha
University, Dwarka, Sector 16C, New Delhi (INDIA) - 110078 (email:
P. Chandra: pchandra@ipu.ac. in, chandra.pravin@gmail.com; U. Ghose:
udayan@ipu.ac. in,g".udayan@lycos.com;. A. Sood is a Ph.D. scholar at the
University School of lnfonnation and Communication Technology, Guru
Gobind Singh lndraprastha University, Dwarka, Sector l6C, New Delhi
(INDIA) - 110078 (email: soodapoorvi@yahoo.com)
I Generally this unit is called a node or neuron while the interconnection
strengths are called weights.
2By a abuse of terminology,the collection of weights and biases together
are also referred to as weights.

978-1-4799-1959-8/15/$31.00 @2015

( )

O' x

0:;

(3

0: <

The generally observed values for


{ -1,0} and (3 1.

(3)

(3
0:

(4)
and (3 are

0: E

The most commonly used sigmoidal functions are the


hyperbolic tangent function, and the log-sigmoid or the
logistic function which is defined as:

INTRODUCTION

( )

O' X

IEEE

( )

O' x

1 +e-x

(5)

The nodes using a specific non-linearity as its output


function is arranged in a structure which is broadly clas
sified as a feedforward architecture. The estimation of the
weights and/or biases of this Feedforward ANN (FFANN) is
estimated by using a non-linear optimization technique (in
general) to minimize the error between the desired and the
obtained response to the FFANN. These methods range from
first-order methods(only using the gradient of the error func
tional) to second-order methods (using the explicit Hessian
or some estimate of the Hessian of the error functional). For
example, the classical error Backpropagation method [3], [4],
[5] and the Resilient Backpropagation (RPROP) [6], [7], [8]
methods are gradient based (first-order) methods, while the
Levenberg-Marquardt [9] and the conjugate gradient [10],
[11] training algorithms may be classified as second-order
methods. By themselves these training mechanism do not
impose any condition on the activation functions or the non
linearity used as the output function of the nodes'. The only
additional algorithmic requirement is that these functions
should be differentiable [2].
The initial set of results about the universal approxi
mation property (UAP)3 was obtained in context to hidden
(layer) nodes that were sigmoidal in nature, under a variety
of conditions [12], [13], [14], [15], [16]. Stinchcombe and
3The property that a FFANN,with at least one hidden layer, with non
linear nodes in the hidden layer can approximate any function arbitrarily
well,provided there are sufficient number of hidden nodes.

0.9
0.8
0.7
0.6
0.5

0.4
0.3
0.2

;
, ;,;

0.1

25

Fig. I.

The schematic diagram of a single hidden layer FFANN.

Fig. 2.

-2

-3

-1

- ..

"'....-dcr(x) I dx

....
.. "

The function <:r(x) (5) and its derivative (8).

II.
White (1990) showed that the UAP for a FFANN depends
not on the sigmoidality of the non-linearity at the hidden
nodes for FFANNs, but on the feedforward structure; and
the properties of the sigmoidality of the hidden nodes are
not as crucial for UAP [17]. Hornik (1991) established that
the UAP exists for a FFANN if the non-linearity (henceforth
called the activation) function at the hidden layer nodes is any
continuous, non-constant and bounded function [18]. Leshno
[19] established that the UAP is obtained if the activation
function of the hidden nodes is not a polynomial. See [20]
for a survey of these UAP results. Thus, from these results
(obtained after 1989), we may impose the following condi
tions on an activation function, for the networks using them
at the hidden layer nodes, as - The activation function must
be a continuous, non-constant and bounded non-polynomial
function4

-4

....

FFANN

ARCHITECTURE AND ACTI VATION


FUNCTION

The schematic diagram of a single hidden layer FFANN


is shown in Fig. l. The numbers of inputs to the network
is I, the inputs are labeled as Xi where i E {I, 2,... , I} ,
the number of hidden nodes in the single hidden layer is H,
the weight connecting the ith input with the jth hidden node
is Wji where i E {I, 2,. . . ,I} and j E {I, 2,. . . ,H}, the
threshold of the jth node is {}j with j E {I, 2,... ,H} and
the connection strength between the output of the jth hidden
node and the output node is aj where j E {I, 2,. . . ,H}
while '/ is the threshold or the bias of the output node. With
this structure the net input to the jth hidden node is:
I

nj

L WjiXi +{}j;

i=l

{I, 2,. . . ,H}

(6)

If (.) represents an arbitrary activation function used at


the hidden layer nodes, we may write the net output of the
network as:
H

These set of results, that have expanded the potential


classes of function that can be used as activation function,
have not been reflected in the empirical works reported. Some
of the results have been reported using polynomial activations
[21], [22], but as the result in [19] established, these networks
do not possess the UAP. In [23] Hermite polynomials are
used in the constructive approach to neural networks. Most
of the research on activation functions role in training of
FFANNs have concentrated on sigmoidal activations [24],
[25], [26]. The activation functions used in the learning
algorithms for FFANN training, play an important role in
determining the speed of training [27], [24]. In this paper
we use a simple non-sigmoidal function that is continuous,
differentiable, bounded, non-constant and non-polynomial
and study its efficiency and efficacy as an activation function
for hidden nodes in a single hidden layer FFANN over 10
function approximation tasks.

The paper is organized as follows: Section II describes


the FFANN architecture and the activation function used.
Section III describes the design of the experiments. Section
IV presents the the results while conclusions are presented
in Section V.

L aj (nj) +'/

(7)

j=l

The activation functions used in this work as a base to


compare the experimental results is the logistic function (5).
The derivative of this function is:
(T(x)
(8)
(T'(x) d
(T(x) (1 - (T(x))
dx
This function (T(x) and its derivative is shown in Fig. 2.
=

The proposed activation is:


'Ij;(x)

1
; xE
x2 +1

(9)

while its derivative is:


d'lj;(x) -2x('Ij;(X))2
(10)
dx
The function 'Ij;(.) and its derivative is shown in Fig. 3. This
function is seen to be continuous, bounded, non-constant,
non-sigmoidal and a non-polynomial function. The shape
of the function in Fig. 3 is similar in shape to the bell
'Ij;'(x)

4 [ncidentally, all sigmoidal functions reported belong to this class, but


are not the only type of function in this class.

5)

where x E (-1,1) and y E (-1,1).


Two dimensional input function from [33], [31],
[32].
1.3356(1.5(1 - x) +
e2x-1 sin(37r(x - 0.6)2) +
e3(y-05)sin(47r( Y - 0.9)2))(15)

-0.6

-O!ls

Fig. 3.

6)
-4

-3

-2

-1

III.

EXPERIMENT DESIGN

The following functions are used to construct 10 function


approximation tasks:

2)

One dimensional input function taken from MAT


LAB sample file humps.m.
1
h(x)
(x - 0.3)2 + 0.01
1
+
-6
(11)
(x - 0.9)2 + 0.04
where x E (0,1).
Two dimensional input function taken from MAT
LAB sample file peaks.m.

where x E (0,1) and y E (0,1).


Two dimensional input function from [33], [31],
[32].
42.659(0.l+x(0.05+x4-lOx2y2+5y4))
(17)
where x E (-0.5,0.5) and y E (-0.5,0.5).
Two dimensional input function from [31], [32].
h(x,y)

8)

- 1 + sin(2x + 3y)
f8 (x,y ) -

3.5 +sin(x - y)

9)

where x E (-2,2) and y E (-2,2).


Four dimensional input function from [31], [32].
f9(Xl,X2,X ,X4)
3

4(Xl - 0.5)(X4 - 0.5)


x sin(27r

10)

(18)

Vxi + x)

(19)

where x E (-1,1) and y E (-1,1).


Six dimensional input function from [31], [32].
flO(-)

10sin(7rXlx ) + 20(X3 - 0.5)2 +


2
(20)
+ 10x4 + 5X + OX6
5

where x

(-1,1) and y

(-1,1).

For each of the above enumerated function approxima


tion problems, a set of 900 points are generated from the
input domain of the functions and the corresponding outputs
generated to form the data set used for learning , out of this
200 tuples are used for training the FFANN (and called the
training data set) while the the rest of the 700 data points
form the test data set.

y5)e(-x2_y2)
1/3)e(-(X+l)2-y2)
(12)

where x E (-3,3) and y E (-3,3).


Two dimensional input function from [30], [31],
[32].
h(x,y) sin(x y)
(13)

The data set used for training and testing the networks
are all scaled in the range [-1,1], and the error measured
on the training data set and the test data set by the mean
squared error (MSE) over the data set:

3(1 - x)2e(-x2-(Y+l)2)
10(x/5

x3

4)

7)

The architecture of the FFANNs used is decided by


varying the number of hidden nodes in the single hidden
layer until a satisfactory solution is reached. The architectural
parameters and the data sizes of the training and test data sets
is reflected in Table I.

h(x,y)

3)

1.9(1.35 + eX sin(13(x - 0.6)2)


xe-ysin(7y))
(16)

The function 4;( X) (9) and its derivative (10).

shaped functions used in Gaussian Potential Function Net


works (GPFN) [28] or the Gaussian functions used in Radial
Basis Function (RBF) Networks [29]. But, in contrast to
the proposed network activation function where the input to
the node / activation function is given by (6) as a scalar
product of the inputs with the associated weights, in the
GPFN or RBF networks the input to the activation function
is a essentially the distance between the input vectors and
the mean of the input vectors (called the center of the radial
basis function) [28], [29]. Thus, the proposed activation is
not equivalent to a radial basis function. It can easily be
seen that the first as well as the second derivative of the
proposed function is bounded. This implies that the training
algorithms based on the gradient (first order methods) and
/ or methods based on (an estimate of the) Hessian (second
order methods) can be used for training the networks using
the proposed function as activation functions.

1)

where x E (0,1) and y E (0,1).


Two dimensional input function from [33], [31],
[32].

where x E (-2,2) and y E (-2,2).


Two dimensional input function from [30], [31],
[32].
(14)

1
=

2P

L (t(k) - y(k)

k=l

(21)

TABLE

I.

DATA AND NETWORK SIZE SUMMARY FOR TASKS 11(1:

NUMBER OF INPUTS, H: NUMBER OF HIDDEN NODES,

0:

NUMBER OF

OUTPUTS).

Sr No
I.

2.
3.

4.
5.

6.

7.
8.

9.

10.

Data-set
h
h
h
i4
is
i6
h
is
i9
ho

H
8

10

10

0
I

to

15

2
2

12

15
5

Train Set Size


200
200

700

700

200

700

200

Test Set Size


700

200

200

training and the test data set. The t-test is performed at a


significance level of a: 0.05. Similarly we use the I-tailed
Wilcoxon rank sum test [34] to find the number of problems
in which the MeMSE corresponding to an ensemble using the
activation function 0' has statistically significant smaller value
than the ensemble characterized by the activation function 1.j;
(and vice-versa), over both the training and the test data set.
The Wilcoxon rank sum test is performed at a significance
level of a: 0.05

700
700

200

700

200

700

200

200

700

IV.

700

where P is the number of data pointsltuples in the train


ing/test data set. t(k) and y(k) represent the desired output
and the obtained output from the FFANN for the kth input.
For each of the learning tasks an ensemble of fifty initial
weight configurations are generated by using uniform random
values in the range (-1,1). For each task, the initial weight
ensemble is used to construct two FFANN ensembles where
the initial weights are equal but the activation functions used
differ. One ensemble uses the activation function 0' ( . ) (5)
while the other uses the activation function 1.j;(.) (9), as the
activation function at the hidden layer nodes. Thus, in all
10 x 2 x 50 1000 networks are trained. Each network is
trained for 1000 epochs, on the training data set.

Due to the volume of data obtained in the experiment


process, we report the summary of the results only. First the
training data set results are presented followed by the test
data set results.
A.

Training Data Set Results

The summary of the results for the data set used for
training is presented in Table II, III and IV. From the obtained
results we may infer the following:
1)

The networks are trained using a variant of the re


silient backpropagation algorithm called the resilient back
propagation algorithm with improved weight-backtracking
(iRPROP2) [8]. This algorithm implements an improved
mechanism for weight update reversal in the case it obtains a
higher error on a weight update. It shares the property of the
resilient backpropagation algorithm (RPROP) as given in [6]
of being a fast first-order algorithm and its space and time
complexity scales linearly with the number of parameters to
be optimized. The minima of the error functional achieved by
this algorithm is comparable to second order methods like
Levenberg Marquardt based or scaled conjugate algorithm
(for example, see the results reported in [26]), for learning
tasks.

2)

The experiments were done using Matlab version 2013a


on a 64-bit Intel i7 based Microsoft Windows 7 system with
6GB RAM.
We report the ensemble average of the error metric
(MMSE), the standard deviation of the ensemble networks'
error (STD), the median of the ensemble error values
(MeMSE), the minimum MSE achieved by a network in
the ensemble (MIN), and the maximum MSE achieved by
a network in the ensemble (MAX); for both the training and
the test data set. We also count the number of problems in
which the MMSE value for ensemble using the activation
function 0' (5) is smaller than the MMSE for the ensemble
corresponding to the usage of the non-sigmoidal activation
1.j; (9) and vice versa, for both the training and the test
set. We use the I-sided Student's t-test [34] to find the
number of problems in which the MMSE corresponding to
an ensemble using the activation function 0' has statistically
significant smaller value than the ensemble characterized by
the activation function 1.j; (and vice-versa), over both the

RESULTS

On inspection of the values for MMSE, for all the


tasks, we find, that in all cases, the MMSE obtained
for the ensemble using the function 1.j;(.) (9) (here
after, we will call this the ensemble identified by
1.j;) has a lower value as compared to the ensemble
using 0' ( . ) (5) (hereafter called the ensemble identi
fied by 0' ) . There is substantial decrease in the value
of the MMSE in almost all cases (except flO) for
the ensemble identified by 1.j; as compared to the
MMSE observed for the ensemble identified by 0'.
The values of MMSE reported in Table II allows us
to only claim that the MMSE of the ensemble iden
tified by 1.j; is smaller than the ensemble identified
by 0'. It does not allow us to conclude whether this
difference is statistically significant or not. To check
for statistical significance of this difference, we per
form a I-tail t-test to test the alternative hypothesis
that the MMSE of the ensemble identified by 1.j; is
smaller than the MMSE of the ensemble identified
by 0', at a significance level a: 0.05. We report the
values of the indicator h and the probability of ob
serving the alternative hypothesis. The value h 1
indicates that the alternative hypothesis holds with
p being the probability of observing the alternative
hypothesis (this corresponds to 1- q, where q is the
probability of observing a test statistics as extreme
as, or more extreme, than the observed value under
null hypothesis). The obtained results are tabulated
in Table III. From the Table we may infer that
in 9 tasks, the ensemble identified by 1.j; has a
statistically significant lower value of the MMSE as
compared to the ensemble identified by 0' (for II
- f9), while for the function flO approximation the
ensemble identified by 1.j; has a MMSE that is not
statistically significantly different than the MMSE
of the ensemble identified by 0' at the a:
0.05
significance level.
In no case it is found that ensemble identified by
0' has a statistically significant lower MMSE value
=

3)

as compared to the ensemble identified by 1/J, at a


significance level of a 0.05. There is substantial
decrease in the value of the MeMSE in almost all
cases (except fIo) for the ensemble identified by
1/J as compared to the MeMSE observed for the
ensemble identified by (7.
For all tasks, we observe (Table II) that the MeMSE
value of the ensemble identified by 1/J has a lower
MeMSE value as compared to the ensemble iden
tified by (7. To check for statistical significance of
this difference, we perform a I-tail Wilcoxon rank
sum test to test the alternative hypothesis that the
MeMSE of the ensemble identified by 1/J is smaller
than the MeMSE of the ensemble identified by (7, at
a significance level a 0.05. We report the values
of the indicator h and the probability of observing
the alternative hypothesis. The value h
1 indi
cates that the alternative hypothesis holds with p
being the probability of observing the alternative
hypothesis (this corresponds to 1- q, where q is the
probability of observing a test statistics as extreme
as, or more extreme, than the observed value under
null hypothesis). The obtained results are tabulated
in Table IV. From the Table we may infer that
in 9 tasks, the ensemble identified by 1/J has a
statistically significant lower value of the MeMSE
as compared to the ensemble identified by (7 (for fI
- f9), while for the function fIo approximation the
ensemble identified by 1/J has a MeMSE that is not
statistically significantly different than the MMSE
of the ensemble identified by (7 at the a
0.05
significance level.
In no case it is found that ensemble identified by
(7 has a statistically significant lower MeMSE value
as compared to the ensemble identified by 1/J, at a
significance level of a 0.05.
In all tasks the MIN value is lower for the ensemble
identified by 1/J as compared to the MIN value for
the ensemble identified by (7.
The MAX value for the ensemble identified by 1/J
is lower than the MAX value for the ensemble
identified by (7 in 7 cases out of lO, except for the
approximation of 13, f9 , and flO.
From these results we assert that the single hidden
layer FFANN using the proposed non-sigmoidal and
non-polynomial activation function 1/J (9) is as good
as if not better than the standard logistic function
(7 (5) for at least the function approximation tasks
reported herein, for training of FFANNs.

TABLE

II.

SUMMARY OF NETWORK TRAINING DATA SET RESULTS

FOR ACTIVATION FUNCTIONS

4)

I
I
I

Sf No
I

I I
I
Task

5)

6)
7)

8)

B.

I
I

I
2)

Test Data Set Results

The summary of the results for the test data set is


presented in Table V, VI and VII. The trend observed for
this data set is similar to the results obtained for training data
set. From the obtained results we may infer the following:
1)

On inspection of the values for MMSE, for all the


tasks, we find, that in the first 9 cases, the MMSE
obtained for the ensemble using the function 1/J(.)
(9) has a lower value as compared to the ensemble
using (7 ( . ) . Only for the case of flO the MMSE of

aO

AND

'!j;O-

STATISTICS ARE REPORTED X

3)

10

I
I

/4

/5

/0

/s

/9

ho

Staristics
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN

ALL VALUES OF THE

10-3.

II-I :

-'i e
-;-----1

0-09906

0-10685
0-06067

0-01525
0-05379

0-00505

0 59192

0 44641

8043208

4-95639

0 00945
2-95004

8-18854

24 43177
6 42596

9-25466

0 00007

1-77278

4-50126

17 96598

4 26325
6044859

1-94706

6-37810

29 17462

74 04550

8-95572

11 42923
1-03489

6-18062

3 27322
0-70685

0045614

0043669

3 83818

2 66605

1-36400

0-98380

0 99236

0 80597

0-56410

1-16883

0 55511
0 40116
0040786

0-90764

4 01896

2 95166

6-08839

2-04028

5-86031

1048792

0 65119

3-82751

0 20847

1-75875

20 98215

10 54470

4-22492

1-56735

4 03360

1 39689

2 45989

1-62919

1 03209

0-65710

16 55252

11 46991

19-50338

2-91343

5 02941

20-09074
12 04759

2 06328

1-05437

2 89007

72 74960

10 59133

10-91532

9-37167

6 75155

2-39587

10 18132

2 18388

2043751
9 15631

29 91510

33 33348

1-26066

1-16436

0 85646

0 79360

15 48426
0-76609

5 16208

1 08167

8 8 8274
0-87291

9 24205

0 92260

the ensemble identified by 1/J has a slightly higher


value as compared to the MMSE of the ensemble
identified by (7.
There is substantial decrease in the value of the
MMSE in almost all cases (except flO) for the
ensemble identified by 1/J as compared to the MMSE
observed for the ensemble identified by (7.
The values of MMSE reported in Table V allows
us to only claim that the MMSE of the ensem
ble identified by 1/J is smaller than the ensemble
identified by (7 in 9 cases. It does not allow us
to conclude whether this difference is statistically
significant or not. Also, in the case of fIo though the
MMSE of the ensemble identified by 1/J has a higher
value as compared to the MMSE of the ensemble
identified by (7, we cannot make a comment on

TABLE lIl.

t--TEST RESULTS TO TEST THE ALTERNATIVE

MMSE OF THE ENSEMBLE IDENTIFIED BY 'IjJ IS


MMSE OF THE ENSEMBLE IDENTIFIED BY u, FOR

HYPOTHESIS THAT THE


SMALLER THAN THE

TRAINING DATA SET, AT A SIGNIFICANCE LEVEL OF a =

Sr No
I

h
h
Is
/.
Is
/0
h
/s
/9
ho

4
5

6
7

10

TABLE IV.

Task

Statistics
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE

II

h
I

II

TABLE V.

0.05.

I
I

1 00000
099816

099980

0 99990

Sr No
I

I I
I
Task

1 00000

1 00000

100000

099906

0 72050

WILCOXON RANK SUM TEST RESULTS TO TEST THE

ALTERNATIVE HYPOTHESIS THAT THE MEMSE OF THE ENSEMBLE


IDENTIFIED BY

()

'IjJ IS SMALLER THAN THE MEMSE OF THE ENSEMBLE

Is

IDENTIFIED BY u, FOR TRAINING DATA SET, AT A SIGNIFICANCE LEVEL


OF a =

Sr No
I

4
5

6
7

10

Task

h
h
Is
/.
Is
/0
h
/s
/9
ho

0.05.

Statistics
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE

II

h
I
I
I

I
I

II

1 00000
100000
099993

100000

0 99969

I
I

/4

Is

1 00000

099921

0 86273

the statistical significance on the basis of only the


MMSE value(s). Thus, similarly to the case of
training data set result, we perform a I-tailed t-test.
The obtained results are tabulated in Table VI. From
the Table we may infer that in 9 tasks, the ensemble
identified by 1j; has a statistically significant lower
value of the MMSE as compared to the ensemble
identified by IJ (for II - f9 ), while for the function
flO approximation the ensemble identified by 1j;
has a MMSE that is not statistically significantly
different than the MMSE of the ensemble identified
by IJ at the a 0.05 significance level.
In no case it is found that ensemble identified by
IJ has a statistically significant lower MMSE value
as compared to the ensemble identified by 1j;, at a
significance level of a 0.05. There is substantial
decrease in the value of the MeMSE in almost all
cases (except lIo) for the ensemble identified by
1j; as compared to the MeMSE observed for the
ensemble identified by IJ.
For 9 tasks, we observe (Table II) that the MeMSE
value of the ensemble identified by 1j; has a lower
MeMSE value as compared to the ensemble iden
tified by IJ. To check for statistical significance
of this difference, we perform a I-tail Wilcoxon
rank sum test, similar to the case of training data
set result, at a significance level a
0.05. The
obtained results are tabulated in Table VII. From
the Table we may infer that in 9 tasks, the ensemble
identified by 1j; has a statistically significant lower
value of the MeMSE as compared to the ensemble
identified by IJ (for II - f9 ), while for the function

I
I

I
I

I
I

4)

1 00000

1 00000

AND

'IjJ( - ) .

ALL VALUES OF THE

STATISTICS ARE REPORTED X

1 00000

SUMMARY OF NETWORK TEST DATA SET RESULTS FOR

ACTIVATION FUNCTIONS u -

10

I
I

/0

/s

/9

ho

I
I

Staristics
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN
MMSE
STD
MeMSE
MAX
MIN

10-3.

011361

001780

006839

000589

012021
0 59192

0 00945

1408138

440695

1389667

006277

0 44641
0 00007

979514
307308

892949

24 43177

17 96598

19 27182

1251660

19 66443

1218528

6 42596

375258

4 26325
955968

29 17462

74 04550

167130

113991

11 42923

061593

1 64902

3 27322
059033

0 98643

3 83818

2 66605

177957

129828

0 80597

0 40116

085407

054291

4 01896

2 95166

7-47084

268920

139119

0 65119

439678

721906

120357

0 20847

206216
204259

20 98215

10 54470

1171642

514925

2 45989

289113

12 02735

1 03209

217155

4 52178

16 55252

11 46991

2203466

5-41058

5 02941

1535974
16 88683

2 06328

171369

5 05453

72 74960

10 59133

2407256

2190770

23 78736

2177558

6 75155

3-45818

29 91510
15 48426
226039

1-42010

1 42869

5 16208

1 08167

2 18388

487731

33 33348

8 8 8274

230840
162316
1 70628

9 24205

0 92260

5)

flO approximation the ensemble identified by 1j;


has a MeMSE that is not statistically significantly
different than the MMSE of the ensemble identified
by IJ at the a 0.05 significance level.
In no case it is found that ensemble identified by
IJ has a statistically significant lower MeMSE value
as compared to the ensemble identified by 1j;, at a
significance level of a 0.05.
In all tasks the MIN value is lower for the ensemble
identified by 1j; as compared to the MIN value for
the ensemble identified by IJ.
The MAX value for the ensemble identified by 1j;
is lower than the MAX value for the ensemble
identified by IJ in 7 cases out of 10, except for the
approximation of 13, f9, and flO.
From these results we assert that the single hidden
=

6)

7)

8)

9)

TABLE VI.

t--TEST RESULTS TO TEST THE ALTERNATIVE

MMSE OF THE ENSEMBLE IDENTIFIED BY 'IjJ IS


MMSE OF THE ENSEMBLE IDENTIFIED BY rJ, FOR

HYPOTHESIS THAT THE


SMALLER THAN THE

TEST DATA SET, AT A SIGNIFICANCE LEVEL OF Q =

Sr No
I

h
h
Is
/.
Is
/0
h
/s
/9
ho

4
5

6
7

10

TABLE VII.

Task

Statistics
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE

II

h
I

II

0.05.

1 00000
1 00000

100000

099997

1 00000

100000

1 00000

099401

0 43762

WILCOXON RANK SUM TEST RESULTS TO TEST THE

IDENTIFIED BY

REFERENCES
[I]

0 99945

I
I

MEMSE

ALTERNATIVE HYPOTHESIS THAT THE

computationally less intensive than the calculation of either


the logsigmoid activation function or the hyperbolic tangent
activation function.

OF THE ENSEMBLE

[2]

[3]

[4]

'IjJ IS SMALLER THAN THE MEMSE OF THE ENSEMBLE

IDENTIFIED BY rJ, FOR TEST DATA SET, AT A SIGNIFICANCE LEVEL OF


Q =

Sr No
I

4
5

6
7

10

Task
h
h
Is
/.
Is
/6
h
/s
/9
ho

0.05.

Starisrics
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE
MMSE

II

h
I

II

I
I

1 00000
100000

100000
0 99999

1 00000

100000

[6]

0 99786

I
I

[5]

100000
0 99693

[7]

0 16984

[8]

layer FFANN using the proposed non-sigmoidal and


non-polynomial activation function 1./J (9) is as good
as if not better than the standard logistic function
(J (5) for at least the function approximation tasks
reported herein, for test data sets.
V.

CONCLUSION

The possession of the universal approximation property


for single hidden layer FFANNs was established for contin
uous, bounded and non-constant activation functions at the
hidden layer nodes in [18]. This result was extended in [19]
wherein it was established that any continuous function that
is not a polynomial can be used as an activation function at
the hidden layer nodes of a single hidden layer FFANN, and
the FFANN would posses the UAP. Based on these result
a simple continuous, bounded, non-constant, differentiable,
non-sigmoid and non-polynomial function is proposed in
this work, for usage as the activation function at hidden
layer nodes. The efficiency and efficacy of the usage of
this function as an activation function at the hidden layer
nodes of a single hidden layer FFANN is demonstrated in
this work, over a set of lO function approximation tasks.
These networks are statistically as good as if not better than
the networks using the logistic function as the activation in
terms of achieving the training error values or generalization
error values. Since activation functions have an important
role to play in determining the speed of training of FFANNs
[27], [24], [25], it is conjectured that the usage of new
activations which are non-sigmoidal in nature may also be
shown to have a beneficial consequence for finding a fast
training mechanism. Moreover, the proposed function does
not involve the calculation of an exponential term, and thus is

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]
[19]

W. McCulloch and W. Pitts,"A logical calculus of the ideas immanent


in nervous activity," Bulletin of Mathematical Biophysics, vol. 5,pp.
115-133,1943.
P. Chandra and Y Singh, "Feedforward sigmoidal networks equicontinuity and fault-tolerance," IEEE Transactions on Neural
Networks, vol. 15,no. 6,pp. 1350-1366,2004.
D. E. Rumelhart, G. E. Hinton, and R. 1. Williams, "Learning
representations by back-propagating errors," Nature, vol. 323, pp.
533-536,Oct. 1986.
D. E. Rumelhart,G. E. Hinton,and R. J. Williams,"Learning internal
representations by error propagation," in Parallel Distributed Process
ing: Volume I: Foundations, D. E. Rumelhart,1. L. McClelland,and
The PDP Research Group,Eds. Cambridge: MIT Press, 1987, pp.
318-362.
,"Learning representations by back-propagating errors," in Neu
rocomputing: foundations of research, J. A. Anderson and E. Rosen
feld,Eds. Cambridge,MA,USA: MIT Press,1988,pp. 696-699.
M. Riedmiller and H. Braun, "A direct adaptive method for faster
backpropagation learning: The RPROP algorithm," in Proc. of IEEE
conference on Neural Networks, vol. I,San Francisco,2010,pp. 586591.
M. Riedmiller, "Advanced supervised learning in multi-layer per
ceptrons from backpropagation to adaptive learning algorithms,"
Computer Standards & Interfaces, vol. 16, no. 3, pp. 265 - 278,
1994.
C. Igel and M. Husken,"Empirical evaluation of the improved rprop
learning algorithms," Neurocomputing, vol. 50,no. 0,pp. 105 - 123,
2003.
M. T. Hagan and M. B. Menhaj,"Training feedforward networks with
the Marquardt algorithm," IEEE Transactions on Neural Networks,
vol. 5,pp. 989-993,1994.
M. Moller,"A scaled conjugate gradient algorithm for fast supervised
learning," Aarhus University Computer Science Department, Aarhus,
Denmark,Tech. Rep. Technical Report PB-339" 1990.
R. Battiti, "First and second order methods for learning: btetween
steepest descent and Newton's method," Neural Computation, vol. 4,
no. 2,pp. 141-166,1992.
A. R. Gallant and H. White, "There exists a neural network that
does not make avoidable mistakes," in Proceedings of the Second
International Joint Conference on Neural Networks, vol. 1, 1988,pp.
593-606.
S. M. Carroll and B. W. Dickinson,"Construction of neural networks
using the Radon transform," in Proc. of the IJCNN, vol. 1, 1989,pp.
607-611.
G. Cybenko, "Approximation by superposition of a sigmoidal func
tion," Mathematics of Control, Signal and Systems, vol. 5, pp. 233243,1989.
K. Funahashi, "On the approximate realization of continuous map
pings by neural networks," Neural Networks, vol. 2, pp. 183-192,
1989.
K. Hornik,M. Stinchcombe,and H. White,"Multilayer feedforward
networks are universal approximators," Neural Networks, vol. 2,pp.
359-366,1989.
M. Stinchcombe and H. White, "Universal approximation using
feedforward networks with non-sigmoid hidden layer activation func
tions," in Neural Networks, 1989. IJCNN., International Joint Con
ference on, 1989,pp. 613-617 vol. I.
K. Hornik, "Approximation capabilities of multilayer feedforward
networks," Neural Networks, vol. 4,no. 2,pp. 251 - 257,1991.
M. Leshno, V. Y Lin, A. Pinkus, and S. Schocken, "Multilayer
feedforward networks with a non-polynomial activation function can
approximate any function," Neural Networks, vol. 6, pp. 861-867,
1993.

--

[20]
[21]

[22]

A. Pinkus, "Approximation theory of the MLP model in neural


networks," Acta NlImerica, vol. 8,pp. 143-195,1999.
S. Guarnieri, F. Piazza, and A. Uncini, "Multilayer feedforward
networks with adaptive spline activation function," Nellral Networks,
IEEE Transactions on, vol. 10,no. 3,pp. 672-683,May 1999.
M. Solazzi and A. Uncini,"Artificial neural networks with adaptive
multidimensional spline activation functions," in Nellral Networks,

[27]

[28]

[29]

2000. UCNN 2000, Proceedings of the IEEE-INNS-ENNS Interna

[23]

[24]
[25]

[26]

tional Joint Conference on, vol. 3,2000,pp. 471-476 vol.3.


J.-N. Hwang, S.-R. Lay,M. Maechler, R. Martin, and J. Schimert,
"Regression modeling in back-propagation and projection pursuit
learning," Nellral Networks, IEEE Transactions on, vol. 5,no. 3,pp.
342-353,May 1994.
W. Ouch and N. Jankowski, "Survey of neural network transfer
functions," Neural Computing Surveys, vol. 2,pp. 163-212,1999.
P. Chandra, "Sigmoidal function classes for feedforward artificial
neural networks," Nellral Processing Letters, vol. 18,no. 3,pp. 205215,2003.
S. S. Sodhi and P. Chandra,"Bi-modal derivative activation function
for sigmoidal feedforward networks," Nellrocompllting, vol. 143,
no. 0,pp. 182 - 196,2014.

[30]
[31]

[32]
[33]

Y. LeCun,L. Bottou, G. B. Orr,and K.-R. Muller,"Efficient back


prop," in Nellral Networks: Tricks of the trade, ser. LNCS:1524,G. B.
Orr and K.-R. MUlier,Eds. Berlin: Springer,1998,pp. 9-50.
S. Lee and R. M. Kil, "A gaussian potential function network with
hierarchically self-organizing learning," Nellral Networks, vol. 4,
no. 2,pp. 207 - 224,1991.
S. Haykin, Nellral Networks: A Comprehensive Foundation. New
Jersey: Prentice Hall,Inc. ,1999.
L. Breiman, "The PI method for estimating multivariate functions
from noisy data," Technometrics, vol. 3,no. 2,pp. 125-160,1991.
V. Cherkassky,D. Gehring,and F. MUlier,"Comparison of adaptive
methods for function estimation from samples," IEEE Transactions
on Nellral Networks, vol. 7,no. 4,pp. 969-984,1996.
V. Cherkassky and F. MUlier,Learning from Data - Concepts, Theory
and Methods.
New York: John Wiley,1998.
M. Maechler,O. Martin,J. Schimert,M. Csoppenszky,and J. Hwang,
"Projection pursuit learning networks for regression," in Proc. of the
2nd International IEEE Conference on Tools for Artificial Intelli

1990,pp. 350-358.
J. O. Gibbons and S. Chakraborti, Nonparametric
ence.
New York: Marcel Dekker, Inc., 2003.
gence,

[34]

Statistical Infer