0 оценок0% нашли этот документ полезным (0 голосов)

65 просмотров8 страницA Non-Sigmoidal Activation Function for Feedforward Artificial
Neural Networks

© © All Rights Reserved

PDF, TXT или читайте онлайн в Scribd

A Non-Sigmoidal Activation Function for Feedforward Artificial
Neural Networks

© All Rights Reserved

0 оценок0% нашли этот документ полезным (0 голосов)

65 просмотров8 страницA Non-Sigmoidal Activation Function for Feedforward Artificial
Neural Networks

© All Rights Reserved

Вы находитесь на странице: 1из 8

Neural Networks

Abstract-For a single hidden layer feedforward artificial

neural network to possess the universal approximation prop

erty, it is sufficient that the hidden layer nodes activation

functions are continuous non-polynomial function. It is not

required that the activation function be a sigmoidal function. In

this paper a simple continuous, bounded, non-constant, differ

entiable, non-sigmoid and non-polynomial function is proposed,

for usage as the activation function at hidden layer nodes. The

proposed activation function does require the computation of an

exponential function, and thus is computationally less intensive

as compared to either the log-sigmoid or the hyperbolic tangent

tinuous variant / extension of this function, which is also

bounded, differentiable and monotonically increasing, is the

sigmoidal class of functions. A sigmoidal function may be

defined as [2]:

sigmoidal function 0' ( . ) is a map 0' : R ----+

where R is the set of all real numbers, having the

following limits for the argument to the function (x) tending

to oo:

Definition 1. A

R,

lim

x-+ <X!

demonstrate the efficiency and efficacy of the usage of the

proposed activation functions. The results obtained allow us

to assert that, at least on the 10 function approximation tasks,

lim

x---+ - 00

networks using the proposed activation function reach deeper

minima of the error functional and also generalize better in

most of the cases, and statistically are as good as if not better

than networks using the logistic function as the activation

function at the hidden nodes.

I.

computational paradigm of the biological neural networks.

Thus, the initial thrust was to develop a structure based on

interconnection of simple computational units I. The simplest

node used for approximating the behaviour of the biological

neuron is a threshold device also called the McCulloch-Pitts

node [1]. The net input to a node for inputs [Xl, X2, ... , xn V

is given by:

n

L

i=l

W i Xi

+()

(1)

() is the bias of the node2 The bias decides at what point

the transition of the node state from dormant to exited takes

place. This can be seen from the output of the node which

is:

(2)

P. Chandra and Udayan Ghose are faculty at the University School of In

formation and Communication Technology,Guru Gobind Singh Indraprastha

University, Dwarka, Sector 16C, New Delhi (INDIA) - 110078 (email:

P. Chandra: pchandra@ipu.ac. in, chandra.pravin@gmail.com; U. Ghose:

udayan@ipu.ac. in,g".udayan@lycos.com;. A. Sood is a Ph.D. scholar at the

University School of lnfonnation and Communication Technology, Guru

Gobind Singh lndraprastha University, Dwarka, Sector l6C, New Delhi

(INDIA) - 110078 (email: soodapoorvi@yahoo.com)

I Generally this unit is called a node or neuron while the interconnection

strengths are called weights.

2By a abuse of terminology,the collection of weights and biases together

are also referred to as weights.

978-1-4799-1959-8/15/$31.00 @2015

( )

O' x

0:;

(3

0: <

{ -1,0} and (3 1.

(3)

(3

0:

(4)

and (3 are

0: E

hyperbolic tangent function, and the log-sigmoid or the

logistic function which is defined as:

INTRODUCTION

( )

O' X

IEEE

( )

O' x

1 +e-x

(5)

function is arranged in a structure which is broadly clas

sified as a feedforward architecture. The estimation of the

weights and/or biases of this Feedforward ANN (FFANN) is

estimated by using a non-linear optimization technique (in

general) to minimize the error between the desired and the

obtained response to the FFANN. These methods range from

first-order methods(only using the gradient of the error func

tional) to second-order methods (using the explicit Hessian

or some estimate of the Hessian of the error functional). For

example, the classical error Backpropagation method [3], [4],

[5] and the Resilient Backpropagation (RPROP) [6], [7], [8]

methods are gradient based (first-order) methods, while the

Levenberg-Marquardt [9] and the conjugate gradient [10],

[11] training algorithms may be classified as second-order

methods. By themselves these training mechanism do not

impose any condition on the activation functions or the non

linearity used as the output function of the nodes'. The only

additional algorithmic requirement is that these functions

should be differentiable [2].

The initial set of results about the universal approxi

mation property (UAP)3 was obtained in context to hidden

(layer) nodes that were sigmoidal in nature, under a variety

of conditions [12], [13], [14], [15], [16]. Stinchcombe and

3The property that a FFANN,with at least one hidden layer, with non

linear nodes in the hidden layer can approximate any function arbitrarily

well,provided there are sufficient number of hidden nodes.

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

;

, ;,;

0.1

25

Fig. I.

Fig. 2.

-2

-3

-1

- ..

"'....-dcr(x) I dx

....

.. "

II.

White (1990) showed that the UAP for a FFANN depends

not on the sigmoidality of the non-linearity at the hidden

nodes for FFANNs, but on the feedforward structure; and

the properties of the sigmoidality of the hidden nodes are

not as crucial for UAP [17]. Hornik (1991) established that

the UAP exists for a FFANN if the non-linearity (henceforth

called the activation) function at the hidden layer nodes is any

continuous, non-constant and bounded function [18]. Leshno

[19] established that the UAP is obtained if the activation

function of the hidden nodes is not a polynomial. See [20]

for a survey of these UAP results. Thus, from these results

(obtained after 1989), we may impose the following condi

tions on an activation function, for the networks using them

at the hidden layer nodes, as - The activation function must

be a continuous, non-constant and bounded non-polynomial

function4

-4

....

FFANN

FUNCTION

is shown in Fig. l. The numbers of inputs to the network

is I, the inputs are labeled as Xi where i E {I, 2,... , I} ,

the number of hidden nodes in the single hidden layer is H,

the weight connecting the ith input with the jth hidden node

is Wji where i E {I, 2,. . . ,I} and j E {I, 2,. . . ,H}, the

threshold of the jth node is {}j with j E {I, 2,... ,H} and

the connection strength between the output of the jth hidden

node and the output node is aj where j E {I, 2,. . . ,H}

while '/ is the threshold or the bias of the output node. With

this structure the net input to the jth hidden node is:

I

nj

L WjiXi +{}j;

i=l

(6)

the hidden layer nodes, we may write the net output of the

network as:

H

classes of function that can be used as activation function,

have not been reflected in the empirical works reported. Some

of the results have been reported using polynomial activations

[21], [22], but as the result in [19] established, these networks

do not possess the UAP. In [23] Hermite polynomials are

used in the constructive approach to neural networks. Most

of the research on activation functions role in training of

FFANNs have concentrated on sigmoidal activations [24],

[25], [26]. The activation functions used in the learning

algorithms for FFANN training, play an important role in

determining the speed of training [27], [24]. In this paper

we use a simple non-sigmoidal function that is continuous,

differentiable, bounded, non-constant and non-polynomial

and study its efficiency and efficacy as an activation function

for hidden nodes in a single hidden layer FFANN over 10

function approximation tasks.

the FFANN architecture and the activation function used.

Section III describes the design of the experiments. Section

IV presents the the results while conclusions are presented

in Section V.

L aj (nj) +'/

(7)

j=l

compare the experimental results is the logistic function (5).

The derivative of this function is:

(T(x)

(8)

(T'(x) d

(T(x) (1 - (T(x))

dx

This function (T(x) and its derivative is shown in Fig. 2.

=

'Ij;(x)

1

; xE

x2 +1

(9)

d'lj;(x) -2x('Ij;(X))2

(10)

dx

The function 'Ij;(.) and its derivative is shown in Fig. 3. This

function is seen to be continuous, bounded, non-constant,

non-sigmoidal and a non-polynomial function. The shape

of the function in Fig. 3 is similar in shape to the bell

'Ij;'(x)

are not the only type of function in this class.

5)

Two dimensional input function from [33], [31],

[32].

1.3356(1.5(1 - x) +

e2x-1 sin(37r(x - 0.6)2) +

e3(y-05)sin(47r( Y - 0.9)2))(15)

-0.6

-O!ls

Fig. 3.

6)

-4

-3

-2

-1

III.

EXPERIMENT DESIGN

approximation tasks:

2)

LAB sample file humps.m.

1

h(x)

(x - 0.3)2 + 0.01

1

+

-6

(11)

(x - 0.9)2 + 0.04

where x E (0,1).

Two dimensional input function taken from MAT

LAB sample file peaks.m.

Two dimensional input function from [33], [31],

[32].

42.659(0.l+x(0.05+x4-lOx2y2+5y4))

(17)

where x E (-0.5,0.5) and y E (-0.5,0.5).

Two dimensional input function from [31], [32].

h(x,y)

8)

- 1 + sin(2x + 3y)

f8 (x,y ) -

3.5 +sin(x - y)

9)

Four dimensional input function from [31], [32].

f9(Xl,X2,X ,X4)

3

x sin(27r

10)

(18)

Vxi + x)

(19)

Six dimensional input function from [31], [32].

flO(-)

2

(20)

+ 10x4 + 5X + OX6

5

where x

(-1,1) and y

(-1,1).

tion problems, a set of 900 points are generated from the

input domain of the functions and the corresponding outputs

generated to form the data set used for learning , out of this

200 tuples are used for training the FFANN (and called the

training data set) while the the rest of the 700 data points

form the test data set.

y5)e(-x2_y2)

1/3)e(-(X+l)2-y2)

(12)

Two dimensional input function from [30], [31],

[32].

h(x,y) sin(x y)

(13)

The data set used for training and testing the networks

are all scaled in the range [-1,1], and the error measured

on the training data set and the test data set by the mean

squared error (MSE) over the data set:

3(1 - x)2e(-x2-(Y+l)2)

10(x/5

x3

4)

7)

varying the number of hidden nodes in the single hidden

layer until a satisfactory solution is reached. The architectural

parameters and the data sizes of the training and test data sets

is reflected in Table I.

h(x,y)

3)

xe-ysin(7y))

(16)

works (GPFN) [28] or the Gaussian functions used in Radial

Basis Function (RBF) Networks [29]. But, in contrast to

the proposed network activation function where the input to

the node / activation function is given by (6) as a scalar

product of the inputs with the associated weights, in the

GPFN or RBF networks the input to the activation function

is a essentially the distance between the input vectors and

the mean of the input vectors (called the center of the radial

basis function) [28], [29]. Thus, the proposed activation is

not equivalent to a radial basis function. It can easily be

seen that the first as well as the second derivative of the

proposed function is bounded. This implies that the training

algorithms based on the gradient (first order methods) and

/ or methods based on (an estimate of the) Hessian (second

order methods) can be used for training the networks using

the proposed function as activation functions.

1)

Two dimensional input function from [33], [31],

[32].

Two dimensional input function from [30], [31],

[32].

(14)

1

=

2P

L (t(k) - y(k)

k=l

(21)

TABLE

I.

0:

NUMBER OF

OUTPUTS).

Sr No

I.

2.

3.

4.

5.

6.

7.

8.

9.

10.

Data-set

h

h

h

i4

is

i6

h

is

i9

ho

H

8

10

10

0

I

to

15

2

2

12

15

5

200

200

700

700

200

700

200

700

200

200

significance level of a: 0.05. Similarly we use the I-tailed

Wilcoxon rank sum test [34] to find the number of problems

in which the MeMSE corresponding to an ensemble using the

activation function 0' has statistically significant smaller value

than the ensemble characterized by the activation function 1.j;

(and vice-versa), over both the training and the test data set.

The Wilcoxon rank sum test is performed at a significance

level of a: 0.05

700

700

200

700

200

700

200

200

700

IV.

700

ing/test data set. t(k) and y(k) represent the desired output

and the obtained output from the FFANN for the kth input.

For each of the learning tasks an ensemble of fifty initial

weight configurations are generated by using uniform random

values in the range (-1,1). For each task, the initial weight

ensemble is used to construct two FFANN ensembles where

the initial weights are equal but the activation functions used

differ. One ensemble uses the activation function 0' ( . ) (5)

while the other uses the activation function 1.j;(.) (9), as the

activation function at the hidden layer nodes. Thus, in all

10 x 2 x 50 1000 networks are trained. Each network is

trained for 1000 epochs, on the training data set.

process, we report the summary of the results only. First the

training data set results are presented followed by the test

data set results.

A.

The summary of the results for the data set used for

training is presented in Table II, III and IV. From the obtained

results we may infer the following:

1)

silient backpropagation algorithm called the resilient back

propagation algorithm with improved weight-backtracking

(iRPROP2) [8]. This algorithm implements an improved

mechanism for weight update reversal in the case it obtains a

higher error on a weight update. It shares the property of the

resilient backpropagation algorithm (RPROP) as given in [6]

of being a fast first-order algorithm and its space and time

complexity scales linearly with the number of parameters to

be optimized. The minima of the error functional achieved by

this algorithm is comparable to second order methods like

Levenberg Marquardt based or scaled conjugate algorithm

(for example, see the results reported in [26]), for learning

tasks.

2)

on a 64-bit Intel i7 based Microsoft Windows 7 system with

6GB RAM.

We report the ensemble average of the error metric

(MMSE), the standard deviation of the ensemble networks'

error (STD), the median of the ensemble error values

(MeMSE), the minimum MSE achieved by a network in

the ensemble (MIN), and the maximum MSE achieved by

a network in the ensemble (MAX); for both the training and

the test data set. We also count the number of problems in

which the MMSE value for ensemble using the activation

function 0' (5) is smaller than the MMSE for the ensemble

corresponding to the usage of the non-sigmoidal activation

1.j; (9) and vice versa, for both the training and the test

set. We use the I-sided Student's t-test [34] to find the

number of problems in which the MMSE corresponding to

an ensemble using the activation function 0' has statistically

significant smaller value than the ensemble characterized by

the activation function 1.j; (and vice-versa), over both the

RESULTS

tasks, we find, that in all cases, the MMSE obtained

for the ensemble using the function 1.j;(.) (9) (here

after, we will call this the ensemble identified by

1.j;) has a lower value as compared to the ensemble

using 0' ( . ) (5) (hereafter called the ensemble identi

fied by 0' ) . There is substantial decrease in the value

of the MMSE in almost all cases (except flO) for

the ensemble identified by 1.j; as compared to the

MMSE observed for the ensemble identified by 0'.

The values of MMSE reported in Table II allows us

to only claim that the MMSE of the ensemble iden

tified by 1.j; is smaller than the ensemble identified

by 0'. It does not allow us to conclude whether this

difference is statistically significant or not. To check

for statistical significance of this difference, we per

form a I-tail t-test to test the alternative hypothesis

that the MMSE of the ensemble identified by 1.j; is

smaller than the MMSE of the ensemble identified

by 0', at a significance level a: 0.05. We report the

values of the indicator h and the probability of ob

serving the alternative hypothesis. The value h 1

indicates that the alternative hypothesis holds with

p being the probability of observing the alternative

hypothesis (this corresponds to 1- q, where q is the

probability of observing a test statistics as extreme

as, or more extreme, than the observed value under

null hypothesis). The obtained results are tabulated

in Table III. From the Table we may infer that

in 9 tasks, the ensemble identified by 1.j; has a

statistically significant lower value of the MMSE as

compared to the ensemble identified by 0' (for II

- f9), while for the function flO approximation the

ensemble identified by 1.j; has a MMSE that is not

statistically significantly different than the MMSE

of the ensemble identified by 0' at the a:

0.05

significance level.

In no case it is found that ensemble identified by

0' has a statistically significant lower MMSE value

=

3)

significance level of a 0.05. There is substantial

decrease in the value of the MeMSE in almost all

cases (except fIo) for the ensemble identified by

1/J as compared to the MeMSE observed for the

ensemble identified by (7.

For all tasks, we observe (Table II) that the MeMSE

value of the ensemble identified by 1/J has a lower

MeMSE value as compared to the ensemble iden

tified by (7. To check for statistical significance of

this difference, we perform a I-tail Wilcoxon rank

sum test to test the alternative hypothesis that the

MeMSE of the ensemble identified by 1/J is smaller

than the MeMSE of the ensemble identified by (7, at

a significance level a 0.05. We report the values

of the indicator h and the probability of observing

the alternative hypothesis. The value h

1 indi

cates that the alternative hypothesis holds with p

being the probability of observing the alternative

hypothesis (this corresponds to 1- q, where q is the

probability of observing a test statistics as extreme

as, or more extreme, than the observed value under

null hypothesis). The obtained results are tabulated

in Table IV. From the Table we may infer that

in 9 tasks, the ensemble identified by 1/J has a

statistically significant lower value of the MeMSE

as compared to the ensemble identified by (7 (for fI

- f9), while for the function fIo approximation the

ensemble identified by 1/J has a MeMSE that is not

statistically significantly different than the MMSE

of the ensemble identified by (7 at the a

0.05

significance level.

In no case it is found that ensemble identified by

(7 has a statistically significant lower MeMSE value

as compared to the ensemble identified by 1/J, at a

significance level of a 0.05.

In all tasks the MIN value is lower for the ensemble

identified by 1/J as compared to the MIN value for

the ensemble identified by (7.

The MAX value for the ensemble identified by 1/J

is lower than the MAX value for the ensemble

identified by (7 in 7 cases out of lO, except for the

approximation of 13, f9 , and flO.

From these results we assert that the single hidden

layer FFANN using the proposed non-sigmoidal and

non-polynomial activation function 1/J (9) is as good

as if not better than the standard logistic function

(7 (5) for at least the function approximation tasks

reported herein, for training of FFANNs.

TABLE

II.

4)

I

I

I

Sf No

I

I I

I

Task

5)

6)

7)

8)

B.

I

I

I

2)

presented in Table V, VI and VII. The trend observed for

this data set is similar to the results obtained for training data

set. From the obtained results we may infer the following:

1)

tasks, we find, that in the first 9 cases, the MMSE

obtained for the ensemble using the function 1/J(.)

(9) has a lower value as compared to the ensemble

using (7 ( . ) . Only for the case of flO the MMSE of

aO

AND

'!j;O-

3)

10

I

I

/4

/5

/0

/s

/9

ho

Staristics

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

10-3.

II-I :

-'i e

-;-----1

0-09906

0-10685

0-06067

0-01525

0-05379

0-00505

0 59192

0 44641

8043208

4-95639

0 00945

2-95004

8-18854

24 43177

6 42596

9-25466

0 00007

1-77278

4-50126

17 96598

4 26325

6044859

1-94706

6-37810

29 17462

74 04550

8-95572

11 42923

1-03489

6-18062

3 27322

0-70685

0045614

0043669

3 83818

2 66605

1-36400

0-98380

0 99236

0 80597

0-56410

1-16883

0 55511

0 40116

0040786

0-90764

4 01896

2 95166

6-08839

2-04028

5-86031

1048792

0 65119

3-82751

0 20847

1-75875

20 98215

10 54470

4-22492

1-56735

4 03360

1 39689

2 45989

1-62919

1 03209

0-65710

16 55252

11 46991

19-50338

2-91343

5 02941

20-09074

12 04759

2 06328

1-05437

2 89007

72 74960

10 59133

10-91532

9-37167

6 75155

2-39587

10 18132

2 18388

2043751

9 15631

29 91510

33 33348

1-26066

1-16436

0 85646

0 79360

15 48426

0-76609

5 16208

1 08167

8 8 8274

0-87291

9 24205

0 92260

value as compared to the MMSE of the ensemble

identified by (7.

There is substantial decrease in the value of the

MMSE in almost all cases (except flO) for the

ensemble identified by 1/J as compared to the MMSE

observed for the ensemble identified by (7.

The values of MMSE reported in Table V allows

us to only claim that the MMSE of the ensem

ble identified by 1/J is smaller than the ensemble

identified by (7 in 9 cases. It does not allow us

to conclude whether this difference is statistically

significant or not. Also, in the case of fIo though the

MMSE of the ensemble identified by 1/J has a higher

value as compared to the MMSE of the ensemble

identified by (7, we cannot make a comment on

TABLE lIl.

MMSE OF THE ENSEMBLE IDENTIFIED BY u, FOR

SMALLER THAN THE

Sr No

I

h

h

Is

/.

Is

/0

h

/s

/9

ho

4

5

6

7

10

TABLE IV.

Task

Statistics

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

II

h

I

II

TABLE V.

0.05.

I

I

1 00000

099816

099980

0 99990

Sr No

I

I I

I

Task

1 00000

1 00000

100000

099906

0 72050

IDENTIFIED BY

()

Is

OF a =

Sr No

I

4

5

6

7

10

Task

h

h

Is

/.

Is

/0

h

/s

/9

ho

0.05.

Statistics

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

II

h

I

I

I

I

I

II

1 00000

100000

099993

100000

0 99969

I

I

/4

Is

1 00000

099921

0 86273

MMSE value(s). Thus, similarly to the case of

training data set result, we perform a I-tailed t-test.

The obtained results are tabulated in Table VI. From

the Table we may infer that in 9 tasks, the ensemble

identified by 1j; has a statistically significant lower

value of the MMSE as compared to the ensemble

identified by IJ (for II - f9 ), while for the function

flO approximation the ensemble identified by 1j;

has a MMSE that is not statistically significantly

different than the MMSE of the ensemble identified

by IJ at the a 0.05 significance level.

In no case it is found that ensemble identified by

IJ has a statistically significant lower MMSE value

as compared to the ensemble identified by 1j;, at a

significance level of a 0.05. There is substantial

decrease in the value of the MeMSE in almost all

cases (except lIo) for the ensemble identified by

1j; as compared to the MeMSE observed for the

ensemble identified by IJ.

For 9 tasks, we observe (Table II) that the MeMSE

value of the ensemble identified by 1j; has a lower

MeMSE value as compared to the ensemble iden

tified by IJ. To check for statistical significance

of this difference, we perform a I-tail Wilcoxon

rank sum test, similar to the case of training data

set result, at a significance level a

0.05. The

obtained results are tabulated in Table VII. From

the Table we may infer that in 9 tasks, the ensemble

identified by 1j; has a statistically significant lower

value of the MeMSE as compared to the ensemble

identified by IJ (for II - f9 ), while for the function

I

I

I

I

I

I

4)

1 00000

1 00000

AND

'IjJ( - ) .

1 00000

ACTIVATION FUNCTIONS u -

10

I

I

/0

/s

/9

ho

I

I

Staristics

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

MMSE

STD

MeMSE

MAX

MIN

10-3.

011361

001780

006839

000589

012021

0 59192

0 00945

1408138

440695

1389667

006277

0 44641

0 00007

979514

307308

892949

24 43177

17 96598

19 27182

1251660

19 66443

1218528

6 42596

375258

4 26325

955968

29 17462

74 04550

167130

113991

11 42923

061593

1 64902

3 27322

059033

0 98643

3 83818

2 66605

177957

129828

0 80597

0 40116

085407

054291

4 01896

2 95166

7-47084

268920

139119

0 65119

439678

721906

120357

0 20847

206216

204259

20 98215

10 54470

1171642

514925

2 45989

289113

12 02735

1 03209

217155

4 52178

16 55252

11 46991

2203466

5-41058

5 02941

1535974

16 88683

2 06328

171369

5 05453

72 74960

10 59133

2407256

2190770

23 78736

2177558

6 75155

3-45818

29 91510

15 48426

226039

1-42010

1 42869

5 16208

1 08167

2 18388

487731

33 33348

8 8 8274

230840

162316

1 70628

9 24205

0 92260

5)

has a MeMSE that is not statistically significantly

different than the MMSE of the ensemble identified

by IJ at the a 0.05 significance level.

In no case it is found that ensemble identified by

IJ has a statistically significant lower MeMSE value

as compared to the ensemble identified by 1j;, at a

significance level of a 0.05.

In all tasks the MIN value is lower for the ensemble

identified by 1j; as compared to the MIN value for

the ensemble identified by IJ.

The MAX value for the ensemble identified by 1j;

is lower than the MAX value for the ensemble

identified by IJ in 7 cases out of 10, except for the

approximation of 13, f9, and flO.

From these results we assert that the single hidden

=

6)

7)

8)

9)

TABLE VI.

MMSE OF THE ENSEMBLE IDENTIFIED BY rJ, FOR

SMALLER THAN THE

Sr No

I

h

h

Is

/.

Is

/0

h

/s

/9

ho

4

5

6

7

10

TABLE VII.

Task

Statistics

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

II

h

I

II

0.05.

1 00000

1 00000

100000

099997

1 00000

100000

1 00000

099401

0 43762

IDENTIFIED BY

REFERENCES

[I]

0 99945

I

I

MEMSE

the logsigmoid activation function or the hyperbolic tangent

activation function.

OF THE ENSEMBLE

[2]

[3]

[4]

Q =

Sr No

I

4

5

6

7

10

Task

h

h

Is

/.

Is

/6

h

/s

/9

ho

0.05.

Starisrics

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

MMSE

II

h

I

II

I

I

1 00000

100000

100000

0 99999

1 00000

100000

[6]

0 99786

I

I

[5]

100000

0 99693

[7]

0 16984

[8]

non-polynomial activation function 1./J (9) is as good

as if not better than the standard logistic function

(J (5) for at least the function approximation tasks

reported herein, for test data sets.

V.

CONCLUSION

for single hidden layer FFANNs was established for contin

uous, bounded and non-constant activation functions at the

hidden layer nodes in [18]. This result was extended in [19]

wherein it was established that any continuous function that

is not a polynomial can be used as an activation function at

the hidden layer nodes of a single hidden layer FFANN, and

the FFANN would posses the UAP. Based on these result

a simple continuous, bounded, non-constant, differentiable,

non-sigmoid and non-polynomial function is proposed in

this work, for usage as the activation function at hidden

layer nodes. The efficiency and efficacy of the usage of

this function as an activation function at the hidden layer

nodes of a single hidden layer FFANN is demonstrated in

this work, over a set of lO function approximation tasks.

These networks are statistically as good as if not better than

the networks using the logistic function as the activation in

terms of achieving the training error values or generalization

error values. Since activation functions have an important

role to play in determining the speed of training of FFANNs

[27], [24], [25], it is conjectured that the usage of new

activations which are non-sigmoidal in nature may also be

shown to have a beneficial consequence for finding a fast

training mechanism. Moreover, the proposed function does

not involve the calculation of an exponential term, and thus is

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

in nervous activity," Bulletin of Mathematical Biophysics, vol. 5,pp.

115-133,1943.

P. Chandra and Y Singh, "Feedforward sigmoidal networks equicontinuity and fault-tolerance," IEEE Transactions on Neural

Networks, vol. 15,no. 6,pp. 1350-1366,2004.

D. E. Rumelhart, G. E. Hinton, and R. 1. Williams, "Learning

representations by back-propagating errors," Nature, vol. 323, pp.

533-536,Oct. 1986.

D. E. Rumelhart,G. E. Hinton,and R. J. Williams,"Learning internal

representations by error propagation," in Parallel Distributed Process

ing: Volume I: Foundations, D. E. Rumelhart,1. L. McClelland,and

The PDP Research Group,Eds. Cambridge: MIT Press, 1987, pp.

318-362.

,"Learning representations by back-propagating errors," in Neu

rocomputing: foundations of research, J. A. Anderson and E. Rosen

feld,Eds. Cambridge,MA,USA: MIT Press,1988,pp. 696-699.

M. Riedmiller and H. Braun, "A direct adaptive method for faster

backpropagation learning: The RPROP algorithm," in Proc. of IEEE

conference on Neural Networks, vol. I,San Francisco,2010,pp. 586591.

M. Riedmiller, "Advanced supervised learning in multi-layer per

ceptrons from backpropagation to adaptive learning algorithms,"

Computer Standards & Interfaces, vol. 16, no. 3, pp. 265 - 278,

1994.

C. Igel and M. Husken,"Empirical evaluation of the improved rprop

learning algorithms," Neurocomputing, vol. 50,no. 0,pp. 105 - 123,

2003.

M. T. Hagan and M. B. Menhaj,"Training feedforward networks with

the Marquardt algorithm," IEEE Transactions on Neural Networks,

vol. 5,pp. 989-993,1994.

M. Moller,"A scaled conjugate gradient algorithm for fast supervised

learning," Aarhus University Computer Science Department, Aarhus,

Denmark,Tech. Rep. Technical Report PB-339" 1990.

R. Battiti, "First and second order methods for learning: btetween

steepest descent and Newton's method," Neural Computation, vol. 4,

no. 2,pp. 141-166,1992.

A. R. Gallant and H. White, "There exists a neural network that

does not make avoidable mistakes," in Proceedings of the Second

International Joint Conference on Neural Networks, vol. 1, 1988,pp.

593-606.

S. M. Carroll and B. W. Dickinson,"Construction of neural networks

using the Radon transform," in Proc. of the IJCNN, vol. 1, 1989,pp.

607-611.

G. Cybenko, "Approximation by superposition of a sigmoidal func

tion," Mathematics of Control, Signal and Systems, vol. 5, pp. 233243,1989.

K. Funahashi, "On the approximate realization of continuous map

pings by neural networks," Neural Networks, vol. 2, pp. 183-192,

1989.

K. Hornik,M. Stinchcombe,and H. White,"Multilayer feedforward

networks are universal approximators," Neural Networks, vol. 2,pp.

359-366,1989.

M. Stinchcombe and H. White, "Universal approximation using

feedforward networks with non-sigmoid hidden layer activation func

tions," in Neural Networks, 1989. IJCNN., International Joint Con

ference on, 1989,pp. 613-617 vol. I.

K. Hornik, "Approximation capabilities of multilayer feedforward

networks," Neural Networks, vol. 4,no. 2,pp. 251 - 257,1991.

M. Leshno, V. Y Lin, A. Pinkus, and S. Schocken, "Multilayer

feedforward networks with a non-polynomial activation function can

approximate any function," Neural Networks, vol. 6, pp. 861-867,

1993.

--

[20]

[21]

[22]

networks," Acta NlImerica, vol. 8,pp. 143-195,1999.

S. Guarnieri, F. Piazza, and A. Uncini, "Multilayer feedforward

networks with adaptive spline activation function," Nellral Networks,

IEEE Transactions on, vol. 10,no. 3,pp. 672-683,May 1999.

M. Solazzi and A. Uncini,"Artificial neural networks with adaptive

multidimensional spline activation functions," in Nellral Networks,

[27]

[28]

[29]

[23]

[24]

[25]

[26]

J.-N. Hwang, S.-R. Lay,M. Maechler, R. Martin, and J. Schimert,

"Regression modeling in back-propagation and projection pursuit

learning," Nellral Networks, IEEE Transactions on, vol. 5,no. 3,pp.

342-353,May 1994.

W. Ouch and N. Jankowski, "Survey of neural network transfer

functions," Neural Computing Surveys, vol. 2,pp. 163-212,1999.

P. Chandra, "Sigmoidal function classes for feedforward artificial

neural networks," Nellral Processing Letters, vol. 18,no. 3,pp. 205215,2003.

S. S. Sodhi and P. Chandra,"Bi-modal derivative activation function

for sigmoidal feedforward networks," Nellrocompllting, vol. 143,

no. 0,pp. 182 - 196,2014.

[30]

[31]

[32]

[33]

prop," in Nellral Networks: Tricks of the trade, ser. LNCS:1524,G. B.

Orr and K.-R. MUlier,Eds. Berlin: Springer,1998,pp. 9-50.

S. Lee and R. M. Kil, "A gaussian potential function network with

hierarchically self-organizing learning," Nellral Networks, vol. 4,

no. 2,pp. 207 - 224,1991.

S. Haykin, Nellral Networks: A Comprehensive Foundation. New

Jersey: Prentice Hall,Inc. ,1999.

L. Breiman, "The PI method for estimating multivariate functions

from noisy data," Technometrics, vol. 3,no. 2,pp. 125-160,1991.

V. Cherkassky,D. Gehring,and F. MUlier,"Comparison of adaptive

methods for function estimation from samples," IEEE Transactions

on Nellral Networks, vol. 7,no. 4,pp. 969-984,1996.

V. Cherkassky and F. MUlier,Learning from Data - Concepts, Theory

and Methods.

New York: John Wiley,1998.

M. Maechler,O. Martin,J. Schimert,M. Csoppenszky,and J. Hwang,

"Projection pursuit learning networks for regression," in Proc. of the

2nd International IEEE Conference on Tools for Artificial Intelli

1990,pp. 350-358.

J. O. Gibbons and S. Chakraborti, Nonparametric

ence.

New York: Marcel Dekker, Inc., 2003.

gence,

[34]

Statistical Infer