Analysis of Boundedness and Convergence of Online Gradient Method For Two-Layer Feedforward Neural Networks

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO.
8, AUGUST 2013 1327

Analysis of Boundedness and Convergence of
Online Gradient Method for Two-Layer
Feedforward Neural Networks
Lu Xu, Jinshu Chen, Defeng Huang, Senior Member, IEEE,
Jianhua Lu, Senior Member, IEEE, and Licai Fang
AbstractThis paper presents a theoretical boundedness and
convergence analysis of online gradient method for the training
of two-layer feedforward neural networks. The well-known linear
difference equation is extended to apply to the general case of
linear or nonlinear activation functions. Based on this extended
difference equation, we investigate the boundedness and conver-
gence of the parameter sequence of concern, which is trained by
nite training samples with a constant learning rate. We show
that the uniform upper bound of the parameter sequence, which
is very important in the training procedure, is the solution of
an inequality regarding the bound. It is further veried that, for
the case of linear activation function, a solution always exists
and, moreover, the parameter sequence can be uniformly upper
bounded, while for the case of nonlinear activation function, some
simple adjustment methods on the training set or the activation
function can be derived to improve the boundedness property.
Then, for the convergence analysis, it is shown that the parameter
sequence can converge into a zone around an optimal solution at
which the error function attains its global minimum, where the
size of the zone is associated with the learning rate. Particularly,
for the case of perfect modeling, a strong global convergence
result, where the parameter sequence can always converge to an
optimal solution, is proved.
Index TermsBoundedness, convergence, online gradient
method, two-layer feedforward neural networks.
I. INTRODUCTION
N
OWADAYS, neural networks are being employed in
many applications, such as channel estimation and equal-
ization, signal approximation, machine learning, and pattern
recognition (for details, see [1][4] and references cited
therein). Because of their simple structures, two-layer feedfor-
ward neural networks (TLFNNs) have inspired much scientic
research and engineering applications in past decades includ-
ing, for instance, the lattice polynomial perceptron (LPP) with
the GramSchmidt orthogonal decomposition employed for
Manuscript received September 4, 2012; accepted March 31, 2013. Date
of publication May 13, 2013; date of current version June 28, 2013. This
work was supported in part by the Australian Research Councils Discovery
Projects DP1093000, and the National Natural Science Foundation of China
under Grant 61021001.
L. Xu, J. Chen, and J. Lu are with the Department of Electronic
Engineering, Tsinghua University, Beijing 100084, China (e-mail:
xulu06@mails.tsinghua.edu.cn; chenjs@tsinghua.edu.cn; lhhdee@mail.
tsinghua.edu.cn).
D. Huang and L. Fang are with the School of Electrical, Electronic and
Computer Engineering, University of Western Australia, Perth 6009, Australia
(e-mail: huangdf@ee.uwa.edu.au; 21252521@student.uwa.edu.au).
Digital Object Identier 10.1109/TNNLS.2013.2257845
Fig. 1. General model of TLFNNs [7].
cellular mobile communication receiver [5], the functional-
link neural network and its variants for overcoming co-channel
interference (CCI) [6], nonlinear dynamic system identication
[7], nonlinear channel equalization [8], and intelligent sensors
in wireless sensor networks [9].
A general model of TLFNNs is shown in Fig. 1. At the
kth step, let a
k
be the desired signal, y
k
the output signal,
e
k
the output error, X
k
=
_
x
k,1
, x
k,2
, . . . , x
k,N
_
T
the input
signal of dimension N, and H
k
=
_
h
k,1
, h
k,2
, . . . , h
k,N
_
T
the
parameter vector. {(a
k
, X
k
)}
K
k=1
is the training set, where K is
the number of training samples. In Fig. 1, g(x) is the activation
function, which can be linear or nonlinear. For instance, a
linear g(x) = x [10], [11] or a hyperbolic tangent function
g(x) = tanh(x) [12][14] is usually used.
Online gradient method (OGM) for TLFNN training is nor-
mally used to update the parameter vector immediately after
each training sample is fed (according to the gradient-descent
principle) to nd an optimal parameter vector ultimately. An
optimal parameter vector, also called an optimal solution,
refers to a vector H
at which the error function

J(H) =
K
k=1
(a
k
g(X
T
k
H))
2
attains its global minimum
H
= arg min
HR
N
J(H). (1)
2162-237X/$31.00 2013 IEEE
1328 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
Note that the optimal solution may not be unique. As in J(H),
we have
g(X
T
k
H
) = g(X
T
k
(H
+ H
)), k = 1, . . . , K
where H
is a vector orthogonal to the input sequence

{X
k
}
K
k=1
. It is not difcult to verify that there exist optimal
solutions which can be linearly combined by some elements
of {X
k
}
K
k=1
. For the sake of clarication, in the rest of this
paper, we let H
denote one of these optimal solutions. Given

H
, we now have
a
k
= g(X
T
k
H
) +
k
, k = 1, . . . , K (2)
where {
k
}
K
k=1
denotes the disturbance sequence that repre-
sents all nonidealities such as modeling error and noise.
In this paper, we consider a nite number of training
samples and two cycle training schemes, namely, OGM-F
(xed order), in which each sample in the set is supplied to the
network exactly once in a xed order in each training cycle
[15], [16], and OGM-SS (special stochastic order) in which
each sample in the set is supplied to the network exactly
once in a stochastic order in each training cycle [16], [17].
The OGM recursion equation of the parameter vector can be
written as
H
nK+k+1
= H
nK+k
+u
_
a
nK+k
g
_
X
T
nK+k
H
nK+k
__
g
_
X
T
nK+k
H
nK+k
_
X
nK+k
k = 1, . . . , K, n = 0, 1, . . . (3)
where u is the learning rate. For the OGM-F scheme,
(a
nK+k
, X
nK+k
) = (a
k
, X
k
) for n = 0, 1, . . . and
k = 1, . . . , K, and for the OGM-SS scheme, every set
{(a
nK+k
, X
nK+k
)}
K
k=1
for n = 0, 1, . . . is a stochastic per-
mutation of the set {(a
k
, X
k
)}
K
k=1
.
The boundedness of the OGM then depends on the upper
bound of the parameter sequence {H
i
}
i=1
where i = nK +k.
It is an important issue because, in practice, an oversized
parameter may crash the training procedure or damage the
equipment that is being used. As a result, the convergence
of the OGM (i.e., the convergence of the parameter sequence
{H
i
}
i=1
to an optimal solution in some sense) is a prerequisite
of any successful application of the OGM.
Actually, the boundedness and convergence of the OGM
have inspired a lot of research, and many excellent analytic
mechanisms have been proposed, with the activation function
g(x) linear or nonlinear. For the case of linear g(x), the
parameter estimation error V
i
= H
i
H
was often considered.

And the well-known linear difference equation [18]
V
i+1
= (I u A
i
)V
i
+u
i
X
i
(4)
where A
i
= X
i
X
T
i
, was usually used, thanks to the linear-
ity of g(x). Based on this linear difference equation, some
elegant probabilistic stochastic analyses on boundedness and
convergence of the OGM have been conducted as the number
of training samples goes to innity [18][25]. Also, for the
case of nonlinear g(x), some deterministic boundedness and
convergence analyses of the OGM were given in [26][36].
With a penalty term proportional to the norm of the parameter
vector added to the error function J(H), it was proved in [26]
that {H
i
}
i=1
can be automatically bounded. An integrated error
function and a tuning learning law were used in [27] and the
boundedness result of {H
i
}
i=1
was derived. Using the Lya-
punov function, [28], [29] proved the boundedness of {H
i
}
i=1
and the output error sequence {e
i
}
i=1
with the assumption of
bounded uncertainty for two-layer and three-layer feedforward
neural networks, respectively. Lyapunov analysis was carried
out in [30] and [31], which considered recurrent neural net-
works, and in [32] where radial basis function (RBF) neural
networks were considered to obtain the boundedness result of
{H
i
}
i=1
. However, the boundedness results derived in these
papers are based on the case of a time-varying learning rate,
and cannot apply to the case of a constant learning rate. For
the convergence property, by investigating the monotonicity of
the sequence {J(H
nK
)}
n=0
, [33][35] derived some important
results like the weak convergence as lim
i
J(H
i
) = 0
and the strong convergence as lim
i
H
i
{H:J(H) = 0},
with a learning rate adjusted with a proper decreasing scheme.
Then using a constant learning rate, a similar weak conver-
gence result was given in [36]. Nevertheless, the convergence
results obtained in [33][36] are local convergence results,
as the parameter sequence {H
i
}
i=1
may converge into a
local minimum of the error function J(H). Additionally, the
uniformly upper bounded assumption of g(x) in those works
make them inapplicable to the case of linear g(x).
In this paper, we present a deterministic analysis on the
boundedness and convergence of the parameter sequence
{H
i
}
i=1
for the general case of a linear or nonlinear activation
function, with a constant learning rate, and for both the
OGM-F and the OGM-SS training schemes. We note that
the linear difference equation (4) of V
i
plays a key role in
the analyses of [18][25]. However, this useful equation is
only applicable to the case of a linear activation function.
Using the differential mean value theorem [37], we derive an
extended difference equation of V
i
for the general case of
linear or nonlinear activation function. Based on this extended
difference equation, some new results for the boundedness
and convergence of the parameter sequence {H
i
}
i=1
are then
obtained.
With the mild restrictions of bounded training set
{(a
k
, X
k
)}
K
k=1
and initialization H
1
, we prove that the para-
meter sequence {H
i
}
i=1
is uniformly upper bounded as long
as there exists a solution for an inequality [see (16) of this
paper] regarding the bound. To the best of our knowledge,
this explicit criterion for the deterministic boundedness of the
parameter sequence {H
i
}
i=1
has not been derived before. A
further investigation on this explicit criterion is then carried
out. And we show that a solution for the inequality always
exists for the case of linear g(x), which means that the
parameter sequence {H
i
}
i=1
can be uniformly upper bounded.
For the case of nonlinear g(x), some simple adjustment meth-
ods on the training set {(a
k
, X
k
)}
K
k=1
, the activation function
g(x), or the initialization H
1
, which can improve the upper
boundedness property of {H
i
}
i=1
, may be found on the basis
of the analysis of the inequality.
Then, based on the boundedness result, the convergence
of the parameter sequence {H
i
}
i=1
is obtained. We show
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1329
that H
i
converges into a zone around an optimal solution
as i . And the size of the zone is associated with the
disturbance sequence {
k
}
K
k=1
and the learning rate u. Indeed,
this result can be considered as a deterministic counterpart of
the stochastic result obtained in [23][25], which only consider
the case of a linear activation function.
The convergence property is further investigated for the
special case of a vanishing disturbance sequence {
k
}
K
k=1
,
also called the case of perfect modeling [38], [39]. A global
convergence result, i.e., the parameter vector H
i
can always
converge to an optimal solution [which leads to a global
minimum of the error function J(H) as shown in (1)], is
proved. Compared with the convergence result that H
i
may
be trapped into a local minimum obtained in [33] and [35],
the global convergence derived in this paper is stronger.
The rest of this paper is organized as follows. Some assump-
tions for analysis are listed in Section II. In Section III, with
the OGM-F or OGM-SS training scheme, the main bound-
edness and convergence results along with some necessary
lemmas are presented. The proofs of the main results and
lemmas are gathered in Section IV. Finally, Section V gives
the conclusions.
II. ASSUMPTIONS
In this section, for the boundedness and convergence analy-
sis, a list of assumptions are established as follows.
1) The size of the training set {(a
k
, X
k
)}
K
k=1
is nite, i.e.,
K < .
2) The training set {(a
k
, X
k
)}
K
k=1
is upper bounded, i.e.,
there exists a B
a
R and a B
X
R, such that
|a
k
| B
a
< and X
k
B
X
< for
k = 1, . . . , K.
3) The initialization H
1
and the optimal solution H
are
upper bounded, i.e., there exists a B
H
1
R and a B

R, such that H
1
B
H
1
< and H
< .
4) The activation function g(x) is continuously differ-
entiable, and g
(x) is positive and uniformly upper

bounded, i.e., there exists a B
g
R, such that
0 < g
(x) B
g
< .
5) For all bounded x of |x| B
x
< , g
(x)
is lower bounded by a small positive value, i.e.,
g
(x) D
g
(B
x
) > 0, where D
g
(B
x
)

= min(g
(x) :
|x| B
x
).
6) The activation function g(x) satises g(0) = 0.
The notation here is the Euclidean norm for vectors and
the matrix norm induced by the Euclidean norm for matrices
as
A = max
X=1
AX A R
mn
, X R
n
.
The restriction of a nite-size training set in Assumption 1
is reasonable because obtaining an innite (large) number of
training samples is a hard task for most applications. For
Assumptions 2 and 3, the upper boundedness of the desired
signals, input signals, and the initialization can be readily
justied in practice.
For Assumptions 46, we only assume the upper bound-
edness and lower boundedness of g
(x), and replace the

boundedness assumption of g(x) as used in [33][36] with
g(0) = 0. It is not difcult to see that some widely used
linear or nonlinear activation functions, e.g., the linear function
g(x) = x and the hyperbolic tangent function g(x) = tanh(x),
are consistent with Assumptions 46.
III. MAIN RESULTS
We summarize in this section the main results obtained for
boundedness and convergence of the parameter sequence under
Assumptions 16. The proofs of the results are postponed to
the next section.
Let i = nK + k for n = 0, 1, . . . and k = 1, . . . , K;
the boundedness and convergence properties of the parameter
sequence {H
i
}
i=1
are studied by an investigation of the para-
meter estimation error V
i
for i = 1, 2, . . . Let us rst extend
the well-known linear difference equation (4) of V
i
to the case
of the nonlinear activation function g(x). With V
i
= H
i
H
,
we can rewrite the recursion (3) of OGM as follows:
V
i+1
= V
i
+u
_
a
i
g
_
X
T
i
H
i
__
g
_
X
T
i
H
i
_
X
i
, i = 1, 2, . . .
(5)
Using the differential mean value theorem, there exists a H
i
such that
g(X
T
i
H
i
) = g(X
T
i
H
) + g
(X
T
i
H
i
)(X
T
i
H
i
X
T
i
H
)
= g(X
T
i
H
) + g
(X
T
i
H
i
)X
T
i
V
i
(6)
where X
T
i
H
i
satises
min(X
T
i
H
i
, X
T
i
H
) X
T
i
H
i
max(X
T
i
H
i
, X
T
i
H
).
Inserting (6) into (5) and using (2), we now obtain an
extended difference equation for the general case of linear or
nonlinear g(x) as follows:
V
i+1
= (I ut
i
A
i
)V
i
+u
i
g
(X
T
i
H
i
)X
i
(7)
where
t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
) (8)
A
i
= X
i
X
T
i
. (9)
Comparing (4) with (7), it is clear that the linear difference
equation can be seen as a special case of the extended
difference equation with a constant g
(x) for linear g(x). Note

that the extended difference equation is no longer linear with
nonlinear g(x), as t
i
is associated with V
i
through H
i
and H
i
.
However, using this equation, some excellent results can still
be deduced as will be shown below.
Using the extended difference equation (7), after i recur-
sions, we can obtain the following expression of V
i+1
:
V
i+1
= F
i
+ S
i
, i = 1, 2, . . . (10)
where
F
i
= U
0,i
V
1
(11)
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
(12)
and U
j,i
is the transition matrix dened as
U
j,i
=
_
(I ut
i
A
i
) (I ut
j +1
A
j +1
), j < i
I, j = i.
(13)
With (10), as V
i+1
= H
i+1
H
, we now can derive

the boundedness and convergence properties of the parameter
sequence {H
i
}
i=1
by an investigation of F
i
and S
i
.
A. Boundedness of the Parameter Sequence {H
i
}
i=1
Let us rst investigate a single term (I ut
i
A
i
) of U
j,i
in (13) and introduce the following lemma. We note here
that all the lemmas, theorems, and corollaries in this paper
are derived for the OGM-F or OGM-SS training scheme, i.e.,
(a
i
, X
i
,
i
) {(a
k
, X
k
,
k
)}
K
k=1
for i = 1, 2, . . .
Lemma 1: Under Assumptions 2 and 4, let Z be an N-
dimensional vector, and
i
the correlation coefcient between
X
i
and Z, i.e.,
i
= < X
i
, Z >/X
i
Z (< , > denotes
inner product) for X
i
= 0 and Z = 0, and
i
= 1 for X
i
= 0
or Z = 0. There exists a positive constant
u
0
=
1
B
2
g
B
2
X
(14)
such that for all 0 < u u
0
, we have
(I ut
i
A
i
)Z
_
1 u
2
i
t
i
i
Z
where t
i
is dened in (8), A
i
is dened in (9), and
i
= X
i
2
.
Noting that t
i
is positive by Assumption 4 and
0 u
2
i
t
i
i
1 for 0 < u u
0
as shown in the proof in the
next section, from Lemma 1, for the nonzero vectors Z and X
i
,
it is not difcult to verify that (I ut
i
A
i
)Z < Z when
i
= 0 (i.e., Z is not orthogonal to X
i
). With this property
of (I ut
i
A
i
), we can obtain the following important lemma
about U
j,i
.
Lemma 2: Under Assumptions 25, let Z be an upper
bounded N-dimensional vector that can be linearly combined
by some elements of the sequence {X
l
}
i
l=j +1
with a nite size
i j , and the parameter vectors {H
l
}
i
l=j +1
is upper bounded by
B
H
, H
l
B
H
for l = j +1, . . . , i . Then for all 0 0 (15)
with
g
= D
g
(max(B
X
B
H
, B
X
B
)) is the lower bound of

g
(x) when |x| max(B

X
B
H
, B
X
B
), D
X
is the minimal
value of the norms of nonzero elements of the input sequence
{X
k
}
K
k=1
, and (0, 1] is a constant.
For 0 < u u
0
, from (14) and (15), it can be seen that
0 < u 1. Using Lemmas 1 and 2, we can now derive the
important boundedness property of {H
i
}
i=1
as follows.
Theorem 1: Under Assumptions 16, if there exists at least
one solution for the inequality
f (B)

= (B B
H
1
)
D
2
g
(BB
X
)
B
g
9K B
a
B
X
4D
2
X
0 (16)
where B R, D
g
(BB
X
) denotes the lower bound of g
(x)
when |x| BB
X
, then for all 0 < u u
0
, the parameter
sequence {H
i
}
i=1
is uniformly upper bounded
H
i
B
H
< , i = 1, 2, . . .
where
B
H
= min{B: f (B) 0}.
It immediately follows from Theorem 1 that for a training
procedure with a specic training set and activation func-
tion, the parameter sequence {H
i
}
i=1
can be uniformly upper
bounded as long as there exists a B that satises (16). We then
investigate the existence of a solution for (16). As D
2
g
(BB
X
)
in (16) is different for different activation functions g(x), it is
clear that the solution for (16) varies with g(x).
For the case of linear g(x), D
2
g
(BB
X
) = C
g
is a positive
constant, and then we can readily conclude that there exist
solutions for (16), and the parameter sequence {H
i
}
i=1
can
always be uniformly upper bounded by B
H
given as follows:
B
H
= B
H
1
+
9K B
a
B
X
B
g
4D
2
X
C
g
< .
However, for the case of nonlinear g(x), it is difcult
to obtain an explicit expression for B
H
. Indeed, for some
scenarios, e.g., an activation function of g(x) = tanh(x) and
a large size of training set (i.e., a large K), a solution for
(16) may not exist, and then a uniform upper bound for
{H
i
}
i=1
cannot be guaranteed. In these scenarios, however,
from the investigation on each term of the inequality (16),
we may obtain some improvement methods for the upper
boundedness property of {H
i
}
i=1
. For example, we can adjust
the training set {(a
k
, X
k
)}
K
k=1
or the activation function g(x),
including scaling down the desired signals {a
k
}
K
k=1
to decrease
B
a
, scaling up the input sequence {X
k
}
K
k=1
to reduce B
X
/D
2
X
,
and/or scaling up g(x) by a positive number to increase
D
2
g
(BB
X
)/B
g
. Besides, properly decreasing training sets
size K, rearranging the order of input signals in {X
k
}
K
k=1
to
increase , or setting a lower initialization H
1
to reduce B
H
1
may also be useful to obtain a positive f (B).
B. Convergence of the Parameter Sequence {H
i
}
i=1
Let P = {X
k
}
K
k=1
denote the input sequence. Assume that
the dimension of P is N
P
(i.e., the maximum number of linear
independent elements in P is N
P
). It is not difcult to verify
from (2) that there exist more than one optimal solutions to
(1) if N
P
< N, where N is the dimension of X
k
, because we
have
J(H
) = J(H
+ H
)
where H
denotes a nonzero vector orthogonal to P. And

H
= 0 if N
P
= N. Then each optimal solution can be
expressed as H
+ H
. Based on this argument, with the

boundedness result obtained in Theorem 1, the convergence
property of the parameter sequence {H
i
}
i=1
can then be
deduced.
Theorem 2: Under Assumptions 16, with an arbitrary ini-
tialization H
1
, if there exists a solution for (16), then for all
0 < u u
0
, we have
lim
i
_
_
_H
i
H
H
1
_
_
_ C(u)
where H
1
is the component of H
1
that is orthogonal to input
sequence P, = B
X
B
g
K
k=1
k
is a positive constant,
and
C(u) =
u
1
1 u
+u.
It is clear that H
+ H
1
is an optimal solution to (1). Then

Theorem 2 shows that the parameter sequence {H
i
}
i=1
can
converge into a zone around this optimal solution, where the
size of the zone is associated with the disturbance sequence
{
k
}
K
k=1
and the learning rate u.
We further study the convergence property of {H
i
}
i=1
for
the special case of a vanishing disturbance sequence {
k
}
K
k=1
;
then (2) can be rewritten as
a
k
= g(X
T
k
H
), k = 1, . . . , K.
This special case can be satised. For example, in [40] it was
proved that there exists a converged parameter vector such
that a TLFNN with input dimension of N can approximate
the mapping of a set {a
k
, X
k
}
N
k=1
with arbitrarily small error.
Furthermore, the universal approximation analysis conducted
in [41] and [42] shows that a TLFNN can approximate any
continuous target function with vanishing output deviation by
using a proper parameter vector.
For this special case, it can be seen that
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
= 0, i = 1, 2, . . .
as
i
{
k
}
K
k=1
= 0 for i = 1, 2, . . . And we have
V
i+1
= F
i
= U
0,i
V
1
, i = 1, 2, . . .
A global convergence result can then be obtained from
Theorem 2 as follows.
Corollary 1: Under Assumptions 15, for the case of a
vanishing disturbance sequence {
k
}
K
k=1
, with an arbitrary
initialization H
1
, for all 0 < u u
0
, H
i
can always converge
to an optimal solution as i
lim
i
H
i
= H
+ H
1
where H
1
is the component of H
1
that is orthogonal to the
input sequence P = {X
k
}
K
k=1
.
Corollary 1 states that, for this special case, H
i
can always
converge to a point that achieves a global minimum of the
performance measure J(H), which gives a strong convergence
result. Note that, for this special case, Assumption 6 and the
inequality (16) do not necessarily hold because the bounded
initialization H
1
and bounded optimal solution H
of Assump-
tion 3 can always guarantee the boundedness of the parameter
sequence {H
i
}
i=1
as shown in the proof in the next section.
IV. PROOFS OF RESULTS
The proofs of the lemmas, theorems, and corollaries pre-
sented in Section III are given in this section.
Proof of Lemma 1: For X
i
= 0, the result is straightforward.
We consider the case of X
i
= 0 below.
From (9), we can see that A
i
is a symmetric and positive
semidenite matrix, and
rank(A
i
) = rank(X
i
) = 1, X
i
= 0.
Then there exists one positive eigenvalue of A
i
, and all other
eigenvalues are equal to zero. It follows from the orthogonal
decomposition of symmetric matrices that there exists an
orthogonal matrix Q
i
= [q
i,1,
q
i,2,
. . . , q
i,N
] with
q
i,1
=
X
i
X
i
(17)
such that
I ut
i
A
i
= Q
i
i
Q
T
i
(18)
where
i
= diag(1 ut
i
i
, 1, . . . , 1), and
i
= X
i
2
is the
positive eigenvalue of A
i
.
With (18), we have
(I ut
i
A
i
)Z = Q
i
i
Q
T
i
Z
= Q
i
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
.
(19)
Then it can be derived from (19) that
(I ut
i
A
i
)Z
=
_
_
_
_
Q
i
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
_
_
_
_
=
_
_
_
_
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
_
_
_
_
=
_
(1 ut
i
i
)
2
(q
T
i,1
Z)
2
+(q
T
i,2
Z)
2
+. . . +(q
T
i,N
Z)
2
_
1
2
=
_
Z
2
u(2t
i
i
ut
2
i

2
i
)(q
T
i,1
Z)
2
_
1
2
. (20)
Using (17), we have
_
q
T
i,1
Z
_
2
=
2
i
Z
2
(21)
where
i
=
< X
i
, Z >
X
i
Z
is the correlation coefcient between X
i
and Z, and
0
2
i
1.
Plugging (21) into (20), we have
(I ut
i
A
i
)Z =
_
1 u
2
i
(2t
i
i
ut
2
i

2
i
) Z . (22)
It follows from Assumption 4 that
0 < t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
) B
2
g
(23)
and from Assumption 2 that
0
i
= X
i
2
B
2
X
. (24)
Now if the learning rate u satises
0 < u u
0
=
1
B
2
g
B
2
X
(25)
it is not difcult to verify that
u(2t
i
i
ut
2
i

2
i
) = ut
i
i
(2 ut
i
i
)
ut
i
i
(2 uB
2
g
B
2
X
)
ut
i
i
(26)
where for the rst inequality we use (23) and (24), and for
the second inequality we use (25).
Consequently, inserting (26) into (22), we have
(I ut
i
A
i
)Z
_
1 u
2
i
t
i
i
Z .
As 0
2
i
1, from (23), (24), and (25), it is clear that
0 u
2
i
t
i
i
1.
This completes the proof.
Proof of Lemma 2: If Z = 0, then the result is straightfor-
ward. We consider the case of Z = 0 below.
We have
U
j,i
Z = (I ut
i
A
i
)(I ut
i1
A
i1
) . . . (I ut
j +1
A
j +1
)Z
where A
l
= X
l
X
T
l
and X
l
{X
k
}
K
k=1
for l = j +1, . . . , i .
Let
l
denote the correlation coefcient between X
l
and
U
j,l1
Z
l
=
< X
l
, U
j,l1
Z >
X
l
_
_
U
j,l1
Z
_
_
, l = j +1, . . . , i.
As Assumptions 2 and 4 hold and the learning rate u
satises 0 < u u
0
, using the denition of U
j,i
in (13),
it follows from Lemma 1 (by replacing Z in Lemma 1 with
U
j,l1
Z for l = j +1, . . . , i ) that
_
_
U
j,i
Z
_
_
=
_
_
(I ut
i
A
i
)U
j,i1
Z
_
_
_
1 u
i
2
t
i
i
_
_
U
j,i1
Z
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
j +1
2
t
j +1
j +1
Z .
(27)
Also, in (27), we have
0 u
l
2
t
l
l
1
where
t
l
= g
(X
T
l
H
l
)g
(X
T
l
H
l
)
with
min(X
T
l
H
l
, X
T
l
H
) X
T
l
H
l
max(X
T
l
H
l
, X
T
l
H
)
and
l
= X
l
2
for l = j +1, . . . , i .
If X
l
= 0, then
l
= X
l
2
= 0 and
_
1 u
l
2
t
l
l
= 1.
Suppose that there exist L nonzero elements of {X
l
}
i
l=j +1
.
Since the nonzero vector Z can be linearly combined by
some elements of {X
l
}
i
l=j +1
, it is not difcult to verify that
L 1. Let the indices of the nonzero elements of {X
l
}
i
l=j +1
be m
1
, m
2
, . . . , m
L
[ j +1, i ], and without loss of generality,
let m
1
< m
2
< < m
L
; we can rewrite (27) as follows:
_
_
U
j,i
Z
_
_
_
1 u
m
L
2
t
m
L
m
L

_
1 u
m
1
2
t
m
1
m
1
Z (28)
where
m
l
=
_
_
X
m
l
_
_
2
> 0 for l = 1, . . . , L.
Let
m
p
2
= max
1lL
(
m
l
2
)
with p {1, 2, . . . , L}. As we have 0 u
m
l
2
t
m
l
m
l
1 for
l = 1, . . . , L, it follows from (28) that:
_
_
U
j,i
Z
_
_

_
1 u
m
p
2
t
m
p
m
p
Z . (29)
We now investigate the lower bound of the term
m
p
2
t
m
p
m
p
in (29).
Let
D
X
= min
1kK,X
k
=0
X
k
.
As X
m
l
{X
k
}
K
k=1
and X
m
l
= 0, we have
m
l
=
_
_
X
m
l
_
_
2
D
2
X
> 0, l = 1, . . . , L. (30)
It is then clear that
m
p
D
2
X
> 0. (31)
The parameter vectors {H
l
}
i
l=j +1
is upper bounded by B
H
,
and the input sequence {X
k
}
K
k=1
and the optimal solution
H
are also upper bounded using Assumptions 2 and 3,

respectively. As
min(X
T
m
l
H
m
l
, X
T
m
l
H
) X
T
m
l
H
m
l
max(X
T
m
l
H
m
l
, X
T
m
l
H
)
X
T
m
l
H
m
l
and X
T
m
l
H
m
l
are upper bounded as well, i.e., for
l = 1, . . . , L
_
_
_X
T
m
l
H
m
l
_
_
_ max(B
X
B
H
, B
X
B
)
and _
_
_X
T
m
l
H
m
l
_
_
_ max(B
X
B
H
, B
X
B
).
From Assumption 5, g
(X
T
m
l
H
m
l
) and g
(X
T
m
l
H
m
l
) should
be lower bounded by a small positive value

= D
g
(max(B
X
B
H
, B
X
B
))
and consequently
t
m
l
= g
(X
T
m
l
H
m
l
)g
(X
T
m
l
H
m
l
)
2
g
> 0
l = 1, . . . , L. (32)
It is then clear that
t
m
p

2
g
> 0. (33)
We claim that there exists a constant (0, 1] such that
m
p
2
. Proof by contradiction is used to verify this claim.
Suppose > 0, we have
m
p
2
. Then
m
l
2
for all
l = 1, . . . , L. Since X
l
= 0 for l / {m
1
, m
2
, . . . , m
L
}, U
j,i
Z
can be rewritten as
U
j,i
Z = (I ut
m
L
A
m
L
) . . . (I ut
m
1
A
m
1
)Z
with L i j .
From (22), we have
_
_
U
j,i
Z
_
_
=
_
1 u
m
L
2
(2t
m
L
m
L
ut
2
m
L
2
m
L
) (34)
. . .
_
1 u
m
1
2
(2t
m
1
m
1
ut
2
m
1
2
m
1
) Z .
For 0 0, l = 1, . . . , L. (35)
We also have
u(2t
m
l
m
l
ut
2
m
l
2
m
l
) 1 (36)
as (ut
m
l
m
l
1)
2
0.
It follows from (35) and (36) that:
_
1 u
m
l
2
(2t
m
l
m
l
ut
2
m
l
2
m
l
)

1
l = 1 . . . , L
and consequently using (34), we have
Z
_
_
U
j,i
Z
_
_

_
1
_
1
_
L
_
Z . (37)
Since Z is a nonzero vector and upper bounded, L is nite,
and > 0 can be arbitrarily small, then it can be readily
obtained from (37) that
_
_
U
j,i
Z
_
_
= Z.
From (20) in the proof of Lemma 1, it can be seen that
we have (I ut
i
A
i
)Z = Z, if and only if q
T
i,1
Z = 0.
As q
i,1
has the same direction as X
i
from (17), q
T
i,1
Z = 0
means that Z is orthogonal to X
i
. In the same vein, after i j
iterations, it is readily concluded from
_
_
U
j,i
Z
_
_
= Z that
Z is orthogonal to {X
l
}
i
l=j +1
, which contradicts the fact that
Z can be linearly combined by some elements of {X
l
}
i
l=j +1
.
Therefore, we can verify that there exists a constant (0, 1]
such that
m
p
2
> 0. (38)
Inserting (31), (33), and (38) into (29), we have
_
_
U
j,i
Z
_
_

_
1 u Z
where 0 0.
Proof of Theorem 1: We use induction to prove Theorem 1.
1) The upper bound when i = 1.
As B
H
is a solution for (16), it is clear that B
H
B
H
1
, and
consequently
H
1
B
H
1
B
H
.
2) The upper bound when i > 1.
Suppose
_
_
H
j
_
_
B
H
for all j = 1, . . . , i ; now we need to
prove that H
i+1
B
H
.
Let i = nK +k for n = 0, 1, . . . and k = 1, . . . , K. We can
rewrite the recursion (3) of OGM as follows:
H
i+1
= H
i
+u
_
a
i
g
_
X
T
i
H
i
__
g
_
X
T
i
H
i
_
X
i
,
i = 1, 2, . . . . (39)
Using the differential mean value theorem, there exists a H
i
such that
g(X
T
i
H
i
) = g(0) + g
(X
T
i
H
i
)(X
T
i
H
i
0)
= g(0) + g
(X
T
i
H
i
)X
T
i
H
i
(40)
where X
T
i
H
i
satises
min(X
T
i
H
i
, 0) X
T
i
H
i
max(X
T
i
H
i
, 0). (41)
Inserting (40) into (39) and using g(0) = 0 of Assumption 6,
we have
H
i+1
= (I ut
i
A
i
)H
i
+ua
i
g
(X
T
i
H
i
)X
i
(42)
where A
i
is dened in (9), and
t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
). (43)
Using (42), after i recursions, we can obtain the following
expression of H
i+1
:
H
i+1
= G
i
+ R
i
(44)
where
G
i
= U
0,i
H
1
(45)
R
i
= u
i
j =1
a
j
g
(X
T
j
H
j
)U
j,i
X
j
. (46)
For the term G
i
in (45), if the learning rate u satises
0 < u u
0
, it follows from Lemma 1 and Assumption 3
that
G
i
=
_
_
U
0,i
H
1
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
1
2
t
1
1
H
1
H
1
B
H
1
(47)
where
l
denotes the correlation coefcient between X
l
and
U
0,l1
H
1
l
=
< X
l
, U
0,l1
H
1
>
X
l
_
_
U
0,l1
H
1
_
_
and
0 u
l
2
t
l
l
1
for l = 1, . . . , i .
For the term R
i
in (46), we consider the case of i K and
i > K separately.
When i K, from Lemma 1, the upper bound of R
i
can
be obtained as
R
i
ui B
a
B
g
B
X
uK B
a
B
g
B
X
, 1 K, use the property of the cycle training and
recall that i = nK +k for n = 1, 2, . . . and k = 1, . . . , K, R
i
can be rewritten as
u
K
j =1
U
nK,nK+k
n
l=1
a
j
g
(X
T
j
H
(l1)K+j
(l1) )M
( j )
n,l
X
j
+u
k
j =1
a
nK+j
g
(X
T
nK+j
H
nK+j
)U
nK+j,nK+k
X
nK+j
(49)
where j
(l1)
denotes the position of X
j
in the lth training
cycle for l = 1, . . . , n, and
M
( j )
n,l
=
U
(n1)K,nK
U
l K,(l+1)K
U
(l1)K+j
(l1)
, l K
l = 1, . . . , n 1
U
(l1)K+j
(l1)
,l K
, l = n.
(50)
For l = 1, . . . , n and j = 1, . . . , K, j
(l1)
= j for the
OGM-F scheme, while j
(l1)
is an arbitrary number between
1 and K for the OGM-SS scheme,.
The boundedness of the rst term in the RHS of (49) is
investigated at rst. It follows from Lemma 1 and Assump-
tions 2 and 4 that this term is upper bounded by:
uB
a
B
g
j =1
_
_
_
_
_
n
l=1
M
( j )
n,l
X
j
_
_
_
_
_
. (51)
The proof of the boundedness of

n
l=1
M
( j )
n,l
X
j
in (51) then
becomes a key step in the proof of the boundedness of the
rst term.
When l = n, from Lemma 1 we have
_
_
_M
( j )
n,n
X
j
_
_
_ =
_
_
_U
(n1)K+j
(n1)
,nK
X
j
_
_
_
_
_
X
j
_
_
.
From (19), we can see that (I ut
i
A
i
)Z is obtained by
reducing the projection of Z on X
i
. Then, it is clear that
U
(l1)K+j
(l1)
,l K
X
j
is obtained by reducing the projections on
{X
(l1)K+j
(l1)
+1
, . . . , X
l K
} from X
j
. As X
j
can be linearly
combined by some elements of {X
k
}
K
k=1
, and
{X
(l1)K+j
(l1)
+1
, . . . , X
l K
} {X
k
}
K
k=1
it is not difcult to verify that U
(l1)K+j
(l1)
,l K
X
j
can be
linearly combined by some elements of {X
k
}
K
k=1
with a nite
size K (using Assumption 1) as well. Then, as
_
_
H
j
_
_
B
H
, j = 1, . . . , nK +k
using Lemma 2, we have
_
_
_U
l K,(l+1)K
U
(l1)K+j
(l1)
,l K
X
j
_
_
_
_
1 u
_
_
_U
(l1)K+j
(l1)
,l K
X
j
_
_
_
_
1 u
_
_
X
j
_
_
, l = 1, . . . , n 1
where 0 < u 1 with
=
2
g
D
2
X
. (52)
Using (41) and (43), from the proof of Lemma 2, we have
g
= D
g
(max(B
X
B
H
, 0)) = D
g
(B
X
B
H
). (53)
In the similar vein above, using Lemma 2, the upper bound
of M
( j )
n,l
X
j
can be derived as
_
_
_M
( j )
n,l
X
j
_
_
_
_
_
1 u
_
nl _
_
X
j
_
_
, l = 1, . . . , n. (54)
From (54), we have
_
_
_
_
_
n
l=1
M
( j )
n,l
X
j
_
_
_
_
_

n1
l=0
_
_
1 u
_
l _
_
X
j
_
_
=
1
_
1 u
_
n
1
1 u
_
_
X
j
_
_
1
1
1 u
_
_
X
j
_
_
. (55)
Inserting (55) into (51) and using Assumption 2, we obtain
the following upper bound of the rst term in the RHS
of (49):
uK B
a
B
g
B
X
1
1
1 u
. (56)
From Lemma 1, the second term in the RHS of (49) is upper
bounded by
uk B
a
B
g
B
X
. (57)
Plugging (56) and (57) into (49), and noting that k K,
we obtain the upper bound of R
i
when i > K as follows:
R
i
K B
a
B
g
B
X
C(u), i > K (58)
where
C(u) =
_
u +
u
1
1 u
_
. (59)
As 0 u. Now we can
combine the upper bound of R
i
for i K [see (48)] and
i > K [see (58)] together, and obtain the following:
R
i
K B
a
B
g
B
X
C(u), i > 1. (60)
Given the bound of G
i
in (47) and the bound of R
i
in
(60), from (44) we can get an upper bound of H
i+1
as
follows:
H
i+1
G
i
+R
i
B
H
1
+ K B
a
B
g
B
X
C(u).
Now, if we have
B
H
1
+ K B
a
B
g
B
X
C(u) B
H
(61)
then H
i+1
B
H
is proved.
From (59), we have
C
(u) =
(6 2u )
1 u 6 +5u
(4 2u )
1 u 4 +4u
. (62)
Then from (62), it is not difcult to verify that C(u) has a
stationary point at u = 3/4, and C
(u) > 0 for u < 3/4

and C
(u) < 0 for u > 3/4. Then, as

0 < u u
0
=
1
B
2
g
B
2
X
2
g
D
2
X
=
1
we can prove that

2
C(u)
9
4
(63)
where the upper bound is obtained when u = 3/4 , and the
lower bound is obtained when u = 1/ .
Using the upper bound of C(u) in (63), it is clear that if
the inequality
B
H
B
H
1

9K B
a
B
X
B
g
4
0 (64)
holds, then (61) can be satised. Plugging (52) and (53) into
(64), we have
(B
H
B
H
1
)
D
2
g
(B
H
B
X
)
B
g
9K B
a
B
X
4D
2
X
0. (65)
As B
H
is a solution for (16) and (65) always holds, and then
we get H
i+1
B
H
.
This completes the induction and the proof.
Proof of Theorem 2: As there exists a solution for the
inequality (16) and Assumptions 16 hold, from Theorem 1,
the parameter sequence {H
i
}
i=1
is uniformly upper bounded.
Note that V
i+1
= F
i
+ S
i
[see (10)]. Let us investigate the
limit of F
i
and S
i
as i separately.
A. Limit of F
i
As i = nK +k for n = 0, 1, . . . and k = 1, . . . , K, we have

F
i
= U
0,i
V
1
= U
nK,nK+k
U
(n1)K,nK
. . . U
0,K
V
1
. (66)
In (66), each term U
(l1)K,l K
for l = 1, . . . , n corresponds
to a transmission of the OGM-F or OGM-SS training scheme.
And in each transmission, all vectors of the input sequence
P = {X
k
}
K
k=1
have been used.
Two cases: 1) N
P
= N and 2) N
P
< N, where N
P
is the
dimension of the input sequence P = {X
k
}
K
k=1
, are considered
below.
1) The case of N
P
= N.
Consider the rst transmission U
0,K
V
1
. As N
P
= N
(i.e., the input sequence P is of full dimension), it is clear
that the N-dimensional vector V
1
can be linearly combined
by some elements of P. Then, using Lemma 2, we have
_
_
U
0,K
V
1
_
_

_
1 u V
1
where 0 < u 1.
For the second transmission, since U
0,K
V
1
is an
N-dimensional vector, it can be linearly combined by some
elements of P. Then it can be seen that
_
_
U
K,2K
U
0,K
V
1
_
_

_
1 u
_
_
U
0,K
V
1
_
_
_
_
1 u
_
2
V
1
.
In the similar vein, and noting that
_
_
U
nK,nK+k
_
_
1 from
Lemma 1, we have
F
i

_
_
1 u
_
n
V
1
.
Since 0

1 u < 1 and V
1
is upper bounded by the
bounded initialization H
1
and bounded optimal solution H
in
Assumption 3, the limit of F
i
as i can be obtained
as
0 lim
i
F
i
lim
n
_
_
1 u
_
n
V
1
= 0
and consequently
lim
i
F
i
= 0. (67)
2) The case of N
P
< N.
As the dimension of P = {X
k
}
K
k=1
is N
P
, then all elements
of P can be linearly combined by a set of normalized orthog-
onal bases of order N
P
. Suppose that the set is (
1
, . . . ,
N
P
).
Clearly, there also exists a set of normalized orthogonal bases
of order N, (
1
, . . . ,
N
P
,
N
P
+1
, . . . ,
N
), such that each
N-dimensional vector can be generated by this set.
Let Q =
_
1
, . . . ,
N
P
,
N
P
+1
, . . . ,
N
_
and
k,i
= X
T
k

i
for i = 1, . . . , N
P
; we can represent each element in P as
X
k
= Q
k
, k = 1, . . . , K (68)
where
k
=
_
k,1
, . . . ,
k,N
P
, 0, . . . , 0
_
T
.
Let
k
=
_
k,1
, . . . ,
k,N
P
_
T
.
A new sequence
P
= {
k
}
K
k=1
(69)
can be obtained, and it is noted that this new sequence P
is
of full dimension.
We then represent the term V
1
and U
0,i
in F
i
by Q. Clearly,
V
1
can be rewritten as
V
1
= H
1
H
+ H
1
where H
1
is the projection of H
1
on the subspace formed by
the set (
1
, . . . ,
N
P
), and H
1
is the projection of H
1
on the
subspace formed by the set (
N
P
+1
, . . . ,
N
). Since H
can
be linearly combined by some elements of P, H
1
H
can
be generated by the set (
1
, . . . ,
N
P
) as well.
Let w
i
= (H
1
H
)
T
i
for i = 1, . . . , N
P
. Using Q, we
can represent V
1
as follows:
V
1
= Q
_
W
1
0
_
+ H
1
(70)
where
W
1
= [w
1
, . . . , w
N
P
]
T
. (71)
Plugging (68) into (9), we have
A
i
= X
i
X
T
i
= Q
i
T
i
Q
T
= Q
_
A
i
0
0 0
_
Q
T
(72)
where A
i
=
T
i
, and
i
P
. Using (72), U
0,i
can then
be represented as
U
0,i
= (I ut
i
A
i
) . . . (I ut
1
A
1
)
= Q
_
I ut
i
A
i
0
0 I
_
Q
T
. . . Q
_
I ut
1
A
1
0
0 I
_
Q
T
= Q
_
(I ut
i
A
i
) . . . (I ut
1
A
1
) 0
0 I
_
Q
T
= Q
_
U
0,i
0
0 I
_
Q
T
(73)
where
U
j,i
=
_
(I ut
i
A
i
) . . . (I ut
j +1
A
j +1
), j < i
I, j = i
(74)
in which the new sequence P
= {
k
}
K
k=1
dened by (69) is
used.
Inserting (70) and (73) into (66), F
i
can be represented as
follows:
F
i
= U
0,i
V
1
= U
0,i
_
Q
_
W
1
0
_
+ H
1
_
= Q
_
U
0,i
0
0 I
_
Q
T
Q
_
W
1
0
_
+U
0,i
H
1
= Q
_
F
i
0
_
+ H
1
(75)
where for the second equation we use (70), for the third
equation we use (73), and for the fourth equation we use
F
i
= U
0,i
W
1
(76)
with U
0,i
dened in (74) and W
1
dened in (71). Note that
U
0,i
H
1
= H
1
as H
1
is orthogonal to the input sequence P.

It is easy to see that (76) is a case using the new sequence
P
= {
k
}
K
k=1
in (69), which is of full dimension. Therefore,
using the result of the case of N
P
= N investigated above,
we have
lim
i
_
_
F
i
_
_
= 0. (77)
Consequently, we have
lim
i
_
_
_
_
Q
_
F
i
0
__
_
_
_
= 0. (78)
Also, from (75), we have
V
i+1
H
1
= Q
_
F
i
0
_
+ S
i
. (79)
B. Limit of S
i
Using the same vein of deriving the upper bound of R

i
as
shown in the proof of Theorem 1, we can obtain the explicit
expression of S
i
as well. Indeed, replacing a
j
and a
nK+j
with
j
and
nK+j
, respectively, in (49), we can obtain the
expression of S
i
.
From (55), we have
_
_
_
_
_
n
l=1
M
( j )
n,l
X
j
_
_
_
_
_

1
_
1 u
_
n
1
1 u
_
_
X
j
_
_
.
Then it can be readily derived that
S
i
uB
g
B
X
1
_
1 u
_
n
1
1 u
K
l=1
+uB
g
B
X
k
j =1
_
_
nK+j
_
_
. (80)
As k K and
nK+j
{
l
}
K
l=1
, (80) can be rearranged as
S
i
uB
g
B
X
_
1 +
1
_
1 u
_
n
1
1 u
_
K
l=1
l
.
Since 0

1 u < 1, we have
lim
i
S
i
C(u) (81)
where = B
X
B
g
K
l=1
l
is a positive constant, and C(u)
is dened in (59).
Then, given the limits of F
i
and S
i
, we can obtain the
following results: For the case of N
P
= N, it follows from
(67) and (81) that:
lim
i
_
_
_H
i
H
H
1
_
_
_ = lim
i
V
i
lim
i
F
i
+ lim
i
S
i
C(u)
where H
1
= 0. For the case of N

P
< N, from (78), (79), and
(81), we have
lim
i
_
_
_H
i
H
H
1
_
_
_ = lim
i
_
_
_V
i
H
1
_
_
_
lim
i
_
_
_
_
Q
_
F
i
0
__
_
_
_
+ lim
i
S
i
C(u).
Proof of Corollary 1: With a vanishing disturbance sequence
{
k
}
K
k=1
, we have
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
= 0, i = 1, 2, . . .
as
i
{
k
}
K
k=1
= 0 for i = 1, 2, . . . Then, from (10), V
i+1
can be rewritten as
V
i+1
= F
i
= U
0,i
V
1
, i = 1, 2, . . . (82)
Let us rst investigate the boundedness of the parameter
sequence {H
i
}
i=1
for this special case.
Let
l
denote the correlation coefcient between X
l
and
U
0,l1
V
1
l
=
< X
l
, U
0,l1
V
1
>
X
l
_
_
U
0,l1
V
1
_
_
, l = 1, . . . , i.
As the learning rate u satises 0 < u u
0
, using Lemma 1
(by replacing Z in Lemma 1 with U
0,l1
V
1
for l = 1, . . . , i ),
we have
_
_
U
0,i
V
1
_
_
=
_
_
(I ut
i
A
i
)U
0,i1
V
1
_
_
_
1 u
i
2
t
i
i
_
_
U
0,i1
V
1
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
1
2
t
1
1
V
1
V
1
, i = 1, 2, . . . (83)
where
0 u
l
2
t
l
l
1.
Inserting (83) into (82), with bounded initialization H
1
and
bounded optimal solution H
by Assumption 3, we have
V
i+1
=
_
_
U
0,i
V
1
_
_
V
1
H
1
+H
B
H
1
+ B
, i = 1, 2, . . . (84)
Then, from (84), the boundedness of {H
i
}
i=1
is a straightfor-
ward result as
H
i
= H
i
H
+ H
H
i
H
+H
= V
i
+H
2B
+ B
H
1
, i = 1, 2, . . . (85)
Note that compared with the proof of Theorem 1, for
this special case Assumption 6 and the inequality (16) do
not necessarily hold to obtain the boundedness result (85).
Actually, from the proof of Theorem 2, we can see that these
two conditions are only used to guarantee the uniform upper
boundedness of {H
i
}
i=1
.
Given the boundedness result (85), we can directly use some
results of Theorem 2. For the case of N
P
= N, from (67), we
have
lim
i
F
i
= 0
and consequently
lim
i
V
i+1
= lim
i
F
i
= 0. (86)
Also, for the case of N
P
< N, it follows from (75) that
V
i+1
= Q
_
F
i
0
_
+ H
1
.
From (77), we have
lim
i
F
i
= 0
and consequently
lim
i
V
i+1
= lim
i
_
Q
_
F
i
0
_
+ H
1
_
= H
1
. (87)
We now conclude from (86) and (87) that
lim
i
H
i
= H
+ H
1
(88)
where H
1
= 0 for the case of N

P
= N.
V. CONCLUSION
In this paper, the boundedness and convergence of OGM
for TLFNNs were studied under the general case of linear or
nonlinear activation functions, with a constant learning rate,
a nite-sized training set, and two cycle training schemes of
OGM-F and OGM-SS. Our main contributions in this paper
can be summarized as follows.
1) We extended the well-known difference equation to the
case of nonlinear activation function.
2) We gave a uniform upper bound of the parameter
sequence, which is a solution for an inequality regarding
the bound.
3) We proved that the parameter sequence can converge
into a zone around an optimal solution. We also showed
that the size of this zone is associated with the distur-
bance sequence and the learning rate.
4) For the special case of a vanishing disturbance sequence,
we proved a strong global convergence result, where the
parameter sequence can always converge to an optimal
solution.
REFERENCES
[1] K. Burse, R. Yadav, and S. Shrivastava, Channel equalization using
neural networks: A review, IEEE Trans. Syst., Man, Cybern. C, Appl.
Rev., vol. 40, no. 3, pp. 352357, May 2010.
[2] G. Huang, P. Saratchandran, and N. Sundararajan, A generalized
growing and pruning RBF (GGAP-RBF) neural network for function
approximation, IEEE Trans. Neural Netw., vol. 16, no. 1, pp. 5767,
Jan. 2005.
[3] W. W. Hsieh, Machine Learning Methods in the Environmental Sciences.
Cambridge, U.K.: Cambridge Univ. Press, 2009.
[4] B. D. Ripley, Pattern Recognition and Neural Network. Cambridge,
U.K.: Cambridge Univ. Press, 2008.
[5] Z. J. Xiang and G. G. Bi, A new polynomial perceptron based 64 QAM
cellular mobile communications receiver, IEEE Trans. Signal Process.,
vol. 43, no. 12, pp. 30943098, Dec. 1995.
[6] A. Hussain, J. J. Soraghan, and T. S. Durrani, A new adaptive
functional-link neural-network-based DFE for overcoming co-channel
interference, IEEE Trans. Commun., vol. 45, no. 11, pp. 13581362,
Nov. 1997.
[7] J. C. Patra and A. C. Kot, Nonlinear dynamic system identication
using Chebyshev functional link articial neural networks, IEEE Trans.
Syst., Man, Cybern. B, Cybern., vol. 32, no. 4, pp. 505511, Aug. 2002.
[8] H. Zhao and J. Zhang, Adaptively combined FIR and functional
link articial neural network equalizer for nonlinear communication
channel, IEEE Trans. Neural Netw., vol. 20, no. 4, pp. 665674,
Apr. 2009.
[9] J. C. Patra, P. K. Meher, and G. Chakraborty, Development of Laguerre
neural-network-based intelligent sensors for wireless sensor networks,
IEEE Trans. Instrum. Meas., vol. 60, no. 3, pp. 725734, Mar. 2011.
[10] S. U. H. Qureshi, Adaptive equalization, Proc. IEEE, vol. 73, no. 9,
pp. 13491387, Sep. 1985.
[11] J. Arenas-Garcia, V. Gomez-Verdejo, and A. R. Figueiras-Vidal, New
algorithms for improved adaptive convex combination of LMS transver-
sal lters, IEEE Trans. Instrum. Meas., vol. 54, no. 6, pp. 22392249,
Dec. 2005.
[12] W. S. Gan, J. J. Soraghan, and T. S. Durrani, New functional-link based
equaliser, Electron. Lett., vol. 28, no. 17, pp. 16431645, Aug. 1992.
[13] C.-Y. Lo and W. de Weng, Application of neural network techniques
on nonlinear channel equalization for 16-QAM modulation systems, in
Proc. Int. Conf. Intell. Syst. Design Appl., vol. 1. 2008, pp. 356361.
[14] W.-D. Weng, C.-S. Yang, and R.-C. Lin, A channel equalizer using
reduced decision feedback Chebyshev functional link articial neural
networks, Inform. Sci., vol. 177, no. 13, pp. 26422654, 2007.
[15] O. L. Mangasarian and M. V. Solodov, Serial and parallel backprop-
agation convergence via nonmonotone perturbed minimization, Optim.
Method. Softw., vol. 4, no. 2, pp. 103116, 1994.
[16] T. Heskes and W. Wiegerinck, A theoretical comparison of batch-mode,
on-line, cyclic, and almost-cyclic learning, IEEE Trans. Neural Netw.,
vol. 7, no. 4, pp. 919925, Jul. 1996.
[17] T. Nakama, Theoretical analysis of batch and on-line training for
gradient descent learning in neural networks, Neurocomputing, vol. 73,
nos. 13, pp. 151159, 2009.
[18] O. Macchi and E. Eweda, Second-order convergence analysis of
stochastic adaptive linear ltering, IEEE Trans. Autom. Control, vol. 28,
no. 1, pp. 7685, Jan. 1983.
[19] R. Bitmead, Convergence in distribution of LMS-type adaptive para-
meter estimates, IEEE Trans. Autom. Control, vol. 28, no. 1, pp. 5460,
Jan. 1983.
[20] L. Guo, L. Ljung, and G.-J. Wang, Necessary and sufcient conditions
for stability of LMS, IEEE Trans. Autom. Control, vol. 42, no. 6,
pp. 761770, Jun. 1997.
[21] O. Dabeer and E. Masry, Analysis of mean-square error and transient
speed of the LMS adaptive algorithm, IEEE Trans. Inf. Theory, vol. 48,
no. 7, pp. 18731894, Jul. 2002.
[22] V. B. Tadic, On the almost sure rate of convergence of linear stochastic
approximation algorithms, IEEE Trans. Inf. Theory, vol. 50, no. 2,
pp. 401409, Feb. 2004.
[23] V. Solo, The stability of LMS, IEEE Trans. Signal Process., vol. 45,
no. 12, pp. 30173026, Dec. 1997.
[24] M. Godavarti and A. O. Hero, Partial update LMS algorithms, IEEE
Trans. Signal Process., vol. 53, no. 7, pp. 23822399, Jul. 2005.
[25] D. Levanony and N. Berman, Recursive nonlinear system identication
by a stochastic gradient algorithm: Stability, performance, and model
nonlinearity considerations, IEEE Trans. Signal Process., vol. 52, no. 9,
pp. 25402550, Sep. 2004.
[26] H. Zhang, W. Wu, F. Liu, and M. Yao, Boundedness and convergence of
online gradient method with penalty for feedforward neural networks,
IEEE Trans. Neural Netw., vol. 20, no. 6, pp. 10501054, Jun. 2009.
[27] X. Ren and X. Lv, Identication of extended Hammerstein systems
using dynamic self-optimizing neural networks, IEEE Trans. Neural
Netw., vol. 22, no. 8, pp. 11691179, Aug. 2011.
[28] W. Yu and X. Li, Discrete-time neuro identication without robust
modication, IEE Proc. Control Theory Appl., vol. 150, no. 3,
pp. 311316, May 2003.
[29] J. de Jesus Rubio, P. Angelov, and J. Pacheco, Uniformly stable
backpropagation algorithm to train a feedforward neural network, IEEE
Trans. Neural Netw., vol. 22, no. 3, pp. 356366, Mar. 2011.
[30] W. Yu, Nonlinear system identication using discrete-time recurrent
neural networks with stable learning algorithms, Inf. Sci., vol. 158,
pp. 131147, Jan. 2004.
[31] J. H. Prez-Cruz, A. Y. Alanis, J. de Jess Rubio, and J. Pacheco,
System identication using multilayer differential neural networks:
A new result, J. Appl. Math., vol. 2012, pp. 120, Feb. 2012.
[32] J. de Jess Rubio, D. M. Vzquez, and J. Pacheco, Backpropagation
to train an evolving radial basis function neural network, Evol. Syst.,
vol. 1, pp. 173180, 2010.
[33] Z.-B. Xu, R. Zhang, and W.-F. Jing, When does online BP training
converge? IEEE Trans. Neural Netw., vol. 20, no. 10, pp. 15291539,
Oct. 2009.
[34] W. Wu, J. Wang, M. Cheng, and Z. Li, Convergence analysis of online
gradient method for BP neural networks, Neural Netw., vol. 24, no. 1,
pp. 9198, 2011.
[35] W. Wu, G. Feng, Z. Li, and Y. Xu, Deterministic convergence of an
online gradient method for BP neural networks, IEEE Trans. Neural
Netw., vol. 16, no. 3, pp. 533540, May 2005.
[36] W. Wu and Y. Xu, Deterministic convergence of an online gradi-
ent method for neural networks, J. Comput. Appl. Math., vol. 144,
nos. 12, pp. 335347, 2002.
[37] H. Jeffreys and B. S. Jeffreys, Methods of Mathematical Physics.
Cambridge, U.K.: Cambridge Univ. Press, 1988.
[38] R. Bitmead and B. Anderson, Performance of adaptive estimation
algorithms in dependent random environments, IEEE Trans. Autom.
Control, vol. 25, no. 4, pp. 788794, Aug. 1980.
[39] O. Macchi and E. Eweda, Convergence analysis of self-adaptive equal-
izers, IEEE Trans. Inf. Theory, vol. 30, no. 2, pp. 161176, Mar. 1984.
[40] G.-B. Huang, Learning capability and storage capacity of two-hidden
layer feedforward networks, IEEE Trans. Neural Netw., vol. 14, no. 2,
pp. 274281, Mar. 2003.
[41] G.-B. Huang and C.-K. Siew, Extreme learning machine with randomly
assigned RBF kernels, Int. J. Inf. Tech., vol. 11, no. 1, pp. 1624, 2005.
[42] G.-B. Huang, L. Chen, and C.-K. Siew, Universal approximation using
incremental constructive feedforward networks with random hidden
nodes, IEEE Trans. Neural Netw., vol. 17, no. 4, pp. 879892, Jul. 2006.
Lu Xu received the B.S.E.E. and M.S.E.E. degrees
in electronic engineering from Tsinghua University,
Beijing, China, in 2003 and 2006, respectively.
He is currently pursuing the Ph.D. degree with
the Department of Electronic Engineering, Tsinghua
University.
His current research interests include neural
networks, blind signal processing, and satellite
communication.
Jinshu Chen received the M.S. degree in elec-
tronic engineering from Tsinghua University, Bei-
jing, China, in 1993.
He is currently an Associate Professor with the
Department of Electronic Engineering, Tsinghua
University. His current research interests include
satellite communication, communication and infor-
mation systems, data recording, and signal process-
ing.
DefengHuang (M01S02M05SM07) received
the B.S.E.E. and M.S.E.E. degrees in electronic
engineering from Tsinghua University, Beijing,
China, in 1996 and 1999, respectively, and the
Ph.D. degree in electrical and electronic engineering
from the Hong Kong University of Science and
Technology, Kowloon, Hong Kong, in 2004.
He is currently a Professor with the School of
Electrical, Electronic and Computer Engineering,
University of Western Australia.
Dr. Huang serves as an Editor for the
IEEE WIRELESS COMMUNICATIONS LETTERS and an Editor for the IEEE
TRANSACTIONS ON WIRELESS COMMUNICATIONS from 2005 to 2011.
Jianhua Lu (M98SM07) received the B.S.E.E.
and M.S.E.E. degrees from Tsinghua University,
Beijing, China, in 1986 and 1989, respectively, and
the Ph.D. degree in electrical and electronic engi-
neering from the Hong Kong University of Science
and Technology, Kowloon, Hong Kong.
He has been with the Department of Electronic
Engineering, Tsinghua University, since 1989, where
he is currently a Professor. He has published more
than 180 technical papers in international journals
and conference proceedings. His current research
interests include broadband wireless communication, multimedia signal
processing, satellite communication, and wireless networking.
Dr. Lu has been an active member of several professional societies. He
was a recipient of the Best Paper Award at the International Conference
on Communications, Circuits and Systems in 2002 and ChinaCom in 2006,
and received the National Distinguished Young Scholar Fund from the NSF
Committee of China in 2006. He has served in numerous IEEE conferences as
a member of Technical Program Committees and served as a Lead Chair of the
General Symposium of IEEE ICC in 2008, as well as a Program Committee
Co-Chair of the 9th IEEE International Conference on Cognitive Informatics.
He is a Senior Member of the IEEE Signal Processing Society.
Licai Fang received the M.S. degree in radio elec-
tronics from Peking University, Beijing, China, in
1998. He is currently pursuing the Ph.D. degree with
the School of Electrical, Electronic and Computer
Engineering, Western Australia University, Perth,
Australia.
His current research interests include signal
processing techniques for turbo equalization system.

Analysis of Boundedness and Convergence of Online Gradient Method For Two-Layer Feedforward Neural Networks

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Analysis of Boundedness and Convergence of Online Gradient Method For Two-Layer Feedforward Neural Networks

Загружено:

Авторское право:

Доступные форматы

IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO.

8, AUGUST 2013 1327

at which the error function

is a vector orthogonal to the input sequence

denote one of these optimal solutions. Given

was often considered.

(x) is positive and uniformly upper

(x), and replace the

(x) for linear g(x). Note

, we now can derive

)) is the lower bound of

(x) when |x| max(B

denotes a nonzero vector orthogonal to P. And

. Based on this argument, with the

is an optimal solution to (1). Then

are also upper bounded using Assumptions 2 and 3,

(u) > 0 for u < 3/4

(u) < 0 for u > 3/4. Then, as

we can prove that

As i = nK +k for n = 0, 1, . . . and k = 1, . . . , K, we have

is orthogonal to the input sequence P.

Using the same vein of deriving the upper bound of R

= 0. For the case of N

= 0 for the case of N

Вам также может понравиться