Академический Документы
Профессиональный Документы
Культура Документы
k=1
(a
k
g(X
T
k
H))
2
attains its global minimum
H
= arg min
HR
N
J(H). (1)
2162-237X/$31.00 2013 IEEE
1328 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
Note that the optimal solution may not be unique. As in J(H),
we have
g(X
T
k
H
) = g(X
T
k
(H
+ H
)), k = 1, . . . , K
where H
, we now have
a
k
= g(X
T
k
H
) +
k
, k = 1, . . . , K (2)
where {
k
}
K
k=1
denotes the disturbance sequence that repre-
sents all nonidealities such as modeling error and noise.
In this paper, we consider a nite number of training
samples and two cycle training schemes, namely, OGM-F
(xed order), in which each sample in the set is supplied to the
network exactly once in a xed order in each training cycle
[15], [16], and OGM-SS (special stochastic order) in which
each sample in the set is supplied to the network exactly
once in a stochastic order in each training cycle [16], [17].
The OGM recursion equation of the parameter vector can be
written as
H
nK+k+1
= H
nK+k
+u
_
a
nK+k
g
_
X
T
nK+k
H
nK+k
__
g
_
X
T
nK+k
H
nK+k
_
X
nK+k
k = 1, . . . , K, n = 0, 1, . . . (3)
where u is the learning rate. For the OGM-F scheme,
(a
nK+k
, X
nK+k
) = (a
k
, X
k
) for n = 0, 1, . . . and
k = 1, . . . , K, and for the OGM-SS scheme, every set
{(a
nK+k
, X
nK+k
)}
K
k=1
for n = 0, 1, . . . is a stochastic per-
mutation of the set {(a
k
, X
k
)}
K
k=1
.
The boundedness of the OGM then depends on the upper
bound of the parameter sequence {H
i
}
i=1
where i = nK +k.
It is an important issue because, in practice, an oversized
parameter may crash the training procedure or damage the
equipment that is being used. As a result, the convergence
of the OGM (i.e., the convergence of the parameter sequence
{H
i
}
i=1
to an optimal solution in some sense) is a prerequisite
of any successful application of the OGM.
Actually, the boundedness and convergence of the OGM
have inspired a lot of research, and many excellent analytic
mechanisms have been proposed, with the activation function
g(x) linear or nonlinear. For the case of linear g(x), the
parameter estimation error V
i
= H
i
H
i=1
can be automatically bounded. An integrated error
function and a tuning learning law were used in [27] and the
boundedness result of {H
i
}
i=1
was derived. Using the Lya-
punov function, [28], [29] proved the boundedness of {H
i
}
i=1
and the output error sequence {e
i
}
i=1
with the assumption of
bounded uncertainty for two-layer and three-layer feedforward
neural networks, respectively. Lyapunov analysis was carried
out in [30] and [31], which considered recurrent neural net-
works, and in [32] where radial basis function (RBF) neural
networks were considered to obtain the boundedness result of
{H
i
}
i=1
. However, the boundedness results derived in these
papers are based on the case of a time-varying learning rate,
and cannot apply to the case of a constant learning rate. For
the convergence property, by investigating the monotonicity of
the sequence {J(H
nK
)}
n=0
, [33][35] derived some important
results like the weak convergence as lim
i
J(H
i
) = 0
and the strong convergence as lim
i
H
i
{H:J(H) = 0},
with a learning rate adjusted with a proper decreasing scheme.
Then using a constant learning rate, a similar weak conver-
gence result was given in [36]. Nevertheless, the convergence
results obtained in [33][36] are local convergence results,
as the parameter sequence {H
i
}
i=1
may converge into a
local minimum of the error function J(H). Additionally, the
uniformly upper bounded assumption of g(x) in those works
make them inapplicable to the case of linear g(x).
In this paper, we present a deterministic analysis on the
boundedness and convergence of the parameter sequence
{H
i
}
i=1
for the general case of a linear or nonlinear activation
function, with a constant learning rate, and for both the
OGM-F and the OGM-SS training schemes. We note that
the linear difference equation (4) of V
i
plays a key role in
the analyses of [18][25]. However, this useful equation is
only applicable to the case of a linear activation function.
Using the differential mean value theorem [37], we derive an
extended difference equation of V
i
for the general case of
linear or nonlinear activation function. Based on this extended
difference equation, some new results for the boundedness
and convergence of the parameter sequence {H
i
}
i=1
are then
obtained.
With the mild restrictions of bounded training set
{(a
k
, X
k
)}
K
k=1
and initialization H
1
, we prove that the para-
meter sequence {H
i
}
i=1
is uniformly upper bounded as long
as there exists a solution for an inequality [see (16) of this
paper] regarding the bound. To the best of our knowledge,
this explicit criterion for the deterministic boundedness of the
parameter sequence {H
i
}
i=1
has not been derived before. A
further investigation on this explicit criterion is then carried
out. And we show that a solution for the inequality always
exists for the case of linear g(x), which means that the
parameter sequence {H
i
}
i=1
can be uniformly upper bounded.
For the case of nonlinear g(x), some simple adjustment meth-
ods on the training set {(a
k
, X
k
)}
K
k=1
, the activation function
g(x), or the initialization H
1
, which can improve the upper
boundedness property of {H
i
}
i=1
, may be found on the basis
of the analysis of the inequality.
Then, based on the boundedness result, the convergence
of the parameter sequence {H
i
}
i=1
is obtained. We show
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1329
that H
i
converges into a zone around an optimal solution
as i . And the size of the zone is associated with the
disturbance sequence {
k
}
K
k=1
and the learning rate u. Indeed,
this result can be considered as a deterministic counterpart of
the stochastic result obtained in [23][25], which only consider
the case of a linear activation function.
The convergence property is further investigated for the
special case of a vanishing disturbance sequence {
k
}
K
k=1
,
also called the case of perfect modeling [38], [39]. A global
convergence result, i.e., the parameter vector H
i
can always
converge to an optimal solution [which leads to a global
minimum of the error function J(H) as shown in (1)], is
proved. Compared with the convergence result that H
i
may
be trapped into a local minimum obtained in [33] and [35],
the global convergence derived in this paper is stronger.
The rest of this paper is organized as follows. Some assump-
tions for analysis are listed in Section II. In Section III, with
the OGM-F or OGM-SS training scheme, the main bound-
edness and convergence results along with some necessary
lemmas are presented. The proofs of the main results and
lemmas are gathered in Section IV. Finally, Section V gives
the conclusions.
II. ASSUMPTIONS
In this section, for the boundedness and convergence analy-
sis, a list of assumptions are established as follows.
1) The size of the training set {(a
k
, X
k
)}
K
k=1
is nite, i.e.,
K < .
2) The training set {(a
k
, X
k
)}
K
k=1
is upper bounded, i.e.,
there exists a B
a
R and a B
X
R, such that
|a
k
| B
a
< and X
k
B
X
< for
k = 1, . . . , K.
3) The initialization H
1
and the optimal solution H
are
upper bounded, i.e., there exists a B
H
1
R and a B
R, such that H
1
B
H
1
< and H
< .
4) The activation function g(x) is continuously differ-
entiable, and g
(x) B
g
< .
5) For all bounded x of |x| B
x
< , g
(x)
is lower bounded by a small positive value, i.e.,
g
(x) D
g
(B
x
) > 0, where D
g
(B
x
)
= min(g
(x) :
|x| B
x
).
6) The activation function g(x) satises g(0) = 0.
The notation here is the Euclidean norm for vectors and
the matrix norm induced by the Euclidean norm for matrices
as
A = max
X=1
AX A R
mn
, X R
n
.
The restriction of a nite-size training set in Assumption 1
is reasonable because obtaining an innite (large) number of
training samples is a hard task for most applications. For
Assumptions 2 and 3, the upper boundedness of the desired
signals, input signals, and the initialization can be readily
justied in practice.
For Assumptions 46, we only assume the upper bound-
edness and lower boundedness of g
i=1
are studied by an investigation of the para-
meter estimation error V
i
for i = 1, 2, . . . Let us rst extend
the well-known linear difference equation (4) of V
i
to the case
of the nonlinear activation function g(x). With V
i
= H
i
H
,
we can rewrite the recursion (3) of OGM as follows:
V
i+1
= V
i
+u
_
a
i
g
_
X
T
i
H
i
__
g
_
X
T
i
H
i
_
X
i
, i = 1, 2, . . .
(5)
Using the differential mean value theorem, there exists a H
i
such that
g(X
T
i
H
i
) = g(X
T
i
H
) + g
(X
T
i
H
i
)(X
T
i
H
i
X
T
i
H
)
= g(X
T
i
H
) + g
(X
T
i
H
i
)X
T
i
V
i
(6)
where X
T
i
H
i
satises
min(X
T
i
H
i
, X
T
i
H
) X
T
i
H
i
max(X
T
i
H
i
, X
T
i
H
).
Inserting (6) into (5) and using (2), we now obtain an
extended difference equation for the general case of linear or
nonlinear g(x) as follows:
V
i+1
= (I ut
i
A
i
)V
i
+u
i
g
(X
T
i
H
i
)X
i
(7)
where
t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
) (8)
A
i
= X
i
X
T
i
. (9)
Comparing (4) with (7), it is clear that the linear difference
equation can be seen as a special case of the extended
difference equation with a constant g
i
.
However, using this equation, some excellent results can still
be deduced as will be shown below.
Using the extended difference equation (7), after i recur-
sions, we can obtain the following expression of V
i+1
:
V
i+1
= F
i
+ S
i
, i = 1, 2, . . . (10)
where
F
i
= U
0,i
V
1
(11)
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
(12)
1330 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
and U
j,i
is the transition matrix dened as
U
j,i
=
_
(I ut
i
A
i
) (I ut
j +1
A
j +1
), j < i
I, j = i.
(13)
With (10), as V
i+1
= H
i+1
H
i=1
by an investigation of F
i
and S
i
.
A. Boundedness of the Parameter Sequence {H
i
}
i=1
Let us rst investigate a single term (I ut
i
A
i
) of U
j,i
in (13) and introduce the following lemma. We note here
that all the lemmas, theorems, and corollaries in this paper
are derived for the OGM-F or OGM-SS training scheme, i.e.,
(a
i
, X
i
,
i
) {(a
k
, X
k
,
k
)}
K
k=1
for i = 1, 2, . . .
Lemma 1: Under Assumptions 2 and 4, let Z be an N-
dimensional vector, and
i
the correlation coefcient between
X
i
and Z, i.e.,
i
= < X
i
, Z >/X
i
Z (< , > denotes
inner product) for X
i
= 0 and Z = 0, and
i
= 1 for X
i
= 0
or Z = 0. There exists a positive constant
u
0
=
1
B
2
g
B
2
X
(14)
such that for all 0 < u u
0
, we have
(I ut
i
A
i
)Z
_
1 u
2
i
t
i
i
Z
where t
i
is dened in (8), A
i
is dened in (9), and
i
= X
i
2
.
Noting that t
i
is positive by Assumption 4 and
0 u
2
i
t
i
i
1 for 0 < u u
0
as shown in the proof in the
next section, from Lemma 1, for the nonzero vectors Z and X
i
,
it is not difcult to verify that (I ut
i
A
i
)Z < Z when
i
= 0 (i.e., Z is not orthogonal to X
i
). With this property
of (I ut
i
A
i
), we can obtain the following important lemma
about U
j,i
.
Lemma 2: Under Assumptions 25, let Z be an upper
bounded N-dimensional vector that can be linearly combined
by some elements of the sequence {X
l
}
i
l=j +1
with a nite size
i j , and the parameter vectors {H
l
}
i
l=j +1
is upper bounded by
B
H
, H
l
B
H
for l = j +1, . . . , i . Then for all 0 < u u
0
,
we have
_
_
U
j,i
Z
_
_
_
1 u Z
where
=
2
g
D
2
X
> 0 (15)
with
g
= D
g
(max(B
X
B
H
, B
X
B
), D
X
is the minimal
value of the norms of nonzero elements of the input sequence
{X
k
}
K
k=1
, and (0, 1] is a constant.
For 0 < u u
0
, from (14) and (15), it can be seen that
0 < u 1. Using Lemmas 1 and 2, we can now derive the
important boundedness property of {H
i
}
i=1
as follows.
Theorem 1: Under Assumptions 16, if there exists at least
one solution for the inequality
f (B)
= (B B
H
1
)
D
2
g
(BB
X
)
B
g
9K B
a
B
X
4D
2
X
0 (16)
where B R, D
g
(BB
X
) denotes the lower bound of g
(x)
when |x| BB
X
, then for all 0 < u u
0
, the parameter
sequence {H
i
}
i=1
is uniformly upper bounded
H
i
B
H
< , i = 1, 2, . . .
where
B
H
= min{B: f (B) 0}.
It immediately follows from Theorem 1 that for a training
procedure with a specic training set and activation func-
tion, the parameter sequence {H
i
}
i=1
can be uniformly upper
bounded as long as there exists a B that satises (16). We then
investigate the existence of a solution for (16). As D
2
g
(BB
X
)
in (16) is different for different activation functions g(x), it is
clear that the solution for (16) varies with g(x).
For the case of linear g(x), D
2
g
(BB
X
) = C
g
is a positive
constant, and then we can readily conclude that there exist
solutions for (16), and the parameter sequence {H
i
}
i=1
can
always be uniformly upper bounded by B
H
given as follows:
B
H
= B
H
1
+
9K B
a
B
X
B
g
4D
2
X
C
g
< .
However, for the case of nonlinear g(x), it is difcult
to obtain an explicit expression for B
H
. Indeed, for some
scenarios, e.g., an activation function of g(x) = tanh(x) and
a large size of training set (i.e., a large K), a solution for
(16) may not exist, and then a uniform upper bound for
{H
i
}
i=1
cannot be guaranteed. In these scenarios, however,
from the investigation on each term of the inequality (16),
we may obtain some improvement methods for the upper
boundedness property of {H
i
}
i=1
. For example, we can adjust
the training set {(a
k
, X
k
)}
K
k=1
or the activation function g(x),
including scaling down the desired signals {a
k
}
K
k=1
to decrease
B
a
, scaling up the input sequence {X
k
}
K
k=1
to reduce B
X
/D
2
X
,
and/or scaling up g(x) by a positive number to increase
D
2
g
(BB
X
)/B
g
. Besides, properly decreasing training sets
size K, rearranging the order of input signals in {X
k
}
K
k=1
to
increase , or setting a lower initialization H
1
to reduce B
H
1
may also be useful to obtain a positive f (B).
B. Convergence of the Parameter Sequence {H
i
}
i=1
Let P = {X
k
}
K
k=1
denote the input sequence. Assume that
the dimension of P is N
P
(i.e., the maximum number of linear
independent elements in P is N
P
). It is not difcult to verify
from (2) that there exist more than one optimal solutions to
(1) if N
P
< N, where N is the dimension of X
k
, because we
have
J(H
) = J(H
+ H
)
where H
= 0 if N
P
= N. Then each optimal solution can be
expressed as H
+ H
i=1
can then be
deduced.
Theorem 2: Under Assumptions 16, with an arbitrary ini-
tialization H
1
, if there exists a solution for (16), then for all
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1331
0 < u u
0
, we have
lim
i
_
_
_H
i
H
H
1
_
_
_ C(u)
where H
1
is the component of H
1
that is orthogonal to input
sequence P, = B
X
B
g
K
k=1
k
is a positive constant,
and
C(u) =
u
1
1 u
+u.
It is clear that H
+ H
1
i=1
can
converge into a zone around this optimal solution, where the
size of the zone is associated with the disturbance sequence
{
k
}
K
k=1
and the learning rate u.
We further study the convergence property of {H
i
}
i=1
for
the special case of a vanishing disturbance sequence {
k
}
K
k=1
;
then (2) can be rewritten as
a
k
= g(X
T
k
H
), k = 1, . . . , K.
This special case can be satised. For example, in [40] it was
proved that there exists a converged parameter vector such
that a TLFNN with input dimension of N can approximate
the mapping of a set {a
k
, X
k
}
N
k=1
with arbitrarily small error.
Furthermore, the universal approximation analysis conducted
in [41] and [42] shows that a TLFNN can approximate any
continuous target function with vanishing output deviation by
using a proper parameter vector.
For this special case, it can be seen that
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
= 0, i = 1, 2, . . .
as
i
{
k
}
K
k=1
= 0 for i = 1, 2, . . . And we have
V
i+1
= F
i
= U
0,i
V
1
, i = 1, 2, . . .
A global convergence result can then be obtained from
Theorem 2 as follows.
Corollary 1: Under Assumptions 15, for the case of a
vanishing disturbance sequence {
k
}
K
k=1
, with an arbitrary
initialization H
1
, for all 0 < u u
0
, H
i
can always converge
to an optimal solution as i
lim
i
H
i
= H
+ H
1
where H
1
is the component of H
1
that is orthogonal to the
input sequence P = {X
k
}
K
k=1
.
Corollary 1 states that, for this special case, H
i
can always
converge to a point that achieves a global minimum of the
performance measure J(H), which gives a strong convergence
result. Note that, for this special case, Assumption 6 and the
inequality (16) do not necessarily hold because the bounded
initialization H
1
and bounded optimal solution H
of Assump-
tion 3 can always guarantee the boundedness of the parameter
sequence {H
i
}
i=1
as shown in the proof in the next section.
IV. PROOFS OF RESULTS
The proofs of the lemmas, theorems, and corollaries pre-
sented in Section III are given in this section.
Proof of Lemma 1: For X
i
= 0, the result is straightforward.
We consider the case of X
i
= 0 below.
From (9), we can see that A
i
is a symmetric and positive
semidenite matrix, and
rank(A
i
) = rank(X
i
) = 1, X
i
= 0.
Then there exists one positive eigenvalue of A
i
, and all other
eigenvalues are equal to zero. It follows from the orthogonal
decomposition of symmetric matrices that there exists an
orthogonal matrix Q
i
= [q
i,1,
q
i,2,
. . . , q
i,N
] with
q
i,1
=
X
i
X
i
(17)
such that
I ut
i
A
i
= Q
i
i
Q
T
i
(18)
where
i
= diag(1 ut
i
i
, 1, . . . , 1), and
i
= X
i
2
is the
positive eigenvalue of A
i
.
With (18), we have
(I ut
i
A
i
)Z = Q
i
i
Q
T
i
Z
= Q
i
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
.
(19)
Then it can be derived from (19) that
(I ut
i
A
i
)Z
=
_
_
_
_
Q
i
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
_
_
_
_
=
_
_
_
_
_
(1 ut
i
i
)q
T
i,1
Z, q
T
i,2
Z, . . . , q
T
i,N
Z
_
T
_
_
_
_
=
_
(1 ut
i
i
)
2
(q
T
i,1
Z)
2
+(q
T
i,2
Z)
2
+. . . +(q
T
i,N
Z)
2
_
1
2
=
_
Z
2
u(2t
i
i
ut
2
i
2
i
)(q
T
i,1
Z)
2
_
1
2
. (20)
Using (17), we have
_
q
T
i,1
Z
_
2
=
2
i
Z
2
(21)
where
i
=
< X
i
, Z >
X
i
Z
is the correlation coefcient between X
i
and Z, and
0
2
i
1.
Plugging (21) into (20), we have
(I ut
i
A
i
)Z =
_
1 u
2
i
(2t
i
i
ut
2
i
2
i
) Z . (22)
It follows from Assumption 4 that
0 < t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
) B
2
g
(23)
and from Assumption 2 that
0
i
= X
i
2
B
2
X
. (24)
1332 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
Now if the learning rate u satises
0 < u u
0
=
1
B
2
g
B
2
X
(25)
it is not difcult to verify that
u(2t
i
i
ut
2
i
2
i
) = ut
i
i
(2 ut
i
i
)
ut
i
i
(2 uB
2
g
B
2
X
)
ut
i
i
(26)
where for the rst inequality we use (23) and (24), and for
the second inequality we use (25).
Consequently, inserting (26) into (22), we have
(I ut
i
A
i
)Z
_
1 u
2
i
t
i
i
Z .
As 0
2
i
1, from (23), (24), and (25), it is clear that
0 u
2
i
t
i
i
1.
This completes the proof.
Proof of Lemma 2: If Z = 0, then the result is straightfor-
ward. We consider the case of Z = 0 below.
We have
U
j,i
Z = (I ut
i
A
i
)(I ut
i1
A
i1
) . . . (I ut
j +1
A
j +1
)Z
where A
l
= X
l
X
T
l
and X
l
{X
k
}
K
k=1
for l = j +1, . . . , i .
Let
l
denote the correlation coefcient between X
l
and
U
j,l1
Z
l
=
< X
l
, U
j,l1
Z >
X
l
_
_
U
j,l1
Z
_
_
, l = j +1, . . . , i.
As Assumptions 2 and 4 hold and the learning rate u
satises 0 < u u
0
, using the denition of U
j,i
in (13),
it follows from Lemma 1 (by replacing Z in Lemma 1 with
U
j,l1
Z for l = j +1, . . . , i ) that
_
_
U
j,i
Z
_
_
=
_
_
(I ut
i
A
i
)U
j,i1
Z
_
_
_
1 u
i
2
t
i
i
_
_
U
j,i1
Z
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
j +1
2
t
j +1
j +1
Z .
(27)
Also, in (27), we have
0 u
l
2
t
l
l
1
where
t
l
= g
(X
T
l
H
l
)g
(X
T
l
H
l
)
with
min(X
T
l
H
l
, X
T
l
H
) X
T
l
H
l
max(X
T
l
H
l
, X
T
l
H
)
and
l
= X
l
2
for l = j +1, . . . , i .
If X
l
= 0, then
l
= X
l
2
= 0 and
_
1 u
l
2
t
l
l
= 1.
Suppose that there exist L nonzero elements of {X
l
}
i
l=j +1
.
Since the nonzero vector Z can be linearly combined by
some elements of {X
l
}
i
l=j +1
, it is not difcult to verify that
L 1. Let the indices of the nonzero elements of {X
l
}
i
l=j +1
be m
1
, m
2
, . . . , m
L
[ j +1, i ], and without loss of generality,
let m
1
< m
2
< < m
L
; we can rewrite (27) as follows:
_
_
U
j,i
Z
_
_
_
1 u
m
L
2
t
m
L
m
L
_
1 u
m
1
2
t
m
1
m
1
Z (28)
where
m
l
=
_
_
X
m
l
_
_
2
> 0 for l = 1, . . . , L.
Let
m
p
2
= max
1lL
(
m
l
2
)
with p {1, 2, . . . , L}. As we have 0 u
m
l
2
t
m
l
m
l
1 for
l = 1, . . . , L, it follows from (28) that:
_
_
U
j,i
Z
_
_
_
1 u
m
p
2
t
m
p
m
p
Z . (29)
We now investigate the lower bound of the term
m
p
2
t
m
p
m
p
in (29).
Let
D
X
= min
1kK,X
k
=0
X
k
.
As X
m
l
{X
k
}
K
k=1
and X
m
l
= 0, we have
m
l
=
_
_
X
m
l
_
_
2
D
2
X
> 0, l = 1, . . . , L. (30)
It is then clear that
m
p
D
2
X
> 0. (31)
The parameter vectors {H
l
}
i
l=j +1
is upper bounded by B
H
,
and the input sequence {X
k
}
K
k=1
and the optimal solution
H
) X
T
m
l
H
m
l
max(X
T
m
l
H
m
l
, X
T
m
l
H
)
X
T
m
l
H
m
l
and X
T
m
l
H
m
l
are upper bounded as well, i.e., for
l = 1, . . . , L
_
_
_X
T
m
l
H
m
l
_
_
_ max(B
X
B
H
, B
X
B
)
and _
_
_X
T
m
l
H
m
l
_
_
_ max(B
X
B
H
, B
X
B
).
From Assumption 5, g
(X
T
m
l
H
m
l
) and g
(X
T
m
l
H
m
l
) should
be lower bounded by a small positive value
= D
g
(max(B
X
B
H
, B
X
B
))
and consequently
t
m
l
= g
(X
T
m
l
H
m
l
)g
(X
T
m
l
H
m
l
)
2
g
> 0
l = 1, . . . , L. (32)
It is then clear that
t
m
p
2
g
> 0. (33)
We claim that there exists a constant (0, 1] such that
m
p
2
. Proof by contradiction is used to verify this claim.
Suppose > 0, we have
m
p
2
. Then
m
l
2
for all
l = 1, . . . , L. Since X
l
= 0 for l / {m
1
, m
2
, . . . , m
L
}, U
j,i
Z
can be rewritten as
U
j,i
Z = (I ut
m
L
A
m
L
) . . . (I ut
m
1
A
m
1
)Z
with L i j .
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1333
From (22), we have
_
_
U
j,i
Z
_
_
=
_
1 u
m
L
2
(2t
m
L
m
L
ut
2
m
L
2
m
L
) (34)
. . .
_
1 u
m
1
2
(2t
m
1
m
1
ut
2
m
1
2
m
1
) Z .
For 0 < u u
0
, from (26), (30), and (32), we have
u(2t
m
l
m
l
ut
2
m
l
2
m
l
) ut
m
l
m
l
> 0, l = 1, . . . , L. (35)
We also have
u(2t
m
l
m
l
ut
2
m
l
2
m
l
) 1 (36)
as (ut
m
l
m
l
1)
2
0.
It follows from (35) and (36) that:
_
1 u
m
l
2
(2t
m
l
m
l
ut
2
m
l
2
m
l
)
1
l = 1 . . . , L
and consequently using (34), we have
Z
_
_
U
j,i
Z
_
_
_
1
_
1
_
L
_
Z . (37)
Since Z is a nonzero vector and upper bounded, L is nite,
and > 0 can be arbitrarily small, then it can be readily
obtained from (37) that
_
_
U
j,i
Z
_
_
= Z.
From (20) in the proof of Lemma 1, it can be seen that
we have (I ut
i
A
i
)Z = Z, if and only if q
T
i,1
Z = 0.
As q
i,1
has the same direction as X
i
from (17), q
T
i,1
Z = 0
means that Z is orthogonal to X
i
. In the same vein, after i j
iterations, it is readily concluded from
_
_
U
j,i
Z
_
_
= Z that
Z is orthogonal to {X
l
}
i
l=j +1
, which contradicts the fact that
Z can be linearly combined by some elements of {X
l
}
i
l=j +1
.
Therefore, we can verify that there exists a constant (0, 1]
such that
m
p
2
> 0. (38)
Inserting (31), (33), and (38) into (29), we have
_
_
U
j,i
Z
_
_
_
1 u Z
where 0 < u 1 and
=
2
g
D
2
X
> 0.
This completes the proof.
Proof of Theorem 1: We use induction to prove Theorem 1.
1) The upper bound when i = 1.
As B
H
is a solution for (16), it is clear that B
H
B
H
1
, and
consequently
H
1
B
H
1
B
H
.
2) The upper bound when i > 1.
Suppose
_
_
H
j
_
_
B
H
for all j = 1, . . . , i ; now we need to
prove that H
i+1
B
H
.
Let i = nK +k for n = 0, 1, . . . and k = 1, . . . , K. We can
rewrite the recursion (3) of OGM as follows:
H
i+1
= H
i
+u
_
a
i
g
_
X
T
i
H
i
__
g
_
X
T
i
H
i
_
X
i
,
i = 1, 2, . . . . (39)
Using the differential mean value theorem, there exists a H
i
such that
g(X
T
i
H
i
) = g(0) + g
(X
T
i
H
i
)(X
T
i
H
i
0)
= g(0) + g
(X
T
i
H
i
)X
T
i
H
i
(40)
where X
T
i
H
i
satises
min(X
T
i
H
i
, 0) X
T
i
H
i
max(X
T
i
H
i
, 0). (41)
Inserting (40) into (39) and using g(0) = 0 of Assumption 6,
we have
H
i+1
= (I ut
i
A
i
)H
i
+ua
i
g
(X
T
i
H
i
)X
i
(42)
where A
i
is dened in (9), and
t
i
= g
(X
T
i
H
i
)g
(X
T
i
H
i
). (43)
Using (42), after i recursions, we can obtain the following
expression of H
i+1
:
H
i+1
= G
i
+ R
i
(44)
where
G
i
= U
0,i
H
1
(45)
R
i
= u
i
j =1
a
j
g
(X
T
j
H
j
)U
j,i
X
j
. (46)
For the term G
i
in (45), if the learning rate u satises
0 < u u
0
, it follows from Lemma 1 and Assumption 3
that
G
i
=
_
_
U
0,i
H
1
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
1
2
t
1
1
H
1
H
1
B
H
1
(47)
where
l
denotes the correlation coefcient between X
l
and
U
0,l1
H
1
l
=
< X
l
, U
0,l1
H
1
>
X
l
_
_
U
0,l1
H
1
_
_
and
0 u
l
2
t
l
l
1
for l = 1, . . . , i .
For the term R
i
in (46), we consider the case of i K and
i > K separately.
When i K, from Lemma 1, the upper bound of R
i
can
be obtained as
R
i
ui B
a
B
g
B
X
uK B
a
B
g
B
X
, 1 < i K. (48)
When i > K, use the property of the cycle training and
recall that i = nK +k for n = 1, 2, . . . and k = 1, . . . , K, R
i
can be rewritten as
u
K
j =1
U
nK,nK+k
n
l=1
a
j
g
(X
T
j
H
(l1)K+j
(l1) )M
( j )
n,l
X
j
+u
k
j =1
a
nK+j
g
(X
T
nK+j
H
nK+j
)U
nK+j,nK+k
X
nK+j
(49)
1334 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
where j
(l1)
denotes the position of X
j
in the lth training
cycle for l = 1, . . . , n, and
M
( j )
n,l
=
U
(n1)K,nK
U
l K,(l+1)K
U
(l1)K+j
(l1)
, l K
l = 1, . . . , n 1
U
(l1)K+j
(l1)
,l K
, l = n.
(50)
For l = 1, . . . , n and j = 1, . . . , K, j
(l1)
= j for the
OGM-F scheme, while j
(l1)
is an arbitrary number between
1 and K for the OGM-SS scheme,.
The boundedness of the rst term in the RHS of (49) is
investigated at rst. It follows from Lemma 1 and Assump-
tions 2 and 4 that this term is upper bounded by:
uB
a
B
g
j =1
_
_
_
_
_
n
l=1
M
( j )
n,l
X
j
_
_
_
_
_
. (51)
The proof of the boundedness of
n
l=1
M
( j )
n,l
X
j
in (51) then
becomes a key step in the proof of the boundedness of the
rst term.
When l = n, from Lemma 1 we have
_
_
_M
( j )
n,n
X
j
_
_
_ =
_
_
_U
(n1)K+j
(n1)
,nK
X
j
_
_
_
_
_
X
j
_
_
.
From (19), we can see that (I ut
i
A
i
)Z is obtained by
reducing the projection of Z on X
i
. Then, it is clear that
U
(l1)K+j
(l1)
,l K
X
j
is obtained by reducing the projections on
{X
(l1)K+j
(l1)
+1
, . . . , X
l K
} from X
j
. As X
j
can be linearly
combined by some elements of {X
k
}
K
k=1
, and
{X
(l1)K+j
(l1)
+1
, . . . , X
l K
} {X
k
}
K
k=1
it is not difcult to verify that U
(l1)K+j
(l1)
,l K
X
j
can be
linearly combined by some elements of {X
k
}
K
k=1
with a nite
size K (using Assumption 1) as well. Then, as
_
_
H
j
_
_
B
H
, j = 1, . . . , nK +k
using Lemma 2, we have
_
_
_U
l K,(l+1)K
U
(l1)K+j
(l1)
,l K
X
j
_
_
_
_
1 u
_
_
_U
(l1)K+j
(l1)
,l K
X
j
_
_
_
_
1 u
_
_
X
j
_
_
, l = 1, . . . , n 1
where 0 < u 1 with
=
2
g
D
2
X
. (52)
Using (41) and (43), from the proof of Lemma 2, we have
g
= D
g
(max(B
X
B
H
, 0)) = D
g
(B
X
B
H
). (53)
In the similar vein above, using Lemma 2, the upper bound
of M
( j )
n,l
X
j
can be derived as
_
_
_M
( j )
n,l
X
j
_
_
_
_
_
1 u
_
nl _
_
X
j
_
_
, l = 1, . . . , n. (54)
From (54), we have
_
_
_
_
_
n
l=1
M
( j )
n,l
X
j
_
_
_
_
_
n1
l=0
_
_
1 u
_
l _
_
X
j
_
_
=
1
_
1 u
_
n
1
1 u
_
_
X
j
_
_
1
1
1 u
_
_
X
j
_
_
. (55)
Inserting (55) into (51) and using Assumption 2, we obtain
the following upper bound of the rst term in the RHS
of (49):
uK B
a
B
g
B
X
1
1
1 u
. (56)
From Lemma 1, the second term in the RHS of (49) is upper
bounded by
uk B
a
B
g
B
X
. (57)
Plugging (56) and (57) into (49), and noting that k K,
we obtain the upper bound of R
i
when i > K as follows:
R
i
K B
a
B
g
B
X
C(u), i > K (58)
where
C(u) =
_
u +
u
1
1 u
_
. (59)
As 0 < u 1, it is clear that C(u) > u. Now we can
combine the upper bound of R
i
for i K [see (48)] and
i > K [see (58)] together, and obtain the following:
R
i
K B
a
B
g
B
X
C(u), i > 1. (60)
Given the bound of G
i
in (47) and the bound of R
i
in
(60), from (44) we can get an upper bound of H
i+1
as
follows:
H
i+1
G
i
+R
i
B
H
1
+ K B
a
B
g
B
X
C(u).
Now, if we have
B
H
1
+ K B
a
B
g
B
X
C(u) B
H
(61)
then H
i+1
B
H
is proved.
From (59), we have
C
(u) =
(6 2u )
1 u 6 +5u
(4 2u )
1 u 4 +4u
. (62)
Then from (62), it is not difcult to verify that C(u) has a
stationary point at u = 3/4, and C
B
2
X
2
g
D
2
X
=
1
C(u)
9
4
(63)
where the upper bound is obtained when u = 3/4 , and the
lower bound is obtained when u = 1/ .
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1335
Using the upper bound of C(u) in (63), it is clear that if
the inequality
B
H
B
H
1
9K B
a
B
X
B
g
4
0 (64)
holds, then (61) can be satised. Plugging (52) and (53) into
(64), we have
(B
H
B
H
1
)
D
2
g
(B
H
B
X
)
B
g
9K B
a
B
X
4D
2
X
0. (65)
As B
H
is a solution for (16) and (65) always holds, and then
we get H
i+1
B
H
.
This completes the induction and the proof.
Proof of Theorem 2: As there exists a solution for the
inequality (16) and Assumptions 16 hold, from Theorem 1,
the parameter sequence {H
i
}
i=1
is uniformly upper bounded.
Note that V
i+1
= F
i
+ S
i
[see (10)]. Let us investigate the
limit of F
i
and S
i
as i separately.
A. Limit of F
i
where 0 < u 1.
For the second transmission, since U
0,K
V
1
is an
N-dimensional vector, it can be linearly combined by some
elements of P. Then it can be seen that
_
_
U
K,2K
U
0,K
V
1
_
_
_
1 u
_
_
U
0,K
V
1
_
_
_
_
1 u
_
2
V
1
.
In the similar vein, and noting that
_
_
U
nK,nK+k
_
_
1 from
Lemma 1, we have
F
i
_
_
1 u
_
n
V
1
.
Since 0
1 u < 1 and V
1
is upper bounded by the
bounded initialization H
1
and bounded optimal solution H
in
Assumption 3, the limit of F
i
as i can be obtained
as
0 lim
i
F
i
lim
n
_
_
1 u
_
n
V
1
= 0
and consequently
lim
i
F
i
= 0. (67)
2) The case of N
P
< N.
As the dimension of P = {X
k
}
K
k=1
is N
P
, then all elements
of P can be linearly combined by a set of normalized orthog-
onal bases of order N
P
. Suppose that the set is (
1
, . . . ,
N
P
).
Clearly, there also exists a set of normalized orthogonal bases
of order N, (
1
, . . . ,
N
P
,
N
P
+1
, . . . ,
N
), such that each
N-dimensional vector can be generated by this set.
Let Q =
_
1
, . . . ,
N
P
,
N
P
+1
, . . . ,
N
_
and
k,i
= X
T
k
i
for i = 1, . . . , N
P
; we can represent each element in P as
X
k
= Q
k
, k = 1, . . . , K (68)
where
k
=
_
k,1
, . . . ,
k,N
P
, 0, . . . , 0
_
T
.
Let
k
=
_
k,1
, . . . ,
k,N
P
_
T
.
A new sequence
P
= {
k
}
K
k=1
(69)
can be obtained, and it is noted that this new sequence P
is
of full dimension.
We then represent the term V
1
and U
0,i
in F
i
by Q. Clearly,
V
1
can be rewritten as
V
1
= H
1
H
+ H
1
where H
1
is the projection of H
1
on the subspace formed by
the set (
1
, . . . ,
N
P
), and H
1
is the projection of H
1
on the
subspace formed by the set (
N
P
+1
, . . . ,
N
). Since H
can
be linearly combined by some elements of P, H
1
H
can
be generated by the set (
1
, . . . ,
N
P
) as well.
Let w
i
= (H
1
H
)
T
i
for i = 1, . . . , N
P
. Using Q, we
can represent V
1
as follows:
V
1
= Q
_
W
1
0
_
+ H
1
(70)
where
W
1
= [w
1
, . . . , w
N
P
]
T
. (71)
Plugging (68) into (9), we have
A
i
= X
i
X
T
i
= Q
i
T
i
Q
T
= Q
_
A
i
0
0 0
_
Q
T
(72)
where A
i
=
T
i
, and
i
P
. Using (72), U
0,i
can then
be represented as
U
0,i
= (I ut
i
A
i
) . . . (I ut
1
A
1
)
= Q
_
I ut
i
A
i
0
0 I
_
Q
T
. . . Q
_
I ut
1
A
1
0
0 I
_
Q
T
= Q
_
(I ut
i
A
i
) . . . (I ut
1
A
1
) 0
0 I
_
Q
T
= Q
_
U
0,i
0
0 I
_
Q
T
(73)
where
U
j,i
=
_
(I ut
i
A
i
) . . . (I ut
j +1
A
j +1
), j < i
I, j = i
(74)
in which the new sequence P
= {
k
}
K
k=1
dened by (69) is
used.
1336 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 8, AUGUST 2013
Inserting (70) and (73) into (66), F
i
can be represented as
follows:
F
i
= U
0,i
V
1
= U
0,i
_
Q
_
W
1
0
_
+ H
1
_
= Q
_
U
0,i
0
0 I
_
Q
T
Q
_
W
1
0
_
+U
0,i
H
1
= Q
_
F
i
0
_
+ H
1
(75)
where for the second equation we use (70), for the third
equation we use (73), and for the fourth equation we use
F
i
= U
0,i
W
1
(76)
with U
0,i
dened in (74) and W
1
dened in (71). Note that
U
0,i
H
1
= H
1
as H
1
= {
k
}
K
k=1
in (69), which is of full dimension. Therefore,
using the result of the case of N
P
= N investigated above,
we have
lim
i
_
_
F
i
_
_
= 0. (77)
Consequently, we have
lim
i
_
_
_
_
Q
_
F
i
0
__
_
_
_
= 0. (78)
Also, from (75), we have
V
i+1
H
1
= Q
_
F
i
0
_
+ S
i
. (79)
B. Limit of S
i
l=1
M
( j )
n,l
X
j
_
_
_
_
_
1
_
1 u
_
n
1
1 u
_
_
X
j
_
_
.
Then it can be readily derived that
S
i
uB
g
B
X
1
_
1 u
_
n
1
1 u
K
l=1
+uB
g
B
X
k
j =1
_
_
nK+j
_
_
. (80)
As k K and
nK+j
{
l
}
K
l=1
, (80) can be rearranged as
S
i
uB
g
B
X
_
1 +
1
_
1 u
_
n
1
1 u
_
K
l=1
l
.
Since 0
1 u < 1, we have
lim
i
S
i
C(u) (81)
where = B
X
B
g
K
l=1
l
is a positive constant, and C(u)
is dened in (59).
Then, given the limits of F
i
and S
i
, we can obtain the
following results: For the case of N
P
= N, it follows from
(67) and (81) that:
lim
i
_
_
_H
i
H
H
1
_
_
_ = lim
i
V
i
lim
i
F
i
+ lim
i
S
i
C(u)
where H
1
H
1
_
_
_ = lim
i
_
_
_V
i
H
1
_
_
_
lim
i
_
_
_
_
Q
_
F
i
0
__
_
_
_
+ lim
i
S
i
C(u).
This completes the proof.
Proof of Corollary 1: With a vanishing disturbance sequence
{
k
}
K
k=1
, we have
S
i
= u
i
j =1
j
g
(X
T
j
H
j
)U
j,i
X
j
= 0, i = 1, 2, . . .
as
i
{
k
}
K
k=1
= 0 for i = 1, 2, . . . Then, from (10), V
i+1
can be rewritten as
V
i+1
= F
i
= U
0,i
V
1
, i = 1, 2, . . . (82)
Let us rst investigate the boundedness of the parameter
sequence {H
i
}
i=1
for this special case.
Let
l
denote the correlation coefcient between X
l
and
U
0,l1
V
1
l
=
< X
l
, U
0,l1
V
1
>
X
l
_
_
U
0,l1
V
1
_
_
, l = 1, . . . , i.
As the learning rate u satises 0 < u u
0
, using Lemma 1
(by replacing Z in Lemma 1 with U
0,l1
V
1
for l = 1, . . . , i ),
we have
_
_
U
0,i
V
1
_
_
=
_
_
(I ut
i
A
i
)U
0,i1
V
1
_
_
_
1 u
i
2
t
i
i
_
_
U
0,i1
V
1
_
_
_
1 u
i
2
t
i
i
. . .
_
1 u
1
2
t
1
1
V
1
V
1
, i = 1, 2, . . . (83)
where
0 u
l
2
t
l
l
1.
Inserting (83) into (82), with bounded initialization H
1
and
bounded optimal solution H
by Assumption 3, we have
V
i+1
=
_
_
U
0,i
V
1
_
_
V
1
H
1
+H
B
H
1
+ B
, i = 1, 2, . . . (84)
XU et al.: ANALYSIS OF BOUNDEDNESS AND CONVERGENCE OF OGM FOR TLFNNs 1337
Then, from (84), the boundedness of {H
i
}
i=1
is a straightfor-
ward result as
H
i
= H
i
H
+ H
H
i
H
+H
= V
i
+H
2B
+ B
H
1
, i = 1, 2, . . . (85)
Note that compared with the proof of Theorem 1, for
this special case Assumption 6 and the inequality (16) do
not necessarily hold to obtain the boundedness result (85).
Actually, from the proof of Theorem 2, we can see that these
two conditions are only used to guarantee the uniform upper
boundedness of {H
i
}
i=1
.
Given the boundedness result (85), we can directly use some
results of Theorem 2. For the case of N
P
= N, from (67), we
have
lim
i
F
i
= 0
and consequently
lim
i
V
i+1
= lim
i
F
i
= 0. (86)
Also, for the case of N
P
< N, it follows from (75) that
V
i+1
= Q
_
F
i
0
_
+ H
1
.
From (77), we have
lim
i
F
i
= 0
and consequently
lim
i
V
i+1
= lim
i
_
Q
_
F
i
0
_
+ H
1
_
= H
1
. (87)
We now conclude from (86) and (87) that
lim
i
H
i
= H
+ H
1
(88)
where H
1