Академический Документы
Профессиональный Документы
Культура Документы
Networks
Martin Burgerand Andreas Neubauer
Abstract. In this paper we prove convergence rates for the problem of approximat-
ing functions f by neural networks and similar constructions. We show that the rates
are the better the smoother the activation functions are, provided that f satises an
integral representation. We give error bounds not only in Hilbert spaces but in general
Sobolev spaces W m;r (
). Finally, we apply our results to a class of perceptrons and
present a sucient smoothness condition on f guaranteeing the integral representation.
1. Introduction
The aim of this paper is to nd error bounds for the approximation of functions by
feed-forward networks with a single hidden layer and a linear output layer, which can
be written as
n
X
(1.1)
fn(x) = cj (x; tj ) ;
j =1
fn(x) =
j =1
cj (aTj x + bj ) :
(1.2)
The interest in such networks grew, since Hornik et al. [5] showed that functions of the
form (1.2) are dense in C (
), if is a function of sigmoidal form. An other special case
are radial basis function networks, where (x; t) = ( kx ? tk ) (cf. [7]).
We consider the problem of approximating a function f 2 W m;r (
), where W m;r (
)
denote the usual Sobolev spaces and
is a (not necessarily bounded) domain in Rd .
This problem can be written in the abstract form
inf kf ? gk X ;
g2Xn
(1.3)
Supported by the Austrian Fonds zur Forderung der wissenschaftlichen Forschung under grant
SFB F013/1308
Xn = fg =
j =1
cj (x; tj ) : tj 2 P Rp; cj 2 Rg :
(1.4)
1
2
2. Error Bounds
An inspection of the proof of (1.6) in [6] shows that the result can be improved if the
activation function is Holder continuous. Moreover, rates can be obtained in H m (
):
Theorem 2.1. Let Xn be dened as in (1.4) with P Rp bounded and such that
k(; t) ? (; s)k H m(
) c kt ? sk ; 2 (0; 1] ; c > 0 ; m 2 N0 :
(2.1)
Moreover, let f 2 H m (
) satisfy (1.5) with h 2 L1 (P ). Then we obtain the rate
m (
) = O (n? ? p )
k
f
?
g
k
inf
H
g2Xn
1
2
P=
n
j =1
Pj ;
Pi \ Pj = fg ; i 6= j ; diam(Pj
P nP =
1
) = O(n? p ) ;
Pj ;
j =n+1
jPj j = O( n1 ) ;
i; j = 1; : : : ; n :
(2.2)
We now dene coecients
cj :=
Pj
t 2 Pj
otherwise
1
cj h(t) ;
j (t) := 0 ;
h(t) dt
E [z] :=
E [ kf ?
j =1
cj (; tj )k 2H m(
) ]
= kf k 2H m(
) ? 2
+
n
X
j =1
P
P
i6=j =1
Z
n
X
+ c2j j (tj ) k(; tj )k 2H m(
) dtj
P
j =1
Z
n
X
k P [h(t) ? cj j (t)](; t) dtk 2H m(
)
j =1
Z
Z
n
h
i
X
+ c2j j (t) k(; t)k 2H m(
) dt ? k j (t)(; t) dtk 2H m(
)
P
P
j =1
Z hZ
n
Z
i
X
X
@ jj (x; t) dt 2 dx
@ jj (x; t))2 dt ?
(
t
)
c2j
j (t)( @x
j @x
P
j =1 jjm
P
Z hZ
n
i
Z
X
X
2
@ jj (x; t) ? @ jj (x; s)) ds 2 dt dx
cj
j (t) j (s)( @x
@x
Pj
j =1 jjm
Pj
=
=
Noting that h 2 L1(P ) and (2.2) imply that cj = O( n1 ), we now obtain together with
(2.1) and (2.2) that
E [ kf ?
j =1
=
cj (; tj )k 2H m(
) ]
n
j =1
n
X
j =1
c2j
c2j
hZ
jjm
Pj
j (t)
Pj
Z
Pj
j (t)
Pj
j (s)
i
@ jj (x; t) ? @ jj (x; s) 2 ds dt dx
@x
@x
j =1
cj (; tj )k H m(
) E [ kf ?
=
j =1
1
cj (; tj )k 2H m(
) ]
O(n? 21 ? p )
where cj is as above.
We think that the proposition above is also true if h 2 L2(P ). However, the choice
of the subsets Pj in (2.2) has to be more tricky, since cj = O( n1 ) will no longer hold, in
general.
We will now turn to other estimates in spaces W m;r (
). The error bounds will depend
on the dimension p of P Rp. The proofs are based on the following results from niteelement theory:
Let
p
i=1
i=1
Then, obviously
P=
Moreover, we dene
li =0;:::; ?1
i=1;:::;p
Pl :::lp :
1
i=1
(2.6)
Proof. The proof follows with Theorem 3.1 and Theorem 3.3 in [8].
For our main result we need the following types of smoothness of : 2 W m;r (
; Y )
with Y = H k (P ) or Y = C k (P ) and norms
8
>
>
>
<
>
>
>
:
1
x2
inf kf ? gk W m;r(
) = O(n? p ) :
g2Xn
Proof. If we choose cj as
cj :=
h(t) j (t) dt ;
j 2 L1(P ) ;
kf ?
j =1
cj (; tj )k W m;r (
) = k h(t) (; t) ?
P
j =1
kf ?
j =1
cj (; tj )k W m;r(
) = k h(t)((; t) ? I (; t))dtk W m;r(
) :
P
Note that this interpolating property also holds for all derivatives of with respect to
x. Applying (2.7) ( = 0) for Y = H k (P ) and (2.8) ( = 0) for Y = C k (P ) we obtain
the estimates
kf ?
and
kf ?
j =1
n
j =1
cj (; tj )k W m;r (
) c0 ?k khk L (P ) kk W m;r (
;H k (P ))
(2.9)
cj (; tj )k W m;r(
) c0 ?k khk L (P ) kk W m;r (
;C k (P ))
(2.10)
respectively. Now the assertion follows together with the fact that n p .
1
Remark 2.4. The idea of choosing cj , tj and j as in the prove above was found
in a paper by Whaba [9] for one-dimensional P . This idea was extended to higher
dimensions, i.e., P Rp.
The following extensions of Theorem 2.3 are obvious from the proof:
5
jj P
(2.11)
If the functions
j are chosen such that for each they coincide with the appropriate derivative of the basis functions qj :::jp in Pl :::lp , we obtain together with
Proposition 2.2 the rates
1
inf kf ? gk W m;r(
) = O(n?
g2Xn
k?)
p ):
Finally, we want to mention that the rates above and in Theorem 2.3 decrease with
increasing dimension p. There is no dimensionless term like n? in (1.6) or Theorem 2.1.
Since the estimates in the proof of Theorem 2.3 are based on a xed choice of knots tj
this dependence on p is to be expected. We were not able to improve the rates for a
possible optimal choice of knots. However, since Proposition 2.2 is valid also for many
other non-uniform choices of knots tj , the rates in Theorem 2.3 are valid for many
choices tj (also non-optimal ones) if at least cj is chosen optimally.
1
2
3. Application to perceptrons
We now apply the results of the previous section to perceptrons with a single hidden
layer, namely Ridge-constructions (cf. (1.2)) where is a function of sigmoidal form,
i.e.,
n
X
Xn = fg = cj (aTj x + bj ) : a 2 A Rd ; b 2 B Rg
j =1
t!?1
If is such that
and
lim (t) = 1 :
t!+1
1;
t > 1;
(t) := > p(t) ; ?1 t 1 ;
:
0;
t < ?1 ;
with p the unique polynomial of degree 2k + 1 satisfying
8
>
<
(3.1)
(3.2)
1.2
0.8
0.6
0.4
0.2
0.2
1.5
0.5
0.5
1.5
8
>
<
1;
>
:
0;
t > 1;
?1 t 1 ;
t < ?1 ;
t+1 ;
2
(3.3)
and let A := X [?ai; ai] and B := [?b; b] with ai > 0 and b > 0 such that
i=1
8a 2 A 8x 2
: jaT xj b ? 1 :
Since (x; a; b) := (aT x + b) satises (2.1) with m = 0 and = 1, Theorem 2.1 implies
that
inf kf ? gk L (
) = O(n? ? d )
g2Xn
1
+1
1
2
if
f (x) =
A ?b
hZ
h(a; b)(aT x + b) db da
1?aT x
?1?aT x
1?aT x
h(a; b) db da
(3.4)
Example 3.2. We consider now the general case, where is dened by (3.1), (3.2),
and where A and B are as in Example 3.1.
Since (x; a; b) := (aT x + b) satises that 2 W m;1(
; C k?m(A B )) (m k) and
2 W m;1(
; H k+1?m(A B )) (m k + 1), we may apply Theorem 2.3 to obtain
k?m
inf kf ? gk W m;r(
) = O(n? d )
g2Xn
+1
if f 2 W m;r (
) satises
f (x) =
hZ
1?aT x
?1?aT x
h(a; b)p(aT x + b) db +
7
b
1?aT x
h(a; b) db da
(3.5)
k+1?m
d+1 )
if f 2 W m;r (
) satises (3.5) for some h 2 L2 (A B ) and k + 1 ? m > d+1
2 . Note that
d
+1
for m = 0 and k > 2 the rate above is better than the one in Example 3.1.
From both examples, we can see that the conditions (3.4) and (3.5) can be only
satised if f is several times dierentiable. We will now give a sucient condition on f
that guarantees (3.4):
Let "0 := 0 and "n := 2 (4nj ? 3), n 2 N, for some j 2 N to be specied later, and
let n := "n"n . We dene the function h as follows:
+1
h(a; b) =
where
n(a) :=
n=1
(3.6)
0;
else ;
(
(3.7)
d
? "3 <f^(a"n ) ;
if
a
2
A
n
A
;
(2
)
n
?
1
n
n(a) := 0 ;
else :
Note that, due to the denition of n and n, the sum in (3.6) will be almost always
nite. = and < denote the imaginary and real part, respectively. With f^ we denote the
Fourier transform of any function f~ satisfying f~ = f in
.
2
Lemma 3.3. Let f be such that (1+ jj3+? p )f^() 2 Lp(Rd ), where f^ is as above and
= 0 for p = 1 and > 0 for 1 < p 1, and let A and B be as in Example 3.1. Then
it holds for h dened by (3.6) and (3.7) with j 2 N suciently large (see the denition
1
h 2 Lp(A B ) :
Proof. Let p < 1. Then we obtain with (3.6), (3.7), and "n, k dened as above that
Z
A ?b
jh(a; b)jpdb da
= O
(jn(a)j + jn(a)j) da
2b
p
"3njf^(a"n)j da
X
n=1
"3njf^(a"n)j
X
n=1
"n(3+)pjf^(a"n)jp
1 ? p
p?1
n=1
"n
< 1;
X
n=1
p p?1
1
"n? p?
A ?b
jh(a; b)jpdb da
= O
X Z
= O
X Z
"n(3+)p jf^(a"n)jpda
n=1 Ann?1 A
if j is suciently large and = 0 for p = 1 and > 0 for p > 1 which we assume to
hold in the following. Since
A ?b
jh(a; b)jpdb da
= O
X Z
= O
Z
(1 + jzj3+? p )pjf^(z)jpdz
1
Rd
jh(a; b)j
n=1
(jn(a)j + jn(a)j)
k
= O
X
= O
X
n=1
k
n=1
"3njf^(a"n)j
Proposition 3.4. Let f , A, and B satisfy the conditions in Lemma 3.3. Moreover,
let f be such that (1 + j j)f^() 2 L1 (Rd ). Then f has an integral representation (3.4)
for some h 2 Lp(A B ).
Proof. With the special choice of h as in (3.6) and (3.7) we know from Lemma 3.3
that h 2 Lp(A B ). We will now show that
g(x) :=
=
hZ
1 Z
X
1?aT x
?1?aT x
n(a)
Z
1?aT x
1?aT x
h(a; b) db da
cos(b"n) db
1?aT x
?1?aT x
Z b
Z 1?aT x
i
1+aT x+b db +
+ n(a)
sin(
b"
)
sin(
b"
)
db
da
n
n
2
?1?aT x
1?aT x
g(x) =
X h
= (2)?
d
2
i
? (2)? d2
The second term above is a constant, since (1+ jj)f^() 2 L1 (Rd ). (The proof is similar
to the one in Lemma 3.3.) We denote this constant by C in the following. Hence, we
obtain that
Z
d
?
(<f^(z) cos(zT x) ? =f^(z) sin(zT x)) dz + C
g(x) = (2)
d
R
Z
= (2)? d f^(z)eizT xdz + C
2
Rd
= f (x) + C
It remains to be shown that the constant function satises (3.4) for some h 2 L1(AB ).
Let h(a; b) := bjCAj . Then we obtain that
Z
1?aT x
Z
Z b
i
T x+b
1+
a
C
h(a; b) 2 db + T h(a; b) db da = bjAj (b + aT x) da = C ;
T
A
?1?a x
1?a x
hZ
aT x da = 0
A
for the special choice of A (see Example 3.1).
Remark 3.5. For the case p = 1, the condition (1 + j j)f^() 2 L1(Rd ) in Proposition 3.4 is super uous, since it is implied by condition (1 + j j2)f^() 2 L1 (Rd) in
Lemma 3.3. This sucient condition for (3.4) actually means that f has a C 2 -extension
into the exterior of
. On the other hand, it is easy to see that for condition (3.4) to
hold it is necessary that f is two-times weakly dierentiable.
For the case p = 2, the conditions in Proposition 3.4 mean that f has a C 1 -extension
into the exterior of
and that f may be extended to a function in H + (Rd) for some
> 0.
For the general case of perceptrons (k 2 N) in Example 3.2, one can prove a similar
result to Proposition 3.4 by constructing the function h in Lemma 3.3 similarly to (3.6)
and (3.7). The sucient conditions for (3.5) to hold are:
(1 + j j)f^() 2 L1(Rd ) and (1 + j j3+k+? p )f^() 2 Lp(Rd ) :
It was shown in [1] that (1 + j j)f^() 2 L1(Rd ) is sucient for the rate
5
2
inf kf ? gk L (
) = O(n? )
g2Xn
2
10
1
2
if P = Rd+1 . It is obvious that better rates can only be obtained under stronger
conditions on f . Unfortunately, the rates in Theorem 2.3 are only better than O(n? )
if k is suciently large depending on the dimension d. On the other hand, the rates in
Theorem 2.3 are also valid for non-optimally chosen ftj g (compare Remark 2.4).
1
2
References
[1] A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal
function, IEEE Trans. Inf. Theory 39 (1993), 930{945.
[2] C. M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford,
1995.
[3] M. Burger and H. W. Engl, Training neural networks with noisy data as an
ill-posed problem, Adv. Comp. Math. (2000), to appear.
[4] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems,
Kluwer, Dordrecht, 1996.
[5] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks
are universal approximators, Neural Networks 2 (1989), 359{366.
[6] P. Niyogi and F. Girosi, Generalization bounds for function approximation from
scattered noisy data, Adv. Comp. Math. 10 (1999), 51{80.
[7] R. Schaback, Approximation by radial basis functions with nitely many centers,
Const. Appr. 12 (1996), 331{340.
[8] G. Strang and G. J. Fix, An Analysis of the Finite Element Method, PrenticeHall, Englewood Clis, 1973.
[9] G. Wahba, Convergence rates of certain approximate solutions to Fredholm integral
equations of the rst kind, J. Approx. Theory 7 (1973), 167{185.
11