Вы находитесь на странице: 1из 11

Error Bounds for Approximation with Neural

Networks
Martin Burgerand Andreas Neubauer

Institut fur Industriemathematik, Johannes-Kepler-Universitat,


A-4040 Linz, Austria

Abstract. In this paper we prove convergence rates for the problem of approximat-

ing functions f by neural networks and similar constructions. We show that the rates
are the better the smoother the activation functions are, provided that f satis es an
integral representation. We give error bounds not only in Hilbert spaces but in general
Sobolev spaces W m;r (
). Finally, we apply our results to a class of perceptrons and
present a sucient smoothness condition on f guaranteeing the integral representation.

Key Words: Neural networks, error bounds, nonlinear function approximation.


AMS Subject Classi cations: 41A30, 41A25, 92B20, 68T05

1. Introduction
The aim of this paper is to nd error bounds for the approximation of functions by
feed-forward networks with a single hidden layer and a linear output layer, which can
be written as
n
X
(1.1)
fn(x) = cj (x; tj ) ;
j =1

where cj 2 R and tj 2 P  R are parameters to be determined.


An important special case of (1.1) are so-called Ridge-constructions, i.e.,
p

fn(x) =

j =1

cj (aTj x + bj ) :

(1.2)

The interest in such networks grew, since Hornik et al. [5] showed that functions of the
form (1.2) are dense in C (
), if  is a function of sigmoidal form. An other special case
are radial basis function networks, where (x; t) = ( kx ? tk ) (cf. [7]).
We consider the problem of approximating a function f 2 W m;r (
), where W m;r (
)
denote the usual Sobolev spaces and
is a (not necessarily bounded) domain in Rd .
This problem can be written in the abstract form
inf kf ? gk X ;

g2Xn

(1.3)

Supported by the Austrian Fonds zur Forderung der wissenschaftlichen Forschung under grant
SFB F013/1308


where Xn denotes the set of all functions of form (1.1), i.e.,

Xn = fg =

j =1

cj (x; tj ) : tj 2 P  Rp; cj 2 Rg :

(1.4)

 is assumed smooth enough so that Xn  X ; P is a (usually bounded) domain.


Usually, the convergence of solutions of (1.3) if they exist (note that Xn is not a
nite-dimensional subspace of X ) is arbitrarily slow, since the approximation problem
is asymptotically ill-posed, i.e., arbitrarily small errors in the observation can lead to
arbitrarily large errors in the approximation as n ! 1 (cf., e.g., [2, 3]). It was shown
in [3] that the set of functions to which networks of the form (1.1) converge is just the
closure of the range of the integral operator
h 7! h(t)(; t) dt :
P
Rates are usually only obtained under additional conditions on f (cf., e.g., [4]). A
natural condition seems to be that f is in the range of the above operator, i.e.,
f (x) = h(t)(x; t) dt
(1.5)
P
It was shown in [6] that under this condition the rate
(1.6)
inf kf ? gk L (
) = O(n? )
g2Xn
Z

1
2

is obtained if  is a continuous function. We improve this result under additional


smoothness assumptions on the basis function  in the next section with estimates
also in H m(
). Moreover, we will give error bounds in W m;r (
) that depend on the
dimension p of P , where the analysis is based on nite-element theory. In Section 3, we
apply the results to perceptrons and give sucient conditions on f for condition (1.5)
to hold.

2. Error Bounds
An inspection of the proof of (1.6) in [6] shows that the result can be improved if the
activation function  is Holder continuous. Moreover, rates can be obtained in H m (
):

Theorem 2.1. Let Xn be de ned as in (1.4) with P  Rp bounded and  such that
k(; t) ? (; s)k H m(
)  c kt ? sk  ;  2 (0; 1] ; c > 0 ; m 2 N0 :
(2.1)
Moreover, let f 2 H m (
) satisfy (1.5) with h 2 L1 (P ). Then we obtain the rate

m (
) = O (n? ? p )
k
f
?
g
k
inf
H
g2Xn
1
2

Proof. Let P := ft 2 P : h(t)  0g (note that P is unique up to a set of measure


zero) and n := [ n2 ]. Since P is bounded, it is possible to nd bounded measurable sets
Pj such that

P=

n

j =1

Pj ;

Pi \ Pj = fg ; i 6= j ; diam(Pj

P nP =

1
) = O(n? p ) ;

Pj ;
j =n+1
jPj j = O( n1 ) ;

i; j = 1; : : : ; n :

(2.2)

We now de ne coecients

cj :=

Pj

and probability measures

t 2 Pj
otherwise

1
cj h(t) ;

j (t) := 0 ;

h(t) dt

for cj > 0; and j is arbitrary for cj = 0 :

Furthermore, we consider the variables tj 2 P as random variables distributed with


probability distribution j . The expected value of z(t1 ; : : : ; tn) is de ned as

E [z] :=

: : : z(t1 ; : : : ; tn)1(t1)    n(tn) dt1 : : : dtn :


P

With cj and j as above and f as in (1.5) we obtain that

E [ kf ?

j =1

cj (; tj )k 2H m(
) ]

= kf k 2H m(
) ? 2
+

n
X

j =1

cj h f (x); j (tj )(; tj ) dtj iH m(


)
P

cicj h i(ti)(; ti) dti; j (tj )(; tj ) dtj iH m(


)

P
P
i6=j =1
Z
n
X
+ c2j j (tj ) k(; tj )k 2H m(
) dtj
P
j =1
Z
n
X
k P [h(t) ? cj j (t)](; t) dtk 2H m(
)
j =1
Z
Z
n
h
i
X
+ c2j j (t) k(; t)k 2H m(
) dt ? k j (t)(; t) dtk 2H m(
)
P
P
j =1
Z hZ
n
Z
 i
X
X
@ j j (x; t) dt 2 dx
@ j j (x; t))2 dt ?

(
t
)
c2j
j (t)( @x
j @x

P
j =1 j jm
P
Z hZ
n

i
Z
X
X
2
@ j j (x; t) ? @ j j (x; s)) ds 2 dt dx
cj
j (t) j (s)( @x


@x
Pj
j =1 j jm
Pj

=
=

Noting that h 2 L1(P ) and (2.2) imply that cj = O( n1 ), we now obtain together with
(2.1) and (2.2) that

E [ kf ?

j =1


=

cj (; tj )k 2H m(
) ]
n

j =1
n
X
j =1

c2j
c2j

hZ

j jm

Pj

j (t)

Pj
Z

Pj

j (t)

Pj

j (s)


i
@ j j (x; t) ? @ j j (x; s) 2 ds dt dx
@x
@x

j (s) k(; t) ? (; s)k 2H m(


) ds dt


= O(n  n?2  n? p ) = O(n?1? p )


2

Therefore, there exists a set of elements tj 2 P such that


inf kf ? gk H m(
)  kf ?
g2Xn

j =1

cj (; tj )k H m(
)  E [ kf ?
=

j =1

1

cj (; tj )k 2H m(
) ]


O(n? 21 ? p )

where cj is as above.
We think that the proposition above is also true if h 2 L2(P ). However, the choice
of the subsets Pj in (2.2) has to be more tricky, since cj = O( n1 ) will no longer hold, in
general.
We will now turn to other estimates in spaces W m;r (
). The error bounds will depend
on the dimension p of P  Rp. The proofs are based on the following results from niteelement theory:
Let
p

P := X [pi; pi ] and Pl :::lp := X [pi + pi? pi li; pi + pi? pi (li + 1)] ;  2 N :


1

i=1

i=1

Then, obviously

P=

Moreover, we de ne

li =0;:::; ?1
i=1;:::;p

Pl :::lp :
1

tj :::jp := (tj :::jp;1; : : : ; tj :::jp;p) 2 Rp ; tj :::jp;i := pi + pik?pi ji ; ji = 0; : : : ; k : (2.3)


Then for all kli  i  k(li + 1) there exists a unique polynomial function
q :::p 2 Qk;l :::lp := fq(t) = cj :::jp tj1    tjpp : 0  ji  k; 1  i  p; (2.4)
t = (t1; : : : ; tp) 2 Pl :::lp g
satisfying
p
(2.5)
q :::p (tj :::jp ) = iji ; kli  i ; ji  k(li + 1) :
1

i=1

The function uI , de ned by



uI Pl :::l :=
1

kli ji k(li +1)

u(tj :::jp )qj :::jp ;


1

(2.6)

interpolates u 2 C (P ) at the knots tj :::jp , 0  ji  k , 1  i  p. Note that


uI 2 C (P ) \ H 1(P ).
1

Proposition 2.2. Let P  Rp be rectangular. If u 2 H k (P ) with k > 2p , then


there is a constant c > 0 such that for all multiindices with j j =  < k and for all
li 2 f0; : : : ;  ? 1g, i = 1; : : : ; p, it holds that
(2.7)
kD (u ? uI )k L (Pl :::lp )  c ?(k?)jujH k(Pl :::lp ) :
If u 2 C k (P ), then there is a constant c > 0 such that for all multiindices with
j j =  < k and for all li 2 f0; : : : ;  ? 1g, i = 1; : : : ; p, it holds that
uk 1
(2.8)
k
D
kD (u ? uI )k L1(Pl :::lp )  c ?(k?) max
L (Pl :::lp ) :
j j=k
2

Proof. The proof follows with Theorem 3.1 and Theorem 3.3 in [8].

For our main result we need the following types of smoothness of :  2 W m;r (
; Y )
with Y = H k (P ) or Y = C k (P ) and norms
8

>
>
>
<

@ j j (x; )k r dx r ; if 1  r < 1 ;


k

Y
@x
kk W m;r(
;Y ) := j jm
@j j
max ess sup k @x (x; )k Y ; if r = 1 :
j jm
X

>
>
>
:

1

x2

Theorem 2.3. Let Xn be de ned as in (1.4) with P  Rp bounded and rectangular


and let  2 W m;r (
; Y ) with Y = H k (P ), k > 2p , or Y = C k (P ). Moreover, let
f 2 W m;r (
) satisfy (1.5) with h 2 L2 (P ) if Y = H k (P ) and h 2 L1(P ) if Y = C k (P ).
Then we obtain the rate

inf kf ? gk W m;r(
) = O(n? p ) :
g2Xn

Proof. If we choose cj as

cj :=

h(t) j (t) dt ;

j 2 L1(P ) ;

with h as in (1.5), then we obtain that

kf ?

j =1

cj (; tj )k W m;r (
) = k h(t) (; t) ?
P

j =1

j (t)(; tj ) dtk W m;r (


) :

Let us de ne  := [n pk]?1 and n := (k + 1)p  n. Then we choose tj and j as follows:


For j = n + 1; : : : ; n let tj be arbitrary and j  0. For j = 1; : : : ; n let tj and j be the
appropriate knots and basis functions such that the sum above equals the interpolating
function I (; t) (see (2.3) { (2.6)), i.e.,
1

kf ?

j =1

cj (; tj )k W m;r(
) = k h(t)((; t) ? I (; t))dtk W m;r(
) :
P

Note that this interpolating property also holds for all derivatives of  with respect to
x. Applying (2.7) ( = 0) for Y = H k (P ) and (2.8) ( = 0) for Y = C k (P ) we obtain
the estimates

kf ?
and

kf ?

j =1
n

j =1

cj (; tj )k W m;r (
)  c0 ?k khk L (P ) kk W m;r (
;H k (P ))

(2.9)

cj (; tj )k W m;r(
)  c0  ?k khk L (P ) kk W m;r (
;C k (P ))

(2.10)

respectively. Now the assertion follows together with the fact that   n p .
1

Remark 2.4. The idea of choosing cj , tj and j as in the prove above was found

in a paper by Whaba [9] for one-dimensional P . This idea was extended to higher
dimensions, i.e., P  Rp.
The following extensions of Theorem 2.3 are obvious from the proof:
5

 If P is not rectangular but supp(h)  P  P with P rectangular, then the results

are still valid.


 If Y = C k (P ), the condition (1.5) for f with h 2 L1 (P ) may be replaced by: f is
such that there exists a uniformly bounded sequence hl in L1(P ) with
Z

kf ? P hl (t)(; t) dtk W m;r(


) ! 0 as l ! 1 :
 Condition (1.5) may be generalized to
f (x) =

j j P

h (t) @@tj j (x; t) dt ;  < k :

(2.11)

If the functions j are chosen such that for each they coincide with the appropriate derivative of the basis functions qj :::jp in Pl :::lp , we obtain together with
Proposition 2.2 the rates
1

inf kf ? gk W m;r(
) = O(n?
g2Xn

k?)
p ):

Finally, we want to mention that the rates above and in Theorem 2.3 decrease with
increasing dimension p. There is no dimensionless term like n? in (1.6) or Theorem 2.1.
Since the estimates in the proof of Theorem 2.3 are based on a xed choice of knots tj
this dependence on p is to be expected. We were not able to improve the rates for a
possible optimal choice of knots. However, since Proposition 2.2 is valid also for many
other non-uniform choices of knots tj , the rates in Theorem 2.3 are valid for many
choices tj (also non-optimal ones) if at least cj is chosen optimally.
1
2

3. Application to perceptrons
We now apply the results of the previous section to perceptrons with a single hidden
layer, namely Ridge-constructions (cf. (1.2)) where  is a function of sigmoidal form,
i.e.,
n
X
Xn = fg = cj (aTj x + bj ) : a 2 A  Rd ; b 2 B  Rg
j =1

and  is piecewise continuous, monotonically increasing, and such that


lim (t) = 0

t!?1

If  is such that

and

lim (t) = 1 :

t!+1

1;
t > 1;
(t) := > p(t) ; ?1  t  1 ;
:
0;
t < ?1 ;
with p the unique polynomial of degree 2k + 1 satisfying
8
>
<

p(?1) = 0 ; p(1) = 1 ; and p(l) (?1) = 0 = p(l) (1) ; 1  l  k ;


then  2 C k;1 and  2 W k+1;1 (see Figure 3.1).
6

(3.1)
(3.2)

1.2

0.8

0.6

0.4

0.2

0.2
1.5

0.5

0.5

1.5

Figure 3.1: Function  from (3.1) and (3.2) for k = 0; 1; 2; 3

Example 3.1. Let us consider the special case of k = 0, i.e.,


(t) :=

8
>
<

1;

>
:

0;

t > 1;

?1  t  1 ;
t < ?1 ;

t+1 ;
2

(3.3)

and let A := X [?ai; ai] and B := [?b; b] with ai > 0 and b > 0 such that
i=1

8a 2 A 8x 2
: jaT xj  b ? 1 :
Since (x; a; b) := (aT x + b) satis es (2.1) with m = 0 and  = 1, Theorem 2.1 implies
that
inf kf ? gk L (
) = O(n? ? d )
g2Xn
1
+1

1
2

if

f (x) =

A ?b
hZ

h(a; b)(aT x + b) db da
1?aT x

?1?aT x

for some h 2 L1(A  B ).

h(a; b) 1+aT2 x+b db +

1?aT x

h(a; b) db da

(3.4)

Example 3.2. We consider now the general case, where  is de ned by (3.1), (3.2),
and where A and B are as in Example 3.1.
Since (x; a; b) := (aT x + b) satis es that  2 W m;1(
; C k?m(A  B )) (m  k) and
 2 W m;1(
; H k+1?m(A  B )) (m  k + 1), we may apply Theorem 2.3 to obtain
k?m

inf kf ? gk W m;r(
) = O(n? d )
g2Xn
+1

if f 2 W m;r (
) satis es

f (x) =

hZ

1?aT x

?1?aT x

h(a; b)p(aT x + b) db +
7

b
1?aT x

h(a; b) db da

(3.5)

for some h 2 L1 (A  B ), and


inf kf ? gk W m;r(
) = O(n?
g2Xn

k+1?m
d+1 )

if f 2 W m;r (
) satis es (3.5) for some h 2 L2 (A  B ) and k + 1 ? m > d+1
2 . Note that
d
+1
for m = 0 and k > 2 the rate above is better than the one in Example 3.1.
From both examples, we can see that the conditions (3.4) and (3.5) can be only
satis ed if f is several times di erentiable. We will now give a sucient condition on f
that guarantees (3.4):
Let "0 := 0 and "n := 2 (4nj ? 3), n 2 N, for some j 2 N to be speci ed later, and
let n := "n"n . We de ne the function h as follows:
+1

h(a; b) =
where

n(a) :=

n=1

(n(a) cos(b"n) + n(a) sin(b"n))

(3.6)

?(2)? d "3n=f^(a"n) ; if a 2 Ann?1A ;


2

0;
else ;
(
(3.7)
d
? "3 <f^(a"n ) ;
if
a
2
A
n

A
;
(2

)
n
?
1
n
n(a) := 0 ;
else :
Note that, due to the de nition of n and n, the sum in (3.6) will be almost always
nite. = and < denote the imaginary and real part, respectively. With f^ we denote the
Fourier transform of any function f~ satisfying f~ = f in
.
2

Lemma 3.3. Let f be such that (1+ jj3+ ? p )f^() 2 Lp(Rd ), where f^ is as above and
= 0 for p = 1 and > 0 for 1 < p  1, and let A and B be as in Example 3.1. Then
it holds for h de ned by (3.6) and (3.7) with j 2 N suciently large (see the de nition
1

of "n above) that

h 2 Lp(A  B ) :

Proof. Let p < 1. Then we obtain with (3.6), (3.7), and "n, k de ned as above that
Z

A ?b

jh(a; b)jpdb da

(n(a) cos(b"n) + n(a) sin(b"n)) db da

k=1 k Ank?1 A ?b n=1


1 Z
k
X
X

= O

(jn(a)j + jn(a)j) da

 2b

k=1 k Ank?1 A n=1


k
1 Z
 X
X

k=1 k Ank?1 A n=1

p
"3njf^(a"n)j da


This together with the estimate


k

 X

n=1

and the fact that

"3njf^(a"n)j 


X

n=1

"n(3+ )pjf^(a"n)jp

1 ? p
p?1

n=1

"n

if > 0, p > 1, and j > p p?1 , implies that


8

< 1;

 X

n=1

p p?1
1

"n? p?

A ?b

jh(a; b)jpdb da

= O

 X Z

= O

 X Z

"n(3+ )p jf^(a"n)jpda

n=1 Ann?1 A

n=1 "n An"n?1 A

"n(3+ )p?1 jf^(z)jpdz

if j is suciently large and = 0 for p = 1 and > 0 for p > 1 which we assume to
hold in the following. Since

9C > 0 8z 2 "nAn"n?1A : "n(3+ )p?1  C (1 + jzj3+ ? p )p ;


1

we nally obtain that


Z

A ?b

jh(a; b)jpdb da

= O

 X Z

= O

Z

(1 + jzj3+ ? p )pjf^(z)jpdz
1

n=1 "n An"n?1 A



1
(1 + jzj3+ ? p )pjf^(z)jpdz :

Rd

This proves the assertion for p < 1.


Let us now consider the case p = 1: We assume that > 0 and that j > 1 . Then
we obtain for all a 2 k Ank?1A that

jh(a; b)j 

n=1

(jn(a)j + jn(a)j)
k

= O

X

= O

X

n=1
k
n=1

"3njf^(a"n)j

(1 + (jaj"n)3+ )"?n jf^(a"n)j

= O( k(1 + j  j3+ )f^()k L1(Rd ))


This proves the assertion for p = 1.

Proposition 3.4. Let f , A, and B satisfy the conditions in Lemma 3.3. Moreover,
let f be such that (1 + j  j)f^() 2 L1 (Rd ). Then f has an integral representation (3.4)
for some h 2 Lp(A  B ).
Proof. With the special choice of h as in (3.6) and (3.7) we know from Lemma 3.3
that h 2 Lp(A  B ). We will now show that

g(x) :=
=

hZ

1 Z
X

1?aT x

?1?aT x

h(a; b) 1+aT2 x+b db +


k h
X

n(a)

Z

1?aT x

1?aT x

h(a; b) db da

cos(b"n) 1+aT2 x+b db +

cos(b"n) db

1?aT x
?1?aT x
Z b
 Z 1?aT x
i
1+aT x+b db +
+ n(a)
sin(
b"
)
sin(
b"
)
db
da
n
n
2
?1?aT x
1?aT x

k=1 k Ank?1 A n=1

is identical to f up to a constant. The integrals with respect to b may be calculated


analytically. Together with sin("n) = 1 this yields that

g(x) =

X h

k=1 k Ank?1 A n=1

= (2)?

d
2

n(a) "?n 1 sin(b"n) + "?n 2 sin(aT x"n)

i

+ n(a) ? "?n 1 cos(b"n) + "?n 2 cos(aT x"n) da


(<f^(z) cos(zT x) ? =f^(z) sin(zT x)) dz

n=1 "n An"n?1 A

? (2)? d2

n=1 "n An"n?1 A

"n(<f^(z) cos(b"n) + =f^(z) sin(b"n)) dz

The second term above is a constant, since (1+ jj)f^() 2 L1 (Rd ). (The proof is similar
to the one in Lemma 3.3.) We denote this constant by C in the following. Hence, we
obtain that
Z
d
?
(<f^(z) cos(zT x) ? =f^(z) sin(zT x)) dz + C
g(x) = (2)
d
R
Z
= (2)? d f^(z)eizT xdz + C
2

Rd

= f (x) + C
It remains to be shown that the constant function satis es (3.4) for some h 2 L1(AB ).
Let h(a; b) := bjCAj . Then we obtain that
Z

1?aT x

Z
Z b
i
T x+b
1+
a
C
h(a; b) 2 db + T h(a; b) db da = bjAj (b + aT x) da = C ;
T
A
?1?a x
1?a x

hZ

where we used the fact that

aT x da = 0
A
for the special choice of A (see Example 3.1).

Remark 3.5. For the case p = 1, the condition (1 + j  j)f^() 2 L1(Rd ) in Proposition 3.4 is super uous, since it is implied by condition (1 + j  j2)f^() 2 L1 (Rd) in

Lemma 3.3. This sucient condition for (3.4) actually means that f has a C 2 -extension
into the exterior of
. On the other hand, it is easy to see that for condition (3.4) to
hold it is necessary that f is two-times weakly di erentiable.
For the case p = 2, the conditions in Proposition 3.4 mean that f has a C 1 -extension
into the exterior of
and that f may be extended to a function in H + (Rd) for some
> 0.
For the general case of perceptrons (k 2 N) in Example 3.2, one can prove a similar
result to Proposition 3.4 by constructing the function h in Lemma 3.3 similarly to (3.6)
and (3.7). The sucient conditions for (3.5) to hold are:
(1 + j  j)f^() 2 L1(Rd ) and (1 + j  j3+k+ ? p )f^() 2 Lp(Rd ) :
It was shown in [1] that (1 + j  j)f^() 2 L1(Rd ) is sucient for the rate
5
2

inf kf ? gk L (
) = O(n? )
g2Xn
2

10

1
2

if P = Rd+1 . It is obvious that better rates can only be obtained under stronger
conditions on f . Unfortunately, the rates in Theorem 2.3 are only better than O(n? )
if k is suciently large depending on the dimension d. On the other hand, the rates in
Theorem 2.3 are also valid for non-optimally chosen ftj g (compare Remark 2.4).
1
2

References
[1] A. R. Barron, Universal approximation bounds for superpositions of a sigmoidal
function, IEEE Trans. Inf. Theory 39 (1993), 930{945.
[2] C. M. Bishop, Neural Networks for Pattern Recognition, Clarendon Press, Oxford,
1995.
[3] M. Burger and H. W. Engl, Training neural networks with noisy data as an
ill-posed problem, Adv. Comp. Math. (2000), to appear.
[4] H. W. Engl, M. Hanke, and A. Neubauer, Regularization of Inverse Problems,
Kluwer, Dordrecht, 1996.
[5] K. Hornik, M. Stinchcombe, and H. White, Multilayer feedforward networks
are universal approximators, Neural Networks 2 (1989), 359{366.
[6] P. Niyogi and F. Girosi, Generalization bounds for function approximation from
scattered noisy data, Adv. Comp. Math. 10 (1999), 51{80.
[7] R. Schaback, Approximation by radial basis functions with nitely many centers,
Const. Appr. 12 (1996), 331{340.
[8] G. Strang and G. J. Fix, An Analysis of the Finite Element Method, PrenticeHall, Englewood Cli s, 1973.
[9] G. Wahba, Convergence rates of certain approximate solutions to Fredholm integral
equations of the rst kind, J. Approx. Theory 7 (1973), 167{185.

11

Вам также может понравиться