Открыть Электронные книги

Категории

Открыть Аудиокниги

Категории

Открыть Журналы

Категории

Открыть Документы

Категории

0 оценок0% нашли этот документ полезным (0 голосов)

232 просмотров98 страницRedes neuronales

Artificial Neural Networks an Econometric Perspective

© Attribution Non-Commercial (BY-NC)

PDF, TXT или читайте онлайн в Scribd

Redes neuronales

Attribution Non-Commercial (BY-NC)

0 оценок0% нашли этот документ полезным (0 голосов)

232 просмотров98 страницArtificial Neural Networks an Econometric Perspective

Redes neuronales

Attribution Non-Commercial (BY-NC)

Вы находитесь на странице: 1из 98

INTRODUCnON

Artificial neural networks are a class of models developed by cognitive scientists

interested in understanding how computation is performed by the brain. These networks are

capable of learning through a process of trial and error that can be appropriately viewed as sta-

tistical estimation of model parameters. Although inspired by certain aspects of the way infonnation is processed in the brain, these network models and their associated learning paradigms are still far from anything

clo~e to a realistic description of how brains actually work. They nevertheless provide a rich,

powerful and interesting modeling framework with proven and potential application across the sciences. To mention just a handful of such applications, artificial neural networks have been successfully used to translate printed English text into speech (Sejnowski and Rosenberg, 1986), to recognize hand-printed characters (Fukushima and Miyake, 1984), to perform complex coordination tasks (Selfridge, Sutton and Barto, 1985), to play backgammon (Tesauro, 1989), to diagnose chest pain CEaxt, 1991), and to decode deterministic chaos (Lapedes and Farber, 1987; White, 1989; Ga11ant and White ,1991). Successesin these and other areas suggest that artificial neural network models may

serve as a useful addition to the tool-kits of economists and econometricians. Areas with particu-

lar potential for application include time-series modeling and forecasting, nonparametric estimation, and learning by economic agents. The purpose of this article is two-fold: first, to review the basic concepts and theory required to make artificial neural networks accessible to economists and econometricians, with particular focus on econometrically relevant methodology; and second, to develop theory for a

leading neural network learning paradigm to a point comparable to that of the modem theory of

estimation and inference for misspecified nonlinear dynamic models (e.g., Gallant and White, 1988a; Potscher and Prucha, 1991a,b). As we hope will become apparent from our development, not only do artificial neural networks have much to offer economics and econometrics, but there is also considerable

-3 -

potential for economics and econometrics to benefit the neural network field, arising to a considerable degree from economic and econometric experience in modeling and estimating dynamic

systems. Thus, a larger goal of this article is to provide an entry point and appropriate back-

ground for those wishi:tlg to engage in tl}e fascinating intellectual arbitrage required to fully realize the potential gains from trade between economics, econometrics and artificial neural networks.

NEURAL NETWORK

The simplest general artificial neural network (ANN) models draw primarily on three features of the way that biological neural networks process information: massive parallelism, nonlinear neural unit response to neural unit input, and processing by multiple layers of neural

units. Incorporation of a fourth feature, dynamic feedback among units, leads to even greater

generality and richness. In this section, we describe how these features are embodied in now standard approaches to ANN modeling, and some of the implications of these embodiments. Because of the very considerable breadth of ANN paradigms, we cannot do justice to the entire spectrum of such models; instead, we focus our attention on those most easily related to and with greatest relevance for econometrics. Although not usua1ly thought of in such terms, para1lelism is a familiar aspect of

econometric modeling. A schematic of a simple parallel processing network is shown in Figure

1. Here, input unit ("sensors") send real-valued signals (Xi, i = 1, ..., r) in parallel over connections to subsequent units, designate,d"output units" for now. The signal from input unit i to output unit j may be attenuated or amplified by a factor r ji E IR, so that signals Xi r ji reach output unit j, i = 1, ..., r. The factors r ji are known as "network weights" or "connection strengths."

In simple ANN models, the receiving units process parallel incoming signals in typically simple ways. The simplest is to add the signals seen by the receiver, in which case the output unit produces output

r L Xi r ji , i=l

1, ...,

v.

If, as is common, we permit an input, say Xo, to supply Xo = 1 to the network ( a "bias unit" in network jargon), output can be represented as

/j(x, r) =x'rj

j=l,...,v,

or

(x,

r)

= (1!&1

x)r

where f = (f1, ..., fv)~, x =(1, X1, ..., xr)~, r = <r~1, ..., r~v)~, and rj = <rjO, rj1,

..., rjr)~.

The

"out-

put function"

f is easily recognized

unrelated (linear) equations; in the neural network literature, an electronic version of this network was introduced as the MADALINE

Hoff(1960).

network of Widrow and

Hoff (1960), easily recognized as the simple linear model, the workhorse of empirical econometrics.

In biological neural systems, the number of processing units can range into the mil-

lions or billions and beyond (hence the teml "massive" parallelism). While such numbers are not usually encountered in economic models, the essential feature of parallel processing is common to both. From the outset of their development, the behavior of artificial neural networks was fornlulated to include another stylized feature of biological systems. This is the tendency of certain types of neurons to be quiescent in the presence of modest levels of input activity, and to become active themselves only after input activity passes a particular threshold. Beyond this threshold, increases in input activity have little further effect This introduces the fundamental

feature of nonlinear response into the ANN paradigm.

For present purposes, it suffices to think either of neural units switching on or off, or to imagine a single dimension along which neural activity (e.g. neural firing rate) can smoothly vary from fully off to fully on. In their seminal article, McCulloch and Pitts (1943) considered

-5 -

the first possibility, proposing networks with output unit activity given by

h(x,

y)

G(x'yj)

j = 1,

, v,

the "Heaviside" or

unit step function. Output unit j thus turns on when X'rj > 0, i.e. when input activity L;=l Xirji exceeds the threshold -rjo. For this reason the Heaviside function is said to implement a "threshold logic unit" (1LU). G is called the "activation function" of the (output) unit Networks with TLU's are appropriate for classification and recognition tasks: the study of such networks exclusively pre-occupied the ANN field through the 1950's and dominated the field through the 1960's. In retrospect, a major breakthrough in the ANN literature

occurred when it was proposed to replace the Heaviside activation function with a smooth sigmoid (s-shaped) function, 1967). Instead of switching in particular the logistic function, G(a) = 1/(1 + exp(-a (Cowan,

units turn on gradually as input from the ANN standpoint will however, binary logit we observe that model

be discussed in the next section.

With

this

/j(x, r) = G(x'rj)

= 1/(1 + exp(-x'rj

is precisely

probability

(e.g. Amemiya, 1981; 1985, p. 268). Other choices for G yield other models appropriate for classification or qualitative response modeling; for example, if G is the normal cumulative distribution function, we have the binary probit model, etc. As Amemiya (1981) documents in his

classic survey, such models have great utility in econometric applications where binary

classifications or decisions are involved. Although biological networks with direct connections from input to output units are well-known (e.g.. the knee-jerk reflex is mediated by direct connections from sensory receptors in the knee onto motoneurons in the spinal cord that then activate leg muscles), it is much more common to observe processing occurring in multiple layers of units. For example, six distinct processing layers are at work in the human cortex. Such multilayered structures were introduced into the ANN literature by Rosenblatt (1957, 1958) and by Gamba and his associates (palmieri

-6 -

and Sanna, 1960; Gamba, et. al., 1961). Figure 2 shows a schematic diagram of a network containing a single intemlediate layer of processing units separating input from output. Intermediate

layers of this sort are often caned "hidden" layers to distinguish them from the input and output layers. Processing in such networks is straightforward. Units in one layer treat the units in the preceding layer as input, and produce outputs to be processed by the succeeding layer. The output function for such a network with a single hidden (as in Figure 2) is thus of the form

G(X"rj){3hj), function,

h =

1, ...,

(1.1.1)

nection strengths from hidden unit j ( j = O indexes a bias unit) to output unit h. The vector

8 = ({3'1,

,j3'v,r'l,

.."r'q)

(with

collects

together

all network

weights.

As originally introduced, the hidden layer network activation functions F and G

implemented

11..U's.

However,

modern practice permits F and G to be chosen quite freely. (the logistic) simplicity or F(a) = a (the identity) generality, and we

Leading choices are F(a) = G(a) = 1/(1 + exp(-a)) G(a) = 1/(1 + exp(-a)). Because of its notational

and considerable

f(x,

q 0) = 130 + L j=l

G(x'rj)13j

(1.1.2)

Although we have seen econometrically familiar models emerge in our foregoing discussion of ANN models (e.g. seemingly unrelated regression systems and logit models), equation (1.1.2) is not so familiar. .It does bear a strong resemblance to the projection pursuit models of modem statistics (Friedman and Stuetzle, 1981; Huber, 1985) in which output response is given by

j=l

However, in projection pursuit models the functions Gj are unknown and must be estimated from data (perrnittingf3 j to be absorbed into Gj), whereas in the hidden layer network model (I.l.2), G is given. The hidden layer network model is thus somewhat simpler than the projection pursuit model.

A variant of the single hidden layer network that is particularly relevant for

econometric applications is depicted in Figure 3. This network has direct connections from the

Input to output layer as well as a single ruaaen layer. output tor this network can be expressed

as

f(x,

0) = F(x'a

(1.1.3)

weights, and () is now taken to be , {3q)' we nest as

where

is

r x 1

vector

of

input-output

choice

(} = a', f3o,

, {3q, r'l

, ..., r'q)'.

By suitable

In particular, with F(a) = a (the identity) we have a standard linear model aug-

mented by nonlinear terms. Given the popularity of linear models in econometrics, this form is

particularly appealing, as it suggests that ANN models can be viewed as extensions of, rather to, the familiar models. The hidden unit activations can then be viewed as

than as alternatives

latent variables whose inclusion enriches the linear model. We shall refer to an ANN model with

output of the form (1.1.3) as an "augmented" single hidden layer network. Such networks will

What originally commanded the attention and excitement of a diverse range of dis-

ciplines was the demonstrated successesthat models of the form (1.1.1) and (1.1.2) had in solving previously intractable classification, forecasting and control problems, or in producing superior solutions to difficult problems in orders of magnitude less time than traditional approaches. Until recently, a theoretical basis for such successes was unknown --artificial neural networks just

-8 -

Motivated by a desire either to delineate the limitations of network models or to

results establishing that functions of the form (1.1.2) can be viewed as I'universal approx.imators,"

that is, as a flexible functional form that, provided with sufficiently many hidden units and properly adjusted 'parameters, can approximate an arbitrary function 9 : IR r -:; IR arbitrarily well in

useful spacesof functions. Results of this sort have been given by Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik, Stinchcombe and White (1989, 1990) (HSWa, HSWb) and Stinchcombe and White (1989), among others. The flavor of such results is conveyed by the following Theorem 2.4 of HSWa. paraphrase of part of

THEOREMI.1.1:

};'(G)={f: {30 E JR'~

JRlf(x)={30+LJ=IG(X'rj){3j,XE G: JR ~ [0,1]

JR';rjE distribution

layer

JR'+I,{3jE

network

functions

JR,j=I,...,q; Then};'(G) is

is any cumulative

function.

unifonnly

dense on compacta in C( 1R~, Le. for every 9 in C( 1R~, every compact subset K of

such that Supx E K I f(x) -g(x) I < E.

Thus, the biologically inspired combination of parallelism, nonlinear response and multilayer processing leads us to a class of functions that can approximate members of the useful class C( 1R~ arbitrarily well. Similar results hold for network models with general (not necessarily sigmoid) activation functions approximating functions in Lp spaces with compactly supported measures, and, as HSWb and Hornik (1991) show, in general Sobolev spaces. Thus, functions of the form (1.1.2) can approximate a function and its derivatives arbitrarily well, and in this sense are as

flex.ible as Ga1lant's (1981) flex.ible Fourier form. Indeed, Ga1lant and White (1988b) construct a

sigmoid choice for G (the "cosine squasher") that nests Fourier series within (1.1.2), so that the flexible Fourier form is a special case of (1.1.3) even for sigmoid G.

-9 -

The econometric usefulness of the flexible form (1.1.2) has been further enhanced by

Hu and Joerding (1990) and Joerding and Meador (1990), who show how to impose constraints

ensuring monotonicity and concavity (or convexity) of the network output function. interested reader is referred to these papers for details.

An issue of both theoretical and practical importance is the "degree of approxima-

tion" problem: how rapidly does the approximation to an arbitrary function improve number of hidden units q increases? Classic results for Fourier series are provided by Edmunds and Moscatelli (1977). Similar results for ANN models are only beginning to appear, and so far are not as sharp as those for Fourier series. Barron (1991a) exploits results of Jones (1991) to

establish essentially that 11/- 9 112= O(l/q 1/2) ( 11.112 denotes an L2 norm) when I is an element

of };r(G)

differentiable

open area for further work is the extension and deepening of results of

this sort, especially as such results may provide key insight into advantages and disadvantages of ANN models compared to standard flexible function families. Degree of approximation results are also necessary for establishing rates of convergence for nonparametric estimation based on

ANN models.

Our focus so far on networks with a single hidden layer is justified by their relative simplicity and their approximative power. However, if nature is any guide, there are advantages to using networks of many hidden layers, as depicted in Figure 4. Output of an l-layer network

can be represented as

ahi = Gh(Ahi(ah-l))

i =

1, ...,

qh;

h =

1, ...,

1,

Ahi(a) = tI'rhi for some (qh + 1) x 1 vector rhi, 2 = (1, a)), Oh is the

function for

activation

units of layer h, ao = x, qo = r, and ql = v. The single hidden layer networks discussed above correspond to 1 = 2 in this representation.

-10

An interesting

open question

is to what

extent

networks

with

1 ? 3 layers

may be

Specifically,

can a three

layer network achieve a given degree of accuracy with fewer connections (free parameters) than a two layer network? Examples are known in which a two layer network cannot exactly

represent a function exactly representable by a three layer network (Blum and Li, 1991), and it is

known that certain mappings containin2 discontinuities relevant in control theory ~~n hp. Imiformly approximated in three but not two layers (Sontag, 1990). HSWa (Corollary 2.7) have shown that additional layers cannot hurt, in the sensethat approximation properties of single hidden layer networks (I = 2) carry over to multi-hidden layer networks. Further research in this interesting area is needed.

A further generalization of the networks represented by (1.1.4) is obtained by replac-

ing the affine function Ahi(.) with a polynomial Phi(.) with degree possibly dependent on i and h.

This modification yields a class of networks containing as a special case the so-called "sigma-pi"

(Ell) networks (Maxwell, Ones, Lee and Chen. 1986: Williams. lCJRn) Stinrhrombe (1991) hM studied the approximation properties of networks for which an arbitrary "inner function" Ihi

replaces Am in (!.1.4)

The richness of this class of network models is now fairly apparent. However, we

still have not exploited a known feature of biological networks, that of internal feedback.

represented schematically as in Figure 5. In Figure 5(a), network output feeds back into the hidden layer with a time delay, as proposed by Jordon (1986). In Figure 5(b), hidden layer output Thc outvul

feerl~ h~r.k into th~ hidd~n layer with a time del~y, as proposcd by Elma.n (1988). function of the Elman network can thus be represented as

atj = G(Xt'rj + a 't-I 8j),

j=I,...,q;t=O,I,2,...,

(1.1.5)

11-

where

at

(atl'

...,

atq)'.

tial value ao, and the entire history of system inputs, xt = (XI' ..., xt).

Such networks are capable of rich dynamic behavior, exhibiting memory and context sensitivity. Because of the presence of internal feedbacks, these networks are referred to in the literature as "recurrent networks," while networks lacking feedback (e.g., with output functions

G.l.3)) are desi2oated "feedforwarrl In econometric dynamic latent variables netwnrk~ II as a nonlinear applications in

terms, a model of the form (1.1.5) can be viewed model. Such models have a great many potential

economics and finance. Their estimation would appear to present some serious computational challenges (see e.g. Hendry and Richard, 1990, and Duffle and Singleton, 1990), but in fact some straightforward recursive estimation procedures related to the Kalman filter can deliver consistent estimates of model parameters (Kuan, Homik and White, 1990; Kuan and White, 1991). We discuss this further in the next section.

Although we have covered a fair amount of grounrl in thi~ ~p~tion. we have only

scratched the surface of the modeling possibilities offered by artificial neural networks. To mention some additional models treated in the ANN literature, we note that fully interconnected networks have been much studied (with applications to such areas as associative memory and solu-

tion of problems like the traveling salesman problem; see e.g. Xu and Tsai, 1990, and Xu and Tsai, 1991), and that networks running in continuous rather than discrete time are also standard objects of investigation (e.g. Williams and Zipser, 1989). Although fascinating, these network models appear to be less relevant to econometrics than those discussed so far, and we shall not

treat them fiJrthp.r As rich as ANN models are, they still ignore a host of biologically relevant features.

Neural systems that have taken perhaps billions of years to evolve will take humans a little more time to model exhaustively than the five decades devoted so far! To mention just a few items,

biological neurons communicate over multiple pathways, chernical as well as electrical --the .l;in-

gle communication dimension ("activation") assumed in most ANN models is quite incomplete.

-12-

Also, biological neurons respond to input activity stochastically and in much more complicated ways than as modeled by the sigmoid activation function --neurons output complex spike trains through time, and are in fact not simple processing units. Of course, these and other lirnitations

of ANN models are daily being challenged by ANN modelers, and we may expect a continuing

increase in the richness of ANN models as the diverse interdisciplinary talents of the ANN community are broueht to bear on these issues.

Despite these limitations sufficiently as descriptions attractive of biological reality, ANN models are modeling.

Given models, the econometrician wants estimators. We take up estimation in the next section, where we encounter additional interesting tools developed by the ANN community in their study of learning in artificial neural networks.

The discussion of the previous section establishes ANN models as flexible functional fomls, extending standard linear specifications. As such, they are potentially useful for econometric modeling. To fulfill this potential, we require methods for finding useful values for the free parameters of the model, the network weights.

TO any econometriCian verse a m the standard tools of the trade, a multitude of

relevant estimation procedures for finding useful parameter values present themselves, typically dependent on the behavior of the data generating process and the goals of the analysis.

For example, suppose we observe a realization of a random sequence of s x 1 vec-

of Xt. The minimum mean-squared error forecast of yt given Xt is the conditional expectation

g(XI} = E(YI I XI}. Although the function 9 is unknown, we can attempt to approximate it using a

neural network with some sufficient number of hidden units. If we adopt (1.1.3) with F the iden-

-13-

f(x,

8)

= x'a

+f3o

q + L j=l

G(x'rj)f3j,

..., r'q)'

as an approximation,

from the outset that it is misspecified. Nevertheless, the theory of least squares for ~sspecified nonlinear regression models (White, 1981; 1992, Ch. 5; Domowitz and White. 1982: Gallant :\nci White, 1988a) applies immediately to establish that a nonlinear least squares estimator 9 n solving the problem

exists and converges almost surely under general conditions as n ~ ~ to 9., the solution to the

problem

where

a~ = E([Yt -g(xJf).

(See Sussman

(1991)

for

discussion

of

issues relating

to

identification.:

Further, under general conditions

a multivariate nonnal distribution with

estimable

as n ~ 00 to

matrix

covariance

(White, 1981; 1992, Ch. 6; Domowitz and White, 1982). Although least squares is a leading case, the properties of the dependent variable yt will often suggest the appropriateness of a Qua."i-ma:ximllmlik~Jihood procedure different from

least squares. For example, if yt is a binary choice indicator taking values O or 1 only, it may be assumed to follow a conditional Bernoulli distribution, given Xto A network model to approxi-

f(x, O) = F(x'a

.8j) ,

{1.2.1)

14 -

where F(.) is now some appropriate c.d.f. (e.g., the logistic or normal). The mean quasi-log likelihood function for a samDle of size n is then

Ln(Zn,

f)

= n-1

n L[Yt t=l

logf(Xt'

f)

+ (1-

yt) log(l-

f(Xt,

f))].

A quasi-maximum

likelihood

estimator

A 8 n solving

the problem

max. BE e

Ln(Zn,

f)

can be shown under general conditions to exist and converge to 0., the solution to the problem

f(Xt. (1))].

(See White, 1982; 1992, Ch. 3-5.) The solution ()* minimizes the Kullback-Leibler divergence of the approximate probabillly Inul1el f(Xt, 0.) [rUIIl 111t;Uut; g(Xt). fu inl11t; It;~l :)4U(1lt;:) \;;~t;, -I;; (0 n -0 .) I.;UIIVC;lgC;~

in distribution

as n ~ 00 to a multivariate

normal distribution

If Ye represents count data, then a Poisson quasi-maximum likelihood procedure is

natural (e.2. Gourieroux. Monfort and Tro2non. 1984a.b). where fis as in G.2.1) with F chosen to ensure non-negativity (e.g. F(a) = exp(a, so as to permit f(Xt, J) to plausibly approximate

g(Xt) = E(Yt I X,). If Yt represents a survival time, then a Cox proportional hazards model (e.g.

A:r:nemiya,1985, pp. 449-454) is a natural choice, with hazard rate of the form )..(t) f(Xt, 9).

From an econometric would ordinarily standpoint, then, ANN models can be used anywhere with estimation one

proceeding

via appropriate quasi-maximum likelihood (or, alternatively, generalized method of moments) techniques. The now rather well-developed theory of estimation of misspecified models (White, 1982, 1992; Gallant and "Whitc, 1988a; POt~chcJ: and P1-ucha,1991a,b) applic~ immcdiatcly to provide interpretations and inferential procedures.

15-

The natural instincts of econometricians are not the instincts of those concerned with

artificial neural network learning, however. This is a double blessing, because it means not only

that econometrics has much to offer those who study and apply artificial neural networks, but also

that econometrics may benefit from novel techniques developed by the ANN community. In considering how an artificially intelligent system must go about learning, ANN learning ~ thc; l1lU-

cess by which knowledge is acquired, it follows that knowledge accumulates as learning experiences occur, Le. as new data are observed.

In ANN

models, knowledge

connection

strengths,(}.

t+l=Ot+Lltt

current

knowledge

(1eaming).

A successful learning

specify

observables,

..

some appropriate

Zt = (yt, X't)"

way to fonn thf'. llpd~te A, from previous knowlcdgc Thus we seek an appropriate function Vlt for

~t

1fItCZt.

()

t).

Current leading ANN learning methods can trace their history from seminal work of Rosenblatt (1957, 1958, 1961) and Widrow and Hoff(1960). Rosenblatt's learning network, the a-perceptron, was concerned with pattern classification and utilized threshold logic units.

Widrow and Hoffs ADALINE networks do not require a nu, as they are not restricted to being

classifiers.

For their linear networks (with output for now given by f(x, 8) = x' 8) Widrow and Hoff proposed a version of recursive least squares (itself traceable back to Gauss, 1809 --see Young, 1984),

Ot+1

= Ot + a XtCft

-X't

Ot).

(1.2.2)

Here

error"

between

computed

output

"target"

value

yt.

The scalar a > O is a "learning rate" to be adjusted by trial and error. This recursion

was motivated explicitly by consideration of minimizing expected squared error loss. For networks with nonlinear output f(x, 8) the direct generalization of the delta rule

is

" Ot+1

" = Ot + a Vf(Xt.

" t)

(yt -f(Xt.

" t

(1.2.3)

where V f(x, .)is the gradient of f(x, .)with respect to (J (a column vector).

this recursion is called the "generalized delta rule" or the method of "backpropagation" (a term invented for a related procedure by Rosenblatt, 1961). Its discovery is attributable to many (Werbos, 1974; Parker, 1982,1985; Le Cun, 1985), but the influential work of Rumelhart, Hinton and Williams (1986) is perhaps most responsible for its widespread adoption.

This apparently straightforward generalization of (1.2.2) in fact caused a revolution

in the ANN field, spurring the explosive growth in ANN modeling resDonsible for its vi2or today

and the appearance of an article such as this in a journal devoted to econometrics. The reasons

for this revolution are essentially two. First, until its discovery, there were no methods known to ANN modelers for finding good weights for connections into the hidden units. The focus on threshold logic units in multilayer networks in the 1950's and 1960's led researchers away from

gradient methods, as the derivative of a TLU is zero almost everywhere, and does not obviously

lend itself to gradient methods. This is why the introduction of sigmoid activation functions by

Cowan (1967) amounted to such a significant breakthrough --straightforward gradient methods

become possible with such activation functions. Even so, it took over a decade to sink into the

collective consciousness of the ANN community that a solution to a problem long considered

intractable (even impossible, viz. Minsky and Papert, 1969) was now at hand. The second reason is that once feasible methods for training hidden layer networks were available, they were applied to a vast range of problems with some startling successes. That this should be so is all the more impressive given the considerable difficulties in obtaining convergence via (1.2.3). For

17 -

with considerable accompanying hype and extravagant claims. In 1987 one of us (White, 1987a) pointed out that (!.2.3) is in fact an application of the method of stochastic approximation (Robbins and Monro, 1951; B1um, 1954) to the nonlinear least squares problem (as in Albert and Gardner, 1967). The least squares stochastic approximation recursions are in fact a little more ~eneral. havin~ the form " " " 9t+l = 9 t + at Vf(Xt, 9 t) (yt -f(Xt, " 9 t)),

t=

1, 2,

00

(!.2.4)

The difference is that here the learning rate at is indexed by t, whereas in (1.2.3) it is a constant.

This is quite an important difference With a constant learning rate, the recursion

(1.2.3) can converge only under extremely stringent conditions (there must exist eo such that

y = f(X, eo) almost surely, where Zt has the distribution of Z = (Y; X')' t = 1, 2,

). When this

condition fails, the recursion of (1.2.3) generally converges to a Brownian motion (see Kushner

and Huang. 19R1: Homik ~nr1 Kll~n, 1QQO),not an appealing behavior in this context. Howevcr,

suffices that at oc t-IC 1/2 < 1( ~ 1), standard results from the theory of stochastic approximation

can be applied (e.g., White, 1989a) to establish the almost sure convergence ofe t in (1.2.4) to ()*,

a local solution of the least squares problem

mill E([Y BE 8

-f(K,

8)]1

Repeated

initialization

of the

recursion

(1.2.4)

from

different

starting

values

A e

(e.g., following

the parameter space partitioning strategy of Morris and Wong, 1991) can lead to rather good local solutions.

This fact is significant. The recursion (!.2.4) provides a computationally very simple

algorithm for getting a consistent estimator for a locally mean square optimal parameter vector in a nonlinear model with just a single pass through the data. Multiple passes through the data

(which can be executed in parallel) permit exploration for a global optimum. Thus, in addition to

21

and Duffle and Singleton (1990). Duffle and Singleton derive consistency and asymptotic normality results for MSM estimators of correctly specified models of conditional distribution. The recursive estimator (1.2.6) is computationally simpler by several orders of magnitude and has useful approximation properties even with misspecified models. It is therefore an interesting estimator in its own right: it also appears promising as a generator of starting estimates for MSM esti-

mation. In all of the discussion so far, we have implicitly assumed that network complexity (indexed by the number of hidden units) is fixed. However, the universal approximation properties described in Section 1.1 suggest that ANN models may prove a useful vehicle for nonparametric estimation. This intuition is correct: using results of White and Wooldridge (1991), White (1990a) shows that nonparametric sieve estimators (Grenander, 1981; Geman and Hwang,

1982) based on ANN models can consistently estimate a square-integrable conditional expecta-

tion function, and White (199Ob) shows that nonparametric sieve estimators based on ANN models can consistently estimate conditional quantile functions. Using results of Gallant (1987), Gallant and White (1991) establish the consistency in Sobolev norm of nonparametric sieve estimators based on ANN models. Thus, ANN models can consistently estimate unknown functions and their derivatives in a manner analogous to the performance of the flexible Fourier function

form (Gallant, 1981; Elbadawi, Ga1lant and Souza, 1983). Given tile early stage or aevelopment ot oegree of approximation results for ANN models, rate of convergence results for nonparametric ANN estimators are only beginning to be obtained. However, Barron (1991b) has obtained rate of convergence results for nonparametric least squares estimators of conditional expectation functions. For i.i.d. samples, these rates are

slightly slower than n 1/2,

To gain some insight into the issues that arise in nonparametric estimation using ANN models, we briefly consider the problem treated by White (1990a). The estimation problem considered there has the standard sieve estimation form

n =

1,2,

...,

(1.2.7)

-22 -

T(G,

., ~)

fl(x,

j ,

x E m.'

G is a given hidden layer activation function,

{qn E IN} and {~n E JR+

to infinity with n, e is the space of functions square integrable with respect to the distribution of

Xt,and now u

/3

1,...,

/3

'

q,rl,r2,...,rq

) ' .

Given this setup, the estimation problem (1.2.7)is equivalent to the constrained non-

mill 8"' e D,

n-1

n L '=1

[f,

-f1'(X"

~')f

, n = 1,2,

""",

(1.2.8)

l{3j I ~lln,

L;:l

L;=o

Irji

I ~qnlln}.

size n, one performs a constrained nonlinear least squares estimation on a model with qn hidden units, satisfying certain sumrnability restrictions on the network weights. B y letting the number

of hidden units qn increase gradually with n, and by gradually relaxing the weight constraints, the

network model becomes increasingly inates overfitting asymptotically, flexible as n increases. Proper control of qn and l1n elimof 80, 80(Xt) = E(Yt I Xu, to

{ Zr } , consistency

is

ing processes of a specific size, ~n = o(n 1/4) and qn~; log qn~n = o(n 1/2) suffice for consistency.

In practice, determining appropriate network complexity is precisely analogous to determining how many terms to include in a nonparametric series regression. As in that case, either cross-validation or information-theoretic methods can be used to determine the number of

hidden units optimal for a given sarnple. Information-theoretic methods in which one optimizes a

-23-

Sawa, 1978) have been shown to have desirable properties by Barron (1990). Extension of analysis by Li (1987) as applied by Andrews (1991a) to cross-validated selection of the number of terms in a standard series regression may deliver appropriate optimality results for crossvalidated selection of network complexity , and is an interesting area for further research.

Also an open question is that of the asymptotic distribution of nonparametric neural network estimators. Results of Andrews (199lb) for series estimators may also be extendable to treat nonparametric estimator of ANN models. Additional interesting insights should arise from this analysis.

13. SPECIFICATION

Consider the nonlinear regression model based on (1.1.3) with F the identity

func-

tion,

The standard linear model ocl::urs as the special case in which {31 = {32 =

..{3q

0.

Thus,

Hq

.fi

=0

v~

H.. .{I; = 0

A motnent's

reflection

reveals

an interesting

obstacle

to straightforward

application of the usual tools rf statistical inference: the "nuisance parameters" rj, j = 1, ..., q, are not identified under the nu1l hypothesis, but are identified only under the alternative.

tunately, there is now availabl~ a variety of tools that permits testing of Ho in this context.

The simplest, mo$t naive procedure is to avoid treating the rj as free parameters, instead choosing them a priori in some fashion (e.g., drawing them at random from some

appropriate distribution) and I then proceeding to test Ho using standard methods, e.g. via

Lagrange multiplier or Wald statistics, conditional on the values selected forrjo A procedure of

24precisely this sort was proposed by White (1989b), and the properties of the resulting "neural network test for neglected nonlinearity" were compared to a number of other recognized procedures

for testing linearity by Lee, White and Granger (1991). (See White, 1989b, and Lee, White and Granger, 1991, for implementation details.) The network test was found to perform well in comparison with other procedures. Though no one test dominated the others considered, the network test had good size, was often most powerful, and when not most powerful, was often one of the more powerful procedures. It thus appears to be a useful addition to the modem arsenal of specification testing procedures.

A more !;ophi!;ticated prncedllre i~ tn ~hnfil:p. "1 vfll11~S that optimize the direction in

which nonlinearity

is sought.

Et = yt -i't ~n where ~n is an estimator of ([:Jo,a')'.

yielding

residuals

conditions that

b*(r)

= E(G(X'tr)

X't)

A.

E(Xt

X't)

p where ~n ~ !!.,

Bierens (1990) specifies G( , ) = exp( .), but as we discuss below, this is not the

It follows that

25 -

A W(r)

A = nM(r)/Gn(r)

d ~xi

under correct specification of the linear model, where a~(r) is a consistent estimator ofa2(r).

Under the alternative, A W(r)/n -717(r) > 0 Q.s. for essentially every choice ofr, as Bierens (1990,

Theorem 2) shows. " To avoid picking r at random, Bierens proposes maximizing W( r) with respect to

r E r Can appropriately specified compact set), yielding Wcr), say. As Bierens notes, this max-

imization renders the xi distribution inapplicable under Ho. However, a xi statistic can be constructed by the following device: choose c > 0, /l E (0, 1) andy n independently of the samDle

and put

r=ro

A =r

if

W(r)

-W(ro)

~ cn)..

if

Bierens

A W(r)/n ~

(1990,

Theorem

4)

shows

that

under

correct

specification

essentially

while

SUpre r 7J(r)

Bierens'

result hold')

regardless

of how r is chosen.

In recent related work, Stinchcombe and White (1991) show that Bierens' concluincluding

sions are preserved if G is chosen to belong to a certain wide class of functions G( .) = exp( .). Other members of this class are G(a) = 1/(1 + exp(-a The choice of c, }., and r o in Bierens' construction

is problematic.

Two researchers

using the same data and models but using different values for c, ).. and r o can arrive at differing

conclusions in finite samples regarding correctness of a given specification. One way to avoid

such difficulties is to confront the problem head-on and determine the distribution of W( r). Some

useful inequalities are given by Davies (1977, 1987), but these are not terribly helpful when r ?; 3 variables). Recently, Hansen (1991) has proposed a com-

putationally intensive procedure that permits computation of an asymptotic distribution for W( r) under Ho'

-26-

An interesting

research

is a comparison

of the relative

performance

and computational cost of the procedures discussed here: the naive procedure of picking rj'S at

random; Bierens' " W(r) procedure; and use of Hansen's (1991) asymptotic distribution " for W(r).

nonlinear models, as well as testing the specification models. For testing correct specification of likelihood or method of moment-based

venience includes an intercept) one can test Ho: /3 = O vs. Ha: /3 * Oin the augmented model

v, = h(X"

a)

q 4- ~ j=l

G(X','Yj)f3j

4- I;,

(1.3.2)

p If an is the nonlinear least squares estimator under the null (with an ~ a* under Ha; see White, 1981), then with Et = ft -h(Xt, an) we have

where now

(J"2Cr)

var([GCXtr)

-b*Cr)

A *-1

VahCXt.

a*)]e;)

b*(r) = E(G(X'tr)

V'ah(Xt, a*))

A. = E(Vah(Xt. a *)

We again have " W(r) " = nM(r)/a-

V'ahCXt.

a*))

2 n(r)

d -7xi

under

Ho.

while

" W(rYn

-717(r)

> O a.s.

under

Ha

(mlsspecification) for essentially all r. A consistent specification test is therefore available. A Optimizing W( r) over choice of r leads to considerations regarding asymptotic testing identical to those arising in the linear case.

For testing correct specification of a likelihood-based model, a consistent m-test

(Newey, 1985; Tauchen, 1985; White, 1987b, 1992) can be performed. The starting point is the

fact that if 1 (Zt. 0) is a correctly specified conditionallo~-likelihood for y, 2iven X, Ci.e. for some

-27 -

density

E(s(ZtJ

(Jo) I Xt) = O ,

E(s(Zt'

() 0) G(X't

r))

= 0

for all ye r.

A Under standard conditions (e.g. White, 1992, Ch. 9) it follows that with en the

(qua3i-) mll.Ximum likc;lihood c;~timator c;oroi~tc;nt undc;r mi~~pc;cifica.tiOll fOl (}*, wc 11(1VC

where

};(y)

= var([(G(Xt'y)

b*(r)

= E([G(i'tr)

A. = E(V'9 S;)

s: = s(Zt. (}*,

\7'6 S; = \7'6 S(Zt. e*).

Consequently, analogous

A M(r)

d ~xi

under

correct

specification. ~17(r)

Argument

to that

Theorem

4) delivers

A W(r)/n

misspecification

for essentially

all r, given an appropriate choice of G, e.g. G(a) = exp(a) as in G(a) = tanh(a), as in Stinchcombe and White 0991). " W(r) over choice ofr leads to considerations

Optimizing

Because ANN models must be recognized from the outset as misspecified, one

-28-

cannot test hypotheses about estimated parameters of the ANN model in the same way that one would test hypotlleses about correctly specified nonlinear models (e.g. as in Gallant, 1973, 1975). Nevertheless, one can test interesting and useful hypotheses within the context of inference for misspecified models (White, 1982, 1992; Gallant and White, 1988a). In this context, two issues arise: the first concerns the interpretation of the hypothesis itself; and the second concerns construction of an appropriate test statistic. Both of these issues can be conveniently illustrated in

the context of nonlinear regression, as in White (1981). A The nonlinear least squares estimator () n solves

(J)r

where, for concreteness we take f(Xt, (}) to be of the fonn (!.1.3) with F the identity function.

White (1981) provides conditions ensuring A Q.S. that (} n ~ (}*, where (}* is the solution to

rnin E([E(Yt BE e

I Xt)

-f(Xt.

(J)J2)

E(Yt I Xt). One can therefore test hypotheses about the parameters of the best approximation. A leading case is that in which a specified explanatory variable (say the rth variable, Xtr) is hypothesized to afford no improvement permitted by f in predicting yt, within the class of approximations

Ho:

S, e*

= 0

vs.

Ha: S, ()

;!: 0

where s r is a q + 1 x k selection

matrix

elements of () .(i.e.

ar,rlr,...,rqr.

Testing Ho against Ha in the context of a misspecified model can be conveniently done using either Lagrange multiplier (LM) or Wald-type test statistics, but not likelihood ratio statistics, for reasons described in Foutz and Srivastava (1977), White (1982, 1992) and Gallant

-29-

tic the validity of the information matrix equality (White, 1982, 1992), which fails under misspecification. The classical LM or Wald statistics also require the validity of the information matrix equality, but can be modified by replacing classical estimators of the asymptotic covariA ance matrix of en with specification robust estimators (White, 1981, 1982, 1992; Gallant and White, 1988a). Thus, a test of Ho against Ha can be conducted using the Wald statistic

.." Wn = n O 'n S'r(Sr .. Sr O n ,

Cn S'r)-l

where

~ Cn

--1 = An

---1 En An

The covariance

estimator

when {4}

is i.i.d.,

but modifications

preserving consistency are available in other contexts. Under the hypothesis that Xtr is irrelevant

" d (and with consistent Cn), one can show that Wn ~ X~+l' and that the test is consistent for the

alternative. Similar results hold for the LM test statistic. Details can be found in Gallant and Whit~ (19&&a,CQ 7) and White (19&2; 1992, Ch. 8).

In this section we illustrate methods for estimating ANN models by fitting single

hidden layer feedforward networks to time series generated by three deterministic chaos

30 -

(a)

Yt+l

= 3.8

YtCl -Yt)

(b)

The circle map (Thompson and Stewart, 1986, pp. 164, 285-6):

Y, 11 = Y, + (22/1l')

~in(21l' Y, + ~~)

(c)

Yt+l = -2

+ 28.5 Yt/(l

+ yf)

Chaos (a) is by now a familiar example to economists and econometricians. Chaos (b) and chaos (c) are less familiar, but these three examples, representing polynomial, sinusoidal and !ational polynomial functions, provide a modest range of different functions with which to demonstrate

ANN capab1Unes. Time-series plots or me mree senes are given in Figures 6,7 and 8.

Because we shall not be adding observational error to the chaotic series, our exampIes will provide direct insight into the approximation abilities of single hidden layer feedforward networks. In each case, we fit ANN models of the form

f(Xt.

-q 0) = X't

+ /30 + L j=l

G(X't

rj)

/3 j

(1.4.1)

Several models are examined

the logistic.

in p~('h in~ance- Specifically, the input X, iE:a E:inslelas of the torgct scrics yt, whilc thc numbcl of hidden units (q) varies from zero to eight. The best model is chosen from these alternatives using the Schwartz Information Criterion (SIC). For each network configuration, we estimate model parameters by a version of the method of nonlinear least squares,Le., we attempt to solve

-31-

Optimization proceeds in two stages. First, the parameter estimates an are obtained by ordinary

least squares, with parameters {3 constrained to zero. (Note that an contains an intercept.) Then

if q > 0, second stage parameter estimates fi n and r n are obtained in such a way as to exploit the

structure of (1.4.1); the an estimates are not subsequently modified, forcing the hidden layer to extract any available structure from the least-squaresresiduals. Inspecting (1.4.1), we see that for given rj..s, ordinary least squares gives fully optimal eGtimntosfor /3. Thus, wc choosc a largc numbcr of ralldol1l vi1luc:)fur tIle elementS or

rj, j = 1, ..., q, and compute the least squares estimates for /3. This implements a form of global

random search of the parameter space. The best fitting values of.8 and r are then used as starting

values for local steepest descent with respect to {:3and r. Within steepest descent, the step size is dynamica11yadjusted to increase when improvements to mean squared error occur, and otherwise to decrease until a mean squared error improvement is found. Convergence is judged to occur

when (mse(k) -mse(k -1)/(1 + mse(k -1)) is sufficiently small, where mse(k) denotes sample

mean squared error on the kth steepest descent iteration. Once a local minimum is reached, the procedure terminates. This algorithm has been found to be fast and reliable across a variety of applic~tions investigated by the authors. The re~lllt~ nf lp.~~t~'111~rp.~ p~tim~tion of a linear model are given in Table 1. Tho simple linear model explains only 12% of the target variance for the circle map, while explaining 84% of the target variance for the Bier-Bountis map. The logistic map is intermediate at 36%, Results for the single hidden layer feedforward network are given in Table 2. In each case the hidden layer network chooses to take as many hidden units as are offered (8), and with this number of hidden units, nearly perfect fits are obtained. Because the relationships studied here are noiseless, the SIC starts to limit the number of hidden units chosen essentially only when machine imprecision begins to corrupt the computations. This lirnit was not reached in these examples. Our examples show that single hidden layer feedforward networks do have

-32 -

appealing flexibility, and can be profitably used to extract approximations at least to some simple

chaos-generating functions. Experience in a wide variety of applications across a spectrum of

scientific disciplines suggests that the usefulness of this flexibility is likely to extend broadly to

econometric contexts.

ANN

models

additions

to the modem

econometrician's tool-kit.

WIm

DEPENDENT OBSERVATIONS

lI.l.

INTRODUCnON

In Part I, we briefly discussed the method of stochastic approximation (Robbins and Monro,

1951). The Robbins-Monro function '!1(0), say 0" , by (RM) algorithm recursively approximates the zero of an unknown

A () t+1

A = () t + at 1fI(Zt,

A () J

t=

1,2,...

(II.l.l)

where at is a "learning

influcn(;cd

rate" tending to zero, and 1j/(Zt,8) is a measurement of '1'(8) at time t, When 'I'(IJ) = E~V(Zt, tJ)) truS methOd yields a recursive

implementation of the method ofm-estimation of Huber (1964). In particular, the method can be

used to estimate recursively the parameters of nonlinear regression models, such as those arising

in neural network applications. The RM algorithm has two significant advantages: (1) its recursive nature places few demands on computer resources; and (2) in theory , just one pass through a sufficiently large data

set can yield a consistent estimate. The RM algorithm is therefore particularly appealing for

estimating parameters of nonlinear models in large data sets. Very general results relevant to the convergence properties of the RM algorithm have been given by Kushner and Clark (1978) (KC) and Kushner and Huang (1979) (KH). However, the conditions ofKC/KH are not primitive and require some effort to apply. In this part of the paper,

we bridge an existing gap between the results of KC/KH and some interesting and fairly broad

-35-

1/I(z.

())

there

exist

functions

PI:

/R+

/R+

and

h3

1Rs ~

1R+

such that

p I (U) ~

0 as u ~

0, h3 is measurable-

IR$ x e x e

1jI(Z.(}I) -1jI(Z.(}z)

~Pl(

I (}1-(}2

)h3(z),

where

ASSUMPTION

A.3:

E 1/f(Zt, () < 00 for each () in 0, and there exists a function 'l' : 0-7

IRk

ASSUMPTION

L;=oat ~ 00 as n

A.4:

~ 00,

ASSUMPTION

A.5:

(a) (b)

For j = 1,2,3, there exist bounded non-stochastic sequences {17jt} such that

L;=o at[hj(Zt)-1Jjt]

converges a.s.-P.

Assumption A.l introduces the data generating process, and Assumption A.2 imposes some suitable and relatively mild restrictions on the growth and smoothnessproperties of the measurement function 11/ . Assumption A.3 is a mild asymptotic mean stationarity requirement at -7 O ensures that the effect of error adjustment eventually

00 allows the adjustment to continue for an arbitrarily long

In

van-

ishes; the condition ~n ""'t=1 at-:;

time,

is always plausible.

Assumption A.5 imposes mild convergence conditions on the processes depending on Z:. Below we consider more primitive mixingale conditions that ensure the validity of this assumption. Let 1C

IRk ~ e be a measurable projection function (for f) E e, 1t"(f) = f).

We then

In what follows,

This result generalizes classical results (e.g., Blum, 1954) in several respects. First, Zr is not required to enter the function 1/1 additively. Second, the learning rate at is not required to be

square summable. Most importantly, general behavior for Zr is allowed, provided that Assump-

tion A.5 holds. As examples, KC consider martingale difference sequences and moving average

processes.

conditions

denote the

of AssumpLp-norm,

1975). Let

.llp

whenever each element of X belongs to Lp(P).

In this case

lip is as just defined, with

denoting the spectral norm induced by the Euclidean norm. We use the following definition.

DEFINITION

11.2.2: Let {Xt} be a sequence of random variables belonging of IF .The sequence {Xt.

{ct }

IFt} be a filtration

nonnegative IIE(Xt

IFt} is a mixingale

0

real

constants

and

and

{'m}

where

:5Ct~m+l

~m

-)

as

m ~ CX), we

have if

/Ft-m)112

:5 Ct~m

IIXt-E(Xtl/Ft+m)lk

{ Xt } is a mixingale

of size -a

~m = O(m).) for some).. < .:-a. (We drop explicit reference to the filtration

of confusion.)

Our definition

of size is

convenient, but also stronser than that considered by McLeish (1975). As 3pccifll Ca,.jC.', mixingale processesinclude independent sequences,martingale difference sequences, l/>-,p- and amixing processes, finite and certain infinite order moving average processes, and sequences of near epoch dependent functions of infinite histories of mixing processes (discussed further in the next section). Mixingales thus constitute a rather broad class of dependent heterogeneous

processes.

we always assume that the relevant random variables are measurable condition holds automatically. This avoids anticipativity of

the RM algorithm.

-38-

ASSUMPTIONA.4':

L;=l at ~~ as n ~~,

a; < 00 and

ASSUMPTIONA.5':

(a)

For each 0 in e, SUpt II 'I'(Zt, 0) 112~ ~e < 00 and {'1'(Zt, 0) -E'1'(Zt, 0), IFt} is a mix-

Assumption

A.4'

implies

Assumption

A.4.

implied by Assumptions A.5'(b) and A.2(b.i), and that we may take 17jt = Ehj(Zt). We have the following result.

A A.4' and A.5', let { ()t} be given by (II.l.l)

II.2.1 hold.

COROLLARY

with

Then the conclusions

A eo chosen arbitrarily.

of Theorem

0

of .. () t.

This

provides

general

and

fairly

primitive

conditions

ensuring

the

convergence

Only

Assumption A.5' is a reasonable candidate for further specialization to achieve additional simpliThis is most conveniently done by placing conditions on h 1, h2, h3 and {2, } sufficient to ensure that the mixingale property is valid. We give examples of this in the next section. The present result gives a very considerable generalization of a convergence result of

White (1989a, Proposition 3.1). There Zr is taken to be an i.i.d. uniformly bounded sequence.

Corollary II.2.3 also generalizes results of Englund, HoIst and Ruppert (1988), who assume that

{ Zt } is a stationary mixing process and that 1/' is a bounded function.

fastest rate of convergence obtains with at = (t + 1)-1; we adopt this rate for the rest of this sec-

-39-

For given 0* E

IRk we write

ut = (t + 1)Y2(8 t -8*).

Straightforward manipulations a11ow us to write Ut+1 = [Ik + (t + 1)-1 Ht] Ut + (t + l)-Yi q; , where

Ht = \"761{/; + [((t +2)/(t+ 1))Yl-1] \"761{/; + Ik/2 + O((t + 1)-1) Ik

(II.2.4)

(II.2.".)

and

with 111;= 1II(Zt, 8*), V 9 111;= V 9 1II(Zt, (1*). The piecewise constant interpolation with interpolation intervals { at} is defined as 'r E ['rt, 'rt+l)'

of Ut on [0, 00)

UO('l")

= ut,

't"?o.

The asymptotic

distribution

A of et is found

by showing

that

Ut(

stochastic differential

B.l:

Assumption

ASSUMPTION B.2:

(a)

IRS, 1f/(z, .) is continuously

m.s-:; m.+ such that

differentiable

P2(U) -:;

as u -:; 0, h4 is measurable-1Bs,

P2 : /R+ -?

/R+ and h4

in e of 0

v 9 1jf(Z,e) -V 9 1jf(Z,eo )

~P2( 10-0

h4(z).

ASSUMPTION

1Jf; E L6(P),

and the eigenvalues of H =H" + Ik /2 (with H" = E(V 91/';)) have

V 91Jf; E L2(P),

ASSUMPTION (a)

noS:

IFo)lk

IF )-O"jI12, -* O"j=E(1JIt1Jlt+j), *' .

(b)

For

some

17 4 E

* Vo1flt

(c)

L;=o

-h*

converge

a.s.

-P,

where h * = E I V 9 1f/; I.

The stationarity imposed in Assumption B.l is extremely convenient; without this, the

analysis becomes exceedingly complicated. Assumption B.2(b) imposes a Lipschitz B.3 imposes additional

stable equilibrium.

condition

as a

asymptotically

candidate

to Assumption

A.4 or A.4'.

Finally,

Assumption

B.5 imposes

some further convergence conditions beyond those of A.5. Assumption B.5(a) restricts the local fluctuations (quadratic variation) induced by (t + l)-Y'q; in (II.2.4) to be compatible with those of a Wiener process. Assumption B.5(b,c) (together with B.2) ensures that the effects of the second term and the last term in (II.2.5) eventually vanish.

The asymptotic normality result can be stated as follows. A and B.5 hold, and that et ~e*

THEOREM

11.2.4:

Suppose Assumptions

B.I-B.3

a.s.-P,

where

by (II.1.1)

with

element

Then: (a)

(b)

{ Ut} is tight in IRk,

I=~C:O

"""'

J =-=

a.

<00,

(c)

tilt;

weakly

to the stationary

solution

d'l" + I'

dW('l"),

I)'l(Ulll(Ull

k;-var1ate

Wiener

exp[ii'c]dc

In

partiCUlar

matrix equation HF* + F* H' = -J:;,

(d)

M A M'

If H* = -H*

is symmetric, A

then

F*

= MLM',

where

matrix

"uch

that in

, with

the diagonal

matrix

containing

If at is chosen to be (t + 1)-1 A (for finite nonsingular Theorem asymptotic 1I.2.4(C) necomes aut't") = liut't")a'r distribution + Ai~aw('r),

k x k matrix A), then the SDE in and the covariance matrix of the

becomes AF* A '. Part (d) gives an alternative expression for the covari-

ance matrix of the asymptotic distribution, analogous to that given by Fabian (1968). Despite the

assumed stationarity , Theorem II.2.4 generalizes previous results in that the random variables c~n hf'. llnhmlnrlf'.rl ~nd th~ m~~st1rement can be correlated (cf. Ljung and Soderstrom, 1983, Ch.

4, and Fabian, 1968). Again, the properties of mixingales can be exploited to verify the convergence conditions. We impose

ASSUMPTION

(a)

B.5':

of size -2 with ct ~ K for some K < 00, t = 1, 2, ..., ; { bt } such that

*' \lE(1fI't 1fI't+j

JFQ ) -a

j~ uf ~jLC -2.

-42

. V81f1t -h*, IFf} are mix-

(b)

{h4(Z,)-E(h4(Zt))'

IFt},

{V1f/; -H*,

IFt}

and

A et

COROLLARY

II.2.5:

Suppose

Assumptions

B.I-B.3

and

B.5'

hold

and

that

-1e*

a.s.-P

where {et} is generated by (1.1) with eO arbitrary, at = (t+ 1)-] and ()* is an isolated element of

EJ+."men the conclusions of'lheorem ll.2.4 hold

0

4.1) from the

This considerably

generalizes

i.i.d. unifornlIy bounded case to the stationary dependent case. EngIund, HoIst and Ruppert

(1988) also give a result for i.i.d. observations.

8 E D C

model I(Xt, 8)

(I:

IRr x D -:;

variahlp. y.

IR, Xt a random

It i.l: ~nmmon

r x 1 vector,

the random

to seek 8* , a solu.

mill E([Yt oe D -f(Xt. 8)r).

' (8) = E(Vo

f(Xt, 8) [yt -f(Xt, 8)]) = 0,

a k x 1 column vector.

The simple

RM algorithm for this problem in nonlinear least squaresregression is the algorithm (II.l.l)

1/f(Zt, () = v b' f(Xt, 8) [yt -f(Xt, 8)],

with

A A " 8 t+1 =8 t + at Vo!t[yt " -It] .-8,). ,

{II.3.1)

where

we have

written

it = f(Xt.

Vo ft = Vo f(Xt.

8,).

This

is known

as a "stochastic

gra-

went method." In tl1is section we consider the properties of tl1is algorithm and two useful

43

variants, the "quick" and the "modified" A disadvantage RM algorithms. is that it may converge very slowly (e.g.

White, 1988). To improve the speed of convergence, a natural modification mate Gauss-Newton

[0/1

1JI(Zt, 8) =

(Zt,

1fI2(Zt,

'/f2(Zt.

0) = G 1 Vo f(Xt.

0) [rt -f(Xt.

0 )J

(II.3.2a)

1 8t+1 =8t+atGt+1 --Vo!t[ft-ft].

(II.3.2b)

symmetric matrix.

A We take Go to be an arbitrary

positive-definite

" The difficulties of applying this algorithm are: (1) the inversion of Gt+l is computation ally

A demanding, and (2) the updating estimates Gt need not be positive-definite, pointing the algo-

rithm in the wrong direction. The first problem can be solved by use of the rank one updating formula for the matrix

inverse. Let Pt+l = Gt"ll and }.t = (1at)/ at. The modified RM algorithm is algebraically

equJvalem 10

(II.3.3a)

A A " " " Ot+l =Ot + at Pt+l Volt [ft-It],

of. Ljung and Sodorstrom (1983, A Ch. 2 & 3). Thc; c;hoil;c; Po = Ik j1) uflclll;ullvcnlem.

(II.3.3b)

" Gt " + at [V 8ft V 8ft ",

" -Gt],

modification

of (1I.3.2a):

Gt+l

(II.3.4a)

" Gt+l =

(II.3.4b)

A where E is some predetermined positive number, and Mt+l (E) is chosen so that Gt+l -El is positive-semidefinite. Some practical implementations of this can be found in Ljung and SoderA strom (1983, Ch. 6). A similar device can be applied to Pt. Implementation be understood to employ a projection device restricting .-pact convex set r such that the max-imum and minimum of this algorithm will

A simplification particular, of the modified RM algorithm is to choose G to be a diagonal matrix. scalar, so that matrix inversion In

is avoided.

(II.l.l)

8)' Vo f(Xt,

8) -e,

1JI2(Zt,

()) = e-l

Vo f(Xt,

8)[Yt

-f(Xt,

8)],

A et+l = A et A, + at[Vc5ft Vc5f, A -et] A

(II.3.5a)

(II.3.5b) The scalar et can be easily modified to be positive in a manner analogous to (3.4); we also restrict

et to be bounded. The quick RM algorithm is a compromise of the other two algorithms in that it takes a negative gradient direction with a scaling factor utilizing tion. Consequently, the quick algorithm than the modified some local curvature informa-

ought to converge more quickly than the simple algoalgorithm. When al = (t + 1)-1, the quick algorithm

-45 -

then reduces to the "quick and dirty" algorithm of Albert and Gardner (1967, Ch. 7).

It is straightforward to impose conditions ensuri~g the validity of all assumptions required

for the convergence results of the preceding section. Only the mix.ingale assumptions A.5' and B.5' require particular attention. We make use of a convenient and fairly general class of mixingales, near epoch dependent (NED) functions of mixing processes(Billingsley, 1968, McLeish, 1975, Gallant and White, 1988a).

Let { Vt} be a stochastic process on (.0., IF , P) and define the mix.ing coefficients

<1> m = SUp't" SUP{F E 1F' -, G E IF;.m : P(F) > 0}

P(G F) -P(G)

am

SUp't"

SUP{FeF-.

GeJF;.m}

(uni-

where /F~ =a(V'C, ..., Vt). Whentpm ~ O or am ~ O as m ~ ~ we say that {Vt} istp-mixing fonn mixing) ora-mixing (strong mixing). Whenlf>m = O(mA.) for some)., < -a

is tf>-rnixing of size -a, and sirnilarly for amo We use the following definition of near epoch

dependence. where we ~d()pt the n()t~ti()n P~:!:,'::( .) ~ E( .

1F~.!,':.').

DEFINITION

11.3.1: Let {Zt} be a sequence of random variables belonging to L2(P), and let

process on (0, IF 1 P). Then {Zt} is near epoch dependent (NED) on { Vt} of

{ vt } be a stochastic

The following three results make it straightforward to impose conditions sufficing for Assumptions A.5' and B.5'. The first is obtained by following the argument of Theorem 3.1 of

McLeish (1975).

PROPOSITION

mixing

11.3.2: Let {Zt E Lp(P)}, p ?2 be NED on {Vt} of size -a, where {Vt} is a

If>m of size -ap /(P -1) of size -a. or am of size -2ap / (p -2), p > 2.

sequence with

{Zr -E(Zr)

} is a mixingale

0

II.3.2. Let 9 : IRs

PROPOSmONII.3.3:

-46g(Zl)-g(ZV I ~L

satisfy

Lipschitz

condition,

Zl-Z2

,L

< oo,Zl,Z2,

lRs.

Then

of Proposition

II.3.2,

then {g(Zt)-E(g(ZJ)}

PROPOSITION 11.3.4: Let {Ut} and {Wt} be two sequencesNED on {Vt} ofsize-a.

(a) If SUpt

wt

$ il

< 00 and

SUpt II u t 114 $ il

and

{ Ut Wt}

is

NhU

(b) If SUpt II w t 118 ~ ~ < 00 and SUpt II u t 118 ~ ~ < 00, then SUpt II u t w t 114 ~ ~ 2 and { U t Wt } is

NED on { Vt} of size -a /2. (c) If SUptII u t 118 ~ L\ < 00 and { Vt } satisfies tile conditions of Proposition 3.2, tilen tilere exist KX)

E(Ut Ut+j)l!2

::; Kbt and bt is of size -all.

D'Q

0

ll.3.4(a), requiring SUpt II ft 114$: !!. and a

Our subsequent

results

will

make

use of Proposition

bound on tho olomonts of Xto Part (b ) i11ustra.tcs usc of thc Ca.uchy-Schwart.Z; incqu0.1ity to cclax

the boundedness condition; the price for this is a corresponding strengthening of moment conditions on ut (corresponding to yt). Here we sha11 adopt boundedness conditions on Xt to minimize moment conditions placed on yt and facilitate verification of the Lipschitz condition of Proposition II.3.3. Part (c) permits verification of Assumption B.5' (a.ii). We impose the following conditions.

C.l:

Assumption

A.l

on {Vt}

Xt bounded

and suPtIlYtIlp~L1<oo,

is a mixing

p ?: 4.

ASSUMPTION

m.k.

C.2: f:

IRr x D ~

differentiable, and f (x, .) and Vo f(x, .) each

For each x E

m.r, f(x,

.) is continuously

satIsfy a LlpsCJUIZ COnOltiOn wltn LlpSCJUIZ constantS 1 (X) ana 2 (X), where 1 and L2 are each Lipschitz continuous in x. For each 8 E D, f( ., 8) and V s f( ., 8) each satisfies a Lipschitz

48 -

methods thus coincide, so that the RM estimators tend to the same limit(s) as the nonlinear least squares estimator (cf. Ljung and Soderstrom, 1983). Corollary 11.3.5is more general than the i.i.d. case treated by White (1989a) and the exampIes given in KC (Ch. 2), as we allow the data to be moderately dependent and heterogeneous. This result differs from those of Metivier and Priouret (1984) in that we require neither "conditional independence" nor stationarity .

Corollary II.3.5 also generalizes a result of Ruppert (1983). Ruppert assumesthat for some

O yt = f(Xt, 8*) + Et and that (Xt, EJ is strong mixing of size -p / (p -2), a condition that may

fail when Xt contains lagged ft, because ft need not be mixing when it is generated in this manner, even when Et and other elements of Xt are mix.ing. Indeed, this fact partially motivates our usage of near epoch dependence. Also, we do not require that Yeis generated in the manner assumedby Ruppert (Le., we may be estimating a "rnisspecified" model). Compared to the result of Ljung and Soderstrom (1983), we allow more dependence in the data, as the data need not be

ge:ne:r~te:d by ~ line'Jr filter.

The modified RM algorithm can be identified with the extended Kalman filter for the nonlinear signal model

yt = f(Xt. 8t) + Et

8t = 80 for

all

t.

" " The Kalman gain is at Pt+l V 6 it. Corollary 11.3.5thus provides conditions more general than previously available ensuring consistency of the filter. In particular, the model can be

misspecified and the data can be NED on some underlying mixing sequence. Because the quick RM algorithm includes Albert and Gardner's quick and dirty algorithm, Corollary 11.3.5directly generalizes their consistency result to the case of dependent observatiODS.

To obtain asymptotic normality results for the case of nonlinear regression, we impose the following conditions.

-52-

For this we impose appropriate conditions. In particular, we adopt Assumption C.l. The assumption of uniformly bounded XI causesno loss of generality in the present context. This is a consequence of the fact

~

Xt)

where

Xti

= ~(Xti)'

i = 1, ..., r

and

If Xt is not unifonnly

what follows, with the implicit understanding that Xt has been transformed so that Assumption C.l holds. Note, however, that yt is not assumedbounded, providing the desired generality.

ASSUMPTION E.l: I:

compact subsets of function continuously IRr,

]R' x D ~

respectively, and with G: IR ~ IR a bounded

differentiable

The conditions on G are readily verified for the logistic c.d.f. and hyperbolic

tangent "squashers"

A C.l, E.l, C.3 and A.4', let {()t} be given by (II.3.1),

algorithms, respectively) with A (} o chosen arbi-

COROLLARY

(II.3.2) or (II.3.5)

(the simple, modified

and quick

Thus the method ofback-propagation and its generalizations converge to a parameter vector giving a locally

E(Yt

approximation

[Ulll,;l!Ull gen-

eralizes Theorem 3.2 of White (1989a), For the asymptotic distribution results, we impose the following condition.

ASSUMPTION

F.l:

differentiable

of order 4.

A et

COROLLARY

11.4.2:

Suppose

Assumptions

D.l,

D.2

and

F.l

hold

and

that

~e.a.s.-P

where {ot} is generated by (II.3.1), (II.3.2) or (II.3.5) with 00 chosen arbitrarily, at = (t + 1)-1, and 0. is an isolated element of e. .Then the conclusions of Theorem 11.2.4hold.

-54 -

considered here. For many choices of 1/', the analysis parallels that for the least squares case

rather closely. These results are within relatively easy reach for estimation procedures. For neural network models, it is desirable to relax the assumption that q is fixed. Letting

q -7 00 as the available sample becomes arbitrarily large permits use of neural network models

for purposes of non-parametric estimation. Off-line non-parametric estimation methods for the case of mixing processes are treated by White (1990a) using results for the method of sieves (Grenander, 1981, White and Wooldridge, 1991). On-line non-parametric estimation methods appear possible, but will require convergence to a global optimum of the underlying least squares problem, not just the local optimum that the present methods deliver. Results of Kushner (1987) for the method of simulated annealing provide hope that convergence to the global optimum is achievable for the case of dependent observations with appropriate modifications to the RM procedure. Finally, it is of interest to consider RM algorithms for neural network models that generalize the feedforward networks treated here by a11owing certain intema1 feedbacks. Such "recurrent" network models have been considered by Jordan (1986), Elman (1988) and Williams and Zipser (1989). For example, in the Elman (1988) set up, hidden layer activations feed back,

so that network

CAto.Atl. ",.Atq)-.Ato

output

= 1

At =

behavior of network output. Learning in such models is complicated by the fact that at any stage of learning. network output depends not only on the entire past history of inputs Xt. but also on .. the entire past history of estimated parameters e t. Results of KC are relevant for treating such

internal feedbacks. Convergence of RM estimates in recurrent networks is studied by Kuan

-57

II.2.1(b) follows from Theorem II.2.1(c). Finally, we show that cycling between two asymptotically stable equilibria is impossible.

It is easy to see that points in e. must be isolated. Let O ~ and 0; be two isolated points in e. t

and let NEI and NEz be neighborhoods of 8~ and 8;, respectively, such that NEl ~ dCe*),

N el ~ d(e* ), and N el f'\ N el = 0. from, say, N 1 to N z infinitely

e ti E N El e = 7t'['(8)] caWlot

" " If the path of e t cycles between e ~ and e;, e t must move

often.

8 ( .) satisfying the

.But for every to O i as .-:. T there is a t > T such that ~.

conycrgc;

2.1(d).

PROOF

OF COROLLARY condition

11.23:

The result follows from Theorem ll.2.1 because the A.4' implies at ~ O as t ~ ~ and Assumption A.5'

summability

of at in Assumption

implies Assumption A.5 by the mixingale convergence theorem (McLeish, 1975, Corollary

1.8).

PROOF

OF THEOREM

We first observe that the conditions [A1], [A4], [A7] and [AS] of KH are directly assumed, and that [A3] ofKH is ensured by Assumption B.5(c) and Lemma Al.

Second, we show that the consequence of [A2] of KH holds under Assumptions B.2(b) and

B.5(b, c). This amounts to showing that the second assertion in Lemma 1 of KH holds. By Assumption B.2(b) we have

L:learly, the integral on the RHS of (a. 10) converges to zero a.s. because e t -7 () .a.s.

a sequence of positive real numbers such that LkEk < 00, and let {Nk} be a sequence

Let {Ek} be

ofintegers

tending to infinity

60 -

PROOF

OF COROLLARY

We observe that

Assumption B.5'(b) is a mixingale condition ensuring Assumption B.5(b, c) by the mixingale convergence theorem. To establish Assumption B.5(a), we see that Assumption B.5'(a.i) ensures

that for K < 00

](t

=IIE(l/f;

lFQ

) 112 $: K

~ IL",t .

(a.15)

The fact that; K;,r is of size -2 implies that

where;

memory coefficient.

; t = SUPj?:O IIE(1f/; 1f/;:j -E(1f/; 1f/;+j ) IIFo )112 ~ Khto

That bt is of size -2 ensures that Lt=o ~r < 00. This establishes Assumption B.5(a.ii).

0

D

PROOF

OF PROPOSITION

PROOF

OF PROPOSITION

11.3.3:

See Andrews

(1989, Lemma

1).

PROOF

OF PROPOSITION

113.4:

that

EIUtWt-E~.::::(UtWt)

12

where Ut.m= E~.:!:.:::(Ut) and Wt.m= E~.:!:.:::(Wt). Here we employ the fact that E~.:!:.:::(Ut Wt) is the best L2-predictor of Ut Wt among all IF~~:::-measurable functions. Hence,

II Ut

Wt-E~::::(Ut

WJI12

~IIUtWt-Ut,mWt,mI12

-62 -

Similarly,

II ut

Wt-Ut,m

wtI12~{iil3/2I1Ut-Ut,mll~1

Consequently,

v,

+VW,m

).

(c)

II ut

Ut+j-E~~~+m(ut

ut+j)lk

~11 Ut

ut+j

-E~~:::(Ut)

E~tJ~:::(Ut+j)

112

.$: II ut

Ut+j-

ut E~~{::::(Ut+i)

112+ II utE~~1::::(ut+j)

-E~::::(Ut)

E~~~::::(Ut+j)

111

IFo ), we have

.sIIEo

E~~J+s

(Ut

Ut+j)

-E

Ut Ut+j

112+ IIEo[Ut

Ut+j

-E~~J+s

(Ut

Ut+j)]

Ik ,

(a.17)

where s = [tll]

t bounded by

By Jensen's inequality,

where Kis a constant. It follows from Lemma 2.1 ofMcLeish (1975) and Lemma 3.14 of Gallant

u

PROOF OF COROLLARY II.3.5: We verify the conditions of Corollary II.2.3. Because the

other conditions obviously hold, for the simple RM estimates it suffices to show that Assumptions A.2(b) and A.5' hold. Given Assumption C.2, it is straightforward to verify that f and Vof are

such that If(x, 8)1 ~ Ql(X) and IVof(x, 8)1 ~ Q2(X) for aI18e D (compact), where Ql and Q2

-63 -

11jf(z, ())I = I Vof(x, li)[Y-f(x, li)]1

~Q2(x)[lyl

11//(z,

()l)-1//(z,()vl

=1

Vof(x,

81)[y -f(x,

81)] -Vof(x,

82)[y -f(x,

8z)] I

It follows from Assumption C.2 that

b'l)!(X, 81) I

(a.18)

IVof(x,

81)y-Vof(x,

8vyl

~ lyIL2(X)181-821

I Vof(x,

82)f(x,

8v -Vof(x,

81 )f(x,

81) I

~ I Vof(x,

~)f(x,

82) -Vof(x,

8vf(x,

81) I + I Vof(x,

8vf(x,

81) -Vof(x,

81)f(x, 81) I

Thi;s

c;;sta.bli;shc;~ A~~UUlVUUU

A.2(lJ.li).

64

Because

Iy I, L1(x),

L2(x),

Ql(X)

and

Q2(X)

satisfy

Lipschitz

conditions,

Proposition

11.3.3ensuresthat IYtl , Ll(Xt)' L2(Xt) Ql(XJ and Q2(Xt) are NED on {Vt} of size -1. Because

Xt is bounded, Ql(Xt), Q2(X,), L1(X,) and L2(Xt) are bounded. Because IlYtl14 ~Ll, it follows

from Proposition 1I.3.4(a) and Corollary 4.3(a) of Gallant and White (1988a) (i.e., sums of random variables NED of size -a are also NED of size -a) that h3(ZJ is NED on {Vt} of size -1/2.

The mixing conditions size -112 by Proposition ing Assumption A.5'(ii). of Assumption 3.2. Similarly, C.l then ensure that { h 3 (Zt) -Eh 3(Zt) } is a mixingale {h2(ZJ -Eh2(ZJ} is a mixingale of

We next verify that for each 8 E e, {1jI(Zt, 8) } is a mixingale of size -1/2. Fix 8 ( = 8). Observe that the Lipschitz condition on f( .,8) and the conditions on {2,} imply by Proposition ll.3.3 that {f(Xt. 8) } is NED on

{Yt-fCXt,

Vt} of size -1

the continuity offC ., 8),

condition on

that II yt -f

{Vol(

..8)}

of {Zt} imply

by Proposition

8)} is also

NED on {Vt} of size -1. Further, the elements ofV f(Xt, 8) are bounded, so that by Proposition

II.3.4(a) {1f/(Zt. ()) = Vof(Xt. 8)[ftf(Xt. 8)] } is NED on

Vt} of size -1/2. It follows from Pro-

ditions imposed on {Vt} by Assumption C.l. Thus, Assumption A.5'(i) holds, and the result for the simple RM procedure follows. For the modified RM estimates we first note that every element of 0-1 is bounded above so

that I G-1 I < 11 for some 11.

Now.

IG-l

Vof(x,

8)[y

-f(x,

8)]

~~

Q2(x)[\yl

Ql(X)]

65 -

Ivec

GI

= I Vof(x,

8) 12 + I vec G 1

= [tr(A

' A)]Y%.

Hence

Assumption

A.2(b.i)

holds,

as

= h:1(7)

convex compact set r, so the mean value theorem applies. A matrix differentiation that when c is symmetric and nonsingular, dC-l/dg/J

element of G and Sij is a selection matrix whose every element is zero except that the ij-th and ji-th elements are one; see Graybi11 (1983, p. 358). Hence we can write

rl \vec ("0... r:,-l'l a '-' J

-l

-vec

(0-

Sij

0 -1 ) ,

a~;j

vec

VofCx,

81)VofCx,

8v'

-vec

[ v ofCx,

8z) vofCx,

8z)'

(I@VoJ(x,81))

.Vof(x,

81)-Vof(x,

8V]

(Vof(x,

82)

1)

Vof(x,

81) -Vof(x,

8v

VofCx,

O2) @ I

vec (ARC)

= (C !8> A) vec B.

It

can be verified

and

that

Vof(x,

c5z)@ I

, where

k is the dimension

of

8.

1Jf1 (z,

(J1)

-1Jf1

(z,

(Jv

~ 2K Q2 (X) L2 (X)

01-021

I vec(G2-G1)

h3' (z)

I f)1 -f)2

Assumption

A.2(b.ii) holds, as

$; 11fIl(Z,

91)-V'1(Z,

92)1

IV'2(Z,

91)-V'2(Z,

9vl

h3(z)

181

-821

with {h2(Zt)

hj(z)

Using

thc

~d1nc

(11 !; Wll~ll~

as

before

we

nave

that

-Eh2(Zt)

-Eh3(Zt)

0) } are mixingales

of size -112.

-68-

Hence Assumption A.5' also holds. This yields the desired results for the modified RM estimates. The conclusions for the quick RM estimates follow because the quick algorithm is a special case of the modified algorithm.

D

PROOF OF COROLLARY

ll.2.5.

KM estimates we neea to ShOwthat Assumptions B.2(b) ana B.5' hold In this case v 9 1jI(Z,e) = V 0(\70 f (x, 8) [y -f (x, 8)])

= Voof(x,

hence for 9 in int G and 9 in Go

8) Vof(x, 8)'

Voof(y-f)-VofVof'-Voor(y-r)

+ vor

Vor'

IVqo!Y-Voof'y\

I(Voof')f'-(Voof)!1

lVof'Vof"-VofVof'1

D.2, 0 = ~ = 8* .Apply-

I (Voof' )I' -(Vool)11

~ lVoorl

Ll(X)

18-00

Ql

(X)L3

(x)

18-00

18-~1

since

Voor

I .$: Q3(X),

with

Q3

Lipschitz-continuous

in

by

straightforward

arguments.

Funher,

I Vorvar' -Vof Vof' I ~ I varvar' -VorVof' I + I vor Vof' -Vof Vof' I

~ 2Q2(X)

L2(x)

18 -l)O

I,

so that

-71-

It follows from Ca.20) that the fir.1;ttenn in (a22) i.l; If';.1;.I; th~n

(a.22)

11

It can also be verified that the second tenn in (a.22) is less than

I G 1 -(G~) 1 I

Q3(X) ( Iy I + ~1 tX + tQ2(XZ

Iy I + Ql

(X)) + (Q2(X))2

vec(G-GO)

0-1 [Voof(y.n -VofVof'] -(00)-1 [Voor (y-r) -Vor Vor']

$: h;'

(z)

18-8

I,

where

h;' (z) E 11

Iy I L3(X)

+ Q3(X) L1(x)

+ Ql(X)

L3(X)

+ 2Q2(X)

L2(x)

We also note the fact that IA I ~ I vec A I ~};j}; j I ajj I, where A is a square matrix and ajj

v 8 1jI (Z, e) -V 8 1jI (z, eO)

~ h4 (z) 18-81,

establishes Assumption

B.2(b).

tion result .. of 8 t follows from Corollary 1I.2.5 with

H; = E(V 91f1;),

where v 611' (z, e) IS given by (a.:Zl), ana

-73

where the first equality follows from the fact that exp[(-Ik/2)c] = exp(-c/2) 1 1 = [exp(-c/2)]Ik,

-I

H * 3 -

(Val;

.Vaol;)

-e

*-1

is also block triangular, and the lower ~ght kxk block ofI3 is

so that

...d (t + 1)Y; (8t -8.) -. F3)

N (0,

G *-1 is a positive

semidefinite

matrix.

From Theorem

1I.2-4(c) we 2et

-};1 =HIFl +F1 HI =(HI +I/2)Fl +F1(Hl +1/2)

=HI Hence,

FI +FI

HI +FI

-(G*

)-1 I1(G*

)-1

=(G*)-I(H~F~

+F~H~

+F~)(G*)-1

-F~(G.)-l

-(G*)-IF~

(G*)-IF~(G*)-1

-75 -

is positive semidefinite, where (F; )y, is such that (F; )Vi (F; )Vi = F; .Since

holds.

i~

= }; 1, the result

PROOF OF COROLLARY

cia} structure of f in (II.4.1) and the continuous differentiability of G, it is straightforward to verify the domination and Lipschitz conditions required for application of Corollary 11.3.5.

D

PROOF

OF COROLLARY

11.4.2:

Direct

application

of Corollary

II.3.6.

-76

TABLE 1

DEmRMINISTIC

CHAOS APPROXIMAmD

BY LINEAR MODELt

Logistic Map

N

-1.60

a

R2

SIC

-1.20

of observations;

a = regression

staIldard

error; Criterion:

regression

SIC

coefficient;

= log (j

SIC = Schwartz

Ny2N

Information

+ k(Iog

k = number

of estimated

coefficients

( = 2).

77 -

TABLE 2

SINGLE IllDDEN LAYER

BY

Logistic Map

q N

1.35 x 10-3

2.34 x 10-2

8 250

2.68 x 10-4

a

R2

.9999

.9999

-6.32

.9999 -3.46

SIC

-7.93

t q. = SIC-optimal

number

of hidden

units;

remaining

symbols

as in Table

1.

-78-

REFERENCES

Albert, A.E., and L.A. Gardner (1967): Stochastic Approximation and Nonlinear Regression, Cambridge: M.I. T. Press.

19, 1483-1536.

Amemiya, T. (1985): Advanced Econometrics. Cambridge: Harvard University Press.

Andrews, D. W.K. (1989): "An Empirical Process Central Limit Theorem for Dependent NonIdentically Distributed Random Variables," Cowles Foundation Discussion Paper, Yale University.

Andrews, D.W.K. (1991a): "Asymptotic Optimality of Generalized CL, Cross-validation and Generalized Cross-validation in Regression with Heteroskedastic Errors," Journal of

Econometrics 47,359-378

Andrews, D.W.K. (1991b): "Asymptotic Normality of Series Estimators for Nonparametric and Semi-parametric Regression Models," Econometrica 59, 307-345.

ArllUlll,

L. (1974):

SlochasItc

DtJ[eremial

EquariOns:

Theory

ana Appllcattons.

New

York:

John

University Report 57. of minois at Urbana -Champaign Department of Statistics Technical

Barron, A. (1991a): "Universal Approximation Bounds for Superpositions of a Sigmoidal Function," University oflllinois at Urbana -Champaign Department of Statistics Techni-

-79-

Barron, A. (1991b): "Approximation and Estimation Bounds for Artificial Neural Networks,"

University

Report 59.

of nlinois

at Urbana -Champaign

Department

of Statistics

Technical

Baxt, W.G. (1991): "The Optimization of the Training of an Artificial Neural Network Trained to Recognize the Presence of Myocardial Infarction by the Variance of Disease Likelihood," UC San Diego Medical Center Technical Report.

Bierens, H. (1990): " A Consistent Conditional Moment Test of Function Form," Econometrica

58,1443-1458.

Bi11ingsley,P. (1968): Convergence of Probability Measures. New York: John Wiley & Sons.

Blum, J.R. (1954): "Approximation Methods Which Converge with Probability One," Annals of

Mathematical Statistics 25,382-386.

Blum, E.K. and L.K. Li (1991): Approximation Theory and Feedforward Networks," Neural Networks 4,511-516.

Carroll, S.M. and B.W. Dickinson (1989): "Construction of Neural Nets Using the Radon

Transfornl," in Proceedings of the International Joint Conference on Neural Net-

Cybenko, G. (1989):

IIApproximation

Mathematics

of

Cowan, J. (1967): "A Mathematical Theory of Central Nervous Activity," dissertation, University of London.

unpublished Ph.D.

"Hypothesis

-80-

AItemative," Biometrika 64,247-254. Davies, R.B. (1987): "Hypothcsis Tcsting Whcn a Nui3ancc Paramctcci~ Pcc~cnt 011ly UllUCl lIIC

Domowitz, I. and H. White (1982): "Misspecified Models with Dependent Observations," Journal of Econometrics 20, 35-58.

Singleton

(1990):

"Simulated

Moments

Estimation

of Markov

Models

of

Elbadawi, I., A.R. Gallant and G. Souza (1983): "An Elasticity Can be Estimated Consistently Without A Priori Knowledge of Functional FonD," Econometrica 51,1731-1752.

Elman, J.L. (1988): "Finding Structure in Time," CRL Report 8801, Center for Research in Language, UC San Diego.

EngIund, J.-E., U. HoIst, and D. Ruppert (1988): "Recursive M-Estimators of Location and Scale for Dependent Sequences," Scandinavian Journal of Statistics 15,147-159.

Fabian, V. (1968):

"On Asymptotic

StatiJticJ

Normality

in Stochastic Approximation,"

Annals of

lrfathematical

39, 1327-1332.

Foutz, R.V. and R.C. Srivastava (1977): "The Performance of the Likelihood Ratio Test When

the Model is Incorrect, II Annals of Statistics 5, 1183-1194.

Friedman, J.H. and W. Stuetz1e (1981): "Projection Pursuit Regression," Journal of the American Statistical Association 76,817-823.

(1984):

"Neocognition:

A New Algorithm

-81-

Gallant, A.R. (1973): "Inference for Nonlinear Models," North Carolina State University, Institute of Statistics, Mimeograph Series No, 875.

Regression Model,"

Functional

Unbiased

Bewlely ed., Advances in Econometrics Fifth World Congress. New York: Cam-

(1988a):

A Unified

Theory

of Estimation

and Inference

for

Non-

Gallant,

Avoidable

(1988b):

of the Second Annual

IEEE Conference on

Mistakes,"

Proceedings

Neural Networks,

San Diego.

Ga11ant, A.R. and H. White (1991): "On Learning the Derivatives of an Unknown Mapping with

Multilayer Feedforward Networks," Neural Networks 4 (to appear).

Gamba, A., L. Gamberini, G. Palmieri and R. Sanna (1961): "Further Experiements with PAPA,"

Nuovo Cimento Suppl. 20,221-231

Theoria

Mouts

Corporom

Celestium.

English

translation

(1963):

Theory

of

-82-

Gerencser, L. (1986): "Parameter Tracking of Time-Varying Continuous-Time Linear Stochastic Systems," in C.E. Byrnes and A.

Robust Control, New York: Elsevier,

Lindquist eds., Modelling, Identification and

pp. 581-594.

Go1dstein,L. (1988): "On the Choice of Step Size in the Robbins-Monro Procedure," Statistics and Probability Letters 6, 299-303.

Gourieroux, C., A. Monfort and A. Trognon (1984a)' "Pseudo-Maximum Likelihood Methods: Theory ," Econometrica 52, 681-700.

Gourieroux, C., A. Monfort and A. Trognon (1984b): "Pseudo-Maximum Likelihood Methods: Application to Poisson Models," Econometrica 52, 701- 720.

Graybi1l, F.A. (1983): Matrices with Applications in Statistics, second edition. Belmont:

worth.

Hansen, B. (1991):

"Inference

Hecht-Nielsen, R. (1989): "Theory of the Back-Propagation Neural Network," Proceedings of the [nternational Joint Conference on Neural Networks, Washington D.C. York: IEEE Press,pp. 1:593-606.

(1990):

Models," Duke Institute of Statistics and Decision Sciences Discussion Paper 90A15

-83

works

4,231-242.

Hornik, K. and C.-M. Kuan (1990): "Convergence of Learning Algorithms with Constant Learning Rates," University of lllinois Urbana -Champaign Department of Economics Discussion Paper.

Hornik, K, M. Stinchcombe, and H. White (1989): "Multi-Layer Feedforward Networks Are Universal Approximators," Neural Networks 2, 359-366.

Hornik, K, M. Stinchcombe and H. White (1990): "Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks," Neural Networks 3,551-560.

Hu, S. and W. Joerding (1990): "Monotonicity and Concavity Restrictions for a Single Hidden Layer Feedforward Network," Washington State University Department of Economics Discussion Paper.

"Robust Estimation

Statis-

tics 35,73-101.

Huber, P.J. (1967): "The Behavior of Maximum Likelihood Estiamtes Under Nonstandard ConOitions,' E'rOCeedmgs of the r tflh Berkeley ~ympostum on Mathematical

and Probability. Berkeley: University of California Press, 1, pp. 221-233.

Statistics

A Priori:

Information

in Neural

Networks,"

Jones, L.K. (1991): "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training,"

Annals of Statistics (forthcoming).

-84 -

Processing Approach,"

UC San Diego,

Kuan, C.-M. (1989): "Estimation of Neural Network Models," Ph.D. Dissertation, UC San Diego.

Kuan, C.-M., K. Hornik and H. White (1990): "Some Convergence Results for Learning in Recurrent Neural Networks," UCSD Department of Economics Discussion Paper.

Kuan, C.-M. and H. White (1991): "Strong Convergence of Recursive m-estimators for Models with Dynamic Latent Variables," UC San Diego Department of Economics Discussion Paper 91-05R.

Kushner, H.J. (1987): "Asymptotic Global Behavior for Stochastic Approximation and Diffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo," SlAM

Journal of Applied Mathematics, 47, 169-185.

Kushner, H.J. and D.S. Clark (1978): Stochastic Approximation Methods for Constrained and

Unconstrained Systems New York: SpringerVerlag.

Kushner, H.J. and H. Huang (1979): "Rates of Convergence for Stochastic Approximation Type

Algorithms," SlAM Joumal ofControl and Optimization 17,607-617.

Kushner, H.J. and H. Huang (1981): "Asymptotic Properties on Stochastic Approximations with

Constant Coefficients," SlAM Journal of Control and Optimization, 19,87-105.

Lapedes, A. and R. Farber (1987): "Nonlinear Signal ProcessingUsing Neural Networks: Prediction and System Modeling," Los Alamos National Laboratory Technical Report.

Le CUD, Y. (1985):

"Une Procedure

d' Apprentissage

Lee, T.H., H. White and C.W.J. Granger (1991): "Testing for Neglected Nonlinearity in Time

-85 -

Li, K.-C. (1987): "Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized CrossValidation: Discrete Index Set," Annals of Statistics 15, 958-975.

Automatic Control AC-22,551-575,

IEEE Transactions on

Ljung, L. and T. Soderstrom (1983): Theory and Practice of Recursive Identification. Cambridge:

M.I.T. Press.

Lukacs, E. (1975): Stochastic Convergence. 2nd ed., New York: Academic Press.

Marcet, A. and T.J. Sargent (1989): "Convergence of Least Squares Learning Mechanisms in Self Referential, Linear Stochastic Models," Journal of Economic Theory 48, 337368.

Maxwell, T., G.L. Giles, Y.C. Lee and H.H. Chen (1986): "Nonlinear Dynamics of Artificial Neural Systems," in J. Denker ed., Neural Networks for Computing. New York:

American Institute of Physics.

McLeish, D.L. (1975): "A Maximal Inequality and Dependent Strong Laws," Annals of Probability 3, 829-839.

Metivier, M. and P. Priouret (1984): "Applications of a Kushner and Clark Lemma to General Classes of Stochastic Algorithm," IEEE Transactions on Infonnation Theory IT-30,

140-151

McCulloch, W.S. and W. Pitts (1943): "A Logical Calculus of the Ideas Immanent in Nervous

Activity ," Bulletin of Mathematical Biophysics 5, 115-133.

Cambridge:

MIT Press.

-86-

Morris, R. and W.-S. Wong {1991): I"Systematic Choice of Initial Points in Local Search: Extensions and Application tb Neural Networks," Infonnation Processing Letters (forthcoming). Newey, W. (1985): "Maximum rLikelihood Specification Testing and Conditional Moment

Palmieri, G. and R. Sanna (1960): Methodos 12, No.48.

Parker, D.B. (1982): "Learnin~ Lo.!!Jc," Invention Report 581-64 (File 1). Stanfnrrl TTniv~r~ity Office of Technology Ljcensing.

"Learning

Logic,"

Research in Economics

and Management

Potscher, B. and I. Prucha (1991a): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts,"

Potscher, B. and I. Prucha (1991b): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Mo(Jels, Part ll: Asymptotic Normality," Econometric Reviews

(forthcoming).

Robbins, H., and S. Monro (1951): '!A Stochastic Approximation Method," Annals of Mathematical Statistics 22, 400-407.

Rosenb1att, R. (1957):

"The Perceptron:

A Perceiving

and Recognizing

Automaton,"

Project

Rosenblatt,

F. (1958):

"The Percdptron:

A Probabilistic

Model

for Information

Storage and

-87 -

Rosenblatt,

F. (1961):

Mechanisms.

Principles

Washington

of Neurodynamics:

D.C.: Spartan

Perceptrons

Books.

Rumelhart, D.E., G.E. Hinton and R.J. Williams (1986): "Learning Internal Representations by Error Propagation," in D. E. Rumelhart and I. L. McClelland eds., Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Cambridge:

Ruppert, D. (1983): "Convergence of Stochastic Approximation Algorithms with Non- Additive Dependent Disturbances and Applications," in U. Herkenrath, D. Kalin and W. Vogel

eds., Mathematical Leaming Models-Theory and Algorithms. New York: Springer-

Verlag,pp. 182-190.

Sawa, T. (1978): "Infonnation Criteria for Discriminating Among Alternative Regression

A Parallel Network

Aloud," Johns Hopkins University Department of Electrical Engineering and Computer Science Technical Report 86/01.

Selfridge, 0., R. Sutton and A. Barto (1985): "Training and Tracking in Robotics," Proceedings of the Ninth International Joint Conference on Artificial Intelligence. Los Angeles:

Morgan Kaufman, 1, pp. 670-672.

Sontag, E. (1990): "Feedback Stabilization Using 1\vo-Hidden-Layer Nets," Rutgers Center for Systems and Control Technical Report SYCON-90-ll.

Stinchcombe, M. (1991): "Inner Functions and Universal Approximation Properties," UC San Diego Department of Economics Discussion Paper.

-88-

With Non-Sigmoid Hidden Layer Activation Functions," Proceedings of the International Joint Conference on Neural Networks, San Diego. New York: IEEE Press,

pp. 1:612-617.

Stinchcombe, M. and H. White, (1991): "Consistent Specification Testing Using Duality," UC San Diego Department of Economics Discussion Paper.

Sussman, H. (1991):

Feedforward

Input-Output Map," Rutgers Center for Systems and Control Technical Report SYCON-19-06.

Sydaster,

(1981):

Topics

in Mathematical

Analysis

for

Economists.

New

York:

Academic

Press.

Journal of Econometrics 30,415-444.

rnompson, J.M:l~ and H.B. ~tewart(19H6): Nonlinear Dynamics and Chaos. New York: Wiley.

Walk, H. (1977): "An Invariance Principle for the Robbins-Monro Process in a Hilbert Space,"

Zeitschrift fiir Wahrscheinlichkeitstheorie und Verwandete Gebiete 30, 135-150.

Werbos, P. (1974):

Behavioral Sciences," unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.

Journal of the American Statistical Association 76.419-433.

-89 -

50, 1-25.

White, H. (1987a): "Some Asymptotic Results for Back-Propagation," Proceedings of the IEEE First International Conference on Neural Networks, San Diego. New York: IEEE Press,pp. III:261-266.

White, H. (1987b):

White, H. (1988): "Economic Prediction Using Neural Networks: The Case of mM Stock

Prices," Proceedings of the Second Annual IEEE Conference in Neural Networks.

White, H. (1989a): "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward

Network Models," Jottmal of the American Statistical Association 84,1003-1013.

White, H. (1989b): "An Additional Hidden Unit Test for Neglected Nonlinearity," Proceedings of the International Joint Conference on Neural Networks, Washington D.C. New York: IEEE Press, pp. 11:451-455

White, H. (1990a): "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks 3,535-549.

White, H. (1990b): "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," UC San Diego Department of Economics Discussion Paper.

White, H. (1992): Estimation, Inference and Specification Analysis. New York: Cambridge University Press (forthcoming).

White, H. and J. Wooldridge (1991): "Some Results for Sieve Estimation with Dependent

Semiparametric Methods in Economics. New York: Cambridge University Press, pp.

459-493.

Williams, R. (1986): "The Logic of Activation Functions," in D.E. Rumelhart and J.L. McClelland eds., Parallel

C-'ognttton.

Distributed

MITPress,

Processing:

Explorations

in the Microstructures

of

Cambridge:

1, pp. 423-443,

Williams, R.J. and D. Zipser (1989): "A Learning Algorithm for Continua11y Running Fully Recurrent Neural Networks," Neural Computation 2, 270-280.

Xu, x. and W.T. Tsai {1990): "Constructing Associative Memories Using Neural Networks,"

Neural Networks 3,301-310.

Xu, x. and W.T. Tsai (1991): "Effective Neural Algorithms for the Traveling Salesman Problem," Neural Networks 4, 193-206.

Young, P.C. (1984): Recursive Estimation and Time-Series Analysis. New York: Springer Verlag.