Академический Документы
Профессиональный Документы
Культура Документы
INTRODUCnON
Artificial neural networks are a class of models developed by cognitive scientists
interested in understanding how computation is performed by the brain. These networks are
capable of learning through a process of trial and error that can be appropriately viewed as sta-
tistical estimation of model parameters. Although inspired by certain aspects of the way infonnation is processed in the brain, these network models and their associated learning paradigms are still far from anything
clo~e to a realistic description of how brains actually work. They nevertheless provide a rich,
powerful and interesting modeling framework with proven and potential application across the sciences. To mention just a handful of such applications, artificial neural networks have been successfully used to translate printed English text into speech (Sejnowski and Rosenberg, 1986), to recognize hand-printed characters (Fukushima and Miyake, 1984), to perform complex coordination tasks (Selfridge, Sutton and Barto, 1985), to play backgammon (Tesauro, 1989), to diagnose chest pain CEaxt, 1991), and to decode deterministic chaos (Lapedes and Farber, 1987; White, 1989; Ga11ant and White ,1991). Successesin these and other areas suggest that artificial neural network models may
serve as a useful addition to the tool-kits of economists and econometricians. Areas with particu-
lar potential for application include time-series modeling and forecasting, nonparametric estimation, and learning by economic agents. The purpose of this article is two-fold: first, to review the basic concepts and theory required to make artificial neural networks accessible to economists and econometricians, with particular focus on econometrically relevant methodology; and second, to develop theory for a
leading neural network learning paradigm to a point comparable to that of the modem theory of
estimation and inference for misspecified nonlinear dynamic models (e.g., Gallant and White, 1988a; Potscher and Prucha, 1991a,b). As we hope will become apparent from our development, not only do artificial neural networks have much to offer economics and econometrics, but there is also considerable
-3 -
potential for economics and econometrics to benefit the neural network field, arising to a considerable degree from economic and econometric experience in modeling and estimating dynamic
systems. Thus, a larger goal of this article is to provide an entry point and appropriate back-
ground for those wishi:tlg to engage in tl}e fascinating intellectual arbitrage required to fully realize the potential gains from trade between economics, econometrics and artificial neural networks.
NEURAL NETWORK
The simplest general artificial neural network (ANN) models draw primarily on three features of the way that biological neural networks process information: massive parallelism, nonlinear neural unit response to neural unit input, and processing by multiple layers of neural
units. Incorporation of a fourth feature, dynamic feedback among units, leads to even greater
generality and richness. In this section, we describe how these features are embodied in now standard approaches to ANN modeling, and some of the implications of these embodiments. Because of the very considerable breadth of ANN paradigms, we cannot do justice to the entire spectrum of such models; instead, we focus our attention on those most easily related to and with greatest relevance for econometrics. Although not usua1ly thought of in such terms, para1lelism is a familiar aspect of
econometric modeling. A schematic of a simple parallel processing network is shown in Figure
1. Here, input unit ("sensors") send real-valued signals (Xi, i = 1, ..., r) in parallel over connections to subsequent units, designate,d"output units" for now. The signal from input unit i to output unit j may be attenuated or amplified by a factor r ji E IR, so that signals Xi r ji reach output unit j, i = 1, ..., r. The factors r ji are known as "network weights" or "connection strengths."
In simple ANN models, the receiving units process parallel incoming signals in typically simple ways. The simplest is to add the signals seen by the receiver, in which case the output unit produces output
r L Xi r ji , i=l
1, ...,
v.
If, as is common, we permit an input, say Xo, to supply Xo = 1 to the network ( a "bias unit" in network jargon), output can be represented as
/j(x, r) =x'rj
j=l,...,v,
or
(x,
r)
= (1!&1
x)r
where f = (f1, ..., fv)~, x =(1, X1, ..., xr)~, r = <r~1, ..., r~v)~, and rj = <rjO, rj1,
..., rjr)~.
The
"out-
put function"
f is easily recognized
unrelated (linear) equations; in the neural network literature, an electronic version of this network was introduced as the MADALINE
Hoff(1960).
Hoff (1960), easily recognized as the simple linear model, the workhorse of empirical econometrics.
In biological neural systems, the number of processing units can range into the mil-
lions or billions and beyond (hence the teml "massive" parallelism). While such numbers are not usually encountered in economic models, the essential feature of parallel processing is common to both. From the outset of their development, the behavior of artificial neural networks was fornlulated to include another stylized feature of biological systems. This is the tendency of certain types of neurons to be quiescent in the presence of modest levels of input activity, and to become active themselves only after input activity passes a particular threshold. Beyond this threshold, increases in input activity have little further effect This introduces the fundamental
feature of nonlinear response into the ANN paradigm.
For present purposes, it suffices to think either of neural units switching on or off, or to imagine a single dimension along which neural activity (e.g. neural firing rate) can smoothly vary from fully off to fully on. In their seminal article, McCulloch and Pitts (1943) considered
-5 -
the first possibility, proposing networks with output unit activity given by
h(x,
y)
G(x'yj)
j = 1,
, v,
the "Heaviside" or
unit step function. Output unit j thus turns on when X'rj > 0, i.e. when input activity L;=l Xirji exceeds the threshold -rjo. For this reason the Heaviside function is said to implement a "threshold logic unit" (1LU). G is called the "activation function" of the (output) unit Networks with TLU's are appropriate for classification and recognition tasks: the study of such networks exclusively pre-occupied the ANN field through the 1950's and dominated the field through the 1960's. In retrospect, a major breakthrough in the ANN literature
occurred when it was proposed to replace the Heaviside activation function with a smooth sigmoid (s-shaped) function, 1967). Instead of switching in particular the logistic function, G(a) = 1/(1 + exp(-a (Cowan,
units turn on gradually as input from the ANN standpoint will however, binary logit we observe that model
With
this
/j(x, r) = G(x'rj)
= 1/(1 + exp(-x'rj
is precisely
probability
(e.g. Amemiya, 1981; 1985, p. 268). Other choices for G yield other models appropriate for classification or qualitative response modeling; for example, if G is the normal cumulative distribution function, we have the binary probit model, etc. As Amemiya (1981) documents in his
classic survey, such models have great utility in econometric applications where binary
classifications or decisions are involved. Although biological networks with direct connections from input to output units are well-known (e.g.. the knee-jerk reflex is mediated by direct connections from sensory receptors in the knee onto motoneurons in the spinal cord that then activate leg muscles), it is much more common to observe processing occurring in multiple layers of units. For example, six distinct processing layers are at work in the human cortex. Such multilayered structures were introduced into the ANN literature by Rosenblatt (1957, 1958) and by Gamba and his associates (palmieri
-6 -
and Sanna, 1960; Gamba, et. al., 1961). Figure 2 shows a schematic diagram of a network containing a single intemlediate layer of processing units separating input from output. Intermediate
layers of this sort are often caned "hidden" layers to distinguish them from the input and output layers. Processing in such networks is straightforward. Units in one layer treat the units in the preceding layer as input, and produce outputs to be processed by the succeeding layer. The output function for such a network with a single hidden (as in Figure 2) is thus of the form
G(X"rj){3hj), function,
h =
1, ...,
(1.1.1)
nection strengths from hidden unit j ( j = O indexes a bias unit) to output unit h. The vector
8 = ({3'1,
,j3'v,r'l,
.."r'q)
(with
collects
together
all network
weights.
implemented
11..U's.
However,
modern practice permits F and G to be chosen quite freely. (the logistic) simplicity or F(a) = a (the identity) generality, and we
Leading choices are F(a) = G(a) = 1/(1 + exp(-a)) G(a) = 1/(1 + exp(-a)). Because of its notational
and considerable
f(x,
q 0) = 130 + L j=l
G(x'rj)13j
(1.1.2)
Although we have seen econometrically familiar models emerge in our foregoing discussion of ANN models (e.g. seemingly unrelated regression systems and logit models), equation (1.1.2) is not so familiar. .It does bear a strong resemblance to the projection pursuit models of modem statistics (Friedman and Stuetzle, 1981; Huber, 1985) in which output response is given by
j=l
However, in projection pursuit models the functions Gj are unknown and must be estimated from data (perrnittingf3 j to be absorbed into Gj), whereas in the hidden layer network model (I.l.2), G is given. The hidden layer network model is thus somewhat simpler than the projection pursuit model.
A variant of the single hidden layer network that is particularly relevant for
econometric applications is depicted in Figure 3. This network has direct connections from the
Input to output layer as well as a single ruaaen layer. output tor this network can be expressed
as
f(x,
0) = F(x'a
(1.1.3)
weights, and () is now taken to be , {3q)' we nest as
where
is
r x 1
vector
of
input-output
choice
(} = a', f3o,
, {3q, r'l
, ..., r'q)'.
By suitable
mented by nonlinear terms. Given the popularity of linear models in econometrics, this form is
particularly appealing, as it suggests that ANN models can be viewed as extensions of, rather to, the familiar models. The hidden unit activations can then be viewed as
than as alternatives
latent variables whose inclusion enriches the linear model. We shall refer to an ANN model with
output of the form (1.1.3) as an "augmented" single hidden layer network. Such networks will
ciplines was the demonstrated successesthat models of the form (1.1.1) and (1.1.2) had in solving previously intractable classification, forecasting and control problems, or in producing superior solutions to difficult problems in orders of magnitude less time than traditional approaches. Until recently, a theoretical basis for such successes was unknown --artificial neural networks just
-8 -
that is, as a flexible functional form that, provided with sufficiently many hidden units and properly adjusted 'parameters, can approximate an arbitrary function 9 : IR r -:; IR arbitrarily well in
useful spacesof functions. Results of this sort have been given by Carroll and Dickinson (1989), Cybenko (1989), Funahashi (1989), Hecht-Nielsen (1989), Hornik, Stinchcombe and White (1989, 1990) (HSWa, HSWb) and Stinchcombe and White (1989), among others. The flavor of such results is conveyed by the following Theorem 2.4 of HSWa. paraphrase of part of
THEOREMI.1.1:
};'(G)={f: {30 E JR'~
layer
JR'+I,{3jE
network
functions
JR,j=I,...,q; Then};'(G) is
is any cumulative
function.
unifonnly
dense on compacta in C( 1R~, Le. for every 9 in C( 1R~, every compact subset K of
such that Supx E K I f(x) -g(x) I < E.
Thus, the biologically inspired combination of parallelism, nonlinear response and multilayer processing leads us to a class of functions that can approximate members of the useful class C( 1R~ arbitrarily well. Similar results hold for network models with general (not necessarily sigmoid) activation functions approximating functions in Lp spaces with compactly supported measures, and, as HSWb and Hornik (1991) show, in general Sobolev spaces. Thus, functions of the form (1.1.2) can approximate a function and its derivatives arbitrarily well, and in this sense are as
flex.ible as Ga1lant's (1981) flex.ible Fourier form. Indeed, Ga1lant and White (1988b) construct a
sigmoid choice for G (the "cosine squasher") that nests Fourier series within (1.1.2), so that the flexible Fourier form is a special case of (1.1.3) even for sigmoid G.
-9 -
The econometric usefulness of the flexible form (1.1.2) has been further enhanced by
Hu and Joerding (1990) and Joerding and Meador (1990), who show how to impose constraints
ensuring monotonicity and concavity (or convexity) of the network output function. interested reader is referred to these papers for details.
An issue of both theoretical and practical importance is the "degree of approxima-
tion" problem: how rapidly does the approximation to an arbitrary function improve number of hidden units q increases? Classic results for Fourier series are provided by Edmunds and Moscatelli (1977). Similar results for ANN models are only beginning to appear, and so far are not as sharp as those for Fourier series. Barron (1991a) exploits results of Jones (1991) to
establish essentially that 11/- 9 112= O(l/q 1/2) ( 11.112 denotes an L2 norm) when I is an element
of };r(G)
differentiable
open area for further work is the extension and deepening of results of
this sort, especially as such results may provide key insight into advantages and disadvantages of ANN models compared to standard flexible function families. Degree of approximation results are also necessary for establishing rates of convergence for nonparametric estimation based on
ANN models.
Our focus so far on networks with a single hidden layer is justified by their relative simplicity and their approximative power. However, if nature is any guide, there are advantages to using networks of many hidden layers, as depicted in Figure 4. Output of an l-layer network
can be represented as
ahi = Gh(Ahi(ah-l))
i =
1, ...,
qh;
h =
1, ...,
1,
activation
units of layer h, ao = x, qo = r, and ql = v. The single hidden layer networks discussed above correspond to 1 = 2 in this representation.
-10
An interesting
open question
is to what
extent
networks
with
1 ? 3 layers
may be
Specifically,
can a three
layer network achieve a given degree of accuracy with fewer connections (free parameters) than a two layer network? Examples are known in which a two layer network cannot exactly
represent a function exactly representable by a three layer network (Blum and Li, 1991), and it is
known that certain mappings containin2 discontinuities relevant in control theory ~~n hp. Imiformly approximated in three but not two layers (Sontag, 1990). HSWa (Corollary 2.7) have shown that additional layers cannot hurt, in the sensethat approximation properties of single hidden layer networks (I = 2) carry over to multi-hidden layer networks. Further research in this interesting area is needed.
A further generalization of the networks represented by (1.1.4) is obtained by replac-
ing the affine function Ahi(.) with a polynomial Phi(.) with degree possibly dependent on i and h.
This modification yields a class of networks containing as a special case the so-called "sigma-pi"
(Ell) networks (Maxwell, Ones, Lee and Chen. 1986: Williams. lCJRn) Stinrhrombe (1991) hM studied the approximation properties of networks for which an arbitrary "inner function" Ihi
replaces Am in (!.1.4)
The richness of this class of network models is now fairly apparent. However, we
still have not exploited a known feature of biological networks, that of internal feedback.
represented schematically as in Figure 5. In Figure 5(a), network output feeds back into the hidden layer with a time delay, as proposed by Jordon (1986). In Figure 5(b), hidden layer output Thc outvul
feerl~ h~r.k into th~ hidd~n layer with a time del~y, as proposcd by Elma.n (1988). function of the Elman network can thus be represented as
j=I,...,q;t=O,I,2,...,
(1.1.5)
11-
where
at
(atl'
...,
atq)'.
tial value ao, and the entire history of system inputs, xt = (XI' ..., xt).
Such networks are capable of rich dynamic behavior, exhibiting memory and context sensitivity. Because of the presence of internal feedbacks, these networks are referred to in the literature as "recurrent networks," while networks lacking feedback (e.g., with output functions
G.l.3)) are desi2oated "feedforwarrl In econometric dynamic latent variables netwnrk~ II as a nonlinear applications in
terms, a model of the form (1.1.5) can be viewed model. Such models have a great many potential
economics and finance. Their estimation would appear to present some serious computational challenges (see e.g. Hendry and Richard, 1990, and Duffle and Singleton, 1990), but in fact some straightforward recursive estimation procedures related to the Kalman filter can deliver consistent estimates of model parameters (Kuan, Homik and White, 1990; Kuan and White, 1991). We discuss this further in the next section.
Although we have covered a fair amount of grounrl in thi~ ~p~tion. we have only
scratched the surface of the modeling possibilities offered by artificial neural networks. To mention some additional models treated in the ANN literature, we note that fully interconnected networks have been much studied (with applications to such areas as associative memory and solu-
tion of problems like the traveling salesman problem; see e.g. Xu and Tsai, 1990, and Xu and Tsai, 1991), and that networks running in continuous rather than discrete time are also standard objects of investigation (e.g. Williams and Zipser, 1989). Although fascinating, these network models appear to be less relevant to econometrics than those discussed so far, and we shall not
treat them fiJrthp.r As rich as ANN models are, they still ignore a host of biologically relevant features.
Neural systems that have taken perhaps billions of years to evolve will take humans a little more time to model exhaustively than the five decades devoted so far! To mention just a few items,
biological neurons communicate over multiple pathways, chernical as well as electrical --the .l;in-
gle communication dimension ("activation") assumed in most ANN models is quite incomplete.
-12-
Also, biological neurons respond to input activity stochastically and in much more complicated ways than as modeled by the sigmoid activation function --neurons output complex spike trains through time, and are in fact not simple processing units. Of course, these and other lirnitations
of ANN models are daily being challenged by ANN modelers, and we may expect a continuing
increase in the richness of ANN models as the diverse interdisciplinary talents of the ANN community are broueht to bear on these issues.
Despite these limitations sufficiently as descriptions attractive of biological reality, ANN models are modeling.
Given models, the econometrician wants estimators. We take up estimation in the next section, where we encounter additional interesting tools developed by the ANN community in their study of learning in artificial neural networks.
relevant estimation procedures for finding useful parameter values present themselves, typically dependent on the behavior of the data generating process and the goals of the analysis.
For example, suppose we observe a realization of a random sequence of s x 1 vec-
of Xt. The minimum mean-squared error forecast of yt given Xt is the conditional expectation
g(XI} = E(YI I XI}. Although the function 9 is unknown, we can attempt to approximate it using a
neural network with some sufficient number of hidden units. If we adopt (1.1.3) with F the iden-
-13-
f(x,
8)
= x'a
+f3o
q + L j=l
G(x'rj)f3j,
..., r'q)'
as an approximation,
from the outset that it is misspecified. Nevertheless, the theory of least squares for ~sspecified nonlinear regression models (White, 1981; 1992, Ch. 5; Domowitz and White. 1982: Gallant :\nci White, 1988a) applies immediately to establish that a nonlinear least squares estimator 9 n solving the problem
exists and converges almost surely under general conditions as n ~ ~ to 9., the solution to the
problem
where
a~ = E([Yt -g(xJf).
(See Sussman
(1991)
for
discussion
of
issues relating
to
identification.:
Further, under general conditions
a multivariate nonnal distribution with
as n ~ 00 to
matrix
covariance
(White, 1981; 1992, Ch. 6; Domowitz and White, 1982). Although least squares is a leading case, the properties of the dependent variable yt will often suggest the appropriateness of a Qua."i-ma:ximllmlik~Jihood procedure different from
least squares. For example, if yt is a binary choice indicator taking values O or 1 only, it may be assumed to follow a conditional Bernoulli distribution, given Xto A network model to approxi-
f(x, O) = F(x'a
.8j) ,
{1.2.1)
14 -
where F(.) is now some appropriate c.d.f. (e.g., the logistic or normal). The mean quasi-log likelihood function for a samDle of size n is then
Ln(Zn,
f)
= n-1
n L[Yt t=l
logf(Xt'
f)
+ (1-
yt) log(l-
f(Xt,
f))].
A quasi-maximum
likelihood
estimator
A 8 n solving
the problem
max. BE e
Ln(Zn,
f)
can be shown under general conditions to exist and converge to 0., the solution to the problem
f(Xt. (1))].
(See White, 1982; 1992, Ch. 3-5.) The solution ()* minimizes the Kullback-Leibler divergence of the approximate probabillly Inul1el f(Xt, 0.) [rUIIl 111t;Uut; g(Xt). fu inl11t; It;~l :)4U(1lt;:) \;;~t;, -I;; (0 n -0 .) I.;UIIVC;lgC;~
in distribution
as n ~ 00 to a multivariate
normal distribution
natural (e.2. Gourieroux. Monfort and Tro2non. 1984a.b). where fis as in G.2.1) with F chosen to ensure non-negativity (e.g. F(a) = exp(a, so as to permit f(Xt, J) to plausibly approximate
g(Xt) = E(Yt I X,). If Yt represents a survival time, then a Cox proportional hazards model (e.g.
A:r:nemiya,1985, pp. 449-454) is a natural choice, with hazard rate of the form )..(t) f(Xt, 9).
From an econometric would ordinarily standpoint, then, ANN models can be used anywhere with estimation one
proceeding
via appropriate quasi-maximum likelihood (or, alternatively, generalized method of moments) techniques. The now rather well-developed theory of estimation of misspecified models (White, 1982, 1992; Gallant and "Whitc, 1988a; POt~chcJ: and P1-ucha,1991a,b) applic~ immcdiatcly to provide interpretations and inferential procedures.
15-
The natural instincts of econometricians are not the instincts of those concerned with
artificial neural network learning, however. This is a double blessing, because it means not only
that econometrics has much to offer those who study and apply artificial neural networks, but also
that econometrics may benefit from novel techniques developed by the ANN community. In considering how an artificially intelligent system must go about learning, ANN learning ~ thc; l1lU-
cess by which knowledge is acquired, it follows that knowledge accumulates as learning experiences occur, Le. as new data are observed.
In ANN
models, knowledge
connection
strengths,(}.
t+l=Ot+Lltt
knowledge
(1eaming).
A successful learning
specify
observables,
..
some appropriate
Zt = (yt, X't)"
way to fonn thf'. llpd~te A, from previous knowlcdgc Thus we seek an appropriate function Vlt for
~t
1fItCZt.
()
t).
Current leading ANN learning methods can trace their history from seminal work of Rosenblatt (1957, 1958, 1961) and Widrow and Hoff(1960). Rosenblatt's learning network, the a-perceptron, was concerned with pattern classification and utilized threshold logic units.
Widrow and Hoffs ADALINE networks do not require a nu, as they are not restricted to being
classifiers.
For their linear networks (with output for now given by f(x, 8) = x' 8) Widrow and Hoff proposed a version of recursive least squares (itself traceable back to Gauss, 1809 --see Young, 1984),
Ot+1
= Ot + a XtCft
-X't
Ot).
(1.2.2)
Here
error"
between
computed
output
"target"
value
yt.
The scalar a > O is a "learning rate" to be adjusted by trial and error. This recursion
was motivated explicitly by consideration of minimizing expected squared error loss. For networks with nonlinear output f(x, 8) the direct generalization of the delta rule
is
" Ot+1
" = Ot + a Vf(Xt.
" t)
(yt -f(Xt.
" t
(1.2.3)
where V f(x, .)is the gradient of f(x, .)with respect to (J (a column vector).
this recursion is called the "generalized delta rule" or the method of "backpropagation" (a term invented for a related procedure by Rosenblatt, 1961). Its discovery is attributable to many (Werbos, 1974; Parker, 1982,1985; Le Cun, 1985), but the influential work of Rumelhart, Hinton and Williams (1986) is perhaps most responsible for its widespread adoption.
This apparently straightforward generalization of (1.2.2) in fact caused a revolution
in the ANN field, spurring the explosive growth in ANN modeling resDonsible for its vi2or today
and the appearance of an article such as this in a journal devoted to econometrics. The reasons
for this revolution are essentially two. First, until its discovery, there were no methods known to ANN modelers for finding good weights for connections into the hidden units. The focus on threshold logic units in multilayer networks in the 1950's and 1960's led researchers away from
gradient methods, as the derivative of a TLU is zero almost everywhere, and does not obviously
lend itself to gradient methods. This is why the introduction of sigmoid activation functions by
Cowan (1967) amounted to such a significant breakthrough --straightforward gradient methods
become possible with such activation functions. Even so, it took over a decade to sink into the
collective consciousness of the ANN community that a solution to a problem long considered
intractable (even impossible, viz. Minsky and Papert, 1969) was now at hand. The second reason is that once feasible methods for training hidden layer networks were available, they were applied to a vast range of problems with some startling successes. That this should be so is all the more impressive given the considerable difficulties in obtaining convergence via (1.2.3). For
17 -
with considerable accompanying hype and extravagant claims. In 1987 one of us (White, 1987a) pointed out that (!.2.3) is in fact an application of the method of stochastic approximation (Robbins and Monro, 1951; B1um, 1954) to the nonlinear least squares problem (as in Albert and Gardner, 1967). The least squares stochastic approximation recursions are in fact a little more ~eneral. havin~ the form " " " 9t+l = 9 t + at Vf(Xt, 9 t) (yt -f(Xt, " 9 t)),
t=
1, 2,
00
(!.2.4)
The difference is that here the learning rate at is indexed by t, whereas in (1.2.3) it is a constant.
This is quite an important difference With a constant learning rate, the recursion
(1.2.3) can converge only under extremely stringent conditions (there must exist eo such that
y = f(X, eo) almost surely, where Zt has the distribution of Z = (Y; X')' t = 1, 2,
). When this
condition fails, the recursion of (1.2.3) generally converges to a Brownian motion (see Kushner
and Huang. 19R1: Homik ~nr1 Kll~n, 1QQO),not an appealing behavior in this context. Howevcr,
suffices that at oc t-IC 1/2 < 1( ~ 1), standard results from the theory of stochastic approximation
can be applied (e.g., White, 1989a) to establish the almost sure convergence ofe t in (1.2.4) to ()*,
a local solution of the least squares problem
mill E([Y BE 8
-f(K,
8)]1
Repeated
initialization
of the
recursion
(1.2.4)
from
different
starting
values
A e
(e.g., following
the parameter space partitioning strategy of Morris and Wong, 1991) can lead to rather good local solutions.
This fact is significant. The recursion (!.2.4) provides a computationally very simple
algorithm for getting a consistent estimator for a locally mean square optimal parameter vector in a nonlinear model with just a single pass through the data. Multiple passes through the data
(which can be executed in parallel) permit exploration for a global optimum. Thus, in addition to
21
and Duffle and Singleton (1990). Duffle and Singleton derive consistency and asymptotic normality results for MSM estimators of correctly specified models of conditional distribution. The recursive estimator (1.2.6) is computationally simpler by several orders of magnitude and has useful approximation properties even with misspecified models. It is therefore an interesting estimator in its own right: it also appears promising as a generator of starting estimates for MSM esti-
mation. In all of the discussion so far, we have implicitly assumed that network complexity (indexed by the number of hidden units) is fixed. However, the universal approximation properties described in Section 1.1 suggest that ANN models may prove a useful vehicle for nonparametric estimation. This intuition is correct: using results of White and Wooldridge (1991), White (1990a) shows that nonparametric sieve estimators (Grenander, 1981; Geman and Hwang,
1982) based on ANN models can consistently estimate a square-integrable conditional expecta-
tion function, and White (199Ob) shows that nonparametric sieve estimators based on ANN models can consistently estimate conditional quantile functions. Using results of Gallant (1987), Gallant and White (1991) establish the consistency in Sobolev norm of nonparametric sieve estimators based on ANN models. Thus, ANN models can consistently estimate unknown functions and their derivatives in a manner analogous to the performance of the flexible Fourier function
form (Gallant, 1981; Elbadawi, Ga1lant and Souza, 1983). Given tile early stage or aevelopment ot oegree of approximation results for ANN models, rate of convergence results for nonparametric ANN estimators are only beginning to be obtained. However, Barron (1991b) has obtained rate of convergence results for nonparametric least squares estimators of conditional expectation functions. For i.i.d. samples, these rates are
slightly slower than n 1/2,
To gain some insight into the issues that arise in nonparametric estimation using ANN models, we briefly consider the problem treated by White (1990a). The estimation problem considered there has the standard sieve estimation form
n =
1,2,
...,
(1.2.7)
-22 -
T(G,
., ~)
fl(x,
j ,
x E m.'
to infinity with n, e is the space of functions square integrable with respect to the distribution of
Xt,and now u
/3
1,...,
/3
'
q,rl,r2,...,rq
) ' .
Given this setup, the estimation problem (1.2.7)is equivalent to the constrained non-
mill 8"' e D,
n-1
n L '=1
[f,
-f1'(X"
~')f
, n = 1,2,
""",
(1.2.8)
l{3j I ~lln,
L;:l
L;=o
Irji
I ~qnlln}.
size n, one performs a constrained nonlinear least squares estimation on a model with qn hidden units, satisfying certain sumrnability restrictions on the network weights. B y letting the number
of hidden units qn increase gradually with n, and by gradually relaxing the weight constraints, the
network model becomes increasingly inates overfitting asymptotically, flexible as n increases. Proper control of qn and l1n elimof 80, 80(Xt) = E(Yt I Xu, to
{ Zr } , consistency
is
ing processes of a specific size, ~n = o(n 1/4) and qn~; log qn~n = o(n 1/2) suffice for consistency.
In practice, determining appropriate network complexity is precisely analogous to determining how many terms to include in a nonparametric series regression. As in that case, either cross-validation or information-theoretic methods can be used to determine the number of
hidden units optimal for a given sarnple. Information-theoretic methods in which one optimizes a
-23-
Sawa, 1978) have been shown to have desirable properties by Barron (1990). Extension of analysis by Li (1987) as applied by Andrews (1991a) to cross-validated selection of the number of terms in a standard series regression may deliver appropriate optimality results for crossvalidated selection of network complexity , and is an interesting area for further research.
Also an open question is that of the asymptotic distribution of nonparametric neural network estimators. Results of Andrews (199lb) for series estimators may also be extendable to treat nonparametric estimator of ANN models. Additional interesting insights should arise from this analysis.
13. SPECIFICATION
Consider the nonlinear regression model based on (1.1.3) with F the identity
func-
tion,
The standard linear model ocl::urs as the special case in which {31 = {32 =
..{3q
0.
Thus,
Hq
.fi
=0
v~
H.. .{I; = 0
A motnent's
reflection
reveals
an interesting
obstacle
to straightforward
application of the usual tools rf statistical inference: the "nuisance parameters" rj, j = 1, ..., q, are not identified under the nu1l hypothesis, but are identified only under the alternative.
tunately, there is now availabl~ a variety of tools that permits testing of Ho in this context.
The simplest, mo$t naive procedure is to avoid treating the rj as free parameters, instead choosing them a priori in some fashion (e.g., drawing them at random from some
appropriate distribution) and I then proceeding to test Ho using standard methods, e.g. via
Lagrange multiplier or Wald statistics, conditional on the values selected forrjo A procedure of
24precisely this sort was proposed by White (1989b), and the properties of the resulting "neural network test for neglected nonlinearity" were compared to a number of other recognized procedures
for testing linearity by Lee, White and Granger (1991). (See White, 1989b, and Lee, White and Granger, 1991, for implementation details.) The network test was found to perform well in comparison with other procedures. Though no one test dominated the others considered, the network test had good size, was often most powerful, and when not most powerful, was often one of the more powerful procedures. It thus appears to be a useful addition to the modem arsenal of specification testing procedures.
A more !;ophi!;ticated prncedllre i~ tn ~hnfil:p. "1 vfll11~S that optimize the direction in
which nonlinearity
is sought.
yielding
residuals
conditions that
b*(r)
= E(G(X'tr)
X't)
A.
E(Xt
X't)
p where ~n ~ !!.,
Bierens (1990) specifies G( , ) = exp( .), but as we discuss below, this is not the
25 -
A W(r)
A = nM(r)/Gn(r)
d ~xi
under correct specification of the linear model, where a~(r) is a consistent estimator ofa2(r).
Under the alternative, A W(r)/n -717(r) > 0 Q.s. for essentially every choice ofr, as Bierens (1990,
Theorem 2) shows. " To avoid picking r at random, Bierens proposes maximizing W( r) with respect to
r E r Can appropriately specified compact set), yielding Wcr), say. As Bierens notes, this max-
imization renders the xi distribution inapplicable under Ho. However, a xi statistic can be constructed by the following device: choose c > 0, /l E (0, 1) andy n independently of the samDle
and put
r=ro
A =r
if
W(r)
-W(ro)
~ cn)..
if
Bierens
A W(r)/n ~
(1990,
Theorem
4)
shows
that
under
correct
specification
essentially
while
SUpre r 7J(r)
Bierens'
result hold')
regardless
of how r is chosen.
In recent related work, Stinchcombe and White (1991) show that Bierens' concluincluding
sions are preserved if G is chosen to belong to a certain wide class of functions G( .) = exp( .). Other members of this class are G(a) = 1/(1 + exp(-a The choice of c, }., and r o in Bierens' construction
is problematic.
Two researchers
using the same data and models but using different values for c, ).. and r o can arrive at differing
conclusions in finite samples regarding correctness of a given specification. One way to avoid
such difficulties is to confront the problem head-on and determine the distribution of W( r). Some
useful inequalities are given by Davies (1977, 1987), but these are not terribly helpful when r ?; 3 variables). Recently, Hansen (1991) has proposed a com-
putationally intensive procedure that permits computation of an asymptotic distribution for W( r) under Ho'
-26-
An interesting
research
is a comparison
of the relative
performance
and computational cost of the procedures discussed here: the naive procedure of picking rj'S at
random; Bierens' " W(r) procedure; and use of Hansen's (1991) asymptotic distribution " for W(r).
venience includes an intercept) one can test Ho: /3 = O vs. Ha: /3 * Oin the augmented model
v, = h(X"
a)
q 4- ~ j=l
G(X','Yj)f3j
4- I;,
(1.3.2)
p If an is the nonlinear least squares estimator under the null (with an ~ a* under Ha; see White, 1981), then with Et = ft -h(Xt, an) we have
where now
(J"2Cr)
var([GCXtr)
-b*Cr)
A *-1
VahCXt.
a*)]e;)
b*(r) = E(G(X'tr)
V'ah(Xt, a*))
A. = E(Vah(Xt. a *)
We again have " W(r) " = nM(r)/a-
V'ahCXt.
a*))
2 n(r)
d -7xi
under
Ho.
while
" W(rYn
-717(r)
> O a.s.
under
Ha
(mlsspecification) for essentially all r. A consistent specification test is therefore available. A Optimizing W( r) over choice of r leads to considerations regarding asymptotic testing identical to those arising in the linear case.
For testing correct specification of a likelihood-based model, a consistent m-test
(Newey, 1985; Tauchen, 1985; White, 1987b, 1992) can be performed. The starting point is the
fact that if 1 (Zt. 0) is a correctly specified conditionallo~-likelihood for y, 2iven X, Ci.e. for some
-27 -
density
E(s(ZtJ
(Jo) I Xt) = O ,
E(s(Zt'
() 0) G(X't
r))
= 0
for all ye r.
A Under standard conditions (e.g. White, 1992, Ch. 9) it follows that with en the
(qua3i-) mll.Ximum likc;lihood c;~timator c;oroi~tc;nt undc;r mi~~pc;cifica.tiOll fOl (}*, wc 11(1VC
where
};(y)
= var([(G(Xt'y)
b*(r)
= E([G(i'tr)
A. = E(V'9 S;)
s: = s(Zt. (}*,
\7'6 S; = \7'6 S(Zt. e*).
Consequently, analogous
A M(r)
d ~xi
under
correct
specification. ~17(r)
Argument
to that
Theorem
4) delivers
A W(r)/n
misspecification
for essentially
all r, given an appropriate choice of G, e.g. G(a) = exp(a) as in G(a) = tanh(a), as in Stinchcombe and White 0991). " W(r) over choice ofr leads to considerations
Optimizing
-28-
cannot test hypotheses about estimated parameters of the ANN model in the same way that one would test hypotlleses about correctly specified nonlinear models (e.g. as in Gallant, 1973, 1975). Nevertheless, one can test interesting and useful hypotheses within the context of inference for misspecified models (White, 1982, 1992; Gallant and White, 1988a). In this context, two issues arise: the first concerns the interpretation of the hypothesis itself; and the second concerns construction of an appropriate test statistic. Both of these issues can be conveniently illustrated in
the context of nonlinear regression, as in White (1981). A The nonlinear least squares estimator () n solves
(J)r
where, for concreteness we take f(Xt, (}) to be of the fonn (!.1.3) with F the identity function.
White (1981) provides conditions ensuring A Q.S. that (} n ~ (}*, where (}* is the solution to
rnin E([E(Yt BE e
I Xt)
-f(Xt.
(J)J2)
E(Yt I Xt). One can therefore test hypotheses about the parameters of the best approximation. A leading case is that in which a specified explanatory variable (say the rth variable, Xtr) is hypothesized to afford no improvement permitted by f in predicting yt, within the class of approximations
Ho:
S, e*
= 0
vs.
Ha: S, ()
;!: 0
where s r is a q + 1 x k selection
matrix
elements of () .(i.e.
ar,rlr,...,rqr.
Testing Ho against Ha in the context of a misspecified model can be conveniently done using either Lagrange multiplier (LM) or Wald-type test statistics, but not likelihood ratio statistics, for reasons described in Foutz and Srivastava (1977), White (1982, 1992) and Gallant
-29-
tic the validity of the information matrix equality (White, 1982, 1992), which fails under misspecification. The classical LM or Wald statistics also require the validity of the information matrix equality, but can be modified by replacing classical estimators of the asymptotic covariA ance matrix of en with specification robust estimators (White, 1981, 1982, 1992; Gallant and White, 1988a). Thus, a test of Ho against Ha can be conducted using the Wald statistic
.." Wn = n O 'n S'r(Sr .. Sr O n ,
Cn S'r)-l
where
~ Cn
--1 = An
---1 En An
The covariance
estimator
when {4}
is i.i.d.,
but modifications
preserving consistency are available in other contexts. Under the hypothesis that Xtr is irrelevant
" d (and with consistent Cn), one can show that Wn ~ X~+l' and that the test is consistent for the
alternative. Similar results hold for the LM test statistic. Details can be found in Gallant and Whit~ (19&&a,CQ 7) and White (19&2; 1992, Ch. 8).
30 -
(a)
Yt+l
= 3.8
YtCl -Yt)
(b)
The circle map (Thompson and Stewart, 1986, pp. 164, 285-6):
Y, 11 = Y, + (22/1l')
~in(21l' Y, + ~~)
(c)
Yt+l = -2
+ 28.5 Yt/(l
+ yf)
Chaos (a) is by now a familiar example to economists and econometricians. Chaos (b) and chaos (c) are less familiar, but these three examples, representing polynomial, sinusoidal and !ational polynomial functions, provide a modest range of different functions with which to demonstrate
ANN capab1Unes. Time-series plots or me mree senes are given in Figures 6,7 and 8.
Because we shall not be adding observational error to the chaotic series, our exampIes will provide direct insight into the approximation abilities of single hidden layer feedforward networks. In each case, we fit ANN models of the form
f(Xt.
-q 0) = X't
+ /30 + L j=l
G(X't
rj)
/3 j
(1.4.1)
Several models are examined
the logistic.
in p~('h in~ance- Specifically, the input X, iE:a E:inslelas of the torgct scrics yt, whilc thc numbcl of hidden units (q) varies from zero to eight. The best model is chosen from these alternatives using the Schwartz Information Criterion (SIC). For each network configuration, we estimate model parameters by a version of the method of nonlinear least squares,Le., we attempt to solve
-31-
Optimization proceeds in two stages. First, the parameter estimates an are obtained by ordinary
least squares, with parameters {3 constrained to zero. (Note that an contains an intercept.) Then
if q > 0, second stage parameter estimates fi n and r n are obtained in such a way as to exploit the
structure of (1.4.1); the an estimates are not subsequently modified, forcing the hidden layer to extract any available structure from the least-squaresresiduals. Inspecting (1.4.1), we see that for given rj..s, ordinary least squares gives fully optimal eGtimntosfor /3. Thus, wc choosc a largc numbcr of ralldol1l vi1luc:)fur tIle elementS or
rj, j = 1, ..., q, and compute the least squares estimates for /3. This implements a form of global
random search of the parameter space. The best fitting values of.8 and r are then used as starting
values for local steepest descent with respect to {:3and r. Within steepest descent, the step size is dynamica11yadjusted to increase when improvements to mean squared error occur, and otherwise to decrease until a mean squared error improvement is found. Convergence is judged to occur
when (mse(k) -mse(k -1)/(1 + mse(k -1)) is sufficiently small, where mse(k) denotes sample
mean squared error on the kth steepest descent iteration. Once a local minimum is reached, the procedure terminates. This algorithm has been found to be fast and reliable across a variety of applic~tions investigated by the authors. The re~lllt~ nf lp.~~t~'111~rp.~ p~tim~tion of a linear model are given in Table 1. Tho simple linear model explains only 12% of the target variance for the circle map, while explaining 84% of the target variance for the Bier-Bountis map. The logistic map is intermediate at 36%, Results for the single hidden layer feedforward network are given in Table 2. In each case the hidden layer network chooses to take as many hidden units as are offered (8), and with this number of hidden units, nearly perfect fits are obtained. Because the relationships studied here are noiseless, the SIC starts to limit the number of hidden units chosen essentially only when machine imprecision begins to corrupt the computations. This lirnit was not reached in these examples. Our examples show that single hidden layer feedforward networks do have
-32 -
appealing flexibility, and can be profitably used to extract approximations at least to some simple
chaos-generating functions. Experience in a wide variety of applications across a spectrum of
scientific disciplines suggests that the usefulness of this flexibility is likely to extend broadly to
econometric contexts.
ANN
models
additions
to the modem
econometrician's tool-kit.
WIm
DEPENDENT OBSERVATIONS
lI.l.
INTRODUCnON
In Part I, we briefly discussed the method of stochastic approximation (Robbins and Monro,
1951). The Robbins-Monro function '!1(0), say 0" , by (RM) algorithm recursively approximates the zero of an unknown
A () t+1
A = () t + at 1fI(Zt,
A () J
t=
1,2,...
(II.l.l)
where at is a "learning
influcn(;cd
rate" tending to zero, and 1j/(Zt,8) is a measurement of '1'(8) at time t, When 'I'(IJ) = E~V(Zt, tJ)) truS methOd yields a recursive
implementation of the method ofm-estimation of Huber (1964). In particular, the method can be
used to estimate recursively the parameters of nonlinear regression models, such as those arising
in neural network applications. The RM algorithm has two significant advantages: (1) its recursive nature places few demands on computer resources; and (2) in theory , just one pass through a sufficiently large data
set can yield a consistent estimate. The RM algorithm is therefore particularly appealing for
estimating parameters of nonlinear models in large data sets. Very general results relevant to the convergence properties of the RM algorithm have been given by Kushner and Clark (1978) (KC) and Kushner and Huang (1979) (KH). However, the conditions ofKC/KH are not primitive and require some effort to apply. In this part of the paper,
we bridge an existing gap between the results of KC/KH and some interesting and fairly broad
-35-
1/I(z.
())
there
exist
functions
PI:
/R+
/R+
and
h3
1Rs ~
1R+
such that
p I (U) ~
0 as u ~
0, h3 is measurable-
IR$ x e x e
1jI(Z.(}I) -1jI(Z.(}z)
~Pl(
I (}1-(}2
)h3(z),
where
ASSUMPTION
A.3:
E 1/f(Zt, () < 00 for each () in 0, and there exists a function 'l' : 0-7
IRk
ASSUMPTION
L;=oat ~ 00 as n
A.4:
~ 00,
ASSUMPTION
A.5:
(a) (b)
L;=o at[hj(Zt)-1Jjt]
converges a.s.-P.
Assumption A.l introduces the data generating process, and Assumption A.2 imposes some suitable and relatively mild restrictions on the growth and smoothnessproperties of the measurement function 11/ . Assumption A.3 is a mild asymptotic mean stationarity requirement at -7 O ensures that the effect of error adjustment eventually
00 allows the adjustment to continue for an arbitrarily long
In
van-
time,
is always plausible.
Assumption A.5 imposes mild convergence conditions on the processes depending on Z:. Below we consider more primitive mixingale conditions that ensure the validity of this assumption. Let 1C
IRk ~ e be a measurable projection function (for f) E e, 1t"(f) = f).
We then
In what follows,
This result generalizes classical results (e.g., Blum, 1954) in several respects. First, Zr is not required to enter the function 1/1 additively. Second, the learning rate at is not required to be
square summable. Most importantly, general behavior for Zr is allowed, provided that Assump-
tion A.5 holds. As examples, KC consider martingale difference sequences and moving average
processes.
conditions
denote the
of AssumpLp-norm,
1975). Let
.llp
denoting the spectral norm induced by the Euclidean norm. We use the following definition.
DEFINITION
11.2.2: Let {Xt} be a sequence of random variables belonging of IF .The sequence {Xt.
{ct }
IFt} be a filtration
nonnegative IIE(Xt
IFt} is a mixingale
real
constants
and
and
{'m}
where
:5Ct~m+l
~m
-)
as
m ~ CX), we
have if
/Ft-m)112
:5 Ct~m
IIXt-E(Xtl/Ft+m)lk
{ Xt } is a mixingale
of size -a
~m = O(m).) for some).. < .:-a. (We drop explicit reference to the filtration
of confusion.)
Our definition
of size is
convenient, but also stronser than that considered by McLeish (1975). As 3pccifll Ca,.jC.', mixingale processesinclude independent sequences,martingale difference sequences, l/>-,p- and amixing processes, finite and certain infinite order moving average processes, and sequences of near epoch dependent functions of infinite histories of mixing processes (discussed further in the next section). Mixingales thus constitute a rather broad class of dependent heterogeneous
processes.
we always assume that the relevant random variables are measurable condition holds automatically. This avoids anticipativity of
the RM algorithm.
-38-
ASSUMPTIONA.4':
L;=l at ~~ as n ~~,
a; < 00 and
ASSUMPTIONA.5':
(a)
For each 0 in e, SUpt II 'I'(Zt, 0) 112~ ~e < 00 and {'1'(Zt, 0) -E'1'(Zt, 0), IFt} is a mix-
Assumption
A.4'
implies
Assumption
A.4.
implied by Assumptions A.5'(b) and A.2(b.i), and that we may take 17jt = Ehj(Zt). We have the following result.
A A.4' and A.5', let { ()t} be given by (II.l.l)
II.2.1 hold.
COROLLARY
with
A eo chosen arbitrarily.
of Theorem
0
of .. () t.
This
provides
general
and
fairly
primitive
conditions
ensuring
the
convergence
Only
Assumption A.5' is a reasonable candidate for further specialization to achieve additional simpliThis is most conveniently done by placing conditions on h 1, h2, h3 and {2, } sufficient to ensure that the mixingale property is valid. We give examples of this in the next section. The present result gives a very considerable generalization of a convergence result of
White (1989a, Proposition 3.1). There Zr is taken to be an i.i.d. uniformly bounded sequence.
Corollary II.2.3 also generalizes results of Englund, HoIst and Ruppert (1988), who assume that
{ Zt } is a stationary mixing process and that 1/' is a bounded function.
-39-
For given 0* E
IRk we write
ut = (t + 1)Y2(8 t -8*).
Straightforward manipulations a11ow us to write Ut+1 = [Ik + (t + 1)-1 Ht] Ut + (t + l)-Yi q; , where
Ht = \"761{/; + [((t +2)/(t+ 1))Yl-1] \"761{/; + Ik/2 + O((t + 1)-1) Ik
(II.2.4)
(II.2.".)
and
with 111;= 1II(Zt, 8*), V 9 111;= V 9 1II(Zt, (1*). The piecewise constant interpolation with interpolation intervals { at} is defined as 'r E ['rt, 'rt+l)'
of Ut on [0, 00)
UO('l")
= ut,
't"?o.
The asymptotic
distribution
A of et is found
by showing
that
Ut(
stochastic differential
B.l:
Assumption
ASSUMPTION B.2:
(a)
differentiable
P2(U) -:;
P2 : /R+ -?
/R+ and h4
in e of 0
v 9 1jf(Z,e) -V 9 1jf(Z,eo )
~P2( 10-0
h4(z).
ASSUMPTION
1Jf; E L6(P),
V 91Jf; E L2(P),
ASSUMPTION (a)
noS:
IFo)lk
IF )-O"jI12, -* O"j=E(1JIt1Jlt+j), *' .
(b)
For
some
17 4 E
(c)
L;=o
-h*
converge
a.s.
-P,
where h * = E I V 9 1f/; I.
The stationarity imposed in Assumption B.l is extremely convenient; without this, the
analysis becomes exceedingly complicated. Assumption B.2(b) imposes a Lipschitz B.3 imposes additional
stable equilibrium.
condition
candidate
to Assumption
A.4 or A.4'.
Finally,
Assumption
B.5 imposes
some further convergence conditions beyond those of A.5. Assumption B.5(a) restricts the local fluctuations (quadratic variation) induced by (t + l)-Y'q; in (II.2.4) to be compatible with those of a Wiener process. Assumption B.5(b,c) (together with B.2) ensures that the effects of the second term and the last term in (II.2.5) eventually vanish.
The asymptotic normality result can be stated as follows. A and B.5 hold, and that et ~e*
THEOREM
11.2.4:
Suppose Assumptions
B.I-B.3
a.s.-P,
where
by (II.1.1)
with
element
Then: (a)
(b)
{ Ut} is tight in IRk,
I=~C:O
"""'
J =-=
a.
<00,
(c)
weakly
to the stationary
solution
d'l" + I'
dW('l"),
I)'l(Ulll(Ull
k;-var1ate
Wiener
exp[ii'c]dc
In
partiCUlar
(d)
M A M'
If H* = -H*
is symmetric, A
then
F*
= MLM',
where
matrix
"uch
that in
, with
the diagonal
matrix
containing
If at is chosen to be (t + 1)-1 A (for finite nonsingular Theorem asymptotic 1I.2.4(C) necomes aut't") = liut't")a'r distribution + Ai~aw('r),
k x k matrix A), then the SDE in and the covariance matrix of the
becomes AF* A '. Part (d) gives an alternative expression for the covari-
ance matrix of the asymptotic distribution, analogous to that given by Fabian (1968). Despite the
assumed stationarity , Theorem II.2.4 generalizes previous results in that the random variables c~n hf'. llnhmlnrlf'.rl ~nd th~ m~~st1rement can be correlated (cf. Ljung and Soderstrom, 1983, Ch.
4, and Fabian, 1968). Again, the properties of mixingales can be exploited to verify the convergence conditions. We impose
ASSUMPTION
(a)
B.5':
of size -2 with ct ~ K for some K < 00, t = 1, 2, ..., ; { bt } such that
JFQ ) -a
j~ uf ~jLC -2.
-42
. V81f1t -h*, IFf} are mix-
(b)
{h4(Z,)-E(h4(Zt))'
IFt},
{V1f/; -H*,
IFt}
and
COROLLARY
II.2.5:
Suppose
Assumptions
B.I-B.3
and
B.5'
hold
and
that
-1e*
a.s.-P
where {et} is generated by (1.1) with eO arbitrary, at = (t+ 1)-] and ()* is an isolated element of
EJ+."men the conclusions of'lheorem ll.2.4 hold
0
4.1) from the
This considerably
generalizes
i.i.d. unifornlIy bounded case to the stationary dependent case. EngIund, HoIst and Ruppert
(1988) also give a result for i.i.d. observations.
model I(Xt, 8)
(I:
IRr x D -:;
variahlp. y.
IR, Xt a random
It i.l: ~nmmon
r x 1 vector,
the random
to seek 8* , a solu.
a k x 1 column vector.
The simple
RM algorithm for this problem in nonlinear least squaresregression is the algorithm (II.l.l)
1/f(Zt, () = v b' f(Xt, 8) [yt -f(Xt, 8)],
with
{II.3.1)
where
we have
written
it = f(Xt.
Vo ft = Vo f(Xt.
8,).
This
is known
as a "stochastic
gra-
went method." In tl1is section we consider the properties of tl1is algorithm and two useful
43
variants, the "quick" and the "modified" A disadvantage RM algorithms. is that it may converge very slowly (e.g.
White, 1988). To improve the speed of convergence, a natural modification mate Gauss-Newton
[0/1
1JI(Zt, 8) =
(Zt,
1fI2(Zt,
'/f2(Zt.
0) = G 1 Vo f(Xt.
0) [rt -f(Xt.
0 )J
(II.3.2a)
1 8t+1 =8t+atGt+1 --Vo!t[ft-ft].
(II.3.2b)
symmetric matrix.
A We take Go to be an arbitrary
positive-definite
" The difficulties of applying this algorithm are: (1) the inversion of Gt+l is computation ally
A demanding, and (2) the updating estimates Gt need not be positive-definite, pointing the algo-
rithm in the wrong direction. The first problem can be solved by use of the rank one updating formula for the matrix
inverse. Let Pt+l = Gt"ll and }.t = (1at)/ at. The modified RM algorithm is algebraically
equJvalem 10
(II.3.3a)
A A " " " Ot+l =Ot + at Pt+l Volt [ft-It],
of. Ljung and Sodorstrom (1983, A Ch. 2 & 3). Thc; c;hoil;c; Po = Ik j1) uflclll;ullvcnlem.
(II.3.3b)
modification
of (1I.3.2a):
Gt+l
(II.3.4a)
" Gt+l =
(II.3.4b)
A where E is some predetermined positive number, and Mt+l (E) is chosen so that Gt+l -El is positive-semidefinite. Some practical implementations of this can be found in Ljung and SoderA strom (1983, Ch. 6). A similar device can be applied to Pt. Implementation be understood to employ a projection device restricting .-pact convex set r such that the max-imum and minimum of this algorithm will
is avoided.
(II.l.l)
8)' Vo f(Xt,
8) -e,
1JI2(Zt,
()) = e-l
Vo f(Xt,
8)[Yt
-f(Xt,
8)],
(II.3.5a)
(II.3.5b) The scalar et can be easily modified to be positive in a manner analogous to (3.4); we also restrict
et to be bounded. The quick RM algorithm is a compromise of the other two algorithms in that it takes a negative gradient direction with a scaling factor utilizing tion. Consequently, the quick algorithm than the modified some local curvature informa-
ought to converge more quickly than the simple algoalgorithm. When al = (t + 1)-1, the quick algorithm
-45 -
then reduces to the "quick and dirty" algorithm of Albert and Gardner (1967, Ch. 7).
It is straightforward to impose conditions ensuri~g the validity of all assumptions required
for the convergence results of the preceding section. Only the mix.ingale assumptions A.5' and B.5' require particular attention. We make use of a convenient and fairly general class of mixingales, near epoch dependent (NED) functions of mixing processes(Billingsley, 1968, McLeish, 1975, Gallant and White, 1988a).
Let { Vt} be a stochastic process on (.0., IF , P) and define the mix.ing coefficients
<1> m = SUp't" SUP{F E 1F' -, G E IF;.m : P(F) > 0}
P(G F) -P(G)
am
SUp't"
SUP{FeF-.
GeJF;.m}
where /F~ =a(V'C, ..., Vt). Whentpm ~ O or am ~ O as m ~ ~ we say that {Vt} istp-mixing fonn mixing) ora-mixing (strong mixing). Whenlf>m = O(mA.) for some)., < -a
is tf>-rnixing of size -a, and sirnilarly for amo We use the following definition of near epoch
dependence. where we ~d()pt the n()t~ti()n P~:!:,'::( .) ~ E( .
1F~.!,':.').
DEFINITION
11.3.1: Let {Zt} be a sequence of random variables belonging to L2(P), and let
process on (0, IF 1 P). Then {Zt} is near epoch dependent (NED) on { Vt} of
{ vt } be a stochastic
The following three results make it straightforward to impose conditions sufficing for Assumptions A.5' and B.5'. The first is obtained by following the argument of Theorem 3.1 of
McLeish (1975).
PROPOSITION
mixing
11.3.2: Let {Zt E Lp(P)}, p ?2 be NED on {Vt} of size -a, where {Vt} is a
If>m of size -ap /(P -1) of size -a. or am of size -2ap / (p -2), p > 2.
sequence with
{Zr -E(Zr)
} is a mixingale
0
II.3.2. Let 9 : IRs
PROPOSmONII.3.3:
-46g(Zl)-g(ZV I ~L
satisfy
Lipschitz
condition,
Zl-Z2
,L
< oo,Zl,Z2,
lRs.
Then
of Proposition
II.3.2,
then {g(Zt)-E(g(ZJ)}
PROPOSITION 11.3.4: Let {Ut} and {Wt} be two sequencesNED on {Vt} ofsize-a.
(a) If SUpt
wt
$ il
< 00 and
SUpt II u t 114 $ il
and
{ Ut Wt}
is
NhU
(b) If SUpt II w t 118 ~ ~ < 00 and SUpt II u t 118 ~ ~ < 00, then SUpt II u t w t 114 ~ ~ 2 and { U t Wt } is
NED on { Vt} of size -a /2. (c) If SUptII u t 118 ~ L\ < 00 and { Vt } satisfies tile conditions of Proposition 3.2, tilen tilere exist KX)
E(Ut Ut+j)l!2
D'Q
0
ll.3.4(a), requiring SUpt II ft 114$: !!. and a
Our subsequent
results
will
make
use of Proposition
bound on tho olomonts of Xto Part (b ) i11ustra.tcs usc of thc Ca.uchy-Schwart.Z; incqu0.1ity to cclax
the boundedness condition; the price for this is a corresponding strengthening of moment conditions on ut (corresponding to yt). Here we sha11 adopt boundedness conditions on Xt to minimize moment conditions placed on yt and facilitate verification of the Lipschitz condition of Proposition II.3.3. Part (c) permits verification of Assumption B.5' (a.ii). We impose the following conditions.
C.l:
Assumption
A.l
on {Vt}
Xt bounded
and suPtIlYtIlp~L1<oo,
is a mixing
p ?: 4.
ASSUMPTION
m.k.
C.2: f:
IRr x D ~
For each x E
m.r, f(x,
.) is continuously
satIsfy a LlpsCJUIZ COnOltiOn wltn LlpSCJUIZ constantS 1 (X) ana 2 (X), where 1 and L2 are each Lipschitz continuous in x. For each 8 E D, f( ., 8) and V s f( ., 8) each satisfies a Lipschitz
48 -
methods thus coincide, so that the RM estimators tend to the same limit(s) as the nonlinear least squares estimator (cf. Ljung and Soderstrom, 1983). Corollary 11.3.5is more general than the i.i.d. case treated by White (1989a) and the exampIes given in KC (Ch. 2), as we allow the data to be moderately dependent and heterogeneous. This result differs from those of Metivier and Priouret (1984) in that we require neither "conditional independence" nor stationarity .
Corollary II.3.5 also generalizes a result of Ruppert (1983). Ruppert assumesthat for some
O yt = f(Xt, 8*) + Et and that (Xt, EJ is strong mixing of size -p / (p -2), a condition that may
fail when Xt contains lagged ft, because ft need not be mixing when it is generated in this manner, even when Et and other elements of Xt are mix.ing. Indeed, this fact partially motivates our usage of near epoch dependence. Also, we do not require that Yeis generated in the manner assumedby Ruppert (Le., we may be estimating a "rnisspecified" model). Compared to the result of Ljung and Soderstrom (1983), we allow more dependence in the data, as the data need not be
ge:ne:r~te:d by ~ line'Jr filter.
The modified RM algorithm can be identified with the extended Kalman filter for the nonlinear signal model
yt = f(Xt. 8t) + Et
8t = 80 for
all
t.
" " The Kalman gain is at Pt+l V 6 it. Corollary 11.3.5thus provides conditions more general than previously available ensuring consistency of the filter. In particular, the model can be
misspecified and the data can be NED on some underlying mixing sequence. Because the quick RM algorithm includes Albert and Gardner's quick and dirty algorithm, Corollary 11.3.5directly generalizes their consistency result to the case of dependent observatiODS.
To obtain asymptotic normality results for the case of nonlinear regression, we impose the following conditions.
-52-
For this we impose appropriate conditions. In particular, we adopt Assumption C.l. The assumption of uniformly bounded XI causesno loss of generality in the present context. This is a consequence of the fact
~
Xt)
where
Xti
= ~(Xti)'
i = 1, ..., r
and
If Xt is not unifonnly
what follows, with the implicit understanding that Xt has been transformed so that Assumption C.l holds. Note, however, that yt is not assumedbounded, providing the desired generality.
ASSUMPTION E.l: I:
compact subsets of function continuously IRr,
]R' x D ~
differentiable
The conditions on G are readily verified for the logistic c.d.f. and hyperbolic
tangent "squashers"
COROLLARY
(II.3.2) or (II.3.5)
and quick
Thus the method ofback-propagation and its generalizations converge to a parameter vector giving a locally
E(Yt
approximation
[Ulll,;l!Ull gen-
eralizes Theorem 3.2 of White (1989a), For the asymptotic distribution results, we impose the following condition.
ASSUMPTION
F.l:
differentiable
of order 4.
A et
COROLLARY
11.4.2:
Suppose
Assumptions
D.l,
D.2
and
F.l
hold
and
that
~e.a.s.-P
where {ot} is generated by (II.3.1), (II.3.2) or (II.3.5) with 00 chosen arbitrarily, at = (t + 1)-1, and 0. is an isolated element of e. .Then the conclusions of Theorem 11.2.4hold.
-54 -
considered here. For many choices of 1/', the analysis parallels that for the least squares case
rather closely. These results are within relatively easy reach for estimation procedures. For neural network models, it is desirable to relax the assumption that q is fixed. Letting
q -7 00 as the available sample becomes arbitrarily large permits use of neural network models
for purposes of non-parametric estimation. Off-line non-parametric estimation methods for the case of mixing processes are treated by White (1990a) using results for the method of sieves (Grenander, 1981, White and Wooldridge, 1991). On-line non-parametric estimation methods appear possible, but will require convergence to a global optimum of the underlying least squares problem, not just the local optimum that the present methods deliver. Results of Kushner (1987) for the method of simulated annealing provide hope that convergence to the global optimum is achievable for the case of dependent observations with appropriate modifications to the RM procedure. Finally, it is of interest to consider RM algorithms for neural network models that generalize the feedforward networks treated here by a11owing certain intema1 feedbacks. Such "recurrent" network models have been considered by Jordan (1986), Elman (1988) and Williams and Zipser (1989). For example, in the Elman (1988) set up, hidden layer activations feed back,
so that network
CAto.Atl. ",.Atq)-.Ato
output
= 1
At =
behavior of network output. Learning in such models is complicated by the fact that at any stage of learning. network output depends not only on the entire past history of inputs Xt. but also on .. the entire past history of estimated parameters e t. Results of KC are relevant for treating such
internal feedbacks. Convergence of RM estimates in recurrent networks is studied by Kuan
-57
II.2.1(b) follows from Theorem II.2.1(c). Finally, we show that cycling between two asymptotically stable equilibria is impossible.
It is easy to see that points in e. must be isolated. Let O ~ and 0; be two isolated points in e. t
and let NEI and NEz be neighborhoods of 8~ and 8;, respectively, such that NEl ~ dCe*),
N el ~ d(e* ), and N el f'\ N el = 0. from, say, N 1 to N z infinitely
e ti E N El e = 7t'['(8)] caWlot
" " If the path of e t cycles between e ~ and e;, e t must move
often.
conycrgc;
2.1(d).
PROOF
OF COROLLARY condition
11.23:
The result follows from Theorem ll.2.1 because the A.4' implies at ~ O as t ~ ~ and Assumption A.5'
summability
of at in Assumption
implies Assumption A.5 by the mixingale convergence theorem (McLeish, 1975, Corollary
1.8).
PROOF
OF THEOREM
We first observe that the conditions [A1], [A4], [A7] and [AS] of KH are directly assumed, and that [A3] ofKH is ensured by Assumption B.5(c) and Lemma Al.
Second, we show that the consequence of [A2] of KH holds under Assumptions B.2(b) and
B.5(b, c). This amounts to showing that the second assertion in Lemma 1 of KH holds. By Assumption B.2(b) we have
L:learly, the integral on the RHS of (a. 10) converges to zero a.s. because e t -7 () .a.s.
a sequence of positive real numbers such that LkEk < 00, and let {Nk} be a sequence
Let {Ek} be
ofintegers
tending to infinity
60 -
PROOF
OF COROLLARY
We observe that
Assumption B.5'(b) is a mixingale condition ensuring Assumption B.5(b, c) by the mixingale convergence theorem. To establish Assumption B.5(a), we see that Assumption B.5'(a.i) ensures
that for K < 00
](t
=IIE(l/f;
lFQ
) 112 $: K
~ IL",t .
(a.15)
The fact that; K;,r is of size -2 implies that
where;
memory coefficient.
That bt is of size -2 ensures that Lt=o ~r < 00. This establishes Assumption B.5(a.ii).
0
D
PROOF
OF PROPOSITION
PROOF
OF PROPOSITION
11.3.3:
See Andrews
(1989, Lemma
1).
PROOF
OF PROPOSITION
113.4:
that
EIUtWt-E~.::::(UtWt)
12
where Ut.m= E~.:!:.:::(Ut) and Wt.m= E~.:!:.:::(Wt). Here we employ the fact that E~.:!:.:::(Ut Wt) is the best L2-predictor of Ut Wt among all IF~~:::-measurable functions. Hence,
II Ut
Wt-E~::::(Ut
WJI12
~IIUtWt-Ut,mWt,mI12
-62 -
Similarly,
II ut
Wt-Ut,m
wtI12~{iil3/2I1Ut-Ut,mll~1
Consequently,
v,
+VW,m
).
(c)
II ut
Ut+j-E~~~+m(ut
ut+j)lk
~11 Ut
ut+j
-E~~:::(Ut)
E~tJ~:::(Ut+j)
112
.$: II ut
Ut+j-
ut E~~{::::(Ut+i)
112+ II utE~~1::::(ut+j)
-E~::::(Ut)
E~~~::::(Ut+j)
111
IFo ), we have
.sIIEo
E~~J+s
(Ut
Ut+j)
-E
Ut Ut+j
112+ IIEo[Ut
Ut+j
-E~~J+s
(Ut
Ut+j)]
Ik ,
(a.17)
where s = [tll]
t bounded by
By Jensen's inequality,
where Kis a constant. It follows from Lemma 2.1 ofMcLeish (1975) and Lemma 3.14 of Gallant
u
PROOF OF COROLLARY II.3.5: We verify the conditions of Corollary II.2.3. Because the
other conditions obviously hold, for the simple RM estimates it suffices to show that Assumptions A.2(b) and A.5' hold. Given Assumption C.2, it is straightforward to verify that f and Vof are
such that If(x, 8)1 ~ Ql(X) and IVof(x, 8)1 ~ Q2(X) for aI18e D (compact), where Ql and Q2
-63 -
~Q2(x)[lyl
11//(z,
()l)-1//(z,()vl
=1
Vof(x,
81)[y -f(x,
81)] -Vof(x,
82)[y -f(x,
8z)] I
b'l)!(X, 81) I
(a.18)
IVof(x,
81)y-Vof(x,
8vyl
~ lyIL2(X)181-821
I Vof(x,
82)f(x,
8v -Vof(x,
81 )f(x,
81) I
~ I Vof(x,
~)f(x,
82) -Vof(x,
8vf(x,
81) I + I Vof(x,
8vf(x,
81) -Vof(x,
81)f(x, 81) I
Thi;s
c;;sta.bli;shc;~ A~~UUlVUUU
A.2(lJ.li).
64
Because
Iy I, L1(x),
L2(x),
Ql(X)
and
Q2(X)
satisfy
Lipschitz
conditions,
Proposition
11.3.3ensuresthat IYtl , Ll(Xt)' L2(Xt) Ql(XJ and Q2(Xt) are NED on {Vt} of size -1. Because
Xt is bounded, Ql(Xt), Q2(X,), L1(X,) and L2(Xt) are bounded. Because IlYtl14 ~Ll, it follows
from Proposition 1I.3.4(a) and Corollary 4.3(a) of Gallant and White (1988a) (i.e., sums of random variables NED of size -a are also NED of size -a) that h3(ZJ is NED on {Vt} of size -1/2.
The mixing conditions size -112 by Proposition ing Assumption A.5'(ii). of Assumption 3.2. Similarly, C.l then ensure that { h 3 (Zt) -Eh 3(Zt) } is a mixingale {h2(ZJ -Eh2(ZJ} is a mixingale of
We next verify that for each 8 E e, {1jI(Zt, 8) } is a mixingale of size -1/2. Fix 8 ( = 8). Observe that the Lipschitz condition on f( .,8) and the conditions on {2,} imply by Proposition ll.3.3 that {f(Xt. 8) } is NED on
{Yt-fCXt,
Vt} of size -1
{Vol(
..8)}
of {Zt} imply
by Proposition
8)} is also
NED on {Vt} of size -1. Further, the elements ofV f(Xt, 8) are bounded, so that by Proposition
II.3.4(a) {1f/(Zt. ()) = Vof(Xt. 8)[ftf(Xt. 8)] } is NED on
Vt} of size -1/2. It follows from Pro-
ditions imposed on {Vt} by Assumption C.l. Thus, Assumption A.5'(i) holds, and the result for the simple RM procedure follows. For the modified RM estimates we first note that every element of 0-1 is bounded above so
that I G-1 I < 11 for some 11.
Now.
IG-l
Vof(x,
8)[y
-f(x,
8)]
~~
Q2(x)[\yl
Ql(X)]
65 -
Ivec
GI
= I Vof(x,
8) 12 + I vec G 1
= [tr(A
' A)]Y%.
Hence
Assumption
A.2(b.i)
holds,
as
= h:1(7)
convex compact set r, so the mean value theorem applies. A matrix differentiation that when c is symmetric and nonsingular, dC-l/dg/J
element of G and Sij is a selection matrix whose every element is zero except that the ij-th and ji-th elements are one; see Graybi11 (1983, p. 358). Hence we can write
rl \vec ("0... r:,-l'l a '-' J
-l
-vec
(0-
Sij
0 -1 ) ,
a~;j
vec
VofCx,
81)VofCx,
8v'
-vec
[ v ofCx,
8z) vofCx,
8z)'
(I@VoJ(x,81))
.Vof(x,
81)-Vof(x,
8V]
(Vof(x,
82)
1)
Vof(x,
81) -Vof(x,
8v
VofCx,
O2) @ I
vec (ARC)
= (C !8> A) vec B.
It
can be verified
and
that
Vof(x,
c5z)@ I
, where
k is the dimension
of
8.
1Jf1 (z,
(J1)
-1Jf1
(z,
(Jv
~ 2K Q2 (X) L2 (X)
01-021
I vec(G2-G1)
h3' (z)
I f)1 -f)2
Assumption
A.2(b.ii) holds, as
$; 11fIl(Z,
91)-V'1(Z,
92)1
IV'2(Z,
91)-V'2(Z,
9vl
h3(z)
181
-821
with {h2(Zt)
hj(z)
Using
thc
~d1nc
(11 !; Wll~ll~
as
before
we
nave
that
-Eh2(Zt)
-Eh3(Zt)
0) } are mixingales
of size -112.
-68-
Hence Assumption A.5' also holds. This yields the desired results for the modified RM estimates. The conclusions for the quick RM estimates follow because the quick algorithm is a special case of the modified algorithm.
D
PROOF OF COROLLARY
ll.2.5.
KM estimates we neea to ShOwthat Assumptions B.2(b) ana B.5' hold In this case v 9 1jI(Z,e) = V 0(\70 f (x, 8) [y -f (x, 8)])
= Voof(x,
hence for 9 in int G and 9 in Go
8) Vof(x, 8)'
Voof(y-f)-VofVof'-Voor(y-r)
+ vor
Vor'
IVqo!Y-Voof'y\
I(Voof')f'-(Voof)!1
lVof'Vof"-VofVof'1
D.2, 0 = ~ = 8* .Apply-
~ lVoorl
Ll(X)
18-00
Ql
(X)L3
(x)
18-00
18-~1
since
Voor
I .$: Q3(X),
with
Q3
Lipschitz-continuous
in
by
straightforward
arguments.
Funher,
I Vorvar' -Vof Vof' I ~ I varvar' -VorVof' I + I vor Vof' -Vof Vof' I
~ 2Q2(X)
L2(x)
18 -l)O
I,
so that
-71-
(a.22)
11
It can also be verified that the second tenn in (a.22) is less than
I G 1 -(G~) 1 I
Q3(X) ( Iy I + ~1 tX + tQ2(XZ
Iy I + Ql
(X)) + (Q2(X))2
vec(G-GO)
$: h;'
(z)
18-8
I,
where
h;' (z) E 11
Iy I L3(X)
+ Q3(X) L1(x)
+ Ql(X)
L3(X)
+ 2Q2(X)
L2(x)
establishes Assumption
B.2(b).
H; = E(V 91f1;),
where v 611' (z, e) IS given by (a.:Zl), ana
-73
where the first equality follows from the fact that exp[(-Ik/2)c] = exp(-c/2) 1 1 = [exp(-c/2)]Ik,
-I
H * 3 -
(Val;
.Vaol;)
-e
*-1
is also block triangular, and the lower ~ght kxk block ofI3 is
so that
...d (t + 1)Y; (8t -8.) -. F3)
N (0,
G *-1 is a positive
semidefinite
matrix.
From Theorem
1I.2-4(c) we 2et
-};1 =HIFl +F1 HI =(HI +I/2)Fl +F1(Hl +1/2)
=HI Hence,
FI +FI
HI +FI
-(G*
)-1 I1(G*
)-1
=(G*)-I(H~F~
+F~H~
+F~)(G*)-1
-F~(G.)-l
-(G*)-IF~
(G*)-IF~(G*)-1
-75 -
is positive semidefinite, where (F; )y, is such that (F; )Vi (F; )Vi = F; .Since
holds.
i~
= }; 1, the result
PROOF OF COROLLARY
cia} structure of f in (II.4.1) and the continuous differentiability of G, it is straightforward to verify the domination and Lipschitz conditions required for application of Corollary 11.3.5.
D
PROOF
OF COROLLARY
11.4.2:
Direct
application
of Corollary
II.3.6.
-76
TABLE 1
DEmRMINISTIC
CHAOS APPROXIMAmD
BY LINEAR MODELt
Logistic Map
N
a
R2
SIC
-1.20
of observations;
a = regression
staIldard
error; Criterion:
regression
SIC
coefficient;
= log (j
SIC = Schwartz
Ny2N
Information
+ k(Iog
k = number
of estimated
coefficients
( = 2).
77 -
TABLE 2
BY
Logistic Map
q N
8 250
2.68 x 10-4
a
R2
.9999
.9999
-6.32
.9999 -3.46
SIC
-7.93
t q. = SIC-optimal
number
of hidden
units;
remaining
symbols
as in Table
1.
-78-
REFERENCES
Albert, A.E., and L.A. Gardner (1967): Stochastic Approximation and Nonlinear Regression, Cambridge: M.I. T. Press.
19, 1483-1536.
Amemiya, T. (1985): Advanced Econometrics. Cambridge: Harvard University Press.
Andrews, D. W.K. (1989): "An Empirical Process Central Limit Theorem for Dependent NonIdentically Distributed Random Variables," Cowles Foundation Discussion Paper, Yale University.
Andrews, D.W.K. (1991a): "Asymptotic Optimality of Generalized CL, Cross-validation and Generalized Cross-validation in Regression with Heteroskedastic Errors," Journal of
Econometrics 47,359-378
Andrews, D.W.K. (1991b): "Asymptotic Normality of Series Estimators for Nonparametric and Semi-parametric Regression Models," Econometrica 59, 307-345.
ArllUlll,
L. (1974):
SlochasItc
DtJ[eremial
EquariOns:
Theory
ana Appllcattons.
New
York:
John
Barron, A. (1991a): "Universal Approximation Bounds for Superpositions of a Sigmoidal Function," University oflllinois at Urbana -Champaign Department of Statistics Techni-
-79-
Barron, A. (1991b): "Approximation and Estimation Bounds for Artificial Neural Networks,"
University
Report 59.
of nlinois
at Urbana -Champaign
Department
of Statistics
Technical
Baxt, W.G. (1991): "The Optimization of the Training of an Artificial Neural Network Trained to Recognize the Presence of Myocardial Infarction by the Variance of Disease Likelihood," UC San Diego Medical Center Technical Report.
Bierens, H. (1990): " A Consistent Conditional Moment Test of Function Form," Econometrica
58,1443-1458.
Bi11ingsley,P. (1968): Convergence of Probability Measures. New York: John Wiley & Sons.
Blum, J.R. (1954): "Approximation Methods Which Converge with Probability One," Annals of
Mathematical Statistics 25,382-386.
Blum, E.K. and L.K. Li (1991): Approximation Theory and Feedforward Networks," Neural Networks 4,511-516.
Carroll, S.M. and B.W. Dickinson (1989): "Construction of Neural Nets Using the Radon
Transfornl," in Proceedings of the International Joint Conference on Neural Net-
Cybenko, G. (1989):
IIApproximation
Mathematics
of
Cowan, J. (1967): "A Mathematical Theory of Central Nervous Activity," dissertation, University of London.
unpublished Ph.D.
"Hypothesis
-80-
AItemative," Biometrika 64,247-254. Davies, R.B. (1987): "Hypothcsis Tcsting Whcn a Nui3ancc Paramctcci~ Pcc~cnt 011ly UllUCl lIIC
Singleton
(1990):
"Simulated
Moments
Estimation
of Markov
Models
of
Elbadawi, I., A.R. Gallant and G. Souza (1983): "An Elasticity Can be Estimated Consistently Without A Priori Knowledge of Functional FonD," Econometrica 51,1731-1752.
Elman, J.L. (1988): "Finding Structure in Time," CRL Report 8801, Center for Research in Language, UC San Diego.
EngIund, J.-E., U. HoIst, and D. Ruppert (1988): "Recursive M-Estimators of Location and Scale for Dependent Sequences," Scandinavian Journal of Statistics 15,147-159.
Fabian, V. (1968):
"On Asymptotic
StatiJticJ
Normality
in Stochastic Approximation,"
Annals of
lrfathematical
39, 1327-1332.
Foutz, R.V. and R.C. Srivastava (1977): "The Performance of the Likelihood Ratio Test When
the Model is Incorrect, II Annals of Statistics 5, 1183-1194.
Friedman, J.H. and W. Stuetz1e (1981): "Projection Pursuit Regression," Journal of the American Statistical Association 76,817-823.
(1984):
"Neocognition:
A New Algorithm
-81-
Gallant, A.R. (1973): "Inference for Nonlinear Models," North Carolina State University, Institute of Statistics, Mimeograph Series No, 875.
Regression Model,"
Functional
Unbiased
(1988a):
A Unified
Theory
of Estimation
and Inference
for
Non-
Gallant,
(1988b):
Mistakes,"
Proceedings
Neural Networks,
San Diego.
Ga11ant, A.R. and H. White (1991): "On Learning the Derivatives of an Unknown Mapping with
Multilayer Feedforward Networks," Neural Networks 4 (to appear).
Gamba, A., L. Gamberini, G. Palmieri and R. Sanna (1961): "Further Experiements with PAPA,"
Nuovo Cimento Suppl. 20,221-231
Theoria
Mouts
Corporom
Celestium.
English
translation
(1963):
Theory
of
-82-
Gerencser, L. (1986): "Parameter Tracking of Time-Varying Continuous-Time Linear Stochastic Systems," in C.E. Byrnes and A.
Robust Control, New York: Elsevier,
Lindquist eds., Modelling, Identification and
pp. 581-594.
Go1dstein,L. (1988): "On the Choice of Step Size in the Robbins-Monro Procedure," Statistics and Probability Letters 6, 299-303.
Gourieroux, C., A. Monfort and A. Trognon (1984a)' "Pseudo-Maximum Likelihood Methods: Theory ," Econometrica 52, 681-700.
Gourieroux, C., A. Monfort and A. Trognon (1984b): "Pseudo-Maximum Likelihood Methods: Application to Poisson Models," Econometrica 52, 701- 720.
Graybi1l, F.A. (1983): Matrices with Applications in Statistics, second edition. Belmont:
worth.
Hansen, B. (1991):
"Inference
Hecht-Nielsen, R. (1989): "Theory of the Back-Propagation Neural Network," Proceedings of the [nternational Joint Conference on Neural Networks, Washington D.C. York: IEEE Press,pp. 1:593-606.
(1990):
Models," Duke Institute of Statistics and Decision Sciences Discussion Paper 90A15
-83
works
4,231-242.
Hornik, K. and C.-M. Kuan (1990): "Convergence of Learning Algorithms with Constant Learning Rates," University of lllinois Urbana -Champaign Department of Economics Discussion Paper.
Hornik, K, M. Stinchcombe, and H. White (1989): "Multi-Layer Feedforward Networks Are Universal Approximators," Neural Networks 2, 359-366.
Hornik, K, M. Stinchcombe and H. White (1990): "Universal Approximation of an Unknown Mapping and Its Derivatives Using Multilayer Feedforward Networks," Neural Networks 3,551-560.
Hu, S. and W. Joerding (1990): "Monotonicity and Concavity Restrictions for a Single Hidden Layer Feedforward Network," Washington State University Department of Economics Discussion Paper.
"Robust Estimation
Statis-
tics 35,73-101.
Huber, P.J. (1967): "The Behavior of Maximum Likelihood Estiamtes Under Nonstandard ConOitions,' E'rOCeedmgs of the r tflh Berkeley ~ympostum on Mathematical
and Probability. Berkeley: University of California Press, 1, pp. 221-233.
Statistics
A Priori:
Information
in Neural
Networks,"
Jones, L.K. (1991): "A Simple Lemma on Greedy Approximation in Hilbert Space and Convergence Rates for Projection Pursuit Regression and Neural Network Training,"
Annals of Statistics (forthcoming).
-84 -
Processing Approach,"
UC San Diego,
Kuan, C.-M. (1989): "Estimation of Neural Network Models," Ph.D. Dissertation, UC San Diego.
Kuan, C.-M., K. Hornik and H. White (1990): "Some Convergence Results for Learning in Recurrent Neural Networks," UCSD Department of Economics Discussion Paper.
Kuan, C.-M. and H. White (1991): "Strong Convergence of Recursive m-estimators for Models with Dynamic Latent Variables," UC San Diego Department of Economics Discussion Paper 91-05R.
Kushner, H.J. (1987): "Asymptotic Global Behavior for Stochastic Approximation and Diffusions with Slowly Decreasing Noise Effects: Global Minimization via Monte Carlo," SlAM
Journal of Applied Mathematics, 47, 169-185.
Kushner, H.J. and D.S. Clark (1978): Stochastic Approximation Methods for Constrained and
Unconstrained Systems New York: SpringerVerlag.
Kushner, H.J. and H. Huang (1979): "Rates of Convergence for Stochastic Approximation Type
Algorithms," SlAM Joumal ofControl and Optimization 17,607-617.
Kushner, H.J. and H. Huang (1981): "Asymptotic Properties on Stochastic Approximations with
Constant Coefficients," SlAM Journal of Control and Optimization, 19,87-105.
Lapedes, A. and R. Farber (1987): "Nonlinear Signal ProcessingUsing Neural Networks: Prediction and System Modeling," Los Alamos National Laboratory Technical Report.
Le CUD, Y. (1985):
"Une Procedure
d' Apprentissage
Lee, T.H., H. White and C.W.J. Granger (1991): "Testing for Neglected Nonlinearity in Time
-85 -
Li, K.-C. (1987): "Asymptotic Optimality for Cp, CL, Cross-Validation and Generalized CrossValidation: Discrete Index Set," Annals of Statistics 15, 958-975.
IEEE Transactions on
Ljung, L. and T. Soderstrom (1983): Theory and Practice of Recursive Identification. Cambridge:
M.I.T. Press.
Lukacs, E. (1975): Stochastic Convergence. 2nd ed., New York: Academic Press.
Marcet, A. and T.J. Sargent (1989): "Convergence of Least Squares Learning Mechanisms in Self Referential, Linear Stochastic Models," Journal of Economic Theory 48, 337368.
Maxwell, T., G.L. Giles, Y.C. Lee and H.H. Chen (1986): "Nonlinear Dynamics of Artificial Neural Systems," in J. Denker ed., Neural Networks for Computing. New York:
American Institute of Physics.
McLeish, D.L. (1975): "A Maximal Inequality and Dependent Strong Laws," Annals of Probability 3, 829-839.
Metivier, M. and P. Priouret (1984): "Applications of a Kushner and Clark Lemma to General Classes of Stochastic Algorithm," IEEE Transactions on Infonnation Theory IT-30,
140-151
McCulloch, W.S. and W. Pitts (1943): "A Logical Calculus of the Ideas Immanent in Nervous
Activity ," Bulletin of Mathematical Biophysics 5, 115-133.
Cambridge:
MIT Press.
-86-
Morris, R. and W.-S. Wong {1991): I"Systematic Choice of Initial Points in Local Search: Extensions and Application tb Neural Networks," Infonnation Processing Letters (forthcoming). Newey, W. (1985): "Maximum rLikelihood Specification Testing and Conditional Moment
Parker, D.B. (1982): "Learnin~ Lo.!!Jc," Invention Report 581-64 (File 1). Stanfnrrl TTniv~r~ity Office of Technology Ljcensing.
"Learning
Logic,"
Research in Economics
and Management
Potscher, B. and I. Prucha (1991a): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Models, Part I: Consistency and Approximation Concepts,"
Potscher, B. and I. Prucha (1991b): I"Basic Structure of the Asymptotic Theory in Dynamic Nonlinear Econometric Mo(Jels, Part ll: Asymptotic Normality," Econometric Reviews
(forthcoming).
Robbins, H., and S. Monro (1951): '!A Stochastic Approximation Method," Annals of Mathematical Statistics 22, 400-407.
Rosenb1att, R. (1957):
"The Perceptron:
A Perceiving
and Recognizing
Automaton,"
Project
Rosenblatt,
F. (1958):
"The Percdptron:
A Probabilistic
Model
for Information
Storage and
-87 -
Rosenblatt,
F. (1961):
Mechanisms.
Principles
Washington
of Neurodynamics:
D.C.: Spartan
Perceptrons
Books.
Rumelhart, D.E., G.E. Hinton and R.J. Williams (1986): "Learning Internal Representations by Error Propagation," in D. E. Rumelhart and I. L. McClelland eds., Parallel Distributed Processing: Explorations in the Microstructures of Cognition. Cambridge:
Verlag,pp. 182-190.
Sawa, T. (1978): "Infonnation Criteria for Discriminating Among Alternative Regression
A Parallel Network
Aloud," Johns Hopkins University Department of Electrical Engineering and Computer Science Technical Report 86/01.
Selfridge, 0., R. Sutton and A. Barto (1985): "Training and Tracking in Robotics," Proceedings of the Ninth International Joint Conference on Artificial Intelligence. Los Angeles:
Morgan Kaufman, 1, pp. 670-672.
Sontag, E. (1990): "Feedback Stabilization Using 1\vo-Hidden-Layer Nets," Rutgers Center for Systems and Control Technical Report SYCON-90-ll.
Stinchcombe, M. (1991): "Inner Functions and Universal Approximation Properties," UC San Diego Department of Economics Discussion Paper.
-88-
With Non-Sigmoid Hidden Layer Activation Functions," Proceedings of the International Joint Conference on Neural Networks, San Diego. New York: IEEE Press,
pp. 1:612-617.
Stinchcombe, M. and H. White, (1991): "Consistent Specification Testing Using Duality," UC San Diego Department of Economics Discussion Paper.
Sussman, H. (1991):
Feedforward
Input-Output Map," Rutgers Center for Systems and Control Technical Report SYCON-19-06.
Sydaster,
(1981):
Topics
in Mathematical
Analysis
for
Economists.
New
York:
Academic
Press.
rnompson, J.M:l~ and H.B. ~tewart(19H6): Nonlinear Dynamics and Chaos. New York: Wiley.
Walk, H. (1977): "An Invariance Principle for the Robbins-Monro Process in a Hilbert Space,"
Zeitschrift fiir Wahrscheinlichkeitstheorie und Verwandete Gebiete 30, 135-150.
Werbos, P. (1974):
Behavioral Sciences," unpublished Ph.D. Dissertation, Harvard University, Department of Applied Mathematics.
-89 -
50, 1-25.
White, H. (1987a): "Some Asymptotic Results for Back-Propagation," Proceedings of the IEEE First International Conference on Neural Networks, San Diego. New York: IEEE Press,pp. III:261-266.
White, H. (1987b):
White, H. (1989a): "Some Asymptotic Results for Learning in Single Hidden Layer Feedforward
Network Models," Jottmal of the American Statistical Association 84,1003-1013.
White, H. (1989b): "An Additional Hidden Unit Test for Neglected Nonlinearity," Proceedings of the International Joint Conference on Neural Networks, Washington D.C. New York: IEEE Press, pp. 11:451-455
White, H. (1990a): "Connectionist Nonparametric Regression: Multilayer Feedforward Networks Can Learn Arbitrary Mappings," Neural Networks 3,535-549.
White, H. (1990b): "Nonparametric Estimation of Conditional Quantiles Using Neural Networks," UC San Diego Department of Economics Discussion Paper.
White, H. (1992): Estimation, Inference and Specification Analysis. New York: Cambridge University Press (forthcoming).
White, H. and J. Wooldridge (1991): "Some Results for Sieve Estimation with Dependent
459-493.
Williams, R. (1986): "The Logic of Activation Functions," in D.E. Rumelhart and J.L. McClelland eds., Parallel
C-'ognttton.
Distributed
MITPress,
Processing:
Explorations
in the Microstructures
of
Cambridge:
1, pp. 423-443,
Williams, R.J. and D. Zipser (1989): "A Learning Algorithm for Continua11y Running Fully Recurrent Neural Networks," Neural Computation 2, 270-280.
Xu, x. and W.T. Tsai {1990): "Constructing Associative Memories Using Neural Networks,"
Neural Networks 3,301-310.
Xu, x. and W.T. Tsai (1991): "Effective Neural Algorithms for the Traveling Salesman Problem," Neural Networks 4, 193-206.
Young, P.C. (1984): Recursive Estimation and Time-Series Analysis. New York: Springer Verlag.