Академический Документы
Профессиональный Документы
Культура Документы
Edited by
MOSHE DROR
University of Arizona
PIERRE L’ECUYER
Université de Montréal
FERENC SZIDAROVSZKY
University of Arizona
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
Preface xvii
Part I 13
2
Stability of Single Class Queueing Networks 13
Harold J. Kushner
1 Introduction 13
2 The Model 15
3 Stability: Introduction 22
4 Perturbed Liapunov Functions 23
5 Stability 28
3
Sequential Optimization Under Uncertainty 35
Tze Leung Lai
1 Introduction 35
2 Bandit Theory 37
2.1 Nearly optimal rules based on upper confidence bounds and
Gittins indices 37
2.2 A hypothesis testing approach and block experimentation 42
2.3 Applications to machine learning, control and scheduling of
queues 44
3 Adaptive Control of Markov Chains 44
3.1 Parametric adaptive control 45
3.2 Nonparametric adaptive control 47
4 Stochastic Approximation 49
4
Exact Asymptotics for Large Deviation Probabilities, with 57
Applications
vi MODELING UNCERTAINTY
Iosif Pinelis
1. Limit Theorems on the last negative sum and applications to non-
parametric bandit theory 59
1.1 Condition (4)&(8): exponential and superexponential cases 62
1.2 Condition (4)&(8): exponential (beyond (14)) and subexpo-
nential cases 63
1.3 The conditional distribution of the initial segment
of the sequence of the partial sums given 66
1.4 Application to Bandit Allocation Analysis 68
1.4.1 Test-times-only based strategy 68
1.4.2 Multiple bandits and all-decision-times based strategy 70
2 Large deviations in a space of trajectories 72
3 Asymptotic equivalence of the tail of the sum of independent random
vectors and the tail of their maximum 77
3.1 Introduction 77
3.2 Exponential inequalities for probabilities of large deviation
of sums of independent Banach space valued r.v.’s 81
3.3 The case of a fixed number of independent Banach space val-
ued r.v.’s. Application to asymptotics of infinitely divisible
probability distributions in Banach spaces 83
3.4 Tails decreasing no faster than power ones 86
3.5 Tails, decreasing faster than any power ones 88
3.6 Tails, decreasing no faster than 89
Part II 95
5
Stochastic Modelling of Early HIV Immune Responses Under Treatment 95
by Protease Inhibitors
Wai-Yuan Tan and Zhihua Xiang
1 Introduction 96
2 A Stochastic Model of Early HIV Pathogenesis Under Treatment by
a Protease Inbihitor 97
2.1 Modeling the Effects of Protease Inhibitors 98
2.2 Modeling the Net Flow of HIV From Lymphoid Tissues to
Plasma 99
2.3 Derivation of Stochastic Differential Equations for The State
Variables 100
3 Mean Values of 103
4 A State Space Model for the Early HIV Pathogenesis Under Treat-
ment by Protease Inhibitors 104
4.1 Estimation of given 106
4.2 Estimation of Given with and
107
5 An Example Using Real Data 108
6 Some Monte Carlo Studies 113
Contents vii
6
The impact of re-using hypodermic needles 117
B. Barnes and J. Gani
1 Introduction 117
2 Geometric distribution with variable success probability 118
3 Validity of the distribution 119
4 Mean and variance of I 120
5 Intensity of epidemic 122
6 Reducing infection 123
7 The spread of the Ebola virus in 1976 124
8 Conclusions 128
7
Nonparametric Frequency Detection and Optimal Coding in Molecular 129
Biology
David S. Stoffer
1 Introduction 129
2 The Spectral Envelope 133
3 Sequence Analyses 140
4 Discussion 152
Part IV 249
12
The Birth of Limit Cycles in Nonlinear Oligopolies with Continuously 249
Distributed Information Lag
Carl Chiarella and Ferenc Szidarovszky
1 Introduction 249
2 Nonlinear Oligopoly Models 251
3 The Dynamic Model with Lag Structure 251
4 Bifurcation Analysis in the General Case 253
5 The Symmetric Case 259
6 Special Oligopoly Models 263
7 Conclusions 267
Contents ix
13
A Differential Game of Debt Contract Valuation 269
A. Haurie and F. Moresino
1 Introduction 269
2 The firm and the debt contract 270
3 A stochastic game 273
4 Equivalent risk neutral valuation 275
4.1 Debt and Equity valuations when bankrupcy is not considered 276
4.2 Debt and Equity valuations when liquidation may occur 278
5 Debt and Equity valuations for Nash equilibrium strategies 280
6 Liquidation at fixed time periods 281
7 Conclusion 282
14
Huge Capacity Planning and Resource Pricing for Pioneering Projects 285
David Porter
1 Introduction 285
2 The Model 287
3 Results 291
3.1 Cost and Performance Uncertainty 292
3.2 Cost Uncertainty and Flexibility 297
3.3 Performance Uncertainty and Flexibility 298
4 Conclusion 298
15
Affordable Upgrades of Complex Systems: A Multilevel, Performance- 301
Based Approach
James A. Reneke and Matthew J. Saltzman and Margaret M. Wiecek
1 Introduction 301
2 Multilevel complex systems 306
2.1 An illustrative example 309
2.2 Computational models for the example 312
3 Multiple criteria decision making 313
3.1 Generating candidate methods 314
3.2 Choosing a preferred selection of upgrades 315
3.3 Application to the example 317
4 Stochastic analysis 320
4.1 Random systems and risk 321
4.2 Application to the example 321
5 Conclusions 322
Appendix: Stochastic linearization 327
1 Origin of stochastic linearization 327
2 Stochastic linearization for random surfaces 327
16
On Successive Approximation of Optimal Control of Stochastic Dynamic 333
Systems
Fei-Yue Wang, George N. Saridis
x MODELING UNCERTAINTY
1 Introduction 334
2 Problem Statement 335
3 Sub-Optimal Control of Nonlinear Stochastic Dynamic Systems 337
4 The Infinite-time Stochastic Regulator Problem 346
5 Procedure for Iterative Design of Sub-optimal Controllers 349
5.1 Exact Design Procedure 349
5.2 Approximate Design Procedures for the Regulator Problem 353
6 Closing Remarks by Fei-Yue Wang 356
17
Stability of Random Iterative Mappings 359
László Gerencsér
1 Introduction 359
2 Preliminary results 364
3 The proof of Theorem 1.1 367
Appendix 368
Part V 373
18
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms 373
Victor Solo
1 El Sid 373
2 Introduction 374
3 On-line Binary Classification 375
4 Binary Classification with Noisy Measurements of Classifying Variables-
Offline 376
5 Binary Classification with Errors in Classifying Variables -Online 378
6 Conclusions 380
19
Random Search Under Additive Noise 383
Luc Devroye and Adam Krzyzak
1 Sid’s contributions to noisy optimization 383
2 Formulation of search problem 384
3 Random search: a brief overview 385
4 Noisy optimization by random search: a brief survey 390
5 Optimization and nonparametric estimation 393
6 Noisy optimization: formulation of the problem 394
7 Pure random search 394
8 Strong convergence and strong stability 398
9 Mixed random search 399
10 Strategies for general additive noise 400
11 Universal convergence 410
Contents xi
20
Recent Advances in Randomized Quasi-Monte Carlo Methods 419
Pierre L’Ecuyer and Christiane Lemieux
1 Introduction 420
2 A Closer Look at Low-Dimensional Projections 423
3 Main Constructions 425
3.1 Lattice Rules 426
3.2 Digital Nets 428
3.2.1 Sobol’ Sequences 431
3.2.2 Generalized Faure Sequences 431
3.2.3 Niederreiter Sequences 432
3.2.4 Polynomial Lattice Rules 433
3.3 Constructions Based on Small PRNGs 435
3.4 Halton sequence 438
3.5 Sequences of Korobov rules 439
3.6 Implementations 439
4 Measures of Quality 440
4.1 Criteria for standard lattice rules 441
4.2 Criteria for digital nets 444
5 Randomizations 448
5.1 Random shift modulo 1 449
5.2 Digital shift 449
5.3 Scrambling 450
5.4 Random Linear Scrambling 451
5.5 Others 452
6 Error and Variance Analysis 452
6.1 Standard Lattices and Fourier Expansion 453
6.2 Digital Nets and Haar or Walsh Expansions 455
6.2.1 Scrambled-type estimators 455
6.2.2 Digitally shifted estimators 457
7 Transformations of the Integrand 461
8 Related Methods 462
9 Conclusions and Discussion 464
Appendix: Proofs 464
Part VI 475
21
Singularly Perturbed Markov Chains and Applications to Large-Scale Sys- 475
tems under Uncertainty
G. Yin, Q. Zhang, K. Yin and H. Yang
1 Introduction 476
2 Singularly Perturbed Markov Chains 480
2.1 Continuous-time Case 481
2.2 Time-scale Separation 483
3 Properties of the Singularly Perturbed Systems 485
3.1 Asymptotic Expansion 485
3.2 Occupation Measures 487
3.3 Large Deviations and Exponential Bounds 492
xii MODELING UNCERTAINTY
Moshe Sniedovich
1 Introduction 735
2 Remedies 738
3 The Big Fix 739
4 The Rest is Mathematics 740
5 Refinements 744
6 Non-Markovian Objective functions 746
7 Discussion 748
30
Reflections on Statistical Methods for Complex Stochastic Systems 751
Marcel F. Neuts
1 The Changed Statistical Scene 751
2 Measuring Teletraffic Data Streams 754
3 Monitoring Queueing Behavior 757
Fifty authors from all over the world collectively contributed 30 papers to
this volume. Each of these papers was reviewed and in the majority of cases
the original submission was revised before being accepted for publication in
the book. The papers cover a great variety of topics in probability, statistics,
economics, stochastic optimization, control theory, regression analysis, simula-
tion, stochastic programming, Markov decision process, application in the HIV
context, and others. Some of the papers have a theoretical emphasis and others
focus on applications. A number of papers have the flavor of survey work in a
particular area and in a few papers the authors present their personal view of a
topic. This book has a considerable number of expository articles which should
be accessible to a nonexpert, say a graduate student in mathematics, statistics,
engineering, and economics departments, or just anyone with some mathemat-
ical background who is interested in a preliminary exposition of a particular
topic. A number of papers present the state of the art of a specific area or
represent original contributions which advance the present state of knowledge.
Thus, the book has something for almost anybody with an interest in stochastic
systems.
The editors have loosely grouped the chapters into 8 segments, according
to some common mathematical thread. Since none of us (the co-editors) is an
expert in all the topics covered in this book, it is quite conceivable that the pa-
pers could have been grouped differently. Part 1 starts with a paper on stability
in queuing networks by H.J. Kushner. Part 1 also includes a queuing related
xviii MODELING UNCERTAINTY
paper by T.L. Lai, and a paper by I. Pinelis on asymptotics for large deviation
probabilities. Part 2 groups together 3 papers related to HIV modelling. The
first paper in this group is by W.-Y. Tan and Z. Xiang about modelling early
immune responses, followed by a paper of B. Barnes and J. Gani on the impact
of re-using hypodermic needs, and closes with a paper by D.S. Stoffer. Part 3
groups together optimization and regression papers. It contains 4 papers starting
with a paper by A. Nemirovski and R.Y. Rubinstein about classical stochastic
approximation. The next paper is by B. Kedem and K. Fokianos on regression
models for binary time series, followed with a paper by H. Walk on properties of
Nadarya - Watson regression estimates, and closing with a paper on sequential
predictions of stationary time series by L. Györfi and G. Lugosi. Part 4’s 6 pa-
pers are in the area of economics analysis starting with a nonlinear oligopolies
paper by C. Chiarella and F. Szidarovszky. The paper by A. Haurie and F.
Moresino examines a differential game of debt contract valuation. Next comes
a paper by D. Porter, followed by a paper about complex systems in relation to
affordable upgrades by J.A. Reneke, M.J. Saltzman, and M.M. Wiecek. The 5th
paper in this group, by F.-Y. Wang and G.N. Sardis, concerns optimal control
in stochastic dynamic systems, and the last paper is by L. Gerencsér is about
stability of random iterative mappings. Part 5 loosely groups 3 papers starting
with a paper by V. Solo on Monte Carlo methods for adaptive algorithms, fol-
lowed by a paper on random search with noise by L. Devroye and A. Krzyzak,
and closes with a survey paper on randomized quasi-Monte Carlo methods by
P. L’Ecuyer and C. Lemieux. Part 6 is a collection of 3 papers sharing a focus
on Markov decision analysis. It starts with a paper by G. Yin, Q. Zhang, K.
Yin, and H. Yang on singularly perturbed Markov chains. The second paper, on
risk sensitivity in average Markov decision chains, is by R. Cavazos–Cadena
and E. Fernández–Gaucherand. The 3rd paper, by G.G. Roussas, is on statis-
tical inference in a Markovian framework. Part 7 includes a paper on order
statistics by P.J. Boland, T. Hu, M. Shaked, and J.G. Shanthikumar, followed
by a survey paper on routing with stochastic demands by M. Dror, a paper on
fast Fourier and Walsh transforms by P.J. Sanchez, J.S. Ramberg, and L. Head,
a paper by J.C. Spall on parameter estimation with limited data, and a tuto-
rial paper on data compression by J.C. Kieffer. Part 8 contains 2 ‘reflections’
papers. The first paper is by M. Sniedovich – an ex-student of Sid Yakowitz.
It reexamines Bellman’s principle of optimality. The last paper in this volume
on statistical methods for complex stochastic systems is reserved to M.F. Neuts.
The efforts of many workers have gone into this volume, and would not have
been possible without the collective work of all the authors and reviewers who
read the papers and commented constructively. We would like to take this op-
portunity to thank the authors and the reviewers for their contributions. This
book would have required a more difficult ’endgame’ without Ray Brice’s ded-
PREFACE xix
ication and painstaking attention for production details. We are very grateful
for Ray’s help in this project. Paul Jablonka is the artist who contributed the art
work for the book’s jacket. He was a good friend to Sid and we appreciate his
contribution. We would also like to thank Gary Folven, the editor of Kluwer
Academic Publishers, for his initial and never fading support throughout this
project. Thank you Gary !
B. Barnes
School of Mathematical Sciences
Australian National University
Canberra, ACT 0200
Australia
Philip J. Boland
Department of Statistics
University College Dublin
Belfield, Dublin 4
Ireland
Rolando Cavazos–Cadena
Departamento de Estadística y Cálculo
Universidad Auténoma Agraria Antonio Narro
Buenavista, Saltillo COAH 25315
MÉXICO
Carl Chiarella
School of Finance and Economics
University of Technology
Sydney
P.O. Box 123, Broadway, NSW 2007
Australia
carl.chiarella@uts.edu.au
Luc Devroye
School of Computer Science
McGill University
Montreal, Canada H3A 2K6
xxii MODELING UNCERTAINTY
Moshe Dror
Department of Management Information Systems
The University of Arizona
Tucson, AZ 85721, USA
mdror@bpa.arizona.edu
Emmanuel Fernández–Gaucherand
Department of Electrical & Computer Engineering
& Computer Science
University of Cincinnati
Cincinnati, OH 45221-0030
USA
Konstantinos Fokianos
Department of Mathematics & Statistics
University of Cyprus
P.O. Box 20537 Nikosia, 1678, Cyprus
J. Gani
School of Mathematical Sciences
Australian National University
Canberra, ACT 0200
Australia
László Gerencsér
Computer and Automation Institute
Hungarian Academy of Sciences
H-1111, Budapest Kende u 13-17
Hungary
László Györfi
Department of Computer Science and Information Theory
Technical University of Budapest
1521 Stoczek u. 2,
Budapest, Hungary
gyorfi@szit.bme.hu
A. Haurie
University of Geneva
Geneva Switzerland
Contributing Authors xxiii
Larry Head
Siemens Energy & Automation, Inc.
Tucson, AZ 85715
Taizhong Hu
Department of Statistics and Finance
University of Science and Technology
Hefei, Anhui 230026
People’s Republic of China
Benjamin Kedem
Department of Mathematics
University of Maryland
College Park, Maryland 20742, USA
John C. Kieffer
ECE Department
University of Minnesota
Minneapolis, MN 55455
Adam Krzyzak
Department of Computer Science
Concordia University
Montreal, Canada H3G 1M8
Harold J. Kushner
Applied Mathematics Dept.
Lefschetz Center for Dynamical Systems
Brown University
Providence RI 02912
Pierre L’Ecuyer
Département d’Informatique et de Recherche Opérationnelle
Université de Montréal, C.P. 6128, Succ. Centre-Ville
Montréal, H3C 3J7, Canada
lecuyer@iro.umontreal.ca
Christiane Lemieux
Department of Mathematics and Statistics
University of Calgary, 2500 University Drive N.W.
Calgary, T2N 1N4, Canada
lemieux@math.ucalgary.ca
Gábor Lugosi
Department of Economics,
Pompeu Fabra University
Ramon Trias Fargas 25-27,
08005 Barcelona, Spain
lugosi@upf.es
F. Moresino
Cambridge University
United Kingdom
Arkadi Nemirovski
Faculty of Industrial Engineering and Management
Technion—Israel Institute of Technology
Haifa 32000, Israel
Marcel F. Neuts
Department of Systems and Industrial Engineering
The University of Arizona
Tucson, AZ 85721, U.S.A.
marcel@sie.arizona.edu
Iosif Pinelis
Department of Mathematical Sciences
Michigan Technological University
Houghton, Michigan 49931
ipinelis@math.mtu.edu
Contributing Authors xxv
David Porter
Collage of Arts and Sciences
George Mason University
John S. Ramberg
Systems and Industrial Engineering
University of Arizona
Tucson, AZ 85721
James A. Reneke
Dept. of Mathematical Sciences
Clemson University
Clemson SC 29634-0975
George G. Roussas
University of California, Davis
Reuven Y. Rubinstein
Faculty of Industrial Engineering and Management
Technion—Israel Institute of Technology
Haifa 32000, Israel
Matthew J. Saltzman
Dept. of Mathematical Sciences
Clemson University
Clemson SC 29634-0975
Paul J. Sanchez
Operations Research Department
Naval Postgraduate School
Monterey, CA 93943
George N. Saridis
Department of Electrical, Computer and Systems Engineering
Rensselaer Polytechnic Institute
Troy, New York 12180
xxvi MODELING UNCERTAINTY
Moshe Shaked
Department of Mathematics
University of Arizona
Tucson, Arizona 85721
USA
J. George Shanthikumar
Industrial Engineering & Operations Research
University of California
Berkeley, California 94720
USA
Moshe Sniedovich
Department of Mathematics and Statistics
The University of Melbourne
Parkville VIC 3052, Australia
m.sniedovich@ms.unimelb.edu.au
Victor Solo
School of Electrical Engineering and Telecommunications
University of New South Wales
Sydney NSW 2052, Australia
vsolo@syscon.ee.unsw.edu.au
James C. Spall
The Johns Hopkins University
Applied Physics Laboratory
Laurel, MD 20723-6099
james.spall@jhuapl.edu
David S. Stoffer
Department of Statistics
University of Pittsburgh
Pittsburgh, PA 15260
Ferenc Szidarovszky
Department of Systems and Industrial Engineering
University of Arizona
Tucson, Arizona, 85721-0020, USA
szidar@sie.Arizona.edu
Contributing Authors xxvii
Wai-Yuan Tan
Department of Mathematical Sciences
The University of Memphis
Memphis, TN 38152-6429
waitan@memphis.edu
Harro Walk
Mathematisches Institut A
Universität Stuttgart
Pfaffenwaldring 57, D-70569
Stuttgart, Germany
Fei-Yue Wang
Department of Systems and Industrial Engineering
University of Arizona
Tucson, Arizona 87521
Margaret M. Wiecek
Dept. of Mathematical Sciences
Clemson University
Clemson SC 29634-0975
Zhihua Xiang
Organon Inc.
375 Mt. Pleasant Avenue
West Orange, NJ 07052
z.xiang@organoninc.com
D. S. Yakowitz
Tucson, Arizona
H.Yang
Department of Wood and Paper Science
University of Minnesota
St. Paul, MN 55108
hyang@ece.umn.edu
xxviii MODELING UNCERTAINTY
G. Yin
Department of Mathematics
Wayne State University
Detroit, MI 48202
gyin@math.wayne.edu
K. Yin
Department of Wood and Paper Science
University of Minnesota
St. Paul, MN 55108
kyin@crn.umn.edu, hyang@ece.umn.edu
Q. Zhang
Department of Mathematics
University of Georgia
Athens, GA 30602
qingz@math.uga.edu
This book is dedicated to the
memory of Sid Yakowitz.
Chapter 1
D. S. Yakowitz
Tucson, Arizona
the most cited papers in this field and has been republished in the Handbook of
Mathematical Economics. They could also extend the constrictive proof for the
case when the price and cost functions are not differentiable. They proved that
even in the case of multiple equilibria, the total output of the industry is unique
and the set of all equilibria is a simplex. They also considered the effect of
coalition formation on the profit functions (Szidarovszky and Yakowitz, 1982).
Sid was an expert in time series, both parametric and nonparametric. On the
nonparametric side he made contributions regarding nearest neighbor methods
applied to time series prediction, density and transition function estimation for
Markov sequences, and pattern recognition (Denny and Yakowitz, 1978; Schus-
ter and Yakowitz, 1979; Yakowitz, 1979; Szilagyi et al., 1984; Yakowitz, 1987;
Yakowitz, 1988; Yakowitz, 1989d; Rutherford and Yakowitz, 1991; Yakowitz
and Lowe, 1991; Yakowitz and Tran, 1993; Yakowitz, 1993a; Morvai et al.,
1998; Yakowitz et al., 1999). In particular Sid worked in the area of stochas-
tic hydrology over many years including analyzing hydrologic time series
such as flood and rainfall data to investigate their major statistical properties
and use them for forecasting (Yakowitz, 1973; Denny et al., 1974; Yakowitz,
1976; Yakowitz, 1976; Szidarovszky and Yakowitz, 1976; Yakowitz, 1979;
Yakowitz and Szidarovszky, 1985; Karlsson and Yakowitz, 1987a; Karlsson
and Yakowitz, 1987b; Naokes et al., 1988; Yakowitz and Lowe, 1991).
On the parametric side, Sid applied his deep understanding of linear filtering
of stationary time series in the problem of frequency estimation in the pres-
ence of noise. Here he authored several papers on frequency estimation using
contraction mappings, constructed from the first order auto-correlation, that
involved sophisticated sequences of linear filters with a shrinking bandwidth.
In particular, he showed that the contraction mapping of He and Kedem, which
requires a certain filtering property, can be extended quite broadly. This and the
shrinking bandwidth were very insightful (Yakowitz, 1991; Yakowitz, 1993c; Li
et al., 1994; Kedem and Yakowitz, 1994; Yakowitz, 1994a).
He found numerous applications of nonparametric statistical methods in ma-
chine learning (Yakowitz, 1989c; Yakowitz and Lugosi, 1990; Yakowitz et al.,
1992a; Yakowitz and Kollier, 1992; Yakowitz and Mai, 1995; Lai and Yakowitz,
1995). As a counterpart to his earlier work on numerical computation, Sid in-
troduced a course at the University of Arizona on Non-numerical Computation.
This course, which resulted in an unpublished textbook on the topic, developed
methods applicable to machine learning, games and epidemics. Sid loved this
topic dearly and enjoyed teaching it. He continued to explore this area up to
the time of his death.
In 1986 Sid met Joe Gani, and they worked together intermittently from
that time until his death. Over a period of 13 years, Sid and Joe (together with
students and colleagues) wrote 10 joint papers. Their earliest interest was in the
silting of dams, which they studied (with Peter Todorovic of UCSB) (Gani et al.,
4 MODELING UNCERTAINTY
...readers come to us to learn about certain specific facts. But they will value us
more if we give them more insight and entertainment than they bargained for.
from a letter to Gene Koppel on the value of creative writing courses
himself when he did not succeed. Sid was not known positively for his teaching
style. He was stressed and anxious over lectures and presentations.
When I have to go somewhere to give an invited lecture, I always take crayon
drawings that [my children] have made for me. I put them on the podium and
they give me strength.
from a letter to his friend Steve Berry
This lifelong battle with “stage fright” often led to comic episodes of forget-
fulness, fumbling, and embarrassment. He had little patience with laziness or
disinterest from his students. Those who took interest in the subject, and the
time to get to know Sid, grew to love him for his honesty, insight, enthusiasm,
humor and cheerfulness.
Sid was never too busy to answer a question. His work ethic defined, for me and
many other graduate students, a standard which we measure ourselves against.
Manbir Sodhi
I am privileged to have been his student. I will never forget his sense of humor
and the fact that he didn’t take anything too seriously, including himself.
T. Jayawardena
He distinguished between principle and process. He warned... against becoming
lost in process, and forgetting the initial principles with which they were to
concern themselves.
....He inveighed passionately against engineers simply putting their heads down,
and becoming lost in the mechanistic ritual of their jobs.
John Stevens Berry, Stanford roommate and lifelong friend.
I leave you with an image that most will find familiar, and many endearing.
The Yakowitz grin. Not a relentlessly upbeat smiley-face grin, also not a sneer.
A smile that asked for very little as a reason for its appearance: anything at all in
any way amusing and that he could share with others.
Robert Hymer, Stanford roommate and lifelong friend
I wish to thank Joe Gani, Ben Kedem, Dan Murray and Szidar for their
assistance in preparing this introduction. I and all of Sid’s family members are
grateful to Moshe Dror, Szidar, and Pierre L’Ecuyer for proposing and working
so hard as editors of this tribute to my husband and mentor.
Papers
Harold J. Kushner
Applied Mathematics Dept.
Lefschetz Center for Dynamical Systems
Brown University
Providence RI 02912 *
1. INTRODUCTION
Queueing networks are ubiquitous in modern telecommunications and com-
puter systems and much effort has been devoted to the study of their stability
properties. Consider a system where there are K processing stations, each with
an infinite buffer. The stations might have exogenous input streams (inputs
* Supported in part by NSF grants ECS 9703895 and ECS 9979250 and ARO contract DAAD19-99-1-0-223
14 MODELING UNCERTAINTY
from the outside of the network) as well as inputs from the other stations. Each
customer eventually leaves the system. The service is first come first served
(FCFS) at each station and the service time distributions depend only on the
station. This paper is concerned with the stability of such systems. The basic
analysis supposes that the systems are working under conditions of heavy traf-
fic. Then the result is specialized to the case of a fixed system in arbitrary (not
necessarily heavy) traffic. Loosely speaking, by heavy traffic we mean that the
fraction of time that the processors are idle is small; equivalently, the traffic
intensities at each processor are close to unity. Heavy traffic is quite common
in modern computer and communications systems, and also models the effects
of “bottleneck” nodes in general. Many of the queueing systems of current
interest are much too complicated to be directly solvable. Under conditions of
heavy traffic, laws of large numbers and central limit theorems can be used to
greatly simplify the problem.
Most work on stability has dealt with the more difficult multiclass case (al-
though under simpler conditions on the probabilistic structure of the driving
random variables), where each station can work on several job classes and
there are strict priorities (Banks and Dai, 1997; Bertsimas et al., 1996; Bram-
son, 1994; Bramson and Dai, 1999; Chen and Zhang, 2000; Dai, 1995; Dai,
1996; Dai and Vande Vate, 2001; Dai et al., 1999; Dai and Weiss, 1995; Down
and Meyn, 1997; Dai and Meyn, 1995; Kumar and Meyn, 1995; Kumar and
Seidman, 1990; Lin and Kumar, 1984; Lu and Kumar, 1991; Meyn and Down,
1994; Perkins and Kumar, 1989; Rybko and Stolyar, 1992). A typical result is
that the system is stable if a certain “fluid” or “averaged” model is stable. The
interarrival and service intervals are usually assumed to be mutually indepen-
dent, with the members of each set being mutually independent and identically
distributed and the routing is “Markov.” Stability was shown in Bramson (1996)
for a class of FIFO networks where the service times do not depend on the class.
The counterexamples in Bramson (1994), Kumar and Seidman (1990), Lu and
Kumar (1991), Rybko and Stolyar (1992), Seidman (1994) have shown that the
multiclass problem is quite subtle and that even apparently reasonable strategies
for scheduling the competing classes can be unstable.
The stability situation is simpler when the service at each processor is FCFS
and the service time distribution depends only on the processor, and here too a
typical result is that the system is stable if a certain “fluid” or “averaged” model is
stable. The i.i.d assumption on the interarrival (resp., service) times and Markov
assumption on the routing are common, although results are available under
certain “stationary–ergodic” assumptions (Baccelli and Foss, 1994; Borovkov,
1986). In the purely Markov chain context, where one deals with a reflected
random walk, works via the classical stochastic stability techniques for Markov
chains include Fayolle (1989), Fayolle et al. (1995), and Malyshev (1995).
Stability of Single Class Queueing Networks 15
For the single class case, and under the assumption that the “fluid” approxi-
mation is stable, it will be seen that stability holds under quite general assump-
tions on the probabilistic structure of the interarrival and service intervals, and
on the routing processes. This will be demonstrated by use of the perturbed
Liapunov function methods of Kushner (1984).
The basic class of models under heavy traffic will be defined in Section
2, where various assumptions are stated to put the problem into a context of
interest in current applications. These are for intuitive guidance only, and
weaker conditions will be used in the actual stability results. The basic class of
models which is motivated by the heavy traffic analysis of queueing networks is
then generalized to include a form of the so-called “Skorohod problem” model
which covers a broader class of systems. Stability under heavy traffic is actually
stability of a sequence of queueing systems, and it is “uniform” in the traffic
intensity parameter as that tends to unity. The stability results depend on a basic
theorem of Dupuis and Williams (1994) which established the existence of a
Liapunov function for the fluid approximation, and this is stated in Section 3.
The idea of perturbed Liapunov functions is developed in Section 4. Section 5
gives the main stability theorem for the sequence of systems in heavy traffic,
as well as the result for a single queue which is not necessarily in heavy traffic.
The results can be extended to account for many of the features of queueing
networks, such as batch arrivals and processing or server breakdown.
2. THE MODEL
A classical queueing network: The heavy traffic scalings. Heavy traffic.
analysis works with a sequence of queues, indexed by and with arrival and
service “rates” depending on such that, as the fraction of time
at which the processors are idle goes to zero. With appropriate scaling and
under broad conditions, the sequence of scaled queues converges weakly to
a process which is the solution to a reflected stochastic differential equation
(Kushner, 2000; Reiman, 1984).1 Let there be K processors or servers, denoted
by Let denote the size of the vector of queues in the network
at real time for the system in the sequence. There are two scalings
which are in common use. One scaling works with the state process defined by
where both time and amplitude are “squeezed.” This is
used for classical queueing problems where the interarrival and service intervals
are O(1). In many modern communications and computer systems, the arrivals
and services (say, arrivals and transmissions of the packets) are fast, and then
one commonly works with the scaling We will concentrate
on the first scaling, although all the results are transferable to the second scaling
with an appropriate identification of variables.
16 MODELING UNCERTAINTY
Consider the first scaling where the model is of the type used in Reiman
(1984). See also Kushner (2000; Chapter 6). Let denote the in-
terarrival interval for exogenous arrivals to and let denote the
service interval at Define times the number of exogenous
arrivals to by real time and define analogously for the service
completions there. Let denote times the number of departures
from by real time that have gone to For some centering constants
which will be further specified below, define
the processes:
If the upper index of a sum is not an integer, we always take the integer part.
Let denote the indicator function of the event that the departure from
goes to Define the “routing” vector
In applications to communications systems, where the second (the “fast”)
scaling would be used, the scale parameter is the actual physical speed or size
of the physical system. Then, the actual interarrival and service intervals would
be defined by and resp. Thus, in this latter case we are working
with a sequence of systems of increasing speed. Under the given conditions,
the results in Section 5 will state that the processes are uniformly (in )
stable for large
Assumptions for the classical network in heavy traffic. Three related types
of models will be dealt with. The first form is the classical queueing network
type described above. Conditions (A2.1)–(A2.3) are typical of those used at
present for this problem class. These are given for motivational purposes and
to help illustrate the basic “averaging” method. The actual conditions which
are to be used are considerably weaker. The second type of model, called
the Skorohod problem, includes the first model. But it covers other queueing
systems which are not of the above network form (see, for example, Kushner
(2000; Chapter 7)). Finally, it will be seen that the stability results are valid
even without the heavy traffic assumptions, provided that the mean flows are
stable.
A2.0. For each the initial condition (0) is independent of all future
arrival and service times and of all routing decisions, and other future driving
random variables.
A2.1. The set of service and interarrival intervals is independent of the set
of routing decisions. There are constants such that
The set of processes defined in (2.1) is
tight in the Skorohod topology (Billingsley, 1968; Ethier and Kurtz, 1986) and
has continuous weak sense limits.
Stability of Single Class Queueing Networks 17
A2.2. The are mutually independent for each There are con-
stants and such that and The spectral
radius of the matrix is less than unity. Write
The spectral radius condition in (A2.2) implies that each customer will even-
tually leave the system and that the number of services of each customer is
stochastically bounded by a geometrically distributed random variable. We
will often write It is convenient to introduce the defi-
nition Condition (A2.3) is known as the heavy traffic condition. It
quantifies the rate at which the difference between the mean input rate and mean
output rate goes to zero at each processor as
A2.3. There are real such that
The limit system and fluid approximation for the classical model. The
basic state equation is
18 MODELING UNCERTAINTY
The stability of the system (2.6) is the usual departure point for the study of the
stability of the actual physical stochastic queues.
processor, and that they both occur “just before” the times
These are still referred to as the departures and arrivals at the times Con-
sequently, there cannot be a departure from a processor at real time
if its queue is empty at real time
Some details for (2.4). A formal summary of a few of the details which
take (2.3) to (2.4) will be helpful to get a feeling for the derivation of (2.4)
and motivation for the role of (2.6) in the stability analysis. The development
involves representations for the terms in (2.3) that separate the “drift” from
the “noise.” Let denote the indicator function of the event that there is
a departure from at real time and let denote the indicator
function of the event that this departure goes to Let be the indicator
function of the event that there is an exogenous arrival to at real time
By the definitions, we have
Write
where
Use the representation (2.8) for the coefficient of in the right hand term of
(2.10). Then, the (negative of the) idle time terms in the equation (2.3) for
sum to
Now, consider the exogenous arrival processes. Modulo a residual time error
term,
Alternatively,
where
The difference between the right hand terms of (2.11) and (2.12) is also a
residual time error term and is asymptotically negligible.
Putting the expansions together and using the heavy traffic condition (A2.3)
yields (modulo asymptotically negligible errors, from the point of view of weak
convergence)
where
Stability of Single Class Queueing Networks 21
and is defined by
Let denote the boundary face on which and let denote the
reflection direction on that face. Then, for some nonnegative and uniformly
bounded random variables can be written as
where has the form (2.14a) for uniformly bounded random variables
and for some nonnegative and uniformly bounded random variables
the reflection term can be written as
Assumptions for the model (2.16), (2.17). Condition (A2.4) below holds for
the original queueing network model, as does (A2.5) because of the condition
on the spectral radius of Q.
A2.4. There are vectors such that the state space G is the intersection
of a finite number of closed halfspaces in each containing the origin and
defined by and it is the closure of its interior (i.e., it is a
“wedge”). Let denote the faces of G, and the interior
normal to Interior to the reflection direction is denoted by the unit
vector and for each The possible reflection directions at
points on the intersections of any subset of the are in the convex hull of the
directions on the adjoining faces. Let denote the set of reflection directions
at the point whether it is a singleton or not.
A2.5. For define the index set Suppose that
lies in the intersection of more than one boundary; i.e, has the
22 MODELING UNCERTAINTY
3. STABILITY: INTRODUCTION
Stability of a queue can be defined in many ways, but it almost always
means something close to a uniform recurrence property, which we can define
as follows. Let be a large real number. Suppose that the current queue
size is Then there is a real valued function K(·) such that the mean time,
conditioned on the systems data to the present, for the queue size to get within the
centered at the origin, is bounded by with probability
one. Then we say that the queue process is uniformly recurrent. The queue
process need not be Markovian, or a component of a Markov process, or even
ergodic or stationary in any sense.
Now, consider a sequence of queues in heavy traffic, scaled as in Section
2. Then the uniform recurrence property is rephrased as follows. Suppose that
the current scaled queue size is Then the mean time, conditioned on the
systems data to the present, for the scaled queue size to get within the
whose whose center is the origin, is bounded by
with probability one and uniformly in (large)
The study of the theory of the stability of Markovian processes, via stochas-
tic Liapunov functions, goes back to Bucy (1965), Khasminskii (1982), and
Kushner (1990a) and that of non-Markovian processes, via perturbed Liapunov
functions, to Blankenship and Papanicolaou (1978), Kushner (1984), and Kush-
ner (1990a). It was shown in Harrison and Williams (1987) that a necessary
and sufficient condition for the recurrence of the heavy traffic limit (2.4) (when
is a Wiener process) is that
which might not arise as a limit of the sort of queueing processes (2.3) dis-
cussed in Section 2. They constructed a Liapunov function for the associated
fluid approximation, and used this to prove the recurrence of the heavy traffic
limit. The needed properties of their Liapunov function are stated in Theorem
3.1.
The next two sections contain an analysis of a class of systems whose heavy
traffic limits are either of the queuing type of (A2.0)–(A2.3), or of the more
general Skorohod model type (2.16), (2.17). The same methods will also be
applied to a single queue (and not a sequence of queues) which might not be in
heavy traffic. The method is interesting in that it can handle quite complicated
correlations in the arrival, service and routing processes, and even allow non-
stationarities. We aim to avoid stronger assumptions which would require the
processes to be either stationary or to be representable in terms of a component
of the state of a Markov process. Thus, the notion of Harris recurrence is not
directly relevant. Our opinion is that stability results should be as robust as pos-
sible to small variations in the assumptions. The perturbed Liapunov function
method is ideal for such robustness.
The following is a form of the main theorem of Dupuis and Williams (1994).
The reference used the orthant But the proof still
holds if G is a wedge as in (A2.4). For a point the interior of G, define
to be the set of indices such that In the theorem, denotes
the gradient of V(·).
Theorem 3.1. Assume (A2.4)–(A2.6). Then, there exists a real–valued
function V(·) on with the following properties. It is continuous,
together with its partial derivatives up to second order. There is a (twice
continuously differentiable) surface such that any ray from the origin crosses
once and only once, and for a scalar and
For
(2.4) and (2.5), but the assumptions (A2.0)–(A2.3) will be weakened. By the
scaling of time, can change values only at times
with the departures occurring “just before the arrivals.
Perturbed Liapunov functions. Introduction and motivation. This sec-
tion will introduce the idea of perturbed Liapunov functions (Blankenship and
Papanicolaou, 1978; Kushner, 1984; Kushner, 1990b; Kushner and Yin, 1997)
and their structure. The computations are intended to be illustrative and will be
formal. But, the actual proofs of the theorems in the next section go through
the same steps.
The classical Liapunov function method is quite limited, for problems such as
ours, since (owing to the correlations or possibly non Markov character) there
is not usually a “contraction” at each step to yield the local supermartingale
property which is needed to prove stability. The perturbed Liapunov function
method is a powerful extension of the classical method. In the perturbed Lia-
punov function method, one adds a small perturbation to the original Liapunov
function. As will be seen, this perturbation provides an “averaging” which
is needed to get the local supermartingale property. The primary Liapunov
function will be simply the function V(·) of Theorem 3.1. The final Liapunov
function will be of the form where is
“small” in a sense to be defined.
Let denote the expectation given all systems data up to and including
real time We can write
where we define
where we define
26 MODELING UNCERTAINTY
where the are O(1), uniformly in all variables. Similarly, we can write
Also,
Stability of Single Class Queueing Networks 27
in (4.5a) equals
Repeating this for all of the other first order terms yields
By the heavy traffic condition (A2.3), times the terms in brackets in the
second line in (4.7) converges to as
Now, turn to the boundary terms. Define and
Thus, asymptotically, we can write (4.7) as
Next, let us dominate the second order terms in (4.1). For large enough
Average the indicator functions in the second order part of (4.1) via use of
the perturbations as done for the first order terms. This yields the bound
for the second order terms, for large
28 MODELING UNCERTAINTY
Finally, combining the above expansions and bounds, for large enough
Theorem 3.1 allows us to write
5. STABILITY
Discussion of the perturbations. Let us examine the in (4.4a) more
closely to understand why our O(1) requirement on its value is reasonable.
Since is merely a centering constant for the entire sequence, the actual
mean values or rates can vary with time (say, being periodic, etc.). Fix and let
and be the real times of the first two exogenous arrivals to queue
after real time Consider the part of given by
This equals
Next, for the moment, suppose that the interarrival times are mutually in-
dependent and identically distributed, with finite second moments, and mean
Then (5.1) equals zero w.p.1, since
Obviously and can be any two successive exogenous arrival
times with the same result. Thus, under the independence assumption,
Stability of Single Class Queueing Networks 29
is just
Then, grouping terms and formally speaking, we see that is just (5.2) plus
the series
This sum is well defined and bounded uniformly in under broad conditions.
Similar computations can be done for the and
The perturbations which are to be used. The perturbations defined in (4.4)
are well defined and O(1) uniformly in under broad mixing conditions. But,
there are interesting cases where they are not well defined. A typical such case
is the purely deterministic problem where and
where H is an integer. Then the sum, taken from to is periodic in
with each segment moving linearly between zero and The
most convenient way of circumventing this problem of non convergence and
including such examples in our result is to suitably discount the defining sums
(Kushner and Yin, 1997; Solo and Kong, 1995). Thus, if the sums in (4.4) are
not well defined, then we will use the alternative perturbations where the sums
are changed to the following discounted forms, for some small
30 MODELING UNCERTAINTY
The sums in (5.4) are always well defined for each and the conditional
expectation can be taken either inside or outside of the summation. Finally, the
above discussion is summarized in the following assumption.
A5.1. There is a constant B such that and are bounded
by B w.p.1, for each
The actual perturbed Liapunov function which is to be used is
where
Fast arrivals and services: The second scaling. In the queueing network
model of Section 2, we defined This is the traditional
scaling for queues. But, in many applications to computer and communications
systems, the channel capacity is large and the arrivals and services occur “fast,”
proportionally to the capacity. Then, the parameter is taken to be the basic
speed of the system and one uses (Altman and Kushner,
1999; Kushner et al., 1995). As noted at the beginning of Section 2, the service
and interarrival intervals are then defined to be with centering
constants To be consistent with the previous development, suppose
that arrivals and departures can occur only “just before the real times
The development that led to Theorem 5.1 used only the scaled system, and
the results are exactly the same if we have fast arrivals and services.
The Skorohod problem model (2.16), (2.17). We will use the perturbation
where
The proof of the next theorem is the same as that of Theorem 5.1.
Theorem 5.2. Assume the model (2.16), (2.17) and the conditions (A2.4)–
(A2.6). Suppose that is tight and that the of (5.9) are bounded,
w.p.1, uniformly in Then is recurrent, uniformly in If the functions
(5.9) are well defined and uniformly bounded without the discounting, then the
undiscounted form can be used.
Fixed queues: Non heavy traffic problems. Consider a single queueing
network (not a sequence) of the type discussed in Section 2. Let denote the
size of the queue at server at time The primary assumptions in Theorem 5.1
were first (A2.4)–(A2.6) which enabled the construction of the fundamental
Liapunov function of Theorem 3.1 and, second, (A5.1) which provided the
averaging. The fact that is sufficient for the (average
of the) first order term in (4.1) to dominate the (average of the) second order
term for large even without their relative and scalings.
Drop the in the definitions, and suppose that arrivals and departures
only occur as in the queueing network model of Section 2; i.e. “just before”
times for some small In particular, for define
32 MODELING UNCERTAINTY
Define
All of the equations in this and in the previous section hold if the is dropped.
Thus, we have the following theorem.
Theorem 5.3. Assume that the spectral radius of Q is less than unity, condi-
tion (A2.6) and that the sums in (5.10) are bounded w.p.1, uniformly in
and Then Q(·) is recurrent. If the functions (5.10) are well defined and
uniformly bounded without the discounting, then the undiscounted forms can
be used.
NOTES
1. All weak convergence is in the Skorohod topology (Ethier and Kurtz, 1986).
2. Thus the gradient is the same at all points on any ray from the origin.
REFERENCES
Altman, E. and H.J. Kushner. (1999). Admission control for combined guar-
anteed performance and best effort communications systems under heavy
traffic. SIAM J. Control and Optimization, 37:1780–1807.
Baccelli, F. and S. Foss. (1994). Stability of Jackson-type queueing networks.
Queueing Systems, 17:5–72.
Banks, J. and J.G. Dai. (1997). Simulation studies of multiclass queueing net-
works. IEE Trans., 29:213–219.
Bertsimas, D., D. Gamarnik, and J. Tsitsiklas. (1996). Stability conditions for
multiclass fluid queueing networks. IEEE Trans. Aut. Control, 41:1618–
1631.
Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York.
Blankenship, G. and G.C. Papanicolaou. (1978). Stability and control of systems
with wide band noise disturbances. SIAM J. Appl. Math., 34:437–476.
Borovkov, A. A. (1986). Limit theorems for queueing networks. Theory of Prob-
ability and its Applications, 31:413–427.
Bramson,M. (1994). Instability of FIFO queueing networks. Ann. Appl. Probab.,
4:414–431.
Bramson, M. (1996). Convergence to equilibria for FIFO queueing networks.
Queueing Systems, 22:5–45.
REFERENCES 33
Bramson, M. and J.G. Dai. (1999). Heavy traffic limits for some queueing
networks. Preprint.
Bucy, R. S. (1965). Stability and positive supermartingales. J. Differential Equa-
tions, 1:151–155.
Chen, H. and H. Zhang. (2000). Stability of multiclass queueing networks under
priority service disciplines. Operations Research, 48:26–37.
Dai, J., J. Hasenbein, and J. Vande Vate. (1999). Stability of a three station fluid
network. Queueing Systems, 33:293–325.
Dai, J.G and S. Meyn. (1995). Stability and convergence of moments for multi-
class queueing networks via fluid limit models. IEEE Trans on Aut. Control,
40:1889–1904.
Dai, J. and J. Vande Vate. (2001). The stability of two station multi–type fluid
networks. To appear in Operations Research.
Dai, J. G. (1995). On positive Harris recurrence of multiclass queueing net-
works: a unified approach via fluid models. Ann. Appl. Probab., 5:49–77.
Dai, J. G. (1996). A fluid–limit model criterion for instability of multiclass
queueing networks. Ann. of Appl. Prob., 6:751–757.
Dai, J. G. and G. Weiss. (1995). Stability and instability of fluid models for
reentrant lines. Math. of Oper. Res., 21:115–135.
Down, D. and S.P. Meyn. (1997). Piecewise linear test functions for stability
and instability of queueing networks. Queueing Systems, 27:205–226.
Dupuis, P. and R.J. Williams. (1994). Lyapunov functions for semimartingale
reflecting Brownian motions. Ann. Prob., 22:680–702.
Ethier, S. N. and T.G. Kurtz. (1986). Markov Processes: Characterization and
Convergence. Wiley, New York.
Fayolle, G. (1989). On random walks arising in queueing systems: ergodicity
and transience via quadratic forms as liapunov functions, Part 1. Queueing
Systems, 5:167–184.
Fayolle, G., V.A. Malyshev, and M.V. Menshikov. (1995). Topics in the Con-
structive Theory of Markov Chains. Cambridge University Press, Cambridge,
UK.
Harrison, J. M. and R.J. Williams. (1987). Brownian models of open queueing
networks with homogeneous customer populations. Stochastics and Stochas-
tics Rep., 22:77–115.
Khasminskii, R. Z. (1982). Stochastic Stability of Differential Equations. Si-
jthoff, Noordhoff, Alphen aan den Rijn, Amsterdam.
Kumar, P. R. and S.P. Meyn. (1995). Stability of queueing networks and schedul-
ing policies. IEEE Trans. on Automatic Control, 40:251–260.
Kumar, P. R. and T.I. Seidman. (1990). Dynamic instabilities and stabiliza-
tion methods in distributed real time scheduling policies. IEEE Trans. on
Automatic Control, 35:289–298.
34 MODELING UNCERTAINTY
Abstract Herein we review certain problems in sequential optimization when the under-
lying dynamical system is not fully specified but has to be learned during the
operation of the system. A prototypical example is the multi-armed bandit prob-
lem, which was one of Yakowitz’s many research areas. Other problems under
review include stochastic approximation and adaptive control of Markov chains.
1. INTRODUCTION
Sequential optimization, when the underlying function or dynamical system
is not fully specified but has to be learned during the operation of the system,
was one of Yakowitz’s major research areas, to which he made many important
contributions in a variety of topics. In this paper we give an overview of some of
these topics and related developments, and review in this connection Yakowitz’s
contributions to these areas.
The optimization problem of finding the value which maximizes
a given function is difficult when is large and does not have nice
smoothness and concavity properties. Probabilistic algorithms, such as sim-
ulated annealing introduced by Kirkpatrick et al. (1983), have proved useful
to reduce the computational complexity. The problem becomes even more
challenging if is some unknown regression function so that an observation
at a given has substantial “uncertainties” concerning its mean value
In such stochastic settings, statistical techniques and probabilistic methods are
indispensable tools to tackle the problem.
When is finite, the above optimization problem in stochastic settings can be
viewed as a stochastic adaptive control problem with a finite control set, which is
often called a “multi-armed bandit problem”. In its simplest form, the problem
36 MODELING UNCERTAINTY
2. BANDIT THEORY
The “multi-armed bandit problem”, introduced by Robbins (1952), derives
its name from an imagined slot machine with arms. When an arm is
pulled, the player wins a random reward. For each arm there is an unknown
probability distribution of the reward, and the player’s problem is to choose
N successive pulls on the arms so as to maximize the total expected reward.
The problem is prototypical of a general class of adaptive control problems in
which there is a fundamental dilemma between “information” (such as the need
to learn from all populations about their parameter values) and “control” (such
as the objective of sampling only from the best population), cf. Kumar (1985).
Another often cited example of such problems is in the context of clinical trials,
where there are treatments of unknown efficacy to be chosen sequentially to
treat a large class of N patients, cf. Chernoff (1967).
as and
There is also a similar asymptotic theory of the finite-horizon bandit problem
in which the agent’s objective is to maximize the total reward
sum of N observations samples from the first population (with unknown mean)
until stage and then takes the remaining
observations from the second population (with known mean 0), where is
the posterior mean based on observations from the first population
and are positive constants that can be determined by backward induction.
Writing and treating as a
continuous variable, Lai (1987) approximates by where
is the posterior variance of and
where is defined as the largest number in if the set in (8) is empty, and
is the generalized likelihood ratio (GLR) statistic for testing
Sequential Optimization Under Uncertainty 43
sample estimates at every stage. It tries to mimic the optimal rule (assuming
known system parameters) by updating the parameter estimates based on all the
available data. It is particularly attractive when the optimal (or asymptotically
optimal) control scheme assuming known system parameters has a simple re-
cursive form that can be implemented in real time and when there are real-time
recursive algorithms for updating the parameter estimates. Such is the case with
stationary control of Markov chains to maximize the long-run average reward.
where represents the one-step reward at state when action is used and
is the stationary distribution (which is assumed to exist) of
Since and are finite, the set of stationary control laws is
finite, which will be denoted by If were known, then one would
use the stationary control law such that
as in (10). Assume no switching cost for switching among the optimal stationary
control laws that attain the maximum in (12) and a positive switching cost
for each switch from one to another where and are not
both optimal. Let be the cumulative switching cost of an adaptive
control rule up to stage N. An adaptive control rule is said to be “uniformly
good” if and for every Graves
and Lai (1997) showed that for uniformly good rules,
where is defined below after a few other definitions. First the analogue of
the Kullback-Leibler information number in (2) now takes the form
which will be assumed to be finite for all and which assumes the
transition probabilities to be absolutely continuous (having density functions
) with respect to a measure on Next the finiteness of enables us to
Sequential Optimization Under Uncertainty 47
Using sequential likelihood ratio tests and block experimentation ideas (similar
to those described in Section 2.2) to introduce “uncertainty adjustments” into
the certainty equivalence rule, Graves and Lai (1997) constructed uniformly
good adaptive control rules that attain the asymptotic lower bound (14).
exists for every and and such that there is a maximum value
of over For a control rule that chooses adaptively some stationary
control law in to use at every stage, its regret is defined by
48 MODELING UNCERTAINTY
where is the number of times that the control rule uses the stationary
control law up to stage N and denotes expectation under the probability
measure of the controlled Markov chain starting at and using control
rule Since the state in a controlled Markov chain is governed by the
preceding state irrespective of which control law is used at time
it is important to adhere to the same control law (“arm”) over a block of times
(“block experimentation”), instead of switching freely among different arms as
in conventional multi-armed bandits.
Let Take and let Partition
into blocks of consecutive integers,
each block having length except possibly the last one whose length may range
from to Label these blocks as so that the block
begins at stage The basic idea here is to try out the
first stationary control laws for the stages from to Specifically,
for if with use stationary
control law for the entire block of stages if
and use for all the stages in the block otherwise. In (12),
denotes the number of times is used up to stage
under the assumption that is a metric space with the Borel For
controlled Markov chains with a nondenumerable set of stationary control
laws, Lai and Yakowitz (1995) showed how to construct an adaptive control rule
with for any and such that
Sequential Optimization Under Uncertainty 49
where is the time spent in the queuing system by the job, is the
service time for that job, and is the service rate in effect during that service
time, with A being a parameter of the problem. The decision maker need not
know in advance the arrival distribution, or how costs depend on service time,
or even how the service rate being adjusted is related to service time. One
desires a strategy to minimize the average job cost. Making use of the control
rule described earlier in this paragraph, Lai and Yakowitz (1995) showed how
decision functions can be chosen adaptively from a space of functions
mapping the number of jobs in the system into a prescribed interval
of service rates, so that the average control costs converge, as to the
optimal performance level the expectation being with
respect to the invariant measure induced by the decision function
4. STOCHASTIC APPROXIMATION
Consider the regression model
When the random disturbances are present, using Newton’s method (20)
entails that
where at the stage observations and are taken at the design levels
and respectively, and are positive constants
such that
and is an estimate of
Beginning with the seminal papers of Robbins and Monro (RM) and Kiefer
and Wolfowitz (KW), there is a vast literature on stochastic approximation
schemes of the type (23) and (25). In particular, Blum (1954) proved almost
sure (a.s.) convergence of the RM and KW schemes under certain conditions
on M and For the case of i.i.d. with mean 0 and variance Sacks (1958)
showed that an asymptotically optimal choice of in the RM scheme (23) is
for which has a limiting normal distribution with
mean 0 and variance assuming that This led Lai and
Robbins (1979) to develop adaptive stochastic approximation schemes of the
form
that it is possible to have both asymptotically minimal regret and efficient fi-
nal estimate, i.e., a.s. and has a
limiting normal distribution with mean 0 and variance as by
using a modified least squares estimate in (27). Asymptotic normality of the
KW scheme (25) has also been established by Sacks (1958). However, instead
of the usual rate, one has the rate for the choices and
assuming M to be three times continuously differentiable in some
neighborhood of The reason for the slower rate is that the estimate of
has a bias of the order when This slower rate is
common to nonparametric regression and density estimation problems, where
it is known that the rate of convergence can be improved by making use of addi-
tional smoothness of M. Fabian (1967, 1971)showed how to redefine in
(25) when M is continuously differentiable in some neighborhood
of for even integers so that letting has
a limiting normal distribution.
In control engineering, stochastic approximation (SA) procedures are usu-
ally applied to dynamical systems. Besides the dynamics in the SA recursion,
the dynamics of the underlying stochastic system also plays a basic role in the
convergence analysis. Ljung (1977) developed the so-called ODE method that
has been widely used in such convergence analysis in the engineering literature;
it studies the convergence of SA or other recursive algorithms in stochastic dy-
namical systems via the stability analysis of an associated ordinary differential
equation (ODE) that defines the “asymptotic paths” of the recursive scheme;
see Kushner and Clark (1978) and Benveniste, Metivier and Priouret (1987).
Moreover, a wide variety of KW-type algorithms have been developed for con-
strained or unconstrained optimization of objective functions on-line in the
presence of noise. For Spall (1992) introduced “simultaneous
perturbation” SA schemes that take only 2 (instead of measurements to
estimate a smoothed gradient approximation to at every stage; see also
Spall and Cristion (1994). For other recent developments of SA, see Kushner
and Yin (1997).
The ODE approach to analyze SA procedures usually assumes that the as-
sociated ODE is initialized in the domain of attraction of an equilibrium point.
Moreover, the theory on the convergence rate of under conditions such as
unimodality and smoothness refers only to so large that lies in a suffi-
ciently small neighborhood of the limit where the regression function can
be approximated by a local polynomial. The need for good starting values to
initialize SA procedures is also widely recognized in practice. By synthesiz-
ing constrained KW search with nonparametric bandit theory, Yakowitz (1993)
developed the following multistart procedure that converges to the global max-
imum of where is a convex subset of even though M may
have multiple local maxima and minima.
52 MODELING UNCERTAINTY
REFERENCES
Agrawal, R., D. Teneketzis and V. Anantharam. (1989a). Asymptotically ef-
ficient adaptive allocation schemes for controlled I.I.D. processes: Finite
parameter space. IEEE Trans. Automat. Contr. 34, 258-267.
Agrawal, R., D. Teneketzis and V. Anantharam. (1989b). Asymptotically effi-
cient adaptive allocation schemes for controlled Markov chains: Finite pa-
rameter space. IEEE Trans. Automat. Contr. 34, 1249-1259.
Anantharam, V., P. Varaiya and J. Walrand. (1987). Asymptotically efficient
allocation rules for multiarmed bandit problems with multiple plays. Part II:
Markovian rewards. IEEE Trans. Automat. Contr. 32, 975-982.
Banks, J. S. and R.K. Sundaram. (1992). Denumerable-armed bandits. Econo-
metrica 60, 1071-1096.
Banks, J. S. and R.K. Sundaram. (1994). Switching costs and the Gittins index.
Econometrica 62, 687-694.
Benveniste, A., M. Metivier, and P. Priouret. (1987). Adaptive Algorithms and
Stochastic Approximations. Springer-Verlag, New York.
Berry, D. A. (1972). A Bernoulli two-armed bandit. Ann. Math. Statist. 43,
871-897.
Blum, J. (1954). Approximation methods which converge with probability one.
Ann. Math. Statist. 25, 382-386.
Borkar, V. and P. Varaiya. (1979). Adaptive control of Markov chains. I: Finite
parameter set. IEEE Trans. Automat. Contr. 24, 953-958.
REFERENCES 53
Brezzi, M. and T.L. Lai. (2000a). Incomplete learning from endogenous data
in dynamic allocation. Econometrica 68, 1511-1516.
Brezzi, M. and T.L. Lai. (2000b). Optimal learning and experimentation in
bandit problems. To appear in J. Economic Dynamics & Control.
Chang, F. and T.L. Lai. (1987). Optimal stopping and dynamic allocation. Adv.
Appl. Probab. 19, 829-853.
Chernoff, H. (1967). Sequential models for clinical trials. Proc. Fifth Berkeley
Symp. Math. Statist. & Probab. 4, 805-812. Univ. California Press.
Fabian, V. (1967). Stochastic approximation of minima with improved asymp-
totic speed. Ann. Math. Statist. 38, 191-200.
Fabian, V. (1971). Stochastic approximation. In Optimizing Methods in Statis-
tics (J. Rustagi, ed.), 439-470. Academic Press, New York.
Fabius, J. and W.R. van Zwet. (1970). Some remarks on the two-armed bandit.
Ann. Math. Statist. 41, 1906-1916.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices (with
discussion). J. Roy. Statist. Soc. Ser. B 41, 148-177.
Gittins, J.C. (1989). Multi-Armed Bandit Allocation Indices. Wiley, New York.
Gittins, J.C. and D.M. Jones. (1974). A dynamic allocation index for the se-
quential design of experiments. In Progress in Statistics (J. Gani et al., ed.),
241-266. North Holland, Amsterdam.
Graves, T. L. and T.L. Lai. (1997). Asymptotically efficient adaptive choice
of control laws in controlled Markov chains. SI AM J. Contr. Optimiz. 35,
715-743.
Kaebling, L.P., M.C. Littman and A.W. Moore. (1996). Reinformement learn-
ing: A survey. J. Artificial Intelligence Res. 4, 237-285.
Kiefer, J. and J. Wolfowitz. (1952). Stochastic estimation of the maximum of a
regression function. Ann. Math. Statist. 23, 462-466.
Kirkpatrick, S., C.D. Gelatt and M.P. Vecchi. (1983). Optimization by simulated
annealing. Science 220, 671-680.
Klimov, G.P. (1974/1978). Time-sharing service systems I, II. Theory Probab.
Appl. 19/23, 532-551/314-321.
Kumar, P.R. (1985). A survey of some results in stochastic adaptive control.
SIAM J. Contr. Optimiz. 23, 329-380.
Kushner, H.J. and D.S. Clark. (1978). Stochastic Approximation for Constrained
and Unconstrained Systems. Springer-Verlag, New York.
Kushner, H.J. and G. Yin. (1997). Stochastic Approximation Algorithms and
Applications. Springer-Verlag, New York.
Lai, T.L. (1987). Adaptive treatment allocation and the multi-armed bandit
problem. Ann. Statist. 15, 1091-1114.
Lai, T.L. and H. Robbins. (1979). Adaptive design and stochastic approxima-
tion. Ann . Statist. 7, 1196-1221.
54 MODELING UNCERTAINTY
Iosif Pinelis
Department of Mathematical Sciences
Michigan Technological University
Houghton, Michigan 49931
ipinelis@math.mtu.edu
Keywords: large deviation probabilities, final zero crossing, last negative sum, last positive
sum, nonparametric bandit theory, subexponential distributions, superexponen-
tial distributions, exponential probabilistic inequalities
Abstract Three related groups of problems are surveyed, all of which concern asymp-
totics of large deviation probabilities themselves – rather than the much more
commonly considered asymptotics of the logarithm of such probabilities. The
former kind of asymptotics is sometimes referred to as “exact”, while the latter as
“rough”. Obviously, “exact” asymptotics statements provide more information;
the tradeoff is that additional restrictions on regularity of underlying probability
distributions and/or on the corresponding zone of deviations are then required.
Literature on large deviations is vast. Most of the results concern the so-called
large deviation principle (LDP), which deals with asymptotics of the logarithm
of the large deviation probalility, ln (which is most often Gaussian-
like, rather than with asymptotics of the probability itself.
We write if and if Some authors
refer to asymptotics of the logarithm of the large deviation probalility as “rough"
asymptotics, in contrast with “exact" asymptotics of the probability
itself. To appreciate the difference, note that such two different cases of “exact"
asymptotics as, e.g.,
while the second summand dominates the first one in the zone of “very" large
deviations for every
If the exponent then, assuming for simplicity the symmetry of the
distribution of the one has
Put
The notation as in (3) is convenient since it unifies both cases. Note that
analogues of the subsequent results hold when either the underlying distribution
is any lattice one or has a non-zero absolutely continuous component.
Now let stand for the density of which is defined, in both cases,
analogously to (3).
Let
where
Moreover,
Exact asymptotics for large deviation probabilities‚ with applications 61
More explicit expressions for can be given for the case of exponentially or
superexponentially decreasing left tails; the expressions are especially simple
when X has a maximum span 1 lattice lower-semicontinuous distribution‚ i.e.‚
when
note that this case corresponds‚ in the application to the learning problems
described in the next section‚ to the situation when‚ say‚ and
In particular‚ yet simpler formula‚
takes place when X can assume only the three values (–1)‚ 0‚1 with probabil-
ities respectively‚ so that this will correspond
to the situation when and can assume values 0 or 1 only.
As to the case when X has a maximum span 1 lattice upper-semicontinuous
distribution‚ i.e.‚
where
where
62 MODELING UNCERTAINTY
where
Using (12), one can see that (4) is satisfied in this case with
moreover,
Note that in the Cramér case, under the additional condition that there exists
a number such that
or
where
But it is known (or may be obtained using (12)) that under the conditions (11)
and (14)‚ the main term asymptotics of differs from that of
given by (12)‚ only by a constant factor. Our conclusion therefore is that under
the conditions (11) and (14)‚
In case (15) does take place‚ a comparison of (21) and the result of Siegmund
(1975) shows that
where One can see that (23) does not contain defined by
(15); we therefore conjecture that (23) holds without (15)‚ just whenever the
Cramér condition (11)&(14) is satisfied.
In order to compute using (23)‚ one needs to deal with the distributions
of all the But as it was said above‚ the situation is much simpler in the
case when X has a maximum span 1 lower-semicontinuous distribution. In
this case‚ it is easy to obtain from a “renewal” integral equation similar to‚ say‚
(3.16) in Feller (1971) that for integral hence‚ (6) and (8)
yield
where — cf. Borovkov (1976), Chap. 4, Section 19, (14) and Feller
(1971), Chap. XII, (5.9), then (6) and (8) yield
66 MODELING UNCERTAINTY
Let us now see what happens when the Cramér condition fails at all. Suppose‚
e.g.‚ that (16) takes place with Then (18) and (19) yield
Hence‚
where is defined by (6); by the Scheffé theorem (see, e.g., Billingsley (1968)),
this convergence then takes place also in variation, and hence in distribution;
moreover, for large enough
If and either the Cramér condition (11) & (14) or the subexpopen-
tiality condition (16) with (or (17) with and (0,1/2)) is
satisfied, then for all and the conditional distribution
of given (or, equivalently, given weakly con-
verges to the uniform distribution on [0,1]. The same is true for the conditional
distribution of given
We assume that the better bandit is that is‚ EY <EZ. (This is of course
unknown to the observer; we suppose that in general no information about the
distributions of Y and Z is available except that the means EY and EZ are
finite.)
Let N(1) = 1 < N(2) < N(3) < . . . be a sequence of nonrandom integers‚
called here test times; the observations are supposed to be made only at these
time moments.
Consider
the number of the test times up through the time moment and
Remark. Recall that condition (4) holds under very general circumstances.
Equations (20)‚ (23)‚ and subsequent to them give more explicit forms for (26)
for different tail decrease rates.
Let M be the total number of the test times at which the wrong bandit was
selected.
70 MODELING UNCERTAINTY
(at the test times all the bandits are tested), and if and
then if and
Exact asymptotics for large deviation probabilities‚ with applications 71
otherwise‚ where
the last time moment at which the best bandit is not observed. By the
above construction‚
Consider
and
Note that for all and Thus‚ one can combine Theorem 5 and
previous results in order to obtain estimates for the distribution tail of
Suppose that‚ for each is the p.d.f. of
(understood as in (3))‚ is the c.d.f. of and (cf. (4))
as where
A function H is said to be regularly varying with exponent if
for all
Exact asymptotics for large deviation probabilities‚ with applications 73
the of A.
For and define the step function by
as where
It is easy to see that this condition is very mild. Indeed, e.g.,
(27) takes place if, for some
and for such a choice of A one may even take See also
Remark 2 below.
Let us refer to a set as if
as
According to the great many results on the large deviation principle (LDP),
the class of sets is very large.
Finally, for consider the two sets, and defined by
whenever and
Then
whenever and
An analogous assertion can be given also for the case when the set of contact
points is of nonzero Lebesgue measure.
Thus, we see that the condition of the of is quite mild.
Let, as usual, stand for the standard normal c.d.f.:
76 MODELING UNCERTAINTY
whenever and
The following assertion, given above as (1), is due to A.V. Nagaev (1969);
(1971).
whenever and
where
to in which all the jumps are small enough, smaller than and
which then behave as the trajectories of the Gaussian process and (ii)
the trajectories corresponding to in which there is a large jump larger
than and which then behave as the random jump function
Note that provided that and vary
so that one of the sides of this relations tends to 0. Similarly,
provided that and vary so that one of the sides of
this relations tends to 0.
It is easy to see that the “jump" term
is negligible in comparison to the “Gaussian" probability
in the zone with and a sufficiently
small provided that A is Thus, in this case
As we shall see in the next section, the latter kind of behavior is exhibited in
very general settings, when the summands may take on values in a Banach
space and do not have to be identically distributed.
which is equivalent to
This means that the large deviations of the sum occur mainly due to just one
large summand. Note that in (29) or (30) it is not at all necessary that
in fact‚ the number of summands may be as well fixed.
It is not difficult to understand the underlying causes of the two different
large deviations mechanisms. For instance‚ let there exist a bounded density
and let Take a large enough positive number
We are interested in the of numbers that maximize
the joint density or‚ equivalently‚ minimize
under the condition
Proposition 3 Let
Thus‚ one can see that if is convex‚ then (the Cramér condition is satisfied
and) the maximum of the joint density is attained when the summands are equal:
On the other hand‚ if the alternative takes place‚ then (the Cramér condition
fails and) the maximum of the joint density is attained when only one of the
summands is large (equal to and the rest of them are small (equal
to
Exact asymptotics for large deviation probabilities‚ with applications 79
In other words, the critical influence of the Cramér condition as to the type of
the large deviation mechanism is related to the facts that (i) “regular" densities
are log-concave if the Cramér condition holds and log-convex (in a neighbor-
hood of otherwise, and (ii) the direction of convexity of the logarithm of
the density determines the kind of large deviation mechanism.1
Of course, the density does not have to be either log-concave or log-convex for
one of the two large deviation mechanisms to work. The role of log-convexity
(say) may be assumed, e.g., by a condition that the tail of the distribution
is significantly heavier, in a certain sense, than exponential ones. Thus, in
Theorems 14 and 16 below, where the tails decrease no faster than with
there are no convexity conditions. On the other hand, such a
condition is part of Theorem 15 below, where the tails can be aribitrarily close
to exponential ones; that a convexity condition is essential is also shown by
Example 1 below.
For a fixed relation (30) was studied by Chistyakov (1964), with an ap-
plication to a renewal equation. It was shown in Chistyakov (1964) that for a.s.
positive X, relation (30) holds for any fixed iff it is true for i.e.,
when
for any any fixed These and other properties of the class of all subexponen-
tial c.d.f.’s were established in Chistyakov (1964). Class attracted attention
a number of authors; see e.g. Embrechts, Goldie, and Veraverbeke (1979),
Goldie and (1978), Vinogradov (1994), and bibliography therein.
If (30) holds for every fixed then it remains true for uniformly in
some zone of the form where In general nothing can be
said on the rate of growth of There have been many publications concern-
ing (30), including estimates of see, e.g., Andersen (1970); Anorina and
Nagaev (1971); Godovan’chuk (1978); Heyde (1968); Linnik (1961); Nagaev
(1969; 1971; 1964; 1973; 1979); Pinelis (1981); Rohatgi (1973) (1970); (1971);
(1978); (1968); (1961); (1969); (1971); (1964); (1973); (1979); (1981); (1973).
What are the differences of the results listed below from ones found in pre-
ceding literature?
First, the restriction that the summands are identically distributed is removed.
This removal seems quite natural: as an extreme case, if all summands except
one are zero, then (29) is trivial.
80 MODELING UNCERTAINTY
Second‚ it is assumed here that the summand r.v.’s take on values in an ar-
bitrary separable Banach space rather than just real values. Note
that‚ in a variety of settings‚ when the Banach space is “like" infinite-
dimensionality works rather toward the only-one-large-summand large devia-
tion mechanism. Indeed‚ let here and let be
independent real-valued r.v.’s such that the distributions of the are contin-
uous; let and otherwise. Then
where
Exact asymptotics for large deviation probabilities‚ with applications 81
We shall be interested in the conditions under which the large deviation asymp-
totics
obtain.
where
for some constant C > 0, depending only on and It follows, e.g., that if the
are i.i.d. and for some then for any there is
a constant such that
Theorem 11 Let a c.d.f. F with F(0) = 0 satisfy (32). Assume also that there
exist such numbers and a function that
Theorem 11 shows that the tail of a subexponential c.d.f. may vary rather
arbitrarily between and cf. Proposition 4 below. In
particular‚ if for some
It is well known (see e.g. Araujo and Giné (1979)‚ page 108) that any
infinitely divisible probability distribution on admits the unique represen-
tation of the form
as
It can be shown that if the expectation then (39) is incom-
patible with the Cramér condition. However‚ one can construct conterexamples
showing that then both (39) and the Cramér condition may
hold; for instance‚ consider
where
86 MODELING UNCERTAINTY
for some Then one has the relations (33) and (34).
for some positive real numbers Suppose also that at least one of the
following three conditions holds:
(i) is a Banach space of type 2‚ for all and and
(44) takes place;
Exact asymptotics for large deviation probabilities‚ with applications 87
Suppose that
as Then
The latter Corollary is for illustrative purposes only. It shows that the indi-
vidual distributions of the summands may be very irregular; they may be even
be two-point distributions. For (33) and (34) to hold‚ it suffices that the sum of
the tails be regular enough.
Note also that Theorem 12 follows from Theorem 14.
88 MODELING UNCERTAINTY
as Introduce
and
Corollary 6 Let be fixed. Suppose that conditions (49) and (50) hold. Fi-
nally‚ assume that
for some and all large enough integral Then one has (51).
In the previous two subsections‚ there were given theorems covering all the
spectrum of the tails‚ from to for which the relations (33) and (34)
may hold in general. However‚ tails such as the ones mentioned in the title of
this subsection may still remain of interest. As one can see from the statement
of Theorem 16‚ a restriction is now imposed only on the first derivative of the
tail function‚ while in Theorem 15 one has to take into account essentially the
sign of a second derivative. Note also that Theorem 16 generalizes Theorems
1 and 2 of Nagaev (1977) simultaneously.
Consider the following classes of nonnegative non-decreasing function
defined on some interval where
(i) class defined by the condition: is
non-increasing in
(ii) class for defined by the condition:
whenever
(iii) class for defined by the condition: is
absolutely continuous, and its derivative satisfies the inequality
Proposition 5 If then
In particular‚
This proposition shows that all these three kinds of classes are essentially
the same.
90 MODELING UNCERTAINTY
for where
and Suppose also that
and
NOTES
1. Similar observations on large deviation mechanisms were offered by Nagaev (1964).
REFERENCES
Andersen‚ G. R. (1970) Large deviation probabilities for positive random vari-
ables. Proc. Amer. Math. Soc. 24 382–384.
Anorina‚ L. A.; Nagaev‚ A. V. (1971) An integral limit theorem for sums of inde-
pendent two-dimensional random vectors with allowance for large deviations
in the case when Cramér’s condition is not satisfied. (Russian) Stochastic pro-
cesses and related problems‚ Part 2 (Russian)‚ “Fan" Uzbek. SSR‚ Tashkent‚
3–11.
Araujo‚ A. and Giné‚ E. (1979) On tails and domains of attraction of stable
measures in Banach spaces. Trans. Amer. Math. Soc. 248‚ no. 1‚ 105–119.
Bahadur‚ R.R. and Rao‚ R. Ranga. (1960) On deviations of the sample mean‚
Ann. Math. Stat.‚ 11‚ 123-142.
Billingsley‚ P. (1968) Convergence of Probability Measures‚ Wiley‚ New York.
Borovkov‚ A.A. (1964). Analysis of large deviations for boundary problems
with arbitrary boundaries I‚ II. In: Selected Translations in Math. Statistics
and Probability‚ Vol.6‚ 218-256‚ 257-274.
Borovkov‚ A.A. (1976). Stochastic Processes in Queuing Theory. Springer-
Verlag‚ New York-Berlin.
Borovkov‚ A.A. and Rogozin‚ B.A. (1965) On the multidimensional central
limit theorem‚ Theory Probab. Appl.‚ 10‚ 52-62.
Chistyakov‚ V.P. (1964) A theorem on sums of independent positive random
variables and its application to branching processes‚ Theory Probab. Appl.‚
9‚ 710–718.
Dembo‚ A. and Zeitouni‚ O. (1998) Large deviations techniques and appli-
cations. Second edition. Applications of Mathematics‚ 38. Springer-Verlag‚
New York.
Embrechts‚ P.; Goldie‚ C. M.; Veraverbeke‚ N. (1979) Subexponentiality and
infinite divisibility. Z. Wahrsch. Verw. Gebiete 49‚ no. 3‚ 335–347.
Feller‚ W.‚ (1971) An Introduction to Probability Theory and its Applications‚
II‚ 2nd ed.‚ John Wiley & Sons‚ New York.
Fill‚ J.A. (1983) Convergence rates related to the strong law of large numbers‚
Annals Probab.‚ 11‚ 123-142.
Godovan’chuk‚ V.V. (1978) Probabilities of large deviations of sums of inde-
pendent random variables attached to a stable law‚ Theory Probab. Appl. 23‚
602–608.
Goldie‚ C.M. (1978) Subexponential distributions and dominated-variation tails.
J. Appl. Probability 15‚ no. 2‚ 440–442.
92 MODELING UNCERTAINTY
Wai-Yuan Tan
Department of Mathematical Sciences
The University of Memphis
Memphis‚ TN 38152-6429
waitan@memphis.edu
Zhihua Xiang
Organon Inc.
375 Mt. Pleasant Avenue
West Orange‚ NJ 07052
z.xiang@organoninc.com
Abstract It is well documented that‚ in many cases‚ most of the free HIV are generated in
the lymphoid tissues rather than in the plasma. This is especially true in the late
stage of HIV pathogenesis because in this stage‚ the total number of T
cells in the plasma is very small‚ whereas the number of free HIV in the plasma
is very large. In this paper we have developed a state space model in plasma
involving net flow of HIV from lymph nodes‚ extending the original model of
Tan and Xiang (1999). We have applied this model and the theory to the data of
a patient (patient No.104) considered in Perelson et al. (1996)‚ in which RNA
virus copies per were observed on 18 occasions over a three week period.
This patient was treated by a protease inhibitor‚ ritonavir‚ so that a large number
of non-infectious HIV was generated by the treatment. For this patient‚ by using
the state space model over the three week span‚ we have estimated the numbers
of productively HIV-infected T cells‚ the total number of infectious HIV‚ as well
as the number of non-infectious HIV. Our results showed that within this period‚
most of the HIV in the plasma was non-infectious‚ indicating that the drug is
quite effective.
96 MODELING UNCERTAINTY
Keywords: Infectious HIV‚ lymph nodes‚ Monte Carlo studies‚ non-infectious HIV‚ protease
inhibitors‚ state space models‚ stochastic differential equations.
1. INTRODUCTION
In a recent paper‚ Tan and Xiang (1999) developed some stochastic and state
space models for HIV pathogenesis under treatment by anti-viral drugs. In
this paper we extend these models into models involving net flow of HIV from
lymphoid tissues. This extension is important and necessary because of the
following biological observations:
(1) HIV normally exists either in the plasma as free HIV‚ or trapped by follic-
ular dendritic cells in the germinal center of the lymph nodes during all stages
of HIV infection (Fauci and Pantaleo‚ 1997; Levy‚ 1998; Cohen‚ Weissman and
Fauci‚ 1998; Fauci‚ 1996; Tan and Ye‚ 2000). Further‚ Haase et al. (1996) and
Cohen et al. (1997‚ 1998) have shown that the majority of the free HIV exist in
lymphoid tissues. This is true especially in the late stage of HIV pathogenesis.
For example‚ Perelson et al. (1996) have considered a patient (Patient No. 104
in Perelson et al. (1996)) treated by a protease inhibitor‚ ritonavir. For this
patient‚ at the start of the treatment‚ the total number of T cells in the
blood was yet the total number of RNA virus copies was
in the blood; many other examples can be found in Piatak et al. (1993). Thus‚
in the late stage‚ it is unlikely that most of the free HIV in the plasma were
generated by productively HIV-infected CD4 T cells in the plasma. (Note that
the total number of T cells includes the productively HIV-infected T
cells.)
(2) Lafeuillade et al. (1996) and many others have shown that both the free
HIV in the plasma‚ and the HIV in lymph notes can infect T cells‚ generating
similar dynamics in the plasma as well as the lymph nodes. Furthermore‚ the
infection process in the lymph nodes is much faster than in the plasma (Fauci‚
1996; Cohen‚ Weissman and Fauci‚ 1998; Cohen et al.‚ 1997; Haase et al.‚ 1996;
Kirschner and Web‚ 1997; Lafeuillade et al.‚ 1996; Tan and Ye‚ 2000). From
these observations‚ it follows that most of the free HIV in the blood must have
come from HIV in the lymph nodes‚ rather than from productively HIV-infected
CD4 cells in the blood; this is true especially in the late stages.
from the plasma to the lymph nodes. In this paper we thus consider only net
flow of HIV from lymphoid tissues to plasma‚ ignoring net flow of T cells.
(ii) The uninfected CD4(+) T cells have a relatively long life span (The
average life span of uninfected T cells is more than 50 days; see
Cohen et al. (1998)‚ Tan and Ye (1999a)).
(iii) Mitter et al. (1998) have provided additional justifications for assumption
(2); they have shown that even for a much longer period since treatment‚
the assumption has little effect on the total number of free HIV.
Let denote the numbers of productively HIV-infected
T cells‚ non-infectious free HIV and infectious free HIV in the blood
at time t respectively. Then we are considering a three dimensional stochastic
process With the above assumptions‚ we now
proceed to derive stochastic equations for the state variables in plasma of this
stochastic process under treatment by protease inhibitors and with net flow of
HIV from lymphoid tissue. We first illustrate how to model the effects of pro-
tease inhibitors and the net flow of HIV from the lymphoid tissues to the plasma.
where is the flow inhibiting function‚ the flow potential function and
the saturation constant (Kirschner and Webb‚ 1997).
100 MODELING UNCERTAINTY
Let denote the net flow of HIV from the lymphoid tissues to the
plasma during the net flow of HIV from the lymphoid tissues
to the plasma during Then‚
and
and
where
and
In the above equations, the variables on the right hnad side are random vari-
ables. To specify the probability distributions of these variables, let
denote the HIV infection rate of T cells by free HIV in the plasma at time t; let
denote the death rate of productively HIV-infected T cells in the plasma
at time t and the rate at which free HIV or in the plasma are being
removed or die at time t. Then, the conditional probability distributions of the
above random variables are given by:
where
102 MODELING UNCERTAINTY
and
Table 1
Variances and Covariances of the Random Noises
3. MEAN VALUES OF
Let and denote the mean values of and
respectively. Then‚ by using equations (2.4)-(2.6)‚ we obtain:
104 MODELING UNCERTAINTY
If we follow Perelson et al. (1996) in assuming that the drug is 100% efective
(i.e. then the solution of is
Since is usually very small‚ the above equations then lead to the
following equation for the approximation of
the observed total number of RNA virus load at time Then the observation
model based on the RNA virus load is given by:
for
(i) For
satisfies the following equations with boundary conditions
for where
and
where
and
for
108 MODELING UNCERTAINTY
and
where
Table 2. Observed RNA Virus Copies per for Patient No. 104
where and
To fit the above model to the data in Table (2), we use the estimates
and of Perelson et
al.(1996). We use N = 2500 from Tan and Ye (1999, 2000), with
110 MODELING UNCERTAINTY
and
Stochastic Modelling of Early HIV Immune Responses 111
Using these estimates and the Kalman filter methods in Tan and Xiang (1998,
1999), we have estimated the number of cells, infectious HIV as well as
non-infectious HIV per of blood over time. Plotted in Figure (1) are the
observed total number of RNA copies per together with the estimates by
the Kalman filter method and the estimates by the deterministic model in two
cases (Case a: Case b: Plotted
in Figures (2)-(3) are the estimated numbers of infected T-cells and free HIV
(infectious and non-infectious HIV) respectively.
From Figure (1)‚ we observed that the Kalman filter esitmates followed the
observed numbers‚ whereas the deterministic model estimates appeared to draw
a smooth line to match the observed numbers‚ and could not follow the fluctu-
ations of the observed numbers. Thus‚ there are some differences between the
two estimates within the first 8 hours although the differences are not noticeable
in the figure; however‚ after 8 hours‚ there is little difference between the two
estimates. Furthermore‚ the curves appeared to have reached a steady state low
level in 200 hours (a little more than a week).
From Figures (3)‚ we observed that at the time of starting the treatment‚ the
estimated number of infectious HIV begins to decrease very sharply and reaches
112 MODELING UNCERTAINTY
the lowest steady state level within 10 hours‚ and there is little difference be-
tween the Kalman filter estimates and the estimates of the deterministic model;
on the other hand‚ the estimated number of non-infectious HIV first increases‚
reaching the maximum in about 6-8 hours before decreasing steadily to reach a
very low steady state level in about 200 hours. Within 50 hours since treatment‚
there appeared to have significant differences between the two estimates of the
number of non-infectious HIV; after 50 hours such differences appeared to be
very small.
Comparing the two cases in Figures (1)-(3)‚ we observed that the estimates
assuming (Case b) are almost identical to the corresponding
ones assuming (Case a). These results suggest that
the delay effects are very small in this example.
interested only in the estimation of T cells and free HIV‚ the two models make
little difference.
scribed in Section 3. Similar results have also been obtained by Tan and Xiang
(1998,1999) in other models of a similar nature.
(2) The estimates by using the deterministic model seem to draw a smooth line
across the generated numbers. Thus, although results of deterministic model
cannot follow the fluctuations of the generated numbers, due presumably to the
randomness of the state variables, they are still quite useful in assessing the
behavior and trend of the process.
(3) For the numbers of cells and free HIV, there are small differ-
ences between the Kalman filter estimates and the estimates of the determin-
istic model, due presumably to the small numbers of these cells. For the non-
infectious free HIV (i.e. however, there are significant differences between
the Kalman filter estimates and the estimates using the deterministic model. It
appears that the Kalman filter estimates have revealed much stronger effects of
the treatment at early times (before 10 hours), which could not be detected by
the deterministic model.
ACKNOWLEDGMENTS
The authors wish to thank the referee for his help in smoothing the English
language.
B. Barnes
School of Mathematical Sciences
Australian National University
Canberra‚ ACT 0200
Australia*
J. Gani
School of Mathematical Sciences
Australian National University
Canberra‚ ACT 0200
Australia
Abstract This paper considers an epidemic which is spread by the re-use of unsterilized
hypodermic needles. The model leads to a geometric type distribution with
varying success probabilities‚ which is applied to model the propagation of the
Ebola virus.
Keywords: epidemic and its intensity‚ Ebola virus‚ geometric type distribution.
1. INTRODUCTION
In her book “The Coming Plague”‚ Laurie Garrett (1994) describes how the
shortage of hypodermic needles in Third World countries‚ as well as shared
needles in most countries‚ is responsible for the spread of infectious diseases.
She cites one case from the Soviet Union in 1988‚ where 3 billion injections
were administered‚ although only 30 million needles were manufactured. In
one instance of shared needles that year‚ a single hospitalised HIV infected
baby‚ led to the infection of all other infants in the same ward.
* This project was funded by the Centre for Mathematics and its Applications‚ Australian National University‚
Canberra‚ University College‚ University of New South Wales‚ Canberra.
118 MODELING UNCERTAINTY
have
and thus
equation (2.6),
or
Now applying the well known binomial identity for both positive
integers, (see Abramowitz and Stegun, 1964)
to the left hand side of equation (3.9), and writing the equality is
proved as required.
5. INTENSITY OF EPIDEMIC
It is of interest to estimate the intensity of the epidemic, defined here as the
average proportion of susceptibles in a group of N individuals, who are infected
when the same needle is used to inject them. This proportion is
It has a least value of 1/2 when that is, when there is only a
single infective in the population. The proportion rises rapidly as the number
of susceptibles decreases, and if then which is
close to 1 for large N. The variance of is
6. REDUCING INFECTION
An obvious and simple strategy for reducing the risk of infection is to divide
the population into K groups (K = 2,3,4...) of N/K individuals, where
for an integer, and sterilize the needle after vaccinating each group.
Using a subscript to number each of the groups, we have
The sum of the quantities from equation (6.16) can be expressed in the following
form
infectives is
which gives
Returning to the available data concerning the Ebola virus in Garrett (1994),
the epidemic began on September 5, 1976. In that first month 38 of the 300
Yambuku residents died from Ebola. The hospital was closed to new patients
after 25 September, a quarantine on the region imposed on the 30 September,
and by October 9 the virus was observed to be in recession. By November, 46
villages had been affected with 350 deaths recorded. The probability of con-
tracting the virus from an infected needle was estimated to be 90%, and once
126 MODELING UNCERTAINTY
contracted by this means the probability of death was between 88% and 92.5%,
or possibly higher. After the symptoms appear, the Ebola virus takes approx-
imately 10 days to kill a victim, through a particularly painful and gruesome
means; the patient, losing hair, skin and nails, literally bleeds to death as the
membranes containing the body fluids disintegrate and the body organs liquify.
From the data we can establish estimates for parameters in the model. Of
the primary cases, 72 of the 103 were from injection at the hospital. We then
assume that 0.7 of the deaths resulting from the virus were as a direct result of
infected needles. Thus four weeks after the onset of the epidemic, from the 38
deaths in the Yambuku region, we estimate 0.7 × 38 = 26.6 were caused by
infected needles. Towards the end of the epidemic, by November, of the 350
recorded deaths we estimate that 0.7 × 350 = 245 were as a result of infected
needles. (The virus was also spread through body fluid contact, and in one cited
case, a single individual spread the disease to 21 family members and friends, 18
of whom died as a result. However, in general, individuals who contracted the
disease in this manner had a 43% chance of survival. This method of infection
is not considered in our model.) We take the period between infection by the
Ebola virus and death to be 21 days. This follows as the incubation period
was between 2 and 21 days, typically 4 to 16 days, and time until death after
the appearance of the major symptoms approximately a week to 10 days. The
probability of infection through the re-use of a needle it difficult to estimate
from the available data. Taking into account the existing, but insufficient,
sterilization process described above, it has been estimated as 1/50, and a
number of different values have been compared. (See Figure 6.1 with values
0.03, 0.05, 0.07 and 0.1.) The probability of death following infection from
a needle is between 88% and 92.5%, and is taken here as 88%. Recall that
the probability of infection from a single infected injection at the hospital was
given as 90%.
The results of simulations for the number of victims of the Ebola virus from
5 September to November, 1976, resulting from the introduction of a single
infective into the Yambuku hospital, are illustrated in Figures 6.1, 6.2 and 6.3.
Infection through means other than infected needles have not been considered.
The number of hospital beds at Yambuku was 150, and between 300 and 600
patients were treated each day, including those in the hospital. We assume that
250 of these received an injection on a single day, using 5 different needles.
The expected number of new infectives is calculated from equation (7.21.
As is illustrated in Figure 6.1, for the number of deaths caused by infection
through re-use of needles, the model predicts 26.4 deaths after 4 weeks, and
over the following months, after November, the number of deaths approaches
245 in the limit. By the end of November the model predicts a total of 236
deaths. These figures compare well with the data given above: 38 deaths in the
first month (28 days), 0.7 × 38 = 26.6 of which are due to infection through
The impact of re-using hypodermic needles 127
128 MODELING UNCERTAINTY
needles, and 350 deaths by November, 0.7 × 350 = 245 of which are expected
to have been from infected needles. Furthermore, the model is in agreement
with the data which states that by 10 October, the disease was in recession. This
is illustrated in Figure 6.2. The epidemic is seen to peak and begin to decline in
the first week of October, between day 30 and day 40. While the closure of the
hospital to new patients, with a drastic reduction in the susceptible population,
would certainly have had an impact on this decline, it is clear (Figure 6.2) that
the number of new infectives was declining by day 15 to 20, that is from 20 to 25
September. Figure 6.3 graphs the number of susceptibles and infectives in the
hospital over time, illustrating the proportion of infectives to susceptibles and
demonstrating a gradual decline of the epidemic in late September, followed
by a sharp decline when the hospital was closed and the region quarantined.
The damage caused by the Ebola virus in 10 days, is comparable with that
caused by AIDS over a period of 10 years. It is no wonder it is considered the
second most lethal disease of the Century.
8. CONCLUSIONS
A geometric type distribution model has been developed for the spread of
an infection through the re-use of infected needles. It provides a reasonable
description of the dynamics of the Ebola virus epidemic in Yambuku, in 1976,
although the available data is scarce. Further work, comparing this model with
the spread of a disease through contact with some external source, using the
Greenwood and Reed-Frost chain binomial models, as well as the dynamics of
a deterministic model, is planned to provide an insight into the differences in
the propagation of epidemics.
REFERENCES
Abramowitz, M. and I. A. Stegun. (1964). Handbook of Mathematical Func-
tions. National Bureau of Standards, Washington D. C.
Feller, W. (1968). An Introduction to Probability Theory and its Applications.
John Wiley, New York.
Garrett, L. (1994). The Coming Plague: Newly Emerging Diseases in a World
out of Balance. Penguin, New York.
Chapter 7
David S. Stoffer
Department of Statistics
University of Pittsburgh
Pittsburgh, PA 15260
1. INTRODUCTION
Rapid accumulation of genomic sequences has increased demand for meth-
ods to decipher the genetic information gathered in data banks such as Gen-
Bank. While many methods have been developed for a thorough micro-analysis
of short sequences, there is a shortage of powerful procedures for the macro-
analyses of long DNA sequences. Combining statistical analysis with modern
computer power makes it feasible to search, at high speeds, for diagnostic pat-
terns within long sequences. This combination provides an automated approach
to evaluating similarities and differences among patterns in long sequences and
aids in the discovery of the biochemical information hidden in these organic
molecules.
130 MODELING UNCERTAINTY
observed that codon sites three were chosen to be unlike neighboring bases to
the left and to the right with respect to the strong-weak (S-W) alphabet. While
the various studies on codon usage exhibit many substantial differences, most
of them agree on one point, namely the existence of some kind of periodicity
in coding sequences. This widely accepted observation is supported by the
spectral envelope approach which shows a very strong period-three signal in
genes but disappears in noncoding regions. This method may even be helpful
in detecting wrongly assigned gene segments as will be seen. In addition, the
spectral envelope provides not only the optimal period lengths but also most
favorable alphabets, for example {S, W} or {G, H}, where H = (A, C, T). This
analysis might help decide which among the different suggested pattern [such
as RNY, GHN, etc., where R = (A,G), Y = (C, T), and N is anything] are the
most valid.
The spectral envelope methodology is computationally fast and simple be-
cause it is based on the fast Fourier transform and is nonparametric (that is, it is
model independent). This makes the methodology ideal for the analysis of long
DNA sequences. Fourier analysis has been used in the analysis of correlated
data (time series) since the turn of the twentieth century. Of fundamental inter-
est in the use of Fourier techniques is the discovery of hidden periodicities or
regularities in the data. Although Fourier analysis and related signal processing
are well established in the physical sciences and engineering, they have only
recently been applied in molecular biology. Because a DNA sequence can be
regarded as a categorical-valued time series it is of interest to discover ways in
which time series methodologies based on Fourier (or spectral) analysis can be
applied to discover patterns in a long DNA sequence or similar patterns in two
long sequences.
One naive approach for exploring the nature of a DNA sequence is to assign
numerical values (or scales) to the nucleotides and then proceed with standard
time series methods. It is clear, however, that the analysis will depend on the
particular assignment of numerical values. Consider the artificial sequence
ACGT ACGT ACGT... . Then, setting A = G = 0 and C = T= 1, yields the
numerical sequence 0101 0101 0101..., or one cycle every two base pairs (that
is, a frequency of oscillation of or a period of oscillation of length
Another interesting scaling is A = 1, C = 2, G = 3, and T = 4,
which results in the sequence 1234 1234 1234..., or one cycle every four bp
In this example, both scalings (that is, {A, C, G, T} = {0, 1, 0, 1}
and {A, C, G, T} = {1, 2, 3, 4}) of the nucleotides are interesting and bring out
different properties of the sequence. It clear, then, that one does not want to
focus on only one scaling. Instead, the focus should be on finding all possible
scalings that bring our interesting features of the data. Rather than choose values
arbitrarily, the spectral envelope approach selects scales that help emphasize any
periodic feature that exists in a DNA sequence of virtually any length in a quick
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 133
where is the sample variance of the data; the last term is dropped if is odd.
One usually plots the periodogram, versus the fundamental frequencies
for and inspects the graph for large values. As
previously mentioned, large values of the periodogram at indicate that the
data are highly correlated with the sinusoid that is oscillating at a frequency of
cycles in observations.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 135
An algorithm for estimating the spectral envelope and the optimal scalings
given a particular DNA sequence (using the nucleotide alphabet, {A, C, G, T},
for the purpose of example) is as follows:
can be done with any finite number of possible categories, and is not restricted
to looking only at nucleotides. Inference for the sample spectral envelope and
the sample optimal scalings are described in detail in Stoffer et al (1993a). A
few of the main results of that paper are as follows.
If is an uncorrelated sequence, and if no smoothing is used (that is,
then the following large sample approximation based on the chi-square
distribution is valid for
where is the number of letters in the alphabet being used (for example,
in the nucleotide alphabet). If is a consistent spectral estimator and if
for each the largest root of is distinct, then
3. SEQUENCE ANALYSES
Our initial investigations have focused on herpesviruses because we regard
them as scientifically and medically important. Eight genomes are completely
sequenced and a large amount of additional knowledge about their biology is
known. This makes them a perfect source of data for statistical analyses. Here
we report on an analysis of nearly all of the CDS of the Epstein-Barr virus via
methods involving the spectral envelope. The data are taken from the EMBL
data base.
The study of nucleosome positioning is important because nucleosomes en-
gage in a large spectrum of regulatory functions and because nucleosome re-
search has come to a point where experimental data and analytical methods from
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 141
different directions begin to merge and to open ways to develop a more uni-
fied and accurate picture of formation, structure and function of nucleosomes.
While ten years ago many investigators regarded histones as mere packing tools,
irrelevant for regulation, there are now vast amounts of evidence suggesting the
participation of nucleosomes in many important cellular events such as replica-
tions, segregation, development, and transcription (for reviews, see Grunstein
142 MODELING UNCERTAINTY
shown in Figure 3.5b has the second position destroyed and Figure 3.5c is the
result of destroying the third position. While destroying the first position has
virtually no effect on the signal, destroying the second or third position does
have an effect on the signal. In the next three panels of Figure 3.5, the results
of destroying two positions simultaneously are shown. Figure 3.5d shows what
happens when the first and second positions are destroyed, Figure 3.5e has the
first and third positions destroyed, and Figure 3.5f has the second and third
positions destroyed. It is clear that the major destruction to the signal occurs
when either the first and third positions, or the second and third positions are
destroyed, although in either case there is still some evidence that the signal has
survived. From Figures 3.5e and 3.5f we see that the first and second positions
cannot thoroughly do the job of carrying the signal alone, however, the signal
remains when the third position is destroyed (Figure 3.5c), so that the job does
not belong solely to position three.
To show how this technology can help detect heterogeneities and wrongly
assigned gene segments we focus on a dynamic (or sliding-window) analysis of
BNRF1 (bp 1736-5689) of Epstein-Barr. Figure 3.6 shows the spectral envelope
(using triangular smoothing with of the entire CDS (approximately
4000 bp). The figure shows a strong signal at frequency 1/3; the corresponding
optimal scaling was A = 0.04, C = 0.71, G = 0.70, T = 0, which indicates that
the signal is in the strong-weak bonding alphabet, S = {C, G} and W = {A, T}.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 145
Next, we computed the spectral envelope over two windows: the first half
and the second half of BNRF1 (each section being approximately 2000 bp long).
We do not show the result of that analysis here, but the spectral envelopes and
the corresponding optimal scalings were different enough to warrant further
investigation. Figure 3.7 shows the result of computing the spectral envelope
over four 1000 bp windows across the CDS, namely, the first, second, third,
fourth quarters of BNRF1. An approximate 0.001 significance threshold is
.69%. The first three quarters contain the signal at the frequency 1/3 (Figure
3.7a-c); the corresponding sample optimal scalings for the first three windows
were: (a) A = 0.06, C = 0.69, G = 0.72, T = 0; (b) A = 0.09, C = 0.70, G =
0.71, T = 0; (c) A = 0.18, C = 0.59, G = 0.77, T = 0. The first two windows
are strongly consistent with the overall analysis, the third section, however,
shows some minor departure from the strong-weak bonding alphabet. The
most interesting outcome is that the fourth window shows that no signal is
present. This result suggests the fourth quarter of BNRF1 of Epstein-Barr is
just a random assignment of nucleotides (noise).
146 MODELING UNCERTAINTY
CDS. We see, however, that this alphabet is nonexistent in the final 1000 bp of
BNRF1. This lack of periodicity prompted us to reexamined this region with a
number of other tools, and we now strongly believe that this segment is indeed
noncoding.
Herpesvirus saimiri (taken from GenBank) has a CDS from bp 6821 to 10561
(3741 bp) where the similarity to EBV BNRF1 is noted. To see if a similar
problem existed in HVS BNRF1, and to generally compare the periodic behavior
of the genes we analyzed HVS BNRF1 in a similar fashion to EBV BNRF1
as displayed in Figure 3.6. Figure 3.9 shows the smoothed (triangular with
spectral envelope of HVS BNRF1 for (a) the first 1000 bp, (b) the
second 1000 bp, (c) the third 1000 bp, and (d) the remaining 741 bp. There
are some obvious similarities, that is, for the first three sections the cycle 1/3
is common to both the EBV and the HVS gene. The obvious differences are
the appearance of the 1/10 cycle in the third section and the fact that in HVS,
the fourth section shows the strong possibility of containing the 1/3 periodicity
(the data were padded to for use with the FFT, the 0.001 significance
threshold in this case is (2/756)exp(3.71/3) = .91; the peak value of the spectral
envelope at 1/3 in this section was 0.89) whereas in EBV the fourth section is
148 MODELING UNCERTAINTY
noise. Next, we compared the scales for each section of the HVS analysis. In
the first section, the scales corresponding to the 1/3 cycle are A = 0.2, C = 0.96,
G = 0.18, T = 0, which suggests that the signal is driven by C, not-C. In the
second section the scales corresponding to the 1/3 signal are A = 0.26, C = 0.63,
G = 0.73, T = 0, which suggests the strong-weak bonding alphabet. In the third
section there are two signals; at the approximate 1/10 cycle the scales are A =
0.83, C = 0.47, G = 0.30, T = 0 (suggesting a strong bonding-A-T alphabet),
at the 1/3 cycle the scales are A = 0.20, C = 0.32, G = 0.93, T = 0 (suggesting
a G-H alphabet). In the final section, the scales corresponding to the (perhaps
not significant) 1/3 signal are A = 0.28, C = 0.51, G = 0.81, T = 0, which does
not suggest any collapsing of the nucleotides.
Finally, we tabled the results of the analysis of nearly every CDS in Epstein-
Barr. Only genes that exceed 500 bp in length are reported (BNRF1 and BcLF1
are not reported again here). In every analysis we used triangular smoothing
with in which case These analyses were performed on the
entire gene and it is possible that a dynamic analysis would find other significant
periodicities in sections of a CDS than are listed here. Table 2 lists the CDS
analyzed, the 0.001 critical value (CV) for that sequence, the significant values
of the smoothed sample spectral envelope (SpecEnv), the frequency at which
the spectral envelope is significant (Freq), and the scalings for A, C, and G at
the significant frequency (T = 0 in all cases). Note that for some genes, there
is no evidence to support that the sequence is anything other than noise; these
genes should be investigated further. The occurrence of the zero frequency has
many explanations but we not certain which applies and this warrants further
investigation. One explanation is that the CDS has long memory in that sections
in the CDS that are far apart are highly correlated with one another. Another
possibly is that the CDS is not entirely coding. For example, we analyzed
the entire lambda virus (approximately 49,000 bp) and found a strong peak at
the zero frequency and at the one-third frequency; however, when we focused
on any particular CDS, only the one-third frequency peak remained. We have
noticed this on other analyses of sections that contain coding and noncoding
(see Stoffer et al, 1993b) but this is not consistent across all of our analyses.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 149
150 MODELING UNCERTAINTY
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 151
152 MODELING UNCERTAINTY
4. DISCUSSION
The spectral envelope, as a basic tool, appears to be suited for fast auto-
mated macro-analyses of long DNA sequences. Interactive computer programs
are currently being developed. The analyses described in this paper were per-
formed either using a cluster of C programs that compile on Unix operating
systems, or using the Gauss programming system for analyses on Windows
operating systems. We have presented some ways to adapt the technology to
the analysis of DNA sequences. These adaptations were not presented in the
original spectral envelope article (Stoffer et al, 1993a) and it is clear that there
are many possible ways to extend the original methodology for use on various
problems encountered in molecular biology. For example, we have recently
developed similar methods to help with the problem of discovering whether
two sequences share common signals in a type of local alignment and a type of
global alignment of sequences (Stoffer and Tyler, 1998). Finally, the analyses
presented here point to some inconsistencies in established gene segments and,
evidently, some additional investigation and explanation is warranted.
ACKNOWLEDGMENTS
This article is dedicated to the memory of Sid Yakowitz and his research in the
field of time series analysis; in particular, his contributions and perspectives on
fast methods for frequency detection. Part of this work was supported by a grant
from the National Science Foundation. This work benefited from discussions
with Gabriel Schachtel, University of Giessen, Germany.
REFERENCES
Bernardi, G. and G. Bernardi. (1985). Codon usage and genome composition.
Journal of Molecular Evolution, 22, 363-365.
Bina, M. (1994). Periodicity of dinucleotides in nucleosomes derived from
siraian virus 40 chromatin. Journal of Molecular Biology, 235, 198-208.
Blaisdell, B.E. (1983). Choice of base at silent codon site 3 is not selectively
neutral in eucaryotic structural genes: It maintains excess short runs of weak
and strong hydrogen bonding bases. Journal of Molecular Evolution, 19,
226-236.
Buckingham, R.H. (1990). Codon context. Experientia, 46, 1126-1133.
Cornette, J.L., K.B. Cease, H. Margaht, J.L. Spouge, J.A. Berzofsky, and C.
DeLisi. (1987) Hydrophobicity scales and computational techniques for de-
tecting amphipathic structures in proteins. Journal of Molecular Biology,
195, 659-685.
REFERENCES 153
Tavaré, S. and B.W. Giddings. (1989). Some statistical aspects of the primary
structure of nucleotide sequences. In Mathematical Methods for DNA Se-
quences, M.S. Waterman ed., pp. 117-131, Boca Raton, Florida: CRC Press.
Travers, A.A. and A. Klug. (1987). The bending of DNA in nucleosomes and
its wider implications. Philosophical Transactions of the Royal Society of
London, B, 317, 537-561.
Trifonov, E.N. (1991). DNA in profile. Trends in Biochemical Sciences, 16,
467-470.
Trifonov, E.N. and J.L. Sussman. (1980). The pitch of chromatin DNA is re-
flected in its nucleotide sequence. Proc. Natl. Acad. Sci., 77, 3816-3820.
Uberbacher, E.C., J.M. Harp, and G.J. Bunick. (1988). DNA sequence patterns
in precisely positioned riucleosomes. Journal of Biomolecular Structure and
Dynamics, 6, 105-120.
Viari, A., H. Soldano, and E. Ollivier. A scale-independent signal processing
method for sequence analysis. Computer Applications in the Biosciences, 6,
71-80.
Zhurkin, V.B. (1983) Specific alignment of nucleosomes on DNA correlates
with periodic distribution of purine-pyrimidine and pyrimidine-purine dimers.
Febs Letters, 158, 293-297.
Zhurkin, V.B. (1985). Sequence-dependent bending of DNA and phasing of
nucleosomes. Journal of Biomolecular Structure and Dynamics, 2, 785-804.
Part III
Chapter 8
Arkadi Nemirovski
Reuven Y. Rubinstein
Faculty of Industrial Engineering and Management
Technion—Israel Institute of Technology
Haifa 32000, Israel
Abstract We show that Polyak’s (1990) stochastic approximation algorithm with averag-
ing originally developed for unconstrained minimization of a smooth strongly
convex objective function observed with noise can be naturally modified to solve
convex-concave stochastic saddle point problems. We also show that the ex-
tended algorithm, considered on general families of stochastic convex-concave
saddle point problems, possesses a rate of convergence unimprovable in order
in the minimax sense. We finally present supporting numerical results for the
proposed algorithm.
1. INTRODUCTION
We start with the classical stochastic approximation algorithm and its mod-
ification given in Polyak (1990).
where the exact gradients of are replaced with their unbiased estimates. In
the notation from Example 2.1, the CSA algorithm is
It turns out that under the same assumptions as for the CSA (smooth nonde-
generate convex objective attaining its minimum at an interior point of X),
Polyak’s algorithm possesses the same asymptotically unimprovable conver-
gence rate as the CSA. At the same time, in Polyak’s algorithm there is no
need for “fine adjustment” of the stepsizes to the “curvature” of Moreover,
Polyak’s algorithm with properly chosen preserves a “reasonable” (close to
rate of convergence even when the (convex) objective is nonsmooth
and/or degenerate.
A somewhat different aggregation in SA algorithms was proposed earlier
by Nemirovski and Yudin (1978, 1983). For additional references on the CSA
An Efficient Algorithm for Saddle Point Problems 157
algorithm and its outlined modification, see Ermoliev (1969), Ermoliev and
Gaivoronski (1992), L’Ecuyer, Giroux, and Glynn (1994), Ljung, Pflug and
Walk (1992), Pflug (1992), Polyak (1990) and Tsypkin (1970) and references
therein.
Our goal is to extend Polyak’s algorithm from unconstrained convex mini-
mization to the saddle point case. We shall show that, although for the general
saddle point problems below the rate of convergence slows down from
to the resulting stochastic approximation saddle point (SASP) al-
gorithm, as applied to stochastic saddle point (SSP) problems, preserves the
optimality properties of Polyak’s method.
The rest of this paper is organized as follows. In Section 2 we define the SSP
problem, present the associated SASP algorithm, and discuss its convergence
properties. We show that the SASP algorithm is a straightforward extension of
its stochastic counterpart with averaging, originally proposed by Polyak (1990)
for stochastic minimization problems as in Example 2.1 below. It turns out
that in the general case the rate of convergence of the SASP algorithm becomes
instead of that is, the convergence rate of Polyak’s algorithm.
We demonstrate in Section 3 that this slowing down is an unavoidable price for
extending the class of problems handled by the method. In Section 4 we present
numerical results for the SASP algorithm as applied to the stochastic Minimax
Steiner problem and to an on-line queuing optimization problem. Appendix
contains the proofs of the rate of convergence results for the SASP algorithm.
It is not our intention in this paper to compare the SASP algorithm with other
optimization algorithms suitable for off-line and on-line stochastic optimization,
like the stochastic counterpart (Rubinstein and Shapiro ,1993). It is merely to
show the high potential of the SASP method and to promote it for further
applications.
(the primal and the dual objectives, respectively), and the following pair of
optimization problems
It is well known (see, e.g., Rockafellar (1970)) that under assumption A both
problems (P) and (D) are solvable, with the optimal values equal to each other,
and the set of saddle points of on X × Y is exactly the set
being the “observation noises”. We assume that these noises are in-
dependent identically distributed, according to a Borel probability measure P,
random variables taking values in a Polish (i.e., metric separable complete)
space
We also make the following
Assumption B. The functions on are Borel
functions taking values in respectively, such that
and
Here
An Efficient Algorithm for Saddle Point Problems 159
since
2.2. EXAMPLES
We present now several stochastic optimization problems which can be nat-
urally posed in the SSP form.
Example 2.1 Simple single-stage stochastic program. Consider the simplest
stochastic programming problem
Assume further that when solving (2.7), we cannot compute directly, but
are given an iid random sample distributed according to P and know
how to compute at every given point
Under these assumptions program (2.7) can be treated as an SSP problem
with
and that the accuracy measure (2.5) in this problem is just the residual in terms
of the objective:
The variation of observations for this source can be bounded from above as
Note that in this case the accuracy measure satisfies the inequality
subject to
on the set Note that if (2.12) – (2.13) satisfies the Slater condition,
then possesses a saddle point on and the solutions to (2.12), (2.13)
coincide with the of the saddle points of
Assume that we have prior information on the problem, which enables us to
identify a compact convex set containing the of some
saddle point of Then we can replace in the Lagrange saddle point problem
the nonnegative orthant with Y, thus obtaining an equivalent saddle point
problem
form a stochastic source of information for (2.14), we see that (2.13) – (2.12) can
be reduced to an SSP. The variation of observations for the associated stochastic
source of information clearly can be bounded as
These examples demonstrate that the SSP setting is a natural form of many
stochastic optimization problems.
Algorithm 2.1
where
is the projector on X × Y :
the vector is
Here denotes the smallest integer which is With this setup, (2.20)
results in
for all
Let us associate with the SASP algorithm (2.15) – (2.18), (2.23) the following
(deterministic) sequence:
where the supremum is taken over all trajectories associated with the SSP prob-
lem in question.
Theorem 2 Let the stepsizes in the SASP algorithm (2. 15) – (2.18) be chosen
according to (2.23), and the remaining parameters according
to (2.21). Then under assumptions A, B, C the expected inaccuracy
of the approximate solution generated by the SASP algorithm can
be estimated, for every N > 1, as follows:
where is our current guess for the unknown quantity Since by definition
(2.4)
166 MODELING UNCERTAINTY
where and
present some a priori guesses for lower and upper bounds on and
respectively. Then the truncated version of (2.29) is
Clearly, the stepsize policy (2.27), (2.30) satisfies (2.24) – it suffices to take
and In addition,
for the truncated version we have
3. DISCUSSION
3.1. COMPARISON WITH POLYAK’S ALGORITHM
As applied to convex optimization problems (see Example 2.1), the SASP
algorithm with the setup (2.21) looks completely similar to Polyak’s algorithm
with averaging. There is, however, an important difference: the stepsizes
given by (2.21) are not quite suitable for Polyak’s method.
For the latter, the standard setup is with and this is
the setup for which Polyak’s method possesses its most attractive property as
opposite to rate of convergence on strongly convex (i.e., smooth
and nondegenerate) objective functions. Specifically, let
be the class of all stochastic optimization problems
on a compact convex set with twice differentiable
objective satisfying the condition
and equipped with a stochastic source of information with variation of the ob-
servations not exceeding L. Note that problems from class possess uniformly
smooth objectives. In addition, if which corresponds to the “well-posed
case”, the objectives are uniformly nondegenerate as well.
For Polyak’s method with stepsizes and properly
chosen ensures that the expected error of N-th
approximate solution does not exceed where depend only on
the data of Under the same circumstances the stepsizes given by
(2.21) will result in a slower convergence, namely, ln N . Thus,
in the “well-posed case” the SASP method with setup (2.21) is slower by a
logarithmic in N factor than the original Polyak’s method.
The situation changes dramatically when that is, when we pass from
the “well-posed” case to the “ill-posed” one. Here the SASP algorithm still en-
sures (uniformly in problems from ) the rate of convergence
which is not the case for Polyak’s method. Indeed, consider the simplest
case when X = [0,1] and assume that observation noises are absent, so that
Consider next the subfamily of comprised
of the objectives
the expectation being taken over the distribution of observation noises; here
is the inaccuracy measure (2.5) associated with instance
The of on the entire family is
An Efficient Algorithm for Saddle Point Problems 169
Remark 3.1 The outlined optimality property of the SASP method means that
as far as the performance on the entire family is concerned, no
alternative solution method outperforms the SASP algorithm by more than an
170 MODELING UNCERTAINTY
absolute constant factor. This fact, of course, does not mean that it is impos-
sible to outperform essentially the SASP method on a given subfamily of
Applying the scheme presented in Example 2.2 to the above stochastic Min-
imax problem, we convert it into an equivalent stochastic saddle point problem,
which is readily seen to be identical with (3.4) – (3.5).
Note that the family of stochastic saddle point problems
is contained in We claim that the family is "difficult".
Indeed, denoting by the accuracy measure associated with the problem
(3.4) and taking into account (2.5), we have
4. NUMERICAL RESULTS
In this section we apply the SASP algorithm to a pair of test problems:
the stochastic Minimax Steiner problem and on-line optimization of a simple
queuing model.
being the location of the facility. The problem is to find a location for the
facility which minimizes the worst-case, with respect to all towns, inconve-
nience of service. Mathematically we have the following minimax stochastic
program
We assume that the only source of information for the problem is a sample
The above Minimax problem can be naturally posed as an SSP problem (cf.
Example 1.2) with the objective
and ran 2,000 steps of the SASP algorithm, starting at the point
The results are presented on Fig. 1. We found that the relative inaccuracy
where is the expected steady state waiting time of a customer, is the cost
of a waiting customer, are parameters of the distributions of
interarrival and service times, is the cost per unit increase of and is
the transpose of Note that for most exponential families of distributions (see
e.g., Rubinstein and Shapiro (1993), Chapter 3) the expected performance
is a convex function of
To proceed with the program (4.1), consider Lindley’s recursive (sample
path) equation for the waiting time of a customer in a G I / G / 1 queue (e.g.
Kleinrock (1975), p. 277):
can be regarded as the efficiency of the SASP sequence relative to the CSASP
one. From the results of Tables 4.3 and 4.4 it follows that the efficiency is quite
significant. E.g., for the experiments presented in Table 4.3 we have
and
We applied to problem (4.1)–(4.3) the SASP algorithm with the adaptive
stepsize policy (2.27), (2.30) and used it for various single-node queuing models
with different interarrival and service time distributions. In all our experiments
we found that the SASP algorithm converges reasonably fast to the optimal
solution
178 MODELING UNCERTAINTY
5. CONCLUSIONS
We have shown that
be the scaled Euclidean distance between and Note that due to the standard properties of
the projection operator, we have
Setting
where
and
Summing the inequalities (0.5) over and applying the Cauchy inequal-
ity, we obtain
where
Applying next Jensen’s inequality to the convex functions and and taking into
account (2.18), we obtain that
because both and belong to X × Y . In view of these inequalities we obtain from (0.6)
The right hand side of (0.8) is independent of consequently, it (0.8) majorizes the
upper bound of the left-hand side over This upper bound is equal to
An Efficient Algorithm for Saddle Point Problems 181
and similarly
Finally,
Since, by definition,
and, consequently,
we arrive at (2.26).
Let further
REFERENCES 183
It is clear that the probability for to reject the hypothesis when it is valid is exactly
the probability for to get, as a result of its work on a point with In this
case the inaccuracy of regarded as an approximate solution to is at least and since
the expected inaccuracy of on the indicated probability is at most 1/4. By similar
considerations, the probability for to reject when this hypothesis is valid is also
Thus, the integer is such that there exists a routine for distin-
guishing between the aforementioned pair of statistical hypotheses with probability of rejecting
the true hypothesis (whether it is or ) at most 1/4. By standard statistical arguments,
this is possible only if
with an appropriately chosen positive absolute constant O(1), which yields the sought lower
bound on N.
REFERENCES
Asmussen, S. and R. Y. Rubinstein. (1992). “The efficiency and heavy traffic
properties of the score function method in sensitivity analysis of queuing
models", Advances of Applied Probability, 24(1), 172–201.
Ermoliev, M. (1969). “On the method of generalized stochastic gradients and
quasi-Fejer sequences", Cybernetics, 5(2), 208–220.
Ermoliev, Y.M. and A. A. Gaivoronski. (1992). “Stochastic programming tech-
niques for optimization of discrete event systems", Annals of Operations
Research, 39, 1–41.
Kleinrock, L. (1975). Queuing Systems, Vols. I and II, Wiley, New York.
184 MODELING UNCERTAINTY
Benjamin Kedem
Department of Mathematics
University of Maryland
College Park, Maryland 20742, USA
Konstantinos Fokianos
Department of Mathematics & Statistics
University of Cyprus
P.O. Box 20537 Nikosia, 1678, Cyprus
Abstract We consider the general regression problem for binary time series where the
covariates are stochastic and time dependent and the inverse link is any differ-
entiable cumulative distribution function. This means that the popular logistic
and probit regression models are special cases. The statistical analysis is carried
out via partial likelihood estimation. Under a certain large sample assumption
on the covariates, and owing to the fact that the score process is a martingale, the
maximum partial likelihood estimator is consistent and asymptotically normal.
From this we obtain the asymptotic distribution of a certain useful goodness of
fit statistic.
1. INTRODUCTION
Consider a binary time series taking the values 0 or 1, and related
covariate or auxiliary stochastic data represented by a column vector
The binary series may be stationary or nonstationary, and
the time dependent random covariate vector process may represent one
or more time series and functions thereof that influence the evolution of the
primary series of interest The covariate vector process need not
be stationary per se, however, it is required to possess the “nice” long term
behavior described by Assumption A below. Conveniently, may contain
past values of and/or past values of an underlying process that produces
186 MODELING UNCERTAINTY
(1.2), or equivalently (1.3), is called logistic regression, and when F is the cdf of
the standard normal distribution, the model is called probit regression.
The most popular link functions for binary regression are listed in Table 9.1.
The regression model (1.2) has received much attention in the literature,
mostly under independence. See, among many more, Cox (1970), Diggle
Liang Zeger (1994), Fahrmeir and Kaufmann (1987), Fahrmeir and Tutz (1994),
Fokianos and Kedem (1998), Kaufmann (1987), Keenan (1982), Slud and Ke-
dem (1994), and Zeger and Qaqish (1988).
Regression Models for Binary Time Series 187
Thus, if we define,
then for a binary time series with covariate information the partial likeli-
hood of takes on the simple product form,
188 MODELING UNCERTAINTY
Assumption A
A1. The true parameter belongs to an open set
A2. The covariate vector almost surely lies in a nonrandom compact
subset of such that
A3. There is a probability measure v on such that is positive
definite, and such that for Borel sets
where,
Just as is the case with the regular (full) likelihood, assuming differentiability,
the score vector
Regression Models for Binary Time Series 189
where,
plays an important role in large sample theory based on partial likelihood. The
score vector process, is defined by the partial sums,
Observe that the score process, being the sum of martingale differences, is
a martingale with respect to the filtration That is,
Clearly,
Define,
and
where
It follows that,
190 MODELING UNCERTAINTY
Thus, an application of the central limit theorem for martingales gives (Fokianos
and Kedem, 1998; Slud and Kedem, 1994),
We now have
Theorem 2.1 (Fokianos and Kedem, 1998; Slud and Kedam, 1994). The MPLE
is almost surely unique for all sufficiently large N, and as
(i)
(ii)
(iii)
2.4. PREDICTION
An immediate application of Theorem 2.1 is in constructing prediction in-
tervals for from By the delta method (see Rao, 1973, p.388), (ii)
in Theorem 2.1 implies that
where
3. GOODNESS OF FIT
The (scaled) deviance
and
Put
The next result follows readily from Theorem 2.1, and several applications
of the multivariate Martingale Central Limit Theorem as given in Andersen and
Gill (1982), Appendix II. Assume that the true parameter is
Theorem 3.1 (Slud and Kedem, 1994). Consider the general regression model
(1.2) where F is a cdf with density f. Let be a partition of
Then we have as
(i)
(ii) As
is
4. LOGISTIC REGRESSION
The logistic regression model where (see (1.4)) is the most widely
used regression model for binary data. The model can be written as,
Regression Models for Binary Time Series 193
The equivalent inverse transformation of (4.18), and another way to write the
model, is the canonical link for binary data referred to as logit,
For this important special case the previous results simplify greatly as
all and Thus, the score vector has the simplified
form,
and that the sample information matrix per single observation con-
verges to a special case of the limit (2.13),
and
194 MODELING UNCERTAINTY
4.1. A DEMONSTRATION
Consider a binary time series obeying a logistic autoregres-
sion model containing a deterministic periodic component,
To illustrate the asymptotic normality result (ii) in Theorem 2.1, the model
was simulated 1000 times for N = 200, 300,1000. In each run, the partial
likelihood estimates of the were obtained by maximizing (2.6). This gives
1000 estimates from which sample means and variances were
computed.
The theoretical variances of the estimators were approximated by inverting
in (4.21). The results are summarized in Table 9.2. There is a close
agreement between the theory and the experimental results.
A graphical illustration of the prediction limits (2.16) is given in Figure 9.2
where we can see that is nestled quite comfortably within the prediction
limits. Again we inverted (4.21) for the approximation
Regression Models for Binary Time Series 195
196 MODELING UNCERTAINTY
where is given in (4.18). The Q-Q plots in Figure 9.3 were obtained from
1000 independent time series (4.23) of length N = 200
and N = 400 and 1000 independent random variables.
Except for a few outliers, the approximation is quite good.
5. CATEGORICAL DATA
The previous analysis can readily be extended to categorical time series where
admits values representing categories. We only mention two types of
models to show the proximity to regression models for binary time series. For
a thorough treatment see Fokianos (1996), and Fokianos and Kedem (1998).
Generally speaking, we have to distinguish between two types of categorical
variable, nominal, where the categories are not ordered (e.g. daily choice
of dinner categorized as vegetarian, dairy, and everything else), and ordinal,
where the categories are ordered (e.g. hourly blood pressure categorized as
low, normal, and high.) Interval data can be treated as ordinal.
A possible model for nominal categorical time series is the multinomial logits
model (Agresti, 1990),
Since the set of cumulative probabilities corresponds one to one to the set of the
response probabilities, estimating the former enables estimation of the latter.
Various choices for F can arise. For example, the logistic distribution gives the
so called proportional odds model. In principle any link used for binary time
series can be used here as well.
A Final Note As said before, this paper is an extension of Slud and Kedem
(1994). In that paper there is a data analysis example that uses National Weather
Service rainfall/runoff measurements. The data were graciously provided to us
in 1987 by Sid Yakowitz, blessed be his memory. When Slud and Kedem (1994)
finally appeared in 1994, Sid upon receiving a reprint made some encouraging
remarks. The present extension is written in his memory.
REFERENCES
Agresti, A. (1990). Categorical Data Analysis. Wiley, New York.
Andersen, P. K. and R. D. Gill. (1982). Cox’s regression model for counting
processes: A large sample study. Annals of Statistics, 10, 1100-1120.
Cox, D. R. (1970). The Analysis of Binary Data. Methuen, London.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 69-76.
Diggle, P. J., K-Y. Liang, and S. L. Zeger. (1994). Analysis of Longitudinal
Data. Oxford University Press, Oxford.
Fahrmeir, L. and H. Kaufmann. (1987). Regression models for nonstationary
categorical time series. Journal of Time Series Analysis, 8, 147-160.
Fahrmeir, L. and G. Tutz. (1994). Multivariate Statistical Modelling Based on
Generalized Linear Models. Springer, New York.
Fokianos, K. (1996). Categorical Time Series: Prediction and Control. Ph.D.
Thesis, Department of Mathematics, University of Maryland, College Park.
Fokianos, K. and B. Kedem. (1998). Prediction and Classification of non-
stationary categorical time series. Journal of Multivariate Analysis, 67, 277-
296.
REFERENCES 199
Harro Walk
Mathematisches Institut A
Universität Stuttgart
Pfaffenwaldring 57, D-70569
Stuttgart, Germany
1. INTRODUCTION
In this paper a self-contained treatment of some convergence problems in
nonparametric regression estimation is given. For an observable
random vector X and a not observable square integrable real random variable
Y the best estimate of a realization of Y on the basis of an observed realization
of X in a mean square sense is given by the regression function
defined by for minimizing with respect to
measurable because of
function is estimated by
(with This estimate, also for more general kernel functions K, has
been proposed by Nadaraya (1964) and Watson (1964). Set
of large In the proof, the essential step (Theorem 1) consists in the verification
of a condition in Györfi’s (1991) criterion for strong universal consistency of
a class of regression estimates. The Cesàro summability result (Theorem 3) is
proved for a general bandwidth sequence satisfying
especially as above. The results can be transferred to partitioning
estimates and, refining arguments for binomial random variables and using a
more general covering lemma due to Devroye and Krzyzak (1989), also to
Nadaraya-Watson estimates with rather general kernel; as to the convergence
results (with another argument) see Walk (1997). The only tools used in the
present paper without proof are Doob’s submartingale convergence theorem
(Loève (1977), section 32.3) and the fact that the set of continuous functions
with compact support is dense in (Dunford and Schwartz (1958), p.
298). It is well known in summability theory (see Zeller and Beekmann (1970),
section 44, or Hardy (1949), Theorem 114) that Cesàro summability (even Abel
summability) of a sequence together with a gap condition on the increments
implies its convergence. This gap condition is fulfilled for the increments of
in Theorem 2, but not for the increments of the sequence
there, thus Theorem 2 is not implied by Theorem 3, but needs
a separate proof. Section 2 contains the results (Theorems 1,2,3). Section 3
contains lemmas and proofs.
2. RESULTS
The results concern Nadaraya-Watson regression estimates with window ker-
nel according to (1) and (2). Theorem 1 is an essential tool in the proof of
Theorem 2 and is stated in this section because of its independent interest. In
contrast to the other results, Theorem 1 deals with integrable (instead of square
integrable) nonnegative real (instead of real) random variables
where, as in the other parts of the paper, are independent
copies of ( X , Y).
Theorem 1. If
at most for the indices
where for fixed D > 1, then with some
for bounded real random variables one can show the latter assertion even
for integrable real random variables But we shall content ourselves to
show the corresponding convergence result for square integrable real
formulated as Theorem 2. This theorem states strong universal consistency of
window kernel regression estimates for special sequences of bandwidths.
Theorem 2. Let satisfy
at most for the indices
where for fixed D > 1 and
Then
for
PROOF.
Now a variant of inequalities of Efron and Stein (1981), Steele (1986) and
Devroye (1991) on the variance of a function of independent random variables
will be established. Assumptions concerning symmetry of the function or iden-
tical distribution of the random variables or bounded function value differences
are avoided.
Lemma 2. Let be independent m-dimensional ran-
dom vectors where is a copy of For measurable
assume square integrable. Then
(1977), section 32.3) yields a.s. convergence of from which the assertion
follows by a.s. convergence of
The next lemma is a specialized version of a covering lemma of Devroye
and Krzyzak (1989), compare also Devroye and Wagner (1980) and Spiegelman
and Sacks (1980).
Lemma 4. There is a finite constant only depending on such that for
each and probability measure
will be used. First a criterion for strong universal consistency will be given.
Lemma 5. (Györfi (1991)) is strongly universally consistent if the
following conditions are fulfilled:
a)
a.s.
for each distribution of (X,Y) with bounded Y ,
b) there is a constant such that for each distribution of ( X , Y) with
satisfying only
and let and be the functions and when Y and are replaced
by and respectively. Then
By Cauchy-Schwarz inequality
The following lemma is well-known from the classical proof and from Ete-
madi’s (1981) proof of Kolmogorov’s strong law of large numbers (see e.g.
Bauer (1991), §12).
Lemma 6. For identically distributed random variables with
let be the truncation of at i.e.
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 209
Then
In view of the third assertion, for let denote the minimal index
with Then
210 MODELING UNCERTAINTY
and we obtain
for
In the first step, it will be shown
For we have
212 MODELING UNCERTAINTY
(by exchangeability)
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 213
Noticing
we similarly obtain
Further, by
we have thus
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 215
and thus
and thus
one has a.s. from some random index on. Thus, because of (5), it
suffices to show that for each fixed
for bounded Y. This was proved for general bandwidth sequences by Devroye
and Krzyzak (1989) via exponential inequalities. For the special bandwidth
sequence here, we shall use Lemma 3. By uniform boundedness of and
apparently it’s enough to show
With and
we obtain
yields
It holds
and (14) is obtained in the general case. Now from (13) and (14) relation (10)
follows. By (10) with one has for
and thus, considering also the case relation (9). One is led to (16) by a
majorization and a minorization and then using
for We notice
(because of Lemma 6), by use of the Kronecker lemma. From (19) we obtain
because of
obtained as (7). (20) and (21) yield (17). Now assume Y is bounded. In view
of (18) it suffices to show
for each sphere S* around 0. The proof of this is reduced to the proof of
We notice
REFERENCES 221
by
with some and by the Kronecker lemma. (23) together with (14) yields
(22).
The author thanks the referee and M. Kohler, whose suggestions improved
the readability of the paper.
REFERENCES
Aizerman, M.A., E. M. Braverman, and L. I. Rozonoer. (1964). The proba-
bility problem of pattern recognition learning and the method of potential
functions. Automation and Remote Control 25, 1175-1190.
Bauer, H. (1991). Wahrscheinlichkeitstheorie, 4th ed. W. de Gruyter, Berlin.
Devroye, L. (1991). Exponential inequalities in nonparametric estimation. In:
G. Roussas, Ed., Nonparametric Functional Estimation and Related Topics.
NATO ASI Ser. C, Kluwer Acad. Publ., Dordrecht, 31-44.
Devroye, L., L. Györfi, A. Krzyzak, and G. Lugosi. (1994). On the strong
universal consistency of nearest neighbor regression function estimates. Ann.
Statist. 22, 1371–1385.
Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern
Recognition. Springer, New York, Berlin, Heidelberg.
Devroye, L. and A. Krzyzak. (1989). An equivalence theorem for conver-
gence of the kernel regression estimate. J. Statist. Plann. Inference 23,71–82.
Devroye, L. and T. J. Wagner. (1980). Distribution-free consistency results
in nonparametric discrimination and regression function estimation. Ann.
Statist. 8, 231–239.
Devroye, L. and T. J. Wagner. (1980). On the convergence of kernel esti-
mators of regression functions with applications in discrimination. Z. Wahr-
scheinlichkeitstheorie Verw. Gebiete 51, 15–25.
Dunford, N. and J. Schwartz. (1958). Linear Operators, Part I. Interscience
Publ., New York.
Efron, B. and C. Stein. (1981). The jackknife estimate of variance. Ann. Statist.
9, 586-596.
222 MODELING UNCERTAINTY
László Györfi
Department of Computer Science and Information Theory
Technical University of Budapest
1521 Stoczek u. 2,
Budapest, Hungary
gyorfi@szit.bme.hu
Gábor Lugosi
Department of Economics,
Pompeu Fabra University
Ramon Trias Fargas 25-27,
08005 Barcelona, Spain,
lugosi@upf.es *
Abstract We present simple procedures for the prediction of a real valued sequence. The
algorithms are based on a combination of several simple predictors. We show
that if the sequence is a realization of a bounded stationary and ergodic random
process then the average of squared errors converges, almost surely, to that of
the optimum, given by the Bayes predictor. We offer an analog result for the
prediction of stationary gaussian processes.
1. INTRODUCTION
One of the many themes of Sid’s research was the search for prediction and
estimation methods for time series that do not necessarily satisfy the classical
assumptions for autoregressive markovian and gaussian processes (see, e.g.,
Morvai et al., 1996; Morvai et al., 1997; Yakowitz, 1976; Yakowitz, 1979;
*The work of the second author was supported by DGES grant PB96-0300
226 MODELING UNCERTAINTY
where
is the minimal mean squared error of any prediction for the value of based
on the infinite past Note that it follows by
stationarity and the martingale convergence theorem (see, e.g., Stout, 1974)
that
This lower bound gives sense to the following definition:
Universal strategies asymptotically achieve the best possible loss for all er-
godic processes in the class. Algoet (1992) and Morvai et al. (1996) proved
Strategies for sequential prediction of stationary time series 227
that there exists a prediction strategy universal with respect to the class of
all bounded ergodic processes. However, the prediction strategies exhibited
in these papers are either very complex or have an unreasonably slow rate of
convergence even for well-behaved processes.
The purpose of this paper is to introduce several simple prediction strategies
which, apart from having the above mentioned universal property of Algoet
(1992) and Morvai et al. (1996), promise much improved performance for
“nice” processes. The algorithms build on a methodology worked out in recent
years for prediction of individual sequences, see Vovk (1990), Feder et al.
(1992), Littlestone and Warmuth (1994), Cesa-Bianchi et al. (1997), Kivinen
and Warmuth (1999), Singer and Feder (1999), and Merhav and Feder (1998)
for a survey.
An approach similar to the one of this paper was adopted by Györfi et al.
(1999), where prediction of stationary binary sequences was addressed. There
we introduced a simple randomized predictor which predicts asymptotically as
well as the optimal predictor for all binary ergodic processes. The present setup
and results differ in several important points from those of Györfi et al. (1999).
On the one hand, special properties of the squared loss function considered here
allow us to avoid randomization of the predictor, and to define a significantly
simpler prediction scheme. On the other hand, possible unboundedness of a
real-valued process requires special care, which we demonstrate on the example
of gaussian processes. We refer to Nobel (2000), Singer and Feder (1999),
Singer and Feder (2000), Yang (1999), and Yang (2000) to recent closely related
work.
In Section 2 we introduce a universal strategy for bounded ergodic processes
which is based on a combination of partitioning estimates. In Section 3, still
for bounded processes, we consider, as an alternative, a prediction strategy
based on combining generalized linear estimates. In Section 4 we replace the
boundedness assumption by assuming that the sequence to predict is an ergodic
gaussian process, and show how the techniques of Section 3 may be modified
to take care of the difficulties originating in the unboundedness of the process.
The results of the paper are given in an autoregressive framework, that is, the
value is to be predicted based on past observations of the same process.
We may also consider the more general situation when is predicted based on
and where is an process such that
is a jointly stationary and ergodic process. The prediction problem is similar
to the one defined above with the exception that the sequence of is also
available to the predictor. One may think about the as side information.
Formally, now a prediction strategy is a sequence of functions
228 MODELING UNCERTAINTY
With some abuse of notation, for any and we write for the
sequence Fix positive integers and for each
string of positive integers, define the partitioning regression function estimate
in the past. Then it predicts according to the average of the following the
string.
The proposed prediction algorithm proceeds as follows: let be a prob-
ability distribution on the set of all pairs of positive integers such that for
all Put and define the weights
Then the prediction scheme defined above is universal with respect to the
class of all ergodic processes such that
One of the main ingredients of the proof is the following lemma, whose proof
is a straightforward extension of standard arguments in the prediction theory
of individual sequences, see, for example, Kivinen and Warmuth (1999), and
Singer and Feder (2000).
with and
230 MODELING UNCERTAINTY
Here – ln 0 is treated as
Note that
so that
and therefore
Now by Lemma 1,
Strategies for sequential prediction of stationary time series 233
Proof. By Theorem 1,
then
where I denotes the indicator function. Algoet (1994) showed that for any
guessing strategy and stationary ergodic binary process
almost surely, where
Thus, any predictor with the universal property established in Theorem 1 may
be converted, in a natural way, into a universal guessing scheme. An alternative
proof of the same fact is given by Nobel (2000).
such that the coefficients are calculated based on the past observations
Before defining the coefficients, note that one is tempted to define the
Strategies for sequential prediction of stationary time series 237
if and the all-zero vector otherwise. However, even though the minimum
always exists, it is not unique in general, and therefore the minimum is not well-
defined. Instead, we define the coefficients by a standard recursive procedure
as follows (see, e.g., Tsypkin, 1971, Györfi, 1984, Singer and Feder, 2000).
Introduce
and
(Note that since is symmetric and positive semidefinite, its eigenvalues are
all real and nonnegative.) Let be the integer for which if
and if Express the vector as
and define
(It is easy to see by stationarity that the value of the vector is independent of
It is shown by Györfi (1984) that
and moreover
Then
On the other hand, by the denseness assumption of the theorem, for any fixed
Finally, by Lemma 1,
as before, but the bound is increased with Also, we need to modify the way
of combining these elementary predictors.
The proposed predictor is based on a convex combination of linear predictors
of different orders. For each introduce
where
with
Thus, we divide the time instances into intervals of exponentially increasing
length and, after initializing the predictor at the beginning of such an interval,
we use a different way of combining the elementary predictors in each such
segment. The reason for this is that to be able to combine elementary predictors
as in Lemma 1, we need to make sure that the predictor as well as the outcome
to predict is appropriately bounded. In our case this can be achieved based on
Lemma 3 below which implies that with very large probability, the maximum
of identically distributed normal random variables is at most of the order of
242 MODELING UNCERTAINTY
At a key point the proof uses the following well-known properties of gaussian
random variables:
This implies, by the Borel-Cantelli lemma, that with probability one there exists
a finite index such that for all Also,
there exists a finite index T such that for all
Strategies for sequential prediction of stationary time series 243
In other words,
This proves the second statement of the theorem. To prove the claimed univer-
sality property, it suffices to show that for all ergodic gaussian processes,
244 MODELING UNCERTAINTY
since
Proof. We proceed exactly as in the proof of Corollary 1. The only thing that
needs a bit more care is checking the conditions of Kolmogorov’s strong law
for sums of martingale differences, since in the gaussian case the corresponding
martingale differences are not bounded. By the Cauchy-Schwarz inequality,
On the other hand, Gerencsér (1994) showed under some mixing conditions for
processes that there exists a predictor such that
Further rate-of-convergence results under more general conditions for the pro-
cess were established by Gerencsér (1992). Another general branch of bounds
can be found in Goldenshluger and Zeevi (1999). Consider the repre-
sentation of
246 MODELING UNCERTAINTY
and for
Thus, for the processes investigated by Goldenshluger and Zeevi, the predictor
of Theorem 3 achieves the rate of convergence
REFERENCES
Algoet, P. (1992). Universal schemes for prediction, gambling, and portfolio
selection. Annals of Probability, 20:901–941.
Algoet, P. (1994). The strong law of large numbers for sequential decisions
under uncertainty. IEEE Transactions on Information Theory, 40:609–634.
Bailey, D. H. (1976). Sequential schemes for classifying and predicting ergodic
processes. PhD thesis, Stanford University.
Breiman, L. (1960). The individual ergodic theorem of information theory.
Annals of Mathematical Statistics, 31:809–811, 1957. Correction. Annals of
Mathematical Statistics, 31:809–810.
Cesa-Bianchi, N., Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and
M.K. Warmuth. (1997). How to use expert advice. Journal of the ACM,
44(3):427–485.
Chow, Y.S. (1965). Local convergence of martingales and the law of large
numbers. Annals of Mathematical Statistics, 36:552–558.
Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern
Recognition. Springer-Verlag, New York.
REFERENCES 247
Carl Chiarella
School of Finance and Economics
University of Technology
Sydney
P.O. Box 123, Broadway, NSW 2007
Australia
carl.chiarella@uts.edu.au
Ferenc Szidarovszky
Department of Systems and Industrial Engineering
University of Arizona
Tucson, Arizona, 85721-0020, USA
szidar@sie.Arizona.edu
Abstract The dynamic behavior of the output in nonlinear oligopolies is examined when the
equilibrium is locally unstable. Continuously distributed time lags are assumed in
obtaining information about rivals’ output as well as in obtaining or implementing
information about the firms’ own output. The Hopf bifurcation theorem is used
to find conditions under which limit cycle motion is born. In addition to the
classical Cournot model, labor managed and rent seeking oligopolies are also
investigated.
1. INTRODUCTION
During the last three decades many researchers have investigated the stability
of dynamic oligopolies in both continuous and discrete time scales. A com-
prehensive summary of the key results is given in Okuguchi (1976) and their
multiproduct extensions are presented in Okuguchi and Szidarovszky (1999).
250 MODELING UNCERTAINTY
where is the inverse demand function, and is the cost function of firm In
the case of labor managed oligopolies,
where is the inverse production function and is the cost unrelated to labor.
In the case of rent-seeking games
where is the cost function of agent The first term represents the probability
of winning the rent. If unit rent is assumed, then is the expected profit of
agent
Let be an equilibrium of the game, and let
Assume that in a neighborhood of this equilibrium the best response is unique
for each player, and the best response functions are differentiable. Let
denote the best response of player for a given
own output. We consider the situation in which each firm experiences a time
lag in obtaining information about the rivals’ output as well as a time lag
in management receiving or implementing information about its own output.
Russel et al. (1986) used the adjustment process
and
Here we assume that T > 0, and is a nonnegative integer. Notice that this
weighting function has the following properties:
(a) for weights are exponentially declining with the most weights
given to the most current output;
(b) for zero weight is assigned to the most recent output, rising to
maximum at and declining exponentially thereafter;
(c) the area under the weighting function is unity for all T and
(d) as increases, the function becomes more peaked around
For sufficiently large values of the function may for all practical purposes be
regarded as very close to the Dirac delta function centered at
(e) as the function tends to the Dirac delta function.
The Birth of Limit Cycles in Nonlinear Oligopolies 253
so that the model with discrete time lags is approached when is chosen
sufficiently large.
Property (e) implies that for small values of T,
in which case we would recover the case of no information lags usually consid-
ered in the literature. Substituting equation (3.4) into (3.2) and (3.3), and the
resulting expressions for and into equation (3.1), a system of nonlinear
integro-differential equations is obtained around the equilibrium. In order to
analyze the local dynamic behavior of the system we consider the linearized
system. Letting and denote the deviation of and from their equi-
librium levels, the linearized system can then be formulated as follows:
where
with
and
with
Notice that in the most general case the left hand side is a polynomial in
of degree
The determinant (4.6) can be expanded by using a special idea used earlier
by Okuguchi and Szidarovszky (1999). Introduce the notation
The Birth of Limit Cycles in Nonlinear Oligopolies 255
where we have used the simple fact that for any vectors and
This relation can be easily proved by using finite induction on the dimension
of and A value is an eigenvalue if either
and
and
we have
The Birth of Limit Cycles in Nonlinear Oligopolies 257
and
Equating the real and imaginary parts to zero leads to the following:
and
Notice that the left hand side has an odd degree, and no terms of even degree
are present in its polynomial form. Therefore is always a root. However,
it is difficult to find simple conditions that guarantee the existence of nonzero
real roots.
Differentiating equation (4.10) with respect to we obtain
258 MODELING UNCERTAINTY
Therefore
and so
Assuming that the numerator is nonzero, the Hopf bifurcation theorem (see
for example, Guckenheimer and Holmes (1983)) implies that there is a limit
cycle for in the neighborhood of
Consider next equation (4.9). It can be rewritten as
with
Separating the real and imaginary parts yields the following two equations:
The Birth of Limit Cycles in Nonlinear Oligopolies 259
and
In this case we have more freedom than in the previous case, since now we
have bifurcation parameters. Assume that with some values of
these equations have a common nonzero real solution
Differentiating equation (4.18) we have
Notice that this equation is very similar to (4.10), therefore the same idea
will be used to examine pure complex roots. Introduce the polynomials
and
Equating again the real and imaginary parts to zero we have the following
two equations:
and
Therefore
Assuming that the numerator is nonzero, the Hopf bifurcation theorem im-
plies the existence of a limit cycle for in the neighborhood of
As a special case assume first that the firm’s own information lag S is much
smaller than T the information lag about rival firms. Thus we select S = 0,
and equation (5.1) reduces to the following:
Thus from the second equality in equation (5.12) we see that the combination
of parameter values for which pure complex roots are possible is given by the
relation
Thus the Hopf bifurcation theorem implies the existence of a limit cycle for
in the neighborhood of
The Birth of Limit Cycles in Nonlinear Oligopolies 263
and
which is negative for The sign coincides with that given in (5.13).
Considering equation (6.3) together with the relation (5.13), we see that the
parameters and T can be selected so that there is a limit cycle for all
264 MODELING UNCERTAINTY
Consider next the classical Cournot model with the same price function as
before but assume that the cost function of firm is a nonlinear function
Assuming interior best response, we find that for all firms
which holds if
version of the classical Cournot model can yield limit cycles with any number
of firms.
Assume next that the price function is linear, and the cost
functions are nonlinear. Then the profit of firm can be obtained as
holds.
Differentiating equation (6.9) with respect to shows that
with all parameters being positive. Assume again that the best response is
interior in the neighborhood of the equilibrium, then an easy calculation shows
that
Notice that the right hand side is strictly decreasing in and straight-
forward calculation shows that
Notice that and a comparison to (5.13) shows that limit cycles may
be born for The cases of nonlinear and can be examined
similarly to classical Cournot oligopolies therefore details are not discussed
here.
Consider finally the case of rent-seeking oligopolies (2.4). Notice that they
are identical mathematically to classical Cournot oligopolies with the selection
of Therefore our conclusions for the classical model apply here.
REFERENCES 267
7. CONCLUSIONS
A dynamic model with continuously distributed lags in obtaining information
about rivals’ output as well as in the firms obtaining or implementing informa-
tion about their own outputs, was examined. Classical bifurcation theory was
applied to the governing integro-differential equations. Time lags can be also
modeled by using differential-difference equations, however in this case one
has to deal with an infinite spectrum, which makes the use of bifurcation the-
ory analytically intractable. In addition, fixed time lags are not realistic in real
economic situations.
We have derived the characteristic equation in the general case, and therefore
the existence of pure complex roots can be analyzed by using standard numerical
techniques. The derivatives of the best response functions were selected as the
bifurcation parameters. The derivatives of the pure complex roots with respect
to the bifurcation parameters were given in a closed form expression, that makes
the application of the Hopf bifurcation theorem straightforward.
The classical Cournot model, labor-managed oligopolies, and rent-seeking
games were examined as special cases. If identical firms are present with linear
cost functions and hyperbolic price function, then under a special adjustment
process limit cycles are guaranteed for a sufficiently large number of firms.
However with nonlinear cost functions limit cycles may be born with any arbi-
trary number of firms. If the price as well as the costs are linear, no limit cycle
is guaranteed, however if the costs are nonlinear, then limit cycles can be born
with arbitrary values of Similar conclusions have been reached for labor-
managed oligopolies. Rent-seeking games are mathematically equivalent to
the classical Cournot model with hyperbolic price function, therefore the same
conclusions can be given as those presented earlier.
REFERENCES
Arnold, V.I. (1978). Ordinary Differential Equations. MIT Press, Cambridge,
MA.
Carr, J. (1981). Applications of Center Manifold Theory. Springer-Verlag, New
York.
Chiarella, C. and A. Khomin. (1996). An Analysis of the Complex Dynamic
Behavior of Nonlinear Oligopoly Models with Time Lags. Chaos, Solitons
& Fractals, Vol. 7. No. 12, pp. 2049-2065.
Cox, J.C. and M. Walker. (1998). Learning to Play Cournot Duopoly Strategies.
J. of Economic Behavior and Organization, Vol. 36, pp. 141-161.
Cushing, J.M. (1977). Integro-differential Equations and Delay Models in Pop-
ulation Dynamics. Springer-Verlag, Berlin/Heidelberg/New York.
Guckenheimer, J. and P. Holmes. (1983). Nonlinear Oscillations, Dynamical
Systems and Bifurcations of Vector Fields. Springer-Verlag, New York.
268 MODELING UNCERTAINTY
A. Haurie
University of Geneva
Geneva Switzerland
F. Moresino
Cambridge University
United Kingdom
Abstract This paper deals with a problem of uncertainty management in corporate finance.
It represents, in a continuous time setting, the strategic interaction between a
firm owner and a lender when a debt contract has been negotiated to finance a
risky project. The paper takes its inspiration from a model by Anderson and
Sundaresan (1996) where a simplifying assumption on the information structure
was used. This model is a good example of the possible contribution of stochastic
games to modern finance theory. In our development we consider the two possible
approaches for the valuation of risky projects: (i) the discounted expected net
present value when the firm and the debt are not traded on a financial market, (ii)
the equivalent risk neutral valuation when the equity and the debt are considered
as derivatives traded on a spanning market. The Nash equilibrium solution is
characterized qualitatively.
1. INTRODUCTION
In Anderson and Sundaresan (1996) an interesting dynamic game model of
debt contract has been proposed and used to explain some observed discrep-
ancies on the yield spread of risky debts. The model is cast in a discrete time
setting, with a simplifying assumption on the information structure allowing
for a relatively easy sequential formulation of the equilibrium conditions as a
sequence of Stackelberg solutions where the firm owner is the leader and the
lender is the follower.
270 MODELING UNCERTAINTY
where
W : is a standard Wiener process,
is the instantaneous growth rate
is the instantaneous variance.
This state could represent, for example, the price of the output from the project.
The firm expects a stream of cash flows defined as a function of the
state of the project. Therefore, if the firm has a discount rate the equity of
the unlevered firm, that is the debt free firm, when evaluated as the net present
value of expected cash flows, is given by
A Differential Game of Debt Contract Valuation 271
Using a standard technique of stochastic calculus one can characterize the func-
tion as the solution of the following differential equation
The boundary condition (1.4) comes from the fact that a project with zero value
will remain with a zero value and thus will generate no cash flow. An interesting
case is the one where since (1.4) rewrites now as
This function is illustrated in Figure 1.1. The strategic structure of the problem
272 MODELING UNCERTAINTY
is due to the fact that the firm controls its payments and may decide not to abide
to the contract at some time. At time let the variable give the state of the
debt service, which is the cumulated payments made by the firm up to time
This state variable evolves according to the following differential equation
The strategic structure of the problem is also due to the fact that the lender is
authorized to take control of the firm when the owner is late in his payments,
i.e. at any time where is the set
This assumes that the lender will find another firm ready to buy the project at
equity value and K is a constant liquidation cost. For the example of
debt contract given above, the debt balance is
3. A STOCHASTIC GAME
Let us assume that the risk involved in developing the project, or in financing
it through the agreed debt contract cannot be spanned by assets traded on a
market. So we represent the strategic interaction between two individuals,
the firm owner, who strives to maximize the expected present value of the net
cash flow, using a discount rate and a lender who maximizes the present
value of the debt, using a discount rate The state of the system is the pair
An admissible strategy for the firm owner
is given by a measurable function1 A strategy
for the lender is a stopping time defined by where
B is a Borel set Associated with a strategy pair we
define the payoffs of the two players
274 MODELING UNCERTAINTY
Remark 1 It is in the treatment of the debt service that our model differs signif-
icantly from Anderson and Sundaresan (1996). The creditor does not forget the
late payments. The creditor can also take control at maturity, if the condition
holds. The firm is allowed to “overpay” and it is thus possible
that be positive at maturity2 . It is then normal that the firms gets
back this amount at time T as indicated in (1.13).
for the agents. In the next section we shall use another approach which is
valid when the equity and debt can be considered as derivatives obtained from
assets that are traded or spanned by an efficient market. We will see that the
real option or equivalent risk neutral valuation method eliminates the need to
identify these parameters when defining an equilibrium solution.
To simplify assume that the firm’s project has a value V that is perfectly cor-
related with the asset which is traded. We assume that this asset pays no
dividend , so its entire return is from capital gains. Then evolves according
to
Here the drift rate should be equal to the expected rate of return from holding
an asset with this risk characteristics, according to the CAPM theory (Duffie,
1992)
4
where is the risk free rate, is the market price of risk and is the
correlation of with the market portfolio. One also calls the risk adjusted
expected rate of return that investors would require to own the project.
276 MODELING UNCERTAINTY
Keeping in mind that the equity pays a dividend and the portfolio pays
a dividend the strategy is self-financing if7
which leads to
We can now write the stochastic equation satisfied by the portfolio value
One observes that, as intended, the portfolio replicates the risk of V. Due to
our sixth assumption, must have the same return as V, otherwise there
would be arbitrage opportunities. Moreover since at time zero, the value of
278 MODELING UNCERTAINTY
A similar equation could be derived for the other derivative (debt) value D
This formulation is interesting, since the model reduces then to a single con-
troller framework. The firm’s owner controls the trajectory but runs the risk
of having a terminal value drawn to 0. Since the payments are upper bounded
by the cash flow value it might be optimal for the borrower to
anticipate the payments, for some configurations of We shall not pursue
that development, but rather consider the more realistic case where the lender
can liquidate the firm when it is late in its payments.
where
and
and denotes the expected value w.r.t. the probability measure induced by
the strategies (Here is the stochastic process induced by the feedback
law This is the usual change of measure occuring in Black-Scholes
evaluations.
The optimal strategy for the firm is “bang-bang" and defined by the switching
manifold
A Differential Game of Debt Contract Valuation 281
The lender takes control as soon as the following conditions are satisfied
The game simplifies considerably. The only relevant state variable is the value
of the firm V(t) at t. The firm’s strategy is now
pay the debt service at otherwise don’t pay
anything.
7. CONCLUSION
The design of the “best" debt contract, would be the combination of design
parameters, that maximizes the Nash-equilibrium value of equity
E(0; (0,V)) while the Nash-equilibrium debt value D(0; (0, V)) is at least
equal to the needed amount C. This is a very complicated problem that can
uniquely be addressed by direct search methods where the Nash-equilibria will
be computed for a variety of design parameter values. In this paper we have
concentrated on the evaluation of the equity and debt values, when, given some
contract terms, the firm owner and the lender act strategically, and play a dy-
namic Nash equilibrium. The interesting aspect of this model, which was
already present in Anderson and Sundaresan (1996), is the use of equivalent
risk neutral or real option valuation technique in a stochastic game. The con-
tribution of this paper has been to extend the model to a continuous time setting
and to consider an hereditary effect of past deviations from the contracted debt
service. We feel that this formulation is relatively general and that it contributes
to the extension of game theory to Black-Scholes economics.
REFERENCES 283
NOTES
1. In all generality the control of the firm owner could be described as a process which is adapted to
For our purpose the definition as a feedback law will suffice.
2. Notice that the overpayment is a form of investment by the firm. We could allow it to invest a part
of the cash flows in another type of asset; this would be more realistic from a financial point of view but it
would further complicate the model.
3. Note that we have assumed here that the agents are optimizing discounted cash flows, not the utility
of them. If risk aversion has to be incorporated in the model then the equilibrium characterization will be
harder.
4. It is defined by where and are the expected return and standard deviation of the
market portfolio respectively.
5. This can be verified by constructing a portfolio where all dividends are immediately reinvested in the
project. Such a portfolio is a replication of and must, therefore follow the same dynamics as
6. Notice that the usual way to obtain the Black-Scholes equation, would have been to construct a
self-financing portfolio composed of the risk-free asset B and the underlying asset V in such proportions
that it is a replication of the derivative (either E or D). Then according to the two last assumption, we know
that this portfolio has to give the same return as the underlying asset. However in our case the unlevered
firm is not traded; so we proceed in a symmetric way and construct a self-financing portfolio composed of
the risk-free asset B and the derivative E that will replicate V.
7. Such a dynamic adaptation of the portfolio composition is feasible, as is measurable with respect
to the filtration generated by W(t).
8. One uses the notation arbitrarily small.
9. One uses the notation arbitrarily small.
REFERENCES
Anderson R.W. and S. Sundaresan. (1996). Design and Valuation of Debt Con-
tracts, The Review of Financial Studies, Vol. 9, pp. 37-68.
Black F. and M. Scholes. (1973). The Pricing of Options and Corporate Lia-
bilities, The Journal of Political Economy, Vol. 81, pp. 637-654.
Dixit A.K and R. S. Pindyck. (1993). Investment under Uncertainty, Princeton
University press.
DuffieD. (1992). Dynamic Asset Pricing Theory, Princeton University Press.
Chapter 14
David Porter
Collage of Arts and Sciences
George Mason University
Abstract Pioneering projects are systems that are designed and operated with new, virtually
untested, technologies. Thus, decisions concerning the capacity of the initial
project, its expansion over time and its operations are made with uncertainty.
This is particularly true for NASA’s earth orbiting Space Station. A model is
constructed that describes the input-output structure of the Space Station in which
cost and performance is uncertain. It is shown that when there is performance
uncertainty the optimal pricing policy is not to price at expected marginal cost.
The optimal capacity decisions require the use of contingent contracts prior to
construction to determine the optimal expansion path.
1. INTRODUCTION
A pioneering project is defined as a system in which both the reliability and
actual operations have considerable uncertainties (see Merrow et al. (1981)).
Managing such projects is a daunting task. Management decision making in
the face these sizable uncertainties is the focus of this paper. In particular, the
decisions on the initial system capacity and design flexibility for future growth
of the project will be addressed. The typical management process for pioneering
projects is to design to a wish list of requirements from those who will be using
the project’s resources during its operations (this is sometimes referred to by
engineers as user “desirements”). The main management policy of the designers
is to hold margins of resources for each subsystem to mitigate the uncertainties
that may arise in development and operations. The amount of reserve to be
held is not well defined, but it is sometimes referred to as an insurance pool.
However, unlike an insurance pool, the historical results for pioneering projects
has been universal “bad luck ” by all subsystems which deplete the reserves
(see Merrow et al. (1981) and Wessen and Porter (1998)). Rarely are incentive
286 MODELING UNCERTAINTY
systems used to obtain information on system reliability and the cost of various
designs.1 Instead of focusing on these types of management policies, we want
to determine the optimal planning, organizational and incentive systems for
managing pioneering projects.
The application that will be used throughout will be NASA’s Space Station.
The Space Station is an integrated earth orbiting system. The Station is to
supply resources for the operation of scientific, commercial and technology
based payloads. The structure of the Station design is an interrelated system
of inputs and outputs. For example, the design of the power subsystem as a
profound effect on the design of the propulsion subsystem since power will
be provided via photovoltaics. The solar array subsystem creates drag on the
Station which will require reboosts from the propulsion subsystem. Fox and
Quirk (1987) developed an input-output model of the Station’s subsystems in
which the coefficients of the input-output matrix are random. This model has
been analyzed for the case with uniform distributions over the coefficients and
cost parameters, with a safety-first constraint on the net output of the Station.
Using semi-variances, they determine the distribution of costs over the Station’s
net outputs. Using some test data from engineering subsystems, the model has
been exercised for the uniform Leontief system (see Quirk et al. (1989)). While
this attempt at modelling the interaction of the subsystems and the uncertainty in
cost and performance has shown that the Station is very sensitive to performance
uncertainties and that the errors propagate, there is very little in the way of
policies to help guide the design process.
One of the main features of the operation of the Station is that information
about subsystem performance and cost accrues over time (see Furniss (2000)).
Thus, design and allocation decisions need to take into account the future effects
on resource availability. For the Station, a capacity for each subsystem must
be selected at “time zero” along with the design parameters of each subsystem.
Next, users of the Station must design their individual payloads and be scheduled
within the capacity constraints of the system. After payloads are manifested,
the Station operations parameters must be determined. After the Station starts
operating, a decision on how to grow the system must be made. The timing of
decisions can be found in Diagram 1 below. An important aspect of the analysis
to follow is that new information will become available that was previously
unknown during the decision timeline.
Huge Capacity Planning and Resource Pricing for Pioneering Projects 287
The decision variables include the vector of initial subsystem design capaci-
ties X, the vector of initial design parameters v, the vector of planned resources
u that will be used by payloads, the realization of resources q that are available
to payloads for operations and the capacity expansion, redesign and operations
after there is experience with the capabilities of the Station. Not
pictured in Diagram 1 is the fact that actual Station operating capabilities are
realized between time and It is that time that the uncertainty is
resolved and known to all parties.
Within the confines of the decision structure defined above, the following
questions are addressed:
Given the uncertainty over cost and performance, what should be the
crucial considerations in the initial capacity and design of the Station?
2. THE MODEL
Space Station can be viewed as a multiproduct “input-output” system in
which decisions are related across time. In addition, there is uncertainty as
to the cost and performance of the system and this uncertainty is resolved
over time. The Station is a series of subsystems (e.g. logistics, environmental
control, electric power, propulsion, etc.) which require inputs from each other to
operate. With a project as new and complex as the Station, the actual operations
are uncertain. Specifically, the performance and cost of each subsystem is
uncertain. Let denote the gross capacity of subsystem i and the design
parameters of subsystem i. Let denote the vector of resources of other
subsystems utilized by subsystem i. Let denote the random variable associated
with the cost and performance of each subsystem. The input-output structure
of the Station is given by:
Let i = 1,...,n so that there is a total of n subsystems that constitute the Station.
If the Station subsystems were related by a fixed linear coefficient technology,
(1) could be represented by Y=AX where Y would be a 1 matrix representing
the internal use of subsystem outputs, A would be the matrix of input-
output coefficients where the entries are random variables, and X is the
1 matrix of subsystem capacities. We consider the more general nonlinear
structure for our model.
288 MODELING UNCERTAINTY
Costs are assumed to be separable across subsystems so that total cost can
be represented as
This definition captures the ides that the larger (or more complex) the project
the larger will be the benefits of learning to operate the system. If it is costly
to dismantle or redesign existing capacity then the project will have what is
commonly referred to as capital irreversibilities.
To find the system optimum, we compare benefits and costs. The benefit
side of the model is one of the most difficult to estimate in practice. Clearly,
however, the payload designer is the individual in the best position to determine
these benefits. The difficulty in obtaining this information is that there is an
incentive overstate the benefits since NASA does not base its pricing systems
on this information, it uses it only to aid in its subsystem design decisions.
Nonetheless, we model the benefit side of the model using a per period dollar
benefit function for each payload j. Specifically, let denote payload
j’s monetary payoff from consuming units of subsystem capacities, where
The present value of benefits at time t=0, starting at is
given by:
290 MODELING UNCERTAINTY
Where r is “the” discount rate. The present value of benefits between time
t=0 and is given by:
Turning to the cost side of the objective function, the present value of costs
at time t=0 if the state is is:
The present value of system costs at time for state is given by:
In order to maximize net benefits, we must choose initial capacities and de-
signs a growth path and an allocation of subsystem capacities
to maximize:
Subject to the equations (4), (6), and (7) and where is the expectation
operator over This is a dynamic programming problem so we must solve
the problem at first taking the t=0 decisions as given, and then proceed to
solve the and t=0 problems. For the most part we are interested in the
comparative static properties of the model and the pricing solutions. For the
solution to the problem and its associated comparative statics will use a form of
envelope theorem for dynamic programming. Recall that the envelope theorem
has:
This theorem allows an easy way to find the marginal effects of the parameters
of the model. Extending this model to the dynamic programming case is easy.
We define the maximization problem at as:
Differentiating we obtain:
Thus, at t=0 we can safely ignore the “indirect” effects of the parameters
on the solution at We now use this fact to obtain some results from
this model.
3. RESULTS
We will present a series of results starting with the easiest case and steadily
add complexity to the decision making model.
292 MODELING UNCERTAINTY
Subject to where
The necessary conditions
for a maximum are:
Where is the Lagrangian multiplier for the ith constraint. Rewriting (22)
and (23) in matrix notation we find:
Moving to the problem, is known and X,v have been selected; the only
issue remaining is allocating the available resources among payloads. Since
future benefits and costs are not affected by the allocation we need only
maximize benefits given revealed supply subject to any commitments made at
through the contracts Formally,
The prices for net outputs q at defined in the equations above, will be
functions of i.e. contingent contracts. If we let denote the prices
that solve the equations above, this means that contracts signed at t=0 must be
contingent on For practical purposes, such a complete set of contingent
contracts is infeasible. However, a less restrictive set of contracts can be
designed that can increase efficiency. Let
and where denotes the sample
space for
If priority classes are not used to create contracts, but instead when is
realized a Vickrey auction (see Vickrey (1961)) is conducted wherein each
payload j submits a two-tuple and the auctioneer computes winners as
follows:
Huge Capacity Planning and Resource Pricing for Pioneering Projects 295
The Vickrey auction has a dominant strategy that has each bidder revealing
their true value. However, since each bidder does not know the value of
or the values of other participants prior to developing their payloads they must
calculate the probability that resources will be available and that they have
one of the winning bids. In particular, payload j must select based on the
probability that their value will be greater than other bidders and that they fit
within the resource constraints (35). Since the only way to increase one’s
probability of being selected is to fit within the capacity constraints, enters
into the decision calculus through the functions B, C and the probability of
fitting. That is, if j selects a larger then for a smaller choice of we
can obtain the same net benefits, but increase the probability of fitting (see
Figure 1.1 above). Thus, there is an incentive to design less resource intensive
payloads which are inefficient payload designs relative to the use of priority
contracts.7
Using our envelope theorem from Section 2, (37) can be written as:
Where the second term is evaluated at and the last terms are
evaluated at Using equation (22) we find that
4. CONCLUSION
We can now address the questions listed at the beginning of this paper. Con-
cerning the initial size of the project, if technology is such that additions to
capacity or design changes are costly, then a smaller more flexible system
should be built and operated until more information concerning performance
is obtained. If the technology is such that irreversibilities are small and that
marginal productivity increases “dramatically” as operators learn how the tech-
nology works, then a larger more capable system should be built initially. These
flexibility issues also need to be part of the pricing regime to suppress the ap-
petite of payloads to demand more resource.
For the second question that was posed there is a better alternative than
what NASA currently uses to price resources. In particular, it is clear that the
use of priority contracts is essential and prices should not be solely based on
incremental costs. Project management should devote more effort in obtaining
accurate information concerning cost and performance distributions and should
provide more flexible contract procedures to assist users in contingent planning.
In addition, contract regimes should be put in place to aide operators in the
REFERENCES 299
rationing of resources when there are performance short-falls. The market for
manifesting payloads and the scheduling of resources should be interactive and
should be done years before actual operations. In this way price information
can guide payload design and Station growth paths.
NOTES
1. One notable exception is the management policy used on the Jet Propulsion Laboratory’s Cassini
Mission to Saturn (see Wessen and Porter (1997))
2. The Space Shuttle went through this process in determining the price to charge private companies
for the use of the Shuttle bay to launch satellites and middeck lockers for R&D payloads. The initial pricing
policy was to charge for the maximum percentage of Shuttle mass or volume capacity used by a payload
times the expected long-run marginal cost of a Shuttle launch, or short-run marginal cost if demand faltered,
see(Waldrop(1982)).
3. In general, so that benefits are in terms of contract specifications based on the
realization of We investigate the structure of such contracts in Section 3.1, but for now we only consider
use of subsystem capacities.
4. It is assumed that and are continuously differentiable.
5. This type of contract has been investigated by Harris and Raviv (1981) as a way for a monopolist to
segment the market. Chao and Wilson (1987) examine this form of contracting to price reliability in an
electric network.
6. Resources expended to design payloads is one of the most expensive elements in payload develop-
ment. Payloads can be designed to be more autonomous so as to use less astronaut time but this may increase
the use of on-board power and data storage. These trade-offs are extremely important but are usually done
in a haphazard manner (see Polk (1998)).
7. This phenomena has been found on the Space Shuttle due to the reduced number of flight opportunities
for science payloads. The Shuttle program has created a Get-Away-Special (GAS) container for payloads
that does not require man-tending or special environmental controls. The number of payloads that are of
the GAS variety have increased 10-fold from the beginning of the Shuttle program.
8. Noussair and Porter (1992) have developed an aution process to allocate a specific form of priority
contract. The experiments they conduct show that their auction results in highly efficient allocations and
out performs proportional rationing schemes.
9. A somewhat similar result can be found in Yildizoglu (1994).
REFERENCES
Abel A. and J. Eberly. (June 1998). “The Mix and Scale of Factors with Irre-
versibility and Fixed Costs of Investment, ” Carnegie-Rochester Conference
Series on Public Policy 48:101-135.
Chao, H. and R. Wilson. (December 1987). “Priority Service: Pricing, Invest-
ment and Market Organization,” American Economic Review 77:899-916
Fox G. and J. Quirk. (October 1985). “Uncertainty and Input-Output Analysis,”
JPL Economics Research Series 23.
Furniss, T. (July 2000). “International Space Station,” Spaceflight 42: 267-289.
Harris M. and A. Raviv. (June 1981). “A Theory of Monopoly Pricing Schemes
with Demand Uncertainty, ”American Economic Review 71:347-365.
Merrow, E., K. Philips and C. Myers. (September 1981). “Understanding Cost
Growth and Performance Shortfalls in Pioneering Process Plants,” Rand Cor-
poration Report R-2569-DOE.
300 MODELING UNCERTAINTY
James A. Reneke
Matthew J. Saltzman
Margaret M. Wiecek
Dept. of Mathematical Sciences
Clemson University
Clemson SC 29634-0975
1. INTRODUCTION
In this paper we address the problem of upgrading a complex system where
there are multiple levels of decision making, the effects of the choices interact,
and the choices are accompanied by uncertainty. We propose a framework for
choosing, from among a number of alternatives, a set of affordable upgrades
to a system that exhibits these characteristics. We present the methodology in
terms of an illustrative example which highlights its application to each of these
areas of difficulty.
The system of interest is decomposed into multiple levels, each consisting of a
network describing the interactions of the components at that level. The process
302 MODELING UNCERTAINTY
constructs conceptual models working from top to bottom, and evaluates the
impact of proposed upgrades using computational models from bottom to top.
In general, as one proceeds down the levels of decision making, alternatives
depend on more independent variables representing exogenous influences and
hence uncertainty.
The upgraded enterprise operates within a range of environments represented
by the exogenous variables and so the best set of choices might not be optimal for
any particular environment. Understanding the tradeoffs leads to better choices.
Because of the interaction of choices for component upgrades the focus for the
decision maker must be on overall enterprise performance. Major upgrades of
some components may have little impact on overall performance while minor
upgrades on other components could have a large impact. Finally, upgrade
options must be evaluated in terms of their performance in noisy environments
and the contribution of component uncertainty to enterprise uncertainty.
The upgrade problem for complex systems is related to the system design
problem. Hazelrigg (2000) critiques classical systems engineering methodolo-
gies. He proposes properties that a design methodology should satisfy, includ-
ing: independence of the engineering discipline, inclusion of uncertainty and
risk, consistency and rationality in comparing alternative solutions and inde-
pendence of the order in which the solutions are considered, capability of rank
ordering of candidate solutions. The method should not impose preferences
on the decision maker nor constraints on the decision-making process, should
have positive association with information (the more the better) and be derivable
from axioms.
Papalambros (2000) and Papalambros and Michelena (2000) survey the ap-
plication of optimization methods to general systems design problems. In ad-
dition to identifying optimal structural and control configurations of physical
artifacts, optimization methods can be applied to decomposition of systems
into subsystems and integration of optimal subsystem designs into the final
overall design. Rogers (1999) proposes a system for decomposition of design
projects based on the concept of a design structure matrix (DSM). Rogers and
Salas (1999) describe a Web-based design management tool that uses the DSM.
Models and methodologies for upgrade decisions have been also studied
by mathematical and management scientists, and by operations researchers.
Equipment upgrade and replacement along with equipment capacity expan-
sion to meet demand growth is studied by Rajagopalan et al. (1998) and Ra-
jagopolan (1998). The former presents a model for making acquisition and
upgrade decisions to meet future demand growth and develops a stochastic dy-
namic programming algorithm while the latter unifies capacity expansion and
equipment replacement within a deterministic integer programming model.
Replacement and repair decisions are studied by Makis et al. (2000) with
the objective to find the repair/replacement policy minimizing the long-run
expected average cost of a system. The problem is formulated as a continuous
time decision problem and the results are based on the theory of jump processes.
Carrillo and Gaimon (2000) introduce an optimal control model to couple
the improvement of system performance with decisions about the selection
and timing of process change alternatives and related knowledge creation and
accumulation.
Majety et al. (1999) describe a system-as-network model for reliability al-
location in system design. In this model, the methods are associated with
reliability measures as well as costs. Cost is the objective to be minimized and
a certain level of overall system reliability is to be achieved. Here the objective
is linear and the constraints are potentially nonlinear, depending on the structure
of the network.
Luman (1997); (2000) describes an analysis of upgrades to a complex, multi-
level weapons system (a “system of systems”). The methodology treats cost as
304 MODELING UNCERTAINTY
the independent variable, allowing the decision maker to analyze the tradeoff
between cost and overall performance. Luman addresses the need to under-
stand the relationship between overall performance and performance measures
of major subsystems. The upgrade problem is modeled as a complex nonlinear
optimization problem. Closed form approximations can adequately represent
some systems, but for greater complexity, simulation-based optimization meth-
ods may be required. The methodology is illustrated on a mine countermeasures
system involving subsystems for reconnaissance and neutralization.
note, simulations are obtained as matrix multiplications and the methods return
basic statistics exactly.
(see Figure 15.1). Each level below the master level might have
multiple decision makers. Conceptual modeling proceeds from the top down.
At the master level, the overall system (the enterprise) is modeled as a single
task. At every level but the lowest level , a task can be decomposed into
subtasks assigned to lower level decision makers who do not interact. The lowest
level is reached when tasks are not decomposed further. Each decision maker,
in decomposing a task, develops a conceptual model of interactions among
subtasks (see Figure 15.2). Tasks are viewed as networks whose nodes represent
subtasks to be accomplished with arcs representing performance functions.
Also for each task, the subtasks are accomplished in an environment modeled
by a set of exogenous variables so the performance functions may depend
on several independent variables. The network structure indicates how the
performance of each subtask influences and is influenced by other subtasks.
Thus a node (a subtask) acts on performance functions represented by inbound
arcs to produce a performance function associated with outbound arcs.
Computational models are developed and decisions made from the bottom
up. During the decision making phase, methods for accomplishing the tasks are
associated with the nodes. Starting at the lowest level, decision makers develop
alternative feasible methods for their tasks, develop computational models for
these methods, and pass methods and models up to the next higher level. Meth-
ods for a task are feasible provided they satisfy the cost constraint assigned to
that task. Computational models at the next higher level are constructed from
the sub-method models using the previously developed conceptual models. Se-
lections are made from the alternative feasible methods and these methods and
models are passed up to the next higher level. At the master level, computational
models of feasible methods for the system task are constructed and a preferred
method is chosen from among the alternatives. The overall goal is to find a
preferred method satisfying the enterprise cost constraint, i.e., an affordable
selection of upgrades.
Methods are modeled by means of linear operators relating the input perfor-
mance of a task to the output performance. The performance functions may
be vector or scalar valued and are defined on the space of exogenous variables
characteristic for the task. By restricting models of methods to linear opera-
tors, computational models at higher levels can be obtained by “plugging” the
alternatives from below into the conceptual models.
Models for methods (linear operators) are based on stochastic linearizations
of the underlying components. The theory and methodology of stochastic lin-
earization are discussed elsewhere (Reneke, 1997; 2001). The convergence
theory is well developed for the one variable (signal) case (Reneke et al., 1987;
Reneke, 1998). The theory for the multiple variable (response surface) case is
emerging but is complete for significant special cases. Stochastic linearization,
308 MODELING UNCERTAINTY
Note that the two environmental variables reflect the importance assigned to
the materials component. We could have decided that the distribution compo-
nent was the most important and chosen different environmental variables.
The system to this point can be visualized as in Figure 15.4 where the arrows
indicate relations between components. For instance, the performance of
the fabrication component depends on the performance of the materials
component and influences the performance of the distribution component.
The performance functions require engineering insight and may be based
on detailed investigation of the individual components. For this simplified
presentation we assume that models the availability of raw material in the
market, models material purchased and stored and available for fabrication,
models product available for distribution, models customer demand, and
is the enterprise performance function, measuring unmet consumer demand.
Assuming might look like Figure 15.6. (Note in this
figure and subsequent figures the surfaces are computed at grid points
The independent axes are labeled by the grid indices.) For
increasing and supply of material increases and prices fall. If there is no
Affordable Upgrades of Complex Systems 311
component performance are equal. Thus the diagram does not model flows of
material, there is no conservation law, but the interrelationships among compo-
nents.
The model given above describes an existing system. In the upgrade prob-
lem we assume that each subtask can be accomplished using one of several
methods. The overall goal is to optimize enterprise performance by choosing
one method (linear operator) for each of the subtasks. Note that the choices
interact complicating the decision process.
to the input function The results are displayed in Figure 15.7. Of interest is
how the operators (methods) transform an input function. The response of M
to increases rapidly for and small but increasing. The response is flatter
for and large. Similar observations could be made for F and D.
Combining the component models in the feedback network, Figure 15.5, we
obtain the enterprise performance, Figure 15.8.
The decision stage is performed only for the master-level task and yields
a final preferred response surface and the corresponding selection of methods
which determine a preferred selection of upgrades for the system.
In this section, we first formulate an optimization problem suitable for the
optimization of a task and then discuss how a preferred selection of upgrades
could be found. The presented approaches and methods are then illustrated on
the bi-level example.
target value at each grid point. The surface S* might not be a response surface
for the system. A surface associated with a selection is said
to dominate a surface associated with a selection if at each
grid point the value of is at least as close to the reference value as
that of and the value of is closer than that of at at least one grid point.
Thus, nondominated surfaces are those for which every other surface associated
with a feasible selection of methods yields an inferior value at one or more grid
points. The method selections associated with each nondominated surface are
called efficient.
In other words, a feasible surface associated with a selection
is said to be nondominated if there is no other feasible surface such that
for all and
for at least one
With respect to this model, generation of the nondominated surfaces related
to the multiple grid optimization problem leads to solving the following multiple
objective program
The preferred surface is the one that achieves the minimum. Note that the
definition of the norm can be extended to treat over-performance and under-
performance asymmetrically if appropriate.
mentation increasing from method 1 to method 3. Recall that the models are
to serve the decision maker. A detailed model of any of the suggested meth-
ods might be very complex with many variables and parameters. Further, any
reasonable physical model would almost certainly be nonlinear. However, the
decision maker is only concerned with how a method transforms an input per-
formance function into an output performance function. Even these simplified
models must be restricted to linear transformations so the decision maker can
account for interactions among choices.
Similar comments can be made about the other two components, F and D.
So for each component we assume three efficient choices of methods in as-
cending order of cost. The enterprise performance function for the lowest cost
choices for M, F, and D is given in Figure 15.9. Among the twenty-seven
choices for methods, assumed to be feasible for the master-level task, seven
selections produce nondominated enterprise performance functions. The effi-
cient selections are {111, 311, 312, 321, 322, 331, 332} where the selection
means we have chosen the method for M, the method for F, and the
method for D.
Recall that every nondominated surface exceeds any other surface at one
or more points. We might attempt comparing the nondominated surfaces on
subregions of the set of exogenous variables in order to pick out preferred
solutions. In Figures 15.10 and 15.11, we produce graphs of the diagonals of
the nondominated enterprise functions.
To illustrate the selection of a preferred surface at the master level, we cal-
culate the weighted norm of the difference between each nondominated surface
and the reference surface which passes through performance level 0 at each grid
point. Based on the shape of the input performance function in the example,
we partition the domain of exogenous variables into three diagonal bands and
assign probability 1/2 to the central band and probability 1/4 to the bands on
each side. The decision maker assigns importance 1 to each grid point. The
Affordable Upgrades of Complex Systems 319
320 MODELING UNCERTAINTY
norm is The cost of the methods associated with a surface are taken to
be the sums of the digits in the triple identifying the surface. Thus, we have the
data in Table 15.1. The norms and costs are plotted in Figure 15.12.
It can be seen that the surface with the most desirable norm is 312. When
norm and cost are treated as two criteria to be minimized simultaneously, the
nondominated surfaces with respect to these criteria are 111, 311, and 312. The
decision maker must select from among these the final surface and associated
methods.
4. STOCHASTIC ANALYSIS
In practice systems must be considered with uncertainty. The first source
of uncertainty is the environment in which the system operates, modeled in
this paper by the exogenous variables. Another source of uncertainty comes
Affordable Upgrades of Complex Systems 321
from system inputs. In the example we can consider either or both of the
system inputs, i.e., and as random. Figure 15.13 illustrates the case of
random customer demand and Figure 15.14 shows the corresponding enterprise
performance. We always assume that the random input functions have been
modeled leading to a stochastic decision problem which we can discuss in
terms of risk.
The nature of the process for generating candidate methods in the previous
section precludes the use of models with random inputs. Therefore we envision
resolution of the stochastic decision problem in two steps.
2 Analyze the risk for the preferred choices using random inputs.
If the risk is unacceptable then iterate after eliminating the component choices
which contribute the most to enterprise risk. This pruning process will be aided
by considering the obtained from simulations of the complete model.
5. CONCLUSIONS
This paper presents a novel system-science approach to complex systems
decision making. A system to be upgraded is modeled as a multilevel network
whose nodes represent tasks to be accomplished and whose arcs describe the
interactions of component performance measures. Since a task can be carried
out by a collection of methods, a decision maker is interested in choosing a
best method for each task so that the system performance is optimized. Thus
the well-known concept of network—typically used to modeling physics-based
subsystems, components of the system and flows between them—now becomes
a carrier of higher-level performance analysis conducted on tasks the system
is supposed to accomplish and on available methods. This performance-based
rather than physics-based framework is independent of engineering discipline
and results in other desirable properties highlighted in the paper.
At each decision level, the system performance depends upon exogenous
variables modeling the uncertainty related to the environment in which the
system may operate. This source of uncertainty is accounted for in the de-
cision process through the use of multicriteria optimization. Uncertainty in
Affordable Upgrades of Complex Systems 323
324 MODELING UNCERTAINTY
Affordable Upgrades of Complex Systems 325
326 MODELING UNCERTAINTY
the selection chosen for implementation associated with random inputs (i.e.,
performance functions) is assessed using a stochastic analysis technique. The
methodology has been examined for the case of two exogenous variables and
is under development for the more general case. Also required for application
are component models, understanding of relations between components, and
mean input surfaces. Finally, for stochastic inputs the random part of the input
surface is limited to the standard Wiener surface.
The value of our approach consists of
Basic hypothesis. The central idea in this subsection is to produce a condition on the
covariance kernel of a random field (function of two variables) which for fields F generated
by linear systems, i.e., of the form F = AW, is sufficient to produce a representation of A.
(Here W is the standard Wiener field, see below.) If F is not generated by a linear system then
the method produces a linearization of the unknown nonlinear operator Â, namely A. In this
case the utility of the linearization depends upon the particular application and how nearly the
assumption of the condition (perhaps uncheckable) is to holding.
328 MODELING UNCERTAINTY
The condition is the following: for each pair of points and in [0, U] × [0, V]
where cov(·) is the covariance operator. (See Vanmarcke (1988), p. 82.) Note that the condition
is on the observation field F rather than on the underlying unmodeled system.
In general, the independence expressed in the condition might not hold. However, such
independence is implicit in the common engineering practice of exploring a given physical
system by allowing only one quantity to vary at a time.
The Standard Wiener field. The standard Wiener field on [0, U] × [0, V] has the
following defining properties:
sample fields are continuous
if then
where E[·] is the expectation operator. Simulation of the standard Wiener process proceeds
as follows. If then has mean zero and covariance
i.e., is indistinguishable from Suppose, in addition, that
Then has mean zero and covariance i.e., is indistinguishable from
Hence, TW has the mean and covariance of F and so T is the stochastic linearization of the
system generating F.
Construction of linear models in the example. The recipe yields, for the com-
ponent operator M, the representation and where
and
If the modeler has observations and available then and can
be estimated from the data. This scenario might correspond to the case where an implementation
of M exists and is available for experimentation.
REFERENCES 329
If the modeler does not have observations to use in estimating and then they might be
based on engineering experience or intuition. Except for scaling, the shapes of and will
likely come from a class of simple shapes. The shapes chosen in the example are representative
of the possibilities for and Some small amount of data might be useful in establishing
the appropriate scales for the operators.
Note that in the example and were assumed equal ([0, U] = [0, V] = [0, 5] and )
and There is no requirement that this be true.
Similar comments hold for the construction of the component operators F and D.
REFERENCES
Burman, M., S. B. Gershwin and C. Suyematsu. (1998). “Hewlett-Packard uses
operations research to improve the design of a printer production,” Inter-
faces 28, 24–36.
Carrillo, J.E. and Ch. Gaimon. (2000). “Improving manufacturing performance
through process change and knowledge creation,” Management Science 46,
265–288.
Cover, A., J. Reneke, S. Lenhart, and V. Protopopescu. (1996). “RKH space
methods for low level monitoring and control of nonlinear systems,” Math,
Models and Methods in Applied Sciences 6, 77–96.
Hart, D.T. and E. D. Cook. (1995). “Upgrade versus replacement: a practical
guide to decision-making,” IEEE Transactions on Industry Applications 31,
1136–1139.
Hazelrigg, G.A. (2000). “Theoretical foundations of systems engineering,” pre-
sented at INFORMS National Meeting, San Antonio.
Korman, R.S., D. Capitanio and A. Puccio. (1996). “Upgrading a bulk chemical
distribution system to meet changing demands,” MICRO 14, 37–41.
Luman, R.R. (1997). “Quantitative decision support for upgrading complex sys-
tems of systems,” D.Sc. thesis, The George Washington University, Wash-
ington, DC.
Luman, R.R. (2000). “Upgrading complex systems of systems: a CAIV method-
ology for warfare area requirements allocation,” Military Operations Re-
search 5, 53–75.
Makis, V., X. Jiang and K. Cheng. (2000). “Optimal preventive replacement
under minimal repair and random repair cost,” Mathematics of Operations
Research 25, 141–156.
Majety, S.R.V., M. Dawande, and J. Rajgopal. (1999). “Optimal reliability al-
location with discrete cost-reliability data for components,” Operations Re-
search 47, 899–906.
McIntyre, M.G. and J. Meitz. (1994). “Applying yield impact models as a first
pass in upgrade decisions,” Proceedings of the IEEE/SEMI 1994 Advanced
Semiconductor Manufacturing Conference and Workshop, Cambridge, MA,
November 1994, 147–149.
Olson, D.L. (1996). Decision Aids for Selection Problem, Springer, New York.
330 MODELING UNCERTAINTY
Yan, P., M.-Ch. Zhou and R. Caudill. (2000). “Life cycle engineering approach
to FMS development,” Proceedings of IEEE International Conference on
Robotics and Automation ICRA 2000, San Francisco, CA, April 2000, 395–
400.
Chapter 16
Fei-Yue Wang
Department of Systems and Industrial Engineering
University of Arizona
Tucson, Arizona 87521
George N. Saridis
Department of Electrical, Computer and Systems Engineering
Rensselaer Polytechnic Institute
Troy, New York 12180
Abstract An approximation theory of optimal control for nonlinear stochastic dynamic sys-
tems has been established. Based on the generalized Hamilton-Jacobi-Bellman
equation of the cost function of nonlinear stochastic systems, general iterative
procedures for approximating the optimal control are developed by successively
improving the performance of a feedback control law until a satisfactory sub-
optimal solution is achieved. A successive design scheme using upper and lower
bounds of the exact cost function has been developed for the infinite-time stochas-
tic regulator problem. The determination of the upper and lower bounds requires
the solution of a partial differential inequality instead of equality. Therefore it
provides a degree of flexibility in the design method over the exact design method.
Stability of the infinite-time sub-optimal control problem was established under
not very restrictive conditions, and stable sequences of controllers can be gen-
erated. Several examples are used to illustrate the application of the proposed
approximation theory to stochastic control. It has been shown that in the case of
linear quadratic Gaussian problems, the approximation theory leads to the exact
solution of optimal control.
1. INTRODUCTION
The problem of controlling a stochastic dynamic system, such that its be-
havior is optimal with respect to a performance cost, has received considerable
attention over the past two decades. From a theoretical as well as practical
point of view, it is desirable to obtain a feedback solution to the optimal control
problem. In situations of linear stochastic systems with additive white Gaus-
sian noise and quadratic performance indices (so-called LQG problems), the
separation theorem is directly applicable, and the optimal control theory is well
established (Aoki, 1967; Woham, 1970; Kwakernaak and Sivan, 1972; Sage
and White,1977).
However, due to difficulties associated with the mathematics of stochastic
processes, only fragmentary results are available for the design of optimal con-
trol of nonlinear stochastic systems. On the other hand, there is need to design
optimal and suboptimal controls for practical implementation in engineering
applications (Panossian, 1988).
The objective of this paper is to develop an approximation theory that may
be used to find feasible, practical solutions to the optimal control of nonlinear
stochastic systems. To this end, the problem of stochastic control is addressed
from an inverse point of view:
Given an arbitrary selected admissible feedback control, it is desirable to compare
it to other feedback controls, with respect to a given performance cost, and to
successively improve its design to converge to the optimal.
Various direct approximations of the optimal control have been widely stud-
ied for nonlinear deterministic systems (Rekasius, 1964; Leake and Liu, 1967;
Saridis and Lee, 1979; Saridis and Balarm, 1986), and appeared to be more
promising than the linearization type approximation methods that have met
with limited success (Al’brekht, 1961; Lukes, 1969; Nishikawa et al., 1962).
For stochastic systems, a method of successive approximation to solve the
Hamilton-Jacobi-Bellman equation for a stochastic optimal control problem
using quasilinearization, was proposed in Ohsumi (1984), but systematic pro-
cedures for the construction of suboptimal controllers were not established.
This paper presents a theoretical procedure to develop suboptimal feedback
controllers for stochastic nonlinear systems (Wang and Saridis, 1992), as an ex-
tesnion of the Approximation Theory of Optimal Control developed by Saridis
and Lee (1979) for deterministic nonlinear systems. The results are organized
as follows. Section 2 gives the mathematical preliminaries of the stochastic
optimal control problem. Section 3 describes major theorems that can be used
for the construction of successively improved controllers. For the infinite-time
stochastic regulator problem, a design theory using upper and lower bounds of
the cost function and the corresponding stability considerations are discussed
in Section 4. Two proposed design procedures are outlined and illustrated with
Successive Approximation of Stochastic Dynamic Systems 335
2. PROBLEM STATEMENT
For the purpose of obtaining explicit expressions, and without loss of gen-
erality since the results can be immediately generalized, consider a nonlin-
ear stochastic control system described by the following stochastic differential
equation:
If it is assumed that the optimal control law, exists and if the cor-
responding value function, is sufficiently smooth, then and V*
may be found by solving the well-know Hamilton-Jacobi-Bellman equation
(Bellman, 1956),
where is a suitable constant. Then the necessary and sufficient conditions for
to be the value function of an admissible fixed feedback control law
i.e.,
are
Proof: From (1.7), using Itô’s integration formula (Itô, 1951), it follows that:
Therefore,
338 MODELING UNCERTAINTY
The sufficient condition results from the above equation. For the necessary
condition, assume that Then from the above equation,
and for
Therefore,
then is an upper (or a lower) bound of the cost function of system (1.1).
That is‚
while Theorem 6 presents a method for constructing upper and lower bounds
to the optimal cost function‚ which can be used to evaluate the acceptability of
suboptimal controllers.
Theorem 3. Given the admissible controls and with
and be the corresponding cost functions satisfying Eqs. (1.7)
and (1.8) for and respectively‚ define the Hamiltonian function for
and 2‚
where
It is shown that
when
Proof: Let
then
Successive Approximation of Stochastic Dynamic Systems 341
Therefore
Hence‚
then
Proof: Since control and the corresponding value function must satisfy
Eqs. (1.9) and (1.10)‚ according to Theorem 1‚ it follows that for every
that is‚
Since‚
From (1.10)‚
that is‚
and
for all and where is the limit of the cost functions . The corresponding
limit of control sequence can be identified from (1.24) as‚
Clearly‚ and thus obtained‚ still satisfy Eqs. (1.9) and (1.10) of Theo-
rem 1. However‚ from the construction of control sequence minimizes
the pre-Hamiltonian function associated with the value function In other
words‚ and satisfy the Hamilton-Jacobi-Bellman equation for the optimal
control of stochastic system (1.1)
Hence‚
are the optimal control and optimal value function of the stochastic control
problem (1.5).
Remark 1: It follows from this theorem that the optimal feedback control
and the optimal cost function V* are related by
derived control laws and their corresponding value functions. However‚ for
a nonlinear stochastic control system as in (1.1)‚ the admissibility of the new
control laws is not always easy to show.
Finally‚ the following theorem presents a method for the construction of
an upper (or a lower) bound of the optimal cost function Since the
optimal cost function is extremely difficult to find‚ its upper (or lower) bounds
can provide a practical measure to evaluate the effectiveness of the sub-optimal
controllers.
Theorem 6. Assume that there exists a function satisfying condition
(1.7) of Theorem 1‚ for which the associated control
When
Proof: Following the same procedure used in the proof for Theorem 3‚ one
can show that
hence‚
The next theorem is the counterpart of Theorem 4‚ and its proof can be carried
out by the same procedure used in Theorem 4.
Theorem 8. Assume that there exists a control and a function
pair satisfying (1.11) of Theorem 2. If there exists a
348 MODELING UNCERTAINTY
then‚
2 Since all the states are assumed available for measurement‚ system (1.1)
is obviously Completely Observable.
3 The Performance Cost (1.37) is bounded because‚
Where
Successive Approximation of Stochastic Dynamic Systems 349
Select the first control to satisfy the previous theorems of the Approxi-
mation Theory‚ and the condition‚
Then‚
Where and can be found by solving Eqs. (1.9) and (1.10) of Theorem
1‚ i.e.‚
Successive Approximation of Stochastic Dynamic Systems 351
and
Which is the optimal control for the linear stochastic systems with quadratic
performance criterion (Wonham‚1970).
This solution demonstrates the use of Theorem 5 to sequentially improve the
control parameters towards the optimal values in a Linear Quadratic Gaussian
system with well-known solution.
Example 2. The second example illustrates the design method by the fol-
lowing nonlinear first-order stochastic system:
Such a controller was selected to be of the same order as the partial derivative
of the cost function as per Theorem 5 suggests. The corresponding cost
function is assumed to be‚
Which yields
Successive Approximation of Stochastic Dynamic Systems 353
which is satisfied by
which‚ with the rest of the inequalities‚ produce acceptable values for
and For example‚ one can show that
are a set of the acceptable values. The lower and upper bounds of the value
functions in this case are found to be‚
356 MODELING UNCERTAINTY
ACKNOWLEDGMENTS
I would like to dedicate this paper to the memory of Professor Sidney J.
Yakowitz.
REFERENCES
Al’brekht‚ E. G. (1961). On the optimal stabilization of nonlinear systems‚ J.
Appl. Math. Mech. (ZPMM)‚ 25‚ 5‚ 1254-1266.
Aoki‚ M. (1967). Optimization of Stochastic Systems‚ Academic Press‚ N.Y.
Bellman‚ R. (1956). Dynamic Programming‚ Princeton University Press‚ Prince-
ton‚ N.J.
Doob‚ J. L. (1953). Markov Processes. Wiley‚ N.Y.
Dynkin‚ E. B. (1953). Stochastic Processes. Academic Press‚ N.Y.
Itô‚ K. (1951). On Stochastic Differential Equations‚ Memn Amer. Math. Soc.‚
4.
Jaynes‚ E. T. (1957). Information theory and statistical mechanics. Physical
Review‚ 4‚ 106.
Kushner‚ H. (1971). Introduction to Stochastic Control‚ Holt‚ Reinhardt and
Winston‚ NY.
Kwakernaak‚ H. and R. Sivan. (1972). Linear Optimal Control Systems‚ Wiley‚
N.Y.
Leake‚ R.J. and R.-W. Liu. (1967). Construction of suboptimal control se-
quences‚ SIAM J. Control‚ 5‚ 1‚ 54-63.
Lukes‚ D. L. (1969). Optimal regulation of nonlinear dynamical systems. SIAM
J. Control‚7‚ 1‚75-100.
Nishikawa‚ Y.‚ N. Sannomiya and H. Itakura. (1962). A method for suboptimal
design of nonlinear feedback systems‚ Automatica‚ 7‚ 6‚703-712.
Ohsumi‚ A. (1984). Stochastic control with searching a randomly moving target‚
Proc. Of American control Conference‚ San Diego‚ CA‚ 500-504.
Panossian‚ H. V. (1988). Algorithms and computational techniques in stochastic
optimal control‚ C.T. Lenodes (ed.)‚ Control and Dynamic Systems‚ 28‚ 1‚
1-55.
Rekasius‚ Z.V. (1964). Suboptimal design of intentionally nonlinear controllers.
IEEE Trans. Automatic Control‚ AC-9‚ 4‚ 380-386.
Sage‚ A. P. and C. C. White. (1977). Optimun Systems Control‚ Prentice-Hall‚
Englewood Cliffs‚ N.J.
358 MODELING UNCERTAINTY
László Gerencsér
Computer and Automation Institute
Hungarian Academy of Sciences
H-1111‚ Budapest Kende u 13-17
Hungary*
Abstract We consider a sequence of not necessarily i.i.d. random mappings that arise in
discrete-time fixed-gain recursive estimation processes. This is related to the
sequence generated by the discrete-time deterministic recursion defined in terms
of the averaged field. The tracking error is majorated by an L-mixing process
the moments of which can be estimated. Thus we get a discrete-time stochastic
averaging principle. The paper is a simplification and extension of Gerencsér
(1996).
1. INTRODUCTION
Most of the commonly used identification methods for linear stochastic sys-
tems can be considered as special cases of a general estimation scheme, which
was proposed in Djereveckii and Fradkov (1974); Ljung (1977), and further
elaborated in Djereveckii and Fradkov (1981); Ljung and Söderström (1983).
This scheme can be described as follows. Let us define a parameter-dependent,
stochastic process by the state-space equation:
their eigenvalues inside the unit disc, and A(.), B(.) and C(.) are
of
Let be an quadratic function, defined on all
components of which are homogeneous quadratic functions of Let
As an extreme case, may not depend on at all. This is the case with
the celebrated LMS method in adaptive filtering. Here a particular component
of a wide-sense stationary signal is approximated by the linear combination of
the remaining components. Formally: let be an
Stability of Random Iterative Mappings 361
(1.9)
The on-line computations of the best weights is obtained by the LMS method:
Fixed gain procedures are quite well-known in adaptive filtering. The normal-
ized tracking error has been analyzed from a number of aspects.
Weak convergence‚ a central limit theorem and an invariance principle‚ when
have been derived in Kushner and Schwartz (1984); Györfi and Walk
(1996); Gerencsér (1995); Joslin and Heunis (2000); Kouritzin (1994). Bounds
for higher order moments for any fixed 0 have been derived in Gerencsér
(1995).
In this paper we consider general fixed-gain recursive estimation processes
that include processes given by (1.12). In modern terminology these can be con-
sidered as iterated random maps. Thus we consider random iterative processes
of the form
characterization of the tracking error process. The focus of that paper was on
processes in continuous-time.
The application of the so-called ODE method for discrete-time fixed-gain
recursive estimation processes requires the often painful analysis of the effect
of discretization error. The main contribution of the present paper is an ex-
tension of Gerencsér (1996) to discrete-time processes and the development
of a discrete-time ODE-method‚ where ODE now is for Ordinary Difference
Equation. This new method is much more convenient for applications and
also gives more accurate characterization of the error process. These advan-
tages will be exploited in the forthcoming paper (Gerencsér and Vágó‚ 2001) to
analyze the convergence properties of the so-called SPSA (simultaneous per-
turbation stochastic approximation) methods‚ developed in Spall (1992); Spall
(1997); Spall (2000)‚ when applied to noise-free optimization. In the condi-
tions below we use the definitions and notations given in the Appendix. The
key technical condition that ensures a stochastic averaging effect is Condition
1.2.
Condition 1.1 The random fields H and are bounded for
say K and We
let
Condition 1.2 H and are L-mixing uniformly in for and in
for respectively, with respect to a pair of families
of
Condition 1.3 The mean field EH is independent of i.e. we can
write
The conditions imposed on the averaged field G will are described in terms of
(1.16) given below‚ this is more convenient for applications. Thus consider the
continuous-time deterministic process defined by
The condition ensures the existence and uniqueness of the solution of (1.16)
in some finite or infinite time interval for any which will be
denoted by It is well-known that is a continuously differen-
tiable function of and also exists and is continuous
in
The time-homogenous flow associated with (1.16) is defined as the mapping
Let D be a subset of D such that for we
have for any For any fixed t the image of under will be
denoted as i.e. The union of
these sets will be denoted by i.e.
The above theorem has two new features compared to previous results by
other authors: first‚ the higher order moments of the tracking error are estimated.
Secondly‚ the upper bound is shown to be an L-mixing process. An interesting
consequence is the following stochastic averaging result:
Theorem 1.2 Let the conditions of Theorem 1.1 be satisfied. Let F(.) be a
continuously differentiable function of defined in D. Then we have
2. PRELIMINARY RESULTS
In the first result of this section we show a simple method to couple continuous-
time and discrete-time processes. Then we show that the discrete flow defined
by (1.15) is exponentially stable with respect to initial perturbations for suffi-
ciently small
Lemma 2.1 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Let be the
solution of (1.16) and be the solution of (1.15). Then if
then will stay in and for all
Proof: Let denote the piecewise linear extension of
for Taking into account Lemma 0.2 we
can write as long as for all
Lemma 2.2 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Then d >
implies that will stay in moreover for any
we have
or equivalently
Thus
To estimate the first term in the integral we differentiate (2.6) with respect to
to obtain
Returning to (2.4)‚ the first term under the first integral on the right hand side
is estimated using Lemma 0.3. For the second term we write
Thus the absolute value of the first integral on the right hand side of (2.4) is
majorated by
For the absolute value of the second integral we get the upper bound
Altogether we get
Stability of Random Iterative Mappings 367
with Writing
For the proof first note that Condition 1.5 and an obvious adaptation of
Lemma 2.1 ensures that for all whenever
There is no loss of generality to assume that and then we have
368 MODELING UNCERTAINTY
In the first integral take supremum over and over the initial
condition which enters implicitly through Then we
get the random variable defined in (3.2) with Since H is Lipschitz-
continuous in with Lipschitz-constant L‚ the second term is majorized by
Applying a discrete-time Bellman-Gronwall-lemma we get the
claim for
Now Lemma 2.2 of Gerencsér (1996)‚ which itself is proved by direct com-
putations using Theorems of Gerencsér (1989)‚ implies that the process is
L–mixing with respect to and for any and we
have
where
APPENDIX
The basic concepts of the theory of L-mixing processes developed in Gerencsér (1989) will
be presented. Let a probability space be given, let be an open domain and
let be parameter-dependent stochastic process. Alternatively, we
may consider as a time-varying random field. We say that is M-bounded if for
all
Here | · | denotes the Euclidean norm. We shall use the same terminology if or t degenerate
into a single point. Also we shall use the following notation: if is M-bounded we write
Moreover‚ if is a positive real-valued function we write
if
Stability of Random Iterative Mappings 369
The following lemma is a direct consequence of Lemma 2.4 of Gerencsér (1989)‚ which itself
is verified by direct calculations:
Lemma 0.1 Let be a discrete time L-mixing process with respect to a pair of
families of Define the process
Two simple analytical lemmas follow. Due to its importance the proof of the first Lemma
will be given. The proof of the second lemma is given in Gerencsér (1996):
Lemma 0.2 (cf. Geman, 1979) Let be a function satisfying Conditions 1.4, 1.5 and let
be solution of (1.16). Further let be a continuous, piecewise continuously differentiable
curve such that Then for
370 MODELING UNCERTAINTY
Proof: Consider the function Obviously the left hand side of (0.3) can be
written as Write
which follows from the identity after differentiation with respect to we get
the lemma.
REFERENCES
Benveniste, A., M. Métivier, and P. Priouret. (1990). Adaptive algorithms and
stochastic approximations. Springer-Verlag, Berlin.
Bittanti, S. and M. Campi. (1994). Bounded error identification of time-varying
parameters by RLS techniques. IEEE Trans. Automat. Contr., 39(5): 1106–
1110.
Caines, P. E. (1988). Linear Stochastic Systems. Wiley.
Campi, M. (1994). Performance of RLS identification algorithms with forget-
ting factor: a approach. J. of Mathematical Systems, Estimation
and Control, 4:1–25.
Djereveckii, D. P. and A.L. Fradkov. (1974). Application of the theory of
Markov-processes to the analysis of the dynamics of adaptation algorithms.
Automation and Remote Control, (2):39–48.
Djereveckii, D. P. and A.L. Fradkov. (1981). Applied theory of discrete adaptive
control systems. Nauka, Moscow. In Russian.
Geman, S. (1979). Some averaging and stability results for random differential
equations. SIAM Journal of Applied Mathematics, 36:87–105.
Gerencsér, L. (1989). On a class of mixing processes. Stochastics, 26:165–191.
Gerencsér, L. (1995). Rate of convergence of the LMS algorithm. Systems and
Control Letters, 24:385–388.
Gerencsér, L. (1996). On fixed gain recursive estimation processes. J. of Ma-
thematical Systems, Estimation and Control, 6:355–358. Retrieval code for
full electronic manuscript: 56854.
Gerencsér, L. and Zs. Vágó. (2001). A general framework for noise-free SPSA.
In Proceedings of the 40-th Conference on Decision and Control, CDC’01,
page submitted.
REFERENCES 371
Victor Solo
School of Electrical Engineering and Telecommunications
University of New South Wales
Sydney NSW 2052
Australia
vsolo @ syscon.ee.unsw.edu.au
Keywords: Markov Chain, Monte Carlo, Adaptive Algorithm.
Abstract Many Signal Processing and Control problems are complicated by the presence
of unobserved variables. Even in linear settings this can cause problems in con-
structing adaptive parameter estimators. In previous work the author investigated
the possibility of developing an on-line version of so-called Markov Chain Monte
Carlo methods for solving these kinds of problems. In this article we present a
new and simpler approach to the same group of problems based on direct simu-
lation of unobserved variables.
1. EL SID
When I first started wandering away from the statistics beaten track into
the physical and engineering sciences I soon encountered a small but intrepid
band of ‘statistical explorers’; statisticians who had taken the pains necessary to
break into other areas (where their mathematical skills did not excite the same
level of awe as they might have in some of the social or biological sciences)
but who were having a considerable impact, out of proportion to their numbers,
elsewhere.
Of course there were the better known names, the Speakes and Burtons of
these strange lands: Tukey certainly was there in geophysics; and Kiefer in
information theory. But there were others, less well known, the Bakers and
Marc hands, perhaps more enterprising, but no less deserving of respect. Sid
374 MODELING UNCERTAINTY
was one of these, but different even then; too maverick to belong to a group of
mavericks!
I first encountered Sid’s work in adaptive control in the 1970’s (Yakowitz,
1969). Then a little later I was astounded to discover he’d worked on the
marvellous ill-conditioned inverse problem of aquifer modelling (Neuman and
Yakowitz, 1979). And yet again there was his work on Kriging (Yakowitz and
Szidarovszky, 1984), providing some rare clarity on a much mis-discussed sub-
ject. He was worrying about nonparametric estimation of stochastic processes
long before it became fashionable : e.g. Yakowitz (1985) and earlier references.
He was then, an early player in areas which became extremely important. In
his research Sid cut to the heart of the problem and asked the hard (and em-
barassing) questions; he obviously took application seriously but saw the utility
of rigour too. At a personal level he was very humble and very encouraging to
other younger researchers.
Because Sid worked on so many diverse problems he did not get the kind
of recognition he deserved but he certainly rates along with others who did.
Given that his first work was on adaptive algorithms I am pleased to offer a
contribution on this topic. This work, with a statistical bent, would appeal to
Sid; although he would not be pleased by the lack of rigour!
2. INTRODUCTION
In many problems of real time parameter estimation there is a necessity
to estimate unobserved signals. These may be states or auxiliary variables
measured with error. In this paper we concentrate on this latter problem of
errors in variables.
Thus in Computer Vision, consider the fundamental problem of estimating
three-dimensional motion from a sequence of two-dimensional images using
a feature based method - a highly non-linear problem (Tekalp, 1995). Many
current methods ignore the presence of noise in the measurements which make
the problem into an errors in variables problem. Kanatani (1996) has tackled
the problem of noise but his techniques only apply in high signal to noise
ratio cases and under independence assumptions. An approach that overcomes
this in in Ng and Solo (2001). In multiuser communication based on spread
spectrum methods, detection methods may require spreading codes which may
not be known and so have to be estimated. They occur in a nonlinear fashion
(Rappaport, 1996).
To construct an adaptive parameter estimator for a non-linear problem there
are two routes. In the model free or stochastic approximation approach (Ben-
veniste et al., 1990; Ljung, 1983; Solo and Kong, 1995) an analytic expression
is needed for the gradient of the likelihood with respect to the parameters. In
the model based approach which usually leads to approximations based on the
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms 375
extended Kalman filter (Solo and Kong, 1995; Ljung, 1983) this gradient is also
needed. But there are many problems where it is not possible to develop ana-
lytic expressions for the likelihood much less its gradient. Errors in variables
typically produce such a situation.
In the statistics literature in the last decade or so Markov Chain Monte
Carlo methods (MCMC)(Roberts and Rosenthal, 1998), originating in Physics
(Metropolis et al., 1953) and Image Processing (Geman and Geman, 1984) have
provided a powerful simulation based methodology for solving these kinds of
problems. In previous work (Solo, 1999) we have pursued the possibility of
using this method in an offline setting. Here based on more recent work of the
author (Solo, 2000ab) we develop an on-line version of an alternative and sim-
pler method. We use a simple binary classification problem as an exploratory
platform. For the errors in variables case we consider below there seems to
be no previous literature (aside from Solo (2000a)) using the model free ap-
proach. For the model based approach one ends up appending the parameters
being tracked to the states and pursues a nonlinear filtering approach. There
are examples of this e.g. Kitagawa (1998) but not really aimed at the kind of
errors in variables problem envisaged here.
The remainder of the paper is organised as follows. In section 2 we describe
the problem (without measurement errors) and briefly review an adaptive es-
timator. In section 3 we discuss the same problem but where the auxiliary
(classifying ) variables are measured with error and describe the ’Unobserved’
Monte Carlo method for parameter estimation (offline). In section 4 we de-
velop an online version of the estimator and sketch a convergence analysis.
Conclusions are offered in section 5.
In the sequel we use for the density of an unobserved signal or variable;
for a conditional density and for the density of an observed variable.
We also use to denote a probability of a binary event, this should cause no
confusion.
We cannot directly generate draws from this conditional density since the nor-
malising constant (or partition function) cannot be calculated. Note that this
partiton function is precisely the density that we cannot calculate. And now
MCMC comes into play because we can generate instead draws from a Markov
chain which has as its limit or invariant or ’steady state’ density. There
are numerous MCMC algorithms available (Roberts and Rosenthal, 1998).
An alternative to EM is MC-NR (Monte Carlo Newton Raphson) (Kuk and
Cheng, 1997) in which the idea is to get by simulation also. The EM
framework remains useful. We find
This idea was developed into an online procedure in the previous work (Solo,
1999). But here we take a different route.
378 MODELING UNCERTAINTY
In recent work (Solo, 2000ab) the author has found a simpler route to Monte
Carlo based estimation by means of direct simulation of the unobserved variable.
The method is dubbed ’unobserved’ Monte Carlo (UMC).1 Referring to (4.1)
we see that if we draw iid samples from the density q(u)
then a Monte Carlo estimate of is
Now if is small, N not too large then in the sum can be approximated by
its value at the beginning of the sum while can be approximated by a
draw from So we get
Note the unusual feature that is the true density of Now argue in reverse
to find the averaged system
It is interesting that this is the same averaged system found in the previous work
(Solo, 1999) based on MCMC simulation. Continuing we find
6. CONCLUSIONS
In this paper we have pursued the issue of on-line use of simulation meth-
ods for adaptive parameter estimation in nonstandard problems. Here we have
concentrated on errors in variables problems. We have shown that a simpler
simulation method than previously developed has great promise for these kinds
of problems. Previously problems like these have been ignored, or treated ap-
proximately by Taylor series expansions. And it now appears to be possible to
do much better. The careful reader can cast an eye back over our discussion
to see that nothing special about the errors in variables setup is invoked. The
approach developed here extends in a straight forward way to deal with partially
observed dynamical systems. It should be noted however that there will remain
some problems where even to draw from the density of the unobserved signal
will itself require an MCMC approach. In future work we will look at imple-
mentations for more realistic problems and study stability and performance in
more detail.
NOTES
1. The current and related works were completed in the second half of 1999 after a sabbatical. Recently
the author became aware of work in so-called sequential Monte Carlo ? of which to some extent UMC is a
special case.
2. A more formal analysis can be developed using the methods in (Solo and Kong, 1995) but will be
pursued elsewhere
REFERENCES 381
REFERENCES
Benveniste, A., M. Metivier, and P. Priouret. (1990). Adaptive Algorithms and
stochastic approximations. Springer-Verlag, New York.
Dempster, A.P.,N.M. Laird, andD.B. Rubin. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39:p.1-38.
Geman, S. and D. Geman. (1984). Stochastic relaxation, gibbs distributions and
the bayesian restoration of images. IEEE. Trans. Patt. Anal. Machine Intell.,
6:721–741.
Kanatani, K. (1996). Statistical optimization for Geometric Computation: The-
ory and Practice. North-Holland, Amsterdam.
Kuk, A.Y.C., and Y. W. Cheng. (1997). The Monte Carlo Newton Raphson
algorithm. Jl Stat Computation Simul.
Kitagawa, G. (1998). Self organising state space model. Jl. Amer. Stat. Assoc,
93:1203–1215.
Kushner, H.J. (1984). Approximation and weak convergence methods for ran-
dom processes with application to stochastic system theory. MIT Press, Cam-
bridge MA.
Ljung, L. (1983). Theory and practice of recursive identification. MIT Press,
Cambridge, Massachusetts.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. (1953).
Equations of state calculations by fast computing machines. J. Chem. Phys.,
21:1087–1091.
Ng, L. and V. Solo. (2001). Errors-in-variables modelling in optical flow esti-
mation. IEEE Trans. Im.Proc., to appear.
Neuman, S.P. and S. Yakowitz. (1979). A statistical approach to the inverse
problem of aquifer hydrology, I. Water Resour Res, 15:845–860.
Rappaport, T.S. (1996). Wireless Communication. Prentice Hall, New York.
Ripley, B.D. (1996). Pattern recognition and Neural networks. Cambridge Uni-
versity Press, Cambridge UK.
Roberts, G.O. and J.S. Rosenthal. (1998). Markov chain monte carlo: Some
practical implications of theoretical results. Canadian Jl. Stat., 26:5–31.
Sastry, S. and M. Bodson. (1989). Adaptive Control. Prentice Hall, New York.
Solo, V. and X. Kong. (1995). Adaptive Signal Processing Algorithms. Prentice
Hall, New Jersey.
Solo, V. (1999). Adaptive algorithms and Markov chain Monte Carlo methods.
In Proc. IEEE Conf Decision Control 1999, Phoenix, Arizona, IEEE.
Solo, V. (2000a). ’Unobserved’ Monte Carlo method for system identification of
partially observed nonlinear state space systems, Part I: Analog observations.
In Proc JSM2001, Atlanta, Georgia, August, page to appear. Am Stat Assocn.
382 MODELING UNCERTAINTY
Luc Devroye
School of Computer Science
McGill University
Montreal, Canada H3A 2K6
Adam Krzyzak
Department of Computer Science
Concordia University
Montreal, Canada H3G 1M8
Note first of all the distribution-free character of this statement: its universality
is both appealing and limiting. We note in passing here that many papers have
been written about how one could decide to stop random search at a certain
point.
To focus the search somewhat, random covering methods may be considered.
For example, Lipschitz functions may be dealt in the following manner (Shubert,
1972): at the trial points we know Q and can thus derive piecewise linear
bounds on Q. The next trial point is given by
Bohachevsky, Johnson and Stein (1986) adjust during the search to make
the probability of accepting a trial point hover near a constant. Nevertheless, if
is taken as above, the rate of convergence to the minimum is bounded from
below by which is much slower than the polynomial rate we would
have if Q were multimodal but Lipschitz.
Several ideas deserve more attention as they lead to potentially efficient
algorithms. These are listed here in arbitrary order. In 1975, Jarvis introduced
competing searches such as competing local random searches. If N is the
number of such searches, a trial (or time unit) is spent on the search with
probability where is adapted as time evolves; a possible formula is to
replace by where is a weight, and
are constants, and is the trial point for the competing search. More
energy is spent on promising searches.
This idea was pushed further by several researchers in one form or another.
Several groups realized that when two searches converge to the same local
minimum, many function evaluations could be wasted. Hence the need for
on-line clustering, the detection of points that belong somehow to the same
local valley of the function. See Becker and Lago (1970), Törn (1974, 1976),
de Biase and Frontini (1978), Boender et al (1982), and Rinnooy Kan and
Timmer (1984, 1987).
The picture is now becoming clearer—it pays to keep track of several base
points, i.e., to increase the storage. In Price’s controlled random search for
example (Price, 1983), one has a cloud of points of size about where is
the dimension of the space. A random simplex is drawn from these points, and
the worst point of this simplex is replaced by a trial point, if this trial point is
better. The trial point is picked at random inside the simplex.
Independently, the German school developed the Evolutionsstrategie (Rechen-
berg, 1973; Schwefel, 1981). Here a population of base points gives rise to a
population of trial points. Of the group of trial points, we keep the best N, and
repeat the process.
Bilbro and Snyder (1991) propose tree annealing: all trial points are stored
in tree format, with randomly picked leaves spawning two children. The leaf
probabilities are determined as products of edge probabilities on the path to the
root, and the tree represents the classical tree partition of the space. Their
approach is at the same time computationally efficient and fast.
To deal with high-dimensional spaces, the coordinate projection method of
Zakharov (1969) and Hartman (1973) deserves some attention. Picture the
space as being partitioned by a N × · · · × N regular grid. With each marginal
interval of each coordinate we associate a weight proportional to the likelihood
that the global minimum is in that interval. A cell is grabbed at random in the
grid according to these (product) probabilities, and the marginal weights are
390 MODELING UNCERTAINTY
such as normally distributed noise, or indeed, any noise with tails that decrease
faster to zero than exponential, then we have convergence in the given sense.
The reader should not confuse our notion of stability which is taken from the
order statistics literature (Geffroy, 1958) with that of the stable distribution.
Stable noise is interesting because an i.i.d. sequence drawn from G,
satisfies in probability for some sequence See
for example Rubinstein and Weissman (1979). Additional results are presented
in this paper.
In noisy optimization in general, it is possible to observe a sample drawn
from distribution at each with possibly different for each The mean
of is If there are just two and the probe points selected by us
are where each of the is one of the then the purpose in
bandit problems is to minimize
to be exactly at the best but that is impossible since some sampling of the
non-optimal value or values is necessary. Similarly, we may sometimes wish
to minimize
This is nothing but the pure random search algorithm, employed as if we were
unaware of the presence of any noise. Our study of this algorithm will reveal
how noise-sensitive or robust pure random search really is. Not unexpectedly,
the behavior of the algorithm depends upon the nature of the noise distribution.
The noise will be called stable if for all
The reason why stable noise will turn out to be manageable, is that min
is basically known to fall into an interval of arbitrary small positive length around
some deterministic value with probability tending to one for some sequence
It could thus happen that as yet this is not a problem.
This was also observed by Rubinstein and Weissman (1979). In the next sec-
tion, we obtain a necessary and sufficient condition for the weak convergence
of for the pure random search algorithm.
T HEOREM 1. If is stable, then in probability. Conversely,
if is not stable, then does not tend to in probability for any search
problem for which for all small enough, > 0.
We picked the name “stable noise” because the minimum of is
stable in the sense used in the literature on order statistics, that is, there exists a
sequence such that in probability. We will prove the minimal
properties needed further on in this section. The equivalence property A of
Lemma 1 is due to Gnedenko (1943), while parts B and C are inherent in the
fundamental paper of Geffroy (1958).
LEMMA 2.
A. G is the distribution function of stable noise if and only if in
probability for some sequence
C. If the noise distribution is not stable, then there exist positive constants
a sequence a subsequence and an such that
and for all
PROOF. We begin with property B. Note that by assumption,
and thus Also, implies
Observe that for any
This shows that eventually, Thus,
and
Let us turn to A. We first show that B implies Gnedenko’s condition. We
can assume without loss of generality that is monotone decreasing since
can be replaced by in view of property B. For every we find
such that Thus, and
396 MODELING UNCERTAINTY
Thus,
Similarly,
where we used property B of Lemma 1 again. This concludes the proof of the
sufficiency.
The necessity is obtained as follows. Since G is not stable, we can find
positive constants a sequence a subsequence and an such
that and for all (Lemma 1, property C). Let
be in this subsequence and let be the multinomial
random vector with the number of values in
and respectively. We first condition on this vector. Clearly, if for
some in the second interval we have while for all
in the first interval, we have then Thus,
the conditional probability of this event is
The event on the right hand side has zero probability. Hence so does the event
on the left hand side.
It is more difficult to provide characterizations of strongly stable noises,
although several sufficient and a few necessary conditions are known. For an
in-depth treatment, we refer to Geffroy (1958). It suffices to summarize a few
key results here. The following condition due to Geffroy is sufficient:
This function comes close to being necessary. Indeed, if G is strongly stable, and
is monotone in the left tail beyond some point, then Geffroy’s
condition must necessarily hold. If G has a density then another sufficient
condition is that
Random Search Under Additive Noise 399
where we use the notation of Theorem 2. All events on the right-hand-side have
probability tending to zero with
The best point is the one with the smallest average. Clearly, this strategy cannot
be universal since for good performance, it is necessary that the law of large
numbers applies, and thus that where is the noise random variable.
However, in view of its simplicity and importance, we will return to this solution
in a further subsection.
If we order all the components of the vector so as to
obtain then other measures of the goodness may include
the medians of these components, or “quick and dirty” methods such as
Gastwirth’s statistic (Gastwirth, 1966)
consider other rank statistics based on medians of observations or upon the gen-
eralized Wilcoxon statistic (Wilcoxon, 1945; Mann and Whitney, 1947), where
only comparisons between are used.
The number of function evaluations used up to iteration is In addition,
in some cases, some sorting may be necessary, and in nearly all the cases, the
entire array needs to be kept in storage. Also, there are matches to
determine the wins, leading to a complexity, in the iteration alone of about
This seems to be extremely wasteful. We discuss some time-saving
modifications further on. We provide a typical result here (Theorem 4) for the
tournament method.
THEOREM 5. Let be chosen by the tournament method. If
then almost surely as
PROOF. Fix and let G be the noise distribution, and is the
empirical distribution function obtained from observations taken at
Note that can be considered as an estimate of In fact, by
the Glivenko-Cantelli lemma, we know that
almost surely. But much more is true. By an inequality due to Dvoretzky,
Kiefer and Wolfowitz (1956), in a final form derived by Massart (1990), we
have for all
Consider next the tournament, and partition the in four groups according
to whether belongs to one of these intervals:
The cardinalities of the groups are
If holds, then any member of group 1 wins its match against any
member of groups 3 and 4, for at least wins. Any member of group 4
can at most win against other members of group 4 or all members of group 3,
for at most wins. Thus, the tournament winner must come from
groups 1, 2 or 3, unless there is no in any of these groups. Thus,
Hence,
Now,
Random,Search Under Additive Noise 405
and
Then
Thus, strong convergence follows from weak convergence if we can prove that
This follows if
and
Define
We recall that
where
We note that if L is the distance between the third and first quartiles of G,
then This is easily seen by partitioning the two quartile
interval of length L into intervals of length or less. The
Random Search Under Additive Noise 407
where
provided that is large enough. For any fixed we can choose so large that
this upper bound is smaller than a given small constant. Thus,
is smaller than a given small constant if we first choose large enough,
and then choose appropriately. This concludes the proof of Theorem 6 when
is used.
Both the sequential and nonsequential strategies can be applied to the case
in which we compare points on the basis of the average of observations
made at This is in fact nothing more than the situation we will encounter
when we wish to minimize a regression function. Indeed, taking averages
would only make sense when the mean exists. Assume thus that we have the
regression model
REFERENCES
Aarts, E. and J. Korst. (1989). Simulated Annealing and Boltzmann Machines,
John Wiley, New York.
REFERENCES 411
Gastwirth, J.L. (1966). “On robust procedures,” Journal of the American Sta-
tistical Association, vol. 61, pp. 929–948.
Gaviano, M. (1975). “Some general results on the convergence of random search
algorithms in minimization problems,” in: Towards Global Optimization,
(edited by L. C. W. Dixon and G. P. Szegö), pp. 149–157, North Holland,
New York.
Geffroy, J. (1958). “Contributions à la théorie des valeurs extrêmes,” Publica-
tions de l’Institut de Statistique des Universités de Paris, vol. 7, pp. 37–185.
Gelfand, S.B. and S. K. Mitter. (1991). “Weak convergence of Markov chain
sampling methods and annealing algorithms to diffusions,” Journal of Opti-
mization Theory and Applications, vol. 68, pp. 483–498.
Geman, S. and C.-R. Hwang. (1986). “Diffusions for global optimization,”
SIAM Journal on Control and Optimization, vol. 24, pp. 1031–1043.
Gidas, B. (1985). “Global optimization via the Langevin equation,” in: Proceed-
ings of the 24th IEEE Conference on Decision and Control, Fort Lauderdale,
pp. 774–778.
Gnedenko, A.B.V. (1943). Sur la distribution du terme maximum d’une série
aléatoire, Annals of Mathematics, vol. 44, pp. 423–453.
Goldberg, D.E. (1989). Genetic Algorithms in Search Optimization and Ma-
chine Learning, Addison-Wesley, Reading, Mass.
Gurin, L.S. (1966). “Random search in the presence of noise,” Engineering
Cybernetics, vol. 4, pp. 252–260.
Gurin, L.S. and L. A. Rastrigin. (1965). “Convergence of the random search
method in the presence of noise,” Automation and Remote Control, vol. 26,
pp. 1505–1511.
Haario, H. and E. Saksman. (1991). “Simulated annealing process in general
state space,” Advances in Applied Probability, vol. 23, pp. 866–893.
Hajek, B. (1988). “Cooling schedules for optimal annealing,” Mathematics of
Operations Research, vol. 13, pp. 311–329.
Hajek, B. and G. Sasaki. (1989). “Simulated annealing—to cool or not,” Systems
and Control Letters, vol. 12, pp. 443–447.
Holland, J.H. (1973). “Genetic algorithms and the optimal allocation of trials,”
SIAM Journal on Computing, vol. 2, pp. 88–105.
Holland, J.H. (1992). Adaptation in Natural and Artificial Systems : An In-
troductory Analysis With Applications to Biology, Control, and Artificial
Intelligence, MIT Press, Cambridge, Mass.
Jarvis, R.A. (1975). “Adaptive global search by the process of competitive
evolution,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-
5, pp. 297–311.
Johnson, D.S., C. R. Aragon, L. A. McGeogh, and C. Schevon. (1989). “Opti-
mization by simulated annealing: an experimental evaluation; part I, graph
partitioning,” Operations Research, vol. 37, pp. 865–892.
414 MODELING UNCERTAINTY
Rinnooy Kan, A.H.G. and G. T. Timmer. (1984). “Stochastic methods for global
optimization,” American Journal of Mathematical and Management Sci-
ences, vol. 4, pp. 7–40.
Karmanov, V.G. (1974). “Convergence estimates for iterative minimization
methods,” USSR Computational Mathematics and Mathematical Physics,
vol. 14(1), pp. 1–13.
Kiefer, J. and J. Wolfowitz. (1952). “Stochastic estimation of the maximum of a
regression function,” Annals of Mathematical Statistics, vol. 23, pp. 462–466.
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. (1983). “Optimization by sim-
ulated annealing,” Science, vol. 220, pp. 671–680.
Koronacki, J. (1976). “Convergence of random-search algorithms,” Automatic
Control and Computer Sciences, vol. 10(4), pp. 39–45.
Kushner, H.L. (1987). “Asymptotic global behavior for stochastic approxima-
tion via diffusion with slowly decreasing noise effects: global minimization
via Monte Carlo,” SIAM Journal on Applied Mathematics, vol. 47, pp. 169–
185.
Lai, T.L. and H. Robbins. (1985) “Asymptotically efficient adaptive allocation
rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22.
Mann, H.B. and D. R. Whitney. (1947). “On a test of whether one or two random
variables is stochastically larger than the other,” Annals of Mathematical
Statistics, vol. 18, pp. 50–60.
Marti, K. (1982). “Minimizing noisy objective functions by random search
methods,” Zeitschrift für Angewandte Mathematik und Mechanik, vol. 62,
pp. T377–T380.
Marti, K. (1992). “Stochastic optimization in structural design,” Zeitschrift für
Angewandte Mathematik und Mechanik, vol. 72, pp. T452–T464.
Massart, P. (1990). “The tight constant in the Dvoretzky-Kiefer-Wolfowitz in-
equality,” Annals of Probability, vol. 18, pp. 1269–1283.
Matyas, J. (1965). “Random optimization,” Automation and Remote Control,
vol. 26, pp. 244–251.
Meerkov, S.M. (1972). “Deceleration in the search for the global extremum of
a function,” Automation and Remote Control, vol. 33, pp. 2029–2037.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
(1953). “Equation of state calculation by fast computing machines,” Journal
of Chemical Physics, vol. 21, pp. 1087–1092.
Mockus, J.B. (1989). Bayesian Approach to Global Optimization, Kluwer Aca-
demic Publishers, Dordrecht, Netherlands.
Männer, R. and H.-P. Schwefel. (1991). “Parallel Problem Solving from Na-
ture,” vol. 496, Lecture Notes in Computer Science, Springer-Verlag, Berlin.
Petrov, V.V. (1975). Sums of Independent Random Variables, Springer-Verlag,
Berlin.
REFERENCES 415
Yakowitz, S.J., T. Jayawardena, and S. Li. (1992). “Theory for automatic learn-
ing under Markov-dependent noise, with applications,” IEEE Transactions
on Automatic Control, vol. AC-37, pp. 1316–1324.
Yakowitz, S.J. and M. Kollier. (1992). “Machine learning for blackjack counting
strategies,” Journal of Forecasting and Statistical Planning, vol. 33, pp. 295–
309.
Yakowitz, S.J. and W. Lowe. (1991). “Nonparametric bandit methods,” Annals
of Operations Research, vol. 28, pp. 297–312.
Yakowitz, S.J. and E. Lugosi. (1990). “Random search in the presence of noise,
with application to machine learning,” SIAM Journal on Scientific and Sta-
tistical Computing, vol. 11, pp. 702–712.
Yakowitz, S.J. and A. Vesterdahl. (1993). “Contribution to automatic learning
with application to self-tuning communication channel,” Technical Report,
Systems and Industrial Engineering Department, University of Arizona.
Zhigljavsky, A.A. (1991). Theory of Global Random Search, Kluwer Academic
Publishers, Hingham, MA.
Chapter 20
Pierre L’Ecuyer
Département d’Informatique et de Recherche Opérationnelle
Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, H3C 3J7, CANADA
lecuyer@iro.umontreal.ca
Christiane Lemieux
Department of Mathematics and Statistics
University of Calgary, 2500 University Drive N.W., Calgary, T2N 1N4, CANADA
lemieux@math.ucalgary.ca
1. Introduction
To approximate the integral of a real-valued function defined over
the unit hypercube given by
supporting this are that first, the upper bound given in (20.3) is a worst-
case bound for the whole set It does not necessarily reflect the be-
havior of on a given function in this set. Second, it happens often in
practice that even if the dimension is large, the integrand can be well
approximated (in a sense to be specified in the next section) by a sum of
low-dimensional functions. In that case, a good approximation for
can be obtained by simply making sure that the corresponding pro-
jections of on these low-dimensional subspaces are well distributed.
These observations have recently led many researchers to turn to other
tools than the setup that goes with (20.3) for analyzing and improv-
ing the application of QMC methods to practical problems, where the
dimension is typically large, or even infinite (i.e., there is no a priori
bound on In connection with these new tools, the idea of randomizing
QMC point sets has been an important contribution that has extended
the practical use of these methods. The purpose of this chapter is to
give a survey of these recent findings, with an emphasis on the theoreti-
cal results that appear most useful in practice. Along with explanations
describing why these methods work, our goal is to provide, to the reader,
tools for applying QMC methods to his own specific problems.
Our choice of topics is subjective. We do not cover all the recent de-
velopments regarding QMC methods. We refer the readers to, e.g., Fox
(1999), Hellekalek and Larcher (1998), Niederreiter (1992b), and Spanier
and Maize (1994) for other viewpoints. Also, the fact that we chose not
to talk more about inequalities like (20.3) does not mean that they are
useless. In fact, the concept of discrepancy turns out to be useful for
defining selection criteria on which exhaustive or random searches to find
“good” sets can be based, as we will see later. Furthermore, we think
it is important to be aware of the discouraging order of magnitude for
required for the rate to be better than and to under-
stand that this problem is simply a consequence of the fact that placing
points uniformly in is harder and harder as increases because the
space to fill becomes too large. This suggests that the success of QMC
methods in practice is due to a clever choice of point sets exploiting the
features of the functions that are likely to be encountered, rather than
to an unexplainable way of breaking the “curse of dimensionality”.
Highly-uniform point sets can also be used for estimating the mini-
mum of a function instead of its integral, sometimes in a context where
function evaluations are noisy. This is discussed in Chapter 6 of Nieder-
reiter (1992b), Chapter 13 of Fox (1999), and was also the subject of
collaborative work between Sid Yakowitz and the first author (Yakowitz,
L’Ecuyer, and Vázquez-Abad 2000).
Recent Advances in Randomized Quasi-Monte Carlo Methods 423
if I is non-empty, and
424 MODELING UNCERTAINTY
erty is certainly desirable, for the lack of it means that some of the
ANOVA components of are integrated by less than points even if
evaluations of have been done. Secondly, we say that is dimension-
stationary (L’Ecuyer and Lemieux 2000) if for any
such that
that is, only the spacings between the indices in I are relevant in the def-
inition of the projections of a dimension-stationary point set, and
not their individual values. Hence not all non-empty projections of
need to be considered when measuring the quality of since many
are the same. Another advantage of dimension-stationary point sets is
that because the quality of their projections does not deteriorate as the
first index increases, they can be used to integrate functions that have
important ANOVA components associated with subsets I having a large
value of Therefore, when working with those point sets it is not nec-
essary to try rewriting so that the important ANOVA components are
associated with subsets I having a small first index (as is often done;
see, e.g., Fox 1999). We underline that not all types of QMC point sets
have these properties.
3. Main Constructions
In this section, we present constructions for low-discrepancy point sets
that are often used in practice. We first introduce lattice rules (Sloan
and Joe 1994), and a special case of this construction called Korobov
rules (Korobov 1959), which turn out to fit in another type of construc-
tion based on successive overlapping produced by a recurrence
defined over a finite ring. This type of construction is also used to de-
fine pseudorandom number generators (PRNGs) with huge period length
when the underlying ring has a very large cardinality (e.g., low-
discrepancy point sets are, in contrast, obtained by using a ring with a
small cardinality (e.g., between and For this reason, we refer
to this type of construction as small PRNGs, and discuss it after having
introduced digital nets (Niederreiter 1992b), which form an important
family of low-discrepancy point sets that also provides examples of small
PRNGs. Various digital net constructions are described. We also recall
the Halton sequence (Halton 1960), and discuss a method by which the
number of points in a Korobov rule can be increased sequentially, thus
offering an alternative to digital sequences. Additional references regard-
ing the implementation of QMC methods are provided at the end of the
section.
426 MODELING UNCERTAINTY
and Joe 1994). For example, if are positive integers such that
for at least one then the lattice point set
2. bijections for
and compute
Recent Advances in Randomized Quasi-Monte Carlo Methods 429
Then with is
digital net over R in base
This scheme has been used to construct point sets having a low-
discrepancy property that can be described by introducing the notion
of (Lemieux and L’Ecuyer 2001). Let
where the are non-negative integers, and consider the
boxes obtained by partitioning into equal intervals
along the axis. If each of these boxes contains exactly points
from a point set where then is said to be
If a digital net is when-
ever for some integer it is called a (Niederre-
iter 1992b). The smallest integer having this property is a widely-used
measure of uniformity for digital nets and we call it the of
Note that the is meaningless if and that the smaller is, the
better is the quality of Criteria for measuring the equidistribution
of digital nets are discussed in more details in Section 4.2.
Figure 20.2 shows an example of a two-dimensional point set with
points in base 3, having the best possible equidistribution;
that is, its is 0 and thus, any partition of the unit square into
ternary boxes of size is such that exactly one point is in each box.
The figure shows two examples of such a partition, into rectangles of sizes
and respectively. The other partitions (not shown
here) are into rectangles of sizes and For all
of these partitions, each rectangle contains one point of This point
set contains the first 81 points of the two-dimensional Faure sequence
in base 3. In this case, in the definition above, all bijections
are the identity function over the generating matrix for the first
dimension is the identity matrix, and is given by
posed by Niederreiter and Xing (1997, 1998), as they require the in-
troduction of many concepts from algebraic function fields that go well
beyond the scope of this chapter. These sequences are built so as to
optimize the asymptotic behavior of their as a function of the
dimension, for a fixed base See Pirsic (2001) for a definition of these
sequences and a description of a software implementation.
to the dimension (which means that these sequences are practical only
for small values of An important feature of this construction is that
their has the best possible value provided is of C the
form for some integers Assuming the matrices
take the form:
Most definitions and results that we mentioned for lattice rules, which
from now on are referred to as standard lattice rules, have their counter-
part for polynomial lattice rules, as we now explain. First, we refer to
point sets that define polynomial lattice rules as polynomial lattice
point sets, whose rank is equal to the smallest number of basis vectors
required to write as
where A is defined as
(see‚ e.g.‚ Lemieux and L’Ecuyer 2001)‚ and the coefficients are such
that = The coefficients for satisfy the
recurrence determined by i.e.‚
Note that in the definition of Niederreiter (1992b) and Pirsic and Schmid
(2001)‚ the matrices are restricted to rows.
required in the PRNG context because in this case, this point set can be
seen as the sampling space for the PRNG when it is used on a problem
requiring uniform numbers per run. See L’Ecuyer (1994) for details on
this aspect of PRNGs and more.
We now describe two particular cases of this type of construction that
provide an alternative way of generating Korobov-type lattice point sets
(either standard or polynomial).
where the integers are typically chosen as the first prime num-
bers sorted in increasing order‚ and is the radical-inverse function
in base defined by
3.6. Implementations
The points of a digital net in base can be generated efficiently
using a Gray code. This idea was first suggested by Antonov and Saleev
(1979) for the Sobol’ sequence‚ and for other constructions by‚ e.g.‚ Hong
and Hickernell (2001)‚ Pirsic and Schmid (2001)‚ Tezuka (1995)‚ and
Bratley‚ Fox‚ and Niederreiter (1992). Assuming the idea is
to modify the order in which the points are generated by replacing the
digits from the expansion of by the Gray code
440 MODELING UNCERTAINTY
4. Measures of Quality
In this section, we present a number of criteria that have been pro-
posed in the literature for measuring the uniformity (or non-uniformity)
of a point set in the unit hypercube i.e., for measuring the
Recent Advances in Randomized Quasi-Monte Carlo Methods 441
criterion and the quality measure described above have been used
for selecting generators defining sequences of nested Korobov lattice
point sets by Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux (2001). Note
that can be considered as a special case of the weighted spectral test
of Hellekalek (1998)‚ Definition 6.1. Various other measures of quality
for lattice rules can be found in Niederreiter (1992b)‚ Sloan and Joe
(1994)‚ Hickernell (1998b)‚ and the references therein.
That is‚ measures the regularity of all projections and returns the
worst case. Inside this definition of we can also normalize the value of
so that projections over subspaces of different dimensions are judged
more equitably‚ in the same way as the value is used to normalize
in the criterion To do so‚ we can use the lower bound for in
dimensions‚ given by Niederreiter and Xing (1998)
and define‚
for example. The idea behind this is that a large value of for a low-
dimensional subset I is usually worse than when I is high-dimensional
and therefore it should be more penalized.
The definition of the that we used so far is of a geometrical
nature‚ similarly to the interpretation of as being the inverse of the
distance between the hyperplanes of a lattice. Interestingly‚ just like
Recent Advances in Randomized Quasi-Monte Carlo Methods 445
The analysis of Niederreiter and Pirsic (2001) assumes that the generat-
ing matrices are but we will explain shortly why it remains valid
even if we start with matrices and truncate them.
Let be the null space of the row space of C‚ i.e‚
We refer to as the dual space of the digital net from now on. Define
the following norm on for any nonzero let
and Define the norm of a
vector by
We now explain why this result is valid even if the matrices have
been truncated to their first rows. Let denote the dual space that
would be obtained without the truncation. Observe that by definition‚
Also‚ using Proposition 1 of Niederreiter and Pirsic (2001) and the fact
that the dimension of the row space of C is not larger than we have
that
Therefore‚
446 MODELING UNCERTAINTY
then
where is the resolution of the polynomial lattice point set. This re-
sult is discussed by Couture, L’Ecuyer, and Tezuka (1993), Couture
and L’Ecuyer (2000), Lemieux and L’Ecuyer (2001), and Tezuka (1995).
The resolution is often used for measuring the quality of PRNGs based
on linear recurrences modulo 2 such as Tausworthe generators (Tootill,
Robinson, and Eagle 1973; L’Ecuyer 1996). From a geometrical point
of view, the resolution is the largest integer for which is
equidistributed. Obviously, if
This concept can be extended from polynomial lattice point sets to
general digital nets by replacing by above. More precisely, we
have:
Proof: The proof of this proposition requires results given in the forth-
coming sections, and it can be found in the appendix.
The resolution can be computed for any projection of for
let
point sets (Lemieux and L’Ecuyer 2001)‚ and it could also be used for
any digital net:
where the set has the same meaning as in the definition of the
criterion in (20.14).
Another criterion is the digital version of the quality measure It
is closely related to the dyadic dyaphony (Hellekalek and Leeb 1997)
and the weighted spectral test (Hellekalek 1998)‚ and was introduced by
Lemieux and L’Ecuyer (2001) for polynomial lattice point sets in base
2. It uses a norm W(h) defined as
where
powers of the bases are typically used in practice‚ and the quality of
is measured only for (or less) projections‚ where is
small. This illustrates the fact that for comparisons to be fair between
different constructions‚ the value of the should be considered in
conjunction with the base Algorithms to compute are discussed by
Pirsic and Schmid (2001).
The resolution has been used by Sobol’ for finding optimal values to
initialize recurrences defining the direction numbers in his construction
(Sobol’ 1967; Sobol’ 1976). More precisely‚ his Property A means that
the first points of the sequence have the maximal resolution of 1‚
and his Property means that the first points have the maximal
resolution of 2.
Following ideas from Morokoff and Caflisch (1994) and Cheng and
Druzdzel (2000)‚ a criterion related to is used to find initial
direction numbers for Sobol’ sequence in dimensions in the forth-
coming RandQMC library (Lemieux‚ Cieslak‚ and Luttmer 2001); the max-
imum in (20.21) is taken over all two-dimensional subsets I of the form
where is the dimension for which we want to find
initial direction numbers‚ and Examples of parame-
ters for polynomial lattice point sets chosen with respect to for
different values of are given by Lemieux and L’Ecuyer (2001).
5. Randomizations
Once a construction is chosen for and the approximation given
by (20.2) is computed‚ it is usually important to have an estimate of
the error For that purpose‚ upper bounds of the form (20.3)
are not very useful since they are usually much too conservative‚ in
addition to being hard to compute and restricted to a possibly small set
of functions. Instead‚ one can randomize the set so that : 1) each point
in the randomized point set has a uniform distribution over
2) the regularity (or low-discrepancy property) of as measured by
a specific quality criterion‚ is preserved under the randomization. The
first property guarantees that the approximation
When is a lattice point set, the length of the shortest vector asso-
ciated with any projection is preserved under this randomization.
An explicit expression for the variance of in that case will be given
in Section 6.1.
With this randomization, each shifted point mod 1)
is uniformly distributed over Therefore, even if the dimension
is much larger than the number of points and if many coordinates
are equal within a given point (for instance, when comes from a
LCG with a small period, as in Section 3.3), these coordinates become
mutually independent after the randomization. Hence each point has the
same distribution as in the MC method; the difference with MC is that
the points of the shifted lattice are not independent. See L’Ecuyer
and Lemieux (2000), Section 10.3, for a concrete numerical example
with and These properties also hold for the other
randomizations described below.
if and
we compute
where
and let
This randomization was suggested to us by Raymond Couture for
point sets based on linear recurrences modulo 2 (see also Lemieux and
L’Ecuyer 2001). It is also used in an arbitrary base (along with other
more time-consuming randomizations) in Hong and Hickernell (2001)
and (1998) as we will see in Section 5.4. It is best suited for
digital nets in base and its application preserves the resolution and
the of any projection.
5.3. Scrambling
This randomization has been proposed by Owen (1995), and it also
preserves the of a digital net and its resolution, for any projec-
tion. It works as follows: in each dimension partition the
interval [0, 1) in equal parts and permute them uniformly and ran-
domly; then, partition each of these sub-intervals into equal parts and
permute them uniformly and randomly; etc. More precisely, to scramble
L digits one needs to randomly and uniformly generate several indepen-
dent permutations of the integers [0... (assuming a specific
bijection has been chosen to identify the elements in with those in
if is not prime), where
and compute
where
(1998) points out‚ if and no two points have the same first
digits in each dimension (i.e.‚ for each the unidimensional projection
has a maximal resolution of then the permutations after level
are independent for each point and therefore‚ the random digits
for can be generated uniformly and independently over Hence
in this case we do not need to generate any permutation after level
Owen (1995)‚ Section 3.3‚ has suggested a similar implementation.
When and are large‚ the amount of memory required for storing
all the permutations becomes very large‚ and only a partial scrambling
might then be feasible; that is‚ scramble digits‚ and generate the
remaining ones randomly and uniformly (Fox 1999)‚ or use the permuta-
tions for the digit (Tan and Boyle 2000). A clever way of avoiding
storage problems is discussed by (1998)‚ and a related idea
is used in Morohosi’s code (which can be found at www.misojiro.t.
u–tokyo.ac.jp/˜morohosi) for scrambling Faure sequences. The idea
is to avoid storing all the permutations by reinitializing appropriately
the underlying PRNG so that the permutations can be regenerated as
they are needed. This is especially useful when the base is large‚ which
happens when Faure sequences are used in large or even moderate di-
mensions.
5.5. Others
We briefly mention some other ideas that can be used to randomize
QMC point sets. In addition to the linear scrambling, (1998)
proposes randomization techniques for digital sequences that are easier
to generate than the scrambling method, while retaining enough ran-
domness for the purpose of some specific theoretical analyses. Hong and
Hickernell suggest another form of linear scrambling that incorporates
transformations proposed by Faure and Tezuka (2001). Randomizations
that use permutations in each dimension to reorder the Halton sequence
are discussed by Braaten and Weller (1979) and Morokoff and Caflisch
(1994). Wang and Hickernell (2000) propose to randomize this sequence
by randomly generating its starting point.
Some authors (Ökten 1996; Spanier 1995) suggest partitioning the
set of dimensions into two subsets (typically, of successive in-
dices, i.e., and and then to use a QMC method
(randomized or not) on one subset and MC on the other one. One of
the justifications for this approach is that some digital nets (e.g., Sobol’
sequence) are known to have projections with better properties
when I contains small indices; this suggests using QMC on the first few
dimensions and MC on the remaining ones. However, this argument
becomes irrelevant if a dimension-stationary point set is used. More im-
portantly, if the QMC point set is randomized and can be shown (or
presumed) to do no worse than MC in terms of its variance, then there
is no advantage or “safety net” gained by using MC on one part of the
problem. Estimators obtained by “padding” randomized QMC point
sets with MC are analyzed by Owen (1998a) and Fox (1999). Owen
(1998a) discusses some other padding techniques, as well as a method
called Latin Supercube Sampling to handle very high-dimensional prob-
lems.
The result (20.25) was proved independently by Tuffin (1998)‚ but un-
der the stronger assumption that has an absolutely convergent Fourier
series. Notice that by contrast with the MC estimator‚ there is no factor
of that multiplies the sum of squared Fourier coefficients for the
randomly shifted lattice rule estimator Hence in the worst case‚ the
variance of could be times as large as the MC estimator’s vari-
ance. This worst case corresponds to an extremely unlucky pairing of
function and point set for which for all However‚ in
the expression for the variance of the coefficients are summed only
over the dual lattice which contains times less points than the set
over which the sum is taken in the MC case. Therefore‚ if the dual
lattice is such that the squared Fourier coefficients are smaller “on
average” over than over then the variance of will be smaller
than the variance of
From the results given in the previous proposition‚ different bounds on
the error and variance can be obtained by making additional assumptions
on the integrand (Sloan and Joe 1994; Hickernell 1998b; Hickernell
2000). Most of these bounds involve the quality measures or
Hence a point set that minimizes one of these two criteria minimizes
a bound on the error or variance for the class of functions for which
those bounds hold. Such analyses often provide arguments in favor of
these criteria. A different type of analysis‚ based on the belief that the
largest squared Fourier coefficients tend to be associated with “short
vectors” h‚ corresponding to the low frequency terms of suggests that
the lattice point set should be chosen so that does not contain those
“short” vectors. From this point of view‚ a criterion like seems
appropriate since it makes sure that does not contain vectors with a
small euclidean length. This criterion also has the advantage of being
usually much faster to compute than (Entacher‚ Hellekalek‚
and L’Ecuyer 2000; Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux 2001).
Recent Advances in Randomized Quasi-Monte Carlo Methods 455
where
where
Using this‚ Owen obtains the following bound on the variance of the
scrambled-net estimator:
Proposition 3 (Owen 1998b‚ Theorem 1) Let be the estimator con-
structed from a scrambled digital net with points. For any square-
integrable function‚
(a)
(b) is uniformly distributed over
Larcher, Niederreiter, and Schmid (1996) have shown that the above
sum is 0 when h satisfies
Proof: If then h · since
is in the row space of C for all and the result follows
easily. If then where We are interested
in the scalar product h · for Notice that
= which is the image of under the
application of a mapping that corresponds to the multiplication by
Since the dimension of this image is 1 and the dimension of
the kernel of this mapping is thus Hence each element in has
pre-images in under this mapping, and therefore as a multiset
{h · contains copies of each element of Using
this and the fact that
where otherwise}.
Hence in the digital shift case‚ in comparison with MC‚ the contribu-
tion of a basis function to the variance expression is either multi-
plied by (if h is in the dual space) or by 0‚ whereas in the scrambled
case‚ this contribution is multiplied by 0 for “small vectors”‚ and by a
factor that can be upper-bounded by a quantity independent of other-
wise. This factor being sometimes in the digital shift case prevents us
from bounding by a constant times Similarly‚ the case of
smooth functions yields a variance bound in for digitally-
shifted estimators‚ which is larger by a factor than the order of the
bound obtained for scrambled-type estimators. On the other hand‚ the
shift is a very simple randomization easy to implement; the esti-
mator can typically be constructed in the same (or less) time as the
MC estimator based on the same number of points.
Based on the expression (20.30) for the variance of the digitally shifted
estimator‚ the same type of heuristic arguments as those given for ran-
domly shifted lattice rules can be used to justify selection criteria such
as to choose digital nets. That is‚ if we assume that the largest
Walsh coefficients are those associated with “small” vectors h‚ then it is
reasonable to choose so that the dual space does not contain those
small vectors. If we use the norm ||h|| defined in (20.20) to measure h‚
this suggests using a criterion based on the resolution such as
if instead we use the norm V(h) defined in (20.18)‚ then the or
the variant defined in (20.16) should be employed. We refer the reader
to Hellekalek and Leeb (1997) and Tezuka (1987) for additional connec-
tions between Walsh expansions and nonuniformity measures (e.g.‚ the
so-called ‘Walsh-spectral test’ of Tezuka). Note that criteria based on
the resolution are faster to compute than those based on the
Recent Advances in Randomized Quasi-Monte Carlo Methods 461
8. Related Methods
We now discuss integration methods that are closely related to QMC
methods‚ but that do not exactly fit the framework presented so far.
First‚ a natural extension for the estimator or would be to
assign weights to the different evaluation points; that is‚ for a point set
define
Additional results on this method can be found in‚ e.g.‚ Avramidis and
Wilson (1996)‚ Owen (1998a)‚ and the references therein. In particular‚
see Owen (1992a) and Loh (1996b) for results showing that the LHS
estimator obeys a central-limit theorem. A related method that extends
464 MODELING UNCERTAINTY
the uniformity property of LHS and has close connections with digital
nets is a randomized orthogonal array design; we refer the reader to Owen
(1992b, 1994, 1995, 1997) and Loh (1996a) for more on this method.
Appendix: Proofs
Proof of Proposition 1: The result is obtained by first generalizing Proposition 5.2 of
Lemieux and L’Ecuyer (2001) to arbitrary digital nets in base This can be done
by using Lemma 2 from Section 6.2. More precisely, we show that is
equidistributed if and only if where
Consider the class of all real-valued functions that are constant on each
of the boxes in the definition of Clearly, is
if and only if the corresponding point set integrates every
function with zero error. But due to its periodic structure, each function
has a Walsh expansion of the form
We now show that which will prove the result. Since the
resolution is it means that is not Therefore,
the matrix L formed by concatenating the transposed of the first
rows of each generating matrix has a row space whose dimension is strictly smaller
than Hence there exists a nonzero vector x in such that
Furthermore, we can assume since would contradict our
assumption that has a resolution of Define
by
for any then (20.A.3) is equal to zero and the result is proved.
To show (20.A.4), it suffices to observe that
Acknowledgments
This work was supported by NSERC-Canada individual grants to the two authors
and by an FCAR-Québec grant to the first author. We thank Bennett L. Fox, Fred
J. Hickernell, Harald Niederreiter, and Art B. Owen for helpful comments and sug-
gestions.
References
Acworth, P., M. Broadie, and P. Glasserman. (1997). A comparison
of some Monte Carlo and quasi-Monte Carlo techniques for option
pricing. In Monte Carlo and Quasi-Monte Carlo Methods in Scien-
tific Computing, ed. P. Hellekalek and H. Niederreiter, Number 127
in Lecture Notes in Statistics, 1–18. Springer-Verlag.
Åkesson, F., and J. P. Lehoczy. (2000). Path generation for quasi-Monte
Carlo simulation of mortgage-backed securities. Management Sci-
ence 46:1171–1187.
Antonov, I. A., and V. M. Saleev. (1979). An economic method of com-
puting Zh. Vychisl. Mat. Mat. Fiz. 19:243–245.
In Russian.
Avramidis, A. N., and J. R. Wilson. (1996). Integrated variance reduc-
tion strategies for simulation. Operations Research 44:327–346.
REFERENCES 467
G.Yin
Department of Mathematics
Wayne State University
Detroit, MI 48202
gyin@math.wayne.edu
Q. Zhang
Department of Mathematics
University of Georgia
Athens, GA 30602
qingz@math.uga.edu
Abstract This chapter is concerned with large-scale hybrid stochastic systems, in which the
dynamics involve both continuously evolving components and discrete events.
Corresponding to different discrete states, the dynamic behavior of the under-
lying system could be markedly different. To reduce the complexity of these
systems, singularly perturbed Markov chains are used to characterize the sys-
tem. Asymptotic expansions of probability vectors and the structural properties
of these Markov chains are provided. The ideas of decomposition and aggrega-
tion are presented using two typical optimal control problems. Such an approach
leads to control policies that are simple to obtain and perform nearly as well as
the optimal ones with substantially reduced complexity.
476 MODELING UNCERTAINTY
1. INTRODUCTION
In memory of our distinguished colleague and dear friend Sidney Yakowitz,
who made significant contributions to mathematics, control and systems the-
ory, and operations research, we write this chapter to celebrate his lifetime
achievements and to survey some of the most recent developments in singu-
larly perturbed Markov chains and their applications in control and optimiza-
tion of large-scale systems under uncertainty, which are related to Sid’s work
on automatic learning, adaptive control, and nonparametric theory (see Lai
and Yakowitz (1995); Yakowitz (1969); Yakowitz et al. (1992); Yakowitz et
al. (2000) and the references therein). Many adaptive control and learning
problems give rise to Markov decision processes. In solving such problems,
one often has to face the curse of dimensionality. The singular perturbation
approach is an effort in the direction of reduction of complexity.
Our study is motivated by the desire of solving numerous control and op-
timization problems in engineering, operations research, management, biol-
ogy, and physical sciences. To show why Markovian models are useful and
preferable, we begin with the well-known Leontief model, which is a dynamic
system of a multi-sector economy (see, for example, Kendrick (1972)). The
classical formulation is: Let there be sectors, the output of sector at
time and the demand for the product of sector at time Denote
and let
be the amount of commodity that sector needs in production and the
proportion of commodity that is transferred to commodity Write
and The matrix B is termed a Leontief input-output matrix. The
Leontief dynamic model is given by
where is a continuous-time Markov chain (see Yin and Zhang (2001) for
more details). In fact, the use of Markovian models have assumed a prominent
role in time series analysis, financial engineering, and economic systems (see
Hamilton and Susmel (1994); Hansen (1992) and the references therein). In
addition to the modeling point mentioned above, many systems arising from
communication, manufacturing, reliability, and queueing networks, among oth-
ers, exhibit jump discontinuity in their sample paths. A common practice is to
resort to Markovian jump processes in modeling and optimization.
This chapter is devoted to such Markovian models having large state spaces
with complex structures, which frequently appear in various applications and
which may cause serious obstacles in obtaining optimal controls for the under-
lying systems. The traditional approach of dynamic programming for obtaining
optimal control does not work well in such systems. The large size that renders
computation infeasible is known as the “curse of dimensionality.”
Owing to the pervasive applications of Markovian formulation in numerous
areas, there has been resurgent interest in further exploring various properties of
Markov chains. Hierarchical structure, a feature common to many systems of
practical concerns has proved very useful for reducing complexity. As pointed
out by Simon and Ando (1961), all systems in the real world have a certain
hierarchy. Therefore it is natural to use the ideas of decomposition and aggre-
gation for complexity reduction. Because the transitions (switches or jumps)
in various states of a large-scale system often occur at different rates, the de-
composition and aggregation of the states of the corresponding Markov chain
can be achieved according to their rates of changes. Taking advantage of the
hierarchical structure, the first step is to divide a large task into smaller pieces.
The subsequent decompositions and aggregations will lead to a simpler ver-
sion of the originally formidable system. Such an idea has been applied to
queueing networks for resource organization, to computer systems for memory
level aggregation, to economic models for complexity reduction (see Courtois
(1977); Simon and Ando (1961)), and to manufacturing systems for production
planning (see Sethi and Zhang (1994)). Owing to their different time scales,
the problems fit reasonably well into the framework of singular perturbation
in control and optimization; see Abbad et al. (1992); Gershwin (1994); Pan
and (1995); Sethi and Zhang (1994); Yin and Zhang (1997b); Yin and
Zhang (1998) and the references therein. Related work on singularly perturbed
Markov chains can be found in Di Masi et al. (1995); Pervozvanskii and Gaits-
478 MODELING UNCERTAINTY
see Sethi and Zhang (1994) and Yin and Zhang (1998), Appendix for details.
Even if the demand is a constant, it is difficult to obtain the closed-form
solution of the optimal control, not to mention the added difficulty due to the
possible large state space (i.e., being a large number) since to find the
optimal control using the dynamic programming approach requires solving
equations of the form (1.3). To overcome the difficulty, we introduce a
small parameter and assume that the Markov chain is generated by
where Q is an irreducible generator of a continuous-time Markov
chain. Using a singular perturbation approach, we can derive a limit problem
in which the stochastic capacity is replaced by its average with respect to the
stationary measures. The limit problem is in fact deterministic and is much
simpler to solve. Intuitively, a higher level manager in a manufacturing firm
need not know every details of floor events, only an averaged overview will be
sufficient for the upper level decision making; see Sethi and Zhang (1994) for
more discussion along this line.
Mathematically, for sufficiently small the problem can be approxi-
mated by a limit control problem with dynamics
such that and are themselves generators. A Markov chain with generator
given in (2.4) is a singularly perturbed Markov chain. The simple example
given below illustrates the effect of the small parameter
Example. Suppose that is a Markov chain having four states
Let
Simulation of sample paths of the Markov chain yields Fig. 21.1, which displays
paths for and respectively.
Observe that the small parameter has a squeezing effect which rescales
the sample paths of the Markov chain. When the Markov chain gen-
erated by undergoes rapid variations. This combined with produces a
generator that has both rapidly varying part and slowly changing part with
weak and strong interactions.
Let
be the probability vector associated with the Markov chain. Then it is known
that satisfies the differential equation
of the singular perturbation idea may not work here since although (2.6) is a
linear system of differential equations, the generator has an eigenvalue 0. A
first glance may lead one to believe that is unbounded as However,
this is not the case. By subtracting 0 from the spectrum of the generator, the
rest of the eigenvalues are all on the left side of the complex plan, producing a
rapid convergence to the stationary distribution.
In view of (2.4), the matrix dominates the asymptotics. Let us con-
centrate on the fast changing part of the generator first. According to A.N.
Kolmogorov, the states of any Markov chain can be classified as either recur-
rent or transient. It is also known that any finite-state Markov chain has at
least one recurrent state, i.e., not all states are transient. As a result (Iosifescu,
1980, p.94), by appropriate arrangements, it is always possible to write the
corresponding generator as either
Singularly Perturbed Markov Chains 483
or
The generator (2.7) corresponds to a Markov chain that has recurrent classes,
whereas (2.8) corresponds to a Markov chain that includes transient states in
addition to the recurrent classes. For the stationary cases, combining both (2.7)
and (2.8) has exhausted all possibilities of practical concern. In the discussion
to follow, we will use the notion of weak irreducibility (see the appendix for a
definition) that is an extension of the usual notion of irreducibility. We will also
deal with the partitioned matrices of the forms (2.7) and (2.8). Nevertheless, in
lieu of recurrent classes, we will consider weakly irreducible classes, which is
a generalization of the classical formulation.
Step 1. Separate the entries of the matrix based on their orders of magnitude.
The numbers {1,2} are at a scale (order of magnitude) different from that of
484 MODELING UNCERTAINTY
the numbers {10, –11, –12, 21, –22, 30, –33}. So we write Q as
Step 2. Adjust the entries to make each of the above matrices a generator.
This requires moving the entries so that each of the two matrices satisfies the
condition of a generator, i.e., with nonnegative off-diagonal elements, non-
positive diagonal element, and zero row sum. After such a rearrangement the
matrix Q becomes
Step 3. Permute the columns and rows so that the dominating matrix is of a
desired block diagonal form (corresponding to the weakly irreducible blocks).
In this example, exchanging the orders of and in i.e., considering
yields
Let Then
Note that the decomposition procedure is not unique. We can also write
for some
Assume Then one of the following two alternatives holds: (1) The
homogeneous equation has only the zero solution, in which
case (the resolvent set of A), is bounded, and the
inhomogeneous equation has also one solution
for each (2) The homogeneous equation has a nonzero
solution, in which case the inhomogeneous equation has a
solution iff for every solution of the adjoint equation
or
where each is itself a transition matrix within the weakly irreducible class
for and the last row corresponds to the transient states.
The two-time-scale interpretation becomes a normal rate of change versus a
slow rate of change. Let be the solution of
and write
The immediate questions are: How good is such an approximation? What is the
asymptotic distribution of Using probability basics, we can show that
That is, a mean squares estimate for the unsealed
occupation measure holds. Such an estimate implies that a scaled sequence of
the occupation measures may have a nontrivial asymptotic distribution. Define
For the definitions of weak convergence and jump diffusions, see the ap-
pendix. Loosely speaking, a switching diffusion process is a combination of
switching processes and diffusion processes. It possesses both diffusive be-
havior as well as jump properties. Switching diffusions are widely used in
many applications. For example, in a manufacturing system, the demand may
be modeled as a diffusion process, whereas the machine production rate as a
Markov jump process.
where
Observe that
with probability one (w.p.1). The central limit theorem implies that
The last line above is a consequence of the fact that is normally distributed.
The Chernoff’s bound,
where denotes the usual inner product. It follows that is also irre-
ducible. For each let be the Perron–Frobenius eigenvalues of
the matrix (see Seneta (1981) for a definition). Define
By using the Gärtner-Ellis Theorem (see Dembo and Zeitouni (1998)), we can
derive the following large deviations bounds: For any set
where and denote the interior and the closure of G, respectively, and
[0, T] with its derivative being Lipschitz, and is also Lipschitz. Then, there
exist and K > 0 such that for and
in which
The above results are mainly for time-varying generators. Better results can
be obtained for constant and where the corresponding conclusion is that
there exist positive constants and K such that for and
and
and
In solving this problem, one will face the difficulty caused by large dimension-
ality if the state space is large (i.e., is a large number), where a total of
equations must be solved. To resolve this, we aggregate the states in as one
state and obtain an aggregated process defined by when
The process is not necessarily Markovian. However, using
certain probabilistic argument, we have shown in Yin and Zhang (1998); Yin
et al. (2000a) that converges weakly to the process generated by
which has the form
and
and
4.2. DISCRETE-TIME LQ
This section discusses the discrete-time LQ problem involving a singularly
perturbed Markov chain. Many systems and physical models in economics‚
biology‚ and engineering are represented in discrete time mainly because var-
ious measurements are only available at discrete instants. As a result‚ the
planning decisions‚ strategic policies‚ and control actions regarding the under-
lying systems are made at discrete times. The continuous-time models can
be regarded as an approximation to the “discrete” reality. For example‚ using
Singularly Perturbed Markov Chains 499
500 MODELING UNCERTAINTY
Define
and
The optimal feedback control for the LQ problem is linear in the state variable:
As noted before‚ the solution of the problem completely depends on the number
of equations to be solved. Thus the reduction of complexity is achieved in that
only equations need to be solved as compared to solving the equations
in the original problem. If is substantially larger than (i.e.‚
the complexity is significantly diminished. It can be shown that
converges weakly to where is the solution of the hybrid system
where
and
For a two-dimensional dynamic system (4.42) and the cost (4.43), we work with
a time horizon of with T = 5. We use step size to discretize
the limit Riccati equations.
Take The trajectories of and
are given in Fig. 21.5 for The simulation results
show that the continuous-time hybrid LQG problem approximates closely the
corresponding discrete-time linear quadratic regulator problem.
Using this limit system, we can find its optimal control
where
5. FURTHER REMARKS
This work provides a survey on using singularly perturbed Markov chains
to model uncertainty in various applications with the aim of reduction of com-
plexity of large-scale systems. We have mainly focused on finite-state Markov
chains. For future work‚ we point out:
is called a diffusion.
The concept of weak convergence is a substantial generalization of con-
vergence in distribution in elementary probability theory. Let P and
denote probability measures defined on a metric space The
sequence is weakly convergent to P if
A Markov process having piecewise constant sample paths and taking values
in either (for some positive integer or
is called a Markov chain. The set is termed the state space of the Markov
chain. The function
that is, there exists a constant K such that for all and
for and and (2) for all
bounded real-valued functions defined on
is a martingale.
We say that a generator or the corresponding Markov chain is weakly
irreducible if the system of equations
Denote Let
with satisfies
Then
Thus the reconstructed process and the original process have the
same distribution.
REFERENCES
Abbad‚ M.‚ J.A. Filar‚ and T.R. Bielecki. (1992). Algorithms for singularly
perturbed limiting average Markov control problems‚ IEEE Trans. Automat.
Control AC-37‚ 1421-1425.
Bertsekas‚ D. (1995). Dynamic Programming and Optimal Control‚ Vol. I & II‚
Athena Scientific‚ Belmont‚ MA.
Billingsley‚ P. (1999). Convergence of Probability Measures‚ 2nd Ed.‚ J. Wiley‚
New York.
Blankenship‚ G. (1981). Singularly perturbed difference equations in optimal
control problems‚ IEEE Trans. Automat. Control T-AC 26‚ 911-917.
Courtois‚ P.J. (1977). Decomposability: Queueing and Computer System Ap-
plications‚ Academic Press‚ New York.
Chung‚ K.L. (1967). Markov Chains with Stationary Transition Probabilities‚
Second Edition‚ Springer-Verlag‚ New York.
Davis‚ M.H.A. (1993). Markov Models and Optimization‚ Chapman & Hall‚
London.
Delebecque‚ F. and J. Quadrat. (1981). Optimal control for Markov chains
admitting strong and weak interactions‚ Automatica 17‚ 281-296.
REFERENCES 511
Yakowitz‚ S.‚ R. Hayes‚ and J. Gani. (1992). Automatic learning for dynamic
Markov fields with application to epidemiology‚ Oper. Res. 40‚ 867-876.
Yakowitz‚ S.‚ P. L’Ecuyer‚ and F. Vázquez-Abad. (2000). Global stochastic
optimization with low-dispersion point sets‚ Oper. Res.‚ 48‚ 939-950.
Yang‚ H.‚ G. Yin‚ K. Yin‚ and Q. Zhang. (2001). Control of singularly perturbed
Markov chains: A numerical study‚ to appear in J. Australian Math. Soc. Ser.
B: Appl. Math.
Yin‚ G. (2001). On limit results for a class of singularly perturbed switching
diffusions‚ to appear in J. Theoretical Probab.
Yin‚ G. and M. Kniazeva. (1999). Singularly perturbed multidimensional switch-
ing diffusions with fast and slow switchings‚ J. Math. Anal. Appl.‚ 229‚ 605-
630.
Yin‚ G. and J.F. Zhang. (2001). Hybrid singular systems of differential equa-
tions‚ to appear in Scientia Sinica.
Yin‚ G. and Q. Zhang. (1997a). Control of dynamic systems under the influence
of singularly perturbed Markov chains‚ J. Math. Anal. Appl. 216‚ 343-367.
Yin‚ G. and Q. Zhang (Eds.) (1997b). Mathematics of Stochastic Manufacturing
Systems‚ Proc. 1996 AMS-SIAM Summer Seminar in Applied Mathematics‚
Lectures in Applied Mathematics‚ LAM 33‚ Amer. Math. Soc.‚ Providence‚
RI‚ Providence‚ RI.
Yin‚ G. and Q. Zhang. (1998). Continuous-time Markov Chains and Applica-
tions: A Singular Perturbation Approach‚ Springer-Verlag‚ New York.
Yin‚ G. and Q. Zhang. (2000). Singularly perturbed discrete-time Markov
chains‚ SIAM J. Appl. Math.‚ 61‚ 834-854.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000a). Asymptotic properties of a sin-
gularly perturbed Markov chain with inclusion of transient states‚ Ann. Appl.
Probab.‚ 10‚ 549-572.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000b). Singularly perturbed Markov
chains: Convergence and aggregation‚ J. Multivariate Anal. 72‚ 208-229.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000c). Occupation measures of singu-
larly perturbed Markov chains with absorbing states‚ Acta Math. Sinica‚ 16‚
161-180.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000d). Decomposition and aggregation
of large-dimensional Markov chains in discrete time‚ preprint.
Yin‚ G.‚ Q. Zhang‚ and Q.G. Liu. (2000e). Error bounds for occupation measure
of singularly perturbed Markov chains including transient states‚ Probab.
Eng. Informational Sci. 14‚ 511-531.
Yin‚ G.‚ Q. Zhang‚ H. Yang‚ and K. Yin. (2001). Discrete-time dynamic systems
arising from singularly perturbed Markov chains‚ to appear in Nonlinear
Anal.‚ Theory‚ Methods Appl.
514 MODELING UNCERTAINTY
Rolando Cavazos–Cadena
Departamento de Estadística y Cálculo
Universidad Autónoma Agraria Antonio Narro
Buenavista, Saltillo COAH 25315
MÉXICO* †
Emmanuel Fernández–Gaucherand
Department of Electrical & Computer Engineering
& Computer Science
University of Cincinnati
Cincinnati‚ OH 45221-0030
‡
USA
emmanuel @ ececs.uc.edu
Abstract This work concerns discrete–time Markov decision processes with denumerable
state space and bounded costs per stage. The performance of a control policy
is measured by a (long–run) risk–sensitive average cost criterion associated to a
utility function with constant risk sensitivity coefficient and the main objective
of the paper is to study the existence of bounded solutions to the risk–sensitive
average cost optimality equation for arbitrary values of The main results are
as follows: When the state space is finite‚ if the transition law is communicating‚
in the sense that under an arbitrary stationary policy transitions are possible
*This work was partially supported by a U.S.-México Collaborative Program‚ under grants from the National
Science Foundation (NSF-INT 9602939)‚ and the Consejo Nacional de Ciencia y Tecnología (CONACyT)
(No. E 120.3336)‚ and United Engineering Foundation under grant 00/ER-99.
†We dedicate this paper to our wonderful colleague‚ Sid Yakowitz‚ an scholar and friend whom we greatly
miss.
‡The support of the PSF Organization under Grant No. 200-350–97–04 is deeply acknowledged by the first
author.
516 MODELING UNCERTAINTY
between every pair of states‚ the optimality equation has a bounded solution for
arbitrary non-null However‚ when the state space is infinite and denumerable‚
the communication requirement and a strong form of the simultaneous Doeblin
condition do not yield a bounded solution to the optimality equation if the risk
sensitivity coefficient has a sufficiently large absolute value‚ in general.
Keywords: Markov decision proccesses‚ Exponential utility function‚ Constant risk sensi-
tivity‚ Constant average cost‚ Communication condition‚ Simultaneous Doeblin
condition‚ Bounded solutions to the risk–sensitive optimality equation.
1. INTRODUCTION
This work considers discrete–time Markov decision processes (MDP’s) with
(finite or infinite) denumerable state space and bounded costs. Besides a stan-
dard continuity–compactness requirement‚ the main structural feature of the
decision model is that‚ under the action of each stationary policy‚ every pair
of states communicate (see Assumption 2.3 below). On the other hand‚ it is
assumed that the decision maker grades two different random costs according
to the expected value of an exponential utility function with (non-null) constant
risk sensitivity coefficient and the performance index of a control policy
is the risk–sensitive (long–run) average cost criterion. Within this context‚
the main purpouse of the paper is to study the existence of bounded solutions
to the risk–sensitive average cost optimality equation corresponding to a non-
null value of (i.e.‚ the which‚ under the continuity–compactness
conditons in Assumption 2.1‚ yields an optimal stationary policy with con-
stant risk–sensitive average cost. Thus‚ we are concerned in this paper with
fundamental theoretical issues. The reader is referred to a growing body of
literature in the application of risk-sensitive models in operations research and
engineering‚ e.g.‚ Fernández-Gaucherand and Marcus (1997)‚ Avila-Godoy et
al. (1997); Avila-Godoy and Fernández-Gaucherand (1998); Avila-Godoy and
Fernández-Gaucherand (2000); Shayman and Fernández-Gaucherand (1999).
The study of stochastic dynamical systems with risk–sensitive criteria can
be traced back‚ at least‚ to the work of Howard and Matheson (1972)‚ Jacob-
son (1973)‚ and Jaquette (1973; 1976). Particularly‚ in Howard and Matheson
(1972) the case of MDP’s with finite state and action spaces was considered
and‚ under Assumption 2.3 below and assuming aperiodicity of the transi-
tion matrix induced by each stationary policy‚ a solution to the was
obtained via the Perron–Frobenious theory of positive matrices for arbitrary
Recently‚ there has been an increasing interest on MDP’s endowed with
risk–sensitive criteria (Cavazos–Cadena and Fernández–Gaucherand‚ 1998a–
d; Fernández–Gaucherand and Marcus‚ 1997; Fleming and McEneany‚ 1995;
Fleming and Hernández–Hernández‚ 1997b; Hernández–Hernández and Mar-
cus‚ 1996; James et al.‚ 1994; Marcus et al.‚ 1996; Runolfsson‚ 1994; Whit–
Risk–sensitive average cost optimality equation in communicating MDP’s 517
Notation. Throughout the remainder and stand for the set of real numbers
and nonnegative integers, respectively, and for
Given a nonempty set S endowed with denotes the space of
all measurable real-valued and bounded functions defined on S, and
is the supremum norm of On the other hand, given
denotes the indicator function of that is if
and Finally, for an event W the corresponding indicator
function is denoted by I[W] and, as usual, all relations involving conditional
expectation are supposed to hold true almost everywhere with respect to the
underlying probability measure without explict reference.
with the (usual) convention that the minimum of the empty set is
Assumption 2.4.
(a) The state and action spaces are finite (notice that in this situation Assumption
2.1 is automatically satisfied);
(b) Assumption 2.3 holds and the transition matrix induced by an arbitrary
stationary policy is aperiodic.
On the other hand, it has been recently shown in Cavazos–Cadena and Fernández–
Gaucherand (1998a) that, even when the state space is finite, under Assumptions
2.1 and 2.2, the has a bounded solution only if the risk sensitivity coef-
ficient is small enough, and an example was given showing that this conclusion
can not be extended to arbitrary values of The difference between the con-
clusions in Cavazos–Cadena and Fernández–Gaucherand (1998a) and Howard
and Matheson (1972), comes from the different settings in both papers. In
particular, Assumption 2.3 is imposed in Howard and Matheson (1972), but
not in Cavazos–Cadena and Fernández–Gaucherand (1998a), and an additional
aperiodicity condition is used in the latter reference.
3. MAIN RESULTS
The main problem considered in the paper consists in studying if Assumptions
2.1-2.3 yield (bounded) solutions to the for arbitrary values of the
risk sensitivity coefficient It turns out that the answer depends on the
state space: If S is finite, Assumption 2.3, combined with Assumption 2.1,
implies the existence of a solution to the for arbitrary this
result is presented below as Theorem 3.1 and gives an extension of that in
Howard and Matheson (1972). On the other hand, as it will be shown via a
detailed example, such a conclusion cannot be extended to the case in which S is
countably infinite, thus providing an extension to the results in Cavazos–Cadena
and Fernández–Gaucherand (1998a).
Finite State Space Models.
522 MODELING UNCERTAINTY
The following theorem shows that Assumption 2.1 and Assumption 2.3 are
sufficient to guarantee a (bounded) solution to the for arbitrary values
of the non-null risk sensitivity coefficient Hence, our results extend those in
Howard and Matheson (1972) in that the aperiodicity in Assumption 2.4(b) is
not required, and in that our results hold for both the risk-seeking and risk-averse
cases.
Theorem 3.1. Let the state space S be finite and suppose that Assumptions 2.1
and 2.3 hold true. In this case, for every there exist a constant
and a function such that the following are true.
(i) The pair satisfies the
Proposition 3.1. For the MDP in Example 3.1 above, Assumptions 2.1–2.3
hold true and, moreover, the transition matrix in (3.2) is aperiodic.
Proof. Assumption 2.1 clearly holds in this example, since A is a singleton.
To verify Assumption 2.2, let and notice that, since from state
transitions are possible to or to state it is not difficult to see that
Observe now that, for Example 3.1, the reduces to the following
Poisson equation:
In the following proposition it will be shown that this equation does not admit
a bounded solution if is large enough. First, let be determined by
Proposition 3.2. For the MDP in Example 3.1, with the following
assertions hold:
(i)
(ii)
(iii ) For there is not a pair satisfying (3.5).
Proof. (i) From (3.2) and (3.4), it is not difficult to see that
for every positive integer so that
which is equivalent to
526 MODELING UNCERTAINTY
Using (a)–(d), (3.9) implies, via the dominated convergence theorem, that
i.e.,
The first result establishes that Assumptions 2.1 and 2.3 together imply a
strong form of the Simultaneous Doeblin condition.
The proof of this theorem, based on ideas related to the risk–neutral average
cost criterion, is contained in Appendix A. Suppose now that the initial state
is The following result provides a lower bound for the probability of
reaching a state before returning to
Theorem 4.2. Let with arbitrary but fixed and suppose that
Assumptions 2.1 and 2.3 hold true. In this case,
(i) There exists a constant such that, for every
Risk–sensitive average cost optimality equation in communicating MDP’s 527
The arguments leading to the proof of this result are also based on ideas using
the risk–neutral average cost criterion, and are presented in Appendix B.
The following lemma will be useful in the proof of Theorem 3.1 (which is
presented in Section 7).
Lemma 4.1. Suppose that Assumptions 2.1 and 2.3 hold true, and let
be such that for some and
Let be fixed.
(i) For every positive integer
and (iii)
so that
528 MODELING UNCERTAINTY
Notice now that for a given Theorem 4.1 implies that there exists a
positive integer such that so that
so that
(iii) Since
part (ii) yields that
Risk–sensitive average cost optimally equation in communicating MDP’s 529
whereas
(ii) If then
(a) for every
moreover,
(b) The following optimality equation holds:
Similarly,
(iii ) If then
Proof. (i) Let be an arbitrary policy, and observe that for every
Jensen’s inequality yields
completing the proof of part ( i ) , since it is clear that see (5.1) and
(5.2).
(ii) Let and be arbitrary and for a fixed state with
define the combined policy as follows: At time
for every wheras for and
if for whereas
where the equality used that and coincide before the state is reached in
a positive time. Therefore, under the condition it follows
that
532 MODELING UNCERTAINTY
In this case,
In this case
(i) There exists such that
Moreover,
(ii) for every
Proof. Let be such that
(i) The starting point is the optimality equation in Theorem 5.1(ii):
where
and then
(ii) First, notice that the inequality in (5.12) yields
536 MODELING UNCERTAINTY
it follows that
so that, setting
Risk–sensitive average cost optimality equation in communicating MDP’s 537
where K is as in Theorem 4.1, and combining this inequality with (5.14) and
(5.15) it follows that
Then, setting
The proof of this result is presented in two parts below, as Lemmas 6.1 and
6.2. First, it is convenient to introduce some useful notation.
Risk–sensitive average cost optimality equation in communicating MDP’s 539
and
Via the monotone convergence theorem, this inequality and (6.1) together yield
that
Therefore, so that
542 MODELING UNCERTAINTY
(iii) By parts (i) and (ii), there exists a policy such that
whereas
(i) By Theorem 6.1, (7.1) and (7.2) yield that and then, the opti-
maloity equations in Theorem 5.1 imply that
and
Risk–sensitive average cost optimality equation in communicating MDP’s 543
8. CONCLUSIONS
This paper considered Markov decision processes endowed with the risk–
sensitive average cost optimality criterion in (2.4)–(2.6). The main result of the
paper, namely, Theorem 3.1, shows that under standard continuity–compactness
conditions (see Assumption 2.1), the communication condition in Assumption
2.3 guarantees the existence of a solution to the stated in (3.1) for
arbitrary values of when the state space is finite. Furthermore, it was
shown via Example 3.1, that the conclusions in Theorem 3.1 cannot be extended
to the case of countably infinite state space models. Hence, the results presented
in the paper significantly extend those in Howard and Matheson (1972), and also
the recent work of the authors presented in Cavazos–Cadena and Fernández–
Gaucherand (1998a-d); see Remark 3.1.
REFERENCES 545
REFERENCES
Arapostathis, A., V. S. Borkar, E. Fernández–Gaucherand, M. K. Gosh and S.
I. Marcus (1993). Discrete–time controlled Markov processes with average
cost criteria: a survey, SIAM, Journal on Control and Optimization, 31,282–
334.
Avila-Godoy, G., A. Brau and E. Fernández-Gaucherand. (1997). “Controlled
Markov chains with discoun- ted risk-sensitive criteria: applications to ma-
chine replacement,” in Proc. 36th IEEE Conference on Decision and Control,
San Diego, CA, pp. 1115-1120.
Avila-Godoy, G. and E. Fernández-Gaucherand. (1998). “Controlled Markov
chains with exponential risk-sensitive criteria: modularity, structured policies
and applications,” in Proc. 37th IEEE Conference on Decision and Control,
Tampa, FL, pp. 778-783.
Avila-Godoy, G.M. and E. Fernández-Gaucherand. (2000). “Risk-Sensitive In-
ventory Control Problems,” in Proc. Industrial Engineering Research Con-
ference 2000, Cleveland, OH.
Bertsekas, D.P. (1987). Dynamic Programming: Deterministic and Stochastic
Models. Prentice-Hall, Englewood Cliffs.
Borkar, V.K. (1984). On minimum cost per unit of time control of Markov
chains, SIAM Journal on Control and Optimization, 21, 965–984.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998a). Controlled Markov
Chains with Risk-Sensitive Criteria: Average Cost, Optimality Equations,
and Optimal Solutions, ZOR: Mathematical Methods of Operations Re-
search. To appear.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998b). Controlled Markov
Chains with Risk–Sensitive Average Cost Criterion: Necessary Conditions
for Optimal Solutions Under Strong Recurrence Assumptions. Submitted for
Publication.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998c). Markov Deci-
sion Processes with Risk–Sensitive Average Cost Criterion: The Discounted
Stochastic Games Approach. Submitted for publication.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998d). The Vanishing
Discount Approach in Markov Chains with Risk–Sensitive Criteria. Submit-
ted for publication.
Fernández–Gaucherand, E., A. Arapostathis and S.I. Marcus. (1990). Remarks
on the Existence of Solutions to the Average Cost Optimality Equation in
Markov Decision Processes, Systems and Control Letters, 15, 425–432.
Fernández-Gaucherand, E. and S.I. Marcus. (1997). Risk-Sensitive Optimal
Control of Hidden Markov Models: Structural Results. IEEE Transactions
on Automatic Control, 42, 1418-1422.
546 MODELING UNCERTAINTY
Theorem A. Let the state space be finite suppose that Assumptions 2.1 and 2.3 hold true. In this
case, the simltaneous Doeblin condition is valid. More explicitly, there exists such
that
Notice that the conclusion of Theorem A refers to the class of stationary policies, whereas
Theorem 4.1 involves the family of all policies. However, Theorem 4.1 can be deduced from The-
orem A as in Cavazos–Cadena and Fernández–Gaucherand (1998a) or Hernández–Hernández
and Marcus, 1996.
The proof of Theorem A has been divided into three steps presented in the following three
lemmas. To begin with, notice that, since S is finite, Assumption 2.2 implies that the Markov
chain induced by a stationary policy has a unique invariant distribution, denoted bu
and characterized by Loéve (1980)
Using that S is finite, it is clear that and since each has these
properties; moreover, for each
where the last equality used (A.1) and Assumption 2.1. Therefore, has all the properties to
be the unique invariant distribution of the transition matrix determined by so that
(ii) As already noted, by Assumption 2.3, so that using the equality
((Loève, 1980)), the assertion follows from part (i).
(c)
Therefore, denoting the generated by by the Markov
property yields,
REFERENCES 549
so that
Lemma A.3. Let be fixed and let be the indicator function of i.e.,
and set
Then,
(i) and there exists such that
(ii) Define by
and then Lemma A. 1 yields the existence of at which the supremum is achieved. In this
case, since by Assumption 2.3.
550 MODELING UNCERTAINTY
(ii) Notice that the equality follows from (A.5). Also, using the Markov property, it
is not difficult to see that for every
which is equivalent to
To establish (A.6), pick an arbitrary pair and define the discrepacy function
and the policy by
and
and then [ ],
Combining this last equality with (A.7) and using the fact that it follows that
Therefore (see (A.9)), and then, since the
pair was arbitrary,
Lemma B.1. Suppose that Assumptions 2.1 and 2.3 hold true. For given define
the number of times in which the state process visits before returning to the initial
state in a positive time, by
where K is as in Theorem 4.1. Hence, taking expectation with respect to it follows that
and then
Proof of Theorem 4.2. (i) Let be given with Since the simultaneous Doeblin
condition holds, there exists and such that (Arapostathis et al., 1993; (Jaquette,
1973); (Jaquette, 1976))
If for each is such that minimizes the term in brackets in (B. 1), it follows
that [ ]
and
Therefore‚ (B.3) allows to conclude‚ via the dominated convergence theorem‚ that
(ii) and
(iii)
and then, by Assumption 2.1, there exists a policy such that for every
From this equality, a simple induction argument using the Markov property
yields that for every positive integer and
George G. Roussas
University of California‚ Davis *
*This work was supported in part by a research grant from the University of California‚ Davis.
556 MODELING UNCERTAINTY
1. INTRODUCTION
Traditionally‚ much of statistical inference has been carried out under the ba-
sic assumption that the observations involved are independent random variables
(r.v.s). Most of the time‚ this is augmented by the supplemental assumption that
the r.v.s also have the same distribution‚ so that we are dealing‚ in effect‚ with
independent identically distributed (i.i.d.) r.v.s. This set-up is based on two
considerations‚ one of which is mathematical convenience‚ and the other the
fact that these requirements are‚ indeed‚ met in a broad variety of applications.
As for the stochastic or statistical models employed‚ they are classified mainly
as parametric and nonparametric. In the former case‚ it is stipulated that the
underlying r.v.s are drawn from a model of known functional form except for a
parameter belonging in a (usually open) subset the parameter space‚ of the
Euclidean space In the latter case‚ it is assumed that the underlying
model is not known except that it is a member of a wide class of possible models
obeying some very broad requirements.
The i.i.d. paradigm described above has been subsequently extended to cover
cases‚ where dependence of successive observations is inescapable An early
attempt to model such situations was the introduction of Markov processes‚
based on the Markovian property. According to this property‚ and in a discrete
time parameter framework‚ the conditional distribution of given the
entire past and the present‚ depends only on the present‚
For statistical inference purposes‚ most of the time we assume the existence
of probability density functions (p.d.f.s). Then‚ if these p.d.f.s depend on a
parameter as described above‚ we are dealing with parametric inference (about
if not‚ we are in the realm of nonparametric inference. Both cases will be
addressed below.
According to Markovian dependence‚ the past is irrelevant‚ or to put it‚
perhaps‚ differently‚ the past is summarized by the present in our attempt to
make probability statements about the future. Should that no be the case‚ then
new statistical models must be invented‚ where the entire past is entering into
the picture. A class of such models is known under the general name of mixing.
There are several modes of mixing used in the literature‚ but in this paper we
shall confine ourselves to only three of them. The fundamental characteristic
embodied in the concept of mixing is that the past and the future‚ as expressed
by the underlying stochastic process‚ are approximately independent‚ provided
they are sufficiently far apart. Precise definitions are given in Section 3 (see
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 557
2. MARKOVIAN DEPENDENCE
Let be r.v.s constituting a (strictly) stationary Markov process,
defined on a probability space open subset
of and taking values in the real line As it has already been stated,
most often in Statistics we assume the existence of p.d.f.s. To this effect, let
be the initial p.d.f. (the p.d.f. of with respect to a dominating measure
such as the Lebesgue measure), and let be likewise the p.d.f. of the joint
distribution of Then is the p.d.f. of the transition
distribution of given
558 MODELING UNCERTAINTY
and let
where all logarithms are taken with base Under suitable regularity conditions,
a MLE, exists and enjoys several optimality proper-
ties. Actually, the parametric inference results in a Markovian framework were
well summarized in the monograph Billingsley (1961a) to which the interested
reader is referred for the technical details. From (2.1) and (2.2), we have the
likelihood equations
or just
by omitting the first term in (2.3) since all results are asymptotic.
The results to be stated below hold under certain conditions imposed on the
underlying Markov process. These conditions include strict stationarity for the
process‚ as already mentioned‚ absolute continuity of the initial and the tran-
sition probability measures‚ joint measurability of the respective p.d.f.s‚ their
differentiability up to order three‚ integrability of certain suprema‚ finiteness
of certain expectations‚ and nonsingularity of specified matrices. Their precise
formulation is as follows and can be found in Billingsley (1961a); see also
Billingsley (1961b).
Then
(iv) For each there is a neighborhood of it‚ lying in such
that‚ for each and and as above‚ it holds:
where
and
560 MODELING UNCERTAINTY
Theorem 2.1. Under conditions (C1) - (C2) and for each the following
are true: (i) There is a solution of (2.4), with
tending to 1; (ii) This solution is a local maximum of
(that is, essentially, of the log-likelihood function); (iii) The vector
in (iv) If is another sequence
which satisfies (i) and (iii), then
Thus, this theorem ensures, essentially, the existence of a consistent (in the
probability sense) MLE of
The MLE of the previous theorem is also asymptotically normal, when
properly normalized. Rephrasing the relevant part of Theorem 2.2 in Billingsley
(1961a) (see also Billingsley (1961b)), we have
Theorem 2.2. Let be the MLE whose existence is ensured by Theorem 2.1.
Then‚ under conditions (C1)-(C2)‚
This is‚ actually‚ given in Theorem 3.1 of Billingsley (1961a)‚ which states that:
Classical results on the MLE in the i.i.d. case were derived by Wald (1941‚
1943). See also Roussas (1965b) for an extension to the Markovian case of a
certain result by Wald‚ and Roussas (1968a) for an asymptotic normality result
of the MLE in a Markovian framework again. Questions regarding asymptotic
efficiency of the MLE‚ in the i.i.d. case again‚ are discussed in Bahadur (1964)‚
and for generalized MLE in Weiss and Wolfowitz (1966). When an estimate
is derived by the principal of maximizing the probability of concentration over
a certain class of sets‚ the resulting estimate is called a maximum probability
estimate. Much of the relevant information may be found in Wolfowitz (1965)‚
and Weiss and Wolfowitz (1967‚ 1970‚ 1974). Parametric statistical inference
for general stochastic processes is discussed in Basawa and Prakasa Rao (1980).
so that
Then‚ Theorems 2.4-2.6 stated below hold under the following set of as-
sumptions.
For some comments on these assumptions and examples where they are
satisfied‚ see pages 45-52‚ in Roussas (1972). Theorems 2.4-2.6‚ given below‚
are a consolidated restatement of Theorems 4.1-4.6‚ pages 53-54‚ and Theorem
1.1‚ page 72‚ in the reference just cited.
(ii)
(iii)
564 MODELING UNCERTAINTY
Also‚
Theorem 2.5. In the notation of the previous theorem‚ and under the same
assumptions‚
in probability.
It should be mentioned at this point that‚ unlike the classical approach‚ The-
orem 2.5 is obtained from Theorem 2.4 without much effort at all. This is so
because of the contiguity of the sequences and established in
Proposition 6.1‚ pages 65-66‚ in Roussas (1972)‚ in conjunction with Corollary
7.2‚ page 35‚ Lemma 7.1‚ pages 36-37‚ and Theorem 7.2‚ pages 38-39‚ in the
same reference.
As an application of Theorems 2.4 and 2.5‚ one may construct tests‚ based
essentially on the log-likelihood function‚ which are either asymptotically uni-
formly most powerful or asymptotically uniformly most powerful unbiased.
For the justification of the above assertion and certain variations of it‚ see
Theorems 3.1 and 3.2‚ Corollaries 3.1 and 3.2‚ and subsequent examples in
pages 100 - 107 of Roussas (1972).
A rough interpretation of part in Theorem 2.4 is that‚ for all sufficiently
large and in the neighborhood of the likelihood function
behaves as follows:
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 565
Then‚ we have
For a discussion of this application‚ see Theorem 5.1‚ pages 115 - 121 in Roussas
(1972).
Theorem 2.6 may be used in testing the hypothesis even if
The result obtained enjoys several asymptotic optimal properties. Actually‚ the
566 MODELING UNCERTAINTY
discussion of such properties is the content of Theorems 2.1 and 2.2, pages 170
- 171, Theorem 4.1, pages 183 - 184, and Theorems 6.1 and 6.2, pages 191
- 196, in Roussas (1972). This theorem can also be employed in obtaining a
certain representation of the asymptotic distribution of a class of estimates of
these estimates need not be MLE.
To be more precise, for an arbitrary and any set
so that for all sufficiently large Let be a class of estimates
of defined as follows:
and it stays in any given state before it moves to the next state a random
amount of time. This time depends both on and and is not necessarily
exponential, as is the case in Markovian processes. It may be shown that, under
suitable conditions, the process may be represented as follows:
where is a Markov chain taking
values in and are stopping times taking values in
{1, 2 . . . } . For an extensive discussion of this problem, the interested reader is
referred to Roussas and Bhattacharya (1999b). See also Akritas and Roussas
(1980).
where
Some of the properties of the estimates given by (2.16) - (2.18) are summa-
rized in the theorem below. The assumptions under which these results hold‚
as well as their justification‚ can be found in Theorems 2.2‚ 3.1‚ 4.2‚ 4.3 and
Corollary 3.1 of Roussas (1969a)‚ and in Theorems 4.1 and 4.2 of Roussas
(1988b).
and these convergences are uniform over compact subsets of and re-
spectively.
and
where
The estimates given by (2.19) and (2.20) are seen to enjoy the familiar Glivenko-
Cantelli uniform strong consistency property; namely‚
(i)
The justification of these results is given in Theorems 3.1 and 3.2 of Roussas
(1969b).
In addition‚ the estimate is shown to be asymptotically normal‚ as
the following result states. Its proof is found in Roussas (1991a); see Theorem
2.3 there.
where
and
The estimate is consistent and not much different from the estimate
as the follows result states.
and‚ of course‚
where
and
all
where
4. The second order derivative of the joint p.d.f. of the r.v.s and
satisfies the condition:
5. The joint p.d.f.s of the r.v.s are bounded, and
so are the joint p.d.f.s of where
6. The one-step transition d.f. of the process has a unique p-th
quantile for and
7. For suitable and is continuous in
where means
3. MIXING
3.1. INTRODUCTION AND DEFINITIONS
As it has already be mentioned in the introductory section‚ mixing is a kind
of dependence which allows for the entire past‚ in addition to the present‚ to
influence the future. There are several modes of mixing‚ but we are going
to concentrate on three of them‚ which go by the names of weak mixing or
mixing‚ and strong mixing or There is a large probabilistic
literature on mixing; however‚ in this paper‚ we focus on estimation problems
in mixing models. A brief study of those probability results‚ pertaining to
statistical inference‚ may be found in Roussas and Ioannides (1987‚ 1988) and
Roussas (1988a).
All mixing definitions will be given for stationary stochastic processes‚ al-
though the stationarity condition is not necessary. The concept of strong mixing
or was introduced by Rosenblatt (1956b) and is as follows.
The meaning of (3.1) is clear. It states that the past and the future‚ defined
on the underlying process‚ are approximately independent‚ provided they are
sufficiently far apart.
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 577
Definition 3.2. In the notation of the previous definition‚ the sequence of r.v.s
is said to be with mixing coefficient if
Remark 3.1. One arrives at Definition 3.2 by first defining the maximal corre-
lation by:
being
being with
Relations (3.3) show that and are all equivalent‚ in the sense
that if and only if if and only if
The third and last mode of mixing to be considered here is the weak mixing
or defined below.
Definition 3.3. In the notation of Definition 3.1‚ the sequence of r.v.s is said to
be weak mixing or with mixing coefficient if
578 MODELING UNCERTAINTY
with and
(ii) Under
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 579
(iii) Under
for real-valued
580 MODELING UNCERTAINTY
and
for complex-valued
for real-valued
and
for complex-valued
for real-valued
and
for complex-valued
(ii) Under
for real-valued
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 581
and
for complex-valued
(iii) Under
for real-valued
and
for complex-valued
The proof of (3.6) is carried out by induction on and the relevant details
may be found in Roussas (1988a).
Although inequality (3.6) does provide a bound for probabilities of the form
a stronger bound is often required. In other words‚
a Bernstein-Hoeffding type bound would be desirable in the present set-up.
Results of this type are available in the literature‚ and one stated here is taken
from Roussas and Ioannides (1988).
Theorem 3.6. Let the r.v.s be as in the previous theorem‚ and let
Then‚ under certain additional regularity conditions‚ it holds
constants,
and (and subject to an additional restriction.)
Also‚
where
where
Remark 3.2. In reference to part (iii) of the theorem‚ it must be mentioned that
stationarity alone is enough for its validity‚ mixing is not required.
Detailed listing of the assumptions under which these results hold‚ as well
as their proofs‚ may be found in Cai and Roussas (1992) (Corollary 2.1 and
Theorem 3.2)‚ Roussas (1989b) (Propositions 2.1 and 2.2)‚ and Roussas (1989c)
(Theorem 3.1 and Propositions 4.1 and 4.2). However‚ see also the marterial
right after the end of Theorem 3.16. An additional relevant reference‚ among
others‚ is the paper by Yamato (1973).
Theorem 3.8. In the mixing framework and other additional assumptions‚ the
estimate defined by (3.8) has the following properties:
(i) Asymptotic unbiasedness:
(ii) Strong consistency with rates:
for some
Also,
and
and
Also,
Formulation of the assumptions under which the above results hold, as well
as their proofs, may be found in Roussas (1990a) (Theorems 3.1, 4.1 and 5.1
), Roussas (1988b) (Theorems 2.1, 2.2, 3.1 and 3.2), Cai and Roussas (1992)
(Theorems 4.1 and 4.4), and Roussas and Tran ((1992a) (Theorem 7.1).
Suppose now that the r-th order derivative of exists, and let us estimate
it by where
Theorem 3.9. In the mixing framework and other additional assumptions, the
estimate defined by (3.9) is uniformly strong consistent with rates;
namely,
and, in particular,
provided
These results are discussed in Cai and Roussas (1992) (Theorem 4.4).
There is an extensive literature on this subject matter. The following con-
stitute only a sample of relevant references dealing with various estimates and
their behavior. They are Bradley (1983), Masry (1983, 1989), Robinson (1983,
1986), Tran (1990), and Roussas and Yatracos (1996, 1997).
3.4.3 Estimating the Hazard Rate. Hazard analysis has broad appli-
cations in systems reliability and survival analysis. It was thought therefore
appropriate to touch upon some basic issues in this subject matter. Recall at
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 587
this point that if F and are the d.f. and the p.d.f. of a r.v. X, then the hazard
rate is defined as follows,
Theorem 3.10. In the mixing framework and under other additional assump-
tions, the estimate defined in (3.11) has the following properties:
(i) Strong pointwise consistency:
and, in particular,
provided
where
(iv) Joint asymptotic normality: For any distinct continuity points of
Precise statement of assumptions, under which the above results holds, and
their justification may be found in Roussas (1989b) (Theorems 2.1, 2.2 and
2.3), Cai and Roussas (1992) (Theorem 4.2), Roussas (1990a) (Theorem 4.2),
and Roussas and Tran (1992a) (Theorem 8.1). Relevant are also the references
Watson and Leadbetter (1964a, b).
Theorem 3.11. In the mixing set-up and under other additional assumptions,
the estimate given in (3.13) is uniformly strong consistent with rates over
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 589
and, in particular,
provided
For the justification of this result, the interested reader is referred to Cai and
Roussas (1992) (Theorem 4.3). It appears, the idea of using smooth estimate
for a d.f. as the one employed above belongs to Nadaraya (1964b).
The estimate has several optimality properties some of which are sum-
marized below.
Theorem 3.12. In the mixing framework and under additional suitable condi-
tions, the recursive estimate defined in (3.14) has the following properties:
590 MODELING UNCERTAINTY
(i)Asymptotic unbiasedness:
Also,
Remark 3.3 The results in part (ii) and (iv) (with )with justify
the statement made earlier about reduction of the variance of the asymptotic
normal distribution.
The justification of the statements in Theorem 3.12 may be found in Roussas
and Tran (1992a) (relations (2.6), (2.10), (2.15), and Theorems 3.1 and 4.1).
Now, if is the empirical survival function, it can be written as follows:
Among the several asymptotic properties of is the one stated in the fol-
lowing result (see Theorem 6.1 in Roussas and Tran (1992a)).
Theorem 3.13. Under mixing and suitable additional conditions, the estimate
defined in (3.20) has the following joint asymptotic normality property;
that is, for any distinct continuity points of
Remark 3.4. Applying this last result for and comparing it with the
result stated in Theorem 3.1 ( i i i ) , one sees the superiority of in terms of
asymptotic variance (for as it compares to
From among several relevant references, we mention those by Masry (1986),
Masry and Györfi (1987), Györfi and Masry (1990), Wegman and Davis (1979),
and Roussas (1992), the last two concerning themselves with the independent
case.
This problem has been studied extensively in the i.i.d. case (see, for example,
Priestley and Chao (1972), Gasser and Müller (1979), Ahmad and Lin (1984),
and Georgiev (1988)). Under mixing conditions, however, this problem has
been dealt with only the last decade or so. Here, we present a summary of
some of the most important results which have been obtained in the mixing
framework.
Theorem 3.14. Under mixing assumptions and further suitable conditions, the
fixed design regression estimate defined in (3.22) has the following properties:
(iii)Strong consistency:
Precise statement of conditions under which the above results hold, as well
as their proofs, can be found in Roussas (1989a) (Theorems 2.1, 2.2 and 3.1),
and Roussas et al. (1992) (Theorems 2.1 and 3.1).
is finite, and then the problem is that of estimating on the basis of the
observations at hand.
Before we proceed further, we present another formulation of the problem
which provides more motivation for what is to be done. To this effect, let
be real-valued r.v.s forming a stationary time series. Suppose we
wish to predict the r.v. on the basis of the previous r.v.s
As predictor, we use the assum-
ing, of course, that this conditional expectation is finite. By setting
the pairs form a stationary
sequence, and the problem of prediction in the time series setting is equiv-
alent to that of estimating the conditional expectation
on the basis of the available observations. Actually, one
may take it one step further by considering a (known) real-valued function
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 593
Theorem 3.15. Under the basic assumption of mixing and further suitable
conditions, the regression estimate given in (3.24) is strongly consistent with
rates, uniformly over compact subsets of namely,
This estimate is asymptotically normal (see Theorems 2.1 - 2.3 in Roussas and
Tran (1992b)), as the following theorem states.
Theorem 3.16. Under the basic assumption of mixing and further suitable
conditions, the recursive regression estimate given in (3.25) has the properties
stated below.
(i) Asymptotic normality: For any continuity point for and for which
594 MODELING UNCERTAINTY
where
(i) as
(ii) and for suitable
(iii) K is Lipschitz of order 1; i.e.,
(iv) The derivative exists is continuous and of bounded variation, and
(i)
(ii)
(iii)
(iv) They also satisfy some additional conditions involving other entities (such
as mixing coefficients and rates of convergence).
(v) In particular, in the recursive case, the bandwidths satisfy conditions such
as:
for some there is a sequence of positive numbers such
that and
The formulation of the conditions employed in the fixed regression, as well
as the stochastic regression case, requires the introduction of a large amount
of notation. We choose not to do it, and refer the reader to the original papers
already cited.
It is to be emphasized that in any one of the results, Theorem 3.5 - 3.16,
stated above the proof requires only some of the assumptions just listed.
Early papers on regression were those by Nadaraya (1964a, 1970) and Watson
(1964). Subsequently, there has been a large number of contributions in this
area. Those by Burman (1991), Masry (1996), and Masry and Fan (1997) is
596 MODELING UNCERTAINTY
ACKNOWLEDGMENTS
Thanks are due to an anonymous reviewer whose constructive comments,
resulting from a careful and expert reading of the manuscript, helped improve
the original version of this work.
REFERENCES
Ahmad, I. A. and P. E. Lin. (1984). Fitting a multiple regression function.
Journal of Statistical Planning and Inference 9, 163 - 176.
REFERENCES 597
Philip J. Boland
Department of Statistics
University College Dublin
Belfield, Dublin 4
Ireland
Taizhong Hu
Department of Statistics and Finance
University of Science and Technology
Hefei, Anhui 230026
People’s Republic of China
Moshe Shaked
Department of Mathematics
University of Arizona
Tucson, Arizona 85721
USA
J. George Shanthikumar
Industrial Engineering & Operations Research
University of California
Berkeley, California 94720
USA
Abstract In this paper we survey some recent developments involving comparisons of order
statistics and spacings in various stochastic senses.
608 MODELING UNCERTAINTY
Keywords: Reliability theory, systems, IFR, DFR, hazard rate order, likelihood
ratio order, dispersive order, sample spacings.
1. INTRODUCTION
Order statistics are basic probabilistic quantities that are useful in the theory
of probability and statistics. Almost every student of probability and statistics
encounters these random variables at an early stage of his/her studies because
these statistics are associated with an elegant theory, are useful in applications,
and are also a convenient tool to use in order to illustrate in a basic (though not
trivial) way some probabilistic concepts such as transformations, conditional
probabilities, lack of independence, and the foundations of stochastic processes.
In the area of statistical inference, order statistics are the basic quantities used
to define observable functions such as the empirical distribution function, and
in reliability theory these are the lifetimes of systems.
In 1996 Boland, Shaked and Shanthikumar wrote a survey (which appeared
in 1998 (Boland, Shaked, and Shanthikumar, 1998)), covering most of what
had been developed in the area of stochastic ordering of order statistics up to
that time. During the last few years this area has experienced an explosion of
new developments. In this paper we try to describe and summarize some of
these recent developments.
The notation that we use in this paper is the following. Let
be independent random variables which may or may not be identically dis-
tributed. The corresponding order statistics are denoted by
Thus, and
If is another collection of indepen-
dent random variables, then the corresponding order statistics are denoted by
In the sequel we will also touch upon another stochastic order, introduced
in Lillo, Nanda and Shaked (2001), which is defined as follows. Let X and Y
be two absolutely continuous random variables with support We say
that X is smaller than Y in the down shifted likelihood ratio order, denoted as
if
Note that in the above definition we compare only nonnegative random vari-
ables. This is because for the down shifted likelihood ratio order we cannot
take an analog of (2.2), such as, as a definition. The reason is
that here, by taking very large, it is seen that practically there are no random
variables that satisfy such an order relation. Note that in the definition above,
the right hand side can take on (when varies) any value in the
610 MODELING UNCERTAINTY
and
In fact, Lillo, Nanda and Shaked (2001) showed that it is not possible to replace
in (2.4).
When the [respectively, above are i.i.d. then from (2.3)–(2.5) it
follows that
and
this result has been obtained by Raqab and Amin (1996) and, independently,
by Khaledi and Kochar (1999).
Stochastic Ordering of OrderStatistics II 611
Similarly, from (2.8) it follows that if the above are i.i.d. with a logconvex
density function then
We say that X is smaller than Y in the reversed hazard rate order, denoted
as if
612 MODELING UNCERTAINTY
for any such function In Nanda and Shaked (2000) it is shown that
for any such function The latter two implications correct a mistake in Theo-
rems 1.B.2 and 1.B.22 in Shaked and Shanthikumar (1994) — the parenthetical
statements there are incorrect. These two implications often enable us to trans-
form results about the hazard rate order into results about the reversed hazard
rate order and vice versa.
The first result regarding ordering order statistics in the sense of the hazard
and the reversed hazard rate orders is the following useful proposition; later
(see Theorem 3.1) we use it in order to obtain a new stronger result.
Proposition 3.1. Let [respectively, be in-
dependent (not necessarily i.i.d.) absolutely continuous random variables, all
with support for some
(i) If for all and then
and
and the desired result follows from the fact that the likelihood ratio order implies
the hazard rate order.
With the aid of (3.4) and (3.5) it can be shown that statement (3.7) is equiv-
alent to (3.6).
Theorem 3.2 extends and unifies the relatively restricted Theorems 3.8 and 3.9
of Nanda and Shaked (2000).
In light of (4.1) in Section 4, one may wonder whether the conclusion in
(3.6) holds if it is only assumed there that for all (rather than
for all In order to see that this is not the case, consider the
614 MODELING UNCERTAINTY
The first inequality in (3.8) was proven in Boland, El-Neweihi and Proschan
(1994) for nonnegative random variables; however, by (3.3), the inequality holds
also without the nonnegativity assumption. The second inequality in (3.8) is
taken from Hu and He (2000). The inequalities in (3.9) can be found in Block,
Savits and Singh (1998) and in Hu and He (2000). Again, using (3.4) and (3.5)
it can be shown that (3.9) is actually equivalent to (3.8); see Nanda and Shaked
(2000) for details.
For the next inequalities we need to have, among the a “largest" [re-
spectively “smallest"] variable in the sense of the hazard [reversed hazard] rate
order. Again, let be independent (not necessarily i.i.d.) ab-
solutely continuous random variables, all with support for some
Then
(i) If then
(ii) If then
Boland, El-Neweihi and Proschan (1994) proved part (i) above for nonnegative
random variables; however, again by (3.3), the inequality in part (i) is valid
without the nonnegativity assumption. Part (ii) above is Theorem 4.2 of Block,
Savits and Singh (1998). Again, using (3.4) and (3.5) once more it can be
shown that part (ii) is equivalent to part (i); see Nanda and Shaked (2000) for
details.
Stochastic Ordering of OrderStatistics II 615
The usual stochastic order is implied by the orders studied in Sections 2 and 3.
Therefore, comparisons of order statistics, associated with one collection of
random variables, follow from previous results in these sections. For example,
in (2.9), (3.8) and (3.9), the orders and can be replaced by
However, when we try to compare order statistics that are associated with two
different sets of random variables (that is, a set of and a set of we
may get new results because the assumption that an is smaller than a in
the usual stochastic order is weaker than a similar assumption involving any of
the orders discussed in Sections 2 and 3.
At present there are not many such results available. We just mention one
recent result that has been derived, independently, by Belzunce, Franco, Ruiz
and Ruiz (2001) and by Nanda and Shaked (2000). Let [re-
spectively, ] be independent (not necessarily i.i.d.) absolutely
continuous random variables, all with support for some Then for
any and we have that
The above two results were obtained by Barlow and Proschan (1966) who also
showed that if F is DFR (decreasing failure rate; that is, log is convex on
then the inequalities above are reversed. Kochar and Kirmani (1995)
strengthened this DFR result as follows. Let be i.i.d. nonnegative
random variables. If the common distribution function F is DFR then the
normalized spacings satisfy
Khaledi and Kochar (1999) proved, in addition to the above, that if F is DFR
then the normalized spacings satisfy
Proof. Let F and denote, respectively, the distribution function and the den-
sity function of Given and the conditional
density of at the point is and the conditional
density of at the point is Since is
increasing [decreasing] it is seen that, conditionally,
and therefore, conditionally, But the usual stochastic
order is closed under mixtures, and this yields the stated result.
is DFR then
In fact, they showed a stronger result, namely, that the random vector
is smaller than the random vector
in the multivariate likelihood ratio order (see Snaked and Shanthikumar
(1994) for the definition).
Bartoszewicz (1985) and Bagai and Kochar (1986) have shown that
This follows from (3.8), with the aid of (6.2), and from the fact that is
DFR. If are i.i.d. DFR random variables, then Khaledi and Kochar
(2000a) showed that
Stochastic Ordering of OrderStatistics II 619
Consider now the normalized spacings that are associated with the i.i.d.
random variables From (5.1)–(5.3), from (6.2), and from the fact
that the spacings here are DFR (see Barlow and Proschan (1966)), it is seen
that if are i.i.d. nonnegative DFR random variables, then we get
the following results of Kochar and Kirmani (1995) and of Khaledi and Kochar
(1999):
or, in summary,
for all increasing functions for which the expectations exist. The
following theorem is stated in Bartoszewicz (1986) with an incomplete proof.
Theorem 6.1. Let U and V be as above. If then In
particular,
The fact that now follows from a well-known property of the multi-
variate order
Theorem 2.7 in page 182 of Kamps (1995) extends (6.4) to the spacings of
the so called generalized order statistics.
Rojo and He (1991) proved a converse of Theorem 6.1. Specifically, they
showed that if
for all then
ACKNOWLEDGMENTS
We thank Baha-Eldin Khaledi for useful comments on a previous version of
this paper.
REFERENCES
Alzaid, A. A. and F. Proschan. (1992). Dispersivity and stochastic majorization.
Statistics and Probability Letters 13, 275–278.
Arnold, B. C. and J. A. Villasenor. (1998). Lorenz ordering of order statistics and
record values. In Handbook of Statistics, Volume 16 (Eds: N. Balakrishnan
and C. R. Rao), Elsevier, Amsterdam, 75–87.
Bagai, I. and S.C. Kochar. (1986). On tail-ordering and comparison of failure
rates. Communications in Statistics—Theory and Methods 15, 1377–1388.
Barlow, R. E. and F. Proschan. (1966). Inequalities for linear combinations of
order statistics from restricted families. Annals of Mathematical Statistics
37, 1574–1592.
Barlow, R. E. and F. Proschan. (1975). Statistical Theory of Reliability and Life
Testing, Probability Models, Holt, Rinehart, and Winston, New York, NY.
Bartoszewicz, J. (1985). Dispersive ordering and monotone failure rate distri-
butions. Advances in Applied Probability 17, 472–474.
Bartoszewicz, J. (1986). Dispersive ordering and the total time on test transfor-
mation. Statistics and Probability Letters 4, 285–288.
Bartoszewicz, J. (1998a). Applications of a general composition theorem to the
star order of distributions. Statistics and Probability Letters 38, 1–9.
Bartoszewicz, J. (1998b). Characterizations of the dispersive order of distribu-
tions by the Laplace transform. Statistics and Probability Letters 40, 23–29.
Belzunce, F., M. Franco, J.-M. Ruiz, and M. C. Ruiz. (2001). On partial order-
ings between coherent systems with different structures. Probability in the
Engineering and Informational Sciences 15, 273–293.
Block, H. W., T.H. Savits, and H. Singh. (1998). The reversed hazard rate
function. Probability in the Engineering and Informational Sciences 12, 69–
90.
Boland, P. J., E. El-Neweihi, and F. Proschan. (1994). Applications of the hazard
rate ordering in reliability and order statistics. Journal of Applied Probability
31, 180–192.
622 MODELING UNCERTAINTY
Li, X., Z. Li, and B-Y. Jing. (2000). Some results about the NBUC class of life
distributions. Statistics and Probability Letters 46, 229–237.
Lillo, R. E., A.K. Nanda, and M. Shaked. (2001). Preservation of some like-
lihood ratio stochastic orders by order statistics. Statistics and Probability
Letters 51, 111–119.
Misra, N. and E.C. van der Meulen. (2001), On stochastic properties of
spacings, Technical Report, Department of Mathematics, Katholieke Uni-
versity Leuven.
Nanda, A. K., K. Jain, and H. Singh. (1998). Preservation of some partial or-
derings under the formation of coherent systems. Statistics and Probability
Letters 39, 123–131.
Nanda, A. K. and M. Shaked. (2000). The hazard rate and the reversed hazard
rate orders, with applications to order statistics. Annals of the Institute of
Statistical Mathematics, to appear.
Oja, H. (1981). On location, scale, skewness and kurtosis of univariate distri-
butions. Scandinavian Journal of Statistics 8, 154–168.
Pledger, G. and F. Proschan. (1971). Comparisons of order statistics and spac-
ings from heterogeneous distributions. In Optimizing Methods in Statistics
(Ed: J. S. Rustagi), Academic Press, New York, 89–113.
Raqab, M. Z. and W.A. Amin. (1996). Some ordering results on order statistics
and record values. IAPQR Transactions 21, 1–8.
Rojo, J. and G.Z. He. (1991). New properties and characterizations of the dis-
persive ordering. Statistics and Probability Letters 11, 365–372.
Shaked, M. and J.G. Shanthikumar. (1994). Stochastic Orders and Their Appli-
cations, Academic Press, Boston.
Shanthikumar, J. G. and D.D. Yao. (1986). The preservation of likelihood ratio
ordering under convolution. Stochastic Processes and Their Applications 23,
259–267.
Wilfling, B. (1996). Lorenz ordering of power-function order statistics. Statistics
and Probability Letters 30, 313–319.
Chapter 25
Moshe Dror
Department of Management Information Systems
The University of Arizona
Tucson, AZ 85721, USA
mdror @ bpa.arizona.edu
Abstract In this paper we provide an overview and modeling details regarding vehicle
routing in situations in which customer demand is revealed only when the vehicle
arrives at the customer’s location. Given a fixed capacity vehicle, this setting
gives rise to the possibility that the vehicle on arrival does not have sufficient
inventory to completely supply a given customer’s demand. Such an occurrence
is called a route failure and it requires additional vehicle trips to fully replenish
such a customer. Given a set of customers, the objective is to design vehicle
routes and response policies which minimize the expected delivery cost by a
fleet of fixed capacity vehicles. We survey the different problem statements and
formulations. In addition, we describe a number of the algorithmic developments
for constructing routing solutions. Primarily we focus on stochastic programming
models with different recourse options. We also present a Markov decision
approach for this problem and conclude with a challenging conjecture regarding
finite sums of random variables.
1. INTRODUCTION
Consider points indexed in a bounded subset B in the Eu-
clidian space. Given a distance matrix between the point pairs, a traveling
salesman problem solution for points is described by a cyclic per-
mutation such that
is minimized over all cyclic permutations where represents
the point in the position in In this generic traveling salesman problem
(TSP) statement it is assumed that all the elements (positions of the points
and the corresponding distances) are known in advance. In a setting like this,
stochasticity can be introduced in a number of ways. First, consider that a
626 MODELING UNCERTAINTY
Another stochastic version of the TSP can be stated in terms of the distance
matrix by assuming a nonnegative random variable
with a known probability distribution for each pair
The value serves as a travel time factor for the distance
We simply define a new random variable as a random travel time
between points and In this setting an optimal cyclic permutation would
be one which minimizes the expected TSP distance with respect to the
matrix. In real-life terms, this setting seems appropriate when all points have to
be visited; however, the travel time between pairs of points is a random variable
with a known distribution (Laipala, 1978).
Clearly, there are real-life examples for both of the above problems and for
a hybrid of the two. However, in this chapter we survey yet a different setting
of a stochastic routing problem which we refer to as a vehicle routing with
stochastic demands (SVRP). For instance, in the case of automatic bank teller
machines the daily demand for cash from each machine is uncertain. The max-
imal amount of cash that may be carried by an armored cash delivery vehicle
is limited for security and insurance reasons. Thus, given a preplanned route
(sequence of cash machines), there might not be enough cash on the designated
vehicle to supply all the machines on its route resulting in a delivery stockout
(referred to as a route failure), forcing a decision of how to supply the ma-
chines which have not been serviced because the cash run out. The problem
can be described as follows: Given the points in some bounded
subset space, a point (the depot), and a positive real
value (capacity of a vehicle). Associate with each
a bounded, nonnegative random variable (demand at The objective is to
construct ( is determined as a part of the solution) cyclic paths all
sharing the point – {0}, and some paths may be sharing other points as well,
such that for each realization of the demand vector the demand
Vehicle Routing with Stochastic Demands: Models & Computational Methods 627
on each of the cyclic paths does not exceed Q–vehicle capacity, the total real-
ized demand is less than or equal to and all realized demands are satisfied.
A key assumption is that the value is revealed only when a vehicle visits the
point for the first time. Note that the actual construction of the cyclic paths
might depend on demand realizations at the points already visited. The objec-
tive is to find a routing solution, perhaps in the form of routing rules, which has
a minimal expected distance. Note that some demand values might be split-
delivered over a number of these cyclic paths. The problem can be viewed as a
single vehicle problem with multiple routes all visiting the point {0} - the depot.
The above version of the SVRP has been associated with real-life settings
such as sludge disposal (Larson, 1988) and delivery of home heating oil (Dror,
Ball, and Golden, 1985), among others. In these two examples there is no
advanced reporting of the values which represent current inventory levels or
the size of replenishment orders. Thus, the amount to be delivered becomes
known only as a given site is first visited.
assume a priori that a route delivery sequence will be carried out as planned
without interruption. In addition, given a ordered sequence of customer visits
on an undirected network (symmetric distances), we show in Example 2 below
that unlike deterministic vehicle routing problem, it makes a difference if such
a sequence is visited as ordered "from left to right" or in reverse "from right to
left".
Example 2: Assume a route with five customers located at (5,0), (5,5), (0,5),
(–5,5), and (–5,0), and a depot located at (0,0) forming a rectangle of length
10 and width 5. The customers are denoted as 1,2,3,4, and 5, respectively.
Assume a straight line travel distance between the locations and the expected
demands of and 4, and for the customer
located at (–5,0). We also assume a very simple recourse policy in the case of
route failure by servicing each customer who was not delivered a full demand
on the planned route individually (with multiple back and forth trips). We de-
note the route planned counterclockwise by
and the (opposite direction) clockwise route as
In the case that the customers’ demands are independent, normally distributed
random variables with mean demands as described above and an identical co-
efficient of variation of the expected travel distances for the two routes
are quite different. For it is equal to 40.5563, and for the expected
travel distance is 48.8362 (Dror and Trudeau, 1986).
A classical, and perhaps the most popular, heuristic for constructing vehicle-
routing solutions for the case of deterministic customers’ demands is the so-
called Clark and Wright heuristic (Clark and Wright, 1964). The thrust of the
Clark and Wright heuristic route construction is in the concept that if two de-
liveries on two different routes can be joined on a single route, savings can be
realized. In the VRP the savings are calculated for each pair of delivery
points as (assuming a symmetric travel matrix).
The savings are ordered and the customers are joined according to the largest
saving available, as long as the demand of the combined route does not exceed
the capacity of the vehicle – Q. This basic savings idea can be generalized for
the stochastic vehicle routing case as follows:
= [the expected cost of the route with customer on it] + [the expected
cost of the route with customer on it] - [the expected cost of the combined
route where customer immediately precedes customer
When computing the savings terms in the stochastic version of Clark and
Wright heuristic one has to account for the direction of the route. For each pair
of points two different directional situations have to be considered and
only the one with the highest saving value will be kept.
In Dror and Trudeau (1986), the above stochastic Clark and Wright heuristic
was implemented to construct a routing solution for a 75 customer test prob-
lem in Eilon, Watson-Gandy, and Christofides (1971). The truck capacity for
this experiment is set at Q = 160. The depot is located at a center point
of a square 80 × 80, however, we do not list here the coordi-
nates for all the 75 points. Table 1 lists the routes constructed by the stochastic
savings heuristic. It is interesting to note that some of the 10 routes constructed
have a high probability of failure. For instance, routes marked as 9 and 10 have
a probability of failure of 0.239 and 0.679 respectively. The expected demand
of route 10 exceeds the truck capacity. However, upon closer examination of the
two routes we find that the probability of route failure before the last customer
is negligible, and the failures, if occurring at all, are only likely to materialize
as the truck arrives at the last customer who is located very close to the depot
630 MODELING UNCERTAINTY
where is a binary decision variable which takes the value 1 if vehicle travels
directly from to and is 0 otherwise; NV denotes the number of available ve-
hicles; is the set of feasible routes for the traveling salesman problem with
NV salesmen. and Q are defined as before and is the maximum allow-
able probability that a route might fail. The chance-constrained SVRP model
presented above is in the spirit of mathematical models developed by Charnes
and Cooper in the 50ties and early 60ties. One of the main premises of such
models was that complicated stochastic optimization problems are convertable
into equivalent deterministic problems while controlling for the probability of
Vehicle Routing with Stochastic Demands: Models & Computational Methods 631
"bad" events such as route failures for the SVRP. Stewart and Golden (1983)
showed that this conversion process (stochastic deterministic) of the SVRP is
possible for some random demand distributions. In addition, a number of sim-
ple penalty-based SVRP models have also been proposed by the same authors
and are restated below.
The apparent problem with the above modeling approach for the SVRP is
that designing a complete set of vehicle routes while controlling or penalizing
route failures irrespective of their likely location and the cost of such failures,
might result in bad routing decisions. It is important to remember that the cost
of the recourse action taken in response to route failure is more critical than the
mere likelihood of such failure.
This is the theme of a recent paper by W.-H. Yang, K. Mathur, and R.H. Ballou
(2000), in which a customer’s stochastic demand does not exceed the vehicle
capacity Q. For simplicity, they assume that has a discrete distribution with
possible values with probability mass function
In their solution of the SVRP, Yang, et al. adopt a simple recourse action
of returning to the depot whenever the vehicle runs out of stock, in addition to
preset refill decisions based on the demand realizations along the route. Hence
the vehicle may return to the depot before stockouts actually occur. What is
632 MODELING UNCERTAINTY
Theorem 1. (Yang et al., 2000) For each customer there exists a quantity
such that the optimal decision, after serving node is to continue to node
if or return to the depot if
For the proof we refer the reader to the original paper and to Yang’s (1996)
thesis. The computation of the values is recursive and for a given routing se-
quence requires computing the two parts of equation (1). Since there are many
different routing options, constructing the best route for the SVRP with this
simple recourse scheme requires direct examination of these routing options
(say by enumeration) and is only feasible for small problems. In Yang (1996),
a branch-and-bound algorithm is described which produces optimal solutions to
some typical problems of up to 10 customers. Since this recourse policy allows
for restocking the vehicle at any point on its route, even when it is clear that
the total customer demand exceeds the vehicle’s capacity, it is not necessary to
consider multiple routes. In fact, Yang, Mathur, and Ballou (2000), prove that a
single route is more efficient than multiple vehicle route system. Obviously, this
result assumes no other routing constraints such as time duration, etc. which
might require implementation of a multiple route system.
Vehicle Routing with Stochastic Demands: Models & Computational Methods 633
Laporte, Louveaux, and Van hamme (2001) exmines the same problem from
a somewhat different perspective. The SVRP is examined and an optimal so-
lution methodology by means of an integer L-shaped method is proposed for
the simple recourse of back and forth vehicle trips to the depot for refill in the
case of route failure. We restate below the main points of the solution approach
from Laporte et al. (2001), and Laporte and Louveaux (1993), as applied to
the SVRP model in Dror et al. (1989). However, since Yang et al. (2000)
have shown that single route design is more efficient than multiple routes, and
(based on Theorem 1, Yang et al. 2000) a better recourse policy is obtained
by allowing the return to the depot for refill even if the vehicle did not run-out
of commodity at node we modify the L-shaped solution approach. That is,
we assume that a single delivery route will be traced until either a route failure
occurs followed by a round trip to the depot, or based on the result of Theorem
1, the vehicle returns to the depot to refill before continuing the route.
subject to
The cost term can be viewed as a two part cost of a given solution.
In the first part we have the term cx denoting the cost of the initial pre-planned
634 MODELING UNCERTAINTY
routing sequence represented by x, and the second part, denoted by Q(x, q),
would be the cost of recourse given x and a realization of q. Thus, Q(x, q)
reflects the cost of return trips incurred by route failures and decisions to return
for refill before route failures, minus some resulting savings. In this represen-
tation, we write simply Note that the two routing
vectors and x are not the same. The binary vector x represents an initial TSP
route, whereas is the binary routing vector which includes all the recourse
decisions. We want to keep the vector binary and for this purpose we assume
that (i) the probability of a node demand to be greater than capacity of a vehicle
is zero, and (ii) the probability that a vehicle, upon a failure, will go back to the
depot after returning to a node to complete its delivery is also zero. Since the
vector x represents a routing solution for a single route, it satisfies constraints
(3)-(6) ((3) and (4) with equality only). Setting the expectation with respect to
q denoted as the objective function (2) becomes
In principle, the function Q(x) can be of any sign and is bounded (from below
and above). However, in our case Constraints (3)-(6) ensure connec-
tivity and conservation of flow. At this point we describe the function Q(x, q)
in the standard framework of the Two-Stage Stochastic Linear Programs. In
the second stage the customer deliveries are explicitly represented.
where y is the binary vector representing the recourse initiated trips to the de-
pot. T(q) represents the deliveries made by the x vector given q, is the
demand realization for q which has to be met (delivered) either by x or the
recourse y.
subject to
Vehicle Routing with Stochastic Demands: Models & Computational Methods 635
4 Check for any violations of constraints (12) or integrality values for and
introduce at least one violated constraint. At this stage, valid inequalities
or lower bounding functionals may also be generated. Return to Step 2.
Otherwise, if fathom the current node and return to Step
1.
In order to find a valid lower bound, one can rank all costs of types (a) and
(b) among the nodes in and compute the least expected
additional cost as follows:
One lower bound calculation for the expected cost through the partial tour is
obtained by assuming that the vehicle starts the partial tour at its full capacity.
Denote this lower bound as A somewhat tighter lower bound is described
in Yang (1996).
(SVRP) Minimize
subject to
Vehicle Routing with Stochastic Demands: Models & Computational Methods 639
(1) Any arc is traversed at most once in the optimal SVRP solution.
(2) The number of trips to the depot in an optimal SVRP solution is less than
or equal to
(3) In the optimal SVRP solution no customer will be visited more than
times.
(customer nodes).
The permutation preserves the relative order of the set of nodes in the
new graph which represent a node in the original graph. Given that customer
is fully replenished, say on the visit to that customer, than the rest of the
nodes corresponding to customer are visited in succession at no additional
cost, which is consistent with the property of
subject to
Vehicle Routing with Stochastic Demands: Models & Computational Methods 641
where denotes the flow from node to node and cannot exceed
Q-vehicle capacity. Each node in the graph is visited exactly once. Constraints
(25) are stated in a symbolic form expressing the fact that disconnected subtours
are not allowed in the solution.
The above model is stated purely for conceptual understanding of the rout-
ing decisions with respect to new demand information as it is revealed one
customer at a time. From the computational prospective, this model is not very
practical and would require a number of restrictive assumptions (in terms of
demand distributions, recourse options, etc.) before operational policies could
be computed.
Consider a single vehicle of fixed capacity Q located at the depot and assume
simply that no customer ever demands a quantity greater than Q. In essence,
all our assumptions are the same as before. One can assume discrete demand
distributions at the customers’ locations in an attempt to reduce the size of the
state space (Secomandi, 1998), however, in this model presentation we assume
continues demand distributions. A basic rule regarding vehicle deliveries is
that once a vehicle arrives at customer location it delivers the customer’s full
demand or as much of the demand as it has available. Thus, upon arrival (and
after delivery) only one decision has to be taken in case there is some com-
modity amount left on the vehicle: Which location to move next. The vehicle
642 MODELING UNCERTAINTY
automatically returns to the depot only when empty, or when all customers have
been fully replenished. In the initial state the vehicle is at the depot fully loaded
and no customer has been replenished. The final state of the system is when all
customers have been fully replenished and the vehicle is back at the depot.
The states of the system are recorded each time the vehicle arrives for the
first time at one of the customer locations and each time the vehicle enters the
depot. Let be the times at
which the vehicle arrived for the first time at new customer location. These are
transition times and correspond to the times at which decisions are taken. The
state of the system at time is described by a vector
where denotes the position of the vehicle, and de-
scribes the commodity level in the vehicle. implies automatically
(i.e., the vehicle is full at the depot). If customer has been visited then his exact
demand is known and after a replenishment (partial or complete) denotes the
remaining demand. In this case, If the customer has not been
visited yet, then is set to -1. (That is the demand is unknown.) The state
space is a subset S of which satisfies
the above conditions. Given a transition time a decision is selected from de-
cision space where and
means that the vehicle goes from his present position, say to customer whose
demand is yet undetermined, and on its route from to it replenishes a subset
of customers P (a shortest path over that set and its end points) whose demand
is already known. In that case the vehicle might also visit the depot. In many
cases the subset P may be empty. For instance, at the first and second transi-
tion times. The decision is admissible only if The decision
whenever or for (and every customer
has been fully replenished). For each let denote the set of
admissible decisions when the system is in state and
the set of admissible state-decision pairs.
ending at location In this model we assume that the service time is zero. Let
be the transition law. That is for every Borel subset of
is the probability that the next state belongs to given and
where is the number of transitions and is the usual time distance between
and if the next transition occurs at without visiting any other nodes, or is
the shortest time path from to if a subset of nodes is replenished between
the two transition times.
Since this model was presented in Dror et al. (1989) and Dror (1993), there
has been (to our knowledge) only one substantial attempt to solve this hard
problem as a Markov decision model: This was done by Secomandi (1998)
and explored further in the form of heuristics in Secomandi (2000). Sacomandi
(1998) reports solving problems with up to 10 customers. However, in order
to simplify the state space somewhat, Secomandi assumes identical discrete
demand distributions for all the customers. Clearly, this is a very hard problem
for a number of reasons. One is the size of the state space. Another reason
is that some subproblems might require exact TSP path solutions. However,
so far this is the most promising methodology for solving the SVRP exactly
without narrowly restricting the policy space.
In a typical vehicle routing setting with a large number of customers - like the
case of propane distribution (Dror et al. 1985, Larson 1988) - time constraints
for completing the daily deliveries cast the problem as a multi-vehicle delivery
problem with each vehicle assigned a route on which the customers’ demand
values are uncertain. From the point of view of a single route with customers
on it, designing a route with the likelihood of numerous route failures makes
little or no practical sense. In practice, a dispatcher of propane delivery vehicles
seldom expects a delivery truck to return more than once to the depot for refill
in order to satisfy all customers’ demands.
In this case, one can construct a tour assuming that in the case of a route
failure at a last customer the vehicle returns to the depot to refill and then incurs
the cost of a back-and-forth trip to the last customer. To construct an optimal
SVRP solution in this case requires solving a TSP in which for all customers
the cost is replaced by A better solution might be
obtained by permitting a return to the depot for refill at a suitable location along
the route and thus prevent the potential of route failure at the last node. This
recourse option can be formulated as a TSP by introducing 2 artificial nodes in
the following manner: Let and be the two artificial nodes. If the
node is entered from one of the nodes in N (note that the depot is denoted
by {0}), it indicates that the solution requires a ‘preventive refill’ (a return to
the depot before the last customer). If the node is entered from one of the
nodes in N, the solution contains no ‘preventive refill’. Thus, only one of the
two artificial nodes will be visited. The costs associated with the nodes
and are:
element in each of the subsets. Solution methodologies for the GTSP have been
developed by Noon and Bean (1991). In addition, one can transform a GTSP
into a TSP (Noon and Bean, 1992). For instance, in our case set
for some large constant M, forcing the two nodes and
to be visited consecutively. Then reset
This makes the transformation complete.
Solving the above TSP over nodes solves our SVRP (with a failure
potentially occurring only at the last node) allowing to incorporate the option
of preventive refills.
In principle, in the expression above the random variables need not be inde-
pendent or identical. They are just ordered and the first are summed into a
partial sum. The connection to the SVRP as described in this paper is obvi-
ous. However, this partial sum describes a setting which is not particular to the
SVRP. In fact, in the stochastic processing and reliability literature, is
referred to as a ‘threshold detection probability’. provides a measure of
likelihood of overstepping a boundary Q at exactly the trial in successive
steps accumulating the effects of a random sampled phenomenon. Some other
related results together with a conjecture statemet are described in Kreimer and
Dror (1990). More specifically, Kreimer and Dror (1990) address the following
questions:
(1) What is the most likely number of trials required to overstep the threshold
value Q ?
(2) In what range of is the sequence monotonically
increasing (decreasing) ?
646 MODELING UNCERTAINTY
1 Coefficient of variation
We have proven this conjecture for normal distribution and few other distri-
butions. However, the (as yet unproven) claim is that this monotonicity property
is true in general !
8. SUMMARY
This paper deals with stochastic vehicle routing problem that involves design-
ing routes for fixed capacity vehicles serving (delivering to or collecting from
but not both) a set of customers whose individual demands are only revealed
when the vehicles arrive to provide service. At the beginning, we describe the
problem and provide examples which demonstrate the impact demand uncer-
tainty might have on the routing solutions. Since this topic has been examined
by academics and routing professionals for over 20 years, there is a considerable
body of research papers. We certainly have not covered them all in this overview.
REFERENCES 647
The vehicle routing problem with stochastic demands is a very hard problem
and we have attempted to cover all the significant developments for solving
this problem. Starting with early work on simple heuristics such as stochastic
Clark and Wright, followed by chance-constrained formulations, and stochastic
programming with recourse models, have attempted a broad overview. In the
literature, the most frequently encountered papers have focused on stochastic
programming models with limited recourse options such as back-and-forth ve-
hicle trips to the depot for refill in the event of route failure. This approach
has been examined more recently using the so called L-shaped optimization
method (Laporte et al. 2001). Other approaches (Yang et al. 2000) have added
interesting recourse options which could improve on the solution quality. How-
ever, the most promising approach is that of modeling the problem as a Markov
decision process, presented in Dror et al (1989) and Dror (1993), with signifi-
cant modeling and computational progress made recently by Secomandi (1998,
2000).
In short, the vehicle routing problem is very easy to state, but, like a number of
other similar problems, very hard to solve. It combines combinatorial elements
with stochastic elements. The problem is ‘real’ in the sense that we can point
out numerous real-life applications; unfortunately the present state-of-the-art
for solving the problem is not very satisfactory. It is a challenging problem and
we are looking forward to significant improvements in solution procedures -
hopefully in the near future.
REFERENCES
Applegate, D., R. Bixby, V. Chvatal, and W. Cook. (1998). "On the solution of
traveling salesman problem", Documenta Mathemaitica, extra volume ICM
1998; III, 645-656.
Bertsimas, D.J. (1992). "A vehicle routing problem with stochastic demand",
Operations Research 40, 574-585.
Bertsimas, D.J., P. Chevi, and M. Peterson. (1995). "Computational approaches
to stochastic vehicle routing problems", Transportation Science 29,342-352.
Birge, J.R. (1985). "decomposition and partition methods for multistage stochas-
tic linear programs", Operations Research 33, 989-1007.
Clarke, C. and J.W. Wright. (1964). "Scheduling of vehicles from a central
depot to a number of delivery points", Operations Research 12, 568-581.
Dror, M. (1983). The Inventory Routing Problem, Ph.D. Thesis, University of
Maryland. College Park, Maryland, USA.
Dror, M. (1993). "Modeling vehicle routing with uncertain demands as a stochas-
tic program: Properties of the corresponding solution", European J. of Op-
erational Research 64, 532-441.
648 MODELING UNCERTAINTY
Stewart, W.R., Jr., B.L. Golden, and F. Gheysens. (1983). "A survey of stochastic
vehicle routing", Working Paper MS/S, College of Business and Manage-
ment, University of Maryland at College Park.
Trudeau, P. and M. Dror. (1992). "Stochastic inventory routing: Stockouts and
route failure", Transportation Science 26,172-184.
Yang, W.-H. (1996). "Stochastic Vehicle Routing with Optimal Restocking",
Ph.D. Thesis, Case Western Reserve University, Cleveland, OH.
Yang, W.-H., K. Mathur, and R.H. Ballou. (2000). "Stochastic vehicle routing
problem with restocking", Transportation Science 34, 99-112.
Chapter 26
Paul J. Sanchez
Operations Research Department
Naval Postgraduate School
Monterey, CA 93943
John S. Ramberg
Systems and Industrial Engineering
University of Arizona
Tucson, AZ 85721
Larry Head
Siemens Energy & Automation, Inc.
Tucson, AZ 85715
Abstract Orthogonal functions play an important role in factorial experiments and time
series models. In the latter half of the twentieth century orthogonal functions be-
came prominent in industrial experimentation methodologies that employ com-
plete and fractional factorial experiment designs, such as Taguchi orthogonal
arrays. Exact estimates of the parameters of linear model representations can be
computed effectively and efficiently using “fast algorithms.” The origin of “fast
algorithms” can be traced to Yates in 1937. In 1958 Good created the ingenious
fast Fourier transform, using Yates’s concept as a basis. This paper is intended
to illustrate the fundamental role of orthogonal functions in modeling, and the
close relationship between two of the most significant of the fast algorithms.
This in turn yields insights into the fundamental aspects of experiment design.
652 MODELING UNCERTAINTY
1. INTRODUCTION
Our purpose in writing this paper is to illustrate the role of orthogonal func-
tions in factorial design, and one usage in Walsh and Fourier analysis for sig-
nal and image processing. We also want to exhibit the relationship between the
Yates “fast Factorial Algorithm” and the fast Walsh transform, and to show how
Yates’s algorithm contributed to the development of fast Fourier transforms.
We would like to de-mystify the “black box” aura which often surrounds the
presentation of these algorithms in a classroom setting, and to encourage the
discussion of the topic of computationally efficient algorithms. We think that
this approach is valuable because it demonstrates close links between statistics
and a number of other fields, such as thermodynamics and signal processing,
which are often viewed as quite divergent.
Orthogonal functions serve many purposes in a wide variety of applications.
For example, orthogonal design matrices or arrays play important roles in the
statistical design and analysis of experiments. Discrete Fourier and Walsh
transforms play comparable roles in digital signal and image processing. An
important distinction between the various application areas is the data collec-
tion scheme employed. In statistical experiment design the functions represent
the factors and their interactions, and the experiment is typically run in a ran-
domized order. In signal processing the data are collected over time and repre-
sented in terms of a set of orthogonal functions which are explicit functions of
time.
Historically, the importance of these methods created interest in developing
effective and efficient computational approaches, which are often called “fast”
algorithms. Fast algorithms produce the same mathematical result as the stan-
dard algorithms, and are typically more computationally stable, yielding better
numerical accuracy and precision. Thus they should not be confused with so-
called “quick and dirty” statistical techniques which yield only approximate
results.
Nair (1990) has indicated the importance of attracting applications arti-
cles in the new technological areas of physical science and engineering, such
as semiconductors. Hoadley and Kettenring (1990) have stressed the impor-
tance of communication between statisticians, engineers, and physical scien-
tists. Stoffer (1991) has discussed statistical applications based on the Walsh–
Fourier transform, and has outlined the existing Walsh-Fourier theory for real-
time stationary time series. Understanding the relationship of experiment de-
sign to other orthogonal function based techniques and the corresponding fast
algorithms should be useful in enhancing this communication.
At first glance, high speed computation abilities might seem to negate the
need for the computational efficiency available using fast transforms. In pattern
recognition problems and signal processing the amount of data being processed
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms 653
2. LINEAR MODELS
Generally, the role of linear models in factorial experiments, Walsh analysis,
and Fourier analysis is not made explicit. The methods of analysis which we
will discuss are all based upon linear models. A discrete indexed linear model
can be represented in matrix form in either a deterministic or statistical context
as
or
In this case, the projection onto the basis B is one-to-one, i.e., y and are com-
pletely interchangeable since either can be constructed (reconstructed) from
the other.
We will be applying the linear model in three settings: Factorial Analysis,
Walsh Analysis, and Fourier Analysis. The table in Appendix A serves as a
quick reference to notation.
In general, the analysis matrix for a full factorial experiment can be writ-
ten in matrix form in Yates’s standard order as
The vector is inserted as the first column. The generating vectors and in-
teractions complete the basis. For example, the product vector obtained from
the interaction of and is found as the fourth column of (2.7). The
column vectors of equation (2.7) are represented graphically in Figure 26.2.
It is easily verified that the columns of X all have modulus so that if least
squares is used to estimate
where I is the identity matrix. Thus the estimator is given from (2.4) as
and only one matrix multiplication is required to obtain the
estimates.
The estimator has the same number of elements as the data vector, y, and
can be viewed as an orthogonal transformation of the data vector to the pa-
rameter space. Furthermore, the observation vector can be computed from the
parameter vector by Note that the transformation is information
preserving, since the original vector can be recovered from the parameter vec-
tor. Finally, a smoothed (or parsimonious) predictor of y can be obtained by
setting certain elements of to zero (perhaps those which are not statistically
significant). If we call the new vector then the smoothed predictor is given
by
Note that the index of a generating vector is the number of adjacent pluses or
minuses, and that for where is the factorial
generating vector with index
To determine the value of a Walsh function whose index is not a power of 2,
decompose the index into its binary representation and take the product of all
Walsh functions corresponding to a 1 bit. For example, since the
binary representation of 5 is As with factorial
designs, the total number of vectors generated in this fashion is To
obtain a complete orthogonal basis, we again insert the vector The
basis is constructed by placing as the column.
In general, the matrix for a Walsh analysis can be written in Hadamard
order as
The generating vectors and are the second, third, and fifth columns
of (2.10), respectively. The product vector obtained from the interaction of
and for example, is found as the fourth column.
662 MODELING UNCERTAINTY
Note that this is the analysis matrix X from Yates’s algorithm given in (2.7)
to within a scale factor of {±1} for each column.
where I is the identity matrix. Thus the matrix form for the discrete Walsh
transform (DWT) of a vector y of length is defined as
Note that the DWT is the least squares estimator for Y. Also note that due to
the symmetric nature of the Hadamard matrix, so it is correct to write
the DWT as
In other words, the DWT is its own inverse to within a scale factor of
The transform notation emphasizes the fact that the transformation is infor-
mation preserving — the transform vector contains all of the information in
the original vector. In other words, the transform is not actually changing the
data, but rather is changing our viewpoint of the data. Both the original and the
transform vectors represent exactly the same point in space.
All we have done in the transformation is to change the set of axes from which
we choose to view that point.
Any orthogonal basis could be used for viewing the data, but some bases
might be more interesting than others because of the physical interpretation
we could place on the results. For statistical interpretation, an appropriate
choice of basis would be the set of vectors we used to determine the inputs to
an experiment. If the outputs fall solidly in some subspace corresponding to
certain inputs, we infer that those inputs are important factors in determining
the output.
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms 663
This matrix can be expressed in a simpler form as the matrix of exponents after
dividing the exponents by
This notation makes clear that the relationship between the columns is com-
parable to that in a factorial design array. Note, however, that the interaction
columns are the sums of the corresponding factor columns here, since prod-
ucts of powers of a common base can be expressed in terms of the sums of the
exponents.
We can also express the transform in terms of trigonometric functions us-
ing the relationship This is useful for computational
purposes – we can keep track of the magnitudes of the real and imaginary
components separately for each element. In other words, each complex scalar
element of the vector of data is represented by two scalar values, corresponding
to the real and imaginary components.
Complex multiplication can be equivalently expressed in matrix terms. If
and then can be
written as
As with the other bases we have discussed, the vectors are mutually orthogonal,
so the term forms a diagonal matrix and the estimator simplifies to a
single matrix multiplication. However, the first and last column have a different
modulus than the other terms – they are scaled by while the rest are
scaled by Note that is symmetric about its diagonal, so that it is its
own transpose. However, since is complex must be its complex conjugate,
i.e., it is given as equation (2.12) with all signs reversed. We will designate the
complex conjugate henceforth as
When the data are real-valued, as with statistical applications, the imaginary
part of each observation is zero and can be omitted. Hence we can eliminate
the even numbered rows in equation (2.14). (Rows are eliminated, rather than
columns, because the estimators are obtained from The result is that
is a 16 × 8 matrix – clearly eight of the rows are redundant, since we only
need an 8 × 8 matrix to have a complete basis. The question of which vectors
to include in the basis can be resolved by examining Figure 26.5. The set of
points obtained by evaluating form a circle of unit radius
in the complex plane. Note that for that and
In other words, frequencies in the range can
666 MODELING UNCERTAINTY
It can be easily verified that the columns form a real-valued orthogonal basis.
The columns of are plotted in Figure 26.6.
3. AN EXAMPLE
In this section we will present a vector of data and show how the coefficients
are calculated for each of the three bases we have discussed. The results will
then be compared.
Suppose we have run a planned experiment in which the three factors
were varied in controlled fashion. In practice it is usually recommended that
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms 667
668 MODELING UNCERTAINTY
We can then analyze it by calculating the estimator given in equation 2.3. The
resulting estimates are
In other words
We can analyze the same vector in terms of the Walsh basis. The resulting
vector estimate is:
Finally, we analyze the data set using Fourier analysis. The resulting vector
of estimated coefficient is
4. FAST ALGORITHMS
Recall that for any orthogonal basis, the least squares estimator is obtained
from a single matrix multiplication. If the number of operations required to
compute this can be substantially reduced the result is a computationally ef-
ficient algorithm. These are often called “fast” algorithms. The Fast Fourier
Transform (FFT) is probably the best known of these fast algorithms.
At first glance, high speed computation capabilities might seem to negate
the need for the computational efficiency available using fast transforms. A
straightforward implementation of the discrete Fourier transform requires
calculations, where is the number of factors, while the corresponding FFT
requires calculations. The amount of time required for a single-
threaded computer implementation of the algorithm to compute the transform
is proportional to the number of calculations being performed. Figure 26.8
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms 671
so that
672 MODELING UNCERTAINTY
and thus
tion is not explicitly performed for the zero terms in M. The matrix X contains
no zero elements, while M in fact contains precisely two non-zero elements
per row or column, as can be seen above. This means that we perform only
additions or subtractions for each of the M matrices in the factored form
of the algorithm. Since there are such matrix factors, the total amount of
work in the Yates’s approach is rather than if least squares
is applied straightforwardly. Paraphrasing, for we have
so the algorithm is in complexity rather than
Figure 26.9 illustrates how Yates’s algorithm works for the case of
This flow diagram contains equivalent information to the matrix representation
– it describes how to combine those terms with non-zero coefficients. The
diagram could also be used as a schematic for implementing Yates’s algorithm
in hardware. Straight lines indicate that a term is to be added, while dotted lines
indicate that the term is to be subtracted. Observations are input on the left, and
combined as indicated by the lines to produce intermediate values
and then contrasts Thus, at the first intermediate stage
we have
Note that this is exactly the same result we would obtain by explicitly per-
forming the multiplication The contrasts can be scaled to produce
estimates of by either pre- or post-multiplying by which corresponds
to the scale factor of the least squares estimator. For general values of
each column is in length, and there are columns with transforma-
tions being applied. Yates’s is a constant-geometry algorithm, i.e., exactly the
same sequence of operations is applied in each of the transformations. This
is undoubtedly a benefit when the calculations are being performed by hand.
Given the flow diagram, the operations can be expressed algorithmically.
We will need several storage vectors of length which are labeled
for We define to be the element of the data vector
at the iteration. The algorithm follows.
676 MODELING UNCERTAINTY
yields the same matrix for a experimental design. Further, it is not nec-
essary to have the component matrices of the factorization be identical (i.e.,
The efficiency comes from sparseness, not symmetry, of the matrix
factors. As we will see in the next section, there exists a whole class of “fast”
algorithms which can be used to evaluate factorial experiments.
where
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms 677
and
Comparable results exist for the other Walsh ordering schemes. In fact, one
way of performing the sequency ordered FWT is to apply the algorithm just
described after pre-sorting the data into a particular order.
and
682 MODELING UNCERTAINTY
The S matrix contains only one non-zero entry per row and column‚ and so
sorts the data set as described earlier in O(N) time. If desired‚ the ones can
be replaced with so that the estimators are properly scaled with no addi-
tional work.
Note that the functional form of the matrices for the FFT is identical to
that for the FWT. Only the coefficients have changed. (In fact‚ calculating the
sequency ordered FWT involves pre-multiplying by exactly the same sorting
matrix.) This is the basis of generalized transform algorithms‚ which use the
same matrix structure and substitute different sets of coefficients to perform
the various transforms of interest.
5. CONCLUSIONS
It is clear that many of the researchers referenced herein were aware of and
influenced by Yates’s algorithm. However‚ the fundamental role of orthogonal
transform theory‚ and relationships between the various “fast algorithms” ap-
pear to be unfamiliar to many statisticians. The fields of orthogonal factorial
experimental designs and orthogonal transform theory appear at first glance to
have evolved in parallel‚ with little cross-communication. Despite the initial
impact of Yates’s work‚ many statisticians treat orthogonal transform theory
and the closely related field of spectral analysis as tools for “time series”‚ and
appear unaware of the applicability of the work to factorial experimental de-
signs.
Researchers in the field of digital signal processing (DSP) have significantly
extended the work of Yates‚ Good‚ Cooley‚ and Tukey. We believe that it
is worthwhile for statisticians to become familiar with the DSP research in
orthogonal function decomposition for a number of reasons.
The FWT offers two benefits relative to the traditional Yates’s algorithm:
the FWT is an in-place algorithm;
the FWT is its own inverse.
DSP literature is based upon generalized transform theory:
REFERENCES 683
results can be generalized for many orthogonal designs‚ not just for
and factorials. We have illustrated this using the FFT.
REFERENCES
Ahmed‚ N.‚ and K. R. Rao. (1971). “The generalised transform.” Proc. Applic.
Walsh Functions‚ Washington‚ D.C. AD727000‚ 60–67.
Beauchamp‚ K.G. (1984)‚ “Applications of Walsh and related functions.” Aca-
demic Press‚ London.
Chatfield‚ C. (1984)‚ “An Analysis of Time Series: An Introduction.” Chapman
and Hall‚ New York.
Cooley‚ J.W.‚ and J. W. Tukey. (1965)‚ “An algorithm for the machine compu-
tation of complex Fourier series.” Math. Comput. 19‚ 297–301.
Good‚ I.J. (1958)‚ “The interactive algorithm and practical Fourier analysis.” J.
Roy. Stat. Soc. (London)‚ B20‚ 361–372.
Heideman‚ M.T.‚ D. H. Johnson‚ and C. S. Burrus. (1984)‚ “Gauss and the
History of the Fast Fourier Transform.” IEEE ASSP Magazine‚ Oct. 1984‚
14–21.
Hoadley‚ A.B.‚ and J. R. Kettenring. (1990)‚ “Communications Between Statis-
ticians and Engineers/Physical Scientists.” Technometrics Vol 32 No. 3‚243–
247.
Kiefer‚ R.‚ and J.Wolfowitz. (1959)‚ “Optimal Designs in Regression Prob-
lems.” Ann. Math. Stat. Vol 30‚ 271–294.
Manz‚ J.W. (1972)‚ “A sequency-ordered fast Walsh transform.” IEEE Trans.
Audio Electroacoust. AV-20‚ 204–205.
Nelson‚ L.S. (1982)‚ “Analysis of Two-Level Factorial Experiments.” JQT Vol
14‚ 2‚ 95–98.
Pratt‚ W.K.‚ J.Kane‚ and H. C. Andrews. (1969)‚ “Transform image coding.”
Proc. IEEE‚ 57‚ 58–68.
Sanchez‚ P.J.‚ and S. M. Sanchez. (1991). “Design of frequency domain exper-
iments for discrete-valued factors.” Applied Mathematics and Computation‚
42(1): 1–21.
Stoffer‚ David S. (1991). “Walsh–Fourier Analysis and Its Statistical Implica-
tions” J. American Statistical Association.‚ June 1991‚ Vol. 86‚ #414‚ 461–
485.
Walsh‚ J.L. (1923)‚ “A closed set of orthogonal functions.” Ann. J. Math‚ 55‚
5–24.
Yates‚ F. (1937)‚ “The Design and Analysis of Factorial Experiments.” Techni-
cal Communication No. 35‚ Imperial Bureau of Soil Science‚ London.
684 MODELING UNCERTAINTY
James C. Spall
The Johns Hopkins University
Applied Physics Laboratory
Laurel‚ MD 20723-6099
e-mail: james.spall@jhuapl.edu
Acknowledgments and Comments: This work was supported by U.S. Navy Contract N00024-98-
D-8124. Dr. John L. Maryak of JHU/APL provided many helpful comments and Mr. Robert C.
Koch of the Federal National Mortgage Association (Fannie Mae) provided valuable computa-
tional assistance in carrying out the example. A preliminary version of this paper was published in
the Proceedings of the IEEE Conference on Decision and Control‚ December 1995. This paper is
dedicated to the memory of Sid Yakowitz—a scholar and a gentleman.
686 MODELING UNCERTAINTY
1. INTRODUCTION
Meaningful inference in parameter estimation usually involves an estimation
process and an uncertainty calculation. For many estimators—such as least squares‚
maximum likelihood‚ minimum prediction error‚ maximum a posteriori‚ etc.—
there exists an asymptotic theory that provides the basis for determining probabili-
ties and uncertainty regions in large samples (e.g.‚ Hoadley‚ 1971; Ljung‚ 1978;
Serfling‚ 1980). However‚ except for relatively simple cases‚ it is generally not
possible to determine this uncertainty information in the small-sample setting. This
paper presents an approach to determining small-sample probabilities and uncer-
tainty regions for a general class of multivariate M-estimators (M-estimates are
those found as the solution to a system of equations‚ and include those estimates
mentioned above). Theory and implementation aspects will be presented.
The approach is based on a simple—but apparently unexamined—idea.
Suppose that the statistical model being used is some distance (to be defined
below) away from an “idealized” model‚ where the small-sample distribution of
the M-estimate for the idealized model is known. Then the known probabilities
and uncertainty regions for the idealized model provide the basis for computing
the probabilities and uncertainty regions in the actual model. The distance may
be reflected in a conservative adjustment to the idealized quantities. This approach
is fundamentally different from other finite-sample approaches (see below)‚ where
the accuracy of the relevant approximations is tied to the size of the sample
versus the deviation from an idealized model.
The M-estimation framework for the approach encompasses most estimators
of practical interest and allows us to develop concrete regularity conditions that
are largely in terms of the score function (the score function is typically the
gradient of the objective function‚ which is being set to zero to create the system
of equations that yields the estimate). One of the significant challenges in assess-
ing the small-sample behavior of M-estimates is that they are usually nonlinear‚
implicitly defined functions of the data.
Perhaps the most popular current approach to small-sample analysis is com-
puter-based resampling‚ most notably the bootstrap (e.g.‚ Efron and Tibshirani‚
1986; Hall‚ 1992; and Hjorth‚ 1994). The main appeal of this approach is relative
ease of use‚ even for complex estimation problems. Rutherford and Yakowitz (1991)
show how the bootstrap applies in the nonparametric regression problem‚ for
which analytical analysis would rarely be possible. Resampling techniques make
few analytical demands on the user‚ instead shifting the burden to one of computa-
tion. However‚ the bootstrap may provide a highly inaccurate description of M-
estimate uncertainties in small samples (e.g.‚ Lunneburg‚ 2000‚ pp. 97–98). This
poor performance is inherently linked to the limited amount of information in
the small sample‚ with little improvement possible through a larger amount of
resampling.
Uncertainty bounds in parameter estimation with limited data 687
that one is working with small samples, and desirable properties as may
not be relevant.
The remainder of this paper is organized as follows. Section 2 describes the
fundamental problem and formally introduces the concept of the idealized dis-
tribution. Associated artificial estimators and data that will be used in character-
izing the probabilities of interest for the real estimator and data are also intro-
duced in this section. Section 3 summarizes how the formulation in Section 2
applies in the areas of signal-plus-noise modeling, nonlinear regression, and time
series correlation analysis. Section 4 presents the main theoretical results, which
characterize the error between the idealized (known) probability and real (un-
known) probability for the parameter estimate lying in a particular compact set
(i.e., uncertainty region). Section 5 presents a thorough analysis of the signal-
plus-noise example introduced in Section 3, including a numerical evaluation.
Section 6 offers a summary and some concluding remarks. The Appendix pre-
sents technical details and a proof of the Theorem.
2. PROBLEM FORMULATION
Suppose we have a vector of data (representing a sample of size whose
distribution depends on and a known scalar where is to be estimated
by maximizing some objective function (say‚ as in maximum likelihood). The
estimate is the quantity for which we wish to characterize the uncertainty when
is small. It is assumed to be found as the objective-maximizing solution to the
score equation:
and such that and have the same distribution as x for the chosen and for
respectively. Then, from (2.1),
mining whether uncertainty regions from this idealized distribution are accept-
able approximations to the unknown uncertainty regions resulting from non-
identical In employing the Theorem (via (2.2a,b)), we let
where
In cases with a larger degree of difference in the (as ex-
pressed through a larger this idealized approximation for the uncer-
tainty regions may not be adequate—implied constants associated with the
bound of the Theorem provide a means of altering the idealized uncertainty re-
gions (these implied constants depend on terms other than
This example illustrates the apparent arbitrariness sometimes present in speci-
fying a numerical value of (e.g., if the elements of are made larger, then the
value of must be made proportionally smaller to preserve algebraic equiva-
lence). This apparent arbitrariness has no effect on the fundamental limiting pro-
cess as it is only the relative values of that have meaning after the other param-
eters (e.g., Q, etc.) have been specified. In particular, the numerical value of
the bound does not depend on the way in which the deviation from the
idealized case is allocated to and to the other parameters; in this example,
depends on the products which are certainly not arbitrary. We will return
to this signal-plus-noise example in Section 5 for a more thorough treatment.
in, say, (Anderson, 1971, Subsection 6.1), correspond to the MLE when the data
are normally distributed.)
To define the likelihood function for performing the ML estimation, one needs
to choose a particular model of the non-normality. Although the method here
can work with any of the common non-normal models, let us consider the fairly
simple way of supposing the data are distributed according to a nonlinear trans-
formation of a normal random vector. (Two other ways may also be appropriate:
(i) suppose that the data are distributed according to a mixture distribution where
at least one of the distributions in the mixture is normal and where the weighting
on the other distributions is expressed in terms of or (ii) suppose that the data
are composed of a convolution of two random vectors, one of which is normal
and the other being non-normal with a weighting expressed by In particular,
consistent with (2.2a, b), suppose that x has the same distribution as
where z is a normally distributed random vector and is a transformation,
with measuring the degree of nonlinearity. Since is a linear transforma-
tion, the resulting artificial estimate has one of the finite-sample distributions
shown in Anderson (1971, Sect. 6.7) or Wilks (1962, pp. 592–593) (the specific
form of distribution depends on the properties of the eigenvalues of matrices
defining the time series progression). Note that aside from entering the score
function through the artificial data appears explicitly (à la (2.1)) through its
effect on the form of the distribution (and hence likelihood function) for the data
x or Then, provided that is not too large, the Theorem in Section 4 (with or
without the implied constant of the bound, as appropriate) can be used with
the known finite-sample distribution to determine set probabilities for testing the
hypothesis of sequential uncorrelatedness in the non-normal case of interest.
4. MAIN RESULTS
4.1. Background and Notation
This section presents the main result‚ showing how the difference in the un-
known and known probabilities for lying in a rectangle
decreases as In particular‚ we will be interested in characterizing the prob-
abilities associated with rectangles:
We present one such constant here; another is presented in Spall (1995). The
constant here will tend to be conservative in that it is based on upper bounds to
certain quantities in the proof of the Theorem. This conservativeness may be
desirable in cases where is relatively large to ensure that the in (4.3) is
preserved in practical applications (i.e.‚ when the term is ignored).2 The con-
stant in Spall (1995) is less conservative and is determined through a computer
resampling procedure.
Aside from the assumptions of the Theorem‚ there are two additional condi-
tions under which is valid: (i) the elements of are mutually indepen-
dent‚ with having a bounded‚ continuous density function on some set such
that every point of is interior to this set‚ and (ii)
from Subsection 4.1) is uniformly bounded on (note that‚ even when is
continuous on the Theorem does not require boundedness for since
2
The issue of ignoring higher-order error terms (à la is common in many small- and large-
sample estimation methods. For example, the saddlepoint, bootstrap, and central limit theorem, in
general, all have unquantifiable higher-order error terms in terms of n.
696 MODELING UNCERTAINTY
5.1. Background
This section returns to the example of Subsection 3.1, and presents an analysis
of how the small-sample approach would apply in practice. In particular, con-
sider independent scalar observations distributed
where the are known and are to be estimated (jointly) using maximum
likelihood. As mentioned in Subsection 3.1, when for at least one i, j (the
actual case), no closed-form expression (and hence no computable distri-
bution) is generally available for When for all i, j (the idealized
case), the distribution of is known (see (5.2a, b) below).
For this estimation problem, Subsection 5.2 discusses the regularity condi-
tions of the Theorem and comments on the calculation of the implied constant
c(a, h), and Subsection 5.3 presents some numerical results. This two-parameter
estimation problem is one where the other analytical techniques discussed in
Section 1 (i.e., Edgeworth expansion and saddlepoint approximation) are im-
practical because of the unwieldy calculations required (say, as related to the
cumulant generating function and its inverse).
When using the distribution for as an approximation to the actual
distribution (when justified by the Theorem), we choose a value of Q correspond-
ing to the “information average” of the individual’s i.e., Q is such that
(The idea of summing information terms for different mea-
surements is analogous to the idea in Rao (1973, pp. 329–331).) As discussed in
Subsection 3.1, deviations of order from the common Q are then naturally ex-
pressed in the inverse domain: where the are some fixed
quantities (discussed below). Working with information averages has proven de-
sirable as a way of down-weighting the relative contribution of the larger
versus what their contribution would be, say, if Q were a simple mean of the
(from (5.1) below, we see that the score expression also down-weights the data
associated with larger A further reason to favor the information average is
that the score is naturally parameterized directly in terms of through use of
the relationship Hence, represents the
mean of the natural nuisance parameters in the problem. Finally, we have found
numerically that the “idealized” probabilities computed with the information
698 MODELING UNCERTAINTY
3
Note that for this example, the noise levels are relatively small, so techniques such as those in
Chesher (1991) may have been useful. However, this example was chosen with small noise only to
avoid the “tangential” (to this paper) issue of coping with negative variance estimates; there is
nothing inherent in the small-sample approach that requires small noise (e.g., the other two ex-
amples discussed in Section 3 do not fit into the small-noise framework).
Uncertainty bounds in parameter estimation with limited data 701
for We see that with the true and small-sample densities are virtually
identical throughout the domain while the asymptotic-based density is dramati-
cally different. For there is some degradation in the match between the
true and idealized small-sample densities, but the match is still much better then
between the true and asymptotic-based densities. Of course, it is the purpose of
the adjustment based on c(a, h) to compensate for such a discrepancy in
confidence interval calculation. Note that the true densities illustrate the fre-
quency with which we can expect to see a negative variance estimate, which is
an inherent problem due to the small size of the sample (the asymptotic-based
density significantly overstates this frequency). Because of the relatively poor
performance of the asymptotic-based approach, we focus below on comparing
uncertainty regions from only the true distributions and the small-sample ap-
proach.4
Figure 5.2 translates the above into a comparison of small-sample uncertainty
regions with the true regions. Included here are regions based on the term of
the Theorem when quantified through use of the constant, c(a, h). The small-
sample regions are “nominal” regions in the sense that the distributions in (5.2.a,b)
4
However‚ as one illustration of the comparative performance of the small-sample and asymptotic
approaches for confidence region calculation‚ consider the CEP estimation problem mentioned in
Subsection 3.1 above. The CEP estimates are based on the signal-plus-noise estimation example
of this section. For the e = 0.15 case‚ the 90% confidence interval of interest for the CEP estimate
was approximately 30% narrower when using the small-sample approach than when using the
standard asymptotic approach. Hence‚ by more properly characterizing the distribution of the
underlying parameter estimates‚ we are able to extract more information about the CEP quantity
of interest.
702 MODELING UNCERTAINTY
and are evaluated at the true and (consistent with a hypothesis testing
framework where there is an assumed “true” The indicated interval end points
were chosen based on preserving equal probability (0.025) in each tail, with the
exception of the conservative case; here the lower bound went slightly
below 0 using symmetry, so the lower end point was shifted upward to 0 with a
corresponding adjustment made to the upper end point to preserve at least 95%
coverage. (Spall (1995) includes more detail on how the prob-
ability adjustment was translated into an uncertainty interval adjustment.) For
the case, we see that the idealized small-sample bound is identical to the
true bound (this, of course, is the most desirable situation since there is then no
need to work with the c(a, h)-based adjustment). As expected, the uncertainty
intervals with the conservative (c(a, h)-based) adjustments are wider. For the
case, there is some degradation in the accuracy of coverage for the ide-
alized small-sample interval, which implies a greater need to use the conserva-
tive interval to ensure the intended coverage probability for the interval.
The above study is fully representative of others that we have conducted for
this estimation framework (e.g., nominal coverage probabilities of 90% and 99%
and other values of They illustrate that with relatively small values
of the idealized uncertainty intervals are very accurate, but that with larger
values (e.g., or 0.30) the idealized interval becomes visibly too short. In
these cases, the c(a, h)-based adjustment to the idealized interval provides a means
for broadening the coverage to encompass the true uncertainty interval.
bound. Implementations of the approach were discussed for three distinct well-
known identification settings to illustrate the range of potential applications. These
were a signal-plus-noise maximum likelihood estimation problem, a general non-
linear regression setting, and a problem in non-Gaussian time series correlation
analysis.
The signal-plus-noise example was considered in greatest detail. This prob-
lem is motivated for the author by the analysis and (state-space) modeling of a
naval defense system. The small-sample approach was relatively easy to imple-
ment for this problem and yielded accurate results. The required idealized case
was one where the data are i.i.d. When the actual data were moderately non-i.i.d.,
it was shown that the small-sample approach yielded results close to the true
(uncomputable) uncertainty regions. In cases where the actual data deviated a
greater amount from the i.i.d. setting, the small-sample approach yielded conser-
vative uncertainty regions based on a quantification of the implied constant of the
order bound of the Theorem. This example provided a realistic illustration of the
types of issues arising in using the approach in a practical application.
While this paper provides a new method for small-sample analysis, further
work would enhance the applicability of the approach. One area would be to
explore, in detail, the application to problems other than the signal-plus-noise
problem. Two candidate problems in time series and nonlinear regression were
sketched in Section 3. Another valuable extension of the current work would be
to automate (as much as possible) the computation of the implied constant that is
used to provide a conservative uncertainty region. It would also be useful to carry
out a careful comparative analysis of the relative accuracy, ease of implementa-
tion, and applicability of the saddlepoint method, the bootstrap, and the method
of this paper. Despite these open problems, the method here offers a viable ap-
proach to a range of small-sample estimation problems.
C.4
Briefly, the conditions above are used in the proof as follows. The bounded
sets assumed in C.1 ensure that a domain of interest can be covered by a finite
Uncertainty bounds in parameter estimation with limited data 705
number of “small” neighborhoods. The fact that the derivative exists and is
bounded away from 0 in C.2 ensures that an important ratio with in the de-
nominator exists a.s. The assumption in C.3 regarding the continuous differentia-
bility of certain expressions ensures (via the implicit function theorem) the local
one-to-oneness of certain transformation, which in turn ensures the existence of
certain density functions. C.4 allows for the substitution of an easier probability
expression for a more complicated probability (to within “negligible” error).
Finally, C.5 guarantees that local density functions exist for the “artificial” data
vector z, which then can be used via the real value theorem to characterize a set
probability of interest.
In the proof of the Theorem below, we use the following two Lemmas, which
have relatively straightforward proofs given in Spall (1995).
Lemma 1 For a random vector let be continuity points for
its associated distribution function F(·). Then
where
For Lemma 2‚ represents a term such that converges to 0 in prob-
ability as
Lemma 2 Let and A be two random variables and one event (respec-
tively)‚ with dependent on the introduced above. Suppose that
where and that Then
and
For the probabilities in the first sum, we know that the event occurs
only if Letting we have that the event for the jth summand
corresponds to
For the probabilities in the second sum on the right-hand side of (A. 1) we know
that so the event for the jth summand corresponds to
By condition C.4 we know that the probability of the event on the r.h.s. of (A.2a)
can be bounded above to within by
Hence, from (A.2a, b), to within each of the 2p probabilities associated with
the summands in (A.1) may be bounded above by the expression in (A.3).
Now, for almost all z in the set we know by condition C.2 and Taylor’s
theorem that
Since Lemma 2 applies (see Spall (1995))‚ we can replace (to within error)
the two probabilities in (A.5) by the following probabilities that do not depend on
and are easier to analyze:
We now show that these two probabilities are This will establish the main
result to be proved.
To show the above, Spall (1995) shows that the conditional (on ) densities for
exist near 0 for each j and ± sign such that We then use these
densities to characterize the probabilities in (A.6). (When an the corre-
sponding probability in (A.6) is by C.1 and C.2.) To characterize these den-
sities, we first use C.2, C.3, and the inverse function theorem to establish that
local densities exist in a finite number of disjoint regionswithin The inverse
and implicit function theorems are then used to establish the form for the joint
density for and (representing the elements of z after the
scalar element from C.3 has been removed) in each of the local regions (likewise
for Then can be integrated out, leaving the densities of
interest for in each local region. On each of these local regions, it can
be shown that the relevant probability is Taking the union over the finite
number of local regions establishes that the probabilities in (A.6) are Q.E.D.
REFERENCES
Anderson‚ T. W. (1971). The Statistical Analysis of Time Series‚ Wiley‚ New York.
Bickel‚ P. J.‚ and K. A. Doksum (1977). Mathematical Statistics‚ Holden-Day‚ Oakland‚
CA.
Bisgaard‚ S. and B. Ankenman (1996). Standard errors for the eigenvalues in second-
order response surface models. Technometrics‚ 38:238–246.
Chen‚ Z. and K.-A. Do (1994). The bootstrap method with saddlepoint approximations
and importance sampling. Statistica Sinica‚ 4:407–421.
Chesher‚ A. (1991). The effect of measurement error. Biometrika‚ 78:451–462.
Cook‚ P. (1988). Small-sample Bayesian frequency-domain analysis of autoregressive
models. J. C. Spall‚ editor‚ Bayesian Analysis of Time Series and Dynamic Models‚
pp. 101–126. Marcel Dekker‚ New York.
Daniels‚ H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical
Statistics‚ 25:631–650.
Davison‚ A. C.‚ and D. V. Hinkley (1988). Saddlepoint approximations in resampling
methods. Biometrika‚ 75:417–431.
708 MODELING UNCERTAINTY
Efron‚ B.‚ and R. Tibshiran (1996). Bootstrap methods for standard errors‚ confidence
intervals‚ and other measures of statistical accuracy (with discussion). Statistical
Science‚ 1:54–77.
Field‚ C‚ and E. Ronchetti (1990). Small Sample Asymptotics. IMS Lecture Notes–
Monograph Series (vol. 13). Institute of Mathematical Statistics‚ Hayward‚ CA.
Fraser‚ D.A.S.‚ and N. Reid (1993). Third–order asymptotic models: Likelihood functions
leading to accurate approximations for distribution functions. Statistica Sinica‚ 3:
67–82.
Ghosh‚ J. K. (1994). Higher Order Asymptotics. NSF-CBMS Regional Conference Se-
ries in Probability and Statistics‚ Volume 4. Institute of Mathematical Statistics‚
Hayward‚ CA.
Ghosh‚ M.‚ and J. N. K. Rao (1994). Small area estimation: An approach (with discussion).
Statistical Science‚ 9: 55–93.
Goutis‚ C.‚ and G. Casella (1999). Explaining the saddlepoint approximation. American
Statistician‚ 53:216–224.
Hall‚ P. (1992). The Bootstrap and the Edgeworth Expansion. Springer-Verlag‚ New York.
Hjorth‚ J. S. U. (1994). Computer Intensive Statistical Methods. Chapman and Hall‚ Lon-
don.
Hoadley‚ B. (1971). Asymptotic properties of maximum likelihood estimates for the
independent not identically distributed case. Annals of Mathematical Statistics‚
42:1977–1991.
Hui‚ S. L‚ and J. O. Berger (1983). Empirical Bayes estimation of rates in longitudinal
studies. Journal of the American Statistical Association‚ 78:753–760.
Huzurbazar‚ S. (1999). Practical saddlepoint approximations. American Statistician‚
53:225–232.
James‚ A. T.‚ and W. N. Venables (1993). Matrix weighting of several regression coeffi-
cient vectors. Annals of Statistics‚ 21:1093–1114.
Joshi‚ S. S.‚ H. D. Sherali‚ and J. D. Tew (1994). An enhanced RSM algorithm using
gradient-deflection and second-order search strategies. J. D. Tew et al.‚ editors‚ Pro-
ceedings of the Winter Simulation Conference‚ pp. 297–304.
Kmenta‚ J. (1971). Elements of Econometrics. Macmillan‚ New York.
Kolassa‚ J. E. (1991). Saddlepoint approximations in the case of intractable cumulant
generating functions. Selected Proceedings of the Sheffield Symposium on Applied
Probability. IMS Lecture Notes—Monograph Series (vol. 18). Institute of Mathemati-
cal Statistics‚ pp. 236–255‚ Hayward‚ CA.
Laha‚ R. G.‚ and V. K. Rohatgi (1979). Probability Theory. Wiley‚ New York.
Lieberman‚ O. (1994). On the approximation of saddlepoint expansions in statistics.
Econometric Theory‚ 10:900–916.
Ljung‚ L. (1978). Convergence analysis of parametric identification methods. IEEE Trans-
actions on Automatic Control‚ AC-23:770–783.
Uncertainty bounds in parameter estimation with limited data 709
John C. Kieffer
Department of Electrical and Computer Engineering
University of Minnesota
Minneapolis, MN 55455
Abstract Hierarchical lossless data compression is a compression technique that has been
shown to effectively compress data in the face of uncertainty concerning a proper
probabilistic model for the data. In this technique, one represents a data sequence
x using one of three kinds of structures: (1) a tree called a pointer tree, which
generates x via a procedure called “subtree copying”; (2) a data flow graph
which generates x via a flow of data sequences along its edges; or (3) a context-
free grammar which generates x via parallel substitutions accomplished with
the production rules of the grammar. The data sequence is then compressed
indirectly via compression of the structure which represents it. This article is
a survey of recent advances in the rapidly growing field of hierarchical lossless
data compression. In the article, we illustrate how the three distinct structures
for representing a data sequence are equivalent, outline a simple method for
designing compact structures for representing a data sequence, and indicate the
level of compression performance that can be obtained by compression of the
structure representing a data sequence.
1. INTRODUCTION
A modern day data communication system must be capable of transmitting
data of all types, including textual data, speech/audio data, or image/video
data. The block diagram which follows depicts a data communication system,
consisting of encoder, channel, and decoder:
712 MODELING UNCERTAINTY
Every leaf vertex of the tree is labelled by either a symbol from the data
alphabet or by a pointer label “pointing to” some nonterminal vertex of
the tree. (A vertex containing a pointer label shall be called a pointer
vertex.)
There exists at least one leaf vertex of the tree which is labelled by a
pointer label.
The data sequence represented by the pointer tree can be recovered from
the pointer tree via “subtree copying.”
We explain the “subtree copying” procedure. Define a data tree to be a tree
in which every leaf vertex of the tree is labelled by a symbol from the data
alphabet. Suppose T is a pointer tree, and that is a leaf vertex of T pointing
to a nonterminal vertex of T. Suppose the subtree of T rooted at is a data
tree. Let T' be the tree obtained by appending to the subtree of T rooted at
We say that the tree T' is obtained from the tree T by one round of subtree
copying. The tree T' is either a pointer tree possessing one less pointer vertex
than T, or it is a data tree. Suppose we start with a pointer tree having exactly
pointer vertices, and are able to construct trees such that is
obtained from via one round of subtree copying, Then,
must be a data tree. It can be shown that if is some
other sequence of trees in which is obtained from via subtree copying
then Therefore, we may characterize as the unique
data tree obtainable via finitely many rounds of subtree copying, starting from
the pointer tree We call the data tree induced by the pointer tree
Order the leaf vertices of in depth-first order. If we write down the sequence
of data labels that we see according to this ordering, we obtain a data sequence
which we will term the data sequence represented by the pointer tree
Example 2: Fig. 5 illustrates a pointer tree. By convention, we take the
pointer labels to mean that we are pointing to the nonterminal vertices
labelled 1,2,3, respectively. (The internal labels 1,2,3 are really not needed;
we shall see how to get along without these labels in Section 3.) The reader
can easily verify that the tree in Fig. 6 is obtainable from the Fig. 5 tree via four
rounds of subtree copying. Therefore, this tree is the data tree induced by the
Fig. 5 tree. (The reader can also verify that the Fig. 6 tree is obtained regardless
of the order in which the subtree copyings are done. For example, one can
do rounds of subtree copying to the vertices which in depth-first ordering have
labels respectively; one can also do the copying according to the
716 MODELING UNCERTAINTY
In Fig. 6, we have kept the internal labels 1,2,3 that were in the Fig. 5 tree,
in order to illustrate an important principle. Notice that the subtrees of the
Fig. 6 data tree rooted at vertices 1,2,3 are distinct. Our important principle
is the following: One should strive to find a pointer tree representing a given
data sequence so that any two leaf vertices of the pointer tree which point to
distinct vertices should have distinct data trees appended to them in the rounds
of subtree copying. If a given pointer tree does not obey this property, then it
can be reduced to a simpler pointer tree.
(p.2) Each noninput vertex of the data flow graph possesses two or more in-
coming edges, and there is an implicit ordering of these incoming edges.
(P.3) Each input vertex V contains a label which is a symbol from the
data alphabet.
(p.4) The input labels uniquely determine a data sequence label for each
noninput vertex V of the graph. The labels on the vertices of the graph
satisfy the equations
Property(p.4) indicates why we call our graph a data flow graph. Visualize
“data flow” through the graph in several cycles. In the first cycle, one computes
at each vertex V whose incoming edges all start at input vertices. In each
succeeding cycle, one computes the label at each vertex V whose incoming
edges all start at vertices whose labels have been determined previously. In the
final cycle, the data sequence represented by the data flow graph is computed
at the output vertex. The following example illustrates the procedure.
Example 3: Let us compute the data sequence represented by the data flow
graph in Fig. 7. We suppose that the direction of flow along edges is from left
to right, and that incoming edges to a vertex are ordered from top to bottom.
On the first cycle, we compute
Computation of the label on the fourth and last cycle tells us what is:
One property of a good data flow graph for representing a data sequence
should be mentioned here: The sequences computed at the vertices of the data
flow graph should be distinct. If this property fails, then two or more vertices
of the data flow graph can be merged to yield a simpler data flow graph which
also represents the given data sequence.
where V, the left member of the production rule (1.2), is a variable of the
grammar G, and each belonging to the right member of the production rule
(1.2) is either a variable of G or a symbol from the data alphabet. The variables
of G shall be denoted by capital letters the symbols in the data
alphabet are distinct from the variables of G, and shall be denoted by lower case
letters The variable is a special variable called the root variable
of G; it is the unique variable which does not appear in the right members of
the production rules of G. For each variable V of the grammar G, it is required
that there be one and only one production rule (1.2) of G whose left member
is V; such a grammar is said to be deterministic. With these assumptions,
A Tutorial on Hierarchical Lossless Data Compression 719
one is assured that the language L(G) generated by G must satisfy one of the
following two properties:
L(G) consists of exactly one data sequence; or
L(G) is empty.
To see which of these two properties holds, one performs rounds of parallel
substitutions using the production rules of G, starting with the root variable
In each round of parallel substitutions, one starts with a certain sequence of
variables and data symbols generated from the previous round; each variable
in this sequence is replaced by the right member of the production rule whose
left member is that variable—all of the substitutions are done simultaneously.
There are only two possibilities:
Possibility 1: After finitely many rounds of parallel substitutions, one encoun-
ters a data sequence for the first time; or
Possibility 2: One never encounters a data sequence, no matter how many
rounds of parallel substitutions are performed.
Let be the number of variables of the grammar G. In Kieffer and Yang
(2000), it is shown that if one does not encounter a data sequence after rounds
of parallel substitutions, then Possibility 2 must hold. This gives us an algorithm
which runs in a finite amount of time to determine whether or not Possibility 1
holds.
Suppose Possibility 1 holds. Let be the data sequence generated by the
grammar G after finitely many rounds of parallel substitutions. Then, L(G) =
We call the data sequence represented by G.
We list the requirements that shall be placed on any grammar G that is used
for representing a data sequence:
Requirement (i): The grammar G is a deterministic context-free grammar.
could be made simpler if (iii) or (iv) were not true. A grammar G satisfying
requirements (i)-(iv) shall be called an admissible grammar.
Example 4: Let G be the admissible grammar whose production rules are
Starting with the root variable the sequences that are obtained via rounds
of parallel substitutions are:
Traverse the entries of the rows of the display (1.3) in the top-down, left-to-right
order; if you write down the first appearances of the variables you encounter
in order of their appearance, you obtain the ordering
which is in accordance with the numbering of the variables of G. We can always
number the variables of an admissible grammar so that this property will be true,
and shall always do so in the future. The rounds of parallel substitutions that
we went through to obtain the sequence (1.4) are easily accomplished via the
four line Mathematica program
S={1};
P={{2,3,4,4,5,2}, {6,6,b}, {a,6}, {3,b,a}, {b,b}, {a,5}};
Do[S=Flatten[S/. Table[i->P[[i]],{i,Length[P]}]],{i,Length[P]}];
S
A Tutorial on Hierarchical Lossless Data Compression 721
which the reader is encouraged to try. You can run this program given any
admissible grammar to find the data sequence represented by the grammar. All
that needs to be changed each time is the second line of the program, which
gives the right members of the production rules of the grammar.
Notice that the grammar G in Example 4 obeys the following two properties:
Property 1: Every variable of G except the root variable appears at least twice
in the right members of the production rules of G.
Property 2: If you slide a window of width two along the right members of
the production rules of G, you will never encounter two disjoint windows
containing the same sequence of length two.
An admissible grammar satisfying Property 1 and Property 2 is said to be irre-
ducible. There is a growing body of literature on the design of hierarchical loss-
less compression schemes employing irreducible grammars (Nevill-Manning
and Witten, 1997a-b; Kieffer and Yang, 2000; Yang and Kieffer, 2000).
Example 5: Referring to the pointer tree in Fig. 5, we see that the corre-
sponding grammar has production rules
Any tree which can be built up from these 9 trees is a pointer tree for the same
data sequence represented by the grammar (2.6). (Start the tree building process
by joining two of the trees in the array (2.7) to form a single tree—joining of
two trees is accomplished by merging the root vertex of one tree with a leaf
vertex of the other tree, where these two vertices have the same label This
gives an array of 8 trees; join two of the trees in this array. Repeated joinings,
8 of them in all, gradually reduce the original array (2.7) to a single tree, the
desired pointer tree.) One of the pointer trees constructible by this method is
given in Fig. 8; another one is given in Fig. 9.
right members of the production rules of G. Grow a directed acyclic graph with
vertices and edges as follows:
Step 1: Draw vertices on a piece of paper. Label of them with the
labels respectively. Label the remaining of them with
the data symbols appearing in the right members of the production rules.
Step 2: For each let
PRUNING ALGORITHM
Step 1: List the nonterminal vertices of in depth-first order. Let this list
be
Step 2: Traverse the list (3.9) from left to right, underlining each for which
has not been seen previously. Let be the set of underlined
vertices in the list (3.9), and let be the set consisting of each nonun-
derlined vertex in the list (3.9) whose father in belongs to
Step 3: Prune from the tree all vertices which are successors of the
vertices in Let be the resulting pruned tree. (If this step has
been done correctly, then the set of nonterminal vertices of will be
and the set of leaves of which are not leaves of will
be
Step 4: Attach a pointer label to each vertex V in which points to the
unique vertex in for which T(V) and are the same. The
pruned tree with the pointer labels attached, is a pointer tree
representing the data sequence
Example 8: We find a pointer tree representation for the data sequence
of length 12. Forming the tree and then the tree
we obtain the tree in Fig. 10. Notice that we have enumerated the nonterminal
vertices of in Fig. 10 in depth-first order. Executing Steps 1 and 2, we
see that
4. ENCODING METHODOLOGY
Let us now refer back to the two parts of a hierarchical lossless compression
scheme given in Figs. 2 and 3. From the preceding sections, the reader under-
stands the nature of a transform that can be used in Fig. 2 and the corresponding
inverse transform in Fig. 3. To be precise, we have learned three distinct options
for (transform,inverse transform) in dealing with a data sequence
Option 1: Supposing to be the length of one could transform into the
pointer tree which represents it, via the PRUNING ALGORITHM
of Section 3. Then can be the “data structure” in Fig. 2. The inverse
transform in Fig. 3 then employs the subtree copying method of Section
1.1 to obtain from
Option 2: One could take the “data structure” in Fig. 2 to be the admissible
grammar G associated with the pointer tree as in Section 2.1.
The inverse transform in Fig. 3 then employs several rounds of parallel
substitutions to obtain from G, as in Example 4.
728 MODELING UNCERTAINTY
Option 3: The “data structure” in Fig. 2 could be the data flow graph DFG(G)
formed from the grammar G of Option 2, as described in Section 2.2. Ex-
ample 3 illustrates the inverse transform method via which is computed
from DFG(G) via a flow of data sequences along the edges of DFG(G).
We have not yet discussed how the encoder compresses the data structure in
Fig. 2 that is presented to it—this section addresses this question. Because of
the equivalences between the tree, graph, and grammar structures discussed in
Section 2, we can explain the encoder’s methodology for the pointer tree data
structure in Fig. 2 only. Thus, we assume that the data structure in Fig. 2 that
is to be assigned a binary codeword by the encoder is the pointer tree
introduced in Section 3, where is the length of the data sequence (assumed
to satisfy ).
The binary codeword generated by the encoder will consist of the concate-
nation of the three binary sequences discussed below.
The sequence The purpose of this sequence is to let the decoder know
what is. There is a simple way to do this (see Example 9) in which
consists of binary symbols.
The sequence The purpose of this sequence is to let the decoder know
the structure of the unlabelled pointer tree, i.e., the tree without
the pointer labels and without the data labels. If a vertex of is a
nonterminal vertex, there will a corresponding entry of equal to 1; if
the vertex is a leaf of there will be a corresponding entry of
equal to 0 (see Example 9).1
The sequence The purpose of this sequence is to let the decoder know
each data label and each pointer label that has to be entered into the
unlabelled pointer tree to obtain the pointer tree
Example 9: As in Example 8, we consider the data sequence
of length Referring to the tree in Fig. 10 and the pointer tree
in Fig. 11, we can see what the sequences and generated by
the encoder have to be. In general, the length can be processed to form in
two steps:
Step 1: Expand the integer into its binary representation consist-
ing of binary symbols where is
the most significant bit.
Step 2: Generate (That is, repeat every
digit of and then write down followed by )
In this particular case, Step 1 gives us the binary expansion of the integer 12,
which is 1100, and then Step 2 gives us
A Tutorial on Hierarchical Lossless Data Compression 729
Let us now determine how we can form in order to convey to the decoder
the tree in Fig. 11 without the data and pointer labels. The first part of
the binary codeword received by the decoder (the part) tells the decoder that
whereupon the decoder knows that the nonterminal vertices of
are the vertices 1,2,3,4,5,6,7,8,9,10,11 in Fig. 10. Of these, the decoder
knows that vertices 1,2,3,4 will automatically be nonterminal vertices in Fig.
11. If the remaining vertices are processed by encoder and decoder in the
breadth-first order 9,6,10,11,5,7,8, then the decoder learns the structure of
with the transmission of 5 bits. Specifically, a 0 is transmitted for vertex
9 (indicating that this vertex is a leaf vertex in Fig. 11), vertices 10, 11 are
deleted from the list (since these vertices cannot belong to and then
bits are sent for each of vertices 6,5,7,8 to indicate which are leaf vertices and
which are nonterminal vertices in Fig. 11. This gives us
Finally, we discuss how the encoder constructs the sequence The entries
of tell the decoder what the data labels and the pointer labels are in Fig. 11.
The data labels are which we encode as 0, 1, 1, 1, respectively. The
pointer vertices are vertices 5, 8, 9 (see Fig. 10). The decoder already knows
where vertex 5 points (as discussed in Example 8), so no pointer label needs
to be encoded for vertex 5. The pointer labels 1, 2 on vertices 8, 9 can be very
simply encoded as 0, 1, respectively. (For a very long data sequence, one would
instead use arithmetic coding to encode the resulting large number of pointer
labels—see Kieffer and Yang (2000) and Kieffer et al. (2000) for the arithmetic
coding details which we have omitted here.) Concatenating the encoded data
labels 0, 1, 1, 1 with the encoded pointer labels 0, 1, we see that
The total length of the binary codeword generated by the encoder is the sum
of the lengths of and which is 17 bits. If we assume that the
decoder knows the length of the data sequence to be 12, but does not know
which binary data sequence of length 12 is being transmitted, then it is not
necessary to form and the encoder’s binary codeword consists of and
only. In this case, the length of the binary codeword is 11 bits. We are achieving
a modest level of data compression in this example, since transmitting to the
decoder without compression would take 12 > 11 bits—for a much longer data
sequence, more compression than this could be achieved.
Theorem 1 tells us that the ratio is small for large In other words,
the size of the pointer tree is small relative to the length of the data
sequence which is a compactness condition on the tree >From our
discussion at the beginning of Section 3, this suggests to us that the use of the
pointer tree in a hierarchical lossless compression scheme might lead
to good compression performance. Specifically, we consider the hierarchical
lossless compression scheme in which, for each and each data sequence
of length the pointer tree is compressed according to the procedure
described in Section 4. In the ensuing development, we will see that this pointer
tree based hierarchical lossless compression scheme performs extremely well.
Let A denote our fixed finite alphabet. We consider how well our hierar-
chical lossless compression scheme can compress a randomly generated data
sequence generated according to the probability mass function
Information theory tells us that for any lossless compres-
sion scheme, the expected length of the binary codeword into which is en-
coded cannot be less than the entropy
and that the best lossless compression scheme for encoding (the Huffman
code (Cover and Thomas, 1991)) assigns a binary codeword of expected
length no worse than Unfortunately, the Huffman code can be
constructed only if the probability model is known. However, what
if we are uncertain about the probability model? Remarkably, for large our
pointer tree based hierarchical lossless compression scheme provides us with
near-optimum compression performance regardless of the probability model
according to which the data is generated (provided that we assume stationarity
of the model). In other words, if faced with uncertainty about the true data
model, one can employ a hierarchical lossless compression scheme not based
on any probability model which performs as well as a special-purpose loss-
less compression scheme based upon the true data model, asymptotically as
the length of the data sequence grows without bound. The following theorem,
proved in Kieffer and Yang (2000) and Kieffer et al. (2000), makes this precise.
Discussion: We point out subcases of Theorem 2 that occur for special classes
of stationary processes. First, suppose that is a memoryless
process, meaning that the random variables are statistically independent,
each having the same marginal probabilities
Letting
where
There are other lossless compression schemes for which Theorem 1 is true for
arbitrary stationary processes, and for which asymptotics of the form (5.12)
732 MODELING UNCERTAINTY
occur for arbitrary Markov processes of finite order. For example, the Lempel-
Ziv compression scheme (Ziv and Lempel, 1978) is another such scheme. It is an
open question whether the hierarchical lossless scheme presented in this paper
or whether the Lempel-Ziv scheme gives the smaller constant times
in the term in (5.12). However, hierarchical lossless compression
schemes have some advantages over the Lempel-Ziv scheme. Two of these
advantages are: (1) hierarchical schemes are easily scalable; (2) hierarchical
schemes sometimes yield state-of-the-art compression performance in practical
applications.
NOTES
1. The length of is at worst the number of vertices of which is by
Theorem 1.
2. The smallest value of C for which Theorem 1 is true is not known.
REFERENCES
Barnsley, M. and L. Hurd. (1993). Fractal Image Compression. Wellesley, MA:
AK Peters, Ltd.
Burt, P. and E. Adelson. (1983). “The Laplacian Pyramid as a Compact Image
Code,” IEEE Trans. Commun., Vol. 31, pp. 532–540.
Cameron, R. (1988). “Source Encoding Using Syntactic Information Source
Models,” IEEE Trans. Inform. Theory, Vol. 34, pp. 843–850.
Chui, C. (1992). (ed.), Wavelets: A Tutorial in Theory and Applications. New
York: Academic Press.
Cover, T. and J. Thomas. (1991). Elements of Information Theory. New York:
Wiley.
Fisher, Y. (1995). (ed.), Fractal Image Compression: Theory and Application.
New York: Springer-Verlag.
Kawaguchi, E. and and T. Endo. (1980). “On a Method of Binary-Picture Rep-
resentation and its Application to Data Compression,” IEEE Trans. Pattern
Anal. Machine Intell., Vol. 2, pp. 27–35.
Kieffer, J. and E.-H. Yang. (2000). “Grammar-Based Codes: A New Class of
Universal Lossless Source Codes,” IEEE Trans. Inform. Theory, Vol. 46, pp.
737–754.
Kieffer, J., E.-H. Yang, G. Nelson, and P. Cosman. (2000). “Universal Lossless
Compression Via Multilevel Pattern Matching,” IEEE Trans. Inform. Theory,
Vol. 46, pp. 1227–1245, 2000.
REFERENCES 733
Moshe Sniedovich
Department of Mathematics and Statistics
The University of Melbourne
Parkville VIC 3052, Australia
m.sniedovich@ms.unimelb.edu.au
Abstract Ever since Bellman formulated his Principle of Optimality in the early 1950s,
the Principle has been the subject of considerable criticism. In fact, a number
of dynamic programming (DP) scholars quantified specific difficulties with the
common interpretation of Bellman’s Principle and proposed constructive reme-
dies. In the case of stochastic processes with a non-denumerable state space, the
remedy requires the incorporation of the faithful "with probability one" clause. In
this short article we are reminded that if one sticks to Bellman’s original version
of the principle, then no such a fix is necessary. We also reiterate the central
role that Bellman’s favourite "final state condition" plays in the theory of DP in
general and the validity of the Principle of Optimality in particular.
1. INTRODUCTION
All of us are familiar with Bellman’s Principle of Optimality (Bellman, 1975,
p. 83) and the major role that it had played in Bellman’s monumental work on
DP. What is not so well known - yet very well documented - is that Bellman’s
Principle of Optimality has been the subject of serious criticism, e.g. Denardo
and Mitten (1967), Karp and Held (1967), Yakowitz (1969), Porteus (1975),
Morin (1982), Sniedovich (1986, 1992).
In fact, almost every aspect of the Principle - e.g. its exact meaning, validity,
role in DP - is problematic in the sense that scholars have conflicting views on
736 MODELING UNCERTAINTY
the matter. For the purposes of this discussion it will suffice to provide two
pairs of quotes. The first pair refers to the title "Principle":
... Equation (3.24) is a fundamental equation of Dynamic Programming. It
expresses a fundamental principle, the principle of optimality (Bellman [B4],
[B5], which can also be expressed in the following way: …
Kushner (1971,p. 87)
The term principle of optimality is, however, somewhat misleading; it suggests
that this is a fundamental truth, not a consequence of more primitive things.
Denardo (1982, p. 16)
Counter-Example 1:
Consider the network depicted in Figure 29.1 and assume that the objective
is to determine the shortest path from node 1 to node 5, where the length of a
path is equal to the length of the longest arc on that path. By inspection, there
Eureka! Bellman’s Principle of Optimality is valid! 737
are two optimal paths, namely p=(1,2,3,5) and q=(1,2,4,5), both of length 3.
Consider now the optimal path q and the state (node) resulting from the first
decision, namely node 2. Clearly, the remaining decisions of q, namely the
subpath (2,4,5) does not constitute an optimal policy with respect to node 2 -
it is clearly longer than (2,3,5). Hence, the optimal path q does not obey the
Principle of Optimality.
Counter-Example 2:
Consider the following naive stochastic game: there are two stages (n=1,2)
and at each stage there are two feasible decisions n=1,2. The
dynamics of the process are as follows: The process starts at stage 1 with a
given state, Upon making the first decision, the process moves to the
next stage, where we observe a new state Then we make the second
decision, and the process terminates. The second state, is a continuous
random variable on the set S=[0,1] whose conditional density function depends
on both and The return generated by the game is equal to the sum of the
two decisions, namely
The objective is to determine a policy so as to maximize the expected value
of the total return
Clearly, the best policy is to always use that is it is best to set
This will yield an expected return equal to This
policy obeys the Principle of Optimality. But consider the policy according to
which and
While these fixes are fine, there are other possibilities. In particular, it is pos-
sible to fix these - and other - bugs by adhering to Bellman’s original formulation
of DP and the Principle of Optimality.
The main objective of this paper is to briefly show how this can be done.
2. REMEDIES
In this section we provide two remedies to the Principle. These remedies not
only fix the bugs discussed above, they also indicate how elegantly Bellman
dealt with a number of thorny modelling and technical aspects of DP.
Remedy 1: Final state formulation
A close examination of Bellman’s work on DP reveals that Bellman contin-
ually struggled with the following dilemma: how do you keep the formulation
of DP simple, yet enable it to tackle complex problems? The Principle of Opti-
mality was conceived as a device that will keep the description of the main ideas
of basic of DP simple. In particular, in contrast to the DP models developed in
the 1960’s with the stated goal of putting DP on a rigorous mathematical foun-
dation (e.g. Denardo and Mitten (1967), Karp and Held (1967)), Bellman’s
original treatment of DP paid very little attention to the objective function of
the process.
As a matter of fact, systematically and consistently Bellman avoided the need
to deal with this issue by a very drastic assumption: the over all return from the
decision process depends only on the final state of the process. Readers who
are sceptical about this fact are invited to read (carefully) Bellman first book on
DP, where they can find the following definition of an optimal policy:
. . . Let us now agree to the following terminology: A policy is any rule for making
decision which yields an allowable sequence of decisions; and an optimal policy
is a policy which maximizes a preassigned function of the final state variable ...
Bellman (1957, p.82)
Needless to say, there are of course problems where this condition is not
satisfied. However, from our perspective this is not the point because - following
Bellman - we require the model to be a final state model in which case the above
condition is trivially satisfied. Rather, the point is that there is no need here for
the "with probability one" clause.
Before we address this issue any further it will be instructive to re-examine
Counter-Example 2 and see whether it satisfies the above condition.
Counter Example 2: Revisited
The objective function associated with Problem is equal to
thus the optimal policy for this problem is The objective function
for Problem is hence the optimal policy is regardless of what
value takes. Thus, the above condition is satisfied.
In short, the alternative to the "with probability one" fix works well not only
in the framework final state models.
s.t.
where
Eureka! Bellman’s Principle of Optimality is valid! 741
for
This leads to the notion of modified problems:
Problem
subject to (4.7)-(4.8).
Let denote the set of all feasible solutions to Problem
and let denote the set of all optimal solutions to this prob-
lem. It is assumed that the set is not empty for any
quadruplet Ob-
serve that by construction
Clearly then (by inspection),
Lemma 1
for all
Problem
Then clearly,
Corollary 3
Note that the expectation is taken with respect to the random variable
whose probability function is conditioned by and
Let denote the set of all the optimal solutions to Problem
and let denote the set of all the optimal solutions to Problem
Then the Markovian conditions for stochastic processes can
be stated as follows:
Markovian Condition (Stochastic processes):
744 MODELING UNCERTAINTY
Hence,
Corollary 4
If the objective function is separable under conditional expectation and the
Markovian condition (stochastic processes) holds, then
for all
5. REFINEMENTS
The Markovian condition can be refined a bit to reflect the fact that for the
DP functional equation to be valid, it is sufficient that Principle of Optimality
is satisfied by one policy, rather then by all the optimal policies.
This leads to the following:
Weak Markovian Condition (Deterministic processes):
Eureka! Bellman’s Principle of Optimality is valid! 745
Corollary 6
If the Weak Markovian condition (deterministic processes) holds then the DP
functional equation (4.22) is valid.
Weak Markovian Condition (stochastic processes):
In words, any conditional modified problem shares at least one optimal so-
lution with each modified problem giving rise to it.
Corollary 7
If the objective function is separable under conditional expectation and the
Weak Markovian condition (stochastic processes) holds then the DP functional
equation (4.22) is valid.
2. There is not always a choice in this matter. That is, some objective
functions are Markovian in nature so it is not possible to formulate for
them valid DP functional equations that satisfy the Weak Markovian
condition but do not satisfy the Markovian condition.
The question naturally arises: what happens if the Markovian condition does
not hold?
in the state variable of the DP model so that the expanded state variable is of the
form We can then consider the “expanded”
objective function
where
and
748 MODELING UNCERTAINTY
The point is that, for any given value of this function is separable and
Markovian with respect to the original state and decision variables. The idea
is then to identify a value for the parameter such that if we optimize the
problem using this value of we obtain an optimal solution to the original
problem. This typically involves a line search which in turn requires solving
the parametric problem for a number of values of the parameter Under
appropriate conditions, composite concave programming can be used for this
purpose (Sniedovich, 1992).
7. DISCUSSION
One of the fascinating aspects of Bellman’s work on dynamic programming
is his attempt to capture the essence of the method by a short non-technical
description. Over the years this description - The Principle of Optimality - has
become synonymous with dynamic programming.
Unfortunately, the non-technical nature of the description has also led to
difficulties with common interpretations of the Principle, which in turn led to
criticism of Bellman’s work itself.
It was shown in this paper that a proper reading and interpretation of Bell-
man’s formulation of dynamic programming in general and the Principle in
particular can overcome the above difficulties.
REFERENCES
Bellman, R. (1957). Dynamic Programming, Princeton University Press, Prince-
ton, NJ.
Bertsekas, D.P. (1976). Dynamic Programming and Stochastic Control, Aca-
demic Press, NY.
Carraway R.L, T.L. Morin, and H. Moskovwitz. (1990). Generalized dynamic
programmingfor multicriteria optimization, European Journal of Operations
Research, 44, 95-104.
REFERENCES 749
Denardo, E.V. and L.G. Mitten. (1967). Elements of sequential decision pro-
cesses, Journal of Industrial Engineering, 18, 106-112.
Denardo, E.V.(1982). Dynamic Programming Models and Applications, Prentice-
Hall, Englewood Cliffs, NJ.
Domingo A. and M. Sniedovich. (1993). Experiments with algorithms for non-
separable dynamic programming problems, European Journal of Operational
Research 67(4.1), 172-187.
Karp, R.M. and M. Held. (1967). Finite-state processes and dynamic program-
ming, SIAM Journal of Applied Mathematics, 15, 693-718.
Kushner, H. (1971). Introduction to Stochastic Control, Holt, Rinehart and Win-
ston, NY.
Mitten, L.G. (1964). Composition principles for synthesis of optimal multistage
processes, Operations Research, 12, 414-424.
Morin, T.L. (1982). Monotonicity and the principle of optimality, Journal of
Mathematical analysis and Applications, 88, 665-674.
Porteus, E. (1975). An informal look at the principle of optimality, Management
Science, 21, 1346-1348.
Sniedovich, M. (1986). A new look at Bellman’s principle of optimality, Journal
of Optimization Theory and Applications, 49(1.1), 161-176.
Sniedovich, M. (1992). Dynamic Programming, Marcel Dekker, NY.
Yakowitz S. (1969). Mathematic of Adaptive Control Processes, Elsevier, NY.
Woeginger, G.J. (2000). When does a dynamic programming formulation guar-
antee the existence of a fully polynomial time approximation scheme (FP-
TAS), INFORMS Journal on Computing.
Chapter 30
Marcel F. Neuts
Department of Systems and Industrial Engineering
The University of Arizona
Tucson, AZ 85721, U.S.A.
marcel@sie.arizona.edu
When we were students, multivariate and time series analyses were already
beautiful mathematical theories. Sid had a more direct interest in these than I
but we agreed that the computational burden and the paucity of tractable results
made their application to actual data a daunting task. I, for one, was happy to
leave that to people in biology, economics, and the social sciences where the
reality of multivariate, highly dependent data could not be overlooked.
Work on stochastic models kept me well occupied and I could only follow
developments in statistics from a distance. From colleagues at Purdue Univer-
sity and elsewhere, I learned about bayesian procedures, about selection and
ranking, about variable selection in linear models, and other such work. During
the 1970s, there clearly was a growing preoccupation with numerical results
obtained by substantial algorithms, yet the statistical laboratory and the com-
puting center remained clearly separated worlds. In the first, people engaged
in statistical thinking, in the second, one sought advice and found help with
massive computer jobs.
In 1980 or so, during a one-day visit to Princeton University, I vividly ex-
perienced the thrill of seeing a changed, enriched statistical scene. A graduate
student demonstrated a software package for time series that he was developing.
A rich variety of statistical estimators, tests, and data transformations could be
interactively implemented to serve in the exploration of one or more traces of
a time series. Algorithm and computation, once barriers between methodolog-
ical and physical insight, had become our faithful servants, if not yet our allies.
The doctoral student had excellent knowledge of statistical theory and of the
computer’s capabilities. He combined them in a creative, synergistic research
project.
There are now many highly numerate statistical researchers; the years since
1980 brought major progress in the algorithmization of statistics. Professionally
written statistical software is now readily available. Judging by the text books,
by my experience during the latter years of my teaching career, and by visits
to universities in many countries, academic education in statistics still lags far
behind these developments. With few exceptions, students learn the elementary
mathematics underlying the most classical estimators and tests, not the deeper,
substantive insights needed to use existing software packages with confidence
and competence.
When trying to stir interest and enthusiasm, ponderous preaching about gen-
eralities is counter-productive. When asked for a look ahead talk, I prefer to
choose some specific problems that are just beyond our present capabilities.
After explaining why they are important, I speculate about promising new ap-
proaches - promising in that they may get the job done, not in the first place
for leading to easily publishable papers. In Neuts (1998), I so discussed se-
lected problems in stochastic modelling. That area could benefit from greater
emphasis on understanding the physical behavior of the models.
754 MODELING UNCERTAINTY
users do not look at theory. In an effort to belie that quip, one of my later
doctoral students, David Rauschenberg (Rauschenberg, 1994), examined ways
of summarizing long strings of counts in short strings of informative icons that
reflect the qualitative behavior of the counts over long substrings. Although
it is regrettably unpublished, I consider his thesis a seminal work leading to
data-analytic procedures that merit much further attention.
Reconstituting a traffic stream from counting statistics is only an example in a
vast class of problems dealing with random transformations of point processes.
During the early 1990s, I worked with several Ph.D. students on local Poissoni-
fication, the operation whereby the events during successive intervals of length
a are uniformly redistributed over these intervals, see Neuts et al. (1992) and
Neuts (1994). What was our qualitative thinking behind that construction?
If you only have the event counts over intervals of length you cannot
recover information about the exact arrival times during those intervals. You
can redistribute the points regularly, place them all in a "group arrival" in the
middle of the interval, or, as we did, you can imagine that they are uniformly and
randomly distributed - as they would be if the original process was a Poisson
process. What we studied has the intuitive flavor of a "linear interpolation."
That intuition was indeed borne out by some formulas for lower order moments
that we derived.
Unless is large, differences between the original and the reconstituted
processes should not matter greatly - exactly the same idea that underlies the
grouping of univariate or bivariate data. The statement of that intuition is vague.
One can assail it with criticism or, constructively, one can give technically
precise formulations that are amenable to rigorous scientific inquiry.
In Neuts et al. (1992), we initiated a theoretical study of local Poissonification
for the family of versatile benchmark processes, known as MAPs (Markovian
arrival processes), see e.g., Neuts (1992), Narayana and Neuts (1992), and
Lucantoni (1993). I wish we had been able to pursue that study further along
the following lines:
The pertinent engineering question is whether and when we can use counting
data instead of detailed, but expensive traces. The answer to that question is
context dependent. It depends on the service mechanisms to which the traffic
is offered. In a queueing context, for example, when there is any appreciable
queueing at all, the operation of service systems is little or not affected by slight
perturbations in the arrival times of packets. Whether the packet comes a little
earlier or a little later only means that it spends a little more or a little less time
waiting in the queue.
With the restrictive assumptions needed for classical queueing analysis, it
is impossible to model the effect of local Poissonification (or of other trans-
formations known in the engineering literature as traffic shaping) by standard
analytic or algorithmic methods. Moreover, to compare an input stream and its
756 MODELING UNCERTAINTY
poissonifications for various values of it is not enough to treat each case sep-
arately. Using simulation terminology, one should run simultaneous, parallel
simulations in which the various poissonifications of a given input stream are
subject to identical service times. Valid comparisons are possible only when
that is done. Experimental physicists know that, in meaningful comparison
studies, one varies only one or two parameters between experimental runs,
keeping all other conditions as much as possible the same. People with solid
grounding in probability understand that, to compare two experiments (and not
merely some simple descriptor, such as a mean) y! ou formalize both on the
same probability space. Therein lies the fundamental difficulty of - and the
serious scientific reservations to - the many engineering approximations com-
mon in applied stochastic models. For approximate models to be scientifically
validated, we need to compare differences in the realizations of the stochastic
processes, not merely in crude descriptors such the mean or standard deviation
of the delay.
A major difficulty in doing so is the paucity and extreme difficulty of the-
oretical results on multi-variate stochastic processes. For a few years now, I
increasingly realize the importance of computer experimentation in stochas-
tic modelling. In Neuts (1997) and Neuts (1998), I adduce reasons for that
importance. As a case in point, the study of the effect of the window size
leads to a pretty, seminal computer experiment. We generate a large data base
(say, 10 million or more) interarrival times and we construct poissonifications
of that random point sequence with K different values of These we offer
(in parallel) to single servers with identical strings of service times, generated
from a common distribution F(·). We so obtain K realizations of the queueing
process that differ only in the values of the parameter
What are some technical issues that arise in the design and analysis of that
experiment? In the first place, note that a common input stream is used for all K
poissonifications. Poissonification does not affect the order of the arrivals. We
may therefore think of each arrival as a clearly identified job. The service time
of that job is unaffected by the choice of Therefore, the original input and
all K poissonified streams are subject to the same queueing delays; a common
sequence of service times is used.
If the original arrival stream comes from a stationary point process, it is
easy to assure that all K poissonified streams are also realizations of stationary
processes. The most interesting part is the analysis of the output of the exper-
iment. As we are mainly interested in the differences between the queue with
the original input and each of the K models with poissonified arrivals, I would
form the sequences of differences between the delay of each job in the original
queue and in each of the poissonified models. Each such sequence is a trace
of a stationary process; we can apply established statistical procedures to it.
In comparing distinct sequences of differences, that is, comparing the results
Reflections on Statistical Methods for Complex Stochastic Systems 757
for different values of a, we must bear in mind that these are highly dependent
stochastic processes. It is likely that only data-analytic comparisons remain
possible. Qualitative conclusions from such comparisons need be validated by
replications of the entire experiment with different, independently generated
data sets. The choice of the statistics used in comparisons, the informative
representation and summary of data, and the efficient performance of the ex-
periments, all present interesting new questions and challenges. Experience
gained from one experimental study facilitates future ones and therein lies the
potential growth of this field.
The problem and the methodological approach that I have just described
have important counterparts in engineering practice. I already mentioned the
traffic traces of telecommunications applications. How are different traces or
simulated traces from a proposed traffic model compared? A common practice
is to use various measured or simulated arrival flows as input to single or multiple
queues with constant service times. For many highly calibrated manufacturing
or communications devices, the assumption of constant processing times is
plausible. These input processes are typically offered - in parallel simulations
- to servers with different holding times. A given input to various servers with
different constant holding times is then interpreted as though input streams of
various rates were offered to a server with a single, fixed holding time.
Measured quantities, such as the average delay or the frequency of loss in a
finite-buffer model, are typically quite robust. Useful engineering conclusions
are drawn from them although without a formal statistical justification. The
high dependence between the various simulated realizations and the heuristic
manner in which estimates are obtained offer challenges to statistical analysis.
In both problems I have mentioned, the general methodological issue is the
same. How can we meaningfully measure differences between (dependent)
stochastic processes whose realizations are relatively minor perturbations of
each other?
ACKNOWLEDGMENTS
This research of M. F. Neuts was supported in part by NSF Grant Nr. DMI-
9988749.
REFERENCES
Li, J-M., M. F. Neuts, and I. Widjaja. (1998). Congestion detection in ATM
networks. Performance Evaluation, 34:147–168.
Lucantoni, D.M. (1993). The BMAP/G/1 queue: A Tutorial. In Lorenzo Do-
natiello and Randolph Nelson, editors, Performance Evaluation of Computer
and Communication systems: Joint Tutorial Papers of Performance ’93 and
Sigmetrics ’93, pages 330–358. Springer-Verlag, Berlin.
Narayana, S. and M. F. Neuts. (1992). The first two moments matrices of
the counts for the Markovian arrival process. Communications in Statistics:
Stochastic Models, 8(3):459–477.
Neuts, M.F. (1988). Profile curves for the M/G/1 queue with group arrivals.
Communications in Statistics: Stochastic Models, 4(2):277–298.
Neuts, M.F. (1992). Models based on the Markovian arrival processes. IEICE
Transactions On Communications, E75-B(12):1255–65.
760 MODELING UNCERTAINTY
Neuts, M.F. (1994). The Palm measure of a poissonified stationary point process.
In Ramón Gutierrez and Mariano J. Valderrama, editors, Selected Topics on
Stochastic Modelling, pages 26–40. Singapore: World Scientific.
Neuts, M.F. (1997). Probability Modelling in the Computer Age. Keynote ad-
dress, International Conference, Stochastic and Numerical Modelling and
Applications, Utkal University, Bhubaneswar, India.
Neuts, M.F. (1998). Some promising directions in algorithmic probability. In
Attahiru S. Alfa and Srinivas R. Chakravarthy, editors, Advances in Matrix
Analytic Methods for Stochastic Models, pages 429–443. Neshanic Station,
NJ: Notable Publications, Inc.
Neuts, M.F. and J-M Li. (1999). Point processes competing for runs: A new tool
for their investigation. Methodology and Computing in Applied Probability,
1:29–53.
Neuts, M.F. and J-M. Li. (2000). The input/output process of a queue. Applied
Stochastic Models in Business and Industry, 16:11–21.
Neuts, M.F., D. Liu, and S. Narayana. (1992). Local poissonification of the
Markovian arrival process. Communications in Statistics: Stochastic Models,
8(1):87–129.
Rauschenberg, D.E. (1994). Computer-Graphical Exploration of Large Data
Sets from Teletraffic. PhD thesis, The University of Arizona, Tucson, Arizona.
Widjaja, I., M. F. Neuts, and J-M. Li. (1996). Conditional Overflow Probability
and Profile Curve for ATM Congestion Detection. IEEE.
Yakowitz, S. (1977). Computational Probability and Simulation. Addison-Wesley,
Reading, MA.
Author Index
Billingsley, 32, 66–67, 90, 510, 558, 560, 595 Caudill, 331
Bina, 131, 133, 152 Cavazos–Cadena, 516–517, 520–523, 525, 529,
Birge, 647 532, 543–545
Bisgaard, 692 Cavert, 99, 114
Bittanti, 370 Cease, 152
Bixby, 647 Cerny, 412
Black, 270, 282 Cesa-Bianchi, 246
Blaisdell, 131, 152–153 Cesaro, 176
Blankenship, 22, 32, 510 Chakravarthy, 760
Block, 621 Chang, 39–41, 52
Blount, 11, 575 Chao, 298–299, 590
Blum, 50, 52 Charnes, 630
Bodson, 376, 381 Chatfield, 662, 682
Boender, 389, 411 Chebyshev, 67
Bohachevsky, 411 Chen, 32, 687
Boland, 608, 612, 614, 621 Cheng, 329, 381
Bolshoy, 153 Chernoff, 37, 53
Boltzmann, 388 Chesher, 700
Borkar, 45, 52, 545 Chevi, 647
Borovkov, 32, 62–63, 65, 72, 75, 84, 90–91 Chiarella, 252, 267
Bosq, 581, 595 Chin 690 Chistyakov, 79
Boyle, 467, 474 Chistyakov, 79, 91
Braaten, 467 Cholesky, 327
Bradley, 581, 585, 595–596 Chow, 246
Bramson, 14, 32 Christofides, 629, 648
Bratley, 467 Chui, 732
Brau, 545 Chung, 510
Braverman, 206, 221 Chvatal, 647
Breiman, 246 Cieslak, 471
Bremermann, 387, 390, 411 Clark, 51, 53, 156, 629, 647
Brezzi, 39–43, 52 Clarke, 184, 647
Broadie, 466–467 Clements, 330
Brooks, 411 Cobb, 692
Bucher, 153 Cochran, 467
Buckingham, 131, 152 Cohen, 96, 98–99, 114
Bucy, 22, 32 Columban, 2
Bunick, 154 Compagner, 474
Burman, 302, 329, 594, 596 Conover, 471
Burrus, 682 Conway, 467
Burt, 712, 732 Cook, 329, 647, 692
Buyukkoc, 39, 54 Cooley, 682
Byrnes, 546 Cooper, 630
Caflisch, 467, 472 Cornette, 152
Caines, 370–371 Corput, 438
Calladine, 131, 141, 153 Cosman, 732
Cameron, 732 Cournot, 2, 249–251, 263–267
Campi, 370 Courtois, 510
Cantelli, 242, 402–403, 409, 569, 574 Cover, 327, 329, 730, 732
Cao, 115 Cox, 198–199, 250, 267
Capitanio, 329 Cranley, 467
Carr, 250, 267 Cristion, 51, 54
Carraway, 748 Crothers, 131, 153
Carrillo, 303, 329 Cushing, 250, 267
Casella, 687 Dai, 32–33
Castelana, 594 Daley, 4, 11
Castellana, 596 Daniels, 687
Cauchy, 180 Danielson, 678
AUTHOR INDEX 763
Datta, 546 Ethier, 33, 511
Davis, 510, 603 Eubank, 575, 596
Davison, 687 Fabian, 53
Davydov, 577, 596 Fabius, 40, 53
Dawande, 329 Fahrmeir, 198
Dekker, 199, 223, 330, 749 Fauci, 96, 99, 114
Dekkers, 388, 412 Faure, 429–430, 432, 440–441, 447, 451, 468
Delebecque, 496, 510 Fayolle, 14, 33
DeLisi, 152 Feder, 227, 229, 246–247
Dembo, 58, 91, 510 Federgruen, 388, 411
Dempster, 381 Feller, 63, 65, 67, 91, 118, 128
Denardo, 735–736, 738, 748–749 Fennell, 330
Denny, 6–7, 574, 596 Fernández–Gaucherand, 516–517, 520–523, 525,
Devroye, 202–203, 205, 207, 216, 221, 236, 246, 529, 532, 543–545
386–387, 392–393, 400, 404, 412, 575, 596 Fernández-Gaucherand, 545
Di Masi, 511 Field, 687
Dietrich, 9, 11 Fill, 63, 91
Diggle, 198 Fincke, 468
Dirac, 252 Firth, 198
Dixit, 283 Fisher, 6, 379, 383, 392, 394, 400–401, 412, 416,
Djereveckii, 370 696, 700, 732
Do, 687 Fishwick, 471
Doeblin, 516–517, 520, 522, 526 Fleming, 479, 500, 511, 516, 545
Doksum, 692 Fokianos, 190–191, 198
Domingo, 747, 749 Foss, 32
Doob, 335, 357, 572–573, 575, 580, 596 Fourier, 132–134, 138, 453–454, 651–653, 662,
Dordrecht, 221–222, 414 664, 667, 669, 678
Douglas, 692 Fox, 286, 299, 466–468, 511
Doukhan, 577, 581, 596 Fradkov, 370
Down, 33–34 Franco, 621
Drew, 131, 141, 152–153 Fraser, 687
Dror, 5, 627–629, 633, 638–639, 641, 643–649 Fredholm, 486
Druzdzel, 467 Freund, 246
Duckstein, 6–7 Friedel, 468
Duff, 8 Frobenious, 516, 521–522
Duffie, 283, 468 Frontini, 389, 412
Dunford, 203, 218, 221 Furniss, 286, 299
Dupuis, 15, 23, 33 Fushimi, 472
Dvoretzky, 402, 412 Gaimon, 329
Dykstra, 621 Gaitsgori, 512
Dynkin, 335–336, 357 Gaivoronski, 157, 183
Eberly, 288, 299 Galambosi, 11
Edgeworth, 687 Gallager, 512
Efimov, 469 Gamarnik, 32
Efron, 205, 221, 468, 686 Gani, 3, 5, 9–11, 383, 513, 575, 596
Eilon, 629, 648 Garrett, 117, 125, 128
Eisenberg, 133, 153 Gastwirth, 401, 412
Elliot, 546 Gauss, 703
Ellis, 493 Gaviano, 387, 413
Embrechts, 79, 85, 91 Geffroy, 391, 395, 398–399, 413
Endo, 732 Gelatt, 53, 387, 414
Entacher, 468 Gelfand, 388, 413
Erlang, 62, 65 Geman, 369–370, 381, 387, 413
Ermakov, 390, 412 Georgiev, 575, 590, 596–597
Ermoliev, 157, 183, 385, 412 Gerencsér, 245–247, 362, 370
Essunger, 115 Gershwin, 329, 511
Etemadi, 221 Gheysens, 649
764 MODELING UNCERTAINTY