Modeling Uncertainty 0792374630 PDF

MODELING UNCERTAINTY
An Examination of Stochastic Theory,

Methods, and Applications
INTERNATIONAL SERIES IN
OPERATIONS RESEARCH & MANAGEMENT SCIENCE
Frederick S. Hillier, Series Editor Stanford University
Vanderbei, R. / LINEAR PROGRAMMING: Foundations and Extensions

Jaiswal, N.K. / MILITARY OPERATIONS RESEARCH: Quantitative Decision Making
Gal, T. & Greenberg, H. / ADVANCES IN SENSITIVITY ANALYSIS AND
PARAMETRIC PROGRAMMING
Prabhu, N.U. / FOUNDATIONS OF QUEUEING THEORY
Fang, S.-C., Rajasekera, J.R. & Tsao, H.-S.J. / ENTROPY OPTIMIZATION
AND MATHEMATICAL PROGRAMMING
Yu, G. / OPERATIONS RESEARCH IN THE AIRLINE INDUSTRY
Ho, T.-H. & Tang, C. S. / PRODUCT VARIETY MANAGEMENT
El-Taha, M. & Stidham , S. / SAMPLE-PATH ANALYSIS OF QUEUEING SYSTEMS
Miettinen, K. M. / NONLINEAR MULTIOBJECTIVE OPTIMIZATION
Chao, H. & Huntington, H. G. / DESIGNING COMPETITIVE ELECTRICITY MARKETS
Weglarz, J. / PROJECT SCHEDULING: Recent Models, Algorithms & Applications
Sahin, I. & Polatoglu, H. / QUALITY, WARRANTY AND PREVENTIVE MAINTENANCE
Tavares, L. V. / ADVANCED MODELS FOR PROJECT MANAGEMENT
Tayur, S., Ganeshan, R. & Magazine, M. / QUANTITATIVE MODELING FOR SUPPLY
CHAIN MANAGEMENT
Weyant, J./ ENERGY AND ENVIRONMENTAL POLICY MODELING
Shanthikumar, J.G. & Sumita, U./APPLIED PROBABILITY AND STOCHASTIC PROCESSES
Liu, B. & Esogbue, A.O. / DECISION CRITERIA AND OPTIMAL INVENTORY PROCESSES
Gal, T., Stewart, T.J., Hanne, T./ MULTICRITERIA DECISION MAKING: Advances in MCDM
Models, Algorithms, Theory, and Applications
Fox, B. L./ STRATEGIES FOR QUASI-MONTE CARLO
Hall, R.W. / HANDBOOK OF TRANSPORTATION SCIENCE
Grassman, W.K./ COMPUTATIONAL PROBABILITY
Pomerol, J-C. & Barba-Romero, S./MULTICRITERION DECISION IN MANAGEMENT
Axsäter, S./ INVENTORY CONTROL
Wolkowicz, H., Saigal, R., Vandenberghe, L./ HANDBOOK OF SEMI-DEFINITE
PROGRAMMING: Theory, Algorithms, and Applications
Hobbs, B. F. & Meier, P. / ENERGY DECISIONS AND THE ENVIRONMENT: A Guide
to the Use of Multicriteria Methods
Dar-El, E./ HUMAN LEARNING: From Learning Curves to Learning Organizations
Armstrong, J. S./ PRINCIPLES OF FORECASTING: A Handbook for Researchers and
Practitioners
Balsamo, S., Personé, V., Onvural, R./ ANALYSIS OF QUEUEING NETWORKS WITH BLOCKING
Bouyssou, D. et al/ EVALUATION AND DECISION MODELS: A Critical Perspective
Hanne, T./ INTELLIGENT STRATEGIES FOR META MULTIPLE CRITERIA DECISION MAKING
Saaty, T. & Vargas, L./ MODELS, METHODS, CONCEPTS & APPLICATIONS OF THE ANALYTIC
HIERARCHY PROCESS
Chatterjee, K. & Samuelson, W./ GAME THEORY AND BUSINESS APPLICATIONS
Hobbs, B. et al/ THE NEXT GENERATION OF ELECTRIC POWER UNIT COMMITMENT MODELS
Vanderbei, R.J./ LINEAR PROGRAMMING: Foundations and Extensions, 2nd Ed.
Kimms, A./ MATHEMATICAL PROGRAMMING AND FINANCIAL OBJECTIVES FOR
SCHEDULING PROJECTS
Baptiste, P., Le Pape, C. & Nuijten, W./ CONSTRAINT-BASED SCHEDULING
Feinberg, E. & Shwartz, A./ HANDBOOK OF MARKOV DECISION PROCESSES: Methods
and Applications
Ramík, J. & Vlach, M. / GENERALIZED CONCAVITY IN FUZZY OPTIMIZATION
AND DECISION ANALYSIS
Song, J. & Yao, D. / SUPPLY CHAIN STRUCTURES: Coordination, Information and
Optimization
Kozan, E. & Ohuchi, A./ OPERATIONS RESEARCH/ MANAGEMENT SCIENCE AT WORK
Bouyssou et al/ AIDING DECISIONS WITH MULTIPLE CRITERIA: Essays in
Honor of Bernard Roy
Cox, Louis Anthony, Jr./ RISK ANALYSIS: Foundations, Models and Methods
MODELING UNCERTAINTY
An Examination of Stochastic Theory,
Methods, and Applications
Edited by
MOSHE DROR
University of Arizona
PIERRE L’ECUYER
Université de Montréal
FERENC SZIDAROVSZKY
KLUWER ACADEMIC PUBLISHERS

NEW YORK, BOSTON, DORDRECHT, LONDON, MOSCOW
eBook ISBN: 0-306-48102-2
Print ISBN: 0-7923-7463-0
©2005 Springer Science + Business Media, Inc.
Print ©2002 Kluwer Academic Publishers

Dordrecht
All rights reserved
No part of this eBook may be reproduced or transmitted in any form or by any means, electronic,
mechanical, recording, or otherwise, without written consent from the Publisher
Created in the United States of America
Visit Springer's eBookstore at: http://ebooks.springerlink.com

and the Springer Global Website Online at: http://www.springeronline.com
Contents
Preface xvii
Contributing Authors xxi

1
Professor Sidney J. Yakowitz 1
D. S. Yakowitz
Part I 13
2
Stability of Single Class Queueing Networks 13
Harold J. Kushner
1 Introduction 13
2 The Model 15
3 Stability: Introduction 22
4 Perturbed Liapunov Functions 23
5 Stability 28
3
Sequential Optimization Under Uncertainty 35
Tze Leung Lai
1 Introduction 35
2 Bandit Theory 37
2.1 Nearly optimal rules based on upper confidence bounds and
Gittins indices 37
2.2 A hypothesis testing approach and block experimentation 42
2.3 Applications to machine learning, control and scheduling of
queues 44
3 Adaptive Control of Markov Chains 44
3.1 Parametric adaptive control 45
3.2 Nonparametric adaptive control 47
4 Stochastic Approximation 49
4
Exact Asymptotics for Large Deviation Probabilities, with 57
Applications
vi MODELING UNCERTAINTY
Iosif Pinelis
1. Limit Theorems on the last negative sum and applications to non-
parametric bandit theory 59
1.1 Condition (4)&(8): exponential and superexponential cases 62
1.2 Condition (4)&(8): exponential (beyond (14)) and subexpo-
nential cases 63
1.3 The conditional distribution of the initial segment
of the sequence of the partial sums given 66
1.4 Application to Bandit Allocation Analysis 68
1.4.1 Test-times-only based strategy 68
1.4.2 Multiple bandits and all-decision-times based strategy 70
2 Large deviations in a space of trajectories 72
3 Asymptotic equivalence of the tail of the sum of independent random
vectors and the tail of their maximum 77
3.1 Introduction 77
3.2 Exponential inequalities for probabilities of large deviation
of sums of independent Banach space valued r.v.’s 81
3.3 The case of a fixed number of independent Banach space val-
ued r.v.’s. Application to asymptotics of infinitely divisible
probability distributions in Banach spaces 83
3.4 Tails decreasing no faster than power ones 86
3.5 Tails, decreasing faster than any power ones 88
3.6 Tails, decreasing no faster than 89
Part II 95
5
Stochastic Modelling of Early HIV Immune Responses Under Treatment 95
by Protease Inhibitors
Wai-Yuan Tan and Zhihua Xiang
1 Introduction 96
2 A Stochastic Model of Early HIV Pathogenesis Under Treatment by
a Protease Inbihitor 97
2.1 Modeling the Effects of Protease Inhibitors 98
2.2 Modeling the Net Flow of HIV From Lymphoid Tissues to
Plasma 99
2.3 Derivation of Stochastic Differential Equations for The State
Variables 100
3 Mean Values of 103
4 A State Space Model for the Early HIV Pathogenesis Under Treat-
ment by Protease Inhibitors 104
4.1 Estimation of given 106
4.2 Estimation of Given with and
107
5 An Example Using Real Data 108
6 Some Monte Carlo Studies 113
Contents vii
6
The impact of re-using hypodermic needles 117
B. Barnes and J. Gani
1 Introduction 117
2 Geometric distribution with variable success probability 118
3 Validity of the distribution 119
4 Mean and variance of I 120
5 Intensity of epidemic 122
6 Reducing infection 123
7 The spread of the Ebola virus in 1976 124
8 Conclusions 128
7
Nonparametric Frequency Detection and Optimal Coding in Molecular 129
Biology
David S. Stoffer
1 Introduction 129
2 The Spectral Envelope 133
3 Sequence Analyses 140
4 Discussion 152
Part III 155

8
An Efficient Stochastic Approximation Algorithm for Stochastic Saddle 155
Point Problems
Arkadi Nemirovski and Reuven Y. Rubinstein
1 Introduction 155
1.1 Classical stochastic approximation 155
2 Stochastic saddle point problem 157
2.1 The problem 157
2.1.1 Stochastic setting 158
2.1.2 The accuracy measure 159
2.2 Examples 159
2.3 The SASP algorithm 162
2.4 Rate of convergence and optimal setup: off-line choice of
the stepsizes 163
2.5 Rate of convergence and optimal setup: on-line choice of the
stepsizes 164
3 Discussion 167
3.1 Comparison with Polyak’s algorithm 167
3.2 Optimality issues 168
4 Numerical Results 172
4.1 A Stochastic Minimax Steiner problem 172
4.2 A simple queuing model 174
5 Conclusions 178
Appendix: A: Proof of Theorems 1 and 2 179
viii MODELING UNCERTAINTY
Appendix: B: Proof of the Proposition 182

9
Regression Models for Binary Time Series 185
Benjamin Kedem, Konstantinos Fokianos
1 Introduction 185
2 Partial Likelihood Inference 187
2.1 Definition of Partial Likelihood 187
2.2 An Assumption Regarding the Covariates 188
2.3 Partial Likelihood Estimation 188
2.4 Prediction 190
3 Goodness of Fit 191
4 Logistic Regression 192
4.1 A Demonstration 194
5 Categorical Data 196
10
Almost Sure Convergence Properties of Nadaraya-Watson Regression 201
Estimates
Harro Walk
1 Introduction 201
2 Results 203
3 Lemmas and Proofs 205
11
Strategies for Sequential Prediction of Stationary Time Series 225
László Györfi, Gábor Lugosi
1 Introduction 225
2 Universal prediction by partitioning estimates 228
3 Universal prediction by generalized linear estimates 236
4 Prediction of Gaussian processes 240
Part IV 249
12
The Birth of Limit Cycles in Nonlinear Oligopolies with Continuously 249
Distributed Information Lag
Carl Chiarella and Ferenc Szidarovszky
1 Introduction 249
2 Nonlinear Oligopoly Models 251
3 The Dynamic Model with Lag Structure 251
4 Bifurcation Analysis in the General Case 253
5 The Symmetric Case 259
6 Special Oligopoly Models 263
7 Conclusions 267
Contents ix
13
A Differential Game of Debt Contract Valuation 269
A. Haurie and F. Moresino
1 Introduction 269
2 The firm and the debt contract 270
3 A stochastic game 273
4 Equivalent risk neutral valuation 275
4.1 Debt and Equity valuations when bankrupcy is not considered 276
4.2 Debt and Equity valuations when liquidation may occur 278
5 Debt and Equity valuations for Nash equilibrium strategies 280
6 Liquidation at fixed time periods 281
7 Conclusion 282
14
Huge Capacity Planning and Resource Pricing for Pioneering Projects 285
David Porter
1 Introduction 285
2 The Model 287
3 Results 291
3.1 Cost and Performance Uncertainty 292
3.2 Cost Uncertainty and Flexibility 297
3.3 Performance Uncertainty and Flexibility 298
4 Conclusion 298
15
Affordable Upgrades of Complex Systems: A Multilevel, Performance- 301
Based Approach
James A. Reneke and Matthew J. Saltzman and Margaret M. Wiecek
1 Introduction 301
2 Multilevel complex systems 306
2.1 An illustrative example 309
2.2 Computational models for the example 312
3 Multiple criteria decision making 313
3.1 Generating candidate methods 314
3.2 Choosing a preferred selection of upgrades 315
3.3 Application to the example 317
4 Stochastic analysis 320
4.1 Random systems and risk 321
4.2 Application to the example 321
5 Conclusions 322
Appendix: Stochastic linearization 327
1 Origin of stochastic linearization 327
2 Stochastic linearization for random surfaces 327
16
On Successive Approximation of Optimal Control of Stochastic Dynamic 333
Systems
Fei-Yue Wang, George N. Saridis
x MODELING UNCERTAINTY
1 Introduction 334
2 Problem Statement 335
3 Sub-Optimal Control of Nonlinear Stochastic Dynamic Systems 337
4 The Infinite-time Stochastic Regulator Problem 346
5 Procedure for Iterative Design of Sub-optimal Controllers 349
5.1 Exact Design Procedure 349
5.2 Approximate Design Procedures for the Regulator Problem 353
6 Closing Remarks by Fei-Yue Wang 356
17
Stability of Random Iterative Mappings 359
László Gerencsér
1 Introduction 359
2 Preliminary results 364
3 The proof of Theorem 1.1 367
Appendix 368
Part V 373
18
’Unobserved’ Monte Carlo Methods for Adaptive Algorithms 373
Victor Solo
1 El Sid 373
2 Introduction 374
3 On-line Binary Classification 375
4 Binary Classification with Noisy Measurements of Classifying Variables-
Offline 376
5 Binary Classification with Errors in Classifying Variables -Online 378
6 Conclusions 380
19
Random Search Under Additive Noise 383
Luc Devroye and Adam Krzyzak
1 Sid’s contributions to noisy optimization 383
2 Formulation of search problem 384
3 Random search: a brief overview 385
4 Noisy optimization by random search: a brief survey 390
5 Optimization and nonparametric estimation 393
6 Noisy optimization: formulation of the problem 394
7 Pure random search 394
8 Strong convergence and strong stability 398
9 Mixed random search 399
10 Strategies for general additive noise 400
11 Universal convergence 410
Contents xi
20
Recent Advances in Randomized Quasi-Monte Carlo Methods 419
Pierre L’Ecuyer and Christiane Lemieux
1 Introduction 420
2 A Closer Look at Low-Dimensional Projections 423
3 Main Constructions 425
3.1 Lattice Rules 426
3.2 Digital Nets 428
3.2.1 Sobol’ Sequences 431
3.2.2 Generalized Faure Sequences 431
3.2.3 Niederreiter Sequences 432
3.2.4 Polynomial Lattice Rules 433
3.3 Constructions Based on Small PRNGs 435
3.4 Halton sequence 438
3.5 Sequences of Korobov rules 439
3.6 Implementations 439
4 Measures of Quality 440
4.1 Criteria for standard lattice rules 441
4.2 Criteria for digital nets 444
5 Randomizations 448
5.1 Random shift modulo 1 449
5.2 Digital shift 449
5.3 Scrambling 450
5.4 Random Linear Scrambling 451
5.5 Others 452
6 Error and Variance Analysis 452
6.1 Standard Lattices and Fourier Expansion 453
6.2 Digital Nets and Haar or Walsh Expansions 455
6.2.1 Scrambled-type estimators 455
6.2.2 Digitally shifted estimators 457
7 Transformations of the Integrand 461
8 Related Methods 462
9 Conclusions and Discussion 464
Appendix: Proofs 464
Part VI 475
21
Singularly Perturbed Markov Chains and Applications to Large-Scale Sys- 475
tems under Uncertainty
G. Yin, Q. Zhang, K. Yin and H. Yang
1 Introduction 476
2 Singularly Perturbed Markov Chains 480
2.1 Continuous-time Case 481
2.2 Time-scale Separation 483
3 Properties of the Singularly Perturbed Systems 485
3.1 Asymptotic Expansion 485
3.2 Occupation Measures 487
3.3 Large Deviations and Exponential Bounds 492
xii MODELING UNCERTAINTY
3.3.1 Large Deviations 492

3.3.2 Exponential Bounds 493
4 Controlled Singularly Perturbed Markovian Systems 494
4.1 Continuous-time Hybrid LQG 495
4.2 Discrete-time LQ 498
5 Further Remarks 504
6 Appendix: Mathematical Preliminaries 505
6.1 Stochastic Processes 505
6.2 Markov chains 506
6.3 Connections of Singularly Perturbed Models: Continuous
Time vs. Discrete Time 508
22
Risk–Sensitive Optimal Control in Communicating Average Markov 515
Decision Chains
Rolando Cavazos–Cadena, Emmanuel Fernández–Gaucherand
1 Introduction 516
2 The Decision Model 518
3 Main Results 521
4 Basic Technical Preliminaries 526
5 Auxiliary Expected–Total Cost Problems: I 529
6 Auxiliary Expected–Total Cost Problems: II 538
7 Proof of Theorem 3.1 542
8 Conclusions 544
Appendix: A: Proof of Theorem 4.1 547
Appendix: B: Proof of Theorem 4.2 551
23
Some Aspects of Statistical Inference in a Markovian and Mixing Framework 555
George G. Roussas
1 Introduction 556
2 Markovian Dependence 557
2.1 Parametric Case - The Classical Approach 558
2.2 Parametric Case - The Local Asymptotic Normality Approach 561
2.3 The Nonparametric Case 567

3 Mixing 576
3.1 Introduction and Definitions 576
3.2 Covariance Inequalities 578
3.3 Moment and Exponential Probability Bounds 581
3.4 Some Estimation Problems 582
3.4.1 Estimation of the Distribution Function or Survival Function 582
3.4.2 Estimation of a Probability Density Function and its Derivatives 584
3.4.3 Estimating the Hazard Rate 586
3.4.4 A Smooth Estimate of F and 588
3.4.5 Recursive Estimation 589
3.4.6 Fixed Design Regression 591
3.4.7 Stochastic Design Regression 592
Contents xiii
Part VII 607

24
Stochastic Ordering of Order Statistics II 607
Philip J. Boland, Taizhong Hu, Moshe Shaked and J. George Shanthikumar
1 Introduction 608
2 Likelihood Ratio Orders Comparisons 609
3 Hazard and Reversed Hazard Rate Orders Comparisons 611
4 Usual Stochastic Order Comparisons 615
5 Stochastic Comparisons of Spacings 615
6 Dispersive Ordering of Order Statistics and Spacings 618
7 A Short Survey on Further Results 620
25
Vehicle Routing with Stochastic Demands: Models & Computational 625
Methods
Moshe Dror
1 Introduction 625
2 An SVRP Example and Simple Heuristic Results 627
2.1 Chance Constrained Models 630
3 Modeling SVRP as a stochastic programming with recourse problem 631
3.1 The model 633
3.2 The branch-and-cut procedure 635
3.3 Computation of a lower bound on and on Q(x) 636
4 Multi-stage model for the SVRP 638
4.1 The multi-stage model 640
5 Modeling SVRP as a Markov decision process 641
6 SVRP routes with at most one failure – a more ‘practical’ approach 643
7 The Dror conjecture 645
8 Summary 646
26
Life in the Fast Lane: Yates’s Algorithm, Fast Fourier and Walsh Transforms 651
Paul J. Sanchez, John S. Ramberg and Larry Head
1 Introduction 652
2 Linear Models 653
2.1 Factorial Analysis 654
2.1.1 Definitions and Background 654
2.1.2 The Model 656
2.1.3 The Coefficient Estimator 658
2.2 Walsh Analysis 658
2.2.2 The Model 662
2.2.3 Discrete Walsh Transforms 662
2.3 Fourier Analysis 663
2.3.2 The Model 665
3 An Example 666
xiv MODELING UNCERTAINTY
4 Fast Algorithms 670

4.1 Yates’s Fast Factorial Algorithm 671
4.2 Fast Walsh Transforms 676
4.3 Fast Fourier Transforms 679
5 Conclusions 682
Appendix: A: Table of Notation 684
27
Uncertainty Bounds in Parameter Estimation with Limited Data 685
James C. Spall
1 Introduction 686
2 Problem Formulation 688
3 Three Examples of Appropriate Problem Settings 689
3.1 Example 1: Parameter Estimation in Signal-Plus-Noise Model
with Non-i.i.d. Data 690
3.2 Example 2: Nonlinear Input-Output (Regression) Model 691
3.3 Example 3: Estimates of Serial Correlation for Time Series 692
4 Main Results 693
4.1 Background and Notation 693
4.2 Order Result on Small-Sample Probabilities 695
4.3 The Implied Constant of Bound 695
5 Application of Theorem for the MLE of Parameters in Signal-Plus-
Noise Problem 697
5.1 Background 697
5.2 Theorem Regularity Conditions and Calculation of Implied
Constant 698
5.3 Numerical Results 699
6 Summary and Conclusions 702
Appendix: Theorem Regularity Conditions and Proof (Section 4) 703
28
A Tutorial on Hierarchical Lossless Data Compression 711
John C. Kieffer
1 Introduction 711
1.1 Pointer Tree Representations 715
1.2 Data Flow Graph Representations 716
1.3 Context-Free Grammar Representations 718
2 Equivalences Between Structures 721
2.1 Equivalence of Pointer Trees and Admissible Grammars 721
2.2 Equivalence of Admissible Grammars and Data Flow Graphs 723
3 Design of Compact Structures 725
4 Encoding Methodology 727
5 Performance Under Uncertainty 729
Part VIII 735

29
Eureka! Bellman’s Principle of Optimality is valid! 735
Contents xv
Moshe Sniedovich
1 Introduction 735
2 Remedies 738
3 The Big Fix 739
4 The Rest is Mathematics 740
5 Refinements 744
6 Non-Markovian Objective functions 746
7 Discussion 748
30
Reflections on Statistical Methods for Complex Stochastic Systems 751
Marcel F. Neuts
1 The Changed Statistical Scene 751
2 Measuring Teletraffic Data Streams 754
3 Monitoring Queueing Behavior 757
Author Index 761

Preface
This volume titled MODELING UNCERTAINTY: An Examination of Stochas-

tic Theory, Methods, and Applications, has been compiled by the friends and
colleagues of Sid Yakowitz in his honor as a token of love, appreciation, and
sorrow for his untimely death. The first paper in the book is authored by Sid’s
wife – Diana Yakowitz – and in it Diana describes Sid the person, his drive
for knowledge and his fascination with mathematics, particularly with respect
to uncertainty modelling and applications. This book is a collection of papers
with uncertainty as its central theme.
Fifty authors from all over the world collectively contributed 30 papers to
this volume. Each of these papers was reviewed and in the majority of cases
the original submission was revised before being accepted for publication in
the book. The papers cover a great variety of topics in probability, statistics,
economics, stochastic optimization, control theory, regression analysis, simula-
tion, stochastic programming, Markov decision process, application in the HIV
context, and others. Some of the papers have a theoretical emphasis and others
focus on applications. A number of papers have the flavor of survey work in a
particular area and in a few papers the authors present their personal view of a
topic. This book has a considerable number of expository articles which should
be accessible to a nonexpert, say a graduate student in mathematics, statistics,
engineering, and economics departments, or just anyone with some mathemat-
ical background who is interested in a preliminary exposition of a particular
topic. A number of papers present the state of the art of a specific area or
represent original contributions which advance the present state of knowledge.
Thus, the book has something for almost anybody with an interest in stochastic
systems.
The editors have loosely grouped the chapters into 8 segments, according
to some common mathematical thread. Since none of us (the co-editors) is an
expert in all the topics covered in this book, it is quite conceivable that the pa-
pers could have been grouped differently. Part 1 starts with a paper on stability
in queuing networks by H.J. Kushner. Part 1 also includes a queuing related
xviii MODELING UNCERTAINTY
paper by T.L. Lai, and a paper by I. Pinelis on asymptotics for large deviation
probabilities. Part 2 groups together 3 papers related to HIV modelling. The
first paper in this group is by W.-Y. Tan and Z. Xiang about modelling early
immune responses, followed by a paper of B. Barnes and J. Gani on the impact
of re-using hypodermic needs, and closes with a paper by D.S. Stoffer. Part 3
groups together optimization and regression papers. It contains 4 papers starting
with a paper by A. Nemirovski and R.Y. Rubinstein about classical stochastic
approximation. The next paper is by B. Kedem and K. Fokianos on regression
models for binary time series, followed with a paper by H. Walk on properties of
Nadarya - Watson regression estimates, and closing with a paper on sequential
predictions of stationary time series by L. Györfi and G. Lugosi. Part 4’s 6 pa-
pers are in the area of economics analysis starting with a nonlinear oligopolies
paper by C. Chiarella and F. Szidarovszky. The paper by A. Haurie and F.
Moresino examines a differential game of debt contract valuation. Next comes
a paper by D. Porter, followed by a paper about complex systems in relation to
affordable upgrades by J.A. Reneke, M.J. Saltzman, and M.M. Wiecek. The 5th
paper in this group, by F.-Y. Wang and G.N. Sardis, concerns optimal control
in stochastic dynamic systems, and the last paper is by L. Gerencsér is about
stability of random iterative mappings. Part 5 loosely groups 3 papers starting
with a paper by V. Solo on Monte Carlo methods for adaptive algorithms, fol-
lowed by a paper on random search with noise by L. Devroye and A. Krzyzak,
and closes with a survey paper on randomized quasi-Monte Carlo methods by
P. L’Ecuyer and C. Lemieux. Part 6 is a collection of 3 papers sharing a focus
on Markov decision analysis. It starts with a paper by G. Yin, Q. Zhang, K.
Yin, and H. Yang on singularly perturbed Markov chains. The second paper, on
risk sensitivity in average Markov decision chains, is by R. Cavazos–Cadena
and E. Fernández–Gaucherand. The 3rd paper, by G.G. Roussas, is on statis-
tical inference in a Markovian framework. Part 7 includes a paper on order
statistics by P.J. Boland, T. Hu, M. Shaked, and J.G. Shanthikumar, followed
by a survey paper on routing with stochastic demands by M. Dror, a paper on
fast Fourier and Walsh transforms by P.J. Sanchez, J.S. Ramberg, and L. Head,
a paper by J.C. Spall on parameter estimation with limited data, and a tuto-
rial paper on data compression by J.C. Kieffer. Part 8 contains 2 ‘reflections’
papers. The first paper is by M. Sniedovich – an ex-student of Sid Yakowitz.
It reexamines Bellman’s principle of optimality. The last paper in this volume
on statistical methods for complex stochastic systems is reserved to M.F. Neuts.
The efforts of many workers have gone into this volume, and would not have
been possible without the collective work of all the authors and reviewers who
read the papers and commented constructively. We would like to take this op-
portunity to thank the authors and the reviewers for their contributions. This
book would have required a more difficult ’endgame’ without Ray Brice’s ded-
PREFACE xix
ication and painstaking attention for production details. We are very grateful
for Ray’s help in this project. Paul Jablonka is the artist who contributed the art
work for the book’s jacket. He was a good friend to Sid and we appreciate his
contribution. We would also like to thank Gary Folven, the editor of Kluwer
Academic Publishers, for his initial and never fading support throughout this
project. Thank you Gary !
Moshe Dror Pierre L’Ecuyer Ferenc Szidarovszky

Contributing Authors
B. Barnes
School of Mathematical Sciences
Australian National University
Canberra, ACT 0200
Australia
Philip J. Boland
Department of Statistics
University College Dublin
Belfield, Dublin 4
Ireland
Rolando Cavazos–Cadena
Departamento de Estadística y Cálculo
Universidad Auténoma Agraria Antonio Narro
Buenavista, Saltillo COAH 25315
MÉXICO
Carl Chiarella
School of Finance and Economics
University of Technology
Sydney
P.O. Box 123, Broadway, NSW 2007
Australia
carl.chiarella@uts.edu.au
Luc Devroye
School of Computer Science
McGill University
Montreal, Canada H3A 2K6
xxii MODELING UNCERTAINTY
Moshe Dror
Department of Management Information Systems
The University of Arizona
Tucson, AZ 85721, USA
mdror@bpa.arizona.edu
Emmanuel Fernández–Gaucherand
Department of Electrical & Computer Engineering
& Computer Science
University of Cincinnati
Cincinnati, OH 45221-0030
USA
Konstantinos Fokianos
Department of Mathematics & Statistics
University of Cyprus
P.O. Box 20537 Nikosia, 1678, Cyprus
J. Gani
Canberra, ACT 0200
Australia
László Gerencsér
Computer and Automation Institute
Hungarian Academy of Sciences
H-1111, Budapest Kende u 13-17
Hungary
László Györfi
Department of Computer Science and Information Theory
Technical University of Budapest
1521 Stoczek u. 2,
Budapest, Hungary
gyorfi@szit.bme.hu
A. Haurie
University of Geneva
Geneva Switzerland
Contributing Authors xxiii
Larry Head
Siemens Energy & Automation, Inc.
Tucson, AZ 85715
Taizhong Hu
Department of Statistics and Finance
University of Science and Technology
Hefei, Anhui 230026
People’s Republic of China
Benjamin Kedem
Department of Mathematics
University of Maryland
College Park, Maryland 20742, USA
John C. Kieffer
ECE Department
University of Minnesota
Minneapolis, MN 55455
Adam Krzyzak
Department of Computer Science
Concordia University
Montreal, Canada H3G 1M8
Harold J. Kushner
Applied Mathematics Dept.
Lefschetz Center for Dynamical Systems
Brown University
Providence RI 02912
Tze Leung Lai

Stanford University
Stanford, California
xxiv MODELING UNCERTAINTY
Pierre L’Ecuyer
Département d’Informatique et de Recherche Opérationnelle
Université de Montréal, C.P. 6128, Succ. Centre-Ville
Montréal, H3C 3J7, Canada
lecuyer@iro.umontreal.ca
Christiane Lemieux
Department of Mathematics and Statistics
University of Calgary, 2500 University Drive N.W.
Calgary, T2N 1N4, Canada
lemieux@math.ucalgary.ca
Gábor Lugosi
Department of Economics,
Pompeu Fabra University
Ramon Trias Fargas 25-27,
08005 Barcelona, Spain
lugosi@upf.es
F. Moresino
Cambridge University
United Kingdom
Arkadi Nemirovski
Faculty of Industrial Engineering and Management
Technion—Israel Institute of Technology
Haifa 32000, Israel
Marcel F. Neuts
Department of Systems and Industrial Engineering
Tucson, AZ 85721, U.S.A.
marcel@sie.arizona.edu
Iosif Pinelis
Department of Mathematical Sciences
Michigan Technological University
Houghton, Michigan 49931
ipinelis@math.mtu.edu
Contributing Authors xxv
David Porter
Collage of Arts and Sciences
George Mason University
John S. Ramberg
Systems and Industrial Engineering
Tucson, AZ 85721
James A. Reneke
Dept. of Mathematical Sciences
Clemson University
Clemson SC 29634-0975
George G. Roussas
University of California, Davis
Reuven Y. Rubinstein
Haifa 32000, Israel
Matthew J. Saltzman
Clemson University
Paul J. Sanchez
Operations Research Department
Naval Postgraduate School
Monterey, CA 93943
George N. Saridis
Department of Electrical, Computer and Systems Engineering
Rensselaer Polytechnic Institute
Troy, New York 12180
xxvi MODELING UNCERTAINTY
Moshe Shaked
Tucson, Arizona 85721
USA
J. George Shanthikumar
Industrial Engineering & Operations Research
University of California
Berkeley, California 94720
USA
Moshe Sniedovich
The University of Melbourne
Parkville VIC 3052, Australia
m.sniedovich@ms.unimelb.edu.au
Victor Solo
School of Electrical Engineering and Telecommunications
University of New South Wales
Sydney NSW 2052, Australia
vsolo@syscon.ee.unsw.edu.au
James C. Spall
The Johns Hopkins University
Applied Physics Laboratory
Laurel, MD 20723-6099
james.spall@jhuapl.edu
David S. Stoffer
University of Pittsburgh
Pittsburgh, PA 15260
Ferenc Szidarovszky
Tucson, Arizona, 85721-0020, USA
szidar@sie.Arizona.edu
Contributing Authors xxvii
Wai-Yuan Tan
The University of Memphis
Memphis, TN 38152-6429
waitan@memphis.edu
Harro Walk
Mathematisches Institut A
Universität Stuttgart
Pfaffenwaldring 57, D-70569
Stuttgart, Germany
Fei-Yue Wang
Margaret M. Wiecek
Clemson University
Zhihua Xiang
Organon Inc.
375 Mt. Pleasant Avenue
West Orange, NJ 07052
z.xiang@organoninc.com
D. S. Yakowitz
Tucson, Arizona
H.Yang
Department of Wood and Paper Science
St. Paul, MN 55108
hyang@ece.umn.edu
xxviii MODELING UNCERTAINTY
G. Yin
Wayne State University
Detroit, MI 48202
gyin@math.wayne.edu
K. Yin
St. Paul, MN 55108
kyin@crn.umn.edu, hyang@ece.umn.edu
Q. Zhang
University of Georgia
Athens, GA 30602
qingz@math.uga.edu
This book is dedicated to the
memory of Sid Yakowitz.
Chapter 1
PROFESSOR SIDNEY J. YAKOWITZ
D. S. Yakowitz
Tucson, Arizona
Sidney Jesse Yakowitz was born in San Francisco, California on March

8, 1937 and died in Eugene, Oregon on September 1, 1999. Sid’s parents,
Morris and MaryVee, were chemists with the Food and Drug Administration
and encouraged Sid to be a life-long learner. He attended Stanford University
and after briefly toying with the idea of medicine, settled into engineering (“I
saved hundreds of lives with that decision!”). Sid graduated from Stanford with
a B.S in Electrical Engineering in 1960.
His first job out of Stanford was as a design engineer with the University
of California’s Lawrence Radiation Laboratory (LRL) at Berkeley. Sid was
unhappy after college but claimed that he learned the secret to happiness from
his office mate at LRL, Jim Sherwood, who told him he was being paid to be
creative. Sid decided that “Good engineering design is a synonym for ‘invent-
ing’.”
For graduate school, Sid chose Arizona State University. By this time, his
battle since childhood with acute asthma made a dry desert climate a manda-
tory consideration. In graduate school he flourished. He received his M.S. in
Electrical Engineering in 1965, an M.A. in Mathematics in 1966, and Ph.D. in
Electrical Engineering in 1967. His new formula for happiness in his work led
him to consider each topic or problem that he approached as an opportunity to
“invent”.
In 1966 Sid was hired as an Assistant Professor in the newly founded De-
partment of Systems and Industrial Engineering at the University of Arizona in
Tucson. This department remained his “home” for 33 years with the exception
of brief sabbaticals and leaves such as a National Academy of Science Post-
doctoral Fellowship at the Naval Postgraduate School in Monterey, California
in 1970-1971.
In 1969 Sid’s book Mathematics of Adaptive Control Processes (Yakowitz,
1969) was published as a part of Richard Bellman’s Elsevier book series. This
book was essentially his Ph.D. dissertation and was the first of four published
2 MODELING UNCERTAINTY
books. Latter Sid was instrumental in the popularization of differential dynamic

programming (DDP). Overcoming the “curse of dimensionality” made possible
the solution of problems that could at that time only be solved approximately,
for example, high dimensional multireservoir control problems. His paper
with then Ph.D. student Dan Murray in Water Resources Research (Murray
and Yakowitz, 1979) demonstrated quite dramatically what could be done with
DDP.
In addition to his own prolific accomplishments Sid had another important
talent - the ability to recognize talent in others. He enthusiastically collabo-
rated with colleagues on numerous subjects including hydrology, economics,
information-theory, statistics, numerical methods, and machine learning.
Sid’s international work started in 1973 with his participation in a joint NSF
sponsored US-Hungarian research project. According to Ferenc Szidarovszky
(Szidar), also involved in the project, his extraordinary talents in combining
probabilistic and statistical ideas with numerical computations made him one
of the most important contributors. Several papers on uncertainty in water
resources management, conference presentations, and invited lectures were the
result of this grant that was renewed until 1981. This was the period that
he had the most intensive collaboration with his many Hungarian colleagues.
This cooperation also resulted in the two textbooks on numerical analysis with
Szidar, Principles and Procedures of Numerical Analysis (Szidarovszky and
Yakowitz, 1978) and An Introduction to Numerical Computations (Yakowitz
and Szidarovszky, 1986). Long after the project terminated, Sid continued to
collaborate with Hungarian scientists who often visited him in Tucson enjoying
his hospitality.
Sid’s ability in combining probabilistic ideas and numerical computations
made him an expert in simulation. His book Computational Probability and
Simulation (Yakowitz, 1977) is considered one of the best of its kind. His paper
on weighted Monte-Carlo simulation (Yakowitz et al., 1978) offered a new
integration method that was much faster than the known classical procedures.
Sid had a very successful six year cooperation with Professors Columban
Hutter, of ETH Zurich, and Szidar working on an NSF sponsored project on
the mathematical modeling and computer solutions of ice-sheets, glaciers and
avalanches. In this work, Sid’s expertise on numerical analysis was the essential
factor in solving large-scale differential equations with unusual boundary and
normalizing conditions (Yakowitz et al., 1985; Hutter et al., 1986a; Yakowitz
et al., 1986; Hutter et al., 1986b; Hutter et al., 1987; Szidarovszky et al., 1987;
Hutter et al., 1987; Szidarovszky et al., 1989).
Sid’s algorithmic way of thinking resulted in two major contributions to
game theory. With Szidar he developed a new proof for the existence of a unique
equilibrium of Cournot oligopolies, which is constructive, offering an algorithm
to find the equilibrium. This paper, (Szidarovszky and Yakowitz, 1977) is one of
Professor Sidney J. Yakowitz 3
the most cited papers in this field and has been republished in the Handbook of
Mathematical Economics. They could also extend the constrictive proof for the
case when the price and cost functions are not differentiable. They proved that
even in the case of multiple equilibria, the total output of the industry is unique
and the set of all equilibria is a simplex. They also considered the effect of
coalition formation on the profit functions (Szidarovszky and Yakowitz, 1982).
Sid was an expert in time series, both parametric and nonparametric. On the
nonparametric side he made contributions regarding nearest neighbor methods
applied to time series prediction, density and transition function estimation for
Markov sequences, and pattern recognition (Denny and Yakowitz, 1978; Schus-
ter and Yakowitz, 1979; Yakowitz, 1979; Szilagyi et al., 1984; Yakowitz, 1987;
Yakowitz, 1988; Yakowitz, 1989d; Rutherford and Yakowitz, 1991; Yakowitz
and Lowe, 1991; Yakowitz and Tran, 1993; Yakowitz, 1993a; Morvai et al.,
1998; Yakowitz et al., 1999). In particular Sid worked in the area of stochas-
tic hydrology over many years including analyzing hydrologic time series
such as flood and rainfall data to investigate their major statistical properties
and use them for forecasting (Yakowitz, 1973; Denny et al., 1974; Yakowitz,
1976; Yakowitz, 1976; Szidarovszky and Yakowitz, 1976; Yakowitz, 1979;
Yakowitz and Szidarovszky, 1985; Karlsson and Yakowitz, 1987a; Karlsson
and Yakowitz, 1987b; Naokes et al., 1988; Yakowitz and Lowe, 1991).
On the parametric side, Sid applied his deep understanding of linear filtering
of stationary time series in the problem of frequency estimation in the pres-
ence of noise. Here he authored several papers on frequency estimation using
contraction mappings, constructed from the first order auto-correlation, that
involved sophisticated sequences of linear filters with a shrinking bandwidth.
In particular, he showed that the contraction mapping of He and Kedem, which
requires a certain filtering property, can be extended quite broadly. This and the
shrinking bandwidth were very insightful (Yakowitz, 1991; Yakowitz, 1993c; Li
et al., 1994; Kedem and Yakowitz, 1994; Yakowitz, 1994a).
He found numerous applications of nonparametric statistical methods in ma-
chine learning (Yakowitz, 1989c; Yakowitz and Lugosi, 1990; Yakowitz et al.,
1992a; Yakowitz and Kollier, 1992; Yakowitz and Mai, 1995; Lai and Yakowitz,
1995). As a counterpart to his earlier work on numerical computation, Sid in-
troduced a course at the University of Arizona on Non-numerical Computation.
This course, which resulted in an unpublished textbook on the topic, developed
methods applicable to machine learning, games and epidemics. Sid loved this
topic dearly and enjoyed teaching it. He continued to explore this area up to
the time of his death.
In 1986 Sid met Joe Gani, and they worked together intermittently from
that time until his death. Over a period of 13 years, Sid and Joe (together with
students and colleagues) wrote 10 joint papers. Their earliest interest was in the
silting of dams, which they studied (with Peter Todorovic of UCSB) (Gani et al.,
1987), followed by a paper on the prediction of reservoir lifetimes under silting

(Gani and Yakowitz, 1989). In both, they made use of the Moran Markovian
reservoir model.
By this time, Sid and Joe had been awarded an NIH grant, which lasted until
Joe’s retirement from the University of California, Santa Barbara in 1994. They
used the grant to analyze various epidemic models, starting with cellular au-
tomaton models (with Sid’s student R.Hayes) in 1990 (Yakowitz et al., 1990),
and automatic learning for dynamic Markov fields in epidemiology in 1992
(Yakowitz et al., 1992b). Sid soloed on a decision model paper for the AIDS
epidemic that year (Yakowitz, 1992) as well as collaborating with Joe on a basic
paper on the spread of HIV among intravenous drug users (Gani and Yakowitz,
1993). Two further papers followed in 1995, one on interacting groups in AIDS
epidemics (Gani and Yakowitz, 1995), and another on deterministic approxi-
mations to epidemic Markov processes (Gani and Yakowitz, 1995). More work
followed on expectations for large compartmentalized models for the spread of
AIDS in prisons (with Sid’s student M. Blount) (Yakowitz et al., 1996; Gani et
al., 1997). In these problems, Sid brought to bear his vast knowledge of Markov
processes and, as Joe describes, his formidable computational skills.
Sid spent part of his last sabbatical in Australia working with Joe Gani. This
work culminated with a paper, published posthumously in 2000, on individual
susceptibilities and infectivities (with D.J. Daley and J. Gani) (Daley et al.,
2000). Sid was extremely proud of the work on this topic as well as the paper
on the spread of AIDS in prisons (Yakowitz et al., 1996). Both of these pa-
pers required delicate analysis, and careful computation, at both of which he
excelled.
The above is surely a truncated and inadequate representation of the impact
that Sid has had on the many topics that he was interested and the people
that he worked with over the years. Sid authored over 100 journal papers and
books and probably an equal number of papers in conference proceedings. The
single common thread that is woven into all of his research is uncertainty. All
of his publications have a probabilistic component and this book is perhaps
the most appropriate tribute that could be given him. “Write it down!” is an
admonishment that he made on many occasions. His own writing was greatly
influenced by his love of literature and he felt quite strongly that technical
writers could learn much from the art of creative writing. Perhaps Sid said it
best:
...readers come to us to learn about certain specific facts. But they will value us
more if we give them more insight and entertainment than they bargained for.
from a letter to Gene Koppel on the value of creative writing courses
He hoped not only to pass on information but to inspire further investigation

into the topics.
This desire to inspire often led him to experiment in his teaching. He wanted
to pass on more than just the information in the textbook and he was hard on
Publications of Sid Yakowitz 5
himself when he did not succeed. Sid was not known positively for his teaching
style. He was stressed and anxious over lectures and presentations.
When I have to go somewhere to give an invited lecture, I always take crayon
drawings that [my children] have made for me. I put them on the podium and
they give me strength.
from a letter to his friend Steve Berry
This lifelong battle with “stage fright” often led to comic episodes of forget-
fulness, fumbling, and embarrassment. He had little patience with laziness or
disinterest from his students. Those who took interest in the subject, and the
time to get to know Sid, grew to love him for his honesty, insight, enthusiasm,
humor and cheerfulness.
Sid was never too busy to answer a question. His work ethic defined, for me and
many other graduate students, a standard which we measure ourselves against.
Manbir Sodhi
I am privileged to have been his student. I will never forget his sense of humor
and the fact that he didn’t take anything too seriously, including himself.
T. Jayawardena
He distinguished between principle and process. He warned... against becoming
lost in process, and forgetting the initial principles with which they were to
concern themselves.
....He inveighed passionately against engineers simply putting their heads down,
and becoming lost in the mechanistic ritual of their jobs.
John Stevens Berry, Stanford roommate and lifelong friend.
I leave you with an image that most will find familiar, and many endearing.
The Yakowitz grin. Not a relentlessly upbeat smiley-face grin, also not a sneer.
A smile that asked for very little as a reason for its appearance: anything at all in
any way amusing and that he could share with others.
Robert Hymer, Stanford roommate and lifelong friend
I wish to thank Joe Gani, Ben Kedem, Dan Murray and Szidar for their
assistance in preparing this introduction. I and all of Sid’s family members are
grateful to Moshe Dror, Szidar, and Pierre L’Ecuyer for proposing and working
so hard as editors of this tribute to my husband and mentor.
PUBLICATIONS OF SID YAKOWITZ

Books
Yakowitz, S. (1969). Mathematics of Adaptive Control Processes. Elsevier, New

York.
Yakowitz, S. (1977). Computational Probability and Simulation. Addison-Wesley,
Reading, MA.
Szidarovszky, F. and S. Yakowitz. (1978). Principles and Procedures of Nu-
merical Analysis. Plenum Press, New York.
Yakowitz, S. and F. Szidarovszky. (1986). An Introduction to Numerical Com-

putations, edn. Macmillan, New York [ edn 1989].
Papers
Yakowitz, S. and J. Spragins. (1968). On the identifiability of finite mixtures.

Ann. Math. Statist. 39, 209-214.
Yakowitz, S. (1969). A consistent estimator for the identification of finite mix-
tures. Ann. Math. Statist. 40, 1728-1735.
Yakowitz, S. (1970). Unsupervised learning and the identification of finite mix-
tures. IEEE Trans. Inform. Theory 16, 330-338.
Fisher, L. and S. Yakowitz. (1970). Estimating mixing contributions in metric
spaces. Sankhya A 32, 411-418.
Yakowitz, S. and L. Fisher. (1973). On sequential search for the maximum of
an unknown function. J. Math. Anal. Appl. 41, 234-359.
Yakowitz, S. and S. Parker. (1973). Computation of bounds for digital filter
quantization errors. IEEE Trans. Circuit Theory 20, 391-396.
Yakowitz, S. (1973). A stochastic model for daily river flows in an arid region.
Water Resources Research 9, 1271-1285.
Yakowitz, S. (1974). Multiple hypothesis testing by finite-memory algorithms.
Ann. Statist. 2, 323-336.
Yakowitz, S., L. Duckstein, C. and Kisiel. (1974). Decision analysis of a gamma
hydrologic variate. Water Resources Research 10, 695-704.
Denny, J., C. Kisiel, and S. Yakowitz. (1974). Procedures for determining the or-
der of Markov dependence in streamflow records. Water Resources Research
10, 947-954.
Parker, S. and S. Yakowitz. (1975). A general method for calculating quantiza-
tion error bounds due to round off in multivariate digital filters. IEEE Trans.
Circuits Systems 22, 570-572.
Sagar, B., S. Yakowitz, and L. Duckstein. (1975) A direct method for the iden-
tification of the parameters of dynamic nonhomogeneous aquifers. Water
Resources Research 11, 563-570.
Szidarovszky, F, S. Yakowitz, and R. Krzysztofowicz. (1975). A Bayes ap-
proach for simulating sediment yield. J. Hydrol. Sci. 3, 33-45.
Fisher, L. and S. Yakowitz. (1976). Uniform convergence of the potential func-
tion algorithm. SIAM J. Control Optim. 14, 95-103.
Yakowitz, S. (1976). Small sample hypothesis tests of Markov order with ap-
plication to simulated and hydrologic chains. J. Amer. Statist. Assoc. 71,
132-136.
Yakowitz, S. and P. Noren. (1976) On the identification of inhomogeneous
parameters in dynamic linear partial differential equations. J. Math. Anal.
Appl. 53, 521-538.
Yakowitz, S. (1976). Model-free statistical methods for water table prediction.

Yakowitz, S., T.L. William, and G.D. Williams. (1976). Surveillance of several
Markov targets. IEEE Trans. Inform. Theory 22, 716-724.
Szidarovszky, F. and S. Yakowitz. (1976). Analysis of flooding for an open
channel subject to random inflow and blockage. J. Hydrol. Sci. 3, 93-103.
Duckstein, L., F. Szidarovszky, and S. Yakowitz. (1977). Bayes design of a
reservoir under random sediment yield. Water Resources Research 13, 713-
719.
Szidarovszky, F. and S. Yakowitz. (1977). A new proof of the existence and
uniqueness of the Cournot equilibrium. Int. Econom. Rev. 18, 181-183.
Denny, J. and S. Yakowitz. (1978). Admissible run-contingency type tests for
independence and Markov dependence. J. Amer. Statist. Assoc. 73, 117-181.
Yakowitz, S., J. Krimmel, and F. Szidarovszky. (1978). Weighted Monte Carlo
integration. SIAM J. Numer. Anal. 15, 1289-1300.
Schuster, R. and S. Yakowitz. (1979). Contributions to the theory of nonpara-
metric regression with application to system identification. Ann. Statist. 7,
139-149.
Yakowitz, S. (1979). Nonparametric estimation of Markov transition functions.
Ann. Statist. 7, 671-679.
Neuman, S. and S. Yakowitz. (1979). A statistical approach to the inverse prob-
lem of aquifer hydrology: Part 1. Theory. Water Resources Research 15,
845-860.
Murray, D. and S. Yakowitz. (1979). Constrained differential dynamic program-
ming and its application to multireservoir control. Water Resources Research
15, 1017-1027.
Yakowitz, S. (1979). A nonparametric Markov model for daily river flow. Water
Krzysztofowicz, R. and S. Yakowitz. (1980). Large-sample methods analysis
of gamma variates. Water Resources Research 16, 491-500.
Yakowitz, S. and L. Duckstein. (1980). Instability in aquifer identification -
theory and case studies. Water Resources Research 16, 1045-1064.
Pebbles, R, R. Smith, and S. Yakowitz. (1981). A leaky reservoir model for
ephemeral flow recession. Water Resources Research 17, 628-636.
Murray, D. and S. Yakowitz. (1981). The application of optimal control method-
ology to non-linear programming problems. Math. Programming 21, 331-
347.
Szidarovszky, F. and S. Yakowitz. (1982). Contributions to Cournot oligopoly
theory. J. Econom. Theory 28, 51-70.
Yakowitz, S. (1982). Dynamic programming applications in water resources.
Yakowitz, S. (1983). Convergence rate of the state increment dynamic program-

ming method. Automatica 19, 53-60.
Yakowitz, S. and B. Rutherford. (1984). Computational aspects of discrete-time
optimal-control. Appl. Math. Comput. 15, 29-45.
Szilagyi, M., S. Yakowitz, and M. Duff. (1984). A procedure for electron and
ion lens optimization. Appl. Phys. Lett. 44, 7-9.
Murray, D. and S. Yakowitz. (1984). Differential dynamic programming and
Newton’s method for discrete optimal control problems. J. Optim. Theory
Appl. 42, 395-415.
Yakowitz, S. (1985). Nonparametric density estimation, prediction and regres-
sion for Markov sequences. J. Amer. Statist. Assoc. 80, 215-221.
Yakowitz, S. (1985). Markov flow models and the flood warning problem. Water
Yakowitz, S. and F. Szidarovszky. (1985). A comparison of Kriging with non-
parametric regression methods. J. Multivariable Anal. 6, 21-53.
Yakowitz, S., K. Hutter, and F. Szidarovszky. (1985). Toward computation of
steady-state profiles of ice sheets. Z. für Gletcherkund 21, 283-289.
Schuster, E. and S. Yakowitz (1985). Parametric nonparametric mixture density-
estimation with application to flood frequency analysis. Water Resources
Bulletin 21, 797-804.
Yakowitz, S. (1986). A stagewise Kuhn-Tucker condition and differential dy-
namic programming. IEEE Trans. Automat. Control 31, 25-30.
Hutter, K., S. Yakowitz, and F. Szidarovszky. (1986a). A numerical study of
plane ice sheet flow. J. Glaciology 32, 139-160.
Yakowitz, S., K. Hutter, and F. Szidarovszky. (1986). Elements of a computa-
tional theory for glaciers. J. Comput. Phys. 66, 132-150.
Hutter, K., F. Szidarovszky, and S. Yakowitz. (1986b). Plane steady shear-flow
of a cohesionless antigranulocytes material down an inclined plane – a model
for flow avalanches: Part I. Theory. Acta Mechanica 63, 87-112.
Hutter, K., F. Szidarovszky, and S. Yakowitz. (1987). Plane steady shear-flow of
a cohesionless antigranulocytes material down an inclined plane – a model
for flow avalanches: Part II. Numerical results. Acta Mechanica 65, 239-261.
Yakowitz, S. (1987). Nearest neighbour methods in time-series analysis. J. Time
Series Anal. 2, 235-247.
Szidarovszky, F., K. Hutter, and S. Yakowitz. (1987). A numerical study of
steady plane antigranulocytes chute flows using the Jenkins-Savage model
and its extensions. J. Numer. Methods Eng. 24, 1993-2015.
Hutter, K., S. Yakowitz, and F. Szidarovszky. (1987). Coupled thermomechani-
cal response of an axisymmetrical cold ice-sheet. Water Resources Research
23, 1327-1339.
Sen, S. and S. Yakowitz. (1987). A quasi-Newton differential dynamic program-
ming algorithm for discrete-time optimal control. Automatica 23, 749-752.
Karlsson, M. and S. Yakowitz. (1987a). Nearest-neighbor methods for nonpara-

metric rainfall-runoff forecasting. Water Resources Research 23, 1300-1308.
Karlsson, M. and S. Yakowitz. (1987b). Rainfall-runoff forecasting methods,
old and new. Stoch. Hydrol. Hydraul. 1, 303-318.
Gani, J., P. Todorovich, and S. Yakowitz. (1987). Silting of dams by sedimentary
particles. Math. Scientist 12, 81-90.
Naokes, D., K. Hipel, A.I. Mcleod, and S. Yakowitz. (1988). Forecasting annual
geophysical time series. Int. J. Forecasting 4, 103-115.
Yakowitz, S. (1988). Parametric and nonparametric density-estimation to ac-
count for extreme events. Adv. Appl. Prob. 20, 13.
Szidarovszky, F., K. Hutter, and S. Yakowitz. (1989). Computational ice-divide
analysis of a cold plane ice sheet under steady conditions. Ann. Glaciology
12, 170-178.
Yakowitz, S. (1989a). Algorithms and computatitonal techniques in differential
dynamic programming. Control Dynamic Systems 31, 75-91.
Yakowitz, S. (1989b). Theoretical and computational advances in differential
dynamic programming. Control Cybernet. 17, 172-189.
Yakowitz, S. (1989c). A statistical foundation for machine learning, with ap-
plication to Go-Moku. Comput. Math. Appl. 17, 1095-1102.
Yakowitz, S. (1989d). Nonparametric density and regression estimation for
Markov sequences without mixing assumptions. J. Multivariate Anal. 30,
124-136.
Gani, J. and S. Yakowitz. (1989). A probabilistic sedimentation analysis for
predicting reservoir lifetime. Water Resources Management 3, 191-203.
Yakowitz, S. and E. Lugosi. (1990). Random search in the presence of noise,
with application to machine learning. SIAM J. Sci. Statist. Comput. 11, 702-
712.
Yakowitz, S., J. Gani, and R. Hayes. (1990). Cellular automaton modeling of
epidemics. Appl. Math. Comput. 40, 41-54.
Rutherford, B. and S. Yakowitz. (1991). Error inference for nonparametric re-
gression. Ann. Inst. Statist. Math. 43, 115-129.
Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Ann. Op-
erat. Res. 28, 297-312.
Dietrich, R.D. and S. Yakowitz. (1991). A rule based approach to the trim-loss
problem. Int. J. Prod. Res. 29, 401-415.
Yakowitz, S. (1991). Some contributions to a frequency location problem due
to He and Kedem. IEEE Trans. Inform Theory 17, 1177-1182.
Yakowitz, S., T. Jayawardena, and S. Li. (1992a). Theory for automatic learning
under partially observed Markov-dependent noise. IEEE Trans. Automat.
Control 37, 1316-1324.
Yakowitz, S., R. Hayes, and J. Gani. (1992b). Automatic learning for dynamic
Markov-fields with application to epidemiology. Operat. Res. 40, 867-876.
Yakowitz, S. and M. Kollier. (1992). Machine learning for optimal blackjack

counting strategies. J. Statist. Plann. Inference 33, 295-309.
Yakowitz, S. (1992). A decision model and methodology for the AIDS epidemic.
Appl. Math. Comput. 52, 149-172.
Yakowitz, S. and L.T. Tran. (1993). Nearest Neighbor estimators for random
fields. J. Multivariate Anal. 44, 23-46.
Yakowitz, S. (1993a). Nearest neighbor regression estimation for null- recurrent
Markov time series. Stoch. Proc. Appl. 48, 311-318.
Gani, J. and S. Yakowitz. (1993). Modeling the spread of HIV among intra-
venous drug users. IMA J. Math. Appl. Medicine Biol. 10, 51-65.
Yakowitz, S. (1993b). A globally convergent stochastic approximation. SIAM
J. Control Optim. 31, 30-40.
Yakowitz, S. (1993c). Asymptotic theory for a fast frequency detector. IEEE
Trans. Inform. Theory 39, 1031-1036.
Li, T.H., B. Kedem, and S. Yakowitz. (1994). Asymptotic normality of sample
autocovariances with an application in frequency estimation. Stoch. Proc.
Appl. 52, 329-349.
Pinelis, I. and S. Yakowitz. (1994). The time until the final zero-crossing of
random sums with application to nonparametric bandit theory. Appl. Math.
Comput. 63, 235-263.
Kedem, B. and S. Yakowitz. (1994). Practical aspects of a fast algorithm for
frequency detection. IEEE Trans. Commun. 42, 2760-2767.
Yakowitz, S. (1994a). Review of Time series analysis of higher order crossings,
by B. Kedem. SIAM Rev. 36, 680-682.
Yakowitz, S. (1994b). From a microcosmic IVDU model to a macroscopic HIV
epidemic. In Modeling the AIDS Epidemic: Planning, Policy, and Prediction,
eds E.H. Kaplan and M.L. Brandeau. Raven Press, New York, pp. 365-383.
Yakowitz, S. and J. Mai. (1995). Methods and theory for off-line machine learn-
ing. IEEE Trans. Automat. Control 40, 161-165.
Gani, J. and S. Yakowitz. (1995). Computational and stochastic methods for
interacting groups in the AIDS epidemic. J. Comput. Appl. Math. 59, 207-
220.
Yakowitz, S. (1995). Computational methods for Markov series with large state-
spaces, with application to AIDS Modeling. Math. Biosci. 127, 99-121.
Lai, T.L. and S. Yakowitz. (1995). Machine learning and nonparametric bandit
theory. IEEE Trans. Automat. Control 40, 1199-1209.
Gani, J. and S. Yakowitz. (1995). Error bounds for deterministic approximation
to Markov processes, with applications to epidemic models. J. Appl. Prob.
32, 1063-1076.
Yakowitz, S. and R.D. Dietrich. (1996). Sequential design with application to
the trim-loss problem. Int. J. Production Res. 34, 785-795.
Tran, L., G. Roussas, S. Yakowitz, and B. Van Troung. (1996). Fixed-design

regression for linear time series. Ann. Statist. 24, 975-991.
Jayawardena, T. and S. Yakowitz. (1996). Methodology for the stochastic graph
completion time problem. INFORMS J. Comput. 8, 331-342.
Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inferences for
ergodic, stationary time series. Ann. Statist. 24, 370-379.
Yakowitz, S., M. Blount, and J. Gani. (1996). Computing marginal expectations
for large compartmentalized models with application to AIDS evolution in
a prison system. IMA J. Math. Appl. Medicine Biol. 13, 223-244.
Blount, S., A. Galambosi, and S. Yakowitz. (1997). Nonlinear and dynamic
programming for epidemic intervention. Appl. Math. Comput. 86, 123-136.
Gani, J., S. Yakowitz, and M. Blount. (1997). The spread and quarantine of HIV
infection in a prison system. SIAM J. Appl. Math. 57, 1510-1530.
Morvai, G., S. Yakowitz, and P. Algoet. (1998). Weakly convergent nonpara-
metric forecasting of stationary time series. IEEE Trans. Inform. Theory 44,
886-892.
Yakowitz, S., L. Györfi, J. Kieffer, and G. Morvai. (1999). Strongly consistent
nonparametric forecasting and regression for stationary ergodic sequences.
J. Multivariable Anal. 71, 24-41.
Daley, D.J., J. Gani, and S. Yakowitz. (2000). An epidemic with individual
infectivities and susceptibilities. Math. and Comp. Modelling 32, 155-167.
Part I
Chapter 2
STABILITY OF SINGLE CLASS QUEUEING

NETWORKS
Harold J. Kushner
Applied Mathematics Dept.
Lefschetz Center for Dynamical Systems
Brown University
Providence RI 02912 *
Abstract The stability of queueing networks is a fundamental problem in modern com-

munications and computer networks. Stability (or recurrence) is known under
various independence or ergodic conditions on the service and interarrival time
processes if the “fluid or mean approximation” is asymptotically stable. The basic
property of stability should be robust to variations in the data. Perturbed Lia-
punov function methods are exploited to give effective criteria for the recurrence
under very broad conditions on the “driving processes” if the fluid approximation
is asymptotically stable. In particular, stationarity is not required, and the data
can be correlated. Various single class models are considered. For the problem of
stability in heavy traffic, where one is concerned with a sequence of queues, both
the standard network model and a more general form of the Skorohod problem
type are dealt with and recurrence, uniformly in the heavy traffic parameter, is
shown. The results can be extended to account for many of the features of queue-
ing networks, such as batch arrivals and processing or server breakdown. While
we concentrate on the single class network, analogous results can be obtained for
multiclass systems.
1. INTRODUCTION
Queueing networks are ubiquitous in modern telecommunications and com-
puter systems and much effort has been devoted to the study of their stability
properties. Consider a system where there are K processing stations, each with
an infinite buffer. The stations might have exogenous input streams (inputs
* Supported in part by NSF grants ECS 9703895 and ECS 9979250 and ARO contract DAAD19-99-1-0-223
from the outside of the network) as well as inputs from the other stations. Each
customer eventually leaves the system. The service is first come first served
(FCFS) at each station and the service time distributions depend only on the
station. This paper is concerned with the stability of such systems. The basic
analysis supposes that the systems are working under conditions of heavy traf-
fic. Then the result is specialized to the case of a fixed system in arbitrary (not
necessarily heavy) traffic. Loosely speaking, by heavy traffic we mean that the
fraction of time that the processors are idle is small; equivalently, the traffic
intensities at each processor are close to unity. Heavy traffic is quite common
in modern computer and communications systems, and also models the effects
of “bottleneck” nodes in general. Many of the queueing systems of current
interest are much too complicated to be directly solvable. Under conditions of
heavy traffic, laws of large numbers and central limit theorems can be used to
greatly simplify the problem.
Most work on stability has dealt with the more difficult multiclass case (al-
though under simpler conditions on the probabilistic structure of the driving
random variables), where each station can work on several job classes and
there are strict priorities (Banks and Dai, 1997; Bertsimas et al., 1996; Bram-
son, 1994; Bramson and Dai, 1999; Chen and Zhang, 2000; Dai, 1995; Dai,
1996; Dai and Vande Vate, 2001; Dai et al., 1999; Dai and Weiss, 1995; Down
and Meyn, 1997; Dai and Meyn, 1995; Kumar and Meyn, 1995; Kumar and
Seidman, 1990; Lin and Kumar, 1984; Lu and Kumar, 1991; Meyn and Down,
1994; Perkins and Kumar, 1989; Rybko and Stolyar, 1992). A typical result is
that the system is stable if a certain “fluid” or “averaged” model is stable. The
interarrival and service intervals are usually assumed to be mutually indepen-
dent, with the members of each set being mutually independent and identically
distributed and the routing is “Markov.” Stability was shown in Bramson (1996)
for a class of FIFO networks where the service times do not depend on the class.
The counterexamples in Bramson (1994), Kumar and Seidman (1990), Lu and
Kumar (1991), Rybko and Stolyar (1992), Seidman (1994) have shown that the
multiclass problem is quite subtle and that even apparently reasonable strategies
for scheduling the competing classes can be unstable.
The stability situation is simpler when the service at each processor is FCFS
and the service time distribution depends only on the processor, and here too a
typical result is that the system is stable if a certain “fluid” or “averaged” model is
stable. The i.i.d assumption on the interarrival (resp., service) times and Markov
assumption on the routing are common, although results are available under
certain “stationary–ergodic” assumptions (Baccelli and Foss, 1994; Borovkov,
1986). In the purely Markov chain context, where one deals with a reflected
random walk, works via the classical stochastic stability techniques for Markov
chains include Fayolle (1989), Fayolle et al. (1995), and Malyshev (1995).
For the single class case, and under the assumption that the “fluid” approxi-
mation is stable, it will be seen that stability holds under quite general assump-
tions on the probabilistic structure of the interarrival and service intervals, and
on the routing processes. This will be demonstrated by use of the perturbed
Liapunov function methods of Kushner (1984).
The basic class of models under heavy traffic will be defined in Section
2, where various assumptions are stated to put the problem into a context of
interest in current applications. These are for intuitive guidance only, and
weaker conditions will be used in the actual stability results. The basic class of
models which is motivated by the heavy traffic analysis of queueing networks is
then generalized to include a form of the so-called “Skorohod problem” model
which covers a broader class of systems. Stability under heavy traffic is actually
stability of a sequence of queueing systems, and it is “uniform” in the traffic
intensity parameter as that tends to unity. The stability results depend on a basic
theorem of Dupuis and Williams (1994) which established the existence of a
Liapunov function for the fluid approximation, and this is stated in Section 3.
The idea of perturbed Liapunov functions is developed in Section 4. Section 5
gives the main stability theorem for the sequence of systems in heavy traffic,
as well as the result for a single queue which is not necessarily in heavy traffic.
The results can be extended to account for many of the features of queueing
networks, such as batch arrivals and processing or server breakdown.
2. THE MODEL
A classical queueing network: The heavy traffic scalings. Heavy traffic.
analysis works with a sequence of queues, indexed by and with arrival and
service “rates” depending on such that, as the fraction of time
at which the processors are idle goes to zero. With appropriate scaling and
under broad conditions, the sequence of scaled queues converges weakly to
a process which is the solution to a reflected stochastic differential equation
(Kushner, 2000; Reiman, 1984).1 Let there be K processors or servers, denoted
by Let denote the size of the vector of queues in the network
at real time for the system in the sequence. There are two scalings
which are in common use. One scaling works with the state process defined by
where both time and amplitude are “squeezed.” This is
used for classical queueing problems where the interarrival and service intervals
are O(1). In many modern communications and computer systems, the arrivals
and services (say, arrivals and transmissions of the packets) are fast, and then
one commonly works with the scaling We will concentrate
on the first scaling, although all the results are transferable to the second scaling
with an appropriate identification of variables.
Consider the first scaling where the model is of the type used in Reiman
(1984). See also Kushner (2000; Chapter 6). Let denote the in-
terarrival interval for exogenous arrivals to and let denote the
service interval at Define times the number of exogenous
arrivals to by real time and define analogously for the service
completions there. Let denote times the number of departures
from by real time that have gone to For some centering constants
which will be further specified below, define
the processes:
If the upper index of a sum is not an integer, we always take the integer part.
Let denote the indicator function of the event that the departure from
goes to Define the “routing” vector
In applications to communications systems, where the second (the “fast”)
scaling would be used, the scale parameter is the actual physical speed or size
of the physical system. Then, the actual interarrival and service intervals would
be defined by and resp. Thus, in this latter case we are working
with a sequence of systems of increasing speed. Under the given conditions,
the results in Section 5 will state that the processes are uniformly (in )
stable for large
Assumptions for the classical network in heavy traffic. Three related types
of models will be dealt with. The first form is the classical queueing network
type described above. Conditions (A2.1)–(A2.3) are typical of those used at
present for this problem class. These are given for motivational purposes and
to help illustrate the basic “averaging” method. The actual conditions which
are to be used are considerably weaker. The second type of model, called
the Skorohod problem, includes the first model. But it covers other queueing
systems which are not of the above network form (see, for example, Kushner
(2000; Chapter 7)). Finally, it will be seen that the stability results are valid
even without the heavy traffic assumptions, provided that the mean flows are
stable.
A2.0. For each the initial condition (0) is independent of all future
arrival and service times and of all routing decisions, and other future driving
random variables.
A2.1. The set of service and interarrival intervals is independent of the set
of routing decisions. There are constants such that
The set of processes defined in (2.1) is
tight in the Skorohod topology (Billingsley, 1968; Ethier and Kurtz, 1986) and
has continuous weak sense limits.
A2.2. The are mutually independent for each There are con-
stants and such that and The spectral
radius of the matrix is less than unity. Write
The spectral radius condition in (A2.2) implies that each customer will even-
tually leave the system and that the number of services of each customer is
stochastically bounded by a geometrically distributed random variable. We
will often write It is convenient to introduce the defi-
nition Condition (A2.3) is known as the heavy traffic condition. It
quantifies the rate at which the difference between the mean input rate and mean
output rate goes to zero at each processor as
A2.3. There are real such that
Note that (2.2) implies that
or, equivalently, that
Let (resp., denote times the number of exogenous ar-

rivals to (resp., jobs completely served) at by real time
An example. See Figure 2.1 for a two dimensional example. The state space
is the positive quadrant.
The limit system and fluid approximation for the classical model. The
basic state equation is
Define the vector Under (A2.0)–(A2.3), the sequence

converges weakly to the reflected diffusion process defined by Kushner (2000),
and Reiman(1984)
where is a continuous process (usually a Wiener process, in applications).

The reflection term has the form
where is continuous, nondecreasing, with and can increase only

at where The vector is the column of and is the
reflection direction on the line or face The reflection
directions for the example of Figure 2.1 are given in Figure 2.2.
Define Then, converges weakly to the “fluid
approximation” which solves the reflected ODE or Skorohod problem:
The stability of the system (2.6) is the usual departure point for the study of the
stability of the actual physical stochastic queues.
A simplifying assumption on the arrival and service times. To simplify

the notation, it will be assumed in the rest of the paper that there are which
can go to zero as fast as we wish as such that arrivals and service
completions can occur only at integral multiples of (real time) We further
suppose that at most one exogenous arrival and one service completion can
occur at a time at each processor. These conventions are not restrictive since
can go to zero as fast as we wish as All results hold without this con-
vention. Furthermore, for notational simplicity and without loss of generality,
let us suppose that service completions occur “just before” any arrivals at any
processor, and that they both occur “just before” the times
These are still referred to as the departures and arrivals at the times Con-
sequently, there cannot be a departure from a processor at real time
if its queue is empty at real time
Some details for (2.4). A formal summary of a few of the details which
take (2.3) to (2.4) will be helpful to get a feeling for the derivation of (2.4)
and motivation for the role of (2.6) in the stability analysis. The development
involves representations for the terms in (2.3) that separate the “drift” from
the “noise.” Let denote the indicator function of the event that there is
a departure from at real time and let denote the indicator
function of the event that this departure goes to Let be the indicator
function of the event that there is an exogenous arrival to at real time
By the definitions, we have
Write
Alternatively, we can write
where
Define a residual time error term to be a [constant/ × [a residual interarrival

or service interval] plus a deterministic process which converges to zero. They
will always converge weakly to the ‘zero” process. The right hand terms of
(2.8) and (2.9) differ by a residual time error term, which is times
the number of steps (of real time length each) between real time and the
time of last departure at or before real time
Let denote times the real idle time at by real time The
right hand term of (2.8) can be written as
Now, write the coupling terms as
Use the representation (2.8) for the coefficient of in the right hand term of
(2.10). Then, the (negative of the) idle time terms in the equation (2.3) for
sum to
Now, consider the exogenous arrival processes. Modulo a residual time error
term,
Alternatively,
where
The difference between the right hand terms of (2.11) and (2.12) is also a
residual time error term and is asymptotically negligible.
Putting the expansions together and using the heavy traffic condition (A2.3)
yields (modulo asymptotically negligible errors, from the point of view of weak
convergence)
where is from (2.2), has the form
where
and is defined by
Let denote the boundary face on which and let denote the
reflection direction on that face. Then, for some nonnegative and uniformly
bounded random variables can be written as
Under (A2.0)–(A2.3), converges weakly to a continuous process and

converges weakly to the solution of (2.4).
The Skorohod problem. A more general model. The above discussion has
motivated the model (2.13)–(2.15) in terms of the new “primitives” and
with the state space A more general model starts from
the form (2.13)–(2.15), and generalizes it to the following. The state space G
is a “wedge” or convex cone (replacing the set and is
formally defined in (2.4). The system model is represented as
where has the form (2.14a) for uniformly bounded random variables
and for some nonnegative and uniformly bounded random variables
the reflection term can be written as
Assumptions for the model (2.16), (2.17). Condition (A2.4) below holds for
the original queueing network model, as does (A2.5) because of the condition
on the spectral radius of Q.
A2.4. There are vectors such that the state space G is the intersection
of a finite number of closed halfspaces in each containing the origin and
defined by and it is the closure of its interior (i.e., it is a
“wedge”). Let denote the faces of G, and the interior
normal to Interior to the reflection direction is denoted by the unit
vector and for each The possible reflection directions at
points on the intersections of any subset of the are in the convex hull of the
directions on the adjoining faces. Let denote the set of reflection directions
at the point whether it is a singleton or not.
A2.5. For define the index set Suppose that
lies in the intersection of more than one boundary; i.e, has the
form for some Let denote the convex hull of

the interior normals to resp., at Then, there is
some vector such that for all
A2.6.The solution to the deterministic Skorohod problem (the fluid approxi-
mation to (2.4) or (2.16))
converges to zero for each initial condition.
3. STABILITY: INTRODUCTION
Stability of a queue can be defined in many ways, but it almost always
means something close to a uniform recurrence property, which we can define
as follows. Let be a large real number. Suppose that the current queue
size is Then there is a real valued function K(·) such that the mean time,
conditioned on the systems data to the present, for the queue size to get within the
centered at the origin, is bounded by with probability
one. Then we say that the queue process is uniformly recurrent. The queue
process need not be Markovian, or a component of a Markov process, or even
ergodic or stationary in any sense.
Now, consider a sequence of queues in heavy traffic, scaled as in Section
2. Then the uniform recurrence property is rephrased as follows. Suppose that
the current scaled queue size is Then the mean time, conditioned on the
systems data to the present, for the scaled queue size to get within the
whose whose center is the origin, is bounded by
with probability one and uniformly in (large)
The study of the theory of the stability of Markovian processes, via stochas-
tic Liapunov functions, goes back to Bucy (1965), Khasminskii (1982), and
Kushner (1990a) and that of non-Markovian processes, via perturbed Liapunov
functions, to Blankenship and Papanicolaou (1978), Kushner (1984), and Kush-
ner (1990a). It was shown in Harrison and Williams (1987) that a necessary
and sufficient condition for the recurrence of the heavy traffic limit (2.4) (when
is a Wiener process) is that
in that each component of the vector is negative. If is the

routing probability from to then is the matrix of
the total mean number of current and future visits to for customers currently
at
The paper (Dupuis and Williams, 1994) extended this result to more general
reflection processes and state spaces by replacing (2.4) by a stochastic differ-
ential equation of a more general type (the solution to a Skorohod problem),
which might not arise as a limit of the sort of queueing processes (2.3) dis-
cussed in Section 2. They constructed a Liapunov function for the associated
fluid approximation, and used this to prove the recurrence of the heavy traffic
limit. The needed properties of their Liapunov function are stated in Theorem
3.1.
The next two sections contain an analysis of a class of systems whose heavy
traffic limits are either of the queuing type of (A2.0)–(A2.3), or of the more
general Skorohod model type (2.16), (2.17). The same methods will also be
applied to a single queue (and not a sequence of queues) which might not be in
heavy traffic. The method is interesting in that it can handle quite complicated
correlations in the arrival, service and routing processes, and even allow non-
stationarities. We aim to avoid stronger assumptions which would require the
processes to be either stationary or to be representable in terms of a component
of the state of a Markov process. Thus, the notion of Harris recurrence is not
directly relevant. Our opinion is that stability results should be as robust as pos-
sible to small variations in the assumptions. The perturbed Liapunov function
method is ideal for such robustness.
The following is a form of the main theorem of Dupuis and Williams (1994).
The reference used the orthant But the proof still
holds if G is a wedge as in (A2.4). For a point the interior of G, define
to be the set of indices such that In the theorem, denotes
the gradient of V(·).
Theorem 3.1. Assume (A2.4)–(A2.6). Then, there exists a real–valued
function V(·) on with the following properties. It is continuous,
together with its partial derivatives up to second order. There is a (twice
continuously differentiable) surface such that any ray from the origin crosses
once and only once, and for a scalar and
For
Thus, the second partial derivatives are of the order of as Also,

there are real such that
For There is such that for

Define V(0) = 0. Then V(·) is globally Lipschitz
continuous.
4. PERTURBED LIAPUNOV FUNCTIONS

In this motivational section, we will work with the first model of Section 2,
the sequence of networks in heavy traffic as modeled by (2.7) with limit process
(2.4) and (2.5), but the assumptions (A2.0)–(A2.3) will be weakened. By the
scaling of time, can change values only at times
with the departures occurring “just before the arrivals.
Perturbed Liapunov functions. Introduction and motivation. This sec-
tion will introduce the idea of perturbed Liapunov functions (Blankenship and
Papanicolaou, 1978; Kushner, 1984; Kushner, 1990b; Kushner and Yin, 1997)
and their structure. The computations are intended to be illustrative and will be
formal. But, the actual proofs of the theorems in the next section go through
the same steps.
The classical Liapunov function method is quite limited, for problems such as
ours, since (owing to the correlations or possibly non Markov character) there
is not usually a “contraction” at each step to yield the local supermartingale
property which is needed to prove stability. The perturbed Liapunov function
method is a powerful extension of the classical method. In the perturbed Lia-
punov function method, one adds a small perturbation to the original Liapunov
function. As will be seen, this perturbation provides an “averaging” which
is needed to get the local supermartingale property. The primary Liapunov
function will be simply the function V(·) of Theorem 3.1. The final Liapunov
function will be of the form where is
“small” in a sense to be defined.
Let denote the expectation given all systems data up to and including
real time We can write
where the are O(1), uniformly in all variables.

Before proceeding, for motivation let us temporarily modify the model and
formally examine the first term on the right side of (4.1), under the assumptions
that and that the (real time) interarrival and service
intervals are exponentially distributed with rates resp., suppos-
ing that is “infinitesimal,” and letting the set of all intervals be mutually
independent. The conditional mean value of the bracketed term is then
By the heavy traffic condition (A2.3), times this “formally converges” to

Putting this “limit” into (4.1) yields that the first term of (4.1) is (asymp-
totically) By Theorem 3.1, this is less than If
we ignore the second order term in (4,1) and the behavior on the boundary of
G–{0} where at least one component of is zero, we then have that
I.e., has the supermartingale property for The order

of the conditional mean change per step and standard stochastic stability theo-
rems (Kushner, 1984) implies the uniform recurrence. For “non exponential”
distributions, one needs some mechanism that allows the indicator functions
to be replaced by their “centering” or “mean” values. This is done by adding
a perturbation to the Liapunov function, and this will also allow us to
account for the behavior on the boundary and deal with the second order term
as well.
Motivation for the form of the Liapunov function perturbations. Now,
drop the “exponential and independence assumption.” Before defining the ac-
tual perturbation, for additional motivation we will discuss the general principle
(Kushner, 1984) behind the construction of the perturbation by use of a slightly
simpler form of it. Let be small and let us work with large enough
such that for small
For our centering constants define Pro-
ceeding formally until further notice, the first suggested perturbed Liapunov
function will have the form:
where we define
The individual perturbations in (4.2b) are defined by:
where we define
We suppose (until further notice) that the and are O(1),

uniformly in all variables, w.p.1. Clearly, the centering constants must be such
that the sums are well defined. Under broad mixing conditions, there are such
centering constants. While continuing to proceed formally now, we will return
to this point in the next section and modify the C–functions to extend the
conditions under which the sums are well defined.
Define
By the definitions of the perturbations, we can expand as
where the are O(1), uniformly in all variables. Similarly, we can write
Also,
Note that the term in (4.1) plus the corre-

sponding term
in (4.5a) equals
Repeating this for all of the other first order terms yields
By the heavy traffic condition (A2.3), times the terms in brackets in the
second line in (4.7) converges to as
Now, turn to the boundary terms. Define and
Thus, asymptotically, we can write (4.7) as
Next, let us dominate the second order terms in (4.1). For large enough
Average the indicator functions in the second order part of (4.1) via use of
the perturbations as done for the first order terms. This yields the bound
for the second order terms, for large
Finally, combining the above expansions and bounds, for large enough
Theorem 3.1 allows us to write
The boundary term can be written as
which is nonpositive by Theorem 3.1. Thus, has the supermartingale

property for large state values, say for for some positive number
Suppose that Then, asymptotically, the
mean number of steps of length which are required for it to return to the
set where is bounded above by
Thus, in the interpolated time scale, it requires an average of units
of time. Since the perturbation goes to zero as this bound also holds
asymptotically for as well. Hence, the desired stability.
5. STABILITY
Discussion of the perturbations. Let us examine the in (4.4a) more
closely to understand why our O(1) requirement on its value is reasonable.
Since is merely a centering constant for the entire sequence, the actual
mean values or rates can vary with time (say, being periodic, etc.). Fix and let
and be the real times of the first two exogenous arrivals to queue
after real time Consider the part of given by
This equals
Next, for the moment, suppose that the interarrival times are mutually in-
dependent and identically distributed, with finite second moments, and mean
Then (5.1) equals zero w.p.1, since
Obviously and can be any two successive exogenous arrival
times with the same result. Thus, under the independence assumption,
is just
where is just the conditional expectation of the mean real time

to the next exogenous arrival to queue after real time given the data to
real time For use below, keep in mind that (w.p.1) this quantity is bounded
uniformly in under the above assumptions on the independence and the
moments.
Now, suppose that the interarrival times are correlated, still with centering
constant Let denote the sequence of exogenous arrival
(real time) times to after Then, for
Then, grouping terms and formally speaking, we see that is just (5.2) plus
the series
This sum is well defined and bounded uniformly in under broad conditions.
Similar computations can be done for the and
The perturbations which are to be used. The perturbations defined in (4.4)
are well defined and O(1) uniformly in under broad mixing conditions. But,
there are interesting cases where they are not well defined. A typical such case
is the purely deterministic problem where and
where H is an integer. Then the sum, taken from to is periodic in
with each segment moving linearly between zero and The
most convenient way of circumventing this problem of non convergence and
including such examples in our result is to suitably discount the defining sums
(Kushner and Yin, 1997; Solo and Kong, 1995). Thus, if the sums in (4.4) are
not well defined, then we will use the alternative perturbations where the sums
are changed to the following discounted forms, for some small
The sums in (5.4) are always well defined for each and the conditional
expectation can be taken either inside or outside of the summation. Finally, the
above discussion is summarized in the following assumption.
A5.1. There is a constant B such that and are bounded
by B w.p.1, for each
The actual perturbed Liapunov function which is to be used is
where
and, analogously to (4.4), the individual perturbations are defined by:
Theorem 5.1. Let be tight. Assume the network model of Section

2, (A2.6), and that the spectral density of Q is less than unity. Then
is recurrent, uniformly in If (5.4) is well defined and uniformly bounded
without the discounting, then the undiscounted form can be used.
Note on the proof. Note that, for the discounted function,
and similarly for the conditional expectations of the increments in and

The right hand “error term” is hence it adds
to the right side of (4.9), and this is dominated by the first order term there.
The rigorous development with the perturbed Liapunov function defined by
(5.4)–(5.6) is exactly the same as the formal development using (4.2)–(4.4),
with the addition of the error terms arising from the second term on the right of
(5.7) (and analogously from the and and we will not repeat the
details.
Fast arrivals and services: The second scaling. In the queueing network
model of Section 2, we defined This is the traditional
scaling for queues. But, in many applications to computer and communications
systems, the channel capacity is large and the arrivals and services occur “fast,”
proportionally to the capacity. Then, the parameter is taken to be the basic
speed of the system and one uses (Altman and Kushner,
1999; Kushner et al., 1995). As noted at the beginning of Section 2, the service
and interarrival intervals are then defined to be with centering
constants To be consistent with the previous development, suppose
that arrivals and departures can occur only “just before the real times
The development that led to Theorem 5.1 used only the scaled system, and
the results are exactly the same if we have fast arrivals and services.
The Skorohod problem model (2.16), (2.17). We will use the perturbation
where
The proof of the next theorem is the same as that of Theorem 5.1.
Theorem 5.2. Assume the model (2.16), (2.17) and the conditions (A2.4)–
(A2.6). Suppose that is tight and that the of (5.9) are bounded,
w.p.1, uniformly in Then is recurrent, uniformly in If the functions
(5.9) are well defined and uniformly bounded without the discounting, then the
undiscounted form can be used.
Fixed queues: Non heavy traffic problems. Consider a single queueing
network (not a sequence) of the type discussed in Section 2. Let denote the
size of the queue at server at time The primary assumptions in Theorem 5.1
were first (A2.4)–(A2.6) which enabled the construction of the fundamental
Liapunov function of Theorem 3.1 and, second, (A5.1) which provided the
averaging. The fact that is sufficient for the (average
of the) first order term in (4.1) to dominate the (average of the) second order
term for large even without their relative and scalings.
Drop the in the definitions, and suppose that arrivals and departures
only occur as in the queueing network model of Section 2; i.e. “just before”
times for some small In particular, for define
Define
All of the equations in this and in the previous section hold if the is dropped.
Thus, we have the following theorem.
Theorem 5.3. Assume that the spectral radius of Q is less than unity, condi-
tion (A2.6) and that the sums in (5.10) are bounded w.p.1, uniformly in
and Then Q(·) is recurrent. If the functions (5.10) are well defined and
uniformly bounded without the discounting, then the undiscounted forms can
be used.
NOTES
1. All weak convergence is in the Skorohod topology (Ethier and Kurtz, 1986).
2. Thus the gradient is the same at all points on any ray from the origin.
REFERENCES
Altman, E. and H.J. Kushner. (1999). Admission control for combined guar-
anteed performance and best effort communications systems under heavy
traffic. SIAM J. Control and Optimization, 37:1780–1807.
Baccelli, F. and S. Foss. (1994). Stability of Jackson-type queueing networks.
Queueing Systems, 17:5–72.
Banks, J. and J.G. Dai. (1997). Simulation studies of multiclass queueing net-
works. IEE Trans., 29:213–219.
Bertsimas, D., D. Gamarnik, and J. Tsitsiklas. (1996). Stability conditions for
multiclass fluid queueing networks. IEEE Trans. Aut. Control, 41:1618–
1631.
Billingsley, P. (1968). Convergence of Probability Measures. Wiley, New York.
Blankenship, G. and G.C. Papanicolaou. (1978). Stability and control of systems
with wide band noise disturbances. SIAM J. Appl. Math., 34:437–476.
Borovkov, A. A. (1986). Limit theorems for queueing networks. Theory of Prob-
ability and its Applications, 31:413–427.
Bramson,M. (1994). Instability of FIFO queueing networks. Ann. Appl. Probab.,
4:414–431.
Bramson, M. (1996). Convergence to equilibria for FIFO queueing networks.
Queueing Systems, 22:5–45.
REFERENCES 33
Bramson, M. and J.G. Dai. (1999). Heavy traffic limits for some queueing
networks. Preprint.
Bucy, R. S. (1965). Stability and positive supermartingales. J. Differential Equa-
tions, 1:151–155.
Chen, H. and H. Zhang. (2000). Stability of multiclass queueing networks under
priority service disciplines. Operations Research, 48:26–37.
Dai, J., J. Hasenbein, and J. Vande Vate. (1999). Stability of a three station fluid
network. Queueing Systems, 33:293–325.
Dai, J.G and S. Meyn. (1995). Stability and convergence of moments for multi-
class queueing networks via fluid limit models. IEEE Trans on Aut. Control,
40:1889–1904.
Dai, J. and J. Vande Vate. (2001). The stability of two station multi–type fluid
networks. To appear in Operations Research.
Dai, J. G. (1995). On positive Harris recurrence of multiclass queueing net-
works: a unified approach via fluid models. Ann. Appl. Probab., 5:49–77.
Dai, J. G. (1996). A fluid–limit model criterion for instability of multiclass
queueing networks. Ann. of Appl. Prob., 6:751–757.
Dai, J. G. and G. Weiss. (1995). Stability and instability of fluid models for
reentrant lines. Math. of Oper. Res., 21:115–135.
Down, D. and S.P. Meyn. (1997). Piecewise linear test functions for stability
and instability of queueing networks. Queueing Systems, 27:205–226.
Dupuis, P. and R.J. Williams. (1994). Lyapunov functions for semimartingale
reflecting Brownian motions. Ann. Prob., 22:680–702.
Ethier, S. N. and T.G. Kurtz. (1986). Markov Processes: Characterization and
Convergence. Wiley, New York.
Fayolle, G. (1989). On random walks arising in queueing systems: ergodicity
and transience via quadratic forms as liapunov functions, Part 1. Queueing
Systems, 5:167–184.
Fayolle, G., V.A. Malyshev, and M.V. Menshikov. (1995). Topics in the Con-
structive Theory of Markov Chains. Cambridge University Press, Cambridge,
UK.
Harrison, J. M. and R.J. Williams. (1987). Brownian models of open queueing
networks with homogeneous customer populations. Stochastics and Stochas-
tics Rep., 22:77–115.
Khasminskii, R. Z. (1982). Stochastic Stability of Differential Equations. Si-
jthoff, Noordhoff, Alphen aan den Rijn, Amsterdam.
Kumar, P. R. and S.P. Meyn. (1995). Stability of queueing networks and schedul-
ing policies. IEEE Trans. on Automatic Control, 40:251–260.
Kumar, P. R. and T.I. Seidman. (1990). Dynamic instabilities and stabiliza-
tion methods in distributed real time scheduling policies. IEEE Trans. on
Automatic Control, 35:289–298.
Kushner, H. J. (1967). Stochastic Stability and Control. Academic Press, New

York.
Kushner, H. J. (1972). Stochastic stability. In R. Curtain, editor, Stability of
Stochastic Dynamical Systems; Lecture Notes in Math. 294, pages 97–124,
Berlin and New York, Springer-Verlag.
Kushner, H. J. (1984). Approximation and Weak Convergence Methods for Ran-
dom Processes with Applications to Stochastic Systems Theory. MIT Press,
Cambridge, Mass.
Kushner, H. J. (1990). Numerical methods for stochastic control problems in
continuous time. SIAM J. Control Optim., 28:999–1048.
Kushner, H. J. (1990). Weak Convergence Methods and Singularly Perturbed
Stochastic Control and Filtering Problems, volume 3 of Systems and Control.
Birkhäuser, Boston.
Kushner, H. J. (2000). Heavy Traffic Analysis of Controlled and Uncontrolled
Queueing and Communication Networks. Springer Verlag, Berlin and New
York.
Kushner, H. J., D. Jarvis, and J. Yang. (1995). Controlled and optimally con-
trolled multiplexing systems: A numerical exploration. Queueing Systems,
20:255–291.
Kushner, H. J. and G. Yin. (1997). Stochastic Approximation Algorithms and
Applications. Springer-Verlag, Berlin and New York.
Lin, W. and P.R. Kumar. (1984). Optimal control of a queueing system with two
heterogeneous servers. IEEE Trans. on Automatic Control, AC-29:696–703.
Lu, S. H. and P.R. Kumar. (1991). Distributed scheduling based on due dates
and buffer priorities. IEEE Trans. on Automatic Control, 36:1406–1416.
Malyshev, V. A. (1995). Networks and dynamical systems. Adv. in Appl. Probab.,
25:140–175.
Meyn, S. P. and D. Down. (1994). Stability of generalized Jackson networks.
Ann. Appl. Prob., 4:124–148.
Perkins, J. R. and P.R. Kumar. (1989). Stable distributed real–time scheduling
of flexible manufacturing/assembly/disassembly systems. IEEE Trans. on
Reiman, M. R. (1984). Open queueing networks in heavy traffic. Math. Oper.
Res., 9:441–458.
Rybko, A. N. and A.L. Stolyar. (1992). On the ergodicity of stochastic pro-
cesses describing open queueing networks. Problems Inform. Transmission,
28:199–220.
Seidman, T. I. (1994). First come, first served, can be unstable. IEEE Trans on
Solo, V. and X. Kong. (1995). Adaptive Signal Processing Algorithms. Prentice-
Hall, Englewood Clffs, NJ.
Chapter 3
SEQUENTIAL OPTIMIZATION UNDER

UNCERTAINTY
Tze Leung Lai

Stanford University
Abstract Herein we review certain problems in sequential optimization when the under-
lying dynamical system is not fully specified but has to be learned during the
operation of the system. A prototypical example is the multi-armed bandit prob-
lem, which was one of Yakowitz’s many research areas. Other problems under
review include stochastic approximation and adaptive control of Markov chains.
1. INTRODUCTION
Sequential optimization, when the underlying function or dynamical system
is not fully specified but has to be learned during the operation of the system,
was one of Yakowitz’s major research areas, to which he made many important
contributions in a variety of topics. In this paper we give an overview of some of
these topics and related developments, and review in this connection Yakowitz’s
contributions to these areas.
The optimization problem of finding the value which maximizes
a given function is difficult when is large and does not have nice
smoothness and concavity properties. Probabilistic algorithms, such as sim-
ulated annealing introduced by Kirkpatrick et al. (1983), have proved useful
to reduce the computational complexity. The problem becomes even more
challenging if is some unknown regression function so that an observation
at a given has substantial “uncertainties” concerning its mean value
In such stochastic settings, statistical techniques and probabilistic methods are
indispensable tools to tackle the problem.
When is finite, the above optimization problem in stochastic settings can be
viewed as a stochastic adaptive control problem with a finite control set, which is
often called a “multi-armed bandit problem”. In its simplest form, the problem
can be described as follows. There are statistical populations

with univariate density functions with respect to some
measure M. At each time we can sample from one of these populations and
the reward is the sampled value Thus the control set is where
control action refers to sampling from An adaptive sampling rule consists
of a sequence of random variables taking values in such
that the event (“ is sampled from ”) belongs to the
generated by Let If were
known, then we would sample from the population with the largest mean,
i.e., where
is assumed to be finite. In ignorance of the problem is to sample
sequentially from the populations to maximize or equivalently
to minimize the regret
as where and if A occurs,

otherwise. Section 2 reviews some important developments and basic results
of the multi-armed bandit problem, ranging from the above parametric set-
ting with independent observations to nonparametric models with dependent
observations pioneered by Yakowitz and his collaborators in a series of papers.
Returning to deterministic optimization as described in the first paragraph,
if is convex subset of and is smooth and unimodal, then efficient gradi-
ent methods are available for deterministic problems and their counterparts in
stochastic settings have been developed under the rubric of “stochastic approx-
imation”. It is widely recognized that for satisfactory performance stochastic
approximation procedures have to be initialized at good starting values. One
possible approach for modifying stochastic approximation algorithms accord-
ingly is to incorporate them into a multistart procedure. This idea was used by
Yakowitz (1993) to find the global maximum of a function that may have mul-
tiple local maxima. Section 4 reviews this work and some other developments
in stochastic approximation.
In the engineering literature, stochastic approximation schemes are usually
applied to optimization and control problems in dynamical systems, instead
of to static regression functions considered above. In principle, given a prior
distribution of the unknown system parameters and the joint probability dis-
tribution of the sequence of random variables that determine the stochastic
system, one can formulate a stochastic adaptive control problem as a dynamic
programming problem in which the “state” is the conditional distribution of the
original system state and of the parameter vector given the past observations.
However, because of the complexity of the systems usually encountered in prac-
tice, the dynamic programming equations are prohibitively difficult to handle,

both computationally and analytically. Moreover, it is often not possible to
specify a realistic probability law for all the random variables involved and a
reasonable prior distribution for the unknown parameter vector. Instead of the
Bayesian approach, a much more practical alternative that is commonly used in
the engineering literature is the “certainty equivalence” approach that replaces
unknown parameters in the optimal control rule by their sample estimates at
every stage. Section 3 gives a brief review of stochastic adaptive control in
controlled Markov chains. It shows how asymptotically optimal control rules
can be constructed by a modification of the certainty equivalence approach,
called “certainty equivalence with uncertainty adjustments” by Graves and Lai
(1997), that incorporates uncertainties in the parameter estimates.
2. BANDIT THEORY
The “multi-armed bandit problem”, introduced by Robbins (1952), derives
its name from an imagined slot machine with arms. When an arm is
pulled, the player wins a random reward. For each arm there is an unknown
probability distribution of the reward, and the player’s problem is to choose
N successive pulls on the arms so as to maximize the total expected reward.
The problem is prototypical of a general class of adaptive control problems in
which there is a fundamental dilemma between “information” (such as the need
to learn from all populations about their parameter values) and “control” (such
as the objective of sampling only from the best population), cf. Kumar (1985).
Another often cited example of such problems is in the context of clinical trials,
where there are treatments of unknown efficacy to be chosen sequentially to
treat a large class of N patients, cf. Chernoff (1967).
2.1. NEARLY OPTIMAL RULES BASED ON UPPER

CONFIDENCE BOUNDS AND GITTINS INDICES
For the regret defined in (1), Robbins (1952) showed that it is possible to
achieve by a “certainty equivalence rule with forcing” that
chooses from the population (“arm”) with the largest sample mean (“certainty
equivalence”) except at a sparse set of times when is chosen (“forcing”)
for each Lai and Robbins (1985) showed how to construct
sampling rules for which at every These rules are called
“uniformly good.” They also developed asymptotic lower bounds for the regret
of uniformly good rules and showed that the rules constructed actually
attain these asymptotic lower bounds and are therefore asymptotically efficient.
Specifically, they showed that under certain regularity conditions
for uniformly good rules, where is the Kullback-Leibler information

number. Instead of sampling from the population with the largest sample mean,
they proposed to sample from the population with the largest upper confidence
bound for and showed how these confidence bounds can be constructed
for the sampling rule to attain the asymptotic lower bound (2). Their result
was subsequently generalized by Anantharam, Varaiya and Walrand (1987)
to the multi-armed bandit problem in which each represents an aperiodic,
irreducible Markov chain on a finite state space S so that the successive ob-
servations from are no longer independent but are governed by a Markov
transition density This extension was motivated by the more general
problem of adaptive control of finite-state Markov chains with a finite control
set, details of which are given in Section 3.
Besides control engineering, the theory of multi-armed bandits also has an
extensive literature in economics. In particular, it has been applied to pricing
under demand uncertainty, decision making in labor markets, general search
problems and resource allocation (cf. Rothschild (1974), Mortensen (1985),
Banks and Sundaram (1992)). Unlike the formulation above, the formulation
of adaptive allocation problems in the economics literature involves a discount
factor that relates future rewards to their present values. Moreover, an economic
agent typically incorporates his prior beliefs about the unknown parameters into
his choice of actions. Suppose an agent chooses actions sequentially from a
finite set such that the reward of action has a proba-
bility distribution depending on an unknown parameter which has a prior
distribution The agent’s objective is to maximize the total discounted
reward
where is a discount factor and denotes the action chosen by

the agent at time The optimal solution to this problem, commonly called
the “discounted multi-armed bandit problem”, was shown by Gittins and Jones
(1974) and Gittins (1979) to be the “index rule” that chooses at each stage the
action with the largest “dynamic allocation index” (also called “Gittins index”),
which is a complicated functional of the posterior distribution given the
rewards of action up to stage where denotes the
total number of times that action has been used up to stage
Let have distribution function (depending on the unknown param-
eter ) so that are independent random variables with common
distribution function Let be a prior distribution on The Gittins

index associated with is defined as
where the supremum is over all stopping times defined on

(cf. Gittins (1979)). As is well known, the conditional distribution of
given can be described by that
are independent having common distribution function
and that has distribution which is the posterior distribution of given
Chapter 7 of Gittins (1989) describes computational methods
to calculate Gittins indices for normal, Bernoulli and exponential with
the prior distribution of belonging to a conjugate family. These methods
involve approximating the infinite horizon in the optimal stopping problem
(4) by a finite horizon N and using backward induction. When is near 1,
a good approximation requires a very large N and becomes computationally
prohibitive.
Varaiya, Walrand and Buyukkoc (1985) have suggested a simple way to view
the complicated index (4) and to see why the index rule is optimal. Suppose
there are machines with deterministic rewards
In analogy with (4), the index at time 1 of machine is defined as
Suppose and is attained at From the definition of it follows

that
These inequalities can be used in conjunction with to prove the follow-

ing: For any rule let T be the stage that operates machine 1 for the st
time, so machine 2 is operated times by during the first T stages.
Consider the rule that operates machine 1for the first stages, machine
2 for the next stages and such that is the same as after stage
T. Then the total discounted reward of is larger than or equal to that of
showing that it is better to use the index rule until stage The argument
can then be repeated starting at stage proving the optimality of the index rule.
See Section II.A of Varaiya, Walrand and Buyukkoc (1985) for the algebraic
details.
Bounds and approximations to the Gittins index have been developed by

Brezzi and Lai (2000a,b) and Chang and Lai (1987). In particular, Brezzi and
Lai (2000a) showed that
where and denote the mean and standard deviation of re-

spectively. Making use of these bounds, Brezzi and Lai (2000a) gave a simple
proof of the incompleteness of learning from endogenous data by an optimizing
economic agent. Specifically, they showed that with positive probability the in-
dex rule uses the optimal action only finitely often and that it can estimate consis-
tently only one of the generalizing Rothschild’s (1974) “incomplete learning
theorem” for Bernoulli two-armed bandits. Moreover, the Gittins index can be
written as an upper confidence bound of the form
where is a nonnegative function of and When the are normal with
mean and variance 1 and the prior distribution is normal, Chang and
Lai (1987) showed that can be expressed as
where
as and
There is also a similar asymptotic theory of the finite-horizon bandit problem
in which the agent’s objective is to maximize the total reward
where is a prior distribution of the vector Even when the

are independent under (so that is a product of marginal distributions as in
(3)), the optimal rule that maximizes (6) does not reduce to an index rule. In
principle, one can use dynamic programming to maximize (6). In the case of
Bernoulli populations with independent Beta priors for their parameters,
Fabius and van Zwet (1970) and Berry (1972) studied the dynamic programming
equations analytically and obtained several qualitative results concerning the
optimal rule. Lai (1987) showed that although index-type rules do not provide
exact solutions to the optimization problem (6), they are asymptotically optimal
as and have nearly optimal performance from both the Bayesian and
frequentist viewpoints for moderate and small values of N.
The starting point in Lai’s approximation to the optimal rule is to consider the
normal case. Suppose that an experimenter can choose at each stage
between sampling from two normal populations with known variance 1 such that
one has unknown mean and the other has known mean 0. Assuming a normal
prior distribution on the optimal rule that maximizes the expected
sum of N observations samples from the first population (with unknown mean)
until stage and then takes the remaining
observations from the second population (with known mean 0), where is
the posterior mean based on observations from the first population
and are positive constants that can be determined by backward induction.
Writing and treating as a
continuous variable, Lai (1987) approximates by where
is the posterior variance of and
The function is obtained by first evaluating numerically the boundary of the

corresponding optimal stopping problem for the Brownian motion and then
developing some simple closed-form approximation to the boundary. Although
it differs from the function in (5) because of the difference between the finite-
horizon criterion and the discounted criterion, note that
as Brezzi and Lai (2000b) recently developed a similar closed-form
approximation to by computing the optimal stopping boundary for Brownian
motion in the discounted case.
More generally, without assuming a prior distribution on the unknown pa-
rameters, suppose are independent random variables from a one-
parameter exponential family with density function
with respect to some dominating measure. Then is in-
creasing in since and the Kullback-Leibler information
number is
Let be the maximum likelihood estimate of based on Lai
(1987) considered an upper confidence bound for of the form
where
Note that is the generalized likelihood ratio statistic for testing

so the above upper confidence bound is tantamount to the usual construction
of confidence limits by inverting an equivalent test. Lai (1987) showed that this
upper confidence bound rule is uniformly good and attains the lower bound (2)
not only at fixed as (so that the rule is asymptotically
optimal from the frequentist viewpoint), but also uniformly over a wide range
of parameter configurations, which can be integrated to show that the rule is
asymptotically Bayes with respect to a large class of prior distributions for
There is also an analogous asymptotic theory for the discounted
multi-armed bandit problem as as shown by Chang and Lai (1987).
The construction of asymptotically efficient adaptive allocation rules that attain

(2) at fixed in Lai and Robbins (1985) uses similar upper confidence bounds
which, unlike (7), do not involve the horizon N and for which the asymptotics
in (2) is not uniform over
2.2. A HYPOTHESIS TESTING APPROACH AND

BLOCK EXPERIMENTATION
When switching costs are present, even the discounted multi-armed bandit
problem does not have an optimal solution in the form of an index-type rule, as
shown by Banks and Sundaram (1994). At any stage one has a greater propen-
sity to stick to the current arm instead of switching to the arm with the largest
index and incurring a switching cost. Although approximating an index by an
upper confidence bound that incorporates parameter uncertainty is no longer
applicable, we can re-interpret confidence bounds as hypothesis tests (as ex-
plained in the sentence following (7)) and modify the preceding rules in the
presence of switching costs by using hypothesis testing to decide which popu-
lation to sample from. Brezzi and Lai (2000b) recently used this approach to
construct a class of “block experimentation” rules in which active experimenta-
tion with an apparently inferior population is carried out in blocks. Specifically
consider statistical populations such that has density function
with respect to some common dominating measure
for Let be a positive integer divisible by and partition time
into frames such that the length of the frame is for and is
for The frame is further subdivided into blocks of equal length so
that refers to the block in frame Let be a random
permutation of (i.e., all permutations are equally likely). The
block in the first frame is devoted to sampling from For the
frame denote the population with the largest sample mean among
all populations not yet eliminated at the end of the st frame by Let
denote the number of such populations and let Let denote the
population with the largest sample mean among all populations not yet elimi-
nated at the end of the block where the end of block means the
end of frame Let denote successive observations from
and be the sample mean based on For the block which
will be denoted by (with ), we sample from until stage
where is defined as the largest number in if the set in (8) is empty, and
is the generalized likelihood ratio (GLR) statistic for testing
based on and is given by (9)

below. If the set in (8) is non-empty, eliminate (or ) from further sampling
if and the remaining observations in the block
are sampled from (or ) that is not eliminated. For the block
is devoted to sampling from the population with the largest sample mean
among all populations not yet eliminated at the end of block
If for some integer J, the preceding definition of the “block ex-
perimentation” rule applies to all J frames. If we mod-
ify the definition of the Jth frame by proceeding as before until the Nth
observation. The GLR statistic for testing based on
is
with noting that the function is con-

tinuous and increasing and therefore has an inverse. By choosing to be of
order with Brezzi and Lai (2000b) showed that such block
experimentation rules attain the asymptotic lower bound (2) for the regret while
the expected number of switches converges (as ) to at any
fixed with a unique component that attains
Similar hypothesis testing and block experimentation ideas were used by
Lai and Yakowitz (1995) in their development of nonparametric bandit theory
which was initiated by Yakowitz and Lowe (1991). In this theory the successive
observations from different arms may be dependent and no parametric model for
the underlying stochastic sequences generating these observations is prescribed.
To begin with, suppose that there are stochastic sequences
such that and exists and is finite, where
for Let and define
the optimal index set Assuming polynomial bounds of the
order on and exponential bounds
on the left tails of for Lai and Yakowitz (1995) showed
how a sequential allocation rule can be constructed for choosing among so
that the regret
is of the order O(log N) , where is the number of observations from

that have been taken up to stage N. Their basic idea is to replace the para-
metric likelihood-based upper confidence bounds described in Section 2.1 by
nonparametric sample-mean-based tests. They also extended the construction
to the case where there are countably infinitely many stochastic sequences
whose means have a finite maximum such that Given
any nondecreasing sequence of positive numbers such that and

as they showed how can be incorporated into the
allocation rule so that the regret satisfies Not only do
these results improve and generalize earlier work on nonparametric bandits by
Yakowitz and Lowe (1991), but they can also be extended to controlled Markov
chains, as will be shown in Section 3.2.
2.3. APPLICATIONS TO MACHINE LEARNING,

CONTROL AND SCHEDULING OF QUEUES
Yakowitz (1989), Yakowitz and Lugosi (1990), Yakowitz and Kollier (1992),
Yakowitz and Mai (1995) and Kaebling, Littman and Moore (1996) have given
various applications of bandit theory to machine learning. These applications
illustrate the usefulness of Yakowitz’s “black-box” (nonparametric) approach
to stochastic optimization, with which one can deal with processes having un-
known structure so that one still has long-run average cost optimality, with
regret rates only slightly larger than the optimal rates for parametric problems.
Although it is not applicable to queuing control for which optimal policies
are not of index type, Gittins’ discounted bandit theory has been applied to
determine optimal scheduling policies for queuing networks, which turn out
to be of index type. A major step in this direction was undertaken by Whittle
(1981) who introduced the notion of “open bandit processes”, in which new
projects (arms) are continually appearing. Whittle used a dynamic program-
ming equation to define the Gittins index of a project of type in state
assuming Markovian dynamics for the evolution of a project. He showed that
the optimal policy that maximizes the infinite-horizon discounted reward is to
work at each time on an available project that has the largest of those Gittins
indices which exceed the index of a project constantly left idle (at the
“no-further-action” state ) and to remain idle when no such project is avail-
able. Lai and Ying (1988) showed that under certain stability assumptions the
open bandit problem is asymptotically equivalent to a closed bandit problem
in which there is no arrival of new projects, as the discount factor approaches
1. Using this result, they showed that Klimov’s (1974, 1978) priority indices
for scheduling queuing networks are limits of Gittins indices for the associated
closed bandit problem and extended Klimov’s priority indices to preemptive
policies and to general queuing systems.
3. ADAPTIVE CONTROL OF MARKOV CHAINS

To design control rules for a stochastic system whose dynamics depends on
certain unknown parameters, the “certainty equivalence” approach first finds the
optimal (or asymptotically optimal) control rule when the system parameters
are known and then replaces the parameter values in this control rule by their
sample estimates at every stage. It tries to mimic the optimal rule (assuming
known system parameters) by updating the parameter estimates based on all the
available data. It is particularly attractive when the optimal (or asymptotically
optimal) control scheme assuming known system parameters has a simple re-
cursive form that can be implemented in real time and when there are real-time
recursive algorithms for updating the parameter estimates. Such is the case with
stationary control of Markov chains to maximize the long-run average reward.
3.1. PARAMETRIC ADAPTIVE CONTROL

Mandl (1974) studied such certainty equivalence control rules in finite-state
Markov chains whose transition functions depend on the action
chosen from a finite control set and an unknown parameter belonging to
a compact metric space Let denote the state space. The objective is to
choose a stationary control law that maximizes the long-run average
reward
where represents the one-step reward at state when action is used and
is the stationary distribution (which is assumed to exist) of
Since and are finite, the set of stationary control laws is
finite, which will be denoted by If were known, then one would
use the stationary control law such that
In ignorance of a certainty equivalence control rule uses the control law

at time where is an estimate of based on
Mandl (1974) chose to be the minimum contrast estimate and showed that
converges a.s. to under a restrictive identifiability condition and some other
regularity conditions. Borkar and Varaiya (1979) removed this identifiability
condition and showed that when is finite, the maximum likelihood estimate
converges a.s. to a random variable such that
for all They also gave an example for which
with positive probability, showing that the certainty equivalence rule
eventually uses with positive probability only the suboptimal stationary control
law to the exclusion of other control laws because of premature conver-
gence of the parameter estimates to a wrong parameter value.
In view of this difficulty with the certainty equivalence rule, various modifi-
cations of the rule have been proposed, including (i) forced choice schemes that
reserve some sparse set of times for experimentation with all stationary control
laws in (ii) randomization schemes that select according to a probability
distribution, depending on past data, that assigns positive probability to every
and (iii) using penalized (cost-biased) maximum likelihood estimators

see Kumar’s (1985) survey of these methods. Motivated by the lower bound
(2) for the multi-armed bandit problem, which is a special case of controlled
i.i.d. processes, Agrawal, Teneketzis and Anantharam (1989a) developed a
similar lower bound for a controlled independent sequence by making use of
the finiteness of and introducing a finite set of “bad” parameter values
associated with Whereas in the multi-armed bandit problem
has component parameterizing an individual arm parameterizes all the
arms in a controlled independent sequence so that rewriting (2) in terms
of the “bad” set provides the main clue in extending (2). When the state
space the control set and the parameter space are all finite, Agrawal,
Teneketzis and Anantharam (1989b) developed a “translation scheme” which
together with the construction of an “extended probability space” enabled them
to extend the lower bound (2) further to controlled Markov chains by converting
it to a form similar to that for controlled i.i.d. processes.
Graves and Lai (1997) removed the finiteness assumptions on and
and used another approach involving change of measures to extend the lower
bound (2) to controlled Markov chains when the set of stationary control laws
is finite. Define and by (11) and (12) and the regret of an adaptive
control rule by
as in (10). Assume no switching cost for switching among the optimal stationary
control laws that attain the maximum in (12) and a positive switching cost
for each switch from one to another where and are not
both optimal. Let be the cumulative switching cost of an adaptive
control rule up to stage N. An adaptive control rule is said to be “uniformly
good” if and for every Graves
and Lai (1997) showed that for uniformly good rules,
where is defined below after a few other definitions. First the analogue of
the Kullback-Leibler information number in (2) now takes the form
which will be assumed to be finite for all and which assumes the
transition probabilities to be absolutely continuous (having density functions
) with respect to a measure on Next the finiteness of enables us to
decompose as the union of subsets: where
i.e., is an optimal stationary control law if For let
Thus, is the set of all optimal stationary control laws when

is the true parameter value, and consists of all “bad” parameter values
which are statistically indistinguishable from if one only uses
the optimal control laws because Define as
Using sequential likelihood ratio tests and block experimentation ideas (similar
to those described in Section 2.2) to introduce “uncertainty adjustments” into
the certainty equivalence rule, Graves and Lai (1997) constructed uniformly
good adaptive control rules that attain the asymptotic lower bound (14).
3.2. NONPARAMETRIC ADAPTIVE CONTROL

Without assuming a parametric family of transition densities
Lai and Yakowitz (1995) consider a controlled Markov chain on state space
with control set and transition probability function
where is a of subsets of Let represent the
one-step reward at time For a stationary control law we shall use
to denote the transition probability function
of the controlled Markov chain under the control law and to denote the
conditional probability measure of this chain starting at state Let be a
countable (possibly infinite) set of stationary control laws such that
exists for every and and such that there is a maximum value
of over For a control rule that chooses adaptively some stationary
control law in to use at every stage, its regret is defined by
where is the number of times that the control rule uses the stationary
control law up to stage N and denotes expectation under the probability
measure of the controlled Markov chain starting at and using control
rule Since the state in a controlled Markov chain is governed by the
preceding state irrespective of which control law is used at time
it is important to adhere to the same control law (“arm”) over a block of times
(“block experimentation”), instead of switching freely among different arms as
in conventional multi-armed bandits.
Let Take and let Partition
into blocks of consecutive integers,
each block having length except possibly the last one whose length may range
from to Label these blocks as so that the block
begins at stage The basic idea here is to try out the
first stationary control laws for the stages from to Specifically,
for if with use stationary
control law for the entire block of stages if
and use for all the stages in the block otherwise. In (12),
denotes the number of times is used up to stage
and is so chosen that
Let be any nondecreasing sequence of positive numbers such that

and as Lai and Yakowitz (1995) showed how and
can be chosen in (16) and (17) so that the regret defined by (15) satisfies
for every
Yakowitz, Jayawardena and Li (1992) extended the nonparametric bandit
theory of Yakowitz and Lowe (1991) to the case of a nondenumerable set of
arms. Instead of the regret (10), they introduced the “learning loss”
under the assumption that is a metric space with the Borel For
controlled Markov chains with a nondenumerable set of stationary control
laws, Lai and Yakowitz (1995) showed how to construct an adaptive control rule
with for any and such that
by sampling control laws from for block experimentation. The basic

underlying idea is to sample independently from according to some
probability distribution such that for every open ball B. This yields
a countable set of stationary control laws for which the same
strategy as in the preceding paragraph can be applied. Lai and Yakowitz (1995)
applied this control rule to the problem of adaptively adjusting a service rate
parameter in an M/M/1 queue with finite buffer, on the basis of observed times
spent in the queuing system and the service costs, to minimize the long-run
average cost. Specifically, suppose the cost for the item serviced is
where is the time spent in the queuing system by the job, is the
service time for that job, and is the service rate in effect during that service
time, with A being a parameter of the problem. The decision maker need not
know in advance the arrival distribution, or how costs depend on service time,
or even how the service rate being adjusted is related to service time. One
desires a strategy to minimize the average job cost. Making use of the control
rule described earlier in this paragraph, Lai and Yakowitz (1995) showed how
decision functions can be chosen adaptively from a space of functions
mapping the number of jobs in the system into a prescribed interval
of service rates, so that the average control costs converge, as to the
optimal performance level the expectation being with
respect to the invariant measure induced by the decision function
4. STOCHASTIC APPROXIMATION
Consider the regression model
where denotes the response at the design level M is an unknown regression

function, and represents unobservable noise. In the deterministic case (where
for all ), Newton’s method for finding the root of a smooth function
M is a sequential scheme defined by the recursion
When the random disturbances are present, using Newton’s method (20)
entails that
Hence, if should converge to so that and

(assuming M to be smooth and to have a unique root such that
then (21) implies that which is not possible for many kinds of random
disturbances (e.g., when the are independent and identically distributed

(i.i.d.) with mean 0 and variance To dampen the effect of the errors
Robbins and Monro (1951) replaced in (20) by constants that
converge to 0. Specifically, assuming that
the Robbins-Monro scheme is defined by the recursion
where are positive constants such that
Noting that maximization of a smooth unimodel regression function

is equivalent to solving the equation Kiefer and Wolfowitz
(1952) proposed the following recursive maximization scheme
where at the stage observations and are taken at the design levels
and respectively, and are positive constants
such that
and is an estimate of
Beginning with the seminal papers of Robbins and Monro (RM) and Kiefer
and Wolfowitz (KW), there is a vast literature on stochastic approximation
schemes of the type (23) and (25). In particular, Blum (1954) proved almost
sure (a.s.) convergence of the RM and KW schemes under certain conditions
on M and For the case of i.i.d. with mean 0 and variance Sacks (1958)
showed that an asymptotically optimal choice of in the RM scheme (23) is
for which has a limiting normal distribution with
mean 0 and variance assuming that This led Lai and
Robbins (1979) to develop adaptive stochastic approximation schemes of the
form
in which is an estimate of based on the current and past observations. Not-

ing that the inputs should be set at if it were known, Lai and Robbins (1979)
defined the “regret” of an adaptive design to be They showed
that it is possible to have both asymptotically minimal regret and efficient fi-
nal estimate, i.e., a.s. and has a
limiting normal distribution with mean 0 and variance as by
using a modified least squares estimate in (27). Asymptotic normality of the
KW scheme (25) has also been established by Sacks (1958). However, instead
of the usual rate, one has the rate for the choices and
assuming M to be three times continuously differentiable in some
neighborhood of The reason for the slower rate is that the estimate of
has a bias of the order when This slower rate is
common to nonparametric regression and density estimation problems, where
it is known that the rate of convergence can be improved by making use of addi-
tional smoothness of M. Fabian (1967, 1971)showed how to redefine in
(25) when M is continuously differentiable in some neighborhood
of for even integers so that letting has
a limiting normal distribution.
In control engineering, stochastic approximation (SA) procedures are usu-
ally applied to dynamical systems. Besides the dynamics in the SA recursion,
the dynamics of the underlying stochastic system also plays a basic role in the
convergence analysis. Ljung (1977) developed the so-called ODE method that
has been widely used in such convergence analysis in the engineering literature;
it studies the convergence of SA or other recursive algorithms in stochastic dy-
namical systems via the stability analysis of an associated ordinary differential
equation (ODE) that defines the “asymptotic paths” of the recursive scheme;
see Kushner and Clark (1978) and Benveniste, Metivier and Priouret (1987).
Moreover, a wide variety of KW-type algorithms have been developed for con-
strained or unconstrained optimization of objective functions on-line in the
presence of noise. For Spall (1992) introduced “simultaneous
perturbation” SA schemes that take only 2 (instead of measurements to
estimate a smoothed gradient approximation to at every stage; see also
Spall and Cristion (1994). For other recent developments of SA, see Kushner
and Yin (1997).
The ODE approach to analyze SA procedures usually assumes that the as-
sociated ODE is initialized in the domain of attraction of an equilibrium point.
Moreover, the theory on the convergence rate of under conditions such as
unimodality and smoothness refers only to so large that lies in a suffi-
ciently small neighborhood of the limit where the regression function can
be approximated by a local polynomial. The need for good starting values to
initialize SA procedures is also widely recognized in practice. By synthesiz-
ing constrained KW search with nonparametric bandit theory, Yakowitz (1993)
developed the following multistart procedure that converges to the global max-
imum of where is a convex subset of even though M may
have multiple local maxima and minima.
Let and let be a probability measure on that has a positive

continuous density function with respect to Lebesgue measure. Let
and The set represents times at which new starting
values are generated. Specifically, for choose a new starting value
at random according to the probability distribution Let be a hypercube
centered at with sides of length and observe a response
at design level which will be used to initialize a KW scheme constrained
(by projection) inside For carry out a constrained KW step for each
hypercube (“bandit arm”) whose sample mean based on all the observed
inside the hypercube up to stage differs from the largest sample mean by
no more than or whose sample size is smaller than The basic idea
is to occasionally generate new regions for local exploration using a constrained
KW procedure. Letting denote the KW design level at stage inside the
hypercube with the largest sample mean among all hypercubes whose sample
sizes exceed Yakowitz (1993) showed that has a limiting
normal distribution under certain regularity conditions, where is the global
maximum of M (which may have many local maxima) over
REFERENCES
Agrawal, R., D. Teneketzis and V. Anantharam. (1989a). Asymptotically ef-
ficient adaptive allocation schemes for controlled I.I.D. processes: Finite
parameter space. IEEE Trans. Automat. Contr. 34, 258-267.
Agrawal, R., D. Teneketzis and V. Anantharam. (1989b). Asymptotically effi-
cient adaptive allocation schemes for controlled Markov chains: Finite pa-
rameter space. IEEE Trans. Automat. Contr. 34, 1249-1259.
Anantharam, V., P. Varaiya and J. Walrand. (1987). Asymptotically efficient
allocation rules for multiarmed bandit problems with multiple plays. Part II:
Markovian rewards. IEEE Trans. Automat. Contr. 32, 975-982.
Banks, J. S. and R.K. Sundaram. (1992). Denumerable-armed bandits. Econo-
metrica 60, 1071-1096.
Banks, J. S. and R.K. Sundaram. (1994). Switching costs and the Gittins index.
Econometrica 62, 687-694.
Benveniste, A., M. Metivier, and P. Priouret. (1987). Adaptive Algorithms and
Stochastic Approximations. Springer-Verlag, New York.
Berry, D. A. (1972). A Bernoulli two-armed bandit. Ann. Math. Statist. 43,
871-897.
Blum, J. (1954). Approximation methods which converge with probability one.
Borkar, V. and P. Varaiya. (1979). Adaptive control of Markov chains. I: Finite
parameter set. IEEE Trans. Automat. Contr. 24, 953-958.
REFERENCES 53
Brezzi, M. and T.L. Lai. (2000a). Incomplete learning from endogenous data
in dynamic allocation. Econometrica 68, 1511-1516.
Brezzi, M. and T.L. Lai. (2000b). Optimal learning and experimentation in
bandit problems. To appear in J. Economic Dynamics & Control.
Chang, F. and T.L. Lai. (1987). Optimal stopping and dynamic allocation. Adv.
Appl. Probab. 19, 829-853.
Chernoff, H. (1967). Sequential models for clinical trials. Proc. Fifth Berkeley
Symp. Math. Statist. & Probab. 4, 805-812. Univ. California Press.
Fabian, V. (1967). Stochastic approximation of minima with improved asymp-
totic speed. Ann. Math. Statist. 38, 191-200.
Fabian, V. (1971). Stochastic approximation. In Optimizing Methods in Statis-
tics (J. Rustagi, ed.), 439-470. Academic Press, New York.
Fabius, J. and W.R. van Zwet. (1970). Some remarks on the two-armed bandit.
Gittins, J. C. (1979). Bandit processes and dynamic allocation indices (with
discussion). J. Roy. Statist. Soc. Ser. B 41, 148-177.
Gittins, J.C. (1989). Multi-Armed Bandit Allocation Indices. Wiley, New York.
Gittins, J.C. and D.M. Jones. (1974). A dynamic allocation index for the se-
quential design of experiments. In Progress in Statistics (J. Gani et al., ed.),
241-266. North Holland, Amsterdam.
Graves, T. L. and T.L. Lai. (1997). Asymptotically efficient adaptive choice
of control laws in controlled Markov chains. SI AM J. Contr. Optimiz. 35,
715-743.
Kaebling, L.P., M.C. Littman and A.W. Moore. (1996). Reinformement learn-
ing: A survey. J. Artificial Intelligence Res. 4, 237-285.
Kiefer, J. and J. Wolfowitz. (1952). Stochastic estimation of the maximum of a
regression function. Ann. Math. Statist. 23, 462-466.
Kirkpatrick, S., C.D. Gelatt and M.P. Vecchi. (1983). Optimization by simulated
annealing. Science 220, 671-680.
Klimov, G.P. (1974/1978). Time-sharing service systems I, II. Theory Probab.
Appl. 19/23, 532-551/314-321.
Kumar, P.R. (1985). A survey of some results in stochastic adaptive control.
SIAM J. Contr. Optimiz. 23, 329-380.
Kushner, H.J. and D.S. Clark. (1978). Stochastic Approximation for Constrained
and Unconstrained Systems. Springer-Verlag, New York.
Kushner, H.J. and G. Yin. (1997). Stochastic Approximation Algorithms and
Applications. Springer-Verlag, New York.
Lai, T.L. (1987). Adaptive treatment allocation and the multi-armed bandit
problem. Ann. Statist. 15, 1091-1114.
Lai, T.L. and H. Robbins. (1979). Adaptive design and stochastic approxima-
tion. Ann . Statist. 7, 1196-1221.
Lai, T.L. and H. Robbins. (1985). Asymptotically efficient adaptive allocation

rules. Adv. Appl. Math. 6, 4-22.
theory. IEEE Trans. Automat. Contr. 40, 1199-1209.
Lai, T.L. and Z. Ying. (1988). Open bandit processes and optimal scheduling
of queuing networks. Adv. Appl. Probab. 20, 447-472.
Ljung, L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans. Au-
tomat. Contr. 22, 551-575.
Mandl, P. (1974). Estimation and control of Markov chains. Adv. Appl. Probab.
6, 40-60.
Mortensen, D. (1985). Job search and labor market analysis. Handbook of Labor
Economics 2, 849-919.
Robbins, H. (1952). Some aspects of the sequential design of experiments. Bull.
Amer. Math. Soc. 58, 527-535.
Robbins, H. and S. Monro. (1951). A stochastic approximation method. Ann.
Math. Statist. 22, 400-407.
Rothschild, M. (1974). A two-armed bandit theory of market pricing. J. Eco-
nomic Theory 9, 185-202.
Sacks, J. (1958). Asymptotic distribution of stochastic approximation proce-
dures. Ann. Math. Statist. 29, 375-405.
Spall, J.C. (1992). Multivariate stochastic approximation using a simultaneous
perturbation gradient approximation. IEEE Trans. Automat. Contr. 37, 332-
341.
Spall, J.C. and J.A. Cristion. (1994). Nonlinear adaptive control using neural
networks: Estimation with a smoothed form of simultaneous perturbation
gradient approximation. Statistica Sinica 4, 1 -27.
Varaiya, P.P., J.C. Walrand and C. Buyukkoc. (1985). Extensions of the mul-
tiarmed bandit problem: the discounted case. IEEE Trans. Automat. Contr.
30, 426-439.
Whittle, P. (1981). Arm-acquiring bandits. Ann. Probab. 9, 284-292.
Yakowitz, S. (1989). A statistical foundation for machine learning, with appli-
cation to Go-moku. Computers & Math. 17, 1085-1102.
Yakowitz, S. (1993). A global stochastic approximation. SIAM J. Contr. Optimiz.
31, 30-40.
Yakowitz, S., J. Jayawardena and S. Li. (1992). Theory for automatic learning
under partially observed Markov-dependent noise. IEEE Trans. Automat.
Contr. 37, 1316-1324.
Yakowitz, S. and M. Kollier. (1992). Machine learning for blackjack counting
strategies. J. Statist. Planning & Inference 13, 295-309.
Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Ann. Op-
erat. Res. 28, 297-312.
REFERENCES 55
Yakowitz, S. and E. Lugosi. (1990). Random search in the presence of noise,

with application to machine learning. SIAM J. Scient. & Statist. Comput. 11,
702-712.
Yakowitz, S. and J. Mai. (1995). Methods and theory for off-line machine learn-
ing. IEEE Trans. Automat. Contr. 40, 161-165.
Chapter 4
EXACT ASYMPTOTICS FOR

LARGE DEVIATION PROBABILITIES,
WITH APPLICATIONS
Iosif Pinelis
Michigan Technological University
Houghton, Michigan 49931
ipinelis@math.mtu.edu
Keywords: large deviation probabilities, final zero crossing, last negative sum, last positive
sum, nonparametric bandit theory, subexponential distributions, superexponen-
tial distributions, exponential probabilistic inequalities
Abstract Three related groups of problems are surveyed, all of which concern asymp-
totics of large deviation probabilities themselves – rather than the much more
commonly considered asymptotics of the logarithm of such probabilities. The
former kind of asymptotics is sometimes referred to as “exact”, while the latter as
“rough”. Obviously, “exact” asymptotics statements provide more information;
the tradeoff is that additional restrictions on regularity of underlying probability
distributions and/or on the corresponding zone of deviations are then required.
Most broadly, large deviation probabilities may be defined as any small

probabilities. Thus, large deviation probabilities will always arise in hypothesis
testing (say) when small error probabilities are desired. For example, consider
the “small deviation” probability where X is a continuous random
variable (r.v.) and is a small positive number; obviously, this probability is
small and can be rewritten as the large deviation probability where
and Of course, when a large deviation probability, say of
the form is considered, usually the r.v. Y is assumed to represent
a certain structure of interest; e.g., Y may be a linear or quadratic form in
independent r.v.’s.
Literature on large deviations is vast. Most of the results concern the so-called
large deviation principle (LDP), which deals with asymptotics of the logarithm
of the large deviation probalility, ln (which is most often Gaussian-
like, rather than with asymptotics of the probability itself.
We write if and if Some authors
refer to asymptotics of the logarithm of the large deviation probalility as “rough"
asymptotics, in contrast with “exact" asymptotics of the probability
itself. To appreciate the difference, note that such two different cases of “exact"
asymptotics as, e.g.,
for any correspond to the same “rough" asymptotics
A great deal of literature on “rough" asymptotics is reviewed and referenced

in the monograph by Dembo and Zeitouni (1998).
In this paper, we shall survey a few results concerning “exact" asymptotics.
Naturally, the volume of literature in this direction is much smaller. Yet, it still
exceeds by far our capabilities to do any kind of justice to them in this paper.
See another recent monograph, Vinogradov (1994).
A typical result in this direction is the following beautiful theorem due to
A. V. Nagaev (1971). Let be the partial sum of independent identically
distributed (i.i.d.) r.v.’s with a common cumulative distribution
function (c.d.f.)
and where and is slowly varying at
Then
whenever and i.e., whenever and

here, is the c.d.f. of the standard normal law. The first summand,
is roughly the probability of the set of the trajectories
such that and all the jumps are relatively small.
The second summand, is roughly the probability of the
set of the trajectories such that and exactly one of the
jumps, say is close to while the input of all the rest, with is
negligible. The first summand on the right-hand side of (1) dominates the sec-
ond one in the zone of “moderately" large deviations
Exact asymptotics for targe deviation probabilities, with applications 59
while the second summand dominates the first one in the zone of “very" large
deviations for every
If the exponent then, assuming for simplicity the symmetry of the
distribution of the one has
whenever see Heyde (1968). In the borderline case

the behavior of the large deviation probabilities depends on that of the slowly
varying function see Tkachuk (1974; 1975). Finally, for subexponential tails
lighter than powers, the nature of large deviations is more complicated; see
again Nagaev (1971).
Just as for (1), certain conditions of regularity of tails are required for “exact"
asymptotics in the entire zone of the large deviation probabilities in which rela-
tion (2) obtains. However, all standard subexponential statistical distributions
possess much more than the required regularity.
Let us now briefly describe the contents of this paper.
In Section 1, we present our joint results with Sid Yakowitz. There, “exact"
asymptotics of large deviation probabilities concerning the last negative sum of
i.i.d. r.v.’s with a positive mean are given, as well as related invariance principles
for conditional distributions in the space of trajectories of the cumulative sums.
Application to nonparametric bandit theory are given here as well.
In Section 2, “exact" asymptotics of large deviation probabilities in the space
of trajectories of the cumulative sums are described, which generalize (1),
In Section 3, “exact" asymptotics of “very" large deviation probabilities in
general Banach spaces are described, which generalize (2).
1. LIMIT THEOREMS ON THE LAST NEGATIVE

SUM AND APPLICATIONS TO NONPARAMETRIC
BANDIT THEORY
The results of this section stem from the line of inquiry begun by Robbins
(1952) and continued by Lai and Robbins (1985) and Yakowitz and Lowe
(1991). They are taken from Pinelis and Yakowitz (1994), unless specified
otherwise.
Let be a sequence of i.i.d. r.v.’s with and let
denote the c.d.f. of X. Let for
and let
By the strong law of large numbers, T is a proper random variable, as well as

Put
We assume that the distribution of X is either absolutely continuous or lattice.

The latter means that for some and the only values assumed by X are
the ones of the form without loss of generality, we may and
do assume that the maximum span of the lattice distribution is 1; that is, the
greatest such that for some is 1; in addition, it is
assumed that Then we can say that the distribution of X , in both cases,
is absolutely continuous with respect to the measure that is the Lebesgue one
in the “absolutely continuous" case and the counting measure in the maximum
span 1 lattice case. In both cases,
where is the density function; if is the counting measure, then

for every integer and (3) means in this case that
The notation as in (3) is convenient since it unifies both cases. Note that
analogues of the subsequent results hold when either the underlying distribution
is any lattice one or has a non-zero absolutely continuous component.
Now let stand for the density of which is defined, in both cases,
analogously to (3).
Let
The following assumption is crucial: there exist a constant and a

positive function such that
Theorem 1 If (4) holds, then
where
Moreover,
Exact asymptotics for large deviation probabilities‚ with applications 61
More explicit expressions for can be given for the case of exponentially or
superexponentially decreasing left tails; the expressions are especially simple
when X has a maximum span 1 lattice lower-semicontinuous distribution‚ i.e.‚
when
note that this case corresponds‚ in the application to the learning problems
described in the next section‚ to the situation when‚ say‚ and
In particular‚ yet simpler formula‚
takes place when X can assume only the three values (–1)‚ 0‚1 with probabil-
ities respectively‚ so that this will correspond
to the situation when and can assume values 0 or 1 only.
As to the case when X has a maximum span 1 lattice upper-semicontinuous
distribution‚ i.e.‚
one can obtain the following improvement of Theorem 1 itself:
where
The asymptotic result (5)‚ provided by Theorem 1‚ depends only on condition

(4)‚ which is related to the local limit theorem for large deviations of sums
of i.i.d. r.v.’s. Naturally‚ the question remains‚ “When is the restriction (4)
satisfied?"
Our conjecture is that (4) is always valid provided only that (i) EX > 0‚
(ii) P(X < 0) > 0‚ and (iii) the density is surely bounded.
Moreover‚ we conjecture that for all
where
if these integrals both exist, otherwise. Note that

and always, so that is a well-defined non-positive
real number.
Consideration below shows that our argument is applicable to all superex-
ponentially decreasing and also to some complete enough spectra of exponen-
tially and subexponentially decreasing left distribution-tails of X. In particu-
lar, conjecture (4)&(8) is satisfied for all usual statistical distributions, e.g., if
and the distribution of Y belongs to any
one of the following classes: all bounded distributions, Poisson, normal, F,
Pareto, Gamma (including the Erlang, and geometric ones), the extreme
value distributions (both kinds of tails), lognormal and Weibull distributions,
Pascal and negative binomial distributions (including discrete geometrical one),
etc.
Let us now discuss the applicability of conjecture (4)&(8) in more detail.
First, consider
1.1. CONDITION (4)&(8): EXPONENTIAL AND

SUPEREXPONENTIAL CASES
Suppose that the following Cramér type condition (cf.‚ e.g.‚ Hu (1991)) takes
place: for some and for each compact
In particular‚ all bounded distributions‚ Poisson‚ normal‚ Gamma (including

Erlang‚ and geometric ones)‚ Pascal and negative binomial distributions
(including discrete geometrical one) and their Cramér transforms‚ etc.‚ are taken
care of via this case.
Take‚ for a moment‚ the distribution of X to be absolutely continuous. We
of course assume throughout that EX > 0 and P(X < 0) > 0. Then (see‚ e.g.‚
Borovkov and Rogozin (1965) or Hu (1991)‚ formula (A.1)‚ page 415)‚
as uniformly in belonging to any compact subset of the set
where
indicates the derivative and is the unique (because solu-

tion to the equation Note that this definition of is consistent
with (10). Suppose that
(Recall that and so, for (14) to be true, it is sufficient that

the left tail be superexponential, i.e., or, more generally, that
Using (12), one can see that (4) is satisfied in this case with
moreover,
Note that in the Cramér case, under the additional condition that there exists
a number such that
Siegmund (1975) gave integral analogues of (4).

Condition (15)‚ under which the treatment does get simplified‚ was also used
in variuos large deviation problems; e.g‚ see Bahadur and Rao (1960)‚ Feller
(1971)‚ Borovkov (1976)‚ and Fill (1983).
The lattice distribution case can be treated in essentially the same way‚ using‚
e.g‚ the result of Petrov (1965); also see Petrov (1975).
In the literature‚ it has become a kind of natural modus vivendi to confine
analysis of large deviations of order in boundary problems for random walks
only to the case when the Cramér condition (11) holds; see‚ e.g.‚ Hu (1991) and
the references therein. We have seen that in such a case‚ our approach — via
condition (4)&(8) — is valid as well as the habitually used Cramér transform
methods.
What is more‚ condition (4)&(8) (if it is indeed a condition in the sense that
it is not always satisfied!) leads to essentially new horizons. For example‚
consider here the following vast class of distributions for which the Cramér
condition (11) and/or (14) may fail while (4) is true.
1.2. CONDITION (4)&(8): EXPONENTIAL (BEYOND

(14)) AND SUBEXPONENTIAL CASES
Let X be either absolutely continuous or maximum span 1 lattice‚ with the
density of the form either
or
where and are functions slowly varying at (i.e.‚

as and vanishing sufficiently rapidly at
say‚ where We emphasize that the

“subexponential” case is included here as well. The classes (16) and
(17) contain such distributions as F‚ Pareto‚ Weibull‚ stable distributions‚ and
their Cramér transforms. If (recall (10)) < 0‚ then (14) is satisfied‚ and
so‚ condition (4)&(8) is true‚ in view of the above discussion.
Suppose now that If (16) holds‚ then (cf. Nagaev (1971))
uniformly in for any here‚

this implies
uniformly in for any An analogous statement holds

for the case (17); cf. Nagaev (1969). Note‚ finally‚ that here as
defined by (9)‚ so that (4)&(8) hold again‚ while (14) fails.
Theorem 1 and (7) show that the adequate term for description of the asymp-
totics of is Moreover‚ Theorem 1 implies
Usually‚ the sum in (19) can be easily treated.

Suppose‚ e.g.‚ that conditions (11) and (14) are valid. Then (12) gives
where
From (19) and (20) it easily follows that
But it is known (or may be obtained using (12)) that under the conditions (11)
and (14)‚ the main term asymptotics of differs from that of
given by (12)‚ only by a constant factor. Our conclusion therefore is that under
the conditions (11) and (14)‚
where is a positive constant depending only on F.

The last assertion‚ also under a Cramér type condition similar to (11)&(14)‚
but with the additional condition (15)‚ was found by Siegmund (1975).
In case (15) does take place‚ a comparison of (21) and the result of Siegmund
(1975) shows that
where One can see that (23) does not contain defined by
(15); we therefore conjecture that (23) holds without (15)‚ just whenever the
Cramér condition (11)&(14) is satisfied.
In order to compute using (23)‚ one needs to deal with the distributions
of all the But as it was said above‚ the situation is much simpler in the
case when X has a maximum span 1 lower-semicontinuous distribution. In
this case‚ it is easy to obtain from a “renewal” integral equation similar to‚ say‚
(3.16) in Feller (1971) that for integral hence‚ (6) and (8)
yield
where is given by (15)‚
As we stated above‚ the expression for becomes especially simple when X

has a maximum span 1 distribution which is both upper- and lower-semicontinuous.
i.e.‚ when X can assume only the three values (–1)‚ 0‚1 with probabilities‚ say‚
respectively. Here‚ (7) takes place with
Another case‚ in which a simpler formula for may be found is when

enjoys a more explicit expression — see‚ e.g. Borovkov (1976)‚ Chap. 4‚
Section 19. In particular‚ this is true in the case of the Erlang distribution. If‚
e.g.‚ is “purely continuous exponential” on that is‚
where — cf. Borovkov (1976), Chap. 4, Section 19, (14) and Feller
(1971), Chap. XII, (5.9), then (6) and (8) yield
Let us now see what happens when the Cramér condition fails at all. Suppose‚
e.g.‚ that (16) takes place with Then (18) and (19) yield
where is a positive constant. But it is known — cf. Nagaev (1971) — that

in this case‚
Hence‚
so that the asymptotics is similar to that of (22).
1.3. THE CONDITIONAL DISTRIBUTION OF THE

INITIAL SEGMENT OF THE
SEQUENCE OF THE PARTIAL SUMS GIVEN
T = N
Here, we shall present, in particular, limit theorems for the conditional dis-
tribution of the number of negative values among given
First, we present the limit theorem for the conditional distribution of given
denote by the density of this distribution (in the
lattice case, this notation may be regarded literally as well).
Theorem 2 If condition (4) holds‚ then for all
where is defined by (6); by the Scheffé theorem (see, e.g., Billingsley (1968)),
this convergence then takes place also in variation, and hence in distribution;
moreover, for large enough
where C is the same as in (4).
Consider now the conditional distribution of given

where Observe that this distribution remains the same if the
condition is replaced by — recall the independence
Exact asymptotics for large deviation probabilities, with applications 67
of the Unlike the universal character of the limiting behavior of the

distributions described by Theorems 1 and 2, that of essentially depends
on whether the Cramér condition (11)&(14) is satisfied or not.
Consider the random process on [0,1] defined by the formula
where is the integral part of The trajectories of live in the so denoted

space D — see, e.g., Billingsley (1968) — of the functions without disconti-
nuities of the second kind. As usual, D is assumed to be endowed with the
Chebyshev norm let be the corresponding
Borel in D.
Theorem 3 If the Cramér condition (11)&(14) is satisfied and

then for all and the conditional distribution in D of
the process
given (or, equivalently, given weakly converges to

that of the standard Brownian bridge; for the definition of the latter, see, e.g.,
Billingsley (1968); is defined in (13). The same is true for the conditional
distribution of given
As mentioned earlier, if the Cramér condition fails, the limiting behavior

of the process defined by (25), is completely different; instead of the
Brownian bridge as the limiting process one has to deal with the following
Z-shaped trajectories:
If U is a random variable uniformly distributed on [0, 1] and, as before,

let us call the random function the Z process.
Theorem 4 If the subexponentiality condition (16) with is satisfied, then

for all and the conditional distribution in D of the
process
given (or, equivalently, given weakly converges to

that of the Z process. The same is true for the conditional distribution of
given
Remark. Theorem 4 holds if instead of (16), condition (17) with

and is assumed; if however (17) takes place with but
the limiting behavior of the process given (or
becomes somewhat whimsical, taking on shapes intermediate between
those of the Brownian bridge and the Z process; here, we do not give further
details on this case. Nonetheless, while in the two cases — (11)&(14) on the
one hand and (16) on the other — we have so different patterns as the Brownian
bridge and the Z process, the asymptotics of the conditional distribution of the
number of, say, negative sums are exactly the same in both cases — which is
no wonder: cf., e.g., Feller (1971), Theorem 3, Section 9, Chapter XII.
More rigorously, it is easy to extract from Theorems 3 and 4 the following.
Corollary 1 Let denote any of the following random variables:
If and either the Cramér condition (11) & (14) or the subexpopen-
tiality condition (16) with (or (17) with and (0,1/2)) is
satisfied, then for all and the conditional distribution
of given (or, equivalently, given weakly con-
verges to the uniform distribution on [0,1]. The same is true for the conditional
An obvious observation based on this corollary is that the “striking distri-

butional similarity" between the asymptotics of (in our terms) and
which is the result of Kao (1978), is actually by no
means surprising.
1.4. APPLICATION TO BANDIT ALLOCATION

ANALYSIS
First, we consider a simplified strategy, which nevertheless turns out to be in
a certain sense optimal, at least in the exponential and superexponential cases,
as shown in Yakowitz and Lowe (1991).
1.4.1 Test-times-only based strategy. Suppose that we have two ban-

dits and whose behaviour is described by two independent sequences
of i.i.d. r.v.’s and respectively; and are re-
garded as the results of potential observations made on and at the time
moment as if the observations were made indeed,
We assume that the better bandit is that is‚ EY <EZ. (This is of course
unknown to the observer; we suppose that in general no information about the
distributions of Y and Z is available except that the means EY and EZ are
finite.)
Let N(1) = 1 < N(2) < N(3) < . . . be a sequence of nonrandom integers‚
called here test times; the observations are supposed to be made only at these
time moments.
Consider
the number of the test times up through the time moment and
the cumulative sums of the observations made on and up to the time

moment
Let us say that bandit is apparently better at the time moment (which
leads to the erroneous decision that is better indeed) if
Denote by the last time moment at which is apparently better:
Since the following statement is

straightforward from (19).
Proposition 1 If the distribution of satisfies (4)‚ then for the

constant given in (6)‚
where is the density of
Remark. Recall that condition (4) holds under very general circumstances.
Equations (20)‚ (23)‚ and subsequent to them give more explicit forms for (26)
for different tail decrease rates.
Let M be the total number of the test times at which the wrong bandit was
selected.
Proposition 2 Put Suppose that If (16)

with then
where the constant c is given in (6)

(0,1/2)) are satisfied then
1.4.2 Multiple bandits and all-decision-times based strategy. Here

we consider a more common strategy: Continue to observe the apparently
worse bandit at the test times and use every observation (including all those
made between test times) in calculating the sample mean. At test times, switch
the observation allocation whenever the order relation of the bandit sample
meant; changes. Let us now give a rigorous description of the model.
Let stand for the bandits under consideration.
For each let be a sequence of i.i.d. r.v.’s
with the common mean is regarded as the result of a potential
observation made on at the time moment as if the observation were made
indeed,
We assume that the best bandit is that is, (this fact
is of course assumed to be unknown to the observer).
Let N(1) = 1 < N(2) < N(3) < · · · be a sequence on nonrandom integers,
called here test times.
We define recursively the sequences
and where The first of
them is the sequence of the decision indicators:
is the number of times has been observed up to the test time

is the cumulative sum of the corresponding results of observations; and
is the corresponding sample mean.
More exactly, we put
(at the test times all the bandits are tested), and if and
then if and
otherwise‚ where
Thus the control-decision process is completely defined.

Let us now introduce
the last time moment at which the best bandit is not observed. By the
above construction‚
Fix any real number such that and put
Consider
and
the number of the test times up to the time moment

Theorem 5 For all
Note that for all and Thus‚ one can combine Theorem 5 and
previous results in order to obtain estimates for the distribution tail of
Suppose that‚ for each is the p.d.f. of
(understood as in (3))‚ is the c.d.f. of and (cf. (4))
some constant and function Then‚ by (19)‚

for some constants The last expression can be further

treated (cf. (22) and (24)).
The estimate given by Theorem 5 might seem too rough, but actually it is
not, because, by the strong law of large numbers, the quality of the test at the
time moment is basically determined by the number of observations made up
to the time moment on the apparently worst bandit, which is close to if
is large.
2. LARGE DEVIATIONS IN A SPACE OF

TRAJECTORIES
The results of this section are taken from Pinelis (1981), unless specified
otherwise.
As in Section 1, let be i.i.d. r.v.’s with a common distribution
function F and let stand for the cumulative sums of the Let the
random process be defined by the same formula (25). However, in this
section we shall assume that EX = 0, rather than EX > 0.
Also, in this section we shall impose more restrictions on F. First, we shall
assume that Var X is finite; actually, without loss of generality, we shall assume
that Var X = 1. Further restrictions on F will be imposed in the statements of
results below.
In this section, we shall describe the asymptotic behavior of large deviation
probabilities of the form
where A is separated from and and vary arbitrarily

subject only to the conditions and At that, we assume
that the tails of the distribution of X decrease in a certain sense as as
where A similar problem corresponding to was
considered by Godovan’chuk (1978). The asymptotic behavior of probabilities
of large deviations in the space of trajectories when Cramér’s condition holds
was studied by Borovkov (1964).
Let us write if
as where
A function H is said to be regularly varying with exponent if
for all
Let us say that H is completely regularly varying with exponent if
for some function which is regularly varying with exponent It is not

difficult to see that (i) every function which is completely regularly varying
with exponent is regularly varying with exponent and (ii) every
function H which is regularly varying with exponent is asymptotically
equivalent to a function which is completely regularly varying with exponent
that is‚ as
Let us say that a set is separated from 0 if
For any as usual‚ let denote the boundary of A‚
the distance from to A‚ and
the of A.
For and define the step function by
For and consider the “cross section"
For any consider the measures F and defined by
for any Borel set

Let us refer to a set separated from 0‚ as (in the Riemann
sense) if the function is continuous almost everywhere (a.e.) on
[0‚1].
For an set consider the measure defined by the
Riemann integral as
Let us refer to an set separated from 0, as

if
Note that for the of a set separated from 0, it suffices

that at least one of the following two conditions hold:
(i) the set is for all small enough (so that the
Lebesgue integral exists) and
(ii) uniformly in
Let be a standard (continuous) Wiener process. Then the probability
is defined for any
Let us refer to a set as if, for any
as where
It is easy to see that this condition is very mild. Indeed, e.g.,
(27) takes place if, for some
and for such a choice of A one may even take See also
Remark 2 below.
Let us refer to a set as if
as
According to the great many results on the large deviation principle (LDP),
the class of sets is very large.
Finally, for consider the two sets, and defined by
Now we are able to state the main results of this section.
Theorem 6 Let a set be and and such

that the sets and are Let the tail functions and
be completely regularly varying with exponent Then
whenever and
Theorem 7 Assume that a function is a.e. continuous‚

and the set
is Assume also that and the tail function is

regularly varying with exponent Let
Then
whenever and
Remark. By the Donsker-Prohorov invariance principle‚ if then

is equivalent to
Remark 1. There arises the question: For what functions with will
the set be —Let us call a contact point for if
Let have only a finite number of contact points and let

be infinitely differentiable in a neighborhood of each contact point; moreover,
for each assume that not all the derivatives are

equal to 0; if we mean here a left neighborhood of and the left
derivatives at
Then, using Theorem 13 (Borovkov, 1964), it is not hard to see that is
An analogous assertion can be given also for the case when the set of contact
points is of nonzero Lebesgue measure.
Thus, we see that the condition of the of is quite mild.
Let, as usual, stand for the standard normal c.d.f.:
Corollary 2 Assume that and the tail function is regularly

varying with exponent Then
whenever and
The following assertion, given above as (1), is due to A.V. Nagaev (1969);
(1971).
Corollary 3 Assume that and the tail function is regularly

varying with exponent Then
whenever and
Remark. The conditions of Theorem 6 contain the requirement of complete

regular variation of the tails of the distribution of X; that is‚ the requirement
of regular variation of the density of the distribution of X. This is necessitated
by the breadth of the class of sets for which Theorem 6 holds; such
sets A in Theorem 6 do not have to be “integral"; thus‚ Theorem 6 has a “local"
aspect to it. On the other hand‚ the condition of the completeness of the regular
variation is not necessary in the “integral" Theorem 7 and Corollaries 2 and 3‚
and indeed is not imposed there.
A few words about the method of the proof. Let us write
where
is a bounded sequence such that as for any

is a positive constant; R is the “remainder".
Then and behave mainly as and
respectively‚ while the remainder R is negligible.
To prove this‚ one uses so-called “methods of one and the same probability
space” to show that‚ except for a negligible probability‚ there are only two kinds
of trajectories providing for large deviations: (i) the trajectories corresponding
to in which all the jumps are small enough, smaller than and
which then behave as the trajectories of the Gaussian process and (ii)
the trajectories corresponding to in which there is a large jump larger
than and which then behave as the random jump function
Note that provided that and vary
so that one of the sides of this relations tends to 0. Similarly,
provided that and vary so that one of the sides of
this relations tends to 0.
It is easy to see that the “jump" term
is negligible in comparison to the “Gaussian" probability
in the zone with and a sufficiently
small provided that A is Thus, in this case
On the other hand, the “Gaussian" probability in Theorem 6

is negligible in comparison to the “jump" term
in the zone with and a sufficiently large
provided that Thus, in this case
As we shall see in the next section, the latter kind of behavior is exhibited in
very general settings, when the summands may take on values in a Banach
space and do not have to be identically distributed.
3. ASYMPTOTIC EQUIVALENCE OF THE TAIL OF

THE SUM OF INDEPENDENT RANDOM
VECTORS AND THE TAIL OF THEIR MAXIMUM
3.1. INTRODUCTION
Here, we follow Pinelis (1985).
For the beginning, let us consider, as in the previous sections, i.i.d. r.v.’s
with a common c.d.f. Set
Consider a “typical" trajectory of the sums which significantly

deviates from 0, say in the sense that there occurs an event of the form
where is such that is small. How does such a typical trajectory
look like?
If the Cramér condition
is satisfied for some then a “typical large deviation" trajectory of the

sums "straightens out", and the contrinutions of the summands are
overall comparable with one another.
If the Cramér condition does not take place, then the shape of a “typical
large deviation" trajectory changes significantly. If the tail is not too
irregular and (28) does not hold, then in a zone of the form
one has the asymptotic relation
which is equivalent to
This means that the large deviations of the sum occur mainly due to just one
large summand. Note that in (29) or (30) it is not at all necessary that
in fact‚ the number of summands may be as well fixed.
It is not difficult to understand the underlying causes of the two different
large deviations mechanisms. For instance‚ let there exist a bounded density
and let Take a large enough positive number
We are interested in the of numbers that maximize
the joint density or‚ equivalently‚ minimize
under the condition
Proposition 3 Let
Let there take place one of the following two cases:

(i) function is convex on
(ii) function is concave on for some real and as
Then in the first case for large enough while in the

second case for a fixed and
Thus‚ one can see that if is convex‚ then (the Cramér condition is satisfied
and) the maximum of the joint density is attained when the summands are equal:
On the other hand‚ if the alternative takes place‚ then (the Cramér condition
fails and) the maximum of the joint density is attained when only one of the
summands is large (equal to and the rest of them are small (equal
to
In other words, the critical influence of the Cramér condition as to the type of
the large deviation mechanism is related to the facts that (i) “regular" densities
are log-concave if the Cramér condition holds and log-convex (in a neighbor-
hood of otherwise, and (ii) the direction of convexity of the logarithm of
the density determines the kind of large deviation mechanism.1
Of course, the density does not have to be either log-concave or log-convex for
one of the two large deviation mechanisms to work. The role of log-convexity
(say) may be assumed, e.g., by a condition that the tail of the distribution
is significantly heavier, in a certain sense, than exponential ones. Thus, in
Theorems 14 and 16 below, where the tails decrease no faster than with
there are no convexity conditions. On the other hand, such a
condition is part of Theorem 15 below, where the tails can be aribitrarily close
to exponential ones; that a convexity condition is essential is also shown by
Example 1 below.
For a fixed relation (30) was studied by Chistyakov (1964), with an ap-
plication to a renewal equation. It was shown in Chistyakov (1964) that for a.s.
positive X, relation (30) holds for any fixed iff it is true for i.e.,
when
C.d.f.’s F satifying (31) were later on called subexponential because of their

property
which follows from the relation
for any any fixed These and other properties of the class of all subexponen-
tial c.d.f.’s were established in Chistyakov (1964). Class attracted attention
a number of authors; see e.g. Embrechts, Goldie, and Veraverbeke (1979),
Goldie and (1978), Vinogradov (1994), and bibliography therein.
If (30) holds for every fixed then it remains true for uniformly in
some zone of the form where In general nothing can be
said on the rate of growth of There have been many publications concern-
ing (30), including estimates of see, e.g., Andersen (1970); Anorina and
Nagaev (1971); Godovan’chuk (1978); Heyde (1968); Linnik (1961); Nagaev
(1969; 1971; 1964; 1973; 1979); Pinelis (1981); Rohatgi (1973) (1970); (1971);
(1978); (1968); (1961); (1969); (1971); (1964); (1973); (1979); (1981); (1973).
What are the differences of the results listed below from ones found in pre-
ceding literature?
First, the restriction that the summands are identically distributed is removed.
This removal seems quite natural: as an extreme case, if all summands except
one are zero, then (29) is trivial.
Second‚ it is assumed here that the summand r.v.’s take on values in an ar-
bitrary separable Banach space rather than just real values. Note
that‚ in a variety of settings‚ when the Banach space is “like" infinite-
dimensionality works rather toward the only-one-large-summand large devia-
tion mechanism. Indeed‚ let here and let be
independent real-valued r.v.’s such that the distributions of the are contin-
uous; let and otherwise. Then
so that (29) is trivial.

Third, conditions of regularity of tails are relaxed here, especially when the
second moment does not exist. What is even more important is that regularity
conditions are now imposed only on the sum of the tails of the distributions of
the summands, rather than on every one of the tails.
Fourth, the optimal zone is studied for the most difficult case
when the tails may be arbitrarily close to exponential ones; see Theorem 15 and
Proposition 4 below. The only precedent to this result is the work by Nagaev
(1962), where it is assumed that the summands are real valued and identically
distributed and stricter regularity conditions are imposed on the tails. The
method of Nagaev (1962) does not work in the more general setting.
It may seem somewhat surprising that the proofs of the limit theorems stated
below do not use a Cramér transform of any kind but are based on the inequal-
ities given in the next section. This shows, in particular, how powerful those
inequalities are.
To explain the idea on how to obtain limit theorems based on the upper
bounds, let us turn back to (1). It is comparatively easy to show that
On the other hand, as was said, the second summand –

– on the right-hand side of (1) dominates the first one – – in the
zone of “very" large deviations, when Suppose now
that one has an upper bound of the form
where
in the zone for any given and large enough

cf. (36) below. Clearly that would be sufficient for the asymptotics
in the large deviations zone of the same form –

Let be a double-index array of r.v.’s
with values in a separable Banach space such that for each the r.v.’s
are independent. Let
We shall be interested in the conditions under which the large deviation asymp-
totics
obtain.
3.2. EXPONENTIAL INEQUALITIES FOR

PROBABILITIES OF LARGE DEVIATION OF
SUMS OF INDEPENDENT BANACH SPACE
VALUED R.V.’S
Let
where is any nonnegative Borel function.

For any function and any positive real numbers
consider the generalized inverse function
where we set sup

Assume that for any
Theorem 8 Let a function be log-concave on an interval for some
for all where is a positive number. The for
any positive numbers such that one has
where
Remark. If for some then
for some constant C > 0, depending only on and It follows, e.g., that if the
are i.i.d. and for some then for any there is
a constant such that
in the zone of “moderate" deviations. The constant 2 in

is obviously the best possible. Similar important properties of the bound (35)
result in the optimal zones of large deviations in which (33) and (34) take place;
cf. (44) and (56) below.
Theorem 8 is based in part on the following result.
Theorem 9 (Pinelis and Sakhanenko ) For all
Moment inequalities for sums of infinite-dimensional random vectors are

also known; see e.g. Pinelis (1994) and references therein.
3.3. THE CASE OF A FIXED NUMBER OF

INDEPENDENT BANACH SPACE VALUED R.V.’S.
APPLICATION TO ASYMPTOTICS OF
INFINITELY DIVISIBLE PROBABILITY
DISTRIBUTIONS IN BANACH SPACES
Let here and fix
Theorem 10 If the c.d.f. is subexponential (i.e.‚

satisfies (31))‚ then the asymptotic relation (33) takes place as
Theorem 10 is complemented by the following powerful and easy to check

sufficient condition for subexponentiality.
Theorem 11 Let a c.d.f. F with F(0) = 0 satisfy (32). Assume also that there
exist such numbers and a function that
as and the function
Then F is a subexponential c.d.f.
Theorem 11 shows that the tail of a subexponential c.d.f. may vary rather
arbitrarily between and cf. Proposition 4 below. In
particular‚ if for some
as and F(0) = 0, then F is a subexponential c.d.f. This contradicts

a statement in Teugels (1975), page 1001 that such an F is not subexponential
for That this statement is in error was noticed also by Pitman (1980),
who used a different method.
The following example shows that the sufficient condition for subexponen-
tiality provided by Theorem 11 is rather close to a necessary one. In particular,
one cannot replace there by 1.
Example 1 Let and, for all let
where the are chosen so that is continuous, and Let then

and let a c.d.f. F be such that for all
Then F is not subexponential, and the function

is concave only for Yet, the necessary condition (32) is satisfied.
On the other hand, the choice of the function in the definition

is rather arbitrary. Instead of one may use here any concave
function such that as and the function
is integrable in a neighborhood of In particular‚ any
of the following functions will do:
Theorem 11 works especially well for subexponential tails close to exponen-

tial ones. For tails decreasing no faster than power ones‚ the following criterion
is useful.
Theorem 12 (Borovkov(1976)‚ page 173) Let a c.d.f. F with F(0) = 0 satisfy
(32). Assume also that F is such that
as Then F is a subexponential c.d.f.

Consider now applications to asymptotics of infinite-dimensional infinitely
divisible probability distributions. By measures on a Banach space we
mean Borel measures on A probability distribution on is called
infinitely divisible if it can be represented, for every natural as the
convolution of a probability distribution on For any measure on with a
finite variation one can define the compound Poisson distribution
with Lévy measure Let denote the distribution concentrated in a point

For any measure and positive numbers let
for all Borel A in where Let
Measure is called a Lévy measure if for all

and there exists the weak limit for some Any of
such limits (all of which are equal to one another up to a shift) is denoted by
s Pois
It is well known (see e.g. Araujo and Giné (1979)‚ page 108) that any
infinitely divisible probability distribution on admits the unique represen-
tation of the form
where is a centered Gaussian distribution. In what follows‚ we assume that

and are as in (37). Let
Theorem 13 If is a subexponential c.d.f. for some (or‚ equivalently‚ for all)

then
as Vice versa‚ if M is a subexponential c.d.f.‚ then is so too and

(38) holds.
When and is concentrated on the result of Theorem 13

was obtained by Embrechts, Goldie, and Veraverbeke (1979); when
and is regularly varying, by Zolotarev (1961); when is a stable
distribution of an index by Araujo and Giné (1979).
In the case one may obtain one-sided analogues of the results of
this subsection, using the following extension of the notion of a subexponential
c.d.f. Let us refer to a c.d.f. F as positive-subexponential if the c.d.f.
is subexponential. It is easy to see that if F is a positive-
subexponential c.d.f., then for any fixed one has
as
It can be shown that if the expectation then (39) is incom-
patible with the Cramér condition. However‚ one can construct conterexamples
showing that then both (39) and the Cramér condition may
hold; for instance‚ consider
where
3.4. TAILS DECREASING NO FASTER THAN POWER

ONES
In this subsection‚ we shall deal with nonnegative nonincreasing functions
H‚ satisfying a condition of the form for some
constant C > 1 and some positive numbers It is easy to see that such
functions decrease no faster than a power function: if and
then
Theorem 14 Let and vary so that
At that‚ suppose that
for all and
for some Then one has the relations (33) and (34).
Corollary 4 Suppose that
for where H is a positive function such that
for some positive real numbers Suppose also that at least one of the
following three conditions holds:
(i) is a Banach space of type 2‚ for all and and
(44) takes place;
(ii) is a Banach space of type for some and

for all and
(iii)
Then one has the relations (33) and (34) provided and (43).
Recall that a Banach space is of type if there is a constant

such that
for all independent zero-mean r.v.’s with the values in

Remark. Conditions (45)–(47) are satisfied if‚ e.g.‚
for some and some slowly varying function Thus, Corollary 4 is

a generalization of results of Andersen (1970); Godovan’chuk (1978); Heyde
(1968), , for all the values of in (48) except for If, with
it is additionally assumed that, for instance, all the are symmetrically
distributed and is of type then one still has the relations (33) and (34)
provided and (43).
Corollary 5 Let be fixed‚
Suppose that
as Then
The latter Corollary is for illustrative purposes only. It shows that the indi-
vidual distributions of the summands may be very irregular; they may be even
be two-point distributions. For (33) and (34) to hold‚ it suffices that the sum of
the tails be regular enough.
Note also that Theorem 12 follows from Theorem 14.
3.5. TAILS‚ DECREASING FASTER THAN ANY

POWER ONES
Let stand for the set of all differentiable function satisfying the conditions
as Introduce
Theorem 15 Suppose that
where the positive sequence is non-decreasing and for some

Suppose that
and
for some Then one has (33) and (34).
Note that condition (52) implies the function is con-

cave‚ while (53) implies Roughly‚ conditions (52) and (53) mean‚
in addition to certain regularity‚ that the tail decreases slower than any exponen-
tial one but faster than any power one. The following proposition shows that
Theorem 15 concerns tails that may be arbitrarily close to exponential ones.
Proposition 4 For any function such that as there

exists a function such that as
Condition (56) is essentially (44) rewritten so that the other conditions of

Theorem 15 are taken into account. Together with (55)‚ it determines the
boundary of the zone of large deviations in which one has (33) and (34); this
boundary is best possible‚ as shows comparison with results Nagaev (1973).
Note that‚ for the tails close enough to exponential ones‚ the boundary of the
zone is essentially determined by condition (55); the role of the “separator

between the spheres of influence" of conditions (55) and (56) is played by the
function
The following corollary is parallel to Corollary 5.
Corollary 6 Let be fixed. Suppose that conditions (49) and (50) hold. Fi-
nally‚ assume that
for some and all large enough integral Then one has (51).
3.6. TAILS‚ DECREASING NO FASTER THAN
In the previous two subsections‚ there were given theorems covering all the
spectrum of the tails‚ from to for which the relations (33) and (34)
may hold in general. However‚ tails such as the ones mentioned in the title of
this subsection may still remain of interest. As one can see from the statement
of Theorem 16‚ a restriction is now imposed only on the first derivative of the
tail function‚ while in Theorem 15 one has to take into account essentially the
sign of a second derivative. Note also that Theorem 16 generalizes Theorems
1 and 2 of Nagaev (1977) simultaneously.
Consider the following classes of nonnegative non-decreasing function
defined on some interval where
(i) class defined by the condition: is
non-increasing in
(ii) class for defined by the condition:
whenever
(iii) class for defined by the condition: is
absolutely continuous, and its derivative satisfies the inequality
Proposition 5 If then
In particular‚
This proposition shows that all these three kinds of classes are essentially
the same.
Theorem 16 Suppose that and‚ for all
for where
and Suppose also that
and
Then relations (33) and (34) hold.

Corollary 7 Suppose that one has (54) for some function where
such that as Suppose that and

Corollary 8 Suppose that
where is non-decreasing and is slowly varying at

Suppose that and

Of course, the latter corollary could have been obtained from Theorem 14
as well, and at that one would have obtained more precise – as compared with
(59) – boundaries of the corresponding zone of large deviations. On the other
hand, Corollaries 7 and 8 generalize the first statement of each of Theorems 1
and 2 of Nagaev (1977). The remaining statements of the latter two theorems
can be deduced quite similarly, if one observes that Theorem 16 can modified
as follows:
(i) replace (57) by the condition
(ii) add condition (41);

(iii) replace the right-hand side in (58) by
REFERENCES 91
NOTES
1. Similar observations on large deviation mechanisms were offered by Nagaev (1964).
REFERENCES
Andersen‚ G. R. (1970) Large deviation probabilities for positive random vari-
ables. Proc. Amer. Math. Soc. 24 382–384.
Anorina‚ L. A.; Nagaev‚ A. V. (1971) An integral limit theorem for sums of inde-
pendent two-dimensional random vectors with allowance for large deviations
in the case when Cramér’s condition is not satisfied. (Russian) Stochastic pro-
cesses and related problems‚ Part 2 (Russian)‚ “Fan" Uzbek. SSR‚ Tashkent‚
3–11.
Araujo‚ A. and Giné‚ E. (1979) On tails and domains of attraction of stable
measures in Banach spaces. Trans. Amer. Math. Soc. 248‚ no. 1‚ 105–119.
Bahadur‚ R.R. and Rao‚ R. Ranga. (1960) On deviations of the sample mean‚
Ann. Math. Stat.‚ 11‚ 123-142.
Billingsley‚ P. (1968) Convergence of Probability Measures‚ Wiley‚ New York.
Borovkov‚ A.A. (1964). Analysis of large deviations for boundary problems
with arbitrary boundaries I‚ II. In: Selected Translations in Math. Statistics
and Probability‚ Vol.6‚ 218-256‚ 257-274.
Borovkov‚ A.A. (1976). Stochastic Processes in Queuing Theory. Springer-
Verlag‚ New York-Berlin.
Borovkov‚ A.A. and Rogozin‚ B.A. (1965) On the multidimensional central
limit theorem‚ Theory Probab. Appl.‚ 10‚ 52-62.
Chistyakov‚ V.P. (1964) A theorem on sums of independent positive random
variables and its application to branching processes‚ Theory Probab. Appl.‚
9‚ 710–718.
Dembo‚ A. and Zeitouni‚ O. (1998) Large deviations techniques and appli-
cations. Second edition. Applications of Mathematics‚ 38. Springer-Verlag‚
New York.
Embrechts‚ P.; Goldie‚ C. M.; Veraverbeke‚ N. (1979) Subexponentiality and
infinite divisibility. Z. Wahrsch. Verw. Gebiete 49‚ no. 3‚ 335–347.
Feller‚ W.‚ (1971) An Introduction to Probability Theory and its Applications‚
II‚ 2nd ed.‚ John Wiley & Sons‚ New York.
Fill‚ J.A. (1983) Convergence rates related to the strong law of large numbers‚
Annals Probab.‚ 11‚ 123-142.
Godovan’chuk‚ V.V. (1978) Probabilities of large deviations of sums of inde-
pendent random variables attached to a stable law‚ Theory Probab. Appl. 23‚
602–608.
Goldie‚ C.M. (1978) Subexponential distributions and dominated-variation tails.
J. Appl. Probability 15‚ no. 2‚ 440–442.
Goldie‚ C.M. and Klüppelberg‚ C. (1978) Subexponential distributions. Uni-

versity of Sussex: Research Report 96/06 CSSM/SMS.
Heyde‚ C. C. (1968) On large deviation probabilities in the c ase of attraction
to a non-normal stable law. Ser. A 30 253–258.
Hu‚ I. (1991) Nonlinear renewal theory for conditional random walks‚ Annals
Probab.‚ 19‚ 401–22.
Kao‚ C-s. (1978) On the time and the excess of linear boundary crossing of
sample sums‚ Annals Probab.‚ 6‚ 191–199.
Lai‚ T.L. and H. Robbins (1985) Asymptotically efficient adaptive allocation
rules‚ Advances in Applied Mathematics‚ 6‚ 4–22.
Linnik‚ Ju. V. (1961) On the probability of large deviations for the sums of
independent variables. Proc. 4th Berkeley Sympos. Math. Statist. and Prob.‚
Vol. II pp. Univ. California Press‚ Berkeley‚ Calif. 289–306
Nagaev‚ A.V. (1969) Limit theorems taking into account large deviations when
Cramér’condition fails‚ Izv. AN UzSSR‚ Ser. Fiz-Matem.‚ 6‚ 17–22.
Nagaev‚ A.V. (1969) Integral limit theorems taking large deviations into account
when Cramér’condition does not hold‚ I‚ II‚ Theory Probab. Appl.‚ 24‚51–64‚
193–208.
Nagaev‚ A.V. (1971)‚ Probabilities of Large Deviations of Sums of Independent
Random Variables‚ D. Sc. Dissertation‚ Tashkent (in Russian).
Nagaev‚ A.V. (1977) A property of sums of independent random variables.
(Russian.) Teor. Verojatnost. i Primenen. 22‚ no. 2‚ 335–346.
Nagaev‚ S.V. (1962)‚ Integral limit theorem for large deviations‚ Izv. AN UzSSR‚
Ser. fiz.-mat. nauk. 37–43. (in Russian).
Nagaev‚ S.V. (1964)‚ Limit theorems for large deviations‚ Winter School on
Probability Theory and Mathematical Statistics. Kiev‚ 147–163. (in Russian).
Nagaev‚ S. V. (1973) Large deviations for sums of independent random vari-
ables. Transactions of the Sixth Prague Conference on Information The-
ory‚ Statistical Decision Functions‚ Random Processes (Tech. Univ.‚ Prague‚
1971; dedicated to the memory of Antonín Academia‚ Prague 657–
674.
Nagaev‚ S. V. (1979) Large deviations of sums of independent random variables.
Ann. Probab. 7‚ no. 5‚ 745–789.
Petrov‚ V.V. (1965) On probabilities of large deviations of sums of independent
random variables‚ Teor. Veroyatnost. Primen.‚ 10‚ 310-322. (in Russian).
Petrov‚ V.V.‚ (1975) Sums of Independent Random Variables‚ Springer-Verlag‚
New York.
Pinelis‚ I.F. (1981) A problem on large deviations in a space of trajectories‚
Theory Probab. Appl. 26‚ no. 1‚ 69–84.
Pinelis‚ I.F. (1985) Asymptotic equivalence of the probabilities of large devi-
ations for sums and maximum of independent random variables. (Russian)
REFERENCES 93
Limit theorems of probability theory‚ 144–173‚ 176‚ Trudy Inst. Mat.‚ 5‚

“Nauka" Sibirsk. Otdel.‚ Novosibirsk.
Pinelis‚ Iosif (1994) Optimum bounds for the distributions of martingales in
Banach spaces. Ann. Probab. 22‚ no. 4‚ 1679–1706.
Pinelis‚ I. F.; Sakhanenko‚ A. I. Remarks on inequalities for probabilities of
large deviations. (Russian) Theory Probab. Appl. 30‚ no. 1‚ 143–148.
Pinelis‚ I.; Yakowitz‚ S. (1994) The time until the final zero crossing of random
sums with application to nonparametric bandit theory‚ Appl. Math. Comput.
63‚ no. 2-3‚ 235–263.
Pitman‚ E. J. G. (1980) Subexponential distribution functions. J. Austral. Math.
Soc. Ser. A 29)‚ no. 3‚ 337–347.
Robbins‚ H.‚ (1952) Some aspects of the sequential design of experiments‚ Bull.
Amer. Math. Soc. 58‚ 527-535.
Rohatgi‚ V. K. (1973) On large deviation probabilities for sums of random
variables which are attracted to the normal law. Comm. Statist. 2‚ 525–533.
Siegmund‚ D. (1975) Large deviations probabilities in the strong law of large
numbers‚ Z. Wahrsch. verw. Gebiete 31‚ 107-113.
Teugels‚ J. L. (1975) The class of subexponential distributions. Ann. Probability
3‚ no. 6‚ 1000–1011.
Tkachuk‚ S. G. (1974) Theorems o large deviations in the case of a stable
limit law. In Random Process and Statistical Inference‚ no. 4‚ Fan‚ Tashkent
178–184. (In Russian)
Tkachuk‚ S. G. (1975) Theorems o large deviations in the case of distributions
with regularly varying tails. In Random Process and Statistical Inference‚
no. 5‚ Fan‚ Tashkent 164–174. (In Russian)
Vinogradov‚ V. (1994) Refined large deviation limit theorems. Pitman Research
Notes in Mathematics Series‚ 315. Longman Scientific & Technical‚ Harlow;
copublished in the United States with John Wiley & Sons‚ Inc.‚ New York.
Yakowitz‚ S. and W. Lowe‚ (1991) Nonparametric bandits‚ Annals of Operations
Research‚ 28‚ 297-312.
Zolotarev‚ V. M. (1961) On the asymptotic behavior of a class of infinitely
divisible laws. (Russian) Teor. Verojatnost. i Primenen. 6‚ 330–334.
Part II
STOCHASTIC MODELLING OF EARLY HIV

IMMUNE RESPONSES UNDER TREATMENT
BY PROTEASE INHIBITORS
Wai-Yuan Tan
The University of Memphis
Memphis‚ TN 38152-6429
waitan@memphis.edu
Zhihua Xiang
Organon Inc.
375 Mt. Pleasant Avenue
West Orange‚ NJ 07052
z.xiang@organoninc.com
Abstract It is well documented that‚ in many cases‚ most of the free HIV are generated in
the lymphoid tissues rather than in the plasma. This is especially true in the late
stage of HIV pathogenesis because in this stage‚ the total number of T
cells in the plasma is very small‚ whereas the number of free HIV in the plasma
is very large. In this paper we have developed a state space model in plasma
involving net flow of HIV from lymph nodes‚ extending the original model of
Tan and Xiang (1999). We have applied this model and the theory to the data of
a patient (patient No.104) considered in Perelson et al. (1996)‚ in which RNA
virus copies per were observed on 18 occasions over a three week period.
This patient was treated by a protease inhibitor‚ ritonavir‚ so that a large number
of non-infectious HIV was generated by the treatment. For this patient‚ by using
the state space model over the three week span‚ we have estimated the numbers
of productively HIV-infected T cells‚ the total number of infectious HIV‚ as well
as the number of non-infectious HIV. Our results showed that within this period‚
most of the HIV in the plasma was non-infectious‚ indicating that the drug is
quite effective.
Keywords: Infectious HIV‚ lymph nodes‚ Monte Carlo studies‚ non-infectious HIV‚ protease
inhibitors‚ state space models‚ stochastic differential equations.
1. INTRODUCTION
In a recent paper‚ Tan and Xiang (1999) developed some stochastic and state
space models for HIV pathogenesis under treatment by anti-viral drugs. In
this paper we extend these models into models involving net flow of HIV from
lymphoid tissues. This extension is important and necessary because of the
following biological observations:
(1) HIV normally exists either in the plasma as free HIV‚ or trapped by follic-
ular dendritic cells in the germinal center of the lymph nodes during all stages
of HIV infection (Fauci and Pantaleo‚ 1997; Levy‚ 1998; Cohen‚ Weissman and
Fauci‚ 1998; Fauci‚ 1996; Tan and Ye‚ 2000). Further‚ Haase et al. (1996) and
Cohen et al. (1997‚ 1998) have shown that the majority of the free HIV exist in
lymphoid tissues. This is true especially in the late stage of HIV pathogenesis.
For example‚ Perelson et al. (1996) have considered a patient (Patient No. 104
in Perelson et al. (1996)) treated by a protease inhibitor‚ ritonavir. For this
patient‚ at the start of the treatment‚ the total number of T cells in the
blood was yet the total number of RNA virus copies was
in the blood; many other examples can be found in Piatak et al. (1993). Thus‚
in the late stage‚ it is unlikely that most of the free HIV in the plasma were
generated by productively HIV-infected CD4 T cells in the plasma. (Note that
the total number of T cells includes the productively HIV-infected T
cells.)
(2) Lafeuillade et al. (1996) and many others have shown that both the free
HIV in the plasma‚ and the HIV in lymph notes can infect T cells‚ generating
similar dynamics in the plasma as well as the lymph nodes. Furthermore‚ the
infection process in the lymph nodes is much faster than in the plasma (Fauci‚
1996; Cohen‚ Weissman and Fauci‚ 1998; Cohen et al.‚ 1997; Haase et al.‚ 1996;
Kirschner and Web‚ 1997; Lafeuillade et al.‚ 1996; Tan and Ye‚ 2000). From
these observations‚ it follows that most of the free HIV in the blood must have
come from HIV in the lymph nodes‚ rather than from productively HIV-infected
CD4 cells in the blood; this is true especially in the late stages.
To model the HIV pathogenesis in the blood‚ it is therefore necessary to in-

clude net flow of HIV from the lymph nodes or other tissues to the plasma. On
the other hand‚ since the T lymphocytes are less mobile (Weiss‚ 1996) and are
much larger in size than the HIV‚ one would expect that the number of T cells
flowing from the lymph nodes to the plasma is about the same as that flowing
Stochastic Modelling of Early HIV Immune Responses 97
from the plasma to the lymph nodes. In this paper we thus consider only net
flow of HIV from lymphoid tissues to plasma‚ ignoring net flow of T cells.
In Section (2)‚ we illustrate how to develop a stochastic models for HIV

pathogenesis under treatment by protease inhibitors‚ involving net flow of HIV
from lymphoid tissues to plasma. To compare results with the deterministic
model‚ in Section 3 we give equations for the expected numbers of the state
variables. By using the stochastic model in Section 2 as the stochastic system
model‚ in Section 4 we develop a state space model for the early HIV patho-
genesis under treatment by protease inhibitors‚ and with net flow of HIV from
lymphoid tissues to the plasma. Observation of the state space model is based
on the RNA virus copies per of blood taken over time. In Section 5 we
illustrate the model and the theory by applying it to the data of a patient reported
by Perelson et al. (1996). Finally in Section 6‚ we generate some Monte Carlo
studies to confirm the usefulness of the extended Kalman filter method.
2. A STOCHASTIC MODEL OF EARLY HIV

PATHOGENESIS UNDER TREATMENT BY A
PROTEASE INBIHITOR
Consider the situation in which a HIV-infected individual is treated by a
protease inhibitor. Then‚ for this individual‚ both infectious HIV and non-
infectious HIV will be generated in both the plasma and the lymphoid tissues.
Since HIV data such as the RNA virus copies and/or the T cell counts‚
are usually sampled from the blood‚ we illustrate in this section how to develop
stochastic models in the plasma for early HIV pathogenesis‚ under treatment
by protease inhibitors and with a net flow of HIV from lymphoid tissues to
plasma. We look for models which capture most of the important characteristics
of the HIV pathogenesis‚ yet are simple enough to permit efficient estimation
of parameters and /or state variables. Thus‚ since the contribution to HIV by
latently HIV-infected T cells is very small (less than 1%; see Perelson et al.
(1996))‚ we will ignore latently HIV-infected T cells like these authors. Also‚
in the early stage when the period since treatment is short (e.g.‚ one month‚
one may assume that drug resistance of HIV to the anti-viral drugs has not
yet developed; further‚ because of the following biological observations‚ we
will follow Perelson et al. (1996) in assuming that the number of un-infected
cells is constant:
(i) Before the start of treatment‚ the HIV pathogenesis is in a steady-state

situation.
(ii) The uninfected CD4(+) T cells have a relatively long life span (The
average life span of uninfected T cells is more than 50 days; see
Cohen et al. (1998)‚ Tan and Ye (1999a)).
(iii) Mitter et al. (1998) have provided additional justifications for assumption
(2); they have shown that even for a much longer period since treatment‚
the assumption has little effect on the total number of free HIV.
Let denote the numbers of productively HIV-infected
T cells‚ non-infectious free HIV and infectious free HIV in the blood
at time t respectively. Then we are considering a three dimensional stochastic
process With the above assumptions‚ we now
proceed to derive stochastic equations for the state variables in plasma of this
stochastic process under treatment by protease inhibitors and with net flow of
HIV from lymphoid tissue. We first illustrate how to model the effects of pro-
tease inhibitors and the net flow of HIV from the lymphoid tissues to the plasma.
2.1. MODELING THE EFFECTS OF PROTEASE

INHIBITORS
Under treatment by protease inhibitors‚ the HIV can infect the T cells
effectively; however‚ due to inhibition of the enzyme protease‚ most of the free
HIV released by the death of productively HIV-infected T cells is non-infectious.
It follows that under treatment by protease inhibitors‚ both non-infectious free
HIV (to be denoted by ) and infectious free HIV (to be denoted by ) will
be generated. When the drug is effective‚ most of the released free HIV is
non-infectious.
To model this stochastically‚ denote by

The probability that a free HIV released in the plasma at time t by
the death of a productively infected T cell is non-infectious.
if the protease inhibitors are 100 percent effective and if the
protease inhibitors are not effective. For sensitive HIV‚
Number of free HIV released by the the death at time t of a
productively HIV-infected T cell in the plasma.
The number of the non-infectious HIV among the re-
leased free HIV.
Given HIV, is a binomial random variable with parameters
That is,
2.2. MODELING THE NET FLOW OF HIV FROM

LYMPHOID TISSUES TO PLASMA
As shown by Cavert et al. (1996), as in plasma, the anti-retroviral drugs in
Highly Active Anti-Retroviral Therapy (HAART) treatment can reduce HIV
to a very low or undetectable level in the lymphoid tissues. Further, dynamics
for HIV pathogenesis in the lymphoid tissues and in the plasma (Lafeuillade
et al., 1996) have been observed to be similar. Thus, by dynamics similar to
those in the plasma, both and will be generated in the lymphoid tissues
under treatment by protease inhibitors. Furthermore, as shown by Fauci (1996),
Cohen, Weissman and Fauci (1998), Cohen et al. (1997), Haase et al. (1996)
and Lafeuillade et al. (1996), the infection process of HIV pathogenesis in the
lymph nodes is much faster than in the plasma. Hence, there are net flows of HIV
from the lymph nodes or other tissues to the plasma; on the other hand, since
the T lymphocytes are less mobile and much larger in size than the HIV, one
may ignore the net flow of T cells (Weiss, 1996). To model this stochastically,
denote by:
The total net flow of HIV during generated by all
productively HIV-infected T cells in the lymphpid tissues at time t‚
The total net flow of HIV during generated by
in the lymphoid tissues at time t.
Then‚ consists of only whereas consists of both infec-
tious and non-infectious free HIV. Let be the number of infectious HIV
among the free HIV and the probability that a released free HIV
in the lymphoid tissues at time t becomes non-infectious in the lymphoid tissues
under treatment by protease inhibitors. The conditional probability distribution
of given is then given by:
1 if the protease inhibitors are 100 percent effective in the lymphoid

tissues and if the protease inhibitors are not effective in the lymphoid
tissues. For sensitive HIV‚
In Tan and Ye (1999b)‚ is modelled by a negative binomial

distribution given by:
where is the flow inhibiting function‚ the flow potential function and
the saturation constant (Kirschner and Webb‚ 1997).
Let denote the net flow of HIV from the lymphoid tissues to the
plasma during the net flow of HIV from the lymphoid tissues
to the plasma during Then‚
and
The expected values of and are given respectively by:
and
where
and
2.3. DERIVATION OF STOCHASTIC DIFFERENTIAL

EQUATIONS FOR THE STATE VARIABLES
To derive stochastic equations for the state variables of the above stochastic
process‚ consider the time interval and denote by:
The number of cells generated by infection of T cells by
HIV during
The number of deaths of cells during
The number of non-infectious HIV generated by the death of
the cell during
The total number of HIV (infectious or non-infectious)
generated by the death of one cell during
The number of net flow of HIV for non-infectious and
for infectious HIV) from lynph nodes to the plasma during
The number of deaths of HIV for non- infectious
HIV and for infectious HIV) during
Then we have the following stochastic equations for the state variables
In the above equations, the variables on the right hnad side are random vari-
ables. To specify the probability distributions of these variables, let
denote the HIV infection rate of T cells by free HIV in the plasma at time t; let
denote the death rate of productively HIV-infected T cells in the plasma
at time t and the rate at which free HIV or in the plasma are being
removed or die at time t. Then, the conditional probability distributions of the
above random variables are given by:
Further‚ with and we

have:
The expected values of and are given respectively by:
where
and
Let be the number of uninfected T cells and

(Note the number of uninfected T cells has been assumed to be constant
during the study period.) Define the random noises by:
Then‚ denoting by the above equations (2.1 )-(2.3)

can be rewritten as the following stochastic differential equations respectively:
In equations (2.4)-(2.6), the random noises have expecta-

tion zero and are uncorrelated with the state variables and
Since the random noises are random variables associated with the random
Table 1
Variances and Covariances of the Random Noises
transitions during the interval these random noises are uncorrelated

with the random noises for all and if Further‚ since these random
noises are basically linear combinations of binomial and multinomial random
variables‚ their variances and covariances readily be derived. These results are
given in Table 1.
3. MEAN VALUES OF
Let and denote the mean values of and
respectively. Then‚ by using equations (2.4)-(2.6)‚ we obtain:
From equation (3.3)‚ we obtain
If we follow Perelson et al. (1996) in assuming that the drug is 100% efective
(i.e. then the solution of is
Denote Then‚ since is very large‚
Since is usually very small‚ the above equations then lead to the
following equation for the approximation of
Hence an approximation for is
In the example‚ we will use this approximation of to estimate
4. A STATE SPACE MODEL FOR THE EARLY HIV

PATHOGENESIS UNDER TREATMENT BY
PROTEASE INHIBITORS
The state space model consists of a stochastic model of the system and an
observation model which is a statistical model based on some data relating
the observed data to the system. Thus‚ the state space model has significant
advantages over both the stochastic model and the statistical model alone since
it combines information from both models. For the state space model of HIV
pathogenesis under treatment by protease inhibitors with the flow of HIV from
lymphoid tissues to the plasma‚ the stochastic system model is that given by the
stochastic difference equations (2.4)-(2.6) of the previous section. Let be
the observed total number of RNA virus load at time Then the observation
model based on the RNA virus load is given by:
where and is the random error associated with

measuring (If T cell counts at different times are available‚ then the
observation model will contain some additional equations involving observed
T cell counts.)
In equation (4.1)‚ one may assume that the have expected value 0 and
variance and are uncorrelated with the random noises of equations (2.4)-(2.6)
of the previous section.
In the above state space model‚ the system model is nonlinear. Also‚ unlike
the classical discrete-time Kalman filter model‚ the above model has observa-
tions only at times (This is the so-called missing observations
model.) For this type of non-linear state space model‚ because the HIV infec-
tion rate is very samll‚ it is shown in Tan and Xiang (1999) that the model can be
closely approximated by an extended Kalman filter model. This has in fact been
confirmed by some Monte Carlo studies given in Section 5; see also Tan and
Xiang (1998‚ 1999). In this paper we will thus use the extended Kalman filter
model to derive the estimates and the predicted numbers of the state variables
To illustrate‚ write equations (2.4)-(2.6) and (4.1) respectively as:
for
Let be an estimator of with estimated residual

and define as an unbiased estimator of if Sup-
pose that is the unbiased estimator for with covariance matrix
P(0). Then‚ starting with the procedures given in the following two
subsections provide close approximations to some optimal methods for esti-
mating and predicting The proofs of these procedures can be found in
Tan and Xiang (1999).
4.1. ESTIMATION OF GIVEN
Let be an estimator (or predictor) of given data

with estimated residual and denote the
covariance matrix of by Write
P(0) = P(0|0) and
Then, starting with with P(0) = P(0|0), the
linear, unbiased and minimum variance estimators of given the data
are closely approximated by the following recur-
sive equations:
(i) For
satisfies the following equations with boundary conditions
where is given in (iii) below:
(ii) For the covariance matrix satisfies

the following equations with boundary conditions
where is given in (iii) below:
for where
(iii) Denote by the from (i) and

from (ii). Then and are given
respectively by:
and
where
and
To implement the above procedure, one starts with and

P(0) = P(0|0). Then by (i) and (ii), one derives and for
and also derives and P(1|1) by (iii). Repeating these pro-
cedures one may derive and for
These procedures are referred to as forward filtering procedures.
4.2. ESTIMATION OF GIVEN WITH

AND
Let be an estimator of given the data
Let be the covariance matrix of Denote
and Then‚ starting with
and P(0) = P(0|0)‚ the linear‚ unbiased and minimum variance esti-
mators of given data are closely approximated by the
following recursive equations:
(i) For satisfies the following

equations with boundary conditions where
is given in (iii) below:
for
(ii) For satisfies the following

equations with boundary condition
where is given in (iii) below.
(iii) and for are given by the following

recursive equations:
and
where
To implement the above procedure in deriving for a given initial

distribution of at one first derives results by using formulas in Section
(4.1) (forward filtering). Then one goes backward from to 1 by using formu-
las in Section (4.2) (backward filtering).
5. AN EXAMPLE USING REAL DATA

As an illustration‚ in this section we apply the model and the theories of the
previous sections to analyze the data from a HIV-infected patient (No. 104)
who was treated by ritonavir (a protease inhibitor) as reported in the paper by
Perelson et al. (1996). For this patient‚ the RNA virus copies in the plasma
have been measured on 18 occasions within 3 weeks. The observed RNA
virus copies per of blood in the plasma are given in Table (2). (The
data were provided by Dr. David Ho of the Diamond AIDS Research Center
in New York city.) For this individual‚ at the time of initiating the treatment‚
there were only 2 T cells per of blood‚ but there were RNA
copies per of blood. Thus‚ most of the HIV must have come from lymph
nodes or other sources. To develop a stochastic model for the data in Table
(2)‚ we follow Perelson et al. (1996) in assuming that during the 3-weeks
period since treatment‚ drug resistance has not yet developed‚ and the number
of uninfected T cells remains constant. As discussed in Section (2)‚ these
assumptions appear to be justified by the observations. Thus‚ for the data
Table 2. Observed RNA Virus Copies per for Patient No. 104
in Table (2)‚ a reasonable model may be a homogeneous three-dimensional

stochastic process with a flow of HIV from
lymph nodes to the plasma. This is the model described in Section (2) with
time homogeneous parameters. That is‚ we assume
N(T) = N‚ Then‚ an approximate
solution for the mean is
where and
Under the assumption that for
is equivalent to assuming that the drug is 100% effective.

This is the assumption made in Perelson et al. (1996‚ 1997). Our Monte Carlo
studies seem to suggest that this assumption would not significantly distort the
results and the pattern.
To fit the above model to the data in Table (2), we use the estimates
and of Perelson et
al.(1996). We use N = 2500 from Tan and Ye (1999, 2000), with
from Tan and Xiang (1999)‚ and and
Because the decline of HIV in the lymph nodes appeared to be piece-wise

linear in log scale (Perelson et al., 1997; Tan and Ye, 1999;2000), we as-
sume h(t) to be given by for non-
overlapping intervals
where the are constants and is the indicator function of
if and if
Assume the above approximation and the parameter values of

and take Under these
conditions‚ the best fitted parameter values of
and
(The estimates of was derived by minimizing the residual sum of squares

where is the observed RNA virus copies per
at
Under the above specification‚ the stochastic equations for

are given by:
and
Using these estimates and the Kalman filter methods in Tan and Xiang (1998,
1999), we have estimated the number of cells, infectious HIV as well as
non-infectious HIV per of blood over time. Plotted in Figure (1) are the
observed total number of RNA copies per together with the estimates by
the Kalman filter method and the estimates by the deterministic model in two
cases (Case a: Case b: Plotted
in Figures (2)-(3) are the estimated numbers of infected T-cells and free HIV
(infectious and non-infectious HIV) respectively.
From Figure (1)‚ we observed that the Kalman filter esitmates followed the
observed numbers‚ whereas the deterministic model estimates appeared to draw
a smooth line to match the observed numbers‚ and could not follow the fluctu-
ations of the observed numbers. Thus‚ there are some differences between the
two estimates within the first 8 hours although the differences are not noticeable
in the figure; however‚ after 8 hours‚ there is little difference between the two
estimates. Furthermore‚ the curves appeared to have reached a steady state low
level in 200 hours (a little more than a week).
From Figures (3)‚ we observed that at the time of starting the treatment‚ the
estimated number of infectious HIV begins to decrease very sharply and reaches
the lowest steady state level within 10 hours‚ and there is little difference be-
tween the Kalman filter estimates and the estimates of the deterministic model;
on the other hand‚ the estimated number of non-infectious HIV first increases‚
reaching the maximum in about 6-8 hours before decreasing steadily to reach a
very low steady state level in about 200 hours. Within 50 hours since treatment‚
there appeared to have significant differences between the two estimates of the
number of non-infectious HIV; after 50 hours such differences appeared to be
very small.
Comparing the two cases in Figures (1)-(3)‚ we observed that the estimates
assuming (Case b) are almost identical to the corresponding
ones assuming (Case a). These results suggest that
the delay effects are very small in this example.
Comparing results in Figures (1)-(3) with the corresponding results in Tan

and Xiang (1999)‚ one may note that the two models give very close results.
However‚ it is important to note that the model in Perelson et al. (1996) and in
Tan and Xiang (1999) assumed that all HIV were generated by the productively
HIV-infected T cells‚ while the model in Section (4) assumed that most of the
HIV came from lymph nodes. The results suggest‚ however‚ that if one is
interested only in the estimation of T cells and free HIV‚ the two models make
little difference.
6. SOME MONTE CARLO STUDIES

To justify and confirm the usefulness of the extended Kalman filter method
for the state space model in Section 4‚ we generate some Monte Carlo studies
using a computer on the basis of the model in Sections 2 and 4. The parameters
of this model were taken from those of the estimates of patient No. 104 given
above. To generate the observation model‚ we add some Gaussian noises to the
total number of the generated free HIV to produce That is‚ we
generate by the equation
where the are as given above, and is assumed as to be a Gaussian

variable with mean 0 and variance with
Using the generated data we then use the same method as in Section 4 to
derive the Kalman filter estimates from the state space model, and the estimates
from the deterministic model. From these we have observed the following
results:
(1) In the estimation of the numbers of T* cells, free HIV and free
HIV, the estimates by the extended Kalman filter method appear to follow
the generated numbers very closely. These results suggest that in estimating
these numbers, one may in fact use the extended Kalman filter method as de-
scribed in Section 3. Similar results have also been obtained by Tan and Xiang
(1998,1999) in other models of a similar nature.
(2) The estimates by using the deterministic model seem to draw a smooth line
across the generated numbers. Thus, although results of deterministic model
cannot follow the fluctuations of the generated numbers, due presumably to the
randomness of the state variables, they are still quite useful in assessing the
behavior and trend of the process.
(3) For the numbers of cells and free HIV, there are small differ-
ences between the Kalman filter estimates and the estimates of the determin-
istic model, due presumably to the small numbers of these cells. For the non-
infectious free HIV (i.e. however, there are significant differences between
the Kalman filter estimates and the estimates using the deterministic model. It
appears that the Kalman filter estimates have revealed much stronger effects of
the treatment at early times (before 10 hours), which could not be detected by
the deterministic model.
ACKNOWLEDGMENTS
The authors wish to thank the referee for his help in smoothing the English
language.
PUBLICATIONS OF SID YAKOWITZ

Cavert‚ W.‚ D.W. Notermans‚ K. Staskus‚ et al. (1996). Kinetics of response in
lymphoid tissues to antiretroviral therapy of HIV-1 infection. Science‚ 276‚
p. 960-964.
Cohen‚ O.J.‚ G. Pantaleo‚ G.K. Lam‚ and A.S. Fauci. (1997). Studies on lym-
phoid tissue from HIV-infected individuals: Implications for the design of
therapeutic strategies. In: "Immunopathogenesis of HIV Infection‚ A.S. Fauci
and G. Pantaleo (eds.)”. Springer-Verlag‚ Berlin‚ p. 53-70.
Cohen‚ O.J.‚ D. Weissman‚ and A.S. Fauci. (1998). The immunopathpgenesis
of HIV infection. In: “Fundamental Immunology‚ Fourth edition.” ed. W.E.
Paul‚ Lippincott-Raven Publishers‚ Philadelphia‚ Chapter 44‚ p. 1511-1534.
Haase‚ A.T.‚ K. Henry‚ M. Zupancic‚ et al. (1996). Quantitative image analysis
of HIV-1 infection in lymphoid tissues. Science‚ 274‚ p. 985-989.
Fauci‚ A.S. and G. Pantaleo. (Eds.) (1997). Immunopathogenesis of HIV Infec-
tion. (Springer- Verlag‚ Berlin‚ 1997.)
Fauci‚ A.S. (1996). Immunopathogenic mechanisms of HIV infection. Annals
of Internal Medicine 124‚ p. 654-663.
Kirschner‚ D.E. and G.F. Webb. (1997). Resistance‚ remission‚ and qualitative
differences in HIV chemotherapy. Emerging Infectious Diseases‚ 3‚ p. 273-
283.
Lafeuillade‚ A.‚ C. Poggi‚ N. Profizi‚ et al. (1996). Human immunodeficiency

virus type 1 kinetics in lymph nodes compared with plasma. The Jour. Infec-
tious Diseases‚ 174‚ p. 404-407.
Mitter‚ J.E.‚ B. Sulzer‚ A.U. Neumann‚ and A.S. Perelson. (1998). Influence of
delayed viral production on viral dynamics in HIV-1 infected patients. Math.
Biosciences‚ 152‚ p.. 143-163.
Perelson‚ A.S.‚ A.U. Neumann‚ M. Markowitz‚ et al. (1996). HIV-1 dynamics
in vivo: Virion clearance rate‚ infected cell life-span‚and viral generation
time. Science‚ 271‚ p. 1582-1586.
Perelson‚ A.S.‚ O. Essunger‚ Y.Z. Cao‚ et al. (1997). Decay characteristics
of HIV infected compartments during combination therapy. Nature‚ 387‚ p.
188-192.
Piatak‚ M. Jr.‚ M.S. Saag‚ L.C. Yang‚ et al. (1993). High levels of HIV-1 in plasma
during all stages of infection determined by competitive PCR. Science‚ 259‚
p. 1749-1754.
Tan‚ W.Y. and H. Wu. (1998). Stochastic modeling of the dynamics of
T cells by HIV infection and some Monte Carlo studies. Math. Biosciences‚
147‚ p. 173-205.
Tan‚ W.Y. and Z.H. Xiang. (1998). State Space Models for the HIV pathogenesis.
In: " Mathematical Models in Medicine and Health Sciences”‚ eds. M.A.
Horn‚ G. Simonett and G. Webb. (Vanderbilt University Press‚ Nashville‚
TN‚ 1998)‚ p. 351-368.
Tan‚ W.Y. and Z.H. Xiang. (1999). A state space model of HIV pathogene-
sis under treatment by anti-viral drugs in HIV-infected individuals. Math.
Biosciences‚ 156‚ p. 69-94.
Tan‚ W.Y. and Z.Z. Ye. (1999). Stochastic modeling of HIV pathogenesis under
HAART and development of drug resistance. (Proceeding of the International
99’ ISAS meeting‚ Orlando‚ Fl. 1999.)
Tan‚ W.Y. and Z.Z. Ye. (2000). Assessing effects of different types of HIV and
macrophage on the HIV pathogenesis by stochastic models of HIV pathogen-
esis in HIV-infected individuals. Jour. Theoretical Medicine‚ 2‚ p. 245-265.
Weiss‚ R.A. (1996). HIV receptors and the pathogenesis of AIDS. Science‚ 272‚
p. 1885-1886.
Chapter 6
THE IMPACT OF RE-USING HYPODERMIC NEEDLES
B. Barnes
Canberra‚ ACT 0200
Australia*
J. Gani
Canberra‚ ACT 0200
Australia
Abstract This paper considers an epidemic which is spread by the re-use of unsterilized
hypodermic needles. The model leads to a geometric type distribution with
varying success probabilities‚ which is applied to model the propagation of the
Ebola virus.
Keywords: epidemic and its intensity‚ Ebola virus‚ geometric type distribution.
1. INTRODUCTION
In her book “The Coming Plague”‚ Laurie Garrett (1994) describes how the
shortage of hypodermic needles in Third World countries‚ as well as shared
needles in most countries‚ is responsible for the spread of infectious diseases.
She cites one case from the Soviet Union in 1988‚ where 3 billion injections
were administered‚ although only 30 million needles were manufactured. In
one instance of shared needles that year‚ a single hospitalised HIV infected
baby‚ led to the infection of all other infants in the same ward.
* This project was funded by the Centre for Mathematics and its Applications‚ Australian National University‚
Canberra‚ University College‚ University of New South Wales‚ Canberra.
We will consider the impact of a single needle‚ used to inject a group of

size N‚ which may contain infectives and susceptibles‚ without
sterilisation of the needle between use on the individuals. Thus if the person
injected is the first infective on which the needle is used then
of the susceptibles will be infected subsequently. Clearly
if all susceptibles are infected‚ while if none of the
susceptibles is infected.
In the following sections we derive a distribution for the number of infected
susceptibles. We then consider the impact of using several needles‚ and apply
the results to the 1976 spread of the Ebola virus in Central Africa.
2. GEOMETRIC DISTRIBUTION WITH VARIABLE

SUCCESS PROBABILITY
The standard geometric distribution is a waiting time distribution for inde-
pendent Bernoulli random variables such that with
and The waiting time T, until the first
success, is given by
with the probability generating function (pgf)
Suppose now that the probability of success varies with each

trial‚ so that
with and (see Feller (1968)). The distribution of

the time T until a first success is given by
In the particular case of infection by re-used needles‚ when we start with

susceptibles‚ infectives and the number of new infectives‚ we
have
We thus have a distribution for the random variable I‚ of

susceptibles infected when the person injected is infected. We
can write it in general as
3. VALIDITY OF THE DISTRIBUTION

We provide a simple example which suggests this to be a valid distribution‚
and then prove that it is indeed so in general.
Let N = 3‚ and then We need to show that the sum over all
is 1. From equation (2.5)‚
and thus
We can show to be an honest distribution in general‚ by applying

a simple binomial identity. Consider the sum of probabilities in the form of
equation (2.6),
We need to prove this sum equal to 1, that is,
or
Now applying the well known binomial identity for both positive
integers, (see Abramowitz and Stegun, 1964)
to the left hand side of equation (3.9), and writing the equality is
proved as required.
4. MEAN AND VARIANCE OF I

The same identity (equation (3.10)) and technique can be used to establish
expressions for the mean and variance of the distribution. Suppose we wish to
determine the mean number of susceptibles infected, then
since when Hence, using identity (3.10), we obtain
which provides an expression for the mean value.

The variance of I is
In order to evaluate it we first find an expression for From equation

(3.8), this is
Hence we obtain (from (4.12)and (4.13)),
Applying these results to the simple example of §3, with N = 3, and

we have, from equation (4.11)
and from equation (4.14)
5. INTENSITY OF EPIDEMIC
It is of interest to estimate the intensity of the epidemic, defined here as the
average proportion of susceptibles in a group of N individuals, who are infected
when the same needle is used to inject them. This proportion is
It has a least value of 1/2 when that is, when there is only a
single infective in the population. The proportion rises rapidly as the number
of susceptibles decreases, and if then which is
close to 1 for large N. The variance of is
Applying these proportions to our simple example (3.7) we have

6. REDUCING INFECTION
An obvious and simple strategy for reducing the risk of infection is to divide
the population into K groups (K = 2,3,4...) of N/K individuals, where
for an integer, and sterilize the needle after vaccinating each group.
Using a subscript to number each of the groups, we have
where originally we had
We can show that
The sum of the quantities from equation (6.16) can be expressed in the following
form
For the original equation (6.17) for E(I) we have
Now, since for at least some and is non-negative by definition,

and comparing equation (6.20) with equation (6.19), we have the inequality
(6.18).
7. THE SPREAD OF THE EBOLA VIRUS IN 1976

Although it is well understood today that disease is easily spread through
the use of shared needles, this was not so well known in 1976 in Yambuku, a
central African town in the Republic of Zaire, where the re-use of unsterilized
hypodermic needles turned out to be a most efficient and deadly amplifier for the
Ebola virus. Whilst the exact manner in which the disease emerged in August,
1976, is uncertain, the consensus is that a single traveller, foreign to those parts,
who came to the Catholic Mission Hospital at Yambuku on 28 August with a
mysterious illness, carried with him the Ebola virus. He was hopitalised for 2
days and then disappeared.
At the time, Yambuku hospital housed 150 beds in 8 wards, and treated
between 300 and 600 patients and outpatients each day. Many were given in-
jections, but with only 5 needles which were re-used many times over. The
sterilization procedure was a 30 litre autoclave, and boiling water atop a Primus
stove. Most outpatients were pregnant women, visiting the hospital for check-
ups, and they were injected with a vitamin B complex.
We can use the above model to get some idea of the rapid spread of the virus.
We will consider the use of K = 5 needles on a population N, in which there
are infectives. In general, we have infectives distributed between K groups
of equal size N/K, such that Assume the probability of
each different configuration for the total number of infectives to be the same.
We are interested in the number of new infectives after each set of injections,
which can be expressed as
With infectives in a particular group and the probability of infectives

occurring in that group,
where is the number of infectives generated by the initial infectives. The

probability can be expressed as a binomial distribution
where it is assumed any infective has the probability 1/K of belonging to a

particular group. Thus, with K groups, the expected total number of new
infectives is
This expression for E(I) can be simplified. Set and then
which gives
Returning to the available data concerning the Ebola virus in Garrett (1994),
the epidemic began on September 5, 1976. In that first month 38 of the 300
Yambuku residents died from Ebola. The hospital was closed to new patients
after 25 September, a quarantine on the region imposed on the 30 September,
and by October 9 the virus was observed to be in recession. By November, 46
villages had been affected with 350 deaths recorded. The probability of con-
tracting the virus from an infected needle was estimated to be 90%, and once
contracted by this means the probability of death was between 88% and 92.5%,
or possibly higher. After the symptoms appear, the Ebola virus takes approx-
imately 10 days to kill a victim, through a particularly painful and gruesome
means; the patient, losing hair, skin and nails, literally bleeds to death as the
membranes containing the body fluids disintegrate and the body organs liquify.
From the data we can establish estimates for parameters in the model. Of
the primary cases, 72 of the 103 were from injection at the hospital. We then
assume that 0.7 of the deaths resulting from the virus were as a direct result of
infected needles. Thus four weeks after the onset of the epidemic, from the 38
deaths in the Yambuku region, we estimate 0.7 × 38 = 26.6 were caused by
infected needles. Towards the end of the epidemic, by November, of the 350
recorded deaths we estimate that 0.7 × 350 = 245 were as a result of infected
needles. (The virus was also spread through body fluid contact, and in one cited
case, a single individual spread the disease to 21 family members and friends, 18
of whom died as a result. However, in general, individuals who contracted the
disease in this manner had a 43% chance of survival. This method of infection
is not considered in our model.) We take the period between infection by the
Ebola virus and death to be 21 days. This follows as the incubation period
was between 2 and 21 days, typically 4 to 16 days, and time until death after
the appearance of the major symptoms approximately a week to 10 days. The
probability of infection through the re-use of a needle it difficult to estimate
from the available data. Taking into account the existing, but insufficient,
sterilization process described above, it has been estimated as 1/50, and a
number of different values have been compared. (See Figure 6.1 with values
0.03, 0.05, 0.07 and 0.1.) The probability of death following infection from
a needle is between 88% and 92.5%, and is taken here as 88%. Recall that
the probability of infection from a single infected injection at the hospital was
given as 90%.
The results of simulations for the number of victims of the Ebola virus from
5 September to November, 1976, resulting from the introduction of a single
infective into the Yambuku hospital, are illustrated in Figures 6.1, 6.2 and 6.3.
Infection through means other than infected needles have not been considered.
The number of hospital beds at Yambuku was 150, and between 300 and 600
patients were treated each day, including those in the hospital. We assume that
250 of these received an injection on a single day, using 5 different needles.
The expected number of new infectives is calculated from equation (7.21.
As is illustrated in Figure 6.1, for the number of deaths caused by infection
through re-use of needles, the model predicts 26.4 deaths after 4 weeks, and
over the following months, after November, the number of deaths approaches
245 in the limit. By the end of November the model predicts a total of 236
deaths. These figures compare well with the data given above: 38 deaths in the
first month (28 days), 0.7 × 38 = 26.6 of which are due to infection through
needles, and 350 deaths by November, 0.7 × 350 = 245 of which are expected
to have been from infected needles. Furthermore, the model is in agreement
with the data which states that by 10 October, the disease was in recession. This
is illustrated in Figure 6.2. The epidemic is seen to peak and begin to decline in
the first week of October, between day 30 and day 40. While the closure of the
hospital to new patients, with a drastic reduction in the susceptible population,
would certainly have had an impact on this decline, it is clear (Figure 6.2) that
the number of new infectives was declining by day 15 to 20, that is from 20 to 25
September. Figure 6.3 graphs the number of susceptibles and infectives in the
hospital over time, illustrating the proportion of infectives to susceptibles and
demonstrating a gradual decline of the epidemic in late September, followed
by a sharp decline when the hospital was closed and the region quarantined.
The damage caused by the Ebola virus in 10 days, is comparable with that
caused by AIDS over a period of 10 years. It is no wonder it is considered the
second most lethal disease of the Century.
8. CONCLUSIONS
A geometric type distribution model has been developed for the spread of
an infection through the re-use of infected needles. It provides a reasonable
description of the dynamics of the Ebola virus epidemic in Yambuku, in 1976,
although the available data is scarce. Further work, comparing this model with
the spread of a disease through contact with some external source, using the
Greenwood and Reed-Frost chain binomial models, as well as the dynamics of
a deterministic model, is planned to provide an insight into the differences in
the propagation of epidemics.
REFERENCES
Abramowitz, M. and I. A. Stegun. (1964). Handbook of Mathematical Func-
tions. National Bureau of Standards, Washington D. C.
Feller, W. (1968). An Introduction to Probability Theory and its Applications.
John Wiley, New York.
Garrett, L. (1994). The Coming Plague: Newly Emerging Diseases in a World
out of Balance. Penguin, New York.
Chapter 7
NONPARAMETRIC FREQUENCY DETECTION AND

OPTIMAL CODING IN MOLECULAR BIOLOGY
David S. Stoffer
University of Pittsburgh
Pittsburgh, PA 15260
Abstract The concept of spectral envelope for analyzing periodicities in categorical-valued

time series was introduced in the statistics literature as a computationally sim-
ple and general statistical methodology for the harmonic analysis and scaling of
non-numeric sequences. One benefit of this technique is that it combines non-
parametric statistical analysis with modern computer power to quickly search
for diagnostic patterns within long sequences. An interesting area of applica-
tion is the nucleosome positioning signals and optimal alphabets in long DNA
sequences. The examples focus on period lengths in nucleosome signals and
optimal alphabets in herpesviruses and we point out some inconsistencies in
established gene segments.
Keywords: Spectral Analysis, Optimal Scaling, Nucleosome Positioning Signals, Herpesviruses,

DNA Sequences.
1. INTRODUCTION
Rapid accumulation of genomic sequences has increased demand for meth-
ods to decipher the genetic information gathered in data banks such as Gen-
Bank. While many methods have been developed for a thorough micro-analysis
of short sequences, there is a shortage of powerful procedures for the macro-
analyses of long DNA sequences. Combining statistical analysis with modern
computer power makes it feasible to search, at high speeds, for diagnostic pat-
terns within long sequences. This combination provides an automated approach
to evaluating similarities and differences among patterns in long sequences and
aids in the discovery of the biochemical information hidden in these organic
molecules.
Briefly, a DNA strand can be viewed as a long string of linked nucleotides.

Each nucleotide is composed of a nitrogenous base, a five carbon sugar, and a
phosphate group. There are four different bases that can be grouped by size, the
pyrimidines, thymine (T) and cytosine (C), and the purines, adenine (A) and
guanine (G). The nucleotides are linked together by a backbone of alternating
sugar and phosphate groups with the carbon of one sugar linked to the
carbon of the next, giving the string direction. DNA molecules occur naturally
as a double helix composed of polynucleotide strands with the bases facing
inwards. The two strands are complementary, so it is sufficient to represent
a DNA molecule by a sequence of bases on a single strand. Thus, a strand
of DNA can be represented as a sequence of letters, termed base pairs (bp),
from the finite alphabet {A, C, G, T}. The order of the nucleotides contains
the genetic information specific to the organism. Expression of information
stored in these molecules is a complex multistage process. One important task
is to translate the information stored in the protein-coding sequences (CDS)
of the DNA. A common problem in analyzing long DNA sequence data is in
identifying CDS that are dispersed throughout the sequence and separated by
regions of noncoding (which makes up most of the DNA). For example, the
entire DNA sequence of a small organism such as the Epstein-Barr virus (EBV)
consists of approximately 172,000 bp. Table 1 shows part of the EBV DNA
sequence.
The idea of rotational signals for nucleosome positioning is based on the
fact that the nucleosomal DNA is tightly wrapped around its protein core. The
bending of the wound DNA requires compression of the grooves that face to-
ward the core and a corresponding widening of the grooves facing the outside.
Because, depending on the nucleotide sequence, DNA bends more easily in one
plane than another, Trifonov and Sussman (1980) proposed that the association
between the DNA sequence and its preferred bending direction might facilitate
the necessary folding around the core particle. This sequence dependent bend-
ability motivated the theoretical and experimental search for rotational signals.
These signals were expected to exhibit some kind of periodicity in the sequence,
reflecting the structural periodicity of the wound nucleosomal DNA.
Nonparametric Frequency Detection and Optimal Coding in Molecular Biology 131
While model calculations as well as experimental data strongly agree that

some kind of periodic signal exists, they largely disagree about the exact type
of periodicity. A number of questions remain unresolved: Do the periodic-
ities in rotational signals occur predominantly in di- or in trinucleotides, or
even in higher order dinucleotides? Ioshikhes et al (1992) reported evidence
for dinucleotide signals, while the analysis of Satchwell et al (1986) resulted
in a trinucleotide pattern that was supported by data from Muyldermans and
Travers (1994). Some questions are: Which nucleotide alphabets are involved
in rotational signals? Satchwell et al (1986) used a strong, S = (G, C), versus
weak, W = (A, T) hydrogen bonding alphabet to propose one signal, while
Zhurkin (1985) suggested the purine-pyrimidine alphabet with another pattern,
and Trifonov and coworkers propose a different motif. What is the exact period
length? The helical repeat of free DNA is about 10.5 bp, the periodicities of
rotational signals tend to be slightly shorter than 10.5 in general, for example:
10.1 bp in Shrader and Crothers (1990), 10.2 bp in Satchwell (1986), 10.3 bp
in Bina (1994), and 10.4 bp in Ioshikhes et al (1992). Consistent with all these
data is the proposition by Shrader and Crothers (1992) that nucleosomal DNA
is over wound by about 0.3 bp per turn. Are there other periodicities besides
the approximate 10 bp period? Uberbacher et al (1988) observed several addi-
tional periodic patterns of lengths 6 to 7, 10, and 21 bp. Bina (1994) reports a
TT-period of 6.4 bp.
Of course one could extend this list of controversial questions about the
properties and characteristics of positioning signals. Depending on the choice
among these divergent observations and claims, different sequence-directed
algorithms for nucleosomic mapping have been developed, for example, by
Mengeritsky and Trifonov (1983), Zhurkin (1983), Drew and Calladine (1987),
Uberbacher et al (1988), and Pina et al (1990). An attempt to analyze existing
data by the spectral envelope (Stoffer et al, 1993a) could result in a more
unified picture about the major periodic signals that contribute to nucleosome
positioning. This, in turn, might lead to a new reliable and efficient way to
predict nucleosome locations in long DNA sequences by computer.
In addition to positioning, the spectral envelope could prove to be a use-
ful tool in examining codon usage. Regional fluctuations in G+C content not
only influence silent sites but seem to create a general tendency in high G+C
regions toward G+C rich codons (G+C pressure), see Benardi and Bernardi
(1985) and Sueoka (1988). Schachtel et al (1991) compared two closely re-
lated and showed that for pairs of homologues genes, G+C
frequencies differed in all three codon positions, reflecting the large difference
in their global G+C content. In perfect agreement with their overall composi-
tional bias, the usage for each individual amino acid type was shifted signifi-
cantly toward codons of preferred G+C content. Several authors reported codon
context related biases (see Buckingham 1990, for a review). Blaisdell (1983)
observed that codon sites three were chosen to be unlike neighboring bases to
the left and to the right with respect to the strong-weak (S-W) alphabet. While
the various studies on codon usage exhibit many substantial differences, most
of them agree on one point, namely the existence of some kind of periodicity
in coding sequences. This widely accepted observation is supported by the
spectral envelope approach which shows a very strong period-three signal in
genes but disappears in noncoding regions. This method may even be helpful
in detecting wrongly assigned gene segments as will be seen. In addition, the
spectral envelope provides not only the optimal period lengths but also most
favorable alphabets, for example {S, W} or {G, H}, where H = (A, C, T). This
analysis might help decide which among the different suggested pattern [such
as RNY, GHN, etc., where R = (A,G), Y = (C, T), and N is anything] are the
most valid.
The spectral envelope methodology is computationally fast and simple be-
cause it is based on the fast Fourier transform and is nonparametric (that is, it is
model independent). This makes the methodology ideal for the analysis of long
DNA sequences. Fourier analysis has been used in the analysis of correlated
data (time series) since the turn of the twentieth century. Of fundamental inter-
est in the use of Fourier techniques is the discovery of hidden periodicities or
regularities in the data. Although Fourier analysis and related signal processing
are well established in the physical sciences and engineering, they have only
recently been applied in molecular biology. Because a DNA sequence can be
regarded as a categorical-valued time series it is of interest to discover ways in
which time series methodologies based on Fourier (or spectral) analysis can be
applied to discover patterns in a long DNA sequence or similar patterns in two
long sequences.
One naive approach for exploring the nature of a DNA sequence is to assign
numerical values (or scales) to the nucleotides and then proceed with standard
time series methods. It is clear, however, that the analysis will depend on the
particular assignment of numerical values. Consider the artificial sequence
ACGT ACGT ACGT... . Then, setting A = G = 0 and C = T= 1, yields the
numerical sequence 0101 0101 0101..., or one cycle every two base pairs (that
is, a frequency of oscillation of or a period of oscillation of length
Another interesting scaling is A = 1, C = 2, G = 3, and T = 4,
which results in the sequence 1234 1234 1234..., or one cycle every four bp
In this example, both scalings (that is, {A, C, G, T} = {0, 1, 0, 1}
and {A, C, G, T} = {1, 2, 3, 4}) of the nucleotides are interesting and bring out
different properties of the sequence. It clear, then, that one does not want to
focus on only one scaling. Instead, the focus should be on finding all possible
scalings that bring our interesting features of the data. Rather than choose values
arbitrarily, the spectral envelope approach selects scales that help emphasize any
periodic feature that exists in a DNA sequence of virtually any length in a quick
and automated fashion. In addition, the technique can determine whether a

sequence is merely a random assignment of letters.
Fourier analysis has been applied successfully in molecular genetics; McLach-
lan et al (1976) and Eisenberg et al (1984) studied the periodicity in proteins
with Fourier analysis. They used predefined scales (for example, the hydropho-
bicity alphabet) and observed the frequency of amphipatic helices.
Because predetermination of the scaling is somewhat arbitrary and may not
be optimal, Cornette et al (1987) reversed the problem and started with a fre-
quency of and proposed a method to establish an ‘optimal’ scaling at
In this setting, optimality roughly refers to the fact that the scaled
(numerical) sequence is maximally correlated with the sinusoid that oscillates
at a frequency of Viari et al (1990) generalized this approach to a systematic
calculation of a type of spectral envelope (which they called and of
the corresponding optimal scalings over all fundamental frequencies. While the
aforementioned authors dealt exclusively with amino acid sequences, various
forms of harmonic analysis have been applied to DNA by, for example, Tavaré
and Giddings (1989), and in connection to nucleosome positioning by Satch-
well et al (1986) and Bina (1994). Recently, Stoffer et al (1993a) proposed the
spectral envelope as a general technique for analyzing categorical-valued time
series in the frequency domain. The basic technique is similar to the methods
established by Tavaré and Giddings (1989) and Viari et al (1990), however,
there are some differences. The main difference is that the spectral envelope
methodology is developed in a statistical setting to allow the investigator to
distinguish between significant results and those results that can be attributed
to chance. In particular, tests of significance and confidence intervals can be
calculated using large sample techniques.
2. THE SPECTRAL ENVELOPE

Briefly, spectral analysis has to do with partitioning the variance of a station-
ary time series, into components of oscillation indexed
by frequency and measured in cycles per unit of time, for
Given a numerical-valued time series sample, that has been
centered by its sample mean, the sample spectral density (or periodogram) is
defined in terms of frequency
The periodogram is essentially the squared-correlation of the data with a sines

and cosines that oscillate at frequency For example, if represents hourly
measurements of a persons body temperature that happens to oscillate at a rate
of one cycle every 24 hours, then will be large because the data will
be highly correlated with the cosine and/or sine term that oscillates at a cycle
of but other values of will be small.
Although not the optimal choice of a definition, the spectral density of
the time series can be defined as the limit as the sample size tends to infinity
of provided that it exists. It is worthwhile to note that
and
where Thus, the spectral density can be thought of as the

variance density of a time series relative to frequency of oscillation. That is,
for positive frequencies between 0 and 1/2, the proportion of the variance that
can be attributed to oscillations in the data at frequencies in a neighborhood of
is roughly If the time series is white noise, that is, is
independent of time and for all then that
is, a uniform distribution. This is interpreted as all frequencies being present
at the same power (variance) and hence the name white noise, from an analogy
to white light, indicating that all possible periodic oscillations are present with
equal strength.
If is a highly composite integer, the fast Fourier transform (FFT) provides
for extremely fast calculation of for where
is the greatest integer less than or equal to If is not highly composite,
one may remove some observations or pad the series with zeros (see Shumway
and Stoffer, 2000, §3.5). The frequencies are called the fundamental
(or Fourier) frequencies. The sample equivalent of the integral equation (2.2)
is
where is the sample variance of the data; the last term is dropped if is odd.
One usually plots the periodogram, versus the fundamental frequencies
for and inspects the graph for large values. As
previously mentioned, large values of the periodogram at indicate that the
data are highly correlated with the sinusoid that is oscillating at a frequency of
cycles in observations.
As a simple example, Figure 2.1 shows a time plot of 128 observations

generated by
where is the frequency of oscillation, is a phase

shift, and ~ iid N(0,1); the cosine signal, is superimposed
on the data in Figure 2.1. Figure 2.2 shows the standardized periodogram,
of the data shown in Figure 2.1. Note that there is a large value of
the periodgram at and small values elsewhere (if there were no
noise in (2.4) then the periodogram would only be non-zero at
Because—no matter how large the sample size—the variance of periodogram
is unduly large, the graph of the periodogram can be very choppy. To overcome
this problem, a smoothed estimate of the spectral density is typically used. One
form of an estimate is
where the weights are chosen so that and

A simple average corresponds to the case where for
The number is chosen to obtain a desired degree of smooth-
ness. Larger values of lead to smoother estimates, but one has to be careful
not to smooth away significant peaks.
An analogous theory applies if one collects numerical-valued time series,

say for In this case, write as
the column vector of data at time The periodogram is now a
complex matrix
where * means to transpose and conjugate. Smoothing the periodogram can be

accomplished as in the univariate case, that is,
The population spectral density matrix, is again defined as the limit
as tends to infinity of The spectral matrix is Hermitian
and non-negative definite. The diagonal elements of say
for are the individual spectra and the off-diagonal elements,
say for are related to the pairwise dependence structure
among the sequences (these are called cross-spectra). Details for the spectral
analysis of univariate or multivariate time series can be found in Shumway and
Stoffer (2000, Chapters 3 and 5).
The spectral envelope is an extension of spectral analysis when the data are
categorical-valued such as DNA sequences. To briefly describe the technique
using the nucleotide alphabet, let be a DNA sequence taking
values in {A, C, G, T}. For real numbers not all equal,
denote the scaled (numerical) data by where
Then, for each frequency, we call the optimal scaling at frequency if it

satisfies
where is the spectral density of the scaled data, is a real number,

1 is a vector of ones, and Note that can be thought of as
the largest proportion of the power (variance) that can be obtained at frequency
for any scaling of the DNA sequence and is the particular scaling
that maximizes the power at frequency Thus, is called the spectral
envelope. The name spectral envelope is appropriate because envelopes
the spectrum of any scaled process. That is, for any assignment of numbers
to letters, the standardized spectral density of a scaled sequence is no bigger
than the spectral envelope, with equality only when the numerical assignment
is proportional to the optimal scaling, We say ‘proportional to’ because
the optimal scaling vector, is not unique. It is, however, unique up to
location and scale changes; that is, any scaling of the form yields the
same value of the spectral envelope where and are real numbers.
For example, the numerical assignments {A, C, G, T} = {0, 1, 0, 1} and {A, C,
G, T} = {–1, 1, –1, 1} will yield the same normalized spectral density. The
value of however, does not depend on the particular choice of scales;
details can be found in Stoffer et al (1993a). For ease of computation, we set
one element of equal to zero (that is, for example, the scale for T is held
fixed at T = 0) and then proceed with the computations.
For example, to find the spectral envelope, and the corresponding
optimal scaling, holding the scale for T fixed at zero, form 3 × 1 vectors
Now with the scaled sequence, can be obtained from

the vector sequence by the relationship This relationship
implies
where is the 3 × 3 spectral density matrix of the indicator process,

and V is the population variance-covariance matrix of Because the
imaginary part of is skew-symmetric, the following relationship holds:
where denotes the real part of It
follows that and can easily be obtained by solving an eigenvalue
problem with real-valued matrices.
An algorithm for estimating the spectral envelope and the optimal scalings
given a particular DNA sequence (using the nucleotide alphabet, {A, C, G, T},
for the purpose of example) is as follows:
1 Given a DNA sequence of length form the 3 × 1 vectors

as previously described.
2 Calculate the fast Fourier transform of the data:
Note that is a 3 × 1 complex-valued vector. Calculate the peri-

odogram, for and retain only
the real part, say
3 Smooth the periodogram, that is, calculate
where are symmetric positive weights, and controls the degree of

smoothness. See Shumway and Stoffer (2000, Ch 3) for example, for
further discussions on periodogram smoothing.
4 Calculate the 3 × 3 variance-covariance matrix of the data,
where is the sample mean of the data.

5 For each determine the largest eigenvalue and the correspond-
ing eigenvector of the matrix Note that
is the unique square root matrix of S, and is the inverse of that
matrix.
6 The sample spectral envelope is the eigenvalue obtained in the
previous step. If denotes the eigenvector obtained in the previous
step, the optimal sample scaling is this will result
in three values, the fourth being held fixed at zero.
Any standard programming language can be used to do the calculations;

basically, one only has to be able to compute fast Fourier transforms and eigen-
values and eigenvectors of real symmetric matrices. Note that this procedure
can be done with any finite number of possible categories, and is not restricted
to looking only at nucleotides. Inference for the sample spectral envelope and
the sample optimal scalings are described in detail in Stoffer et al (1993a). A
few of the main results of that paper are as follows.
If is an uncorrelated sequence, and if no smoothing is used (that is,
then the following large sample approximation based on the chi-square
distribution is valid for
where is the number of letters in the alphabet being used (for example,
in the nucleotide alphabet). If is a consistent spectral estimator and if
for each the largest root of is distinct, then
converges jointly in distribution to independent zero-mean normal

distributions, the first of which is standard normal; the covariance structure of
the asymptotic (normal) distribution of is given in Shumway and Stoffer
(2000, Section 5.8). The term in (2.7) depends on the type of estimator being
used. For example, in the case of weighted averaging (we put and take
but as as in (2.5), then If a
simple average is used, that is, then Based
on these results, asymptotic normal confidence intervals and tests for can
be readily constructed. Similarly, for asymptotic confidence ellipsoids
and chi-square tests can be constructed; details can be found in Stoffer et al.
(1993a, Theorems 3.1 – 3.3).
Peak searching for the smoothed spectral envelope estimate can be aided
using the following approximations. Using a first order Taylor expansion we
have
so that is approximately standard normal. It also fol-

lows that and If there is no signal
present in a sequence of length we expect for
and hence approximately of the time, will be less than
where is the upper tail cutoff of the standard
normal distribution. Exponentiating, the critical value for becomes
From our experience, thresholding at very small values of
relative to the sample size works well.
As a simple example, consider the sequence data presented in Whisenant

et al (1991) which were used in an analysis of a human Y-chromosomal DNA
fragment; the fragment is a string of length The sample spectral
envelope (based on the periodogram) of the sequence is plotted in Figure 2.3a
where frequency is measured in cycles per bp. The spectral envelope can
be interpreted as the largest proportion of the total variance at frequency
that can be obtained for any scaling of the DNA sequence. The graph can be
inspected for peaks by employing the approximate null probabilities previously
given. In Figure 2.3a, we show the approximate null significances thresholds of
0.0001 (0.60%) and 0.00001 (0.71%) for a single a priori specified frequency
The null significances were chosen small in view of the problem of making
simultaneous inferences about the value of the spectral envelope over more than
one frequency.
Figure 2.3a shows a major peak at approximately cycles per
bp (about three cycles in the DNA fragment of 4156 bp), with corresponding
sample scaling A = 1, C = 0.1, G= 1.1, T = 0. This particular scaling suggests
that the purine-pyrimidine dichotomization best explains the slow cycling in the
fragment. There is also a secondary peak at approximately cycles per
bp with a corresponding sample scaling of A = 1, C = 1.5, G = 0.9, T = 0. Again
we see the pairing of the purines, but the pyrimidines C and T are set apart;
the significance of this scaling and frequency warrants further investigation and
at this time we can offer no insight into this result. Using the tests on the
specific scalings, we found that we could not reject the hypotheses that (i) at
A = G = 0, C = T=1, and (ii) at A = G = 0, C = –1, T=1.
To show how smoothing helps, Figure 2.3b shows a smoothed spectral envelope
based on a simple average with For this example, so an
approximate 0.0001 significance threshold is (2/4156) exp(3.71/5) = 0.10%.
Note that there is considerably less variability in the smoothed estimate and
only the significant peaks are visible in the figure.
3. SEQUENCE ANALYSES
Our initial investigations have focused on herpesviruses because we regard
them as scientifically and medically important. Eight genomes are completely
sequenced and a large amount of additional knowledge about their biology is
known. This makes them a perfect source of data for statistical analyses. Here
we report on an analysis of nearly all of the CDS of the Epstein-Barr virus via
methods involving the spectral envelope. The data are taken from the EMBL
data base.
The study of nucleosome positioning is important because nucleosomes en-
gage in a large spectrum of regulatory functions and because nucleosome re-
search has come to a point where experimental data and analytical methods from
different directions begin to merge and to open ways to develop a more uni-
fied and accurate picture of formation, structure and function of nucleosomes.
While ten years ago many investigators regarded histones as mere packing tools,
irrelevant for regulation, there are now vast amounts of evidence suggesting the
participation of nucleosomes in many important cellular events such as replica-
tions, segregation, development, and transcription (for reviews, see Grunstein
(1992) or Thorma (1992)). Although nucleosomes are now praised as a long

overlooked “parsimonious way of regulating biological activity" (Drew and
Calladine, 1987) or as the “structural code" (Travers and Klug, 1987), no ob-
vious signals that would distinguish them from the rest of DNA are yet known
(Trifonov, 1991). It therefore remains an important task to understand and un-
ravel this complex structural code. The genetic code is degenerate, more than
one triplet codes for the same amino acid, nevertheless strong biases exist in
the use of supposedly equivalent codons. This occurrence raises many ques-
tions concerning codon preferences and understanding of these preferences will
provide valuable information. It is our goal to contribute to a better understand-
ing of coding sequences by presenting the statistical methodology to perform
a systematic statistical analysis of periodicities and of codon-context patterns.
To this end, we suggest the following types of analyses based on the spectral
envelope. The uses of the spectral envelope and the analyses presented here
are by no means exhaustive; we will most likely raise more questions than we
answer.
We first explore the gene BcLF1 of Epstein-Barr. Figure 3.4 shows the spec-
tral envelope of the CDS which is 4143 bp long with the following nucleotide
distribution: 900 A, 1215 C, 1137 G, 891 T. There is a clear signal at one cycle
every three bp Smoothing was performed in the calculation of the
sample spectral envelope using triangular smoothing with (in this case
the weights are
so that and a 0.0001 [0.00001 ] significance thresh-
old is approximately 0.17% [0.20%]). In this analysis, the scalings at the peak
frequency of one cycle every three bp were A = 1.16, C = 0.87, G = – 0.14, T

= 0. This suggests that BcLF1’s signal is in the {M, K} alphabet, where M =
{A or C},K = {G or T}.
The next question is which positions in the codon are critical to the BcLF1
signal. To address this we did the following: First, every codon-position 1 was
replaced with a letter chosen at random from the nucleotide alphabet, {A, C, G,
T}, with probability equal to the proportion of each letter in the entire gene. For
example, the first nine values of BcLF1 are ATG GCC TCA; they are changed
to where for are independently chosen
letters such that the probability that is: an A is 900/4143, a C is 1215/4143,
a G is 1137/4143, a T is 891/4143. This is done over the entire gene and
then the spectral envelope is computed to see if destroying the sequence in that
position destroys the signal. The graph shown in Figure 3.5a is the resulting
spectral envelope. There it is noted that not very much has changed from
Figure 3.4, so that destroying position 1 has no effect on the signal. The graph
shown in Figure 3.5b has the second position destroyed and Figure 3.5c is the
result of destroying the third position. While destroying the first position has
virtually no effect on the signal, destroying the second or third position does
have an effect on the signal. In the next three panels of Figure 3.5, the results
of destroying two positions simultaneously are shown. Figure 3.5d shows what
happens when the first and second positions are destroyed, Figure 3.5e has the
first and third positions destroyed, and Figure 3.5f has the second and third
positions destroyed. It is clear that the major destruction to the signal occurs
when either the first and third positions, or the second and third positions are
destroyed, although in either case there is still some evidence that the signal has
survived. From Figures 3.5e and 3.5f we see that the first and second positions
cannot thoroughly do the job of carrying the signal alone, however, the signal
remains when the third position is destroyed (Figure 3.5c), so that the job does
not belong solely to position three.
To show how this technology can help detect heterogeneities and wrongly
assigned gene segments we focus on a dynamic (or sliding-window) analysis of
BNRF1 (bp 1736-5689) of Epstein-Barr. Figure 3.6 shows the spectral envelope
(using triangular smoothing with of the entire CDS (approximately
4000 bp). The figure shows a strong signal at frequency 1/3; the corresponding
optimal scaling was A = 0.04, C = 0.71, G = 0.70, T = 0, which indicates that
the signal is in the strong-weak bonding alphabet, S = {C, G} and W = {A, T}.
Next, we computed the spectral envelope over two windows: the first half
and the second half of BNRF1 (each section being approximately 2000 bp long).
We do not show the result of that analysis here, but the spectral envelopes and
the corresponding optimal scalings were different enough to warrant further
investigation. Figure 3.7 shows the result of computing the spectral envelope
over four 1000 bp windows across the CDS, namely, the first, second, third,
fourth quarters of BNRF1. An approximate 0.001 significance threshold is
.69%. The first three quarters contain the signal at the frequency 1/3 (Figure
3.7a-c); the corresponding sample optimal scalings for the first three windows
were: (a) A = 0.06, C = 0.69, G = 0.72, T = 0; (b) A = 0.09, C = 0.70, G =
0.71, T = 0; (c) A = 0.18, C = 0.59, G = 0.77, T = 0. The first two windows
are strongly consistent with the overall analysis, the third section, however,
shows some minor departure from the strong-weak bonding alphabet. The
most interesting outcome is that the fourth window shows that no signal is
present. This result suggests the fourth quarter of BNRF1 of Epstein-Barr is
just a random assignment of nucleotides (noise).
To investigate these matters further, we took a window of size 1000 and

moved it across the CDS 200 bp at a time. For example, the first analysis was
on bp 1700-2700 of Epstein-Barr, the second analysis is on bp 1900-2900, the
third on bp 2100-3100, and so on, until the final analysis on bp 4700-5700. Each
analysis showed the frequency 1/3 signal except the last analysis on bp 4700-
5700 (this is an amazing result considering that the analysis prior to the last one
is on bp 4500-5500). Figure 3.8 shows the optimal sample scalings at the 1/3
frequency from each window analysis; the horizontal axis shows the starting
location of the 1000 bp window (1700, 1900, 2100, ..., 4700), and the vertical
axis shows the scalings. In Figure 3.8 the scalings are obtained as follows:
each analysis fixes G = 0 and then the scales for A, C and T are calculated.
Next, we divided each scale by the value obtained for C, so that C = 1 in each
analysis. Hence, the vertical axis of Figure 3.8 shows the scales, at frequency
1/3, in each window with G = 0 (solid line), C = 1 (dashed line), A free (solid
line), T free (dashed line). This was done primarily to assess the homogeneity
of the strong-weak bonding alphabet across the CDS. We see that, for the first
quarter or so of the CDS the {S, W} alphabet is strong. That strength fades a
bit in the middle and then comes back (though not as strong) near the end of the
CDS. We see, however, that this alphabet is nonexistent in the final 1000 bp of
BNRF1. This lack of periodicity prompted us to reexamined this region with a
number of other tools, and we now strongly believe that this segment is indeed
noncoding.
Herpesvirus saimiri (taken from GenBank) has a CDS from bp 6821 to 10561
(3741 bp) where the similarity to EBV BNRF1 is noted. To see if a similar
problem existed in HVS BNRF1, and to generally compare the periodic behavior
of the genes we analyzed HVS BNRF1 in a similar fashion to EBV BNRF1
as displayed in Figure 3.6. Figure 3.9 shows the smoothed (triangular with
spectral envelope of HVS BNRF1 for (a) the first 1000 bp, (b) the
second 1000 bp, (c) the third 1000 bp, and (d) the remaining 741 bp. There
are some obvious similarities, that is, for the first three sections the cycle 1/3
is common to both the EBV and the HVS gene. The obvious differences are
the appearance of the 1/10 cycle in the third section and the fact that in HVS,
the fourth section shows the strong possibility of containing the 1/3 periodicity
(the data were padded to for use with the FFT, the 0.001 significance
threshold in this case is (2/756)exp(3.71/3) = .91; the peak value of the spectral
envelope at 1/3 in this section was 0.89) whereas in EBV the fourth section is
noise. Next, we compared the scales for each section of the HVS analysis. In
the first section, the scales corresponding to the 1/3 cycle are A = 0.2, C = 0.96,
G = 0.18, T = 0, which suggests that the signal is driven by C, not-C. In the
second section the scales corresponding to the 1/3 signal are A = 0.26, C = 0.63,
G = 0.73, T = 0, which suggests the strong-weak bonding alphabet. In the third
section there are two signals; at the approximate 1/10 cycle the scales are A =
0.83, C = 0.47, G = 0.30, T = 0 (suggesting a strong bonding-A-T alphabet),
at the 1/3 cycle the scales are A = 0.20, C = 0.32, G = 0.93, T = 0 (suggesting
a G-H alphabet). In the final section, the scales corresponding to the (perhaps
not significant) 1/3 signal are A = 0.28, C = 0.51, G = 0.81, T = 0, which does
not suggest any collapsing of the nucleotides.
Finally, we tabled the results of the analysis of nearly every CDS in Epstein-
Barr. Only genes that exceed 500 bp in length are reported (BNRF1 and BcLF1
are not reported again here). In every analysis we used triangular smoothing
with in which case These analyses were performed on the
entire gene and it is possible that a dynamic analysis would find other significant
periodicities in sections of a CDS than are listed here. Table 2 lists the CDS
analyzed, the 0.001 critical value (CV) for that sequence, the significant values
of the smoothed sample spectral envelope (SpecEnv), the frequency at which
the spectral envelope is significant (Freq), and the scalings for A, C, and G at
the significant frequency (T = 0 in all cases). Note that for some genes, there
is no evidence to support that the sequence is anything other than noise; these
genes should be investigated further. The occurrence of the zero frequency has
many explanations but we not certain which applies and this warrants further
investigation. One explanation is that the CDS has long memory in that sections
in the CDS that are far apart are highly correlated with one another. Another
possibly is that the CDS is not entirely coding. For example, we analyzed
the entire lambda virus (approximately 49,000 bp) and found a strong peak at
the zero frequency and at the one-third frequency; however, when we focused
on any particular CDS, only the one-third frequency peak remained. We have
noticed this on other analyses of sections that contain coding and noncoding
(see Stoffer et al, 1993b) but this is not consistent across all of our analyses.
4. DISCUSSION
The spectral envelope, as a basic tool, appears to be suited for fast auto-
mated macro-analyses of long DNA sequences. Interactive computer programs
are currently being developed. The analyses described in this paper were per-
formed either using a cluster of C programs that compile on Unix operating
systems, or using the Gauss programming system for analyses on Windows
operating systems. We have presented some ways to adapt the technology to
the analysis of DNA sequences. These adaptations were not presented in the
original spectral envelope article (Stoffer et al, 1993a) and it is clear that there
are many possible ways to extend the original methodology for use on various
problems encountered in molecular biology. For example, we have recently
developed similar methods to help with the problem of discovering whether
two sequences share common signals in a type of local alignment and a type of
global alignment of sequences (Stoffer and Tyler, 1998). Finally, the analyses
presented here point to some inconsistencies in established gene segments and,
evidently, some additional investigation and explanation is warranted.
ACKNOWLEDGMENTS
This article is dedicated to the memory of Sid Yakowitz and his research in the
field of time series analysis; in particular, his contributions and perspectives on
fast methods for frequency detection. Part of this work was supported by a grant
from the National Science Foundation. This work benefited from discussions
with Gabriel Schachtel, University of Giessen, Germany.
REFERENCES
Bernardi, G. and G. Bernardi. (1985). Codon usage and genome composition.
Journal of Molecular Evolution, 22, 363-365.
Bina, M. (1994). Periodicity of dinucleotides in nucleosomes derived from
siraian virus 40 chromatin. Journal of Molecular Biology, 235, 198-208.
Blaisdell, B.E. (1983). Choice of base at silent codon site 3 is not selectively
neutral in eucaryotic structural genes: It maintains excess short runs of weak
and strong hydrogen bonding bases. Journal of Molecular Evolution, 19,
226-236.
Buckingham, R.H. (1990). Codon context. Experientia, 46, 1126-1133.
Cornette, J.L., K.B. Cease, H. Margaht, J.L. Spouge, J.A. Berzofsky, and C.
DeLisi. (1987) Hydrophobicity scales and computational techniques for de-
tecting amphipathic structures in proteins. Journal of Molecular Biology,
195, 659-685.
REFERENCES 153
Drew, H.R. and C. R. Calladine. (1987). Sequence-specific positioning of core

histones on an 860 base-pair DNA: Experiment and theory. Journal of Molec-
ular Biology, 195, 143-173.
Eisenberg, D., R.M. Weiss, and T.C. Terwillger. (1994). The hydrophobic mo-
ment detects periodicity in protein hydrophobicity. Proc. Natl. Acad. Sci.,
81, 140-144.
Grunstein, M. (1992). Histones as regulators of genes. Scientific American, 267,
68-74.
Ioshikhes, I., A. Bolshoy, and E.N. Trifonov. (1992). Preferred positions of AA
and TT dinucleotides in aligned nucleosomal DNA sequences. Journal of
Biomolecular Structure and Dynamics, 9, 1111-1117.
McLachlan, A.D. and M. Stewart. (1976). The 14-fold periodicity in alpha-
tropomyosin and the interaction with actin. Journal of Molecular Biology,
103, 271-298.
Mengeritsky, G. and E.N. Trifonov. (1983). Nucleotide sequence-directed map-
ping of the nucleosomes. Nucleic Acids Research, 11, 3833-3851.
Muyldermans, S. and A. A. Travers. (1994) DNA sequence organization in chro-
matosomes. Journal of Molecular Biology, 235, 855-870.
Pina, B., D. Barettino, M. Truss, and M. Beato. (1990). Structural features of a
regulatory nucleosome. Journal of Molecular Biology, 216, 975-990.
Satchwell, S.C., H.R. Drew, and A.A. Travers. (1986). Sequence periodicities in
chicken nucleosome core DNA. Journal of Molecular Biology, 191, 659-675.
Schachtel, G.A..P. Bucher, E.S. Mocarski, B.E. Blaisdell, and S. Karlin. (1991).
Evidence for selective evolution in codon usage in conserved amino acid seg-
ments of human alphaherpesvirus proteins. Journal of Molecular Evolution,
33, 483-494.
Shrader, T.E. and D.M. Crothers. (1990). Effects of DNA sequence and histone-
histone interactions on nucleosome placement. Journal of Molecular Biol-
ogy, 216, 69-84.
Shumway, R.H. and D.S. Stoffer. (2000). Time Series Analysis and Its Applica-
tions. New York: Springer.
Stoffer, D.S., D.E. Tyler, and A.J. McDougall. (1993a). Spectral analysis for
categorical time series: Scaling and the spectral envelope. Biometrika, 80,
611-622.
Stoffer, D.S., D.E. Tyler, A.J. McDougall, and G.A. Schachtel. (1993b). Spectral
analysis of DNA sequences (with discussion). Bulletin of the International
Statistical Institute, Bk 1, 345-361; Bk 4, 63-69.
Stoffer, D.S. and D.E. Tyler. (1998). Matching sequences: Cross-spectral anal-
ysis of categorical time series. Biometrika, 85, 201-213.
Sueoka, N. (1988). Directional mutation pressure and neutral molecular evolu-
tion. Proc. Nati. Acad. Sci., 85, 2653-2657.
Tavaré, S. and B.W. Giddings. (1989). Some statistical aspects of the primary
structure of nucleotide sequences. In Mathematical Methods for DNA Se-
quences, M.S. Waterman ed., pp. 117-131, Boca Raton, Florida: CRC Press.
Travers, A.A. and A. Klug. (1987). The bending of DNA in nucleosomes and
its wider implications. Philosophical Transactions of the Royal Society of
London, B, 317, 537-561.
Trifonov, E.N. (1991). DNA in profile. Trends in Biochemical Sciences, 16,
467-470.
Trifonov, E.N. and J.L. Sussman. (1980). The pitch of chromatin DNA is re-
flected in its nucleotide sequence. Proc. Natl. Acad. Sci., 77, 3816-3820.
Uberbacher, E.C., J.M. Harp, and G.J. Bunick. (1988). DNA sequence patterns
in precisely positioned riucleosomes. Journal of Biomolecular Structure and
Dynamics, 6, 105-120.
Viari, A., H. Soldano, and E. Ollivier. A scale-independent signal processing
method for sequence analysis. Computer Applications in the Biosciences, 6,
71-80.
Zhurkin, V.B. (1983) Specific alignment of nucleosomes on DNA correlates
with periodic distribution of purine-pyrimidine and pyrimidine-purine dimers.
Febs Letters, 158, 293-297.
Zhurkin, V.B. (1985). Sequence-dependent bending of DNA and phasing of
nucleosomes. Journal of Biomolecular Structure and Dynamics, 2, 785-804.
Part III
Chapter 8
AN EFFICIENT STOCHASTIC APPROXIMATION

ALGORITHM FOR STOCHASTIC SADDLE
POINT PROBLEMS
Arkadi Nemirovski
Reuven Y. Rubinstein
Haifa 32000, Israel
Abstract We show that Polyak’s (1990) stochastic approximation algorithm with averag-
ing originally developed for unconstrained minimization of a smooth strongly
convex objective function observed with noise can be naturally modified to solve
convex-concave stochastic saddle point problems. We also show that the ex-
tended algorithm, considered on general families of stochastic convex-concave
saddle point problems, possesses a rate of convergence unimprovable in order
in the minimax sense. We finally present supporting numerical results for the
proposed algorithm.
1. INTRODUCTION
We start with the classical stochastic approximation algorithm and its mod-
ification given in Polyak (1990).
1.1. CLASSICAL STOCHASTIC APPROXIMATION

Classical stochastic approximation (CSA) originates from the papers of
Robbins-Monroe and Kiefer-Wolfovitz. It is basically a steepest descent method
for solving the minimization problem
where the exact gradients of are replaced with their unbiased estimates. In
the notation from Example 2.1, the CSA algorithm is
where is an arbitrary point from X, is the point of X closest to (the

projection of onto X); the stepsizes are normally chosen as
C being a positive constant. Under appropriate regularity assumptions (see,

e.g., Kushner and Clark (1978)) the sequence converges almost surely and
in the mean square to the unique minimizer of the objective.
Unfortunately, the CSA algorithm possesses poor robustness. In the case
of a smooth (i.e., with a Lipschitz continuous gradient) and nondegenerate
(i.e., with a nonsingular Hessian) convex objective, the rate of convergence is
and is unimprovable, in a certain precise sense. However, to achieve
this rate, one could adjust the constant C in (1.3) to the “curvature” of a
“bad” choice of C – by an absolute constant factor less than the optimal one –
can convert the convergence rate to with Finally, if the
objective, although smooth, is not nondegenerate, (1.3) may result in extremely
slow convergence.
The CSA algorithm was significantly improved by Polyak (1990). In his
algorithm, the stepsizes are larger in order than those given by (1.3) (they
are of order of with so that the rate of convergence of
the trajectory (1.2) to the solution is worse in order than for the usual CSA.
The crucial difference between Polyak’s algorithm and the CSA is that the
sequence (1.2) is used only to collect information about the objective rather than
to estimate the solution itself. Approximate solutions to (1.1) are obtained
by averaging the “search points” in (1.2) according to
It turns out that under the same assumptions as for the CSA (smooth nonde-
generate convex objective attaining its minimum at an interior point of X),
Polyak’s algorithm possesses the same asymptotically unimprovable conver-
gence rate as the CSA. At the same time, in Polyak’s algorithm there is no
need for “fine adjustment” of the stepsizes to the “curvature” of Moreover,
Polyak’s algorithm with properly chosen preserves a “reasonable” (close to
rate of convergence even when the (convex) objective is nonsmooth
and/or degenerate.
A somewhat different aggregation in SA algorithms was proposed earlier
by Nemirovski and Yudin (1978, 1983). For additional references on the CSA
An Efficient Algorithm for Saddle Point Problems 157
algorithm and its outlined modification, see Ermoliev (1969), Ermoliev and
Gaivoronski (1992), L’Ecuyer, Giroux, and Glynn (1994), Ljung, Pflug and
Walk (1992), Pflug (1992), Polyak (1990) and Tsypkin (1970) and references
therein.
Our goal is to extend Polyak’s algorithm from unconstrained convex mini-
mization to the saddle point case. We shall show that, although for the general
saddle point problems below the rate of convergence slows down from
to the resulting stochastic approximation saddle point (SASP) al-
gorithm, as applied to stochastic saddle point (SSP) problems, preserves the
optimality properties of Polyak’s method.
The rest of this paper is organized as follows. In Section 2 we define the SSP
problem, present the associated SASP algorithm, and discuss its convergence
properties. We show that the SASP algorithm is a straightforward extension of
its stochastic counterpart with averaging, originally proposed by Polyak (1990)
for stochastic minimization problems as in Example 2.1 below. It turns out
that in the general case the rate of convergence of the SASP algorithm becomes
instead of that is, the convergence rate of Polyak’s algorithm.
We demonstrate in Section 3 that this slowing down is an unavoidable price for
extending the class of problems handled by the method. In Section 4 we present
numerical results for the SASP algorithm as applied to the stochastic Minimax
Steiner problem and to an on-line queuing optimization problem. Appendix
contains the proofs of the rate of convergence results for the SASP algorithm.
It is not our intention in this paper to compare the SASP algorithm with other
optimization algorithms suitable for off-line and on-line stochastic optimization,
like the stochastic counterpart (Rubinstein and Shapiro ,1993). It is merely to
show the high potential of the SASP method and to promote it for further
applications.
2. STOCHASTIC SADDLE POINT PROBLEM

2.1. THE PROBLEM
Consider the following saddle point problem:
(SP) Given a function find a saddle point
of on X × Y, i.e., a point at which attains its minimum
in and its maximum in
In what follows, we write (SP) down as
We make the following

Assumption A. X and Y are convex compact sets and is convex in

concave in and Lipschitz continuous on X × Y.
Let us associate with (SP) the following pair of functions
(the primal and the dual objectives, respectively), and the following pair of
optimization problems
It is well known (see, e.g., Rockafellar (1970)) that under assumption A both
problems (P) and (D) are solvable, with the optimal values equal to each other,
and the set of saddle points of on X × Y is exactly the set
2.1.1 Stochastic setting. We are interested in the situation where

neither the function in (SP), nor the derivatives
are available explicitly; we assume, however, that at a time instant
one can obtain, for every desired point “noisy estimates” of
the aforementioned partial derivatives. These estimates form a realization of
the pair of random vectors
being the “observation noises”. We assume that these noises are in-
dependent identically distributed, according to a Borel probability measure P,
random variables taking values in a Polish (i.e., metric separable complete)
space
We also make the following
Assumption B. The functions on are Borel
functions taking values in respectively, such that
and
Here
and are the sub- and super-differentials of in

respectively;
is the standard Euclidean norm on the corresponding
and are the Euclidean diameters of X and Y, respectively.
We refer to and in (2.2) satisfying Assumption B as to a stochastic source
of information (SSI) for problem (SP), and refer to problem (SP) satisfying
Assumption A and equipped with a particular stochastic source of information
and as to a stochastic saddle point (SSP) problem. The associated quan-
tity will be called the variation of observations of the stochastic source of
information.
Our goal is to develop a stochastic approximation algorithm for solving the
SSP problem.
2.1.2 The accuracy measure. As an accuracy measure of a candidate

solution of problem (SP), we use the following function
Note that is expressed in terms of the objective function rather than of

the distance from to the saddle set of It is nonnegative everywhere
and equals to 0 exactly at the saddle set of This is so, since the saddle points
of are exactly the pairs Note finally that
since
2.2. EXAMPLES
We present now several stochastic optimization problems which can be nat-
urally posed in the SSP form.
Example 2.1 Simple single-stage stochastic program. Consider the simplest
stochastic programming problem
with convex compact feasible domain X and convex objective Here is a

Polish space and P is a Borel probability measure on
Assume that the integrand is sufficiently regular, namely,
is summable for every

F is differentiable in for all with Borel in
is Lipschitz continuous in with a certain constant and
Assume further that when solving (2.7), we cannot compute directly, but
are given an iid random sample distributed according to P and know
how to compute at every given point
Under these assumptions program (2.7) can be treated as an SSP problem
with
and a trivial – singleton – set Y (which enables us to set ). It is readily

seen that the resulting SSP problem satisfies assumptions A and B with
and that the accuracy measure (2.5) in this problem is just the residual in terms
of the objective:
Example 2.2 Minimax stochastic program. Consider the following system

of stochastic inequalities:
with convex compact domain X and convex constraints Here

P are the same as in Example 2.1, and each of the integrands
possesses the same regularity properties as in Example 2.1.
Clearly, to solve (2.9) it suffices to solve the optimization problem
which is the same as solving the following saddle point problem
Note that the latter problem clearly satisfies Assumption A.

Similarly to Example 2.1, assume that when solving (2.10), we cannot com-
pute (and thus explicitly, but are given an iid sample distributed
according to P and, given and we can compute

Note that under this assumption the SSI
for the saddle point problem (2.11) is given by
The variation of observations for this source can be bounded from above as
Note that in this case the accuracy measure satisfies the inequality
so that it presents an upper bound for the residual in (2.10).

Although problem (2.7) can be posed as a convex minimization problem
(2.10) rather than the saddle point problem (2.11), it cannot be solved directly.
Indeed, to solve (2.10) by a Stochastic Approximation algorithm, we need
unbiased estimates of subgradients of and we cannot built estimates of this
type from the only available for us unbiased estimates of Thus, in the
case under consideration the saddle point reformulation seems to be the only
one suitable for handling “noisy observations”.
Example 2.3 Single-stage stochastic program with stochastic constraints.
Consider the following stochastic program:
subject to
with convex compact domain X and convex functions and let

P and the integrands satisfy the same assumptions as in Examples 2.1,
2.2. As above, assume that when solving (2.12) – (2.13), we cannot compute
explicitly, but are given an iid sample distributed according to P and,

given and can compute
To solve problem (2.12) – (2.13), it suffices to find a saddle point of the
Lagrange function
on the set Note that if (2.12) – (2.13) satisfies the Slater condition,
then possesses a saddle point on and the solutions to (2.12), (2.13)
coincide with the of the saddle points of
Assume that we have prior information on the problem, which enables us to
identify a compact convex set containing the of some
saddle point of Then we can replace in the Lagrange saddle point problem
the nonnegative orthant with Y, thus obtaining an equivalent saddle point
problem
with convex and compact set X and Y .

Noting that the vectors
form a stochastic source of information for (2.14), we see that (2.13) – (2.12) can
be reduced to an SSP. The variation of observations for the associated stochastic
source of information clearly can be bounded as
These examples demonstrate that the SSP setting is a natural form of many
stochastic optimization problems.
2.3. THE SASP ALGORITHM

The SASP algorithm for the stochastic saddle point problem (2.1), (2.3) is
as follows:
Algorithm 2.1
where
is the projector on X × Y :
the vector is
and being positive parameters of the method;

are positive stepsizes which, in principle, can be either deterministic
or stochastic (see subsections 2.3 and 2.4, respectively);
the initial point is an arbitrary (deterministic) point in X × Y.
As an approximate solution of the SSP problem we take the moving average
where is a deterministic function taking, for every integer values

between 1 and
In the two subsections which follow we discuss the rate of convergence of the
SASP algorithm and the choice of its parameters. Subsections 2.4 and 2.5 deal
with off-line and on-line choice of the stepsizes, respectively.
2.4. RATE OF CONVERGENCE AND OPTIMAL

SETUP: OFF-LINE CHOICE OF THE STEPSIZES
Here we consider the case of deterministic sublinear stepsizes. Namely,
assume that
where C > 0 and As we shall see in Section 3, with properly

chosen C and yield an unimprovable in order (in a certain precise sense) rate
of convergence.
Theorem 1 Under assumptions A, B and (2.19), the expected inaccuracy

of the approximate solutions generated by the method can be
bounded from above, for every N > 1, as follows:
The proof of Theorem 1 is given in Appendix A.

It is easily seen that the parameters minimizing, up to an absolute constant
factor, the right hand side of (2.20) are given by the setup
Here denotes the smallest integer which is With this setup, (2.20)
results in
2.5. RATE OF CONVERGENCE AND OPTIMAL

SETUP: ON-LINE CHOICE OF THE STEPSIZES
Setup (2.21) requires a priori knowledge of the parameters
When the domains X, Y are “simple” (e.g., boxes, Euclidean balls or perfect
simplices), there is no problem in computing the diameters And in
actual applications we can handle simple X and Y only, since we should know
how to project onto these sets. Computation of the variation of observations
is, however, trickier. Typically the exact value of is not available, and a
bad initial guess for can significantly slow down the convergence rate. For
practical purposes it might be better to use an on-line policy for updating guesses
for and our current goal is to demonstrate that there exists a reasonably wide
family of these policies preserving the convergence rate of the SASP algorithm.
We shall focus on stochastic stepsizes of the type (cf. (2.19))
where is fixed and depends on the observations

For the theoretical analysis, we make the following
Assumption C. For every depends only on the observations collected
at the first steps, i.e., is a deterministic function of
Moreover, there exist “safety bounds” – two positive constants

and – such that
for all
Let us associate with the SASP algorithm (2.15) – (2.18), (2.23) the following
(deterministic) sequence:
where the supremum is taken over all trajectories associated with the SSP prob-
lem in question.
Theorem 2 Let the stepsizes in the SASP algorithm (2. 15) – (2.18) be chosen
according to (2.23), and the remaining parameters according
to (2.21). Then under assumptions A, B, C the expected inaccuracy
of the approximate solution generated by the SASP algorithm can
be estimated, for every N > 1, as follows:
Note that (2.20) is a particular case of (2.26) with

The proof of Theorem 2 is given in Appendix A.
We now present a simple example of adaptive choice of Recalling (see
(2.19), (2.21)) that the optimal stepsize is the one with and
where is the variation of observations, it is natural to choose as
where is our current guess for the unknown quantity Since by definition
(2.4)
a natural candidate for the role of is given by
– the sample mean of “magnitude of observations”. Such a choice of

may violate assumption (2.24), since fluctuations in the observations
may result in being either too small or too large to satisfy (2.24).
It thus makes sense to replace (2.29) with its truncated version. More precisely,
let
where and
present some a priori guesses for lower and upper bounds on and
respectively. Then the truncated version of (2.29) is
Clearly, the stepsize policy (2.27), (2.30) satisfies (2.24) – it suffices to take
and In addition,
for the truncated version we have
where O(1) depends solely on our safety bounds Inequalities

(2.26), (2.31) combined with the stepsize policy (2.27), (2.30) result in the
same of convergence as in (2.22).
Note that the motivation behind the stepsize policy (2.27), (2.30) is, roughly
speaking, to choose stepsizes according to the actual magnitudes of observa-
tions along the trajectory rather than according to the worst-case
“expected magnitudes”
3. DISCUSSION
3.1. COMPARISON WITH POLYAK’S ALGORITHM
As applied to convex optimization problems (see Example 2.1), the SASP
algorithm with the setup (2.21) looks completely similar to Polyak’s algorithm
with averaging. There is, however, an important difference: the stepsizes
given by (2.21) are not quite suitable for Polyak’s method.
For the latter, the standard setup is with and this is
the setup for which Polyak’s method possesses its most attractive property as
opposite to rate of convergence on strongly convex (i.e., smooth
and nondegenerate) objective functions. Specifically, let
be the class of all stochastic optimization problems
on a compact convex set with twice differentiable
objective satisfying the condition
and equipped with a stochastic source of information with variation of the ob-
servations not exceeding L. Note that problems from class possess uniformly
smooth objectives. In addition, if which corresponds to the “well-posed
case”, the objectives are uniformly nondegenerate as well.
For Polyak’s method with stepsizes and properly
chosen ensures that the expected error of N-th
approximate solution does not exceed where depend only on
the data of Under the same circumstances the stepsizes given by
(2.21) will result in a slower convergence, namely, ln N . Thus,
in the “well-posed case” the SASP method with setup (2.21) is slower by a
logarithmic in N factor than the original Polyak’s method.
The situation changes dramatically when that is, when we pass from
the “well-posed” case to the “ill-posed” one. Here the SASP algorithm still en-
sures (uniformly in problems from ) the rate of convergence
which is not the case for Polyak’s method. Indeed, consider the simplest
case when X = [0,1] and assume that observation noises are absent, so that
Consider next the subfamily of comprised
of the objectives
where and let us apply Polyak’s method to with stepsizes

starting at The search points are
where is continuous on [1/2,1). For the points as

well as their averages, belong to the domain where
Therefore in order to get an of Polyak’s algorithm requires at

least steps. We conclude that in the ill-posed case the worst-case,
with respect to rate of convergence of Polyak’s algorithm cannot be better
than Thus, in the ill-posed case Polyak’s setup with
results in worse, by factor of order of rate of convergence
then the setup
We believe that the outlined considerations provide enough arguments in
favor of the rule unless we are certain that the problem is “well-
posed”. As we shall see below, in the case of “genuine” saddle point problems
(not reducing to minimization of a convex function via unbiased observations
of its subgradients) the rate of convergence of the SASP algorithm with setup
(2.21) is unimprovable even for “well-posed” problems.
3.2. OPTIMALITY ISSUES

We are about to demonstrate that as far as general families of SSP problems
are concerned, the SASP algorithm with setup (2.21) is optimal in order in the
minimax sense. To this end, let us define the family of stochastic saddle point
problems as that of all SSP instances on X × Y (recall that an SSP
problem always satisfies assumptions A, B) with the variance of observations
not exceeding L > 0.
Given a positive and a subfamily let us denote by
the information-based of defined as follows.
a) Let us define a solution method for the family as a procedure which, as
applied to an instance from the family, generates sequences of “search points”
and “approximate solutions” with the pair
defined solely on the basis of observations along the previous search points:
Formally, the method is exactly the collection of “rules”

and the only restriction on these rules is that all
function pairs must be Borel.
b) For a solution method its on an instance
is defined as the smallest N such that
the expectation being taken over the distribution of observation noises; here
is the inaccuracy measure (2.5) associated with instance
The of on the entire family is
i.e., it is the the worst case, over all instances from of on an

instance.
For example, (2.22) says that the complexity of the SASP method on the
family of stochastic saddle point problems can be bounded from
above as
provided that one uses the setup (2.21) with

Finally, the of the family is the minimum, over
all solution methods of of the methods on the family:
A method is called optimal in order on if there exists such that
Optimality in order of a method on a family means that does not ad-

mit a solution method “much faster” than for every required accuracy
solves every problem from within (expected) accuracy in not
more than steps, while every competing method in less than
steps fails to solve, within the same accuracy, at least one (de-
pending on ) problem from
We are about to establish the optimality in order of the SASP algorithm on
families
Proposition 3.1 The complexity of every nontrivial (with a non-singleton X ×
Y) family of stochastic saddle point problems admits a lower bound
C being a positive absolute constant.

The proof of Proposition 3.1 is given in Appendix B.
Taking into account (3.1), we arrive at
Corollary 3.1 For every convex compact sets and every
L > 0, the SASP algorithm with setup (2.21) ( is set to L) is optimal in order
on the family of SSP problems.
Remark 3.1 The outlined optimality property of the SASP method means that
as far as the performance on the entire family is concerned, no
alternative solution method outperforms the SASP algorithm by more than an
absolute constant factor. This fact, of course, does not mean that it is impos-
sible to outperform essentially the SASP method on a given subfamily of
For example, the family of convex stochastic minimization problems

introduced in Subsection 3.1 can be treated as a subfamily of
(think of a convex optimization problem as a saddle point problem
with objective independent of ). As explained in Subsection 3.1, in the “well-
posed” case the SASP algorithm is not optimal in order on (the
complexity of the method on is while the complexity of is
In view of Remark 3.1, it makes sense to present a couple of examples of what

we call “difficult” subfamilies – those for which
is of the same order as the complexity of the entire family
For the sake of simplicity, let us restrict ourselves to the case
Y = [0,1], L = 10. It is readily seen that if both X , Y are non-singletons,
then can be naturally embedded into
so that "difficult" subfamilies of generate "difficult" sub-
families in every family of stochastic saddle point problems with
nontrivial X , Y .
1) The first example of a “difficult” subfamily in is the family of ill-
posed smooth stochastic convex optimization problems
associated with X = [–1,1], A = 1, L = 10, see Subsection 3.1.
Indeed, consider a 2-parametric family of optimization programs
the parameters being and We assume that the family

(3.3) is equipped with the stochastic source of information
Thus, we seek to minimize a simple quadratic function of a single variable in

the situation when the derivative of the objective function is corrupted with an
additive Gaussian white noise of unit intensity.
The same reasoning as in the proof of Proposition 3.1 demonstrates that the
family is indeed "difficult", the reason being that the programs (3.3) become
more and more ill-posed as approaches 0.
2) The second example is a single-parametric family of “genuine” saddle
point problems
equipped with the stochastic sources of information
here the parameter varies in [–1,1],

The origin of the stochastic saddle point problem (3.4) – (3.5) is very simple.
What we seek in fact is to solve a Minimax problem (see Example 2.2) of the
form
specified by an unknown parameter What we can observe are the

values and derivatives of at any desired point They are corrupted with
noise and equal to
respectively. The observations of are, respectively,
Applying the scheme presented in Example 2.2 to the above stochastic Min-
imax problem, we convert it into an equivalent stochastic saddle point problem,
which is readily seen to be identical with (3.4) – (3.5).
Note that the family of stochastic saddle point problems
is contained in We claim that the family is "difficult".
Indeed, denoting by the accuracy measure associated with the problem
(3.4) and taking into account (2.5), we have
It follows that In other words, if there exists a method which

solves in N steps every instance from within an expected inaccuracy then
this method is capable of recovering, within the same expected inaccuracy, the
value of underlying the instance. On the other hand, from the viewpoint
of collected information, the N observations (3.5) used by the method are
equivalent to observing a sample of N iid random variables. Thus, if
then one can recover, within the expected inaccuracy the

unknown mean of from N-point sample drawn from this
distribution, regardless of the actual value of the mean. It is well-
known that the latter is possible only when Thus,
as claimed.
Note finally that the stochastic Minimax problems (3.6) which give rise to the
stochastic saddle point problems from are “as nice as a Minimax problem can
be”. Indeed, the components present just shifts by of simple quadratic
functions on the axis. Moreover, problems (3.6) are perfectly well posed – the
solution is “sharp”, i.e., the residual
is of the first order in the distance from a candidate solution
to the exact solution provided that this distance is small. We see that in
situations less trivial than the one considered in case 1), "difficult" stochastic
saddle point problems can arise already from quite simple and perfectly well-
posed stochastic optimization models.
4. NUMERICAL RESULTS
In this section we apply the SASP algorithm to a pair of test problems:
the stochastic Minimax Steiner problem and on-line optimization of a simple
queuing model.
4.1. A STOCHASTIC MINIMAX STEINER PROBLEM

Assume that in a two-dimensional domain X there are towns of the same
population, say equal to 1. The distribution of the population over the area
occupied by town is All towns are served by a single facility, say, an
ambulance. The “inconvenience of service” for town is measured by the
mean distance from the facility to the customers, i.e., by the function
being the location of the facility. The problem is to find a location for the
facility which minimizes the worst-case, with respect to all towns, inconve-
nience of service. Mathematically we have the following minimax stochastic
program
We assume that the only source of information for the problem is a sample
with mutually independent entries distributed according to i.e., a

random sample of N tuples of requests for service, one request per town in a
tuple.
The above Minimax problem can be naturally posed as an SSP problem (cf.
Example 1.2) with the objective
the observations being
In our experiments we chose X to be the square

and dealt with towns placed at the vertices of a regular
pentagon, being the normal two-dimensional distribution with mean and
the unit covariance matrix. We used setup (2.19), (2.21) with the parameters
and ran 2,000 steps of the SASP algorithm, starting at the point
The results are presented on Fig. 1. We found that the relative inaccuracy
in 20 runs varied from 0.0006 to 0.006.

4.2. A SIMPLE QUEUING MODEL

Here we apply the SASP algorithm to optimization of simple queuing mod-
els in steady state, such as the GI/G/1 queue. We consider the following
minimization program:
where the domain X is an box
In particular, we assume that the expected performance is given as
where is the expected steady state waiting time of a customer, is the cost
of a waiting customer, are parameters of the distributions of
interarrival and service times, is the cost per unit increase of and is
the transpose of Note that for most exponential families of distributions (see
e.g., Rubinstein and Shapiro (1993), Chapter 3) the expected performance
is a convex function of
To proceed with the program (4.1), consider Lindley’s recursive (sample
path) equation for the waiting time of a customer in a G I / G / 1 queue (e.g.
Kleinrock (1975), p. 277):
where, for fixed is an iid sequence of random variables

(differences between the interarrival and the service times) with distribution
depending on the parameter vector
It is readily seen that the corresponding algorithm for calculating an estimate
of can be written as follows:
Algorithm 4.1 :
1. Generate the output process using Lindley ’s recursive equation
here are the distri-

butions of the interarrival and service times, respectively, and
are independent random variables uniformly distributed in [0,1].
2. Differentiate (4.5) with respect to thus getting a recurrent formula for
and use this recurrence to construct the estimates
of
Note that under mild regularity assumptions (see, e.g., Rubinstein and Shapiro
(1993), Chapter 4) the expectation of converges to 0 as
Application of the SASP algorithm (2.15)–(2.18) to program (4.1) – (4.3)
yields
where are the estimates of yielded by Algorithm 4.1, with in (4.5)

replaced by (see (4.6)), and
It is important to note that now we are in a situation different from the
one postulated in Theorem 1 in the sense that the stochastic estimates of
used in (4.6) are biased and depend on each other. The numerical experiments
below demonstrate that the SASP algorithm handles these kinds of problems
reasonably well. In these experiments, we considered an M/M/1 queue,
and being the interarrival and service rates; is the decision variable in the
program (4.1 )–(4.3). Taking into account that it is readily
seen that the value which minimizes the performance measure in (4.1 )–
(4.3) is We set (which corresponds
to and ) and choose as X the segment
which in terms of the traffic intensity is
To demonstrate the effect of the Cesaro averaging (4.6) we present below

statistical results for both the sequences and (see (2.15)–(2.18)), i.e., with
and without Cesaro averaging. We shall call the sequence the crude SASP
(CSASP) sequence.
Tables 4.1,4.2 present 10 realizations of the estimates for and yielded
by the CSASP and SASP algorithms (denoted respectively) along with
the corresponding values of the objective; two estimates in the same column
correspond to a common simulation. Each of the 10 experiments related to
Table 4.1 was started at the point which corresponds to the
starting point for the experiments from Table 4.2 was The
observations used in the method were given by the Lindley equation approach;
the “memory” and the stepsizes were chosen according to (2.21), with
and In each experiment, we performed 2000
steps of the algorithm (i.e., simulated 2000 arrivals of customers).
Tables 4.3 and 4.4 summarize the statistics of Tables 4.1 and 4.2, namely,
they present the sample means
and and the associated confidence

intervals.
Let and be the widths of the confidence intervals associated with the
CSASP and SASP sequences, respectively. The quantity
can be regarded as the efficiency of the SASP sequence relative to the CSASP
one. From the results of Tables 4.3 and 4.4 it follows that the efficiency is quite
significant. E.g., for the experiments presented in Table 4.3 we have
and
We applied to problem (4.1)–(4.3) the SASP algorithm with the adaptive
stepsize policy (2.27), (2.30) and used it for various single-node queuing models
with different interarrival and service time distributions. In all our experiments
we found that the SASP algorithm converges reasonably fast to the optimal
solution
5. CONCLUSIONS
We have shown that
The SASP algorithm (2.15)–(2.18) is applicable to a wide variety of

stochastic saddle point problems, in particular to those associated with
single-stage constrained convex stochastic programming programs (Ex-
amples 2.1–2.3). The method works under rather mild conditions: it
requires only convexity-concavity of the associated saddle point prob-
lem (Assumption A) and conditional independence and unbiasedness of
the observations (Assumption B).
In contrast to the classical Stochastic Approximation, no smoothness or

nondegeneracy assumptions are needed. The rate of convergence of the
method is data- and dimension-independent and is optimal in order, in
the minimax sense, on wide families of convex-concave stochastic saddle
point problems.
As applied to general saddle point problems, the method seems to be

the only stochastic approximation type routine converging without addi-
tional smoothness and nondegeneracy assumptions. The only alternative
method for treating these problems is the so-called stochastic counterpart
method (see Shapiro (1996)), which, however, requires more powerful
nonlocal information on the problem. (For more details on the stochas-
tic approximation versus the stochastic counterpart method, see Shapiro

(1996)).
APPENDIX: A: PROOF OF THEOREMS 1 AND 2

As was already mentioned, the statement of Theorem 1 is a particular case of that of Theorem
2 corresponding to so that it suffices to prove Theorem 2.
Let
be the scaled Euclidean distance between and Note that due to the standard properties of
the projection operator, we have
We assume, without loss of generality, that

Note that
where for each are deterministic Borel functions.

Let us fix and consider the random variable
We have from (0.1)
Setting
we can rewrite the resulting relation as
where
and
Since is convex in and we have

Similarly, whence
Substituting this inequality in (0.2), we obtain
Summing the inequalities (0.5) over and applying the Cauchy inequal-
ity, we obtain
where
Applying next Jensen’s inequality to the convex functions and and taking into
account (2.18), we obtain that
Since we also have that and Clearly
because both and belong to X × Y . In view of these inequalities we obtain from (0.6)
The right hand side of (0.8) is independent of consequently, it (0.8) majorizes the
upper bound of the left-hand side over This upper bound is equal to
(see (2.6)). Thus, we have derived the following inequality
In view of (2.23) and assumption C we have
Consequently, (0.9) yields
(we have taken into account that

To obtain the desired estimate for it suffices to take expectation of both sides
of (0.11). When doing so, one should take into account that
In view of assumption B the conditional expectations of the vectors and
(for fixed ) are zero, those of their squared norms do not exceed
and by construction is a deterministic function of This implies the

inequalities
and similarly
Finally,
With these inequalities we obtain from (0.11)
Since, by definition,
and, consequently,
we arrive at (2.26).
APPENDIX: B: PROOF OF THE PROPOSITION

Without loss of generality we may assume that X is not a singleton. By evident homogeneity
reasons, we may also assume that the diameter of X is 2 and that X contains the segment
being a unit vector. For a given consider the two
problems and with the following functions:
Let further
REFERENCES 183
be the associated estimates of the partial (with respect to and ) derivatives of

and respectively. Assume, finally, that is a standard Gaussian random
variable.
It is readily seen that the problems indeed belong to
Let By the definition of complexity, there exists a method
which in N steps solves all problems from (in particular, both and ) with
expected inaccuracy at most The method clearly implies a routine for distinguishing
between two hypotheses, and on the distribution of an iid sample
where states that the distribution of every is The

routine is as follows:
In order to decide which of the hypotheses takes place, we treat the observed
sample as the sequence of coefficients at in the N subsequent observations of
the gradient with respect to in a saddle point problem on X × Y (and add zero
observations of the gradient with respect to ). Applying the first N steps of
method to these observations, we form the N-th approximate solution and
check whether If it is the case, we accept otherwise we accept
It is clear that the probability for to reject the hypothesis when it is valid is exactly
the probability for to get, as a result of its work on a point with In this
case the inaccuracy of regarded as an approximate solution to is at least and since
the expected inaccuracy of on the indicated probability is at most 1/4. By similar
considerations, the probability for to reject when this hypothesis is valid is also
Thus, the integer is such that there exists a routine for distin-
guishing between the aforementioned pair of statistical hypotheses with probability of rejecting
the true hypothesis (whether it is or ) at most 1/4. By standard statistical arguments,
this is possible only if
with an appropriately chosen positive absolute constant O(1), which yields the sought lower
bound on N.
REFERENCES
Asmussen, S. and R. Y. Rubinstein. (1992). “The efficiency and heavy traffic
properties of the score function method in sensitivity analysis of queuing
models", Advances of Applied Probability, 24(1), 172–201.
Ermoliev, M. (1969). “On the method of generalized stochastic gradients and
quasi-Fejer sequences", Cybernetics, 5(2), 208–220.
Ermoliev, Y.M. and A. A. Gaivoronski. (1992). “Stochastic programming tech-
niques for optimization of discrete event systems", Annals of Operations
Research, 39, 1–41.
Kleinrock, L. (1975). Queuing Systems, Vols. I and II, Wiley, New York.
Kushner, H.I. and D. S. Clarke. (1978). Stochastic Approximation Methods for

Constrained and Unconstrained Systems, Springer-Verlag, Applied Math.
Sciences, Vol. 26.
L’Ecuyer, P., N. Giroux, and P.W. Glynn. (1994). “Stochastic optimization by
simulation: Numerical experiments for the M/M/1 queue in steady-state",
Management Science, 40, 1245–1261.
Ljung, L., G. Pflug, and H. Walk.(1992). Stochastic Approximation and Opti-
mization of Stochastic Systems. Birkhaus Verlag, Basel.
Nemirovski, A. and D. Yudin. (1978). “On Cesàro’s convergence of the gradi-
ent descent method for finding saddle points of convex-concave functions",
Doklady Akademii Nauk SSSR, Vol. 239, No. 4, (in Russian; translated into
English as Soviet Math. Doklady).
Nemirovski, A. and D. Yudin. (1983). Problem Complexity and Method Effi-
ciency in Optimization, J. Wiley & Sons.
Pflug, G. Ch. (1992). “Optimization of simulated discrete event processes".
Annals of Operations Research, 39, 173–195.
Polyak, B.T. (1990). “New method of stochastic approximation type", Automat.
Remote Control, 51, 937–946.
Rockafellar, R.T. (1970). Convex Analysis, Princeton University Press.
Rubinstein, R.Y. and A. Shapiro. (1993). Discrete Event Systems: Sensitivity
Analysis and Stochastic Optimization Via the Score Function Method, to be
published by John Wiley & Sons.
Shapiro, A. (1996). “Simulation based optimization—Convergence analysis
and statistical inference", Stochastic Models, to appear.
Tsypkin Ya.Z. (1970). Adaptation and Learning in Automatic Systems, Aca-
demic Press, New York.
Chapter 9
REGRESSION MODELS FOR BINARY TIME SERIES
Benjamin Kedem
University of Maryland
College Park, Maryland 20742, USA
Konstantinos Fokianos
Department of Mathematics & Statistics
University of Cyprus
P.O. Box 20537 Nikosia, 1678, Cyprus
Abstract We consider the general regression problem for binary time series where the
covariates are stochastic and time dependent and the inverse link is any differ-
entiable cumulative distribution function. This means that the popular logistic
and probit regression models are special cases. The statistical analysis is carried
out via partial likelihood estimation. Under a certain large sample assumption
on the covariates, and owing to the fact that the score process is a martingale, the
maximum partial likelihood estimator is consistent and asymptotically normal.
From this we obtain the asymptotic distribution of a certain useful goodness of
fit statistic.
1. INTRODUCTION
Consider a binary time series taking the values 0 or 1, and related
covariate or auxiliary stochastic data represented by a column vector
The binary series may be stationary or nonstationary, and
the time dependent random covariate vector process may represent one
or more time series and functions thereof that influence the evolution of the
primary series of interest The covariate vector process need not
be stationary per se, however, it is required to possess the “nice” long term
behavior described by Assumption A below. Conveniently, may contain
past values of and/or past values of an underlying process that produces
We wish to study the regression problem of estimating the conditional success

probability
through a parameter vector where represents all that

is known to the observer at time about the time series and the covariate
information; clearly, More precisely, the problem is
to model the conditional probability (1.1) by a regression model depending on
and then estimate the latter given a binary time series and its time dependent
random covariates.
The present paper follows the construction in Fokianos and Kedem (1998)
and Slud and Kedem (1994), where categorical time series and logistic regres-
sion were considered, respectively. It is primarily an extension of Slud and
Kedem (1994), although still a special case of Fokianos and Kedem (1998).
Accordingly, we model (1.1) by the general regression model,
where F is a differentiable distribution function. Following the terminology of

generalized linear models, the term link is reserved here for the inverse function
(McCullagh and Nelder, 1989),
Thus, is the inverse link.

Any suitable differentiable inverse link F that maps the real line onto the
interval [0, 1] will do, but we shall assume without loss of too much generality
that F is a differentiable cumulative distribution function (cdf) with probability
density function (pdf) In particular, when F is the logistic cdf (the
case of canonical link),
(1.2), or equivalently (1.3), is called logistic regression, and when F is the cdf of
the standard normal distribution, the model is called probit regression.
The most popular link functions for binary regression are listed in Table 9.1.
The regression model (1.2) has received much attention in the literature,
mostly under independence. See, among many more, Cox (1970), Diggle
Liang Zeger (1994), Fahrmeir and Kaufmann (1987), Fahrmeir and Tutz (1994),
Fokianos and Kedem (1998), Kaufmann (1987), Keenan (1982), Slud and Ke-
dem (1994), and Zeger and Qaqish (1988).
2. PARTIAL LIKELIHOOD INFERENCE

The useful idea of forming certain likelihood functions by taking products
of conditional densities where the formed products do not necessarily give
complete joint or full likelihood information is due to Cox (1975), and later
studied rigorously in Jacod (1987), Slud (1982), Slud (1992), and Wong (1986).
In this section we define precisely what we mean by partial likelihood with
respect to an increasing sequence of sigma–fields, and then apply it in regression
models for binary time series. This is done quite generally for a large class of
link functions where is a differentiable distribution function.
2.1. DEFINITION OF PARTIAL LIKELIHOOD

Partial likelihood with respect to a nested sequence of conditioning histories
is defined as follows.
Definition. Let be an increasing sequence of

and let be a sequence of random variables on some
common probability space such that is measurable. Denote the density
of given by where is a fixed parameter. The partial
likelihood (PL) function relative to and the data is given
by the product
Thus, if we define,
then for a binary time series with covariate information the partial likeli-
hood of takes on the simple product form,
In the next section we study the maximizer of referred to as the

maximum partial likelihood estimator (MPLE) of
2.2. AN ASSUMPTION REGARDING THE

COVARIATES
The following assumption guarantees the asymptotic stability of the covari-
ate process
Assumption A
A1. The true parameter belongs to an open set
A2. The covariate vector almost surely lies in a nonrandom compact
subset of such that
A3. There is a probability measure v on such that is positive
definite, and such that for Borel sets
in probability as at the true value of

A4. The inverse link function F is twice continuously differentiable and
2.3. PARTIAL LIKELIHOOD ESTIMATION

It is simpler to derive the maximum partial likelihood estimator by maximiz-
ing with respect to the log-partial likelihood,
Assuming differentiability, when exists it can be obtained from an estimating

equation referred to as the partial likelihood score equation,
where,
Just as is the case with the regular (full) likelihood, assuming differentiability,
the score vector
where,
plays an important role in large sample theory based on partial likelihood. The
score vector process, is defined by the partial sums,
Observe that the score process, being the sum of martingale differences, is
a martingale with respect to the filtration That is,
Clearly,
Define,
and
where
Then the sample information matrix satisfies,
By Assumption A, has a limit in probability,
while Fokianos and Kedem (1998),
It follows that,
and we refer to as the information matrix per single observation for

estimating By assumption A it is positive definite and hence also nonsingular
for every
By expanding using Taylor series to one term about
and by Assumption A, we obtain the useful approximation up to terms asymp-
totically negligible in probability,
Thus, an application of the central limit theorem for martingales gives (Fokianos
and Kedem, 1998; Slud and Kedem, 1994),
We now have
Theorem 2.1 (Fokianos and Kedem, 1998; Slud and Kedam, 1994). The MPLE
is almost surely unique for all sufficiently large N, and as
(i)
(ii)
(iii)
2.4. PREDICTION
An immediate application of Theorem 2.1 is in constructing prediction in-
tervals for from By the delta method (see Rao, 1973, p.388), (ii)
in Theorem 2.1 implies that
where
Therefore, an asymptotic prediction interval is given by,

3. GOODNESS OF FIT
The (scaled) deviance
is used routinely in testing for goodness of fit of generalized linear models. It

turns out, however, that the deviance is quite problematic when confronting
binary data (Firth, 1991; McCullagh and Nelder, 1989). An alternative is a
goodness of fit statistic constructed by classifying the binary data according to
the covariates (Fokianos and Kedem, 1998; Schoenfeld ,1980; Slud and
Kedem, 1994).
Let constitute a partition of For define,
and
Put
The next result follows readily from Theorem 2.1, and several applications
of the multivariate Martingale Central Limit Theorem as given in Andersen and
Gill (1982), Appendix II. Assume that the true parameter is
Theorem 3.1 (Slud and Kedem, 1994). Consider the general regression model
(1.2) where F is a cdf with density f. Let be a partition of
Then we have as
(i)
where is a square matrix of dimension
Here A is a diagonal matrix with the diagonal element given

by
The matrix is the limiting inverse of the information

matrix, and the column of B is given by
(ii) As
(iii) As the asymptotic distribution of the statistic
is
In verifying (3.17) in Theorem 3.1 it is helpful to note that

is a zero-mean martingale with asymptotic variance and that for
and are orthogonal.
Replacing by its estimator in (3.17), the goodness of fit statistic
can be used for model adequacy. In this case is stochastically smaller
than when and are obtained from the same (training)
data, and stochastically larger when is obtained from one data set (training
data) but comes from a different independent data set (testing
data). Indeed, in the first case,
On the other hand, when and are obtained from independent

data sets,
Appropriate modification of the degrees of freedom must be made to accom-

modate the two cases.
4. LOGISTIC REGRESSION
The logistic regression model where (see (1.4)) is the most widely
used regression model for binary data. The model can be written as,
The equivalent inverse transformation of (4.18), and another way to write the
model, is the canonical link for binary data referred to as logit,
For this important special case the previous results simplify greatly as
all and Thus, the score vector has the simplified
form,
and the sample information matrix reduces to,
It is easily seen that is the sum of conditional covariance matrices,
and that the sample information matrix per single observation con-
verges to a special case of the limit (2.13),
There is also a simplification in Theorem 3.1.

Corollary 4.1 Under the assumption of logistic regression, Theorem 3.1 holds
with
and
4.1. A DEMONSTRATION
Consider a binary time series obeying a logistic autoregres-
sion model containing a deterministic periodic component,
so that A time series from the model and its

corresponding success probability, are plotted in Figure 9.1.
To illustrate the asymptotic normality result (ii) in Theorem 2.1, the model
was simulated 1000 times for N = 200, 300,1000. In each run, the partial
likelihood estimates of the were obtained by maximizing (2.6). This gives
1000 estimates from which sample means and variances were
computed.
The theoretical variances of the estimators were approximated by inverting
in (4.21). The results are summarized in Table 9.2. There is a close
agreement between the theory and the experimental results.
A graphical illustration of the prediction limits (2.16) is given in Figure 9.2
where we can see that is nestled quite comfortably within the prediction
limits. Again we inverted (4.21) for the approximation
To demonstrate the tendency of the goodness of fit statistic (3.17) towards

a chi-square distribution with the indicated number of degrees of freedom,
consider again the logistic regression model (4.23). It is sufficient to partition
the set of values of into disjoint sets. Let
Then, k=4, is the sum of those ’s for which is in

and the are obtained similarly. In forming (3.17), we replace by its
estimator,
where is given in (4.18). The Q-Q plots in Figure 9.3 were obtained from
1000 independent time series (4.23) of length N = 200
and N = 400 and 1000 independent random variables.
Except for a few outliers, the approximation is quite good.
5. CATEGORICAL DATA
The previous analysis can readily be extended to categorical time series where
admits values representing categories. We only mention two types of
models to show the proximity to regression models for binary time series. For
a thorough treatment see Fokianos (1996), and Fokianos and Kedem (1998).
Generally speaking, we have to distinguish between two types of categorical
variable, nominal, where the categories are not ordered (e.g. daily choice
of dinner categorized as vegetarian, dairy, and everything else), and ordinal,
where the categories are ordered (e.g. hourly blood pressure categorized as
low, normal, and high.) Interval data can be treated as ordinal.
A possible model for nominal categorical time series is the multinomial logits
model (Agresti, 1990),
where is a p-dimensional regression parameter and is a vector of

stochastic time dependent covariates of the same dimension.
A well known model for the analysis of ordinal data is the cumulative odds
model (McCullagh, 1980). We can illustrate the model using a latent variable
Thus, let where is a sequence of i.i.d random
variables with cumulative distribution F, is a vector of parameters, and

is a covariate vector of the same dimension. Suppose that we observe
for where are threshold

parameters. It follows that
The model can be formulated somewhat more compactly by the equation:
Since the set of cumulative probabilities corresponds one to one to the set of the
response probabilities, estimating the former enables estimation of the latter.
Various choices for F can arise. For example, the logistic distribution gives the
so called proportional odds model. In principle any link used for binary time
series can be used here as well.
A Final Note As said before, this paper is an extension of Slud and Kedem
(1994). In that paper there is a data analysis example that uses National Weather
Service rainfall/runoff measurements. The data were graciously provided to us
in 1987 by Sid Yakowitz, blessed be his memory. When Slud and Kedem (1994)
finally appeared in 1994, Sid upon receiving a reprint made some encouraging
remarks. The present extension is written in his memory.
REFERENCES
Agresti, A. (1990). Categorical Data Analysis. Wiley, New York.
Andersen, P. K. and R. D. Gill. (1982). Cox’s regression model for counting
processes: A large sample study. Annals of Statistics, 10, 1100-1120.
Cox, D. R. (1970). The Analysis of Binary Data. Methuen, London.
Cox, D. R. (1975). Partial likelihood. Biometrika, 62, 69-76.
Diggle, P. J., K-Y. Liang, and S. L. Zeger. (1994). Analysis of Longitudinal
Data. Oxford University Press, Oxford.
Fahrmeir, L. and H. Kaufmann. (1987). Regression models for nonstationary
categorical time series. Journal of Time Series Analysis, 8, 147-160.
Fahrmeir, L. and G. Tutz. (1994). Multivariate Statistical Modelling Based on
Generalized Linear Models. Springer, New York.
Fokianos, K. (1996). Categorical Time Series: Prediction and Control. Ph.D.
Thesis, Department of Mathematics, University of Maryland, College Park.
Fokianos, K. and B. Kedem. (1998). Prediction and Classification of non-
stationary categorical time series. Journal of Multivariate Analysis, 67, 277-
296.
REFERENCES 199
Firth, D (1991). Generalized linear models. Chapter 3 of D. Hinkley et al. eds.

(1991). Statistical Theory and Modelling. In Honour of Sir David Cox, FRS,
Chapman and Hall, London.
Jacod, J. (1987). Partial likelihood processes and asymptotic normality. Stochas-
tic Processes and their Applications, 26, 47-71.
Kaufmann, H. (1987). Regression models for nonstationary categorical time
series: Asymptotic estimation theory. Annals of Statistics, 15, 79-98.
Kedem, B. (1980). Binary Time Series. Dekker, New York.
Keenan, D. M. (1982). A time series analysis of binary data. Journal of the
American Statistical Association, 77, 816-821.
McCullagh, P. and J. A. Nelder. (1989). Generalized Linear Models, 2nd ed.
McCullagh, P. (1980). Regression models for ordinal data (with discussion).
Journal of the Royal Statistical Society, B, 42, 109-142.
Rao, C. R. (1973). Linear Statistical Inference and its Applications, 2nd ed.
Schoenfeld, D. (1980). Chi-squared goodness-of-fit for the proportional hazard
regression model. Biometrika, 67, 145-153.
Slud, E. (1982). Consistency and efficiency of inferences with the partial like-
lihood. Biometrika, 69, 547-552.
Slud, E. (1992). Partial likelihood for continuous-time stochastic processes.
Scandinavian Journal of Statistics, 19, 97-109.
Slud, E. and B. Kedem. (1994). Partial likelihood analysis of logistic regression
and autoregression. Statistica Sinica, 4, 89-106.
Wong, W. H. (1986). Theory of partial likelihood. Annals of Statistics, 14, 88-
123.
Zeger, S. L. and B. Qaqish. (1988). Markov regression models for time series:
A quasi likelihood approach. Biometrics, 44, 1019-1031.
Chapter 10
ALMOST SURE CONVERGENCE PROPERTIES OF

NADARAYA-WATSON REGRESSION ESTIMATES
Harro Walk
Mathematisches Institut A
Universität Stuttgart
Pfaffenwaldring 57, D-70569
Stuttgart, Germany
Abstract For Nadaraya-Watson regression estimates with window kernel self-contained

proofs of strong universal consistency for special bandwidths and of the corre-
sponding Cesàro summability for general bandwidths are given.
1. INTRODUCTION
In this paper a self-contained treatment of some convergence problems in
nonparametric regression estimation is given. For an observable
random vector X and a not observable square integrable real random variable
Y the best estimate of a realization of Y on the basis of an observed realization
of X in a mean square sense is given by the regression function
defined by for minimizing with respect to
measurable because of
where is the distribution of X. Knowledge of also allows an interpretation

of the relation between X and Y . The regression function which is usually
unknown can be estimated on the basis of an observable training sequence
of independent (identically distributed) copies of the
random vector ( X , Y). Let S be a fixed bounded sphere
in around 0 and be its indicator function (window kernel function).
Choose a sequence in (0, of so-called bandwidths. Now by use of the
observations of the regression
function is estimated by
(with This estimate, also for more general kernel functions K, has
been proposed by Nadaraya (1964) and Watson (1964). Set
Weak universal consistency of i.e.
for each distribution of ( X , Y) with again for more general kernels,

and choice has
been established by Devroye and Wagner (1980) and Spiegelman and Sacks
(1980). The concept of (weak) universal consistency in regression estimation
was introduced by Stone (1977) who showed this property for nearest neighbor
regression estimates. It is an open question whether is strongly universally
consistent, i.e.
for the usual choice For nearest neighbor re-

gression estimates strong universal consistency was shown by Devroye, Györfi,
Krzyzak and Lugosi (1994). Strong consistency of the Nadaraya-Watson ker-
nel estimates in the case of bounded Y was shown by Devroye and Krzyzak
(1989). On the other hand, from Yakowitz and Heyde (1997), Györfi, Morvai
and Yakowitz (1998) and Nobel (1999) it is known that the concept of strong
universal consistency cannot be transferred to the general case of stationarity
and ergodicity, even not for {0, 1}-valued Y.
The present paper deals with almost sure convergence in and the
corresponding Cesàro summability (a.s. convergence of arithmetic means of
the of window kernel regression estimates. The convergence re-
sult (Theorem 2; strong universal consistency) is established for a piecewise
constant sequence of bandwidths, especially
([ ] denoting the integer part) with arbitrary fixed

The bandwidth sequence above is well approximated by (3) via the choice
Almost Sure Convergence Properties of Nadaraya-Watson Regression Estimates 203
of large In the proof, the essential step (Theorem 1) consists in the verification
of a condition in Györfi’s (1991) criterion for strong universal consistency of
a class of regression estimates. The Cesàro summability result (Theorem 3) is
proved for a general bandwidth sequence satisfying
especially as above. The results can be transferred to partitioning
estimates and, refining arguments for binomial random variables and using a
more general covering lemma due to Devroye and Krzyzak (1989), also to
Nadaraya-Watson estimates with rather general kernel; as to the convergence
results (with another argument) see Walk (1997). The only tools used in the
present paper without proof are Doob’s submartingale convergence theorem
(Loève (1977), section 32.3) and the fact that the set of continuous functions
with compact support is dense in (Dunford and Schwartz (1958), p.
298). It is well known in summability theory (see Zeller and Beekmann (1970),
section 44, or Hardy (1949), Theorem 114) that Cesàro summability (even Abel
summability) of a sequence together with a gap condition on the increments
implies its convergence. This gap condition is fulfilled for the increments of
in Theorem 2, but not for the increments of the sequence
there, thus Theorem 2 is not implied by Theorem 3, but needs
a separate proof. Section 2 contains the results (Theorems 1,2,3). Section 3
contains lemmas and proofs.
2. RESULTS
The results concern Nadaraya-Watson regression estimates with window ker-
nel according to (1) and (2). Theorem 1 is an essential tool in the proof of
Theorem 2 and is stated in this section because of its independent interest. In
contrast to the other results, Theorem 1 deals with integrable (instead of square
integrable) nonnegative real (instead of real) random variables
where, as in the other parts of the paper, are independent
copies of ( X , Y).
Theorem 1. If
at most for the indices
where for fixed D > 1, then with some
for each distribution of ( X , Y) with integrable

Avoiding a formulation which uses ( X , Y), the assertion of Theorem 1 can

also be stated in the form
with independent identically distributed random vectors

in the definition of where is independent
of the underlying distribution of which has to satisfy
This means that in spite of possible unboundedness of the
s the sequence is a.s. bounded with an asymptotic bound
depending on By this result, as in the proof of Theorem 2, using
for bounded real random variables one can show the latter assertion even
for integrable real random variables But we shall content ourselves to
show the corresponding convergence result for square integrable real
formulated as Theorem 2. This theorem states strong universal consistency of
window kernel regression estimates for special sequences of bandwidths.
Theorem 2. Let satisfy
at most for the indices
where for fixed D > 1 and
e.g. with Then
Remark. In Theorem 2 the statistician does not change the bandwidth at

each change of as is done in the usual choice
But if the special choice (3) is written in the form one has
such that and are of the same order and even the
factor in can be arbitrarily well approximated by use of a sufficiently large
in the definition of This is important in view of the rate of convergence under
regularity assumptions (see e.g. Härdle (1990), ch. 4, with further references).
The next theorem states that for very general choice of the bandwidth se-
quence including choice the sequence of
is Cesàro summable to 0 a.s.
Theorem 3. Let satisfy
Then
3. LEMMAS AND PROOFS

The first lemma which concerns binomially distributed random variables, is
elementary and well-known.
Lemma 1. If B is a binomial variable, then
for
PROOF.
Now a variant of inequalities of Efron and Stein (1981), Steele (1986) and
Devroye (1991) on the variance of a function of independent random variables
will be established. Assumptions concerning symmetry of the function or iden-
tical distribution of the random variables or bounded function value differences
are avoided.
Lemma 2. Let be independent m-dimensional ran-
dom vectors where is a copy of For measurable
assume square integrable. Then
PROOF. We use arguments in the proof of Theorem 9.2 (McDiarmid

(1989)) and Theorem 9.3 (Devroye (1991)) in Devroye, Györfi and Lugosi
(1996) and Jensens’s inequality. Set
form a martingale difference sequence with respect to

We have
Let denote the distribution of Then
for which yields the assertion.

The following lemma is well-known especially in stochastic approximation.
Lemma 3. (Aizerman, Braverman and Rozonoer (1964), Gladyshev (1965),
MacQueen (1967), Van Ryzin (1969), Robbins and Siegmund (1971)) Let
and be sequences of integrable nonnegative real random variables on a
probability space with and let be a nondecreasing
sequence of of such that and are measurable with
respect to (i.e. events for each and
Then converges a.s.

PROOF. Setting
one notices that is a submartingale with respect to satisfying

sup Then Doob’s submartingale convergence theorem (Loève
(1977), section 32.3) yields a.s. convergence of from which the assertion
follows by a.s. convergence of
The next lemma is a specialized version of a covering lemma of Devroye
and Krzyzak (1989), compare also Devroye and Wagner (1980) and Spiegelman
and Sacks (1980).
Lemma 4. There is a finite constant only depending on such that for
each and probability measure
PROOF. Let S (with radius R) be covered by finitely many open spheres

of radius R/2 (M depending only on Thus
For each and relation implies
thus Then for each and
one has
In the following the notation
will be used. First a criterion for strong universal consistency will be given.
Lemma 5. (Györfi (1991)) is strongly universally consistent if the
following conditions are fulfilled:
a)
a.s.
for each distribution of (X,Y) with bounded Y ,
b) there is a constant such that for each distribution of ( X , Y) with
satisfying only
PROOF. Let ( X , Y) be arbitrary with Assume without

loss of generality. Fix For all define
and let and be the functions and when Y and are replaced
by and respectively. Then
By a) and uniform boundedness of and
By Cauchy-Schwarz inequality
if L is chosen sufficiently large, further for this L and by b),
The following lemma is well-known from the classical proof and from Ete-
madi’s (1981) proof of Kolmogorov’s strong law of large numbers (see e.g.
Bauer (1991), §12).
Lemma 6. For identically distributed random variables with
let be the truncation of at i.e.
Then
Moreover, for with D > 1 we have
PROOF. Noticing that
we obtain the first assertion from
by the Kronecker lemma, and the second assertion via
In view of the third assertion, for let denote the minimal index
with Then
and we obtain
PROOF OF THEOREM 1. Without loss of generality may

be assumed (as long as for some insert integer part of
into the sequence As in the classical proof and Etemadi’s (1981) proof
of Kolmogorov’s strong law of large numbers (see e.g. Bauer (1991), §12) a
truncation argument is used. Set
for (as in the proof of Lemma 5), and
(as in Lemma 6). Further set
for
In the first step, it will be shown
for and for some suitable constant Let

be independent (identically distributed) copies of (X, Y) and let
be obtained from via replacing
By Lemma 2
For we have
(by exchangeability)
(by exchangeability and

=: 8B + 8C + 8D.
In the following several times Lemma 4 will be used, also the independence
assumption for taking expectations and First
we obtain
(by Cauchy-Schwarz inequality and Lemma 1 for
Noticing
we similarly obtain
(by Cauchy-Schwarz inequality and Lemma 1 for and

Further, by
and exchangeability, we obtain
These bounds yield (4).

In the second step
will be shown. We use a monotonicity argument (compare Etemadi’s (1981)

proof of the strong law of large numbers). For
we have thus
By the independence assumption, Lemmas 1 and 4, we obtain
and thus
Using (4) and Lemma 6, by we obtain
and thus
Now (6), (7), (8) yield (5).

In the third step the assertion of the theorem will be shown. Because of
one has a.s. from some random index on. Thus, because of (5), it
suffices to show that for each fixed
But this follows from

(see above upper bound for B).

PROOF OF THEOREM 2. According to Lemma 5 and Theorem 1 it suffices
to show
for bounded Y. This was proved for general bandwidth sequences by Devroye
and Krzyzak (1989) via exponential inequalities. For the special bandwidth
sequence here, we shall use Lemma 3. By uniform boundedness of and
apparently it’s enough to show
for each sphere S* around 0. First we shall prove
With and
we obtain
Taking integral on S* with respect to by Lemma 3 we obtain a.s. convergence

of the sequence
Almost Sure Convergence Properties of Nadaraya- Watson Regression Estimates 217
because of piecewise constancy of and
the corresponding relation with replaced by and
with a suitable constant To show (11) and (12) we notice
with suitable with for some and then use

Lemma 4; further we notice
with some D* > 1. A.s. convergence of together with

for some and
yields
It holds
To show this notice that the set of continuous functions with

compact support is dense in (see e.g. Dunford and Schwartz (1958), p.
298). Now for an arbitrary and choose such an with
Obviously (14) holds for Further, by
Lemma 4
and (14) is obtained in the general case. Now from (13) and (14) relation (10)
follows. By (10) with one has for
(15) and (10) yield (9).

It should be mentioned that (9) can also be proved using once more Etema-
di’s (1981) argument for the strong law of large numbers, thus avoiding the
above martingale argument. Noticing that in context of the piecewise constant
sequence one can assume with

arbitrarily close to 1, one obtains
and thus, considering also the case relation (9). One is led to (16) by a
majorization and a minorization and then using
(generalized Lebesgue’s density theorem; see e.g. Wheeden and Zygmund

(1977), Theorem 10.49) in view of an expectation term and
(by (11)) in view of a variance term.

PROOF OF THEOREM 3. The argument is similar to that for Theorem 2
the notations of which we use. Analogously to Lemma 5 one notices that it
suffices to show that for some
for each integrable and that
in the case of bounded Y. In view of (17) as in the proof of Theorem 1 we set
for We notice
The latter assertion is obtained from

(see (4)) and thus
(because of Lemma 6), by use of the Kronecker lemma. From (19) we obtain
as in the third step of the proof of Theorem 1. Further
because of
obtained as (7). (20) and (21) yield (17). Now assume Y is bounded. In view
of (18) it suffices to show
for each sphere S* around 0. The proof of this is reduced to the proof of
We notice
REFERENCES 221
by
with some and by the Kronecker lemma. (23) together with (14) yields
(22).
The author thanks the referee and M. Kohler, whose suggestions improved
the readability of the paper.
REFERENCES
Aizerman, M.A., E. M. Braverman, and L. I. Rozonoer. (1964). The proba-
bility problem of pattern recognition learning and the method of potential
functions. Automation and Remote Control 25, 1175-1190.
Bauer, H. (1991). Wahrscheinlichkeitstheorie, 4th ed. W. de Gruyter, Berlin.
Devroye, L. (1991). Exponential inequalities in nonparametric estimation. In:
G. Roussas, Ed., Nonparametric Functional Estimation and Related Topics.
NATO ASI Ser. C, Kluwer Acad. Publ., Dordrecht, 31-44.
Devroye, L., L. Györfi, A. Krzyzak, and G. Lugosi. (1994). On the strong
universal consistency of nearest neighbor regression function estimates. Ann.
Statist. 22, 1371–1385.
Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern
Recognition. Springer, New York, Berlin, Heidelberg.
Devroye, L. and A. Krzyzak. (1989). An equivalence theorem for conver-
gence of the kernel regression estimate. J. Statist. Plann. Inference 23,71–82.
Devroye, L. and T. J. Wagner. (1980). Distribution-free consistency results
in nonparametric discrimination and regression function estimation. Ann.
Statist. 8, 231–239.
Devroye, L. and T. J. Wagner. (1980). On the convergence of kernel esti-
mators of regression functions with applications in discrimination. Z. Wahr-
scheinlichkeitstheorie Verw. Gebiete 51, 15–25.
Dunford, N. and J. Schwartz. (1958). Linear Operators, Part I. Interscience
Publ., New York.
Efron, B. and C. Stein. (1981). The jackknife estimate of variance. Ann. Statist.
9, 586-596.
Etemadi, N. (1981). An elementary proof of the strong law of large numbers.

Z. Wahrscheinlichkeitstheorie Verw. Gebiete 55, 119–122.
Gladyshev, E.G. (1965). On stochastic approximation. Theor. Probab. Appl. 10,
275-278.
Györfi, L. (1991). Universal consistencies of a regression estimate for unboun-
ded regression functions. In: G. Roussas, Ed., Nonparametric Functional
Estimation and Related Topics. NATO ASI Ser. C, Kluwer Acad. Publ.,
Dordrecht, 329–338.
Györfi, L., G. Morvai, and S. Yakowitz. (1998). Limits to consistent on-line
forecasting for ergodic time series. IEEE Trans. Information Theory, 44,
886-892.
Härdle, W. (1990). Applied Nonparametric Regression. Cambridge University
Press, Cambridge.
Hardy, G. H. (1949). Divergent Series. Clarendon Press, Oxford.
Loève, M. (1977). Probability Theory II, 4th ed. Springer, New York, Heidel-
berg, Berlin.
MacQueen, J. (1967). Some methods for classification and analysis of multi-
variate observations. In: J. Neyman, Ed., Proceedings of the Fifth Berkeley
Symposium on Mathematical Statistics and Probability. University of Cali-
fornia Press, Berkeley and Los Angeles, 281-297.
McDiarmid, C. (1989). On the method of bounded differences. In: Surveys in
Combinatorics 1989. Cambridge University Press, Cambridge, 148-188.
Nadaraya, E.A. (1964). On estimating regression. Theory Probab. Appl. 9,141–
142.
Nobel, A.B. (1999). Limits to classification and regression estimation from
ergodic processes. Ann. Statist. 27, 262-273.
Robbins, H. and D. Siegmund. (1971). A convergence theorem for nonnega-
tive almost supermartingales and some applications. In: J.S. Rustagi, Ed.,
Optimizing Methods of Statistics. Academic Press, New York, 233–257.
Spiegelman, C. and J. Sacks. (1980). Consistent window estimation in nonpa-
rametric regression. Ann. Statist. 8, 240–246.
Steele, J.M. (1986). An Efron-Stein inequality for nonsymmetric statistics. Ann.
Statist. 14, 753-758.
Stone, C.J. (1977). Consistent nonparametric regression. Ann. Statist. 5, 595–
645.
Van Ryzin, J. (1969). On strong consistency of density estimates. Ann. Math.
Statist. 40, 1765-1772.
Walk, H. (1997). Strong universal consistency of kernel and partitioning regres-
sion estimates. Preprint 97-1, Math. Inst. A, Univ. Stuttgart.
Watson, G.S. (1964). Smooth regression analysis. Ser. A 26, 359–372.
Wheeden, R.L. and A. Zygmund. (1977). Measure and Integral. Marcel Dekker,
New York.
REFERENCES 223
Yakowitz, S. and C. C. Heyde. (1997). Long-range dependent effects with im-

plications for forecasting and queueing inference. Preprint.
Zeller, K. and W. Beekmann. (1970). Theorie der Limitierungsverfahren, 2.
Aufl. Springer, Berlin, Heidelberg, New York.
Chapter 11
STRATEGIES FOR SEQUENTIAL PREDICTION OF

STATIONARY TIME SERIES
László Györfi
Department of Computer Science and Information Theory
Technical University of Budapest
1521 Stoczek u. 2,
Budapest, Hungary
gyorfi@szit.bme.hu
Gábor Lugosi
Department of Economics,
Pompeu Fabra University
Ramon Trias Fargas 25-27,
08005 Barcelona, Spain,
lugosi@upf.es *
Abstract We present simple procedures for the prediction of a real valued sequence. The
algorithms are based on a combination of several simple predictors. We show
that if the sequence is a realization of a bounded stationary and ergodic random
process then the average of squared errors converges, almost surely, to that of
the optimum, given by the Bayes predictor. We offer an analog result for the
prediction of stationary gaussian processes.
1. INTRODUCTION
One of the many themes of Sid’s research was the search for prediction and
estimation methods for time series that do not necessarily satisfy the classical
assumptions for autoregressive markovian and gaussian processes (see, e.g.,
Morvai et al., 1996; Morvai et al., 1997; Yakowitz, 1976; Yakowitz, 1979;
*The work of the second author was supported by DGES grant PB96-0300
Yakowitz, 1985; Yakowitz, 1987; Yakowitz, 1989; Yakowitz et al., 1999). He

firmly believed that most real-world applications require such robust methods.
This note is a contribution to the line of research pursued and promoted by Sid
who directed us to this beautiful area of research.
We study the problem of sequential prediction of a real valued sequence. At
each time instant the predictor is asked to guess the value of the
next outcome of a sequence of real numbers with knowledge of
the past (where denotes the empty string). Thus,
the predictor’s estimate, at time is based on the value of Formally, the
strategy of the predictor is a sequence of decision functions
and the prediction formed at time is After time instants, the

normalized cumulative prediction error on the string is
In this paper we assume that are realizations of the random vari-

ables drawn from the real valued stationary and ergodic process
The fundamental limit for the predictability of the sequence can
be determined based on a result of Algoet (1994), who showed that for any
prediction strategy and stationary ergodic process
where
is the minimal mean squared error of any prediction for the value of based
on the infinite past Note that it follows by
stationarity and the martingale convergence theorem (see, e.g., Stout, 1974)
that
This lower bound gives sense to the following definition:
Definition 1 A prediction strategy is called universal with respect to a class

of stationary and ergodic processes if for each process in the class,
Universal strategies asymptotically achieve the best possible loss for all er-
godic processes in the class. Algoet (1992) and Morvai et al. (1996) proved
Strategies for sequential prediction of stationary time series 227
that there exists a prediction strategy universal with respect to the class of
all bounded ergodic processes. However, the prediction strategies exhibited
in these papers are either very complex or have an unreasonably slow rate of
convergence even for well-behaved processes.
The purpose of this paper is to introduce several simple prediction strategies
which, apart from having the above mentioned universal property of Algoet
(1992) and Morvai et al. (1996), promise much improved performance for
“nice” processes. The algorithms build on a methodology worked out in recent
years for prediction of individual sequences, see Vovk (1990), Feder et al.
(1992), Littlestone and Warmuth (1994), Cesa-Bianchi et al. (1997), Kivinen
and Warmuth (1999), Singer and Feder (1999), and Merhav and Feder (1998)
for a survey.
An approach similar to the one of this paper was adopted by Györfi et al.
(1999), where prediction of stationary binary sequences was addressed. There
we introduced a simple randomized predictor which predicts asymptotically as
well as the optimal predictor for all binary ergodic processes. The present setup
and results differ in several important points from those of Györfi et al. (1999).
On the one hand, special properties of the squared loss function considered here
allow us to avoid randomization of the predictor, and to define a significantly
simpler prediction scheme. On the other hand, possible unboundedness of a
real-valued process requires special care, which we demonstrate on the example
of gaussian processes. We refer to Nobel (2000), Singer and Feder (1999),
Singer and Feder (2000), Yang (1999), and Yang (2000) to recent closely related
work.
In Section 2 we introduce a universal strategy for bounded ergodic processes
which is based on a combination of partitioning estimates. In Section 3, still
for bounded processes, we consider, as an alternative, a prediction strategy
based on combining generalized linear estimates. In Section 4 we replace the
boundedness assumption by assuming that the sequence to predict is an ergodic
gaussian process, and show how the techniques of Section 3 may be modified
to take care of the difficulties originating in the unboundedness of the process.
The results of the paper are given in an autoregressive framework, that is, the
value is to be predicted based on past observations of the same process.
We may also consider the more general situation when is predicted based on
and where is an process such that
is a jointly stationary and ergodic process. The prediction problem is similar
to the one defined above with the exception that the sequence of is also
available to the predictor. One may think about the as side information.
Formally, now a prediction strategy is a sequence of functions
so that the prediction formed at time is The normalized cumu-

lative prediction error for any fixed pair of sequences is now
All results of the paper may be extended, in a straightforward manner to this

more general prediction problem. As the extension does not require new ideas,
we omit the details.
Another direction for generalizing the results is to consider predicting vector-
valued processes. Once again, the extension to processes
is obvious, and the details are omitted.
2. UNIVERSAL PREDICTION BY PARTITIONING

ESTIMATES
In this section we introduce our first prediction strategy for bounded ergodic
processes. We assume throughout the section that is bounded by a constant
B > 0, with probability one. First we assume that the bound B is known. The
case of unknown B will be treated later in a remark.
The prediction strategy is defined, at each time instant, as a convex combina-
tion of elementary predictors, where the weighting coefficients depend on the
past performance of each elementary predictor.
We define an infinite array of elementary predictors as
follows. Let be a sequence of finite partitions
of the feature space and let be the corresponding quantizer:
With some abuse of notation, for any and we write for the
sequence Fix positive integers and for each
string of positive integers, define the partitioning regression function estimate
where 0/0 is defined to be 0.

Now we define the elementary predictor by
That is, quantizes the sequence according to the partition and

looks for all appearances of the last seen quantized strings of length
in the past. Then it predicts according to the average of the following the
string.
The proposed prediction algorithm proceeds as follows: let be a prob-
ability distribution on the set of all pairs of positive integers such that for
all Put and define the weights
and their normalized values
The prediction strategy is defined by
Theorem 1 Assume that

(a) the sequence of partitions is nested, that is, any cell of is a subset of
a cell of
(b) if denotes the diameter of a set, then for each
sphere S centered at the origin
Then the prediction scheme defined above is universal with respect to the
class of all ergodic processes such that
One of the main ingredients of the proof is the following lemma, whose proof
is a straightforward extension of standard arguments in the prediction theory
of individual sequences, see, for example, Kivinen and Warmuth (1999), and
Singer and Feder (2000).
Lemma 1 Let be a sequence of prediction strategies (experts), and

let be a probability distribution on the set of positive integers. Assume
that and Define
with and
If the prediction strategy is defined by
then for every
Here – ln 0 is treated as
Proof. Introduce and for First we show that

for each
Note that
so that
Therefore, (2.2) becomes

which is implied by Jensen’s inequality and the concavity of the function

for Thus, (2.2) implies that
which concludes the proof.
Another main ingredient of the proof of Theorem 1 is known as Breiman’s

generalized ergodic theorem (Breiman, 1960), see also Algoet (1994).
Lemma 2 (BREIMAN, 1960). Let be a stationary and ergodic

process. Let T denote the left shift operator. Let be a sequence of real-valued
functions such that for some function almost surely. Assume
that Then
Proof of Theorem 1. By a double application of the ergodic theorem, as

almost surely,
and therefore
Thus, by Lemma 2, as almost surely,
Since the partitions are nested, is a martingale indexed

by the pair Thus, the martingale convergence theorem (see, e.g., Stout,
1974) and assumption (b) for the sequence of partitions implies that
Now by Lemma 1,
and therefore, almost surely,
and the proof of the theorem is finished.
Theorem 1 shows that asymptotically, the predictor defined by (2.1)

predicts as well as the optimal predictor given by the regression function
In fact, gives a good estimate of the regression function
in the following sense:
Corollary 1 Under the conditions of Theorem 1
Proof. By Theorem 1,
Consider the following decomposition:
Then the ergodic theorem implies that

It remains to show that
But this is a straightforward consequence of Kolmogorov’s classical strong law

of large numbers for martingale differences due to Chow (1965) (see also Stout,
1974, Theorem 3.3.1). It states that if is a martingale difference sequence
with
then
Thus, (2.4) is implied by Chow’s theorem since the martingale differences

are bounded by
(To see that the indeed form a martingale difference sequence just note
that for all
Remark. UNKNOWN B. The prediction strategy studied in this section may
be easily extended to the case when the process is bounded, but B is
unknown, that is, when no upper bound is known to the range of the process.
In such a case we may simply start with the hypotheses B = 1 and predict
according to (2.1) until we find a value with Then we reset
the algorithm and start the predictor again but with doubling the value of the
previous B, and keep doing this. Then the universal property of Theorem 1
obviously remains valid to this modified strategy.
Remark. CHOICE OF Theorem 1 is true independently of the choice of
the as long as these values are strictly positive for all and In practice,
however, the choice of may have an impact on the performance of the
predictor. For example, if the distribution has a very rapidly decreasing
tail, then the term – will be large for moderately large values of and
and the performance of will be determined by the best of just a few of the
elementary predictors Thus, it may be advantageous to choose to
be a large-tailed distribution. For example, is a safe choice,
where is an appropriate normalizing constant.
Remark. SEQUENTIAL GUESSING. If the process takes values from a finite
set, one is often interested in the sequential guessing of upon observing the
past Such a problem was investigated (among others) by Györfi et al.

(1999), where it was assumed that takes one of two values:
Sequential guessing is then formally defined by a sequence of
decision functions
and the guess formed at time is The normalized cumulative loss of

guessing by on the string is
where I denotes the indicator function. Algoet (1994) showed that for any
guessing strategy and stationary ergodic binary process
almost surely, where
is the minimal expected probability of error of guessing based on the infinite

past The existence of a guessing scheme for which
almost surely follows from results of Onrstein (1978) and Bailey (1976).
In Györfi et al. (1999) a simple guessing procedure was proposed with the
same asymptotic guarantees and with a good finite-sample behavior for Markov
processes. The disadvantage of the predictor given in Györfi et al. (1999) is
that it requires randomization. Here we observe that with the help of predictors
having the universal convergence property of Theorem 1 we may easily define
a nonrandomized guessing scheme with the desired convergence properties.
Given a prediction strategy
for a binary process we simply define the guessing scheme by the

decision functions
Then we may use the properties of established in Corollary 1 to conclude

that the guessing scheme defined above has an average number of mistakes
converging to the optimum R* almost surely. Indeed, if we define the
decision based on observing the infinite past as the one minimizing the
probability of error of guessing
then we may write
(by the ergodic theorem)
(by the martingale convergence theorem)
(by Theorem 2.2 in Devroye et al. (1996))
(by Corollary 1.)
Thus, any predictor with the universal property established in Theorem 1 may
be converted, in a natural way, into a universal guessing scheme. An alternative
proof of the same fact is given by Nobel (2000).
3. UNIVERSAL PREDICTION BY GENERALIZED

LINEAR ESTIMATES
This section is devoted to an alternative way of defining a universal predic-
tor for the class of all bounded ergodic processes. Once again, we apply the
method described by Lemma 1 to combine elementary predictors, but now, in-
stead of partitioning-based predictors, we use elementary predictors which are
generalized linear predictors. Once again, we consider bounded processes, and
assume that a positive constant B is known such that (The
case of unknown B may be treated similarly as in Section 2.)
We define an infinite array of elementary predictors as
follows. Let be real-valued functions defined on The elementary
predictor generates a prediction of form
such that the coefficients are calculated based on the past observations
Before defining the coefficients, note that one is tempted to define the
as the coefficients which minimize
if and the all-zero vector otherwise. However, even though the minimum
always exists, it is not unique in general, and therefore the minimum is not well-
defined. Instead, we define the coefficients by a standard recursive procedure
as follows (see, e.g., Tsypkin, 1971, Györfi, 1984, Singer and Feder, 2000).
Introduce
(where the superscript T denotes transpose),
Let be an arbitrary positive constant and put where I is the

identity matrix. Define
It easy to see that the inverse can be calculated recursively by
which makes the calculation of easy.

Theorem 2 Define
Suppose and for any fixed the set
is dense in Define a prediction strategy by combining the ele-

mentary predictors given by (2.1). The obtained predictor is universal
with respect to the class of all ergodic processes with
Proof. Let and let be an eigensystem

of that is, are orthogonal solutions of the equation
and
(Note that since is symmetric and positive semidefinite, its eigenvalues are
all real and nonnegative.) Let be the integer for which if
and if Express the vector as
and define
(It is easy to see by stationarity that the value of the vector is independent of
It is shown by Györfi (1984) that
and moreover
Also, observe that by the ergodic theorem, for any fixed

Therefore by (3.6), (3.5) and Lemma 2
Next define the coefficient vector to be any vector which

achieves the minimum in
Then
It is immediate by the martingale convergence theorem that

On the other hand, by the denseness assumption of the theorem, for any fixed
Thus, we conclude that
Finally, by Lemma 1,
which concludes the proof.

Again, as in Corollary 1, we may compare the predictor directly to the re-
gression function. By the same argument, we obtain the following result. The
details are left to the reader.
Corollary 2 Under the conditions of Theorem 2
4. PREDICTION OF GAUSSIAN PROCESSES

Up to this point we have always assumed that the process to predict is
bounded. This excludes some traditionally well-studied unbounded processes
such as gaussian processes. In this section we define a predictor which is univer-
sal with respect to the class of all stationary and ergodic gaussian processes. For
gaussian processes the best predictor (i.e., the regression function) is linear, and
therefore we may use the techniques of the previous section in the special case
when However, the unboundedness of the process introduces
some additional difficulty. To handle it, we use bounded elementary predictors
as before, but the bound is increased with Also, we need to modify the way
of combining these elementary predictors.
The proposed predictor is based on a convex combination of linear predictors
of different orders. For each introduce
where the vector of coefficients is calculated by the

formula introduced in Section 3:
where is a positive number, and

with for
Introduce the notation
Then the predictor is defined as follows: for all if is such

that then
where
with
Thus, we divide the time instances into intervals of exponentially increasing
length and, after initializing the predictor at the beginning of such an interval,
we use a different way of combining the elementary predictors in each such
segment. The reason for this is that to be able to combine elementary predictors
as in Lemma 1, we need to make sure that the predictor as well as the outcome
to predict is appropriately bounded. In our case this can be achieved based on
Lemma 3 below which implies that with very large probability, the maximum
of identically distributed normal random variables is at most of the order of
Theorem 3 The prediction strategy defined above is universal with respect

to the class of all stationary and ergodic zero-mean gaussian processes. Also,
At a key point the proof uses the following well-known properties of gaussian
random variables:
Lemma 3 (Pisier, 1986). Let be zero-mean gaussian random vari-

ables with Then
and for each
Proof of Theorem 3. Lemma 3 implies, by taking
This implies, by the Borel-Cantelli lemma, that with probability one there exists
a finite index such that for all Also,
there exists a finite index T such that for all
Therefore, denoting we may write
In other words,
This proves the second statement of the theorem. To prove the claimed univer-
sality property, it suffices to show that for all ergodic gaussian processes,
This can be done similarly to the proof of Theorem 2:
Define the coefficient vector such that it minimizes
(If the minimum is not unique, choose one arbitrarily.) Then
since
Now the proof can be finished by mimicking the proof of Theorem 2.

Once again, we may derive a property analogous to Corollary 1:
Corollary 3 Under the conditions of Theorem 3,
Proof. We proceed exactly as in the proof of Corollary 1. The only thing that
needs a bit more care is checking the conditions of Kolmogorov’s strong law
for sums of martingale differences, since in the gaussian case the corresponding
martingale differences are not bounded. By the Cauchy-Schwarz inequality,
where C and are positive constants, which implies so

the condition of Kolmogorov’s theorem is satisfied.
Remark. RATES OF CONVERGENCE. The inequality of Theorem 3 shows that
the rate of convergence of to L* is determined by the performance of the
best elementary predictor The price of adaptation to the best elementary
predictor is merely an additional term of the order of This additional
term is not much larger than an inevitable estimation error. This is supported
by a result of Gerencsér and Rissanen (1986) who showed that for any gaussian
process and for any predictor
On the other hand, Gerencsér (1994) showed under some mixing conditions for
processes that there exists a predictor such that
Further rate-of-convergence results under more general conditions for the pro-
cess were established by Gerencsér (1992). Another general branch of bounds
can be found in Goldenshluger and Zeevi (1999). Consider the repre-
sentation of
with transfer function
Goldenshluger and Zeevi show that if for and
then for large
and for
Thus, for the processes investigated by Goldenshluger and Zeevi, the predictor
of Theorem 3 achieves the rate of convergence
REFERENCES
Algoet, P. (1992). Universal schemes for prediction, gambling, and portfolio
selection. Annals of Probability, 20:901–941.
Algoet, P. (1994). The strong law of large numbers for sequential decisions
under uncertainty. IEEE Transactions on Information Theory, 40:609–634.
Bailey, D. H. (1976). Sequential schemes for classifying and predicting ergodic
processes. PhD thesis, Stanford University.
Breiman, L. (1960). The individual ergodic theorem of information theory.
Annals of Mathematical Statistics, 31:809–811, 1957. Correction. Annals of
Mathematical Statistics, 31:809–810.
Cesa-Bianchi, N., Y. Freund, D.P. Helmbold, D. Haussler, R. Schapire, and
M.K. Warmuth. (1997). How to use expert advice. Journal of the ACM,
44(3):427–485.
Chow, Y.S. (1965). Local convergence of martingales and the law of large
numbers. Annals of Mathematical Statistics, 36:552–558.
Devroye, L., L. Györfi, and G. Lugosi. (1996). A Probabilistic Theory of Pattern
Recognition. Springer-Verlag, New York.
REFERENCES 247
Feder, M., N. Merhav, and M. Gutman. (1992). Universal prediction of individ-

ual sequences. IEEE Transactions on Information Theory, 38:1258–1270.
Gerencsér, L. (1992). estimation and nonparametric stochastic com-
plexity. IEEE Transactions on Information Theory, 38:1768–1779.
Gerencsér, L. (1994). On Rissanen’s predictive stochastic complexity for sta-
tionary ARMA processes. J. of Statistical Planning and Inference, 41:303–
325.
Gerencsér, L. and J. Rissanen. (1986). A prediction bound for Gaussian ARMA
processes. Proc. of the 25th Conference on Decision and Control, 1487–1490.
Goldenshluger, A. and A. Zeevi. (submitted for publication 1999). Non-asymptotic
bounds for autoregressive time series modeling.
Györfi, L. (1984). Adaptive linear procedures under general conditions. IEEE
Transactions on Information Theory, 30:262–267.
Györfi, L., G. Lugosi, and G. Morvai. (1999). A simple randomized algorithm
for consistent sequential prediction of ergodic time series. IEEE Transactions
on Information Theory, 45:2642–2650.
Kivinen, J. and M. K. Warmuth. (1999). Averaging expert predictions. In H. U. Si-
mon P. Fischer, editor, Computational Learning Theory: Proceedings of
the Fourth European Conference, EuroCOLT’99, pages 153–167. Springer,
Berlin. Lecture Notes in Artificial Intelligence 1572.
Littlestone, N. and M. K. Warmuth. (1994). The weighted majority algorithm.
Information and Computation, 108:212–261.
Merhav, N. and M. Feder. (1998). Universal prediction. IEEE Transactions on
Information Theory, 44:2124–2147.
Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inference for
ergodic, stationary time series. Annals of Statistics, 24:370–379.
Morvai, G., S. Yakowitz, and P. Algoet. (1997). Weakly Convergent Stationary
Time Series. IEEE Transactions on Information Theory, 43:483–498.
Nobel, A. (2000). Aggregate schemes for sequential prediction of ergodic pro-
cesses. Manuscript.
Ornstein, D. S. (1978). Guessing the next output of a stationary process. Israel
Journal of Mathematics, 30:292–296.
Pisier, G. (1986). Probabilistic methods in the geometry of Banach spaces. In
Probability and Analysis. Lecture Notes in Mathematics, 1206, pages 167–
241. Springer, New York.
Singer, A. and M. Feder. (1999). Universal linear prediction by model order
weighting. IEEE Transactions on Signal Processing, 47:2685–2699.
Singer, A. C. and M. Feder. (2000). Universal linear least-squares prediction.
International Symposium of Information Theory.
Stout, W.F. (1974). Almost sure convergence. Academic Press, New York.
Tsypkin, Ya. Z. (1971). Adaptation and Learning in Automatic Systems. Aca-
demic Press, New York.
Yakowitz, S. (1976). Small-sample hypothesis tests of Markov order, with appli-

cation to simulated and hydrologic chains. Journal of the American Statistical
Association, 71:132–136.
Annals of Statistics, 7:671–679.
Yakowitz, S. (1985). Nonparametric density estimation, prediction, and regres-
sion for Markov sequences. Journal of the American Statistical Association,
80:215–221.
Yakowitz, S. (1987). Nearest-neighbour methods for time series analysis. Jour-
nal of Time Series Analysis, 8:235–247.
Yakowitz, S. (1989). Nonparametric density and regression estimation for Markov
sequences without mixing assumptions. Journal of Multivariate Analysis,
30:124–136.
Yakowitz, S., L. Györfi, J. Kieffer, and G. Morvai. (1999). Strongly consis-
tent nonparametric estimation of smooth regression functions for stationary
ergodic sequences. Journal of Multivariate Analysis, 71:24–41.
Vovk, V.G. (1990). Aggregating strategies. In Proceedings of the Third Annual
Workshop on Computational Learning Theory, pages 372–383. Association
of Computing Machinery, New York.
Yang, Y. (1999). Aggregating regression procedures for a better performance.
Manuscript.
Yang, Y. (2000). Combining different procedures for adaptive regression. Jour-
nal of Multivariate Analysis, 74:135–161.
Part IV
Chapter 12
THE BIRTH OF LIMIT CYCLES IN NONLINEAR

OLIGOPOLIES WITH CONTINUOUSLY DISTRIBUTED
INFORMATION LAGS
Carl Chiarella
School of Finance and Economics
University of Technology
Sydney
P.O. Box 123, Broadway, NSW 2007
Australia
carl.chiarella@uts.edu.au
Ferenc Szidarovszky
Tucson, Arizona, 85721-0020, USA
szidar@sie.Arizona.edu
Abstract The dynamic behavior of the output in nonlinear oligopolies is examined when the
equilibrium is locally unstable. Continuously distributed time lags are assumed in
obtaining information about rivals’ output as well as in obtaining or implementing
information about the firms’ own output. The Hopf bifurcation theorem is used
to find conditions under which limit cycle motion is born. In addition to the
classical Cournot model, labor managed and rent seeking oligopolies are also
investigated.
1. INTRODUCTION
During the last three decades many researchers have investigated the stability
of dynamic oligopolies in both continuous and discrete time scales. A com-
prehensive summary of the key results is given in Okuguchi (1976) and their
multiproduct extensions are presented in Okuguchi and Szidarovszky (1999).
However relatively little research has been devoted to the investigation of

unstable oligopolies. There are two main reasons why unstable oligopolies
were neglected in the main stream of research. First, it was usually assumed
that no economy would operate under unstable conditions. Second, there was
a lack of appropriate technical tools to investigate unstable dynamical systems.
During the last two decades, the qualitative theory of nonlinear differential
equations has resulted in appropriate techniques that enable economists to in-
vestigate the behavior of locally unstable economies. There is an expanding
literature on this subject. The works of Arnold (1978), Guckenheimer and
Holmes (1983), Jackson (1989) and Cushing (1977) can be mentioned as main
sources for the mathematical methodology.
Theoretical research on unstable dynamical systems has evolved in three
main streams. Bifurcation theory (see for example, Guckenheimer and Holmes
(1983)) is used to identify regions in which certain types of oscillating attractors
are born. Centre manifold theory (see for example, Carr (1981)) can be applied
to reduce the dimensionality of the governing differential equation to a manifold
on which the attractor lies. Many other studies have used computer methods
(see for example, Kubicek and Marek (1986)) to determine the typical types of
attractors and bifurcation diagrams.
Empirical research can be done either by using computer simulations (see
for example, Kopel (1996)) or by performing laboratory experiments (see for
example, Cox and Walker (1998)). In these studies observations are recorded
about the dynamic behavior of the system under consideration.
In this paper a general oligopolistic model will be examined which contains
the classical Cournot model, labor managed oligopolies, as well as rent-seeking
games as special cases. Time lags are assumed for each firm in obtaining infor-
mation about rivals’ output as well as in obtaining or implementing information
about its own output. The time lag is unknown, it is considered as random vari-
able with a given distribution. The dynamic model including the expected
values is a nonlinear integro-differential equation, the asymptotical behavior of
which will be examined via linearization. The Hopf bifurcation theorem will
be then applied to find conditions that guarantee the existence of limit cycles.
The paper develops as follows. Nonlinear oligopoly models will be introduced
in Section 2, and the corresponding dynamic models with time lags will be
formulated in Section 3. Bifurcation analysis will be performed in Section 4
for the general case, and the special case of identical firms will be analyzed
in Section 5. Special oligopoly models such as the classical Cournot model,
labor-managed oligopolies, and rent-seeking games will be considered in Sec-
tion 6, where special conditions will be given for the existence of limit cycles.
Section 7 concludes the paper.
The Birth of Limit Cycles in Nonlinear Oligopolies 251
2. NONLINEAR OLIGOPOLY MODELS

In this paper an game will be examined, when the strategy set for
each player is and the payoff function of player is given as
with Here is a given function. Notice that this

model includes the classical Cournot model, labor-managed oligopolies, as well
as rent-seeking oligopolies as special cases. In the case of the Cournot model
where is the inverse demand function, and is the cost function of firm In
the case of labor managed oligopolies,
where is the inverse production function and is the cost unrelated to labor.
In the case of rent-seeking games
where is the cost function of agent The first term represents the probability
of winning the rent. If unit rent is assumed, then is the expected profit of
agent
Let be an equilibrium of the game, and let
Assume that in a neighborhood of this equilibrium the best response is unique
for each player, and the best response functions are differentiable. Let
denote the best response of player for a given
3. THE DYNAMIC MODEL WITH LAG STRUCTURE

We assume that in a disequilibrium situation each firm adjusts to the desired
level of output according to the adaptive rule
where > 0 is the speed of adjustment, is the expectation by player

on the output of the rest of the players, and is the expectation on his/her
own output. We consider the situation in which each firm experiences a time
lag in obtaining information about the rivals’ output as well as a time lag
in management receiving or implementing information about its own output.
Russel et al. (1986) used the adjustment process
which is a differential-difference equation with an infinite eigenvalue spectrum.

In economic situations the lags and are usually uncertain, therefore we
will model time lags in a continuously distributed manner that allows us to use
finite dimensional bifurcation techniques. Therefore expectations and
are modeled by using the same lag structure as given earlier in Invernizzi and
Medio (1991) and in Chiarella and Khomin (1996). Particularly we assume
that
and
where the weighting function is assumed to have the form
Here we assume that T > 0, and is a nonnegative integer. Notice that this
weighting function has the following properties:
(a) for weights are exponentially declining with the most weights
given to the most current output;
(b) for zero weight is assigned to the most recent output, rising to
maximum at and declining exponentially thereafter;
(c) the area under the weighting function is unity for all T and
(d) as increases, the function becomes more peaked around
For sufficiently large values of the function may for all practical purposes be
regarded as very close to the Dirac delta function centered at
(e) as the function tends to the Dirac delta function.
Property (d) implies that for large values of
so that the model with discrete time lags is approached when is chosen
sufficiently large.
Property (e) implies that for small values of T,
in which case we would recover the case of no information lags usually consid-
ered in the literature. Substituting equation (3.4) into (3.2) and (3.3), and the
resulting expressions for and into equation (3.1), a system of nonlinear
integro-differential equations is obtained around the equilibrium. In order to
analyze the local dynamic behavior of the system we consider the linearized
system. Letting and denote the deviation of and from their equi-
librium levels, the linearized system can then be formulated as follows:
where
4. BIFURCATION ANALYSIS IN THE GENERAL

CASE
The characteristic equation of the Volterra integro-differential equations (3.7)
can be obtained by a technique expounded by Miller (1972) but originally due
to Volterra (1931). We seek the solution in the form
Substituting this form into equation (3.7) and allowing we have

By introducing the notation
with
and
with
equation (4.2) can be simplified as
Therefore the characteristic equation has the form
Notice that in the most general case the left hand side is a polynomial in
of degree
The determinant (4.6) can be expanded by using a special idea used earlier
by Okuguchi and Szidarovszky (1999). Introduce the notation
to see that equation (4.6) can be rewritten as
where we have used the simple fact that for any vectors and
This relation can be easily proved by using finite induction on the dimension
of and A value is an eigenvalue if either
for at least two players, or
We recall that in order to apply the Hopf bifurcation theorem, we need to

study the behaviour of the eigenvalues as they cross the pure imaginary axis.
As a first step we need to obtain the conditions under which the eigenvalue
equations (4.8) and (4.9) yield pure complex roots. We consider each of these
equations in turn.
Consider first equation (4.8). A value solves this equation if
Assume now that is a pure complex number: In order to sepa-

rate the real and imaginary parts of the left hand side, introduce the following
operators. Let
be a polynomial with real coefficients Define polynomials
and
Then it is easy to see that with any real
By introducing the polynomials
and
we have
and
Then equation (4.10) can be rewritten as
Equating the real and imaginary parts to zero leads to the following:
and
If is considered as the bifurcation parameter, then any real solution

must satisfy equation
Notice that the left hand side has an odd degree, and no terms of even degree
are present in its polynomial form. Therefore is always a root. However,
it is difficult to find simple conditions that guarantee the existence of nonzero
real roots.
Differentiating equation (4.10) with respect to we obtain
If P denotes the multiplier of then equation (4.15) implies that
Therefore
and so
Assuming that the numerator is nonzero, the Hopf bifurcation theorem (see
for example, Guckenheimer and Holmes (1983)) implies that there is a limit
cycle for in the neighborhood of
Consider next equation (4.9). It can be rewritten as
with
Assume now that then equation (4.18) simplifies as
Separating the real and imaginary parts yields the following two equations:
and
In this case we have more freedom than in the previous case, since now we
have bifurcation parameters. Assume that with some values of
these equations have a common nonzero real solution
Differentiating equation (4.18) we have
If is nonzero, then the Hopf bifurcation theorem implies the

existence of a limit cycle around
We may summarise the above analysis by stating that the existence of a limit
cycle is guaranteed by two conditions:
(i) Equations (4.14) and/or (4.19) have nonzero real solutions;
(ii) The real part of the derivative and is nonzero.
Condition (i) is difficult to check in general. In the next section we will show
cases when nonzero real roots exist and also cases when there is no nonzero
real root. In the general case computer methods are available for solving these
polynomial equations. A comprehensive summary of the methods for solving
nonlinear equations can be found for example, in Szidarovszky and Yakowitz
(1978). After a real root is found, can be obtained by solving equation
(4.12) (or (4.13)) or (4.19) (or (4.20)), and the value of has to be
substituted into the derivative expressions (4.16) or (4.21). And finally we have
to check if the real part of the derivative is nonzero.
5. THE SYMMETRIC CASE

In this section identical firms are assumed, that is, it is assumed that
Therefore
as well. Assume in addition that the initial values are identical.
Then equations (4.3), (4.4), and (4.5) imply that the characteristic equation
has the form
Notice that this equation is very similar to (4.10), therefore the same idea
will be used to examine pure complex roots. Introduce the polynomials
and
A complex number is a solution of equation (5.1) if and only if
Equating again the real and imaginary parts to zero we have the following
two equations:
and
Consider again as the bifurcation parameter. Then there is a real solution

if and only if satisfies equation
Notice that this is a polynomial equation. As in the general case, it is difficult

to find a simple condition that guarantees the existence of a nonzero real root.
As in the general case, the degree of this polynomial equation is odd, and no
even degree terms are present in its polynomial form.
Differentiating equation (5.1) with respect to we have
Let now denote the multiplier of then (5.6) can be rewritten as
Therefore
Assuming that the numerator is nonzero, the Hopf bifurcation theorem im-
plies the existence of a limit cycle for in the neighborhood of
As a special case assume first that the firm’s own information lag S is much
smaller than T the information lag about rival firms. Thus we select S = 0,
and equation (5.1) reduces to the following:
If then this equation simplifies as
The real parts of the roots are negative for If 1 –

then there are two nonzero real roots. If = 0,
then and is negative. Therefore there is no nonzero pure complex
root.
Consider next the case of Then equation (5.9) has the special form
Let be a pure complex root. Substituting it into equation (5.11) and

equating the real and imaginary parts we see that
Thus from the second equality in equation (5.12) we see that the combination
of parameter values for which pure complex roots are possible is given by the
relation
Simple calculation and relations (5.12) imply that at
Thus the Hopf bifurcation theorem implies the existence of a limit cycle for
in the neighborhood of
In order to illustrate a case when there are lags in obtaining information

on both the rivals’ and own outputs and no pure complex root exists, select
Then equations (5.3) and (5.4) are simultaneously satisfied if
and
By substituting into equation (5.17) we see that
With fixed values of if or even if S – T is a small positive number,

then becomes negative. In such cases no pure complex root exists. If
S–T> then there is nonzero real root. However this case in which the
own information lag is longer than the lag in obtaining information about rival
firms seems economically unrealistic.
6. SPECIAL OLIGOPOLY MODELS

Consider first the classical Cournot model (2.2). Assuming that
a simple calculation shows that
and in the case of identical firms,
which is negative for The sign coincides with that given in (5.13).
Considering equation (6.3) together with the relation (5.13), we see that the
parameters and T can be selected so that there is a limit cycle for all
Consider next the classical Cournot model with the same price function as
before but assume that the cost function of firm is a nonlinear function
Assuming interior best response, we find that for all firms
Hence can be obtained by solving equation
for In order to guarantee a unique solution assume that in the neighborhood

of the equilibrium and that the right hand side is strictly increasing
i.e.
which holds if
in the neighborhood of the equilibrium. Differentiate equation (6.4) with respect

to to obtain
which implies that
By using relation (6.4) again we see that at the equilibrium
where Notice that becomes

negative if the output of each firm is smaller than the output of the rest of the
industry. Figure 2 illustrates the value of as a function of Notice
also that with the appropriate choice of (namely
can take any negative value. Thus from equation (5.13) we see that this
version of the classical Cournot model can yield limit cycles with any number
of firms.
Assume next that the price function is linear, and the cost
functions are nonlinear. Then the profit of firm can be obtained as
The best response is the solution of the equation
for Assume that in the neighborhood of the equilibrium,

and that the right hand side strictly increases in order to guarantee uniqueness
i.e. the condition
holds.
Differentiating equation (6.9) with respect to shows that
Under assumption (6.10) the value of is always negative. The shape of

the function with independent variable is similar to the one shown in
Figure 2. By controlling the value of any negative value can be obtained

for showing that limit cycles exist for all If is linear, then
is zero, therefore = for all Comparing this value to (6.3)
we see that no satisfies this value, so no limit cycle can be guaranteed in this
case.
Consider now the model of labor-managed oligopolies (2.3) with a linear or
nonlinear and with and where all parameters
are positive. The profit of firm per labor can be then given as
which is mathematically equivalent to the classical Cournot model with hyper-

bolic price function and cost functions
Consider again labor-managed oligopolies (2.3) and now assume that
with all parameters being positive. Assume again that the best response is
interior in the neighborhood of the equilibrium, then an easy calculation shows
that
Notice that the right hand side is strictly decreasing in and straight-
forward calculation shows that
Notice that and a comparison to (5.13) shows that limit cycles may
be born for The cases of nonlinear and can be examined
similarly to classical Cournot oligopolies therefore details are not discussed
here.
Consider finally the case of rent-seeking oligopolies (2.4). Notice that they
are identical mathematically to classical Cournot oligopolies with the selection
of Therefore our conclusions for the classical model apply here.
REFERENCES 267
7. CONCLUSIONS
A dynamic model with continuously distributed lags in obtaining information
about rivals’ output as well as in the firms obtaining or implementing informa-
tion about their own outputs, was examined. Classical bifurcation theory was
applied to the governing integro-differential equations. Time lags can be also
modeled by using differential-difference equations, however in this case one
has to deal with an infinite spectrum, which makes the use of bifurcation the-
ory analytically intractable. In addition, fixed time lags are not realistic in real
economic situations.
We have derived the characteristic equation in the general case, and therefore
the existence of pure complex roots can be analyzed by using standard numerical
techniques. The derivatives of the best response functions were selected as the
bifurcation parameters. The derivatives of the pure complex roots with respect
to the bifurcation parameters were given in a closed form expression, that makes
the application of the Hopf bifurcation theorem straightforward.
The classical Cournot model, labor-managed oligopolies, and rent-seeking
games were examined as special cases. If identical firms are present with linear
cost functions and hyperbolic price function, then under a special adjustment
process limit cycles are guaranteed for a sufficiently large number of firms.
However with nonlinear cost functions limit cycles may be born with any arbi-
trary number of firms. If the price as well as the costs are linear, no limit cycle
is guaranteed, however if the costs are nonlinear, then limit cycles can be born
with arbitrary values of Similar conclusions have been reached for labor-
managed oligopolies. Rent-seeking games are mathematically equivalent to
the classical Cournot model with hyperbolic price function, therefore the same
conclusions can be given as those presented earlier.
REFERENCES
Arnold, V.I. (1978). Ordinary Differential Equations. MIT Press, Cambridge,
MA.
Carr, J. (1981). Applications of Center Manifold Theory. Springer-Verlag, New
York.
Chiarella, C. and A. Khomin. (1996). An Analysis of the Complex Dynamic
Behavior of Nonlinear Oligopoly Models with Time Lags. Chaos, Solitons
& Fractals, Vol. 7. No. 12, pp. 2049-2065.
Cox, J.C. and M. Walker. (1998). Learning to Play Cournot Duopoly Strategies.
J. of Economic Behavior and Organization, Vol. 36, pp. 141-161.
Cushing, J.M. (1977). Integro-differential Equations and Delay Models in Pop-
ulation Dynamics. Springer-Verlag, Berlin/Heidelberg/New York.
Guckenheimer, J. and P. Holmes. (1983). Nonlinear Oscillations, Dynamical
Systems and Bifurcations of Vector Fields. Springer-Verlag, New York.
Invernizzi, S. and A. Medio. (1991). On Lags and Chaos in Economic Dynamic

Models. J. Math. Econ., Vol. 20, pp. 521-550.
Jackson, E.A. (1989). Perspectives of Nonlinear Dynamics. Vols. 1 & 2. Cam-
bridge University Press.
Kopel, M. (1996). Simple and Complex Adjustment Dynamics in Cournot
Duopoly Models. Chaos, Solitons & Fractals, Vol. 7, No. 12, pp. 2031 -2048.
Kubicek, M. and M. Marek. (1986). Computational Methods in Bifurcation
Theory and Dissipative Structures. Springer-Verlag, Berlin/Heidelberg/New
York.
Miller, R.K. (1972). Asymptotic Stability and Peturbations for Linear Volterra
Integrodifferential Systems. In Delay and Functional Differential Equations
and Their Applications, edited by K. Schmitt. Academic Press, New York.
Okuguchi, K. (1976). Expectations and Stability in Oligopoly Models. Springer-
Verlag, Berlin/Heidelberg/New York.
Okuguchi, K. and F. Szidarovszky. (1999). The Theory of Oligopoly with Multi-
product Firms. Edition) Springer-Verlag, Berlin/Heidelberg/New York.
Russel, A.M., J. Rickard and T.D. Howroyd. (1986). The Effects of Delays on
the Stability and Rate of Convergence to Equilibrium of Oligopolies. Econ.
Record, Vol. 62, pp. 194-198.
Szidarovszky, F. and S. Yakowitz. (1978). Principles and Procedures of Nu-
merical Analysis. Plenum Press, New York/London.
Volterra, V. (1931). Lecons sûr la Théorie Mathématique de la Lutte pour la
Vie. Gauthiers-Villars, Paris.
Chapter 13
A DIFFERENTIAL GAME OF DEBT CONTRACT

VALUATION
A. Haurie
University of Geneva
Geneva Switzerland
F. Moresino
Cambridge University
United Kingdom
Abstract This paper deals with a problem of uncertainty management in corporate finance.
It represents, in a continuous time setting, the strategic interaction between a
firm owner and a lender when a debt contract has been negotiated to finance a
risky project. The paper takes its inspiration from a model by Anderson and
Sundaresan (1996) where a simplifying assumption on the information structure
was used. This model is a good example of the possible contribution of stochastic
games to modern finance theory. In our development we consider the two possible
approaches for the valuation of risky projects: (i) the discounted expected net
present value when the firm and the debt are not traded on a financial market, (ii)
the equivalent risk neutral valuation when the equity and the debt are considered
as derivatives traded on a spanning market. The Nash equilibrium solution is
characterized qualitatively.
1. INTRODUCTION
In Anderson and Sundaresan (1996) an interesting dynamic game model of
debt contract has been proposed and used to explain some observed discrep-
ancies on the yield spread of risky debts. The model is cast in a discrete time
setting, with a simplifying assumption on the information structure allowing
for a relatively easy sequential formulation of the equilibrium conditions as a
sequence of Stackelberg solutions where the firm owner is the leader and the
lender is the follower.
In the present paper we revisit the Anderson-Sundaresan model but we formu-

late it in continuous time and we characterize a fully dynamic Nash equilibrium
solution. The purpose of the exercise is not to complement the results obtained
in Anderson and Sundaresan (1996) (this will be done in another paper using
again the discrete time formulation), but rather to explore the possible mix-
ing of Black and Scholes valuation principles (Black and Scholes, 1973) with
stochastic differential game concepts. It happens that the debt contract valua-
tion problem provides a very natural framework where antagonistic parties act
strategically to manage risk in an uncertain environment.
The paper is organized as follows: in section 2 we define the firm that envi-
sions to develop a risky project, needs some amount C to launch it and negotiates
a debt contract to raise this money. In section 3 we formulate a stochastic dif-
ferential game between a firm owner and a lender in a situation where there is
no market where to trade debt or equity. In section 4 we explore the equivalent
risk neutral valuation principles applied on equity and debt value when there is
a market where these derivatives can be traded. In section 5 we analyze qual-
itatively the Nash-solution to the stochastic differential game characterized by
the Hamilton-Jacobi-Bellman equations satisfied by the equilibrium values of
equity and debt. In section 6 we discuss a reformulation of the problem where
liquidation can take place only at discrete time periods.
2. THE FIRM AND THE DEBT CONTRACT

A firm has a project which is characterized by a stochastic state equation in
the form of a geometric Brownian process with drift:
where
W : is a standard Wiener process,
is the instantaneous growth rate
is the instantaneous variance.
This state could represent, for example, the price of the output from the project.
The firm expects a stream of cash flows defined as a function of the
state of the project. Therefore, if the firm has a discount rate the equity of
the unlevered firm, that is the debt free firm, when evaluated as the net present
value of expected cash flows, is given by
Using a standard technique of stochastic calculus one can characterize the func-
tion as the solution of the following differential equation
The boundary condition (1.4) comes from the fact that a project with zero value
will remain with a zero value and thus will generate no cash flow. An interesting
case is the one where since (1.4) rewrites now as
A linear function can be used as a test function. It satisfies the

boundary condition (1.6) and it satisfies (1.5) if A is such that
for all This defines and therefore
Later on, in section 4 we will see another way of evaluating the equity, which
does not require to specify the discount rate when there is a market spanning
the risk of the project.
Suppose that the firm needs to borrow the amount C to finance its projects.
The financing is arranged through the issue of a debt contract with a lender. The
contract is defined by the following parameters also called the contract terms:
outstanding principal amount P
maturity T
grace period
coupon rate
The contract terms define a debt service that we represent by a function
It gives the cumulative contractual payments that have to be made up to time
The contract may also include an interest rate in case of delayed
payment and an interest rate in case of advanced payments. If one assumes
a linear amortization the function is solution to
This function is illustrated in Figure 1.1. The strategic structure of the problem
is due to the fact that the firm controls its payments and may decide not to abide
to the contract at some time. At time let the variable give the state of the
debt service, which is the cumulated payments made by the firm up to time
This state variable evolves according to the following differential equation
where is the payment at time

and We assume that the payment rate
is upper bounded by the cash flows generated by the project
The strategic structure of the problem is also due to the fact that the lender is
authorized to take control of the firm when the owner is late in his payments,
i.e. at any time where is the set
The lender action is an impulse control. If at time the lender

takes control and liquidates the firm he receives the minimum between the debt
balance and the liquidation value of the firm, i.e.
A possible form for the liquidation value is
This assumes that the lender will find another firm ready to buy the project at
equity value and K is a constant liquidation cost. For the example of
debt contract given above, the debt balance is
The argument of Anderson and Sundaresan is that, if the firm is in financial

distress and its liquidation value is too low, the lender may have advantage not
to liquidate, even if the condition is satisfied.
This strategic interaction between the firm owner and the lender will be
modelled as a noncooperative differential game where the payoffs are linked to
the equity and debt value respectively. These values will depend on the time
the state of the project the state of the debt service and will be determined
either by optimizing the expected utility of the agents or, if equity and debt can
be considered as derivatives traded on a market, through the so-called equivalent
risk neutral valuation.
3. A STOCHASTIC GAME
Let us assume that the risk involved in developing the project, or in financing
it through the agreed debt contract cannot be spanned by assets traded on a
market. So we represent the strategic interaction between two individuals,
the firm owner, who strives to maximize the expected present value of the net
cash flow, using a discount rate and a lender who maximizes the present
value of the debt, using a discount rate The state of the system is the pair
An admissible strategy for the firm owner
is given by a measurable function1 A strategy
for the lender is a stopping time defined by where
B is a Borel set Associated with a strategy pair we
define the payoffs of the two players
For the firm’s owner
For the lender
Remark 1 It is in the treatment of the debt service that our model differs signif-
icantly from Anderson and Sundaresan (1996). The creditor does not forget the
late payments. The creditor can also take control at maturity, if the condition
holds. The firm is allowed to “overpay” and it is thus possible
that be positive at maturity2 . It is then normal that the firms gets
back this amount at time T as indicated in (1.13).
Definition 1 A strategy pair is a subgame perfect Nash equi-

librium for this stochastic game, if for any the following
holds
We shall not pursue here the characterization of the equilibrium strategies.

Indeed we recognize that it will be affected by several assumptions regarding
the attitude of the agents toward risk3 (i.e. their utility functions) and their
relative discount rates. What is problematic, in practice, is that these data are
not readily observable and it is difficult to assume that it is common knowledge
for the agents. In the next section we shall use another approach which is
valid when the equity and debt can be considered as derivatives obtained from
assets that are traded or spanned by an efficient market. We will see that the
real option or equivalent risk neutral valuation method eliminates the need to
identify these parameters when defining an equilibrium solution.
4. EQUIVALENT RISK NEUTRAL VALUATION

Throughout the rest of the paper we make the following assumptions char-
acterizing the financial market where the debt contract takes place.
Assumption 1 We assume that the following conditions hold
A1: no transaction costs
A2: trading takes place continuously
A3: assets are assumed to be perfectly divisible
A4: unlimited short selling allowed
A5: borrowing and lending at the risk-free rate, i.e. the risk-free asset can be
sold short
A6: the firm has no impact on the risk structure of the whole market
A7: no arbitrage
A8: there exists a risk-free security B, for example a zero-coupon bond, paying
the risk-free interest rate The dynamics of B is given by
To simplify assume that the firm’s project has a value V that is perfectly cor-
related with the asset which is traded. We assume that this asset pays no
dividend , so its entire return is from capital gains. Then evolves according
to
Here the drift rate should be equal to the expected rate of return from holding
an asset with this risk characteristics, according to the CAPM theory (Duffie,
1992)
4
where is the risk free rate, is the market price of risk and is the
correlation of with the market portfolio. One also calls the risk adjusted
expected rate of return that investors would require to own the project.
The project is assumed to pay dividends where is the payout

ratio. Then the value V evolves according to the following geometric brownian
motion5
where If we repeat the developments of (1.3)-(1.6), but with

and we will check that V is indeed the equity value for the unlevered
firm when one uses the risk-adjusted rate of return as the discount rate.
Again we assume that the firm needs an amount C to launch the project.
The firm, which is not default free, cannot sale the risk-free security short,
since this would be equivalent to borrowing at the risk-free rate. Therefore
the assumption A5 applies only to investors who may buy the firm’s equity or
debt. The firm’s owner and the lender are now interested in maximizing the
equity and the debt value for the levered firm respectively. For that they act
strategically, playing a dynamic Nash-equilibrium based on the evolutions of
equity and debt.
4.1. DEBT AND EQUITY VALUATIONS WHEN

BANKRUPCY IS NOT CONSIDERED
At time the firm pays to contribute to the debt service and
the remaining cash, given by is a dividend paid to the equity
holder. Assume that the lender strategy is never to ask for bankruptcy while
the borrower’s strategy is given by a feedback law We are interested
in defining the value of Equity under these strategies. The equity is a function
of the time the value of the hypothetical unlevered firm V and the state of
debt service and is denoted by To simplify notation we omit
the arguments and simply refer to E. One constructs a portfolio composed of
risk-free bonds B and derivative E replicating the unlevered firm. To avoid
arbitrage opportunities this portfolio must provide the same return as the
unlevered firm, since it has the same risk6. Applying Ito’s lemma we obtain the
following stochastic equation for the equity value dynamics:
The dynamics of E and V are perfectly correlated. Therefore it is possible to

construct a self-financing portfolio consisting of shares of the risk-free
asset B and share of the equity E (all dividend paid by the equity are
immediately reinvested in the portfolio), which replicates the risky term of V
and pays the same dividend as V. The portfolio value at time is
Keeping in mind that the equity pays a dividend and the portfolio pays
a dividend the strategy is self-financing if7
which leads to
In order to get a risk replication, the weight must be given by
Then the weight must verify
We can now write the stochastic equation satisfied by the portfolio value
One observes that, as intended, the portfolio replicates the risk of V. Due to
our sixth assumption, must have the same return as V, otherwise there
would be arbitrage opportunities. Moreover since at time zero, the value of
the portfolio is equal to the value of V, we must have for all

Matching the drift terms and multiplying by one obtains the following
partial differential equation that has to be satisfied by the equity value on the
domain
At maturity, one should have
A similar equation could be derived for the other derivative (debt) value D
At maturity one should have
Remark 2 As we assume that no bankrupcy occurs, we have to suppose that,

at maturity the balance of the debt service will be cleared, if the value of the
project permits it. This is represented in the terminal conditions (1.28) and
(1.30).
This formulation is interesting, since the model reduces then to a single con-
troller framework. The firm’s owner controls the trajectory but runs the risk
of having a terminal value drawn to 0. Since the payments are upper bounded
by the cash flow value it might be optimal for the borrower to
anticipate the payments, for some configurations of We shall not pursue
that development, but rather consider the more realistic case where the lender
can liquidate the firm when it is late in its payments.
4.2. DEBT AND EQUITY VALUATIONS WHEN

LIQUIDATION MAY OCCUR
Suppose now that both actors have chosen a strategy, the lender having a
stopping-time strategy which defines the conditions when liquidation occurs.
After liquidation the equity E is not available to construct the portfolio
Therefore, such a portfolio has a return which is smaller than or equal to the
return of V. One obtains the following relations for E and D respectively
Equality holds at points where there is no liquidation. At a liquidation time

the following boundary conditions8 hold
where
and
At maturity T when the boundary conditions are
Consider a strategy pair Then we can rewrite the debt value as
and the equity value as

where is the auxiliary process defined by
and denotes the expected value w.r.t. the probability measure induced by
the strategies (Here is the stochastic process induced by the feedback
law This is the usual change of measure occuring in Black-Scholes
evaluations.
5. DEBT AND EQUITY VALUATIONS FOR NASH

EQUILIBRIUM STRATEGIES
Now let us reformulate the noncooperative dynamic game sketched out in
section 3 but now using the real option approach to value debt and equity. The
players strategies are now chosen in such a way that a Nash equilibrium is
reached at all points Under the appropriate regularity con-
ditions a feedback Nash equilibrium will be characterized by the HJB equations
satisfied by equity and debt values
The boundary conditions at maturity T when are
The boundary conditions in are9
The optimal strategy for the firm is “bang-bang" and defined by the switching
manifold
The equilibrium strategy is given by
The manifold with equation determines the behavior of the

firm. If the equation can be solved as then
The lender takes control as soon as the following conditions are satisfied
6. LIQUIDATION AT FIXED TIME PERIODS

Often, in practice, the debt service is paid at fixed periods
The lender can only take control at time period, if the condition
holds. This is the framework used in Anderson and Sundaresan (1996). If one
assumes, as in Anderson and Sundaresan (1996), that, at each period the unpaid
debt service can trigger a liquidation but if no liquidation occurs the unpaid
service is forgotten, then the state variable becomes unnecessary. In a
payment is due. If the value
does not satisfy the contract terms, i.e. if

the lender can liquidate. The H J B equations write now
The game simplifies considerably. The only relevant state variable is the value
of the firm V(t) at t. The firm’s strategy is now
pay the debt service at otherwise don’t pay
anything.
7. CONCLUSION
The design of the “best" debt contract, would be the combination of design
parameters, that maximizes the Nash-equilibrium value of equity
E(0; (0,V)) while the Nash-equilibrium debt value D(0; (0, V)) is at least
equal to the needed amount C. This is a very complicated problem that can
uniquely be addressed by direct search methods where the Nash-equilibria will
be computed for a variety of design parameter values. In this paper we have
concentrated on the evaluation of the equity and debt values, when, given some
contract terms, the firm owner and the lender act strategically, and play a dy-
namic Nash equilibrium. The interesting aspect of this model, which was
already present in Anderson and Sundaresan (1996), is the use of equivalent
risk neutral or real option valuation technique in a stochastic game. The con-
tribution of this paper has been to extend the model to a continuous time setting
and to consider an hereditary effect of past deviations from the contracted debt
service. We feel that this formulation is relatively general and that it contributes
to the extension of game theory to Black-Scholes economics.
REFERENCES 283
NOTES
1. In all generality the control of the firm owner could be described as a process which is adapted to
For our purpose the definition as a feedback law will suffice.
2. Notice that the overpayment is a form of investment by the firm. We could allow it to invest a part
of the cash flows in another type of asset; this would be more realistic from a financial point of view but it
would further complicate the model.
3. Note that we have assumed here that the agents are optimizing discounted cash flows, not the utility
of them. If risk aversion has to be incorporated in the model then the equilibrium characterization will be
harder.
4. It is defined by where and are the expected return and standard deviation of the
market portfolio respectively.
5. This can be verified by constructing a portfolio where all dividends are immediately reinvested in the
project. Such a portfolio is a replication of and must, therefore follow the same dynamics as
6. Notice that the usual way to obtain the Black-Scholes equation, would have been to construct a
self-financing portfolio composed of the risk-free asset B and the underlying asset V in such proportions
that it is a replication of the derivative (either E or D). Then according to the two last assumption, we know
that this portfolio has to give the same return as the underlying asset. However in our case the unlevered
firm is not traded; so we proceed in a symmetric way and construct a self-financing portfolio composed of
the risk-free asset B and the derivative E that will replicate V.
7. Such a dynamic adaptation of the portfolio composition is feasible, as is measurable with respect
to the filtration generated by W(t).
8. One uses the notation arbitrarily small.
9. One uses the notation arbitrarily small.
REFERENCES
Anderson R.W. and S. Sundaresan. (1996). Design and Valuation of Debt Con-
tracts, The Review of Financial Studies, Vol. 9, pp. 37-68.
Black F. and M. Scholes. (1973). The Pricing of Options and Corporate Lia-
bilities, The Journal of Political Economy, Vol. 81, pp. 637-654.
Dixit A.K and R. S. Pindyck. (1993). Investment under Uncertainty, Princeton
University press.
DuffieD. (1992). Dynamic Asset Pricing Theory, Princeton University Press.
Chapter 14
HUGE CAPACITY PLANNING AND RESOURCE

PRICING FOR PIONEERING PROJECTS
David Porter
Collage of Arts and Sciences
George Mason University
Abstract Pioneering projects are systems that are designed and operated with new, virtually
untested, technologies. Thus, decisions concerning the capacity of the initial
project, its expansion over time and its operations are made with uncertainty.
This is particularly true for NASA’s earth orbiting Space Station. A model is
constructed that describes the input-output structure of the Space Station in which
cost and performance is uncertain. It is shown that when there is performance
uncertainty the optimal pricing policy is not to price at expected marginal cost.
The optimal capacity decisions require the use of contingent contracts prior to
construction to determine the optimal expansion path.
1. INTRODUCTION
A pioneering project is defined as a system in which both the reliability and
actual operations have considerable uncertainties (see Merrow et al. (1981)).
Managing such projects is a daunting task. Management decision making in
the face these sizable uncertainties is the focus of this paper. In particular, the
decisions on the initial system capacity and design flexibility for future growth
of the project will be addressed. The typical management process for pioneering
projects is to design to a wish list of requirements from those who will be using
the project’s resources during its operations (this is sometimes referred to by
engineers as user “desirements”). The main management policy of the designers
is to hold margins of resources for each subsystem to mitigate the uncertainties
that may arise in development and operations. The amount of reserve to be
held is not well defined, but it is sometimes referred to as an insurance pool.
However, unlike an insurance pool, the historical results for pioneering projects
has been universal “bad luck ” by all subsystems which deplete the reserves
(see Merrow et al. (1981) and Wessen and Porter (1998)). Rarely are incentive
systems used to obtain information on system reliability and the cost of various
designs.1 Instead of focusing on these types of management policies, we want
to determine the optimal planning, organizational and incentive systems for
managing pioneering projects.
The application that will be used throughout will be NASA’s Space Station.
The Space Station is an integrated earth orbiting system. The Station is to
supply resources for the operation of scientific, commercial and technology
based payloads. The structure of the Station design is an interrelated system
of inputs and outputs. For example, the design of the power subsystem as a
profound effect on the design of the propulsion subsystem since power will
be provided via photovoltaics. The solar array subsystem creates drag on the
Station which will require reboosts from the propulsion subsystem. Fox and
Quirk (1987) developed an input-output model of the Station’s subsystems in
which the coefficients of the input-output matrix are random. This model has
been analyzed for the case with uniform distributions over the coefficients and
cost parameters, with a safety-first constraint on the net output of the Station.
Using semi-variances, they determine the distribution of costs over the Station’s
net outputs. Using some test data from engineering subsystems, the model has
been exercised for the uniform Leontief system (see Quirk et al. (1989)). While
this attempt at modelling the interaction of the subsystems and the uncertainty in
cost and performance has shown that the Station is very sensitive to performance
uncertainties and that the errors propagate, there is very little in the way of
policies to help guide the design process.
One of the main features of the operation of the Station is that information
about subsystem performance and cost accrues over time (see Furniss (2000)).
Thus, design and allocation decisions need to take into account the future effects
on resource availability. For the Station, a capacity for each subsystem must
be selected at “time zero” along with the design parameters of each subsystem.
Next, users of the Station must design their individual payloads and be scheduled
within the capacity constraints of the system. After payloads are manifested,
the Station operations parameters must be determined. After the Station starts
operating, a decision on how to grow the system must be made. The timing of
decisions can be found in Diagram 1 below. An important aspect of the analysis
to follow is that new information will become available that was previously
unknown during the decision timeline.
The decision variables include the vector of initial subsystem design capaci-
ties X, the vector of initial design parameters v, the vector of planned resources
u that will be used by payloads, the realization of resources q that are available
to payloads for operations and the capacity expansion, redesign and operations
after there is experience with the capabilities of the Station. Not
pictured in Diagram 1 is the fact that actual Station operating capabilities are
realized between time and It is that time that the uncertainty is
resolved and known to all parties.
Within the confines of the decision structure defined above, the following
questions are addressed:
Given the uncertainty over cost and performance, what should be the
crucial considerations in the initial capacity and design of the Station?
How should outputs of the Station be priced to users; in particular, can

management rely on traditional long-run marginal cost principles to al-
locate outputs and services efficiently?2
2. THE MODEL
Space Station can be viewed as a multiproduct “input-output” system in
which decisions are related across time. In addition, there is uncertainty as
to the cost and performance of the system and this uncertainty is resolved
over time. The Station is a series of subsystems (e.g. logistics, environmental
control, electric power, propulsion, etc.) which require inputs from each other to
operate. With a project as new and complex as the Station, the actual operations
are uncertain. Specifically, the performance and cost of each subsystem is
uncertain. Let denote the gross capacity of subsystem i and the design
parameters of subsystem i. Let denote the vector of resources of other
subsystems utilized by subsystem i. Let denote the random variable associated
with the cost and performance of each subsystem. The input-output structure
of the Station is given by:
Let i = 1,...,n so that there is a total of n subsystems that constitute the Station.
If the Station subsystems were related by a fixed linear coefficient technology,
(1) could be represented by Y=AX where Y would be a 1 matrix representing
the internal use of subsystem outputs, A would be the matrix of input-
output coefficients where the entries are random variables, and X is the
1 matrix of subsystem capacities. We consider the more general nonlinear
structure for our model.
Costs are assumed to be separable across subsystems so that total cost can
be represented as
Where is the cost of maintaining subsystem i with capacity

and design in state
At time t=0, the initial subsystem designs and capacities are selected under
uncertainty. Let i=1,...,n denote subsystem net capacities so that
Thus, the amount of capacity available to service payloads will be a function

of X, The model that has been described up to this point is defined in
terms of stock units. Changing the model to accommodate resource flows would
unduly complicate the model. Using flow resources would take into account
peak loads and other temporal phenomena that are important in the scheduling
of payloads but would detract from the general management decision making
issues that are the focus of this paper.
At time payloads are manifested to the Station. Let be the contract
signed with payload j to deliver services. In general, could be a list of
contingent resource levels as well as expected resource deliveries and monetary
damages from the failure to deliver services. The important issue to note is that
the contract must be made before is known. Let denote payload j’s
consumption of the capacity of subsystem i and define
Given the set of contracts signed at the vector of subsystem capacities
available at time will be committed for through The
mapping of contract specifications to resource commitments will be defined
through If the contracts m are based on actual resource delivery
so that nonperformance contracting is infeasible, it would be the case that
for all At time the value of is realized and the
actual level of subsystem capacities is known and contracts commitments and
penalties are assessed. At time management must decide on the increments
to initial subsystem capacities and redesign choices
One of the most discussed items in pioneering projects (see Abel and Eberly
(1998)) is the dynamic nature of costs and performance based on the installed
capacity. How one designs the Station to incorporate modular additions and
the level of experience gained from operating a complex system can have pro-
found effects on the incremental costs of expansion. We capture the notions of
growth flexibility through the function f(X,v). This function enters the cost and
performance elements of the project at the growth stage. Specifically, at
we have:
Where is known and resources are allocated among payloads. We assume

that That is, the more flexible the project, the smaller will be
the corresponding cross-utilization of subsystems and costs when the decision
to expand is made.
Definition 1 A project is said to exhibit learning-by-doing properties if
This definition captures the ides that the larger (or more complex) the project
the larger will be the benefits of learning to operate the system. If it is costly
to dismantle or redesign existing capacity then the project will have what is
commonly referred to as capital irreversibilities.
Definition 2 A project is said to have capacity irreversibilities if

for
Definition 3 A project is said to have design irreversibilities if
To find the system optimum, we compare benefits and costs. The benefit
side of the model is one of the most difficult to estimate in practice. Clearly,
however, the payload designer is the individual in the best position to determine
these benefits. The difficulty in obtaining this information is that there is an
incentive overstate the benefits since NASA does not base its pricing systems
on this information, it uses it only to aid in its subsystem design decisions.
Nonetheless, we model the benefit side of the model using a per period dollar
benefit function for each payload j. Specifically, let denote payload
j’s monetary payoff from consuming units of subsystem capacities, where
The present value of benefits at time t=0, starting at is
given by:
Where r is “the” discount rate. The present value of benefits between time
t=0 and is given by:
Turning to the cost side of the objective function, the present value of costs
at time t=0 if the state is is:
The present value of system costs at time for state is given by:
In order to maximize net benefits, we must choose initial capacities and de-
signs a growth path and an allocation of subsystem capacities
to maximize:
Subject to the equations (4), (6), and (7) and where is the expectation
operator over This is a dynamic programming problem so we must solve
the problem at first taking the t=0 decisions as given, and then proceed to
solve the and t=0 problems. For the most part we are interested in the
comparative static properties of the model and the pricing solutions. For the
solution to the problem and its associated comparative statics will use a form of
envelope theorem for dynamic programming. Recall that the envelope theorem
has:
Where is the solution to the problem of choosing x for fixed so as

to maximize subject to and
4
(the Lagrangian), so that
This theorem allows an easy way to find the marginal effects of the parameters
of the model. Extending this model to the dynamic programming case is easy.
We define the maximization problem at as:
Where X is the vector of choice variables and is a vector of fixed pa-

rameters. Let be a solution to (14) and define
Suppose we can partition where are are the choice variables
from the t=0 problem and are fixed parameters for all periods. At t=0 the
optimization problem is:
Since is a parameter in we can find:
Differentiating we obtain:
The first order conditions for require that:
Adding (16) to (17) and using (18) we find:
Thus, at t=0 we can safely ignore the “indirect” effects of the parameters
on the solution at We now use this fact to obtain some results from
this model.
3. RESULTS
We will present a series of results starting with the easiest case and steadily
add complexity to the decision making model.
3.1. COST AND PERFORMANCE UNCERTAINTY

We assume that f(X,v)= 0. At the value of is realized and expansion
and redesign decisions are made. Thus the maximization problem becomes:
Subject to where
The necessary conditions
for a maximum are:
Where is the Lagrangian multiplier for the ith constraint. Rewriting (22)
and (23) in matrix notation we find:
Where is the vector of multipliers, is the vector of marginal capacity

costs, is the vector of marginal redesign costs, is the diagonal matrix of
marginal capacity cross-utilizations and is the diagonal matrix of marginal
redesign cross-utilizations.
Result 1 (standard marginal cost pricing): Optimal decisions are made
when That is, capacity adjust-
ments and redesigns should equate marginal costs with demands at those prices
for resources.
Moving to the problem, is known and X,v have been selected; the only
issue remaining is allocating the available resources among payloads. Since
future benefits and costs are not affected by the allocation we need only
maximize benefits given revealed supply subject to any commitments made at
through the contracts Formally,
Subject to for positive levels of output. Let

be the maximum value of (27) subject to the net output constraint
for different values of X,v, and For fixed values of X and v, is the
optimal contingent plans of the payloads. The solution to (27) is the maximum
benefits that can be derived subject to the supply constraint. This is nothing
more than the use of a spot market to clear supply and demand.
Result 2 (spot market): At time when is known, if there are no
contracts restricting supply, prices should be set by spot contracting to solve
(27).
If contracts are signed prior to then for the contracts to create incentives
for the optimal allocation of resources, the contracts must be dependant on
In particular, for fixed values of X and v, prices p must be a function of and
solve the following problem:
The prices for net outputs q at defined in the equations above, will be
functions of i.e. contingent contracts. If we let denote the prices
that solve the equations above, this means that contracts signed at t=0 must be
contingent on For practical purposes, such a complete set of contingent
contracts is infeasible. However, a less restrictive set of contracts can be
designed that can increase efficiency. Let
and where denotes the sample
space for
Definition 4 A priority contract of class for resource where there are

classes, is a contract that specifies delivery of output if and only
if for for
Thus the dispatch of resources, once the uncertainty is resolved, is governed

by which priority class the resource resides.5 We will examine this contract
type when we examine the problem.
The problem is essentially the planning of resource use by payloads
given the installed capacity X and the design v. If no contracts have been
signed at t=0, then the problem is either to sign contingent contracts by solving
(28) and (29) above or just assign payloads to the station so that the resource
demands by all the manifested payloads is less than and use spot
markets to clear the demand once is revealed. Suppose that for payload j, the
net benefits derived from the payload are a function of the payload design.6 In
particular, let us represent the design variable for payload j with so that net
benefits can be modelled as where
denotes gross benefits from the payload and denotes the costs of designing
and operating the payload. We assume that benefit and costs are an increasing
function of net output use and payload design, i.e., If
priority contracts are utilized and if denotes the price for priority class k for
resource i then each payload j would maximize:
Where E is the expectation operator over the joint distribution of priority

classes and net outputs. The maximization equations and market clearing
constraints become
If priority classes are not used to create contracts, but instead when is
realized a Vickrey auction (see Vickrey (1961)) is conducted wherein each
payload j submits a two-tuple and the auctioneer computes winners as
follows:
The Vickrey auction has a dominant strategy that has each bidder revealing
their true value. However, since each bidder does not know the value of
or the values of other participants prior to developing their payloads they must
calculate the probability that resources will be available and that they have
one of the winning bids. In particular, payload j must select based on the
probability that their value will be greater than other bidders and that they fit
within the resource constraints (35). Since the only way to increase one’s
probability of being selected is to fit within the capacity constraints, enters
into the decision calculus through the functions B, C and the probability of
fitting. That is, if j selects a larger then for a smaller choice of we
can obtain the same net benefits, but increase the probability of fitting (see
Figure 1.1 above). Thus, there is an incentive to design less resource intensive
payloads which are inefficient payload designs relative to the use of priority
contracts.7
Result 3 (priority contracting): The use of priority contracts results in a

more efficient assignment of resources than the use of spot contracts.
Result 4 (inefficient payload design): If payloads have design choices,
then reliance on a spot market when there are performance uncertainties will
result in inefficient payload designs.
We are now in a position to examine the t=0 problem. The maximization

problem becomes:
The first order conditions for installed capacities are:
Using our envelope theorem from Section 2, (37) can be written as:
Where the second term is evaluated at and the last terms are
evaluated at Using equation (22) we find that
The above equation along with the fact that yields:
Thus, capacity should be selected so that:
If there is no performance uncertainty then it is a simple rule to price

where marginal benefits equal expected marginal cost at time t=0, i.e
However, when there is performance uncertainty, we
again need priority contracting to get information on the marginal benefits of
reliability to be equated with incremental costs.
Result 5: If there are no performance uncertainties, then prices should be
set to expected marginal cost and capacity should be built to equate demand
and supplies at these prices. If there are performance uncertainties priority
contracting should be used to obtain demand information on reliability.8
We now turn to an environment in which there is capital irreversibilities or
learning by doing
3.2. COST UNCERTAINTY AND FLEXIBILITY

Suppose that there is no performance uncertainty, but that flexibility is an
issue. Thus, we need to introduce the constraints (4) - (7) into the model. At
the first order conditions will implicitly define the expansion path
redesign and demands as functions of X, v, and f. Since there is no
performance uncertainty, we do not need to consider the contingent allocation
problems at Thus examining the decisions that are affected at time t=0
by using the implicit supply and demand function at time and applying our
envelope theorem we find the following first order conditions:
Equation (42) implies that payloads should be charged for resource

use. Using (30) we can derive the pricing equation:
Result 6 (capacity commitment pricing): If there are capital irreversibil-

ities, i.e. then the optimal pricing policy is to charge an amount
greater than long-run expected marginal cost. Specifically a capacity commit-
ment charge of needs to be
9
charged in order to insure optimal capacity is installed.
Definition 5 Project A is said to have greater capital irreversibilities or lower

learning by doing effects than Project B if and only if
From (45) we can obtain the following result:

Result 7 (installed capacity): The larger (smaller) are the capital irre-
versibility (learning by doing) effects, the smaller should be the installed ca-
pacity.
3.3. PERFORMANCE UNCERTAINTY AND

FLEXIBILITY
Returning to our model described in the beginning of Section 3 and applying
the envelope theorem when flexibility is present, we will obtain the following
first order conditions:
Compared to equation (40), we have an additional term associated with flex-

ibility. Thus, the solution entails both the use of contingent or priority contract-
ing to obtain contingent demands to optimally ration supply while incorporating
the effects on performance and costs of capital irreversibilities. With capital
irreversibilities, solving the reliability problem by installing additional capacity
or more sophisticated designs is costly if you are wrong and learn that changing
the expansion path is expensive. Thus the additional terms in the equations
above tell us:
Result 8 (demand reduction): When there are capital irreversibilities
(learning by doing) the priority contract prices will be higher (lower) to that
resource demands are suppressed (enhanced) reflecting the cost (benefit) of
committing to capacity at time t=0.
4. CONCLUSION
We can now address the questions listed at the beginning of this paper. Con-
cerning the initial size of the project, if technology is such that additions to
capacity or design changes are costly, then a smaller more flexible system
should be built and operated until more information concerning performance
is obtained. If the technology is such that irreversibilities are small and that
marginal productivity increases “dramatically” as operators learn how the tech-
nology works, then a larger more capable system should be built initially. These
flexibility issues also need to be part of the pricing regime to suppress the ap-
petite of payloads to demand more resource.
For the second question that was posed there is a better alternative than
what NASA currently uses to price resources. In particular, it is clear that the
use of priority contracts is essential and prices should not be solely based on
incremental costs. Project management should devote more effort in obtaining
accurate information concerning cost and performance distributions and should
provide more flexible contract procedures to assist users in contingent planning.
In addition, contract regimes should be put in place to aide operators in the
REFERENCES 299
rationing of resources when there are performance short-falls. The market for
manifesting payloads and the scheduling of resources should be interactive and
should be done years before actual operations. In this way price information
can guide payload design and Station growth paths.
NOTES
1. One notable exception is the management policy used on the Jet Propulsion Laboratory’s Cassini
Mission to Saturn (see Wessen and Porter (1997))
2. The Space Shuttle went through this process in determining the price to charge private companies
for the use of the Shuttle bay to launch satellites and middeck lockers for R&D payloads. The initial pricing
policy was to charge for the maximum percentage of Shuttle mass or volume capacity used by a payload
times the expected long-run marginal cost of a Shuttle launch, or short-run marginal cost if demand faltered,
see(Waldrop(1982)).
3. In general, so that benefits are in terms of contract specifications based on the
realization of We investigate the structure of such contracts in Section 3.1, but for now we only consider
use of subsystem capacities.
4. It is assumed that and are continuously differentiable.
5. This type of contract has been investigated by Harris and Raviv (1981) as a way for a monopolist to
segment the market. Chao and Wilson (1987) examine this form of contracting to price reliability in an
electric network.
6. Resources expended to design payloads is one of the most expensive elements in payload develop-
ment. Payloads can be designed to be more autonomous so as to use less astronaut time but this may increase
the use of on-board power and data storage. These trade-offs are extremely important but are usually done
in a haphazard manner (see Polk (1998)).
7. This phenomena has been found on the Space Shuttle due to the reduced number of flight opportunities
for science payloads. The Shuttle program has created a Get-Away-Special (GAS) container for payloads
that does not require man-tending or special environmental controls. The number of payloads that are of
the GAS variety have increased 10-fold from the beginning of the Shuttle program.
8. Noussair and Porter (1992) have developed an aution process to allocate a specific form of priority
contract. The experiments they conduct show that their auction results in highly efficient allocations and
out performs proportional rationing schemes.
9. A somewhat similar result can be found in Yildizoglu (1994).
REFERENCES
Abel A. and J. Eberly. (June 1998). “The Mix and Scale of Factors with Irre-
versibility and Fixed Costs of Investment, ” Carnegie-Rochester Conference
Series on Public Policy 48:101-135.
Chao, H. and R. Wilson. (December 1987). “Priority Service: Pricing, Invest-
ment and Market Organization,” American Economic Review 77:899-916
Fox G. and J. Quirk. (October 1985). “Uncertainty and Input-Output Analysis,”
JPL Economics Research Series 23.
Furniss, T. (July 2000). “International Space Station,” Spaceflight 42: 267-289.
Harris M. and A. Raviv. (June 1981). “A Theory of Monopoly Pricing Schemes
with Demand Uncertainty, ”American Economic Review 71:347-365.
Merrow, E., K. Philips and C. Myers. (September 1981). “Understanding Cost
Growth and Performance Shortfalls in Pioneering Process Plants,” Rand Cor-
poration Report R-2569-DOE.
Noussair C. and D. Porter. (1992). “Allocating Priority with Auctions: An

Experimental Analysis,” Journal of Economic Behavior and Organization
19:169-195.
Polk C. (1993).The Organization of Production: Moral Hazard and R&D. Ph.D
Dissertation, California Institute of Technology, Pasadena California.
Quirk, J., M. Olson, H. Habib-agahi and G. Fox. (May 1989). “Uncertainty and
Leontief Systems: An Application to the Selection of Space Station Designs,”
Management Science 35:585-596.
Waldrop, M. (1982). “NASA Struggles with Space Shuttle Pricing,” Science
216:278-279.
Wessen R. and D. Porter. (August 1997).“A Management Approach for Allo-
cating Instrument Development Resources,” Space Policy 13:191-201
Wessen R. and D. Porter. (March 1998). “Market-Based Approaches for Con-
trolling Space Mission Costs: The Cassini Resource Exchange,” Journal of
Reduced Mission Operations Costs 2:119-132.
Vickrey, W. (1961). “Counterspeculation, Auctions and Competitive Sealed-
Bid Tenders,” Journal of Finance, 16:8-37
Yildizoglu, M. (July 1994). “Strategic Investment, Uncertainty and Irreversibil-
ity Effect,” Annales d’Economie et de Statistique 35:87-106
Chapter 15
AFFORDABLE UPGRADES OF COMPLEX SYSTEMS:

A MULTILEVEL, PERFORMANCE-BASED APPROACH
James A. Reneke
Matthew J. Saltzman
Margaret M. Wiecek
Clemson University
Abstract A modeling and methodological approach to complex system decision making is

proposed. A system is modeled as a multilevel network whose components inter-
act and decisions on affordable upgrades of the components are to be made under
uncertainty. The system is studied within a framework of overall performance
analysis in a range of exogenous environments and in the presence of random
inputs. The methodology makes use of stochastic analysis and multiple-criteria
decision analysis. An illustrative example of upgrading an idealized industrial
production system with complete computations is included.
1. INTRODUCTION
In this paper we address the problem of upgrading a complex system where
there are multiple levels of decision making, the effects of the choices interact,
and the choices are accompanied by uncertainty. We propose a framework for
choosing, from among a number of alternatives, a set of affordable upgrades
to a system that exhibits these characteristics. We present the methodology in
terms of an illustrative example which highlights its application to each of these
areas of difficulty.
The system of interest is decomposed into multiple levels, each consisting of a
network describing the interactions of the components at that level. The process
constructs conceptual models working from top to bottom, and evaluates the
impact of proposed upgrades using computational models from bottom to top.
In general, as one proceeds down the levels of decision making, alternatives
depend on more independent variables representing exogenous influences and
hence uncertainty.
The upgraded enterprise operates within a range of environments represented
by the exogenous variables and so the best set of choices might not be optimal for
any particular environment. Understanding the tradeoffs leads to better choices.
Because of the interaction of choices for component upgrades the focus for the
decision maker must be on overall enterprise performance. Major upgrades of
some components may have little impact on overall performance while minor
upgrades on other components could have a large impact. Finally, upgrade
options must be evaluated in terms of their performance in noisy environments
and the contribution of component uncertainty to enterprise uncertainty.
Relation of this work to the literature. The problem of upgrading complex

systems has a wide range of applications in business and engineering. Korman
et al. (1996) review the process of making upgrade or replacement decisions for
a chemical distribution system. Hewlett-Packard Corporation used operations
research based methods to develop design changes in order to improve the
performance of a printer production line, as reported by Burman et al. (1998).
Samuelson (1999) describes the effort of modular software design that allowed
the “plug-in” replacement of the control software and improved the performance
of a telephone system of International Telesystems Corporation. Su et al. (2000)
study two distribution management systems of a Taiwan power company and
consider system upgrades or migration to improve performance.
In engineering communities, the upgrade problem has been extensively stud-
ied. As upgrades are part of recoverable manufacturing, researchers in the area
of manufacturing planning and control have been involved. Yield impact mod-
els are applied by McIntyre and Meits (1994) to evaluate cost effectiveness of
upgrades in semiconductor manufacturing. Hart and Cook (1995) proposed
a checklist type approach to decision making about upgrade versus replace-
ment in the manufacturing environment. Upgrade decisions in the context of
life-cycle decision making for flexible manufacturing systems are examined by
Yan et al. (2000). Some authors view upgrades from the perspective of system
modernization. Wallace et al. (1996) propose a system-modernization decision
framework and analyzes engineering tradeoffs made in modernizing a manufac-
turing design system to use distributed object technology. Another perspective
is used by van Voorthuysen and Platfoot (2000). They develop a system to iden-
tify relationships between critical process variables for the purpose of process
improvement and upgrading, and they demonstrate the methodology on a case
study of a commercial printing process.
Affordable Upgrades of Complex Systems 303
The upgrade problem for complex systems is related to the system design
problem. Hazelrigg (2000) critiques classical systems engineering methodolo-
gies. He proposes properties that a design methodology should satisfy, includ-
ing: independence of the engineering discipline, inclusion of uncertainty and
risk, consistency and rationality in comparing alternative solutions and inde-
pendence of the order in which the solutions are considered, capability of rank
ordering of candidate solutions. The method should not impose preferences
on the decision maker nor constraints on the decision-making process, should
have positive association with information (the more the better) and be derivable
from axioms.
Papalambros (2000) and Papalambros and Michelena (2000) survey the ap-
plication of optimization methods to general systems design problems. In ad-
dition to identifying optimal structural and control configurations of physical
artifacts, optimization methods can be applied to decomposition of systems
into subsystems and integration of optimal subsystem designs into the final
overall design. Rogers (1999) proposes a system for decomposition of design
projects based on the concept of a design structure matrix (DSM). Rogers and
Salas (1999) describe a Web-based design management tool that uses the DSM.
Models and methodologies for upgrade decisions have been also studied
by mathematical and management scientists, and by operations researchers.
Equipment upgrade and replacement along with equipment capacity expan-
sion to meet demand growth is studied by Rajagopalan et al. (1998) and Ra-
jagopolan (1998). The former presents a model for making acquisition and
upgrade decisions to meet future demand growth and develops a stochastic dy-
namic programming algorithm while the latter unifies capacity expansion and
equipment replacement within a deterministic integer programming model.
Replacement and repair decisions are studied by Makis et al. (2000) with
the objective to find the repair/replacement policy minimizing the long-run
expected average cost of a system. The problem is formulated as a continuous
time decision problem and the results are based on the theory of jump processes.
Carrillo and Gaimon (2000) introduce an optimal control model to couple
the improvement of system performance with decisions about the selection
and timing of process change alternatives and related knowledge creation and
accumulation.
Majety et al. (1999) describe a system-as-network model for reliability al-
location in system design. In this model, the methods are associated with
reliability measures as well as costs. Cost is the objective to be minimized and
a certain level of overall system reliability is to be achieved. Here the objective
is linear and the constraints are potentially nonlinear, depending on the structure
of the network.
Luman (1997); (2000) describes an analysis of upgrades to a complex, multi-
level weapons system (a “system of systems”). The methodology treats cost as
the independent variable, allowing the decision maker to analyze the tradeoff
between cost and overall performance. Luman addresses the need to under-
stand the relationship between overall performance and performance measures
of major subsystems. The upgrade problem is modeled as a complex nonlinear
optimization problem. Closed form approximations can adequately represent
some systems, but for greater complexity, simulation-based optimization meth-
ods may be required. The methodology is illustrated on a mine countermeasures
system involving subsystems for reconnaissance and neutralization.
Nature of the results. This paper presents a comprehensive modeling

and decision methodology in which the modeling techniques are designed for
decision making and decisions are based on system performance over a range of
operating conditions constrained by affordability considerations. The decision
methodology makes use of multicriteria decision analysis within a framework
that allows for uncertainty.
A thorough exposition of ideas in the paper is given in terms of an illustrative
example of an industrial production system. The problem has two decision
levels and illustrates how one might model interactions among possible choices.
Finally, the role of uncertainty is discussed in terms of risk associated with the
final design.
The paper provides computational schemes which can be adapted to more
complex systems. Performance is represented by response surfaces or func-
tions with respect to independent variables defining operating conditions. Each
component is modeled as a transformation of an input performance function
to an output function. Interaction between components at a single level are
modeled through the performance functions: the transformation describing a
component may take account of the output function of another component as
part of its input.
In the modeling phase, the relationships between tasks or components are
developed from the top level down. Each component at a given level is de-
composed into subtasks, and the relationships between tasks are modeled as a
network. In the decision phase, selected affordable upgrade alternatives at the
lower level are passed to the higher level, where they are combined to form
alternative upgrade plans for that level. These, in turn, are passed up to the next
higher level. At the top (enterprise) level, the decision maker chooses a most de-
sirable selection of upgrades from among alternatives passed up. At each level,
multicriteria optimization is used to identify upgrades that are near optimal
over the range of operating conditions. Stochastic linearization representations
(Reneke, 1998) are extended to systems with two independent variables and
used to assess risk associated with the chosen solution. This technique has
been applied previously only to systems with one independent variable. Of
note, simulations are obtained as matrix multiplications and the methods return
basic statistics exactly.
Significance of the results. While we do not claim that we have completely

met the criteria sketched by Hazelrigg, we assert that our approach does satisfy
a majority of the desired features and does not suffer from pitfalls common
for other methods. The reliance on performance functions rather than on the
physical system makes our model independent of the engineering discipline
and allows for rational analysis by a decision maker who is unfamiliar with
all the details of the system under consideration. Performance functions can
potentially be derived from underlying physical (or economic) models where
appropriate, or they can be based on simulations. But this independence means
that the approach can be applied to non-physical systems as well as the physical
systems traditionally in the purview of systems engineering.
The process limits the amount of information which must be passed between
levels. Of course, if all information is available to decision makers at adjacent
levels then there is no need to distinguish between levels.
The use of stochastic analysis guarantees that decisions are analyzed under
conditions of uncertainty with exogenous variables modeling uncertainty at each
level. Due to the use of multicriteria analysis, the set of feasible decisions at the
lower level is rationally and consistently narrowed down to a set of candidate
decisions that are passed up to the higher level at which they become feasible
alternatives. The final preferred decision at the top level is made according to
the preferences expressed by the decision maker.
The work described here opens up new directions for research. The ap-
proach as outlined in the example provides a framework for lots of focused
investigations. In particular, investigations into application of the system mod-
eling/simulation methodology to problems based on real data, for clarifying
the use of stochastic analysis in a multiple criteria decision process, and into
the examination of numerical methods for large scale simulations, i.e., systems
with many components and choices. More research is needed to explore the
application of multicriteria optimization in aiding the decision maker to arrive at
a preferred decision. Since the resulting problems might be of high dimension,
techniques have to be developed to reduce the problems to a level at which the
analysis can be easily comprehended.
Outline of the remainder of the paper. Section 2 describes the modeling

process. At each level, each task or component is modeled as a linear trans-
formation defined on a set of performance functions. All of the performance
functions at a given level depend on the same set of exogenous variables. Rela-
tions among component choices, i.e., which components influence which, are
describable using a network with nodes (components) representing operators
and links representing performance functions. A bi-level example of a sim-

ple industrial production model is introduced and the modeling of tasks and
their interactions is carried out. We take up in order simulations of components
and computing the system performance function in terms of external influences.
These computational tools are used in the discussion of decision methods which
follows.
The decision process is described in Section 3. At each level beginning
with the lowest, a multicriteria optimization technique is used to search for
nondominated upgrade choices from among affordable alternatives for each
task. These solutions form the set of choices that are integrated by the decision
maker to form the upgrade alternatives that are passed to the next higher level.
The decision maker at the top level chooses a single preferred solution from
among combinations of alternatives provided for the subsystems at this level.
Section 4 discusses stochastic analysis including the two sources of decision
uncertainty, modeling component risk, and evaluation of a final design (solution)
in terms of system risk. The technical details of the stochastic analysis appear
in the Appendix.
Finally, in the conclusion (Section 5) we discuss applicability of the method-
ology and the value of our approach, and we list some questions to be resolved.
2. MULTILEVEL COMPLEX SYSTEMS

In general, a complex system is viewed as a system composed of several
decision levels with the top level referred to as the master level and lower levels
(see Figure 15.1). Each level below the master level might have
multiple decision makers. Conceptual modeling proceeds from the top down.
At the master level, the overall system (the enterprise) is modeled as a single
task. At every level but the lowest level , a task can be decomposed into
subtasks assigned to lower level decision makers who do not interact. The lowest
level is reached when tasks are not decomposed further. Each decision maker,
in decomposing a task, develops a conceptual model of interactions among
subtasks (see Figure 15.2). Tasks are viewed as networks whose nodes represent
subtasks to be accomplished with arcs representing performance functions.
Also for each task, the subtasks are accomplished in an environment modeled
by a set of exogenous variables so the performance functions may depend
on several independent variables. The network structure indicates how the
performance of each subtask influences and is influenced by other subtasks.
Thus a node (a subtask) acts on performance functions represented by inbound
arcs to produce a performance function associated with outbound arcs.
Computational models are developed and decisions made from the bottom
up. During the decision making phase, methods for accomplishing the tasks are
associated with the nodes. Starting at the lowest level, decision makers develop
alternative feasible methods for their tasks, develop computational models for
these methods, and pass methods and models up to the next higher level. Meth-
ods for a task are feasible provided they satisfy the cost constraint assigned to
that task. Computational models at the next higher level are constructed from
the sub-method models using the previously developed conceptual models. Se-
lections are made from the alternative feasible methods and these methods and
models are passed up to the next higher level. At the master level, computational
models of feasible methods for the system task are constructed and a preferred
method is chosen from among the alternatives. The overall goal is to find a
preferred method satisfying the enterprise cost constraint, i.e., an affordable
selection of upgrades.
Methods are modeled by means of linear operators relating the input perfor-
mance of a task to the output performance. The performance functions may
be vector or scalar valued and are defined on the space of exogenous variables
characteristic for the task. By restricting models of methods to linear opera-
tors, computational models at higher levels can be obtained by “plugging” the
alternatives from below into the conceptual models.
Models for methods (linear operators) are based on stochastic linearizations
of the underlying components. The theory and methodology of stochastic lin-
earization are discussed elsewhere (Reneke, 1997; 2001). The convergence
theory is well developed for the one variable (signal) case (Reneke et al., 1987;
Reneke, 1998). The theory for the multiple variable (response surface) case is
emerging but is complete for significant special cases. Stochastic linearization,
discussed further in Section 4, provides an “easy to apply” modeling method-

ology based on natural assumptions for sub-method behavior.
In the two-dimensional case, real valued performance functions can be thought
of as (response) surfaces. Further, discrete surfaces can be represented as ma-
trices. Finite dimensional approximations for each linear operator in this case
require two matrices. For instance, a linear operator M acting on a discrete
response surface can be represented by where
is an matrix, is an matrix, and is an matrix .
The decision making process for a task involves a choice of efficient selec-
tions of methods (linear operators) from among feasible selections of methods
by which the task may be performed (see Figure 15.2). The process starts
at level where there is a collection of unrelated simple tasks that cannot be
further decomposed and which do not interact. Each task is assigned to a deci-
sion maker who selects a set of efficient methods for the task. These efficient
methods are then passed to the higher level
Each level task, as well as the master-level task, is

evaluated for every efficient selection of methods for every subtask and a set
of efficient selections of methods for that task is produced. In particular, a
level task output performance function is computed for every feasible, i.e.,
affordable, selection of methods for its subtasks. This process yields the set of
feasible selections of methods for the task. The set of efficient selections of
methods is then found among the feasible selections and passed to the higher
level at which the task becomes a subtask. When a set of efficient selections
of methods is produced for the master-level task, this level's decision maker
chooses a preferred selection of methods from among the efficient selections
which becomes the optimal selection for the system and concludes the decision
making process.
As stated above, each level task, has a set of exogenous
variables modeling this task’s uncertain environment. When passing efficient
selections of methods from subtasks to a higher level task, exogenous variables
not common to all subtasks making up the task are eliminated. The set of
efficient selections of methods for the master level task is computed in the
space of exogenous variables characteristic only for that task and all exogenous
variables are eliminated when the optimal selection has been found.
2.1. AN ILLUSTRATIVE EXAMPLE

In this paper, we focus on a bi-level example modeling a simplified indus-
trial production system. The model consists of two levels. The master level
represents the entire production process. The lower level consists of three com-
ponents: a component M for acquiring and storing raw materials, a component

F for fabricating the product, and a component D for distributing the prod-
uct. (See Figure 15.3.) Each component corresponds to a specific task and
performances of the tasks interact to produce the enterprise performance.
As mentioned above, the system is part of a larger environment beyond the
control of the decision maker which exerts an influence on system performance.
We assume that the external environment is characterized by two variables:
potential demand for the product, denoted by and total industry capacity
for producing the product, denoted by The influence of each component on
other components is characterized by a performance function depending on
and Also, there is a performance function reflecting availability of inputs of
raw material and a performance function reflecting actual customer demand for
product from this particular enterprise. In general, the performance functions
will be vector valued but for this simplified example we assume the functions
of two variables are scalar valued, i.e., scalar fields. The units do not have to
be the same for each component.
Note that the two environmental variables reflect the importance assigned to
the materials component. We could have decided that the distribution compo-
nent was the most important and chosen different environmental variables.
The system to this point can be visualized as in Figure 15.4 where the arrows
indicate relations between components. For instance, the performance of
the fabrication component depends on the performance of the materials
component and influences the performance of the distribution component.
The performance functions require engineering insight and may be based
on detailed investigation of the individual components. For this simplified
presentation we assume that models the availability of raw material in the
market, models material purchased and stored and available for fabrication,
models product available for distribution, models customer demand, and
is the enterprise performance function, measuring unmet consumer demand.
Assuming might look like Figure 15.6. (Note in this
figure and subsequent figures the surfaces are computed at grid points
The independent axes are labeled by the grid indices.) For
increasing and supply of material increases and prices fall. If there is no
potential demand or no production capacity then the supply of raw materials

must be low. The values of are scaled between 0 and 1 but there is nothing
significant in this.
Associated with each component (or subtask) is a method modeled by a
linear operator which transforms the sum of the performance functions affecting
that component into that component’s performance function. For instance,
and where we use the component
designation for the associated linear operator.
We complete the diagram of the master-level network in Figure 15.5. The
added feedback operators and are dictated by nature, i.e., are not subject
to the decision maker, and all arrows emanating from a component modeling
component performance are equal. Thus the diagram does not model flows of
material, there is no conservation law, but the interrelationships among compo-
nents.
The model given above describes an existing system. In the upgrade prob-
lem we assume that each subtask can be accomplished using one of several
methods. The overall goal is to optimize enterprise performance by choosing
one method (linear operator) for each of the subtasks. Note that the choices
interact complicating the decision process.
2.2. COMPUTATIONAL MODELS FOR THE

EXAMPLE
To illustrate the methodology we have provided linear transformations to
model M, F, and D and assume that is the negative of the identity
operator. Again for the function is chosen as in Figure 15.7. To
aid in visualizing the effect of applying the component operators, we apply them
to the input function The results are displayed in Figure 15.7. Of interest is
how the operators (methods) transform an input function. The response of M
to increases rapidly for and small but increasing. The response is flatter
for and large. Similar observations could be made for F and D.
Combining the component models in the feedback network, Figure 15.5, we
obtain the enterprise performance, Figure 15.8.
3. MULTIPLE CRITERIA DECISION MAKING

Multiple criteria decision making (MCDM) is a field of system science deal-
ing with complex systems performing in the presence of multiple conflicting
criteria. In our multilevel model, the space of exogenous variables of each task
creates such an environment since the best task performance varies depending
upon regions or even points of this space. There is no single response surface
representing the best task performance but rather a set of alternatives (response
surfaces) competing with each other over that space. Analogously, there is
no single response surface representing the best overall system performance.
Within this framework we apply two classic stages of MCDM: the optimiza-
tion stage of screening for efficient solutions (nondominated outcomes) from
among feasible alternatives and then the decision stage of choosing a preferred
efficient solution (nondominated outcome). Several effective decision aids sup-
porting the latter are collected by Olson (1996), while theoretical foundations
and advances are thoroughly analyzed by Roy (1996).
The optimization stage is performed for every task at every level of the system
and involves the elimination of task response surfaces that are outperformed by
other surfaces over the entire space of exogenous variables for this task. The
optimization produces a set of candidate methods for a task that will be passed
up to the next higher level at which this task becomes a subtask.
The decision stage is performed only for the master-level task and yields
a final preferred response surface and the corresponding selection of methods
which determine a preferred selection of upgrades for the system.
In this section, we first formulate an optimization problem suitable for the
optimization of a task and then discuss how a preferred selection of upgrades
could be found. The presented approaches and methods are then illustrated on
the bi-level example.
3.1. GENERATING CANDIDATE METHODS

The quality of the upgrade decision is gauged by a system performance
measure that is not available in a closed form but rather in the form of a response
surface determined by the values achieved at some preselected grid points. The
decision maker is interested in choosing a selection of methods for each task
such that the system performance is at its best at every grid point at which this
performance is evaluated.
Given the space of exogenous variables of a task, every grid point of this
space can be viewed as a criterion (or scenario) with respect to which the
task performance should be optimized and the decision maker is to choose a
selection that would be as good as possible across all the criteria (scenarios).
In this context, the upgrade problem becomes a multiple-criteria optimization
problem or a multiple-grid-point optimization problem and should be treated
within a multiple-criteria decision making (MCDM) framework. However, it
is not a typical MCDM problem in which a preferred decision has to be made
with respect to several (typically less than 10) objective functions available in
a closed form. The upgrade problem will usually have a large number of grid
points (100-200) so that the resulting optimization problem will have the same
large number of criteria represented by the values on response surfaces.
Let be the set of all feasible selections of methods for a task at some
level and be the set of all surfaces associated with the feasible selections,
where is the number of the surfaces. The task’s decision maker has obtained
the selections from lower level decision makers and checked the feasibility of
the selections against a cost constraint for the task. A selection of methods is
feasible if it satisfies the cost constraint for a task.
Let be the set of grid points selected in the domain of exogenous variables
and (see page 308) represent a grid point,
where is the number of the points. Let be the value of the surface
assumed at the grid point
Depending on the application, the optimal performance level at a grid point
might be the maximum or minimum achievable or it may be a particular fixed
target value between the maximum and minimum. The collection of target
values at the grid points determines a reference surface S* passing through the
target value at each grid point. The surface S* might not be a response surface
for the system. A surface associated with a selection is said
to dominate a surface associated with a selection if at each
grid point the value of is at least as close to the reference value as
that of and the value of is closer than that of at at least one grid point.
Thus, nondominated surfaces are those for which every other surface associated
with a feasible selection of methods yields an inferior value at one or more grid
points. The method selections associated with each nondominated surface are
called efficient.
In other words, a feasible surface associated with a selection
is said to be nondominated if there is no other feasible surface such that
for all and
for at least one
With respect to this model, generation of the nondominated surfaces related
to the multiple grid optimization problem leads to solving the following multiple
objective program
where the minimum is vector valued.
3.2. CHOOSING A PREFERRED SELECTION OF

UPGRADES
Let denote the set of all nondominated surfaces for the master-level task.
Given this set, the master decision maker is interested in choosing a preferred
nondominated surface and an efficient selection of methods that produces the
chosen surface. This process can be performed in many different ways due to
incomplete preference information provided by the multiple grid optimization
problem at the master level and the resulting need to introduce additional in-
formation to resolve the selection process. One can apply a variety of methods
developed in the area of multiple criteria decision analysis (MCDA) that deal
with choosing the best alternative from among the finite number of choices.
In this study we explore two methods that we adapt to our specific appli-
cation. One is the well known method of choosing a preferred nondominated
solution with respect to the reference (ideal) solution while the other combines
the preference ranking of solutions with their costs to produce a new, higher-
level multicriteria problem.
Ranking nondominated surfaces. If the target is to maximize the perfor-

mance level, then we can construct the reference surface as follows: At every
grid point we find a surface that yields the best system per-
formance. This is accomplished by solving the single grid point optimization
problem at each grid point
Let be an optimal solution of (3.1) with the optimal value of

Having solved problem (3.1) for every we get a collection of
optimal surfaces Define
now a utopia surface S* as the upper envelope of all the surfaces in S* which
is made of the portions of the nondominated surfaces visible from
The surface S* might not be a response surface but represents ideal system
performance over the entire domain of exogenous variables. Reference surfaces
for performance measures to be minimized can be constructed similarly. For
finite target values, the reference surface simply passes through the target value
at each grid point.
We now choose a preferred nondominated surface as the one that is the closest
to the reference surface where a norm of choice measures the distance between
the surfaces in the multidimensional space of grid points.
When constructing the norm, the decision maker can assign different weights
to each grid point in order to model the probability of the system having to
perform within a neighborhood of the values of exogenous variables related to
this point. One may partition the domain of exogenous variables into B subsets
and to each of them assign a probability chosen by the decision maker
such that Hence, the weight assigned to each point in a subset is
equal to where is the number of points in subset
In addition to variations in the probability of encountering certain operating
conditions, the decision maker may make judgments regarding the importance
of achieving optimal performance under such conditions. For example, there
may be a rare combination of operating conditions under which optimal system
performance is critical. These judgments can be captured by eliciting from
the decision maker weights associated with the grid points (or with
regions, as above). Larger weights penalize poor performance at those points.
Define the weighted for by computing
where We then solve the problem

The preferred surface is the one that achieves the minimum. Note that the
definition of the norm can be extended to treat over-performance and under-
performance asymmetrically if appropriate.
Ranking versus cost. An alternative approach is to compute a weighted

distance from the reference surface for each surface. This
value can be treated as a single criterion in a bicriteria optimization problem,
where nondominated surfaces (based on grid points) are to be compared based
on value and implementation cost. The higher-order decision problem is
where is the minimum cost of an efficient selection of methods corresponding

to nondominated surface and the minimum is vector valued. The solutions
to this problem are nondominated in the sense that any decrease in distance
from the reference surface comes at the expense of an increase in cost and any
decrease in cost is accompanied by an increase in distance. Identifying such
solutions can be accomplished using weighted Tchebycheff norms or other
techniques for bicriteria optimization. The decision maker must then choose
a most preferred nondominated surface using other preference identification
techniques.
3.3. APPLICATION TO THE EXAMPLE

We continue with our outline of the example and assume that each of the
three decision makers at level 1 has already chosen efficient methods for the
task from among feasible (affordable) methods for that task (i.e., those methods
that satisfy a cost constraint for the task).
Assume for illustration that there are three efficient methods for task M of
acquiring and storing raw materials to be used by the fabrication component.
Materials are to be purchased on an open market with supplies and prices
influenced by the environmental variables, size of potential demand for the
product and total industry capacity for producing the product. The methods are
1 Purchase for “just in time delivery” minimizing storage before fabrica-
tion.
2 Purchase and storage of critical raw materials subject to scarcity mini-
mizing interruptions in the fabrication process.
3 Development of flexible handling and storage facilities capable of re-
sponding to changing raw material requirements.
We do not discuss modeling these methods but assume the methods are
equally effective under “normal” operating conditions with costs for imple-
mentation increasing from method 1 to method 3. Recall that the models are
to serve the decision maker. A detailed model of any of the suggested meth-
ods might be very complex with many variables and parameters. Further, any
reasonable physical model would almost certainly be nonlinear. However, the
decision maker is only concerned with how a method transforms an input per-
formance function into an output performance function. Even these simplified
models must be restricted to linear transformations so the decision maker can
account for interactions among choices.
Similar comments can be made about the other two components, F and D.
So for each component we assume three efficient choices of methods in as-
cending order of cost. The enterprise performance function for the lowest cost
choices for M, F, and D is given in Figure 15.9. Among the twenty-seven
choices for methods, assumed to be feasible for the master-level task, seven
selections produce nondominated enterprise performance functions. The effi-
cient selections are {111, 311, 312, 321, 322, 331, 332} where the selection
means we have chosen the method for M, the method for F, and the
method for D.
Recall that every nondominated surface exceeds any other surface at one
or more points. We might attempt comparing the nondominated surfaces on
subregions of the set of exogenous variables in order to pick out preferred
solutions. In Figures 15.10 and 15.11, we produce graphs of the diagonals of
the nondominated enterprise functions.
To illustrate the selection of a preferred surface at the master level, we cal-
culate the weighted norm of the difference between each nondominated surface
and the reference surface which passes through performance level 0 at each grid
point. Based on the shape of the input performance function in the example,
we partition the domain of exogenous variables into three diagonal bands and
assign probability 1/2 to the central band and probability 1/4 to the bands on
each side. The decision maker assigns importance 1 to each grid point. The
norm is The cost of the methods associated with a surface are taken to
be the sums of the digits in the triple identifying the surface. Thus, we have the
data in Table 15.1. The norms and costs are plotted in Figure 15.12.
It can be seen that the surface with the most desirable norm is 312. When
norm and cost are treated as two criteria to be minimized simultaneously, the
nondominated surfaces with respect to these criteria are 111, 311, and 312. The
decision maker must select from among these the final surface and associated
methods.
4. STOCHASTIC ANALYSIS
In practice systems must be considered with uncertainty. The first source
of uncertainty is the environment in which the system operates, modeled in
this paper by the exogenous variables. Another source of uncertainty comes
from system inputs. In the example we can consider either or both of the
system inputs, i.e., and as random. Figure 15.13 illustrates the case of
random customer demand and Figure 15.14 shows the corresponding enterprise
performance. We always assume that the random input functions have been
modeled leading to a stochastic decision problem which we can discuss in
terms of risk.
The nature of the process for generating candidate methods in the previous
section precludes the use of models with random inputs. Therefore we envision
resolution of the stochastic decision problem in two steps.
1 Optimize the deterministic system as in the previous section using ex-

pected inputs.
2 Analyze the risk for the preferred choices using random inputs.
If the risk is unacceptable then iterate after eliminating the component choices
which contribute the most to enterprise risk. This pruning process will be aided
by considering the obtained from simulations of the complete model.
4.1. RANDOM SYSTEMS AND RISK

Risk is associated with random variability in the enterprise performance
function. If we admit random elements into our system model then random
variability in performance can not be eliminated. However, the variability can
be “shaped” by our choices of methods for system tasks. Methods are graded
in terms of risk by their ability to damp out variation in the input function. High
risk methods damp out the least and the low risk methods the most.
In general, the goal is risk reduction which is equivalent in the model to reduc-
ing random variation in the enterprise performance function. The master-level
decision maker must determine if the variability in the enterprise performance
as given in Figure 15.14 is acceptable when customer demand, for example,
has the variability pictured in Figure 15.13.
A system upgrade decision problem forces us to balance cost and risk. A
simplistic choice of the cheapest choices for all the methods might result in an
unacceptable design in terms of risk. Alternately, always choosing the least
risky method for each task might result in a design which is too costly to
implement. The upgrade problem leads us to search for those components
where the cheaper (perhaps, more risky) choices have the least effect on risk
for the overall system.
4.2. APPLICATION TO THE EXAMPLE

The stochastic version of the performance function modeling avail-
ability of raw materials is assumed to be of the form where Z
is a two-dimensional random surface defined over We make a

similar assumption for modeling random customer demand.
Consider the three efficient methods for task (or component) D from Sub-
section 3.3 pictured in Figure 15.15 in terms of their transformation of In
Figure 15.16 we see the action of these transformations on (The same
sample surface for Z was used in all of these graphs and all of the succeeding
ones.) The random variation of the component response to is reduced more
as we proceed from the cheapest to the most expensive alternatives. Thus we
conclude that the methods can be rated as riskiest to least risky in the same
order.
Under “normal” conditions, i.e., no noise, the three methods as part of a
complete system perform about the same. See Figure 15.17 for the enterprise
performance functions for the three options. In Figure 15.18 we see the three
enterprise functions for the stochastic cases. Clearly the least risky option yields
the best performance for the overall system.
In Figure 15.19 we see the system performance for deterministic
and stochastic The three options for D again perform about the same. The
noise enters the system too late to be influenced by D. If the riskier option costs
less then under these circumstances that option merits consideration.
This discussion of risk concludes our discussion of the stochastic optimiza-
tion problem. However, the introduction of random input surfaces enables us
to explain the modeling methodology used for the component operators in the
decision process. This material is presented in Appendix 15.A.
5. CONCLUSIONS
This paper presents a novel system-science approach to complex systems
decision making. A system to be upgraded is modeled as a multilevel network
whose nodes represent tasks to be accomplished and whose arcs describe the
interactions of component performance measures. Since a task can be carried
out by a collection of methods, a decision maker is interested in choosing a
best method for each task so that the system performance is optimized. Thus
the well-known concept of network—typically used to modeling physics-based
subsystems, components of the system and flows between them—now becomes
a carrier of higher-level performance analysis conducted on tasks the system
is supposed to accomplish and on available methods. This performance-based
rather than physics-based framework is independent of engineering discipline
and results in other desirable properties highlighted in the paper.
At each decision level, the system performance depends upon exogenous
variables modeling the uncertainty related to the environment in which the
system may operate. This source of uncertainty is accounted for in the de-
cision process through the use of multicriteria optimization. Uncertainty in
the selection chosen for implementation associated with random inputs (i.e.,
performance functions) is assessed using a stochastic analysis technique. The
methodology has been examined for the case of two exogenous variables and
is under development for the more general case. Also required for application
are component models, understanding of relations between components, and
mean input surfaces. Finally, for stochastic inputs the random part of the input
surface is limited to the standard Wiener surface.
The value of our approach consists of
1 The comprehensive approach we have taken to multilevel decisions.
2 The simplified modeling methodology for decision making.
3 The framework allowing for decisions with uncertainty.
Many questions have yet to be resolved: How to extend the methodology to

more than two exogenous variables? How to quantify risk in the context of the
upgrade problem? What is the proper relationship for physics based models vs.
performance based models in decision making? In addition, it will be necessary
to develop processes for carrying out the analysis that can be implemented by
decision makers working with real systems.
APPENDIX: STOCHASTIC LINEARIZATION

The method of stochastic linearization is reviewed in the next subsection in the context of
stochastic processes. In the last subsection we consider stochastic linearization in the context of
random surfaces with the aid of an assumption on the surface covariance kernels. The factoriza-
tion of the surface kernels following from this assumption enables us to present a representation
of linear operators generating the random surfaces.
1. ORIGIN OF STOCHASTIC LINEARIZATION

Normal processes have the property that all of the information about the process obtainable
from observations is already contained in the process mean (a function of one variable) and
the process covariance (a function of two variables). Thus normal processes are the easiest to
analyze. Stochastic processes generated by linear systems are normal.
Stochastic linearization is a method of analysis which attempts to approximate the underlying
generating system (assuming a zero mean response) with a linear system which in turn generates a
normal process having the mean and covariance of the original process. If the underlying system
has a known dynamical model (the case considered in control applications), then stochastic
linearization has a certain flavor. If the only information available for the underlying system is
the mean and covariance of the observation process (the case considered in signal processing
applications), then stochastic linearization has a different flavor. Our work derives from the latter
case (Cover et al., 1996).
Suppose a given process has mean function and covariance function where
and
Here, is the and is the matrix defined
by (This shorthand is used throughout the remainder of this section.)
If is a (0, l)-normal with independent components, then where
is the upper Cholesky factor of has mean and covariance i.e.,
provides a stochastic linearization of the underlying system generating the given process. The
linear system approximates the underlying system in that the two have the same second order
statistics.
2. STOCHASTIC LINEARIZATION FOR RANDOM

SURFACES
For scalar fields defined on two-dimensional rectangles, sufficient conditions on the system
covariance kernel are given for the development of a system linearization based on a factorization
of the system covariance kernel. The mathematical question of limitations imposed by the
covariance condition can be discussed in terms of properties of the resulting linearizations. A
set of reasonable conditions on linear systems can be shown to be equivalent to the covariance
condition. Thus the methods apply to a rich class of random fields.
Basic hypothesis. The central idea in this subsection is to produce a condition on the
covariance kernel of a random field (function of two variables) which for fields F generated
by linear systems, i.e., of the form F = AW, is sufficient to produce a representation of A.
(Here W is the standard Wiener field, see below.) If F is not generated by a linear system then
the method produces a linearization of the unknown nonlinear operator Â, namely A. In this
case the utility of the linearization depends upon the particular application and how nearly the
assumption of the condition (perhaps uncheckable) is to holding.
The condition is the following: for each pair of points and in [0, U] × [0, V]
where cov(·) is the covariance operator. (See Vanmarcke (1988), p. 82.) Note that the condition
is on the observation field F rather than on the underlying unmodeled system.
In general, the independence expressed in the condition might not hold. However, such
independence is implicit in the common engineering practice of exploring a given physical
system by allowing only one quantity to vary at a time.
The Standard Wiener field. The standard Wiener field on [0, U] × [0, V] has the
following defining properties:
sample fields are continuous
if then
where E[·] is the expectation operator. Simulation of the standard Wiener process proceeds
as follows. If then has mean zero and covariance
i.e., is indistinguishable from Suppose, in addition, that
Then has mean zero and covariance i.e., is indistinguishable from
Linear models. We provide a recipe for representations of linear systems generating

random fields satisfying the basic hypothesis. The recipe works for a large class of systems.
Recall from Section 2 that finite-dimensional approximations for a linear operator T require two
matrices. For instance, where is an matrix, W is an
matrix, and is an matrix. The representation problem for a linear transformation T
reduces to a search for appropriate matrices and
If and then
for and we have
Hence, TW has the mean and covariance of F and so T is the stochastic linearization of the
system generating F.
Construction of linear models in the example. The recipe yields, for the com-
ponent operator M, the representation and where
and
If the modeler has observations and available then and can
be estimated from the data. This scenario might correspond to the case where an implementation
of M exists and is available for experimentation.
REFERENCES 329
If the modeler does not have observations to use in estimating and then they might be
based on engineering experience or intuition. Except for scaling, the shapes of and will
likely come from a class of simple shapes. The shapes chosen in the example are representative
of the possibilities for and Some small amount of data might be useful in establishing
the appropriate scales for the operators.
Note that in the example and were assumed equal ([0, U] = [0, V] = [0, 5] and )
and There is no requirement that this be true.
Similar comments hold for the construction of the component operators F and D.
REFERENCES
Burman, M., S. B. Gershwin and C. Suyematsu. (1998). “Hewlett-Packard uses
operations research to improve the design of a printer production,” Inter-
faces 28, 24–36.
Carrillo, J.E. and Ch. Gaimon. (2000). “Improving manufacturing performance
through process change and knowledge creation,” Management Science 46,
265–288.
Cover, A., J. Reneke, S. Lenhart, and V. Protopopescu. (1996). “RKH space
methods for low level monitoring and control of nonlinear systems,” Math,
Models and Methods in Applied Sciences 6, 77–96.
Hart, D.T. and E. D. Cook. (1995). “Upgrade versus replacement: a practical
guide to decision-making,” IEEE Transactions on Industry Applications 31,
1136–1139.
Hazelrigg, G.A. (2000). “Theoretical foundations of systems engineering,” pre-
sented at INFORMS National Meeting, San Antonio.
Korman, R.S., D. Capitanio and A. Puccio. (1996). “Upgrading a bulk chemical
distribution system to meet changing demands,” MICRO 14, 37–41.
Luman, R.R. (1997). “Quantitative decision support for upgrading complex sys-
tems of systems,” D.Sc. thesis, The George Washington University, Wash-
ington, DC.
Luman, R.R. (2000). “Upgrading complex systems of systems: a CAIV method-
ology for warfare area requirements allocation,” Military Operations Re-
search 5, 53–75.
Makis, V., X. Jiang and K. Cheng. (2000). “Optimal preventive replacement
under minimal repair and random repair cost,” Mathematics of Operations
Research 25, 141–156.
Majety, S.R.V., M. Dawande, and J. Rajgopal. (1999). “Optimal reliability al-
location with discrete cost-reliability data for components,” Operations Re-
search 47, 899–906.
McIntyre, M.G. and J. Meitz. (1994). “Applying yield impact models as a first
pass in upgrade decisions,” Proceedings of the IEEE/SEMI 1994 Advanced
Semiconductor Manufacturing Conference and Workshop, Cambridge, MA,
November 1994, 147–149.
Olson, D.L. (1996). Decision Aids for Selection Problem, Springer, New York.
Papalambros, P.Y. (2000). “Extending the Optimization Paradigm in Engineer-

ing Design,” Proceedings of the 3rd International Symposium on Tools and
Methods of Competitive Engineering, Delft, The Netherlands.
Papalambros, P.Y. and N. F. Michelena. (2000). “Trends and challenges in sys-
tem design and optimization,” Proceedings of the International Workshop on
Multidisciplinary Design Optimization, Pretoria, South Africa.
Rajagopolan, S. (1998). “Capacity expansion and equipment replacement: a
unified approach,” Operations Research 46, 846–857.
Rajagopolan, S., M. S. Singh and T. E. Morton. (1998). “Capacity expansion
and replacement in growing markets with uncertain technological break-
throughs,” Management Science 44, 12–30.
Reneke, J., R. Fennell, and R. Minton. (1987). Structured Hereditary Systems,
Marcel Dekker, New York.
Reneke, J. (1997). “Stochastic linearizations based on random fields,” Proceed-
ings of the 2nd World Congress of Nonlinear Analysts, Athens, Greece,
Nonlinear Analysis, Theory, Methods & Applications 30, 265–274.
Reneke, J. (1998). “Stochastic linearization of nonlinear point dissipative sys-
tems,” www.math.clemson.edu/affordability/.
Reneke, J. (in preparation, 2001). “Reproducing kernel Hilbert space methods
for spatial signal analysis”.
Rogers, J.L. (1999). “Tools and techniques for decomposing and managing
complex design projects,” Journal of Aircraft 36, 266–274.
Rogers, J.L. and A. O. Salas. (1999). “Toward a more flexible Web-based frame-
work for multidisciplinary design,” Advances in Engineering Software 30,
439–444.
Roy, B. (1996). Multicriteria methodology for Decision Aiding, Kluwer, Dor-
drecht.
Samuelson, D.A. (1999). “Predictive dialing for outbound telephone call cen-
ters,” Interfaces 29, 66–81.
Su, C.L., C. N. Lu and M. C. Lin. (2000). “Wide area network performance study
of a distribution management system,” International Journal of Electrical
Power & Energy Systems 22, 9–14.
Vanmarcke, E. (1998). Random Fields: Analysis and Synthesis, The MIT Press,
Cambridge, MA.
van Voorthuysen, E.J. and R. A Platfoot. (2000). “A flexible data acquisition
system to support process identification and characterization,” Journal of
Engineering Manufacture 214, 569–579.
Wallace, E., P. C. Clements and K. C. Wallnau. (1996). “Discovering a system
modernization decision framework: a case study in migrating to distributed
object technology,” Proceedings of the 1996 IEEE Conference on Software
Maintenance, ICSM. Monterey, CA, November 1996, 185–195.
REFERENCES 331
Yan, P., M.-Ch. Zhou and R. Caudill. (2000). “Life cycle engineering approach
to FMS development,” Proceedings of IEEE International Conference on
Robotics and Automation ICRA 2000, San Francisco, CA, April 2000, 395–
400.
Chapter 16
ON SUCCESSIVE APPROXIMATION OF OPTIMAL

CONTROL OF STOCHASTIC DYNAMIC SYSTEMS
Fei-Yue Wang
George N. Saridis
Department of Electrical, Computer and Systems Engineering
Rensselaer Polytechnic Institute
Troy, New York 12180
Abstract An approximation theory of optimal control for nonlinear stochastic dynamic sys-
tems has been established. Based on the generalized Hamilton-Jacobi-Bellman
equation of the cost function of nonlinear stochastic systems, general iterative
procedures for approximating the optimal control are developed by successively
improving the performance of a feedback control law until a satisfactory sub-
optimal solution is achieved. A successive design scheme using upper and lower
bounds of the exact cost function has been developed for the infinite-time stochas-
tic regulator problem. The determination of the upper and lower bounds requires
the solution of a partial differential inequality instead of equality. Therefore it
provides a degree of flexibility in the design method over the exact design method.
Stability of the infinite-time sub-optimal control problem was established under
not very restrictive conditions, and stable sequences of controllers can be gen-
erated. Several examples are used to illustrate the application of the proposed
approximation theory to stochastic control. It has been shown that in the case of
linear quadratic Gaussian problems, the approximation theory leads to the exact
solution of optimal control.
Keywords: Hamilton-Jacobi-Bellman equation, optimal control, nonlinear stochastic sys-

tems
1. INTRODUCTION
The problem of controlling a stochastic dynamic system, such that its be-
havior is optimal with respect to a performance cost, has received considerable
attention over the past two decades. From a theoretical as well as practical
point of view, it is desirable to obtain a feedback solution to the optimal control
problem. In situations of linear stochastic systems with additive white Gaus-
sian noise and quadratic performance indices (so-called LQG problems), the
separation theorem is directly applicable, and the optimal control theory is well
established (Aoki, 1967; Woham, 1970; Kwakernaak and Sivan, 1972; Sage
and White,1977).
However, due to difficulties associated with the mathematics of stochastic
processes, only fragmentary results are available for the design of optimal con-
trol of nonlinear stochastic systems. On the other hand, there is need to design
optimal and suboptimal controls for practical implementation in engineering
applications (Panossian, 1988).
The objective of this paper is to develop an approximation theory that may
be used to find feasible, practical solutions to the optimal control of nonlinear
stochastic systems. To this end, the problem of stochastic control is addressed
from an inverse point of view:
Given an arbitrary selected admissible feedback control, it is desirable to compare
it to other feedback controls, with respect to a given performance cost, and to
successively improve its design to converge to the optimal.
Various direct approximations of the optimal control have been widely stud-
ied for nonlinear deterministic systems (Rekasius, 1964; Leake and Liu, 1967;
Saridis and Lee, 1979; Saridis and Balarm, 1986), and appeared to be more
promising than the linearization type approximation methods that have met
with limited success (Al’brekht, 1961; Lukes, 1969; Nishikawa et al., 1962).
For stochastic systems, a method of successive approximation to solve the
Hamilton-Jacobi-Bellman equation for a stochastic optimal control problem
using quasilinearization, was proposed in Ohsumi (1984), but systematic pro-
cedures for the construction of suboptimal controllers were not established.
This paper presents a theoretical procedure to develop suboptimal feedback
controllers for stochastic nonlinear systems (Wang and Saridis, 1992), as an ex-
tesnion of the Approximation Theory of Optimal Control developed by Saridis
and Lee (1979) for deterministic nonlinear systems. The results are organized
as follows. Section 2 gives the mathematical preliminaries of the stochastic
optimal control problem. Section 3 describes major theorems that can be used
for the construction of successively improved controllers. For the infinite-time
stochastic regulator problem, a design theory using upper and lower bounds of
the cost function and the corresponding stability considerations are discussed
in Section 4. Two proposed design procedures are outlined and illustrated with
Successive Approximation of Stochastic Dynamic Systems 335
several examples in Section 5. Section 6 concludes the paper with remarks

regarding its history and notes to Dr. Sid Yakowitz.
2. PROBLEM STATEMENT
For the purpose of obtaining explicit expressions, and without loss of gen-
erality since the results can be immediately generalized, consider a nonlin-
ear stochastic control system described by the following stochastic differential
equation:
Where is a vector of state of the stochastic system,

is a control vector, is a specified compact set of admissible controls, and
is a separable Wiener process;
and are measurable system functions. Equation (1.1) was
studied first by Itô (1951) and later under less restrictive conditions, by Doob
(1953), Dynkin (1953), Skorokhod (1965) and Kushner (1971). It is assumed
that the control law satisfies the following conditions:
i) Linear Growth Condition,
ii) Uniform Lipschitz Condition,
where is Euclidian norm operator, and is some

constant.
For a given initial state (deterministic) and feedback control
the performance cost of system is defined as
With nonnegative functions and J is

also called the Cost Function of system (1.1).
The infinitesimal generator of the stochastic process specified by (1.1) is
defined to be,
where has compact support and is continuous up to all its

second order derivatives (Dynkin, 1953), and and are transpose and
trace operators, respectively. The differential operators are defined as
A pre-Hamiltonian function of the system with respect to the given perfor-

mance cost (1.2) and a control law is defined as
The optimal control problem of stochastic system can be stated now, as

follows:
Optimal Stochastic Control Problem: For a given initial condition
and the performance cost (1.2), find such that
If it is assumed that the optimal control law, exists and if the cor-
responding value function, is sufficiently smooth, then and V*
may be found by solving the well-know Hamilton-Jacobi-Bellman equation
(Bellman, 1956),
Unfortunately, except in the case linear quadratic Gaussian controls, where

the problem has been solved by Wonham (1970), a closed-loop form solu-
tion of the Hamilton-Jacobi-Bellman for solving the optional stochastic control
problem can not be obtained in general when the system of Equation (1.1) is
nonlinear.
Instead, one may consider the optimal control problem relaxed to that of
finding an admissible feedback control law, that has an acceptable,
but not necessarily optimal, cost function. This gives rise to a stochastic sub-
optimal control solution that could conceivably be solved with less difficulty
than the original optimal stochastic control problem. The exact conditions for
acceptability of a given cost function should be determined from practical con-
siderations for the specific problem. The solution of the stochastic suboptimal
control problem that converges successively to the optimal is discussed in more
detail in the next two sections.
3. SUB-OPTIMAL CONTROL OF NONLINEAR

STOCHASTIC DYNAMIC SYSTEMS
This section contains the main results of the approximation theory for the
optimal solution of nonlinear stochastic control problems. Two theorems, one
for the evaluation of performance of control laws and the other for the con-
struction of lower and upper bounds of value functions, are established first.
Then theoretical procedures that can lead to the iterative design of suboptimal
controls are developed based on those theorems.
Theorem 1. (Performance evaluation of a control law) Assume V :
be an arbitrary function with continuous V, and and satisfy
the condition
where is a suitable constant. Then the necessary and sufficient conditions for
to be the value function of an admissible fixed feedback control law
i.e.,
are
Proof: From (1.7), using Itô’s integration formula (Itô, 1951), it follows that:
Therefore,
The sufficient condition results from the above equation. For the necessary
condition, assume that Then from the above equation,
and for
Therefore,
has to be true for all hence
which proves the necessary condition.

Remark 1: For a feedback controller that makes system (1.1) unstable, the
associated does not satisfy the assumption of Theorem 1, particularly,
condition (1.7).
Remark 2: The relation in Eqs. (1.9) and (1.10), called the Generalized
Hamilton-Jacobi-Bellman equation for the stochastic control system (1.1), is re-
duced to the Hamilton-Jacobi-Bellman partial differential equation (1.6), when
the control is replaced by the optimal control
Remark 3: The exact cost function for a given control law is found by
solving the Generalized Hamilton-Jacobi-Bellman partial differential equation.
This equation, through simpler and more general than the Hamilton-Jacobi-
Bellman equation, is difficult to solve in a closed form.
Remark 4: For more general stochastic process defined by the stochastic
differential equation and the general form of perfor-
mance cost, one can show that Theorem 1 is still true.
Since it is generally difficult to find the exact cost functions satisfying Eqs.
(1.9) and (1.10) of Theorem 1, the following theorem introduces a method of
constructing the lower and upper bounds of the cost functions. This method can
be used for design of simpler suboptimal controllers based only on the upper
bounds to value functions.
Theorem 2 (Lower and upper bounds of cost functions) For an admissible
fixed control law and a continuous function with for
all If the function satisfies Eq. (1.7) with continuous
V, and and
then is an upper (or a lower) bound of the cost function of system (1.1).
That is‚
Proof: By procedure similar to the proof of Theorem 1‚ it can be shown that
Therefore‚ from Eqs. (1.11) and (1.12)‚ it follows that‚
This completes the proof.

Remark 1: In general‚ the function in this theorem does not need to
be calculated. However‚ stating the inequality as in (1.11) gives an additional
degree of flexibility that enables the determination of an upper (or lower) bound
to the cost function
Remark 2: The function in this theorem is the exact cost function for
a system with a performance cost augmented by
where is a terminal cost function such that

Having established the two theorems for the evaluation of performance of a
given feedback control law‚ it is necessary to develop algorithms to improve the
control law now. The followings Theorem 3-5 provide a theoretical procedure
for designing the suboptimal feedback controllers based on the Theorem 1‚
while Theorem 6 presents a method for constructing upper and lower bounds
to the optimal cost function‚ which can be used to evaluate the acceptability of
suboptimal controllers.
Theorem 3. Given the admissible controls and with
and be the corresponding cost functions satisfying Eqs. (1.7)
and (1.8) for and respectively‚ define the Hamiltonian function for
and 2‚
where
It is shown that
when
Proof: Let
then
Therefore
which‚ from assumption (1.18)‚ implies that
In addition‚ from Eq. (1.10) of Theorem 1‚
However‚ applying Itô’s integration formula to along the tra-

jectory generated by the control it follows that
Hence‚
Remark: One should not try to find by subtracting from

directly based on their individual Itô formulas evaluated along the
trajectories generated by their corresponding controls. This is due to the fact
that the two state trajectories‚ generated by and respectively‚
are different.
A combination of Theorem 1 and 3‚ where represents the cost func-
tion of the system (1.1) when driven by control yields an inequality that
serves as a basis of suboptimal control algorithms‚ to iteratively reduce the cost
of the performance of the system. This is outlined by the following theorem.
Theorem 4. Assume that there exists a control and a corresponding
function satisfying Eqs. (1.7) and (1.8) of Theorem 1. If there exist
a function satisfying the same condition of Theorem 1‚ its associated
control has been selected to satisfy
then
Proof: Since control and the corresponding value function must satisfy
Eqs. (1.9) and (1.10)‚ according to Theorem 1‚ it follows that for every
This can be rewritten as
that is‚
Similarly‚ one can find‚
Since‚
It follows from Eqs. (1.22) and (1.23) that
Hence‚ according to Theorem 2‚
which proves the theorem.

Remark: Clearly‚ condition (1.20) in Theorem 4 is much easier to be tested
than condition (1.18) in Theorem 3‚ since (1.20) does not involve the time
derivation and the infinitesimal generator
Based on Theorems 3 and 4‚ the following theorem establishes a sequence of
feedback controls which successively improve the cost of performance of the
system‚ and converge to the optimal feedback control.
Theorem 5. Let a sequence of pairs satisfy Eqs. (1.7) and (1.8)

of Theorem 1‚ and be obtained by minimizing the pre-Hamiltonian function
corresponding to the previous cost function that is‚
then the corresponding cost functions ‚ satisfies the inequality‚
Thus by selecting the pairs sequentially‚ in the above manner‚ the

resulting sequence converges monotonically to the optimal cost function
V*‚ and the corresponding sequence converges to the optimal control
associated with V*.
Proof: Since the control of (1.24) and the corresponding cost function
satisfy (1.9) and (1.10) of Theorem 1‚ it follows from (1.22) of Theorem 4 that
where Therefore‚ application of (1.19) of Theorem 2 yields
From (1.10)‚
hence‚ Itô’s integration formula applied to along the trajectory generated

by leads to the inequality‚
that is‚
which proves (1.25).

To show the convergence of the sequence‚ note that is a non-negative and

monotonically decreasing sequence that satisfies (1.7). Therefore‚ the following
limits exist:
and
for all and where is the limit of the cost functions . The corresponding
limit of control sequence can be identified from (1.24) as‚
Clearly‚ and thus obtained‚ still satisfy Eqs. (1.9) and (1.10) of Theo-
rem 1. However‚ from the construction of control sequence minimizes
the pre-Hamiltonian function associated with the value function In other
words‚ and satisfy the Hamilton-Jacobi-Bellman equation for the optimal
control of stochastic system (1.1)
Hence‚
are the optimal control and optimal value function of the stochastic control
problem (1.5).
Remark 1: It follows from this theorem that the optimal feedback control
and the optimal cost function V* are related by
which is a relationship‚ that results from the minimization of the Hamiltonian

function associated with the stochastic system (1.1).
Remark 2: As indicated by the conditions‚ in applying the theorem‚ assump-
tions must be made a priori‚ regarding the admissibility of the successively
derived control laws and their corresponding value functions. However‚ for
a nonlinear stochastic control system as in (1.1)‚ the admissibility of the new
control laws is not always easy to show.
Finally‚ the following theorem presents a method for the construction of
an upper (or a lower) bound of the optimal cost function Since the
optimal cost function is extremely difficult to find‚ its upper (or lower) bounds
can provide a practical measure to evaluate the effectiveness of the sub-optimal
controllers.
Theorem 6. Assume that there exists a function satisfying condition
(1.7) of Theorem 1‚ for which the associated control
is an admissible one. Then‚ is an upper (or a lower) bound to the optimal

cost function of system (1.1)‚ if it satisfies the following conditions
where is continuous and for all

Proof: From (1.31)‚ and the Hamilton-Jacobi-Bellman equation‚ it is obvious
that the optimal control and the optimal cost function‚ are related
and similarly‚ for and
For subtracting (1.35) from (1.36) yields
where From assumption (1.34)‚

Therefore‚ application of Itô’s integration formula to along the

trajectory generated by control yields‚
So is an upper bound to the optimal cost function For

subtracting (1.36) from (1.35) leads to‚
where It can be shown‚ by using condition (1.34)

and Itô’s integration formula it follows that
In this case‚ is a lower bound to the optimal cost function

Theorems‚ which lead to the design of simpler sub-optimal controllers based
on the upper and lower bounds of cost functions‚ may also be constructed.
A more detailed discussion of such derivation for the infinite-time stochastic
regulator problem is given in the next section.
4. THE INFINITE-TIME STOCHASTIC REGULATOR

PROBLEM
The infinite-time stochastic regulator problem is defined as a control problem
for nonlinear stochastic system (1.1)‚ with infinite duration All state
trajectories generated by admissible controls in must be bounded uniformly
in
For the infinite-time stochastic regulator problem and assuming that the sys-
tem is stable‚ the Performance Cost exists and is defined as
A discussion of the stability of system (1.1) with Eq. (1.37) as is

given in the next section. Applying Itô’s integration formula before the limit‚
the cost function becomes
where satisfies and (1.8) of Theorem 1 for the possible state

trajectories‚ which is true for all
All the theorems developed in the previous section are still valid for the
infinite-time stochastic regulator problem‚ except that all the terminal condi-
tions at in those theorems are no longer required. However‚ in this
case‚ theorems can be constructed which can lead to the iterative design of
simpler sub-optimal controls based only on the upper and lower bounds can be
obtained without solving the partial differential equation (1.9) of Theorem 1‚
those theorems have a great potential for application. Two of such theorems‚
corresponding to Theorems 3 and 4‚ are given in the sequel.
Theorem 7. Given admissible controls and with and
be their corresponding cost functions defined by (1.37)‚ if there exist
function pairs and satisfying (1.11)
of Theorem 2 for and respectively ‚ then‚
When
Proof: Following the same procedure used in the proof for Theorem 3‚ one
can show that
where Therefore‚ Itô’s integration formula yields‚
hence‚
which implies that
The next theorem is the counterpart of Theorem 4‚ and its proof can be carried
out by the same procedure used in Theorem 4.
Theorem 8. Assume that there exists a control and a function
pair satisfying (1.11) of Theorem 2. If there exists a
function pair satisfying the same condition of Theorem

2‚ of which the associated control has been selected to satisfy
then‚
where and are the cost functions of and respectively.

Note that neither Theorem 7 nor Theorem 8 is true for the stochastic system
(1.1) with a cost function defined by (1.2).
Stability of the infinite-time approximate control problem‚ will be treated in
a manner similar to the deterministic case‚ that is‚
It suffices to show that the Performance Index (1.37)‚ of the system (1.1)‚ is
bounded for all the controls generated by the Approximation Theory.
Lemma 1. If Theorem 7 and/or 8 are satisfied‚ stability of the infinite horizon
systems‚ driven by the subsequent controls generated by the approximation
theory‚ is guaranteed if the first controller is selected to yield a bounded
The proof of the above Lemma is clearly established from the statements of
Theorems 7 and/or 8.
In order to prove the stability of the system for the first controller‚ the fol-
lowing steps are appropriate for consideration:
1 System (1.1) must be Complete Controllable‚ for all admissible controls
2 Since all the states are assumed available for measurement‚ system (1.1)
is obviously Completely Observable.
3 The Performance Cost (1.37) is bounded because‚
From Itô’s integration formula‚
Where
Select the first control to satisfy the previous theorems of the Approxi-
mation Theory‚ and the condition‚
Then‚ using the Mean Value Theorem‚
Then‚
The boundedness of the Performance Index of the infinite-time

problem, establishes the stability of the system for all the controllers derived
from the Approximation Theory.
5. PROCEDURE FOR ITERATIVE DESIGN OF

SUB-OPTIMAL CONTROLLERS
The optimal feedback control and its associated satisfying
the Hamilton-Jacobi-Bellman equation‚ Eq. (1.6)‚ obviously satisfy all the the-
orems developed in Section 3. However‚ in most of cases of nonlinear stochastic
control systems‚ the optimal solution is very difficult‚ if not impossible‚ to im-
plement either because the solution is unavailable or because some of the states
are not available for measure. In both cases‚ the theory developed in Section 3
may serve to obtain controllers which can make the system stable‚ and then be
successively modified to approximate the optimal solution. Upper and lower
bounds of the value function of the nonlinear stochastic system may be used to
evaluate the effectiveness of the approximation.
5.1. EXACT DESIGN PROCEDURE

This approach‚ based on the assumption that the cost function for a
control can be found to satisfy Eqs. (1.9) and (1.10) of Theorem 1‚ may
be implemented according to Theorems 3 and 4‚ by the following procedure:
1 Select a feedback control law for system (1.1), set

2 Find a to satisfy Theorem 1 for
3 Obtain and a to satisfy Theorem 1‚ and Theorems

3 or 4 for and is an improved controller;
4 From Theorem 6‚ find a lower bound to the optimal cost function
and then use as a measure to evaluate as
an approximation to the optimal control If acceptable‚ stop.
5 If the approximation is not acceptable‚ repeat Step 2 by increasing index
by one and continue.
The improved controller in Step 3 can also be constructed by using

Theorem 5‚ if the corresponding cost function can be obtained. When
a lower bound to the optimal value function is difficult to find‚ Step 4 can be
omitted and then the criterion for the acceptability of the approximation has to
be determined based on other considerations.
Example 1. (Linear Stochastic Systems) In order to better understand the
method‚ the design procedure of a sub-optimal controller will be first applied
to a linear stochastic system‚ the optimal solution of which is well known. The
linear stochastic system is described by the following differential equation:
The cost function of the system has the quadratic form
The infinitesimal generator of the linear stochastic process is
Assume first a linear non-optimal control‚
where is a feedback matrix. The corresponding cost function is assumed

to be
Where and can be found by solving Eqs. (1.9) and (1.10) of Theorem
1‚ i.e.‚
and
The feedback law is improved by using Theorem 5. From Eq. (1.24)‚
and the corresponding cost function is assumed to be
Where and are determined by solving the equations‚
As approaches S‚ the solution of the matrix Riccati equation‚

i.e.‚
and correspondingly‚ the control approaches to
Which is the optimal control for the linear stochastic systems with quadratic
performance criterion (Wonham‚1970).
This solution demonstrates the use of Theorem 5 to sequentially improve the
control parameters towards the optimal values in a Linear Quadratic Gaussian
system with well-known solution.
Example 2. The second example illustrates the design method by the fol-
lowing nonlinear first-order stochastic system:
with a cost function selected to represent a minimum error‚ mini-

mum input energy specification criterion‚ for a regular control problem
The infinitesimal generator of the stochastic process becomes
First assuming a linear control law‚
The corresponding cost function is assumed to be
Equation (1.9) of Theorem 1 yields
Which is true for
Next‚ select a higher order control law‚
Such a controller was selected to be of the same order as the partial derivative
of the cost function as per Theorem 5 suggests. The corresponding cost
function is assumed to be‚
In this case‚ in order to satisfy Eq. (1.9)‚ one must solve
Which is true for
To satisfy Theorem 4‚ controllers and must satisfy‚
Which yields
Their corresponding cost functions can be compared‚
If Theorem 5 is to be used for the above and must be selected

according to
And a satisfying Eq. (1.9) of Theorem 1 exists‚ if
Comparing the cost functions‚ one finds that‚
In both cases‚ since the performance of the system has been

improved by replacing a linear controller by a nonlinear controller All
the above controllers make the origin an equilibrium point for the system.
5.2. APPROXIMATE DESIGN PROCEDURES FOR

THE REGULATOR PROBLEM
In many cases‚ the selection of a to satisfy (1.9) and (1.10) of Theorem
1 is a very difficult task. In such a case approximate design procedures‚ which
use the upper and lower bounds of the cost function obtained through Theorem
2 can be constructed. For the infinite time stochastic regulator problem‚ the
following design procedure is proposed based on Theorem 7‚ 8 in Section 4:
1 Select a feedback control law for system (1.1), set
2 For an find a for to satisfy Theorem 2 for a lower

bound.
3 Obtain a and for an and find a for to
satisfy Theorem 2 as an upper bound. found
should also satisfy conditions (1.40) or (1.41) for the improvement of
performance.
4 Using a lower bound to the optimal cost function‚ which is determined
according to Theorem 6‚ the approximation of the optimal control can be
measured. If acceptable‚ stop.
5 If the approximation is not acceptable‚ repeat Step2 by increasing index
by one and continue.
Example 3. The design method is illustrated with the following nonlinear

first-order stochastic regulator problem:
with a cost function
The infinitesimal generator of the stochastic process is
For a linear control law‚
The lower bound of its cost function is assumed to be
Application of Eq. (1.11) of Theorem 2 leads to
which is satisfied by
For a higher order control law‚

the upper bound to its value function is assumed to be
In this case‚ application of (1.11) yields‚
which is true for
where are arbitrary.

Improvement of performance occurs if Eq. (1.40) or Eq. (1.41) is
satisfied‚ which leads to
which‚ with the rest of the inequalities‚ produce acceptable values for
and For example‚ one can show that
are a set of the acceptable values. The lower and upper bounds of the value
functions in this case are found to be‚
Improvement of performance occurs by replacing a linear controller by a

nonlinear controller of the same order as the partial derivate of Note
that in this case the actual cost function of control cannot be found by simply
using the method applied in Example 2.
This approach has a great potential for application since one does not have
to solve the partial differential equation of Theorem 1 every time.
6. CLOSING REMARKS BY FEI-YUE WANG

In the Fall of 1987 at Rensselaer‚ I took the course of ECSE 35649: State
Estimation and Stochastic Control by Professor George N. Saridis‚ then my
PhD. Advisor. One day in the class‚ we had an argument regarding the Entropy
reformulation of optimal control theory‚ and I went to the blackboard to show
my derivation. George was upset and dismissed the class for the day. That was
how this paper started. For the rest of the course‚ I concentrated on formulating
and proving the theorems and developing iterative design procedures‚ and by
the end of semester‚ I had the most of the results shown here. However‚ I did not
find any useful results regarding the Entropy formulation of optimal stochastic
control problem‚ the reason for its beginning and a favorite topic of George’s.
Until the end of my Rensselaer years‚ I never found the time to complete the
paper‚ partly due to the fact that my effort was focused on Intelligent Control
Systems then.
I come to Tucson in the middle of the hot summer in 1990. Sid was one of the
first few colleagues I talked to at the SIE Department‚ and become a good friend
ever since. Over years until his untimely death in 1999‚ I had received generous
support and advising on many issues from Sid‚ and admired his strong passion
for scholarly work‚ especially analytical research work. He often comments
that a paper without an equation can hardly be a good paper‚ which kind of
affected my decision to complete this paper‚ since there are certainly a lot of
equations here.
I would like to acknowledge that most of the results in this paper had been
reported at IEEE 1992 International Conference on Decision and Control in
Tucson‚ Arizona (see Wang and Saridis‚ 1992). When I completed the final
version of this paper for journal submission‚ George had asked me to add the
Entropy reformulation of the Generalized Hamilton-Jacobi-Bellman equation
to the paper along the line of Saridis (1988) and Jaynes (1957). I agree that
this is a good idea but I could not find a meaningful way to do so at that time‚
actually‚ even today. Then in 1994‚ George developed a way to reformulate
the Approximation Theory of Stochastic Optimal Control‚ add to the paper as
an additional section‚ and submitted to the Control – Theory and Advanced
Technology in Japan (Saridis and Wang‚ 1994)‚ which had been discontinued
for publication in 1994.
REFERENCES 357
I believe the theoretical procedures outlined in this paper could be an ex-

tremely useful foundation for the development of a numerical package of de-
signing practical controllers for nonlinear stochastic systems in complex ap-
plications found in economical‚ social‚ political‚ and management fields. This
will be a direction for future research.
ACKNOWLEDGMENTS
I would like to dedicate this paper to the memory of Professor Sidney J.
Yakowitz.
REFERENCES
Al’brekht‚ E. G. (1961). On the optimal stabilization of nonlinear systems‚ J.
Appl. Math. Mech. (ZPMM)‚ 25‚ 5‚ 1254-1266.
Aoki‚ M. (1967). Optimization of Stochastic Systems‚ Academic Press‚ N.Y.
Bellman‚ R. (1956). Dynamic Programming‚ Princeton University Press‚ Prince-
ton‚ N.J.
Doob‚ J. L. (1953). Markov Processes. Wiley‚ N.Y.
Dynkin‚ E. B. (1953). Stochastic Processes. Academic Press‚ N.Y.
Itô‚ K. (1951). On Stochastic Differential Equations‚ Memn Amer. Math. Soc.‚
4.
Jaynes‚ E. T. (1957). Information theory and statistical mechanics. Physical
Review‚ 4‚ 106.
Kushner‚ H. (1971). Introduction to Stochastic Control‚ Holt‚ Reinhardt and
Winston‚ NY.
Kwakernaak‚ H. and R. Sivan. (1972). Linear Optimal Control Systems‚ Wiley‚
N.Y.
Leake‚ R.J. and R.-W. Liu. (1967). Construction of suboptimal control se-
quences‚ SIAM J. Control‚ 5‚ 1‚ 54-63.
Lukes‚ D. L. (1969). Optimal regulation of nonlinear dynamical systems. SIAM
J. Control‚7‚ 1‚75-100.
Nishikawa‚ Y.‚ N. Sannomiya and H. Itakura. (1962). A method for suboptimal
design of nonlinear feedback systems‚ Automatica‚ 7‚ 6‚703-712.
Ohsumi‚ A. (1984). Stochastic control with searching a randomly moving target‚
Proc. Of American control Conference‚ San Diego‚ CA‚ 500-504.
Panossian‚ H. V. (1988). Algorithms and computational techniques in stochastic
optimal control‚ C.T. Lenodes (ed.)‚ Control and Dynamic Systems‚ 28‚ 1‚
1-55.
Rekasius‚ Z.V. (1964). Suboptimal design of intentionally nonlinear controllers.
IEEE Trans. Automatic Control‚ AC-9‚ 4‚ 380-386.
Sage‚ A. P. and C. C. White. (1977). Optimun Systems Control‚ Prentice-Hall‚
Englewood Cliffs‚ N.J.
Saridis‚ G. N. and Fei-Yue Wang. (1994). Suboptimal control of nonlinear

stochastic systems‚ Control – Theory and Advanced Technology‚ Vol. 10‚
No. 4‚ Part l‚ pp.847-871.
Saridis‚ G. N. (1988). Entropy formulation for optimal and adaptive control.
IEEE Trans. Automatic Control‚ AC-33‚ 8‚713-721.
Saridis‚ G. N. and J. Balaram. (1986). Suboptimal control for nonlinear systems.
Control-Thoery and Advanced Technology (C-TAT)‚ 2‚ 3‚ 547-562.
Saridis‚ G. N. and C. S. G. Lee. (1976). An approximation theory of optimal
control for trainable manipulators. IEEE Trans. Systems‚ Man‚ and Cyber-
netics‚ SMC-9‚ 3‚ 152-159.
Skorokhod‚ A. V. (1965). Studies in the Theory of Random Processes. Addison-
Wesley‚ Reading‚ MA.
Wang‚ Fei-Yue and G. N. Saridis. (1992). Suboptimal control for nonlinear
stochastic systems‚ Proceedings of 31st Conference on Decision and control‚
Tucson‚ AZ‚ Dec.
Wonham‚ W. M. (1970). Random differential equations in control theory. A.
T. BharuchaReid (ed.)‚ Probabilistic Methods in Applied Mathematics‚ Aca-
demic Press‚ NY.
Chapter 17
STABILITY OF RANDOM ITERATIVE MAPPINGS
László Gerencsér
Computer and Automation Institute
Hungarian Academy of Sciences
H-1111‚ Budapest Kende u 13-17
Hungary*
Abstract We consider a sequence of not necessarily i.i.d. random mappings that arise in
discrete-time fixed-gain recursive estimation processes. This is related to the
sequence generated by the discrete-time deterministic recursion defined in terms
of the averaged field. The tracking error is majorated by an L-mixing process
the moments of which can be estimated. Thus we get a discrete-time stochastic
averaging principle. The paper is a simplification and extension of Gerencsér
(1996).
1. INTRODUCTION
Most of the commonly used identification methods for linear stochastic sys-
tems can be considered as special cases of a general estimation scheme, which
was proposed in Djereveckii and Fradkov (1974); Ljung (1977), and further
elaborated in Djereveckii and Fradkov (1981); Ljung and Söderström (1983).
This scheme can be described as follows. Let us define a parameter-dependent,
stochastic process by the state-space equation:
where are matrices of appropriate dimensions that depend on

a parameter and D is an open domain. Here is
an exogenous noise source‚ which is assumed to be a vector-valued‚ wide-sense
stationary stochastic process. The matrices are assumed to have
*Partially supported by the National Foundation of Hungary under contract No T 032932.

their eigenvalues inside the unit disc, and A(.), B(.) and C(.) are
of
Let be an quadratic function, defined on all
components of which are homogeneous quadratic functions of Let
It is easy to see that is well-defined. The general abstract estimation

problem is then defined as follows: find the solution of the nonlinear algebraic
equation
using observed values of for various The solution will be denoted

by It is assumed that is locally identifiable‚ i.e. is non-singular.
The following stochastic gradient-type algorithm was proposed in Djereveckii
and Fradkov (1974); Ljung (1977):
where is an on-line estimate of the frozen-parameter value

defined by
This algorithm is closely related to the following frozen-parameter procedure:
The above general scheme includes a number important identification proce-

dures‚ such as the recursive maximum-likelihood method for Gaussian ARMA-
processes‚ or the maximum-likelihood method for the identification of finite-
dimensional‚ multivariable linear stochastic systems (cf. Caines‚ 1988;Ljung
and Söderström‚ 1983). In fact‚ for multivariable linear stochastic systems the
general estimation scheme is not directly applicable‚ and a simple extension is
needed: take a function which is linear in Q and in and then
define
As an extreme case, may not depend on at all. This is the case with
the celebrated LMS method in adaptive filtering. Here a particular component
of a wide-sense stationary signal is approximated by the linear combination of
the remaining components. Formally: let be an
second order stationary process‚ where is a real-valued process. Consider

the minimization problem
(1.9)
where is an valued weighting vector. This is equivalent to solving
The on-line computations of the best weights is obtained by the LMS method:
In many practical applications the system to be estimated has a slowly changing

dynamics. In this case instead of accumulating past information we gradually
forget them. This is accomplished by replacing the gain in (1.4) or in (1.7)
by a fixed gain‚ say Thus we arrive at the procedure:
Fixed gain procedures are quite well-known in adaptive filtering. The normal-
ized tracking error has been analyzed from a number of aspects.
Weak convergence‚ a central limit theorem and an invariance principle‚ when
have been derived in Kushner and Schwartz (1984); Györfi and Walk
(1996); Gerencsér (1995); Joslin and Heunis (2000); Kouritzin (1994). Bounds
for higher order moments for any fixed 0 have been derived in Gerencsér
(1995).
In this paper we consider general fixed-gain recursive estimation processes
that include processes given by (1.12). In modern terminology these can be con-
sidered as iterated random maps. Thus we consider random iterative processes
of the form
where is a random variable defined over some probability space

for 1 and where D is an open domain. Fixed gain
estimation processes of have been analyzed in a number of interesting cases
beyond LMS. Stability results for Markovian time-varying systems are given
in Meyn and Caines (1987). Time-varying regression-type models have been
investigated in Bittanti and Campi (1994); Campi (1994); Guo (1990). A power-
ful and general result in a Markovian setting is given in as Theorem 7‚ Chapter
4.4‚ Part II of Benveniste et al. (1990). The advance of Gerencsér (1996)
is the extension of the so-called ODE-method (ODE for Ordinary Differen-
tial Equations) to general fixed-gain estimation schemes‚ and a more complete
characterization of the tracking error process. The focus of that paper was on
processes in continuous-time.
The application of the so-called ODE method for discrete-time fixed-gain
recursive estimation processes requires the often painful analysis of the effect
of discretization error. The main contribution of the present paper is an ex-
tension of Gerencsér (1996) to discrete-time processes and the development
of a discrete-time ODE-method‚ where ODE now is for Ordinary Difference
Equation. This new method is much more convenient for applications and
also gives more accurate characterization of the error process. These advan-
tages will be exploited in the forthcoming paper (Gerencsér and Vágó‚ 2001) to
analyze the convergence properties of the so-called SPSA (simultaneous per-
turbation stochastic approximation) methods‚ developed in Spall (1992); Spall
(1997); Spall (2000)‚ when applied to noise-free optimization. In the condi-
tions below we use the definitions and notations given in the Appendix. The
key technical condition that ensures a stochastic averaging effect is Condition
1.2.
Condition 1.1 The random fields H and are bounded for
say K and We
let
Condition 1.2 H and are L-mixing uniformly in for and in
for respectively, with respect to a pair of families
of
Condition 1.3 The mean field EH is independent of i.e. we can
write
The process will be compared with the discrete-time deterministic process

defined by
The conditions imposed on the averaged field G will are described in terms of
(1.16) given below‚ this is more convenient for applications. Thus consider the
continuous-time deterministic process defined by
Condition 1.4 The Junction G defined on D is continuous and bounded in

together with its first and second partial derivatives‚ say
Here denotes the operator norm.
The condition ensures the existence and uniqueness of the solution of (1.16)
in some finite or infinite time interval for any which will be
denoted by It is well-known that is a continuously differen-
tiable function of and also exists and is continuous
in
The time-homogenous flow associated with (1.16) is defined as the mapping
Let D be a subset of D such that for we
have for any For any fixed t the image of under will be
denoted as i.e. The union of
these sets will be denoted by i.e.
The neighborhood of a set will be denoted by i.e.

for some Finally the interior of a compact domain
is denoted by int In the following condition a certain stability property
of (1.16) is described:
Condition 1.5 There exist compact domains such

that we have for some and
for some
Condition 1.6 The autonomous ordinary differential equation (1.16) is expo-

nentially asymptotically stable with respect to initial perturbations‚ i.e. for
some and we have for all
Theorem 1.1 Assume that Conditions 1.1-1.6 are satisfied and

and Then for we have for all
and Moreover setting with
some fixed c we have
where is an L-mixing process with respect to and we have for

any and
where and here depends only on and the domains and

D‚ and is a system’s constant‚ i.e. it depends only on K‚ L‚ and
The above theorem has two new features compared to previous results by
other authors: first‚ the higher order moments of the tracking error are estimated.
Secondly‚ the upper bound is shown to be an L-mixing process. An interesting
consequence is the following stochastic averaging result:
Theorem 1.2 Let the conditions of Theorem 1.1 be satisfied. Let F(.) be a
continuously differentiable function of defined in D. Then we have
with probability 1‚ where c is a deterministic constant which is independent of
In the above studies the boundedness of the estimator process is ensured by

strong conditions‚ among others the boundedness of the random field. It is not
known if these conditions can be relaxed. A globally convergent procedure
has been developed in Yakowitz (1993) for the case of Kiefer-Wolfowitz-type
stochastic approximation procedures with decreasing gain‚ but it is not known if
the technique developed there can be extended to general‚ fixed gain estimation
processes. On the other hand‚ averaging is also important in deterministic
systems (cf. Sanders and Verhulst‚ 1985). The use of deterministic techniques
in a stochastic setting is not yet fully explored.
2. PRELIMINARY RESULTS
In the first result of this section we show a simple method to couple continuous-
time and discrete-time processes. Then we show that the discrete flow defined
by (1.15) is exponentially stable with respect to initial perturbations for suffi-
ciently small
Lemma 2.1 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Let be the
solution of (1.16) and be the solution of (1.15). Then if
then will stay in and for all
Proof: Let denote the piecewise linear extension of
for Taking into account Lemma 0.2 we
can write as long as for all
where denotes the integer part of But

Taking into account Condition 1.6 we get
Now hence if then will be smaller than the

distance between and where denotes the complement of the set
hence will stay in for all Thus the proposition of the lemma is proved.
An interesting property of discrete flow defined by (1.15) is that it inherits
exponential stability with respect to initial perturbations if is sufficiently small.
Let denote the solution of (1.15) with initial condition
Lemma 2.2 Assume that Conditions 1.4‚ 1.5 and 1.6 are satisfied. Then d >
implies that will stay in moreover for any
we have
whenever is sufficiently small.
Proof: Write and We have
where Differentiating with respect to

we get
First we consider the local error Setting

a second-order Taylor-series expansion gives
or equivalently
From (1.16) we have

Thus
Differentiating (2.5) with respect to we get
To estimate the first term in the integral we differentiate (2.6) with respect to
to obtain
where * denotes tensor-product. Note that by Condition 1.6 we have

and hence
Thus we finally get
Returning to (2.4)‚ the first term under the first integral on the right hand side
is estimated using Lemma 0.3. For the second term we write
Differentiating with respect to we get
Thus the absolute value of the first integral on the right hand side of (2.4) is
majorated by
For the absolute value of the second integral we get the upper bound
Altogether we get
where is a system-constant. Write Then we get
It follows by standard arguments that where the latter is the solution

of the
with Writing
we have Obviously we have

for any whenever is sufficiently small‚ and thus the proposition of the
lemma follows.
3. THE PROOF OF THEOREM 1.1

First it is easy to show that that for all whenever
The proof is almost identical with the proof of Lemma 2.1.
Let us subdivide the set of integers into intervals of length T. In the interval
we consider the solution of (1.15) starting from at time
denoted by Thus is defined by
Now we prove that for we have
where is defined in terms of the random field along deterministic trajec-

tories by:
For the proof first note that Condition 1.5 and an obvious adaptation of
Lemma 2.1 ensures that for all whenever
There is no loss of generality to assume that and then we have
In the first integral take supremum over and over the initial
condition which enters implicitly through Then we
get the random variable defined in (3.2) with Since H is Lipschitz-
continuous in with Lipschitz-constant L‚ the second term is majorized by
Applying a discrete-time Bellman-Gronwall-lemma we get the
claim for
Now Lemma 2.2 of Gerencsér (1996)‚ which itself is proved by direct com-
putations using Theorems of Gerencsér (1989)‚ implies that the process is
L–mixing with respect to and for any and we
have
where depends only on and the domains

Let the solution of (1.15) be denoted by Then we claim that for any
and for sufficiently small
where
and is a system’s constant‚ its value being given by

The proof is based on a representation analogous to (2.3). Using (3.1) and
Lemma 2.2 with the role of will be taken by
and thus we get the claim. It follows that we have for any
with This is obtained by applying Lemma 0.1 for the

process with To complete the proof of Theorem 1.1 we
combine the estimates (3.5)‚ (3.7)‚ (3.4)‚
APPENDIX
The basic concepts of the theory of L-mixing processes developed in Gerencsér (1989) will
be presented. Let a probability space be given, let be an open domain and
let be parameter-dependent stochastic process. Alternatively, we
may consider as a time-varying random field. We say that is M-bounded if for
all
Here | · | denotes the Euclidean norm. We shall use the same terminology if or t degenerate
into a single point. Also we shall use the following notation: if is M-bounded we write
Moreover‚ if is a positive real-valued function we write
if
Let be a family of monotone increasing and be a

monotone decreasing family of We assume that for all and are
independent. For we set A stochastic process is L-mixing
with respect to uniformly in if it is M-bounded and if we set for
where is a positive integer then
The following lemma is a direct consequence of Lemma 2.4 of Gerencsér (1989)‚ which itself
is verified by direct calculations:
Lemma 0.1 Let be a discrete time L-mixing process with respect to a pair of
families of Define the process
with Then is L-mixing with respect to and for we have
To capture the smoothness of a stochastic process with respect to we define
for A stochastic process is M-Lipschitz-continuous in if the

process is M-bounded, i.e. if for all we have
We define in an analogous way. Finally we introduce the notations
Two simple analytical lemmas follow. Due to its importance the proof of the first Lemma
will be given. The proof of the second lemma is given in Gerencsér (1996):
Lemma 0.2 (cf. Geman, 1979) Let be a function satisfying Conditions 1.4, 1.5 and let
be solution of (1.16). Further let be a continuous, piecewise continuously differentiable
curve such that Then for
Proof: Consider the function Obviously the left hand side of (0.3) can be
written as Write
Taking into account the equality
which follows from the identity after differentiation with respect to we get
the lemma.
Lemma 0.3 Under Conditions 1.4‚ 1.5 and 1.6 we have
REFERENCES
Benveniste, A., M. Métivier, and P. Priouret. (1990). Adaptive algorithms and
stochastic approximations. Springer-Verlag, Berlin.
Bittanti, S. and M. Campi. (1994). Bounded error identification of time-varying
parameters by RLS techniques. IEEE Trans. Automat. Contr., 39(5): 1106–
1110.
Caines, P. E. (1988). Linear Stochastic Systems. Wiley.
Campi, M. (1994). Performance of RLS identification algorithms with forget-
ting factor: a approach. J. of Mathematical Systems, Estimation
and Control, 4:1–25.
Djereveckii, D. P. and A.L. Fradkov. (1974). Application of the theory of
Markov-processes to the analysis of the dynamics of adaptation algorithms.
Automation and Remote Control, (2):39–48.
Djereveckii, D. P. and A.L. Fradkov. (1981). Applied theory of discrete adaptive
control systems. Nauka, Moscow. In Russian.
Geman, S. (1979). Some averaging and stability results for random differential
equations. SIAM Journal of Applied Mathematics, 36:87–105.
Gerencsér, L. (1989). On a class of mixing processes. Stochastics, 26:165–191.
Gerencsér, L. (1995). Rate of convergence of the LMS algorithm. Systems and
Control Letters, 24:385–388.
Gerencsér, L. (1996). On fixed gain recursive estimation processes. J. of Ma-
thematical Systems, Estimation and Control, 6:355–358. Retrieval code for
full electronic manuscript: 56854.
Gerencsér, L. and Zs. Vágó. (2001). A general framework for noise-free SPSA.
In Proceedings of the 40-th Conference on Decision and Control, CDC’01,
page submitted.
REFERENCES 371
Guo‚ L. (1990). Estimating time-varying parameters by the Kalman-filter based

algorithm: stability and convergence. IEEE Trans. Automat. Contr.‚ 35:141–
157.
Györfi‚ L. and H. Walk. (1996). On the average stochastic approximation for
linear regression. SIAM J. Control and Optimization‚ 34(1):31–61.
Joslin‚ J. A. and A. J. Heunis. (2000). Law of the iterated logarithm for a
constant-gain linear stochastic gradient algorithm. SIAM J. on Control and
Optimization‚ 39:533–570.
Kouritzin‚ M. (1994). On almost-sure bounds for the LMS algorithm. IEEE
Trans. Informat. Theory‚ 40:372–383.
Kushner‚ H. J. and A. Schwartz. (1984). Weak convergence and asymptotic
properties of adaptive filters with constant gains. IEEE Trans. Informat. The-
ory‚ 30(2):177–182.
Ljung‚ L. (1977). Analysis of recursive stochastic algorithms. IEEE Trans.
Automat. Contr.‚ 22:551–575.
Ljung‚ L. and T. Söderström. (1983). Theory and practice of recursive identifi-
cation. The MIT Press.
Meyn‚ S. P. and P.E. Caines. (1987). A new approach to stochastic adaptive
control. IEEE Trans. Automat. Contr.‚ 32:220–226.
Sanders‚ J. A. and F. Verhulst. (1985). Averaging methods in nonlinear dynam-
ical systems. Springer-Verlag‚ Berlin.
Spall‚ J. C. (1992). Multivariate stochastic approximation using a simultaneous
perturbation gradient approximation. IEEE Trans. Automat. Contr.‚ 37:332–
341.
Spall‚ J. C. (1997). A one-measurement form of simultaneous perturbation
stochastic approximation. Automatica‚ 33:109–112.
Spall‚ J. C. (2000). Adaptive stochastic approximation by the simultaneous
perturbation method. IEEE Trans. Automat. Contr‚ 45:1839–1853.
Yakowitz‚ S. (1993). A globally convergent stochastic approximation. SIAM J.
Control and Optimization‚ 31:30–40.
Part V
Chapter 18
’UNOBSERVED’ MONTE CARLO METHODS FOR

ADAPTIVE ALGORITHMS
Victor Solo
School of Electrical Engineering and Telecommunications
University of New South Wales
Sydney NSW 2052
Australia
vsolo @ syscon.ee.unsw.edu.au
Keywords: Markov Chain, Monte Carlo, Adaptive Algorithm.
Abstract Many Signal Processing and Control problems are complicated by the presence
of unobserved variables. Even in linear settings this can cause problems in con-
structing adaptive parameter estimators. In previous work the author investigated
the possibility of developing an on-line version of so-called Markov Chain Monte
Carlo methods for solving these kinds of problems. In this article we present a
new and simpler approach to the same group of problems based on direct simu-
lation of unobserved variables.
1. EL SID
When I first started wandering away from the statistics beaten track into
the physical and engineering sciences I soon encountered a small but intrepid
band of ‘statistical explorers’; statisticians who had taken the pains necessary to
break into other areas (where their mathematical skills did not excite the same
level of awe as they might have in some of the social or biological sciences)
but who were having a considerable impact, out of proportion to their numbers,
elsewhere.
Of course there were the better known names, the Speakes and Burtons of
these strange lands: Tukey certainly was there in geophysics; and Kiefer in
information theory. But there were others, less well known, the Bakers and
Marc hands, perhaps more enterprising, but no less deserving of respect. Sid
was one of these, but different even then; too maverick to belong to a group of
mavericks!
I first encountered Sid’s work in adaptive control in the 1970’s (Yakowitz,
1969). Then a little later I was astounded to discover he’d worked on the
marvellous ill-conditioned inverse problem of aquifer modelling (Neuman and
Yakowitz, 1979). And yet again there was his work on Kriging (Yakowitz and
Szidarovszky, 1984), providing some rare clarity on a much mis-discussed sub-
ject. He was worrying about nonparametric estimation of stochastic processes
long before it became fashionable : e.g. Yakowitz (1985) and earlier references.
He was then, an early player in areas which became extremely important. In
his research Sid cut to the heart of the problem and asked the hard (and em-
barassing) questions; he obviously took application seriously but saw the utility
of rigour too. At a personal level he was very humble and very encouraging to
other younger researchers.
Because Sid worked on so many diverse problems he did not get the kind
of recognition he deserved but he certainly rates along with others who did.
Given that his first work was on adaptive algorithms I am pleased to offer a
contribution on this topic. This work, with a statistical bent, would appeal to
Sid; although he would not be pleased by the lack of rigour!
2. INTRODUCTION
In many problems of real time parameter estimation there is a necessity
to estimate unobserved signals. These may be states or auxiliary variables
measured with error. In this paper we concentrate on this latter problem of
errors in variables.
Thus in Computer Vision, consider the fundamental problem of estimating
three-dimensional motion from a sequence of two-dimensional images using
a feature based method - a highly non-linear problem (Tekalp, 1995). Many
current methods ignore the presence of noise in the measurements which make
the problem into an errors in variables problem. Kanatani (1996) has tackled
the problem of noise but his techniques only apply in high signal to noise
ratio cases and under independence assumptions. An approach that overcomes
this in in Ng and Solo (2001). In multiuser communication based on spread
spectrum methods, detection methods may require spreading codes which may
not be known and so have to be estimated. They occur in a nonlinear fashion
(Rappaport, 1996).
To construct an adaptive parameter estimator for a non-linear problem there
are two routes. In the model free or stochastic approximation approach (Ben-
veniste et al., 1990; Ljung, 1983; Solo and Kong, 1995) an analytic expression
is needed for the gradient of the likelihood with respect to the parameters. In
the model based approach which usually leads to approximations based on the
extended Kalman filter (Solo and Kong, 1995; Ljung, 1983) this gradient is also
needed. But there are many problems where it is not possible to develop ana-
lytic expressions for the likelihood much less its gradient. Errors in variables
typically produce such a situation.
In the statistics literature in the last decade or so Markov Chain Monte
Carlo methods (MCMC)(Roberts and Rosenthal, 1998), originating in Physics
(Metropolis et al., 1953) and Image Processing (Geman and Geman, 1984) have
provided a powerful simulation based methodology for solving these kinds of
problems. In previous work (Solo, 1999) we have pursued the possibility of
using this method in an offline setting. Here based on more recent work of the
author (Solo, 2000ab) we develop an on-line version of an alternative and sim-
pler method. We use a simple binary classification problem as an exploratory
platform. For the errors in variables case we consider below there seems to
be no previous literature (aside from Solo (2000a)) using the model free ap-
proach. For the model based approach one ends up appending the parameters
being tracked to the states and pursues a nonlinear filtering approach. There
are examples of this e.g. Kitagawa (1998) but not really aimed at the kind of
errors in variables problem envisaged here.
The remainder of the paper is organised as follows. In section 2 we describe
the problem (without measurement errors) and briefly review an adaptive es-
timator. In section 3 we discuss the same problem but where the auxiliary
(classifying ) variables are measured with error and describe the ’Unobserved’
Monte Carlo method for parameter estimation (offline). In section 4 we de-
velop an online version of the estimator and sketch a convergence analysis.
Conclusions are offered in section 5.
In the sequel we use for the density of an unobserved signal or variable;
for a conditional density and for the density of an observed variable.
We also use to denote a probability of a binary event, this should cause no
confusion.
3. ON-LINE BINARY CLASSIFICATION

Suppose we have iid (independent identically distributed) data where
are binary random variables e.g. yes/no results in an object (e.g. a bolt on
a conveyer belt) recognition task and are regressors or classifying auxiliary
variables (e.g. a spatial moment estimated from an image of the object). A
typical model linking the recognition variable to the classifying variable is
(Ripley, 1996) the logistic regression model which specifies the conditional
distribution of given as,
In our previous work (Solo, 1999) we gave, as a background, details of the

simple instantaneous steepest descent algorithm, for estimating the parameter
on-line, based on the one point likelihood
namely
And an averaging stability analysis (Kushner, 1984; Benveniste et al., 1990;

Solo and Kong, 1995; Sastry and Bodson, 1989) was sketched. This is all
reasonably straightforward.
4. BINARY CLASSIFICATION WITH NOISY

MEASUREMENTS OF CLASSIFYING
VARIABLES-OFFLINE
Now we complicate the problem by assuming that the auxiliary variables are
measured with error. The model becomes.
The unknowns are and we denote data as Then

direct maximum likelihood estimation (off-line) for is not straightforward
because although the density is given and the conditional density
is easily computed, there is no closed form analytic expression
for the marginal density of
and hence for the likelihood.

In such a case a natural idea is to try the EM (expectation maximization)

algorithm (Dempster et al., 1977) which requires the kernel (wherein subscript
0 denotes expectation with the value
But the conditional density is intractable and so the kernel cannot be

constructed. The idea now is to generate an approximation to the kernel (Wei
and Tanner, 1990)
where are supposed to be iid draws from
We cannot directly generate draws from this conditional density since the nor-
malising constant (or partition function) cannot be calculated. Note that this
partiton function is precisely the density that we cannot calculate. And now
MCMC comes into play because we can generate instead draws from a Markov
chain which has as its limit or invariant or ’steady state’ density. There
are numerous MCMC algorithms available (Roberts and Rosenthal, 1998).
An alternative to EM is MC-NR (Monte Carlo Newton Raphson) (Kuk and
Cheng, 1997) in which the idea is to get by simulation also. The EM
framework remains useful. We find
So generate from a MCMC relating to and use the e.g. steepest

descent (or NR)
This idea was developed into an online procedure in the previous work (Solo,
1999). But here we take a different route.
In recent work (Solo, 2000ab) the author has found a simpler route to Monte
Carlo based estimation by means of direct simulation of the unobserved variable.
The method is dubbed ’unobserved’ Monte Carlo (UMC).1 Referring to (4.1)
we see that if we draw iid samples from the density q(u)
then a Monte Carlo estimate of is
Similarly the gradient is given by (with
which leads to a Monte Carlo estimator
We can get an estimate of by dividing (4.5) by (4.3). Detailed discussion

of the technique in an offline setting is given in Solo (2000a), and Solo (2000b).
Here we pursue an on-line version.
5. BINARY CLASSIFICATION WITH ERRORS IN

CLASSIFYING VARIABLES -ONLINE
To get an online or adaptive algorithm we follow the usual approach (Solo
and Kong, 1995) and use an ’instantaneous’ gradient. So (4.5) leads us to
where is a draw based on This is much simpler than the MCMC

approach developed in Solo (1999) since here the Monte Carlo draws are gen-
erated directly not as a result of an auxiliary limiting procedure.
Note that this will not converge to the maximum likelihood estimator, rather
it is an estimator in its own right. It is not our aim here to attempt a full blown
convergence and performance analysis of this algorithm. But we can sketch a
stability analysis.
We now develop an averaging analysis using the kind of heuristics discussed
e.g. in Solo and Kong (1995)2 . Sum over an interval of extent N to get
Now if is small, N not too large then in the sum can be approximated by
its value at the beginning of the sum while can be approximated by a
draw from So we get
Continuing this can be approximated by its average
Note the unusual feature that is the true density of Now argue in reverse
to find the averaged system
It is interesting that this is the same averaged system found in the previous work
(Solo, 1999) based on MCMC simulation. Continuing we find
We now note trivially that is an equilibrium point of the averaged system

since We also get local exponential stability of the linearized
system since
And under certain regularity conditions R is positive definite since it is the

Fisher Information. Using the more formal methods in Solo and Kong (1995)
we can link behaviour of the averaged system to the original algorithm through
so called Hovering theorems; thus (5.1) has no equilibrium points and so under
certain regularity conditions, hovers or jitters around the equilibrium points of
the averaged system. This analysis will be developed in detail elsewhere.
However we can sketch a global stability analysis. If we replace the discrete
averaged system (5.2) by a continuous time one (this is discussed in Solo (1999))
i.e. an ODE. Then there is a natural Lyapunov function, namely
By Jensen’s inequality it has the property and assuming one can

show identifiability of then additionally if and only if We
assume is continuous. Also we see that so that
Thus is uniformly bounded and converges. We assume is contin-

uous and as For our example this holds e.g. with
and This ensures then that is bounded. Since
all limit points of must be stationary points of V. And local identifiability
will ensure these are isolated.
6. CONCLUSIONS
In this paper we have pursued the issue of on-line use of simulation meth-
ods for adaptive parameter estimation in nonstandard problems. Here we have
concentrated on errors in variables problems. We have shown that a simpler
simulation method than previously developed has great promise for these kinds
of problems. Previously problems like these have been ignored, or treated ap-
proximately by Taylor series expansions. And it now appears to be possible to
do much better. The careful reader can cast an eye back over our discussion
to see that nothing special about the errors in variables setup is invoked. The
approach developed here extends in a straight forward way to deal with partially
observed dynamical systems. It should be noted however that there will remain
some problems where even to draw from the density of the unobserved signal
will itself require an MCMC approach. In future work we will look at imple-
mentations for more realistic problems and study stability and performance in
more detail.
NOTES
1. The current and related works were completed in the second half of 1999 after a sabbatical. Recently
the author became aware of work in so-called sequential Monte Carlo ? of which to some extent UMC is a
special case.
2. A more formal analysis can be developed using the methods in (Solo and Kong, 1995) but will be
pursued elsewhere
REFERENCES 381
REFERENCES
Benveniste, A., M. Metivier, and P. Priouret. (1990). Adaptive Algorithms and
stochastic approximations. Springer-Verlag, New York.
Dempster, A.P.,N.M. Laird, andD.B. Rubin. (1977). Maximum likelihood from
incomplete data via the EM algorithm. J. Roy. Stat. Soc. B, 39:p.1-38.
Geman, S. and D. Geman. (1984). Stochastic relaxation, gibbs distributions and
the bayesian restoration of images. IEEE. Trans. Patt. Anal. Machine Intell.,
6:721–741.
Kanatani, K. (1996). Statistical optimization for Geometric Computation: The-
ory and Practice. North-Holland, Amsterdam.
Kuk, A.Y.C., and Y. W. Cheng. (1997). The Monte Carlo Newton Raphson
algorithm. Jl Stat Computation Simul.
Kitagawa, G. (1998). Self organising state space model. Jl. Amer. Stat. Assoc,
93:1203–1215.
Kushner, H.J. (1984). Approximation and weak convergence methods for ran-
dom processes with application to stochastic system theory. MIT Press, Cam-
bridge MA.
Ljung, L. (1983). Theory and practice of recursive identification. MIT Press,
Cambridge, Massachusetts.
Metropolis, N., A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. (1953).
Equations of state calculations by fast computing machines. J. Chem. Phys.,
21:1087–1091.
Ng, L. and V. Solo. (2001). Errors-in-variables modelling in optical flow esti-
mation. IEEE Trans. Im.Proc., to appear.
Neuman, S.P. and S. Yakowitz. (1979). A statistical approach to the inverse
problem of aquifer hydrology, I. Water Resour Res, 15:845–860.
Rappaport, T.S. (1996). Wireless Communication. Prentice Hall, New York.
Ripley, B.D. (1996). Pattern recognition and Neural networks. Cambridge Uni-
versity Press, Cambridge UK.
Roberts, G.O. and J.S. Rosenthal. (1998). Markov chain monte carlo: Some
practical implications of theoretical results. Canadian Jl. Stat., 26:5–31.
Sastry, S. and M. Bodson. (1989). Adaptive Control. Prentice Hall, New York.
Solo, V. and X. Kong. (1995). Adaptive Signal Processing Algorithms. Prentice
Hall, New Jersey.
Solo, V. (1999). Adaptive algorithms and Markov chain Monte Carlo methods.
In Proc. IEEE Conf Decision Control 1999, Phoenix, Arizona, IEEE.
Solo, V. (2000a). ’Unobserved’ Monte Carlo method for system identification of
partially observed nonlinear state space systems, Part I: Analog observations.
In Proc JSM2001, Atlanta, Georgia, August, page to appear. Am Stat Assocn.
Solo, V. (2000b). ’Unobserved’ Monte Carlo method for system identification

of partially observed nonlinear state space systems, Part II: Counting process
observations. In Proc 39th IEEE CDC, Sydney Australia. IEEE.
Tekalp, M. (1995). Digital Video Processing. Prentice-Hall, Englewood Cliffs,
N.J.
Wei, G.C.G. and M. Tanner. (1990). A Monte Carlo implementation of the EM
algorithm and the poor man’s data augmentation algorithms. Jl. Amer. Stat.
Assoc, 85:699–704.
Yakowitz, S. (1969). Mathematics of Adaptive Control Processes. Elsevier, New
York.
Yakowitz, S. (1985). Nonparametric density estimation, prediction and regres-
sion for Markov sequences. Jl. Amer. Stat. Assoc, 80:215–221.
Yakowitz, S. and F. Szidarovszky. (1984). A comparison of kriging with non-
parametric regression methods. Jl Mult Anal.
Chapter 19
RANDOM SEARCH UNDER ADDITIVE NOISE
Luc Devroye
School of Computer Science
McGill University
Montreal, Canada H3A 2K6
Adam Krzyzak
Department of Computer Science
Concordia University
Montreal, Canada H3G 1M8
1. SID’S CONTRIBUTIONS TO NOISY

OPTIMIZATION
From the early days in his career, Sid Yakowitz showed interest in noisy
function optimization. He realized the universality of random search as an
optimization paradigm, and was particularly interested in the minimization of
functions Q without making assumptions on the form of Q. Especially the noisy
optimization problem appealed to him, as exact computations of Q come often
at a tremendous cost, while rough or noisy evaluations are computationally
cheaper. His early contributions were with Fisher (Fisher and Yakowitz, 1976;
Yakowitz and Fisher, 1973). The present paper builds on these fundamental
papers and provides further results along the same lines. It is also intended to
situate Sid’s contributions in the growing random search literature.
Always motivated by the balance between accurate estimation or optimiza-
tion and efficient computations, Sid then turned to so-called bandit problems, in
which noisy optimization must be performed within a given total computational
effort (Yakowitz and Lowe, 1991).
The computational aspects of optimization brought him closer to learning
and his work there included studies of game playing strategies (Yakowitz, 1989;
Yakowitz and Kollier, 1992), epidemiology (Yakowitz, 1992; Yakowitz, Hayes
and Gani, 1992) and communication theory (Yakowitz and Vesterdahl, 1993).
Sid formulated machine learning invariably as a noisy optimization problem,

both over finite and infinite sample spaces: Yakowitz (1992), Yakowitz and
Lugosi (1990), and Yakowitz, Jayawardena and Li (1992) summarize his main
views and results in this respect.
Another thread he liked to follow was stochastic approximation, and in par-
ticular the Kiefer-Wolfowitz method (1952) for the local optimization in the
presence of noise. In a couple of technical reports in 1989 and in his 1993
SIAM paper, Sid presented globally convergent extensions of this method by
combining ideas of random search and stochastic approximation.
We have learned from his insights and shared his passion for nonparametric
estimation, machine learning and algorithmic statistics. Thank you, Sid.
2. FORMULATION OF SEARCH PROBLEM

We wish to locate the global minimum of a real-valued function Q on some
search domain a subset of As we pose it, this problem may have no
solution. First of all, the function Q may not have a minimum on (consider
and and if a minimum exists, it may not be unique
(consider the real line and and if it exists and is unique, it may
be nearly impossible to find it exactly, although we can hope to approximate it
in some sense. But is even that possible? Take for example the function on
defined by everywhere except on a finite set A on which the function takes
the value – 1. Without a priori information about the location of the points of
A, it is impossible to locate any point in A, and thus to find a global minimum.
To get around this, we take a probabilistic view. Assume that we can probe our
space with the help of a probability distribution such as the uniform density on
or the standard normal distribution on If X is a random variable with
probability distribution we can define the global minimum by the essential
infimum:
This means that and for

all The value of depends heavily on It is the smallest possible
value that we can hope to reach in a search process if the search is carried
out at successive independent points with common probability
distribution For example, if is the uniform distribution on then is
the (Lebesgue) essential infimum of Q in the unit cube. To push the formalism
to an extreme, we could say that a couple defines a search problem if
(i) Q is a Borel measurable function on
(ii) is a probability measure on the Borel sets of
(iii)
Formally, a search algorithm is a sequence of mappings from to
where is a place at which is
computed or evaluated. The objective is have tend

to and if possible, to assure that the rate of this convergence is fast. In
random search methods, the mapping is replaced by a distribution on
that is a function of and is a random variable drawn from
that distribution. The objective remains the same. Noisy optimization problems
will be formally defined further on in the paper.
3. RANDOM SEARCH: A BRIEF OVERVIEW

Random search methods are powerful optimization techniques. They in-
clude pure random search, adaptive random search, simulated annealing, ge-
netic algorithms, neural networks, evolution strategies, nonparametric estima-
tion methods, bandit problems, simulation optimization, clustering methods,
probabilistic automata and random restart. The ability of random search meth-
ods to locate the global extremum made them indispensable tool in many areas
of science and engineering. The explosive growth of random search is partially
documented in books such as those by Aarts and Korst (1989), Ackley (1987),
Ermoliev and Wets (1988), Goldberg (1989), Holland (1992), Pintér (1996),
Schwefel (1977, 1981), Törn and Žilinskas (1989), Van Laarhoven and Aarts
(1987), Wasan (1969) and Zhigljavsky (1991).
Random search algorithms are usually easy and inexpensive to implement.
Since they either ignore the past or use a small collection of points from iteration
to iteration they are easily parallelizable. Convergence of most random search
procedures is not affected by the cost function, in particular its smoothness
or multimodality. In a minimax sense, random search is more powerful than
deterministic search: this means it is nearly the best method in the worst possible
situation (discontinuities, high dimensionality, multimodality) but possibly the
worst method in the best situation (smoothness, continuity, unimodality) (Jarvis,
1975). The simplest random search method the pure random search can be used
to select a starting point for more sophisticated random search techniques and
also can act as a benchmark against which the performance of other search
algorithms are measured. Also, random search is much less sensitive than
deterministic search to function evaluations perturbed by the additive noise and
that motivates the present paper.
In ordinary random search, we denote by the best estimate of the (global)
minimum after n iterations, and by a random probe point. In pure random
search. are i.i.d. with a given fixed probability measure over the
parameter space The simple ordinary random search algorithm is given
below:
In local random search on a discrete space, usually is a random neighbor of

where the definition of a neighborhood depends upon the application. In
local random search in a Euclidean space, one might set
where is a random perturbation usually centered at zero. The fundamental

properties of pure random search (Brooks, 1958) are well documented. Let
be the distribution function of Then
is approximately distributed as where E is an exponential random vari-
able. This follows from the fact that if F is nonatomic,
Note first of all the distribution-free character of this statement: its universality
is both appealing and limiting. We note in passing here that many papers have
been written about how one could decide to stop random search at a certain
point.
To focus the search somewhat, random covering methods may be considered.
For example, Lipschitz functions may be dealt in the following manner (Shubert,
1972): at the trial points we know Q and can thus derive piecewise linear
bounds on Q. The next trial point is given by
where C is the Lipschitz constant. This is a beautiful approach, whose im-

plementation for large d seems very hard. For noisy problems, or when the
dimension is large, a random version of this was proposed in Devroye (1978).
If is compact, is taken uniformly in minus the union of the balls
centered at the with radius If C is
unknown, replace it in the formula for the radius by and let such
that and (example:
for Then almost surely.
Global random search is a phrase used to denote many methods. Some of
these methods proceed in a local manner, yet find a global minimum. Assume
for example that we set
where are i.i.d. normal random vectors, and is a given

deterministic sequence. The new probe point is not far from the old best point,
as if one is trying to mimic local descent algorithms. However, over a compact
set, global convergence takes place whenever This is merely
due to the fact that form a cloud that becomes dense in the
expanding sphere of radius Hence, we will never get stuck in a local
minimum. The convergence result does not put any restrictions on Q. The
above result, while theoretically pleasing, is of modest value in practice as

must be adapted to the problem at hand. A key paper in this respect is by Matyas
(1965), who suggests making adaptive and setting
where is a preferred direction that is made adaptive as well. A rule of

thumb, that may be found in several publications (see Devroye, 1972, and more
recently, Bäck, Hoffmeister and Schwefel, 1991), is that should increase
after a successful step, and decrease after a failure, and that the parameters
should be adjusted to keep the probability of success around 1/5. Schumer and
Steiglitz (1968) and others investigate the optimality of similar strategies for
local hill-climbing. Alternately, may be found by a one-dimensional search
along the direction given by (Bremermann, 1968; Gaviano, 1975).
In simulated annealing, one works with random probes as in random search,
but instead of letting be the best of (the probe point) and (the
old best point), a randomized decision is introduced, that may be reformulated
as follows (after Hajek and Sasaki, 1989):
where is a positive constant depending for now on only and is an

i.i.d. sequence of positive random variables. The best point thus walks around
the space at random. If the temperature, is zero, we obtain ordinary random
search. If is a random walk over the parameter space.
If and is exponentially distributed, then we obtain the Metropolis
Markov chain or the Metropolis algorithm (Metropolis et al, 1953; Kirkpatrick,
Gelatt and Vecchi, 1983; Meerkov, 1972; Cerny, 1985; Hajek and Sasaki, 1989).
Yet another version of simulated annealing has emerged, called the heat bath
Markov chain (Geman and Hwang, 1986; Aluffi-Pentini et al, 1985), which
proceeds by setting
where now are i.i.d. random variables and is the temper-

ature parameter. If the are distributed as the extreme-value distribution
(with distribution function then we obtain the original version of the
heat bath Markov chain. Note that each is then distributed as log log(1/U)
where U is uniform [0,1], so that computer simulation is not hampered.
The two schemes are not dramatically different. The heat bath Markov chain
as we presented it here has the feature that function evaluations are intentionally
corrupted by noise. This clearly reduces the information content and must slow
down the algorithm. Most random search algorithms take random steps but
do not add noise to measurements; in simulated annealing, one deliberately

destroys valuable information. It should be possible to formulate an algorithm
that does not corrupt expensive function evaluations with noise (by storing them)
and outperforms the simulated annealing algorithm in some sense. One should
be careful though and only compare algorithms that occupy equal amounts of
storage for the program and the data.
We now turn to the choice of In view of the representation given above,
it is clear that is bounded from below by a constant times
as is the threshold we allow in steps away from the minimum. Hence the
need to make small. This need clashes with the condition of convergence
(typically, must be at least for some constant ). The condition
of convergence depends upon the setting (the space and the definition of
given We briefly deal with the specific case of finite-domain simulated
annealing below. In continuous spaces, progress has been made by Vanderbilt
and Louie (1984), Dekkers and Aarts (1991), Bohachevsky, Johnson and Stein
(1986), Gelfand and Mitter (1991), and Haario and Saksman (1991). Other
key references on simulated annealing include Aarts and Korst (1989), Van
Laarhoven and Aarts (1987), Anily and Federgruen (1987), Gidas (1985), Hajek
(1988), and Johnson, Aragon, McGeoch and Schevon (1989).
Further work seems required on an information-theoretic proof of the inad-
missibility of simulated annealing and on a unified treatment of multistart and
simulated annealing, where multistart is a random search procedure in which
one starts at a randomly selected place at given times or whenever one is stuck
in a local minimum.
On a finite connected graph, simulated annealing proceeds by picking a trial
point uniformly at random from its neighbors. Assume the graph is regular,
i.e., each node has an equal number of neighbors. If we keep the temperature
fixed, then there is a limiting distribution for called the Gibbs
distribution or Maxwell-Boltzmann distribution: for the Metropolis algorithm,
the asymptotic probability of node is proportional to Interestingly,
this is independent of the structure of the graph. If we now let then
with probability tending to one, belongs to the collection of local minima.
With probability tending to one, belongs to the set of global minima if
additionally, (for example, for
will do). Here is the maximum of all depths of strictly local minima (Hajek,
1988). The only condition on the graph is that all connected components of
are strongly connected for any The slow convergence of
puts a severe lower bound on the convergence rate of simulated annealing.
Let us consider optimization on a compact of and let Q be bounded there.
If we let have a fixed density that is bounded from below by a
constant times the indicator of the unit ball, then in the Metropolis algorithm
converges to the global minimum in probability if yet
Bohachevsky, Johnson and Stein (1986) adjust during the search to make
the probability of accepting a trial point hover near a constant. Nevertheless, if
is taken as above, the rate of convergence to the minimum is bounded from
below by which is much slower than the polynomial rate we would
have if Q were multimodal but Lipschitz.
Several ideas deserve more attention as they lead to potentially efficient
algorithms. These are listed here in arbitrary order. In 1975, Jarvis introduced
competing searches such as competing local random searches. If N is the
number of such searches, a trial (or time unit) is spent on the search with
probability where is adapted as time evolves; a possible formula is to
replace by where is a weight, and
are constants, and is the trial point for the competing search. More
energy is spent on promising searches.
This idea was pushed further by several researchers in one form or another.
Several groups realized that when two searches converge to the same local
minimum, many function evaluations could be wasted. Hence the need for
on-line clustering, the detection of points that belong somehow to the same
local valley of the function. See Becker and Lago (1970), Törn (1974, 1976),
de Biase and Frontini (1978), Boender et al (1982), and Rinnooy Kan and
Timmer (1984, 1987).
The picture is now becoming clearer—it pays to keep track of several base
points, i.e., to increase the storage. In Price’s controlled random search for
example (Price, 1983), one has a cloud of points of size about where is
the dimension of the space. A random simplex is drawn from these points, and
the worst point of this simplex is replaced by a trial point, if this trial point is
better. The trial point is picked at random inside the simplex.
Independently, the German school developed the Evolutionsstrategie (Rechen-
berg, 1973; Schwefel, 1981). Here a population of base points gives rise to a
population of trial points. Of the group of trial points, we keep the best N, and
repeat the process.
Bilbro and Snyder (1991) propose tree annealing: all trial points are stored
in tree format, with randomly picked leaves spawning two children. The leaf
probabilities are determined as products of edge probabilities on the path to the
root, and the tree represents the classical tree partition of the space. Their
approach is at the same time computationally efficient and fast.
To deal with high-dimensional spaces, the coordinate projection method of
Zakharov (1969) and Hartman (1973) deserves some attention. Picture the
space as being partitioned by a N × · · · × N regular grid. With each marginal
interval of each coordinate we associate a weight proportional to the likelihood
that the global minimum is in that interval. A cell is grabbed at random in the
grid according to these (product) probabilities, and the marginal weights are
updated. While this method is not fool-proof, it attempts at least to organize

global search effort in some logical way.
Consider a population of points, called a generation. By selecting good
points, modifying or mutating good points, and combining two or more good
points, one may generate a new generation, which, hopefully, is an improve-
ment over the parent generation. Iterating this process leads to the evolutionary
search method (Bremermann, 1962, 1968; Rechenberg, 1973; Schwefel, 1977;
Jarvis, 1975) and a body of methods called genetic algorithms (Holland. 1975).
Mutations may be visualized as little perturbations by noise vectors in a con-
tinuous space. However, if is the space then mutations become bit
flips, and combinations of points are obtained by merging bit strings in some
way. The term cross-over is often used. In optimization on graphs, mutations
correspond to picking a random neighbor. The selection of good points may be
extinctive or preserving, elitist or non-elitist. It may be proportional or based on
ranks. As well, it may be adaptive and allow for immigration (new individuals).
In some cases, parents never die and live in all subsequent generations. The pop-
ulation size may be stable or explosive. Intricate algorithms include parameters
of the algorithm itself as part of the genetic structure. Convergence is driven
by mutation and can be proved under conditions not unlike those of standard
random search. Evolution strategies aim to mimic true biological evolution.
In this respect, the early work of Bremermann (1962) makes for fascinating
reading. Ackley’s thesis (1987) provides some practical implementations. In
a continuous space, the method of generations as designed by Ermakov and
Zhigljavsky (1983) lets the population size change over time. To form a new
generation, parents are picked with probability proportional to
and random perturbation vectors are added to each individual, where is to be

specified. The latter are distributed as where the ’s are i.i.d. and
is a time-dependent scale factor. This tends to maximize Q if we let tend
to infinity at a certain rate. For more recent references, see Goldberg (1989),
Schwefel (1995) or Banzhaf, Nordin and Keller (1998).
4. NOISY OPTIMIZATION BY RANDOM SEARCH: A

BRIEF SURVEY
Here is a rather general optimization problem: for each point we can
observe a random process with almost surely, where
Q is the function to be minimized. We refer to this as the noisy optimization
model. For example, at we can observe independent copies of
where is measurement noise satisfying and Averaging
these observations naturally leads to a sequence with the given convergence

property. In simulation optimization, may represent a simulation run for a
system parametrized by It is necessary to take large for accuracy, but taking
too large would be wasteful for optimization. Beautiful compromises are
awaiting the analyst. Finally, in some cases, is known to be the expected
value or an integral, as in or where A
is a fixed set and T is a given random variable. In both cases, may represent
a certain Monte Carlo estimate of which may be made as accurate as
desired by taking large enough.
By additive noise, we mean that each is corrupted by an independent
realization of a random variable Z, so that we can only observe The
first question to ask is whether ordinary random search is still convergent. For-
mally, if are independent realizations of Z, the algorithm generates
trials and at observes Then is defined
as the trial point among with the lowest value Assume that
with probability at least is sampled according to a fixed distribution
with support on Even though the decisions are arbitrary, as in simulated
annealing, and even though there is no converging temperature factor, the above
algorithm may be convergent in some cases, i.e., in probability.
For stable noise, i.e., noise with distribution function G satisfying
such as normally distributed noise, or indeed, any noise with tails that decrease
faster to zero than exponential, then we have convergence in the given sense.
The reader should not confuse our notion of stability which is taken from the
order statistics literature (Geffroy, 1958) with that of the stable distribution.
Stable noise is interesting because an i.i.d. sequence drawn from G,
satisfies in probability for some sequence See
for example Rubinstein and Weissman (1979). Additional results are presented
in this paper.
In noisy optimization in general, it is possible to observe a sample drawn
from distribution at each with possibly different for each The mean
of is If there are just two and the probe points selected by us
are where each of the is one of the then the purpose in
bandit problems is to minimize
in some sense (by, e.g., keeping small). This minimization is with

respect to the sequential choice of the Obviously, we would like all
to be exactly at the best but that is impossible since some sampling of the
non-optimal value or values is necessary. Similarly, we may sometimes wish
to minimize
where is the global minimum of Q. This is relevant whenever we want to

optimize a system on the fly, such as an operational control system or a game-
playing program. Strategies have been developed based upon certain parametric
assumptions on the or in a purely nonparametric setting. A distinction is
also made between finite horizon and infinite horizon solutions. With a finite
number of bandits, if at least one is nondegenerate, then for any algorithm,
we must have for some constant on some optimization
problem (Robbins, 1952; Lai and Robbins, 1985).
In the case of bounded noise, Yakowitz and Lowe (1991) devised a play-the-
leader strategy in which the trial point is the best point seen thus far (based
on averages) unless for some integer ( and are fixed positive
numbers), at which times is picked at random from all possible choices.
This guarantees Thus, the optimum is missed at most log
times out of
Another useful strategy for parametric families was proposed by Lai
and Robbins (1985). Here confidence intervals are constructed for all
The with the smallest lower confidence interval endpoint is sampled.
Exact lower bounds were derived by them for this situation. For two normal
distributions with means and variances and Holland (1973)
showed that
Yakowitz and Lugosi (1989) illustrate how one may optimize an evaluation
function on-line in the Japanese game of gomoku. Here each represents
a Bernoulli distribution and is nothing but the probability of winning
against a random opponent with parameters
In a noisy situation when is uncountable, we may minimize Q if we are
given infinite storage. More formally, let be trial points, with the
only restriction being that at each with probability at least is sampled
from a distribution whose support is the whole space (such as the normal
density, or the uniform density on a compact). The support of a random variable
X is the smallest closed set S such that We also make sure
that at least observations are available for each at time If the noise
is additive, we may consider the pairings for all the observations at each
of and recording all values of the number of wins of over
where a win occurs when for a pair of observations
For each let and define as the
trial point with maximal value. If and then
almost surely (Devroye, 1977; Fisher and Yakowitz, 1973).
Interestingly, there are no conditions whatever on the noise distribution. With

averaging instead of a statistic based on ranks, a tail condition on the noise
would have been necessary. Details and proofs are provided in this paper. For
non-additive noise,
for all (where Y is drawn from ) suffices for example when

is obtained by minimizing the at the trial points.
Gurin (1966) was the first to explore the idea of averages of repeated mea-
surements. Assume again the condition on the selection of trial points and
let denote the average of observations. Then, if Gurin proceeds
by setting
This is contrary to all principles of simulated annealing, as we are gingerly

accepting new best points by virtue of the threshold Devroye (1976) has
obtained some sufficient conditions for the strong convergence of
One set includes and
(a very strong condition indeed). If and for each is
stochastically smaller than Z where for some then and
are sufficient as well. In the latter case, the conditions insure
that with probability one, we make a finite number of incorrect decisions. Other
references along the same lines include Marti (1982), Pintér (1984), Karmanov
(1974), Solis and Wets (1978), Koronacki (1976) and Tarasenko (1977).
5. OPTIMIZATION AND NONPARAMETRIC

ESTIMATION
To extract the maximum amount of information from past observations, we
might store these observations and construct a nonparametric estimate of the
regression function where Y is an observation from
Assume that we have pairs where a diverging number of
are drawn from a global distribution, and the are corresponding noisy
observations. Estimate by which may be obtained by averaging
those whose is among the nearest neighbors of It should be obvious
that if almost surely, then almost surely if
To this end, it suffices for example that
that the noise be uniformly bounded, and that be compact.
Such nonparametric estimates may also be used to identify local minima.
6. NOISY OPTIMIZATION: FORMULATION OF THE

PROBLEM
We consider a search problem on a subset B of where is the
probability distribution of a generic random variable X that has support on
B. Typically, is the uniform distribution on B. For every it is possible
to obtain an i.i.d. sequence distributed as where
is a random variable (“the noise”) with a fixed but unknown distribution. We
can, if we wish, demand to see as little or as much of the sequence as
we wish. With this formulation, it is still possible to define a random search
procedure such that almost surely for all search problems
and all distributions of Note that we do not even assume that has
a mean. Throughout this paper, F is the distribution function of
The purpose of this paper is to draw attention to such universally convergent
random search algorithms that do not place any conditions on and F, just as
Sid Yakowitz showed us in 1973 (Yakowitz and Fisher, 1973).
7. PURE RANDOM SEARCH

In this section, we analyze the behavior of unaltered pure random search
under additive noise. The probe points form an i.i.d. sequence drawn
from a distribution with probability distribution At each probe point we
observe where the ’s are i.i.d. random variables distributed
as Then we define
This is nothing but the pure random search algorithm, employed as if we were
unaware of the presence of any noise. Our study of this algorithm will reveal
how noise-sensitive or robust pure random search really is. Not unexpectedly,
the behavior of the algorithm depends upon the nature of the noise distribution.
The noise will be called stable if for all
where G is the distribution function of and 0/0 is considered as zero. This

will be called Gnedenko’s condition (see Lemma 1). A sufficient condition for
stability is that G has a density and
EXAMPLES. If G does not have a left tail (i.e., for some

then the noise is stable. Normal noise is also stable, but dou-
ble exponential noise is not. In fact, the exponential distribution is on the
borderline between stability and non-stability. Distributions with a diverging

hazard rate as we travel from the origin out to are stable. Thus, stable
noise distributions have small left tails. In fact, for all
The reason why stable noise will turn out to be manageable, is that min
is basically known to fall into an interval of arbitrary small positive length around
some deterministic value with probability tending to one for some sequence
It could thus happen that as yet this is not a problem.
This was also observed by Rubinstein and Weissman (1979). In the next sec-
tion, we obtain a necessary and sufficient condition for the weak convergence
of for the pure random search algorithm.
T HEOREM 1. If is stable, then in probability. Conversely,
if is not stable, then does not tend to in probability for any search
problem for which for all small enough, > 0.
We picked the name “stable noise” because the minimum of is
stable in the sense used in the literature on order statistics, that is, there exists a
sequence such that in probability. We will prove the minimal
properties needed further on in this section. The equivalence property A of
Lemma 1 is due to Gnedenko (1943), while parts B and C are inherent in the
fundamental paper of Geffroy (1958).
LEMMA 2.
A. G is the distribution function of stable noise if and only if in
probability for some sequence
B. If in probability for some sequence then

and as for all Also, if
then and as for all
Note:
C. If the noise distribution is not stable, then there exist positive constants
a sequence a subsequence and an such that
and for all
PROOF. We begin with property B. Note that by assumption,
and thus Also, implies
Observe that for any
This shows that eventually, Thus,
and
Let us turn to A. We first show that B implies Gnedenko’s condition. We
can assume without loss of generality that is monotone decreasing since
can be replaced by in view of property B. For every we find
such that Thus, and
Thus,
The case of bounded is trivial, so assume Now let

(and thus ), and deduce that Since is
arbitrary, we obtain Gnedenko’s condition.
Next, part A follows if we can show that Gnedenko’s condition implies the
existence of the sequence proving the existence of is equivalent to proving
the existence of such that and for all
Let us take From the definition of G, we note that for
any Thus, by the Gnedenko
condition, for
Similarly,
This concludes the proof of part A.

For part C, we see that necessarily We define By
assumption, there exists an a sequence and an such that
for all Next, by definition of we note that
for infinitely many indices we have These define the
subsequence that we will use. Observe that for all while for
all with
PROOF OF THEOREM 1. Let F be the distribution function of

and let G be the distribution function of We first show that stable noise
is sufficient for convergence. For brevity, we denote by Furthermore,
is an arbitrary constant, and Observe that the event
is implied by where is the event that for some
and simultaneously, and is the event that
for all we either have or The convergence
follows if we can show that and as
This tends to zero by property B of Lemma 1. Next,
where we used property B of Lemma 1 again. This concludes the proof of the
sufficiency.
The necessity is obtained as follows. Since G is not stable, we can find
positive constants a sequence a subsequence and an such
that and for all (Lemma 1, property C). Let
be in this subsequence and let be the multinomial
random vector with the number of values in
and respectively. We first condition on this vector. Clearly, if for
some in the second interval we have while for all
in the first interval, we have then Thus,
the conditional probability of this event is
To un-condition, we use the multinomial moment generating function
where and are the parameters of the multinomial distribu-

tion. This yields
provided that This can be guaranteed, since we can make

smaller without compromising the validity of the lower bound. This concludes
the proof of Theorem 1.
8. STRONG CONVERGENCE AND STRONG

STABILITY
There is a strong convergence analog of Theorem 1. We say that G is strongly
stable noise if the minimal order statistic of is strongly stable, i.e.,
there exists a sequence of numbers such that almost surely as
T HEOREM 2. If is strongly stable, then almost surely.

PROOF. Since strong stability implies stability, we recall from Lemma 1
that we can assume without loss of generality that For
let Observe that in any case as Let
be arbitrary, and let be the set of indices between 1 and for which
Let denote the cardinality of this set. As is binomial
it is easy to see that except possibly finitely often
with probability one. Define
Since almost surely, we have and

almost surely. Define Consider the following inclusion of events
(assuming for convenience that is integer-valued):
The event on the right hand side has zero probability. Hence so does the event
on the left hand side.
It is more difficult to provide characterizations of strongly stable noises,
although several sufficient and a few necessary conditions are known. For an
in-depth treatment, we refer to Geffroy (1958). It suffices to summarize a few
key results here. The following condition due to Geffroy is sufficient:
This function comes close to being necessary. Indeed, if G is strongly stable, and
is monotone in the left tail beyond some point, then Geffroy’s
condition must necessarily hold. If G has a density then another sufficient
condition is that
It is easy to verify now that noise with density is strongly sta-

ble for and is not stable when The borderline is once again
close to the double exponential distribution. To more clearly identify the
threshold cases consider the noise distribution function given by
where is a positive function. It can
be shown that for constant the noise is stable but not strongly
stable. However, if as then G is strongly stable (Geffroy,
1958).
9. MIXED RANDOM SEARCH

Assume next that we are not using pure random search, in the hope of assuring
consistency for more noise distributions, or speeding up the method of the
previous section. A certain minimum amount of global search is needed in
any case. So, we will consider the following prototype model of an algorithm:
has distribution given by where is a sequence
of probabilities, and is an arbitrary probability distribution that may depend
upon the past (all with and all observations made up to time ).
We call this mixed random search, since with probability the trial point is
generated according to the pure random search distribution In the noiseless
case, is necessary and sufficient for strong convergence of
to for all search problems and all ways of choosing One is
tempted to think that under stable noise and the same condition on we still
have at least weak convergence. Unfortunately, this is not so. What has gone
wrong is that it is possible that too few probe points have small Q-values, and
that the smallest corresponding to these probe values is not small enough to
“beat” the smallest among the other probe values. In the next theorem, we
establish that convergence under stable noise conditions can only be guaranteed
when a positive fraction of the search effort is spent on global search, e.g. when
Otherwise, we can still have convergence of but we
lose the guarantee, as there are several possible counterexamples.
THEOREM 3. If for some then under sta-

ble noise conditions, in probability, and under strong stable noise
conditions, almost surely. Conversely, if
then there exists a search problem, a stable noise distribution G, and a way
of choosing the sequence such that does not converge weakly to
PROOF. We only prove the first part with strong stability. We mimic
the proof of Theorem 2 with the following modification: let be the set
of indices between 1 and for which and is generated
according to the portion of the removed mixture distribution. Note that
is distributed as where the are i.i.d. Bernoulli random variables
with parameter Note that almost surely,
so that except possibly finitely often with probability one.

Then apply the event inclusion of Theorem 2 with The weak
convergence is obtained in a similar fashion from the inclusion
where we use the notation of Theorem 2. All events on the right-hand-side have
probability tending to zero with
10. STRATEGIES FOR GENERAL ADDITIVE NOISE

From the previous sections, we conclude that under some circumstances,
noise can be tolerated in pure random search. However, it is very difficult
to verify whether the noise at hand is indeed stable; and the rate of conver-
gence takes a terrible beating with some stable noise distributions. There are
algorithms whose convergence is guaranteed under all noise distributions, and
whose rate of convergence depends mainly upon the search distribution F, and
not on the noise distribution! Such niceties do not come free: a slower rate of
convergence results even when the algorithm operates under no noise; and in
one of the two strategies discussed further on, the storage requirements grow
unbounded with time.
How can we proceed? If we stick to the idea of trying to obtain improvements
of by comparing observations drawn at with observations drawn at
then we should be very careful not to accept as the new unless
we are reasonably sure that Thus, several noise-perturbed
observations are needed at each point, and some additional protection is needed
in terms of thresholds that give the edge in a comparison. This is only
natural, since embodies all the information gathered so far, and we should
not throw it away lightly. To make the rate of convergence less dependent upon
the noise distribution, we should consider comparisons between observations
that are based on the relative ranks of these only. This kind of solution was first
proposed in Devroye (1977).
Yakowitz and Fisher (1973) proposed another strategy, in which no informa-
tion is ever thrown away. We store for each all the observations ever made at
it. At time draw more observations at all the previous probe points and at
a new probe point and choose from the entire pool of probe points.
From an information theoretic point of view, this is a clever though costly pol-
icy. The decision which probe point to take should be based upon ranks, once
again. Yakowitz and Fisher (1973) employ the empirical distribution functions
of the observations at each probe point. Devroye (1977) in a related approach
advocates the use of the Wilcoxon-Mann-Whitney rank statistic and modifica-
tions of it. Note that the fact that no information is ever discarded makes these
algorithms nonsequential in nature; this is good for parallel implementations,
but notoriously bad when rates of convergence are considered.
Consider first the nonsequential strategy in its crudest form: define as

the “best” among the first probe points which in turn are i.i.d.
random vectors with probability distribution Let be a sequence of
integers to be chosen by the user. We make sure that at each time the function
is sampled times at each If the previous observations are not thrown away,
then this means that new observations are necessary,
at and at each of The observations are stored
in a giant array As an index of the “goodness” of
we could consider the average
The best point is the one with the smallest average. Clearly, this strategy cannot
be universal since for good performance, it is necessary that the law of large
numbers applies, and thus that where is the noise random variable.
However, in view of its simplicity and importance, we will return to this solution
in a further subsection.
If we order all the components of the vector so as to
obtain then other measures of the goodness may include
the medians of these components, or “quick and dirty” methods such as
Gastwirth’s statistic (Gastwirth, 1966)
We might define as that point among with the smallest value

of
We could also introduce the notion of pairwise competitions between the
for example, we say that wins against if The with the
most wins is selected to be This view leads to precisely that with the
smallest value of However, pairwise competitions can go further, as we
now illustrate. Yakowitz and Fisher (1973) thought it useful to work with the
empirical distribution functions Our approach
may be considered as a tournament between in which matches
are played, one per pair Let
We say that wins its match with when We define as that

member of with the most wins. In case of ties, we take the mem-
ber with the smallest index. We call this the tournament method. Rather than
using the Kolmogorov-Smirnov based statistic suggested above, one might also
consider other rank statistics based on medians of observations or upon the gen-
eralized Wilcoxon statistic (Wilcoxon, 1945; Mann and Whitney, 1947), where
only comparisons between are used.
The number of function evaluations used up to iteration is In addition,
in some cases, some sorting may be necessary, and in nearly all the cases, the
entire array needs to be kept in storage. Also, there are matches to
determine the wins, leading to a complexity, in the iteration alone of about
This seems to be extremely wasteful. We discuss some time-saving
modifications further on. We provide a typical result here (Theorem 4) for the
tournament method.
THEOREM 5. Let be chosen by the tournament method. If
then almost surely as
PROOF. Fix and let G be the noise distribution, and is the
empirical distribution function obtained from observations taken at
Note that can be considered as an estimate of In fact, by
the Glivenko-Cantelli lemma, we know that
almost surely. But much more is true. By an inequality due to Dvoretzky,
Kiefer and Wolfowitz (1956), in a final form derived by Massart (1990), we
have for all
For we define the positive constant
and let be the event that for all

Clearly, then,
An important event for us is the event that all matches

with have a fair outcome, that is, wins
against Let us compute a bound on the probability of We observe that
so that To see this, fix with
Then if
but this in turn implies that
which is impossible. Thus, every such match has a fair outcome.

Consider next the tournament, and partition the in four groups according
to whether belongs to one of these intervals:
The cardinalities of the groups are
If holds, then any member of group 1 wins its match against any
member of groups 3 and 4, for at least wins. Any member of group 4
can at most win against other members of group 4 or all members of group 3,
for at most wins. Thus, the tournament winner must come from
groups 1, 2 or 3, unless there is no in any of these groups. Thus,
We showed the following:
This is summable in for every so that we can conclude

almost surely by the Borel-Cantelli lemma.
It is a simple exercise to modify Theorem 4 to include mixing schemes, i.e.,
schemes of sampling in which has probability measure
where and is an arbitrary probability measure. We note that
still converges to almost surely if we merely add the standard mixing
condition
Following the proof of Theorem 4, we notice indeed that for arbitrary N ,
Hence,
Here we used an argument as in the proof of Theorem 3 Theorem 4 in the

section on mixed global random search. The bound tends to 0 with N, and thus
almost surely if
Consider the simplest sequential strategy, comparable to a new player enter-
ing at each iteration in the tournament, and playing against the best player seen
thus far. Assume that the are i.i.d. with probability measure and that at
iteration we obtain samples and at and
respectively. For comparing performances, we use suitably modified statistics
such as
where is the empirical distribution function for the sample, and is

the empirical distribution function for the other sample. Define according
to the rule
where is a threshold designed to give some advantage to since

the information contained in is too valuable to throw away without some
form of protection. We may introduce mixing as long as we can guarantee
that for any Borel set A, and is the usual mixing
coefficient. This allows us to combine global and local search and to concentrate
global search efforts in promising regions of the search domain.
THEOREM 5 (Devroye, 1978). Assume that the above sequential strategy is
used with mixing, and that comparisons are based upon Assume further-
more that If
then in probability as If
then with probability one as

PROOF. We will use fact that if in probability, and
then almost surely. This follows easily from the inequality
Now,
Random,Search Under Additive Noise 405
where we used the Dvoretzky-Kiefer-Wolfowitz inequality. Thus, the summa-

bility condition is satisfied. To obtain the weak convergence, we argue as
follows. Define
and
Then
A bit of analysis shows that when and either

or But we have already seen that
Define Then, for
Thus, when and either or

we have weak convergence of to This concludes the proof of The-
orem 5.
Choosing and is an arbitrary process. Can we do without? For example,
can we boldly choose and still have guaranteed convergence for all
search problems and all noises? The answer is yes, provided that increases
faster than quadratically in This result may come as a bit of a surprise, since
we based the proof of Theorem 5 on the observation that the event
occurs finitely often with probability one. We will now allow this
event to happen infinitely often with any positive probability, but by increasing
quickly enough, the total sum of the "bad" moves
is finite almost surely.
THEOREM 6. Assume that the sequential strategy is used with mixing,
and that comparisons are based upon Assume furthermore that
and Then with probability

one as
PROOF. Let be arbitrary. Observe that
Thus, strong convergence follows from weak convergence if we can prove that
This follows if
and
By the Beppo-Levi theorem, the former condition is implied by
By the Beppo-Levi theorem, the latter condition is implied by
Define
We recall that
where
We note that if L is the distance between the third and first quartiles of G,
then This is easily seen by partitioning the two quartile
interval of length L into intervals of length or less. The
maximum probability content of one of these intervals is at least

From the proof of Theorem 5,
This is summable in as required. Also,
and this is summable in To establish the weak convergence of we

begin with the following inclusion of events, in which is a positive integer:
where
Here we used the notation of the proof of Theorem 5. Theorem 6 By estimates

obtained there, we note that
provided that is large enough. For any fixed we can choose so large that
this upper bound is smaller than a given small constant. Thus,
is smaller than a given small constant if we first choose large enough,
and then choose appropriately. This concludes the proof of Theorem 6 when
is used.
Both the sequential and nonsequential strategies can be applied to the case
in which we compare points on the basis of the average of observations
made at This is in fact nothing more than the situation we will encounter
when we wish to minimize a regression function. Indeed, taking averages
would only make sense when the mean exists. Assume thus that we have the
regression model
where is a real-valued function, and is some random variable. For fixed

we cannot observe realizations of but rather an i.i.d. sample
where and the are i.i.d. In the additive noise case, we have the
special form where is a zero mean random variable. We
first consider the nonsequential model, in which the probe points are i.i.d. with
probability measure The convergence is established in Theorem 7. Clearly,
the choice of (and thus our cost) increases with the size of the tail or tails of
In the best scenario, should increase faster than
THEOREM 7. Let be chosen on the basis of the smallest value of
Assume that is such that Then in proba-
bility as when condition A holds: for some
(a sufficient condition for this is and
We have strong convergence if B or C hold: (condition B) for all
in an open neighborhood of the origin, and (con-
dition C) for some and
Finally, for any noise with zero moment, there exists a sequence
such that almost surely as
PROOF. Let be the event that for all We
note that
Arguing as in the proof of Theorem 4, we have

This implies weak convergence of to if From the

work of Baum and Katz (1985) (see Petrov (1975, pp. 283-286)), we retain
that for all if where
If for some then by Theorem 28 of Petrov (1975),
so condition A is satisfied. Finally, if
for all in an open neighborhood of the origin, then for
some constant (see e.g. Petrov (1975, pp. 54-55)). This proves the
weak convergence portion of the theorem.
The strong convergence under condition B follows without trouble from the
above bounds and the Borel-Cantelli lemma. For condition C, we need the fact
that if for every where is a
constant, then we have
(see Petrov, 1975, p. 284). Now observe that under condition C,
for some constant This tends to 0 with N.

The last part of the theorem follows from the weak law of large numbers.
Indeed, there exists a function with as such that
Thus, Clearly, this is
countable in if we choose so large that Therefore, the last
part of the theorem follows by the Borel-Cantelli lemma.
The remarks regarding rates of convergence made following Theorem 4 The-
orem 4 -> Theorem 5 apply here as well. What is new though is that we have
lost the universality, since we have to impose conditions on the noise. If we

apply the algorithm with some predetermined choice of we have no guaran-
tee whatsoever that the algorithm will be convergent. And even if we knew the
noise distribution, it may not be possible to avoid divergence for any manner of
choosing
11. UNIVERSAL CONVERGENCE

In the search for universal optimization methods, we conclude with the fol-
lowing observation. Let Q be a function on the positive integers with finite
infimum. Assume that for each there exists an infinite
sequence of i.i.d. random variables called the observa-
tions. We have A search algorithm is a sequence of
functions
where as a function of describes a distribution. The sequence

is called the history. For each starting with
we generate a pair from the distribution given by This
pair allows us to look at Thus, after iterations, we have ac-
cessed at most observations. A search algorithm needs a function of
that maps to to determine which integer is
taken as the best estimate of the minimum:
A search algorithm is a sequence of mappings A search algorithm is

universally convergent if for all functions Q with and all
distributions of in probability. We do
not know if a universally convergent search algorithm exists. The difficulty
of the question follows from the following observation. At time we have
explored at most integers and looked at at most observations. Assume
that we have observations at each of the first integers (consider this as a
present of additional observations). Let us average these observations,
and define as the integer with the smallest average. While at each integer,
the law of large numbers holds, it is not true that the averages converge at the
same rate to their means, and this procedure may actually see diverge
to infinity in some probabilistic sense.
REFERENCES
Aarts, E. and J. Korst. (1989). Simulated Annealing and Boltzmann Machines,
REFERENCES 411
Ackley, D.H. (1987). A Connectionist Machine for Genetic Hillclimbing,Kluwer

Academic Publishers, Boston.
Aluffi-Pentini, F., V. Parisi, and F. Zirilli. (1985). “Global optimization and
stochastic differential equations,” Journal of Optimization Theory and Ap-
plications, vol. 47, pp. 1–16.
Anily, S. and A. Federgruen. (1987). “Simulated annealing methods with gen-
eral acceptance probabilities,” Journal of Applied Probability, vol. 24, pp. 657–
667.
Banzhaf, W., P. Nordin, and R. E. Keller. (1988). Genetic Programming : An
Introduction : On the Automatic Evolution of Computer Programs and Its
Applications, Morgan Kaufman, San Mateo, CA.
Baum, L. and M. Katz. (1965). “Convergence rates in the law of large numbers,”
Transactions of the American Mathematical Society, vol. 121, pp. 108–123.
Becker, R.W. and G. V. Lago. (1970). “A global optimization algorithm,” in:
Proceedings of the 8th Annual Allerton Conference on Circuit and System
Theory, pp. 3–12.
Bekey, G.A. and M. T. Ung. (1974). “A comparative evaluation of two global
search algorithms,” IEEE Transactions on Systems, Man and Cybernetics,
vol. SMC-4, pp. 112–116.
G. L. Bilbro, G.L. and W. E. Snyder. (1991). “Optimization of functions with
many minima,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-
21, pp. 840–849.
Boender, C.G.E., A. H. G. Rinnooy Kan, L. Stougie, and G. T. Timmer. (1982).
“A stochastic method for global optimization,” Mathematical Programming,
vol. 22, pp. 125–140.
Bohachevsky, I.O. (1986). M. E. Johnson, and M. L. Stein, “Generalized simu-
lated annealing for function optimization,” Technometrics, vol. 28, pp. 209–
217.
Bremermann, H.J. (1962). “Optimization through evolution and recombina-
tion,” in: Self-Organizing Systems, (edited by M. C. Yovits, G. T. Jacobi and
G. D. Goldstein), pp. 93–106, Spartan Books, Washington, D.C.
Bremermann, H.J. (1968). “Numerical optimization procedures derived from
biological evolution processes,” in: Cybernetic Problems in Bionics, (edited
by H. L. Oestreicher and D. R. Moore), pp. 597–616, Gordon and Breach
Science Publishers, New York.
Brooks, S.H. (1958). “A discussion of random methods for seeking maxima,”
Operations Research, vol. 6, pp. 244–251.
Brooks, S.H. (1959). “A comparison of maximum-seeking methods,” Opera-
tions Research, vol. 7, pp. 430–457.
Bäck, T., F. Hoffmeister, and H.-P. Schwefel. (1991). “A survey of evolution
strategies,” in: Proceedings of the Fourth International Conference on Ge-
netic Algorithms, (edited by R. K. Belew and L. B. Booker), pp. 2–9, Morgan

Kaufman Publishers, San Mateo, CA.
Cerny, V. (1985). “Thermodynamical approach to the traveling salesman prob-
lem: an efficient simulation algorithm,” Journal of Optimization Theory and
Applications, vol. 45, pp. 41–51.
Dekkers, A. and E. Aarts. (1991). “Global optimization and simulated anneal-
ing,” Mathematical Programming, vol. 50, pp. 367–393.
Devroye, L. (1972). “The compound random search algorithm,” in: Proceed-
ings of the International Symposium on Systems Engineering and Analysis,
Purdue University, vol. 2, pp. 195–110.
Devroye, L. (1976). “On the convergence of statistical search,” IEEE Transac-
tions on Systems, Man and Cybernetics, vol. SMC-6, pp. 46–56.
Devroye, L. (1976). “On random search with a learning memory,” in: Pro-
ceedings of the IEEE Conference on Cybernetics and Society, Washington,
pp. 704–711.
L. Devroye, L. (1977). “An expanding automaton for use in stochastic optimiza-
tion,” Journal of Cybernetics and Information Science, vol. 1, pp. 82–94.
Devroye, L. (1978a). “The uniform convergence of nearest neighbor regression
function estimators and their application in optimization,” IEEE Transactions
on Information Theory, vol. IT-24, pp. 142–151.
Devroye, L. (1978b). “Rank statistics in multimodal stochastic optimization,”
Technical Report, School of Computer Science, McGill University.
Devroye, L. (1978c). “Progressive global random search of continuous func-
tions,” Mathematical Programming, vol. 15, pp. 330–342.
Devroye, L. (1979). “Global random search in stochastic optimization prob-
lems,” in: Proceedings of Optimization Days 1979, Montreal.
de Biase, L. and F. Frontini. (1978). “A stochastic method for global optimiza-
tion: its structure and numerical performance,” in: Towards Global Opti-
mization 2, (edited by L. C. W. Dixon and G. P. Szegö), pp. 85–102, North
Holland, Amsterdam.
Dvoretzky, A., J. C. Kiefer, and J. Wolfowitz. (1956). “Asymptotic minimax
character of the sample distribution function and of the classical multinomial
estimator,” Annals of Mathematical Statistics, vol. 27, pp. 642–669.
Ermakov, S.M. and A. A. Zhiglyavskii. (1983). “On random search for a global
extremum,” Theory of Probability and its Applications, vol. 28, pp. 136–141.
Ermoliev, Yu. and R. Wets. (1988). “Stochastic programming, and introduction,”
in: Numerical Techniques of Stochastic Optimization, (edited by R. J.-B. Wets
and Yu. M. Ermoliev), pp. 1–32, Springer-Verlag, New York.
Fisher, L. and S. J. Yakowitz. (1976). “Uniform convergence of the potential
function algorithm,” SIAM Journal on Control and Optimization, vol. 14,
pp. 95–103.
REFERENCES 413
Gastwirth, J.L. (1966). “On robust procedures,” Journal of the American Sta-
tistical Association, vol. 61, pp. 929–948.
Gaviano, M. (1975). “Some general results on the convergence of random search
algorithms in minimization problems,” in: Towards Global Optimization,
(edited by L. C. W. Dixon and G. P. Szegö), pp. 149–157, North Holland,
New York.
Geffroy, J. (1958). “Contributions à la théorie des valeurs extrêmes,” Publica-
tions de l’Institut de Statistique des Universités de Paris, vol. 7, pp. 37–185.
Gelfand, S.B. and S. K. Mitter. (1991). “Weak convergence of Markov chain
sampling methods and annealing algorithms to diffusions,” Journal of Opti-
mization Theory and Applications, vol. 68, pp. 483–498.
Geman, S. and C.-R. Hwang. (1986). “Diffusions for global optimization,”
SIAM Journal on Control and Optimization, vol. 24, pp. 1031–1043.
Gidas, B. (1985). “Global optimization via the Langevin equation,” in: Proceed-
ings of the 24th IEEE Conference on Decision and Control, Fort Lauderdale,
pp. 774–778.
Gnedenko, A.B.V. (1943). Sur la distribution du terme maximum d’une série
aléatoire, Annals of Mathematics, vol. 44, pp. 423–453.
Goldberg, D.E. (1989). Genetic Algorithms in Search Optimization and Ma-
chine Learning, Addison-Wesley, Reading, Mass.
Gurin, L.S. (1966). “Random search in the presence of noise,” Engineering
Cybernetics, vol. 4, pp. 252–260.
Gurin, L.S. and L. A. Rastrigin. (1965). “Convergence of the random search
method in the presence of noise,” Automation and Remote Control, vol. 26,
pp. 1505–1511.
Haario, H. and E. Saksman. (1991). “Simulated annealing process in general
state space,” Advances in Applied Probability, vol. 23, pp. 866–893.
Hajek, B. (1988). “Cooling schedules for optimal annealing,” Mathematics of
Operations Research, vol. 13, pp. 311–329.
Hajek, B. and G. Sasaki. (1989). “Simulated annealing—to cool or not,” Systems
and Control Letters, vol. 12, pp. 443–447.
Holland, J.H. (1973). “Genetic algorithms and the optimal allocation of trials,”
SIAM Journal on Computing, vol. 2, pp. 88–105.
Holland, J.H. (1992). Adaptation in Natural and Artificial Systems : An In-
troductory Analysis With Applications to Biology, Control, and Artificial
Intelligence, MIT Press, Cambridge, Mass.
Jarvis, R.A. (1975). “Adaptive global search by the process of competitive
evolution,” IEEE Transactions on Systems, Man and Cybernetics, vol. SMC-
5, pp. 297–311.
Johnson, D.S., C. R. Aragon, L. A. McGeogh, and C. Schevon. (1989). “Opti-
mization by simulated annealing: an experimental evaluation; part I, graph
partitioning,” Operations Research, vol. 37, pp. 865–892.
Rinnooy Kan, A.H.G. and G. T. Timmer. (1984). “Stochastic methods for global
optimization,” American Journal of Mathematical and Management Sci-
ences, vol. 4, pp. 7–40.
Karmanov, V.G. (1974). “Convergence estimates for iterative minimization
methods,” USSR Computational Mathematics and Mathematical Physics,
vol. 14(1), pp. 1–13.
Kiefer, J. and J. Wolfowitz. (1952). “Stochastic estimation of the maximum of a
regression function,” Annals of Mathematical Statistics, vol. 23, pp. 462–466.
Kirkpatrick, S., C. D. Gelatt, and M. P. Vecchi. (1983). “Optimization by sim-
ulated annealing,” Science, vol. 220, pp. 671–680.
Koronacki, J. (1976). “Convergence of random-search algorithms,” Automatic
Control and Computer Sciences, vol. 10(4), pp. 39–45.
Kushner, H.L. (1987). “Asymptotic global behavior for stochastic approxima-
tion via diffusion with slowly decreasing noise effects: global minimization
via Monte Carlo,” SIAM Journal on Applied Mathematics, vol. 47, pp. 169–
185.
Lai, T.L. and H. Robbins. (1985) “Asymptotically efficient adaptive allocation
rules,” Advances in Applied Mathematics, vol. 6, pp. 4–22.
Mann, H.B. and D. R. Whitney. (1947). “On a test of whether one or two random
variables is stochastically larger than the other,” Annals of Mathematical
Statistics, vol. 18, pp. 50–60.
Marti, K. (1982). “Minimizing noisy objective functions by random search
methods,” Zeitschrift für Angewandte Mathematik und Mechanik, vol. 62,
pp. T377–T380.
Marti, K. (1992). “Stochastic optimization in structural design,” Zeitschrift für
Angewandte Mathematik und Mechanik, vol. 72, pp. T452–T464.
Massart, P. (1990). “The tight constant in the Dvoretzky-Kiefer-Wolfowitz in-
equality,” Annals of Probability, vol. 18, pp. 1269–1283.
Matyas, J. (1965). “Random optimization,” Automation and Remote Control,
vol. 26, pp. 244–251.
Meerkov, S.M. (1972). “Deceleration in the search for the global extremum of
a function,” Automation and Remote Control, vol. 33, pp. 2029–2037.
Metropolis, N., A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, and E. Teller.
(1953). “Equation of state calculation by fast computing machines,” Journal
of Chemical Physics, vol. 21, pp. 1087–1092.
Mockus, J.B. (1989). Bayesian Approach to Global Optimization, Kluwer Aca-
demic Publishers, Dordrecht, Netherlands.
Männer, R. and H.-P. Schwefel. (1991). “Parallel Problem Solving from Na-
ture,” vol. 496, Lecture Notes in Computer Science, Springer-Verlag, Berlin.
Petrov, V.V. (1975). Sums of Independent Random Variables, Springer-Verlag,
Berlin.
REFERENCES 415
Pinsky, M.A. (1991). Lecture Notes on Random Evolution, World Scientific

Publishing Company, Singapore.
Pintér, J. (1984). “Convergence properties of stochastic optimization proce-
dures,” Mathematische Operationsforschung und Statistik, Series Optimiza-
tion, vol. 15, pp. 405–427.
Pintér, J. (1996). Global Optimization in Action, Kluwer Academic Publishers,
Dordrecht.
Price, W.L. (1983). “Global optimization by controlled random search,” Journal
of Optimization Theory and Applications, vol. 40, pp. 333–348.
Rechenberg, I. (1973). Evolutionsstrategie—Optimierung technischer Systeme
nach Prinzipien der biologischen Evolution, Frommann-Holzboog, Stuttgart.
Rinnooy Kan, A.H.G. and G. T. Timmer. (1987). “Stochastic global optimiza-
tion methods part II: multi level methods,” Mathematical Programming,
vol. 39, pp. 57–78.
Rinnooy Kan, A.H.G. and G. T. Timmer. (1987). “Stochastic global opti-
mization methods part I: clustering methods,” Mathematical Programming,
vol. 39, pp. 27–56.
Robbins, H. (1952). “Some aspects of the sequential design of experiments,”
Bulletin of the American Mathematical Society, vol. 58, pp. 527–535.
Rubinstein, R. Y. and I. Weissman. (1979). “The Monte Carlo method for global
optimization,” Cahiers du Centre d’Etude de Recherche Operationelle, vol. 21,
pp. 143–149.
Schumer, M.A. and K. Steiglitz. (1968). “Adaptive step size random search,”
IEEE Transactions on Automatic Control, vol. AC-13, pp. 270–276.
Schwefel, H.-P. (1977). Modellen mittels der Evolutionsstrategie, Birkhäuser
Verlag, Basel.
Schwefel, H.-P. (1981). Numerical Optimization of Computer Models, John
Wiley, Chichester.
Schwefel, H.-P. (1995). Evolution and Optimum Seeking, Wiley, New York.
Sechen, C. (1988). VLSI Placement and Global Routing using Simulated An-
nealing, Kluwer Academic Publishers.
Shorack, G.R. and J. A. Wellner. (1986). Empirical Processes with Applications
to Statistics, John Wiley, New York.
Shubert, B.O. (1972). “A sequential method seeking the global maximum of a
function,” SIAM Journal on Numerical Analysis, vol. 9, pp. 379–388.
F. J. Solis, F.J. and R. B. Wets. (1981). “Minimization by random search tech-
niques,” Mathematics of Operations Research, vol. 1, pp. 19–30.
Tarasenko, G.S. (1977). “Convergence of adaptive algorithms of random search,”
Cybernetics, vol. 13, pp. 725–728.
Törn, A. (1974). Global Optimization as a Combination of Global and Local
Search, Skriftserie Utgiven av Handelshogskolan vid Abo Akademi, Abo,
Finland.
Törn, A. (1976). “Probabilistic global optimization, a cluster analysis approach,”

in: Proceedings of the EURO II Conference, Stockholm, Sweden, pp. 521–
527, North Holland, Amsterdam.
Törn, A. and A. Žilinskas. (1989). Global Optimization, Lecture Notes in Com-
puter Science, vol. 350, Springer-Verlag, Berlin.
Uosaki, K., H. Imamura, M. Tasaka, and H. Sugiyama. (1970). “A heuristic
method for maxima searching in case of multimodal surfaces,” Technology
Reports of Osaka University, vol. 20, pp. 337–344.
Vanderbilt, D. and S. G. Louie. (1984). “A Monte Carlo simulated annealing ap-
proach to optimization over continuous variables,” Journal of Computational
Physics, vol. 56, pp. 259–271.
Van Laarhoven, P.J.M. and E. H. L. Aarts. (1987). Simulated Annealing: Theory
and Applications, D. Reidel, Dordrecht.
Wasan, M.T. (1969). Stochastic Approximation, Cambridge University Press,
New York.
Wilcoxon, F. (1945). “Individual comparisons by ranking methods,” Biometrics
Bulletin, vol. 1, pp. 80–83.
Yakowitz, S. (1992). “Automatic learning: theorems for concurrent simula-
tion and optimization,” in: 1992 Winter Simulation Conference Proceedings,
(edited by J. J. Swain, D. Goldsman, R. C. Crain and J. R. Wilson), pp. 487–
493, ACM, Baltimore, MD.
Yakowitz, S.J. (1989). “A statistical foundation for machine learning, with appli-
cation to go-moku,” Computers and Mathematics with Applications, vol. 17,
pp. 1095–1102.
Yakowitz, S.J. (1989). “A globally-convergent stochastic approximation,” Tech-
nical Report, Systems and Industrial Engineering Department, University of
Arizona, Tucson, AZ.
Yakowitz, S.J. (1989). “On stochastic approximation and its generalizations,”
Technical Report, Systems and Industrial Engineering Department, Univer-
sity of Arizona, Tucson, AZ, 1989.
Yakowitz, S.J. (1992). “A decision model and methodology for the AIDS epi-
demic,” Applied Mathematics and Computation, vol. 55, pp. 149–172.
Yakowitz, S.J. (1993). “Global stochastic approximation,” SIAM Journal on
Control and Optimization, vol. 31, pp. 30–40.
Yakowitz, S.J. and L. Fisher. (1973). “On sequential search for the maximum of
an unknown function,” Journal of Mathematical Analysis and Applications,
vol. 41, pp. 234–259.
Yakowitz, S.J., R. Hayes, and J. Gani. (1992). “Automatic learning for dynamic
Markov fields, with applications to epidemiology,” Operations Research,
vol. 40, pp. 867–876.
REFERENCES 417
Yakowitz, S.J., T. Jayawardena, and S. Li. (1992). “Theory for automatic learn-
ing under Markov-dependent noise, with applications,” IEEE Transactions
on Automatic Control, vol. AC-37, pp. 1316–1324.
Yakowitz, S.J. and M. Kollier. (1992). “Machine learning for blackjack counting
strategies,” Journal of Forecasting and Statistical Planning, vol. 33, pp. 295–
309.
Yakowitz, S.J. and W. Lowe. (1991). “Nonparametric bandit methods,” Annals
of Operations Research, vol. 28, pp. 297–312.
Yakowitz, S.J. and E. Lugosi. (1990). “Random search in the presence of noise,
with application to machine learning,” SIAM Journal on Scientific and Sta-
tistical Computing, vol. 11, pp. 702–712.
Yakowitz, S.J. and A. Vesterdahl. (1993). “Contribution to automatic learning
with application to self-tuning communication channel,” Technical Report,
Systems and Industrial Engineering Department, University of Arizona.
Zhigljavsky, A.A. (1991). Theory of Global Random Search, Kluwer Academic
Publishers, Hingham, MA.
Chapter 20
RECENT ADVANCES IN RANDOMIZED

QUASI-MONTE CARLO METHODS
Pierre L’Ecuyer
Département d’Informatique et de Recherche Opérationnelle
Université de Montréal, C.P. 6128, Succ. Centre-Ville, Montréal, H3C 3J7, CANADA
lecuyer@iro.umontreal.ca
Christiane Lemieux
University of Calgary, 2500 University Drive N.W., Calgary, T2N 1N4, CANADA
lemieux@math.ucalgary.ca
Abstract We survey some of the recent developments on quasi-Monte Carlo (QMC)

methods, which, in their basic form, are a deterministic counterpart to
the Monte Carlo (MC) method. Our main focus is the applicability of
these methods to practical problems that involve the estimation of a
high-dimensional integral. We review several QMC constructions and
different randomizations that have been proposed to provide unbiased
estimators and for error estimation. Randomizing QMC methods allows
us to view them as variance reduction techniques. New and old results
on this topic are used to explain how these methods can improve over
the MC method in practice. We also discuss how this methodology can
be coupled with clever transformations of the integrand in order to re-
duce the variance further. Additional topics included in this survey are
the description of figures of merit used to measure the quality of the
constructions underlying these methods, and other related techniques
for multidimensional integration.
1. Introduction
To approximate the integral of a real-valued function defined over
the unit hypercube given by
a frequently-used approach is to choose a point set

and then take the average value of over
as an approximation of Many problems can be formulated as in

(20.1), e.g., when simulation is used to estimate an expectation and each
simulation run requires calls to a pseudorandom number generator that
outputs numbers between 0 and 1; see Section 2 for more details.
If is very smooth and is small, the product of one-dimensional
integration rules such as the rectangular or trapezoidal rules can be used
to define (Davis and Rabinowitz 1984). When these conditions are
not met, the Monte Carlo method (MC) is usually more appropriate. It
amounts to choosing as a set of i.i.d. uniformly distributed vectors
over With this method, is an unbiased estimator of
whose error can be approximated via the central limit theorem, and
whose variance is given by
assuming that (i.e., is square-integrable). This means that the

error associated with the MC method has probabilistic order
Quasi-Monte Carlo (QMC) methods can be seen as a deterministic

counterpart to the MC method. They are based on the idea of using more
regularly distributed point sets to construct the approximation (20.2)
than the random point set associated with MC. The fact that QMC
methods are deterministic suggests that one has to make assumptions
on the integrand in order to guarantee a certain level of accuracy for
In other words, the improved regularity of comes with worst-case
functions for which the QMC approximation is bad. For this reason,
the usual way to analyze QMC methods consists in choosing a set
of functions and a definition of discrepancy to measure, in some
way, how far from the uniform distribution on is the empirical

distribution induced by Once and are determined, one can
usually derive upper bounds on the deterministic error, of the following
form (Niederreiter 1992b; Hickernell 1998a):
where is a measure of the variability of such that for

all A well-known special case of (20.3) is the Koksma-Hlawka
inequality (Hlawka 1961), in which is the set of functions having
bounded variation in the sense of Hardy and Krause, and is the
rectangular-star discrepancy. To compute this particular definition of
one considers all rectangular boxes in aligned with the
axes and with a “corner” at the origin, and then take the supremum,
over all these boxes, of the absolute difference between the volume of a
box and the fraction of points of that fall in it. The requirement that
in this case implies in particular that has continuous partial
derivatives; see Niederreiter (1992b), page 19, for a precise definition.
It is clear from (20.3) that a small value of is desirable for the
set This leads to the notion of low-discrepancy sequences, which are
sequences of points such that if is constructed by taking the first
points of then is significantly smaller than the discrepancy
of a typical set of i.i.d. uniform points. The term low-discrepancy
point set usually refers to a set obtained by taking the first points of a
low-discrepancy sequence, although it is sometimes used in a looser way
to describe a point set that has a better uniformity than an independent
and uniform point set.
In the case where is the rectangular-star discrepancy, it is
common to say that is a low-discrepancy sequence if
Following this, the usual argument supporting the su-
periority of QMC methods over MC is to say that if is obtained
using a low-discrepancy point set, then the error bound (20.3) is in
which for a fixed dimension is a better asymptotic rate
than the rate associated with MC. For this reason, one expects
QMC methods to approximate with a smaller error than MC if is suf-
ficiently large. However, the dimension does not need to be very large
in order to have for large values of For example, if
one must have to ensure that
and thus the superiority of the convergence rate of the QMC bound
over MC is meaningful only for values of that are much too large for
practical purposes.
Nevertheless, this does not mean than QMC methods cannot improve
upon MC in practice, even for problems of large dimension. Arguments
supporting this are that first, the upper bound given in (20.3) is a worst-
case bound for the whole set It does not necessarily reflect the be-
havior of on a given function in this set. Second, it happens often in
practice that even if the dimension is large, the integrand can be well
approximated (in a sense to be specified in the next section) by a sum of
low-dimensional functions. In that case, a good approximation for
can be obtained by simply making sure that the corresponding pro-
jections of on these low-dimensional subspaces are well distributed.
These observations have recently led many researchers to turn to other
tools than the setup that goes with (20.3) for analyzing and improv-
ing the application of QMC methods to practical problems, where the
dimension is typically large, or even infinite (i.e., there is no a priori
bound on In connection with these new tools, the idea of randomizing
QMC point sets has been an important contribution that has extended
the practical use of these methods. The purpose of this chapter is to
give a survey of these recent findings, with an emphasis on the theoreti-
cal results that appear most useful in practice. Along with explanations
describing why these methods work, our goal is to provide, to the reader,
tools for applying QMC methods to his own specific problems.
Our choice of topics is subjective. We do not cover all the recent de-
velopments regarding QMC methods. We refer the readers to, e.g., Fox
(1999), Hellekalek and Larcher (1998), Niederreiter (1992b), and Spanier
and Maize (1994) for other viewpoints. Also, the fact that we chose not
to talk more about inequalities like (20.3) does not mean that they are
useless. In fact, the concept of discrepancy turns out to be useful for
defining selection criteria on which exhaustive or random searches to find
“good” sets can be based, as we will see later. Furthermore, we think
it is important to be aware of the discouraging order of magnitude for
required for the rate to be better than and to under-
stand that this problem is simply a consequence of the fact that placing
points uniformly in is harder and harder as increases because the
space to fill becomes too large. This suggests that the success of QMC
methods in practice is due to a clever choice of point sets exploiting the
features of the functions that are likely to be encountered, rather than
to an unexplainable way of breaking the “curse of dimensionality”.
Highly-uniform point sets can also be used for estimating the mini-
mum of a function instead of its integral, sometimes in a context where
function evaluations are noisy. This is discussed in Chapter 6 of Nieder-
reiter (1992b), Chapter 13 of Fox (1999), and was also the subject of
collaborative work between Sid Yakowitz and the first author (Yakowitz,
L’Ecuyer, and Vázquez-Abad 2000).
This chapter is organized as follows. In Section 2, we give some in-

sight on how point sets should be constructed by using an ANOVA
decomposition of the integrand over low-dimensional subspaces. Section
3 recalls the definition of different families of low-discrepancy point sets.
In Section 4, we present measures of quality (or selection criteria) for
low-discrepancy point sets that take into account the properties of the
decomposition discussed in Section 2. Various randomizations that have
been proposed for QMC methods are described in Section 5. Results on
the error and variance of approximations based on (randomized) QMC
methods are presented in Section 6. The purpose of Section 7 is to
briefly review different classes of transformations that can be applied
to the integrand for reducing the variance further by exploiting, or
not, the structure of the point set Integration methods that are
somewhere between MC and QMC but that exploit specific properties
of the integrand more directly are discussed in Section 8. Conclusions
and ideas for further research are given in Section 9.
2. A Closer Look at Low-Dimensional

Projections
We mentioned earlier that as the dimension increases, it becomes
difficult to cover the unit hypercube very well with a fixed number
of points. However, if instead our goal is to make sure that for some
chosen subsets the projections over the subspace of
indexed by the coordinates in I are evenly distributed, the task is
easier. By doing this, one can get a small integration error if the chosen
subsets I match the most important terms in the functional ANOVA
decomposition of which we now explain.
The functional ANOVA decomposition (Efron and Stein 1981; Hoeffd-
ing 1948; Owen 1998a) writes a square-integrable function as a sum
of orthogonal functions as follows:
where corresponds to the part of that depends

only on the variables in Moreover, this decomposition is
such that
if I is non-empty, and
if Defining we then have
The best mean-square approximation of by a sum of

(or less) functions is Also, the relative importance
of each component indicates which variables or which subsets of vari-
ables are the most important (Hickernell 1998b).
A function has effective dimension (in the superposition sense)
if (Caflisch, Morokoff, and Owen 1997). Functions de-
fined over many variables but having a low effective dimension often arise
in practical applications (Caflisch, Morokoff, and Owen 1997; Lemieux
and Owen 2001). The concept of effective dimension has actually been
introduced (in a different form than above) by Paskov and Traub (1995)
in the context of financial pricing to explain the much smaller error ob-
tained with QMC methods compared with MC, for a problem defined
over a 360-dimensional space.
A broad class of problems that are likely to have low effective dimen-
sion (relative to are those arising from simulation applications. To see
this, note that simulation is typically used to estimate the expectation
of some measure of performance defined over a stochastic system, and
proceeds by transforming in a more or less complicated way a sequence
of numbers between 0 and 1 produced by a pseudorandom generator into
an observation of the measure of interest. Hence it fits the framework
of equation (20.1), with equal to the number of uniforms required for
each simulation, and taken as the mapping that transforms a point in
into an observation of the measure of performance. In that con-
text, it is frequent that the uniform numbers that are generated close
to each other in the simulation (i.e., corresponding to dimensions that
are close together) are associated to random variables that interact more
together. In other words, for these applications it is often the subsets I
containing nearby indices and not too many of them that are the most
important in the ANOVA decomposition. This suggests that to design
point sets that will work well for this type of problems, one should
consider the quality of the projections corresponding to these “im-
portant” subsets. We present, in Section 4, measures of quality defined
on this basis.
We conclude this section by recalling two important properties re-
lated to the projections of a point set in Firstly, we say
that is fully projection-regular (Sloan and Joe 1994; L’Ecuyer and
Lemieux 2000) if each of its projections over a non-empty subset
of dimensions contains distinct points. Such a prop-
erty is certainly desirable, for the lack of it means that some of the
ANOVA components of are integrated by less than points even if
evaluations of have been done. Secondly, we say that is dimension-
stationary (L’Ecuyer and Lemieux 2000) if for any
such that
that is, only the spacings between the indices in I are relevant in the def-
inition of the projections of a dimension-stationary point set, and
not their individual values. Hence not all non-empty projections of
need to be considered when measuring the quality of since many
are the same. Another advantage of dimension-stationary point sets is
that because the quality of their projections does not deteriorate as the
first index increases, they can be used to integrate functions that have
important ANOVA components associated with subsets I having a large
value of Therefore, when working with those point sets it is not nec-
essary to try rewriting so that the important ANOVA components are
associated with subsets I having a small first index (as is often done;
see, e.g., Fox 1999). We underline that not all types of QMC point sets
have these properties.
3. Main Constructions
In this section, we present constructions for low-discrepancy point sets
that are often used in practice. We first introduce lattice rules (Sloan
and Joe 1994), and a special case of this construction called Korobov
rules (Korobov 1959), which turn out to fit in another type of construc-
tion based on successive overlapping produced by a recurrence
defined over a finite ring. This type of construction is also used to de-
fine pseudorandom number generators (PRNGs) with huge period length
when the underlying ring has a very large cardinality (e.g., low-
discrepancy point sets are, in contrast, obtained by using a ring with a
small cardinality (e.g., between and For this reason, we refer
to this type of construction as small PRNGs, and discuss it after having
introduced digital nets (Niederreiter 1992b), which form an important
family of low-discrepancy point sets that also provides examples of small
PRNGs. Various digital net constructions are described. We also recall
the Halton sequence (Halton 1960), and discuss a method by which the
number of points in a Korobov rule can be increased sequentially, thus
offering an alternative to digital sequences. Additional references regard-
ing the implementation of QMC methods are provided at the end of the
section.
3.1. Lattice Rules

The general construction for lattice rules has been introduced by Sloan
and his collaborators (see Sloan and Joe 1994 and the references therein)
by building upon ideas developed by Korobov (1959, 1960), Bakhvalov
(1959), and Hlawka (1962). The following definition is taken from the
expository book of Sloan and Joe (1994).
Definition: Let be a set of vectors linearly
independent over and with coordinates in [0,1). Define
and assume that The approximation based on the set

is a lattice rule. The number of points in the rule
is equal to the inverse of the absolute value of the determinant of the
matrix V whose rows are the vectors This number is
called the order of the rule.
Note that the basis for is not unique, but the determinant of the
matrix V remains constant for all choices of basis.
Figure 20.1 gives an example of a point set that corresponds to a
two-dimensional lattice rule, with Here, the two vectors shown
in the figure, and form a
basis for the lattice Another basis for the same lattice is formed by
and
These 101 points cover the unit square quite uniformly. They are also
placed very regularly on equidistant parallel lines, for several families of
lines. For example, any of the vectors or given above determines
one family of lines that are parallel to this vector. This regularity prop-
erty stems from the lattice structure and it holds for any lattice rule—in
more than two dimensions, the lines are simply replaced by equidistant
parallel hyperplanes (Knuth 1998; Conway and Sloane 1999). For to
cover the unit hypercube quite well, the successive hyperplanes should
never be too far apart (to avoid wide uncovered gaps), for any choice of
family of parallel hyperplanes. Selection criteria for lattice rules, based
on the idea of minimizing the distance between successive hyperplanes
for the “worst-case” family, are discussed in Section 4.1.
From now on, we refer to point sets giving rise to lattice rules
as lattice point sets. Each lattice point set has a rank associated
with it, which can be defined as the smallest integer such that can
be obtained by taking all integer combinations, modulo 1, of vectors
independent over Alternatively, the rank can be defined as
the smallest number of cyclic groups whose direct sum yields (Sloan
and Joe 1994). For example, if are positive integers such that
for at least one then the lattice point set
has rank 1 and contains distinct points. It can also be obtained by

taking with and for
in (20.4), where is a vector of zeros with a one in the
position. The condition for all is necessary and
sufficient for a rank-1 lattice point set to be fully projection-regular.
A Korobov rule is obtained by choosing an integer
and taking in (20.5), for all In this
case, having and relatively prime is a necessary and sufficient condi-
tion for to be both fully projection-regular and dimension-stationary
(L’Ecuyer and Lemieux 2000). The integer is usually referred to as the
generator of the lattice point set For instance, the point set given
on Figure 20.1 has a generator equal to 12. In Section 3.3, we describe
an efficient way of constructing for Korobov rules when is prime
and is a primitive element modulo
In the definition of a lattice rule, we assumed This implies
that has a period of 1 in each dimension. A lattice with this prop-
erty is called an integration lattice (Sloan and Joe 1994). A necessary
and sufficient condition for to be an integration lattice is that the
inverse of the matrix V has only integer entries. In this case, it can
be shown that if the determinant of V is then there exists a basis

for with coordinates of the form where We assume
from now on that all lattices considered are integration lattices.
The columns of the inverse matrix form a basis of the dual lattice
of defined as
If the determinant of is then contains times less points per

unit of volume than Also, is periodic with a period of in each
dimension. As we will see in Sections 4.1 and 6.1, this dual lattice plays
an important role in the error and variance analysis for lattice rules, and
in the definition of selection criteria.
3.2. Digital Nets

We first recall the general definition of a digital net in base a concept
that was first introduced by Sobol’ (1967) in base 2, and subsequently
generalized by Faure (1982), Niederreiter (1987), and Tezuka (1995).
The following definition is from Niederreiter (1992b), with the conven-
tion, as in Tezuka (1995), that the generating matrices contain an
infinite number of rows (although often, only a finite number of these
rows are nonzero).
Definition 1 Let and be integers. Choose

1. a commutative ring R with identity and with cardinality (usually
2. bijections for
3. bijections for and
4. Generating matrices of dimension over R.

For let
with be the digit expansion of in base Consider the vector
and compute
where each element is in R. For let
Then with is
digital net over R in base
This scheme has been used to construct point sets having a low-
discrepancy property that can be described by introducing the notion
of (Lemieux and L’Ecuyer 2001). Let
where the are non-negative integers, and consider the
boxes obtained by partitioning into equal intervals
along the axis. If each of these boxes contains exactly points
from a point set where then is said to be
If a digital net is when-
ever for some integer it is called a (Niederre-
iter 1992b). The smallest integer having this property is a widely-used
measure of uniformity for digital nets and we call it the of
Note that the is meaningless if and that the smaller is, the
better is the quality of Criteria for measuring the equidistribution
of digital nets are discussed in more details in Section 4.2.
Figure 20.2 shows an example of a two-dimensional point set with
points in base 3, having the best possible equidistribution;
that is, its is 0 and thus, any partition of the unit square into
ternary boxes of size is such that exactly one point is in each box.
The figure shows two examples of such a partition, into rectangles of sizes
and respectively. The other partitions (not shown
here) are into rectangles of sizes and For all
of these partitions, each rectangle contains one point of This point
set contains the first 81 points of the two-dimensional Faure sequence
in base 3. In this case, in the definition above, all bijections
are the identity function over the generating matrix for the first
dimension is the identity matrix, and is given by
The general definition of this type of construction is given below.

From now on, is assumed to be prime power, and in all the con-
structions described below, R is taken equal to Also, we assume
that a bijection from to has been chosen to identify the elements
of with the “digits” and all bijections and are

defined according to this bijection. In particular, when is prime, these
bijections are equal to the identity function over and all operations
are performed modulo The base and the generating matrices
therefore completely describe these constructions, because the ex-
pansion of a given coordinate of is obtained by simply multiplying
with the digit expansion of in base The goal is then to choose
the generating matrices so that the equidistribution property mentioned
above holds for boxes that are as small as possible. In terms of
these matrices, this roughly means that we want them to have a large
number of rows that are linearly independent. For example, if for each
the first rows of the matrix are linearly independent, then each
box obtained by partitioning the axis into equal intervals of
length has one point from In particular, this implies that is
fully projection-regular.
Digital sequences in base (see, e.g., Larcher 1998 and Niederreiter
1992b) are infinite sequences obtained in the same way as digital nets ex-
cept that the generating matrices have an infinite number of columns;
the first points of the sequence thus form a digital net for each
For example, Sobol’, Generalized Faure, and Niederreiter sequences (to
be discussed below) are all defined as digital sequences since the “recipe”
to add extra columns in the generating matrices is given by the
method.
We now describe specific well-known constructions of digital nets with
a good equidistribution. We do not discuss recent constructions pro-
posed by Niederreiter and Xing (1997, 1998), as they require the in-
troduction of many concepts from algebraic function fields that go well
beyond the scope of this chapter. These sequences are built so as to
optimize the asymptotic behavior of their as a function of the
dimension, for a fixed base See Pirsic (2001) for a definition of these
sequences and a description of a software implementation.
3.2.1 Sobol’ Sequences. Here the base is and the spec-

ification of each generating matrix requires a primitive polynomial
over and integers for to initialize
a recurrence based on that generates the direction numbers defin-
ing The method specifies that the polynomial should be the
one in the list of primitive polynomials over sorted by increasing
degree (within each degree, Sobol’ specifies a certain order which is given
in the code of Bratley and Fox (1988) for There remains the
parameters to control the quality of the point set.
Assume where for each
The direction numbers are rationals of the form
where is an odd integer smaller than for The first

values must be (carefully) chosen, and the following ones
are obtained through the recurrence
where denotes a bit-by-bit exclusive-or operation, and

means that the binary expansion of is shifted by positions to
the right (i.e., is divided by These direction numbers are then
used to define whose entry in the row and column is given by
A good choice of the initial values (or in

each dimension is important for the success of this method, especially
when is small. The implementation of Bratley and Fox (1988) uses
the initial values given by Sobol’ and Levitan (1976) for with
and More details on how to choose these initial
values to optimize certain quality criteria are given in Section 4.2.
3.2.2 Generalized Faure Sequences. These sequences were

introduced by Faure (1982) and generalized by Tezuka (1995). For this
type of sequence, the base is the smallest prime power larger or equal
to the dimension (which means that these sequences are practical only
for small values of An important feature of this construction is that
their has the best possible value provided is of C the
form for some integers Assuming the matrices
take the form:
where P is the Pascal’s matrix (i.e., with entries

and is an arbitrary non-singular lower-triangular matrix.
The original Faure sequence is obtained by taking prime and as
the identity matrix for all By allowing the matrices
to be different from the identity matrix, point sets that avoid some of
the defects observed on the original Faure sequence can be built (Mo-
rokoff and Caflisch 1994; Boyle, Broadie, and Glasserman 1997; Tan and
Boyle 2000). For instance, the Generalized Faure sequence of Tezuka
and Tokuyama (1994) amounts to take Recently, Faure
suggested another form of Generalized Faure sequence that consists in
taking equal to the lower-triangular matrix with all nonzero entries
equal to 1, for all (Faure 2001).
3.2.3 Niederreiter Sequences. This construction has been

proposed by Niederreiter (1988). We first describe a special case (Nieder-
reiter 1992b, Section 4.5), and then briefly explain how it can be general-
ized. The base is assumed to be a prime power, and for this special case
the generating matrices are defined by distinct monic irreducible poly-
nomials in Let for
In what follows, represents the field of formal Laurent series
over that is,
To find the element on the row and column of for

consider the expansion
where is the unique integer satisfying and

and is also uniquely determined. The element on the row
and column of is then given by For these sequences, the
is given by which suggests that to minimize the
should be taken equal to the monic irreducible polynomials of

smallest degree over In the general definition of Niederreiter (1988),
the numerator in (20.7) can be multiplied by a different polynomial for
each pair of dimension and row index, and the just need to
be pairwise relatively prime polynomials. Tezuka (1995), Section 6.1.2,
generalizes this concept a step further by removing more restrictions on
this numerator.
3.2.4 Polynomial Lattice Rules. This construction can be

seen as a lattice rule in which the ring of integers is replaced by a ring
of polynomials over a finite field. As we explain below, it generalizes
the construction discussed by Niederreiter (1992a), Niederreiter (1992b),
Section 4.4, and Larcher, Lauss, Niederreiter, and Schmid (1996), and
is studied in more details by Lemieux and L’Ecuyer (2001).
Definition: Let be a set of vectors of
formal Laurent series over linearly independent over
where is a prime power. Define
and assume that Let be the

mapping that evaluates each component of a vector in
at The approximation based on the set
is a polynomial lattice rule. The number of points in the rule is called
the order of the rule and is equal to where is the degree of the
inverse of the determinant of the matrix V whose rows are the vectors
Most definitions and results that we mentioned for lattice rules, which
from now on are referred to as standard lattice rules, have their counter-
part for polynomial lattice rules, as we now explain. First, we refer to
point sets that define polynomial lattice rules as polynomial lattice
point sets, whose rank is equal to the smallest number of basis vectors
required to write as
In the expression above, represents the operation by which

all non-negative powers in are dropped. The construction intro-
duced by Niederreiter (1992a), also discussed in Niederreiter (1992b),
Section 4.4, and Larcher, Lauss, Niederreiter, and Schmid (1996), and
sometimes referred to as the “method of optimal polynomials”‚ is a poly-

nomial lattice point set of rank 1‚ since it can be obtained as
where is a polynomial of de-

gree over and the are polynomials in the ring
of polynomials over modulo A Korobov polynomial lattice
point set is obtained by taking = mod in (20.9)
for some As in the standard case‚ the condition
is necessary and sufficient to guarantee that a Korobov
polynomial lattice point set is dimension-stationary and fully projection-
regular (Lemieux and L’Ecuyer 2001). An efficient way of generating this
type of point set when is a primitive polynomial is described in the
next subsection.
The condition that implies that has a period
of 1 in each dimension‚ and we call a polynomial integration lattice
in this case. If V represents a matrix whose rows are formed by basis
vectors of then is a polynomial integration lattice if and only if
all entries in are in In that case‚ a basis for with vectors
having coordinates of the form can be found‚ where
is the determinant of and is in or equal to
Finally‚ the columns of form a basis of the dual lattice of which
is defined by
where the scalar product is defined as

This construction is a special case of digital nets in which the gener-
ating matrices are defined by
where denotes the coefficient of in the formal series

the coordinate of and the index is determined by and
the structure of as follows: has the property that for some non-
zero polynomials coming from a triangular basis of
and such that (see Lemieux and L’Ecuyer 2001‚
Lemma A.2‚ for more details)‚ we have
In other words‚ the first columns of each matrix con-

tain the coefficients associated with the first basis vector the
next columns are associated with the second basis vector
and so on.
For polynomial lattice point sets of rank 1‚ the corresponding gener-
ating matrices can be described a bit more easily (Niederreiter 1992b‚
Section 4.4)‚ as we now explain. For each dimension con-
sider the formal Laurent series expansion
The first coefficients satisfy
where A is defined as
(see‚ e.g.‚ Lemieux and L’Ecuyer 2001)‚ and the coefficients are such
that = The coefficients for satisfy the
recurrence determined by i.e.‚
where the minus sign represents the subtraction in The entries of

the generating matrices are defined by
Note that in the definition of Niederreiter (1992b) and Pirsic and Schmid
(2001)‚ the matrices are restricted to rows.
3.3. Constructions Based on Small PRNGs

Consider a PRNG based on a recurrence over a finite ring R and a
specific output function from R to [0, 1). The idea proposed here is to
define as the set of all overlapping in than can be
obtained by running the PRNG from all possible initial states in R.

More precisely‚ let be a bijection over R called the transition
function of the PRNG‚ and define
for The sequence obtained from any seed has a

period of at most |R|. Now let be an output function‚ and
define
for We call the point set
where a recurrence-based point set. The requirement that

be a bijection guarantees that is dimension-stationary‚ and therefore
fully-projection regular (L’Ecuyer and Lemieux 2000). In concrete real-
izations‚ it is often the case that the recurrence has two main
cycles‚ one of period length |R| – 1‚ and the other of period length 1 and
which contains 0‚ the zero of R. In this case‚ can be generated very
efficiently (provided the function is easy to compute) by running the
PRNG for steps to obtain the point set
and adding the vector to this set.

An important advantage of this construction is that it can be used
easily on problems for which depends on a random an unbounded
number of variables. For this type of function‚ one needs a point set
whose defining parameters are independent of and for which points
of arbitrary size can be generated. Recurrence-based point sets satisfy
these requirements since they are determined by the functions and
the set R‚ and those are independent of the dimension. Also‚ points of
any size can be obtained by running the PRNG as long as required. Note
that when the coordinates of each point in have a periodic
structure but by randomizing as in Section 5‚ this periodicity disap-
pears and the coordinates of each point become mutually independent‚
so one can have without any problem.
The link between PRNG constructions and QMC point sets was poin-
ted out by Niederreiter (1986). Not only similar constructions have been
proposed in both fields but in addition‚ some of the selection criteria used
in the two settings have interesting connections. Criteria measuring the
uniformity of the point set given in (20.10) are also
required in the PRNG context because in this case, this point set can be
seen as the sampling space for the PRNG when it is used on a problem
requiring uniform numbers per run. See L’Ecuyer (1994) for details on
this aspect of PRNGs and more.
We now describe two particular cases of this type of construction that
provide an alternative way of generating Korobov-type lattice point sets
(either standard or polynomial).
Example 1 Let for some prime define
for some nonzero and let This type of PRNG is a

linear congruential generator (LCG) (Knuth 1998; L’Ecuyer 1994)‚ and
the recurrence-based point set is a Korobov lattice point set. When
is a primitive element modulo (see Lidl and Niederreiter 1994 for the
definition)‚ the recurrence (20.12) has the maximal period of for
any nonzero seed‚ and can thus be generated efficiently using (20.12)
and (20.11).
Example 2 Let be a primitive polynomial

over (see Lidl and Niederreiter 1994 for the definition)‚ be a positive
integer‚ and
Let be given by the composition
where is defined as the evaluation of a formal

Laurent series over at that is‚
This type of PRNG is called a polynomial LCG (L’Ecuyer 1994; Tezuka

1995)‚ and the recurrence-based point set is a Korobov polynomial
lattice point set. When and are relatively prime‚ the recurrence
(20.13) has maximal period length Polynomial LCGs can also
be obtained via Tausworthe-type linear feedback shift-register sequences
(Tausworthe 1965): the idea is to use the recurrence
over and to define the output at step as
which in practice is typically truncated to the word-length of the com-

puter. Tezuka and L’Ecuyer give an efficient algorithm to generate the
output under some specific conditions (Tezuka and L’Ecuyer 1991;
L’Ecuyer 1996).
Combining a few Tausworthe generators to define the output can

greatly help improving the quality of the associated set as explained
by L’Ecuyer (1996)‚ Tezuka and L’Ecuyer (1991) and Wang and Com-
pagner (1993). Another way of enhancing the quality of point sets based
on linear recurrences modulo 2 is to use tempering transformations (Mat-
sumoto and Kurita 1994; Matsumoto and Nishimura 1998; L’Ecuyer and
Panneton 2000). Note that these transformations generally destroy the
lattice structure of (Lemieux and L’Ecuyer 2001). However the point
set obtained is still a digital net and therefore‚ it can be studied un-
der this more general setting. Conditions that preserve the dimension-
stationarity of under these transformations are given by Lemieux
and L’Ecuyer (2001). The idea of combining different constructions to
build sets with better equidistribution properties is also discussed in
Niederreiter and Pirsic (2001) in the more general setting of digital nets.
3.4. Halton sequence

This sequence was introduced by Halton (1960) for constructing point
sets of arbitrary length‚ and is a generalization of the one-dimensional
van der Corput sequence (Niederreiter 1992b). Although it is not a
digital sequence‚ it uses similar ideas and can thus be thought of as an
ancestor of those sequences. For the point in the sequence is
given by
where the integers are typically chosen as the first prime num-
bers sorted in increasing order‚ and is the radical-inverse function
in base defined by
where the integers are the coefficients in the expansion of

i.e.‚ as in the digital net definition.
3.5. Sequences of Korobov rules

With (infinite) digital sequences‚ one can always add new points to
until the estimation error is deemed small enough. The lattice point sets
that we have discussed so far‚ on the other hand‚ contain only a fixed
number of points. We now consider a method discussed by (Sobol’ 1967;
Maize 1981; Hickernell and Hong 1997; Hickernell‚ Hong‚ L’Ecuyer‚ and
Lemieux 2001) for generating an infinite sequence of point sets
in such that for is a Korobov (polynomial) lattice
point set and
Sequences based on nested Korobov lattice point sets can be con-
structed by choosing a nonzero odd integer and defining
where Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux (2001)

give tables of generators to build these sequences‚ that were chosen
with respect to different selection criteria to be explained in Section 4.1.
One way of constructing a sequence based on Korobov polynomial
lattice point sets is to choose a polynomial of degree 1 in (i.e.‚
and a generating polynomial
such that Then define
where is defined as on page 433‚ and mod

In this case‚ the sequence turns out to be
a special case of a digital sequence; see also Larcher (1998)‚ page 187‚
for the particular case where and for a more general setting
based on irrational formal Laurent series.
3.6. Implementations
The points of a digital net in base can be generated efficiently
using a Gray code. This idea was first suggested by Antonov and Saleev
(1979) for the Sobol’ sequence‚ and for other constructions by‚ e.g.‚ Hong
and Hickernell (2001)‚ Pirsic and Schmid (2001)‚ Tezuka (1995)‚ and
Bratley‚ Fox‚ and Niederreiter (1992). Assuming the idea is
to modify the order in which the points are generated by replacing the
digits from the expansion of by the Gray code
for which satisfies the following property: the Gray code

for and only differ in one position; if is the smallest index such
that then is the digit whose value changes‚ and it becomes
in the Gray code for (Tezuka 1995‚ Theorem 6.6). This
reduces the generation time because only a dot product of two vectors
has to be performed in each dimension to generate a point.
It is sometimes suggested in the literature (Bratley‚ Fox‚ and Nieder-
reiter 1992; 1998; Acworth‚ Broadie‚ and Glasserman 1997;
Fox 1986; Morokoff and Caflisch 1994) that for most low-discrepancy
sequences‚ an initial portion of the sequence should be discarded be-
cause of the so-called “leading-zeros phenomenon”. For sequences such
as Sobol’s that require initial parameters‚ this problem can be avoided
(at least partially) by choosing these parameters carefully. Using a suf-
ficiently large number of points and randomizing the point set can also
help alleviate this problem. We refer the reader to the papers mentioned
above for more details.
The FORTRAN code of Bratley and Fox (1988) can be used to gen-
erate Sobol’ sequence for and is available from the Collective
Algorithms of the ACM at www.acm.org/calgo‚ where the code of Fox
(1986) for generating the Faure sequence can also be found‚ as well as a
code for Niederreiter’s sequence (Bratley‚ Fox‚ and Niederreiter 1994). A
code to generate Generalized Faure sequences (provided the matrices
have been precomputed) is given by Tezuka (1995). Owen has written
code to generate scrambled nets; it is available at www–stat.Stanford.
edu/˜owen/scramblednets‚ and is free for noncommercial uses. A C++
library called libseq was recently developed by Friedel and Keller (2001)‚
in which they use efficient algorithms to generate scrambled digital se-
quences‚ Halton sequences‚ and other techniques such as Latin Hyper-
cube and Supercube Sampling (Mckay‚ Beckman‚ and Conover 1979;
Owen 1998a). This library can be found at www.multires.caltech.
edu/software/libseq/index.html. There are also a few commercial
software packages to generate different QMC point sets; e.g.‚ QR Streams
at www.mathdirect.com/products/qrn/‚ and the FinDer software of
Paskov and Traub (1995) at www.cs.columbia.edu/˜ap/html/finder.
html.
4. Measures of Quality
In this section, we present a number of criteria that have been pro-
posed in the literature for measuring the uniformity (or non-uniformity)
of a point set in the unit hypercube i.e., for measuring the
discrepancy between the distribution of the points of and the uniform

distribution, in the context of QMC integration.
In one dimension (i.e., several such measures are widely used
in statistics for testing the goodness of fit of a data set with the uniform
distribution; e.g., the Kolmogorov-Smirnov (KS), Anderson-Darling, and
chi-square test statistics. Chi-square tests also apply in more than one
dimension, but their efficiency vanishes quickly as the dimension in-
creases. The rectangular-star discrepancy discussed earlier turns out
to be one possible multivariate generalization of the KS test statistic.
Other variants of this measure are discussed by Niederreiter (1992b),
and connections between discrepancy measures and goodness-of-fit tests
used in statistics are studied in Hickernell (1999).
The asymptotic behavior of quality measures like the rectangular-
star discrepancy is known for many constructions. For example, the
Halton sequence has a rectangular-star discrepancy in but
with a hidden constant that grows superexponentially with This is
also true for Sobol’ sequence, but with a smaller hidden constant. By
contrast, Faure and Niederreiter-Xing sequences are built so that this
hidden constant goes to zero exponentially fast as the dimension goes
to infinity.
In practice however, as soon as the number of dimensions exceeds
a few units, these general measures of discrepancy are unsatisfactory
because they are too hard to compute. A more practical and less general
approach is to define measures of uniformity that exploit the structure
of a given class of highly structured point sets. Here we concentrate our
discussion on these types of criteria, starting with criteria for standard
lattice rules, then covering those for digital nets.
4.1. Criteria for standard lattice rules

The criteria discussed here all relate to the dual lattice defined
in Section 3. The first criterion we introduce has a nice geometri-
cal interpretation, and is often used to measure the quality of LCGs
through the so-called spectral test (Fishman and Moore III 1986; Fish-
man 1990; Knuth 1998; L’Ecuyer 1999). It was introduced by Cov-
eyou and MacPherson (1967) and was used by Entacher, Hellekalek, and
L’Ecuyer (2000) to choose lattice point sets. It amounts to computing
the euclidean length of the shortest nonzero vector in i.e.,
The quantity turns out to be equal to the inverse of the distance

between the successive hyperplanes in the family of most distant parallel
hyperplanes on which the points of lie. This distance should be as

small as possible so there are no wide gaps in without any point
from Equivalently, should be as large as possible. Algorithms
for computing are discussed in, e.g., Dieter (1975), Fincke and Pohst
(1985), Knuth (1998), and L’Ecuyer and Couture (1997). For instance,
the dual of the basis shown in Figure 20.1 contains the shortest vector
in given by h = (5,8), so in this case.
This test can be applied not only to but also to any projec-
tion of More precisely, assume and let be the
integration lattice such that The idea
is to compute
where for all To define a crite-

rion in which is computed for several subsets I, it is convenient to
normalize so that the same scale is used to compare the different pro-
jections. This can be achieved by using upper bounds derived from
the “best” possible lattice in dimensions (not necessarily an integration
lattice). Such bounds can be found in, e.g., Conway and Sloane (1999)
and L’Ecuyer (1999). Criteria using these ideas have been used for mea-
suring LCGs (Fishman and Moore III 1986; Fishman 1990; L’Ecuyer
1999), usually for subsets I containing successive indices, i.e., of the
form The following criterion is more general and has
been used to construct tables of “good” Korobov lattice point sets by
L’Ecuyer and Lemieux (2000):
where Let the range of

a subset be defined as The criterion
considers projections over successive indices whose range is at most
over pairs of indices whose range is at most over triplets of indices
whose range is at most etc. Note that for dimension-stationary point
sets‚ it is sufficient to do as in the definition of and to only
consider subsets I having 1 as their first index.
The next criterion is called the Babenko-Zaremba index‚ and is similar
to except that a different norm is used for measuring the vectors h in
It is defined as follows (Niederreiter 1992b):
where max It has been used by Maisonneuve (1972)

to provide tables of “good” Korobov rules‚ but its computation is typ-
ically much more time-consuming than computing Both and

can be seen as special cases of more general criteria such as general dis-
crepancy measures defined‚ e.g.‚ by Hickernell (1998b)‚ Theorem 3.8‚ or
the generalized spectral test of Hellekalek (1998)‚ Definition 5.2. These
criteria use a general norm to measure h and apply to point sets that
do not necessarily come from a lattice.
The following criterion‚ called uses the same norm as the
Babenko-Zaremba index‚ but it sums a fixed power of the length of all
vectors in the dual lattice instead of only considering the smallest
one. For an arbitrary integer it is defined as (Sloan and Joe 1994;
Hickernell 1998b)
When is even‚ it simplifies to (Sloan and Joe 1994; Hickernell 1998b)
where is the Bernoulli polynomial of degree (Sloan and Joe

1994)‚ and it can then be computed in time. This criterion has
been used in‚ e.g.‚ Haber (1983)‚ Sloan and Walsh (1990)‚ and Sloan and
Joe (1994) to choose lattice point sets.
The definition of has been generalized by Hickernell (1998b) by the
introduction of weights that take into account the relative importance
of each subset I (e.g.‚ with respect to the ANOVA decomposition of
The generalization is defined as
where is the set of nonzero indices of h. Assuming

that is even and the are product-type weights‚ i.e.‚ that
this “weighted” becomes
which can still be computed in time. By letting for

we recover the criterion A normalized version of this
criterion and the quality measure described above have been used
for selecting generators defining sequences of nested Korobov lattice
point sets by Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux (2001). Note
that can be considered as a special case of the weighted spectral test
of Hellekalek (1998)‚ Definition 6.1. Various other measures of quality
for lattice rules can be found in Niederreiter (1992b)‚ Sloan and Joe
(1994)‚ Hickernell (1998b)‚ and the references therein.
4.2. Criteria for digital nets

As we mentioned in Section 3.2‚ the of a digital net is often used
to assess its quality. To compute this value‚ one has to find
which is the largest integer such that for any integers satisfying
the vectors are linearly
independent‚ where denote the row of the generating matrix
of for Hence‚ computing is typically quite time-
consuming since for a given integer there are different vectors
satisfying and (Pirsic and Schmid
2001).
If we define as the value of the for the projection an
equivalent definition of is
That is‚ measures the regularity of all projections and returns the
worst case. Inside this definition of we can also normalize the value of
so that projections over subspaces of different dimensions are judged
more equitably‚ in the same way as the value is used to normalize
in the criterion To do so‚ we can use the lower bound for in
dimensions‚ given by Niederreiter and Xing (1998)
and define‚
for example. The idea behind this is that a large value of for a low-
dimensional subset I is usually worse than when I is high-dimensional
and therefore it should be more penalized.
The definition of the that we used so far is of a geometrical
nature‚ similarly to the interpretation of as being the inverse of the
distance between the hyperplanes of a lattice. Interestingly‚ just like
the can also be related to the length of a shortest vector in

a certain dual space (Tezuka 1995; Larcher‚ Niederreiter‚ and Schmid
1996; Niederreiter and Pirsic 2001; Couture and L’Ecuyer 2000) as we
now explain. Our presentation follows Niederreiter and Pirsic (2001).
Let be the generating matrices associated with a
digital net in base with points. Let C be the matrix
obtained by concatening the transposed of the first rows of each
that is‚ if denotes the matrix containing the first rows of then
The analysis of Niederreiter and Pirsic (2001) assumes that the generat-
ing matrices are but we will explain shortly why it remains valid
even if we start with matrices and truncate them.
Let be the null space of the row space of C‚ i.e‚
We refer to as the dual space of the digital net from now on. Define
the following norm on for any nonzero let
and Define the norm of a
vector by
The following result about the is proved by Niederreiter and

Pirsic (2001):
We now explain why this result is valid even if the matrices have
been truncated to their first rows. Let denote the dual space that
would be obtained without the truncation. Observe that by definition‚
Also‚ using Proposition 1 of Niederreiter and Pirsic (2001) and the fact
that the dimension of the row space of C is not larger than we have
that
Therefore‚
which means that (20.19) is also true if we replace by From

now on‚ we assume that actually represents the dual space obtained
without truncating the generating matrices Also‚ we view as a
subspace of which means that each element in is represented
by a vector of polynomials over and the
norm is defined by with the convention
that deg(0) = –1.
In the special case where is a polynomial lattice point set‚ the dual
space corresponds to the dual lattice If we define the norm
then
where is the resolution of the polynomial lattice point set. This re-
sult is discussed by Couture, L’Ecuyer, and Tezuka (1993), Couture
and L’Ecuyer (2000), Lemieux and L’Ecuyer (2001), and Tezuka (1995).
The resolution is often used for measuring the quality of PRNGs based
on linear recurrences modulo 2 such as Tausworthe generators (Tootill,
Robinson, and Eagle 1973; L’Ecuyer 1996). From a geometrical point
of view, the resolution is the largest integer for which is
equidistributed. Obviously, if
This concept can be extended from polynomial lattice point sets to
general digital nets by replacing by above. More precisely, we
have:
Proposition 1 Let be a digital net in base and let be the dual

space of The resolution of satisfies:
where ||h|| is defined as in (20.20).
Proof: The proof of this proposition requires results given in the forth-
coming sections, and it can be found in the appendix.
The resolution can be computed for any projection of for
let
where is the dual of the row space of the matrix

The following criterion has been proposed to select polynomial lattice
point sets (Lemieux and L’Ecuyer 2001)‚ and it could also be used for
any digital net:
where the set has the same meaning as in the definition of the
criterion in (20.14).
Another criterion is the digital version of the quality measure It
is closely related to the dyadic dyaphony (Hellekalek and Leeb 1997)
and the weighted spectral test (Hellekalek 1998)‚ and was introduced by
Lemieux and L’Ecuyer (2001) for polynomial lattice point sets in base
2. It uses a norm W(h) defined as
and is defined for any integer and weights as
where In the special case where is even,

and the weights are product-type weights as in (20.15), we have that
where
and it is assumed that when

We conclude this subsection by giving references where numerical val-
ues of the previous criteria are given for specific point sets. The value
of the for Sobol’ sequence can be found in Niederreiter and Xing
(1998) for dimensions these values are compared in that paper
with those obtained from the improved base-2 sequences proposed by
Niederreiter and Xing (1997, 1998). The “Salzburg Tables” given in
(Pirsic and Schmid 2001) list optimal polynomial pairs and
their associated to build Korobov polynomial rules. Generalized
Faure sequences are built so that but with the drawback that the
base must be at least as large as the dimension. Hence only small
powers of the bases are typically used in practice‚ and the quality of
is measured only for (or less) projections‚ where is
small. This illustrates the fact that for comparisons to be fair between
different constructions‚ the value of the should be considered in
conjunction with the base Algorithms to compute are discussed by
Pirsic and Schmid (2001).
The resolution has been used by Sobol’ for finding optimal values to
initialize recurrences defining the direction numbers in his construction
(Sobol’ 1967; Sobol’ 1976). More precisely‚ his Property A means that
the first points of the sequence have the maximal resolution of 1‚
and his Property means that the first points have the maximal
resolution of 2.
Following ideas from Morokoff and Caflisch (1994) and Cheng and
Druzdzel (2000)‚ a criterion related to is used to find initial
direction numbers for Sobol’ sequence in dimensions in the forth-
coming RandQMC library (Lemieux‚ Cieslak‚ and Luttmer 2001); the max-
imum in (20.21) is taken over all two-dimensional subsets I of the form
where is the dimension for which we want to find
initial direction numbers‚ and Examples of parame-
ters for polynomial lattice point sets chosen with respect to for
different values of are given by Lemieux and L’Ecuyer (2001).
5. Randomizations
Once a construction is chosen for and the approximation given
by (20.2) is computed‚ it is usually important to have an estimate of
the error For that purpose‚ upper bounds of the form (20.3)
are not very useful since they are usually much too conservative‚ in
addition to being hard to compute and restricted to a possibly small set
of functions. Instead‚ one can randomize the set so that : 1) each point
in the randomized point set has a uniform distribution over
2) the regularity (or low-discrepancy property) of as measured by
a specific quality criterion‚ is preserved under the randomization. The
first property guarantees that the approximation
is an unbiased estimator of When the second property holds‚ the

variance of the estimator can usually by expressed in a way that
establishes a relation between the optimization of the criterion whose
value is preserved under the randomization‚ and the minimization of the
estimator’s variance. Hence these two properties help viewing random-
ized QMC methods as variance reduction techniques that preserve the

unbiasedness of the MC estimator. In practice‚ the variance of can
be estimated by generating i.i.d. copies of through independent
replications of the randomization. This estimator can then be compared
with the estimated variance of the MC estimator to assess the effective-
ness of QMC for any particular problem.
We now describe some randomizations having these two properties
and that have been proposed in the literature for the constructions pre-
sented in the preceding section.
5.1. Random shift modulo 1

The following randomization has originally been proposed by Cranley
and Patterson (1976) for standard lattice rules. Some authors suggested
that it could also be used for other low-discrepancy point sets (Tuffin
1996; Morohosi and Fushimi 2000).
Let be a given point set in and
an random vector uniformly distributed over The
randomly shifted estimator based on is defined as
When is a lattice point set, the length of the shortest vector asso-
ciated with any projection is preserved under this randomization.
An explicit expression for the variance of in that case will be given
in Section 6.1.
With this randomization, each shifted point mod 1)
is uniformly distributed over Therefore, even if the dimension
is much larger than the number of points and if many coordinates
are equal within a given point (for instance, when comes from a
LCG with a small period, as in Section 3.3), these coordinates become
mutually independent after the randomization. Hence each point has the
same distribution as in the MC method; the difference with MC is that
the points of the shifted lattice are not independent. See L’Ecuyer
and Lemieux (2000), Section 10.3, for a concrete numerical example
with and These properties also hold for the other
randomizations described below.
5.2. Digital shift

When is a digital net in base the counterpart of the previous
method is to consider the expansion of the random vector and
to add it to each point of using operations over More precisely,
if and
we compute
where
and let
This randomization was suggested to us by Raymond Couture for
point sets based on linear recurrences modulo 2 (see also Lemieux and
L’Ecuyer 2001). It is also used in an arbitrary base (along with other
more time-consuming randomizations) in Hong and Hickernell (2001)
and (1998) as we will see in Section 5.4. It is best suited for
digital nets in base and its application preserves the resolution and
the of any projection.
5.3. Scrambling
This randomization has been proposed by Owen (1995), and it also
preserves the of a digital net and its resolution, for any projec-
tion. It works as follows: in each dimension partition the
interval [0, 1) in equal parts and permute them uniformly and ran-
domly; then, partition each of these sub-intervals into equal parts and
permute them uniformly and randomly; etc. More precisely, to scramble
L digits one needs to randomly and uniformly generate several indepen-
dent permutations of the integers [0... (assuming a specific
bijection has been chosen to identify the elements in with those in
if is not prime), where
and compute
where
In practice‚ L may be chosen equal to the word-length of the computer‚

and the digits for are then dropped. However‚ as
(1998) points out‚ if and no two points have the same first
digits in each dimension (i.e.‚ for each the unidimensional projection
has a maximal resolution of then the permutations after level
are independent for each point and therefore‚ the random digits
for can be generated uniformly and independently over Hence
in this case we do not need to generate any permutation after level
Owen (1995)‚ Section 3.3‚ has suggested a similar implementation.
When and are large‚ the amount of memory required for storing
all the permutations becomes very large‚ and only a partial scrambling
might then be feasible; that is‚ scramble digits‚ and generate the
remaining ones randomly and uniformly (Fox 1999)‚ or use the permuta-
tions for the digit (Tan and Boyle 2000). A clever way of avoiding
storage problems is discussed by (1998)‚ and a related idea
is used in Morohosi’s code (which can be found at www.misojiro.t.
u–tokyo.ac.jp/˜morohosi) for scrambling Faure sequences. The idea
is to avoid storing all the permutations by reinitializing appropriately
the underlying PRNG so that the permutations can be regenerated as
they are needed. This is especially useful when the base is large‚ which
happens when Faure sequences are used in large or even moderate di-
mensions.
5.4. Random Linear Scrambling

(1998) proposes an alternative scrambling method that does
not require as much randomness and storage. It borrows ideas from
the scrambling technique and transformations proposed by Faure and
Tezuka (2001) and Tezuka (1995). This method is also discussed by
Hong and Hickernell (2001)‚ where it is called “Owen-Scrambling”; our
presentation follows theirs but we prefer the name used by
to avoid any confusion with the actual scrambling proposed by Owen.
The idea is to generate nonsingular lower-triangular matrices
with elements chosen randomly‚ independently‚ and uniformly
over (the elements on the main diagonal of each are chosen over
where is such that and vectors
with entries independently and uniformly distributed over
The digits of a randomized coordinate are then ob-
tained as
where all operations are performed in

5.5. Others
We briefly mention some other ideas that can be used to randomize
QMC point sets. In addition to the linear scrambling, (1998)
proposes randomization techniques for digital sequences that are easier
to generate than the scrambling method, while retaining enough ran-
domness for the purpose of some specific theoretical analyses. Hong and
Hickernell suggest another form of linear scrambling that incorporates
transformations proposed by Faure and Tezuka (2001). Randomizations
that use permutations in each dimension to reorder the Halton sequence
are discussed by Braaten and Weller (1979) and Morokoff and Caflisch
(1994). Wang and Hickernell (2000) propose to randomize this sequence
by randomly generating its starting point.
Some authors (Ökten 1996; Spanier 1995) suggest partitioning the
set of dimensions into two subsets (typically, of successive in-
dices, i.e., and and then to use a QMC method
(randomized or not) on one subset and MC on the other one. One of
the justifications for this approach is that some digital nets (e.g., Sobol’
sequence) are known to have projections with better properties
when I contains small indices; this suggests using QMC on the first few
dimensions and MC on the remaining ones. However, this argument
becomes irrelevant if a dimension-stationary point set is used. More im-
portantly, if the QMC point set is randomized and can be shown (or
presumed) to do no worse than MC in terms of its variance, then there
is no advantage or “safety net” gained by using MC on one part of the
problem. Estimators obtained by “padding” randomized QMC point
sets with MC are analyzed by Owen (1998a) and Fox (1999). Owen
(1998a) discusses some other padding techniques, as well as a method
called Latin Supercube Sampling to handle very high-dimensional prob-
lems.
6. Error and Variance Analysis

In this section‚ we study the error for the approximations based
on low-discrepancy point sets‚ and the variance of estimators obtained
by randomizing these sets. All the results mentioned here are obtained
by using a particular basis for the set of square-integrable
functions over to expand the integrand In each case‚ the basis is
carefully chosen‚ depending on the structure of so that the behavior
of the approximation on the basis functions is easy to analyze.
The results presented here describe‚ for a given square-integrable func-
tion‚ the variance of different randomized QMC estimators. This is one
possible approach for analyzing the performance of these estimators;
different aspects of randomized QMC methods have been studied in the

literature. In particular‚ Hickernell and (2001) have re-
cently studied the behavior of different types of error on Hilbert spaces
of integrands in a general setting that includes MC‚ scrambled nets‚ and
randomly shifted lattices. They consider the “worst-case error” (error
on the worst integrand for a given QMC approximation)‚ the “random-
case error” (expected error‚ e.g.‚ in the root mean square sense‚ of a
randomized QMC estimator on a given integrand)‚ and the “average-
case error” (average error over and they give explicit formulas for
these errors. This type of analysis provides useful connections between
variance expressions such as those studied here (which corresponds to
the random-case setting)‚ and the discrepancy measures discussed in the
previous section (which corresponds to the worst-case setting). Specific
results for scrambled-nets estimators are given by Heinrich‚ Hickernell‚
and Yue (2001b).
6.1. Standard Lattices and Fourier Expansion

For standard lattice rules‚ the following result suggests that expand-
ing in a Fourier series is appropriate for error and variance analysis.
Recall that the Fourier basis for is orthonormal and given by
where and
Lemma 1 (Sloan and Joe 1994‚ Lemma 2.7) If

is a lattice point set‚ then for any
Hence‚ the lattice rule integrates a basis function with no error

when and with error 1 otherwise. Using this‚ we get
the following result:
Proposition 2 Let be a lattice point set‚ be defined as in (20.22)‚

and
be the Fourier coefficient of evaluated in (From Sloan and Joe 1994)

If has an absolutely convergent Fourier series‚ then
(From L ’Ecuyer and Lemieux 2000) If then
whereas for the MC estimator based on points‚
The result (20.25) was proved independently by Tuffin (1998)‚ but un-
der the stronger assumption that has an absolutely convergent Fourier
series. Notice that by contrast with the MC estimator‚ there is no factor
of that multiplies the sum of squared Fourier coefficients for the
randomly shifted lattice rule estimator Hence in the worst case‚ the
variance of could be times as large as the MC estimator’s vari-
ance. This worst case corresponds to an extremely unlucky pairing of
function and point set for which for all However‚ in
the expression for the variance of the coefficients are summed only
over the dual lattice which contains times less points than the set
over which the sum is taken in the MC case. Therefore‚ if the dual
lattice is such that the squared Fourier coefficients are smaller “on
average” over than over then the variance of will be smaller
than the variance of
From the results given in the previous proposition‚ different bounds on
the error and variance can be obtained by making additional assumptions
on the integrand (Sloan and Joe 1994; Hickernell 1998b; Hickernell
2000). Most of these bounds involve the quality measures or
Hence a point set that minimizes one of these two criteria minimizes
a bound on the error or variance for the class of functions for which
those bounds hold. Such analyses often provide arguments in favor of
these criteria. A different type of analysis‚ based on the belief that the
largest squared Fourier coefficients tend to be associated with “short
vectors” h‚ corresponding to the low frequency terms of suggests that
the lattice point set should be chosen so that does not contain those
“short” vectors. From this point of view‚ a criterion like seems
appropriate since it makes sure that does not contain vectors with a
small euclidean length. This criterion also has the advantage of being
usually much faster to compute than (Entacher‚ Hellekalek‚
and L’Ecuyer 2000; Hickernell‚ Hong‚ L’Ecuyer‚ and Lemieux 2001).
6.2. Digital Nets and Haar or Walsh Expansions

Recall that digital nets are usually built so as to satisfy different
equidistribution properties with respect to partitions of the unit hy-
percube into boxes. For this reason‚ it is convenient to use
a basis consisting of step functions that are constant over boxes
for studying their associated error and variance. Both Walsh and Haar
basis functions have this property. In addition‚ the Walsh functions form
an orthonormal basis of
6.2.1 Scrambled-type estimators. We first define the Haar

basis functions in base following the presentation of Owen (1997) and
(Heinrich‚ Hickernell‚ and Yue 2001a). Let
be a vector of positive integers‚ be a vector such that
and be such that A multivariate Haar
wavelet basis function is defined as
where Now consider the part of the Haar expansion of

that depends on the basis functions associated with a given vector
i.e.‚ let
where
The function is also a step function‚ which is constant within the

boxes obtained by partitioning into equal intervals along
the axis‚ for each Owen (1997) shows that the variance of the
estimator based on a scrambled digital net with points is given by
where
and depends on the equidistribution properties of the digital net and

the definition of the scrambling. Assuming that the of the net is
and we have (Owen 1998b):
Using this‚ Owen obtains the following bound on the variance of the
scrambled-net estimator:
Proposition 3 (Owen 1998b‚ Theorem 1) Let be the estimator con-
structed from a scrambled digital net with points. For any square-
integrable function‚
Hence the variance of the scrambled-net estimator cannot be larger

than the MC estimator’s variance‚ up to a certain constant (independent
of but growing exponentially with for a fixed base For a
net‚ which implies that this constant can be bounded by
(Owen 1997). In the case where satisfies certain smoothness properties
(its mixed partial derivatives satisfy a Lipschitz condition)‚ Owen shows
that
Under this assumption‚ he obtains a bound in for the

variance of the scrambled-net estimator. Other results on the asymptotic
properties of the scrambled-net estimator‚ that use Haar wavelets‚ can
be found in Heinrich‚ Hickernell‚ and Yue (2001a)‚ Heinrich‚ Hickernell‚
and Yue (2001b)‚ and the references cited there. Haar series are also
considered in the context of QMC integration by‚ e.g.‚ Entacher (1997)
and Sobol’ (1969).
An important point to mention is that the scrambling is not the only
randomization for which (20.27) holds; the result is valid for any random-
ization satisfying the following properties (Hong and Hickernell 2001;
1998):
1 each point in the randomized point set is
uniformly distributed over
2 for and if for but
then
(a)
(b) is uniformly distributed over
(c) are uncorrelated, for any

Hong and Hickernell (2001) have shown that the random linear scram-
bling mentioned in Section 5.4 satisfies these properties‚ so the bound
(20.28) given in Proposition 3 holds for linearly scrambled estimators
as well. This is interesting since this method has a faster implemen-
tation than the scrambling of Section 5.3. Note that the digital
shift does not satisfy 2(b) since the digits are such that
6.2.2 Digitally shifted estimators. To study the

variance of a shifted digital net we use a Walsh expansion for
Walsh series have also been used to analyze the error produced by (non-
randomized) digital nets by Larcher, Niederreiter, and Schmid (1996),
Larcher, Lauss, Niederreiter, and Schmid (1996), and Larcher and Pir-
sic (1999). In the presentation of the forthcoming results, the vector h
will be used both to represent elements of where and
elements of the ring When required, we will use the bijection
defined by =
where for each to go back and forth between these
two spaces.
For any the Walsh basis function in h is defined as
where h · u = and the coefficients

and are such that and and all
operations are performed in For we have that
where corresponds to a digit-by-digit addition over (as if

we were adding the corresponding elements in See Larcher‚
Niederreiter‚ and Schmid (1996) and Larcher and Pirsic (1999) for more
information on generalized definitions of Walsh series in the context of
QMC integration.
Let denote the Walsh coefficient of in h‚ that is
The following result may be interpreted as the digital counterpart of the

result stated in Lemma 1. Recall that denotes the dual space of
Lemma 2 Let be a digital net in base with For any
we have
Larcher, Niederreiter, and Schmid (1996) have shown that the above
sum is 0 when h satisfies
Proof: If then h · since
is in the row space of C for all and the result follows
easily. If then where We are interested
in the scalar product h · for Notice that
= which is the image of under the
application of a mapping that corresponds to the multiplication by
Since the dimension of this image is 1 and the dimension of
the kernel of this mapping is thus Hence each element in has
pre-images in under this mapping, and therefore as a multiset
{h · contains copies of each element of Using
this and the fact that
the result immediately follows.

Using this lemma‚ we get the following result‚ which is analogous to
that presented in Proposition 2. It is proved by Lemieux and L’Ecuyer
(2001) for the case where and is a polynomial lattice point set.
Proposition 4 Let be a digital net in base For any function
having an absolutely convergent Walsh series expansion‚ we have
Let be a shifted digital net in base and be the associated

estimator. For any square-integrable function we have
and the variance of the MC estimator based on points is given by

Proof: Assume Then we can write
If is square-integrable‚ then the function defined by
is also square-integrable‚ where corresponds to a digit-by-digit ad-

dition in In addition‚
because Parseval’s equality holds for the Walsh series expansion (see Gol-
ubov‚ Efimov‚ and Skvortsov 1991‚ for example). Now‚ for any
we have
In the above display‚ the third line is obtained by letting

and thus where denotes a digit-by-digit subtraction in
From (20.31)‚ the result follows. The variance of the MC estimator

is obtained by applying Parseval’s equality.
To see the connection with scrambled-type estimators‚ we use the
following result (proved for by Lemieux 2000)‚ whose proof is
given in the appendix. The result also makes a connection between
variance expressions in terms of Haar and Walsh expansions‚ because
is defined in terms of Haar coefficients.
Lemma 3 Let be a vector of positive integers and be defined as

in (20.26). If is square-integrable‚ then
where otherwise}.
Hence in the digital shift case‚ in comparison with MC‚ the contribu-
tion of a basis function to the variance expression is either multi-
plied by (if h is in the dual space) or by 0‚ whereas in the scrambled
case‚ this contribution is multiplied by 0 for “small vectors”‚ and by a
factor that can be upper-bounded by a quantity independent of other-
wise. This factor being sometimes in the digital shift case prevents us
from bounding by a constant times Similarly‚ the case of
smooth functions yields a variance bound in for digitally-
shifted estimators‚ which is larger by a factor than the order of the
bound obtained for scrambled-type estimators. On the other hand‚ the
shift is a very simple randomization easy to implement; the esti-
mator can typically be constructed in the same (or less) time as the
MC estimator based on the same number of points.
Based on the expression (20.30) for the variance of the digitally shifted
estimator‚ the same type of heuristic arguments as those given for ran-
domly shifted lattice rules can be used to justify selection criteria such
as to choose digital nets. That is‚ if we assume that the largest
Walsh coefficients are those associated with “small” vectors h‚ then it is
reasonable to choose so that the dual space does not contain those
small vectors. If we use the norm ||h|| defined in (20.20) to measure h‚
this suggests using a criterion based on the resolution such as
if instead we use the norm V(h) defined in (20.18)‚ then the or
the variant defined in (20.16) should be employed. We refer the reader
to Hellekalek and Leeb (1997) and Tezuka (1987) for additional connec-
tions between Walsh expansions and nonuniformity measures (e.g.‚ the
so-called ‘Walsh-spectral test’ of Tezuka). Note that criteria based on
the resolution are faster to compute than those based on the
because the latter verifies the of for a

much larger number of vectors
7. Transformations of the Integrand

So far‚ our description of how to use QMC methods can be summa-
rized as follows: Choose a construction and a randomization; choose a
selection criterion; find a good point set with respect to this criterion
(or use a precomputed table of “good” point sets); randomize the point
set‚ and compute as an estimator for If the selection crite-
rion mimics the variance of well enough‚ one should obtain a low
variance estimator with this approach. Most of the selection criteria
presented in Section 4 are defined so that they should imitate more or
less the variance of for a large class of functions‚ i.e.‚ they pro-
vide “general-purpose” low-discrepancy point sets. However‚ once the
problem at hand is known‚ the variance can sometimes be reduced fur-
ther by making use of information on in a clever way. In particular‚
techniques used to reduce the MC variance can also be used in combi-
nation with QMC methods. Examples of such techniques are antithetic
variates‚ control variables‚ importance sampling‚ and conditional Monte
Carlo. These methods can all be seen as transformations applied to in
order to reduce its variability; that is‚ one replaces by a function such
that and (hopefully)
If the function requires more computation time for its evaluation‚ one
should make sure that the variance reduction gained is worth the extra
effort‚ i.e.‚ that the efficiency is improved.
A second class of methods that can reduce the variance of QMC esti-
mators for certain functions are dimension reduction methods. Among
this class are the Brownian bridge technique of Caflisch and Moskowitz
(1995)‚ approaches based on principal components (Acworth‚ Broadie‚
and Glasserman 1997; Åkesson and Lehoczy 2000)‚ and various meth-
ods discussed by Fox (1999) for generating Poisson and other stochastic
processes.
Typically‚ these methods are used when is defined in terms of a
stochastic process for which a sample path is generated using the uniform
numbers provided by a point in The goal is then
to generate the sample path in a way that will decrease the effective
dimension of This is usually achieved by using a method that gives
a lot of importance to a few uniform numbers. As an illustration‚ we
describe in the example below the Brownian bridge technique‚ which can
be used to generate the sample paths of a Brownian motion.
Example 3 As this often happens in financial simulations (see e.g.‚

Boyle‚ Broadie‚ and Glasserman 1997; Caflisch‚ Morokoff‚ and Owen
1997)‚ suppose we want to generate the sample path of a Brownian
motion at different times using uniform numbers
For instance‚ this Brownian motion might be driving the
price process of an asset on which an option has been written (Duffie
1996). Instead of generating these observations sequentially (that is‚ by
using to generate the Gaussian random variable given
is used to generate is used to generate
for for etc. This can be done easily since for
the distribution of given and is Gaussian
with parameters depending only on By generating the Brownian
motion path this way‚ more importance is given to the first few uniform
numbers since they determine important aspects of the path such as its
value at the end‚ middle‚ first quarter‚ etc.
Another type of transformation that can sometimes reduce the actual
dimension of the problem (and not only the effective one) is the condi-
tional Monte Carlo method; see L’Ecuyer and Lemieux (2000)‚ Section
10.1‚ for example‚ where this method is used with randomly shifted lat-
tice rules. This method can also be viewed as a “smoothing technique”‚
i.e.‚ it replaces the integrand by a smoother function (in this case‚ a
conditional expectation) that‚ e.g.‚ satisfies conditions required to ob-
tain variance rates that are as in Section 6.2. We refer
the reader to Spanier and Maize (1994) and Fox (1999)‚ Chapters 7 and
8‚ for more on smoothing techniques that can be used in combination
with QMC methods.
8. Related Methods
We now discuss integration methods that are closely related to QMC
methods‚ but that do not exactly fit the framework presented so far.
First‚ a natural extension for the estimator or would be to
assign weights to the different evaluation points; that is‚ for a point set
define
with Hickernell (1998b) proved that when is a lat-

tice point set‚ then a certain measure of discrepancy (called the
discrepancy)‚ defined so that these weights are accounted for‚ is min-
imized by setting the weights to be all equal to In other words‚
using weights is useless in this case.
However, if is not restricted to have a particular form, then it

can be shown that in some cases allowing different weights can bring a
significant improvement (Yakowitz, Krimmel, and Szidarovszky 1978).
For example, one can use the MC method with weights defined by
the Voronoï tessellation induced by the uniform and random points
more precisely, define
where denotes the Lebesgue measure on This approach yields

an estimator with variance in when Weighted ap-
proximations also based on Voronoï tessellations are discussed by Pagès
(1997).
A closely related idea is used in stratified sampling (Cochran 1977). In
this method‚ the unit hypercube is partitioned into cells
and is uniformly distributed over for The
stratified sampling estimator is then
where It can be shown (Cochran 1977) that for any square-

integrable function‚
The amount of variance reduction depends on the definition of the cells

and their interaction with the integrand
An integration method that is guaranteed to yield an estimator with
a variance not larger than the MC estimator for monotone functions is
the Latin Hypercube Sampling (LHS) (Mckay, Beckman, and Conover
1979). It uses a point set whose unidimensional projections are evenly
distributed (i.e., one point per interval for
1). To construct this point set, one needs to generate random, uniform,
and independent permutations of the integers from 0 to
1, and independent shifts uniformly distributed over
Then define the point set
Additional results on this method can be found in‚ e.g.‚ Avramidis and
Wilson (1996)‚ Owen (1998a)‚ and the references therein. In particular‚
see Owen (1992a) and Loh (1996b) for results showing that the LHS
estimator obeys a central-limit theorem. A related method that extends
the uniformity property of LHS and has close connections with digital
nets is a randomized orthogonal array design; we refer the reader to Owen
(1992b, 1994, 1995, 1997) and Loh (1996a) for more on this method.
9. Conclusions and Discussion

We have described various QMC constructions that can be used for
multidimensional numerical integration. Measures of quality that can
help selecting parameters for a given construction have been presented.
We also discussed different randomizations, and provided results on the
variance of estimators obtained by applying these randomizations. In
particular, we gave a new result that expresses the variance of an es-
timator based on a digitally shifted net as a sum of squared Walsh
coefficients over the dual space of
In the near future, we plan to compare empirically various construc-
tions and randomizations on practical problems, to study selection cri-
teria and compare their effectiveness, and to investigate in more details
the effect of transformations, such as those discussed in Section 7, on
the variance of the randomized QMC estimators.
Appendix: Proofs
Proof of Proposition 1: The result is obtained by first generalizing Proposition 5.2 of
Lemieux and L’Ecuyer (2001) to arbitrary digital nets in base This can be done
by using Lemma 2 from Section 6.2. More precisely, we show that is
equidistributed if and only if where
Consider the class of all real-valued functions that are constant on each
of the boxes in the definition of Clearly, is
if and only if the corresponding point set integrates every
function with zero error. But due to its periodic structure, each function
has a Walsh expansion of the form
where and is the bijection between

and mentioned on page 457. To see this, note that any can be
written as
where the are real numbers. When there exists

and an integer such that Let and
recall that where the coefficients and come from the
representation and respectively. When goes

from to goes from 0 to in and this
dot product is then equal to each number between 0 and exactly times.
Hence, if we first integrate with respect to when computing via (20.29),
any term from the sum (20.A.2) will be 0 because
Therefore, if and (20.A.1) follows. Now, for any nonzero

is in Hence by Proposition 4, the
error obtained by using to integrate is since the
only nonzero Walsh coefficient of is the one evaluated in (and it is equal to 1). From
this, we see that if is then
Hence if has a resolution of then it is and therefore
We now show that which will prove the result. Since the
resolution is it means that is not Therefore,
the matrix L formed by concatenating the transposed of the first
rows of each generating matrix has a row space whose dimension is strictly smaller
than Hence there exists a nonzero vector x in such that
Furthermore, we can assume since would contradict our
assumption that has a resolution of Define
by
Since L is just a truncated version of C and the coefficients of for powers of

larger than are zero for all we have that Ch = 0, and therefore with
which proves the result.
Proof of Lemma 3: Recall that
where is a step function constant on the boxes obtained by partitioning

the axis of in equal intervals, for each Using the same notation
as for the preceding proof, we have that where
Hence from the proof of Proposition 1, we know that if

Assume and that there exists one such that
We need to verify that Now,
Now, since the function is constant over any interval of the

form Hence, if
for any then (20.A.3) is equal to zero and the result is proved.
To show (20.A.4), it suffices to observe that
for any which proves the result.
Acknowledgments
This work was supported by NSERC-Canada individual grants to the two authors
and by an FCAR-Québec grant to the first author. We thank Bennett L. Fox, Fred
J. Hickernell, Harald Niederreiter, and Art B. Owen for helpful comments and sug-
gestions.
References
Acworth, P., M. Broadie, and P. Glasserman. (1997). A comparison
of some Monte Carlo and quasi-Monte Carlo techniques for option
pricing. In Monte Carlo and Quasi-Monte Carlo Methods in Scien-
tific Computing, ed. P. Hellekalek and H. Niederreiter, Number 127
in Lecture Notes in Statistics, 1–18. Springer-Verlag.
Åkesson, F., and J. P. Lehoczy. (2000). Path generation for quasi-Monte
Carlo simulation of mortgage-backed securities. Management Sci-
ence 46:1171–1187.
Antonov, I. A., and V. M. Saleev. (1979). An economic method of com-
puting Zh. Vychisl. Mat. Mat. Fiz. 19:243–245.
In Russian.
Avramidis, A. N., and J. R. Wilson. (1996). Integrated variance reduc-
tion strategies for simulation. Operations Research 44:327–346.
REFERENCES 467
Bakhvalov, N. S. (1959). On approximate calculation of multiple

integrals. Vestnik Moskovskogo Universiteta, Seriya Matematiki,
Mehaniki, Astronomi, Fiziki, Himii 4:3–18. In Russian.
Boyle, P., M. Broadie, and P. Glasserman. (1997). Monte Carlo methods
for security pricing. Journal of Economic Dynamics & Control 21
(8-9): 1267–1321. Computational financial modelling.
Braaten, E., and G. Weller. (1979). An improved low-discrepancy se-
quence for multidimensional quasi-Monte Carlo integration. Journal
of Computational Physics 33:249–258.
Bratley, P., and B. L. Fox. (1988). Algorithm 659: Implementing Sobol’s
quasirandom sequence generator. ACM Transactions on Mathemat-
ical Software 14 (1): 88–100.
Bratley, P., B. L. Fox, and H. Niederreiter. (1992). Implementation and
tests of low-discrepancy sequences. ACM Transactions on Modeling
and Computer Simulation 2:195–213.
Bratley, P., B. L. Fox, and H. Niederreiter. (1994). Algorithm 738: Pro-
grams to generate Niederreiter’s low-discrepancy sequences. ACM
Transactions on Mathematical Software 20:494–495.
Caflisch, R. E., W. Morokoff, and A. Owen. (1997). Valuation of
mortgage-backed securities using Brownian bridges to reduce effec-
tive dimension. The Journal of Computational Finance 1 (1): 27–46.
Caflisch, R. E., and B. Moskowitz. (1995). Modified Monte Carlo meth-
ods using quasi-random sequences. In Monte Carlo and Quasi-
Monte Carlo Methods in Scientific Computing, ed. H. Niederreiter
and P. J.-S. Shiue, Number 106 in Lecture Notes in Statistics, 1–16.
New York: Springer-Verlag.
Cheng, J., and M. J. Druzdzel. (2000). Computational investigation
of low-discrepancy sequences in simulation algorithms for bayesian
networks. In Uncertainty in Artificial Intelligence Proceedings 2000,
72–81.
Cochran, W. G. (1977). Sampling techniques. Second ed. New York:
John Wiley and Sons.
Conway, J. H., and N. J. A. Sloane. (1999). Sphere packings, lattices and
groups. 3rd ed. Grundlehren der Mathematischen Wissenschaften
290. New York: Springer-Verlag.
Couture, R., and P. L’Ecuyer. (2000). Lattice computations for random
numbers. Mathematics of Computation 69 (230): 757–765.
Couture, R., P. L’Ecuyer, and S. Tezuka. (1993). On the distribution
of vectors for simple and combined Tausworthe se-
quences. Mathematics of Computation 60 (202): 749–761, S11–S16.
Coveyou, R. R., and R. D. MacPherson. (1967). Fourier analysis of
uniform random number generators. Journal of the ACM 14:100–
119.
Cranley, R., and T. N. L. Patterson. (1976). Randomization of num-

ber theoretic methods for multiple integration. SIAM Journal on
Numerical Analysis 13 (6): 904–914.
Davis, P., and P. Rabinowitz. (1984). Methods of numerical integration.
second ed. New York: Academic Press.
Dieter, U. (1975). How to calculate shortest vectors in a lattice. Math-
ematics of Computation 29 (131): 827–833.
Duffie, D. (1996). Dynamic asset pricing theory. second ed. Princeton
University Press.
Efron, B., and C. Stein. (1981). The jackknife estimator of variance.
Annals of Statistics 9:586–596.
Entacher, K. (1997). Quasi-Monte Carlo methods for numerical integra-
tion of multivariate Haar series. BIT 37:846–861.
Entacher, K., P. Hellekalek, and P. L’Ecuyer. (2000). Quasi-Monte
Carlo node sets from linear congruential generators. In Monte
Carlo and Quasi-Monte Carlo Methods 1998, ed. H. Niederreiter
and J. Spanier, 188–198. Berlin: Springer.
Faure, H. (1982). Discrépance des suites associées à un système de
numération. Acta Arithmetica 61:337–351.
Faure, H. (2001). Variations on Journal of Complexity.
To appear.
Faure, H., and S. Tezuka. (2001). A new generation of
To appear.
Fincke, U., and M. Pohst. (1985). Improved methods for calculating
vectors of short length in a lattice, including a complexity analysis.
Mathematics of Computation 44:463–471.
Fishman, G. S. (1990), Jan. Multiplicative congruential random number
generators with modulus An exhaustive analysis for and
a partial analysis for Mathematics of Computation 54 (189):
331–344.
Fishman, G. S., and L. S. Moore III. (1986). An exhaustive analysis of
multiplicative congruential random number generators with modu-
lus SIAM Journal on Scientific and Statistical Computing 7
(1): 24–45.
Fox, B. L. (1986). Implementation and relative efficiency of quasiran-
dom sequence generators. ACM Transactions on Mathematical Soft-
ware 12:362–376.
Fox, B. L. (1999). Strategies for quasi-Monte Carlo. Boston, MA: Kluwer
Academic.
Friedel, I., and A. Keller. (2001). Fast generation of randomized low-
discrepancy point sets. In Monte Carlo and Quasi-Monte Carlo
Methods 2000, ed. K.-T. Fang, F. J. Hickernell, and H. Niederreiter:
Springer. To appear.
REFERENCES 469
Golubov, B., A. Efimov, and V. Skvortsov. (1991). Walsh series and

transforms: Theory and applications, Volume 64 of Mathematics and
Applications: Soviet Series. Boston: Kluwer Academic Publishers.
Haber, S. (1983). Parameters for integrating periodic functions of several
variables. Mathematics of Computation 41:115–129.
Halton, J. H. (1960). On the efficiency of certain quasi-random se-
quences of points in evaluating multi-dimensional integrals. Nu-
merische Mathematik 2:84–90.
Heinrich, S., F. J. Hickernell, and R.-X. Yue. (2001a). Integration of
multivariate Haar wavelet series. Submitted.
Heinrich, S., F. J. Hickernell, and R. X. Yue. (2001b). Optimal quadra-
ture for Haar wavelet spaces. submitted.
Hellekalek, P. (1998). On the assessment of random and quasi-
random point sets. In Random and Quasi-Random Point Sets, ed.
P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statis-
tics, 49–108. New York: Springer.
Hellekalek, P., and G. Larcher. (Eds.) (1998). Random and quasi-random
point sets, Volume 138 of Lecture Notes in Statistics. New York:
Springer.
Hellekalek, P., and H. Leeb. (1997). Dyadic diaphony. Acta Arith-
metica 80:187–196.
Hickernell, F. J. (1998a). A generalized discrepancy and quadrature
error bound. Mathematics of Computation 67:299–322.
Hickernell, F. J. (1998b). Lattice rules: How well do they measure up?
In Random and Quasi-Random Point Sets, ed. P. Hellekalek and
G. Larcher, Volume 138 of Lecture Notes in Statistics, 109–166. New
York: Springer.
Hickernell, F. J. (1999). Goodness-of-fit statistics, discrepancies and
robust designs. Statistical and Probability Letters 44:73–78.
Hickernell, F. J. (2000). What affects accuracy of quasi-Monte Carlo
quadrature? In Monte Carlo and Quasi-Monte Carlo Methods 1998,
ed. H. Niederreiter and J. Spanier, 16–55. Berlin: Springer.
Hickernell, F. J., and H. S. Hong. (1997). Computing multivariate normal
probabilities using rank-1 lattice sequences. In Proceedings of the
Workshop on Scientific Computing (Hong Kong), ed. G. H. Golub,
S. H. Lui, F. T. Luk, and R. J. Plemmons, 209–215. Singapore:
Springer-Verlag.
Hickernell, F. J., H. S. Hong, P. L’Ecuyer, and C. Lemieux. (2001). Ex-
tensible lattice sequences for quasi-Monte Carlo quadrature. SIAM
Journal on Scientific Computing 22 (3): 1117–1138.
Hickernell, F. J., and (2001). The price of pessimism
for multidimensional quadrature. Journal of Complexity 17. To
appear.
Hlawka, E. (1961). Funktionen von beschränkter variation in der theorie

der gleichverteilung. Ann. Mat. Pura. Appl. 54:325–333.
Hlawka, E. (1962). Zur angenäherten berechnung mehrfacher integrale.
Monatshefte für Mathematik 66:140–151.
Hoeffding, W. (1948). A class of statistics with asymptotically normal
distributions. Annals of Mathematical Statistics 19:293–325.
Hong, H. S., and F. H. Hickernell. (2001). Implementing scrambled
digital sequences. Submitted for publication.
Knuth, D. E. (1998). The art of computer programming, volume 2:
Seminumerical algorithms. Third ed. Reading, Mass.: Addison-
Wesley.
Korobov, N. M. (1959). The approximate computation of multiple inte-
grals. Dokl. Akad. Nauk SSSR 124:1207–1210. in Russian.
Korobov, N. M. (1960). Properties and calculation of optimal coeffi-
cients. Dokl. Akad. Nauk SSSR 132:1009–1012. in Russian.
Larcher, G. (1998). Digital point sets: Analysis and applications.
In Random and Quasi-Random Point Sets, ed. P. Hellekalek and
G. Larcher, Volume 138 of Lecture Notes in Statistics, 167–222. New
York: Springer.
Larcher, G., A. Lauss, H. Niederreiter, and W. C. Schmid. (1996). Opti-
mal polynomials for and numerical integration of mul-
tivariate Walsh series. SIAM Journal on Numerical Analysis 33 (6):
2239–2253.
Larcher, G., H. Niederreiter, and W. C. Schmid. (1996). Digital nets
and sequences constructed over finite rings and their application to
quasi-Monte Carlo integration. Monatshefte für Mathematik 121 (3):
231–253.
Larcher, G., and G. Pirsic. (1999). Base change problems for generalized
Walsh series and multivariate numerical integration. Pacific Journal
of Mathematics 189:75–105.
L’Ecuyer, P. (1994). Uniform random number generation. Annals of
Operations Research 53:77–120.
L’Ecuyer, P. (1996). Maximally equidistributed combined Tausworthe
generators. Mathematics of Computation 65 (213): 203–213.
L’Ecuyer, P. (1999). Tables of linear congruential generators of different
sizes and good lattice structure. Mathematics of Computation 68
(225): 249–260.
L’Ecuyer, P., and R. Couture. (1997). An implementation of the lat-
tice and spectral tests for multiple recursive linear random number
generators. INFORMS Journal on Computing 9 (2): 206–217.
L’Ecuyer, P., and C. Lemieux. (2000). Variance reduction via lattice
rules. Management Science 46 (9): 1214–1235.
REFERENCES 471
L’Ecuyer, P., and F. Panneton. (2000). A new class of linear feedback

shift register generators. In Proceedings of the 2000 Winter Simula-
tion Conference, ed. J. A. Joines, R. R. Barton, K. Kang, and P. A.
Fishwick, 690–696. Pistacaway, NJ: IEEE Press.
Lemieux, C. (2000), May. L’utilisation de règles de réseau en simula-
tion comme technique de réduction de la variance. Ph. D. thesis,
Université de Montréal.
Lemieux, C., M. Cieslak, and K. Luttmer. (2001). RandQMC user’s
guide. In preparation.
Lemieux, C., and P. L’Ecuyer. (2001). Selection criteria for lattice rules
and other low-discrepancy point sets. Mathematics and Computers
in Simulation 55 (1–3): 139–148.
Lemieux, C., and A. B. Owen. (2001). Quasi-regression and the relative
importance of the ANOVA components of a function. In Monte
Carlo and Quasi-Monte Carlo Methods 2000, ed. K.-T. Fang, F. J.
Hickernell, and H. Niederreiter: Springer. To appear.
Lidl, R., and H. Niederreiter. (1994). Introduction to finite fields and
their applications. Revised ed. Cambridge: Cambridge University
Press.
Loh, W.-L. (1996a). A combinatorial central limit theorem for ran-
domized orthogonal array sampling designs. Annals of Statis-
tics 24:1209–1224.
Loh, W.-L. (1996b). On Latin hypercube sampling. The Annals of
Statistics 24:2058–2080.
Maisonneuve, D. (1972). Recherche et utilisation des “bons treillis”,
programmation et résultats numériques. In Applications of Number
Theory to Numerical Analysis, ed. S. K. Zaremba, 121–201. New
York: Academic Press.
Maize, E. (1981). Contributions to the theory of error reduction in quasi-
Monte Carlo methods. Ph. D. thesis, Claremont Graduate School,
Claremont, CA.
(1998). On the L2-discrepancy for anchored boxes. Journal
of Complexity 14:527–556.
Matsumoto, M., and Y. Kurita. (1994). Twisted GFSR generators II.
ACM Transactions on Modeling and Computer Simulation 4 (3):
254–266.
Matsumoto, M., and T. Nishimura. (1998). Mersenne twister: A 623-
dimensionally equidistributed uniform pseudo-random number gen-
erator. ACM Transactions on Modeling and Computer Simulation 8
(1): 3–30.
Mckay, M. D., R. J. Beckman, and W. J. Conover. (1979). A comparison
of three methods for selecting values of input variables in the analysis
of output from a computer code. Technometrics 21:239–245.
Morohosi, H., and M. Fushimi. (2000). A practical approach to the er-

ror estimation of quasi-Monte Carlo integration. In Monte Carlo
and Quasi-Monte Carlo Methods 1998, ed. H. Niederreiter and
J. Spanier, 377–390. Berlin: Springer.
Morokoff, W. J., and R. E. Caflisch. (1994). Quasi-random sequences and
their discrepancies. SIAM Journal on Scientific Computing 15:1251–
1279.
Niederreiter, H. (1986). Multidimensional numerical integration using
pseudorandom numbers. Mathematical Programming Study 27:17–
38.
Niederreiter, H. (1987). Point sets and sequences with small discrepancy.
Monatshefte für Mathematik 104:273–337.
Niederreiter, H. (1988). Low-discrepancy and low-dispersion sequences.
Journal of Number Theory 30:51–70.
Niederreiter, H. (1992a). Low-discrepancy point sets obtained by digital
constructions over finite fields. Czechoslovak Math. Journal 42:143–
166.
Niederreiter, H. (1992b). Random number generation and quasi-Monte
Carlo methods, Volume 63 of SIAM CBMS-NSF Regional Confer-
ence Series in Applied Mathematics. Philadelphia: SIAM.
Niederreiter, H., and G. Pirsic. (2001). Duality for digital nets and its
applications. Acta Arithmetica 97:173–182.
Niederreiter, H., and C. Xing. (1997). The algebraic-geometry ap-
proach to low-discrepancy sequences. In Monte Carlo and Quasi-
Monte Carlo Methods in Scientific Computing, ed. P. Hellekalek,
G. Larcher, H. Niederreiter, and P. Zinterhof, Volume 127 of Lec-
ture Notes in Statistics, 139–160. New York: Springer-Verlag.
Niederreiter, H., and C. Xing. (1998). Nets, and al-
gebraic geometry. In Random and Quasi-Random Point Sets, ed.
P. Hellekalek and G. Larcher, Volume 138 of Lecture Notes in Statis-
tics, 267–302. New York: Springer.
Ökten, G. (1996). A probabilistic result on the discrepancy of a hybrid-
Monte Carlo sequence and applications. Monte Carlo methods and
Applications 2:255–270.
Owen, A. B. (1992a). A central limit theorem for Latin hypercube sam-
pling. Journal of the Royal Statistical Society B 54 (2): 541–551.
Owen, A. B. (1992b). Orthogonal arrays for computer experiments,
integration and visualization. Statistica Sinica 2:439–452.
Owen, A. B. (1994). Lattice sampling revisited: Monte Carlo variance
of means over randomized orthogonal arrays. Annals of Statis-
tics 22:930–945.
Owen, A. B. (1995). Randomly permuted and
In Monte Carlo and Quasi-Monte Carlo Methods in Sci-
REFERENCES 473
entific Computing, ed. H. Niederreiter and P. J.-S. Shiue, Number

106 in Lecture Notes in Statistics, 299–317. Springer-Verlag.
Owen, A. B. (1997). Monte Carlo variance of scrambled equidistribution
quadrature. SIAM Journal on Numerical Analysis 34 (5): 1884–
1910.
Owen, A. B. (1998a). Latin supercube sampling for very high-
dimensional simulations. ACM Transactions of Modeling and Com-
puter Simulation 8 (1): 71–102.
Owen, A. B. (1998b). Scrambling Sobol and Niederreiter-Xing points.
Journal of Complexity 14:466–489.
Pagès, G. (1997). A space quantization method for numerical integra-
tion. Journal of Computational and Applied Mathematics 89:1–38.
Paskov, S., and J. Traub. (1995). Faster valuation of financial derivatives.
Journal of Portfolio Management 22:113–120.
Pirsic, G. (2001). A software implementation of Niederreiter-Xing se-
quences. In Monte Carlo and Quasi-Monte Carlo Methods 2000,
ed. K.-T. Fang, F. J. Hickernell, and H. Niederreiter: Springer. To
appear.
Pirsic, G., and W. C. Schmid. (2001). Calculation of the quality param-
eter of digital nets and application to their construction. Journal of
Complexity. To appear.
Sloan, I. H., and S. Joe. (1994). Lattice methods for multiple integration.
Oxford: Clarendon Press.
Sloan, I. H., and L. Walsh. (1990). A computer search of rank-2 lattice
rules for multidimensional quadrature. Mathematics of Computa-
tion 54:281–302.
Sobol’, I. M. (1967). The distribution of points in a cube and the approx-
imate evaluation of integrals. U.S.S.R. Comput. Math. and Math.
Phys. 7:86–112.
Sobol’, I. M. (1969). Multidimensional quadrature formulas and Haar
functions. Moskow: Nauka. In Russian.
Sobol’, I. M. (1976). Uniformly distributed sequences with an additional
uniform property. USSR Comput. Math. Math. Phys. Academy of
Sciences 16:236–242.
Sobol’, I. M., and Y. L. Levitan. (1976). The production of points uni-
formly distributed in a multidimensional. Technical Report Preprint
40, Institute of Applied Mathematics, USSR Academy of Sciences.
In Russian.
Spanier, J. (1995). Quasi-Monte Carlo methods for particle transport
problems. In Monte Carlo and Quasi-Monte Carlo Methods in Sci-
entific Computing, ed. H. Niederreiter and P. J.-S. Shiue, Volume
106 of Lecture Notes in Statistics, 121–148. New York: Springer-
Verlag.
Spanier, J., and E. H. Maize. (1994). Quasi-random methods for estimat-

ing integrals using relatively small samples. SIAM Review 36:18–44.
Tan, K. S., and P. P. Boyle. (2000). Applications of randomized low dis-
crepancy sequences to the valuation of complex securities. Journal
of Economic Dynamics and Control 24:1747–1782.
Tausworthe, R. C. (1965). Random numbers generated by linear recur-
rence modulo two. Mathematics of Computation 19:201–209.
Tezuka, S. (1987). Walsh-spectral test for GFSR pseudorandom num-
bers. Communications of the ACM 30 (8): 731–735.
Tezuka, S. (1995). Uniform random numbers: Theory and practice. Nor-
well, Mass.: Kluwer Academic Publishers.
Tezuka, S., and P. L’Ecuyer. (1991). Efficient and portable combined
Tausworthe random number generators. ACM Transactions on Mod-
eling and Computer Simulation 1 (2): 99–112.
Tezuka, S., and T. Tokuyama. (1994). A note on polynomial arithmetic
analogue of Halton sequences. ACM Transactions on Modeling and
Computer Simulation 4:279–284.
Tootill, J. P. R., W. D. Robinson, and D. J. Eagle. (1973). An asymptot-
ically random Tausworthe sequence. Journal of the ACM 20:469–
481.
Tuffin, B. (1996). On the use of low-discrepancy sequences in Monte
Carlo methods. Technical Report No. 1060, I.R.I.S.A., Rennes,
France.
Tuffin, B. (1998). Variance reduction order using good lattice points in
Monte Carlo methods. Computing 61:371–378.
Wang, D., and A. Compagner. (1993). On the use of reducible poly-
nomials as random number generators. Mathematics of Computa-
tion 60:363–374.
Wang, X., and F. J. Hickernell. (2000). Randomized Halton sequences.
Math. Comput. Modelling 32:887–899.
Yakowitz, S., J. E. Krimmel, and F. Szidarovszky. (1978). Weighted
Monte Carlo integration. SIAM Journal on Numerical Analy-
sis 15:1289–1300.
Yakowitz, S., P. L’Ecuyer, and F. Vázquez-Abad. (2000). Global stochas-
tic optimization with low-discrepancy point sets. Operations Re-
search 48 (6): 939–950.
Part VI
Chapter 21
SINGULARLY PERTURBED MARKOV

CHAINS AND APPLICATIONS TO
LARGE-SCALE SYSTEMS
UNDER UNCERTAINTY
G.Yin
Wayne State University
Detroit, MI 48202
gyin@math.wayne.edu
Q. Zhang
University of Georgia
Athens, GA 30602
qingz@math.uga.edu
K. Yin and H. Yang

St. Paul, MN 55108
kyin @ crn.umn.edu, hyang @ ece.umn.edu
Abstract This chapter is concerned with large-scale hybrid stochastic systems, in which the
dynamics involve both continuously evolving components and discrete events.
Corresponding to different discrete states, the dynamic behavior of the under-
lying system could be markedly different. To reduce the complexity of these
systems, singularly perturbed Markov chains are used to characterize the sys-
tem. Asymptotic expansions of probability vectors and the structural properties
of these Markov chains are provided. The ideas of decomposition and aggrega-
tion are presented using two typical optimal control problems. Such an approach
leads to control policies that are simple to obtain and perform nearly as well as
the optimal ones with substantially reduced complexity.
Key words. singularly perturbation, Markov chain, near optimality, optimal

control, LQG, MDP.
1. INTRODUCTION
In memory of our distinguished colleague and dear friend Sidney Yakowitz,
who made significant contributions to mathematics, control and systems the-
ory, and operations research, we write this chapter to celebrate his lifetime
achievements and to survey some of the most recent developments in singu-
larly perturbed Markov chains and their applications in control and optimiza-
tion of large-scale systems under uncertainty, which are related to Sid’s work
on automatic learning, adaptive control, and nonparametric theory (see Lai
and Yakowitz (1995); Yakowitz (1969); Yakowitz et al. (1992); Yakowitz et
al. (2000) and the references therein). Many adaptive control and learning
problems give rise to Markov decision processes. In solving such problems,
one often has to face the curse of dimensionality. The singular perturbation
approach is an effort in the direction of reduction of complexity.
Our study is motivated by the desire of solving numerous control and op-
timization problems in engineering, operations research, management, biol-
ogy, and physical sciences. To show why Markovian models are useful and
preferable, we begin with the well-known Leontief model, which is a dynamic
system of a multi-sector economy (see, for example, Kendrick (1972)). The
classical formulation is: Let there be sectors, the output of sector at
time and the demand for the product of sector at time Denote
and let
be the amount of commodity that sector needs in production and the
proportion of commodity that is transferred to commodity Write
and The matrix B is termed a Leontief input-output matrix. The
Leontief dynamic model is given by
In the classical Leontief model, the coefficients are fixed. Nevertheless, in

reality, more often than not, not only are A, B, and D time varying, but also
they are subject to discrete shifts in regime–episodes across which the behavior
of the corresponding dynamic systems are markedly different. As a result,
a promising alternative to the traditional model is to allow sudden, discrete
changes in the values of the parameters, which lead to a “hybrid” or “switching
Singularly Perturbed Markov Chains 477
model” governed by a Markov chain:
where is a continuous-time Markov chain (see Yin and Zhang (2001) for
more details). In fact, the use of Markovian models have assumed a prominent
role in time series analysis, financial engineering, and economic systems (see
Hamilton and Susmel (1994); Hansen (1992) and the references therein). In
addition to the modeling point mentioned above, many systems arising from
communication, manufacturing, reliability, and queueing networks, among oth-
ers, exhibit jump discontinuity in their sample paths. A common practice is to
resort to Markovian jump processes in modeling and optimization.
This chapter is devoted to such Markovian models having large state spaces
with complex structures, which frequently appear in various applications and
which may cause serious obstacles in obtaining optimal controls for the under-
lying systems. The traditional approach of dynamic programming for obtaining
optimal control does not work well in such systems. The large size that renders
computation infeasible is known as the “curse of dimensionality.”
Owing to the pervasive applications of Markovian formulation in numerous
areas, there has been resurgent interest in further exploring various properties of
Markov chains. Hierarchical structure, a feature common to many systems of
practical concerns has proved very useful for reducing complexity. As pointed
out by Simon and Ando (1961), all systems in the real world have a certain
hierarchy. Therefore it is natural to use the ideas of decomposition and aggre-
gation for complexity reduction. Because the transitions (switches or jumps)
in various states of a large-scale system often occur at different rates, the de-
composition and aggregation of the states of the corresponding Markov chain
can be achieved according to their rates of changes. Taking advantage of the
hierarchical structure, the first step is to divide a large task into smaller pieces.
The subsequent decompositions and aggregations will lead to a simpler ver-
sion of the originally formidable system. Such an idea has been applied to
queueing networks for resource organization, to computer systems for memory
level aggregation, to economic models for complexity reduction (see Courtois
(1977); Simon and Ando (1961)), and to manufacturing systems for production
planning (see Sethi and Zhang (1994)). Owing to their different time scales,
the problems fit reasonably well into the framework of singular perturbation
in control and optimization; see Abbad et al. (1992); Gershwin (1994); Pan
and (1995); Sethi and Zhang (1994); Yin and Zhang (1997b); Yin and
Zhang (1998) and the references therein. Related work on singularly perturbed
Markov chains can be found in Di Masi et al. (1995); Pervozvanskii and Gaits-
gori (1988); Phillips and Kokotovic (1981), among others. In Khasminskii et

al. (1996), we analyzed singularly perturbed systems with an approach com-
bining matched asymptotic expansion techniques and Markov properties. This
combined approach enables us to obtain the desired asymptotic expansion of
the probability vector and transition probabilities. We have furthered our un-
derstanding by considering singularly perturbed chains under weak and strong
interactions (Khasminskii et al., 1997). This line of research has been carried
forward substantially and has included asymptotic normality, exponential error
bounds, Markov chains with countable state spaces, and the associated semi-
groups (see Yin and Zhang (1998)), which are the cornerstones for an in-depth
investigation on various asymptotic properties and which are useful for many
applications. Studies of the structural properties of the underlying singularly
perturbed Markov chains and other properties of related occupation measures
are in Zhang and Yin (1996); Zhang and Yin (1997) and Yin et al. (2000a); Yin
et al. (2000b); Yin et al. (2000c). The asymptotic optimal and nearly op-
timal controls of controlled dynamic systems driven by singularly perturbed
Markov chains have been examined in Yin and Zhang (1997a); Zhang et al.
(1997); Zhang and Yin (1999). Numerical methods and a simulation study can
be found in Yang et al. (2001). These results are extended to discrete-time
models and included in Yin and Zhang (2000); see also Liu et al. (2001); Yang
et al. (2001).
To proceed, we present an illustrative example of a production planning
model for a failure-prone manufacturing system. The system consists of a single
machine whose production capacity is modeled by a finite-state continuous-time
Markov chain having a generator and taking
values in For example, 1 represents the machine being down,
represents a full capacity operation, and the additional states represent
different machine capacities in between. Let and denote the
surplus, the rate of production, and the rate of demand, respectively. Here
can be either deterministic or stochastic. The system equation is
Our objective is to choose the production rate subject to the

constraint
and to minimize the infinite-horizon discounted cost function

where is a discount factor and G(·) is a running cost function. Using

the dynamic programming approach (Fleming and Rishel, 1975), for each
we can write the differential equation satisfied by the value function
(the optimal value of the cost) as
see Sethi and Zhang (1994) and Yin and Zhang (1998), Appendix for details.
Even if the demand is a constant, it is difficult to obtain the closed-form
solution of the optimal control, not to mention the added difficulty due to the
possible large state space (i.e., being a large number) since to find the
optimal control using the dynamic programming approach requires solving
equations of the form (1.3). To overcome the difficulty, we introduce a
small parameter and assume that the Markov chain is generated by
where Q is an irreducible generator of a continuous-time Markov
chain. Using a singular perturbation approach, we can derive a limit problem
in which the stochastic capacity is replaced by its average with respect to the
stationary measures. The limit problem is in fact deterministic and is much
simpler to solve. Intuitively, a higher level manager in a manufacturing firm
need not know every details of floor events, only an averaged overview will be
sufficient for the upper level decision making; see Sethi and Zhang (1994) for
more discussion along this line.
Mathematically, for sufficiently small the problem can be approxi-
mated by a limit control problem with dynamics
where is the stationary distribution (or steady-state probability)

of the Markov chain generated by Q and satisfies In the
cost function, the expectation can be replaced by the average with respect to its
stationary distribution
and the associated value function satisfies a single equation
where As can be seen from the above example, the essence is

that by using a small and letting the study of the detailed variation
is replaced by its corresponding “average” with respect to the stationary distri-

bution. In what follows, we will consider more complex systems. However,
even for the above seemingly simple example, the averaging approach reduces
the complexity noticeably.
Note that the small parameter may not be present in the original sys-
tems. To facilitate the analysis and hierarchical decomposition, we introduce
it into the systems. How small is considered to be small? In applications,
constants such as 0.1 or even 0.5 could be considered as small if all other coef-
ficients are of the order of magnitude 1. Only the relative order of magnitude
matters. The mathematical results to be presented herein serve as guidelines for
various approximations and for the estimation of error bounds. The asymptotic
results of the underlying system (as ) provide insights into the struc-
ture of the system and heuristics of applications. A thorough understanding of
the intrinsic behavior of the systems is instructive and is beneficial to explo-
rations of the rich diversity of applications in hierarchical production planning,
Markov decision processes, random evolution, and control and optimization of
stochastic dynamic systems involving singularly perturbed Markov chains.
Since the subject discussed in this chapter is at the intersection of singular
perturbation and stochastic processes, our approaches consist of both analytic
techniques (purely deterministic) and stochastic analysis (purely probabilistic).
To make the chapter appeal to a wide range of audience in the fields of systems
science, operations research, management science, applied mathematics, and
engineering, we emphasize on the motivation and present the main results.
Interested readers can refer to the references provided for further study and
technical details.
The rest of the chapter is organized as follows. Section 2 discusses the formu-
lation of singularly perturbed Markov chains. To handle singularly perturbed
systems driven by Markov chains, it is essential to have a thorough under-
standing of the underlying probabilistic structure. These properties are given
in Section 3. Based on a couple of motivational examples, Section 4 presents
decomposition/aggregation and nearly optimal controls and focuses on the opti-
mal controls of linear quadratic regulators. Some additional remarks are made
in Section 5. To make the chapter self-contained, we provide mathematical
preliminaries and necessary background in the appendix.
2. SINGULARLY PERTURBED MARKOV CHAINS

In this section we introduce singularly perturbed Markovian models in both
continuous time and discrete time. The discussion is confined to the cases of
stationary processes for simplicity. The power of the approach to be presented,
however, is better reflected by nonstationary models.
2.1. CONTINUOUS-TIME CASE

Many natural phenomena include processes changing at different rates. To
describe such fast/slow two-time scales in the formulation, we introduce a small
real number Let be a continuous-time Markov chain with finite
state space and generator
such that and are themselves generators. A Markov chain with generator
given in (2.4) is a singularly perturbed Markov chain. The simple example
given below illustrates the effect of the small parameter
Example. Suppose that is a Markov chain having four states
Let
Simulation of sample paths of the Markov chain yields Fig. 21.1, which displays
paths for and respectively.
Observe that the small parameter has a squeezing effect which rescales
the sample paths of the Markov chain. When the Markov chain gen-
erated by undergoes rapid variations. This combined with produces a
generator that has both rapidly varying part and slowly changing part with
weak and strong interactions.
Let
be the probability vector associated with the Markov chain. Then it is known
that satisfies the differential equation
where is a probability vector (i.e., all of its component and

). The equation (2.6) is known as the forward equation. Take, for simplicity,
Then the solution of (2.6) is A direct use
of the singular perturbation idea may not work here since although (2.6) is a
linear system of differential equations, the generator has an eigenvalue 0. A
first glance may lead one to believe that is unbounded as However,
this is not the case. By subtracting 0 from the spectrum of the generator, the
rest of the eigenvalues are all on the left side of the complex plan, producing a
rapid convergence to the stationary distribution.
In view of (2.4), the matrix dominates the asymptotics. Let us con-
centrate on the fast changing part of the generator first. According to A.N.
Kolmogorov, the states of any Markov chain can be classified as either recur-
rent or transient. It is also known that any finite-state Markov chain has at
least one recurrent state, i.e., not all states are transient. As a result (Iosifescu,
1980, p.94), by appropriate arrangements, it is always possible to write the
corresponding generator as either
or
The generator (2.7) corresponds to a Markov chain that has recurrent classes,
whereas (2.8) corresponds to a Markov chain that includes transient states in
addition to the recurrent classes. For the stationary cases, combining both (2.7)
and (2.8) has exhausted all possibilities of practical concern. In the discussion
to follow, we will use the notion of weak irreducibility (see the appendix for a
definition) that is an extension of the usual notion of irreducibility. We will also
deal with the partitioned matrices of the forms (2.7) and (2.8). Nevertheless, in
lieu of recurrent classes, we will consider weakly irreducible classes, which is
a generalization of the classical formulation.
2.2. TIME-SCALE SEPARATION

Our formulation can accommodate and indicate different rates of changes
among different states. In model (2.7), for example, a Markov chain generated
by has classes of weakly irreducible states, within which the transitions
take place at a faster pace. It follows that the state space of the Markov chain
generated by has a natural classification A similar in-
terpretation holds for the generator given by (2.8). Note that the component,
on the last row of matrix in (2.8) denotes transitions from the transient
states to the weakly irreducible class; and denotes the transition among
those transient states. In what follows, we demonstrate how any given generator
can be rewritten as (2.4) after appropriate rearrangements. For simplicity, we
present a case where the dominating part of the generator, corresponds to a
chain with weakly irreducible classes. The example to follow is drawn from
Chapter 3 of Yin and Zhang (1998).
Consider a generator Q given by
with the corresponding state space
Step 1. Separate the entries of the matrix based on their orders of magnitude.
The numbers {1,2} are at a scale (order of magnitude) different from that of
the numbers {10, –11, –12, 21, –22, 30, –33}. So we write Q as
Step 2. Adjust the entries to make each of the above matrices a generator.
This requires moving the entries so that each of the two matrices satisfies the
condition of a generator, i.e., with nonnegative off-diagonal elements, non-
positive diagonal element, and zero row sum. After such a rearrangement the
matrix Q becomes
Step 3. Permute the columns and rows so that the dominating matrix is of a
desired block diagonal form (corresponding to the weakly irreducible blocks).
In this example, exchanging the orders of and in i.e., considering
yields
Let Then
Note that the decomposition procedure is not unique. We can also write
with Therefore, by using elementary row and column operations and

some rearrangements, we have transformed a matrix Q into the form (2.4). The
above procedure is applicable to time-dependent generators as well. It can also
be used to incorporate generators with transient states.
3. PROPERTIES OF THE SINGULARLY PERTURBED

SYSTEMS
This section contains several asymptotic properties of singularly perturbed
Markov chains. It begins with asymptotic expansions, proceeds to occupation
measures, continues with functional central limit results, and concludes with
large deviations and exponential error bounds.
3.1. ASYMPTOTIC EXPANSION

The basic question concerns the probability distribution of the Markov chain
as For the continuous-time case, let the Markov chain have state
space and generator (2.4). Our study focuses on the forward equation (with
time-dependent ):
where and The detailed proof of the following results

can be found in Yin and Zhang (1998), Chapter 6.
Theorem 1. Suppose that is given by (2.7) such that for each

and each is weakly irreducible; for some positive integer
and are and continuously differentiable
on [0,T], respectively. Moreover, and are
Lipschitz on [0, T]. Then, for sufficiently small the probability vector
can be approximated to a desired accuracy by a series in that we can
construct two sequences and such that
Moreover, all the are sufficiently smooth, and
for some
The asymptotic expansion enables us to simplify the computation. Note that

as for Thus in lieu of using we can use the
limit of the probability distribution. In many cases, we actually need both the
limit of the probability distribution and certain higher order corrections. For the
scaled occupation measure to be studied in the subsequent section, for example,
the terms play an essential role. If transient states are included with
given by (2.8), we need that all eigenvalues of have negative real

parts.
In accordance with (3.10), for sufficiently small can be approxi-
mated by a series with an error bound of the order The approximation
is uniform in it reflects the two-time scales, the regular time and the
stretched time The terms involving are known as outer expansion
terms; and the terms involving are the initial layer corrections. The
term approximates well for those away from a layer of
thickness but it does not satisfy the initial condition. Therefore we intro-
duce to accommodate it. The outer expansion terms are all smooth
and well behaved, and all the initial layer corrections decay exponentially.
The method of obtaining and is constructive. The key
is to choose such initial conditions appropriately that match both the outer
expansion and initial layer correction. The procedure requires the use of weak
irreducibility (see the definition in the appendix, in particular (6.54)), certain
orthogonality, and the Fredholm alternative, which is stated as: Let B denote
an matrix. For any define an operator as
where I is the identity matrix I. Then the adjoint operator

is given by
Assume Then one of the following two alternatives holds: (1) The
homogeneous equation has only the zero solution, in which
case (the resolvent set of A), is bounded, and the
inhomogeneous equation has also one solution
for each (2) The homogeneous equation has a nonzero
solution, in which case the inhomogeneous equation has a
solution iff for every solution of the adjoint equation
After obtaining the formal expansion of the asymptotic series, we proceed

to validate the expansion by estimating the corresponding error bounds. An
interested reader is referred to Yin and Zhang (1998), Chapters 4 and 6 for
detailed treatments.
Discrete-time Problems. For discrete-time cases, we must use transition prob-
ability matrices in lieu of a generator. Similar to the continuous-time case,
suppose that the transition matrix is given by
where is a small parameter, P is the transition matrix of a discrete-time

Markov chain, and is a generator. Then any transition probability
matrix of a finite-state Markov chain with stationary transition probabilities can

be put in the form (Iosifescu, 1980, p. 94) of either
or
where each is itself a transition matrix within the weakly irreducible class
for and the last row corresponds to the transient states.
The two-time-scale interpretation becomes a normal rate of change versus a
slow rate of change. Let be the solution of
where P is a transition matrix and Q is a generator. Parallel to the develop-

ment of the continuous-time systems, we can obtain the following asymptotic
expansions.
where represents the approximation error, and
The term is the outer expansion and is the boundary-layer cor-

rection. For a detailed treatment, we refer the readers to Yin and Zhang (2000).
3.2. OCCUPATION MEASURES

The asymptotic expansion obtained in the previous section is based on a
purely deterministic approach. To study how the underlying Markov chain
evolves, it is also important to investigate the stochastic characteristic of the
Markov chain. A quantity of particular interest is the so-called occupation
measure, or occupation time, which indicates the amount of time the chain
spends in a given state.
Let us begin with the continuous-time case in which
and is weakly irreducible. Using the idea of the previous
section, we can develop asymptotic expansions of In this case, the lead-

ing term in the asymptotic expansion is the quasi-stationary distribution.
Consider the time that the Markov chain spends in state up to time That
is, It is easy to show that this integral can be approximated
by which is a non-random function and is the mean (as ) of
the occupation measure indexed by Thus, we define a sequence of centered
occupation measures as
and write
If in lieu of assuming and the weakly irreducibility

of has more than one blocks as indicated in (2.7), then a non-
random limit cannot be expected. To proceed, aggregate the states in each of
the subspace into one state and define if This
aggregated process in general is not Markov. However, it can be shown that
converges weakly to a Markov chain with generator Recall
that a sequence of random variables converges weakly to if and only if
for any bounded and continuous function
If the Markov chain corresponding to has more than one weakly ir-
reducible classes, a simple “deterministic approximation” as in (3.16) by the
quasi-stationary distribution is not enough. A pertinent definition of the occu-
pation measure becomes
The immediate questions are: How good is such an approximation? What is the
asymptotic distribution of Using probability basics, we can show that
That is, a mean squares estimate for the unsealed
occupation measure holds. Such an estimate implies that a scaled sequence of
the occupation measures may have a nontrivial asymptotic distribution. Define
which is a scaled sequence of occupation measures. The scaling is suggested

by the central limit theorem. The detailed proof of the following results can be
found in Yin and Zhang (1998), Chapter 7.
Theorem 2. Assume that for each and is weakly

irreducible. Suppose is twice differentiable with Lipschitz continuous sec-
ond derivative and is differentiable with Lipschitz continuous derivative.
For each and let be a bounded and Lipschitz
continuous deterministic function. Then converges weakly to a switching
diffusion where
and is a standard m-dimensional Brownian motion.
For the definitions of weak convergence and jump diffusions, see the ap-
pendix. Loosely speaking, a switching diffusion process is a combination of
switching processes and diffusion processes. It possesses both diffusive be-
havior as well as jump properties. Switching diffusions are widely used in
many applications. For example, in a manufacturing system, the demand may
be modeled as a diffusion process, whereas the machine production rate as a
Markov jump process.
Discrete-time Chains. For discrete-time Markov chains, analogous results

can be obtained. Consider the weakly irreducible case, for
and define a sequence of occupation measures
Then define a continuous-time interpolation:
With definition (3.19), can be written recursively as a difference equation

where
Define an aggregated process of by
Furthermore, define continuous-time interpolations with the interpolation in-

terval on[0,T], by
Similar to the continuous-time case, a mean squares estimate can be obtained

and the weak convergence of can be derived. Subsequently, for each
and each define the normalized
occupation measure:
Observe that
To proceed, define a continuous-time interpolation:
For each fixed intuitively, we expect that the random variable

has a normal distribution. It indeed is the case and is illustrated by Fig. 21.2.
We used 1000 replications (each replication corresponding to one realization).
The computation results are displayed in the form of a histogram.
On the other hand, keeping the sample point fixed and taking as a
function of only yields a sample path as shown in Fig. 21.3. Note the diffusion
type behavior displayed in the figure.
Moreover, it can be shown that converges weakly to a switching
diffusion process and that the asymptotic covariance depending on the zeroth
order initial layer terms is computable. The detailed development of the asymp-
totic distribution is quite involved, we refer readers to Yin and Zhang (1998),
Chapter 7 for containing weakly irreducible classes, and to Yin et al.
(2000a) for cases including transient states.
3.3. LARGE DEVIATIONS AND EXPONENTIAL

BOUNDS
3.3.1 Large Deviations. We begin with a simple example to review the
concept of large deviations. Suppose that is a sequence of i.i.d. random
variables with mean 0 and variance For simplicity, assume that is Gaus-
sian, which allows us to get an explicit representation of the moment generating
function. Let be the partial sum We are interested in the
probability of the event for The well-known strong law of
large numbers indicates
with probability one (w.p.1). The central limit theorem implies that
Then for large is a rare event in that its probability goes to 0.

Although some approximations can be derived, neither the law of large numbers
nor the central limit theorem can give a precise quantification of this rareness
or its associated probability. More detailed description beyond the normal
deviation range is needed. That is the role of the large deviations approach.
Consider the Cramèr–Legendre transformation L, defined by
The last line above is a consequence of the fact that is normally distributed.
The Chernoff’s bound,
indicates that is “exponentially equivalent to” In

other words, the probability of the rare event is exponentially small.
To illustrate the use of large deviations for the singularly perturbed Markov
chains, let us consider the discrete-time problem. For simplicity assume that P
has only one block and is irreducible. Then the centered occupation measures
become
where denotes the component of the stationary distribution of P. Let

and Using the approach in Dembo and
Zeitouni (1998), pp. 71-75, define a matrix with entries
where denotes the usual inner product. It follows that is also irre-
ducible. For each let be the Perron–Frobenius eigenvalues of
the matrix (see Seneta (1981) for a definition). Define
By using the Gärtner-Ellis Theorem (see Dembo and Zeitouni (1998)), we can
derive the following large deviations bounds: For any set
where and denote the interior and the closure of G, respectively, and
3.3.2 Exponential Bounds. Many control and optimization problems

(see Zhang (1995); Zhang (1996)) involve singularly perturbed Markov chains
and require the use of the exponential-type upper bounds for the probability
errors. Such exponential bounds also provide alternative ways to establish
asymptotic normality. Such an idea was used in Zhang and Yin (1996) for
error estimates of in continuous-time Markov chains, where is weakly
irreducible. In fact, such a bound is crucial in obtaining asymptotic optimal
controls of nonlinear dynamic systems (Zhang et al., 1997). The work of
obtaining such bounds involving only weakly irreducible classes can be found
in Yin and Zhang (1998); and the inclusion of transient states is in Yin et al.
(2000e). Concentrating on weakly irreducible states only, we have the following
Theorem.
Theorem 3. Suppose the Markov chain has time-dependent generator

with being of the form (2.7). Suppose that for each
and is weakly irreducible; is differentiable on
[0, T] with its derivative being Lipschitz, and is also Lipschitz. Then, there
exist and K > 0 such that for and
where is a constant satisfying
in which
An application of the error bounds yields that for any
The above results are mainly for time-varying generators. Better results can
be obtained for constant and where the corresponding conclusion is that
there exist positive constants and K such that for and
4. CONTROLLED SINGULARLY PERTURBED

MARKOVIAN SYSTEMS
In this section‚ by using a continuous-time LQG (linear quadratic Gaussian)
regulator problem and a discrete-time LQ problem‚ we illustrate
(1) the idea of hybrid system modeling under the influence of singularly
perturbed Markov chains‚
(2) the decomposition and aggregation‚
(3) the associated nearly optimal controls. We show that the complexity of
the underlying systems can be reduced by means of hierarchical control
approach.
The main ideas are presented and the end results are demonstrated. Interested
reader can consult Zhang and Yin (1999) for the hybrid LQG problem and Yang
et al. (2001) for the discrete-time LQ problem for further reading.
4.1. CONTINUOUS-TIME HYBRID LQG

For some finite T > 0, we consider the following linear system in a time
horizon [0,T],
where denotes state variables‚ denotes control vari-

ables‚ and are well defined and have finite
values for any is a standard Brownian motion‚ and is a
Markov chain with a finite state space The classical work
of LQG assumes a fixed model (i.e.‚ A‚ B etc. are all fixed)‚ which excludes
those systems subject to discrete-event variations. Similar to the Leontief model
given in the introduction‚ however‚ a model with switching is a better alterna-
tive in many applications. The premise of such a model is that many important
movements arise from discrete events. To accommodate this‚ we have included
in (4.30) both continuous dynamics and discrete-event perturbations‚ which
justifies the name hybrid systems. Our objective is to find an optimal control
to minimize the expected quadratic cost function
where is the expectation given and

are symmetric nonnegative definite matrices;
and D are symmetric positive definite matrices; and are independent.
The LQG problems are important since many real-world applications can be
approximated by linear or piecewise linear systems.
Following the classical approach given in Fleming and Rishel (1975) and
denoting the value function (the optimal cost) by we write
The problem is reduced to finding and It can be solved by substi-

tuting the above quadratic function into the dynamic programming equation of
then solving the resulting so-called Riccati equations for and
the equations for The solution of the problem completely depends on the
solutions of the Riccati equations. Since many applications involve large-scale
systems, we consider that the Markov chain has a large state space ( is large).
A measure of complexity is the number of equations to be solved. Since the

Markov chain is involved‚ instead of solving one equation‚ one needs to solve
a system of equations. When is large‚ the computation involved may be
infeasible. To overcome the difficult‚ we introduce a small parameter and
let the generator of the Markov chain be consisting of a rapidly changing
part and a slowly varying one of the form (2.4). Note that is obtainable
using the procedure described in the previous section. The resulting Markov
chain becomes and the dynamic equation and the cost function
can be rewritten as
and
To be more specific, assume has a block-diagonal form (2.7), where

are weakly irreducible, for and A
small parameter results in the system under consideration having a two-
time-scale behavior (Delebecque and Quadrat, 1981; Phillips and Kokotovic,
1981).
Using the dynamic programing techniques (see Fleming and Rishel (1975)),
the value function = satisfies the following
Hamilton-Jacobi-Bellman (HJB) equations: for and
with the boundary condition where is

defined in (6.52).
Following the approach in Fleming and Rishel (1975)‚ let
for some matrix-valued function and a real-valued function

Substituting (4.35) into (4.34) and comparing the coefficients of lead to the
following Riccati equations for
and
where is as defined in (6.52). In accordance with the results of

Fleming and Rishel (1975)‚ these equations have unique solutions. In view of
the positive definiteness of the optimal control has the form:
In solving this problem, one will face the difficulty caused by large dimension-
ality if the state space is large (i.e., is a large number), where a total of
equations must be solved. To resolve this, we aggregate the states in as one
state and obtain an aggregated process defined by when
The process is not necessarily Markovian. However, using
certain probabilistic argument, we have shown in Yin and Zhang (1998); Yin
et al. (2000a) that converges weakly to the process generated by
which has the form
with being the stationary distribution of (for each

For and
and
uniformly on [0‚ T] as where

are the unique solutions to the differential equations
and
respectively, with Based on this result, we can use the optimal

control of the limit problem to construct controls of the original problem. It can
be shown that the controls so constructed are asymptotically optimal (Zhang
and Yin, 1999). To proceed, we illustrate the ideas by examining the following
example.
Consider a four-state Markov chain having
the same as in (2.5). Suppose that the generator has the form such
that consists of and each of which is a 2 × 2 matrix. Thus and
Suppose that the dynamic system (4.32) is a 2-dimensional one with the
cost function (4.33). Choosing a time horizon [0, T] with T = 5, we discretize
the system equations with step size The trajectories of
and are given in Fig. 21.4. The computation was
done on a Sun Workstation ULTRA 5 and the CPU times were recorded. The
average CPU time for solving the original problem was 4.59 seconds, whereas
the CPU time used in solving the problem via aggregation and averaging was
only 2.46 seconds. Thus compared with the solution of the original system,
only a little more than 50% of the computational effort is needed to find the
nearly optimal control policy. The simulation results show that our algorithm
approximates the optimal solutions well.
4.2. DISCRETE-TIME LQ
This section discusses the discrete-time LQ problem involving a singularly
perturbed Markov chain. Many systems and physical models in economics‚
biology‚ and engineering are represented in discrete time mainly because var-
ious measurements are only available at discrete instants. As a result‚ the
planning decisions‚ strategic policies‚ and control actions regarding the under-
lying systems are made at discrete times. The continuous-time models can
be regarded as an approximation to the “discrete” reality. For example‚ using
a dynamic programing approach to solve a stochastic control problem (with

Markov chain driving noise)‚ yielding the so-called Hamilton-Jacobi-Bellman
equation (Fleming and Rishel‚ 1975). This equation often needs to be solved
numerically via discretization. Simulation and digitization also lead to various
discrete-time systems. Furthermore‚ owing to the rapid progress in computer
control technology‚ the study of sampled-data systems becomes more and more
important. All these make the study of discrete-time systems necessary.
Let be a discrete-time Markov chain with transition probability matrix
given by (3.11)‚ where P is a transition probability matrix and Q is a generator.
Use to denote the integer part for any For a finite real number
and the dynamic system is given by
where is the state‚ is the control‚ and

are well defined and have finite values for each and
is a sequence of random variables with zero mean. Define a sequence of
cost functions
where is the expectation given and Our objective

is to find an optimal control to minimize the expected quadratic cost function
The formulation of (4.42) is useful in reducing the complex-
ity as described in the last section. That is‚ the original system model without
the small parameter is considered first. To reduce the computational burden‚ we
introduce a small parameter into the system that results in a singularly perturbed
model of the form (4.42).
To obtain the desired asymptotic results‚ applying the dynamic programming
principle under suitable assumptions with a slight modification of the argument
in Bertsekas (1995) yields a system of dynamic programming equations. De-
note the value functions by Then the dynamic
programming (DP) equation is
for Let Assume
Define
Then we have the following Riccati equations
and
The optimal feedback control for the LQ problem is linear in the state variable:
For each Define and define the piecewise

constant interpolated processes and as
Instead of solving the Riccati equations‚ we can use the aggregation to

obtain a limit system‚ in which the functions involved are averaged out with
respect to the stationary distributions of the Markov chain. Although the original
problem is a discrete-time one‚ the limit dynamic system is in continuous time.
As noted before‚ the solution of the problem completely depends on the number
of equations to be solved. Thus the reduction of complexity is achieved in that
only equations need to be solved as compared to solving the equations
in the original problem. If is substantially larger than (i.e.‚
the complexity is significantly diminished. It can be shown that
converges weakly to where is the solution of the hybrid system
where
and
In addition, and converge to and that are solutions

of (4.40) and (4.41), respectively.
As an example, consider a four-state Markov chain
with transition probability matrix
For a two-dimensional dynamic system (4.42) and the cost (4.43), we work with
a time horizon of with T = 5. We use step size to discretize
the limit Riccati equations.
Take The trajectories of and
are given in Fig. 21.5 for The simulation results
show that the continuous-time hybrid LQG problem approximates closely the
corresponding discrete-time linear quadratic regulator problem.
Using this limit system, we can find its optimal control
where for and
Subsequently‚ using the optimal control of the limit system‚ we construct a

control for the original discrete-time control system
where
Equivalently‚ we can write
Such a control can be shown to be asymptotically optimal.

As in the continuous-time case‚ to find the nearly optimal control requires the
solution of Riccati equations‚ whereas to solve the original problems requires
solving Riccati equations. Thus‚ when we will have a substantial
savings. In the example presented‚ we used and Again‚ the
computation was done on a Sun Workstation ULTRA 5. A CPU time of 2.21
seconds was needed to solve the problem via aggregation and averaging as
compared to 5.58 seconds used in solving the original problem‚ which amounts
to a 60% saving.
5. FURTHER REMARKS
This work provides a survey on using singularly perturbed Markov chains
to model uncertainty in various applications with the aim of reduction of com-
plexity of large-scale systems. We have mainly focused on finite-state Markov
chains. For future work‚ we point out:
It will be of theoretical interests and practical concerns to study Markov

chains having countable state spaces. In addition‚ including both diffu-
sions and jumps into various models is another important area of investi-
gation. Such switching diffusion models can be used for manufacturing
systems‚ in which the diffusion represents the random demand and the
jump process delineates the machine capacity‚ and for stock-investment
models‚ in which hybrid switching geometric Brownian motion models
can be used to capture the market trends and other economic factors such
as interest rates and business cycles. Although some results have been
reported in Il’ in et al. (1999a); Il’ in et al. (1999b); Yin (2001); Yin and
Kniezeva (1999)‚ much more in-depth investigation is needed.
Treating partially observable systems‚ there are growing interests in an-

alyzing filtering and state estimation procedures that involve continuous
dynamics and discrete-event interventions.
Another area that remains to be largely untouched is the study of stability

of singularly perturbed Markovian systems. Here the focus is on the
asymptotic properties of the underlying systems as 0 and
All of these and their applications in signal processing‚ learning‚ economic

systems‚ and financial engineering open up new avenues for further research
and investigation.
6. APPENDIX: MATHEMATICAL PRELIMINARIES

To make the chapter as self-contained as possible‚ we briefly review some
basics of stochastic processes and related topics in this appendix.
6.1. STOCHASTIC PROCESSES

Let be a probability space, where is the sample space, is
the of subsets, and P is the probability measure defined on A
collection of or simply is called a filtration if
for It is understood that is complete in the sense that it
contains all null sets. A probability space together with a filtration
is termed a filtered probability space
A stochastic process defined on is a function of
two variables and where is a sample point and is a parameter
taking values from a suitable index set For each fixed is a
random variable and for each fixed is a function of called
sample function or a realization. It is customary to suppress the dependence of
from the notation. The index set may be discrete, e.g.,
or continuous, e.g., (an interval of real numbers).
A process is adapted to a filtration if for each is an
random variable; is progressively measurable if for each
the process restricted to is measurable with respect to the
in where denotes the Borel sets of
A progressively measurable process is measurable and adapted, whereas the
converse is not generally true. However, any measurable and adapted process
with right-continuous sample paths is progressively measurable.
A real-valued or vector-valued stochastic process is said to
be a martingale on with respect to if, for each is
and = w.p. 1 (with probability
one) for all If we only say that is a martingale without specifying
the filtration is taken to be the generated by
A random vector is Gaussian if its characteristic function
has the form
where is a constant vector in is the usual inner product, denotes the

pure imaginary number satisfying and A is a symmetric nonnegative
definite matrix. A process is a Gaussian process if its finite-
dimensional distributions are Gaussian. That is‚ for any

and is a Gaussian vector. If for any
and
are independent, then is a process with independent increments.

An Gaussian random process for is called a Brownian
motion, if B(0) = 0 w.p.1, B(·) is a process with independent increments,
B(·) has continuous sample paths w.p.1, and the increments have
Gaussian distribution with and for some nonneg-
ative definite matrix where denotes the covariance of Given
processes and a process defined as
is called a diffusion.
The concept of weak convergence is a substantial generalization of con-
vergence in distribution in elementary probability theory. Let P and
denote probability measures defined on a metric space The
sequence is weakly convergent to P if
for every bounded and continuous function on Suppose that and

X are random variables associated with and P, respectively. The sequence
converges to X weakly if for any bounded and continuous function on F,
as A family of probability measures defined
on a metric space F is tight if for each there exists a compact set
such that
For more details on definitions‚ notation‚ and properties of stochastic pro-

cesses‚ see the references Hillier and Lieberman (1995); Karlin and Taylor
(1975); Karlin and Taylor (1981). A more mathematically inclined reader may
refer to Chung (1967); Davis (1993); Ethier and Kurtz (1986).
6.2. MARKOV CHAINS

A Markov process is a process having the property that‚ given the
current state‚ the probability of the future behavior of the process is not altered
by any additional knowledge of its past behavior. That is‚ for any any
and any set A (an interval of the real line)
A Markov process having piecewise constant sample paths and taking values
in either (for some positive integer or
is called a Markov chain. The set is termed the state space of the Markov
chain. The function
is known as the transition probability. If the transition probabilities depend only

on the time difference (not on then we say that the Markov chain has
stationary transition probabilities.
Much of our recent work concerns the asymptotic properties of nonstation-
ary singularly perturbed Markov chains. Interested reader can find more ad-
vanced topics of the treatments set forth in the list of references of the cur-
rent chapter. Throughout this chapter, we consider finite-state Markov chains
with state space Unless explicitly stated, the state space of is given
by for some positive integer Note that the transition
probabilities form an matrix that is called a transition matrix. Such
a matrix has nonnegative entries with row sums being 1. For a discrete-time
Markov chain, the one-step (stationary) transition matrix is given by
whereas for a continuous-time Markov chain, the
transition probability matrix is given by
To characterize a discrete-time Markov chain, we use its transition probabil-
ities; whereas for a continuous-time Markov chain, the use of a generator is a
better alternative. Loosely speaking, a generator prescribes the rates of change
(“time derivative”) of the transition probabilities. Formally, (for a stationary
Markov chain) it is given by
where I is an identity matrix. For a suitable function

denotes
For a nonstationary Markov chain we have to modify the definition of

the generator. A matrix is a generator of if
is Borel measurable for all and is uniformly bounded‚
that is, there exists a constant K such that for all and
for and and (2) for all
bounded real-valued functions defined on
is a martingale.
We say that a generator or the corresponding Markov chain is weakly
irreducible if the system of equations
has a unique non-negative solution‚ where for a positive integer denotes

an column vector with all entries being 1. The unique solution
is termed a quasi-stationary distribution. The definition above is a slight
generalization of the traditional notion of irreducibility and stationary distribu-
tion. We allow the and the to be time dependent; we require only
nonnegative solutions of (6.54) instead of positive solutions.
6.3. CONNECTIONS OF SINGULARLY PERTURBED

MODELS: CONTINUOUS TIME VS. DISCRETE
TIME
This section illustrates the close connection between continuous-time sin-
gularly perturbed Markov chains and their discrete-time counterparts; some of
the related discussions are in Kumar and Varaiya (1984) and Yin and Zhang
(1998) (see also Fox and Glynn (1990) for work on simulation). Starting with a
continuous-time singularly perturbed Markov chain (as defined in the last
section)‚ we can construct a discrete-time singularly perturbed Markov chain
Conversely‚ starting from the discrete-time singularly perturbed Markov
chain we can construct a continuous-time singularly perturbed Markov
chain which has the same distribution of the original process
Consider a continuous-time singularly perturbed Markov chain with
state space and generator where and are generators of
some Markov chains with stationary transition probabilities. The probability
vector
satisfies the forward equation

Introduce a new variable Then the probability vector

satisfies the rescaled forward equation
Denote Let
and fix Define
It is clear that all entries of are nonnegative and where for a

positive integer denotes an column vector with all entries
being 1. Therefore‚ is a transition probability matrix. Eq. (6.55) can be
written as
The solution of the above forward equation is
Consider a discrete-time Markov chain having transition matrix Then

the corresponding probability vector
with satisfies
This together with (6.57) yields
Suppose is a Poisson process independent of having rate q. Then

for each nonnegative integer
Let be a sequence of “waiting times,” which is a sequence of indepen-

dent and identically distributed exponential random variables. Define
the time at which the event occurs. Now define a piecewise
constant process
Then
Thus the reconstructed process and the original process have the
same distribution.
Acknowledgement. We would like to thank Felisa Vázquez-Abad and Pierre

L’Ecuyer for critical reading of an early version of the manuscript and for
providing us with detailed comments and suggestions‚ which led to a much
improved presentation.
The research of G. Yin was supported in part by the National Science Foun-
dation under grant DMS-9877090. The research of Q. Zhang was supported
in part by the USAF Grant F30602-99-2-0548 and ONR Grant N00014-96-
1-0263. The research of K. Yin and H. Yang was supported in part by the
Minnesota Sea Grant College Program by the NOAA Office of Sea Grant‚ U.
S. Department of Commerce‚ under grant NA46-RG0101.
REFERENCES
Abbad‚ M.‚ J.A. Filar‚ and T.R. Bielecki. (1992). Algorithms for singularly
perturbed limiting average Markov control problems‚ IEEE Trans. Automat.
Control AC-37‚ 1421-1425.
Bertsekas‚ D. (1995). Dynamic Programming and Optimal Control‚ Vol. I & II‚
Athena Scientific‚ Belmont‚ MA.
Billingsley‚ P. (1999). Convergence of Probability Measures‚ 2nd Ed.‚ J. Wiley‚
New York.
Blankenship‚ G. (1981). Singularly perturbed difference equations in optimal
control problems‚ IEEE Trans. Automat. Control T-AC 26‚ 911-917.
Courtois‚ P.J. (1977). Decomposability: Queueing and Computer System Ap-
plications‚ Academic Press‚ New York.
Chung‚ K.L. (1967). Markov Chains with Stationary Transition Probabilities‚
Second Edition‚ Springer-Verlag‚ New York.
Davis‚ M.H.A. (1993). Markov Models and Optimization‚ Chapman & Hall‚
London.
Delebecque‚ F. and J. Quadrat. (1981). Optimal control for Markov chains
admitting strong and weak interactions‚ Automatica 17‚ 281-296.
REFERENCES 511
Dembo‚ A. and O. Zeitouni. (1998). Large Deviations Techniques and Appli-

cations‚ Springer-Verlag‚ New York.
Di Masi‚ G.B. and Yu. M. Kabanov. (1995). A first order approximation for the
convergence of distributions of the cox processes with fast Markov switch-
ings‚ Stochastics Stochastics Rep. 54‚ 211-219.
Ethier‚ S.N. and T.G. Kurtz. (1986). Markov Processes: Characterization and
Convergence‚ J. Wiley‚ New York.
Fleming‚ W.H. and R.W. Rishel. (1975). Deterministic and Stochastic Optimal
Control‚ Springer-Verlag‚ New York.
Fox‚ G. and P. Glynn. (1990). Discrete-time conversion for simulating finite-
horizon Markov processes‚ SIAM J. Appl. Math.‚ 50‚ 1457-1473.
Gershwin‚ S.B. (1994). Manufacturing Systems Engineering‚ Prentice Hall‚
Englewood Cliffs‚ 1994.
Hillier‚ F.S. and G.J. Lieberman. (1995). Introduction to Operations Research‚
McGraw-Hill‚ New York‚ 6th Ed.
Hamilton‚ J.D. and R. Susmel. (1994). Autoregressive conditional heteroskedas-
ticity and changes in regime‚ J. Econometrics‚ 64‚ 307-333.
Hansen‚ B.E. (1992). The likelihood ratio test under nonstandard conditions:
Testing Markov trend model of GNP‚ J. Appl. Economics‚ 7‚ S61-S82.
Hoppensteadt‚ F.C. and W.L. Miranker. (1977). Multitime methods for systems
of difference equations‚ Studies Appl. Math. 56‚ 273-289.
Il’ in‚ A.‚ R.Z. Khasminskii‚ and G. Yin. (1999a). Singularly perturbed switching
diffusions: Rapid switchings and fast diffusions‚ J. Optim. Theory Appl. 102‚
555-591.
Il’ in‚ A.‚ R.Z. Khasminskii‚ and G. Yin. (1999b). Asymptotic expansions of so-
lutions of integro-differential equations for transition densities of singularly
perturbed switching diffusions: Rapid switchings‚ J. Math. Anal. Appl. 238‚
516-539.
Iosifescu‚ M. (1980). Finite Markov Processes and Their Applications‚ Wiley‚
Chichester.
Karlin‚ S. and H.M. Taylor. (1975). A First Course in Stochastic Processes‚ 2nd
Ed.‚ Academic Press‚ New York.
Karlin‚ S. and H.M. Taylor. (1981). A Second Course in Stochastic Processes‚
Academic Press‚ New York.
Kendrick‚ D. (1972). On the Leontief dynamic inverse‚ Quart. J. Economics‚
86‚ 693-696.
Khasminskii‚ R.Z. and G. Yin. (1996). On transition densities of singularly
perturbed diffusions with fast and slow components‚ SIAM J. Appl. Math.‚
56‚ 1794-1819.
Khasminskii‚ R.Z.‚ G. Yin‚ and Q. Zhang. (1996). Asymptotic expansions of
singularly perturbed systems involving rapidly fluctuating Markov chains‚
SIAM J. Appl. Math. 56‚ 277-293.
Khasminskii, R.Z., G. Yin, and Q. Zhang. (1997). Constructing asymptotic

series for probability distribution of Markov chains with weak and strong
interactions, Quart, Appl. Math. LV, 177-200.
Kumar, P.R. and P. Varaiya. (1984). Stochastic Systems: Estimation, Identifica-
tion and Adaptive Control, Prentice-Hall, Englewood Cliffs, N. J.
Kushner, H.J. (1972). Introduction to Stochastic Control Theory, Holt, Rinehart
and Winston, New York.
Kushner, H.J. (1990). Weak Convergence Methods and Singularly Perturbed
Stochastic Control and Filtering Problems, Birkhäuser, Boston.
Kushner, H.J. and G. Yin. (1997). Stochastic Approximation Algorithms and
Applications, Springer-Verlag, New York.
Lai, T.-L., and S. Yakowitz. (1995). Machine learning and nonparametric bandit
theory, IEEE Tran. Automatic Control 40, 1199-1210.
Liu, R.H., Q. Zhang, and G. Yin. (2001). Nearly optimal control of singularly
perturbed Markov decision processes in discrete time, to appear in Appl.
Math. Optim.
Pan, Z.G. and (1995). of Markovian jump linear systems
and solutions to associated piecewise-deterministic differential games, in
New Trends in Dynamic Games and Applications, G. J. Olsder (Ed.), 61-94,
Birkhäuser, Boston.
Pervozvanskii, A.A. and V.G. Gaitsgori. (1988). Theory of Suboptimal Deci-
sions: Decomposition and Aggregation, Kluwer, Dordrecht.
Phillips, R.G. and P.V. Kokotovic. (1981). A singular perturbation approach to
modelling and control of Markov chains, IEEE Trans. Automat. Control 26,
1087-1094.
Ross, S. (1983). Introduction to Stochastic Dynamic Programming, Academic
Press, New York.
Seneta, E. (1981). Non-negative Matrices and Markov Chains, Springer-Verlag,
New York.
Sethi, S.P. and Q. Zhang. (1994). Hierarchical Decision Making in Stochastic
Manufacturing Systems, Birkhäuser, Boston.
Simon, H.A. and A. Ando. (1961). Aggregation of variables in dynamic systems,
Econometrica 29, 111-138.
Thompson, W.A. Jr. (1988). Point Process Models with Applications to Safety
and Reliability, Chapman and Hall, New York.
Tse, D.N.C., R.G. Gallager, and J.N. Tsitsiklis. (1995). Statistical multiplexing
of multiple time-scale Markov streams, IEEE J. Selected Areas Comm. 13,
1028-1038.
White, D.J. (1992). Markov Decision Processes, Wiley, New York.
Yakowitz, S. (1969). Mathematics of Adaptive Control Processes, Elsevier, New
York.
REFERENCES 513
Yakowitz‚ S.‚ R. Hayes‚ and J. Gani. (1992). Automatic learning for dynamic
Markov fields with application to epidemiology‚ Oper. Res. 40‚ 867-876.
Yakowitz‚ S.‚ P. L’Ecuyer‚ and F. Vázquez-Abad. (2000). Global stochastic
optimization with low-dispersion point sets‚ Oper. Res.‚ 48‚ 939-950.
Yang‚ H.‚ G. Yin‚ K. Yin‚ and Q. Zhang. (2001). Control of singularly perturbed
Markov chains: A numerical study‚ to appear in J. Australian Math. Soc. Ser.
B: Appl. Math.
Yin‚ G. (2001). On limit results for a class of singularly perturbed switching
diffusions‚ to appear in J. Theoretical Probab.
Yin‚ G. and M. Kniazeva. (1999). Singularly perturbed multidimensional switch-
ing diffusions with fast and slow switchings‚ J. Math. Anal. Appl.‚ 229‚ 605-
630.
Yin‚ G. and J.F. Zhang. (2001). Hybrid singular systems of differential equa-
tions‚ to appear in Scientia Sinica.
Yin‚ G. and Q. Zhang. (1997a). Control of dynamic systems under the influence
of singularly perturbed Markov chains‚ J. Math. Anal. Appl. 216‚ 343-367.
Yin‚ G. and Q. Zhang (Eds.) (1997b). Mathematics of Stochastic Manufacturing
Systems‚ Proc. 1996 AMS-SIAM Summer Seminar in Applied Mathematics‚
Lectures in Applied Mathematics‚ LAM 33‚ Amer. Math. Soc.‚ Providence‚
RI‚ Providence‚ RI.
Yin‚ G. and Q. Zhang. (1998). Continuous-time Markov Chains and Applica-
tions: A Singular Perturbation Approach‚ Springer-Verlag‚ New York.
Yin‚ G. and Q. Zhang. (2000). Singularly perturbed discrete-time Markov
chains‚ SIAM J. Appl. Math.‚ 61‚ 834-854.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000a). Asymptotic properties of a sin-
gularly perturbed Markov chain with inclusion of transient states‚ Ann. Appl.
Probab.‚ 10‚ 549-572.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000b). Singularly perturbed Markov
chains: Convergence and aggregation‚ J. Multivariate Anal. 72‚ 208-229.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000c). Occupation measures of singu-
larly perturbed Markov chains with absorbing states‚ Acta Math. Sinica‚ 16‚
161-180.
Yin‚ G.‚ Q. Zhang‚ and G. Badowski. (2000d). Decomposition and aggregation
of large-dimensional Markov chains in discrete time‚ preprint.
Yin‚ G.‚ Q. Zhang‚ and Q.G. Liu. (2000e). Error bounds for occupation measure
of singularly perturbed Markov chains including transient states‚ Probab.
Eng. Informational Sci. 14‚ 511-531.
Yin‚ G.‚ Q. Zhang‚ H. Yang‚ and K. Yin. (2001). Discrete-time dynamic systems
arising from singularly perturbed Markov chains‚ to appear in Nonlinear
Anal.‚ Theory‚ Methods Appl.
Zhang‚ Q. (1995). Risk sensitive production planning of stochastic manufactur-

ing systems: A singular perturbation approach‚ SIAM J. Control Optim. 33‚
498-527.
Zhang‚ Q. (1996). Finite state Markovian decision processes with weak and
strong interactions‚ Stochastics Stochastics Rep. 59‚ 283-304.
Zhang‚ Q. and G. Yin. (1996). A central limit theorem for singularly perturbed
nonstationary finite state Markov chains‚ Ann. Appl. Probab. 6‚ 650-670.
Zhang‚ Q. and G. Yin. (1997). Structural properties of Markov chains with weak
and strong interactions‚ Stochastic Process Appl. 70‚ 181-197.
Zhang‚ Q. and G. Yin. (1999). On nearly optimal controls of hybrid LQG prob-
lems‚ IEEE Trans. Automat. Control‚ 44‚2271-2282.
Zhang‚ Q.‚ G. Yin‚ and E.K. Boukas. (1997). Controlled Markov chains with
weak and strong interactions: Asymptotic optimality and application in man-
ufacturing‚ J. Optim. Theory Appl. 94‚ 169-194.
Chapter 22
RISK–SENSITIVE OPTIMAL CONTROL IN

COMMUNICATING AVERAGE MARKOV
DECISION CHAINS
Rolando Cavazos–Cadena
Departamento de Estadística y Cálculo
Universidad Autónoma Agraria Antonio Narro
Buenavista, Saltillo COAH 25315
MÉXICO* †
Emmanuel Fernández–Gaucherand
Department of Electrical & Computer Engineering
& Computer Science
University of Cincinnati
Cincinnati‚ OH 45221-0030
‡
USA
emmanuel @ ececs.uc.edu
Abstract This work concerns discrete–time Markov decision processes with denumerable
state space and bounded costs per stage. The performance of a control policy
is measured by a (long–run) risk–sensitive average cost criterion associated to a
utility function with constant risk sensitivity coefficient and the main objective
of the paper is to study the existence of bounded solutions to the risk–sensitive
average cost optimality equation for arbitrary values of The main results are
as follows: When the state space is finite‚ if the transition law is communicating‚
in the sense that under an arbitrary stationary policy transitions are possible
*This work was partially supported by a U.S.-México Collaborative Program‚ under grants from the National
Science Foundation (NSF-INT 9602939)‚ and the Consejo Nacional de Ciencia y Tecnología (CONACyT)
(No. E 120.3336)‚ and United Engineering Foundation under grant 00/ER-99.
†We dedicate this paper to our wonderful colleague‚ Sid Yakowitz‚ an scholar and friend whom we greatly
miss.
‡The support of the PSF Organization under Grant No. 200-350–97–04 is deeply acknowledged by the first
author.
between every pair of states‚ the optimality equation has a bounded solution for
arbitrary non-null However‚ when the state space is infinite and denumerable‚
the communication requirement and a strong form of the simultaneous Doeblin
condition do not yield a bounded solution to the optimality equation if the risk
sensitivity coefficient has a sufficiently large absolute value‚ in general.
Keywords: Markov decision proccesses‚ Exponential utility function‚ Constant risk sensi-
tivity‚ Constant average cost‚ Communication condition‚ Simultaneous Doeblin
condition‚ Bounded solutions to the risk–sensitive optimality equation.
1. INTRODUCTION
This work considers discrete–time Markov decision processes (MDP’s) with
(finite or infinite) denumerable state space and bounded costs. Besides a stan-
dard continuity–compactness requirement‚ the main structural feature of the
decision model is that‚ under the action of each stationary policy‚ every pair
of states communicate (see Assumption 2.3 below). On the other hand‚ it is
assumed that the decision maker grades two different random costs according
to the expected value of an exponential utility function with (non-null) constant
risk sensitivity coefficient and the performance index of a control policy
is the risk–sensitive (long–run) average cost criterion. Within this context‚
the main purpouse of the paper is to study the existence of bounded solutions
to the risk–sensitive average cost optimality equation corresponding to a non-
null value of (i.e.‚ the which‚ under the continuity–compactness
conditons in Assumption 2.1‚ yields an optimal stationary policy with con-
stant risk–sensitive average cost. Thus‚ we are concerned in this paper with
fundamental theoretical issues. The reader is referred to a growing body of
literature in the application of risk-sensitive models in operations research and
engineering‚ e.g.‚ Fernández-Gaucherand and Marcus (1997)‚ Avila-Godoy et
al. (1997); Avila-Godoy and Fernández-Gaucherand (1998); Avila-Godoy and
Fernández-Gaucherand (2000); Shayman and Fernández-Gaucherand (1999).
The study of stochastic dynamical systems with risk–sensitive criteria can
be traced back‚ at least‚ to the work of Howard and Matheson (1972)‚ Jacob-
son (1973)‚ and Jaquette (1973; 1976). Particularly‚ in Howard and Matheson
(1972) the case of MDP’s with finite state and action spaces was considered
and‚ under Assumption 2.3 below and assuming aperiodicity of the transi-
tion matrix induced by each stationary policy‚ a solution to the was
obtained via the Perron–Frobenious theory of positive matrices for arbitrary
Recently‚ there has been an increasing interest on MDP’s endowed with
risk–sensitive criteria (Cavazos–Cadena and Fernández–Gaucherand‚ 1998a–
d; Fernández–Gaucherand and Marcus‚ 1997; Fleming and McEneany‚ 1995;
Fleming and Hernández–Hernández‚ 1997b; Hernández–Hernández and Mar-
cus‚ 1996; James et al.‚ 1994; Marcus et al.‚ 1996; Runolfsson‚ 1994; Whit–
Risk–sensitive average cost optimality equation in communicating MDP’s 517
tle‚ 1990). In particular‚ it was shown in Cavazos–Cadena and Fernandez–

Gaucherand (1998a-d) that even when the state space is finite‚ a strong recur-
rence restriction‚ like the simultaneous Doeblin condition in Assumption 2.2
below‚ guarantees the existence of solutions to the if is suf-
ficiently small‚ establishing an interesting contrast with the results in Howard
and Matheson (1972)‚ which provides the motivation for considering the role
of the communication property in the existence of solutions to the
The main result of the paper‚ stated below as Theorem 3.1‚ can be seen as
an extension of the results in Howard and Matheson (1972)‚ in that avoiding
other conditions (e.g.‚ aperiodicity)‚ it is shown that Assumptions 2.1 and 2.3
together imply the existence of solutions to the when the state space
is finite‚ and also‚ as a complement to the Theorem 3.1 in Cavazos–Cadena
and Fernández-Gaucherand (1998a)‚ in that it is shown that incorporating the
communication Assumption 2.3 and requiring the finiteness of state space‚ the
existence of solutions to the is guaranteed for every non-null sensi-
tivity coefficient. On the other hand‚ it is shown‚ via a detailed example‚ that
Theorem 3.1 does not extend to the denumerable state space case; see Example
3.1 and Propositions 3.1 and 3.2 in Section 3.
In contrast with recent work based on the theory of stochastic games—
which presently is a commonly used technique to study risk–sensitive crite-
ria (Hernández–Hernández and Marcus‚ 1996; Hordjik‚ 1974; James et al.‚
1994; Runolfsson‚ 1994)‚ the approach followed below is along more self–
contained methods within the MDP literature. Indeed‚ as in Cavazos–Cadena
and Fernández–Gaucherand (1998a)‚ the arguments are based on the study of
a parameterized expected–total cost problem with stopping time‚ and the result
is obtained by an appropriate selection of the paramenter; this approach allows
to include both the risk–averse and risk–seeking cases. For other recent ap-
proaches in the area of stochastic decision processes‚ the reader is referred to
the work of Yakowitz et al.‚ e.g.‚ Lai and Yakowitz (1995); Yakowitz (1995). For
earlier work in the foundations of stochastic decision processes‚ see Yakowitz
(1969).
The organization of the paper is as follows: In Section 2 the decision model
is introduced and the main result is stated in Section 3 in the form of Theorem
3.1‚ which assumes a finite state space; also‚ an example is used to show that
the conclusions in Theorem 3.1 can not be extended to MDP’s with infinite and
denumerable state space‚ in general. Section 4 contains the basic consequences
of the Assumptions 2.1 and 2.3 to be used in the sequel‚ whereas in Section 5
the auxiliary expected–total cost stopping problems are introduced‚ and funda-
mental properties of these problems are stated in Section 6. Then‚ in Section
7 a proof of Theorem 3.1 is given and the paper concludes in Section 8 with
some brief comments.
Notation. Throughout the remainder and stand for the set of real numbers
and nonnegative integers, respectively, and for
Given a nonempty set S endowed with denotes the space of
all measurable real-valued and bounded functions defined on S, and
is the supremum norm of On the other hand, given
denotes the indicator function of that is if
and Finally, for an event W the corresponding indicator
function is denoted by I[W] and, as usual, all relations involving conditional
expectation are supposed to hold true almost everywhere with respect to the
underlying probability measure without explict reference.
2. THE DECISION MODEL

Following standard notation (Arapostathis et al., 1993; Hernández–Lerma
and Lasserre, 1996; Puterman, 1994), let an MDP be specified by the four-
tuple where the state space S is denumerable, the separable
metric space A is the control (decision, action) set, is the cost
function, and is the controlled transition law. The interpretation of
the model M is as follows: At each time the state of a dynamical system
is observed, say and an action is chosen. Then a
cost is incurred and, regardless of the previous states and actions, the
state of the system at time will be with probability
this is the Markov property of the decision model. Notice that it is assumed
that every is an admissible action at each state but, as noted in Borkar
(1984), this condition does not imply any loss of generality.
The following standard assumption is assumed to hold throughout the sequel.
Assumption 2.1. (i) The control set A is compact.

(ii ) For each and are continuous mappings
in
Policies. For each the space of histories up to time is recursively

defined by and a generic element of is
denoted by where and A
control policy is a sequence where each is a stochastic kernel
on A given that is‚ for each is a probability measure
on A and for each Borel subset is a measurable
mapping in the number is the probability of choosing
action when the system is driven by Throughout the remainder
denotes the class of all policies. Given the policy being used and the
initial state the distribution of the state–action process
is uniquely determined via Ionescu Tulcea’s theorem (Hernández–Lerma and
Lasserre‚ 1996; Hinderer‚ 1970; Puterman‚ 1994); such a distribution is denoted
by whereas stands for the corresponding expectation operator. Set

so that consists of all functions A policy is
stationary if there exists such that‚ under at each time the action
applied is determined by The class of stationary policies is
naturally identified with and with this convention Under the
action of each stationary policy‚ the state process is a Markov chain with
stationary transition mechanism (Arapostathis et al.‚ 1993; Hernández–Lerma
and Lasserre‚ 1996; Puterman‚ 1994; Ross‚ 1994).
Utility Function. For each define the utility function with
(constant) risk sensitivity as follows: For
For a given a (bounded) random cost the corresponding certain equivalent

with respect to is implicitly defined by
so that a controller with risk sensitivity a random cost according

to the expectation of between incurring the random cost
or paying the corresponding certain equivalent for sure. Notice that (2.1) and
(2.2) together yield that
Using Jensen’s inequality it follows that, when is non constant,

(resp. if (resp. A controller assessing
a random cost according to the expectation of is referred to as risk–
averse when and risk–seeking if if the decision maker is
risk–neutral.
Performance Index. Let and be arbitrary. When the

system is driven by and is the initial state, denotes the certain
equivalent of the total cost incurred up to time with respect to (., i.e.,
(see (2.3)) whereas the long–run average cost under starting at

is defined by
The optimal average cost at state is given by
and a policy is optimal if

for every
Remark 2.1. From (2.3)–(2.6) it is not difficult to see that
so that the optimal average cost satisfies
see‚ for instance‚ Cavazos–Cadena
and Fernández–Gaucherand (1998a).
Communication and Stability Assumptions. In the risk–neutral case

it is well–known that a strong recurrence condition is necessary for the exis-
tence of a bounded solution to the corresponding average cost optimality equa-
tion (Arapostathis et al.‚ 1993; Bertsekas‚ 1987; Fernández-Gaucherand et al.‚
1990; Hernández–Lerma and Lasserre‚ 1996; Kumar and Varaiya‚ 1986; Put-
erman‚ 1994)‚ which in turn yields an optimal stationary policy in a standard
way. For the risk–sensitive average criterion in (2.4)–(2.6)‚ it was shown in
Cavazos–Cadena and Fernández–Gaucherand (1998a) that the following sta-
bility condition is necessary for the existence of a bounded solution to the
see below in (3.1).
Assumption 2.2. (Simultaneous Doeblin Condition (Arapostathis et al.‚ 1993;

Hernández–Lerma and Lasserre‚ 1996; Hordjik‚ 1974; Puterman‚ 1994; Ross‚
1994; Thomas‚ 1980).) There exist a state and a positive constant K
such that
where for each the first passage time to state is defined by
with the (usual) convention that the minimum of the empty set is
It was shown in Cavazos–Cadena and Fernández–Gaucherand (1998a–d);

that Assumptions 2.1 and 2.2 yield a bounded solution to the in
(3.1) below whenever the risk sensitivity coefficient is small enough. To study
the existence of bounded solutions to the for arbitrary the
following additional condition‚ used firstly in Howard and Matheson (1972)‚
will be employed.
Assumption 2.3. (Communication). Under every stationary policy each pair

of states communicate, i.e, given and there exists
such that
Remark 2.2. It will be shown in the sequel that, when the state and action
spaces are finite, Assumption 2.3 implies that Assumption 2.2 holds true.
Relation to the work of Howard-Matheson. Under Assumption 2.4 below, it
was proved in Howard and Matheson (1972), via the Perron–Frobenious theory
of positive matrices, that for arbitrary the has a (bounded)
solution (rewards instead of costs were considered in Howard and Matheson
(1972)).
Assumption 2.4.
(a) The state and action spaces are finite (notice that in this situation Assumption
2.1 is automatically satisfied);
(b) Assumption 2.3 holds and the transition matrix induced by an arbitrary
stationary policy is aperiodic.
On the other hand, it has been recently shown in Cavazos–Cadena and Fernández–
Gaucherand (1998a) that, even when the state space is finite, under Assumptions
2.1 and 2.2, the has a bounded solution only if the risk sensitivity coef-
ficient is small enough, and an example was given showing that this conclusion
can not be extended to arbitrary values of The difference between the con-
clusions in Cavazos–Cadena and Fernández–Gaucherand (1998a) and Howard
and Matheson (1972), comes from the different settings in both papers. In
particular, Assumption 2.3 is imposed in Howard and Matheson (1972), but
not in Cavazos–Cadena and Fernández–Gaucherand (1998a), and an additional
aperiodicity condition is used in the latter reference.
3. MAIN RESULTS
The main problem considered in the paper consists in studying if Assumptions
2.1-2.3 yield (bounded) solutions to the for arbitrary values of the
risk sensitivity coefficient It turns out that the answer depends on the
state space: If S is finite, Assumption 2.3, combined with Assumption 2.1,
implies the existence of a solution to the for arbitrary this
result is presented below as Theorem 3.1 and gives an extension of that in
Howard and Matheson (1972). On the other hand, as it will be shown via a
detailed example, such a conclusion cannot be extended to the case in which S is
countably infinite, thus providing an extension to the results in Cavazos–Cadena
and Fernández–Gaucherand (1998a).
Finite State Space Models.
The following theorem shows that Assumption 2.1 and Assumption 2.3 are
sufficient to guarantee a (bounded) solution to the for arbitrary values
of the non-null risk sensitivity coefficient Hence, our results extend those in
Howard and Matheson (1972) in that the aperiodicity in Assumption 2.4(b) is
not required, and in that our results hold for both the risk-seeking and risk-averse
cases.
Theorem 3.1. Let the state space S be finite and suppose that Assumptions 2.1
and 2.3 hold true. In this case, for every there exist a constant
and a function such that the following are true.
(i) The pair satisfies the
(ii) for each

(iii) For every the term in brackets in the right–hand side of (3.1)
is a continuous function on A; thus, it has a minimizer and the
corresponding policy Moreover,
(iv) The pair in (3.1) is unique whenever satisfies

where is arbitrary but fixed.
Remark 3.1. Theorem 3.1 provides an extension to the results in Howard

and Matheson (1972), in that both the risk–averse and risk–seeking cases are
considered, the action space is not restricted to be finite, and the requirement
of aperiodicity of the transition matrices associated to stationary policies is
avoided; notice that this latter condition is essential in the development of
the Perron–Frobenious theory of positive matrices, which was the key tool
employed in Howard and Matheson (1972). Also, observing that Assumption
2.3 yields the Doeblin condition in Assumption 2.2 when the state space is
finite (see Theorem 4.1 in the next section), Theorem 3.1 above can be seen as
an extension to Theorem 3.1 in Cavazos–Cadena and Fernández–Gaucherand
(1998a), in that by restricting the framework to a finite state space, Assumptions
2.1 and 2.3 yield a solution to the for every In Cavazos–
Cadena and Fernández–Gaucherand (1998a), it was shown that the
admits bounded solution only for small enough, in contrast to the claims
in Hernández–Hernández and Marcus (1996).
The (somewhat technical) proof of Theorem 3.1 will be presented in Section

7, after the necessary preliminaries stated in the following three sections. As
in Cavazos–Cadena and Fernández–Gaucherand (1998a)‚ the main idea is to

consider a family of auxiliary parameterized stopping problems for the MDP
endowed with the a risk–sensitive expected–total cost criterion‚ and then The-
orem 3.1 will be obtained by an appropriate selection of the parameter.
Denumerable State Space Models. In view of Theorem 3.1 above it is natural

to ask the following: Do‚ for every Assumptions 2.1–2.3 together yield
a bounded solution to the when the state space is countably infinite?
Such a solution is indeed obtained under the above assumptions for the risk-
neutral case (Arapostathis et al.‚ 1993). The following example shows that the
answer to the question above is negative.
Example 3.1. For each positive integer define
where is selected in such a way that Now, for each define

an MDP as follows: The state space is the
action space A = {a} is a singleton, the transition law is defined by
whereas the cost function is given by
In the following proposition it will be proved that Assumptions 2.1–2.3 hold

in this example, and then, that the does not have a bounded solution
for large enough.
Proposition 3.1. For the MDP in Example 3.1 above, Assumptions 2.1–2.3
hold true and, moreover, the transition matrix in (3.2) is aperiodic.
Proof. Assumption 2.1 clearly holds in this example, since A is a singleton.
To verify Assumption 2.2, let and notice that, since from state
transitions are possible to or to state it is not difficult to see that
and from (3.2) it follows that

so that for every
verifying Assumption 2.2. Now observe that by (3.2) and (3.4),

and so that
and then the communi-
cation condition in Assumption 2.3 holds. Notice, finally, that
so that the transition law in (3.2) is aperiodic.
Observe now that, for Example 3.1, the reduces to the following
Poisson equation:
In the following proposition it will be shown that this equation does not admit
a bounded solution if is large enough. First, let be determined by
Proposition 3.2. For the MDP in Example 3.1, with the following
assertions hold:
(i)
(ii)
(iii ) For there is not a pair satisfying (3.5).
Proof. (i) From (3.2) and (3.4), it is not difficult to see that
for every positive integer so that
and it is clear that is equivalent to

(ii) Using (3.7) and part (i), it follows that
( i i i ) Let and suppose that satisfies (3.5).

From Theorem 3.1 (i) in Cavazos–Cadena and Fernández–Gaucherand (1998a)
(see also the verification Theorem in Hernández–Hernández and Marcus (1996)
for the case it follows that is the optimal average cost
at every state, i.e., Then, from Remark 2.1 it follows that
where Next, observe that the Poisson equation (3.5) can
be written as
and then (see Lemma 4.1 below)
Next observe that for

whereas for
so that (3.8) yields that for evry
To continue, observe that the following properties hold true:

(a) is finite and then surely, since
see also (2.8).
(b) As
(c) Using and (3.9), Fatou’s lemma yields that
(d) The following inequality holds:
Using (a)–(d), (3.9) implies, via the dominated convergence theorem, that
i.e.,
In this case, part (ii) implies that so that In short, it

has been proved that if a pair exists such that (3.5) is
satisified, then
4. BASIC TECHNICAL PRELIMINARIES

This and the following two sections contain the ancilliary technical results
that will be used to establish Theorem 3.1. The main objective is to collect
some basic consequences of Assumptions 2.1 and 2.3 which will be useful
in the sequel. These results are presented below as Theorems 4.1 and 4.2
and Lemma 4.1. Throughout the remainder of the paper, the state space S is
assumed to be finite.
The first result establishes that Assumptions 2.1 and 2.3 together imply a
strong form of the Simultaneous Doeblin condition.
Theorem 4.1. Under Assumptions 2.1 and 2.3, there exists

such that
The proof of this theorem, based on ideas related to the risk–neutral average
cost criterion, is contained in Appendix A. Suppose now that the initial state
is The following result provides a lower bound for the probability of
reaching a state before returning to
Theorem 4.2. Let with arbitrary but fixed and suppose that
Assumptions 2.1 and 2.3 hold true. In this case,
(i) There exists a constant such that, for every
(ii) Given there exists a positive integer such that
The arguments leading to the proof of this result are also based on ideas using
the risk–neutral average cost criterion, and are presented in Appendix B.
The following lemma will be useful in the proof of Theorem 3.1 (which is
presented in Section 7).
Lemma 4.1. Suppose that Assumptions 2.1 and 2.3 hold true, and let
be such that for some and
Let be fixed.
(i) For every positive integer
(ii) For each
and (iii)
Proof. (i) Define the sequence of random variables and

for In this case, for each and
the Markov property and (4.1) together yield
so that
i.e., is a martingale with respect to the probability measure and the

standard filtration regrdless of the initial state. Therefore, by optional stopping,
for every and so that
(ii) By Theorem 4.1, always holds, so that, since

on the event (see (2.8)),
and then (4.2) and Fatou’s Lemma yield that
Notice now that for a given Theorem 4.1 implies that there exists a
positive integer such that so that
and since S is finite, there exist such that
Now set and observe that
so that
by the dominated convergence theorem.
(iii) Since
part (ii) yields that
Risk–sensitive average cost optimally equation in communicating MDP’s 529
Observe now that (4.2) is equivalent to
and, via part (ii), this equality implies that
5. AUXILIARY EXPECTED–TOTAL COST

PROBLEMS: I
The method of analysis of the risk sensitive criterion in (2.4)–(2.6) employed
in the sequel will be based on auxiliary stopping problems endowed with a risk–
sensitive total expected cost criterion; see also Cavazos–Cadena and Fernández–
Gaucherand (1998a). Formally, let be a given state and let
be a fixed one–stage cost function. The system will be allowed to run
until the state is reached in a positive time, and it will be stopped at that
moment without incurring any terminal cost. The total operation cost, namely,
will be assessed according to the expected value of
and both the maximization and minimization of this criterion
will be considered: For and define
whereas
In Section 7, these criteria will be used to establish Theorem 3.1 by selecting

where is an appropriate constant. The remainder
of the section establishes fundamental properties of the criteria in (5.1) and
(5.2) which follow from Assumptions 2.1 and 2.3; recall that the state space is
assumed to be finite.
Theorem 5.1. Let be arbitrary but fixed. In this case,

(i) For each
(ii) If then
(a) for every
moreover,
(b) The following optimality equation holds:
Similarly,
(iii ) If then
(a) for every and
(b) The following optimality equation holds:
Proof. (i) Let be an arbitrary policy, and observe that for every
Jensen’s inequality yields
where K is as in Theorem 4.1. Therefore,

completing the proof of part ( i ) , since it is clear that see (5.1) and
(5.2).
(ii) Let and be arbitrary and for a fixed state with
define the combined policy as follows: At time
for every wheras for and
if for whereas
If for some for and then
In words, a controller selecting actions according to operates as follows:

At time the action applied is chosen using the decision maker keeps on
using until is reached in a positive time, and at that moment the controller
switches to as if the process had started again. Let be the generated
by the history vector and observe, by
the Markov property, the definition of yields that for each positive integer
and taking expectation with respect to it follows that
where the equality used that and coincide before the state is reached in
a positive time. Therefore, under the condition it follows
that
which, since was arbitrary, yields
Selecting such that (which is

possible, by Theorem 4.2(ii)), it follows that establishing
part (a), whereas part (b) follows along the lines used in the proof of Theorem
3.1 in Cavazos–Cadena and Fernández–Gaucherand (1998a).
(iii ) Suppose that In this case, there exists a policy
such that
Next, by the Markov property, for each positive and
where the shifted policy is defined by

Taking the expectation with respect to it follows that
and after selecting such that (by

Theorem 4.2(ii)), it follows that establishing part (a),
whereas part (b) can be proved using the same arguments as in the proof of
Theorem 3.1 in Cavazos–Cadena and Fernández–Gaucherand (1998a).
Theorem 5.2. Let and be arbitrary but fixed, and suppose

that
In this case, there exists a positive constant c such that

The proof relies on the following two lemmas.
Lemma 5.1. Let and be given and suppose

that
In this case,
Proof. Define and for

and, as before, let be the generated by the history vector
Notice now that for arbitrary and
where the shifted policy is defined by

and the Markov property was used to obtain the second equality. Therefore,
(5.5) yields that
so that is a submartingale with respect to each probability measure

and the filtration By optional stopping, for each positive integer
where it was used that on the event Observe now that

(by Theorem 4.1), so that letting increase to in the last
inequality, it follows that
and then, using the condition this yields that

1, and the conclusion follows, since and were arbitrary.
Lemma 5.2. Let be a given function and let be such that,
In this case
(i) There exists such that
Moreover,
(ii) for every
Proof. Let be such that
(i) The starting point is the optimality equation in Theorem 5.1(ii):
Next, define and by
where
Combining these definitions with (5.6), it is clear that
whereas for (5.6)–(5.9) yield

and in combination with (5.10) it follows that
Then, Lemma 5.1 implies that
Next, observe that
(see (5.8)), so that, setting in (5.11),
and then
(ii) First, notice that the inequality in (5.12) yields
Next observe that on the event there exists with for

which so that (5.12) yields
inequality that together with (5.13) implies
and combining this with
it follows that
Notice now that by Theorem 4.2(i), and define the measure

Q on the Borel sets of by
so that, setting
by Jensen’s inequality. Next, observe that
where K is as in Theorem 4.1, and combining this inequality with (5.14) and
(5.15) it follows that
To conclude observe that the mapping is increasing

in so that, the last inequality yields
where is as in Theorem 4.2. Then,
so that, by (5.1), (5.2) and (5.11),

After these preliminaries, Theorem 5.2 can now be proved.

Proof of Theorem 5.2. Let be such that and write
where and N is the number of elements of S.
Given a positive integer set and let denote the
indicator function of i.e.,
For consider the following claim:
(a) Applying Lemma 5.2 with (5.16) hold true for

(b) Suppose that (5.16) is valid for In this case, from an application
of Lemma 5.2 with instead of and with replacing D, it follows
that there exists such that for every
Then, setting
so that (5.16) holds ture for

From (a) and (b), it follows that (5.16) is valid for that is, for
for all
6. AUXILIARY EXPECTED–TOTAL COST

PROBLEMS: II
In this section, by examining (3.1), (5.1-5.2), and by an appropriate choice
of the one-stage cost function D(·, ·) in the auxiliary expected-total cost MDP,
candidate solutions needed to establish Theorem 3.1 are constructed.
Theorem 6.1. Let and be arbitrary but fixed.

(i) There exists such that
Similarly,
(ii) There exists such that
The proof of this result is presented in two parts below, as Lemmas 6.1 and
6.2. First, it is convenient to introduce some useful notation.
Definition 6.1. For each and the sets and
are defined as follows:
and
Notice that when so that

for every policy and then and
(see (5.1) and (5.2)), which yields that both sets in Definition 6.1 are nonempty.
On the other hand, observe that when then which
yields that for every and then, this implies that
and are contained in
Lemma 6.1. For given and let In

this case,
Proof. Let be a sequence such that
In this case, for every policy it follows that
Via the monotone convergence theorem, this inequality and (6.1) together yield
that
and since is arbitrary, it follows that
i.e., To conclude, suppose that In this

case, Theorem 5.2 with instead of D and instead of yields a positive
constant such that so that
contradicting the definition of as the supremum of Therefore,
and the proof is complete.
The following lemma presents results for similar to those

in Lemma 6.1, however the proof is substantially different.
Risk–sensitive average cost optimality equations communicating MDP’s 541
and then (see (5.1)–(5.2)), for every
(ii) Select a sequence such that
By part (i), for each there exists such that
where the positive integer N is arbitrary. Since is a compact metric space,

there is not any loss of generality in assuming that, in addition to (6.3),
for some In this case, Assumption 2.1 yields, via Proposition 18 in
Royden (1968) p. 232, that
, and then (6.4) implies that
and since the integer N > 0 is arbitrary, it follows that
Therefore, so that
(iii) By parts (i) and (ii), there exists a policy such that
It will be shown, by contradiction, that Thus, suppose

that
and define a modified MDP by setting

and In this case, using (6.5), Lemma 5.2
applied to model yields a positive constant such that
and then so that and
this contradicts the definition of as the supremum of Therefore
and the proof is complete.
The two previous lemmas yield Theorem 6.1 immediately.
Proof of Theorem 6.1. Let be fixed. Setting part (i)
follows from Lemma 6.1, whereas defining Lemma 6.2 yields
part (ii).
7. PROOF OF THEOREM 3.1

After the basic technical results presented in the previous sections, Theorem
3.1 can be established as follows.
Proof of Theorem 3.1. Let and by fixed, and define
as follows:
whereas
(i) By Theorem 6.1, (7.1) and (7.2) yield that and then, the opti-
maloity equations in Theorem 5.1 imply that
and
equalities that can be condensed into a single one: For
which is equivalent to the (3.1).

(ii) and (iii) These parts can be obtained as in Theorem 3.1 in Cavazos–
Cadena and Fernández–Gaucherand (1998a), or from the verification theorem
in Hernández–Hernández and Marcus (1996) for the case
(iv) Let be such that
where for some fixed state It will be shown that

and that First, notice that from Theorem 3.1 in Cavazos–
Cadena and Fernández–Gaucherand (1998a) it follows that so that
is the optimal average cost at every state, which is uniquely determined, so
that Next, using Assumption 2.1, select such that
is a minimizer of the term in brackets in (7.3), so that
In this case, Lemma 4.1 with and instead of and D ( . ) ,

respectively, yields that for every and
(notice that on the event and
To conclude, consider the following two cases.

Case 1: Since in this situation (7.3) implies that
and following the same arguments as in the proof
of Lemma 4.1 (i),
and then, using that
and (7.4) and (7.5) imply that
so that the reverse inequality can be obtained in a similar way,

and then
Case 2: Using again that (7.3) yields

for every and then (7.6)–(7.8) occur if the inequalities are reversed and
in (7.7), is replaced by Therefore for every
so that and, similarly, so that
8. CONCLUSIONS
This paper considered Markov decision processes endowed with the risk–
sensitive average cost optimality criterion in (2.4)–(2.6). The main result of the
paper, namely, Theorem 3.1, shows that under standard continuity–compactness
conditions (see Assumption 2.1), the communication condition in Assumption
2.3 guarantees the existence of a solution to the stated in (3.1) for
arbitrary values of when the state space is finite. Furthermore, it was
shown via Example 3.1, that the conclusions in Theorem 3.1 cannot be extended
to the case of countably infinite state space models. Hence, the results presented
in the paper significantly extend those in Howard and Matheson (1972), and also
the recent work of the authors presented in Cavazos–Cadena and Fernández–
Gaucherand (1998a-d); see Remark 3.1.
REFERENCES 545
REFERENCES
Arapostathis, A., V. S. Borkar, E. Fernández–Gaucherand, M. K. Gosh and S.
I. Marcus (1993). Discrete–time controlled Markov processes with average
cost criteria: a survey, SIAM, Journal on Control and Optimization, 31,282–
334.
Avila-Godoy, G., A. Brau and E. Fernández-Gaucherand. (1997). “Controlled
Markov chains with discounted risk-sensitive criteria: applications to ma-
chine replacement,” in Proc. 36th IEEE Conference on Decision and Control,
San Diego, CA, pp. 1115-1120.
Avila-Godoy, G. and E. Fernández-Gaucherand. (1998). “Controlled Markov
chains with exponential risk-sensitive criteria: modularity, structured policies
and applications,” in Proc. 37th IEEE Conference on Decision and Control,
Tampa, FL, pp. 778-783.
Avila-Godoy, G.M. and E. Fernández-Gaucherand. (2000). “Risk-Sensitive In-
ventory Control Problems,” in Proc. Industrial Engineering Research Con-
ference 2000, Cleveland, OH.
Bertsekas, D.P. (1987). Dynamic Programming: Deterministic and Stochastic
Models. Prentice-Hall, Englewood Cliffs.
Borkar, V.K. (1984). On minimum cost per unit of time control of Markov
chains, SIAM Journal on Control and Optimization, 21, 965–984.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998a). Controlled Markov
Chains with Risk-Sensitive Criteria: Average Cost, Optimality Equations,
and Optimal Solutions, ZOR: Mathematical Methods of Operations Re-
search. To appear.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998b). Controlled Markov
Chains with Risk–Sensitive Average Cost Criterion: Necessary Conditions
for Optimal Solutions Under Strong Recurrence Assumptions. Submitted for
Publication.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998c). Markov Deci-
sion Processes with Risk–Sensitive Average Cost Criterion: The Discounted
Stochastic Games Approach. Submitted for publication.
Cavazos–Cadena, R. and E. Fernández–Gaucherand. (1998d). The Vanishing
Discount Approach in Markov Chains with Risk–Sensitive Criteria. Submit-
ted for publication.
Fernández–Gaucherand, E., A. Arapostathis and S.I. Marcus. (1990). Remarks
on the Existence of Solutions to the Average Cost Optimality Equation in
Markov Decision Processes, Systems and Control Letters, 15, 425–432.
Fernández-Gaucherand, E. and S.I. Marcus. (1997). Risk-Sensitive Optimal
Control of Hidden Markov Models: Structural Results. IEEE Transactions
on Automatic Control, 42, 1418-1422.
Fleming, W.H. and W. M. McEneany. (1995). Risk–sensitive control on an

infinite horizon, SIAM, Journal on Control and Optimization, 33, 1881–
1915.
Fleming, W.H. and D. Hernández–Hernández. (1997b), Risk sensitive control
of finite state machines on an infinite horizon I, SIAM Journal on Control
and Optimization, 35, 1970-1810.
Hernández–Hernández, D. and S.I. Marcus.(1996), Risk sensitive control of
Markov processes in countable state space, Systems & Control Letters, 29,
147–155.
Hernández–Lerma, D. and J.B. Lasserre. (1996). Discrete-Time Markov Control
Processes, Springer, New York.
Hinderer, K. (1970). Foundations of Non–stationary Dynamic Programming
with Discrete Time Parameter, Lecture Notes on Operations Research and
Mathematical Systems, No. 33, Springer, New York.
Hordjik, A. (1974). Dynamic Programming and Potential Theory, Mathematical
Centre Tract No. 51, Matematisch Centrum, Amsterdam.
Howard, A.R. and J.E. Matheson. (1972). Risk–sensitive Markov Decision pro-
cesses, Managment Sciences, 18, 356–369.
Jacobson, D.H. (1973). Optimal stochastic linear systems with exponential per-
formance criteria and their relation to deterministic differential games, IEEE
Transactions on Automatic Control, 18, 124–131.
Jaquette, S.C. (1973). Markov decision processes with a new optimality crite-
rion: discrete time. The Annals of Statistics, 1, 496-505.
Jaquette, S.C. (1976). A utility criterion for Markov decision processes. Man-
agement Science, 23, 43-49.
James, M.R., J.S. Baras and R. J. Elliot. (1994). Risk–sensitive control and
dynamic games for partially observed discrete–time nonlinear systems, IEEE
Transactions on Automatic Control, 39, 780–792.
Kumar, P.R. and P. Varaiya. (1986). Stochastic Systems: Estimation, Identifi-
cation and Adaptive Control, Prentice-Hall, Englewood Cliffs.
Lai, T.L. and S. Yakowitz, “Machine Learning and Nonparametric Bandit The-
ory,” IEEE Trans. Auto. Control 40 (1995) 1199-1209.
loève, M. (1980). Probability Theory I, Springer, New York.
Marcus, S.I., E. Fernández-Gaucherand,D. Hernández-Hernández, S. Coraluppi
and P. Fard. (1996). Risk Sensitive Markov Decision Processes, in Systems
& Control in the Twenty-First Century. Series: Progress in Systems and Con-
trol, Birkhäuser. Editors: C.I. Byrnes, B.N. Datta, D.S. Gilliam, C.F. Martin,
263-279.
Puterman, M. (1994). Markov Decison Processes, Wiley, New York.
Ross, S.M. Applied Probability Models with Optimization Applications, Holden–
Day, San Francisco.
Royden, H.L. (1968). Real Analysis, MacMillan, London.
REFERENCES 547
Runolfsson, T. (1994). The equivalence beteen infinite horizon control of stochas-

tic systems with exponential–of–integral performance index and stochastic
differential games, IEEE Transactions on Automatic Control, 39, 1551–1563.
Shayman, M.A. and E. Fernández-Gaucherand. (1999). “Risk–sensitive deci-
sion theoretic troubleshooting.” In Proc. 37th Annual Allerton Conference
on Communication, Control, and Computing, September 22-24.
Thomas, L.C. (1980). Connectedness conditions for denumerable state Markov
decision processes, in: R. Hartjey, L. C. Thomas and D.J. White, Editors,
Recent Advances in Markov Decision Processes, Academic press, New York.
Whittle, P. (1990). Risk–sensitive Optimal Control, Wiley, New York.
Yakowitz, S., T. Jawayardena, and S. Li. (1992). “Machine Learning for Depen-
dent Observations, with Application to Queueing Network Optimization,”
IEEE Trans. Auto. Control 37, 1316-1324.
Yakowitz, S. “The Nonparametric Bandit Approach to Machine Learning,” in
Proc. 34th IEEE Conf. Decision & Control,” New Orleans, LA, 568-572.
Yakowitz, S. Mathematics of Adaptive Control Processes. Elsevier, New York.
APPENDIX: A: PROOF OF THEOREM 4.1

Thheorem 4.1 stated in Section 4 is a consequence of the following.
Theorem A. Let the state space be finite suppose that Assumptions 2.1 and 2.3 hold true. In this
case, the simltaneous Doeblin condition is valid. More explicitly, there exists such
that
Notice that the conclusion of Theorem A refers to the class of stationary policies, whereas
Theorem 4.1 involves the family of all policies. However, Theorem 4.1 can be deduced from The-
orem A as in Cavazos–Cadena and Fernández–Gaucherand (1998a) or Hernández–Hernández
and Marcus, 1996.
The proof of Theorem A has been divided into three steps presented in the following three
lemmas. To begin with, notice that, since S is finite, Assumption 2.2 implies that the Markov
chain induced by a stationary policy has a unique invariant distribution, denoted bu
and characterized by Loéve (1980)
Moreover, from Assumption 2.2 it follows that for every state

Lemma A.1. (i) The mapping is continuous in Consequently,
(ii) For each is finite and continuous in

Proof. (i) Let be such that To show that it is clearly sufficient
to prove that an arbitrary limit point of coincide with Thus, let be a limit point of
and notice that, without loss of generality, taking a subsequence if necessary it can be
supposed that
Using that S is finite, it is clear that and since each has these
properties; moreover, for each
where the last equality used (A.1) and Assumption 2.1. Therefore, has all the properties to
be the unique invariant distribution of the transition matrix determined by so that
(ii) As already noted, by Assumption 2.3, so that using the equality
((Loève, 1980)), the assertion follows from part (i).
Lemma A.2. Let and with be arbitrary.

(i) There exist a positive integer such that
(ii) For each

Proof. (i) By Assumption 2.3, there exists a positive integer and states
such that
Let so that and notice that since Next, set

and notice that for so that (see (A.3))
(ii) Let be as in part (i), and notice that

(a)
(b) whereas for
(c)
Therefore, denoting the generated by by the Markov
property yields,
REFERENCES 549
Combining (a)-(c), it follows that
so that
Since and by Lemma A.l(ii), it follows

that
Lemma A.3. Let be fixed and let be the indicator function of i.e.,
and set
Then,
(i) and there exists such that
(ii) Define by
Then, and the following (risk-neutral optimality) equation is satisfied:
Proof. (i) Since and for it is clear that

so that
and then Lemma A. 1 yields the existence of at which the supremum is achieved. In this
case, since by Assumption 2.3.
(ii) Notice that the equality follows from (A.5). Also, using the Markov property, it
is not difficult to see that for every
To establish (A.6), pick an arbitrary pair and define the discrepacy function
and the policy by
and
Combining these definitions with A.8 it follows that for every
and then [ ],
Combining this last equality with (A.7) and using the fact that it follows that
Therefore (see (A.9)), and then, since the
pair was arbitrary,
and (A.6) follows combining this inequality with (A.8).

Proof of Theorem A. The notation is as in the statement of Lemma A.3. From the optimality
equation (A.6), and using that it follows that for every and
and an inducion argument yields that, for every

REFERENCES 551
Observe now that whereas for positive

so that (A. 10) implies that
and letting increase to it follows that
since is strictly less than 1, by Lemma A.3 (i), so that
To conclude observe that in this argument is arbitrary, so that setting
APPENDIX: B: PROOF OF THEOREM 4.2

The proof of Therem 4.2 relies on the following lemma.
Lemma B.1. Suppose that Assumptions 2.1 and 2.3 hold true. For given define
the number of times in which the state process visits before returning to the initial
state in a positive time, by
In this case, for each
where K is as in Theorem 4.1. [Recall that is the indicator function of ]

Proof. For each positive notice that given
so that, setting = the generated by the Markov property yields

where the shifted policy is determined by

Therefore,
where K is as in Theorem 4.1. Hence, taking expectation with respect to it follows that
and then
which, using that yields
To conclude, observe that given on the event so that
Proof of Theorem 4.2. (i) Let be given with Since the simultaneous Doeblin
condition holds, there exists and such that (Arapostathis et al., 1993; (Jaquette,
1973); (Jaquette, 1976))
where without loss of generality, it is assumed that
If for each is such that minimizes the term in brackets in (B. 1), it follows
that [ ]
On the other hand, (B.1) implies that for each and
where the equality used that as well as Next, an induction argument

using the Markov property, yields that for each positive integer and
REFERENCES 553
Observe now that and Since
(by Theorem 4.1)‚ so that as the following convergences hold

almost surely:
and
Therefore‚ (B.3) allows to conclude‚ via the dominated convergence theorem‚ that
Setting in this inequality it follows‚ since that
and‚ since an application of Lemma B.1 implies that and the

conclusion follows selecting
(ii) Let and with be given. By part (i), and using
the equality it follows that for
some positive integer and the overall conclusion is obtained observing that
Lemma 6.2. Let and be fixed.

(i) For each there exists such that
Let In this case,
(ii) and
(iii)
Proof. (i) Notice that since so that

Theorem 5.1 (iii) yields that
and then, by Assumption 2.1, there exists a policy such that for every
From this equality, a simple induction argument using the Markov property
yields that for every positive integer and
and since this last inequality implies, for each that

Chapter 23
SOME ASPECTS OF STATISTICAL INFERENCE

IN A MARKOVIAN AND MIXING FRAMEWORK
George G. Roussas
University of California‚ Davis *
Abstract This paper is a contribution to a special volume in memory of our departed

colleague‚ Sidney Yakowitz‚ of the University of Arizona‚ Tucson.
The material discussed is taken primarily from existing literature in Marko-
vian and mixing processes. Emphasis is given to the statistical inference aspects
of such processes. In the Markovian case‚ both parametric and nonparametric
inferences are considered. In the parametric component‚ the classical approach is
used‚ whereupon the maximum likelihood estimate and the likelihood ratio func-
tion are the main tools. Also‚ methodology is employed based on the concept
of contiguity and related results. In the nonparametric approach‚ the entities of
fundamental importance are unconditional and conditional distribution functions‚
probability density functions‚ conditional expectations‚ and quantiles. Asymp-
totic optimal properties are stated for the proposed estimates. In the mixing
context‚ three modes of mixing are entertained but only one‚ the strong mix-
ing case‚ is pursued to a considerable extent. Here the approach is exclusively
nonparametric. As in the Markovian case‚ entities estimated are distribution
functions‚ probability density functions and their derivatives‚ hazard rates‚ and
regression functions. Basic asymptotic optimal properties of the proposed esti-
mates are stated‚ and precise references are provided. Estimation is proceeded
by a discussion of probabilistic results‚ necessary for statistical inference.
It is hoped that this brief and selected review of literature on statistical infer-
ence in Markovian and mixing stochastic processes will serve as an introduction
to this area of research for those who entertain such an interest.
The reason for selecting this particular area of research for a review is that a
substantial part of Sidney’s own contributions have been in this area.
Keywords: Approximate exponential‚ asymptotic normality‚ consistency (weak‚ strong‚ in

quadratic mean‚ uniform‚ with rates)‚ contiguity‚ differentiability in quadratic
mean‚ design (fixed‚ stochastic)‚ distribution function‚ estimate (maximum like-
lihood‚ maximum probability‚ nonparametric‚ of a distribution function‚ of a pa-
*This work was supported in part by a research grant from the University of California‚ Davis.
rameter‚ of a probability density function‚ of a survival function‚ of derivatives‚ of

hazard rate‚ parametric‚ recursive)‚ likelihood ratio test‚ local asymptotic normal-
ity‚ Markov processes‚ mixing processes‚ random number of random variables‚
stopping times‚ testing hypotheses.
1. INTRODUCTION
Traditionally‚ much of statistical inference has been carried out under the ba-
sic assumption that the observations involved are independent random variables
(r.v.s). Most of the time‚ this is augmented by the supplemental assumption that
the r.v.s also have the same distribution‚ so that we are dealing‚ in effect‚ with
independent identically distributed (i.i.d.) r.v.s. This set-up is based on two
considerations‚ one of which is mathematical convenience‚ and the other the
fact that these requirements are‚ indeed‚ met in a broad variety of applications.
As for the stochastic or statistical models employed‚ they are classified mainly
as parametric and nonparametric. In the former case‚ it is stipulated that the
underlying r.v.s are drawn from a model of known functional form except for a
parameter belonging in a (usually open) subset the parameter space‚ of the
Euclidean space In the latter case‚ it is assumed that the underlying
model is not known except that it is a member of a wide class of possible models
obeying some very broad requirements.
The i.i.d. paradigm described above has been subsequently extended to cover
cases‚ where dependence of successive observations is inescapable An early
attempt to model such situations was the introduction of Markov processes‚
based on the Markovian property. According to this property‚ and in a discrete
time parameter framework‚ the conditional distribution of given the
entire past and the present‚ depends only on the present‚
For statistical inference purposes‚ most of the time we assume the existence
of probability density functions (p.d.f.s). Then‚ if these p.d.f.s depend on a
parameter as described above‚ we are dealing with parametric inference (about
if not‚ we are in the realm of nonparametric inference. Both cases will be
addressed below.
According to Markovian dependence‚ the past is irrelevant‚ or to put it‚
perhaps‚ differently‚ the past is summarized by the present in our attempt to
make probability statements about the future. Should that no be the case‚ then
new statistical models must be invented‚ where the entire past is entering into
the picture. A class of such models is known under the general name of mixing.
There are several modes of mixing used in the literature‚ but in this paper we
shall confine ourselves to only three of them. The fundamental characteristic
embodied in the concept of mixing is that the past and the future‚ as expressed
by the underlying stochastic process‚ are approximately independent‚ provided
they are sufficiently far apart. Precise definitions are given in Section 3 (see
Definitions 3.1 - 3.3). In a way‚ mixing models provide an enlargement of the

class of Markovian models‚ as is demonstrated that most (but not all) Markovian
processes are also mixing processes.
Unlike the Markovian case‚ where we have the options of considering a
parametric or a nonparametric framework‚ in a mixing set-up the nonparametric
approach is literally forced upon us. To the knowledge and in the experience
of this author‚ there is no general natural way of inserting a parameter in the
model to turn it into a parametric one.
Lest it be misunderstood that Markovian and mixing models are the only
ones providing for dependence of the underlying r.v.s‚ it should be stated here
categorically that there is a plethora of other models available in the literature.
Some of them‚ such as various time series models‚ have proven highly successful
in many applications. There are several reasons for confining ourselves to
Markovian and mixing processes alone here. One is parsimony‚ another is
better familiarity of this author with these kind of processes‚ and still a third
reason not the least of which is that these are areas (in particular‚ Markovian
processes) where Sidney Yakowitz himself contributed heavily by his research
works.
The technical parts of this work and relevant references will be given in
following sections. The present section is closed by stating that all limits in this
paper are taken as the sample size tends to infinity unless otherwise specified.
This paper is not intended as an exhaustive review of the literature on sta-
tistical inference under Markovian dependence or under mixing. Rather‚ its
purpose is to present only some aspects of statistical inference‚ mostly estima-
tion‚ and primarily in nonparametric framework. Such topics are taken‚ to a
large extent‚ from relevant works of this author and his collaborators. Refer-
ences are also given to selected works of other researchers with emphasis to the
generous contributions to this subject matter by Sidney Yakowitz.
2. MARKOVIAN DEPENDENCE
Let be r.v.s constituting a (strictly) stationary Markov process,
defined on a probability space open subset
of and taking values in the real line As it has already been stated,
most often in Statistics we assume the existence of p.d.f.s. To this effect, let
be the initial p.d.f. (the p.d.f. of with respect to a dominating measure
such as the Lebesgue measure), and let be likewise the p.d.f. of the joint
distribution of Then is the p.d.f. of the transition
2.1. PARAMETRIC CASE - THE CLASSICAL

APPROACH
In this framework‚ there are two basic problems of statistical inference to
be addressed. Estimation of and testing hypotheses about Estimation of
is usually done by means of the maximum likelihood principle leading to a
Maximum Likelihood Estimate (MLE). This calls for maximizing‚ with respect
to the likelihood (or log-likelihood) function. Thus‚ consider the likelihood
function
and let
where all logarithms are taken with base Under suitable regularity conditions,
a MLE, exists and enjoys several optimality proper-
ties. Actually, the parametric inference results in a Markovian framework were
well summarized in the monograph Billingsley (1961a) to which the interested
reader is referred for the technical details. From (2.1) and (2.2), we have the
likelihood equations
or just
by omitting the first term in (2.3) since all results are asymptotic.
The results to be stated below hold under certain conditions imposed on the
underlying Markov process. These conditions include strict stationarity for the
process‚ as already mentioned‚ absolute continuity of the initial and the tran-
sition probability measures‚ joint measurability of the respective p.d.f.s‚ their
differentiability up to order three‚ integrability of certain suprema‚ finiteness
of certain expectations‚ and nonsingularity of specified matrices. Their precise
formulation is as follows and can be found in Billingsley (1961a); see also
Billingsley (1961b).
Condition (C1). (i) For each and these exists a stationary

transition measure where
the Borel in is measurable with respect to for

each and a probability measure on for each
(ii) For each and each assumes a unique sta-
tionary distribution i.e.‚ there is a unique probability measure on
such that
(iii) There is a measure on such that‚ for each
and the p.d.f. is jointly measurable with respect to
also‚ for each and each with
p.d.f. jointly measurable with respect to
Condition(C2). (i) For each and each
(ii) For each the set is independent
of
(iii) For each and in up to the third order partial derivatives of
with respect to exist and are continuous throughout
For and ranging from 1 to set:
Then
(iv) For each there is a neighborhood of it‚ lying in such
that‚ for each and and as above‚ it holds:
where
and
(v) For each the matrix is nonsingular‚

where
Condition (C3). Let be a open subset of and

let be a mapping from into Set and
assume that each has third order partial derivatives
and the matrix has rank
throughout
Then Theorem 2.1 in Billingsley (1961a) states that:
Theorem 2.1. Under conditions (C1) - (C2) and for each the following
are true: (i) There is a solution of (2.4), with
tending to 1; (ii) This solution is a local maximum of
(that is, essentially, of the log-likelihood function); (iii) The vector
in (iv) If is another sequence
which satisfies (i) and (iii), then
Thus, this theorem ensures, essentially, the existence of a consistent (in the
probability sense) MLE of
The MLE of the previous theorem is also asymptotically normal, when
properly normalized. Rephrasing the relevant part of Theorem 2.2 in Billingsley
(1961a) (see also Billingsley (1961b)), we have
Theorem 2.2. Let be the MLE whose existence is ensured by Theorem 2.1.
Then‚ under conditions (C1)-(C2)‚
where the covariance is given by the expression
Turning now to a testing hypothesis problem‚ let be as described in con-

dition (C3)‚ and suppose that we are interested in testing the (composite) hy-
pothesis at level of significance In testing the likelihood
ratio test is employed‚ and the (approximate) determination of the cut-off point
is to be obtained. For this purpose‚ the asymptotic distribution of the quantity
is required.
This is‚ actually‚ given in Theorem 3.1 of Billingsley (1961a)‚ which states that:
Theorem 2.3. Under conditions (C1)-(C3)‚
where is given in (2.4), is the dimension of and c is the dimension

of In particular, if for some then (2.5) becomes
Classical results on the MLE in the i.i.d. case were derived by Wald (1941‚
1943). See also Roussas (1965b) for an extension to the Markovian case of a
certain result by Wald‚ and Roussas (1968a) for an asymptotic normality result
of the MLE in a Markovian framework again. Questions regarding asymptotic
efficiency of the MLE‚ in the i.i.d. case again‚ are discussed in Bahadur (1964)‚
and for generalized MLE in Weiss and Wolfowitz (1966). When an estimate
is derived by the principal of maximizing the probability of concentration over
a certain class of sets‚ the resulting estimate is called a maximum probability
estimate. Much of the relevant information may be found in Wolfowitz (1965)‚
and Weiss and Wolfowitz (1967‚ 1970‚ 1974). Parametric statistical inference
for general stochastic processes is discussed in Basawa and Prakasa Rao (1980).
2.2. PARAMETRIC CASE - THE LOCAL

ASYMPTOTIC NORMALITY APPROACH
When two parameter points and say‚ are sufficiently far apart‚ then
any reasonable statistical procedure should be capable in differentiating them.
Things are more complicated‚ however‚ if the parameter points are close to-
gether. A more precise formulation of the problem is as follows. Choose an
arbitrary point in and let be a sequence of parameter points tending
to Then study statistical procedures which will lead to differentiation of
and This is one of the things achieved by the introduction of the concept of
local asymptotic normality of the log-likelihood. This concept was introduced
and developed by Le Cam (1960). A fundamental role in these developments
is played by the concept of contiguity‚ also due to Le Cam. Contiguity may
be viewed as a measure of closeness of two sequences of probability measures
and is akin to asymptotic mutual absolute continuity. There are several charac-
terizations of contiguity available. Perhaps‚ the simplest of them is as follows.
For consider the measurable spaces and let and be
probability measures defined on The sequences and are said
to be contiguous if whenever also and
vice versa. Subsequent relevant references are Le Cam (1966)‚ Le Cam (1986)‚
Chapter 6, and Le Cam and Yang (2000). In a Markovian framework, results of

this type are discussed in Roussas (1972). For a brief description of the basic
concepts and results under contiguity see Roussas (2000).
As in the previous section, let be a stationary Markov process
defined on the probability space let be the induced
by the r.v.s and let be the restriction of to For
simplicity, it will be assumed that for any and all
that is, the probability measures involved are mutually absolutely continuous.
Then
so that
Then the likelihood function is expressed as follows:
Replace by and set for the resulting

log-likelihood; that is‚
One of the objectives of this section is to discuss the asymptotic distribution of

under the probability measures and Clearly, an immediate
consequence of it would be the (approximate) determination of the cut-off point
when testing the null hypothesis (or or in the case
that ) against appropriate alternatives, and also the determination of the
power of the test, when the test is based on the likelihood function.
A basic assumption made here is that, for each the random function
is differentiable in quadratic mean when the probability measure is
used. That is, there is a random vector the derivative in
quadratic mean of at such that
where the prime denotes transpose.

(For a more extensive discussion‚ the interested reader is referred to Lind and
Roussas (1977).) Define the random vector and the covariance by:
Then‚ Theorems 2.4-2.6 stated below hold under the following set of as-
sumptions.
(A1) For each the underlying Markov process is (strictly) stationary

and ergodic.
(A2) The probability measures are mutually absolutely contin-
uous for all
(A3) (i) For each the random function is differentiable in
quadratic mean with respect to at the point when is employed,
and let be the derivative in quadratic mean involved.
(ii) is where is the of Borel subsets
of
(iii) The covariance function defined by (2.10)‚ is positive definite
for every
(A4) (i) For each in as
where is defined in (2.7).
(ii) For each fixed is and
is
For some comments on these assumptions and examples where they are
satisfied‚ see pages 45-52‚ in Roussas (1972). Theorems 2.4-2.6‚ given below‚
are a consolidated restatement of Theorems 4.1-4.6‚ pages 53-54‚ and Theorem
1.1‚ page 72‚ in the reference just cited.
Theorem 2.4. Let and be defined by (2.9) and (2.10)‚

respectively. Then‚ under assumptions (A1)-(A4)‚
(i) in
(ii)
(iii)
Also‚
Theorem 2.5. In the notation of the previous theorem‚ and under the same
assumptions‚
in probability.
It should be mentioned at this point that‚ unlike the classical approach‚ The-
orem 2.5 is obtained from Theorem 2.4 without much effort at all. This is so
because of the contiguity of the sequences and established in
Proposition 6.1‚ pages 65-66‚ in Roussas (1972)‚ in conjunction with Corollary
7.2‚ page 35‚ Lemma 7.1‚ pages 36-37‚ and Theorem 7.2‚ pages 38-39‚ in the
same reference.
As an application of Theorems 2.4 and 2.5‚ one may construct tests‚ based
essentially on the log-likelihood function‚ which are either asymptotically uni-
formly most powerful or asymptotically uniformly most powerful unbiased.
Application 2.1. For testing the hypothesis against the alternative

at level of significance define the test functions
by:
where and are defined by for all Then the sequence

is asymptotically uniformly most powerful (AUMP) in the sense that‚ if
is any other sequence of test functions of level then
For the justification of the above assertion and certain variations of it‚ see
Theorems 3.1 and 3.2‚ Corollaries 3.1 and 3.2‚ and subsequent examples in
pages 100 - 107 of Roussas (1972).
A rough interpretation of part in Theorem 2.4 is that‚ for all sufficiently
large and in the neighborhood of the likelihood function
behaves as follows:
that is‚ is approximately exponential with being the all important

random vector appearing in the exponent. This statement is made precise as
follows. Let be a suitable truncated version of for which
and define the (exactly) exponential measure by
Then‚ we have
Theorem 2.6. Let where is any bounded sequence in

Then‚ under assumptions (A1)-(A4)‚
This result provides for the following application.
Application 2.2. For testing the hypothesis against the alternative

at asymptotic level of significance define the test
functions by:
where and are chosen so that and is the upper

p-th quantile of Then the sequence is asymptotically uni-
formly most powerful unbiased (AUMPU) of asymptotic level of significance
That is‚ it is asymptotically unbiased‚ lim inf
and for any other sequence which is also of asymptotic level of signifi-
cance and asymptotically unbiased‚ it holds
For a discussion of this application‚ see Theorem 5.1‚ pages 115 - 121 in Roussas
(1972).
Theorem 2.6 may be used in testing the hypothesis even if
The result obtained enjoys several asymptotic optimal properties. Actually‚ the
discussion of such properties is the content of Theorems 2.1 and 2.2, pages 170
- 171, Theorem 4.1, pages 183 - 184, and Theorems 6.1 and 6.2, pages 191
- 196, in Roussas (1972). This theorem can also be employed in obtaining a
certain representation of the asymptotic distribution of a class of estimates of
these estimates need not be MLE.
To be more precise, for an arbitrary and any set
so that for all sufficiently large Let be a class of estimates
of defined as follows:
Then the limiting probability measure is represented as a convolution of

two probability measures, one of them being the distribution,
where is given in (2.10). The significance of this result derives from its
usefulness as a tool of studying asymptotic efficiency of estimates and minmax
issues. For its precise formulation and proof, the reader is referred to Theorem
3.1, pages 136-141, in Roussas (1972).
Some of the original papers on which the above results are based are Roussas
(1965a), Le Cam (1966), Johnson and Roussas (1969, 1970, 1972), Hájek
(1970), Inagaki (1970), and Roussas and Soms (1973). Some concrete examples
are treated in Stamatelos (1976). In Roussas (1968b), contiguity results are
employed to study asymptotic efficiency of estimates considered earlier by
Schmetterer (1966) in the framework of independence. In Roussas (1979),
results such as those stated in Theorems 2.4 and 2.5 were obtained for general
stochastic processes.
Statistical procedures are often carried out in accordance with a stopping
time A stopping time defined on the sequence is a non-
negative integer-valued r.v. tending non-decreasing to a.s. and such that
for each Next, let be positive real numbers such
that and in Set
and let and be defined by (2.9) and (2.10),
respectively, with replaced by Then suitable versions of Theorems 2.4
and 2.5 hold true. Relevant details may be found in Roussas and Bhattacharya
(1999a), see Theorems 1.3.1 and 1.3.2 there. Also, for a version of Theorem
2.6 in the present framework, see Theorem 1.2.3 in Roussas and Bhattacharya
(1999b). See also the papers Akritas et al. (1979) and Akritas and Roussas
(1979).
A concrete case where random times appear in a natural way is the case of
semi-Markov processes. A continuous time parameter semi-Markov process
with finite state space consists of r.v.s defined on a probability
space and taking values in the set Furthermore, the
process moves from state to state, according to transition probabilities
and it stays in any given state before it moves to the next state a random
amount of time. This time depends both on and and is not necessarily
exponential, as is the case in Markovian processes. It may be shown that, under
suitable conditions, the process may be represented as follows:
where is a Markov chain taking
values in and are stopping times taking values in
{1, 2 . . . } . For an extensive discussion of this problem, the interested reader is
referred to Roussas and Bhattacharya (1999b). See also Akritas and Roussas
(1980).
2.3. THE NONPARAMETRIC CASE

Suppose that the i.i.d. r.v.s have a p.d.f. and that we wish
to estimate nonparametrically at a given point Although there are several
ways of doing this‚ we are going to focus exclusively on the so-called kernel
method of estimation. This method originated with Rosenblatt (1956a)‚ and it
was further developed by Parzen (1962). Presently‚ there is a vast amount of
literature on this subject matter. Thus‚ is estimated by defined by
where the kernel K is a known function‚ usually a p.d.f.‚ and is a

bandwidth of positive numbers tending to 0. Under suitable regularity condi-
tions‚ it is shown that the estimate possesses several optimal properties‚
such as consistency‚ uniform consistency over compact subsets of or over
the entire real line‚ rates of convergence‚ asymptotic normality‚ and asymptotic
efficiency. Derivatives of may also be estimated likewise‚ and properties
for the estimates can be established‚ similar to the ones just cited.No further
elaboration will be made here on this matter in the context of i.i.d. r.v.s. Some
excellent references on this subject are given at the end of this subsection.
Up to 1969‚ the entire literature on kernel estimation of a p.d.f. and related
quantities was exclusively restricted to the i.i.d. framework. The papers Rous-
sas (1969a‚ b) were the first ones to address such issues in a Markovian set-up.
These papers were followed by those of Rosenblatt(1969‚ 1970‚ 1971).So‚ let
be a stationary Markov process with initial distribution function
(d.f.) F‚ one-step transition d.f. initial p.d.f. joint p.d.f. of
and one-step transition p.d.f. Two of the most important
problems here are those of (nonparametrically) estimating the transition p.d.f.
and transition d.f. In the process of doing so‚ one also estimates
F‚ and For a brief description of the basics‚ consider the segment
form the underlying Markov process‚ and estimate by
where
The kernel K is a known p.d.f. and is a bandwidth as described above. The

p.d.f. is estimated by where
Then‚ for each the natural estimate of is
Some of the properties of the estimates given by (2.16) - (2.18) are summa-
rized in the theorem below. The assumptions under which these results hold‚
as well as their justification‚ can be found in Theorems 2.2‚ 3.1‚ 4.2‚ 4.3 and
Corollary 3.1 of Roussas (1969a)‚ and in Theorems 4.1 and 4.2 of Roussas
(1988b).
Theorem 2.7. Under suitable regularity conditions (see material presented

below right after Theorem 2.12)‚ the estimates and of
and respectively‚ have the following properties:
(i) Asymptotic unbiasedness:
(ii) Consistency in quadratic mean:
and these convergences are uniform over compact subsets of and re-
spectively.
(iii) Weak consistency:

and this convergence is uniform over compact sets of
(iv) Strong consistency with rates:
(v) Uniform strong consistency with rates:

for any and some
(vi) Asymptotic normality:
and
where
Remark 2.1 In part (vi)‚ centering at and is also possible.

Let us turn now to the estimation of the other important quantity‚ namely‚ the
transition d.f. The initial d.f. F may also be estimated‚ and the usual
popular estimate of it would be the empirical d.f. Namely‚
However‚ regarding the d.f. no such estimate is available. Instead‚

is estimated‚ naturally enough‚ as follows
Of course‚ F could be also estimated in a similar manner; that is‚
The estimates given by (2.19) and (2.20) are seen to enjoy the familiar Glivenko-
Cantelli uniform strong consistency property; namely‚
Theorem 2.8. Under suitable regularity conditions‚ and with and

given by (2.20) and (2.19)‚ respectively‚ it holds:
(i)
(ii) for all

The justification of these results is given in Theorems 3.1 and 3.2 of Roussas
(1969b).
In addition‚ the estimate is shown to be asymptotically normal‚ as
the following result states. Its proof is found in Roussas (1991a); see Theorem
2.3 there.
Theorem 2.9. Under certain regularity conditions‚ the estimate given

by (2.19)‚ is asymptotically normal; that is‚
where
and
Another important problem in the present context is that of estimating all

(finite‚ conditional) moments of
The natural estimate of would be where
In turns out that the quantity is somewhat complicated and is replaced‚

instead‚ by where
The estimate is consistent and not much different from the estimate
as the follows result states.
Theorem 2.10. Under suitable regularity conditions‚ for

and and given by (2.23) and (2.22):
These fact are established in Theorem 4.1 of Roussas (1969b).
Remark 2.2. In (2.21)‚ we could consider‚ more generally a conditional ex-

pectation of the form for some (measurable)‚
provided‚ of course‚ The estimation procedure and the estab-

lishment of consistency of the respective estimate would be quite similar.
As is well known‚ still another important problem here is that of estimating
the p-th quantile of for any It is assumed that‚ for
the equation has a unique root‚ the p-th quantile of to be
denoted by An obvious estimate of would be any root of the
equation For reasons of definiteness‚ we consider the smallest
such root to be denote by Regarding this estimate‚ the following
consistency holds‚ as is proven in Theorem 5.1 of Roussas (1969b).
Theorem 2.11. Under suitable regularity conditions‚ and with and

as defined above‚ it holds
This section is concluded with certain aspects of recursive estimation. The

transition d.f. was estimated by defined in (2.19) in terms
of In turn‚ was defined in (2.18) by means of
and given in (2.17) and (2.16)‚ respectively. Suppose now that we are
faced with the same estimation problems but the observations are available in
a sequential manner. On the basis of the segment of r.v.s from
an underlying Markov process‚ we construct estimates such as
and Next‚ suppose that a new observation becomes
available. Then the question arises as to how this observation is incorporated
in the available estimates without starting from scratch. Clearly‚ this calls for
a recursive formula which would facilitate such an implementation. At this
point‚ we are going to consider a recursive estimate for In the first
place‚ appropriate estimates for and to be denoted here by
and are respectively:
and‚ of course‚
Then and are given recursively by the formulas

where
and
all
Below‚ we state asymptotic normality only for the recursive estimate

whose proof is found in Roussas (1991b)‚ Theorem 2.1.
Theorem 2.12. Let the recursive estimate of be given by (2.26)

(see also relations (2.24) - (2.25) and (2.27) - (2.28)). Then‚ under suitable
regularity conditions‚ the estimate given in (2.25)‚ is asymptotically
normal; that is‚
where
and which is assumed

to exist and to be in (0‚1).
As has already been mentioned‚ precise sets of conditions under which The-
orems 2.7 - 2.12 hold‚ as well as their proofs‚ may be found in the references
cited. However‚ we thought it would be appropriate to list here a set of con-
ditions out of which one would choose those actually used in the proof of a
particular theorem. These assumptions are divided into three groups: Those
pertaining to the underlying process‚ the ones referring to the kernel employed‚
and‚ finally‚ the conditions imposed on the bandwidth.
Assumptions imposed on the underlying process.
1. The underlying process is a (strictly) stationary Markov process which
satisfies hypothesis (see Doob (1953), page 221), or the weaker
assumption of being geometrically ergodic (see Rosenblatt (1970), page
202).
2. (i) The process has 1-dimensional and 2-dimensional p.d.f.s (with respect
to the appropriate Lebesgue measure) denoted by and respec-
tively. Then the one-step transition p.d.f. is
provided
(ii) The p.d.f.s and are bounded and continuous.

(iii) The p.d.f.s and have continuous second order derivatives.
(iv) The p.d.f.s and are Lipschitz of order 1; i.e.,
(v) The p.d.f. satisfies the condition:
3. Let be the joint p.d.f. of the r.v.s and Then
4. The second order derivative of the joint p.d.f. of the r.v.s and
satisfies the condition:
5. The joint p.d.f.s of the r.v.s are bounded, and
so are the joint p.d.f.s of where
6. The one-step transition d.f. of the process has a unique p-th
quantile for and
7. For suitable and is continuous in
Assumptions imposed on the kernel.

The real-valued function K defined on is a bounded p.d.f. such that:
(i) as
(ii) and for suitable
(iii) K is continuous.
(iv) K is Lipschitz of order 1; i.e.,
(v) The derivative exists except, perhaps, for finitely many points, and
Assumptions imposed on the bandwidths.

is a sequence of positive numbers such that:
(i)
(ii) In some cases, with and subject to some further restric-
tions.
(iii) In the recursive case, and
Remark 2.3. In reference to the hypothesis imposed on the process‚ a

brief description of it is as follows.
Hypothesis
(a) There is a (finite-valued non-trivial) measure on the Borel in
an integer and a positive such that:
if
where is the step transition measure of the processes.

(b) There is only a simple ergodic set and this set contains no cyclically
moving subsets.
(For part (a) above and for the definition of ergotic sets‚ as well as cyclically
moving subsets‚ see Doob (1953)‚ pages 192‚ and 210-211).
Regarding the concept of geometric ergodicity‚ here are some relevant com-
ments. Suppose T is the one-step transition probability operator of the underly-
ing process; i.e.‚ if is the one-step transition p.d.f. and is a (real-valued‚
measurable) bounded function‚ then
For the operator is defined recursively. For

let be the of with respect to the measure induced by the
1-dimensional p.d.f. of the process; i.e.‚
For the of the operator is defined by:
where means
Then the process is said to be geometrically ergodic‚ if the operator T satisfies

condition for some with i.e.‚ for some
positive integer
Remark 2.4. It is to be mentioned that all results in Theorems 2.8 - 2.13‚

where the p.d.f.s and and the one-step transition d.f. are
involved hold for points and such that is a continuity point of and
is a continuity point of
Also‚ it should be pointed out that the Glivenko-Cantelli Theorem for the em-
pirical d.f. does not require the Markovian property; it holds under stationarity
alone.
Sidney Yakowitz’ contributions to Markovian inference‚ either as a sole au-

thor or as a co-author‚ have been profound. A sample of some of the relevant
works is presented here.
The early work of Sidney Yakowitz was devoted to applying stochastic pro-
cess modelling to a variety of problems related to daily stream flow records‚
Yakowitz (1972)‚ to daily river flows in arid regions‚ Yakowitz (1973)‚ to hydro-
logic chains‚ Yakowitz (1976a)‚ to water table predictions‚ Yakowitz (1976b)‚ to
rivers in the southwest‚ Yakowitz (1977a)‚ to daily river flow‚ Yakowitz (1977b)‚
to hydrologic time series‚ Yakowitz and Denny (1973)‚ to statistical inference
on stream flow processes‚ Denny‚ Kisiel and Yakowitz (1974).
Papers Yakowitz (1976a‚ 1977b) have a strong theoretical component related
to Markov processes. In Yakowitz (1979)‚ the author considers a Markov chain
with stationary transition d.f. which is
assumed to be continuous in He proceeds to construct an estimate
of which‚ under certain regularity conditions‚ satisfies the Glivenko-
Cantelli Theorem as a functions of At that time‚ this was a substantial
improvement over a similar result obtained earlier by the present author.
In the paper Yakowitz (1985)‚ the author undertakes the nonparametric
estimation problem of the transition p.d.f. and the conditional expectation
measurable) in a class of Markovian pro-
cesses‚ which include those satisfying the so-called Doob’s hypothesis (D)
(see‚ Doob (1953)‚ page 192). Results are illustrated by an analysis of river-
flow records. In Yakowitz (1989)‚ the author considers a Markov chain with
state space in which satisfies the so-called Harris condition‚ and proceeds
with the nonparametric estimation of the initial p.d.f.‚ as well as of the regres-
sion function The approach employed is based
on kernel methodology. In Yakowitz and Lowe (1991)‚ the authors frame the
Bandit problem as a Markovian decision problem‚ according to which‚ at each
decision time‚ a r.v. is selected from a finite collection of r.v.s‚ and an out-
come is observed. In Yakowitz et al. (1992)‚ the authors study the problem
of minimizing a function on the basis of a certain observed loss. This loss is
expressed in terms of a Markov decision process and the corresponding control
sequence. In Yakowitz (1993)‚ the author proves consistency of the nearest
neighbor regression estimate in a Markovian time series context. In Lai and
Yakowitz (1995)‚ the authors refine and pursue further the study of the Ban-
dit problem in the framework of controlled Markov processes. Finally‚ in the
paper Gani‚ Yakowitz and Blount (1997)‚ the authors study both deterministic
and stochastic models for the HIV spread in prisons‚ and provide some simu-
lations of epidemics. For computations in the case of stochastic modelling‚ a
new technique is employed for bounding variability in a Markov chain with a
large state space.
Results on various aspects of nonparametric estimation in the i.i.d. context

are contained in the following papers and books; namely‚ Watson and Leadbet-
ter (1964a‚ b)‚ Bhattacharya (1967)‚ Schuster (1969‚ 1972)‚ Nadaraya (1964a‚
1970)‚ Yamato (1973)‚ Devroye and Wagner (1980)‚ Cheng and Lin (1981)‚
Mack and Silverman (1982)‚ and Georgiev (1984a‚ b). Also‚ Devroye and
Györfi (1985)‚ Silverman (1986)‚ Devroye (1987)‚ Rüschendorf (1987)‚ Eu-
bank (1988)‚ and Scott (1992). The papers Prakasa Rao (1977) and Nguyen
(1981‚ 1984) refer to a Markovian framework. Finally‚ a general reference on
nonparametric estimation is the book Prakasa Rao (1983).
3. MIXING
3.1. INTRODUCTION AND DEFINITIONS
As it has already be mentioned in the introductory section‚ mixing is a kind
of dependence which allows for the entire past‚ in addition to the present‚ to
influence the future. There are several modes of mixing‚ but we are going
to concentrate on three of them‚ which go by the names of weak mixing or
mixing‚ and strong mixing or There is a large probabilistic
literature on mixing; however‚ in this paper‚ we focus on estimation problems
in mixing models. A brief study of those probability results‚ pertaining to
statistical inference‚ may be found in Roussas and Ioannides (1987‚ 1988) and
Roussas (1988a).
All mixing definitions will be given for stationary stochastic processes‚ al-
though the stationarity condition is not necessary. The concept of strong mixing
or was introduced by Rosenblatt (1956b) and is as follows.
Definition 3.1. Consider the stationary sequence of r.v.s . . . ‚

and denote by and the induced by the r.v.s. . . . ‚
and respectively. Then the sequence is said to be strong mixing
or with mixing coefficient if
The meaning of (3.1) is clear. It states that the past and the future‚ defined
on the underlying process‚ are approximately independent‚ provided they are
sufficiently far apart.
The next definition is that of Thus,
Definition 3.2. In the notation of the previous definition‚ the sequence of r.v.s
is said to be with mixing coefficient if
Remark 3.1. One arrives at Definition 3.2 by first defining the maximal corre-
lation by:
being
being with
next‚ specializing it to indicator functions and

to obtain
and‚ finally‚ modifying as follows:
It is shown that and are related as follows:
Relations (3.3) show that and are all equivalent‚ in the sense
that if and only if if and only if
The third and last mode of mixing to be considered here is the weak mixing
or defined below.
Definition 3.3. In the notation of Definition 3.1‚ the sequence of r.v.s is said to
be weak mixing or with mixing coefficient if
The concept of is due to Ibragimov (1959).

It is shown that the mixing coefficients and are related as
follows:
so that implies and implies Accord-

ingly‚ the terms “weak mixing” and “strong mixing” assigned to and
respectively‚ is a misnomer. The kind of mixing which has proven
to be most useful in applications is It has been established that sev-
eral classes of stochastic processes are under rather weak regularity
conditions. For some examples‚ the interested reader is referred to Section 3 of
Roussas and Ioannides (1987).
The question naturally arises as to when a stochastic process satisfies a mix-
ing condition. Some answers to such a question are provided in the papers
by Davydov (1973)‚ Kesten and O’Brien (1976)‚ Gorodetskii (1977)‚ Withers
(1981a)‚ Pham and Tran (1985)‚ Athreya and Pantula (1986)‚ Pham (1986)‚ and
Doukhan (1944).
3.2. COVARIANCE INEQUALITIES

In statistical applications, mixing is often entering into the picture in the form
of covariances, moment inequalities and exponential probability inequalities.
With this in mind, we present below a series of theorems.
In the following theorem, the functions and are defined on the underly-
ing stochastic process, they are and respec-
tively, and they may be either real-valued or complex-valued.
Theorem 3.1.(i) Under
for real-valued and

and
for complex-valued and

where
with and
(ii) Under
(iii) Under
for real-valued and

and
where with and and
Theorem 3.2. Let and be as above and also a.s., a.s.

Then:
(i) Under
for real-valued and
and
(ii) Under
for real-valued and
and
(iii) Under
for real-valued and
and
There are certain useful generalizations of the above two theorems which
are stated below. For this purpose, the following notation is required. Let
be as follows: with
and let be the induced by the r.v.s
The functions are -measurable and
they may be either real-valued or complex-valued.
Theorem 3.3.(i) Under
for real-valued
and
for complex-valued
where, for with

and
(ii) Under
for real-valued
and
for complex-valued
where, for with and

Finally, the following generalized version of Theorem 3.2 holds.
Theorem 3.4. For let be as described above and also

a.s. Then:
(i) Under
for real-valued
and
for complex-valued
(ii) Under
for real-valued
and
for complex-valued
(iii) Under
for real-valued
and
for complex-valued
Detailed proofs of Theorems 3.1 - 3.4 may be found in Sections 5 - 7 of Rous-

sas and Ioannides (1987). Specifically‚ see Theorems 5.1 - 5.5 and Corollary
5.1; Theorems 6.1 - 6.2 and Proposition 6.1; Theorems 7.1 - 7.4 and Corollaries
7.1 - 7.2 in the reference just cited.
3.3. MOMENT AND EXPONENTIAL PROBABILITY

BOUNDS
For the segment of r.v.s set for their sum‚
and let be any real number. For statistical inference purposes‚ often a
bound of the moment is needed‚ assuming‚ of course‚ that this moment
is finite. Such a bound is required‚ for example‚ when the Markov inequality
is used. If the are i.i.d.‚ such a bound is easy to establish. A similar
result has also been established for the case that the r.v.s are coming from a
stationary Markov process. Namely‚ for some (> 0) constant
(see Doob(1953)‚ Lemma 7.4‚ page 225). Such an inequality holds true as
well under mixing. More specifically‚ one has the theorem stated below. The
main conditions under which this theorem as well as the remaining theorems in
this subsection hold are grouped together right after the end of Theorem 3.16.
Theorem 3.5. Let be a segment from a stationary sequence of r.v.s

satisfying any one of the or properties. Assume that
and Then‚ under some additional requirements‚ it
holds
and a positive constant. (3.6)
The proof of (3.6) is carried out by induction on and the relevant details
may be found in Roussas (1988a).
Although inequality (3.6) does provide a bound for probabilities of the form
a stronger bound is often required. In other words‚
a Bernstein-Hoeffding type bound would be desirable in the present set-up.
Results of this type are available in the literature‚ and one stated here is taken
from Roussas and Ioannides (1988).
Theorem 3.6. Let the r.v.s be as in the previous theorem‚ and let
Then‚ under certain additional regularity conditions‚ it holds
constants,
and (and subject to an additional restriction.)
Also‚
where is a (> 0) norming factor (determining the rate at which converges

a.s. to 0).
A discussion of some basic properties of mixing may be found in Bradley
(1981‚ 1985). Issues regarding moment inequalities are dealt with in the ref-
erences Longnecker and Serfling (1978)‚ Yoshihara (1978)‚ and Yokoyama
(1980). Papers where the Central Limit Theorem and Strong Law of Large
Numbers are discussed are Ibragimov (1963)‚ Philipp (1969)‚ Withers (1981b)‚
Peligrad (1985)‚ and Takahata (1993). Finally‚ books where mixing is dis-
cussed‚ along with other subject matters‚ are those by Ibragimov and Linnik
(1971)‚ Hall and Heyde (1980)‚ Rosenblatt (1985)‚ Yoshihara (1993‚ 1994a‚ b‚
1995)‚ Doukhan (1994) (Sections 1.4.1 and 1.4.2)‚ and Bosq (1996).
3.4. SOME ESTIMATION PROBLEMS

The estimation problems to be addressed here are those of estimating a d.f.‚ a
p.d.f. and its derivatives‚ the hazard rate‚ fixed design regression‚ and stochastic
design regression. The approach is nonparametric and the methodology is
kernel methodology. Any one of the three modes of mixing discussed earlier
is assumed to prevail.
In all that follows‚ is a segment from a stationary stochastic
process obeying any one of the three modes of mixing:
Actually‚ many results are stated below only for and
therefore are valid for the remaining two stronger modes of mixing.
3.4.1 Estimation of the Distribution Function or Survival Function.

Let F be the d.f. of the so that defines the corresponding
survival function. So‚ if is an estimate for then is estimated

by Due to this relationship‚ statements made about also hold
for and vice versa. At this point‚ it should be pointed out that results
stated below also hold for the case that the are multi-dimensional vectors.
For the sake of simplicity‚ we restrict ourselves to real-valued
The simplest estimate of is the empirical d.f. defined by
This estimate enjoys a number of optimal properties (mostly asymptotic) sum-

marized in the following theorem.
Theorem 3.7. The empirical d.f. defined by (3.7)‚ as an estimate of

has the following properties‚ under mixing and some additional regularity
conditions:
(i) Unbiasedness:
(ii) Strong consistency:
(iii) Strong uniform consistency:
(iv) Strong uniform consistency with rates:
where is any compact subset of and the (positive) norming factor

determines the rate of convergence.
Also‚
(v) Asymptotic normality:
where
(vi) The at the rate specified by the convergence;
(vii)Asymptotic uncorrelatedness: For the estimates and

are asymptotically uncorrelated at the rate specified by the convergence:
where
Remark 3.2. In reference to part (iii) of the theorem‚ it must be mentioned that
stationarity alone is enough for its validity‚ mixing is not required.
Detailed listing of the assumptions under which these results hold‚ as well
as their proofs‚ may be found in Cai and Roussas (1992) (Corollary 2.1 and
Theorem 3.2)‚ Roussas (1989b) (Propositions 2.1 and 2.2)‚ and Roussas (1989c)
(Theorem 3.1 and Propositions 4.1 and 4.2). However‚ see also the marterial
right after the end of Theorem 3.16. An additional relevant reference‚ among
others‚ is the paper by Yamato (1973).
3.4.2 Estimation of a Probability Density Function and its Derivatives.

The setting here remains the same as above except we assume the existence
of a p.d.f. of the d.f. F. The problem then is that of estimating and‚
perhaps‚ its derivatives as well. As in the previous subsection‚ although results
below have been established for multi-dimensional random vectors‚ they will
be stated here for real-valued r.v.s. To start with‚ is estimated by
where
it is recalled here that K is a kernel (known p.d.f.)‚ and is a bandwidth

with The estimate has several optimal properties which are
summarized in the following theorem. stands for the set of continuity
points of
Theorem 3.8. In the mixing framework and other additional assumptions‚ the
estimate defined by (3.8) has the following properties:
(ii) Strong consistency with rates:
for some
(iii) Uniform a.s. consistency with rates:

Also,
and
(iv) Asymptotic normality:
and
(v) Joint Asymptotic normality: For any distinct continuity points of
where is a diagonal matrix with diagonal elements
Also,
(vi) Asymptotic uncorrelatedness: For any with

Formulation of the assumptions under which the above results hold, as well
as their proofs, may be found in Roussas (1990a) (Theorems 3.1, 4.1 and 5.1
), Roussas (1988b) (Theorems 2.1, 2.2, 3.1 and 3.2), Cai and Roussas (1992)
(Theorems 4.1 and 4.4), and Roussas and Tran ((1992a) (Theorem 7.1).
Suppose now that the r-th order derivative of exists, and let us estimate
it by where
where is the r-th order derivative of the kernel Regarding this

estimate, it may be shown that it enjoys properties similar to the ones for
One of them is stated below.
Theorem 3.9. In the mixing framework and other additional assumptions, the
estimate defined by (3.9) is uniformly strong consistent with rates;
namely,
and, in particular,
provided
These results are discussed in Cai and Roussas (1992) (Theorem 4.4).
There is an extensive literature on this subject matter. The following con-
stitute only a sample of relevant references dealing with various estimates and
their behavior. They are Bradley (1983), Masry (1983, 1989), Robinson (1983,
1986), Tran (1990), and Roussas and Yatracos (1996, 1997).
3.4.3 Estimating the Hazard Rate. Hazard analysis has broad appli-
cations in systems reliability and survival analysis. It was thought therefore
appropriate to touch upon some basic issues in this subject matter. Recall at
this point that if F and are the d.f. and the p.d.f. of a r.v. X, then the hazard
rate is defined as follows,
If the r.v. X represents the lifetime of an item, then is the instantaneous

probability that the item will survive beyond time given that is has survived
to time In practice, is unknown and is estimated by where
where is given in (3.8), and and is the

empirical d.f. given in (3.7).
The estimate has several properties, inherited from and
Some of them are summarized below.
Theorem 3.10. In the mixing framework and under other additional assump-
tions, the estimate defined in (3.11) has the following properties:
(i) Strong pointwise consistency:
(ii) Uniform a.s. consistency with rates:
where J is any compact subset of and is a norming factor specifying the

rate of convergence.
Also,
and, in particular,
provided
(iii) Asymptotic Normality:

where
(iv) Joint asymptotic normality: For any distinct continuity points of
where is a diagonal matrix with diagonal elements
Precise statement of assumptions, under which the above results holds, and
their justification may be found in Roussas (1989b) (Theorems 2.1, 2.2 and
2.3), Cai and Roussas (1992) (Theorem 4.2), Roussas (1990a) (Theorem 4.2),
and Roussas and Tran (1992a) (Theorem 8.1). Relevant are also the references
Watson and Leadbetter (1964a, b).
3.4.4 A Smooth Estimate of F and As is apparent from the previous

discussions, in estimation problems, we often assume the existence of the p.d.f.
of the d.f. F of a r.v. X. In such a case, F is a smooth curve, and one may
find it more reasonable to estimate if by a smooth curve, unlike which is a
step function. This approach was actually used in Roussas (1969b), and it may
be pursued here as well, to a limited extent. Thus, is estimated by
where
Then the hazard rate may also be estimated by given below.
It may be shown that the estimate has properties similar to those of

except that it is asymptotically unbiased rather than unbiased. Furthermore, it
is shown that, under certain regularity conditions, the smooth estimate is
superior to the standard estimate, the empirical d.f. in a certain second
order asymptotic efficiency sense. The measure of asymptotic comparison is
sample size, and the criterion used is mean square error. We do not plan to
elaborate on it any further. The interested reader is referred to Roussas (1996)
and references cited there. We proceed to close this section with one optimal
property of the estimate Namely,
Theorem 3.11. In the mixing set-up and under other additional assumptions,
the estimate given in (3.13) is uniformly strong consistent with rates over
compact subsets J of namely,
and, in particular,
provided
For the justification of this result, the interested reader is referred to Cai and
Roussas (1992) (Theorem 4.3). It appears, the idea of using smooth estimate
for a d.f. as the one employed above belongs to Nadaraya (1964b).
3.4.5 Recursive Estimation. As it has already been seen, a p.d.f.

may also be estimated recursively. This approach provides a handy way of
incorporating an incoming observation to an existing estimate. Even in today’s
age of high speed computers, the technique of recursive estimation may provide
time-saving advantages, in order not to mention the virtues of the principle of
parsimony. Furthermore, The resulting recursive estimate enjoys other optimal
properties over a non-recursive estimate, such as reduction of the variance of
the asymptotic normal distribution.
Consider the recursive estimate of defined as follows:
Then, it is easily seen that
Here is a sequence of bandwidths such that
The estimate has several optimality properties some of which are sum-
marized below.
Theorem 3.12. In the mixing framework and under additional suitable condi-
tions, the recursive estimate defined in (3.14) has the following properties:
(i)Asymptotic unbiasedness:
(ii) Possible reduction of the asymptotic variance:
(iii) Asymptotic uncorrelatedness:
(iv) Joint asymptotic normality: For any distinct continuity points

of
where C is a diagonal matrix with diagonal elements
Also,
Remark 3.3 The results in part (ii) and (iv) (with )with justify
the statement made earlier about reduction of the variance of the asymptotic
normal distribution.
The justification of the statements in Theorem 3.12 may be found in Roussas
and Tran (1992a) (relations (2.6), (2.10), (2.15), and Theorems 3.1 and 4.1).
Now, if is the empirical survival function, it can be written as follows:
By means of (3.17), it can be easily seen that
so that a recursive formula for is also available. By means of and

define by
Then, may be evaluated recursively by way of the formula below;

namely,
Among the several asymptotic properties of is the one stated in the fol-
lowing result (see Theorem 6.1 in Roussas and Tran (1992a)).
Theorem 3.13. Under mixing and suitable additional conditions, the estimate
defined in (3.20) has the following joint asymptotic normality property;
that is, for any distinct continuity points of
where C* is a diagonal matrix with diagonal elements
Remark 3.4. Applying this last result for and comparing it with the
result stated in Theorem 3.1 ( i i i ) , one sees the superiority of in terms of
asymptotic variance (for as it compares to
From among several relevant references, we mention those by Masry (1986),
Masry and Györfi (1987), Györfi and Masry (1990), Wegman and Davis (1979),
and Roussas (1992), the last two concerning themselves with the independent
case.
3.4.6 Fixed Design Regression. The problem addressed in this subsec-

tion is the following. For suppose one selects design points
in a compact set S of at which respective observations
are taken. It is assumed that these r.v.s have the following structure
where is an unknown real-valued function defined on and are random

errors. The problem at hand is to estimate for by means of
and The estimator usually used in this context is a linear
weighted average of the More precisely, if stand for said
estimator evaluated at then
where and are weights subject to

certain requirements.
This problem has been studied extensively in the i.i.d. case (see, for example,
Priestley and Chao (1972), Gasser and Müller (1979), Ahmad and Lin (1984),
and Georgiev (1988)). Under mixing conditions, however, this problem has
been dealt with only the last decade or so. Here, we present a summary of
some of the most important results which have been obtained in the mixing
framework.
Theorem 3.14. Under mixing assumptions and further suitable conditions, the
fixed design regression estimate defined in (3.22) has the following properties:
(ii)Consistency in quadratic mean:
(iii)Strong consistency:
(iv) Asymptotic normality:
Precise statement of conditions under which the above results hold, as well
as their proofs, can be found in Roussas (1989a) (Theorems 2.1, 2.2 and 3.1),
and Roussas et al. (1992) (Theorems 2.1 and 3.1).
3.4.7 Stochastic Design Regression. The set-up presently differs from

the one in the previous subsection in that both the and the are r.v.s.
More precisely, one has at hand pairs coming from a
stationary process, where the are real-valued r.v.s and the are
random vectors. It is assumed that
is finite, and then the problem is that of estimating on the basis of the
observations at hand.
Before we proceed further, we present another formulation of the problem
which provides more motivation for what is to be done. To this effect, let
be real-valued r.v.s forming a stationary time series. Suppose we
wish to predict the r.v. on the basis of the previous r.v.s
As predictor, we use the assum-
ing, of course, that this conditional expectation is finite. By setting
the pairs form a stationary
sequence, and the problem of prediction in the time series setting is equiv-
alent to that of estimating the conditional expectation
on the basis of the available observations. Actually, one
may take it one step further by considering a (known) real-valued function
defined on and entertain the problem of predicting by

means of the (assumed to be finite) conditional expectation
Once again, by letting be as above, and by setting

the problem becomes again that of estimating the conditional expectation
So, it suffices to concentrate on estimating given in (3.23). The pro-
posed estimate is defined by
The estimate enjoys several optimal properties; one of them is recorded

below.
Theorem 3.15. Under the basic assumption of mixing and further suitable
conditions, the regression estimate given in (3.24) is strongly consistent with
rates, uniformly over compact subsets of namely,
where J is any compact subset of and the (positive) norming factor

specifies the rate of convergence.
For details, the interested reader is referred to Roussas (1990b) (Theorem
4.1).
A recursive version of the estimate is also available and is defined as
follows:
This estimate is asymptotically normal (see Theorems 2.1 - 2.3 in Roussas and
Tran (1992b)), as the following theorem states.
Theorem 3.16. Under the basic assumption of mixing and further suitable
conditions, the recursive regression estimate given in (3.25) has the properties
stated below.
(i) Asymptotic normality: For any continuity point for and for which
where
(ii) Joint asymptotic normality: For any distinct continuity points

of and for which
where the covariance matrix D is a diagonal matrix with

diagonal elements given by
Precise formulation of conditions under which Theorems 3.5 - 3.16 hold,

as well as their proofs, can be found in references already given. In order for
the reader to get a taste of the assumptions used, a brief description of them is
presented below. They are divided into three groups as was done in the material
presented right after the end of Theorem 2.12.
Assumptions imposed on the underlying process.
(i) The underlying process is (strictly) stationary and satisfies one of the
mixing conditions:
(ii) The mixing coefficients involved satisfy certain summability conditions.

Furthermore, in some of the results, they are required to satisfy some
conditions also involving other entities (such as bandwidths and rates of
convergence). For example, for coefficients it may be required
that: or for some
or or for some
or for some and a certain
sequence of positive numbers
(iii) The 1-dimensional p.d.f. of the process, is continuous; it is Lipschitz
of order 1; i.e., it has bounded and
continuous second order derivative; J a compact

subset of
(iv) The joint p.d.f. of and satisfies the condition
(v) The 1-dimensional d.f. F, is Lipschitz of order 1; i.e.,
Assumptions imposed on the kernel.

The kernel K is a bounded p.d.f. such that:
(i) as
(ii) and for suitable
(iii) K is Lipschitz of order 1; i.e.,
(iv) The derivative exists is continuous and of bounded variation, and
Assumptions imposed on the bandwidths.

is a sequence of positive numbers such that:
(i)
(ii)
(iii)
(iv) They also satisfy some additional conditions involving other entities (such
as mixing coefficients and rates of convergence).
(v) In particular, in the recursive case, the bandwidths satisfy conditions such
as:
for some there is a sequence of positive numbers such
that and
The formulation of the conditions employed in the fixed regression, as well
as the stochastic regression case, requires the introduction of a large amount
of notation. We choose not to do it, and refer the reader to the original papers
already cited.
It is to be emphasized that in any one of the results, Theorem 3.5 - 3.16,
stated above the proof requires only some of the assumptions just listed.
Early papers on regression were those by Nadaraya (1964a, 1970) and Watson
(1964). Subsequently, there has been a large number of contributions in this
area. Those by Burman (1991), Masry (1996), and Masry and Fan (1997) is
only a sample of them. In a monograph form, there is the contribution Härdle

(1990).
Finally, it should be mentioned at this point that there is a significant number
of papers and/or monographs dealing with dependencies which do not neces-
sarily involve mixing. Some of them are those by Földes (1974), Györfi (1981),
Castelana and Leadbetter (1986), Morvai, Yakowitz and Györfi (1996), Tran,
Roussas, Yakowitz and Truong (1996), Morvai, Yakowitz and Algoet (1997),
and the monographs Müller (1988) (for the independent case), Györfi et al.
(1989).
Sidney Yakowitz, either alone or in collaboration with others, has also made
significant contributions to statistical inference under mixing conditions. The
papers below represent a limited sample of such works under mixing condi-
tions. In Yakowitz (1987), the author studies the problem of estimating the
nonlinear regression by nearest-neighbor method-
ology. The setting is that of a time series satisfying a mixing condition. In Tran
and Yakowitz (1993), the authors establish asymptotic normality for a nearest-
neighbor estimate of a p.d.f. in a random field framework satisfying a mixing
condition. In Györfi, Morvai and Yakowitz (1998), the authors address the
forecasting problem in a time-series set-up. They show that various plausible
prediction problems are unsolvable under weakest assumptions, whereas others
are not solvable by predictions which are known to be consistent under mix-
ing conditions. Furthermore, in work as yet unpublished, Heyde and Yakowitz
(2001) show that there is no procedure which would provide consistent density
or regression estimates for every process. Consistency is contin-
gent upon assumptions regarding rates of decay of the coefficients.
Such decay assumptions play an essential role in the analysis of any estimation
scheme.
Finally, in conclusion, it should be mentioned at this point that Yakowitz has
made contributions in other areas of research. However, such contributions are
not included here as falling outside the scope of this work.
ACKNOWLEDGMENTS
Thanks are due to an anonymous reviewer whose constructive comments,
resulting from a careful and expert reading of the manuscript, helped improve
the original version of this work.
REFERENCES
Ahmad, I. A. and P. E. Lin. (1984). Fitting a multiple regression function.
Journal of Statistical Planning and Inference 9, 163 - 176.
REFERENCES 597
Akritas, M.G. and G. G. Roussas. (1979). Asymptotic expansion of the log-

likelihood function based on stopping times defined on Markov processes.
Annals of the Institute of Statistical Mathematics 31A, 21 - 38.
Akritas, M.G., M. L. Puri, and G. G. Roussas. (1979). Sample size, parameter
rates and contiguity - the i.i.d. case. Communication in Statistics - Theoretical
Methods A8, 71 - 83.
Akritas, M.G. and G. G. Roussas. (1980). Asymptotic inference in continuous
time semi-Markov processes. Scandinavian Journal of Statistics 7, 73 - 79.
Athreya, K.B. and S. G. Pantula. (1986). Mixing properties of Harris chains
and autoregressive processes. Journal of Applied Probability 23, 880 - 892.
Bahadur, R.R. (1964). On Fisher’s bound for asymptotic variances. Annals of
Mathematical Statistics 35, 1545 - 1552.
Basawa, I.V. and B. L. S. Prakasa Rao. (1980). Statistical Inference for Stochas-
tic Processes. Academic Press, New York.
Bhattacharya, P.K. (1967). Estimation of probability density and its derivatives.
Series A 29, 373 - 382.
Billingsley, P. (1961a). Statistical Inference for Markov Processes. University
of Chicago Press, Chicago.
Billingsley, P. (1961b). The Lindeberg-Lévy theorem for martingales. Proceed-
ings of the American Mathematical Society 12, 788 - 792.
Bosq, D. (1996). Nonparametric Statistics for Stochastic Processes. Springer -
Verlag, Berlin.
Bradley, R.C. (1981). Central limit theorem under weak dependence. Journal
of Multivariate Analysis 11, 1 - 16.
Bradley, R.C. (1983). Asymptotic normality of some kernel-type estimators of
probability density. Statistics and Probability Letters 1, 295 - 300.
Bradley, R.C. (1985). Basic properties of strong mixing conditions. In: Depen-
dence in Probability and Statistics, E. Eberlein and M.S. Taqqu (Eds.) 165 -
192, Birkhäuser, Boston.
Burman, P. (1991). Regression function estimation from dependent observa-
tions. Journal of Multivariate Analysis 36, 263 - 279.
Cai, Z. and G. G. Roussas. (1992). Uniform strong estimation under
with rates. Statistics and Probability Letters 15, 47 - 55.
Castellana, J.V. and M. R. Leadbetter. (1986). On the smoothed probability
density estimation for stationary processes. Stochastic Processes and their
Applications 21, 179 - 193.
Cheng, K.F. and P. E. Lin. (1981). Nonparametric estimation of a regression
function. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete
57, 223 - 233.
Davydov, Y.A. (1973). Mixing conditions for Markov chains. Theory of Prob-
ability and its Applications. 18, 312 - 328.
Denny, J., C. Kisiel, and S. Yakowitz. (1974). Statistical inference on stream

flow processes with Markovian characteristics. Water Resources Research
10, 947 - 954.
Devroye, L, and T. J. Wagner. (1980). On the convergence of kernel esti-
mators of regression function with applications in discrimination. Zeitschrift
für Wahrscheinlichkeitstheorie und verwandte Gebiete 51, 15 - 25.
Devroye, L. and L. Györfi. (1985). Nonparametric Density Estimation: The
View. John Wiley and Sons, Toronto.
Devroye, L. (1987). A Course in Density Estimation. Birkhäuser, Boston.
Doob, J.L. (1953). Stochastic Processes. Wiley, New York.
Doukhan, P. (1994). Mixing: Properties and examples. Lecture Notes in Statis-
tics No. 85, Springer-Verlag, New York.
Eubank, R. (1988). Spline Smoothing and Nonparametric Regression. Marcel-
Dekker, New York.
Földes, A. (1974). Density estimation for dependent samples. Studia Scien-
tiarum Mathematicarum Hungarica. 9, 443 - 452.
Gani, J., S. Yakowitz, and M. Blount. (1997). The spread and quarantine of HIV
infection in a prison system. SIAM Journal of Applied Mathematics 57,1510
- 1530.
Gasser, T. and H.-G. Müller. (1979). Kernel estimation of regression function.
In: Smoothing Techniques for Curve Estimation Lecture Notes in Mathemat-
ics. 757, 23 - 68. Springer-Verlag, Berlin.
Georgiev, A.A. (1984a). Kernel estimates of functions and their derivatives with
applications. Statistics and Probability Letters 2, 45 - 50.
Georgiev, A.A. (1984b). Speed of convergence in nonparametric kernel esti-
mation of a regression function and its derivatives. Annals of the Institute of
Statistical Mathematics 36, 455 - 462.
Georgiev, A.A. (1988). Consistent nonparametric multiple regression: The fixed
design case. Journal of Multivariate Analysis 25, 100 - 110.
Gorodetskii, V.V. (1977). On the strong mixing property for linear sequences.
Theory of Probability and its Applications 22, 411 - 413.
Györfi, L. (1981). Strongly consistent density estimate from ergodic samples.
Journal of Multivariate Analysis 11, 81 - 84.
Györfi, L., Härdle, W., Sarda, P. and Vieu, P. (1989). Nonparametric Curve
Estimation from Time Series. Springer-Verlag, Berlin.
Györfi, L. and E. Masry. (1990). The and strong consistency of recursive
kernel density estimation from dependent samples. IEEE Transactions on
Information Theory 36, 531 - 539.
Györfi, L., G. LMorvai, and S. Yakowitz. (1998). Limits to consistent on-line
forecasting for ergodic time series. IEEE Transactions on Information Theory
44, 886 - 892.
REFERENCES 599
Hájek, J. (1970). A characterzation of limiting distributions of regular estimates.

Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 14, 323 -
330.
Hall, P. and C.C. Heyde. (1980). Martingale Limit Theory and Its Applications.
Academic Press, New York.
Härdle, W. (1990). Applied Nonparametric Regression. Cambridge University
Press, Cambridge.
Heyde C. and S. Yakowitz. (2001). Unpublished work.
Ibragimov, I.A. (1959). Some limit theorems for strict sense stationary stochas-
tic processes. Doklady Akademii Nauk SSSR 125, 711 - 714.
Ibragimov, I.A. (1963). A central limit theorem for a class of dependent random
variables. Theory of Probability and its Applications 8, 83 - 94.
Ibragimov, I.A. and Yu. V. Linnik. (1971). Independent and Stationary Se-
quences of Random Variables. Wolters-Noordhoff Publishing, Groningen,
The Netherlands.
Inagaki, N. (1970). On the limiting distribution of a sequence of estimators with
uniformity property. Annals of the Institute of Statistical Mathematics 22, 1
- 13.
Johnson, R.A. and G.G. Roussas. (1969). Asymptotically most powerful tests
in Markov processes. Annals of Mathematical Statistics 40, 1207 - 1215.
Johnson, R.A. and G.G. Roussas. (1970). Asymptotically optimal test in Markov
processes. Annals of Mathematical Statistics 41, 918 - 938.
Johnson, R.A. and G.G. Roussas. (1972). Applications of contiguity to multi-
parameter hypothesis testing. Proceedings of the 6th Berkeley Symposium of
Mathematical Statistics and Probability 1, 195 - 226.
Kesten, H. and G.L. O’Brien. (1976). Examples of mixing sequences. Duke
Mathematical Journal 43, 405 - 415.
theory. IEEE Transactions on Automatic Control 40, 1199 - 1209.
Le Cam, L. (1960). Locally asymptotically normal families of distributions.
University of California Publication of Statistics 3, 37 - 98.
Le Cam, L. (1966). Likelihood functions for large numbers of independent
observations. In: Research Papers in Statistics, Festschrift for J. Neyman,
F.N. David (Ed.) 167 - 187. John Wiley and Sons, New York.
Le Cam, L. (1986). Asymptotic Methods in Statistical Decision Theory. Springer-
Verlag, New York.
Le Cam, L. and G. Yang. (2000). Asymptotics in Statistics: Some Basic Con-
cepts, 2nd ed. Springer-Verlag, New York.
Lind, B. and G.G. Roussas. (1977). Camér-type conditions and quadratic mean
differentiability. Annals of the Institute of Statistical Mathematics 29, 189 -
201.
Longnecker, M. and R.J. Serfling. (1978). Moment inequalities for under

general stationary mixing sequences. Zeitschrift für Wahrscheinlichkeitsthe-
orie und verwandte Gebiete 43, 1-21.
Mack, Y.P. and B.W. Silverman. (1982). Weak and strong uniform consistency
of kernel regression estimates. Zeitschrift für Wahrscheinlichkeitstheorie und
verwandte Gebiete 61, 405 - 415.
Masry, E. (1983). Probability density estimation form sampled data. IEEE
Transactions on Information Theory 29, 696 - 709.
Masry, E. (1986). Recursive probability density estimation for weakly depen-
dent stationary processes. IEEE Transactions on Information Theory II 32,
254 - 267.
Masry, E. and L. Györfi. (1987). Strong consistency and rates for recursive
probability density estimators of stationary processes. Journal of Multivari-
ate Analysis 22, 79 -93.
Masry, E. (1989). Nonparametric estimation of conditional probability densi-
ties and expectations of stochastic processes: Strong consistency and rates.
Stochastic Processes and their Applications 32, 109 - 127.
Masry, E. (1996). Multivariate local polynomial regression estimation for time
series: Uniform strong consistency and rates. Journal of Time Series Analysis
17,571 -599.
Masry, E. and J. Fan. (1997). Local polynomial estimation of regression func-
tions for mixing processes. Scandinavian Journal of Statistics 24, 165 - 179.
Morvai, G., S. Yakowitz, and L. Györfi. (1996). Nonparametric inference for
ergodic, stationary time series. Annals of Statistics 24, 370 - 379.
Morvai, G., S. Yakowitz, and P. Algoet. (1997). Weakly convergent nonparamet-
ric forecasting for stationary time series. IEEE Transactions on Information
Theory 43, 483 - 498.
Müller, H.-G. (1988). Nonparametric Regression Analysis of Longitudinal Data.
Lecture Notes in Statistics No. 46, Springer-Verlag, Heidelberg.
Nadaraya, E.A. (1964a). On estimating regression. Theory of Probability and
its Applications 9, 141 - 142.
Nadaraya, E.A. (1964b). Some new estimates for distribution functions. Theory
of Probability and its Applications 9, 491 - 500.
Nadaraya, E.A. (1970). Remarks on nonparametric estimates for density func-
tion and regression curves. Theory of Probability and its Applications 15,
134 - 137.
Nguyen, H.T. (1981). Asymptotic normality of recursive density estimators in
Markov processes. Publication of the Institute of Statistics, University of
Paris 26, 73-93.
Nguyen, H.T. (1984). Recursive nonparametric estimation in stationary Markov
processes. Publication of the Institute of Statistics University of Paris 29, 65
- 84.
REFERENCES 601
Parzen, E. (1962). On estimation of a probability density function and mode.

Annals of Mathematical Statistics 23, 1065 - 1076.
Peligrad, M. (1985). Convergence rates of the strong law for stationary mixing
sequences. Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete
70, 307-314.
Pham, T.D. and L.T. Tran. (1985). Some strong mixing properties of time series
models. Stochastic Processes and their Applications 19, 297 - 303.
Pham, T.D. (1986). The mixing property of bilinear and generalized random co-
efficient autoregressive models. Stochastic Processes and their Applications
23, 291 - 300.
Philipp, W. (1969). The remainder in the central limit theorem for mixing
stochastic processes. Annals of Mathematical Statistics 40, 601 - 609.
Prakasa Rao, B.L.S. (1977). Berry-Esseen bound for density estimators of sta-
tionary Markov processes. Bulletin of Mathematical Society 17, 15 - 21.
Prakasa Rao, B.L.S. (1983). Nonparametric Functional Estimation. Academic
Press, New York.
Priestley, M.B. and M.T. Chao. (1972). Nonparametric functions fitting. Journal
of the Royal Statistical Society Series B 34, 385 - 392.
Robinson, P.M. (1983). Nonparametric estimators for time series. Journal of
Time Series Analysis 4, 185 - 207.
Robinson, P.M. (1986). On the consistency and finite-sample properties of non-
parametric kernel time series regression, autoregression and density estima-
tors. Annals of the Institute of Statistical Mathematics 38, Part A, 539 - 549.
Rosenblatt, M. (1956a). Remarks on some nonparametric estimates of a density
function. Annals of Mathematical Statistics 27, 823 - 835.
Rosenblatt, M. (1956b). A central limit theorem and a strong mixing condition.
Proceedings of the National Academy of Sciences, U.S.A. 42, 43 - 47.
Rosenblatt, M. (1969). Conditional probability density and regression estima-
tors. In: Multivariate Analysis II, P.R. Krishnaiah (Ed.) 25-31. Academic
Press, New York.
Rosenblatt, M. (1970). Density estimates and Markov sequences. In: Nonpara-
metric Techniques in Statistical Inference, M. Puri (Ed.). Cambridge Univer-
sity Press, Cambridge.
Rosenblatt, M. (1971). Curve estimates. Annals of Mathematical Statistics 42,
1815 -1842.
Rosenblatt, M. (1985). Stationary Sequences and Random Fields. Birkhäuser,
Boston.
Roussas, G.G. (1965a). Asymptotic inference in Markov processes. Annals of
Mathematical Statistics 36, 978 - 992.
Roussas, G.G. (1965b). Extension to Markov processes of a result by A. Wald
about the consistency of the maximum likelihood estimate. Zeitschrift für
Wahrscheinlichkeitstheorie und verwandte Gebiete 4, 69 - 73.
Roussas, G.G. (1968a). Asymptotic normality of the maximum likelihood es-

timate in Markov processes. Metrika 14, 62 - 70.
Roussas, G.G. (1968b). Some applications of the asymptotic distribution of the
likelihood functions to the asymptotic efficiency of estimates. Zeitschrift für
Wahrscheinlichkeitstheorie und verwandte Gebiete 10, 252 - 260.
Roussas, G.G. (1969a). Nonparametric estimation in Markov processes. Annals
of the Institute of Statistical Mathematics 21, 73 - 78.
Roussas, G.G. (1969b). Nonparametric estimation of the transition distribution
function of a Markov process. Annals of Mathematical Statistics 40, 1386 -
1400.
Roussas, G.G. (1972). Contiguity of Probability Measures: Some Applications
in Statistics. Cambridge University Press, Cambridge.
Roussas, G.G. and A. Soms. (1973). On the exponential approximation of a fam-
ily of probability measures and a representation theorem of Hájek-Inagaki.
Annals of the Institute of Statistical Mathematics 25, 27 - 39.
Roussas, G.G. (1979). Asymptotic distribution of the log-likelihood function
for stochastic processes. Zeitschrift für Wahrscheinlichkeitstheorie und ver-
wandte Gebiete 47, 31 - 46.
Roussas, G.G. and D. Ioannides. (1987). Moment inequalities for mixing se-
quences of random variables. Stochastic Analysis and Applications 5, 61 -
120.
Roussas, G.G. (1988a). A moment inequality of for triangular arrays of
random variables under mixing conditions, with applications. In: Statistical
Theory and Data Analysis II, K. Matusita (Ed.) 273 - 292. North-Holland,
Amsterdam.
Roussas, G.G. (1988b). Nonparametric estimation in mixing sequences of ran-
dom variables. Journal of Statistical Planning and Inference 18, 135 -149.
Roussas, G.G. and D. Ioannides. (1988). Probability bounds for sums of tri-
angular arrays of random variables under mixing conditions. In: Statistical
Theory and Data Analysis II, K. Matusita (Ed.) 293 - 308. North-Holland,
Amsterdam.
Roussas, G.G. (1989a). Consistent regression estimation with fixed design
points under dependence conditions. Statistics and Probability Letters 8,
41 - 50.
Roussas, G.G. (1989b). Hazard rate estimation under dependence conditions.
Journal of Statistical Planning and Inference 22, 81 - 93.
Roussas, G.G. (1989c). Some asymptotic properties of an estimate of the sur-
vival function under dependence conditions. Statistics and Probability Let-
ters 8, 235 - 243.
Roussas, G.G. (1990a). Asymptotic normality of the kernel estimate under
dependence conditions: Application to hazard rate. Journal of Statistical
Planning and Inference 25, 81 - 104.
REFERENCES 603
Roussas, G.G. (1990b). Nonparametric regression estimation and mixing con-

ditions. Stochastic Processes and their Applications 36, 107 - 116.
Roussas, G.G. (1991a). Estimation of transition distribution function and its
quantiles in Markov processes: Strong consistency and asymptotic normality.
In: Nonparametric Functional Estimation and Related Topics, G.G. Roussas
(Ed.) 443 - 462. Kluwer Academic Publishers, The Netherlands.
Roussas, G.G. (1991b). Recursive estimation of the transition distribution func-
tion of a Markov process: Asymptotic normality. Statistics and Probability
Letters 11, 435 - 447.
Roussas, G.G. (1992). Exact rates of almost sure convergence of a recursive
kernel estimate of a probability density function: Application to regression
and hazard rate estimation. Journal of Nonparametric Statistics 1, 171 -195.
Roussas, G.G. and L.T. Tran. (1992a). Joint asymptotic normality of kernel
estimates under dependence, with applications to hazard rate. Journal of
Nonparametric Statistics 1, 335 - 355.
Roussas, G.G. and L.T. Tran. (1992b). Asymptotic normality of the recursive
kernel regression estimate under dependence conditions. Annals of Statistics
20, 98 -120.
Roussas, G.G., L.T. Tran, and D.A. loannides. (1992). Fixed design regression
for time series: Asymptotic normality. Journal of Multivariate Analysis 40,
262 - 291.
Roussas, G.G. (1996). Efficient estimation of a smooth distribution function
under In: Research Developments in Probability and Statistics,
Festschrift in honor of Madan L. Puri, E. Brunner and M Denker (Eds.), 205
-217. VSP, The Netherlands.
Roussas, G.G. and Y.G. Yatracos. (1996). Minimum distance regression type
estimates with rates under weak dependence. Annals of the Institute of Sta-
tistical Mathematics 48, 267 - 281.
Roussas, G.G. and Y.G. Yatracos. (1997). Minimum distance estimates with
rates under mixing. In: Research Papers in Probability and Statistics,
Festschrift for Lucien Le Cam, D. Pollard, E. Torgersen and G.L. Yang (Eds.)
337 - 344. Springer-Verlag, New York.
Roussas, G.G. and D. Bhattacharya. (1999a). Asymptotic behavior of the log-
likelihood function in stochastic processes when based on a random number
of random variables. In: Semi-Markov Models and Applications, J. Janssen
and N. Limnios (Eds.) 119 -147. Kluwer Academic Publishers, The Nether-
lands.
Roussas, G.G. and D. Bhattacharya. (1999b). Some asymptotic results and ex-
ponential approximation in semi-Markov models. In: Semi-Markov Models
and Applications, J. Janssen and N. Limnios (Eds.) 149 -166. Kluwer Aca-
demic Publishers, The Netherlands.
Roussas, G.G. (2000). Contiguity of Probability Measures. Encyclopaedia of

Mathematics, Supplement II. Kluwer Academic Publishers, pages 129 -130,
The Netherlands.
Rüschendorf, L. (1987). Consistency of estimates for multivariate density func-
tions and for the mode. Series A 39, 243 - 250.
Schmetterer, L. (1966). On the asymptotic efficiency of estimates. In: Research
Papers in Statistics, Festschrift for J. Neyman; F.N. David (Ed.) 301 - 317.
John Wiley and Sons, New York.
Schuster, E.F. (1969). Estimation of a probability density function and its deriva-
tives. Annals of Mathematical Statistics 40, 1187 - 1195.
Schuster, E.F. (1972). Joint asymptotic distribution of the estimated regression
function at a finite number of distinct points. Annals of Mathematical Statis-
tics 43, 84 - 88.
Scott, D.W. (1992). Multivariate Density Estimation. Wiley, New York.
Silverman, B.W. (1986). Density Estimation for Statistics and Data Analysis.
Stamatelos, G.D. (1976). Asymptotic distribution of the log-likelihood function
for stochastic processes. Some examples. Bulletin of Mathematical Society.
Gréce 17, 92 - 116.
Takahata, H. (1993). On the rates in the central limit theorem for weakly depen-
dent random variables. Zeitschrift für Wahrscheinlichkeitstheorie und ver-
wandte Gebiete 62, 477 - 480.
Tran, L.T. (1990). Kernel density estimation under dependence. Statistics and
Probability Letters 10, 193-201.
Tran, L.T. and S. Yakowitz. (1993). Nearest neighbor estimators for random
fields. Journal of Multivariate Analysis 44, 23 - 46.
Tran, L.T., G.G. Roussas, S. Yakowitz, and V. Truong. (1996). Fixed-design
regression for linear time series. Annals of Statistics 24, 975 - 991.
Wald, A. (1941). Asymptotically most powerful tests of statistical hypotheses.
Annals of Mathematical Statistics 12, 1 - 19.
Wald, A. (1943). Tests of statistical hypotheses concerning several parameters
when the number of observations is large. Transactions of the American
Mathematical Society 54, 426 - 482.
Watson, G.S. (1964). Smooth regression analysis. Series A 26, 359 -
372.
Watson, G.S. and M.R. Leadbetter. (1964a). Hazard Analysis. I. Biometrika 51,
175 -184.
Watson, G.S. and M.R. Leadbetter. (1964b). Hazard Analysis. Series
A 26, 110-116.
Wegman, E.J. and H.I. Davis. (1979). Remarks on some recursive estimators
of a probability density. Annals of Statistics. 7, 316 - 327.
REFERENCES 605
Weiss, L. and J. Wolfowitz. (1966). Generalized maximum likelihood estima-

tors. Theory of Probability and its Applications 11, 58-81.
Weiss, L. and J. Wolfowitz. (1967). Maximum probability estimators. Annals
of the Institute of Statistical Mathematics 19, 193 - 206.
Weiss, L. and J. Wolfowitz. (1970). Maximum probability estimators and asymp-
totic sufficiency. Annals of the Institute of Statistical Mathematics 22, 225 -
244.
Weiss, L. and J. Wolfowitz. (1974). Maximum Probability Estimators and Re-
lated Topics. Lecture Notes in Mathematics 424, Springer-Verlag, Berlin-
Heidelberg-New York.
Withers, C.S. (1981a). Conditions for linear processes to be strong mixing.
Zeitschrift für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 477 -
480.
Withers, C.S. (1981b). Central limit theorems for dependent variables I. Zeitschrift
für Wahrscheinlichkeitstheorie und verwandte Gebiete 57, 509 - 534.
Wolfowitz, J. (1965). Asymptotic efficiency of the maximum likelihood esti-
mator. Theory of Probability and its Applications 10, 247 - 260.
Yakowitz, S. (1972). A statistical model for daily stream flow records with
applications to the Rillito river. Proceedings, International Symposium on
Uncertainties in Hydrologic and Water Resources Systems, 273 - 283. Uni-
versity of Arizona, Tucson.
Yakowitz, S. (1973). A stochastic model for daily river flows in an arid region.
Water Resources Research 9, 1271 - 1285.
Yakowitz, S. and J. Denny. (1973). On the statistics of hydrologic time series.
Proceedings, 17th Annual Meeting of the Arizona Academy of Sciences 3,
146 -163. Tucson.
Yakowitz, S. (1976a). Small sample hypothesis tests of Markov order, with
applications to simulated and hydrologic chain. Journal of the American
Statistical Association 71, 132 -136.
Yakowitz, S. (1976b). Model-free statistical methods for water table predictions.
Water Resources Research 12, 836 - 844.
Yakowitz, S. (1977a). Statistical models and methods for rivers in the southwest.
Proceedings, 21st Annual Meeting of the Arizona Academy of Sciences. Las
Vegas.
Yakowitz, S. (1977b). Computational Probability and Simulation. Addison-
Wesley, Reading, Mass.
Annals of Statistics 7, 671 - 679.
Yakowitz, S. (1985). Nonparametric density estimation, prediction, and regres-
sion for Markov sequences. Journal of the American Statistical Association
80, 215 - 221.
Yakowitz, S. (1987). Nearest-neighbor methods for time series analysis. Journal

of Time Series Analysis 8, 235 - 247.
Yakowitz, S. (1989). Nonparametric density and regression estimation for Markov
sequences without mixing assumptions. Journal of Multivariate Analysis 30,
124 - 136.
Yakowitz, S. and W. Lowe. (1991). Nonparametric bandit methods. Annals of
Operation Research 28, 297 - 312.
Yakowitz, S., T. Jayawardena, and S. Li. (1992). Theory for automatic learning
under partially observed Markov-dependent noise. IEEE Transactions on
Automatic Control 37, 1316 - 1324.
Yakowitz, S. (1993). Nearest neighbor regression estimation for null-recurrent
Markov time series. Stochastic Processes and their Applications 48, 311 -
318.
Yamato, H. (1973). Uniform convergence of an estimator of a distribution func-
tion. Bulletin of Mathematical Society 15, 69 - 78.
Yokoyama, R. (1980). Moment bounds for stationary mixing sequences. Zeitschrift
für Wahrscheinlichkeitstheorie und verwandte Gebiete 52, 45-57.
Yoshihara, K. (1978). Moment inequalities for mixing sequences. Kodai Math-
ematics Journal 1, 316 - 328.
Yoshihara, K. (1993). Weakly Dependent Stochastic Sequences and Their Ap-
plications. Vol. II: Asymptotic Statistics based on Weakly Dependent Data.
Sanseido Co. Ltd., Tokyo.
Yoshihara, K. (1994a). Weakly Dependent Stochastic Sequences and Their Ap-
plications. Vol. IV: Curve Estimation based on Weakly Dependent Data.
Sanseido Co. Ltd., Tokyo.
Yoshihara, K. (1994b). Weakly Dependent Stochastic Sequences and Their Ap-
plications. Vol. V: Estimators based on Time Series. Sanseido Co, Ltd., Tokyo.
Yoshihara, K. (1995). Weakly Dependent Stochastic Sequences and Their Ap-
plications. Vol. VI: Statistical Inference based on Weakly Dependent Data.
Sanseido, Co, Ltd., Tokyo.
Part VII
Chapter 24
STOCHASTIC ORDERING OF ORDER

STATISTICS II
Philip J. Boland
University College Dublin
Belfield, Dublin 4
Ireland
Taizhong Hu
Department of Statistics and Finance
University of Science and Technology
Hefei, Anhui 230026
People’s Republic of China
Moshe Shaked
USA
J. George Shanthikumar
Industrial Engineering & Operations Research
University of California
Berkeley, California 94720
USA
Abstract In this paper we survey some recent developments involving comparisons of order
statistics and spacings in various stochastic senses.
Keywords: Reliability theory, systems, IFR, DFR, hazard rate order, likelihood
ratio order, dispersive order, sample spacings.
1. INTRODUCTION
Order statistics are basic probabilistic quantities that are useful in the theory
of probability and statistics. Almost every student of probability and statistics
encounters these random variables at an early stage of his/her studies because
these statistics are associated with an elegant theory, are useful in applications,
and are also a convenient tool to use in order to illustrate in a basic (though not
trivial) way some probabilistic concepts such as transformations, conditional
probabilities, lack of independence, and the foundations of stochastic processes.
In the area of statistical inference, order statistics are the basic quantities used
to define observable functions such as the empirical distribution function, and
in reliability theory these are the lifetimes of systems.
In 1996 Boland, Shaked and Shanthikumar wrote a survey (which appeared
in 1998 (Boland, Shaked, and Shanthikumar, 1998)), covering most of what
had been developed in the area of stochastic ordering of order statistics up to
that time. During the last few years this area has experienced an explosion of
new developments. In this paper we try to describe and summarize some of
these recent developments.
The notation that we use in this paper is the following. Let
be independent random variables which may or may not be identically dis-
tributed. The corresponding order statistics are denoted by
Thus, and
If is another collection of indepen-
dent random variables, then the corresponding order statistics are denoted by
Many of the recent results in the literature yield stochastic comparisons of

with whenever and This is to be contrasted
with previous results which yielded stochastic comparisons of the above order
statistics, but under restrictions such as or and/or or
or In this paper we emphasize the newer kind of
comparisons. Of course, by a simple choice of the indices and one
usually can obtain the older results from the newer ones. In this paper we also
cover some recent results which stochastically compare sample spacings.
We assume throughout the paper that the and the have absolutely
continuous distribution functions, though many of the results which are de-
scribed below are valid also in the more general case where the distribution
functions of these random variables are quite general.
In this paper “increasing" and “decreasing" mean “nondecreasing" and “non-
increasing," respectively. For any random variable X and any event A we denote
Stochastic Ordering of OrderStatistics II 609
by [X|A] any random variable whose distribution function is the conditional

distribution of X given A. Unless stated otherwise, all the stochastic orders
that are studied in this paper are described and extensively analyzed in Shaked
and Shanthikumar (1994).
2. LIKELIHOOD RATIO ORDERS COMPARISONS

In this section we describe some recent results which yield orderings of order
Statistics with respect to various likelihood ratio orders.
Recall the definition of the likelihood ratio order when the compared random
variables have interval supports (possibly infinite) that need not be identical. Let
X and Y be two absolutely continuous random variables, each with an interval
support. Let and be the left and the right endpoints of the support of X.
Similarly define and The values and may be infinite. Let
and denote the density functions of X and Y, respectively. We say that X
is smaller than Y in the likelihood ratio order, denoted as if
Note that in (2.1), when we use the convention when

In particular, it is seen that if then
Shanthikumar and Yao (1986) have introduced and studied an order which
they called the shifted likelihood ratio order. The following definition is slightly
more general than the definition of Shanthikumar and Yao (1986) who consid-
ered only nonnegative random variables. We say that X is smaller than Y in
the up shifted likelihood ratio order, denoted as if
In the sequel we will also touch upon another stochastic order, introduced
in Lillo, Nanda and Shaked (2001), which is defined as follows. Let X and Y
be two absolutely continuous random variables with support We say
that X is smaller than Y in the down shifted likelihood ratio order, denoted as
if
Note that in the above definition we compare only nonnegative random vari-
ables. This is because for the down shifted likelihood ratio order we cannot
take an analog of (2.2), such as, as a definition. The reason is
that here, by taking very large, it is seen that practically there are no random
variables that satisfy such an order relation. Note that in the definition above,
the right hand side can take on (when varies) any value in the
right neighborhood of 0. Therefore we restricted the support of the compared

random variables to
Lillo, Nanda and Shaked (2001) have obtained the following results. Let
be independent (not necessarily i.i.d. (independent and
identically distributed)) random variables, and let be other in-
dependent (not necessarily i.i.d.) random variables, all having absolutely con-
tinuous distributions. Then
and
For nonnegative random variables with support Lillo, Nanda and

Shaked (2001) were not able to prove a complete analog of (2.3) and (2.4) for
the order. Rather, they could only show that the minima can be compared,
that is,
In fact, Lillo, Nanda and Shaked (2001) showed that it is not possible to replace
in (2.4).
When the [respectively, above are i.i.d. then from (2.3)–(2.5) it
follows that
and
By taking in (2.6) we obtain a comparison of order statistics from

the same distribution (though the sample sizes may differ). Specifically, from
(2.6) it follows that if the above are i.i.d. then
this result has been obtained by Raqab and Amin (1996) and, independently,
by Khaledi and Kochar (1999).
In general, a random variable X is not comparable in the up [respectively,

down] likelihood ratio order with itself, unless it has a logconcave [respectively,
logconvex] density function. Thus, from (2.7) it follows that if the above
are i.i.d. with a logconcave density function then
Similarly, from (2.8) it follows that if the above are i.i.d. with a logconvex
density function then
These results can be found in Lillo, Nanda and Shaked (2001).
3. HAZARD AND REVERSED HAZARD RATE

ORDERS COMPARISONS
In this section we describe some recent results which yield orderings of order
statistics with respect to the hazard and reversed hazard rate orders. An elegant
new result is also given in this section.
Recall the definition of the hazard rate and the reversed hazard rate orders
when the compared random variables have supports that need not be identical.
Let X and Y be two continuous (not necessarily nonnegative) random vari-
ables, each with an interval support which we denote by and
respectively; and may be and and may be Let F and G
be the distribution functions of X and Y, respectively, and let and
be the corresponding survival functions.
We say that X is smaller than Y in the hazard rate order, denoted as
if
Note that in (3.1), when we use the convention when

In particular, it is seen that if then If the hazard rate
functions and are well defined,
then
We say that X is smaller than Y in the reversed hazard rate order, denoted
as if
Again it is seen that if then

In Shaked and Shanthikumar (1994) it is shown that
for any continuous function which is strictly increasing on Also,
for any such function In Nanda and Shaked (2000) it is shown that
for any continuous function which is strictly decreasing on Also,
for any such function The latter two implications correct a mistake in Theo-
rems 1.B.2 and 1.B.22 in Shaked and Shanthikumar (1994) — the parenthetical
statements there are incorrect. These two implications often enable us to trans-
form results about the hazard rate order into results about the reversed hazard
rate order and vice versa.
The first result regarding ordering order statistics in the sense of the hazard
and the reversed hazard rate orders is the following useful proposition; later
(see Theorem 3.1) we use it in order to obtain a new stronger result.
Proposition 3.1. Let [respectively, be in-
dependent (not necessarily i.i.d.) absolutely continuous random variables, all
with support for some
(i) If for all and then
(ii) If for all and then

Boland and Proschan (1994) proved part (i) of Proposition 3.1 for nonneg-
ative random variables; however, by (3.3), the inequality in part (i) is valid
under the weaker assumptions of Proposition 3.1. Part (ii) of Proposition 3.1
strengthens Corollary 3.1 of Nanda, Jain and Singh (1998). It is worthwhile
noting that by using (3.4) and (3.5) it can be shown that part (ii) is actually
equivalent to part (i); see Nanda and Shaked (2000) for details.
We now state and prove a new result which strengthens and unifies some
previous results. The following theorem gives hazard rate and reversed hazard
rate analogs of (2.3) and (2.4).
Theorem 3.2. Let be independent (not necessarily i.i.d.)
random variables, and let be other independent (not neces-
sarily i.i.d.) random variables, all having absolutely continuous distributions
with support for some Then
and
Proof. First we prove (3.6). Assume that for all We will

now show that there exists a random variable Z with support such that
for all Let and denote the hazard rate functions
of the indicated random variables. From the assumption that for all
it follows by (3.2) that
Let be a function which satisfies
for example, let It can be shown that

is indeed a hazard rate function. Let Z be a random variable with the hazard
rate function Then indeed for all
Now, let be i.i.d. random variables distributed as Z.
Then
and the desired result follows from the fact that the likelihood ratio order implies
the hazard rate order.
With the aid of (3.4) and (3.5) it can be shown that statement (3.7) is equiv-
alent to (3.6).
Theorem 3.2 extends and unifies the relatively restricted Theorems 3.8 and 3.9
of Nanda and Shaked (2000).
In light of (4.1) in Section 4, one may wonder whether the conclusion in
(3.6) holds if it is only assumed there that for all (rather than
for all In order to see that this is not the case, consider the
independent exponential random variables and with hazard rates 1 and

3, respectively, and the independent exponential random variables and
with hazard rates 1 and 2, respectively. Then However, it
is easy to verify that then and are not comparable
in the hazard rate order. Similarly, the conclusion in (3.7) does not follow from
the mere assumption for all
We end this section by stating some results involving hazard rate and re-
versed hazard rate comparisons of order statistics constructed from one set of
random variables. Let be independent (not necessarily i.i.d.)
absolutely continuous random variables, all with support for some
Then
The first inequality in (3.8) was proven in Boland, El-Neweihi and Proschan
(1994) for nonnegative random variables; however, by (3.3), the inequality holds
also without the nonnegativity assumption. The second inequality in (3.8) is
taken from Hu and He (2000). The inequalities in (3.9) can be found in Block,
Savits and Singh (1998) and in Hu and He (2000). Again, using (3.4) and (3.5)
it can be shown that (3.9) is actually equivalent to (3.8); see Nanda and Shaked
(2000) for details.
For the next inequalities we need to have, among the a “largest" [re-
spectively “smallest"] variable in the sense of the hazard [reversed hazard] rate
order. Again, let be independent (not necessarily i.i.d.) ab-
solutely continuous random variables, all with support for some
Then
(i) If then
(ii) If then
Boland, El-Neweihi and Proschan (1994) proved part (i) above for nonnegative
random variables; however, again by (3.3), the inequality in part (i) is valid
without the nonnegativity assumption. Part (ii) above is Theorem 4.2 of Block,
Savits and Singh (1998). Again, using (3.4) and (3.5) once more it can be
shown that part (ii) is equivalent to part (i); see Nanda and Shaked (2000) for
details.
4. USUAL STOCHASTIC ORDER COMPARISONS

Recall the definition of the usual stochastic order. Let X and Y be random
variables with survival functions and We say that X is smaller than Y in
the usual stochastic order, denoted as if
The usual stochastic order is implied by the orders studied in Sections 2 and 3.
Therefore, comparisons of order statistics, associated with one collection of
random variables, follow from previous results in these sections. For example,
in (2.9), (3.8) and (3.9), the orders and can be replaced by
However, when we try to compare order statistics that are associated with two
different sets of random variables (that is, a set of and a set of we
may get new results because the assumption that an is smaller than a in
the usual stochastic order is weaker than a similar assumption involving any of
the orders discussed in Sections 2 and 3.
At present there are not many such results available. We just mention one
recent result that has been derived, independently, by Belzunce, Franco, Ruiz
and Ruiz (2001) and by Nanda and Shaked (2000). Let [re-
spectively, ] be independent (not necessarily i.i.d.) absolutely
continuous random variables, all with support for some Then for
any and we have that
5. STOCHASTIC COMPARISONS OF SPACINGS

Let be nonnegative random variables with the associated
order statistics The corresponding
spacings are defined by and the corresponding
normalized spacings are defined as
The normalized spacings are of importance in reliability
theory because they are the building blocks of the TTT (total time on test)
statistic. Kochar (1998) has surveyed the literature on stochastic comparisons
of spacings and normalized spacings up to 1997. In this section we emphasize
more recent advances.
Let be i.i.d. nonnegative random variables. If the common dis-
tribution function F is IFR (increasing failure rate; that is, is concave on
) then the normalized spacings satisfy
The above two results were obtained by Barlow and Proschan (1966) who also
showed that if F is DFR (decreasing failure rate; that is, log is convex on
then the inequalities above are reversed. Kochar and Kirmani (1995)
strengthened this DFR result as follows. Let be i.i.d. nonnegative
random variables. If the common distribution function F is DFR then the
normalized spacings satisfy
Khaledi and Kochar (1999) proved, in addition to the above, that if F is DFR
then the normalized spacings satisfy
Summarizing (5.1)–(5.3) we have, under the assumption that F is DFR, that
Kochar and Kirmani (1995) claimed that if are i.i.d. non-

negative random variables with a common logconvex density function then
However, Misra and van der Mue-
len (2001) showed via a counterexample that this is not correct. Pledger and
Proschan (1971) showed that if are exponential random vari-
ables with possibly different parameters (or more generally, with decreasing
proportional hazard rate functions) then
For a particular choice of the parameters of the exponential random variables,

Khaledi and Kochar (2000b) showed that the above inequality holds with
replacing
In an effort to extend this comparison to the likelihood ratio order, Kochar
and Korwar (1996) obtained the following result which compares any normal-
ized spacing to the first normalized spacing. Again, if are
exponential random variables with possibly different parameters then
The following unpublished result which compares spacings is due to Joag-

Dev (1995).
Theorem 5.1. Let be i.i.d. random variables with a finite sup-
port, and with an increasing [decreasing] density function over that support.
Then
Proof. Let F and denote, respectively, the distribution function and the den-
sity function of Given and the conditional
density of at the point is and the conditional
density of at the point is Since is
increasing [decreasing] it is seen that, conditionally,
and therefore, conditionally, But the usual stochastic
order is closed under mixtures, and this yields the stated result.
For a sample of i.i.d. nonnegative random variables, Hu

and Wei (2000) studied sums of adjacent spacings
which they called generalized spacings. For
example, the range of a sample is a generalized
spacing, Hu and Wei (2000) showed that if is DFR [IFR] then
This result generalizes (5.3).

We end this section by describing a few results that compare normalized
and generalized spacings from two different random samples. The following
two results have been derived by Khaledi and Kochar (1999). Let
be i.i.d. nonnegative random variables with an absolutely continuous common
distribution function, and let be i.i.d. nonnegative random variables
with a possibly different absolutely continuous common distribution function.
As above, denote the normalized spacings that are associated with the by
Also, denote the normalized
spacings that are associated with the by
If and if either or is DFR, then
If and if either or is DFR, then
In fact, (5.4) can be obtained from (5.5) by taking

Of course, if we set in the above inequalities and then we
obtain, as a special case, comparisons of the spacings (rather than the normalized
spacings) that are associated with these two random samples.
Let be the generalized
spacings from a sample of i.i.d. nonnegative random variables
as described above, and let be the
generalized spacings from another sample of i.i.d. nonnegative
random variables. Hu and Wei (2000) showed that if and if or
is DFR then
Now let be independent exponential random variables with

hazard rates respectively, and let be i.i.d. ex-
ponential random variables with hazard rate As above, let
and be the normalized spacings associated with
the and the respectively. Kochar and Rojo (1996) showed that then
In fact, they showed a stronger result, namely, that the random vector
is smaller than the random vector
in the multivariate likelihood ratio order (see Snaked and Shanthikumar
(1994) for the definition).
6. DISPERSIVE ORDERING OF ORDER STATISTICS

AND SPACINGS
An order that is useful in reliability theory, and in other areas of applied
probability, is the dispersive order. Let X (with distribution function F) and Y
(with distribution function G) be two random variables. We say that X is smaller
than Y in the dispersive order, denoted by if
for all where and are the
right continuous inverses of F and G, respectively. It is well-known (see, for
example, Shaked and Shanthikumar (1994)) that
Bartoszewicz (1985) and Bagai and Kochar (1986) have shown that
In this section we describe some recent results involving stochastic ordering of

order statistics and spacings in the dispersive order.
First note that from (3.8) it follows that if are independent
(not necessarily i.i.d.) absolutely continuous DFR random variables, all with
support for some then we get the following result of Kochar (1996):
This follows from (3.8), with the aid of (6.2), and from the fact that is
DFR. If are i.i.d. DFR random variables, then Khaledi and Kochar
(2000a) showed that
Now let be a collection of i.i.d. random variables, and let

be another collection of i.i.d. random variables. Bartoszewicz
(1986) showed that
Under the assumption Alzaid and Proschan (1992) have obtained

some monotonicity results in and of the differences
In order to obtain a more general inequality than (6.3) we need to assume
more. Khaledi and Kochar (2000a) showed that if is a sequence of
i.i.d. random variables, and if is another sequence of i.i.d. random
variables, and if or is DFR, then
Khaledi and Kochar (2000c) showed that if are independent ex-

ponential random variables with hazard rates respectively, and
if are i.i.d. exponential random variables with hazard rate
then This strengthens a result of Dyk-
stra, Kochar and Rojo (1997) who proved the inequality
when are i.i.d. exponential random variables with hazard rate
Consider now the normalized spacings that are associated with the i.i.d.
random variables From (5.1)–(5.3), from (6.2), and from the fact
that the spacings here are DFR (see Barlow and Proschan (1966)), it is seen
that if are i.i.d. nonnegative DFR random variables, then we get
the following results of Kochar and Kirmani (1995) and of Khaledi and Kochar
(1999):
or, in summary,
We end this section by considering the spacings

that are associated with the nonnegative i.i.d.
absolutely continuous random variables and the spacings
that are associated
with the nonnegative i.i.d. absolutely continuous random variables
Define the random vectors and
Recall that means that
for all increasing functions for which the expectations exist. The
following theorem is stated in Bartoszewicz (1986) with an incomplete proof.
Theorem 6.1. Let U and V be as above. If then In
particular,
Proof. Let F and G denote the distribution functions of and respec-

tively. Define and and
Clearly,
Furthermore, from (6.1) we have that
The fact that now follows from a well-known property of the multi-
variate order
Theorem 2.7 in page 182 of Kamps (1995) extends (6.4) to the spacings of
the so called generalized order statistics.
Rojo and He (1991) proved a converse of Theorem 6.1. Specifically, they
showed that if
for all then
7. A SHORT SURVEY ON FURTHER RESULTS

In this last section we briefly mention some results that give stochastic com-
parisons of order statistics and spacings in senses different than those described
in Sections 2–6.
Arnold and Villasenor (1998) obtained various results comparing order statis-
tics from the uniform (0,1) distribution in the sense of the Lorenz order. In
particular, one of their results can be described as "sample medians exhibit less
variability as sample size increases." A comparison of two maxima of indepen-
dent (not necessarily i.i.d.) random variables in the increasing convex order is
implicit in Theorem 9 of Li, Li and Jing (2000).
Barlow and Proschan (1975), pages 107–108 obtained some results com-
paring order statistics (in their words, systems) in the convex, star,
and subadditive transform orders (see Shaked and Shanthikumar (1994), Sec-
tion 3.C for a discussion on these orders). Oja (1981) showed that the convex
transform order between two distributions implies that the ratios of spacings,
that correspond to samples from these distributions, are ordered in the usual
stochastic order. In (1998a) Bartoszewicz showed that the star transform or-
der between two distributions implies that some ratios of moments of order
REFERENCES 621
statistics from these distributions have some interesting monotonicity proper-

ties. In (1998b) Bartoszewicz obtained various comparisons of functions of
order statistics from different samples in the sense of the Laplace transform
order. Wilfling (1996) described some comparison results of order statistics
from exponential and Pareto distributions in the Laplace transform order.
ACKNOWLEDGMENTS
We thank Baha-Eldin Khaledi for useful comments on a previous version of
this paper.
REFERENCES
Alzaid, A. A. and F. Proschan. (1992). Dispersivity and stochastic majorization.
Statistics and Probability Letters 13, 275–278.
Arnold, B. C. and J. A. Villasenor. (1998). Lorenz ordering of order statistics and
record values. In Handbook of Statistics, Volume 16 (Eds: N. Balakrishnan
and C. R. Rao), Elsevier, Amsterdam, 75–87.
Bagai, I. and S.C. Kochar. (1986). On tail-ordering and comparison of failure
rates. Communications in Statistics—Theory and Methods 15, 1377–1388.
Barlow, R. E. and F. Proschan. (1966). Inequalities for linear combinations of
order statistics from restricted families. Annals of Mathematical Statistics
37, 1574–1592.
Barlow, R. E. and F. Proschan. (1975). Statistical Theory of Reliability and Life
Testing, Probability Models, Holt, Rinehart, and Winston, New York, NY.
Bartoszewicz, J. (1985). Dispersive ordering and monotone failure rate distri-
butions. Advances in Applied Probability 17, 472–474.
Bartoszewicz, J. (1986). Dispersive ordering and the total time on test transfor-
mation. Statistics and Probability Letters 4, 285–288.
Bartoszewicz, J. (1998a). Applications of a general composition theorem to the
star order of distributions. Statistics and Probability Letters 38, 1–9.
Bartoszewicz, J. (1998b). Characterizations of the dispersive order of distribu-
tions by the Laplace transform. Statistics and Probability Letters 40, 23–29.
Belzunce, F., M. Franco, J.-M. Ruiz, and M. C. Ruiz. (2001). On partial order-
ings between coherent systems with different structures. Probability in the
Engineering and Informational Sciences 15, 273–293.
Block, H. W., T.H. Savits, and H. Singh. (1998). The reversed hazard rate
function. Probability in the Engineering and Informational Sciences 12, 69–
90.
Boland, P. J., E. El-Neweihi, and F. Proschan. (1994). Applications of the hazard
rate ordering in reliability and order statistics. Journal of Applied Probability
31, 180–192.
Boland, P. J. and F. Proschan. (1994). Stochastic order in system reliability

theory. In Stochastic Orders and Their Applications (Eds: M. Shaked and J.
G. Shanthikumar), Academic Press, San Diego, 485–508.
Boland, P. J., M. Shaked, and J.G. Shanthikumar. (1998). Stochastic ordering of
order statistics. In Handbook of Statistics, Volume 19 (Eds: N. Balakrishnan
and C. R. Rao), Elsevier, Amsterdam, 89–103.
Dykstra, R., S. Kochar, and J. Rojo. (1997). Stochastic comparisons of parallel
systems of heterogeneous exponential components. Journal of Statistical
Planning and Inference 65, 203–211.
Hu, T. and F. He. (2000). A note on comparisons of systems with
respect to the hazard and reversed hazard rate orders. Probability in the
Engineering and Informational Sciences 14, 27–32.
Hu, T. and Y. Wei. (2000). Stochastic comparisons of spacings from restricted
families of distributions. Technical Report, Department of Statistics and Fi-
nance, University of Science and Technology of China.
Joag-Dev, K. (1995). Personal communication.
Kamps, O. (1995). A Concept of Generalized Order Statistics, B. G. Taubner,
Stuttgart.
Khaledi, B. and S. Kochar. (1999). Stochastic orderings between distributions
and their sample spacings — II. Statistics and Probability Letters 44, 161–
166.
Khaledi, B. and S. Kochar. (2000a). On dispersive ordering between order statis-
tics in one-sample and two-sample problems. Statistics and Probability Let-
ters 46, 257–261.
Khaledi, B. and S. Kochar. (2000b). Stochastic properties of spacings in a single
outlier exponential model. Technical Report, Indian Statistical Institute.
Khaledi, B. and S. Kochar. (2000c). Some new results on stochastic comparisons
of parallel systems. Technical Report, Indian Statistical Institute.
Kochar, S. C. (1996). Dispersive ordering of order statistics. Statistics and
Probability Letters 27, 271–274.
Kochar, S. C. (1998). Stochastic comparisons of spacings and order statistics. In
Frontiers in Reliability (Eds: A. P. Basu, S. K. Basu and S. Mukhopadhyay),
World Scientific, Singapore, 201–216.
Kochar, S. C. and S.N.U.A. Kirmani. (1995). Some results on normalized spac-
ings from restricted families of distributions. Journal of Statistical Planning
and Inference 46, 47–57.
Kochar, S. C. and R. Korwar. (1996). Stochastic orders for spacings of hetero-
geneous exponential random variables. Journal of Multivariate Analysis 57,
69–83.
Kochar, S. C. and J. Rojo. (1996). Some new results on stochastic compar-
isons of spacings from heterogeneous exponential distributions. Journal of
Multivariate Analysis 59, 272–281.
REFERENCES 623
Li, X., Z. Li, and B-Y. Jing. (2000). Some results about the NBUC class of life
distributions. Statistics and Probability Letters 46, 229–237.
Lillo, R. E., A.K. Nanda, and M. Shaked. (2001). Preservation of some like-
lihood ratio stochastic orders by order statistics. Statistics and Probability
Letters 51, 111–119.
Misra, N. and E.C. van der Meulen. (2001), On stochastic properties of
spacings, Technical Report, Department of Mathematics, Katholieke Uni-
versity Leuven.
Nanda, A. K., K. Jain, and H. Singh. (1998). Preservation of some partial or-
derings under the formation of coherent systems. Statistics and Probability
Letters 39, 123–131.
Nanda, A. K. and M. Shaked. (2000). The hazard rate and the reversed hazard
rate orders, with applications to order statistics. Annals of the Institute of
Statistical Mathematics, to appear.
Oja, H. (1981). On location, scale, skewness and kurtosis of univariate distri-
butions. Scandinavian Journal of Statistics 8, 154–168.
Pledger, G. and F. Proschan. (1971). Comparisons of order statistics and spac-
ings from heterogeneous distributions. In Optimizing Methods in Statistics
(Ed: J. S. Rustagi), Academic Press, New York, 89–113.
Raqab, M. Z. and W.A. Amin. (1996). Some ordering results on order statistics
and record values. IAPQR Transactions 21, 1–8.
Rojo, J. and G.Z. He. (1991). New properties and characterizations of the dis-
persive ordering. Statistics and Probability Letters 11, 365–372.
Shaked, M. and J.G. Shanthikumar. (1994). Stochastic Orders and Their Appli-
cations, Academic Press, Boston.
Shanthikumar, J. G. and D.D. Yao. (1986). The preservation of likelihood ratio
ordering under convolution. Stochastic Processes and Their Applications 23,
259–267.
Wilfling, B. (1996). Lorenz ordering of power-function order statistics. Statistics
and Probability Letters 30, 313–319.
Chapter 25
VEHICLE ROUTING WITH STOCHASTIC DEMANDS:

MODELS & COMPUTATIONAL METHODS
Moshe Dror
Department of Management Information Systems
Tucson, AZ 85721, USA
mdror @ bpa.arizona.edu
Abstract In this paper we provide an overview and modeling details regarding vehicle
routing in situations in which customer demand is revealed only when the vehicle
arrives at the customer’s location. Given a fixed capacity vehicle, this setting
gives rise to the possibility that the vehicle on arrival does not have sufficient
inventory to completely supply a given customer’s demand. Such an occurrence
is called a route failure and it requires additional vehicle trips to fully replenish
such a customer. Given a set of customers, the objective is to design vehicle
routes and response policies which minimize the expected delivery cost by a
fleet of fixed capacity vehicles. We survey the different problem statements and
formulations. In addition, we describe a number of the algorithmic developments
for constructing routing solutions. Primarily we focus on stochastic programming
models with different recourse options. We also present a Markov decision
approach for this problem and conclude with a challenging conjecture regarding
finite sums of random variables.
1. INTRODUCTION
Consider points indexed in a bounded subset B in the Eu-
clidian space. Given a distance matrix between the point pairs, a traveling
salesman problem solution for points is described by a cyclic per-
mutation such that
is minimized over all cyclic permutations where represents
the point in the position in In this generic traveling salesman problem
(TSP) statement it is assumed that all the elements (positions of the points
and the corresponding distances) are known in advance. In a setting like this,
stochasticity can be introduced in a number of ways. First, consider that a
problem instance is represented by a subset of the given points. Say a subset

is selected. Suppose that the prob-
lem instance represented by S occurs with probability Now suppose that
given a cyclic permutation on the solution is cal-
culated with respect to the point ordering determined by by simply skipping
the points, implying that if however
then the term appearing in the summation of pair distances is for the
minimal positive integer such that For a given the expected
distance is Thus, this stochastic version of the TSP is to
find a cyclic permutation over that minimizes the expected dis-
tance. This stochastic optimization problem is known as the probabilistic TSP
or PTSP for short (see Jaillet, 1985).
Another stochastic version of the TSP can be stated in terms of the distance
matrix by assuming a nonnegative random variable
with a known probability distribution for each pair
The value serves as a travel time factor for the distance
We simply define a new random variable as a random travel time
between points and In this setting an optimal cyclic permutation would
be one which minimizes the expected TSP distance with respect to the
matrix. In real-life terms, this setting seems appropriate when all points have to
be visited; however, the travel time between pairs of points is a random variable
with a known distribution (Laipala, 1978).
Clearly, there are real-life examples for both of the above problems and for
a hybrid of the two. However, in this chapter we survey yet a different setting
of a stochastic routing problem which we refer to as a vehicle routing with
stochastic demands (SVRP). For instance, in the case of automatic bank teller
machines the daily demand for cash from each machine is uncertain. The max-
imal amount of cash that may be carried by an armored cash delivery vehicle
is limited for security and insurance reasons. Thus, given a preplanned route
(sequence of cash machines), there might not be enough cash on the designated
vehicle to supply all the machines on its route resulting in a delivery stockout
(referred to as a route failure), forcing a decision of how to supply the ma-
chines which have not been serviced because the cash run out. The problem
can be described as follows: Given the points in some bounded
subset space, a point (the depot), and a positive real
value (capacity of a vehicle). Associate with each
a bounded, nonnegative random variable (demand at The objective is to
construct ( is determined as a part of the solution) cyclic paths all
sharing the point – {0}, and some paths may be sharing other points as well,
such that for each realization of the demand vector the demand
Vehicle Routing with Stochastic Demands: Models & Computational Methods 627
on each of the cyclic paths does not exceed Q–vehicle capacity, the total real-
ized demand is less than or equal to and all realized demands are satisfied.
A key assumption is that the value is revealed only when a vehicle visits the
point for the first time. Note that the actual construction of the cyclic paths
might depend on demand realizations at the points already visited. The objec-
tive is to find a routing solution, perhaps in the form of routing rules, which has
a minimal expected distance. Note that some demand values might be split-
delivered over a number of these cyclic paths. The problem can be viewed as a
single vehicle problem with multiple routes all visiting the point {0} - the depot.
The above version of the SVRP has been associated with real-life settings
such as sludge disposal (Larson, 1988) and delivery of home heating oil (Dror,
Ball, and Golden, 1985), among others. In these two examples there is no
advanced reporting of the values which represent current inventory levels or
the size of replenishment orders. Thus, the amount to be delivered becomes
known only as a given site is first visited.
2. AN SVRP EXAMPLE AND SIMPLE HEURISTIC

RESULTS
Consider a single route. Given customers 1,2,..., and a depot indexed
0, we have different routes for the customers, each corresponding to a
cyclic permutation over {0,1,2,..., }. Let be the set of all routes (cyclic
permutations) and for a particular route denote by the first customer,
the second customers, and in general by the customer. would
be the last customer before returning to the depot. The cost for route is simply
Denote by the probability that a vehicle of capacity Q runs-out of

the delivered commodity at or before customer on route Assuming that
the customers’ demands are independent random variables with nonnegative
means, the sequence is a nondecreasing sequence for any
Moreover, for normally distributed customer demands with coefficient of vari-
ation and the expected value for any
the sequence is strictly convex (see Dror and Trudeau,
1986).
Given the above condition that (i.e., the

expected demand of customers on the route is less than the vehicle’s capacity),
in Example 1 below, we show that the probabilities of the vehicle running-out
of commodity at points along the route can be quite significant, and can result
in a vehicle being forced to return to the depot for refill. Thus, one cannot
assume a priori that a route delivery sequence will be carried out as planned
without interruption. In addition, given a ordered sequence of customer visits
on an undirected network (symmetric distances), we show in Example 2 below
that unlike deterministic vehicle routing problem, it makes a difference if such
a sequence is visited as ordered "from left to right" or in reverse "from right to
left".
Example 1: Assume a route consisting of 20 customers. Customers’ de-

mands are represented by independent, identical, normally distributed ran-
dom variables We can tabulate the function for two
different values of the coefficient of variation
where
for a small such as The probability of not reaching the last
customer (customer = 20) without being forced to return to the depot for
refill is 0.3228 and 0.4052 respectively for coefficient of variation
and 20/21 (Dror and Trudeau, 1986). For the higher coefficient of variation the
probability of never reaching the customer is also high (= 0.3085).
Thus, the likelihood of a route failure in these settings is not negligible.
Example 2: Assume a route with five customers located at (5,0), (5,5), (0,5),
(–5,5), and (–5,0), and a depot located at (0,0) forming a rectangle of length
10 and width 5. The customers are denoted as 1,2,3,4, and 5, respectively.
Assume a straight line travel distance between the locations and the expected
demands of and 4, and for the customer
located at (–5,0). We also assume a very simple recourse policy in the case of
route failure by servicing each customer who was not delivered a full demand
on the planned route individually (with multiple back and forth trips). We de-
note the route planned counterclockwise by
and the (opposite direction) clockwise route as
The expected travel distance for the route is calculated as follows:

The expected travel distance for the route is calculated similarly.
In the case that the customers’ demands are independent, normally distributed
random variables with mean demands as described above and an identical co-
efficient of variation of the expected travel distances for the two routes
are quite different. For it is equal to 40.5563, and for the expected
travel distance is 48.8362 (Dror and Trudeau, 1986).
A classical, and perhaps the most popular, heuristic for constructing vehicle-
routing solutions for the case of deterministic customers’ demands is the so-
called Clark and Wright heuristic (Clark and Wright, 1964). The thrust of the
Clark and Wright heuristic route construction is in the concept that if two de-
liveries on two different routes can be joined on a single route, savings can be
realized. In the VRP the savings are calculated for each pair of delivery
points as (assuming a symmetric travel matrix).
The savings are ordered and the customers are joined according to the largest
saving available, as long as the demand of the combined route does not exceed
the capacity of the vehicle – Q. This basic savings idea can be generalized for
the stochastic vehicle routing case as follows:
= [the expected cost of the route with customer on it] + [the expected
cost of the route with customer on it] - [the expected cost of the combined
route where customer immediately precedes customer
When computing the savings terms in the stochastic version of Clark and
Wright heuristic one has to account for the direction of the route. For each pair
of points two different directional situations have to be considered and
only the one with the highest saving value will be kept.
In Dror and Trudeau (1986), the above stochastic Clark and Wright heuristic
was implemented to construct a routing solution for a 75 customer test prob-
lem in Eilon, Watson-Gandy, and Christofides (1971). The truck capacity for
this experiment is set at Q = 160. The depot is located at a center point
of a square 80 × 80, however, we do not list here the coordi-
nates for all the 75 points. Table 1 lists the routes constructed by the stochastic
savings heuristic. It is interesting to note that some of the 10 routes constructed
have a high probability of failure. For instance, routes marked as 9 and 10 have
a probability of failure of 0.239 and 0.679 respectively. The expected demand
of route 10 exceeds the truck capacity. However, upon closer examination of the
two routes we find that the probability of route failure before the last customer
is negligible, and the failures, if occurring at all, are only likely to materialize
as the truck arrives at the last customer who is located very close to the depot
(customer 4 on route 9 is located at and customer 75 of route

10 is located Thus, the projected penalties (the cost) for
route failures (a round trip to the depot and back) are very small.
2.1. CHANCE CONSTRAINED MODELS

A different way of dealing with demand uncertainty and the potential route
failure when a vehicle does not have enough commodity left to satisfy all cus-
tomer’s demand is by modeling the stochastic vehicle routing problem in the
form of chance-constrained programming. Essentially, for a given customer
demands parameters, such as demand distributions with their means and vari-
ances, one subjectively specifies a control probability for a route not to incur
a route failure. Following Stewart and Golden (1983), we restate below their
chance-constrained SVRP formulation.
where is a binary decision variable which takes the value 1 if vehicle travels
directly from to and is 0 otherwise; NV denotes the number of available ve-
hicles; is the set of feasible routes for the traveling salesman problem with
NV salesmen. and Q are defined as before and is the maximum allow-
able probability that a route might fail. The chance-constrained SVRP model
presented above is in the spirit of mathematical models developed by Charnes
and Cooper in the 50ties and early 60ties. One of the main premises of such
models was that complicated stochastic optimization problems are convertable
into equivalent deterministic problems while controlling for the probability of
"bad" events such as route failures for the SVRP. Stewart and Golden (1983)
showed that this conversion process (stochastic deterministic) of the SVRP is
possible for some random demand distributions. In addition, a number of sim-
ple penalty-based SVRP models have also been proposed by the same authors
and are restated below.
where is a fixed penalty incurred each time route fails.
where is a penalty per unit demand in excess of Q on route and

is the expected number of units by which the
demand on route exceeds the capacity of the vehicle.
The apparent problem with the above modeling approach for the SVRP is
that designing a complete set of vehicle routes while controlling or penalizing
route failures irrespective of their likely location and the cost of such failures,
might result in bad routing decisions. It is important to remember that the cost
of the recourse action taken in response to route failure is more critical than the
mere likelihood of such failure.
3. MODELING SVRP AS A STOCHASTIC

PROGRAMMING WITH RECOURSE PROBLEM
Given that a customers’ exact demands are revealed only when the delivery
vehicle visits the customer and that the driver is given the sequence in which
the customers need to be serviced while allowing him to return to the depot and
refill even before the vehicle actually runs out of commodity, how would we
design an optimal routing sequence including the option of dynamic decision
rules for when to refill the vehicle ?
This is the theme of a recent paper by W.-H. Yang, K. Mathur, and R.H. Ballou
(2000), in which a customer’s stochastic demand does not exceed the vehicle
capacity Q. For simplicity, they assume that has a discrete distribution with
possible values with probability mass function
In their solution of the SVRP, Yang, et al. adopt a simple recourse action
of returning to the depot whenever the vehicle runs out of stock, in addition to
preset refill decisions based on the demand realizations along the route. Hence
the vehicle may return to the depot before stockouts actually occur. What is
interesting in this simple recourse setting is that given a route (represented as

a sequence of customers for each customer
there exists a quantity such that the optimal decision in terms of the expected
length of the route, is to continue to customer after serving customer
if the quantity on the vehicle at that point is and to return to the depot
to refill if it is More formally, suppose that upon completion of delivery
to customer the remaining quantity in the vehicle is Let denote the
total expected cost for completing the deliveries from customer onward. Let
be the set of all possible loads on the vehicle after the delivery to customer
then, satisfies the following dynamic programming recursion,
with the boundary condition

Assuming that the cost matrix satisfies the triangular inequality, then
which leads to Theorem 1 below.
Theorem 1. (Yang et al., 2000) For each customer there exists a quantity
such that the optimal decision, after serving node is to continue to node
if or return to the depot if
For the proof we refer the reader to the original paper and to Yang’s (1996)
thesis. The computation of the values is recursive and for a given routing se-
quence requires computing the two parts of equation (1). Since there are many
different routing options, constructing the best route for the SVRP with this
simple recourse scheme requires direct examination of these routing options
(say by enumeration) and is only feasible for small problems. In Yang (1996),
a branch-and-bound algorithm is described which produces optimal solutions to
some typical problems of up to 10 customers. Since this recourse policy allows
for restocking the vehicle at any point on its route, even when it is clear that
the total customer demand exceeds the vehicle’s capacity, it is not necessary to
consider multiple routes. In fact, Yang, Mathur, and Ballou (2000), prove that a
single route is more efficient than multiple vehicle route system. Obviously, this
result assumes no other routing constraints such as time duration, etc. which
might require implementation of a multiple route system.
Laporte, Louveaux, and Van hamme (2001) exmines the same problem from
a somewhat different perspective. The SVRP is examined and an optimal so-
lution methodology by means of an integer L-shaped method is proposed for
the simple recourse of back and forth vehicle trips to the depot for refill in the
case of route failure. We restate below the main points of the solution approach
from Laporte et al. (2001), and Laporte and Louveaux (1993), as applied to
the SVRP model in Dror et al. (1989). However, since Yang et al. (2000)
have shown that single route design is more efficient than multiple routes, and
(based on Theorem 1, Yang et al. 2000) a better recourse policy is obtained
by allowing the return to the depot for refill even if the vehicle did not run-out
of commodity at node we modify the L-shaped solution approach. That is,
we assume that a single delivery route will be traced until either a route failure
occurs followed by a round trip to the depot, or based on the result of Theorem
1, the vehicle returns to the depot to refill before continuing the route.
3.1. THE MODEL

Let be the cost of the routing solution when
is the vector of the routing decisions
if the vehicle goes directly from node to node and otherwise), and
q is a vector of the customer demands which are revealed one at a time when the
vehicle visits the customer. Clearly, both and are random variables. The
cost of a particular routing solution predicated on a realization of is simply
an additive term Thus, the SVRP model can be stated as
follows
subject to
The cost term can be viewed as a two part cost of a given solution.
In the first part we have the term cx denoting the cost of the initial pre-planned
routing sequence represented by x, and the second part, denoted by Q(x, q),
would be the cost of recourse given x and a realization of q. Thus, Q(x, q)
reflects the cost of return trips incurred by route failures and decisions to return
for refill before route failures, minus some resulting savings. In this represen-
tation, we write simply Note that the two routing
vectors and x are not the same. The binary vector x represents an initial TSP
route, whereas is the binary routing vector which includes all the recourse
decisions. We want to keep the vector binary and for this purpose we assume
that (i) the probability of a node demand to be greater than capacity of a vehicle
is zero, and (ii) the probability that a vehicle, upon a failure, will go back to the
depot after returning to a node to complete its delivery is also zero. Since the
vector x represents a routing solution for a single route, it satisfies constraints
(3)-(6) ((3) and (4) with equality only). Setting the expectation with respect to
q denoted as the objective function (2) becomes
In principle, the function Q(x) can be of any sign and is bounded (from below
and above). However, in our case Constraints (3)-(6) ensure connec-
tivity and conservation of flow. At this point we describe the function Q(x, q)
in the standard framework of the Two-Stage Stochastic Linear Programs. In
the second stage the customer deliveries are explicitly represented.
where y is the binary vector representing the recourse initiated trips to the de-
pot. T(q) represents the deliveries made by the x vector given q, is the
demand realization for q which has to be met (delivered) either by x or the
recourse y.
The integer L-shaped solution method of Laporte et al. (2001), is a branch-

and-cut procedure modified for the stochastic setting such as the SVRP by
combining steps based on the combinatorial nature of the problem with an
estimation of the stochastic cost contribution for each partial solution. We
restate below the model for the problem statement required by the branch-and-
cut (integer L-shaped) method.
(SVRP)
subject to
3.2. THE BRANCH-AND-CUT PROCEDURE

The branch-and-cut procedure operates on the so-called ‘current problem’
(CP) obtained from the SVRP by: (a) relaxing the subtour elimination con-
straints (12), (b) at times solving a Linear Programming relaxation of it, and (c)
replacing Q(x) by a lower bound in the estimation of the solution value of
the (CP). This method assumes that given a feasible solution x, then Q(x) can
be computed exactly. Moreover, a finite lower bound L on Q(x) is assumed
to be available. Below, we repeat the steps of the procedure as presented in
Laporte et al. (2001).
1 Set the iteration count and introduce the bounding constraint

into (CP). Set the value of the best known solution equal to
The only pendant node corresponds to the initial current problem.
2 Select a pendant node from the list. If none exists stop.
3 Set and solve (CP). Let be an optimal solution.
4 Check for any violations of constraints (12) or integrality values for and
introduce at least one violated constraint. At this stage, valid inequalities
or lower bounding functionals may also be generated. Return to Step 2.
Otherwise, if fathom the current node and return to Step
1.
5 If the solution is not integer, branch on a fractional variable. Append

the corresponding subproblems to the list of pendant nodes and return to
Step 1.
6 Compute and set If
7 If then fathom the current node and return to Step 1.

Otherwise add a valid cut to (CP) and return to Step 2.
3.3. COMPUTATION OF A LOWER BOUND ON Z*

AND ON Q(X)
Since the solution to the SVRP described here includes the policy option of
returning to the depot just to refill and continue the route even if the vehicle did
not empty (as in Yang et al. 2000), the expected optimal solution value is never
more than that obtained by the L-shaped method of Laporte et al. (2001). If
we denote the optimal solution value obtained by implementing the L-shaped
method of Laporte et al. by and by the optimal solution value when
solving the SVRP incorporating the Yang, et al (2000) recourse options, then
we have that Moreover, both terms of the objective function
(7): cx and in the Laporte et al. (2001) solution are
respectively to the corresponding two terms when the recourse options are
expended as in Yang et al. (2000). Below we describe the lower bounding
calculations as in Yang (1996).
Since we are describing a branch-and-bound enumerative tree, we assume

that a node (a current problem - (CP) in the L-shaped method) in such a tree
can be represented by two sets E and I. The set E is the set of variables
which are set to zero, that is, the arcs which are excluded from the TSP solu-
tion. The set I consists of the variables which are set to one, these being the
variables which describe a (partial) TSP tour. If we denote the current branch-
and-bound node as node the corresponding E and I sets will be marked
by a superscript We describe the partial TSP tour determined by
the set as the sequence of nodes 0. To find the least ex-
pected cost given the sets (and the corresponding TSP path) we note that
The cost of the traveling salesman path though the nodes in

ending in the node can certainly serve as a lower
bound on the first part of the above cost in (**). In addition, since is a
nonincreasing function of gives a lower bound on the secondcom-
ponent of the above cost (**). Since there are a number of good codes which
solve in reasonable time large TSP instances of about 10,000 node problems
(Applegate et al. 1998), we assume that such a TSP code can be called to
generate an optimal path solution for the set ending in
the node Denote the TSP solution value for ending
in the node by To further tighten the lower bound we can employ
a number of additional procedures. For instance, given that the vehicle has a
finite capacity and the expected demand of the set is

greater than the vehicle capacity, the vehicle is likely to exhaust its load and go
back to the depot before completing the route through the set of unconnected
customers. At any node in the set the vehicle returns
to the depot when either (a) there is a route failure, in which case the vehicle has
to resume the route at with an additional cost of or (b) the vehicle goes
back to the depot after replenishing customer for a refill and then proceeds
directly to the next customer, say incurring an additional cost of
In order to find a valid lower bound, one can rank all costs of types (a) and
(b) among the nodes in and compute the least expected
additional cost as follows:
where and Clearly, is an upper

bound on potential replenishments for M. Note that this is a conservative lower
bound which requires an calculations.
One lower bound calculation for the expected cost through the partial tour is
obtained by assuming that the vehicle starts the partial tour at its full capacity.
Denote this lower bound as A somewhat tighter lower bound is described
in Yang (1996).
Combining the 3 lower bounds we obtain a valid lower bound at a node of

the branch-and-bound tree as In case one wants
to relax the corresponding TSP path problem and solve its linear programming
relaxation, adding subtour elimination constraints and valid TSP facets and
inequalities as in Laporte et al. (2001), a weaker lower bound at node is ob-
tained. However, this way one combines the TSP solution with the stochastic
programming solution procedure. Essentially, at a current branch-and-bound
node we have both the cx value and the lower bound estimate on making
it possible to use the 7 steps as outlined in Laporte et al. (2001). With this
procedure we should be able to solve to optimality problems with up to 100
customers as Laporte et al. did, but the solution quality should be better since
we are proposing to incorporate the finding of Yang et al. (2000).
In this context of restricted recourse (the original delivery sequence will

be followed), it is important to mention the work of Bertsimas (1992), and
Bertsimas et al. (1995). In those two papers given an identical discrete demand
distribution for each customer an asymptotically optimal heuristic is proposed
together with a closed expression for computing the expected length of such
a tour, followed by an empirical investigation of the heuristic’s behavior (first
paper). In the second paper, enhancement rules are proposed using dynamic
programming for selecting return trips to the depot.
4. MULTI-STAGE MODEL FOR THE SVRP

This section is based on Dror (1993), which provides a multi-stage stochastic
programming model for the SVRP without restricting the recourse options a pri-
ori. In principle, a multi-stage stochastic programming model can be viewed as
follows: Given certain initial (deterministic) information vec-
tor (think of it as the distance matrix and vehicle capacity Q), an
decision vector is selected at some cost
After executing (for instance determining and visiting the first customer), a
new information vector, is obtained and a new decision,
is selected at a cost This sequence of information de-
cision information decision, continues up to K stages. At each
stage the decisions are subject to constraints and depend on the
actual realizations of the vectors of random variables Essen-
tially, the decisions represent a recourse after new information or
new realization of values for the random variables has been observed.
A multi-stage stochastic (linear integer) program with fixed recourse can be

modeled in principle as follows (see also Birge, 1985):
(SVRP) Minimize
subject to
where is a known (cost) vector in

is a known vector in is a random defined on the probabil-
ity space are correspondingly dimensioned
real-valued matrices. For the SVRP the vector represents routing decisions
made in the first period. Future decision vectors depend on previous realiza-
tions of decisions and the distribution parameters
of as well as the problem’s static constraints (like vehicle-capacity
Q). The adoption of this model for the SVRP is a relatively straightforward
exercise presented next. But first we state a few simple observations regarding
properties of the SVRP multi-stage solution. The two assumptions are that (i)
(i.e., no individual demand is greater than the vehicle
capacity), and (ii) if a vehicle visits a customer, it has to deliver to that customer
whatever fraction of the customer’s demand the vehicle carries on arrival. If
the vehicle is empty it has to return directly to the depot (i.e., no visits for just
collecting demand information are allowed). Under these assumptions and a
distance matrix which satisfies triangular inequality in the strong sense, Dror
(1993) observed that:
(1) Any arc is traversed at most once in the optimal SVRP solution.
(2) The number of trips to the depot in an optimal SVRP solution is less than
or equal to
(3) In the optimal SVRP solution no customer will be visited more than
times.
The above three observations allow us to represent the SVRP as a problem on

a ‘new’ expanded (minimal) graph with the property that any optimal stochastic
vehicle routing solution of the original problem corresponds to a Hamiltonian
cycle on the ‘new’ graph. Moreover, any Hamiltonian cycle on the ‘new’ graph
describes an SVRP solution which satisfies the three above observations, given
that the realization of customers’ demands makes such a solution feasible with
respect to the vehicle’s capacity. The ‘new’ expanded graph is constructed as
follows: Duplicate each customer node so each such node is represented by
nodes. The depot is represented by nodes. Thus, the ‘new’ graph has
nodes. All nodes representing a single customer
are interconnected by arcs of zero distance and connected to the other nodes
by arcs with the same distance as in the original graph. The same is applied
for the depot nodes. On this new graph we seek a Hamiltonian tour (a TSP
solution) which is equivalent to minimizing the expected cost of the SVRP as
a multi-stage problem.
First, a notational convention. Let denote a cyclic permutation of the node

numbers with the following property:
(the depot node),
(customer nodes).
The permutation preserves the relative order of the set of nodes in the
new graph which represent a node in the original graph. Given that customer
is fully replenished, say on the visit to that customer, than the rest of the
nodes corresponding to customer are visited in succession at no additional
cost, which is consistent with the property of
Let and similarly, let

Set
The interpretation of is that the cyclic permutation selects customer

to be visited first, second, etc. Note that after the first visit to a cus-
tomer, assuming that he will be replenished within a short time, during which
his consumption is neglegible, the customer’s demand is known with certainty.
However, visiting a customer for the purpose of obtaining the exact demand
value without any delivery is not allowed.
4.1. THE MULTI-STAGE MODEL

The SVRP is stated as a multi-stage stochastic programming model with
stages over the expanded complete graph with nodes.
Minimize
subject to
where denotes the flow from node to node and cannot exceed
Q-vehicle capacity. Each node in the graph is visited exactly once. Constraints
(25) are stated in a symbolic form expressing the fact that disconnected subtours
are not allowed in the solution.
The above model is stated purely for conceptual understanding of the rout-
ing decisions with respect to new demand information as it is revealed one
customer at a time. From the computational prospective, this model is not very
practical and would require a number of restrictive assumptions (in terms of
demand distributions, recourse options, etc.) before operational policies could
be computed.
5. MODELING SVRP AS A MARKOV DECISION

PROCESS
We restate here a Markov decision process model for the SVRP as presented
in Dror (1993) which is very similar to the model presented in Dror et al. (1989).
Consider a single vehicle of fixed capacity Q located at the depot and assume
simply that no customer ever demands a quantity greater than Q. In essence,
all our assumptions are the same as before. One can assume discrete demand
distributions at the customers’ locations in an attempt to reduce the size of the
state space (Secomandi, 1998), however, in this model presentation we assume
continues demand distributions. A basic rule regarding vehicle deliveries is
that once a vehicle arrives at customer location it delivers the customer’s full
demand or as much of the demand as it has available. Thus, upon arrival (and
after delivery) only one decision has to be taken in case there is some com-
modity amount left on the vehicle: Which location to move next. The vehicle
automatically returns to the depot only when empty, or when all customers have
been fully replenished. In the initial state the vehicle is at the depot fully loaded
and no customer has been replenished. The final state of the system is when all
customers have been fully replenished and the vehicle is back at the depot.
The states of the system are recorded each time the vehicle arrives for the
first time at one of the customer locations and each time the vehicle enters the
depot. Let be the times at
which the vehicle arrived for the first time at new customer location. These are
transition times and correspond to the times at which decisions are taken. The
state of the system at time is described by a vector
where denotes the position of the vehicle, and de-
scribes the commodity level in the vehicle. implies automatically
(i.e., the vehicle is full at the depot). If customer has been visited then his exact
demand is known and after a replenishment (partial or complete) denotes the
remaining demand. In this case, If the customer has not been
visited yet, then is set to -1. (That is the demand is unknown.) The state
space is a subset S of which satisfies
the above conditions. Given a transition time a decision is selected from de-
cision space where and
means that the vehicle goes from his present position, say to customer whose
demand is yet undetermined, and on its route from to it replenishes a subset
of customers P (a shortest path over that set and its end points) whose demand
is already known. In that case the vehicle might also visit the depot. In many
cases the subset P may be empty. For instance, at the first and second transi-
tion times. The decision is admissible only if The decision
whenever or for (and every customer
has been fully replenished). For each let denote the set of
admissible decisions when the system is in state and
the set of admissible state-decision pairs.
At transition time the system is in some state and a decision

has to be taken. The time for the next transition is dependent on
and is deterministic (based on the deterministic distance matrix The next
state is generated according to the probability distribution which governs
the demand variables. More specifically, the demand variable at the vehicle
location of the next transition time. Suppose is
the observed state at the current transition time and is the decision
taken. Then the time until the next transition time is simply
where is the location of the vehicle at transition time
and In the case when then the time until the
next transition time is the shortest path travel time through the set P
ending at location In this model we assume that the service time is zero. Let
be the transition law. That is for every Borel subset of
is the probability that the next state belongs to given and
A control policy is a function such that for

An optimal solution policy for this SVRP Markov decision
model minimizes the expected travel time described by a decision sequence
Its starting state is and terminating state is
That is, minimizes the following expected value
where is the number of transitions and is the usual time distance between
and if the next transition occurs at without visiting any other nodes, or is
the shortest time path from to if a subset of nodes is replenished between
the two transition times.
Since this model was presented in Dror et al. (1989) and Dror (1993), there
has been (to our knowledge) only one substantial attempt to solve this hard
problem as a Markov decision model: This was done by Secomandi (1998)
and explored further in the form of heuristics in Secomandi (2000). Sacomandi
(1998) reports solving problems with up to 10 customers. However, in order
to simplify the state space somewhat, Secomandi assumes identical discrete
demand distributions for all the customers. Clearly, this is a very hard problem
for a number of reasons. One is the size of the state space. Another reason
is that some subproblems might require exact TSP path solutions. However,
so far this is the most promising methodology for solving the SVRP exactly
without narrowly restricting the policy space.
6. SVRP ROUTES WITH AT MOST ONE FAILURE – A

MORE ‘PRACTICAL’ APPROACH
In section 2 of this paper we examined probabilities of route failure for the
case of independent normally distributed demands as a function of the coeffi-
cient of variation. We illustrated that for ‘reasonably’ constructed vehicle routes
(routes with total expected demand of customers below the vehicle capacity)
the likelihood of route failures is not negligible. However, it is quite reasonable
to assume that the likelihood of more than 1 or 2 failures on a route is negligi-
ble. Knowing that the number of SVRP route failures is small is certainly of
practical interest and is the topic of the paper by Dror et al. (1993). Here we
restate the main contributions of that paper.
In a typical vehicle routing setting with a large number of customers - like the
case of propane distribution (Dror et al. 1985, Larson 1988) - time constraints
for completing the daily deliveries cast the problem as a multi-vehicle delivery
problem with each vehicle assigned a route on which the customers’ demand
values are uncertain. From the point of view of a single route with customers
on it, designing a route with the likelihood of numerous route failures makes
little or no practical sense. In practice, a dispatcher of propane delivery vehicles
seldom expects a delivery truck to return more than once to the depot for refill
in order to satisfy all customers’ demands.
The probability of a route failure at a customer of a route is simply

We will return to this probability equation
later. However, in this section we assume that the probability of more than one
failure on a route is negligible (see section 2). Moreover, we assume that if a
failure occurs on a route, it will occur only at the last customer. That is, given a
set N, of customers,
and
In this case, one can construct a tour assuming that in the case of a route
failure at a last customer the vehicle returns to the depot to refill and then incurs
the cost of a back-and-forth trip to the last customer. To construct an optimal
SVRP solution in this case requires solving a TSP in which for all customers
the cost is replaced by A better solution might be
obtained by permitting a return to the depot for refill at a suitable location along
the route and thus prevent the potential of route failure at the last node. This
recourse option can be formulated as a TSP by introducing 2 artificial nodes in
the following manner: Let and be the two artificial nodes. If the
node is entered from one of the nodes in N (note that the depot is denoted
by {0}), it indicates that the solution requires a ‘preventive refill’ (a return to
the depot before the last customer). If the node is entered from one of the
nodes in N, the solution contains no ‘preventive refill’. Thus, only one of the
two artificial nodes will be visited. The costs associated with the nodes
and are:
Define singleton sets and a set

The problem then becomes that of solving what is referred
to as asymmetric Generalized TSP (GTSP) over the sets In
the GTSP one has to construct a minimal cost tour which visits exactly one
element in each of the subsets. Solution methodologies for the GTSP have been
developed by Noon and Bean (1991). In addition, one can transform a GTSP
into a TSP (Noon and Bean, 1992). For instance, in our case set
for some large constant M, forcing the two nodes and
to be visited consecutively. Then reset
This makes the transformation complete.
Solving the above TSP over nodes solves our SVRP (with a failure
potentially occurring only at the last node) allowing to incorporate the option
of preventive refills.
7. THE DROR CONJECTURE

A very important aspect of the SVRP is the computation and the properties of
the probability of a route failure at the customer on the route. In the SVRP
we assume that the customers” demands are independent random variables. In
some cases for the sake of more ‘tractable’ computational studies we assume that
these demands are independent identical random variables. Given a sequence
of independent, identically distributed random variables with
a positive mean value and a variance and positive fixed value
denote by the probability that the random variable expressed as the sum
of but not or less of the random variables exceeds the value In
other words,
In principle, in the expression above the random variables need not be inde-
pendent or identical. They are just ordered and the first are summed into a
partial sum. The connection to the SVRP as described in this paper is obvi-
ous. However, this partial sum describes a setting which is not particular to the
SVRP. In fact, in the stochastic processing and reliability literature, is
referred to as a ‘threshold detection probability’. provides a measure of
likelihood of overstepping a boundary Q at exactly the trial in successive
steps accumulating the effects of a random sampled phenomenon. Some other
related results together with a conjecture statemet are described in Kreimer and
Dror (1990). More specifically, Kreimer and Dror (1990) address the following
questions:
(1) What is the most likely number of trials required to overstep the threshold
value Q ?
(2) In what range of is the sequence monotonically
increasing (decreasing) ?
The second question can be also stated in terms of convexity (concavity) of

the sum
In the spirit of the SVRP, denote by such that Kreimer

and Dror (1990) prove that for a number of distributions, the sequence
1,2, . . . is monotonically increasing in the range (or How-
ever, here we would like only to state a conjecture originated in Dror (1983).
Attempts to prove (or disprove) this conjecture over the years have not been
successful.
Conjecture : Given Q > 0, and a cumulative distribution function with mean

and variance that satisfies the following peoperties:
1 Coefficient of variation
2 Mean is greater than or equal to the median,
Then the sequence is monotonically increasing in the range
For some distributions all these restrictions are not necessary.
We have proven this conjecture for normal distribution and few other distri-
butions. However, the (as yet unproven) claim is that this monotonicity property
is true in general !
In the SVRP it is important to analyze properties of In

some practical cases encountered by the author, such as designing cost- efficient
distribution for propane, convexity properties of play a crucial role
(Trudeau and Dror, 1992).
8. SUMMARY
This paper deals with stochastic vehicle routing problem that involves design-
ing routes for fixed capacity vehicles serving (delivering to or collecting from
but not both) a set of customers whose individual demands are only revealed
when the vehicles arrive to provide service. At the beginning, we describe the
problem and provide examples which demonstrate the impact demand uncer-
tainty might have on the routing solutions. Since this topic has been examined
by academics and routing professionals for over 20 years, there is a considerable
body of research papers. We certainly have not covered them all in this overview.
REFERENCES 647
The vehicle routing problem with stochastic demands is a very hard problem
and we have attempted to cover all the significant developments for solving
this problem. Starting with early work on simple heuristics such as stochastic
Clark and Wright, followed by chance-constrained formulations, and stochastic
programming with recourse models, have attempted a broad overview. In the
literature, the most frequently encountered papers have focused on stochastic
programming models with limited recourse options such as back-and-forth ve-
hicle trips to the depot for refill in the event of route failure. This approach
has been examined more recently using the so called L-shaped optimization
method (Laporte et al. 2001). Other approaches (Yang et al. 2000) have added
interesting recourse options which could improve on the solution quality. How-
ever, the most promising approach is that of modeling the problem as a Markov
decision process, presented in Dror et al (1989) and Dror (1993), with signifi-
cant modeling and computational progress made recently by Secomandi (1998,
2000).
In short, the vehicle routing problem is very easy to state, but, like a number of
other similar problems, very hard to solve. It combines combinatorial elements
with stochastic elements. The problem is ‘real’ in the sense that we can point
out numerous real-life applications; unfortunately the present state-of-the-art
for solving the problem is not very satisfactory. It is a challenging problem and
we are looking forward to significant improvements in solution procedures -
hopefully in the near future.
REFERENCES
Applegate, D., R. Bixby, V. Chvatal, and W. Cook. (1998). "On the solution of
traveling salesman problem", Documenta Mathemaitica, extra volume ICM
1998; III, 645-656.
Bertsimas, D.J. (1992). "A vehicle routing problem with stochastic demand",
Operations Research 40, 574-585.
Bertsimas, D.J., P. Chevi, and M. Peterson. (1995). "Computational approaches
to stochastic vehicle routing problems", Transportation Science 29,342-352.
Birge, J.R. (1985). "decomposition and partition methods for multistage stochas-
tic linear programs", Operations Research 33, 989-1007.
Clarke, C. and J.W. Wright. (1964). "Scheduling of vehicles from a central
depot to a number of delivery points", Operations Research 12, 568-581.
Dror, M. (1983). The Inventory Routing Problem, Ph.D. Thesis, University of
Maryland. College Park, Maryland, USA.
Dror, M. (1993). "Modeling vehicle routing with uncertain demands as a stochas-
tic program: Properties of the corresponding solution", European J. of Op-
erational Research 64, 532-441.
Dror, M. and P. Trudeau. (1986). "Stochastic vehicle routing with modified

savings algorithm", European Journal of Operations Research 23, 228-235.
Dror, M, M.O. Ball, and B.L. Golden. (1985). "Computational comparison of
algorithms for inventory routing", Annals of Operations Research 4, 3-23.
Dror, M., G. Laporte, and P. Trudeau. (1989). "Vehicle routing with stochastic
demands: Properties and solution framework", Transportation Science 23,
166-176.
Dror, M., G. Laporte, and V.F. Louveaux. (1993). "Vehicle routing with stochas-
tic demands and restricted failures", ZOR – Zeitschrift fur Operations Re-
search 37, 273-283.
Eilon, S, C.D.T. Watson-Gandy, N. Christofides. (1971). Distribution Manage-
ment: Mathematical Modelling and Practical Analysis, Griffin, London.
Jailet, P. (1985). "Probabilistic traveling salesman problem", Ph.D. thesis, Mas-
sachusetts Institute of Technology, Cambridge, MA.
Kreimer J. and M. Dror. (1990). "The monotonicity of the threshold detection
probability in stochastic accumulation process", Computers & Operations
Research 17, 63-71.
Laipala, T. (1978). "On the solutions of the stochastic traveling salesman prob-
lems", European J. of Operational Research 2, 291-297.
Laporte, G. and F.V. Louveaux. (1993). "The integer L-shaped method for
stochastic integer programs with complete recourse", Operations Research
Letters 13, 133-142.
Laporte, G., F.V. Louveaux, and L. Van hamme. (2001). "An integer L-shaped
algorithm for the capacitated vehicle routing problem with stochastic de-
mands", Operations Research (forthcoming).
Larson, R.C. (1988). "Transportation of sludge to the 106-mile site: An inven-
tory routing algorithm for fleet sizing and logistical system design", Trans-
portation Science 22, 186-198.
Noon, C.E. and J.C. Bean. (1991). "A Lagrnagian based approach to the asym-
metric Generalized Traveling Salesman Problem", Operations Research 39,
623-632.
Noon, C.E. and J.C. Bean. (1993). "An efficient transformation of the Gener-
alized Traveling Salesman Problem", IN FOR 31, 39-44 .
Secomandi, N. (1998). "Exact and Heuristic Dynamic Programming Algorithms
for the Vehicle Routing Problem with Stcochastic Demands", Doctoral Dis-
sertation, University of Houston, USA.
Secomandi, N. (2000). "Comparing neuro-dynamic programming algorithms
for the vehicle routing problem with stochastic demands", Computers &
Operations Research 27, 1201-1225.
Stewart, W.R., Jr. and B.L. Golden. (1983). "Stochastic vehicle routing: A com-
prehensive approach", European Journal of Operational Research 14, 371-
385.
REFERENCES 649
Stewart, W.R., Jr., B.L. Golden, and F. Gheysens. (1983). "A survey of stochastic
vehicle routing", Working Paper MS/S, College of Business and Manage-
ment, University of Maryland at College Park.
Trudeau, P. and M. Dror. (1992). "Stochastic inventory routing: Stockouts and
route failure", Transportation Science 26,172-184.
Yang, W.-H. (1996). "Stochastic Vehicle Routing with Optimal Restocking",
Ph.D. Thesis, Case Western Reserve University, Cleveland, OH.
Yang, W.-H., K. Mathur, and R.H. Ballou. (2000). "Stochastic vehicle routing
problem with restocking", Transportation Science 34, 99-112.
Chapter 26
LIFE IN THE FAST LANE: YATES’S ALGORITHM,

FAST FOURIER AND WALSH TRANSFORMS
Paul J. Sanchez
Operations Research Department
Naval Postgraduate School
Monterey, CA 93943
John S. Ramberg
Systems and Industrial Engineering
Tucson, AZ 85721
Larry Head
Siemens Energy & Automation, Inc.
Tucson, AZ 85715
Abstract Orthogonal functions play an important role in factorial experiments and time
series models. In the latter half of the twentieth century orthogonal functions be-
came prominent in industrial experimentation methodologies that employ com-
plete and fractional factorial experiment designs, such as Taguchi orthogonal
arrays. Exact estimates of the parameters of linear model representations can be
computed effectively and efficiently using “fast algorithms.” The origin of “fast
algorithms” can be traced to Yates in 1937. In 1958 Good created the ingenious
fast Fourier transform, using Yates’s concept as a basis. This paper is intended
to illustrate the fundamental role of orthogonal functions in modeling, and the
close relationship between two of the most significant of the fast algorithms.
This in turn yields insights into the fundamental aspects of experiment design.
1. INTRODUCTION
Our purpose in writing this paper is to illustrate the role of orthogonal func-
tions in factorial design, and one usage in Walsh and Fourier analysis for sig-
nal and image processing. We also want to exhibit the relationship between the
Yates “fast Factorial Algorithm” and the fast Walsh transform, and to show how
Yates’s algorithm contributed to the development of fast Fourier transforms.
We would like to de-mystify the “black box” aura which often surrounds the
presentation of these algorithms in a classroom setting, and to encourage the
discussion of the topic of computationally efficient algorithms. We think that
this approach is valuable because it demonstrates close links between statistics
and a number of other fields, such as thermodynamics and signal processing,
which are often viewed as quite divergent.
Orthogonal functions serve many purposes in a wide variety of applications.
For example, orthogonal design matrices or arrays play important roles in the
statistical design and analysis of experiments. Discrete Fourier and Walsh
transforms play comparable roles in digital signal and image processing. An
important distinction between the various application areas is the data collec-
tion scheme employed. In statistical experiment design the functions represent
the factors and their interactions, and the experiment is typically run in a ran-
domized order. In signal processing the data are collected over time and repre-
sented in terms of a set of orthogonal functions which are explicit functions of
time.
Historically, the importance of these methods created interest in developing
effective and efficient computational approaches, which are often called “fast”
algorithms. Fast algorithms produce the same mathematical result as the stan-
dard algorithms, and are typically more computationally stable, yielding better
numerical accuracy and precision. Thus they should not be confused with so-
called “quick and dirty” statistical techniques which yield only approximate
results.
Nair (1990) has indicated the importance of attracting applications arti-
cles in the new technological areas of physical science and engineering, such
as semiconductors. Hoadley and Kettenring (1990) have stressed the impor-
tance of communication between statisticians, engineers, and physical scien-
tists. Stoffer (1991) has discussed statistical applications based on the Walsh–
Fourier transform, and has outlined the existing Walsh-Fourier theory for real-
time stationary time series. Understanding the relationship of experiment de-
sign to other orthogonal function based techniques and the corresponding fast
algorithms should be useful in enhancing this communication.
At first glance, high speed computation abilities might seem to negate the
need for the computational efficiency available using fast transforms. In pattern
recognition problems and signal processing the amount of data being processed
is a major reason for their continued importance. Relating the algorithms to

well-known statistical techniques provides a mechanism for determining confi-
dence limits when the signal is subject to random fluctuation, such as variation
in natural lighting or low signal to noise ratios. In factorial experiments, where
data sets are much smaller, the systematic computation format is advantageous
for implementing the algorithm as a program irrespective of language, and can
even be used to perform analyses in spreadsheets. Finally, these algorithms
tend to have better numerically stability.
2. LINEAR MODELS
Generally, the role of linear models in factorial experiments, Walsh analysis,
and Fourier analysis is not made explicit. The methods of analysis which we
will discuss are all based upon linear models. A discrete indexed linear model
can be represented in matrix form in either a deterministic or statistical context
as
or
respectively. In both cases y is an column vector, B is an matrix

representing the selected basis, and is an column vector of unknown
parameters. In (2.2), is an column vector of random errors with zero
mean and variance
In both cases, when and B is of rank the parameters can be
determined (estimated) uniquely by least squares as
The determined (estimated) parameters can be viewed as a projection of the

data onto the basis represented by B, as opposed to the original basis which
was composed of the first elementary vectors.
If the basis is orthogonal, is an diagonal matrix, and hence so
is its inverse. Thus is determined (estimated) by
where D is an diagonal matrix whose element is the inverse

of the norm of the column vector of the basis. If all basis vectors have
the same norm, the multiplication by D reduces to multiplication by a scalar
constant.
For the special case the least squares estimator simplifies to
In this case, the projection onto the basis B is one-to-one, i.e., y and are com-
pletely interchangeable since either can be constructed (reconstructed) from
the other.
We will be applying the linear model in three settings: Factorial Analysis,
Walsh Analysis, and Fourier Analysis. The table in Appendix A serves as a
quick reference to notation.
2.1. FACTORIAL ANALYSIS

2.1.1 Definitions and Background. A factorial experimental design is
conducted by controlling the levels of the various factors during the course of
the experiment. In a factorial experiment there are factors, each of which
is set at “low” and “high” levels. These will be designated –1 and +1 and we
will use “–” and “+” to denote –1 and +1, respectively. The factor settings
are determined by enumerating all combinations of low and high settings for
each factor. This is demonstrated for three factors, in Figure 26.1.
It is easily seen that there are points, each of which can be represented
as a three-dimensional vector. The corner labeled 1 has factors one, two, and
three set at low, low, and low settings, respectively, and can be represented as
a triplet ( – , – , – ) . Corner 2 sets factors one, two, and three to high, low, and
low, respectively, and yields the triplet ( + , – , – ) . In general, if there are
factors the design can be represented as vectors of length where each of
the elements specifies the level setting of a corresponding factor at a single

design point. We can construct the design matrix by assigning the vector at
design point as the row of the matrix.
Alternatively, we can consider the settings of one particular factor in the
order given by the numbering of the design points. This provides the column
vectors of the design matrix. Both viewpoints yield the same design matrix for
any given experiment. The column vectors called the generating
vectors for a factorial experiment, are:
Each factor in an experiment is assigned a (unique) generating vector.

For an experiment with one replication, each run corresponds to a row,
which is selected randomly (without replacement) and each factor is set ac-
cording to the level indicated by the generating vector. For an experiment with
blocks, this process is repeated times, each time with a new random se-
quence. For an experiment with replications, each row appears times, and
the rows are selected in a completely random order. A single observation of the
response is obtained for each row, resulting in observations for a full facto-
rial experiment. This set of observations are subsequently sorted into standard
order for analysis. In this case the order in which the run was carried out should
be recorded to allow assessment of model assumptions.
The generating vectors are used to construct the analysis matrix. When
two or more factors affect the outcome based (non-additively) on their com-
bined values, we say that there is an interaction term in the model. For ex-
ample, when two or more factors interact in the form
this term may also be included in the model. The
generating vectors and all possible interaction terms form the contrast matrix.
The addition of a column of ones completes the basis, and results in the analy-
sis matrix. The columns of the analysis matrix representing factor interactions
are obtained by the element-wise multiplication of all the generating vectors of
the factors involved.
In general, the analysis matrix for a full factorial experiment can be writ-
ten in matrix form in Yates’s standard order as
It is constructed from the generating vectors as follows. At stage 0, X consists

solely of the column vector which is a column of +’s. At stage
for X is expanded by appending to it the element-wise product
of with each vector in the matrix at stage thus doubling the number
of columns. Since the number of columns doubles at each of the iterations,
and the column vectors are of length it follows that X is a square
matrix. It is also of full rank, since the columns are mutually orthogonal. The
table of contrasts, employed in many texts for the analysis of factorial designs,
is the analysis matrix excluding (Contrast vectors always sum to zero, and
does not fulfill that requirement and hence is not a contrast.)
The vector is inserted as the first column. The generating vectors and in-
teractions complete the basis. For example, the product vector obtained from
the interaction of and is found as the fourth column of (2.7). The
column vectors of equation (2.7) are represented graphically in Figure 26.2.
2.1.2 The Model. Let y be the vector of responses, or response

totals or means (depending upon the exact form of analysis of interest), which
we wish to relate to the factors in some fashion. One commonly used model is
the linear model given by (2.2) in Section 2. This is often written as
where is the vector of coefficients corresponding to and X plays the

role of the basis matrix in (2.2). The vector is indexed by
Thus the index identifies the factor or interaction.
2.1.3 The Coefficient Estimator. The familiar least squares estimator

for is
It is easily verified that the columns of X all have modulus so that if least
squares is used to estimate
where I is the identity matrix. Thus the estimator is given from (2.4) as
and only one matrix multiplication is required to obtain the
estimates.
The estimator has the same number of elements as the data vector, y, and
can be viewed as an orthogonal transformation of the data vector to the pa-
rameter space. Furthermore, the observation vector can be computed from the
parameter vector by Note that the transformation is information
preserving, since the original vector can be recovered from the parameter vec-
tor. Finally, a smoothed (or parsimonious) predictor of y can be obtained by
setting certain elements of to zero (perhaps those which are not statistically
significant). If we call the new vector then the smoothed predictor is given
by
2.2. WALSH ANALYSIS

Walsh functions are binary valued functions of time. They first appeared in
the published literature in Walsh (1923). While the time index can be either
continuous or discrete, we will concentrate on discrete index applications. One
can view them as similar to sine/cosine functions, in that they can be used as
a complete orthogonal basis for representing time series, but Walsh functions
differ in that they are not periodic. The first eight Walsh functions are plotted
(in continuous time) in Figure 26.3 to illustrate this. The discrete-time vec-
tor representation can be obtained by taking the right-continuous limit of the
continuous-time function for
The binary nature of Walsh functions makes them a good choice for digital
signal processing applications, or many other settings in which we wish to
accurately represent discontinuous functions using a relatively small number of
terms. We will deal only with a few properties relevant to our application area,
experimental design. See Beauchamp (1984) for a more detailed treatment of
the functions and their properties.
2.2.1 Definitions and Background. By convention Walsh functions

are also encoded as {±1}. Three different ordering schemes are used to in-
dex them, each of which has merits for some types of applications. These are
known as natural, dyadic, and sequency ordering. We will concentrate on the

natural (also known as Hadamard) order because it corresponds directly to the
Yates’s ordering. However, the indexing scheme used in sequency ordering
results in a nice notation for specifying factor interactions. In addition, the
sequency index is directly related to how often the factor is varying about its
mean, which closely corresponds to the concept of frequency in trigonomet-
ric functions. This may be of interest to the experimenter in certain settings
(Sanchez, 1991). In all three indexing schemes the function numbering starts
at zero, and the function indexed by zero is a vector of ones.
As with the factorial designs, we can derive Walsh functions from a com-
plete enumeration of all combinations of high and low settings of the k factors.
This is illustrated for k=3 by Figure 26.4. The three indexing schemes briefly
discussed above are obtained by sampling the eight design points in differ-
ent (systematic) orderings. Note that the Hadamard ordering is precisely the
reverse of the Yates’s ordering.
This yields a set of column vectors which can be used to construct an anal-
ysis matrix, exactly as we did with the factorial analysis. The corresponding
column generating vectors are:
Note that the index of a generating vector is the number of adjacent pluses or
minuses, and that for where is the factorial
generating vector with index
To determine the value of a Walsh function whose index is not a power of 2,
decompose the index into its binary representation and take the product of all
Walsh functions corresponding to a 1 bit. For example, since the
binary representation of 5 is As with factorial
designs, the total number of vectors generated in this fashion is To
obtain a complete orthogonal basis, we again insert the vector The
basis is constructed by placing as the column.
In general, the matrix for a Walsh analysis can be written in Hadamard
order as
Thus if the basis is
The generating vectors and are the second, third, and fifth columns
of (2.10), respectively. The product vector obtained from the interaction of
and for example, is found as the fourth column.
Note that this is the analysis matrix X from Yates’s algorithm given in (2.7)
to within a scale factor of {±1} for each column.
2.2.2 The Model. Let y be the vector of data which we wish to

represent in terms of our basis H. By convention the time index starts at zero,
i.e., the vector consists of One commonly used model
is the linear model given by (2.1) in Section 2. This can be written as
where Y is the vector of Walsh coefficients corresponding to and H plays

the role of the basis matrix B in (2.1).
2.2.3 Discrete Walsh Transforms. It is easily verified that the columns

of H all have modulus so that
where I is the identity matrix. Thus the matrix form for the discrete Walsh
transform (DWT) of a vector y of length is defined as
Note that the DWT is the least squares estimator for Y. Also note that due to
the symmetric nature of the Hadamard matrix, so it is correct to write
the DWT as
In other words, the DWT is its own inverse to within a scale factor of
The transform notation emphasizes the fact that the transformation is infor-
mation preserving — the transform vector contains all of the information in
the original vector. In other words, the transform is not actually changing the
data, but rather is changing our viewpoint of the data. Both the original and the
transform vectors represent exactly the same point in space.
All we have done in the transformation is to change the set of axes from which
we choose to view that point.
Any orthogonal basis could be used for viewing the data, but some bases
might be more interesting than others because of the physical interpretation
we could place on the results. For statistical interpretation, an appropriate
choice of basis would be the set of vectors we used to determine the inputs to
an experiment. If the outputs fall solidly in some subspace corresponding to
certain inputs, we infer that those inputs are important factors in determining
the output.
Life in the Fast Lane: Yates ’s Algorithm, Fast Fourier and Walsh Transforms 663
2.3. FOURIER ANALYSIS

The Fourier Transform is a common tool in engineering applications for an-
alyzing time series, solving differential equations, or performing convolutions.
For a thorough yet accessible introduction to Fourier transforms and other time
series methods, we refer the reader to the text by Chatfield (1984). When the
data consist of a discrete set of samples, the Discrete Fourier Transform (DFT)
is the applicable tool.
2.3.1 Definitions and Background. Let be the generating

vector whose elements are given by for where
The set of indices consists of the ordered set of integers
If we now set the power (element-wise) of
for the resulting set of vectors are mutually orthogonal. We can
construct a matrix which we will call using these vectors as the
columns.
For example, if the generating vector is
and thus the analysis matrix is:

Since these are periodic functions, (2.11) can be simplified by evaluating

the exponents modulo The result is
This matrix can be expressed in a simpler form as the matrix of exponents after
dividing the exponents by
This notation makes clear that the relationship between the columns is com-
parable to that in a factorial design array. Note, however, that the interaction
columns are the sums of the corresponding factor columns here, since prod-
ucts of powers of a common base can be expressed in terms of the sums of the
exponents.
We can also express the transform in terms of trigonometric functions us-
ing the relationship This is useful for computational
purposes – we can keep track of the magnitudes of the real and imaginary
components separately for each element. In other words, each complex scalar
element of the vector of data is represented by two scalar values, corresponding
to the real and imaginary components.
Complex multiplication can be equivalently expressed in matrix terms. If
and then can be
written as
The multiplication of two complex scalars becomes a matrix multiplication of

a 2 × 2 matrix by a 2 × 1 vector to yield a 2 × 1 vector. It can thus be seen that
the Fourier basis represented by (2.12) can thus be represented as a 16 × 16

matrix:
It is interesting to note that is a symmetric matrix, but is not.
2.3.2 The Model. Let y be the vector of data which we wish to

represent in terms of our basis As with the Walsh basis the time index starts
at zero by convention, i.e., the vector consists of One
commonly used model is the linear model given by (2.1) in Section 2. This can
be written as
where Y is the vector of Fourier coefficients corresponding to and plays

the role of the basis matrix in (2.1). The least squares estimator for Y is
As with the other bases we have discussed, the vectors are mutually orthogonal,
so the term forms a diagonal matrix and the estimator simplifies to a
single matrix multiplication. However, the first and last column have a different
modulus than the other terms – they are scaled by while the rest are
scaled by Note that is symmetric about its diagonal, so that it is its
own transpose. However, since is complex must be its complex conjugate,
i.e., it is given as equation (2.12) with all signs reversed. We will designate the
complex conjugate henceforth as
When the data are real-valued, as with statistical applications, the imaginary
part of each observation is zero and can be omitted. Hence we can eliminate
the even numbered rows in equation (2.14). (Rows are eliminated, rather than
columns, because the estimators are obtained from The result is that
is a 16 × 8 matrix – clearly eight of the rows are redundant, since we only
need an 8 × 8 matrix to have a complete basis. The question of which vectors
to include in the basis can be resolved by examining Figure 26.5. The set of
points obtained by evaluating form a circle of unit radius
in the complex plane. Note that for that and
In other words, frequencies in the range can
be expressed as linear functions of frequencies in the range so we can

obtain a complete basis from the sine and cosine terms in the latter range. For
the result is a reduced matrix
It can be easily verified that the columns form a real-valued orthogonal basis.
The columns of are plotted in Figure 26.6.
3. AN EXAMPLE
In this section we will present a vector of data and show how the coefficients
are calculated for each of the three bases we have discussed. The results will
then be compared.
Suppose we have run a planned experiment in which the three factors
were varied in controlled fashion. In practice it is usually recommended that
such experiments be run in a randomized order, to try to reduce the effects of

sampling order. The data are then pre-sorted into standard order.
For a classical factorial analysis, the data are placed in the order specified
by the design cube of Figure 26.1. Suppose our sample, in standard order, is
the vector
We can then analyze it by calculating the estimator given in equation 2.3. The
resulting estimates are
In other words
We can analyze the same vector in terms of the Walsh basis. The resulting
vector estimate is:
and the corresponding model is
Recall that corresponds to factor 2, and corresponds to factor 3. Note

that the estimated coefficients have the same magnitude but the opposite sign as
those obtained from the traditional analysis. The explanation is straightforward
– we have analyzed the data in the wrong order for the Walsh basis. Looking
at Figure 26.4 it can be seen that the observation which has all three settings
low is the last observation for a Walsh-based experiment. In fact, the ordering
is exactly the reverse of the standard. Pre-sorting the data into reverse order
results in the model
Finally, we analyze the data set using Fourier analysis. The resulting vector
of estimated coefficient is
which corresponds to the model

The fit can be seen in Figure 26.7.

Note that for this particular data set more terms are required to obtain a fit in
the Fourier basis. The Walsh and factorial bases require exactly same number
of terms, because the basis vectors are the same to within a scale factor of ±1.
In general, the number of non-zero Fourier coefficients will be different than
the number of non-zero Walsh or factorial coefficients. Which number is larger
depends entirely on the data set being analyzed.
4. FAST ALGORITHMS
Recall that for any orthogonal basis, the least squares estimator is obtained
from a single matrix multiplication. If the number of operations required to
compute this can be substantially reduced the result is a computationally ef-
ficient algorithm. These are often called “fast” algorithms. The Fast Fourier
Transform (FFT) is probably the best known of these fast algorithms.
At first glance, high speed computation capabilities might seem to negate
the need for the computational efficiency available using fast transforms. A
straightforward implementation of the discrete Fourier transform requires
calculations, where is the number of factors, while the corresponding FFT
requires calculations. The amount of time required for a single-
threaded computer implementation of the algorithm to compute the transform
is proportional to the number of calculations being performed. Figure 26.8
illustrates the relative efficiency of the fast algorithms in terms of computa-

tional time requirements. The horizontal axis is and the vertical axis plots
the number of computations which must be performed for that value of The
plot illustrates that the savings can be substantial for even relatively small val-
ues of This is particularly noticeable in a desktop computing environment.
Further, the fast algorithms are often preferable in terms of numerical stability
since they perform fewer calculations to yield the same answers. The sys-
tematic format of Yates’s algorithm is also advantageous in the programming
of computations associated with factorial or fractional factorial designs. See
Nelson (1982) as an illustration. Finally, simple techniques such as Yates’s
algorithm also improve the feasibility of analyzing factorial designs on spread
sheets without the need to program.
4.1. YATES’S FAST FACTORIAL ALGORITHM

Yates’s algorithm provides a computationally efficient method estimating
the in the model given in equation (2.8).
Many of us, when exposed to the algorithm for the first time, are surprised
and curious about its operation. Since it is usually viewed as a means to an
end, most of us do not explore its makeup. Indeed, Yates gives no explanation
of the origins of the algorithm.
Most statisticians appear to be familiar with Yates’s algorithm, but few have
more than a vague notion of why the algorithm works. We shall see in this
and subsequent sections that the algorithm is based on elegant symmetries in
the original experimental design, and that the central idea behind the Yates’s
algorithm is applicable to orthogonal functions in general.
Recall that since Yates’s algorithm can then
be understood by representing the matrix as the product of another matrix
multiplied by itself times. Specifically, we find a special matrix M such that
We’ll first illustrate this matrix factorization for the case and
then generalize. For we have
so that
and thus
The form of the matrix factorization for arbitrary values of is
The calculation of can be equivalently expressed as At first

glance this appears to be worse than the original algorithm, but the multiplica-
tion is not explicitly performed for the zero terms in M. The matrix X contains
no zero elements, while M in fact contains precisely two non-zero elements
per row or column, as can be seen above. This means that we perform only
additions or subtractions for each of the M matrices in the factored form
of the algorithm. Since there are such matrix factors, the total amount of
work in the Yates’s approach is rather than if least squares
is applied straightforwardly. Paraphrasing, for we have
so the algorithm is in complexity rather than
Figure 26.9 illustrates how Yates’s algorithm works for the case of
This flow diagram contains equivalent information to the matrix representation
– it describes how to combine those terms with non-zero coefficients. The
diagram could also be used as a schematic for implementing Yates’s algorithm
in hardware. Straight lines indicate that a term is to be added, while dotted lines
indicate that the term is to be subtracted. Observations are input on the left, and
combined as indicated by the lines to produce intermediate values
and then contrasts Thus, at the first intermediate stage
we have
at the second intermediate stage we obtain

and at the final stage we get
Note that this is exactly the same result we would obtain by explicitly per-
forming the multiplication The contrasts can be scaled to produce
estimates of by either pre- or post-multiplying by which corresponds
to the scale factor of the least squares estimator. For general values of
each column is in length, and there are columns with transforma-
tions being applied. Yates’s is a constant-geometry algorithm, i.e., exactly the
same sequence of operations is applied in each of the transformations. This
is undoubtedly a benefit when the calculations are being performed by hand.
Given the flow diagram, the operations can be expressed algorithmically.
We will need several storage vectors of length which are labeled
for We define to be the element of the data vector
at the iteration. The algorithm follows.
The symbol indicates assignment of a value to a variable. References

to should use the original vector of data. The output of the algorithm
resides in the vector
The gains in computational efficiency arise from the sparseness in the matrix
M which factors the matrix. The factorization which corresponds to Yates’s
algorithm is not unique. For example, it is easily verified that
yields the same matrix for a experimental design. Further, it is not nec-
essary to have the component matrices of the factorization be identical (i.e.,
The efficiency comes from sparseness, not symmetry, of the matrix
factors. As we will see in the next section, there exists a whole class of “fast”
algorithms which can be used to evaluate factorial experiments.
4.2. FAST WALSH TRANSFORMS

As with the Yates’s algorithm, the FWT is obtained by matrix factoriza-
tion. However, in this case the matrices are not all identical. This presents
no problem – we noted in an earlier discussion that the efficiency of the algo-
rithm comes from the sparseness of the matrix factors, not from any symmetry
properties. The FWT can be represented in matrix form as
where
and
The resulting fast algorithm is depicted in Figure 26.10.

As with the Yates’s factorization, we find that the FWT algorithm presented
here is not unique. There are many published algorithms for performing FWT’s
(Beauchamp, 1984) – hence our claim in the previous section that many fast
algorithms exist for evaluating factorial experiments.
Figure 26.11 illustrates a basic set of operations used for the FWT, and is
called a butterfly flow diagram (for obvious reasons). Note that all the op-
erations in the natural ordered FWT can be expressed as butterflys. This is
quite significant, because the butterfly has the property that a pair of values is
operated on to produce a new pair, which go in the corresponding locations.
Since the data in a butterfly is used independently of data anywhere else in
the algorithm, the storage initially allocated for the data vector can be reused
at subsequent stages. The result is an in-place algorithm, and is highly ef-
ficient in both storage and computational time. However, since this is not a
constant-geometry algorithm, it is possibly less desirable than Yates’s for hu-

man calculation.
Given the basic operation specified by a butterfly, the natural ordered FWT
is described algorithmically below. (The symbol indicates assignment to
a variable.)
Comparable results exist for the other Walsh ordering schemes. In fact, one
way of performing the sequency ordered FWT is to apply the algorithm just
described after pre-sorting the data into a particular order.
4.3. FAST FOURIER TRANSFORMS

We will use the complex basis for notational simplicity in deriving a Fast
Fourier Transform (FFT) algorithm. As with Walsh functions, there are actu-
ally several FFT’s. The one we will illustrate here is based on a result known
as the Danielson–Lanczos lemma. Let be the Fourier transform co-
efficient, and define the corresponding contrast transform to be
Danielson and Lanczos showed that could be expressed in terms of two
other transforms, based on the even and odd data points, as follows.
where and are the contrast transforms of subsequences of length N/ 2

composed of the even and odd data points, respectively. In other words, the
transform coefficient can be obtained as a weighted sum of two other transform
coefficients based on shorter subsequences, and where the weighting factor is
a constant to the power.
The lemma can now be applied recursively until subsequences of length one
are obtained. At each stage of the recursion the size of the data set is halved,
so it can be applied at most times for each of the N coefficients. The
result is a computational technique which is O(N log N) in complexity.
By making use of the fact that our basis consists of circular functions, we
can evaluate the coefficients modulo In doing so, we find a great deal of
redundancy in the corresponding matrices. We can also note that the algorithm
logically groups the data first by the least significant bit of the binary repre-
sentation of the index, then by the second least significant bit, and so on, up to
the most significant bit. By pre-sorting the data in reverse order of the bits, we
find the data are appropriately grouped to perform the transformation in place.
This pre-sorting is an O(N) operation, so it does not significantly impact the
overall computation time while reducing the storage requirements.
For example, the DFT for a set of eight points can be represented in matrix
form in terms of the complex conjugate of equation (2.12). It is easy to
verify via matrix multiplication that
to within modulo on each of the exponents, where

and
The S matrix contains only one non-zero entry per row and column‚ and so
sorts the data set as described earlier in O(N) time. If desired‚ the ones can
be replaced with so that the estimators are properly scaled with no addi-
tional work.
Note that the functional form of the matrices for the FFT is identical to
that for the FWT. Only the coefficients have changed. (In fact‚ calculating the
sequency ordered FWT involves pre-multiplying by exactly the same sorting
matrix.) This is the basis of generalized transform algorithms‚ which use the
same matrix structure and substitute different sets of coefficients to perform
the various transforms of interest.
5. CONCLUSIONS
It is clear that many of the researchers referenced herein were aware of and
influenced by Yates’s algorithm. However‚ the fundamental role of orthogonal
transform theory‚ and relationships between the various “fast algorithms” ap-
pear to be unfamiliar to many statisticians. The fields of orthogonal factorial
experimental designs and orthogonal transform theory appear at first glance to
have evolved in parallel‚ with little cross-communication. Despite the initial
impact of Yates’s work‚ many statisticians treat orthogonal transform theory
and the closely related field of spectral analysis as tools for “time series”‚ and
appear unaware of the applicability of the work to factorial experimental de-
signs.
Researchers in the field of digital signal processing (DSP) have significantly
extended the work of Yates‚ Good‚ Cooley‚ and Tukey. We believe that it
is worthwhile for statisticians to become familiar with the DSP research in
orthogonal function decomposition for a number of reasons.
The FWT offers two benefits relative to the traditional Yates’s algorithm:
the FWT is an in-place algorithm;
the FWT is its own inverse.
DSP literature is based upon generalized transform theory:
REFERENCES 683
results can be generalized for many orthogonal designs‚ not just for
and factorials. We have illustrated this using the FFT.
DSP literature offers a consistent notation and framework for represent-

ing orthogonal designs.
REFERENCES
Ahmed‚ N.‚ and K. R. Rao. (1971). “The generalised transform.” Proc. Applic.
Walsh Functions‚ Washington‚ D.C. AD727000‚ 60–67.
Beauchamp‚ K.G. (1984)‚ “Applications of Walsh and related functions.” Aca-
demic Press‚ London.
Chatfield‚ C. (1984)‚ “An Analysis of Time Series: An Introduction.” Chapman
and Hall‚ New York.
Cooley‚ J.W.‚ and J. W. Tukey. (1965)‚ “An algorithm for the machine compu-
tation of complex Fourier series.” Math. Comput. 19‚ 297–301.
Good‚ I.J. (1958)‚ “The interactive algorithm and practical Fourier analysis.” J.
Roy. Stat. Soc. (London)‚ B20‚ 361–372.
Heideman‚ M.T.‚ D. H. Johnson‚ and C. S. Burrus. (1984)‚ “Gauss and the
History of the Fast Fourier Transform.” IEEE ASSP Magazine‚ Oct. 1984‚
14–21.
Hoadley‚ A.B.‚ and J. R. Kettenring. (1990)‚ “Communications Between Statis-
ticians and Engineers/Physical Scientists.” Technometrics Vol 32 No. 3‚243–
247.
Kiefer‚ R.‚ and J.Wolfowitz. (1959)‚ “Optimal Designs in Regression Prob-
lems.” Ann. Math. Stat. Vol 30‚ 271–294.
Manz‚ J.W. (1972)‚ “A sequency-ordered fast Walsh transform.” IEEE Trans.
Audio Electroacoust. AV-20‚ 204–205.
Nelson‚ L.S. (1982)‚ “Analysis of Two-Level Factorial Experiments.” JQT Vol
14‚ 2‚ 95–98.
Pratt‚ W.K.‚ J.Kane‚ and H. C. Andrews. (1969)‚ “Transform image coding.”
Proc. IEEE‚ 57‚ 58–68.
Sanchez‚ P.J.‚ and S. M. Sanchez. (1991). “Design of frequency domain exper-
iments for discrete-valued factors.” Applied Mathematics and Computation‚
42(1): 1–21.
Stoffer‚ David S. (1991). “Walsh–Fourier Analysis and Its Statistical Implica-
tions” J. American Statistical Association.‚ June 1991‚ Vol. 86‚ #414‚ 461–
485.
Walsh‚ J.L. (1923)‚ “A closed set of orthogonal functions.” Ann. J. Math‚ 55‚
5–24.
Yates‚ F. (1937)‚ “The Design and Analysis of Factorial Experiments.” Techni-
cal Communication No. 35‚ Imperial Bureau of Soil Science‚ London.
APPENDIX: A: TABLE OF NOTATION

Chapter 27
UNCERTAINTY BOUNDS IN PARAMETER

ESTIMATION WITH LIMITED DATA
James C. Spall
The Johns Hopkins University
Applied Physics Laboratory
Laurel‚ MD 20723-6099
e-mail: james.spall@jhuapl.edu
Abstract Consider the problem of determining uncertainty bounds for parameter

estimates with a small sample size of data. Calculating uncertainty bounds
requires information about the distribution of the estimate. Although many
common parameter estimation methods (e.g.‚ maximum likelihood‚ least
squares‚ maximum a posteriori‚ etc.) have an asymptotic normal distribu-
tion‚ very little is usually known about the finite-sample distribution. This
paper presents a method for characterizing the distribution of an estimate
when the sample size is small. The approach works by comparing the ac-
tual (unknown) distribution of the estimate with an “idealized” (known)
distribution. Some discussion and analysis are included that compare the
approach here with the well-known bootstrap and saddlepoint methods.
Example applications of the approach are presented in the areas of signal-
plus-noise modeling‚ nonlinear regression‚ and time series correlation analy-
sis. The signal-plus-noise problem is treated in greatest detail; this prob-
lem arises in many contexts‚ including state-space modeling‚ the problem
of combining several independent estimates‚ and quantile calculation for
projectile accuracy analysis.
Key Words: Small sample‚ parameter estimation‚ system identification‚ un-

certainty regions‚ M-estimates‚ signal-plus-noise‚ nonlinear regression‚ time
series correlation.
Acknowledgments and Comments: This work was supported by U.S. Navy Contract N00024-98-
D-8124. Dr. John L. Maryak of JHU/APL provided many helpful comments and Mr. Robert C.
Koch of the Federal National Mortgage Association (Fannie Mae) provided valuable computa-
tional assistance in carrying out the example. A preliminary version of this paper was published in
the Proceedings of the IEEE Conference on Decision and Control‚ December 1995. This paper is
dedicated to the memory of Sid Yakowitz—a scholar and a gentleman.
1. INTRODUCTION
Meaningful inference in parameter estimation usually involves an estimation
process and an uncertainty calculation. For many estimators—such as least squares‚
maximum likelihood‚ minimum prediction error‚ maximum a posteriori‚ etc.—
there exists an asymptotic theory that provides the basis for determining probabili-
ties and uncertainty regions in large samples (e.g.‚ Hoadley‚ 1971; Ljung‚ 1978;
Serfling‚ 1980). However‚ except for relatively simple cases‚ it is generally not
possible to determine this uncertainty information in the small-sample setting. This
paper presents an approach to determining small-sample probabilities and uncer-
tainty regions for a general class of multivariate M-estimators (M-estimates are
those found as the solution to a system of equations‚ and include those estimates
mentioned above). Theory and implementation aspects will be presented.
The approach is based on a simple—but apparently unexamined—idea.
Suppose that the statistical model being used is some distance (to be defined
below) away from an “idealized” model‚ where the small-sample distribution of
the M-estimate for the idealized model is known. Then the known probabilities
and uncertainty regions for the idealized model provide the basis for computing
the probabilities and uncertainty regions in the actual model. The distance may
be reflected in a conservative adjustment to the idealized quantities. This approach
is fundamentally different from other finite-sample approaches (see below)‚ where
the accuracy of the relevant approximations is tied to the size of the sample
versus the deviation from an idealized model.
The M-estimation framework for the approach encompasses most estimators
of practical interest and allows us to develop concrete regularity conditions that
are largely in terms of the score function (the score function is typically the
gradient of the objective function‚ which is being set to zero to create the system
of equations that yields the estimate). One of the significant challenges in assess-
ing the small-sample behavior of M-estimates is that they are usually nonlinear‚
implicitly defined functions of the data.
Perhaps the most popular current approach to small-sample analysis is com-
puter-based resampling‚ most notably the bootstrap (e.g.‚ Efron and Tibshirani‚
1986; Hall‚ 1992; and Hjorth‚ 1994). The main appeal of this approach is relative
ease of use‚ even for complex estimation problems. Rutherford and Yakowitz (1991)
show how the bootstrap applies in the nonparametric regression problem‚ for
which analytical analysis would rarely be possible. Resampling techniques make
few analytical demands on the user‚ instead shifting the burden to one of computa-
tion. However‚ the bootstrap may provide a highly inaccurate description of M-
estimate uncertainties in small samples (e.g.‚ Lunneburg‚ 2000‚ pp. 97–98). This
poor performance is inherently linked to the limited amount of information in
the small sample‚ with little improvement possible through a larger amount of
resampling.
Uncertainty bounds in parameter estimation with limited data 687
Other relatively popular methods for small-sample probability and uncertainty

region calculation are those based on series expansions, particularly the Edgeworth
and saddlepoint e.g., Daniels, 1954; Davison and Hinkley, 1988; Reid, 1988;
Field and Ronchetti, 1990; and Ghosh, 1994). However, as noted in Reid (1988,
p. 223), “saddlepoint approximations have not yet had much impact on statistical
practice.” More recently, Goutis and Casella (1999) and Huzurbazar (1999) dis-
cuss some of the challenges to implementation in large-scale problems (“...deri-
vations and implementations rely on tools such as exponential tilting, Edgeworth
expansions, Hermite polynomials, complex integration, and other advanced no-
tions”). These references focus on easier problems than the types of problems
motivating our work here. The major limiting factors of these series-based meth-
ods are the cumbersome analytical form and numerical calculations involved.
Hence, much of the literature in this area focuses on the relatively tractable set-
ting of estimates that are a smooth function of a sample mean (Reid, 1988; Kolassa,
1991; Fraser and Reid, 1993; Lieberman, 1994; and Chen and Do, 1994). This
setting, however, is severely limiting in practice. Field and Ronchetti (1990, Sect.
4.5) consider the multivariate M-estimation problem. While their method has
reasonable regularity conditions, the implementation appears challenging in all
but the simplest problems. As an example M-estimate, Field and Ronchetti par-
tially work out the solution for a simple linear regression problem, but in even
this simple problem the solution is incomplete due to an unknown constant of
integration (which does affect the interval probabilities). Field and Ronchetti
(1990) also note difficulties in going to more general problems (“...it is not pos-
sible to find explicit solutions”).
The essential relationship of the small-sample approach here to the analytical
(saddlepoint) methods above is as follows. The saddlepoint methods make strong
analytical and computational demands on the user and appear infeasible in many
of the small-sample multivariate M-estimation problems encountered in practice
(where the estimate is usually implicitly defined and must be found numerically).
Although the approach here is generally easier to use for multivariate M-estimates,
it requires that an idealized setting be identified from which to make adjust-
ments. This is not always possible. Section 3 provides several distinct examples
to illustrate how the idealized case may be determined in practice. A fundamental
distinction between the saddlepoint method and the method here is in the nature
of the errors in the probability calculations. For the saddlepoint these errors are
in terms of the sample size n, and are typically of order 1/n; for the approach here
the error is in terms of the deviation from the idealized case. In particular, for the
above-mentioned measure of deviation the error is of order for any n for
which the estimate is defined. In addition, the implied constant of the order
term can be explicitly bounded if needed. Hence, while traditional methods have
the desirable property of disappearing error as the small-sample approach
here has disappearing error as the model deviation The philosophy here is
that one is working with small samples, and desirable properties as may
not be relevant.
The remainder of this paper is organized as follows. Section 2 describes the
fundamental problem and formally introduces the concept of the idealized dis-
tribution. Associated artificial estimators and data that will be used in character-
izing the probabilities of interest for the real estimator and data are also intro-
duced in this section. Section 3 summarizes how the formulation in Section 2
applies in the areas of signal-plus-noise modeling, nonlinear regression, and time
series correlation analysis. Section 4 presents the main theoretical results, which
characterize the error between the idealized (known) probability and real (un-
known) probability for the parameter estimate lying in a particular compact set
(i.e., uncertainty region). Section 5 presents a thorough analysis of the signal-
plus-noise example introduced in Section 3, including a numerical evaluation.
Section 6 offers a summary and some concluding remarks. The Appendix pre-
sents technical details and a proof of the Theorem.
2. PROBLEM FORMULATION
Suppose we have a vector of data (representing a sample of size whose
distribution depends on and a known scalar where is to be estimated
by maximizing some objective function (say‚ as in maximum likelihood). The
estimate is the quantity for which we wish to characterize the uncertainty when
is small. It is assumed to be found as the objective-maximizing solution to the
score equation:
(e.g., s(·) represents the gradient of a log-likelihood function with respect to

when represents a maximum likelihood estimate). Suppose further that if
(the “idealized” case), then the distribution for the estimate is known for the
small n of interest. Our goal is to establish conditions under which probabilities for
with (the real case) are close to the known probabilities in the idealized
case. In particular, we show that the difference between the unknown and known
probabilities for the estimates is proportional to when is small. This justifies
using the known distribution for when to construct approximate uncer-
tainty regions for when is small. Further, when is not so small, we show
how the difference in the real and idealized probabilities can be approximated or
bounded.
The complexity of as a function of (via (2.1)) makes direct calculation
of the small-sample distribution of impossible in all but the simplest cases. To
characterize probabilities for the estimate we introduce two artificial estimators
that have the same distribution as when and when respectively.
This is analogous to the use of the well-known Skorokhod representation (Serfling,
1980, p. 23) where an easier-to-analyze “artificial” random process (defined on

a different probability space) is used to analyze the weak convergence and other
properties of the real random process of interest. The two artificial estimators,
say and are based, respectively, on fictitious vectors of data, and of
the same dimension as x. To construct the two fictitious data vectors, we intro-
duce a random vector z (same dimension as x), with associated transformations
and being the same as at such thatwith being the same as the
unknown (true) generating the real data,
and such that and have the same distribution as x for the chosen and for
respectively. Then, from (2.1),
As we will see in the examples of Section 3, it is relatively simple to specify

the transformation and given the basic problem structure assocated
with (2.1).
The fundamental point in the machinations above is that the distributions of
and are identical to the distributions of the estimate under and
even though the various quantities etc.) have nothing per se to do
with the real data and associated estimate. Our goal in Section 4 is to establish
conditions under which probabilities for are close to the known probabilities
for irrespective of the sample size n. This provides a basis for approximating
(or bounding) the probabilities and uncertainty regions for under through
knowledge of the corresponding quantities for Throughout the remainder of
this paper, we use the order notation and to denote terms such that
and are bounded and approach zero, respectively, as
3. THREE EXAMPLES OF APPROPRIATE PROBLEM

SETTINGS
To illustrate the range of estimation problems for which the small-sample
approach is useful‚ this section sketches how the approach would be applied in
three distinct M-estimation problems. Further detailed analysis (including nu-
merical results) for the first of these problems is provided in Section 5.
3.1. Example 1: Parameter Estimation in Signal-Plus-Noise

Model with Non-i.i.d. Data
Consider the problem of estimating the mean and covariance matrix of a ran-
dom signal when the measurements of the signal include added independent
noise with known distributional characteristics. In particular‚ suppose we have
observations distributed where the noise covari-
ances are known and the signal parameters (for which the unique ele-
ments are represented in vector format by are to be jointly determined using
maximum likelihood estimation (MLE). From the form of the score vector (see
Section 5)‚ we find that there is generally no closed form solution (and no known
finite-sample distribution) for the MLE when for at least one This
corresponds to the actual case of interest. We also found that the saddlepoint
method was analytically intractable for this problem (due to the relative com-
plexity of the score vector) and that the bootstrap method worked poorly in sample
sizes of practical interest (e.g.‚ Spall (1995) discusses this further.
Estimation problems of this type (with either scalar or multivariate data) have
been considered in many different problem contexts in control and statistics.
Here are some examples: Shumway et al. (1981) and Sun (1982) in a state-space
model identification problem; Rao et al. (1981) in the estimation of a random
effects input-output model; James and Venables (1993) and the National Re-
search Council (1992‚ pp. 143–144) in a problem of combining independent
estimates of coefficients; Ghosh and Rao (1994) in small area estimation in sta-
tistical survey sampling; and Hui and Berger (1983) in the empirical Bayesian
estimation of a dose response curve. One of the author’s interests in this type of
problem lies in estimating projectile impact means and covariance matrices from
noisy observations of varying quality; these are then used in calculating CEP
values (the 50% circular quantile values) for measuring projectile accuracy as in
Spall and Maryak (1992) and Shnidman (1995). In addition‚ for general multi-
variate versions of this problem‚ Smith (1985) presents an approach for ensuring
that the MLE of the covariance matrix is positive semidefinite‚ and Spall and
Chin (40) present an approach for data influence and sensitivity analysis.
Central to implementing the small-sample approach is the identification of
the idealized case and definition of relative to the problem structure. We
can write where and are known matrices. (We are using
the inverse form here to relate the various matrices since the more natural param-
eterization in the score vector is in terms of not as discussed in
Section 5. However‚ this is not required as the basic ideas would also apply in
working with the noninverse form.) If (the idealized identical case)‚ the
distribution of the estimate is normal-Wishart for all sample sizes of at least
two. For this application‚ the Theorem in Section 4 provides the basis for deter-
mining whether uncertainty regions from this idealized distribution are accept-
able approximations to the unknown uncertainty regions resulting from non-
identical In employing the Theorem (via (2.2a,b)), we let
where
In cases with a larger degree of difference in the (as ex-
pressed through a larger this idealized approximation for the uncer-
tainty regions may not be adequate—implied constants associated with the
bound of the Theorem provide a means of altering the idealized uncertainty re-
gions (these implied constants depend on terms other than
This example illustrates the apparent arbitrariness sometimes present in speci-
fying a numerical value of (e.g., if the elements of are made larger, then the
value of must be made proportionally smaller to preserve algebraic equiva-
lence). This apparent arbitrariness has no effect on the fundamental limiting pro-
cess as it is only the relative values of that have meaning after the other param-
eters (e.g., Q, etc.) have been specified. In particular, the numerical value of
the bound does not depend on the way in which the deviation from the
idealized case is allocated to and to the other parameters; in this example,
depends on the products which are certainly not arbitrary. We will return
to this signal-plus-noise example in Section 5 for a more thorough treatment.
3.2. Example 2: Nonlinear Input-Output (Regression) Model

Although the standard linear regression framework is appropriate for model-
ing input-output relationships in some problems‚ a great number of practical
problems have inherent nonlinearities. In particular‚ suppose that data are mod-
eled as coming from the relationship
where is nonlinear mapping and is a random noise term. Typically‚ least

squares‚ Bayesian‚ or MLE techniques are used to find an estimate of In con-
trast to the linear setting (with normally distributed noise terms) the finite-sample
distribution of will rarely be known in the nonlinear setting. (In particular‚
although the problem of estimating parameters in nonlinear regression models is
frequently solvable using numerical optimization methods‚ the “situation is much
worse when considering the accuracy of the obtained estimates” [Pazman‚ 1990].)
The small-sample approach here is appropriate when the degree of nonlinearity
is moderate; the corresponding idealized case is a linear regression setting that‚
in a sense illustrated below‚ is close to the actual nonlinear setting. Relative to
(2.2a‚ b)‚ it is natural to choose where
has the joint distribution of
Let us illustrate the ideas for two specific nonlinear cases. First‚ suppose that
is a quadratic function where and are vectors
or matrices (as appropriate) of known constants. Such a setting might arise in an

inversion problem of attempting to recover an unknown input value from ob-
served outputs (as is‚ e.g.‚ the main theme of fields such as statistical pattern
recognition‚ image analysis‚ and signal processing); this particular quadratic model
also plays a major role in experimental design and response surface identifica-
tion (e.g.‚ Walter and Pronzato‚ 1990; Joshi et al.‚ 1994; and Bisgaard and
Ankenman‚ 1996). Clearly in the case‚ we have the standard linear regres-
sion model for which the distribution of is known for having certain
distributions (e.g.‚ normal). As with the example of Subsection 3.1‚ the apparent
arbitrariness in specifying is accommodated since the product is the inher-
ent expression of nonlinearity appearing in the bound. The second case
pertains to a common model in econometrics. Suppose that represents the
constant elasticity of substitution (CES) production function relating labor and
capital inputs to production output within a sector of the economy (Kmenta‚
1971‚ pp. 462–464 or Nicholson‚ 1978‚ pp. 200–201). This model includes a “sub-
stitution parameter‚” which we represent by After making a standard log trans-
formation‚ the CES model has the form
where the three parameters within the vector represent parameters
of economic interest‚ and and represent capital and labor input from
firm As discussed in Kmenta (1971‚ p. 463) and Nicholson (1978‚ p. 201)‚
when non-trivial limiting arguments show that the CES function reduces to
the well-known (log-linear) Cobb-Douglas production function‚ representing the
idealized case here. Hence‚ uncertainty regions for the estimate in the CES
model can be derived from the standard linear regression-based uncertainty re-
gions for the Cobb-Douglas function through use of the Theorem of Section 4.
3.3. Example 3: Estimates of Serial Correlation for Time

Series
A basic problem in time series and dynamic modeling is to determine whether
a sequence of measurements is correlated over time, and, if so, to determine the
maximum order of correlation (i.e., the maximum number of time steps apart for
which the elements in the sequence are correlated). A standard approach for
testing this hypothesis is to construct estimates of correlation coefficients for
varying order correlations, and then to use the known distribution of the esti-
mates to test against the hypothesis of zero correlation. Let us suppose that we
construct MLEs of the jth-order correlation coefficients, j = 1 , 2 , . . . . Our inter-
est here is in the case where the data are non-normally distributed. This con-
trasts, for example, with the small-sample approach in Cook (1988), which is
oriented to normal (and autoregressive) models. (By the result in (Bickel and
Doksum, 1977, pp. 220–221), we know that standard correlation estimate forms
in, say, (Anderson, 1971, Subsection 6.1), correspond to the MLE when the data
are normally distributed.)
To define the likelihood function for performing the ML estimation, one needs
to choose a particular model of the non-normality. Although the method here
can work with any of the common non-normal models, let us consider the fairly
simple way of supposing the data are distributed according to a nonlinear trans-
formation of a normal random vector. (Two other ways may also be appropriate:
(i) suppose that the data are distributed according to a mixture distribution where
at least one of the distributions in the mixture is normal and where the weighting
on the other distributions is expressed in terms of or (ii) suppose that the data
are composed of a convolution of two random vectors, one of which is normal
and the other being non-normal with a weighting expressed by In particular,
consistent with (2.2a, b), suppose that x has the same distribution as
where z is a normally distributed random vector and is a transformation,
with measuring the degree of nonlinearity. Since is a linear transforma-
tion, the resulting artificial estimate has one of the finite-sample distributions
shown in Anderson (1971, Sect. 6.7) or Wilks (1962, pp. 592–593) (the specific
form of distribution depends on the properties of the eigenvalues of matrices
defining the time series progression). Note that aside from entering the score
function through the artificial data appears explicitly (à la (2.1)) through its
effect on the form of the distribution (and hence likelihood function) for the data
x or Then, provided that is not too large, the Theorem in Section 4 (with or
without the implied constant of the bound, as appropriate) can be used with
the known finite-sample distribution to determine set probabilities for testing the
hypothesis of sequential uncorrelatedness in the non-normal case of interest.
4. MAIN RESULTS
4.1. Background and Notation
This section presents the main result‚ showing how the difference in the un-
known and known probabilities for lying in a rectangle
decreases as In particular‚ we will be interested in characterizing the prob-
abilities associated with rectangles:
where and (Of course‚ by con-

sidering a union of arbitrarily small rectangles‚ the results here can be applied to
a nonrectangular compact set subject to an arbitrarily small error.) As discussed
in Section 2‚ we will use the artificial estimates and in analyzing this
difference. After some background material is presented in Subsection 4.1‚
Subsection 4.2 presents the Theorem showing that the difference in probabilities
is Subsection 4.3 presents a computable bound to the implied constant of the

bound.
An expression of critical importance in the Theorem below (and in the calcu-
lation of the implied constants for the result in the Theorem—see Subsec-
tion 4.3) is the gradient From the fact that depends on and we
have by the total derivative formula of calculus
When the score s(·) is a continuously differentiable function in a neighborhood of

and the of interest, and when exists at these the im-
plicit function theorem applies to two of the gradients on the right-hand side of
(4.1):
where the right-hand sides of the expressions in (4.2) are evaluated at

and the of interest. All references to s = s(·) here and in the
Theorem correspond to the definition in (2.1), with replacing as in (2.2a).
The remaining gradient on the right-hand side of (4.1) is obtainable directly as
Note that and its derivative in (4.1) are evaluated at the
true (the generating the data) in contrast to the other expressions in (4.1) and
(4.2), which are evaluated at the estimated One interesting implication of (4.1)
and (4.2) is that is explicitly calculable even though is, in general,
only implicitly defined. From the implicit function theorem, the computation of
for the important special case of (see notation below) relies on the
above-mentioned assumptions of continuous differentiability for s(·) holding for
both slightly positive and negative.
The following notation will be used in the Theorem conditions and proof:
Consistent with usage above, a subscript i or j on a vector (say on etc.)
denotes the ith or jth component.
represents the inverse image of relative to i.e., the set

Likewise, is the inverse image
relative to
4.2. Order Result on Small-Sample Probabilities

The main theoretical result of this paper is presented in the Theorem below.
The proof is provided in the Appendix together with certain technical regularity
conditions. These conditions are quite modest, as discussed in the remarks fol-
lowing their presentation in the Appendix (and as demonstrated in the signal-
plus-noise example of Section 5). The regularity conditions pertain essentially
to smoothness properties of the score vector and to characteristics of the distri-
bution of z.
Theorem Let and be as given in (2.2a, b), and let a ± h be continuity
points of the associated distribution functions. Then, under regularity conditions
C.1–C.5 in the Appendix,
4.3. The Implied Constant of Bound

Through the form of the calculations in the proof of the Theorem‚ it is pos-
sible to produce computable implied constants for the bound‚ i.e.‚ constants
such that
We present one such constant here; another is presented in Spall (1995). The
constant here will tend to be conservative in that it is based on upper bounds to
certain quantities in the proof of the Theorem. This conservativeness may be
desirable in cases where is relatively large to ensure that the in (4.3) is
preserved in practical applications (i.e.‚ when the term is ignored).2 The con-
stant in Spall (1995) is less conservative and is determined through a computer
resampling procedure.
Aside from the assumptions of the Theorem‚ there are two additional condi-
tions under which is valid: (i) the elements of are mutually indepen-
dent‚ with having a bounded‚ continuous density function on some set such
that every point of is interior to this set‚ and (ii)
from Subsection 4.1) is uniformly bounded on (note that‚ even when is
continuous on the Theorem does not require boundedness for since
2
The issue of ignoring higher-order error terms (à la is common in many small- and large-
sample estimation methods. For example, the saddlepoint, bootstrap, and central limit theorem, in
general, all have unquantifiable higher-order error terms in terms of n.
may be unbounded). Although these conditions may only be approximately sat-

isfied in some practical problems, the value of c(a, h) computed as if the condi-
tions were true may still provide critical insight into the true (uncomputable)
constant (it should be no surprise that we have to impose additional conditions to
obtain an analytical constant given that the computation of implied constants in
order bounds that appear in the statistics and identification literature is almost
always notoriously difficult and is usually not carried out). Nevertheless, condi-
tion (i) can sometimes be fully satisfied, as illustrated in the problem of Section
5. Furthermore, in other problems, it can often be at least approximately satisfied
by forming a new through scaling the original by the square root of the Fisher
information matrix under (this uses the fact that many estimates are asymp-
totically normally distributed with covariance matrix given by the inverse Fisher
information matrix and the fact that independence and uncorrelatedness are
equivalent under joint normality).
Let us begin our derivation of c(a, h) by developing constants for each of the
two probabilities in (A.6) of the Theorem proof. Then, by (A.1), c(a, h) will be
twice the sum (over j) of the sum of the two constants for (A.6). Let be the
bound to on that is implied by condition (ii). (In practice, if this bound
is not available analytically, it could be estimated by random sampling a
large number of times over taking the estimated as the maximum
observed value of Then, we have for one of the two probabilities in
(A.6),
where is the marginal density function for is defined in the Appen-

dix (eqn. (A.4))‚ and the last line follows from the
independence condition (i) and the mean value theorem. For
(the other probability in (A.6))‚ we follow an identical
line of reasoning to obtain
where Because the true values of and will

not be known in practice, and could be chosen as the midpoint of the
(assumed small) intervals in which they lie. Thus from (A.1) and (A.6),
5. APPLICATION OF THEOREM FOR THE MLE OF

PARAMETERS IN SIGNAL-PLUS-NOISE PROBLEM
5.1. Background
This section returns to the example of Subsection 3.1, and presents an analysis
of how the small-sample approach would apply in practice. In particular, con-
sider independent scalar observations distributed
where the are known and are to be estimated (jointly) using maximum
likelihood. As mentioned in Subsection 3.1, when for at least one i, j (the
actual case), no closed-form expression (and hence no computable distri-
bution) is generally available for When for all i, j (the idealized
case), the distribution of is known (see (5.2a, b) below).
For this estimation problem, Subsection 5.2 discusses the regularity condi-
tions of the Theorem and comments on the calculation of the implied constant
c(a, h), and Subsection 5.3 presents some numerical results. This two-parameter
estimation problem is one where the other analytical techniques discussed in
Section 1 (i.e., Edgeworth expansion and saddlepoint approximation) are im-
practical because of the unwieldy calculations required (say, as related to the
cumulant generating function and its inverse).
When using the distribution for as an approximation to the actual
distribution (when justified by the Theorem), we choose a value of Q correspond-
ing to the “information average” of the individual’s i.e., Q is such that
(The idea of summing information terms for different mea-
surements is analogous to the idea in Rao (1973, pp. 329–331).) As discussed in
Subsection 3.1, deviations of order from the common Q are then naturally ex-
pressed in the inverse domain: where the are some fixed
quantities (discussed below). Working with information averages has proven de-
sirable as a way of down-weighting the relative contribution of the larger
versus what their contribution would be, say, if Q were a simple mean of the
(from (5.1) below, we see that the score expression also down-weights the data
associated with larger A further reason to favor the information average is
that the score is naturally parameterized directly in terms of through use of
the relationship Hence, represents the
mean of the natural nuisance parameters in the problem. Finally, we have found
numerically that the “idealized” probabilities computed with the information
average have provided more accurate approximations to the true probabilities

when the vary moderately than, say, idealized probabilities based on an
average equal to the mean Note, however, that any type of average will
work when the are sufficiently close since if and only if
when Q > 0.
The log-likelihood function, for the estimation of is
where from which the score expression

is found:
Since we will consider only those sets of interest (i.e.,

such that This does not pre-
clude having a practical estimate come from (in which case
one would typically set to 0); however, in specifying confidence sets, we will
only consider those points in space that make physical sense. (Note that if n is
reasonably large and/or the are reasonably small relative to then
from will almost always be positive.)
5.2. Theorem Regularity Conditions and Calculation of

Implied Constant
The first step in checking the conditions for the Theorem is to define the
artificial data sequences‚ and associated artificial estimators and
From the definitions in Subsection 3.1‚ the two artificial MLEs are
where As required to apply the Theorem‚ has a known dis-

tribution (the same‚ of course‚ as for the of interest from (5.1) when
In particular‚ and satisfy
Spall (1995) includes a verification of the regularity conditions C.1–C.4 in

the Appendix (C.5 is immediate by the definition of We assume that Q > 0;
then for all in a neighborhood of 0 (i.e., is well-defined as a variance
for all near 0, including as required by the implicit function theorem in
computing as discussed in Subsection 4.1).
The two conditions for calculating the constant c(a, h) introduced in Subsec-
tion 4.3 were also satisfied. In particular, (i) and are independent with a
bounded continuous density function, and (ii) is bounded by the fact that
is continuous on (via the implicit function theorem) and is a bounded
set. Although an analytical form is available for it may be easier in practice
to approximate max for each j by randomlysampling over
This yields estimates of and and is the procedure used in comput-
ing c(a, h) in Subsection 5.3 below. The probabilities
for j = 1, 2 are readily available by the normal and chi-squared distributions for
and Likewise, the density-based values are easily approximated
by taking as an intermediate (we use mid) point of the appropriate interval
This provides all the elements needed for a practical de-
termination of as illustrated in Subsection 5.3.
5.3. Numerical Results

This subsection presents results of a numerical study of the above MLE prob-
lem. Our goals here are primarily to compare the accuracy of uncertainty intervals
based on the small-sample theory with the actual (empirically determined) regions.
We also briefly examine the performance of asymptotic MLE theory. In this study,
we took n = 5 and generated data according to with such that
(so the average in an
information sense, is 0.04 according to the discussion of Subsection 5.1). As
discussed above, we estimate and are interested in uncertainty inter-
vals for and For ease of presentation and interpretation, we will
focus largely on the marginal distributions and uncertainty intervals for each of
and this also is partly justified by the fact that and are approxi-
mately independent (i.e., when either or n is large, and are indepen-
dent). We will report results for and which correspond to values
of equal to {0.0323, 0.0323, 0.0323, 0.0625, 0.0625} and
{0.0271, 0.0271, 0.0271, 0.143, 0.143}, respectively. Results for these two values
of are intended to represent the performance of the small-sample theory for a

small- and a moderate-sized
Spall (1995) shows that for the portion of the small-sample uncertainty
intervals differ little from the true intervals or those obtained by asymptotic theory.
Hence, we focus below on the part of Figure 5.1 depicts three density func-
tions for for each of the and cases: (i) the “true” density
based on the marginal histogram constructed from estimates of de-
termined from independent sets of n = 5 measurements (a smoother in
the SAS/GRAPH software system, p. 416 in Reference Volume 1 (35), was used
to smooth out the small amount of jaggedness in the empirical histogram), (ii)
the small-sample density from (5.2b) (corresponding to the idealized
case), and (iii) the asymptotic-based normal density with mean = 1 and variance
given by the appropriate diagonal element of the inverse Fisher information matrix
3
Note that for this example, the noise levels are relatively small, so techniques such as those in
Chesher (1991) may have been useful. However, this example was chosen with small noise only to
avoid the “tangential” (to this paper) issue of coping with negative variance estimates; there is
nothing inherent in the small-sample approach that requires small noise (e.g., the other two ex-
amples discussed in Section 3 do not fit into the small-noise framework).
for We see that with the true and small-sample densities are virtually
identical throughout the domain while the asymptotic-based density is dramati-
cally different. For there is some degradation in the match between the
true and idealized small-sample densities, but the match is still much better then
between the true and asymptotic-based densities. Of course, it is the purpose of
the adjustment based on c(a, h) to compensate for such a discrepancy in
confidence interval calculation. Note that the true densities illustrate the fre-
quency with which we can expect to see a negative variance estimate, which is
an inherent problem due to the small size of the sample (the asymptotic-based
density significantly overstates this frequency). Because of the relatively poor
performance of the asymptotic-based approach, we focus below on comparing
uncertainty regions from only the true distributions and the small-sample ap-
proach.4
Figure 5.2 translates the above into a comparison of small-sample uncertainty
regions with the true regions. Included here are regions based on the term of
the Theorem when quantified through use of the constant, c(a, h). The small-
sample regions are “nominal” regions in the sense that the distributions in (5.2.a,b)
4
However‚ as one illustration of the comparative performance of the small-sample and asymptotic
approaches for confidence region calculation‚ consider the CEP estimation problem mentioned in
Subsection 3.1 above. The CEP estimates are based on the signal-plus-noise estimation example
of this section. For the e = 0.15 case‚ the 90% confidence interval of interest for the CEP estimate
was approximately 30% narrower when using the small-sample approach than when using the
standard asymptotic approach. Hence‚ by more properly characterizing the distribution of the
underlying parameter estimates‚ we are able to extract more information about the CEP quantity
of interest.
and are evaluated at the true and (consistent with a hypothesis testing
framework where there is an assumed “true” The indicated interval end points
were chosen based on preserving equal probability (0.025) in each tail, with the
exception of the conservative case; here the lower bound went slightly
below 0 using symmetry, so the lower end point was shifted upward to 0 with a
corresponding adjustment made to the upper end point to preserve at least 95%
coverage. (Spall (1995) includes more detail on how the prob-
ability adjustment was translated into an uncertainty interval adjustment.) For
the case, we see that the idealized small-sample bound is identical to the
true bound (this, of course, is the most desirable situation since there is then no
need to work with the c(a, h)-based adjustment). As expected, the uncertainty
intervals with the conservative (c(a, h)-based) adjustments are wider. For the
case, there is some degradation in the accuracy of coverage for the ide-
alized small-sample interval, which implies a greater need to use the conserva-
tive interval to ensure the intended coverage probability for the interval.
The above study is fully representative of others that we have conducted for
this estimation framework (e.g., nominal coverage probabilities of 90% and 99%
and other values of They illustrate that with relatively small values
of the idealized uncertainty intervals are very accurate, but that with larger
values (e.g., or 0.30) the idealized interval becomes visibly too short. In
these cases, the c(a, h)-based adjustment to the idealized interval provides a means
for broadening the coverage to encompass the true uncertainty interval.
6. SUMMARY AND CONCLUSIONS

Making inference in parameter estimation with limited data is a problem en-
countered in many applications. Although statistical techniques such as the boot-
strap and saddlepoint approximation have shown some promise in the small-
sample setting, there remain serious difficulties in accuracy and feasibility for
the type of multivariate M-estimation problems that are frequently encountered
in practice.
For a range of problem settings, the approach described here is able to provide
accurate information about the estimate uncertainties. The primary restriction is
that an idealized case must be identified (where the estimate uncertainty is known
for the given sample size) with which the estimate uncertainty for the actual case
is compared. This idealized case is related to the actual problem, but differs in
that an analytical solution to the distribution of the estimate is available. A com-
putable bound is available to characterize the difference in set probabilities for
the (known) idealized case and (unknown) actual problem.
A Theorem was presented that provides the basis for comparing the actual
and idealized cases. The Theorem provides an order bound on the difference
between the cases and a means for quantifying the implied constant of the order
bound. Implementations of the approach were discussed for three distinct well-
known identification settings to illustrate the range of potential applications. These
were a signal-plus-noise maximum likelihood estimation problem, a general non-
linear regression setting, and a problem in non-Gaussian time series correlation
analysis.
The signal-plus-noise example was considered in greatest detail. This prob-
lem is motivated for the author by the analysis and (state-space) modeling of a
naval defense system. The small-sample approach was relatively easy to imple-
ment for this problem and yielded accurate results. The required idealized case
was one where the data are i.i.d. When the actual data were moderately non-i.i.d.,
it was shown that the small-sample approach yielded results close to the true
(uncomputable) uncertainty regions. In cases where the actual data deviated a
greater amount from the i.i.d. setting, the small-sample approach yielded conser-
vative uncertainty regions based on a quantification of the implied constant of the
order bound of the Theorem. This example provided a realistic illustration of the
types of issues arising in using the approach in a practical application.
While this paper provides a new method for small-sample analysis, further
work would enhance the applicability of the approach. One area would be to
explore, in detail, the application to problems other than the signal-plus-noise
problem. Two candidate problems in time series and nonlinear regression were
sketched in Section 3. Another valuable extension of the current work would be
to automate (as much as possible) the computation of the implied constant that is
used to provide a conservative uncertainty region. It would also be useful to carry
out a careful comparative analysis of the relative accuracy, ease of implementa-
tion, and applicability of the saddlepoint method, the bootstrap, and the method
of this paper. Despite these open problems, the method here offers a viable ap-
proach to a range of small-sample estimation problems.
APPENDIX: THEOREM REGULARITY CONDITIONS

AND PROOF (SECTION 4)
Regularity conditions C.1–C.5 for the Theorem are as follows. These are built
from the score function (2.1) (or equivalently, (2.2a, b)) in terms of the artificial
estimators and
C.1 Let be a bounded set
is valid). Further, if suppose that is uniformly
bounded away from 0 on If then, except in an open neighborhood of
(i.e., a region such that for some radius > 0 an n-ball of this radius around any
point in is contained within this region), we have uniformly
bounded away from 0 on
C.2 Except on a set of 0 = probability measure for

exists on Further, for each on a.s. and
C.3 For each when suppose that there exists an open

neighborhood of (see C.1) such that and exist continuously in
the neighborhood. Further, for each j and sign ±, there exists some scalar element
in z, say such that we have
C.4
C.5 For each j = 1, 2 , . . . , p, the distribution of z is absolutely continuous in

an open neighborhood of (see C.1).
Remarks on C.1–C.5. Although some of the regularity conditions may seem

arcane, the conditions may be easily checked in some applications (see, e.g., the
problem in Section 5). In checking C.1, note that will often be a bounded set
(see Section 5), which automatically implies that is bounded The other
requirement on being bounded away from 0 on is straightforward
to check since has a known distribution. Given the form of the score s(·) and
transformation eqns. (4.1) and (4.2) readily suggest when C.2 will be satis-
fied. Note that when is bounded and is continuous on (see example in
Section 5), the last part of the condition will be automatically satisfied since
will be bounded above. The conditions in C.3 pertaining to can generally be
checked directly since is typically available in closed form (its distribution is
certainly available in closed form); the condition on can be checked through
use of (4.1) and (4.2) (as in checking C.2). As illustrated in Section 5, to satisfy
the near-symmetry condition, C.4, it is sufficient that have a continuous den-
sity near and that (from (4.1)) be uniformly bounded on
(of course, this condition can also be satisfied in settings where is not
bounded in this way). Finally, C.5 is a straightforward condition to check since
the analyst specifies the distribution for z. Note that none of the conditions im-
pose any requirements of bounded moments for (or equivalently, for which
would be virtually impossible to check for most practical M-estimation problems.
Briefly, the conditions above are used in the proof as follows. The bounded
sets assumed in C.1 ensure that a domain of interest can be covered by a finite
number of “small” neighborhoods. The fact that the derivative exists and is
bounded away from 0 in C.2 ensures that an important ratio with in the de-
nominator exists a.s. The assumption in C.3 regarding the continuous differentia-
bility of certain expressions ensures (via the implicit function theorem) the local
one-to-oneness of certain transformation, which in turn ensures the existence of
certain density functions. C.4 allows for the substitution of an easier probability
expression for a more complicated probability (to within “negligible” error).
Finally, C.5 guarantees that local density functions exist for the “artificial” data
vector z, which then can be used via the real value theorem to characterize a set
probability of interest.
In the proof of the Theorem below, we use the following two Lemmas, which
have relatively straightforward proofs given in Spall (1995).
Lemma 1 For a random vector let be continuity points for
its associated distribution function F(·). Then
where
For Lemma 2‚ represents a term such that converges to 0 in prob-
ability as
Lemma 2 Let and A be two random variables and one event (respec-
tively)‚ with dependent on the introduced above. Suppose that
where and that Then
Proof of Theorem. Without loss of generality‚ we take Let us define
and
as appropriate). Then by the dominated convergence theorem and

Laha and Rohatgi (1979‚ pp. 145–146)‚
For the probabilities in the first sum, we know that the event occurs
only if Letting we have that the event for the jth summand
corresponds to
For the probabilities in the second sum on the right-hand side of (A. 1) we know
that so the event for the jth summand corresponds to
By condition C.4 we know that the probability of the event on the r.h.s. of (A.2a)
can be bounded above to within by
Hence, from (A.2a, b), to within each of the 2p probabilities associated with
the summands in (A.1) may be bounded above by the expression in (A.3).
Now, for almost all z in the set we know by condition C.2 and Taylor’s
theorem that
(suppressing the dependence on z). For all j = 1, 2 , … , p, we know by C.2 that

a.s. on Hence the quantity
is well-defined a.s. on Note that the expression in (A.3) is bounded above by

Since Lemma 2 applies (see Spall (1995))‚ we can replace (to within error)
the two probabilities in (A.5) by the following probabilities that do not depend on
and are easier to analyze:
We now show that these two probabilities are This will establish the main
result to be proved.
To show the above, Spall (1995) shows that the conditional (on ) densities for
exist near 0 for each j and ± sign such that We then use these
densities to characterize the probabilities in (A.6). (When an the corre-
sponding probability in (A.6) is by C.1 and C.2.) To characterize these den-
sities, we first use C.2, C.3, and the inverse function theorem to establish that
local densities exist in a finite number of disjoint regionswithin The inverse
and implicit function theorems are then used to establish the form for the joint
density for and (representing the elements of z after the
scalar element from C.3 has been removed) in each of the local regions (likewise
for Then can be integrated out, leaving the densities of
interest for in each local region. On each of these local regions, it can
be shown that the relevant probability is Taking the union over the finite
number of local regions establishes that the probabilities in (A.6) are Q.E.D.
REFERENCES
Anderson‚ T. W. (1971). The Statistical Analysis of Time Series‚ Wiley‚ New York.
Bickel‚ P. J.‚ and K. A. Doksum (1977). Mathematical Statistics‚ Holden-Day‚ Oakland‚
CA.
Bisgaard‚ S. and B. Ankenman (1996). Standard errors for the eigenvalues in second-
order response surface models. Technometrics‚ 38:238–246.
Chen‚ Z. and K.-A. Do (1994). The bootstrap method with saddlepoint approximations
and importance sampling. Statistica Sinica‚ 4:407–421.
Chesher‚ A. (1991). The effect of measurement error. Biometrika‚ 78:451–462.
Cook‚ P. (1988). Small-sample Bayesian frequency-domain analysis of autoregressive
models. J. C. Spall‚ editor‚ Bayesian Analysis of Time Series and Dynamic Models‚
pp. 101–126. Marcel Dekker‚ New York.
Daniels‚ H. E. (1954). Saddlepoint approximations in statistics. Annals of Mathematical
Statistics‚ 25:631–650.
Davison‚ A. C.‚ and D. V. Hinkley (1988). Saddlepoint approximations in resampling
methods. Biometrika‚ 75:417–431.
Efron‚ B.‚ and R. Tibshiran (1996). Bootstrap methods for standard errors‚ confidence
intervals‚ and other measures of statistical accuracy (with discussion). Statistical
Science‚ 1:54–77.
Field‚ C‚ and E. Ronchetti (1990). Small Sample Asymptotics. IMS Lecture Notes–
Monograph Series (vol. 13). Institute of Mathematical Statistics‚ Hayward‚ CA.
Fraser‚ D.A.S.‚ and N. Reid (1993). Third–order asymptotic models: Likelihood functions
leading to accurate approximations for distribution functions. Statistica Sinica‚ 3:
67–82.
Ghosh‚ J. K. (1994). Higher Order Asymptotics. NSF-CBMS Regional Conference Se-
ries in Probability and Statistics‚ Volume 4. Institute of Mathematical Statistics‚
Hayward‚ CA.
Ghosh‚ M.‚ and J. N. K. Rao (1994). Small area estimation: An approach (with discussion).
Statistical Science‚ 9: 55–93.
Goutis‚ C.‚ and G. Casella (1999). Explaining the saddlepoint approximation. American
Statistician‚ 53:216–224.
Hall‚ P. (1992). The Bootstrap and the Edgeworth Expansion. Springer-Verlag‚ New York.
Hjorth‚ J. S. U. (1994). Computer Intensive Statistical Methods. Chapman and Hall‚ Lon-
don.
Hoadley‚ B. (1971). Asymptotic properties of maximum likelihood estimates for the
independent not identically distributed case. Annals of Mathematical Statistics‚
42:1977–1991.
Hui‚ S. L‚ and J. O. Berger (1983). Empirical Bayes estimation of rates in longitudinal
studies. Journal of the American Statistical Association‚ 78:753–760.
Huzurbazar‚ S. (1999). Practical saddlepoint approximations. American Statistician‚
53:225–232.
James‚ A. T.‚ and W. N. Venables (1993). Matrix weighting of several regression coeffi-
cient vectors. Annals of Statistics‚ 21:1093–1114.
Joshi‚ S. S.‚ H. D. Sherali‚ and J. D. Tew (1994). An enhanced RSM algorithm using
gradient-deflection and second-order search strategies. J. D. Tew et al.‚ editors‚ Pro-
ceedings of the Winter Simulation Conference‚ pp. 297–304.
Kmenta‚ J. (1971). Elements of Econometrics. Macmillan‚ New York.
Kolassa‚ J. E. (1991). Saddlepoint approximations in the case of intractable cumulant
generating functions. Selected Proceedings of the Sheffield Symposium on Applied
Probability. IMS Lecture Notes—Monograph Series (vol. 18). Institute of Mathemati-
cal Statistics‚ pp. 236–255‚ Hayward‚ CA.
Laha‚ R. G.‚ and V. K. Rohatgi (1979). Probability Theory. Wiley‚ New York.
Lieberman‚ O. (1994). On the approximation of saddlepoint expansions in statistics.
Econometric Theory‚ 10:900–916.
Ljung‚ L. (1978). Convergence analysis of parametric identification methods. IEEE Trans-
actions on Automatic Control‚ AC-23:770–783.
Lunneborg‚ C. E. (2000). Data Analysis by Resampling: Concepts and Applications.

Duxbury‚ Pacific Grove‚ CA.
National Research Council (1992). Combining Information: Statistical Issues and Op-
portunities for Research. National Academy of Sciences‚ Washington‚ DC.
Nicholson‚ W. (1978). Microeconomic Theory. Dryden‚ Hinsdale‚ IL.
Pazman‚ A. (1990). Small-sample distributional properties of nonlinear regression esti-
mators (a geometric approach) (with discussion). Statistics‚ 21:323–367.
Rao‚ C. R. (1973). Linear Statistical Inference and its Applications. Wiley‚ New York.
Rao‚ P. S. R. S.‚ J. Kaplan‚ and W. G. Cochran (1981). Estimators for the one-way ran-
dom effects model with unequal error variances. Journal of the American Statistical
Association‚ 76:89–97.
Reid‚ N. (1988). Saddlepoint methods and statistical inference (with discussion). Statis-
tical Science‚ 3:213–238.
Rutherford‚ B.‚ and S. Yakowitz (1991). Error inference for nonparametric regression.
Annals of the Institute of Statistical Mathematics‚ 43:115–129.
SAS Institute (1990). SAS/GRAPH Software: Reference‚ Version 6‚ First Edition. SAS
Institute‚ Cary‚ NC.
Serfling‚ R. J. (1980). Approximation Theorems of Mathematical Statistics. Wiley‚ New
York.
Shnidman‚ D. A. (1995). Efficient computation of the circular error probable (CEP) inte-
gral. IEEE Transactions on Automatic Control‚ 40:1472–1474.
Shumway‚ R. H.‚ D. E. Olsen‚ and L. J. Levy (1981). Estimation and tests of hypotheses
for the initial mean and covariance in the Kalman filter model. Communications in
Statistics—Theory and Methods‚ 10:1625–1641.
Smith‚ R. H. (1985). Maximum likelihood mean and covariance matrix estimation con-
strained to general positive semi–definiteness. Communications in Statistics—Theory
and Methods‚ 14: 2163–2180.
Spall‚ J. C. and D. C. Chin (1990). First-order data sensitivity measures with applica-
tions to a multivariate signal-plus-noise problem. Computational Statistics and Data
Analysis‚ 9:297–307.
Spall‚ J. C. and J. L. Maryak (1992). A feasible Bayesian estimator of quantiles for
projectile accuracy from non-i.i.d. data. Journal of the American Statistical Asso-
ciation‚ 87:676–681.
Spall‚ J. C. (1995). Uncertainty bounds for parameter identification with small sample
sizes. Proceedings of the IEEE Conference on Decision and Control‚ pp. 3504–3515.
Sun‚ F. K. (1982). A maximum likelihood algorithm for the mean and covariance of
nonidentically distributed observations. IEEE Transactions on Automatic Control‚ AC-
27:245–247.
Wilks‚ S. S. (1962). Mathematical Statistics. Wiley‚ New York.
Walter‚ E.‚ and L. Pronzato (1990). Qualitative and quantitative experiment design for
phenomenological models–A survey. Automatica‚ 26:195–213.
Chapter 28
A TUTORIAL ON HIERARCHICAL LOSSLESS DATA

COMPRESSION
John C. Kieffer
Department of Electrical and Computer Engineering
Minneapolis, MN 55455
Abstract Hierarchical lossless data compression is a compression technique that has been
shown to effectively compress data in the face of uncertainty concerning a proper
probabilistic model for the data. In this technique, one represents a data sequence
x using one of three kinds of structures: (1) a tree called a pointer tree, which
generates x via a procedure called “subtree copying”; (2) a data flow graph
which generates x via a flow of data sequences along its edges; or (3) a context-
free grammar which generates x via parallel substitutions accomplished with
the production rules of the grammar. The data sequence is then compressed
indirectly via compression of the structure which represents it. This article is
a survey of recent advances in the rapidly growing field of hierarchical lossless
data compression. In the article, we illustrate how the three distinct structures
for representing a data sequence are equivalent, outline a simple method for
designing compact structures for representing a data sequence, and indicate the
level of compression performance that can be obtained by compression of the
structure representing a data sequence.
1. INTRODUCTION
A modern day data communication system must be capable of transmitting
data of all types, including textual data, speech/audio data, or image/video
data. The block diagram which follows depicts a data communication system,
consisting of encoder, channel, and decoder:
The data sequence that is to be transmitted through the communication channel

will need to have allocated to it a certain portion of the available channel band-
width. In order that the amount of channel bandwidth that is utilized be kept
at a minimum, data compression must take place, in which the data sequence
is encoded by the encoder into a binary codeword for transmission through the
communication channel. The compression scheme in Fig. 1, consisting of en-
coder and decoder, is called a lossless scheme if the decoder always reconstructs
the original data sequence, and is called a lossy scheme if reconstruction error
is allowed.
Hierarchical data compression is a rapidly growing subfield of information
theory that started about 20 years ago. In hierarchical data compression, one
employs structures such as trees, graphs, or grammars for data representation,
thereby allowing progressive data reconstruction across scales from low res-
olution to high resolution. In order to better accommodate the demands of
communication systems users, the next generation of data compression stan-
dards will be scalable in this sense, and this is the reason why hierarchical data
compression is now coming into prominence.
A hierarchical data compression scheme can either be a lossy scheme or a
lossless scheme. Although the hierarchical lossy data compression schemes are
not the subject of this paper, this paragraph gives some pointers to the literature
on this subject. The most common hierarchical lossy compression schemes are
based on wavelets (Strang and Nguyen, 1996; Ch. 11; Chui, 1992; Ch. 7). In
a wavelet-based scheme, a signal is lossily compressed by exploiting its
wavelet decomposition
where is an appropriately chosen wavelet function. Each wavelet coef-

ficient provides a different characteristic of the signal as the scaling
parameter and the spatial parameter are varied. Compression of the signal
results from expanding in bits each coefficient whose magnitude
exceeds a given threshold. Several important hierarchical data compression
schemes based upon the wavelet decomposition have been developed, includ-
ing the Burt-Adelson pyramid algorithm (Burt and Adelson, 1983), the Shapiro
zerotree algorithm (Shapiro, 1993), and the Said-Pearlman set partitioning al-
gorithm (Said and Pearlman, 1996). The second most common hierarchical
lossy compression schemes are the fractal-based schemes—we refer the reader
to the references Barnsley and Hurd (1993) and Fisher (1995) for a discussion
of this topic.
The present paper is devoted to coverage of hierarchical lossless data com-
pression schemes, since until recently, knowledge concerning how to design
efficient hierarchical lossless compression schemes lagged behind knowledge
concerning how to design efficient hierarchical lossy compression schemes.

The beginnings of the subject of hierarchical lossless data compression can
be traced to the papers Kawaguchi and Endo (1980); Cameron (1988); and
Kourapova and Ryabko (1995). In these papers, a context-free grammar is
used to model some (but not all) of the dependencies in the data, and then the
data is compressed by making use of the rules of the grammar. In recent years,
a new idea has been exploited, namely, the idea of constructing a grammar from
the data which models all of the dependencies in the data—since the data is
reconstructible from the grammar itself, the problem of compression of the data
reduces to the problem of compression of the grammar representing the data.
In recent years, this idea has spawned many advances in the theory and design
of hierarchical lossless data compression schemes, which shall be surveyed in
this paper.
Fig. 2 gives the encoding part of a hierarchical lossless data compression
scheme, and Fig. 3 gives the decoding part. In the encoding part, a transform is
applied to a given data sequence that is to be compressed, transforming into
a data structure that is suitable for representing The data structure is typically
one of three kinds of structures: (i) a pointer tree (Sec. 1.1); (ii) a data flow
graph (Sec. 1.2); or (iii) a context-free grammar (Sec. 1.3). The encoder in Fig.
2 operates indirectly, compressing the data structure representing the given data
sequence rather than directly applying itself to The decoding part unravels
the operations done in the encoding part—the decoder and encoder are inverse
mappings, and the inverse transform and transform are inverse mappings.
This paper is organized as follows. In the rest of this introductory sec-

tion, we present examples illustrating the pointer tree, data flow graph, and
context-free grammar structures for data sequence representation. In Section
2, equivalences between the tree, graph, and grammar structures are indicated.
A simple algorithm for the design of a good structure to use in hierarchical
lossless compression is presented in Section 3. Section 4 is concerned with
the encoder operation in hierarchical lossless compression schemes. Finally,
Section 5 presents some results concerning the compression performance af-

forded by hierarchical lossless compression schemes. In particular, we will
see that hierarchical lossless compression schemes represent a good choice for
the communication system designer who is faced with uncertainty concerning
the “true” probability distribution which models the data that is encountered in
the communication system—such compression schemes can perform well no
matter what the probability model for the data may be.
Terminology and Notation. Throughout this paper, we assume some finite
data alphabet to be fixed in the background. When we refer to a sequence as
a data sequence, we simply mean that the sequence consists of finitely many
entries from the fixed finite alphabet. If is a sequence whose entries are
respectively, then we write the sequence as
(i.e., without commas separating the entries). If are sequences,
then denotes the sequence obtained by left-to-right concatenation
of the sequences (see Example 3 for an illustration of the con-
catenation procedure). In this tutorial, when we refer to a tree, we will always
mean a finite ordered rooted tree in the sense of Knuth (1973). Such a tree has
a unique designated vertex which is the root vertex of the tree, and each of the
finitely many nonroot vertices is reachable via a unique path starting from the
root. The vertices which have children are called nonterminal vertices. We
shall assume that each nonterminal vertex has at least two children, which have
a designated ordering. The vertices of a tree which have no children are called
leaf vertices. We will represent a tree pictorially, adopting the convention that
the root vertex will be placed at the top of the picture, and that the tree’s edges
extend downward, with the children of each nonterminal vertex appearing in
the picture from left-to-right on the same level, corresponding to the designated
ordering of these children. There are two important orderings of the vertices of
a tree that we shall make use of that are compatible with the separate orderings
of the children of each nonterminal vertex: (1) the breadth-first ordering, and
(2) the depth-first ordering.
Example 1: Fig. 4 illustrates the depth-first and breadth-first orderings of the
vertices of a tree.
1.1. POINTER TREE REPRESENTATIONS

We can represent a data sequence using a tree called a pointer tree, which
satisfies the properties:
Every leaf vertex of the tree is labelled by either a symbol from the data
alphabet or by a pointer label “pointing to” some nonterminal vertex of
the tree. (A vertex containing a pointer label shall be called a pointer
vertex.)
There exists at least one leaf vertex of the tree which is labelled by a
pointer label.
The data sequence represented by the pointer tree can be recovered from
the pointer tree via “subtree copying.”
We explain the “subtree copying” procedure. Define a data tree to be a tree
in which every leaf vertex of the tree is labelled by a symbol from the data
alphabet. Suppose T is a pointer tree, and that is a leaf vertex of T pointing
to a nonterminal vertex of T. Suppose the subtree of T rooted at is a data
tree. Let T' be the tree obtained by appending to the subtree of T rooted at
We say that the tree T' is obtained from the tree T by one round of subtree
copying. The tree T' is either a pointer tree possessing one less pointer vertex
than T, or it is a data tree. Suppose we start with a pointer tree having exactly
pointer vertices, and are able to construct trees such that is
obtained from via one round of subtree copying, Then,
must be a data tree. It can be shown that if is some
other sequence of trees in which is obtained from via subtree copying
then Therefore, we may characterize as the unique
data tree obtainable via finitely many rounds of subtree copying, starting from
the pointer tree We call the data tree induced by the pointer tree
Order the leaf vertices of in depth-first order. If we write down the sequence
of data labels that we see according to this ordering, we obtain a data sequence
which we will term the data sequence represented by the pointer tree
Example 2: Fig. 5 illustrates a pointer tree. By convention, we take the
pointer labels to mean that we are pointing to the nonterminal vertices
labelled 1,2,3, respectively. (The internal labels 1,2,3 are really not needed;
we shall see how to get along without these labels in Section 3.) The reader
can easily verify that the tree in Fig. 6 is obtainable from the Fig. 5 tree via four
rounds of subtree copying. Therefore, this tree is the data tree induced by the
Fig. 5 tree. (The reader can also verify that the Fig. 6 tree is obtained regardless
of the order in which the subtree copyings are done. For example, one can
do rounds of subtree copying to the vertices which in depth-first ordering have
labels respectively; one can also do the copying according to the
two orderings of the form Some other orderings of the rounds of

subtree copying are valid as well.) The data labels on the data tree in Fig. 6 give
us the sequence which is the data sequence represented
by the pointer tree in Fig. 5.
In Fig. 6, we have kept the internal labels 1,2,3 that were in the Fig. 5 tree,
in order to illustrate an important principle. Notice that the subtrees of the
Fig. 6 data tree rooted at vertices 1,2,3 are distinct. Our important principle
is the following: One should strive to find a pointer tree representing a given
data sequence so that any two leaf vertices of the pointer tree which point to
distinct vertices should have distinct data trees appended to them in the rounds
of subtree copying. If a given pointer tree does not obey this property, then it
can be reduced to a simpler pointer tree.
1.2. DATA FLOW GRAPH REPRESENTATIONS

Let be a data sequence. A data flow graph which represents x has the
following properties:
(p.1) The data flow graph is a finite directed acyclic graph which contains
one or more vertices (called input vertices) having outgoing edges but
no incoming edges, a unique vertex (called the output vertex) having at

least one incoming edge but no outgoing edge, and possibly one or more
vertices which are neither input nor output vertices. There is at least one
directed path from each input vertex to the output vertex.
(p.2) Each noninput vertex of the data flow graph possesses two or more in-
coming edges, and there is an implicit ordering of these incoming edges.
(P.3) Each input vertex V contains a label which is a symbol from the
data alphabet.
(p.4) The input labels uniquely determine a data sequence label for each
noninput vertex V of the graph. The labels on the vertices of the graph
satisfy the equations
where V varies through all noninput vertices and are the

vertices at the beginnings of the ordered edges ending at V. (The fact
that the graph is acyclic guarantees the existence and uniqueness of the
solutions to equations (1.1).)
(p.5) The sequence of data symbols computed at the output vertex V is

the data sequence
Property(p.4) indicates why we call our graph a data flow graph. Visualize
“data flow” through the graph in several cycles. In the first cycle, one computes
at each vertex V whose incoming edges all start at input vertices. In each
succeeding cycle, one computes the label at each vertex V whose incoming
edges all start at vertices whose labels have been determined previously. In the
final cycle, the data sequence represented by the data flow graph is computed
at the output vertex. The following example illustrates the procedure.
Example 3: Let us compute the data sequence represented by the data flow
graph in Fig. 7. We suppose that the direction of flow along edges is from left
to right, and that incoming edges to a vertex are ordered from top to bottom.
On the first cycle, we compute
The second cycle yields

As a result of the third cycle, we obtain
Computation of the label on the fourth and last cycle tells us what is:
One property of a good data flow graph for representing a data sequence
should be mentioned here: The sequences computed at the vertices of the data
flow graph should be distinct. If this property fails, then two or more vertices
of the data flow graph can be merged to yield a simpler data flow graph which
also represents the given data sequence.
1.3. CONTEXT-FREE GRAMMAR

REPRESENTATIONS
A context-free grammar G for representing a data sequence can be specified
in terms of its production rules. Each production rule of G takes the form
where V, the left member of the production rule (1.2), is a variable of the
grammar G, and each belonging to the right member of the production rule
(1.2) is either a variable of G or a symbol from the data alphabet. The variables
of G shall be denoted by capital letters the symbols in the data
alphabet are distinct from the variables of G, and shall be denoted by lower case
letters The variable is a special variable called the root variable
of G; it is the unique variable which does not appear in the right members of
the production rules of G. For each variable V of the grammar G, it is required
that there be one and only one production rule (1.2) of G whose left member
is V; such a grammar is said to be deterministic. With these assumptions,
one is assured that the language L(G) generated by G must satisfy one of the
following two properties:
L(G) consists of exactly one data sequence; or
L(G) is empty.
To see which of these two properties holds, one performs rounds of parallel
substitutions using the production rules of G, starting with the root variable
In each round of parallel substitutions, one starts with a certain sequence of
variables and data symbols generated from the previous round; each variable
in this sequence is replaced by the right member of the production rule whose
left member is that variable—all of the substitutions are done simultaneously.
There are only two possibilities:
Possibility 1: After finitely many rounds of parallel substitutions, one encoun-
ters a data sequence for the first time; or
Possibility 2: One never encounters a data sequence, no matter how many
rounds of parallel substitutions are performed.
Let be the number of variables of the grammar G. In Kieffer and Yang
(2000), it is shown that if one does not encounter a data sequence after rounds
of parallel substitutions, then Possibility 2 must hold. This gives us an algorithm
which runs in a finite amount of time to determine whether or not Possibility 1
holds.
Suppose Possibility 1 holds. Let be the data sequence generated by the
grammar G after finitely many rounds of parallel substitutions. Then, L(G) =
We call the data sequence represented by G.
We list the requirements that shall be placed on any grammar G that is used
for representing a data sequence:
Requirement (i): The grammar G is a deterministic context-free grammar.
Requirement (ii): The production rules of G generate a data sequence after

finitely many rounds of parallel substitutions.
Requirement (iii): The right member of each production rule of G should
contain at least two entries.
Requirement (iv): Every production rule of G must be used at least once in

the finitely many rounds of parallel substitutions that generate a data
sequence.
Requirements (i) and (ii) have been mentioned previously. Requirements (iii)
and (iv) are new, but are natural requirements to make, since the grammar
could be made simpler if (iii) or (iv) were not true. A grammar G satisfying
requirements (i)-(iv) shall be called an admissible grammar.
Example 4: Let G be the admissible grammar whose production rules are
Starting with the root variable the sequences that are obtained via rounds
of parallel substitutions are:
The data sequence represented by G is thus
Traverse the entries of the rows of the display (1.3) in the top-down, left-to-right
order; if you write down the first appearances of the variables you encounter
in order of their appearance, you obtain the ordering
which is in accordance with the numbering of the variables of G. We can always
number the variables of an admissible grammar so that this property will be true,
and shall always do so in the future. The rounds of parallel substitutions that
we went through to obtain the sequence (1.4) are easily accomplished via the
four line Mathematica program
S={1};
P={{2,3,4,4,5,2}, {6,6,b}, {a,6}, {3,b,a}, {b,b}, {a,5}};
Do[S=Flatten[S/. Table[i->P[[i]],{i,Length[P]}]],{i,Length[P]}];
S
which the reader is encouraged to try. You can run this program given any
admissible grammar to find the data sequence represented by the grammar. All
that needs to be changed each time is the second line of the program, which
gives the right members of the production rules of the grammar.
Notice that the grammar G in Example 4 obeys the following two properties:
Property 1: Every variable of G except the root variable appears at least twice
in the right members of the production rules of G.
Property 2: If you slide a window of width two along the right members of
the production rules of G, you will never encounter two disjoint windows
containing the same sequence of length two.
An admissible grammar satisfying Property 1 and Property 2 is said to be irre-
ducible. There is a growing body of literature on the design of hierarchical loss-
less compression schemes employing irreducible grammars (Nevill-Manning
and Witten, 1997a-b; Kieffer and Yang, 2000; Yang and Kieffer, 2000).
2. EQUIVALENCES BETWEEN STRUCTURES

In the previous section, we discussed pointer trees, data flow graphs, and
admissible grammars as three different types of structures for representing a
data sequence. These structures are equivalent in the sense that if you have a
structure of one of the three types representing a data sequence then a structure
representing of either one of the other two types can easily be constructed. It
is the purpose of this section to illustrate this phenomenon.
2.1. EQUIVALENCE OF POINTER TREES AND

ADMISSIBLE GRAMMARS
Suppose we are given a pointer tree representing a data sequence Here is
a four-step procedure for using the pointer tree to find an admissible grammar
G representing
Step 1: Let be the breadth-first ordering of the nonterminal ver-
tices of the pointer tree. Label these vertices with the labels
respectively.
Step 2: Each leaf vertex of the pointer tree contains either a pointer label or
a data label. For each leaf vertex V containing a pointer label, assign a
new label consisting of the label of the nonterminal vertex to which
V points. For each leaf vertex containing a data label, keep this label
unchanged.
Step 3: For each form the production rule
where are the respective labels of the children of vertex

(children ordered from left to right).
Step 4: The grammar G consists of the production rules (2.5), for
Example 5: Referring to the pointer tree in Fig. 5, we see that the corre-
sponding grammar has production rules
This is clear if we relabel the Fig. 5 tree according to Fig. 8.
In order to go from an admissible grammar to a pointer tree, one can reverse

the four-step procedure given at the beginning of this subsection. The reverse
procedure can give rise to more than one possible pointer tree for the same
data sequence—just choose one of them. The following example illustrates the
technique.
Example 6: Suppose we start with the grammar (2.6). From the 9 production
rules listed in (2.6), one forms the following 9 trees, each of depth one:
Any tree which can be built up from these 9 trees is a pointer tree for the same
data sequence represented by the grammar (2.6). (Start the tree building process
by joining two of the trees in the array (2.7) to form a single tree—joining of
two trees is accomplished by merging the root vertex of one tree with a leaf
vertex of the other tree, where these two vertices have the same label This
gives an array of 8 trees; join two of the trees in this array. Repeated joinings,
8 of them in all, gradually reduce the original array (2.7) to a single tree, the
desired pointer tree.) One of the pointer trees constructible by this method is
given in Fig. 8; another one is given in Fig. 9.
2.2. EQUIVALENCE OF ADMISSIBLE GRAMMARS

AND DATA FLOW GRAPHS
Let G be an admissible grammar representing the data sequence We
describe how to obtain a data flow graph DFG(G) representing Let be the
number of variables of G; then these variables are Let be
the number of distinct symbols from the data alphabet which appear in the right
members of the production rules of G. Let be the sum of the lengths of the
right members of the production rules of G. Grow a directed acyclic graph with
vertices and edges as follows:
Step 1: Draw vertices on a piece of paper. Label of them with the
labels respectively. Label the remaining of them with
the data symbols appearing in the right members of the production rules.
Step 2: For each let
be the production rule of G whose left member is where

are each either symbols from or symbols from the data
alphabet. For each draw outgoing edges from vertex
and label these edges respectively. Make these edges
terminate at the vertices with labels respectively.
Step 3: For each edge that you have drawn on your piece of paper, reverse the
direction of the edge. For each vertex with a label remove the label.
What you now see on your piece of paper is the data flow graph DFG(G).
(For each vertex of DFG(G) which is not an input vertex, the incoming
edges to that vertex have the implicit ordering in accordance with the
edge labelling performed in Step 2.)
It is not hard to see that the above procedure is invertible—from a given data
flow graph representing data sequence one can obtain an admissible grammar
G representing and the given data flow graph is DFG(G).
Example 7: Let G be the grammar with production rules
Apply the above three-step procedure to G in order to obtain DFG(G). Redraw

the pictorial representation of DFG(G) so that all edges go from left to right, so
that the input vertices are on the left and the output vertex is on the right, and
so that the the incoming edges to each noninput vertex go from top to bottom
according to the implicit ordering of these edges. This gives us the data flow
graph in Fig. 7 (without the labels Conversely, given the data
flow graph in Fig. 7, it is a simple exercise to construct the grammar (2.8).
3. DESIGN OF COMPACT STRUCTURES

We have seen that we can equivalently represent a data sequence using pointer
trees, data flow graphs, or admissible grammars. The structure used to represent
a given data sequence (whether it be a tree, graph, or grammar) should be
“compact,” in order for us to be able to gain an advantage in compressing the
data sequence by compressing the tree, graph, or grammar which represents it.
Here are design principles which make clear what we mean by a “compact”
structure:
If the representing structure is to be a pointer tree or a data flow graph,
design the structure to be compact in the sense that the number of edges
is small relative to the length of the data sequence that is represented.
If the representing structure is to be a context-free grammar, design the
grammar to be compact in the sense that the total length of the right hand
members of its production rules is small relative to the length of the data
sequence that is represented.
Various design methods for finding a compact structure to represent a given
data sequence are addressed in the papers Nevill-Manning and Witten (1997a-
b); Kieffer et al., 1998; Kieffer and Yang (2000); Yang and Kieffer (2000).
We shall not discuss all of these design methods here. Instead, we discuss
one particular design method which is both simple and useful.
We explain how to obtain a good pointer tree for representing each data
sequence of a fixed length Choose a binary tree with leaf vertices
as follows:
If is a power of two, take to be the full binary tree of depth
consists of one root vertex, two vertices of depth one, four vertices
of depth two, etc., until we have vertices of depth )
If is not a power of two, first grow the full binary tree of depth
This tree has leaf vertices. Choose of the leaf vertices,
and grow two edges from each of them. The resulting binary tree, which
has leaf vertices, is taken to be
Now let be any data sequence of length that we wish to
represent by a pointer tree. Let be the depth-first ordering of the
leaf vertices of Label each with and let denote
this labelled tree. For each nonterminal vertex V of let be the
subtree of rooted at V. Each tree T(V) is drawn pictorially, with each
nonterminal vertex carrying no label and each leaf vertex carrying a data label.
Two T(V )’s are regarded to be the same if they look the same pictorially.
To obtain a pointer tree from that represents we have to prune cer-
tain vertices from The following algorithm determines those vertices of
that have to be pruned.
PRUNING ALGORITHM
Step 1: List the nonterminal vertices of in depth-first order. Let this list
be
Step 2: Traverse the list (3.9) from left to right, underlining each for which
has not been seen previously. Let be the set of underlined
vertices in the list (3.9), and let be the set consisting of each nonun-
derlined vertex in the list (3.9) whose father in belongs to
Step 3: Prune from the tree all vertices which are successors of the
vertices in Let be the resulting pruned tree. (If this step has
been done correctly, then the set of nonterminal vertices of will be
and the set of leaves of which are not leaves of will
be
Step 4: Attach a pointer label to each vertex V in which points to the
unique vertex in for which T(V) and are the same. The
pruned tree with the pointer labels attached, is a pointer tree
representing the data sequence
Example 8: We find a pointer tree representation for the data sequence
of length 12. Forming the tree and then the tree
we obtain the tree in Fig. 10. Notice that we have enumerated the nonterminal
vertices of in Fig. 10 in depth-first order. Executing Steps 1 and 2, we
see that
The vertices of which are successors of the vertices in are now

pruned from to obtain the pointer tree in Fig. 11. Pointer labels
must be placed at the the vertices in indicating the vertices in that
they point to, and this is done as follows:
Since vertex 5 must point to vertex 4 of no pointer label is assigned
to vertex 5 in Fig. 11.
Vertex 8 can point only to vertex 4 or vertex 7 of Since it actually
points to the first of these, vertex 8 is assigned pointer label “1” in Fig.
11.
Vertex 9 can only point to vertex 3 or vertex 6 of Since it actually

points to the second of these, pointer label “2” is assigned to vertex 9 in
Fig. 11.
4. ENCODING METHODOLOGY
Let us now refer back to the two parts of a hierarchical lossless compression
scheme given in Figs. 2 and 3. From the preceding sections, the reader under-
stands the nature of a transform that can be used in Fig. 2 and the corresponding
inverse transform in Fig. 3. To be precise, we have learned three distinct options
for (transform,inverse transform) in dealing with a data sequence
Option 1: Supposing to be the length of one could transform into the
pointer tree which represents it, via the PRUNING ALGORITHM
of Section 3. Then can be the “data structure” in Fig. 2. The inverse
transform in Fig. 3 then employs the subtree copying method of Section
1.1 to obtain from
Option 2: One could take the “data structure” in Fig. 2 to be the admissible
grammar G associated with the pointer tree as in Section 2.1.
The inverse transform in Fig. 3 then employs several rounds of parallel
substitutions to obtain from G, as in Example 4.
Option 3: The “data structure” in Fig. 2 could be the data flow graph DFG(G)
formed from the grammar G of Option 2, as described in Section 2.2. Ex-
ample 3 illustrates the inverse transform method via which is computed
from DFG(G) via a flow of data sequences along the edges of DFG(G).
We have not yet discussed how the encoder compresses the data structure in
Fig. 2 that is presented to it—this section addresses this question. Because of
the equivalences between the tree, graph, and grammar structures discussed in
Section 2, we can explain the encoder’s methodology for the pointer tree data
structure in Fig. 2 only. Thus, we assume that the data structure in Fig. 2 that
is to be assigned a binary codeword by the encoder is the pointer tree
introduced in Section 3, where is the length of the data sequence (assumed
to satisfy ).
The binary codeword generated by the encoder will consist of the concate-
nation of the three binary sequences discussed below.
The sequence The purpose of this sequence is to let the decoder know
what is. There is a simple way to do this (see Example 9) in which
consists of binary symbols.
the structure of the unlabelled pointer tree, i.e., the tree without
the pointer labels and without the data labels. If a vertex of is a
nonterminal vertex, there will a corresponding entry of equal to 1; if
the vertex is a leaf of there will be a corresponding entry of
equal to 0 (see Example 9).1
each data label and each pointer label that has to be entered into the
unlabelled pointer tree to obtain the pointer tree
Example 9: As in Example 8, we consider the data sequence
of length Referring to the tree in Fig. 10 and the pointer tree
in Fig. 11, we can see what the sequences and generated by
the encoder have to be. In general, the length can be processed to form in
two steps:
Step 1: Expand the integer into its binary representation consist-
ing of binary symbols where is
the most significant bit.
Step 2: Generate (That is, repeat every
digit of and then write down followed by )
In this particular case, Step 1 gives us the binary expansion of the integer 12,
which is 1100, and then Step 2 gives us
Let us now determine how we can form in order to convey to the decoder
the tree in Fig. 11 without the data and pointer labels. The first part of
the binary codeword received by the decoder (the part) tells the decoder that
whereupon the decoder knows that the nonterminal vertices of
are the vertices 1,2,3,4,5,6,7,8,9,10,11 in Fig. 10. Of these, the decoder
knows that vertices 1,2,3,4 will automatically be nonterminal vertices in Fig.
11. If the remaining vertices are processed by encoder and decoder in the
breadth-first order 9,6,10,11,5,7,8, then the decoder learns the structure of
with the transmission of 5 bits. Specifically, a 0 is transmitted for vertex
9 (indicating that this vertex is a leaf vertex in Fig. 11), vertices 10, 11 are
deleted from the list (since these vertices cannot belong to and then
bits are sent for each of vertices 6,5,7,8 to indicate which are leaf vertices and
which are nonterminal vertices in Fig. 11. This gives us
Finally, we discuss how the encoder constructs the sequence The entries
of tell the decoder what the data labels and the pointer labels are in Fig. 11.
The data labels are which we encode as 0, 1, 1, 1, respectively. The
pointer vertices are vertices 5, 8, 9 (see Fig. 10). The decoder already knows
where vertex 5 points (as discussed in Example 8), so no pointer label needs
to be encoded for vertex 5. The pointer labels 1, 2 on vertices 8, 9 can be very
simply encoded as 0, 1, respectively. (For a very long data sequence, one would
instead use arithmetic coding to encode the resulting large number of pointer
labels—see Kieffer and Yang (2000) and Kieffer et al. (2000) for the arithmetic
coding details which we have omitted here.) Concatenating the encoded data
labels 0, 1, 1, 1 with the encoded pointer labels 0, 1, we see that
The total length of the binary codeword generated by the encoder is the sum
of the lengths of and which is 17 bits. If we assume that the
decoder knows the length of the data sequence to be 12, but does not know
which binary data sequence of length 12 is being transmitted, then it is not
necessary to form and the encoder’s binary codeword consists of and
only. In this case, the length of the binary codeword is 11 bits. We are achieving
a modest level of data compression in this example, since transmitting to the
decoder without compression would take 12 > 11 bits—for a much longer data
sequence, more compression than this could be achieved.
5. PERFORMANCE UNDER UNCERTAINTY

How well do hierarchical lossless compression schemes perform? This sec-
tion is devoted to answering this question. First, let us focus on the size of
the pointer tree as measured by the number of vertices it has. We let

denote the number of vertices in the pointer tree It has been
shown (Liaw and Lin, 1992; Kieffer et al., 2000) that there is a positive constant
C such that the following is true.2
Theorem 1 For every and for every data sequence of length
Theorem 1 tells us that the ratio is small for large In other words,
the size of the pointer tree is small relative to the length of the data
sequence which is a compactness condition on the tree >From our
discussion at the beginning of Section 3, this suggests to us that the use of the
pointer tree in a hierarchical lossless compression scheme might lead
to good compression performance. Specifically, we consider the hierarchical
lossless compression scheme in which, for each and each data sequence
of length the pointer tree is compressed according to the procedure
described in Section 4. In the ensuing development, we will see that this pointer
tree based hierarchical lossless compression scheme performs extremely well.
Let A denote our fixed finite alphabet. We consider how well our hierar-
chical lossless compression scheme can compress a randomly generated data
sequence generated according to the probability mass function
Information theory tells us that for any lossless compres-
sion scheme, the expected length of the binary codeword into which is en-
coded cannot be less than the entropy
and that the best lossless compression scheme for encoding (the Huffman
code (Cover and Thomas, 1991)) assigns a binary codeword of expected
length no worse than Unfortunately, the Huffman code can be
constructed only if the probability model is known. However, what
if we are uncertain about the probability model? Remarkably, for large our
pointer tree based hierarchical lossless compression scheme provides us with
near-optimum compression performance regardless of the probability model
according to which the data is generated (provided that we assume stationarity
of the model). In other words, if faced with uncertainty about the true data
model, one can employ a hierarchical lossless compression scheme not based
on any probability model which performs as well as a special-purpose loss-
less compression scheme based upon the true data model, asymptotically as
the length of the data sequence grows without bound. The following theorem,
proved in Kieffer and Yang (2000) and Kieffer et al. (2000), makes this precise.
Theorem 2 Let be a stationary random process in which each

random variable takes its values in the given finite data alphabet A. For
each let be the random data sequence Then,
the expected codeword length arising from the encoding of with

our pointer tree based hierarchical lossless compression scheme satisfies the
bound
Discussion: We point out subcases of Theorem 2 that occur for special classes
of stationary processes. First, suppose that is a memoryless
process, meaning that the random variables are statistically independent,
each having the same marginal probabilities
Letting
then, because of the independence, the joint entropy in (5.10) can be

simply expressed as Also, it is known in the memoryless case
(see Kieffer and Yang (2000); Kieffer et al. (2000)) that the term in (5.10)
can be expressed as the size of the pointer tree as given
by Theorem 1. Therefore, for the memoryless process, equation (5.10) reduces
to
Next, suppose that · · · is a stationary first order Markov process

with transition probabilities
and marginal probabilities in (5.11). Then, equation (5.10) reduces to
where
These extensions can be taken as far as one wishes. In general, if

is a stationary Markov process of any finite order one can define a nonnegative
real constant (the so-called order entropy of the process) such that
There are other lossless compression schemes for which Theorem 1 is true for
arbitrary stationary processes, and for which asymptotics of the form (5.12)
occur for arbitrary Markov processes of finite order. For example, the Lempel-
Ziv compression scheme (Ziv and Lempel, 1978) is another such scheme. It is an
open question whether the hierarchical lossless scheme presented in this paper
or whether the Lempel-Ziv scheme gives the smaller constant times
in the term in (5.12). However, hierarchical lossless compression
schemes have some advantages over the Lempel-Ziv scheme. Two of these
advantages are: (1) hierarchical schemes are easily scalable; (2) hierarchical
schemes sometimes yield state-of-the-art compression performance in practical
applications.
Acknowledgement. This work was supported by National Science Foun-

dation Grants NCR-9627965 and CCR-9902081.
NOTES
1. The length of is at worst the number of vertices of which is by
Theorem 1.
2. The smallest value of C for which Theorem 1 is true is not known.
REFERENCES
Barnsley, M. and L. Hurd. (1993). Fractal Image Compression. Wellesley, MA:
AK Peters, Ltd.
Burt, P. and E. Adelson. (1983). “The Laplacian Pyramid as a Compact Image
Code,” IEEE Trans. Commun., Vol. 31, pp. 532–540.
Cameron, R. (1988). “Source Encoding Using Syntactic Information Source
Models,” IEEE Trans. Inform. Theory, Vol. 34, pp. 843–850.
Chui, C. (1992). (ed.), Wavelets: A Tutorial in Theory and Applications. New
York: Academic Press.
Cover, T. and J. Thomas. (1991). Elements of Information Theory. New York:
Wiley.
Fisher, Y. (1995). (ed.), Fractal Image Compression: Theory and Application.
New York: Springer-Verlag.
Kawaguchi, E. and and T. Endo. (1980). “On a Method of Binary-Picture Rep-
resentation and its Application to Data Compression,” IEEE Trans. Pattern
Anal. Machine Intell., Vol. 2, pp. 27–35.
Kieffer, J. and E.-H. Yang. (2000). “Grammar-Based Codes: A New Class of
Universal Lossless Source Codes,” IEEE Trans. Inform. Theory, Vol. 46, pp.
737–754.
Kieffer, J., E.-H. Yang, G. Nelson, and P. Cosman. (2000). “Universal Lossless
Compression Via Multilevel Pattern Matching,” IEEE Trans. Inform. Theory,
Vol. 46, pp. 1227–1245, 2000.
REFERENCES 733
Knuth, D. (1973). The Art of Computer Programming: Volume 1/Fundamental

Algorithms. Reading, MA: Addison-Wesley.
Kourapova, E. and B. Ryabko. (1995). “Application of Formal Grammars for
Encoding Information Sources,” Probl. Inform. Transm., Vol. 31, pp. 23–26.
Liaw, H-T. and C-S. Liu. (1992). “On the OBDD-Representation of General
Boolean Functions,” IEEE Trans. Computers, Vol. 41, pp. 661–664.
Nevill-Manning, C. and I. Witten. (1997a). “Identifying Hierarchical Structure
in Sequences: A Linear-Time Algorithm,” Jour. Artificial Intell. Res., Vol. 7,
pp. 67–82.
Nevill-Manning, C. and I. Witten. (1997b). “Compression and Explanation
Using Hierarchical Grammars,” Computer Journal, Vol. 40, pp. 103–116.
Said, A. and W. Pearlman. (1996). “A New Fast and Efficient Image Codec
Based on Set Partitioning in Hierarchical Trees,” IEEE Trans. Circuits Sys.
Video Technol., Vol. 6, pp. 243–250.
Shapiro, J. (1993). “Embedded Image Coding Using Zerotrees of Wavelet Co-
efficients," IEEE Trans. Signal Proc., Vol. 41, pp. 3445-3462.
Strang, G. and T. Nguyen. (1996). Wavelets and Filter Banks. Wellesley, MA:
Wellesley-Cambridge Press.
Yang, E.-H. and J. Kieffer. (2000). “Efficient Universal Lossless Data Compres-
sion Algorithms Based on a Greedy Sequential Grammar Transform–Part
One: Without Context Models,” IEEE Trans. Inform. Theory, Vol. 46, pp.
755–777.
Ziv, J. and A. Lempel. (1978). “Compression of individual sequences via variable-
rate coding,” IEEE Trans. Inform. Theory, Vol. 24, pp. 530–536.
Part VIII
Chapter 29
EUREKA! BELLMAN’S PRINCIPLE OF OPTIMALITY

IS VALID!
Moshe Sniedovich
The University of Melbourne
Parkville VIC 3052, Australia
m.sniedovich@ms.unimelb.edu.au
Abstract Ever since Bellman formulated his Principle of Optimality in the early 1950s,
the Principle has been the subject of considerable criticism. In fact, a number
of dynamic programming (DP) scholars quantified specific difficulties with the
common interpretation of Bellman’s Principle and proposed constructive reme-
dies. In the case of stochastic processes with a non-denumerable state space, the
remedy requires the incorporation of the faithful "with probability one" clause. In
this short article we are reminded that if one sticks to Bellman’s original version
of the principle, then no such a fix is necessary. We also reiterate the central
role that Bellman’s favourite "final state condition" plays in the theory of DP in
general and the validity of the Principle of Optimality in particular.
Keywords: dynamic programming, principle of optimality, final state condition, stochastic

processes, non-denumerable state space.
1. INTRODUCTION
All of us are familiar with Bellman’s Principle of Optimality (Bellman, 1975,
p. 83) and the major role that it had played in Bellman’s monumental work on
DP. What is not so well known - yet very well documented - is that Bellman’s
Principle of Optimality has been the subject of serious criticism, e.g. Denardo
and Mitten (1967), Karp and Held (1967), Yakowitz (1969), Porteus (1975),
Morin (1982), Sniedovich (1986, 1992).
In fact, almost every aspect of the Principle - e.g. its exact meaning, validity,
role in DP - is problematic in the sense that scholars have conflicting views on
the matter. For the purposes of this discussion it will suffice to provide two
pairs of quotes. The first pair refers to the title "Principle":
... Equation (3.24) is a fundamental equation of Dynamic Programming. It
expresses a fundamental principle, the principle of optimality (Bellman [B4],
[B5], which can also be expressed in the following way: …
Kushner (1971,p. 87)
The term principle of optimality is, however, somewhat misleading; it suggests
that this is a fundamental truth, not a consequence of more primitive things.
Denardo (1982, p. 16)
The second refers to the validity of the Principle:

Roughly the principle of optimality says the following rather obvious fact …
Bertsekas (1976, p. 48)
To see that Bellman’s original statement of the Principle of Optimality also does
not hold, simply flip the graph around ...
Morin (1982,p. 669)
To motivate the discussion that follows we consider two typical counter

examples to the validity of the principle. The first features a deterministic
problem, the second a stochastic process. However, before we do this let us
recall that Bellman’s phrasing of the principle is as follows (Bellman, 1957,
p.83):
PRINCIPLE OF OPTIMALITY: An optimal policy has the property that whatever
the initial state and initial decision are, the remaining decisions must constitute
an optimal policy with regard to the state resulting from the first decision.
Counter-Example 1:
Consider the network depicted in Figure 29.1 and assume that the objective
is to determine the shortest path from node 1 to node 5, where the length of a
path is equal to the length of the longest arc on that path. By inspection, there
are two optimal paths, namely p=(1,2,3,5) and q=(1,2,4,5), both of length 3.
Consider now the optimal path q and the state (node) resulting from the first
decision, namely node 2. Clearly, the remaining decisions of q, namely the
subpath (2,4,5) does not constitute an optimal policy with respect to node 2 -
it is clearly longer than (2,3,5). Hence, the optimal path q does not obey the
Principle of Optimality.
Counter-Example 2:
Consider the following naive stochastic game: there are two stages (n=1,2)
and at each stage there are two feasible decisions n=1,2. The
dynamics of the process are as follows: The process starts at stage 1 with a
given state, Upon making the first decision, the process moves to the
next stage, where we observe a new state Then we make the second
decision, and the process terminates. The second state, is a continuous
random variable on the set S=[0,1] whose conditional density function depends
on both and The return generated by the game is equal to the sum of the
two decisions, namely
The objective is to determine a policy so as to maximize the expected value
of the total return
Clearly, the best policy is to always use that is it is best to set
This will yield an expected return equal to This
policy obeys the Principle of Optimality. But consider the policy according to
which and
Because is a continuous random variable on S=[0,1], the probability that

is equal to zero under this policy. Hence, this policy is also optimal. But
this policy does not obey the Principle of Optimality: selecting when we
are at stage 2 observing state is not an optimal policy with regard to this
realization of
The traditional fixes for the apparent bugs in the conventional interpretation
of the Principle are as follows:
1 To disarm counter-examples such as our Counter-Example 1, it is as-

sumed that the objective function of the process satisfies certain separa-
bility and (strict) monotonicity conditions (e.g. Yakowitz (1969))
2 To disarm counter examples such as our Counter-Example 2, the clause

"with probability one" is added (e.g. Yakowitz (1969), Porteus (1975))
to the premise of the Principle.
While these fixes are fine, there are other possibilities. In particular, it is pos-
sible to fix these - and other - bugs by adhering to Bellman’s original formulation
of DP and the Principle of Optimality.
The main objective of this paper is to briefly show how this can be done.
2. REMEDIES
In this section we provide two remedies to the Principle. These remedies not
only fix the bugs discussed above, they also indicate how elegantly Bellman
dealt with a number of thorny modelling and technical aspects of DP.
Remedy 1: Final state formulation
A close examination of Bellman’s work on DP reveals that Bellman contin-
ually struggled with the following dilemma: how do you keep the formulation
of DP simple, yet enable it to tackle complex problems? The Principle of Opti-
mality was conceived as a device that will keep the description of the main ideas
of basic of DP simple. In particular, in contrast to the DP models developed in
the 1960’s with the stated goal of putting DP on a rigorous mathematical foun-
dation (e.g. Denardo and Mitten (1967), Karp and Held (1967)), Bellman’s
original treatment of DP paid very little attention to the objective function of
the process.
As a matter of fact, systematically and consistently Bellman avoided the need
to deal with this issue by a very drastic assumption: the over all return from the
decision process depends only on the final state of the process. Readers who
are sceptical about this fact are invited to read (carefully) Bellman first book on
DP, where they can find the following definition of an optimal policy:
. . . Let us now agree to the following terminology: A policy is any rule for making
decision which yields an allowable sequence of decisions; and an optimal policy
is a policy which maximizes a preassigned function of the final state variable ...
Bellman (1957, p.82)
As argued in Sniedovich (1992), not surprisingly this type of objective func-

tion satisfies the traditional (strict) monotonicity conditions (Mitten, 1964;
Yakowitz, 1969), hence, the Principle of Optimality is (trivially) valid in the
context of Bellman’s final state model of DP.
It should be stressed that Bellman’s final state approach to the modelling
aspects of DP is an extremely useful tool of thought. And it should not come
as a surprise that it is still used as such (e.g. Woeginger (2000)).
A close examination of Bellman’s treatment of stochastic processes (Bell-
man, 1957) immediately reveals that Bellman also devised a super elegant
alternative to the "with probability one" clause suggested by Yakowitz (1969)
and Porteus (1975). It works like this:
Remedy 2: Conditional Modified Problems
The conventional interpretation of the Principle of Optimality deals with two

optimization problems: the initial problem (before any decision is made) and the
problem associated with the state resulting from the first decision. Symbolically
we can call these problems Problem and Problem respectively.
That is, the initial problem - Problem - is the problem we have to solve
when we are at the initial stage n=1 observing the initial state of the process,
The second problem - Problem - is the problem we face when we are
at stage n=2, observing the state resulting from the first decision. Counter-
Example 2 indicates that even though is resulting from an implementation
of an optimal policy with respect to Problem there is no guarantee that
the remainder of this policy is optimal with regard to Problem
But suppose that we compare Problem with the following problem:
we still have to solve the original problem, but under the assumption that the
first decision and the state resulting from it are given. Let this problem
be denoted by Problem Is it true that if a policy is optimal with
respect to Problem then it is also optimal with respect to Problem
Needless to say, there are of course problems where this condition is not
satisfied. However, from our perspective this is not the point because - following
Bellman - we require the model to be a final state model in which case the above
condition is trivially satisfied. Rather, the point is that there is no need here for
the "with probability one" clause.
Before we address this issue any further it will be instructive to re-examine
Counter-Example 2 and see whether it satisfies the above condition.
Counter Example 2: Revisited
The objective function associated with Problem is equal to
thus the optimal policy for this problem is The objective function
for Problem is hence the optimal policy is regardless of what
value takes. Thus, the above condition is satisfied.
In short, the alternative to the "with probability one" fix works well not only
in the framework final state models.
3. THE BIG FIX

What emerges from the above analysis is that if we stick to (1.1) Bellman’s fi-
nal state model and (4.1) the relationship between the modified problems (Prob-
lem and their respective conditional problems (Problem )
then the Principle of Optimality is alive and well and does not require any fixing.
With regard to the relationship between the modified problems and their
respective conditional problems, this is a typical Markovian property: it entails
that Problem and Problem have the same optimal solutions
regardless of the specific values of and This implies that the set of optimal
solutions of Problem does not depend on and as such, only

on This is a Markovian property par excellence.
The nice thing about this condition is that it enables us to drop both the
final state requirement and the traditional monotonicity property. In short, the
Markovian condition can act as the ultimate fix for the Principle of Optimality
in that its validity guarantees the validity of the Principle of Optimality and the
functional equation of DP.
4. THE REST IS MATHEMATICS

Technical details concerning the Markovian condition and its role in DP can
be found elsewhere (e.g. Sniedovich (1992)). Here we just illustrate how it
can be used in a formal axiomatic manner as an alternative to the traditional
monotonicity condition and the "with probability with one" clause.
So let us first consider the following deterministic sequential decision prob-
lem:
Problem
s.t.
where S denotes the state space, T is the transition function, denotes

the set of feasible decisions at stage n if the system is in state and - the
objective function - is a real valued function of the state and decision variables.
It is assumed that the initial state of the system, is given.
We refer to Problem as the initial problem. Let denote the set
of all the feasible solutions to this problem and let denote the set of
all (global) optimal solutions to this problem. It is assumed that is not
empty.
We now decompose the objective function, in the usual DP manner
by assuming that there exists a function and a sequence of functions
such that: and
where
for
This leads to the notion of modified problems:
Problem
Let denote the set of all feasible solutions to Problem and

let denote the set of all optimal solutions to this problem. It is
assumed that for any pair, the set is not empty.
In anticipation for the introduction of the Markovian condition, we also
consider the following conditional modified problems:
Problem
subject to (4.7)-(4.8).
Let denote the set of all feasible solutions to Problem
and let denote the set of all optimal solutions to this prob-
lem. It is assumed that the set is not empty for any
quadruplet Ob-
serve that by construction
Clearly then (by inspection),
Lemma 1
for all and

Our beloved Markovian condition can now be stated formally as follows:

Markovian Condition (Deterministic processes):
for all and
The following is then an immediate consequence of the definition of the

Markovian property.
Corollary 1
If the Markovian condition (deterministic processes) holds then
for all and
Therefore, substituting (4.12) into (4.10) yields

Corollary 2
If the Markovian condition (deterministic processes) holds then
for all
This is the famous functional equation of DP.

In short, in the context of deterministic processes, the Markovian condition
can be used as a sufficient condition for the validity of the Principle of Optimality
and the functional equation of DP.
Let us now examine the situation in the context of stochastic processes. We
shall do this on the fly as an extension of the deterministic process.
Observe that in the context of stochastic processes, the state observed at stage
n+1 is not determined uniquely by the state and decision pertaining to stage n.
Rather, it is determined by a (conditional) probabilistic law that depends on
these two entities. In short, the state variable is a random variable.
It is therefore natural to introduce the notion of a decision making rule, that
is a rule according to which the decisions are determined. Fur-
thermore, in anticipation for the Markovian condition, it is natural to consider
the class of Makovian decision rules, namely Markovian policies. Formally, a
Markovian policy is a rule that for each stage-state pair, it

assigns an element of Let denote the set of all the feasible Marko-
vian policies associated with our model. Then, by definition, consists of all
the Markovian Decision rules satisfying the following condition:
Our initial problem can thus be stated as follows:

Problem
where denotes the expected value of gen-

erated by policy given the initial state where
Similarly, the modified and conditional problems are defined as follows:
Problem
Problem
Then clearly,
Corollary 3
for all and
Note that the expectation is taken with respect to the random variable
whose probability function is conditioned by and
Let denote the set of all the optimal solutions to Problem
and let denote the set of all the optimal solutions to Problem
Then the Markovian conditions for stochastic processes can
be stated as follows:
Markovian Condition (Stochastic processes):
for all and
Now, assume that the decomposition scheme of the objec-

tive function is separable under conditional expectation, namely suppose that
for all and
Under this condition, Corollary 2 entails that
Hence,
Corollary 4
If the objective function is separable under conditional expectation and the
Markovian condition (stochastic processes) holds, then
for all
This is the famous DP functional equation for stochastic processes.

Note that here, as in (4.18), expectation is taken with respect to the random
variable whose probability function is conditioned by and
In summary then, the Markovian condition works well in both deterministic
and stochastic processes.
5. REFINEMENTS
The Markovian condition can be refined a bit to reflect the fact that for the
DP functional equation to be valid, it is sufficient that Principle of Optimality
is satisfied by one policy, rather then by all the optimal policies.
This leads to the following:
Weak Markovian Condition (Deterministic processes):
for all and where denotes the empty set.
In words, each conditional modified problem shares at least one optimal

solution with each modified problem giving rise to it.
Corollary 5
If the Weak Markovian condition (deterministic processes) holds then the DP
functional equation (4.13) is valid.
Weak Markovian Condition (Stochastic processes):
for all and
Corollary 6
If the Weak Markovian condition (deterministic processes) holds then the DP
functional equation (4.22) is valid.
Weak Markovian Condition (stochastic processes):
for all and
In words, any conditional modified problem shares at least one optimal so-
lution with each modified problem giving rise to it.
Corollary 7
If the objective function is separable under conditional expectation and the
Weak Markovian condition (stochastic processes) holds then the DP functional
equation (4.22) is valid.
The following observations should be made with regard to the distinction

drawn above between the Markovian condition and the Weak Markovian con-
dition:
1. DP Functional equations based on the Weak Markovian condition do not

guarantee the recovery of all the optimal solutions.
2. There is not always a choice in this matter. That is, some objective
functions are Markovian in nature so it is not possible to formulate for
them valid DP functional equations that satisfy the Weak Markovian
condition but do not satisfy the Markovian condition.
3. The distinction is more pronounced in the context of deterministic pro-

cesses than in stochastic processes. This is a reflection of the fact that
the "separation under conditional expectation" condition (4.20) is very
demanding, hence the class of objective functions satisfying the Weak
Markovian condition - hence the Markovian condition - is relatively small.
4. The distinction between Markovian and Weak Markovian conditions is

similar - but not identical - to the distinction between monotone and
strictly monotone decomposition schemes (Sniedovich, 1992).
The question naturally arises: what happens if the Markovian condition does
not hold?
6. NON-MARKOVIAN OBJECTIVE FUNCTIONS

Suppose that we have a sequential decision model whose objective function
cannot be decomposed in such a way to satisfy the (Weak) Markovian condition.
In this case, the process of deriving the DP functional equation breaks down.
This could be caused because the objective function is not separable, that is, it
cannot be decomposed, or if it can be decomposed, the decomposition scheme
does not satisfy the (Weak) Markovian condition.
There is no simple fool proof recipe for handling such cases. Rather, there
seems to be two basic approaches and the choice between them is very much
problem dependent.
The first approach - which might be called "Pure DP" - regards the difficulty as
a DP difficulty and attempts to fix it by reformulating the problem - in particular
the state variable - so as to force the objective function to obey the Markovian
condition. Typically this leads to an expansion of the state space and/or the
return space with obvious implications for the efficiency of the algorithm used
to solve the resulting DP functional equation.
The second approach - which might be called "Hybrid" - attempts to deal
with the difficulty without altering the structure of the DP model. Instead, other
methods are employed to deal with the non-Markovian nature of the objective
function. Thus, the resulting approach is based on a collaborative scheme
between DP and other optimization methods.
As an example, consider the case where the objective function of the sequen-
tial decision model is of the following form:
Clearly, this function is not separable and therefore it cannot be decomposed

in a DP manner unless the structure of the model is expanded. There are two
ways to expand the DP model so as to satisfy the Markovian property in this case:
The first involves an expansion of the return space resulting from replacing
the objective function g by two objective functions:
and viewing the problem as a multi-objective problem. Since g is monotone

increasing with both r and an optimal solution to the original problem is a
Pareto efficient solution to the multi objective problem. Thus, multi-objective
DP can be used to generate all the efficient solutions to the multi-objective prob-
lem, from which the optimal solution to the original problem can be recovered
(e.g. Carraway et al (1990), Domingo and Sniedovich (1993)).
The second possible expansion is in the state space. Here the difficulty
caused by the square root term in (5.3) is resolved by incorporating the variable
in the state variable of the DP model so that the expanded state variable is of the
form We can then consider the “expanded”
objective function
where
and
Observe that the expanded objective function is Markovian with respect to

the expanded state variable.
A possible "Hybrid" approach to the problem is to linearize the square root
term on the right hand side of (6.1) and consider the parametric objective
function:
The point is that, for any given value of this function is separable and
Markovian with respect to the original state and decision variables. The idea
is then to identify a value for the parameter such that if we optimize the
problem using this value of we obtain an optimal solution to the original
problem. This typically involves a line search which in turn requires solving
the parametric problem for a number of values of the parameter Under
appropriate conditions, composite concave programming can be used for this
purpose (Sniedovich, 1992).
7. DISCUSSION
One of the fascinating aspects of Bellman’s work on dynamic programming
is his attempt to capture the essence of the method by a short non-technical
description. Over the years this description - The Principle of Optimality - has
become synonymous with dynamic programming.
Unfortunately, the non-technical nature of the description has also led to
difficulties with common interpretations of the Principle, which in turn led to
criticism of Bellman’s work itself.
It was shown in this paper that a proper reading and interpretation of Bell-
man’s formulation of dynamic programming in general and the Principle in
particular can overcome the above difficulties.
REFERENCES
Bellman, R. (1957). Dynamic Programming, Princeton University Press, Prince-
ton, NJ.
Bertsekas, D.P. (1976). Dynamic Programming and Stochastic Control, Aca-
demic Press, NY.
Carraway R.L, T.L. Morin, and H. Moskovwitz. (1990). Generalized dynamic
programmingfor multicriteria optimization, European Journal of Operations
Research, 44, 95-104.
REFERENCES 749
Denardo, E.V. and L.G. Mitten. (1967). Elements of sequential decision pro-
cesses, Journal of Industrial Engineering, 18, 106-112.
Denardo, E.V.(1982). Dynamic Programming Models and Applications, Prentice-
Hall, Englewood Cliffs, NJ.
Domingo A. and M. Sniedovich. (1993). Experiments with algorithms for non-
separable dynamic programming problems, European Journal of Operational
Research 67(4.1), 172-187.
Karp, R.M. and M. Held. (1967). Finite-state processes and dynamic program-
ming, SIAM Journal of Applied Mathematics, 15, 693-718.
Kushner, H. (1971). Introduction to Stochastic Control, Holt, Rinehart and Win-
ston, NY.
Mitten, L.G. (1964). Composition principles for synthesis of optimal multistage
processes, Operations Research, 12, 414-424.
Morin, T.L. (1982). Monotonicity and the principle of optimality, Journal of
Mathematical analysis and Applications, 88, 665-674.
Porteus, E. (1975). An informal look at the principle of optimality, Management
Science, 21, 1346-1348.
Sniedovich, M. (1986). A new look at Bellman’s principle of optimality, Journal
of Optimization Theory and Applications, 49(1.1), 161-176.
Sniedovich, M. (1992). Dynamic Programming, Marcel Dekker, NY.
Yakowitz S. (1969). Mathematic of Adaptive Control Processes, Elsevier, NY.
Woeginger, G.J. (2000). When does a dynamic programming formulation guar-
antee the existence of a fully polynomial time approximation scheme (FP-
TAS), INFORMS Journal on Computing.
Chapter 30
REFLECTIONS ON STATISTICAL METHODS FOR

COMPLEX STOCHASTIC SYSTEMS
Marcel F. Neuts
Tucson, AZ 85721, U.S.A.
marcel@sie.arizona.edu
Abstract Remembering conversations with Sidney Yakowitz on statistical methods for

stochastic systems, the author reflects on the difficulties of such methods and
describes several specific problems on which he and his students have worked.
Keywords: statistical methods for stochastic systems, computer experimental methods.
1. THE CHANGED STATISTICAL SCENE

Sid Yakowitz and I often had spirited discussions about statistical method-
ology for stochastic engineering systems. Sid is a friend who left many fond
memories. It is a tribute to him that his ideas and opinions still stir our minds.
This article offers some views on future developments of statistical methodol-
ogy for probability models and is my way of honoring the memory of a departed
friend.
We were graduate students around 1960, a decade after statistics fully came
into its own as an academic discipline and was widely recognized as an indis-
pensible tool in industrial, agricultural, and socio-political practice. In those
days, gathering data was always expensive and major computation was a chore.
The challenge to statistical methods was to squeeze useful insight from simple
models that, one hoped, captured the principal functional relations underlying
the data and could reasonably well account for the variability in those data.
In stochastic modelling, a similar situation prevailed. Those were the days
when queues, inventories, and epidemics were analyzed in detail under the
simplest Markovian assumptions and by using models with very few parameters.
Analytically explicit results were at a premium. Though burdened with ad

hoc, unrealistic assumptions, these models provided useful information such as
the equilibrium conditions for certain queues, simple operational rules for the
control of single-product inventories, and threshold theorems for epidemics.
It took a generation or longer before the impact of the computer was fully ap-
preciated but it surely changed the worlds of probability and statistics. Starting
in the early 1970s, I zealously argued that applied probability ought to place
greater emphasis on the mathematical structure underlying algorithmic solu-
tions than on rare cases of analytic tractability. Long before that was a com-
monly accepted view, I stated that a good computer algorithm offers greater
versatility and provides deeper physical insight than the heavily constrained
explicit formulas of our textbooks.
Early on, Sid independently reached similar conclusions. He was among the
first to teach a course on algorithmic methods for statistics and probability and in
1977, published the book Computational Probability and Simulation (Yakowitz,
1977), one of the earliest in algorithmic probability. He occasionally asked me
why, rather than devoting my efforts to algorithms for stochastic models, I had
not done really useful work and developed statistical procedures for these same
models instead.
In brief, my answer went as follows: Algorithmic methodology for proba-
bility models requires a complete rethinking of all existing theory. Based on
insight into the structure of the models, it is more abstract and mathematically
quite challenging. The constituent skills of algorithmic thinking are closer to
those of pure mathematics, say, of functional analysis and algebra, than of the
analytic manipulations that make up so much of applied mathematical education
and practice. The broad change of perspective needed of the researchers in our
field will (and did) take time. However, if that be true for probability models,
it is nothing compared to the technical difficulties, the radical paradigm shift,
and the social changes needed for statistical methods to meet the challenges
and opportunities of this age of rapid, massive information processing.
Why is the present time so exciting and, equally, so difficult for statistical
research? It is exciting because of the superbly enhanced means of gathering
and analyzing data. It is difficult because all existing statistical methods were
conceived and developed in an era of small data sets and onerous computation.
The principal statistical descriptors, such as moments and the familiar estima-
tors, arose out of work with univariate data. Decades of descriptive statistics
and of inferential analysis of data sets preceded and inspired the formal mathe-
matical development of statistics. Issues such as the best (always to mean, most
appropriate to the situation at hand) measures of central tendency and variability
were discussed for a long time. Now, the task of retooling is immense, the data
sets are huge and cannot easily be visualized, and the time pressure imposed
by the rapidly evolving scientific and technological needs is enormous.
When we were students, multivariate and time series analyses were already
beautiful mathematical theories. Sid had a more direct interest in these than I
but we agreed that the computational burden and the paucity of tractable results
made their application to actual data a daunting task. I, for one, was happy to
leave that to people in biology, economics, and the social sciences where the
reality of multivariate, highly dependent data could not be overlooked.
Work on stochastic models kept me well occupied and I could only follow
developments in statistics from a distance. From colleagues at Purdue Univer-
sity and elsewhere, I learned about bayesian procedures, about selection and
ranking, about variable selection in linear models, and other such work. During
the 1970s, there clearly was a growing preoccupation with numerical results
obtained by substantial algorithms, yet the statistical laboratory and the com-
puting center remained clearly separated worlds. In the first, people engaged
in statistical thinking, in the second, one sought advice and found help with
massive computer jobs.
In 1980 or so, during a one-day visit to Princeton University, I vividly ex-
perienced the thrill of seeing a changed, enriched statistical scene. A graduate
student demonstrated a software package for time series that he was developing.
A rich variety of statistical estimators, tests, and data transformations could be
interactively implemented to serve in the exploration of one or more traces of
a time series. Algorithm and computation, once barriers between methodolog-
ical and physical insight, had become our faithful servants, if not yet our allies.
The doctoral student had excellent knowledge of statistical theory and of the
computer’s capabilities. He combined them in a creative, synergistic research
project.
There are now many highly numerate statistical researchers; the years since
1980 brought major progress in the algorithmization of statistics. Professionally
written statistical software is now readily available. Judging by the text books,
by my experience during the latter years of my teaching career, and by visits
to universities in many countries, academic education in statistics still lags far
behind these developments. With few exceptions, students learn the elementary
mathematics underlying the most classical estimators and tests, not the deeper,
substantive insights needed to use existing software packages with confidence
and competence.
When trying to stir interest and enthusiasm, ponderous preaching about gen-
eralities is counter-productive. When asked for a look ahead talk, I prefer to
choose some specific problems that are just beyond our present capabilities.
After explaining why they are important, I speculate about promising new ap-
proaches - promising in that they may get the job done, not in the first place
for leading to easily publishable papers. In Neuts (1998), I so discussed se-
lected problems in stochastic modelling. That area could benefit from greater
emphasis on understanding the physical behavior of the models.
Next, I shall mention illustrative problems suggested by questions in traffic

engineering and manufacturing. As academic researchers we can view those
problems from the rear lines, without the immediacy of deadlines. By taking
a general view, we can develop well-justified versatile methodology. If at all
successful, that methodology will find new, unanticipated applications.
2. MEASURING TELETRAFFIC DATA STREAMS

For their design and control, the measuring of Internet data streams is a rich
source of statistical problems. These are obviously relevant and timely. We
consider just one such a problem and see where it can lead. A record of a data
stream is called a trace. It is a list of all teletraffic cells in the stream over
a span of time. It gives their arrival times, destinations, types, lengths, and
whatever other information is germane to their proper routing. A few seconds
of observation can yield traces of one billion items of data or more. Each
trace is only one observation; its length corresponds to the dimension of the
multivariate data. Obtaining traces is not easy. It requires special equipment
and careful planning; the mere fact of recording the trace interferes with the
data stream on the microscopic time scale where individual transactions take
place. For use in design work, teletraffic engineers maintain a limited collection
of representative traces.
It is much easier to obtain counts of the cells flowing past an observer during
successive intervals of length The choice of requires attention. How
large can we choose without losing important information about the traffic
stream? Using only counts, can we reconstitute a reasonable approximation to
the original trace for use, say, in future simulation experiments? Choosing
is clearly a recasting of the old statistical problem of grouping data for easier
analysis or for plotting in histograms. A look at the literature of the past shows
how long these issues were debated, even for ordinary, univariate data. The
stated questions do not have clear cut answers. In actuality, much depends on
the context and on the specific application, but methodological studies serve to
clarify criteria for the choice of
Once is chosen, how can we extract pertinent information from, say, ten
million counts? Obvious, inexpensive operations such as calculating a few
sample means and variances or plotting the marginal empirical density of the
counts are readily done, but what, if anything, do they actually tell us?
Extracting information from long strings of counts lies at the origin of a
growing literature on discrete time series. With one caveat, I believe that to be
a promising area for useful statistical work. Judging from the published work
of which I am aware, there is little interest in descriptive statistics and not much
contact with actual data sets. It is only a mild generalization - if one that is
painfully unkind - that academic statisticians do not look at data and applied
users do not look at theory. In an effort to belie that quip, one of my later
doctoral students, David Rauschenberg (Rauschenberg, 1994), examined ways
of summarizing long strings of counts in short strings of informative icons that
reflect the qualitative behavior of the counts over long substrings. Although
it is regrettably unpublished, I consider his thesis a seminal work leading to
data-analytic procedures that merit much further attention.
Reconstituting a traffic stream from counting statistics is only an example in a
vast class of problems dealing with random transformations of point processes.
During the early 1990s, I worked with several Ph.D. students on local Poissoni-
fication, the operation whereby the events during successive intervals of length
a are uniformly redistributed over these intervals, see Neuts et al. (1992) and
Neuts (1994). What was our qualitative thinking behind that construction?
If you only have the event counts over intervals of length you cannot
recover information about the exact arrival times during those intervals. You
can redistribute the points regularly, place them all in a "group arrival" in the
middle of the interval, or, as we did, you can imagine that they are uniformly and
randomly distributed - as they would be if the original process was a Poisson
process. What we studied has the intuitive flavor of a "linear interpolation."
That intuition was indeed borne out by some formulas for lower order moments
that we derived.
Unless is large, differences between the original and the reconstituted
processes should not matter greatly - exactly the same idea that underlies the
grouping of univariate or bivariate data. The statement of that intuition is vague.
One can assail it with criticism or, constructively, one can give technically
precise formulations that are amenable to rigorous scientific inquiry.
In Neuts et al. (1992), we initiated a theoretical study of local Poissonification
for the family of versatile benchmark processes, known as MAPs (Markovian
arrival processes), see e.g., Neuts (1992), Narayana and Neuts (1992), and
Lucantoni (1993). I wish we had been able to pursue that study further along
the following lines:
The pertinent engineering question is whether and when we can use counting
data instead of detailed, but expensive traces. The answer to that question is
context dependent. It depends on the service mechanisms to which the traffic
is offered. In a queueing context, for example, when there is any appreciable
queueing at all, the operation of service systems is little or not affected by slight
perturbations in the arrival times of packets. Whether the packet comes a little
earlier or a little later only means that it spends a little more or a little less time
waiting in the queue.
With the restrictive assumptions needed for classical queueing analysis, it
is impossible to model the effect of local Poissonification (or of other trans-
formations known in the engineering literature as traffic shaping) by standard
analytic or algorithmic methods. Moreover, to compare an input stream and its
poissonifications for various values of it is not enough to treat each case sep-
arately. Using simulation terminology, one should run simultaneous, parallel
simulations in which the various poissonifications of a given input stream are
subject to identical service times. Valid comparisons are possible only when
that is done. Experimental physicists know that, in meaningful comparison
studies, one varies only one or two parameters between experimental runs,
keeping all other conditions as much as possible the same. People with solid
grounding in probability understand that, to compare two experiments (and not
merely some simple descriptor, such as a mean) y! ou formalize both on the
same probability space. Therein lies the fundamental difficulty of - and the
serious scientific reservations to - the many engineering approximations com-
mon in applied stochastic models. For approximate models to be scientifically
validated, we need to compare differences in the realizations of the stochastic
processes, not merely in crude descriptors such the mean or standard deviation
of the delay.
A major difficulty in doing so is the paucity and extreme difficulty of the-
oretical results on multi-variate stochastic processes. For a few years now, I
increasingly realize the importance of computer experimentation in stochas-
tic modelling. In Neuts (1997) and Neuts (1998), I adduce reasons for that
importance. As a case in point, the study of the effect of the window size
leads to a pretty, seminal computer experiment. We generate a large data base
(say, 10 million or more) interarrival times and we construct poissonifications
of that random point sequence with K different values of These we offer
(in parallel) to single servers with identical strings of service times, generated
from a common distribution F(·). We so obtain K realizations of the queueing
process that differ only in the values of the parameter
What are some technical issues that arise in the design and analysis of that
experiment? In the first place, note that a common input stream is used for all K
poissonifications. Poissonification does not affect the order of the arrivals. We
may therefore think of each arrival as a clearly identified job. The service time
of that job is unaffected by the choice of Therefore, the original input and
all K poissonified streams are subject to the same queueing delays; a common
sequence of service times is used.
If the original arrival stream comes from a stationary point process, it is
easy to assure that all K poissonified streams are also realizations of stationary
processes. The most interesting part is the analysis of the output of the exper-
iment. As we are mainly interested in the differences between the queue with
the original input and each of the K models with poissonified arrivals, I would
form the sequences of differences between the delay of each job in the original
queue and in each of the poissonified models. Each such sequence is a trace
of a stationary process; we can apply established statistical procedures to it.
In comparing distinct sequences of differences, that is, comparing the results
for different values of a, we must bear in mind that these are highly dependent
stochastic processes. It is likely that only data-analytic comparisons remain
possible. Qualitative conclusions from such comparisons need be validated by
replications of the entire experiment with different, independently generated
data sets. The choice of the statistics used in comparisons, the informative
representation and summary of data, and the efficient performance of the ex-
periments, all present interesting new questions and challenges. Experience
gained from one experimental study facilitates future ones and therein lies the
potential growth of this field.
The problem and the methodological approach that I have just described
have important counterparts in engineering practice. I already mentioned the
traffic traces of telecommunications applications. How are different traces or
simulated traces from a proposed traffic model compared? A common practice
is to use various measured or simulated arrival flows as input to single or multiple
queues with constant service times. For many highly calibrated manufacturing
or communications devices, the assumption of constant processing times is
plausible. These input processes are typically offered - in parallel simulations
- to servers with different holding times. A given input to various servers with
different constant holding times is then interpreted as though input streams of
various rates were offered to a server with a single, fixed holding time.
Measured quantities, such as the average delay or the frequency of loss in a
finite-buffer model, are typically quite robust. Useful engineering conclusions
are drawn from them although without a formal statistical justification. The
high dependence between the various simulated realizations and the heuristic
manner in which estimates are obtained offer challenges to statistical analysis.
In both problems I have mentioned, the general methodological issue is the
same. How can we meaningfully measure differences between (dependent)
stochastic processes whose realizations are relatively minor perturbations of
each other?
3. MONITORING QUEUEING BEHAVIOR

I have long been interested in procedures for monitoring the behavior of
queueing systems. Monitoring differs from control. Its objective is to signal
unusual excursions of the current workload (queue length or backlog of work)
and to classify such excursions into different categories to identify their causes.
When the cause of unusual excursions is identified, we can take appropriate
control actions.
A discussion of a monitor for a classical queueing model is found in Neuts
(1988), and an application is described in Widjaja et al. (1996), and Li et al.
(1998). For simple, tractable models, we can define and compute a profile
curve. In brief, that is a stochastic upper envelope of the rare excursions above
a threshold K. As explained in Neuts (1988), the monitor becomes active

when the queue length exceeds K and becomes dormant again when it returns
to values below K. However, when an active monitor detects that the queue
length exceeds the profile curve, a call for a control action ensues. The path
of the excursion between K and the profile curve reveals the nature of the
excursion and that is important to appropriate control intervention. Monitoring
is another subject that, had time and personnel resources permitted it, I would
have pursued in greater depth.
It is obvious that monitoring multi-dimensional stochastic processes with
complex dependence, so prevalent in queueing networks and production sys-
tems, present the greatest methodological challenge. I shall discuss two cases
in some detail.
Low-Dimensional Processes: The profile curve is an involved analogue
of the familiar control limits of quality control. It is much more involved as
it deals with highly dependent stochastic process data. The situation is even
more complex for multivariate stochastic processes. We could define control
limits for low-dimensional stochastic processes as follows: A realization of the
process traces out a path in As the process is typically observed at discrete
time epochs, our observations form a set S of isolated data points.
The successive pealed convex hulls of our data set are informative statistical
descriptors of the multivariate process. We start with the standard convex hull
of S and we delete all its extreme points. The convex hull of the remaining set
is the first pealed convex hull of S. We repeat the pealing procedure on
to construct the second pealed convex hull and so on.
We can define a monitoring region, for instance, as the difference between
the convex hull of S and the first pealed convex hull to contain fewer than 95
percent of the original data points. The monitor is dormant when the process
stays within that inner pealed convex hull. When it enters the monitoring
region, we track the magnitude and the gradient of the changes in the process
as information for whatever control actions that are appropriate.
The implementation of such a monitoring scheme relies on methods that
are now standard in computational geometry. A detailed statistical analysis
of the procedure, if possible at all, will be limited to very special processes.
An extensive experimental exploration using a various interesting multivariate
processes with few restrictions would, I believe, inspire greater confidence in
the statistical soundness and the practical utility of such a monitoring procedure.
Stochastic Networks: Modern technology abounds with examples of stochas-
tic networks. Only the rarest yield to modelling analysis. While, of necessity,
engineers have extensive experience with measurements and monitoring of net-
work traffic, their practice rests on the scantest of statistical methodology.
Taking queueing networks as a working metaphor, I wondered how we could
possibly monitor the traffic in such networks. A detailed observation of all
REFERENCES 759
network transactions is nearly always impractical, if not infeasible. On the

other hand, we can usually observe, at little cost, the increments and decrements
of the job load within sub-nodes of the network. Usually, these correspond to
arrivals to and departures from the sub-node, but there are also cases where
jobs increase by splitting or may disappear from the node without completing
service.
My last doctoral student, Jian-Min Li, and I initiated work on the process
of the epochs of increments and decrements of a network node (Neuts and Li,
1999; Neuts and Li, 2000). We studied the fluctuations in the occurrences of
increments and decrements by imagining a competition for runs between these
two kinds of events. For a given, stable node in steady-state, we expect the
departures and arrivals to be well interlaced. A quick, large increase in the
arrivals suggests a rapid buildup in the content of the node. A long string of
departures may indicate that the node is being starved for work.
The idea behind our work was to use the statistical characteristics of normal
fluctuations in the network to construct one or more monitors for selected sub-
nodes. Due to various circumstances, our work remained in its initial stages.
We believe, however, that the results were promising and merit further attention.
Sid was definitely right. Statistical methods for stochastic models are sorely
needed and of the greatest practical importance. Their development is chal-
lenging; it will require new, daring, and unconventional approaches.
ACKNOWLEDGMENTS
This research of M. F. Neuts was supported in part by NSF Grant Nr. DMI-
9988749.
REFERENCES
Li, J-M., M. F. Neuts, and I. Widjaja. (1998). Congestion detection in ATM
networks. Performance Evaluation, 34:147–168.
Lucantoni, D.M. (1993). The BMAP/G/1 queue: A Tutorial. In Lorenzo Do-
natiello and Randolph Nelson, editors, Performance Evaluation of Computer
and Communication systems: Joint Tutorial Papers of Performance ’93 and
Sigmetrics ’93, pages 330–358. Springer-Verlag, Berlin.
Narayana, S. and M. F. Neuts. (1992). The first two moments matrices of
the counts for the Markovian arrival process. Communications in Statistics:
Stochastic Models, 8(3):459–477.
Neuts, M.F. (1988). Profile curves for the M/G/1 queue with group arrivals.
Communications in Statistics: Stochastic Models, 4(2):277–298.
Neuts, M.F. (1992). Models based on the Markovian arrival processes. IEICE
Transactions On Communications, E75-B(12):1255–65.
Neuts, M.F. (1994). The Palm measure of a poissonified stationary point process.
In Ramón Gutierrez and Mariano J. Valderrama, editors, Selected Topics on
Stochastic Modelling, pages 26–40. Singapore: World Scientific.
Neuts, M.F. (1997). Probability Modelling in the Computer Age. Keynote ad-
dress, International Conference, Stochastic and Numerical Modelling and
Applications, Utkal University, Bhubaneswar, India.
Neuts, M.F. (1998). Some promising directions in algorithmic probability. In
Attahiru S. Alfa and Srinivas R. Chakravarthy, editors, Advances in Matrix
Analytic Methods for Stochastic Models, pages 429–443. Neshanic Station,
NJ: Notable Publications, Inc.
Neuts, M.F. and J-M Li. (1999). Point processes competing for runs: A new tool
for their investigation. Methodology and Computing in Applied Probability,
1:29–53.
Neuts, M.F. and J-M. Li. (2000). The input/output process of a queue. Applied
Stochastic Models in Business and Industry, 16:11–21.
Neuts, M.F., D. Liu, and S. Narayana. (1992). Local poissonification of the
Markovian arrival process. Communications in Statistics: Stochastic Models,
8(1):87–129.
Rauschenberg, D.E. (1994). Computer-Graphical Exploration of Large Data
Sets from Teletraffic. PhD thesis, The University of Arizona, Tucson, Arizona.
Widjaja, I., M. F. Neuts, and J-M. Li. (1996). Conditional Overflow Probability
and Profile Curve for ATM Congestion Detection. IEEE.
Yakowitz, S. (1977). Computational Probability and Simulation. Addison-Wesley,
Reading, MA.
Author Index
Aarts, 385, 388, 410, 412, 416 Bakhvalov, 466

Abbad, 510 Balarm, 334
Abel, 203, 288, 299 Ball, 627, 648
Abramowitz, 128 Ballou, 631–632, 649
Ackley, 385, 410 Banach, 77, 79–80, 84, 86
Acworth, 466 Banks, 32, 38, 42, 52
Adelson, 712, 732 Banzhaf, 390, 411
Agrawal, 46, 52 Baras, 546
Agresti, 196, 198 Barettino, 153
Ahmad, 590, 595 Barlow, 621
Ahmed, 682 Barnsley, 732
Aizerman, 206, 221 Bartoszewicz, 618–621
Akad, 470 Basawa, 561, 595
Akademi, 415 Basu, 622
Akritas, 566, 595 Bauer, 208, 210, 221
Algoet, 11, 226–227, 231, 235, 246–247, 594 Baum, 409, 411
Altman, 32 Bayes, 41, 225, 690–691
Aluffi-Pentini, 411 512
Alzaid, 618, 621 Bean, 648
Amin, 623 Beato, 153
Anantharam, 38, 46, 52 Beauchamp, 657, 676, 682
Andersen, 79, 87, 90, 198 Becker, 411
Anderson, 269–270, 273–274, 281–282, 693 Beckman, 471
Ando, 512 Beekmann, 203, 223
Andrews, 682 Bekey, 411
Anily, 388, 411 Bellman, 270, 333–334, 336, 338, 344–345, 349,
Ankenman, 692 356–357, 496, 500, 735–736, 738–739, 748
Anorina, 79, 90 Belmont, 510
Antonov, 466 Belzunce, 615, 621
Aoki, 334, 357 Benardi, 131
Applegate, 636, 647 Benveniste, 51–52, 370, 374, 376, 381
Aragon, 388, 413 Berger, 690
Arapostathis, 544–545 Bernardi, 131, 152
Araujo, 84–85, 90 Bernoulli, 39–40, 118, 392, 399, 443
Arnold, 250, 267, 620–621 Berry, 52
Asmussen, 183 Bertsekas, 510, 545, 736, 748
Athreya, 577, 595 Bertsimas, 32, 638, 647
Avila-Godoy, 545 Berzofsky, 152
Avramidis, 466 Bhattacharya, 566, 575, 595, 602
Baccelli, 32 Bhubaneswar, 760
Badowski, 513 Bianchi, 227
Bagai, 618, 621 Biase, 389, 412
Bahadur, 63, 90, 561, 595 Bickel, 692
Bailey, 235, 246 Bilbro, 389, 411
Billingsley, 32, 66–67, 90, 510, 558, 560, 595 Caudill, 331
Bina, 131, 133, 152 Cavazos–Cadena, 516–517, 520–523, 525, 529,
Birge, 647 532, 543–545
Bisgaard, 692 Cavert, 99, 114
Bittanti, 370 Cease, 152
Bixby, 647 Cerny, 412
Black, 270, 282 Cesa-Bianchi, 246
Blaisdell, 131, 152–153 Cesaro, 176
Blankenship, 22, 32, 510 Chakravarthy, 760
Block, 621 Chang, 39–41, 52
Blount, 11, 575 Chao, 298–299, 590
Blum, 50, 52 Charnes, 630
Bodson, 376, 381 Chatfield, 662, 682
Boender, 389, 411 Chebyshev, 67
Bohachevsky, 411 Chen, 32, 687
Boland, 608, 612, 614, 621 Cheng, 329, 381
Bolshoy, 153 Chernoff, 37, 53
Boltzmann, 388 Chesher, 700
Borkar, 45, 52, 545 Chevi, 647
Borovkov, 32, 62–63, 65, 72, 75, 84, 90–91 Chiarella, 252, 267
Bosq, 581, 595 Chin 690 Chistyakov, 79
Boyle, 467, 474 Chistyakov, 79, 91
Braaten, 467 Cholesky, 327
Bradley, 581, 585, 595–596 Chow, 246
Bramson, 14, 32 Christofides, 629, 648
Bratley, 467 Chui, 732
Brau, 545 Chung, 510
Braverman, 206, 221 Chvatal, 647
Breiman, 246 Cieslak, 471
Bremermann, 387, 390, 411 Clark, 51, 53, 156, 629, 647
Brezzi, 39–43, 52 Clarke, 184, 647
Broadie, 466–467 Clements, 330
Brooks, 411 Cobb, 692
Bucher, 153 Cochran, 467
Buckingham, 131, 152 Cohen, 96, 98–99, 114
Bucy, 22, 32 Columban, 2
Bunick, 154 Compagner, 474
Burman, 302, 329, 594, 596 Conover, 471
Burrus, 682 Conway, 467
Burt, 712, 732 Cook, 329, 647, 692
Buyukkoc, 39, 54 Cooley, 682
Byrnes, 546 Cooper, 630
Caflisch, 467, 472 Cornette, 152
Caines, 370–371 Corput, 438
Calladine, 131, 141, 153 Cosman, 732
Cameron, 732 Cournot, 2, 249–251, 263–267
Campi, 370 Courtois, 510
Cantelli, 242, 402–403, 409, 569, 574 Cover, 327, 329, 730, 732
Cao, 115 Cox, 198–199, 250, 267
Capitanio, 329 Cranley, 467
Carr, 250, 267 Cristion, 51, 54
Carraway, 748 Crothers, 131, 153
Carrillo, 303, 329 Cushing, 250, 267
Casella, 687 Dai, 32–33
Castelana, 594 Daley, 4, 11
Castellana, 596 Daniels, 687
Cauchy, 180 Danielson, 678
AUTHOR INDEX 763
Datta, 546 Ethier, 33, 511
Davis, 510, 603 Eubank, 575, 596
Davison, 687 Fabian, 53
Davydov, 577, 596 Fabius, 40, 53
Dawande, 329 Fahrmeir, 198
Dekker, 199, 223, 330, 749 Fauci, 96, 99, 114
Dekkers, 388, 412 Faure, 429–430, 432, 440–441, 447, 451, 468
Delebecque, 496, 510 Fayolle, 14, 33
DeLisi, 152 Feder, 227, 229, 246–247
Dembo, 58, 91, 510 Federgruen, 388, 411
Dempster, 381 Feller, 63, 65, 67, 91, 118, 128
Denardo, 735–736, 738, 748–749 Fennell, 330
Denny, 6–7, 574, 596 Fernández–Gaucherand, 516–517, 520–523, 525,
Devroye, 202–203, 205, 207, 216, 221, 236, 246, 529, 532, 543–545
386–387, 392–393, 400, 404, 412, 575, 596 Fernández-Gaucherand, 545
Di Masi, 511 Field, 687
Dietrich, 9, 11 Fill, 63, 91
Diggle, 198 Fincke, 468
Dirac, 252 Firth, 198
Dixit, 283 Fisher, 6, 379, 383, 392, 394, 400–401, 412, 416,
Djereveckii, 370 696, 700, 732
Do, 687 Fishwick, 471
Doeblin, 516–517, 520, 522, 526 Fleming, 479, 500, 511, 516, 545
Doksum, 692 Fokianos, 190–191, 198
Domingo, 747, 749 Foss, 32
Doob, 335, 357, 572–573, 575, 580, 596 Fourier, 132–134, 138, 453–454, 651–653, 662,
Dordrecht, 221–222, 414 664, 667, 669, 678
Douglas, 692 Fox, 286, 299, 466–468, 511
Doukhan, 577, 581, 596 Fradkov, 370
Down, 33–34 Franco, 621
Drew, 131, 141, 152–153 Fraser, 687
Dror, 5, 627–629, 633, 638–639, 641, 643–649 Fredholm, 486
Druzdzel, 467 Freund, 246
Duckstein, 6–7 Friedel, 468
Duff, 8 Frobenious, 516, 521–522
Duffie, 283, 468 Frontini, 389, 412
Dunford, 203, 218, 221 Furniss, 286, 299
Dupuis, 15, 23, 33 Fushimi, 472
Dvoretzky, 402, 412 Gaimon, 329
Dykstra, 621 Gaitsgori, 512
Dynkin, 335–336, 357 Gaivoronski, 157, 183
Eberly, 288, 299 Galambosi, 11
Edgeworth, 687 Gallager, 512
Efimov, 469 Gamarnik, 32
Efron, 205, 221, 468, 686 Gani, 3, 5, 9–11, 383, 513, 575, 596
Eilon, 629, 648 Garrett, 117, 125, 128
Eisenberg, 133, 153 Gastwirth, 401, 412
Elliot, 546 Gauss, 703
Ellis, 493 Gaviano, 387, 413
Embrechts, 79, 85, 91 Geffroy, 391, 395, 398–399, 413
Endo, 732 Gelatt, 53, 387, 414
Entacher, 468 Gelfand, 388, 413
Erlang, 62, 65 Geman, 369–370, 381, 387, 413
Ermakov, 390, 412 Georgiev, 575, 590, 596–597
Ermoliev, 157, 183, 385, 412 Gerencsér, 245–247, 362, 370
Essunger, 115 Gershwin, 329, 511
Etemadi, 221 Gheysens, 649
Ghosh, 687, 690 Held, 735, 738, 749

Gibbs, 388 Hellekalek, 466, 468–470, 472
Gidas, 388, 413 Helmbold, 246
Giddings, 133, 154 Henry, 114
Giessen, 152 Hermite, 687
Gill, 198, 383 Hernández–Hernández, 546
Gilliam, 546 Heunis, 371
Giroux, 157, 184 Heyde, 59, 79, 87, 91, 202, 223, 581, 594
Gittins, 38–40, 44, 53 Hickernell, 452, 466, 468–471, 473–474
Gladyshev, 206, 222 Hilbert, 453
Glasserman, 466–467 Hillier, 511
Glynn, 157, 184, 511 Hinderer, 546
Gnedenko, 395–396, 413 Hingham, 417
Goldberg, 385, 390, 413 Hinkley, 687
Golden, 627, 630–631, 648–649 Hipel, 9
Goldenshluger, 245–247 Hjorth, 686
Goldie, 91 Hlawka, 421, 469–470
Golub, 469 Hoadley, 652, 682, 686
Golubov, 468 Hoeffding, 470, 581
Good, 682 Hoffmeister, 387, 411
Gorodetskii, 577, 597 Holland, 53, 385, 390, 392, 412–413, 416
Gosh, 545 Holmes, 250, 258, 267
Goutis, 687 Hopf, 249–250, 255, 258–259, 261–262, 267
Graves, 53 Hoppensteadt, 511
Groningen, 597 Hordjik, 546
Grunstein, 141, 153 Howard, 546
Gruyter, 221 Howroyd, 268
Guckenheimer, 250, 258, 267 Hu, 62–63, 91, 614, 617, 621
Guo, 370 Huffman, 730
Gurin, 393, 413 Hui, 690
Gutman, 246 Hurd, 732
Györfi, 202, 207, 227, 234–235, 238, 246–248, 371 Hutter, 8–9
Györfi, 221–222 Huzurbazar, 687
Haario, 388, 413 Hwang, 387, 413
Haase, 96, 99, 114 Hymer, 5
Haber, 469 Härdle, 222
Habib-agahi, 299 Ibragimov, 577, 581, 597
Hadamard, 659–661 Imamura, 416
Hajek, 387–388, 413 Inagaki, 566, 597
Hall, 686 Invernizzi, 252, 267
Halton, 425, 440–441, 452, 469 Ioannides, 575, 577, 580–581, 601
Hamilton, 511 Ionescu, 518
Hansen, 511 Ioshikhes, 131, 153
Hardy, 203, 222, 421 Iosifescu, 482, 487, 511
Harp, 154 Jackson, 250, 268
Harris, 22–23, 298, 575 Jacobson, 516, 546
Hart, 302, 329 Jacod, 199
Hartjey, 547 Jailet, 648
Hartman, 389 Jaillet, 626
Hasenbein, 32 Jain, 622
Haussler, 246 James, 546, 690
Hayes, 9–10, 383, 416, 513 Jaquette, 546
Hazelrigg, 305, 329 Jarvis, 34, 385, 389–390, 413
He, 621, 623 Jawayardena, 547
Heidelberg, 222–223 Jayawardena, 10–11, 48, 54, 384, 417, 604
Heideman, 682 Jaynes, 356–357
Heinrich, 469 Jensens, 205
AUTHOR INDEX 765
Jiang, 329 Korobov, 425–427, 434, 437, 439, 442, 444, 447,
Jing, 622 470
Joag-Dev, 622 Koronacki, 393, 414
Johnson, 388–389, 411, 413, 566, 597, 682 Korst, 385, 388, 410
Joines, 471 Korwar, 616, 622
Jones, 38, 53 Kourapova, 733
Joshi, 692 Kouritzin, 371
Joslin, 371 Krause, 421
Kabanov, 511 Kreimer, 645–646, 648
Kaebling, 44, 53 Krimmel, 7, 474
Kalman, 97, 105, 111–114, 375 Kronecker, 209, 220–221
Kamps, 620, 622 Krzysztofowicz, 6–7
Kanatani, 381 202–203, 207, 216, 221
Kane, 682 Kubicek, 250, 268
Kao, 68, 91 Kuk, 381
Karlin, 153, 511 Kumar, 14, 33–34, 37, 53, 512, 546
Karlsson, 9 Kurita, 471
Karmanov, 393, 414 Kurtz, 33, 511
Karp, 735, 738, 749 Kushner, 15–17, 22, 25, 32–34, 51, 53, 156, 183,
Katz, 409, 411 335, 357, 371, 376, 381, 414, 512, 736, 749
Kaufman, 411 Kwakernaak, 334, 357
Kaufmann, 198–199 L’Ecuyer, 184, 474, 513
Kawaguchi, 732 Lafeuillade, 96, 99, 114
Kedem, 3, 5, 10, 190–191, 198–199 Lago, 389, 411
Keenan, 199 Lagrange, 162
Keller, 390, 468 Laha, 705
Kendrick, 511 Lai, 10, 37, 39–44, 46–50, 52–53, 59, 91, 392, 414,
Kesten, 577, 597 512, 546, 575, 598
Kettenring, 652, 682 Laipala, 626, 648
Khaledi, 610, 616–620, 622 Laird, 381
Khasminskii, 22, 33, 478, 511 Lam, 114
Khomin, 252, 267 Lanczos, 678
Kiefer, 50, 53, 373, 402, 412, 414, 682 Laplace, 620
Kieffer, 11, 248, 719, 721, 725, 729–733 Laporte, 633–637, 647–648
Kirkpatrick, 35, 53, 414 Larson, 627, 644, 648
Kirmani, 615–616, 619, 622 Laurent, 432–433, 435, 437, 439
Kirschner, 96, 99, 114 Lauss, 470
Kisiel, 6, 574, 596 Le Cam, 561, 566, 598
Kitagawa, 381 Leadbetter, 575, 586, 594, 603
Kivinen, 227, 229, 247 Leake, 334, 357
Kleinrock, 175, 183 Lebesgue, 51, 60, 73, 75, 384, 463, 557, 572
Klimov, 53 Lee, 334
Klug, 141, 154 Leeb, 469
Klüppelberg, 91 Legendre, 492
Kmenta, 692 Lehoczy, 466
Kniazeva, 513 Leibler, 38, 46
Knuth, 732 Lem, 397
Koch, 685 Lemieux, 469–471
Kochar, 610, 615–619, 621–622 Lempel, 732–733
Kohler, 221 Lenhart, 329
Kokotovic, 496, 512 Leontief, 286, 476, 495
Kolassa, 687 Levitan, 473
Kollier, 10, 44, 54, 383, 417 Levy, 96
Kong, 34, 374–376, 378–379, 381 Li, 10, 54, 417, 622, 759–760
Kopel, 250, 268 Liang, 198
Koppel, 4 Liapunov, 13, 15, 22–25, 30–31
Korman, 329 Liaw, 730, 733
Lidl, 471 Marti, 393, 414

Lieberman, 511, 687 Martingale, 191
Lillo, 610–611, 622 Maryak, 685, 690
Lin, 34, 330, 575, 590, 730 Masry, 585, 590, 594, 598
Lindley, 176 Massart, 402, 414
Linnik, 79, 91, 581 Mateo, 411–412
Lipschitz, 23, 156, 158, 160, 335, 386, 456, 485, Matheson, 546
489, 494, 572–573, 593 Mathur, 631–632, 649
Littlestone, 227, 247 Matous, 451
Littman, 44, 53 Matsumoto, 471
Liu, 334, 512–513, 733, 760 Matyas, 387, 414
Ljung, 51, 54, 184, 360, 371, 374, 381, 686 McCullagh, 186, 191, 196, 199
Loh, 471 McDiarmid, 205, 222
Longnecker, 581, 598 McDougall, 153
Lorenz, 620 McEneany, 516
Louie, 416 McGeoch, 388
Louveaux, 633, 648 McIntyre, 302, 329
Lowe, 9, 43–44, 48, 54, 59, 68, 93, 383, 392, 417, Mckay, 471
575 McLachlan, 133, 153
Loève, 546 Mcleod, 9
Loève, 222 Medio, 268
Lu, 14, 34, 330 Meerkov, 414
Lucantoni, 759 Meits, 302
Lugosi, 9, 44, 54, 202, 205, 221, 246–247, 384, Meitz, 329
392, 417 Mengeritsky, 131, 153
Lui, 469 Menshikov, 33
Luk, 469 Merhav, 227, 246–247
Lukes, 334, 357 Merrow, 285, 299
Luman, 304, 329 Methuen, 198
Lunneburg, 686 Metivier, 51–52, 381
Luttmer, 471 Metropolis, 381, 414
Lyapunov, 380 Meyn, 33–34, 371
Macmillan, 6 Michelena, 330
MacPherson, 467 Miller, 253, 268
MacQueen, 206, 222 Minton, 330
Mai, 10, 44, 54, 474 Miranker, 511
Maisonneuve, 471 Misra, 616, 622
Majety, 303, 329 Mitten, 735, 738, 749
Makis, 303, 329 Mitter, 98, 115, 388, 413
Makovian, 742 Mocarski, 153
Malyshev, 14, 33–34 Mockus, 414
Manbir, 5 Monro, 50, 54
Mandl, 45, 54 Monroe, 155
Mann, 402, 414 Moore, 53
Manning, 721, 725 Moran, 4
Manz, 682 Morin, 748–749
Marchands, 373 Morohosi, 471
Marcus, 516, 522, 525, 543, 545–546 Morokoff, 467, 472
Marek, 250, 268 Mortensen, 38, 54
Margaht, 152 Morton, 330
Markov, 3–4, 14, 22–24, 35, 37–38, 44–48, 235, Morvai, 11, 202, 222, 226–227, 247–248, 594, 598
373, 375, 377, 387, 476–483, 485, 487–489, Moskovwitz, 748
492–496, 498, 500–502, 504, 506–509, Moskowitz, 467
515–516, 518–519, 527, 531–533, 540, 544, Muelen, 616
556–558, 561, 563, 566–567, 571–572, Murray, 7–8
574–575, 580, 625, 641, 643, 647, 731–732 Muyldermans, 131, 153
Markowitz, 115 Myers, 299
AUTHOR INDEX 767
Männer, 414 Pearlman, 712, 733
Métivier, 370 Pebbles, 7
Nadaraya, 202, 222, 575, 587, 594, 599 Peligrad, 581, 599
Nagaev, 58–59, 64, 66, 75, 79–80, 88–92 Pentini, 387
Nair, 652 Perelson, 95–97, 104, 108–110, 112, 115
Nanda, 609–615, 622 Perkins, 34
Naokes, 9 Pervozvanskii, 512
Narayana, 759–760 Petrov, 63, 92, 409, 414
Nauk, 470 Pflug, 157, 184
Nauka, 370 Pham, 577, 599
Nelder, 186, 191, 199 Philipp, 581, 599
Nelson, 682, 732 Phillips, 496, 512
Nemirovski, 156, 184 Philips, 299
Neuman, 7, 381 Piatak, 96, 115
Neumann, 115 Pina, 131, 153
Neuts, 759–760 Pindyck, 283
Nevill, 721, 725 Pinelis, 10, 59, 71, 77, 79, 82, 92
Nevill-Manning, 733 Pinsky, 414
Neweihi, 614 Pintér, 414
Neyman, 222, 598 Pirsic, 470, 472–473
Ng, 381 Pisier, 247
Nguyen, 575, 599, 712, 733 Pitman, 92
Nicholson, 692 Platfoot, 330
Niederreiter, 430–431, 447, 466–473 Pledger, 623
Nishikawa, 334, 357 Plemmons, 469
Nishimura, 471 Poggi, 115
Nobel, 202, 222, 227, 236, 247 Pohst, 468
Noon, 648 Polyak, 155–157, 184
Noordhoff, 33 Porter, 285, 298–299
Nordin, 390, 411 Porteus, 735, 737–738, 749
Noren, 6 Prakasa, 561, 575
Notermans, 114 Pratt, 682
Noussair, 298 Price, 389, 415
Ohsumi, 334, 357 Priestley, 590, 599
Oja, 620, 623 Priouret, 51–52, 370, 381
Okuguchi, 249, 254, 268 Profizi, 115
Ollivier, 154 Prohorov, 75
Olson, 299, 329 Pronzato, 692
Onrstein, 235 Proschan, 612, 614–616, 618–621, 623
Orlando, 115 Protopopescu, 329
Ornstein, 247 Puccio, 329
Owen, 440, 456, 464, 466–467, 471–473 Pura, 470
Pan, 512 Puterman, 546
Panneton, 471 Qaqish, 199
Panossian, 334, 357 Quadrat, 510
Pantaleo, 114 Quirk, 286, 299
Pantula, 577 Rabinowitz, 468
Papalambros, 303, 329–330 Rajagopalan, 303
Papanicolaou, 22, 32 Rajagopolan, 330
Pareto, 62–63, 620, 747 Rajgopal, 329
Parisi, 411 Ranga, 90
Parker, 6 Rao, 63, 90, 190, 199, 561, 575, 599, 682, 690, 697
Parzen, 567, 599 Rappaport, 381
Pascal, 62 Raqab, 610, 623
Paskov, 473 Rastrigin, 413
Patterson, 468 Raton, 154
Pazman, 691 Rauschenberg, 755, 760
Raviv, 298 Sanchez, 659, 682

Rechenberg, 389–390, 415 Sanders, 371
Reid, 687 Sanseido, 605
Reidel, 416 Sarda, 597
Reiman, 15, 17, 34 Saridis, 334, 356–358
Rekasius, 334, 357 Sasaki, 387, 413
Reneke, 304, 307, 329–330 Sastry, 376, 381
Riccati, 351, 495, 497, 501–502, 504 Satchwell, 131, 133, 153
Rickard, 268 Savits, 614, 621
Riemann, 73 Schachtel, 131, 152–153
Rijn, 33 Schapire, 246
Rinnooy Kan, 411, 415 Schevon, 388
Rinnooy, 389 Schmetterer, 566, 602
Ripley, 375, 381 Schmid, 470, 473
Rishel, 479, 500, 511 Schoenfeld, 191, 199
Rissanen, 245, 247 Scholes, 270, 279, 282–283
Robbins, 37, 41, 50, 53–54, 59, 91–92, 155, 206, Schumer, 387, 415
222, 392, 414–415 Schuster, 7–8, 575, 602
Roberts, 381 Schwartz, 203, 218, 221, 371
Rockafellar, 158, 184 Schwefel, 385, 387, 389–390, 415
Rogers, 303, 330 Scott, 602
Rogozin, 62, 91 Sechen, 415
Rohatgi, 79, 92, 705 Secomandi, 641, 643, 647–648
Rojo, 617, 619–623 Seidman, 14, 33–34
Ronchetti, 687 Sen, 8
Rosenblatt, 567, 572, 576, 581, 599–600 Seneta, 512
Rosenbluth, 381, 414 Serfling, 581, 686, 688
Rosenthal, 381 Sethi, 512
Ross, 512, 546 Shaked, 608–615, 618, 620–621, 623
Rothschild, 38, 54 Shanthikumar, 608–609, 611–612, 618, 620, 623
Roussas, 11, 221, 561–571, 575, 577, 580–581, Shapiro, 157, 175, 178–179, 184, 712, 733
583–587, 589–590, 592, 600–603 Shayman, 546
Roy, 330 Sherwood, 1
Royden, 546 Shiue, 467, 473
Rozonoer, 206, 221 Shnidman, 690
Rubin, 381 Shorack, 415
Rubinstein, 157, 175, 183–184, 391, 395, 415 Shrader, 131, 153
Ruiz, 615, 621 Shubert, 386, 415
Runolfsson, 546 Shumway, 134, 136, 138–139, 153, 690
Russel, 268 Siegmund, 63–64, 92, 206, 222
Rustagi, 53, 623 Sijthoff, 33
Rutherford, 8–9, 686 Silverman, 575, 602
Ryabko, 733 Simon, 512
Rybko, 14, 34 Singer, 227, 229, 247
Ryzin, 206 Singh, 330, 621–622
Saag, 115 Sivan, 334
Sacks, 54, 222 Skorohod, 13, 15–16, 18, 21–23
Sacomandi, 643 Skorokhod, 335, 358, 688
Sagar, 6 Skvortsov, 469
Sage, 334, 357 Sloan, 426, 473
Said, 712, 733 Sloane, 467
Sakhanenko, 92 Slud, 190–191, 199
Saksman, 388, 413 Smirnov, 401, 441
Salas, 303, 330 Smith, 7, 690
Saleev, 466 Sniedovich, 735, 738, 740, 746–749
Salzburg, 447 Snyder, 389, 411
Samuelson, 330 Soldano, 154
AUTHOR INDEX 769
Solis, 393, 415 Thompson, 512
Solo, 34, 374–379, 381 Tibshirani, 686
Soms, 566, 600 Timmer, 414
Spall, 54, 371, 685, 690, 695, 699–700, 702, 705, Tkachuk, 59, 92–93
707 Todorovic, 3
Spanier, 468–469, 472–473 Todorovich, 9
Speakes, 373 Tokuyama, 474
Spiegelman, 202, 207, 222 Tootill, 474
Spouge, 152 Tran, 10–11, 603
Spragins, 6 Traub, 473
Stackelberg, 269 Travers, 131, 141, 153–154, 726
Stamatelos, 566, 602 Trifonov, 130–131, 141, 153–154
Staskus, 114 Trudeau, 628–629, 646, 648–649
Steele, 205, 222 Truss, 153
Stegun, 128 Tse, 512
Steiglitz, 387, 415 Tsitsiklas, 32
Stein, 205, 221, 388–389, 468 Tsitsiklis, 512
Steiner, 157, 172 Tsypkin, 157, 184, 247
Stevens, 5 Tuffin, 474
Stewart, 153, 630–631, 648 Tukey, 373, 682
Stoffer, 131, 133–134, 136–139, 148, 152–153, Tutz, 198
652, 682 Tyler, 153
Stolyar, 14, 34 Törn, 415
Stone, 202, 222 Uberbacher, 131, 154
Stout, 247 Ung, 411
Strang, 712, 733 Uosaki, 416
Su, 330 Valderrama, 760
Sueoka, 131, 153 Van hamme, 633, 648
Sugiyama, 416 Van Laarhoven, 385, 388, 416
Sulzer, 115 Van Ryzin, 222
Sun, 690 Van Troung, 11
Sundaram, 38, 42, 52 Vande Vate, 32–33
Sundaresan, 269–270, 273–274, 281–282 Vanderbilt, 388, 416
Susmel, 511 Vanmarcke, 328, 330
Sussman, 130, 154 Varaiya, 38–39, 45, 52, 54, 512, 546
Suyematsu, 329 Vecchi, 53, 387
Szidarovszky, 2, 5–9, 249, 254, 259, 268, 382, 474 Venables, 690
Szilagyi, 8 Veraverbeke, 79, 85, 91
Söderström, 371 Verhulst, 364, 371
Taguchi, 651 Verlag, 34, 114, 184, 415, 595
Takahata, 581, 602 Vesterdahl, 383, 417
Tan, 95–96, 98–99, 105, 109–112, 114–115, 474 Viari, 133, 154
Tanner, 382 Vickrey, 294, 299
Tarasenko, 393, 415 Vieu, 597
Tasaka, 416 Villasenor, 620–621
Tashkent, 90 Vinogradov, 58, 79, 93
Taubner, 622 Volterra, 253, 268
Tausworthe, 438, 446, 474 Voorthuysen, 302, 330
Taylor, 139, 190, 380, 511 Vovk, 227, 248
Tchebycheff, 317 Vychisl, 466
Tekalp, 382 Vágó, 362, 370
Teller, 381 Vázquez-Abad, 474
Teneketzis, 46, 52 Wagner, 202, 207, 221, 575
Terwillger, 153 Wald, 561, 603
Teugels, 83, 92 Waldrop, 298–299
Tezuka, 438, 467–468, 474 Walk, 157, 184, 203, 222, 371
Thomas, 547, 730, 732 Walker, 250, 267
Wallace, 302, 330 Wong, 199

Wallnau, 330 Wonham, 336, 351, 358
Walrand, 38–39, 52, 54 Wright, 629, 647
Walsh, 455, 457–460, 464–465, 473, 651–653, 657, Wu, 115
659–661, 664, 667, 669, 678, 682 Xiang, 95–96, 105, 110–112, 114–115
Walter, 692 Xing, 431, 441, 447, 472
Wang, 334, 356, 358, 474 Yakowitz, 1, 4–11, 36, 43–44, 47–49, 51–54, 59,
Warmuth, 227, 229, 246–247 68, 92–93, 152, 198, 202, 222–223,
Wasan, 385, 416 247–248, 259, 268, 371, 381–384, 392, 394,
Watson, 222, 603 400–401, 412, 416–417, 422, 474, 476,
Watson-Gandy, 648 512–513, 517, 546–547, 555, 557, 574–575,
Web, 96 594–596, 598–599, 603–604, 685–686, 735,
Webb, 99, 114 737–738, 749, 751–752, 760
Wegman, 590, 603 Yamato, 575, 583, 604
Wei, 382, 617, 622 Yan, 302, 330
Weibull, 62–63 Yang, 34, 115, 227, 248, 510, 513, 561, 631–633,
Weiss, 33, 96, 99, 115, 153, 561, 603 636–637, 647, 649, 719, 721, 725, 729–733
Weissman, 96, 99, 114, 391, 395, 415 Yao, 609
Wellesley, 732–733 Yates, 651–652, 655, 657, 659, 661, 670, 672,
Wellner, 415 675–676, 678, 681–682
Wessen, 285, 298–299 Yatracos, 585, 602
Wets, 385, 393, 412 Ye, 96, 98–99, 109–110, 115
Wheeden, 219, 222 Yildizoglu, 298–299
Whisenant, 139 Yin, 34, 53, 511–514
White, 334, 512, 547 Ying, 44, 54
Whitney, 414 Yokoyama, 581, 604
Whittle, 44, 54, 547 Yoshihara, 581, 604–605
Widjaja, 759–760 Yu, 469
Yudin, 156, 184
Wiener, 18, 22, 73, 270, 326–328, 335
Zakharov, 389
Wilcoxon, 416
Zaremba, 442–443, 471
Wiley, 32–33, 53, 90, 128, 183, 198–199, 357, 410,
Zeevi, 245–247
415, 510–512, 596, 602
Zeger, 198–199
Wilfling, 620, 623
Zeitouni, 58, 91, 511
Wilks, 693
Zeller, 203, 223
William, 7
Zhang, 32, 493, 498, 510–514
Williams, 7, 15, 22–23, 33
Zhigljavsky, 385, 390, 417
Wilson, 298–299, 466
Zhiglyavskii, 412
Winston, 357, 512, 749
Zhou, 331
Wishart, 690
Zhurkin, 131, 154
Withers, 603
Zinterhof, 472
Witten, 721, 725, 733 Zirilli, 411
Woeginger, 738, 749 Ziv, 732–733
Woham, 334 Zolotarev, 85, 93
Wolfowitz, 50, 53, 402, 405, 414, 561, 603 Zupancic, 114
Zygmund, 219

Modeling Uncertainty 0792374630 PDF

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Modeling Uncertainty 0792374630 PDF

Загружено:

Авторское право:

Доступные форматы

MODELING UNCERTAINTY

An Examination of Stochastic Theory,

Vanderbei, R. / LINEAR PROGRAMMING: Foundations and Extensions

KLUWER ACADEMIC PUBLISHERS

©2005 Springer Science + Business Media, Inc.

Print ©2002 Kluwer Academic Publishers

All rights reserved

Created in the United States of America

Visit Springer's eBookstore at: http://ebooks.springerlink.com

Contributing Authors xxi

Part III 155

Appendix: B: Proof of the Proposition 182

3.3.1 Large Deviations 492

2.3 The Nonparametric Case 567

Part VII 607

4 Fast Algorithms 670

Part VIII 735

Author Index 761

This volume titled MODELING UNCERTAINTY: An Examination of Stochas-

Moshe Dror Pierre L’Ecuyer Ferenc Szidarovszky

Tze Leung Lai

PROFESSOR SIDNEY J. YAKOWITZ

Sidney Jesse Yakowitz was born in San Francisco, California on March

books. Latter Sid was instrumental in the popularization of differential dynamic

1987), followed by a paper on the prediction of reservoir lifetimes under silting

He hoped not only to pass on information but to inspire further investigation

PUBLICATIONS OF SID YAKOWITZ

Yakowitz, S. (1969). Mathematics of Adaptive Control Processes. Elsevier, New

Yakowitz, S. and F. Szidarovszky. (1986). An Introduction to Numerical Com-

Yakowitz, S. and J. Spragins. (1968). On the identifiability of finite mixtures.

Yakowitz, S. (1976). Model-free statistical methods for water table prediction.

Yakowitz, S. (1983). Convergence rate of the state increment dynamic program-

Karlsson, M. and S. Yakowitz. (1987a). Nearest-neighbor methods for nonpara-

Yakowitz, S. and M. Kollier. (1992). Machine learning for optimal blackjack

Tran, L., G. Roussas, S. Yakowitz, and B. Van Troung. (1996). Fixed-design

STABILITY OF SINGLE CLASS QUEUEING

Abstract The stability of queueing networks is a fundamental problem in modern com-

Note that (2.2) implies that

or, equivalently, that

Let (resp., denote times the number of exogenous ar-

Define the vector Under (A2.0)–(A2.3), the sequence

where is a continuous process (usually a Wiener process, in applications).

where is continuous, nondecreasing, with and can increase only

A simplifying assumption on the arrival and service times. To simplify

Alternatively, we can write

Define a residual time error term to be a [constant/ × [a residual interarrival

Now, write the coupling terms as

where is from (2.2), has the form

Under (A2.0)–(A2.3), converges weakly to a continuous process and

form for some Let denote the convex hull of

converges to zero for each initial condition.

in that each component of the vector is negative. If is the

Thus, the second partial derivatives are of the order of as Also,

For There is such that for

4. PERTURBED LIAPUNOV FUNCTIONS

where the are O(1), uniformly in all variables.

By the heavy traffic condition (A2.3), times this “formally converges” to

I.e., has the supermartingale property for The order

The individual perturbations in (4.2b) are defined by:

We suppose (until further notice) that the and are O(1),

By the definitions of the perturbations, we can expand as

Note that the term in (4.1) plus the corre-

The boundary term can be written as