Вы находитесь на странице: 1из 597

NATO ASI Series

Advanced Science Institutes Series


A series presenting the results of activities sponsored by the NATO Science Committee,
which aims at the dissemination of advanced scientific and technological knowledge,
with a view to strengthening links between scientific communities.
The Series is published by an international board of publishers in conjunction with the
NATO Scientific Affairs Division.
A Life Sciences Plenum Publishing Corporation
B Physics London and New York
C Mathematical and Physical Sciences Kluwer Academic Publishers
D Behavioural and Social Sciences Dordrecht, Boston and London
E Applied Sciences
F Computer and Systems Sciences Springer-Verlag
G Ecological Sciences Berlin Heidelberg New York Barcelona
H Cell Biology Budapest Hong Kong London Milan
Global Environmental Change Paris Santa Clara Singapore Tokyo

Partnership Sub-Series
1. Disarmament Technologies Kluwer Academic Publishers
2. Environment Springer-Verlag /
Kluwer Academic Publishers
3. High Technology Kluwer Academic Publishers
4. Science and Technology Policy Kluwer Academic Publishers
5. Computer Networking Kluwer Academic Publishers

The Partnership Sub-Series incorporates activities undertaken in collaboration with


NATO's Cooperation Partners, the countries of the CIS and Central and Eastern
Europe, in Priority Areas of concern to those countries.

NATO-PCO Database
The electronic index to the NATO ASI Series provides full bibliographical references
(with keywords and/or abstracts) to about 50000 contributions from international
scientists published in all sections of the NATO ASI Series. Access to the NATO-PCO
Database compiled by the NATO Publication Coordination Office is possible in two
ways:
- via online FILE 128 (NATO-PCO DATABASE) hosted by ESRIN, Via Galileo Galilei,
1-00044 Frascati, Italy.
- via CD-ROM "NATO Science & Technology Disk" with user-friendly retrieval
software in English, French and German ( WTV GmbH and DATA WARE
Technologies Inc. 1992).
The CD-ROM can be ordered through any member of the Board of Publishers or
through NATO-PCO, B-3090 Overijse, Belgium.

Series F: Computer and Systems Sciences, Vol. 154


Springer-Verlag Berlin Heidelberg GmbH
Reliability and Maintenance
of Complex Systems

Edited by

Siileyman Ozekici
Department of Industrial Engineering
Bogazi/yi University
80815 Bebek-istanbul, Turkey

In cooperation with
Erhan <;mlar
Princeton University
Frank Van der Duyn Schouten
Tilburg University
Nozer D. Singpurwalla
The George Washington University
Jack P.e. Kleijnen
Tilburg University
Rommert Dekker
Erasmus University Rotterdam

Springer
Proceedings of the NATO Advanced Study Institute on Current Issues
and Challenges in the Reliability and Maintenance of Complex Systems,
held in Kerner-Antalya, Turkey, June 12-22, 1995.

Library of Congress Cataloging-in-Publication Data

Reliability and maintenance of complex systems I edited by Suleyman


Ozek iCi.
p. c~. -- (NATO ASI series. Series F. Computer and systems
sc i ences ; vo 1. 154)
"Published in cooperation with NATO Scientific Affairs Division."
Proceedings of the NATO Advanced Study Institute on Current
Issues and Challenges in the Reliability and Maintenance of Co~plex
Systems, held in Kemer-Antalya, Turkey, June 12-22, 1995"--T.p.
verso.
Includes bibliographical references.
ISBN 978-3-642-08250-4 ISBN 978-3-662-03274-9 (eBook)
DOI 10.1007/978-3-662-03274-9
1. Reliability (Engineering) 2. Maintenance. I. azekici, S.
(Suleyman) II. North Atlantic Treaty Organization. Scientific
Affairs Division. III. NATO Advanced Study Institute on Reliability
and Maintenance of Complex Systems (1995 , Kemer Bucagl, Antalya
1li, Turkey) IV. Series; NATO ASI series. Series F, Computer and
systems sc 1ences ; no. 154.
TA169.R4364 1996
620' .00452--dc20 96-13936
CIP

CR Subject Classification (1991): J.I, J.6, G.3-4, D.2, HA, K.6


ISBN 978-3-642-08250-4
This work is subject to copyright. All rights are reserved, whether the whole or part of the
material is concerned, specifically the rights of translation, reprinting, reuse of illustrations,
recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data
banks. Duplication of this publication or parts thereof is permitted only under the provisions
of the German Copyright Law of September 9, 1965, in its current version, and permission for
use must always be obtained from Springer-Verlag Berlin Heidelberg GmbH.
Violations are liable for prosecution under the German Copyright Law.
Springer-Verlag Berlin Heidelberg 1996
Originally published by Springer-Verlag Berlin Heidelberg New York in 1996
Softcover reprint of the hardcover I st edition 1996

Typesetting: Camera-ready by editor


Printed on acid-free paper
SPIN: 10486193 45/3142 - 5 43210
Preface

Technological developments in recent decades prompted the design and pro-


duction of devices with many components which are substantially more com-
plex in structure than their earlier versions. Computers, telecommunication
devices, airplanes, and manufacturing robots are only a few examples. Such
high technology devices are generally quite expensive and critical in the func-
tioning of the system they are installed in; thus, it is of utmost importance
that they should be highly reliable in design and properly maintained to
achieve an extended economically useful lifetime.
These words reflect the main theme of the NATO Advanced Study Insti-
tute (ASI) on Current Issues and Challenges in the Reliability and
Maintenance of Complex Systems. The meeting was held in Kemer-
Antalya in Turkey during June 12-22, 1995, with the participation of 90
scientists from 21 countries. The objective of the ASI was to create a high-
level discussion and educational forum in order to disseminate the current
knowledge and state-of-the-art on the reliability and maintenance of complex
systems. This volume contains edited versions of the lecture notes as well
as additional contributions from some of the participants. It is intended to
serve as a graduate-level textbook as well as a reference book for scientists
and academicians whose research interests coincide with the theme of the
meeting.
The book consists of a carefully blended mixture of chapters organized in
five complementary parts that cover the most important aspects of reliabil-
ity and maintenance. The complete list of contributions is presented in the
table of contents. It covers a wide range of topics related with reliability and
maintenance, from the theoretical aspects of stochastic modelling to the very
practical issues in maintenance management.
Part I concentrates on stochastic models of reliability and main-
tenance. The main feature of any reliability and maintenance model is the
uncertainty involved in the failure mechanism. The model is therefore repre-
sented by stochastic processes which depict the deterioration, aging, or failure
of the system in time. Analysis of system reliability and determination of op-
timal maintenance policies are all based on these stochastic failure models.
This part provides an overview of such models and discusses some of the
recent research along this direction, including stochastic models of fatigue
VI Preface

crack growth, physics of failure, dynamic modelling of reliability systems,


reliability analysis via corrections, and age-based failure modelling.
The main theme of Part II is the maintenance of complex systems.
Although the literature in reliability and maintenance is still dominated by
single component models, there is now growing interest in complex devices
with many components. The emphasis in this part is on decision and op-
timization models for such devices. An overview of the state of the art in
multi-component maintenance modelling is presented followed by a number
of chapters on current research. They focus on the use of random environ-
ments and the intrinsic aging concept in complex systems, a framework for
maintenance activities, the marginal cost approach, availability analysis and
optimization of monotone systems, determination of maintenance frequen-
cies, and burn-in design problems for heterogeneous populations.
Part III is on stochastic methods in software engineering. Until
quite recently, classical applications of reliability and maintenance has been
on hardware involving devices like machines, production lines, workstations,
computers, etc. With the rapid advances in information technologies, soft-
ware systems and their reliability have gained special attention in current
research on reliability and maintenance. This part provides overviews of soft-
ware reliability engineering and assessment, discusses the operational profile,
and addresses important issues in the analysis of software failure data as well
as decision methods regarding optimal testing of software.
Part IV concerns computational methods and simulation in relia-
bility and maintenance. The computation of the reliability or the deter-
mination of the optimal maintenance policy for a complex system is often
complicated by factors like dimensionality or the difficulty of applying the
proposed solution technique. Simulation is a powerful tool used in all disci-
plines that can be applied to complicated mathematical models, which defy
analytical solution, so that a computer is employed to find a solution for the
model. This technique is widely used in many fields like queueing, telecommu-
nication and computer science, but its application in reliability and mainte-
nance is often overlooked. In addition to basic topics in simulation, this part
discusses some analytical computational techniques and algorithms for com-
plex Markovian systems along with the use of efficient techniques for estimat-
ing, via simulation, transient measures of highly dependable non-Markovian
systems.
Finally, Part Vends our treatment with maintenance management
systems. These are the prime tools in a maintenance organization to sup-
port its management. They usually consist of an information system through
which the daily data on workorders, execution of maintenance, and failures
are collected. In this last part, the focus is on decision support systems for
maintenance optimization and their integration with maintenance manage-
ment systems. Apart from case studies of optimization, actual decision sup-
port systems, their development, models, user interfaces, and applications
Preface VII

are also discussed. Important elements of maintenance management systems


are presented followed by discussions of a decision support system for oppor-
tunistic preventive maintenance and optimization with the delay time model.
I was very fortunate to have many people helping me during the meeting
and in the editorial process of this volume. I would especially like to acknowl-
edge the contributions of the members of the organizing committee; Profes-
sors Erhan Qmlar from Princeton University, Frank Van der Duyn Schouten
and Jack p.e. Kleijnen from Tilburg University, Nozer D. Singpurwalla from
the George Washington University, and Rommert Dekker from Erasmus Uni-
versity Rotterdam. They provided their support enthusiastically in the orga-
nization and execution of the meeting by coordinating the five sessions on
the areas mentioned above, and by serving in the editorial board during the
review process.
The book consists of 28 chapters written by 31 authors, most of whom
also participated in the AS!. I thank them all for the superb job they have
done in preparing their manuscripts and for their professional attitude that
allowed the timely production of this volume. The comments and suggestions
of the ASI participants during the meeting were incorporated to improve the
initial set of lecture notes significantly.
The whole manuscript is prepared using M\TEX which required a lot of
expertise and assistance in putting all of the files in the format requested
by Springer-Verlag. This was done elegantly by my doctoral student Ash
Erdem. I appreciate her contribution and thank also the graduate students
at our Industrial Engineering Department who offered their help generously
in the preparation of the lecture notes and this volume.
Finally, I am grateful to the Scientific and Environmental Affairs Divi-
sion of NATO for their financial support; without it neither the ASI nor
this volume would have been possible. As co-sponsors, Bogazir,;i University
Foundation and Research Fund also provided financial assistance.

istanbul Siileyman Ozekici


March 1996
Table of Contents

Part I. Stochastic Models of Reliability and Maintenance

Stochastic Models of Reliability and Maintenance:


An Overview
Uwe Jensen .................................................... 3

Fatigue Crack Growth


Erhan Qiniar . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 37

Predictive Modeling for Fatigue Crack Propagation via


Linearizing Time Transformations
Panickos N. Palettas and Prem K. Goel. . . . . . . . . . . . . . . . . . . . . . . . . . .. 53

The Case for Probabilistic Physics of Failure


Max Mendel. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 70

Dynamic Modelling of Discrete Time Reliability Systems


Moshe Shaked, J. George Shanthikumar,
and Jose Benigno Valdez-Torres. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 83

Reliability Analysis via Corrections


Igor N. Kovalenko. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. 97

Towards Rational Age-Based Failure Modelling


Menachem P. Berg .............................................. 107

Part II. Maintenance of Complex Systems

Maintenance Policies for Multicomponent Systems:


An Overview
Frank Van der Duyn Schouten .................................... 117
X Table of Contents

Complex Systems in Random Environments


Siileyman Ozekici ............................................... 137

Optimal Replacement of Complex Devices


Siileyman Ozekici ............................................... 158

A Framework for Single-Parameter Maintenance Activities


and Its Use in Optimisation, Priority Setting and Combining
Rommert Dekker ............................................... 170

Economics Oriented Maintenance Analysis and the


Marginal Cost Approach
Menachem P. Berg .............................................. 189

Availability Analysis of Monotone Systems


Terje Aven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206

Optimal Replacement of Monotone Repairable Systems


Terje Aven . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224

How to Determine Maintenance Frequencies for


Multi-Component Systems? A General Approach
Rommert Dekker, Hans Frenk, and Ralph E. Wildeman ............. 239

A Probabilistic Model for Heterogeneous Populations and


Related Burn-in Design Problems
Fabio Spizzichino ............................................... 281

Part III. Stochastic Methods in Software Engineering

An Overview of Software Reliability Engineering


John D. Musa .................................................. 319

The Operational Profile


John D. Musa .......................... " ...................... 333

Assessing the Reliability of Software: An Overview


Nozer D. Singpurwalla and Refik Soyer ............................ 345

The Role of Decision Analysis in Software Engineering


Jason Merrick and Nozer D. Singpurwalla .......................... 368

Analysis of Software Failure Data


Refik Soyer. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 389
Table of Contents XI

Part IV. Computational Methods and Simulation in


Reliability and Maintenance

Simulation: Runlength Selection and Variance Reduction


Techniques
Jack P.C. Kleijnen .............................................. 411

Simulation: Sensitivity Analysis and Optimization Through


Regression Analysis and Experimental Design
Jack P.C. Kleijnen .............................................. 429

Markov Dependability Models of Complex Systems:


Analysis Techniques
Jogesh K. Muppala, Manish Malhotra, and Kishor S. Trivedi ......... 442

Bounded Relative Error in Estimating Transient Measures of


Highly Dependable Non-Markovian Systems
Philip Heidelberger, Perwez Shahabuddin, and Victor F. Nicola ....... 487

Part V. Maintenance Management Systems

Maintenance Management System:


Structure, Interfaces and Implementation
Wim Groenendijk ............................................... 519

PROMPT, A Decision Support System for


Opportunity-Based Preventive Maintenance
Rommert Dekker and Cyp van Rijn ............................... 530

Maintenance Optimisation with the Delay Time Model


Rose Baker. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 550

List of Contributors .......................................... 589


Part I

Stochastic Models of Reliability and


Maintenance
Stochastic Models of Reliability and
Maintenance: An Overview
Uwe Jensen
Institute of Stochastics, University of UIm, Ulm D 89069, Germany

Summary. An overview of some mathematical models of reliability and mainte-


nance is presented. Emphasis is laid on some more recent developments, which use
tool from the theory of stochastic processes. This framework allows to observe reli-
ability systems on different information levels and to optimize maintenance actions.

Keywords. Complex systems in reliability, maintenance, failure rate process,


information-based minimal repair, Markov modulated repair processes, optimal
stopping

1. Introduction

Reliability theory has gained much interest in recent times. This becomes
evident if one realizes the number of publications in this field. Such numbers
are available in the MATH DATABASE of STN International, the Scien-
tific & Technical Information Network. This database is the online version
of Zentralblatt fiir Mathematik/Mathematical abstracts and contains all en-
tries in the Zentralblatt since 1972. The following table shows the number of
publications from 1972 up to and including 1994 for some keywords:

Table 1.1. Number of publications


Keyword Number of publications (1972-1994)
Reliability 8352
Reliability fstatistics 3769
Reliability fmartingale 60
Maintenance or repair 1909

About 0.8 % of all mathematical publications are related to reliability.


This shows the importance of that field and at the same time the impos-
sibility of giving a complete overview of the matter. For this reason an in-
troduction to different types of models is given first. Then emphasis is laid
on some more recent developments with a personal view of the author, of
course. In Section 2 a general set-up is given which can be used to treat most
of the reliability models in a framework of stochastic processes including the
possibility of observing the stochastic development of reliability systems on
different information levels. In Sections 3 and 4 the tools presented before
will be used to combine some lifetime models with maintenance actions. In
4 Uwe Jensen

classical reliability a device or system is considered which fails at an unpre-


dictable (this term is made precise later) random age of ( > O. This real
random variable is assumed to have a distribution F, F(t) = P(( :S t), t E IF.
with a density f. Then the hazard or failure rate A is defined on the support
of the distribution by
A(t) = ~(t)
F(t)
with the survival function F(t) = 1- F(t). The (cumulative) hazard function
it
is denoted by A,
A(t) = A(s)ds = -In {F(t)}.

The well-known relation

F(t) = P(( > t) = exp{ -A(t)} (1.1)

establishes the link between the cumulative hazard and the survival function.
Modeling in reliability theory is mainly concerned with additional informa-
tion about the state of a system, which can be gathered during the operating
time of the system. This additional information leads to updating predic-
tions about proneness to system failure. There are a lot of ways to introduce
such additional information into the model. In a general setting Arjas (1993)
uses marked point processes to describe this flow of information in an in-
structive way. In the following some examples of how to introduce additional
information are given.

1.1 Complex Systems

A complex system is composed of n components with positive random life-


times Xi, i = 1,2, ... , n, n EN. Let ~ : {O, l}n -+ {O, I} be the structure func-
tion ofthe system, which is assumed to be monotone and coherent. The possi-
ble states of the components and of the system "intact" and "failed" are indi-
cated by "0" and "1" respectively. Then ~t = ~(I{Xl~t}' I{x2St}, ... , I{x"St})
describes the state of the system at time t, where I{xiSt} denotes the indi-
cator function
I if Xi < t
I{xi~t} = { 0 if Xi; t
which is 0 if component i is intact at time t and otherwise 1. The lifetime (
of the system is then given by (= inf{t E IF.+ : ~t = I}. As a simple example
the following system with three components is considered which is intact if
component 1 and at least one of the components 2 or 3 are intact:
In this example ~t = 1 - (1 - I{x2$t}I{x3$t})(1 - I{xl:St}) is easily
obtained with ( = inf{t E IF.+ : ~t = =
I} Xl 1\ (X2 V X3), where as usual
a 1\ b and a V b denote min(a,b) and max(a,b) respectively. The additional
information about the lifetime ( is given by the observation of the state of
Stochastic Models of Reliability and Maintenance: An Overview 5

comp.2
comp.1
comp.3

Fig. 1.1. System with three components

the single components. As long as all components are intact only a failure of
component 1 leads to system failure. If one of the components 2 or 3 fails
first then the next component failure is a system failure.
Under the classical assumption that all components work independently,
i.e., the random variables Xi, i = 1, ... , n are independent, the investigations
concentrate on the following problems:
- Determining the system lifetime distribution from the known component
lifetime distributions or finding at least bounds for this distribution.
- Are certain properties of the component lifetime distributions like IFR
(increasing failure rate: A(t)) or IFRA (increasing failure rate average:
(l/t)A(t)) preserved by forming monotone systems? One of these closure
theorems states for example that the distribution of the system lifetime is
IFRA if all component lifetimes have IFRA distributions.
- In what way does a certain component contribute to the function of the
whole system? The answer to this question leads to the definition of several
importance measures. A short survey of the importance of components in
a monotone coherent system has been given by Natvig (1988).
A basic reference for monotone coherent systems is still the book of Barlow
and Proschan (1975). More recent related publications, which contain a lot of
generalizations, are Aven (1992) and Shaked and Shanthikumar (1990). For
the state of the art the contributions of Aven (1996a, 1996b), Van der Duyn
Schouten (1996) and Ozekici (1996a, 1996b) to this volume are referred to.

1.2 Damage Threshold Models

Additional information about the lifetime ( can also be introduced into the
model in a quite different way. If the state or damage of the system at time
t E ~+ can be observed and this damage is described by a random variable
X t then the lifetime of the system

( = inf{t E ~+ : X t ~ S}
6 Uwe Jensen

can be defined as the first time the damage hits a given level S. Here S can be
a constant or, more general, a random variable independent of the damage
process. Some examples of damage processes X = (Xt ) of this kind are the
following.
Wiener Process. The damage process is a Wiener process with positive
drift starting at 0 and the failure threshold S is a positive constant. Then
the lifetime of the system is known to have an inverse Gaussian distribution.
Models of this kind are especially of interest if one considers different envi-
ronmental conditions under which the system is working, as for example in
so-called burn-in models. An accelerated aging caused by additional stress or
different environmental conditions can be described by a change of time. Let
r : ~+ -+ ~+ be an increasing function, then Zt := XT(t) denotes the actual
observed damage. The time transformation r drives the speed of the deterio-
ration. Following Doksum (1991) one possible way to express different stress
= =
levels in time intervals [ti, ti+l), 0 to < tl < ... < tic, i 0, 1, ... , k-1, kEN,
is the choice
i-l
t 1-+ r(t) = 2:,8j(ti+ 1 - tj) + ,8i(t - td,
j=O

In this case it is easily seen that if Fo is the inverse Gauss distribution function
of ( = inf{t E ~+ : X t ~ S}, and F is the distribution function ofthe lifetime
(a = inf{t E ~+ : Zt ~ S} under accelerated aging then F(t) = Fo(r(t)).
Some further references on accelerated aging can be found in Doksum (1991).
The failure time distribution for damage processes with more general failure
thresholds is investigated by Domine (1996), among others. A generalization
in another direction is to consider a random time change which means that r
is a stochastic process. By this, randomly varying environmental conditions
can be modeled. This idea has been develol?ed by Qmlar (1984) for semi-
Markov processes and further by Qmlar and Ozekici in (1987) and by Qmlar
et al. (1989).
Compound Point Processes. Processes of this kind describe so-called
shock processes where the system is subjected to shocks which occur from
time to time and add a random amount to the damage. The successive
times of occurrence of shocks, Tn, are given by an increasing sequence
o < Tl ::; T2 ::; ... of random variables where the inequality is strict unless
Tn = 00. Each time point Tn is associated with a real-valued random mark
Mn which describes the additional damage caused by the n-th shock. The
marked point process is denoted (T, M) = (Tn, Mn)nEN. From this marked
point process the corresponding compound point process X with

2:
00

Xt = I{T .. 9}Mn
n=l
Stochastic Models of Reliability and Maintenance: An Overview 7

is derived which describes the accumulated damage up to time t. The simplest


example is a compound Poisson process in which the shock arrival process
is Poisson and the shock amounts (Mn) are i.i.d. random variables. The
lifetime (, as before, is the first time the damage process (Xt ) hits the level
S. Assuming that S is a random failure level describes the situation in which
the damage process does not carry complete information about the state of
the system.

1.3 Maintenance

In the last two subsections various ways of modeling the lifetime of a technical
system by introducing additional information were described. In addition to
such models it is often useful to take maintenance actions into account to
prolong the lifetime, to increase the availability and to reduce the probability
of an unpredictable failure. The most important maintenance actions include:
- Preventive replacements of parts of the system or of the whole system
- Providing spare parts
- Providing repair facilities
- Inspections to check the state of (parts of) the system if not observed
continuously
Taking maintenance actions into account leads, depending on the specific
model, to one of the following problem fields.
Availability Analysis. If the system or parts of it are repaired or replaced
when failures occur the problem is to characterize the performance of the
system. Different measures of performance can be defined as for example
- The intact probability at a certain time point or in a given time interval
- The mean time to first failure of the system
- The probability distribution of the downtime of the system in a given time
interval.
Of course, a lot of other measures and generalizations of the above ones
have been investigated. An overview of different performance measures for
monotone systems is given by Aven (1996a) in his contribution to this volume.
Optimization Models. If a valuation structure is given, i.e. costs of re-
placements, repairs, downtime ... and gains, then one is naturally led to the
problem of planning the maintenance action so as to minimize (maximize) a
given cost (gain) criterion. Examples of such criteria are expected costs per
unit time or total expected discounted costs. Surveys of these models can be
found in the review articles mentioned below.
One can imagine that thousands of models (and papers) can be created
in combining the different types of lifetime models with different mainte-
nance actions. Instead of providing a long and, inevitably, almost certainly
incomplete list of references, some of the surveys and review articles will be
8 Uwe Jensen

mentioned in the following. Besides a number of Operations Research journals


the IEEE Transactions on Reliability has concentrated on reliability models,
including statistics of reliability. From time to time, the Naval Research Lo-
gistics Quarterly journal publishes survey articles in this field, among them
the renowned article by Pierskalla and Voelker (1976) which appeared with
259 references, updated by Sherif and Smith (1981) with an extensive bib-
liography of 524 references, followed by Valdez-Flores and Feldman (1989)
with 129 references. The review by Bergman (1985) reflects the author's ex-
perience in industry and emphasizes the usefulness of reliability methods
in applications. Gertsbakh (1984) reviews asymptotic methods in reliability
and especially investigates under what conditions the lifetime of a complex
system with many components is approximately exponentially distributed.
The survey of Arjas (1989) considers reliability models using more advanced
mathematical tools as marked point processes and martingales. A key role in
this survey is played by an information-based hazard concept. This concept
will be used and described in some variations in what follows.

1.4 Different Information Levels

In Sections 1.1 and 1.2 it was pointed out in what way additional information
can lead to a reliability model. But it is also important to note that in one
and the same model different observation levels are possible, i.e. the amount
of actual available information about the state of a system may vary. So for
example in optimization models the optimal strategy will strongly depend on
the available information. The following two examples will show the effect of
different degrees of information.
Simpson's paradox. This paradox says that if one compares the death
rates in two countries, say A and B, then it is possible that the crude overall
death rate in country A is higher than in B although all age-specific death
rates in B are higher than in A. This can be transferred to reliability in the
following sense. Considering a two-component parallel system, the failure rate
of the system lifetime may increase although the component lifetimes have
decreasing failure rates. The following proposition, which can be proved by
some elementary calculations, yields an example of this.

Proposition 1.1. Let ( =


XIV X 2 with i. i. d. random variables Xi, i = 1, 2
following the common distribution F,

F(t) =1 - e-u(t), t 2: 0, u(t) = ct + ~(1 - e- f3t ), 0:', /3, c > O.

If In( 2~f3) > ~ and c < /3 then the failure rate A of the lifetime ( increases,
whereas the component lifetimes Xi have decreasing failure rates.
Stochastic Models of Reliability and Maintenance: An Overview 9

This example shows that it makes a great difference whether only the
system lifetime can be observed (aging property: IFR) or additional infor-
mation about the component lifetimes is available (aging property: DFR). In
addition one may also notice that the aging property of the system lifetime
of a complex system does not only depend on the joint distribution of the
component lifetimes but of course also on the structure function. Consider,
instead of a two-component parallel system, a series system where the com-
ponent lifetimes have the same distributions as in the proposition. Then the
failure rate of (3er= Xl "X 2 decreases, whereas (par = Xl V X 2 has an
increasing failure rate.
Predictable Lifetime. The Wiener process X = (Xt)telll+ with positive
drift I' and variance scaling parameter u serves, as mentioned before, as a
popular damage threshold model. X can be represented as X t = uBt + I't,
where B is standard Brownian motion. If one assumes that the failure level S

an inverse Gauss distribution with a finite mean E( = *.


is a fixed known constant then the lifetime ( = inf{t E ~+ : X t ~ S} follows
One criticism
of this model, as Doksum (1991) mentions, is that the paths of X do not
increase. As a partial answer Doksum states that maintenance actions also
lead to improvements and thus X could be decreasing at some time points.
A more severe criticism from the point of view of the available information is
the following. It is often assumed that in this model the paths of the damage
process can be observed continuously. But this would make the lifetime (
a predictable random time (a precise definition follows in the next section),
i.e. there is an increasing sequence Tn, n E N which announces the failure.
In this model one could choose Tn = inf{t E ~+ : X t ~ S - ~}, and take
n large enough and stop operating the system at Tn "just" before failure
to carry out some preventive maintenance. This does not usually apply in
practical situations. This example shows that one has to distinguish carefully
between the different information levels for the model formulation (complete
information) and for the actual observation (partial information).

2. A General Model in Reliability


A general set-up should include all basic lifetime models, should take into
account the time-dynamic development and should allow for different infor-
mation and observation levels. Thus one is led in a natural way to the theory
of stochastic processes in the spirit of Arjas (1989, 1993) and Koch (1986).
It should be stressed that the aim of this contribution is rather to present
ideas than to give an excursion into the theory of stochastic processes. So
the mathematical technicalities are kept (almost) to a minimum, details of
the mathematical basis are provided in references such as Bremaud (1981)
or Rogers and Williams (1994).
Let (il,:F, P) be the basic probability space. The information up to time
t is measured by the pre-t-history :Ft , which contains all events of :F that
10 Uwe Jensen

can be distinguished up to and including time t. The filtration IF = {.1't)tEllR+,


which is the family of increasing pre-t-histories, is assumed to follow the usual
conditions of completeness and right continuity. In most cases the information
and by that the filtration is determined by a stochastic process. But since
it is sometimes desirable to observe one and the same stochastic process
on different information levels it seems more convenient to use filtrations
as measures of information. In addition a stochastic process Z = {Zt)tEllR+ is
considered which is adapted to the filtration IF, i.e. on the IF-information level
the process can be observed, or in mathematical terms: Zt is .1't-measurable. A
random variable T with values in ~+U{ oo} is called an IF-stopping time if {T ~
t} E .1't for all t E ~+. Thus a stopping time is related to the given information
in that at any time t one can decide whether T has already happened or not,
using only information of the past and present but not anticipating the future.
A key role is played by a semimartingale representation of the process Z.
This is a decomposition in a drift or regression part and an additive random
fluctuation described by a martingale:

Zt = Zo + 1t f.ds + Mt , (2.1)

where f = (ft )tEllR+ is a stochastic process with EU; If. Ids ) < 00 for all
t E ~+, EIZol < 00 and M = {Mt)tEllR+ is a martingale which starts in 0:
Mo = O. A martingale is the mathematical model of a fair game with constant
expectation function EMo = 0 = EMt for all t E ~+. Since the drift part
in the above decomposition is continuous, a process Z, which admits such a
representation is called a smooth semimartingale or smooth IF-semimartingale
if one wants to emphasize that Z is adapted to the filtration IF. For details
and basic results concerning smooth semimartingales see Jensen (1989).
First let us consider the simple indicator process Zt = I{(9}' where (
is the lifetime random variable defined on the basic probability space. The
paths of this indicator process are constant, except for one jump from 0 to
1 at (. The general model now simply consists of the assumption that this
indicator process has a smooth IF-semimartingale representation:

I{(:5t} = 1t I{(>.})'.ds + Mt , t E ~+. (2.2)

The process). = ().t)tEllR+ is called failure rate or hazard rate process and the
compensator At = f; I{(>.})'.ds is called hazard process. Before investigating
under what conditions such a representation exists some examples are given.
Example 2.1 If the failure rate process). is deterministic then forming
expectations leads to the integral equation

F(t) = P{( ~ t) = EI{(~t} = 1t P{( > s))'.ds = 1t {I - F{s)))..ds.


Stochastic Models of Reliability and Maintenance: An Overview 11

The unique solution F = exp{- I~ .A,ds} is just equation (1.1). This shows
that if the hazard rate process .A is deterministic then it coincides with the
standard failure rate.
Example 2.2 In continuation of the example of a 3-component complex
system in Section 1.1 it is assumed that the component lifetimes Xl, X 2, X3
are LLd. exponentially distributed with parameter a > O. What is the failure
rate process corresponding to the lifetime ( = Xl /I. (X2 V X3) ? This depends
on the information level,i.e. the filtration IF.
-:I"t = U(I{Xl~,},I{X2~,},I{X3~,},0:5 s:5 t). Observing on the component
level means that :l"t is generated by the indicator processes of the compo-
nent lifetimes up to time t. It can be shown that the failure rate process of
the system lifetime is given by .At = a(l + I{x2~t} + I{x3~t}). As long as
all components work,the rate is a due to component 1. When one of the
two parallel components 2 or 3 fails first, then the rate switches to 2a.
- :l"t = u(I{(~s}, 0 :5 s :5 t). If only the system lifetime cane observed then
the failure rate process diminishes to the ordinary deterministic failure rate

.At = a(l + 21 - exp{ -at}).


2 - exp{-at}
Example 2.3 Consider the damage threshold model in which the deteri-
oration was described by the Wiener process X t = uBt + Ilt, where B is
standard Brownian motion. In this case,whether and in what way the life-
time ( = inf{t E ~+ : X t ~ S}, S E ~+ can be characterized by a failure
rate process, also depends on the available information.
- :l"t = u(B" 0 :5 s :5 t). In this case, observing the actual state of the system
proves to be too informative to be described by a failure rate process. The
martingale part is identically 0, the drift part is the indicator process I{(~t}
itself. No semimartingale representation (2.2) exists because the lifetime is
predictable as mentioned in Section 1.4.
- :l"t = u(I{(~,}, 0 :5 s :5 t). If only the system lifetime can be observed con-
ditions change completely. A representation (2.2) exists with the ordinary
failure rate of the inverse Gauss distribution.

2.1 Existence of Failure Rate Processes

It is possible to formulate rather general conditions on Z to ensure a semi-


martingale representation (2.1) (see Jensen 1989). But in the reliability model
one has a more specific process lit = I{(~t} for which a representation (2.2)
has to be found. Whether such a representation exists should depend on the
random variable ( (or on the probability measure P respectively) and on the
filtration IF. If ( is a stopping time with respect to the filtration IF, then a rep-
resentation (2.2) only exists for stopping times which are totally inaccessible
in the following sense:
12 Uwe Jensen

Definition. An IF-stopping time T is called


- predictable if an increasing sequence (Tn)neN of IF-stopping times Tn < T
exists such that liffin-+co Tn = T;
- totally inaccessible if P( T = (T < 00) = 0 for all predictable IF-stopping
times (T.

It can be shown that if V has a smooth semimartingale representation (2.2)


then , is a totally inaccessible stopping time. On the other hand if , is to-
tally inaccessible then there is a (unique) decomposition V = A + M in
which the process A is continuous. So the class of lifetime models with an
absolutely continuous compensator A, At = J~ I{'>3}A3ds is rich enough to
include most relevant systems in reliability theory. In view of Example 2.3
the condition that V admits such a representation seems a natural restric-
tion,because if the lifetime could be predicted by an announcing sequence of
stopping times maintenance actions would make no sense. In addition Exam-
ple 2.3 also shows that one and the same random variable, can be predictable
or totally inaccessible depending on the corresponding information filtration.
How can one ascertain the failure rate process A for a given information
level IF? In general one can determine A under some technical conditions as
the limit
I{(>t}At = h~W+ ~P(t < , ~ t + hlFt)
in the sense of almost sure convergence, see Jensen (1989) Theorem 3.4. In
some special cases A can be represented in a more explicit form, as for example
for complex systems. As in Section 1.1 let Xi, i = 1, ... , n be n random vari-
ables which describe the component lifetimes of a monotone complex system
with structure function ifJ. For simplicity it is assumed that P(Xi Xj) 0 = =
for i i= j and that each Xi has an ordinary failure rate A.(i). Note that no
independence assumption was made. To derive the failure rate process on
the component observation level IF,Ft = (T(I{xc~3}, ... ,I{x,,:$$}'0 ~ s ~ t),
Theorem 4.1 in Arjas (1981b) can be used to yield

I{,>t}At = L: I{xi>t}At(i), (2.3)


ieroJ>(t)

where r~(t) is the set of critical components at time t, the failure of which
would immediately result in a system failure, i.e. i E r~(t) if and only if

ifJ(I{Xl:$t} , ... , I{xi_l~t}, 1, I{Xi+l$t} , ... , I{x,,~t}) - ifJ( ... , 0, ... ) = l.


Example 2.4 (Continuation of Example 2.2). If at time t all three compo-
nents work then only component 1 belongs to r~(t) and I{,>t}At = aI{x1>t}
on {X2 > t, X3 > t}. If one of the components 2 or 3 has failed first before
time t, say component 2, thenr~(t) =
{1,3} and I{(>t}At a(I{x1>t} + =
I{x 3 >t}) on {X2 ~ t}. Combining these two formulas yields the failure rate
process given in Example 2.2.
Stochastic Models of Reliability and Maintenance: An Overview 13

The set r~(.) of critical components increases (=non-decreasing) in t. So


from (2.3) it can easily be seen that if all component failure rates increase
then the IF-failure rate process A also increases and the hazard process A is
convex (almost surely for t E (0, (D. Such a closure theorem does not hold
true for the ordinary failure rate of the lifetime ( as can be seen from simple
counter examples (see Barlow and Proschan 1975, p. 83). Now it is natural
to define that ( has an IF-increasing failure rate if A increases.
Definition. If an IF-semimartingale representation (2.2) holds true for (,
then the latter is called IF-IFR (increasing failure rate relative to IF) if A has
increasing paths almost surely for t E (0, (].
An alternate definition, which is derived from notions of multivariate
aging terms, was given by Arjas (1981a), see also Shaked and Shanthiku-
mar (1991).

2.2 Change of Information Level

One of the advantages of the semimartingale technique is the possibility of


studying the random evolution of a stochastic process on different information
levels. Let A = (At)telll+ and IF = (.1"t)telll+ be two filtrations on the same
probability space (il,.1", Pl. Then A is called a subfiltration of IF if At eFt for
all t E ~+. In this case IF can be viewed as the complete information filtration
and A as the observation filtration on a lower level. If Z is a semimartingale

1t
with representation
Zt = Zo + /,ds+ Mt ,

then the projection theorem of filtering theory (see Jensen 1989) for detailed
references) ensures that such a representation also applies to the conditional
expectation Z with it = E(Zt IAt):

it = io + 1t i, ds + Mt , (2.4)

where it is some suitable version of the conditional expectation E(ft IAt) and
M is an A-martingale. This projection theorem can be applied to the lifetime
indicator process lit = I{(9} with presentation (2.2). If the lifetime can be
observed,i.e. {( ~ s} E At for all 0 ~ s ~ t, which is assumed throughout,
then the change of the information level from IF to A leads from (2.2) to the
representation

"Ct = E(I{(~t} IAt) = I{(9} = 1t I{(>,}j,ds + Mt , (2.5)

where jt = E(At IAt ). The projection theorem shows that it is possible to


obtain the failure rate on a lower information level merely by forming condi-
tional expectations under some mild technical conditions.
14 Uwe Jensen

Remark 2.1. Unfortunately monotonicity properties are in general not pre-


served when changing the observation level. As was noted above, if all compo-
nents of a monotone system have lifetimes with increasing failure rates then (
is IF-IFR on the component observation level. But switching to a subfiltration
A may lead to a non-monotone failure rate process i

The following example from Heinrich and Jensen (1992) illustrates the role
of partial information.
Example 2.5. Consider a two-component parallel system with i.i.d. ran-
dom variables Xi, i = 1,2 describing the component lifetimes, which fol-
Iowan exponential distribution with parameter a. Then the system life-
time is ( = Xl V X 2 and the "complete information" filtration is given by
:Ft = u(I{xl:5 a},I{X2:5 a}, 0 ~ s ~ t). In this case the IF-semimartingale rep-
resentation (2.2) is given by

I{(9} = 1t I{(>a}a(I{x 1 :5'} + I{x2:5.})ds + Mt = 1t I{(>.}Aads + Mt .

Now several subfiltrations can describe different lower information levels


where it is assumed that the system lifetime ( can be observed on all ob-
servation levels.
a) Information about component lifetimes with time lag h > 0:

Af { U(IK::;6}
u(I{(:5a}. 0 s t)
~ ~
, I{Xl::;u} , I{X2:;:u} , s t, u t - h)
~ ~
for O~t<h
for t ? h,
ja
t { a (2 - I{X1>t-h}e -<:>h - I {X2>t-h}e -<:>h forfor
2a(1 - (2 - e-<:>t)-l) o~ t < h
t? h,

b) Information about ( till h, after h complete information:

{ u(I{(:5a},O ~ s ~ t) for 0 ~ t < h


:Ft for t? h,

{
2a(1- (2 - e-<:>t)-l) for 0~t <h
At for t? h,

c) Information about component lifetime Xl and (:

u(I{(:5a},I{xl:5a}, 0 ~ s ~ t),
a(I{X 1 9} + I{x 1 >t}P(X2 ~ t,
d) Information only about (:
Stochastic Models of Reliability and Maintenance: An Overview 15

At = u(I{(:$8},O ~ s ~ t),
5.t = 2a(l- (2 - e-at)-l).
The failure rate corresponding to Ad in part d) of this example is the stan-
dard deterministic failure rate, because {( > t} is an atom of At so that 5. d
can always be chosen to be deterministic on {( > t}. Example 2.1 showed
that such deterministic failure rates satisfy the well-known exponential for-
mula (1.1).One might ask under what conditions such an exponential formula
extends also to random failure rate processes. This question was referred to
briefly in Arjas (1989) and answered in Yashin and Arjas (1988) to some ex-
tent. The following treatment differs slightly in that the starting point is the
basic model (2.2). The failure rate process A is assumed to be observable on
some level A, i.e. A is adapted to that filtration. This observation level can be
=
somewhere between the trivial filtration G (Qt)tEl!+, (gt) = {0,.a} which
does not allow for any random information, and the basic complete informa-
tion filtration IF. So ( itself need not be observable at level A (and should
not, if we want to arrive at an exponential formula). Using the projection
theorem one obtains

E(I{{:$t}IAt) = 1- Ft = 10t F8A8ds + Mt , (2.6)

,!here F denotes the conditional survival probability, Ft = E(I{{>t} IAt) and


M is an A-martingale. In general F can be rather irregular, it need not
even be monotone. But if F has continuous paths of bounded variation then
the martingale M is identically 0 and the solution of the resulting integral
equation is
(2.7)
which is a generalization of formula (1.1). If A is the trivial filtration G
then (2.7) coincides with (1.1). For (2.7) to hold, it is necessary that the
observation of A and other events on level A only have "smooth" influence
on the conditional survival probability.
Remark 2.2. This is a more technical remark to show how one can proceed if
F is not continuous. Let (Ft - )tEl!+ be the left-continuous version of F. The
equation (2.6) can be rewritten as

t
Ft = 1-1o F8_A8ds - Mt .
Under mild conditions an A-martingale L can be found such that M can
be represented as the (stochastic) integral Mt = f;
F8 _dL 8 With the semi-
martingale Z, Zt = - f;
A8ds - Lt equation (2.6) becomes
16 Uwe Jensen

Ft = 1 + fat F _dZs .
8

The unique solution of this integral equation is given by the so-called Doleans
exponential

exp{Zn II (1 + .1Z8)

where for Z (and L respectively) ZC denotes the continuous part of Z and


.1Z8 = Z8 - Z8- the jump height at s. This extended exponential formula
shows that possible jumps of the conditional survival probability are not
caused by jumps of the failure rate process but by (unpredictable) jumps of
the martingale part.

3. Models of Minimal Repair


In this and the next section the general model presented will be combined
with maintenance actions. One of these actions is to repair the system. In-
stead of replacing a failed system by a new one a so-called minimal repair
restores the system to a certain degree. Models of this kind have been consid-
ered by Barlow and Hunter (1960), Aven (1983), Bergman (1985), Block et
al. (1985), Stadje and Zuckerman (1991), Shaked and Shanthikumar (1986)
and Beichelt (1993), among others. Often used verbal definitions for a mini-
mal repair are the following:
- "The ... assumption is made that the system failure rate is not disturbed
after performing minimal repair. For instance, after replacing a single tube
in a television set, the set as a whole will be about as prone to failure after
the replacement as before the tube failure" (Barlow and Hunter 1960).
- "A minimal repair is one which leaves the unit in precisely the condition
it was in immediately before the failure" (Phelps 1983).
The definition of the state of the system immediately before failure depends
to a considerable degree on the information one has about the system. So it
makes a difference whether all components of a complex system are observed
or only failure of the whole system is recognized. In the first case the lifetime
of the repaired component (tube of TV set) is associated with the residual
lifetime in that a further failure of this part will cause the whole system to
fail. In the second case the only information about the condition of the system
immediately before failure is the age. So a minimal repair in this case would
mean replacing the system (the whole TV set) by another one of the same
age that as yet has not failed. Minimal repairs of this kind are also called
Stochastic Models of Reliability and Maintenance: An Overview 17

black box or statistical minimal repairs, whereas the componentwise minimal


repairs are also called physical minimal repairs.
Example 3.1 We consider a simple two-component parallel system with
independent Exp(1) distributed component lifetimes Xl, X 2 and allow for
exactly one minimal repair.
- Physical minimal repair. After failure at ( = TI = Xl V X 2 the component
which caused the system to fail is repaired minimally. Since the component
lifetimes are exponentially distributed, the additional lifetime is given by an
Exp(1) random variable X3 independent of Xl and X 2 .The total lifetime
TI + X3 has distribution

- Black box minimal repair. The lifetime ( = TI = X I V X 2 until the first


failure of the system has distribution P(TI ~ t) = (1 - e- t )2 and failure
rate A(t) = 2;=:~~t=!~. The additionallifetimeT2 - TI until the second
failure is assumed to have conditional distribution
2 - e-(t+x)
P(T2 - TI ~ xlTI = t) = P(TI ~ t + x ITt > t) = 1 - e- x 2 _ e- t .

Then the total lifetime T2 has distribution

It is (perhaps) no surprise that the total lifetime after a black box minimal
repair is stochastically greater than after a physical minimal repair:

P(T2 > t) 2: P(TI + X3 > t), for all t 2: o.


As was pointed out by Bergman (1985), information plays an important role.
Further steps in investigating information-based minimal repair were carried
out by Arjas and Norros (1989) and Natvig (1990).

3.1 Information-Based Minimal Repair

Let the time points Tn of minimal repairs be given by an increasing sequence


o < TI ~ T2 ~ ... of random variables on the given probability space, where as
before the inequality is strict unless Tn = 00 and moreover limn-+oo Tn = 00.
It is assumed that these minimal repairs take negligible time. The process
N = (NthEB.+ with

= L I{Tn~t}
00

Nt
n=l
counts the number of minimal repairs up to time t and is adapted to some
filtration IF. Similar to the failure time model (2.2) it is now assumed that N
has an absolutely continuous compensator:
18 Uwe Jensen

Nt = 1t A,ds + M t , (3.1)

where A is some non-negative intensity process observable on the IF-level and


M is an IF-martingale. This point process model is consistent with the general
lifetime model (2.2). If the process N is stopped at Tl then (3.1) is reduced
to (2.2):

NtllTl = I{Tl:5t} =ltllTl A,ds + MtllTl = it I{T1 >,}A,ds + M:,

where M' is the stopped martingale M, Mf = M tllTl . The time to first failure
corresponds to the original lifetime ( = Tl.
Example 3.2. Different types of minimal repair processes are characterized
by different intensities A.
a) Poisson process with constant intensity At == A. The times between suc-
cessive 'minimal' repairs are independent Exp(A) distributed random vari-
ables. This is the simple case in which repairs have the same effect as
replacements with new items.
b)If in a) the intensity is not constant but a random variable A(w)which is
known at the time origin (A is :Fo -measurable) then the process is called
doubly stochastic Poisson process or Cox process.
c) If in a) the intensity is not constant but a time-dependent deterministic
function At = A(t) then the process is a non-homogeneous Poisson process.
Most attention in the literature on minimal repairs has been paid to this
case of black box minimal repairs in which, after repairs, the failure inten-
sity remains the same as if the system had not failed before. In the case of
the parallel system in Example 3.1 one has A(t) = 2~:::::~ :::! .
d)The general case, A is IF-adapted. This applies to the physical minimal
repair in Example 3.1: At = I{XlI1X2:5t}.
Example 3.1 suggests comparing the effects of minimal repairs on different in-
formation levels. However, it seems difficult to define such point processes on
different levels. One possible way is sketched in the following where considera-
tions are restricted to the given IF-level of the basic model(3.1) and the 'black-
box-level' A' generated by ( = Tl,At = u(I{T1 :5'},0 ::; s ::; t).Proceeding
from the representation (3.1) the time to first failure is governed by the
IF-hazard rate process A for t E (0, (]. The change to the AClevel by condi-
tioning leads to the failure rate ~, ~t = E(AtIAt).As described in Section 2.2,
~ can be chosen deterministically. For the time to first failure we have the
two representations

I{Tl:5t} = 1t I{T1 >,}A,ds+ Mt IF -level

it I{Tl>,}~,ds + Mt A' -level.


Stochastic Models of Reliability and Maintenance: An Overview 19

From the deterministic failure rate ~ a nonhomogeneous Poisson process


(T~)nEN, 0 < T{ < T~ < ... can be constructed where Tl and T{ have the
same distribution. This nonhomogeneous Poisson process with

N; = L I{T~~;t} =
00

n=l
1~8ds + M;
0
t

describes the minimal repair process on the M -level. Comparing these two
information levels example 3.1 might suggest ENt ~ EN: for all positive
t.A general comparison, also for arbitrary subfiltrations, seems to be an open
problem (see Arjas 1989 and Natvig 1990).
Example 3.3. In the two-component parallel system of Example 3.1 we
have the failure rate process At = I{X 1AX2 9} on the component level and
~t = 2 ~-exp
-exp -
-~ on the black-box level. So one has two descriptions of the
same random lifetime ( = Tl

1t I{Tl>8}I{xlAX2~8}ds + t M

1 o
t 1- exp (-s)
I{T1 >8}22
-
()ds+Mt .
- exp -s
The process N counts the number of minimal repairs on the component
level:
Nt =
t
10 I{X1AX2~.}ds
+ Mt
This is a delayed Poisson process, the (repair) intensity of which is equal to 1
after first component failure. The process N' counts the number of minimal
repairs on the black-box level:

-it
N t' -
o
21 - exp ( -s) d
2-exp(-s)
M'
s + t

This is a nonhomogeneous Poisson process with an intensity which corre-


sponds to the ordinary failure rate of T 1 . Elementary calculations yield in-
deed

To interpret this result one should note that on the component level only the
critical component which caused the system to fail is repaired. A black box
repair, which is a replacement by a system of the same age that has not yet
failed, can be a replacement by a system with both components working.
20 Uwe Jensen

3.2 A Markov Modulated Minimal Repair Process

In this section a model with a given reward structure is investigated in which


an optimal operating time of a system has to be found that balances some flow
of rewards and the increasing cost rate due to minimal repairs. The following
presentation follows the lines of Jensen and Hsu (1993) and Jensen (1996)
which include the technical details. Consider a one-unit system which fails
from time to time according to a point process. After failure a minimal repair
is carried out which leaves the state of the system unchanged. The system
can work in one of m unobservable states. State '1' stands for new or in good
condition and 'm' is defective or in bad condition. Aging of the system is
described by a link between the failure point process and the unobservable
state of the system. The failure or minimal repair intensity may depend on the
state of the system. There is some constant flow of income on the one hand
and on the other hand each minimal repair incurs a random cost amount. The
question is when to stop processing the system and carrying out an inspection
or a renewal in order to maximize some reward functional.
For the mathematical formulation let the basic probability space be
({}, F, P) equipped with a filtration IF, the complete information level, to
which all processes are adapted, and let S = {I, ... , m} be the set of unob-
servable states. Moreover the following(random) quantities are given.
- The changes of the states are driven by a homogeneous Markov process
Y = (YthEl!l+ with values in S and infinitesimal parameters qj, the rate to
leave state i, and %, the rate to reach state j from state i:

qi = h-O+
lim -hI P(Yh i:- ilYo = i), qij = lim -hI P(Yh = jlYo = i).
h_O+

- The time points of failures (minimal repairs) 0 < Tl < T2 < ... form a
point process and N = (Nt )tEl!l+ is the corresponding counting process:
00

Nt = L: I{T.. :$t}.
n=l
It is assumed that N has a stochastic intensity AY which depends on the
j

unobservable state, i.e. N is a so-called Markov modulated Poisson process


with representation .
Nt = lot Ay.ds + M t ,

where M is an IF-martingale and 0 < Ai < 00, i E S. In state i the failure


point process is Poisson with rate Ai. But note that the ordinary failure
rate of Tl is not constant.
- (Xn)nEN is a sequence of positive Li.d. random variables, independent of
Nand Y, with common distribution F and finite mean JI.. Xn describes
the cost caused by the n-th minimal repair at time Tn.
Stochastic Models of Reliability and Maintenance: An Overview 21

- There is an initial capital u and an income of constant rate c > 0 per unit
time.
Now the process R, given by
N,
R t = u + ct - l: Xn
n=l

describes the available capital at time t as the difference of the income and
the total amount of costs for minimal repairs up to time t.

Fig. 3.1. Risk reserve

The process R is well-known in collective risk theory, it describes the risk


reserve at time t. In risk theory one is mainly interested in the distribution
of the time to ruin T = inf{t E ~+ : Rt < OJ. The focus in the reliability
frame-work is on determining the optimal operating time with respect to the
given reward structure. For this one has to estimate the unobservable state
of the system at time t, given the history of the process R up to time t.
This can be done by means of well-known results in filtering theory (see, e.g.
Bremaud 1981). Stopping at a fixed time t results in the net gain
m
Zt = Rt -l: kjUt(j),
j=l

where Ut(j) = I{y,=il is the indicator of the state at time t and kj E ~,j E S
are stopping costs (for inspection and replacement) which may depend on the
stopping state. The process Z can not be observed directly because only the
22 Uwe Jensen

failure time points and the costs for minimal repairs are known to an observer.
The observation filtration A = (At)tEllt+ is given by

At = (T(N" Xi, 0 ~ S ~ t, i = 1, ... , Nt).


Let C A = {T : T finite A-stopping time, EZ; < oo} be the set of feasible
stopping times in which the optimal one has to be found. As usual a- =
- mineO, a) denotes the negative part of a E ~.So the problem is to find this
T* E C A which maximizes the expected net gain:

For the solution of this problem an IF-semimartingale representation of the


process Z is needed, where it is assumed that the complete information fil-
tration IF is generated by Y, Nand (Xn ):

Ft = (T(Y" Ns , Xi, 0 ~ S ~ t, i = 1, ... , Nt).


Such a representation can easily be obtained (see Jensen 1995 for details):

Zt = u - t
j=1
kjUo(j) + 1t t
0 j=1
U,(j)rjds + M t , t E ~+, (3.2)

where M = (Mt ) is an IF-martingale and the constants rj are defined by

rj =C-Ajp,- ~)kv-kj)qjv.
vtj

These constants can be interpreted as net gain rates in state j:


- C is the income rate,
- Aj, the failure rate in state j, is the expected number of failures per unit
of time, p, is the expected repair cost for one minimal repair. So Aj p, is the
repair cost rate,
- the remaining sum is the stopping cost rate.
Since the state indicators U, (j) and therefore Z cannot be observed a pro-
jection to the observation filtration A is needed. As described in Section 2.2
such a projection from the IF-level (3.2) to the A-level leads to the following
conditional expectations:

Zt = E(Zt/At ) = u - t
j=1
k/fo(j) + 1t
0 j=1
t Us (j}rjds + Mt , t E JR+. (3.3)

The integrand 2:7=1 Us (j)rj with U.(j) =


E(UsIA.) P(Y, ilA,) is the = =
conditional expectation of the net gain rate at time s given the observations
up to time s. If this integrand has non-increasing paths then it is said that one
is in the "monotone case" and the stopping problem could be solved under
Stochastic Models of Reliability and Maintenance: An Overview 23

some additional integrability conditions. To state monotonicity conditions for


the integrand in (3.3) an explicit representation of Ut(j) is needed, which can
be obtained by means of results in filtering theory (see Bremaud 1981, p. 98)
in form of "differential equations":
- between the jumps of N : Tn t < Tn+l

t. (~U.(
~

U,U) Udi) + ill qii + U. (j)( Ai - Ai)}) ds, qii ~ -qi'


Uo(j) P(Yo = j), j E S.
- at jumps

where UT n _ (j) denotes the left limit.


The following conditions ensure that the system ages, i.e. it moves from the
"good" states with high net gains and low failure rates to the "bad" states
with low and possibly negative net gains and high failure rates and it is never
possible to return to a "better" state.

qi > 0, i = 1, ... , m - 1, % = 0 for i > j, i, j E S,


rl 2: r2 2: ... 2: rm = C - Am/l, rm < 0, (3.4)
0< Al ~ A2 ~ ... ~ Am.
A reasonable candidate for an optimal A-stopping time is
m

r* = inf{t E 1R+ : L Ut(j)rj < O}, (3.5)


j=l

the first time the conditional expectation of the net gain rate falls below O.
Theorem 3.1. Let r* be the A-stopping time (3.5) and assume that condi-
tions (3.4) hold true. If in addition qim > Am - Ai, i = 1, ... , m - 1 then r* is
optimal:

A proof can be found in Jensen and Hsu (1993). The additional condition
qim > Am - Ai ensures that the integrand of the drift term gt = 2:}:1 U t (j)rj
has non-increasing paths and the monotone case applies. But in any case
under conditions (3.4) 9 = (gt)tER+ is a supermartingale and r* is optimal in
a smaller set of A-stopping times with finite expectation. Of special interest
is the case m = 2 for which an explicit solution of the stopping problem will
be given.
24 Uwe Jensen

K I I II I )

o u

Fig. 3.2. The failure process

The Case of m=2 States. For two states the stopping problem can be
reformulated as follows. At an unobservable random time, say u, there occurs
a switch from state 1 to state 2. Detect this change as well as possible by
means of the failure process observations. The conditions (3.4) now read

q1 = q12 =: q > 0, q2 = q21 = 0,


r1 = c - >'11' - q(k2 - kt} > 0> r2 =c - >'21', (3.6)
o< >'1 ~ >'2, P(Yo = 1) = 1.
The conditional distribution of u can be obtained explicitly as solution of the
above mentioned differential equations:
e-g .. (t)
1- Tn < t < Tn+1
+ (>'2 - >'d IT" e-g,,()ds
t '
dn -
>'2 UT,,-(2)
>'1 + (>'2 - >.t}UT,,_(2) '

where dn = (1 - =
UT" (2)) -1 ,gn(t) (q - (>'2 - >.t})(t - Tn). The stopping
time r* in (3.5) can now be written as

r* = inf{t E ~+: Ut (2) > z*},z* = _r_1_.


r1 - r2
As a consequence of the above theorem this A-stopping time, which is of
control-limit type, is optimal if q ~ (>'2 - >'dz".

Remark 3.1. If the failure rates in both states coincide, i.e. >'1 = >'2 the
observation of the failure time points should give no additional information
about the change time point from state 1 to state 2. Indeed, in this case the
conditional distribution of u is deterministic,

P(u ~ tlAt) = P(u ~ t) = 1 - exp {-qt}


and r* is a constant. As to be expected, random observations are useless in
this case.
Stochastic Models of Reliability and Maintenance: An Overview 25

In general the value of the stopping problem sup{EZT : r E CA},the


best possible expected net gain, cannot be determined explicitly. But it is
possible to determine bounds for this value. For this, the semimartingale
representation turns out to be useful again, because it allows, by means of
the projection theorem, to compare different information levels. The constant
stopping times are contained in C A and C A C C F . Therefore the following
inequality applies

sup{EZt : t E ~+} ::; sup{EZT : r E CAl ::; SUp{EZT : rEeF}.


At the complete information levellF the change time point (f can be observed
and it is obvious that under conditions (3.6) the IF-stopping time (f is optimal
in C F . Thus we have found upper and lower bounds bu and b"
b, ::; sup{EZT : r E CAl ::; bu with

For Ai = A2 the optimal stopping time is deterministic so that in this case


the lower bound is attained. The inequality is also sharp in the sense that
constants can be found which obey conditions (3.6) so that the upper and
lower bound come arbitrarily close together.

4. Information-Based Replacement of Complex Systems

In this section the basic lifetime model is combined with the possibility of
preventive replacements. A system with random lifetime ( > 0 is replaced by
a new equivalent one after failure. A preventive replacement can be carried
out before failure. There are costs for each replacement and an additional
amount has to be paid for replacements after failures. The aim is to determine
an optimal replacement policy with respect to some cost criterion.
There is an extensive literature about models of this kind which is
surveyed in the overviews by Pierskalla and Voelker (1976), Sherif and
Smith (1981) and Valdez-Flores and Feldman (1989) mentioned before. Sev-
eral cost criteria are known among which the long run average cost per
unit time criterion is by far the most popular one. A general set-up for
cost minimizing problems is introduced in Jensen (1990) similar to Aven
and Bergman (1986). It allows for specialization in different directions. As
an example the total expected discounted cost criterion as described by
Aven (1983) will be applied. What goes beyond the results in Aven and
26 Uwe Jensen

Bergman (1986) is the possibility to take different information levels into


account. This shall be applied to complex monotone systems for which in
Section 2.2 some examples of various degrees of observation levels were given.
For the special case of a two component parallel system with dependent com-
ponent lifetimes it is shown how the optimal replacement policy depends
on the different information levels and on the degree of dependence of the
components.

4.1 The Maintenance Model

Consider a technical system with random lifetime ( > 0 according to the


basic model (2.2),i.e. there exists an IF-semimartingale representation

I{(9} = 1t I{(>6}Aads + M t ,

on some information levellF. When the system fails it is immediately replaced


by an identical one and the process repeats itself. A preventive replacement
can be carried out before failure. Each replacement incurs a cost of c > 0 and
each failure adds a penalty cost k > 0 (for a more general cost structure which
allows for age dependent costs see Heinrich and Jensen 1992). The problem
is to find a replacement (stopping) time which minimizes the total expected
discounted costs. Other criteria are possible, e.g. the long run average cost
per unit time criterion, and the solution of the corresponding minimization
problem follows the same lines as below.
Let a > 0 be the discount rate and (ZT' r), (ZTl' rd, (ZT2' T2), ... a se-
quence of i.i.d. pairs of positive random variables, where Ti represents the
replacement age of the i-th implemented system, i.e., the length of the i-th
cycle, and ZTi describes the costs incurred during the i-th cycle discounted
to the beginning of the cycle. Then the total expected discounted costs are

E (ZTl + e-aT1ZT2 + e-a(Tl+T2)ZT3 + .. -)


EZT
= E(1- e- aT )'
It turns out that KT is the ratio of the expected discounted costs for one
cycle and E(1 - e- aT ). Since only replacement times less or equal to ( are
possible the set of admissible stopping times is defined by
el = {r: rfinitelF-stopping time,T:::; (,EZ; < co}.
The stopping problem is to find a stopping time E el withiT

K* = Ku =inf{KT : TEen. (4.1)


Stopping at a fixed time t leads to the following costs for one cycle discounted
to the beginning of the cycle:
Stochastic Models of Reliability and Maintenance: An Overview 27

Zt = (c+ kI{(9})e- at ,t E Im.+.


Proceeding from the semimartingale representation (2.2) such a represen-
tation can also be obtained for Z = (Zthe R+ by using a product rule for
"differentiating" semimartingales (compare Jensen 1989 and Heinrich and
Jensen 1992), which corresponds to the ordinary product rule. This yields for
t E (0, (]:

Zt c + 1t I{(>3}O'e- a3 ( -c + A3 ~) ds + Rt
c + 1t I{(>3}O'e- a3 r 3ds + R t , (4.2)

where rs =
*(-O'c + Ask) is a cost rate and R = (Rt)teR+ is a uniformly
integrable IF-martingale.

4.2 Optimal Stopping

To find the minimum K* in (4.1) we will proceed as follows. First of all


bounds bl and bu for K* are determined. Then the problem to minimize the
ratio is replaced by an equivalent problem to maximize the expectation with
respect to a stochastic processor which a semimartingale representation is
known. The bounds bl , bu are used to state conditions under which a solution
of the optimal stopping problem exists.
To determine the bounds let q = inf{rt : 0 ::; t < ((w),w E Q}be the
minimum of the cost rate with q ~ -c. For T E Cfone has ERT = and
T
EZT = c + E (1 I{(>s}O'e-aSrsds) ~ c + qE(1 _ e- aT ).
This yields the lower bound

K - EZT C C - b
T - E(1 _ e-aT) ~ E(1 _ caT) +q ~ E(1 _ e-a,) +q - I

Because of ( E Cf one can use bu = K( as an upper bound:

c * E((c+k)e- a()
bl = E(1 _ e-a,) + q ::; K ::; bu = E(1- e- a()' (4.3)

It is a well-known technique to replace the minimization problem (4.1) by an


equivalent maximization problem. Observing that KT = EZr/ E(1 - eaT) ~
K*is equivalent to K* E(1 - e- aT ) - EZT ::; 0, where equality holds for an
optimal stopping time, one has the maximization problem:

E Cf with EYo = sup{EY.,. : T E = 0, whereCn


cn.
Find (J' (4.4)
at
Yt = K*(l - e- ) - Zt and K* = inf{K.,. : T E
28 Uwe Jensen

This new stopping problem can be solved by means of the semimartingale


representation of the process Y = (Yt )tE~+ for t E (0, (]

Yt = -c + it I{(>&}ae-a&(K* - r&)ds + R t.
So if the cost rate r crosses K* only once from below then it is optimal to
stop the first time r hits K* since ERT = 0 for all T E e[.If r has this
monotonicity property then instead of considering all stopping timesT E e[
one may restrict the search for an optimal stopping time to the class of
indexed stopping times

px = inf{t E JR+: rt;::: x} I\(,x E JR, inf0 = 00. (4.5)

From EYq = 0 it follows then that the optimal stopping level x* is given by

x* = inf{x E JR: xE(1 - e- aPr ) - EZpx ;::: O}. (4.6)

These observations are summarized in the following theorem.


Theorem 4.1. Assume that Z has semimartingale representation (4.2) and
let Px, x E JR and x* be defined as above in (4.5) and (4.6) respective/yo lIthe
cost rate r has non-decreasing or bathtub-shaped paths with ro :S b1 on[O, ()
(P-a.s.) then (J' = Px. is an optimal stopping time and x* = K*.
Remark 4.1. The condition that r has non-decreasing or bathtub-shaped
paths, which decrease first and then increase, can be relaxed. It is only re-
quired that the paths of r cross the value K* from below at most once. Since
K* is unknown in advance a monotonicity condition on the paths of r can be
imposed which relies on the bounds b1 and bu of K*: For all t, h E JR+ rt ;::: b1
implies rt+h ;::: rt 1\ bu. The class of functions obeying this condition includes
the monotone and the bathtub-shaped functions. If no such monotonicity
condition holds, x* is at least an upper bound for K* : K* :S x* :S bu.
The possibility of observing the system on different information levels
shall also be used for solving the stopping problem. Let A be a subfiltration
of IF. Then the idea is to use the projection Z of Z to the A-level and apply
the above described optimization technique to Z. Of course, on the lower
information level the cost minimum is increased,

inf{Kr : TEen;::: inf{KT : TEen,

and the question, to what extent the information level influences the cost
minimum,has to be investigated.
Considerations are now restricted to coherent monotone systems with ran-
dom component lifetimes Xi > 0, i = 1,2, ... , n, n E N and structure function
qj : {O, l}n -+ {O, I las described in section 1.1. The system lifetime <: is given
by ( = inf{t E JR+ : qjt= I}, where qjt = qj(I{Xl~t}' I{x2~t}, ... , I{Xn9}) =
Stochastic Models of Reliability and Maintenance: An Overview 29

I{($t} indicates the state of the system at time t. It is assumed that iPt admits
a semimartingale representation with failure rate process A with respect to
the filtration IF generated by the component lifetimes:

F t = 17(I{Xl~'}' I{x2~'}' ... , I{Xn~'}' 0 ~ s ~ t)


The effect of partial information is in the following only considered for the
case that no single component or only some ofthe n components are observed,
say those with index in a subset {i l ,i 2 , ... ,ir } C {1,2, ... ,n},r~ n. Then the
subfiltration A is generated by ( or by ( and the corresponding component
lifetimes respectively. The projection theorem yields a representation on the
corresponding observation level:

& = E(I{($t}IA t ) = I{(~t} = 1t I{(>s}~.ds + Mt .


The A-failure rate ~t =
E(AtIAt) can be inserted in (4.2) to give the A-
representation Z of Z. Then the optimal stopping time can be determined as
in the theorem above with Z replaced by Z.

4.3 A Parallel System with Two Dependent Components

A two-component parallel system is considered now to demonstrate how


the optimal replacement rule can be determined explicitly. It is assumed

comp.1

comp.2

Fig. 4.1. Two-component parallel system

that the two component lifetimes Xl and X 2 follow a bivariate exponen-


tial distribution. There are lots of multivariate extensions of the univariate
exponential distribution, for an overview see Hutchinson and Lai (1990) or
Basu (1988). But it seems that only the models of Freund (1961) and Mar-
shall and Olkin (1967) are physically motivated and that some other models
are more or less mere formal generalizations of one-dimensional distributions.
30 Uwe Jensen

The idea behind Freund's model is that after failure of one component
in a two-component parallel system the stress, placed on the surviving com-
ponent, is changed. As long as both components work, the lifetimes follow
independent exponential distributions with parameters /31 and /32. When one
of the components fails, the parameter of the surviving component is switched
to 131 or132 respectively.
Marshall and Olkin proposed a bivariate exponential distribution for a
two-component system where the components are subjected to shocks. The
components may fail separately or both at the same time due to such shocks.
This model includes the possibility of a common cause of failure which de-
stroys the whole system at once.
As a combination of these two models the following bivariate distribu-
tion can be derived. Let the pair (Y1, Y2 ) of random variables be distributed
according to the model of Freund and let Y12 be another positive random vari-
able, independent ofYl and Y2, exponentially distributed with parameter /312.
Then (Xl, X 2 ) with Xl = Y1 t\ Y12 , X 2 = Y2 t\ Y12 is said to follow a combined
exponential distribution. For brevity the notation Ii = /31 + /32 - 13i , i E {I, 2}
and /3 = /31 + /32 + /312 is introduced. The survival function

P(x, y) = P(X1 > X, X 2 > y) = P(Y1 > X, Y2 > y)P(Y12 > X V y)


is then given by
!l!.e-"I2 X-(iJ2+f3'2)Y _ ib-(32 e-f3y for x ~ y
"12 "12
F(x,y) = { (4.7)
i!.2. e-"I,y-(iJ, +f3'2)X _ j3,-(3, e-f3x for x> y,
"I, "I,

where here and in the following Ii "I 0, i E {I, 2} is assumed. For /3i = 13i this
formula diminishes to the Marshall-Olkin distribution and for /312 = (4.7)
gives the Freund distribution. A detailed derivation, statistical properties and

methods of parameter estimation of this combined exponential distribution
can be found in Heinrich and Jensen (1995). From (4.7) the distribution H
of the system lifetime ( = Xl t\ X 2 can be obtained:

H(t) P(( ~ t) = P(XI ~ t,X2 ~ t) (4.8)


1 _ /32 e-(i3, +f3'2)t _ /31 e-(i32+f3'2)t + /31132 + /32131 - 131132 e-f3t.
n b nb
According to the procedure described in the last section the optimization
problem (4.4) will be solved for three different information levels
- Complete information about XI, X 2 (and (). The corresponding filtration
IF is generated by both component lifetimes:

:Ft = o-(I{X,S'}'!{X S'}'0 ~ s ~ t),t E ~+.


2
Stochastic Models of Reliability and Maintenance: An Overview 31

- Information about Xl and (. The corresponding filtration A is generated


by one component lifetime, say Xl, and the system lifetime:

At = O'(I{Xl~.}' I{'~8}' 0 ~ s ~ t), t E llR+.


- Information about (. The filtration generated by ( is denoted by IE:

Bt = 0'(I{'~8}' 0 ~ s ~ t), t E llR+.


In the following it is assumed that Pi ~ ~i' i E {I, 2} and ~l ~ ~2' i.e. after
failure of one component the stress placed on the surviving one is increased.
Without loss of generality the penalty costs for replacements after failures
are set to k = 1. The solution of the stopping problem will be sketched in
the following. Details are contained in Heinrich and Jensen (1996).
4.3.1 Complete information about Xl, X 2 and (. It can be shown that
the failure rate process A on the IF-observation level is given by:

At = Pl2 + ~2I{xl<t<x2} + ~lI{x2<t<xd'


The bounds (4.3) for the stopping value K* are

b, = ~ + Pl2 and bu = (c+ l)v,


I-v a I-v
where v = E(e- a ,) can be determined by means of the distribution H. Since
the failure rate process is monotone the optimal stopping time can be found
among the control limit rules Px = inf{t E llR+ : Tt ~ :z:} /I. (:

0 for :z:<fu-
- a
c

X l /l.X2 for fu _ c < :z: < PI +,812 - c


Px = a -
~ - c < :z: < i!2+,812 -
a

Xl for a - a
C

( for :z: > i!2+,812


a
- c

The optimal control limit :z:* is the solution of the equation

Since the optimal value :z:* lies between the bounds b, and bu considerations
can be restricted to the cases :z: >
-
b, > fu
a
- c. In the first case fu - c < :z: <
a -
lh~,812 - c one has Px = Xl /I. X 2 and

E(1 - e- ap .,)
p+a
cE(e-ap")+E(I{,~p.,}e-ap")=cf3!a + :~2a'
32 Uwe Jensen

The solution of the equation x fi~a - (C~ + !~~) = 0 is

X
= -1 (R R) f' f312
cfJ + fJl2 1 - - C <x ~
PI + f312 - C
a a a

or equivalently if 0 < C ~ CI := fi~a.


The remaining two cases (PI + f312)a- 1 - C < x ~ (P2 + f312)a- 1 - C and
x> (P2 + f312)a- 1 - C are treated in a similar manner. After some extensive
calculations the following solution of the stopping problem is derived:

{ 0< C ~ ci
X I l\X2 for
PIC = Xl for CI < C ~ C2
( for C2 < C

{
xI for 0< C ~ CI
x* = x2 for CI < C ~ C2
X3 for C2 < c,

x3
The explicit formulas for the optimal stopping value were only presented here
to show how the procedure works and that even in seemingly simple cases
extensive calculations are necessary. The essential conclusion can be drawn
from the structure of the optimal policy. For small values of c (note that the
penalty costs for failures are k=l) it is optimal to stop and replace the system
at the first component failure. For mid-range values of c the replacement
should take place when the "better" component with a lower residual failure
rate (PI ~ P2) fails. If the "worse" component fails first this results in an
intentional replacement after system failure. For high values of c preventive
replacements do not pay and it is optimal to wait until system failure. In this
case the optimal stopping value is equal to the upper bound x = bu.
4.3.2 Information about Xl and C. The failure rate process correspond-
ing to this observation level A is given by

get)

The paths of this failure rate process depend only on the observable compo-
nent lifetime Xl, as required, and not on X 2 The paths are non-decreasing
Stochastic Models of Reliability and Maintenance: An Overview 33

so that the same procedure as in the last Section 4.3.1 can be applied. For
/1 = f31 + f32 - 131 > 0 the following results can be obtained:

Px = {, X1/1.b*
Xl
for
for
for
0< c $ C1
C1 < c $ C2
C2 < C

{ o< c $
xi for C1
x* = x; for C1 <c$ C2
x*3 for C2 < c.
The constants C1, C2 and the stopping values x;,
xa are the same as in
Section 4.3.1. What is optimal on a higher information level and can be
observed on a lower information level must be optimal on the latter too. So
only the case 0 < c $ C1 is new. In this case the optimal replacement time is
Xl /I. b*with a constant b*, which is the unique solution of the equation:

d1 exp{ /1 b*} + d2 exp{ -(131 + f312 + a )b*} + d3 = O.


The constants di, i E {1, 2, 3} are extensive expressions in a, f3's and/'s and
therefore not presented here (see Heinrich and Jensen 1996). The values of
b* andxi have to be determined numerically. For /1 < 0 a similar result can
be obtained.
4.3.3 Information about C. On this lowest level lIl! no additional infor-
mation about the state of the components is available up to the time of
system failure. The failure rate is deterministic and can be derived from the
distribution H in (4.8):
d
At = - dt (In(1 - H(t))).

In this case the replacement times Px = , /I. b, b E ~+ U {oo }are the well-
known age-replacement policies. Even if A is not monotone such a policy is
optimal on this lIB-level. The optimal values b* and x* have to be determined
by minimizing K p ", as a function of b.
4.3.4 Numerical Examples. The following tables show the effects of
changes of two parameters, the replacement cost parameter c and the "de-
pendence parameter" f312. To be able to compare the cost minima K* = x*
both tables refer to the same set of parameters: f31 = 1, f32 = 3,131 = 1.5,132 =
3.5, a = 0.08. The optimal replacement times are denoted:
a: Px. = Xl /I. X 2 b: Px. = Xl c: Px. = Xl /I. b*
d: px. = ( /I. b* e: Px. = ( = Xl V X 2
34 Uwe Jensen

Table 4.1. /31 = 1, /32 = 3, /312 = 0.5, ill = 1.5, il2 = 3.5, = 0.08
Q'

Information Level
IF I Pi. I
0.01 6.453 6.813 a 9.910 c 11.003 d 20.506
0.10 8.280 11.875 a 17.208 c 19.678 d 22.333
0.50 16.402 28.543 b 28.543 b 30.455 e 30.455
1.00 26.553 39.764 b 39.764 b 40.606 e 40.606
2.00 46.856 60.900 e 60.900 e 60.900 e 60.900

Table 4.2. /31 = 1, /32 = 3, /312 = 0.5, ill = 1.5, il2 = 3.5, c = 0.1, = 0.08 Q'

Information Level
IF I Pi. I
0.00 1.505 5.000 a 10.739 c 13.231 d 16.552
0.10 2.859 6.375 a 12.032 c 14.520 d 17.698
1.00 15.067 18.750 a 23.688 c 26.132 d 28.235
10.00 138.106 142.500 b 142.500 b 144.168 e 144.168
50.00 687.677 689.448 e 689.448 e 689.448 e 689.448

Table 4.1 shows the cost minima x* for different values of c. For small
values of c the influence of the information level is greater than for moderate
values. For c > 1.394 preventive replacements do not pay, additional infor-
mation concerning ( is not profitable. Table 4.2 shows how the cost minimum
depends on the parameter /312. For increasing values of /312 the difference be-
tween the cost minima on different information levels decreases, because the
probability of a common failure of both components increases and therefore
extra information about a single component is not profitable.

References

Arjas, E.: A Stochastic Process Approach to Multivariate Reliability Systems: No-


tions Based on Conditional Stochastic Order. Mathematics of Operations Re-
search 6, 263-276 (1981a)
Arjas, E.: The Failure and Hazard Processes in Multivariate Reliability Systems.
Mathematics of Operations Research 6, 551-562 (1981b)
Arjas, E.: Survival Models and Martingale Dynamics. Scand. J. Statist. 16,177-225
(1989)
Arjas, E.: Information and Reliability: A Bayesian Perspective. In: Barlow, R.,
Clarotti, C., Spizzichino, F. (eds.): Reliability and Decision Making. London:
Chapman and Hall 1993, pp. 115-135
Arjas, E., Norros, I.: Change of Life Distribution via Hazard Transformation: An
Inequality with Application to Minimal Repair. Mathematics of Operations
Research 14, 355-361 (1989)
Stochastic Models of Reliability and Maintenance: An Overview 35

Aven, T.: Optimal Replacement Under a Minimal Repair Strategy - A General


Failure Model. Adv. Appl. Prob. 15, 198-211 (1983)
Aven, T.: Reliability and Risk Analysis. London: Elsevier 1992
Aven, T.: Availability Analysis of Monotone Systems. In this volume (1996a), pp.
206-223
Aven T.: Optimal Replacement of Monotone Repairable Systems. In this volume
(1996b), pp. 224-238
Aven, T., Bergman, B.: Optimal Replacement Times, a General Set-up. J. Appl.
Prob. 23, 432-442 (1986)
Barlow, R., Hunter, L.: Optimum Preventive Maintenance Policies. Operations Res.
8, 90-100 (1960)
Barlow, R., Proschan, F.: Statistical Theory of Reliability and Life Testing. New
York: Holt, Rinehart and Winston 1975
Basu, A.: Multivariate Exponential Distributions and Their Applications in Relia-
bility. In: Krishnaiah, P. R., Rao, C. R. (eds.): Handbook of Statistics 7. Quality
Control and Reliability. Amsterdam: North-Holland 1988, pp. 99-111
Beichelt, F.: A Unifying Treatment of Replacement Policies with Minimal Repair.
Nav. Res. Log. Q. 40, 51-67 (1993)
Bergman, B.: On Reliability Theory and Its Applications. Scand. J. Statist. 12,
1-41 (1985)
Block, H., Borges, W., Savits, T.: Age-Dependent Minimal Repair. J. Appl. Prob.
22, 370-385 (1985)
Bremaud, P.: Point Processes and Queues. Martingale Dynamics. New York:
Springer 1981
QInlar, E.: Markov and Semi-Markov Models of Deterioration. In: Abdel-Hameed,
M., Qinlar, E., Quinn, J. (eds.): Reliability Theory and Models. Orlando: Aca-
demic Press 1984, pp. 3-41
QInlar, E., Ozekici, S.: Reliability of Complex Devices in Random Environments.
Probab. Engrg. Inform. Sci. 1, 97-115 (1987)
QInlar, E., Shaked, M., Shanthikumar, G.: On Lifetimes Influenced by a Common
Environment. Stoch. Proc. Appl. 33, 347-359 (1989)
Doksum, K.: Degradation Rate Models for Failure Time and Survival Data. CWI
Quarterly 4, 195-203 (1991)
Domine, M.: First Passage Time Distribution of a Wiener Process with Drift Con-
cerning Two Elastic Barriers. J. Appl. Prob .. To appear (1996)
Freund, J. E.: A Bivariate Extension of the Exponential Distribution. J. Amer.
Stat. Ass. 56, 971-977 (1961)
Gertsbakh, I.: Asymptotic Methods in Reliability: A Review. Adv. Appl. Prob. 16,
147-175 (1984)
Heinrich, G., Jensen, U.: Optimal Replacement Rules Based on Different Informa-
tion Levels. Nav. Res. Log. Q. 39, 937-955 (1992)
Heinrich, G., Jensen, U.: Parameter Estimation for a Bivariate Lifetime Distribution
in Reliability with Multivariate Extensions. Metrika 42, 49-65 (1995)
Heinrich, G., Jensen, U.: Bivariate Lifetime Distributions and Optimal Replace-
ment. ZOR-Methods and Models of OR. To appear (1996)
Hutchinson, T. P., Lai, C. D.: Continuous Bivariate Distributions, Emphasizing
Applications. Adelaide: Rumbsby Scientific Publishing 1990
Jensen, U.: Monotone Stopping Rules for Stochastic Processes in a Semimartingale
Representation with Applications. Optimizatiom 20, 837-852 (1989)
Jensen, U.: A General Replacement Model. ZOR-Methods and Models of Opera-
tions Research 34, 423-439 (1990)
Jensen, U.: An Optimal Stopping Problem in Risk Theory. Scand. Act. J .. To appear
(1996)
36 Uwe Jensen

Jensen, U., Hsu, G.: Optimal Stopping by means of Point Process Observations with
Applications in Reliability. Mathematics of Operations Research 18, 645-657
(1993)
Koch, G.: A Dynamical Approach to Reliability Theory. Proc. Int. School of Phys.
"Enrico Fermi", XCIV. Amsterdam: North-Holland 1986, pp. 215-240
Marshall, A. W., Olkin, I.: A Multivariate Exponential Distribution.J. Amer. Stat.
Ass. 62, 30-44 (1967)
Natvig, B.: Reliability: Importance of Components. In: Kotz, S., Johnson, N. (eds.):
Encyclopedia of Statistical Sciences 8. New York: Wiley 1988, pp. 17-20
Natvig, B.: On Information-Based Minimal Repair and the Reduction in Remaining
System Lifetime due to the Failure of a Specific Module. J. AppJ. Prob. 27, 365-
375 (1990)
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996a),
pp. 137-157
Ozekici, S.: Optimal Replacement of Complex Systems. In this volume (1996b), pp.
158-169
Phelps, R.: Optimal Policy for Minimal Repair. J. OpJ. Res. 34, 425-427 (1983)
Pierskalla, W., Voelker, J.: A Survey of Maintenance Models: The Control and
Surveillance of Deteriorating Systems. Nav. Res. Log. Q. 23, 353-388 (1976)
Rogers, C., Williams, D.: Diffusions, Markov Processes and Martingales. Vol. 1, 2nd
ed. Chichester: Wiley 1994
Shaked, M., Shanthikumar, G.: Multivariate Imperfect Repair. Oper. Res. 34, 437-
448 (1986)
Shaked, M., Shanthikumar, G.: Reliability and Maintainability. In: Heyman, D.,
Sobel, M. (eds.): Stochastic Models. Vol. 2. Amsterdam: North-Holland 1990,
pp.653-713
Shaked, M., Shanthikumar, G.: Dynamic Multivariate Aging Notions in Reliability
Theory. Stoch. Proc. AppJ. 38, 85-97 (1991)
Sherif, Y., Smith, M.: Optimal Maintenance Models for Systems Subject to Failure.
A Review. Nav. Res. Log. Q. 28, 47-74 (1981)
Stadje, W., Zuckerman, D.: Optimal Maintenance Strategies for Repairable Systems
with General Degree of Repair. J. AppJ. Prob. 28, 384-396 (1991)
Valdez-Flores, C., Feldman, R.: A Survey of Preventive Maintenance Models for
Stochastically Deteriorating Single-Unit Systems. Nav. Res. Log. Q. 36, 419-
446 (1989)
Van der Duyn Schouten, F.: Maintenance Policies for Multicomponent Systems: An
Overview. In this volume (1996), pp. 117-136
Yashin, A., Arjas, E.: A Note on Random Intensities and Conditional Survival
Functions. J. AppJ. Prob. 25, 630-635 (1988)
Fatigue Crack Growth
Erhan Qmlar
Department of Civil Engineering and Operations Research, Princeton University,
Princeton, N J 08544, USA

Summary. Metal fatigue is a major cause for failure of mechanical and structural
components. We review the fracture mechanics offatigue and Paris-Erdogan law for
the mean behavior. After a consideration of experimental data reported by Virkler
et al. (1979), we propose a continuous semimarkov process to model crack growth.
The model accounts for the material randomness and sees crack as a motion in a
random field.

Keywords. Fatigue, crack growth, semimarkov process, Ornstein-Uhlenbeck fields,


gamma processes, Markov additive processes

1. Introduction

Metal fatigue is a major cause for failure of mechanical and structural compo-
nents. It is widely recognized to be a random phenomenon, two main reasons
being the randomness in stress loading and the random variations in the
material properties.
From the point of view of fracture mechanics, the fatigue damage of a
component is measured by the size of the dominant crack, and the failure
is defined to occur when that crack's size reaches a critical magnitude. The
cracks in question are in the range from 10 to 50 mm, so that micromechanical
considerations need not be made explicit. Indeed, micro level considerations
will be used to derive, through probabilistic reasoning, the likely laws for the
growth of macroscopic cracks.
Our model for crack growth is based on two considerations. First, the
effect of material inhomogeneities is viewed as a Markov random field on
the plane. Second, the motion of crack tip is seen as a continuous increasing
semimarkov process in that Markov random field. As such, our model is a
refinement of all models known to us except one. In fact, as we shall make
clear, the better accepted models in literature are averaged versions of ours.
However, our model is best described in terms of the process of primary
interest, namely, the random time To it takes the crack to reach size a. Our
model is that it is infinitesimally a gamma process given the Markov random
field for the strength of the material.
In Section 1, we review the deterministic equation of Paris and Erdogan
(1963) which is based on fracture mechanical considerations. In Section 2, we
describe the results of the experimental work by Virkler et al. (1979), which
seems to be the only work suitable for probabilistic reasoning. In Section 3,
we review various stochastic models proposed and insights to be gained from
38 Erhan Qmlar

them. In Section 4, we describe our model, justify its basic assumptions, and
point out its relationships to earlier models. Finally, in Section 5, we present
a summary of our model and some results.

2. Deterministic Analysis

Early work on fatigue crack growth has been from the point of view offracture
mechanics. In general, all such work treat cracks in infinite sheets subjected
to a uniform stress perpendicular to the crack. They relate the crack length,
2a, to the number N of cycles of load applied, with the stress range u and
some material constants Ci.
The single form in which all crack propagation laws can be written is
da
dN = I(u,a, Ci) , (2.1)

and the problem is to figure out the function f. Note, incidentally, the treat-
ment of N as a continuous variable.
Paris and Erdogan (1963) present a critical evaluation of various forms of
the equation proposed by earlier researchers. Here are some such models (see
Paris and Erdogan (1963) for references):
da C3 u 2 a
(2.2)
dN C2 - u '
called Head's formula;
da u3 a
(2.3)
dN C4 '
proposed by Frost and Dugdale ;
da
dN = l(u(l + 2y. r::7:
a/p)) (2.4)

proposed by McEvilly and Illg, where p is the end radius of the elliptical hole
with semimajor axis a, and where 1 is obtained empirically to be
34
I(x) = 0.00509x - 5.472 - x _ 34 .

From the continuum mechanics point of view, the essential factor is the
stress-intensity factor k. The latter reflects the effect of external load and
configuration on intensity of the whole stress field around the crack tip. More-
over, for various configurations, the crack tip stress fields have the same form.
Therefore, the stress-intensity factor k should control the rate of crack ex-
tension, that is, the law should be

!!:!!....
dN
=C k P
5 (2.5)
Fatigue Crack Growth 39

where p is a material constant. Since k = (1Va in general, this is close to


(2.2).
Reviewing all these and still others, Paris and Erdogan (1963) propose
that the general model is
da
- = Co(1QaP (2.6)
dN '
where p and q are material constants. Note that (2.6) is the general form of
all the models proposed so far, excepting (2.4).
Equation (2.6), called Paris-Erdogan equation, is generally accepted to
be satisfactory for most purposes. Subsequent data analysis show that the
constant p is somewhere between 1 and 3, depending mostly on the material
involved.
We shall use (2.6) to guide us in figuring out the distributional require-
ments in our model. For us, (2.6) indicates the expected behavior.

3. Experimental Facts
A typical crack propagation test consists of a wide plate with a central crack
of some initial length 2ao subjected to uniform tension (1 applied repeatedly.
During a single test, (1 is kept constant, and the data consist of half crack
lengths a and the corresponding cycle numbers Na (at which the crack length
becomes 2a).
Early experiments were conducted by engineers unfamiliar with statistical
or probabilistic thinking. Such experiments do not have the idea of replica-
tion and are therefore useless for our purposes. The first significant study
of random variability occurs in Virkler et al. (1979). We shall present their
results with some care, because their results are instructive and because they
have influenced the stochastic modeling of crack growth ever since.
Virkler et al. have designed their experiments with careful attention to
the random nature of the phenomenon. They have performed the test on
68 identical specimens coming from a single lot of material, an aluminum
alloy. Replicate tests were conducted under identical load and environmental
conditions; a constant amplitude loading, careful experimental conditions,
with measurements errors of the order of 0.00141 mm (which can be ignored).
For each one of the 68 specimens, they started a crack and brought the
crack to the half size ao = 9.00mm, and then, they recorded the num-
ber N a , of cycles it took to bring the crack to size ai for 164 fixed values
al, a2, ... , al64. Thus, what we have is a random sample of size 68 from a
stochastic process {Na , a?: 9.00}, each sample path having 164 data points
corresponding to N ao = 0, N aI' , N a164. Figure 3.1 shows the essentials of
their data. For purposes of clarity, instead of 68 paths, we have drawn only
5; instead of 164 data points (a;, Na .}, we have marked only 3 per path.
There are two qualitative observations we can make immediately: sample
paths do cross, that is, there is a good amount of randomness, but neither
40 Erhan Qmlar

E
,.
4

-
(\I

.!:
Cl
c:
~ 3
.::t!.
...
(.)
ctI
()

o~ ____,-____,-____ ____ ____-,____


~ ~

o 65000 195000 325000


N (cycles)

Fig. 3.1. Crack length data against number of cycles

a nor N is Markovian. In fact, cracks that grow fast in the beginning seem
to grow fast at all times, modulo some randomness. There are three other
observations made by Virkler et al. We list these next.

Microscopic inhomogeneity of the material


It is best to quote from Virkler et al. (1979) directly: "Both sudden increases
and decreases in the growth rate, for varying lengths of time, were observed
as if the crack was passing through a different material possessing different
properties. It appears that the material is made up mostly of a fairly homo-
geneous material with many smaller areas located in a random fashion which
characteristically have vastly different crack propagation properties .... These
small areas obviously have a very large effect on the overall smoothness of a
versus N data set and on the total amount of scatter."

Distribution of increments of N
For each fixed crack size ai, the data consists of a random sample of size 68.
Thus, for instance, for ai =
38.20 mm and aiH =
38.60 mm, we have 68
independent observations for the random variable NiH - Ni, which is the
number of cycles it takes to grow the crack from size 38.20 to 38.60. The
Fatigue Crack Growth 41

flavor of the data is in Figure 3.2 where the horizontal axis is logarithmic
and the vertical axis is chosen so that cumulative normal distribution would
be a straight line.
As can be seen, 3-parameter log-normal distribution is a good fit. In fact,
Virkler et al. (1979) have repeated this exercise with each ai, i = 1,2, ... ,163,
and tested the goodness of fit of five different distributions. Of these 163 fits,
3-parameter log-normal distribution had the best fit 137 times, 3-parameter
gamma distribution had the best fit 16 times, 2-parameter log-normal 7 times,
and Wei bull distribution 3 times.

~ 0.98
:0
til
...0
.0

a.. 0.90
iii 0.80
...
E
0
z
"E
til 0.50
-0
c:
19
CI)
(]) 0.20
>
~ 0.10 ..
:;
E
::J
0
0.02 _

log N

Fig. 3.2. Fit of the log-normal distribution to the number of cycles needed to bring
crack length to a = 3.82 cm

Distribution of da/dN
Virkler et al. have also looked at the distribution of the slope da/dN as a
function of LlK, stress-intensity factor range. According to the determinis-
42 Erhan Qmlar

tic analysis, this is supposed to be a straight line when plotted on a log-log


paper. Indeed the data do cluster around a line. But the statistical tests
are at best inconclusive. There were 136 different Lll{ values, for each Lll{
value 5 distributions were tested for their fit to the data for da/dN: two-
parameter normal, two-parameter log-normal, three-parameter log-normal,
three-parameter Weibull, and three-parameter gamma. The best fits were
obtained for these, respectively, in 27, 37, 26, 19, and 27 cases. In myopin-
ion, there were no random samples to fit a distribution to, and this part of
statistical analysis is inconclusive.

Prediction of a versus N

There is a final, negative and useful result in Virkler et al. Using the distri-
butions fitted to da/dN at various Lll{ levels, they have simulated 68 sample
paths for {Na : a 2: 9.00 mm} under the assumption that da/dN values at
different Lll{ levels are independent. The simulation yields paths that cap-
ture the mean path quite well, but the variance is much smaller than with
the real data. We conclude from this that the assumed independence does
not hold true. This confirms the observed behavior where growth rate shows
abrupt changes.

4. Stochastic Models
An almost exhaustive survey of stochastic models that have been proposed
in the past can be found in Sobczyk and Spencer (1992). Their survey, sup-
plemented by a few recent papers can be summarized as follows.
Stochastic models proposed have always been for the crack growth pro-
cesses parametrized by cycle counts, the latter being treated as a continuous
"time" parameter. There are three basic models, which we describe now.

Markov chain models

In many ways the simplest model was proposed by Kozin and Bogdanov;
see Kozin and Bogdanov (1989). Their model is a Markov chain {An: n =
0,1, ...} where Ao is the initial crack size and An is the crack size after
n cycles. This Markov chain is increasing, the state space is the set of all
positive integers, and each transition is either from a to a or from a to a + 1.
It follows that, starting at a, the crack size stays a for a geometric random
amount of time with some mean l/p{a) and then jumps to a + 1, stays at
a + 1 some random time with geometric distribution with mean l/p{a + 1)
and then jumps to a + 2, and so on. Kozin and Bogdanov claim good fit to
data.
If we make the parameter continuous, so that we are talking of the crack
size At at time t, the model becomes a pure birth process with state space
Fatigue Crack Growth 43

{1, 2, ...}. The model captures the essentials of crack growth, but without
attempting to account for material inhomogeneity.

Compound Poisson models

In Sobczyk (1986) and culminating in Sobczyk and Trebicki (1989), crack


size At at time t is modeled as
N(t)
At =Ao+ L:yt
i=l

where N(t) is a Poisson process and the yt are independent and identically
distributed random variables independent of the process N(t), t ~ O. In other
words, A is a compound Poisson process. Here, N(t) is viewed as the num-
ber of jumps in crack size, which presumably corresponds to the high level
exceedances in stress, and yt is the jump size for the ith jump. They cite
experiments performed by Kogajew and Liebiedinskij (1983) as a justifica-
tion for assuming that the distribution of yt be exponential with a random
parameter. See also Sobczyk and Trebicki (1991) for the same model, but
with yt correlated.

Randomized Paris-Erdogan models

Numerous authors have proposed the following randomized version of Paris-


Erdogan type equation (2.1) :
dA t
Tt=f(LlK,Kmax,S,R,At)Xt . (4.1)

Here, At is the crack size at time t, f is some positive function, LlK is the
stress intensity factor range, Kmax is the maximum stress intensity factor,
S is the stress amplitude, R is the stress ratio, and finally X t is some process
to be specified.
The essential point here is that the deterministic model
da
dt = f(LlK,Kmax,S,R,a)

is being randomized by multiplying the right hand side by some random pro-
cess X t and replacing the deterministic function aCt) by the random process
At.
Different authors have argued variously regarding the process X t ; see the
references at the end of Spencer et al. (1989). Some have taken X t = Xo for
all t, and at the other extreme, some have taken X to be the white noise. The
former, random constant case, fails to account for random inhomogeneities
adequately. The latter should be written as a stochastic differential equation
(for fixed everything except A)
44 Erhan Qmlar

dA t = g(A t ) dWt
where W is the Wiener process. Of course, this is unacceptable since it is
impossible to make A increasing.
The role of X t in (4.1) should be to account for material inhomogeneities.
Thus as was argued by Ortiz and Kiremidjian (1988) and by Spencer et al.
(1989), X t should have the form
X t = Y(A t ) (4.2)
where Y(a) stands for the material properties at the point a. Further, Y
should be a positive process, and they propose that
Y(a) = eZ(a) , a 2: 0, (4.3)
where Z is an Ornstein-Uhlenbeck process, that is, a Gaussian process with
mean 0 and covariance function
2
-
Q
e -,6la-a'l a,a' 2: O. (4.4)
2f3
Thus, in its essentials (4.1) becomes

~~ = g(A) Y(A) (4.5)


where g is some positive function and Y = e Z , where Z is Ornstein-
Uhlenbeck.
In fact, if we take g to be in the form of Paris-Erdogan equation (2.6), we
get in summary form
(4.6)
where G is a constant, p > 0 is a constant, and Z = {Z(a); a 2: O} is an
Ornstein-Uhlenbeck process.
Various authors have reported good agreement with data for the model
(4.6). In preparation for the next section we note the following points regard-
ing (4.6):
i) Crack size process A is strictly increasing and differentiable, (4.7)
ii) Given Z, the process A is deterministic,
iii) The process Z is a stationary Gaussian Markov process.
iv) The pair (At, Z(A t )) is a Markov process with state space R+ x R.
In addition, solving (4.6) for T a , the time (or number of cycles) required to
bring the crack size from the initial size ao to size a, we obtain
d
-Ta = Go a- P e-Z(a) , a> ao, (4.8)
da
where Go = 1/G. This simple observation seems to have escaped attention
so far. We note that (4.8) is in full agreement with the basic observations of
Virkler et al. (1979): for fixed crack size a, the right side has the log-normal
distribution, as observed in Figure 3.2.
Fatigue Crack Growth 45

5. Proposed Model
We shall let At denote the crack size at time t, "time" being the continuous
analog of cumulative number of cycles. We shall take Ao = ao fixed. As the
functional inverse of A, we let Ta denote the time at which the crack size
exceeds aj more precisely,
Ta = inf {t : At > a} , a 2: ao . (5.1)
Our model will be directly for the process T = {Ta : a 2: ao}, but we shall
motivate and justify our model by considerations on A = {At: t 2: O}.

Semimarkov jump processes


Consider an infinite plate with a crack, subjected to a constant macroscopic
stress 0', the load being applied repeatedly. Crack size (the half-length of the
crack) is a convenient summary statistic of the actual crack geometry. The
actual geometry and the granular structure of the material at crack's tip are
the controlling factors for crack's growth. It is clear, and is well-known, that
crack size follows a pure jump type trajectory when regarded at a high enough
level of magnification. Basically, if the current crack size is a, the size will
remain constant for some random time and then jump to some new size b. The
sojourn time at a is the time needed for the strain energy to grow to a level
sufficient to overcome the resistance of the grain at the crack tip. And, the
magnitude b - a of the jump is again controlled by the granular structure of
the matter at the crack tip. Assuming that the actual strength (and therefore
the actual stress) at the crack tip is known at all times, these considerations
imply that the crack size be modeled by a pure jump semimarkov process
with paths as in the figure below.
The conditional law , given the actual stress at the crack tip at all times, of
such a process is described by a transition kernel Q( a, db, dt) which specifies,
for each possible crack size a, the joint probability that the crack size jumps
to some value in an infinitesimal interval db around b after a sojourn of
some length in an infinitesimal interval dt around t. In fact, this model is
a generalization of the Markov chains proposed by Kozin and Bogdanov as
well as the compound Poisson models proposed by Sobczyk (1986), Sobczyk
and Trebicki (1989) and (1991). In the case of Markov chains, a and bare
restricted to integers and Q has the form

Q(a, db, dt) = { ~(a)e-A(a)t dt if b = a + 1,


otherwise j
and in the compound Poisson cases,
Q(a, db, dt) = J.le-/J(b-a) ,xe- At db dt .
Unless such a simplifying assumption is made, the data needed to obtain Q
statistically is prohibitively extensive and expensive to collect.
46 Erhan Qmlar

At
-

Fig. 5.1. Crack size against time (number of cycles) in the microscale

Continuous semimarkov processes

We propose an alternative approach that retains the logic of the semimarkov


model while simplifying the quantitative aspects. We argue that the sizes of
the jumps in Figure 5.2 are very small compared with the crack size. Indeed,
the jumps involved must be of magnitudes comparable to grain sizes, whereas
the crack size is of the order of 10 to 50 millimeters. Therefore, it should
be reasonable to view the crack size process A as an increasing continuous
process; see Figure 5.2 below.
We list this formally and retain the assumption that A have the semi-
markov property in the sense introduced in Qmlar (1979): at every stopping
time S that belongs to the set of times of increase of A, the future after S
is conditionally independent of the past before S given the crack size As at
that time.

Hypothesis 5.1. a) The process A is increasing and continuous.


b) Given the actual stresses at the crack tip at all times, the conditional law
of A is that of a semimarkov process in the sense of 9znlar (1979).

Consequences of this hypothesis will be spelled out below in terms of the


cumulative cycle process Ta parametrized by crack size a. For the present,
we tUl'll our attention to the modeling of actual stress field at the crack tip.
Fatigue Crack Growth 47

Fig. 5.2. Theoretical approximation of the crack size as a continuous semimarkov


process

Random stress field


Consider an infinite plate of some macroscopically homogeneous material.
Representing the plate by the plane R 2 , let Y., be the strength at the point
:I: in R2. The random field Y = {Y., : :I: E R2} is positive and scalar-valued.
Macroscopic homogeneity implies that, as a random field, Y should be spa-
tially homogeneous and isotropic, that is, the probability law of Y should be
invariant under translations and orthogonal transformations of the plane R2.
Moreover, since the inhomogeneities being modeled by Yare in the micro-
scale, the random field Y should be continuous and have the Markov property
in the plane: for example, given the values of Y., for :I: on the boundary of
a disk D, the values of Y., for :I: in the interior of D should be conditionally
independent of the values of Y., for :I: outside D.
The only known workable processes Y satisfying all these requirements
have the form

where f is a one-to-one continuous function from R onto ~ and where Z is


Gaussian with mean 0 and covariance
0: 2
Cov(Z Z) = -e- fJ 1"-111 :I:,YER2
." II 2f3 '

Assuming that the crack we are considering is a line segment, strength


at the crack tip is Y." where :I: = (a,O) if we choose the crack line as the
48 Erhan Qmlar

horizontal axis. For the function f, we choose the exponential function, which
seems justified by both the statistical data and the experience. Thus, writing
simply Za instead of Z(a,O), we put our assumptions regarding the stress level
at the crack tip when the size is a. Here u is the macroscopic stress magnitude.
Hypothesis 5.2. a) When the crack size is a, the actual stress at the crack
tip is

b) The process Z is a stationary Ornstein- Uhlenbeck process, that is, Z zs


Gaussian with mean 0 and covariance
a 2 -f3 la-bl , a, b >
2f3e _ ao

for some constants a and f3 > O.


We should note that Z and -Z have the same probability law, and there-
fore, e- z and eZ have the same law. Our reason for choosing e- z is a desire
to be consistent with our interpretation of Y as strength.

Cumulative cycle process

Recall Hypothesis (5.1): Given the process Z, crack size process A is an


increasing continuous semimarkov process. The structure of such processes
was characterized in Qmlar (1979): every such process is the functional inverse
of a pure-jump strictly increasing process with (non-stationary) independent
increments. Since the functional inverse of A is T, we may state this result
as a consequence of Hypothesis (5.1).
Proposition 5.1. Under Hypothesis (5.1), the cumulative cycle process {Ta :
a ~ ao} is a pure-jump, strictly increasing process. Given Z, the conditional
law of T is that of a process with independent increments.
Structure of processes with independent increments is well-known. In the
case of pure-jump strictly increasing ones without fixed discontinuities, every
such process is obtained from a Poisson random measure M on R+ x R+ by
the formula
Ta = 1
(ao,a)xR+
M(db, dt) t. (5.2)

The law of M itself is specified by its mean measure J.l(db, dt), which gives
the mean number of points in the small box with sides db and dt. In our case,
since T is conditionally a process with independent increments, the measure
J.l is random and depends on Z. It is clear that J.l has the form
J.l(da, dt) = f(u, a, Za, t) da dt
for some positive function f. Moreover, since T is to be strictly increasing
(so that A be continuous), we must have
Fatigue Crack Growth 49

10 00
f((J', a, Za, t)dt = +00. (5.3)

Also, in our case, the conditional mean rate of increase of Ta should be finite
and in agreement with Virkler et al. data, which requires us to have

10 00
f((J', a, Za, t) t dt = Co ((J'e-Zd)q a- P (5.4)

which is basically the same as (4.8).


Of course, there are many functions f that satisfy the conditions (5.3) and
(5.4). However, there is one f that recommends itselffor some extra reasons.
We put this as our final hypothesis, and explain the extra reasons afterward.
Hypothesis 5.3. The conditional mean measure J-l has the form

where p and q are material constants, (J' is the macroscopic stress level, and
Z is the Ornstein- Uhlenbeck process described in Hypothesis (5.2).
Stripped to its essentials, J-l has the form
e-h(a)t
J-l(da, dt) = g(a)-t-dadt.

If g(a) and h(a) were constants, say g(a) =, =


and h(a) A, then the distri-
bution of Ta would have been the gamma distribution with shape parameter
,(a - ao) and scale parameter A (so that the mean would be ,(a - ao)/A and
variance ,(a - ao)/A 2 ). Thus, Hypothesis (5.3) is that, the number of cycles
needed to move the crack from a to a + (where oa oa
is small) has the gamma
distribution with mean
(5.5)
and variance
(5.6)
both depending on the random variable Za.
How does this conclusion square with the experimental data? As Virkler
et al. pointed out, their best fit was a log-normal distribution. I believe there
is no conflict between our conclusion and theirs: They fit their distribution
to the number of cycles needed to go from a to a + Oa, which has the random
mean (5.5) according to our model, and the distribution of (5.5) in our model
is log-normal.
50 Erhan Qmlar

6. Summary and Conclusions

Recall that time is measured continuously, but stands for the cumulative
number of cycles, that At denotes the crack size at time t, with Ao = ao
fixed, and that Ta is the time at which crack size exceeds a. Finally, let us
introduce a new process S to simplify notation:

(6.1)
We think of Sa as the random stress intensity factor when the crack size is
a.
Throughout this section we assume that Hypotheses (5.1), (5.2), (5.3)
hold. The following describe the processes S, A, T one by one.

Intensity process

The process S is log-normal; more precisely, logS is a stationary Gaussian


process with constant mean

logeo + q 10g(1' (6.2)


and covariance kernel

q2 a 2 -Pla-bl
2/3 e ,a, b ~ o. (6.3)

In fact, S is a stationary Markov process; in the representation (6.1),


the process Z is a stationary Ornstein-Uhlenbeck process, that is, it is the
solution of the Langevin equation

(6.4)
with W denoting the Wiener process, and Zo being independent of Wand
having the Gaussian distribution with mean 0 and variance a 2 /2/3.

Cumulative cycle count

The process T = {Ta : a ~ ao} is an infinitesimally gamma process with


the random shape index Sa and deterministic scale index aP , where p ~ 1 is
constant.
It can be constructed as follows. Consider a random point process on
(ao, 00) x R+; let M(B) be the number of points in the Borel subset B of
(ao, 00) x ~. Let the random measure M be conditionally Poisson with
mean measure J-I given by
e- aPt
J-I(da,dt) = Sa -t- dadt. (6.5)
Fatigue Crack Growth 51

In other words, given the process S, the random variable M(B) has the
Poisson distribution
e-I'(B) J.L(B)k
k! ' k = 0,1,2, ... (6.6)

1 1
where
e-aPt
J.L(B)= J.L(da,dt) = Sa-- dadt .
B B t
In terms of M, the process T is defined as follows:

Ta =f M(da, dt) t. (6.7)


J(ao,a)XR+
In other words, if the points of M are the random pairs (ai, ti)'

Ta = L ti. (6.8)
ao<a.<a
Increments of T are conditionally independent given S (or, equivalently,
given Z). It follows from (6.7) that, for ao ~ a < b,

1E(n- TaI S ) = llb OO


J.L(da,dt)t= lbSaa-Pda (6.9)

(6.10)

1E(e-A(Tb-Ta) IS) exp [-l 1 b 00


J.L(da, dt)(l- e- At ) (6.11)

exp[ l
a
b aP
Sa /og-,-- da].
A+ a P
Unconditional expectation and Laplace transforms can be obtained from
these by taking expectations. For instance,

(6.12)

Further computations do not yield explicit results, and numerical procedures


are needed.
52 Erhan Qmlar

Crack size

Since the probability law of T is completely characterized, the relationship


(5.1) yields a complete characterization for the process A as well. In fact, we
note that (5.1) is equivalent to

At = inf{a: Ta > t}, t:?: o.


For computing distributions, we note that

P{At > a} = P{Ta < t},


which may be useful in numerical evaluations.
However, we believe that interest in A is for academic purposes only. The
real question of reliability concerns the process T directly. For, if the critical
crack size is a cr , the lifetime of the structure involved is T acr

References

Qmlar, E.: On Increasing Continuous Processes. Stoch. Proc. Their Appl. 9, 147-
154 (1979)
Qmlar, E.: On a Generalization of Gamma Processes. J. Appl. Prob. 17, 467-480
(1980)
Kogajew, V.H., Liebiedinskij, S.G.: Probabilistic Model of Fatigue Crack Growth
(In Russian). Mashinoviedinije 4, 78-83 (1983)
Kozin, F., Bogdanov, J.L.: Recent Thought on Probabilistic Fatigue Crack Growth.
Appl. Mech. Rev. 42, S121-S127 (1989)
Naronha, P.J. et al.: Fastener Hole Quality, I and II. Tech. Report AFFDL-TR-78-
206, Wright-Patterson Air Force Base, Ohio (1978)
Ortiz, K., Kiremidjian, A.: Stochastic Modeling of Fatigue Crack Growth. Engng.
Fracture Mechanics 29, 317-334 (1988)
Paris, P.C., Erdogan, F.: A Critical Analysis of Crack Propagation Laws. J. Basic
Engng. 85, 528-534 (1963)
Sobczyk, K.: Modelling of Random Fatigue Crack Growth. Engng. Fracture Me-
chanics 24, 609-623 (1986)
Sobczyk, K., Spencer Jr., B.F.: Random Fatigue: From Data to Theory. Boston:
Academic Press 1992
Sobczyk, K., Trebicki, J.: Modelling of Random Fatigue by Cumulative Jump Pro-
cesses. Engng. Fracture Mechanics 34, 477-493 (1989)
Sobczyk, K., Trebicki, J.: Cumulative Jump-Correlated Model for Random Fatigue.
Engng. Fracture Mechanics 40,201-210 (1991)
Spencer Jr, B.F., Tang, J., Artley, M.E.: Stochastic Approach to Modeling Fatigue
Crack Growth. AIAA Journal 27, 1628-1635 (1989)
Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. J. Engng. Materials Tech. Trans. ASME 101, 148-153 (1979)
Predictive Modeling for Fatigue Crack
Propagation via Linearizing Time
Transformations
Panickos N. Palettas1 and Prem K. Goel 2
1 Department of Statistics, Virginia Polytechnic Institute and State University,
Blacksburg, VA 24061-0439, USA
2 Department of Statistics, The Ohio State University, Columbus, OR 43219-1247

Summary. Engineering structures operating under cyclic loading are subject to


fatigue damage accumulation, which leads to the initiation and subsequent growth
of fatigue cracks. Predictive modeling of fatigue crack growth is important for the
prevention of catastrophic failures commonly associated with aging air-transport
fleets, critical aerospace components, and other large engineering structures such
as hydroelectric dams and off-shore oil rigs. In this article, we demonstrate how a
training sample of few experimental units from the underlying fatigue crack growth
process can be effectively used to predict the crack propagation of an unit under use
into the future, given its early growth history. Aspects of the process that represent
population (macro) growth characteristics are integrated in a semiparametric time
transformation model, which linearizes the functional relationship between average
crack length and the number of load cycles. On the transformed time scale, indi-
vidual (micro) growth characteristics are effectively accounted by a linear model
fitted to whatever initial crack-growth data might be available for the particular
specimen of interest. The prediction results for an experimental data set are quite
appealing.

Keywords. Change point, crack growth, prediction, regression to the mean

1. Introduction

Structural fatigue occurs when a structure is exposed to cyclic loading over


a prolonged period of time. Even if the loading amplitude is restricted well
within conservative limits for the particular structural design, accumulation
of damage in the micro-structure of material subject to sustained alternating
stress intensity, such as that resulting for example from the cyclic usage of
an aircraft, a bridge, or an off-shore oil rig, is virtually unavoidable.
The initiation of fatigue cracks is generally the result of local stress con-
centration at sites of inherent material defects, scratches, and indentations.
Soon after initiation, microscopic cracks are typically halted at some thresh-
old length depending on the type of material. Sooner or later, however, prop-
agation of neighboring nucleating microscopic cracks would eventually yield
a large dominant crack. Generally, the alternating stress intensity resulting
from cyclic loading at the tip of such a crack is increasing with the length of
the crack. Thus the crack growth rate is steadily accelerating.
54 Panickos N. Palettas and Prem K. Goel

The inability to characterize damage accumulation in the micro-structure


of the material has traditionally focused interest on empirical models describ-
ing the macroscopic behavior of fatigue crack growth. Experimental evidence
suggests that such models should by design have the capacity to account for
the inherent dependence of crack growth to the loading history. Furthermore,
these models need to account for the enormous variability in crack growth
patterns, typical even in carefully controlled experimental testing. Palettas
(1988) and Palettas and Goel (1992) developed a modeling framework with
these two requirements in mind. In this article, we focus primarily on aspects
pertaining to the practicability and effectiveness of that framework.
In Section 2, we introduce the necessary notation and the formulation of
the modeling framework under consideration. In Section 3, we discuss model
implementation issues. In Section 4, we examine the performance of the model
in describing a set of experimental fatigue crack propagation (FCP) data.
Finally, in Section 5, we examine the capacity of this modeling framework to
capture growth trends and to predict time for the crack to grow to a specified
levels under a variety of extreme crack growth behaviors. The effect of varying
the level of available information, about the early-growth data for the unit
under observation, on the prediction errors is also examined.

2. The Modeling Framework


In general, we assume that longitudinal FCP measurements are available on
a training sample of p experimental units for a full range of crack lengths.
For an unit under use, whose early crack growth history is available, we
wish to predict its subsequent growth. Considering the crack length, a, as
a function of the number of load cycles, N, we use the notation ai(N), i =
0,1, ... , p, to represent the observed sample functions from the crack growth
process. The index" 0" is used to denote the partially observed unit for which
prediction is desired. Similarly, the number of load cycles, N, to (reach) a
certain crack length, a, are denoted by Ni(a),i = 0,1,2, ... ,p. Individual
sample measurements (see, Section 4) for the training sample on a grid of fixed
=
crack lengths are denoted by (Nij , aij), where aij ai(Nij), j =1,2, ... , k;,
i = 0,1,2, ... , p. Finally the mean function for the crack growth process (as
a function of N ) is denoted by ma(N), while the mean number of cycles to
reach a specified crack length, a, is denoted by mN(a).
In short, the modeling framework considered in this article is motivated
by the need to project the early crack growth history of a new unit relative
to a training set of test samples from the underlying growth process, with-
out having to postulate any particular parametric growth curve model for the
process. To that extent, the transformation g(a) = mN(a) (i.e., the regression
of the number of load cycles on the crack length) is known to have the max-
imum correlation between any such transformation g( a) and N. Of course,
the transformed crack length is not necessarily linear in N for every replicate
Predictive Modelling for Fatigue Crack Propagation 55

test under consideration; nevertheless, it seems that piece-wise linearity with


a random change point is quite satisfactory as a model for prediction of FCP
for a specific unit under examination. That is,

(2.1)

where Ti denotes the change point and 6i the shift in the growth rate relative
to the mean for the ith unit.
The complexities imposed by the longitudinal nature of the raw FCP
data, as well as the constraints imposed by the monotonicity of mN(a) on
the error terms in 2.1 can be overcome by a model implied by 2.1, in terms of
successive increments in mN(aij). Thus, at the expense of possibly increased
error variance, we use the shifting regimes regression model

(2.2)

where dmN(aij) = mN(aij) - mN(aij-d and Wij = (Nij - Ti)+ - (Nij-l -


Ti)+, for j = 1,2, ... ,ki, and i = 0,1,2, ... ,p. As discussed in Section 4,
there is reasonable ground to support the usual assumptions of constant
error variance and independence among the error terms, Xij, j = 1,2, ... , ki
in 2.2.
Finally, the shift in the relative crack growth rate turns out to be largely
the consequence of regression to the mean, i.e., for i = 1,2, ... , p,

6i = 6(f3i - 1), (2.3)


where d is the macro-level crack growth parameter associated with the par-
ticular fatigue process under study. Thus the model in 2.2 is reduced to:

(2.4)
where j = 1,2, ... , kj , and i = 0,1,2, .. . ,p.

3. Model Implementation

The fitting of the proposed model to data is made cumbersome by the non
linearity in 2.4 with respect to the unknown change points ti, i = 0, 1, ... , p
and the need to estimate the linearizing transformation mN(a). Nevertheless,
the use of an iterative fitting algorithm in the spirit of alternating conditional
expectations (Brieman and Friedman 1985) is conceptually simple to imple-


ment. More specifically, starting with a set of initial values of f3i, 6i , say 1
and respectively, i = 0,1,2, ... , p, least squares estimates of these parame-
ters can be obtained by iterative repetition of the following two alternating
conditional estimation steps:
56 Panickos N. Palettas and Prem K. Goel

Step 1: Given the current set of estimates for (3j, bj and tj, i = 1,2, ... , p, the
linearizing transformation g(a) = mN(a) is obtained by means of a non-
parametric smoother ofthe scatter plot {(ajj, (3j Njj + bj(Nij - T;)+); j =
1,2, ... , kj, i = 1,2, .. . p}. The supersmoother (Friedman and Steutzle,
1982), in particular, is easy to use and it generally works reasonably well.
Step 2(a): Given the transformation mN(a) and Tj = Tj, i = 1,2, .. . ,p,
the least squares estimates for the parameters (3j and bj, i = 1,2, ... , p, are
straight forward to obtain.
Step 2(b): Given mN(a), (3i = it and b; = 6i , the current estimate, Ti
for the change point is recursively updated to 0: 2 /(0: 1 - 0: 3 ), where 0: 1 is
the estimate for the slope in (2.2) fitted to the data, from only the ith
replicate test, with abscissas to the left of Ti, and 0: 2 , 0: 3 are the estimates
for the intercept and the slope in (2.2) fitted to the data, from only the ith
replicate test, with abscissas to the right of Ti.
In other words, the shifting regimes model in (2.2) is fitted to the data
for each replicate test separately to the left and to the right of the current
estimate of the change point. The point of intersection of these line segments
gives the new estimate of the change point.
In theory, convergence of the estimation algorithm in Step 2(b) can be
adversely affected by outliers and spurious effects due to the random errors
in the neighborhood of the change point. In practice, however, this procedure
is typically well behaved, yielding a sequence of estimates that converge quite
rapidly to the overall least squares estimate for the change point. Yet, when-
ever a shift is actually questionable this algorithm does also diverge rapidly.
Thus the divergence becomes an indicator of this phenomena.
Fitting (2.4) to the training set is done as described above, with the only
difference in Step 2 in order to account for the constraint in (2.3). Thus,
given the linearizing transformation mN(a), obtained as in Step 1, (3i =
Pi, and Tj = Tj, i = 1,2, ... , p, the model in (2.4) is just a simple linear
regression model with unknown slope b. Likewise, given mN(a), b = 6, and
Tj = Ti, i = 1,2, ... ,p, the model in (2.4) is also linear with respect to the
parameters {31, {32, ... , (3p, which may be estimated by least squares. Finally,
given mN(a), b = 6, and (3i= Pi, = i 1,2, ... ,p, the analog of Step 2(b)
provides a recursive updating scheme, in which Tj is repeatedly replaced by
0: 0 / b(1 - o:d until convergence, where 0: 0 and 0: 1 denote the least squares
estimates for the parameters in the linear regression model (3.1):

4. Experimental FCP Data

Experimental Fep data are usually collected by continuous monitoring of


the crack length and recording the number of load cycles corresponding to a
Predictive Modelling for Fatigue Crack Propagation 57

preselected sequence of crack lengths. More specifically, the dominating crack


on a test unit subject to uninterrupted cyclic loading is visually monitored
with the help of a microscope mounted on a digital traversing system. When
the crack tip reaches the cross-hair in the microscope, the cumulative number
of load cycles is recorded and the microscope is advanced to the position
corresponding to the next level among preselected crack lengths.
Virkler et al. (1978) obtained 68 replicated FCP tests as described above.
The test data was collected on center-cracked panels of 2024-T3 aluminum
alloy, 0.10" thick, 22" long, and 6" wide. Crack initiation was achieved with
the use of high amplitude stress in conjunction with a slit which was machined
at the center of each specimen. For each replicate test, 164 measurements
corresponding to a grid of regularly spaced crack lengths in the range from
9.0 to 49.8 mm were obtained. These sample curves defined by connecting
the 164 points by straight line segments are shown in Figure 4.1.

so

CJl

~ 40
~
rl

-B 30
01
~
Q)
H

~ 20
ro
H
u

10

0 SO 100 150 200 250 300


Load Cycles (in 1000's)

Fig. 4.1. Sample FCP Tests from 2024-T3 Aluminum Alloy (Virkler et al. 1979)

A subset of the Virkler et al data consisting of sample measurements


from 20 replicate tests intended to serve as the training set are shown in
Figure 4.2. These 20 replicate tests were carefully picked to ensure that they
form a representative sample of the larger base of available data. Yet, through
this sample we attempt to recreate some features typically expected from
field-data. More specifically the sample shown in Figure 4.2 corresponds to a
relatively sparse and irregular grid of points, as opposed to a relatively large
58 Panickos N. Palettas and Prem K. Goel

50

~ 40
.::
.,.-i

"fJ 30
01
.::
Q)
..:I

t3 20
rd
~
U

10

o 50 100 150 200 250 300

Load Cycles (in 1000's)

Fig. 4.2. Sample measurements from a representative set of 20 sample functions


from those in Figure 4.1 as a training set

number of sample functions, a dense grid, and regularly spaced crack lengths
featured in the Virkler et al data.
Figures 4.3 and 4.4 clearly indicate that, the residuals from the shifting
regimes model (2.4) fitted to the data in the training set are free of any ap-
parent trend, thus indicating a satisfactory fit. Also notable is the lack of any
indication of serial correlation and no clear evidence of heteroscedasticity.
Figures 4.5 and 4.6, corresponding to the first sample unit in the training
set, further support the conclusions of adequate fit and lack of serial cor-
relation and absence of any sizable heteroscedasticity. Thus the assumption
of independent homoscedastic error terms, eij, j = 1,2, ... ,ki, seems to be
highly tenable. The validity of these assumptions is certainly essential to the
validity of bootstrap prediction regions, based on (2.4), that are presented in
Section 5.

5. FCP Prediction for an Unit in Use


The problem of primary interest in this paper is the projection of the partially
observed fatigue crack growth curve for a unit in use, that is not part of the
training sample. The objective is to predict the crack's subsequent growth
(subject to continual cyclic loading) in terms of its expected length at points
beyond the number of load cycles at which the crack was last observed.
To that extent we will use the model in (2.4) to extrapolate the crack's
Predictive Modelling for Fatigue Crack Propagation 59

Fig. 4.3. Residual increments in the transformed crack length, mN(a,j), from the
shifting regimes model (2.4) fitted to the training set

3
0

0
2
-
Ul
a
a
00)
0
0
a 0
0
rl 0 0
0
~
oM 0

Ul
.-f 0
Cd
.g -1
oM
0
0

Ul
Q) <0
p:; 0 0
-2 0 0 0

0
0
-3
4 5 6 7 8
Fitted Values (in 1000's)

Fig. 4.4. Residual increments in the transformed crack length, mN(a,j), from the
shifting regimes model (2.4) fitted to the training set
60 Panickos N. Palettas and Prem K. Goel

1
0
0 0
Ul 0
0 cP
00
0
0
0.5 0 0 00
rl
0 0
o 0 0 0 cp
C 0 0 00
rl 0
& cpo 0
0 0 o 0
Ul 0 0
rl 0 0
0 0
rtl
;j 0
'0 -0.5 0 0 0 00
rl 0 0
Ul 0
(I)
0::
-1 0 0

0 50 100 150 200 250


Load Cycles (in 1000's)

Fig. 4.5. Residual increments in the transformed crack length, mN(aij), from the
shifting regimes model (2.4) fitted to the data from the Replicate Test 1

1
0
0 0
Ul 0
0 0 0 0
0 ~ 0
0
0
rl
0 8 0

000 00 0
C
rl 0o 0 0 0
0 0&99 0
Ul
o 0 0
rl 0 0
rtl 0 0
;j 0
'0 0 0 o 0 0
rl
Ul 0 0
(I) 0
0::
-1 0 0

4 5 6 7

Fitted Values (in 1000's)

Fig. 4.6. Residual increments in the transformed crack length, mN(aij), from the
shifting regimes model (2.4) fitted to the data from the Replicate Test 1
Predictive Modelling for Fatigue Crack Propagation 61

early growth, as represented by the set of measurements (Noj,aoj), j =


1,2, ... , ko, to predict ao(N) for values of N larger than Nko; nonetheless
within the general range represented by the training set.
Notably, predictions might be repeated for different values of N and pos-
sibly different fatigue cracks of interest. The training set, however, would
normally remain fixed and so would be any estimates, obtained on the basis
of the training set, for the common growth characteristics mN(a), 8, and FT ,
where the latter denotes the distribution of the random change points for
the given family of growth curves. Consequently, it is important to realize
that even from the frequentist point of view the predictive inference consid-
ered in this Section should be conditional on the given training set and the
corresponding estimates mN(a), 8, and FT , obtained in Section 4.
Depending on the length of the period over which the crack of interest
has been observed, the early growth data available for such a crack might not
provide any direct information pertaining to the change point, TO, employed
in (2.4). Nevertheless, on the basis of exchangeability we may treat TO as a
random observation from FT. With no other prior information (or intuitive
understanding) concerning the form of FT it seems appropriate to estimate
it by means of the empirical distribution, FT , with support the set of change
point estimates, {7\, T2, ... , Tp}, obtained in Section 4 for the test units in
the training sample.
Given the estimates mN(a), 8, FT and a variate TO representing a random
observation from FT , then (2.4) becomes

dmN(aOj)+8wOj = f3o(dNoj+8woj) +eOj, j=1,2, ... ,ko. (5.1)


Based on the discussion in Section 4, preceding Figures 4.3 and 4.4, it is
reasonable to treat the error terms in (5.1) as independent and identically
distributed. Thus f30 is merely the slope in a simple linear regression model
and it may be estimated via ordinary least squares. Furthermore, bootstrap-
ping residuals (see e.g., Efron and Tibshirani 1993, pp. 113-114) provides a
legitimate approach for the study of the variability in the fitted model and
the construction of prediction intervals for ao(N), for any desirable value
of N. More specifically, conditionally on a given TO generated as a random
variate from FT , a bootstrap iterate for ao(N) is given by

izo(N) = m;\/{ boN + 8(b o - 1)(N - TO)+ }, (5.2)

where mi\/(.) denotes the inverse mapping of the monotone transformation


g( a) = mN (a) and bo is a bootstrap iterate for f30 obtained by means of
bootstrapping residuals in the context of (5.1). Averaging a large sample of
independent bootstrap iterates generated as in (5.2) yields the predicted value
for ao(N). Alternatively, sample quantiles calculated on the basis of such a
sample can be used to construct approximate prediction intervals and/or
bands for ao(N).
62 Panickos N. Palettas and Prem K. Goel

We now examine the capacity of the prediction scheme outlined above to


capture a variety of crack growth behaviors and yield meaningful predictions.
For the purpose, a carefully selected validation set of five replicate tests (# 5,
17,47,49, and 68, see Figure 5.1) from Virkler et al. data for prediction is
chosen to represent the diversity ofFCP behavior. These replicate tests do not
belong to the training set of 20 tests used in Section 4. The replicate tests 49
and 68 represent the slowest and the fastest crack-growth-rates, respectively.
Replicate tests 47, as well as 49 represent the most extreme with regard to
irregularities in the crack growth behaviors. Finally, replicate tests 5 and 17
represent typical fatigue crack growth behavior.
For each of these replicate tests, predictions are obtained by fitting (5.1)
to the observed measurements over a segment of the early crack growth period
(e.g., the solid portion of each ofthe growth curves in Figure 5.1 correspond-
ing to the range from 9 to 15 mm in crack length). Each of the prediction
intervals and bands shown in the Figures 5.2-5.10 is calculated on the basis
of the sample quantiles from 1000 independent bootstrap iterates.

50

CIl

~ 40
J:!
rl

-B 30
OJ
J:!
Q)
...:l
~
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.1. The Fep curves for the Replicate Tests 5,17,47,49, and 68 considered
for prediction. Early growth data (the solid portion of each curve) are used as
information available for predictions

obtained as in (5.2). In each case the observed sample FCP curves are also
shown for comparison. Again, the solid portion of each curve indicates the
range of the data used to fit the model in (5.1), while the dotted portion was
assumed to be the future to be predicted.
Predictive Modelling for Fatigue Crack Propagation 63

Overall, the model seems to perform very well as a prediction tool. In


particular, for FCP curves depicting typical growth behavior, e.g. Replicate
Tests 5 and 17, the prediction intervals obtained even from a few observations
during the unit's early crack growth history are both quite short and represent
the subsequently observed FCP curves quite accurately (see Figures 5.2 and
5.3). The dotted portion of the curves in these figures represents the observed
sample function assumed missing at the time of prediction. In Figure 5.2, the
prediction intervals obtained for the Replicate Test 5, which has exhibited
nearly average growth behavior, compared to the training set, are both quite
accurate and precise. In Figure 5.3, the prediction intervals obtained for the
Replicate Test 17 cover the true unknown FCP curve satisfactorily.

50

i
~

Ul

~ 40
.::
rl

"fJ 30 ,t-J
l
01
.::
OJ

tJ
H
.!<:
~ 20 ~J
H
u
~.J
10
0 50 100 150 200 250 300
Load cycles (in 1000's)

Fig. 5.2. Prediction intervals for the Replicate Test 5. Data from 9 to 15 mm crack
length (solid curve) used as available information

As one collects information on a unit under observation, one should, of


course, update the predictions. Figures 5.3, 5.4, and 5.5 show that as more
data becomes available by monitoring the early FCP behavior over a longer
period, the model does indeed yield progressively shorter and more accurate
prediction intervals. For the Replicate Test 17, a comparison of the prediction
intervals, in Figure 5.3, based on sample measurements over the range from 9
to 15 mm in crack length, with the prediction intervals in Figure 5.4, based on
sample measurements over the range from 9 to 20 mm in crack length exhibit
a significant improvement in reduction of uncertainty. A further comparison
of these figures with Figure 5.5, displaying the prediction intervals based
64 Panickos N. Palettas and Prem K. Goel

50

,l
Ul

40
~
ri

'501 30 )J
~
Q)
~.J
l
..."l
..>:
~ 20
H
.IJ'
t'~.
u
~t'
10
0 50 100 150 200 250 300

Load Cycles (in 1000's)

Fig. 5.3. Prediction intervals for the Replicate Test 17. Data from 9 to 15 mm in
crack length (solid curve) used as available information

on sample measurements over the range from 9 to 30 mm in crack length


exhibit an even more substantial improvement with respect to both accuracy
and precision.
Furthermore, for the Replicate Test 68, which exhibited much faster
growth rate than any of the replicate tests used in the training set, pre-
diction intervals are given in Figure 5.6. Notice how the prediction intervals
are larger, e.g. more imprecise, compare to those obtained for the replicate
tests with typical growth behavior. However, these prediction intervals still
represent the subsequently observed growth measurements (dotted portion
of the curve) quite closely.
Similarly, for the Replicate Test 49, which exhibited the slowest growth
rate, as well as quite irregular growth behavior, prediction intervals are given
in Figure 5.7. Notice how these intervals become substantially larger than for
any others discussed above. Yet, these intervals represent the subsequently
observed growth measurements (dotted portion of the curve) closely, despite
that they are based only on early growth history up to 15 mm. In Figure 5.8,
prediction intervals for the same replicate test, based on somewhat larger
period of observation of the early growth history, up to 20 mm in crack
length, show improvement at least with respect to precision.
Finally, for the Replicate Test 47, which exhibited overall average growth
rates, nonetheless the most irregular growth behavior, the prediction intervals
based on early growth history up to 20 mm in crack length are shown in
Predictive Modelling for Fatigue Crack Propagation 65

50

m
~ 40
~
.r!

i:iOJ 30
~
())
..:l
,;.:
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.4. Prediction intervals for the Replicate Test 17. Data from 9 to 20 mm in
crack length (solid curve) used as available information

50

m
40
~
.r! :,J
i:i 30
.f
OJ
~
Q)
..:l
,;.:
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.5. Prediction intervals for the Replicate Test 17. Data from 9 to 30 mm in
crack length (solid curve) used as available information
66 Panickos N. Palettas and Prem K. Goel

50

Vl

40
C
rl

tho
01
c
<l>
H
,.'<!
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.6. Prediction intervals for the Replicate Test 68. Data from 9 to 30 mm in
crack length (solid curve) used as available information

50

Vl

40
s::::
rl

.'
tho
01
s::::
<l>
H
,.'<!
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.7. Prediction intervals for the Replicate Test 49. Data from 9 to 15 mm in
crack length (solid curve) used as available information
Predictive Modelling for Fatigue Crack Propagation 67

50

Vl

~ 40
s::
rl

::130
01
s::
(])
..:l
.!i! .'
..
~ 20
H
U

10
0 50 100 150 200 250 300
Load Cycles (in 1000's)

Fig. 5.8. Prediction intervals for the Replicate Test 49. Data from 9 to 20 mm in
crack length (solid curve) used as available information

50

Vl

~ 40
s::
rl

:501 30
s::(])
..:l
.!i!
~ 20
H
U

10
0 200 250
Load Cycles (in 1000's)

Fig. 5.9. Prediction intervals for the Replicate Test 47. Data from 9 to 20 mm in
crack length (solid curve) used as available information
68 Panickos N. Palettas and Prem K. Goel

Figure 5.9. The irregular behavior is correctly picked-up by the model as


evidenced in relatively large prediction intervals. As expected, with a longer
period of early growth history up to 30 mm in crack length, the width of the
prediction intervals for the same replicate test, given in Figure 5.10, is much
smaller in comparison to those in Figure 5.9.

50

(1

~ 40
c:
.-<

:i30
01
c:
OJ
...:I
~
~ 20
H
U

10
0
Load Cycles (in 1000's)

Fig. 5.10. Prediction intervals for the Replicate Test 47. Data from 9 to 30 mm in
crack length (solid curve) used as available information

References

Breiman, L., Friedman, l.R.: Estimating Optimal Transformations for Multiple


Regression and Correlation. Journal of the American Statistical Association
80, 580-597 (1985)
Efron, B., Tibshirani, R.l.: An Introduction to the Bootstrap. New York: Chapman
& Rall 1993
Friedman, l.R., Stuetzle, W.: Smoothing of Scatterplots. Technical Report Orion
3. Department of Statistics, Stanford University (1982)
Palettas, P.N.: Stochastic Modeling and Prediction for Fatigue Crack Propagation.
Ph.D. Dissertation. The Ohio State University (1988)
Palettas, P.N., Goel, P.K.: Bayesian Modeling for Fatigue Crack Curves. In: Klein,
l. P., Goel, P. K. (eds.): Survival Analysis: State of the Art. NATO ASI Series.
Dordrecht: Kluwer 1992, pp. 153-170
Predictive Modelling for Fatigue Crack Propagation 69

Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. Technical Report No. 78-43. U.S. Air Force Flight Dynamics Lab-
oratory (1978)
Virkler, D.A., Hillberry, B.M., Goel, P.K.: The Statistical Nature of Fatigue Crack
Propagation. Journal of Engineering Material Technology 101,148-153 (1979)
The Case for Probabilistic Physics of Failure
Max Mendel
Department of Industrial Engineering and Operations Research, University of
California, Berkeley, CA 94720, USA

Summary. In Probabilistic Physics of Failure or PPoF, the statistical lifetime


model is derived directly from the physics of the failure mechanism. This paper
describes the technique and argues for its importance in design for reliability.

Keywords: Lifetime modelling, physics of failure, probabilistic design, reliability,


Bayesian inference

1. Introduction

This chapter overviews the Probabilistic Physics of Failure or PPoF approach


to determining lifetime models. Lifetime models are typically chosen from
some parametric family such as the exponential or the more general Weibull
family. PPoF concerns the choice of the likelihood function and advocates
basing the likelihood function on the physics of the failure mechanism. This
can give a firm foundation to the choice of a likelihood, which is often chosen
for mathematical convenience alone. It also provides an alternative to the
data-based approaches for model identification. This is especially useful in
situation where there is little or no data. Finally, the PPoF models turn
out to be stated explicitly in terms of the physical parameters of the failure
process. This makes the PPoF models particularly useful for probabilistic
design for reliability. In the design phase, these physical parameters can be
influenced and in this way the probabilistic reliability of the systems can be
determined beforehand and competing design can be compared.
The following three sections address the what, how, and why of PPoF,
respectively. A final section concludes by showing how the PPoF models relate
to data.

2. What is PPoF?

Probabilistic physics of failure derives a probability model directly from the


physical failure mechanism. This is shown in Figure 3.1. To do this, one
first identifies the failure mechanism. Then a probability assessment is made
with respect to the physical and geometric structure specified by the failure
mechanism. This specifies a class of models, which is then expressed as a
statistical likelihood model for component lifetime. The outstanding feature
of PPoF is that no data is needed to specify a likelihood mode. It is the
The Case for Probabilistic Physics of Failure 71

physics of the failure mechanism that replaces the data that is often used to
determine the probability distribution.
To make this distinction concrete with an example, consider the problem
of specifying a lifetime distribution for machine tools such as drill bits. One
might choose some parametric family for lifetime such as a Wei bull distribu-
tion:
(2.1)
where F = Prob(Lifetime > x) is the survival function. This choice fixes the
distribution up to two parameters, a and {3, which can then be estimated
from lifetime data. Table 2.1 shows estimates of these parameters for various
cutting speeds. Notice that the shape increases and the scale parameter de-
creases with cutting speed. Many data points were required to obtain these
estimates and more will be needed to determine the probability model at
other cutting speeds.

Table 2.1. Weibull scale and shape parameters for multiple feed rates as reported
by Negishi and Aoki (1976)
Feed rate Shape Scale
mm/rev f3 ~
0.265 0.632 1245
0.335 0.725 423
0.425 0.624 480
0.850 0.531 715
1.060 0.760 120
1.320 0.850 86
1.700 1.325 40

Consider now the following PPoF approach to the same problem. Assume
that wear is the dominant failure mechanism. Wear is studied extensively in
the field of tribology and is documented in so-called wear curves. Figure 2.1
provides several examples. Qualitatively, we expect the probability of failure
to increase with increasing wear. Quantitatively, we can assume that:
Assessment 2.1. If one bit has twice the wear as another, then it is twice
as likely to fail in an upcoming infinitesimal interval.
This assumption is further discussed in Section 3. For a potentially infinite
supply of drill bits, the assumption can be shown to imply that lifetimes are
conditionally independent and identically distributed according to:
F(xIO) = e-G(~)/8.

Here G(x) is the area under the wear curve evaluated at the lifetime x of a
generic bit. 0 is the limiting average value of the G(Xi) as i ranges over the
bits in the supply:
72 Max Mendel

1 N
o= N-oo
lim N L G(Xi).
i=l

] 0.8

Fig. 2.1. Wear curves for MI0 cemented carbide tools with various coatings (cut
speed v = 200 m/min; feed rate f = 0.41 mm/rev; cut depth a = 2 mm; work
piece: grey cast iron bar, hardness 170 HB (from Schintlmeister, et al. 1984).

We now compare these two solutions. First notice that in the PPoF model
all the components of the model have a direct tribological meaning. To com-
pare, we can think of the Weibull model as a tribological PPoF model to-
gether with the assumption that the wear curve is a power law. This latter
assumption is not too bad as can be seen from Figure 2.1, although it does
underestimate the effects of run-in wear. Under this assumption, the shape
parameter [3 would be determined by the wear curve itself and does not need
to be estimated from data. The role of the scale parameter IX is played by the
average integrated wear O. This parameter is not fixed by the wear curves;
given G, 0 is a function of the unknown lifetimes making it a random variable
itself.
Endowing the components with a tribological meaning has several ad-
vantages. First, the task of assessing the parameters is simplified; the shape
parameter follows directly from the wear curve and by giving a physical mean-
ing to the scale parameter one can imagine that it is easier to assess a prior
for it. More importantly, however, it makes it possible to actually predict
the reliability of the bits at various cutting speeds. Figure 2.2 shows a set
of wear curves at various cutting speeds. Notice that the curves climb faster
with increasing cutting speed. By substituting these curves into the proba-
bility model we can predict the probabilistic behavior of the bits at various
cutting speed. If one assumes that the wear curve can be approximated by a
power law, then it follows that the shape parameter increases with increasing
The Case for Probabilistic Physics of Failure 73

cutting speed. We also expect the average cumulative wear to decrease. This
corroborates entirely with the empirical data in Table 2.1.

Wear
(mm)

Usage (cydes)
Fig. 2.2. Wear curves for high-speed cutting (g) and normal cutting conditions
(g).

Two critical remarks of the PPoF approach are appropriate here. First,
is the claim that no data is needed to determine the model. In the drill bit
example, this should be taken to mean that no lifetime data are needed.
The wear curves are data based, but this is data concerning the wear on the
tool's face. The PPoF approach eliminates the need to take additional data.
The second is to point to the weak link in the derivation: the assessment
that links wear to probability in assessment 2.1. This assessment is neces-
sarily subjective and one may disagree with it. It is generally impossible to
avoid subjective assessments altogether in lifetime modelling. The choice of a
Weibull model is subjective, even if this choice is based on some type of data-
based identification method since the choice of such a method is subjective.
From an engineering perspective, the goal is to provide simple statements
that relate directly to the relevant engineering quantities. An engineer can
then choose to agree or disagree. This is a critical component of PPoF and
several alternative methods for making assessments are overviewed in the
next section.

3. Failure Mechanisms and Assessments

Figure 3.1 divides the PPoF approach into three steps: identifying the failure
mechanism, making an probabilistic assessment with respect to the mecha-
nism, and translating this into a likelihood model. This section analyzes these
steps into more detail.
74 Max Mendel

Failure Mechanism

Assessment

Likelihood Model

Fig. 3.1. Steps in deriving a PPoF model.

3.1 Failure Mechanism

First is the identification of the failure mechanism. The simplest models re-
sult when there is one failure mechanism that is dominant. Again, this is a
subjective engineering assumption. Multiple failure mechanisms are handled
using the theorem of total probability as follows:

F(x) = L F(xlmechanism i) P(mechanism i obtains)


i

The conditional probabilities are then handled the same way as the models
pertaining to a single mechanism. The marginal probabilities for the failure
mechanism have to be assessed using other means, though. They serve as the
weights that measure the importance of the various mechanisms. Choosing a
dominant failure mechanism, then, corresponds to assigning probability 1 to
that mechanism.
Other relevant factors such as multiple failure sites are handled in a similar
way. Although extending the analysis in this way is straightforward from a
theoretical perspective, it greatly increases the modelling and assessment
efforts. In practice, it may therefore be more expedient to limit oneself to
a small number of important mechanisms rather than attempting to be as
inclusive as possible.
The Case for Probabilistic Physics of Failure 75

3.2 Assessment

The assessment step relates the failure mechanism to the probability model.
The assessment should be simple and relate directly to the relevant engineer-
ing quantities.
An example is the assessment "twice the wear, twice the probability of
failure in an upcoming small interval" that was used in the introduction to
assess a lifetime distribution for drill bits. This example is taken from Chick
and Mendel (1994). To make this comparison precise, we have to consider
a batch of, say, N items (drill bits). Denote the vector of their lifetimes by
x = (Xl' ... ' XN) and let Xi and Xj be the lifetimes of two different items.
Let h be the upcoming time interval. Then, the assessment becomes:

F(Xi + hlx) = 9Xi)) F(xj Ix) + o(h).


9 Xj
It is shown in Chick and Mendel (1994) that this condition implies that, for
a population-size of N,

from which the expression given in the introduction follows after a passage
to the limit as N -+ 00.
This example can be applied to many other damage models apart from
wear. For instance, in fatigue fracture, it is customary to express the fatigue
damage g( n) after n cycles as follows:

g(n) = ~ (L\r)-~
2 2f./
Here L\r is the shear strain range (in percent), f / is the fatigue ductility
coefficient, and c is the fatigue ductility exponent. A probability model that
is consistent with the statement that twice the damage, twice the probability
of failure is:

F(n) = exp (_ n
2
1). (3.1)
B(~r
This is a Weibull model with the average cumulative damage as scale param-
eter and a shape parameter of 2.
The assessment "twice the 'damage', twice the density of failure" can be
applied to any scalar damage model. Although it is a very simple assumption,
it does have certain attractive characteristics. It gives an entire lifetime dis-
tribution, integrates the damage model into this distribution, and it does not
introduce abstract parameters. Compare this with the usual Coffin-Manson
life equation, which gives only the median life:
76 Max Mendel

Nso ="21(Ll"Y)
2fj
*
If a lifetime distribution is needed, it is common to choose a Weibull that
has the same median. However, this involves the introduction of a new and
abstract shape parameter that has to be estimated from data. Notice, by the
way, that the median pertaining to (3.1) is,

o(Ll"Y) C (-In[O.5])1/2),
1

mso =
2fj
which is in general not equal to the Coffin-Manson median. When, however,
the average cumulative damage is close to the Coffin-Manson median, then
the two are quite close. However, the model in (3.1) provides a mechanism
for adjusting the median based on an observed average.
When there is more physical structure available than a simple scalar dam-
age model, we can take a more sophisticated approach based on indifference
or invariance. The idea is to identify sets of outcomes that are equally likely
or, equivalently, identify a set of transformations that leave the distribution
invariant. This way of assessing likelihood models was pioneered in the statis-
tics literature by DeFinetti (1964) and further extended by several others (see
Bernardo and Smith 1994 for an overview). For engineering applications, vec-
tor fields on manifolds are a convenient way for identifying equi-probable sets
or, alternatively, to function as the (infinitesimal) generators for the invari-
ance transformations.
To illustrate this, consider the example in Figure 3.2. This is taken from
ShortIe and Mendel (1996). A rotor is placed on a shaft which is suspended
by two bearing; in the figure, either the pair Bl and B2 or the pair Bl and B~.
Inaccuracies in the manufacture of this assembly lead to imbalances. These
lead to torques in the bearings which cause the bearings to fail. We model
the torques probabilistically. There are two sources of imbalance: (1) Static
imbalance that occurs when the rotor's center of mass is off of the axis of
rotation and (2) dynamic imbalance that occurs when the rotor's principal
axis of inertia are not aligned with the axis of rotation.
Consider only the dynamic imbalance. To model it probabilistically, we
need to put a distribution on the space of inertia tensors. This is a 6 di-
mensional space; it is spanned by 3 normal moments of inertia and 3 cross
moments. These are usually arranged in an inertia matrix:

By re-orienting the rotor, the inertia tensor changes in a 3 dimensional sub-


space of this 6 dimensional space. Any such subspace is characterized by 3
principal moments of inertia I!, 12 , and 13 (they are the scalar functions
The Case for Probabilistic Physics of Failure 77

that are invariant under re-orientations of the rotor). An infinitesimal re-


orientation is modelled by a vector and at each point there are 3 such vec-
tors, one corresponding to each of the 3 orthogonal rotations of the rotor.
These three vector span a infinitesimal cube. These cubes give us enough
structure to apply the principle of indifference: a random inertia corresponds
to a distribution that gives equal probability to each of these infinitesimal
cubes. Calculations then show that the likelihood model for inertia is
a
l(Ixx'!xy, Ixz Ih, h Ia) 0( II [(Ii + Ixx){I;+1 - Ixx) + I;y + I;zr 1 / 2 .
;=1
(3.2)
Notice that the parameters are the principal moments of inertia. The distri-
bution for the torque it implies is

l( II I I) =
T 1, 2, a
~2 [(k_-..I2kk2 _T2 )t + (k+-..IP_T2)t]
2k
(k 2_ 2)-t T

(3.3)
Here k = w21Ia - h 1/2 is the maximum torque required to spin the rotor
at an angular velocity of w. This density if shown in Figure 3.3. It shows
clearly the problem in the manufacture: without control, it is much more
likely to produce an assembly that leads to high torques than it is one with
low torques.

Fig. 3.2. Rotor on shaft with bearings (B I , B2) and (B I , B~).

This example also demonstrates the peculiar mathematically difficulties


one encounters in PPoF. The space of inertia tensors, although 6 dimensional,
is not ~6 although it is tangent to ~6 at any point. It is foliated into 3-
dimensional submanifolds that correspond to a set of principal moments of
inertia. To determine the likelihoods on this a space, we proceeded as follows.
The three vectors wedge together to form a trivector field that represents the
78 Max Mendel

f('t)

't
-10 -s o s 10

Fig. 3.3. Bearing-torque density for a randomly oriented rotor (here k = 10).

infinitesimal cube within each submanifold. A 3-form field representing a


probability density acts on this trivector field to give the scalar amount of
probability in each trivector. Setting this equal to a constant corresponds to
applying the principle of indifference conditional on a set of values for the
principal moments of inertia. By introducing the inertia matrix coordinates
one finds the expression in (3.2) and by changing variables appropriately one
finds the expression in (3.3).
To every indifference corresponds an equivalent invariance assumption
describing a symmetry in the distribution. For instance, the distributions in
the above example can be characterized as those that are invariant under
the action of the special orthogonal group SO(3) on the inertia-tensor space.
For ~n , the way invariances determine a Bayesian likelihood model has been
much studied recently in the literature. More complicated manifolds such as
the tensor space in the example have received much less attention.

4. Design for Reliability


In the design phase of an engineering system we have that:
1. No lifetime data is available since nothing has been built yet.
2. Competing design have to be compared to select a "best" design.
PPoF models have two characteristics that match these requirements:
1. No data is required to assess the PPoF likelihood models.
2. All parameters of the probability models are physical.
Instead of data we have the physics of the failure mechanism to base the
model. Presumably, this is known in the design phase. Also, the physics-of-
failure models become part of the PPoF models. For instance, the angular
velocity w appears explicitly in the likelihood for the torque in (3.3) and the
cumulative wear G in the likelihood for lifetime in (2.1). Also the statistical
The Case for Probabilistic Physics of Failure 79

parameters have physical meaning such as "average cumulative wear" and


"principal moment of inertia." Physical parameters may differ from design to
design and can potentially even be controlled. This implies that designs can
be compared on their probabilistic characteristics or that a design can even
be optimized with respect to the probabilistic characteristics.
To make this concrete, consider the tribology-based lifetime distribution
introduced in the introduction again. By increasing the drilling speed, we
increase the chances of shorter lifetimes and, hence, increase the total cost
due to downtime and replacement. On the other hand, the production can be
increased by increasing the drilling speed. These costs compete and so there
is an optimum. Because the tribology-based lifetime distribution is stated
directly in terms of the wear curve, we can determine from the handbooks
which cutting speed is cost optimal. (See Chick and Mendel 1996 for the
analysis and results in an age replacement policy).
To illustrate this further, consider the problem of designing the manu-
facture of the rotor whose dynamic imbalance was analyzed in the previous
section. One possible way of manufacturing a rotor is shown in Figure 4.1.
Uncertainties in the angle () of the drill leads to uncertainties in the dynamic
and the static imbalance of the rotor. For the details see Shortle and Mendel
(1995). Figure 4.2 shows the densities in the bearing torques as a function of
the drilling height h. Rather than attempting to plot the density in a third
dimensions, the pictures give the flow lines of the conditional density fCrlh).
For a fixed drilling height h, these graphs are interpreted as follows: Draw
an imaginary vertical line at the value of h (dotted in the figure). This line
intercepts the flow lines of the conditional density. In between each two ad-
jacent intercepts lies a unit amount of probability. Thus, the flow lines give
a picture of probability density comparable to that of mass density in com-
pressible fluid flow: when the flow lines cluster together the density goes up
and vice versa.
From these conditional density flows we can asses the best drilling height.
The figure shows that when the bearings are on either side of the rotor, the
model predicts the obvious, namely that the closer one brings the drill to
the rotor, the smaller the chances for large torques are. The interesting case
occurs, however, when the bearings are on one side ofthe rotor (as they would
be, for instance, on a helicopter blade). Then there is a definite best drilling
height. This depends, of course, on the particular loss function. A couple are
shown in the figure.
The results in Figure 4.2(b) are not obvious without the PPoF analysis
and in that way makes a case for probabilistic design. The intuitive reason for
the phenomenon is the following: The manufacture of the rotor by drilling
leads both to a static imbalance and a dynamic imbalance. These are de-
pendent on one another: the large the angle (), the larger both imbalances.
However, when the signs are opposite the torques due to the imbalances can
partially cancel each other out. This apparently only happens when the bear-
80 Max Mendel

0\
,
,
,
y

Fig. 4.1. Manufacture of a rotor by drilling the center hole on a drill press. 8 is
the unknown error angle from the vertical.

Minimize Minimize
Worst-case -c E(-c)
.
1.S:

,: Optimal
: Drill Height
>.5~/
~ :

::~
o
.--

(a) ~

Fig. 4.2. Conditional density of the bearing torque r as a function of the drilling
height for either suspension case.
The Case for Probabilistic Physics of Failure 81

ings are on one side of the rotor. The PPoF analysis allows us to control the
chances on the various bearing torques by redesigning the manufacture.

5. Conclusions

The PPoF approach uses the physics of the failure process to derive a prob-
ability model. The paper argues that this is useful when there is no data
and when we wish to use probability in the design phase of an engineering
system. However, how would a PPoF approach use lifetime data when this is
available? This is the question this concluding section addresses.
The PPoF approach yields a likelihood model. This likelihood model can
be used to process any data that may be available. The situation is summa-
rized in Figure 5.1. The failure mechanisms produces both the physics of the
failure and the failure data. The physics of failure leads to a PPoF likelihood
model that combines with data to provide an updated model.

Failure Mechanism

Physics of Failure

Bayes Formalism

Posterior Probability Model

Fig. 5.1. PPoF and data.

Because the statistical parameters are physical quantities (average cumu-


lative wear, principal moment of inertia) rather than indexes of true distri-
butions, they must be random variables. Therefore, the Bayesian formalism
for parametric inference is the obvious one to use here. However, a prior is
then needed and this still has to be assessed.
The question arises whether the PPoF approach can be used to assess
the prior. To answer this question, it is important to understand why PPoF
gives a likelihood model. In physics of failure one often considers a class of
systems: for instance sets of drill bits that share the same wear curve. A
class of systems leads to a class of probability models, rather than a single
probability model. The class determines the likelihood function. Each member
of the class is determined by a prior. The prior captures what is different from
system to system. For the drill bits, this is the average cumulative wear of the
82 Max Mendel

particular set of bits. If the physics of failure addresses a single system, the
PPoF approach will specify both likelihood and prior, although it is not clear
how useful the distinction is then. Thus, although the PPoF approach covers
this, additional information concerning the physics of failure of a particular
system has to be introduced to derive a prior.

Acknowledgement. Research done while visiting the Department of Mathematics


and Informatics, Delft University of Technology. The author thanks Stephen Chick
and John Shortle for their help. Also, the derivation of the fatigue-fracture survival
function on page 72 is done by Pei-Sung Tsai.

References

Chick, S.E., Mendel, M.B.: Deriving Accelerated Lifetime Models from Engineering
Curves with an Application to Tribology. 40th IES Annual Technical Meeting
Proceedings (1994)
Chick, S.E., Mendel, M.B.: Using Wear Curves to Predict the Cost of Changes in
cutting Conditions. ASME Journal of Engineering for Industry. To appear in
(1996)
DeFinetti, B.: La Prevision: ses Lois Logiques, ses Sources Subjectives. Annales de
l'Institut Henri Poincare 7, 1-68 (1937). English translation in: Kyburg, Jr.,
H.E., Smokler, H.E.(eds.): Studies in Subjective Probability. New York: Wiley
1964
Smith, A.F.M., Bernardo, J.M.: Bayesian Theory. New York: Wiley 1994
Negishi, H., Aoki, K.: Investigations of Reliability of Carbide Cutting Tools (1st
Report). Precision Machining (Journal of the Japanese Society of Precision
Engineers) 42 (6-extra), 459-464 (1976)
Diaconis P., Freedman D.: A Dozen de Finetti-style Results in Search of a Theory.
Annales de l'Institut Henri Poincare 23, 397-423 (1987)
Schintlmeister, W., Wallgram, W., Kanz, J., Gigl, K.: Cutting Tool Materials
Coated by Chemical Vapour Deposition. In: Dowson, D.(ed.): Wear, a Cele-
bration Volume. Lausanne: Elsevier 1984, pp. 153-169
Shortle, J.F., Mendel, M.B.: Probabilistic Design of Rotors: Minimizing Static and
Dynamic Imbalance. Technical Report #95-29, ESRC (1995)
ShortIe, J.F., Mendel, M.B.: Predicting Dynamic Imbalance in Rotors. Probabilistic
Engineering Mechanics. To appear in (1996)
Dynamic Modelling of Discrete Time
Reliability Systems
Moshe Shaked,h J. George Shanthikumar,2u Jose Benigno Valdez-Torres3
1 Department of Mathematics, University of Arizona, Tucson, AZ 85721-0001, USA
2 The Walter A. Haas School of Management, University of California, Berkeley,
CA 94720, USA
3 Escuela de Ciencias Quimicas, Universidad Autonoma de Sinaloa, Culiacan,
Sinaloa, Mexico

Summary. In this paper we summarize recent results that have been obtained
in Shaked et al. (1994, 1995) on the dynamic modelling of reliability systems in
discrete time. Discrete time models of reliability systems are appropriate when the
system operates in cycles or the system is monitored at discrete time epochs. On
the other hand, discrete failure times arise naturally in several common situations
in reliability theory where dock time is not the best scale on which to describe
lifetime. Specifically, we model the dynamic behavior of the components of a relia-
bility system by discrete multivariate conditional hazard rates (which is equivalent
to specifying the joint life time distribution of the components). But this represen-
tation allows one to extend the basic model to incorporate repairs and replacements
of components in a natural way. An algorithm to construct sample paths of the dy-
namics of the components based on the discrete multivariate conditional hazard
rate is described. This algorithm can be used to simulate the system behavior and
can be used for numerical studies as well as for analytic stochastic comparisons. We
use this construction to study stochastic comparison of life times in the hazard rate
and other stochastic orderings (of vectors of discrete dependent random lifetimes).

Keywords. Time-dynamic modelling, stochastic ordering, likelihood ratio order-


ing, hazard rate ordering, discrete dynamic construction, history, simulation, con-
struction on the same probability space, discrete Freund model.

1. Introduction
This paper surveys and summarizes recent results which have been obtained
by Shaked et al. (1994, 1995) in the dynamic modelling of reliability systems
in discrete time. One may choose to model the dynamics of a reliability
system in discrete time when it is operated in cycles and the observation is the
number of cycles successfully completed prior to failure. In other situations
a device may be monitored only once per time period and the observation
then is the number of time periods successfully completed prior to the failure
of the device. On the other hand discrete failure times in reliability systems
may arise naturally in several common situations where clock time is not the
best scale on which to describe lifetimes. For example, in weapons reliability,
the number of rounds fired until failure is more important than age in failure
Supported in part by the NSF Grant DMS 9303891
Supported in part by the NSF Grant DMS 9308149
84 Moshe Shaked et al.

and in the modelling of the landing gear in aeroplane the number of take-offs
and landings is more important.
The time-dynamic modelling of multi-component reliability systems us-
ing a marked point approach in the continuous time was initially proposed
by Arjas (1981a, 1981b). These works were further extended by Arjas and
Norros (1984, 1989) and Norros (1985, 1986). The continuous analog of the
work described here was originally carried out in a series of papers start-
ing with Shaked and Shanthikumar (1986a, 1986b, 1987a, 1987b). Specif-
ically, among other things, a definition of multivariate conditional hazard
rate functions was introduced in Shaked and Shanthikumar (1986a). The
usefulness of these functions for modelling imperfect repair in the multivari-
ate setting (Shaked and Shanthikumar 1986a) and for characterizing aging
in the multivariate setting (see Shaked and Shanthikumar 1988, 1991a) have
been demonstrated. Several notions of probabilistic ordering among vectors
of random lifetimes, using this dynamic modelling is studied in Shaked and
Shanthikumar (1987b). A new hazard rate ordering relation among such ran-
dom vectors is defined and its relationship to other probabilistic orderings
are studied in Shaked and Shanthikumar (1990). A summary of these results
(in the context of continuous time modelling) can be found in Shaked and
Shanthikumar (1993b). The results of the present paper can be looked at as
a discrete parallel development of the absolute continuous case summarized
in Shaked and Shanthikumar (1993b). However, in the discrete case there
are some technical problems which do not appear in the absolute continuous
case. These require the different methodology which is used in the present
paper.
The notion of discrete multivariate conditional hazard rate functions is
presented in Section 2. In Section 3 we present an algorithm (called the dis-
crete dynamic construction) which can construct dynamically, using the dis-
crete multivariate conditional hazard rate functions, a random vector having
a desirable distribution. This algorithm may be used for simulation purposes,
but here we illustrate its use as a technical tool for proving stochastic ordering
among multi-component reliability systems. In Section 4 we give the defini-
tions of the probabilistic orderings which are studied later in the paper. A
result, which states that the discrete multivariate hazard rate ordering implies
stochastic ordering, is proved in Section 4. In the same section, we study the
relationship between the discrete likelihood ordering and the discrete hazard
rate ordering. In Section 5 we discuss the dependence structure among the
components. A summary is provided in Section 6.
Dynamic Modelling of Discrete Time Reliability Systems 85

2. Discrete Multivariate Conditional Hazard Rate


Functions
Consider a random vector T = (T1 , T2, ... , Tn) where 11 is the failure time
of component i, i = 1,2, ... , n. The random vector T takes on values in
{I, 2, .. .}n == N++
The following notation will be used. Let z = (Zl, Z2, .. , zn) E N++ and
1= {it. i 2, ... , ik} C {I, 2, ... , n}. Then ZI will denote (Ziti Zi~, ... , Zik)' The
complement of I will be denoted by I = {I, 2, ... , n} - I. We will also denote
e = (1,1, ... ,1); the length of e will vary according to the expression in which
e appears.
Suppose that all the components start to live at time 0 and are new
then. As time progresses the components fail one by one (we do not rule out
the possibility of multiple failures at the same time epoch). Thus, at time
tEN++ , the information which has been gained by observing the components
is an event of the form {TI = tI,Ty 2: tel for some I C {1,2, ... ,n}
and tI < teo The multivariate conditional hazard rate functions of Tare
conditioned on such events. They are defined as

AJII(tltI) = P{TJ = te, T 1_J > telTI = tI, Ty 2: tel (2.1)


for some J C I C {1,2, ... ,n} and tI < teo If in (2.1) the probability of
{TI = tI, Ty 2: tel is zero, then AJII(tltI) is defined as 1. Note that in (2.1)
it is possible that J = 0. In that case we have

Aw(tltI) = P{Ty > telTI = tI, Ty 2: tel


If 1= 0 in (2.1) then we abbreviate AJII(tltI) by AJ(t). These hazard rates
can be called initial because they describe the hazard rates of the components
before having had any failures.
Clearly, the hazard rate functions are determined by the probability func-
tion of T. But also the converse is true. It is possible to express explicitly the
joint probability function of T by means of the hazard rate functions (2.1);
see Shaked et al. (1995). Specifically, when the probability of simultaneous
failures is zero, for 0 = to < tl < t2 < ... < tn we have
n tk-l n

P{11 = ti, i = 1, ... , n} = II { II [1 - L A{j}I{l ... k-l}( ilt{l ... k-l})]}


k=l i=tk_l+1 j=k

x A{k}l{l .... k-l}(tklt{l .... k-l}). (2.2)


It follows that in order to describe the life distribution of T it is enough
to postulate the hazard rate functions (2.1). This is a useful fact because in
the setting of reliability theory the hazard rate functions have more intuitive
meaning than the joint probability function. In this paper we use these func-
tions to characterize various probabilistic orderings of discrete multivariate
vectors of random lifetimes.
86 Moshe Shaked et at.

3. The Discrete Dynamic Construction


Let T = (T1' T 2 , .. , Tn) be a discrete random vector taking on values in N++.
Let A.dI) be its discrete multivariate conditional hazard rate functions as
described in (2.1). We describe now an algorithm, called the discrete dynamic
construction, which, using the functions A.dI), constructs a random vector
T = (1'1,1'2, ... ,Tn) such that
T=st T (3.1)
(here '=st' means equality in law).
The algorithm is similar to, but different than, the dynamic construc-
tion described in Shaked and Shanthikumar (1991b). The latter construction
applies to vectors of random lifetimes with absolutely continuous joint distri-
butions. In such a case, no two components can fail at the same time epoch.
Here, in the discrete case, it is possible. Therefore, the discrete construction
is different in nature than the one in Shaked and Shanthikumar (1991b) -
it has to allow multiple failures at some time epochs. The discrete dynamic
construction is described below by induction on t E N++ - the countable
number of time epochs in which components may fail. It is unlike the continu-
ous construction ofShaked and Shanthikumar (1991b) in which the induction
was over the ordered failure times.
We describe now the steps of the discrete dynamic construction. As we
mentioned above, they are indexed by t E N++. In general, Step t describes
which components failed at time t, if any. These failure times are the 'ii's.
Step 1. The algorithm enters this step when all the components are alive.
The algorithm now chooses a set J C {I, 2, ... , n} with probability AJ(I) [J
may be empty], and defines (if J =1= 0)

(3.2)
For i E J the algorithm does not define 'ii in this step; these 'ii's will be
defined in a later step. Upon determination of J and TJ the algorithm sets
t = 2 and then proceeds to Step t.
Thus, upon exit from Step 1, some of the 'ii's (if any) have been deter-
mined already as described in (3.2), and the other 'ii's (i. e., for i E J) are
still to be determined. Therefore TJ > e. (If J = 0 then after Step lone has
T> e.)
Step t. Upon entrance to this step some of the 'ii's (if any) have already
been determined. Suppose that the algorithm has already determined the
'ii's with i E I for some set I C {I, 2, ... , n}. More explicitly, suppose that
upon entrance to this step we already know that TJ = tJ (where, of course,
tJ < te) and that Ty ~ teo The algorithm now chooses a set J C I with
probability AJIJ(tltJ) and defines (if J =1= 0)
Dynamic Modelling of Discrete Time Reliability Systems 87

TJ = teo
For i E I U J the algorithm does not define t in this step; these t's (if any)
will be determined in a later step. From step t the algorithm proceeds to Step
t + 1 provided I U J =f. 0. Otherwise the construction is complete.
Thus, upon exit from Step t, the t's with i E IuJ have been determined
already. The other t's (if any) are still to be determined, that is, T[UJ > teo
Upon entrance to Step t + 1 (if ever) we already know the values of t for
i E IU J.
The algorithm performs the steps in sequence until all the t's have been
determined. With probability one this will happen in a finite number of steps
whenever P{T; < 00, i = 1,2, ... , n} = l.
From the construction it is clear that T has the discrete multivariate
conditional hazard rate functions of T. Since the discrete multivariate condi-
tional hazard rate functions uniquely determine the probability function, it
follows that T =st T.
The discrete dynamic construction can be used to simulate discrete de-
pendent lifetimes. This can be done by generating a sequence of independent
uniform random variables {Ut , t E N++} and using Ut in order to generate
the required probabilities in Step t, t E N++. In this paper, however, we use
the discrete dynamic construction as a technical tool for proving Theorem 4.1
in Section 4.

4. Discrete Probabilistic Ordering

4.1 Definitions

Let X = (Xl, X2, ... , Xn) and Y = (Yl, Y2, ... , Y n ) be two discrete random
vectors taking on values in { ... ,-1,O,1, ... }n = zn. The random vector
X is said to be stochastically smaller than the random vector Y (denoted
X $st Y) if

E4>(X) $ E4>(Y) (4.1)


for every real function 4>, with domain in Z" , which is increasing with respect
to the componentwise partial ordering in Z" (and for which the expectations
in (4.1) exist). In this paper 'increasing' means 'nondecreasing' and 'decreas-
ing' means 'nonincreasing'. If Q denotes the probability measure of X and R
denotes the probability measure of Y then we sometimes write Q $st R to
denote X $st Y.
The establishment of the relationship X $st Y is of importance in various
applications. One can view Theorem 4.1 in Section 4 as a set of sufficient con-
ditions on the discrete multivariate conditional hazard rate functions which
88 Moshe Shaked et al.

assure the stochastic ordering relation between two vectors of discrete random
vectors.
In order to define the next ordering (the one we call the hazard rate
ordering) we need to introduce some notation. This ordering will be used
only in order to compare vectors of discrete random lifetimes. Therefore, we
assume now that X and Y can take on values only in N++.
For t E N++ let ht denote a realization of the failure times of n com-
ponents up to time t, exclusive. That is, if X l ,X2 , ... ,Xn are the dis-
crete random lifetimes of the components, then ht is an event of the form
{XI = XI, Xy ~ tel for some I C {I, 2, ... , n} and XI < teo On such events
we condition the probabilities in the definition (2.1) of the discrete multivari-
ate conditional hazard rate functions. Such an event will be called a history.
Fix atE N++. If h t and h~ are two histories such that in h t there are
more failures than in h~ and every component which failed in h~ also failed
in ht, and, for components which failed in both histories, the failures in h t
are earlier than the failures in h~, then we say that h t ::s; h~. More explicitly,
if h t is a history associated with X of the form {XI = XI, X Y ~ te} and h~ is
a history associated with Y of the form {Y A = YA, Y A ~ te} then h t ::s; h~
if, and only if, A C I and XA ::s; Y A (of course, we also have XI -A < te and
YA<te).
Remark 4.1. Before proceeding, we note a 1-1 association between {O,l}n
and the set of subsets of {I, 2, ... , n}. For each point u E {O, l}n let A(u) C
{I, 2, ... , n} be the set of the coordinates of u which are 1'So Conversely, for
each set A = {il' i 2, ... , ik} C {I, 2, ... , n} let u(A) E {O, l}n be the vector
which has 1's in places iI, i 2 , . , i k and O's elsewhere.
Let /l.I.(I) denote the discrete multivariate conditional hazard rate func-
tion of X (as defined in (2.1)). Similarly let 7].1.(-1) be the hazard rate func-
tions of Y.
Given a history ht, associated with X, of the form {XI = XI, Xl ~ te},
we define now a probability measure Qh, on {O, l}n as follows. For A C 7 set

(4.2)
and let the mass of Qh, on all other points of {O, l}n be 0. It is obvious
that Qh, is a proper probability measure; it corresponds to the indicators of
the components that have failed by time t, inclusive. We call Q. the discrete
multivariate conditional hazard rate measure of X.
Similarly, given a history h~, associated with Y, one can define, as in (4.2),
the discrete multivariate conditional hazard rate measure of Y. It is denoted
by R ..

Definition 4.1. Let X and Y be random vectors which take on values in


N++. The random vector X is said to be smaller than Y in the discrete
hazard rate ordering (denoted X ::S;h Y) if
Dynamic Modelling of Discrete Time Reliability Systems 89

Qh. ~st R h: whenever ht $ h~. (4.3)

For example suppose that n = 2. Then (4.3) is equivalent to

J.I{l,2}(t) > 7J{l,2}(t), t E N++, (4.4)


J.I{l}(t) + J.I{l,2}(t) > 7J{l}(t) + 7J{l,2}(t), t EN++, (4.5)
J.I{2}(t) + J.I{l,2}(t) > 7J{2}(t) + 7J{l,2}(t), t E N++, (4.6)
J.I{l}(t) + J.I{2}(t) + J.I{l,2}(t) > 7J{l}(t) + 7J{2}(t) + 7J{l,2}(t), t E N+./...4.7)
J.I{2}I{l}(tlxl) > 7J{2}(t) + 7J{l,2}(t), t > Xl ~ 1,
J.I{l}1{2}(tlx2) > 7J{l}(t) + 7J{l,2}(t), t > X2 ~ 1,
J.I{2}1{1}(tlxI) > 7J{2}1{1}(tlyI), t > Yl ~ Xl ~ 1, and
J.I{l}I{2}(tlx2) > 7J{l}l{2}(t IY2), t > Y2 ~ X2 ~ 1.
If in the case n = 2 there cannot be simultaneous failures, that is, if
P{Xl = X 2} = P{Yl = Y2} = 0, then J.I{l,2}(t) = 7J{l,2}(t) = 0, t E N++,
and (4.7) is superfluous because it follows from (4.5) and (4.6). Also (4.4)
then obviously holds. The remaining conditions can then be written as

J.I{k}IIUJ(tlxI,XJ) ~ 7J{k}lI(YI), XI $ YI < te, (4.8)


XJ < te, Ie {1, 2}, J c {1, 2}, In J = 0, k E I U J.
In fact, for a general n, if no two or more simultaneous failures can occur for
a collection of components with lifetimes Xl, X 2 , .. , X n , and for a collec-
tion of components with lifetimes Yl , Y2 , ... , Yn , then (4.3) reduces to (4.8)
(with {1, 2} replaced there by {1, 2, ... , n}). Condition (4.8) is similar to the
condition of Shaked and Shanthikumar (1990) which defines the hazard rate
ordering for vectors of random lifetimes with absolutely continuous distribu-
tions. But in Definition 4.1 we need condition (4.3) rather than (4.8) because
of the positive probability of multiple failures when one deals with discrete
failure times. One can see now the additional complexity which is involved
when one studies components with discrete random lifetimes which may have
multiple failures, as opposed to the case of random lifetimes with absolutely
continuous distributions.
Example 4.1. Consider the following discrete analogue of a model of Ross
(1984) and Freund (1961). Suppose n components start to live at time O. The
discrete failure rate of each of them at time t = 1 is Pn and they may fail at
time 1 independently of each other. At any time t E N++, the failure rate of
each ofthe surviving components is independent of t. It depends only on the
number of surviving components, and the surviving components may fail at
time t independently of each other. More formally, the AJII(tltI) of (2.1) is
now a function of III (the cardinality of I) and of IJI only. If PfII is the failure
rate of any of the surviving components then
90 Moshe Shaked et aI.

J clc {1,2, ... ,n}.


Let X be a vector of lifetimes having the above distribution. That is, suppose
that X has the discrete multivariate conditional hazard rate functions

- IJI(l
I'JII (t IXI ) - Pril - Pili
)f11-IJI
' J CI c {I, 2, ... , n}, tEN++.

Let Y have the same distribution but with parameters qn, qn-!,.'" q!, rather
than Pn,Pn-l, ... ,Pl. That is, suppose that the discrete multivariate condi-
tional hazard rate functions of Y are

Then it can be verified (using coupling arguments) that if Pi ~ qj, j ~ i, i =


1,2, ... ,n,then

where Q. and R. are as described in (4.2) and (4.3). Therefore X $h Y. II


Let X and Y take on values in zn. Let f denote the discrete probability
density of X, that is,

Similarly, let 9 denote the discrete probability density of Y. We say that X


is smaller than Y in the likelihood ratio ordering (denoted X $Ir Y) if

f(x)g(y) $ f(x /\ y)g(x V y), X E zn, y E Zn,


where X /\ Y denotes (Xl /\ Yl, X2/\ Y2, ... , Xn /\ Yn) and X Vy denotes (Xl V
Yl, X2 V Y2, ... , Xn V Yn); see Karlin and Rinott (1980) and Whitt (1982),
where examples of random vectors which are ordered by the likelihood ratio
ordering can be found.
It should be noted that $h and $Ir are not orderings in the usual sense
because they are not necessarily reflexive.

4.2 Hazard rate ordering and the usual stochastic ordering

In this section we prove the following result.


Dynamic Modelling of Discrete Time Reliability Systems 91

Theorem 4.1. Let X =


(Xl, X 2 , , Xn) and Y =
(Y1 , Y2 , . , Yn ) be two
random vectors which can take on values in N++. If X ~h Y then

X ~st Y. (4.9)
Proof. The proof will ~e dC?ne by cC?nstructin~, 0!l the sa!lle probability space,
two random vectors (Xl, X 2, ... , Xn) and (Yl, Y2 , , Yn) such that

X =st X, (4.10)
Y =st Y, and (4.11)
X < Y a.s .. (4.12)
From (4.10), (4.11) and (4.12) one obtains (4.9). "
Denote the discrete multivariate conditional hazard rate functions of X
by p'I.(I) and of Y by 1/'1.(1).
The construction of X and Y will be done in steps indexed by t E N++.
Here, as in the discrete dynamic construction, we describe an algorithm in
which t is to be thought of as a value of discrete time. In Step t it is determined
which Xi'S (if any) and which 'fi's (if any) are equal to t.
Step 1. The algorithm enters this step with the obvious information that
=
X ~ e and Y ~ e. Consider Qhl as in (4.3) with t 1 and I 0 (because =
h1 = {X ~ e}). Consider Rhl as in (4.3) with t = 1 and 1=0 except that
here 1/ replaces p. From (4.3) it follows that Qhl ~st Rh 1 Therefore random
vectors U 1 and V 1 , which can take on values in {O, l}n, can be defined on
the same probability space such that U 1 has the probability measure Qhl , V 1
has the probability measure Rhl and U 1 ~ V 1 with probability one (see, e.g.,
Kamae et al. 1977). Let 81 be the joint probability measure of (U 1, V d. The
algorithm now chooses a realization (Ul, vt) according to 81.
Let A C {1, 2, ... , n} be the set associated with U1 as described in Re-
mark 4.1. Similarly let A' C {I, 2, ... , n} be the set associated with V1. Since
U1 ~ V1 it follows that A :::> A'. Of course A' or A may be the empty sets.
Define

=
XA e, Y A' e, =
set t = 2 and proceed to Step t.
Upon exit from Step 1 some of the Xi'S and some of the }j's (if any) have
been determined and it is known, then, that X'A > e and Yji' > e. It follows
that we already have

XA < YA with probability one.


Notice that not all the 'fi's with i E A have been already determined. Some
of the 'fi's (those with i E A - A') still have not been determined, but they
must satisfy 'fi > 1.
92 Moshe Shaked et al.

Step t. Upon entrance to this step some of the X/s and sO,!lle of the "fi's
(if any) have already been determined. Suppose that the Xi's have been
determined for all i E A for some set A C {1, 2, ... , n}. More explicitly
suppose that XA = XA, X:A ~ teo Suppose, also, that the "fi's have been
determined for i E A' for some set A' C {1, 2, ... , n}. More explicitly, suppose
VA' = YA', VA' ~ teo By the induction hypothesis, A:::> A', XA < te, XA' $
YA' < teo Therefore, if we define h t = {XA' = XA',XA-A' = XA-A"X:A ~
= =
tel and h~ {Y A' YA', Y A , ~ tel we have h t $ h~. Consider now Qh, and
Rh~ as defined in Section 4. From (4.3) it follows that Qh, ~st Rh~. Therefore,
random vectors U t and V t , taking on values in {O,l}n, can be defined, on
the same probability space, such that U t is distributed according to Qh" V t
is distributed according to Rh~' and U t ~ V t with probability one. Let St
be the joint probability measure of (U t , V t ). The algorithm now chooses a
realization (Ut, Vt) according to St.
Let B C {1, 2, ... , n} be the set associated with Ut as described in Remark
4.1 and let B' C {1, 2, ... , n} be the set similarly associated with Vt. From
the definition of Qh, is clear that B :::> A. Similarly from the definition of Rh~
it is seen that B' :::> A'. Also, since Ut ~ Vt it follows that B :::> B'. Define

XB-A = te, VB'-A' = te


and proceed to Step t + 1.
Upon exit from Step t some ofthe X/s and some of the Yi's (if any) have
been determined and it is known that XB > te and VB' > teo Also, since
B :::> B', it follows (using the induction hypothesis XA $ VA a.s.) that

XB < VB a.s ..
Notice that not necessarily all the Yi's with i E B have been determined by
Step t. The Yi's with i E B - B' have not been determined yet, but they
must satisfy Yi > t.
Performing the steps of this procedure in sequence the algorithm finally
determines all the X/s and Yi's using a construction for all h t and h~ which
are realized. The resulting X and V must satisfy (4.12). The Xsatisfies (4.10)
because it is marginally constructed as in the discrete dynamic construction.
Similarly V satisfies (4.11). II
As an example for the use of Theorem 4.1 consider the X and the Y
defined in Example 4.1. It has been shown in Example 4.1 that X $h Y. It
follows from Theorem 4.1 that X $st Y.

4.3 Hazard Rate Ordering and the Likelihood Ratio Ordering


The following notation is used in this section: Let Z be a random variable
(or vector) and let E be an event. Then [ZIE] denotes any random variable
(or vector) whose distribution is the conditional distribution of Z given E.
In this section we prove the following result.
Dynamic Modelling of Discrete Time Reliability Systems 93

Theorem 4.2. Let X = =


(X 1 ,X2, ... ,Xn ) and Y (Yl,Y2, ... ,Yn) be two
random vectors taking on values in N++. If X ~lr Y then

X~h Y. (4.13)
Proof. Denote the discrete density of X by f and of Y by g.
Split {1, 2, ... , n} into three mutually exclusive sets I, J and L (so that
L = I U J). Fix XI, XJ, YI and t E N++ such that XI ~ YI < te and XJ < teo
Let ht = {XI = XI,XJ = XJ,XL ~ tel and h~ = {VI = YI,YJUL ~ tel
First we show that

[(XI,XJ,XL)IXI = XI,XJ = XJ,XL ~ tel ~lr


[(YI,YJ,YL)IYI=YI,YJUL ~te]. (4.14)

Denote the discrete densities of (XI, XJ , XL) and of (YI, Y J, Y L) by 1


and g, respectively. The discrete density of [(XI, XJ, XL)IXI = XI, XJ =
xJ, XL ~ tel is

= =
provided aI XI, aJ XJ, aL ~ te, and is 0 otherwise. The discrete density
Of[(YI' Y J, Y L) IYI = YI, Y JuL ~ tel is

i(h I , hJ, h L ) = l: gtI , ~(' hL) )


9 YI,YJ,YL
YJ~teYL~te

provided hI = YI, hJ ~ te, hL ~ te, and is 0 otherwise. In order to prove


(4.14) we need to show that

- -
j(aI, aJ, aL)i(hI , hJ, h L) ~j(aI 1\ hI,aJ 1\ hJ,aL 1\ hL) .
.i(aIVhI,aJVhJ,aL VhL) (4.15)
Since XI ~ YI < te, XJ ~ te, it follows that (4.15) holds if

j(XI, XJ, aL)g(YI, hJ, hL) ~ j(XI, XJ, aL 1\ hL)g(YI, hJ, aL V hL)

for hJ ~ te, aL ~ te and hL ~ teo But this follows from the assumption that
X ~lr Y. Thus (4.14) holds.
Since ~lr==>~st (see, e.g., Karlin and Rinott (1980) or Whitt (1982)) it
follows from (4.14) that

[(XI, XJ, XL)lh t ] ~st [(VI, Y J, Y L)lh~]. (4.16)


Now define, for i E {1, 2, ... , n},
94 Moshe Shaked et al.

I if Xi ::; t;
Wi =
{
0: if Xi > t;

and
I ifYi::; t;
Zi = { 0', I'fv
Ii> t.
From (4.16) it follows that

(4.17)
The conditional distribution ofW given ht is determined by JlAIIUJ(tlxI, xJ),
A c I U J, which are the discrete multivariate conditional hazard rate func-
tions conditioned on h t . This distribution is the one which is associated with
the discrete multivariate conditional hazard rate measure Qhl of X (see Sec-
tion 4 for its definition). Similarly, the conditional distribution of Z given h~
is the one associated with the discrete multivariate conditional hazard rate
measure Rh: ofY. And (4.17) is equivalent to

Qhl ~st R h : (4.18)


Since (4.18) has been shown whenever ht ::; h~ one obtains (4.3) and this
proves (4.13).11
It is well known that X ::;lr Y implies X ::;st Y. Theorem 4.2 gives a
stronger result, that is, that X ::;h Y. The order ::;h enables us to compare
the underlying items 'locally' as time progresses, in contrast to the 'global'
comparison that the order :5.t yields. More explicitly, given comparable his-
tories associated with X and Y at time t, the order ::;h allows us to stochas-
tically compare the predicted behavior of the two underlying systems at the
next time point. Such a comparison is not possible by means of the order ::;st
solely.

5. Positive Dependence Concepts

In Shaked and Shanthikumar (1990b) there is an application of the continu-


ous orderings to the area of positive dependence. Several notions of positive
dependence, pertaining to the random variables Xl, X 2 ,., X n , are obtained in
Shaked and Shanthikumar (1990b) by requiring, for example, that X ::;h X
or that X ::;lr X. The relationships among the continuous orderings enable
one to study the relationships among the various resulting positive depen-
dence notions (see Shaked and Shanthikumar 1990b). These notions were also
compared there to other well known positive dependence notions such as the
positive association notion of Esary et al. (1967). In the present paper we
have not studied the corresponding analogous discrete positive dependence
notions. However we believe that no essential new technical difficulties arise
Dynamic Modelling of Discrete Time Reliability Systems 95

when one tries to study them. One use of Theorem 4.2 is to show that the pos-
itive dependence notion defined by X ~lr X implies the positive dependence
notion defined by X ~h X.

6. Conclusions and Some Remarks

In this paper we have presented the discrete multivariate hazard rate func-
tions as the time-dynamic models of multi component reliability systems
and studied stochastic order relationships among them. These orderings are
discrete analogues of the continuous orderings of Shaked and Shanthikumar
(1990b), but the technical difficulties which are encountered while studying
the discrete orderings are different from those involved with the continuous
orderings of Shaked and Shanthikumar (1990b).
In Shaked and Shanthikumar (1990b) an ordering relation, called the
cumulative hazard ordering, denoted by ~ch, is also studied. An analogue of
this ordering is not studied here because, a "correct" discrete analogue of ~ch
in not easy to identify; see Valdez-Torres (1989).
Shaked and Shanthikumar (1991a) used the orderings ofShaked and Shan-
thikumar (1990b) in order to define several multivariate aging notions for con-
tinuous dependent random lifetimes such as MIFR (multivariate increasing
failure rate) and a kind of multivariate logconcavity which was called MPF 2
(multivariate Polya frequency of order 2). Similar discrete analogues can be
developed using the discrete multivariate orderings of the present paper. We
may do it elsewhere.

References

Arjas, E.: A Stochastic Process Approach to Multivariate Reliability Systems: No-


tions Based on Conditional Stochastic Order. Mathematics of Operations Re-
search 6, 263-276 (1981a)
Arjas, E.: The Failure and Hazard Processes in Multivariate Reliability Systems.
Mathematics of Operations Research 6, 551-562 (1981b)
Arjas, E., Norros, I.:. Life Lengths and Association: A Dynamic Approach. Math-
ematics of Operations Research 9, 151-158 (1984)
Arjas, E., Norros, I.: Change of Life Distribution via A Hazard Transformation:
An Equality with Application to Minimal Repair. Mathematics of Operations
Research 14, 355-361 (1989)
Esary, J.D., Proschan, F., Walkup, D.W.: Association of Random Variables, with
Applications. Annals of Mathematical Statistics 38, 1466-1474 (1967)
Freund, J.E.: A Bivariate Extension ofthe Exponential Distribution. Journal ofthe
American Statistical Association 56, 971-977 (1961)
Karnae, T., Krengel, U., O'Brien, G.L.: Stochastic Inequalities on Partially Ordered
Spaces. Annals of Probability 5, 899-912 (1977)
96 Moshe Shaked et al.

Karlin, S., Rinott, Y.:. Classes of Orderings of Measures and Related Correlation
Inequalities. I. Multivariate Totally Positive Distributions. Journal of Multivari-
ate Analysis 10, 467-498 (1980)
Norros, I.: Systems Weakened by Failures. Stochastic Processes and Their Appli-
cations 20,181-196 (1985)
Norros, I.: A Compensator Representation of Multivariate Life Length Distribu-
tions, with Applications. Scandinavian Journal of Statistics 13, 99-112 (1986)
Ross, S.M.: A Model in Which Component Failure Rates Depend on the Working
Set. Naval Research Logistics Quarterly 31, 297-300 (1984)
Shaked, M., Shanthikumar, J.G.: Multivariate Imperfect Repair. Operations Re-
search 34, 437-448 (1986a)
Shaked, M., Shanthikumar, J.G.: The Total Hazard Construction, Antithetic Vari-
ates and Simulation of Stochastic Systems. Stochastic Models 2, 237-249
(1986b)
Shaked, M., Shanthikumar, J.G.: The Multivariate Hazard Construction. Stochastic
Processes and Their Applications 24, 241-258 (1987a)
Shaked, M., Shanthikumar, J.G.: Multivariate Hazard Rates and Stochastic Order-
ing. Advances in Applied Probability 19, 123-137 (1987b)
Shaked, M., Shanthikumar, J.G.: Multivariate Conditional Hazard Rates and the
MIFRA and MIFR Properties. Journal of Applied Probability 25, 150-168
(1988)
Shaked, M., Shanthikumar, J.G.: Multivariate Stochastic Orderings and Positive
Dependence in Reliability Theory. Mathematics of Operations Research 15,
545-552 (1990)
Shaked, M., Shanthikumar, J.G.: Dynamic Multivariate Aging Notions in Reliabil-
ity Theory. Stochastic Processes and Their Applications 38, 85-97 (1991a)
Shaked, M., Shanthikumar, J.G.: Dynamic Construction and Simulation of Random
Vectors. In: Block, H.W., Sampson, A., Savits, T.H. (eds.): Topics in Statistical
Dependence. IMS Lecture Notes (1991b), pp. 415-433
Shaked, M., Shanthikumar, J.G.: Dynamic Multivariate Mean Residual Functions.
Journal of Applied Probability 28, 613-629 (1991c)
Shaked, M., Shanthikumar, J.G.: Dynamic Conditional Marginal Distributions in
Reliability Theory. Journal of Applied Probability 30, 421-428 (1993a)
Shaked, M., Shanthikumar, J.G.: Multivariate Conditional Hazard Rate and Mean
Residual Life Functions and Their Applications. In: Barlow, R.E., Clarotti,
C.A., Spizzichino, F. (eds.): Reliability and Decision Making. Chapman and
Hall: New York 1993b, pp. 137-155
Shaked, M., Shanthikumar, J.G., Valdez-Torres, J.B.: Discrete Probabilistic Order-
ing in Reliability Theory. Statistica Sinica 4, 567-579 (1994)
Shaked, M., Shanthikumar, J.G., Valdez-Torres, J.B.: Discrete Hazard Rate Func-
tions. Computers and Operations Research 22, 391-402 (1995)
Valdez-Torres, J .B.: Multivariate Discrete Failure Rates with Some Applications.
Ph.D. Dissertation. University of Arizona (1989)
Whitt, W.: Multivariate Monotone Likelihood Ratio and Uniform Conditional
Stochastic Order. Journal of Applied Probability 19, 695-701 (1982)
Reliability Analysis via Corrections
Igor N. Kovalenko 1 ,2
1 STORM Research Centre, University of North London, 166-220 Holloway Road,
London N7 8DB, United Kingdom
2 V.M. Glushkov Institute of Cybernetics, National Academy of Sciences, Ukraine,
40 Glushkov Avenue, Kiev 252207, Ukraine

Summary. Some approaches are developed to the approximate analysis of relia-


bility parameters via small corrections. It is assumed that a parameter in consid-
eration can be computed by a formula for a simpler system slightly different from
a given one. In many cases, a correction can be derived for the difference of the
two values of the parameter. Variance reduction simulation methods can be applied.

Keywords. Queueing, light traffic, perturbation analysis, reliability, availability,


small parameter, redundancy, repairable, ultra-reliable, busy period, time depen-
dent + queueing, time-dependent + reliability, rare events, simulation, variance
reduction, hybrid methods, complex systems, applied probability, special stochas-
tic processes

1. Introductory Remarks
Let me cite from Asmussen and Rubinstein (1995):"Analytical and even
"good" asymptotical expressions for ... rare event probabilities ... are only
available for a very small class of systems." I 100% agree with such an opin-
ion, but my experience suggests that it is not fruitless to seek for more and
more general queueing models admitting the derivation of asymptotical or
approximate expressions for reliability parameters, say. And in a case if there
is no explicit formula for the desired parameter, one very often can choose an
appropriate formula for a close, slightly changed system, and then to calculate
necessary corrections.
The purpose of the submitted paper consists of the derivation of some
corrections in three problems typical while investigating complex systems
reliability.
For the simplicity, only a simple queueing system M/G/2/2 is considered
throughout the paper. But the approach is fruitful in much more general
cases as well.
A short annotated bibliography is attached.

2. Variance Reduction Estimates for Some Busy Period


Parameters
Consider a queueing system M / G/2 /2 describing the behavior of a repairable
system. Denote by v(t) the queue length at a time t, i.e. v(t) is the number
98 Igor N. Kovalenko

of failed components. If v(t) = k then a new component failure may occur in


a small interval (t, t + dt) with a probability )'dt as soon as k ~ 3 and with
zero probability as k ~ 4. Thus v(t) ~ 4 with probability 1 if it is the case
=
for t O. The system failure is associated with the state v(t) 4. There are =
two repair channels, B(t) being the distribution function of a repair time.
Consider a busy period originating at initial time t = O. Let' be the busy
period length, and T be the sojourn time in the failure state within the busy
period, so that
T = Ht: v(t) = 4, 0 ~ t ~ 01
The following two indices are ofthe main interest: ET and q = P {max v(t) =
O~t~'
=4}.
The system failure within the busy period can occur through a finite
failure path. Among them a monotonic path are of the central role in the
theory of highly reliable repairable systems. For reference, see Soloviev (1994).
In particular, for our simple example M/G/2/2

qo =).3 JJ
00 00

e->'(x+y) B(z + y)B(y)dzdy


o 0

where qo is the system monotonic path failure probability. For a small ).

_ 00

where B(t) =1- B(t), B(t) = J B(z) dz if the expression for I is finite.
t
Consider a random variable To vanishing in each of the two cases: (i) no
system failure occurs within the busy period, (ii) a non-monotonic failure
occurs in the same period, and defined as the length of the first system
failure interval in the case (iii) a monotonic path failure occurs within the
busy period. In the example being considered

JJJ
00 00 00

ETo =).3 ye->'(x+ y) B(z + y + z)B(y + z)dzdydz.


000

For small).

E To "" 2
1 ).3 J::2
00

Z B (z) dz
o
as soon as RHS is finite.
The cited asymptotic expressions are well known, but non-monotonic fail-
ure paths can contribute essentially in practicable cases. Consider, for exam-
ple, the exponential case B(t) = e-I't, t ~ O. Set p = )./jJ, and let ql be
defined as ql = q - qo. For a small p
Reliability Analysis via Corrections 99

1 3 3 4 9 5
qo "" 4p , ql "" 8 p, q2 "" 16 P ,

where qi is the probability of a system failure through 4 + i failures of com-


ponents within a busy period. The following table illustrates relative contri-
bution of non-monotonic paths for some values of p:

p
0.1 0.15 0.022
0.01 0.015 0.0002
0.001 0.0015 2.10- 6

We have the following bound for relative values

1=..iQ.} :S (1 + p) (1 + ~)2 -
E[i~Tol 1 "" 2p, p -+ o.
ETo

We have
q = qo + Ll q, T = To + Ll T,
where Ll q and Ll T should be estimated via simulation.
Many resent investigations deal with the elaboration of variance reduction
methods for the estimation of rear events probabilities. We should mention
monographs Asmussen (1987), Rubinstein and Shapiro (1993). In the both
ones the score function method generalizing a traditional importance sam-
pling was developed. I suggest an analytical computation of the qo and To
whereas some estimates are applied for the computation of the correction
terms Llq and LlT. The approach of stratified sampling combined with the
score function method is used.
Let IA denote the indicator of any random event A. Then

Llq = EI{non-monotonic failure}

LlT = E(T - To)


The above promised unbiased estimates are of the form
1 n
jq = ~ E IkOk
k=l

where Ik is the indicator of non-monotonic system failure in trial k, LlTk is


the value T - To in the same trial, Ok is the weight due to the change of the
p.d.f. "e->.t to the p.d.f. "0
e->'ot, i.e.
100 Igor N. Kovalenko

II (Ae->'Xloj / Aoe->'oXloj)
rio

(he =
j=l

where Xkj, 1 ~ j ~ rk, denote failure free times in trial k, AO is the param-
eter of the sampling exponential law.
Consider, for example, the construction of a small - variance unbiased
estimate of L1q for the exponential case B(t) = e-J.lt which can be reduced
to a stopped random walk with transition probabilities

1 - 2 : -p-; 2 - 3 : -p-; 3 - 4 : -p-


l+p 2+p 2+p
2 2
2-1:-- 3-2:--
2+p' 2+p
The state 4 is an absorbing one; the walk is stopped with the probability
1/(1 + p) in the state l.
Instead we suggest another random walk with transition probabilities

1 - 2 : 1; 2 - 2 : b; 3 - 4 : b

2 - 1 : 1 - b; 3 - 2 : 1 - b;
The value of b is chosen as

b = (1 + JiO)/6 ~ 0.6937
Then
~ 1 4
cr [Llq] '" - . O.6604p
n
for small p whereas
qo = l/4
The bounds
cr[.1q/qo] ~ Cp, cr[LfT/To] ~ C 1p
can be established for a wide class of queueing systems, the constants C and
C1 depending on an appropriate moment of the repair time distribution.
A further improvement can be suggested: to compute qo + q1 analytically
and use a correction being computed by simulation. The simulated variable
in a single trial has the form

lq = OI{failure after at least two repairs}


Then
cr[(qO+q1+o'q)/q] = O(p2)asp_0
under some moment condition.
The q1 can be computed in the following way.
Let ek, e~, '7k, '7~ be independent random variables with p.d.f. Ak exp{ -Akt},
t ~ 0, for ek, e~ and b(t) for '7k, '7~. Define the following random variables.
Reliability Analysis via Corrections 101

1'0 = "12/\ ("11 - ~d


1'1= "11 + "12 - 6 - 21'0
1'2 = "12 /\ (')'1 - ~D
1'3 = 1'1 /\ "13
Then
q2 = p {6 < "11, < "It, 6 > 1'0, ~~ + 6 < 1'2} +
~~
+ P {6 < "It, 6 < 1'0 < 6 + 6, ~~ + ~~ < 1'3}.

3. Corrections for Non-Markovity of the Failure Law


Consider a closed queueing system GI/G/2/2 which can be specified as fol-
lows. There are four components with a general failure law, so that A(t) is the
distribution function of the free-of-failure time. The repair time distribution
is characterized by the distribution function B(t) of a general form. There
are two repair channels. The system failure means that lI(t) = 4 where lI(t)
is the queue length. Assume that the system is in the statistical equilibrium,
and denote by q the probability of the system failure within a busy period.
Consider also an associated system >"k/G/2/2 with Markovian law offail-
ures so that each component has the constant failure rate >.. = 1/ I B(t) dt
00

o
and up-transitions of the process lI(t) has a rate >"k = (4 - k..
as soon
as lI(t) = = k, 0 :::; k :::; 4. For the latter system the parameter q can be
estimated in a similar manner as it is done in section 3. For a fixed B(t) and
small >..
J
00

qM '" 3>..3 B2(t)dt


o
where qM denotes the value of q for the system with Markovian failure law.
Consider for example the Erlangian case
dA(t) = 4>..2 t e- 2At dt, t > O.
Though the following approach can be suggested for a case if such an approx-
imation is not sufficient.
A free-of-failure period of each component consists of two exponential
phases: phase 2 and phase 1. Denote by 1'(t) the number of components in
phase 1 at a time t. The process of the system behavior can be considered as
an alternating process in which "Markovian" periods are changed by "non-
Markovian" ones and wise versa. Let us introduce a pseudo-time variable T
increasing within Markovian periods and stopping its increase during non-
Markovian periods.
102 Igor N. Kovalenko

Denote by a the rate of flow of instant busy periods in the pseudo-time,


and let ak be the rare of failure instant busy periods at a state k of the
r-process. We have
1 4
q= - akPk L
a k=l
where Pk is the steady-state probability of the event {r = k} in the pseudo-
time. The values of a and ak can be computed via busy period analysis, see
section 2. As concerns Pk, the vector p = (PO,Pl,P2,P3,P4f can be derived
as the solution of a perturbed system of linear equations

A'p = Ii
where Ii = (1,0,0,0, O)T. The following approximate equations holds for A' :

(!
1

A' = 0
o 0
3
~
-4 6p 2 -\2p

2
12p
-4+ Sp 3 -ISp
o
24p
1
-4+6p 4- 24p
1
000 1 -4
00

where p = ).//-1, 1
/-I
= f0 B(t)dt. We have

p = (0.0625,0.250,0.375,0.250,0.0625) + 0(p2), a = 4), + 0().p2)

An elementary asymptotic analysis (matrix inversion etc.) shows that q =


f
B (t) dt + 0().4).
00-2
= 3).3
a
The same expression holds true for qM. The computations show that the
term 0().4) is the same for both cases, so that these expressions may differ
only by a term 0().5.)

4. Corrections for the Time-dependence

Consider an alternating renewal process starting with its up-phase. Assume


that up-phases Xk are exponentially distributed with parameter). and down-
phases are arbitrarily distributed with mean 1//-1. Set

p = ).//-1, (3(s) = Eexp{-s/-IYd

Let poet) denote the pointwise availability of the system. It is well known
that
poet) - 4 -/-I-
t-+oo ). + /-I
Reliability Analysis via Corrections 103

and
h(t) _ \ AJ.t
t-oo A + J.t
where h(t) is up-to-down renewal rate [Indeed h(t) = '\Po(t).] Though it is
important to estimate the deviations of both the functions from their steady-
state limits. Kovalenko and Birolini (1995) derive an exponential two-sides
bound for Po(t) :

-CL e-(l+p)t < PO(l1t) - _J.t_ < Cu e-(1+p)t


- r- A+J.t-
where p = AI J.t under some additional conditions. For example, for any Er-
= =
langian distribution one may set CL Cu 1 as soon as p ::; O.l.
Set
,\
-\- L1(t).
A+J.t
Then the identity

E( -It pn B * B*(n)(t)
00

L1(t) =
n=O

holds and hence some Monte Carlo procedures can be derived for the com-
putation of a non-stationary correction.

References

Asmussen, S.: Applied Probability and Queues. Chichester: Wiley 1987


Asmussen, S., Rubinstein, R.Y.: Complexity Properties of Steady-State Rare Event
Simulation in Queueing Models. In: Dshalalow, J. (ed.): Advances in Queueing.
Boca Raton: CRC Press 1995, pp. 481-506
Kovalenko, I.N., Birolini, A.: Uniform Exponential Bounds for the Time Dependent
Availability. Exploring Stochastic Laws. Zeist: VSP Publishers 1995
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and
Stochastic Optimization via the Score Function Method. New York: Wiley 1993
Soloviev, A.D.: Asymptotic Methods for Highly Reliable Repairable Systems. In:
Ushakov, I.A. (ed.): Handbook of Reliability Engineering. Chichester: Wiley
1994, pp. 112-137
104 Igor N. Kovalenko

Appendix

A. Short Annotated Bibliography


Light traffic theory having been developed within last two decades can be
fruitfully applied to reliability analysis.
An excellent survey by Blaszczyszyn et al. (1995) should be mentioned
the first. For some most general results of the approach see Baccelli and
Schmidt (1992). Asmussen (1992), Zazanis (1992). As surveys concentrated to
busy period analysis and reliability applications see Kovalenko (1994, 1995).
Specifically reliability investigations in which light traffic approach was ap-
plied are: Soloviev (1994), Kovalenko (1980), Pechinkin (1984), Gertsbakh
(1984). Some investigations deal with the combination of light traffic analy-
sis and simulation: see Gnedenko and Kovalenko (1989), Kovalenko (1980),
Kovalenko and Kuznetsov (1988), Kuznetsov and Pegg (1996), Reiman and
Weiss (1989). Many recent investigations are devoted to the derivation of
variance reduction methods for the simulations of rare events parameters.
Those most close to reliability problems are: Kleijnen (1996), Kleijnen and
Van Groenendaal (1992), Rubinstein and Shapiro (1993), Heidelberger et al.
(1996), Shahabuddin (1994a, 1994b), Shanthikumar (1986), Shpak (1995). It
is worthwhile to mention also some works of a general character covering not
only specifically reliability problems: Muppala et al. (1996), Jerrum (1995),
Ermakov (1975), Asmussen (1987), Glasserman (1991), Jerrum (1995). For
a variety of stochastic models to be tried via correction approaches see Gne-
denko et al (1969), Kovalenko et al. (1996), Qmlar (1996), Ozekici (1996a,
1996b), Shaked et al. (1996), Jensen (1996), Srinivasan and Subramanian
(1980), Birolini (1994). As it is illustrated in Kovalenko (1994), the key ap-
proach in mathematical analysis of high reliable repairable systems consists
in subtle analysis of busy period phenomena via recurrence methods: Prabhu
(1965), Cohen (1982), Stagje (1990).

Bibliography

Asmussen, S.: Applied Probability and Queues. New York: Wiley 1987
Asmussen, S.: Light Traffic Equivalence in Single Server Queues. Ann. Appl. Prob.
2, 555-574 (1992)
Baccelli, F., Schmidt, V.: Taylor Expansions for Poisson Driven (max, +)-linear
systems. Research Report No. 2494, INRIA (1995)
Birolini, A.: Quality and Reliability of Technical Systems. Berlin: Springer 1994
Blaszczyszyn, B., Rolski, T., Shmidt, V.: Light Traffic Approximations in Queues
and Related Stochastic Models. In: Dshalalow, J. H.: Advances in Queueing.
Boca Raton: CRC Press 1995, pp. 379-406
Cohen, J.W.: The Single Server Queue. Amsterdam: North Holland 1982
Qmlar, E.: Fatigue Crack Growth. In this volume (1996), pp. 37-52
Reliability Analysis via Corrections 105

Ermakov, S.M.: Die Monte-Carlo-Metode und verwandte Fragen. Miinchen: Olden-


bourg 1975
Gerthbakh, LB.: Asymptotic Methods in Reliability Theory: A Review. Adv. Appl.
Prob. 16, 147-175 (1984)
Glasserman, P.: Gradient Estimation via Infinitesimal Perturbation Analysis. Dor-
drecht: Kluwer 1991
Gnedenko, B.V., Belyayev, Y.K., Solovyev, A.D.: Mathematical Methods in Relia-
bility Theory. San Diego: Academic Press 1969
Gnedenko, B.V., Kovalenko, LN.: Introduction to Queueing Theory. Boston:
Birkhiiuser 1989
Heidelberger, P., Shahabuddin, P., Nicola, F.: Bounded Relative Error in Estimat-
ing Transient Measures of Highly Dependable Non-markovian Systems. In this
volume (1996), pp. 487-515
Jensen, U.: Stochastic Models of Reliability and Maintenance: An Overview. In this
volume (1996), pp. 3-36
Jerrum, M.: The "Markov Chain Monte Carlo" Method: Analytical Techniques
and Applications. A manuscript. Department of Computer Science, University
of Edinburgh (1995)
Kleijnen, J.P.C.: Simulation: Runlength Selection and Variance Reduction Tech-
niques. In this volume (1996), pp. 411-428
Kleijnen, J.P.C., Van Groenendaal: Simulation: A Statistical Perspective. Chich-
ester: Wiley 1992
Kovalenko, LN.: Rare Event Analysis in Estimation of Systems Efficiency and Re-
liability (In Russian). Moscow: Radio i Sviaz 1980
Kovalenko, LN.: Rare Events in Queueing Systems - A Survey. Queueing Systems
16, 1-49 (1994)
Kovalenko, LN.: Approximations of Queues via Small Parameter Method. In: Dsha-
lalow, J.H.: Advances in Queueing. Boca Baton: CRC Press 1995, pp. 481-506
Kovalenko, LN., Kuznetsov, N.Y.: Methods of High Reliable Systems Analysis (In
Russian). Moscow: Radio i Sviaz 1988
Kovalenko, LN., Kuznetsov, N.Y., Pegg, P.: The Mathematical Theory of Reliability
of Time-Dependent Systems, with Practical Applications. Chichester: Wiley. To
appear (1996)
Kovalenko, I., Kuznetsov, N.Y., Shurenkov, V.M.: Models of Random Processes.
A Handbook for Mathematicians and Engineers. Boca Raton: eRe Press. To
appear (1996)
Muppala, J.K., Malhotra,M., Trivedi, K.S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
Ozekici, S.: Optimal Replacement of Complex Devices. In this volume (1996a), pp.
158-169
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996b),
pp.137-157
Pechinkin, A.V.: The Analysis of One-Server Systems with Small Load. Eng. Cy-
bern. 22, 129-135 (1984)
Prabhu, N.U.: Queues and Inventories. New York: Wiley 1965
Reiman, M.L, Weiss, A.: Light Traffic Derivatives via Likelihood Ratios. IEEE
Trans. on Inf. Theory 35, 648-654 (1989)
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and
Stochastic Optimization by the Score Function Method. New York: Wiley 1993
Shahabuddin, P.: Importance Sampling for the Simulation of Highly Reliable
Markovian Systems. Management Science 40, 333-352 (1994a)
Shahabuddin, P.: Fast Transient Simulation of Markovian Models of Highly De-
pendable Systems. Performance Evaluation 20, 267-286 (1994b)
106 Igor N. Kovalenko

Shaked, M., Shanthikumar, J.G., Valdez-Torres, J.B.: Dynamic Modelling of Dis-


crete Time Reliability Systems. In this volume (1996), pp. 83-96
Shanthikumar, J.G.: Uniformization and Hybrid Simulation. Analytic Models of
Renewal Processes. Operations Research 34, 573-580 (1986)
Shpak, V.D.: Accelerated Simulation Methods of Highly Reliable Semi-Markovian
Systems. (Short Summary). Reliability and Maintenance of Complex Systems.
Lecture Notes. NATO ASI, Kerner-Antalya, Turkey (1995)
Srinivasan, S.K., Subramanian, R.: Probabilistic Analysis of Redundant Systems.
Lecture Notes in Economics and Mathematical Systems. New York: Springer
1980
Stadje, W.: A New Approach to the Distribution ofthe Duration ofthe Busy Period
for a GIG/1 Queueing System. J. Austral. Math. Soc. Ser. A 48, 89-100 (1990)
Zazanis, M.A.: Analyticity of Poisson-Driven Stochastic Systems. Adv. Appl. Prob.
24, 532-541 (1992)
Towards Rational Age-Based Failure Modelling
Menachem P. Berg
Department of Statistics, University of Haifa, Mount Carmel, Haifa 31905, Israel

Summary. Conventional age-based failure modelling is too arbitrary in the way


life distributions are selected. The specific focus here is on repairable items for which
we propose a failure modelling procedure that is based on a relevant physical notion
thereby making it less abstract and easier for the user engineer to deal with effec-
tively. Coupled with the modelling is statistical inference on unknown parameters
through suitable physical interpretations. In that regard we adopt here a Bayesian
approach which in the context of repairable items requires Bayesian revision 'models
for stochastic processes.

Keywords. Maintenance, repairable items, age-based failure-modelling, Bayesian


inference

1. Introduction

Causal failure modelling, where we relate failures to specific causes, can be


in many instances unrealistic because of the crudeness of the assumptions
made, the extent of the necessary data to implement the model and the costs
involved in the actual penetration of the system to learn about its reliability
status. Therefore, analysts often resort to age (or usage)-based failure mod-
elling where the failure phenomenon is simply related to the age (calendar
or operational) of the item. From a mathematical point of view all that is
required with this black-box type approach is a probability-distribution that
represents the inherent randomness involved in the time-to-failure: the so-
called "life" -distribution. It is noteworthy that this course of action is very
much relevant to mechanical systems where, as with biological ones, wearout
can be related in a natural manner to aging. It is also customary to attribute
similar aging characteristics to electronic devices though with less conclusive
empirical evidence and a conspicuous exception in this regard is software reli-
ability where aging is essentially irrelevant to failures (and therefore a causal
type of modelling, of one sort or another, is mandatory).
Still, age-based failure modelling even with the most natural hardware
type, suffers from a serious deficiency in the way it is usually executed. There
is a disturbing level of arbitrariness in the way life distributions are selected.
As things are done, prominence and priority is given to certain probability dis-
tributions solely by virtue of their "nice" mathematical form. Once selected,
such a life-distribution candidate is subject to a statistical "test" at the end
of which it is either "accepted" or "rejected". Yet, given enough parameters,
not too much data in comparison and a not too harsh "level-of-significance",
acceptance is indeed not too rare and in any case it is obvious that being
108 Menachem P. Berg

tested first can give a meaningful advantage. Furthermore, there is no clear


mechanism for the transition to the next candidate if one is "rejected". (Re-
moving arbitrariness in failure modelling is also aimed at Mendel 1996 when
a concrete physical law can be used for that purpose and, although in an
altogether different reliability context, in Qmlar 1996 and Goel and Palet-
tas 1996).
Ostensibly, the Bayesian approach is free of the above drawbacks since
there one chooses the initial life-distribution according to ''judgment'' and
further flexibility is rendered by leaving some parameters free for updating
according to gathered data. Yet, guidelines for choosing the initial life distri-
bution (call it effective ways of translating the general ideas and information
of the user engineer into probabilistic terms) could enhance the procedure
and improve convergence to the true distribution. Then, there is criticism
with regard to the conventional parametric Bayesian approach of the notion
of a generic "parameter" void of any clear meaning rooted in the physical
world, for which the user-engineer is still expected to be able to provide good
probabilistic assessment in terms of a "prior" distribution. Both issues will
be considered later.

2. Who Should be Nice?

The tendency to choose probability distributions with "nice" mathematical


forms to represent the time-to-failure distribution stems from an innate gen-
eral concept of symmetry and smoothness in real-world phenomena. While
this would be acceptable to many people if a physical process is involved,
extending it to artifacts like life distributions - based on the abstract notion
of probability - is quite dubious.
The apparent remedy of that is to use a relevant physical process and make
it the starting point of the fitting procedure. That would facilitate as well as
improve assessments by the user engineer who now deals with an empirical
quantity he understands and has experience with. In the next section we shall
identify such a physical quantity and tailor a fitting procedure around it.

3. Repairable Items

We shall restrict our attention in this paper to items that normally are re-
paired upon failure with a replacement being only an alternative maintenance
action. This situation, rather than the single-failure case where an item is
replaced upon its first failure, is the one encountered with most items of im-
portance: the large and the expensive ones, e.g., engines, and machines, and
even with circuit boards in electronic devices. For such items the term life-
distribution is clearly a misnomer (and hence the quotation marks we used
Towards Rational Age-Based Failure Modelling 109

earlier) but conforming with the general literature we shall retain this term
here.
Since the subject of this paper is age-based failure-modelling we shall
assume here that repairs are minimal (see Barlow and Proschan 1975) so
that the past history of failures has no impact on the future of the fail-
ure process and only age counts. This last, essentially qualitative, type of
property translates elegantly to a specific stochastic process namely a (time
non-homogeneous) Poisson process [follows immediately from the indepen-
dent increment property of the (unit jump) failure process throughout the
life of an item, which is implied by the very definition of a minimal repair,
coupled with a corresponding characterization of Poisson processes (Qmlar
1975)]. Moreover, it can be shown that the intensity function of this Poisson
process is exactly the hazard function r(.) of the underlying life distribution
F(). These two last functions are interrelated by the one to one relationship:

F(x) = e-R(x) (3.1)


where R(x) = J; r(u)du and F(.) is the survival function i.e., F(x) =
1- F(x).
Next, we utilize a basic property ofthe Poisson process, namely that R(x)
is equal to the expected number of events (failures here) in the time (age here)
interval [0, x], so that

J.t(x) = R(x), x> 0 (3.2)


x
represents the expected number of failures per unit of time, as a function of
the item's age.
The function 1'( x) not only represents a physical phenomenon but is also
an average and hence expecting "niceness" from its functional form is far
more founded. We shall now utilize this function for the development of a
selection procedure for age-based failure models.

4. Generating New Life Distributions


As a first step we want to use (3.1) and (3.2) to generate an array of life
distributions that correspond to "nice" functional forms of 1'( x). Let us now
consider several examples:
m
(a) J.t(x) = Laixi, x ~ 0; aj > 0
i=O

corresponds to a life distribution which is expressible as a product of the


survival functions of Weibull distributions with different (integer) shape pa-
rameters and (possibly) different scale parameters.
110 Menachem P. Berg

(b) J.l(x) = In(ax+b), x~O; a>O, b~l

corresponds to
F(x) (ax+b)-X, x~O
ax
(c) J.l(x) -
l+x
- , x>- O, a>O
corresponds to

F(x) e- ax '/(1+ x )

(d) J.l(x) eax , x ~ 0; a>O

corresponds to
F(x) = e- xeu , x ~ o.
The first thing to observe is that even though the functional form of J.l( x )
in the above four examples are basic and "nice" the resulting life distributions
are rarely, if ever, used and are unlikely to be arrived at by direct "guessing".
Thus, perfectly legitimate candidates for age-based failure-modelling have
been ignored merely because the "niceness" of the mathematical functional
forms has been placed on the "wrong guys" .
We note that all the above J.l(x) are increasing monotonously (technically
relating to the IFRA aging property -see discussion later), which is the natu-
ral state of affairs with mechanical systems (at least once the burn-in period
is over) with the different functional forms above, reflecting different mono-
tonicity characteristics (rate of increase, etc.). Continuing this pattern, an
extensive mapping of life distribution can be created which are interrelated
through the monotonicity characteristics of their corresponding J.l( x) func-
tions. This scheme of life distributions should serve as the knowledge base for
the fitting procedure.

5. The Fitting Procedure


Since the function J.l( x) has a clear physical interpretation, which the user-
engineer is able to relate to, the choice of a life distribution on its basis
becomes a much more natural exercise. The initial fitting, mainly on the basis
of the mono tonicity characteristics of different potential functions J.l( x), yields
a functional form for F() whose constant coefficients may not be known. At
this point we add, as mentioned earlier, the inference element which would be
approached here according to the Bayesian paradigm. Consider for instance
an initial choice of J.l( x) that belongs to the general class of functions of the
form

J.l(x) = a(x), x ~ 0; a> O.


Towards Rational Age-Based Failure Modelling 111

where f() is a positive increasing function. Assuming that f(.) is specified


whereas "a" is not we change the notation to the standard Bayesian form:

JJ(xIO) = Of(x), x ~ 0, (5.1)


where for 0 we need a prior distribution 11(). Since for any given 0 we have
F(xIO) = e-xl'(xIIJ) =e-IJxl(x) , x~0 (5.2)

the (unconditional) survival functional is given by

F(x) = J F(xIO)II(O)dO = 1I*(xf(x)), (5.3)

by (5.2), where 11*() is the Laplace transform of 11(.).


We first note that in the repairable (or multiple-failure) case considered
here combined with the representation in (5.1), 0 possesses a physical inter-
pretation and it is not just an abstract "parameter". Thus, for f( x) = x,
orepresents the expected number of failures in one unit of time and similar
interpretation applies for any (increasing) f() through a time-transformation
(see Ozekici 1996 in this volume for an intrinsic-clock motivation to this time-
transformation). Therefore, suggesting a prior distribution for 0 becomes less
of an abstract exercise and intuition and experience with the item can be
incorporated in a natural manner.

6. Revision of the Prior and Life Distributions on the


Basis of Observed Data
Suppose that an item of the above type has been observed in operation until
age y. The failure-data thus obtained can be summarized by
Ty = {N(y) = n,~ = (Sl' ... , sn)}
where N(y) is the number of failures up to age y and ~ their moments of
occurrence.
To proceed with the Bayesian revision model we need the likelihood of Ty
given 0 and here we utilize the abovementioned fact that the failure process
must be Poisson with intensity function r(.) (the hazard function) - which is
now denoted by rUO). Then, applying some general results obtained in Berg
(1987) (within another problem context) regarding Bayesian revision models
for the Poisson process, it can be shown that for the particular form in (5.1),
the posterior distribution for 0 is given by
II(OITy) == II(OIN(y)) = k II(O)ON(Y)e-IJyl(Y) (6.1)
where k is a normalizing constant (independent of 0). Thus, in particular,
N(y) is sufficient for Ty in this Bayesian revision model. The (conditional)
survival function then becomes
112 Menachem P. Berg

F(xITy) = Je- 8rl (X)v(BITy)dB = v*(xf(x)ITy) (6.2)

where v*(ITy) is the Laplace transform of v('ITy)


Carrying the example further suppose that

v( B) = Gamma( 0:, (3) (6.3)


0:
with and (3 being the scale and shape parameters, respectively.
Therefore, by (5.3) and (6.3), the (unconditional) survival function is
given by
x~O (6.4)

From (6.1), we can obtain the posterior distribution for B as

v(BIN(y = Gamma(o: + yf(y) , (3 + N(y


(implying that the Gamma distribution is the conjugate prior for B in this
Bayesian revision model). The revised (unconditional) survival function is
then, by (6.2) and (6.4)

_ (0: + yf(y) ) fHN(y)


F(xIN(y = 0: + yf(y) + xf(x) , x~0

7. Reliability-Deterioration Criteria in the Repairable


Case

In identifying the appropriate 1'( x) for the failure-modelling of a repairable


item one input is the assessment of the reliability-deterioration as age in-
creases. In the single-failure case this is done through aging criteria, e.,g.,
IFR (i.e., r( x) increasing) or the IFRA (R( x) / x increasing) (see Barlow and
Prosch an (1975) for more details on this topic). It has been argued in Berg
(1995) that in the single-failure context the IFR is a natural aging criterion
whereas the IFRA is rather pointless since it lacks a clear physical interpre-
tation and because it is itself not preserved under aging (i.e., an item that has
this property when it is new may lose it, with respect to the remaining life,
at a later age). The IFR, in contrast, can be described in natural language
and is preserved under aging. Moreover, to make the IFRA also possess this
latter preservation property, clearly an essential one for an aging criterion,
requires upgrading it to IFR.
Whereas in the single-failure case the IFR criterion is represented through
the mathematical behavior of the function rC), as appropriate there from
a physical point of view, here it should be done through the mathematical
behavior of 1'( x), the relevant function here from a physical point of view. This
is mainly done through assessment of monotonicity characteristics of 1'( x) and
Towards Rational Age-Based Failure Modelling 113

the most basic such property is merely that J.l( x) is an increasing function of x.
In the present context of minimal repairs and the Poisson failure process thus
generated, so that J.l(x) = R~X), this property of J.l(x) is technically identical
to the IFRA (we should say technically because the IFRA was devised for the
single-failure case). We have thus "rescued" this mathematical property by
identifying a context where it is meaningful and useful. The payoff is that all
the mathematical results obtained for the IFRA (Barlow and Proschan 1975)
can be directly applied for the case considered here, namely, repairable items
with minimal repairs. In particular, this includes probably the most useful of
these results, namely the closure-property of (coherent) systems, i.e., (when
phrased in the present context) if for each component the expected number
of failures per unit of time increases with age then so will be the case for the
system itself (assuming operational independence between the components).

References

Barlow, R.E., Proschan, F.: Statistical Theory of Reliability and Life Testing. New
York: Holt, Rinehart and Winston 1975
Berg, M.: Reliability Analysis for Mission-Critical Items. Naval Research Logistics
34,417-429 (1987)
Berg, M.: Age-Based Failure Modelling: A Hazard-Function Approach. CentER
Discussion Paper (No. 9569), Tilburg University (1995)
Qmlar, E.: Introduction to Stochastic Processes. Englewood Cliffs: Prentice-Hall
1975
Qmlar, E.: Fatigue Crack Growth. In this volume (1996), pp. 37-52
Goel, P., Palettas, P.N.: Predictive Modelling for Fatigue Crack Propagation via
Linearizing Transformations. In this volume (1996), pp. 53-69
Mendel, M.: The Case for Probabilistic Physics of Failure. In this volume (1996),
pp. 70-82
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996), pp.
137-157
Part II

Maintenance of Complex Systems


Maintenance Policies for
Multicomponent Systems: An Overview
Frank Van der Duyn Schouten
CentER for Economic Research, School of Management and Economics, Tilburg
University, 5000 LE Tilburg, The Netherlands

Summary. We present an overview of some recent developments in the area of


mathematical modelling of maintenance decisions for multicomponent systems. We
do not claim to be complete, but rather we expose some ideas both in modelling
and in solution procedures which turned out to be useful in understanding and sup-
porting complex maintenance management decision problems. The mathematical
tools that are used mainly stem from applied probability theory, renewal theory,
and Markov decision theory.

Keywords. Maintenance management, multicomponent systems, corrective and


preventive maintenance, economies of scale, availability, renewal theory, Markov
decision processes.

1. Introduction

In this chapter we review some mathematical models for maintenance man-


agement on multicomponent systems that were introduced and analyzed dur-
ing the last decade. We do not claim to give a complete overview of the state
of the art in this area, but instead we will highlight some models and methods
that turned out to be useful in analyzing these models. The presentation is
highly biased in favor of the authors own contributions to the field. Actually,
several other chapters of this book deal with the same subject.
Mathematical models for reliability analysis and maintenance manage-
ment are in their mathematical nature strongly related to queueing (control)
models and inventory control models. In particular the kit of mathematical
tools for analyzing models from these various areas is more or less the same:
applied probability theory, renewal reward processes, and Markov decision
theory have shown their value in all these areas of application. However, un-
like waiting time models and inventory control models, the application of
maintenance and replacement models has been rather limited in practice.
As an explanation of this distinction in applicability the difference in data
availability should be mentioned. The most successful applications of waiting
time models are to be found in computer and telecommunication architec-
ture. The availability of input data on arrival processes of jobs and service
times is usually not really problematic. The same holds for the input data of
inventory control models: demand distributions and lead time distributions.
However, the data required for successful application of mathematical models
118 Frank Van der Duyn Schouten

for maintenance and replacement decisions include failure data of the equip-
ment under consideration which are usually not widely available nor easy
to obtain. This makes the application of mathematical models to support
maintenance and replacement decisions less obvious.
A second reason that is often put forward to explain the lack of success
in applications of maintenance and replacement models is the simplicity of
the models compared to the complex environment where the applications
occur. In particular the fact that up to ten years ago the vast majority of
the models were concerned with one single piece of equipment operating in
a fixed environment was considered as an intrinsic barrier for applications.
However, one should realise that this argument is also valid for waiting time
and inventory applications. The booming interest in polling models in queue-
ing theory and in multi-item inventory control models in logistics reflects this
increasing need for more realistic modelling of complex management prob-
lems. From this point of view also the increasing interest for multicomponent
maintenance models can be understood. In this context we should realise,
however, that the availability of reliable data becomes even more important
for successful applications of theoretical developments in this area. Successful
case studies on practical maintenance models are badly needed to convince
management of the potential cost savings in this management field. In sec-
tion 4 we will briefly describe the application of one of the models in the area
of road management. For other implementations of maintenance models we
refer to Dekker and Van Rijn (1996) and Groenendijk (1996).
This chapter is organised as follows. In Section 2 we present an overview
of the models to be discussed in this chapter. We also indicate the various
economic backgrounds that justify the choice of these models. In Section 3
we address the problem how to structure the (corrective) maintenance on
parallel and identical units. In Section 4 preventive maintenance on paral-
lel and non-identical units is considered, while in Section 5 we address the
problem how to combine in an economic optimal fashion corrective and pre-
ventive maintenance actions on a number of independent units. Finally in
Section 6 we pay attention to models which take explicitly into account that
maintenance activities should be considered as an intrinsic part of produc-
tion schedules, implying that scheduling of preventive maintenance activities
should not only be based on the physical condition of the equipment but also
on its immediate impact on the production process in which it operates.

2. Multicomponent Maintenance Models and Their


Economic Justification
As for multi-item inventory models, many maintenance models for multi com-
ponent systems derive their value from the existence of economies of scale
in carrying out maintenance activities on several units simultaneously. The
Maintenance Policies for Multicomponent Systems: An Overview 119

most direct example of this situation is that of a parallel production system


consisting of a number of identical production units. The units are operating
simultaneously, but, at the expense of production losses, a failed unit does not
have to be repaired immediately, since production can continue, although at
a lower level, because of the remaining units. The same production losses are
of course incurred during repair activities on one or more units. Economies
of scale occur due to the maintenance cost structure. In order to repair one
or more failed machines a maintenance crew has to be brought to the spot
(which might be costly for example in case of offshore activities). Also it may
occur that due to safety regulations the whole production process has to be
interrupted during maintenance activities. Costs of this type usually consist
of a part that is independent of the size of the maintenance job, i.e. inde-
pendent of the number of failed units, and a part that is proportional to the
number of failed units included in the repair activity. The problem now is to
determine in which situations a maintenance activity should be started and
how many failed units should be included in this maintenance activity. In
this setting the problem resembles the classical single unit inventory control
problem. First we present some results for the case of identical units, as ob-
tained by Assaf and Shanthikumar (1987) and Ritchken and Wilson (1990).
Assaf and Shanthikumar (1987) prove that under special assumptions the
optimal policy has a control limit structure, i.e., start a maintenance activity
on all failed units as soon as the number of failed units has reached a given
threshold. For a more general situation Ritchken and Wilson (1990) analyse
a certain well structured class of policies. Next we review a paper by Jansen
and Van Der Duyn Schouten (1995) who, unlike the previously mentioned
authors, take the repair times into account explicitly. They show that in this
case the cost structure can result in an optimal policy that prescribes idling
of operational components in some situations.
Section 4 is devoted to a model for the case of non-identical units. How-
ever, the focus here is on preventive rather than on corrective maintenance.
Corrective maintenance is supposed to be started as soon as a failure occurs
and hence it is not controllable. The problem is to plan the preventive main-
tenance activities on the various components in such a way that economies
of scale are obtained. Combining preventive and corrective maintenance ac-
tivities is not allowed. The results presented in this section are based on
Goyal and Kusy (1985), Dekker et al. (1996) and Vos De Wael (1995). The
latter reference includes an application on the maintenance of traffic control
systems.
In the situations described above the option exists to leave a (failed) com-
ponent out of operation for some period of time. Apparently, this is related to
the structure in which the unit operates. For example, in a series structure it
does not make sense to leave a failed unit out of operation longer than strictly
necessary (unless the costs for maintenance far outweigh the revenues of the
production process, in which case it seems to be better to stop the whole
120 Frank Van der Duyn Schouten

operation). But also in case the unavailability of one single unit prevents the
system from operating, there might be room for combination of maintenance
activities. In particular the moments at which corrective maintenance activ-
ities are called for, might be used to carry out preventive maintenance on
non-failed, but deteriorated, units. Such a policy might reduce the number
of unexpected corrective maintenance activities at fairly low costs, since pre-
ventive maintenance, when combined with corrective maintenance, can be
carried out without substantial additional expenses. Models to describe de-
cision problems of this kind have been studied by many authors. In Section
5 we describe some results based on Haurie and L'Ecuyer (1982), Ozekici
(1988), Van der Duyn Schouten and Vanneste (1990, 1993) and Wijnmalen
and Hontelez (1996)
A major difficulty experienced in practical situations is that maintenance
and production are considered as responsibilities of different departments.
The maintenance department prefers to do preventive maintenance at those
moments at which the (maintenance) workload is low, while the production
department prefers to carry out maintenance activities when the demand rate
is low. Unfortunately, in general those dips in workload will not coincide.
The most sensible solution is to make the production department responsible
for maintenance of their own equipment as long as technical skills are not
prohibitive in this respect. In Section 8 we describe a model which is aimed
at illustrating the possible effects of this integration in terms of cost savings.
The presentation is based on Van der Duyn Schouten and Vanneste (1995)
and De Waal and Vanneste (1995).

3. Corrective Maintenance for Parallel Systems


Consisting of Independently Operating and Identical
Units

Consider n identical units which are operating in parallel and subject to


random failures. The lifetimes of the individual units are independent and
identically distributed random variables with distribution function F(t) and
density function f(t). When a unit is waiting for repair or under repair it does
not affect the functionality ofthe other units. However, it will cause a certain
loss of productivity at a non-negative rate C 1 (k) per unit of time, when k units
are down simultaneously. There is full information on the number of failed
units as well as on the age of non-failed units. At failure epochs of individual
units one might decide to start repair or replacement of an arbitrary number
offailed units. Moreover, at the same time an arbitrary number of non-failed
units can be overhauled. Both repair of a failed unit and overhaul of a non-
failed unit result in a situation which is characterized as "as good as new".
The costs of repair and overhaul depend on the number of units included in
these operations, and are denoted by C 2 ( k) and C3 ( k) respectively.
Maintenance Policies for Multicomponent Systems: An Overview 121

In Assaf and Shanthikumar (1987) it is assumed that the lifetime dis-


tribution of individual units is negative exponentially distributed, with pa-
rameter A. It is obvious that under this assumption overhaul of non-failed
units does not make sense, which makes the cost function G3 {k) obsolete.
Repairs are supposed to be instantaneous and the cost rate G1 (k) and the
cost function G2(k) are both supposed to be linear in k, i.e., G1 (k) = kG1
and G2{k) = Go + kG2 , where Go, G1 and G2 are known constants.

Theorem 3.1. Suppose the system starts at time 0 with all units in opera-
tional condition. When

it is optimal not to repair at all. Otherwise there exists a critical number


1 :::; m :::; n, such that the optimal policy prescribes to start a repair on all
failed units if and only if the number of failed units has reached the level m.

Proof. The proof proceeds in three steps. First of all it is noted that in the
search for the optimal policy we can restrict ourselves to policies characterized
by two critical numbers m and I: repair I units whenever the number offailed
units reaches the level m, for some 1 :::; I :::; m :::; n. This result is based on
the observations that the number of failed units increases by steps of size one
while any repair will reduce the number of failed units. Secondly, it is shown
by renewal reward arguments that the average cost g( m, I) per unit of time
for the policy with critical numbers m and I is given by

AGo + /AG2 + G1 L~-::~ n~;;;~1~k


g(m, I) = 1-1 1
Lk=O n-m+l-k

Next it is shown that

g(m, I) ~ g(m, I + 1),1:::; I :::; m - 1

Theorem 3.2. (i) The optimal critical repair level m* is the smallest inte-
ger m, for which

k AGo
2: -m-
m-l

k=O
> ~-:-:::
n - k - G AG1 - 2

(ii) For large values of n, the optimal critical repair level m* is asymptotically
proportional to ..;n.
Proof. See Assaf and Shanthikumar (1987).
122 Frank Van der Duyn Schouten

Ritchken and Wilson (1990) consider the case of general lifetime distri-
butions, also with instantaneous repair. They restrict attention to a class of
policies characterized by two critical numbers m and T, implying that a main-
tenance activity (including repair of all failed units and overhaul of all non
failed units) is started if and only if the number of failed units has reached the
level m or T units of time have passed since the previous maintenance activ-
ity. Since only combined maintenance on all units is considered the moments
at which maintenance is started are renewal points for the process describing
the ages of each of the individual components. From the analysis by Assaf
and Shanthikumar (1987) it can be concluded that, in case of exponentially
distributed lifetimes, for the optimal policy within this class we have T = 00.
Note that the expected time between two subsequent maintenance activ-
ities equals

Similar expressions can be obtained for the total expected costs during one
cycle, which provides an explicit expression for the average costs per unit of
time as a function of the control parameters m and T. Using some properties
of this function Ritchken and Wilson present an algorithm to compute the
optimal values of m and T from a finite number of function evaluations.
Assaf and Shanthikumar also show that under the assumptions of their
model there exists an optimal policy which does not allow operational units
to idle. Jansen and Van der Duyn Schouten (1995) show that this conclusion
is not correct in case the repair is not instantaneous. They consider the case
where the costs for production losses far outweigh the actual repair costs
of the machines, i.e. C 2 (k) and C3 (k) are both assumed to be equal to zero.
However, the costs for production losses C1(k) are non-decreasing and convex
in k. The lifetime distributions are again exponential with parameter A (like
in the case of Assaf and Shanthikumar). Also the repair time of one single
unit is exponentially distributed with parameter 1'. There are no economies of
scale in repair time, i.e. the total length of the repair time of two units is the
sum of two exponentials, each with parameter 1'. Due to the assumptions on
the cost functions, it follows, in correspondence with the result of Assaf and
Shanthikumar, that the optimal policy will not allow a repair on a failed unit
to be postponed until other units have failed (the critical repair level is equal
to 1). In order to investigate whether it is profitable to allow operational units
to idle we assume that the running speed of each individual unit is adjustable
between 0 and 1. Using speed x for a unit simply means that the failure rate
of this unit is reduced from A to XA. So using speed 0 means that a unit
is completely idling. Consequently, when i units are operational, the total
production speed of all units together can be controlled within the interval
[0, i). The function C10 that represents the costs of loss of production is
now assumed to have a continuous argument. Jansen and Van der Duyn
Maintenance Policies for Muiticomponent Systems: An Overview 123

Schouten (1995) consider the case ofrestricted repair capacity (meaning that
the number of available servers 8 is smaller than or equal to n, the number
of units). In this presentation we will only deal with the case of ample repair
capacity (8 = n).
This control problem can be formulated as a semi-Markov decision model
with discrete state space {I, ... , n} and continuous action space [0, i] in state
i. State i corresponds to the situation that i units are available and n - i
are under repair. Taking action a E [0, i] means that the system produces at
capacity i-a, while capacity a is kept in reserve. Note that putting a unit in
reserve position has a negative impact on the present productivity level, but
has the advantage that this unit is not subject to failure and hence is available
with certainty when the next unit breaks down. In case the running speeds of
individual units are not adjustable the action space in state i simply reduces
to {O, 1, ... , i}. Now in state i, only transitions to states i-I and i + 1 will
occur, since the capacity that is kept in reserve is available again at the next
decision epoch. This leads to the following transition probabilities, expected
transition times, and expected one-step transition costs for the semi-Markov
decision process:

(a::; i;j = i-I);

)
Pij (a)
(a::; i;j = i+ 1);
(3.1)
Ti(a) (a::; i);
Ci( a) (a::; i).
The average cost optimality equations thus become

G 1( n -i+a)- g+(i-a )'xV._l +(n-i)I'Vil


Vi mmO$a$i (i-a)'x+(n-i)1'
G 1(n-y)-g+y'xvi_l+(n-i)l' v i1 (. - 0 ) (3.2)
mino~Y9 y>.+(n i)1' Z- , ... , n

Now apply the uniformization technique to this semi-Markov decision model,


using a uniform transition rate of n(A + p); then, in state i under action y,
self-transitions occur with rate (n-y)A+ip. Also it may be assumed without
loss of generality that n(A + p) = 1, so that ( 3.2) can be written as

Vi = mmO$y$i {Gl(n - y) - g + YAVi-l + (n - i)pVi+l +


+[( n - Y)A + ip]vd
=> 0= mino$y:5i [Gl(n - y) + YA(Vi-l - Vi)] - (n - i)p(Vi - Vi+r) - g.

Hence

mino<y<i[Gl(n - y) + YA(Vi-l - Vi)] - g


Vi-Vi+!= ( .) (i=1, ... ,n-1)(3.3)
n - Z P
124 Frank Van der Duyn Schouten

A fixed policy Y = (Yl, ... , Yn) can now be analyzed by means of a birth-death
process with transition rates

Pi,i-l YiA (i=I, ... ,n)j } (3.4)


Pi,i+l (n-i)JJ (i=I, ... ,n-l).

The steady-state probabilities immediately follow from ( 3.4):

t
i
1ri= ( IIi=lYi
)-1 (n-i)!
n. (JJ)'
'I 1ro (.z=I, ... ,n) (3.5)

This gives the following expression for the average costs as a function of the
control rates Yi (i = 1, ... , n) (where Yi denotes the actual productivity level
when i units are available):

The optimal policy is determined by the control rates Yi that minimize ( 3.6)
subject to Yi $ i (i = 1, ... , n). Moreover, ( 3.6) can be used to construct
an efficient policy iteration algorithm as follows. In any step of the policy
iteration procedure, the following system of equations has to be solved for
some fixed policy Y = (Yl, ... , Yn).

OJ
Cl(n-Yi)-g(Y)+Yi AV i_l(y)+(n-i)I'Vil(Y) (. 0 ) (3.7)
Yi A +(n-i)1' Z = , ... , n .

Since the average costs g(y) can be calculated from ( 3.6), system ( 3.7) can
be solved recursively as follows (compare ( 3.3)):

Vo(y) o
vo(y) - Vl(y) C 1 (n)-g(y)
nl'
}(3 8)
0

Vi(Y) - Vi+l (y) Cl(n- Yi )-g(Y[+Yi A[Vi_l(Y)-Vi(Y)]


n-i)1'
(0 -- 1, ... , n - 1)

Using this efficient method, the policy iteration algorithm converges relatively
fast to the optimal policy as compared to the value iteration algorithm in
terms of number of iterations as well as calculation time. For n = 5 the
procedure takes in most cases less than 5 iterations and less than 10 seconds
of calculation time on an 80386 DX microprocessor to find the optimal policy.
Maintenance Policies for Multicomponent Systems: An Overview 125

Theorem 3.3.
YiSYi+t (i=0, ... ,n-1)
For the complete proof of this theorem we refer to Jansen and Van der Duyn
Schouten (1995). Here we only provide a global indication ofthe various steps
of the proof, since these steps are more or less typical for proving structural
results of this kind. First define

v~~2 := total 0: - discounted cost starting in state i with k transitions to go.


The following recursion holds:

mino~y~i(Ct(n - y) + o:[yAv~~~,~ + (n - y)Av~~a-t)])+


0: [( n - .) J.lvi+t,d
Z
(k-t) + zJ.lvi,a
. (k-t)l (.Z -- n,.k --
, ,
1 , 2 , ...)
(3.9)

Lemma 3.1. Suppose that Ct(x) is non-increasing and convex in x(x ~ 0)


and gi is non-increasing and convex in i(i = 0, ... , n). Define

fi:= min JCt(n - y) + yAgi - t + (n - y)Ag;l (i = 0, ... , n)


O~y~,

and

hi := (n - i)J.l9i+t + iJ.l9i (i = 0, ... , n)


then both fi and hi are non-increasing and convex in i (i = 0, ... , n)
Lemma 3.2. Vi is non-increasing and convex in i.
Proof. The monotonicity and convexity of in i is a direct consequence of
Vi
the convexity of v~k2 in i, which can be proved by induction. For k = 1, the
mono tonicity and 'convexity of vF2 = Ct(n - i) in i is assumed. Further-
more, assuming monotonicity and' convexity of v~ka-t) in i, the monotonicity
and convexity of v~~2 follows, using Lemma 3.1. 'Define Vi,a as the total 0:-
discounted costs starting in state i (i = 0, ... , n). It is proved in Ross (1983,
Proposition II. 3.1) that v?2 converges uniformly to Vi,a as k --+ 00, and so
Vi,a is again non-increasing and convex in i. Also Vi,a satisfies an optimality
equation that follows from ( 3.9) by letting k --+ 00:

Vi,a minO~y~i(Ct(n - y) + o:[yAvi-t,a + (n - y)AVi,a]) (3.10)


+0:[( n - i)J.lvi+t,a + iJ.lvi,al (i = 0, ... , n)
Note that from the assumed non-negativity of Ct(x) and using ( 3.9), it

follows that v~k2 ~ and hence Vi,a ~ 0. This together with ( 3.10), and
using the mon~tonicity of Vi,a and the assumption n(A + J.l) = 1, implies
126 Frank Van der Duyn Schouten

Hence
o ~ Vi,a - vn,a ~ C 1(n).
Now, using Theorem V. 2.2 (ii) in Ross (1983), we conclude that there exists
a sequence of discount factors Ck 1 1 such that

From this limiting relation follows that Vi inherits the monotonicity and con-
vexity of Vi,a (see also Ross 1983, p. 96, Remark 1).
Finally we note that using (3.3) and differentiating the expression that
has to be minimized it can be seen that the optimal control in state i equals
min{ i, zd, with Zi satisfying

-C~(n - Zi) + A(Vi-1 - Vi) = 0 (i = 0, ... , n)


Since Vi is convex in i this gives

C~(n - Zi) = A(Vi-1 - Vi) ~ A(Vi - Vi+d = C~(n - Zi+1), (i = 1, ... , n - 1)


Using the convexity of C1 it follows that n - Zi ~ n- Zi+1 or Zi ~ Zi+1. Hence

Yi = min{i, zd ~ min{i + 1, zi+d = Yi+l (i = 0, ... , n - 1)


Note that Theorem 3.3 indicates that the number of units that actually
is put into operation increases with the number of available units. However,
this does not automatically imply that the number of units that is put into
reserve position is increasing with the number of available units. To conclude
this section we present a numerical example illustrating Theorem 3.3.
In Table 3.1 below we present numerical results for the following input
parameters: s = n = 5; C 1 (x) = x 2 ; the value of p := Alp ranges from 0.01
(overhaul time very short relative to life time) to 100 (life time very short
relative to overhaul time). The column under go gives the average cost for
the situation without control (never put units into reserve position), while in
the third column g1 and g2 represent the optimal value of the average cost,
both for the case in which the actions are restricted to integer values (gt)
and the case where the actions are unrestricted. The two lines in the column
headed by Yi give the corresponding optimal control rates in state i for both
the restricted and unrestricted case.
From this table (and extensive other numerical experiments) it can be
concluded that the average costs can be reduced only slightly when control
is allowed. According to Theorem 3.3 the optimal control rate Yi increases
with i, while Yi+1 - Yi ~ 1. Furthermore, for fixed i, the optimal control
rates decrease as p increases. When p converges to zero the optimal strategy
Maintenance Policies for Multicomponent Systems: An Overview 127

Table 3.1. Minimal average costs and optimal productivity levels for various values
of the workland
p go g1,g2 Y1 Y2 Y3 Y4 Y5
0.01 0.05146 0.05146 1 2 3 4 5
0.05144 1 2 3 4 5

0.1 0.06198 0.06198 1 2 3 4 5


0.06174 1 2 3 4 4.94

0.2 1.3889 1.3889 1 2 3 4 5


1.3810 1 2 3 4 4.86

0.5 3.8889 3.8889 1 2 3 4 5


3.8675 1 2 3 4 4.60

1 7.5000 7.4806 1 2 3 4 4
7.4793 1 2 3 3.99 4.19

2 12.222 12.206 1 2 3 3 4
12.198 1 2 3 3.37 3.58

5 18.056 18.032 1 2 2 3 3
18.026 1 1.91 2.22 2.46 2.64

10 21.074 21.059 1 1 2 2 2
21.039 1 1.40 1.64 1.83 1.99

100 24.556 24.555 1 1 1 1 1


24.532 0.32 0.45 0.54 0.62 0.68

converges to Yi = i, i.e. no control. On the other hand, the optimal strategy


converges to Yi = 1 as p approaches infinity.
A class of models not explicitly dealt with in this section are the cold-
standby models. These models are appropriate for systems in which a very
high level of availability is required, like in communication and computer sys-
tems. Van der Duyn Schouten and Wartenhorst (1994) consider a situation
in which an operating unit is backed by an identical cold standby unit. The
level of deterioration of the operating unit is visible (upon inspection) and
one has to decide when to take the working unit out of operation (and to
replace it by the standby) in order to maximize the overall availability of the
system. The incentive to take a working unit out of operation deliberately
and not waiting until failure, is to be found in the fact that a preventive
repair is usually less time consuming than a corrective repair. Van der Duyn
Schouten and Wartenhorst (1994) provide numerical procedures to compute
the complete distribution of up and down periods under a given maintenance
strategy of control limit type. Those distributions are subsequently used to
generate approximations for the interval availability distribution, i.e. the dis-
tribution of the fraction of a given finite interval during which the system is
128 Frank Van der Duyn Schouten

available. The optimal control limit type policy can then be determined by a
straightforward one-dimensional search procedure.

4. Preventive Maintenance on Systems Consisting of


Non-identical Units, with Compulsory Corrective
Repair of Units Upon Failure
In this section we consider a system consisting of n non-identical units. The
lifetimes of the individual units are independent random variables with life-
time distribution Fj(t) for unit j, 1 :S j :S n. The focus here is on preventive
rather than on corrective maintenance. Corrective maintenance is supposed
to be compulsory as soon as a failure occurs. The preventive maintenance
activities on the various components should be planned in such a way that
economies of scale are obtained. Combining preventive and corrective main-
tenance activities is not allowed. The cost structure consists of a fixed cost Co
for any preventive maintenance operation, irrespective of the number of units
included in the operation. In addition to the fixed costs there is an individual
cost aj for preventive maintenance on unit j. For a corrective maintenance
operation on unit j a cost Cj is incurred. Finally we include cost functions
Pj (t) for production losses due to the inavailability of unit j during t units of
time.
This model resembles a multi-item inventory control model. Goyal and
Kusy (1985) were the first to realize this and they proposed for the main-
tenance optimization model a class of static strategies, which were already
studied for the inventory control model. This class of policies is character-
ized by n + 1 parameters T, k 1 , ... , k n , with the following interpretation: do
preventive maintenance on unit j every kjT time units. Since failed units are
not allowed to be left unattended and both corrective and preventive main-
tenance are supposed to be instantaneous the production loss functions pj(t)
become obsolete. Note that by including a fixed penalty cost Pj (0) in the
corrective maintenance costs Cj for unit j, one can penalize a failure.
The average cost per unit time under the policy (T, k1 , ... , kn ) is given
by

where
n
Ll(k)=I)-1)i+1 L /cm(kal'" .. ,kaJ-l
j=l {aiac{1, ... ,n},iai=j}

and M j (t) denotes the renewal function generated by Fj (t), (see Dagpunar
1981).
Maintenance Policies for Multicomponent Systems: An Overview 129

Note that Ll(k) = 1 when minl~j~n kj = l.


In general the minimization of a function of n + 1 variables like g(T, k) is very
time consuming. However, g(T, k) has some properties which can be exploited
in the search for the optimal values of T, k1 , ... , kn . The most important of
these properties is that, in case Ll(k) = 1, g(T, k) is separable in the variables
kl' ... , k n . For an excellent and complete discussion on the optimization of
(a generalization of) the function g(T, k) we refer to Dekker et al. (1996),
where several algorithms to find the minimum of g(T, k) are compared. All
algorithms are based on the assumption that Ll(k) = 1 for the optimal policy.
Here we present the variant proposed by Vos De Wael (1995). This algorithm
is essentially a variant on the algorithm of Goyal and Kusy (1985) and does
not guarantee an optimal solution.

Define for 1 ~ j ~ n and integer values of k

h.(k T) .= aj
J ,.
+ cjMj(kT)
kT
Step 0 Determine for every individual unit j the optimal replacement cycle
Tj, i.e. solve n independent one-dimensional optimization problems of the
following type

Let T := minl~j~n Tj
Step 1 For every j E {I, ... , n} determine the smallest value of kj such that

If all kj remain unchanged in two subsequent executions of step 1 then


stop else go to step 2.
Step 2 Solve the one-dimensional non-linear optimization problem
ming(T,k)
T~O

and go to step l.
In Vos De Wael (1995) an application of this model to the maintenance
of road traffic control systems is described. In this situation a unit consists
of a group of light bulbs functioning in a traffic control system on a certain
road crossing. Bulbs with the same burning and cost characteristics are put
together into one single group. A typical replacement rule used in practice is:
replace all bulbs serving red lights every three months, the bulbs serving the
green lights every six months, and the bulbs serving the yellow lights every
year. The following numerical example illustrates the algorithm.
130 Frank Van der Duyn Schouten

The following input data are used: Co = 2 and

Table 4.1. Input parameter specification


Group Rl. Fj(t) Pi (1j aj Cj
1 Ed. ~2;~) 1.000 0.500 1 10
2 Ed. (3;2) 0.667 0.222 1 10
3 Ed. (4;3) 0.750 0.188 3 15
4 ~eib. (0.8;1.5) 1.128 0.587 3 15
5' ~eib. '(0.4;3) 2.232 0.658 3 15

Here Erl. (A; k) denotes the Erlang distribution with parameter A and k,
while Weib. (A; a) denotes the Weibull distribution with parameters A and a

Step 0 Tl =
0.344; T2 = 0.229; T3 = 0.372; n = 0.924; Ts =
1.218; T = 0.229

Step 1 (I't it.) kl 2; k2 = 1; k3 = 2; k4 = 4; ks =4 with


g(T, k) = 59.59

Step 2 (I't it.) T = 0.294 with g(T, k) = 58.83


Step 1 (2 nd it.) kl = 1; k2 = 1; k3 = 1; k4 3; ks = 4 with
g(T, k) = 57.86

Step 2 (2 nd it.) T = 0.395 with g(T, k) = 56.52


Step 1 (3 rd it.) kl = 1; kl = 1; k3 = 1; k4 = 2; ks 3 with
g(T, k) = 56.27

Step 2 (3 rd it.) T = 0.441 with g(T, k) = 56.11


Step 1 (4th it.) kl = 1; k2 = 1; k3 = 1; k4 = 2; ks = 3.
Hence the algorithm provides the following solution:
T* = 0.441; k~ = 1; k; = 1; k; = 1; k: = 2; k; = 3,
which turns out to be the optimal solution (as is verified by the exact proce-
dure presented in Dekker et al. 1996).

5. Combining Corrective and Preventive Maintenance


on Multicomponent Systems
Except for the model of Ritchken and Wilson (1990), the models considered so
far do not allow combination of corrective and preventive maintenance activi-
Maintenance Policies for Multicomponent Systems: An Overview 131

ties. In Ritchker and Wilson (1990) combination of both types of maintenance


activities is only allowed in the following elementary fashion: the corrective
repair of all failed units can be combined with preventive repair of all remain-
ing units, resulting in a complete system replacement. One should realize that
in many situations combination of corrective and preventive repair is indeed
not realistic. Corrective maintenance is unexpected in its very nature, while
preventive maintenance can be planned in advance. Hence combination of
both types of activities is not always considered as a valuable option. How-
ever, there are certainly situations where combining can be advantageous.
In particular, when corrective repair on one single unit requires dismantling
of the whole system, carrying out preventive maintenance on neighbouring
units might be worthwhile. Two options can be considered here. When failed
units have to be repaired without any delay, the only possibility for combi-
nation exists in advancing preventive maintenance. On the other hand, when
failed units can be kept idling for some limited amount of time, one has the
option to delay the corrective maintenance action until the first scheduled
preventive maintenance operation, like in the model of Ritchken and Wil-
son (1990). The advantage of the latter is that one can stick to previously
scheduled maintenance activities.
In this section we will concentrate on the first option: deciding to ad-
vance preventive maintenance actions when corrective maintenance is re-
quired. Consider again the situation described in Section 4: a system con-
sisting of n non-identical independently operating units, with lifetime distri-
bution Fj(t) for component j. Corrective maintenance is supposed to start
as soon as a failure occurs. A fixed cost Co is incurred for any maintenance
operation (preventive or corrective), while additionally an individual cost aj
or Cj is incurred whenever unit j is subject to preventive or corrective main-
tenance, respectively. All repairs are supposed to be instantaneous (which
makes a penalty function for unavailability unnecessary) and to result in a
unit which is "as good as new" . When a corrective maintenance action on unit
j is combined with preventive maintenance on unit k, the associated costs
equal Co + Cj + ak, which is Co cheaper than in case of separate actions. In
contrast with section 4 we consider dynamic policies, i.e. policies which pre-
scribe actions based on the actual condition of all units. As condition variable
we will use the "age" of a unit, defined as the time since the last maintenance
action. A stationary policy is a function from R n to the class of all subsets
of {1, ... n}. Let 1r be any stationary policy. Then 1r( xl, ... , Xn) = {il' ... , id
has the following interpretation: when at a discrete decision epoch the age
configuration of the units is given by (Xl, ... , xn) then the units i l , ... , i k
receive maintenance simultaneously.
Haurie and L'Ecuyer (1982) consider the simple case of identical units
with costs aj = Cj = c. Hence the cost of a maintenance action including
/J units equals Co + /JC for /J 2: 1. Moreover, they assume that preventive
maintenance is only carried out when at least one unit has failed (which is
132 Frank Van der Duyn Schouten

optimal due to their special cost structure). Haurie and L'Ecuyer show by
counterexample that the optimal policy is not necessarily monotone, not even
when the lifetime distribution is IFR. A policy 7r is called monotone when
7r(x) C 7r(Y), whenever x :::; y.
Ozekici (1988) considers the same model under much more general as-
sumptions on the aging process (not necessarily independent lifetime distri-
butions) and on the cost functions. In this situation it is certainly not always
optimal to start a maintenance action only when a failure has occurred. In
Ozekici (1988), the following structural result is shown:
Theorem 5.1. The optimal policy 7r* has the following property:

(i) If 7r*(x) = {1, ... , n} then =


7r*(Y) {1, ... , n} for all y ~ x (5.1)
(ii) If i E 7r*(Xl, ... , xn) then i E 7r*(Xl, ... , Xi-I, y, Xi+1, ... , xn)
for all y ~ Xi
Relation 5.1 implies that when under a certain age configuration the com-
plete system is replaced, than this will also happen under a worse age con-
figuration.
In spite of the characterization provided by Theorem 5.1, the optimal
policy can have a rather complex structure, due to the absence of monotonic-
ity. Therefore approximate policies have been studied. Vergin and Scriabin
(1977) introduce for a system consisting of two identical units an (m, M)
policy: start maintenance on a single unit whenever it fails or reaches the
age M and include the other unit into the maintenance operation if and only
if its age exceeds m. Van der Duyn Schouten and Vanneste (1990) provide
an efficient numerical procedure to compute the optimal values of m and
M. Moreover, they show by extensive numerical experiments that the best
(m, M) policy is almost always less than 1% off from the overall optimal
policy. The simple structure of (m, M) policies and their near-optimality,
combined with the availability of a computationally efficient algorithm, make
(m, M) policies an interesting class of policies for two unit systems. However,
the proposed algorithm cannot easily be generalized to situations with more
than two units (see, however, also the last paragraph of this section).
The complexity of the optimal policy is due to the availability of detailed
information on the complete age configuration of all units. For the case of
identical units in Van der Duyn Schouten and Vanneste (1993) a model is
presented with sublimated age information: every unit can be in one out of
four possible states: good (0), doubtful (1), bad (2) and failed (3). Under a
given lifetime distribution these states can be seen as representing certain
age intervals. Van der Duyn Schouten and Vanneste (1993) assume that a
sojourn in one of the states is exponentially distributed with parameter de-
pending on the actual state. At the end of a sojourn a jump to another state
occurs according to a Markov chain. When a unit enters a bad state (2) a
Maintenance Policies for Multicomponent Systems: An Overview 133

preventive maintenance is required, while a corrective maintenance is com-


pulsory on entrance in state 3 (failed). At moments at which a maintenance
action becomescompulsory one has the option to maintain the whole system
or (which is mathematically equivalent) to maintain all doubtful components
as well. Two classes of policies are considered:

Policy class A: a complete system overhaul is carried out when a single


component enters state 2 or 3 and the number of doubtful (state 1) compo-
nents at that moment is greater than or equal to I<.

Policy class B: a complete system overhaul is carried out at the first time
epoch at which an individual component enters state 2 or 3 after the first
moment at which the number of doubtful components has reached the level I<.
The difference between both control rules is rather subtle and concerns
the decision to make when the number of doubtful components has reached
the level I<. Under policy B a system overhaul will certainly be performed
at the first subsequent epoch at which one of the components turns bad or
fails. However, when this component was a doubtful one, a system overhaul is
not carried out under policy A, because the number of doubtful components
decreases from I< to I< - 1. For both type of policies explicit expressions are
derived for the average number of system overhauls per unit of time as well
as the expected number of preventive and corrective maintenance actions on
individual units per system lifetime. Also the authors show how this model
can be used as an approximation for the situation where the failures are
governed by a lifetime distribution. Assuming an IFR lifetime distribution
they propose to identify the age interval [0, r] with a good state, the age
interval [r, R] with a doubtful state and the interval [R, 00) with a bad state.
Apart from the control limit I<, also the parameters rand R can be used as
control variables.
Numerical investigations show that this approximation gives fairly good
results and certainly can be used to support the decision how to choose the
relevant control variables. In particular, it is noteworthy that the quality
of the approximations improves when the number of units increases. The
validation of the approximation is done by simulation.
To conclude this section we mention a recent paper by Wijnmalen and
Hontelez (1996), in which a promising computational procedure is proposed
to find "good" policies for the general model introduced in this section.
Attention is restricted to policies which are characterized by two vectors
(MI' M 2 , ... , Mn) and (ml' m2, ... , m n ) as a straightforward generalization
of the (m, M) policies introduced in Vergin and Scriabin (1977) for the two-
unit system. A maintenance action on component j is compulsory as soon
as its age (or condition) exceeds level Mj and when maintenance on another
unit is carried out, then unit j is included in this maintenance operation
whenever its age (or condition) exceeds mj. As mentioned earlier an exact
134 Frank Van der Duyn Schouten

analysis of this class of policies (including optimization) is only possible for


the case n = 2. However, Wijnmalen and Hontelez (1996) propose an ap-
proximating optimization method by decomposing the n-dimensional prob-
lem into n one-dimensional problems. Each one-dimensional case is modelled
as a one-dimensional Markov decision problem with discount opportunities.
The discount opportunities (in reality the moments of compulsory mainte-
nance on other units) are modelled as if they arrive according to independent
Poisson processes. The rates of these Poisson processes are determined by
the maintenance thresholds. This gives rise to an iterative procedure in which
alternatively the maintenance limits and the opportunity arrival rates are ad-
justed until sufficient convergence is achieved. This procedure has very strong
resemblances with the computational procedure developed by Federgruen et
al. (1984) to compute "optimal" can-order policies in multi-item inventory
systems. For comments on the limitations of this procedure we refer to Van
Eijs (1994).

6. The Use of Safety Stocks in Maintenance Planning


In all previous sections the decision to start a maintenance activity was based
exclusively on the level of deterioration of the unit in operation or on the num-
ber of available units. By realizing that preventive maintenance finds its main
motivation in the fact that it can contribute to a more continuous production
process, we conclude that from this point of view preventive maintenance and
buffers play comparable roles in production processes. The difference is that
the decision to build buffer capacity is located in the design level, while main-
tenance decisions are made on the operational level. Nevertheless, those two
control mechanisms can be combined in the sense that one could consider
maintenance policies, which are based not only on the information about the
level of deterioration of the unit, but also on the content of a subsequent
buffer.
Van der Duyn Schouten and Vanneste (1995) study a production system
consisting of one single production unit with an infinite supply source and
followed by a buffer of finite capacity K. Demand for the product is constant
at rate d units per unit of time. As long as the buffer capacity is not reached,
the unit operates at a constant rate of p units per unit of time, and slows
down to d units per unit of time as soon as the buffer gets full. The unit is
subject to failures, with a given IFR type probability distribution for the time
until failure. Upon failure corrective maintenance (eM) starts, bringing the
condition of the unit back to the "as good as new" state. During maintenance,
which lasts a stochastic amount of time, the unit is inoperative. When the
buffer becomes empty the demand is (partially) backlogged up to a given
level. When the backlog exceeds this level the overflow is lost. To prevent too
many failures preventive maintenance (PM) is allowed. The incentive for PM
is to be found in the fact that PM lasts shorter, on average, than eM. The
Maintenance Policies for Multicomponent Systems: An Overview 135

optimal policy for this decision process has a rather complex structure. Hence,
Van der Duyn Schouten and Vanneste (1995) restrict attention to the subclass
of (m, M, k)-policies. An (m, M, k)-policy prescribes to start a PM-action if
and only if the age of the production unit i and the buffer content x satisfy
the relations i ~ M and k ~ x < K, or the relations i ~ m and x = K.
Numerical experiments show that the class of (m, M, k)-policies performs
very well in the sense that the optimal policy within this restricted class is in
general less than 1% off from the overall optimal policy. The advantages of
the (m, M, k)-policies over the overall optimal policy are two-fold. First of all
these policies are relatively easy to implement, and secondly the performance
of these policies can be analytically determined as is shown in Van der Duyn
Schouten and Vanneste (1995). The latter advantage enables us to do (brute
force) optimization within the class of (m, M, k)-policies, in a much more
efficient way than overall optimization by Markov decision theory. The effect
of using both buffer content and condition of the unit as indicators for the
advisability of PM, can be quantified by comparing the best policy in the
(m, M, k)-class with the optimal age replacement policy, in which only the
condition of the unit counts. Numerical examples show that the differences
are typically in the range from 0 to 25%.
In De Waal and Vanneste (1995) a detailed analysis is presented of the
transient behaviour of the buffer content process under various maintenance
strategies.

References

Assaf, D., Shanthikumar, J.G.: Optimal Group Maintenance Policies with Contin-
uous and Periodic Inspections.Management Sc. 33, 1440-1452 (1987)
Dagpunar, J.S.: Formulation of a Multi Item Single Supplier Inventory Problem.
Journal of the Operational Research Society 33, 285-286 (1981)
Dekker, R., van Rijn, C.: PROMPT, A Decision Support System for Opportunity-
Based Preventive Maintenance. In this volume (1996), pp. 530-549
Dekker, R., Frenk, J.B.G., Wildeman, R.E.: How to Determine Maintenance Fre-
quencies for Multi-component Systems? A General Approach. In this volume
(1996), pp. 239-280
De Waal, P.R., Vanneste, S.G.: System Effectiveness of a Production Unit with an
Output Buffer. Shell Research, Rep. AMER. 94.010 (1995)
Federgruen, A., Groenevelt, H., Tijms, H.C.: Coordinated Replenishments in a
Multi-Item Inventory System with Compound Poisson Demands. Management
Sci. 30, 344-357 (1984)
Goyal, S.K., Kusy, M.I.: Determining Economic Maintenance Frequency for a Fam-
ily of Machines. Journal of the Operational Research Society 36, 1125-1128
(1985)
Groenendijk, W.: Maintenance Management System: Structure, Interfaces and Im-
plementation. In this volume (1996), pp. 519-529
136 Frank Van der Duyn Schouten

Haurie, A., L'Ecuyer, P.: A Stochastic Control Approach to Group Preventive Re-
placement in a Multicomponent System. IEEE Trans Automat. Control 27,
387-393 (1982)
Jansen, J., Van der Duyn Schouten, F.A.: Maintenance Optimization on Parallel
Production Units. IMA J. Math. Appl. Bus. Indust. 6, 113-134 (1995)
Ozekici, S.: Optimal Periodic Replacement of Multi-Component Reliability Sys-
tems. Operations Res. 36, 542-552 (1988)
Ritchken, P., Wilson, J .G.: (m, T) Group Maintenance Policies. Management Sci.
36, 632-639 (1990)
Ross, S.M.: Introduction to Stochastic Dynamic Programming. Orlando: Academic
Press 1983
Van der Duyn Schouten, F.A., Vanneste, S.G.: Analysis and Computation of (n, N)-
Strategies for Maintenance of a Two-Component System. European Journal of
Operational Research 48, 260-274 (1990)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Two Simple Control Policies for a
Multicomponent Maintenance System. Operations Res. 41, 1125-1136 (1993)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Maintenance Optimization of a
Production System with Buffer Capacity. European Journal of Operational Re-
search 82, 323-338 (1995)
Van der Duyn Schouten, F.A., Wartenhorst, P.: Transient Analysis of a Two-Unit
Standby System with Markovian Degrading Units. Management Sci. 40, 418-
428 (1994)
Van Eijs, M.J.G.: On the Determination of the Control Parameters of the Optimal
Can-Order Strategy. Z. Oper. Res. 39, 289-304 (1994)
Vergin, R.C., Scriabin, M.: Maintenance Scheduling for Multicomponent Equip-
ment. AIlE Transactions 9, 297-305 (1977)
Vos De Wael, S.: Strategies for Lampremplace (In Dutch). Masters Thesis. Tilburg
University (1995)
Wijnmalen, D.J.D., Hontelez, J.A.M.: Coordinated Condition Based Repair Strate-
gies for Components of a Multicomponent Maintenance System with Discounts.
European Journal of Operational Research. To appear (1996)
Complex Systems in Random Environments
Siileyman Ozekici

Department ofIndustrial Engineering, Bogazic.;i University, 80815 Bebek-istanbul,


Turkey

Summary. In this paper, we consider various inventory, queueing and reliability


models where there is apparent complexity due to interacting components or sub-
systems. In particular, our analysis focuses on a multi-item inventory model with
stochastically dependent demands, a queueing network where there are dependent
arrival and service processes, or a reliability model with stochastically dependent
component lifetimes. We discuss cases where this dependence is induced only by a
random environmental process which the system operates in. This process repre-
sents the sources of variation that affect all deterministic and stochastic parameters
of the model. Thus, not only the parameters of the model are now stochastic pro-
cesses, but they are all dependent due to the common environment they are all
subject to. Our objective is to provide a convincing argument that, under fairly
reasonable conditions, the analysis techniques used in these models as well their so-
lutions are not much more complicated than those where there is no environmental
variation.

Keywords. Complex systems, random environment, reliability and maintenance,


queueing, inventory

1. Introduction
Most of the literature on stochastic models in operations research and man-
agement science involve models where the parameters remain unchanged. In
cases where they do, the change is usually indexed by the time factor only,
leading to dynamic models. There are many real life applications where these
parameters, whether they involve the deterministic or stochastic structure
of the model, change randomly with respect to a randomly changing envi-
ronmental factor. Thus, the model parameters can be viewed as stochastic
processes rather then simple deterministic constants, as in stationary models,
or deterministic functions of time, as in dynamic models.
This paper takes a close look at complex stochastic models that oper-
ate in a randomly changing environment which affects the deterministic and
stochastic model parameters. Here, complexity is due not only to the variety
in the number of components of the model, but also to the fact that these
components are interrelated through their common environmental process.
For example, the demand processes for a multi-item inventory model, the
customer arrival processes to various queues in a network and the compo-
nent lifetimes in a multi-component reliability model will be dependent due
to the fact that they are all subject to a common environmental process. In
all of our models, we assume that stochastic dependence is only due to the
138 Siileyman Ozekici

random environment and that there would be complete independence other-


wise. In other words, given the environmental process, all of these processes
are conditionally independent.
The concept of an "environmental" process, in one form or another, has
been used in the literature for various purposes. Neveu (1961) provides an
early reference to paired stochastic processes where the first component is
a Markov process while the second one has conditionally independent in-
crements given the first. Ezhov and Skorohod (1969) call it a Markov pro-
cess with homogeneous second component. In a more modern setting, Qmlar
(1972a, 1972b) introduced Markov additive processes and provided a detailed
description on the structure of the additive component. The environment is
modelled as a Markov process in all these cases and the additive process
represents the stochastic evolution of a quantity of interest. In the context
of reliability theory, for example, this process is best suited to model the
deterioration of a device due to shock and wear as outlined in Qmlar (1977).
Ozekici (1979) discusses other applications in storage, queueing and inventory
models.
In what follows, we assume that the system operates in a randomly chang-
ing environment depicted by Y = {yt; t E T} where yt is the state of the
environment at time t. The environmental process Y is a stochastic process
with time-parameter set T and some state space E which is assumed to be
discrete to simplify the notation.
We will discuss the implications of the environmental process in the con-
text of inventory models in Section 2, queueing models in Section 3 and,
finally, reliability models in Section 4.

2. Inventory Models

The deterministic structure of a typical inventory model is described by pa-


rameters like purchase cost, ordering cost, holding cost, shortage cost, selling
price and salvage value. Usually, these parameters are taken to be constants
or deterministic functions of time for infinite horizon as well as finite hori-
zon decision problems. The objective is generally the minimization of the
expected total discounted cost or the average cost where the decision maker
chooses the order quantity as well as the order time.
The stochastic structure is depicted by the demand and the lead-time
processes. Once again, the parameters of this stochastic structure involving
demand rates and mean lead-times are constants in many cases for stationary
models. It is possible that they depend on time only in nonstationary models.
If the parameters are taken to be deterministic functions of time, especially
in finite horizon problems, we have dynamic models that can be analyzed
using the recursive technique of dynamic programming. In infinite horizon
problems, the deterministic parameters usually do not depend on time and
Complex Systems in Random Environments 139

some kind of stationarity is assumed in the stochastic processes involved in


order to obtain smooth results which are computationally tractable.
In recent years, there has been growing interest in models where the pa-
rameters of the deterministic as well as stochastic structure do not remain
unchanged but, rather, change with respect to an environmental factor that
affects the model as a whole. Song and Zipkin (1993) argue that the demand
for the product may be affected by a randomly changing "state-of-the-world" ,
which we choose to call the "environment" in our exposition. This could rep-
resent various important factors such as the randomly changing economic
conditions, market conditions for new products or products that may be ob-
solete, or any environmental condition that may affect the demand as well
as the supply and the cost parameters. Song and Zipkin present a strong
case for this concept by discussing the implications of a randomly fluctuating
demand environment in a continuous review inventory model where the de-
mand process is a Markov modulated Poisson process, i.e., the environment
is represented by a Markov process and the demand in each environment is
a Poisson process with an arrival rate that depends on the environment.
This issue has been only scarcely addressed in earlier literature on in-
ventory management. For example, Iglehart and Karlin (1962) consider a
periodic review model where the periodic demand is modulated by a Markov
chain. In Kalymon (1971), the unit market price of the product varies ran-
domly over successive periods as a Markov chain. The steady-state distri-
bution of the inventory position in a continuous review inventory model is
derived by Feldman (1978) when the demand is a Markov modulated com-
pound Poisson process. Sethi and Cheng (1993) assume that the demand pro-
cess forms a Markov chain. With all the cost parameters taken as constants,
they show the optimality of (8, S) policies for different types of problems,
e.g., with no ordering periods and storage and service level constraints.

2.1 A Simple Periodic Model

We now discuss the periodic review model of Ozekici and Parlar (1995)
which unites the notions discussed above through an environmental process
which not only affects the demand, but also the supply availability and the
cost parameters. Consider a single product inventory system which is in-
spected periodically over an infinite planning horizon. The state of the en-
vironment observed at the beginning of period n is represented by Yn and
we assume that Y = {Yn ; n ~ O} is a time-homogeneous Markov chain on
a discrete state space E with a given transition matrix P having elements
P(i,j) = P[Yn+1 = j I Yn = i]. Let Xn denote the inventory position ob-
served at the beginning of period n. The basic assumption of this model is
that all parameters during a period depend on the state of the environment
at the beginning of that period. Therefore, the decision maker observes both
the inventory position and the environmental state to decide on the order
140 Siileyman Ozekici

quantity which we assume is delivered immediately. There is complete back-


logging of demand that occurs during the period and all costs are discounted
periodically at the rate 0 < 0: < 1.
If Dn is the total demand during period n, then the demand process
D = {Dn; n ~ O} is modulated by the Markov chain Y so that its conditional
cumulative distribution function is

Ai(Z) = P[Dn ~ Z I Yn = i] (2.1)


for i E E, Z ~ O. Therefore, the demand distribution is completely specified
by the environment.
Inventory models in the literature such as Song and Zipkin (1993) that
are related to ours mostly concentrate on the demand process which varies
stochastically in random environments. Although the demand is the primary
factor that depends on the environment, it is only fair to assume that the
environment affects the supply as well. Supply may not be readily available
due to, e.g., suppliers' strikes, embargoes, machine breakdowns, and, thus, an
order may not be placed. The issue of supplier uncertainty in the inventory
control context has been discussed by Silver (1981) and Nahmias (1993, p.
186); however, no models were presented. In a recent paper, Parlar and Berkin
(1991) formulate the supply uncertainty problem for the classical economic
order quantity model. Also, see Gupta (1993), Parlar (1993) and Giirler and
Parlar (1995) for extensions of the supplier uncertainty model incorporating,
e.g., random demand and multiple suppliers.
In our setting, we let Un denote the availability of the supply in period
n so that Un = 1 if supply is available and Un = 0 if it is not available.
This corresponds to the supplier being up (1) or down (0) to satisfy an order.
The order is placed and delivered immediately when the supply is available;
otherwise no order is placed. As in the demand structure, we assume that the
supply availability process U = {Un; n ~ O} depends on the environment so
that
(2.2)
for b E {O, I} where Ui is the probability that supply will be available in
environment i, and 1 - Ui is the probability that it will not be available in
environment i.
The cost parameters also depend on the environment. Therefore, in envi-
ronment i, Ki is the fixed cost of ordering, Ci is the unit purchase cost, hi is
the unit holding cost incurred at the end of a period, and Pi is the unit short-
age cost incurred at the end of a period. As usual, holding cost is incurred if
there is positive inventory on hand and shortage cost is incurred if there is a
backlog.
An inventory management policy is defined as a function y : E x R -+ R
where R = (-00,00) so that y( i, :c) is the order-up-to level if the environ-
ment is i and the inventory position is :c at the beginning of a period. The
Complex Systems in Random Environments 141

admissibility condition requires that y( i, x) ~ x since we do not allow for dis-


carding of any inventory without satisfying demand. In the present setting
with infinite planning horizon and a time-homogeneous Markov chain X, the
optimal policy will be of the form prescribed.
For any policy y, it follows that the process (Y,X) = {(Yn,Xn)j n ~ O}
is a Markov chain where

(2.3)
for n ~ O. Let Vi (x) be the optimal expected total discounted cost if the
initial environment is i and inventory position is x. Then, using the Markov
property one can easily verify that Vi (x) satisfies the dynamic programming
equation (DPE)

Vi(X) = u;miny>x{Ki8(y - x) + Gi(y)} + (1- Ui)G;(X) - CiX (2.4)


for i E E, x E R where 8(z) is the indicator function which is equal to 0 only
if z = 0 and 1 otherwise, and

Gi(Y) = CiY + Li(Y) + a L P(i,j) 1 Ai(dz)Vj(Y -


00
z) (2.5)
iEE a

with
L;(y) = hi l Y
A;(dz)(y - z) + Pi 1 00
Ai(dz)(z - y) . (2.6)

The derivation ofthe DPE (2.4) is quite straightforward. Here, y is not the
policy but a real number which represents the order-up-to level. This notation
is used interchangeably to denote a policy as well as a decision whenever our
intention is clear from the expressions. Note that in environment i with the
inventory position at level x, a decision y ~ x is taken only if the supply is
available with probability Ui. In this case, a fixed cost Ki is incurred only if
y > x and the purchase cost is C; (y - x). Moreover, the one- period expected
holding and shortage cost is Li(Y) and, as usual, the sum on the right-hand-
side of (2.5) is the expected optimal cost from the next period onward. If
the supply is not available with probability (1 - Ui), then no ordering cost is
incurred and the inventory position stays at the level x.
For any real valued function f : E x R -+ R, define the mapping T as

T f(i, x) = uiminy~x{Ki8(y - x) + Gi(Y)} + (1- Ui)Gi(X) - CjX (2.7)


for i E E, x E R. It follows that the optimal expected discounted cost function
V is a fixed point of T since the DPE (2.4) can be rewritten as v = Tv.
Unfortunately, the fixed point of T is not unique since Ci (y - x) + Li (y) ~ 0
is not necessarily bounded for y ~ x. We know, however, that v is the minimal
fixed point of T. In other words, if f = T f, then v :s f. We also know that
the optimal solution v that satisfies (2.4) is given by v = limk-+oo Tk fa with
fa = O. This dynamic programming algorithm is useful if one needs to infer
142 Siileyman Ozekici

properties of the optimal function v from properties of Tk fo. For details


regarding these results the reader is referred to Section 5.4 of the excellent
textbook by Bertsekas . As a matter of fact, this analysis of the inventory
models follows the one given in Section 6.2 of the same reference.
We will denote an optimal policy by y so that y( i, x) is chosen to be the
minimizing y on the right-hand-side of the DPE (2.4). In case there are ties,
we set y* (i, x) to be the smallest such value.

2.2 Optimal Policy


Suppose that Ki = 0 identically so that there is no fixed cost of ordering.
This provides a substantial simplification in the DPE (2.4) as well as the
mathematical analysis. In stationary or nonstationary models, this assump-
tion generally leads to the optimality of the so-called order-up-to level policies
which are of a control-limit type specified by a single critical number (Bellman
et al. 1955 and Bertsekas 1987, Section 6.2). In our model with environment
dependent demand, supply and cost parameters, this simple characterization
will still retain its validity.
Theorem 2.1. Suppose Ki = 0 for all i E E. Then Vi(X) ~ 0 is convex in x
for all i E E and an environment dependent order-up-to level policy {Sd zs
optimal. In other words, there exist critical numbers {Sd such that
Si - x, if x ~ Si
y*(i, x) = { 0 (2.8)
, if x> S;,
for i E E, x E R.
We now consider the general problem where the environment dependent
fixed costs Ki ~ 0 are not necessarily equal to zero. It is well-known that
(Iglehart 1963, Zheng 1991), under reasonable assumptions a fixed ordering
cost leads to optimal (s, S) policies in the standard constant environment set-
ting. As we shall prove shortly, this simple characterization can still be made
in our model where a Markovian environment affects the demand, supply,
and cost parameters. This, however, requires an interesting and meaningful
restriction on the fixed costs {K;}.
Definition 2.1. A finite and real-valued function 9 ~ 0 on E is said to be
a-excessive if gi ~ a L-j
EE P( i, j)gj for all i E E.

These functions play an important role in the potential theory of Markov


chains. In particular, if the reward function in the optimal stopping problem
of the Markov chain X with discount factor a is a-excessive, then the optimal
decision is to take this reward and stop immediately. This follows from the
observation that the immediate reward collected by stopping is greater than
or equal to the expected reward that can be collected in the next period by
continuing in this game. The reader is referred to Qmlar (1975, p. 204) on
the basic results pertaining to the potential theory of Markov chains.
Complex Systems in Random Environments 143

Theorem 2.2. Suppose {Ki} is o:-excessive. Then Vi(X) ~ 0 is Ki-convex


in x for all i E E and an environment dependent {(Si' Sin policy is optimal.
In other words, there exist critical numbers Si ::; Si such that

Si - x, if x ::; Si
y*(i, x) = { 0,
if x> Si,

for i E E, x E R.
For any i E E, Si is the smallest minimizer of G; so that Gi(Si) ::; Gi(Y)
for all y E R. Moreover, Si ::; Si can be computed by solving

The characterization provided in Theorem 2.2 is true if {Ki} is 0:-


excessive. By Definition 2.1, this means that the fixed cost of ordering in
any environment is greater than or equal to the expected discounted fixed
cost of ordering that will be incurred if the order is placed after one more
period. Considering the fixed costs only, this provides an additional incentive
for the decision maker not to place an order at the beginning of a period
and wait an additional period to pay less in expectation. This tendency to
hold an order and wait, especially if inventory level is not too low, is the
fundamental idea behind (s, S) inventory control policies. It is clear that the
special case where Ki = K for some constant K ~ 0 satisfies the condition
imposed by Theorem 2.2 since Ki = K ~ o:K = 0: 'LjEE P(i,j)Kj so long as
o < 0: ::; 1. In a continuous-review model with a Markov modulated Poisson
demand process, Song and Zipkin (1993) provide a similar {(Si' Sin optimal
policy where the fixed cost of ordering is constant over the environmental
states.
The basic periodic review model presented in Section 2.1 can be easily
extended and generalized in several directions. Much of the analysis is similar
and control-limit type policies with simple structures are still optimal. Proofs
of the basic results presented in this section and discussions on extensions
involving finite planning horizon, no backlogging, the average cost criterion
and lead times can be found in Ozekici and Parlar (1995).

2.3 Multi-Item Models

In models with multiple inventory items, the random environmental process


can be used as a tool that explains the dependence between the demand
processes. As an illustration, suppose that the environmental process reflects
the economic conditions of the market with respect to business-cycles. Then,
if the demand for refrigerators sold in a consumer goods store is high, it is
likely that the demand for washing machines, for example, is also high because
the economy is "expanding". The opposite may be true in a "contracting"
economy.
144 Siileyman Ozekici

Suppose that there are m items in store and the demand for the k'th one
in period n is given by D~ so that

P[D~ S Zl,"', Dr;: S Zm I Yn = i]


AUzt} .. A~(zm) (2.9)
for any Z = (Zl, ... , zm) E R m where AfO is the distribution function of the
periodic demand for item k in environment i. Note that the demand processes
for the m items are dependent due to the common environment, but they are
conditionally independent given the environment. Similarly, if U~ represents
supplier availability for item k in period n, then

u;(b) == u;(b 1, ... , bm ) P[U~ =b 1 , .. , U: =bm I Yn = i]


II [uf)h [1_uf)(1-b
m
k)
(2.10)
k=l
for any b =(b 1, .. , bm) E {o,l}m where uf is the probability that the
supplier of item k will be available in environment i. For item k, Kf is the
fixed cost of ordering, cf is the unit purchase cost, hf is the unit holding cost
incurred at the end of a period, and pf is the unit shortage cost incurred at
the end of a period. Moreover, there is an additional fixed cost K incurred if
any order is placed at all.
An inventory management policy is now a function Y : E x R m ---+ nm so
that the elements of the vector y(i,;z;) = (Y1(i,;z;),Y2(i,;z;)'''',Ym(i,;z;)) give
the order-up-to levels of the items if the environment is i and the inventory
position is ;z; E Rfn at the beginning of a period.
For any policy Y, it now follows that the process (Y, X) = {(Yn , Xn); n ~
O} is a Markov chain where Xn = (X~, X~, .. , X:) is a vector with elements

(2.11)
for any item k and n ~ O. Here X represents the inventory position of the
whole system and Xk represents the inventory position of the k'th item. Let
Vi (;z;) be the optimal expected total discounted cost if the initial environment
is i and inventory position is ;z;. Then, using the Markov property one can
easily verify that Vi(;z;) satisfies the dynamic programming equation (DPE)

v;(;z;) = miny~x{K6(L;=1(Yk-;z;k))+L;=lufKf6(Yk-Xk)+
+ LbE{O,l}m ui(b)Gi(;z; + b(y -;z;))} - L;=l C~;z;k
(2.12)
for i E E, x E Rm where 6(z) is the indicator function which is equal to 0
only if Z = 0 and 1 otherwise, and

Gi(y) = I>fYk
k=l
m

+ Li(Y) + () L
iEE
P(i,j) 1 Rm
Ai(dz)Vj(Y - z) (2.13)
Complex Systems in Random Environments 145

with

L;(y) = f
k=l
(h~ [Yk A~(dz)(Yk - z) + p~
Jo
1 A~(dz)(z
00

Yk
- Yk)). (2.14)

In (2.12), x+b(y-x) is the vector in R m where the k'th element is xk+h(Yk-


Xk) which is either Xk or Yk depending on whether bk = 0 or 1. If /{ =
0,
it is pretty clear from (2.12)-(2.14) and our cost structure that the functions
involved will have an additive form so that v;(x) = 2:::;;=1 vf(xk), Gi(y) =
2:::;;=1 Gf(Yk), and Li(Y) = 2:::;;=1 L~(Yk). In this case, the problem with m
different inventory items can be decomposed into m independent problems
with a single inventory item. The optimal policy will have the form
(2.15)
where yj;(Xk) is the optimal policy corresponding to the k'th inventory item.
This can be determined by solving the DPE involving this item only. In par-
ticular, if /{ik = 0 for all i E E, then the optimal policy is an environment
dependent base-stock policy {Sf} as stated by Theorem 2.1. Similarly, The-
orem 2.2 states that an environment dependent {(sf, Sf)} policy is optimal
if /{ik f:. 0 for some i E E but /{k is a-excessive.
If /{ f:. 0, the optimal inventory management policy for this complex sys-
tem, involving a random environment, may in general have a complicated
structure. However, under fairly reasonable conditions, we believe that it will
not be much more complicated than the one with a constant environment.
This view is illustrated in Ozekici (1996a) by considering the optimal replace-
ment problem for a complex system operating in a random environment.

3. Queueing Models
Queueing models also involve stochastic and deterministic parameters that
are subject to variations depending on some environmental factors. The cus-
tomer arrival rate as well as the service rate are not necessarily constants
that remain intact throughout the entire operation of the queueing system.
The environmental process in this case could represent any factor that may
influence these rates. Arrival rates of vehicles to a highway and their ser-
vice rates on that highway obviously depend on weather conditions, or the
production rate of a machine or work station depends on how well it is per-
forming physically and, in particular, this rate would be zero if it is in a
failed state. In server-vacation models, the service rate is zero if the server is
"vacationing" . Production rates are routinely changed due to work schedules
and many businesses go through slack periods where hardly any customers
arnve.
A queueing model where the arrival and service rates depend on a ran-
domly changing two-state environment was first introduced by Eisen and
146 Siileyman Ozekici

Tainiter (1963). This line of modelling is later extended by other authors


such as Neuts (1978a, 1978b) and Purdue (1974), who provide a matrix-
geometric solution on the steady-state behaviour of the queueing systems. In
recent literature, models that involve arrival and service processes that are
modulated by a stochastic process are all aimed at this phenomenon where
the modulating process represents our environment. A comprehensive dis-
cussionon Markov modulated queueing systems can be found in Prabhu and
Zhu (1989).

3.1 MIMll Queue with Varying Arrival and Service Rates

As an illustration, we consider Neuts (1978a) where the environmental pro-


cess Y = {Yi; t ~ O} is a finite state Markov process with generator Q and
states space E = {I, 2,, M}. Customers arrive according to a Poisson
process with rate Ai, and they are serviced by a single server who works
exponentially at rate J.ti in environment i. Let X t denote the number of cus-
tomers in the system at time t, then it follows that (X, Y) is a Markov process
with a generator of the form
Ao +AI A2
Ao Al
o Ao
Q*= o 0 (3.1)

= =
where Ao(i,j) J.tiI(i,j),A1(i,j) Q(i,j) - (Ai + J.ti)I(i,j), and A 2(i,j) =
AiI(i,j) are all M x M matrices.
Suppose that the environment has the stationary distribution 1I"i =
limt--++oo P[Yi = i] that can be determined by solving 1I"Q = 0 and 11"1 = l.
The average arrival rate is 11" A == EiEE 1I"iAi and the average service rate is
11"1' == EiEE 1I"iJ.ti so that the traffic intensity is now expressed as p = 1I"A11I"J.t.
The stationary distribution
vn(i) = t-+oo
lim P[Xt = n, Yi = i]

is characterized by the following main result.


Theorem 3.1. The queue is stable if p < 1, and the stationary distribution
v of (X, Y) is given by v = (vo, vb .. ,) where
(3.2)
for n ~ O. The matrix R is the unique solution of nonnegative matrices of
order M, which have a spectral radius less than one, of the equation

(3.3)
Complex Systems in Random Environments 147

The stability condition p < 1 makes perfect sense as in the standard


M/M/1 model. Note also that the solution (3.2) reduces trivially to the well-
known result limt-++oo P[Xt = = =
n] L:iEElln(i) (1- p)pn where p >'/1'=
if >'i = >. and I'i = I' for all states i E E. Theorem 3.1 demonstrates the fact
that the "geometric" stationary distribution of the basic M/M/1 model is now
generalized as the "matrix-geometric" solution (3.2) of the model operating
in a random environment.

3.2 M/M/1 Queueing Network in a Random Environment


Although there is considerable literature on models that focus on varying cus-
tomer arrival rates, the effect of the environment on other aspects of queue-
ing models are often neglected. These can involve cases where the number
of customers, the system capacity or even the queue discipline changes with
respect to a randomly changing environment. What is more important is that
the random environment can be used to model complex queueing networks
where the arrival processes, or other model parameters involved, are stochas-
tically dependent. Once again, the dependence is due only to the common
environmental process. If the arrival rate of parts coming to a workstation
in a manufacturing system is increasing, it is very likely that the same phe-
nomenon is observed in another workstation in the same network. This could
be due to a heavy work schedule that applies to all workstations or due to
the increase in the demand for the final product which requires the process-
ing of more different parts that are later assembled together. It is common
knowledge that during certain "seasons" calls arriving at different nodes of
a telecommunications network all increase at the same time. Computational
algorithms to determine R can be found in Neuts (1981).
Consider now such a queueing network with m nodes that are all M/M/!.
The customer arrival rate to the k'th queue is >.~ and the server works at rate
I'r in environment i E E. The routing probability matrix is Pi so that Pi(k, I)
is the probability that a customer who leaves queue k goes to queue I, and
1 - L:~l Pi(k, I) is the probability that the customer leaves the network. If
the number of customers in the k'th system at time tis xf, then the process
(Xl, X2, ... , xm, Y) is a Markov process. It is clear that the arrival processes
of the m queues are all dependent since they are all modulated by the same
environmental process Y. As a matter of fact, if the number of customers
arriving to the k'th queue until time t is denoted by N,k, then

= nl, ... , Ntm = nmlY = i] = II "


m e->'~t (>'~t)nk
p[Nl (3.4)
k=l nk
148 Siileyman Ozekici

where Dt(i) == f; li(Y.)ds is the total amount of time the environment


stays at state i until time t. It is clear from (3.5) that the arrival processes
N 1 , N 2 , ... , N m are dependent due to the common environment but, given
the environment, they are conditionally independent. An interesting prob-
lem here is to investigate the structure of the stationary joint distribution of
(X, Y) to find out whether it has some sort of a product form as is the case
under no environmental variation.

4. Reliability Models
In reliability and maintenance models, it is generally assumed that a device
always works in a given fixed environment. The probability law of the dete-
rioration and failure process thus remains intact throughout its useful life.
The life distribution and the corresponding failure rate function is taken to
be the one obtained through statistical life testing procedures that are usu-
ally conducted under ideal laboratory conditions by the manufacturer of the
device. Data on lifetimes may also be collected while the device is in oper-
ation to estimate the life distribution. In any case, the basic assumption is
that the prevailing environmental conditions either do not change in time or,
in case they do, they have no effect on the deterioration and failure of the
device. Therefore, statistical procedures in estimating the life distribution
parameters and decisions related with replacement and repair are based on
the calendar age of the item.
There has been growing interest in recent years in reliability and main-
tenance models where the main emphasis is placed on the so-called intrinsic
age of a device rather than its real age. This is necessitated by the fact that
devices often work in varying environments during which they are subject
to varying environmental conditions with significant effects on performance.
The deterioration and failure process therefore depends on the environment,
and it no longer makes much sense to measure the age in real time without
taking into consideration the different environments that the device has op-
erated in. There are many examples where this important factor can not be
neglected or overlooked. Consider, for example, a jet engine which is subject
to varying atmospheric conditions like pressure, temperature, humidity, and
mechanical vibrations during take-off, cruising, and landing. The changes in
these conditions cause the engine to deteriorate, or age, according to a set of
rules which may well deviate substantially from the usual one that measures
the age in real time irrespective of the environment.
As a matter of fact, the intrinsic age concept is being used routinely in
practice in one form or another. In aviation, the calendar age of an airplane
since the time it was actually manufactured is not of primary importance
in determining maintenance policies. Rather, the number of take-offs and
landings, total time spent cruising in fair conditions or turbulence, or total
Complex Systems in Random Environments 149

miles flown since manufacturing or since the last major overhaul are more
important factors.
Another example is a machine or a workstation in a manufacturing sys-
tem which may be subject to varying loading patterns depending on the
production schedule. In this case, the atmospheric conditions do not neces-
sarily change too much in time, and the environment is now represented by
varying loading patterns so that, for example, the workstation ages faster
when it is overloaded, slower when it is underloaded, and not at all when
it is not loaded or kept idle. Therefore, the term "environment" is used in
a loose sense here so that it represents any set of conditions that affect the
deterioration and aging of the device.
Once again, it is also routine practice in manufacturing systems to mea-
sure the age of a workstation not with respect to its real age since installation,
but with respect to another criterion like the number of parts produced by
the workstation since its installation or since the last maintenance. Here, the
environment can be the production rate which can be set at different levels
depending on the production schedule or on a usually cyclic workload re-
quired during the production shifts on any given day. It is reasonable then
to suppose that if the production rate is increased or decreased, the worksta-
tion ages faster or slower respectively. In case the production rate is zero, the
workstation should not age at all or possibly at a very low failure rate that
accounts for the effect of deterioration in real time.

4.1 An Environment Dependent Periodic Model

Consider a periodic model where a device operates in a randomly changing


environment and the survival probability for each period depends on the
state of the environment. Letting Yn denote the state of the environment at
the beginning of the n'th period, we suppose that Y is a Markov chain with
some transition matrix P on the state space E. If the environment is at state
i, then the device survives that period with probability q(i) and fails with
probability p( i) where p( i)+q( i) = 1. The conditional probability distribution
of the lifetime L of the device is given by

P[L - nlY] - { p(Yo), ifn = 1


(4.1)
- - q(Yo)q(Y1 ) q(Yn- 2 )P(Yn-I), ifn ~ 2

which is simply the geometric distribution P[L = nlY] = qn-l p if q(i) = q


and p( i) = p for all i E E independent of the states of the environment.
So, the generalized geometric distribution (4.1) can be referred to as the
geometric distribution modulated by the Markov chain Y. We can also write

(4.2)
for n ~ 1.
It follows trivially that (4.2) leads to the recursive formula
150 Siileyman Ozekici

Pi[L > n + 1] = q(i) L P(i,j)Pj[L > n] (4.3)


jEE

=
with the obvious boundary condition Pi[L > 0] 1 where Pi[A] P[AIYo = =
z1 for any event A.
Life distribution classifications play an important role in many problems
on reliability and maintenance. Recall that stochastic processes are often
classified with respect to the life distribution classification of their first pas-
sage times. In particular, supposing that the state space of Y is ordered as
E = {O, 1,2, ...}, then the Markov chain Y is said to be an IFRA (DFRA)
process if the first passage time Tj = inf{n ~ 0: Yn ~ j} has discrete IFRA
(DFRA) distribution respectively on {Yo = i} for any i < j.
Theorem 4.1. Suppose that q(i) is decreasing (increasing) in i E E; ifY is
an increasing IFRA (DFRA) process, then L has an IFRA (DFRA) distri-
bution on {Yo = i} for any i E E.

Theorem 4.1 states that if the environmental process increases such that
the states get "worse" in the IFRA sense with decreasing survival probabil-
ities, then the lifetime L has an IFRA distribution. One of the implications
here is that the probability of failure in the next period increases in time.
The opposite conditions yield the DFRA case.
If the device consists of m components connected in series so that com-
ponent k survives a period in environment i with probability qk(i) and fails
with probability qk(i), then the characterization provided by (4.1)-(4.3) holds
for the lifetime Lk for component k. The conditional joint distribution is
m

P[L 1 > nl,L2 > n2,,Lm > nmlY] = II qk(YO)qk(Yt}"'qk(Ynk-t}


k=1
(4.4)
Moreover, for the series reliability model
m
P[L> nlY] = II qk(YO)qk(Yt}. qk(Yn-t} (4.5)
k=1

and the recursive relationship

Pi[L > n + 1] = (fI


k=1
qk(i) L P(i,j)Pj[L > n]
jEE
(4.6)

still holds true with Pi [L > 0] = 1. Comparison of (4.6) with (4.3) reveals the
obvious fact that the series system can be regarded as a single component
that has survival probability q(i) = n~=l qk(i) in environment i. So, the life
distribution classification provided in Theorem 4.1 can easily be extended to
the complex case with many stochastically dependent components.
Complex Systems in Random Environments 151

Corollary 4.1. Suppose that qk( i) is decreasing (increasing) in i E E for all


k; ifY is an increasing IFRA (DFRA) process, then L has an IFRA (DFRA)
distribution on {Yo = i} for any i E E.

The model introduced in this section can be further analyzed in several


directions. For detailed proofs of the results stated here and a co~plete anal-
ysis of the optimal maintenance problem, we refer the reader to Ozekici and
Sevilir (1996). A rather general model in the context of a Bernoulli process
modulated by a Markov chain is discussed by Ozekici (1996b) with an appli-
cation in reliability.

4.2 Intrinsic Aging

A complex reliability model is one that contains a large number of compo-


nents which may be highly interrelated. A rather restrictive and unrealistic
assumption of these models is the stochastic independence of the lifetimes of
the components that make up the system. This assumption is hardly true for
many cases. An interesting model of stochastic component dependence was
introduced by Cllliar and Ozekici (1987) where stochastic dependence is in-
troduced by a randomly changing common environment that all components
of the system are subjected to. This model is based on the simple observation
that the aging or deterioration process of any component depends very much
on the environment that the component is operating in. They propose to
construct an intrinsic clock which ticks differently in different environments
to measure the intrinsic age of the device. The environment is modelled by a
semi-Markov jump process and the intrinsic age is represented by the cumu-
lative hazard accumulated in time during the operation of the device in the
randomly varying environment. This is a rather stylish choice which envisions
that the intrinsic lifetime of any device has an exponential distribution with
parameter 1. There are, of course, other methods of constructing an intrin-
sic clock to measure the intrinsic age. Also, the random environment model
can be used to study reliability and maintenance models involving complex
devices with many interacting components. The lifetimes of the components
of such complex devices are stochastically dependent due to the common
environment they are all subject to.
The concept of random hazard functions is also used in Gaver (1963) and
Arjas (1981). The intrinsic aging model of Cllliar and Ozekici (1987) is stud-
ied further in Cllliar et al. (1989) to determine the conditions that lead to
associated component lifetimes, as well as multivariate increasing failure rate
(IFR) and new better than used (NBU) life distribution characterizations.
It was also extended in Shaked and Shanthikumar (1989) by discussions on
several different models with multicomponent replacement policies. Lindley
and Singpurwalla (1986) discuss the effects of the random environment on
the reliability of a system consisting of components which share the same
environment. Although the initial state of the environment is random, they
152 Siileyman Ozekici

assume that it remains constant in time and components have exponential


life distributions in each possible environment. This model is also studied by
Lefevre and Malice (1989) to determine partial orderings on the number of
functioning components and the reliability of k-out-of-n systems, for different
partial orderings of the probability distribution on the environmental state.
The association of the lifetimes of components subjected to a randomly vary-
ing environment is discussed in Lefevre and Milhaud (1990). Singpurwalla and
Youngren (1993) also discuss multivariate distributions that arise in models
where a dynamic environment affects the failure rates of the components.
For a complex model with m components, intrinsic aging in Qmlar and
Ozekici (1987) is described by the basic relationship
dA t
dt = f(Yt,A t ) (4.7)

where At = (A}, A , ... , Ar) is the intrinsic age of the system at time t that
consists of the intrinsic ages of the m components, Y = (Y/, y?, ... ,~d)
is the environmental process with state space E that reflects the states of
various environmental factors and f is the intrinsic hazard rate function.
For example, ~1 can be the calendar time t, ~2 could be the temperature
at t, ~3 could be the pressure at time t, etc .. Moreover, f is of the form
f(i, x) = (!t(i, x), !2(i, x), .!m(i, x)) where J,.(i, x) is the intrinsic aging
rate of component k in environment i if the intrinsic ages of the components
are given by the vector x = (Xl, X2,, x m ).
In our exposition, we will further specialize on this basic model by adapt-
ing the notation and terminology of Ozekici (1995) who analyzed the optimal
maintenance problem of a single-component device operating in a random en-
vironment. In particular, we suppose that the state space E is discrete and
fk(i, x) = fk(i, Xk) such that the intrinsic aging rate of any component k
depends only on the environment and the intrinsic age of that component,
independent of the ages of all other components. This implies that both
stochastic dependence among the components and intrinsic aging of each
component depend only on environmental factors that the system as a whole
is subjected to. Furthermore, we will relate the intrinsic failure rate function
fk(i, x) to the failure rate function rf(t) of component k while it operates in
environment i. We will now present the details of the specific construction of
the intrinsic aging process.
Let Lk denote the lifetime of the k'th component while L represents the
lifetime of the system. Suppose, for now, that the environment remains fixed
at some state i E E so that yt = i for all t ~ O. In any environment i E E,
the life distribution of component k is given by the cumulative distribution
function
=
Fl(t) P[Lk :5 tlY i) = (4.8)
with failure rate function rf(t) and hazard function Rf(t) = J; rf(s)ds so
that the survival probability function F; = 1 - Fl can be written as
Complex Systems in Random Environments 153

F:(t) = P[Lk > tlY = i] = exp(-R~(t)) (4.9)

We further suppose that the stochastic dependence among the components is


due to the environment only, and the components are otherwise independent.
This means
m

P[L 1 > ul,L2 > U2, ,L m > umlY = i] = exp(- LR~(Uk)) (4.10)
k=l

so that the lifetimes are independent so long as the environment is fixed.


Relationship (4.9) allows us to construct an intrinsic clock to measure the
intrinsic age of any component n at time t as A: = R~ (t) and the real lifetime
is characterized by
(4.11)
where Lk is a random variable representing the intrinsic lifetime of component
k. Moreover, it has an exponential distribution with parameter 1 since

(4.12)
Therefore, in the fixed environment i E E, it follows that if the intrinsic
age is measured by the hazard function, then component k has an exponen-
tially distributed intrinsic lifetime with parameter 1. Moreover, its intrinsic
clock ticks at the rate r~(t) at time t. If the real time is t, then the in-
trinsic clock shows time R~(t). Similarly, when the intrinsic time is x, the
corresponding real time is given by the inverse function

n~(x) = inf{t ~ 0; R~(t) > x} (4.13)

In other words, it takes n~ (x) units of real time operation to age a brand
new component to intrinsic age x in environment i.
Let a~ = limt-++oo R~(t) denote the maximum intrinsic age that compo-
nent k can reach while operating in environment i E E and t~ = inf {t ~
OJ Rf(t) = an denote the time when this maximum age is reached. In most
environments a~ = = t~ +00, but it is also possible that a~, t~ < +00. In
R:
particular, if 6 E E represents an environmental state during which the com-
ponent is kept idle, then r: = = a~ = t~ = O. Moreover, if a~ < +00,
then r~(s) = 0 and R~(s) = a~ for all s ~ t~. This is equivalent to saying
that once the component reaches the intrinsic age a~, it does not fail or age
any more in environment i. As a matter of fact, if a~ < +00, then the life
distribution is defective with P[Lk = +oolY = i] = e-a~ > 0 and the device
may function forever without failing. This may also correspond to the case
where, upon reaching the critical age a~, the component is used no more
in environment i. Note that the definition (4.13) implies that n~(x) = +00
whenever x ~ a~. Throughout the remainder of this article the intrinsic age,
intrinsic time, and intrinsic lifetime will be referred to as simply the age,
time, and lifetime unless stated otherwise.
154 Siileyman Ozekici

4.3 Intrinsic Aging in a Random Environment

Suppose that the environmental process is the minimal semi-Markov process


associated with a Markov renewal process. Let Tn denote the time of the
n'th environment change and Xn denote the n'th environmental state for
k ~ 0 with To == O. The main assumption is that the process (X, T) =
{( X n , Tn); n ~ O} is a Markov renewal process on the state space Ex R+ with
=
some semi-Markov kernel Q where R+ [0, +00). Moreover, Y {tit ~ O} =
is the minimal semi-Markov process associated with (X, T). More precisely,
Yt = Xn whenever Tn ~ t < Tn+l . For any i, j E E and t ~ 0,

Q(i,j, t) = P[Xn+! = j, Tn+! - Tn ~ tlXn = i] (4.14)


and it is well-known that X is a Markov chain on E with transition matrix
P(i,j) = P[Xn+! =
jlXn = =
i] Q(i,j, +00). We further assume that the
Markov renewal process has infinite lifetime so that SUPn Tn = +00.
A stylish choice to extend the construction of the intrinsic aging process in
this setting is to measure the age by the total hazard accumulated during the
operation of the device in the randomly varying environment. Therefore, for
component k, the age process A k = {Af; t ~ O} is the continuously increasing
stochastic process defined by

(4.15)

for any k ~ 0, s ~ Tn+! - Tn, and initial age A~ ~ O. The model therefore
supposes that if the component has already reached the critical maximum
age a~ n by time Tn, it is either kept idle or it does not fail or age any more
throughout the n'th environment X n . An equivalent definition is provided by
the derivative

dAf = { rL (nL (A~J + (t - Tn)), (4.16)


dt 0,

for Tn ~ t < Tn+l . To simplify the notation, it is convenient to set


if Xk < a~
(4.17)
if Xk ~ af
for i E E, x E R+. and t ~ 0 so that this represents the amount of aging
caused by operating component k of initial age Xk in environment i for t real
time units. Note that x = (Xl, X2, .. , xm) is the initial age of the system
that consists of m components. If component k is initially at age Xk at the
beginning of environment i, then the amount of real time operation required
to age it u time units in this environment is given by

(4.18)
Complex Systems in Random Environments 155

We will use the compact notation Hi(Z, t) = (Hl(z, t),"', Hr'(z, t to


denote the intrinsic age of the system at time t given that the initial age was
z and the environment was i. It follows immediately that equation (4.15) can
be rewritten as AT.. +3 = Hx .. (AT.. ,S).
To observe t~e relationship between the intrinsic failure rate f in (4.7) of
the Qmlar and Ozekici (1987) model and the ordinary failure rate function r
in the present setting, please note that (4.16) implies f(i,z) = ri(n.i(z in
compact notation with the understanding that this is equivalent to h(i, Zk) =
rf(n.f(zk for component k.
This intrinsic aging model simply combines the hazard functions of the
components in the environmental states. Given the failure rate functions
{rf(); i E E} and a realization of the environmental process Y, the age
process A == (Al,A2, ... ,Am) is completely defined by (4.15) or (4.16). The
description of Ak should be clear from these expressions. If the initial age
of component k is A~ = Zk and the initial environment is Xo = i with
To == 0, then the initial real age of the component is n.t(Zk) and it ages as
A! = H{(z, s) for s ~ T1 . At some time Tl = U, the environment jumps
to state j E E with some probability Q(i,j,du) and the age is now the
accumulated hazard given by A!+3 = Hj(A!,s) for u+s ~ T2. The sample
path of A k is constructed similarly in time as the environmental process
evolves so that, in general, if the environment jumps to some state 1 E E at
= =
the n'th jump time Tn t, then the age evolves as A~+3 Hf(A~, s) so long
as t + s ~ Tn+!.
It should be mentioned that the intrinsic age model described in this sec-
tion may provide substantial simplification on the statistical issues regarding
complex devices with many components. Suppose that there are m com-
ponents with dependent lifetimes operating in a deterministic environment
with N states. An important statistical problem is the estimation of the joint
life distribution which is a function of m variables. The intrinsic age model
reduces this statistical problem to one where we only need to estimate m
separate marginal life distributions for the m components in each one of the
N environmental states. In other words, it may be easier to estimate m x N
functions of a single variable than one function of m variables. This can be
achieved through statistical life testing procedures by testing each component
separately in a laboratory which is capable of simulating the environmental
states. If the device is made up of a single component, as in our setting, then
the intrinsic aging model described by (4.15) or (4.16) provides a means of
combining the hazard functions of the device which are determined by life
testing in a laboratory for each environment.
The intrinsic aging model described in detail in this section is used in the
context of an optimal replacement problem for a complex system in Ozekici
(1996a).
156 Siileyman Ozekici

References

Arjas, E.: The Failure and Hazard Process in Multivariate Reliability Systems.
Mathematics of Operations Research 6, 551-562 (1981)
Bellman, R., Glicksberg, I., and Gross, 0.: On the Optimal Inventory Equation.
Management Science 2, 83-104 (1955)
Bertsekas, D. P.:Dynamic Programming, Deterministic and Stochastic Models. En-
glewood Cliffs: Prentice-Hall 1987
QInlar, E.: Markov Additive Processes: I. Z. Wahrscheinlichkeitstheorie verw. Geb.
24, 85-93 (1972a)
QInlar, E.: Markov Additive Processes: II. Z. Wahrscheinlichkeitstheorie verw. Geb.
24, 95-121 (1972b)
QInlar, E.: Introduction to Stochastic Processes. Englewood Cliffs, NJ: Prentice-
Hall 1975
QInlar, E.: Shock and Wear Models and Markov Additive Processes. In: Shimi, LN.,
Tsokos, C.P. (eds.):The Theory and Applications of Reliability 1. New York:
Academic Press 1977, pp. 193-214
QInlar, E., Ozekici, S.: Reliability of Complex Devices in Random Environments.
Probability in the Engineering and Informational Sciences 1, 97-115 (1987)
QInlar, E., Shaked, M., Shanthikumar, J.G.: On Lifetimes Influenced by a Common
Environment. Stochastic Processes and Their Applications 33 347-359 (1989)
Eisen, M., Tainiter, M.: Stochastic Variations in Queuing Processes. Operations
Research 11, 922-927 (1963)
Ezhov, 1.1., Skorohod, A.V.: Markov Processes with Homogeneous Second Compo-
nent: I. Teor. Verojatn. Primen. 14, 1-13 (1969)
Feldman, R.: A Continuous Review (8, S) Inventory System in a Random Environ-
ment. Journal of Applied Probability 15, 654-659 (1978)
Gaver, D. P.: Random Hazard in Reliability Problems. Technometrics 5, 211-226
(1963)
Gupta, D.: The (Q, r) Inventory System with an Unreliable Supplier. Technical
Report. School of Business, McMaster University (1993)
Giirler, D., Parlar, M.: An Inventory Problem with Two Randomly Available Sup-
pliers. Technical report. School of Business, McMaster University (1995)
Iglehart, D.L.: Dynamic Programming and Stationary Analysis of Inventory Prob-
lems. In: Scarf, H.E., Gilford, D.M., Shelly, M.W. (eds.): Multistage Inventory
Models and Techniques. Stanford: Stanford University Press 1963
Iglehart, D.L., Karlin, S.: Optimal Policy for Dynamic Inventory Process with Non-
stationary Stochastic Demands. In: Arrow, K.J., Karlin, S., Scarf, H. (eds.):
Studies in Applied Probability and Management Science. Stanford: Stanford
University Press 1962 pp. 127-147
Kalymon, B.: Stochastic Prices in a Single Item Inventory Purchasing Model. Op-
erations Research 19, 1434-1458 (1971)
Lefevre, C., Malice, M.P.: On a System of Components with Joint Lifetimes Dis-
tributed as a Mixture of Exponential Laws. Journal of Applied Probability 26,
202-208 (1989)
Lefevre, C., Milhaud, X.: On the Association of the Lifelenghts of Components
Subjected to a Stochastic Environment. Advances in Applied Probability 22,
961-964 (1990)
Lindley, D.V., Singpurwalla, N.D.: Multivariate Distributions for the Lifelengths of
Components of a System Sharing a Common Environment. Journal of Applied
Probability 23, 418-431 (1986)
Complex Systems in Random Environments 157

Nahmias, S.: Production and Operations Analysis. 2nd edition. Homewood: Irwin
1993
Neuts, M.F.: The MIMI 1 Queue with Randomly Varying Arrival and Service Rates.
Opsearch 15, 139-157 (1978a)
Neuts, M.F.: Further Results on the MIMII Queue with Randomly Varying Rates.
Opsearch 15, 158-168 (1978b)
Neuts, M.F.: Matrix Geometric Solutions in Stochastic Models. Baltimore: John
Hopkins University Press 1981
Neveu, J.: Une Generalisation des Processus a Accroisements Positifs Independants.
Abhandlungen aus den Mathematischen Seminar der Universitat Hamburg 25,
36-61 (1961)
Ozekici, S.: Optimal Control of Storage Models with Markov Additive Inputs. Ph.D.
Dissertation, Northwestern University (1979)
Ozekici, S.: Optimal Maintenance Policies in Random Environments. European
Journal of Operational Research 82, 283-294 (1995)
Ozekici, S.: Optimal Replacement of Complex Devices. In this volume (1996a), pp.
158-169
Ozekici, S.: Markov Modulated Bernoulli Process. Technical Report. Department
of Industrial Engineering, Bogazi9 University (1996b)
Ozekici, S., Parlar, M.: Periodic-review Inventory Models in Random Environments.
Technical Report. School of Business, McMaster University (1995) .
Ozekici, S., Sevilir, M.: Maintenance of a Device with Environment Dependent
Survival Probabilities. Technical Report. Department of Industrial Engineering,
Bogazi9 University (1996)
Parlar, M.: Continuous-Review Inventory Problem Where Supply Interruptions Fol-
Iowa Semi-Markov Process. Technical Report. School of Business (1993)
Parlar, M., Berkin, D.: Future Supply Uncertainty in EOQ Models. Naval Research
Logistics 38, 107-121 (1991)
Prabhu, N.U., Zhu, Y.: Markov-Modulated Queueing Systems. Queueing Systems
5, 215-246 (1989)
Purdue, P.: The MIMII Queue in a Markovian Environment. Operations Research
22, 562-569 (1974)
Sethi, S.P., Cheng, F.: Optimality of (s, S) Policies in Inventory Models with Marko-
vian Demand Processes. Technical Report. Faculty of Management, University
of Toronto (1993)
Shaked, M., Shanthikumar, J. G.: Some Replacement Policies in a Random Envi-
ronment. Probability in the Engineering and Informational Sciences 3,117-134
(1989)
Silver, E.A.: Operations Research in Inventory Management: A Review and Cri-
tique. Operations Research 29, 628-645 (1981)
Singpurwalla, N.D., Youngren, M.A.: Multivariate Distributions Induced by Dy-
namic Environments. Scandinavian Journal of Statistics 20, 251-261 (1993)
Song, J.S., Zipkin, P.: Inventory Control in Fluctuating Demand Environment.
Operations Research 41, 351-370 (1993)
Zheng, Y.S.: A Simple Proof of Optimality of (s, S) Policies in Infinite-Horizon
Inventory Systems. Journal of Applied Probability 28, 802-810 (1991)
Optimal Replacement of Complex Devices
Siileyman Ozekici

Department ofIndustrial Engineering, Bogazi9, University, 80815 Bebek-istanbul,


Turkey

Summary. Decision problems on complex systems usually require complex formu-


lations due to the large number of subsystems or components involved. In general,
optimal policies may have a rather complicated structure even when there are no
random environmental fluctuations in the model parameters. However, these poli-
cies and the solution procedures do not become any more complicated when the
system operates in a random environment under fairly reasonable assumptions. We
demonstrate this conjecture by considering the optimal replacement problem of a
simple device and a complex device with dependent components.

Keywords. Complex systems, random environment, dependent components, opti-


mal replacement, dynamic programming

1. Introduction
Optimization problems involving complex systems are quite challenging due
to the multidimentionality created by the large number of components or
subsystems that make up the whole system. These problems are further com-
plicated by the fact that, in many cases, the components or subsystems are
stochastically and economically dependent. We suppose that dependence is
induced by a randomly changing environment that all components or subsys-
tems operate in.
The formulation of the optimization problem, the characterization of op-
timal policies and the solution procedure is undoubtedly more complicated.
It is well-known that the structure of optimal policies may be quite complex
in multicomponent systems even when there are no environmental fluctua-
tions. However, it is surprising that, under fairly reasonable conditions, the
environmental process does not increase the complexity of the structure of
optimal policies or the solution procedures. We will demonstrate this con-
jecture on the optimal component replacement problem and show that the
random environment does not actually create optimal policies which are far
more complex than those obtained in the standard single environment mod-
els. This chapter builds on the intrinsic aging model described in Section 4 of
Ozekici (1996) in this volume, our notation and terminology will follow those
introduced there.
Preventive replacement is perhaps the most widely used maintenance pol-
icy to prevent the device from failure during operation, thus incurring exces-
sive failure costs. The fixed environment case with several cost structures and
objectives is discussed extensively in the literature by many authors and, in
Optimal Replacement of Complex Devices 159

most cases, an age replacement policy is optimal if the life distribution is IFR.
Ozekici (1985) provides an example along this direction and a discussion on
the optimality conditions for control-limit policies can be found in So (1992).
Throughout the remainder of this chapter, we make a similar assumption by
requiring that the failure rate functions {rf (.); i E E} are all increasing. Note
that this assumption implies that af = =
Rf (+(0) +00 for all i E E except
for idle environmental states with a~ = 0 and r~ = 0 identically. Therefore,
HNx, t) = Rf(n~(x/c) + t) for all x E R+, t ~ 0 and i E E except for the
idleness case where Hi(x, t) = X/c.

2. Single-Component Model

We first consider a device consisting of a single component and present the


formulation and summary of the main results for the optimal replacement
problem discussed in Ozekici (1995). The device is inspected at the beginning
of each environment and a decision is made to replace the device by an
identical brand new one or not. If an operational device at some age x is
replaced in environment i, then a preventive cost Pi is incurred and a new
device at age 0 is installed immediately. Otherwise, the device at age x is
used throughout the prevailing environment until the next decision epoch
when the environment jumps to a different state, and this decision process is
repeated at the beginning of each environment. In case the device is found to
be in a failed state at an inspection, a failure cost is incurred and it is replaced
immediately by a brand new one of age O. The cost of failure replacement at
state i is Ii where we suppose Ii ~ Pi > 0 for all i E E, i.e., the preventive
replacement cost is always less than or equal to the failure replacement cost.
In addition, if the device remains failed in state i for t units of time before
it is replaced at the next decision epoch, then an additional cost d;(t) 2:: 0
is incurred. This accounts for additional costs due to downtime production
losses or other lost opportunities caused by the failed device. The opportunity
cost di(t) is increasing in t for all i E E. In many cases, it may suffice to take
di = 0, Pi = cp , and Ii = Cj for all i E E with Cj ~ cp > 0 as it is often
done in the literature. Finally, all costs are discounted at some rate 0: > O.
For a technical reason which will be clear shortly, we further assume that
sUPiEE E[e-O'T1IXo = i] < l.
The purpose is to find that replacement policy which minimizes the ex-
pected total discounted cost. Setting Bn = ATn as the age of the device at
the beginning of the n'th environment, it follows that (X, B) is a Markov
chain. Markov decision theory can be applied in a straightforward manner to
obtain characterizations of the optimal solution. Defining Vi (x) to denote the
minimum expected total discounted cost if the initial environment is i and
the device is at age x, then V satisfies the dynamic programming equation
(DPE)
160 Siileyman Ozekici

Vi(X) = min {pi + rV(i, 0), rV(i, x)} (2.1)


for i E E, x ~ where the operator r is defined by
rg(i, x) = LjEE 1000 Q(i,j, dt)e-at{e-(H.(x,f)-x)gj(Hi(X, t))
(2.2)
+ IoH.(x,t)-x du e-U[I; + gj(O) + di(t - Ti(X, u))]}
for any function 9 on the set B of all bounded nonnegative real-valued func-
tions defined on E x [0,+00). Note that our notation implies rg(i,x) =
(rgMx). The DPE (2.1) follows by observing that if preventive replacement
is made at state (i, x), then a cost Pi is incurred and the state is immediately
transformed to (i,O) and the minimum expected total discounted cost using
the optimal policy from the decision epoch onward is rv(i,O). Similarly, if
no preventive replacement is made, then no immediate cost is incurred and
the minimum expected total discounted cost using the optimal policy from
the next decision epoch onward is rv(i, x).
Expression (2.2) is obtained simply by conditioning on (Xl, Tl) and using
the fact that the remaining lifetime has an exponential distribution with
parameter 1. If no replacement is made at age x in environment i, then the
environment jumps to j in the vicinity of time t with probability Q( i, j, dt)
during which the device does not fail with probability exp( -(Hi (x, t)-x)) and
the new age is Hi (x, t) at the beginning of the new environment j. However,
the device may also fail with probability exp( -u) after u intrinsic time units
during [0, Hi(x, t) - xl incurring failure cost I; and downtime cost di(t -
Ti(X, u)) since Ti(X, u) is the amount of calendar time required to age the
device by u intrinsic time units as defined by (4.18) in Ozekici (1996). Thus,
t - Ti(:C, u) is the duration of downtime at time t when the environment
changes and the replacement of the failed device is made in environment
jwith Vj(O) denoting the expected total discounted cost using the optimal
policy from then onward.
Recall that the decision maker is allowed to make an inspection and re-
placement only when the environment changes. Of course, this is quite re-
strictive since it eliminates the option of replacing the device when there is
no change in the environment. The reason behind this restriction is the ne-
cessity to use the Markov property at the times of environment changes in
writing the DPE (2.1). If the environmental process Y is a Markov process
so that Q(i,j, t) = P(i,j)(l - e->.t) and the sojourn in any environment
i E E is exponentially distributed with some rate Ai, then it is clear that the
decision problem is still represented by the DPE (2.1) irrespective of when
replacements can be done. The decision maker can replace the device at any
time during a given environment and the memorylessness property of the
exponential distribution leads to the same DPE.
Theorem 2.1. There is a unique function v in B which satisfies the DPE
(2.1). Moreover, Vi(X) is increasing in x for all i E E.
Optimal Replacement of Complex Devices 161

This result follows by noting that r is a contraction mapping so that


v = rv has a unique solution on the Banach space B with the usual supremum
norm. A detailed proof can be found in Ozekici (1988). The DPE (2.1) implies
that the optimal decision at state (i, x) is the one that yields the minimum of
the two expressions Pi + rv(i, 0) and rv(i, x). Since Pi + rv(i, 0) is constant
and rv( i, x) is increasing in x, it must be true that rv( i, x) 2:: Pi + rv( i, 0) for
all x 2:: Ni and rv(i, x) < Pi + rv(i, 0) for all x < Ni for some Ni 2:: 0 which
is the optimal replacement age in environment i. This proves the following
characterization result which states that the optimal replacement policy is a
simple control-limit type age replacement policy.
Corollary 2.1. The optimal replacement policy is such that the device is
replaced in environment i at age x if and only if x 2:: Ni where the optimal
replacement age Ni is given by
Ni = inf{y 2:: OJ rv(i, y) 2:: Pi + rv(i, On (2.3)
for any i E E .

If yi (x) represents the optimal decision taken in environment i with the


device at age x, then it has the form
< Ni
yi(x) = { ~: if x
if x 2:: Ni
(2.4)

Although the stochastic structure of the environmental process is quite


general with minimal restrictions on its probability law, the optimal replace-
ment policy is astonishingly simple. The computational problem is reduced
to one where only the critical replacement ages {Ni j i E E} have to be iden-
tified. The characterization provided in Corollary 2.1 does not only provide
computational simplification, but it is also helpful in real life applications due
to the inherent practicality of the optimal policy.
The computation of the of the optimal solution is not straightforward,
in particular, for complex models with many components. The reader is re-
ferred to Puterman (1990) or Bertsekas (1987) for solution techniques and
computational issues regarding probabilistic dynamic programming. In many
cases, it may be more convenient and efficient to develop a special algorithm
which exploits the specific structure of the optimal replacement or repair pol-
icy, especially if it is of a simple control-limit type described by a few critical
numbers. Examples for such algorithms are given in Tijms and Van der Duyn
Schouten (1985), and Van der Duyn Schouten and Vanneste (1990).

3. Complex Model with Many Components

We now consider a series system with m components that operates under the
randomly changing environmental process Y. All components age intrinsically
162 Siileyman Ozekici

as described in Section 4 of Ozekici (1996) in this volume. Recall that for any
component k, rf{-) is failure rate function and Rf{-) is the hazard function in
environment i (i.e., Rf(t) = I~ rf(s)ds). Similarly, Hf(Xk, t) = Rf(nf(Xk) +
t) is the intrinsic age of component k at time t in environment i if the initial
age is Xk, and Tik(Xk' u) = n:(Xk + u) - n:(Xk) is the amount of real-time
operation required in environment i to age the k'th component intrinsically
u time units given that its initial age is Xk.
Lifetime of component k is denoted by Lk and L = mink Lk is the life-
time of the system. The age process of the system is A = (A 1 , A 2 , . , Am)
where A: is the intrinsic age of component k at time t. The construc-
tion of Ak is described in detail by equations (4.15) and (4.16) in Ozekici
(1996) in this volume. It is clear that the process A takes values in the
state space S = [0, +oo]m where +00 denotes a failed component. For any
x = (Xl, X2, ... ,Xm) E S, Xk E [0, +00] is the intrinsic age of component
k. We also define Hi(X,t) = (Hl(X1,t), Hl(x2,t), .. ,Hr(xm,t E S to be
the intrinsic age of the system at time t in environment i if the initial age is
xES.

3.1 System Reliability

The system reliability can be characterized by a Markov renewal equation.


Since (X, T) is a Markov renewal process on E x [0, +00], X, A), T) is also
a Markov renewal process on E x S. Setting

fi, x), t) = P[L > tWo = i, Ao = x]



for i E E, XES, and t ~ it can be shown by a renewal theoretic argument
that f satisfies the Markov renewal equation
(3.1)
where

E Q(i,j, t)] exp( - E(Hf(x, t) - Xk


n

gi, x), t) = [1-


JEE k=l
and Qis the semi-Markov kernel
n
Qi, x), (j, dy), dt) = Q(i, j, dt)ldy(Hi(X, t exp( - E(Hf(x, t) - Xk .
k=l
where ID (u) is the indicator function which is equal to 1 if and only if u ED.
It follows from (3.1) that

f= R*g

where R = En=o Q.n is the Markov renewal kernel corresponding to Q.
00'
Optimal Replacement of Complex Devices 163

3.2 Optimal Replacement Problem

At the beginning of each environment, the age of the system is observed and a
decision is made to either replace any component k or not. This is represented
by the binary variable sk which is 1 only if component k is replaced. Therefore,
the decisions on all the components is represented by s == (S1 , s2, ... ,sm) E
J = {a, l}m. If the system is observed to be in state x at the beginning of
environment i, then a state-dependent cost Ci(X) is incurred. For any decision
s, the cost of replacement is p;(s) so that this gives the cost of replacing
components {I ~ k ~ m; sk = I}. Finally, the downtime cost is d;(t) if the
system is down for t units of time in environment i.

Assumption 3.1. For the optimal replacement problem, the following con-
ditions hold for all i E E and component k :
a. rf(t) is increasing in t,
b. Ci(X) is increasing in x,
c. di(t) is increasing in t,


d. p;(s) is increasing in s,
e. r, s E J with rs = => p;(r + s) ~ Pi(r) + p;(s).
These assumptions are quite reasonable and they do not impose unneces-
sary restrictions on our problem. The first one require that all life distribu-
tions are IFR in all of the environments, an assumption that is often made
in optimal replacement problems. The second one simply states that as the
system gets "older" it costs more. In particular, if there are only failure costs
involved so that the failure cost of component k in environment i is c~ , then
it suffices to take
m

Ci(X) = Lc~l{+oo}(xk) . (3.2)


k=l

The third assumption states that the downtime cost increases as the system
stays down for a longer duration of time. According to the fourth one, the
replacement cost increases as more components are replaced. Finally, the
last assumption reflects the economies of scale involved in replacing many
components at the same time. This is an important assumption which makes
the components economically dependent as well. For example, the cost of
replacing components, say, 1 and 2 at the same time is less than or equal
to replacing them separately at different times. This fact is often true due
to possible set-up costs involved in making replacements. If the preventive
replacement cost is p~ for component k in environment i and there is a fixed
replacement cost K i , then the replacement cost function

(3.3)
satisfies this assumption.
164 Siileyman Ozekici

3.3 Dynamic Programming Equation


Define Vi(X) to be the optimal expected total discounted cost using the opti-
mal policy if the system is at age x initially and the environment is i where
there is continuous discounting at rate a > O. By conditioning on the time
of the first change of state, we obtain the dynamic programming equation
(3.4)

i E E, xES where the operator ra is


rag(i, x) =

L: 1 Q(i,j,dt)e-at{exp(- f(Ht(Xk,t) - Xk))gj(Hi(X,t))


00

jEE 0 k=l
[H~(:t:klt)-:t:k
L: 10
m
+ du e- U exp( - L:[Hf(xn, r;"(xk' u)) - Xn]) .
k=l 0 nf:k
'[9i(1i~(X, u)) + di(t - r;"(Xk' u))]} (3.5)
for i E E,x E [O,+oo)m and

rag(i, x) = L: [00 Q(i,j, dt)e-at{gj(x) + di(t)} (3.6)


iEE10

for i E E, Xk = +00 for some 1 :::; k :::; m. In (3.5),


1i~(x, u) = (Hl(X1, r;"(xk' u), ... ,Hf'(xm , rl(xk, u)) (3.7)
is the age of the system in environment i when the k'th component fails after
aging intrinsically for u time units given that the initial system age was x.
The explanation for (3.4) and (3.5) is similar to that of (2.1) and (2.2), and it
will not be repeated here. The difference is due only to multidimentionality.

3.4 Characterizations of the Optimal Replacement Policy

The dynamic programming equation (3.4) has a rather complicated structure,


but it can be analyzed along the lines of Ozekici (1988) where there are no
environmental fluctuations. Using contraction mappings in the usual way, it
follows that (3.4) has a unique solution. Moreover, the characterizations on
the optimal replacement policies for multi component systems operating in
a constant environment are still true for our complex system in a random
environment. However, these characterizations are now state-dependent on
the environment.
For any age x and environment i, let s.(x) = (si{x), s1(x) , "', sr(x)) E J
be the optimal decision that is the minimizer of the right-hand side of (3.4).
The set of components that should be replaced then is G.(x) == {I :::; k :::;
mj s:(x) = I}.
Optimal Replacement of Complex Devices 165

Theorem 3.1. The following characterizations hold in all environments i E


E:
a. v;(x) is increasing in x,
b. s;(x(l- s;(x)) = 0 for all xES,
=
c. s;(y) s;(x) if Yk ~ Xk for k E C;(x) and Yk =
Xk for k rt C;(x),
d. st(x) is increasing in Xk for all k,s;(x) = 1 => Si(Y) = 1 for all Y ~ x.
The reader is referred to the proofs of Theorems 4 and 5, and Corollary 1
in Ozekici (1988) for the constant environment setting of a similar reliability
model that also apply to our case with some adjustments. The characteriza-
tions provided by Theorem 3.1 can be explained best by a graphical illustra-
tion that shows the ages for all possible combinations of the components that
should be replaced. For simplicity, suppose that there are two components
only so that the optimal policy can be identified by the ages at which only
component 1 is replaced (1), or only component 2 is replaced (2), or both
components are replaced (1,2), or no component is replaced (0). Figure 3.1,
which appeared in Ozekici (1988), depicts a typical optimal policy.

A
2 1,2

Fig. 3.1. Optimal policy for the multicomponent model

It is clear that the structure of the optimal policy, in general, is nontrivial


and it can not be identified by a few critical numbers. This is actually due
to two reasons: economic and stochastic dependence among the components.
Some rather counterintuitive observations can be made on Figure 3.1. For
example, at point A, no component should be replaced but at point B, where
there is substantial aging on component 2, both 1 and 2 should be replaced
166 Siileyman Ozekici

even though component 1 is at the same age as point A. This is called "op-
portunistic replacement" since 1 is replaced by making use of the opportunity
in replacing component 2. Since an "old" component 2 is being replaced opti-
mally, it may be best to replace component 1 at the same time. Opportunistic
replacement is due to the fact that nl ~ Nl and n2 ~ N 2 This could be fur-
ther clarified by considering the case where the components are stochastically
independent while 1 has an exponentially distributed lifetime and 2 has an
IFR life distribution. If the components are economically dependent, Radner
and Jorgenson (1963) showed that the optimal policy has the simple form
given in Figure 3.2 for a fixed environment model. Note that component 1 is
replaced only at failure (F) since it has an exponential lifetime. Component
2 is optimally replaced at the critical age N 2 , but if component 1 has failed
and must be replaced, then 2 can be replaced opportunistically as early as
age n2 ~ N 2 .

2 1,2

Nk +--------------------------------------------;

o
1

Xl F
Fig. 3.2. Optimal policy when one component has exponential lifetime

Another interesting observation follows from the comparison of points C


and D on Figure 3.1. Note that component 1 is replaced at C but not at the
"older" system age D. The system is not interfered with at the "worse" state
D while a replacement decision is given at C. This is due to "opportunistic
non replacement" since it may be best not to do any replacement at D and
wait a little longer to replace both 1 and 2 at the same time.
The complexity in the structure of the optimal replacement policy creates
computational as well as practical difficulties since it is not identified by a few
Optimal Replacement of Complex Devices 167

critical numbers. Even if the optimal policy is determined by anyone of the


procedures, its implementation requires extreme precision. For these reasons,
one may have to approximate the optimal policy by one that has a much
simpler structure. Sasieni (1956) uses the (n, N) policy for a device in the tire
industry with two identical components that produce two tires at the same
time. Anyone of the components is replaced whenever it fails or produces
N tires, and if this happens, the other one is also replaced provided that
it has produced at least n tires. A general opportunistic replacement model
with two stochastically independent IFR components with no breakdown or
failure costs is considered in Bouzitat (1962). However, replacement costs
are such that it costs less to replace both components at the same time
than to replace them separately. A replacement decision is not made if both
components are functioning since there is no failure cost, but as soon as a
component fails, the other one is replaced opportunistically if its age exceeds
a critical number. Kumar (1968) considers suboptimal policies of the "replace
nothing" or "replace all" type, but this could lead to solutions that are far
from optimal. L'Ecuyer and Haurie (1983) propose a policy that replaces
all failed components and those whose ages exceed critical thresholds. Van
der Duyn Schouten and Vanneste (1990) analyzed the (n, N) policy given in
Figure 3.3.

2 1,2

N2+-----------~----_,

o
1

Fig. 3.3. (n, N) policy for the multicomponent model

Note that the (n, N) policy provides substantial simplification when com-
pared with the optimal policy in Figure 3.1. The fact that nl ~ Nl and
168 Siileyman Ozekici

n2 :$ N2 show that, in essence, these policies reflect the existence of oppor-


tunistic replacement. However, this does not allow for "opportunistic nonre-
placement" since the number of components replaced in an "older" system is
at least as many as that of a "younger" system. If this phenomenon is signif-
icant, then it may be better to use the modified (n, N, M) policy in Figure
3.4. In this case, the fact that Nl :$ Ml and N2 :$ M2 shows the existence of
"opportunistic nonreplacement" .

2 1,2

Xl

Fig. 3.4. (n, N, M) policy for the multicomponent model

References

Bertsekas, D.P.: Dynamic Programming, Deterministic and Stochastic Models. En-


glewood Cliffs: Prentice-Hall 1987
Bouzitat, J.: Choix d'une Politique d'Exploitation dans un Ensemble Industriel
Complexe. Cah. Bur. Univ. Reeh. Opnl. 4, 17-40 (1962)
Kumar, S.: Study of an Industrial Replacement Problem of High Dimension. Cah.
Cent. Etud. Reeh. Opnl. 10, 35-45 (1968)
L'Ecuyer,P., Haurie, A.: Preventive Replacement for Multicomponent Systems: an
Opportunistic Discrete-Time Dynamic Programming Model. IEEE Transac-
tions on Reliability R-32, 117-118 (1983)
Ozekici, S.: Optimal Replacement of One-Unit Systems under Periodic Inspection.
SIAM Journal of Control and Optimization 23, 122-128 (1985)
Optimal Replacement of Complex Devices 169

Ozekici, S.: Optimal Periodic Replacement of Multicomponent Reliability Systems.


Operations Research 36, 542-552 (1988)
Ozekici, S.: Optimal Maintenance Policies in Random Environments. European
Journal of Operational Research 82, 283-294 (1995)
Ozekici, S.: Complex Systems in Random Environments. In this volume (1996), pp.
137-157
Puterman, M.L.: Markov Decision Processes. In: Heyman, D.P., Sobel, M.J. (eds.):
Handbooks in Operations Research and Management Science, Vol. 2. Amster-
dam: Elsevier 1990, pp. 331-434
Radner, R., Jorgenson, D.W.: Opportunistic Replacement of a Single Part in the
Presence of Several Monitored Parts. Management Science 10, 70-84 (1963)
Sasieni, M.W.: A Markov Chain Process in Industrial Replacement. Opnl. Res.
Quart. 18, 148-154 (1956)
So, K.C.: Optimality of Control Limit Policies in Replacement Models. Naval Re-
search Logistics 39, 685-697 (1992)
Tijms, H.C., Van Der Duyn Schouten, F.A.: A Markov Decision Algorithm for
Optimal Inspections and Revisions in a Maintenance System with Partial In-
formation. European Journal of Operational Research 21, 245-253 (1985)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Analysis and Computation of (n, N)-
Strategies for Maintenance of a Two-Component System. European Journal of
Operational Research 48, 260-274 (1990)
A Framework for Single-Parameter
Maintenance Activities and its Use in
Optimisation, Priority Setting and Combining'"
Rammert Dekker
Econometric Institute, Erasmus University Rotterdam, 3000 DR Rotterdam, The
Netherlands

Summary. In this paper we present an integration of optimisation, priority set-


ting and combining of maintenance activities. We use a framework which covers
several optimisation models, like the block replacement, a minimal repair and an
efficiency model, and develop a uniform analysis for all these models. From this
analysis penalty functions are derived which can act as priority criterion functions.
These penalty functions also serve as basic elements in a method to determine op-
timal combinations of activities and in maintenance planning.

Keywords. Maintenance, optimisation, framework, multi-components

1. Introduction

Every few years new surveys appear on maintenance optimisation, showing


that it is a lively field and that many interesting mathematical problems can
be found in the maintenance area (e.g., McCall 1965, Pierskalla and Voelker
1979, Sherif and Smith 1981, Valdez-Flores and Feldman 1989, Cho and Par-
lar 1991). Applications follow, but at a slow rate. These are stimulated by
the advent of decision support systems (d.s.s.) for maintenance optimisation.
One of the problems encountered in building a d.s.s. is which of the many
optimisation models to select for incorporation and how to assist a user in
choosing the right model.
Another problem encountered in practice is that many relationships exist
between components to be maintained. Modelling these relations directly
yields large models, which are difficult to analyse as they suffer from the
curse of dimensionality. A decomposition approach is then to be preferred. In
such an approach one applies simple models for individual components and
uses the outcomes as input in a comprehensive model. This requires a certain
structure of the underlying models. Little work is present in this respect.
Other problems are encountered in the implementation of maintenance
policies for individual components. Urgent corrective maintenance work usu-
ally sets preventive maintenance aside, and priorities have to be set. Further-
more, it can be profitable to combine maintenance activities, thereby saving
common preparation work. Finally, maintenance plans have to be made in
This paper is to a large extent identical to Dekker (1995), which was published
in the European Journal of Operational Research
A Framework for Single-Parameter Maintenance Activities 171

agreement with production plans, which may result in a restriction to certain


time windows where only a limited number of activities may be executed.
This paper tackles these problems in an integrated way. To facilitate in
selecting a model we present an overall framework for time-based preven-
tive maintenance and analyse it with respect to the questions when and
where there exists an average-cost minimum. The framework is based on the
marginal costs of deferring preventive maintenance. These costs may be di-
rectly estimated by a user or specified through a number of models, including
the block replacement model. The framework further allows an extension to
priority setting, combination and planning of maintenance (the last topic is
not addressed in this paper but in Dekker 1995). The framework is based
on experience with developing two decision support systems for maintenance
optimisation (see Dekker and Smeitink 1994) in which a number of these
problems had to be tackled.
Few papers attempt to unify maintenance optimisation models. Aven and
Bergman (1986) do, and in fact our framework is a simpler version of theirs.
Yet they only consider optimisation and do not make links with combination,
priority setting or finite horizon planning. The central notion in this paper,
the marginal cost rate of deferring maintenance, was first introduced by Berg
(1980). It was fruitfully applied in Berg and Cleroux (1982) and Berg (1995)
for repair-limit models and Dekker and Smeitink (1991) and Dekker and
Dijkstra (1992) for opportunity maintenance.
The structure of this paper is as follows. After introducing the frame-
work in Section 2 we provide a basic analysis and state which models can be
incorporated. Next we introduce penalty functions in Section 3. Combining
execution of maintenance activities is considered in Sections 4 and 5, priority
setting in Section 6, and a heuristic policy based on the marginal cost ideas
in Section 7.

2. A Framework for Single-Parameter Policies


2.1 Introduction

As there is quite a variety in maintenance activities there are many opti-


misation models. A method for planning, priority setting and combining of
preventive maintenance activities should therefore embrace as many types
of activities as possible. Priority setting refers to determining the order of
execution of activities which have to be carried out. Planning encompasses
the timing of activities in coordance with production and manpower require-
ments. Finally, combining refers to shifting originally planned execution times
to allow joint execution at possibly different moments. Notice that all these
processes share timing as dominant aspect and that they are not done once
but repeatedly. Being able to plan and shift execution times is in fact one of
the most important advantages of preventive maintenance over failure-based
172 Rommert Dekker

maintenance. As a result of planning the work can be prepared beforehand


and necessary spare parts can' be ordered. Shifting work in time also allows
a more evenly spread workload and thus a higher efficiency. In a method
for integrating optimisation with planning, priority setting and combining of
maintenance activities it seems necessary to restrict oneselves to activities
whose execution can be planned in advance. We therefore restrict ourselves
in first instance to maintenance activities whose next execution moment is
determined from its last one, by a single parameter. To that end we formulate
a framework for optimisation models and derive the results necessary for the
existence and position of an average-cost minimum. It is a simpler version
of the framework from Aven and Bergman (1986) and of the marginal cost
analysis from Berg (1980, 1995). Yet our results are somewhat stronger. An
extension to other types of maintenance will be discussed in Section 2.5.

2.2 A Framework for Block Type Policies in Continuous Time

Here we present the general structure of the framework for a continuous time
setting; extensions follow later. Consider a component (for ease of terminology
we use this term, it may also be a part of a system) which deteriorates
in time and which can be returned to the as-good-as-new condition by a
preventive maintenance activity. The main question the framework focuses
at is when to execute the activity and the answer will be based on cost
considerations. We primarily consider long-term average costs as objective
criterion, as that best reflects what one should do on a long term. The central
notion in the framework is the so-called marginal expected cost of deferring
the execution of the activity for an infinitesimally small interval. We first
consider the case in which the activity can be carried out at any moment
against the same cost cp In this case it is natural to speak of the marginal
deterioration cost rate, denoted by mO, which is assumed to be a continuous
and piecewise differentiable function ofthe time t since the previous execution
of the activity. We will now show that these assumptions are sufficient to
determine an average optimal maintenance interval.
Let M(t) := f~ m(x)dx, i.e. the total expected costs due to deterioration
over an interval of t time units, when the component was new at the start. It
easily follows from renewal theory that the average costs g(t) per time unit
when executing the activity every t time units amount to

g(t) = cP + M(t) . (2.1 )


t
Our first objective is to find the t value which minimises g(t). To this end
we take the derivative and notice that g'(t) = [m(t) - g(t))jt. Let tJr(t)
tm(t) - M(t) = f~ m(t) - m(x)dx. Hence
g'(t) = 0 {:} m(t) - g(t) = 0 {:} tJr(t) = cPo (2.2)
A Framework for Single-Parameter Maintenance Activities 173

Equation (2.2) is the key for the analysis of g(t). Notice that w(t) is increas-
ing (decreasing) if m(t) is increasing (decreasing). The following theorem
summaries the relations between the behaviour of m(t) and the existence of
an average cost minimum. Part (iv) is a generalization of results for exist-
ing models (see, e.g., Barlow and Proschan 1965 for the block replacement
model), the other parts are simple new results.
Theorem 2.1.
(i) if met) is decreasing or constant on [to, ttl and m(to) < g(to), then get)
is also decreasing on [to, tl],
(ii) ifmo(t) = ml(t) + c, for some c and all t > 0, then go(t) and gl(t) have
the same extremes,
(iii) if mo(t) is nonincreasing on (0, to) and increases thereafter, then go(t)
has the same minima as gl(t), where ml(t) = mo(to) fort < to and ml(t) =
mo(t) else and ci = cb + J;O(mo(t) - mo(todt,
(iv) if met) increases strictly for t > to , where m(to) < g(to), and if either
(a) limt_oom(t) = 00, or
(b) limt_oo m(t) = c and limt_oo [ct - M(t)] > cP, for some c > 0,

ro
then get) has a minimum, say g* in t*, which is unique on [to, 00); moreover,
for to < t < t*
m(t) - g(t) =0 for t = t* (2.3)
for t > t*

ro
>0
and
for to < t < t*
m(t) - g* =0 for t = t* (2.4)
>0 for t > t*
(v) ifm(t) increases strictly fort> to, where m(t o) < g(to), limt_oom(t) =
c and limt_oo [et - M(t)] < cP, for some c > 0, then g(t) is decreasing for
t > to.
(vi) if m(t) is convex on [to, T], where m(to) < g(to) and T1to [m(T) -
m(to)] > cP + J;O(m(t) - m(todt, then get) has a minimum, say g* in t*,
which is unique on (to,T) and (2.3) and (2.4) hold on [to,T]. If to = 0,
then it is sufficient that t[m(T) - m(O)] > cPo

Proof (i) Notice that m(to) < g(to) implies that W(to) < cp If m(t) is
decreasing or constant, then w(t) is also decreasing or constant and the
result is immediate.
(ii) If m2(t) = ml (t) + c, then W2(t) = WI (t) and the result is immediate.
(iii) According to (i) neither glO nor g2() have a minimum before to. Notice
next that for t > to we have Wl(t) = ci } W2(t) = ~, from which the
assertion follows.
174 Rommert Dekker

(iv) Notice that !Ii(t) - !Ii(to) = ftto(m(t) - m(xdx + to[m(t) - m(to)] >
(t1 - to)[m(t) - m(tt}], t > to, for some t1 E (to, t). Hence, !Ii(t) increases
strictly to infinity if m(t) does so. Since !Ii(to) < eP by (i), !Ii(t) passes
the level eP only once for t > to, which guarantees the uniqueness. If
limt_oo [et - M(t)] = d some d, then it easily follows that Iimt_oog(t) = c.
Moreover, for t large enough, say> tf we have M(t) < et - d + f and
g(t) < c + [eP - d + flit for any f > O. Hence if cP - d < 0, then g(t)
approaches c from below, implying that it must have a finite minimum.
The uniqueness of the minimum follows from the fact that (2.2) implies
that m(t) intersects g(t) in minima from below and in maxima from above.
As m(t) is strictly increasing on [to, 00), there can be no maxima in that
region.
(v) Notice that in this case g(t) approaches its limit c from above. If there
would have been a minimum of g( t) there should also be a maximum. Since
in each extreme !Ii(t) = eP, there is a contradiction, since !Ii(t) is increasing
because m(t) increases and can therefore cross eP only once for t > to.
Accordingly g(t) is decreasing for t > to.
(vi) If m(t) is convex on [to, T] then M(T) - M(to) < (T - to)[m(T) +
m(to)]/2. Inserting this in !Ii(T) and using assertion (iii) shows after some
algebra that !Ii(T) > cP , from which the results follow in the same way as
p~O~. D
Remark A decreasing m(t) may be due to burn-in (or initial) failures. Part
(iii) of this theorem shows that we only need to estimate their contribution to
the total costs and that we can leave the burn-in failures out of the modelling
of m(t) provided that a compensation is made for them in cP. In this way we
can take care of the bathtub curve.
Relation (2.4) can interpreted in the following way (Berg 1980 was the
first one to introduce it). Consider at time t the two options: (a) maintain
now or (b) defer maintenance for a time dt. For option (b) the expected costs
over [t, t + dt] amount to m(t)dt + cp For option (a) there are direct costs
cP , and the renewal is dt time units before the option (b). To compensate
for this time difference we associate costs g* dt to the interval, which gives
a total expected costs of cP + g*dt for option (a). Subtraction then yields
that maintaining is cost-effective if m(t) - g* > O. The myopic stopping rule:
maintain if m(t) - g* ~ 0 is therefore average optimal. Although a simple
enumeration to locate the average-cost minimum usually satisfies in practice,
one can speed up calculations by using relations (2.3) and (2.4) and applying
a bisection procedure.
Special cases:
(i) if m(t) = at f3 - 1, a > 0 then !Ii t= a(1 - 1/{3)tf3, which increases if
{3 > 1. In that case t* = {3cP / a({3 - 1).
(ii) if m(t) = at + b, a, b > 0, then !Ii(t) = !at 2 , and t* = J2cP fa.
Equation (2.2) also allows us to do some sensitivity analysis. We have
A Framework for Single-Parameter Maintenance Activities 175

Theorem 2.2.
(i) if m2(t) = Aml (t) with A > 1 and cf = ~, then t; < ti,
(ii) if m2(t) - ml (t) increases in t and cf = ~, then t; < ti,
(iii) if ~ > cf and m2(t) = ml (t), then t; > ti.

Proof Notice that W;(t) -W~ (t) = t[m;(t) - m~ (t)]. For case (ii) we now have
that W2(t)-Wl(t) increases in t and that W2(t) reaches the level cf earlier than
Wl (t) from which the assertion follows. In case (i), W2(t) -WI (t) is increasing
if WI (t) increases and the same argument holds. Assertion (iii) is also a direct
consequence of (2.2). D

2.3 Determination of the Deterioration Costs

In this section we present a number of models that can be captured by the


framework.

(i) The block replacement model


In this model a component is replaced upon failure and preventively after
a fixed interval of length t (see e.g. Barlow and Prosch an 1965). The to-
tal deterioration costs over that interval, M(t), are made up of the failure
renewals, each against cost c!. Let H(t) denote the expected number of fail-
ures in [0, t], then we have M(t) = c! H(t). It is well-known that for H(t) the
following asymptotic expansion holds: 1imt_oo[H(t) -tfJJ] = ~(1'2fJJ2 -1).
Hence theorem 2.1(iv) implies the existence of a minimum provided that
~ < ~(1 - (1'2 f JJ2), which is exactly the condition derived in Barlow and
Proschan (1965).

(ii) Minimal repair model with block replacements


In this model failures of a system occur according to a nonhomogeneous
Poisson process with rate A(t). Upon failure the system undergoes a minimal
repair which brings it back to an as-good-as-before condition. Next to that the
system may be replaced, which has to be planned in advance and cannot be
combined with a failure repair. Let cr , cP denote the costs of a minimal repair
and a preventive replacement respectively, hence m(t) = cr A(t). An average-
=
cost minimum exists if either 1imt_oon(t) 00 or in case 1imt_oon(t) cfor =
some c > 0, that then also 1imt_oo[ct - A(t)] > cP , where A(t) = f~ A(s)ds.
Notice that if A(t) follows a bathtub pattern, we may add by theorem 2.1(iii)
the costs associated to initial failures to the preventive maintenance costs,
and consider for optimisation only the increasing part of A(t). There is also a
more general version of the minimal repair model in which replacements may
be combined with failure repairs and in which the repair costs may vary. In
that case preventive maintenance is no longer plannable and the framework
of Section 2.5 has to be used.
176 Rommert Dekker

(iii) A standard inspection model


In this inspection model a component is inspected every t time units against
costs cP , which reveals whether the component is functioning. We assume that
inspection is always accompanied by corrective actions which brings the com-
ponent back to an as-good-as-new condition (e.g. calibration of instruments)
and which costs can be neglected compared to cp A failure of the compo-
nent can only be detected by inspection. Let F() denote the c.d.f. of the
time to failure X and let CU be the costs associated to a non-functioning
component per time unit. In this case the deterioration costs M(t) con-
sist of the expected costs due to unavailability of the component over [0, t].
Hence M(t) = CU I; F(x)dx, and m(t) = CU F(t). It is easy to show that we
have limt-+oo[M(t) - cUt] = -cuEX, where EX denotes the expected life-
time. Hence by Theorem 2.1 there exists a unique minimum provided that
cP < cU EX (the unavailability costs during a lifetime of a component are
more than the inspection and repair costs).

(iv) An efficiency model


Assume that the efficiency of a system drops in course of time and that
preventive maintenance restores the efficiency to that of the as-good-as-new
condition. The efficiency can be measured in terms of output vs. input and
either the output can go down, or the input may have to be increased to
sustain the same output. The efficiency, E(.) as function of time, is scaled on
the interval [0,1]' where E(O) = 1. Let ce denote the costs per time unit as-
sociated to a zero efficiency. The deterioration costs M(t) consist of the total
efficiency losses over [O,t], i.e. M(t) = ce I;(1- E(x))dx. Let E(oo) denote
the limiting efficiency in case no maintenance is ever carried out. Similarly to
the previous model we can establish that 1imt-+oo[M(t) - ce (1- E(oo))t] =
_c e IoOO(E(oo) - E(t))dt. Hence an optimal preventive maintenance interval
exists provided that cP < ce IoOO(E(oo) - E(t))dt. This model is mathemat-
ically equivalent to that of running costs, see also Berg and Epstein (1979),
yet they do not derive conditions for optimality.

(v) A combined model


All aforementioned models may be combined as deterioration costs may con-
sist of failure costs, repair costs, efficiency losses and unavailability penalties
together.

2.4 Extensions

In this section we will give a number of extensions of the framework.


(i) Discrete time case
In the discrete time case actions may only be taken at discrete points in
time. The only change for the framework is that all functions have to be
A Framework for Single-Parameter Maintenance Activities 177

discretised: i.e. m(t) indicates the expected deterioration costs until the next
time moment.

(ii) Scrapping value


Suppose a system is replaced by a new one every t time units against costs
cP and that it has a scrapping value 8(t) at age t. We assume that 8(t) is
decreasing in course of time. Let M(t) again denote the deterioration costs
of the system and let M(t) = M(t) - 8(t). Notice that the total costs over a
replacement cycle [0, t] amount to cP - 8(t) + M(t), which equals cP + M(t).
Hence a scrapping value can be taken care of by adjusting the deterioration
cost function. Finally, we would like to remark that this model is mathe-
matically equivalent to the block replacement model with time-dependent
replacement costs.

(iii) Discounted cost case


The analysis for the average costs case is easily extended to the discounted
costs case, as is also shown in Berg's (1980) marginal cost analysis. Assuming
a discount rate A we remark that the expected discounted deterioration cost
rate at time t is given by m(t)e- At . Hence total expected discounted costs
over an interval of length t, starting with a new system/component amount
to I~ m(y)e-Atdy. The total expected discounted costs vA(t) over an infinite
horizon when replacements are made every t time units and starting with
preventive maintenance, amount to

A() cP + I~ m(y)e-AYdy
v t = 1- e- At . (2.5)

In this case we have


dvA(t)
- - =0
dt
{::? m(t) - AV\t) = 0,

which leads to a similar analysis as for the average costs (see also Section 2.5).

(iv) Opportunity maintenance


Suppose that preventive maintenance can only be carried out at opportu-
nities which are generated according to a renewal process. Let the generic
r.v. Y denote the time between successive opportunities. We assume that
the opportunity process is independent of the deterioration process. In this
case maintenance can no longer be planned, instead we consider policies of
the control-limit type, i.e. maintain at the first opportunity after a threshold
time t since the last execution of the maintenance. Let the r. v. Zt denote the
forward recurrence time, i.e. the time from t to the first opportunity. It easily
follows from renewal theory that for the average costs gy(t) we now have

(t) = cP + Iooo M(t + x)dP(Zt ~ x) (2.6)


gy t + EZt '
178 Rommert Dekker

see Dekker and Smeitink (1991). In a similar, but far more complicated way,
they derive inequalities (2.3) and (2.4) with m(t) replaced by 77(t) = J: m(t+
z)dP(Y :::; z), the expected deterioration costs until the next opportunity.

2.5 A Framework for Single-Parameter Age-Based Maintenance


In the well-known age replacement model the marginal deterioration cost rate
m() is a function of the age of a component rather than of the time since
the last execution of the preventive maintenance activity. The age is set back
to zero upon any renewal of the component, including failure renewals. This
implies that the renewal cycle has a variable length. The framework given
in Section 2.2 can be extended in the following way (see also Berg 1995).
Suppose the time to a system renewal, possibly caused by a breakdown, is
stochastic with c.d.f. F(t) and p.d.f. f(t). The long-run average costs are now
given by
( ) _ cP + J; m(x)(l - F(xdx
9t - L(t) (2.7)

where L(t) = J;(l- F(xdx indicates the expected cycle length. It is easily
shown that g'(t) = (m(t) - g(t/(l- F(tL(t). Let 4i(t) be the analogue
of !li(t), i.e. 4i(t) = m(t)L(t) - J; m(x)(l - F(xdx. Hence g'(t) = 0 >
m(t) - g(t) = 0 > 4i(t) = cp Notice further that 4i'(t) = m'(t)[l - F(t)].
We are now in a position to formulate a theorem similar to theorem 2.1 and
which proof is analogous.
Theorem 2.3.
(i) if m(t) is nonincreasing on [to, t 1] and m(to) < g(to), then g(t) has no
minimum on [to, t1],
(ii) if mo(t) = m1 (t) + c, for some c and all t > 0, then go(t) and gl (t) have
the same extremes,
(iii) if mo(t) is nonincreasing on (0, to) and increases thereafter, then go(t)
has the same minima as gl(t), where m1(t) = mo(to) for t < to and
ml(t) = mo(t) else and ~ = c{; + J;O(mo(x) - mo(to(l- F(xdx,
(iv) if m(t) increases strictly for t > to, where m(to) < g(to), and either
(a) limt_oom(t) = 00, or
(b) limt_oom(t) = c where c> limt_oog(t) for some c> 0,

ro
then g(t) has a minimum, say g" in t", which is unique on [to, 00); more-
over,
for to < t < t"
m(t) - g(t) = 0 fort = t" (2.8)
>0 for t > t"
and
<0 for to < t < t*
m(t) - g' { =0 for t = t* (2.9)
>0 for t > t*
A Framework for Single-Parameter Maintenance Activities 179

In case of the age replacement model the marginal deterioration cost rate
m(x) amounts to (c! - cP)r(x), where r(x) denotes the hazard rate, r(x) =
f(x)/(I-F(x)). In that case the numerator of (2.6) equals cP+(c! -cP)F(t).
Notice that the discounted cost case (see Section 2.4) can be regarded
as a special case of the extended framework by considering discounting as a
truncation of the system lifetime. Hence the cdf of the time to system renewal,
F(x), should be defined as F(x) = 1 - e->'z, where A is the continuous
discount rate. In this case the expected lifetime equals I/A and hence the
total discounted costs per unit of time equal AV>'(t), where v>'(t) is given by
(2.5).
The main problem to use the age replacement extension for planning and
combining is that we no longer can predict in advance whether we will replace
at some time t, as that depends upon the possible occurrence of failures in
between. Doing a correct analysis implies that we have to condition on all
possible events between the moment of planning and the expected moment
of execution. This directly leads to intractable models in case of multiple
components. An heuristic way out is to do a conditional planning, assuming
that no failures occur in the planning horizon and taking the actual ages into
account. This is a reasonable approach since numerical experiments show that
in cases where preventive maintenance is really cost-effective, F(t*) is quite
small (up to 20%). Implementing this approach on a rolling horizon basis (i.e.
adapting the planning in course of time with the occurrence of events) takes
care of failures. This idea was pursued in Dekker et al. (1993) in a discrete
time case.

2.6 The Delay Time Model

In the delay time or two-phase model, an item passes through an observ-


able intermediate state (often called fault) before failing (see e.g. Baker and
Christer 1994). Inspections are undertaken to see whether deterioration has
progressed that far, that the intermediate state is visible. If so, a repair is
carried out immediately, which is also the case upon a failure. After the re-
pair the state is as-good-as-new. Suppose that faults occur according to a
Poisson process with a rate A per time unit and that a cdf F() of the so-
called delay time is available, i.e. the time left between the occurrence of the
intermediate state and a failure. Let ci , cr and c! denote the costs of inspec-
tion, repair in the intermediate state (to the as-good-as-new condition) and
failure repair respectively. Notice that both after an inspection and repair
the item is as good-as-new. Let GO be the cdf of the time between the as-
good-as-new state and failure if no inspections are carried out. Accordingly,
G(t) = f~ Ae->'u F(t - u)du. Assume further that inspections are carried at
intervals of length t since the previous inspection or repair. The deterioration
cost rate m(t) now amounts to cr + (c! - cr )G'(t)/(I- G(t)). The delay-time
model is especially suited for cases where c! is high compared to cP and cr ,
180 Rammert Dekker

hence an asymptotic criterion w.r.t. existence of an optimum makes little


sense.

2.7 Relation with the Framework from A yen and Bergman

Aven and Bergman (1986) argue that the objective function in many main-
tenance optimisation models can be written as:

E[c(O) + foT a(s)h(s)ds]


E[P(O) + f: h(s)ds]
where T is a stopping time based on the information about the condition of
the system, a(s) is a non decreasing stochastic process, h(s) a nonnegative
stochastic process and both c(O) and p(O) are nonnegative r.v.'s. The expec-
=
tation is taken w.r.t. all r.v.'s and stochastic processes. In our case h(s) 1,
=
T is set at a prefixed value t, c(O) has the constant value cP , p(O) 0 and a(s)
represents the deterioration costs m(s). They show that equations (2.3) and
(2.4) hold, but give no further results. We also consider the case that m(s)
is first decreasing and next increasing (the bathtub pattern). Although their
framework is more general, one can not predict the replacement in advance
and it is not yet clear how their framework can be used for planning and
combining.

3. Penalty Cost Function for Shifting from the Optimum

One important aspect of the framework is that it allows the derivation of


penalty costs for deviating from the individual optimal execution interval.
These penalty costs are input in comprehensive models for combination of
maintenance and for maintenance planning. Three different types of deviation
are possible: a short-term shift, a long-term shift and finally, a permanent
shift. Here we assume that a preventive maintenance activity is carried out
at regular intervals of length t. The short term shift changes one interval to
t + x, where x may be positive or negative, the next one to t - x, so that only
one execution moment is changed. The long-term shift changes one interval to
t + x and all following intervals remain constant. Finally, the permanent shift
changes all intervals to t + x. Which shift is most appropriate, depends on
how the preventive maintenance program is incorporated in the maintenance
management system and whether the shifts are permanent or not. If the
maintenance management system calculates all future execution dates from
the initially planned dates, then the short-term shift is appropriate. If it does
so from the actual execution dates, then the long-term shift should be used.
From the deterioration costs M (.) penalty cost functions for each of the shifts
can be derived. Let hs(x),hL(x) and hp(x) denote the penalty functions for
A Framework for Single-Parameter Maintenance Activities 181

deviating x time units from the optimum t* for a short-term shift, long-term
shift and permanent shift respectively. It is easy to see that
hs(x) M(t* + x) + M(t* - x) - 2M(t*), (3.1)

hL(X) = M(t* + x) - M(t*) - xg* =1 t+:t'

t
(m(y) - g*)dy, (3.2)
hp(x) = g(t* + x) - g(t*) = hL(x)j{t* + x), (3.3)
where g* denotes the minimum long term average costs. These penalty func-
tions can not only be used to assess the cost-effectiveness of any special sales
offer, but also for priority setting and to assist in combining activities or
integrating maintenance planning with production planning. Notice that the
penalty functions have the following properties: they are always nonnegative
and they are zero for x = O. Furthermore, h() is symmetric round zero.
These penalty functions indicate the expected cost for deviating from the
optimum interval. It may happen, however, that the present state already
deviates from the optimum and that one does not need to take the costs into
account for arriving in the present state, but that one is interested in the
extra costs for deviating even further. More specifically, suppose one is at t
time units, t > t* since the last execution of the activity. The expected costs
for deferring (in this case their is no other option) the activity for another x
time units amount to (we only consider the long-term shift)

hi{x) = M(t + x) - M(t) - xg* = I


t
t +:t'
(m(y) - g*)dy, x> O. (3.4)

In case of the extended framework we have to condition on the present age


and only consider the case where the component survives. Hence the penalty
costs for deferring preventive maintenance at age t to age t + x, where it is
normally executed at age t* (only the long-term shift is relevant) amount to

t
hL(x) -
_I t
t +:t' * 1 - F(y)
(m(y) - 9 ) 1- F(t) dy,x > O. (3.5)

4. Static Combining of Execution of Activities


One way of reducing maintenance costs is to combine the execution of main-
tenance activities. In many cases preparatory work, such as shutting down a
unit, scaffolding, travelling of the maintenance crew, has to take place before
maintenance can be done. Combining activities allows savings on this work.
On the other hand, combining mostly implies that one deviates from the
originally planned execution moments, which is not free. Combining activi-
ties can both be done on a long-term (e.g. creating maintenance packages)
or on a short-term, taking all once-oft' combinations into account. In this sec-
tion we consider a method for long-term combining, which was proposed by
182 Rommert Dekker

Dekker et al. (1996). It is called static since fixed combinations are made.
Other long-term approaches apply a variable combining, based on the state
of the other components, such as the (n, N) policies.
Consider n maintenance activities, a;, i = 1, ... , n which, if carried out
alone, costs cf, i = 1, ... , n. All activities share the same set-up work. Hence
if k activities are carried out together the cost savings by joint execution
amounts to (k -1)c8 , where c8 are the cost of the set-up work. Suppose next
that the set-up work is done every t time units and that activity i is carried
out every k;-th time, i.e. with an interval of kit, where k; is an integer decision
variable. The total long-term average costs g(t, kl' ... , kn ) now amount to

(4.1)

The minimisation of g( t, kl' ... , kn ) is now a mixed integer nonlinear program-


ming problem. This approach was first introduced by Goyal and Kusy (1981)
for the special case that Mi(t) = >.;t f3 , i = 1, ... , n ((3 is fixed for all activities).
They proposed an iterative optimisation method which not always yields the
optimum. In Dekker et al. (1996) the optimisation is studied in more detail.

5. Dynamic Combining of Execution of Activities

In this section we will consider short-term combining and show that the
penalty functions derived in the Section 4 allow a cost-effectiveness evalua-
tion of combinations and assist in the timing of the execution. The main idea
is to apply a decomposition approach, that is, we first determine for each
activity its preferred execution moment and derive its penalty function. Next
we consider groups of activities, for which the preferred moment of execution
follows from a minimisation of the sum of the penalty functions involved. If
this sum is less than the set-up savings because of a joint execution, combin-
ing is cost-effective. Corrective maintenance work can also be involved in the
combination, provided that it is known at the outset of planning. In case it is
deferrable a penalty function for deferring should be determined. Determin-
ing the optimal groups can be formulated as a set-partitioning problem (see
Dekker et al. 1992). Wildeman et al. (1992) show that under certain condi-
tions the optimal grouping consists of groups with consecutive initial planning
moments, which allows the formulation of an O(n 2 ) dynamic programming
algorithm (n being the number of activities).
Example Table 5.1 provides data on 8 maintenance activities, which each
replace a unit. Deterioration costs of unit i are primarily due to small failures
upon which a minimal repair is done. These occur independently of the state
of other units and the cost rate amounts to: mi(xi) = cp 00 ((3;/ >';)(x;/ >.;)f3.- 1 ,
where Xi denotes unit i's age. Special case (i) (see Section 2.2) gives a formula
A Framework for Single-Parameter Maintenance Activities 183

for the individually optimal replacement age, which we denote by xi. Finally
let ti be the resulting initial planning moment (counted from the start of the
planning horizon).

Table 5.1. Example data for combining 8 activities


Activity A, ~days} {3, c'r:
~ t, (days} xi (days}
1 2380 1.70 46 60 0 229
2 2380 1.70 91 120 15 230
3 1900 2.00 14 180 32 681
4 2850 2.00 15 90 60 698
5 1620 1.70 86 300 100 278
6 2850 2.00 15 180 160 987
7 1950 1.75 45 60 180 195
8 1350 1.75 25 180 212 354

The resulting penalty functions are shown in Figure 5.1 (the numbers
refer to the activities).
We consider combining under short-term shifts, in which case the penalty
costs are given by equation (3.1). The planning horizon is [0,220]. As in
the previous section we assume that combining execution of any k activities
saves k - 1 times the set-up work (for any k), which is estimated at 15
cost units (about 10% of the preventive maintenance costs of an activity).
Using the algorithm of Wildeman et al. (1992) yields as optimal groupings:
{1,2,3} executed at day 12.6, {4,5} at day 97.9 and {6,7,8} at day 192.9.
The savings (set-up cost reduction minus penalty costs) for the combinations
amount to 29.4, 14.4 and 28.2 respectively. Total savings amount to 72.0,
which constitutes 6% of total preventive maintenance costs.
Dekker et al. (1993) give an analysis of the performance of this combina-
tion method for a more complex case where components are replaced using a
discrete time age replacement. They apply a conditional planning (assuming
no failures in the planning horizon) on a rolling horizon basis (implement the
decision for the current epoch, observe the new state at the next epoch and
make a new planning). They use the discrete version of the penalty func-
tions (3.5). They consider combining both for a finite and infinite horizon
and compare their planning method with an optimal solution obtained by
solving a large scale Markov decision chain numerically (which was tractable
up to four identical components only). It appears that for high set-up costs
and many components the cost allocation in the component decomposition
has to be changed because components are almost always replaced together.
When that has been done the loss of their strategy compared to the optimal
one is less than 1%.
184 Rommert Dekker

15.00~----~--~----~'---~----~~~--'---------~

10.00
6. L/.

(J)
(jj
0
0

5.00
7-

~.
0.00+,.....2$':::';O~.::::;.,=---r--==7,==---"'-rL----r-.=;:=----r-~=-__,"':O:-___4
o 20 40 60 80 100 120 140 160 180 200 220
Time

Fig. 5.1. Penalty functions for 8 maintenance activities

6. Priority Setting
Maintenance is usually classified into corrective and preventive work. The
first originates from a directly foreseeable, or already observed malfunction-
ing of systems, and the latter from a preconceived plan to keep systems in a
good condition in the long run. Often the first type of work is the most ur-
gent one. The maintenance capacity needed to take care of that may fluctuate
severely in time, due to the random character of failures. Hence preventive
work is often delayed in favor of the corrective work. Accordingly, there is
usually a large backlog of preventive work, with the implication that an indi-
vidual preventive maintenance activity is either delayed for an unknown time
or even never carried out. Most maintenance organisations have problems in
managing the backlog. It will be clear that the results of maintenance optimi-
sation decrease in value if the maintenance organisation is not able to do the
work on time, which is especially a problem for the many small maintenance
activities. Priority criterion functions, embedded in management information
systems, can be of help.
Here we propose the use of the penalty functions hL(x) (or hs(x) if ap-
propriate, see Section 3) as priority functions, where the long-term objective
is the average costs. Although they are formulated for a continuous time
setting, where at each moment a decision can be taken, they can easily be
A Framework for Single-Parameter Maintenance Activities 185

extended to discrete intervals between decision moments, depending on how


often one wants to reset priorities. The same holds for a discounted costs
objective function. Before we give some pro's and cons of these functions, we
first introduce some other priority criteria which have been used in practice
(see Pintelon 1990 for a review):
(i) a fixed priority according to the importance of the machine to be main-
tained
(ii) a machine importance factor multiplied by the waiting time for execu-
tion.
It will be clear that these criterion functions are heuristic and not related
to an optimisation model. The penalty function hL(X) on the other hand,
has the following properties: it is negative before the optimal execution time,
zero exactly at that time and increases thereafter. It is expressed in money
terms, has an easy interpretation and is additive. The latter means that the
priority criterion for a group is the sum of the individual priority criteria.
Hence, splitting up activities into smaller activities does not effect the pri-
ority for the group. This is not the case for the criteria of type (i) and (ii):
these are expressed on a cardinal scale only! The penalty based criteria can
therefore also be used for the groups which are the result of the combination
of activities (just use the sum of the penalty functions minus their minimum
value). Furthermore, they can used in more sophisticated planning. We like
to remark that the penalty based priority criteria only express how important
it is to execute a certain activity. It does not express how much of a scarce
resource is needed for execution. Setting priorities between corrective main-
tenance and preventive is in principle possible using the above ideas, since
the priority functions are expressed in money. Corrective maintenance has to
be separated into deferrable and nondeferrable and for the first category a
cost rate for deferring has to be estimated, which can be compared with the
priority function for preventive maintenance.
Dekker and Smeitink (1994) provide an analysis of similar priority crite-
ria, though for a case where preventive maintenance could only be executed
at randomly occurring opportunities of restricted duration. Hence at each
opportunity priorities for execution had to be set. They computed long-term
average costs for a twenty-four component system under four different prior-
ity criteria, including one based on the penalty functions. This one performed
best in all cases considered.
Finally, in case of the extended framework for the age replacement model
(see Section 2.5) we may use a similar priority function. Given a present age
of x, the expected cost rate of delaying the maintenance activity amounts to
m(x) whereas we save by deferring on the average g* per time unit. Hence
the priority function reads m(x) - g* for the continuous time case.
186 Rommert Dekker

7. Heuristic Replacement Criteria

Many problems are that complex that either the optimal strategy is unlikely
to have a simple structure, or the computational effort to determine it may
be prohibitive. In those cases one has to resort to approximate solutions. The
framework does allow the derivation of meaningful and often good performing
heuristics. Here we will show how they can be derived. First we state the
underlying philosophy.
We fix an action (e.g. either to replace or not) and focus on the tim-
ing aspect by considering at each moment whether deferring the action is
cost-effective. The replacement criterion is based on a comparison between
the local deterioration costs and the minimum average costs, i.e. equation
(2.2). Local deterioration costs are usually easy to determine, contrary to
the minimum average costs. For the latter one basically needs to enumerate
all possible deterioration and action possibilities. If there are many options
computational problems arise. Hence the heuristic criteria approximate the
minimum average costs by either restricting the number of options, or by
comparing with a simpler model. Concluding, the replacement criteria read:
"replace if m(t, I(t)) - 9 2: 0", where I(t) stands for all relevant information
available at time t and 9 for the minimum average costs in a suitable simpler,
but consistent model.
Example Dekker and Roelvink (1995) present such a heuristic for the fol-
lowing problem. Consider a maintenance package consisting of n activities,
each addressing one component within a unit. Upon failure of a component,
only the corresponding activity is executed, with the result that only that
component is renewed; the conditions of the other components remain the
same (upon a failure during operations only the respective activity is carried
out, whilst there is no time left to do the other activities). At a preventive
basis always the full package is executed, since that is only done when the
system is not needed. Hence the problem is when to execute the full package.
A simple strategy is to execute it at fixed time intervals (block replacement),
yet under this policy relatively new components may be replaced preven-
tively. On the other hand, it is relatively simple to calculate the minimum
average costs under this policy (it involves minimising a one dimensional
function only), and let us denote these costs by 9'6. Suppose now that at time
t the ages of all components are available, denoted by ZI, ... , Zn, and that
we consider the problem in a continuous time frame. Then local deteriora-
tion costs amount to E j cJ rj(z), where rj(-} and cJ stand for component j's
hazard rate and failure costs respectively. Accordingly, we have the following
cf
replacement criterion: replace if "Ej rj(z) - 9; 2: 0" The results obtained
by Dekker and Roelvink (1995) indicate that the difference in average costs
between this policy and the average optimal policy (which has been com-
puted for a 2 component case) are less than 1%, whereas the improvement
over block replacement varies between 0% and 10% (of total average costs).
A Framework for Single-Parameter Maintenance Activities 187

8. Conclusions
In this paper we presented a framework for optimisation models which allows
integration with priority setting, planning and combination of activities. Fur-
ther research is required to investigate whether more models can be incor-
porated into the framework, and whether other models can be converted to
allow combining and planning as done in this paper.

Acknowledgement. The author likes to thank Adriaan Smit (Koninklijke/Shell-


Laboratorium, Amsterdam) for useful comments and Ralph Wildeman and Rob
van Egmond for their numerical support.

References

Aven, T., Bergman, B.: Optimal Replacement Times - A General Set-up. J. Appl.
Prob. 23, pp. 432-442 (1986)
Baker, R.D., Christer, A.H.: Review of Delay-Time OR Modelling of Engineering
Aspects of Maintenance. Eur. lourn. Oper. Res. 73, pp. 407-422 (1994)
Barlow, R.E., Proschan, F.: Mathematical Theory of Reliability. New York: John
Wiley 1965
Berg, M., Epstein, B.: A Note on a Modified Block Replacement Strategy with
Increasing Running Costs. Nav. Res. Log. Quat. 26, pp. 157-159 (1979)
Berg, M.: A Marginal Cost Analysis for Preventive Replacement Policies. Eur.
Journ. Oper. Res. 4, 136-142 (1980)
Berg, M., Cleroux, R.: A Marginal Cost Analysis for an Age Replacement Policy
for Units with Minimal Repair. Infor. 20, 258-263 (1982)
Berg, M.: The Marginal Cost Analysis and Its Application to Repair and Replace-
ment Policies. Eur. lourn. Oper. Res. 82, 214-240 (1995)
Cho, D.L, Parlar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
Eur. Journ. Oper. Res. 51, 1-23 (1991)
Dekker, R.: Applications of Maintenance Optimisation Models: A Review and
Analysis. Report Econometric Institute 9228/ A, Erasmus University Rotter-
dam (1992)
Dekker, R.: Integrating Optimisation, Priority Setting, Planning and Combining of
Maintenance Activities. Eur. Journ. Oper. Res. 82, 225-240 (1995)
Dekker, R., Dijkstra, M.C.: Opportunity-Based Age Replacement: Exponentially
Distributed Times Between Opportunities. Naval Res. Log. 39, 175-190 (1992)
Dekker, R., Roelvink, I.F.K.: Marginal Cost Criteria for Group Replacement. Eur.
Journ. Oper. Res. 84, 467-480 (1995)
Dekker, R., Smeitink, E.: Opportunity-Based Block Replacement: The Single Com-
ponent Case. Eur. Journ. Oper. Res. 53, 46-63 (1991)
Dekker, R., Smeitink, E.: Preventive Maintenance at Opportunities of Restricted
Duration. Naval. Res. Log. 41, 335-353 (1994)
Dekker, R., Smit, A.C.J.M., Loosekoot, J.E.: Combining Maintenance Activities in
an Operational Planning Phase. IMA Journ. of Math. Appl. in Buss. Ind. 3,
315-332 (1992)
Dekker, R., Wildeman, R.E., Van Egmond, R.: Joint Replacement in an Opera-
tional Planning Phase. Report Econometric Institute 9438/A (revised version),
Erasmus University Rotterdam (1993)
188 Rommert Dekker

Dekker, R., Frenk, J.B.G., Wildeman, R. E.: How to Determine Maintenance Fre-
quencies for Multi-component Systems? A General Approach. In this volume
(1996), pp. 239-280
Kamath, A.R.R., AI-Zuhairi, A.M., Keller, A.Z., Selman, A.C.: A Study of Ambu-
lance Reliability in a Metropolitan Borough. ReI. Eng. 9, 133-152 (1984)
McCall, J. J.: Maintenance Policies for Stochastically Failing Equipment: A Survey.
Mgmt. Sci. 11, 493-524 (1965)
Noortwijk, J.M. van, Dekker, R., Cooke R.M., Mazucchi, T.A.: Expert Judgment
in Maintenance Optimisation. IEEE Trans. on Reliab. 41, 427-432 (1992)
Pierskalla, W.P., Voelker, J.A.: A Survey of Maintenance Models: The Control and
Surveillance of Deteriorating Systems. Nav. Res. Log. Quat. 23, 353-388 (1979)
Pintelon, L.: Performance Reporting and Decision Tools for Maintenance Manage-
ment. Ph.D. Dissertation, University of Leuven (1990)
Sherif, Y.S., Smith, M.L.: Optimal Maintenance Models for Systems Subject To
Failure - A Review. Nav. Res. Log. Quat. 28,47-74 (1981)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models
for Stochastically Deteriorating Single Unit Systems. Nav. Res. Log. Quat. 36,
419-446 (1989)
Wildeman, R.E., Dekker, R., Smit, A.C.J.M.: Combining Activities in an Opera-
tional Planning Phase: A Dynamic Programming Approach. Report Economet-
ric Institute 9424/ A (revised version), Erasmus University Rotterdam (1992)
Economics Oriented Maintenance Analysis and
the Marginal Cost Approach
Menachem P. Berg
Department of Statistics, University of Haifa, Mount Carmel, Haifa 31905, Israel

Summary. Marginal cost analysis of maintenance policies was introduced by this


author in 1980 and has been since then applied, in a series of papers by this and
other authors, to various maintenance cases. In broader terms it can be categorized
as an economics oriented approach, as different from the probability centered com-
mon approach, and it is appropriate and useful when the costing side is predominant
in the maintenance planning.
We consider here, within a unified framework, a series of maintenance situations
and apply the marginal cost approach to their modelling and optimization. The
smoothness of the derivations and procedures, even for quite general models and
policies, and the clarity and insight introduced to the analysis, demonstrate the
natural role of the marginal cost approach and its effective use to maintenance de-
sign in costing environments - within the broad class of economics-based approaches.

Keywords. Maintenance, marginal cost analysis, repair and replacement policies

1. Introduction

Mathematical maintenance theory was originated in the scientific literature


with a predominant emphasis on the modelling and analysis of the uncer-
tainty aspects of failures. While it is true that maintenance activities have
been primarily designed to avert failures it is just as evident, particularly
in nowadays advanced economic planning climate, that the design of main-
tenance policies should be more closely linked to overall economic goals. It
is thus necessary to make the mathematical study of maintenance stand on
the two legs of uncertainty modelling, through probabilistic analysis, and
economic planning, in more equal footing.
To be sure even in (the now) "classical" probabilistic maintenance theory
- very much the one developed in the seminal treatise by Barlow and Proschan
(1965) - the models do contain cost parameters and the objective function
to be minimized is often the long-term expected cost per unit time, a cost
function. Yet, it is amply clear that the setting and focus of that study is
centered on the probability analysis of failures with the economics side being
subservient to it - whereas real life cost planning often dictates otherwise.
Generally, the economic planning of maintenance is a broad issue covering
various circumstances. In this paper we focus our attention on the "classi-
cal" maintenance policies to which we apply the mathematical-economics ap-
proach of marginal cost analysis. Apart from the specific treatment of these
fundamental maintenance policies a more general purpose of the paper is to
190 Menachem P. Berg

serve as an orientation- setting vehicle for the modelling and optimization of


maintenance activities where the approach very much revolves around cost-
ing goals with the probability tools serving as the necessary mathematical
machinery for achieving them.
Essentially, the two main (classical) families of preventive maintenance
policies are the Age replacement family, where the preventive action is based
on the passed life (sometimes operation life), and the Block replacement
family where preventive maintenance is done at fixed time intervals with no
(or little) relationship to whatever happens in between. For conciseness of
exposition we shall only dwell here on the Age replacement family but the
concepts presented and methods used are just as relevant to the Block-family.
For the latter we shall suffice in referring the reader, in an annotated manner,
to relevant publications.
Obviously, the mathematical results with both approaches (for the basic
cases treated by both) do coincide as we shall see in the sequel. Still, the
clarified and simplified procedures with the approach here and the insight
thereby gained have the quite concrete (quantitative) benefit of paving the
way to the analysis of model generalizations and policy extensions that bring
maintenance planning closer to reality.

2. The Age Replacement Policy

The basic maintenance policy in the Age replacement family is the age re-
placement policy (ARP) under which an item is replaced at failure or oth-
erwise preventively when it reaches a certain critical age. We first note that
a policy of this type would mainly be appropriate (but not exclusively so,
because of the tempting simplicity of its implementation) when the failure
modelling of the item is of the age-based type - that is a 'black-box' statis-
tical approach that relates failures to age, or usage, through statistical data
gathering and inference procedures. (Other types of failure modelling which
attempt to go deeper into what causes a failure may have the disadvantage
of being prohibitively expensive in terms of the data required to infer on
the multitude of additional parameters, as well as of being non robust with
regard to the usually simplified set of assumptions employed).
The issue of age-based failure modelling and its ramifications is consid-
ered in Berg (1995b) and it is argued there that the pivotal quantity for
analysis should be the hazard (or, failure rate) function r(.) rather than the
life distribution F() and as demonstrated there this change of starting point
can have a concrete impact on the statistical procedure despite the equiva-
lence of probability information in both functions, which determine each other
through the mathematical relationship

1- F(x) = F(x) = e- Jo" r(u)du (2.1)


Economics Oriented Maintenance Analysis 191

In that regard we shall also note in the sequel that the hazard function, the
cornerstone of age-based failure modelling, indeed integrates well into the
analysis of the ARP: the age-dependent maintenance policy.
Once the ARP has been chosen as the maintenance policy employed, the
only undecided question is what critical age T, for preventive replacement, is
to be used? Given the hazard function r() which, as stated above, contains
all the probability information we use, and the costs Cl and C2 of failure and
preventive replacements, respectively, the optimal T is then the value which
minimizes our cost objective function. The most commonly used such objec-
tive function is the long-term expected cost per unit of time and expressing
it in terms ofT, using the renewal-reward theorem (e.g., Ross 1983), we find
(Barlow and Proschan 1965),

C(T) = clF(;) + c2F(T) (2.2)


fo F(u)du
Applying regular calculus optimization to (2.2) the equation for T*, the
optimal T, is
r(T) [T F(u)du + F(T) = _Cl_ (2.3)
lo Cl - C2
and then combining (2.2) and (2.3) the optimal cost is found to be

(2.4)
Some conclusions can be drawn and observations be made from examining
the optimality, equation (2.3). Firstly, we see that T* depends on the replace-
ment costs only through their ratio since, trivially, Ct!(Cl-C2) = (1-c2/ct)-l.
That is clear as we can set the money unit as we desire. Then, it can be easily
verified that if r() is increasing, or, using the common terminology, "F() is
IFR (increasing failure rate)" the left-hand side of (2.3) is increasing in T,
which ensures that T* must be unique. It can, however, still be 00 , in case
there is no finite solution to (2.3), which essentially says that an ARP is only
superior to a sheer failure replacement policy if enough is gained by preven-
tive maintenance in terms of replacement costs savings to offset the wastage
in replacing operative items.
So far so good, but then one may still have certain queries about these
results like why is the IFR so focal in ensuring a unique optimum?; or, is
there a simple, intuitively clear, condition for T* to be finite?; and, what is
behind the resulting concise (and "elegant") expression for C* in (2.4)?
It is clear that the probabilistic approach does not touch upon such issues
which is so because it is only a mathematical tool while these issues are con-
cerned with the true underlying nature of matters namely the economics of
maintenance. We shall now adopt an approach that makes economic consid-
erations the starting point of the analysis and, in particular, all the queries
above are consequently removed.
192 Menachem P. Berg

The key mathematical-economics notion in the context of these main-


tenance policies is that of the marginal cost for preventive replacement at
any given age which, when aggregated over all ages, yields the marginal cost
function (MCF). More precisely, define for an item of age x,
Vl (x) - the cost of a preventive replacement now

V2 (x, Ll) - the expected costs (failure or preventive replacement costs in-
cluded) in (x,x+Ll] if the preventive replacement is deferred to age x+Ll
(for an infinitesimal Ll ).
Then, the marginal cost of a preventive replacement at age x is defined as

(2.5)

and the resulting function of age on the left-hand side of (2.5) is the MCF. The
rationale is clear: the decision when to make a preventive replacement can be
decomposed into a sequence of decisions, at each age x, of whether to carry out
a preventive replacement now or wait with it another (infinitesimally short)
time period. We note the implicit conditioning embedded in the definition of
the MCF, since for the preventive replacement at age x to be at all relevant
that age must be first survived.
For the ARP we have, on basic principles,

Vl(X) = C2 and
V2 (x, Ll) r(x)Llcl + (1- r(x)Ll)c2. (ignoring o(Ll) terms) (2.6)
Substituting (2.6) into (2.5) we immediately obtain the MCF associated with
the ARP,
(2.7)
The MCF can be also utilized to obtain the cost objective function C(T).
For that however we also need an appropriately selected underlying renewal
process (see Berg 1995a for elaboration) whose renewal-interval c.dJ. is de-
noted by G(). For the ARP a convenient choice is that of the service-life of
an item, so that replacement moments of any kind constitute renewal epochs
for that purpose. Thus, by the definition of the ARP, we have

O(x) = {F(X) ,x < T (2.8)


o ,x~T

Now, by the very definition of the MCF with its abovementioned conditional
nature (for more details see Berg 1995a) the expected cost during the service-
life of an item is
(2.9)
Economics Oriented Maintenance Analysis 193

Note that (2.9), once (2.7) and (2.8) are substituted into it, coincides with the
numerator of (2.2). C(T) is then obtained by dividing D(T) by the expected

1 lT
service-life
U(T) = 00
G(x)dx = F(x)dx (2.10)

by (2.8).
Comparing the derivation of D(T) in the "classical" approach and the one
here we note that the main difference is that there we have an overall renewal-
interval calculation whereas here the procedure is broken into two steps. First
we have a micro-type calculation where we obtain, in a rather straightforward
manner and usually on mere basic principles, the MCF. This function is then
used, in a given formula, to yield D(T). While in mathematically simple
situations like the (basic) ARP the superiority of the approach here is not
obvious, as both methods provide an easy calculation, it is in the model
generalizations and policy extensions considered later that the simplification
of the setting and facilitation of the mathematics become apparent.
We now proceed to use the MCF for optimization and invoking a basic
principle from mathematical economics we have that the optimal critical age
T* is a solution of the equation
C(T) = TJ(T) (2.11)
which, as can be checked, is equivalent to (2.3).
Apart from delivering the required expressions and the simple setting of
the optimality equation, the marginal cost approach also clarifies the above-
stated queries.
First, invoking another economics-based principle we have that a sufficient
condition for the existence of a unique T* is that the MCF is increasing and
then T* is finite if and only if
TJ(OO) > C(oo) (2.12)
(The more general version of this last result, which covers non- monotonic
MCFs, is that the TJ(T) and C(T) intersect at all the extrema of C(T), and
only there, so that TJ(') crosses from below at the minima and from above at
the maxima (Berg 1980)). The above economics-based principle also clarifies
the role of the IFR property here since by the functional form of (2.7), if r()
increases so does TJO.
The above clearly demonstrates the use of the marginal cost analysis as a
comprehensive tool for the study of the ARP. But then, not less importantly,
it is the insightfulness and smoothness of the procedures that bring about
the much valued virtue of the approach: it enables clear and straightforward
model generalizations and policy extensions, of much usefulness for real life
maintenance planning, which are otherwise demanding and cumbersome in
the mathematics (as is clearly revealed by comparing with other works that
have considered some of these generalizations - see specific references later).
194 Menachem P. Berg

3. Model Generalization: Age Dependent Replacement


Costs
We consider the possibility that replacement costs may depend on the item's
age at the time of replacement due to salvage costs or other costing considera-
tions. To account for this economic type factor the necessary model extension
is to let the replacement costs be a function of the item's age, i.e., replace CI
and C2 by CI(X) and C2(X), respectively. In that case (2.6) becomes
VI (x) = C2(X) and
V2(X,.:1) = r(x).:1cI(X) + (1 - r(x).:1)c2(X +.:1) (3.1)
To obtain the revised D(T) for this model we need to use a somewhat gener-
alized form of formula (2.9) where C2 is replaced by C2(O) (for mathematical
details see Berg 1995a). For the derivation of D(T) we also need G( .), but this
distribution function is clearly unaffected by the model generalization here
and is still given by (2.8). Accordingly, U(T) remains unchanged as well. Di-
viding the revised D(T) by U(T) yields the long-run expected cost C(T), of
the ARP in this model.
An observation of interest here, which relates to one of the abovemen-
tioned queries about the results of the (basic) ARP, is that the IFR property
is not anymore a sufficient condition for the existence of a unique solution
T* to the optimality equation (2.11) (or, equivalently, a unique minimum to
C(T. Thus, its presumed focality in that regard is merely coincidental.

4. Model Generalization: Adding Running Costs


Yet another relevant economic type factor in maintenance design is running
cost of an item (see Berg and Epstein 1979, Cleroux and Hanscomb 1974,
Berg 1976). These costs may include: regular maintenance costs, depreciation
costs, reduced output with aging (which can be translated into costs) etc.,
and because of their relationship to aging they should, and normally would,
be taken into account in preventive replacement decisions.
Letting k(x).:1 be the running costs of an item in age interval (x, x +.:1)
(ignoring the o( dx) term) it is easy to see that as far as the MCF is considered
all that needs to be done to account for these costs in the ARP preventive
maintenance framework is to replace 1](x) by
ij(x) = 1](x) + k(x) (4.1)
and then proceed exactly as before (with G() unchanged) to obtain (the
revised) C(T) and then the optimality equation
ij(T) = C(T)
The resulting optimal critical age is now based on the replacement costs as
well as on the running costs of the item.
Economics Oriented Maintenance Analysis 195

5. Policy Extension: Repair or Replacement at Failure


Depending on Costs
The ARP is restrictive in that the only possible corrective action at failure
is replacement. While that is appropriate for inexpensive items, which are
discarded upon failure, or for modular items which are replaced upon fail-
ure by an operative item of the same kind (and then sent for recovery in a
workshop), there exists a wide range of items, such as industrial machines,
for which repair (without modular replacement) is a natural corrective action
upon failure, with replacement being the alternative. A policy question then
would be: when to do what? As a matter offact, the question can be posed in
a broader form to include the extent of the repair action. Here, however, we
adopt the simplifying but still often realistic assumption that a repair is only
"minimal" , as defined in Barlow and Proschan (1965), i.e., it only restores
operability but makes no improvement. Mathematically, this means that the
hazard function following a repair is undisturbed so that the failure process
in the future is independent of this repair action (as well as of any previous
ones).
The question of whether to repair or replace upon failure in the above
context was considered by Cleroux et al. (1979) (and earlier by other authors
too, e.g., Drinkwater and Hastings 1967, but mathematically, less systemat-
ically), and it is suggested there to make the decision depend on the repair
costs involved. More precisely, a limit 6 is set on the repair costs, which are
essentially random following a c.d.f. L(), so that if the actual repair costs in
a given failure epoch exceed 6, a replacement is made at cost Cl. Accordingly,
6 is called the repair-cost-limit (RCL henceforth) and the policy is named
likewise. The RCL policy still includes a preventive replacement at age T so
that the ARP corresponds, in this context, to the special case of 6 = O.
On the other extreme we have the special case 6 = 00 which corresponds
to a policy that only allows repair at failures. This policy has already been
considered within the ("classical") set of maintenance policies in Barlow and
Proschan (1965), under the name: minimal-rep air-policy (MRP henceforth).
We first make several observations before proceeding to treat the MRP with
the marginal cost approach.
To begin with, the MRP is restrictive in the opposite direction to that
of the ARP as it does not allow at all a replacement upon failure (so that,
in principle, repairs can go on forever). In this sense, the RCL policy is a
compromise between the two extreme cases: the ARP and the MRP.
Another comment relates to the repair costs which are assumed to be
fixed in Barlow and Proschan (1965) but are random here - varying with the
actual required repair extent. It can be, however, observed that all the above-
mentioned results concerning the classical replacement policies remain valid
(under some independence assumptions) if the repair, or even replacement,
costs are allowed to be random variables with the means of these random vari-
ables substituting their constant counterparts in the original models. Thus,
196 Menachem P. Berg

the case 6 --+ 00 corresponds to the classical MRP, with the repair cost fig-
ure there replaced by the expected repair cost here (conditional on a repair
decision),

Ca = 16 udL(u)/L(6) (5.1)

Another observation regarding the MRP is that as far as the preventive


maintenance action is concerned the policy is in fact a bridging case between
the ARP and the BRP (block replacement policy) since with it (in the ab-
sence offailure replacements) any item reaches the critical age for preventive
replacement T so that preventive replacements will be made at equal time-
intervals of length T. From this argument it follows that the GO and U(T)
of the MRP correspond to those of the BRP (see later in (11.1) and (11.2),
respectively) .
For the computation of the MCF of the MRP we first derive, on basic
principles,

C2, and
car(x)..:1 + C2 (5.2)

and hence, by (2.5), we immediately obtain


7](x) = car(x) (5.3)
The optimization procedure for T then follows the usual track (see Berg
(1995a) for details).
Returning to the RCL policy it is convenient to reparametrize it by letting
p = L( 6) and q =1 - p, (5.4)
so that the decision process, through and over the lifetimes of consecutive
items, corresponds to a Bernoulli stochastic process (see Qmlar 1975, p. 44)
since at each failure epoch we have either a repair, with probability p, or
replacement, with probability q, and due to the "minimal" nature of repairs
and the (assumed) independence of the random repair costs at failures, the
choices at the different failure epochs are independent.
Consequently, since the MCF represents expected cost, applying the law
of total expectation (i.e., E(Z) = EE(ZII), where Z represents the costs
and I is the Bernoulli random variable corresponding to the choice of action
at failure) we obtain the MCF of the RCL policy (Berg 1982a) as a linear
combination of those of the ARP, in (2.7), and the MRP, in (5.3),

(5.5)
Several comments are in place here. Firstly, we observe how easily the
marginal cost approach accommodates this non-trivial policy extension (com-
pare with the far more laborious mathematics in Cleroux et al. 1979 where
Economics Oriented Maintenance Analysis 197

the other approach is employed). This is once more due to the use here of an
approach which is designed to deal with economics-oriented modelling as is
essentially the case here.
We already noted that the RCL policy is a compromise between the "ex-
treme" cases of the ARP and the MRP but with (5.5) we have now obtained
an exact mathematical relationship among them.
To find D(T), through the application of formula (2.9), we still need the
distribution of the service-life GO. Probably the simplest way to find this
distribution here is to resort to the hazard function of G() (and recall here
our earlier argument about the pivotal role of hazard functions in "black-box"
(age-based) failure modelling contexts).
On basic principles we have

rG(x) = qr(x), x <T (5.6)

as the service life of an item will be terminated in the next interval of length
..1 , given that it survived age x, if there will be a failure there and also the
repair cost incurred will exceed 8. Consequently, by the basic relationship in
(2.1), between life distributions and hazard functions, we immediately obtain

(5.7)

by (5.6).
Combining 1](') and G(), from (5.5) and (5.7), respectively, we can pro-
ceed with our standard procedure and use formula (2.9) to obtain D(T),

D(T) = [T 1](x)G(x)dx + C2 = PC3 + q(Cl - C2) (1- (F(T)F) + C2


10 q
Next, we substitute (5.7) into (2.10) to obtain U(T) and the cost objective
function C(T) is now given by their ratio,

C(T) = (p C3 + q(Cl - c;(1- (F(Tq) + qC2


q fo (F(xqdx

Equating C(T) to 1](T) yields the optimality equation in this case for T*

6. Generalization of This Policy Extension: Age


Dependent RCL

This is further extension of the basic policy where we still adopt the RCL
rule only it is now made age-dependent, i.e., it is a function 8(x), of the age
at failure x (rather than a constant). Realistically, this could reflect, through
a decreasing function 8( .), a lessening readiness to invest in the repair of an
item as it gets older.
198 Menachem P. Berg

This policy generalization, which is mathematically complicating within


the "classical" approach, is accommodated here in a straightforward and
mathematically effortless manner: all that needs to be done is to replace 6 by
6(x) in the above formulae and it all remains valid (compare with the rather
laborious mathematics in Block et al. (1988) where the other approach is
used). Note that the decision process is now a generalization of the standard
Bernoulli process due to variable probabilities of repair and replacement at
failures:p(x) and q(x), respectively, depending on the age x. Accordingly, the
service-life distribution now becomes
a-(X ) -- e
- 1.'" r(u)q(u)du
o, x<T. (6.1)
The expected repair cost is now a function of the age at failure x

C3(X) = 10
r(:t:) dL(u)/ L(6(x.
6
(6.2)

7. Model Generalization of the RCL policy:


Age-Dependent Repair Cost Distribution
In both RCL policies above we can make a model generalization that could
account for possible varying repair cost distribution as the item gets older.
Simply, replace throughout L(.) by L:t:{-) and the procedure follows. Thus,
yet another economic type generalization is readily accommodated by our
economics-based approach.
It is noteworthy that with an age-dependent repair cost distribution, the
decision on a preventive replacement will depend on the changed reliability
status with age, as represented by the hazard function, as well as on the
repair costs variability with age. How exactly should these two age-dependent
factors be combined into a decision prescription is considered in the next
section.

8. The Determination of the RCL

For the implementation of the RCL policy we need to specify the RCL and
this could be either done directly on the basis of some relevant costing con-
siderations or we can make 6 subject to optimization with respect to the
overall cost objective function (and thus treat the latter, for that purpose, as
a function of two variables: T and 6). This task, however, becomes harder in
the age-dependent case and having an idea on the functional form of 6(x) is
very helpful. For instance, if we want to find an optimal RCL, then instead
of a mathematically- hard functional optimization procedure which searches
Economics Oriented Maintenance Analysis 199

for the optimal function 6(.), we only need to optimize for the constant coef-
ficients of a given function.
To approach this issue we take a closer look at the (economic) rationale of
using an age-dependent RCL. If we assume that the item's "revenue" is age-
invariant then the main reason to be less ready to invest in an older item is the
increased maintenance costs (and if the item's revenues are age-dependent,
for instance if the output reduces with age, then we can use the model with
running costs to account for that). Thus, the difference between the constant
RCL 6 and the age dependent one is a function of the anticipated repair costs.
This argument can be made formal by specifying a future planning horizon
of some length d (to be determined, as usual in economic planning, on the
basis of some exogenous considerations: budgeting or operational) and then
set

where Zx(d) is the expected repair cost for an item of age x in the next time
interval of length d (assuming no replacement there). Consequently,

6(z) = 6 - Zx(d) (8.1)


so that to obtain a functional form for 6(z), all that is needed is to compute
Zx (d). To do that we use a result that states that the failure process of an item
undergoing minimal repairs is a non-homogeneous Poisson process (which
is due to the independent- increment property of the failure process here
which follows, by the definition of a minimal repair, from the independence
of different failure events). Moreover, the intensity function of this NHPP
is exactly the hazard function r() of the item (yet another reminder of the
focality of the hazard function in "black box" (age-based) failure modelling
contexts). Using this we obtain right away, by virtue of a basic property of

l
the NHPP,
x +d
Zx(d) = Ca x r(y)dy (8.2)

(note that in the absence of aging, i.e., constant r(.), Zx(d) is constant inde-
pendent of z, and can be thus absorbed in 6 so that, in this case, a constant
RCL is appropriate). When the repair cost distribution is age-dependent this
can be generalized, by utilizing the independent increment property of the

l
NHPP, to obtain
x +d
Zx(d) = x ca(y)r(y)dy (8.3)

Example: Let
F(z) = e-(Jx"', z ~ 0, (},,> 0, (8.4)
i.e., a Weibulllife distribution, and

(8.5)
200 Menachem P. Berg

Then rex) = (),x'Y- 1 and from (8.1), combined with (8.3), we find
6(x) = 6 - f~ [(x + dP+u - x'Y+U] (8.6)
,+0"
We first note that even though the Weibull distribution is a commonly
used life distribution and f( x) is of a reasonable functional form, the resulting
functional form for the ReL in (8.6) is unlikely to be assessed directly. Indeed,
it is most likely, as a matter of routine practice, that 6(.) will be given a
"nice" functional form as, for instance, the (arbitrarily chosen) linear and
exponential functional forms for 6() in Berg et al. (1986).
For the function in (8.6) we have that 6(x) is decreasing if and only if
, + 0" ~ 1. Thus an IFR Weibull, i.e., , ~ 1, is sufficient for a decreasing
tendency to repair an item as it gets older regardless of whether the mean-
repair-cost function f( x) is increasing or not. However, if, < 1 then f( x) has
to increase fast enough or, more precisely, we need (J' > 1 - , to ensure such
a monotonicity behavior of 6(x).

Some special cases:


(a) ,+0"=1
This reduces the ReL to a constant and we are back to the previous
case.
(b) ,+ 0" = 2
In this case 6(.) in (8.6) becomes a linear function (so that, in particular,
a linear ReL function is a special case of this example) and we note that
it is decreasing whether F(-) possesses the IFR property (corresponding
here to 1 < , ~ 2) or the opposite DFR (decreasing failure rate) property
(corresponding to 1 > , ). Also, since 6(x) in this case becomes negative
for
(8.7)
we conclude that the item has to be replaced at a failure beyond age Xo
(if Xo in (8.7) is negative we set Xo = 0 with the implication that the
item must be replaced upon its first failure, i.e., we are back to the age
replacement policy).

9. Policy Alteration/Generalization: Using the


Expected Total Discounted Costs as Objective Function
In the basic ARP, as well as in its extensions and generalizations hitherto, the
long-term expected cost per unit of time is used as objective function. Another
widely used cost objective function is the expected total discounted costs in
the long-run - where discounting means that a payment of 1 unit of money
t time henceforth has a present value of e- at , where a is the discounting
factor. Accordingly, we now introduce the notation
Economics Oriented Maintenance Analysis 201

Wa(T) - the expected long term total discounted costs of an ARP with critical
age T when the discounting factor is a.
While this is an alteration it is also, in a sense, a generalization since as
a - 0 we are back to the previous case only that an appropriate modification
is needed because Wa(T) is total costs whereas C(T) is average costs. Since
C(T) is in fact a rate of costs, to achieve commensuration we would need to
transform Wa(T) into a rate, and following Howard (1971, p. 853) we note
that a long-term continuous rate of costs of aWa(T) yields, when discounted,
a total Wa(T) (simply: fooo e- at aWa(T)dt = Wa(T)).
The derivation of Wa(T) here is a simple exercise through the renewal
type equation:

from which we immediately obtain

Cl faT e- au J(u)du + C2e-aT F(T)


aWa (T) = T (9.1)
fa e- au F(u)du
From (9.1), comparing with (2.2), we have,

lim aWa(T) = C(T)


a_a (9.2)

which concurs with the general relevant theoretical relationship (e.g., Ross
1970, p. 163).
The marginal cost function is obtained here, using the straightforward
derivations,

V1(X) C2
V2 (x, ..1) r(x)L1cl + (1- r(x)L1)C2e-aA
so that, by (2.5),
7]a(x) = (Cl - c2)r(x) - C2 a (9.3)
As expected, we have, comparing with (2.7),

(9.4)

The optimality equation here (as motivated by the above argumentation on


the relationship between Wa(T) and a cost rate) is

(9.5)

and combining (9.5) and (9.3) the optimal cost is:


202 Menachem P. Berg

where here T* == T*(a). Assuming continuity of r() we also have, by (9.5)


and (2.11), with (9.2) and (9.4),

lim T*(a) = T*
01-0

The rest of the marginal cost procedure continues to follow as before, with
TJa(T) and aWa(T) replacing TJ(T) and C(T), respectively, so that if TJa(T)
is increasing, a unique T* exists and it is then finite if TJa( 00) > a Wa(00).
Also, in general, TJa(T) intersects aWa(T) at the extrema of the latter (and
only there): from below at the minima and from above at the maxima.

10. Policy Extension: Opportunistic Preventive


Maintenance

Opportunistic maintenance policies are designed to take advantage of mo-


ments of particular appropriateness for the purpose of preventive mainte-
nance. There are different variations here and one such situation occurs when
the nature of the demand is intermittent and we want to confine preventive
maintenance activities to no-demand periods. A model of this type has been
considered in Berg (1984) and the policy used there is an ARP extension,
called modified age replacement policy (MARP). With the MARP the pre-
ventive replacement is still governed by a critical (operational) age T but
when this critical age is reached, the action is delayed until the firf?t next no-
demand instant.
Although the model setting in Berg (1984) is of operational nature: the
objective function is the availability of the item at demand times and the
basic tradeoff is between failure and preventive replacement times (where the
former is assumed to be larger than the latter in some appropriate sense), it
can be technically transformed into a cost model with the expected cost per
unit of time as objective function and with the tradeoff related to replace-
ment costs, so that the optimal policy in one is identical with the optimal
policy in the other (for a similar transformation in the basic ARP see Bar-
low and Proschan 1965). This switch to a cost model enables an application
of the marginal cost approach and the analysis proceeds along the regular
procedure (see Berg 1984 for details). The special case of the above model,
corresponding to degenerate zero-length no-demand periods (so that the pre-
ventive replacement opportunities form a point process) has been considered
in Dekker and Dijkstra (1992) using a marginal cost analysis. The marginal
cost function there can be verified to coincide with that in Berg (1984) (in
the cost model) and the optimality analysis concurs as well. A look-ahead
type argumentation there reinforces insights into results.
Economics Oriented Maintenance Analysis 203

11. The Block Replacement Family of Maintenance


Policies

The application of the marginal cost approach to the block replacement policy
(BRP) follows similar lines. The main change is that it is now the length of
the (fixed) preventive replacement intervals, still denoted by T, which is the
argument of the MCF and that now preventive replacements constitute the
selected underlying renewal process (Berg 1995a). Thus, here

a-( x)-_ {1
0
x <T
x~T (11.1)

and,
U(T) = T (11.2)
Once the MCF TJ(T), is computed, it is combined in formula (2.9), with
the distribution in (11.1) to generate D(T). Then division by U(T), in (11.2),
yields C(T). The optimality equation for T* is

C(T) = TJ(T).
The marginal cost analysis of the (basic) BRP is considered in detail
in Berg (1980), and the extensions to the RCL policy: in Berg and Cleroux
(1982b) for constant 6, and in Berg and Cleroux (1991) for an age-dependent
6(). The case of the expected total discounted costs as objective function is
also considered in Berg (1980) while the extension to opportunistic policies
has been studied in Dekker and Smeitink (1991).

12. Conclusion

The main message here is that in the study of maintenance policies, within
costing frameworks, the approach should be economics-based with probabil-
ity tools being subservient to that. This shift of orientation, compared to
commonly used ones in the literature of mathematical maintenance theory,
turns out to be not merely a conceptual change but one with a clear effect
on both the analysis and optimization of maintenance policies.
The specific mathematical economics concept used is that of the marginal
cost of (preventive) maintenance, at a given age or time as far as the main
maintenance policies of age replacement and block replacement are con-
cerned, which gives rise to the marginal cost function. This last function is
utilized for both the derivation ofthe cost objective function ofthe long-term
expected costs as well as its optimization,including the investigation of prop-
erties of the optimal solution. The method proves mathematically smooth
and effective for different model generalizations and policy extensions that
make the maintenance planning more realistic.
204 Menachem P. Berg

It is instructive to point out at this junction that although the differ-


ent model generalizations and policy extensions were treated here separately,
combinations of them can be handled just as effectively with little extra
difficulty. For instance, the required derivations for an RCL policy with
age-dependent replacement costs and with added running costs are made
through a straightforward revision of the MCF. Needless to say that such
extensive model and policy variations are no small thing with the alterna-
tive all-out renewal approach and this is a clear and important benefit of
the approach adopted here. As a matter of general orientation an economics
oriented approach, and the marginal cost analysis in particular, can be ap-
plied to yet other maintenance policies whether they are of the "black-box"
failure-modelling type (which both the age replacement and block replace-
ment policies, essentially are) or causal-failure models, such as those based
on the item's deterioration with age or operations conditions. Regarding this
latter case, for instance, the planning of inspections, to observe the item's
current degradation state (which then serves as a basis for possible mainte-
nance actions), is one more cost-incurring maintenance-related activity that
needs to be incorporated within an economics-based approach into the gen-
eral maintenance design.

References

Barlow, R., Proschan, F.: Mathematical Theory of Reliability. New York: Wiley
1965
Berg, M.: Optimal Replacement Policies for Two-Unit Machines with Increasing
Running Costs -I. Stochastic Processes and Their Applications 4,89-106 (1976)
Berg, M.: Marginal Cost Analysis for Preventive Replacement Policies. European
Journal of Operational Research 4, 136-142 (1980)
Berg, M.: A Preventive Replacement Policy for Units Subject To Intermittent De-
mand. Operations Research 32, 584-595 (1984)
Berg, M.: The Marginal Cost Analysis and Its Application to Repair and Replace-
ment Policies. European Journal of Operational Research 82, 214-224 (1995a)
Berg, M.: Age-Dependent Failure Modelling: A Hazard-Function Approach. Cen-
tER Discussion Paper (No. 9569), Tilburg University (1995b)
Berg, M., Bienvenu, M., Cleroux, R.: Age Replacement Policy with Age-Dependent
Minimal Repair. Infor 24, 26-32 (1986)
Berg, M., Cleroux, R.: A Marginal Cost Analysis for an Age Replacement Policy
with Minimal Repair. Infor 20, 258-263 (1982a)
Berg, M., Cleroux, R.: The Block Replacement Problem with Minimal Repair and
Random Repairs Costs Journal of Statistical Computation and Simulation 15,
1-7 (1982b)
Berg, M., Cleroux, R.: Maintenance Policies with Jointly Optimal Repair and Re-
placement Actions. Proceedings of Relectronics 91. Eighth Symposium on Re-
liability and Electronics, Budapest (1991)
Berg, M., Epstein, B.: A Note on a Modified Block Replacement Policy for Units
with Increasing Marginal Running Costs. Naval Research Logistics Quarterly
26, 157-179 (1979)
Economics Oriented Maintenance Analysis 205

Block, H.W., Borges, W.S., Savits, T.H.: A General Age Replacement Model with
Minimal Repair. Naval Research Logistics 35, 365-372 (1988)
Qmlar, E.:lntroduction to Stochastic Processes. Englewood Cliffs: Prentice-Hall
1975
Cleroux, R., Dubuc, S., Tilquin, C.: The Age Replacement Problem with Minimal
Repair and Random Repair Costs. Operations Research 27, 1158-1167 (1979)
Cleroux, R., Hanscomb, M.: Age Replacement with Adjustment and Depreciation
Costs and Interest Charges. Technometrics 16, 235-239 (1974)
Dekker, R., Dijkstra, C.: Opportunity-Based Age Replacement: Exponentially Dis-
tributed Times Between Opportunities. Naval Research Logistics 39, 175-190
(1992)
Dekker, R., Smeitink, E.: Opportunity Based Block Replacement. European Journal
of Operational Research 53, 46-62 (1991)
Drinkwater, R., Hastings, N.: An Economic Replacement Model. Operational Re-
search Quarterly 18, 69-71 (1967)
Howard, R.: Dynamic Probabilistic System: Semimarkov and Decision Processes.
Vol. II. New York: Wiley 1971
Ross, S.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden- Day 1970
Ross, S.: Stochastic Processes. New York: Wiley 1983
Availability Analysis of Monotone Systems
Terje Aven
Rogaland University Centre, Ullandhaug, 4004 Stavanger, Norway

Summary. Consider a monotone system observed in a time interval J comprising


n components, which are repaired or replaced at failures. The problem is to anal-
yse the performance of the system. Relevant performance measures are presented
and methods for computing these are discussed. Emphasis is placed on analysing
the probability distribution of the downtime of the system (more generally the lost
throughput if a flow network is considered).
Keywords. Monotone systems, multistate systems, availability, downtime distri-
bution, interval reliability, Poisson approximation

1. Introduction

The traditional reliability theory based on a binary approach has recently


been generalized by allowing components and systems to have an arbitrary
finite number of states, see e.g. Aven (1992, 1985), Korczak (1993) and Natvig
(1984) and the references therein. For most reliability applications, binary
modeling should be sufficiently accurate, but for certain types of applications
such as production and transport systems, a multistate approach is usually
required, cf. e.g., Aven (1992, 1990), Brok (1987), Brouwers (1986, 1987),
Dekker and Groenendijk (1995), Smit et al. (1995) and Ostebo (1993). In a
gas transport system, for example, the state of the system is defined as the
rate of delivered gas, and in most cases a binary model (0%,100%) would be
a poor representation of the system.
In this chapter we consider a multistate system comprising components
having only two states: the functioning state and the not functioning state.
Such a model has shown to be suitable in many practical applications, and it
simplifies the analysis significantly compared to a model including multistate
components. The model and many of the results presented in this chapter
can easily be extended to the general multistate case.
The system is observed in a time interval J. The performance of such a
system can be expressed in a number of ways. Some of the most commonly
used measures are presented in Aven (1993), including
- Probability distribution and mean of the system downtime (relatively to a
given system level) in J
- The mean performance level in J
- The demand availability
- Probability distribution and mean of the number of times the system fails
(relatively to a given system level)
A vailability Analysis of Monotone Systems 207

In this chapter we give an overview of relevant performance measures. The


computation of these measures is discussed. Some easily computed approx-
imation formulae are established. Special attention is given to the situation
that the components are highly available, the most common situation in prac-
tice. The work is based on recent research reported in A ven and Haukaas
(1996a, 1996b), Aven and Jensen (1996), Aven (1990, 1992, 1993), Haukaas
(1995), Haukaas and Aven (1996a, 1996b) and Smith (1995). Other key ref-
erences are Birolini (1985, 1994), Gertsbakh (1984), Gnedenko and Ushakov
(1995) and Ushakov (1994).
The chapter is organized as follows. In the following section we introduce
the model. Then we present the performance measures. The final section
discusses the problem of calculating the performance measures.

2. Model
Let <p(t) be a non-negative random variable representing the state or the
performance level of the system at time t, t E J = (u, u + s]. The interval J
is the time interval we observe the system. We assume that <p(t) can take one
of M + 1 values
<PO,<PI, ... ,<PM (<Po < <PI < ... < <PM).
The M + 1 states represent successive levels of performance ranging from
the perfect functioning level <PM down to the complete failure level <Po. In a
flow network we interpret <p(t) as the throughput rate of the system at time t.
An example of a flow network is given in Section 3. In the following we will use
the word "throughput rate" also in the general case. The system comprises
n components, numbered consecutively from 1 to n. Let Xi(t) be a random
variable representing the state of component i at time t, i = 1,2, ... , n,
t E J. We assume that Xi (t) can be in one of two states, which we denote
XiO and XiI (XiO < xid. The state Xil and XiO represents the functioning
state and the not functioning state, respectively. If a component fails, it is
repaired and put into operation again. We assume that the time to failure
of component i has a distribution Fi(t) with mean MTTFi and the time to
repair has a distribution Hi(t) with mean MTTRi. We assume that all times
to failure and repair times are stochastically independent. Furthermore, we
assume that for all components the distribution of the sum of an operation
period and its subsequent repair is not periodic, i.e. a discrete distribution
with values in {r, 2r, 3r, ... } for some r, and that with probability one, two
or more components cannot fail at the same time.
= =
Let Pi(t) P(Xi(t) XiI). Thus Pi(t) represents the availability of com-
ponent i at time t. The above assumptions guarantee that Pi(t) converges
to
Pi == MTTFi + MTTRi
when t --+ 00.
208 Terje A yen

We also assume that there exists a reference level at time t, D(t), which
expresses a desirable level of system performance at time t. The reference level
=
D(t) is a non-negative random variable, taking values in 'D {do, db ... , dw }.
For a flow network system we interpret D(t) as the demand rate at time t.
In the following we will use the word "demand rate" also in the general case.
The system throughput rate is assumed to be a function of the states of
the components and the demand rate, i.e.

(t) = (X(t), D(t))


where X(t) = (X1(t), X 2(t), ... , Xn(t)). The system is assumed to be a
monotone system, i.e.
1. (XlO, X20, ... , XnO, do) = o and (Xll' X21, ... , Xn1, d) = d for all dE 'D
2. The structure function (x, d) is non-decreasing in each argument
The computation of the various performance measures will be based on the
assumption that the n processes Xi(t) are stochastically independent. Thus
we are not including models for analysing standby systems. Refer to Birolini
(1985, 1994), Gnedenko and Ushakov (1995), Heijden and Schornagel (1988)
and Ushakov (1994) for theory and results related to such systems.

Notation

q (q1,q2,,qn)
h(q) P((X,d) ~ Ic), where qi = P(Xi = XiI)
and the XiS are independent
h,q) = (qb .. , qi-1, ., Qi+1, ... , qn)
1() Indicator function, which equals 1 if the argument is true and 0
otherwise

3. Performance Measures

Below we list some relevant performance measures for the system. The mea-
sures will be denoted 11a , 116, ... , 12a , 12b, ... , etc. For some other closely
related measures, see Haukaas (1995) and Haukaas and Aven (1994).
1. Ita = Probability distribution of (t) given a demand rate D(t) = d
Itb = E[(t) I D(t) = d]
11e = P((t) ~ D(t))
2. Let

v = Number of times the process (t) is below Ic III J


12a = Probability distribution of V
Availability Analysis of Monotone Systems 209

l2b = EV
l2c = P(V = 0) = P((t) ~ k, t E J)
Some closely related measures are obtained by replacing k by D(t).
3. Let

y =1 (D(t) - (t)) dt = 1D(t) dt - 1 (t) dt = y(l) - y(2)

We see that Y represents lost throughput (volume) in J, i.e. the differ-


ence between the accumulated demand (volume), y(1), and the actual
throughput (volume), y(2), in J.
l3a = Probability distribution of Y
l3b = EY = Ey(l) - Ey(2)
l3c = Ey(2) / Ey(1)
The measure l3c is called "throughput availability" .

1
4. Let
Z = I~I l((t) =D(t)) dt
The random variable Z represents the portion of time the throughput
rate equals the demand rate.
l4a = Probability distribution of Z
l4b = EZ
The measure l4b is called "demand availability" .

4. Computation of the Performance Measures

The computation of the above measures can be an extremely difficult task


in practice. The need for model simplifications and efficient computation
methods is therefore vital. In the following we will discuss the problem of
computing the various measures defined above. Only analytical methods will
be considered. Refer to Aven (1993) for a discussion ofthe use of Monte Carlo
simulation.

4.1 Measures of Category 1

There exists a number of methods for computing the probability distribution


and the mean of the throughput rate at a fixed point in time, given a fixed
demand rate D(t) = d, see e.g. Aven (1985, 1992) and Natvig (1984) and
the references therein. Computation of these measures represents the basic
problem within multistate theory and has been given much attention in the
reliability theory literature. In most applications the component availabilities,
Pi(t), are assumed to be equal to the limiting (steady state) values as t -+ 00.
Computation of the measure lic is trivial given that the measure Ira has
been computed for all d-values, noting that
210 Terje Aven

w
he = P((t) ~ D(t)) = L P((t) ~ di I D(t) = di ) P(D(t) = di )
i=O

Example
Figure 4.1 shows a simple example of a flow network model. Flow (gas/oil) is
transmitted from A to B. The system comprises four two-state components,
with XiO = 0, i = 1,2,3, X40 = 1, Xu = X21 = 1, X3l = X4l = 2. Hence
the components 1 and 2 are binary components, component 3 has possible
states 0 and 2, and component 4 has possible states 1 and 2. The states of
the components are interpreted as flow capacity rates for the components.
The demand is assumed to be equal to 2, and the state/level of the system
is defined as the maximum flow that can be transmitted from A to B, i.e.

= =
If for example the component states are Xl 0 , X2 1, X3 2 and X4 2,= =
then the flow throughput equals 1, i.e. = (O, 1,2,2) = 1. The time unit is
hours.

Fig. 4.1. A simple example of a flow network

The possible system levels are 0, 1 and 2. We see that is a multistate


monotone system.
Consider now a time period J of 1 year and assume that the demand rate
equals 2 for the whole interval. Furthermore, assume that the performance
process for the binary components 1 and 2 are determined by a constant
failure rate A = 1/480 hours and a Time To Repair (TTR) equal to 20 hours
(the TTR is assumed to be deterministic). Component 3 is assumed to have
a constant failure rate equal to 1/990 hours and a TT R = 10 hours, whereas
component 4 is assumed to have a constant failure rate equal to 1/490 hours
and a TT R = 10 hours. With these probabilities, the limiting availabilities
Pi = 1/(1 + ATTR) are given by

Pi 0.96, i = 1,2
P3 0.99
P4 0.98
Availability Analysis of Monotone Systems 211

Consider now a fixed point in time t and assume as an approximation that


Pi(t) = Pi. Then by simple probability calculus it is easy to compute the
performance measures of category 1:

P((t) = 2) = = =
P(X1(t) 1, X2(t) 1, X3(t) 2, X4(t) = 2)
= 0.96 x 0.96 x 0.99 x 0.98 = 0.894
P((t) ~ 1) = = =
P(X1(t) 1 U X2(t) 1, X3(t) 2)
= P(X1 = 1 U X 2(t) = 1) P(X3(t) = 2)
= {1- P(X1(t) =0) P(X2(t) =On P(X3(t) = 2)
0.9984 x 0.99 = 0.988
E(t)/2 (0.094 x 1 + 0.894 x 2)/2 = 0.941

4.2 Measures of Category 2

First we state some well-known asymptotic results for the (expected) number
of system failures below state k. To simplify the computations we assume
that the demand D(t) is a constant d.
Let N(t) denote the number of system failures (relative to level k) in
[0, t]. It can be shown using results from the theory of counting processes and
renewal processes, see e.g. Aven (1992), that

lim EN(t) = ~ h(li,p) - h(Oi,p)


(4.1)
t-oo t ~ MTTFi + MTT Ri
1=1

I E[N( + ) _ N( )] = ~ h(li,p) - h(Oi,p)


u~~ u S U S L..J MTTF. + MTTR- (4.2)
i=1 1 1

Furthermore, if the process is a regenerative process, i.e. there exist time


points at which the process (probabilistically) restarts itself, then with prob-
ability one,
lim N(t) = ~ h(li'p) - h(Oi,p)
t_oo t ~
,=1
MTTFi + MTT Ri
From these formulae approximations for N(t), EN(t) and E[N(u+s)-N(u)]
can be established. If the up time distribution Fi(t) has a constant failure
rate Ai (for all i), then is a regenerative process, and it can be shown that
the expected number of system failures in the period (u, u + s] equals:

E[N(u + s) - N(u)] = L lU+$ [h(li, p(t - h(Oi, p(t))] Aipi(t)dt


n

i=l U

There does not exist any general formula for the distribution of the number of
system failures V = N (u + s) - N (u). Only in some special cases is it possible
to obtain practical computation formulae. For example, if all components
have exponentially distributed life times, it is possible to derive a simple
approximation formula. Let Ni(t) denote the number offailures of component
212 Terje Aven

i in the time interval [0, t). Then ifthe repair times are small compared to the
life times and the life times are exponentially distributed with parameter >'i, it
follows that the number offailures of component i in the time interval (u, u +
s), Ni(U + s) - Ni(U), is approximately Poisson distributed with parameter
>'is. If the system is a series system, and we make the same assumptions as
above, the number of system failures in the interval (u, u+s) is approximately
Poisson distributed with parameter E~=1 >'is. The number of system failures
in [0, t], N(t), is approximately governed by a Poisson process with intensity
E?:1 >'i.
If the components are highly available and have constant failure rates, the
Poisson distribution will produce good approximations also for more general
systems. The parameter, i.e. the mean, of the Poisson distribution is in prac-
tice calculated by using the asymptotic system failure rate >'if> defined by
(4.1). Assuming exponential life times, X(t) is a regenerative process with
renewal cycles given by the time between consecutive visits to the best state
(X11, X21, . , X n 1), and it can be shown that N(t/(J) converges in distribu-
tion to a Poisson variable with mean t when >'iMTTRi - 0, i = 1,2, ... , n,
where (J is a suitable normalising factor, see Aven and Haukaas (1996a),
Aven and Jensen (1996), Gertsbakh (1984), Gnedenko and Solovyev (1975),
Solovyev (1971) and Ushakov (1994). Suitable normalising factors include
p/ EoT, 1/ ETif> and >'if>, where p equals the probability that a system failure
occurs in a renewal cycle, EoT equals the expected length ofthe renewal cycle
when it is given that a system failure does not occur in the renewal cycle,
and ETif> equals the expected time to the first system failure. The asymp-
totic exponentiality follows by applying a generalized version of the following
well-known result (cf. Gertsbakh 1984, Kalashnikov 1989, Keilson 1975 and
Kovalenko 1994):
Let TJ, j = 1,2, ..., be a sequence of non-negative i.i.d. random
variables with distribution function F(t), and v a {TJ} indepen-
dent geometrically distributed random variable with parameter p, i.e.
P(v = k) = p(l- p)k-1, k = 1,2, ... ,. Then if ET1 = a E (0,00),
1/

(p/a) 2:TJ
;=1
converges in distribution to an exponential distribution with param-
eter 1 as p - O.
Refer to Aven and Haukaas (1996a) and Haukaas (1995) for a study of the
accuracy of the Poisson approximation and a study of the performance of the
different normalising factors. See also Appendix B.
It is also possible to use Markov models to compute the measures of
category 2. Refer to Muppala et al. (1996) in this volume.
The measure 12e is a subset of the measure 1 2a , the distribution of V.
The measure 12e can therefore be computed (approximated) using the above
Availability Analysis of Monotone Systems 213

approach. The reader should consult the recent work by Smith (1995) for some
interesting new results related to this measure. The measure 12e has also been
studied by Natvig (1984, 1991). He has, however, focused on finding bounds
on 12e under various assumptions. We will not look closer into this problem
here.

Example continued
Consider first the number of times the system state is below level 2. In this
case the system can be viewed as a series system. The number of system
failures is approximately governed by a Poisson process, with an intensity
which equals the sum of the failure rates: 1/480 + 1/480 + 1/990 + 1/490 =
7 x 10- 3 . The formula (4.2) gives approximately the same intensity value.
Consider now the number of times the system state is below level 1. To
compute the expectation EV, we use formula (4.2):
EV
s
~ (1 - P2)P3AIPI + (1 - pI)P3A2P2 +
[1 - (1 - pI)(l - P2)]A3P3 = 1.1 X 10- 3
Hence the average number of times the state of the system equals 0 is ap-
proximately 10 (8760 x 1.1 x 10- 3 ) per year. The distribution of the number
of times the system state is below level 1 can be accurately approximated by
a Poisson distribution, see Haukaas (1995).

4.3 Measures of Category 3

From a practical point of view, it is not possible to find the exact distribution
of Y, the lost throughput (volume) in J, using analytical methods. It is,
however, possible to obtain an approximate distribution in many cases.
Writing

Y = ((D(t) - 4(t dt =
JJ I
1
I: a, (D(t) - 4(t dt == I: Y,
bl

and assuming that the lost volume in the intervals (a" b,], Y" are approx-
imately independent and identically distributed, it follows by the Central
Limit Theorem that Y has an approximate normal distribution for large s,
cf. Asmussen (1987), Theorem 3.2, p.136. To guarantee independent, identi-
cal distributions and the asymptotic normality, the process D(t) - 4(t) must
be a regenerative process.
The mean of Y and y(i) can be calculated using Fubini's Theorem:

E 1 Z(t) dt = 1 EZ(t) dt
214 Terje Aven

where Z(t) is one of the processes D(t) - cjJ(t), D(t) and cjJ(t). Using limiting
probabilities we can easily obtain approximate values for this mean value.
Hence the problem has been reduced to computing measures of category 1.
To compute the variance, the asymptotic results in Asmussen (1987) can
be used in the case that Z(t) is a regenerative process. Here we will, how-
ever, consider some simple alternative methods. For the sake of simplicity
we assume that the demand D(t) equals the maximum system state cjJM. We
restrict attention to the variable Y. We assume exponential life times, so that
cjJ is a regenerative process. The methods give normally good approximations
for highly available systems. Initially we look at the case that the system has
only two states, so that a "system failure" is well-defined.
Let Yi denote the throughput loss in the lth interval Jr == (ai, btl, where
al and b, are constants. Then by assuming that the Yis are approximately
independent and that the probability of having two or more system failures
occurring in Jr is small, we obtain

V ar(Y) ~ L V ar(Yi)
I

Var(Yi) Var[Yi [(system fails in J,)]


E[Y? [(system fails in J,)]- (E[Yi [(system fails in Jr)])2
E(Y?lsystem fails in J,) P(system fails in Jr)
-(E(Yilsystem fails in Jr)2(P(system fails in JI))2
~ E(1''?lsystem fails in J,)P(system fails in Jr)
:;:,,: E(Y?lsystem fails in J,).Aif>IJ,1
where >'if> is the asymptotic system failure rate defined by formula (4.1) above,
and IJd equals the length of the subinterval J,. By summing up and intro-
ducing c as the throughput loss rate if a system failure occurs, we find that

Var(Y) ~ >.if>IJjc2 J y2dG(y)

where G(y) equals a "typical" downtime distribution of the system, given


system failure. If we define a random variable R having this distribution
G(y), we can write the above formula for Var(Y) in the following form:
Var(Y) ~ >.<f>IJjc2ER2 = >.<f>IJjc2[Var(R) + (ER)2] (4.3)
Alternatively, this expression could have been established using the formula
for the variance of the compound Poisson process, cf. Ross (1970), p.244.
The steady state distribution (or an approximation of this distribution) is
normally used as the G(y) function, cf. Appendix A. Thus for e.g. a parallel
system comprising two stochastically identical components we have
J, oo [1 - H(r)]dr
G(y) = 1 - [1 - H( Y)] -"--M---T-T-R--
Y
Availability Analysis of Monotone Systems 215

where H(y) equals the distribution of the component repair times. The steady
state formula represents the system downtime distribution at time "t = 00" ,
and does not depend on the life time distribution. The asymptotic distribu-
tion gives very good approximations if it is likely that the number of com-
ponent failures at the occurrence of the system failure, is relatively large for
each component.
From formula (4.3) we see that for a series system of highly available
components, we can write

(4.4)

where R; is the repair time of component i.


Now if there exist more than one system failure state, i.e. states below
CPM, we can calculate an approximate Var(Y) by establishing a formula as
above (formula (4.3)) for each system failure state, and summing up. The
calculations will be demonstrated by an example.

Example continued
We assume as an approximation that the component availabilities Pi(t) equal
the limiting availabilities for all t in the interval. Then we can calculate an
approximate distribution of the lost throughput in this time interval by using
an approximation to the normal distribution with mean equal to 8760 x 2 x (1-
0.941) = 8760xO.118 = 1033. It remains to calculate an approximate variance
using the approach described above. For both system failure states, we can in
this case consider the system as a series type structure and use formula (4.4),
observing that the probability of having more than one component down at
the same time is small. Some straightforward calculations give
Var(Y) ~ 2 x 10 4
Using this value for the variance we can easily calculate an approximate
probability distribution for the lost throughput in J:

Table 4.1. Distribution of lost throughput


Lost throughput!
demand 3.00-4.99% 5.00-6.99% 7.00-8.99%

Probability 0.14 0.78 0.08

In Aven (1993) the above approximations were compared with results


obtained by Monte Carlo simulation. The distribution of the lost through-
put estimated from the simulations is approximately the same as the one
established above.
216 Terje Aven

The asymptotic properties,as s -+ 00, of integrals/sums as above have


been studied by a number of researchers, see Asmussen (1987), Birolini
(1994), Csenki (1994), Gut and Janson (1983), Natvig and Streller (1984),
Smith (1995), Streller (1980) and Takacs (1957) and the references therein;
cf. also Muppala et al. (1996) in this volume. Under natural assumptions it
is proved that the integrals/sums are asymptotically (approximately) nor-
mally distributed. For the classical binary case where the state process of
the system generates an alternating renewal process, it is shown in Takacs
(1957) that J;(t) dt/ s is asymptotically normal as s -+ 00 with mean
MTTF/(MTTF + MTTR) and variance equal to
(MTTF2 r2 + MTTR2 (12)/(MTTF + MTTR)3 s

where (12 and r2 denote the variance of the up-times and the downtimes,
respectively.
The approximation to a normal distribution give poor results if the inter-
val is small. Then alternative calculation methods must be used, see Haukaas
and Aven (1996b), Smith (1995) and Appendix B.

4.4 Measures of Category 4

With respect to computation, this category is similar to measures of cate-


gory 3.

Acknowledgement. The author would like to thank M.A.J. Smith, Erasmus Univer-
sity, A. Csenki, University of Bradford, and I. Kovalenko, Ukrainian Academy of
Sciences, for valuable discussions.

References

Asmussen, S.: Applied Probability and Queues. Chichester: Wiley 1987


Aven, T.: Reliability Evaluation of Multistate Systems of Multistate Components.
IEEE Trans. Reliability 34, 473--479, (1985)
Aven,T.: Availability Evaluation of Flow Networks with Varying Throughput-
Demand and Deferred Repairs. IEEE Trans. Reliability 38, 499-505, (1990)
Aven, T.: Reliability and Risk Analysis. London: Elsevier 1992
Aven, T.: On Performance Measures for Multistate Monotone Systems. Reliability
Engineering and System Safety 41, 259-266, (1993)
Aven, T., Haukaas, H.: Asymptotic Poisson Distribution for the Number of System
Failures of a Monotone System. Paper submitted for publication (1996a)
Aven, T., Haukaas, H.: A Note on the Steady State Downtime Distribution of a
Monotone System. Paper submitted for publication (1996b)
Aven, T., Jensen, U.: Asymptotic Distribution for the Downtime of a Monotone
System. ZOR-Methods and Models of Operations Reseach. Special Issue on
Stochastic Models in Reliability. To appear (1996)
Birolini, A.: Quality and Reliability of Technical Systems. Berlin: Springer 1994
Availability Analysis of Monotone Systems 217

Birolini, A.: On the Use of Stochastic Processes in Modeling Reliability Problems.


Lecture Notes in Economics -and Mathematical Systems 252. Berlin: Springer
1985
Brok, J.F.: Availability Assessment of Oil and Gas Production Systems. Interna-
tional Journal of Quality and Reliability Management 4, 21-36, (1987)
Brouwers, J. J.: Probabilistic Descriptions of Irregular System Downtime. Reliability
Engineering 15, 263-281 (1986)
Brouwers, J.J., Verbeek, P.H., Thomson, W.R.: Analytical System Availability
Techniques. Reliability Engineering 17, 9-22 (1987)
Csenki, A.: Cumulative Operational Time Analysis of Finite Semi-Markov Relia-
bility Models. Reliability Engineering and System Safety 44,17-25 (1994)
Csenki, A.: An Integral Equation Approach to the Interval Reliability of Systems
Modelled by Finite Semi-Markov Processes. Reliability Engineering and System
Safety 47, 37-45 (1995)
Dekker, R., Groenendijk, W.: Availability Assessment Methods and Their Appli-
cation in Practice. Microelectronics and Reliability 35, 1257-1274 (1995)
Donatiello, L., Iyer, B.R.: Closed-Form Solution for System Availability Distribu-
tion. IEEE Trans. Reliability 36, 45-47 (1987)
Funaki, K., Yoshimoto, K.: Distribution of Total Uptime During a Given Time
Interval. IEEE Trans. Reliability 43, 489-492 (1994)
Gertsbakh, I.B.: Asymptotic Methods in Reliability Theory: A Review. Adv. Appl.
Prob. 16, 147-175 (1984)
Gnedenko, D.B., Solovyev, A.D.: Estimation of the Reliability of Complex Renew-
able Systems. Engineering Cybernetics 13, 89-96 (1975)
Gnedenko, B.V., Ushakov, I. A.: Probabilistic Reliability Engineering. New York:
Wiley 1995
Gut, A., Janson, S.: The Limiting Behaviour of Certain Stopped Sums and Some
Applications. Scand. J. Statist. 10, 281-292 (1983)
Haukaas, H.: Contributions to Availability Analysis of Monotone Systems. Ph.D.
Dissertation, University of Oslo (1995)
Haukaas, H., Aven, T.: Availability Evaluation of Gas and Oil Production and
Transportation Systems. Paper presented at the PSAM II Conference, San
Diego (1994)
Haukaas, H., Aven, T.: A General Formula for the Downtime of a Parallel System.
J. Appl. Prob .. To appear (1996a)
Haukaas, H., Aven, T.: Formula for the Downtime Distribution of a System Ob-
served in a Time Interval. Reliability Engineering and System Safety. To appear
(1996b)
Heijden, M.C. van der, Schornagel, A.: Interval Uneffectiveness Distribution for a
k-out-of-n Multistate Reliability System with Repair. European J. Oper. Res.
36, 66-77 (1988)
Kalashnikov, V.V.: Analytical and Simulation Estimates of Reliability for Regen-
erative Models. Syst. Anal. Model. Simul. 6, 833-851 (1989)
Keilson, J.: Markov Chain Models - Rarity and Exponentiality. Berlin: Springer
1975
Korczak, E.: Binary Methods in Reliability Analysis of Multistate Monotone
Systems. Research Report, Przemyslowego Instytutu Telekunikacji, Warszawa
(1993)
Kovalenko, LN.: Rare Events in Queueing Systems - A Survey. Queueing Systems
16, 1-49 (1994)
Muppala, J.K., Malhotra, M., Trivedi, K.S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
218 Terje Aven

Natvig, B.: Multistate Coherent Systems. In: Johnson og Kotz (ed.): Encyclopedia
of Statistical Science 5. New York: Wiley (1984)
Natvig, B.: Strict and Exact Bounds for the Availabilities in a Fixed Time Interval
for Multistate Monotone Systems. Research Report. University of Oslo (1991)
Natvig, B., Streller, A.: The Steady State Behaviour of Multistate Monotone Sys-
tems. J. Appl. Prob. 21, 826-835 (1984)
Ostebo, R.: System Effectiveness Assessment in Offshore Field Development Us-
ing Life Cycle Performance Simulation. Proceedings of Annual Reliability and
Maintainability Symposium (RAMS). Atlanta (1993)
Ross, S.M.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden-Day 1970
Smith, M.A.J.: The interval availability of complex systems. Research Report. Eras-
mus University Rotterdam (1995)
Smit, A.C.J.M., van Rijn, C.F.H., Vanneste, S. G.: SPARC: A Comprehensive Re-
liability Engineering Tool. In: Flamm, J. (ed.): Proceedings of the 6th ESReDA
Seminar on Maintenance and System Effectiveness. Chamonix (1995)
Solovyev, A.D.: Asymptotic Behavior of the Time of First Occurrence of a Rare
Event. Engineering Cybernetics 9, 1038-1048 (1971)
Streller, A.: A Generalization of Cumulative Processes. Elektr. Informationsverarb.
Kybern. 16, 449-460 (1980)
Takacs, L.: On Certain Sojourn Time Problems in the Theory of Stochastic Pro-
cesses. Acta Math. Acad. Sci. Hungar. 8, 169-191 (1957)
Ushakov, I.A. (ed.): Handbook of Reliability Engineering. New York: Wiley 1994

Appendix

A. Downtime Distribution Given System Failure


In this appendix we present formulae for the distribution of the downtime of
a binary monotone system. We focus on the steady state distribution. The
model is as described in Section 2 with and Xi taking values in {O, I} and
the demand equal to 1.
Assume first that is a parallel system of identical components. Then
the steady state downtime distribution given system failure, is given by

J, [1 - H(x)]dx
t- 1
OO
1 - [1 - H( )][ Y
Y MTTR
To see this, let R* be the remaining repair time at a given point in time
of a failed component in steady state. It is well-known from the theory of
alternating renewal processes that the probability distribution of R* is given
by
P(R* > ) = froo [1 - H(x)] dx
r MTTR'
cf. e.g. Birolini (1994). Let Y* be the downtime of a system failure that occurs
at a given point in time in steady state, caused by the failure of component
i. Then, since the processes are stochastically independent, it follows that
Availability Analysis of Monotone Systems 219

P(Y* ~ y) 1 - P(Y* > y) = 1 - [1- H(y)][P(R* > y)]n-l

J, oo [1 - H(x)]dx
1- [1 - H( )][ Y ]n-l
y MTTR
Next we assume that t/J is a parallel system of not necessarily identical com-
ponents. Then the steady state downtime distribution given system failure,
is given by

where
l/MTTRj
f3i = L?=ll/MTTRi
denotes the steady state probability that component j causes a system failure.
This result is shown as above for identical components, the difference being
that we have to take into consideration which component that causes system
failure and the probability of this event given system failure. In view of (4.1)
and (4.2) the probability that component j causes system failure equals
1
MTTF;+MTTR;
TI-
i#j Pi

L~l MTTFI~MTTRI TIil Pi


1
M'i"i7fj TIin=1 -Pi
"n 1
L..t'=l MTTR,
TIni=1 -Pi = f3j
where Pi equals the steady state unavailability of component i, i.e. Pi
1- Pi = MTTR;f(MTTFi + MTTRi).
Consider now an arbitrary, binary monotone system comprising the min-
imal cut sets Kj, j = 1, 2, .. " By looking at the system structure as a series
structure of the minimal cut parallel structures, an approximate formula can
be established using the above results. The probability that a system failure
is caused by minimal cut parallel structure j, is (approximately) given by:
AK-
mj = J
L,AK,
An approximate formula for the steady state downtime distribution, G.p, is
thus given by

j
This formula will produce good approximations for highly available compo-
nents, see Aven and Haukaas (1996b), Haukaas (1995) and Haukaas and Aven
(1996a).
The references Haukaas (1995), Haukaas and Aven (1996a) and Smith
(1995) include also a transient analysis of the downtime distribution. Formu-
lae are established which give improved results for the first system failure in
the time interval.
220 Terje A yen

B. Interval Downtime Distribution

In this appendix we present an approximation formula for the downtime dis-


tribution in a time interval, using the results from Appendix A and Haukaas
(1995), Haukaas and Aven (1996b) and Smith (1995).

A useful lemma

Consider a unit that is put into operation at time zero. At failures the unit
is repaired, and put into operation again. Let Rj, j = 1,2,, denote the
consecutive repair times (downtimes). We assume that the R/s are stochas-
tically independent. Let H Rj (r) denote the distribution of Rj. Furthermore
let N* (s) denote the number of system failures after s operational time units.
In addition define

N*(s-) = limN*(z)
Zl3

Assume that the repair times are independent of the process N* (s). Let Z (s)
denote the total downtime associated with the operating time s, but not
including s, i.e.
N(3- )
Z(s) = L R;
i=l

By convention, L:?=1 = O. Define

T(s) = s + Z(s)
We see that T( s) represents the calendar time after an operation time of
s time units and the completion of the repairs associated with the failures
occurred up to s but not including s.
Now, let Y(t) denote the total downtime of the unit in the time interval
[0, tl. The following lemma gives an exact expression of the probability dis-
tribution of Y(t).

Lemma B.t. The distribution of the downtime in a time interval of length


t is given by

=L
00

P(Y(t) ::; y) H(n)(y)P(N*t - y)-) = n), y~0


n=O

where H(n)(Y), Y ~ 0, is the convolution of HR 1 , HR 2 , , H Rn , i.e.


A vail ability Analysis of Monotone Systems 221

n>l
n=O

Proof To prove the lemma observe that

P(Y(t) ~ y) = P(T(t - y) ~ t) = P(t - y + Z(t - y) ~ t)


= P(Z(t - y) ~ y)
This first equality follows by noting that the event Y(t) ~ y is equivalent
to the event that the uptime in the interval [0, t] is equal to or longer than
(t - y). This means that the point in time when the total uptime ofthe system
equals t - y must occur before or at t, i.e. T(t - y) ~ t. Now using a standard
conditional probability argument it follows that

=L
00

P(Z(t-y) ~ y) P(Z(t-y) ~ yIN*t-y)-) = n)P(N*t-y)-) = n)


n=O

L H(n)(y)P(N*t -
00

= y)-) = n)
n=O

We have used that the repair times are independent of the process N*(s).

Remark B.l. Different versions of the above lemma has been formulated and
proved, cf. Birolini (1985, 1994), Donatiello and Iyer (1987), Funaki and
Yashimoto (1994) and Takacs (1957). The above proof seems to be the sim-
plest one.

Computing the downtime distribution of a monotone system

We consider a binary, monotone system <p. The model is as described in


Section 2 with <p and Xi taking values in {O, I} and the demand equal to
1. We assume that the time to failure of component i has an exponential
distribution with mean MTT Fi and the time to repair has a distribution
with mean MTTRj. Refer to Smith (1995) for an analysis covering non-
exponential life times. The components are assumed to be highly available,
so that MTT Fj is much larger than MTT R;.
Let Y(t) denote the total downtime of the system in the time interval
[0, t]. The number of system failures after s operational time units is denoted
N*(s).
Given the above model assumptions, it can be argued that N* (s) is ap-
proximately a homogeneous Poisson process with intensity A, where A is given
by
222 Terje A yen

To motivate this result, we note that the expected number of system failures
per unit of time when considering calendar time is approximately equal to
>',p, given by (see Section 4.2)

>. = ~ h(l;,p)-h(Oi,p)
4> ~ (MTTF; + MTTR;)

Then observing that the ratio between calendar time and operational time is
approximately l/h(p), we see that the expected number of system failures per
unit oftime when considering operational time, E(N*(s+v)-N* (s))/v, is ap-
proximately equal to >'4>/ h(p). It is not difficult to see that this expectation is
approximately independent of the history of N* up to s, noting that the state
process X frequently restarts itself probabilistically (i.e. X = (1,1, ... ,1)).
The system downtimes are approximate identically distributed with dis-
tribution G(r) (see Section 4.3 and Appendix A) independent of N*, and
approximately independent observing that the state process X with a high
probability restarts itself just after a system failure.
As an approximation we can therefore assume that the conditions of the
lemma are satisfied, with N*(s) approximately Poisson distributed with pa-
rameter >..
Now using the above lemma it follows that

P(Y(t) ::; y) R: f G(n)(Y) [>.(t -, y)]n e-,\(t-y) == Pt(y)


n.
(B.1)
n=O

Note that the above lemma does not require identically distributed down-
times. Hence formula (B.1) can also be used with G(n)(Y) as the convolution
of not necessarily identically distributed downtimes given system failure, cf.
the analysis in Haukaas (1995) and Smith (1995).
In the case that the expected number of system failures in the interval
is small, significantly less than 1, Pt(y) can be accurately approximated by
some simple bounds:

e-,\(t- y)[l + >.(t - y)G(y)] ::; Pt(Y) ::; e-,\(t-y)[l-G(y)]

The lower bound follows by including only the first two terms of the sum in
Pt(y), whereas the upper bound follows by using the inequality

In Haukaas (1995) and Haukaas and Aven (1996b), it is demonstrated by


using Monte Carlo simulations that Pt(y) gives a good approximation to
Availability Analysis of Monotone Systems 223

the downtime distribution, P(Y(t) ~ y), in the case that the components are
highly available. Table B.1 below shows the simulation results for the parallel
system analysed in Section 4.1. The system comprises 2 identical components,
with

Time to repair for a component == 20, MTTFi = 480

The length of the time interval is 8760 units of time. Both components are
assumed to be functioning at time zero. The number of simulation runs were
30000, so the standard deviation is bounded by (0.5 0.5/30000)1/2 ~ 0.003.

Table B.l. Estimated downtime distribution


Estimate of P{Y(8760) ~ y)
y Formula (B.1) Monte Carlo Sim.

o 0.246 0.249
2 0.281 0.284
4 0.320 0.319
6 0.361 0.358
8 0.405 0.399
10 0.451 0.445
12 0.501 0.495
14 0.554 0.546
16 0.610 0.604
18 0.670 0.664
20 0.733 0.729
22 0.763 0.758
24 0.792 0.786
26 0.819 0.814
28 0.845 0.840
30 0.868 0.866
32 0.890 0.887
34 0.909 0.906
36 0.926 0.924
38 0.940 0.938
40 0.950 0.949
45 0.969 0.969
50 0.982 0.982
60 0.994 0.994
70 0.998 0.998
80 0.999 1.000

We see that the approximation is very good for this case. Refer to Aven
and Jensen (1996) for some formal asymptotic results related to the downtime
distribution of Y(t).
Optimal Replacement of Monotone
Repairable Systems
Terje Aven
Rogaland University Centre, Ullandhaug, 4004 Stavanger, Norway

Summary. In this chapter we study the optimal replacement problem of a mono-


tone (coherent) system comprising n components. The optimality criterion is the
long run expected cost per unit of time. Emphasis is placed on situations where the
repair actions can be modelled as "minimal" repairs.

Keywords. Optimal replacement, maintenance optimization, minimal repair, mono-


tone systems, coherent systems, shock models

1. Introduction

Multicomponent systems can be represented in various ways. One general


formulation is to use the theory of monotone systems, see e.g. Aven (1992),
Barlow and Prochan (1975). A monotone system is a multicomponent sys-
tem which has the natural property that the system cannot deteriorate by
improving the performance of a component. The monotone systems include
series systems which function if and only if all its components function, and
parallel systems which function if and only if minimum one of its component
is functioning. There exist several papers treating replacement problems re-
lated to a monotone system. Some key references are Aven (1983, 1987, 1992,
1996), Bergman (1978) and Jensen (1990). In most of the models presented
the system is replaced at failures.
In fact, most replacement models found in the literature are concerned
with non-repairable systems, cf. the review papers by Pierskalla and Voelker
(1979) and Valdez-Flores and Feldman (1989). In recent years there has how-
ever been an increasing interest in repairable systems, motivated by the fact
that most real-life systems are repairable. Particular focus has been placed
on minimal repair models.
When considering maintenance actions, one key question is the following:
In what condition is the unit (component/system) after the maintenance
action has been performed? Some typical situations are described in the fol-
lowing:
- The maintenance action is a major repair that brings the unit to a condi-
tion which is considered to be as good as new, or equivalently, the unit is
physically replaced by a new and identical unit. This maintenance action
is referred to as a replacement. A replacement can be carried out at failure,
or as preventive maintenance.
Optimal Replacement of Monotone Repairable Systems 225

- The maintenance action is such that the unit is considered to be as good


as it was immediately before the failure occurred. This maintenance action
is referred to as a (statistical) minimal repair. A minimal repair is only
carried out at failure, and means that the age of the unit is not changed
as a result of the repair. If the unit is minimally repaired at failures, the
failure process is modelled as a non-homogeneous Poisson process.
- The maintenance action brings the unit to a condition which is somewhere
between the result of a replacement and a minimal repair.
- The result of the maintenance action is unsuccessful, for example in the way
that the wrong part is replaced/repaired or that some damage is inflicted
on the unit during the maintenance. This could result in a higher failure
intensity, and is often referred to as an imperfect repair.
- In some occasions information about the failure causes makes it possible
to improve the unit. The result of the maintenance action could then be
an improved failure intensity of the unit.
Of course, when developing a model of a system it is important to strike a
balance between the following two desired properties: The model must be
sufficiently simple to be able to be used to study the system by mathemat-
ical/statistical methods, and it must be a sufficient accurate representation
of the system.
In this chapter we restrict attention to the situation where the mainte-
nance action results in a minimal repair or replacement.
Using the minimal repair concept it is possible to describe in a simple way
the fact that many repairs in real life bring the unit (component/system) to a
condition which is basically the same as it was just before the failure occurred.
Of course, the purpose of the repair action is not to bring the unit to the
exact same condition. Rather the purpose is to get the unit in operation as
soon as possible. But by looking at the condition of the unit after the repair,
it is a reasonable assumption in many cases to use the non-homogeneous
Poisson process as an approximation to describe the failure process since a
failure does not change the age of the unit in such a process.
The cost structure we consider is very simple. The results obtained in this
paper can, however, easily be extended to a more general cost structure.
The paper is organised as follows. In Section 2 we review the classical
minimal, replacement model for a one-component system. Then in Section 3
we consider a general condition based approach where the system is repaired
or replaced at system failures. The set-up includes monotone system models
as special cases. The approach, which is based on counting process theory,
provides a framework for analysing a large number of condition based main-
tenance models, for repairable and non-repairable systems, see Aven (1996).
Refer to Bremaud (1981) for a presentation of counting process theory. Sec-
tion 3 is to a large extent based on Aven (1983, 1987,1996). A shock damage
model is also covered. In Section 4 we discuss the situation where the com-
ponents are minimally repaired at failures.
226 Terje A ven

2. Basic Minimal Repair jReplacement Model


Consider a system that is replaced at times T, 2T, 3T, .... At failures minimal
repairs are carried out. With minimal repairs the age of the system is not
affected, and the number of failures in the time interval [0, T] follows a non-

I:
homogeneous Poisson process with an intensity function A(t). The expected
number of minimal repairs in the time interval T is EN(T) = A(t) dt. The
cost of a minimal repair is c (c > 0) and the cost of a replacement equals K
(K > 0).
The long run average cost per unit of time when adopting this minimal
repair/replacement policy equals

B
T C I:
A(t)dt + J{
= --"-''---:T~--
From this expression it is straightforward to find an optimal T.

3. Maintenance Action at System Failure

3.1 General Model

Consider a system subject to random deterioration and failure. Assume that


there is available information about the underlying condition of the system
for example through measurements of wear characteristics and damage in-
flicted on the system, and that the proneness of the system to failure can be
characterized by a failure intensity, which is adapted to this information.
Let X(t), t 2: 0, be an observable stochastic process, possibly a vector pro-
cess, representing the condition of the system at time t, assuming no planned
replacement in the interval [0, t]. A planned replacement of the system is
scheduled at time T, which may depend on the condition of the system, i.e.
on the process X(t). The replacement time T is a stopping time in the sense
that the event {T $ s} depends on the process X(t) up to time s. There is no
planned replacement if T = 00. We define N(t) as the number of failures in
[0, t], assuming no planned replacements in this interval. The failure intensity
process, which is denoted A(t), may depend on X(s), 0 $ s $ t. Often we can
formulate the relation in the following way:

A(t) = v(X(t))
where v(x) is a deterministic function. The interpretation of A(t) is that given
the history of the system up to time t the probability that the system shall
fail in the interval (t, t+h) is approximately A(t)h. We assume that the failure
intensity process A(t) is non-decreasing.
If the failure intensity process depends only on the state process X(t) and
not on the failure process N(t), we can interpret the repairs as minimal: a
Optimal Replacement of Monotone Repairable Systems 227

repair which changes neither the age of the system nor the information about
the condition of the system. In this case, the running information about the
condition of the system can be thought to be related to a system which is
always functioning.
The following simple cost structure is assumed: A planned replacement
of the system costs K (> 0) and a repair/replacement at system failure costs
c(>O).
It is assumed that the systems generated by replacements are stochasti-
cally independent and identical, the same replacement policy is used for each
system and the replacement and repairs take negligible time.
The problem is to determine a replacement time minimizing the long run
(expected) cost per unit time.
Let MT and sr denote the expected cost associated with a replacement
cycle and the expected length of a replacement cycle, respectively. We restrict
attention to T's having MT < 00 and sr
< 00. Then using Ross (1970),
Theorem 3.16, the long run (expected) cost per unit time can be written:

T MT cEN(T) + K
B = sr - ET (3.1)

Using the definition of a counting process Bremaud (1981), it follows from


(3.1) that
T cE f~ A(t) dt + K
B = --~~~---- (3.2)
Ef~ dt
We note that the optimality criterion is on the same form as analysed by
Aven and Bergman (1986). Below the main results obtained in Aven and
Bergman (1986) are summarised.
Define the replacement time n
by the first point in time the process
aCt) == c>.(t) exceeds 8, i.e. aCt) 2: 8. We assume ETa < 00. It is seen that Tb

I
mInImIZeS
T
MT - 8ST = E [d(t) - 6jdt+ K

The results of Aven and Bergman (1986) follow. Let B(6) = B T6.
The stopping time Ta*, where 6* = infTBT, minimizes BT. The
value 6* is given as the unique solution of the equation 6 = B(6).
Moreover, if 6 > 6*, then 6 > B(6), if 6 < 6*, then 6 < B(6); B(6)
is non-increasing for 6 ~ 6*, non-decreasing for 6 ~ 6*, and B(6) is
left-continuous.
Choose any 61 such that p(n l > 0) > 0, and set iteratively

6n +1 = B(6n ), n = 1,2, ...


Then
lim 6n = 6*
n ..... oo
228 Terje A ven

Remark 3.1. The above algorithm usually converges very fast. Stan-
dard numerical iterative methods, for example the bisection method
or modified regula falsi (see e.g. Conte and Boor 1977, Section 2) can
in addition to the above algorithm be used to locate 6*. We must
then start with 6a ~ 6b such that 6a ~ B( 6a ) and 6b ? B( 6b). Then
we have 6a ~ 6* ~ 6b.
Remark 3.2. If we restrict attention to stopping times T which are
bounded by a stopping time S, say, satisfying ES < 00, and a(t) is
non-decreasing for t ~ S, then the above results are valid with T6
replaced by min{T6, S}. The stopping time S could for example be
the point in time of the mth system failure.

Remark 3.3. It is possible to give a marginal cost interpretation of


the results similar to the one given by Dekker (1996) in this vol-
ume, observing that at time t the expected cost rate by deferring
replacement equals CA(t).
Thus it is optimal to replace the system when a(t) reaches the level 6*.
It follows from (3.2) and Fubini's theorem that

C Iooo E[I(a(t) < 6))A(t)dt + K


B( 6) -
- --=-:::-:=---------
Iooo EI(a(t) < 6)dt
Hence if
a(t) = cv(X(t))
where v(x) is a deterministic function in x, and QtO is the distribution of
X(t), we may write

B ()
6 =
c Iooo f[I(cv(x) < 6))v(x)Qt(dx)dt + K
00 (3.3)
fo f I(cv(x) < 6)Qt(dx)dt
Note that if X(t) is a vector process, then one of the components may be the
time.
Below we apply the above model to analyse a monotone system comprising
n components.

3.2 Monotone System

We consider a monotone system of n components. Let Xi(t) be a binary


random variable representing the state of component i at time t, t ? 0,
i = 1,2, ... , n. The random variable Xi(t) equals 1 if the component is func-
tioning at t and 0 otherwise. Let 4>(t) be a random variable which denotes the
state of the system at time t. The random variable 1>(t) equals 1 if the sys-
tem is functioning and 0 otherwise. We assume that 1>(t) can be completely
determined by the states of the components, so that we may write
Optimal Replacement of Monotone Repairable Systems 229

4>(t) = 4>(X(t))
where X(t) = (Xl(t),X 2 (t),"" Xn(t)) and 4>(x) is the structure function of
the system. The structure function 4>(x) is assumed to be monotone, i.e.
- 4>(0) = 0 and 4>(1) = 1, and
- the structure function 4>(x) is non-decreasing in each argument.
Let N;(t) denote the number offailures of component i in [0, t], and N(t) the
number of system failures in the same interval. The counting process Ni(t) is
assumed to have an intensity process -\;(t). Hence the failure process of the
system N(t) has an intensity -\(t) given by
n
-\(t) = L -\;(t)X;(t)(l - 4>(0;, X(t)))4>(X(t)) (3.4)
i=1
where 4>h, x) = 4>(XI, ... , Xi-I, " XHl, ... , x n ).
Observe that X;(t)(l - 4>(0;, X(t)))4>(X(t)) is either 0 or 1, and equals 1 if
and only if the system is functioning, component i is functioning and the
system fails if component i fails.
Hence we have a special case of the general set-up described above pro-
vided that the intensity process is non-decreasing.
Below we look closer at two special cases.
3.2.1 Replacement at System Failures. First we consider the case that
the components are all replaced at system failure, but no repairs are carried
out before system failures. Hence in this case we have X;(t) = I(t < Ri),
where R; is a random variable representing the time to the first failure of
the ith component, i = 1,2, ... , n. We assume that component i has a life
time distribution F;(t) with failure rate equal to r;(t). The n components are
assumed to be independent.
It follows that we have a special case of the general model, with A;(t) =
r;(t)X; (t) and S (cf. Remark 3.2) equal to the failure time of the system. It
is not difficult to see that X; (t)(l - 4>(0;, X(t))) is non-decreasing for t < S.
Thus the failure intensity process is non-decreasing if the failure rates r;(t)
are non-decreasing.
Let vet, x) = E~=1 r;(t)x;(l - 4>(0;, x)) and
n
G(t, x) = P(X(t) = x) = II[l- F;(t)]Xi[F;(t)F-Xi
i=1

Then it is not difficult to see that

= c 10 J~06
Lx l(cv(t,x)<6)v(t,x)tf>(x)G(t,x)dHK
00

B(6) Lx l(cv(t,x)<6)tf>(x)G(t,x)dt
c LX:4>(xl=l 1000
l(cv(t,x)<6)v(t,x)G(t,x)dt+K
LX:4>(xl=l J~oo l(cv(t,x)<6)G(t,x)dt
230 Terje Aven

If the failure rates are constant, i.e. ri(t) = ri, then v(t, x) = v(x) and
Iooo G(t, x)dt = Expected sojourn time in state x
= P(x) L~=: rjZj =P(x) Lj'' ':=1 rj
where P(x) represents the probability that the process X(t) visits state x,
observing that the sojourn time in state x given that the process visits this
state, has an exponential distribution with mean I/L~=l riZi. The proba-
bility P(x) can be found using standard Markov theory, see Ross (1970),
Proposition 4.11. Note that the probability of a transition from state (lj, y)
to state (OJ, y) equals rj / E~=l riYi with Yj = 1. It follows that

B(6) = c Ex:t,6(X)=l I(cv(x) < 6)v(x) P(X)(EJ=l rizi)~ll + K (3.5)


Ex:t,6(X)=l I( cv(x) < 6) P(x)(Ei=l rizi)
The optimal 6 can now easily be determined. Note that it is sufficient to
calculate B(6) for 6 values equal to the values the function cv(x) can take,
in addition to Boo.

Numerical example
Assume that (x) = 1 - (1 - zI)(1 - Z2), i.e. the system is a parallel sys-
tem comprising two components. Assume that the components have constant
failure rates ri given by
rl = 1, r2 = 4
Furthermore assume that K = 1 and c =
6. Then we can easily find the
optimal replacement policy. Since cv(x) can only take the values 0,6 and 24,
it suffices to consider the following three replacement policies:
1. Replace the components at system failures only.
2. If component 1 fails before component 2, replace component 1 at failure
(due to exponentiality this action is equivalent to a system replacement).
If component 2 fails before component 1, replace both components at
system failure. This policy corresponds to a 6 value equal to 24.
3. Replace each component at failure. This policy corresponds to a 6 value
equal to 6.
To compute B(6) we use formula (3.5). It is not difficult to show that
r2
P(I,O) P(R2 < RI) =- - =
rl + r2
0.80
rl
P(O,I) P(R1 < R 2) = - - = 0.20
rl + r2
Clearly P(I, 1) = 1. It follows that the expected sojourn times in the states
(1,1), (1,0) and (0,1) equal 0.20, 0.80 and 0.05, respectively. From this we find
= = =
that BOO 7/(0.20+0.80+0.05) 6.7, B(24) (60.80+ 1)/(0.20+0.80) =
= =
5.8 and B(6) 1/0.20 5.0. Thus it is optimal to use policy 3: replace each
component at failure.
Optimal Replacement of Monotone Repairable Systems 231

3.2.2 Minimal Repairs at System Failures. We assume now that if the


system fails, it is minimally repaired. This means that if a component fails
and causes system failure, then this component is minimally repaired. A
component which fails without causing system failure is not repaired.
As in Section 3.2.1 we assume that component i when not being repaired
has a life time Ri with distribution function Fj(t) and failure rate equal to
rj(t). The n components are assumed to be independent.
It follows that we have a special case of the general set-up with A(t) having
the form (3.4) with Aj(t) = rj(t)Xj(t). The process Xj(t), t ~ 0, is in this case
either identical to one, or one up to Ri and then zero. If component i is in
series with the rest of the system, then Xj(t) == 1.
In view of formula (3.3) it is of interest to find Qt(x), the distribution
of X(t). For example, if the system is a parallel system of two components,
then

3.3 Shock Damage Model

3.3.1 One-Component System. Consider first a situation where the sys-


tem is modelled as a one-component system.
Assume that shocks occur to the system at random times, each shock
causes a random amount of damage and these damages accumulate additively.
At a shock the system fails with a given probability. A system failure can
occur only at the occurrence of a shock.
Let V(t) denote the number of shocks in [0, t], and let Y; denote the
amount of damage caused by the ith shock. We assume that V(t) is a Pois-
son process with rate /I, and that the Y;s are independent and identically
distributed random variables with a distribution H(y). Let U(t) denote the
accumulated damage in [0, t], i.e.
V(t)
U(t) = LY;
i=l

Now if the accumulated damage is u and the number of failures are k and
a shock occurs which causes an amount of damage y, then the system fails
with a probability Pk ( u + y). Then it is not difficult to see that the system
failure counting process N(t) has failure intensity process
232 Terje Aven

A(t) = v 1 00
PN(t)(U(t) + y)dH(y)
For a formal proof of this result, see Aven (1987). We see that if Pk(U) is
non-decreasing in U for each k and non-decreasing in k for each u, then the
failure intensity process is non-decreasing.
If Pk ( u) = p( u) for all k, the model represents a kind of a minimal repair
model, since the system after a repair has "forgotten that it failed" .

Numerical example
Suppose the parameters of the model are

v = 1, K = 1, c = 2, Yi == 1, Pk(U) = 1- e- u / 4
Hence V(t) = U(t) and
A(t) = 1 _ e-(U(t)+1)/4

Using formula (3.3) it is not difficult to find the optimal policy: Replace the
system when the number of shocks, V(t), equals 3. The average cost function
then equals 1.1.
3.3.2 Monotone System of n Components. Consider a monotone sys-
tem 4> of n components.
Assume that shocks occur to the system according to a Poisson process
V(t) with rate v; shock j causes a random amount of damage Yii on com-
ponent i. At a shock the system fails with a given probability. A compo-
nent/system failure can occur only at the occurrence of a shock.
Let U and N denote the vector of component processes as defined in
Section 3.3.1, and let Yi denote the vector of damages at the ith shock. We
assume that the ViS are independent and identically distributed. Let X(t)
denote the vector of component states as defined in Section 3.2. As in the
general set-up, N(t) and A(t) denote the system failure process and intensity,
respectively.
If the state of the components is x, the accumulated damage is u, the
number of component failures is k and a shock occurs which causes an
amount of damage y, then the state of the system equals x' with a prob-
ability rx,k(x / , u + y).
Then it is not difficult to see that the failure counting process N(t) has
failure intensity process

A(t) = 4>(X(t))v J L
x' ,q,(x/)=o
/
rX(t),N(t)(X , U(t) + y)dH(y)

where H denotes the distribution of Yi.


Based on this set-up situations analogous to those analysed in Sections
3.2.1 and 3.2.2 can be analysed.
Optimal Replacement of Monotone Repairable Systems 233

4. Minimal Repairs at Component Failures


4.1 Basic Model

Consider a monotone system comprising n independent components, which


are minimally repaired at failures. Let Xj(t) be a binary random variable
representing the state of component i at time t, t ~ 0, i = 1,2, ... , n. The
random variable Xi(t) equals 1 if the component is functioning at t and 0
otherwise. Assume Xi(O) = l.
Let Ni(t) denote the number of failures of component i in [0, t], and let
N: (8) denote the associated process representing the number of failures of
component i in [0,8] when time is measured in operating time. We assume
that N; (8) is a non-homogeneous Poisson process with intensity function
Aj(8). Let Aj(8) = J; Aj(U) du denote the mean value function ofthe process
N;(8).
Let Zi(t) denote the operating time at time t. Then it is not difficult to
see that Nj(t) is a counting process with intensity process Aj(Zj(t))Xj(t).
= = =
Let Pi(t) 1 - qi(t) P(Xi(t) 1). Furthermore let Bin denote the nth
failure time of component i.
We assume that the repair/restoration times are independent with distri-
bution function Gj(t). Let Gj(t) = 1 - Gj(t).
Each component is minimally repaired at failures, which corresponds to
the assumption of a non-homogeneous Poisson process of (8). N;
The following cost structure is assumed:
- A system replacement cost ]{, ]{ > O.
- The cost of a minimal repair of component i is Cj, Cj ~ O.
- The cost of a system failure of duration t is k + bt.
The system is assumed to be replaced at the stopping time T. After a replace-
ment the system is assumed to be as good as new, i.e. the process restarts
itself.

4.2 Cost function


Let MT and sT denote the expected cost associated with a replacement cycle
and the expected length of a replacement cycle, respectively. Then using Ross
(1970), Theorem 3.16, the long run (expected) cost per time unit can be
written:
B T _ MT _ Ecost in [O,T]
- ST - ET
It is tacitly understood that the expectations are finite. In a replacement cycle
the cost of the replacement and the minimal repairs equals ]{ + E~=l Cj Ni (T).

J:
In addition we have a cost associated with system failures. It is not difficult
to see that this cost equals kN(T) + b [1 - 4>(t)] dt, where N(t) represents
the number of system failures in [0, t].
234 Terje Avell

It then follows that the cost/optimization function can be written:

T J{ + 2:7-1 E IoT Ci dNi(t) + k EN(T) + E IoT b(1 - cp(t)) dt


B = ET (4.1)

Thus (4.1) expresses the expected cost per unit of time, and the problem of
finding an optimal replacement time is reduced to that of minimizing this
function with respect to T.
Using that N;(t) is a counting process with intensity process Ai(Z;(t))X;(t)
it follows that

Similarly, we obtain the following expression for the expected number of


system failures in a replacement cycle:

EN(T) 2:~=1 E f~[<I>(l;, X(t)) - <1>(0;, X(t))]dN;(t)


2:;=1 E Io [<1>(1;, X(t)) - <1>(0;, X(t))] A;(Z;(t))Xi(t) dt

(4.3)
where <1>(1;, X(t)) - <1>(0;, X(t)) equals 1 if and only if component i is critical,
i.e. the state of component i determines whether the system functions or not.
Combining (4.1), (4.2) and (4.3) we get

T E IoT a(t) dt + J{
B - (4.4)
- EI: dt
where
n
a(t) = L:[c; + k( <1>(1;, X(t)) - <1>(0;, X(t)))]A;(Z;(t))X;(t) + b[l - cp(t)] (4.5)
;=1
Observe that Z;(t) ~ t if the downtimes are relatively small compared to the
uptimes.
We see from the above expression for BT that it is basically identical
to the one analysed in Aven and Bergman (1986). Unfortunately, a(t) does
not have non-decreasing sample paths. Hence we cannot apply the results of
Aven and Bergman (1986).
In theory, Markov decision processes can be used to analyse the optimiza-
tion problem. The Markov decision process is characterized by a stochastic
process Yi, t ~ 0, defined here by

Yi = (S(t), X(t), V(t), W(t))


where
Optimal Replacement of Monotone Repairable Systems 235

S(t) time since the last replacement


X(t) (Xl(t), X2(t), ... , Xn(t
Xi(t) state of component i at time t
V(t) (Vl(t), V2(t), ... , Vn(t))
V;(t) duration of the downtime of component i at t
since the last failure of the component
W(t) (Wl(t), W2(t), ... , Wn(t))
Wi(t) accumulated downtime of component i at t
since last replacement
At each time t the state yt is observed, and based on the history of the
process up to time t an action at is chosen. In this case there are two possible
actions: "not replace" and "replace".
In this text we shall, however, not analyse this approach any further. From
a practical point of view the Markov decision approach is not very attractive
in this case. The state space is very large and the cost rate function is not
"monotone" , cf. Sandve (1996).
Instead we shall look at a rather simple class of replacement policies:
Replace the system at S or at the first component failure after T, whichever
comes first. Here T and S are constants with T ~ S. We refer to this policy as
a (T, S) policy. Such a policy might be appropriate if e.g. the system failure
cost is relatively large and a failure of a component often results in other
components being critical (this will be the case if the system has minimal cut
sets comprising two components). For some comments concerning this policy
and some related policies, see Section 4.4 below.

4.3 Replacement Policies (T, S)

Let 'T]T denote the first component failure after T. Then from (4.4) it follows
that
B(T,S) = JoT Ea(t) dt + J~ EI(t < 'T]T )a(t) dt + K
T+ JT P(t < 'T]T)dt
where a(t) is defined by (4.5). To compute B(T,S) we will make use of the
approximation Zi(t) ~ t. This means that the downtimes are relatively small
compared to the uptimes. Using that the structure function of a monotone
system can be written as a sum of products of component states with each
term of the sum multiplied by a constant, it is seen that

a(t) ~ L VI(t) IT Xi(t) + constant


I iEA,

for some deterministic functions VI(t) and sets Al C {I, 2,, n}. It suffices
therefore to calculate expressions of the form
236 Terje Ayen

(4.6)

and
is n v/(t)E
I
Xi (t)I(t < 1JT) dt (4.7)

To compute (4.6) we make use of the following formula for qi(t) = 1 - Pi(t):

qi(t) R:1 1t Gi(t - y) Ai(y)e-(A;(t)-A;(y dy (4.8)

To establish (4.8) we note that

qi(t) = 1t P(Xi(t) = OISiN;(t) = y) Hi (dy, t)

where
Hi(y, t) = P(SiN;(t) ::; y)
It is seen that P(Xi(t) = OISiN;(t) = y) R:1 Gi(t - y), and using that
Hi(y, t) = P(SiN;(t) ::; y) = P(Ni(t) - Nj(Y) = 0) R:1 e-(A;(t)-A;(y
formula (4.8) follows.
The accuracy of formula (4.8) is studied in Sandve (1996).
It remains to compute (4.7). Here we shall present a very simple approx-
imation formula. Observing that I(t < 1JT) = 1 means that there are no
component failures in the interval (T, t], and the components are most likely
to be up at time T, we have

E II Xi (t)I(t < 1JT) R:1 P(no component failures in (T, t])

n
= II P(Nj(t) - Nj(T) = 0)
i=l

An approximate value of B(T,S) can now be calculated and an optimal policy


determined.

4.4 Remarks

The (T, S) policy can be improved by taking into account which component
fails. In stead of replacing the system at the first component failure after
T (assuming this occurs before S), we might replace the system at the first
component failure resulting in a critical component, or, wait until the first
system failure after T.
Optimal Replacement of Monotone Repairable Systems 237

In Aven and Bergman (1986) it is shown that the problem of minimizing


BT can be solved my minimizing the function

=
If T* minimizes Lf- , where 6* infT B T , then T* also minimizes BT. Hence
we can focus on Lf.
It is clear from the expression of Lf
that an optimal policy will be greater
than or equal to the stopping time

To = inf{t : a(t) ~ 6}
Using the optimal average cost B(T,S) as an approximation for 6* we can
obtain an improved replacement policy (To-, S).
An alternative replacement policy is obtained by considering the time
points where component failures occur as decision points. Let T; be the point
in time of the ith component failure and let Fi denote the history up to
time T;. Then based on Fi we determine a time Ri (E [0,00]) such that the
system is replaced at T; + Ri if T; + Ri < T;+1. The value of Ri is determined
by minimizing the conditional expected cost from T; until the next decision
point or replacement time, whichever occurs first, i.e. Ri minimizes
[T.+r
g(r) = JT. E[(a(t) - 6)I(t < T;+dIFi] dt

The performance of the above policies are studied in Sandve (1996).

Acknowledgement. The author is grateful to the reviewer for valuable comments.

References

Aven, T.: Optimal Replacement Under a Minimal Repair Strategy - A General


Failure Model. J. Appl. Prob. 15, 198-211 (1983)
Aven, T.: A Counting Process Approach to Replacement Models. Optimization 18,
285-296 (1987)
Aven, T.: Reliability and Risk Analysis. London: Elsevier Applied Science 1992
Aven, T.: Condition Based Replacement Policies - A Counting Process Approach.
Reliability Engineering and System Safety. To appear (1996)
Aven, T., Bergman, B.: Optimal Replacement Times - A General Set-up. J. Appl.
Prob. 23, 432-442 (1986)
Barlow, R.E. , Proschan, F.: Statistical Theory of Reliability and Life Testing. New
York: Holt, Rinehart and Winston 1975
Bergman, B.: Optimal Replacement Under a General Failure Model. Adv. Appl.
Prob. 10, 431-451 (1978)
Bn~maud, P.: Point Processes and Queues. Berlin: Springer 1981
238 Terje Aven

Conte, S.D. , Boor, C.: Elementary Numerical Analysis. New York: McGraw-Hill
1972
Dekker, R.: A Framework for Single-Parameter Maintenance Activities and its Use
in Optimization, Priority Setting and Combining, In this volume (1996), pp.
170-188
Jensen, U.: A General Replacement Model. ZOR - Methods and Models of Opera-
tions Research 34, 423-439 (1990)
Pierskalla, W.P. , Voelker, J.A.: A Survey of Maintenance Models: the Control
and Surveillance of Deteriorating systems. Naval Res. Log. Quart. 23, 353-388
(1979)
Ross, S.M.: Applied Probability Models with Optimization Applications. San Fran-
cisco: Holden-Day 1970
Sandve, K.: Cost Analysis and Optimal Maintenance Planning of a Monotone, Re-
pairable System. Ph.D. Thesis. Rogaland University Centre and Robert Gorden
University. In progress (1996)
Taylor, H.M.: Optimal Replacement Policy Under Additive Damage and Other
Failure Models. Naval Res. Logist. Quart. 22, 1-18 (1975)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models for
Stochastically Deteriorating Single-Unit Systems. Naval Res. Logist. Quart. 36,
419-446 (1989)
How to Determine Maintenance Frequencies
for Multi-Component Systems?
A General Approach
Rommert Dekker, Hans Frenk and Ralph E. Wildeman
Econometric Institute, Erasmus University Rotterdam, 3000 DR Rotterdam, The
Netherlands

Summary. A maintenance activity carried out on a technical system often involves


a system-dependent set-up cost that is the same for all maintenance activities car-
ried out on that system. Grouping activities thus saves costs since execution of
a group of activities requires only one set-up. By now, there are several multi-
component maintenance models available in the literature, but most of them suffer
from intractability when the number of components grows, unless a special struc-
ture is assumed. An approach that can handle many components was introduced in
the literature by Goyal et al. However, this approach requires a specific deteriora-
tion structure for components. Moreover, the authors present an algorithm that is
not optimal and there is no information of how good the obtained solutions are. In
this paper, we present an approach that solves the model of Goyal et al. to optimal-
ity. Furthermore, we extend the approach to deal with more general maintenance
models like minimal repair and inspection that can be solved to optimality as well.
Even block replacement can be incorporated, in which case our approach is a good
heuristic.

Keywords. Maintenance, multi-component, optimisation

1. Introduction
A technical system (such as a transportation fleet, a machine, a road, or
a building) mostly contains many different components. The cost of main-
taining a component of such a technical system often consists of a cost that
depends on the component involved and of a fixed cost that only depends on
the system. The system-dependent cost is called the set-up cost and is shared
by all maintenance activities carried out simultaneously on components of
the system. The set-up cost can consist of, for example, the down-time cost
due to production loss if the system cannot be used during maintenance, or
of the preparation cost associated with erecting a scaffolding or opening a
machine. Set-up costs can be saved when maintenance activities on different
components are executed simultaneously, since execution of a group of activ-
ities requires only one set-up. This can yield considerable cost savings, and
therefore the development of optimisation models for multiple components is
an important research issue.
For a literature overview of the field of maintenance of multi-component
systems, we refer to Van der Duyn Schouten (1996) in this volume. An-
other review is given by Cho and Parlar (1991). By now there are several
240 Rommert Dekker et al.

methods that can handle multiple components. However, most of them suffer
from intractability when the number of components grows, unless a special
structure is assumed. For instance, the maintenance of a deteriorating sys-
tem is frequently described using Markov decision theory (see, for example
Howard 1960, who was the first to use such a problem formulation). Since
the state space in such problems grows exponentially with the number of
components, the Markov decision modelling of multi-component systems is
not tractable for more than three non-identical components (see, for example
Backert and Rippin 1985). For problems with many components heuristic
methods can be applied. For instance, Dekker and Roelvink (1995) present
a heuristic replacement criterion in case always a fixed group of components
is replaced. Van der Duyn Schouten and Vanneste (1990) study structured
strategies, viz. (n, N)-strategies, but provide an algorithm for only two iden-
tical components. Summarising, these models are of limited practical use,
since reasonable numbers of components cannot be handled.
An approach that can handle many components was introduced by Goyal
and Kusy (1985) and Goyal and Gunasekaran (1992). In this approach a
basis interval for maintenance is taken and it is assumed that components can
only be maintained at integer multiples of this interval, thereby saving set-up
costs. The authors present an algorithm that iteratively determines the basis
interval and the integer multiples. The algorithm has two disadvantages. The
first is that only components with a very specific deterioration structure can
be handled, which makes it more difficult to fit practical situations and makes
it impossible to apply it to well-known maintenance models. The second
disadvantage is that the algorithm often gives solutions that are not optimal
and that there is no information of how good the solutions are (see Van
Egmond et al. 1995).
The idea of using a basic cycle time and individual integer multiples was
first applied in the definition of the joint-replenishment problem in inventory
theory, see Goyal (1973); the joint-replenishment problem can be considered
as a special case of the maintenance problem of Goyal and Kusy (1985). A
method to solve the joint-replenishment problem to optimality was presented
by Goyal (1974). However, this method is based on enumeration and is com-
putationally prohibitive. Moreover, it is not clear how this method can be
extended to deal with the more general cost functions in case of maintenance
optimisation. Many heuristics have appeared in the joint-replenishment liter-
ature (see Goyal and Satir 1989). But again, it is not clear how these heuristics
will perform in case of the more general maintenance cost functions.
In this chapter we present a general approach for the coordination of main-
tenance frequencies, thereby pursuing the idea of Goyal and Gunesekaran
(1992) and Goyal and Kusy (1985). With the approach we can easily solve
the model of Goyal et al. to optimality, but we can also incorporate other
maintenance models like minimal repair, inspection and block replacement.
How to Determine Maintenance Frequencies? 241

We can also efficiently solve the joint-replenishment problem to optimality


(see Dekker et al. 1995).
Our solution approach is based on global optimisation of the problem.
We first apply a relaxation and find a corresponding feasible solution. This
relaxation yields a lower bound on an optimal solution so that we can decide
whether the feasible solution is good enough. If it is not good enough, we
apply a global-optimisation procedure on an interval that is obtained by the
relaxation and that contains an optimal solution. For the special cases of
Goyal et al., the minimal-repair model and the inspection model it is then
possible to apply Lipschitz optimisation to find a solution with an arbitrarily
small deviation from an optimal solution. For the block-replacement model
we will apply a good heuristic.
This chapter is structured as follows. In the next section we give the
problem formulation. In Section 3 we rewrite the problem and we introduce
a relaxation, which enables us to use solution techniques that will be discussed
in Section 4. In Section 5 we present numerical results and in Section 6 we
draw conclusions.

2. Problem Definition
Consider a multi-component system with components i, i = 1, ... , n. Cre-
ating an occasion for preventive maintenance on one or more of these com-
ponents involves a set-up cost S, independent of how many components are
maintained. The set-up cost can be due to, for example, system down-time.
Because of this set-up cost S there is an economic dependence between the
individual components.
In this chapter we consider preventive maintenance activities of the block
type, that is, the determination of the next execution time depends only on
the time passed since the latest execution. Otherwise, for example in case of
age replacement, execution of maintenance can no longer be coordinated and
one has to use opportunity or modified block-replacement policies.
On an occasion for maintenance, component i can be preventively main-
tained at an extra cost of cf. Let Mi (:c) be the expected cumulative deteriora-
tion costs of component i (due to failures, repairs, etc.), :c time units after its
latest preventive maintenance. We assume that MiC) is continuous and that
after preventive maintenance a component can be considered as good as new.
Consequently, the average costs <1>;(:c) of component i, when component i is
preventively maintained on an occasion each :c time units, amount to

(2.1)
Since the function MiO is continuous, the function <1>;(.) is also continuous.
To reduce costs by exploiting the economic dependence between compo-
nents, maintenance on individual components can be combined. We assume
242 Rommert Dekker et al.

that preventive maintenance is carried out at a basis interval of T time units


(that is, each T time units an occasion for preventive maintenance is cre-
ated) and that preventive maintenance on a component can only be carried
out at integer multiples of this basis interval T. This implies that compo-
nent i is preventively maintained each kiT time units, ki E N. The idea of
modelling maintenance at fixed intervals that are integer multiples of a basis
interval originates from inventory theory, see Goyal (1973). It was introduced
in maintenance by Goyal and Kusy (1985) and further developed by Goyal
and Gunasekaran (1992).
The objective now is the minimisation of the total average costs per time
unit. The total average costs are the sum of the average set-up cost and the
individual average costs 4>;(k;T) of each component i. The determination of
the average set-up cost depends on how often an occasion for maintenance is
actually used.
In the context of inventory theory a discussion in the literature has taken
place on how to deal with so-called empty occasions that occur when the
smallest integer ki is larger than one. For example, suppose that there are
two components and that k1 =
2 and k2 =
3, then two out of six occasions
will not be used for maintenance. Dagpunar (1982) suggests that in that case
on average only 4/6 th of the set-up cost is incurred. He proposes to use a
correction factor ..1(k), k = (k1' ... ' kn). For example, if k = (2,3) then
..1(k) = 4/6. Dagpunar gives the following general expression for ..1(k):

2:( _1)i+1 2: {lcm(k aw .. , ka;)}


n
..1(k) = -1, (2.2)
;=1 {aC{1, ... ,n}: lal=;}

where lcm(k a1 , , kaJ denotes the least common multiple of the integers
kap- .. , ka;. Notice that ..1(k) S; 1 and that ..1(k) ~ (mindk;})-1. Conse-
quently, if mindk;} = 1, then ..1(k) = 1.
Goyal (1982), however, criticises the formulation of Dagpunar (1982).
In the maintenance context (see Goyal and Kusy 1985 and Goyal and Gu-
nasekaran 1992), but also in the formulation of the joint-replenishment prob-
lem found in the inventory literature, the correction factor is usually ne-
glected, or equivalently, assumed to be equal to 1. This is correct under the
assumption that the set-up cost is also incurred at occasions at which no
actual maintenance is carried out.
We will consider here two different problem formulations, one with the
correction factor and another without. With the correction factor we have
the following problem:

. {S..1(k) n }
mf -T- + t;14>;(k i T) : k; EN, T> 0 , (2.3)

where ..1(k) is given by (2.2). If the correction factor ..1(k) is neglected, we


have:
How to Determine Maintenance Frequencies? 243

inf { + t, ~;(klT) , k; E N, T> o}. (2.4)

Computation of the correction factor Ll(k) is in general time consuming.


As was also pointed out by Goyal (1982) in the inventory context, minimisa-
tion of a cost function becomes considerably more complex if the correction
factor Ll(k) is included. Together with the observation that problem (2.3) is
a mixed continuous-integer programming problem, this makes (2.3) a very
difficult problem to solve.
That is why we will focus in this chapter on problem (2.4). Although
this problem is easier than problem (2.3), it is in general still difficult to
solve. Approaches published so far include only computationally prohibitive
enumeration methods or heuristics. However, in this chapter we will show
that in many cases problem (2.4) can efficiently be solved to optimality. For
the cases that this is not possible we present some heuristics that perform
better than previously published ones. We will also discuss some results for
problem (2.3). However, we will not consider a solution procedure for this
problem. Observe that a solution of problem (2.4) can always be used as a
feasible solution of problem (2.3) and we will show by numerical experiments
in Section 5 that a feasible solution thus obtained will in many cases be
sufficiently good.
In the modelling approach of Goyal and Kusy (1985) and Goyal and
Gunasekaran (1992) only a very specific deterioration-cost function Mi(-)
for component i, i = 1, ... , n, is allowed. Here we allow more general
deterioration-cost functions, so that this modelling approach can also be ap-
plied to well-known preventive-maintenance strategies of the block type, such
as minimal repair, inspection and block replacement.
By choosing the appropriate function Mi (.), the following models can be
incorporated (see also Dekker 1995, who provides an extensive list of these
models; here we only mention some important ones).
Special Case of Goyal and Kusy. Goyal and Kusy (1985) use the following
deterioration-cost function: Mi(X) = J;U; + vite)dt, where Ii and Vi
are non-negative constants for component i and e ~ 0 is the same for
all components. Notice that e = 1 represents the joint-replenishment
problem as commonly encountered in the inventory literature, see also
Dekker et al. (1995) (in that case the deterioration costs are holding
costs).
Special Case of Goyal and Gunasekaran. The deterioration-cost function
used by Goyal and Gunasekaran (1992) is slightly different from that
of Goyal and Kusy (1985). They take M;(x) = JoYi(~-Xi)(ai + b;t)dt,
where x must of course be larger than Xi, and ai, bi , Xi and Yi are
non-negative constants for component i. In this expression, Yi denotes
the average utilisation factor of component i and Xi is the time required
for maintenance of component i. Consequently, they take e = 1 in the
244 Rommert Dekker et al.

deterioration-cost function of Goyal and Kusy, and they take individual


down-time and utilisation factors into account.
Minimal-Repair Model. According to a standard minimal-repair model (see,
for example Dekker 1995), component i is preventively replaced at fixed
intervals of length x, with failure repair occurring whenever necessary.
A failure repair restores the component into a state as good as before.
Consequently, Mi(X) = ci I; ri(t)dt, where ri(') denotes the rate of oc-
currence of failures, and ci the failure-repair cost. Here Mi(X) expresses
the expected repair costs incurred in the interval [0, xl due to failures.
Notice that this model incorporates the special case of Goyal and Kusy
= =
if we take ci 1 and ri(t) Ii + vite.
Inspection Model. In a standard inspection model (see Dekker 1995), com-
ponent i is inspected at fixed intervals of length x, with a subsequent
replacement when at inspection the component turns out to have failed.
If a component fails before it is inspected, it stays inoperative until it is
inspected. After inspection, a component can be considered as good as
new. Here we have Mi(x) = ci I; Fi(t)dt, where ci is the failure cost per
unit time and FiO is the cdf of the failure distribution of component i.
Block-Replacement Model. According to a standard block-replacement model
(see Dekker 1995), component i is replaced upon failure and preventively
after a fixed interval of length x. Consequently, Mi(X) = c{ Ni(x), where
Ni (x) denotes the renewal function (expressing the expected number of
failures in [0, x]), and c{ the failure-replacement cost.
In the following section we present a general approach to construct a relax-
ation of the optimisation problems given by (2.3) and (2.4), and to simplify
(2.4). Observe that the optimisation problems (2.3) and (2.4) allow for each
component another function Mi(')' Thus it is possible to mix the different
models above. It is possible, for instance, to combine the maintenance of a
component according to the minimal-repair model with the maintenance of
a component according to an inspection model.

3. Analysis of the Problem

To make optimisation problems (2.3) and (2.4) mathematically more tractable,


we substitute T by liT. Using this transformation, the relaxation for both
problems that will be introduced in the next subsection becomes an easily
solvable convex-programming problem if each of the individual cost func-
tions 4>;(.) is given by one of the special cases of Goyal et aI., the minimal-
repair model or the inspection model. This result will be proved in Section 4
and there it will also be shown that without this transformation the relax-
ation is in general not a convex-programming problem. As will be seen later,
this result is very useful in a solution procedure to solve problem (2.4).
How to Determine Maintenance Frequencies? 245

Clearly, by the transformation T -+ l/T, the optimisation problem (2.3)


is equivalent with

(P,) inl {SL\(k)T + ~ <l>;(k;/T} , k; E 1'1, T> o},


and optimisation problem (2.4) is equivalent with

(P) inf {ST + ~ 4>i(k;/T) : ki EN, T> 0 }

= j~~ { ST + ~ inl{<l>;(k;/T) , k; E I'll }.

Denote now by v(Pc), v(P) the optimal objective value of (Pc), (P) respec-
tively, and by T(Pc), T(P) an optimal T (if it exists) for these problems.
Notice that ifT(Pc) and (kl(T(Pc)), k2(T(Pc)), ... , kn(T(Pc))) E N n are op-
timal for (Pc), then T = l/T(Pc) and the same values of ki , i =
1, ... , n,
are optimal for the optimisation problem (2.3). Analogously, if T(P) and
(kl(T(P)), k2(T(P)), . .. , kn(T(P))) E N n are optimal for (P), then T =
l/T(P) and (kl(T(P)), k2(T(P)), ... , kn(Tp))) are optimalfor problem (2.4).

3.1 A Relaxation of (Pc) and (P)

We will first introduce a relaxation of problem (P). As will be shown subse-


quently, the optimal objective value of this relaxation is also a lower bound
on v(Pc).
If we replace in (P) the constraints k; EN by ki 2: 1, then we have the
following optimisation problem:

(Pre!) inf {ST+ ~


T>O
~inf{4>i(k;/T) : k; >
-
I}}.
;=1

Let V(Prel) be the optimal objective value of (Prel) and let T(Prel) be a
corresponding optimal solution of (Pre!) (if it exists).
For this relaxation it clearly follows that v(P) 2: V(Pre !}. Without any
assumptions on 4>iO, it can be shown that v(Pre !} is also a lower bound on
v(Pc). This is established in the following lemma.
Lemma 3.1. It follows that v(P) 2: v(Pc) 2: v(Pre1 ).

Proof. Since for every vector k = (k 1, ... , kn ) it holds that Ll( k) ~ 1, the first
inequality follows immediately. To prove the second inequality, we observe
that for every ( > 0 there exists a vector (T" k1(T,), ... , kn(T,)) satisfying
246 Rammert Dekker et al.

v(Pc ) > S~(k(TfTf +~ --r;-


~ cP; (k;(Tf) -(
~ (k;(Tf)~(k(Tf)
S~(k(TfTf + ~ cP; ~(k(TfTf - (.

Since ~(k(Tf ~ (mindk;(Tf)}t1, we have that kj(Tf)~(k(Tf > 1 for

t
every i, and consequently

~ S~(k(TfTf + (~(k(;fTJ ~ I} -(
m ~ I}} -,
v(Pc ) inf {cPi : kj

> J~~ {ST+ t, inf {~; k;


v(Pre ,} - (.
Since (: > 0 is arbitrary, the desired result follows. o
Since V(Prel) equals infT>o{ST + 2:7=1 inf{cPj(k;/T) : kj ~ I}}, it is
natural to impose the following assumption.
=
Assumption 3.1. For each i 1, ... , n the optimisation problem (Pi) given
by inf{cPj(x) : X> O} has a finite optimal solution xi > O.
The problems (Pi) introduced in Assumption 3.1 are often easy to solve.
For many single-component maintenance models the function cPi() has a
unique minimum and is strictly decreasing left of this minimum and strictly
increasing right of it (i.e., the function 41;(.) is strictly unimodal). In that case
optimisation can be carried out with, for example, golden-section search (see
Chapter 8 of Bazaraa et al. 1993). A more efficient algorithm to identify an
optimal solution for a large class of single-component models is presented by
Barros et al. (1995).
To continue our analysis, if optimisation problem (Pre!) can be solved and
T(Pre1 ) is an optimal solution, then we can construct a feasible solution of
(Pc) and (P) in the following way. Introduce the interval I?) := [k/xt, (k +
1)/xiJ, k = 0, 1, ... , and define the function gi() as follows:

.(t) ._ { cPi(1/t) if t E I~O)


gt- . . (3.1)
(k)
mm{cPj(k/t),cPjk + 1)/t)} 1ft E Ii ,k = 1,2, ...
Notice that for a given t, the value gi(t) and the corresponding integers kj(t)
can easily be calculated once an optimal solution xi of (Pi) is known. A
=
given t lies within the interval I~k) for which k ltx; J, with lJ denoting the
lower-entier function. Consequently, if k = 0, one function evaluation (viz.
of cPi(l/t is necessary to compute gj(t), and ki(t) equals 1. Otherwise, if
k ~ 1, two function evaluations are necessary and ki(t) equals k or k + 1,
depending on whether cPi(k/t) ~ cP;k + l)/t) or cPi(k/t) ~ cPjk + 1)/t).
How to Determine Maintenance Frequencies? 247

Using (3.1), it is easy to calculate the integers ki(T(Pre1 )) corresponding


with T(Prel), and it is clear that (T(Prel), k 1(T(Pre l)), ... , kn(T(Pre!))) is a
feasible solution for both (Pc) and (P).
In Section 4 we show that under certain conditions on the functions 4>iO it
holds that g;(t) = inf{4>;(k;/t) : k. EN}. However, without any conditions
on these functions we only have that inf{4>i(k;/t) : k i E N} ~ g.(t) for
every t > 0 and so by the definition of (P) it follows that v(P) ~ ST(Prel) +
2:~1 gi(T(Pre!). Hence, if the value ST(Prel) + 2:7=1 gi(T(Pre!) is close
enough to v(Pre !}, we can decide, due to
n
v(Pred ~ v(Pc) ~ v(P) ~ ST(Prel) + Lgi(T(Prel,
.=1
that (T(Prel), k 1 (T(Pre l, ... , kn(T(Prel))) is a reasonable feasible solution
of problem (Pc) and of problem (P).
To analyse now under Assumption 3.1 the optimisation problem (Pre1 ),
observe that for every T 2: 1/ xi it holds that inf{4>i (k;/T) : k j 2: I} =
4>i(Xi). By this observation the following result is easy to prove. This result
will be used in a procedure to solve (Pre/).

Lemma 3.2. If we assume without loss of generality that l/x~ ~ l/x~_l ~


... ~ l/xi, then for any optimal T(Pre !) of (Pre!) it follows that T(Pred ~
l/xi
Proof. For all T > l/xi we obtain that
n n
ST+Linf{4>;(k;jT): ki2:1}=ST+L4>i(xi)
;=1 ;=1
>

and so for every T> l/xi the objective function of (Pre!) evaluated in T is
larger than the objective function evaluated in the point 1/ xi. This implies
T(Prel) ~ l/xi and the desired result is proved. 0
In Section 4 we will simplify the objective function of problem (Pre1 )
by imposing some assumptions on the functions 4>.(.). In order to simplify
the objective function of problem (P), we also need some assumptions on
the same functions 4>j (.). However, before introducing these assumptions, we
discuss the literature on problem (P).

3.2 Literature on Problem (P)

Goyal and Kusy (1985) and Goyal and Gunasekaran (1992) apply an iterative
algorithm to solve problem (2.4) in the previous section (equivalent with (P
for their specific deterioration-cost functions. The authors initialise each k j =
248 Rommert Dekker et al.

1 and then find the corresponding optimal T by setting the derivative of


the cost function of (2.4) as a function of T equal to zero. Subsequently,
the authors find for each i a value of ki' in two different ways. Goyal and
Kusy (1985) find for each i the optimal integer kj belonging to T by looking
in a table that is made in advance for each component and that gives the
optimal k i for disjoint ranges of T. Goyal and Gunasekaran (1992) find for
each i the optimal real-valued k i by setting the derivative of the cost function
of (2.4) as a function of ki to zero and rounding this real-valued k i to the
nearest integer. Once a value for ki is found, it is compared to the k i in the
previous iteration (in this case the initialisation). When for each i the ki
in the two iterations are equal, the algorithm terminates. Otherwise a new
optimal T is found for the current values of ki' and subsequently new values
of ki are determined, and so on, until for all i the k i in two consecutive
iterations are equal.
The advantage of this algorithm is that it is fast. This is primarily due to
the special deterioration structure of the components in the cases of Goyal et
aI., which makes it possible to find an analytical expression for the optimal T
given values of ki' and also to find ~ value for the k i in little time.
The specific deterioration structure of the components is at the same
time a great disadvantage, since there is little room for differently modelled
components. It is possible to extend the algorithm to deal with the more
general maintenance models given in the previous section, but in that case a
value for an optimal T given values of ki has to be computed numerically, and
the same holds for the corresponding values of ki . As a result, the algorithm
will become much slower.
The greatest disadvantage of the algorithm of Goyal et al. is, however, that
it is often stuck in a local optimal solution (see Van Egmond et al. 1995).
There is no indication whatsoever of how good the solutions are when this
occurs. This implies that even if we extend the algorithm to deal with more
general maintenance models (which we will do anyway to study its perfor-
mance in Section 5), we do not have any guarantee concerning the quality of
the obtained solutions.
In the inventory theory literature many heuristics have appeared for the
special cost functions in the joint-replenishment problem (see Goyal and
Satir 1989). Although some heuristics can be modified to deal with the cost
functions of maintenance optimisation, the performance of these heuristics
cannot be guaranteed.
Altogether, the literature does not provide an efficient and reliable ap-
proach to solve problem (P). That is why we will focus in this chapter on an
alternative solution approach that is based on the global optimisation of (P).
In order to do so, we need to simplify the objective function of problem (P),
which will be done in the next subsection. In a solution procedure for (P)
(discussed in Section 4) we first find a solution to problem (Prel) and by using
(3.1) we then obtain a feasible solution to (P) (and hence also to (Pc)). Since
How to Determine Maintenance Frequencies? 249

V(Prel) is a lower bound on both v(Pc) and v(P), we can decide whether this
feasible solution is good enough.
If this feasible solution is not good enough, we subsequently apply a
global-optimisation procedure to the simplified problem (P) in an interval
that is obtained by the relaxation and that contains an optimal T(P). For
the special cases of Goyal et aI., the minimal-repair model and the inspection
model it is then possible to find in little time a solution to (P) with an objec-
tive value that has an arbitrarily small deviation from the optimal value v(P).
For the block-replacement model this is not possible, but application of a fast
golden-section search heuristic yields a good solution as well. In all cases our
approach outperforms that of Goyal et al. Our approach can also be applied
to find an optimal solution to the joint-replenishment problem, see Dekker et
al. (1995). In that case the procedure can be made even more efficient, since
the cost functions in that problem have a very simple form.
With a solution to problem (P), we then have an improved upper bound
v(P) on v(Pc). If this is close to V(Prel), then it is by Lemma 3.1 also close
to v(Pc) and so we have a good solution of (Pc) as well.
We will now simplify under cert;:l.in conditions the objective function of
problem (P).

3.3 Simplification of Problem (P)


To simplify the objective function of problem (P), we introduce the following
definition and assumption (for Definition 3.1 see also Chapter 3 of Avriel et
al. 1988).

with respect to b ~
Definition 3.1. A function f(x), x E (0,00), is called unimodal on (0,00)
if f( x) is decreasing for x ~ b and increasing for
x ~ b. That is, fey) ~ f(x) for every y ~ x ~ b, and fey) ~ f(x) for every
y ~ x ~ b.
Observe that by this definition it is immediately clear that any increasing
function f( x), x E (0,00), is unimodal on (0,00) with respect to b = 0.
Assumption 3.2. For each i = 1, ... , n the optimisation problem (Pd given
by inf{~i(x) : x> o} has a finite optimal solution xi> 0. Furthermore, for
each i the function ~i(') is unimodal on (0,00) with respect to x;'
By Assumption 3.2 the objective function of problem (P) can be simplified
considerably. To this end consider the interval li( k) := [k / xi, (k + 1) / xiJ, k =
0,1, ..., introduced in Section 3.1 and observe that ift E li k ) and k ~ 1, then
it holds that kit ~ xi ~ (k + l)/t, so that
xi ~ (k + l)/t ~ (k + 2)/t ~ (k + 3)/t ~ ...
and
xi ~ kit ~ (k - l)/t ~ ... ~ l/t.
250 Rommert Dekker et al.

Therefore, as by Assumption 3.2 the function 4>.0 is unimodal on (0,00)


with respect to x; ,
we have that
4>ik + l)/t) ::; 4>.k + 2)/t) ::; 4>ik + 3)/t) ::; ...
and
4>i(k/t) ::; 4>ik - l)/t) ::; ... ::; 4>i(l/t).
Analogously, if t E I~O) and t > 0, then it holds that xi ::; l/t, so that
x; ::; l/t ::; 2/t ::; 3/t ::; ... ,
and consequently
4>i(l/t) ::; 4>i(2/t) ::; 4>i(3/t) ::; ....
This implies that for a given t it is easy to determine an optimal integer ki(t),
since now we have that

ift E in)
I

ift E I~k>, k = 1,2, ...


Consequently, if we define giO as in (3.1), iffollows that
gi(t) = inf{4>i(kdt) : ki EN}.
It is not difficult to verify that by Assumption 3.2 and the fact that 4>iO is
continuous, the function gi(-) is also continuous. In Figure 3.1 an example of
the function 9i() is given.
Under Assumption 3.2 the optimisation problem (P) has the following
simplified representation:

J~~ {ST + ~9i(T)},


with gi() given by (3.1).
Below we introduce a class of functions 4>iO that satisfy Assumption 3.2.
To this end we need the next result.
Lemma 3.3. If the function MiO is convex on (bi' 00) for some bi ~ 0,
then it follows that c:= lim~_oo Mi(X)/X ::; 00 exists. Moreover, the function
M.(x) - xc is decreasing on (bi,oo).
Proof. Since the function Mi() is convex on (bi,oo) iffollows by applying
the well-known criterion of increasing slopes valid for convex functions (see
Proposition 1.1.4 in Chapter I of Hiriart-Urruty and Lemarechal 1993) that
for any fixed y > b. the function x -I- (Mi(X)-Mi(Y))/(X-Y) is increasing on
(bi, 00) \ {y}. This implies that 00 ~ li~ .... oo(Mi(x) - Mi(Y))/(X - y) exists
and clearly this limit equals c := li~_oo Mi(X)/X. To prove the second part
How to Determine Maintenance Frequencies? 251

t-

Fig. 3.1. An example ofthe function gi (.). The thin lines are the graphs ofthe func-
tions ~i(l/t), ~i(2/t), ... , ~i(5/t). The (bold) graph of gi() is the lower envelope
of these functions.

we only need to consider c < 00. Observe now that for any Y with bi <Y<x
it holds that
(Mi{X) - xc) - (Mj{Y) - yc) = (Mj{X) - Mi{Y) -
x-Y
c) (x - y).
Since by the first part of this lemma we have that the function x --+ (Mi (x) -
Mj{Y/{x-y) is increasing on (y, 00) and limx_oo{Mi{x)-Mj{Y/{x-y) =
c, it follows by the above equality that Mi{X) - xc :S Mj{Y) - yc. 0
Using Lemma 3.3 we can show the following result. Observe that the first
part of this lemma improves a result given by Dekker (1995).
Lemma 3.4. If Mj{) is concave on (O,b;) and convex on (bj,oo) for some
bi ~ 0, then the set of optimal solutions of the optimisation problem (Pi)
given by inf{4>j{x) x > O} is nonempty and compact if and only if
limx_ oo Mj{x) - xc < -cf, with c:= liIDx_oo Mj(x)/x. Moreover, it follows
for any optimal solution xi of (Pj) that xi ~ bi and that the function 4>i{)
is unimodal on (0,00) with respect to xi.
Proof. If for some bi > 0 the function Mi() is concave on (O, bi) then the
function cf + Mi{) is also concave on (0, bj). This implies for every 0 < Zl <
Z2 < bi that cf + Mi{zd = cf + MiZ!/Z2)Z2) > (zI/ z2){cf +Mj{Z2. Hence,
by equation (2.1) it follows that 4>i{Zt) > 4>i{Z2) and, consequently, that 4>;{.)
252 Rommert Dekker et al.

is strictly decreasing on (0, bi). By this observation it follows that if x; > 0


is an optimal solution of (Pi)' then necessarily xi ~ bi. On the other hand, if
bi = 0, then by the feasibility of xi we also have that x; ~ bi and this proves
the second part of the lemma (that is, xi ~ bi for any optimal solution xi).
To verify the 'only-if' proposition, observe that the optimal objective value
V(Pi) of (Pi) is smaller than 4)i(oo), since the optimal solution set of (Pi) is
nonempty and compact. By the first part of the proof and the continuity of
4)i(-) this yields the existence of some Xo > bi such that 4)i(XO) < 4)i( (0).
The first part of Lemma 3.3 shows that c := li~_oo Mi(X)/X exists and
since 4)i(oo) = li~_oo Mi(X)/X = c and 4)i(XO) < 4)i(oo) , it follows that
(cf + Mi(XO/Xo < c or, equivalently, Mi(XO) - XoC < -cf. Using now that
the function Mi(X) - xc is decreasing on (bi' (0) (Lemma 3.3), we have that
li~_oo Mi(X )-xc < -cf. To verify the other inclusion (the 'if' proposition),
observe that li~_oo Mi(X) - xc < -cf implies that there exists some Xo E
(0,00) satisfying Mi(XO) - XoC < -cf or, equivalently, 4)i(XO) < 4)i( (0). This
yields that V(Pi) < 4)i(oo) and since also 4)i(O) = 00 and 4).(.) is continuous,
this implies that the optimal solution set of (Pi) is nonempty and compact.
To verify the last result, observe since MiO is convex on (bi, (0), that
the function cf + Mi() is also convex on (bi , (0). By Theorem 3.51 of Mar-
tos (1975) this implies that 4)i(t) = (cf + Mi(t/t is a so-called quasi convex
function. Using that inf{4)i(x) : Z ~ bi} has an optimal solution xi ~ bi, we
then obtain by Proposition 3.8 of Avriel et al. (1988) that 4)iO is unimodal
on (b i , (0) with respect to xi. Together with the result that the continuous
function 4)i() is strictly decreasing on (0, bi) the desired result follows, that
is, 4).0 is unimodal on (0, (0) with respect to xi. D
Now we can show that in general the special cases of Goyal et al., the
minimal-repair model and the inspection model satisfy Assumption 3.2 when
the optimisation problem (Pi) has a finite optimal solution xi > O.
Theorem 3.1. If each (Pi)' i = 1, ... , n, has a finite solution xi > 0 and is
formulated according to one of the special cases of Goyal et al., the minimal-
repair model with a unimodal rate of occurrence of failures or the inspection
model, then Assumption 3.2 is satisfied.
Proof. We will prove that, if for a certain i E {1, ... , n} the optimisation
problem (Pi) has a finite solution xi > 0 for one of the models mentioned,
then the function 4)iO is unimodal with respect to xi. Consider therefore an
arbitrary i E {1, ... , n} and distinguish between the different models.
1. Special Case of Goyal and Kusy. It is easy to show (by setting the deriva-
tive of4)i() to zero) that the optimisation problem (Pi) has an optimal
solution xi = {cf(e + 1)/Viep/(e+1). This solution is finite and posi-
tive if and only if Vi, e and cf are strictly larger than zero, and by the
assumption that xi > 0 we can assume that this is the case.
= =
We have that Mi(X) J;(fi + vite)dt fiX + (v;f(e + 1x e+1, so that
Mf'{x) = eViXe-l > 0 and, as a result, Mi() is (strictly) convex on
How to Determine Maintenance Frequencies? 253

(0,00). By Lemma 3.4 we then have that lPiO is unimodal with respect
to x;.
2. Special Case of Goyal and Gunasekaran. It is easy to show (by setting
the derivative of lPiO to zero) that the optimisation problem (Pi) has an
optimal solution x; = {2(cf - a,X,Yi)/(biY?) + X1P/2. This solution is
finite and positive if and only if bi and Yi are strictly larger than zero
and cf > XiYi(aj - bi XiYi/2), and by the assumption that x; > 0 we
can assume that this is the case.
We have that Mi(X) = J:i(:C-Xi)(ai + bsi)dt = aiYi(x - X;) + biY?(x-
X;) 2 /2, so that Mf'(x) = biY? > 0 and, as a result, MiO is (strictly)
convex on (0,00). By Lemma 3.4 we then have that lPiO is unimodal
with respect to x; .
3. Minimal-Repair Model. If the rate of occurrence of failures ri(') is uni-
modal with respect to a value bi 2: 0, then as Mi(X) = ci J; ri(t)dt it fol-
lows that Mf() is decreasing on (0, bi) and increasing on (bi, 00). Hence
MiO is concave on (0, bi) and convex on (b i , 00). Since the optimisation
problem (PI) has a finite solution x; > 0 we then have by Lemma 3.4
that q)i(-) is unimodal with respect to x;.
Notice that ifbi = 0, then ri(') is increasing on (0,00) and M,O is convex
on (0,00). If riO is unimodal with respect to a bi strictly larger than
zero, then lPiO follows a bathtub pattern. In Lemma 3.4 we showed that
for this case x; 2: bi. As the function MiO is convex on (b"oo), it is
a forteriori convex on (x;, 00), a result that will be used later to prove
that the relaxation (Prel) of (P) is a convex-programming problem (see
Lemma 4.2).
4. Inspection Model. Since Mi(X) = cr J; Fi(t)dt, we have that Mf(x) is
increasing on (0,00), and hence that Mi(X) is convex on (0,00). Since
the optimisation problem (Pi) has a finite solution x;> 0 we then have
by Lemma 3.4 that IPi(') is unimodal with respect to xi.
Consequently, if for each i = 1, ... , n one of the above models is used (possibly
different models for different i), then IPj(') is unimodal with respect to x; and
so we have verified that Assumption 3.2 is satisfied. 0
Observe that by Lemma 3.4 an easy necessary and sufficient condition for the
existence of only finite optimal solutions of (Pi) is presented for both cases 3
and 4 above.
In Figure 3.2 an example of the objective function of problem (P) under
Assumption 3.2 is given. In general this objective function has several local
minima, even for the simple models described above. This is dueto the shape
of the functions giO and it is inherent to the fact that the ki have to be inte-
ger. In the following section we show that when problem (Prel) is considered,
often a much easier problem is obtained; for the special cases of Goyal et
aI., the minimal-repair model and the inspection model the relaxation (Prel)
turns out to be a single-variable convex-programming problem and so it is
easy to solve.
254 Rommert Dekker et al.

v(P)

T(P)
t_

Fig. 3.2. An example of the objective function of problem (P); there are many
local minima.

4. Solving Problem (P)


In this section we discuss under some additional assumptions on the func-
tions q)i(-) a computationally fast solution procedure for problem (P). This
yields an optimal solution (T(P), k 1(T(P)), ... , kn(T(P))) of (P). With re-
spect to problem (Pc) we observe that the optimal solution of (P) is also fea-
sible for (Pc). Moreover, if there exists a (T(Pe), k 1 (T(Pe)), ... , kn(T(Pe)))
that is optimal for (Pc) with Ll(k(T(Pe))) = 1, it follows by Lemma 3.1
that v(P) = v(Pe), so that in that case (T(P), k 1 (T(P)), ... , kn(T(P)))
is also an optimal solution of (Pc). For this solution it follows automati-
cally that Ll(k(T(P))) = 1, so that if for the generated optimal solution
(T(P), k 1 (T(P)), ... , kn(T(P))) of (P) it holds that Ll(k(T(P))) < 1, then
this implies that v(Pc) < v(P).
To start our approach to tackle problem (P), we first find out under which
conditions the relaxation (Pred introduced in Section 3.1 is easy to solve.

4.1 Analysis of (Prel)

To simplify problem (Prel ) we only need a much weaker assumption than


Assumption 3.2 discussed in the previous section.
How to Determine Maintenance Frequencies? 255

Assumption 4.1. For each i = 1, ... , n the optimisation problem (Pi) given
by inf{~i(x) : x> O} has a finite optimal solution xi > o. Furthermore, for
each i = 1, ... , n it holds that ~iO is increasing on (x;, 00).
Theorem 3.1 showed for the special cases of Goyal et al., the minimal-repair
model with a unimodal rate of occurrence of failures and the inspection
model, that Assumption 3.2 is satisfied when (Pi) has a finite solution xi > O.
As a result, also Assumption 4.1 is satisfied for these models.
By Assumption 4.1 the objective function of problem (Prel) can be sim-
plified. Analogously to equation (3.1) we have for
(R)(t) ._ {~i(l/t) if t :5 l/x; (4.1)
gi .- ~i(Xn ift~ l/x;
that g~R)(t) = inf{4>i(kdt) : ki ~ I}. In Figure 4.1 an example of the
function g~R)(.) is given.

, I

t_

Fig. 4.1. An example of the function g\R}(.). Notice the similarity with the graph
of gi() in Figure 3.1.

Now (Prel) has the following simplified representation:

(R) J'!. {ST + t,g!R)(T) }


Denote by v(R) the optimal objective value of (R) and by T(R) an optimal T
(if it exists). In the remainder we will assume the (R) always has an optimal
256 Rommert Dekker et aI.

solution. Notice that by Assumption 4.1 it follows that v(R) = V(Prel) and
T(R) = T(Prel), since (R) and (Prer) are equivalent under this assumption.
Remember, if we use (R) we always assume that Assumption 4.1 holds.
We will now consider a class offunctions 4>i(-) that satisfy Assumption 4.1.
Lemma 4.1. If the optimisation problem (Pi) given by inf{4>i(x) : x> O}
has a finite optimal solution xi > 0 and the function MiO is convex on
(xi, (0), then the function 4>i(') is increasing on (xi, (0).
Proof Since the function Mi (.) is convex on (xi, (0), if follows by Theo-
rem 3.51 of Martos (1975) that 4>i(t) = (cf +M;(t))/t is a so-called quasicon-
vex function on (xi, (0). Since inf{4>i (x) : x > O} has an optimal solution
xi > 0, the desired result follows by Proposition 3.8 of Avriel et al. (1988).
o
Under the same condition as imposed in Lemma 4.1, one can prove ad-
ditionally that the function g~R)(.) is convex. Consequently, if the condition
of Lemma 4.1 holds for each i, the optimisation problem (R) is a univariate
convex-programming problem and so it is easy to solve. The convexity of the
function g~R) (.) is established by the following lemma.
Lemma 4.2. If the function Mi(') is convex on (xi, (0), then the function
g}R)(-) is convex on (0, (0).

Proof. To show that the function g}R)(-) is convex it is sufficient to show


that the function t ~ 4>i(l/t) is convex on (0, l/xi). 1ft ~ 4>;(l/t) is convex
on (0, l/xt) then it follows from the fact that 4>i(1/(1/xi)) = 4>i(X;) is the
minimal value of 4>i(-) on (0,00), that t ~ 4>i(1/t) is also decreasing on
(O,l/xi) and then it follows from the definition of g}R)(-) (see (4.1)) that
g}R)(-) is convex on (0, (0).
So we have to prove that t ~ 4>i(i/t) is convex on (O,l/xi). We will
prove that t ~ 4>;(l/t) is convex on (O,l/b;) if M;(.) is convex on (b;, (0).
So let M;(.) be convex on (b i , (0), then cf + Mi(t) = t4>i(t) is also convex
on (b;, (0). Define now for a function fO

8 (t t ) = f(t) - f(to)
j , a t - to '
and let f(t) := t4>i(t) and g(t) := iP;(l/t). It is easy to verify that
Sj(t, to) = 4>;(to) - (1/t o)8 g (1/t, l/to). (4.2)
The well-known criterion of increasing slopes valid for convex functions (see,
e.g., Proposition 1.1.4 in Chapter I of Hiriart-Urruty and LemankhaI1993),
yields for the convex function /(t) = tiPi(t) on (bi' (0) that 8,(t, to) is in-
creasing in t > bi for every to > b;. By (4.2) this implies that 4>;(to) -
(1/to)8 g (1/t, l/to) is increasing in t > bi for every to > k Since 4>i(tO) and
l/to are constants, the function -8g (1/t, l/to) is then increasing in t > bi for
How to Determine Maintenance Frequencies? 257

every to > k Hence, sg(llt, lito) is increasing as a function of lit < lib; for
every lito < lib;, which is equivalent with Sg(x, xo) is increasing in x < lib;
for every Xo < lib;. Using again the criterion of increasing slopes for convex
functions we obtain that get) = ~;(llt) is convex on (0, lib;).
If M; (.) is convex on (xi, 00 ), that is, if bi = xi, then we have that
t - ~;(llt) is convex on (0, l/xt), which completes the proof. (Notice that
if MiO is convex on (0,00), that is, if bi = 0, then we have that t - ~i(llt)
is also convex on (0,00).) 0
We can now apply the above results to the special cases of Goyal et aI.,
the minimal-repair model and the inspection model.

Theorem 4.1. If each (Pi), i = 1, ... , n, has a finite solution xi > and is
formulated according to one of the special cases of Goyal et al., the minimal-
repair model with a unimodal rate of occurrence of failures or the inspection
model, then problem (Prel) is equivalent with problem (R) and (R) is a convex-
programming problem.

Proof. In the proof of Theorem 3.1 we showed that for the minimal-repair
model with a unimodal rate of occurrence of failures the function M; (-) is
convex on (x;, 00). In case of an increasing rate of occurrence of failures,
MiO is even convex on (0,00), and thus a forteriori convex on (xi, 00). We
also showed that for the special cases of Goyal et al. and the inspection model
the function M;() is convex on (0,00), so that MiO is a forteriori convex
on (xi, 00). Consequently, if for each i = 1, ... , n one of the above models is
used (possibly different models for different i), then by Lemma 4.2 the cor-
responding g}R)O are convex so that problem (R) is a convex-programming
problem. 0
In Figure 4.2 an example of the objective function of problem (R) is given.
We can now explain why we applied in the previous section the trans-
formation of T into liT in the original optimisation problem (2.4). We saw
that (R) is a convex-programming problem if each function g}R)O is con-
vex on (0,00). In the proof of Lemma 4.2 we showed that this is the case
if each function t - ~i(llt) is convex on (0, l/x;). We showed furthermore
that the function t - ~i(llt) is convex on (O,l/x;) if Mi(') is convex on
(x;' 00) (which is generally the case for the models described before). If we
did not apply the transformation of T into liT, we would obtain that the
corresponding relaxation is a convex-programming problem only if each func-
tion ~i(') is convex on (xi, 00). This is a much more restrictive condition and
it is in general not true (not even for the models mentioned before). Sum-
marising, the transformation of T into liT causes the relaxation to be a
convex-programming problem for the models described before, a result that
otherwise does not generally hold.
If (R) is a convex-programming problem, it can easily be solved to op-
timality. When the functions g}R)O are differentiable (which is the case if
258 Rommert Dekker et aI.

v(R)

T(R)
t_

Fig. 4.2. An example of the objective function of problem (R).

the functions <PiO are differentiable), we can set the derivative of the cost
function in (R) equal to zero and subsequently find an optimal solution with
the bisection method. When the functions g~R)O are not differentiable, we
can apply a golden-section search. (For a description of these methods, see
Chapter 8 of Bazaraa et al. 1993).
To apply these procedures it is necessary to have a lower and an upper
bound on an optimal value T( R). If we assume again without loss of generality
that l/x~ :::; I/X~_l :::; ... :::; l/xi, then for any optimal T(R) of (R) it follows
by Lemma 3.2 that 0 < T(R) :::; l/xi.
Once (R) is solved we have an optimal T(R). If additionally (R) is a
convex-programming problem, it is possible to derive an easy dominance
result for the optimal solution of (P). In order to do so, we first need the
following lemma.
Lemma 4.3. If T(R) :::; l/x~ is an optimal solution of problem (R) (and
Assumption 4.1 holds), then (T(R), 1, ... ,1) is an optimal solution of (Pc)
and of (P) . Moreover, if there does not exist an optimal solution of (R) within
the interval (0, l/x~), then any optimal solution T(P) of(P) is bounded from
below by l/x~.
Proof Since T( R) :::; 1/ x~ is an optimal solution of problem (R) it follows by
Assumption 4.1 that the optimal scalars ki(T(R)), i = 1, ... , n, are equal to
one and so (T(R), k 1 (T(R)), ... , kn(T(R))) is also a feasible solution of prob-
lem (Pc) and (P). Hence we obtain that vCR) = ST(R) + 2:7=1 <Pi (I/T(R)) 2:
How to Determine Maintenance Frequencies? 259

v(P) and this yields by Lemma 3.1 that vCR) = v(Pc) = v(P), implying that
(T(R), 1, ... ,1) is also an optimal solution of (Pc) and of (P).
To prove the second part, observe since the functions gi(-) and g~R)(-)
(see (3.1) and (4.1 are identical on (0, l/z~] (by Assumption 4.1), that
T < 1/ z~ is a local optimal solution of problem (R) if and only if T is a local
optimal solution of problem (P). Hence, if there is no local optimal solution
of (R) within the interval (0, l/z~), then there is no local optimal solution of
problem (P) within (0, l/z~), and this yields T(P) ~ l/z~. 0
If (R) is a convex-programming problem, then Lemma 4.3 yields the fol-
lowing result. If T(R) ~ l/z~ then T(R) is an optimal solution of (Pc) and
(P) and the optimal scalars kj , i = 1, ... , n, are equal to one. IfT(R) > l/z~,
we evaluate the objective function of (R) in l/z~ and ifthis value equals vCR)
then 1/ z~ is also an optimal solution of (R), so that by Lemma 4.3 it follows
that l/z~ is an optimal solution of (Pc) and (P) as well. Finally, if the ob-
jective function of (R) in l/z~ is larger than vCR), then there does not exist
a local optimal solution of (R) within (0, l/z~) (since the objective func-
tion of (R) is convex) and thus it follows by Lemma 4.3 that T(P) ~ l/z~.
Consequently, we have shown the following corollary.
Corollary 4.1. Suppose (R) is a convex-programming problem. If T(R) >
l/z~ and the objective function of (R) in l/z~ is larger than vCR), then for
any optimal solution T(P) of (P) it follows that T(P) ~ l/z~. Otherwise,
an optimal T(P) is given by T(P) = min{l/z~, T(R)}.
Observe for T(R) > l/z~ that T(R) may not be an optimal solution
of problem (P). Besides, the values of ki corresponding with T(R) are not
necessarily integer, implying that the optimal solution of (R) is in general not
feasible for (P) when T(R) > l/z~. Consequently, the first thing to do when
T(R) > l/z~ is to find a feasible solution for (P) (which is consequently also
a feasible solution for problem (Pe .

4.2 Feasibility Procedures for (Pc) and (P)

A straightforward way for finding a feasible solution for (Pc) and (P) is
to substitute the value of T(R) in (3.1). This is specified by the following
Feasibility Procedure (FP).

Feasibility Procedure

For each i = 1, ... , n do the following:


1. Compute k = IT(R)xtJ. This is the value for which T(R) E IJk).
2. If k = 0, then ki(F P) = 1 is the optimal ki-value for (P) corresponding
with T(R) (use (3.1.
260 Rommert Dekker et aI.

3. If k ~ 1 then ki(F P) = k or ki(F P) = k + 1 is an optimal value,


depending on whether iPi(k/T(R))::; iPjk+ 1)/T(R)) or iPi(k/T(R)) ~
iPik + 1)/T(R)) (use (3.1)).
If v(F P) denotes the objective value ST(R) + E?=l gi(T(R)) then clearly by
the definition of gi(') and Assumption 4.1 it follows that
n
v(F P} = ST(R) + L gj(T(R)) ~ v(P) ~ v(Pc) ~ v(R).
i=l
Hence we can check the quality of the solution; if v( F P) is close to v( R) then
it is also close to the optimal objective value v(Pc} or v(P).
If it is not close enough we may find a better solution by applying a pro-
cedure that is similar to the iterative approach of Goyal et al. We will call
this procedure the Improved-Feasibility Procedure (IFP).

Improved-Feasibility Procedure
1. Let kj(IFP) = ki(FP), i = 1"oo,n, with kj(FP) the values given by
the feasibility procedure FP.
2. Solve the optimisation problem

~% {ST + t, 4>;(k;(IFP)/T)}. (4.3)

and let T(IF P) be an optimal value for T.


3. Determine new constants ki(IFP) by substitution ofT(IFP) in (3.1).
This implies the application of the feasibility procedure FP to the
value T(I F P). Let v(I F P) be the corresponding objective value.
Under Assumption 3.2 it follows for the value v(IFP) generated by the IFP
that
n
v(IFP) = ST(IFP) + Lgi(T(IFP))
i=l
n

= ST(IFP) + LiPi(ki(IFP)/T(IFP)) (Assumption 3.2)


i=l
n

< ST(IFP) + LiPj(ki(FP)/T(IFP)) (Step 3.)


i=l
n

< ST(R) + L iPi(kj(F P)/T(R)) (Step 2.)


i=l
n

ST(R) + L gi(T(R))
i=l
v(FP).
How to Determine Maintenance Frequencies? 261

Consequently, if Assumption 3.2 holds, the solution generated by the IFP is


at least as good as the solution obtained with the FP.
The IFP can in principle be repeated with in step 1 the new constants
ki(IF P), and this can be done until no improvement is found. This procedure
differs from the iterative algorithm of Goyal et al. in two aspects. The first
difference concerns the way integer values of ki are found given a value of T.
We explained in Section 3 that in the algorithm of Goyal and Kusy (1985)
optimal ki are found by searching in a table that is made in advance for
each i. This becomes inefficient when the values of ki are large, since then
searching in the table takes much time. Besides, this has to be done in each
iteration again. Goyal and Gunasekanin (1992) find for each i an optimal real-
valued ki that is rounded to the nearest integer. This may not be optimal.
In our procedure we can identify under Assumption 3.2 optimal values of ki
for a given value of T immediately, by substitution of Tin (3.1) (that is, by
application of the FP).
The second difference concerns the initialisation of the ki . Goyal et al.
initialise each k i = 1 and then find a corresponding optimal T. This often
results in a solution that cannot be. improved upon by the algorithm but is
not optimal, that is, the algorithm is then stuck in a local optimal solution
(see Van Egmond et al. 1995). In the IFP we start with a value of T that is
optimal for (R) and hence might be a good solution for (P) as well; this may
be a much better initialisation for the algorithm (we will investigate this in
Section 5).
However, we cannot guarantee that with the alternative initialisation the
IFP does not suffer from local optimality. If the procedure terminates and the
generated solution v(I F P) is not close to v(R), then we cannot guarantee
that the solution is good. In that case we will apply a global-optimisation
algorithm.
Observe that for the models mentioned before (with an increasing rate of
occurrence of failures for the minimal-repair model) the IFP is easily solv-
able since (4.3) is a convex-programming problem (the functions q)i(l/t) are
then convex). Otherwise, the IFP may not be useful since (4.3) can be a
difficult problem to solve. In that case we will not apply the IFP but we will
use a global-optimisation algorithm immediately after application of the FP
when v(F P) is not close enough to v(R).
To apply a global optimisation we first need an interval that contains an
optimal T( P).

4.3 Lower and Upper Bounds on T(P)


In this subsection we will derive lower and upper bounds on T(P). Corol-
lary 4.1 already provides a lower bound l/x! if T(R) > l/x! and (R) is a
convex-programming problem.
If the functions Mi(-) are convex and differentiable it is easy to see that
the functions g}R) are differentiable and that (R) is a differentiable convex-
262 Rommert Dekker et al.

programming problem. Moreover, if at least one of the functions Mi (.) is


strictly convex we can prove a lower bound on T( P) that is at least as good
as 1/x~. This is established by the following lemma.
Lemma 4.4. Consider the optimisation problem:

(?,) +~~{ST+t"',(lfT)},
with v(P!) the optimal objective value and T(Pt} an optimal T. If for each
i = 1, ... , n the function M;(-) is convex and differentiable on (0,00), and
for at least one i E {1, ... , n} the function M;(.) is strictly convex on (0,00),
and the differentiable convex-programming problem (R) has no global optimal
solution within (0, 1/x~), then T(P) 2 T(Pt} 2: 1/x~.

Proof. If there does not exist a global optimal solution of (R) in (0, 1/x~),
then it can be shown analogously to Lemma 4.3 that T(Pt} 2: 1/x~.
To prove the inequality T(P) 2 T(Pt} , notice first that (Pt) equals the
optimisation problem (P) when all k; are fixed to the value 1. Consequently,
(P1 ) is a more restricted problem than (P) and it is easy to verify that
v(P) :::; v(P!). Furthermore, if T(P) and certain values of k; are optimal
for (P), then it is easy to see that if the functions Pi (.) are differentiable the
following holds:

so that

Substitution of this in the optimal objective value of (P) yields:


n
v(P) = ST(P) + L p;(k;/T(P
;=1

~ {T~~) p~(k;/T(P + Pi (k;/T(P } .

t
It is easily verified that

~ {T~~) p~(k;/T(P + P;(k;/T(P} = Mt(k;/T(P ,

so that
n
v(P) = L M[{k;/T(P. (4.4)
;=1
How to Determine Maintenance Frequencies? 263

Analogously, it can be shown for the optimal objective value of (PI) that
n
v(PI) = L M/(l/T(P!)). (4.5)
;=1

Suppose now that the inequality T(P) ~ T(PI) does not hold, that is, T(P) <
T{PI). Since the functions M;{) are (strictly) convex and, consequently, the
functions Mf() are (strictly) increasing, this implies that (use (4.4) and (4.5))
n
v(P) = L Mf(ki/T(P))
;=1
n
> L M/(l/T{P))
;=1
n
> L M/(l/T(PI))
i=1
v(PI) ,
which is in contradiction with v(P) :::; v(PI). Hence, T(P) ~ T(PI). 0
A rough upper bound on T(P) is obtained by the following lemma.
Lemma 4.5. For an optimal T(P) of (P) it holds that

T(P) ~ (l/S) {V(F P) - ~ ";(X1)} ,


with v(F P) the objective value corresponding with the feasible solution of (P)
generated by the FP.
Proof For every T > 0 if holds that
n n n
ST + L4>i(xi):::; ST + Lg~R)(T):::; ST + Lgi(T).
;=1 i=1
Consequently, we have for every T with ST + I:~1 g;(T) :::; v(F P) that
n
ST + L4>i(X;) :::; v(FP),

which implies that

T ~ (l/S) { v(F P) - t, ";(X1)} .


Since T(P) is such a T for which ST + I:~1 gi(T) :::; v{F P), the lemma
follows. 0
264 Rommert Dekker et aI.

We obtain a better upper bound if (R) is a convex-programming problem.


This is established by the following lemma.
Lemma 4.6. Let Tup be the smallest T ~ T(R) for which the objective func-
tion of (R) equals v{F P). If (R) is a convex-programming problem then Tup
is an upper bound on T(P). Moreover, this upper bound is smaller than or
equal to the upper bound according to Lemma 4.5.
Proof. For all T > I/xi it follows that
n n
ST+ Lg~R)(T) = ST+ L4'i(Xt),
;=1 i=1
where the function in the right-hand side of the equation is an increasing
function in T that tends to infinity if T tends to infinity. Consequently, there
exists a value Tup ~ T(R) such that ST +E?:1 g~R)(T) = v(F P). For values
of T > Tup the objective function of (R) is larger than or equal to v{F P),
since (R) is a convex-programming problem and the minimum is obtained
in T(R). Since (R) is a relaxation of (P), the objective function of (P) is also
larger than or equal to v{F P) for values of T > Tup , so that Tup is an upper
bound on T{ P).
It is easy to see that this upper bound is at least as good as the upper
bound of Lemma 4.5, since for T = (I/S){v(FP) - E~=1 4'i{X;)} it holds
that
n n

ST + L g~R)(T) ~ ST + L 4'i{Xt)
;=1 ;=1

S [(l/S) {V(FP) - t4'i(Xt)} 1+ t4'i(Xt)


= v(FP)
n

STup + """
LJgi(R) (Tup).
i=1
That is, in T = (I/S){v(FP) - E~=I4'i(X;)} the objective function of (R)
is not smaller than in Tup. Since (R) is a convex-programming problem
and Tup is the smallest T ~ T(R) for which the objective function of (R)
equals v{F P), we have that T ~ Tup. 0
Notice that the upper bound Tup can easily be found with a bisection on the
interval [T{R), (I/ S){ v{F P) - E~=1 4'; {x;)}].
It cannot generally be proved that the objective function of (R) is equal
to v{F P) for a value of T $ T{R), but if it is, we have a lower bound Tiow
on T(P) analogously.
Lemma 4.7. If there is aT $ T{R) for which the objective function of(R)
is equal to v{ F P), let then Tiow be the largest T $ T( R) for which this holds.
If (R) is a convex-programming problem then Tiow is a lower bound on T{P).
How to Determine Maintenance Frequencies? 265

Proof For values of T < l10w the objective function of (R) is larger than or
equal to v(F P), since (R) is a convex-programming problem and the min-
imum is obtained in T(R). Since (R) is a relaxation of (P), the objective
function of (P) is also larger than or equal to v(F P) for values of T < l1ow,
so that Ttow is a lower bound on T(P). 0

Objective function (P) -


Objective function (R) -

!J(FP)

T,ow T(R) Tup


t_

Fig. 4.3. A lower bound lIow and an upper bound Tup on an optimal T(P) are
found where the objective function of relaxation (R) equals v( F P), the value of the
objective function of problem (P) in T(R),

In Figure 4.3 it is illustrated how the bounds l10w and Tup are generated.
If (R) is a convex-programming problem and the lower bound Ttow exists,
then it can easily be found as follows. We first check whether Ttow ~ l/x~,
with 1/x~ the lower bound given by Corollary 4.1. To this end we compute the
objective function of (R) in l/x~ and check whether it is smaller than v(F P).
If so, then 1}ow < l/x~ and otherwise l10w ~ l/x~. In the latter case we can
easily find 1}ow with a bisection on the interval [l/x~, T(R)].
Notice that if (R) is a convex-programming problem, it can be useful to
apply the IFP. In that case the bounds Tup and l10w derived above may
be improved when the objective value v(F P) is replaced by v{IF P), since
v{I F P) ~ v(F P).
In this subsection we derived a number of lower and upper bounds on
T(P). The results are summarised in Table 4.1.
266 Rommert Dekker et al.

Table 4.1. Lower and Upper Bounds on T(P)

Lower bound Condition


l/x! There is no optimal solution of (R) in (0, l/x!)
T(PI) Each Mi() is convex on (0,00) and at least
one Mi() is strictly convex on (0,00)
(R) is a convex-programming problem and its
objective function equals v(F P) for aT < T(R)
Upper bound Condition
(l/SHv(FP) -
2::':1 4ii (xin none
Tup (R) is a convex-programming problem

From Table 4.1 we can find the bounds that can be used dependent on
certain conditions. For example, for the special cases of Goyal et al., the
minimal-repair model with a unimodal rate of occurrence of failures and the
inspection model, we showed in Theorem 4.1 that (R) is a convex program-
ming problem. This is already sufficient to use all bounds of Table 4.1, except
the lower bound T(P!). To use the bound T(P!) , each Mi() must be con-
vex on (0,00) and at least one Mi() must be strictly convex. We showed in
the proof of Theorem 3.1 that each Mi (.) is convex on (0, 00) for the mod-
els described above (with an increasing rate of occurrence of failures for the
minimal-repair model). For the special cases of Goyal et al. each Mi() is
even strictly convex on (0,00), so that the bound T(P1 ) can then always be
used. For the minimal-repair and inspection model at least one MiC) must
be strictly convex.
Let now TI be the largest lower bound and Ttl be the smallest upper bound
that can be used for a specific problem, then we have that T(P) E [TI, Ttl].
Consequently, it is sufficient to apply a global-optimisation technique on the
interval [Tr, Ttl] to find a value for T(P).

4.4 Global-Optimisation Techniques

What remains to be specified is the usage of a global-optimisation technique


for (P) on the interval [TI, Ttl] when the feasible solution to (P) found after
application of the FP (or the IFP) is not good enough.

Lipschitz Optimisation

Efficient global-optimisation techniques exist for the case that the objective
function of (P) is Lipschitz. A univariate function is said to be Lipschitz if
for each pair x and y the absolute difference of the function values in these
points is smaller than or equal to a constant (called the Lipschitz constant)
multiplied by the absolute distance between x and y. More formally:
How to Determine Maintenance Frequencies? 267

Definition 4.1. A function f(x) is said to be Lipschitz on the interval [a, b]


with Lipschitz constant L, if for all x, y E [a, b] it holds that If(x) - f(y)1 :s
Llx-YI
If the objective function of (P) is Lipschitz on the interval [11, Tu ), then
global-optimisation techniques can be applied in this interval to obtain a so-
lution with a corresponding objective value that is arbitrarily close to the
optimal objective value v(P) (see the chapter on Lipschitz optimisation in
Horst and Pardalos 1995). For the special cases of Goyal et al., the minimal-
repair model with an increasing rate of occurrence of failures, and the in-
spection model, we can prove that the objective function of (P) is Lipschitz
on [1/, Tu], and we can derive a Lipschitz constant (see Appendix A).
There are several Lipschitz-optimisation algorithms (see Horst and Parda-
los 1995), and we implemented some of them. The simplest one, called
the passive algorithm, evaluates the function to be minimised at the points
a + f./ L, a + 3f./ L, a + 5f./ L, ... , and takes the point at which the function is
minimal. The function value in this point does not differ more than f. from the
minimal value in [a, b]. We implemented the algorithm of Evtushenko that is
based on the passive algorithm, but that takes a following step larger than
2f./ L if the current function value is larger than 2f. above the current best
known value, which makes the algorithm faster than the passive algorithm.
However, this algorithm can still be very time consuming, especially when
the Lipschitz constant L is large. The algorithm of Evtushenko and the other
algorithms described in Horst and Pardalos (1995) turned out to be too time
consuming, and were therefore not of practical use to our problem.
Fortunately, however, the shape ofthe objective function of problem (P) is
such that the Lipschitz constant is decreasing in T (this is shown in Appendix
A). Using this, the algorithm of Evtushenko can easily be extended to deal
with a dynamic Lipschitz constant; after a certain number of steps (going
from left to right) the Lipschitz constant is recomputed, such that larger
steps can be taken. This is repeated after the same number of steps, and
so on, until the interval [a, b] is covered. This approach turned out to work
very well for our problem; the increment in speed was sometimes a factor 1000
compared to the version of Evtushenko, and this made Lipschitz optimisation
of practical use to our problem.

Golden-Section Search Heuristic

We can also apply alternative methods that do not use the notion of Lipschitz
optimisation. One such a method is golden-section search. Golden-section
search is usually applied (and is optimal) for functions that are strictly uni-
modal, which the objective function of (P) is generally not. However, we will
apply an approach in which the interval [11, Tu] is divided into a number
of subintervals of equal length, on each of which a golden-section search is
applied. The best point of these intervals is taken as solution. We then divide
268 Rommert Dekker et aI.

the subintervals into intervals that are twice as small and we apply on each
a golden-section search again. The doubling of intervals is repeated until
no improvement is found. We refer to this approach as the multiple-interval
golden-section search heuristic, the results of which are given in Section 5.

4.5 A Solution Procedure for (P)


We are now ready to formulate a solution procedure for (P). We consider first
a solution procedure for the special cases of Goyal et al., the minimal-repair
model with a unimodal rate of occurrence of failures, and the inspection
model, in which cases problem (R) is a convex-programming problem. Sub-
sequently, we indicate the changes when, for example, block replacement is
used.
We can summarise the results in this section in the formulation of the
following solution procedure for (P):
1. Solve the convex-programming problem (R) using that T(R):$ l/zi. An
optimal value T(R) can be found by application of a bisection technique
if the objective function of (R)' is differentiable, and otherwise golden-
section search can be applied.
2. If T(R) :$ l/z~ then T(P) = T(R) is optimal for (P); stop.
3. If T(R) > l/z~, check whether the objective function of (R) in l/z~
equals v(R). If so, T(P) = l/z~ is optimal for (P); stop.
4. Otherwise, we have that T(P) ~ l/z~ and we first find a feasible solution
for problem (P) by applying the FP or IFP. If the corresponding objective
value is close enough to v(R), then it is also close to v(P), so that we
have a good solution; stop.
5. If the solution is not good enough, apply a global-optimisation technique
on the interval [T" Tu] to find a value for T(P).
If this solution procedure is applied to the block-replacement model, some
details have to be modified slightly. The first modification concerns the so-
lution of the relaxation (R) that is in general not a convex-programming
problem, but, since it has fewer local minima, is much easier to solve than
problem (P). Therefore, to find a solution to problem (R), we will apply
a single iteration of the multiple-interval golden-section search heuristic de-
scribed in the previous subsection, that is, the number of subintervals is fixed
and will not be doubled until no improvement is found.
Though the optimisation problem (4.3) is not a convex-programming
problem for the block-replacement model and is therefore more difficult to
solve, we will still use the IFP with a single golden-section search applied to
solve problem (4.3); even as such the IFP outperforms the approach of Goyal
and Kusy (1985), as will be shown by the experiments in the next section.
Since the nice results that we derived for the special cases of Goyal et al.,
minimal repair and inspection do not generally hold for the block-replacement
model, the determination of a Lipschitz constant becomes more difficult, if
How to Determine Maintenance Frequencies? 269

possible at all. Therefore, we will not apply Lipschitz optimisation to deter-


mine a value of v( P). Instead, we will use the multiple-interval golden-section
search heuristic as described in the previous subsection.

5. Numerical Results
In this section the solution procedure for (P) described in the previous section
will be investigated and it will be compared with the iterative approach of
Goyal et al. This will first be done for the special case of Goyal and Kusy,
the minimal-repair model with an increasing rate of occurrence of failures,
and the inspection model, in which cases an optimal solution v(P) of (P) can
be found by Lipschitz optimisation. This makes it possible to make a good
comparison and also to investigate the performance of the multiple-interval
golden-section search heuristic. Subsequently, the performance of the solution
procedure for the block-replacement model is investigated, using the golden-
section search heuristic. All algorithms are implemented in Borland Pascal
version 7.0 on a 66 MHz personal computer.
By considering the gap between v(R) and v(P) we are by Lemma 3.1 able
to say something about the optimal objective value v(Pc) of (Pc). We will
not investigate problem (Pc) any further, since incorporation of the correction
factor Ll(k) in a solution procedure is too time consuming.
For all models we have six different values for the number n of components
and seven different values for the set-up cost S. This yields forty-two different
combinations of nand S, and for each ofthese combinations hundred random
problem instances are taken by choosing random values for the remaining
parameters. For the minimal-repair, inspection and block-replacement model
the lifetime distribution for component i is given by a Weibull-(Ai, f3d dis-
tribution (a Weibull-(A,,8) distributed stochastic variable has a cumulative
distribution function F(t) = 1 - e-(t/~)'''). The data are summarised in Ta-
ble 5.1.

Results for the special case of Goyal and Kusy, the minimal-repair
model and the inspection model

For the special case of Goyal and Kusy, the minimal-repair model and the
inspection model, the value v( P) can be determined by Lipschitz optimisation
with an arbitrary deviation from the optimal value; we allowed a relative
deviation of 10- 4 (i.e., 0.01%). In Table 5.2 the relevant results of the 4200
problem instances for each model are given.
Notice first that from this table it follows that the difference between the
relaxed solution v(R) and the optimal objective value v(P) of problem (P)
is not very large. On average the gap is approximately one per cent or less
and the maximum deviation is 5.566% for the model of Goyal and Kusy and
270 Rommert Dekker et al.

Table 5.1. Values for the Parameters in the Four Models


n = 3, 5, 7, 10, 25, 50
S= 10,50,100,200,500,750,1000
cf E [1,500] (random)
The following parameters are taken randomly:
Special Case 0/ Goyal and Kusy: Minimal-Repair Model:
Ii E [15,50] Ai E [1,20]
Vi E [1,20] f3i E [1.5,4]
e E [1,4] ci E [1,250]
Inspection Model: Block-Replacement Model:
Ai E [1,20] Ai E [1,20]
f3i E [1.5,4] f3i E [1.5,4]
cr E [sill; + 1,1000] c{ E [2cf/(1- o} lILt) + 1, 5000]
The variables 1-'; and (1; in this table (for the block-replacement and inspection model) are

cr
the expectation and the standard deviation of the lifetime distribution of component i.
Notice that for the inspection model we take ~ cf / 1-'; + 1 and for the block-replacement
model c{ ~ 2cf /(1 - (1:/1-':) + 1. This guarantees the existence of a finite minimum
xi for the individual average-cost function 4';(-). In Dekker (1995) it is shown that for
the inspection model a finite minimum for 4'i( -) exists if cf < cr
I-'i, and, a forteriori, if
cr ~ cf / I-'i + 1. For the block-replacement model it can be shown (see also Dekker 1995)

.
that a finite minimum exists if cf > 2c P/(1 - (12/1-'2). Notice finally that since Pi > 1, the
, "
rate of occurrence of failures for the minimal-repair model is increasing.
-

even smaller for the other models. By Lemma 3.1 we have that the optimal
objective value v(Pc) of problem (Pc) will deviate even less from v(R). This
implies that if one wants to find a solution to problem (Pc), it is better to solve
the easier problem (P) first. Since the gap between v(P) and v(R) is often
small, this yields a solution that will in most cases suffice. Only when the
gap is considered not small enough, one can subsequently apply a heuristic
to problem (Pc) to try to find an objective value that is smaller than v(P).
From the table it can be seen that solving the relaxation takes very little
time. A subsequent application of the FP requires only one function evalu-
ation for each component and this takes a negligible amount of time, which
is why for the FP no running times are given in Table 5.2. Applying the
IFP also takes little time. (All running times in Table 5.2 are higher for
the inspection model than for the special case of Goyal and Kusy and the
minimal-repair model, since for the inspection model a numerical routine has
to be applied for each function evaluation, whereas for the other two models
the cost functions can be computed analytically.) Notice that some deviations
are negative. This is due to the relative deviation of 0.01 % allowed in the op-
timal objective value determined by the Lipschitz optimisation; a heuristic
can give a solution with an objective value up to 0.01% smaller than that
according to the Lipschitz-optimisation procedure.
As can be expected, the algorithm of Goyal and Kusy outperforms the
algorithm of Goyal and Gunasekaran. This is explained from the fact that
Goyal and Kusy take the optimal ki given a value of T, whereas Goyal and
How to Determine Maintenance Frequencies? 271

Table 5.2. Results of 4200 Random Examples for the Special Case of Goyal and
Kusy, the Minimal-Repair Model and the Inspection Model

GoyKus MinRep Inspec


Relaxation ( R):
Average running time relaxation (sec.) 0.01 0.01 0.06
Deviation (R) (v(P) - v(R))/v(R):
Average deviation (R) 1.174% 0.531% 0.835%
Minimum deviation (R) 0.000% 0.000% 0.000%
Maximum deviation (R) 5.566% 3.390% 4.953%
Feasibility Procedure (FP):
Deviation FP (v(FP) - v(P))/v(P):
Average deviation FP 1.294% 0.246% 0.398%
Minimum deviation FP 0.000% 0.000% 0.000%
Maximum deviation FP 13.666% 8.405% 7.616%
Improved Feasibility Procedure (IFP):
Average running time IFP (sec.) 0.07 0.05 1.32
Deviation IFP (v(IFP) - v(P/v(P):
A verage deviation IFP 0.443% 0.065% 0.129%
Minimum deviation IFP 0.000% 0.000% 0.000%
Maximum deviation IFP 10.842% 4.250% 7.184%
Golden-Section Search (GSS):
Average running time GSS (sec.) 0.72 0.41 11.81
Deviation GSS (v(GSS) - v(P/v(P):
A verage deviation GSS 0.001% 0.000% 0.000%
Minimum deviation GSS 0.000% 0.000% -0.001%
Maximum deviation GSS 0.334% 0.152% 0.107%
Goyal and Kusy (GK):
Average running time GK (sec.) 0.07 0.12 4.64
Deviation GK (v(GK) - v(P))/v(P):
Average deviation GK 0.829% 0.421% 1.253%
Minimum deviation GK 0.000% -0.001% 0.000%
Maximum deviation GK 11.654% 18.289% 66.188%
Goyal and Gunasekaran (GG):
Average running time GG (sec.) 0.06 0.13 4.05
Deviation GG (v(GG) - v(P))/v(P):
A verage deviation GG 0.984% 0.608% 1.910%
Minimum deviation GG 0.000% -0.001% 0.000%
Maximum deviation GG 14.027% 18.289% 66.188%

Gunasekaran take for each ki the rounded optimal real value. However, the
differences between the two algorithms are small.
The feasible solution corresponding with the relaxation (i.e., obtained by
application of the FP) is in most cases better than that of the algorithms
of Goyal et ai. Only for the special case of Goyal and Kusy the FP per-
forms somewhat worse. For the minimal-repair and inspection model the FP
performs much better.
In all cases the IFP (that is an intelligent modification of the approach
of Goyal et aI.) outperforms the iterative algorithms of Goyal et aI., while
the running times of the IFP are equal or faster. The differences are smallest
272 Rommert Dekker et al.

for the special case of Goyal and Kusy. This can be explained from the fact
that in the model of Goyal and Kusy there is little variance possible in the
lifetime distributions of the components, mainly because the exponent e has
to be the same for all components. In the inspection model, however, there
can be large differences in the individual lifetime distributions, and this can
cause much larger deviations for the iterative algorithms of Goyal et al.j the
average deviation for Goyal and Kusy's algorithm is then 1.253% and the
maximum deviation even 66.188%, which is much higher than the deviations
for the IFP. The IFP performs well for all models.
Since for many examples the algorithms of Goyal et al. and the IFP find
the optimal solution, the average deviations of these algorithms do not differ
so much (in many cases the deviation is zero per cent). However, there is
a considerable difference in the number of times that large deviations were
generated. This is illustrated in Table 5.3 that gives the percentage of the
examples in which the IFP and the algorithm of Goyal and Kusy had a devia-
tion larger than 1% and 5% for the three models discussed in this subsection.
From this table it is clear that the IFP performs much better than the algo-

Table 5.3. Percentage of the Examples Where the IFP and the Algorithm of Goyal
and Kusy Generated Deviations of More Than 1% and 5%.
Algorithm Deviation> 1% Deviation > 5%
Special Case of Goyal and Kusy
IFP 12.86 1.79
Goyal and Kusy 27.50 2.10
Minimal-Repair Model
IFP 1.57
Goyal and Kusy 12.38 1.64
Inspection Model
IFP 3.12 0.05
Goyal and Kusy 26.50 6.69

rithm of Goyal and Kusy and that if the algorithm of Goyal and Kusy does
not give the optimal solution, the deviation can be large. The conclusion is
that solving the relaxation and subsequently the improved feasibility proce-
dure is better than and at least as fast as the iterative algorithms of Goyal
et al. This also implies that the algorithms of Goyal et al. can be improved
considerably if another initialisation of the ki and T is taken, viz. according
to the solution of the relaxation.
The deviation of 66.188% in Table 5.2 occurs for one of the problem
instances of the inspection model with n = 5 and S = 10. The parameters and
results are given in Table 5.4. The large deviation for the algorithm of Goyal
and Kusy can be explained as follows. In the first iteration of the algorithm all
ki are initialised at the value one. The corresponding T is then determinedj it
equals 5.87. In the following iteration it is investigated for each component i
How to Determine Maintenance Frequencies? 273

Table 5.4. Parameters and Results for the Problem Instance of the Inspection
Model for Which the Algorithm of Goyal and Kusy Performs Worst
S -10
Component cPI c~ ~i ~i x

1 247.00 962.00 1 3.50 0.83
2 472.00 475.00 9 3.45 5.99
3 344.00 511.00 20 1.71 7.04
4 459.00 528.00 14 3.90 8.45
5 225.00 541.00 17 2.47 6.45
Optimal solution:
T = 0.85

ki = 1,1,1,1,1
corresponding objective value v(GK) = 1173.77
100% X (v(GK) - v(P/v(P) = 66.188%

whether a larger integer value for ki given T = 5.87 yields lower individual
average costs. This is not the case, as can also be expected considering the
individual x; in the last column of Table 5.4. Take, for example, k2 = 2 for
component 2. This implies that component 2 is inspected each 2 x 5.87 =
11.74 time units, whereas its optimal inspection interval has length x2 =
5.99. The value 5.87 turns out to be a better alternative than 11.74, which
also turns out to be the case for the other components. Consequently, the
algorithm terminates with T = 5.87 and all ki equal to one. For component 1
this implies that it is inspected each 5.87 time units whereas the optimal
inspection interval has length 0.83. Since for component 1 the failure cost cf
per unit time is relatively large, this implies a large deviation; the individual
average-cost function of component 1 is relatively steep. It would be much
better to take a smaller T and to increase the ki for components 2,3,4,5
accordingly, which is indeed reflected by the optimal T that equals 0.85.
From the results of Table 5.2 it can further be seen that the multiple-
interval golden-section search heuristic performs very well in all cases. The
average deviation is almost zero, and the maximum deviation is relatively
small. The heuristic is initialised with four subintervals and this number is
doubled until no improvement is found. It turned out that four subintervals is
mostly sufficient. The running time of the heuristic is also quite moderate: less
than a second for the special case of Goyal and Kusy and the minimal-repair
model, and almost 12 seconds for the inspection model (where a numerical
routine has to be applied for each function evaluation). This is not much
compared to, for example, the algorithms of Goyal et al.
Usually, Lipschitz optimisation can take much time. For the special cases
in this subsection, Lipschitz optimisation can be made much faster by ap-
274 Rommert Dekker et al.

plication of a dynamic Lipschitz constant, as was explained in the previous


subsection. For the special case of Goyal and Kusy, Lipschitz optimisation
then took on average 5.83 seconds, for the minimal-repair model only 0.82
seconds, and for the inspection model 23.82 seconds. This is more than the
golden-section search heuristic, but still not very much, especially not when
it is considered that Lipschitz optimisation is an optimal solution procedure
and when the running times are compared to those of the heuristics discussed
here.
The running time of the Lipschitz optimisation depends on the number of
components and on the set-up cost. In Table 5.5 the average running times
are given for the hundred random examples that were taken for each of the
forty-two combinations of nand S for the minimal-repair model. As can be

Table 5.5. Average Running Times (sec.) of the Lipschitz-Optimisation Procedure


for the Minimal-Repair Model

n=3 n=5 n=7 n = 10 n = 25 n = 50


S= 10 0.29 0.36 0.53 0.82 2.01 4.95
S = 50 0.13 0.22 0.36 0.58 1.28 3.35
S = 100 0.10 0.20 0.30 0.47 1.25 2.55
S = 200 0.07 0.17 0.23 0.38 0.97 2.16
S= 500 0.07 0.13 0.21 0.31 0.96 1.91
S = 750 0.05 0.13 0.19 0.31 0.91 2.14
S = 1000 0.05 0.10 0.18 0.29 0.90 1.96
average 0.11 0.19 0.29 0.45 1.18 2.71

seen from this table, the running time increases somewhat more than linearly
in the number n of components and decreases in the set-up cost S. The almost
linear increment of speed is a nice result when it is considered that Lipschitz
optimisation is an optimal solution procedure and that alternative optimal
procedures published so far in the literature (see, for example, Goyal 1974 in
the inventory context) involve only enumeration methods with exponentially
growing running times. The fact that the running time decreases if S increases
is due to a steeper objective function for larger S. A larger S causes smaller
upper bounds for T(P) and, as a result, smaller intervals on which Lips-
chitz optimisation has to be applied. The running time also depends on the
precision that is required. For less precision Lipschitz optimisation becomes
much faster. Future generations of computers will make the advantage ofthe
golden-section search heuristic over Lipschitz optimisation less important.
We can conclude that if a solution is required in little time, we can solve
the relaxation and apply the improved feasibility procedure to obtain a so-
lution with a deviation of less than one per cent on average. The improved
feasibility procedure outperforms the algorithms of Goyal et al. not only by
time and average deviation, but the maximum deviation is also much smaller.
When precision is more important, we can apply the golden-section search
How to Determine Maintenance Frequencies? 275

heuristic to obtain a solution for the above problems with a deviation of


almost zero per cent on average. When optimality must be guaranteed or
when running time is less important, Lipschitz optimisation can be applied
to obtain a solution with arbitrary precision.

Results for the block-replacement model

For the solution of (R) we applied one iteration of the multiple-interval


golden-section search heuristic, that is, we do not double the number of
subintervals until no improvement is found. Since in the previous subsec-
tion it turned out that four subintervals is mostly sufficient to find a solution
for problem (P), we chose the number four here as well.
In Table 5.6 the relevant results of the 4200 problem instances are given
(for the renewal function in the block-replacement model we used the approx-
imation of Smeitink and Dekker 1990). The solutions of the FP, IFP and the
algorithms of Goyal et al. are now compared with the values of v( P) obtained
by the multiple-interval golden-section search heuristic.

Table 5.6. Results of 4200 Random Examples for the Block-Replacement Model
Relaxation (R):
A verage running time relaxation (sec.) 0.23
Average deviation (R) (v(P) - v(R/v(R) 0.402%
Minimum deviation (R) 0.000%
Maximum deviation (R) 2.708%
Feasibility Procedure (FP):
A verage deviation FP (v( F P) - v( P / v( P) 0.196%
Minimum deviation FP 0.000%
Maximum deviation FP 12.217%
Improved Feasibility Procedure (IFP):
Average running time IFP (sec.) 1.30
A verage deviation IFP (v( I F P) - v( P / v( P) 0.051%
Minimum deviation IFP -0.002%
Maximum deviation IFP 5.921%
Golden-Section Search (GSS):
Average running time GSS (sec. 10.26
Goya and Kusy G :
Average running time GK (sec.) 3.72
Average deviation GK (v(GK) - v(P))/v(P) 0.658%
Minimum deviation GK -0.222%
Maximum deviation GK 39.680%
Goyal and Gunasekaran (GG):
Average running time GG (sec.) 3.54
Average deviation GG (v(GG) - v(P))/v(P) 0.943%
Minimum deviation GG -0.222%
Maximum deviation GG 41.003%
276 Rommert Dekker et aI.

From this table is follows again that the gap between v(R) and v(P)
is small: maximally 2.637% and only 0.399% on average. This implies that
also for the block-replacement model it is better to solve first problem (P)
than problem (Pc), since the solution thus obtained will in many cases be
sufficiently good. If the gap is not small enough, one can subsequently apply
a heuristic to problem (Pc).
The average running time of the relaxation is again very small. It is larger
than the average running time of, for example, the inspection model, since
golden-section search is not applied once but four times, according to one
iteration of the multiple-interval golden-section search heuristic.
Also in this case the algorithm of Goyal and Kusy outperforms the algo-
rithm of Goyal and Gunasekaran, though the differences are small. The FP
already outperforms the algorithms of Goyal et al. and the IFP performs even
better. The average deviation is 0.658% for the algorithm of Goyal and Kusy
and only 0.051% for the IFP. Besides, the maximum deviation for the IFP is
quite moderate, 5.921%, whereas for the algorithm of Goyal and Kusy this
can be as large as 39.680% (and for the algorithm of Goyal and Gunasekaran
even larger). It can happen that the algorithms of Goyal et al. sometimes
perform slightly better than the IFP, reflected in the minimum deviations of
-0.222% for the algorithms of Goyal et al. and -0.002% for the IFP.
The golden-section search heuristic applied to solve problem (P) needed
again four intervals in most cases. The average running time of the heuristic is
10.26 seconds, which is not much compared to, for example, the algorithms of
Goyal et al. Remember that the solutions of the algorithms of Goyal et al. and
of the IFP are compared with the solutions according to the golden-section
search heuristic. Notice that the negative deviations of -0.222% and -0.002%
imply that both the algorithms of Goyal et al. and the IFP can in some cases
be better than the golden-section search heuristic, though the differences are
small. This implies that the golden-sectic:>ll search heuristic is not optimal, but
that was already clear from the results in the previous subsection. However,
in most cases the heuristic is better than the other algorithms, regarding the
average deviations of 0,658% and 0.943% for the algorithms of Goyal et al.
and 0.051% for the IFP, compared to the heuristic.
The conclusion here is again that when a solution is required in little time,
we can solve the relaxation and apply the (improved) feasibility procedure;
this is better than the algorithms of Goyal et al. (especially the maximum
deviation is much smaller). When precision is more important, we can apply
the golden-section search heuristic, at the cost of somewhat more time.

6. Conclusions
In this chapter we presented a general approach for the coordination of main-
tenance frequencies. We extended an approach by Goyal et al. that deals
with components with a very specific deterioration structure and that does
How to Determine Maintenance Frequencies? 277

not indicate how good the obtained solutions are. Extension of this approach
enabled incorporation of well-known maintenance models like minimal re-
pair, inspection and block replacement. We presented an alternative solu-
tion approach that can solve these models to optimality (except the block-
replacement model, for which our approach is used as a heuristic).
The solution of a relaxed problem followed by the application of a feasibil-
ity procedure yields a solution in little time and less than one per cent above
the minimal value. This approach outperforms the approach of Goyal et al.
When precision is more important, a fast heuristic based on golden-section
search can be applied to obtain a solution with a deviation of almost zero
per cent. For the special cases of Goyal et al., the minimal-repair model and
the inspection model, application of a procedure using a dynamic Lipschitz
constant yields a solution with an arbitrarily small deviation from an optimal
solution, with running times somewhat larger than those of the golden-section
search heuristic.
In the solution approach of this chapter many maintenance-optimisation
models can be incorporated. Not only the minimal-repair, inspection and
block-replacement models, but many others can be handled as well. It is
also easily possible to combine different maintenance activities, for example
to combine the inspection of a component with the replacement of another.
Altogether, the approach presented here is a flexible and powerful tool for
the coordination of maintenance frequencies for multiple components.

References

Avriel, M., Diewert, W.E., Schaible, S., Zang, I.: Generalized Concavity. New York:
Plenum Press 1988
Bii.ckert, W., Rippin, D.W.T: The Determination of Maintenance Strategies for
Plants Subject to Breakdown. Computers and Chemical Engineering 9, 113-
126 (1985)
Barros, A.I., Dekker, R., Frenk, J.B.G., van Weeren, S.: Optimizing a General
Replacement Model by Fractional Programming Techniques. Technical Report.
Econometric Institute, Erasmus University Rotterdam (1995)
Bazaraa, M.S., Sherali, H.D., Shetty, C.M.: Nonlinear Programming Theory and
Algorithms. New York: Wiley 1993
Cho, D.I., Parlar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
European Journal of Operational Research 51, 1-23 (1991)
Dagpunar, J.S.: Formulation of a Multi Item Single Supplier Inventory Problem.
Journal of the Operational Research Society 33, 285-286 (1982)
Dekker, R.: Integrating Optimisation, Priority Setting, Planning and Combining of
Maintenance Activities. European Journal of Operational Research 82, 225-240
(1995)
Dekker, R., Frenk, J.B.G., Wildeman, R.E. : An Efficient Optimal Solution Method
for the Joint Replenishment Problem. European Journal of Operational Re-
search. To appear (1996)
278 Rommert Dekker et al.

Dekker, R., Roelvink, I.F.K.: Marginal Cost Criteria for Preventive Replacement of
a Group of Components. European Journal of Operational Research 84, 467-
480 (1995)
Goyal, S.K.: Determination of Economic Packaging Frequency for Items Jointly
Replenished. Management Science 20, 293-298 (1973)
Goyal, S.K.: Determination of Optimum Packaging Frequency of Items Jointly Re-
plenished. Management Science 21, 436-443 (1974)
Goyal, S.K.: A note on Formulation of the Multi-Item Single Supplier Inventory
Problem. Journal of the Operational Research Society 33, 287-288 (1982)
Goyal, S.K., Gunasekaran, A.: Determining Economic Maintenance Frequency of a
Transport Fleet. International Journal of Systems Science 4, 655-659 (1992)
Goyal, S.K., Kusy, M.I.: Determining Economic Maintenance Frequency for a Fam-
ily of Machines. Journal of the Operational Research Society 36, 1125-1128
(1985)
Goyal, S.K., Satir, A.T.: Joint Replenishment Inventory Control: Deterministic and
Stochastic Models. European Journal of Operational Research 38, 2-13 (1989)
Hiriart-Urruty, J.-B., Lemarechal, C.: Convex Analysis and Minimization Algo-
rithms I: Fundamentals. A Series of Comprehensive Studies in Mathematics.
Vol. 305. Berlin: Springer 1993
Horst, R., Pardalos, P.M.: Handbook of Global Optimization. Dordrecht: Kluwer
1995
Howard, R.A.: Dynamic Programming and Markov Processes. New York: Wiley
1960.
Martos, B. Nonlinear Programming Theory and Methods. Budapest: Akademiai
Kiado 1975
Smeitink, E., Dekker, R.: A Simple Approximation to the Renewal Function. IEEE
Transactions on Reliability 39, 71-75 (1990)
Van der Duyn Schouten, F.A., Vanneste, S.G.: Analysis and Computation of (n, N)-
Strategies for Maintenance of a Two-Component System. European Journal of
Operational Research 48, 260-274 (1990)
Van der Duyn Schouten, F. A.: Stochastic Models of Reliability and Maintenance:
An Overview. In this volume (1996), pp. 117-136
Van Egmond, R., Dekker, R., Wildeman, R.E.: Correspondence on: Determining
economic maintenance frequency of a transport fleet. International Journal of
Systems Science 26, 1755-1757 (1995)

Appendix

A. Determination of Lipschitz Constant


We will prove here that the objective function of problem (P) is Lipschitz on
the interval [1}, Tu] for the special cases of Goyal et aI., the minimal-repair
model with an increasing rate of occurrence of failures, and the inspection
model. Furthermore, we derive a Lipschitz constant L.
It is obvious that if Li is a Lipschitz constant for the function gi(')
(see (3.1)), then the Lipschitz constant L for the objective function of (P)
equals
How to Determine Maintenance Frequencies? 279

n
L = S+ L:Li, (A.l)
i=1
with S the set-up cost. Consequently, we have to find an expression for Li.
To do so, consider an arbitrary i E {I, ... , n} and determine which of the
intervals I}k) (see Section 3) overlap with the interval [11, Tu]. Clearly, this
is for each k with lllziJ ;:; k ;:; lTuziJ. Now define L~k) as the Lipschitz
constant of gi(-) on Irk) for each of these k ~ 1. If lllztJ = 0, then let L~O)
be the Lipschitz constant of gi(') on [1/z~, l/z;]. We will show that
Li = max{L~k)},
(A.2)
where k ranges from max{O, ll1z;J} to lTuz:J.
To prove this, observe first that if t1, t2 belong to the same interval I}k), then
by definition
Igi(t1) - g;(t2)1 ;:; L~k)lt1 - t21 ;:; Liltl - t21
If t 1, t2 do not belong to the same interval then assume without loss of gener-
ality that g;(t1) ~ gi(t2). For t1 < t2 with t1 belonging to Irk) it then follows
that
o ;:; g;(td - gi(t2) < gi(td - ~i(Zn
gi(tl) - gik + 1)/z;)
< L~k)k + 1)/z; - t1)
< L~k)(t2 - t 1)
< Lilt1 - t21
The other case t2 < t1 can be derived in a similar way and so we have shown
that
Igi(td - g;(t2)1 ;:; Liltl - t21,
with Li according to (A.2).
If we now find an expression for the Lipschitz constant L~k), then
with (A.l) and (A.2) we have an expression for the Lipschitz constant L.
In the proof of Lemma 4.2 we showed that if Mi(t) is convex on (0,00),
then ~i(l/t) is also convex on (0,00). We saw in the proof of Theorem 3.1
that M;(t) is convex on (0,00) for the special cases of Goyal et al., the
minimal-repair model with an increasing rate of occurrence offailures, and the
inspection model. Consequently, for these models ~i(l/t) is convex on (0,00).
This implies that the derivative of the function ~i(l/t) is increasing, and con-
sequently we obtain that for all tl ;:; t2 E [1/x~, l/x;]:
Igi(t1) - gi(t2)1 = l~i(l/tl) - ~i(l/t2)1
< - .!!...~i(l/t)1
dt t=tl
lt 1 - t21
280 Rommert Dekker et aI.

so that
Li(0) -- (xn )24J~(x*
, n'
) (A.3)
By the same argument we find that for k ~ 1

L~k) =max {- ddt4Ji k + l)/t)1 t=k/:c'i ' ~ 4Ji(k/t)1 t=(k+ )/:c'i } ,


1

and so
L,{k) _
- max {k k+ 1( !)2m~ (k +k 1x,!) ,
2 X,!P,
(AA)
(k: 1)2(x;)24J~ (k! 1x;) }.
Notice that both arguments in (AA) are decreasing in k since 4J~O is in-
creasing. This implies that L~k) is maximal for k = lor k = O. Consequently,
(A.2) becomes

L. _ { L~LTI:c;J) if l1}xiJ ~ 1,
,- max{Lp),L~O)} iQ1}xiJ = 0,

with L~k) given by (A.3) and (AA).


This analysis also shows that the Lipschitz constant L is decreasing in
T. That is, if L 1 , L2 are the Lipschitz constants for the objective function
of (P) on the intervals [Tl' Tu] and [T2' Tu] respectively, with Tl $ T2 $ Tu ,
then Ll ~ L2.
A Probabilistic Model for Heterogeneous
Populations and Related Burn-in
Design Problems
Fabio Spizzichino
Department of Mathematics, University "La Sapienza", 00185 Rome, Italy

Summary. We shall carry out a study of relevant probabilistic properties of ex-


changeable quantities arising as lifetimes of a cohort of individuals coming from
two different subcohorts (one of which is the frail subcohort). The arguments to be
treated will then be applied to the problem of stopping burn-in for the components
of a (coherent) complex system. The central role of the distribution ofthe (random)
number of individuals initially belonging to the frail subcohort will be pointed out.
An example will be presented at the end of the paper.

Keywords. Substandard units, residual life-times, heterogeneous populations, mix-


tures of distributions, exchangeability, dependence and aging properties, Schur-
convex survival functions, multivariate conditional hazard functions, early failures,
infant mortality, mixed populations, burn-in, optimal stopping, open loop feedback
optimal procedure, dynamic programming, coherent systems

1. Introduction

The aim of these lectures is to give a unifying presentation of some statisti-


cal research falling within the intersection of fields such as: survival analysis,
mixtures of distributions, burn-in procedures. In particular, we shall carry
out a study of some relevant probabilistic properties of exchangeable quanti-
ties arising as lifetimes of a cohort of individuals coming from two different
sub cohorts.
The arguments to be treated will be applied to the problem of burn-in for
the components of a (coherent) complex system.
Let P be a population containing N individuals U1 , .. , UN whose life-
times are denoted by Tl' ... ' TN. We are interested in the case when P is
heterogeneous in the following sense: among Ul, ... , UN some are strong and
some are weak (or substandard). We neither know the identity of weak indi-
viduals nor, in general, the total number of them.
For j = 1,2, ... , N, denote by Cj the binary variable indicating the con-
dition of the individual Uj:

{Cj = O} == {Uj is strong} {Cj = I} == {Uj is weak} .


We assume the lifetimes T 1 , .. , TN to be conditionally independent given
the vector C == (Ct. ... ,CN); more precisely, given (Cj = i) T.i is indepen-
dent of the other lifetimes and distributed according to a given absolutely
282 Fabio Spizzichino

continuous distribution G.(t), (i = 0, 1). The distributions are such that, for
their hazard rate functions, it is ro(t) $ rl(t), 'Vt ~ 0.
In many cases it may be natural to consider Cl, ... , CN to be exchange-
able; in such cases T1 , ... ,TN are exchangeable as well.
The above can be seen as an appropriate probability model for describing
some of the situations in which infant mortality may be present.
In such situations, burn-in procedures are to be considered; in other words
it may be convenient to observe all the individuals for a while at the beginning
of their life in order to overcome the problem of early failures. A decision
problem related with that is the One of optimally choosing the duration of
the burn-in period.
The paper will be divided into three parts.
In the first part we study different aspects of the distribution of T 1 , ... ,
TN, namely the joint survival function, their univariate and multivariate con-
ditional hazard rate functions, dependence properties, univariate and mul-
tivariate aging properties, extendibility. An important aspect is that the
joint distribution of T1 , ... , TN is characterized by meanS of only N, Gi(t),
= =
(i 0,1) and of the distribution of M 2:f=l Cj ; it is interesting to study
how the afore-mentioned properties are influenced by the choice of Gi and
of the distribution of M. It will be of interest furthermore to study the evo-
lution of the distribution of the number of weak individuals in the residual
population, during a life-testing experiment; in its turn this will put us in
a condition to describe the evolution of the distribution of the residual life-
times of surviving individuals. This study extends the one begun in Iovino
and Spizzichino (1993). It will also achieve the goal of providing a tutorial
presentation; indeed it allows to illustrate a number of general concepts, by
showing how they are manifested in the case at hand.
The second part is devoted to a discussion concerning critical aspects.
First of all we define the concepts of early failures and infant mortality and
formulate the problem of optimally choosing the length of the burn-in period.
The discussion aims to clarify the relationships among the present model,
what is usually referred to as mixed populations and more general situations
where infant mortality can be present. A discussion is indeed needed since
possible confusions between different situations can arise. As an interesting
feature of such topics, there is the often presence of apparent paradoxes (e.g.
those connected with observed decreasing failure rate in mixed populations).
This actually calls for a precise statement of the model and a careful use
of language. Our scheme aims to unify and put into precise terms different
models used in various fields of applications. It will particularly stress the
primary role of the probability distribution of M in heterogeneous popula-
tions.
In the third part we study in some more detail the concepts introduced
before and discuss the problem of the optimal choice of the duration of the
burn-in test, presenting results concerning the model of heterogeneous popu-
A Probabilistic Model for Heterogeneous Populations 283

lations. In the frame of this special model we develop some arguments intro-
duced in Spizzichino (1991) with respect to sequential stopping procedures.
The computation of the optimal procedure for stopping the burn-in, however,
is a difficult task due to its complexity; for this reason it is convenient to con-
sider also the concept of open loop feedback optimal stopping procedures. A
formal definition of the latter will then be given together with some heuristic
illustration.
As we shall see, a specific burn-in decision problem is determined by both
the structure of associated costs and the structure of the probability model.
The arguments to be presented can be used flexibly and be applied in many
different areas; special forms of the costs will be imposed by applications to
any specific field.
As mentioned, we shall in particular show examples of cost functions
which describe the cases when individuals Ul , ... , UN are devices to be used
for building a coherent system.
As far as the probability-model structure is concerned, particular cases of
special interest are those with exponential G;(t) and those with a binomial
distribution for M. The latter condition is equivalent to independence among
T l , ... , TN and deserves special attention for the following two reasons:
- it has been often (sometimes implicitly) assumed in the past literature on
the subject;
- its treatment provides an introduction to open loop feedback optimal pro-
cedures for stopping the burn-in.
An example will be presented at the end of Section 4.

2. Distributions of Lifetimes for Heterogeneous


Populations and Related Probabilistic Aspects

In this section we aim to carry out a study of the probability model for
observable lifetimes of the cohort of individuals coming from two different
subcohorts; in particular we shall point out that the distribution of the num-
ber M of individuals in the weak sub cohort and the distributions G;(t) of
lifetimes in the two sub cohorts influence dependence and aging properties of
the lifetimes. Such properties have an impact on the form of the solution of
the burn-in stopping problem.
We start by presenting the notation that will be used in the paper.
Let G == (Gl, ... ,GN) be a vector of N exchangeable binary random
variables and let M == L~l Gi . Denote
284 Fabio Spizzichino

By the definition of exchangeability the joint distribution of C l , ... , CN is


characterized by the marginal distribution of M, indeed, for c E {O, 1}N, it
is
(2.1)

By a well-known fact about exchangeable events (see e.g. de Finetti 1970 and
Kendall 1967), it is

h=0,1, ... ,n.

(2.2)
In particular, w(1)(1) = 2:f=l f,w(N)(k) =~.
Let Go, G1 be given probability distributions on [0, +oo)j we assume Go,
G 1 to be absolutely continuous and such that their respective failure rate
= =
functions ri(t) ~ii(/t~ (i 0,1) satisfy the inequality

(2.3)
Furthermore we assume

/Ji = 1
00
t g;(t) dt < +00 .
We consider T 1 , .. ,TN to be non-negative random variables, which will be
interpreted as lifetimes of individuals Ul, ... , UN in our heterogeneous pop-
ulation Pj we assume that, for C == (Cl, .. . ,CN) E {O, 1}N, it is
N
P{T1 > h, .. . ,TN > tN ICl = C1, .,CN =CN} = II GCj(tj) . (2.4)
j=l
In other words, T 1, ... , TN are conditionally independent given C l , ... , C N ,
each with conditional one-dimensional survival functions equal to Go or to
G 1 , depending on the value taken by the corresponding Cj.
Under (2.4) and the assumption that Cl, ... , CN are exchangeable, the
joint distribution of T 1, ... , TN turns out to be exchangeable as well and it
is completely characterized by Go, G1 and by the probabilities w(N)(k) (k =
0,1, ... , N)j more precisely, as far as the joint survival function is concerned,
we obtain, by combining (2.1) and (2.4)
Proposition 2.1.

(2.5)
A Probabilistic Model for Heterogeneous Populations 285

By differentiating (2.5) with respect to S1, . , SN, we obtain


Corollary 2.1. The joint probability density function ofT1 , , TN is,

Of course, T 1 , , TN being exchangeable, the marginal distribution of 11 1 ,


11~, ... , 11 .. (n < N) is an exchangeable (n-dimensional) distribution which
does not depend on the particular choice of the indexes i1 # i2 # ... # in;
its survival function has the form
-.=;( n ) -.=;( N)
F (s1, ... ,sn)=P{T1 >s1, ... ,Tn >sn}=F (S1, ... Sn,0, ... ,0)
(2.7)
however, rather than using (2.7), it is more convenient to write it by a direct
reasoning: it must be, by analogy with (2.5)

As far as the one-dimensional marginal is concerned, we have the follow-


ing result, where p(1)(s), f(1)(t), JJ and A(t) respectively denote the survival
function, the density function, the expected value and the failure rate func-
tion.
Proposition 2.2. It is

rl)(S) = G1(s) ~) + Go(s) N - :M) S ~ ; (2.9)

f(1)(t)=g1(t)~) +go(t)N-:/M) JJ=JJl~) +JJO N - : M ).


(2.10)
Furthermore
A(t) = a(t)rl(t) + [1- a(t)]ro(t) (2.11)
where
at _ G1(t)lE[M]
(2.12)
( ) - G1(t)lE[M] + Go(t)(N -lE[M])

Proof For the special case n = 1 in (2.8), we obtain

r 1)(s) = G1(S)w(1)(1) + Go(s)w(1)(0), s~ .


From (2.2), w(1)(1) = t~) and w(1)(O) = 1 - w(1)(1), (2.9) follows. By
differentiating F(l)(s) we immediately obtain (2.10). Finally
286 Fabio Spizzichino

f(1)(t) 91(t)JFJ...M)/N + 9o(t)[N - JFJ...M)]/N


'\(t)
= p(1)(t) = G1(t)JFJ...M)/N + Go(t)[N - JFJ...M)]/N =
r1(t)G 1(t)JFJ...M) + ro(t) Go (t)[N -IF{M)]
G1(t)JFJ...M) + Go(t)[N -IF{M)]
o
Remark 2.1. It can be easily shown that o(t) has the meaning o(t) = P(Cj =
11Tj > t); due to the inequality (2.3), GO(t)/G1(t) is increasing and so o(t)
is decreasing.
We now proceed to discuss the distribution of residual lifetimes, condi-
tional on the observed "histories of failure and survivals" .
Suppose we test the unit Ui(i = 1,2, ... , N) for a period of length (Ti
observing the value ti taken by 71, if (71 ~ (Ti), where (Ti are, deterministic or
random, non-negative quantities; in the case that some ofthem are random we
admit the possibility that they are not independent. However we assume that
the two vectors (T1, (T2, ... , (TN) and (C1, ... , CN) are mutually independent;
in other words we assume the stopping rules of the tests of Uj to be non-
informative. At the end of the tests, our observation will be summarized by
a history of the form

D[n; t; s] == {71 1 =t1 , ... ,71 .. = tn ; Tit > Sl , ... , 1jN-.. > SN-n} (2.13)
where (I == {i 1,i2, ... i n }, I == {j1,h, ... ,jN-n}) is an arbitrary pair of
complementary subsets of the index set {1, 2, ... , N} (possibly I = 0, or
I = 0). The symbol D[O; a] will in particular stand for {1jl > Sl, Til >
s2, ... ,1jN > SN}'
We are interested in studying the conditional distribution of the residual
lifetimes 1jl - Sl, 1j2 - S2,, 1jN-.. - SN-n given the history D[n; t; s] (of
course in the case n < N). Taking into account conditional independence of
T1 , . , TN, given C, we readily obtain

rN-n)(~ID[n; t; aD
== P{1jl - S1 > 6,, 1jN-.. - SN > {N-nID[n; t; sn
:L
ce{o,l}N
P{ C = cID[n; t; sn II G GCj(Sj)
jei
(s'+f)
Cj J J (2.14)

where, by use of Bayes formula,

P{C = cID[n;t;sn oc P{C = c} II GCj(Sj) II9c;(ti) . (2.15)


jei iEI

By combining (2.14) and (2.15), one can also write


A Probabilistic Model for Heterogeneous Populations 287

In many situations we are led to consider those special types of life-testing


experiments in which the test starts simultaneously at time t = 0 for all the
units U, and stops at a certain (generally random) time (1' for all the units still
surviving at (1'. This case is of particular interest and we shall use a special
notation for it; it is commonly described by the term dynamic or longitudinal
observation of life-data and we refer the reader to the relevant literature for
general definitions and results (see e.g. Arjas 1981, 1989 and Shaked and
Shanthikumar 1990, 1991 and references therein). First note that, in such
cases, the history observed up to time s, for any s > 0, has the form

D[h; t; s] == {1L =t1 , ... ,11" =th ; 1il > s , ... , 1iN-" > s}
where 0 ~ t1 ~ t2,"" ~ th ~ s, and h denotes the value at time s of the
stochastic process
N

H, == L:
;=1
I[Tj$']

(H, is the number of failures observed up to s).


All units surviving at time s share the same age s and their residual
life-times are exchangeable; their conditional joint survival function of course
does not depend on the particular identity of the indexes h, h, ... ,iN-h and,
by specializing (2.14), it can be written in the form

P{1il > s+6,,11 N _" > S+~N_hID[h;t;s]}

= L: P{G= cID[h;t;s]} II GCj(s+e;) (2.17)


ce{o,l}N jei GCj(s)

However a slight different way to look at rN-h)(eID[h; t; s]) may turn out
to be more convenient. Let
Mo == M, number of weak units in the population P at time s = 0
M(') == l:~=1 Gi r , number of weak units among the units which failed up to
time s
M, == M - M('), number of weak units in the residual population at time s,
'Vs > 0
N, == N - H" total number of units in the residual population at time s,
'Vs > 0
w}N-h)(klt) == P{M, = klD[h; t; s]}, k = 0, 1, ... , N - h
(N-h) _ w!N-")(klt) _
p, (klt)= (N;;K) ,k-O,I, ... ,N-h.
288 Fabio Spizzichino

Furthermore we set \:f s > 0, \:f 0 < tl ~ t2 ... ~ th ~ s, k = N - h + 1, ... , N,

Note now that the probability model describing the population ofthe units
surviving at the time s is analogous to the one of the original population P,
only we must respectively replace

N with N& , p(N)O with p~N')('lt) , 0 0 0 with Oo(s + )/Oo(s)

and 0 1 0 with 01(S + )/Ol(S) .


So, by analogy with (2.5),

F(N-h\eID[h; t; s]) = L
cE{O,l}N-h
p~N-h) (~th cilt)
.=1
OCj(s + ej)
OCj(s)
ilh
j=l
(2.18)
and, by analogy with (2.9), the one-dimensional conditional survival function
of a single residual lifetime is

pI) (eID[h; t; s)) = 0 1 (s + e) lE( M& ID[h; t; s)) +


01(S) N - h
+ Oo(s + e) N& -lE(M& ID[h; t; s)) (2.19)
Oo(s) N- h
From (2.19) we can immediately obtain the multivariate conditional hazard
rate functions (see e.g. Shaked and Shanthikumar 1990) for the vector of
lifetimes T 1 , ... , TN:

= lim
{~O,
~P{1j < s + eID[h; t; s]} = lim
{~O,
~{1 - F(1)WD[h; t; s))}

lE(M.ID[h; t; s))
N _ h
()
rl s +N - h -lE(M.ID[h; t; s])
N _ h
() (2 20)
ro s . .

Remark 2.2. The equation (2.20) might also be proven in a more formal
way, by applying to our case general results about stochastic filtering of point
processes (Bremaud 1982, see also Koch 1986 and Arjas 1992). First note that
the two processes M. and H. have a crucial role. What we can observe is the
evolution of HSl while we are of course interested in estimating at any time
s the actual value of M., which cannot be observed; the joint distribution of
residual life-times, and then the future evolution of H. depend directly on
Ms.
A Probabilistic Model for Heterogeneous Populations 289

Consider a probability space (n, F, P) over which our lifetimes T1 , ... , TN


are defined; let F t == u(H,; s ~ t) denote the sub u-algebra of F generated by
the process H,(s ~ t) and let (it == u(H" M,; s ~ t) denote the sub u-algebra
of F generated by the pair H" M.(s ~ t). By the assumptions we made, the
Wt}-stochastic intensity of H t is given by
A~gt} = M, . r1(s) + (N - H, - M,) . ro(s)

whence (see comment after Theorem T 14 pg. 32 in Bremaud 1982), the


{Ft }-intensity of Ht is
AVt} = lE(M,IF.)r1(s) + [N - H, -lE(M. IF,)]ro(s) . (2.21)
In order to obtain (2.20) from (2.21), we now note that Tl, ... , TN,
by their exchangeability, share common conditional failure rate functions,
which are related to the stochastic intensity AV't} by means of the following
equation; given the observation D[h; t; s], we have
A~Ft) = (N - h)Jl~h)(t) = N,Jl~h)(t) .
On the other hand, given D[h; t; s], it is lE(M,IF,) = lE(M,ID[h; t; sD.

Before continuing, we pay further attention to the probabilities p~N -h) (k It)
(k = 0, ... , N - h) entering in the formula (2.18). The p~N-h) (klt)'s are in
particular needed for the computation of lE(M,ID[h; t; sD which appears in
the expression (2.20) for the multivariate conditional hazard function and in
the definition of open loop feedback optimal procedures to be given in the
Section 4.

Proposition 2.3. For h = 0, 1, ... N - 1 and 0 < t1 ~ t2 ~ ... ~ th ~ S, it


as

w~N-h)(klt) = (N; h) [z(s)]k ~p(N)(k+V)W(V, h) P{M(') = vlD[h; t; s]}


h

where we let
1
z(s) == G1 (s)/G O(s) and W(v, h) == N h N
Lm:o p(N)(v + m){ ~h)[z(s)]m

Proof. By definition of w~N-h)(klt) and by suitably adjusting the formula of


total probabilities, we can write

L P{M, = kIM(6) = v, D[h; t; s]} P{M(') = vID[h; t; s]}


h
w~Nh)(klt) ==
,,=0
whence, by taking into account the assumption of conditional independence
(2.4),
290 Fabio Spizzichino

h
w~Nh)(klt) == L P{M. = kIM(') = v, H. = h} P{M(s) = vID[h; t; s]}
v=o
as far as the term

P{M = kIM(s) = H = h} = P{(M. = k) n (M(s) = v) n (H. = h)}


v, P{(M(s) = v) n (Hs = h)}
is concerned, we note that
P{(M. = k) n (M(s) = v) n (H. = h)} =
G) (N; h)p(N)(v + k)[G (sW [GO(s)]h-v [Gl(SW[Go(s)]N-h-k
1

= v) n (H. = h)} =
P{(M(s)

}; G) (N ~ h)p(N)(v + m)[G (sW [Go(s)]h-v [G (s)]m[Go(s)]N-h-m


1 1

whence

P{M. = kiM (.) = v, H. = h} = W(v, h) (N k-h) [z(s) ]k p(N) (v + k) .


o
From the above result we then obtain

IF.(M.ID[h;t;s]) = ~ k(N;h)rZ(S)]"I:t.p(N)(k+V)W(V,h)
. P{M(s) = vID[h; t; s]} (2.22)

We now turn to study some aging and dependence properties of the joint
distribution of the lifetimes T 1 , ... , TN; we want to point out that these prop-
erties are influenced by the distribution of M and by Gi(t) (i = 0,1). On the
other hand, as already mentioned, they have an influence on qualitative prop-
erties of the optimal procedures for stopping the burn-in test. Some precise
result in this direction may be obtained in future research.
First we consider a result concerning aging properties of the one-dimen-
sional marginal p(1). By taking into account the condition (2.3) and the
Proposition 2.2, one readily obtains
Proposition 2.4.
- - -(1)
(a) IfGo(s) and G1 (s) are DFR (Decreasing Failure Rate) then F is DFR

(b) IfGo(s) and G 1 (s) are NWU (New Worst than Used) then F(1) is NWU.
A Probabilistic Model for Heterogeneous Populations 291

Note that in general A(t), as given by (2.11), can be decreasing in some


region of [0, +00) even if Go(s) and Gl(s) are IFR (Increasing Failure Rate).

Remark 2.3. Properties of the one-dimensional marginal distribution, in par-


. Iar quaI"ltatlve propertIes
bcu . such ' properbes
as agmg . 0 f F(l) depend on th e
pair Go, G l and on lE( M) but they are not affected by higher moments of the
distribution of M; the latter, by contrast, have an influence on properties of
dependence among T l , ... , TN.
This fact is, for instance, illustrated by a result concerning dependence be-
tween pairs of variables (1jl' 1jJ. Before stating such a result it is convenient
to take into account the following
Lemma 2.1. It is
w(2)(2) - [w(l)(I)]2 = w(2)(0) - [w(1)(OW =

= - [~W(2)(I) -w(1)(I)w(1)(O)] = CoV(Cl ,C2)


Proof
Cov (Cl , C2 ) = P{(Cl = 1) n (C2 = In - [P{Cl = 1}]2
W
= w(2)(2) - [W(l)(1 = Cov (1 - Cl, 1 - C2)
= P{(Cl = 0) n (C2 = on - [P{Cl = 0}]2
= w(2)(0) - [w(l)(OW .
On the other hand

whence

= w(1)(1) - !w(2)(I) - [w(1)(1W


2
_!w(2)(1) + w(1)(l)[l - w(1)(l)]
2
- [~w(2)(1) - W(l)(l)W(1)(O)]

o
Proposition 2.5. It is
(a) Cov(Tl , T2 ) = Cov(Cl , C2)(Jtl - JtO)2,
(b) ;;<2)(S1, S2) - F(l)(sd;;<l)(S2) = Cov(Cl, C2 )(Go(sd -G l (sd](Go(S2)-
G 1(S2)].
292 Fabio Spizzichino

Proof. (a) Equation (2.8) for n = 2 yields

;;(2)(S1, S2) = w(2)(O)GO(st}. GO(S2) + w(2)(2)G 1(st} . G1(S2) +

+ ~w(2)(1){Go(sd' G1(S2) + GO(S2)' G1(Sl)} ;

by differentiating the latter equation we obtain the joint density function


of two lifetimes T1, T2:
t<2)(Sl, S2) = w(2)(O)gO(St}gO(S2) + w(2)(2)91(St}g1 (S2)
+ ~w(2)(1){gO(S1)91(S2) + gO(S2)g1(sd .
Then

lE[T1 . T2] w(2)(O) 11 00 00


t1t2g0(tt}gO(t2)dt1 dt2

+ w(2)(2) 11 00 00
t1t2g1(t1)g1(t2)dt1 dt2

+ ~w(2)(1){lOO 1 00
t1t2g0(tt}g1(t2)dtl dt2

+ 10 10
00 00
t1t2g0(t2)g1 (t1)dt1 dt2 }

= w(2)(O)J.'~ + w(2)(2)J.'~ + w(2)(1)J.'OJ.'1 .


By Equation (2.10),

lE{Ttl.lE[T2] = (JJI ~) + 1'0 N-:M) 2 = (w(1)(1)J.'1 + w(1)(O)J.'o)2 ,


whence we obtain

COV(TI,T2)=J.'~{W(2)(O)- [1- (W(l)(l)f]}


+J.'~ { w(2)(2) - (w(1) (1) 2} + 2J.'oJJl {~w(2)(1) _ w(1)(1)w(1)(O) }
= Cov(Ct, C2)(J.'1 - 1'0)2 .
(b) The proof is analogous to the one for (a) by taking into account the
form of ;;(2)(S1, S2) considered above, equation (2.9) and recalling that
lC~) = w(I)(l). 0
Corollary 2.2. The following conditions are equivalent
(a) T1 and T2 are positively correlated
(b) T1 and T2 are Upper Orthant Positively Dependent
(c) C1 and C2 are positively correlated
A Probabilistic Model for Heterogeneous Populations 293

We now analyze multivariate aging and dependence properties. We first


consider an observed history of the form
D[O;s] == {T1 > sl,T2 > s2, ... ,TN > SN}.
Under the condition (2.3) it is intuitive that between two different indi-
viduals, conditional on the knowledge of their respective ages, the elder one
has a greater probability of belonging to the strong subpopulation. This idea
can be formalized as follows.
Lemma 2.2. Farsi < Sj, P{C, = 1ID[O;sn ~ P{Cj = 1ID[O;sn.
Proof. First we note that, as a special case of the formula (2.15), we can
write, for C E to, l}N,
N
p{e = 1ID[O; sn = /{ P{C = c} II Gc,(SI)
1=1

where /{ is a positive normalizing constant. Thus one can obtain


P{C, = 1ID[O;sn
= LP{C1 = C1, ... ,Ci = 1, .. . ,CN = cNID[O,sn
C

= L P{C1 = C1, ... , Ci = 1, ... , Cj-1 = Cj-l, Cj = 1, ... , CN = cNID[O, sn


C

N
= /{LP(N)(2 + L CI)G 1(Si)G 1(Sj). II Gc,(SI)
C

N
+/{LP(N)(l + L cdG 1(s,)G O(Sj). II Gc/(SI).
c l#i,l#j I#i, I#j
Let us rewrite the above identity in the shortened form
P{C, = 1ID[O; sn
= G1(Ss)G 1(Sj)W'(s) + G1(s;)G O(Sj)W"(s)
where W'(s) and W"(s) are positive quantities. Similarly
P{Cj = 1ID[O;sn = G1(sdG 1(sj)W'(s) + GO(Si)G 1(Sj)W"(s) .
Whence under the conditions (2.3) and for Si < Sj,
P{Ci = 1ID[O; sn -P{Cj = 1ID[O; s]}
= W"(S)[G1(Ss)G O(Sj) - GO(s;)G 1(Sj)] ~ 0 .
o
294 Fabio Spizzichino

Now we compare P{T; - Si > eID[O; s]} with P{1j - Sj > eID[O; s]} for
two different indexes i and j. We are in particular interested in obtaining
sufficient conditions under which the following implication holds:

ve > 0 , Si < Sj :::} P{T; - Si > eID[O; s]} < P{1j - Sj > eID[O; s]} . (2.23)
In this respect we have the following result
Proposition 2.6. Under the assumption (2.3), a sufficient condition for the
validity of the implication (2.23) is that one of the following set of conditions
hold:
(a) ro(t), and [Go(t+e)/GO(t)-G1(t+e)/G1(t)] are non-increasing functions
e
oft, for any > O.
(b) rl (t) and [G1 (t+e)/G 1(t)-Go(t+e)/Go(t)] are non-increasing functions
e
oft, for any > O.
Proof. Consider the set of conditions (a). By letting n = 0 and el = e,6 =
... = eN = 0 in (2.14), we can obtain
P{T; - Si > eID[O; s]) =
= P{Ci = OID[O; s]} GO(Si + e) + P{Ci = 1ID[0; s]} G1(Si + e)
Go(sd Gl(Si)
whose right hand side can be rewritten in the form

-P{Ci = 1ID[0;S]}{ GO{Si + e)/GO{Si)-

-G1{Si +e)/Gl(sd} + GO(Si +e)/GO(Si).

Now compare P{T; - Sa > eID[O; s]} with P{Tj - Sj > eID[O; s]}. By our
hypotheses and by (2.3), {Go(s+e)/GO(s)-G1(S+e)/G1(s)} is non-negative
and non-increasing VS ;::: 0, and Go(s +e)/Go(s) is non-decreasing. Thus the
implication (2.23) is seen to be valid by taking into account Lemma 2.2.
Under the set of conditions (b), an analogous proof can be given by writing

-P{Ci = 0ID[0;s]}{G1{Si +e)/G1(Si)-

-GO(Si + e)/GO(Si)} + G1(Sj + e)/G1(Si) .

o
Of course Proposition 2.6 only gives sufficient conditions for the implica-
tion in (2.23). It is to be stressed that these conditions are verified when Go
and G 1 are exponential distributions.
A Probabilistic Model for Heterogeneous Populations 295

In the dynamic approach to reliability, it is natural to consider depen-


dence and aging properties, such as HIF (Hazard Increasing upon Fail-
ures), WBF (Weakened By Failures), MIFR (Multivariate Increasing Fail-
ure Rate) and so on ... (see Norros 1985 and Shaked and Shanthikumar
1990, 1991). Such properties are defined in terms of inequalities on the
conditional distributions of residual lifetimes given histories of the form
D[h;t;s] == {T;1 = t1, ... ,T;" = th,Til > s, ... ,TiN_" > s} or on multi-
variate conditional hazard rates, comparing different kinds of pairs of "dy-
namic" histories. In particular it is of interest to compare the multivariate
conditional hazard rates Il~h)(t) and Il~~')(t'), corresponding to two different
histories D[h; t; s] and D[h'; t'; s']. In the present model, Il~h)(t) and Il~~')(t'),
can be compared by means of the formula (2.20), which we rewrite in the form

(2.24)

This immediately yields the following result


Proposition 2.7. A sufficient condition for Il~h)(t) ~ Il~~')(t') is that the
following three inequalities simultaneously hold:
(a) lI:(M.ID[h;t;,]) > lI:(M"ID[h';t';,'])
N-h - N-h'
(b) r1(s) - ro(s) ~ r1(s') - ro(s')
(c) ro(s) ~ ro(s').
A necessary condition for Il~h)(t) ~ Il~~')(t') is that at least one among
(a), (b) and (c) hold.
Proposition 2.7, possibly combined with equation (2.22), can be used for
verifying or rejecting various types of dynamic aging and dependence prop-
erties. In this respect a detailed study may be carried out by considering
different possible special cases of interest. We leave this study to future re-
search and here we only add a further definition and present a related remark.
Definition 2.1. Fix s ~ 0 and let D[h; t; s] and D[h'; t'; s'] be two histories.
D[h; t; s] is less favorable than D[h'; t'; s'] (denoted D[h; t; s] ~ D[h'; t'; s'])
if h ~ h', s $ s' and, for a subset {tit,"', tj,,} of {tl, ... , th}, it is t~ ~
tit' ... , t~ ~ th, (roughly, in the first history there are more failures and at
earlier times than in the second history, and survivals are shorter).
Note that if s - s' , then D[h' t s] < D[h" t" , s'] if and only if D[h" , t" , s']
"~,

is less severe than D[h; t; s], (see Shaked and Shanthikumar 1990).
In some cases we can be interested in checking the validity of the impli-
cation:
D[h;t;s] ~ D[h';t';s']::} Il~h)(t) ~ Il~~')(t') .
= =
We remark that in the special case h h', t t', the above implication is
= =
a condition of negative aging, while, under the condition h h', s s', it can
296 Fabio Spizzichino

be seen as a property of positive dependence among T 1 , ... , TN j however, in


both cases, its validity depends on one-dimensional aging properties of GoO
and of G 1 (.) and on dependence properties among C 1 , ... , CN.
= = =
When h h' + 1, 8 8', t (tl. ... th" 8) the above condition coincides
with WBF.
We now turn to consider some multidimensional dependence properties
which are not necessarily of dynamic type. This means that not only histories
ofthe form D[hj tj 8] == {11 1 = t1,"" 11" = th, 1il > 8, ... , 1iN-" > 8} enter
into the definition, but also histories which contain survivals at different ages
81, ... , SN-h can be considered.
A concept of dependence for T 1 , ... ,TN strictly related with infant mor-
tality is Schur-convexity of the joint survival function P<N) (see Barlow and
Mendel 1993, Barlow and Spizzichino 1993, Spizzichino 1992).
Recall that, given two vectors y', y" E ~ N, y' is said to be majorized by
y" if
N N
"", = """
Ie Ie
LY(i) ~ Ly(J) k=I,2, ... N-l, L.J yU) L.J yU)
j=l j=l j=l j=l

where 1/(1) $ 1/(2) $ ... $ 1/(N) and 1/('1) $ 1/(~) $ ... $ 1/(~) are the order
statistics of (1It, ... ,y'N) and (y'{, ... , y'N)' respectively. This is denoted by
y' -< y". A function 'rP : ~ N -+ ~ is Schur-convex if it is non-decreasing with
respect to the majorization ordering:

(yi, ... , y~) -< (y'{, ... , y'fy) implies 'rP(1it , ... , y'N) ::; 'rP(y~, ... , y'lv) .
The following characterization aims to clarify the connection between the
phenomenon of infant mortality and the Schur-convexity property of -pN) .

Lemma 2.3. (Spizzichino 1992) F(N) is Schur-convex if and only if the im-
plication (2.23) holds.
By combining Proposition 2.6 with the latter result, we immediatelyob-
tain
Proposition 2.8. Under the assumptions of Proposition 2.6, -pN) is Schur-
convex.

Remark 2.4. By choosing Go(t) and G 1(t) satisfying the assumptions of


Proposition 2.6 and by considering different types of distributions for M,
we can build examples of Schur-convex survival functions -pN) fitting with
negative or positive dependence among lifetimes (see the Corollary 2.2).
Remark 2.5. Consider the relevant particular case in which Go and G1 are
exponential, i.e. ro(s) = >'0, r1(s) = >'1, for given non-negative quantities
>'0 < >'1. In such a case F(l) is DFR by Proposition 2.4, and the inequalities
A Probabilistic Model for Heterogeneous Populations 297

(b) and (c) of Proposition 2.7 are trivially satisfied, for any pairs sand s'.
Then the inequality I'~N')(t) ~ I'~~/)(t') is equivalent to (a); furthermore an
immediate application of Proposition 2.8 shows that P<N) is Schur-convex.

A special concept of dependence for exchangeable random quantities is


that of extendibility and it is of interest to introduce the related notion of
maximum rank (see e.g. de Finetti 1969 and Spizzichino 1982). For R> N,
the distribution F(N) of exchangeable random quantities Xl, ... ,XN is R-
extendible if it is possible to find out an R-dimensional exchangeable distri-
bution, which admits F(N) as its N-dimensional marginal. The maximum
rank of F(N) is the maximum integer R such that F(N) is R-extendible;
this will be denoted by n(Xl , ... , XN) = R. If F(N) is R-extendible for any
R ~ N, F(N) is infinitely extendible. Roughly speaking, de Finetti's theo-
rem (de Finetti 1937) says that F(N) is infinitely extendible if and only if
Xl, ... , XN are conditionally independent, identically distributed.
In the very special case of exchangeable binary quantities Cl , ... , CN,
where the joint distribution is characterized in terms of the set of probabilities
{w(N)(k), k = 0,1, ... , N}, n(G!, ... , GN) > R means that it is possible to
find (R + 1) non-negative numbers W(R) (I), (I = 0,1, ... , R) such that

(2.25)

Gl , ... , GN are infinitely extendible if and only if they are conditionally i.i.d.,
given a random quantity 8, taking values in [0,1]; in other words if and only
if it is
(2.26)

7rbeing some probability distribution over [0,1].


From a statistical point of view, extendibility is related with the con-
cept of superpopulation (see Cassel et a1.1977); in the case of our variables
Tl, ... , TN for instance, n(Tl, ... , TN) is the maximum size of a conceivable
super-population, such that our population P == {U!, ... , UN} can be seen
as a random sample from it. It will be shown by the next result that in our
special case, the property of extendibility for Tl"'" TN is related to the
corresponding one for the binary random quantities G1, ... , GN.

Proposition 2.9.
(a) n(T1, ... , TN) ~ n(G1,"" GN).
(b) Tl"'" TN are i.i.d. if and only if M "" b(N,p) for some p E [0,1].
(c) If (2.26) holds then T l , ... , TN are conditionally i.i.d.
Proof. (a) Let the binary random quantities G1, ... , GN be R-extendible with
R > N; then we can consider the R-dimensional survival function
298 Fabio Spizzichino

where w(R)(/) = (~)p(R)(I) are the probabilities in (2.25). The joint survival
-r
function N) of T I , .. , TN is the N-dimensional marginal of-r
R ) and R) -r
is obviously exchangeable. So 'R(Tl, ... , TN) 2: 'R(CI , ... , CN)'
(b) M "" b( N, p) is equivalent to independence among C I, ... , CN. On the
other hand independence among C I , ... , CN is equivalent to independence
among T I , ... , TN. Note that in this case (2.5) becomes

(c) Under (2.26), (2.5) becomes

L [10( pi=1f Cj
(l-p)
N- f Ci 1 N -
i=1 d1l'(p) IICcj(Sj)
cE{O,I}N 0 j=l

= 1II1

o
N

j=l
[pG I (Sj)+(1-p)G o(Sj)]d1l'(p). (2.27)

o
We recall here that the condition Cov (Xl, X 2 ) < 0 for an arbitrary pair
of exchangeable random quantities Xl, X 2 , implies finite extendibility.

3. A Discussion on the Burn-in Problem and


Heterogeneous Populations

This section will be devoted to a discussion on the probability model pre-


sented so far and to its relations with the burn-in problem. Our presentation
aims to be self-contained, however it is profitable for the reader to be familiar
with current literature about fundamental aspects of burn-in problems. For
that we refer to Bergman (1985), Bergman and Klefsjo (1985), Block et al.
(1993), Clarotti and Spizzichino (1990), Kuo and Kuo (1983), Jensen and Pe-
tersen (1982), Marcus and Blumenthal (1974) and Spizzichino (1991). Even
though theoretical and practical interest of this field has existed for a long
time, most of the developments are rather recent and the above list is far from
being complete. In particular it is to be noted that the burn-in problem has
been approached both from classical and Bayesian Statistics points of view.
A Probabilistic Model for Heterogeneous Populations 299

Within the latter it is natural to look at the problem as one of sequential


decisions in the face of uncertainty; this is the approach followed here.
We shall start by formulating general definitions of the concepts of early
failures, infant mortality, and burn-in stopping procedures.
Individuals U1, .. ., UN are to be understood as pieces of engineering
equipment or components.
Let in general T1"'" TN be the random lifetimes of U1, ... , UN. We as-
sume T1,"" TN to be exchangeable with a joint survival function denoted
-(N)
byF (Sl, ... ,Sn).
Let moreover {-IPI, ... , N} be given functions; k : lPI.i -+ lPI. will be inter-
preted as follows: k(ft, ... , tk) is the gain obtained from putting k compo-
nents into operations (possibly they can be assembled in a working system).
k is of course a function of the components' operations times t1"'" tk and,
for any fixed k, k then is non-decreasing in t1, ... , tk. In particular k can
be negative if some of its coordinates are very small and it will be null or
negative when k is less than the minimum required for building a desired
system. It will be moreover assumed that k is permutation invariant and
that the following natural implication holds

tk+1 > max{t1, ... ,tk) => k+1{t1, ... ,tk, t k+d ~ k(t1, ... ,tk)'
Practical examples will be shown in next section.
We will say that a set of subsequent observed failure times t1 :::; ... :::; tN
contains early failures iffor some 1 :::; h < N, one has

In words, the inequality (3.1) says that we have h "early" failures at sub-
sequent times t1, ... , tk if the times t1, ... , tk are so short that the follow-
ing circumstance happens: the gain obtainable from putting, at time th, the
(N - h) surviving components into operations would be greater than the gain
obtained from putting all the N components into operations at time O.
Suppose that, at time 0, we start testing simultaneously U1, ... , UN (as-
sumed to be of age 0 at time 0), progressively observing possible failures and
taking records of the different failure times. In this way, up to any time s, we
observe a dynamic history of the form D[h; t; s]. Define

Ift(h; t; s) == E[ k(1i, - s, ... , 1ik - s)ID[h; t; s]] (3.2)

where k = N -h is the remaining number of components, and 1i, -s, ... , 1ik-
s are their residual lifetimes. Ift(h; t; s) is the expected gain from putting into
operation the components surviving a test of duration s, conditional on the
failure history observed in the test.
Let now (1 be a stopping time with respect to the filtration {Ft} (Ft
generated by {H.; 0:::; s :::; t}, with H. == 2:f=11[Tj!'>.1)'
300 Fabio Spizzichino

In words u is a random time such that on the basis of any observation of


the type D[h;t;s] we are able to check whether (u $ s) or (u > s). u can be
interpreted as a possible strategy for stopping a burn-in test to be conducted
before putting the components into operations. Denoting by T(h) the h-th
order statistics of T 1, ... , TN, we associate to u the expected gain
(3.3)
We can say that we are in the presence of infant mortality when there exists
a u such that the expected gain following a burn-in procedure until u is larger
than that without any burn-in procedure i.e., formally,
(3.4)
Roughly, the presence of infant mortality means that the probability
model for lifetimes is such to assign positive probability to the set of those
ordered vectors of lifetimes which contain early failures.
A stopping time u'" in the presence of infant mortality, is optimal if
(3.5)
where the supremum is relative to all possible stopping times u.
Finding the optimal stopping time for the burn-in is a sequential Bayes
decision problem which, in general, can be formulated as an optimal stopping
problem for the process (Ht ; 1(1) At, ... , T(Ht) At; t) (Spizzichino 1993). A
fundamental reference for optimal stopping problems is Shyriaev (1978) (see
also Jensen and Hsu 1993).
Note that the presence of infant mortality and the definition of optimal
stopping for the burn-in procedure are relative to a fixed cost structure,
determined by the gain functions tPk'S. Also the concept of early failures is
relative to a fixed cost structure but it is related to a given vector of lifetimes
and not to a probability model. We assume that costs for conducting the burn-
in test are negligible; this allows to simplify the notation without limiting
the actual generality of our model. On the other hand the damage deriving
from failures of components during the pre-operational test can be taken into
account in the assessment of the tPk (0 $ k $ N).
Now we come back to focusing attention on heterogeneous populations
P of components, studied in Section 2. Heterogeneous populations give rise
to special cases in which infant mortality can be present. This may happen
when Go, G 1 are such that, for some 1 $ h $ N, s > 0

IE(tPN (T1' .. . ,TN)IC1 = 1, .. . ,Ch = 1,Ch+1 = 0, .. . ,CN = 0] <


< IE(tPN-h (Th+l - S, ... , TN - S)ICh+1 = 0, ... , CN = 0)] (3.6)
This inequality translates the idea that weak components are likely to result
in early failures, so that the expected gain deriving from putting only the
strong components (even if of age s > 0) into operations is greater than
A Probabilistic Model for Heterogeneous Populations 301

the expected gain deriving from putting all the components (of age 0) into
operations.
It is to be stressed that, in these cases, burn-in has a special interpretation:
it is a procedure to eliminate from P substandard components (but not nec-
essarily all of them). By taking into account that 0'( s) == P {Gi = 1 111 > s}
is non-increasing in S (see Remark 2.1), one can show that the distribution
of ~ is stochastically non-increasing in s. Thus in particular we see that the
effect of burn-in is to decrease the proportion of surviving weak components.
We point out that the model of heterogeneous populations corresponds
to different situations according to different possible types of distributions
{w(N)(k); k = 0,1, ... , N} for M = Ef=l Gj. Such different types, in their
turn, correspond to different forms of dependence for the lifetimes T l , ... , TN.
To illustrate that we shall now examine a number of special cases, while
clarifying differences between the different situations, from a statistical point
of view.

(A) We start with the special case of a heterogeneous population P for which
p (0 < p < 1) is the known probability that any element chosen from P is
substandard and the conditions Gl, ... , GN are assessed to be independent;
this is equivalent to assuming that the distribution of Mis b(N,p). By (b)
in Proposition 2.9, T l , ... , TN are independent identically distributed as well
and thus (2.5) becomes

... ,SN)= ll(p- - o(sj).1


N
-(N)
F (Sl, Gl (sj)+(I-p)G
j=l
The one-dimensional failure rate function ,\(.) of any lifetime Tj is given by
(2.11).
Suppose we perform a burn-in with a duration s > 0 , let us then consider
the group of components that survive at time s.
It is easily seen that the conditional probability distribution of M, (num-
ber of surviving weak components) is b(N"O'(s)):

w~N)(kIT(1), ... , 7(H.) = (~,) [O'(s)]k[1 - O'(s)t- k k = 0,1, ... , N, .

The following aspect it is to be stressed: we already know at time 0 that


the conditional probability distribution of M, at any time s > 0 will be a
binomial, and, if s is deterministic, the parameter O'(s) can be computed in
the beginning; the value taken by N, is, of course, random at time 0 and will
become known at time s. If s were random then O'(s) would be, of course,
at the beginning, random as well but depending only on s and not on other
aspects of the history to be observed in [0, s1. We stress that the proportion
~ of substandard components surviving at time s is a random quantity
(e~en at time s) with an expected value given by O'(s). If N, is very big, we
expect, by large numbers' law, that ~ is very close to O'(s).
302 Fabio Spizzichino

At time s, a component's survival probability for an extra mission time e


(conditional on survival at s), is

P(Tj > s+elTj > s) = exp { -l'+T [a(u)rt(u) + [1- a(u)]ro(u)]dU}


Ot(s + e)p + Oo(s + e)(1- p)
(3.7)
Gt(s)p + Go(s)(1 - p)
T t , ... ,TN being independent, the joint survival function of residual life-
times only depends on the age s and it is the product of the above probabil-
ities.

(B) Consider now the case in which C t , ... ,CN are conditionally indepen-
dent identically distributed, i.e., (2.26) holds and, by c) of Proposition 2.9,
T t , ... ,TN also are conditionally independent identically distributed (using
the language of frequentist probability, we could say that this case corre-
sponds to (A) with p unknown).
Think again of a burn-in with a duration s > 0 , and consider the group of
components that survive at time s. The conditional probability distribution
of M, is still of the form (2.26), where N is replaced by N, and 7r is replaced
by a new mixing distribution 7r(-ID[h; t; s]) depending on s and on the his-
tory observed up to s. During burn-in two different processes take place: we
eliminate weak components from P and, simultaneously, we learn about p.
Of course this is a case of positive dependence among T t , ... , TN: the dis-
tribution of M. conditional on a history D[h; t; s1 is stochastically greater
than the distribution of M. conditional on a different history D[h'; t'; s'] if
D[h; t; s] ~ D[h'; t'; s'] in the sense of Definition 2.1. The conditional distri-
bution -P<N-h) (eID[h; t; s)) of residual lifetimes can be obtained by means of
a suitable modification of formula (2.27).

(C) We analyze here a case of positive dependence different from (B). Sup-
pose we assess P{Ct = C 2 = ... = CN} = 1, i.e. P{Ct = C 2 = ... = CN =
1} = q, P{Ct = C 2 = ... = CN = O} = 1 - q, namely the distribution of
M is concentrated on the two extreme values 0 and N. In words this means
that all components are in a same (unknown) condition.
It is easy to see that, at any time s, the conditional distribution of M.
= =
remains of this same kind: P{M. N,} 1 - P{M. = O}. P{M. = N.} of
course depends on the observed history; more precisely, by applying Bayes
formula, we have
A Probabilistic Model for Heterogeneous Populations 303

From this, we can immediately obtain F(N-h)(eID[h; t; s]) by applying (2.18).


In this special case burn-in cannot modify the proportion M,IN6 (which
remains constantly equal to 0 or to 1). Burn-in could only be used to "test
the hypothesis {M6IN, = I}", even though, in general, this would not be a
convenient testing procedure.

(D) The last special case of distribution for M that we consider is one of neg-
ative dependence among Tl, ... , TN. Suppose we are sure from the beginning
about the value of MIN: for some 0 < k < N, it is
w(N)(k) = 1 , dN)(n) = 0, for n:f:. k .

The distribution of MIN is degenerate but the distribution of M6IN, (s > 0)


is not and it depends on the history observed up to s. The less favorable the
observed history the larger we expect M(') to be and so the stochastically
smaller is the conditional distribution of M6 IN,. In this case of course there
is nothing to learn, burn-in aims to eliminate the k substandard components.
If, V s > 0, G1(s) <:: Go(s), we expect that T(l),"" TO:) (the first k failure
times that will be observed) will be very small and correspond to failures of
substandard components (in this case the optimal stopping in burn-in would
trivially be u* == 1{k).)
Note that, for N becoming very large, differences between the cases (D)
and (A), in a sense, tend to disappear.
Let us now comment on the arguments presented so far. For brevity's
sake and in order to stimulate the reader's interest, we prefer to keep the
discussion on a heuristic level; however for most of the subsequent remarks a
rigorous explanation is possible.
First we note that all possible distributions for M different from those in
(A), (B), (C) exclude infinite extendibility for e l , ... , eN.
Even if (A) is a very particular case, it has been widely considered in
the literature about burn-in. Indeed, in most of the papers, independence
among T 1, ... , TN is implicitly assumed. This assumption gives rise to a very
special structure of the optimal procedure for stopping the burn-in, since the
conditional survival function of a component surviving at time s does not
depend on the other components' failures history observed up to s: at any
s ~ 0, the decision of continuing the test must be reached only taking into
account s and the number N6 of surviving components; this decision must not
depend on failure times observed in the past. In this case the structure of the
optimal burn-in time only depends on aging properties of the one-dimensional
survival function in (2.9). In particular in the case of an "additive structure"
of tPk (i.e. tPk (t) = 2:~=1 tPl (ti there is no place for considering sequential
burn-in procedures (see also the example presented at the end of Section
4). This means that procedures to be studied must be deterministic: the
stopping instant will not depend on the history to be observed and then it
can be established at time O.
304 Fabio Spizzichino

Most of the existing literature deals with the following extension of case
(A): populations P are considered which are mixtures of more than two sub-
populations so that one-dimensional density functions of components' life-
times are of the form
/(t) = 1
g(t; )")dP()") (3.8)

where P can be a discrete or a continuous probability distribution. For any


component in P there is the same probability dP()") that it belongs to a
subpopulation for which its lifetime has a density g(t; )..), independently on
the other individuals. In this situation we say that P is a mixed population.
Heterogeneity in the population gives rise to many surprising effects and
apparent paradoxes when the population is analyzed as a whole. The anal-
ysis of such aspects has been in particular developed for this case of mixed
populations (see e.g. the illuminating papers by Barlow (1985) and Vaupel
and Yashin (1985).
Some results on existence and computation of optimal (deterministic)
burn-in time for the case in (A) under additive cost functions can be ob-
tained by specializing arguments developed for one-dimensional distributions
on [0, +00) with densities of the form (3.8) (see Clarotti and Spizzichino 1990
and Block et al. 1993).
In all other cases, in general the optimal burn-in procedure will be sequen-
tial: whether or not it is optimal to stop the burn-in test at a time-instant s
depends on the number of surviving components and on the failure times of
the ones already failed up to s.
It is important to point out that, in any case, the one-dimensional
marginal distribution function F(l) is of the form (2.9): as already remarked,
this is then completely determined by the only specification of the expected
value lE(M). So pairs of completely different joint distributions for T l , ... , TN
may give rise to the same one-dimensional marginal. This shows that, in all
the cases different from (A), the study cannot be limited to the analysis of
F(l); as a matter offact the structure ofthe N-dimensional joint distribution
is to be taken into account.
Even the decision whether a burn-in experiment is to be conducted at all
or not cannot be taken only on the basis of the form of F(l). Let us try to
illustrate this important point by means of the following special case.
Suppose that Gl (s) ~ Go (s); this may imply that the failure rate function
)..(t) of F(l) is decreasing in a right neighbourhood of 0 even if
Gl(s) and Go(s) were "slightly" IFR . (3.9)

If we are in case (A), nothing but the form F(l) is to be considered


and thus, if )..(t) is decreasing in a neighbourhood of 0, a burn-in procedure
is recommended despite the condition (3.9). This can be also explained by
observing that, under (A), we have no other mean to modify our opinion
A Probabilistic Model for Heterogeneous Populations 305

about the hypothesis {Gj= O} than testing the component Uj(1 :::; j :::; N)
for a while.
This would be reversed in case (C). In such a case we can test the hypoth-
esis {Gj = O} without burning-in Uj: this can be done by testing a number
r of other components Gh ... Gjr up to the failures of all of them. In this
way we can learn about {Gj = O} or {Gj = I}, leaving Uj completely new
(Le., of age O)j under (3.9) the latter may turn out to be a more convenient
procedure.

4. Burn-in Procedures for Heterogeneous Populations of


Components

In this section we explain in some more detail the concept of optimal se-
quential procedure for stopping the burn-in introduced in the last section
and show some fundamental facts concerning the case of a heterogeneous
population of components.
Consider a burn-in experiment according to which all the components
U1 , .. , UN, belonging to P, are simultaneously put under test at time 0,
progressively recording all the subsequent observed failure times t1, t2 ... up
to a pre-fixed stopping time (T. At (T the experiment is stopped and all the
surviving components are delivered to operation or, in any case, are kept and
considered to be usable later on for assembling some wanted system. In this
case we say that we adopted the procedure (T for stopping the burn-in.
Up to any time s > 0 we observe a history of the form D[hj tj s]. As
already mentioned above, it is of course necessary for (T to be a rule such
that, at any time s, we are in a position to establish whether {(T :::; s} or
{(T > s} based on the information carried by D[hj tj s]. If in particular we fix

(T = s, for some value s, already before starting the experiment, we say that

(T is deterministic. In other cases we say that (T is sequential. For instance

(T = T(h) (corresponding to stopping the test as soon as the h-th failure has

been observed) is a particular sequential rule.


By applying Theorem T33 in Bremaud (1981, p. 308) to the counting
process Hs and by suitably adapting language and notation to our case, one
can see that any sequential stopping procedure (T has the following general
form:
- we start the experiment, planning to stop it at a deterministic time pc:') >
0, if no failure will be observed in betweenj
- ifT(1) = t1 < p(N), then, at t1, we plan to go on with the experiment for
an extra time pc:'-l)(tt}, under the condition that no further failure will
be observed before the instant t1 + pc:'-1)(t1)j
- if, on the contrary, 1(2) = t2, with t1 < t2 < t1 + p~N-1)(tt}, we fix a
further extra burn-in time pc:'-2)(t 1, t2) and so on ....
306 Fabio Spizzichino

- p<jV -h) (tl' t2, ... , th) will in general be a function of tl, t2, ... , th.
To denote the above, we shall write

0'=
- {(N) (N-l)(t)
Pq ,Pq 1, .. ,Pq(1)(t 1, .. , t N-l )} (4.1)
The subscript 0' can be omitted when unnecessary.
Now consider the expected gain Wq , defined in (3.3). Before going ahead
note that Wq obviously will depend on the joint distribution of T1, ... , TN,
which is in its turn determined by the distribution of M, ~(N) == {w(N)(O),
w(N)(I), ... , w(N)(N)}, and by the pair of one-dimensional survival function
Go(s) and G1 (s), via the equation (2.5).
Our task in the following is to express Wq in terms of the representation
(4.1) and to characterize the optimal stopping procedure 0'*, as defined in
(3.5). To our aims we must adopt a dynamic point of viewj for that we
must in general look at the conditional expected gain, given that the history
D[h; t; s] has been observed, if continuing with a procedure 0'. This may be
denoted by
Wq(h; t; s) or W(p~), p~-I), ... , p~I)ID[h; tj s])

whichever of the two forms is more convenient. In order to characterize the


optimal stopping procedure the following further notation is needed
W*(h; t; s) == sup Wq(h; t; s) . (4.2)
q

Consider now the conditional survival function ofthe variable T(h+l) -s given
the history D[hj tj s] observed up to s. T(h+1) - s is the waiting time up to the
next failure after the instant s and its conditional survival function is given
by
P{T(h+1) - s > eID[h; t; s]} = -pN -h) (e, ... , eID[h; t; s]) (4.3)

where -pN-h\ID[h; t; s]) was defined by (2.18). Denote by

f(eID[h; t; s]}
the corresponding density function
d
f(eID[h; t; s]} = - deP{1(h+t) - s > elD[h; t; s]} (4.4)

Recalling Proposition 3.2, it is convenient furthermore to let, for 0 $ h $


N -1, tt, ... ,th $ s, u ~ 0 and for a function Wq
q;w.. (h; t; s, u) == !li(h; t; s + u) x P{T(h+1) > s + ulD[h; t; s]}
+ IoU Wq(h+ 1;(t,e);s+e)f(eID[h;t;s]}de .(4.5)
Consider now the observed history D[h; t; s] and a stopping time 0'. After the
time s, two alternative events can happen:
A Probabilistic Model for Heterogeneous Populations 307

(A) {T(h+l) > O(N-h)(t)} or (B) {1(h+l)::; O(N-h)(t)}


where we shortly put O(N-h)(t) == th + p~N-h)(t).
If (A) happens, then burn-in will be stopped at O(N-h)(t) and the ex-
pected gain will be I/I(h; t; O(N-h)(t)). If (B) happens with T(h+1) = e ::;
e
O(N-h)(t), then burn-in is continued after and the expected gain, from that
instant on, will be Wq(h + 1; (t, e); s + e).
The above arguments lead to the following
Proposition 4.1. Wq(-;;) satisfies the dynamic equation
w: (h t s) -
q " -
(h t s O(N-h)(t) - s)
ifJwiT'" (4.6)
Proposition 4.2. W*(h; t; s) satisfies the inequalities
(a) W*(h;t;s) ~ I/I(h;t;s)
(b) Vu~O W*(h;t;s)~ifJw.(h;t;s,u)
(+oo
(c) W*(h;t;s) ~ Jo W*(h+ 1;(t,e);s+eJ)f(eID[h;t;s]}de =
lim ifJw. (h; t; s, u).
u-oo

Proof. Suppose that, following a burn-in test up to s, we observed the history


D[h;t;s]. I/I(h;t;s) is the expected gain coming from stopping the burn-in
test immediately at s. Then (a) follows from the Definition (4.2).

Fix c > 0, u > 0, and, instead of stopping immediately at s, alternatively


consider the following continuation of burn-in:
- if 1(h+l) > s + u, stop the burn-in at s + u
- if T(h+l) - s < u, at e= T(h+1) go on with an c-optimal procedure (1'. such
that
Wq.(h+ Ij(t,e);s+e) ~ W*(h+ Ij(t,e);s+~")-c.
This gives the expected gain

I/I(h; t; s + u) X P{T(h+l) > s + uID[h; t; s]}


+ lU Wq (h + 1; (t, e); s + e])f(elD[h; t; s]}de .
Again by (4.2), we then obtain the inequality in (b) by letting c -. o.
- From (b), we obtain the inequality in (c) by letting u -. 00.
o
308 Fabio Spizzichino

Note that, for an arbitrary stopping time (1, the following identity holds

tli(h;t;s) = 4>w.. (h;t;s,O) .

In particular (a) can be seen as a special case of (b) for u =0.


Proposition 4.2 then substantially says that

W*(h;t;s) ~ sup4>w.(h;t;8,U)
u~O

By adapting standard arguments of Dynamical Programming, used in Spiz-


zichino (1991), to the present setting, it can furthermore be shown that it
must just be
=
W*(h;t;8) sup4>w.(h;t;s,u). (4.7)
u~O

Thus an optimal stopping procedure does exist and can be described as fol-
lows: after observing a history D[h; t; s]:
(i) stop at 8, if

W*(h; t; 8) = sup4>w. (h; t; 8, u) = tli(h; t; 8) == 4>w. (h; t; s, 0)


u~O

(ii) stop at + it, if, for it > 0, it is


8

T(h+1) > it + 8 and W*(h; t; 8) = sup4>w. (h; t; s, u) = 4>w. (h; t; s, it)


u~O

(iii) if, \:Iu ~ 0, it is 4>w.(h;t;8,U) < W*(h;t;8), then go on up to 1(h+1)


and, from that instant on, continue with an analogous procedure.
By combining (ii) and (i) we see that the following implication holds

W*(h; t; 8) = 4>w. (h; t; s, it) ~ W*(h; t; 8 + it) = tli(h; t; 8 + it)


so that the stopping procedure described above can shortly be defined by

(1* == inf {8 ~ :W*(H3; T(1), ... , T(Hs); s) = tli(H3; 1(1), ... , 1(H.); 8)} .
(4.8)
In other words

pq(N-h)(t)
' = t -_.mf{ 8
- P*(N-h)() ~ th .. W*(h, t,. 8 ) -- .T'(h
'I'
. )}
, t, 8 (4.9)

and we can conclude with the following result

Proposition 4.3. For (1* defined by (4.9), it is, \:11 ~ h ~ N, 8 ~ 0, t1 ~

Wu(h;t;s) = W*(h;t;s) .
In particular (1* is optimal in the sense of the definition in (3.5).
A Probabilistic Model for Heterogeneous Populations 309

Remark 4.1. In order to obtain the optimal stopping time (1*, one must pre-
viously compute the functions w(;;) and W*(;; .). W*(h; t; s) can be com-
puted in terms of the functions W*(h + 1;;) and w(h; .; .).
The stopping time (1* is optimal in the sense of Bayes optimality and
the history of already observed failure times 1(1) = tt, ... , TCh) = th, is of
course taken into account in the dynamic characterization of (1*, since it
influences the conditional distribution of residual lifetimes of the surviving
components. Actually for Go(t) and Gl(t) given such distribution is deter-
mined by the conditional distribution of M, (number of those substandard
components which are still surviving at s). So p~N-h)(t) in (4.9) depends on
t only through the conditional probabilities w~N-h)(klt)(k = 0, ... , N - h).
Qualitative properties of p~N-h)(t) are then affected by the kind of stochastic
dependence among Cl ,., CN.

We now turn to write down special forms of the functions tPk, which
reasonably describe the cases when Ul, ... , UN are components to be possibly
used for assembling a reliability system.
First of all it can be natural to assume
(4.10)
for some non-negative quantity 6 and some n ~ N. This means that we have
a loss, or at the best no gain, if less than n components are available.
For n ~ k ~ N, the following practical examples can be given.
k
1. tPk(tl,"" tk) = L J(tj) (Components to be used separately one of an-
j=l
other)

2. tPlo(tl, ... , tlo) = J (t


J=l
tj ) (Cold stand-by system)
3. tPk(tl, ... , iTo) = J( max tj) (Parallel system)
l~j~k

4. tPk(tl"'" tk) = J [l~J~n


m,in (tl,l. + ... + tl,' .,r')] (Series System)

J : [0, +00] -+ ~ being a given non-decreasing function.


The case 4. describes a series system of n components, each of which
is replaced by another component when failed, if, at its failure time, there
are still spare parts available, so that {1, ... ,k} == Uj{/j,t, ... ,/j,r;l, where
tl,' , l ' . . , tl-1,T,_ are the lifetimes of the components which are progressively put
into operation in the j-th position of the system.

More in general, we can consider the general case of a coherent system


5. tPk(t 1, ... , tk) = l[k>njJ(maxmin(tl-
- i jEP, ,.
1+" .+tl-,.r,,)), where Pi C {I, 2, ... , n}
are the path sets of the system.
310 Fabio Spizzichino

In the general case, finding u* is not a feasible task. For this reason we
do not pursue further the analysis of the computation of p~N-h)(t)j rather
we prefer to concentrate attention on Open Loop Feedback Optimal (OLFO)
procedures. Open loop feedback optimality is a general concept from Optimal
Control Theory (see Runggaldier 1993), for a transposition to the burn-in
problem).
We shall denote by p(N), p(N-1)(td, ... , p(1)(t1,"" tN-1) the functions
characterizing the OLFO procedure.
In order to define the p's, it is previously necessary to analyze the special
case M "" b(N,p) (0 ::; p ::; 1), considered at point (A) in Section 3. As
we saw this corresponds to the assumption that T1, ... ,TN are independent
random quantities.
The problem of computing p~N-h)(t) is much simpler in this case than in
the general casej indeed p~N-h)(t) is simply a function of the arguments h
and th, which will be denoted by p~N-h)(-).
In order to obtain ~N-h)( .), the following arguments are to be taken into
account.
Conditionally on D[hj tj s], the residual lifetimes Til -s, ... , TiN-h -s are
independent and their one-dimensional survival function ~1)(e) is provided
by formula (3.7).
We can then write

lli(hj tj s) == lli(hj s) = 11 1
00 00
...
00
tPN-h(6, ... ,eN-h) .

I1f=~h[91(S + {j)p + 90(S + {j )(1- p)]


[G1(S)p + Go(s)(1 _ p)]N-h d6 .. . dek
Furthermore
P{T(h+1) - s > elD[hj tj s]} = ~1)(e)]N-h
and thus the conditional density function of T(h+1) - s defined in (4.4) takes
the form

/(h+1)(eiS) = (N-h)[g1(s+e)p+go(s+e)(I-p)]. (4.11)


['01(s + e)p +'Oo(s + e)(1 - p)]N-h-1
[Gl(S)p + Go(s)(1 - p)]N-h
In the present case, W* (hj tj s) is a function of the only arguments h
and s which we denote by W* (h j s). It can be obtained by noting that (4.7)
becomes

W*(hjs) = suplu W*(h+l;s+eH/(h+1)(eis)de


u~o 0

+ lli(h; s + u) . ~1)(u) . (4.12)


A Probabilistic Model for Heterogeneous Populations 311

It will furthermore be

p~N-h)(t) = inf{s ~ t: W*(hjs) = tJi(hjs)} (4.13)

W*(hj s) can be computed in terms of the functions W*(h + Ij S + e) and


w(hjs+ u). Due to (4.10), we have W*(hjs) == -6 for N - n + 1::; h::; Nj
so we start by computing W*(N - nj s), for which (4.12) reads

W*(N - nj s) sup{-6[1- (F6(u)t] + tJi(N - nj S + u)(F6(u)tl


u~O

-6 + SUp{[F6(U)]n[6 + tJi(N - nj S + u)]} . (4.14)


u~O

Remark 4.2. In the case considered just above, Tl, ... , TN are independent
variables distributed according to the survival function F (1 ) given in (2.9).
For given N (initial number of components) and cost functions tPk, F (1 ) com-
pletely determines the quantity p~N) initiating the optimal burn-in procedure
(1'*. For the subsequent developments we then use the symbol p~N)(FC1.

Remark 4.3. In the case of independence, we can say that infant mortality is
present if
{f..N>CP(1 > 0 .
Thus, for the probability model defined by the assumption of independence
and by F(1), we see that infant mortality depends on the structure of the
reliability system to be built, which determines the form of tPk'S.

Now we turn to consider the OLFO procedure for stopping the burn-in.
The functions p(N) , p(N-l)(tt), ... , pCl)(tl, ... , tN-d are defined as fol-
lows.
At time S = 0, let
peN) = {f..N>(j;;(l
as if we were in the case of stochastic independence.
For t1 < p(N), let

pCN-l)(tt} == p~N-l) (FC1 )(.ID[1j t lj t d)


where -r1)(ID[ljtljt1]) is a special case ofthe conditional one-dimensional
survival function defined in (2.19). This means that, if we observe T(1) =
t1 < pCN) , we take a posteriori into account the fact that Tl , ... , TN are
not independent and the observation T(1) = tl is considered to update our
assessment of the probability model for the (N - 1) surviving componentsj
but in planning the continuation of the test we consider, once again, their
312 Fabio Spizzichino

lifetimes to be independent, distributed according to the new marginal sur-


vival function F(1)(-ID[1; t1; t1]). Continuing in this way we in general let, for
h = 1, ... , N - 1, 0 ~ t 1 ~ t2 ~ ... ~ th, t h < th-1 +jJ( N-h+1) , (t 1, ... , th- d.

pA(N-h)(t1, ... , th ) -_ _p..(N-h)(r.;(F 1)(.


ID[h..
, t, th ]) . (4.15)

We underline that -P(l)(ID[h;t;th]), which can be obtained by specializing


(2.19), only depends on the functions G1(th + )/G1(th), GO(th + )/GO(th)
and on the conditional expected value
IF.{Mth ID[h; t; th]) _ lE(Mth ID[h; t; th])
Nth N-h
So that aging properties ofGj's and dependence properties among Cl, ... , CN
can be used to obtain monotonicity properties of the functions jJ(N-h)
(tl, ... , th).

Example In order to illustrate the above arguments we consider a special


case defined by the following simplifying conditions:

1/;,,(tl, ... , t,,) = -c(N - k) + C L" 1[t;>TJ - L L" 1[t;$TJ' k = 0,1, ... , N
j=l j=l
(4.16)
where L > C> c> 0 are given quantities and T> 0 is a fixed mission time.
Gi(t) = exp{ -Ait}, t ~ 0 (4.17)

where Al > Ao > o.


First of all we note that, by (2.19) and (4.17), for any 0 ~ h ~ N -1 and
tl ::; t2 ~ ... ::; th, F(l)(ID[h; t; s]) is always of the form

-P(l)(tID[h; t; s]) = exp{ -A1 t } lE(M.~D~h~ t; s])

+exp{-Aot} [1_lE(Mf~h~t;S])] .(4.18)

pA(N) , P
I n order t 0 0 bt am A(N-1)(t) A(l)(t
I , ... , P I , ... , t ) we prevIous
N-1 . Iy must
consider the case of independence, characterized by

(4.19)

and obtain the quantity (f;.N). As a consequence of the additive structure of


1/;" in (4.16) it follows that W"(h;s) = h W*(1;s) and !fr(h;s) = h !fr(1;s)
so that {f.,N) == p.. does not actually depend on N; for Al, Ao given p.. is a
function of p (0 < p < 1) which we denote by P.. (p). Under the condition
A Probabilistic Model for Heterogeneous Populations 313

fi.(p) is the only solution of the equation


L-c
,(p) . exp{ -AI T} + [1- ,(p)] exp{ -AOT} = L_ C

where
,(p) = AlPexp{-AlP}
- AlPexp{-AlP} + Ao(1- p)exp{-AOP}
(see also Clarotti and Spizzichino 1990).
As far as the OLFO procedure is concerned we then have

p(N) = fi. (~)


and, by taking into account also (4.18),

,(N-h)(t t ) _- (lE(MthID[h;t;t h ])) (4.20)


P 1,, h - P. N _ h

It is easy to see that fi. (p) is increasing in p, so that monotonicity properties of


the function p(N-h),s can be obtained by means of a study of lE(Mth~~~;t;t"]).
We remind that lE(M,ID[hj tj s]) can be computed by using (2.22). In
order to apply (2.22), we note that, under (4.17), it is

z(s)=exp{-(Al-AO)S} .

Acknowledgement. I thank colleagues Menachem Berg and Uwe Jensen for useful
discussions and comments. I also like to thank the organizing committee of the
Antalya NATO-ASI meeting for excellent organization and hospitality.
Partial support of CNR Progetto Strategico Applicazioni della Matematica per la
Teenologia e la Societd is acknowledged too.

References

Aldous, D.J.: Exchangeability and Related Topics. Ecole d'Ete St. Flour. Lecture
Notes in Mathematics. Berlin: Springer 1983
Arjas, E.: The Failure and Hazard Processes in Multivariate Reliability Systems.
Math. Oper. Res. 6, 551-562 (1981)
Arjas, E.: Survival Model and Martingale Dynamics. Scand. J. Statist. 16, 117-225
(1989)
Arjas, E., Haara, P., Norros, I.: Filtering the Histories of a Partially Observed
Marked Point Process. Stoch. Proc. AppI. 40 225-250 (1992)
Barlow, R.E.: A Bayesian Explanation of an Apparent Failure Rate Paradox. IEEE
Trans. on ReI. R34, 107-108 (1985)
314 Fabio Spizzichino

Barlow, R.E., Mendel, M.B.: Similarity as a Probabilistic Characteristic of Ageing.


In: Barlow, R.E., Clarotti, C.A., Spizzichino, F., (eds.): Reliability and Decision
Making. London: Chapman and Hall 1993
Barlow, R.E., de Pereira, C., Wechsler, S.: A Bayesian Approach to Environmental
Stress Screening. Naval Research Logistics 41, 215-228 (1994)
Barlow, R.E., Proschan, F.: Statistical Theory of Reliability and Life Testing. New
York: Holt, Rinehart, Winston 1975
Barlow, R.E., Proschan, F.: Life Distributions and Incomplete Data. In: Krishnaiah,
P.R., Rao, C.R.( eds.): Handbook of Statistics 7. London: Elsevier 1988, pp.
225-249
Barlow, R.E., Spizzichino, F.: Schur-Concave Survival Functions and Survival Anal-
ysis. Journal of Computational and Applied Mathematics 46, 437-447 (1993)
Bergman, B.: On Reliability Theory and its Applications. Scan. J. Statist 12, 1-41
(1985)
Bergman, B., Klefsjo, B.: Burn-in Models and TTT Transforms. Quality and Reli-
ability International 1, 125-130 (1985)
Bertsekas, D.P.: Dynamic Programming and Stochastic Control. New York: Aca-
demic Press 1976
Block, H.W. , Mi,J. , Savits, T.H.: Burn-in and Mixed Populations. J. Appl. Prob.
30, 692-702 (1993)
Bremaud, J.P.: Point Processes and Queues. Martingale Dynamics. New York:
Springer 1981
Cassel,D., Sarndal, C.E., Wretman, J.H. Foundations of Inference in Survey Sam-
pling. New York: Wiley 1977
QInlar, E., Ozekici, S.: Reliability of Complex Devices in Random Environments.
Prob. Eng. Inform. Sc. 1, 97-115 (1987)
Clarotti, C.A., Spizzichino, F.: Bayes Burn-in Decision Procedures. Prob. Eng. and
Inform. Sc. 4, 437-445 (1990)
Costantini, C., Spizzichino, F.: Optimal Stopping for Life Testing: Use of Stochastic
Ordering in the Case of Conditionally Exponential Life Times. In: Mosler, K.,
Scarsini, M. (eds.): Stochastic Orders and Decision under Risk 1992
de Finetti, B.: La Prevision: ses Lois Logiques, ses Sources Subjectives. Ann. Inst.
H. Poincare 7, 1-68 (1937)
de Finetti, B.: Teoria dena Probabilita. Torino: Einaudi. English translation: Theory
of Probability. New York: Wiley 1970
Ericson, W. A.: Subjective Bayesian Models in Sampling Finite Populations. J. Roy.
Statist. Soc. B31, 195-233 (1969)
Iovino, M.G., Spizzichino, F.: A Probabilistic Approach For an Optimal Screening
Problem. J. Ital. Stat. Soc. 2, 309-335 (1993)
Jansen, F., Petersen, N.E.: Burn-in. New York: Wiley 1982
Jensen, U., Hsu,G.: Optimal Stopping by Means of Point Process Observations with
Applications in Reliability. Math. Op. Res. 18, 645-657 (1993)
Kendall, D.G.: On Finite and Infinite Sequences of Exchangeable Events. Studia
Scient. Math. Hung. 2, 319-327 (1967)
G. Koch, G.: A Dynamical Approach to Reliability Theory. In: Serra, A., Barlow,
R.E.(eds.): Proceedings of the International School of Physics "Enrico Fermi".
Course XCIV on Theory of Reliability. Amsterdam: North-Holland 1986
Kuo, W., Kuo, Y.: Facing the Headaches of Early Failures: A State-of-the-Art
Review of Burn-in Decision. Proc. IEEE 71, 1257-1266 (1983)
Lawless, J.F.: Statistical Models and Methods for Lifetime Data. New York: Wiley
1982
Lindley, D.V., Novick, M.R.: The Role of Exchangeability in Inference. Ann. of
Statist 9, 45-58 (1981)
A Probabilistic Model for Heterogeneous Populations 315

Marcus, R., Blumenthal, S.: A Sequential Screening Procedure. Technometrics 16,


229-234 (1974)
Mi, J.: Optimal Burn-in. Ph.D. Dissertation, Department of Statistics, University
of Pittsburgh (1991)
Norros, I.: Systems Weakened by Failures. Stoch. Proc. Appl. 20, 181-196 (1985)
Proschan, F.: Theoretical Explanation of Observed Decreasing Failure Rate. Tec-
nometrics 5, 375-383 (1963)
Runggaldier, W. : On Stochastical Control Concepts for Sequential Burn-in Pro-
cedures. In: Barlow, R.E., Clarotti, C.A., Spizzichino, F. (eds.) Reliability and
Decision Making London: Chapman and Hall 1993
Shaked M., Shantikumar, J.G.: Multivariate Stochastic Orderings and Positive De-
pendence in Reliability Theory. Math Oper. Res 15, 545-552 (1990)
Shaked, M., Shantikumar, J. G.: Dynamic Multivariate Aging Motions in Reliability
Theory. Stoc. Proc. and Appl. 38, 85-97 (1991)
Shyriaev, A.N.: Optimal Stopping Rules. New York: Springer 1978
Singpurwalla N.D., Youngren M.A.: Models for Dependent Lifelenghts Induced by
Common Environment. In: Block, H.E., Sampson, A., Savits, T. (eds.): Topics
in Statistical Dependence. Lecture Notes-Monograph Series. Institute of Math-
ematical Statistics (1991)
Spizzichino, F.: Extendibility of Symmetric Probability Measures. In: Koch, G.,
Spizzichino, F. (eds.): Exchangeability in Probability and Statistics. Amster-
dam: North-Holland 1982
Spizzichino, F.: Sequential Burn-in Procedures. J. Stat. Plan. InE. 29, 187-197
(1991)
Spizzichino, F.: Reliability Decision Problems Under Conditions of Ageing. In:
Bernardo, J., Berger, J., Dawid, A.P., Smith, A.F.M. (eds.): Bayesian Statistic
4. Oxford: Clarendon Press 1992, pp. 803-811
Spizzichino, F.: A Unifying Approach to Optimal Design of Life-Testing and Burn-
in. In: Barlow, R.E., Clarotti, C.A., Spizzichino, F. (eds.): Reliability and De-
cision Making. London: Chapman and Hall 1993a
Spizzichino, F.: Extendibility of Schur Survival Functions and Aging Properties of
Their One-Dimensional Marginal. In: Vilaplana, J.P., Puri, M.E. (eds.): Recent
Advan~es in Probability and Statistics. Zeist: VSP Publishers 1993b
Vaupel J.W., Yashin, A.I.: Heterogeneity's Ruses: Some Surprising Effects of Selec-
tion of Population Dynamics. The American Statistician 39, 176-185 (1985)
Part III

Stochastic Methods in Software Engineering


An Overview of
Software Reliability Engineering
John D. Musa
AT&T Bell Laboratories, 480 Red Hill Road, Middletown, NJ 07748-3052, USA

Summary. Software reliability engineering has become an increasingly important


part of software development and software acquisition as the dependence of society
on software has become virtually universal. This paper gives an overview of this
technology as it is currently practiced, indicating its benefits. It also indicates some
current open research questions; progress in these areas will likely affect the way in
which software reliability engineering is practiced in the future.

Keywords. Reliability, software, software reliability, software reliability engineer-


ing, software metrics, failure intensity

1. Introduction

Software reliability engineering is a rapidly spreading practice for software-


based systems. We use the term "software-based" because the real interest is
in reliability of total systems, which may have both hardware and software
components. Clearly, there is no such thing as a pure software system; some
sort of computing logic (hardware) is always needed to execute a program.
Software reliability engineering is an important subset of the larger domain
of software engineering. It has four significant defining characteristics:
1. setting quantitative reliability objectives in such a way that customer
satisfaction with the product will be maximized if they are met,
2. engineering the product and the development process to meet the objec-
tives,
3. focusing development and test on the highest-use and most critical oper-
ations, and
4. testing components and the system to meet the objectives.
This brief characterization will no doubt stimulate some questions in your
mind; the goal of this paper is to answer them.
Software reliability engineering can be applied at different levels. We will
use the term "system" throughout this paper in a generic sense; it can refer
to the entire product being developed, a subsystem of that product, or a
"supersystem" that deals with the operation of the product in the context of
a larger system or network of systems. Thus "system" can refer to distributed
systems that execute over many computers.
Software reliability engineering is practiced over the entire life cycle of a
system, from conception to field operation, usually involving multiple releases.
320 John D. Musa

The term "reliability" is used in the same sense for software as it is for
hardware (Musa et al. 1987). It is the probability of failure-free execution
of a program for a specified period, use, and environment. For example, a
program may have a reliability of 0.99 for 8 hours of execution. Note that
the relevant time is execution time, the actual time that the processor is exe-
cuting the program. The definition of software reliability in analogous terms
to hardware reliability is deliberate, because we want to be able to combine
reliabilities of hardware and software components to obtain system reliability.
The cause of failure in software is different than in hardware; it is erroneous
or incomplete design rather than wear, fatigue, burnout, etc. It should not be
surprising that we use compatible definitions even though failure mechanisms
are different; we already employ a common definition across hardware even
though hardware has many different failure mechanisms. Note that hardware
can also fail from design errors; in this sense, software reliability theory could
be applied to some hardware situations.
Software reliability engineering has spread rapidly in practice because of
the substantial benefits it provides and the relatively low cost of implemen-
tation.

2. Benefits
The benefits derived from software reliability engineering start in the system
engineering phase. Quantitative expression of reliability needs enables sup-
pliers of software-based products to more precisely understand the needs of
users of these products. Assuming that a product is designed to deliver the
functionality required, user satisfaction (the concept of "quality") depends
on multiple factors, but perhaps the three salient ones are reliability, delivery
date, and cost. These quality attributes interact with each other; to obtain
increased reliability requires longer development time or greater cost or both.
If rapid delivery of a product is essential to meet a user's needs, something
must give: either reliability will suffer or cost will escalate. When you can
analyze a user's conflicting needs with respect to these quality attributes and
set more precise goals, you set the stage for a higher level of user satisfaction.
Software reliability engineering includes quantitatively determining how
users will employ a system and uses this information to both tune the system
to this pattern of use and to focus development attention on the operations
that are used the most and/or are most critical. A "critical" operation is one
whose failure will have a severe impact in terms of risk to human life, cost, or
level of service. This focus speeds up development and reduces costs because
we don't waste time and effort on infrequently used, noncritical operations.
Software reliability engineering reduces the risk of unsatisfactory reliabil-
ity by engineering and tracking reliability during development.
An Overview of Software Reliability Engineering 321

As an example of the benefits of software reliability engineering, consider


the International DEFINITyl project (a PBX switching system) of AT&T
(Musa 1993). By applying SRE and related technologies, the project reduced
customer-reported problems and maintenance costs by a factor of 10, system
test interval and system test costs by a factor of 2, and product introduc-
tion interval by 30 percent. Customer satisfaction improved significantly, as
indicated by an increase in sales by a factor of 10.
Software reliability engineering was selected as a Best Current Practice in
AT&T in May 1991. To become a Best Current Practice, a practice must be
widespread, have a documented strong benefit to cost ratio (in this case, the
ratio exceeded 12), and pass a probing review at two different management
levels.

3. Nature of Practice
Software reliability engineering consists of seven principal activities, spread
out over the software life cycle:
1. developing the operational profile,
2. defining "failure" with severity classes,
3. setting failure intensity objectives,
4. engineering the product and the development process to meet the failure
intensity objectives,
5. certifying the failure intensities of acquired software components,
6. reducing and assuring failure intensities during test, and
7. monitoring field failure intensities against objectives.

3.1 Developing the Operational Profile

The operational profile characterizes the way in which a system is expected


to be used. It is developed primarily during the requirements definition and
high level design phases by system engineers and system architects. For a full
description, see Musa (1996) in this volume.

3.2 Defining "Failure" with Severity Classes

Since the concept of reliability depends directly on the definition of "failure,"


the implementation of software reliability engineering requires that we specify
what we mean by "failure" for the project we are dealing with. This definition
is generally accomplished by system engineers during the requirements phase.
First, we need to distinguish between the concepts of "failure" and ''fault''
because they are often confused in the literature. A "failure" is a departure
1 DEFINITY is a registered trademark of AT&T.
322 John D. Musa

of program operation from user requirements, while a "fault" is the defect


in the program that causes the failure when it is executed. The concept of
fault is developer-oriented. Thus when we speak of reliability, we are taking a
user viewpoint. We highlight this distinction because developers have tended
to focus on faults rather than failures. The fault concept is useful when you
are trying to understand how faults (bugs) are introduced into software and
how to find and correct them. However, there is a significant danger. A small
number of faults that occur in code that is heavily used can cause a large
number of failures and great user dissatisfaction. Thus the number of faults
is not a good indicator of reliability. On the other hand, concentrating on
faults in code that is little used is very inefficient. It can cause substantial
time delays and costs with little improvement in reliability.
Note that specifying what a user views as failures is essentially delineat-
ing the negative requirements, the program behavior the user can't tolerate.
Specifying what users don't want in addition to what they want almost always
clarifies their needs.
Failures generally differ in the impact on users. Hence we need to classify
them by severity. The most common classification criteria are risk to human
life, cost, and effect on service. Cost includes not only direct expenditures
caused by the failure but also loss of present or future business. Classes are
generally separated by an order of magnitude because impact can't usually
be computed precisely. For example, one class might include failures with es-
timated cost impact of $10,000 to $100,000; another, $100,000 to $1,000,000.
Most organizations tend to have 4 classes, with 3 or 5 also occurring rela-
tively frequently. More classes result in lower level classes whose effect is truly
negligible and hence could be ignored instead of being identified as failures;
fewer classes don't provide for the real differences that occur. An example of
severity classification based on service is shown in Table 3.1.

Table 3.1. Severity classification based on service

Severity class Definition

1 Complete unavailability to users of services


essential to them
2 Some services essential to users are unavailable
3 Some services essential to users are unavailable,
but they all have workarounds
4 Some services are unavailable, but they don't
affect customers
An Overview of Software Reliability Engineering 323

3.3 Setting Failure Intensity Objectives


Failure intensity is the number of failures per unit of execution time. It is
related to reliability R by
R = exp( -AT) ,
where A is the failure intensity and T is the execution time duration for which
the reliability is specified. Failure intensity is commonly specified as failures
per thousand CPU hours. Since tracking the progress of reliability toward an
objective is particularly important for software, and failure intensity is more
convenient than reliability for this purpose, the failure intensity alternative
expression for reliability is the most commonly used one.
It is very common to have multiple failure intensity objectives. For exam-
ple, failure intensity objectives will generally be lower (more stringent) for
high failure severity classes.
Several factors are involved in setting failure intensity objectives. Fail-
ure intensity of existing and competitive systems and user satisfaction with
them is one of the most important. Comparative analysis of life cycle costs
of systems with different failure intensity objectives is often another, and
its importance is likely to grow as our ability to make accurate reliability
predictions improves. By reliability prediction (Musa et aI. 1987), we mean
projection of reliability as a function of product and development process
parameters such as program size, requirements volatility, amount of require-
ments, design, and code review, length of testing, etc. This is contrasted with
reliability estimation, which refers to projection based on failure time data.
There is a clear opportunity for research in this area.

3.4 Engineering Product and Development Process


There are three principal sub activities involved in engineering the product
and the development process to meet the failure intensity objectives that we
have set. They are primarily performed by system architects or high level
designers, often with major decisions being made by the project manager.
First, the objectives must be allocated among the hardware and software
components of the system, using reliability combinatorics. The breakdown
of the system follows natural divisions of the system to some extent, but
it also involves managerial and engineering judgment. For example, we will
ordinarily identify a part of the system being developed by a subcontractor
or different organization as a separate component. The allocation of failure
intensity objectives among components usually tries to achieve balance. That
is, we want to make attaining the objectives approximately equally difficult.
Other criteria include minimizing total development cost and meeting the
overall scheduled delivery date.
The second sub activity is determining the mix of reliability strategies we
will use. This will affect both product and development process design. The
324 John D. Musa

principal strategies are fault tolerance, reviews, and test. We must determine
the contribution each strategy must make to the overall failure intensity, con-
sidering the effects on development time, development cost, and operational
efficiency. When the failure intensity objective is high, testing alone may be
sufficient. As the objective is reduced (made more stringent), we must increas-
ingly use requirements, design, and code reviews. Very low failure intensity
objectives require the use of fault tolerant features.
The third sub activity is to use the operational profile and a list of critical
operations to allocate process resources (primarily people). Allocations are
made with respect to operations, which are externally initiated tasks such
as commands or transactions. You can speed up the delivery of operations
that are heavily employed or critical to users by operational development,
the organization and scheduling of development by operation rather than by
module. You can reduce cost with the concept of reduced operation software
(ROS). This is the analog of RISC (reduced instruction set computing). You
reduce the total number of operations that must be implemented by elimi-
nating or finding other ways to accomplish the infrequently used, noncritical
operations. For example, you may replace a complex operation by a sequence
of simpler basic operations, possibly with some manual intervention. Any loss
in operational efficiency is small because the operations replaced occur only
rarely, and it is more than compensated for by development cost savings.

3.5 Certifying Failure Intensities of Acquired Software


Components

Software projects frequently reuse components from other projects, employ


"off the shelf' software, or subcontract components to other development
organizations. Before proceeding to regular system test, there is a need to
certify the components to reduce the risk that they will cause problems once
integrated into the system. They must be tested with the operational profile
that will be experienced by the overall system. A simple way to do this is
to integrate them into the system and exercise the system with the opera-
tional profile, recording only failures that occur in the component in question.
Certification testing is frequently performed by system testers before system
testing.
Failure times are recorded and are plotted on a reliability demonstration
chart (Musa et al. 1987). The chart is constructed based on sequential sam-
pling theory. It is easily drawn manually or with a simple computer program.
Its precise form depends on the discrimination ratio (permissible factor of er-
ror in estimation), consumer risk (risk of accepting a bad program), supplier
risk (risk of rejecting a good program), and the failure intensity objective for
the component. An example is shown in Figure 3.1, where the discrimination
ratio is 2, consumer and producer risks are 10 %, and the failure intensity
objective is 50 failures per 1000 CPU hr.
An Overview of Software Reliability Engineering 325

Note that there are three regions: reject, continue, and accept. As long as
failure times remain in the continue region, you keep testing. As soon as a
failure time crosses into a reject or accept region, you can reject or accept the
software based on the discrimination ratio, risk levels, and failure intensity
objective that have been set. For example, in Figure 3.1, the first two failures
(at 15 and 25 CPU hr) plot in the continue region. The third failure occurs
at 100 CPU hours; it is in the accept region, permitting the component to
be accepted. It is possible for software that experiences no failures to be
accepted; in this example, this would happen after 40 CPU hours of failure-
free operation.

Failure 16
number
14

12 Reject

10

8 Continue

4
Accept
o

o
2
o

o
o 40 80 120 160 200

Failure time (CPU hr)

Fig. 3.1. Reliability demonstration chart

3.6 Reducing and Assuring Failure Intensities


The system test and beta test phases are generally both periods of reliability
growth, even though environments may differ between the two phases. They
326 John D. Musa

also serve to increase the level of reliability assurance. Alternatively you may
think of these test phases as periods in which failure intensity is reduced and
we increase our assurance that it is reduced. The reduction comes about, of
course, as we experience failures and we search out and remove the faults
that are causing them.
A model of this failure intensity reduction is shown in Figure 3.2. The
actual reduction is discontinuous. The removal of each fault causes a dis-
continuity whose size depends on how often that fault is activated by the
usage pattern (operational profile) ofthe software. Software reliability mod-
els generally focus on test periods, they are generally nonincreasing, and they
are usually expressed in execution time (Musa et al. 1987). Most of them are
based on nonhomogeneous Poisson processes. Maximum likelihood estimation
is commonly used to determine their parameters, although this is certainly
not a requirement. The models that have been most commonly employed in
practice are the Musa-Okumoto logarithmic Poisson execution time model
(Musa and Okumoto 1984) and the Musa basic execution time model (Musa
1975).

Failure
intensity

Execution time Present

Fig. 3.2. Model of failure intensity reduction

Establishing the level of assurance of failure intensity is based on exam-


ining the confidence interval for our estimate of failure intensity. In practice,
the 75 % confidence interval has been most commonly used; it represents a
An Overview of Software Reliability Engineering 327

good balance between high confidence levels and the necessarily large ranges
associated with such intervals. Figure 3.3 indicates how the confidence inter-
val typically decreases with execution time, as failure intensity estimates are
based on more and more data. Note that we are concerned principally with
the upper confidence limit; we don't care how much failure intensity might
be lower than what we have estimated.

Failure
intensity

Confidence
interval

Nominal value :

Execution time Present

Fig. 3.3. Confidence interval for failure intensity estimates

The procedure for estimating failure intensity during system test or beta
test is straightforward, although there are refinements for special situations
such as program evolution, absence of execution time information, etc. (Musa
et al. 1987). The system is tested by selecting runs in accordance with the
operational profile. Failures are identified and failure times are recorded. The
failure data is input to a reliability estimation program. Such programs use
reliability models and estimation techniques (as noted above) to estimate
failure intensity and its confidence interval or intervals. You compare failure
intensity with your failure intensity objective on a periodic basis. This typi-
cally occurs daily for short test periods and weekly for long ones. As noted
previously, you may have multiple failure intensity objectives to account for
such situations as failure severity classes. In this case, you have corresponding
multiple failure intensity measurements.
328 John D. Musa

The comparison is used initially to highlight the need for corrective ac-
tions, such as changing the levels of resources devoted to testing, changing
testing schedules, or renegotiating delivery dates or failure intensity objec-
tives. When the failure intensity reaches the objective, one of the criteria
that guides release to the next phase is satisfied. We usually track the upper
confidence bound of estimated failure intensity, because we want to establish
meeting the objective at some level of confidence.
The failure intensity estimates generated by a software reliability estima-
tion program for the system test phase of a software development project
are shown in Figure 3.4. The center line is the maximum likelihood estimate;
the other two lines are the bounds of the 75 % confidence intervals. Note
that the test phase covers almost four months, during which time the failure
intensity is substantially reduced (the vertical axis is logarithmic, tending to
deemphasize the reduction). The "noise" in the plots represents not only the
discontinuous nature of failure intensity reduction but also natural random
variation (the estimates are made from relatively small sample sizes early in
test).

10000

5000

1000
Failure
intensity 500
(failuresl
1000 hr)

100

50

10~------~---------L--------L- _______
Aug Sept Oct Nov
Fig. 3.4. Failure intensity estimates and 75 % confidence intervals during system
test

However, you will note a significant upward trend in September that dom-
inates the variation resulting from random effects. This was the sign of a
potential problem requiring investigation. The investigation showed that, un-
known to the testers, some developers had added additional new features to
the system, introducing additional faults and driving up the failure intensity.
An Overview of Software Reliability Engineering 329

This is a graphic example of how tracking failure intensity during test can
uncover problems.

3.7 Monitoring Field Failure Intensities

The last principal software reliability engineering activity is monitoring fail-


ure intensities in the field. A primary reason for doing this is to obtain the
feedback you need to determine how well you have met your users' require-
ments. If comparison of actual field failure intensities with objectives indicates
that you have not met these objectives, you should analyze both the product
and your development process to determine where improvement is required.
Another reason for tracking field failure intensity is to provide guid-
ance to field personnel as to when they can "safely" install new features.
In Fig. 3.5, an operational system in the field was tracked for close to two
years. The failure intensity objective ("service objective" in the figure) was
50 failures per 1000 CPU hours. The utilization of the system was about 20
CPU hours/week; consequently, the objective represents about one failure
per week. Ignore the very first part of the plot; estimates here can have large
errors because they are based on small samples.

5000

1000 Unsatisfactory

500

Failure 100
intensity Service Objective
(failuresl 50
1000 hr)

10

5 Satisfactory
75% Confidence
:;
Interval

1184 7/84 1/85 7/85

Fig. 3.S. Timing new feature introduction in the field


330 John D. Musa

In this figure, the center line again represents the maximum likelihood
estimate of failure intensity, with the two other lines representing the 75 %
confidence bounds. We will focus on the upper confidence bound. Note the
sawtooth pattern. Each release of new features causes a jump in failure in-
tensity that results from the new faults introduced. Then in the periods
between releases, failure intensity declines as the failures experienced lead
to removal of the faults causing them. Observation of this behavior leads to
a simple policy to implement in the field to stabilize field reliability. When
the upper confidence bound of failure intensity exceeds the failure intensity
objective, freeze the system (allow no new feature introduction). When the
upper confidence bound of failure intensity falls well below the objective, you
can consider adding new features. The size of the permissible addition can
be guided by how far below the failure intensity objective you are.
Failure intensity in the field can be estimated with the same model and
estimation method, and hence the same program, as used for system and beta
test. In some cases, faults are not removed in the field between releases. In
that situation, the failure intensity is time invariant. The program will simply
yield model parameters that characterize a zero reliability growth case of the
models.

4. Research Questions

The field of software reliability engineering is very dynamic, as will be seen


by the size of the research community and the diversity of investigations
currently in process (see the Annual Proceedings of the International Sym-
posium on Software Reliability Engineering as an example). At the present
time, there are some strong needs and opportunities arising in the practice
that may shape some of the future research.
Test selection: One of the primary needs is to find ways of applying soft-
ware reliability engineering to test selection, so that we can make testing
more efficient. Users of software-based systems indicate that for them the
most important factors affecting their satisfaction (or characterizing their
view of "quality") are reliability, timely delivery, and cost. In order to im-
prove reliability (reduce failure intensity), we must expend time and money in
testing. Hence the author proposes that we define testing efficiency in terms
of the reduction factor in failure intensity per unit time. The time we use
should be execution time because we want to look at how we can improve
testing efficiency as a function of fundamental inherent factors such as test
selection strategy; we don't want to consider the obvious simple tactics of
adding testing resources such as people and computers. Thus, instantaneous
testing efficiency 1/ can be formally expressed as
1 d>'
1/ = A dr'
An Overview of Software Reliability Engineering 331

where A is failure intensity and T is execution time. One would investigate


different testing strategies and their influence on 1]. Since 1] could be a function
of T, one might evaluate the strategies with respect to ij, the average of 1]
with respect to T.
Reliability prediction: A second need is to improve reliability prediction,
the determination of reliability from product and process characteristics such
as program size, developer skill level, requirements volatility, etc. (Musa et
al. 1987). Better reliability prediction would improve the capability to make
tradeoffs in setting system failure intensity objectives and in tuning the prod-
uct and process to meet them. For these purposes, reliability prediction will
generally be used in a relative rather than absolute sense, and high accuracy
will usually not be required. This may simplify the problem somewhat.
Reliability estimation: Third, there is a need for a capability to estimate
reliability prior to program execution , using data that is directly indicative of
future failures. Possible solutions are to use data such as problem discovery
times expressed in elapsed review time for requirements, design, and code
reviews. Note the potential analogies between problem discovery and failure,
and elapsed review time and execution time. It therefore appears that one
might be able to apply software reliability models to accomplish this task.
Object certification: Finally, the growing potential of object-oriented de-
velopment using extensive object libraries creates a strong need for object
certification. Without object certification, it is unlikely that the full poten-
tial of reuse, with all its favorable effects on reliability, development time,
and cost will be realized. Developers will not reuse objects unless they have
confidence in their reliability. The sequential sampling method for certifica-
tion described in Section 3.5 shows great promise. However, it will probably
be necessary to specify the usage an object will be tested for down to the
input state rather than the operation level. An input state is the set of all
input variable values that characterize a run, where an input variable is any
variable or condition external to a program that affects it.

5. Summary

A study by the Strategic Planning Institute (Buzzell and Gale 1987) shows
that customer-perceived quality is the factor with the strongest influence on
long-term profitability of a company. Users view achieving the right balance
among reliability, delivery data, and cost as having the greatest effect on their
perception of quality. Since one of the main purposes of software reliability
engineering is achieving this balance in software-based systems, this discipline
is an extraordinarily important one. Finding solutions to some of the research
needs can stimulate rapid progress. Finally, there is a compelling need to
educate software and reliability engineers in this technology and practice.
332 John D. Musa

Acknowledgement. The author is indebted to James Cusick for his helpful com-
ments.

References

Buzzell, R.D., Gale, B.T.: The PIMS Principles - Linking Strategy to Performance.
The Free Press 1987, p. 109
Musa, J.D.: A Theory of Software Reliability and its Application. IEEE Transac-
tions in Software Engineering 1, 312-327 (1975)
Musa, J.D.: Operational Profiles in Software Reliability Engineering. IEEE Software
10 (2), 14-32 (1993)
Musa, J.D.: The Operational Profile. In this volume (1996), pp. 333-344
Musa; J.D., lannino, A., Okumoto, K.: Software Reliability: Measurement, Predic-
tion, Application. New York: McGraw-Hill 1987
Musa, J.D., Okumoto, K.: A Logarithmic Poisson Execution Time Model for Soft-
ware Reliability Measurement. Proceedings of the 7th International Conference
on Software Engineering. Orlando 1984, pp. 230-238
The Operational Profile
John D. Musa
AT & T Bell Laboratories, 480 Red Hill Road, Middletown, NJ 07748-3052, USA

Summary. Operational profiles are an important part of the technology and prac-
tice of software reliability engineering. The concept was developed originally (Musa
et al. 1987) to make it possible to specify the nature of the use of a software-
based system so that testing could be made as realistic as possible and so that
reliability measurements would reflect that realism. However, the operational pro-
file rapidly became useful for additional purposes in software reliability engineering
(Musa 1993). In fact, it is also proving useful for purposes outside of software re-
liability engineering as well. This paper gives an overview of operational profile
practice, discussing what the operational profile is, why it is important, and how
it is developed and applied. It also presents some current open research questions;
work in these areas can be expected to affect the practice of the future.

Keywords. Reliability, software, software reliability, software reliability engineer-


ing, usage, function

1. Definition

We will first define the term "operation" and then show how this leads to the
concept of the operational profile. An operation is an externally-initiated task
performed by a system "as built." We contrast it with a function, which is
an externally-initiated task to be performed by a system, as viewed by users.
The idea or the need for the task ordinarily first arises in the minds of users,
who transmit it to system engineers as a requirement. It is sometimes first
conceived by developers, however. At this stage it is a function. As the system
is designed by system architects and developers, functions evolve into and are
implemented as operations. Functions often map one-to-one to operations,
but the mapping is also often more complex, driven by performance and other
needs. Examples of operations (and functions) include specific commands,
transactions, and processing of external events.
An operation or function is generally initiated and followed by an external
intervention, which may come from a human or another machine. Operations
(and functions) are not restricted to one machine; they may be executed over
several machines and thus can be used for distributed systems. Further, they
can be executed in segments separated in time. Thus, they are essentially
logical concepts that are not closely tied to hardware.
We will later refer to sequences of operations and functions that may be
initiated to implement a work process; these are called, respectively, opera-
tional scenarios and functional scenarios. Since these sequence patterns may
occur repetitively, and since interactions may occur between the operations,
the scenarios must be considered when testing.
334 John D. Musa

Table 1.1. Sample Operational Profile

Operation Occurrence Probability

Alarm 1 Processing 0.20


Alarm 2 Processing 0.15
Alarm 3 Processing 0.10

The operational profile is now simply a set of operations and their prob-
abilities of occurrence. For example, suppose we have a system that receives
various alarms and processes them, taking actions that depend on the partic-
ular alarms. Table 1.1 shows a possible operational profile for such a system.
A functional profile is a set of functions and their probabilities of occurrence;
it is thus the exact analog of an operational profile.

2. Benefits

Operational profiles benefit a wide variety of activities associated with de-


velopment of software-based systems, including system engineering, system
design, development, testing, and operational use. Operational profiles can:
1. Increase user satisfaction by capturing their needs more precisely,
2. Satisfy important user needs faster with operational development (orga-
nizing development by operations rather than modules and scheduling
product releases so that the most used, most critical operations are re-
leased first)
3. Reduce costs with reduced operation software (ROS is the software analog
of RISe; infrequently used, noncritical operations are either not imple-
mented or are handled in alternative ways),
4. Speed up development and improve productivity by allocating resources
in relation to use and criticality,
5. Guide distribution of review efforts (requirements, design, code),
6. Reduce system risk with more realistic testing,
7. Make testing faster (faster reliability growth) and more efficient,
8. Help tune the system architecture to use and criticality,
9. Make performance evaluation and management more precise, and
10. Guide development of better manuals and training.
The Operational Profile 335

3. Development

The development of operational profiles for a software-based system involves


four basic decisions and then three sequential activities. In order to illustrate
these, we will present a simple but realistic example software-based system.
The system, which we will call "Fone Follower," lets a user forward tele-
phone calls depending on where he/she expects to be. The user connects to a
voice response system and enters the planned telephone numbers as a func-
tion of date and time. Although this can be done at any time, users most
frequently do this between 7 and 9 AM each day. Incoming calls to the user
are forwarded as the program of planned numbers indicates. If there is no
response at a number (which occurs for about 20% of the calls), the user is
paged if he/she has a pager. If there is no pager, or if there is no response to
the page within a specified time (the latter occurs for about half the pages),
then the call is forwarded to voice mail.
The call forwarding functions of forward call (nonpaging) and forward call
(paging)- will be implemented as three operations: follow, page (paging users
only), and voice mail. For simplicity, we will assume that FONE FOLLOWER
is relatively independent of the telecommunications network and that it is
developed as a single unified system.

3.1 Basic Decisions

The four basic decisions that must be made are:


1. For what systems will you develop operational profiles?
2. What system modes will you define for each system?
3. Will you use explicit or implicit profiles?
4. What granularity and accuracy guidelines will you follow?

3.1.1 What Systems? The operational profile is a very general concept,


just like the concept of "system;" hence it can be applied at different lev-
els. In addition to developing an operational profile for the system you are
developing as a product, you can develop an operational profile for any of
its subsystems. Or you can also focus on the context in which your system
operates; i.e., you can develop an operational profile for the supersystem or
network of which it is a part.
The key questions to ask in determining the systems for which you will
develop operational profiles are:
1. What are the systems you will test?
2. Within which systems will you allocate resources and set priorities for
different parts of the development work?
For Fone Follower, we will simply choose the entire product as the system
since it is being developed in a unified fashion (there aren't major subsystems
336 John D. Musa

being developed separately by other organizations) and since it is relatively


independent of the telecommunications network (there is no particular need
to test Fone Follower in the context of the network to look for failures resulting
from potential interactions).
3.1.2 What System Modes? A system mode is a complete set of opera-
tions that are executed at the same time and the same sites. The reason for
dividing a system into system modes is that the nature of what is executing
during some periods or even at some sites may be substantially different, so
that you may wish to divide up testing to capture these different periods real-
istically. You will then develop a separate operational profile for each system
mode and test it separately. A common example of this division is a prime
hours system mode in which a system has many users performing the prin-
cipal work of the system and an off hours system mode when administrators
are performing work such as backup and maintenance. A system mode must
include all operations that can execute in the environment you define for it;
otherwise, you may miss testing some of the failure-producing interactions
that can occur.
The general principle in determining what set of system modes you should
select is to select a system mode for each time period or set of sites for which
a very different set of operations, an overload, or a different hardware con-
figuration is expected. The extent of the differences in the sets of operations
should be such that the interactions among operations are likely to be quite
different, making different failures likely.
In addition to time of day (e.g., prime hours vs. off hours) some-time-
related variables that typically yield different system modes are system ma-
turity (startup vs. steady state use) and system capability (completely oper-
ational vs. degraded).
There is a limit to how many system modes you want to create, because
the effort and cost of developing operational profiles and performing system
test increase with the number of system modes.
For Fone Follower, we will select four system modes:

1. Entry hours (7-9 AM each day), normal load


2. Entry hours, overload
3. Nonentry hours, normal load
4. Nonentry hours, overload

3.1.3 Explicit or Implicit Profiles. The operational profile can be de-


veloped in two forms, explicit and implicit! With an explicit profile, each
operation is completely described with the values of all of its attributes. Oc-
currence probabilities are explicitly assigned to each operation. Table 1.1
illustrates an explicit profile.

1 Everything in Section 3.1.3 applies equally to functional as well as operational


profiles.
The Operational Profile 337

With an implicit profile, occurrence probabilities are implicitly assigned


to each operation by specifying occurrence probabilities for the values of each
operational attribute, taken separately. Figure 3.1 illustrates an implicit op-
erational profile for a telecommunications switching system, a PBX. It shows
two of the operational attributes of its call processing operations: dialing
type and call destination. Note that the operational attribute values and
their associated occurrence probabilities yield a subprofile for each opera-
tional attribute. Clearly, the implicit profile can be expressed as a network in
which each node is an attribute, each branch is an attribute value, and each
complete path is an operation.

DIALING CALL
TYPE DESTINAnON

Standard =0.8
Internal =0.3


Abbrev =0.2 External =0.1

Fig. 3.1. Example implicit profile from telecommunications switching system


(PBX)

The occurrence probability for the operation may be obtained by multi-


plying all the operational attribute occurrence probabilities in the path that
represents it.
The occurrence probabilities of attribute values can be conditional on the
previous attribute values selected. If this conditional property extends only to
the immediately previous attribute value, the operations form Markov chains,
resulting in some interesting implications (Whittaker 1994). Unfortunately in
practice, the conditionality dependence is often more complex.
The choice between explicit and implicit profiles depends on the nature
of the application. The explicit profile is generally preferable if one or more
of the following hold:
338 John D. Musa

1. Operations are described by a very small number of attributes (e.g., the


operational profile of Table 1.1 has only one operational attribute, alarm
type),
2. Operational scenarios are important because operations are highly cor-
related with each other, producing clear sequence patterns, or
3. Criticality differences among operations are significant.
The implicit profile is generally preferable if operations are described by
multiple operational attributes, especially if the attribute values are selected
sequentially and if attribute occurrence probabilities depend on previous at-
tribute value selections. Note how awkward it would be to explicitly char-
acterize even a simple operation for PBX call processing: "Call processing
for call from manager's set, using abbreviated dialing, going out to external
network, with call answered and placed on hold but then taken off hold to
talk."
3.1.4 Setting Granularity and Accuracy Guidelines. Granularity refers
to the number of operations you decide to define. You have some control over
this because you can lump operations together and make them more general
or you can differentiate them and make them more specific. With a larger
number of operations you have more detailed control in allocating resources
and representing actual field conditions in test. However, the effort, cost, and
time to develop the profile are greater.
Accuracy in determining occurrence probabilities has two effects:
1. A percentage error in occurrence probability results in the same percent-
age error in allocating resources, and
2. A percentage error in occurrence probability results in a much lower error
in failure intensity in most cases (Musa 1994).
Since failure intensity is generally robust with respect to operational pro-
file errors, the principal concern in setting an accuracy guideline is trading
off the precision with which you need to allocate resources against the extra
cost of measuring or estimating a more accurate operational profile.

3.2 Activities in Operational Profile Development

The three activities needed to develop operational profiles, once the four basic
decisions are made, are sequential and must be done for each system mode.
They are:
1. Identify user types,
2. Develop the functional profile, and
3. Convert the functional profile to the operational profile.
The first two activities are commonly performed by system engineers. The
third activity is usually done by system designers (architects) and developers,
although system engineers may be involved.
The Operational Profile 339

3.2.1 Identify User Types. User types are sets of users who are expected
to employ the system in the same way. In order to identify user types, you
must first identify customer types for the system. A customer type is a set of
customers that are expected to have the same user types. For example, for
Fone Follower, educational organizations and medical organizations might
represent two different customer types. Different universities are, of course,
different customers, but they belong to the same customer type because they
can be expected to have the same user types.
Next, you consider each customer type and list all its user types. You then
create a consolidated list of user types, eliminating duplications. Continuing
with the Fone Follower example, suppose that the educational organizations
customer type only has the user type "users without pagers." Assume that
the medical organizations customer type has two user types, "users without
pagers" and "users with pagers." The consolidated user type list is:

users without pagers


users with pagers

3.2.2 Develop Functional Profile. In order to develop an explicit func-


tional profile, you consider each user type and list the functions that user
type needs. You then consolidate the functions and determine their occur-
rence probabilities.
Let us illustrate the case of the explicit profile for Fone Follower, develop-
ing the functional profile for the nonentry hours, normal load system mode.
For the user type "users without pagers" the function list is:

Forward call, nonpaging


Update

For the user type "user with pagers" the function list is:

Forward call, paging


Update

Then the consolidated function list is:

Forward call, nonpaging


Forward call, paging
Update

Assume that we have data on the occurrences/hour as follows:


340 John D. Musa

FUNCTION OCCURRENCES

Forward call, nonpaging 5400


Forward call, paging 3600
Update 1000

If we divide occurrences/hour for each function by the total occurrences/hour


of 10,000, we obtain the explicit functional profile:

FUNCTION OCCUR. PROB.

Forward call, nonpaging 0.54


Forward call, paging 0.36
Update 0.1

The procedure for implicit functional profiles is analogous, but we deal


with functional attribute values rather than functions.
As an adjunct to developing the functional profile, we create lists of
frequently-occurring functional scenarios and of critical functions. By "crit-
ical" we mean successful execution of the function adds substantial safety
or value, or that any failure results in severe risk to human life, cost, or
capability.
3.2.3 Convert Functional Profile to Operational Profile. We must
now convert the user-oriented functional profile to the implementation-
oriented operational profile, because test will require the operational profile
to drive it. The same form (explicit or implicit) is used for the operational
profile that was used for the functional profile.
We will illustrate the process of mapping and converting occurrence prob-
abilities for an explicit profile by continuing with the example of Fone Follower
for the nonentry hours, normal load system mode. Refer to Figure 3.2. The
first two columns give the explicit functional profile. The network of arrows
shows how the list of three functions at the left maps to four operations at
the right. The numbers on the mapping lines indicate the proportion offunc-
tions that map to each operation. We first convert the function occurrence
probabilities to initial operation occurrence probabilities by multiplying by
the mapping proportions. These are not real probabilities as they do not add
to 1. We determine the ''final'' or true operation occurrence probabilities by
dividing each of the initial operation occurrence probabilities by the total
initial operation occurrence probabilities. The process for converting implicit
functional profiles to implicit operational profiles is analogous.
To convert the list of functional scenarios to the list of operational scenar-
ios, we simply substitute each possible operation resulting from a function
for each function in the functional scenario. For example, if one functional
scenario were
The Operational Profile 341

Update; Forward call, nonpaging

we would obtain two corresponding operational scenarios

Update, Follow
Update, Voice mail

INITIAL FINAL
OCCUR. OCCUR. OCCUR.
FUNCTION PROB PROP. OPER. PROB. PROB.

Forward call,
nonpaging 0.54
0.2 ~
.. Follow 0.9 0.740

~ Page 0.072 0.059


1
Forward call, 0.36
paging 0.2
----;;.. Voice 0.144 0.119
0.1
... mail

Update 0.1 ... Update 0.1 0.082

Fig. 3.2. Conversion of explicit functional profile to explicit operational profile for
Fone Follower

To convert the critical functions list to the critical operations list, we do


a similar substitution. For example, if the critical function is

Forward call, paging

then the operations

Follow
Page
Voice Mail

will be critical.
If the critical operations occur rarely, we will need to create an additional
system mode that includes them, and devote enough test time to that system
mode to be able to assure with reasonable confidence that the failure intensity
objective for the critical operations can be met.
342 John D. Musa

4. Application
During the requirements and design phases and even part of the implemen-
tation phase, one employs the functional profile because function to opera-
tion mapping is still evolving and the operational profile isn't yet ready. The
functional profiles of the system modes are averaged, the system modes being
weighted by the proportion of execution time they represent. This average
functional profile and the critical function list are used to allocate system en-
gineering, system design, and implementation resources and priorities. They
are used to manage the potentially schedule-delaying requirements, design,
and code reviews so that they are maximally effective within the deadlines
they must meet. They are used to guide operational development, where de-
velopment is divided and managed by functions and then operations rather
than modules, and releases are scheduled so that the most used and most
critical operations are delivered first. Finally, they support the system engi-
neering of reduced operation software (ROS). As previously noted, ROS is
the software analog of RISe. The functional profile and critical function list
are used to highlight infrequently used, noncritical operations in the context
of what it costs to develop them. In many cases, the goals of these opera-
tions can be attained in other ways, perhaps by combining simpler operations
or by incorporating manual interventions. In many cases, the operations are
sufficiently unimportant that they can be eliminated.
Testing is done on a system mode basis. Recall that we may have a system
mode of critical operations that we provide with extra execution time so that
we can obtain sufficient confidence that it meets its failure intensity objective.
The operational profile for each system mode is used to manage the first stage
of test selection, choice of the operation that will be executed. The probability
that an operation will be selected for test is made to match the probability
that the operation occurs in the field. We use the operational scenario list
to bring out interactions between operations that occur. When an operation
is selected that starts an identified operational scenario, we execute the rest
of the scenario some proportion of the time before returning to operational
profile selection.

5. Research Questions
The area of operational profiles is young but very dynamic. Hence there are
many research needs and opportunities that will shape the practice of the
future. Two of the most important areas involve project trials. The concepts
of using operational profiles to system engineer reduced operation software
(ROS) and to guide operational development have been investigated to the
point of indicating feasibility and promise of substantial benefits. However,
they have not been extensively tested on projects. Project trials should de-
velop much useful information about how to best practice these two ideas.
The Operational Profile 343

Increasing testing efficiency (reduction in failure intensity per unit execution


time) is an important need in software reliability engineering. It is clear that
operational profiles bear on this in some manner, probably interacting with
the degree of homogeneity of run types within the operations. By homogene-
ity, we refer to whether the run types (which are characterized by their input
states) have the same failure behavior. When a set of run types are homoge-
neous, one test is sufficient to test all of them. Homogeneity appears to be
affected by commonality of execution paths and differential project history,
among other factors, and may be a stochastic quantity. How to estimate ho-
mogeneity, and how to combine this with operational profile information to
develop more efficient testing strategies, would be valuable information.
It was mentioned above that homogeneity might be affected by differen-
tial project history. By differential project history, we refer to the fact that
different parts of the system may have experienced differences in the quality
of requirements, design, and implementation activities. We need to under-
stand the degree of variability in failure probability among run types this
introduces and the effect on homogeneity.
Homogeneity probably bears on another important question, how can we
best partition the set of all possible run types into operations? As we local-
ize run types into operations with more run types and greater homogeneity
(these goals conflict), we increase test efficiency. What is the optimum parti-
tioning? This is likely to be a study that requires stochastic modeling, because
estimating homogeneity will probably always involve uncertainty and risk.
Both explicit and implicit profiles have been used on a number of projects.
We have been able to distill from experience some guidelines as to which form
works best in which situation, but there may be a need for a more organized
approach to answering this question.

6. Summary

Operational profiles constitute an important part of software reliability en-


gineering technology and practice. They put the concept of system use on a
scientific, quantitative basis. Experience has shown them to be an important
customer communication tool. The application of operational profiles makes
system engineering, development, and test of software-based systems faster,
less costly, and less risky, leading to increased competitiveness.

Acknowledgement. The author is indebted to James Cusick for his helpful com-
ments.
344 John D. Musa

References

Musa, J.D.: Operational Profiles in Software Reliability Engineering. IEEE Software


10 (2), 14-32 (1993)
Musa, J.D.: Sensitivity of Field Failure Intensity to Operational Profile Errors.
Proceedings of the 5th International Symposium on Software Reliability Engi-
neering. Monterey 1994, pp. 334-337
Musa, J.D., Iannino, A., Okumoto, K.: Software Reliability: Measurement, Predic-
tion, Application. New York: McGraw-Hill 1987
Whittaker, J.A., Thomason, M.G.: A Markov Chain Model for Statistical Software
Testing. IEEE Trans. Software Engineering 20, 812-824 (1994)
Assessing the Reliability of Software:
An Overview
Nozer D. Singpurwalla1 and Refik Soyer2
1 Department of Operations Research, The George Washington University, Wash-
ington, DC 20052, USA
2 Department of Management Science, The George Washington University, Wash-
ington, DC 20052, USA

Summary. In this overview we describe briefiy several well known probability


models for assessing the reliability of software. We motivate each model, discuss its
pros and cons, and present statistical methods for estimating the model parameters.
The paper concludes with a synopsis of some recent work which attempts to unify
the models by looking at the software failure phenomenon as a counting process
model.

Keywords. Autoregressive processes, de-eutrophication, empirical Bayes, expert


opinion, failure rate, hierarchical models, point processes, record value statistics,
state-space models

1. Introduction

1.1 Background: The Failure of Software

Over the last two decades, a considerable amount of effort has been devoted
to developing probability models for describing the failure of software. Such
models help assess software reliability, which is a measure of the quality of
software. Like hardware reliability, software reliability is defined as the prob-
ability of failure-free operation of a computer code for a specified period of
time, called the mission time, in a specified environment, called the oper-
ational profile; see, for example, Musa and Okumoto (1984). However, the
causes of software failure (a notion that will be made more precise later)
are different from those of hardware failure, and whereas hardware reliability
tends to decrease with mission time, software reliability can, in principle, be
100% reliable for any mission time.
Software fails because there are errors, called "flaws" or "bugs" in the logic
of a software code. These flaws are caused by human error. Hardware fails
because of material defects and/or wear, both of which initiate and propagate
microscopic cracks that lead to failure. With hardware failures, the random
element is, most often, the time taken for the dominant microscopic crack to
propagate beyond a threshold. Thus meaningful probability models for the
time to hardware failure should take cognizance of the rates at which the
cracks grow in different media and under different loadings. With the failure
of software, the situation is quite different. We first need to be more precise
346 Nozer D. Singpurwalla and Refik Sayer

as to what we mean by software failure, and then we also need to identify


the random elements in the software failure process. To do the above, the
following perspective motivated by some initial ideas of Jelinski and Moranda
(1972) is helpful.

1.2 A Conceptualization of the Software Failure Process

A program may be viewed as a "black box," or more accurately a "logic


engine" that consists of various statements and instructions that bear a logical
relationship with each other. The engine receives, over time, different types of
inputs, some of which may not be compatible with the design of the engine.
If each compatible input type traverses its logically intended path within the
engine, then the outputs of the engine are the desired ones, and the program
is said to be perfect; that is 100% reliable. If there are any errors in the logic
engine, clerical or conceptual, then it is possible that a certain (compatible)
input will not traverse its designated path, and in so doing will produce
an output that is not the desired output. When the above happens, the
software is declared as having experienced a failure. It is of course possible
that the presence of a bug may prevent the software from producing any
input whatsoever. That is, the flawed logic could lead a compatible input
through an indefinite number of loops. Thus implicit in the notion of software
failure should be the notion of a time interval within which an output should
be produced. That is, associated with each input type, there should be an
allowable service time.
With the set-up conceptualized above, it is important to bear in mind
that not every flaw in the program will lead to a software failure. This is
because the flaw may reside in a logic path which, in certain applications,
may never be traversed. Thus it is important to distinguish between software
bugs and software failures. Every software failure is caused by a bug, but
every bug does not lead to a software failure. Software engineers have heeded
this distinction and many have proposed models for software failures instead
of software errors.

1.3 Random Quantities in Software Failures

We have said before, that with hardware failures the random element is the
time it takes for a crack to propagate beyond a threshold. With software
failures it is the uncertainty about the presence, the location and the en-
counter with a bug that induces randomness. There are two types of random
variables that can be conceived, the first being binary and the second being
continuous. We shall first discuss the nature of the binary random variables
and propose some plausible probability models for it.
Suppose that Xi, i = 1,2, ... , k is a binary random variable which takes
the value 1 ifthe i-th type of input results in a desired (correct) output within
Assessing the Reliability of Software: An Overview 347

its allowable service time; otherwise Xi takes the value zero. The number of
distinct input types is assumed to be k. Let Pi denote the probability that
= = =
Xi 1. If Pi p, i 1, ... , k, and if the Xi'S are assumed to be independent,
were P to be known, then a naive measure of the reliability of the software
would be p. If n :::; k distinct input types were to be tested and L~ Xi
observed, then an estimator of P would be L~ Xi/no If the number of distinct
input types can be conceptually extendible to infinity, then the sequence of
Xi's, i = 1,2, ... , could be judged exchangeable and by virtue of de Finetti's
representation theorem P would have a prior distribution 1I'(p) which would
then be a naive measure of the reliability of the software. Correspondingly,
if L~ Xi/n were available, then the posterior distribution of P would be a
naive measure of the reliability of the software. We say that p (or its prior
and posterior distributions) are naive measures of the reliability, because in
assuming the conditional independence of the Xi'S and the fact that Pi = p,
i = 1, ... , k, we have de facto ignored the possibility that some input types
may be encountered more often than the others, and that some input types
may not be encountered at all. A more realistic approach would be to assume
the Pi'S are generated by a common distribution which then describes the
reliability of the software. Assuming that the Pi'S are generated by a common
distribution entails modeling the joint distribution of the Pi'S by a two-stage
hierarchical model, as is done by Chen and Singpurwalla (1996). The idea of
a hierarchical two-stage model for Bernoulli data on software failures remains
to be explored.
The second type of random variable used for modeling software reliability
pertains to the times between software failures. It is motivated by the notion
that the arrival times, to the software, ofthe different input types are random.
As before, those inputs which traverse through their designated paths in the
logic engine will produce desired outputs. Those which do not, because of
bugs in the engine, will produce erroneous outputs. For assessing software
reliability, one observes T 1 , T2 , , 11, where 11 is the time between the (i -
l)st and the i-th software failure. With this conceptualization, even though
the failure of software is not generated stochastically, the detection of errors is
stochastic, and the end result is that there is an underlying random process
that governs the failure characteristics of software.
Most of the well known models for assessing software reliability are cen-
tered around the interfailure times T 1 , T2 , , or the point processes that they
generate; see Singpurwalla and Wilson (1994). Sections 2 and 3 of this pa-
per provide an overview. Whereas, the monitoring of time is conventional for
assessing reliability, we see several issues that arise when this convention is
applied to software reliability. For one, monitoring the times between failures
ignores the amount of time needed to process an input. Thus an input that is
executed successfully but which takes a long time to process will contribute
more to the reliability than one which takes a small time to process. Second,
also ignored is the fact that between two successive failure times there could
348 Nozer D. Singpurwalla and Refik Soyer

be several successful iterations of inputs that are of the same type. Thus,
in principle there could be an interfailure time of infinite length. Of course
one may argue that monitoring the interfailure times takes into account the
frequency with which the different types of inputs occur and in so doing the
assessed reliability tends to be more realistic than the one which assumes that
all the input types occur with equal frequency. In view of the above consider-
ations it appears that a meaningful way to model the software failure history
is by a marked point process (cf. Arjas and Haara 1984) wherein associated
with each inter-arrival time, say Zi, i = 1,2, ..., there is an indicator Di,
with Di = 1 if the i-th input is successfully processed and Di = 0, otherwise.
Progress in this direction has been initiated by Eric Slud of the University of
Maryland at College Park (personal communication).
The point process approach to software reliability modeling has also been
considered by Miller (1986), Fakhre-Zakeri and Slud (1995), Kuo and Yang
(1995a) and by Chen and Singpurwalla (1995). These authors have been able
to unify most of the existing models in software reliability by adopting a
point process perspective. Some of this work is reviewed in Section 4 of this
paper.

2. Model Classification
Many of the proposed models for software reliability that are based on ob-
serving times between software failures can be classified into two categories;
these are:

Type I. Those that model the times between successive software


failure, or
Type II. Those that count the number of software failures up to a
given time.
In all of the proposed models, time is typically taken to be CPU time. As
before, Let n denote the time between the (i - l)st and the i-th software
failure. In the first category, modeling the n's, is often accomplished via a
specification of the failure rate of the n's. When this is the case, the model is
said to be of Type 1-1. The failure rate, rTi(t), is specified, for i = 1,2,3, ...
and a probability model results. These failure rates may be thought of as
the rate at which errors are detected in the software. A distinctive feature of
software is that the successive failure rates may decrease over time, because
bugs are discovered and corrected. Of course, an attempt to debug software
may introduce more bugs, tending to increase the failure rate. Thus, the
decreasing failure rates assumption is somewhat idealized. However, most of
the models that are reviewed here have a decreasing sequence of failure rates
for the successive times between failure.
An alternative way to model time between failure is to define a stochastic
relationship between successive failure times. Models that are specified this
Assessing the Reliability of Software: An Overview 349

way are said to be of Type 1-2. These models have the advantage over
Type 1-1 in that they directly model the times between failure, which are
observable quantities, and not the more abstract failure rates, which are
unobservable. For example, as a simple case, one could declare that 11+1 =
p11+fj, where p ~ 0 is a constant and fj is a disturbance term (typically some
random variable with mean 0). Then p < 1 would indicate decreasing times
between failure (software reliability expected to become worse), p = 1 would
indicate no change in software reliability whilst p > 1 indicates increasing
times between failure (software reliability expected to improve). The simple
relationship of this example is known as an auto-regressive process of order 1;
in general, one could say that 11+1 = f(T1 , T2 , .. ,11) + fj for some function
f.
The category labeled Type II, modeling the number of failures, uses a
point process to count failures. Let M(t) be the number of failures of the
software that are observed during time [0, t). Often M(t) is modeled by a
Poisson process with mean value function j-t(t), where j-t(t) is non-decreasing
and, for the purposes of this paper, differentiable. The mean number of fail-
ures at time t is given by j-t(t). The different models of this type specify a
different function j-t(t). The Poisson process is chosen because in many ways
it is the simplest point process to work with. The point process approach has
become increasingly popular in recent years. There is no reason why point
processes other than the Poisson could not be used.

3. Review of Type I and Type II Models

A common notation will be assumed throughout this section and is given


below:

i) 'Ii = the time between the (i - l)st and i-th failure;


ii) TT, (t) = failure rate function of 'Ii;
iii) Sj = E~=l Tj, the time to the i-th failure of the software;
iv) M(t) = number of failures of the software in the time
interval [0, t);
v) A(t) = intensity function of M(t).
vi) j-t(t) = expected value of M(t) = J~ A(s)ds, if M(t) is a
Poisson process.

3.1 Modeling Times between Failure: Type 1 Models

3.1.1 The Model of Jelinski and Moranda (1972). According to Musa


et al. (1987), the first software reliability model was that of Hudson (1967).
However, the model of Jelinski and Moranda (1972), henceforth known as the
JM model, was the first software reliability model to be widely known and
350 Nozer D. Singpurwalla and Refik Soyer

used and has formed the basis for many models developed after. It is a Type
1-1 model; it models times between failure by considering their failure rates.
Jelinski and Moranda reasoned as follows. Suppose that the total number of
bugs in the program is N (which can be related to the size of the code), and
suppose that each time the software fails, one bug is corrected. The failure
rate of Ii, is then assumed a constant, proportional to N - i + 1, which is
the number of bugs remaining in the program. In other words

rT.(tIN,A)=A(N-i+l), i=I,2,3 ... ,t2:0, (3.1)


for some constant A. This means that if N and A are known then Ii is an
exponential random variable with mean {A( N - i + I)} -1.
Langberg and Singpurwalla (1985) give an alternative, shock model, inter-
pretation of (3.1). Let N* be the number of distinct types of input, assumed
large or conceptually infinite, that the software can accept. Suppose that
these inputs arrive to be processed by the software as a homogeneous Pois-
son process with rate A. They show that given N*, N and A, the failure rate
of Ii is

rT.(tIN,N*,A)=;.(N-i+l), i=I,2,3 ... ,t2:0, (3.2)


which is precisely the JM model with A = ; . The model makes several
assumptions which can be criticized. It assumes that each error is equal in
the sense that it contributes the same amount A to the failure rate. In reality
different bugs will differ in importance and so have a different effect on the
failure rate. It also assumes that at each failure there is perfect repair and
no new errors are created; thus, the successive failure rates are decreasing. A
model like this is sometimes referred to as a de-eutrophication model, because
the process of removing bugs from software is analogous to the removal of pol-
lutants from rivers and lakes. Jelinski and Moranda derive equations for the
maximum likelihood estimators of N and A, denoted N and A. Let tl, ... , tn
be the observed successive times between failure of a piece of software. Then
N is a solution for N to the equation
"n 1 n
6i=1 N-i+1 = N-..!... I:~1=1 (i-1)t' where Sn = E~=1 tj, (3.3)
8ft. J

with the obvious constraint that N 2: n. Having obtained N, A is given by


" n
A= " n.' (3.4)
NS n - Ei=1(z - l)tj
Forman and Singpurwalla (1977) show that N can be a misleading esti-
mate in certain situations, particularly when n is small when compared to
N. Joe and Reid (1985) derive the distribution of N and show that it can be
infinite with positive probability, and is also median negatively biased, that
i~ P(N < N) > P(N > N). They propose another estimator for N, called
N, which is finite with probability 1 and has less median bias.
Assessing the Reliability of Software: An Overview 351

3.1.2 Bayesian Reliability Growth Model (Littlewood and Verrall


1973). Littlewood and Verrall also looked at the times between failure of the
software. Unlike Jelinski and Moranda, they did not develop the model by
characterizing the failure rate; rather they stated that due to the difficulty
in defining what a software error is, they would look at the time to next
failure directly. They also accepted the fact that whilst the repair of an error
is intended to improve the reliability of a program, it may have the opposite
effect. Specifically, they declared 11 to be exponential with failure rate Ai;
that is

(3.5)
and that instead of Ai decreasing with certainty, as is assumed in the JM
model, they merely required that the sequence of Ai's be stochastically de-
creasing i.e. P(Ai+l < ,\) ~ P(Ai < ,\), for i = 1,2, ... and ,\ ~ O.
If one assumes a gamma distribution for Ai with shape parameter a and
scale lIF(i), where lIF is a monotonically increasing function of i, then

(3.6)
and the required ordering on the distribution of the Ai'S is achieved. The
function lIF( i) is supposed to describe the quality of the programmer and
the programming task. The authors give equations for the distribution of 11
from the instant of the (i - l)st repair and from an arbitrary time-point,
and give an estimate of the instantaneous failure rate. They also investigate
the possibility of an unknown lIF(i) , and consider goodness of fit tests for
deciding on a suitable family of functional forms for lIF. It can be shown that
the reliability function for 11 is given by

.
RT,(t I a,tJi(* = [lIF( i) ]
lIF(i) +t '
Q
(3.7)

which is a Pareto distribution.


Mazzuchi and Soyer (1988) investigated in some detail the case lIF(i) =
f30 + f3 l i. One can show that this makes the expected failure rate of each
11 constant in t and that each time a bug is discovered and fixed there is a
downward jump in the successive failure rates. In fact

(3.8)
Because one is able to specify a failure rate for this model, it is considered
to be of Type 1-1. This model has received quite a lot of attention and has been
the subject of various modifications; for example see the model of Littlewood
(1980).
An alternate structure to the Littlewood and Verrall model was considered
in Soyer (1992) where the author considered E(Ai I a, f3) = ai f3 where values
of f3 < 0 (f3 > 0) implying Ai's be decreasing (increasing). It was recognized
352 Nozer D. Singpurwalla and Refik Soyer

that the proposed model fit into the framework of general linear models
and linear Bayesian methods were used for inference. A generalization of the
model was presented by assuming that 0: and {3 be only locally constant, that
is, changing with i.
3.1.3 Imperfect Debugging Model (Goel and Okumoto 1978). This
is an attempt to improve upon the JM model by altering its assumption that
a perfect fix of a bug always occurs. Goel and Okumoto's Imperfect Debugging
Model is like the Jelinski and Moranda model, but assumes that there is a
probability p, 0 ~ p ~ 1, of fixing a bug when it is encountered. This means
that after i faults have been found, we expect i x p faults, instead of i faults,
to have been corrected. The failure rate of 11 is

rT.(t IN, A,p) = A(N - p(i - 1. (3.9)


With p = 1 we get the Jelinski and Moranda model.
3.1.4 A Model by Schick and Wolverton (1978). This also makes use
of the Type 1-1 strategy, but this time the failure rate is assumed proportional
to the number of bugs remaining in the system and the time elapsed since
the last failure. Thus

rTi(tIN, A) = A(N - i + l)t, t ~ O. (3.10)


This model differs from the previous three models in that the failure
rate does not decrease monotonically. Immediately after the i-th failure, the
failure rate drops to 0, and then increases linearly with slope (N -i) until the
(i+ l)th failure. The resulting distribution for n is the Rayleigh distribution.
3.1.5 Bayesian Differential Debugging Model (Littlewood 1980).
This model can be considered as an elaboration of the model proposed by
Littlewood and Verrall. Recall that it was assumed that Ai, the failure rate
of the i-th time between failures, was distributed as a gamma random vari-
able. In this new model Littlewood supposed that there were N bugs in the
system (a return to the bug counting phenomenon), and then proposed that
Ai be specified as a function of the remaining bugs. In particular, he stated
Ai = <Pi + <P2 + ... + <PN-i+1, where <Pi'S were independent and identically dis-
tributed gamma random variables with shape o:(N - i + 1) and scale {3. This
implied that Ai would have a gamma distribution with shape o:(N - i + 1)
and scale {3. Thus

I {3) -- 0:( N -{3 i


E(A.,0:, + 1) , (3.11)
which is a linearly decreasing function of i. In other respects its assumptions
are identical to the original Littlewood and Verrall model.
Assessing the Reliability of Software: An Overview 353

3.1.6 Random Coefficient Autoregressive Process Model (Singpur-


walla and Soyer 1985). This model uses a Type 1-2 strategy, that is one
that does not use the failure rate of times between failure. Instead it assumes
that there is some pattern between successive failure times and that this pat-
tern can be described by a functional relationship between them. The authors
declare this relationship to be of the form

11 = T;(J~1' i = 1,2,3, ... , (3.12)


where To is the time to the first failure and (h is some unknown coefficient.
If all the (}i'S are bigger than 1 then we expect successive life lengths to
increase, and if all the (); 's are smaller than 1 we expect successive life lengths
to decrease.
The authors emphasize that whilst a modification is supposed to improve
a program, it can bring about a decrease in reliability because of the intro-
duction of more bugs. Thus the need for values of (); to be both greater and
smaller than 1. The final goal of their model is to predict the behavior of
the software after it has been through a test-fail-modify cycle and to decide
whether the last modification of the program (following a failure) was ben-
eficial. To account for uncertainty in the above relationship and any slight
deviation from it, an error term D; is introduced, so that

11 = oiT;(J~1. (3.13)
The authors then make the following assumptions, which greatly facil-
itate the analysis of this model. They assume the 11 's to be log normally
distributed, that is to say that 10g11's have a normal distribution, and that
they are all scaled so that 11 ~ 1. The 0; 's are also assumed to be log normal,
with median 1 and variance O'~ (the conventional notation is A(l, O'~)). Then
by taking logs on the relationship above they obtain

log 11 = O;log 11-1 + log 0;


(3.14)
= Oi log 11-1 + t"i
Since the 11 's and the D; 's are log normal so the log 11 's and the t"; 's (=
log o;'s) will be normally distributed, and in particular t"; has mean 0 and
some variance O'~ (the conventional notation is N(O, O'D). Thus the log life
lengths form an autoregressive process of order 1 with random coefficients ();.
There is an extensive literature on such processes which can now be used on
this model.
All that remains to do is to specify ();, and the authors consider several
alternative models. The simplest is to allow the set of O;'s to be exchange-
able (that is, their joint distribution is invariant under permutations of the
indices). The (};'s are assumed normally distributed with mean A and known
variance O'~, with the mean A itself having a normal distribution with known
mean I' and known variance 0'5; this ensures that the (); 's are exchangeable but
not independent. In this case one can employ standard hierarchical Bayesian
354 Nozer D. Singpurwalla and Refik Soyer

inference techniques to predict future reliability in the light of previous fail-


ure data. Another model for 0i would be to let it follow an autoregressive
process:

(3.15)
where Wi is N(O, Wi) with Wi known. When a is known, the expressions
for log11 and Oi together form a Kalman filter model, on which there is
also an extensive literature. When a is not known an adaptive Kalman filter
model results, for which there are no closed form results; instead, the authors
propose an approximation.
In Singpurwalla and Soyer (1992), the authors discuss inference for the
generalizations of these models when the variance u? was unknown and con-
duct a comparison between two of the models for the Oi'S; the exchangeable
model and the adaptive Kalman filter model, using the "System 40" data
of Musa (1979). The former model is found to be more robust to the initial
choice of its parameters and is able to track the data better. In the case of
the adaptive Kalman filter model, the paper also discusses Bayesian inference
for the parameter a.
3.1. 7 Bayes Empirical Bayes or Hierarchical Model (Mazzuchi and
Soyer 1988). In 1988 Mazzuchi and Soyer proposed a Bayes Empirical
Bayes or Hierarchical extension to the Littlewood and Verrall model. As
with the original model, they assumed 11 to be exponentially distributed
with scale Ai. Then they proposed two ideas for describing Ai, here called
model A and model B.
Model A. Still assume that Ai is described by a gamma distribution, but with
parameters a and 13. Now assume that a and 13 are independent and that
they themselves are described by probability distributions; a by a uniform
and 13 by another gamma. In other words,

11'Ai (,\ 1 a, 13) = ~,\a-l


rOle -A , ,\ ~ 0,

1I'(a 1 v) 1
ii' ~ a ~ v, (3.16)

11'(131 a, b) b- 13a-l -bfj


> 0, b > 0,
rea) e , 13 ~ 0, a
where v, a and b are known.
Model B. Assume that Ai is described exactly as in Littlewood and Verrall
1.e.

(3.17)
and that !li(i) = 130 + 131i, except now place probability distributions on a,
f30 and f31 as follows:
Assessing the Reliability of Software: An Overview 355

1I"(a I W) 1
W' ~ a ~W,

1I"(PI I c, d) de pc-I -d{jl


1'(CJ 1 e , PI0, C > 0, d > 0.
~
(3.18)
So a is described by a uniform distribution, Po by a shifted gamma and PI by
another gamma, and there is dependence between Po and P1. By assuming
PI to be degenerate at 0, model A is obtained from model B.
Inference and estimation with this model is done via a Bayesian approach.
In the light of data, the probability distributions for AI, a, and the other
model parameters are updated by using Bayes' Law. Usually the mean of
this updated, or posterior, distribution is taken as a point estimate for the
parameter. The authors were able to find an approximation to the expectation
= = =
of Tn+1 given that TI tb T2 t2, ... , Tn tn, and so use their model to
predict future reliability of the software in light of the previous failure times.
They applied their model to the data first used by Jelinski and Moranda
(1972) and obtained predictions of the mean time to next failure and the
failure rate, whereupon they concluded that the data showed little evidence
of the assumed software reliability growth.
Extensions of the hierarchical model was considered by Kuo and Yang
(1995b) who assumed a k-th order polynomial for lJi(i) and used Gibbs sam-
pling for Bayesian computations.
3.1.8 Non-Gaussian Kalman Filter Model (Chen and Singpurwalla
1994). The Kalman Filter Model of Singpurwalla and Soyer (1985) has one
disadvantage in that software failure data tends to be skewed. Chen and
Singpurwalla (1994) introduce a non-Gaussian Kalman Filter model, taken
from Bather (1965), in an attempt to overcome this problem. The failure
times are now described by a gamma distribution with scale parameter On,
with On evolving according to a beta distribution. More precisely, for known
parameters C, (Tn, Wn and Vn such that (Tn-1 + Wn = (Tn + Vn , the model
involves the following distributions; the observation equation is given by

the one step ahead forecast is given by

Pearson Type VI (p =Wn , q =(Tn-I)


and the posterior for On is given by

where Un = CUn -I + tn.


356 Nozer D. Singpurwalla and Refik Soyer

The authors set Wn = Vn = O'n = 2, which leads to the following estimate


for the nth failure time given the parameter C and all previous failure times
n
Tn = E(Tn I t1, ... , tn-l, C) =2C E C itn_i_1. (3.19)
i=O
This suggests that the value of the parameter C is critical to assessing whether
the times between failure are increasing or decreasing. If C ~ 1 then this
implies a substantial growth in reliability, whereas C close to zero implies
a drastic reduction in reliability. To this end, a uniform prior distribution
on (0,1) is assigned to C. A plot of the posterior distribution of C, given
failure data, should then indicate whether there is growth or decay in the
software reliability. This posterior can then be used with the above equation
to produce a prediction for Tn. Unfortunately, by placing a prior on C the
closed form nature of the equations is lost, but the authors present a quick
and accurate simulation scheme to overcome this.

3.2 Modeling Number of Failures: Type II Models

3.2.1 The De-eutrophication Model of Moranda (1975). Another de-


eutrophication model by Moranda (1975) attempted to answer some of the
criticisms of the JM model, in particular the criticism concerning the equal
effect that each bug in the code has on the failure rate. He hypothesized that
the fixing of bugs that cause early failures in the system reduces the failure
rate more than the fixing of bugs that occur later, because these early bugs
are more likely to be the bigger ones. With this in mind, he proposed that the
failure rate should remain constant for each Tt, but that it should be made
to decrease geometrically in i after each failure i.e. for constants D and k

rT;(t I D,k) = Dki-1, t ~ O,D > 0,0 < k < 1. (3.20)


Compared to the JM model, where the drop in failure rate after each failure
was always A, the drop in failure rate here after the i-th failure is D(I-k)ki-1.
The assumption of a perfect fix, with no introduction of new bugs during the
fix, is retained.
The author provides maximum likelihood estimates for the parameters
D and k in the model. Suppose t1, ... , tn are the observed successive times
between failure of a piece of software. Then the MLE for k, k, is the solution
to the polynomial equation

E?=l(i - ni1)tiki = 0, 0< k < 1, (3.21)


and having solved for k, b is given by

(3.22)
Assessing the Reliability of Software: An Overview 357

Often, data on software failures is given in terms of the number of fail-


ures that occurred in fixed time periods. The time between each failure is
not given explicitly, and so the above model can not be employed. Instead,
Moranda proposed a Poisson process to describe the number of failures in
each successive time period. In the spirit of his de-eutrophication model, he
proposed that the intensity function should be constant in a particular period
but form a decreasing geometric sequence over them: A, Ak, Ak2, ... , where
o < k < 1. Thus the number of failures in the i-th time period is a homoge-
neous Poisson process with intensity function Aki-l. By scaling, so that the
time periods are of length 1, maximum likelihood estimates of the parameters
k and A are given. Let ml, ... , mn be the observed number of failures during
the first n time periods. Then k is the solution to the polynomial equation

(3.23)

and j is given as

j _ L~=l mi
(3.24)
- ,,~-l
L.",=o
ki '
This is one of the first example of the Type II class of models.
3.2.2 Time-dependent Error Detection Model (Goel and Okumoto
1979). This is the second Type II model that we will consider. First, the au-
thors make the assumption that the expected number of software failures to
time t, given by the mean value function I-'(t), is non-decreasing and bounded
=
above. Specifically, 1-'(0) 0 and limt_oo I-'(t) =
a, where a represents the
expected number of errors in the software. They also assume that the ex-
pected number of failures in the time interval (t, t + Llt) is proportional to
the number of undetected errors, or

I-'(t + Llt) -I-'(t) = b(a -1-'(t))Llt + o(Llt), (3.25)


where b is considered to be the fault detection rate. Dividing the above equa-
tion by Llt, letting Llt -+ 0 and solving the resulting differential equation
subject to the boundary conditions, one reaches the solution

I-'(t) = a(l - e- bt )
(3.26)
A(t) = J.t'(t) = abe- bt
The function J.t(t) is used to define a Poisson process, and the distribution
of M(t) is given by the well known formula

(3.27)
Two assumptions of the JM model are modified here. First, the total number
of errors in the software is a random variable with mean a, contrasted with
358 Nozer D. Singpurwalla and Refik Soyer

the fixed but unknown number in the JM model. Secondly, the times between
successive failures are assumed dependent here, whilst the JM model assumes
independence. Goel and Okumoto claim that these modifications are a better
description of the actual occurrence of failures in software.
The authors present various relevant formulae, for instance the distribu-
tion of Ti given that the time to the (i - 1)th failure was t, is given by

(3.28)
Let tl, t2, ... ,tn be the observed times between successive failures. Maximum
likelihood estimators for a and b are the solutions to the equations

!!.
a
= 1- exp( -bsn ), .
(3.29)
%= L~:l Si + asne-b&n, where Si = LJ:l tj,
which must be obtained numerically.
Experience has shown that often the rate of faults in software increases
initially before eventually decreasing, and so in Goel (1983) the model was
modified to account for this by letting

(3.30)
where a is still the total number of bugs and band c describe the quality of
testing. Goel and Okumoto's model has spawned a plethora of similar non-
homogeneous Poisson process models, each based on different assumptions
as to the expected detection of errors. An overview of such models may be
found in Yamada (1991).
3.2.3 Logarithmic Poisson Execution Time Model (Musa and Oku-
moto (1984). The Logarithmic Poisson Execution Time Model of Musa
and Okumoto has gained much popularity in recent years. Unlike the model
of Goel and Okumoto, this model has not been motivated by directly pos-
tulating a form for the intensity function A(t) of a Poisson process. Rather,
first A(t) is modeled via J-t(t), the expected number of failures in time [0, t),
via the relationship

A(t) = Ae- 9 /J(t), (3.31 )


which encapsulates the belief that the intensity function decreases exponen-
tially in Jl(t). The constants A and () describe the initial failure intensity and
the relative decrease in the failure intensity that follows every failure. Ob-
serve, that with (3.31) the fixing of earlier failures have a greater effect in
lowering A(t) than the latter failures.
f;
Since Jl(t) = A(s)ds, and J-t(0) = 0, (3.31) can be solved in terms of t
to obtain

A(t) = (>'9~+1)'
- In(>.9t+l) (3.32)
J-t (t) - 9 .
Assessing the Reliability of Software: An Overview 359

It now follows from standard Poisson process results that

P(M( t ) = n ) = (In(>.6t+i))'' 0 1
8"(>'6t+i)17s n !, n = , , ... , (3.33)
and that the density of 11+1 given that Si = T is

(3.34)

Estimation of the parameters of the model (3.33) has been done via the
method of maximum likelihood and via a Bayesian approach involving the use
of expert opinion. The method of maximum likelihood is described in detail
by Musa and Okumoto (1984); some difficulties in using this approach are
given by Campod6nico and Singpurwalla (1994). An outline of the Bayesian
approach, which applies to any non-homogeneous Poisson process (NHPP) is
given below; it is abstracted from Campod6nico and Singpurwalla (1994).
Consider a NHPP with a mean value function J.l(t). Suppose that J.l(t)
contains two unknown parameters, and suppose that an analyst (software
reliability assessor), A, asks an expert (software developer, debugger, user,
etc.), &, to think about J.l(t), and to choose two points in time, say Ti and
T2, 0 < Ti < T2, for which can provide opinions on J.l(Tt) and J.l(T2). Let
=
J.li =
J.l(Tt) and J.l2 J.l(T2).
Because J.li and J.l2 are unknown parameters, treats them as random
quantities, and conceptualizes their distributions as P;l (.)
and P;2 (.),
re-
spectively. Then, for each i (i = 1,2), & declares to A two numbers mi and Si,
as measures of the location and the scale of P;, (.),
respectively. For example,
mi and Si may be declared as the mean and the standard deviation of P;, ().
It is important to bear in mind, that even though & has declared the mi and
the Si to be measures of location and scale, it is possible that in A's mind,
what declares may not reflect the true opinions of , and the procedures
that follow, provide for this possibility.
Suppose that for A, the model (3.31) is the appropriate one to consider.
That is, for t > 0, the mean value function of the NHPP is of the form
k
J.l(t I (), A) = In(A(}t + 1). A Bayesian analysis of the NHPP requires that A
construct a joint prior distribution for the parameters (},A), and our goal is
to show how the information provided by can be used by A to induce the
required prior. To do this, A may first construct a joint prior distribution of
J.li and J.l2. For this, we observe that the system of equations

J.li = In(A(}Ti + 1)/(} (3.35)


J.l2 = In(A(}T2 + 1)/(}
has a solution for (},A), (), A > 0, if and only if J.li < J.l2 < R-J.li.
We next describe how A views &'s inputs mi and Si, i = 1,2. We suppose
that A regards the mi's and the Si'S as data, and models the likelihood
of J.li and J.l2 based on two considerations, A's opinion of the biases and
360 Nozer D. Singpurwalla and Refik Soyer

the expertise of , and A's perceived correlation between M1 and M2, and
between Sl and S2. Since it is the same individual, namely, the expert , that
R
provides A information about both J.t1 and J.t2, and also since J.t1 < J.t2 < J.t1,
it is reasonable to suppose that in A's opinion, M1 and M2 will be dependent,
and positively so. To summarize, given m1, m2, 81 and 82, A needs to obtain
P(J.t1, J.t2 I m1, m2, 81,82), which incorporates the above dependencies and the
expertise of the expert.
To proceed further, A uses Bayes' law and writes

where denotes oc "proportional to" . The second term on the right hand side of
the above expression is A's prior opinion about J.t1 and J.t2, and the first term
is A's likelihood which, by the product rule of probability, can be factored as

P(m1,m2,81,82 I J.t1,J.t2) PM2 (m21 m1,81,82,J.t1,J.t2)


x PS 2 (82 I m1,81,J.t1.J.t2)
x PMt (ml 181, J.t1,J.t2)
X PSt (81 I J.t1, J.t2).
The subscripts associated with each P pertain to the random variable upon
whose distribution A's likelihood is based.
To specify each component of the above likelihood, A makes a series
of assumptions, many of which pertain to the conditional independence of
the random variables involved, and some pertaining to A's judgment of the
biases and the expertise of the expert. These assumptions are given below
and discussed in more detail in Campod6nico and Singpurwalla (1994).
Ai. M2 is independent of J.t1 given M1, J.t2, Sl and S2; thus PM2 (m2 I
m1, 81, 82, J.t1. J.t2) = PM 2 (m2 I m1,81,82,J.t2).
A2. PM,(m2 I ml, 81, 82,J.t2) is a truncated normal with mean a + bJ.t2 and
standard deviation ,82; the truncation is on the left and right. The left point
of truncation is m1 + k81, where k is specified by A. The right point of
truncation is Rm1. The parameters a, band, reflect A's view of the biases
and attitudes of in declaring m1 and 81.
A3. S2 is independent of M1 given Sl, J.t1 and J.t2; thus PS2 (82 I m1, 81, J.t1,J.t2)
= PS 2 (82 I 81,J.t1,J.t2).
A4 PS 2 (82 I 81,J.t1,J.t2) is a truncated normal with mean 81 and variance
(J.t2 - J.tt); the truncation is on the left, and the point of truncation is O.
AS. M1 is independent of J.t2 given J.t1 and Sl; thus PM t (ml I 81,J.t1,J.t2) =
PM t (ml 1 81,J.t1).
A 6. PM 1 (m1 I 81, J.tI) is a truncated normal with mean a + bJ.t1 and stan-
dard deviation ,81. The point of truncation is on the left and is at O. The
parameters a, b, and, are the same as before.
Assessing the Reliability of Software: An Overview 361

A 7. PSI (81 I JIl, JI2) is exponential with mean (JI2 - JId. This implies that as
the disparity between JIl and JI2 increases, the uncertainty SI becomes larger
and larger.
A8. P(JIl,JI2) is relatively constant over the range of JIl and JI2 on which the
likelihood is appreciable.

The final consequence of the above operations is encapsulated in the following


result which provides the joint distribution of (JIl ,JI2)' Under the assumptions
AI-A8, A's joint distribution of JIl and JI2, with T l , T2, ml, m2, 81 and 82
specified, is given as

(3.36)
fU exp{ _x 2/2} d
were < JIl < JI2 < :f;JIl' 'Y U = -00
h 0 T Ai. ( )
,f2; x, an a, ,'Y an k , are
d b d
parameters specified by A.
Observe that A2 contains four parameters, a, b, 'Y and k; these are in-
troduced to capture A's view of the biases and the expertise of [;. Thus for
example, with b = 1, a denotes the amount of bias by which A believes that
[; overestimates JI2. If A thinks that [; overestimates (underestimates) JI2 by
10%, then a = 0 and b = 1.1(.9). If A thinks that [; tends to exaggerate
(is overcautious about) the precision of 's assessment, then 'Y > )1. The
parameter k describes A's views as to how cautious [; is in discriminating
between JIl and JI2. These parameters do impact the resulting prior. For
instance, a large value for 'Y will imply a large uncertainty on the predictions.
The joint prior distribution 1f'(JIl' JI2) given by (3.36) can be evaluated
numerically for any specified values of T l , T 2, ml, m2, 81, 82, a, b, 'Y and
k. Note that T l , T 2, ml, m2, 81 and 82 are obtained via expert opinion; the
parameters a, b, 'Y and k bring flexibility into the analysis by allowing the
analyst to evaluate the expert's skills in a formal manner. If the analyst has
no opinion on the expertise of the expert, or chooses not to incorporate such
opinions into the analysis, then a = 0, b = 1, 'Y =1 and k = 0 or l.
The relationship between the parameters (0)) of the logarithmic-Poisson
execution time model, and (JIl,JI2) is given by
362 Nozer D. Singpurwalla and Refik Soyer

The above can be used to solve for A and 0 in terms of (1'1,J.t2,T1,T2);


the solution has to be numerically obtained. Thus, given (1'1,1'2, Tb T2)' I'(t 1
0, A) = j In(AOt + 1), the mean value function of the logarithmic-Poisson
execution time model, can be (numerically) evaluated. We denote the mean
value function thus obtained by 1'(. 11'1. 1'2).
Once the above calculations have been performed, several quantities that
are of practical interest to the software engineer can be obtained; these follow
from standard results on the probabilistic strqcture of the NHPP. The par-
enthetical unconditioning mentioned below, refers to the fact that the results
do not depend on any unknown parameters; these have been averaged out by
the joint distribution developed above.
The quantities of interest are the (unconditional) expected number of fail-
ures in (0, t]:
f I'(t 11'1,1'2) X 7r(1'1, 1'2)dl'b d1'2,
J/Jl ,/J2
the (unconditional) expected number of failures in any interval (s, t], s < t:

f (I'(t 11'1,1'2) -I'(s 11'1,1'2)) x 7r(1'1,1'2)dl'bdI'2,


J/Jl ,/J2
the (unconditional) probability of k failures in the interval (0, t], for k
0,1,2, ... :

and the (unconditional) probability of k failures in the interval (s, tJ, s < t,
for k = 0, 1,2, ... :

x [/J(tl/Jl '1'2)~ r(t l/J1!1'2)]k x 7r(1'1, 1'2)dl'b d1'2.

The last two quantities given above, are known as the predictive distri-
butions. These are used to provide a measure of uncertainty associated with
the predicted number of failures in a specified interval.
A computer code that facilitates computations involving the above inte-
grals has been developed by Campod6nico (1993), and can be made available
to potential users.

4. Model Unification
From the material of the previous section it is apparent that unlike hardware
reliability where a few probability models like the Weibull playa dominant
Assessing the Reliability of Software: An Overview 363

role, the topic of software reliability is deluged with a plethora of models. A


question therefore arises as to whether there could be a common underlying
theme that can unite these models and simplify the task of users? Attempts
to address this question have been undertaken by many, starting with Koch
and Spreij (1983) who view the software failure process through martingale
dynamics, and in so doing set a tone for further developments, especially the
recent work of Van Pul (1992). Subsequent to this, Langberg and Singpur-
walla (1985) were able to show that the models by Jelinski and Moranda,
Littlewood and Verrall, and Goel and Okumoto could be unified by assum-
ing the exchangeability of the software inter-failure times and by adopting a
Bayesian point of view.
Further progress on unification was made by Miller (1986) who viewed
the software failure times as order statistics of independent, non-identically
but exponentially random variables whose scale parameters Al, A2, ... , An
could be deterministic or stochastic. He terms his models exponential order
statistics (EOS) models and labeled the case of the deterministic (stochastic)
scale parameter DETjEOS(DSjEOS). With Ai = A, for i = 1, ... , n, and A
known, the DETJEOS model results in the model of Jelinski and Moranda.
With Al, A2,"" modeled as the realization of a nonhomogeneous Poisson
process (NHPP), the models of Goel and Okumoto, Musa and Okumoto and
Littlewood and Verral occur as special cases.
More recently, Fakhre-Zakeri and Slud (1995), henceforth (F-Z)S, intro-
duce the notion of mixture models for software failures, where the latter are
viewed as point processes whose intensities are conditioned on unobservable
variables. By assigning different probabilistic and deterministic structures to
the conditioning unobservables, (F-Z)S are able to pull together the models
of Jelinski and Moranda, Goel and Okumoto, Musa and Okumoto and the
model considered by Dalal and Mallows (1988). A noteworthy feature of the
work of (F-Z)S is the fact that they are able to provide a justification of the
model by Musa and Okumoto via a limiting argument on the parameters of
the model of Littlewood and Verrall.
It is interesting to note that all attempts to unify models, particularly
those by Langberg and Singpurwalla, Miller, and (F-Z)S involve distribu-
tional assumptions (prior distributions) on one or more parameters of an
underlying structure. Specifically, Langberg and Singpurwalla use a "shock-
model" (cf. Barlow and Proschan 1981, p. 135) as their basic structure,
whereas Miller uses the exponential distribution as his underlying struc-
ture. (F-Z)S (and also Koch and Spreij) follow the French school of modeling
stochastic processes via a counting process framework along the lines used
by Aalen (1978) for analyzing lifetime data.
The latest entrants in the enterprise of unifying software reliability mod-
els are the papers of Chen and Singpurwalla (1995) and of Kuo and Yang
(1995a). Chen and Singpurwalla first argue that, mathematically, it does not
make sense to talk about the failure rate of software that is evolving over
364 Nozer D. Singpurwalla and Refik Soyer

time. They then model the software failure process as a self-exciting point
process (cf. Snyder and Miller 1991, p. 287) and show that all the models dis-
cussed in Section 3 including the Kalman filter based ones by Singpurwalla
and Soyer and by Chen and Singpurwalla are special cases of such processes.
Furthermore, the intensity function of the point process is indeed what soft-
ware engineers (like, Jelinski and Moranda, Littlewood and Verrall, Schick
and Wolverton, etc.) refer to as the failure rate of software. This work, plus
the preceding papers by Miller, (F-Z)S and Koch and Spreij should signal a
shift in the paradigm of software reliability modeling from its current focus on
the failure rate to that of counting process theory and martingale dynamics.
The work of Kuo and Yang (1995a) is also noteworthy, because these
authors introduce the idea of using record value statistics (cf. Glick 1978) for
modeling software failures when new faults may be introduced during the
process of correcting other faults. The unifying theme of Kuo and Yang is a
use of the non-homogeneous Poisson process (NHPP); their focus of attention
is Bayesian inference using the Gibbs sampling approach. An overview of the
main ideas of Kuo and Yang is given next.
Suppose that at the beginning of software testing there is an unknown
number of faults, say N. Then, the first n ~ N epochs of software failure can
be modeled as the first n order statistics of N independent and identically
distributed (i.i.d.) random variables having density f. This idea parallels that
of Miller (1986), who restricts attention to the case of f being exponential.
The authors refer to their set-up as the general order statistics (GOS) model.
When f is exponential, we get the model by Jelinski and Moranda. By varying
f we can obtain analogues to the Jelinski-Moranda model.
Let M(t) be the number of software failures in (0, t], and let JL(t) =
E(M(t)), be its expectation. We assume that JL(t) is differentiable, and let
>'(t) = dJL(t)fdt. Suppose that the prior on N is a Poisson with mean O. Then,
it can be shown (cf. Langberg and Singpurwalla 1985) that M(t) is NHPP
with JL(t) = OF(t), where F is the cumulative distribution function of f, and
intensity function >.(t). With F(t) = =
1 - e- f3t , JL(t) 0(1 - e- f3t ), and the
resulting process for M(t) is the model of Goel and Okumoto. Processes for
which N has a Poisson distribution and limt-+oo JL(t) < 00, are referred to by
Kuo and Yang as "NHPP-I" processes." Nonhomogeneous Poisson processes
with limt-+oo JL(t) = 00, are called "NHPP-II" processes." An example of an
NHPP-II process is the model by Musa and Okumoto (1984).
We now turn attention to record value statistics. Suppose that 8 1 , 8 2 , ,
are independent and identically distributed random variables with density
function f. We define the sequence ofrecord values {Xn }, n 2: 1 and record
times Rk, k 2: 1, as follows:

Rl = 1
Rk min{i Ii> Rk-1,8i > 8Rk_J, k 2: 2, and
Xk 8RIe' k 2: 1.
Assessing the Reliability of Software: An Overview 365

An example best describes the above construction of Rk and Xk. Suppose


that 8 1 =4,82 = 1,83 =7,84 =5,85 = 9,86 = 11,87 = 13,88 = 6,89 = 18,
8 10 = 14 and 8 11 = 15. Then the (Rk' Xk) pairs are: (1,4), (3,7), (5,9),
(7,13), and (9,18). Even though, with large n, Rk will be rarely observed, it
can be shown that the sequence of record values can be infinite. We therefore
model the observed epochs of software failures as the record values X 1, X 2, ....
An interesting theorem due to Dwass (1964) says that the record values
constructed above in (0, t] are the points of an NHPP-II process in (0, t] with a
mean value function Jl(t) = -In(l - F(t)). Consequently, the model of Musa
and Okumoto is a RVS model with 1 having a Pareto distribution. Since
A(t) = =
d(Jl(t))/dt l(t)/(l - F(t)), we have the result that the intensity
function of an NHPP-II process constructed from a RVS model with density
1 is the failure rate of I

References

Aalen, 0.0.: Nonparametric Inference for a Family of Counting Processes. Ann. of


Stat. 6, 701-726 (1978)
Arjas, E. and Haara, P.: A Marked Point Process Approach to Censored Failure
Data with Complicated Covariates. Scand. J. Statist. 11, 193-209 (1984)
Barlow, R.E., Proschan, F.: Statistical Theory of Reliability and Life Testing: Prob-
ability Models. Silver Spring: To Begin With 1981
Bather, J.A.: Invariant Conditional Distributions. Ann. Math. Statist. 36, 829-46
(1965)
Campod6nico, S.: Software for a Bayesian Analysis of the Logarithmic-Poisson Ex-
ecution Time Model. Technical Report GWU /IRRA/TR-93/5. Institute for Re-
liability and Risk Analysis, The George Washington University (1993)
Campod6nico, S., Singpurwalla, N.D.: A Bayesian Analysis of the Logarithmic-
Poisson Execution Time Model Based on Expert Opinion and Failure Data.
IEEE Trans. Soft. Eng. SE-20, 677-683 (1994)
Chen, Y., Singpurwalla, N.D.: A Non-Gaussian Kalman Filter Model for Tracking
Software Reliability. Statistica Sinica 4, 535-548 (1994)
Chen, Y., Singpurwalla, N.D.: Unification of Software Reliability Models by Self-
Exciting Point Processes. Technical Report GWU /IRRA/TR-95/3. Institute
for Reliability and Risk Analysis, The George Washington University (1995)
Chen, J., Singpurwalla, N.D.: The Notion of "Composite Reliability" and its Hier-
archical Bayes Estimation. J. Amer. Statist. Assoc.. To appear (1996)
Dalal, S.R., Mallows, C.L.: When Should One Stop Testing Software? J. Amer.
Statist. Assoc. 83 403, 872-79 (1986)
Dwass, M.: Extremal Processes. Ann. of Math. Stat. 35, 1718-1725 (1964)
Fakhre-Zakeri, I., Slud, E.: Mixture Models for Reliability of Software with Imper-
fect Debugging: Identifiability of Parameters. IEEE Trans. ReI. R-44, 104-113
(1995)
Forman, E.H., Singpurwalla, N.D.: An Empirical Stopping Rule for Debugging and
Testing Computer Software. J. Amer. Statist. Assoc. 72, 750-57 (1977).
Glick, N.: Breaking Records and Breaking Boards. Arne. Math. Monthly 85, 2-26
(1978)
366 Nozer D. Singpurwalla and Refik Soyer

Goel, A.L.: A Guidebook for Software Reliability Assessment. Technical Report


RADC-TR-83-176 (1983)
Goel, A.L.: Software Reliability Models: Assumptions, Limitations, and Applica-
bility. IEEE Trans. Soft. Eng. SE-ll, 1411-1423 (1985)
Goel, A.L. , Okumoto, K.: An Analysis of Recurrent Software Failures on a Real-
Time Control System Proc. ACM Annu. Tech. Conf.. ACM 1978, pp. 496-500
Goel, A.L., Okumoto, K.: Time-Dependent Error Detection Rate Model for Soft-
ware Reliability and other Performance Measures. IEEE Trans. ReI. R-28, 206-
11 (1979)
Hudson, A.: Program Errors as a Birth and Death Process. Technical Report SP-
3011. Systems Development Corporation (1967)
lannino, A., Musa, J.D., Okumoto, K., Littlewood, B.: Criteria for Software Relia-
bility Model Comparison. Trans. Soft. Eng. SE-lO, 687-91 (1984)
Jelinski, Z., Moranda, P.: Software Reliability Research. In : Freiberger, W. (ed.):
Statistical Computer Performance Evaluation. New York: Academic Press 1972,
pp.465-484
Joe, H., Reid, N.: On the Software Reliability Models of Jelinski-Moranda and
Littlewood. IEEE Trans. ReI. R-34, 216-218 (1985)
Koch, G., Spreij, P.: Software Reliability as an Application of Martingale and Fil-
tering Theory. IEEE Trans. ReI. R-32, 342-345 (1983)
Kuo, L., Yang, T.Y.: Bayesian Computation for Nonhomogeneous Poisson Processes
in (Software) Reliability. Under review for publication (1995a)
Kuo, L., Yang, T. Y.: Bayesian Computation of Software Reliability. Journal of
Computational and Graphical Statistics 4, 65-82 (1995b)
Langberg, N., Singpurwalla, N.D.: A Unification of Some Software Reliability Mod-
els. SIAM J. Sci. Statist. Comput. 6, 781-90 (1985)
Littlewood, B.: A Bayesian Differential Debugging Model for Software Reliability.
Proceedings of IEEE COMPSAC (1980)
Littlewood, B., Verall, J.L.: A Bayesian Reliability Growth Model for Computer
Software. Appl. Statist. 22, 332-346 (1973)
Mazzuchi, T.A., Soyer, R.: A Bayes Empirical-Bayes Model for Software Reliability.
IEEE Trans. Rel. R-37, 248-254 (1988)
Miller, D.R.: Exponential Order Statistic Models of Software Reliability Growth.
IEEE Trans. Soft. Eng. SE-12, 12-24 (1986)
Moranda, P.B.: Prediction of Software Reliability and its Applications. Proceed-
ings of the Annual Reliability and Maintainability Symposium. Washington
DC 1975, pp. 327-332.
Musa, J.D.: Software Reliability Data. IEEE Computing Society Repository (1979)
Musa, J.D., lannino, A., Okumoto, K.: Software Reliability; Measurement, Predic-
tion, Application. New York: McGraw-Hill 1987
Musa, J.D., Okumoto, K.: A Logarithmic Poisson Execution Time Model for Soft-
ware Reliability Measurement. Proceedings of the 7th International Conference
on Software Engineering. Orlando 1984, pp. 230-237
Schick, G.J., Wolverton, R.W.: Assessment of Software Reliability. Proc. Oper. Res ..
Wirzberg-Wien: Physica-Verlag 1978, pp. 395-422
Singpurwalla, N.D., Soyer, R.: Assessing (Software) Reliability Growth Using a
Random Coefficient Autoregressive Process and its Ramifications. IEEE Trans.
Soft. Eng. SE-ll, 1456-1464 (1985)
Singpurwalla, N.D., Soyer, R.: Non-Homogeneous Autoregressive Processes for
Tracking (Software) Reliability Growth, and their Bayesian Analysis. J. Roy.
Statist. Soc. B 54, 145-156 (1992)
Singpurwalla, N.D., Wilson, S.P.: Software Reliability Modeling. International Sta-
tistical Review 62, 289-317 (1994)
Assessing the Reliability of Software: An Overview 367

Snyder, D.L., Miller, M.I.: Random Point Processes in Time and Space. Second
Edition. New York: Springer 1991
Soyer, R.: Monitoring Software Reliability Using NonGaussian Dynamic Models.
Proceedings of the Engineering Systems Design and Analysis Conference 1,
419-423 (1992)
van Pul, M.C.: Asymptotic Properties of Statistical Models in Software Reliability.
Scand. J. Statist. 19, 235-254 (1992)
Yamada, S.: Software Quality/Reliability Measurement and Assessment: Software
Reliability Growth Models and Data Analysis. J. Inform. Process. 14, 254-266
(1991 )
The Role of Decision Analysis
.
In
Software Engineering
Jason Merrick and Nozer D. Singpurwalla
Department of Operations Research, The George Washington University, Wash-
ington, DC 20052, USA

Summary. There are many decisions involved in the creation of a reliable software
system. In this paper we demonstrate the use of Bayesian decision theory for mak-
ing decisions in software engineering. We give two examples of such decisions; the
first concerns the choice of a software house to use when an organization identifies a
particular software requirement. The second decision pertains to an optimal testing
strategy that a software house should adopt before releasing a piece of software.
We consider both single and multiple stage testing and utilize existing software
reliability models to determine the optimal rule.

Keywords. Bayesian methods, decision theory, experimental design, hierarchical


classifications, manufacturing science, non-Gaussian filtering, pre-posterior analy-
sis, quality control, sequential testing, software development process, software reli-
ability

1. Introduction

In the design of software systems, there are many decisions to be made.


Some have outcomes which are known and so decisions can be made by
logical deduction. However, many decisions in this field have unknown, but
predictable, outcomes. We consider two decisions in software design as exam-
ples to illustrate a framework under which these decisions should be made.
For a company with a particular software requirement, the first and most
crucial decision pertains to the best software house to choose to create such
a software system. The eventual cost due to bugs in the delivered system is
unknown. For a software house developing a software system, the optimal
system test duration is of critical importance; for this decision the reliability
of the system is the unknown entity.
The concept of a framework with which to make rational decisions has
come under much scrutiny. The use of decision trees and the principle of
maximization of expected utilities has been advocated as a coherent method
of making decisions under uncertainty, see French (1986) Chapter 1. In this
paper, we present methods for modeling uncertainty and an approach to
making the two decisions outlined above.
The main factors in choosing a software house are the price, the saving
made by using the system offered, and the quality of the system. In Section
2 we discuss the CMM model, see Paulk et al. (1993a,1993b), for classifying
the maturity of the quality control procedures practiced by a software house.
The Role of Decision Analysis in Software Engineering 369

We outline a probabilistic version of this model and use it in making the


decision of which software house to choose.
The subject of optimal software testing has been examined by several
authors including Forman and Singpurwalla (1977,1979), Okumoto and Goel
(1980), Yamada et al. (1984), Ross (1985), Dalal and Mallows (1988) and
Singpurwalla (1989,1991). With the exception of the last three references,
the methods offered in these papers are not decision theoretic. In Section
3 we briefly discuss the area of software reliability modeling and use two
existing models to make the optimal decisions of when to terminate testing
in both single and multiple stage tests.

2. Choosing a Software House


When a company identifies a particular software requirement, several soft-
ware houses are approached to offer solutions. Each of these houses gives an
overview of the system they would design and bids a price to create it. Apart
from obvious monetary considerations of price and the saving from using the
software system offered, the main factor in choosing amongst these bids is
quality. Low quality software will have many bugs which may cause delays in
the operation of the company. Thus a method of classifying quality control
practices used by a software house is useful in discriminating between them.
The software development process is defined in Humphrey (1989) as the
set of actions that efficiently transform users' needs into an effective software
solution. The idea that it is possible to create virtually "bug-less" software
by simply hiring good programmers is no longer supported by experts. It
is accepted that sound quality control procedures are the key to producing
quality software. In documented case studies, see Paulk et al. (1993a), the
return on investment in software process improvements has been in the range
5:1 to 8:1.
One of the most widely used classifications of the maturity of the quality
control procedures used within a software house is the Capability Matu-
rity Model (CMM) developed at the Software Engineering Institute (SEI) of
Carnegie Mellon University. Under this scheme a software house is classified
by one of five maturity levels, indicating the maturity of their quality control
procedures.
Obviously, it is highly desirable to employ a software house of the highest
maturity. However, the procedures involved in attaining this level of maturity
cause the price of the software to be high. Thus there is a trade-off between
the quality of the delivered system and the price quoted by the company.

2.1 The Classification Problem: The CMM Model


The classification of a company under the CMM model is hierarchical. There
are five maturity levels, level one being the lowest and attained by all com-
370 Jason Merrick and Nozer D. Singpurwalla

panies. To attain a higher level, certain key process areas (or KPA's) must
be satisfied, in addition to the preceding level. To facilitate a discussion of
the model, we introduce the following notation. Let

denote the event that the software house attains (does


not attain) maturity level i, or higher; and
denote the event that the highest level attained by the
company is the i-tho
The relationship between these events is given by

, if i = 1,2,3,4,
(2.1)
, if i=5.
For the i-th maturity level, there are ni associated KPA's, where
I<i,j = 1(0) denotes the satisfaction (or otherwise) of the j-th KPA
associated with the i-th level.

For the j-th KPA at the i-th level, there are ri,j questions, where

Xi,j,A: = 1(0) denotes that the answer to the k-th such question is a
Yes (No).
The vector of questions for the j-th KPA associated with the i-th maturity
level, (Xi,j,!, ... , Xi,j,r;), is denoted by Qi,j' The set of all such vectors for
the i-th level is denoted by Qi and the responses to the entire questionnaire
are denoted by Q.
The CMM model can be represented by the tree in Figure 2.1, showing
the event hierarchy. This structure shows that for the event Mi to occur, the

Mj

I
I I 1
M. I- K..
I, == 1 ~,Dj -1

Fig, 2.1. The hierarchical structure of the CMM model

event Mi-l must occur and all the Ki,l, ... , Ki,n; must take the value one.
The decision of whether the j-th KPA of the i-th level is satisfied or not, is
based upon
The Role of Decision Analysis in Software Engineering 371

1 ri,j
-L:Xook 1,3, ,
r'I,). k=l

the proportion of affirmative answers.


2.1.1 A Probabilistic Approach. Under the scheme described in the pre-
vious section, if the proportion of affirmative answers is sufficiently large then
a software house is said to satisfy the key process area with probability one.
A difficulty with this approach is that a KPA is an abstract concept and the
answers to the questionnaire are just a data sample related to the question
of satisfaction of the KPA. Based on a finite amount of such data, we cannot
say for sure that the KPA has been satisfied. Another flaw in this determinis-
tic approach is that it does not incorporate prior information that may help
the classifier make a decision of whether or not the KPA's are satisfied. Such
information includes the expert opinion of people in the field and the history
of the software house.
We propose a Bayesian approach which enables the prior information,
1, to be incorporated and the responses to the questionnaire to be used
to update the decision makers' beliefs. This suggests that the determin-
istic relationship between the events Mi, Mi-l and the random variables
Ki,l, ... , Ki,ni' shown in Figure 2.1, is to be weakened.
2.1.2 Assumptions. The derivation of the required probabilities is facili-
tated by making certain assumptions of conditional independence. We denote
by (Zl 1. Z2) I Z3, the conditional independence of the random quantities
Zl and Z2 given the random quantity Z3. For each i, (i = 1,2, ... ,5), we
assume that
Al (Mi 1.Q) I Mi-l,Ki,l, ... ,Ki,ni'
A2 (Ki,j 1. Ki,l) I Q, where j # I,
A3 (M;_l J.. Ki,l,"" Ki,n;) I Qi,
A4 (Mi-l 1. Ki,l,"" Ki,n;) I Mi,
A5 (Ki,j 1. Qi,l) I Qi,j, where j #1,
A6 (Xi,j,k J.. Xi,j,l) I Ki,j, where k # I.
2.1.3 Derivation of the Recursive Relationship. To obtain the proba-
bility P(Mi I Q,ll), i = 2, ... ,5, we develop a recursive relationship based
on the probability, P(Mi-l I Q, 1). All software houses are said to attain the
first level of maturity, i.e. P(Ml I Q,ll) = P(Md = 1.
To obtain the recursion, we first extend the conversation to include
Ki,l, ... , Ki,ni and the event Mi-l. The occurrence of the event Mi-l means
that the event Mi cannot occur, so we need only condition on Mi-l. Invoking
the assumptions AI, A2 and A3 gives the expression for P(Mi-l I Q, 1) as

[P(M; I Mi- l , Ki,l, ... , Ki,ni' 1)


X[lj~l P(Ki,j I Q,ll) (2.2)
XP(Mi_l I Q, 1)] .
372 Jason Merrick and Nozer D. Singpurwalla

The term P(Mi - 1 1 Q,1) is the probability calculated in the previous


step of the recursion; it equals one for the first level. It remains to find
expressions for the first and second probabilities in (2.2). The probability
P(Mi 1 Mi-l, Ki,l, ... , Ki,n;, 1) is obtained by applying Bayes Law and
invoking the assumptions A2, A3 and A4, to obtain
nj~l P(Ki,i 1 Mi)P(Mi 1 Mi-l, 1)
EME{Mi,Mi} nj~l P(KiJ 1 M)P(M 1 Mi-l, 1)'
Once again we use Bayes law to obtain an expression for P(Ki,j 1 Q,1).
Invoking assumption A5 gives
P(Qi,j 1Ki,j)P(Ki,j 11)
EK"',J P(QiJ 1KiJ)P(KiJ 11)'
The likelihood P(Ki,j 1 Mi) is specified by the classifier. It reflects their
belief about the link between the concepts involved in attaining the i-th
maturity level and the satisfaction of the j-th KPA for that level. The like-
lihood P(Qi,i 1 Ki,i) can take various forms depending on the classifier's
beliefs about the relevance of the questions to the satisfaction of the KPA.
The form depends on .the assumed dependence between the random variables
Xi,i,l, ... , Xi,j,r;,j' the responses to the questions for the j-th KPA of level
i. We refer the reader to the discussion of the various forms of dependence
in Landry and Singpurwalla (1995). We note that the likelihoods specified
by the classifier are independent of the software houses; they reflect beliefs
about the concepts involved in the model and the questionnaire.
The prior probabilities of the software house satisfying each ofthe KPA's
and of attaining a maturity level, given that the level below has been attained,
must also be specified using expert opinion.
An example illustrating this approach to the classification of the maturity
of the quality control procedures practiced by a software house is given in
Landry and Singpurwalla (1995). In this example the actual questionnaire
completed by a local company, plus the expert opinion of people associated
with the company, was used.

2.2 The Decision Problem

For a company with a particular software requirement, suppose that k soft-


ware houses bid for the contract. Each tenders a price, denoted P6, for
s = 1, ... , k. The company must make a decision of which house to offer
the contract. A naive company may simply opt for the lowest price. However,
the lack of consideration of the quality may prove more expensive in the long
term as the house chosen may have poor quality control procedures in place;
this could lead to bugs in the delivered system causing costly delays in the
operation of the company. The decision maker must therefore have some idea
of the standard of quality control in each software house.
The Role of Decision Analysis in Software Engineering 373

We define the actions


H. denoting the decision that the s-th house is offered the
contract, for s = 1, ... , k; and
iI denoting the decision that the company does not buy a
software system.

The decision tree is shown in Figure 2.2. The tree consists of one decision

U(H)

Fig. 2.2. The decision tree for choosing a software house

node, V, the choice of which house to hire, and k random nodes, R 1 , ... , Rk.
We denote by d. the profit the company would make if they used the software
system offered by the s-th software house under the assumption of bug-free
operation. The random node for each decision, Rs for s = 1, ... , k, gives the
actual cost due to delays; this cost is denoted by the random variable C. So
the utilities of the decisions to be made are thus given by

-(P.+C.)+d. ,fors=1, ... ,k,


O.
The random variable C. depends on the quality of the software delivered.
In Section 2.1, we have discussed a method to classify the quality control
maturity of a software house. Thus to obtain a distribution on C. we "extend
the conversation" to include the events Li (i = 1, ... ,5) indicating that the
highest maturity level attained by the s-th software house is the i-th and
374 Jason Merrick and Nozer D. Singpurwalla

assuming that given L: , G, is independent of the results of the questionnaire


for the s-th house, Q', and the prior history. Thus we have
5
P(G. =c. I Q') =L: P(G, =c, I Lnp(Li I Q6).
i=l
The distribution of G, given L1 is the distribution of the costs due to
delays caused by a software system written by a software house of maturity
level i. A discussion of the reduction in the mean and variance of the delay
costs for software written by mature software houses is given in Paulk et al.
(1993a), but no attempt is made to assess the distributions.
Once the distribution of G, has been obtained and the probabilities
P(L1 I Q') (i = 1, ... ,5) have been calculated for each software house
(s = 1, ... , k), using the probabilistic CMM model outlined in Section 2.3,
the optimal decision is made using the principle of maximization of expected
utilities. The expected utility for choosing the s-th software house is given,
for s = 1, ... , k, by

-(p. + E[G, I Q6]) + d" (2.3)

where

E[G6 I Q6] = fe, C6 (E~=l P(G. = I Ln p (L1 I Q')) dc,.


C6

If all these values are below the utility for not purchasing a software system,
then the optimal decision is not to purchase a software system. Otherwise,
the optimal choice is the house with the highest expected utility.

3. Optimal Testing Strategies


The generally accepted definition of software reliability is "the probability
of failure-free operation of a computer program in a specified environment
for a specified period of time", given in Musa and Okumoto (1982). The
production of software is costly and time consuming, with much of this time
being occupied testing the system for bugs. The software producer is therefore
very interested in the benefit of time spent testing and in quantifying the
reliability of the software produced.
A software system consists of a finite number of lines of code, or in-
structions, which, given a sequence of inputs, creates a specified sequence
of outputs. This output will not change given the input sequence unless the
code of the system is changed. When a software system is designed the out-
put sequences are specified before the code is written. A bug can be defined
as an error in the code which causes the output sequence to differ from that
specified. As there are only a finite number of lines of code, there can only
The Role of Decision Analysis in Software Engineering 375

be a finite number of bugs. A fuller discussion of the software failure process


is given in Musa and Okumoto (1982).
In testing software the aim is to release software with as few bugs as pos-
sible. However, there is a trade-off between finding as many bugs as possible
and the cost of testing. One of the first attempts to use decision theory in
deciding when to release a piece of software was offered in Dalal and Mallows
(1988). In this paper a utility function is defined in terms of the test duration
and the number of bugs that must be fixed both before and after the soft-
ware is released. A probability model is given for the total number of bugs
in the system and the expected utility is found for a test of a given duration
during which a given number of bugs are detected. Under certain conditions
an optimal stopping rule is found. In certain cases a test of fixed duration
is optimal after which the software is released. For other cases the optimal
decision is to release the software after a predetermined number of bugs have
been detected. However, the optimality results only hold asymptotically, as
the total number of bugs in the system tend to infinity; thus the decision rule
chosen using this method will only be optimal for systems where this number
is large.
Zacks (1995) also offers an approach based on the expected utility at
any given point during testing. The software reliability model of Littlewood
and Verall (1973) is used. A stopping rule is offered and its relationship to
the optimal stopping rule discussed. However, the rule is only shown to be
optimal asymptotically, at an infinite time in the future.
The work of Singpurwalla (1989,1991) and Morali and Soyer (1995) define
a fixed testing strategy and do not have the problem of asymptotic results.
The simplest testing strategy is the single-stage test. The software is tested for
a predetermined period after which all bugs are corrected and the software
is released. The decision of the optimal duration of a single-stage test is
discussed in Singpurwalla (1991); the method offered is outlined in Section
3.1. Strategies for extending the testing to more than one stage are offered
in Singpurwalla (1989) and Morali and Soyer (1991). These approaches are
discussed in Section 3.2.

3.1 Single-Stage Testing

We denote the number of bugs in the system by N; this quantity is unknown


at the time of the decision. Under a single-stage test, the software is tested
for a total running time of r, all bugs are corrected upon detection and the
software is released upon completion of the test. The decision to be made is
the duration, r, of the test phase. Let the number of bugs detected during
such a test be denoted by M(r). If M(r) = k, and we denote the successive
running times by T 1 , ... , Tk, then we define

for k ~ 1,
for k = O.
376 Jason Merrick and Nozer D. Singpurwalla

3.1.1 The Decision of Optimal Test Duration. The decision tree for
the optimal test duration, r, is shown in Figure 3.1. The tree consists of two

Choose N-MI (t I)
Release
bugs left
\----i~(N.MI(t I),t.)

Fig. 3.1. The decision tree for the optimal test duration in single-stage testing

n
decision nodes, V 1 and V 2 , and two random nodes, n 1 and 2 . AtV 1 we
choose a value for r. At V 2 the only possible action is to release the software.
The failures of the system when tested for time r are observed at nl and the
failures of the system after delivery are observed at 2 .n
The utility function UR(N, M( r), r) reflects the cost to the software house
of the remaining (N - M( r)) bugs in the system delivered, after testing for
time r and discovering M( r) = k bugs at times t(k). Obviously, to make
a useful choice of the test duration a suitable form for this utility must be
specified. A plausible form is given in Section 3.1.2.
To obtain an optimal choice of T we start at the terminus of the tree and
work backwards following the principle of maximization of expected utility.
The expected utility at the random node R2 is found by averaging the utility
of N bugs over the distribution of N given k failures during the test period
r and running times t(k); thus

E[UR(N, M(r), r) I M(r) = k, t(k)] = E1=o [UR(k + j, k, r)x (3.1)


P(N = M(r)+j I M(r) = k,t(k)].
The only plausible decision at V 2 is to release, so we calculate the expected
value of the utility at n 1 for a single-stage test of duration r, U( r), by
averaging E[UR(N, M(r), r) I M(r) = k, t(k)] over the prior distributions of
M( r) and t(k) given the history, 1i, to obtain

E[U(r)] [E~=oIt E[UR(N, M(r), r) I M(r) = k, t(k) = t]


xP(M(r) = k 11i)p(t(k) = t 11i)dt] (3.2)
+E[UR(N, M(r), r) I M(r) = 0] x P(M(r) = 0 11i).
The optimal value of r, ro, is found by substituting (3.1) into (3.2) and
maximizing the expression obtained over r.
The Role of Decision Analysis in Software Engineering 377

This type of analysis is known as "pre-posterior" as we are making our


decision about r before the data M(r) = k and t(k) have been collected. So
in (3.2) we average the utility given the possible values of the data over the
prior predictive distribution of the data values.
3.1.2 A Plausible Utility Function. The choice of a realistic utility func-
tion is vital for validity of the optimal decision. We require a function for the
cost to the company of having j bugs left in the system when delivered. This
utility ranges from the utility of the company having a bug-less system, de-
noted by al, to the utility of having a virtually useless system, riddled with
bugs, denoted al + a2. Thus we define the utility of having j bugs in the
system be al + a2e-a3i , where al, a2 and a3 are specified constants.
Apart from the cost of bugs in the system, there is also the cost of fixing
the bugs found during testing and the cost of testing itself. We denote by C
the cost of fixing a bug encountered during the testing phase and by f( r),
the cost of a single-stage test of duration r. The function f is non-negative
and monotone increasing; it includes the cost of testing and the cost of not
releasing the software until time r.
Thus the total utility UR(N, M(r), r), of a single-stage test of duration r,
during which M(r) = k bugs are discovered at times t(k), and after which j
bugs are left in the system upon delivery, is

(3.3)
3.1.3 Application to an Error Counting Model. As the observable
quantity in the testing procedure is the number of failures in a fixed test pe-
riod, we must model the reliability of the software through an error-counting
model. There are many models that have been proposed for this purpose;
a model which has the qualities of simplicity and realism was proposed by
Jelinski and Moranda (1972). Although this model has come under much
criticism, it is sufficient for the purpose of demonstrating the optimal testing
strategy outlined in Section 3.I.
Letting Tl, T2, ... denote the successive running times of the software
between failures, we define for t > 0
P(T; > tiN, .1) = e-.d(N-i+1)t, (3.4)

where .1 is an unknown constant of proportionality. The random variables


11 are exponentially distributed with parameter .1(N - i + 1). This is the
basic model of Jelinski and Moranda (1972). The prior on N is then assumed
to be Poisson, with parameter 0, and .d is assumed a priori to be Gamma
distributed, with shape parameter a and scale parameter JI.. This set up has
been shown, in Langberg and Singpurwalla (1985), to unify several models in
software reliability and allows a wide variety of prior beliefs to be expressed.
The model assumes that each bug is corrected perfectly upon detection, with
no introduction of new bugs.
378 Jason Merrick and Nozer D. Singpurwalla

The general approach, outlined above, for finding the optimal test dura-
tion for single stage testing can be used with this model. This involves finding
expressions for the posterior distribution of N, given that M(r) = k, t(k) and
the test duration r, and the prior predictive probabilities that M (r) = k and
~ L~\T) Ii = t, given the test duration r, for k = 0,1,2, ... The distribu-
tions, see Singpurwalla (1991), are given by

We- 9 0k+i
P(N = k + j I M(r) = k, t(k)) = .,
J.
(Jl + S + kTra-k , (3.5)

where W is a normalizing constant and S = ~ L~=l T;(k);

P(M(r) = k 11l) = 1
0
00 e-8(1-e-AT)
k! (0(1- e ) )
-AT k e-JA(JlA)a-l Jl
r(a) dA;
(3.6)
and
P(~ L~\T) Ii =t I M(r) = k) roo (e-I'A(JA)"'-lJ) ~ -Art
Jo rea) (k_l)!re

LJ~1(-1)j (~ ) (tr-jr)k-1dA,
(3.7)
for ko < t < (ko + 1) and ko = 0,1, ... , (k - 1), where b = (1 - e-AT)-l.
The final expression for the expected utility given r is given in Singpur-
walla (1991). The complexity of this expression is evident from its constituent
parts, (3.5), (3.6) and (3.7), which must be substituted in (3.2); the use of
computational methods is necessary for its calculation. A software implemen-
tation of this decision method is available. For details refer to the World Wide
Web page of The George Washington University's Institute of Reliability and
Risk Analysis, ( http://www.seas. gwu. edu/seas/institutes/irra).
Example. To use the above method to make an optimal choice oftest duration
for software system, the decision makers' prior beliefs must be specified. These
are the parameter () of the Poisson distributed prior for N, the shape, a, and
the scale, Jl, of the gamma distributed prior for ..:1. To illustrate the sensitivity
of the results to these input parameters various values were chosen and the
expected utility curves were calculated.
The utility of delivering a useless, bug-ridden system, al, was chosen to
be -10. The utility of delivering a bug-free system, al + a2, was chosen to be
100, so a2 was 110. The utility of fixing a bug discovered during testing, C,
was 0.1 and the utility of testing for time T, f(T), was chosen to be simply
T.
Figure 3.2 shows the curves of the expected utilities when 0 takes the
values 5, 10, 15, 20, 25 and 30, with a and Jl fixed at 10 and 1 respectively.
The path of the optimal test duration, ro, for each value of 0 is shown. As
can be seen ro increases with (), but the expected utility of a test of duration
ro decreases.
The Role of Decision Analysis in Software Engineering 379

99

Total
Expected
Utility
'U(T)

83

T...1",
4-------------------~------------------+__.Tu.T
5.50 10.00

Fig. 3.2. The expected utilities for varying values of 8

Figure 3.3 shows the curves of the expected utilities for a taking the values
2, 3, 4 and 5, with () and J.l fixed at 10 and 1 respectively. Again the path of
the optimal test duration, TO, is shown. In this case, as a increases both the
optimal test duration and the maximum utility increase.

3.2 Multiple-Stage Testing

The single-stage procedure discussed. in Section 3.1 can be extended to multi-


ple stages of testing. Under this procedure, the test duration at each stage is
decided before any testing is performed. At the end of a given stage, any bugs
are corrected. A decision must then be made whether to continue testing for
the predetermined time for the next stage or to release the software.
The decision tree for a double stage test is shown in Figure 3.4. In this
figure we extend the notation of Section 3.1 to two possible test durations, 1'1
and 1'2, and two counts of the bugs detected during the testing stages, M1 ( 1'1)
and M 2 ( 1'2)' The utility of testing for the first stage and then releasing the
software, UR(N,M 1(Tt},Tt}, is denoted U1. The utility of testing for both
stages before release, UR(N,M 1(Tt} + M 2 (T2),T1 + 1'2), is denoted U 2 . The
decision method for a two stage test is outlined in Singpurwalla (1989). The
analysis outlined for the two stage test is complex; adding further stages
increases this problem.
380 Jason Merrick and Nozer D. Singpurwalla

98.50

Total
Ixpec:ted
O~ty

t .. til1l
88.SO+----------I---------~- Tille T
1.00 5.5 10.00

Fig. 3.3. The expected utilities for varying values of Q

Choose

N-M.("t 1)-M2(t 2)
bugs left
)---.... 1tz
Release

Fig. 3.4. The decision tree for a two stage test


The Role of Decision Analysis in Software Engineering 381

In the multiple-stage testing strategy used by Morali and Soyer (1995), the
software is tested until a bug is detected, located and corrected. A decision is
made whether to stop testing and release the software or to continue testing
for another stage. The decision at each stage is based on our belief of whether
testing for another stage would be beneficial. We therefore set up the problem
as a sequential decision problem and give a one-stage look ahead decision rule
for a given class of utility functions.
3.2.1 The Sequential Decision Problem. We denote the life length of
the software in the i-th stage of testing by T; for i = 1,2, .... A common
view of software is that it does not age or deteriorate with time. Thus, it
is assumed that the failure rate is constant if the code is not changed; The
failure rate of the software during the i-th phase, i.e. the failure rate after
the (i - 1)st modification, is denoted by (h. Thus the random variables T;
are exponentially distributed with parameter ();. At the end of the i-th stage
of testing, our decision is based upon T(i) ={T(O), T 1 , ... , T;}, where T(O)
denotes the prior information about the failure characteristics of the software
before testing.

Fig. 3.5. The decision tree for the software release problem

For theoretical purposes there must be a limit to the number of testing


stages. Thus after a predetermined number of stages of testing the software
is automatically released. However, as this number can be set as large as is
desired the number of stages is effectively infinite. The decision tree for the
first two stages of testing is shown in Figure 3.5; the tree is then repeated
382 Jason Merrick and Nozer D. Singpurwalla

for further stages. Thus we have at stage i a decision node, Vi, where i =
0,1,2, ...; the choice at this node is whether to STOP and release the software
or to TEST the software for another stage.
The utility of a test of duration t is denoted Uo{t) and the utility associ-
ated with releasing a piece of software with failure rate () is denoted UR((}).
The solution of the tree follows the usual path of maximization of expected
utilities as in the previous sections. This means that at each node we must
look at the expected utility for the STOP and the TEST decisions and take
the maximum. After i stages of testing the expected utility of the STOP
decision is given by E[UR((}i+d I T(i)] and the expected utility for the TEST
decision is given by E[Uo(Ti) I T(i)] + Ui+1 where

Ui = max {E[UR((}i+d I T(i)], E[Uo{Ti) I T(i)] + Ui+1}' (3.8)


The calculation of Ui is complex, requiring the computation of many
expectations and minimizations at each stage. In van Dorp et al. (1994), it
is rewritten as
W'1 = min {U~6)} 1 , (3.9)
6=0,1,2, ...

where
6
uf6) = L E[Uo{Ti+j) I T(i)] + E[UR{(}iH+d I T(i)]
j=l

is the additional expected utility associated with testing for {) more stages
after the i-th modification to the software.
In Morali and Soyer (1995), a theorem is given that shows the existence of
an optimal stopping rule under certain conditions on the expected utilities.
It states that if E[Uo{Tj) I T(i)] is increasing in j and E[UR{(}j) I T(i)] is
discrete convex in j, for j = i + 1, ... , then the optimal stopping rule for the
tree in Figure 3.5 is

uP) - ufO) > 0 -+ Continue Testing,


(3.10)
UP) - ufO) ~ 0 -+ Stop Testing and Release.
Thus, under the conditions of the theorem, a one-stage look ahead rule is
optimal. The software is released if the expected decrease in utility due to
testing an additional stage is greater than the expected increase in utility due
to the improvement in reliability resulting from an additional testing stage.
3.2.2 A Model for the Inter-Failure Times of Software. Under the
testing strategy described in Section 3.2.1, the observable quantity is the
inter-failure times between successive modifications of the software. Chen
and Singpurwalla (1994) use a Kalman Filter model where the observation
equation links the inter-failure time, ~, to the failure rate during a given
testing phase; specifically
The Role of Decision Analysis in Software Engineering 383

where fi is exponentially distributed with parameter one. The failure rate of


the software is constant over each testing stage. However, it changes from one
stage to another as the software is modified. This change is modeled using
the state equation. The form used permits both reliability growth and decay
of the software. The state equation is

(3.11)

where ei "" Beta(-YO:i_l, (1 - -Y)O:i-l) and p, -y and O:i-l are known non-
negative quantities, with 0 < -y < 1. The relationship in (3.11) implies that
(Ji < p(Ji-l. It is next assumed that given X(i-l), (Ji-l has a gamma distri-
bution with shape parameter O:i-l and scale parameter 13i-l.
A prior is specified on the failure rate of the software before testing
through the parameters 0:0 and 130. The moments of the predictive distri-
butions of the observables and the posterior distributions of the parameters
can be found in closed form.
Chen and Singpurwalla (1994) note that the parameter p provides infor-
mation about whether the reliability of the software is being improved or
not. When bugs are corrected it is possible that further bugs are introduced.
If p < 1 then the failure rate of the software is strictly decreasing from one
stage to the next. If p > 1 then the failure rate may be increasing. However,
the value of p will be unknown and so a prior distribution is assigned. This
prior is updated with the test data using the standard Bayesian machinery;
the likelihood can be obtained from the predictive distribution of Ii given
Ii-l and p. Thus we can track the growth or decay in the reliability of the
software through the distribution of p given the lengths of the test stages, Ii
for i = 1,2, ....
3.2.3 A One-Stage Look Ahead Decision Rule. To apply the model
proposed by Chen and Singpurwalla(1994) to the decision methodology out-
lined in Section 3.2.1, we must first specify the utility functions, UD(Tj) and
UR((Ji+d
The utility function UD(Tj) can be reasonably assumed to be decreasing
in Tj. Defining the cost per time unit of testing as kD' we obtain the utility
function

(3.12)
If a company releases an unreliable piece of software, there will be an associ-
ated loss in profits. Morali and Soyer (1995) offer the following utility function
to express this loss in terms of the failure rate of the released software

(3.13)
To use the optimal stopping rule for the i-th stage given in (3.10), the
applicability of the theorem given in Morali and Soyer (1995) must first be
shown for this model and these utility functions; this is examined for the cases
384 Jason Merrick and Nozer D. Singpurwalla

P = 1 and P > 1. For the case p> 1, the sufficient conditions are 'YP < 1 and
'Y> 0.5, while for P = 1 the sufficient condition is 'Y > 0.5.
As P is treated as an unknown, we assert the optimality of the one-stage
look ahead rule in (3.10) using probability statements. The utilities for the
stopping rule conditional on P are given in Morali and Soyer (1995) as

ufO) = - {kR~('YP)} ,
(3.14)
uP) = - {kD (-ya~~l) + kR~('Yp)2} .
Using the posterior distribution of p, we can average the utilities in (3.14) to
obtain the utilities ut unconditional on p, as required for the stopping rule.
Thus we proceed by giving prior distributions for the reliability of the
software system, running the software until a failure occurs and updating
the model using this test data. The decision is then made by first checking
the conditions for optimality of the one-stage look ahead rule; if the proba-
bility that these conditions hold is sufficient then the expected utilities for
releasing the software, E[UfO)], and for testing for a further stage, E[UP)]'
are computed. The decision rule, given in 3.10, is then applied. If the decision
is to STOP then the software is released and our decision process is finished.
Otherwise, another stage of testing must be performed and we effectively
start from the beginning of the procedure using the posteriors obtained as
our priors.
Example. An 100 point discretized beta distribution on the range [1,2] was
chosen as the prior on Pi this prior distribution is discussed in Morali and
Soyer (1995). The parameters of the beta distribution, c and d, were chosen to
be 1.25 and 5 respectively. The prior parameters 0'0 and 13o were both chosen
to be 2 and the parameter 'Y was given the value 0.8. The utility constants
kD and kR were chosen to be 1 and 100,000 respectively.
We note that this set up does not guarantee the optimality of the one-
stage look ahead decision rule, because 'YP > 1 for P > 1.25. However, as can
be seen from the plots in Figure 3.6, the probability that p > 1.25 decreases
over subsequent stages of testing. The conditions for the optimality of the
rule are therefore likely to hold at the later stages.
Figure 3.7 shows the expected additional costs for further testing of the
software after having tested it for 0, 2, 3 and 5 stages. It can be seen from
the graph showing the additional expected costs of further testing after the
fifth stage that under the one-step look ahead decision rule, given in (3.10),
one would release the software after stage 5.

4. Conclusion

We have offered a probabilistic model for classifying the maturity of the


quality control procedures of a software house. This model was then used
The Role of Decision Analysis in Software Engineering 385

I='RIOR DISTRIBUTION OF" RHO


STAGE a
0.03e

0.032

0.02e

0.024
~

i
""
0.020

C.01e

""
"-
0 0'2

o.ooe
O.OOA.

0.000
'.7 , .S , .9
RHO

~OSTEAIO" CISTAISUTION OF' RHO A,T 2


c.oe

O.O~

0.0"

~
! 0.03

=
"'"
"-
0.02

0.01

0.00
'.e '.9

POSTe:RIO~ OISTRIBuTION OF'" RI-fO AT 3


0.07

o.oe

0 05

~
0.0 ....
~

I 0.03

0.01

0.00
, .0 '.4
RHO
'.5 '.

o.oe
0,07

0.015

~ O.O~

~
=
0.0 ....

2f 0.03

0.02

0.01

0.00
, .0 '.4 ,." '.e , .7 ,.S , .9
RHO

Fig. 3.6. Distributions of p at the Oth, 2nd, 3rd and 5th stage of testing
386 Jason Merrick and Nozer D. Singpurwalla

...
"
c

, .. :---
?

o ."
~ E 5~ -. e 5 "' .... ::.:

. =-~ . - c ..... .. _ =~ S'5 .. ... 5-" (;# 1:


~ o r------------------------------------------------------------,
e ooo

"200

ooL-________________________________~.------.~----~----~.~----~
T" I[ ~'9' I ""C 'ST.,C

~:,. o

'''2 0

2200
, '. 0

' ",. 0

,,~ o

'''' ':- 0

' 4 00

i
'-'
' ,) 00

, :3>' 0

' 2 ~?

. t ,O
0

Fig. 3.7. The expected additional costs of testing for further stages after the Oth,
2nd, 3rd and 5th stage
The Role of Decision Analysis in Software Engineering 387

to make a choice of the optimal software house to hire to create a software


system to fill a particular software requirement.
The area of optimal testing strategies was discussed and the decision
methods for various testing strategies outlined. The first strategy considered
was a single-stage test. An existing software reliability model, of the error
counting type, was used to make the decision of optimal system test duration
prior to release of a software system. The optimal decision was demonstrated
for various values of the prior parameters under a utility scheme. The outline
of this procedure followed the development in Singpurwalla (1991).
Extensions to multiple-stage tests were discussed. A model based on the
inter-failure times between successive modifications of the software was used
in making a one stage, look ahead decision under a testing procedure with
multiple stages. After each bug was detected, the rule decided whether to
release the software or continue testing. An example of the use of this method
was given. A full development of the decision procedure is given in Morali
and Soyer (1995).
The methods offered for making the decisions above all have the flavor
of Bayesian modeling of the unknown factors in the decision and the use
of decision theory to make the optimal decision given the information avail-
able. This approach incorporates prior information from experts, enabling
the decisions to be made in the absence of large amounts of data.
This general approach is not limited to the decisions covered in this pa-
per. If the unknown factors in the decision can be modeled probabilistically
and the utilities of the decision maker can be quantified, then the use of a
decision tree gives a coherent method for making any such decision given the
information available.

Acknowledgement. This research was supported by the Army Research Office grant
DAAH04-93-G-0020 and the Air force Office of Scientific Research grant AFOSR-
F49620-95-1-0107.

References

Chen, Y., Singpurwalla, N.D.: A Non-Gaussian Kalman Filter Model for Tracking
Software Reliability. Statistica Sinica 4, 535-548 (1994)
Dalal, S.R., Mallows, C.L.: When Should One Stop Testing Software? J. Amer.
Statist. Assoc. 83, 872-879 (1988)
van Dorp, J.R., Mazzuchi, T.A., Soyer, R.: Sequential Inference and Decision Mak-
ing During Product Development. Under review (1994)
Forman, E.H., Singpurwalla N.D.: An Empirical Stopping Rule for Debugging and
Testing Computer Software. J. Amer. Statist. Assoc. 72, 750-757 (1977)
Forman, E.H., Singpurwalla N.D. : Optimal Time Intervals for Testing Hypotheses
on Computer Software Errors. IEEE Trans. Rel. R-28, 250-253 (1979)
French, S.: Decision Theory: An Introduction to the Mathematics of Rationality.
New York: Wiley 1986
388 Jason Merrick and Nozer D. Singpurwalla

Humphrey, W.S.: Managing the Software Process. SEI (The SEI Series in Software
Engineering). Reading: Addison-Wesley 1989
Jelinski, Z., Moranda, P. B.: Software Reliability Research. Computer Performance
Evaluation. New York: Academic Press 1972, pp. 485-502
Langberg, N., Singpurwalla, N.D.: A Unification of Some Software Reliability Mod-
els. SIAM J. Sci. Statist. Comput. 6, 781-790 (1985)
Landry, C., Singpurwalla, N.D.: A Probabilistic Capability Maturity Model for
Rating Software Development Houses. Technical Report IRRA-TR-95/3. IRRA
(1995)
Littlewood, B., Verall, J.L.: A Bayesian Reliability Growth Model For Computer
Software. J. Royal Statist. Soc. 22, 332-346 (1974)
Morali, N. , Soyer, R.: Optimal Stopping Rules for Software Testing. Under review
(1995)
Musa, J.D., Okumoto, K. : Software Reliability Models: Concepts Classification,
Comparisons and Practice. Electronic Systems Effectiveness and Life Cycle
Costing. New York: Springer 1982, pp. 395-423
Okumoto, K., Goel, A.L.: Optimum Release Time For Software Systems, Based on
Reliability and Cost Criteria. J. Syst. Software 1, 315-318 (1980)
Paulk, M.C., Curtis, B., Weber, C.V.: Capability Maturity Model, Version 1.1.
IEEE Software, (1993a)
Paulk, M.C., Curtis, B., Weber, C V. : Capability Maturity Model, Version 1.1.
Technical Report MU /SEI-93-T-24. SEI (1993b)
Ross, S.M.: Software Reliability: The Stopping Rule Problem. IEEE Trans. Software
Eng. SE-ll, 1472-1476 (1985)
Singpurwalla, N.D.: Pre-Posterior Analysis in Software Testing. Statistical Data
Analysis and Inference. Amsterdam: North-Holland 1989
Singpurwalla, N.D.: Determining an Optimal Time Interval for Testing and Debug-
ging Software. IEEE Trans. Software Eng. SE-17, 313-319 (1991)
Yamada, S., Narihisa, H., Osaki, S.: Optimum Release Policies for a Software Sys-
tem with a Scheduled Software Delivery Time. J. Roy. Statist. Soc. B 54, (1984)
Zacks, S.: Sequential Procedures in Software Reliability Testing. In: Recent Ad-
vances in Life-Testing and Reliability. Boca Raton: CRC Press 1995, pp. 107-
126
Analysis of Software Failure Data
Refik Soyer
Department of Management Science, The George Washington University, Wash-
ington DC 20052, USA

Summary. In this chapter we discuss Bayesian analysis of software failure data


by using some of the software reliability models introduced by Singpurwalla and
Soyer (1996). In so doing, we present details concerning Bayesian inference in these
models, and discuss what insights can be obtained from the models when they are
applied to real data. We also present approximation procedures that facilitate the
Bayesian analysis and discuss model comparison.

Keywords. Autoregressive processes, Bayesian inference, data augmentation, Gibbs


sampling, hierarchical models, Kalman filtering, point processes, posterior approx-
imations

1. Introduction
Analysis of software failure data is the most practical test of validity of
the software reliability models. Implementation of the models, presented by
Singpurwalla and Soyer (1996) in this volume, require estimation of the un-
known model parameters. In this chapter, we will adopt the Bayesian point
of view to analyze software failure data using some of the software reliability
models. The Bayesian approach provides a coherent framework for making
inference via probability calculus and decision making via maximization of
expected utility (see Merrick and Singpurwalla 1996 in this volume). In so
doing, it also provides a formalism to incorporate expert opinion as discussed
in Singpurwalla and Soyer (1996). In addition to these, the Bayesian esti-
mation does not suffer from the well documented difficulties of maximum
likelihood estimation (see for example, Meinhold and Singpurwalla 1983 and
Campod6nico and Singpurwalla 1994).
We consider Bayesian analysis of software failure data using four different
models. For each model, we present details concerning Bayesian inference,
and discuss what insights about reliability of software can be obtained from
the models when they are applied to real data. We also discuss comparison of
the predictive performance of competing models. In Bayesian analysis of some
of the models, the relevant posterior and predictive distributions can not be
obtained analytically. In such cases, some posterior approximation methods
such as the one proposed by Lindley (1980) and Markov Chain Monte Carlo
(MCMC) methods such as Gibbs sampler (see for example, Gelfand and
Smith 1990) facilitate the Bayesian analysis. An overview of these methods
are also given.
In Section 2, we discuss the hierarchical Bayes setup of the Litttlewood-
Verrall (1973) model proposed by Mazzuchi and Soyer (1988) and present
390 Refik Soyer

inference results. We analyze the Naval Tactical Data System of Jelinski and
Moranda (1972) and compare two competing models considered by Mazzuchi
and Soyer (1988). We also discuss the Gibbs sampling approach of Kuo and
Tang (1995) to a generalization of these models. In Section 3, we present the
analysis of the 'System 40' data of Musa (1979) using the Kalman filter types
of models of Sing pur walla and Soyer (1985,1992) and Chen and Singpurwalla
(1994). In Section 4 we present the Bayesian analysis of logarithmic Poisson
execution time model of Musa and Okumoto (1984) which was developed by
Campod6nico and Singpurwalla (1994).

2. Analysis Using a Hierarchical Bayes Model


Mazzuchi and Soyer (1988) considered an extension to the Littlewood and
Verrall model by formulating it as a hierarchical Bayes model. The authors
pointed out that in the Littlewood and Verall model, uncertainty about the
hyperparameters of the gamma distribution of Ai'S are not described prob-
abilistically. Only the uncertainty about the shape parameter of the gamma
density is described by a uniform prior, whereas other model parameters
are treated as unknown but fixed quantities and estimated by using a max-
imum likelihood procedure. Thus, in the sense of Deely and Lindley (1981),
Littlewood-Verrall model is a parametric empirical Bayes model. However, a
fully Bayesian analysis of the Littlewood-Verrall model can be achieved by
formulating the problem as a hierarchical Bayes (or Bayes-empirical-Bayes)
model.
As before we denote the time between (i -l)-st and the i-th failure by T;
which is exponential with failure rate Ai, denoted as ('li I A;) ...... Exp(A i ). In
describing the behavior of Ai'S two models are considered:
Model A. To reflect the notion that the performance of the software might
improve or deteriorate as a result of an attempted removal of a fault, a
weaker dependence structure, namely, exchangeability of Ai'S is assumed. As
described in Singpurwalla and Soyer (1996), Ai is described by a gamma
distribution with shape parameter 0: and scale parameter {3, denoted as (Ai I
0:, (3) ...... Gam( 0:, (3). The exchangeability of Ai'S is achieved by assuming that
0: and {3 themselves are described by probability distributions. Mazzuchi and
Soyer assumed that 0: and (3 are independent and 0: is described by a uniform
density, denoted as 11"(0: I v) ...... Uni(O, v), and ({31 a, b) ...... Gam(a, b); where v, a
and b are specified parameters.
Model B. This model considers a hierarchical Bayes formulation of the Little-
wood and Verrall model by assuming that (..1;10: , .,p(i) ...... Gam(o:, .,p(i with
.,p( i) = {30 + {31 i and describing uncertainty about the parameters (0:, {30, {3d
probabilistically via 11"(0:, {30, (31 I a, b, c, d, w). The authors assumed that 0: is
independent of ({30, {3d with (0: I w) ...... Uni(O,w), ({31 I c, d) ...... Gam(c, d) and
({3ola, b, (31) is a shifted gamma of the form
Analysis of Software Failure Data 391

ba
7r({3o I a,b,{3t) = r(a)({3o + {3t)a-1 e-b(fjo+fjl), {3o 2: -{31

Note that Model A is obtained as a special case of the Model B by assuming


{31 to be degenerate at O.
Given n times between failure, t(n)=(t1, t2, ... , tn), our objective is to
infer the failure rate An and the next time-to-failure Tn+1.The posterior dis-
tribution of An given t(n), is given by

P(An I ten) = JJJ P(An I ten>, a, {3o, {3t)7r(a, {3o, {31 I t(n)dad{3od{31,
(2.1)
where P(An I t(n), a, {3o, {3d is the conditional posterior distribution of An
given (a, {3o, {3t) and 7r(a,{3o,{31 I ten) is the posterior joint distribution of
(a, {3o, {3t).
It can be shown (using the assumptions of Mazzuchi and Soyer) that given
Tn =tn, a, {3o, {31, An is independent of all other Ti'S with density
(n) (An)O'(tn + {3o + {31 n t+1 e ->.,,(t,,+fjO+fjl n)
P('\n It, a, {3o, {3t) = r(a + 1) , (2.2)

that is, (A I t(n), a, {3o, {31)",Gam(a + 1, tn + {3o + {31n).


The posterior distribution 7r( a, {3o, {31 I t(n) is obtained via the Bayes rule

7r(a, {3o, {31 I ten) ex: C(a, {3o, {31 I t(n)7r(a, {3o, {3d, (2.3)
where C( a, {3o, {31 I t(n) is the likelihood function of (a, {3o, {3d and 7r( a, {3o, {3t)
is the prior where dependence on the hyperparameters is suppressed. It can
be shown that

C(a, {3o, {31 I t


(n) _
-
nn P(.t, I a, {3o, {31 ) -_ nn (t a({3o + {31i)O'
+,B +,B .)0'+1 . (2.4)
i=l i=l' a 11

The predictive distribution of Tn+1 given ten) is obtained by

p(tn+1 I ten) = JJJ p(tn+1 I t(n),a,{3o,{3t)7r(a,{3o,{31 I t(n)dad{3od{31,


(2.5)
where p(tn+1 I t(n),a,{3o,{3t)=p(t n+1 I a, {3o,{3t).
As pointed out by Mazzuchi and Soyer (1988), any reasonable joint prior
distribution for a, {3o, {31 leads to integrals in (2.1) and (2.5) which cannot
be expressed in closed form. The authors used the Lindley's approximation
to evaluate these integrals.

2.1 Inference Using Lindley's Approximation

Lindley (1980) develops asymptotic expansions for the ratio of integrals of


the form
392 Refik Soyer

f U(0)e A (8)dO
(2.6)
f eA (8)dO
where 0=(1 , O2 , , Om) is an m-dimensional vector of parameters;

71"( 0) prior for 0;


A(O) H(O) + L(O);
L(O) loglikelihood with the dependence on the data suppressed;
H(O) log 71"(0);
U (0) some function of 0 that is of interest.

For example, if U(O) = 0, then the above integral is the mean of the
posterior distribution of 0. Lindley's approximation is concerned with the
asymptotic behavior of (2.6) as the sample size gets large. The idea is to
obtain a Taylor's series expansion of all the functions of 0 in (2.6) about 0,
the posterior mode. The approximation to (2.6) is:

U(O) = U(O) + -21 ( "u


L..J I,Ju1,1. + "L..J A-.
m m
',}, kUrU'' ,'Uk
1, /
)
(2.7)
i,j=l i,j,k,/=l
where

Ui = ()~O(~)
u,
I8=8
-' Ui,j = ~:~~:! I -' Ai,j,k = ():.;~~~ I -'
' J 8=8 ' J k 8=8

and Ui,j = elements of the minus inverse Hessian of A at O. An alternate


approximation is due to Tierney and Kadane (1986).
Using (2.6) with O=(a, f30, (31), the authors were able obtain computable
results for the distributions given by (2.1) and (2.5) and their corresponding
moments. For example, the posterior mean of the failure rate, E(An I t(n,
can be evaluated by setting U(O)=E(An I t(n),a,f3o,f3t), where
(n) a +1
E(An I t , a, f30, (31) = tn + f30 + f31 n (2.8)

Similarly, by setting U(O)=E(Tn+1 I t(n), a, f30, f3d where

E(T.n+1 I t en) ,a, f30, (3)


1 -
- f30 + f31(nl
+ 1)' (2.9)
a-
we can obtain E(Tn+1 I ten~, the predictive mean of the next time between
failure.
Note that inference results for Model A can be obtained by assuming f31
to be degenerate at 0 in the above development.
The authors applied the two models to the software failure data from
the development of the Naval Tactical Data System reported in Jelinski and
Moranda (1972). This complex system consists of 38 distinct modules, and
Analysis of Software Failure Data 393

the data is based on trouble reports from one of the larger modules, the A-
module. The data consists of the number of days between the 26 failures that
occurred during the production phase of the software.
In analyzing the data, Mazzuchi and Soyer selected (arbitrarily), the val-
ues a = 10, b = 0.1, v = 500 for Model A. For Model B, the values of a = 10,
b = 0.1, c = 2, d = 0.25, w = 500 were selected so that, initially, the two
models were similar. In particular, above parameters were selected so that
the prior distribution of Q' was the same for both models, and the prior dis-
tribution of f30 + f31 for Model B was the same as the prior distribution of f3
for Model A. The Lindley approximation was used by the authors to obtain
the posterior means of Ai'S, the predictive distributions, and the predictive
means of the T;'s at each stage for both models.
Table 2.1 presents the actual times between failure along with the predic-
tive means of T;'s for each model. Except for an almost uniform difference,
the behavior of the predictive means from the two models is very similar. As
pointed out by Mazzuchi and Soyer (1988) the predictive means of the two
models differ by f31 i( Q' - 1), given that f3 / (Q' - 1) for Model A is equivalent to
(f3o + f3d( Q' -1)) for Model B. This difference is due to the growth parameter
f31 of Model A.
The plot of the posterior means of Ai (posterior mean of the failure rate of
the i-th time between failures) versus i gives an impression ofthe behavior of
the failure rates from one stage to another. This in turn displays the overall
effect of the modifications at each stage. This is shown in Figure 2.1. Though
both models pick-up an apparent reliability growth during the initial and
later stages of testing and an apparent reliability decay during the middle
stages, Model A is more responsive to the pattern changes present in the
failure data. This is indeed understandable since the underlying structure
of Model B is stronger due to the stochastic ordering assumption, and this
assumption is at odds with the data observed in the middle stages.
Mazzuchi and Soyer (1987) analyzed the same data by using the posterior
approximation technique of Tierney and Kadane (1986) and obtained almost
identical results.

2.2 Model Comparison

The Bayesian paradigm enables us to compare formally two models in terms


of the ratio of likelihoods of observed values based on their predictive dis-
tributions. An overview of this approach, due to Roberts (1965), is given
below.
Consider two models A and B. Given we have no prior preference of
one model over the other, after observing failure data ten) = (t 1, t2, ... , tn),
models A and B, can be compared by computing the posterior ratios:

(2.10)
394 Refik Soyer

where p(ti I t(i-l), A) and p(t;Jt(i-l), B) are obtained by replacing ~ by its

Table 2.1. Actual and predictive means of times between failure


i T; E(T; It" 1) for Model A E(T; I t(' 1) for Model B
1 9.00 NA NA
2 12.00 10.53 9.75
3 11.00 11.84 11.36
4 4.00 11.79 11. 77
5 7.00 9.64 10.09
6 2.00 9.14 9.87
7 5.00 7.85 8.74
8 8.00 7.44 8.45
9 5.00 7.55 8.71
10 7.00 7.27 8.50
11 1.00 7.27 8.61
12 6.00 6.66 7.92
13 1.00 6.62 7.93
14 9.00 6.16 7.35
15 4.00 6.39 7.70
16 1.00 6.23 7.50
17 3.00 5.89 7.03
18 3.00 5.71 6.78
19 6.00 5.56 6.55
20 1.00 5.59 6.61
21 11.00 5.35 6.23
22 33.00 5.64 6.68
23 7.00 6.84 8.52
24 91.00 6.86 8.57
25 2.00 9.73 13.10
26 1.00 9.39 12.66

observed value, ti, in the predictive distribution of~ given t(i-l) for models
A and B respectively.
If the posterior ratio is greater than 1, then model A is preferred to model
B; otherwise the reverse is true. Equation (2.10) provides a global measure
for comparing the two models. An alternative strategy is to compare the
predictive performance of the models with respect to each observation. Such
a local measure is given by the likelihood ratio:

p(ti I t(i-l), A)
(2.11)
p(ti I t(i-l, B) .

As before if the likelihood ratio is greater than 1, then model A is the preferred
model for the i-th observation.
Mazzuchi and Soyer (1988) compared the predictive performances of the
two models using both global and local measures as shown in Figure 2.2.
Using only the global criterion, Model B would be preferred to Model A.
Analysis of Software Failure Data 395

POSTERIOR MEANS OF THE FAILURE RATE

w '"0
>-
<i

""w
""::>
-'
<
":
;/
./
-- - ...- "'-
"-
0

ciLO--~~--~~--~--~~--~~--~--~~--~~
4 8 12 16 20 24 28

TESTING STAGE

Fig. 2.1. Comparison of posterior means of the failure rates.

The occurrence of two surprising observations in stages 22 and 24 have an


overpowering effect on the global measure. However, Figure 2.2 shows that,
for i < 24, the posterior ratio is either above 1.0 or only slightly less, implying
a preference in favor of model A. This is also verified in the first plot in Figure
2.2 where most of the likelihood ratios are above 1.0. Thus an evaluation ofthe
local and global measures of comparison together shows that overall Model
A has done better job of prediction than Model B.
Extensions of the hierarchical model was considered by Kuo and Tang
(1995) who assumed a k-th order polynomial for 1/;(i) , that is, 1/;(i)=f3o +f3Ii+
... + f3kik. The authors used Gibbs sampling for Bayesian computations. We
will give an overview of their approach below using 1/;( i)=f3o + f3I i.

2.3 Inference Using Gibbs Sampler


As before let 8=(0 1 , O2 , . , Om) denote some unknown quantities, such as the
failure rate of the software at different stages of testing. Suppose that interest
centers around the joint distribution of 8 as well as the marginal distribu-
tions of individual Oi'S. Let 71'(8 I t(n) denote the posterior distribution of
8. The Gibbs sampler enables the drawing of samples from 71'(8 I t(n) with-
out actually computing its distributional form. This is achieved by succes-
sive drawings from the full conditional distributions 71'(Oi I 8( -i), t(n) where
8(-i) = {OJ Ii # i,j = 1,2, .. . ,m}.
The process starts with a vector of arbitrary starting values
8 = (O~, O~, ... , O!),
396 Refik Soyer

LIKElIHOOD RATIOS or It 10 B

'.'
'.2

0.0

os

0.'
0.'
0.1

0.0 0
" TESTING STAGE 20
" '"
POSTERIOR RATIOS OF A 10 B

" " "


00
"
20
TESTING STAGE

Fig. 2.2. Likelihood and posterior ratios of model A to B.

and
draws from 1I"(B1 I B~, ... , B~, t(n),
draws from 11"(02 10i, og, ... , O~, t(n),
(2.12)

draws from 1I"(Om 10i, ... ,0:n_l, t(n).


As a result of this single iteration of the Gibbs sampler in (2.12), a
single vector which represents a transition from the starting value (Jo =
(O~,og, ... ,O!) to (Jl =
(OLO~, ... ,O:n) has been generated. If this itera-
tion is repeated k times (i.e. next starting with (Jl and iterating to (J2, and
so on), then the Gibbs sequence

(2.13)
is generated and under some mild regularity conditions, distribution of (Jk
converges to the posterior distribution, 1I"((J I t(n), as k -+ 00 and thus (Jk is
a sample point from 7r((J I t(n). Thus, to generate a sample from 7r((J I t(n),
one alternative is to generate s independent Gibbs sequences of k iterations
and use the k-th value from each sequence as a sample point from 1I"((J I t(n).
For a more detailed discussion of the Gibbs sampler and other related Monte
Carlo methods, see Gelfand and Smith (1990). Once a sample (Jl, (J2, ... , (Jr
is obtained from the posterior distribution 7r((J I t(n), the marginal posterior
Analysis of Software Failure Data 397

distributions of OJ'S and their moments can be approximated from the sample
points OJ, OJ, ... , OJ.
If the full conditional distributions are not of known distributional form
or if they do not exist in closed form, then to facilitate the implementation
of the Gibbs sampler, some random variable generation method such as the
adaptive rejection procedure of Gilks and Wild (1992) can be employed.
In analyzing Model B, Kuo and Tang (1995) assumed independent gamma
distributions for parameters ({3o,{3t), that is, ({3j I aj,bj)--Gam(aj,bj), j =
0,1. As before (a I w)--Uni(O,w).
Let ~=(Al,A2, ... ,An) and ~(-j)={,\j Ii i= j, for j = 1,2, ... ,n}. After
n stages of testing, the implementation of Gibbs sampler requires the full
conditonals:
p(Aj I ~(-j), a, {3o, (3l, ten), j = 1,2, ... , nj
p({3j I ~,a, (3i-ti' ten), j = 0,1 and p( a I ~,{3o, (3l, ten).
Specifying p(Aj I ~(-j),a,{3o,{3l,t(n) is easy, but the form for p({3j I
~,a, (3i-tj, ten) is a complicated mixture. To alleviate this difficulty Kuo and
Tang (1995) use data augmentation by introducing a latent variable Zj which
has a binomial distribution with parameter a and cell probability ri = ,Bo~~ti.
Defining Z=(Zl' Z2, , zn), it can be shown that
(Aj I ~(-j),a,{3o,{3l,z,t(n) Gam(a + 1, tj + {3o + (3d),
({3o I ~,a,{3l,z,t(n) Gam(ao + L:?=l (a - zj),bo + L:?=l Ai),
({3l I ~,a, (3o, Z, ten) Gam (al + L:?:l Zj, bl + L:?=l iAj) ,
(2.14)
where Zj --Bin(a, rj) and the distribution of a

(2.15)

The random variable a can be easily generated using the adaptive rejection
procedure of Gilks and Wild (1992) or the Metropolis algorithm as used by
Kuo and Tang (1995).

3. Analysis Using Kalman Filter Type Models

The second class of models we will consider for data analysis are those which
directly model the time between failures. These were classified as Type 1-
2 models in Singpurwalla and Soyer (1996). In the sequel we will discuss
inference for two examples of these models.
398 Refik Soyer

3.1 The Random Coefficient Autoregressive Process Model

Singpurwalla and Soyer (1985) introduced a first-order random coefficient


autoregressive process to model the log 11 'so The model was motivated via
the simple power law relationship 11 = T!':'1 where the values of unknown
coefficient OJ > ) 1 implies growth (decay) in reliability. If we let l'i = log 11 ,
then we can write the model as

(3.1)

where fj'Sassumed to be normally distributed random variables with mean


o and variance CT~, denoted as fj - #(0, CTn. Singpurwalla and Soyer (1992)
discussed inference for the generalizations of (3.1) and considered the case
where the variance CT~ was unknown. Uncertainty about CT~ was described, by
letting = 1/CT~, and assuming -Gam(/o/2, 60 /2).
The authors considered two types of models to describe the change in
OJ's. The first model assumes the exchangeability of OJ's. It is assumed that
OJ'S are assumed normally distributed with mean A and variance CT~, where
CT~ known. As pointed out by Singpurwalla and Soyer (1992), with CT~ small,
values of A > ) 1 emphasize a growth (decay) in reliability. In essence CT~
reflects our view of the consistency of our policies regarding modifications
and changes to the software; a large (small) value of CT~ would reflect major
(minor) modifications to the system. Exchangeability of OJ'S is achieved by
assuming A normal with mean mo and variance so/, where mo and So are
known quantities. The division by suggests a scaling of all variances by CT~.
The model can be represented as an ordinary Kalman filter model and
standard Bayesian results can be used for inference. Given y(n)=(Y1, Y2, ... , Yn),
it can be shown that the posterior distribution of is Gam( In/2, 6n /2), where

In = In-1 + 1 and 6 = 61i - 1 + 1(Yn


n
- m n_1Yn_t)2
2 (2
+ Yn-1 CT2 + Sn-1
)"
()
3.2

The posterior distribution of On is obtained as a Student-t density with In


degrees of freedom, mean On, and variance 6n Cn /(rn - 2), where
- Yn
On = 1Tnmn-1 + (1 - 1Tn ) - - , (3.3)
Yn-1

CT2 + Sn-1
2
C - (3.4)
n - 2 (2
1 + Yn-1 CT2 + Sn-1 )'
1Tn = 1+y"ft.-I (!2+
2 6,,-1
)' and ....1l..n...
.. is the least squares estimator of On.
y ft.-I
Similarly, the posterior of A is also a Student-t density with In degrees of
freedom, mean mn, and variance 6n s n /(rn - 2), where

(3.5)
Analysis of Software Failure Data 399

and
Pn=(
1 + U 222) 2
Yn _1 + Sn-1Yn _1
Finally, the predictive distribution of Yn +1 given y(n) is a Student-t with 'Yn
degrees of freedom, mean mnYn, and variance 8n {1 + u~ + sn)/('Yn - 2). As
noted by the authors, there is no tractable updating procedure when u~ is
unknown. This model will be referred to as Model A.
The second model considered for OJ was the autoregression

(3.7)
where Wj ' " N(O, u!N), with u; known. The values of 0 < (1 reflect our
belief that the initial modifications show more (less) improvement than the
latter ones and 0 = 1 implies a maturing of the growth process. Singpurwalla
and Soyer (1992) described uncertainty about 0 by a beta distribution over
(a, b) with parameters /31 and /32. Uncertainty about 00 is described by a
normal distribution with mean (variance) 00 (CoN), both specified, apart
from . As noted by the authors, when 0 is not known an adaptive Kalman
filter model results, for which there are no closed form results. The authors
used the Lindley's approximation for making inference in this model.
Given 0, the posterior inference is obtained via the ordinary Kalman filter
solution. For example, given y(n), the conditional distribution p(On I y(n), 0)
is a Student-t distribution with degrees offreedom 'Yn, variance 8n Cn /{-rn -2),
and mean On, where

- OOn-l + RnYnYn-l 2 2 ( )
On = (1 R 2 ) , Rn = 0 C n- 1 + u w , 3.8
+ nYn-l
- 2
Rn (Yn - Yn-1 00n-1) ()
Cn = (
1+
Rn 2 )' and 8n
Yn-1
= 8n - 1 + (1
+ R 2
nYn-1
)' 3.9

all functions of 0, and 'Yn = 'Yn-1 + 1. Furthermore, the predictive distri-


butions P(Yn I y(n-1),0) are also Student-t with degrees of freedom 'Yn-1,
variance (1 + RnY~_1)8n-d('Yn-1 - 2), and mean Yn-100n-1. The recursive
character of the above quantities facilitates a computation of the derivatives
that are needed for the Lindley's approximation. This model will be referred
to as Model B.
The authors applied the both models to the 'System 40' data of Musa
(1979). Only the first 51 of his 101 observations were considered. For the
Model A [B], they chose mo(so) [00(Co)]=1(.25) [1(.25)] reflecting vague prior
knowledge about reliability growth or decay. For 0 (in Model B), a = 0 and
/31 = /32 = b = 2 were chosen implying that the most likely structure for OJ is
the steady model. For both models 'Yo(80)=2.5(1), u~=O.l and u;=0.04.
400 Refik Soyer

Figure 3.1 shows the plots of posterior mean of On under Models A and
B. The plots suggest an overall growth in reliability, since the values of the
posterior mean tend to hover above 1, at least during the initial stages of
testing.
Figure 3.2 shows a plot of mn (the mean of the posterior distribution of
A, in Model A) versus stages of testing. The plot shows that for n 2 25, mn
settles down to a value of about 1.03. This suggests that the overall policy of
making changes to the software results in a consistent growth in reliability.
Figure 3.2 also shows the mode of the posterior distribution of a in Model
B. The posterior mode Ii is below 1 for n 2 2 settling down to the value 0.96
for n 2 25; this suggests that the On's stochastically decrease in n, suggesting
that the initial phases of testing lead to a larger growth in reliability than the
latter ones. Thus it appears that the conclusions about the reliability growth
based on the two models are supportive of each other.
A comparison of the predictive performances of the two models was con-
sidered by the authors using the logarithm of the posterior ratios of Model
A to B for each stage n. It was found that Model A is preferred to Model B.

1.8,.--------_-_-_-_--_---;-_----,
P{)Sl[RlOIi I.fEANS OF THETA IN t.lOOfl A

,.. , .,
" "
10 25
TES11NG STAG[
III l5
" "
POSTERI(R WEANS or 11'" IN Moon 8
I.B

,.

1.2

I.'
'.1

,..
,.. .,
0
" 15
" 25
lES1N(; STAGE:
III
" " 50

Fig. 3.1. Posterior means of On in Models A and B.


Analysis of Software Failure Data 401

or LAMEllA IN MODEl A
'..
POSTERIOR ~ONS

r--_-_-_-~--~-_-_-_-~-__,

'.'
'.'

.

.. -__:'",..._---!"
o. L,-~--,~,-~,,---:'''',...--,.:':"---:::,,----:,.:---~
l[STINC STAGE

POSTERIOR UQOECi or AlPHA. IN UOOEl B


'-'r--------~--~-------~-__,

'.'
,o~1.2
~to\-~~
~M L/

o.L,----~-~,.----,,---,,--,,--~-~-....J ..
l[STING STAGE

Fig. 3.2. Posterior means of ~ in Model A and posterior modes of a in Model B.

3.2 The Non-Gaussian Kalman Filter Model

As an alternate to the Kalman Filter Models of Singpurwalla and Soyer


(1985,1992), Chen and Singpurwalla (1994) introduce a non-Gaussian Kalman
Filter model (which was presented by Singpurwalla and Soyer 1996 in this
volume). The authors assume that the failure times are now described as
(Tn I On) . . . . Gam(w n , On) where the shape parameter Wn is assumed to be
known and the scale parameter On evolves according to the system equation

(oGOn I On-I) . . . . Beta(un_i, vn-d, (3.10)


n-i
which can be written as
on = C
n-i
fn (3.11)
with fn . . . . Beta(un_i,Vn_d. Note that (3.11) can be considered as a prod-
uct autoregression. The authors assume that G, Un, and Vn are known pa-
rameters such that Un-i + Wn = Un + Vn . If it is assumed that (On-i I
t(n-i)) . . . .Gam(un_i + Vn-l, un-t) and initially (0 0 I t(O)) . . . . Gam(uo + Vo, uo),
then it can be shown that
(On I t(n-i)) . . . . Gam(Un_i, GUn-i),
(On I t(n)) (3.12)
. . . . Gam(Un-i+Wn,Un),
402 Refik Sayer

where Un = CUn-l + tn. One step ahead forecast distribution can also be
obtained as
(t I t(n-l) ex: (tn)w,,-l (3.13)
P n -:-(C-=--U-n-_":"l-+':""t-n-:")-q-"--l-:+-W-"

For example, for the case wn = = 1 (where the observation model is an


W

exponential density),

which is a Pareto density.


Chen and Singpurwalla considered a simplification of the model by setting
Wn = Vn = (Tn = 2, for all n and showed that, given the parameter C, the
predictive mean of Tn is given by
n
E(Tn I t(n-l),C) = 2Cl:Citn-i-l. (3.14)
i=O

They noted that the value of the parameter C is critical in assessing whether
the times between failure are increasing or decreasing; with values of C ~ 1
implying a substantial growth in reliability, whereas values of C close to zero
implying a drastic reduction in reliability. Intermediate values of C would
imply a growth or decay in reliability depending on t(n-l). The authors de-
scribed uncertainty about C by a uniform distribution over (0, 1). As a result
the closed form nature of the inference was lost and the authors used a Gibbs
sampler to simulate the posterior and predictive distributions.
As an alternate to the Gibbs sampler, we can use a discretization of the
uniform density over (0, 1). If we consider a k-point discretization, given Tn =
tn is observed, the posterior distribution of C is obtained via the standard
Bayesian machinery as

(3.15)
where the likelihood term p(t n I C/, t(n-l) is the predictive density given by
(3.13). Once the posterior distribution (3.15) is available, the unconditional
posterior distribution of On can be obtained by averaging out (3.12) with
respect to this posterior distribution.
Chen and Singpurwalla (1994) analyzed the System 40 data of Musa
(1979) and compared the predictive performance of the model with that of
exchangeable model of Singpurwalla and Soyer (1985) using posterior ratios
and they concluded that the non-Gaussian Kalman filter model outperformed
the Singpurwalla and Soyer model. In what follows, we present an analysis
of the System 40 data by using the first 51 observations and a 200-point dis-
cretization of the uniform prior on C. Following Chen and Singpurwalla we
choose Wn = Vn = (Tn = 2 for all nand Uo = 500.
Analysis of Software Failure Data 403

In Figure 3.3, we present the posterior distribution of C after observing all


the times between failure, t(5l) and the plot ofthe posterior means of C given
ten), n = 1, ... ,51. We note that as a result of the data the uniform prior
of C has been revised to a posterior peaked around 0.42 and the posterior
means of C have stabilized after the first 15 observations around 0.4. This
implies neither a strong evidence of growth nor decay in reliability.

.. ,------ - - - - - - - - - - - - - - - - - - - - ,

....

'.' '=, -- - - - - ; : - - ---:o:------:':'-----,----..,..


.. - - - . . J
1[S11'< 'f.c:r

Fig. 3.3. Posterior distribution of C given t(51) and posterior means of C .

An alternative to this model is to assume that the failure rate of Tn is


constant for all n . This is achieved by setting Wn = 1. This implies that the
observation model at stage n is exponential with failure rate ()n. Assuming
Vn = 1 and Un = 2 for all nand Uo = 500 with a 200-point discretization
of the uniform prior on C, we have obtained the posterior distribution of C
given t(5l) shown in Figure 3.4 which has a mode at 0.62. Figure 3.4 also
shows that the posterior means of C have stabilized around 0.60 - 0.65 after
15 stages of testing implying a stronger evidence for reliability growth.
A comparison of these two models can be made by looking at the local
likelihood ratios and the global posterior ratios. We will make the comparison
using the log likelihood ratios
p(t i I t(i-l),A))
(
Ii = log p(ti It(i-l),B) ,

and Ln the logarithm of the posterior ratios (or product of the likelihood
ratios), that is,
404 Refik Soyer

Fig. 3.4. Posterior distribution of C given t(Sl) and posterior means of C.

In so doing we will refer to the exponential model with Wn = 1 as Model A


and the model with Wn = 2 as Model B. These ratios are plotted in Figure
3.5.
Most values of the loglikelihood ratios are in vicinity of the zero line,
except in few cases where Model A is outperforming Model B significantly.
We note that Model B does better initially (most loglikelihood ratios are less
than 0) and Model A starts to dominate at later stages. The global behavior
can be seen from the cumulative log posterior ratios which favor Model A
starting at stage 10 (values of the log posterior ratios are greater than 0).

4. Analysis Using a Nonhomogeneous Poisson Process


Model

We consider the logarithmic Poisson execution time model of Musa and Oku-
moto (1984), which was discussed in Singpurwalla and Soyer (1996), with
=
mean value function J.I(t) In(,xOt+1)jO. Following the expert opinion frame-
work of Campod6nico and Singpurwalla (1994), we assume that a joint prior
probability distribution, 11"(1'1,1'2), is elicited for 1'1 = J.I(Tt} and 1'2 = J.I(T2).
Analysis of Software Failure Data 405

lOGlIl(EUHOOO RATIOS Of A TO B

12 15 IS 21 fl V ~ " ~ "
T[STING STAG(

LOG POSTERIOR 'UTIC'S OF A 10 8

12 15 18 21 ~ ~ D ~ ~ q ~ U ~

TESTING STAG[

Fig. 3.S. Loglikelihood and logposterior ratios of Model A to Model B.

Using the relationship


11-1 = In(AOT1 + 1)/0
11-2 = In(AOT2 + 1)/0,
A and 0 can be solved in terms of (11-1,11-2, T 1, T 2) and the mean value func-
tion can be obtained numerically. We will denote this as l1-(e' 11-1,11-2), Note
that the distribution of (0, A) can be induced numerically. Given the prior
distribution 11'(11-1,11-2), inferences about (expected) number offailures in any
interval can be made using the results from nonhomogeneous Poisson pro-
cesses (NHPP's) as discussed in Singpurwalla and Soyer (1996).
Suppose that data is observed as the number of failures in a total of N
disjoint time intervals. Let nj denote the number of failures in the interval
(t1j, t2j], j = 1, ... , N. Let D={nj, t1j, t2j; j = 1, ... , N} denote the observed
data. Given data D, the joint posterior distribution of (11-1,11-2), say 11'(11-1,11-2 ,
D), is obtained via the Bayes' law as

where (11-1,11-2 , D) is the likelihood function, i.e., the joint distribution of


the data when regarded as a function of (11-1,11-2), For the interval failure data,
the likelihood function is given by
(11-1,11-2 , D) = nf::1 e-(I'(hjII'1,1'2)-I'(hjII'1,1'2))

(4.1)
(I'( 12;11'1 ,1'2)-1'(t1 i 11'1 ,1'2 ))n;
nj!
406 Refik Soyer

Once the posterior distribution, 7r(/l1, /l2 I D), is obtained, then the quanti-
ties of interest discussed in Singpurwalla and Soyer (1996) can be evaluated
replacing the prior 7r(/l1, /l2) by the posterior (/l1, /l2 I D). For example, the
probability of k failures in the interval (8, t], 8 < tj k = 0, 1,2, ... is given by

(4.2)

and the (unconditional) expected number of failures in interval (8, t], 8 < t
IS:

f (/l(t2j I /l1,/l2) - /l(t1j I /l1,/l2 7r(/l1,/l21 D)d/l1d/l 2. (4.3)


J/Jl./J2

The above integrals need to be evaluated numerically (see Campod6nico 1993


for a code).
Campod6nico and Singpurwalla (1994) analyzed a set of data taken from
Goel (1985). The data consist of the observed number offailures of a software
system that was tested for 25 hours of CPU time. The standard approach for
analyzing such data is based on the method of MLEj see, for example, Musa
and Okumoto (1984). As noted by Campod6nico and Singpurwalla (1994),
the use of MLE approach may lead one to some difficulties. The authors
pointed out that for the logarithmic-Poisson execution time model, the MLE
method fails to provide a meaningful answer when the data that are available
consist of just the number of failures during the first interval of testing. This
is because, when such is the case, the likelihood function for the two model
parameters (J, A) has no unique solution.
In implementing the Bayesian approach of Singpurwalla and Soyer (1996),
note that the prior 7r(/l1, /l2) is a function of the expert input as described by
the location and scale measures m1, m2, 81 and 82. The authors used some
published empirical results in software testing to illustrate the codification
of expert opinion and discussed various strategies to specify Tl, T2, m1, m2,
81 and 82 in their paper. In their analysis they used T2 = 250 as the total
= =
debugging time and specified m2 455, and 82 200, implying high degree
of uncertainty about m2. For the specification of T1, m1 and 81, the authors
assumed that on the average, 10% of the system failures occur during the
first 1% of the debugging time implying that T1 2.5 and m1= 45.5. To =
reflect the degree of uncertainty about m1, they chose 81 = 4 and to reflect
the fact that there is no basis for modulation of expert input, the authors
specified a = 0, b = 1, r = 1 and k = 1 (see Singpurwalla and Soyer 1996).
Predictions for number of failures, based on the constructed prior, can be
obtained for any time interval using the expected number of failures given
by (4.3). As more data become available, the prior distribution is revised to
the posterior distributions and the corresponding posterior expectations are
calculated. Table 4.1 shows the expected number of failures in the interval
Analysis of Software Failure Data 407

(t, t + 1], for t = 0,1, ... ,4, as obtained by Campod6nico and Singpurwalla
(1994), and the MLE'sj in both cases, the authors use the data up to time t to
predict the number offailures in the next hour. As pointed out by the authors,
MLE is not available for the first two intervals. The Bayesian prediction for
the first interval is based on the prior alone. The authors also show that the
mean square error (MSE) of the Bayesian predictions are lower than those of
the MLE for the specific choice of prior parameters.

Table 4.1. One step-ahead Bayesian and MLE predictions


CPU Hour Interval Bayesian PredlctlOn MLE Observed Failures
0,1 22.8 N/A 27
1,2 16.0 N/A 16
2, 3 12.6 14.4 11
3,4 9.5 9.9 10
4, 5 8.6 8.2 11

References

Campod6nico, S.: Software for a Bayesian Analysis of the Logarithmic-Poisson Ex-


ecution Time Model. Technical Report GWU /IRRA/TR-93/5. Institute for Re-
liability and Risk Analysis, The George Washington University (1993)
Campod6nico, S., Singpurwalla, N.D.: A Bayesian Analysis of the Logarithmic-
Poisson Execution Time Model Based on Expert Opinion and Failure Data.
IEEE Trans. Soft. Eng. SE-20, 677-683 (1994)
Chen, Y., Singpurwalla, N.D.: A Non-Gaussian Kalman Filter Model for Tracking
Software Reliability. Statistica Sinica 4, 535-548 (1994)
Deely, J.J., Lindley, D.V.: Bayes Empirical Bayes. J. Amer. Statist. Assoc. 76,
833-841 (1981)
Gelfand, A.E., Smith, A.F.M.: Sampling-Based Approaches to Calculating Marginal
Densities. J. Amer. Statist. Assoc. 85, 398-409 (1990)
Gilks, W.R., Wild, P.: Adaptive Rejection Sampling for Gibbs Sampling. Appl.
Statist. 41, 337-348 (1992)
Goel, A.L.: Software Reliability Models: Assumptions, Limitations, and Applica-
bility. IEEE Trans. Soft. Eng. SE-ll, 1411-1423 (1985)
Jelinski, Z., Moranda, P.: Software Reliability Research. In: Freiberger, W. (ed.):
Statistical Computer Performance Evaluation. New York: Academic Press 1972,
pp.465-484
Kuo, 1., Yang, T.Y.: Bayesian Computation of Software Reliability. J. Compo and
Graph. Statist. 4, 65-82 (1995)
Lindley, D.V.: Approximate Bayesian Methods. Trabajos Estadistica 31, 223-237
(1980)
Littlewood, B., Verall, J.L.: A Bayesian Reliability Growth Model for Computer
Software. Appl. Statist. 22, 332-346 (1973)
Mazzuchi, T.A., Soyer, R.: Software Reliability Assessment Using Posterior Approx-
imations. Proceedings of the 19th Symposium. Computer Science and Statistics
1987, pp. 248-254
408 Refik Soyer

Mazzuchi, T.A., Soyer, R.: A Bayes Empirical-Bayes Model for Software Reliability.
IEEE Trans. ReI. R-37, 248-54 (1988)
Meinhold, R.J., Singpurwalla, N.D.: Bayesian Analysis of a Commonly Used Model
for Describing Software Failures. Statistician 32, 168-173 (1983)
Merrick J., Singpurwalla, N.D.: The Role of Decision Analysis in Software Engi-
neering. In this volume (1996), pp. 368-388
Musa, J.D.: Software Reliability Data. IEEE Computing Society Repository (1979)
Musa, J.D., Okumoto, K.: A Logarithmic Poisson Execution Time Model for Soft-
ware Reliability Measurement. Proceedings of the 7th International Conference
on Software Engineering. Orlando 1984, pp. 230-37
Roberts, H.V.: Probabilistic Prediction. J. Amer. Statist. Assoc. 60, 50-61 (1965)
Singpurwalla, N.D., Soyer, R.: Assessing (Software) Reliability Growth Using a
Random Coefficient Autoregressive Process and its Ramifications. IEEE nans.
Soft. Eng. SE-ll, 1456-1464 (1985)
Singpurwalla, N.D., Soyer, R.: Non-Homogeneous Autoregressive Processes for
Tracking (Software) Reliability Growth, and Their Bayesian Analysis. J. Roy.
Statist. Soc. B 54, 145-156 (1992)
Singpurwalla, N.D., Soyer, R.: Assessing the Reliability of Software: An overview.
In this volume (1996), pp. 345-367
Tierney, 1., Kadane, J.B.: Accurate Approximations for Posterior Moments and
Marginal Densities. J. Amer. Statist. Assoc. 81, 82-86 (1986)
Part IV

Computational Methods and Simulation in


Reliability and Maintenance
Simulation: Runlength Selection and
Variance Reduction Techniques
Jack P.C. Kleijnen
Department of Information Systems and Auditing and Center for Economic Re-
search (CentER), School of Management and Economics, Tilburg University, 5000
LE Tilburg, The Netherlands.

Summary. This chapter gives a tutorial survey on the use of simple statistical
techniques for the control of the runlength in simulation. The object of the sim-
ulation study may be either short-term operational decision-making or long-term
strategic decision-making. These decision types correspond with two types of sim-
ulation: terminating and steady-state simulations. First, terminating simulation is
discussed. At the preliminary end of a simulation run, a confidence interval for the
simulation response can be derived, using either the Student statistic or alternative
statistics (in case of non-normal simulation responses). From the resulting interval
the definitive run length can be derived. Next, steady-state simulation is discussed.
Such a simulation may be examined through renewal analysis. Both simulation
types may have responses that are not expected values, but either proportions or
quantiles. Whatever the simulation type or simulation response, the required length
of the simulation run may be reduced through simple variance reduction techniques,
namely common pseudorandom numbers, antithetic numbers, and control variates
or regression sampling. Importance sampling is necessary in rare event simulation.
Finally, a general technique -namely, jackknifing- is presented, to reduce possible
bias of estimated simulation responses and to construct robust confidence intervals
for the responses.

Keywords. Distribution-free, non-parametric, stopping rule, run-length, regener-


ation, stationarity, Von Neumann statistic, median, seed, Monte Carlo, likelihood
ratio, generalized least squares

1. Introduction
The objective of this chapter is to give a tutorial survey on the use of sim-
ple statistical techniques for the control of the runlength in simulation. The
following questions are addressed:
(i) How should the simulation run be initialized; for example, should a sim-
ulation of a repairman system start with all machines running?
(ii) How long should this run be continued; for instance, should 1000 machine
breakdowns or one month be simulated?
(iii) How should the accuracy (or precision) of the simulation response be as-
sessed: what is a (say) 90% confidence interval for the simulation response?
(iv) If this precision is too low, how much longer should the system be simu-
lated (with fixed inputs)?
(v) To further improve the accuracy, can 'tricks' (Variance Reduction Tech-
niques or VRTs) be used?
412 Jack P.C. Kleijnen

For didactic reasons it seems useful to consider the following repairman exam-
ple (many more examples and references can be found in the survey, Jensen
1996, in this book). There are m machines that are maintained by a crew of
r repairmen (mechanics). Machine j has a stochastic time between failures
(say) Xlj with j = 1, ... , m; notice that stochastic variables are shown in
capitals; their realizations in lower case letters. Time to repair machine j by
repairman i (with i = 1, ... , r) is X 2ij; that is, mechanic j may be special-
ized in the repair of machine i. However, most analytical models assume that
Xlj and X 2ij do not depend on i and j, which simplifies the notation to
Xl and X2. In simulation this assumption is not necessary. Yet, to simplify
the example, let us make the same assumption as in those analytical mod-
els; that is, Xl and X 2 have Negative exponential (Ne) distributions with
parameters A (Mean Time To Failure or MTTF rate) and J.I. (repair rate).
Furthermore, different priority rules may be implemented: First-In-First-Out
(FIFO), Shortest-Processing-Time (SPT), and so on. A flowchart for the sim-
ulation of this model is given in Kleijnen and Van Groenendaal (1992, pp.
108-109) (that chart, however, should be corrected: replace the variable TIME
by TAT). A standard textbook on simulation is Law and Kelton (1991). Obvi-
ously this example is a Discrete-Event Dynamic System (DEDS): the system
changes state at discrete, not necessarily equidistant points of time.
Readers familiar with Markov analysis will notice that for the repairman
system with Poisson failure and repair processes a complete state description
is given by a single state variable (say) Y with y E {O,l, ... ,m}, which de-
notes the number of machines that is running or 'up'. Obviously, the number
of idle mechanics is uniquely determined by y: that number is max (r - y, 0).
Notice that since Poisson processes are memoryless, it is not necessary to
know how long a particular machine has been running, or how long a par-
ticular mechanic has been working on a machine (also see the discussion on
renewal analysis in Section 2.2). Let Py denote the steady-state probabil-
ity of the system being in state y. Obviously Py also gives the steady-state
percentage of time that the system is in state y.
Management may be interested in several types of response (performance
measure, criterion). In a computer center they may be interested in the per-
centage of time that at least one machine is up (in the steady state, this
percentage is 1 - Po). They may also be concerned about the percentage of
time that at least two machines are up (1 - Po - pt), because customer ser-
vice is better when two computers (instead of one computer) are up: faster
turnaround time. However, for simplicity this chapter concentrates on a sin-
gle variable; for example, p = 1- Po. Multi-variate responses can be handled
through Bonferroni's inequality; see Kleijnen (1987).
Now consider the simulation of this system. Let Z denote simulated avail-
ability, defined as the percentage of simulated time that at least one machine
is running: 0::; z ::; 1. So the response of a simulation run is Z = Z(m, r). In
other words, a simulation run is a single time path that has fixed values for all
Simulation: Runlength Selection and Variance Reduction Techniques 413

its inputs. In this example, these inputs are m and r, and the parameters of
the input distributions A (failure rate) and J1. (repair rate). A special variable
is the pseudorandom number seed (say) Ro, which has positive integers as
realizations. Alternative sources for this seed will be discussed in the section
on VRTs (Section 4).
This chapter is organized as follows. Section 2 covers short-term opera-
tional decisions versus long-term strategic decisions, which correspond with
terminating and steady-state simulations respectively. Section 2.1 derives
confidence intervals for terminating simulations, using either Student's statis-
tic or alternative statistics (in case of non-normal simulation responses). From
this interval the number of necessary simulation runs in a terminating simu-
lation is derived (stopping rule). Section 2.2 covers steady-state simulations,
concentrating on renewal analysis of such simulations, including approximate
renewal states. Section 3 covers proportions and quantiles, as alternatives for
the expected value. Section 4 covers VRTs. Simple VRTs are common pseu-
dorandom numbers, antithetic numbers, and control variates or regression
sampling. Importance sampling is necessary in rare event simulation. Section
5 covers jackknifing, which is a general technique for reducing possible bias
and constructing robust confidence intervals. Section 6 gives a brief summary
and conclusions. This chapter is based on Kleijnen and Van Groenendaal
(1992, pp. 187-203).
Note: Questions such as 'how many mechanics should be hired, and which
priority rule should be selected?' are addressed in Kleijnen (1996), which is
the companion chapter in this volume.

2. Short-Term Operational Versus Long-Term Strategic


Decisions: Terminating Versus Steady-State Simulations
Consider the following two examples:
(i) Management wants to make an investment analysis for a proposed new
plant site: is a given number of machines and repairmen attractive? Such
long-term strategic decisions should be based on the steady state response.
Most analytical studies of stochastic systems consider stationary responses,
because asymptotic results apply only in the steady state. However, the rel-
evance of transient behavior is also emphasized in Heidelberger et al. (1996)
and Muppala et al. (1996) in this volume.
(ii) Management considers hiring one more mechanic for next month. This
is a short-term operational decision, which should account for switching
costs, start-up effects, and so on.
Situation (ii) demonstrates that in many practical problems there is an event
that stops the simulation run; for example, the 'arrival' of the end of the
414 Jack P.C. Kleijnen

month. Such simulations are called terminating. From the viewpoint of math-
ematical statistics (not from the viewpoint of Markov analysis), terminating
simulations are easier to analyze. For didactic reasons, these simulations will
be discussed first (in Section 2.1); steady-state simulations will follow (Section
2.2).
Other examples of terminating simulations are queueing problems in a
bank that is open only between 9 a.m. and 4 p.m.; peak hours in traffic in-
tersections and telephone exchanges (simulation starts before the peak and
finishes after the rush hour); simulations of the life of a machine (simula-
tion begins with installation of the machine and ends when the machine is
scrapped).

2.1 Terminating Simulations

We shall see that it is simple to analyze terminating simulations, since


each run gives one independently and identically distributed (i.i.d.) response.
Hence, a (1- 0:) confidence interval for the simulation response can be based
on standard statistics, such as the StU(ient statistic tv where v denotes the
number of degrees of freedom (Section 2.1.1). It is also easy to derive the
number of runs needed to estimate the simulation response with prefixed
accuracy (Section 2.1.2).
2.1.1 Confidence Intervals. Each run of a terminating simulation pro-
vides one i.i.d. response (replication, duplication); for example, in the repair-
man example, each simulated month yields one estimate (say) z of Z (simu-
lated percentage of time that at least one machine is running). Note that we
distinguish between 'estimator' and 'estimate': an estimate is a realization of
the corresponding estimator. Each replication implies reinitialization of all
state variables; for example, the number of machines running at the start
of the month (so, if June 1995 is simulated, then the number of machines
running at the end of May 1995 may be taken). All parameters and input
variables remain the same; for example, the failure and repair rates A and 1-',
the number of repairmen r, and the number of machines m. The simulation-
ists must generate a new sequence of pseudorandom numbers that does not
overlap with a sequence of a preceding run of the same system.
Let Zh denote replication h of the simulation response; for example, ob-
servation h on the simulated month (say) June 1995. Simulate d replications
with integer d ~ 2. Each replication uses a non-overlapping sequence of pseu-
dorandom numbers. Then a 1- 0: (for example, 90%) confidence interval for
E(Z) is derived as follows. The estimator of the standard deviation of Z,
given a sample size of d replications, is

d
8z = ~)Zh - 2)2 j(d-1) (2.1)
h=l
Simulation: Runlength Selection and Variance Reduction Techniques 415

with the average of d replications

(2.2)

Obviously, the Zh are i.i.d. Then consider the following 1 - 0: one-sided con-
fidence interval for E( Z) = p:

P[E(Z) > Z - ta;d-l Sz I v'd] = 1 - 0: (2.3)


where Z may also be denoted as p and where ta;d-l denotes the 1-0: quantile
Oftd-l (also see Section 3, which covers quantiles). The sample size (number
of simulation runs or replications) d reduces the length of the confidence
interval through the factors v'd and t a ;d-l .
Unfortunately, the Student statistic assumes that the simulation response
Z is Normally, Independently Distributed (n.i.d.). The Zh are indeed i.i.d.,
but they are not Gaussian; for example, their realizations do not vary be-
tween minus and plus infinity, but only between zero and one. Fortunately,
the Student statistic is known to be not very sensitive to non-normality; the
average Z is asymptotically normally distributed (Central Limit Theorem).
Nevertheless, Z may not be approximately normally distributed in simula-
tions with extremely high or low values for the traffic load (AI JJ)(mlr), which
implies that i is cut off at one or zero. Then one of the following two alter-
native procedures may be applied.
If Z has an asymmetric distribution, then Johnson's modified Student
test is an alternative. His test includes an estimate for the skewness of the
distribution of Z. This procedure may provide a valid confidence interval; see
Kleijnen et al. (1985).
A second alternative is a distribution-free or non-parametric confidence
interval. To illustrate its derivation, consider hypothesis testing. For example,
management wishes the expected availability to be at least (say) 0.80:

Ho : E(Z) 2: 0.80 (2.4)


When most of the d runs yield values for Z that are lower than 0.80, then
this null-hypothesis is rejected. The sign test quantifies how many values
smaller than 0.80 lead to rejection of Ho. And the statistic for hypothesis
testing can be transformed into a confidence procedure.
Instead of the sign test, the signed rank test may be applied, as follows:
Subtract 0.80 from each observation. Sort the resulting differences in increas-
ing order, disregarding their signs. Assign the sign of the difference to the
corresponding rank. Compute the sum of positive ranks. This sum should not
be too small or too big; the critical values are tabulated.
Conover (1971) gives an excellent discussion of the statistical details of
distribution-free statistics. Kleijnen (1987) discusses the application in sim-
ulation.
416 Jack P.C. Kleijnen

2.1.2 Selecting the Number of Simulated Replications. After a con-


fidence interval has been computed, it may turn out that this interval is too
wide; for example, with 90% probability the availability lies between 0.60
and 1.30 (the upper limit of the interval may indeed exceed the value 1.00).
Then we increase the number of simulated replications. Suppose management
wishes the estimated percentage to be plus or minus 0.05 accurate; that is,
the desired length of the interval is 0.10 or in general 2c, where the constant
c is called the half-width of the confidence interval. Then (2.3) gives (say) V,
the desired number of simulated replications:

(2.5)

A statistical complication is that V is a random variable, whereas d (number


of available simulated replications) is not. Fortunately, it can be proved that
(2.5) leads to acceptable results, provided Z is indeed nj.d .. Otherwise, the
coverage probability is usually smaller than 1 - Q.
There are a number of variations on this approach. First compute the
standard deviation of the simulation response from a pilot sample of (say)
do replications: replace d by do in (2.1). If the sample size formula (2.5)
yields v $ do, then stop the simulation: the confidence interval has acceptable
length. Otherwise, simulate V -do additional replications (note that the extra
sample size is random). From the total sample of random size V, estimate the
response E(Z). Compute the confidence interval in (2.3) from the first-stage
estimator of the standard deviation, and replace d by do in (2.3).
A variation on this two-stage approach is purely sequential: after each
simulation run h with h ~ 2, the variance and the mean estimates are up-
dated, until the confidence interval is tight enough; see Kleijnen (1987, pp.
47-50).

2.2 Steady-State Simulations

By definition, the steady state is reached only after a very long simulation
run. In practice, the start-up phase is often eliminated. Next the approach
of the preceding subsection (Section 2.1) could be applied. However, given d
runs, the transient phase is then eliminated d times: waste of computer time.
Moreover, it is not known when exactly the transient phase is over. Therefore
suppose the simulationists execute a single long run (not several replicated
runs); also see Kleijnen and Van Groenendaal (1992, pp. 190-191).
Assume a wide-stationary process. Most of these processes have positive
autocorrelation; for example, if a machine must wait long for a repairman to
become available, then the next machine that breaks down must probably
wait long too. This positive correlation implies that the traditional formula
for the variance estimation based on (2.1) has large bias; for example, for an
MIMll model it is known that a traffic load of 0.5 gives an estimate that
is wrong by a factor 10; for a 0.9 load this factor becomes 360 (see Kleijnen
Simulation: Runlength Selection and Variance Reduction Techniques 417

1987, p. 61). Unfortunately, in practice the incorrect formula is often applied;


software may use the wrong formula.
There are several methods for the construction of confidence intervals
in steady-state simulations: batching (or subruns), spectral analysis, stan-
dardized time series; see Kleijnen (1987, p. 79). This subsection, however,
concentrates on renewal analysis, assuming that the reader has at least a
basic knowledge of stochastic systems.
Some stochastic systems have a renewal or regenerative state, which im-
plies that the subsequent events are independent of the preceding history.
For example, in a repairman system such a state is the situation with all m
machines up (r mechanics idle), provided the time between failures is a Pois-
son process (Xl N e(A); see Section 1). Independent cycles start as soon as
,..y

all machines are up again. Cycle responses are i.i.d .. Also see Muppala et al.
(1996) in this volume, and the discussion on nearly renewal states at the end
of Section 2.2.
Denote the length of the renewal cycle by L, and the cycle response (for
example, availability time during a cycle) by W. Then it is well-known that
the steady-state mean response (availability percentage) is

E(Z) = E(W) I E(L) (2.6)


From a statistical viewpoint, renewal analysis is interesting, because this
analysis uses ratio estimators; see WI L in the next equation. Crane and
Lemoine (1977, pp. 39-46) prove that the central limit theorem yields the
following asymptotic 1 - a confidence interval for the mean response:

P[E(Z) > WIL - zaS I (VdIL)] = 1- a (2.7)

where Za denotes the 1 - a quantile of the standard normal variate (for


=
example, a .05 gives Za =
1.64), d is now the number of cycles (in termi-
nating simulations d was the number of replications), and 52 is a shorthand
notation:
S2 = slv + (WI L)2 s1 -
2(W I L)SW,L (2.8)
where the variances are estimated analogous to (2.1):
d
slv = I)Wt - W)2/(d -1) (2.9)
t=l

51 follows from (2.9) by replacing W by L, and the covariance is estimated


by
d
SW,L = ~)Wt - W)(L t - L)/(d - 1) (2.10)
t=l

Obviously, the confidence interval in (2.7) becomes smaller, as

(i) the a error increases (so Za decreases),


418 Jack P.C. Kleijnen

(ii) the estimated variances of Wand L become smaller, or their (compen-


sating) estimated covariance becomes larger,
(iii) the number of cycles d increases,
(iv) the average cycle length L increases.
The transient phase does not create any problems in renewal analysis, because
this phase is part of the first cycle. Further, in a Markov system, any state
can be selected as the renewal state; an example is the state 'm - 1 machines
are up' (which is equivalent to the state 'r-l mechanics are idle'). A practical
problem, however, is that it may take long before the selected renewal state
occurs again; for example, if the work load of the repairmen is heavy, then
it takes long before all mechanics will be idle again. Also, if there are very
many states -as may be the case in network systems- then it may take long
before a specific state occurs again. In those cases nearly renewal states may
be used; for example, define 'many machines up' as the set of (say) two
states, 'all m machines up' or 'm - 1 machines up'. Obviously, this approxi-
mate renewal state implies that the cycles are not exactly i.i.d. However, for
practical purposes they may be nearly Li.d., which may be tested through
the Von Neumann statistic for successive differences:
d
L:(Wt - Wt _ 1 )2 j[(d-l)S?vJ (2.11)
h=2

This statistic is approximately normally distributed with mean 2 and vari-


ance 4(d-2)j(d2 -1), provided the Wt are i.i.d .. Since W is a sum, normality
may apply. However, the cycle length L probably has an asymmetric distri-
bution. Then the rank version of the Von Neumann statistic may be applied;
see Bartels (1982) and the discussion at the end of Section 2.1.1.
A disadvantage of the Von Neumann test is that at least 100 cycles are
needed, to have a reasonable chance of detecting dependence: for d < 100 the
test has low power (high fJ-error or type-II error); see Kleijnen (1987, p. 68).
The proposed approximate renewal analysis of simulation requires more re-
search.

3. Proportions and Quantiles


In practice, management may not be interested in the expected value of
the (simulated and real-life) response, but in one or more proportions. A
proportion is the probability of the response exceeding (or not exceeding) a
given value; for example, what is the probability of Z (availability percentage)
undershooting 80 % :
P(Z < 0.80) = P.80 (3.1)
where P.80 must be estimated from the simulation (this P.80 should not be
confused with Py in Section 1). Of course, other values besides 80 % may
Simulation: Runlength Selection and Variance Reduction Techniques 419

be of interest; for example, the 90 % and 95 % values. Multiple proportions


can be handled through Bonferroni's inequality. Also see Aven (1996) in this
volume.
The value P.80 is estimated by comparing the realizations Zh (h = 1, ... , d)
with 0.80. These comparisons lead to the binomially distributed variable B
defined as
d
B = L:Ah with ah o if Zh < 0.80
h=l
1 if Zh ~ 0.80 (3.2)
This binomial variable has variance p.8o(1 - p.8o)d. Obviously, Bid is an
estimator of P.80 with variance p.8o(1 - p.8o)ld.
Consider so-called rare events; for example, replace O.SO in (3.1) by 10- 6
When estimating such a small probability (say) P with fixed relative precision
(c in equation (2.5) becomes cop), then obviously the required number of
runs goes to infinity; also see Heidelberger et al. (1996) in this volume, and
Section 4.4 on importance sampling.
A response that is closely related to a proportion is a quantile: what is
the value that is not exceeded by (say) SO% of the observed responses; for
example, which value of the availability percentage is exceeded by SO% of the
replications? In symbols:

P(Z < Z.80) = O.SO (3.3)


where now Z.80 must be estimated (Z.80 should not be confused with Zh, which
has as subscript the integer h that runs from 1 to d; Z.80 is defined analogously
to ta;v and Za, which are quantiles that form the critical values of the Student
and the normal statistics; these symbols are simplified notations for tl-a;v
and Zl-a. Notice that the median (Z.50) is a good alternative for the mean,
when quantifying the location of a stochastic variable; also see the section on
jackknifing (Section 5). Again other quantiles besides the 80 % quantile may
be of interest; for example, the 90 and 95 % quantiles.
A specific quantile is estimated by sorting d observations Zh, which yields
the order statistics Z(h); that is, Z(l) is the smallest observation, ... , Z(d) is the
largest observation. The 0.80 quantile may be estimated by Z(.80d) with the
1 - a confidence interval

P(Z(/) < Z.80 < Z(u)) = 1 - a (3.4)

where the lower limit is the Ith order statistic with

1= O.SOd - Za/2JO.SO(1 - O.SO)d (3.5)


and the upper limit is the uth order statistic with

u = O.SOd + Za/2JO.SO(I- O.SO)d; (3.6)


420 Jack P.C. Kleijnen

to keep the notation simple, we ignore the fact that 0.80d, I, and u are not
necessarily integers. Actually these three real variables must be replaced by
their integer parts.
The estimation of proportions and quantiles in terminating and steady-
state simulations is further discussed in Kleijnen (1987, pp. 36-40) and Klei-
jnen and Van Groenendaal (1992, pp. 195-197).

4. Variance Reduction Techniques


This chapter is limited to a few simple VRTs (namely common pseudoran-
dom numbers, antithetic random numbers, control variates) and one sophis-
ticated technique (importance sampling). Details on VRTs are given in Fish-
man (1989), Kleijnen (1974/75, pp. 105-285), Kleijnen and Van Groenendaal
(1992, pp.197-201), Tew and Wilson (1994), and in the references mentioned
in the following subsections.

4.1 Common Pseudorandom Numbers

In the what-if approach there is not so much interest in the absolute magni-
tudes of the results of the simulation, as in the differences among the results
for various values of the parameters (such as A and ,,) and input variables (m
and r). Therefore it seems intuitively appropriate to examine simulated sys-
tems under equal conditions (environments). For example, when comparing
different numbers of mechanics (say) rl and r2, then the simulation may use
the same times between successive failures of machines (denote the succes-
sive realizations of Xl by X1t; that is, by Xu, X12," .). This implies the use
of the same stream of pseudorandom numbers for system variant # 1 and #
2 (rl and r2 repairmen respectively). In that case the two responses Z(rt}
and Z(r2) are correlated. Hence
var[Z(rt) - Z(r2)] = var[Z(rt}] + var[Z(r2)] (4.1)
-2p[Z( rd, Z( r2)] Jvar[Z( rdvar[Z(r2)]
where p[Z(rt} , Z(r2)] denotes the linear correlation coefficient between Z(rt}
and Z(r2)' So ifthe use of the same pseudorandom numbers results in positive
correlation, then the variance of the difference decreases.
In complicated models it may be difficult to realize a strong positive corre-
lation. Therefore separate sequences of pseudorandom numbers are used per
'process'; for example, in the repairman example use one seed for the times
between failures (Xt), and a different seed for the repair times (X2)' How
should these seeds be selected? One seed may be sampled through the com-
puter's internal clock. However, sampling the other seed(s) in this way may
cause overlap among the various streams (making failure and repair times de-
pendent). For certain generators, there are tables with seeds 100,000 apart.
Simulation: Runlength Selection and Variance Reduction Techniques 421

For other generators such seeds may be generated in a separate computer


run. Also see Kleijnen and Van Groenendaal (1992, pp. 29-30).
The advantage of a smaller variance comes at a price: the analysis of the
simulation results becomes more complicated, since the responses are not
independent anymore. For example, when comparing only two responses (as
in the preceding equation), which statistic should be used? In this example
the answer is simple: take the d observations on the difference
(4.2)

and substitute this U for Z in equations (2.1) through (2.3), to find a confi-
dence interval for the mean difference.
However, there are more than two responses when Design 01 Experiments
(DOE) is used. Then regression analysis or Analysis of Variance (ANOVA) is
applied. Classic ANOVA, however, assumes independent responses. Now we
can use either Generalized Least Squares (GLS) or Ordinary Least Squares
(OLS) with adjusted standard errors for the estimated regression parameters.
In practice this complication is often overlooked. For further discussion see
Kleijnen (1996), the companion chapter in this volume.

4.2 Antithetic Pseudorandom Numbers

The intuitive idea behind antithetic pseudorandom numbers (briefly 'anti-


thetics') is as follows. In the terminating repairman simulation, replication
# 1 may happen to overshoot the mean response: when most times between
failures in replication # 1 happen to be large, then the average response z
is large too. In symbols, letting XI,h denote the average time to failure in
replication h, we get XI,1 > E(XI) and E[ZllxI,1 > E(XI)] > E[Z]. In this
case it would be nice if replication # 2 would compensate this overshoot:
E[Z2IxI,2 < E(XI)] < E[Z].
Statistically this 'compensation' means negative correlation between the
replications # 1 and # 2 (with a given combination of rand m). The variance
oftheir average (say) Z follows from (4.1), taking into account that obviously
both replications have the same variance, var(Z):

var(Z) = var(Z)[l + p(ZI' Z2)] /2 (4.3)


If P(ZI, Z2) is negative, then the variance of the average Z decreases.
To realize such a negative correlation, use the pseudorandom numbers r
for replication # 1, and the complements or 'antithetics' 1- r for replication
# 2. Actually, the computer does not need to calculate the complements 1-r,
if it uses a multiplicative congruential generator: simply replace the seed ro
by its complement e - ro, where e stands for the generator's modulo (that
=
is, ri Iri-l mod e where 1 denotes the generator's multiplier). Finally, the
pseudorandom numbers are used to sample Xl from Ne(>.) in replications #
1 and # 2: XI,l = -log(r)/>. and Xl,2 = -log(l - r)/>..
422 Jack P.C. Kleijnen

To compute a confidence interval, observe that the d responses now give


d/2 independent pairs (Zl, ... , Zd/2 ). Hence in (2.1) through (2.3) replace d
by d/2 and Zh by Zg with g = 1, ... , d/2.

4.3 Control Variates or Regression Sampling

In the terminating repairman simulation with antithetics (Section 4.2), repli-


cation # 1 happened to overshoot the mean response because most times be-
tween failures happened to be large: E(Zlil,l > E(Xt} = E(Xl)) > E(Z).
Instead of having the next replication compensate this overshoot, the re-
sponse of the present replication can be corrected, as follows.
Obviously the input Xl and the output Z are positively correlated:
P(Xl' Z) > 0 or P[Z > E(Z)lil > E(Xt}] > P[Z > E(Z)]. Hence in
case of an overshoot, z is corrected downwards. More specifically, consider
the following linear correction:
(4.4)
where Zc is called the (linear) control variate estimator. Obviously, this new
estimator remains unbiased.
It is easy to derive that this control variate estimator has minimum vari-
ance, if the correction factor f3 equals

(4.5)
In practice, however, the correlation p(Xl' Z) is unknown. So this cor-
relation is estimated. Actually, replacing the three factors in the right-hand
side of the preceding equation by their classic estimates results in the OLS
estimate (say) /3 of the regression parameter f3 in the regression model
(4.6)
Therefore the technique of control variates is also called regression sampling.
Obviously, f3 in the latter equation is estimated from the d replications that
give d Li.d. pairs (Xl,h, Zh).
The OLS estimator of P (the optimal correction coefficient (4.5) for the
control variate estimator (4.4)) gives a new control variate estimator. Let
Z denote the sample average of the responses Zh, Xl the average over d
replications of the average failure time per run, and iJ the OLS estimator
of f3 in (4.6) or in (4.5) based on the d pairs (Z, Xl). Then the new control
variate estimator is
(4.7)
This formula can be easily interpreted, when we remember that the estimated
regression model goes through the point of gravity, (Xl, Z).
The example can be extended: take as control variates, not only the time
between failures (Xt), but also the repairmen's service time (X2). This re-
quires multiple regression analysis. A better idea may be to use the traffic
Simulation: Runlength Selection and Variance Reduction Techniques 423

load, Pl/ J-t)(m/r). Actually, the explanatory variables in the regression model
may be selected such that the multiple correlation coefficient R2 adjusted for
the number of control variates, is maximized.
A complication is that estimation of f3leads to a biased control variate; see
(4.7). Moreover, the construction of a confidence interval for E(Z) becomes
problematic. These problems can be solved, either assuming multivariate nor-
mality for (Z, Xl, X2 , ) or using the robust technique of jackknifing (see
Section 5).

4.4 Importance Sampling

The preceding VRT relied on the correlation between the responses of com-
parable simulated systems (common seeds, Section 4.1), or between the res-
ponses of antithetic runs (Section 4.2), or between input and output of a run
(Section 4.3). The simulation program itself was not affected; only seeds were
changed or inputs were monitored. Importance sampling, however, drastically
changes the sampling process of the simulation model. This technique is more
sophisticated, but it is necessary when simulating rare events; for example,
in a dependable system unavailability occurs with a probability of (say) one
in a million replicated months. But then a million replicated months must be
simulated, to expect to see (only) one breakdown of the system!
The basic idea of importance sampling is to change the probabilities of
the inputs such that the probability of the response increases; of course the
resulting estimator must be corrected in order to get an unbiased estimator.
This idea can be explained simply in the case of non-dynamic simulation, also
known as Monte Carlo sampling, as the following example demonstrates.
Consider the integral

e= 1 p
001
->'e-).xdx with>. > 0, p> 0
x
(4.8)

The value of this integral can be estimated (other techniques are integral cal-
culus and numerical approximation). Crude Monte Carlo proceeds as follows.
(i) Sample x from N e(>.).
(ii) Substitute the sampled value x into the 'response'
1
g(x) if x> 1/
x
o otherwise (4.9)

Obviously, g(X) is an unbiased estimator of {, defined in (4.8). Notice that


g(x) - 0 in (4.9) becomes a rare event as 1/ - 00.
Importance sampling does not sample x from the original distribution,
namely Ne(>.), but from a different distribution (say) h(x). The resulting x
is substituted into the response function g(x). However, g(x) is corrected by
the likelihood ratio f(x)/h(x). This gives the corrected response
424 Jack P.C. Kleijnen

g*(X) = g(x/(X) (4.10)


h(x)
This estimator is an unbiased estimator of ~:

E[g*(X)] = 1 g(X)~~:~h(X)dX 1
00
=
00
g(x)f(x)dx = ~ (4.11)

It is quite easy to derive the optimal form of h(x), which results in minimum
varIance.
For dynamic systems (such as the repairman simulation) a sequence of
inputs must be sampled; for example, successive times between machine fail-
ures XU, X12, .. These inputs are assumed to be i.i.d., so their joint density
function is given by
f( xu, X12, . . ) -- I\e
\ -'\Xll I\e
\ -'\X12
... (4.12)

Suppose crude Monte Carlo and importance sampling use the same type of
input distribution (negative exponential) but with different parameters, A
and AO respectively. Then the likelihood ratio becomes

f( xu, X12, ...) Ae-'\Xll Ae-'\X12 .. .


(4.13)
h( xu, X12, ... ) Aoe-'\OXll Aoe-'\OX12 .. .

Obviously this expression can be reformulated to make the computations


more efficient.
In the simulation of dynamic systems it is much harder to obtain the op-
timal new density. Yet distributions can be derived that give drastic variance
reductions; see Heidelberger et al. (1996) in this volume, and also Rubinstein
and Shapiro (1992) and the literature mentioned at the beginning of this
section (Section 4).
Importance sampling can be extended to the score function method for
sensitivity analysis and optimization of the simulation response, with respect
to the parameters of the input distributions; see Glynn and Iglehart (1989),
Rubinstein and Shapiro (1992), Kleijnen and Rubinstein (1995), and Kleijnen
(1996), the companion chapter in this volume.

5. Jackknifing

Jackknifing are is a very general technique that has two goals:

(i) reducing possible bias of an estimator;


(ii) constructing a robust confidence interval.

Efron (1982) and Miller (1974) give classic reviews of jackknifing.


Simulation: Runlength Selection and Variance Reduction Techniques 425

Suppose there are d observations on Z, as in (2.1) through (2.3). This


yields an estimator; for example, an estimator of the median (see Section 3):

M = Z(.5d) (5.1)
where Z(h) still denotes the order statistic; actually, .5d should be replaced
by its integer part (see equation 3.4).
Now partition those d observations into (say) g groups of equal size
v (= dig); g may be equal to d (so v = 1). We shall concentrate on the case
of groups of size one. Eliminate one observation, say, observation h (with
h = 1, ... , d). Calculate the same estimator from the remaining (d - 1) obser-
vations. For example, after dropping the first observation Zl, recalculate the
median. Denote the order statistic after eliminating observation h by Z-h;(j)
with j = 1, ... , d -1; for example, after eliminating observation 2 the biggest
observation is Z-2;(d-l)' In the example of the median, dropping observation
1 gives the estimator of the median

M_l = Z-1;(.50[d-lj) (5.2)

Each time, eliminate another observation. This gives d estimators. The hth
pseudovalue (say) Ph is defined as the following linear combination of the
original and the hth estimator of (say) the median:

Ph = dM - (d - l)M_h with h = 1, ... , d (5.3)

The jackknifed estimator is defined as the average pseudovalue:

(5.4)

It can be proved that if the original estimator is biased, then the jackknifed
estimator has less bias.
Moreover, jackknifing gives the following robust confidence interval. Treat
the d pseudovalues Ph as d i.i.d. variables: compute the 1 - 0: confidence
interval from the Student statistic with d - 1 degrees offreedom, using (2.1)
through (2.3), replacing Z by P.
Let us consider one more example. The VRT of control variates was based
on d i.i.d. pairs (Zh' X1,h); see Section 4.3. Now eliminate pair h, and calculate
the control variate estimator, using (4.7):

(5.5)
where Z-h denotes the sample average of the responses after elimination
of Zh, X-h;l denotes the average failure time after eliminating run h with
its average Xl;h, and B-h is the OLS estimator based on the remaining
d - 1 pairs. (Note that E(X-h;t) = E(Xt) = 1/-'.) This Z-h;e gives the
pseudovalue
426 Jack P.C. Kleijnen

(5.6)
where Zc is the control variate estimator based on all d pairs; see (4.7).
Jackknifed renewal analysis (Section 2.2) is discussed in Kleijnen and Van
Groenendaal (1992, pp. 202-203); jackknifed GLS (Section 4.1) is discussed
in Kleijnen et al. (1987).
Jackknifing is related to bootstrapping, which samples from the set of
d observations; see Efron (1982), Efron and Tibshirani (1993), and Cheng
(1995).

6. Conclusion

This chapter addressed the following questions (also see the introduction,
Section 1):
(i) How to initialize the simulation run?
This chapter emphasized the distinction between terminating and steady-
state simulations. In a terminating simulation we may start with the situation
at the end of last replication; in a steady-state simulation we may start with
all machines running, if that is a renewal state.
(ii) How to assess the accuracy of the simulation response at the end of the
simulation run?
Accuracy may be quantified by a 1 - 0: confidence interval for the simulation
response. This interval may assume normality (Student's statistic) or not
(Johnson's modified Student statistic, distribution-free statistics).
(iii) How to improve this accuracy, if it is too low: how much longer to sim-
ulate the system?
A confidence interval with a fixed length can be derived by sequential statis-
tical procedures. The resulting stopping rule selects the number of necessary
runs with the terminating simulation or the number of renewal cycles with
the steady-state simulation. The latter type of simulation may also use 'ap-
proximate' renewal states.
Further, this chapter covered proportions and quantiles, as alternatives
for the expected value.
(iv) Which {tricks' to use, in order to improve this accuracy?

Several simple VRTs can be applied: common pseudorandom numbers, anti-


thetic numbers, and control variates. In rare-event simulation it is necessary
to apply importance sampling.
Finally this chapter covered jackknifing as a general technique for reducing
possible bias and for constructing robust confidence intervals. Jackknifing
may be needed after application of renewal analysis and VRTs.
Simulation: Runlength Selection and Variance Reduction Techniques 427

Acknowledgement. I thank Jorg Jansen, who is a Ph.D. student at Tilburg Uni-


versity, for his comments on earlier drafts of this chapter. His comments helped to
clarify some parts and to eliminate some errors. All remaining errors are my sole
responsibility.

References

Aven, T.: Availability Analysis of Monotone Systems. In this volume (1996), pp.
206-223
Bartels, R.: The Rank Version of Von Neumann's Ratio Test for Randomness.
Journal of the American Statistical Association 77, 40-46 (1982)
Cheng, R.C.H.: Bootstrap Methods in Computer Simulation Experiments. In: Alex-
opoulos, C., Kang, K., Lilegdon, W.R., Goldsman, D. (eds.): Proceedings of the
Winter Simulation Conference (1995)
Conover, W.J.: Practical Non-parametric Statistics. New York: Wiley 1971
Crane, M.A., Lemoine, A.J.: An Introduction to the Regenerative Method for Sim-
ulation Analysis. Berlin: Springer 1977
Efron, B.: The Jackknife, the Bootstrap, and Other Resampling Plans. CBMS-NSF
Series. Philadelphia: SIAM 1982
Efron, B., Tibshirani, R.J.: Introduction to the Bootstrap. London: Chapman and
Hall 1993
Fishman, G.S.: Focussed Issue on Variance Reduction Methods in Simulation: In-
troduction. Management Science 35, 1277 (1989)
Glynn, P.W., Iglehart, D.L.: Importance Sampling for Stochastic Simulation. Man-
agement Science 35, 1367-1392 (1989)
Heidelberger, P., Shahabuddin, P., Nicola, V.: Bounded Relative Error in Estimat-
ing Transient Measures of Highly Dependable Non-Markovian Systems. In this
volume (1996), pp. 487-515
Jensen, U.: Stochastic Models of Reliability and Maintenance: An Overview. In this
volume (1996), pp. 3-36
Kleijnen, J.P.C.: Statistical Techniques in Simulation (Two Volumes). New York:
Marcel Dekker 1974/1975
Kleijnen, J.P.C.: Statistical Tools for Simulation Practitioners. New York: Marcel
Dekker 1987
Kleijnen, J.P.C.: Simulation: Sensitivity Analysis and Optimization through Re-
gression Analysis and Experimental Design. In this volume (1996), pp. 429-441
Kleijnen, J.P.C., Karremans, P.C.A., Oortwijn, W.K., Van Groenendaal, W.J.H.:
Jackknifing Estimated Weighted Least Squares: JEWLS. Communications In
Statistics, Theory and Methods 16, 747-764 (1987)
Kleijnen, J.P.C., Kloppenburg, G.L.J., Meeuwsen, F.L.: Testing the Mean of an
Asymmetric Population: Johnson's Modified t-Test Revisited. Communications
in Statistics, Simulation and Computation 15, 715-732 (1986)
Kleijnen, J.P.C., Rubinstein, R.Y.: Optimization and Sensitivity Analysis of Com-
puter Simulation Models by the Score Function Method. European Journal of
Operational Research. To appear (1996)
Kleijnen, J.P.C., Van Groenendaal, W.J.H.: Simulation: A Statistical Perspective.
Chichester: Wiley 1992
Law A.M., Kelton W.D.: Simulation Modeling and Analysis. Second Edition. New
York: McGraw-Hill 1991
Miller, R. G.: The Jackknife - A Review. Biometrica 61, 1-15 (1974)
428 Jack P.C. Kleijnen

Muppala, J .K., Malhotra, M., Trivedi, K.S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems; Sensitivity Analysis and
Stochastic Optimization by the Score Function Method. New York: Wiley 1992
Tew, J.D., Wilson, J.R.: Estimating Simulation Metamodels Using Combined
Correlation-Based Variance Reduction Techniques. lIE Transactions 26, 2-16
(1994)
Simulation: Sensitivity Analysis and
Optimization Through Regression Analysis
and Experimental Design
Jack P.C. Kleijnen
Department of Information Systems and Auditing and Center for Economic Re-
search (CentER), School of Management and Economics, Tilburg University, 5000
LE Tilburg, The Netherlands

Summary. This chapter gives a tutorial survey on the use of statistical techniques
in sensitivity analysis, including the application of these techniques to optimization
and validation of simulation models. Sensitivity analysis is divided into two phases.
The first phase is a pilot stage, which consists of screening or searching for the im-
portant factors; a simple technique is sequential bifurcation. In the second phase,
regression analysis is used to approximate the input/output behavior of the sim-
ulation model. This regression analysis gives better results when the simulation
experiment is well designed, using classical statistical designs such as fractional fac-
torials. To optimize the simulated system, Response Surface Methodology (RSM)
is applied; RSM combines regression analysis, statistical designs, and steepest as-
cent. To validate a simulation model that lacks input/output data, again regression
analysis and statistical designs are applied. Several case studies are summarized;
they illustrate how in practice statistical techniques can make simulation studies
give more general results, in less time.

Keywords. Validation, what if, regression analysis, least-squares methods, design


of experiments

1. Introduction

The objective of this chapter is to examine the problem of sensitivity analysis


in simulation, including the related issues of optimization and validation.
To solve these problems, this chapter gives a survey of certain statistical
techniques, namely Design Of Experiments (DOE) and its analysis through
regression analysis (also known as ANOVA, Analysis Of Variance).
This chapter is a tutorial that discusses not only methodology, but also
applications. These applications come from my own experience as a consul-
tant, and from publications by others in Europe and the USA. The reader is
assumed to have a basic knowledge of mathematical statistics and simulation.
More specifically, the following questions are addressed (which should be
answered for all simulation models, including simulation models for the reli-
ability and maintenance of complex systems):
1. What if: what happens if the analysts change parameters, input variables
or modules (such as subroutines for priority rules) ofthe simulation model?
This question is closely related to sensitivity analysis and optimization, as
430 Jack P.C. Kleijnen

this chapter will show. The literature on statistical designs uses the term
factor to denote a parameter, input variable or module.
2. Validation: is the simulation model an adequate representation of the cor-
responding system in the real world? This chapter addresses only part of
the validation problem.
To answer these practical questions, this chapter takes techniques from the
science of mathematical statistics (briefly, statistics). It is not surprising that
statistics is so important in simulation: by definition, simulation means that
a model is 'solved' - not by mathematical analysis (see many other chapters
in this volume) or by numerical methods (see Muppala et al. 1996)- but by
experimentation. But experimentation requires a good design and a good
analysis! DOE with its concomitant analysis is a standard topic in statistics.
However, the standard statistical techniques must be adapted such that they
account for the particularities of simulation. For example, there are a great
many factors in many practical simulation models. Indeed, one application
(discussed later) has hundreds of factors, whereas standard DOE assumes
only up to (say) fifteen factors. Moreover, stochastic simulation models use
pseudorandom numbers, which means that the analysts have much more
control over the noise in their experiments than the investigators have in
standard statistical applications (for example, common and antithetic seeds
may be used; see the companion chapter, Kleijnen 1996).

The main conclusions of this chapter will be:


(i) Screening may use the simple, efficient, and effective technique of sequen-
tial bifurcation; see Bettonvil and Kleijnen (1995).
(ii) Next, regression analysis generalizes the results of the simulation ex-
periment, since it characterizes the input/output behavior of the simulation
model.
(iii) Statistical designs give good estimators of main (first-order) effects and
interactions among factors; these designs require fewer simulation runs than
intuitive designs do.
(iv) Optimization may use Response Surface Methodology or RSM, which
builds on regression analysis and statistical designs; see (ii) and (iii).
(v) Validation may use regression analysis and statistical designs, especially
if there are no data on the input/output of the simulation model or its mod-
ules.
(vi) These statistical techniques have already been applied many times in
practical simulation studies, in many domains; these techniques make simu-
lation studies give more general results, in less time.
Occasionally this chapter will use the repairman example that was also
used in the companion chapter (Kleijnen 1996), so this example is repeated
here briefly. There are m machines that are maintained by a crew of r re-
pairmen (mechanics). Machine j has a stochastic time between failures (say)
Xli with j = 1, .. " m. Notice that stochastic variables are shown in capitals;
Simulation: Sensitivity Analysis and Optimization 431

their realizations in lower case letters or numbers. Time to repair machine


j by repairman i (with i = 1"", r) is X 2ij; that is, mechanic j may be
specialized in the repair of machine i. Different priority rules may be imple-
mented: First-In-First-Out (FIFO), Shortest-Processing-Time (SPT), and so
on. A typical response is availability: management wishes to know the per-
centage of time that at least one machine is running, denoted by (say) Y.
(Multi-variate responses are usually handled through the application of the
techniques of this chapter per response type; also see Kleijnen 1987 and the
companion chapter, Kleijnen 1996). A simulation run is a single time path
that has fixed values for all its inputs and parameters. In this example, these
inputs are m and r, and the parameters ofthe distributions for the inputs Xlj
and X 2 ij; for Negative exponential (Ne) distributions the latter parameters
are Aj (A = 1/MTTF where MTTF stands for Mean Time To Failures) and
Jl;j (repair rate of repairman i for machine j). A special variable is the pseu-
dorandom number seed Ro. Notice that there are many parameters, namely m
failure rates and rm repair rates Jlij (in the companion chapter we assumed
= =
Aj A and Jlij Jl). Moreover, r, m, and the queueing priority rule may be
changed. So there is a great need for statistical designs.
The remainder of this chapter is organized as follows. Section 2 discusses
sensitivity analysis by means of DOE, which treats the simulation model as a
black box. More specifically, Section 2.1 studies the screening phase of a sim-
ulation study: which factors among the many potentially important factors
are really important? Section 2.1.1 discusses a very efficient screening tech-
nique, called sequential bifurcation. Section 2.2 discusses how to approximate
the input/output behavior of simulation models by regression analysis. First
it discusses graphical methods, namely scatter plots; see Section 2.2.1. Next
it presents regression analysis (which formalizes the graphical approach), in-
cluding standardization of factors, Generalized Least Squares (GLS), and
cross-validation; see Section 2.2.2. Next, Section 2.3 discusses statistical de-
signs. First the focus is on designs that assume only main effects (Section
2.3.1). Then follow designs that give unbiased estimators for the main ef-
fects, even if there are interactions between factors (Section 2.3.2). Further,
this section discusses designs that allow estimation of individual interactions
(Section 2.3.3). Section 2.3 ends with designs for estimating the curvature
(quadratic effects) of the input/output approximation (Section 2.3.4). Sec-
tion 3 proceeds with the role of sensitivity analysis in validation, emphasizing
the effects of data availability. Section 4 presents the optimization of simu-
lated systems through RSM. Section 5 gives a summary and conclusions.
Seventeen references conclude the chapter. This chapter is based on Kleijnen
(1995c).
432 Jack P.C. Kleijnen

2. Sensitivity Analysis
The vast literature on simulation does not provide a standard definition of
sensitivity analysis. In this chapter, sensitivity analysis is defined as the sys-
tematic investigation of the reaction of the simulation responses to extreme
values of the model's input or to drastic changes in the model's structure. For
example, what happens to the system's availability, when the MTTF doubles;
what happens if the priority rule changes from FIFO to SPT? So the focus
in this chapter is not on marginal changes (or perturbations) in the input
values.
Moreover, the simulation model is treated as a black box: the simulation
inputs and outputs are observed, and from this input/output behavior the
factor effects are estimated. This approach is standard in DOE.
DOE has advantages and disadvantages. One benefit is that this approach
can be applied to all simulation models. A drawback is that it can not take
advantage of the specific structure of a given simulation model, so it may take
many simulation runs to perform the sensitivity analysis. But DOE requires
fewer runs than the intuitive approach often followed in practice (see the
one-factor-at-a-time approach in Section 2.3.1).
Note: The intricacies of the specific simulation model at hand are con-
sidered in perturbation analysis and in modern importance sampling, also
known as score function; see Ho and Cao (1991), Glynn and Iglehart (1989),
and Rubinstein and Shapiro (1993) respectively. Perturbation analysis and
score function require only one run. Unfortunately, these methods also require
more mathematical sophistication.

2.1 Pilot or Screening Phase

In the pilot phase of a simulation study there are usually a great many poten-
tially important factors. For example, in the repairman system of the Section
1 there are m failure rates and rm repair rates; r, m, and the queueing pri-
ority rule may also be factors. It is the mission of science to come up with a
short list of the most important factors; it is unacceptable to say 'everything
depends on everything else': parsimony principle.
In practice, analysts often restrict their study to a few factors, usually no
more than ten. Those factors are selected through intuition, prior knowledge,
and the like. The factors that are ignored (kept constant), are -explicitly
or implicitly- assumed to be unimportant. For example, in the repairman
example, it is traditional to assume equal MTTFs (1/ Aj = 1/ A ) and equal
repair rates (J-lij = J-l ). Of course, such an assumption severely restricts the
generality of the simulation study!
The statistics literature includes screening designs. These designs provide
scientific methods for finding the important factors. There are several types
of screening designs: random, supersaturated, group screening designs, and
so on; see Kleijnen (1987).
Simulation: Sensitivity Analysis and Optimization 433

Unfortunately, the statistics literature pays too little attention to screen-


ing designs. The reason for this neglect is that in standard statistical appli-
cations it is virtually impossible to control hundreds of factors; fifteen is hard
enough. In simulation, however, models may have hundreds of parameters,
and yet their control is simple: just specify which combinations of parame-
ter values to simulate. Nevertheless, screening applications in simulation are
rare, because most analysts are not familiar with these designs. Yet these
designs are simple and efficient.
Recently, screening designs have been improved and new variations have
been developed; details are given in Bettonvil and Kleijnen (1995). The next
sub-subsection covers the most promising type, namely sequential bifurcation.
2.1.1 Sequential Bifurcation. Sequential bifurcation uses the aggregation
principle, which is often applied in science when studying complicated sys-
tems. So at the start of the simulation experiment, sequential bifurcation
groups the individual factors into clusters. To make sure that individual fac-
tor effects do not cancel out, sequential bifurcation assumes that the analysts
know whether a specific individual factor has a positive or negative effect on
the simulation response: known signs. In practice this assumption is not very
restrictive. For example, in the repairman simulation it is known that in-
creasing the MTTF increases the response, availability (but it is unknown
how big this increase is; therefore the analysts use a simulation model).
In practice, sequential bifurcation was applied to an ecological simulation
with 281 parameters. The ecological experts felt comfortable specifying in
which direction a specific parameter affects the response (this response is
the future carbon-dioxide or CO 2 concentration; CO 2 creates the greenhouse
effect). Moreover, if a few individual factors have unknown signs, then these
factors can be investigated separately, outside the sequential bifurcation de-
sign.
Sequentialization means that factor combinations to be simulated, are
selected as the experimental results become available; that is, as simulation
runs are executed, insight into factor effects is accumulated and used to select
the next run. As the experiment proceeds, groups of factors are eliminated,
because sequential bifurcation concludes that these clusters contain no im-
portant factors.
Also, as the experiment proceeds, the groups become smaller. More specif-
ically, each group that seems to include one or more important factors, is split
into two subgroups of the same size: bifurcation. At the end of screening by
means of sequential bifurcation, individual factors are investigated.
To illustrate the technique, consider a simple example with 128 factors,
of which only 3 factors are important, namely the factors # 68, # 113, and
# 120. Then it is easy to check that in only 16 runs these important factors
are detected by sequential bifurcation.
In the ecological case study, sequential bifurcation took 154 simulation
runs to identify and estimate the 15 most important factors among the orig-
434 Jack P.C. Kleijnen

inal 281 factors. Some of these 15 factors surprised the ecological experts, so
sequential bifurcation turns out to be a powerful statistical (black box) tech-
nique. Moreover, had the analysts assumed no interactions between factors,
then sequential bifurcation would have halved the number of runs (154/2 =
77 runs).
The ecological case study concerns a deterministic simulation model (con-
sisting of a set of non-linear difference equations). There is a need for more
research, applying sequential bifurcation to large random simulations, such
as simulations of reliability and maintenance of complex systems.

2.2 Approximating the Input/Output Behavior of Simulation


Models by Regression Analysis
2.2.1 Introduction: Graphical Methods. After the screening phase (Sec-
tion 2.1), the number offactors to be further investigated is reduced to a small
number (for example, fifteen).
Practitioners often make a scatter plot with on the x-axis the values of
one factor (for example, MTTF) and on the y-axis the simulation response
(say, availability). This graph indiCates the input/output behavior of the
simulation model, treated as a black box. It shows whether this factor has
a positive or negative effect on the response, whether that effect remains
constant over the domain (experimental area) of the factor, etc.
This scatter plot can be further analyzed: fit a curve to these (x, y) data;
for example, fit a straight line (y = f30 + f31 x). Of course, other curves can be
fitted: quadratic (second degree polynomial), exponential, logarithmic (using
paper with a log scale), and so on.
To study interactions between factors, scatter plots per factor can be com-
bined. For example, the scatter plot for different MTTF values was drawn,
given a certain number of repairmen r. Plots for different numbers of me-
chanics can now be superimposed. Intuitively, the availability curve for a low
number of mechanics lies below the curve for a high number of mechanics (if
not, the simulation model is probably wrong; see the discussion on validation
in Section 3). If the response curves are not parallel, there are interactions,
by definition.
However, superimposing many plots is cumbersome. Moreover, their inter-
pretation is subjective: are the response curves really parallel straight lines?
These shortcomings are removed by regression analysis.
2.2.2 Regression analysis. A regression model is a metamodel of the sim-
ulation model; that is, a regression model approximates the input/output
behavior of the simulation model that generates the input/output data to
which the regression analysis is applied. Consider the second degree polyno-
mial
k k k

Vi = f30 + 2:f3h x ih+ 2: 2: f3hh,xih Xih' + Ei (i= 1, ... ,n) (2.1)


h=l h=l h'=h
Simulation: Sensitivity Analysis and Optimization 435

with

li: simulation response of factor combination i;


Po: overall mean response or regression intercept;
Ph: main or first-order effect of factor h;
Xih: value of the standardized factor h in combination i
(see equation (2.2) below);
Phh/: interaction between factors h and hi with h =f. hi;
Phh: quadratic effect of factor h;
Ei: fitting error of the regression model for factor combination i;
n: number of simulated factor combinations.
First ignore interactions and quadratic effects, for didactic reasons. Then
the relative importance of a factor is obtained by sorting the absolute values of
the main effects Ph, provided the factors are standardized. So let the original
(non-standardized) factor h be denoted by Zh. In the simulation experiment
Zh ranges between a lowest value h and an upper value Uh; that is, the simu-
lation model is not valid outside that range (see the discussion on validation
in Section 3) or in practice that factor can range over that domain only (for
example, the number of repairmen can vary only between one and five). The
variation (or spread) of that factor is measured by ah = (Uh - h)/2; its lo-
cation (or mean) by bh = (Uh + Ih)/2. Then the following standardization is
appropriate:
(2.2)
The classic fitting algorithm, which determines P of the regression model in
equation (2.1), uses the ordinary least squares (OLS) criterion. Software for
this algorithm is abundant.
If statistical assumptions about the fitting error are added, then there are
better algorithms. Consider the following assumptions.
It is realistic to assume that the variance of the stochastic fitting er-

var(Ei) = or
ror E varies with the input combination of the random simulation model:
(So Y, the response of the stochastic simulation, has a mean
and a variance that both depend on the input.) Then weighted least squares
(with the standard deviations Ui as weights) yields unbiased estimators of
the factor effects, but with smaller variances than OLS gives.
Common pseudorandom number seeds can be used to simulate different
factor combinations (see the companion chapter, Kleijnen 1996). Then GLS
gives minimum variance, unbiased estimators. Unfortunately, in practice the
variances and covariances of the simulation responses Yare unknown, so they
must be estimated. The following equation gives the classic covariance esti-
mator, assuming d independent replications (or simulation runs) per factor
combination (so lig and lilg are correlated, but lig and Yigl are not):
d
cov(li, li/) = I:(Yig - fi)(li lg - fi,)/(d - 1) (2.3)
g=l
436 Jack P.C. Kleijnen

Fortunately, the resulting estimated G LS gives good results; see Kleijnen and
Van Groenendaal (1992).
Of course, it is necessary to check the fitted regression metamodel: is it an
adequate approximation of the underlying simulation model? Therefore the
metamodel may be used to predict the outcomes for new factor combinations
of the simulation model. So replace {J in the specified metamodel by the esti-
mate~, and substitute new combinations of x (there are n old combinations).
Compare the predictions fj with the simulation response y.
A refinement is cross-validation: do not add new combinations (which
require more computer time), but eliminate one old combination (say) com-
bination i and re-estimate the regression model from the remaining n - 1
combinations. Repeat this elimination for all values of i (i = 1"", n; see
equation (2.1)). This approach resembles jackknifing, discussed in the com-
panion chapter, Kleijnen (1996). Statistical details are discussed in Kleijnen
and Van Groenendaal (1992).
Applications of regression metamodeling will be discusses below (Sec-
tion 2.3 through Section 4).

2.3 Statistical Designs

Section 2.2.2 used regression metamodels to approximate the input/output


behavior of simulation models. Such a metamodel has (say) q regression pa-
rameters {J, which measure the effects of the k factors; for example, q equals
k + 1 if there are no high-order effects; if there are interactions between fac-
tors, then q increases with k(k - 1)/2; and so on.
lt is obvious that to get unique, unbiased estimators of these q effects,
it is necessary to simulate at least q factor combinations. Moreover, which
n combinations to simulate (provided that n 2:: q), can be determined such
that the accuracy (or precision) of the estimated factor effects is maximized
(variance minimized). This is the goal of the statistical theory on DOE (which
Fisher started in the 1930s and Taguchi continues today).
2.3.1 Main Effects Only. Consider a first-order polynomial, which is a
model with only k main effects, besides the overall mean (see the first two
terms in the right-hand side of equation (2.1)).
In practice, analysts usually first simulate the 'base' situation, and next
they change one factor at a time; so, all together they simulate 1 + k runs.
However, DOE proves that it is better to use orthogonal designs, that is,
designs that satisfy
=
x'x nI (2.4)
with the following notation:
Simulation: Sensitivity Analysis and Optimization 437

bold letters: matrices;


X=(Xij): design matrix with i = 1,, n; j = 0,1,, k; n> k;
XiQ = 1: dummy factor;
defined below equation (2.1);
I: identity matrix (this capital letter does not denote a
stochastic variable).
Orthogonal designs give estimators of f3 that are unbiased and have
smaller variances than the estimators resulting from designs that change one
factor at a time.
Orthogonal designs are tabulated in many publications. The analysts may
also learn how to construct those designs; see Kleijnen (1987). Recently, soft-
ware has been developed to help the analysts specify these designs; see Oren
(1993).
A well-known class of orthogonal designs is that of 2k - p fractional fac-
torials. An example is a simulation with k = 7 factors withn = 27 - 4= 8
factor combinations (runs); that is, only the fraction 2- P = 2- 4 is simulated.
Actually, these 2k - p designs also require 8 runs when 4 ~ k ~ 7. See Kleijnen
(1987).
References to many simulation applications of these designs can be found
in Kleijnen (1987) and Kleijnen and Van Groenendaal (1992).
In practice, however, it is unknown whether only main effects are impor-
tant. Therefore orthogonal designs with n ~ k + 1 should be used only in
optimization (see Section 4). Moreover these designs are useful as building
blocks if interactions are accounted for; see Section 2.3.2.
2.3.2 Main Effects Biased by Interactions? It seems prudent to as-
sume that interactions between pairs of factors may be important. Then the
k main effects can still be estimated without bias caused by these interactions.
However, the number of simulated factor combinations must be doubled; for
= = =
example, k 7 requires n 2 x 8 16. These designs also give an indication
of the importance of interactions; also see Section 2.3.3.
Details, including simulation applications are presented in Kleijnen (1987)
and Kleijnen and Van Groenendaal (1992).
Recent applications include the simulation of a decision support system
(DSS) for the investment analysis of gas pipes in Indonesia, and a simulation
model for the Amsterdam police; see Van Groenendaal (1994) and Van Meel
(1994) respectively.
2.3.3 Factor Interactions. Suppose the analysts wish to estimate the in-
dividual two-factor interactions f3hh'; see equation (2.1). There are k(k - 1)/2
such interactions. Then many more simulation runs are necessary; for ex-
ample, k = 7 factors may be studied in a fractional factorial design with
= =
n 27 - 1 64 factor combinations (runs). Therefore only small values for k
are studied in practice. Kleijnen (1987) gives details, including applications.
Of course, if k is really small (say, k = 3), then all 2k (say, 23 ) combina-
tions are simulated, so all interactions (not only two-factor interactions) can
438 Jack P.C. Kleijnen

be estimated. In practice, these full factorial designs are sometimes used in-
deed (but high-order interactions are hard to interpret). See Kleijnen (1987).
2.3.4 Quadratic Effects: Curvature. If the quadratic effects i3hh in Equa-
tion (2.1) are to be estimated, then at least k extra runs are needed (since
h runs from 1 to k). Moreover, each factor must be simulated for more than
two values.
Popular in statistics and in simulation are central composite designs. They
have five values per factor, and require many runs (n q). For example, if
= =
there are k 2 factors, then q 6 effects are to be estimated but as many as
n = 9 factor combinations are simulated. See Kleijnen (1987) and Kleijnen
and Van Groenendaal (1992).
Applications are found in the optimization of simulation models; see Sec-
tion 4.

3. Validation
This paper concentrates on the role of sensitivity analysis (Section 2) in
validation; other statistical techniques for validation and verification are dis-
cussed in Kleijnen (1995a). Obviously, validation is one of the first questions
that must be answered in a simulation study; for didactic reasons, validation
is discussed in this section.
True validation requires that data on the real system be available. In prac-
tice, the amount of data varies greatly: data on failures of nuclear installations
are rare, whereas electronically captured data on computer performance and
on supermarket sales are abundant.
If data are available, then many statistical techniques can be applied. For
example, simulated and real data on the response, can be compared through
the Student statistic for paired observations (see the companion chapter,
Kleijnen 1996), assuming the simulation is fed with real-life input data: trace
driven simulation. A better test uses regression analysis; see Kleijnen et al.
(1996).
However, if no data are available, then the following type of sensitivity
analysis can be used. The clients of the analysts do have qualitative knowl-
edge of certain parts of the real system; that is, these clients do know in
which direction certain factors affect the response of the corresponding mod-
ule in the simulation model (also see the discussion on sequential bifurcation
in Section 2.1.1). If the regression metamodel (see Section 2.2.2) gives an
estimated factor effect with the wrong sign, this is a strong indication of a
wrong simulation model or a wrong computer program.
Applications in ecological and military modeling are given in Kleijnen
et al. (1992) and Kleijnen (1995b) respectively. These applications further
show that the validity of a simulation model is restricted to a certain domain
of factor combinations, which corresponds with the experimental frame in
Zeigler (1976), a seminal book on modeling and simulation.
Simulation: Sensitivity Analysis and Optimization 439

Moreover, the regression metamodel shows which factors are most im-
portant. If possible, information on these factors should be collected, for
validation purposes.

4. Optimization: Response Surface Methodology (RSM)


There are many mathematical techniques for finding optimal values for the
decision variables of nonlinear implicit functions (these functions may be for-
mulated by simulation models), possibly with stochastic noise (as in random
simulation). Examples of such techniques are genetic algorithms, simulated
annealing, and tabu search (also see Nash 1995). However, this paper is lim-
ited to RSM.
First consider four general characteristics of RSM (next, some details will
follow):
(i) RSM relies on first-order and second-order polynomial regression meta-
models, now called response surfaces; see Section 2.2.2.
(ii) It uses the statistical designs of Section 2.3.
(iii) It is augmented with the mathematical (not statistical) technique of
steepest ascent, to determine in which direction the decision variables should
be changed.
(iv) It uses the mathematical technique of canonical analysis to analyze the
shape of the optimal region: does that region have a unique maximum, a
saddle point or a ridge?
Now consider some details. Suppose we wish to maximize the response.
RSM begins by selecting a starting point. Because RSM is a heuristic (no
success guaranteed), several starting points may be tried later on, if time
permits.
RSM explores the neighborhood of that point. The response surface is
approximated locally by a first-order polynomial in the decision variables
(Taylor series expansion).
The main effects f3h are estimated, using a design with n R:: k + 1 (see
Section 2.3.1). Suppose ih /32 > O. Then obviously the increase of decision
variable 1 (say) Zl should be larger than that of Z2. The steepest ascent path
means ilzl/ilz 2 = /31//32 (no standardization; also see next paragraph).
Unfortunately, the steepest ascent technique does not quantify the step
size along this path. Therefore the analysts may try a specific value for the
step size. If that value yields a lower response, then this value should be
reduced. Otherwise, one more step is taken. Ultimately, the response must
decrease, since the first-order polynomial is only an approximation. Then
the procedure is repeated: around the best point so far, a new first-order
polynomial is estimated, after simulating n R:: k+ 1 combinations of Zl through
Zk. And so on.
In the neighborhood of the top, a hyperplane can not be an adequate
representation. Cross-validation may be used to detect this lack of fit. Other
440 Jack P.C. Kleijnen

diagnostic measures are R2 1 (where R2 denotes the multiple correlation


coefficient), var(Ph) Ph, and modern statistical measures such as PRESS,
discussed in Kleijnen (1987).
So when no hyperplane can approximate the local input/output behavior
well enough, then a second-order polynomial is fitted; see Section 2.3.4.
Finally, the optimal values of Zh are found by straightforward differenti-
ation of the fitted quadratic polynomial. A more sophisticated evaluation is
canonical analysis.
Consider the following case study. A decision support system (DSS) for
production planning in a steel tube factory is simulated and is to be op-
timized. There are fourteen decision variables, and two response variables
(namely, a production and a commercial criterion). Simulation of one combi-
nation takes six hours of computer time, so searching for the optimal combi-
nation must be performed with care. Details can be found in Kleijnen (1993).
More applications can be found in Hood and Welch (1993), Kleijnen
(1987), and Kleijnen and Van Groenendaal (1992).

5. Conclusions
In the introduction (Section 1) the following questions were raised:
1. What if. what happens if the analysts change parameters, input variables
or modules of a simulation model? This question is closely related to
sensitivity analysis and optimization.
2. Validation: is the simulation model an adequate representation of the
corresponding system in the real world?
These questions were answered as follows.
In the initial phase of a simulation it is often necessary to perform screen-
ing: which factors among the multitude of potential factors are really im-
portant? The goal of screening is to reduce the number of really important
factors to be further explored in the next phase. The technique of sequential
bifurcation is a simple, efficient, and effective screening technique.
Once the important factors are identified, further analysis with fewer as-
sumptions (no known signs) may use regression analysis. It generalizes the
results of the simulation experiment, since it characterizes the input/output
behavior of the simulation model.
Design Of Experiments (DOE) can give good estimators of the main ef-
fects, interactions, and quadratic effects that occur in the regression model.
These designs require relatively few simulation runs.
Once these factor effects are quantified, they can be used in
(i) validation, especially if there are no data on the input/output of the sim-
ulation model or its modules;
(ii) optimization through RSM, which builds on regression analysis and ex-
perimental designs.
Simulation: Sensitivity Analysis and Optimization 441

These statistical techniques have already been applied many times in prac-
tical simulation studies, in many domains. Hopefully, this survey will stim-
ulate even more analysts to apply these techniques. The goal is to make
simulation studies give more general results, in less time.
In the mean time the research on statistical techniques adapted to simu-
lation, continues in Europe, America and elsewhere.

References

Bettonvil, B., Kleijnen, J.P.C.: Searching for the Important Factors in Simulation
Models with Many Factors. Tilburg University (1995)
Glynn, P.W., Iglehart, D.L.: Importance Sampling for Stochastic Simulation. Man-
agement Science 35, 1367-1392 (1989)
Ho, Y., Cao, X.: Perturbation Analysis of Discrete Event Systems. Dordrecht:
Kluwer 1991
Hood, S.J., Welch, P .. : Response Surface Methodology and its Application in Sim-
ulation. Proceedings of the Winter Simulation Conference (1993)
Kleijnen, J.P.C.: Statistical Tools for Simulation Practitioners. New York: Marcel
Dekker 1987
Kleijnen, J.P.C.: Simulation and Optimization in Production Planning: A Case
Study. Decision Support Systems 9, 269-280 (1993)
Kleijnen, J.P.C.: Verification and Validation of Simulation Models. European Jour-
nal of Operational Research 82, 145-162 (1995a)
Kleijnen, J .P.C.: Case-Study: Statistical Validation of Simulation Models. European
Journal of Operational Research 87, 21-34 (1995b)
Kleijnen, J.P.C.: Sensitivity Analysis and Optimization in Simulation: Design of
Experiments and Case Studies. In: Alexopoulos, C., Kang, K., Lilegdon, W. R.,
Goldsman, D. (eds.): Proceedings of the Winter Simulation Conference (1995c)
Kleijnen, J.P.C.: Simulation: Runlength Selection and Variance Reduction Tech-
niques. In this volume (1996), pp. 411-428
Kleijnen, J.P.C, Bettonvil, B., Van Groenendaal' W.: Validation of Simulation Mod-
els: Regression Analysis Revisited. Tilburg University 1996
Kleijnen J.P.C., Van Groenendaal, W.: Simulation: A Statistical Perspective. Chich-
ester: Wiley 1992
Kleijnen, J.P.C., Van Ham, G., Rotmans, J.: Techniques for Sensitivity Analysis of
Simulation Models: A Case Study of the CO 2 Greenhouse Effect. Simulation
58,410-417 (1992)
Muppala J.K., Malhotra, M., Trivedi, K. S.: Markov Dependability Models of Com-
plex Systems: Analysis Techniques. In this volume (1996), pp. 442-486
Nash, S.G. : Software Survey NLP. OR/MS Today 22, 60-71 (1995)
Oren, T.!.: Three Simulation Experimentation Environments: SIMAD, SIMGEST
and E/SLAM. In: Proceedings of the 1993 European Simulation Symposium.
La Jolla: Society for Computer Simulation 1993
Rubinstein, R.Y., Shapiro, A.: Discrete Event Systems: Sensitivity Analysis and
Stochastic Optimization via the Score Function Method. New York: Wiley 1993
Van Groenendaal, W.: Investment Analysis and DSS for Gas Transmission on Java.
Tilburg University (1994)
Van Meel, J.: The Dynamics of Business Engineering. Delft University (1994)
Zeigler, B.: Theory of Modelling and Simulation. New York: Wiley 1976
Markov Dependability Models of Complex
Systems: Analysis Techniques
Jogesh K. Muppala 1 , Manish Malhotra2 , and Kishor S. Trivedi3
1 Department of Computer Science, The Hong Kong University of Science and
Technology, Clear Water Bay Kowloon, Hong Kong
2 AT&T Bell Laboratories, Holmdel, NJ 07733, USA
3 Center for Advanced Computing & Communication, Department of Electrical
Engineering, Duke University, Durham, NC 27708-0291, USA

Summary. Continuous time Markov chains are commonly used for modelling large
systems, in order to study their performance and dependability. In this paper, we
review solution techniques for Markov and Markov reward models. Several meth-
ods are presented for the transient analysis of Markov models, ranging from fully-
symbolic to fully-numeric. The Markov reward model is explored further, and meth-
ods for computing various reward based measures are discussed including the ex-
pected values of rewards and the distributions of accumulated rewards. We also
briefly discuss the different types of dependencies that arise in dependability mod-
elling of systems, and show how Markov models can handle some of these depen-
dencies. Finally, we briefly review the Markov regenerative process, which relaxes
some of the constraints imposed by the Markov process.

Keywords. Markov chains, dependability analysis, system dependencies, stochas-


tic Petri nets, perform ability, ODE methods, TR-BDF2, Runge-Kutta methods,
randomization

1. Introduction

Rapid advances in technology have resulted in the proliferation of complex


computer systems that are used in different applications, ranging from space-
craft flight-control to information and financial services. These systems are
characterized by high throughput and availability requirements. It is essential
that the designed systems can be shown to meet stringent performance and
dependability requirements. Modelling and evaluation provide a good mecha-
nism for examining the behavior of these systems, right from the design stage
to implementation and final deployment.
Continuous time Markov chains (CTMCs) provide a useful modelling for-
malism for evaluating the performance (Trivedi 1982), reliability/availability
(Goyal et al. 1987), and performability (Meyer 1982 and Smith et al. 1988) of
computer systems. CTMCs can easily handle many of the interdependencies
and dynamic relationships among the system components that are charac-
teristic of current systems.
Two major problems that are encountered in the use of Markov models
are largeness and stiffness. Complex systems give rise to large and complex
Markov models. The largeness problem can be addressed by either avoiding
Markov Dependability Models of Complex Systems 443

it through aggregation and decomposition (largeness avoidance), or by using


automated methods for generating the large and complex Markov chains
(largeness tolerance). Stiffness often results from having transition rates of
different orders of magnitude in the Markov chain or from having a large time
t at which the solution is desired. Methods for handling stiffness can again
be classified into two categories, namely, stiffness avoidance and stiffness
tolerance. The former method is aimed at circumventing the problem by
eliminating the need of generating stiff models. The latter approach is to
tolerate the stiffness in the models by using special methods that can handle
the stiffness.
Complex systems are designed to continue working, even in the presence of
faults in order to guarantee a minimum level of performance. In such cases,
pure performance or pure dependability models do not capture the entire
system behavior. Methods for combined evaluation of performance and de-
pendability are thus required. Two possible approaches for addressing these
requirements are available. The first approach is to combine the performance
and dependability behavior into an exact monolithic model. This approach,
however, is fraught with the largeness and stiffness problems, alluded to ear-
lier.
When we examine the failure-repair and the performance behaviors of
these systems closely, we notice that the failure and repair events are rare,
i.e., the rate of occurrence of these events is very small compared with the
rates ofthe performance-related events. Consequently, we can assume that the
system attains a (quasi-)steady state with respect to the performance related
events, between successive occurrences of failure-repair events. Thus, we can
compute the performance measures for the system in each of these (quasi-
) steady states. The overall system can then be characterized by weighting
these quasi-steady state performance measures by the dependability-model
state probabilities. This leads to a natur:al hierarchy of models: a higher level
dependability model and a set of lower level performance related models, one
for every state in the dependability model.
Several authors have used the latter concept in developing techniques for
combined performance and dependability analysis of computer systems. Early
and defining work in this field was done by Beaudry (1978), who computed the
computational availability until failure for a computer system. Meyer (1980,
1982) proposed the framework of performability for modelling fault-tolerant
systems. Markov reward models (MRMs) (Howard 1971) provide a natural
framework for defining such a hierarchy of models. The system performance
measures can be assigned as rewards associated with the states of a higher
level Markov dependability model. The reward framework enables us to define
and compute several interesting system measures. We will briefly review the
MRM framework, and examine the computation of various system measures
using rewards.
444 Jogesh K. Muppala et al.

We are often interested in transient measures since they provide more


information than steady-state measures. For all but the simplest models,
numerical methods of transient solution (as opposed to symbolic or semi-
symbolic methods) are the only feasible alternative (Reibman et al. 1989).
There are several numerical methods based on randomization (Jensen 1953)
and solution of ordinary differential equations (ODEs) (Reibman and Trivedi
1988) that exploit the sparsity of the CTMC generator matrices to handle
these large and complex models. Some of these techniques are reviewed briefly
in this paper.
A major objection to the use of Markov models in the evaluation of the
performance and dependability behaviors of systems, is the exponential as-
sumption, implying that the holding time of the Markov chain in any state is
exponentially distributed, and that the past behavior of the system is com-
pletely represented by the current state. These assumptions can be relaxed to
obtain the Markov regenerative process (MRGP) (Kulkarni 1995), where the
regeneration points for the process need not coincide with state transitions
of the system. The MRGP has received a lot of attention in current research,
and hence we briefly review the essential details in this paper.
This paper is organized as follows. First we define Markov chains and
present the notation that we use in this paper in Section 2. Next we intro-
duce Markov reward models, and define several measures based on rewards in
Section 2.2. We then examine the two major difficulties that are encountered
in the use of Markov models, namely, largeness and stiffness in Section 3.
Several techniques for the automatic generation of large Markov models from
a high-level specification are briefly reviewed in Section 4. We briefly present
different system dependencies that are encountered in the dependability mod-
elling of complex systems, and review how Markov models can handle these
dependencies, in Section 5. We then look at several techniques for the tran-
sient and steady state analysis of Markov chains in Section 6. Methods for
solution of Markov reward models and the computation of reward measures
are examined in Section 7. We show how the Markovian constraints can
be relaxed to obtain the Markov regenerative process in Section 8. Solution
methods for Markov regenerative processes are briefly mentioned in Section 8.
Finally we give some concluding remarks in Section 9.

2. Notation and Terminology


Continuous-time Markov chains can easily represent many of the intricate
failure dependencies that arise in the modelling of computer systems (Trivedi
1982). A Markov chain is a state-space-based method composed of (1) states
which represent various conditions associated with the system such as the
number of functioning resources of each type, the number of tasks of each
type waiting at a resource, the number of concurrently executing tasks of a
job, the allocations of resources to tasks, and states of recovery for each failed
Markov Dependability Models of Complex Systems 445

resource, and (2) transitions between states, which represent the change of
the system state due to the occurrence of a simple or a compound event such
as the failure of one or more resources, the completion of executing tasks, or
the arrival of jobs.
A Markov chain is a special case of a discrete-state stochastic process in
which the current state completely captures the past history pertaining to the
system's evolution. Markov chains can be classified into discrete-time Markov
chains (DTMCs) and continuous-time Markov chains (CTMCs), depending
on whether the events can occur at fixed intervals or at any time; that is,
whether the time variable associated with the system's evolution is discrete
or continuous. This paper is restricted to continuous-time Markov chains.
Further information on Markov chains may be found in (Trivedi 1982).
In a graphical representation of a Markov chain, states are denoted by
circles with meaningful labels attached. Transitions between states are rep-
resented by directed arcs drawn from the originating state to the destina-
tion state. Depending on whether the Markov chain is a discrete-time or a
continuous-time Markov chain, either a probability or a rate is associated
with a transition, respectively.
In this section, we present a brief introduction to the concepts and the no-
tation for Markov and Markov reward models. We shall illustrate the Markov
chain concepts using a simple example.

Fig. 2.1. The computing system

Consider a computing system consisting of a pair of workstations con-


nected to a file-server through a computing network, as shown in Figure 2.1.
We assume that the system is operational as long as one of the workstations
is operational and the file server is operational. We assume that the time
to failure for each component is exponentially distributed, with the parame-
ters being Aw for the workstations and AJ for the file-server respectively. We
assume that the computer network is highly reliable, and hence ignore the
failure of the network. Furthermore, we assume that failed components can
be repaired. Suppose the time to repair a workstation and the time to repair
446 Jogesh K. Muppala et al.

the file-server are exponentially distributed with the parameters J.lw and J.l f
respectively. The file-server has repair priority over the workstations. We also
assume that whenever the system is down, no further failures can take place.
Hence, when the file-server is down, the workstations cannot fail. Similarly
when both the workstations are down, the file-server does not fail.

2.1 Markov Chains

Let {Z(t), t > O} represent a homogeneous finite-state continuous time


Markov chain (CTMC) with state space fl. Without loss of generality, we
will assume that fl = {I, 2, ... , n}; see below. The infinitesimal generator
matrix is given by Q = [%1 where qij, (i i- j) represents the transition rate
from state i to state j, and the diagonal elements are qii = -q; = - Lj # qij.
Further, let q = maxi Iqiil and let 'fJ be the number of non-zero entries in Q.
The behavior of the example computer system can be represented by
the continuous-time Markov chain shown in Figure 2.2. In this figure the

J.lf

Fig. 2.2. Continuous-time Markov chain for the computer system of Fig. 2.1

label (i, j) of each state is interpreted as follows: i represents the number of


workstations that are still functioning, and j is 1 or 0 depending on whether
the file-server is up or down respectively. For the example problem, with the
states ordered as (2,1), (2,0), (1,1), (1,0), (0,1), the Q matrix is given by:

2Aw
o
Q=[ -(J.lw + Af + Aw)
J.lf
J.lw
Markov Dependability Models of Complex Systems 447

We note that states of a CTMC will most often be vectors. However, the
discrete state space of a CTMC can always be mapped into positive integers.
We will, therefore assume a state space of {I, 2, ... , n}.
2.1.1 Instantaneous Transient Analysis. Let Pi(t) = =
Pr{Z(t) i} be
the unconditional probability of the CTMC being in state i at time t. Then
the row vector P(t) = [P1 (t), P2 (t), ... , Pn(t)] represents the transient state
probability vector of the CTMC. The behavior of the CTMC can be described
by the following Kolmogorov differential equation:

:t P(t) = P(t)Q, given P(O) , (2.1)

where P(O) represents the initial probability vector (at time t=O) of the
CTMC.
2.1.2 Cumulative Transient Analysis. Define L(t) = f~ P(u)du~ Then
Li(t) is the expected total time spent by the CTMC in state i during the
interval [0, t). L(t) satisfies the differential equation:
d
dt L(t) = L(t)Q + Po , L(O) = 0, (2.2)

which is obtained by integrating equation (2.1).


2.1.3 Steady-State Analysis. Let 7ri be the steady-state probability of
state i of the CTMC, and let 1r = limt-+oo P(t) be the steady-state probability
vector. We know that in the steady state ftP(t) = O. By substituting this
into equation (2.1) we can derive the following equation for the steady state
probabilities:
1rQ = 0 , L 7ri = 1. (2.3)
iEJl

Let us return to the computing system example. Since the computing


system is repairable, it is meaningful in this case to compute the availability
of the system. We note that the system is available as long as it is in the
states denoted by (2,1) and (1,1). Hence the instantaneous availability of the
system A(t), which is the probability that the system is operational at time
t, is given by
A(t) = P(2,1)(t) + P(1,1)(t).
If we consider the interval availability AJ(t), which is the fraction of the time
during the interval [0, t) that the system is available, then it can be computed
as
AJ(t) = L(2,1)(t) + L(1,1)(t) .
t
The steady-state availability Ass is given by
Ass = 7r(2,1) + 7r(1,1)
448 Jogesh K. Muppala et al.

Inst.Avail. ....-
Interval Avail. -+--.
0.99998

.~ 0.99996
< I

~
\

\
011 0.99994

!lil 0.99992
oS l "'---__....___ _
0.9999
...... ---....---"1:-----------+---

0.99988 '-------'-----'-----'----'----'
o 20 40 60 80 100
Time in hours.

Fig. 2.3. Availability for the computer system

For this example system, the availability as a function of time, is plotted


in Figure 2.3. For this plot, we assume that Aw 0.0001 hr- i , AJ 0.00005 = =
hr- i , Ilw = 1.0 hr- i , and IlJ = 0.5 hr-i. We notice that the availability
decreases as expected and reaches the steady-state value of the availability
Ass, which is 0.9999.
2.1.4 Up to-Absorption Analysis. Let A represent the set of absorbing
states (A state is considered an absorbing state if there are no outgoing
transitions from that state, i.e., an absorbing state i has qij = 0, Vj, (j :F i)).
Let B (= n - A) be the set of the transient states in the CTMC. From the
matrix Q a new matrix can be constructed by restricting Q to states in B
only: QB of size IBI x IBI, where IBI is the cardinality of the set B.
Io
Let Zi = co P;(T)dT, i E B, the mean time spent by the CTMC in state
i until absorption. The row vector z = [z;] satisfies the following equation:
(2.4)
where P B(O) is the vector P(O) restricted to the states in the set B. The above
equation can be obtained by taking the limit as t -+ 00 of equation (2.2), with
z = LB(oo) and noting that -9tLB(oo) = o. The mean time to absorption,
MTT A, of the CTMC into an absorbing state is computed as
MTTA= LZi.
iEB

By assuming that the example computer system does not recover when-
ever both workstations fail, or whenever the file-server fails, we make the
states (0, 1), (1,0), and (2, 0) the absorbing states. The corresponding Markov
chain is shown in Figure 2.4. This gives the following new matrix QB:
Markov Dependability Models of Complex Systems 449

Fig. 2.4. CTMC with absorbing states

The mean time to failure MTT F of the computer system, which is the
same as the mean time to absorption for the Markov chain given in Figure 2.4,
is obtained as
MTTF = Z(2,1) + Z(l,l)
Assuming that Aw = 0.0001 hr- 1 , AJ = 0.00005 hr- 1 , and Jlw = 1.0 hr- 1
we obtain the mean time to failure as 19992 hours.
Furthermore, since this Markov chain has absorbing states, we can also
compute the reliability of the system. The reliability R(t) is the probability
that the system is functioning throughout the interval [0, t). Since all system
failure states are absorbing, it follows that if the system is functioning at
time t, it must be functioning throughout the interval [0, t). Thus,
R(t) = P(2,1)(t) + P(1,1)(t).
The reliability for the example computer system is plotted in Figure 2.5.

2.2 Markov Reward Models

To obtain Markov reward models, Markov models have been extended by


assigning reward rates to the states, and reward impulses to the transitions
of the Markov chain (Howard 1971). Let us now define a reward rate vector
r over the states of the CTMC such that a reward rate of ri is associated
with state i. A reward of riT; is accumulated when the sojourn time of the
stochastic process in the state i is Ti. An impulse reward Rij is associated with
the transition from state i to state j of the Markov chain. Let X(t) represent
450 Jogesh K. Muppala et al.

Reliability -+-
0.9

0.8

0.7

0.6
.~
~
Cii
0.5
II:
0.4

0.3

0.2

0.1

0
0 20000 40000 60000 80000 100000
Time in hours.

Fig. 2.5. Reliability for the computer system

the instantaneous reward rate of the Markov reward model (MRM). Let Y(t)
denote the accumulated reward in the interval [0, t).

Y(t) = ltX(r)dr.
2.2.1 Expected Rewards. The expected instantaneous reward rate E[X(t)],
the expected accumulated reward E[Y(t)], and the steady-state expected re-
ward rate E[X] = E[X( 00)] can be computed as
E[X(t)] =L rjPj(t) + L RijifJjj(t) ,
iEfJ i,jEfJ

E[Y(t)] = L rj l t Pi(r)dr + L RijNij(t) = L riLi(t) + L RijNij(t) ,


iEfJ a i,j EfJ iEfJ i,j EfJ
and
E[X] = L rj'lri + L RijifJij ,
jEfJ i,jEfJ

where ifJ ij (t) and ifJ ij denote the expected frequency with which the transition
from state i to state j is traversed in the Markov chain at time t, and in
steady-state respectively; Njj (t) is the expected number of such traversals of
the transition from state i to state j during the interval [0, t).
For a Markov chain with absorbing states, the expected accumulated re-
ward until absorption E[Y( 00)] can he computed as

E[Y(oo)] = Lri
jEfJ
1 a
00
Pj(r)dr+ L
i,jEfJ
RijNij = LriZi +
iEfJ
L
i,jEfJ
RijNij ,
Markov Dependability Models of Complex Systems 451

where Nij is the expected number of traversals of the transition from state i
to state j until absorption.
Furthermore, we note that if hi is the expected holding time for the CTMC
in state i, then hi = l/lqiil. If i represents the frequency with which state
i is visited in steady-state, then i = 7r;jhi = 7rilqiil. Given that the CTMC
is in state i, the probability /Iij that the next transition will be to state j, is
given by /Iij = qij/lqiil. Thus, we can compute ~ij as
~ij = /Iiji = %7ri .
Similarly, we can prove that ~ij(t) = qijPi(t). Hence the expressions for the
expected instantaneous reward rate and the expected steady-state reward
rate can be rewritten as
E[X(t)] = 2)ri + 2: Rt,j% )Pi(t)
iEn jEn
and
E[X] = 2:(ri + 2: Rijqij)7ri .
iEn jEn
By a similar argument, if ni is the expected number of visits to state i
until absorption, then ni = zdhi = zilqiil. Then
Nij = /Iijni = qijZi .
Similarly, we can also prove that
Nij(t) = /Iijni(t) = qijLi(t) .
Thus, the expressions for the expected accumulated reward until absorption,
and the expected accumulated reward in the interval [0, t) may be rewritten
as
E[Y(oo)] = ~)ri +E R;.jqij)Zi
iEn jEn
and
E[Y(t)] = 2:(ri + 2: Rt,jqij)Li(t) .
iEn jEn
2.2.2 Distribution of Reward Measures. Assuming only reward rates
(no impulse rewards) are assigned, the distribution of X(t), P[X(t) ~ x], can
be computed as
P[X(t) ~ x] = 2: Pi(t).
r.:5x, iEn
The distribution of X can be computed similarly.
The distribution of accumulated reward until absorption, P[Y( 00) ~ yl,
and the distribution of accumulated reward over a finite horizon, P[Y(t) ~ y],
on the other hand, are difficult to compute. Numerical methods for computing
these distributions will be discussed in a later section.
452 Jogesh K. Muppala et al.

2.2.3 System Measures Using Rewards. Given the MRM framework,


the next immediate question that arises is "what are appropriate reward rate
assignments?" The reward rate vector to be assigned depends on whether we
are interested in performance, dependability, or composite performance and
dependability measures. For example, to compute availability measures for a
system, we divide the state space n into the set of up states, nu, and the set of
down states, nd. We attach a reward rate of rj = 1, Vi E nu and rj = 0, Vi E
nd. The instantaneous availability A(t) (the probability that the system is
functioning at timet) is then given by A(t) = E[X(t)] = EiEn rjPj(t). Using
the same reward rate assignment, the total expected uptime Tu in the interval
[0, t) is given by Tu= =
E[Y(t)] Eien rjLi(t), and the interval availability
AJ(t) is given as AJ(t) = fE[Y(t)]. The steady-state availability Ass is given
= =
by Ass E[X] Eien rill'j
To compute the reliability R(t) (the probability that the system is func-
tioning throughout the interval [0, t)), we consider a Markov chain in which all
the system failure states are absorbing. Then R(t) = E[X(t)] = EiEn rjPj(t).
Using the same reward rate assignment, the mean time to failure MTT F of
the system is given by MTTF = E[Y(oo)] = EiEn rjZj. It should be pointed
out that reliability and mean time to system failure are meaningful mea-
sures, only when all the system down states are absorbing states. Conversely,
steady-state availability is meaningful, only if no system state is an absorbing
state. Instantaneous availability, on the other hand, can be computed in any
case.
Pure performance measures can also be computed using the same frame-
work. For example, let the Markov chain represent the behavior of a queueing
system, and let nj be the number of customers waiting in the queueing sys-
tem when it is in state i. If we now assign a reward rate rj = ni, Vi E n, then
the expected number of customers waiting in the queue at time t is given
by N(t) = E[X(t)] = EiEn riPi(t). The expected throughput of the queue
can also be computed by assigning the rate of the transition from state i
corresponding to the departures from the queue, as the reward rate in that
state.
We can define the reward rates to be the performance levels of the system
in different configurations. Then we can compute measures such as the ex-
pected total amount of work completed in the interval [0, t), and the expected
throughput of the system with failures and repairs. A related approach is to
decide the assignment of reward rates based on some performance threshold
(Levy and Wirth 1989). We designate all states in which the performance
index is below the threshold as down states (assign a reward rate of zero);
the remaining states are up states (reward rate of 1). This approach is well
suited for degradable computer systems, where the system is not completely
unavailable due to failures, but its performance tends to degrade.
The above discussion indicates that the computation of many of the sys-
tem dependability and performance measures requires the computation of the
Markov Dependability Models of Complex Systems 453

state probabilities of the Markov chain. We shall consider several techniques


for the transient and steady-state analysis of the Markov chain in the next
few sections.

3. Computational Difficulties
Two major difficulties that arise in numerical computation of transient be-
havior of Markov chains are largeness and stiffness.

3.1 Largeness

Most Markov models of real systems are very large. The actual model (relia-
bilityor performance) may be specified using a high level description such as
stochastic Petri nets (Ajmone et al. 1984). However, these high level models
are solved after conversion to a Markov model that is typically very large.
Practical models, in general, give rise to hundreds of thousands of states (Ibe
and Trivedi 1990). Two basic approaches to overcome largeness are:
- Largeness-avoidance: One could use state-truncation techniques based on
avoiding generation of low probability states (Boyd et al. 1988, Kantz and
Trivedi 1991, Li and Silvester 1984, and Van Dijk 1991) and model-level
decomposition (Ciardo and Trivedi 1993 and Tomek and Trivedi 1991).
- Largeness-tolerance: In this approach, a concise method of description and
automated generation of the CTMC is used. Sparsity of Markov chains is
exploited to reduce the space requirements. However, no model reduction is
employed. Appropriate data structures for sparse matrix storage are used.
Sparsity preserving solution methods are used, which result in consider-
able reduction in computational complexity. CTMCs with several hundred
thousand states have been solved using this approach. We shall consider
largeness-tolerance methods in this paper.

3.2 Stiffness

Stiffness is another undesirable characteristic of many practical Markov mod-


els (especially reliability models), which adversely affects the computational
efficiency of numerical solution techniques. Stiffness arises if the model solu-
tion has rates of change that differ widely. The linear system of differential
equations (2.1) is considered stiff iffor i = 2, ... , m, Re(Ai) < 0 and
m~IRe(Ai)1 m.inIRe(Ai)l, (3.1)
I I

where Ai are the eigenvalues of Q. (Note that since Q is a singular matrix,


one of its eigenvalues, say A1, is zero.) However, the above equation misses the
point that the rate of change of a solution component is directly influenced by
454 Jogesh K. Muppala et at.

the length of the solution interval. To overcome that shortcoming, Miranker


(1981) defined stiffness as follows: "A system of differential equations is said
to be stiff on the interval [O,t) if there exists a solution component of the
system that has variation on that interval that is large compared to l/t".
Stiffness of a Markov model could cause severe instability problems in the
solution methods if the methods are not designed to handle stiffness. The
two basic approaches to overcome stiffness are:
- Stiffness-avoidance: In this case, stiffness is eliminated from a model by
solving a set of non-stiff models. One such technique based on aggregation
is described in Bobbio and Trivedi 1986).
- Stiffness-tolerance: This approach employs solution methods that remain
stable for stiff models. We focus on this approach in this paper.
Let us consider the source of stiffness in Markov chains. In a dependabil-
ity model, repair rates are several orders of magnitude (sometimes 10 6 times)
larger than failure rates. Failure rates could also be much larger than the
reciprocal of mission time (which is the length of the solution interval). Such
Markov chains have events (failures or repairs) occurring at widely different
time scales. This results in the largest eigenvalue of Q being much larger than
the inverse of mission time (Clarotti 1986); consequently the system of differ-
ential equations (equation (2.1)) is stiff. According to the Gerschegorin circle
theorem (Golub and Loan 1989), the magnitude of the largest eigenvalue is
bounded above by twice the absolute value of the largest entry in the gen-
erator matrix. In a Markov chain, this entry corresponds to the largest total
exit rate from any of the states. Therefore, the stiffness index of a Markov
chain can be defined as qt, the product of the largest total exit rate from a
state, q, and the length of the solution interval t (Reibman and Trivedi 1988).
The above discussion suggests that stiffness can be arbitrarily increased by
increasing q or t. The largest rate q can be increased by increasing model pa-
rameters. However, this increase changes the eigen-structure of matrix Q. In
some models it results in an increase in the magnitude of the smallest non-zero
eigenvalue of the matrix. This implies that those models reach steady-state
faster. We will later define a stiffness index in terms of q, t, and A2, where A2
is the smallest (in magnitUde) non-zero eigenvalue of matrix Q.
Numerical ODE solution methods which are not designed to handle stiff-
ness, become computationally expensive for stiff problems. The solution of a
stiff model entails very small time steps, which increases the. total number of
time steps required and the total computation time manifolds. The original
version of Jensen's method does not handle stiffness well either (Reibman
and Trivedi 1988).
Recently, hybrid methods have been proposed (Malhotra 1996) which
combine stiff and non-stiff ODE methods to yield efficient solutions of stiff
Markov models. We shall discuss these methods briefly at the end of Section
5.3.
Markov Dependability Models of Complex Systems 455

4. Model Specification/Generation Methods


We have shown that Markov models and Markov reward models provide
a very general framework for modelling complex systems. Furthermore, we
have mentioned that model largeness is an important problem in many cases.
This has resulted in several high-level specification languages, which ease the
burden on the modeler in specifying the Markov{reward) model explicitly. A
survey of these techniques is presented in (Haverkort and Trivedi 1993). They
point out that there are two reasons for the need of a high-level specification
language for MRMs:
1. The complexity of the systems directly translates into the complexity
of the corresponding Markov model. This in turn implies that manual
construction of the model is both cumbersome and error-prone.
2. System designers are often unfamiliar with modelling, and hence prefer
to use a language that is closer to their own system specification.
Haverkort and Trivedi (Haverkort and Trivedi 1993) give a set of criteria
for evaluating any specification language. Domain specific application lan-
guages are better suited for model specification in their respective domains,
albeit at the cost of generality. The inherent constraints imposed by the do-
main may preclude the specification of all possible MRMs. These restrictions
might also arise from the structured nature of a language, which permits
the specification of "good" models while restricting the modelling freedom.
Domain specific languages in general tend to provide a higher level of ab-
straction from the underlying mathematical model, while successfully hiding
the details. The set of output measures that can be computed may also be
determined by the domain-specific nature of the modelling language.
Several model specification languages, ranging from very general to very
specific, are in common use. A brief review of some of these languages will
be presented now (for a complete examination of these languages, the reader
may refer to (Haverkort and Trivedi 1993):
Queueing Networks: Queueing networks have long been used to evaluate
the performance of computer and communication systems (Lazowska 1984)
and industrial engineering systems. A Markovian queueing network satisfying
some constraints, has an underlying Markov chain that describes its behavior.
A class of queueing networks satisfying product form constraints (Baskett et
al. 1975), can be efficiently solved, avoiding the construction and the solution
of the underlying Markov chain. In the general case, however, the generation
and the solution of the underlying CTMC is necessary. Queueing networks
permit the specification of resources and resource contention efficiently.
Fault Trees/Reliability Block Diagrams: Fault trees and reliability block di-
agrams are generally used in the specification of the dependability behavior
of systems. In the absence of additional dependencies, such models can be
solved efficiently, avoiding the generation of the underlying state space. If
456 Jogesh K. Muppala et al.

additional dependencies are specified, they can be transformed into an un-


derlying Markov model. HARP (Dugan et al. 1986) permits the specification
of the Markov model using fault-trees/reliability block diagrams and associ-
ated dependencies induced by the fault handling behavior.
Stochastic Petri Nets: Stochastic Petri nets (SPNs) (Ajmone et al. 1984) and
their variants (Chiola 1985, Ciardo et al. 1993 and Couvillion et al. 1991) have
been successfully used to specify Markov and Markov reward models. SPNs
allow the specification of the reward rates in terms of the model structure.
Solution of these models involves construction of the underlying reachability
graph, which is then mapped onto a corresponding Markov reward model
(Ciardo et al. 1993). SPNs can easily handle intricate dependencies among
the various components of the system being modeled. Several tools based
on SPNs and their variants are available (Chiola 1985, Ciardo et al. 1993,
Couvillion et al. 1991 and Sahner et al. 1995). A brief overview of SPNs will
be presented in Section 4.1
Production Rule Systems: This method is based on defining several state
variables that together define the state of the system. Changes to the state
variables are specified using production rules. Reward rates are defined as
expressions of state variables. Several tools such as METFAC (Carrasco and
Figueras 1986) and ASSIST (Johnson and Butler 1988) use production rule
systems.
Dynamic Queueing Networks: This method uses a two-level hybrid approach:
the performance of a system is specified as a queueing network model, while
the failure-repair behavior is modeled using stochastic Petri nets. The DyQN-
tool (Haverkort et al. 1992) is based on this concept.

4.1 Stochastic Petri Nets and Stochastic Reward Nets


In this section, we give an informal description of the features of SPNs and
stochastic reward nets (SRNs). A formal description ofSRNs may be found in
(Ciardo et al. 1993), including numerical algorithms to solve the underlying
Markov reward models.
4.1.1 Basic Terminology. A Petri net (PN) is a bipartite directed graph
whose nodes are divided into two disjoint sets called places and transitions.
Directed arcs in the graph connect places to transitions (called input arcs),
and connect transitions to places (called output arcs). A cardinality may be
associated with these arcs. A marked Petri net is obtained by associating
tokens with places. A marking of a PN is the distribution of tokens in the
places of the PN. In a graphical representation of a PN, places are represented
by circles, transitions by bars, and tokens by dots or integers in the places.
Input places of a transition are the set of places that are connected to the
transition through input arcs. Similarly, output places of a transition are
those places to which output arcs are drawn from the transition.
A transition is considered enabled in the current marking, if the number
of tokens in each input place is at least equal to the cardinality of the input
Markov Dependability Models of Complex Systems 457

arc from that place. The firing of a transition is an atomic action in which
one or more tokens are removed from each input place of the transition, and
one or more tokens are added to each output place of the transition, possibly
resulting in a new marking of the PN. Upon firing the transition, the number
of tokens deposited in each of its output places is equal to the cardinality of
the output arc. Each distinct marking of the PN constitutes a separate state of
the PN. A marking is reachable from another marking, if there is a sequence of
transition firings starting from the original marking which results in the new
marking. The reach ability set (graph) of a PN is the set (graph) of markings
that are reachable from the other markings (connected by the arcs labeled by
the transitions whose firing causes the corresponding change of marking). In
any marking of the PN, multiple transitions may be simultaneously enabled.
Another type of arc in a Petri net is the inhibitor arc. An inhibitor arc
drawn from a place to a transition, means that the transition cannot fire if
the place contains at least as many tokens as the cardinality of the inhibitor
arc.
Extensions to PN have been considered by associating firing times with
the transitions. By requiring exponentially distributed firing times, we ob-
tain the stochastic Petri nets. The underlying reach ability graph of a SPN
is isomorphic to a continuous time Markov chain (CTMC). Further gen-
eralization of SPNs has been introduced in (Ajmone et al. 1984) allowing
transitions to have either zero firing times (immediate transitions) or ex-
ponentially distributed firing times (timed transitions), giving rise to the
generalized stochastic Petri net (GSPN). In this paper, timed transitions are
represented by hollow rectangles, whereas immediate transitions are repre-
sented by thin bars. The markings of a GSPN are classified into two types. A
marking is vanishing if any immediate transition is enabled in the marking. A
marking is tangible if only timed transitions or no transitions are enabled in
the marking. Conflicts among immediate transitions in a vanishing marking
are resolved using a random switch (Ajmone et al 1984).
Although GSPNs provide a useful high-level language for evaluating large
systems, representation of the intricate behavior of such systems often leads
to a large and complex structure of the GSPN. To alleviate some of these
problems, several structural extensions to Petri nets are described in (Ciardo
et al. 1989), which increase the modelling power of GSPNs. These extensions
include guards (enabling functions), general marking dependency, variable
cardinality arcs, and priorities. Some of these structural constructs are also
used in stochastic activity networks (SANs) (Sanders and Meyer 1986) and
GSPNs (Chiola 1985). Stochastic extensions were also added to GSPNs to
permit the specification of reward rates at the net level, resulting in stochastic
reward nets (SRN). All these extensions will be described in the following
subsections.
To illustrate the concepts further, we consider an SRN model for the
computer system example. We consider one further extension to this model,
458 Jogesh K. Muppala et al.

wsup [sup

wsrp

Fig. 4.1. SRN model for the computer system

namely imperfect coverage of the failure of the workstations. This coverage


means that whenever a workstation suffers a failure, the failure is properly
detected with probability c, called the coverage probability. So with proba-
bility 1 - c, the workstation suffers an uncovered failure, wherein the failure
goes undetected. We assume that this undetected failure results in the cor-
ruption ofthe file-server, causing it to fail, resulting in the system failure. The
corresponding SRN model is shown in Figure 4.1. In this model, place wsup
indicates the number of workstations that are still functioning, wsdn indi-
cates the number of workstations failed, I sup indicates the file-server being
up, Isdn indicates the file-server being down, and wst indicates a temporary
place during which a decision is being made whether the workstation failure
is covered or not. Timed transitions wsll, wsrp, Isll, and Isrp represent
the failure and repair of the workstations and the file-server respectively. Fur-
ther, in this case, the rate of firing of the transition ws II is dependent on the
number of tokens in wsup, and hence the firing rate of the transition is ex-
pressed as #( wsup, i)Aw, where #( wsup, i) represents the number of tokens
in wsup in any marking i. Immediate transitions wscv and wsuc represent
the covered and uncovered nature of the workstation failure respectively.
The reachability graph corresponding to this SRN model, is shown in
Figure 4.2. In this figure all the tangible markings are indicated by rounded
rectangles; all the vanishing markings by dashed rectangles. The directed
arrows show how the system moves from one marking to another by firing of
the appropriate transitions. The vector < abcde > enclosed in the rectangles
represents the SRN marking, such that the number of tokens is a in wsup, is
bin Isup, is c in wst, is d in wsdn, and is e in Isdn.
Markov Dependability Models of Complex Systems 459

wsrp wsrp
.r-------""'\.

Isrp Isft

Fig. 4.2. The reach ability graph for the SRN model

The corresponding continuous time Markov chain may be derived from the
reachability graph by eliminating the vanishing markings. The corresponding
CTMC model is shown in Figure 4.3. The algorithm for converting from the
SRN to the CTMC description may be found in Ciardo et al. (1993).

Fig. 4.3. The CTMC for the SRN model

4.1.2 Marking dependency. Perhaps the most important characteristic


of SRNs is the ability to allow extensive marking dependency. Parameters
(such as the rate of a timed transition, the cardinality of an input arc, or the
reward rate in a marking) can be specified as a function of the number of
tokens in some (possibly all) places. Marking dependency can lead to more
compact models of complex systems. As an example, note that the rate of
the transition wsfl in the SRN model above is marking dependent.
4.1.3 Variable cardinality arc. In the standard PN and in most SPN
definitions, the cardinality of an arc is a constant integer value (Peterson
460 Jogesh K. Muppala et al.

1981). If the cardinality of the input arc from place p to transition t is k,


then k tokens must be in p before t can be enabled; moreover, when t fires,
k tokens are removed from p. Often, all tokens in p must be moved to some
other place q (Dugan 1984). A constant cardinality arc cannot accomplish
this in a compact way. This behavior can be easily described in SRNs by
specifying the cardinalities of the input arc from p to t and of the output
arc from t to q as #(p), the number of tokens in p. This representation has
several advantages: it is more natural, no additional transitions or places are
required, and the execution time (to generate the reach ability graph) is likely
to be shorter.
The use of variable cardinality is somewhat similar to the conditional case
construct of SANs (Sanders and Meyer 1986). We allow variable cardinality
input, output arcs, and inhibitor arcs.
When the cardinality of the arc is zero, the arc is considered absent. The
user of SRNs must be aware of the difference between defining the cardinality
of an input arc as "max{l, #(p))" or as "#(p)". The former definition disables
t when p is empty; the latter does not. The correct behavior depends on the
particular application.
4.1.4 Priorities. Often, an activity must have precedence over another
when they both require the same resource. Inhibitor arcs may be used to rep-
resent such constraints, but they may clutter the model. It is more convenient
to incorporate transition priorities directly into the formalism. Traditionally,
priorities have been defined by assigning an integer priority level to each
transition, and adding the constraint that a transition may be enabled only
if no higher priority transition is enabled. This can be generalized further by
requiring only a partial order among transitions. Thus a priority relationship
between two transitions t1 and t2 can be defined, for example as t1 > t2,
implying that t1 has higher priority compared to t2. This added flexibility
provides a simple way to model the situation where t1 > t2, t3 > t4, but t1
has no priority relation with respect to t3 or t 4.
4.1.5 Guards. Each transition t may have an associated (Boolean) guard
g. The function is evaluated in marking M when "there is a possibility that
t is enabled", that is, when (1) no transition with priority higher than t is
enabled in M; (2) the number of tokens in each of its input places is larger
than or equal to the (variable) cardinality of the corresponding input arc; (3)
the number of tokens in each of its inhibitor places is less than the (variable)
cardinality of the corresponding inhibitor arc. Only then geM) is evaluated; t
is declared enabled in M iff 9 (M) = T RUE. The default for 9 is the constant
function TRUE.
The ability to express complex enabling/disabling conditions textually is
invaluable. Without it, the designer might have to add extraneous arcs or even
places and transitions to the SRN, to obtain the desired behavior. The logical
conditions that can be expressed graphically using input and inhibitor arcs,
are limited by the following semantics: a logical "AND" for input arcs (all
Markov Dependability Models of Complex Systems 461

the input conditions must be satisfied), and a logical "OR" for inhibitor arcs
(any inhibitor condition is sufficient to disable the transition). For instance,
a guard such as (#(Pl) ~ 3 V#(P2) ~ 2) A(#(P3) = 5 V#(P4) ::; 1) is difficult
to represent graphically.
4.1.6 Output measures. For a SRN, all the output measures are expressed
in terms of the expected values of reward rate functions. Depending on the
quantity of interest, an appropriate reward rate is defined. In this section we
are not considering impulse rewards, but they can be easily added.
Suppose X represents the random variable corresponding to the steady-
state reward rate describing a measure of interest. A general expression for
the expected reward rate in steady-state is
E[X] = L rk7rk,
keT
where T is the set of tangible markings (no time is spent in the vanishing
markings), 7rk is the steady-state probability of (tangible) marking k, and rk
is the reward rate in marking k.
Analogously X(t) represent the random variable corresponding to the
instantaneous reward rate of interest. The expression for the expected in-
stantaneous reward rate at time t, becomes:
E[X(t)] = L rkPk(t),
keT
where Pk(t) is the probability of being in marking k at time t.
Similarly Y(t) represent the random variable corresponding to the ac-
cumulated reward in the interval [0, t), and let Y(oo) represent the corre-
sponding random variable for the accumulated reward until absorption. The
expressions for the expected accumulated reward in the interval [0, t) and the
expected accumulated reward until absorption are

E[Y(t)] = L
keT
rk t Pk(x)dx ,
Jo
and
E[Y(oo)] = L rk ('0 Pk(x)dx .
keT Jo
In the example model derived above, we assign appropriate reward rates
to the markings of the SRN to compute interesting measures. For example,
to compute the system availability, the reward rate rj associated with the
tangible marking i is given by
. _{I0
r, -
if #( wsup, i) > 0
otherwise
A#(1sup, i) = 1 } .
The instantaneous availability computed for the system for three different
values of the coverage parameter c, is plotted in Figure 4.4. As expected, the
availability decreases with the decrease in the coverage parameter c.
462 Jogesh K. MuppaIa et aI.

c=0.9...-
0.99998 c=0.8 -+--.
c=1.0 B
0.99996

0.99994
\\ ....
f3.
=ffi 0.99992
> \ EI'El-.... a.G ............................ .
< 0.9999
'iii
E 0.99988 \
\
0.99986 \,
\\,
0.99984 \,.

0.99982 ----.....----+-+------------------------
0.9998 L -_ _---lL...-_ _--I._ _ _- ' -_ _ _- '

o 5 10 15 20
Time in hours.

Fig. 4.4. System availability for three coverage parameters c

5. System Dependencies

Earlier we mentioned that continuous time Markov chains can easily rep-
resent many of the failure and repair dependencies that arise in modelling
of computer systems. Now in this section we present the nature of depen-
dencies in practice that can be handled by CTMCs. It is often assumed in
the dependability community that the failures of components are indepen-
dent. When dependencies are considered, they are usually modeled through
the use of multivariate distributions. In this section we present the following
seven kinds of system behaviors that are so simple that they can be easily
represented by CTMCs without resorting to complex mechanisms.
1. Imperfect coverage: Common-mode failures occur occasionally in com-
plex systems; that is, the failure of a component may induce the failure
of the entire system, since the system is unable to recover from the com-
ponent failure. We can use the imperfect coverage concept to model this
behavior. As an example, consider a system composed of two identical
processors. Upon failure of one of the processors, the system may recover
and continue functioning with a single processor. Such a fault is said to
be covered. Alternatively the system may not recover from the proces-
sor failure, causing the entire system to fail; the corresponding fault is
said to be not covered. We assume that upon failure of a processor, the
system recovers with probability c (covered failure) or the system fails
to recover with probability 1 - c (uncovered failure). The system has
imperfect coverage if c < 1.0. This system can be modeled by a CTMC
with three states, as shown in Figure 5.1(a). This dependence may easily
be mapped into the shock model of failure, as shown in the three-state
Markov Dependability Models of Complex Systems 463

2Ac

(a) Imperfect Coverage Model (b) Shock Model

Fig. 5.1. Imperfect coverage model

CTMC in Figure 5.1(b). Here Al is the rate of a processor failure that


the system is able to recover from, and A2 is the rate of processor failure,
that the system does not recover from, and A~ is the rate of processor
failure when only one processor is functioning. The advantage of the im-
perfect coverage approach is that it allows the separation of the statistics
of the failure rates from that of the coverage. The coverage factor .can be
estimated through fault injection experiments (Wang and Trivedi 1995).
2. Fault detection and other related delays: Recovery from a failure is not
instantaneous. The system may require a short reconfiguration and/or
reboot time. The reconfiguration/reboot time plays a crucial role in de-
termining system dependability; see, for example, (Trivedi et al. 1990).
To adequately represent such reconfiguration/reboot delays, we need a
state space model of the system.
3. Transient/intermittent/near-coincident faults: Component failures are
not always permanent. Transient and intermittent faults account for a
significant portion of the component faults. Upon occurrence of a fault,
the fault handling mechanism has to identify the nature of the fault and
take appropriate action. This behavior can be modeled explicitly using
fault-error handling submodels (Dugan et al. 1986 and Geist and Trivedi
1990). Near-coincident faults, (faults occurring while the system is recov-
ering from a previous fault) may be catastrophic. The fault-error handling
mechanism can be extended to take care of near-coincident faults (Geist
and Trivedi 1990). This can then be incorporated into the failure-repair
system model. Once again, to model such complexities, Markov models
are needed.
4. Repair dependence: Repair personnel are usually shared among the failed
components. Priority for repair among different kinds of components,
both preemptive and non-preemptive, can be considered. Field service
464 Jogesh K. Muppala et al.

travel time may also be involved, where the repair personnel need to
travel to the site. However, this travel time appears only once, inde-
pendent of the number of components waiting for repairs. Furthermore,
both imperfect repair and faulty replacements can also be considered.
Once again, Markov and SPN models have been used to capture such
behavior (Ibe et al. 1989 and Muppala et al. 1992).
5. Hardware-software co-dependence: Failure of software usually does not
impact the underlying hardware, so the hardware can continue to execute
other software. However, failure of the hardware automatically implies
that the software running on the hardware will fail. This implied failure
of the software (upon failure of the underlying hardware) can also be
modeled through Markov chains.
6. Performance-dependability dependence: The system's performance and
dependability are also correlated, due to the following causes:
a) The failure of some components may in turn increase the load im-
posed upon the remaining components. Consequently the failure
rates of the functioning components might increase. This can be mod-
eled in Markov chains by making the failure rates dependent on the
number of functioning/failed components.
b) Degradable systems, which continue to function even in the presence
of failures, are best characterized by a combined evaluation of their
performance and dependability. This has led to the development of
performability concepts (Meyer 1982 and Trivedi et al. 1992) based
on Markov reward models (Howard 1971).
c) Inadequate performance behavior of a system may sometimes be con-
strued as a failure (Logothetis and Trivedi 1995). For example, in a
client-server based distributed system, a large delay in the server re-
sponding to a client request, may prompt the client to assume that
the server has failed.
7. Phased mission models: Phased mission models are common in situations
that have the system's configuration and behavior change in different
phases (Dugan 1991 and Kim and Park 1994); for example, a flight control
system has at least three distinct phases: take-off, cruising, and landing.
The failure rates as well as system requirements may be dependent upon
the phase. Markov chains can be used to develop phased mission models,
such that the final state probabilities of one phase are mapped into the
initial state probabilities in another phase. Note that both the structure
of the CTMC and the set of UP and DOWN states may change with the
phase (Somani et al. 1992).

6. Analysis Methods for Transient Behavior


We now discuss various techniques, ranging from fully symbolic to fully nu-
meric, for obtaining the state probabilities of the Markov chains. Wherever
Markov Dependability Models of Complex Systems 465

appropriate, not only the solution methods for P(t), but also for L(t) and 1r
are explored.

6.1 Fully Symbolic Method

We note that the Kolmogorov differential equation (2.1) is a first order linear
differential equation that can be solved using Laplace transforms (Trivedi
1982). Taking the Laplace transform on both sides of the equation, we get
sP(s) - P(O) = P(s)Q.
Rearranging the terms
P(s) = P(O)(sI _ Q)-I,
where I is the identity matrix. The transient state probability vector is ob-
tained by computing the inverse Laplace transform of P(s). In general, com-
puting the inverse Laplace transform for this equation is extremely difficult,
except for Markov chains with very small state spaces; details may be found
in (Trivedi 1982). The advantage of this method is that the solution thus ob-
tained will be closed-form and fully symbolic in both the system parameters
and time t. In principle this approach can also be used to compute L(t).

6.2 Semi-symbolic Method

Suppose the matrix Q has m :$ n distinct eigenvalues, say AI, A2, ... Am
arranged in non-decreasing order of magnitude. Since Q is singular, Al = o.
Let di be the multiplicity of Ai. The general solution for the state probability,
Pi(t) can be written as,
m dj
Pi(t) = LL ajktk-leAjt,
j=lk=l

where the aj k 's are constants. The state probabilities can be easily computed,
once the eigenvalues Aj of the Q matrix, and the constants aj k are computed.
For an acyclic Markov chain, the diagonal elements of the Q matrix yield
the required eigenvalues. Using the convolution integration approach (Trivedi
1982), an 0(n 2 ) algorithm has been developed in Marie et al. (1987). With a
sparse Q matrix, the algorithm can be further simplified to obtain an 0(',,)
algorithm, where 1] is the number of non-zero entries in the Q matrix.
For a general Markov chain, an O( n 3 ) algorithm has been developed in
Tardif et al. (1988) and Ramesh and Trivedi (1995). They first determine
the eigenvalues for the Q matrix, using the QR algorithm (Wilkinson and
Reinsch 1971). Subsequently, the ajk constants are determined by solving a
linear system of equations.
This method yields a closed-form solution for the state probabilities, as
a function of the time variable t. In general, this method cannot be used
466 Jogesh K. MuppaIa et aI.

for Markov chains with large state spaces (2: 400 states), because the QR
algorithm produces a full upper Hessenberg matrix causing space and time
limitations. We are thus forced to resort to fully numerical solution methods
that are discussed next.

6.3 Numerical Methods

We can write the general solution of equation (2.1) as


P(t) = P(O)e Qt , (6.1)
where the matrix exponential eQt is given by the following Taylor series (Moler
and Loan 1978):

eQt = t
i=O
(~~)i
1.
.

Direct evaluation of the matrix exponential is subject to severe round-off


problems since the Q matrix contains both positive and negative entries. In
this section, we present several methods that numerically compute the state
probabilities.
6.3.1 Randomization. Randomization (Grassman 1987, Jensen 1953, Keil-
son 1979 and Reibman and Trivedi 1988) is a very popular numerical method
for computing the state probabilities. Note that in the literature randomiza-
tion has also been referred to as uniformization and Jensen's method. Using
randomization the transient state probabilities of the CTMC are computed
as

P(t) = f
i=O
II(i)e-qt (q?i ,
1.
(6.2)

where q 2: maXi Iqii I; II (i) is the state probability vector of the underlying
discrete time Markov chain (DTMC) after step i. II( i) is computed itera-
tively:
II(O) = P(O), (6.3)
II(i) = II(i - l)Q" , (6.4)
where Q" = Q/q+I. In practice, the summation in equation (6.2) is carried
out up to a finite number of terms k, called the right truncation point. The
number of terms required to meet a given error tolerance f is computed from

~ -qt_q_ <
k ( t)i
1 - ~e ., _ f.
i=O 1.

As qt increases, the Poisson distribution thins from the left; that is, the terms
in the summation for small i become less significant. Thus it may be profitable
Markov Dependability Models of Complex Systems 467

to start the summation at a value I > 0, called the left truncation point (see
De Souza and Gail 1989 and Reibman and Trivedi 1988), to avoid the less
significant terms. In this case, equation (6.2) reduces to
k .
P(t) ~ L 11(i) e- qt (q?' .
~.
(6.5)
i=1
We compute the values of I and k from the specified truncation error tolerance
f, using

1-1 . k .
' " _qt(qt)' < ~ 1 _ ' " e-qt (qt)' < ~
~e i! - 2' ~ i! - 2'
,=0 i=O
Randomization has several desirable properties. We can bound the error
due to truncation of the infinite series. Thus given a truncation error tol-
erance requirement, we can precompute the number of terms of the series
needed to satisfy this tolerance. Since this method involves only additions
and multiplications and no subtractions, it is not subject to severe roundoff
errors.
One of the main problems with randomization is its O(7Jqt) complexity
(Reibman and Trivedi 1988). The number ofterms needed for randomization
between the left and the right truncation point is O(..;qt). However, it is nec-
essary to obtain the DTMC state probability vector at I, the left truncation
point, and I is O(qt). Thus we need to compute O(7Jqt) matrix-vector mul-
tiplications. Instead of using successive matrix-vector multiplies (MVMs) to
compute this vector, we could use the matrix squaring method and change the
complexity of computing 11(1) from O(7Jqt) to O(n 3 Iog(qt)) (Reibman and
Trivedi 1988), where n is the number of states in the Markov chain. How-
ever, the problem with this method is that squaring results in fill-in (reducing
sparsity), and hence it is not feasible for CTMC with large state spaces.
When qt is large, computing the Poisson probabilities, especially near
the tails of the distribution, may result in underflow problems (Fox and
Glynn 1988). We thus choose to use the method suggested by Fox and Glynn
(1988) to compute I and r. This method computes the Poisson probabilities
e-qtqt)i)/(i!) for all i = I, I + 1, ... , r - 1, r. Their method is designed to
avoid the underflow problems.
We have suggested a modified randomization-based method (Malhotra
et al. 1994) that addresses some of the problems caused by large values of
qt. Our method is based on recognizing the steady-state for the underlying
DTMC. We can take advantage of this fact, and rewrite the randomization
equations in such a way that further computation is minimized. One nicety
of our method is that the computation time is now controlled by the sub-
dominant eigenvalue of the DTMC matrix rather than by qt. Thus stiffness
as seen by the new randomization algorithm, is the same as that seen by the
power method (see Section 6.4.1) used for computing the steady-state solution
for the CTMC (Stewart and Goyal 1985). In our experience with a variety
468 Jogesh K. Muppala et al.

of problems, we have found significant improvement in the computational


requirement for the new method over the old method.
We begin by observing that equations (6.3) and (6.4), which are used to
compute the probability vectors for the underlying DTMC, also represent the
iteration equations of the power method. If the convergence of the probability
vector to steady-state is guaranteed, then we can terminate the iteration in
equation (6.4) upon attaining the steady-state; this gives considerable sav-
ings in computation. In order to ensure convergence of the power iteration
equation (6.4), we require that
q>m?xlq;;1 (6.6)
J

since this assures that the DTMC is aperiodic (Goyal et al. 1987). Note that
we do not require that the CTMC (or the DTMC) be irreducible. Indeed
we allow a more general structure with one or more recurrent classes and
a (possibly empty) transient class of states. Let II* denote the steady-state
probability vector of the DTMC.
Assume that the probability vector for the underlying DTMC attains
steady-state at the S-th iteration, so that IIII(S) - II* II is bounded above
by a given error tolerance. Three different cases arise in the computation of
the transient state probability vector of the CTMC: (1) S > k, (2) 1 < S ~ k
and (3) S ~ I. We examine each of these cases individually. In the following
equations we will denote the transient state probability of the CTMC com-
puted by the new randomization algorithm as P(t).

Case 1 (S > k): In this case the steady-state detection has no effect, and the
probability vector is calculated using equation (6.5).
Case 2 (I < S ~ k): Consider equation (6.5). By using II(i) = II(S), i > S,
the equation can be rewritten setting the right truncation point k to 00:

P(t) f
i=l
II(i)e-qt (q?i
to
8 . 00 .

LII(i)e-qt(q~t +II(S) L e-qt(q~t


to t.
;=[ i=8+1
8 . 8 .
L II( i)e- qt (q~)' + II(S)(1 - L e- qt (q~r ) .
;=1 to ;=0 t.

Case 3 (S ~ I): The DTMC reaches steady-state before the left truncation
point. In this case, no additional computation is necessary and P(t) is set
equal to II (S).

For stiff problems, the number of terms needed to meet the truncation
error tolerance requirements is often very large. However as shown above, if
Markov Dependability Models of Complex Systems 469

the DTMC steady-state can be detected, large computational savings result.


In our experience, this is often true, especially when the time values are large.
The detection of steady-state for the underlying DTMC needs to be done
with extreme care. We have implemented the steady-state detection based
on the suggestions given in Stewart and Goyal (1985). The usual method
for checking the convergence is to test some norm of the difference between
successive iterates. However, if the method is converging slowly, the change in
the elements of the vector between successive iterates might be smaller than
the error tolerance required. In this situation, we might incorrectly assume
that the system has reached steady-state, even though it is far from reaching
the steady-state. To avoid this problem, we compare iterates that are spaced
m iterations apart, i.e., we check the difference between ll( i) and ll( i -
m). Ideally, m should be varied according to the convergence rate, which is
difficult to implement in practice. Instead we choose m based on the iteration
number: m = 5 when the number of iterations is less than 100, m = 10 when
it is between 100 and 1000 and m = 20 when it is greater than 1000. We also
check for steady-state every m iterations (instead of checking at the end of
each iteration), thus saving a lot of unnecessary computation.
Randomization has also been extended to L(t) (Reibman and Trivedi
1989). Integrating equation (6.2) with respect to t yields,

L(t) !
q
f=
i=O
ll(i) f=
j=i+l
e- qt (q~t
J.
i .
! L ll(i)(l -l>-qt (q~t) .
00

(6.7)
q i=O j=O J.
This is again a summation of an infinite series, which can be evaluated up to
the first k significant terms (Reibman and Trivedi 1989), resulting in

2:: ll(i)(l - 2:: e- qt (q~t) .


k i .

L(t) =! (6.8)
q ;=0 j=O J.
The error due to truncation, (k)(t), is upper bounded by
470 Jogesh K. MuppaIa et aI.

We realize that the error due to truncation is dependent on time t. Given an


error tolerance requirement f, we can compute the number of terms k needed
to satisfy the error tolerance requirement.
If we consider equation (6.8) for computing L(t) and consider the steady-
state for the underlying DTMC, two cases arise: (A) S > k and (B) S ~ k.

Case A (S > k): In this case equation (6.8) is unaffected and the summation
is carried out upto k terms.
Case B (S ~ k): In this case equation (6.8) is modified as follows:

L(t) = ~
q
f
;=0
II(i) f
j=;+1
e- qt (q~t
J.
8 0 0 ' 00 00 .

!L II(i) L e- qt (q~t + !II(S) L L e- qt (q~t


q ;=0 j=i+l J. q ;=8+1 j=;+1 J.

!L L
8 0 0 '

II(i) e- qt (q~r
q ;=0 j=i+l J.
00 00 800 .

+ ~II(S)(L L e- qt (q~t - L L e- qt (q~r)


q i=Oj=i+l J. i=Oj=i+l J.
8 00' 800'

~L II(i) L e- qt (q~r + ~ II(S)(qt - L L e- qt (q~r )


q ;=0 j=i+l J. q i=O j=i+l J.
8 ; .
!L II(i)(1 - L e- qt (q~t)
q ;=0 j=O J.
S i .
+ ~II(S)(qt - L(1- L e- qt (q~t )) .
q ;=0 j=O J.

6.3.2 Uniformized Power Method. Abdallah and Marie (1993) present


a variant of randomization, called the uniformized power (UP) method to
address the stiffness problem. They observe that the error bound f is not
achieved in practice, because of the finite precision arithmetic in the numer-
ical computation. Also the time complexity of the randomization algorithm
grows with qt.
The randomization equation (6.2) can be rewritten as
P(t) = P(O)P(t) (6.9)
where

(6.10)
Markov Dependability Models of Complex Systems 471

Given a time point t at which the solution is required, the authors select a
time point to such that t = 2mto. The value to is chosen such that qto < 0.1,
to ensure that the Poisson terms e- qt (qitr decrease very fast, and thus the
summation can be truncated for 1 < 10. Then using Horner's algorithm, they
compute P(t) with the truncated summation. The value of m is chosen to be
m = llog2[4(1J + 3)qt]J .
They use the randomization equations to compute P(to) first. Noting that
iftA: = 2tA:-l, then P(tA:) = P(h_l)2, they use matrix squaring to compute
P(tA:) for different values of tA: until P(t) is computed. Then P(t) can be
obtained from equation (6.9).
This method also permits the solution ofthe Markov chain simultaneously
for different time points tA: that are 2A: multiples of to. In their experience, this
method yields faster solution for stiff Markov chains compared with normal
randomization. However, they note that the computation of the matrix P( tA:)
through matrix squaring results in some fill-in; so the sparseness ofthe matrix
is lost. This may affect the tractability of the uniformized power method for
large problems.
6.3.3 Adaptive Uniformization. Another method based on randomiza-
tion is adaptive uniformization (AU), and has been proposed by van Moorsel
and Sanders (1994). This method is suitable for stiff models and models
with infinite state space, even when the transition rates are not uniformly
bounded.
For the underlying DTMC matrix, they define the set of active states at
step n, (n = 0,1,2, ...) as the set {}n ~ {} with
{}n = {i E {}llT;(n) > O} .
Then for n = 0,1,2, ... , they define qn = sup{qiili E {}n} as the adapted uni-
Jormization rate. The corresponding adapted infinitesimal generator matrix
at step n, Q(n) = [qij(n)] is defined as

.. (n) _ {qi j
q'J - if i E {}n
otherwise
Similarly, the adapted transition matrices for the DTMC are defined as
Q*(n) = 1+ Q(n)/qn, n = 0,1,2, ...
Now, define a stochastic process T = {Tn, n = 0, 1,2, ...} where,
Tn = Exp(qo) + Exp(q!) + ... + EXp(qn-l), and To = ,
with Exp(q;) representing an exponentially distributed random variable with
rate qj. Furthermore, define Un(t) as the probability of exactly n jumps in
the interval [0, t]:
Un(t) = P{Tn ::; t 1\ Tn+! > t}, t ~ 0, n = 0,1,2, ...
472 Jogesh K. Muppala et al.

Adaptive uniformization then computes the transient state probabilities of


the Markov chain as
n-l

L Un(t) II Q*(i) = L Un (t)II(n)


00 00

P(t) = P(O)
n=O i=O n=O
with
II(O) = P(O) and II(n) = II(n - l)Q*(n - 1), n = 0, 1,2, ...
The infinite summation is truncated after Na steps, where

n=O
where f. is the desired accuracy. They call the pure birth process with tran-
sition rates qo, ql, ... the AU-jump process, and the DTMC subordinated to
the AU-jump process as the A U process.
They note that in general the AU method requires fewer steps than the
standard uniformization for a given accuracy. However, each step of the AU
method requires more computation. Typically, AU is better than standard
randomization for t < t*, where t* is the turning point. For t > t* AU
becomes computationally more intensive than standard randomization.
When the state space is infinite, Grassman (1991) suggests a method
called dynamic uniformization which is also based on the concept of active
states and uses a fixed value of q. However this method does not yield accurate
results, because there exists a value of t at which one of the transition rates
out of an active state will exceed q. Adaptive uniformization does not suffer
from this problem, since the value of q is not fixed (but it is selected at each
step n based on the set of the active states).
6.3.4 ODE-based Methods. Numerical solution of Markov chains re-
quires the solution of a system of ODEs for which standard techniques are
known. Different methods can be used for different kinds of problems. For
example, stiff methods can be used for stiff systems (or stiff Markov chains).
Methods also differ in the accuracy of the solution yielded and computational
complexity.
ODE solution methods discretize the solution interval into a finite number
of time intervals {tl' t2, ... , ti, ... , tn}. Given a solution at ti, the solution at
ti + h (= ti+d is computed. Advancement in time is made with step size
h, until the time at which the solution is desired (we call it mission time)
is reached. Commonly, the step-size is not constant, but varies from step to
step. ODE solution methods can be classified into two categories: explicit and
implicit.
For stiff systems, the step size of an explicit method may need to be ex-
tremely small to achieve the desired accuracy (Gear 1971). However, when
the step size becomes very small, the round-off effects become significant and
computational cost increases greatly (as many more time steps are needed).
Markov Dependability Models of Complex Systems 473

Implicit ODE methods, on the other hand, are inherently stable as they do
not force a decrease in the step-size to maintain stability. The stability of
implicit methods can be characterized by the following definitions. A method
is said to be A-stable, if all numerical approximations to the actual solution
tend to zero as n --+ 00 when it is applied to the differential equation t = >'Y
with a fixed positive h and a (complex) constant>. with a negative real part
(Gear 1971); n is the number of mesh points, which divide the solution inter-
val. For extremely stiff problems, even A-stability does not suffice to ensure
that rapidly decaying solution components decay rapidly in the numerical
approximation as well, without large decrease in the step-size. This could
lead to a phenomenon called ringing, i.e., the successively computed values
tend to be of the same magnitude but of opposite sign (Yi+1 = -Yi). To pre-
vent ringing, the step-size must be reduced further (Bank et al. 1985), which
leads us back to the same problem. Axelsson (1969) defined methods to be
stiffly A-stable, iffor the equation t
= >'Y, Yi+t!Yi --+ 0 as Re(>.h) --+ -00.
This property is also known as L-stability (Lambert 1991). In this paper, we
describe two L-stable ODE methods.
TR-BDF2 Method. This is a re-starting cyclic multi-step composite method
that uses one step of TR (trapezoidal rule) and one step of BDF2 (second
order backward difference formula) (Bank et al. 1985). This method borrows
its L-stability from the L-stability of the backward difference formula, while
the TR step provides the desirable property of re-starting. A single step of
TR-BDF2 is composed of a TR step from ti to ti +,h and a BDF2 step
from ti +,h to ti+1 where 0 < , < 1. For the system of equations (2.1), this
method yields

P(t+,h)(I- '2hQ ) = P(t)(I+ '2hQ ) (6.11)


for the TR step, and

P(t + h)((2 - ,)1 - (1 -,)hQ)


1
,
= -P(t + ,h) - ,
(1)2
- , P(t) (6.12)

for the BDF2 step.


Most implementations of ODE methods adjust the step-size at each step,
based on the amount of error in the solution computed at the end of the
previous step. To estimate the amount of error for TR-BDF2 method, the
principal truncation error term per step is obtained by Taylor series expansion
of terms in equations (6.11) and (6.12). For a system of differential equations,
an LTE vector is obtained with each element of which corresponding to the
local truncation error (LTE) in each state probability value. The LTE vector
f(h) for the TR-BDF2 method at time t + h is given by

( h) = _3,2 + 4, - 2 h3p(t)Q3
f 12(2 _ ,) ,
where h is the step size at time t. Direct estimation of the LTE vector is
perhaps most accurate, but it requires three matrix vector multiplications. A
474 Jogesh K. Muppala et al.

divided difference estimator suggested in Bank et al. (1985) is less expensive


and provides a good estimate of LTE:
f(h) = -31'2 + 41' - 2 [_ P(t) + P(t + 1'h)Q _ P(t + h)Q]h .
6(2 - 1') l' 1'(1 - 1') 1 -1'
The TR-BDF2 method provides reasonable accuracy for error tolerances
up to 10- 8 and excellent stability for stiff Markov chains (Reibman and
Trivedi 1988); for tighter error tolerances, however, the computational com-
plexity of this method rises sharply. Computation of state probabilities with
a precision requirement as high as 10- 10 is not uncommon in practice. In
such cases, this method becomes computationally expensive. Thus, the need
arises for L-stable methods with higher orders of accuracy.
Implicit Runge-Kutta Method. Implicit Runge-Kutta methods of different or-
ders of accuracy are possible. Axelsson (1969) showed that an r-stage approx-
imation applied to the test problem ~ = >..y would yield

(6.13)

where Pr-1(X) and Qr(x) respectively are polynomials of degree r - 1 and r


in x. These approximations yield L-stable methods of the order of accuracy
2r - 1, that is,

(6.14)

For the Markov chain solution, these approximations take the form of
matrix polynomials (polynomials in hQ). When computing state probabilities
using equation (2.1), these methods yield a linear algebraic system at each
time step:
r r-1
P(t + h) L (ti(hQ)i = P(t) L
t1i(hQ)i , (6.15)

where (ti and t1i are constants whose values are determined based upon the
order of the method desired. The Oth power of hQ is defined to be the Identity
matrix I. In general, these methods involve higher powers of the generator
matrix Q. Substituting r = 2 into equation (6.15), we get a third order
L-stable method:
2 1 1
=
P(t + h)(I - "3hQ + (3h 2 Q2) P(t)(I + "3hQ) . (6.16)
Similarly, using r = 3, a fifth order method may be derived. In principle, we
could derive methods of even higher order. However, with higher orders, we
also need to compute higher powers of the Q matrix, which means increased
computational complexity. We restrict ourselves to the third order method,
described by equation (6.16).
Various possibilities exist for solving the system in equation (6.16).
Markov Dependability Models of Complex Systems 475

- One possibility is to compute the matrix polynomial directly. This method


involves squaring the generator matrix; it is reasonable to expect that the
fill-in will not be much. In the different models we tried, we found that the
fill-in was usually not more than ten percent.
- The other option is to factorize the matrix polynomial. We then need to
solve two successive linear algebraic systems. For example, the left hand
side polynomial in equation (6.16) can be factorized as
1
P(t + h)(1 - r1hQ)(1 - r2 hQ) = P(t)(1 + 3hQ) , (6.17)

i
where rl and r2 are the roots of the polynomial 1 - ~ x + x 2 . This system
can be solved by solving two systems:
1
X(I - r2hQ) = P(t)(1 + 3hQ) (6.18)

P(t + h)(1 - r1hQ) = X . (6.19)

Unfortunately, the roots rl and r2 are complex conjugate; hence this ap-
proach will require the use of complex arithmetic.
For the third order implicit Runge-Kutta method, the LTE vector at t+h
is given by

(6.20)

Direct calculation of LTE is not as expensive as it seems. A careful look at


equation (6.16) reveals that we have already computed P(t)Q to compute
the right hand side. We have also computed Q2 as part of the left hand side.
Hence, p(t)Q4 can be computed by two matrix-vector multiplications (one
with matrix Q and one with Q2).
Implementation Details. ODE solution methods follow a few generic steps.
In the first step, various parameters are initialized: the initial step-size h o,
minimum step-size hmin, maximum step-size h max , error tolerance for the
linear system solvers if any. The values of hmin and hmax may be based
upon the accuracy desired. For example, a very small hmin may result in
excessive computation time and round-off errors. The next step is to form
the LHS matrix and the RHS vector for the given system of equations (e.g.,
matrix (I - ~hQ + ih2Q2) in equation (6.16)).
The calculations for each time step are then performed. For implicit meth-
ods, a linear system is solved at each time step. Sparse direct linear system
solvers (Duff et al. 1986) (such as Gaussian elimination) yield accurate re-
sults. Rows and columns can be reordered to minimize the fill-in that results
from using direct methods. For large Markov chains, direct methods may
be too expensive to use. In such cases, sparse iterative techniques (Gauss-
Seidel or successive over-relaxation(SOR)) can be used. We found Gauss-
Seidel to be sufficiently fast and accurate. Typically the matrices ((I - :r;.
Q)
476 Jogesh K. Muppala et al.

and 2 - ,)1 - (1 - ,)hQ) for TRBDF2 and (I - ~hQ + ih2Q2) for the
implicit Runge-Kutta method) are diagonally dominant because of the spe-
cial structure of the Q matrix, which helps faster convergence of the iterative
solvers. However, if Gauss-Seidel does not converge within a specified number
of iterations, then we switch to SOR. If convergence is still not achieved, then
either the tolerance is relaxed by a constant factor or we could switch to a
sparse direct method.
In the next step, the LTE vector at the end of time step is calculated.
A scalar estimate of LTE is obtained as a suitable norm (L1' L 2 , or Loo) of
the LTE vector. If the scalar LTE estimate is within the error tolerance, then
the step is accepted. If the end of the solution interval is reached, then the
procedure ends. Otherwise a new step-size is computed, based on the step-
size control technique, such that it is less than h max . The above steps are
repeated, starting from the step in which the LHS matrix is computed. If the
scalar error estimate is not within the error tolerance, then the step-size is
reduced and the above time-step is repeated. If the step-size must be reduced
below hmin' then two approaches are either to increase the error tolerance
or to switch to another ODE solver with higher order of accuracy. Note that
since we work with local truncation errors, we require that error tolerance be
specified as the local error tolerance and not as the global error tolerance. It
is hard to estimate global error from the local errors occurring at each time
step. However, it is reasonable to assume that controlling local errors would
help bound the global error. There exist several step-size control techniques.
We use the following:
h opt = h( local tolerance) or der+l
1
(6.21)
LTE '
where order is the order of accuracy of the single-step method.
Computational Complexity. The computational complexity of ODE solution
methods has traditionally been evaluated in terms of the number of function
evaluations. In our case each function evaluation is a matrix-vector multipli-
cation. For implicit methods, computational complexity is heavily dependent
on the linear system solver. Each iteration of an iterative linear system solver
takes O( TJ) time, where TJ is the number of non-zero entries in the Q matrix.
However, the number of iterations until convergence can not be bounded a
priori. Let 8 be the number of time-steps required by the ODE-solver to
compute the state probability vector at the mission time.
For the TR-BDF2 method with an iterative linear system solver, the
complexity is 0(18TJ) where I is the average number of iterations per lin-
ear system solution. For the implicit Runge-Kutta method, we analyze the
case where the LHS matrix polynomial is computed directly. Computing the
matrix polynomial involves squaring the matrix and three matrix additions.
Squaring the matrix takes O( nTJ) time, where n is the number of states in the
Markov chain. The squaring of Q results in some fill-in. Suppose TJ' denotes
the number of non-zeroes in the squared matrix, and f the fill-in ratio (TJ' /TJ).
Markov Dependability Models of Complex Systems 477

We found that f increases with n. For most of the Markov chains we tried,
f was not more than 10 percent. Having computed the LHS matrix, the
remaining computation occurs in solving the linear system of n equations.
Using an iterative solver, the total time-complexity is O(n71 + 1s71') where
1 is the average number of iterations per linear system solution. We found
that usually not more than two to three iterations are required for iterative
methods to converge.
6.3.5 Hybrid Methods. These methods combine explicit (non-stiff) and
implicit (stiff) ODE methods for numerical transient analysis of Markov mod-
els. This approach (Malhotra 1996) is based on the property that stiff Markov
chains are non-stiff for an initial phase of the solution interval. A non-stiff
ODE method is used to solve the model for this phase, and a stiff ODE
method for the rest of the duration until the mission time. A formal crite-
rion to determine the length of the non-stiff phase is described. A significant
outcome of this approach is that the accuracy requirement automatically be-
comes a part of model stiffness. Two specific methods based on this approach
are implemented in (Malhotra 1996). Both the methods use the fourth order
Runge-Kutta-Fehlberg method as the non-stiff method. One uses the TR-
BDF2 method as the stiff method, whereas the other uses an implicit Runge-
Kutta method. Results from solving several models show that the resulting
methods are much more efficient than the corresponding stiff methods (TR-
BDF2 and implicit Runge-Kutta). The implementation details are similar to
those of the standard ODE implementation, with some minor modifications
required to be able to switch from the non-stiff ODE method to the stiff ODE
method, upon detection of stiffness in the Markov chain.

6.4 Numerical Methods for Steady-State Analysis

6.4.1 Power Method. The equation for steady-state probabilities (equa-


tion (2.3)) may be rewritten as
11" = 11"(1 + Q/q) , (6.22)
where q 2: maJCi Iqiil
Substituting Q* = 1 + Q/q, we can set up an iteration by rewriting
equation (6.22), such that
1I"(i) = 1I"(i-l)Q* , (6.23)
where 11"(;) is the value of the iterate at the end of the i-th step. We start off
the iteration by initializing
11"(0) = P(O),
It is well known that this iteration converges to a fixed point (Stewart and
Goyal 1985) and the number of iterations, k, taken for convergence is governed
by the second largest eigenvalue of Q* raised to the power k. This method is
478 Jogesh K. Muppala et al.

referred to as the power method. In order to ensure convergence of the power


iteration equation (6.23), we require that
(6.24)

since this assures that the DTMC is aperiodic (Goyal 1987).


6.4.2 Successive Over Relaxation (SOR). The equation for steady-
state probabilities (equation (2.3)) defines a linear system of equations of
the form
xA=b.
Thus standard numerical techniques for the solution of a linear system of
equations will be applicable in this case. Direct methods such as Gaussian
elimination can be used to solve these equations. But for large Markov chains
with sparse generator matrices, iterative methods such as successive over-
relaxation (SOR) are suitable (Stewart and Goyal 1985). The matrix A is
split into three components (Ciardo et al. 1993):
A = (L+I+ U)D,
where Land U are strictly upper triangular and lower triangular respectively.
Then the SOR iteration equation can be written as:
x(k+l) = x(k)[(1- w)I - wD-1L][I+ wD-1Ur 1 + bwD-1[I +wD-1Ur 1 ,
where x(k) is the k-th iterate for x, and w, is the relaxation parameter. Further
details of this method may be found in (Ciardo et al. 1993 and Stewart and
Goyal 1985).
This approach may also be applied to solve the upto-absorption measures,
using the equation (2.4).

7. Solution of Markov Reward Models

The solution of Markov reward models involves the computation of the expec-
tations and the distributions of various reward measures that were reviewed
earlier. In this section we briefly discuss some of the recent developments in
the solution of Markov reward models.

7.1 Computing the Expected Values of Reward Measures

The expressions for expected values of the reward measures (which were
derived in Section 2.2.1) show that these measures are dependent on the state
probabilities, P(t), and 11', and the expected accumulated times in the states,
L(t) and z. Thus the computation of these measures is straightforward, once
the state probabilities and the expected accumulated times are computed.
Markov Dependability Models of Complex Systems 479

7.2 Computing the Distributions of Reward Measures

7.2.1 Computing P[Y( 00) ~ y). Beaudry (1978) first described a method
for computing P[Y( 00) $ y), the distribution of accumulated reward until
absorption. She assumed that all non-absorbing states have positive reward
rates assigned to them. Given a Markov chain {Z(t), t ~ O} with a reward
rate structure defined such that state i of the chain is assigned a reward rate
of ri, a new Markov chain {Z(t), t ~ O} is constructed by dividing the
transition rates out of state i by ri. It can be proved that the distribution
of the time to absorption of this new Markov chain yields the distribution of
accumulated reward until absorption, P[Y( 00) $ V], for the original Markov
chain. The reason this is true is that the sojourn time in state i of the original
Markov chain is speeded up or slowed down according to whether ri is smaller
or larger than 1. Thus, for state i, a sojourn time of Tin {Z(t), t ~ OJ, is
equivalent to a sojourn time of Tri in {Z(t), t ~ OJ.
Ciardo et a1. (1990) extended Beaudry's method to allow for non-absorbing
states with zero reward rates, and also allowed for the underlying process to
be semi-Markovian. They note that when the reward rates are zero, the above
transformation yields states from which the transition rates are infinite. Such
a situation actually occurs in the solution of generalized stochastic Petri net
(GSPN) models (Ajmone et a1. 1984), where vanishing states occur in the un-
derlying stochastic process describing the behavior of the GSPN. These states
are handled by eliminating them; that is, constructing a stochastic process
that contains only those states with non-zero sojourn times. The same prin-
ciple is used in this situation; that is eliminate those states with zero reward
rates. The solution of the time to absorption for the resulting stochastic pro-
cess yields the distribution of accumulated reward until absorption.
Note, however, that both solution methods consider only reward rates
assigned to the states of the (semi-) Markov process; they do not take into
account impulse rewards. .
7.2.2 Computing P[Y(t) ~ y). The computation of P[Y(t) $ y) is in
general difficult. Several numerical methods to solve this problem have been
presented in the literature. Meyer (1982) obtained a solution for acyclic
Markov reward models with the reward rate ri being a monotonic function
of the state labeling.
Considering the complement of the distribution function, let us denote
Y(t, y) = P[Y(t) ~ V]. Kulkarni et a1. (1986) derived a double Laplace trans-
form system relating Y(t, y) and the reward rates:
(sl + uR - Q)Y"'*(u, s) = e ,
where, Y"'*(u, s) is Y(t, y) with a Laplace-Stieltjes transform (...... ) taken with
respect to y, followed by a Laplace transform (*) taken with respect to t;
R = diag[rl, r2, ... , rn] is a diagonal matrix, and e is a column vector with
all elements equal to 1. Smith et a1. (1988) developed a double-transform
inversion method to solve the above system of equations.
480 Jogesh K. Muppata et at.

De Souza and Gail (1989) present a method based on randomization to


compute this distribution, but it requires exponential time in the number of
distinct reward rates. They use a concept called coloring to identify impor-
tant events in the process, and explore the various paths of the process using
randomization. Qureshi and Sanders (1994) extended this method to allow
for both reward rates associated with states and impulse rewards associated
with the transitions. They use the stochastic activity networks (SAN) (Cou-
villion et al. 1991) as the description method for the automatic generation of
the underlying Markov reward model. They propose a method to discard spe-
cific paths in the process, if the contribution of the path to the performance
variable being computed is not important. A bound on the error introduced
through this discarding is also given. Donatiello and Grassi (1991) present a
polynomial time algorithm based on randomization to compute the distribu-
tion. Recently De Souza et al. (1995) presented a polynomial time algorithm
to solve for the distribution.

8. Relaxing the Markovian Constraints: The Markov


Regenerative Process
A major objection to the use of Markov processes in modelling the behavior
of contemporary computer systems is the assumption that (1) the holding
times in the states are exponentially distributed, and (2) the past behavior
of the process is completely summarized by the current state of the process.
Thus every state transition acts as a regeneration point for the process. The
first assumption can be generalized by allowing the holding time to have any
distribution, thus resulting in the semi-Markov process. The second assump-
tion can also be generalized by allowing not all state transitions to be renewal
points, thereby resulting in the Markov regenerative process. Mathematical
definitions for these stochastic processes are given now.
Definition 8.1 (Kulkarni 1995). A sequence of bivariate random van-
abies {(Yn , Tn), n 2: O} is called a Markov renewal sequence, if
1. To = 0, "In> 0, Tn+! > Tn and Yn E {O, 1,2, ...}
2. 'Vi,j E {Yn , n 2: OJ,
P{Yn+1 = j, Tn+l - Tn ~ tlYn = i, Tn, Yn- 1, Tn-I, ... , Yo, To}
= P{Yn+! = j, Tn+l - Tn ~ tlYn = i} (Markov property)
= P{Y1 = j, Tl ~ tlYo = i} = Kjj(t) (Time homogeneity)

The matrix K(t) = [Kij(t)] is called the kernel of the Markov renewal
sequence. The time instances {Tn} are called the regeneration instances.
Definition 8.2 (Kulkarni 1995). Given a Markov renewal sequence {(Yn , Tn), n 2:
O} with the kernel K(t), define N(t) as
Markov Dependability Models of Complex Systems 481

N(t) = sup{n ~ 0 : Tn :::; t} .


Then the continuous-time discrete-state stochastic process, X(t), defined as:
X(t) = YN(t) , t ~ 0

is called a semi-Markov process.


Definition 8.3 (Kulkarni 1995). A stochastic process {Z(t), t ~ O} is
called a Markov regenerative process (also known as semi-regenerative pro-
cess), if there exists a Markov renewal sequence {(Yn , Tn), n ~ O} of ran-
dom variables such that all the conditional finite dimensional distributions of
{Z(Tn + t), t ~ OJ, given {Z(u), 0:::; U :::; Tn, Yn = i}, are the same as those
of {Z(t), t ~ O}, given Yo = i.
The above definition implies that
P{Z(Tn + t) = j/Z(u), 0:::; U :::; Tn, Yn = i} = P{Z(t) = j/Yo = i} .
For the Markov regenerative process (MRGP), each Tn is a renewal point
for the process. For a (semi-)Markov process these renewal points coincide
with the state transition instants. The matrix K( 00) defines the transition
probability matrix for the embedded discrete time Markov chain at the re-
generation points.
Let Vij (t) be defined as,
Vij(t) = P{Z(t) = j/Z(O) = i}, t ~ 0 .
It can be shown (Kulkarni 1995) that Vij(t) satisfy the following Volterra
integral equations:

(8.1)

where

lot dKim(u)vmj(t - u) = Kim(t) * Vmj(t)


defines the Stieltjes convolution integral, and e;j(t) = P{Z(t) = j, Tl >
t/Yo = i}. equation (8.1) can be rewritten in matrix form as

V(t) = E(t) + 1t dK(s)V(t - s) = E(t) + K(t) * V(t) , (8.2)

where E(t) = [e;j(t)] is called the local kernel, while K(t) = [Kij(t)] is called
the global kernel of the MRGP.
Given the initial probability vector P(O), we can compute the system state
probabilities at time t as
P(t) = P(O)V(t) .
482 Jogesh K. MuppaIa et aI.

Solving equation (8.1) in the time domain is in general a hard problem.


One alternative is to transform the equations, using Laplace transforms, and
then solve the resulting equations (Logothetis et al. 1995). In particular, if
Io Io
we define K . . . (s) = oo e- 3t dK(t), E . . . (s) = oo e- 3t dE(t), and V . . . (s) =
Iooo e-tdV(t), then equation (8.1) can be transformed into

[I - K . . . (s)]V . . . (s) = E . . . (s) . (8.3)


This linear system of equations can be solved for v . . . (s), and then the Laplace
transform inversion can be used to obtain the matrix V(t). In general, the
numerical inversion of the Laplace transform is fraught with problems, espe-
cially if the distributions are deterministic.
Another alternative for solving the equations in the time domain is to
construct a system of partial differential equations (PDEs), using the method
of supplementary variables (German and Lindemann 1994). These PDEs can
then be solved numerically.

9. Conclusions and Future Work


In this paper we reviewed the concepts of Markov and Markov reward models.
We presented several techniques for the transient and steady-state solution of
Markov models. These techniques include fully symbolic, semi-symbolic, and
numerical techniques. Several techniques for the computation of expected re-
ward rates and the distributions of accumulated reward were also presented.
We also briefly mentioned the different kinds of system dependencies that
arise in dependability modelling, and we showed how Markov models can
handle some of these dependencies. We then discussed the extension of the
Markov process to the Markov regenerative process, by relaxing the con-
straints on the Markov process.
As we mentioned in this paper, largeness of the state space of the Markov
model is a big problem. Stochastic Petri nets help address the model genera-
tion problem by automating the generation of the state space. Recently, fluid
stochastic Petri nets (Trivedi and Kulkarni 1993) were proposed as a means of
combating the largeness. A general approach for solving fluid stochastic Petri
nets is still an open problem. While two numerical techniques for transient
solution of Markov regenerative processes were proposed, general numerical
techniques with polynomial complexity are needed. Solution techniques for
the distribution of accumulated reward over a finite horizon P[Y(t) ::; y] are
still open for research.

References

Abdallah, H., Marie, R.: The Uniformized Power Method for Transient Solutions
of Markov Processes. Computers and Operations Research 20, 515-526 (1993)
Markov Dependability Models of Complex Systems 483

Ajmone M.M., Conte, G., Balbo, G.: A Class of Generalized Stochastic Petri Nets
for the Performance Evaluation of Multiprocessor Systems. ACM Transactions
on Computer Systems 2, 93-122 (1984)
Axelsson, 0.: A Class of A-Stable Methods. BIT 9, 185-199 (1969)
Bank, R.E. et al.: Transient Simulation of Silicon Devices and Circuits. IEEE Trans-
actions on Computer-Aided Design 4,436-451 (1985)
Baskett, F. et al.: Open, Closed and Mixed Networks of Queues with Different
Classes of Customers. Journal of the ACM 22, 248-260 (1975)
Beaudry, M.D.: Performance-Related Reliability Measures for Computing Systems.
IEEE Transactions on Computers C-27, 540-547 (1978)
Bobbio, A., Trivedi, K.S.: An Aggregation Technique for the Transient Analysis of
Stiff Markov Chains. IEEE Transactions on Computers C-35, 803-814 (1986)
Boyd, M. et al.: An approach to solving large reliability models. Proceedings of
IEEE/AIAA DASC Symposium. San Diego (1988)
Carrasco, J.A., Figueras, J.: METFAC: Design and Implementation of a Software
Tool for Modeling and Evaluation of Complex Fault-Tolerant Computing Sys-
tems. Proceedings of the IEEE International Symposium on Fault-Tolerant
Computing. Los Alamitos: IEEE Computer Society Press 1986
Chiola, G.: A Software Package for the Analysis of Generalized Stochastic Petri
Net Models. Proceedings of the International Workshop on Timed Petri Nets.
Los Alamitos: IEEE Computer Society Press 1985, pp. 136-143
Ciardo, G. et al.: SPNP: Stochastic Petri Net package. Proceedings of the Interna-
tional Workshop on Petri Nets and Performance Models. Los Alamitos: IEEE
Computer Society Press 1989, pp. 142-150
Ciardo, G. et al.: Perform ability Analysis Using Semi-Markov Reward Processes.
IEEE Transactions on Computers C-39, 1251-1264 (1990)
Ciardo, G. et al.: Automated Generation and Analysis of Markov Reward Models
Using Stochastic Reward Nets. In: Meyer, C., Plemmons, R.J. (eds.): Linear
Algebra, Markov Chains, and Queueing Models. IMA Volumes in Mathematics
and its Applications 48 . Heidelberg: Springer 1993, pp. 145-191
Ciardo, G., Trivedi, K.S.: A Decomposition Approach for Stochastic Petri Net Mod-
els. Performance Evaluation 18, 37-59 (1993)
Clarotti, C.: The Markov Approach to Calculating System Reliability: Computa-
tional Problems. In: Serra, A., Barlow, R.E. (eds.): Proceedings of the Interna-
tional School of Physics. Course XCIV. Amsterdam: North-Holland 1986, pp.
55-66.
Couvillion, J.A. et al.: Performability Modeling with Ultrasan. IEEE Software 8,
69-80 (1991)
de Souza e Silva, E., Gail, H.R.: Calculating Availability and Perform ability Mea-
sures of Repairable Computer Systems Using Randomization. Journal of the
ACM 36,171-193 (1989)
de Souza e Silva, E. et al': Calculating Transient Distributions of Cumulative Re-
ward. Proceedings of the SIGMETRICS'95 (1995), pp. 231-240
Donatiello, L., Grassi, V.: On Evaluating the Cumulative Performance Distribu-
tion of Fault-tolerant Computer Systems. IEEE Transactions on Computers
40, 1301-1307 (1991)
Duff, I. et al.: Direct Methods for Sparse Matrices. Oxford: Oxford University Press
1986
Dugan, J.: Automated Analysis of Phased-Mission Reliability. IEEE Transactions
on Reliability 40, 45-55 (1991)
Dugan, J.B. et al.: Extended Stochastic Petri Nets: Applications and Analysis. In:
Gelenbe, E. (ed.) : Performance '84. Amsterdam: North-Holland 1984
484 Jogesh K. Muppala et al.

Dugan, J.B. et al.: The Hybrid Automated Reliability Predictor. AIAA Journal of
Guidance, Control and Dynamics 9, 319-331 (1986)
Fox, B.1., Glynn, P.W.: Computing Poisson Probabilities. Commun. ACM. 31,
440-445 (1988)
Gear, C.: Numerical Initial Value Problems in Ordinary Differential Equations.
Englewood Cliffs: Prentice-Hall 1971
Geist, R., Trivedi, K.S.: Reliability Estimation of Fault-Tolerant Systems: Tools
and Techniques. IEEE Computer 23,52-61 (1990)
German, R., Lindemann, C.: Analysis of Stochastic Petri Nets by the Method of
Supplementary Variables. Performance Evaluation 20, 317-335 (1994)
Golub, G., Loan, C. F.V.: Matrix Computations. Second Edition. Baltimore: Johns
Hopkins University Press 1989
Goyal, A. et al.: Probabilistic Modeling of Computer System Availability. Annals
of Operations Research 8, 285-306 (1987)
Grassmann, W.K.: Means and Variances of Time Averages in Markovian Environ-
ments. European Journal of Operations Research 31, 132-139 (1987)
Grassmann, W.K.: Finding Transient Solutions in Markovian Event Systems
through Randomization. In: Stewart, W.J. (ed.) : Numerical Solution of Markov
Chains. New York: Marcel Dekker 1991
Haverkort, B.R. et al.: DyQNtool - A Perform ability Modeling Tool Based on the
Dynamic Queuing Network Concept. In: Computer Performance Evaluation:
Modelling Techniques and Tools. Amsterdam (1992), pp. 181-195
Haverkort, B.R., Trivedi, K.S.: Specification Techniques for Markov Reward Models.
Discrete Event Dynamic Systems: Theory and Applications 3, 219-247(1993)
Howard, R.A.: Dynamic Probabilistic Systems: Semi-Markov and Decision Pro-
cesses. Vol. II. New York: Wiley 1971
Ibe, O.C., Trivedi, K.S.: Stochastic Petri Net Models of Polling Systems. IEEE
Journal on Selected Areas in Communication 8, (1990)
Ibe, O.C. et al.: Stochastic Petri Net Modeling of VAX Cluster System Availability.
In: Proceedings of the International Workshop on Petri Nets and Performance
Models. Los Alamitos: IEEE Computer Society Press 1989, pp. 112-121
Jensen, A.: Markov Chains as an Aid in the Study of Markov Processes. Skand.
Aktuarietidskr. 36, 87-91 (1953)
Johnson, S.C., Butler, R.W.: Automated Generation of Reliability Models. In: Pro-
ceedings of the Annual Reliability and Maintainability Symposium (1988), pp.
17-22
Kantz, H., Trivedi, K.S.: Reliability Modeling of the MARS System: A Case Study
in the Use of Different Tools and Techniques. In: Proceedings of the Fourth
International Workshop on Petri Nets and Performance Models. Los Alamitos:
IEEE Computer Society Press 1991
Keilson, J.: Markov Chain Models: Rarity and Exponentiality. Berlin: Springer 1979
Kim, K., Park, K.: Phased-Mission System Reliability Under Markov Environment.
IEEE Transactions on Reliability 43, 301-309 (1994)
Kulkarni, V.G.: Modeling and Analysis of Stochastic Systems. Chapman and Hall
1995
Kulkarni, V.G. et al.: On Modeling the Performance and Reliability of Multi-Mode
Computer Systems. Journal of System Software 6, 175-182 (1986)
Lambert, J.: Numerical Methods for Ordinary Differential Systems. New York:
Wiley 1991
Lazowska, E.D. et al.: Quantitative System Performance. Englewood Cliffs:
Prentice-Hall 1984
Markov Dependability Models of Complex Systems 485

Levy, Y., Wirth, P.E.: A Unifying Approach to Performance and Reliability Objec-
tives. In: Bonatti, M. (ed.): Teletraffic Science for New Cost-Effective Systems,
Networks and Services, ITC-12. Amsterdam: North-Holland 1989, pp. 1173-
1179.
Li, V., Silvester, J.: Performance Analysis of Networks with Unreliable Components.
IEEE Transactions on Commun. COM-32, 1105-1110 (1984)
Logothetis, D. et al.: Markov Regenerative Models. In: Proceedings of the Interna-
tional Computer Performance and Dependability Symposium. Erlangen (1995)
Logothetis, D., Trivedi, K.S.: The Effect of Detection and Restoration Times for
Error Recovery in Communication Networks. In: MILCOM (1995)
Malhotra, M.: A Computationally Efficient Technique for Transient Analysis of
Repairable Markovian Systems. Performance Evaluation. To appear (1996)
Malhotra, M. et al.: Stiffness-Tolerant Methods for Transient Analysis of Stiff
Markov Chains. International Journal on Microelectronics and Reliability 34,
1825-1841 (1994)
Marie, R.A. et al.: Transient Analysis of Acyclic Markov Chains. Performance Eval-
uation 7, 175-194 (1987)
Meyer, J.F.: On Evaluating the Perform ability of Degradable Computing Systems.
IEEE Transactions on Computers C-29, 720-731 (1980)
Meyer, J.F.: Closed-Form Solutions of Performability. IEEE Transactions on Com-
puters C-31, 648-657 (1982)
Miranker, W.: Numerical Methods for Stiff Equations and Singular Perturbation
Problems. Dordrecht: D. Reidel 1981
Moler, C., Loan, C. F.V.: Nineteen Dubious Ways to Compute the Exponential of
a Matrix. SIAM Review 20, 801-835 (1978)
Muppala, 1.K. et al.: Dependability Modeling of a Heterogeneous VAX Cluster
System Using Stochastic Reward Nets. In: Avresky, D.R. (ed.) : Hardware and
Software Fault Tolerance in Parallel Computing Systems. Ellis Horwood Ltd.
1992, pp. 33-59
Peterson, 1.L.: Petri Net Theory and the Modeling of Systems. Englewood Cliffs:
Prentice-Hall 1981
Qureshi, M., Sanders, W.: Reward Model Solution Methods with Impulse and Rate
Rewards: An Algorithm and Numerical Results. Performance Evaluation 20,
413-436 (1994)
Ramesh, A.V., Trivedi, K.: Semi-Numerical Transient Analysis of Markov Models.
In: Proceedings of the 33rd ACM Southeast Conference (1995), pp. 13-23
Reibman, A. et al.: Markov and Markov Reward Model Transient Analysis: An
Overview of Numerical Approaches. European Journal of Operations Research
40, 257-267 (1989)
Reibman, A.L., Trivedi, K.S.: Numerical Transient Analysis of Markov Models.
Computers and Operations Research 15, 19-36 (1988)
Reibman, A.L., Trivedi, K.S.: Transient Analysis of Cumulative Measures of Markov
Model Behavior. Stochastic Models 5, 683-710 (1989)
Sahner, R.A. et al.: Performance and Reliability Analysis of Computer Systems:
An Example-Based Approach Using the SHARPE Software Package. Boston:
Kluwer 1995
Sanders, W.H., Meyer, J.F.: METASAN: A Perform ability Evaluation Tool Based
on Stochastic Activity Networks. In: Proceedings of the ACM-IEEE Computer
Society Fall Joint Computer Conference. Los Alamitos: IEEE Computer Society
Press 1986, pp. 807-816
Smith, R.M. et al.: Perform ability Analysis: Measures, an Algorithm, and a Case
Study. IEEE Transactions on Computers C-37, 406-417 (1988)
486 Jogesh K Muppala et al.

Somani, A. et al.: Computationally-Efficient Phased-Mission Reliability Analysis


for Systems with Variable Configurations. IEEE Transactions on Reliability
41, 504-511 (1992)
Stewart, W., Goyal, A.: Matrix Methods in Large Dependability Models. Tech. Rep.
RC-11485. IBM T.J. Watson Res. Center (1985)
Tardif, H. et al.: Closed-Form Transient Analysis of Markov Chains. Tech. Rep.
CS-1988. Dept. of Computer Science, Duke University (1988)
Tomek, L.A., Trivedi, K.S.: Fixed Point Iteration in Availability Modeling. In: Cin,
M.D., Hohl, W. (eds.) : Proceedings of the 5th International GI/ITG/GMA
Conference on Fault-Tolerant Computing Systems. Berlin: Springer 1991, pp.
229-240.
Trivedi, K.S.: Probability and Statistics with Reliability, Queueing, and Computer
Science Applications. Englewood Cliffs: Prentice-Hall 1982
Trivedi, K.S., Kulkarni, V.G.: FSPNs: Fluid Stochastic Petri Nets. In: Proceedings
of the 14th International Conference on Applications and Theory of Petri Nets
(1993), pp. 24-31
Trivedi, KS. et al.: Should I Add a Processor? In: Proceedings of the 23rd An-
nual Hawaii International Conference on System Sciences. Los Alamitos: IEEE
Computer Society Press 1990, pp. 214-221
Trivedi, KS. et al.: Composite Performance and Dependability Analysis. Perfor-
mance Evaluation 14, 197-215 (1992)
van Dijk, N.: Truncation of Markov Chains with Application to Queuing. Opera-
tions Research 39, 1018-1026 (1991)
van Moorsel, A., Sanders, W.: Adaptive uniformization. Communications in Statis-
tics - Stochastic Models 10, 619-647 (1994)
Wang, W., Trivedi, K.S.: Statistical Guidance for Simulation-Based Coverage Eval-
uation in Safety-Critical Systems. IEEE Transactions on Reliability. To appear
(1995)
Wilkinson, J.H., Reinsch, C.: Handbook for Automatic Computation: Linear Alge-
bra. Vol. II. Berlin: Springer 1971
Bounded Relative Error in Estimating
Transient Measures of Highly Dependable
Non-Markovian Systems *
Philip Heidelberger l , Perwez Shahabuddin l and Victor F. Nicola 2
1 IBM T.J. Watson Research Center, Yorktown Heights, New York 10598, USA
2 Department of Computer Science, University of Twente, 7500 AE Enschede, The
Netherlands

Summary. This paper deals with fast simulation techniques for estimating tran-
sient measures in highly dependable systems. The systems we consider consist of
components with generally distributed lifetimes and repair times, with complex
interaction among components. As is well known, standard simulation of highly
dependable systems is very inefficient and importance sampling is widely used to
improve efficiency. We present two new techniques, one of which is based on the
uniformization approach to simulation, and the other is a natural extension of the
uniformization approach which we call exponential transformation. We show that
under certain assumptions, these techniques have the bounded relative error prop-
erty, i.e., the relative error of the simulation estimate remains bounded as compo-
nents become more and more reliable, unlike standard simulation in which it tends
to infinity. This implies that only a fixed number of observations are required to
achieve a given relative error, no matter how rare the failure events are.

Keywords. Simulation, highly-dependable systems, importance, sampling, vari-


ance reduction

1. Introduction
Repairable systems with general repair and failure distributions are inher-
ently difficult to handle analytically or numerically, mainly because they do
not fall into the Markov, or semi-Markov, chain framework. HARP (Dugan
et al. 1986) and CARE (Stiffler and Bryont 1982) deal with methods to com-
pute dependability measures in large, but mostly non-repairable, Markovian
and non-Markovian systems. Analytical methods and numerical algorithms
for computing dependability measures of general non-Markovian repairable
systems are virtually non-existent.
An alternative approach is to use Monte Carlo simulation. Standard
Monte Carlo simulation is inefficient for highly dependable systems due to
the rarity of system failures events (Geist and Trivedi 1983). This results in
very long simulation run lengths to achieve a reasonable degree of accuracy.
One technique that is widely used to speed up simulations in highly depend-
able systems is importance sampling. In importance sampling we change the
This paper was originally published in ACM Transactions on Modeling and
Computer Simulation 4, 137-164 (1994). 1994, Association for Computing
Machinery, Inc. (ACM). Reprinted with permission.
488 Philip Heidelberger et al.

probabilistic dynamics of the system for simulation purposes. The new prob-
ability measure induces system failures to occur more frequently. Then we
make adjustments to the sample outputs to obtain an unbiased estimator.
The main problem in applying importance sampling to stochastic systems is
the design and implementation of specific importance sampling distributions
in order to obtain significant variance reductions which implies significant
speed-ups of the simulation.
Importance sampling, combined with the theory of large deviations, has
also proven effective in estimating buffer overflow probabilities in queueing
networks (see, e.g., Parekh and Walrand 1989, Frater et al. 1991 and Sad-
owsky 1991). An approach, other than importance sampling, for variance
reduction when estimating long-run averages affected by recoveries from rare
failure events is reported in Moorsel et al. (1991). A survey on using impor-
tance sampling to estimate rare event probabilities in queueing and reliability
models is given in Heidelberger (1995), and a survey on fast simulation of rare
events in reliability models is given in Nicola et al. (1993).
A considerable amount of work has been done in using importance sam-
pling for the fast simulation of highly dependable systems that consist of
highly reliable components with exponentially distributed failure and repair
times. In this case, the system is modeled as a continuous time Markov chain
(CTMC) with transitions of two types - component failures and component
repairs. Certain combinations of failed components cause the system to fail.
Typically, in the embedded Markov chain, component failure transitions hap-
pen with a much lower probability as compared to the component repair
transitions. The new importance sampling distribution is chosen in such a
way that component failure transitions occur with a much higher probability
than in the original system. This is called failure biasing and was introduced
in Lewis and Bohm (1984). in the context of reliability estimation. In Goyal
et al. (1992), it was further adapted to the estimation of steady state unavail-
ability, mean time to failure, and expected interval availability. Modifications
to the failure biasing heuristic were introduced in Shahabuddin (1990), Goyal
et al. (1992) and Shahabuddin (1994a) (balanced failure biasing), Carrasco
(1991a) (failure distance-based failure biasing) and Juneja and Shahabud-
din (1992) (failure biasing for Markovian systems with more general repair
policies).
In the estimation of transient measures in Markovian systems, besides
increasing the component failure transition probabilities of the embedded
Markov chain, we also have to increase the rates of transition in certain states
ofthe CTMC (that have very low transition rates), so that a sufficient number
of transitions happen in the given time horizon. For example, a technique
called forcing (Lewis and Bohm 1984, Goyal et al. 1992) causes the first
component failure time to occur within the time horizon, thus increasing the
probability of a system failure occurring during that time. Failure biasing,
in conjunction with forcing gives good results for time horizons that are
Bounded Relative Error in Estimating Transient Measures 489

small. However these techniques fail to work for larger time horizons. For
such cases, a method based on estimating Laplace transform functions is
studied in Carrasco (1991b) and another one based on estimating bounds to
the transient measure (rather than estimating the actual measure) is studied
in Shahabuddin (1994b) and Shahabuddin and Nakayama (1993).
Importance sampling has also been used for the fast simulation of highly
dependable systems with general component failure and repair distributions,
where the components are highly reliable. In Nicola et al. (1990), ideas for
accelerating component failure events using importance sampling, have been
combined with a clock rescheduling approach to devise a technique for fast
simulation. Analogous to the Markovian case, for transient measures, the fail-
ure acceleration combines two approaches: forcing and failure biasing. The
technique seems to work well in practice and gives orders of magnitude of vari-
ance reduction. Another importance sampling approach, using different forms
offorcing and failure biasing, to estimate unreliability in semi-Markov mod-
els of highly reliable systems is described in Geist and Smotherman (1989).
Their approach also extends to certain models with global time dependency.
Theoretical work in the area of importance sampling for highly depend-
able systems was started in Shahabuddin (1994a). In this paper, a large class
of highly dependable Markovian systems, (which includes systems of the type
in Goyal and Lavenberg 1987) were modeled and it was shown that for the
case of estimating steady state measures, the modification of the failure bi-
asing technique called balanced failure biasing has a desirable property of
bounded relative error. This implies that the simulation run-length for a de-
sired relative error remains bounded as component failure rates tend to zero.
This is in contrast to naive simulation in which the simulation run length
for a desired relative error tends to infinity as component failure rates tend
to zero. These bounded relative error results were extended to gradient esti-
mation (using balanced failure biasing) in Markovian systems in Nakayama
(1991) and to estimation of transient measures (using balanced failure biasing
and forcing) in Markovian systems in Shahabuddin (1994b) and Shahabud-
din and Nakayama (1993). Additional results on failure biasing for Markovian
systems are given in Nakayama (1993, 1994). However, until now, no tech-
nique has been proved to have the bounded relative error property for the
case of non-Markovian systems.
In this paper, we describe two different approaches to applying impor-
tance sampling for estimating system unreliability in non-Markovian sys-
tems. Then for a large class of highly dependable systems, we prove that
the two techniques have the property of bounded relative error. They also
seem to be easier to implement as compared to the clock rescheduling ap-
proach as they avoid rescheduling failure events and use only the exponential
distribution for failure event generation. The first approach is based on uni-
formization (Jensen 1953, Lewis and Shedler 1979, Shanthikumar 1986) and
the second uses a technique which we call exponential transformation. In
490 Philip Heidelberger et al.

both approaches, importance sampling is used to accelerate the component


failure events. In the first approach the component failure events are gen-
erated using uniformization in which the effective component failure event
rate is much higher than that in the original system. In the second approach
the time to the next component failure is sampled from the exponential dis-
tribution with a rate that is much higher than the total failure hazard rate
of the components. Experiments with these techniques give orders of magni-
tude of variance reduction. A preliminary version of this work stating some
of the main theoretical results, along with some experimental results have
been reported in Nicola et al. (1993).
In Section 2, we describe our mathematical model of highly dependable
systems that consist of components with general repair and failure time dis-
tributions. A description of the method of uniformization and how we use it
for importance sampling is also given in Section 2. In Section 3 we discuss
the case where both the failure and repair distributions can be uniformized.
The property of bounded relative error using this technique is also proved in
Section 3. However not all distributions are amenable to the technique of uni-
formization. In Section 4, we discuss the bounded relative error property of
a technique in which we use uniformization only for the failure distributions.
In Section 5 we give a detailed description of the exponential transforma-
tion method and prove that the property of bounded relative error holds for
this method too. Experimental results to illustrate the effectiveness of the
proposed importance sampling techniques are given in Section 6. (Additional
experimental results are reported in Nicola et al. 1992 and Heidelberger et
al. 1992) Finally, in Section 7, we give conclusions and some directions for
future research.

2. Highly Dependable Systems, Importance Sampling


and U niformization

The class of models that will concern us are essentially those that can be con-
structed using the SAVE (System Availability Estimator) modeling language
(see Goyal and Lavenberg 1987), except that general failure time and repair
time distributions will be allowed. However, in this paper we will consider
models that can be constructed using only a subset of the SAVE modeling
language. More specifically, we will consider models in which components
can be in one of two states: operational and failed. The SAVE modeling lan-
guage permits components to be in two additional states: spare and dormant.
(In SAVE, a component becomes dormant if its operation depends upon the
operation of some other component and that other component fails. For ex-
ample, a processor may not be operational unless its power supply is also
operational, and if the power supply fails, the processor is then considered
dormant. Different failure rates may be specified for the operational, spare
Bounded Relative Error in Estimating Transient Measures 491

and dormant states.) While the use of these additional states can be handled
within our framework, the notation becomes more complex and so will not
be considered in this paper.
We assume that there are N components which can fail and be repaired.
Let Gi(X) denote the failure distribution of component i, and let hj(x) be the
hazard rate (see Barlow and Proschan 1981) associated with this distribution:
hj(x) = gi(X)/Gi(X) where gi(X) is the probability density function of Gj(x)

and Gj(x) = 1 - Gi(X). We will assume that gj(x) > for all x > 0. A
component can fail in several failure modes, each mode occurring with a
certain probability. Let Pij be the probability of component i failing in mode
j, given that it fails. When component i fails in mode j, with probability Pijk
it can instantaneously "affect" a subset Sijk of other components, causing
them to fail as well. This is called failure propagation. A component may
have different repair time distributions in different failure modes. However,
for the sake of notational simplicity, we will assume that all modes have the
same repair time distribution. Let rj(x) denote the hazard rate associated
with the repair time distribution of the ith component. There is a set of
repairmen who repair failed compopents according to some fairly arbitrary
priority mechanism. For the purposes of this paper, details of the repair
processes are not crucial, and so will not be described in detail. However, we
allow general repair distributions and use of the SAVE "repair depends upon"
construct which permits modeling situations in which a component cannot
be repaired unless some other specified set of components is operational. We
do assume that no repairs are instantaneous. More specific conditions will be
given in Sections 3 and 4.
Another assumption (property) is that the system is composed of highly
reliable components, so that the component failure rates are much smaller
than the repair rates. To make this precise, we assume that the component
mean repair times are of order one, and there exists a small (but positive)
parameter f such that
(2.1)
for all x ~ 0, where the .Ai's and bi'S are positive constants with bi ~ 1. We
also assume that the ri(x)'s are constants, i.e., independent of L Finally, we
assume that the failure mode probabilities (Pi/S) and the failure propaga-
tion probabilities (Pij k 's) are also constants, though this assumption is not
essential. Inequality 2.1, which bounds the failure rates in terms of f, is the
natural generalization of the assumption in Shahabuddin (1994a) that, with
exponential distributions, the component failure rates are given by .Aifb;. We
will consider the limiting behavior of the unreliability estimates as f --+ 0,
i.e., as components become more reliable. In section 3, we will consider the
case where
(2.2)
for all x ~ 0, where the lti'S are positive constants. In Sections 4 and 5 we will
remove that assumption. The bounded hazard rate implicit in Inequality 2.1
492 Philip Heidelberger et al.

and Inequality 2.2 is satisfied for many distributions, including hyperexpo-


nential, Erlang, Weibull with an increasing failure rate (over a finite time
horizon), and more general Markovian phase type distributions. However, it
is not satisfied for the Wei bull distribution with a decreasing failure rate.
Let Xj(s) = 1 if component i is operational at time s, let X;(s) = 0 if
=
component i is failed at time s, and let X(s) (Xl (s), ... , XN{S)). Let A(s)
denote the set of components in the operational state at time s. In the gen-
eralized semi Markov process (GSMP) setting (see Glynn 1989 and Nicola et
al. 1990), we think of a "clock" as being associated with a component's fail-
ure time. If i E A(s), then let a;(s) denote the "age" of component i's failure
clock at time s; this is the time since the component last became operational.
We assume that all components are operational at time 0 (Xj(O) = 1 for all
i) and that all components are "new" at time 0 (aj(O) = 0 for all i). Let ,\j(s)
=
denote the failure rate of component i at time s, i.e., '\;{s) hj(aj(s for all
i E A( s) and it is 0 otherwise. Similarly, let B( s) be the set of components
that are being repaired at time s and for each i E B( s) let bi { s) denote the
age of the repair process. Then the repair rate of component i at time s is
given by J.t;(s) = ri(b;(s)) for all i E B(s) and it is 0 otherwise.
We assume that there is a set of system configurations F such that the
system is considered to be failed at time s if X( s) E F. Let TF be the first
hitting time of F, i.e., TF is the time to first failure. We shall be interested
in estimating the unreliability, which is defined to be

(2.3)
where t is the time horizon and the subscript G(E) denotes a system in which
the distribution of the component failure times are given by hazard rates
functions satisfying Inequality 2.1. For small E and fixed t, ,(E,t) ~ 0, i.e.,
the event {TF ~ t} is a rare event. In fact, we show in this paper that ,(E,t)
is 8( Er) for some r > 0 (a function f( E) is 8( Er) if there exist two constants
l{1 and l{2 such that l{1 Er ~ f( E) ~ l{2Er, for all sufficiently small E > 0)
and hence ,( E, t) -+ 0 as E -+ O. Now consider the problem of estimating
,(E, t)= EG(f)(I(TF < t)) where 1(.) is the indicator function. In standard
(naive) simulation we generate n independent replications from time 0 to time
min(TF, t) to obtain samples of I(TF < t), say It, 12 , , In Then 2:7=1 I;/n
is an unbiased estimator of ,( E, t). The variance of this estimator is given
by (J'~()(I(TF < t/n. Note that (J'~()(I(TF < t = ,(E, t) - ,2(E, t) is
also 8( Er). Thus, for a fixed n, the relative error {which is proportional to
(J'G(f)(I{TF < t/(Vn,(E,t))) goes to 00 as E -+ O. This is the main problem
in standard simulation of highly dependable systems. Importance sampling
is a well known technique to overcome this inherent difficulty. We illustrate
its basic idea by means of a simple example. (For a detailed discussion of the
concept see, for example, Hammersley and Handscomb 1964 and Glynn and
Iglehart 1989). Let f(.) be a probability density function (pdf) on the real
line and let A be a set on the real line which is rare with respect to f{.).
Bounded Relative Error in Estimating Transient Measures 493

Suppose we wish to estimate Ej(I(X EA where X is a real valued random


variable and the subscript in the expectation denotes the pdf from which X
is sampled. Then we can express

1:-00 I(x E A)/(x)dx = 1:-00 I(x E A)~~:~9(X)dX


Eg(I(X E A)L(X (2.4)
where g(.) is another pdf (with the property that g(x) > 0 whenever I(X E
A)/(x) > 0), and L(x) = /(x)/g(x) is the likelihood ratio. Hence we can
generate samples Xl, X2, ... , Xn of X using g(.), and get an unbiased estimate
of Ej(I(X E A given by L:?=l I(Xi E A)L(Xi)/n. How fast this estimate
converges depends on the variance u;(I(X E A)L(X. Theoretically, there
exists a zero variance estimator, but it requires knowledge of the quantity we
are trying to estimate. The main task in importance sampling is finding an
easily implemented g(.) such that
Eg(I(X E A)L2(X = Ej(I(X E A)L(X ~ Ej(I(X E A. (2.5)
This implies that the variance of the importance sampling estimate is signif-
icantly less than the original one, and therefore the new estimate converges
much faster. Notice from the above equation that one way of obtaining vari-
ance reduction is to select a g(.) such that g(x) ~ /(x) for x E A, i.e., make
{X E A} more likely to occur. In the context of unreliability estimation, the
rare set of sample paths where {TF < t} is analogous to the rare set A in the
above example.
In the following sections we will describe two implementations of impor-
tance sampling, which, for a large class of systems, can be shown to yield
orders of magnitude increases in simulation efficiency over standard simula-
tion. In particular, they yield estimates in which (unlike standard simulation)
the relative error remains bounded as f - 4 O. This implies that only a fixed
number of observations are required to achieve a given relative error no mat-
ter how rare system failure events are. To show bounded relative error, we
prove that ratio of the standard deviation of the importance sampling es-
timate (which is a function of f as the likelihood ratio random variable is
function of f) to ,(f, t) remains bounded as f - 4 o.
Uniformization is a simple technique for sampling (i.e., simulating) the
event times of certain stochastic processes including nonhomogeneous Poisson
processes, renewal processes, or Markovian processes in continuous time on
either discrete or continuous state spaces (see Fox and Glynn 1990, Gross
and Miller 1984, Jensen 1953, Lewis and Shedler 1979, Shanthikumar 1986,
and Van-Dijk 1990). We describe it in the case of a nonhomogeneous Poisson
process {N(t)} with intensity function (}(t). Assume that (}(t) ~ {3 for all
t ~ 0 for some finite constant {3. Let Tn denote the time of the nth event in
a time homogeneous Poisson process {Nj3(t)} with a constant rate {3. Then
the event times of {N(t)} can be sampled by thinning the {Nj3(t)} process as
494 Philip Heidelberger et al.

follows: for each n ~ 1, we include (accept) Tn as an event time in {N(t)} with


probability (}(Tn) / /3, otherwise the point is not included (rejected). Rejected
events are sometimes called pseudo events. (Throughout we will assume that
all rates are left continuous, i.e., (}(t) = 8(r). Thus if an event occurs at
some random time T, then (}(T) is the event rate just prior to time T.)
Renewal processes can be simulated using uniformization as described above
provided (}(t) is the hazard rate of the inter-event time distribution at time
t. Uniformization can be generalized to cases in which the process being
thinned is not a time homogeneous Poisson process (see Lewis and Shedler
=
1979). For example, at time Tn- b we can let Tn Tn- 1 + En where En has
an exponential distribution with rate /3n. The point Tn is then accepted with
probability (}(Tn)//3n. This requires only that (}(t) ~ /3n for all t ~ Tn- 1 .

3. Uniformization of Failures and Repairs


In this section, we consider the use of uniformization and importance sam-
pling for simulating both component failures and repairs. Thus, as in Sec-
tion 2., we assume that both failure and repair rates are bounded as in
equations (2.1) and (2.2). Recall that Ai(S) denotes the failure rate of com-
ponent i at time s, f-ti(S) denotes the component i repair rate at time
s. Let AP(S) = Lf::l Ai(S) denote the total failure rate at time sand
f-tR(S) = Lf::l f-ti(S) denote the total repair rate at time s. Then
e(s) = AF(S) + f-tR(S) (3.1)
is the total event rate at time s. We let /3 be a positive finite constant such
that
e(s) ~ /3 (3.2)
w.p. (with probability) one for all times S ~ t. Equation 3.2 ensures that /3
is a valid uniformization rate for simulating the system.
Consider a simulation of the system using uniformization at rate /3. Let
{N,6(s)} denote a Poisson process with rate /3. By equation (3.1), we can view
the system as being the superposition of failure and repair event processes.
In a uniformization-based simulation, there are three kinds of events:
- Failure events: Let NF(t) denote the total number of failure events (a
component failure causing the instantaneous failure of other components
is treated as one event) in (0, t) and let NF(i, t) denote the number of com-
ponent i failures in (0, t) (excluding failures caused by failure propagation)
=
(NP(t) Li NF(i, t)). Let 1';j be the time at which component i fails for
the j'th time (excluding failures caused by failure propagation)
- Repair events: Let NR(t) denote the total number of repair events in (0, t)
and let N R( i, t) denote the number of times that component i is repaired
in (0, t) (NR(t) = Li NR(i, t)). Let Rij be the time at which component i
is repaired for the j'th time.
Bounded Relative Error in Estimating Transient Measures 495

- Pseudo events: Let Np(t) denote the total number of pseudo events in (0, t)
and let Pj be the time of the j'th pseudo event.
In a uniformization-based simulation, events are obtained by "thinning"
the Poisson process {N.a(t)} as follows. Suppose an event of {N.a(s)} occurs
at time S. Then that event is a
component i failure w.p. >'i(S)/f3,
component i repair w.p. l'i(S)/ f3, (3.3)
pseudo event w.p. [1- e(S)/f3].

Notice that N.a(t) = Np(t) + NR(t) + Np(t) and that if the upper bound of
Inequality 2.1 is satisfied for all components, then the probability of a failure
event is very low.
To implement importance sampling within a uniformization framework
simply involves changing the thinning probabilities in equation (3.3). (We
specifically assume that all failure modes and components affected through
failure propagation are sampled from their given distributions.) This, in turn,
is accomplished by using new failure and repair rates, >.as) and I'as). In the
new system (i.e., the system simulated using importance sampling), the total
failure rate is >.j..(s) = Li >'Hs), the total repair rate is I'R(s) = Li I'Hs) and
the total event rate is e'(s) = >.j..(s) + I'R(s). We assume that e'(s) :::; f3 w.p.
one for all s :::; t, so that f3 is a valid uniformization rate for both the original
and the new systems (and both processes can be simulated by thinning the
same Poisson process {N.a(s)}). In the new system, an event from {N.a(s)}
at time S is a
component i failure w.p. >'~(S)/f3,

component i repair w.p. I'i(S)/f3, (3.4)


pseudo event w.p. [1- e'(S)/f3].

The likelihood ratio associated with this change of measure is given by a


product of three terms:

LU(f, t) = Lu(F, f, t) x Lu(R, f, t) x Lu(P, f, t) (3.5)


where the subscript U stands for uniformization and Lu(F, f, t) is the like-
lihood ratio for failure events, Lu(R, f, t) is the likelihood ratio for repair
events, and Lu(P, f, t) is the likelihood ratio for pseudo events. These likeli-
hood ratios have a simple form:

(3.6)
496 Philip Heidelberger et al.

N NR(i,t) (R)
Lu(R,E,t) =II II
i=l i=l
J-li ii
'(Ri')'
J-l 1 J
(3.7)

Lu(P, E, t) = II
Np(t) [,B-e(P.)]
[,B -e,(A ))" (3.8)

In order for importance sampling to be valid, the new measure must


be nonsingular with respect to the original measure, which, in this case,
translates into the conditions
-XHs) > 0 whenever -X;(s) > 0,
J-l~(s) > 0 whenever J-l;(s) > 0, (3.9)
,B - e'(s) > 0 whenever ,B - e(s) > O.

Whenever likelihood ratios appear in an expectation, it is assumed that the


expectation is with respect to the new measure, i.e., with importance sam-
pling.

3.1 Balanced Failure Biasing

The relationship between uniformization-based importance sampling as de-


scribed above for non-Markovian systems, and balanced failure biasing (with
approximate forcing) for Markovian systems will now be described.
In approximate forcing, when no repairs are ongoing (i.e., when J-lR(S) =
0), the rate at which component failures occur is accelerated so as to make
a component failure more likely to occur in the interval (0, t). This is ac-
complished by choosing a -X~(s) that is considerably higher than AF(S). The
-XHs) is chosen to be A~(s)/N for all i, i.e., there is equal probability for
any component to be the failing component. This is analogous to the Marko-
vian case, i.e., when balanced failure biasing is applied to the state where all
components are up (in a Markovian system).
Notice that the event probabilities of equation (3.4) can be rewritten as
follows:

component i failure w.p. -XHS)/,B [e'(S)/ ,B][-X~ (S)/ e' (S)][A~ (S)/ A~(S)]
component i repair w.p. J-l~(S)/,B [e' (S)/ ,B][J-l~(S)/ e' (S)][J-l~(S)/ J-l~(S)]
pseudo event w.p. [1- e'(S)/,B]. (3.10)
According to equation (3.10), we can view the selection of the event
as occurring in multiple steps. For example, to get a component i fail-
ure, we first must have a "real" event (i.e., failure or repair) which occurs
Bounded Relative Error in Estimating Transient Measures 497

w.p. e'(S)/f3. Then, the event must be a failure event which occurs w.p.
p~(S) == >'~(S)/e'(S), and finally, the event must be a type i failure which
occurs w.p. ff(S) == >'~(S)/ >'~(S).
In balanced failure biasing, we make the probability of a failure event
constant, say Pf , whenever repairs are ongoing. Thus, in uniformization, given
that an event is real (and there are ongoing repairs), we fix p~ (S) = Pf .
Next, in balanced failure biasing, given that an event is a failure, we choose
the failing component uniformly from among the operational components. In
uniformization, this simply corresponds to setting f1(S) = l/IO(S-)1 (the
number of operational components just before time S).
In balanced failure biasing, if an event is a repair, the relative proba-
bilities of selecting which component gets repaired are unchanged. Thus, in
uniformization, we set

JlHs) _ Jlj(s) ~ II.


, ( ) - () lor a z. (3.11)
JlR s JlRS
Finally, when repairs are ongoing, balanced failure biasing does not change
the total rate at which events occur. In uniformization, this can be accom-
plished by equalizing the total event rates in the new and original systems,
i.e., by setting
e'(s) = e(s) whenever JlR(S) > o. (3.12)
We call the above importance sampling scheme "uniformization based
balanced failure biasing". Observe that a consequence of equation (3.12) is
that Lu(P, f, t), the pseudo event likelihood ratio, only involves times when
there are no ongoing repairs (i.e., when JlR(En) = Jl~(En) = 0), since the
probability of a pseudo event in both systems is otherwise the same.

3.2 ASYIllptotic Bounds for the Unreliability

In this section, we derive asymptotic order of magnitude bounds on r( f, t) as


f -+ O. These results generalize those of Gertsbakh (1984) which were derived

for steady-state measures in simpler systems with exponential failure times


and generally distributed repair times. Similar results have also been obtained
for certain steady-state performance measures of Markovian systems in Sha-
habuddin (1994a), for the derivatives of steady-state measures of Markovian
systems in Nakayama (1991), and for transient measures of Markovian sys-
tems in Shahabuddin (1994b) and Shahabuddin and Nakayama (1993).

Theorem 3.1. Suppose there exist positive finite constants ~,xi' Jl and fl
such that ~ifbi ~ h;(x) ~ 5.;f bi and Jl ~ rj(x) ~ fl for all i and 0 ;; x ~ t.
Then there exist positive finite consta-;'ts r, a(t) and b(t) such that, as f -+ 0,

(3.13)
498 Philip Heidelberger et al.

Proof. To prove the lower bound, we will demonstrate a set of sample


paths whose probability is appropriately bounded from below. Consider
the set of sample paths for which TF < t. For any such sample path,
L:~l NF(i, t)b; > O. This sum represents a "distance" (in terms of orders
of magnitude of f) of the sample path to the failure set F. Let r represent
the minimum such distance over all sample paths in {TF < t}. Corresponding
to this minimum distance is a set of components, say components i l , ... , iK,
such that N F(i, t) > 0 if i E {i l , ... , iK} and N F( i, t) =
0 otherwise. This
set of components need not be unique. Also, some components may get re-
paired and fail more than once along such a minimum distance path because
of the presence of failure propagation. (A simple example of this will be given
later in this section.) Now consider such a minimum distance path which con-
sists of a given (ordered) sequence of N F failures (excluding those which fail
through failure propagation), NR repairs, along with corresponding failure
modes and components affected (through failure propagation) at each fail-
ure. In this path, let NF(i) be the number of times component i fails on its
own (not through failure propagation). In a uniformization-based simulation
of the system (without using impOJ;tance sampling), such a sample path is
generated when Nf3(t) = NF + NR, and each of the Poisson events is selected
to be the corresponding event in the minimum distance path. The probability
of such a sample path is given by at least

(3.14)
where am and aj are the products of the failure mode and failure propaga-
tion probabilities. The explanation for each term in equation (3.14) is quite
evident. For example, if a Poisson event occurs at any time 8 in (0, t), then
the probability that it is a type i failure is >'i(8)/ (3 ~ .A;fb i /(3. This proves
the lower bound for i( f, t).
The upper bound will be shown by deriving an upper bound on the likeli-
hood ratio LU(f, t) when the system is sampled using an importance sampling
distribution satisfying certain properties. Specifically, we assume that condi-
tion (3.9) is satisfied and that there exist positive finite constants .A', .x',
J.t', [i'
and (3' such that -

2!' ~ >.:(s) ~ .x' whenever >.;(s) > 0,


/!:.' ~ J.t:(s) ~ [i' whenever J.ti(S) > 0, (3.15)
(3' ~ (3 - e'(s) whenever (3 - e(s) > O.

=
We assume that sampling is stopped at time T min(t, TF). Now for any
sample path such that TF ~ t, L:~l NF(i, t)b; ~ r and therefore the failure
event likelihood ratio, Lu(F, f, T), satisfies
Bounded Relative Error in Estimating Transient Measures 499

N NF(i,T) - b
Lu(F,E,r)::; II II )..~,'::; C;F(T)ED=:l NF(i,T)b.} ::; C;F(T)E r (3.16)
i=1 j=1 -

where CF = max{Ad/Y Similarly, Lu(R,E,r) ::; C~R(T) and Lu(P,E,r) ::;


= =
C;P(T) where CR fi/J-I,' and Cp f31f3'. Since NF(t)+NR(t)+Np(t) N(3(t), =
by the above bounds on the likelihood ratios and by equation (3.5)

LU(E, r ) <
_ cR Np(T) cNF(T) Er <
NR(T) cp _ CNp(t) r
F 1 E (3.17)

where Cl = max{cF,cR,Cp} ~ 1. Recall that ,(E,t) = E[l{TF::;t}] =


E~[Lu(E, r)1{TF9}] where E~ denotes sampling under the importance sam-
pling distribution described above. Therefore, by inequality (3.17),

,(E, t) ::; E~[cfp(t)Er1{TF::;t}] ::; E~[cfp(t)v = e{(3t(c 1 -l)}E r == b(tV (3.18)

thereby completing the proof of the upper bound. D.

We conclude this section by giving a simple example of a system in which


the minimum distance path includes repairs. Consider a three component
system such that b1 =
1 and b2 b3 = = 2, i.e., component 1 fails at rate E,
while components 2 and 3 fail at rate E2. When component 1 fails, it can fail
in one of two modes, each w.p. 0.5. Component 2 is failed through failure
propagation in the first failure mode and component 3 in the second failure
mode. The system is considered failed if both components 2 and 3 are failed.
The path where component 1 fails in mode one, gets repaired, and then fails
in mode two has probability of order E2 (i.e., NF(l, t)b 1 = 2), since two
component 1 failures are required (each occurring w.p. of order E). Any other
failure path (other than the one in which component 1 fails in mode two, gets
repaired and then fails in mode one) has "'L.NF(i,t)b i > 2 and therefore has
much smaller probability.

3.3 Bounded Relative Error Using Uniformization

In this section, we derive an upper bound on the variance of the estimator


that uses the importance sampling distribution described in the proof of
Theorem 3.1. This upper bound and the lower bound of Theorem 3.1 together
imply that the importance sampling estimator enjoys the bounded relative
error property. Let ,(U, E, t) = LU(E, r)l{TF::;t}. Since E~[,(U, E, t)] = ,(E, t),
u 2[,(U,E,t)] = E~[,(U,E,t)2] - ,(E,t)2. The relative error of the estimator
is proportional to u[,(U, E, t)]h(E, t) = JE~[,(U, E, t)2]h(E, t)2 - 1. Since
Theorem 3.1 provides a lower bound on ,(E, t), showing bounded relative
error involves obtaining an appropriate upper bound on E~[,(U, E, t)2]. Such
a bound is derived in the following theorem:
500 Philip Heidelberger et al.

Theorem 3.2. Suppose hi (x) ~ Ajb" ri( x) ~ p, for all i and 0 ~ x ~ t and
e(s) ~ (3 w.p. one for all 0 ~ s ~ t. If the importance sampling distribution
satisfies e'(s) ~ (3 w.p. one for all 0 ~ s ~ t, and equations (3.9) and (3.15),
then there exists a positive finite constant c(t) such that, as -+ 0,
Ehh(U, , t)2] ~ c(t)2r. (3.19)
If, in addition, hj(x) ~ ~b, and ri(x) ~ f!.. for all i and 0 ~ x ~ t, then

lim O"u[,(U, , t)] < c(t) < 00. (3.20)


-0 ,( , t) - a(t)
Proof In the proof of Theorem 3.1, an upper bound on ,(U, , t) = Lu(, r)
I{TF9} is given in equation (3.17). Using this bound, we obtain

Eh[,(U, , t)2] = Eh[Lu(, r)21{TF~t}] ~ Eh[ciN~(t)]2r == c(t)2r (3.21)


thereby proving the first part of the theorem. The second part follows imme-
diately by combining this result with the lower bound for ,(, t) in Theorem
3.1. 0
Note that the upper bound c(t) increases exponentially as t increases.
This is consistent with results for general CTMCs in Glynn (1992), and more
specific results for highly dependable Markovian systems in Shahabuddin
(1994b) and Shahabuddin and Nakayama (1993). It implies that this im-
portance sampling approach will only be effective when t is "not too big"
(relative to f).
The results of this section show that uniformization-based importance
sampling is provably effective when all failure and repair rates are made to be
ofthe same order of magnitude as made precise in equation (3.15). Clearly, the
generalization of balanced failure biasing with approximate forcing described
in Section 3.1 satisfies these conditions.
Theorem 3.2 remains valid under more general uniformization schemes.
Since uniformization (thinning) is a valid simulation technique when non-
homogeneous Poisson processes are thinned (see Lewis and Shedler 1979),
one can think of thinning a non-homogeneous Poisson process with rate (3(s).
A careful examination of the proofs of Theorem 3.1 and Theorem 3.2 shows
that they remain valid provided there exist positive finite constants (3 and
~ such that (3 ~ (3(s) ~ ~, e(s) ~ ~, and e'(s) ~ ~ w.p. one for 0 ~ ~ ~ t
and the rest of the conditions of the theorems are satisfied with the obvious
modifications (e.g., the third part of equation (3.15) becomes (3' ~ (3(s)-e' (s)
whenever (3(s)-e(s) > 0). This generalization permits quite a bit of flexibility
in the implementation of the importance sampling distribution. Most notably,
piecewise constant uniformization rates can be used, with the rates changing
at event times. This permits different uniformization rates for approximate
forcing (when I'R(S) = 0) and for failure biasing (when I'R(S) > 0). It also
permits the rate to change so as to make uniformization more efficient by
reducing the probability of pseudo events.
Bounded Relative Error in Estimating Transient Measures 501

4. U niformization of Failures Only


As described earlier, a number of distributions cannot be directly uniformized
(although see Shanthikumar 1986 for some extensions). These include con-
stant distributions, discrete distributions, and distributions concentrated on
a finite interval, such as the uniform distribution. Since such distributions
may better represent repair distributions, the assumption that repair distri-
butions can be uniformized is both undesirable and overly restrictive. In this
section, we describe an approach that samples repairs from their natural dis-
tributions while using uniformization-based importance sampling for failure
events.
We again let {N,8(s)} be a Poisson process with rate (3. This process is
used only for sampling failure events. This requires that AF(S) ~ (3 w.p. one
for all 0 ~ S ~ t. An event of {N,8(s)} that occurs at time S is a

component i failure w.p. Ai(S)/(3


(4.1)
pseudo event w.p. [1- AF(S)/(3].

Similarly, under importance sampling, an event at time S is a


component i failure w.p. AHS)/(3
(4.2)
pseudo event w.p. [1 - A'p(S)/(3].

Since repairs are sampled from their given distributions, the likelihood ratio
does not contain any repair event terms. Similar to Section 3, the likelihood
ratio takes on a simple form:

L{j(f,t) = L{j(F,f,t) x L{j(P,f,t) (4.3)


where the subscript [r stands for uniformization of only failure events and

L{j(F,f,t) = g }1
N Np(i,t) Ai(1ij)
AH1ij) (4.4)

L{j(P, f, t) = P [(3[(3 _- A'AF(P.)]


Np(t)

3=1 F
(;)]"
3
(4.5)

We now wish to derive conditions under which the above importance


sampling approach results in bounded relative error. Such a proof requires a
lower bound on r(f, t) as in Theorem 3.1, but this theorem must be derived
under conditions that do not require bounds on repair rates (from either
above or below). Unfortunately, simple conditions for such a lower bound
seem to require more specific knowledge about the structure of the system
in terms of failure propagation, queueing disciplines at the repair facilities,
502 Philip Heidelberger et aI.

etc. This complication arises from the possibility of having repair events in
the minimum distance failure path. We will describe some fairly general,
albeit somewhat indirect, conditions for which the lower bound is true, and
then give specific examples of repair queueing disciplines and repair service
distributions that satisfy these conditions. In order to do so, we need to
introduce some new notation. A sample path consists of an ordered sequence
of events (failures and repairs) and the times of those events. Let Ej denote
the type of the i-th event, i.e., Ei = Ikj if the event is a component k failure
in failure mode j and Ei = rk if the event is a repair of component k. Note
that El is always a component failure event. (We could allow simultaneous
repair of components, but will not consider that here since it complicates
the notation. Also, for simplicity, we will also assume that the failure modes
completely specify which components are failed through failure propagation
on each failure.) Let Ti denote the time of the i-th event (failure or repair).
As in the proof of Theorem 3.1, define r to be the minimum distance over
all possible sample paths in the set {TF < t}. (Note that the minimum
distance r is actually a function of t, the repair disciplines, and the repair
time distributions. However, we will assume that these factors are fixed and
suppress the dependence of r on them in our notation.) The sequence of
events till system failure, in any sample path with the minimum distance r,
will be called a most-likely event sequence. (Note that, in any system, there
are only a countably finite number of most-likely event sequences but there
are an uncountably infinite number of sample paths corresponding to any
given most-likely event sequence.)
Assumption A: There exists a most-likely event sequence P = (el' e2, ... , en),
constants 0 = to < tl < ... < tn < t and a constant 6 > 0, with the following
property:
let
Pk = {tj-l < Tj < tj, Ej :;; ei. for 1 ~ j ~ k,n+1 > ttl (4.6)
for 1 ~ k ~ n (Po == 0) and let 'R,k (.1'k) be the set of repair (respectively,
failure) events in (tk-I. tk) for 1 ~ k ~ n. Assume that for all f small enough,
for 1 ~ k ~ n,
P('R,k = {edIPk-l,.1'k = 0) ~ {) if ek is a repair event, (4.7)
P('R,k =0IPk-b.1'k ={et}) ~ {) if ek is a failure event. (4.8)
Assumption A basically states that the events of P occur in the correct
sequence with positive probability, given that the preceding failure and re-
pair events (in P) occur within certain time intervals. More specifically, the
assumptions imply that the interval [0, t) can be broken up into subintervals.
Equation (4.7) implies that if the k-th event is supposed to be a repair, then
there exists an interval such that a repair occurs in that interval with positive
probability. Similarly, equation (4.8) states that, if the k-th interval is sup-
posed to contain a failure event, then no repair events occur in that interval
with positive probability.
Bounded Relative Error in Estimating Transient Measures 503

Before proving the bounded relative error property, we will verify that
these conditions hold for several cases of interest. Let Ri denote a random
variable whose distribution is the repair time distribution of the ith compo-
nent.
Example 1: Consider systems with an arbitrary number of repairmen that
repair components with any non-preemptive priority repair discipline (with
any non-preemptive repair discipline - like FCFS, non-preemptive last come
first served (LCFS), etc., - used between members of the same priority class).
Assume that at least one most-likely event sequence does not contain any
repair completion events. This condition is always true in systems that do
not have failure propagation. In such systems none of the most-likely event
sequences include repair completion events. Repairs are assumed to be non-
instantaneous, i.e., peRi > 0) = 1 for all i. Hence there exists a constant
to > 0 such that for all i, peRi > to) > O. Let 8min = min{P(Ri > to) : 1 ~
i ~ N}. Let to = minHo, t/2}. Clearly peRi > to) ~ 8m in for all i. Let us see
why systems of this type satisfy assumption A.
We will show that Assumption A holds if we choose P as a most likely
event sequence with no repair completion events, ti = ito/n for 1 ~ i ~ n
and 8 = (8min )n. To see this note that since we only have failure events
in the most likely event sequence, we only have to check equation (4.8) for
1 ~ k ~ n. The failure of the ith component in the most likely event sequence
(at time T;) may begin a repair process if a repairman (that repairs this
component) is free. If it does begin a repair process then due to the fact
P( Ri > to) > 8m in, the probability that this repair process finishes before
(absolute) time to is greater than 8min. Hence the probability that all of the
repair processes started before to (i.e., that may have been started at the
times of the failure events in the most likely event sequence) finish after to,
is greater than 8 = (8 min)n. This in turn implies the conditions of equation
(4.8).0
Example 2: Consider systems with a single repairman, with any non-
preemptive priority repair discipline (with any non-preemptive repair dis-
cipline - like FCFS, non-preemptive LCFS, etc. - used between members
of the same priority class), in which the most likely event sequences may
contain repair completion events. Again, assume that the repairs are non-
instantaneous. Let us see now why systems of this type satisfy assumption
A.
First consider the case where a most likely event sequence has two repair
completions, with ml > 0 failure events before the 1st repair completion, m2
failure events between the 1st and 2nd repair completions, and m3 > 0 failure
events after the 2nd repair completion. First we will assume that the repair
completions are non-consecutive (i.e., m2 > 0) and then show how to extend
it to the consecutive case. Without loss of generality, assume that the first
three components that start repair in this most likely path are Component 1,
Component 2 and Component 3, respectively. Since completion of the repairs
504 Philip Heidelberger et aI.

of Component 1 and Component 2 (in the most likely event sequence), occurs
before t, P(R 1 + R2 < t) > O. Hence there exists positive constants 81 and
82, with 81 + 82 < t, such that for all ..1 > 0, P(81 - ..1 < R1 < 81 + ..1) > 0
and P(82 - ..1 < R2 < 82 + ..1) > O. Then the ti's are chosen as follows.
The interval corresponding to the first failure event is chosen small enough
so that if the repair times are near 81 and 82 then the second repair completes
before time t. The repair times are confined sufficiently close to the respective
8i'S, (i.e., the ..1 in (8i - ..1 < ~ < 8i + ..1) is small enough) so that 1) the
interval corresponding to the first failure does not overlap with the interval
corresponding to the 1st repair completion, 2) the intervals corresponding
to the repair completions do not overlap and 3) with positive probability,
the third repair does not complete within the interval corresponding to the
second repair completion, i.e., if 83 > 0 be such that P(R3 > 83) > 0 then it
is enough that the width of the interval corresponding to the second repair
completion be chosen smaller than 83. We choose the width for the 1st failure
interval and the ..1 corresponding to R1 and R2 to be the same; call it ..1 0.
We make sure that the ..10 is small enough so that all the above criteria are
satisfied. More formally, let

..10 = min{ 81/3,82/5,83/6, (t - 81 - 82)/5} > O. (4.9)


For j = 1,2, let
OJ = P(8j - ..10 < Rj < 8j + ..10) > 0 (4.10)
and let
(4.11)
Now choose
0= min{ol,02,63} (4.12)
and t1 = ..10. By equation (4.10) (for j = 1) and equation (4.12), with proba-
bility at least 0, there is no repair in the (absolute) time interval [..1 0,81 - ..1 0]
(note that by equation (4.9), 81 - ..10 > ..10). Hence choose tml = 81 - ..10
and choose the intermediate ti's evenly between t1 and t m1 , i.e.,
(i - 1) .
ti=t1+(m1_1)(tml-t1) for 1<z<m1.

Next, choose tml+1 = 81 +2..1 0, as with probability at least 0, the first repair
completes in [81 - ..1 0,81 + 2..10]. The second repair starts as soon as the
first repair completes. By equation (4.10) (for j = 2) and equation (4.12),
with probability at least 0, the second repair does not complete in the interval
[81 +2..10,81 +82-2..10] (note that by equation (4.9), 81 +82-2..10> 81 +2..10).
Hence choose t m1 +1+m2 = 81 + 82 - 2..1 0, and the intermediate ti's evenly
between tml+1 and t m1 +1+m2' i.e.,
Bounded Relative Error in Estimating Transient Measures 505

Since with probability at least 8, the second repair completes in the interval
[81 + 82 - 2..10, 81 + 82 + 3..10]. choose tml +1+m2+1 = 81 + 82 + 3..10. The third
repair starts when the second repair completes. By equation (4.9), equation
(4.11) and equation (4.12),
P(Ra > 6..1 0) > 8, (4.13)
i.e., with probability at least 8, the third repair does not complete in the
interval [81 +82 -2..10,81 +82+4..10]. Hence choose t m1 +1+m2+1+m3 = 81 +82+
4..10, and the remaining ti 's evenly between tml +1+m2+1 and tml +1+m2+1+m3 ,
i.e.,
i - (m1 + m2 + 1)
ti = tm1+m2 +1 + ma
(tml+1+m2+1+m3 - tm1 +1+m2+d
for m1 + 1 + m2 + 1 < i < m1 + 1 + m2 + 1 + ma. Note that by equation
(4.9), tm1 +1+m2+1+ma < t.
For the case where m2 = 0, we extend the interval corresponding to the
first repair completion, from [81 - ..1 0,81 + 2..1 0], to [81 - ..1 0,81 + 82 - 2..1 0]
(note that 81 + 82 - 2..10 is the beginning of the second repair interval). The
other intervals remain unchanged.
This argument can easily be extended to cases where the the most likely
path contains more than two repair completions. Let us say that there are I
repair completions with m1, m2, ... ml, denoting the respective numbers of
intermediate failure events and ml+1 denoting the number of failure events af-
ter the last repair completion. We will assume that the repair completions are
non-consecutive, though (as in the two repair completion case) our arguments
can easily be extended to the consecutive case. Define 81,82, ... , 81,81+1 and
81,82, ... ,81,61+1 analogous to the two repair completion case and choose
(4.14)
Let

L 8;)/(/+3)}. (4.15)
I
..10 = min{8t/3, 82/5, ... ,81/(2/+1),81+1/(2/+2), (t-
;=1

Choose t1 = ..10. Then tl:~=l mk+j-1 (the start of the interval correspond-
ing to the jth repair completion) may be chosen as l:t=181c - j..1 o and
tl:~=l mk+i may be chosen as l:t=l 81c +(j + 1)..10. The intervals correspond-
ing to the intermediate failure events may be chosen to be evenly distributed
between the above intervals. Finally, choose tl:~:ll mdl as tl:~=l mdl + ..10,
and choose the intervals corresponding to the remaining failure events evenly
distributed between t"" m +1 and t",'+l m +1" 0
L....k=l Ie L....1e=1 Ie
It is possible to verify that other situations also satisfy these assumptions,
although it is difficult to state simple, direct conditions on the underlying
repair disciplines and distributions for which Assumption A is valid.
506 Philip Heidelberger et aI.

Theorem 4.1. Suppose there exist positive finite constants ~i and ~i such
that ~ifbi ~ hj(x) ~ ~ifb; for all i and 0 ~ x ~ t and that Assumption A
holds. Then there exist positive finite constants r, a(t) and b(t) such that, as
f~O,
a(tV ~ ,(f,t) ~ b(tV. (4.16)
Proof. To prove the lower bound, notice that P( 'Tp ~ t) 2: P(Pn ) =
n~=l P(PkIPk-l) (P(PtiPo) == P(Pt). Assume that the process is simu-
lated using uniformization (at rate fJ) of failure events as described earlier.
Now consider P(PkIPk-l) for 1 ~ k ~ n. Note that given Pk-l, the event
Pk implies that there is only one event in the interval (tk-b tk). Thus, if Ek
is a repair event, then
(4.17)
The first term on the right hand side of equation (4.17) is greater than fJ by
equation (4.7) of Assumption A, while the second term is greater than the
probability that a Poisson process with rate fJ has no events in the interval
(tk-l, tk)' Thus, in this case P(PkIPk-l) is greater than some function (of
tk-l, tk, f3 and fJ) that is independent of f. Similarly, if Ek is a failure event,
then

The first term on the right hand side of equation (4.18) is greater than fJ by
equation (4.8) of Assumption A, while the second term is greater than the
probability that a Poisson process with rate fJ has exactly one event in the
interval (tk-l. tk) times the probability of accepting event e,., as the failure
event. This latter probability is at least fbi ~d fJ if the event is a component
i failure. Thus P(Pn ) 2: fr times a function of t (and fJ, fJ, and the failure
mode and failure propagation probabilities) as desired.
The proof of the upper bound is also similar to that in Theorem 3.1. We
assume that
~' ~ ..\~(s) ~ ~I whenever ..\j(s) > 0,
(4.19)
fJ' ~ fJ - ..\;'(s) whenever fJ - Ap(S) > 0

for positive finite constants ~1)I,fJ'. Since we still have Li Np(i,t)bi 2: r,


- N (t) - N (t)
L(U,F,f,t) ~ frC p F and L{U,P,f,t) ~ cp P whenever 'Tp ~ t where
Cp and Cp are defined in the proof of Theorem 3.1. Therefore, letting C2 =
max{cp,cp}, we have
- - N (t)
,(U,f,t) == L(U,(,'T)l{rF~t} ~ c2~ (r (4.20)
and the upper bound on ,(, t) then follows by taking expectations,
,(f, t) =EO['(U, f, t)] ~ fr EO[c~~(t)] (4.21)
Bounded Relative Error in Estimating Transient Measures 507

where EO denotes uniformization-based importance sampling of failures only.


o
Similar to Theorem 3.2, under suitable conditions, we obtain bounded
relative error when applying uniformization-based importance sampling to
the failure times. This proof of this theorem is basically the same as that of
Theorem 3.2; combine the lower bound of Theorem 4.1 with the upper bound
of equation (4.20).
Theorem 4.2. Under the conditions of Theorem 4.1, if importance sampling
satisfying equations (4.19) is applied, then the resulting estimate has bounded
relative error.

5. Exponential Transformation
In this section, we describe an alternative importance sampling procedure
that gets around a potential computational inefficiency using uniformization:
the generation of pseudo events. The method, which we call exponential trans-
formation, is based on the following observation. Consider a uniformization-
based simulation using rate P for the Poisson process {N.B(s)}. Suppose each
event in {N.B(s)} is accepted as a failure event with fixed probability p, i.e.,
>.F( s) / P = p for all s. Then the time between accepted failure events has an
exponential distribution with rate a = p x p. This suggests simply sampling
the time to the next failure event from an exponential distribution with rate
a; this is basically what the exponential transformation method does.
We first describe the method in more detail and present its likelihood
ratio, and then show that the method possesses the bounded relative error
property.

5.1 Description of Exponential Transformation and its Likelihood


Ratio

Exponential transformation involves doing a change of measure in which we


sample the time to the next component failure event from an exponential
distribution. The mean of the exponential distribution is allowed to depend
upon the state of the system. The specific sampling scheme (corresponding
to the change of measure) is as follows. Repair times are again sampled from
their original distributions. We let Tn denote the time of the n-th event in
the system, where an event is either a repair or a failure. We define To = 0
and Dn = Tn - Tn- l to be the inter-event time. Define a repair event list
which contains the completion times of repairs that are ongoing at the time
of the current event. Obviously, this list is updated at each Tn. Let Rn be
the time of the first scheduled repair event after time Tn - l . (Rn is the time
of the next event on the repair event list at time Tn-I.) Now consider the
508 Philip Heidelberger et al.

system just after the event that took place at time Tn - 1 . An exponential
random variable En with some chosen rate an is sampled. If Tn - 1 +En ~ Rn,
then the next system event is a failure and Tn = Tn - 1 + En. In this case,
component i is chosen as the failing component with some chosen probability
qi( n) (provided component i is operational). The likelihood of such a failure
event is qi(n)a ne- a "6,,. On the other hand, if Tn - 1 + En > Rn, then the
next system event is a repair and Tn = Rn. The likelihood of such a repair
event is e- a "6,,. Define F{ n) = i if component i fails at time Tn and let
=
'Yn qF(n)(n)a n for a failure event and define 'Yn =1 for a repair event. Let
N{t) denote the number of events in (0, t). Then the likelihood associated
with sampling the failure times is

(5.1)

The term on the right in equation (5.1) represents the probability that the
last inter-failure time exceeds the remainder of t. However, since sampling
stops at time T = min(t, TF), this term does not appear in the likelihood
if TF < t; this can be formally accommodated in equation (5.1) by setting
an = 0 for n ~ N (TF) + l.
Let NF(i, t) denote the number of times that component i fails in the
interval (0, t). N F( i, t) counts only the times that component i fails on its own
accord, but not the time that the component fails because it is affected by
some other component. Let Mj(t) denote the number of times that component
i's failure clock is reset, but does not expire on its own accord in (O, t). Mi{t)
counts the number of times that component i fails because it is affected by
some other component, plus one if component i is operational at time t. Let
Xij, j = 1, ... , N F{ i, t) denote the age of component i when it fails of its own
accord for the j-th time, and let }'ij(t),j = 1, ... ,Mi(t) denote the age of
component i's clock when it is caused to fail by some other component for
the j-th time, or its age at time t. Then

(5.2)

is the likelihood associated with the failure times of the sample path under the
original failure distributions. Defining LE(f, t) = PG(t)/PE(t) and 'Y(E, f, t) =
LE(f, T)1{TF9}' we have 'Y{f, t) = EE['Y(E, f, t)] where the subscript E refers
to sampling with exponential distributions as described above.
We will assume that an and qi( n) are chosen such that they have the
following property: there exist positive finite constants q, ij,S!, and 0 such
ili~ -
!l ~ qj{n) ~ ij (5.3)
whenever component i is operational, and
Bounded Relative Error in Estimating Transient Measures 509

(5.4)
with probability one. We call this type of importance sampling "generalized
balanced failure biasing with exponential transformation." When qi( n) =
1/10(T;)I, we call the method "balanced failure biasing with exponential
transformation." As in the uniformization approach, there is considerable
flexibility in how to choose the rates Qn. Specific heuristics for doing so are
discussed in (Nicola et al. 1992 and Heidelberger 1992) and will also be de-
scribed briefly in Section 6..

5.2 Bounded Relative Error Using Exponential Transformation

In this section, we show that importance sampling using exponential trans-


formation produces estimates having bounded relative error.
Theorem 5.1. Under the conditions of Theorem 4.1, if importance sam-
pling using exponential transformation satisfying equations (5.3) and (5.4) is
applied, then the resulting estimate has bounded relative error.

Proof The required lower bound on 'Y(f, t) is true by Theorem 4.1. Thus
we only need to prove that EE['Y(E, f, t)2] ~ f(t)f2r for some function f(t).
We begin by establishing an upper bound for the numerator, pa(t), of the
likelihood ratio. Notice that 9i(Xij) = hi (Xij )Gi(Xij) ~ Xif bi . Thus, on
{rp ~ t},
II X~F(i,t)fbiNF(i,t) ~ XNF(t)f r
N
pa(t) ~ (5.5)
i=l

if Xi ~ X. To complete the proof, we need to lower bound the denomina-


tor, PE(t), of the likelihood ratio. First, by the definition of 'Yn, rI~!:t[ rn ~
(q.!!.)NF(t). Also, on {rp ~ t}, rI~!:?e-a .. 6 .. ~ e- iit Combining these two
f~cts yields PE(t) ~ (q.!!)NF(t)e- at . Thus, r(E, f, t)2 ~ f2r e2atc~F(t) where
CE = (X/(q.!!.)2. Thu~ EE[r(E,f,t)2] ~ f2re2atEE[c~F(t)]. But Np(t) is
stochastically smaller than Na(t) where {Nii(t)} is a Poisson process with
rate a, thereby completing the proof. (To see this, the required exponentials
with rate Q n could be generated by appropriately thinning a Poisson process
with rate a.)
o
Again, notice the exponential growth (in t) of the bounding function f(t);
this implies that the method will only be efficient for relatively small values
oft. When applied properly, both uniformization-based importance sampling
and exponential transformation yield estimates having bounded relative er-
ror. However, it is not clear whether or not one of these methods is always
guaranteed to have lower variance than the other method. Notice also that,
unlike the uniformization-based methods, exponential transformation can be
510 Philip Heidelberger et aI.

used for importance sampling even when the failure distributions do not have
bounded hazard rates. However, in this case, the method is not guaranteed
to possess the bounded relative error property.

6. Experimental Results
In this section, we present the results of experiments to test the effectiveness
of the exponential transformation method. Additional experimental results
are presented in Nicola et al. (1992) and Heidelberger et al. (1992) for both the
exponential transformation and uniformization-based importance sampling
approaches.
The test model we consider has two types of components and a single
repairman. There are three components of type one and two components of
type two. The system is considered operational if at least one component
of each type is operational. The repairman fixes components according to a
preemptive priority discipline, with type two components having the highest
priority. Components of type one have a constant repair time distribution
with mean one, and components of type two have a uniformly distributed
repair time distribution on (5,10). The model may include "failure propaga-
tion," i.e., the failure of one component may cause other components to fail
at the same time. Specifically, we assume that with probability a, a failure
of component type two causes two components of type one to also fail (and
with probability 1 - a the component affects no other components). We call
a the components-affected probability, and we consider two cases: a = 0 (no
failure propagation) and a = 0.25. The performance measure of interest is
the probability that the system fails before time t = 100.
The failure distributions are parameterized by f, which measures the rarity
of component failures. We consider two types of failure distributions: Erlang
with two stages and Hyperexponential with two phases. We let E 2 ( f) denote
the Erlang distribution with two stages and failure rate 2f in each stage. The
mean of this distribution is l/f, and the failure rate for small f and fixed
t is 0(f2) (since two exponentials with rate f need to fail within time t).
The Hyperexponential distribution is denoted by H2(f) and has coefficient
of variation equal to two. The parameterization of the Hyperexponential was
chosen so as to equalize P( E 2( f) ~ 100) and P( H 2( f) ~ 100) for a particular
value of f (f = 10- 6 , corresponding to configurations 7 and 8 below). Specif-
ically, with probability 0.7373, H2(f) is exponential with rate ..\(f) = 2.66f
and with probability 0.2727, H2(f) is exponential with rate 12..\(f).
We consider eight different configurations of this system with two dif-
ferent values of f (f = 0.01 and f= 0.0001) for each configuration. These
configurations are listed in Table 6.1. The configurations were chosen so that
a diverse range of most-likely failure paths occurs among the configurations.
Consider configuration 1 in which both components have E2(f) distributions
and there is no failure propagation. In this case, P( TF ~ t) = O( (4) and
Bounded Relative Error in Estimating Transient Measures 511

the most-likely failure path consists of two failures of component type one.
(These O( In estimates should not be taken too literally; they assume that
t is fixed and ---+ O. For example, for = 0.01, and t = 100(= 1/) this
assumption is clearly violated.) In configuration 4, component type one has
an E2( ) distribution, component type two has an E 2 ( 1.5) distribution and
a = 0.25. This is an example of an "unbalanced" system (see Goyal et al.
1992) since component type two is much more reliable than component type
one. For configuration 4, P(TF ~ t) = 0(5) and the most-likely failure path
consists of one failure of component type one and one affecting failure of
component type two, i.e., a failure of type two which causes two components
of type one to fail with it. For configuration 3, P( TF ~ t) = 0(6) and there
are two most-likely failure paths: three failures of component type one, or
two failures of component type two. For configuration 2, P(TF ~ t) = 0(4)
and there are (at least) four different types of most-likely failure paths:
1. two failures of type two,
2. one failure of type one and one affecting failure of type two,
3. an affecting failure of type two, a repair of type two, and an affecting
failure of type two,
4. an affecting failure of type two, two repairs of type two, and an affecting
failure of type two.
Similar analyses can also be made for configurations 5 - 8.
Each configuration was simulated for 256,000 replications using exponen-
tial transformation. The parameter settings for the exponential transforma-
tion were based on earlier experiments described in Nicola et al. (1990) and
Heidelberger et al. (1992). The rate of the first transition, al was chosen so
that an exponential with rate al is less than t = 100 with probability 0.8.
(This is called approximate forcing.) When repairs are ongoing, the values of
an were chosen so as to make the probability of failure before repair comple-
tion approximately equal to p = 1/3 (p i~ called the biasing probability). This
was done as follows. Let 1/Jln denote the mean repair time of the component
in repair at time Tn - 1 (1/ Jln = 0.5 for type one components and 1/Jln = 7.5
for type two components). Then an is chosen so that an/(a n +Jln) = 1/3. For
exponentially distributed repairs, this makes the biasing probability exactly
equal to 1/3. Balancing was done by equalizing the probabilities of which
component type fails upon a failure event (since there are more than one
components of each type). Importance sampling was, in effect, "turned off"
(by making an small, i.e., close to the original hazard rates) whenever the
system returned to the all components operational state.
The point estimates and the relative half-widths of 99% confidence in-
tervals are displayed in Table 6.1. Notice that system failures before time
t= =
100 are not particularly rare for 0.01, especially in configurations 1,2,
5 and 6. Indeed, for = 0.01, the importance sampling is not very effective
and actually can result in some variance increase. For example, using stan-
dard simulation when = 0.01 , the relative half-widths of 99% confidence
512 Philip Heidelberger et al.

intervals would be approximately 0.2%, 4.0%, and 8.3% for configura-


tions 1, 3 and 7, respectively, compared to 2.2%, 6.1% and 5.4% using
exponential transformation. However, when f = 0.0001, system failures be-
fore time t = 100 are extremely rare in all configurations and many orders
of magnitude reduction in variance is obtained; the relative half-widths are
3.4%, 11.5% and 5.8% using exponential transformation compared to
about 729%, 933,000% and 1, 255, 000% using standard simulation for
configurations 1, 3 and 7, respectively. These results, and those in Nicola
et al. (1990) and Heidelberger et al. (1992) provide very good experimental
agreement with the bounded relative error theorems.

Table 6.1. Estimates of the unreliability at time t = 100, I'(f, 100), for the model
with two component types, along with estimated relative half widths of 99% confi-
dence intervals. The estimates were obtained from 256,000 replications using expo-
nential transformation.
Type 1 Type 2 Components f = 0.01 f = 0.0001
No. Failure Failure Affected
Distribution Distribution Probability a
1 E2(f) E2{f) 0 8.19 x 10 -~ 7.34 x 10 -9
2.2% 3.4%
2 E2(f) E2(f) 0.25 1.17 x 10- 1 1.03 x 10 =1r
2.0% 2.7%
3 E2(f) E2(f1.") 0 2.40 x 10 -4 4.49 x 10 15
6.1% 11.5%
4 E2(f) E2(fU) 0.25 9.68 x 10 4 2.86 x 10 1"
14.5% 4.7%
5 H 2 {f LO ) E2{f) 0 8.16 x 10 . 7.34 x 10 -9
1.5% 3.4%
6 H 2 (fU) E2(f) 0.25 8.94 x 10 .~ 9.60 x 10 9
1.5% 2.7%
7 H 2 ( fl.") E 2 {f1.5) 0 5.69 x 10 5 2.48 x 10 :rs-
5.4% 5.8%
8 H 2 (fl.O) E2(f1.") 0.25 2.45 x 10 4 2.14 x 10 1"
5.3% 3.9%

7. Summary
This paper has considered the problem of efficiently simulating the system
failure time distribution in models of highly dependable systems with non-
Markovian component failure distributions. Several importance sampling ap-
proaches were described. These approaches are a natural generalization of
approaches used in Markovian systems. We proved (under appropriate tech-
nical conditions), that these approaches are all effective as component failure
events become rarer. Specifically, we showed that for a fixed time horizon t,
Bounded Relative Error in Estimating Transient Measures 513

estimates of P( TF ~ t) have bounded relative error as the measure of rar-


ity, f, approaches zero. In practice, this means that only a fixed number of
replications are required to get accurate estimates of P( TF ~ t), no mat-
ter how rare system failure events are. Experimental results presented here
(and elsewhere) provided experimental confirmation ofthis theoretical result.
However, the method may result in some variance increase if P( TF ~ t) is not
small, say greater than 10- 4 This suggest that either importance sampling
should not be used for such cases, or that pilot studies be performed so as to
more carefully "tune" the parameters of the importance sampling change of
measure.
A number of problems for further research are evident. There are other
performance measures of interest besides the failure time distribution. For
example, one might be interested in estimating the expected interval un-
availability U(t), the expected fraction of time the system is failed during
some interval (0, t). It seems likely that the approach described here will be
effective for estimating U(t), but this has not yet been shown. In addition,
one is often interested in steady-state measures, e.g. U = limt ..... oo U(t) (as-
suming it exists). With general failure distributions, one can no longer rely
on regenerative structure, as is done in Markovian systems. However, an ex-
tension of the techniques described here, can be used to estimate such steady-
state measures (see Glynn et al. 1993 and Nicola et al. 1992). This approach
appears effective in practice, however, this again has not been established
theoretically. In addition, one is often interested in estimating P( TF ~ t) for
relatively large values of t. For Markovian systems, the regenerative struc-
ture can again be exploited to obtain good estimates (see Carrasco 1991b,
Shahabuddin 1994b and Shahabuddin and Nakayama 1993). In addition, for
systems with exponential failure distributions, it is known that TFj E[TF]
converges in distribution to an exponential random variable with mean one
(see Brown 1990 and Keilson 1979). However, such structure is not present
here and effective "large t" importance sampling techniques have yet to be
devised.

Acknowledgement. Parts of Sections 1, 2 and 5.1 are taken from Nicola et al. (1992).
This material is reprinted with permission of the IEEE.

References

Barlow, R.E., Proschan, F.:Statistical Theory of Reliability and Life Testing. New
York: Holt, Reinhart and Winston 1981
Brown, M.: Error Bounds for Exponential Approximations of Geometric Convolu-
tions. The Annals of Probability 18, 1388-1402 (1990)
Carrasco, J. A.: Failure Distance-Based Simulation of Repairable Fault-Tolerant
Systems. Proceedings of the Fifth International Conference on Modelling Tech-
niques and Tools for Computer Performance Evaluation (1991a), pp. 337-351
514 Philip Heidelberger et al.

Carrasco, J .A. : Efficient Transient Simulation of Failure/Repair Markovian Models.


Proceedings of the Tenth Symposium on Reliable and Distributed Computing.
IEEE Press 1991b, pp. 152-161
Dugan, J.B., Trivedi, K.S., Smotherman M.K., Geist, R.M.: The Hybrid Automated
Reliability Predictor. Journal of Guidance, Control and Dynamics 9, 319-331
(1986)
Fox, B.L., Glynn, P.W.: Discrete-Time Conversion for Simulating Finite-Horizon
Markov Processes. SIAM J. Appl. Math. 50, 1457-1473 (1990)
Frater, M.R., Lennon, T.M., Anderson, B.D.O.: Optimally Efficient Estimation of
the Statistics of Rare Events in Queueing Networks. IEEE Transactions on
Automatic Control 36, 1395-1405 (1991)
Geist, R.M., Smotherman, M.K.: Ultrahigh Reliability Estimates through Simu-
lation. Proceedings of the Annual Reliability and Maintainability Symposium.
IEEE Press 1989, pp. 350-355
Geist, R.M., Trivedi, K.S.: Ultra-High Reliability Prediction for Fault-Tolerant
Computer Systems. IEEE Transactions on Computers C-32, 1118-1127 (1983)
Gertsbakh, I.B.: Asymptotic Methods in Reliability Theory: A Review. Advances
in Applied Probability 16, 147-175 (1984)
Glynn, P.W.: A GSMP Formalism for Discrete Event Systems. Proceedings of the
IEEE 77, 14-23 (1989)
Glynn, P.W.: Importance Sampling for Markov Chains: Asymptotics for the Vari-
ance. Technical Report. Dept. of Operations Research, Stanford University
(1992)
Glynn, P.W., Heidelberger, P., Nicola, V.F., and Shahabuddin, P.: Efficient Esti-
mation of the Mean Time Between Failures in Non-Regenerative Dependability
Models. IBM Research Report RC 19080. Yorktown Heights, New York (1993)
Glynn, P.W., Iglehart, D.L.: Importance Sampling for Stochastic Simulations. Man-
agement Science 35, 1367-1392 (1989)
Goyal, A., Lavenberg, S.S.: Modeling and Analysis of Computer System Availability.
IBM Journal of Research and Development 31, 651-664 (1987)
Goyal, A., Shahabuddin, P., Heidelberger, P., Nicola, V.F., Glynn, P.W.: A Unified
Framework for Simulating Markovian Models of Highly Reliable Systems. IEEE
Transactions on Computers C-41, 36-51 (1992)
Gross, D., Miller, D.R.: The Randomization Technique as a Modeling Tool and
Solution Procedure for Transient Markov Processes. Operations Research 32,
343-361 (1984)
Hammersley, J.M., Handscomb, D.C.: Monte Carlo Methods. London: Methuen
1964
Heidelberger, P.: Fast Simulation of Rare Events in Queueing and Reliability Mod-
els. ACM Transactions on Modeling and Computer Simulation 5, 43-85 (1995)
Heidelberger, P., Nicola, V.F., Shahabuddin, P.: Simultaneous and Efficient Simu-
lation of Highly Dependable Systems with Different Underlying Distributions.
Proceedings of the 1992 Winter Simulation Conference. IEEE Press 1992, pp.
458-465
Jensen, A.: Markov Chains as an Aid in the Study of Markov Processes. Skand.
Aktuarietidskr. 36, 87-91 (1953)
Juneja, S., Shahabuddin, P.: Fast Simulation of Markovian Reliability/Availability
Models with General Repair Policies. Proceedings of the Twenty-Second In-
ternational Symposium on Fault- Tolerant Computing. IEEE Computer Society
Press 1992, pp. 150-159
Keilson, J.: Markov Chain Models - Rarity and Exponentiality. New York: Springer
1979
Bounded Relative Error in Estimating Transient Measures 515

Lewis, E.E., Bohm, F.: Monte Carlo Simulation of Markov Unreliability Models.
Nuclear Engineering and Design 77, 49-62 (1984)
Lewis, P.A.W., Shedler, G.S.: Simulation of Nonhomogeneous Poisson Processes by
Thinning. Naval Research Logistics Quarterly 26, 403-413 (1979)
Moorsel, A.P.A. van, Haverkort, B.R., Niemegeers, I.G. :. Fault Injection Simula-
tion: A Variance Reduction Technique for Systems with Rare Events. Depend-
able Computing for Critical Applications 2. Berlin: Springer 1991, pp. 115-134
Nakayama, M.K.: A Characterization ofthe Simple Failure Biasing Method for Sim-
ulations of Highly Reliable Markovian Systems. ACM Transactions on Modeling
and Computer Simulation 4, 52-88 (1994)
Nakayama, M.K.: General Conditions for Bounded Relative Error in Simulations of
Highly Reliable Markovian Systems. IBM Research Report RC 18993. Yorktown
Heights, New York (1993)
Nakayama, M.K.: Simulation of Highly Reliable Markovian and Non-Markovian
Systems. Ph.D. Dissertation, Department of Operations Research, Stanford Uni-
versity (1991)
Nicola, V.F., Heidelberger, P., Shahabuddin, P.: Uniformization and Exponential
Transformation: Techniques for Fast Simulation of Highly Dependable Non-
Markovian Systems. Proceedings of the Twenty-Second International Sympo-
sium on Fault-Tolerant Computing. IEEE Computer Society Press 1992, pp.
130-139
Nicola, V.F., Nakayama, M.K., Heidelberger, P., Goyal, A.: Fast Simulation of De-
pendability Models with General Failure, Repair and Maintenance Processes.
Proceedings of the Twentieth International Symposium on Fault- Tolerant Com-
puting. IEEE Computer Society Press 1991, pp. 491-498
Nicola, V.F., Shahabuddin, P., Heidelberger, P.: Techniques for Fast Simulation of
Highly Dependable Systems. Proceedings of the Second International Workshop
on Performability Modelling of Computer and Communication Systems (1993)
Nicola, V.F., Shahabuddin, P., Heidelberger, P., Glynn, P.W: Fast Simulation of
Steady-State Availability in Non-Markovian Highly Dependable Systems. Pro-
ceedings of the Twenty- Third International Symposium on Fault- Tolerant Com-
puting. IEEE Computer Society Press 1992, pp. 38-47
Parekh, S., Walrand, J: A Quick Simulation Method for Excessive Backlogs in
Networks of Queues. IEEE Transactions on Automatic Control 34, 54-56 (1989)
Sadowsky, J.S.: Large Deviations and Efficient Simulation of Excessive Backlogs
in a GI/G/m Queue. IEEE Transactions on Automatic Control 36, 1383-1394
(1991)
Shahabuddin, P.: Simulation and Analysis of Highly Reliable Systems. Ph.D. Dis-
sertation, Department of Operations Research, Stanford University (1990)
Shahabuddin, P.: Importance Sampling for the Simulation of Highly Reliable
Markovian Systems. Management Science 40, 333-352 (1994a)
Shahabuddin, P.: Fast Transient Simulation of Markovian Models of Highly De-
pendable Systems. Performance Evaluation 20, 267-286 (1994b)
Shahabuddin, P., Nakayama, M. K.: Estimation of Reliability and its Derivatives
for Large Time Horizons in Markovian Systems. Proceedings of 1993 Winter
Simulation Conference. IEEE Press 1993, pp. 422-429
Shanthikumar, J. G.: Uniformization and Hybrid Simulation/Analytic Models of
Renewal Processes. Operations Research 34, 573-580 (1986)
Stiffler, J., Bryant, L.: CARE III Phase III Report-Mathematical Description.
NASA Contractor Report 3566 (1982)
Van Dijk, N.M.: On a Simple Proof of Uniformization for Continuous and Discrete-
State Continuous-Time Markov Chains. Adv. Appl. Prob. 22,749-750 (1990)
Part V

Maintenance Management Systems


Maintenance Management System:
Structure, Interfaces and Implementation
Wim Groenendijk
Woodside Offshore Petroleum Pty. Ltd., 1 Adelaide Terrace, Perth, WA 6000,
Australia

Summary. Recent years have seen significant development in maintenance man-


agement within the Oil & Gas industry. Consistently low oil and gas prices, and
smaller, and more remote, exploration discoveries have forced the industry to crit-
ically examine the way it is conducting its business. A much better knowledge of
the interrelationship of the various business processes and a better understanding
of the consequences of maintenance strategies and options on both reliability and
life- cycle cash flows are now seen as essential in order to sustain industry prof-
itability in the longer term. It seems that this development has not been widely
recognised yet by the research community. This paper aims to contribute to an in-
creased awareness of the current thinking on maintenance management within the
Oil & Gas industry. The role of reliability engineering in maintenance management
is discussed, and a plea is made for more dialogue between industry and R&D.
Such dialogue is required to assist academia in meeting industry requirements for
graduate engineers and to encourage further development of the methods and tools
required to support industry needs.

Keywords. Maintenance management, reliability, business process

1. Introduction
The way companies conduct their business differs even if producing identi-
cal products. Their business processes will differ as a result of their specific
business principles, policies and strategies. Management systems supporting
these processes will therefore also be different between companies.
This implies that the maintenance management system supporting one
company's business process may therefore not be applicable to another. This
paper is therefore limited to describing some of the generic steps in structur-
ing the maintenance management system, critical success factors, where to
focus, and how to measure. The discussion is furthermore restricted to the
operational phase, i.e., maintenance input into design is not covered by this
paper.
Throughout this paper our definition of maintenance will be:
"The combination of all technical and associated administrative actions
intended to retain an item in, or restore it to, a state in which it can perform
its required function."
To set the scene, during the Operations phase the general objectives for
maintenance within the Oil & Gas industry are:
- to safeguard the technical integrity of all surface facilities;
520 Wim Groenendijk

- to responsibly optimise short-term cash flow by ensuring availability of


surface facility production capacity when required.
Technical integrity of a facility is achieved when, under specified condi-
tions, there is no foreseeable risk of failure endangering safety of personnel,
environment or asset value.
Recent years have featured consistently low oil and gas prices. Also, new
hydrocarbon discoveries tend to be smaller and in more remote areas, leading
to higher production costs. Partially as a result of these decreasing margins,
the Oil & Gas industry has been forced to critically examine the way it is
conducting its business. A much better knowledge of the interrelationship of
the various business processes and a better understanding of the consequences
of individual strategies and options on both reliability and life cycle cash flows
are now seen as essential in order to sustain industry profitability in the longer
term. This has resulted in the adoption of quality systems for maintenance
management. These provide the necessary building blocks to allow review and
improvement of all maintenance activities such that maintenance is focused,
and systematic.

2. The Maintenance Process


2.1 The Process Model
The maintenance management system supports the associated maintenance
process. The management system cannot therefore be implemented effec-
tively until the process has been mapped out and modelled. Business process
analysis can be used to provide a model of the business process. For the
maintenance process, the maintenance model identifies all activities required
to satisfy the maintenance objectives and their relationship to other parts of
the business.
Most important is to have a clear description of the maintenance pro-
cess which is widely understood and accepted throughout the company. The
maintenance activities can be described in as much detail as required, but
always keeping a transparent relationship with the process to which they
belong. Thus maintenance activities can always be seen in context of their
contribution to overall business objectives.
The maintenance business model must fulfill many criteria. It must serve
a "top down" purpose and thus be directly related to a higher-level "Op-
erations" business model: maintenance must be seen as part of the larger
picture. Yet it must also be "bottom up": maintenance engineers must be
able to recognise their own tasks within it. Most importantly in the con-
text of this paper: it must be capable of acting as the starting point for the
development of the maintenance management system.
The development of a business model is the first and possibly the hardest
single step in the development of a management system. The business model
Maintenance Management System 521

acts as the framework within which the activities are defined, including a
description of their logical sequences and their interrelationships with other
activities and other processes.

2.2 Structure

Generically, the maintenance process can be described using a PLAN -


SCHEDULE - EXECUTE - ANALYZE - IMPROVE loop.
2.2.1 Plan. All five stages are self-evidently essential to an effective man-
agement system. However, the course is set in the planning stage. It ensures
that policies and strategies are consistent with those in the rest of the com-
pany and with the corporate objectives. It sets the targets and identifies the
resources to be made available. It identifies what needs to be achieved in the
years ahead.
2.2.2 Schedule. Scheduling sets when things get done. It deals with clash
avoidance, efficient use of resources and minimizing any effects on availability
in order to meet contractual production requirements.
2.2.3 Execution. This is the (only) stage when field activities take place.
In this stage the physical implementation of planned and scheduled activities
takes place. It is the part of the process which yields the return in the form
of product, where most of the resources are consumed, and also where the
biggest (physical) risks are encountered.
2.2.4 Analyse. This stage is where all the results obtained during execution
are examined and performance analysed. Aim of the analysis stage is firstly
to compare performance against plan, and secondly to point the way to do
better than the plan.
2.2.5 Improvement. The final stage (before feedback to the first two) is
improvement in which remedies or improvements are proposed and justified.
This stage is also where the capacity to react to new challenges and oppor-
tunities is established. The improvement stage and the planning stage are
closely linked, as improvements are selected by methods very similar to the
ones by which the original plans were made.

3. The Maintenance Management System

3.1 The Management System

The purpose of a management system is to ensure that the process activities


are performed in a manner which meets agreed customer requirements. The
system also provides a basis (benchmark) to facilitate improvement.
In general, management systems should cover the following aspects:
522 Wim Groenendijk

1. the description of the process, activities and tasks designed to meet cor-
porate and customer requirements with performance measurement and
feedback systems to enable control and continuous improvement;
2. policies, standards and procedures related to the process and activities;
3. controls appropriate to the risks and critical activities of the process;
4. an organisational structure that matches the process, with tasks and
responsibilities defined for each critical activity;
5. a description of the main competencies required from staff to supervise
and carry out the activity/task;
6. information and data systems to enable control and improvement.

3.2 Structure: The Management Cycle

The structure of the maintenance management system should basically follow


the PLAN-SCHEDULE-EXECUTE-ANALYSE-IMPROVE loop discussed
for the maintenance process.
3.2.1 Plan. The starting point for maintenance planning is the setting of
maintenance objectives and strategies. These are derived from the corporate
objectives and strategies. Policies and standards are developed to ensure that
key processes are implemented in line with corporate objectives and comply
with specific statutory requirements.
The next stage concerns the identification and assessment of activities
and inclusion in the long/medium term plans. This is where the maintenance
strategy is further detailed to address the type and frequency of maintenance
for the equipment to be maintained.
Routine maintenance requirements can be identified using methods such
as Reliability Centred Maintenance, a structured method to identify the
maintenance option best suited for e<}uipment given their operating context.
Non-routine maintenance requirements can be identified from analysis of
asset requirements generally using a workshop approach to draw upon expe-
rience of a multi-discipline team. Development opportunities and expected
system changes, as well as threats and constraints, are taken into considera-
tion as different scenarios.
Once the maintenance activities have been identified and the overall ac-
tivity level has been established, resource levels are estimated. The activities
require resources, whether of services, personnel, equipment or materials.
Many of these resources require long lead times; manpower needs recruiting
and training; services and materials need contract strategies.
Finally, the activities that have been identified need to be reviewed against
the availability and/or reliability requirements. Also, the plan needs to be
integrated with all other activities and their mutual impact assessed. At this
stage, the main concern is to have windows in each other's plans to allow the
required work to take place.
Maintenance Management System 523

3.2.2 Schedule. The activities that have been planned eventually need to
be sequenced so that the work is done as efficiently and with least impact on
availability as possible.
Special attention needs to be paid to identify concurrent activities, i.e.,
to identify whether they can safely be executed simultaneously. Also, an
appropriate change control mechanism needs to be in place to manage short-
term deviations from the plan in response to operational conditions (e.g.
breakdowns, weather conditions etc.).
Finally, detailed work packages are prepared and the required resources
called off.
3.2.3 Execute. The execution phase is where the activities identified on the
plan actually take place. This is where most of the resources are consumed,
therefore it is important to ensure that proper controls are employed to ensure
efficient utilisation of resources. Also, control of concurrent operations needs
to be in accordance with the plans and schedules. Finally, documents and
drawing need to be revised and updated to reflect implemented changes.
3.2.4 Analyse. Analysis is an aid. to decision making. In turn, data is the
feedstock of analysis, and the choice of which data to collect, from a company
perspective, is driven by decision making needs. Decision making is forward
looking. Traditionally, maintenance management information systems have
collected vast amounts of information about equipment failures as an aid to
formulate maintenance options.
Analysis of maintenance effectiveness will reveal performance against the
desired state of technical integrity of the facilities. Analysis of maintenance
efficiency is used to control resource performance.
3.2.5 Improve. This is the stage where, based on the previous analysis,
improvement options are selected. These options should be selected after
studying their impact on long- term, preferably life-cycle, profitability. Dis-
counted cash flow methods are used to compare the different options with
respect to their economic viability. The various proposals should then be
ranked and prioritised according to their cost/benefit ratio and introduced
into the plans.
Appropriate change control procedures need to be in place to ensure that
all relevant systems, standards and procedures are updated to reflect the
changes being implemented.

4. Measuring Performance of the Management System

Without appropriate measurements, no activity can be managed properly.


Such measurements can be performed using performance indicators. Perfor-
mance indicators are comparative quantitative measures of actual events, set
against previously projected measures of those events, which additionally,
524 Wim Groenendijk

provide a qualitative indication of future projected performance based on


current achievement. Performance indicators should only be defined for the
critical activities in the process.
The process of comparison incorporates an original projection, a current
measure and a quantitative measure of variation between the two. The process
of comparison implies:
1. the validity of the original projection;
2. a requirement to control the outcome;
3. an expected deviation from the original projection;
4. an ability to forecast and to alter current events to reduce deviations;
5. that the measure will indicate the action required to do this.

The most significant point about good performance indicators, therefore,


is that they are useful in indicating future action. Their primary role is noth-
ing to do with gathering historical data as such.
Performance indicators are generally designed to measure business process
and activities performance in terms of:

1. Effectiveness (an attribute indicating that the products meet specified fit
for purpose requirements);
2. Efficiency (an attribute indicating that the products are produced with
minimum use of resources);
3. Flexibility (effective and efficient in face of change).

To analyse performance of the maintenance management system, per-


formance indicators are required which reflect the achievement of business
process objectives and targets. At activity level, definition of performance
indicators reflect the efficiency, effectiveness and flexibility of activities, as
well as their contribution to the achievement of business process objectives.
Business process performance indicators should demonstrate whether the
business process is achieving its objectives and targets. Defining business
process performance indicators is a three-step course:

1. agree on the definition of the business process, as well as on its output


and customers;
2. define a mission statement for that process clearly stating its objectives;
3. with the help of people who understand that process, define performance
indicators which indicate the achievement of those objectives and which
are both possible and practical to measure.
Starting from the principal maintenance objectives quoted in Section 1,
the maintenance performance indicators should be defined in terms of safe-
guarding technical integrity - safety, environment, and maintaining asset
value, - availability, and cost-effectiveness.
The performance, or measures of performance of an activity are only
meaningful if the contribution of this activity to the overall business process
Maintenance Management System 525

is clearly understood. Therefore the performance of an activity cannot be


measured in isolation, but in connection to its impact on the achievement of
the business process objectives.
One of the consequences of this approach is that performance indicators
should only be defined for activities or groups of activities for which it is
possible to define a clear output and a relevant (and significant) impact on
the business process.
For each of these activities or group of activities, the definition of perfor-
mance indicators is a process involving four steps:
1. Define, as specifically as possible, the activity's output (goods or ser-
vices);
2. Define the impact of the activity's output on the business process;
3. Establish measures of performance - performance indicators - which indi-
cate whether the activity is carried-out effectively, in light of the defined
impact and scope;
4. Map these performance indicators in relation to the performance objec-
tives defined at business process level.

5. Leverage For Improvement

Continuous improvement is necessary in order to achieve the maintenance


objectives. Below three aspects of maintenance management are singled out
that can be used to provide leverage for improvement, in increasing order of
effort required, and benefit obtained.

5.1 Equipment Reliability


It is the deployment of assets ("the equipment") that yields the return in the
form of product. It is the task of maintenance to ensure that the production
capacity of the assets is available when required. Improvement of equipment
reliability / availability is an important parameter to influence the production
capacity of the producing assets.
Options for reliability improvement of equipment are usually based on
analysis of failures or condition monitoring data. A host of statistical tech-
niques is available to forecast remaining-life in service, time between failures,
etc. The results of such data analysis are often used for maintenance optimi-
sation purposes, e.g. optimisation of maintenance frequencies.
System reliability/availability models can be used to analyse production
system performance based on the expected failure and repair patterns of the
individual equipment contained in the system. The results of such analysis
can be used to decide if and where bottlenecks exist and how these should
be addressed, e.g., by introducing redundancy, or by redirecting maintenance
effort.
526 Wim Groenendijk

5.2 Maintenance Strategy Setting

Optimisation of asset maintenance can at best yield marginal benefit only,


if the strategy for maintenance has not been addressed. First the type of
maintenance to be performed on the assets has to be determined, down to
equipment level.
Structured and systematic methods exist to assist in selecting mainte-
nance options. One such method is Reliability-Centred Maintenance, which
is increasingly being used within the Oil and Gas industry. One of the im-
portant features of Reliability-Centred Maintenance is that it systematically
focuses attention on the effect of equipment failure modes on the function
of that equipment within the system. Maintenance effort is then directed at
those critical failure modes that either have an impact on safety, environment,
or technical integrity of the installation, or that directly affect the availability
of production capacity.
Setting the maintenance strategy as discussed above ensures that the
maintenance is focussed, i.e., maintenance is done where it is required to
achieve the maintenance objectives; it is not done where it is not required.

5.3 Management System Effectiveness

Efficient and effective implementation of the maintenance strategy can only


be achieved if there is a well-defined maintenance management system in
place to support and control it. For the system to be effective it is required
that the maintenance process, responsibilities, procedures and resources be
documented and deployed in a consistent manner.
As noted before, of the five stages in the management cycle discussed
above, planning is by far the most important. The course is set in the planning
stage and if this is wrong, or even non-optimal, there is little that subsequent
stages can do to put this right. Note that in the improvement stage, new
challenges and opportunities will require amendment and updating of the
plans.
This suggests that maximum leverage for improvement is obtained by fo-
cussing management attention on the Plan and Improve stages of the main-
tenance management loop.

6. R&D Contribution
6.1 Reliability Engineering

As apparent from the above discussions, maintenance management is con-


cerned with the management of reliability/availability. It is therefore a valid
question to what extent Reliability Engineering as a discipline can provide
some of the tools/methods used in maintenance management.
Maintenance Management System 527

6.1.1 The Role. Traditionally, reliability engineering has been concerned


mainly with equipment reliability, focusing on the analysis of various relia-
bility statistics, such as Mean Time Between Failures (MTBF), repair times,
down times, trending of condition monitoring results, troubleshooting where
necessary, etc. This in general leads to a "fire-fighting" approach, where ac-
tion is taken only once problems have occurred. However, valuable experience
is gained by the analysis of existing problems, which may lead to the future
prevention of similar problems or mitigation of their consequences.
It seems that this perspective is changing. Today's reliability engineer
working in the Oil and Gas industry is expected to take a different, more
pro-active, approach towards the reliability and availability of the production
facilities. This is motivated by the recognition that the main determinants
for the reliability of a facility are the "softer" issues, such as management
processes, planning systems, purchasing and contracting procedures, training
of staff, etc. Experience indicates that the reliability of similar equipment
in similar service under different operators may differ considerably. At the
same time, this line of thinking acknowledges that, given the proper tools,
the (mechanical, instrument, electrical) discipline engineers are the people
best placed to look after the reliability of "their" equipment. The reliability
engineer in that environment has to ensure that there is a structure in place
enforcing and facilitating the regular review of the reliability performance
of the equipment and systems, and which supports the analysis aiming at
measures taken to improve the performance. To avoid costly local "sub"-
optimisation of reliability where, from a "systems" point of view, it is not
justified, the role of the reliability engineer must be a coordinating one.
This then implies that the traditional role of the reliability engineer is
changing from a "specialist troubleshooter with statistical skills" to that of
a more high- level, systems-oriented coordinator, who is one of the drivers of
the maintenance process. This new role still requires a sound knowledge of
traditional reliability engineering techniques, and familiarity with the other
engineering disciplines, but in addition requires a better understanding of
the business and an ability to consider reliability issues from an economical
perspective.
6.1.2 The Tools. The traditional tools of the reliability engineer are statis-
tical packages, availability assessment tools, Failure Modes and Effects Analy-
sis (FMEA), Failure Modes, Effects and Criticality Analysis (FMECA), fault
trees, event trees etc. While all those tools are still needed to support quantifi-
cation of system performance and the enhancement of equipment reliability,
the toolbox has expanded to the point where business modelling, venture
life reference planning, Reliability Centred Maintenance principles, etc. are
actively used.
One of the more significant developments in this area is the advent of
generic, industry-wide, reliability data bases such as the Offshore reliability
Database (OREDA). OREDA is a joint industry project from ten operators
528 Wim Groenendijk

in the Oil and Gas Industry, who have collected reliability data for various
topsides equipment on offshore production platforms. The data present an
industry-wide benchmark opportunity for offshore reliability data. The true
potential of the data, however, is still to be discovered by many, and includes
structured feedback of equipment reliability data to manufacturers, use in
selection and purchasing of equipment, etc ..

6.2 Reliability Engineering Challenges


The changing role of Reliability Engineering as described above poses a num-
ber of challenges to the field. The perspective needs to change from a rather
mechanistic view of failures to a consideration of organization and managerial
systems in place to manage the consequences and possible future prevention
of that failure. There is a clear business incentive to pursue the route out-
lined above, and it seems that this evolution is inevitable if we are to meet
the future needs of industry.
It seems that not much progress has been achieved in reliability research
over recent years. This is notwithstanding the amount of effort spent. Most
of the research seems to be directed at mathematical analysis of special con-
figurations or various duty/stand-by regimes. The results from such research,
although widening the state of the art in reliability engineering, are usually
not easily extendible or applicable to situations encountered in practice. It
seems that little effort is directed at developing techniques for solving real-size
problems. As an example, methodologies such as Reliability Centred Mainte-
nance, which at present are state of the art in industrial applications, are still
largely unsupported by professional tools to analyse the wealth of informa-
tion becoming available in the course of the Reliability-Centred Maintenance
programme.
There is an opportunity for the development of techniques for forecasting
life-cycle availability and reliability profiles, manpower profiles and economic
indicators from data derived in a typical Reliability-Centred Maintenance
study. Amongst others, this requires the development of techniques for the
transient analysis of random processes of the type occurring in our problem
areas. Many of the mathematical results required in this area already exist.
What it takes is to get those techniques into a proper framework to allow
their application in an industrial environment. Another example would be the
analysis of data from reliability data bases; such data is usually censored in a
number of ways: first of all because the observation period is usually a period
somewhere in the life of a piece of equipment, secondly because preventive
maintenance is often applied to that equipment.
Finally, apart from some notable exceptions, it seems that reliability en-
gineering is hardly addressed in most engineering curricula at technical uni-
versities. Industry would benefit from graduate engineers having a better
understanding of at least the first principles of reliability engineering in the
above sense. Guest lecturers from industry should be used where practical
Maintenance Management System 529

to provide students with an insight into practical and realistic problems. A


dialogue between industry and academia should be actively promoted to keep
academia up to date on industry requirements thus limiting the divergence
between theory and best practice. Within industry, engineers should be kept
aware of new development by incorporating reliability engineering training
into the regular training programmes.
PROMPT, A Decision Support System for
Opportunity-Based Preventive Maintenance
Rommert Dekker! and Cyp van Rijn 2
1 Econometric Institute, Erasmus University Rotterdam, 3000 DR Rotterdam, The
Netherlands
2 Beeckzanglaan IF, 1942 L8 Beverwijk, The Netherlands

Summary. In this paper we describe an operational decision support system,


called PROMPT, for systematic optimisation of maintenance activities and exe-
cuting them at opportunities. Hereby, an opportunity is any event at which a unit
can be maintained preventively without incurring cost penalties for the shutdown
of the unit. The d.s.s. was developed to take care of the random occurrence of op-
portunities and restricted duration. Moreover, it is able to handle a multitude of
different maintenance activities. Finally, we describe experience gained in a field test
with PROMPT.

Keywords. Maintenance, decision support system, optimisation, case study

1. Introduction

Maintenance management has been described as the last frontier of scientific


management. Whereas in many other fields, such as production, logistics,
personnel, finance and administration, management science and industrial
engineering have been active for a long time and their results have had its
impact, the maintenance manager has long been missing tools to improve his
decision making. The work of a maintenance man has been dominated by
unexpected events, making management a difficult task. The last decade has
seen however, an increasing attention for maintenance management. Underly-
ing reasons are twofold. First of all, the amount of equipment in production
plants has increased in time and despite improvements in maintainability,
large percentages of personnel and operational expenditure are in the main-
tenance area. The second reason is that trends in production technology and
concepts (like Just-In-Time) have stressed a high, timely and almost contin-
uous production, thereby making downtime costlier and maintenance more
important.
A main improvement in the last decade for maintenance management has
been the introduction of maintenance management information systems, pro-
viding the maintenance manager with up to date information. The area of
decision support however, runs behind, and although maintenance manage-
ment information systems contain a lot of data, these will be worthless, if
they do not improve decision making.
Within the area of operations research a lot of effort has been spent on
developing and analysing maintenance optimisation models. It has been such
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 531

a fruitful area, that review papers mention hundreds of articles (Sherif and
Smith 1981 list 524 papers) and many more have appeared since. The impact
on actual operations of all these papers, has been marginal, and in few other
fields there seems to be such a discrepancy between theory and practice. This
gap is being narrowed by the improvements and cost reductions in computer
technology, making computers and software also available for the maintenance
function, and thereby allowing the use of sophisticated models.
In this paper we describe a decision support system, called PROMPT,
which uses operations research models to assist the maintenance manager
in optimising preventive maintenance and to support him in executing pre-
ventive maintenance at the right time. It is an attempt to bridge the gap
between theory and practice, yet most of the theory needed was only devel-
oped during the construction of the d.s.s. PROMPT addresses that preventive
maintenance that is carried out to reduce downtime or to secure safety. It
offers both planning and scheduling tools to the user and is especially devel-
oped to make use of maintenance opportunities, thereby avoiding scheduled
downtime.
In this paper we will first give an overview of the problem characteris-
tics for which PROMPT was developed. Thereafter we give an overview of
PROMPT, its models, and what it is doing. Furthermore, we state our ex-
periences with a field test of the PROMPT system, which considered both
initialisation of the system as well as the effect of its advice. Finally we give
an evaluation of the system.
The PROMPT system which is described in this paper is the successor
of an earlier prototype which is described in Van Aken et al. (1984). Sim-
ilar to the present PROMPT system the prototype was directed at giving
advice for opportunity based preventive maintenance. Although this system
was considered to be successful, it had two major shortcomings. First of
all its objective was to increase reliability, whereas in a later stage not all
failures were considered to be of equal performance. Secondly, it could not
indicate how much preventive maintenance is cost effective, as the models
assumed that more preventive maintenance always implied more reliability.
The present PROMPT system, as described here, is a completely new sys-
tem, in which we took advantage of the experiences obtained with the earlier
prototype.
There are no comparable systems to PROMPT. In Dekker (1992) an
overview is given of maintenance decision support systems. Most of them are
tactical tools, which address a single unit or component and allow to optimize
a single action on that. Some maintenance management information systems
contain a reliability module, but hardly ever an optimisation module and
certainly not for opportunity maintenance.
532 Rommert Dekker and Cyp van Rijn

2. Problem Description
If preventive maintenance is applied to a unit, there is a preference to carry
it out only at those moments in time when the unit is not required for pro-
duction. In some cases, where units are used continuously (e.g. in the process
industry) this may cause problems. Execution of preventive maintenance is
then restricted to costly annual shutdowns. In some systems however, short
lasting interruptions of production occur by times for a variety of reasons,
e.g. breakdowns of or maintenance on essential units. During these interrup-
tions some other units are not required and can be maintained preventively,
in which case we speak of maintenance opportunities. Unfortunately, these
opportunities can mostly not be predicted in advance. Because of the ran-
dom occurrence of opportunities and of their limited duration, traditional
maintenance planning fails to make effective use of them.
The objective of PROMPT is to give decision support for opportunity-
based preventive maintenance. For PROMPT an opportunity is defined as
any moment in time at which preventive maintenance can be carried out with-
out adverse effects of a unit shutdown being incurred. The user of PROMPT
has to identify the opportunities and to report them to the system to get ad-
vise. PROMPT assumes that although opportunities occur randomly, they
do occur repeatedly and provide 'enough' time to use them for preventive
maintenance. PROMPT primarily focuses at routine preventive maintenance
as first line maintenance (greasing, etc.) can be executed during normal op-
erations and major overhauls are too large for opportunities and have to be
planned in advance.
An opportunity-based policy is of importance for continuously used equip-
ment, for which downtime costs are high. Examples of such equipment are
gas turbine driven power generators at offshore production platforms. A typ-
ical aspect of offshore platforms is that in a limited amount of space all the
equipment has to be installed, and that therefore in the design phase as few
equipment has been installed as possible, making downtime costs high. An-
other aspect is that the production of the platform has a high economic value.
Although usually production is not lost but rather deferred, there is a strong
incentive to recover the large investments as soon as possible and therefore
even deferred production has a high cost value.
To make effective use of opportunities, preventive maintenance has to
be split up into packages which can be fully carried out at an opportunity.
Both mechanical, instrument as well as electrical maintenance is included
and different age indicators, like runhours, starts and stops are allowed.
The tasks PROMPT has to carry out are threefold. First of all it should
indicate how much, if at all, preventive maintenance, is cost-effective and for
each maintenance package it should determine an optimal policy. Secondly,
PROMPT should schedule the cost-effective packages in such a way that
as much as possible is adhered to the optimal interval. For safety related
maintenance PROMPT assumes that a maximal interval can be specified by
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 533

the user. In that case PROMPT tries to execute the safety-related packages at
the last possible opportunity within the required interval. Finally, PROMPT
should administrate failures and preventive maintenance results so that in
course of time a better insight into failures can be obtained.

3. An Outline of PROMPT
3.1 Introduction

In order to reach the objectives set, a variety of problems have to be tackled.


Here we describe some of the ideas behind PROMPT and its main optimisa-
tion methods.

3.2 Hierarchy of Units

Any real system can be decomposed into units, parts and components. Sev-
eral hierarchies may exist, but we use the one applied by maintenance. Such a
decomposition is important since it imposes a lot of rules for maintenance and
for each level different information may be available, which has to be trans-
lated to other levels. In agreement with the operating company for which
PROMPT was developed the following hierarchy was assumed. In PROMPT
a system is defined as a whole of units performing a specified task with a
measurable output on which a lost or deferred product value can be set.
PROMPT assumes that a system is built up of subsystems in a series con-
figuration. A subsystem consists of one or multiple units in parallel, fulfilling
a certain task. A unit can be a gas turbine, compressor, pump or any other
physical entity. The unit is the highest level at which PROMPT gives advice.
PROMPT assumes that if for a unit an opportunity occurs, it is for all parts
of the unit. The planned maintenance routines are subdivided into mainte-
nance packages each consisting of one or more maintenance activities, which
are not overlapping. Apart from this maintenance hierarchy, PROMPT also
considers a physical hierarchy, in which the unit is subdivided into elements
each having one or more failure modes. The user is free to define the elements
which do not have correspond with one maintenance activity.

3.3 Balancing Maintenance Costs and Benefits

For time-based preventive maintenance which is carried out to reduce the


number of failures and to prevent unscheduled downtime, PROMPT should
be able to determine the best frequency. This requires a balancing of the
costs and benefits of that maintenance. The costs are easily calculated, as
they consist of manhours and materials. Benefits are more difficult to quan-
tify. One can determine the number of failures prevented, but not all failures
534 Rommert Dekker and Cyp van Rijn

are equally important. An alternative is to determine the amount of un-


scheduled downtime prevented (i.e. that downtime that would be caused by
failures), which makes more sense. This leaves two ways to determine how
much maintenance should be carried out. First of all, one can set a target for
the intrinsic availability (i.e. the availability excluding standby hours), which
comes down to setting a target value for the unscheduled downtime, and de-
termine how much planned maintenance is required to reach that target. A
second option is to set a cost penalty to unit downtime and determine for
each maintenance package whether its execution pays off against the savings
in reduced downtime. We have chosen for the last option for a number of rea-
sons. It may seem equally difficult, setting either a target for downtime, or a
cost penalty for downtime and special models may be required for each. One
should realise however, that a target availability does not take the value of
production into account, which can be done for a cost penalty. Furthermore,
a cost penalty allows a balancing for each maintenance package, whereas a
target availability requires a simultaneous balancing of all maintenance pack-
ages, which is far more difficult. The latter could be simplified, by making
some artificial choices, but that would be arbitrary. A final argument is that
a target availability is more difficult to handle in case of unrevealed failures
causing no direct downtime.
As said before, PROMPT is based on a cost balancing, and a special
model has been set up to support in setting such a cost penalty.

3.4 Unit Downtime Cost Penalty


In general it is a difficult task to set a cost value to unit downtime as systems
may consist of many units, each performing specific tasks. The same was true
for the systems we considered appropriate for application of PROMPT. No
downtime cost values were available for the development team. Complicating
aspects consisted of the presence of (non-identical) standby units, variations
in demand and the fact that for utility units output has an indirect value (e.g.
what is the value of 1 MWh?). Another problem on e.g. production platforms
is that downtime may cause for deferred rather than for lost production.
Hence there should be a cost value for deferred production, which may depend
on many factors (e.g. tax regimes). Luckily, this aspect had been tackled by
the company's economic department. It is in fact partly a subjective problem,
as a cost value for deferred production is in fact a statement of management
on how much they are prepared to pay for prevention of deferred production.
In fact the same holds for the unit downtime penalty: it is a statement of
how much management is prepared to pay to prevent unit downtime.
Given a cost value for deferred production at system level, we developed
a special economic model to deal with the other aspects. The model assumes
that the unit in question forms with other (not necessarily identical) parallel
units a subsystem. The effect of loss of the unit is considered at subsystem
level only and unavailability of other subsystems is neglected in this respect.
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 535

Basically the cost penalty for a given unit is calculated as follows. First an
enumeration of the states (either working or failed) of all other units in the
subsystem is performed. Next for each combined state of other units the costs
caused by losing the unit in question are determined. The cost penalty then
follows by taking a weighted summation of the costs per state multiplied with
the probability of occurrence of that state. Notice that this is a marginal cost
value, i.e., the costs incurred by one hour of extra downtime of a unit. It is
not the allocation of the actual downtime costs over the units. The model
is an extension of the so-called k-out-of-n availability models to nonidentical
machines and varying demand. It is not clear whether this cost allocation
is the best one. There is almost no literature in this respect, although the
problem arises in most systems with parallel units. Almost all papers assume
that either the cost penalty is given for a particular unit, or that the unit has
only one failure mode, which is an unrealistic assumption. For utility units
we considered the production systems sustained by it. Costs of downtime of
a utility unit then follow from the loss or deferment of production of those
production systems which have be shutdown because of the loss of the utility
unit. In this assessment we take into account the availability of other utility
units which are capable to take the duty over of the unit considered.
The unit downtime cost penalty was explicitly stored in the database of
the d.s.s. with the idea that it might change in time, because of e.g. depletion
of the field from which the platform was producing.

3.5 Failure Models

After having established a cost penalty for unit downtime, we will in this
section consider the positive effects of each preventive maintenance activity
in detail. For reasons of language simplicity we will regard in this section
all elements addressed by one activity as being one component (although in
practice this is not necessarily the case). In PROMPT a failure is defined as
"any event after which a component stops functioning in a prescribed way" .
In general two types of failures should be distinguished, viz. revealed and
unrevealed failures and a separate failure model should be used for each. A
failure model describes the relationships between failure and its consequences
and contains a quantitative prediction mechanism of failures. The latter oc-
curs through probability distributions, which may be in any type of condition
indicator (e.g. calendar time, runhours, etc.), as long as the indicator is pre-
dictable in time. PROMPT's failure models have been set up in such a way
that they are consistent with the findings of inspections.
3.5.1 The Revealed Failure Model. The revealed failure model assumes
that a failure is directly noticed and that an appropriate action is undertaken.
Consequences of the failure are assumed to occur directly after the failure.
As a result of the failure the unit may breakdown with a certain probability,
Pud (to be specified per component). Costs offailure are split up into indirect
536 Rommert Dekker and Cyp van Rijn

cost due to unit downtime (in case it breaks down) and direct costs due to
repair of the component. If the expected downtime amounts to d hours, the
unit downtime cost penalty to Cud and the repair costs to Cr , then the total
expected cost of failure cf is given by cf = Cr + Puddcud. Time to failure is
modelled through a two parameter Weibull distribution with shape parameter
A and scale parameter {3. Other data the user had to specify included average
downtime in case of a unit breakdown, average time needed for repair, average
number of men required for repair and additional material costs (normal
material costs were incorporated as a surcharge on the manpower costs).

3.6 The Unrevealed Failure Model


The unrevealed failure model assumes that a failure of the component may
remain hidden, until either an inspection or some severe consequences occur.
As not the failure event itself is important, but the time being in the failed
state, the model assumes a cost rate for being in a failed condition (like in the
Barlow and Hunter 1960 model). This cost rate is obtained by assuming that
the time between the component failure and consequences is quite long and
may be approximated by an exponential distribution. Let Cc denote the cost
value of these consequences. From specifying the probability Pud on these
consequences in a certain interval of length T, given a component failure
halfway during that interval one can then calculate the cost rate cfr from
cfr = 2pudCc/T. Cost of inspection and subsequent action were assumed
to be independent of the resulting action and had to be specified by the
user. Although we first assumed that the time to failure was exponentially
distributed, we later changed it to a two-parameter Weibull distribution.

3.7 Condition Indicators


Apart from calendar time PROMPT also allows other cumulative indicators,
such as runhours and number of starts and stops. The only requirement
for a cumulative indicator was that it can be predicted in time. To this
end we used both an historical estimate for the time conversion factor on
the long run and an exponential smoothing prediction mechanism to predict
the conversion factor in the short run. We hoped to include state condition
indicators derived from condition monitoring as well. However, none of the
state condition indicators we are aware of, allows a time-related quantitative
prediction of component failure (in terms of probabilities). On the contrary,
they merely indicate whether some (often not which) failure is imminent or
not and are therefore not suited to plan maintenance at opportunities.

3.8 Preventive Maintenance Packages


Opportunity-based maintenance requires that preventive maintenance is split
up into packages. The larger packages are, the more difficult it may be to ex-
ecute it at an opportunity of limited duration (some one or two days). On
PROMPT, A DSS for Opportunity-Based Preventive Maintena.nce 537

the other hand, it is usually not economic neither convenient from an ad-
ministrative point of view to execute all maintenance activities separately.
Hence activities were grouped into packages. We therefore assumed that the
user would be able to define maintenance packages. It further appeared prac-
tical to advise only full maintenance packages, even if one of its activities
was already carried out because of a failure. Furthermore, failures provided
usually no time for preventive maintenance as the failed component had to
be repaired as soon as possible and no time was left over. The user had to be
given the freedom to report either a renewal or a repair of the component to
its state before the failure.

3.9 Optimisation Models


3.9.1 General Approach. The optimisation problem faced by PROMPT
can be summarised as: "plan and schedule a number of maintenance packages,
each consisting of one or more maintenance activities, at randomly occurring
opportunities of a restricted duration". The planning would include deter-
mining the long term optimal policy, whereas the scheduling should indicate
at a given opportunity which packages should be carried out with what pri-
ority given their long term optimal policy. Realistic numbers of maintenance
packages lie between 50 and 100.
Although PROMPT was set up with one specific application in mind,
it was the idea that it should be as general as possible, and thus not fo-
cus on one unit specifically. This resulted in the following detailed problem
characteristics and assumptions.
(i) We assumed the occurrence of opportunities could be described by a
renewal process and that a user was able to specify both a mean and a
variance of the time between opportunities, valid on the long run.
(ii) A user should have the freedom, on the other hand, to overrule the
long-term distribution of the interval to the next opportunity if he has
more information available.
(iii) Decisions concerning executing of maintenance packages only need to
be taken at opportunities.
(iv) The opportunity duration is not known beforehand, and even during
the actual opportunity, it may change. Therefore no exact duration can be
used in the scheduling.
(v) Interactions between maintenance packages of any kind may be ne-
glected.
A literature search revealed some opportunity models (see e.g. Jorgenson et
al. 1967, Woodman 1967, Duncan and Scholnick 1973, Sethi 1977, Vergin and
Scriabin 1977 and Backert and Rippin 1985), but none of them was capable of
dealing fully with our problem. Some papers applied Markov decision models
in which the number of components determined the dimension of the state
space. This approach however, is computationally not tractable in case of
538 Rommert Dekker and Cyp van Rijn

more than three components. Jorgenson et al. (1967) presented a model for
multiple components with exponential time between opportunities, but do
not specify how to optimise. None of the models was able to deal with a
restricted opportunity duration, neither with different failure models.
We therefore developed novel models to deal with this complex problem.
In our case opportunities are created by causes outside the unit and upon
failure of one of its components the unit is repaired as soon as possible and
no time for further preventive maintenance is available. Our first conclusion
was therefore that the only correlation between the packages consisted of
competing for the restricted time at an opportunity. Accordingly we reduced
the original problem to the following: "determine for each maintenance pack-
age separately an optimum policy, which indicates when it should be carried
out at an opportunity, independently of all other packages. Furthermore, de-
termine from the outcomes of these models a priority measure with which
maintenance packages should be executed at a given opportunity". To this
end we introduced the so-called one-opportunity-Iook-ahead policies, which
can be considered as a generalisation of the marginal cost approach (origi-
nally introduced by Berg 1980). At each opportunity these policies compare
for each package the costs of deferring the execution to the next opportunity
with the minimum long term costs. In the next sections the approach will be
discussed in more detail.
3.9.2 Maintenance Activity Optimisation Models. Consider a mainte-
nance activity addressing a revealed failure of a specific component. Basically
the maintenance optimisation can be tackled by the age or block replacement
model with the extra restriction that preventive maintenance is restricted to
opportunities. We took the block replacement model since that can be ex-
tended to multiple activities in a package and non-exponential times between
opportunities can be handled (for age replacement only exponentially dis-
tributed times between opportunities can be handled; non-exponential times
become very difficult, see Dekker and Dijkstra 1992). We did modify the
block policy to avoid replacing new components, but that will be explained
later. In the block replacement model a component is replaced preventively
every t time units against costs cp and upon failure at costs Cj(> cp ). Let
F(t) be the c.d.f. ofthe time to failure and let M(t) be the associated renewal
function, indicating the expected number of failures in [0, t]. The long-term
average costs g(t) follow easily from renewal theory and amount to

g(t) = cp + cjM(t) , (3.1)


t
For unrevealed failures we used Barlow and Hunter's (1960) model, which
goes as follows. A component is inspected every t time units and repaired
without extra costs upon failure. For every time unit the component is failed
a cost rate cJr is incurred. Let F(t),/(t) be the c.d.f, p.d.f. of the time to
failure respectively. The long-term average costs g(t) then equal
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 539

g(t) = cp + I; Cfr(t - x)f(x)dx = Cp + I; Cfr(1- F(x))dx (3.2)


t t
Next consider the case that preventive maintenance or replacement can
only be done at opportunities. Suppose (as was the case in our problem)
that opportunities are generated independently from the component failure
processes and that their occurrence can be modelled by a renewal process.
The block policies are then extended to control limit policies of the type:
"maintain a component at the first opportunity ifmore than t time units have
passed since the previous preventive maintenance". Let the random variable
Zt denote the forward recurrence time to the next opportunity if t time
units have passed since the last preventive maintenance at an opportunity.
Notice that executions of the maintenance activity at an opportunity can
be considered as total renewals. Hence the renewal cycle has length t + Zt.
In case of block replacement the expected number of failures is given by
E(M(t + Zt)), where the expectation is with respect to Zt. This leads to
the following formula for the expected average costs gy(t) of executing a
maintenance activity with control limit t

( ) _ cp + cf Iooo M(t + z)dP(Zt ~ z)


gy t - EZ . (3.3)
t+ t

Dekker and Smeitink (1990) show that the same conditions are needed
for existence of a unique minimum t* to gy(t) as for the standard block re-
placement model. Furthermore, that t* is the unique solution to the following
optimality equation.
<0 for 0 < t < t*
cfE[M(t + Y) - M(t)]- gyEY { =0 for t = t* (3.4)
>0 for t > t*
where gy denotes the minimum average costs. Notice that cfE[M(t + Y) -
M(t)] can be interpreted as the expected costs of deferring execution of the
activity from the present opportunity at time t to the next one, Y time units
ahead.
The analysis of the opportunity block replacement model does not make
use of the interpretation of the cost over an interval. In fact any other cost
function may be used as well (as is also remarked in Dekker 1995). Accord-
ingly the analysis is easily set over to the unrevealed failure model with M(t)
replaced by I;(1- F(x))dx.
To calculate the integrals in equation (3.3) we approximated Zt in first
instance by a three point distribution with reasonably chosen values and
probabilities. Later, in Dekker and Smeitink (1990) it appeared that Zt can
be approximated by the forward recurrence time of a Coxian-2 distribution
in case the coefficient of variation is larger than 0.5 and by the stationary
excess distribution in the other case. For the renewal function a simple but
effective approximation was developed (see Smeitink and Dekker 1989).
540 Rommert Dekker and Cyp van Rijn

Notice that gy(t) is a function of one variable, implying that optimisation


is not too difficult. We applied a fixed step size search combined with a
bisection procedure to determine the first minimum of gy(t).
3.9.3 Maintenance Package Optimisation Models. Notice that both
the block replacement model and the inspection model are easily extended
to a package containing multiple activities. Suppose that the execution of
package costs cp and that nr activities address revealed failures (with failure
time distributions Fi(t) and failure costs Cj,i, i = 1, ... , n r ) and nu unrevealed
ones (with failure time distributions Fj(t) and cost rates Cjr,j ,j = 1, ... , nu).
The total long-term average costs gy(t) then amount to

t _ cP + E[I:7::1 Cj,iMi(t + Zt) + I:j::1 f;+z, cjr,j(l- (Fj(x))dx]


gy( ) - t + EZt
(3.5)
The analysis is again similar to the one component case. In principle
one could encounter multiple minima in the optimization, but in all cases
considered we encountered no problems.
3.9.4 Ranking Criterion for Multiple Maintenance Packages. Apart
from indicating the optimal control-limit and hence an optimal long-term
frequency with which an package was to be executed, we also need to set
priorities in case too many packages had to be carried out at an opportunity.
Notice therefore that equation (3.4) provides a means to set priorities.
Below we extend it to the package case. Let RC(t) be defined by

Re(t) = :~:::>J,;E[M;(t+Y)-M(t)]+ E E[
nr nu 1 HZ '
cJr,j(l-Fj(xdx]-gyEY
;=1 j=1 t
(3.6)
with gy, the minimum average costs of the total package. We can interpret
RC(t) as the expected costs of deferring the execution of the package to the
next opportunity, Y time units ahead, minus the long-term average costs over
that time. Hence it is an ideal candidate to rank packages on. Notice that at
an opportunity we only have to calculate the first part of RC(t); as gy can
be stored in the database we only need to calculate it upon initialisation of
the d.s.s.
The idea is now to execute those maintenance packages with the highest
ranking value until the opportunity is fully used. Notice that the ranking
criterion is myopic: a package may be delayed multiple times at an opportu-
nity. Including that effect, however, was considered to be too complex. The
procedure was tested in Dekker and Smeitink (1994) and performed quite
well.
Next, we did modify the block policy to take recent failure replacements
into account. If for some revealed failure components actual ages were known
we replaced the renewal function in equation (3.6) by the expected number
offailures given the present age(s), using the c.d.f. and its convolutions. This
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 541

idea was elaborated in Dekker and Roelvink (1995) and appeared to cover
most of the cost-performance difference between age and block replacement,
even in the multi-component case.
Finally, we did want to allow the user to enter a specific interval (either
as point value or as three point distribution) to the next opportunity, which
could differ from the long-term distribution of the time between opportu-
nities. In that case we replace the r.v. Y in equation (3.6) by the interval
specified.

3.10 Type of Advice

Once we have calculated for each maintenance package a criterion indicating


its importance for being carried out, we are still left with the problem of
which maintenance packages to carry out, as each of them may require a
different man effort. In principle we considered two problem approaches:
(i) Support the user with a ranked list of maintenance packages, from which
he makes the final selection, taking all kind of extra information into ac-
count.
(ii) Provide the user with an interactive knapsack scheduling program which
determines an optimal selection given the time constraints.
Approach (ii) is to be preferred from a theoretical point of view, as that
best guarantees optimality. The disadvantage is however, that it is far more
complex, it requires a program on the spot and the ability of the user to
run it, and furthermore, to specify the problem exactly. The latter was not
trivial, as execution times of a maintenance packages can vary greatly, and
besides, the opportunity duration may not be known exactly.
So the main question became to determine the extra value of a knapsack
approach above a simple list heuristic: select the packages from the list and
carry them out, one by one until the opportunity has fully been used. Unfor-
tunately, results in this respect can not be found in literature. We therefore
carried out a quick investigation which indicated that the maximal relative
improvement of a knapsack optimisation above the straightforward list pro-
cedure is small in realistic cases (usually less than 5%).
Furthermore, the knapsack procedure has the disadvantage of being sen-
sitive to the constraint formed by the actual opportunity duration. As the
list from which a selection has to be made will be short in practice it is not
that difficult for a maintenance supervisor to determine the best selection.
Even the more, he may be very pleased with having the freedom to take
that decision rather than being degraded by a system telling him what to
do. Besides, he also has to check whether the required spare parts are avail-
able. Therefore we decided to give the ranked list of maintenance packages
as advice. An example is given in Appendix A.
542 Rommert Dekker and Cyp van Rijn

3.11 Software

Although the company for which later a field test would occur had an exten-
sive maintenance management information system in use, we did decide to
develop PROMPT separately from it, with the intention to make connecting
links once PROMPT had demonstrated its value. One of the reasons behind
was that PROMPT needs more detailed information than what is in the
maintenance management information system.
The main part of the PROMPT software consists of a database which has
been written in a 4th generation database language, in order to secure easy
reporting facilities. As language we chose FOCUS, in order to provide com-
patibility between a mainframe and a PC version. The optimisation occurs
through Fortran subroutines.
Total code consists of some 20,000 lines. Although originally PROMPT
was set up for a personal computer (PC), we later switched to a mainframe,
as the complexity was too large to be handled by the then existing PC's (IBM
PC-AT) and the PC FOCUS version.

4. Field Test of PROMPT on Major Gas Turbines

In this section we briefly describe a field test of PROMPT on three Rolls


Royce Avon gas turbines, one for main power generation and two which
served as oil pumps.

4.1 Defining Maintenance Activities and Set up of Maintenance


Packages

This was in fact a major task. Before PROMPT, routine maintenance was
lumped together in large packages of say, 150 hours which were executed
during the yearly platform shutdown. Each task had to be written down in
detail, with exact specifications of the equipment addressed. Thereafter all
activities had to be combined into maintenance packages. Although there are
optimisation aspects involved, this was purely done by engineering judgment,
grouping those activities which could easily be executed together. The type
of the maintenance activities could be either mechanical, instrument or elec-
trical. Furthermore, for each package one had to determine the best condition
indicator, being either runhours, calendar time or number of starts and stops.

4.2 Experiences with the Economic Model for Unit Downtime


Penalties

Although the model developed to assess cost penalties for unit downtime was
considered to be quite general, the field test revealed that practice has many
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 543

unexpected aspects. For example, when assessing the consequences of loss


of power for the power generating system it appeared that not every MW
output was of equal value. In case of power loss the production systems are
shutdown in order of importance. Another special feature was encountered
with pumps. The model assumed that the throughput of units in parallel
was the sum of the individual throughputs, which is not valid for pumps in a
serial configuration (the pressure build-up is non-linear in the capacity). Us-
ing the model philosophy, however, it was not difficult to extend the model
with these new aspects and to arrive at reasonable cost penalties for unit
downtime. It does show, however, that it is difficult to build generally appli-
cable models and that in each case specific unmodelled factors may dominate,
which require a good economist with reliability knowledge. Moreover, hard
coding models in software appeared to be dangerous in case no alternative
ways of determination (e.g., hand calculation) are allowed. The experiences
did learn us that all these problems can be overcome and that at the end
realistic cost penalties for unit downtime can be obtained.

4.3 Initialisation at Component Level


The initialisation at component level was in fact the bulk of the work. Actu-
ally, it was a learning process, since we first did an initialisation for one unit,
then changed the procedure and redid it for the other two units. Data had to
be provided for maintenance packages as well as for maintenance activities.
For each maintenance package one had to assess the man effort required to
execute it, the type of condition indicator, and as option, special materials
costs. Although execution times may vary widely in practice, it was not a
too difficult job to give reasonable estimates. With respect to maintenance
activities the following data were required by PROMPT. First of all the type
of the dominant failure mode, being either revealed or unrevealed. Next to
that the consequences of a failure in terms of costs and potential downtime
and finally the time to failure distribution.
Severe problems were encountered in obtaining the component time to
failure distributions. The data collected so far in the maintenance manage-
ment information system was lumped over many failure modes and not regis-
tered using the PROMPT hierarchy. Furthermore, as the maintenance pack-
ages created for PROMPT were reasonably detailed, the amount of data per
component was low. For even a third of the components no data was available
over a period of two years in which we pooled over four machines. Therefore
we decided to use expert judgment for initial estimates and to update it with
later originating data. A full description of the procedure can be found in
Dekker (1989). Below we will give a short review.
As experts we used maintenance technicians having several years of ex-
perience with the unit in question. As they were difficult to access - they
were working in weekly shifts at the offshore platform - we choose to send a
questionnaire to obtain the data. As we had to model wear out, we needed
544 Rornrnert Dekker and Cyp van Rijn

at least two characteristics of the time to failure distribution. Furthermore,


we decided to ask a control question in order to investigate the value of the
answers. Although the questions were formulated with care, it appeared from
the analysis that there were considerable inconsistencies within the answers.
We decided therefore to send a second questionnaire asking additional in-
formation. The response was not enthusiastic - people had to answer with
probability statements about twenty questions - and the experts had to be
pressed to give their answers. From the analysis we learned that some ques-
tions are difficult to answer when components are regularly maintained: e.g.
the mean time to failure is difficult to estimate if most components are main-
tained before that age. Questions which concerned the failure fraction in the
historical maintenance interval and in twice that interval were considered to
give most reliable information on the lifetime distribution.
For the other two units in the field test, for which only a limited extra
data was needed, we used another method. Based on the last questions we
developed a data collection program which was able to analyse the questions
directly and to give the experts direct feedback - in terms of mean time to
failure and the optimum maintenance interval (resulting from a simplified
optimisation). An analyst from the local head office used the program for
elicitation. This turned out to be a success and removed all of the incon-
sistency between experts' answers. Two problem aspects remained however,
viz. the problem of combining different experts opinion and the problem of
updating the experts opinion with later originating data. To that end a sep-
arate study was initiated with Cooke from Delft University which resulted
after a year in a special method (see Van Noortwijk et al. 1989 and Van Dorp
1989).

4.4 Operational Experiences

Operational experiences with the PROMPT system were good, although


some shortcomings were pinpointed. In the field test PROMPT was run-
ning on mainframe computers and system operation occurred onshore by
supporting staff. The advantage of this was that software errors could faster
be solved, as it still concerned a field test. The PROMPT system did require
some effort to operate. Data on usage of the units were easy to input, or to
change, if erroneous values had been inserted. Producing the ranking provided
no problem either. The main problem was however in the reporting of the
corrective maintenance. Not enough facilities had been provided to secure a
reporting which was consistent with the PROMPT database. Remember that
a new structuring of equipment had been made in order to set up PROMPT,
in terms of failure modes and elements. The existing maintenance manage-
ment information system did not use these concepts, as it contained only
much larger entities, like a whole subunit. The onshore personnel had then
to find out which failure mode actually occurred. On the longer term this is
considered to be too time consuming.
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 545

4.5 Experiences with Software

Experiences with the software were positive. Although we used the term de-
cision support system for PROMPT, it is better described as a structured
decision system, as the support it provides is always of the same form. As
the main advice is at an opportunity and as opportunities occur repeatedly,
there is much to say for structured advice. Developing such large computer
systems does put a different light on mathematical optimisation. The larger
software gets, the more difficult it is handled and checked beforehand. Soft-
ware errors can produce completely wrong results and thereby destroy all
value of optimisation.
A major problem encountered concerned database integrity. In order to
secure this, all kinds of protection mechanism were built into the system,
next to the already existing protection mechanism provided by FOCUS. This
made it very time consuming for the user to change data which were inputted
erroneously. The user did want to have the flexibility of changing data like in a
spreadsheet, but that is not what database languages provide. Especially the
so-called key variables, from which the database is structured, are extremely
difficult to change. Users do not always have beforehand the right description
of their database elements, implying that later on difficult changes have to
be made or that a user is left with a difficult to understand database. The
latter may be a cause for future errors.

4.6 Experiences with the Advice

Although the prototype software was running on a mainframe, and advice


had to be sent by telex, its acceptance was excellent. Every two weeks a
ranking list was made and sent offshore, so that if an opportunity occurred
it was directly available. As the number of maintenance packages advised
for execution was usually small and the priorities differed largely, there was
no problem in making the actual schedule. Users found that there were far
more opportunities than expected, and that they were well equipped to make
effective use of them.

4.7 Evaluation

It is always difficult to evaluate a decision support system as decisions are


taken by people, using information from various sources, and the decision
support system only has a supporting role. Furthermore, as operations and
circumstances change with time and are different from what was envisaged
at the start of a project, it is often not possible to make a proper compar-
ison between the situation before and after the introduction of the decision
support system. Finally, many advantages or disadvantages are difficult to
measure, let alone to quantify. Nevertheless, some evaluation always has to
be done, and here we will give some results of the evaluation of PROMPT.
546 Rommert Dekker and Cyp van Rijn

We will consider three ways of the evaluation, viz. theoretical comparisons,


actual performance comparisons, and finally, management and user accep-
tance. Benefits of PROMPT were classified into four aspects, being execution
of preventive maintenance at opportunities rather than at forced shutdowns,
optimisation of the preventive maintenance frequencies, value as a manage-
ment tool and finally, administrative facilities to learn about the effects of
maintenance.
In the theoretical evaluation we assume that the reality is as the PROMPT
models assume. In fact, one of the advantages of a decision support system is
that one can make this kind of evaluations. Inserting a historical maintenance
interval (or an estimation of it) into PROMPT makes it possible to apply
the PROMPT programmes to evaluate the average costs under the historical
interval and the savings obtained by optimising the maintenance frequency.
Results indicate that relative savings of 20% to 30% were obtainable. The
absolute savings however, were not that large, as the amount of maintenance
suited for execution at opportunities was limited. A calculation of the value
of executing preventive maintenance at opportunities is only possible if one
is able to describe the alternative precisely. This also requires to set a cost
penalty to unit downtime during the annual shutdown. Depending on that
value the outcome of the savings varies widely, but it can be substantially.
The last two aspects are difficult to quantify. It is a fact that preventive main-
tenance is always overshadowed by corrective maintenance and that there is a
large backlog of activities. One of the problems of maintenance management
is to control this backlog. As PROMPT keeps track of what has been done
and what still has to be done, it provides management at any time advice on
what is most important to be done.
A practical evaluation of PROMPT consists of comparing the actual be-
haviour, i.e., the actual availability, of the units for which it gives advice with
that of other units. This is however. veI:Y difficult to realise. First of all the
actual availability of a unit is a realisation of many random processes and
one has to include many units and use a long time scale to make a statisti-
cally sound comparison. Furthermore, only a part of the failure modes of a
unit were addressed by PROMPT. What aggravates this problem even more,
is that reporting of availability is often lousy. For example, if a unit is be-
ing repaired and if it is thereafter not directly needed, then the repair may
take far more time and the restoring of its availability may be postponed to
the moment it is again needed for service. These events can have a substan-
tial effect on the reported availabilities. As from raw data it is very difficult
to find out whether that has occurred, the evaluation is difficult to make.
Given the data available there was no evidence that the availability during
the PROMPT field test was substantially different from that before.
Let us now turn to the final part of the evaluation, being management
and user acceptance. The actual platform maintenance supervisors were very
enthusiastic about the PROMPT advice, and did not want to stop the field-
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 547

test. The actual PROMPT advice was in fact very flexible, and provided
exactly what they needed. The time needed to do the failure mode analysis
and assessment of failure time distribution was considerable (as it often is).
There were some complaints on the complexity and difficulties in managing
the database. A final version of PROMPT has to be simplified and to re-
quire far less data input, certainly when comparing the costs of initialising
PROMPT with the amount of money going on in the part of maintenance
suitable for execution at opportunities of the units in question. Besides, other
problems may overshadow PROMPT temporarily, thereby destroying the dis-
cipline needed to maintain it (on the platform in question there was a lengthy
shutdown caused by other reasons).

5. Conclusions
PROMPT can be considered as a major step forward in applying scientific
methods to maintenance management. It has its pro's and cons. Its pro's are
undoubtedly the structured approach leading to an optimisation of preventive
maintenance. Its con however, mainly consists of being a complex system, and
the long time effort required to initialise it. Future work will be directed at
reducing the initialisation effort and simplifying the system while keeping the
benefits of the structured approach.

Acknowledgement. The authors like to thank Mrss. van Oorschot, Cooper and Hart-
ley from Shell Expro Aberdeen for their cooperation on the PROMPT project. The
actual development of PROMPT was done by Ernest Montagne, loop van Aken,
Dick Turpin and the authors.

References

Barlow, R.E. , Hunter, L.C.: Optimum Preventive Maintenance Policies. Oper. Res.
8, 90-100 (1960)
Barlow, R.E. , Proschan, F.: Mathematical Theory of Reliability. New York: Wiley
1965
Backert, W., Rippin, D.W.T.: The Determination of Maintenance Strategies for
Plants Subject to Breakdown. Computers Chemical Engineering 9, 113-126
(1985)
Berg, M.B.: A Marginal Cost Analysis for Preventive Maintenance Policies. Euro-
pean Journal of Operational Research 4 , 136-142 (1980)
Cho, D.1. , ParIar, M.: A Survey of Maintenance Models for Multi-Unit Systems.
European Journal of Operational Research 51, 1-23 (1991)
Dekker, R.: Use of Expert Judgment for Maintenance Optimization. First report of
the ESRRDA Project group on expert judgment (1989)
Dekker, R.: Applications of Maintenance Optimisation Models: A Review and
Analysis. Report Econometric Institute 9228/ A, Erasmus University Rotter-
dam (1992)
548 Rommert Dekker and Cyp van Rijn

Dekker, R.: Integrating Optimisation, Priority Setting, Planning and Combining of


Maintenance Activities. European Journal of Operational Research 82, 225-240
(1995)
Dekker, R., Dijkstra, M.C.: Opportunity-Based Age Replacement: Exponentially
Distributed Times Between Opportunities. Naval Research Logistics 39, 175-
190 (1992)
Dekker, R., Roelvink, LF.K.: Marginal Cost Criteria for Group Replacement. Eu-
ropean Journal of Operational Research 84, 467-480 (1995)
Dekker, R., Smeitink, E.: Opportunity-Based Block Replacement: The Single Com-
ponent Case. European Journal of Operational Research. 53, 46-63 (1991)
Dekker, R., Smeitink, E.: Preventive Maintenance at Opportunities of Restricted
Duration. Naval Research Logistics 41, 335-353 (1994)
Duncan, J. , Scholnick, 1.S.: Interrupt and Opportunistic Replacement Strategies
for Systems of Deteriorating Components. Operational Research Quarterly 24,
271-283 (1973)
Hanscom, M.A. , Cleroux, R.: The Block Replacement Problem. Journal of Statis-
tical Computations and Simulations 3, 233-248 (1975)
Jorgenson, D.W., McCall, J.J. , Radner, R.:Optimal Replacement Policy. Amster-
dam: North-Holland 1967
Sethi, D.P.S.: Opportunistic Replacement Policies. In: Shimi, LN., Tsokos, C.P.
(eds.): The Theory and Applications of Reliability. Vol. 1. New York: Academic
Press 1977, pp. 433-447
Sherif, Y.S. , Smith, M.L.: Optimal Maintenance Models for Systems Subject to
Failure - A Review. Naval Research Logistics Quarterly 28 ,47-74 (1981)
Smeitink, E. , Dekker, R.: A Simple Approximation to the Renewal Function. IEEE
Trans. on ReI. 39, 71-75 (1990)
Van Aken, J.A., Schmidt, A.C.G., Wolters, W.K., Vet, R.P. van der: Reliability-
Based Method for the Exploitation of Maintenance Opportunities. Proc. 8th
Adv. in ReI. Techn. Symp. (1984), pp. B3/1/1-B3/1/8
Van Dorp, R.: Expert Opinion and Maintenance Data to Determine Lifetime Dis-
tributions. M.Sc. Thesis, Delft University of Technology (1989)
Van Noortwijk, J.M., Dekker, R., Mazzuchi, T.A. , Cooke, R.M.: Expert Judgment
in Maintenance Optimization. IEEE Trans. on Rel. 41, 427-432 (1992)
Vergin, R.C. , Scriabin, M.: Maintenance Scheduling for Multicomponent Equip-
ment. AIlE Transactions 9, 297-305 (1977)
Woodman, R.C.: Replacement Policies for Components that Deteriorate. Opera-
tional Research Quarterly 18, 267-280 (1967)
PROMPT, A DSS for Opportunity-Based Preventive Maintenance 549

Appendix

A. Example of Advice

INSTALLATION: CA SYSTEH: E.Ol POWER GENERATION


SUR SYSTEIl: 01 UlnT: GI070 HAIIl GENERATOR
r.!lRRENT OPPORT1JIIITY: 20 05 AB
0 DAYS LATER WITII I'R08. o\
NEXT OPPORTUNITY: 20 11 B8 184 DAYS LATER WITII PROB. 100 %
0 DAYS LATER WITII PROB. o\
NO. ~IPCODE HP NAHE EFFORT RAIlKlllG VAl.UE EXECUTE
1 H 15 COHBUS CllAlljfIJEL 27.20 11,9B2 <---
E II, DC EilER l.0 PP liT 2.00 5423 <---
H 20 OIL FILTERS 11.80 2770 <---
4 E 35 AVail CaNT PAlIEl. 30.00 21,66 <---
5 E 13 DC STARTER HOTOR 4.00 2099 <---
6 H 25 TURBIIlES 6.60 1423 <---
7 E 31 EXCITFR/GEN. TERH 10.00 1389 <---
B E 36 EXCITArION CUBIC 12.00 1370 <---
9 H 29 COOLEP.S 33.60 1340 <---
10 H 18 GOVER/lOR CAB IIlET 3.00 1271 <---
For further informRtion enter BP NO.
PFl PF2 PF3 PFI, PFS PF6 PF7 PFB PF9 pno
SAFETYI II RETliPN I I NEXT I PREVIOUS I I HELP II
I II IISCREENI SCREEN I I II
Maintenance Optimisation with the Delay
Time Model
Rose Baker
Department of Mathematics and Computer Science, University of Salford, Lan-
caster, M5 4WT, United Kingdom

Summary. The delay time model is an inspection model which has been used
extensively in many case studies of the development of recommended maintenance
policies both for industrial plant and buildings. An introduction to the method-
ology is given, together with some recent modelling and inferential developments.
The focus is on the estimation of model parameters by fitting to so-called objective
data, rather than on the use of subjective data. There is a comprehensive bibliog-
raphy. Hitherto unpublished work presented here includes statistical inference for
multicomponent systems and new ways of modelling imperfect inspection.

Keywords. Mathematical modelling, maintenance and reliability, optimization, in-


spection modelling, repairable systems, delay-time model, multicomponent systems,
statistical inference, Empirical Bayes technique, medical screening

1. Introduction

This chapter describes a preventive maintenance model that has been suc-
cessfully used in many case studies since 1982. The focus here is on recent
developments in the model, and the reader is referred to papers such as
Baker and Christer (1994) for a more general account of the method, and its
historical evolution.
In this chapter, a simple case study is presented to give the flavour of
the method, and after a fuller description of the model and of the estimation
of model parameters, some more complex case studies are discussed. Finally
some current ideas for model development are mentioned.
For those readers familiar with the delay-time model, the work presented
here for the first time is the derivation of the likelihood function of the NHPP
multi component model from the component-tracking model in Section 2.4.1,
the Empirical Bayes multi component model in Section 2.5, the remarks on
stochastic cost in Section 3 and the new imperfect-inspection model in Sec-
tion 7.2.2.

1.1 Background

We may speak of engineering and operational decisions in maintenance. En-


gineering decisions are decisions about which engineering actions to take and
when to take them, and operational decisions must then be made regarding
Maintenance Optimisation with the Delay Time Model 551

the implementation of decided engineering actions. Operational decisions typ-


ically include manpower modelling, logistics and inventory control. Whereas
operations decisions influence the efficiency of implementing a maintenance
concept, engineering decisions determine the maintenance concept itself. The
model described in this chapter is an engineering model of maintenance, and
is an attempt to encapsulate actual engineering perception, experience and
practice.
There are a great many models of preventive maintenance in the liter-
ature, and the delay-time model (DTM) along with others is reviewed by
Valdez-Flores and Feldman (1989) and Thomas et al. (1991). Much of this
modelling work makes seemingly arbitrary assumptions, and there is often
no indication of how the values of model parameters can be determined, no
evident concern for model validation, i.e. assessing the quality of the 'fit' of
the model to data, and no examples of applications or case studies or of post
modelling analysis. On the contrary, this chapter is concerned with precisely
these issues.
Work to date since the genesis of the basic model is of two kinds. The first
is model development to include factors that seem likely to be important in
practice, such as imperfection of inspection, irregular timing of inspections
and stochastically timed (opportunistic) inspections. Insight has been gained
by exploring their mathematical modelling. The other kind is that of fitting
DTMs to data in case studies, with emphasis on parameter estimation, model
validation, and post-modelling verification.

1.2 Terminology
It is necessary to define some terms.
We are concerned with delay-time modelling of one or more machines or
systems liable to (costly) failure. The model has been applied to a variety of
systems, most often industrial plant, but also to building maintenance. The
machines may be very large and complex, such as power presses or production
lines, or small items with few components, such as infusion pumps and other
items of hospital equipment.
Failure is taken here to mean a breakdown or catastrophic event, after
which the system is unusable until repaired or replaced. It may also be simply
a deterioration to a state such that the repair can no longer be postponed.
This is especially true in building maintenance. Preventive maintenance is
some activity carried out at intervals, with the intention of reducing or elim-
inating the number of failures occurring, or of reducing the consequences of
failure in terms of, say, downtime or operating cost.
The concept of failure delay time or simply delay time is central to the
DTM. Failure is regarded as a two stage process. First, at some time u with
distribution function G( u) and pdf. g( u) a component of the system becomes
recognisable as defective, and the defective component subsequently fails after
some further interval h, with distribution function F(h), pdf. f(h). Preventive
552 Rose Baker

maintenance is assumed to consist primarily of an inspection resulting in the


replacement or repair of defective components. Other elements of planned
maintenance such as oil changes in a sump are important to defect reduction,
but the timescale of any noticeable change in performance resulting from
a change in PM period is sufficiently large to be neglected in the current
context. It is for this reason that we model PM as an inspection activity.
The model adopts operational definitions of failure and defectiveness used
in the plant concerned. The judgement that a component is 'defective' is made
by the maintenance technician or production engineer. This approach enables
models to be constructed without more ado: for a possible difficulty caused
by it, see the last two paragraphs of Section 7.2.2.
Figures 1.1 and 1.2 demonstrate the fundamental role of the delay-time
concept. Figure 1.1 shows how inspections prevent failures in the component
tracking model, in which failures and replacements of individual components
are followed through time. Figure 1.2 shows how inspections prevent failures

Fig. 1.1. How inspections prevent failures in the component tracking model. The
horizontal axis represents time, and the open circles represent the origination of
defects, the closed circles represent failures, and the vertical lines inspections. The
third defect has originated but has been detected and repaired at inspection and
so has not caused a failure.

for the Nonhomogeneous Poisson process (NHPP) model, in which defects


arrive as an NHPP process, and histories of individual components are not
known.
There is considerable interest nowadays in condition-monitoring, with the
advent of 'hi tech' methods that can detect abnormal vibration frequencies,
high concentrations of trace metals in oil, and other correlates of wear or
damage. A simple DTM of condition-monitoring exists that is currently based
Maintenance Optimisation with the Delay Time Model 553

(a)

(b)

Fig. 1.2. How inspections prevent failures for the NHPP model. The horizontal
axis represents time and the open circles represent the origination of defects, the
closed circles represent failures and the vertical lines inspections. With periodic
inspections, as in the lower part of the figure, the second, fourth, fifth and eighth
defects have now been detected and repaired, and so have not caused failures.

on the simplest measurement of condition possible-OK or defective (Christer


and Wang 1992).
There is also an analogy with medical screening. Shwartz and Plough
(1984) describe a cancer screening model with three states: healthy, preclini-
cal (defective), and clinical (failed). False negative results correspond to 'im-
perfect inspection'. This model is similar to the component-tracking model
for a I-component machine.
For another simple example, dentists now talk of policies for the main-
tenance of teeth, and the carrying out of regular dental inspections offers a
good illustration of the delay-time concept. No maintenance work is done
unless a defect is observed, such as a leaking or cracked filling. The regular
'scale and polish' by a hygienist can also be thought of as removing the defect
of a build-up of tartar.
Failure here would be any event such as toothache, that necessitated in-
terruption of one's daily routine and an unplanned visit to the dentist. The
pain experienced and the inconvenience of interrupting scheduled activities
give failure a higher cost than maintenance.
It is clear that in general there will be an optimum frequency of preventive
maintenance. If maintenance is rarely done, expensive failures will result. If
554 Rose Ba.ker

maintenance is frequent, defective components are replaced before they cause


an expensive failure, but the maintenance activity itself is costly. The ultimate
aim of modelling is to optimise such decision variables.
An example will illustrate these concepts.

1.3 A Simple Example


This example is based upon Christer and Waller (1984c), which has proved
to be an early and influential case study of a high-speed canning line. It is
further discussed in Christer and Redmond (1990), and Christer and Waller
(1984c) discuss many other aspects ofthe maintenance optimisation problem
besides the calculation given here.
Maintenance was carried out every Ll = 24 hours. In general, mainte-
nance was found to result on average in a downtime of 65.5/24 = 2.73 hours
per week. Assuming that the maintenance downtime period is independent
of maintenance frequency, downtime due to maintenance is 65.5/Ll per week.
The mean number of defects appearing per week was measured as 16.6, and
the fraction b( Ll) of them which resulted in failures caused an average down-
time of 0.698 hours, so that the downtime per week due to failures was mod-
elled as 11.59b(Ll). The downtime per week as a function of Ll was therefore
65.5/Ll + 1l.59b(Ll).
This formula is approximate, and uses the fact that in this case manpower
for maintenance was not a problem: flexible manpower at inspection meant
that an inspection and repair took the same time however many defects had
to be fixed. To optimise Ll, we need to know b as a function of ..1.
Assuming a stationary process of defect arrival,a defect arising in the
interval (0, Ll) is assumed to arrive at time u from last inspection with pdf.
g( u) = 1/Ll. The probability that a random defect causes a failure before it
is caught at the next inspection is

b(Ll) = fo~ g(u)F(Ll-u)du=(I/Ll) fo~(I-exp{-eu})dU,


assuming an exponential distribution for the delay time, when F(h) = 1 -
exp( -e h). Hence
b(Ll) = 1- (1- exp{-eLl})/eLl; (1.1)
this is the required formula for b(Ll).
The observed value of b was 0.396, for the current practice with Ll = 24
hours. Solving equation (1.1) gives e = 0.0463, which provides a delay-time
distribution estimate that is tuned to current practice.
Figure 1.3 shows estimated downtime per week as a function of Ll, and
it can be seen that daily maintenance is about optimal. This must be the
simplest possible example of maintenance optimisation using the DTM. There
are a number of points arising that are discussed further at later points in
this chapter, viz.:
Maintenance Optimisation with the Delay Time Model 555

9
8.8
8.6
8.4
8.2
D
8
7.8
7.6
7.4
7.2
10 15 20 25 30 35 40 45 50
..1 ..... (hours)
Fig. 1.3. Downtime per week as a function of the interval between inspections for
a simple case study of a canning line from Christer and Waller (1984c)

1. The model assumed a stationary process, such as a homogeneous Poisson


process (HPP) of defect generation and an exponential delay time dis-
tribution. This raises the general problem of testing goodness of model
fit-in this example, how do we know that the delay time distribution
really is exponential?
2. There are two model parameters in the problem, the rate of defect arrival
e-
A and the mean delay-time 1 . Both were estimated here by the method
of moments, A by equating the observed number of defects coming to
e
light to its expected value, and by equating b(Ll) to its observed value.
e
In fact can be more accurately estimated using maximum likelihood
estimation, iffailure times are known. Parameter estimation is often more
complicated than in this example, and then only the maximum likelihood
approach is practicable.
3. The regularity of the inspection interval and the HPP assumption meant
in effect that data from many inspection intervals could be pooled. The
sequence number of the interval in which an event occurred contained no
information about the model parameters and could be ignored. In general
inspections are not carried out quite regularly, and parameter estimation
is then more difficult. The maximum likelihood method can cope with
this.
4. It is possible to calculate standard errors on model parameters, and prop-
agate these to give a standard error on the optimum inspection interval
Ll.
556 Rose Baker

5. In the actual case-study, a different approach was adopted, and a para-


metric form was not fitted to f(h).

1.4 Model Assumptions

Given the delay-time concept as outlined, it is possible to derive very many


mathematical models for different inspection maintenance situations. All of
them include the following general assumptions, that characterise the delay-
time concept:
Set A
1. Failure is detectable as soon as it occurs and without the need for in-
spection.
2. A failed system must be repaired before it is again usable.
3. Before failure occurs, a component passes through one or more impaired
or defective states.
4. Whether or not a component is in a defective state can only be deter-
mined by inspection, i.e. a defective component appears to otherwise
function normally. The exception to this is the existence of operator-
reported defects (Chilcott and Christer 1991).
5. The main function of preventive maintenance is to replace or repair de-
fective components on inspection.
Typical additional assumptions for relatively simple models are:
Set B
1. The only effect of preventive maintenance on the system is the replace-
ment of defective components, and the maintenance intervention has no
other beneficial or hazardous effect.
2. Inspection and the repair or replacement of defective components (pre-
ventive maintenance) are undertaken jointly.
3. Inspections occur at equally spaced intervals.
4. All identified defects are repaired.
5. Inspections and repairs take negligible time.
6. There are no false positives, i.e. if a defect is not present one will not be
identified.
7. Every defect has the same probability r :5 1 of being detected at an
inspection, and this probability does not vary with time since the defect
first became detectable.
8. The delay time h of a fault is independent of its time of origin u.
9. All costs or surrogate measures such as downtime are fixed quantities,
i.e. they are not stochastic.
Maintenance Optimisation with the Delay Time Model 557

Typical additional assumptions for a simple model that tracks key com-
ponents are:
Set C
1. Each component has only one failure mode.
2. f and 9 are modelled as exponential or Wei bull distributions.
3. The age of the system, as distinct from the age of the component, does
not influence the distributions G and F.
4. Repairs are statistically equivalent to replacements, so that the faulty
component is restored to an 'as-new' condition.
5. The key components of a machine are assumed independent, i.e. the
failure of one will not affect the subsequent functioning of another.
6. If more than one machine in a set is modelled, machines are assumed to
behave identically and to have uniform usage.
7. Total breakdown repair time is negligible compared to operating time.
Additional model assumptions for a simple model where individual compo-
nents are not tracked are:
1. The number of components is very large, and the probability of any given
component becoming defective in a specified period is very small, so that
defects arise in a NHPP.
2. Defects are repaired sufficiently well that the probability of any given re-
paired component again becoming defective is infinitesimally small. This
assumption is required in order not to jeopardise the NHPP of defect
arrival times. (For example, imperfect repair would cause a clustering of
defect arrival times).
In practice, the NHPP is a good approximation for any complex machine
where individual components cannot be tracked. The first set of model as-
sumptions (set A) cannot be changed without ceasing to have a recognisable
'delay-time' model. When the main function of maintenance is not the re-
placement or repair of defective parts, but is, for example, age-based replace-
ment of components, then the DTM is not applicable. However, it is possible
to include other effects of maintenance besides replacement of defective parts,
for example the 'rejuvenation' or premature ageing of machinery by beneficial
or hazardous inspection is discussed in Section 7.l.
In general, the more specific model assumptions in sets Band C can be
relaxed or varied to suit the problem at hand.
Given a model constructed according to these assumptions, the mainte-
nance activity is understood well enough to calculate optimum maintenance
policies. This will often mean simply finding the optimum frequency of main-
tenance. The development will again differ according to the criterion chosen
for optimisation, e.g. minimum cost, minimum downtime or maximum out-
put. It is possible to devise an optimum policy for component tracking models
by which maintenance occurs at irregularly-spaced epochs after renewal of a
component, and this is discussed in Christer (1991b).
558 Rose Baker

2. Likelihood Functions for Useful DTMs


In general, one wishes the model to be consistent with all available objective
and subjective data. The objective data to be fitted would be the results of
inspections-number of defects found, and times of failures.
By 'subjective data' is meant data acquired by administering a question-
naire Chriter and Whitelaw (1983) to engineers, when a defect is found at
inspection or a component fails, or in the complete absence of any 'objective'
data .. The key questions are: 'How long ago could a fault have first been no-
ticed by an inspection or operator (HLA)?', and 'If the repair were not carried
out, how much longer could it be delayed before a repair is essential (HML )?'.
At an inspection, the subjective estimate of delay-time h = HLA + HML, and
similarly for a failure, when of course HML = O. If the inspection identifying
a defect is made at time t, then u = t - HLA. Work is currently in progress
to make subjective estimation easier and less labour-intensive, and bring it
into line with the conclusions of psychologists and Bayesian statisticians.
This chapter is concerned solely with the 'objective' approach. Subjective
information is used even here, because the type of model fitted to data will be
strongly influenced by information about the plant and maintenance practices
from engineers and technicians.
The likelihood function C(xIM) is the probability or pdf. of observations
x that were made, given the model M. Hence it is a specification of model
assumptions. Later in this section the likelihood function will be used as a
means of deriving a DTM for complex plant from the 'component tracking'
model. Given the likelihood function C for a DTM, one can obtain maxi-
mum likelihood estimates (MLEs) 0 of model parameters (), and estimate the
standard error of iJ.
It may not be obvious that model parameters can be estimated from data
on timings of inspections, numbers of defects found at successive inspections,
and failure times. Although one can write down a likelihood function, it does
not follow that all the parameters included in it are estimable. For example,
given data consisting only of failure times, it is not possible to estimate
parameters of the distributions G and F. The data only contain information
about their convolution, because only the sum of u and h is observed.
It is possible to see intuitively that there is enough information in the
data to at least fit simple models. If maintenance is effective, there will be a
drop in the rate of occurrence of failures just after maintenance, and the rate
will then increase. Clearly there is information about G, because the rate of
increase of ROCOF depends on how quickly defects can arise. The number
of defects caught at inspection also depends on G and on F, because if the
mean delay time is short, few defects will be caught. Hence parameters of G
and F can be estimated.
If inspection is perfect however, there can never be a delay time longer
than the interval between inspections, and so the tail of the F distribution
Maintenance Optimisation with the Delay Time Model 559

is unknown; the behaviour of F in its tail is extrapolated from the parame-


terisation of F. This means that on recommending longer intervals between
inspections than those found in the data, there will be a large uncertainty
on the optimum interval recommended. This will appear 'automatically', on
propagating errors on model parameters through to errors on the recom-
mended inspection interval.
If inspection is perfect, the ROCOF must decrease to zero just after an
inspection unless the convolution of G and F is J-shaped. If A is constant
or slowly varying, the expected number of failures per unit time in (0, t) is
A f~ F( u) dult < AF(t). This eq~als Af~ 1jJ(x) dx for small t, where 1jJ is the
hazard function of F. This will be O(t), i.e. vanishingly small for small t,
unless h(t) ...... 00 faster than lit as t ...... O.
Thus from the value of the ROCOF soon after inspection, the perfectness
of inspection can be estimated. However, when F(t) <X t or f(t) is a constant,
the ROCOF will fail to increase with t. This means that inspections are
carried out so frequently that very few failures are seen. As long as there
are some failures however, F can be estimated at times small compared to
the mean delay-time. Clearly, with such highly censored data, one cannot do
more than fit an exponential distribution.
Finally, if the rate of arrival of defects is increasing with time as the
machine ages, numbers of defects and failures seen will increase with time,
so that effects due to the machine's age can be estimated also.
It is unfortunate that parameters to be estimated can only be found indi-
rectly, by for example likelihood maximisation. Although it is possible to draw
some conclusions about model parameters by 'eyeballing' data, as described
here, there is as yet no formal method for producing a graph or plot that
can suggest a sensible parameterisation. A procedure that made it possible
to 'see' model distributions directly would be invaluable.

2.1 The Component Tracking Model

In general a machine contains several components, that are replaced on failure


or if found to be defective at an inspection. An inspection of all components
mayor may not be carried out on the failure of any. If such an 'opportunistic'
inspection is carried out, the state of unfailed but defective components is
altered by the failure of another component, but otherwise there is no 'com-
ponent dependency', and the failure of a component does not directly affect
the hazard offailure of another. The only other linkage between components
in this model occurs when the distributions G and F are functions of the
machine's age when the (replacement) component was inserted. This causes
times to failure etc. to be positively correlated, and is discussed in Baker and
Wang (1993).
We first introduce the necessary terminology: the possible events that can
contribute to the likelihood are
560 Rose Baker

B Breakdown (failure)
N Inspection and no defect found
Y Inspection and defect found
E End of observation period.
Event N will be referred to as a negative inspection, event Y as a positive
inspection. In addition, the following event types are useful:
S Start of observation period
R Replacement (on a B or Y)
X Denotes any event
Event S is equivalent to an R event. We wish to write down the likelihood
of observing a sequence of events Xl ... Xn of types B, E, Y and N at times
tl ... tn. The key to doing this is the multiplication law of likelihood, i.e.

L = PX 1 X PX2 1x 1 X PX3 1x 1 ,X2 X x PXnIX, ... x n_, (2.1)


for the likelihood of n events. The probability of an event is P, and e.g.
PX2 1x 1 means the probability of event X 2 given that event Xl has occurred.
After a replacement R, the likelihood does not depend on any event pre-
vious to R; a replacement is a regeneration point for the process. There-
fore the likelihood can be written as the product of terms conditional on
events RX1X2 ... starting with the last renewal. Further, PX IRN,N2 N3 ... Nn =
PXIRNn , i.e. if an event X follows a sequence N1 ... N n of negative inspec-
tions at times t1 ... tn, the fact that there was no defect visible on the last
inspection of the sequence is what determines the probability of the event X.
This is a type of Markov property.
For brevity, rather than deriving probabilities from first principles, they
are simply presented with intuitive justifications. It turns out that only three
key probabilities need be considered; the likelihood can be built up from these
three, and others that are special cases of them.
- PNBIR(tn, t) is the pdf. of a sequence of negative inspections, of which the
last occurs at time tn from last renewal, and a breakdown at time t from
last renewal. The sequence of negative inspections may be null, in which
case PBIR(t) is the pdf. of a breakdown at time t, given that the last event
was a replacement at time zero. This may also be written as PNBIR(O, t)
or PBIR(O, t).
This use of notation reflects the fact that an inspection made at the instant
of renewal must be negative with probability unity. Hence one can always
'smuggle in' such a notional inspection without altering the likelihood, and
hence justifiably write
PBIR(t) == PNBIR(O, t).
Finally,

PNBIR(tn, t) = it tn
g(u)f(t - u) duo (2.2)
Maintenance Optimisation with the Delay Time Model 561

Since g(u) is the pdf. that a defect arises at time u, and f(h) is the pdf.
that a breakdown occurs a time h later, g(u)f(t - u) is the pdf. of a failure
at t arising from a defect at u, and the integration sums over all possible
times u. These can only occur after the last moment that there was known
to be no defect, tn, and before the breakdown time t.
It is also true that
PNBIR(tn, t) = PNIR(O, tn)PBIRN(tn, t),
where PNIR(O, tn) is the probability of a negative inspection at time tn
from renewal, and PBIRN(t n , t) is the pdf. of a breakdown, conditional on
that negative inspection.
- PNEIR(tn, t) is the probability of a (possibly null) sequence of negative
inspections of which the last is at tn, and no breakdown before observation
ceases at time t from last renewal.

PNEIR(tn, t) = 1- G(t n ) -it tn


g(u)F(t - u) du,

This expression is simpler to interpret in its alternative form

PNEIR(t n , t) = 1- G(t) + it g(u)(1- F(t - u)) duo


tn

The first term, 1- G(t), is the probability that no defect arises before time
t, and the second contribution to the probability of no failure is that a
defect does arise at time u > tn, but does not lead to a failure before time
t. The product g( u )(1 - F(t - u)) is the pdf. that a defect arises at u and
that there is no failure before time t, and the integration sums over all
possible times u, after the last negative inspection at tn and before time t.
As before, the probability of no event may be written as the probability
of a sequence of negative inspections, multiplied by the probability of no
event given such a sequence, i.e.
PNEIR(tn, t) = PNIR(O, tn)PEIRN(tn, t).
- PNYIR(tn, t) is the probability of a sequence of negative inspections of
which the last occurs at tn, followed by a positive inspection at time t
from last renewal.

PNYIR(tn, t) = G(t) - G(tn) -it tn


g(u)F(t - u) duo

It is simpler to understand in its alternative form

PNYIR(tn, t) = it
to.
g( u)(1 - F(t - u)) duo (2.3)

The pdf. for a fault arising at time u is g( u), and the probability of no
breakdown before t is (1- F(t - u)). The integration sums over all possible
times of fault origin u.
562 Rose Baker

As before, the probability of a positive inspection can be written as the


product of the probability of a sequence of negative inspections, and the
probability of a positive inspection given such a sequence, i.e.
PNYIR(tn, t) = PNIR(O, tn)PYIRN(tn, t).
The three key probabilities are conditional on the last renewal. With Wei bull
distributions for g and f, the probabilities are calculated by substituting:
G(u) = 1- e-(<>lU).8 1 ,

g(u) = Iho:fluf31-1e-(<>lU).81,

F(h) = 1 _ e-(<>2 h/ 2 ,

f(h) = f320:g2hf32-1e-(<>2h).82,
where 0:1,0:2 are scale parameters and f31, f32 are shape parameters.
The likelihood is calculated by accumulating the product of these three
terms. Each renewal may be followed by a sequence of negative inspections,
and this must terminate in an event of type B, E, or Y. Event E is really
'no event'. The likelihood C for a total of nB breakdowns at times ti, nE
'no failure before observation ceases' events at times tj, and ny positive
inspections at times tk, is

II PNBIR(ti,ti) II PNEIR(tj,tj) II PNYIR(tk,tk),


nB nE ny

C= (2.4)
i=l j=l k=l

where the notation tt ,


tJ, tk denotes the time of the latest negative inspection
tt
or, failing that the latest renewal, such that < ti, and so on.
In the more general case of several identical machines, the likelihoods
corresponding to individual machines are multiplied together.

2.2 More Than One Component

The case of a machine comprising two components is discussed first. They are
assumed to be mutually independent in that the state of either component is
assumed not to affect that of the other. There are two possible scenarios: when
component A fails, component B is either not inspected (case 1) or inspected
and replaced if visibly defective (case 2). Happily both are tractable.
In case 1, the two components are completely independent-nothing that
happens to either of them can affect the other, and the likelihood factorizes.
The log-likelihood is the sum of log-likelihoods for each component, log C =
log CA + log CB . In case 2, they are no longer independent, because a failure
of A will cause the replacement of B, if B if visibly defective, and vice versa.
Happily, the likelihood can still be written in factored form, even although the
components are not now independent. A failure of either component (A, say)
Maintenance Optimisation with the Delay Time Model 563

simply generates an inspection event (N or Y) for the other, at the failure


time tAo These extra inspection events mean that the log-likelihood for A is
conditioned on the behaviour of B, and vice versa. The computer analysis is
simple, as the program merely has to insert these extra inspections into the
record before further analysis, and then proceed to estimate parameters for
each component separately, as long as the components have no parameters
III common.
The argument generalises immediately to arbitrarily many components.

2.3 Imperfect Inspection

So far it has been assumed that inspections always find a visible defect if it is
there. In the case of imperfect inspection, there is a probability r < 1 that a
defect is found if it exists. Successive 'trials' or inspections are independent.
This is equivalent to saying that a (perfect) inspection is carried out with
probability r, and that with probability 1- r the inspection is omitted. The
inspection is regarded as omitted merely as far as our state of knowledge
of the machine is concerned: it is not omitted as regards cost, downtime,
and other such consequences. The component-tracking model was developed
for the imperfect inspection case in Baker and Wang (1993). The logic is
complicated, and is not reproduced here.

2.4 The Nonhomogeneous Poisson Process Model

The component tracking model can not be used for complex plant where there
are very many components, and where detailed records are lacking. Follow-
ing a Pareto analysis of failure modes, any unreliable and hence frequently
replaced components could be modelled as above, and the remainder of the
defect and failure types grouped into q classes. In what follows it is assumed
that there is a 1 : 1 correspondence between defect types and failure modes;
however, it is straightforward to generalise the model to the situation where
several types of defect lead to a single failure mode, or vice versa. Each class
of defect is assumed to arise in a NHPP (nonhomogeneous Poisson process)
with intensity Ap (u) for the pth class, and to generate a failure at time t > u,
according to the distribution Fp{t - u).
The NHPP model is now derived as a limiting case of the component
tracking model, and its likelihood function found.
2.4.1 The NHPP . Assume that there are a large number M of com-
ponents, and consider the distribution function of time to a defect aris-
ing for the mth component, Gm{u) = 1 - exp{- IoU Am{x)dx}, where
Am (x) is the hazard of failure at time X. Let Am (x) --+ 0 and M --+ 00
such that the total hazard A{ x) = L:~=l Am (x) = h{ x) is finite. Then
Gm{u) --+ IoU Am{x)dx. Group components into q classes Al .. . A q , so
that the expected number of defects arising by time u in the pth class is
564 Rose Baker

I:mEA Gm(U) -> I:mEA loti Am(X) dx. Note that the LHS of this equation
is an approximation, beca"use once a component has once developed a defect,
the expected number of defects due to it so shortly afterwards at some time
u' > u is not Gm(u'). However, as Am -> 0, a vanishingly small fraction
of components will have developed defects by time u, so that the expected
number of defects by time u tends to the RHS expression, as long as the
hazard of failure still -> 0 as M -> 00 after repair or replacement. A pro-
cess whose expected number of defects by time u is a function only of u is
an NHPP, and so this is a good model of defect arrival for complex plant.
Gamma, Weibull and log logistic distributions for G all lead to the power
law process A( u) = auf3, and the exponential distribution to the special case
of an HPP, where (3 = 1. The Gompertz distribution leads to the loglinear
=
process A( u) a exp{{3u}.
2.4.2 The Likelihood Function. Suppose that 'events' (failures or detec-
tion of defects) may be observed at epochs t 1 ... tn. This means that failures
are interval censored; they occur during the interval (ti-1, ti). It is straightfor-
ward to later revert to the case where the timing of failures is known exactly.
Some of the ti will however be times when inspections are carried out.
Then
M n n
(2.5)
m=1 ;=1 i=1
where Pim is the probability of an event of appropriate type (failure or defect
found at inspection) for the mth component at the ith time, Sim is the number
of such events (either 0 or 1) and the final product runs over all components
that have not given rise to an event of either type.
The Pim are proportional to Am, so as M -> 00, Pim -> O. The final
product can now be taken over all M components rather than those not
suffering any event, and is equal to exp{ - I:~=1 I:~=1 Pim} in the limit of
M -> 00.
The likelihood is a now a product of Poisson expressions
n M
.c = IT IT Pi~mexp{-Pim}. (2.6)
;=1 m=1

Grouping components into q classes, the number of events aip in the pth
class is aip = I:mEA p Sim, and the mean number in that class is J.lip =
I:mEA P;m. Here aip follows a Poisson distribution, because it is the sum of
a number of Poisson random variates Sim, and so the full likelihood is

n q

.c II II J.lf;p exp{-J.lip}/aip!
;=1 p=1
Maintenance Optimisation with the Delay Time Model 565

n q n q

II II J1~;P /a;p! X exp{- 2:2:J1iP}' (2.7)


i=lp=1 i=1 p=1
All that remains is to split the 'events' into failures and defects found at
inspection, and to write the means J1ip for the imperfect-inspection case with
probabilities defined in equations (2.2) and (2.3). where g( u) is replaced by
Ap(U). The resulting formulae are:
When the ith time is a time at which failures occurring during (ti-1, t;)
are noted and repaired:

J1ip = tIT
j=1 k=j
(1- rk) it;
t;-l
Ap(x){F(ti - x) - F(ti-l - x)} dXj (2.8)

with the convention that F(t) =0 if t < 0, and when the ith time is a time
when an inspection took place,

J1ip = ri
i i-I
2: II (1 -
j=1 k=j
rk) it;_l
t
J Ap(x){1 - F(ti - x)} dx. (2.9)

where rk is the probability of an inspection detecting a defect at time tk.


Typically one would set rk = r when an inspection was carried out at tk, rk =
r' when a failure of another type and accompanying opportunistic inspection
took place at tk, and rk = 0 if the failure at tk was of the pth type.
It is worth noting that the likelihood is that of n independent Poisson
random variates, for each class of defect. It can be rewritten as the product
of a multinomial distribution of numbers of events occurring at n different
times, given the total number L~=1 L!=1 Sip, and of a Poisson distribution
of the total number of events occurring:

c = II[{ (",n
q a')' n n
~~=1 ~p,' II(J1;p/EJ1jp)a iP }

p=1 Ilj =1 a
Jp . i=1 j=1
( ",n )a n
~j=1 J1jp 'P '"'
X
(Lj::::1 ajp)! exp{- ~J1jp}]. (2.10)

It can be seen by imagining the intensities Ap (u) scaled up by some factor


that this factor cancels from numerator and denominator of the multinomial
part of the likelihood, and that therefore only the likelihood of observing
the total number of events L?::l aip contains information about the overall
rate of arrival of defects. In consequence, the estimates of any parameters
appearing only in the multinomial factor of the likelihood are statistically
independent and therefore uncorrelated with the estimates of the Ap. The
example makes this point clearer.
566 Rose Baker

2.4.3 A Simple Example. In the simplest case, which is still however of


practical interest, r = 1 and f3 = 1, giving a HPP of rate A, inspections are
regularly spaced at an interval ..1, there are no opportunistic inspections done
after failures, and observation ceases immediately after the last inspection.
From equation (2.8) on particularising to infinitesimal time intervals for
failures, the expected number of failures in time interval (t, t + dt) is

A lot f(t - x) dx = AF(t) (2.11)

at time t after an inspection, and from equation (2.9) the expected number
of defects found at any inspection is

Afo~ {1- F(..1- z) dz = A1~ {1- F(z)} dz.

The expected total number of defects detected in any way will be found to
equal A..1, as it must.
The downtime D per unit time is modelled as
(2.12)
where C1 is the downtime due to an inspection, C2 is the downtime due to
a failure, and b(..1) is the fraction of defects manifesting as failures. From
equation (2.11)

b(..1) = Afo~ F(z)dz/A..1 = 1~ F(z)dz/..1.


Given data from M inter-inspection periods, the likelihood in equa-
tion (2.7) becomes
k ~ M
, = (MA)d{II F(tj)}[l {l-F(z)} dz]d-k exp{ -MA..1}/ II(di-ki )!,(2.13)
. j=1 0 i=1

where k is the total number of failures, d - k the total number of defects


found at inspection, and di - k; the number of defects found at the ith of M
inspections. The logarithm is
k
f = dlogA - AM..1 + L:logF(tj)
j=1

+ (d - k) log 1~{1- F(z)} dz + constant. (2.14)

The simplest form for F is the exponential distribution, F(t) = l-exp( -{t),
so that
k
f = dlogA-AM..1+ L:log{l-exp(-{tj)}
j=1
+ (d - k) 10g{1 - exp( -{..1)} - (d - k) log{ + constant. (2.15)
Maintenance Optimisation with the Delay Time Model 567

The condition 8D/811 = 0 gives the equation for 11*,


(1 + e11*)e-e.:1 = 1 - eCt/(>'C2), (2.16)
which has a solution if eCl < >'C2, otherwise inspections should never be
performed.
To determine 11* however, e must first be estimated from available data.
On differentiating equation (2.15) with respect to>. and setting 8i/8>. = 0,
the MLE ~ = d/(M 11) is obtained. The defect arrival rate is estimated as
the number of defects detected in some way divided by the time interval.
The covariance matrix is estimated as the inverse of the information matrix
_8 2 i/ 8(Jj 8(Jj taken at (Jj = OJ, where (Jj is the ith parameter (Kendall and
Stuart 1979). Here 8 2i/8>.8e = 0 so that the estimates of>. and e are statis-
tically independent. This independence of defect arrival rate estimates from
other model parameter estimates follows in general from equation (2.10).
As 8 2i/8>.2 = _d/~2, the estimated variance of ~ is
2
0").
'2
= >. /d.
Differentiating i with respect to.e,
8i (d-k)11 t, d-k
+L
k
8 c = ee.:1 - 1 expet - 1 - -c-
.. '=::1'"
Equating 8i/8e to zero we obtain the ML estimator i, as the solution of
L i,ti/(expi,tj - 1) = (d - k), (2.17)
events
in a loose notation, where tj is the time of any event (defect found at inspec-
tion or failure). Thus
t. _ {11 if event is detection of defect
, - t if event is breakdown at time t.
Each term on the LHS lies between 0 and 1, approaching zero as e -+ 00 and
unity as e -+ O. Thus there will always be a finite solution if 0 < k < d. If
there are no failures (k = 0) e -+ 0 and the estimate of delay time is infinite.
If all defects cause failures (k = d) the estimate of delay time is zero. General
theory shows that, as the range (0,11] does not depend on e, i, is a consistent
estimator of e, i.e. i, -+ e as d -+ 00.
The variance of i, may be estimated as the inverse of
-8 2 i/8e = L
t;eeti /(eeti - 1)2 - (d - k)/e. (2.18)
events
Substituting (d - k) from equation (2.17) gives
-82i/8e = ' " tjeeti{etj - (1- e-eti)}.
Lt e(eeti _1)2
events
568 Rose Baker

Since in general 1 - e- x < x for x > 0, the RHS is always positive. Hence the
curvature of C is always negative, and so all stationary values are maxima. It
follows that there is only one solution for i" as ifthere were more there would
of necessity be a minimum of C also. This is a practically useful result when
maximising likelihood functions numerically, as any maximum found by the
function optimizer must be the maximum.
{_8 2i/ 8e le=e} -1 estimates the variance of f, for a particular realisation
of the random process, but it is also possible to derive the expected variance
{E { _8 2i/ 8e le=e} }-1, which it may be shown applies for large sample sizes
d. To derive this from equation (2.18), the sums are replaced by probability
integrals: in general
k . :1
l/d~f(t$) -+ 10 F(t)/,1f(t)dt.

Also, d - k is replaced by its expectation dF(t)/,1. This process yields


I(e,1) 1 - e-e..:1
e ,1
,1
e= d
2 -1 -1
U {e(ec. :1 _ 1) + ea,1 - 3 } , (2.19)

where I(z) = J;
x 2 dx/(e X - 1). The standard deviation of i" u ex d- 1 / 2. e
It is interesting to compare the variance of the ML estimator with that
e
of the intuitive estimator obtained by equating the observed and predicted
fraction of defects that manifest as failures, i.e.
1 - e-{..:1
_ = 1- kid. (2.20)
e,1
Using the usual large-sample delta notation, where
be small, differentiating equation (2.20) with respect to
= 6e e- E{e}' and will
e
{e-e..:1 - (1 - e-{..:1)/(e,1)}6e/e = -8k/d.
Squaring and taking expectations, the RHS becomes F(t)/,1(1- F(t)/,1)/d
as k obeys the binomial distribution, and substituting for F(t)/,1 from equa-
tion (2.20) gives
2 e(1- e-{..:1 )(e,1- 1 + e-e..:1)
u{= d(1-(1+e,1)e- c..:1)2 . (2.21)
Figure 2.1 shows the large-sample variances of the ML and naive estimators.
The naive estimator is less efficient than the ML estimator, but its efficiency
e
approaches 100% as -+ 0, and from equations (2.19) and (2.20) both es-
timators then have variance e/2,1. The ML estimator is intuitively better
because it uses the information about failure times in the data. This shows
the advantage of the ML approach.
Maintenance Optimisation with the Delay Time Model 569

900
800
700
600
500
(J'2
400
300
200
100
0
0 2 4 6 8 10
eLl
Fig. 2.1. Large-sample variances for method of moments and maximum likelihood
e,
estimates of where e-
1 is mean delay time.

2.4.4 Missing Data . If the number of defects found at an inspection is


not known, the likelihood must be summed over all possible values (0, 1, 2 ...)
which effectively replaces the Poisson probability of the number of observed
defects by unity. It may be known only that some replacement work was
carried out, so that one or more components were replaced, giving a factor
in [, of 1 - exp { - Jlip}. Similarly if the number of failures at some time is not
known, the Poisson term for the number of failures in an interval is replaced
by unity. However, whether an inspection was carried out or not must be
known, and if opportunistic inspections are being carried out on failure, the
occurrence or not of these failures must also be known.
It often happens that equipment is studied for some period, during which
observations are collected. If the machine is not new when observation starts,
there may be defects present in the machine at t = O. These will not be present
if inspection is perfect. This left-censoring can be dealt with if a plausible
schedule of inspections is known prior to the study. Then there is in effect
an infinite sequence of previous inspections at which results of inspections
and failures are missing. All that need be done is to extend the range of
integration over time in the definition of the Jlim back to -00. One sums over
past inspections until the sum converges. A minor problem is that the age of
the equipment mayor may not be known. If it is not known, the use of the
power law process to model A( u) poses a problem, as u = 0 when the machine
is new. It is then preferable to use the loglinear model, in which time may
be measured from any equipment age, whilst retaining the same functional
form for the model, and merely changing the value of the parameter a, as
aexp{,8(t -to)} = aexp{-,8to} x exp{,8t}.
570 Rose Baker

There is a greater problem when opportunistic inspections of all compo-


nents are carried out when a failure of any type happens, and when it is not
known whether such failures occurred or not. This will happen particularly
before the time when observation commenced. Then Pip is itself a random
variable from a distribution, and the likelihood must be integrated over the
multivariate distribution of Pip at all times i. The same problem arises when
seeking to minimise some measure of cost per unit time, in the computation
of the function b(..1).
A practical way of approximating this likelihood is as follows: approxi-
mate the likelihood as the average of a large number (say 1000 or 10000)
of likelihoods obtained by simulating the process of defect arrival, defect
detection at inspection, and failure, from some epoch long before observa-
tion commenced, up to u = O. The simulated pattern of failures for u < 0
is used in evaluating the likelihoods. When parameter values are varied, to
avoid numerical problems, fresh simulations are carried out using the same
set of random numbers. Simulation is quite adequate also for the computa-
tion of cost functions in choosing optimal strategies. The point is that this
is inelegant but works in practice, whereas there is no feasible Monte-Carlo
substitute for numerical computation of the likelihood function.
The special case of computation of b(..1) under perfect inspection for a
multicomponent system where components are not distinguished is dealt with
in Christer and Wang (1994). Here with a HPP process of defect arrival, each
failure or inspection is a regeneration point for the failure process. The ex-
pected number offailures in the interval (t, x+ dx) timed from the last failure,
without opportunistic inspections, is f; A( u )f( x - u) du dx = AF( x) dx, and
so the expected number of failures in (0, t) would have been AJ~ F(x) dx.
This is a Poisson-distributed count, and so the probability of no failures in
(O,t) is exp{-AJ~ F(x)dx}. This is the survival function of the distribution
of interfailure periods with opportunistic inspections. As failure times form
a renewal process, the expected number of failures N in the inspection in-
terval (0,..1) is the corresponding renewal function. Then b(..1) = E(N)/ A.
More general cases may also prove mathematically tractable, but already in
this simple case numerical methods, such as the discretization method of Xie
(1989), are needed to calculate b(..1).

2.5 An Empirical Bayes Modification


With q > 1 classes of defect/failure, there are many model parameters to be
estimated. This is the more difficult, as some classes of defect may never give
rise to failures, or alternatively may never be detected at inspection. There
are two ways of coping with this problem.
The cruder is not to distinguish between defects/failures of different
classes, but to lump them all together. If the NHPP intensities Ap (u) are
all proportional, so that Ap (u) = gph( u), where the intensity of any de-
fect arising A( u) = L:;=1 Ap (u), the Poisson-distributed expected numbers
Maintenance Optimisation with the Delay Time Model 571

of events which are proportional to terms such as L!=1 J~ Ap(u)Fp(t - u) du


can be written as J~ h(u)F(t - u) du, where F(x) = L!=1 gpFp(x). Thus the
DTM can be formulated using an NHPP of arrival of random defects, and
the distribution function F of the delay-time of a random defect.
The drawback is that useful information is lost, because differences in
delay-time due to defect type are regarded as random variation. It is not
however true that all defects are assumed to have the same delay-time in
this model. Note that if the intensities of arrival of different defect types
are not proportional, the failure delay-time of a random defect will not be
independent of the arrival time u, even if this is so for each defect type
individually.
Note that in the derivation of the NHPP model likelihood, it was also
necessary to assume either that the delay-time distribution F was the same
for all components in a class, or else that the hazards Am (u) were proportional
for all components in a class. Otherwise, the delay time t - u will be correlated
with defect arrival time u.
A better way to avoid estimating many parameters is to regard each of
the q scale parameters of Ap as itself being a random variate from some dis-
tribution of parameter values. This is then a random effects model. Similarly,
the parameters of Fp are a random sample of q parameters from another
distribution, and so on. Since scale parameters must be positive, they must
be distributed according to a lifetime distribution with pdf. v( x). Denote the
mean number of events by XJ.lip, where now in the definition of J.lip <X Q' all
scale factors Q'p take a common value.
The likelihood is now
q
, = II 'p, (2.22)
p=1
where

(2.23)

where, is a parameter of the v distribution.


It is convenient to take
v(xl,) = ,(,x),,-1 exp{-,x}/r(!), (2.24)
a gamma distribution of unit mean and variance 111. The integral can then
be evaluated, and the likelihood is

II J.lip
q
( aip/ . 1) (1
a,p'
+ 111)(1 + 211) (1 + (L~-1 aip - 1)11)
"n' (2.25)
p=1 (1 + L7=1 J.liPI1)'Y+ L.....i=l a,p
It can be readily seen that as , ~ 00 the likelihood function reverts to its
original form, with all scale factors equal, as it must, because then v(x) is a
Dirac delta-function 6(x - 1).
572 Rose Baker

What has been described can be understood from the frequentist view-
point as a random-effects model, whose parameters are estimated by maxi-
mum likelihood. From a Bayesian viewpoint, v(xl,) is a prior distribution,
whose parameters have been (heretically) estimated from the data. The in-
dividual scale parameters a p must also be estimated, as they are needed for
the cost model. They are the usual means of posterior distributions, i.e.
1000 .cp(aiplx)v(xli')x dx
(2.26)
A

p
a = 1000 .cp(aiplx)v(xli') dx
This gives
"n aip + ,
"n
L...,j-1 A

ap = (2.27)
A

L.."i=1 pip + ,
A

The MLE ap = E~-' a.


i=l I'ap
p
has been shrunk towards the prior mean of unity,
by a factor s = L~=1 Pipl(i' + L~=1 Pip), i.e.

L~=1 aip. +i' =


"n S
~ aip I~
L...J L...J pip + (1)
-S. (2.28)
L.."i=1 Ptp + ,
A

i=1 ;=1
If there are no data available on a particular failure type, a = 1, so that this
failure mode is taken as having the mean defect arrival scale factor.
Rather than using the mean of the posterior distribution as a point es-
timate for a p , it may be preferable to calculate the expected cost per unit
time conditional on values of the a p , and then to take the expectation of cost
per unit time (or whatever quantity is to be optimised) with respect to the
(estimated) posterior distribution of the a p
Given more than one machine, they may be assumed identical, in which
case likelihood functions for each machine multiply to give the total likelihood
function, or the EB method may again be used, to regard key parameters for
each machine as drawn from a population of such parameters. This accounts
nicely for random differences in operating conditions or quality of parts, and
is discussed briefly in Baker and Wang (1993).
A general account of the EB method is given in Maritz and Lwin (1989).
2.5.1 Computational Problems. Note that without the EB modification,
the likelihood facto rises
q
.c = II .cp,
p=1
and parameters for each failure mode can be estimated in turn; the estimates
of various parameters for a given failure mode will be correlated together,
but parameter estimates will not correlate with estimates of parameters for
other failure modes. The computational burden is light with a minimum of
two parameters per failure type. It becomes greater if (say) a common shape
parameter f3 is assumed for each defect arrival rate, because all parameters
Maintenance Optimisation with the Delay Time Model 573

must then be estimated simultaneously. The EB approach forces this situa-


tion, so that with many failure types, many parameters must be estimated
simultaneously. It would of course be possible to maximise the likelihood for
all other model parameters for each failure mode in order, for given values
of the common parameters 'Y and a, inside a routine that maximised the
total likelihood for a and 'Y. In other words, one would maximise the profile
likelihood with respect to the common parameters.
If a parameter of the distribution Fp is also taken as a random variate in
an EB analysis, the resulting integral must be evaluated numerically, e.g. by
Gauss-Laguerre quadrature. O'Hagan (1994) describes relevant techniques.
2.5.2 Flexibility of the ML Approach. The problem of coping with miss-
ing data has already been mentioned. In general the likelihood can be summed
or integrated over all happenings consistent with the observed events. Every
case study seems to have special features that must not be ignored. Some are
mentioned in Baker and Wang (1993). There the component tracking model
was used, with inspection of all components of a machine being done if any
component failed. Sometimes it was not clear from the records which of two
replaced components had failed, and which had been found to be defective.
Since the system will be in the same state after the repairs whichever even-
tuality occurred, it is straightforward to write the likelihood as the sum of
likelihoods for each eventuality. Similarly, because of missing or unintelligible
records it may not be clear whether a component was found to be defective
and was replaced, or was not found to be defective. Again, the likelihood
must be summed over both possibilities.
Acceptance tests and repairs are sometimes recorded. This raises the prob-
lem of whether a new machine can be regarded as one that has just been main-
tained, or has just had all components replaced. In the NHPP model, there
may be defects present in the machine initially. The data would then indicate
abnormally high numbers of defects found at inspection or high numbers of
failures initially, over that predicted by the fitted model. One parsimonious
way of modelling this would be to assume the intensity of defect arrival Ap
for the pth component to be A~(U) = cA p(0)6(u) + Ap(U). Here the rate of
arrival includes a delta-function term at U = 0, of size proportional to the
defect arrival rate for that component extrapolated to U = O. This would
work for the loglinear model, but not for the power law process with f3 > 1,
where A(O) = 0 if f3 > l.
Sometimes in Baker and Wang (1993) unfailed components were replaced
with a later version by the manufacturer. This is an example of reliability
growth, and in the component-tracking model it can be modelled by:
1. Inserting an E event into the likelihood, to give the probability that no
failures had occurred between the last renewal and the time of replace-
ment.
2. Inserting a renewal (R) event after the E event.
574 Rose Baker

In a more elaborate model, it would be possible to use a multiplier parameter


for the scale factor of the distributions 9 and / for the new component to
allow for its changed reliability.
In the NHPP model, such replacements have no effect, unless some finite
fraction / of all components are replaced. One could then model such re-
placement at time Uo by setting the defect arrival rate A(U) -+ (1- J)A(U) +
/A(U - uo) for U > Uo.
Note that in the component tracking case, it becomes possible to carry
out age-based replacement of components in addition to replacing them if
defective: the criterion would be to replace if the component age is a >
Ap, or if defective. The likelihood function is then modified as described for
the reliability growth case. Calculation of the optimum policy is now more
difficult however, as there are 1 + q decision variables, where q is the number
of components; the inspection interval, and q component ages Ap. As this
policy generalises the simple replacement of defective components, to which
it reduces when Ap -+ 00, it must be at least as cost-effective. However, such
a policy if based on estimated and therefore incorrect replacement ages could
be less cost-effective.
In general the ML method can be uSed to model many different practices.
Sometimes however vital data is missing, and modelling becomes far-fetched
or impossible. Thus maintenance data on a machine may be missing while it
was under warranty, when it was returned to the manufacturer for repairs.
Although one can attempt to model every situation, unknown biases may
be introduced, and the investigator should encourage good data collection
procedures.

3. Cost Models

The modelling process of model formulation, model fitting and model re-
finement has the benefit that it forces the investigator to examine his or
her assumptions, and to clarify the meaning of the data in discussions with
management. Some model parameters, such as the probability r of detecting
a defect that is present, are of intrinsic interest, and the modelling process
could thus lead to changes in practice.
However, the main aim of modelling the failure and inspection processes
and fitting the model to data is to be able to calculate from the model long-
term cost per unit time, or downtime per unit time, and to choose decision
variables (usually the interval L1 between inspections) to minimise one of
these measures of cost.
This is an area where more modelling effort should be applied, as existing
cost models are very simple. When the defect arrival process is a NHPP,
the rate of occurrence of failures will change (often it will increase) as the
system ages. Hence the frequency of inspection must also increase. In the cost
Maintenance Optimisation with the Delay Time Model 575

models given here, the NHPP must be approximated by a stepwise HPP, and
a different optimum inspection interval found for each step.
For the q failure-mode model, the cost per unit time is
"q {c(J) E{N(J)) + c(i) E{N(i))} + I
( A) = L.Jp=l P P P P
(3.1)
c~ .1+d '
where c~J) is the average cost of a failure for the pth failure mode, E{NJJ)) the
expected number of failures over the inspection interval, c~i) is the average
cost of repairing a defect at inspection for the pth failure mode, E(NJi))
the expected number of defects found at inspection, I is the cost of the
inspection, and d is the average downtime incurred. Expected numbers of
failures and defects are calculated using formulae (2.8) and (2.9), where now
actual inspection timings are replaced by an infinite sequence of inspections
occurring regularly every .1 time units.
Often some terms in equation (3.1) are negligible, so that d may be very
small, I may be small, or conversely all the cost of an inspection may be due
to downtime, and the extra cost incurred per defect c~i) may be negligible.
In general c{.1) has a minimum value, as long as the average total cost of
repairing a defect at inspection is less than that of a failure.
For the component tracking model, there must presumably either be more
than one machine under consideration, or the machine has had all major
components replaced several times. Unless one of these cases holds, the in-
vestigator will have been unable to obtain enough data to estimate the model
parameters. In both these cases, regularly spaced inspections are a reasonable
option (otherwise, for an expensive machine that was ageing, the frequency of
inspection should change with machine age). In the hospital equipment study,
machines such as infusion pumps are serviced every 6 months regardless of
age. To obtain the optimum policy, it, is necessary to know the service lifetime
of machines, and the age distribution ot machines in service. If machines are
purchased regularly, and old machines removed from service, equation (3.1)
still applies, where now expected numbers offailures and defects, and average
costs are for a random machine from the number in service.
For the HPP defect arrival model, opportunistic inspections increase the
value of c~J). Besides repairing some part of the machine responsible for the
pth failure mode, a general inspection is carried out. If the cost of this in-
spection is proportional to the number of defects found, then c~J) will be an
increasing function of .1. The simplest way to find the value of c~J) numeri-
cally for a given .1 is to simulate the process over a long time period.
This highlights a general problem with equation (3.1), that some of the
parameters appearing in it may be functions of .1; if so, the value of .1
minimising cost will change. In the recent study Christer et al. (1995) it was
found that downtime per failure was an increasing function of .1. This was
not due to opportunistic inspections. In the model used in Christer et al.
576 Rose Baker

(1995), all failure modes were lumped together, and it was thought that the
more expensive failure modes had longer delay-times, and so were more likely
to occur during long intervals between inspections. Discriminating between
different failure modes removes this difficulty.
A likely reason why c~i) might increase with ..1 is that repair time or cost
increases with the elapsed delay-time, i.e. the time since the defect became
visible. As time passes, defects grow and require more effort to fix, e.g. a
crack might grow in size. This behaviour can be modelled by allowing cost to
be a stochastic function of elapsed delay-time, giving a distribution function
Pr( C :S c) for cost c:

Pr(C:S C)ip = Ti t IT
j=l k=j
(1- Tk) it;
tJ-l
.Ap(x){Pr(C:S clti - x)} dx. (3.2)

Here {Pr( C :S cit; - x)} is the probability that the random cost does
not exceed c, given that the defect arose at time x. This can be modelled
by any survival distribution, for example a gamma distribution, whose mean
is an increasing function of ti - x. Model parameters can be estimated by
multiplying the likelihood function by the likelihood of observing the repair
costs.
Unfortunately when inspection has been carried out perfectly regularly
every ..1 time units, and when .Ap(x) is constant, the dependence of the dis-
tribution of cost on elapsed delay time at inspection cannot be estimated from
the likelihood. Only when there is very irregular maintenance or there are
opportunistic inspections, can the likelihood function based on equation (3.2)
enable us to estimate cost parameters.

4. Model Choice, Goodness of Fit, Sample Size

Typically, it will be clear from discussions with management and a prelimi-


nary examination of available data whether a delay-time model is appropri-
ate. If so, there are still many modelling options, for example the parameter-
is at ion of the distribution F. The model-building process consists of fitting a
model to the data, examining the goodness of fit of the model, and iterating
until one has the simplest model that fits the data acceptably. Then one can
proceed to find optimum policies.

4.1 Goodness of Fit Testing

Examining goodness of fit for these models is best done graphically, or


with simple tests. Often observed and predicted numbers of events for some
marginal distribution can be plotted. For example, with the NHPP model,
one can plot observed and predicted numbers of failures in equal intervals
Maintenance Optimisation with the Delay Time Model 577

of time from last inspection. Failures occurring at many different times are
lumped into one histogram class together, and this gives a large enough num-
ber of failures per class to enable the goodness of prediction to be assessed
both visually and by a chi-squared test. There should be few failures occurring
soon after an inspection, if inspection is effective at removing defects.
The observed and predicted number of faults found at inspection can
also be plotted, possibly breaking the period of observation up into several
intervals. Observed and predicted numbers of failures can also be plotted for
intervals over the period of observation, and this shows whether the function
h for NHPP models is of the right form.
The usual chi-squared can be calculated for such tests. The number of
degrees of freedom should be the number of independent counts of events, mi-
nus the number of fitted model parameters. However, the number of degrees
of freedom for the chi-squared is only known approximately, first because
parameter estimation by maximum likelihood gives more accurate measure-
ments than obtained by minimising a X2 , so that fewer degrees of freedom
need be subtracted, and secondly because only part of the data appears in any
one chi-squared, so that fewer degrees offreedom again should be subtracted.
This difficulty is not usually a real problem in practice.

4.2 Model Choice


A quite different approach is to extend the model in various ways suggested
by the situation, for example a 3-parameter distribution can be fitted to F,
and the adequacy of the embedded model tested against the larger model.
With models nested in this way the likelihood must always increase with
the number of fitted parameters. The Akaike Information Criterion (AIC)
(Sakamoto et al. 1986) is defined as AIC = -2 log C + 2/, where / is the
number of fitted parameters. Taking the best model as the minimum-AIC
(MAleE) model gives (asymptotically in sample size) the model that would
have the largest expected log-likelihood if the likelihood were calculated from
fresh data similar to that used for model fitting. The MAICE model should
therefore have best predictive power. There are other more stringent criteria,
because MAICE models still tend to overfit data, so that asymptotically
convergence to the correct model does not happen with probability 1. This
criticism must be tempered by the reflection that we usually have only small
data samples, and also that we do not think that there is a 'correct' model.

4.3 Sample Size


On fitting model parameters from small samples of data, the estimated pa-
rameters will be in error by some amount, and the maximum likelihood pro-
cedure also gives estimates of the standard error on parameter estimates.
Hence the estimated optimum period between inspections will also be in er-
ror by some amount, which can be calculated. Operating a machine with such
578 Rose Baker

a suboptimal inspection policy will increase cost per unit time away from the
minimum by some amount called the excess cost. Clearly, for a given sample
size, the excess cost can be calculated, and hence the required sample size
needed for a given excess cost can be found.
These calculations are given in detail in Baker and Scarf (1995). The
important result following can be seen without doing any mathematics, and
is that because of the quadratic nature of a minimum, even large errors on
estimated optimum ~ will incur small excess cost. In fact, excess cost is
inversely proportional to sample size.

4.4 Using the Data Three Times


It has been said that modellers use data three times: to suggest a suitable
model; to fit model parameters; and to assess goodness of model fit. Although
the model-building procedure recommended here is far superior to naive use
of an untested model, it does lead to over-optimism regarding model fit and
standard errors on model parameters. The model is lovingly hand-crafted to
fit every bump and wrinkle of the data. Even use of the AIC will not lead to
the 'best' model, if many different possibilities are tried ('data dredging').
Besides being aware of this, and so opting for parsimonious models, the
modeller can bootstrap the whole modelling procedure, if a fairly standard
set of models is to be considered. Fresh 'resampled' datasets are produced
by simulation, using estimated model parameters as the 'true' parameter
values. The model-choice procedure is then carried out automatically on each
resampled dataset. Sometimes the MAICE model will not be that fitted to the
actual data set. The bootstrapped standard errors and confidence intervals on
model parameters and on the recommended optimum ~ will be more realistic
than those found by propagating the standard errors on model parameters
given by the likelihood method through to become errors on ~, using, say,
the delta-method. .

5. Case Studies
In a recent case study (Christer et al. 1995) the HPP model was used to
optimise maintenance practice for an extrusion press, a key item of plant
for a copper-products manufacturer in the NW of the UK. Fortunately, the
company had already tried various maintenance practices, such as failure
maintenance, and daily and weekly maintenance, so that total downtime per
hour of press operation could be found. Table 5.1 shows these results. and
Table 5.2 shows the results of fitting the HPP delay-time model.
As can be seen, five models for the delay-time distribution were fitted.
Model 1 is exponential, model 4 Weibull, while model 5 is a Weibull distri-
bution with the scale parameter ci' a random variate from a Gamma dis-
tribution. Thus model 4 generalises model 1, and model 5 generalises model
Maintenance Optimisation with the Delay Time Model 579

4. Model 2 is model 5 with f3 = 1, so that the exponential scale factor a


is a random variate from a Gamma distribution. This model 2 is a Pareto
distribution. Model 3 is a mixture of exponential distributions, in which one
distribution has zero mean; i.e. a fraction P of defects have zero delay-time.

Table 5.1. Percentage of downtime for an extrusion press under various mainte-
nance regimes. This includes downtime due to failures and downtime due to main-
tenance.
PM policy percentage downtime per press hour
production record objective method
no PM 5.47 5.53
1 week PM cycle 4.06 4.05
1 day PM cycle 2.45 1.85

On the basis of the AIC, models 3 or 4 would be chosen as having the


best predictive power. Model 3 was favoured as being the more intuitive, as it
was thought that many defects could not be detected by inspection. Clearly,
the simple exponential model does not iit well.
Model fit was assessed using the chi-squared method described earlier,
and was acceptable.
Using a cost model in which downtime due to failures is proportional to
the number of failures, the total downtime per unit time can be plotted.
The result of the analysis was the recommendation of a quick daily main-
tenance, in which only certain specified defects were to be checked for. Other
less urgent maintenance, such as lubrication, was to be carried out weekly.

6. Brief History of the Development of the DTM


The DTM was introduced in Christer (1982) in the context of building main-
tenance, following the first mention of the concept in 1976 in the appendix to
Christer (1976). The model was of the NHPP type. Subjective and objective
information were both to be used.
In 1984 the DTM was applied to problems of industrial plant mainte-
nance. In Christer and Waller (1984a) the DTM was extended to cater for
imperfect inspection, a NHPP of defect origination epochs over the interval
between inspections, and two cost models (maintenance performed simulta-
neously for all defects or sequentially). In a related case study paper Christer
and Waller (1984b) the DTM and snapshot analysis (an extended form of
Pareto analysis) were used to derive an optimum-cost maintenance policy
at the Pedigree Petfoods canning line, which was subsequently adopted by
management.
In another case study Christer and Waller (1984c), snapshot analysis and
the DTM were applied to modelling preventive maintenance for a vehicle fleet
580 Rose Baker

Table 5.2. Fitted values of parameters for the Extrusion Press data.
models and fitted values of parameters

I
model choice:F(h)
parameter (1) (2) (3)
l_e- ah 1- 1 1_(I_pa)e- ah
(1+~)"Y

ROCOF o 1.3174 1.3277 1.3561


>. CV e 0.0832 0.0832 0.0832
scale 0.0355 42.65 0.0178
aCV 0.7407 8.4161 1.1572
shape
.BCV
gamma 0.1732
"(CV 1.3576
P(perfect PM) 0.2142 0.9036 0.9021
rCV 0.7400 1.5006 3.4956
Log-L 104.86 103.85 101.86
AIC 215.73 215.70 211.72

models and fitted values of parameters


model choice:F~~)
parameter (4) (5)
1 _ e(-ah)fI 1- 1
(1+(,,~fJ )"Y
ROCOF 1.3378 1.3378
>.CV 0.0832 0.0837
scale 0.9247 126.855
aCV 2.3583 30.756
shape 0.1276 10.1335
.BCV 2.3391 3.4437
gamma 0.0177
"(CV 0.1886
P(perfect PM) 0.8963 0.9068
rCV 1.1703 1.5464
Log -L 101.89 101.85
AIC 211.79 213.79

a proportion of zero delay time, P = 0.5546 0.4266


b rate of occurrence of faults
e coefficient of variation
Maintenance Optimisation with the Delay Time Model 581

of tractor units operated by Hiram Walker Ltd. Again management adopted


the recommended decrease in frequency of maintenance. This study produced
some peculiarities of practice which the model was extended to cope with,
e.g. some defects were found by drivers, who returned the vehicle for repair at
once, and the next scheduled maintenance was brought forward to coincide
with the repair. This paper also first mentions the observation that repair
times (and hence cost) and delay times h may be positively correlated. A
general account of the DTM is given in Christer (1984).
In these applications b(..:1) emerged as a key concept. The value of b(..:1)
calculated from the subjective and objective data should agree with the ob-
served fraction for the maintenance interval actually employed. In the earlier
development, when agreement was not within a few percent, detailed esti-
mates given by engineers were reassessed, and revised estimates obtained.
After the formulation of the DTM for the HPP jNHPP case, and its early
applications to real-world problems, a period of more theoretical develop-
ment started. In 1987 a perfect-inspection model of the component-tracking
type appeared Christer (1987) and component reliability as a function of in-
spection interval was calculated using a recursive formula. Cerone (1991) later
calculated an approximate reliability measure using a simplified method. Pel-
legrin (1991) derived a graphical procedure for finding the optimum interval
between inspections under a DTM, which allows the various factors relevant
to decision-making to be emphasized.
A further version of the HPP model applicable to the building industry
followed in Christer (1988). Here a DTM was developed in which the prob-
ability p(y) of detection of a defect at time y from the defect origin time u
= =
increased from zero at y 0 to unity at y h. Repair cost now varied over
the delay time as a deterministic function C(y, h). Developments of this work
are ongoing in the form of a major collaborative research project with the
Concrete Research Group at QMC, London, into the inspection and repair
modelling of concrete bridges and high-rise structures.
Later papers, Christer and Redmond (1990,1992) also on the HPP model
considered more formal methods for revising subjective estimates of delay-
time and prior forms of models so that the fraction of defects ending in failure
would agree with the observed value for the maintenance policy actually in
use.
A condition-monitoring coal mining equipment case study is given in
Chilcott and Christer (1991), and a general paper Christer (1991a) gave
a non-mathematical summary of the DTM. In Christer (1991b) the DTM
for the component-tracking case was discussed from the viewpoint of 0-1
condition-monitoring, and the asymptotic cost per unit time of irregular in-
spection policies was derived. This cost was used in Christer and Wang (1992),
where a DTM was derived for a 0-1 condition-monitoring model with regu-
larly spaced inspections for a linear pattern of wear characteristic of some
582 Rose Baker

plant in the steel industry. In this DTM, a positive correlation between u


and h is induced by variability in a population of components.
All case-related models developed to this point had been based sub-
stantially upon subjective data. In Baker and Wang (1991) the component-
tracking delay-time model was introduced, as described in this chapter. This
study showed that it was possible to estimate DTM parameters without sub-
jective data. In a later paper Baker and Wang (1993) some model extensions
were derived and used in fitting data.
In Baker (1992), the use of maximum-likelihood methods for the NHPP
model was discussed. The bias of the ML estimate was derived and shown
to be small. Christer and Wang (1994) discuss an opportunistic inspection
model.

7. New Model Developments


This section describes some recent modelling developments.

7.1 Rejuvenating Effect of Maintenance


In the DTM, the only effect of maintenance is replacement of defective com-
ponents found on inspection. However, maintenance often includes activities
such as lubrication , which can be thought of as partially renewing a com-
ponent. It is only necessary to include one extra model parameter to model
this effect as the addition of some age increment fJ (which can be negative)
to the component's age. The method used here was adopted in Baker and
Wang (1993).
If the hazard of developing a defect at age tis 1jJ(t), after i -1 inspections
at times to" ti-I, where to = 0,
i-I

t -4 teffective = t - L Min{tj - tj-I, fJ},


j=1

for ti-I < t ~ ti, and 1jJ(t) -4 1jJ(teffective)' The sum is nugatory if j > i -1,
i.e. if i = 1, so that no inspection has yet occurred.
The survival function S( u) = 1 - G( u) proves unexpectedly complicated
when fJ i: O. Let So be the survival function when fJ = O. The equation

u = e - Jf"
C' ( )
vo 1/>(t)dt
o (7.1)
is the key to calculating S(u). For ti-I < t < ti, the hazard is 1jJ(t -
f;
I:~:i Min{tj -tj-I, fJ}). The integral 1jJ(teffective) dt must then be carried
out piecewise, and is
r
10 1jJ(teffective)dt = L
n+Ilti i-1
. 1jJ(t - LMin{tj -tj_bfJ})dt, (7.2)
o i=1 t._l j=1
Maintenance Optimisation with the Delay Time Model 583

where a total of n inspections have been carried out by time u from renewal,
to= O,t n +l = u.
It is now possible to write down the survival function S, using the equation

e
- f;
i;_l
t/J(t)dt
= SO(ti)jSO(ti-t},
derived from equation (7.1). Treating each term in the summation in equa-
tion (7.2) in this way, and remembering that So(O) = 1, finally
nIl+! So(t; - L~:~ Min{tj - tj-I. 6})
S(u) = i-I. , (7.3)
i=1 SO(ti-l - Li=1 Mm{ti - ti-I. 6})
where u appears on the right hand side in the guise of tn+!. Clearly, for
exponential distributions the additional terms due to 6 cancel as they must,
because when the hazard tf; is a constant, rejuvenation can have no effect
upon it.
The pdf. g(u) = -dS(u)jdu obtained by differentiating equation (7.3) is
n
g(u) = tf;(u - L Min{ti - ti-l, 6})S(u)
j=1
for u > tn. In terms solely of the original survival function So and pdf. go,
the pdf. is
n n
g(u) = go(u - L Min{tj - tj-I. 6})S(u)jSo(u - L Min{tj - tj-I. 6}),
j=1 i=1
where S is as defined in equation (7.3).
It is now possible to compute G(u) and g(u) when 6 is nonzero, if the
original distribution function G o(u) and pdf. go( u) can be computed.
Whether rejuvenation would be an improvement would depend on whether
the hazard of a defect developing was increasing or decreasing with age-
restoring the machine to an earlier and more unreliable state would not be
an advantage. The basic concept of changing the component's effective age is
still valid for such DFOM (decreasing force of mortality) distributions, but
here it is the increase in age that must be restricted. It is simplest to write
i-I
t ~ teffective =t +L Min{tj - ti-I. 6},
j=1
and to define 6 as the increase in age conferred by the inspection. However, for
DFOM distributions the rationale of this approach, the notion of restoration
to a younger and more reliable state, is lacking.
584 Rose Baker

7.2 Other Developments

7.2.1 Growth in Defect Visibility with Time. Consider a delay-time


model where the probability of detecting a defect rises from zero when the
defect is first visible, to unity or a smaller constant at the instant of failure,
e.g. p(t) = r{(t - u)/h}Q, where r $ 1 and 0 $ a < 00. If a is zero we have
the current model, as a increases probability of detection switches on more
and more slowly. The model is motivated by defects like cracks that increase
in size with time. Its difficulty is its increased computational complexity.
7.2.2 A Simple Alternative Imperfect Inspection Model . Another
possibility is that defects are only seen after they have been in existence for
some fixed period TJ. Hence some will cause failures before being detected.
Existing imperfect inspection models assume that the probability r of
detecting a defect is independent at each inspection. Two inspections very
closely spaced would have more chance of finding a defect (1 - (1 - r )2) than
a single inspection. This new model assumes the opposite; both or none of
two closely spaced inspections would find the defect. Like the 'r' model it has
just one parameter. The likelihood function is again of 'Poisson' type and is
easy to write down. In this model there are three states: OK, defective but
defect undetectable, and defective, defect detectable. In general, the period
of undetectability would be stochastic.
Under the conditions used in Section 2.4.3, an undetectable period of TJ
gives the expected number of defects found at inspection of

AJ"
r.tl+" {l- F(u)} du,
and a failure intensity at time t from inspection of AF(t + TJ). As TJ -+ 00,
no defects are found at inspection, and the failure intensity approaches the
defect arrival rate A. The log-likelihood is

Ie
l dlogA - AM Ll + 2)ogF(tj + TJ)
j=1

+ (d - k) log" 1.tl+"
{l - F(x)} dx + constant. (7.4)

and b(Ll) is:

b(Ll,q) 1
= A"
.tl+"
F(u)du, (7.5)

This simple form makes calculation of costs per unit time easier than in the
'r-model'. For the exponential distribution, the estimating equation for TJ,
8l/8TJ = 0, reduces to
Maintenance Optimisation with the Delay Time Model 585

L l/{expe(tj +~) -
Ie
I} = d - k.
j=l

The computation of b(..1) for perfect inspection and opportunistic in-


spections was given in Section 2.4.4. Under the 'rrmodel' failures again
form a renewal process, and the interfailure time now has survival func-
tion exp{ -A J~ F(x + 7]) dx}. Calculation of b(..1) is thus still analytically
feasible. In contrast, the 'r-model' with r < 1 presents a more difficult case,
because failures and inspections are not regeneration points for the process,
but rather just after a failure or inspection, the number of defects that have
arrived, and their ages, depend on the 'historical' pattern of preceding failures
and inspections.
The last model can be changed slightly. Assume that technicians only
regard a component as defective (and repair it) if they think it has a non-
negligible probability P > Po of causing a failure before the next scheduled
inspection. Any other 'defects' are ignored. This again gives the previous
model, but now if Po is fixed, the period 7] for which a defect is ignored varies
with the interval between inspections. This new model will predict that as the
inspection interval increases, technicians will fix defects at earlier and earlier
stages in their progress towards failures. The cost of inspections will increase
but the reliability of the system will also increase, giving a different optimum
inspection interval. With a constant inspection interval in the data, the 'Po'
and the '7]' models are indistinguishable; otherwise one or the other will give
the better fit to data. The models differ in their predictions of optimum
inspection interval, because either 7] or Po would remain constant.
Because the concept of a defect is defined operationally (Section 1.2), we
pay the penalty of being forced to model the behaviour of engineers. However,
such 'soft' problems intrude everywhere in OR, and cannot be ignored.

Acknowledgement. I would like to thank Professor Tony Christer, Dr. Philip Scarf,
and all my colleagues in the Maintenance Research Group for helpful and stimu-
lating discussions on this presentation of our joint work.

References

Baker, R.D.: Estimating Optimum Inspection Intervals for Repairable Machinery


by Fitting a Delay-Time Model. Technical Report MCS-92-08. Mathematics
Dept., Salford University (1992)
Baker, R.D., Christer, A.H.: Review of Delay-Time OR Modelling of Engineering
Aspects of Maintenance. European Journal of Operational Research 73, 407-
422 (1994)
Baker, R.D., Scarf, P.A.: Can Models Fitted to Maintenance Data with Small Sam-
ple Sizes Give Useful Maintenance Policies? IMA Journal of Mathematics in
Business and Industry 6, 3-12 (1995)
586 Rose Baker

Baker, R.D. , Wang, W.: Estimating the Delay-Time Distribution of Faults in Re-
pairable Machinery from Failure Data. IMA Journal of Mathematics Applied
in Business and Industry 3, 259-281 (1991)
Baker, R.D., Wang, W.: Developing and Testing the Delay-Time model. Journal of
the OR Society 44, 361-374 (1993)
Cerone, P.: On a Simplified Delay-Time Model of Reliability of Equipment Subject
to Inspection Monitoring. J. OpJ. Res. Soc. 42, 505-511 (1991)
Chilcott, J.B., Christer, A.H.: Modelling of Condition-Based Maintenance at the
Coal Face. International Journal of Production Economics 22,1-11 (1991)
Christer, A.H.: Innovatory Decision Making. In: Bowen, K. , White, D.J.(eds.):
Proc. NATO Conference on Role and Effectiveness of Decision Theory in Prac-
tice (1976)
Christer, A.H.: Modelling Inspection Policies for Building Maintenance. J. OpJ. Res.
Soc. 33, 723-732 (1982)
Christer, A.H.: Operational Research Applied to Industrial Maintenance and Re-
placement. In: Eglese, Rand (eds.):Developments in Operational Research. Ox-
ford: Pergamon Press 1984, pp. 31-58
Christer, A.H.: Delay-Time Model of Reliability of Equipment Subject to Inspection
Monitoring. J. Opl. Res. Soc. 38, 329-334 (1987)
Christer, A.H.: Condition-Based Inspection Models of Major Civil-Engineering
Structures. J. Opl. Res. Soc. 39, 71-82 (1988)
Christer, A.H.: Modelling for Control of Maintenance for Production. In: On-
derhoud en Logistiek (Op weg naar intergrale beheersing). Eindhoven: Sam-
som/Nive 1991a
Christer, A.H.: Prototype Modelling of Irregular Condition Monitoring of Produc-
tion Plant. IMA Journal of Mathematics Applied in Business and Industry 3,
219-232 (1991b)
Christer, A.H., Redmond, D.F.: A Recent Mathematical Development in Mainte-
nance Theory. IMA Journal of Mathematics Applied in Business and Industry
2, 97-108 (1990)
Christer, A.H., Redmond, D.F.: Revising Models of Maintenance and Inspection.
International Journal of Production Economics 24, 227-234 (1992)
Christer, A.H., Waller, W.M.: Delay Time Models of Industrial Maintenance Prob-
lems. J. Opl. Res. Soc. 35, 401-406 (1984a)
Christer, A.H., Waller, W.M.: An Operational Research Approach to Planned Main-
tenance: Modelling P.M. for a Vehicle Fleet. J. Opl. Res. Soc. 35, 967-984
(1984b)
Christer, A.H., Waller, W.M.: Reducing Production Downtime Using Delay-Time
Analysis. J. OpJ. Res. Soc. 35, 499-512 (1984c)
Christer, A.H., Wang, W.: A Model of Condition Monitoring of a Production Plant.
International Journal of production Research 9, 2199-2211 (1992)
Christer, A.H., Wang, W.: A Delay-Time Based Maintenance Model of a Multi-
component System. Technical Report MCS-94-13. Mathematics Dept., Salford
University (1994)
Christer, A.H., Whitelaw, J.: An O.R. Approach to Breakdown Maintenance Prob-
lem Recognition. J. OpJ. Res. Soc. 34, 1041-1052 (1983)
Christer, A.H., Wang, W., Baker, R.D., Sharp, J.: Modelling Maintenance Practice
of Production Plants Using the Delay Time Concept. IMA journal of mathe-
matics in business and industry 6, 67-83 (1995)
Kendall, M.G., Stuart, A.: The Advanced Theory of Statistics. 4th edition. High
Wycombe: Griffin 1979
Maritz, J.S., Lwin, T.: Empirical Bayes Methods. London: Chapman and Hall 1989
Maintenance Optimisation with the Delay Time Model 587

O'Hagan, A.: Kendall's Advanced Theory of Statistics: Bayesian Inference. Vol. 2B.
London: Edward Arnold 1994
Pellegrin, C.: A Graphical Procedure for an On-Condition Maintenance Policy:
Imperfect-Inspection Model and Interpretation. IMA Journal of Mathematics
Applied in Business and Industry 3,177-191 (1991)
Sakamoto, Y., Ishiguro, M., Kitagawa, G.: Akaike Information Criterion Statistics.
Tokyo: KTK Publishing House 1986
Shwartz, M., Plough, A.L.: Models to Aid in Cancer Screening Programs. In: Cor-
nell, R. (ed.): Statistical Methods for Cancer Studies. New York: Marcel Dekker
1984
Thomas, L.C., Gaver, D.P., Jacobs, P.A.: Inspection Models and Their Application.
IMA Journal of Mathematics Applied in Business and Industry 3, 283-303
(1991 )
Xie, M.: On the Solution of Renewal-Type Integral Equations. Commun. Statist.
B 18, 281-293 (1989)
Valdez-Flores, C., Feldman, R.M.: A Survey of Preventive Maintenance Models
for Stochastically Deteriorating Single-Unit Systems. Naval Research Logistics
Quarterly 36, 419-446 (1989)
List of Contributors

Terje Aven Hans Frenk


Rogalund University Center Econometric Institute
P.O. Box 2557 Erasmus University Rotterdam
Ullandhaug, 4004 Stavanger P.O. Box 1738
Norway 3000 DR Rotterdam
The Netherlands
Rose Baker
Department of Mathematics Prem K. Goel
and Computer Science Department of Statistics
University of Salford The Ohio State University
Lancaster M5 4WT 1958 Neil Avenue
United Kingdom Columbus, OH 43210
USA
Menachem P. Berg
Department of Statistics Wim Groenendijk
University of Haifa Woodside Offshore Petrolium
Mount Carmel Gosa Level 3
Haifa 31905 1 Adelaide Terrace
Israel Perth 6000
Australia
Erhan Qmlar
Department of Civil Engineering Philip Heidelberger
and Operations Research IBM T.J. Watson Research Center
Princeton University P.O. Box 704
Princeton, NJ 08544 Yorktown Heights, NY 10598
USA USA

Rommert Dekker Uwe Jensen


Econometric Institute Institute of Stochastics
Erasmus University Rotterdam University of Ulm
P.O. Box 1738 D-89060 VIm
3000 DR Rotterdam Germany
The Netherlands
590 List of Contributors

Jack p.e. Kleijnen John D. Musa


Department of Information Room HR2E031
Systems and Auditing AT&T Bell Laboratories
Tilburg University 480 Red Hill Road
P.O. Box 90153 Middletown, NJ 07748-3052
5000 LE Tilburg USA
The Netherlands
Victor F. Nicola
Igor N. Kovalenko Department of Computer Science
V.M. Glushkov Inst. of Cybernetics University of Twente
Ukrainian Academy of Sciences P.O. Box 217
40 Glushkov Prospect 7500 AE Enschede
Kiev 252207 The Netherlands
Ukraine
Siileyman Ozekici
Manish Malhotra Department of Industrial Eng.
Room 2K-327 Bogazic;i University
AT&T Bell Laboratories 80815 Bebek-istanbul
101 Crawfords Corner Road Turkey
Holmdel, NJ 07733
USA Panickos N. Palettas
Department of Statistics
Max Mendel Virginia Polytechnic Institute
Department of Industrial Eng. and State University
and Operations Research Blacksburg, VA 24061-0439
University of California USA
Berkeley, CA 94720
USA Perwez Shahabuddin
Department of Industrial Eng.
Jason Merrick and Operations Research
Department of Operations Research Columbia University
The George Washington University New York, NY 10027-6699
Washington, DC 20052 USA
USA
Moshe Shaked
Jogesh K. Muppala Department of Mathematics
Department of Computer Science University of Arizona
The Hong Kong University of Tucson, AZ 85721-0001
Science and Technology USA
Clear Water Bay, Kowloon
Hong Kong
List of Contributors 591

George J. Shanthikumar Jose Benigno Valdez-Torres


The W.A. Haas School of Business Escuela de Ciencias Quimicas
The University of California Universidad Autonoma de Sinaloa
Berkeley, CA 94720 Culiacan, Sinaloa
USA Mexico

Nozer D. Singpurwalla Frank Van der Duyn Schouten


Department of Operations Research Center for Economic Research
The George Washington University Tilburg University
Washington, DC 20052 P.O. Box 90153
USA 5000 LE Tilburg
The Netherlands
Refik Soyer
Department of Management Science Cyp Van Rijn
The George Washington University Beeckzanglaan IF
Washington, DC 20052 1942 LS Beverwijk
USA The Netherlands

Fabio Spizzichino Ralf E. Wildeman


Department of Mathematics Econometric Institute
University of Rome "La Sapienza" Erasmus University Rotterdam
Piazzale "Aldo Moro" P.O. Box 1738
00185 Rome 3000 DR Rotterdam
Italy The Netherlands

Kishor S. Trivedi
Department of Electrical Eng.
Duke University
Durham, NC 27708
USA
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 46: Recent Advances in Speech Understanding and Dialog Systems. Edited by H. Niemann, M.
Lang and G. Sagerer. X, 521 pages. 1988.
Vol. 47: Advanced Computing Concepts and Techniques in Control Engineering. Edited by M. J.
Denham and A. J. Laub. XI, 518 pages. 1988. (out of print)
Vol. 48: Mathematical Models for Decision Support. Edited by G. Mitra. IX, 762 pages. 1988.
Vol. 49: Computer Integrated Manufacturing. Edited by I. B. Turksen. VIII, 568 pages. 1988.
Vol. 50: CAD Based Programming for Sensory Robots. Edited by B. Ravani. IX, 565 pages. 1988.
(ROB)
Vol. 51: Algorithms and Model Formulations in Mathematical Programming. Edited by S. W. Wallace.
IX, 190 pages. 1989.
Vol. 52: Sensor Devices and Systems for Robotics. Edited by A. Casals. IX, 362 pages. 1989. (ROB)
Vol. 53: Advanced Information Technologies for Industrial Material Flow Systems. Edited by S. Y. Nof
and C. L. Moodie. IX, 710 pages. 1989.
Vol. 54: A Reappraisal of the Efficiency of Financial Markets. Edited by R. M. C. Guimar res, B. G.
Kingsman and S. J. Taylor. X, 804 pages. 1989.
Vol. 55: Constructive Methods in Computing Science. Edited by M. Broy. VII, 478 pages. 1989.
Vol. 56: Multiple Criteria Decision Making and Risk Analysis Using Microcomputers. Edited by
B. Karpak and S. Zionts. VII, 399 pages. 1989.
Vol. 57: Kinematics and Dynamic Issues in Sensor Based Control. Edited by G. E. Taylor. XI, 456
pages. 1990. (ROB)
Vol. 58: Highly Redundant Sensing in Robotic Systems. Edited by J. T. Tou and J. G. Balchen. X, 322
pages. 1990. (ROB)
Vol. 59: Superconducting Electronics. Edited by H. Weinstock and M. Nisenoff. X, 441 pages. 1989.
Vol. 60: 3D Imaging in Medicine. Algorithms, Systems, Applications. Edited by K. H. Hahne, H. Fuchs
and S. M. Pizer. IX, 460 pages. 1990. (out of print)
Vol. 61: Knowledge, Data and Computer-Assisted Decisions. Edited by M. Schader and W. Gaul. VIII,
421 pages. 1990.
Vol. 62: Supercomputing. Edited by J. S. Kowalik. X, 425 pages. 1990.
Vol. 63: Traditional and Non-Traditional Robotic Sensors. Edited byT. C. Henderson. VIII, 468 pages.
1990. (ROB)
Vol. 64: Sensory Robotics for the Handling of Limp Materials. Edited by P. M. Taylor. IX, 343 pages.
1990. (ROB)
Vol. 65: Mapping and Spatial Modelling for Navigation. Edited by L. F. Pau. VIII, 357 pages. 1990.
(ROB)
Vol. 66: Sensor-Based Robots: Algorithms and Architectures. Edited by C. S. G. Lee. X, 285 pages.
1991. (ROB)
Vol. 67: Designing Hypermedia for Leaming. Edited by D. H. Jonassen and H. Mandl. XXV, 457 pages.
1990. (AET)
Vol. 68: Neurocomputing. Algorithms, Architectures and Applications. Edited by F. Fogelman Soulie
and J. Herault. XI, 455 pages. 1990.
Vol. 69: Real-Time Integration Methods for Mechanical System Simulation. Edited by E. J. Haug and
R. C. Oeyo. VIII, 352 pages. 1991.
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 70: Numerical Linear Algebra, Digital Signal Processing and Parallel Algorithms. Edited by
G. H. Golub and P. Van Dooren. XIII, 729 pages. 1991.
Vol. 71: Expert Systems and Robotics. Edited by T. Jordanides and B.Torby. XII, 744 pages. 1991.
Vol. 72: High-Capacity Local and Metropolitan Area Networks. Architecture and Performance Issues.
Edited by G. Pujolle. X, 536 pages. 1991.
Vol. 73: Automation and Systems Issues in Air Traffic Control. Edited by J. A. Wise, V. D. Hopkin and
M. L. Smith. XIX, 594 pages. 1991.
Vol. 74: Picture Archiving and Communication Systems (PACS) in Medicine. Edited by H. K. Huang,
O. Ratib, A. R. Bakker and G. Witte. XI, 438 pages. 1991.
Vol. 75: Speech Recognition and Understanding. Recent Advances, Trends and Applications. Edited
by P. Laface and Renato De Mori. XI, 559 pages. 1991.
Vol. 76: Multimedia Interface Design in Education. Edited by A. D. N. Edwards and S. Holland. XIV,
216 pages. 1992. (AET)
Vol. 77: Computer Algorithms for Solving Linear Algebraic Equations. The State of the Art. Edited by
E. Spedicato. VIII, 352 pages. 1991.
Vol. 78: Integrating Advanced Technology into Technology Education. Edited by M. Hacker,
A. Gordon and M. de Vries. VIII, 185 pages. 1991. (AET)
Vol. 79: Logic, Algebra, and Computation. Edited by F. L. Bauer. VII, 485 pages. 1991.
Vol. 80: Intelligent Tutoring Systems for Foreign Language Leaming. Edited by M. L. Swartz and
M. Yazdani. IX, 347 pages. 1992. (AET)
Vol. 81: Cognitive Tools for Learning. Edited by P. A. M. Kommers, D. H. Jonassen, and J. T. Mayes.
X, 278 pages. 1992. (AET)
Vol. 82: Combinatorial Optimization. New Frontiers in Theory and Practice. Edited by M. Akgul, H. W.
Hamacher, and S. TufekQ. XI, 334 pages. 1992.
Vol. 83: Active Perception and Robot Vision. Edited by A. K. Sood and H. Wechsler. IX, 756 pages.
1992.
Vol. 84: Computer-Based Learning Environments and Problem Solving. Edited by E. De Corte, M. C.
Linn, H. Mandl, and L. Verschaffel. XVI, 488 pages. 1992. (AET)
Vol. 85: Adaptive Learning Environments. Foundations and Frontiers. Edited by M. Jones and P. H.
Winne. VIII, 408 pages. 1992. (AET)
Vol. 86: Intelligent Learning Environments and Knowledge Acquisition in Physics. Edited by
A. Tiberghien and H. Mandl. VIII, 285 pages. 1992. (AET)
Vol. 87: Cognitive Modelling and Interactive Environments. With demo diskettes (Apple and IBM
compatible). Edited by F. L. Engel, D. G. Bouwhuis, T. Basser, and G. d'Ydewalle. IX, 311 pages.
1992. (AET)
Vol. 88: Programming and Mathematical Method. Edited by M. Broy. VIII, 428 pages. 1992.
Vol. 89: Mathematical Problem Solving and New Information Technologies. Edited by J. P. Ponte,
J. F. Matos, J. M. Matos, and D. Fernandes. XV, 346 pages. 1992. (AET)
Vol. 90: Collaborative Learning Through ComputerConferencing. Edited by A. R. Kaye. X, 260 pages.
1992. (AET)
Vol. 91: New Directions for Intelligent Tutoring Systems. Edited by E. Costa. X, 296 pages. 1992.
(AET)
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 92: Hypermedia Courseware: Structures of Communication and Intelligent Help. Edited by
A. Oliveira. X, 241 pages. 1992. (AET)
Vol. 93: Interactive Multimedia Learning Environments. Human Factors and Technical Considerations
on Design Issues. Edited by M. Giardina. VIII, 254 pages. 1992. (AET)
Vol. 94: Logic and Algebra of Specification. Edited by F. L. Bauer, W. Brauer, and H. Schwichtenberg.
VII, 442 pages. 1993.
Vol. 95: Comprehensive Systems Design: A New Educational Technology. Edited by C. M. Reigeluth,
B. H. Banathy, and J. R. Olson. IX, 437 pages. 1993. (AET)
Vol. 96: New Directions in Educational Technology. Edited by E. Scanlon and T. O'Shea. VIII, 251
pages. 1992. (AET)
Vol. 97: Advanced Models of Cognition for Medical Training and Practice. Edited by D. A. Evans and
V. L. Patel. XI, 372 pages. 1992. (AET)
Vol. 98: Medical Images: Formation, Handling and Evaluation. Edited by A. E. Todd-Pokropek and
M. A. Viergever. IX, 700 pages. 1992.
Vol. 99: Multisensor Fusion for Computer Visiori. Edited by J. K. Aggarwal. XI, 456 pages. 1993. (ROB)
Vol. 100: Communication from an Artificial Intelligence Perspective. Theoretical and Applied Issues.
Edited by A. Ortony, J. Slack and O. Stock. XII, 260 pages. 1992.
Vol. 101: Recent Developments in DeCision Support Systems. Edited by C. W. Holsapple and A. B.
Whinston. XI, 618 pages. 1993.
Vol. 102: Robots and Biological Systems: Towards a New Bionics? Edited by P. Dario, G. Sandini and
P. Aebischer. XII, 786 pages. 1993.
Vol. 103: Parallel Computing on Distributed Memory Multiprocessors. Edited by F. OzgOner and
F. Er<,al. VIII, 332 pages. 1993.
Vol. 104: Instructional Models in Computer-Based Learning Environments. Edited by S. Dijkstra,
H. P. M. Krammer and J. J. G. van Merrienboer. X, 510 pages. 1993. (AET)
Vol. 105: Designing Environments for Constructive Learning. Edited by T. M. Duffy, J. Lowyck and
D. H. Jonassen. VIII, 374 pages. 1993. (AET)
Vol. 106: Software for Parallel Computation. Edited by J. S. Kowalik and L. Grandinetti. IX, 363 pages.
1993.
Vol. 107: Advanced Educational Technologies for Mathematics and Science. Edited by D. L.
Ferguson. XII, 749 pages. 1993. (AET)
Vol. 108: Concurrent Engineering: Tools and Technologies for Mechanical System Design. Edited by
E. J. Haug. XIII, 998 pages. 1993.
Vol. 109: Advanced Educational Technology in Technology Education. Edited by A. Gordon,
M. Hacker and M. de Vries. VIII, 253 pages. 1993. (AET)
Vol. 110: Verification and Validation of Complex Systems: Human Factors Issues. Edited by J. A.
Wise, V. D. Hopkin and P. Stager. XIII, 704 pages. 1993.
Vol. 111: Cognitive Models and Intelligent Environments for Learning Programming. Edited by
E. Lemut, B. du Boulay and G. Dettori. VIII, 305 pages. 1993. (AET)
Vol. 112: Item Banking: Interactive Testing and Self-Assessment. Edited by D. A. Leclercq and J. E.
Bruno. VIII, 261 pages. 1993. (AET)
Vol. 113: Interactive Leaming Technology for the Deaf. Edited by B. A. G. Elsendoorn and F. Coninx.
XIII, 285 pages. 1993. (AET)
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 114: Intelligent Systems: Safety, Reliability and Maintainability Issues. Edited by O. Kaynak,
G. Honderd and E. Grant. XI, 340 pages. 1993.
Vol. 115: Learning Electricity and Electronics with Advanced Educational Technology. Edited by
M. Caillot. VII, 329 pages. 1993. (AET)
Vol. 116: Control Technology in Elementary Education. Edited by B. Denis. IX, 311 pages. 1993. (AFT)
Vol. 117: Intelligent Learning Environments: The Case of Geometry. Edited by J.-M. Laborde. VIII, 267
pages. 1996. (AET)
Vol. 118: Program Design Calculi. Edited by M. Broy. VIII, 409 pages. 1993.
Vol. 119: Automating Instructional Design, Development, and Delivery. Edited by. R. D. Tennyson.
VIII, 266 pages. 1994. (AET)
Vol. 120: Reliability and Safety Assessment of Dynarnic Process Systems. Edited by T. Aldernir,
N. O. Siu, A. Mosleh, P. C. Cacciabue and B. G. G6ktepe. X, 242 pages. 1994.
Vol. 121: Learning from Computers: Mathematics Education and Technology. Edited by C. Keitel and
K. Ruthven. XIII, 332 pages. 1993. (AET)
Vol. 122: Simulation-Based Experiential Learning. Edited by D. M. Towne, T. de Jong and H. Spada.
XIV, 274 pages. 1993. (AET)
Vol. 123: User-Centred Requirements for Software Engineering Environments. Edited by D. J.
Gilmore, R. L. Winder and F. Detienne. VII, 377 pages. 1994.
Vol. 124: Fundamentals in Handwriting Recognition. Edited by S. Impedovo. IX, 496 pages. 1994.
Vol. 125: Student Modelling: The Key to Individualized Knowledge-Based Instruction. Edited by J. E.
Greer and G. I. McCalla. X, 383 pages. 1994. (AET)
Vol. 126: Shape in Picture. Mathematical Description of Shape in Grey-level Images. Edited by
Y.-L. 0, A. Toet, D. Foster, H. J. A. M. Heijmans and P. Meer. XI, 676 pages. 1994.
Vol. 127: Real Time Computing. Edited byW. A. Halang and A. D. Stoyenko. XXII, 762 pages. 1994.
Vol. 128: Computer Supported Collaborative Learning. Edited by C. O'Malley. X, 303 pages. 1994.
(AET)
Vol. 129: Human-Machine Communication for Educational Systems Design. Edited by M. D.
Brouwer-Janse and T. L. Harrington. X, 342 pages. 1994. (AET)
Vol. 130: Advances in Object-Oriented Database Systems. Edited by A. Dogac, M. T. Ozsu, A. Biliris
and T. Sellis. XI, 515 pages. 1994.
Vol. 131: Constraint Programming. Edited by B. Mayoh, E. Tyugu and J. Penjam. VII, 452 pages.
1994.
Vol. 132: Mathematical Modelling Courses for Engineering Education. Edited by Y. Ersoy and A. O.
Moscardini. X, 246 pages. 1994. (AET)
Vol. 133: Collaborative Dialogue Technologies in Distance Learning. Edited by M. F. Verdejo and
S. A. Cerri. XIV, 296 pages. 1994. (AET)
Vol. 134: Computer Integrated Production Systems and Organizations. The Human-Centred
Approach. Edited by F. Schmid, S. Evans, A. W. S. Ainger and R. J. Grieve. X, 347 pages. 1994.
Vol. 135: Technology Education in School and Industry. Emerging Didactics for Human Resource
Development. Edited by D. Blandow and M. J. Dyrenfurth. XI, 367 pages. 1994. (AET)
Vol. 136: From Statistics to Neural Networks. Theory and Pattern Recognition Applications. Edited
by V. Cherkassky, J. H. Friedman and H. Wechsler. XII, 394 pages. 1994.
NATO ASI Series F
Including Special Programmes on Sensory Systems for Robotic Control (ROB) and on
Advanced Educational Technology (AET)
Vol. 137: Technology-Based Learning Environments. Psychological and Educational Foundations.
Edited by S. Vosniadou, E. De Corte and H. Mandl. X, 302 pages. 1994. (AET)
Vol. 138: Exploiting Mental Imagery with Computers in Mathematics Education. Edited by
R. Sutherland and J. Mason. VIII, 326 pages. 1995. (AET)
Vol. 139: Proof and Computation. Edited by H. Schwichtenberg. VII, 470 pages. 1995.
Vol. 140: Automating Instructional Design: Computer-Based Development and Delivery Tools. Edited
by R. D. Tennyson and A. E. Barron. IX, 618 pages. 1995. (AET)
Vol. 141: Organizational Leaming and Technological Change. Edited by C. Zucchermaglio, S.
Bagnara and S. U. Stucky. X, 368 pages. 1995. (AET)
Vol. 142: Dialogue and Instruction. Modeling Interaction in Intelligent Tutoring Systems. Edited by
R.-J. Beun, M. Baker and M. Reiner. IX, 368 pages. 1995. (AET)
Vol. 144: The Biology and Technology of Intelligent Autonomous Agents. Edited by Luc Steels. VIII,
517 pages. 1995.
Vol. 145: Advanced Educational Technology: Research Issues and Future Potential. Edited by T. T.
Liao. VIII, 219 pages. 1996. (AET)
Vol. 146: Computers and Exploratory Leaming. Edited by A. A. diSessa, C. Hoyles and R. Noss.
VIII, 482 pages. 1995. (AET)
Vol. 147: Speech Recognition and Coding. New Advances and Trends. Edited by A. J. Rubio Ayuso
and J. M. L6pez Soler. XI, 505 pages. 1995.
Vol. 148: Knowledge Acquisition, Organization, and Use in Biology. Edited by K. M. Fisher and M. R.
Kibby. X, 246 pages. 1996. (AET)
Vol. 149: Emergent Computing Methods in Engineering Design. Applications of Genetic Algorithms
and Neural Networks. Edited by D.E. Grierson and P. Hajela. VIII, 350 pages. 1996.
Vol. 152: Deductive Program Design. Edited by M. Broy. IX, 467 pages. 1996.
Vol. 153: Identification, Adaptation, Learning. Edited by S. Bittanti and G. Picci. XIV, 553 pages. 1996.
Vol. 154: Reliability and Maintenance of Complex Systems. Edited by S. Ozekici. XI, 589 pages. 1996.

Вам также может понравиться