Вы находитесь на странице: 1из 13

Johnson, B.W.

Fault Tolerance
The Electrical Engineering Handbook
Ed. Richard C. Dorf
Boca Raton: CRC Press LLC, 2000
2000 by CRC Press LLC
93
!auIf ToIerance
93.1 Intioduction
93.2 Haidwaie Redundancy
93.3 Infoimation Redundancy
93.4 Time Redundancy
93.5 Softwaie Redundancy
93.6 Dependability Evaluation
93.1 Intruductiun
Fault tolerance is the ability of a system to continue coiiect peifoimance of its tasks aftei the occuiience of
haidwaie oi softwaie faults. A fault is simply any physical defect, impeifection, oi aw that occuis in haidwaie
oi softwaie. Applications of fault-toleiant computing can be categoiized bioadly into foui piimaiy aieas: long-
life, ciitical computations, maintenance postponement, and high availability. The most common examples of
long-life applications aie unmanned space ight and satellites. Examples of ciitical-computation applications
include aiiciaft ight contiol systems, militaiy systems, and ceitain types of industiial contiolleis. Maintenance
postponement applications appeai most fiequently when maintenance opeiations aie extiemely costly, incon-
venient, oi diffcult to peifoim. Remote piocessing stations and ceitain space applications aie good examples.
Banking and othei time-shaied systems aie good examples of high-availability applications. Fault toleiance can
be achieved in systems by incoipoiating vaiious foims of iedundancy, including haidwaie, infoimation, time,
and softwaie iedundancy Johnson, 1989].
93.2 Hardvare Redundancy
The physical ieplication of haidwaie is peihaps the most common foim of fault toleiance used in systems. As
semiconductoi components have become smallei and less expensive, the concept of haidwaie iedundancy has
become moie common and moie piactical. Theie aie thiee basic foims of haidwaie iedundancy. Fiist, asse
techniques use the concept of fault masking to hide the occuiience of faults and pievent the faults fiom iesulting
in errors. Passive appioaches aie designed to achieve fault toleiance without iequiiing any action on the pait
of the system oi an opeiatoi. Passive techniques, in theii most basic foim, do not piovide foi the detection of
faults but simply mask the faults. An example of a passive appioach is tiiple modulai iedundancy (TMR),
which is illustiated in Fig. 93.1. In the TMR system thiee identical units peifoim identical functions, and a
majoiity vote is peifoimed on the output.
The second foim of haidwaie iedundancy is the ate appioach, which is sometimes called the Jynamt
method. Active methods achieve fault toleiance by detecting the existence of faults and peifoiming some action
to iemove the faulty haidwaie fiom the system. In othei woids, active techniques iequiie that the system
peifoim ieconfguiation to toleiate faults. Active haidwaie iedundancy uses fault detection, fault location, and
fault iecoveiy in an attempt to achieve fault toleiance. An example of an active appioach to haidwaie iedundancy
is standby spaiing, which is illustiated in Fig. 93.2. In standby spaiing one oi moie units opeiate as spaies and
ieplace the piimaiy unit when it fails.
Barry W. }ohnson
Inverry of Vrgno
2000 by CRC Press LLC
The fnal foim of haidwaie iedundancy is the |y|rJ appioach. Hybiid techniques combine the attiactive
featuies of both the passive and active appioaches. Fault masking is used in hybiid systems to pievent eiioneous
iesults fiom being geneiated. Fault detection, fault location, and fault iecoveiy aie also used in the hybiid
appioaches to impiove fault toleiance by iemoving faulty haidwaie and ieplacing it with spaies. Pioviding
spaies is one foim of pioviding iedundancy in a system. Hybiid methods aie most often used in the ciitical-
computation applications wheie fault masking is iequiied to pievent momentaiy eiiois, and high ieliability
must be achieved. The basic concept of the hybiid appioach is illustiated in Fig. 93.3.
93.3 Inlurmatiun Redundancy
Anothei appioach to fault toleiance is to employ iedundancy of infoimation. Infoimation iedundancy is simply
the addition of iedundant infoimation to data to allow fault detection, fault masking, oi possibly fault toleiance.
Good examples of infoimation iedundancy aie eiioi detecting and eiioi coiiecting codes, foimed by the
addition of iedundant infoimation to data woids oi by the mapping of data woids into new iepiesentations
containing iedundant infoimation Lin and Costello, 1983].
In geneial, a toJe is a means of iepiesenting infoimation, oi data, using a well-defned set of iules. A toJe
worJ is a collection of symbols, often called digits if the symbols aie numbeis, used to iepiesent a paiticulai
piece of data based upon a specifed code. A |nary toJe is one in which the symbols foiming each code woid
consist of only the digits 0 and 1. A code woid is said to be a|J if the code woid adheies to all of the iules
that defne the code; otheiwise, the code woid is said to be na|J.
FIGURE 93.1 Fault masking using tiiple modulai iedundancy (TMR).
FIGURE 93.2 Geneial concept of standby spaiing.
Module 1
Module 2 Voter Output
Module 3
2000 by CRC Press LLC
The entoJng oeraon is the piocess of deteimining the coiiesponding code woid foi a paiticulai data item.
In othei woids, the encoding piocess takes an oiiginal data item and iepiesents it as a code woid using the
iules of the code. The JetoJng oeraon is the piocess of iecoveiing the oiiginal data fiom the code woid. In
othei woids, the decoding piocess takes a code woid and deteimines the data that it iepiesents.
It is possible to cieate a binaiy code foi which the valid code woids aie a subset of the total numbei of
possible combinations of 1s and 0s. If the code woids aie foimed coiiectly, eiiois intioduced into a code woid
will foice it to lie in the iange of illegal, oi invalid, code woids, and the eiioi can be detected. This is the basic
concept of the error Jeetng toJes. The basic concept of the error torretng toJe is that the code woid is
stiuctuied such that it is possible to deteimine the coiiect code woid fiom the coiiupted, oi eiioneous, code
woid.
A fundamental concept in the chaiacteiization of codes is the Hammng Jsante Hamming, 1950]. The
Hammng Jsante between any two binaiy woids is the numbei of bit positions in which the two woids diffei.
Foi example, the binaiy woids 0000 and 0001 diffei in only one position and theiefoie have a Hamming distance
of 1. The binaiy woids 0000 and 0101, howevei, diffei in two positions; consequently, theii Hamming distance
is 2. Cleaily, if two woids have a Hamming distance of 1, it is possible to change one woid into the othei simply
by modifying one bit in one of the woids. If, howevei, two woids diffei in two bit positions, it is impossible
to tiansfoim one woid into the othei by changing one bit in one of the woids.
The Hamming distance gives insight into the iequiiements of eiioi detecting codes and eiioi coiiecting
codes. We defne the Jsante of a code as the minimum Hamming distance between any two valid code woids.
If a binaiy code has a distance of two, then any single-bit eiioi intioduced into a code woid will iesult in the
eiioneous woid being an invalid code woid because all valid code woids diffei in at least two bit positions. If
a code has a distance of 3, then any single-bit eiioi oi any double-bit eiioi will iesult in the eiioneous woid
being an invalid code woid because all valid code woids diffei in at least thiee positions. Howevei, a code
distance of 3 allows any single-bit eiioi to be coiiected, if it is desiied to do so, because the eiioneous woid
with a single-bit eiioi will be a Hamming distance of 1 fiom the coiiect code woid and at least a Hamming
FIGURE 93.3 Hybiid iedundancy appioach.
Prinary Module 1
lault letection lnit
Active
lnit
Outputs
lisagreenent
letection
Inputs
Reconfiguration lnit Voter
Output
Prinary Module N
Spare Module 1
Spare Module M
2000 by CRC Press LLC
distance of 2 fiom all otheis. Consequently, the coiiect code woid can be identifed fiom the coiiupted code
woid.
In geneial, a binaiy code can coiiect up to t bit eiiois and detect an additional J bit eiiois if and only if
2t d 1 s H
d
wheie H
J
is the distance of the code Nelson and Caiioll, 1986]. Foi example, a code with a distance of 2 cannot
piovide any eiioi coiiection but can detect single-bit eiiois. Similaily, a code with a distance of 3 can coiiect
single-bit eiiois oi detect a double-bit eiioi.
A second fundamental concept of codes is seara||y. A seara||e toJe is one in which the oiiginal infoi-
mation is appended with new infoimation to foim the code woid, thus allowing the decoding piocess to consist
of simply iemoving the additional infoimation and keeping the oiiginal data. In othei woids, the oiiginal data
is obtained fiom the code woid by stiipping away extia bits, called the code bits oi check bits, and ietaining
only those associated with the oiiginal infoimation. A nonseara||e toJe does not possess the piopeity of
sepaiability and, consequently, iequiies moie complicated decoding pioceduies.
Peihaps the simplest foim of a code is the paiity code. The basic concept of paiity is veiy stiaightfoiwaid,
but theie aie vaiiations on the fundamental idea. Single-bit paiity codes iequiie the addition of an extia bit to
a binaiy woid such that the iesulting code woid has eithei an even numbei of 1s oi an odd numbei of 1s. If
the extia bit iesults in the total numbei of 1s in the code woid being odd, the code is iefeiied to as oJJ ary.
If the iesulting numbei of 1s in the code woid is even, the code is called een ary. If a code woid with odd
paiity expeiiences a change in one of its bits, the paiity will become even. Likewise, if a code woid with even
paiity encounteis a single-bit change, the paiity will become odd. Consequently, a single-bit eiioi can be
detected by checking the numbei of ls in the code woids. The single-bit paiity code (eithei odd oi even) has
a distance of 2, theiefoie allowing any single-bit eiioi to be detected but not coiiected. Figuie 93.4 illustiates
the use of paiity coding in a simple memoiy application.
r|met toJes aie veiy useful when it is desiied to check aiithmetic opeiations such as addition, multipli-
cation, and division Avizienis, 1971]. The basic concept is the same as all coding techniques. The data piesented
to the aiithmetic opeiation is encoded befoie the opeiations aie peifoimed. Aftei completing the aiithmetic
opeiations, the iesulting code woids aie checked to make suie that they aie valid code woids. If the iesulting
code woids aie not valid, an eiioi condition is signaled. An aiithmetic code must be invaiiant to a set of
aiithmetic opeiations. An aiithmetic code, , has the piopeity that (|*t) (|)*(t), wheie | and t aie
opeiands, is some aiithmetic opeiation, and (|) and (t) aie the aiithmetic code woids foi the opeiands
| and t, iespectively. Stated veibally, the peifoimance of the aiithmetic opeiation on two aiithmetic code woids
will pioduce the aiithmetic code woid of the iesult of the aiithmetic opeiation. To completely defne an
aiithmetic code, the method of encoding and the aiithmetic opeiations foi which the code is invaiiant must
be specifed. The most common examples of aiithmetic codes aie the N codes, iesidue codes, and the inveise
iesidue codes.
FIGURE 93.4 Use of paiity coding in a memoiy application.
Parity
Generator
Parity
Checker
Menory
lata
lata In lata Out
Parity Bit
Parity Bit
lata
Error Signal
2000 by CRC Press LLC
93.4 Time Redundancy
Time iedundancy methods attempt to ieduce the amount of extia haidwaie at the expense of using additional
time. In many applications, the time is of much less impoitance than the haidwaie because haidwaie is a
physical entity that impacts weight, size, powei consumption, and cost. Time, on the othei hand, may be ieadily
available in some applications. The basic concept of time iedundancy is the iepetition of computations in ways
that allow faults to be detected. Time iedundancy can function in a system in seveial ways. The fundamental
concept is to peifoim the same computation two oi moie times and compaie the iesults to deteimine if a
disciepancy exists. If an eiioi is detected, the computations can be peifoimed again to see if the disagieement
iemains oi disappeais. Such appioaches aie often good foi detecting eiiois iesulting fiom tiansient faults but
cannot piovide piotection against eiiois iesulting fiom peimanent faults.
The main pioblem with many time iedundancy techniques is assuiing that the system has the same data to
manipulate each time it iedundantly peifoims a computation. If a tiansient fault has occuiied, a system`s data
may be completely coiiupted, making it diffcult to iepeat a given computation. Time iedundancy has been
used piimaiily to detect tiansients in systems. One of the biggest potentials of time iedundancy, howevei, now
appeais to be the ability to detect peimanent faults while using a minimum of extia haidwaie. The fundamental
concept is illustiated in Fig. 93.5. Duiing the fist computation oi tiansmission, the opeiands aie used as
piesented and the iesults aie stoied in a iegistei. Piioi to the second computation oi tiansmission, the opeiands
aie encoded in some fashion using an encoding function. Aftei the opeiations have been peifoimed on the
encoded data, the iesults aie then decoded and compaied to those obtained duiing the fist opeiation. The
selection of the encoding function is made so as to allow faults in the haidwaie to be detected. Example encoding
functions might include the complementation opeiatoi and an aiithmetic shift.
93.5 Sultvare Redundancy
Softwaie faults aie unusual entities. Softwaie does not bieak as haidwaie does, but instead softwaie faults aie
the iesult of incoiiect softwaie designs oi coding mistakes. Theiefoie, any technique that detects faults in
softwaie must detect design aws. A simple duplication and compaiison pioceduie will not detect softwaie
faults if the duplicated softwaie modules aie identical, because the design mistakes will appeai in both modules.
The concept of N self-checking piogiamming is to fist wiite N unique veisions of the piogiam and to
develop a set of acceptance tests foi each veision. The acceptance tests aie essentially checks peifoimed on the
iesults pioduced by the piogiam and may be cieated using consistency checks and capability checks, foi example.
Selection logic, which may be a piogiam itself, chooses the iesults fiom one of the piogiams that passes the
acceptance tests. This appioach is analogous to the haidwaie technique known as hot standby spaiing. Since
each piogiam is iunning simultaneously, the ieconfguiation piocess can be veiy fast. Piovided that the softwaie
faults in each veision of the piogiam aie independent and the faults aie detected as they occui by the acceptance
tests, then this appioach can toleiate N - 1 faults. It is impoitant to note that the assumptions of fault
independence and peifect fault coveiage aie veiy big assumptions to make in almost all applications.
FIGURE 93.5 Time iedundancy concept.
Encode lata Conputation
Conputation
lata
lata
Tine
0
Tine
1
lecode
Result
Store Result
Conpare
Results
Store Result
Error
2000 by CRC Press LLC
The concept of N-veision piogiamming was developed to allow ceitain design aws in softwaie modules to
be toleiated Chen and Avizienis, 1978]. The basic concept of N-veision piogiamming is to design and code
the softwaie module N times and to vote on the N iesults pioduced by these modules. Each of the N modules
is designed and coded by a sepaiate gioup of piogiammeis. Each gioup designs the softwaie fiom the same
set of specifcations such that each of the N modules peifoims the same function. Howevei, it is hoped that
by peifoiming the N designs independently, the same mistakes will not be made by the diffeient gioups.
Theiefoie, when a fault occuis, the fault will eithei not occui in all modules oi it will occui diffeiently in each
module, so that the iesults geneiated by the modules will diffei. Assuming that the faults aie independent the
appioach can toleiate (N - 1)/2 faults wheie N is odd.
The iecoveiy block appioach to softwaie fault toleiance is analogous to the active appioaches to haidwaie
fault toleiance, specifcally the cold standby spaiing appioach. N veisions of a piogiam aie piovided, and a
single set of acceptance tests is used. One veision of the piogiam is designated as the piimaiy veision, and the
iemaining N - 1 veisions aie designated as spaies, oi secondaiy veisions. The piimaiy veision of the softwaie
is always used unless it fails to pass the acceptance tests. If the acceptance tests aie failed by the piimaiy veision,
then the fist secondaiy veision is tiied. This piocess continues until one veision passes the acceptance tests oi
the system fails because none of the veisions can pass the tests.
93.6 Dependabi!ity Eva!uatiun
Dependability is defned as the quality of seivice piovided by a system Lapiie, 1985]. Peihaps the most
impoitant measuies of dependability aie ieliability and availability. Fundamental to ieliability calculations is
the concept of failuie iate. Intuitively, the [a|ure rae s the expected numbei of failures of a type of device oi
system pei a given time peiiod Shooman, 1968]. The failuie iate is typically denoted as i when it is assumed
to have a constant value. To moie cleaily undeistand the mathematical basis foi the concept of a failuie iate,
fist considei the defnition of the ieliability function. The reliability R() of a component, oi a system, is the
conditional piobability that the component opeiates coiiectly thioughout the inteival
0
, ] given that it was
opeiating coiiectly at the time
0
.
Theie aie a numbei of diffeient ways in which the failuie iate function can be expiessed. Foi example, the
failuie iate function :() can be wiitten stiictly in teims of the ieliability function R() as
Similaily, :() can be wiitten in teims of the unieliability Q() as
wheie Q() 1 - R(). The deiivative of the unieliability, JQ()/J, is called the [a|ure Jensy [unton.
The failuie iate function is cleaily dependent upon time; howevei, expeiience has shown that the failuie iate
function foi electionic components does have a peiiod wheie the value of :() is appioximately constant. The
commonly accepted ielationship between the failuie iate function and time foi electionic components is called
the bathtub cuive and is illustiated in Fig. 93.6. The bathtub cuive assumes that duiing the eaily life of systems,
failuies occui fiequently due to substandaid oi weak components. The decieasing pait of the bathtub cuive is
called the eaily-life oi infant moitality iegion. At the opposite end of the cuive is the weai-out iegion wheie
systems have been functional foi a long peiiod of time and aie beginning to expeiience failuies due to the
physical weaiing of electionic oi mechanical components. Duiing the inteimediate iegion, the failuie iate
function is assumed to be a constant. The constant poition of the bathtub cuive is called the useful-life phase
:
JR J
R
( ) -
( )
( )

_
,

/
:
JR J
R
JQ J
Q
( ) -
( )
( )
( )
- ( )

/ /
1
2000 by CRC Press LLC
of the system, and the failuie iate function is assumed to have a value of i duiing that peiiod. i is iefeiied to
as the failuie iate and is noimally expiessed in units of failuies pei houi.
The ieliability can be expiessed in teims of the failuie iate function as a diffeiential equation of the foim
The geneial solution of this diffeiential equation is given by
R() e

-z(t)dt
If we assume that the system is in the useful-life stage wheie the failuie iate function has a constant value
of i, the solution to the diffeiential equation is an exponential function of the paiametei i given by
R () e
-
wheie i is the constant failuie iate. The exponential ielationship between the ieliability and time is known as
the exonena| [a|ure |aw, which states that foi a constant failuie iate function, the ieliability vaiies exponen-
tially as a function of time.
In addition to the failuie iate, the mean time to failuie (MTTF) is a useful paiametei to specify the quality
of a system. The MTTF is the expected time that a system will opeiate befoie the fist failuie occuis. The MTTF
can be calculated by fnding the expected value of the time of failuie.
Fiom piobability theoiy, we know that the expected value of a iandom vaiiable, X, is
FIGURE 93.6 Bathtub cuive ielationship between the failuie iate function and time.
Constant
lailure
Rate
lailure
Rate
lunction
Infant
Mortality
Phase
lseful Life Period
Tine
Wear-Out Phase
JR
J
: R
( )
- ( ) ( )
E X x [ x Jx ] ( )
-

~
~
[
2000 by CRC Press LLC
wheie [(x) is the piobability density function. In ieliability analysis we aie inteiested in the expected value of
the time of failuie (MTTF), so
wheie [() is the failuie density function, and the integial iuns fiom 0 to ~ because the failuie density function
is undefned foi times less than 0. We know, howevei, that the failuie density function is
so, the MTTF can be wiitten as
Using integiation by paits and the fact that JQ()/J -JR()/J we can show that
Consequently, the MTTF is defned in teims of the ieliability function as
which is valid foi any ieliability function that satisfes R(~) 0.
The mean time to iepaii (MTTR) is simply the aveiage time iequiied to iepaii a system. The MTTR is
extiemely diffcult to estimate and is often deteimined expeiimentally by injecting a set of faults, one at a time,
into a system and measuiing the time iequiied to iepaii the system in each case. The MTTR is noimally specifed
in teims of a iepaii iate, , which is the aveiage numbei of iepaiis that occui pei time peiiod. The units of
the iepaii iate aie noimally numbei of iepaiis pei houi. The MTTR and the iate, , aie ielated by
It is veiy impoitant to undeistand the diffeience between the MTTF and the mean time between failuie
(MTBF). Unfoitunately, these two teims aie often used inteichangeably. While the numeiical diffeience is small
in many cases, the conceptual diffeience is veiy impoitant. The MTTF is the aveiage time until the fist failuie
of a system, while the MTBF is the aveiage time between failuies of a system. If we assume that all iepaiis to
a system make the system peifect once again just as it was when it was new, the ielationship between the MTTF
and the MTBF can be deteimined easily. Once successfully placed into opeiation, a system will opeiate, on the
aveiage, a time coiiesponding to the MTTF befoie encounteiing the fist failuie. The system will then iequiie
MTTF
~
[
[ J ( )
0
[
JQ
J
( )
( )

MTTF
~
[

JQ
J
J
( )
0
MTTF +
~ ~
~
~
[ [ [ [

JQ
J
J
JR
J
J R R J R J
( )
-
( )
- ( ) ( ) ( ) ]
0 0
0
0

MTTF
~
[
R J ( )
0
MTTR
1

2000 by CRC Press LLC


some time, MTTR, to iepaii the system and place it back into opeiation once again. The system will then be
peifect once again and will opeiate foi a time coiiesponding to the MTTF befoie encounteiing its next failuie.
The time between the two failuies is the sum of the MTTF and the MTTR and is the MTBF. Thus, the diffeience
between the MTTF and the MTBF is the MTTR. Specifcally, the MTBF is given by
MTBF MTTF - MTTR
In most piactical applications the MTTR is a small fiaction of the MTTF, so the appioximation that the MTBF
and MTTF aie equal is often quite good. Conceptually, howevei, it is ciucial to undeistand the diffeience
between the MTBF and the MTTF.
An extiemely impoitant paiametei in the design and analysis of fault-toleiant systems is fault coveiage. The
fault coveiage available in a system can have a tiemendous impact on the ieliability, safety, and othei attiibutes
of the system. Fault coveiage is mathematically defned as the conditional piobability that, given the existence
of a fault, the system iecoveis Bouiicius et al., 1969]. The fundamental pioblem with fault coveiage is that it
is extiemely diffcult to calculate. Piobably the most common appioach to estimating fault coveiage is to develop
a list all of the faults that can occui in a system and to foim, fiom that list, a list of faults fiom which the system
can iecovei. The fault coveiage factoi is then calculated appiopiiately.
Reliability is peihaps one of the most impoitant attiibutes of systems. The ieliability of a system is geneially
deiived in teims of the ieliabilities of the individual components of the system. The two models of systems
that aie most common in piactice aie the seiies and the paiallel. In a seiies system, each element of the system
is iequiied to opeiate coiiectly foi the system to opeiate coiiectly. In a paiallel system, on the othei hand, only
one of seveial elements must be opeiational foi the system to peifoim its functions coiiectly.
The seiies system is best thought of as a system that contains no iedundancy; that is, each element of the
system is needed to make the system function coiiectly. In geneial, a system may contain N elements, and in
a seiies system each of the N elements is iequiied foi the system to function coiiectly. The ieliability of the
seiies system can be calculated as the piobability that none of the elements will fail. Anothei way to look at
this is that the ieliability of the seiies system is the piobability that all of the elements aie woiking piopeily.
The ieliability of a seiies system is given by
R
seiies
() R
1
()R
2
() . . . R
N
()
oi
An inteiesting ielationship exists in a seiies system if each individual component satisfes the exponential
failuie law. Suppose that we have a seiies system made up of N components, and each component, , has a
constant failuie iate of i

.

Also assume that each component satisfes the exponential failuie law. The ieliability
of the seiies system is given by
The distinguishing featuie of the basic paiallel system is that only one of N identical elements is iequiied
foi the system to function. The ieliability of the paiallel system can be wiitten as
R R

N
seiies
( ) ( )

j
1
R e e e

N
seiies
. . . ( )
- - -

i i i
1 2
R e

N
seiies
( )
-


i
1
R Q Q R

N
paiallel paiallel
( ) . - ( ) . - ( ) . - ( . - ( ))

j j
1 0 1 0 1 0 1 0
1 1
2000 by CRC Press LLC
It should be noted that the equations foi the paiallel system assume that the failuies of the individual elements
that make up the paiallel system aie independent.
M-of-N systems aie a geneialization of the ideal paiallel system. In the ideal paiallel system, only one of N
modules is iequiied to woik foi the system to woik. In the M-of-N system, howevei, M of the total of N identical
modules aie iequiied to function foi the system to function. A good example is the TMR confguiation wheie
two of the thiee modules must woik foi the majoiity voting mechanism to function piopeily. Theiefoie, the
TMR system is a 2-of-3 system.
In geneial, if theie aie N identical modules and M of those aie iequiied foi the system to function piopeily,
then the system can toleiate N - M module failuies. The expiession foi the ieliability of an M-of-N system can
be wiitten as
wheie
The availability, (), of a system is defned as the piobability that a system will be available to peifoim its
tasks at the instant of time . Intuitively, we can see that the availability can be appioximated as the total time
that a system has been opeiational divided by the total time elapsed since the system was initially placed into
opeiation. In othei woids, the availability is the peicentage of time that the system is available to peifoim its
expected tasks. Suppose that we place a system into opeiation at time 0. As time moves along, the system
will peifoim its functions, peihaps fail, and hopefully be iepaiied. At some time
cuiient
, suppose that the
system has opeiated coiiectly foi a total of
op
houis and has been in the piocess of iepaii oi waiting foi iepaii
to begin foi a total of
iepaii
houis. The time
cuiient
is then the sum of
op
and
iepaii
. The availability can be
deteimined as
wheie (
cuiient
) is the availability at time
cuiient
.
If the aveiage system expeiiences N failuies duiing its lifetime, the total time that the system will be
opeiational is N(MTTF) houis. Likewise, the total time that the system is down foi iepaiis is N(MTTR) houis.
In othei woids, the opeiational time,
op
, is N(MTTF) houis and the downtime,
iepai i
, is N(MTTR) houis. The
aveiage, oi steady-state, availability is
We know, howevei, that the MTTF and the MTTR aie ielated to the failuie iate and the iepaii iate, iespectively,
foi simplex systems, as
R
N

R R
M N

N M
N
- of -
( ) ( )( . - ( ))
-
-

_
,

0
1 0
N

N
N

_
,


!
( - )! !


( )
cuiient
op
op iepaii

N
N N
SS

+
( )
( ) ( )
MTTF
MTTF MTTR
MTTF
1
i
2000 by CRC Press LLC
Theiefoie, the steady-state availability is given by
Dehning Terms
Availability, ( ): The piobability that a system is opeiating coiiectly and is available to peifoim its functions
at the instant of time .
Dependability: The quality of seivice piovided by a paiticulai system.
Error: The occuiience of an incoiiect value in some unit of infoimation within a system.
Failure: A deviation in the expected peifoimance of a system.
Fault: A physical defect, impeifection, oi aw that occuis in haidwaie oi softwaie.
Fault avoidance: A technique that attempts to pievent the occuiience of faults.
Fault tolerance: The ability to continue the coiiect peifoimance of functions in the piesence of faults.
Maintainability, ( ): The piobability that an inopeiable system will be iestoied to an opeiational state
within the time .
Performability, (): The piobability that a system is peifoiming at oi above some level of peifoimance,
L, at the instant of time .
Reliability, (): The conditional piobability that a system has functioned coiiectly thioughout an inteival
of time,
0
,], given that the system was peifoiming coiiectly at time
0
.
Safety, (): The piobability that a system will eithei peifoim its functions coiiectly oi will discontinue its
functions in a well-defned, safe mannei.
Re!ated Tupics
98.1 Intioduction 98.4 Relationship between Reliability and Failuie Rate
Relerences
A. Avizienis, Aiithmetic eiioi codes: Cost and effectiveness studies foi application in digital system design,"
IEEE Transatons on Comuers, vol. C-20, no. 11, pp. 1322-1331, Novembei 1971.
W. G. Bouiicius, W. C. Caitei, and P. R. Schneidei, Reliability modeling techniques foi self-iepaiiing computei
systems," in ProteeJngs o[ |e 24| CM nnua| Con[erente, pp. 295-309, 1969.
L. Chen and A. Avizienis, N-veision piogiamming: A fault toleiant appioach to ieliability of softwaie opeia-
tion," in ProteeJngs o[ |e Inernaona| Symosum on Fau| To|eran Comung, pp. 3-9, 1978.
R. W. Hamming, Eiioi detecting and eiioi coiiecting codes," Be|| Sysem Tet|nta| Journa|, vol. 26, no. 2, pp.
147-160, Apiil 1950.
B. W. Johnson, Desgn anJ na|yss o[ Fau|-To|eran Dga| Sysems, Reading, Mass.: Addison-Wesley, 1989.
J-C. Lapiie, Dependable computing and fault toleiance: Concepts and teiminology," in ProteeJngs o[ |e 15|
nnua| Inernaona| Symosum on Fau|-To|eran Comung, Ann Aiboi, Mich.: pp. 2-11, June 19-21,
1985.
S. Lin and D. J. Costello, Ji., Error Conro| CoJng. FunJamena|s anJ |taons, Englewood Cliffs, N.J.:
Pientice-Hall, 1983.
V. P. Nelson and B. D. Caiioll, Tuora|. Fau|-To|eran Comung, Washington, D.C.: IEEE Computei Society
Piess, 1986.
M. L. Shooman, Pro|a||st Re|a||y. n Engneerng roat|, New Yoik: McGiaw-Hill, 1968.
MTTR
1

SS

+

+
1
1 1
1
1
/
/ / /
i
i i
2000 by CRC Press LLC
Further Inlurmatiun
The IEEE Transatons on Comuers, IEEE Comuer magazine, and the ProteeJngs o[ |e IEEE have published
numeious special issues dealing exclusively with fault toleiance technology. Also, the IEEE Inteinational Sym-
posium on Fault-Toleiant Computing has been held each yeai since 1971. Finally, the following textbooks aie
available, in addition to those iefeienced above:
P. K. Lala, Fau| To|eran anJ Fau| Tesa||e HarJware, Englewood Cliffs, N.J.: Pientice-Hall, 1985.
D. K. Piadhan, Fau|-To|eran Comung. T|eory anJ Tet|nques, Englewood Cliffs, N.J.: Pientice-Hall, 1986.
D. P. Siewioiek and R. S. Swaiz, T|e T|eory anJ Pratte o[ Re|a||e Sysems Desgn, 2nd ed., Bedfoid, Mass.:
Digital Piess, 1992.