Вы находитесь на странице: 1из 11

IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO.

3, SEPTEMBER 2000 285


Common-Mode Failures in Redundant VLSI
Systems: A Survey
Subhasish Mitra, Student Member, IEEE, Nirmal R. Saxena, Senior Member, IEEE, and
Edward J. McCluskey, Life Fellow, IEEE
AbstractThis paper presents a survey of CMF (common-mode
failures) in redundant systems with emphasis on VLSI (very large
scale integration) systems. The paper discusses CMF in redundant
systems, their possible causes, and techniques to analyze reliability
of redundant systems in the presence of CMF. Current practice and
recent results on the use of design diversity techniques for CMF are
reviewed. By revisiting the CMF problem in the context of VLSI
systems, this paper augments earlier surveys on CMF in nuclear
and power-supply systems. The need for quantifiable metrics and
effective models for CMF in VLSI systems is re-emphasized. These
metrics and models are extremely useful in designing reliable sys-
tems. For example, using these metrics and models, system de-
signers and synthesis tools can incorporate diversity in redundant
systems to maximize protection against CMF.
Index TermsCommon-mode failures, concurrent error detec-
tion, data integrity, design diversity, redundancy.
ACRONYMS
1
ALU arithmetic-logic unit
ASIC application-specific integrated circuit
CAD computer-aided design
CASE computer-aided software engineering
CCF common-cause failure
CED concurrent error detection
CMF common-mode failure
EMI electro-magnetic interference
FPGA field-programmable gate array
HDL hardware description language
IC integrated circuit
s- implies the statistical definition
TMR triple modular redundancy
VLSI very large scale of integration
I. INTRODUCTION
R
EDUNDANCY techniques are widely used for
enhancing system reliability, availability and data
integrity. Redundancy can either be temporal or physical. In
temporal-redundancy, the same task is repeated multiple times
and the final result is calculated using the individual results ob-
tained from all the runs. For systems with physical-redundancy,
Manuscript received November 1, 1999; revised April 1, 2000. This work was
supported by the U.S. Defense Advanced Research Project Agency (DARPA)
under Contract DABT63-97-C-0024 (Dependable Adaptive Computing Sys-
tems [ROAR] project).
The authors are with the Center for Reliable Computing, Stanford University,
Stanford, CA 94305 USA (e-mail: {smitra; saxena; ejm}@crc.stanford.edu).
Publisher Item Identifier S 0018-9529(00)11753-1.
1
The singular and plural of an acronym are always spelled the same.
a module is replicated and the results from individual imple-
mentations are used to calculate the final result. Duplication in
the form of self-checking pairs (duplex systems) and TMR are
classical examples of redundancy techniques. There is a large
literature on redundancy techniques and on reliability modeling
of systems with redundancy [43], [48], [50].
In a redundant system, CMF result from failures that affect
more than one module at the same time, generally due to a
common cause [30]. CMF can appear due to external (such
as EMI, power-supply disturbances, and radiation) or internal
causes. Design mistakes also constitute an important source of
CMF. As stated in [4], although the use of redundant copies
of hardware has proven to be quite effective in the detection of
physical faults and subsequent system recovery, design faults
are reproduced when redundant copies are made; thus, simple
replication fails to enhance the fault tolerance of the systemwith
respect to design faults. Common-mode (common-cause) fail-
ures have been discussed extensively in publications related to
safety and reliability of nuclear reactors and power supply sys-
tems. It is well-known that CMF make the classical reliability
expressions for redundant systems optimistic. As observed in
[30], the addition of redundant modules is not a solution to CMF.
The importance of these failures can be understood from the ob-
servation [16]: The system unavailability may be increased by
more than a factor of 10 in varying common cause contribution
from zero to 1% [sic].
However, most of the publications related to this subject con-
sider nuclear reactors and power supply systems. Very few of
themconsider CMF in dependable computing systems designed
using redundancy techniques. This paper surveys the work on
CMF in redundant systems in general with special emphasis on
computing systems.
A natural component of the study of CMF is the study of di-
versity. As early as 1970, diversity was identified as an effective
antidote for CMF [25]. However, the major thrust was on having
diversity in methodologies at various steps of the design of a
nuclear reactor. Design-diversity was proposed in [4] to protect
redundant computing systems against CMF. This paper reviews
both prior art and recent results on design diversity.
The main contributions of this paper are:
surveying research on CMF;
bringing into perspective the issues related to these failures
in digital IC systems;
presenting results from recent publications that help un-
derstand design diversity in IC systems;
addressing the issue of safety in redundant systems in the
presence of CMF.
00189529/00$10.00 2000 IEEE
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
286 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000
Fig. 1. Example duplex system.
Section II introduces the basic notion of CMF including the
causes and a classification of these kinds of failures.
Section III examines techniques to perform reliability
analysis.
Section IV discusses methodologies to handle these failures
at various stages of the design process.
Section V discusses metrics to quantify design diversity and
techniques to design diverse redundant systems.
II. COMMON-MODE FAILURES
A common-mode failure (CMF) is the result of an event(s)
which, because of dependencies, causes a coincidence of failure
states of components in two or more separate channels of a
redundancy system, leading to the defined system failing to per-
form its intended function [53]. These types of failures have
also been referred to as CCF [38]. The advantage of using the
definition of [53] is: it defines CMF as those generated by a
single source rather than those having exactly identical effects
in the design. Thus, this definition holds for redundant systems
where the copies are s-identical or different. For systems with
s-identical copies, the CMF effect can be the same for both
the copies; for nonidentical copies, the effects can be different.
This paper uses this definition of CMF and calls these failures
common-mode or common-cause failures interchangeably. The
CMF problem is explained using the duplex system in Fig. 1.
In the duplex system of Fig. 1, there are two copies of
the same unit (Copy 1 and Copy 2). The outputs of the two
units are compared. Any mismatch at the outputs of the two
copies prompts a corrective action (maintenance, replacement
with standby spares, etc.). If the failures in the units are
s-independent, then the addition of simple redundancy like
duplication reduces the system failure rate. However, for CMF,
mere addition of redundancy might not help reduce the system
failure rate. The following example illustrates this.
Example: Let the probability that any one of the 2 copies
fail be . If the failures are s-independent, the probability
that the system fails (classical analysis assumes that the system
fails when both copies fail) is . However, for CMF, due
to a single cause, both copies can fail; if the probability of that
cause of failure is , then the probability that both copies
fail is rather than . Thus, simple addition of re-
dundancy through replication does not help protect the system
against these CMF. This example implicitly assumes that since
the redundant copies are s-identical, the CMF effects are also
s-identical. This motivates using diverse copies of the different
modules in a redundant system. With diverse copies, it is pos-
sible that the error effects of a particular CMF are different for
each copy. Thus, there is a possibility of detecting the CMF and
taking corrective action.
Use of design diversity to protect redundant systems against
CMF is discussed in more detail in Section V.
A. Causes of CMF
Extensive studies related to causes of CMF in nuclear
reactors, power-supply systems, avionics, etc. have been con-
ducted. In contrast, there are very few publications on CMF in
VLSI systems. This paper focuses on causes of CMF in VLSI
systems. Design faults constitute a major portion of CMF in
replicated systems [4]; they can occur in the hardware or in
the software. Design faults can be human-made (or due to the
presence of bugs in the tools used, incorrect or incomplete or
imprecise specifications, incorrect understanding, etc.) and are
mainly introduced during the phase of creation of redundant
systems [30]. Design faults can be permanent (hardware or
software bugs) or intermittent (e.g., weak signals). CMF can
also occur due to external disturbances when the system is
operating. These kinds of disturbances include fluctuations in
the power supply and radiation; the fault effects can be transient
or permanent. In [9], power-supply disturbances were analyzed
and it was shown that dips in the power-supply voltage cause
delay faults in the circuits; these effects are transient. On the
other hand, some literature claims that a single radiation source
can cause multiple-event upsets in logic circuits and memories
[46]. If the memory locations are written frequently, then these
upsets have a temporary effect on the system. However, in
SRAM-based programmable systems (e.g., field programmable
gate arrays, FPGA), for example, upsets from radiation can
have a permanent effect (unless the FPGA configuration is
loaded again). In addition to these, CMF in information systems
have been studied [10]; these CMF events are viruses (affecting
applications, compilers, operating systems), power supply
disturbances, manufacturer defects affecting compilers, OS,
monitors, disk drives, RAM, power supply, etc.
B. Classification of CMF
The literature on CMF has proposed various classifications
of CMF [25], [30], [53]. Reference [25] classifies CMF into
4 groups:
functional deficiency,
maintenance error,
design deficiency,
external event.
Functional deficiency is not a hardware failure, but is a
misapplication of hardware or an inability to predict the true
nature of the system under consideration. A maintenance error
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 287
Fig. 2. IC development flow and CMF.
is defined as consistent mis-calibration or mis-service of all
instruments monitoring a given system. Design deficiency is
an unrecognized dependence on a single, common element
or a common deficiency in all elements of a particular type.
External events are failures resulting from disturbances in the
external environment.
The classification in [53] is somewhat similar to that in [25]
and is mainly based on possible causes of CMF. The causes are
split into two categories: engineering and operations. Design
errors, incomplete design specifications, inadequate instrumen-
tation, lack of standards, inadequate testing, etc., are in the engi-
neering category. CMF arising from imperfect repair, operator
error, environmental condition, etc., are in the operations class.
This kind of a classification is useful for identifying the causes
of CMF. Although these classifications are related to nuclear re-
actors and power supply systems, they extend fairly well to IC
systems, as described next.
Fig. 2 shows the flow of the IC development process:
1) There is a specification of the anticipated functionality of
the IC that should be designed.
2) Given the specification, the designer designs the IC.
3) The developed design is fabricated on silicon.
4) The fabricated parts are tested and shipped to the
customer.
CMF can be generated at different points of IC development;
potential CMF causes at different levels must be handled in dif-
ferent ways. This section briefly addresses these causes (see
Fig. 2).
Apotential source of CMF is that specifications are often am-
biguous or incomplete. Even with correct specifications, CMF
can be generated during the design phase. This includes bugs
in the CAD tools that are used for designing the IC chips and
incorrect interpretation and human errors (unintentional or in-
tentional due to sabotage) incurred during the design process.
Incomplete design verification is a potential source of CMF at
this stage.
In the fabrication stage, CMF are introduced due to inaccu-
racies in the manufacturing process, leading to manufacturing
defects. Some of the defective chips can be screened by
thoroughly testing the manufactured parts. However, due to
inadequate testing and low fault-coverage, some defective and
weak chips (that can cause early-life failures) can escape.
In the field, radiation, EMI and power supply disturbances
can cause CMF.
This kind of CMF classification identifies the possible
sources and the steps needed to eliminate CMF. In an IC devel-
opment project, depending on experience and past data, each
stage can be improved so that the introduction or occurrence
of CMF is minimized. However, there is a cost associated with
each of the CMF elimination steps. The challenge is to apply
CMF elimination techniques cost-effectively.
CMF can also be classified based on other properties. For ex-
ample, it might be of concern whether a particular CMF affects
the system only temporarily or permanently [30]. Reference
[30] classifies CMF according to their nature, origin, and persis-
tence. The nature of the CMF can be accidental or intentional.
The origin of CMF can be due to some adverse physical failures
or human made (e.g., operator errors) or because of imperfect
specifications (e.g., design errors). The CMF can persist tem-
porarily or have a permanent effect on the system. Thus [30],
any CMF belongs to 1 of the categories:
transient external,
permanent external,
intermittent design-fault,
permanent design-fault,
arising from interaction of the system with the external
environment (e.g., operator errors).
CMF can be classified according to their effects on the system:
catastrophic (life critical support, fly-by-wire),
noncatastrophic (e.g., on-line transaction processing the
data can be resent),
negligible.
CMF can also be classified as:
1) CMF that do not affect the system output; the system
output is always correct.
2) CMF that can be detected in a redundant system through
disagreement of multiple modules; however, it is not
guaranteed that the system will always produce correct
outputs.
3) CMF that have s-identical effects on different modules of
a redundant system; thus these CMF are not detectable
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
288 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000
Fig. 3. Fault-tree analysis.
through comparison of the outputs of the different
modules.
4) CMF in the presence of which the duplex system ei-
ther produces correct output or detects disagreement
prompting a repair or corrective action.
For safety critical systems, cases 2 and 3 are not desirable. For
other systems, where tests can be applied periodically, CMF in
case 2 might be acceptable. However, CMF of case 3 are not
desirable in any system.
III. RELIABILITY ANALYSIS WITH CMF
This section surveys techniques for estimating the reliability
of redundant systems in the presence of CMF. Research in this
area has focused on developing models for reliability analysis.
Techniques for analysis of fault trees with CMF are in [15],
[37], [44]. A summary of techniques is in [54]. Ref [38] advo-
cates explicit inclusion of different events for each component in
a common-cause group that fails all the members of the group.
This is illustrated using the fault-tree for a simple TMR system.
Consider a TMR system with three modules . To
consider common-cause failures involving component , [38]
modifies the fault-tree of component , as shown in Fig. 3.
Thus:
total failure of component A,
failure of component from s-independent
causes,
failure of components (but not ) from
common causes,
failure of components (but not ) from
common causes,
failure of components from common
causes.
The basic events that cause system failure are:
Different probability values have to be assigned to these indi-
vidual events to estimate the probability of failure of the TMR
system. This is the Basic Parameter Model. Other models
Fig. 4. Reliability modeling of a parallel system with CMF.
(Beta factor model, Multiple Greek Letter model, etc.) are ex-
plained in [38]. All these models try to address the issue of
incompleteness of data used to estimate the individual failure
probabilities.
Beta Factor Model: A preliminary reliability analysis of
redundant systems, considering CMF, can be obtained from
[11]. Reference [11] assumes that the failure rate of a simplex
system is, .
failure rate of s-independent failures,
failure rate of CMF.
Reference [11] assumes:
is determined fromexperimental data or experience. With this
background, a parallel systemof redundant modules is modeled.
In a parallel system, the system performs the intended operation
as long as at least one of the redundant modules is fault-free.
In Fig. 4, is the reliability of each individual module with
respect to s-independent failures, and is the reliability with
respect to CMF.
This analysis assumes that, due to the presence of CMF, the
system can not perform its intended function. Thus, the system
works correctly (performs intended function) when:
a) there is no CMF affecting the system, and
b) at least 1 of the redundant modules works correctly in the
presence of s-independent failures.
Thus, the system reliability is:
When , then and the system acts as an ordinary
parallel system without any CMF.
When , then the system acts as a simplex system. For
this model, the real problem is in estimating , which can be ex-
tremely difficult. While this model might seem to be reasonable
for replicated systems, it is not clear howeffective it is for redun-
dant systems with nonidentical copies. This modeling technique
has been extended to calculate the reliability of other systems in
the presence of CMF.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 289
The beta-factor model assumes that the system becomes
nonfunctional (or does not function correctly) once failures
occur. This is not necessarily true for digital systems. For these
systems, the CMF can have several error effects in the copies
(especially if the implementations are different) and the errors
at the outputs might not always occur simultaneously at the
same cycle. For example, in a duplex system, the failure effect
can be such that the 2 modules do not produce the same set of
erroneous outputs simultaneously; it is then guaranteed that
the system either produces correct outputs or indicates error.
However, if the 2 modules are s-identical, then chances are
high that the CMF has the same effect in both modules. As an
example, consider a CMF design fault: if both copies are exact
replicas of each other, they are affected in the same way. In this
case, both copies produce the same set of erroneous outputs
simultaneously; thus, there is no way the presence of error can
be detected. On the other hand, there can be CMF that cause
the system to behave in such a way that there exists at least
one set of inputs for which the failure is detected (from the
error signal produced by the duplex system). The reliability
modeling technique in [35] addresses these problems. This
allows treating redundant systems (with s-identical or noniden-
tical copies) in a uniform way and derives simple relationships
among mission times, failure rates, and the characteristics
of the response of individual modules to failures. Section V
describes the modeling technique.
Other papers on reliability analysis in the presence of CMF
include [12][14], [20], [22], [28], [51], [52]. Reference [40]
states that most of the models have very little relationships with
the possible CMF causes. The binomial failure-rate model pro-
poses a mechanism in which agents, called shocks, impact all
components in the group. In [51], [52], the CMF have been mod-
eled as external shocks having constant occurrence-rates; in par-
ticular, a multivariate exponential model has been assumed to
perform the reliability analysis in the presence of CMF.
Reference [20] gives a trinomial failure-rate model for relia-
bility analysis in the presence of CMF. In this model, any system
component is in 1 of 3 states:
1) success: the component is working correctly,
2) failed: the component has failed,
3) gray: it is too ambiguous to be declared as a failed or a
success state, such as a partial failure, potential failure or
incipient failure.
Other models for correlated failures and CMF are discussed in
[28].
Although many models have been proposed for reliability
analysis of CMF, for IC and computing systems, there is not
enough real experimental data to demonstrate their effective-
ness. There has been some initiative in this direction for an-
alyzing CMF in nuclear reactors. Thus, real experiments and
CMF models are necessary for progress in research on protec-
tion of redundant IC systems against CMF.
IV. TECHNIQUES TO HANDLE CMF
As mentioned Section II, CMF are considered as a poten-
tial source of problems in redundant systemsnuclear reactors,
power-supply systems, avionics and redundant VLSI and com-
puter systems (hardware and software), etc. There are many ap-
proaches for handling CMF in redundant systems [30]. These
approaches include CMF avoidance, CMF removal, and CMF
tolerance.
CMF avoidance techniques are applied during the 3 phases:
1) specification,
2) design,
3) implementation.
The CMF-removal techniques are applied mainly during the test
and validation phases, while the CMF-tolerance techniques are
primarily intended to handle CMF while the system is in oper-
ation. This section explains each of these techniques in detail.
It is not known yet, whether these techniques cover all possible
CMF sources exhaustively. The relative coverage of these indi-
vidual techniques is a subject of further research, and thorough
experimental data are needed to estimate the coverage numbers.
Reference [30] argues that the occurrence-probability of CMF
that are not covered in any of these three stages might be of the
same order as the probability of multiple random faults in the
multiple modules. Reference [30] uses the following argument
to quantify this claim.
Consider a pessimistic arrival rate of /hour. If a 99%
CMF coverage is obtained at each of 3 phases (specification,
design, implementation), then the probability that a CMF is not
detected by any of these techniques is . Thus, the proba-
bility that a system provides faulty outputs in the presence of a
CMF is of the order of . This number is obtained by mul-
tiplying the probability of CMF occurrence and the proba-
bility that the CMF is not detected in any of the 3 phases. This
shows orders of magnitude improvement in protection against
CMF.
A. CMF Avoidance
Steps to avoid CMF must be adopted from the very begin-
ning of the design and development processes, because, during
all project stages, CMF (or situations that can lead to CMF in
the future) can be introduced. Later in the project, it might be
difficult to detect CMF (or situations that can lead to CMF) in-
troduced at the earlier stages. The main aim of CMF avoidance
techniques is to reduce the number of permanent and intermit-
tent design CMF introduced in computer systems. Seven CMF
avoidance techniques are listed here; many of them also appear
in [30].
1) Mature and Verified Components: The importance of
reuse for building redundant computing systems is stressed in
[30]. By using components which have been verified formally
(microprocessors, operating system kernels, etc.) and stable
products that have been extensively tested (and verified), the
probability of design-flaws can be reduced. There is a very high
chance that design flaws are introduced if one begins designing
everything from scratch.
2) Conformance to Standards: While the standards are
mainly meant to ease interoperability of various techniques,
logistics, maintainability, etc., they can also reduce design
errors. This is because design errors are often introduced due to
incomplete, ambiguous, and/or incorrect understanding of how
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
290 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000
different systems operate and interact with other systems. Con-
formance to standards reduces the probability of design errors
arising from ambiguous interpretation of system operation.
3) Use of Formal Methods: Formal methods can be used for
specifying, developing, and verifying computer systems with
strong emphasis on consistency, completeness, and correctness
of the properties. Formal specification and verification tech-
niques have been used for such systems. However, the major
concern regarding these formal methods is: these techniques do
not scale proportionately with increasing complexities in the de-
sign (e.g., size) and often explode (in time and memory space)
for large designs.
4) Design Automation to Eliminate Human
Errors: Refernce [30] advocates the use of design-automation
tools to automate parts of the hardware and software design
cycles. With automated tools, the probability of human errors
can be reduced. CASE tools and hardware design automation
tools can be used for this purpose. However, complete design
automation can lead to design bugs that might be overlooked. In
this context, [30] discusses the design methodology, integrated
with formal methods, adopted in the Draper Laboratory.
5) Performance CMF: Performance CMF arise mostly in
real-time systems. In the presence of these failures, the system
fails to deliver the required services on time under various work-
load conditions. To avoid these kinds of performance CMF, it is
important to develop an accurate, complete model of the system;
with such a model, analysis (through simulation of benchmarks
and numerical calculations) should be performed a priori, to
find out whether timing faults can appear in a system under var-
ious conditions.
6) Design Rules and Design Techniques: Design rules
are important in the design of VLSI systems, and are mainly
guided by the capabilities (precision) of the underlying fabri-
cation process, signal integrity, electromigration problems, etc.
Design rules for reducing chances of CMF (e.g., increasing
the spacing between two signal lines) can possibly be devised.
However, the resulting design rules can be too conservative
and might not be suitable for achieving high signal-speeds in
high-performance VLSI systems.
Design techniques, like shielding and radiation hardening,
can be used to avoid failures caused by the external environ-
ment when a system is used in the field. The main drawbacks
are that these techniques add extra appreciable development and
manufacturing cost and increase the development time.
7) Design Diversity: Design diversity is an avoidance
technique as well as a tolerance technique for CMF. It is an
avoidance technique for design faults, and a tolerance technique
for other kinds of faults. The concept behind design diversity is
to implement various copies in a redundant system in different
ways, starting from a common set of specifications. It applies to
all levels: hardware, software, programming language, design
development environment, etc. This approach can eliminate
many common-mode design faults since each redundant copy
uses a different design. However, incorrect interpretation of
ambiguous specifications can still lead to faults in multiple
copies; thus, design diversity cannot provide 100% coverage
of all design faults [30]. From the viewpoint of design faults,
mature verification techniques can be more useful in avoiding
CMF arising from design faults. However, diversity might
help in tolerating some other kinds of CMF in the field. With
appropriate diversity, modeled failures in the field can have
different effects on the different copies with diversity.
Design diversity has some costs associated with it. By defi-
nition, it is required to design a given module at least twice to
achieve diversity; thus, the extra development time is an extra
cost. In addition, the two designs must be manufactured and
this increases the manufacturing cost. For example, one might
have to manufacture two different ASIC, for diversity. However
[35], with reconfigurable computing systems, the costs associ-
ated with diversity can be reduced. For implementing diversity
on reconfigurable hardware (like Field Programmable Gate Ar-
rays), one synthesizes and downloads different configurations.
Thus, there is no need to manufacture two different ASIC. The
design cost can be reduced with use of CAD tools for synthesis,
placement and routing of designs on FPGA. Thus, the paradigm
of reconfigurable computing can be regarded as an enabling
technology for design diversity. Design diversity is explained
in more detail in Section V.
B. CMF Removal
The CMF avoidance techniques, in Section IV-A, are not
fool-proof. Thus, the faults that slip past the design process
must be detected and removed in the later stages of system
development. CMF removal techniques include design reviews,
extensive simulation/verification, testing, and fault injection.
Design reviews, simulation, and verification are mainly meant
for removing design faults which constitute a major fraction of
CMF. Testing, on the other hand, detects mainly manufacturing
defects and weak chips. Testing and fault injection in the
following paragraphs.
1) Testing: Testing is performed mainly to screen out
chips with manufacturing defects. There is large literature on
testing techniques [1], [39]. To ensure high quality, chips are
required to work properly not only at the time of production,
but also throughout the anticipated lifetime. Hence, screening
techniques are also aimed at identifying weak parts. These are
the chips that work correctly just after being manufactured, but
have some latent defects. As a result, these parts fail can in
the field as early-life failures. There are many ways to identify
these weak parts. For VLSI systems, burn-in is common
practice to ensure high reliability of the chips that are shipped
to the customers. By exercising the chips at high temperature
and/or high supply voltage, burn-in screens out chips with
defects that can cause early life failures and reliability problems
[23]. However, burn-in is very expensive and hence finding
alternatives to burn-in is a very important research problem.
Some of these techniques are Iddq testing [18], VLV (very low
voltage) [21], and SHOVE (Short voltage elevation) [6] testing
that can detect certain classes of defects that cause reliability
failures.
2) Fault Injection: Fault injection inserts faults in an other-
wise fault-free system (designed to tolerate faults) to evaluate
the systems ability to tolerate these kinds of faults in the real
environment. Faults can be injected either in a system proto-
type or in the software simulation model of the system. Fault
injection enables studying the behavior of a redundant systemin
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 291
the actual environment. For fault-injection to be a success, thor-
ough studies of all failure mechanisms and modes that occur in
real-life (through experiments or actual field data) are needed.
As mentioned in [30], fault injection techniques can also be
used to operate the system in various degraded modes which
the system can encounter in the real life.
There are several ways to perform fault injection [8], [24].
Fault injection can be performed in software or in hardware.
For software fault-injection, errors are injected into the HDL
model of the system, and simulation can be performed to see
the response of the system to the errors introduced. The errors
introduced can be in the function of the module or in the netlist
of a particular module (if such a netlist is available). Thus, it is
possible to simulate the system-level manifestation of gate-level
faults. DEPEND[17] is an example of such an integrated design
and fault injection environment.
For hardware fault-injection, the system (or a prototype) is
first built and errors are introduced in the hardware. These can
include disturbance of signals on the pins of the circuit, putting
the chips under heavy-ion radiation [19], [27], [32], or power
supply disturbances [9].
For validation and verification, emulators are often used
[45]. An emulator contains programmable elements. One
way of implementing these programmable elements is to use
FPGA. A HDL model of a given system is mapped to the
FPGA in the emulator, and the emulator can be connected to
the environment in which the system will run in real life. The
advantage of emulators is the emulation speed (several orders
of magnitude improvement over simulation time). Emulators
can also be used to inject faults in the programmable elements
(change the configuration of the FPGA) and evaluate the effect
of the fault on the system.
C. CMF Tolerance
CMF can manifest itself as transient or permanent faults
due to external causes like environmental disturbances, power
supply disturbances, and radiation. These failures occur in the
field and the only way to handle them is to detect them in the
field and take corrective actions once the failures are detected.
This is why CMF tolerance and recovery are very important.
For CMF detection, one can use watchdog timers, exceptions
handlers, run-time checks, and presence tests [30]. Concurrent
error detection can also be used to make systems secure against
CMF. For example, consider the duplex system in Fig. 5; a CED
circuit (CED1 or CED2) is associated with each module; if a
CMF affects the two modules (Copy 1 and Copy 2), then the
individual concurrent error detectors might be able to detect it
and report an error. If a common-mode failure affects a partic-
ular module and its concurrent error detection circuit (Copy 1
and CED1, for example), then the comparator circuit can detect
it.
Design diversity can be used to detect CMF in redundant
systems. For example, consider a duplex system with 2
s-identical copies of the same hardware. A common-mode
failure that affects the s-identical leads of both the copies
will never be detected. However, with diversity it is possible
to detect many CMF in multiple copies. Use of a hardware
implementation and its dual in recommended in [49]. Reference
Fig. 5. A duplex system with CED.
[35] shows, by theory and simulation, that for CMF there is
a distinct advantage in using diversity for detection purposes.
Reference [36] presents techniques to synthesize redundant
systems that detect modeled CMF. These results are discussed
in Section V-B.
Recovery from CMF is tied very closely to CMF fault
tolerance. Once a CMF is detected, it is necessary to restore
the system-state to a previously known correct point from
which further computation can resume. This translates to
deciding checkpoint intervals when the system state is saved
in nonvolatile memory, e.g., hard disks for which the RAIDS
(Redundant Array of Inexpensive Disks) architecture [42] can
be used. When the error is detected, the system can be rolled
back to the checkpointed state. Checkpointing and recovery
are inter-related. Checkpointing and recovery schemes are
described in [43].
V. DESIGN DIVERSITY
In design diversity, the hardware and software elements that
are used for multiple computations (in a redundant system) are
not just replicated, but are s-independently generated to meet
system requirements [4]. The basic idea is that, with redundant
systems, it is possible to tolerate s-independent physical faults.
However, to tolerate design faults, then direct replication of the
copies in a redundant system is of no help; the same design fault
is reproduced in all 3 copies; hence, the system fails in response
to an input that invokes this design fault. However, if the designs
are generated s-independently (e.g., by different designers and
design-tools),chances are low that the exact same design fault
appears in all 3 copies. Reference [4] gives 3 conditions for the
s-independence of design faults.
1) Different algorithms, programming languages, transla-
tors, design automation tools, machine languages, etc.,
should be used.
2) s-Independent programmers or designers, with diversity
in their training and experience, should be used.
3) The most critical condition is the existence of a complete,
correct initial statement of requirements that should be
satisfied by all the diverse designs. The use of formal
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
292 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000
methods of requirements specification in order to achieve
the third goal is necessary [4].
Design diversity has been used in the context of -version
programming to handle design faults efficiently. Experimental
results supporting the use of diversity in -version program-
ming are in [4]. Reference [29] observes through an experiment,
that sometimes independently generated programs may be in-
dividually extremely reliable but in a large number of cases,
more than one of the programs can fail. Thus, the claim [4]
is that -version programming must be used with care since
there might not be any reliability improvement from diversity.
There are other diversity approaches for handling CMF in
software. A comprehensive report of these techniques is avail-
able in [26]. A method to avoid s-identical errors caused by de-
sign faults in multiple computing systems by diversifying the
input data space is in [3]; the claimis that data diversity requires
an algorithm for the re-expression of input data, but does not re-
quire design diversity. Reference [7] observes that data diversity
can be useful for certain applications.
Functional diversity is another technique for handling CMF
in redundant systems. Functional diversity exploits the fact that
some problems have multiple ways (e.g., multiple algorithms)
of achieving the same result. Details of functional diversity are
in [2]. The main idea is that, with functional diversity, the
system may fail to achieve a goal in its standard way, but may
be able to reach that goal in some different way. Functional-
richness of the system is one of the necessary conditions to
achieve functional diversity. Functional richness is a property of
a system such that it is possible to achieve the same end-result
in several different ways using that system.
Diversity is not confined to redundant software systems. As
early as 1970, in a study on CMF in nuclear systems, diversity
was observed to be a common antidote for CMF [25]. Five kinds
of diversity are classified in this work: functional diversity, oper-
ational administrative diversity, design administrative diversity,
equipment diversity, and physical diversity.
Functional diversity provides protection against design
deficiency, maintenance errors, and external sources.
Operational administrative diversity requires different
persons to do certain tasks or a second person to check
on the first.
Equipment diversity provides different equipment
(possibly of different precision) to measure the same
parameter.
Physical diversity relates to physical separation of
instrumentation components measuring the various key
parameters.
Reference [34] uses diversity in the context of time redundancy.
Instead of structural redundancy, sequential execution of
various implementations of a software task on single computer
is proposed to detect software and hardware faults in a safe
system. This technique has been termed: systematic diversity.
Techniques for designing redundant systems protected against
CMF are in [36]. The techniques of RESO [41] and RERO [31]
are also in this category.
Hardware design diversity has been used to design redundant
hardware systems. Examples of systems using hardware design
diversity include: the Primary Flight Computer (PFC) system
of Boeing 777 [47], the space shuttle, Airbus 320 [5], and many
other commercial systems. For the Boeing 777, three processors
with different architectures (from AMD, Intel, Motorola) are
used in the PFC system.
A. Quantifying Design Diversity
Diversity can bring benefits to a redundant system; however,
these benefits are extremely difficult to quantify. Moreover [29],
not all kinds of diversity are useful; there are several instances in
which many of their 27 versions of the same software program
shared common faults. Thus [33], there is a need to answer ques-
tions such as:
What is diversity?
Are these designs more diverse than those?
How diverse are these two designs?
In the literature, these questions are not answered clearly; the
need for answering these questions is also expressed in [49].
Reference [33] tries to answer some of these questions. It
used probability analysis to reach the conclusions: Suppose,
we know that components and are different, but we are
indifferent between 1-out-of-2 systems consisting of compo-
nents and components only, then we should always build
a 1-out-of-2 system made of component and component
[sic]. Reference [33] claims that this observation can be
generalized for a 1-out-of- system. However, [33] assumes
that, given a particular environment in which the components
operate, the probabilities of failure of the components are
independent, which might not be valid in general.
Reference [35] presents a metric for quantifying diversity
among different designs. Assume that we are given two imple-
mentations (logic networks) of a logic function, an input proba-
bility distribution, and faults and that occur in the first and
second implementations, respectively. The diversity with
respect to the fault pair is the conditional probability
that the two implementations do not produce identical errors,
given that faults and have occurred.
For a given fault-model, the design diversity metric, , be-
tween two designs is the mean value of the diversity with respect
to different fault pairs:
where is the probability of the fault pair .
1) Example 1: Consider any combinational logic function
with inputs and 1 output. The fault model assumes that a com-
binational circuit remains combinational in the presence of the
fault. Consider 2 implementations of the given com-
binational logic function.
The joint-detectability, , of a fault pair is the
number of input patterns that detect both and . This
definition follows from the idea of detectability developed in
[55].
Assume that all the input patterns are equally likely, then:
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 293
TABLE I
BEHAVIOR OF FAULTY MULTIPLE-OUTPUT CIRCUITS
The generate a diversity profile for and with respect
to a fault model. Consider a duplex system consisting of
and . In response to any input combination, and can
produce 1 of 3 cases at their outputs.
1) Both of them produce correct outputs.
2) One of them produces the correct output and the other
produces an incorrect output.
3) Both of them produce the same incorrect value.
For case 1, the duplex system produces correct outputs. For
case 2, the system reports a mismatch so that appropriate re-
covery actions can be taken. For case 3, the system produce an
incorrect output without reporting a mismatch; thus the integrity
of the system is lost due to the presence of faults in and .
In the literature on fault-tolerance [43], [48], this system in-
tegrity is referred to as the fault-secure property.
If all fault pairs are equally probable and there are fault
pairs , then for and is:
2) Example 2: Extend example 1 to consider
multiple-output combinational logic circuits. For an
affecting and , define as the number of input
patterns, in response to each of which, both and
produce the same erroneous output pattern. Use the same
formulas as in example 1.
For example, consider a combinational logic function with
2 inputs and 2 outputs (Table I). Let and affect and
, respectively. Table I shows the responses of and in
the presence of the faults; the faulty output bits are highlighted
in columns 3 and 4. To calculate , consider only the input
patterns 10 and 11.
This illustration of can be extended to sequential circuits
and software programs. For small or medium-sized systems, the
exact value of can be calculated manually or using computer
programs. For large systems, the value can be estimated with
simulation techniques.
3) Reliability Analysis Using the Design Diversity
Metric: Reliability analysis for redundant systems has
been performed in [35] using the design diversity metric.
is useful because it is closely related to the causal structure of
the CMF; as mentioned in Section III, this causal structure is
missing in all the previous work on reliability analysis of CMF.
Theoretical analysis using and simulation results lead to the
two conclusions about diversity [35].
1) For s-independent multiple-module failures, mere use
of different implementations does not always guarantee
higher reliability compared to redundant systems with
s-identical implementations. It is important to analyze
the reliability of redundant systems using .
2) For CMF and design faults, there is a important gain
in using different implementations. However, our anal-
ysis shows that the gain diminishes as the mission-time
increases. Our simulation results demonstrate the useful-
ness of diversity for enhancing the self-testing properties
of redundant systems.
B. Designing for Diversity: Synthesis Problems
This section discusses 2 categories of techniques that can
be used to achieve sufficient diversity in a given hardware or
software system:
A) Techniques that do not consider any fault model,
B) Techniques that consider an underlying fault model.
The concept of -version programming was proposed in [4]
to achieve diversity in software systems. This technique can
be used at various levels of abstraction. For example, entirely
different algorithms can be used for performing a particular
computation. On the other hand, different implementations of
the same algorithm (possibly by different s-independent pro-
grammers) can achieve diversity. However, it is not easy to quan-
tify the diversity that is really obtained with the category B[29].
This paper has already discussed the commercial use of hard-
ware design diversity as used in Boeing 777 [47], the space
shuttle, Airbus 320 [5], etc. All these are examples of imple-
menting diversity without any underlying fault model. We hope
that, with different implementations, the errors in the different
copies will be different.
Although the majority of the design diversity techniques in
computing systems are focused on techniques that do not rely
on any fault model, some work has been done on diversity with
some underlying fault model in mind. RESO (Recomputation
using Shifted Operands) [41] and RERO [31] are two such error
detection techniques that are targeted toward ALU using the
concept of time redundancy. In RESO, during the initial com-
putation, the operands are passed to the inputs of the ALU and
the result is stored in a register. During the recomputation step,
the operands are shifted left (by bits) and then applied to the
inputs of the ALU under consideration. The computed result is
right shifted, and then compared with the result obtained from
the initial step (stored in a register). If these two results mis-
match, an error signal is turned on. Reference [41] shows that
for most practical ALU implementations, RESO detects all er-
rors caused by faults in a bit-slice or a specific subcircuit of the
bit slice. RESO has been extended to RERO (Recomputation
using Rotated Operands) [31].
For duplex systems using hardware redundancy, Tohma pro-
posed to use the implementations of logic functions in true and
complemented forms [56]. The use of a particular circuit and its
dual was proposed [49] to achieve diversity in order to handle
CMF. The basic idea is, with different implementations, failures
that affect the two circuits in the same way will probably cause
different error effects. Ref [36] introduces a common-mode
fault model involving the register bits of the inputs of the indi-
vidual copies, and proposes synthesis techniques for designing
redundant systems that are protected against the modeled CMF.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
294 IEEE TRANSACTIONS ON RELIABILITY, VOL. 49, NO. 3, SEPTEMBER 2000
Conventional (high-level, logic, layout) synthesis techniques
can be adapted to generate multiple designs such that the design
diversity metric in Section V-A can be maximized. Adapting
these synthesis techniques for generating diverse designs leads
to interesting (and important) open problems for researchers in
this field.
For example, in [57], a technique for synthesizing diverse
combinational logic circuits has been described. This technique
maximizes the data integrity of the resulting diverse duplex
system against multiple failures and CMF while minimizing
the area overhead.
REFERENCES
[1] M. Abramovici, M. A. Breuer, and A. D. Friedman, Digital Systems
Testing and Testable Design, 1990.
[2] R. L. Abbott, Resourceful systems for fault tolerance, reliability and
safety, ACM Computing Surveys, vol. 22, no. 1, pp. 3568, 1990.
[3] P. E. Ammann and J. C. Knight, Data diversity: An approach to software
fault tolerance, in Proc. Int. Symp. Fault-Tolerant Computing, 1987, pp.
122126.
[4] A. Avizienis and J. P. J. Kelly, Fault tolerance by design diversity: Con-
cepts and experiments, IEEE Computer, pp. 6780, Aug. 1984.
[5] D. Briere and P. Traverse, Airbus A320/A330/A340 electrical flight
controls: A family of fault-tolerant systems, in Proc. Int. Symp. Fault-
Tolerant Computing, 1993, pp. 616623.
[6] J. T.-Y. Chang and E. J. McCluskey, SHOrt Voltage Elevation (SHOVE)
test, in Proc. Int. Test Conf., 1996, pp. 4549.
[7] J. Christmansson, Z. Kalbarczyk, and J. Torin, Dependable flight con-
trol system by data diversity and self-checking components, Micropro-
cessor and Microprogramming, vol. 40, no. 2 and 3, pp. 207222, 1994.
[8] J. A. Clark and D. K. Pradhan, Fault injection: A method for validating
computer system dependability, IEEE Computer, vol. 28, no. 6, pp.
4756, Jun. 1995.
[9] M. L. Cortes, Temporary failures in digital circuit: Experimental results
and fault modeling, Ph.D. dissertation, Center for Reliable Computing,
Stanford Univ., 1987.
[10] C. Davis, Common-mode failure in information systems, SciTech J.,
vol. 6, no. 5, pp. 1315, Jul.Aug. 1996.
[11] B. S. Dhillon and C. L. Proctor, Common-mode failure analysis of re-
liability networks, in Proc. Ann. Reliability and Maintainability Symp.,
1977, pp. 404408.
[12] B. S. Dhillon and O. C. Anude, Common-cause failure analysis of a
redundant system with repairable units, Int. J. Syst. Sci., vol. 25, no. 3,
pp. 527540, Mar. 1994.
[13] , Common-cause failure analysis of a k-out-of-n: G system with
repairable units, Microelectronics and Reliability, vol. 34, no. 3, pp.
429442, Mar. 1994.
[14] , Common-cause failure analysis of a k-out-of-n: G system with
nonrepairable units, Int. J. Syst. Sci., vol. 26, no. 10, pp. 20292042,
Oct..
[15] Easterling, Probabilistic analysis of common-mode failures, in Proc.
ANS Conf. Probabilistic Analysis of Nuclear Safety, 1978.
[16] K. N. Fleming, A. Mosleh, and A. P. Kelly, On the analysis of dependent
failures in risk assessment and reliability evaluation, Nuclear Safety,
vol. 24, pp. 637657, 1983.
[17] K. K. Goswami et al., DEPEND: A simulation-based environment for
system level dependability analysis, IEEE Trans. Computers, vol. 46,
no. 1, pp. 6074, Jan. 1997.
[18] R. K. Gulati and C. F. Hawkins, Iddq Testing of VLSI Circuits: Kluwer
Academic Publishers, 1993.
[19] U. Gunneflo, J. Karlsson, and J. Torin, Evaluation of error detection
schemes using fault injection by heavy-ion radiation, in Proc. Int. Symp.
Fault-Tolerant Computing, 1989, pp. 340347.
[20] S. G. Han and W. H. Yoon, The trinomial failure rate model for treating
common-mode failures, Reliability Engineering and SystemSafety, vol.
25, no. 2, pp. 131146, 1989.
[21] H. Hao and E. J. McCluskey, Very-low-voltage testing for weak CMOS
logic ICs, in Proc. Int. Test Conf., vol. 199, pp. 275284.
[22] K. Harada and T. Hidaka, Probability analysis of a 2-out-of-n: F system
with common cause failure, Microelectronics and Reliability, vol. 34,
no. 2, pp. 289296, Feb. 1994.
[23] E. R. Hnatek, Integrated circuit quality and reliability,, 1995.
[24] R. K. Iyer and D. Tang, Experimental analysis of computer system
dependability, Center for Reliable and High-Performance Computing,
Univ. Illinois at Urbana-Champaign, Tech. Rep. CRHC-93-15, 1993.
[25] I. M. Jacobs, The common mode failure study discipline, IEEE Trans.
Nucl. Sci., vol. 17, no. 1, pp. 594598, Feb. 1970.
[26] Z. Kalbarcyzk and J. Christmansson, Technical approaches for
reducing the probability of common-cause/common-mode failuresA
survey, Lab. Dependable Computing, Department of Computer
Engineering, Chalmers Univ. of Technology, Sweden, Tech. Rep. 237,
May 1995.
[27] J. Karlsson et al., Using heavy-ion radiation to validate fault-handling
mechanisms, IEEE Micro., vol. 14, no. 1, pp. 823, Feb. 1994.
[28] H. Kim and K. G. Shin, Modeling of externally-induced/common-
cause faults in fault-tolerant systems, in Proc. AIAA/IEEE Digital
Avionics Systems Conf., 1994, pp. 402407.
[29] J. C. Knight and N. G. Leveson, A large scale experiment in N-version
programming, in Proc. Int. Symp. Fault-Tolerant Computing, 1985, pp.
135139.
[30] J. H. Lala and R. E. Harper, Architectural principles for safety-critical
real-time applications, in Proc. IEEE, vol. 82, 1994, pp. 2540.
[31] J. Li and E. E. Swartzlander, Concurrent error detection in ALUs by
recomputing with rotated operands, in Proc. IEEE Int. Workshop on
Defect and Fault Tolerance in VLSI Systems, 1992, pp. 109116.
[32] P. Liden et al., On latching probability of particle-induced transients in
combinational networks, in Proc. FTCS, 1994, pp. 340349.
[33] B. Littlewood, The impact of diversity upon common-mode failures,
Reliability Engineering and System Safety, vol. 51, no. 1, pp. 101113,
1996.
[34] T. Lovric, Detecting hardware-faults with systematic and design diver-
sity: Experimental results, Computer Systems Science and Engineering,
vol. 11, no. 2, pp. 8392, 1996.
[35] S. Mitra, N. R. Saxena, and E. J. McCluskey, A design diversity metric
and reliability analysis for redundant systems, in Proc. Int. Test Conf.,
1999, pp. 662671.
[36] S. Mitra and E. J. McCluskey, Design of redundant systems protected
against common-mode failures, Center for Reliable Computing, Stan-
ford Univ., http://crc.stanford.edu, 2000.
[37] B. M. E. Moret et al., Boolean difference techniques for time-sequence
and common-cause analysis of fault-trees, IEEE Trans. Reliability, vol.
R-33, no. 5, pp. 399405, Dec. 1984.
[38] A. Mosleh, Common cause failures: An analysis methodology and ex-
amples, Reliability Engineering and System Safety, vol. 34, no. 3, pp.
249292, 1991.
[39] W. Needham, Designers Guide to Testable ASIC Devices, 1991.
[40] G. W. Parry, Common cause failure analysis: A critique and some sug-
gestions, Reliability Engineering and System Safety, vol. 34, no. 3, pp.
309326, 1991.
[41] J. H. Patel and L. Y. Fung, Concurrent error detection in ALUs by
recomputing with shifted operands, IEEE Trans. Computers, vol. C-31,
no. 7, pp. 589595, Jul. 1982.
[42] D. A. Patterson, P. Chen, G. Gibson, and R. H. Katz, Introduction to
redundant array of inexpensive disks, in Proc. COMPCON, 1989, pp.
112117.
[43] D. K. Pradhan, Fault-Tolerant Computer System Design: Prentice Hall,
1996.
[44] B. Putney, A common-cause evaluation methodology for large fault
trees, Trans. American Nuclear Soc., vol. 33, pp. 574575, 1979.
[45] Quickturn Design Systems (A Cadence Company), , http://www.quick-
turn.com.
[46] R. Reed et al., Heavy ion and proton-induced single event multiple
upset, IEEE Trans. Nucl. Sci., vol. 44, no. 6, pp. 22242229, Jul. 1997.
[47] R. Riter, Modeling and testing a critical fault-tolerant multi-process
system, in Proc. Int. Symp. Fault-Tolerant Computing, 1995, pp.
516521.
[48] D. P. Siewiorek and R. S. Swarz, Reliable Computer Systems: Design
and Evaluation: Digital Press, 1992.
[49] Y. Tamir and C. H. Sequin, Reducing common mode failures in du-
plicate modules, in Proc. IEEE Int. Conf. Computer Design, 1984, pp.
302307.
[50] K. S. Trivedi, Probability and Statistics with Reliability, Queuing, and
Computer Science Applications: Prentice Hall, 1982.
[51] J. K. Vaurio, The probabilistic modeling of external common cause
failure shocks in redundant systems, Reliability Engineering and
System Safety, vol. 50, no. 1, pp. 97107, 1995.
[52] , An implicit method for incorporating common-cause failures in
system analysis, IEEE Trans. Reliability, vol. 47, no. 2, pp. 173180,
Jun. 1998.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.
MITRA et al.: CMF IN REDUNDANT VLSI SYSTEMS: A SURVEY 295
[53] I. A. Watson and G. T. Edwards, Common-mode failures in redundancy
systems, Nuclear Technology, vol. 46, pp. 183191, Dec. 1979.
[54] R. B. Worrell and G. R. Burdick, Qualitative analysis in reliability and
safety studies, IEEE Trans. Reliability, vol. R-25, no. 3, pp. 164170,
Aug. 1976.
[55] E. J. McCluskey, S. Makar, S. Mourad, and K. D. Wagner, Probability
models for pseudo-random test sequences, IEEE Trans. Computers,
vol. 37, no. 2, pp. 160174, Feb. 1988.
[56] Y. Tohma and S. Aoyagi, Failure-tolerant sequential machines with past
information, IEEE Trans. Computers, vol. C-20, no. 4, pp. 392396,
Apr. 1971.
[57] S. Mitra and E. J. McCluskey, Combinational logic synthesis for diver-
sity in duplex systems, in Proc. Int. Test Conf., 2000, pp. 179188.
Subhasish Mitra is an Assistant Director at the Stanford Center for Reliable
Computing (CRC). He received the B.E. (1994) in computer science and
engineering from Jadavpur University, Calcutta, India, the M.Tech. (1996) in
computer science and engineering from the Indian Institute of Technology,
Kharagpur, and the Ph.D. (2000) from Stanford University, California. Prof. E.
J. McCluskey was his Ph.D. Thesis Adviser. Dr. Mitra is a Research Associate
in the DARPA sponsored ROAR Project at Stanford CRC, and provides
part-time consulting in various areas of VLSI design and test. His research
interests include digital testing, logic synthesis, and fault-tolerant computing.
Dr. Mitra received gold medals for being the top student in the School of
Engineering in the undergraduate and M.Tech. levels.
Nirmal R. Saxena is an Associate Director at Stanford CRC. His research in-
terests include computer architecture, fault-tolerant computing, combinatorial
mathematics, probability theory and VLSI design/test. He received the B.E.
(1982) in electronics and communication engineering fromOsmania University,
India; the M.S. (1984) in electrical engineering fromthe University of Iowa; and
the Ph.D. (1991) in electrical engineering from Stanford University. He is a Se-
nior Member of the IEEE.
Edward J. McCluskey received the A.B. (summa cum laude, 1953) in mathe-
matics and physics from Bowdoin College, B.S. (1953), M.S. (1953), and Sc.D.
(1956) in electrical engineering from MIT. The degree of Doctor Honoris Causa
(1994) was awarded by the Institut National Polytechnique de Grenoble.
He worked on electronic switching systems at the Bell Telephone Laborato-
ries from 1955 to 1959. In 1959, he moved to Princeton University, where he
was Professor of Electrical Engineering and Director of the University Com-
puter Center. In 1966, he joined Stanford University, where he is Professor of
Electrical Engineering and Computer Science, and Director of the Center for
Reliable Computing. He founded the Stanford Digital Systems Laboratory (now
the Computer Systems Laboratory) in 1969 and the Stanford Computer Engi-
neering Program (now the Computer Science M.S. Degree Program) in 1970.
The Stanford Computer Forum (an Industrial Affiliates Program) was started by
Dr. McCluskey and two colleagues in 1970 and he was its Director until 1978.
McCluskey developed the first algorithm for designing combinational cir-
cuitsthe QuineMcCluskey logic minimization procedure as a doctoral stu-
dent at MIT. At Bell Labs and Princeton, he developed the modern theory of
transients (hazards) in logic networks and formulated the concept of operating
modes of sequential circuits. His Stanford research focuses on logic testing, syn-
thesis, design for testability, and fault-tolerant computing. Prof. McCluskey and
his students at CRC worked out many key ideas for fault-equivalence, proba-
bilistic modeling of logic networks, pseudo-exhaustive testing, and watchdog
processors. He collaborated with Signetics researchers in developing one of the
first practical multivalued logic implementations and then worked out a design
technique for such circuitry.
Dr. McCluskey was the first President of the IEEE Computer Society. He re-
ceived the 1996 IEEE Emanuel R. Piore Award. He is a Fellow of the IEEE,
AAAS, and ACM; and a member of the NAE. He has published several books
(including two widely used texts) and book chapters as well as hundreds of pa-
pers. His most recent book is Logic Design Principles with Emphasis on Testable
Semicustom Circuits 1986, by PrenticeHall. His other recent honors include
election to the National Academy of Engineering 1998, and IEEE Computer
Society Golden Core Member. In 1984 he received the IEEE Centennial Medal
and the IEEE Computer Society Technical Achievement Award in Testing. In
1990 he received the EURO ASIC 90 Prize for Fundamental Outstanding Con-
tribution to Logic Synthesis. The IEEE Computer Society honored him with the
1991 Taylor L. Booth Education Award.
Authorized licensed use limited to: UNIVERSIDADE FEDERAL DO RIO GRANDE DO SUL. Downloaded on May 04,2010 at 22:34:42 UTC from IEEE Xplore. Restrictions apply.

Вам также может понравиться