Академический Документы
Профессиональный Документы
Культура Документы
Vincent R. Lalli
Lewis Research Center
Cleveland, Ohio
G3/18 0028221
_ _=_;_ = ir _
DESIGN FOR RELIABILITY
Vincent R. Lalli
National Aeronautics and Space Administration
Lewis Research Center
Cleveland, Ohio 44135
4. Series 12
implementltiou Method: A brief technical discussion, not intended PD-ED-1201 EEE Parts Derating
to give the full detkile of the process but to provide & design engineer PD-ED-1202 High-Voltage Power Supply Design
with adequate information to understand how the pr•ctice should be
ussd and Manufacturing Practices
PD-ED-1203 Class-S Parts in High-Reliability
Toclz_ical Igotimaalo: A brief technical juetification for use of the
practice Applications
PD-ED-1204 Part Junction Temperature
Impact ofNonpr_tleo: A brief statement of what con be expected if
the manual that contain related inform••ion SPONSOR PD-ED-1206 Power Line Filters
OF PD-ED-1207 Magnetic Design Control for Sci-
Referqmceo: Publications that contain addltlon•l infer- PRACTICE
chemical reactions
moving parts
electrical strength
Chemical reaction Lose of mechanical High-speed Heating Thermal aging i
Solar Actinic and phyiJo- Surface deterioration; comprolllOn stroll structural collapse
properties
strength; atteritlon
of electrical proper-
ties; Interference
with function
structural weakening;
increased
conductivity
• Technical Rationale (concluded) and projects.
Unlike a reliability
design practice,a guide-
linelacksspecific
operationalexperienceor data to vali-
date its contribution to mission success.However, a
Environment Principal effect, Typical failures
induced guidelinedoescontaininformationthatrepresentscurrent
Vibration Mechanical stress Lose of mechanical
%eat thinking"on a particularsubject.
strength; interference
with function;
increased wear
fields tion; alteration of The following format is used for reliability guidelines:
electrical properties;
induced heating
occur during spaceflightoperations. Technical Ratlvnale: A brief technical justlfication for use of the
guideline
SalvatoreBavuso
Winter solstice, O.g8 257 263
LangleyResearchCenter
1422 .30 249 2];8
Reliability
engineeringisnecessarybecause as usersof (2) Engage thematerialtechnologists
to determine the
rapidly changing technology and as members of large flaw failuremechanisms.
complex systems,we cannot ensure that essential
details
affectingreliability
are not overlooked. (3)Develop flaw controltechniquesand send informa-
tionback to the engineersresponsiblefordesign,manu-
facture,and support planning.
I.I Period of Awakening: FailureAnalysis
function Environmental
I Design I=
stress
Design information Information on
_ package operational
,r
Flaw control conditions
it information liiiiiiii!i i i System/
I
equipment
Support li iliiiiiiiiii
l- Flaw (failure)
information
iser
planning
Manufacturing _--k_nufacturing
Flaw (failure)
flaw information mechanisms
I tecMaterial
hnology
Completed systems/equlpm_ent
8
Concluding Remarks Other Models
Reliability Training Trends and Conclusions
Software
Uetag Failure Rate Data
Categories of Software
Variables Affecting Failure Rates
Processing Environments
Operating Life Test Severity of Software Defects
Storage Test Software Bugs Compared With Software Defects
Summary of Variables Affecting Failure Rates Hardware and Software Failures
Part Failure Rate Data
Manifestations of Software Bugs
Improving System Reliability Through Part Derating Reliability Training
Predicting Reliability by Rapid Techniques
Use of Failure Rates in Tradeoffs Software quality A_uraace
Nonoperating Failures Concept of Quality
Applications of Reliability Predictions to Control of Software Quality
Equipment Reliability Software Quality Characteristics
Standardisation as a Means of Reducing Failure Rates Software Quality Metrics
Allocation of Failure Rates and Reliability Overall Software Quality Metrics
Importance of Learning From Each Failure Software Quality Standards
Failure Reporting, Analysis, Corrective Action, and Concluding Remarks
Concurrence Reliability Training
Case Study--Achieving Launch Vehicle Reliability
Reliability Management
Design Challenge
Roots of Reliability Management
Subsystem Description
Planning a Reliability Management Organisation
Approach to Achieving Reliability Goals
General Management Considerations
Launch and Flight Reliability
Program Establishment
Field Failure Problem
Goals and Objectives
Mechanical Tests
Symbolic Representation
Runup and Rundown Tests
Logistics Support and Repair Philosophy
Summary of Case Study
Reliability Management Activities
Concluding Remarks
Performance Requirements
Reliability Training
Specification Targets
Applying Probabillty Density Functions Field Studies
Probability Density Functions Human Reliability
Application of Density Functions Analysis Methods
Cumulative Probability Distribution Human Errors
Normal Distribution Example
Normal Density Function Presentation of Reliability
Properties of Normal Distribution Engineering and Manufacturing
Symmetrical Two-Limit Problems User or Customer
One-Limit Problems Reliability Training
Nonsymmetrical Two-Limit Problems
Appendixes
Application of Normal Distribution to Test Analyses and
A--Reliability Information
Reliability Predictions
B--Project Manager's Guide on Product Assurance
Effects of Tolerance on a Product
C--Reliabillty Testing Examples
Notes on Tolerance Accumulation: A How-To-Do-It Guide
Estimating Effects of Tolerance Bibliography
Concluding Remarks
Reliability Training Answer_
Reliability Training
I0
where added failures, the observed failures will be greater than
the inherent failures of the design.
Pc probability that catastrophic part failures will not
occur
2.4 K-Factors
Pt probability that tolerance failures will not occur
The other contributorsto product failurejust men-
Pw probability that wearout failures will not occur tioned are calledK-factors;they have a value between 0
and 1 and modify the inherentreliability:
As in the resistor example, these probabilities are multi-
plied together because they are considered to be indepen-
Rproduc t = Ri(KqKmKrK_Ku)
dent of each other. However, this may not always be true
because an out-of-tolerance failure, for example, may
evolve into or result from a catastrophic part failure. • K-factors denote probabilities that inherent reliability
Nevertheless, in this tutorial they are considered inde- will not be degraded by
pendent and exceptions are pointed out as required.
- K m manufacturing and fabricationand assembly
2.3.2 Inherent product reliability.--Consider the in- techniques
herent reliability R i of a product. Think of the expression - quality test methods and acceptance criteria
R i = PcPtP w as representing the potential reliability of Krqreliability engineering activities
a product as described by its documentation, or let it - K_ logistics activities
represent the reliability inherent in the design drawings K u the user or customer
instead of the reliability of the manufactured hardware.
This inherent reliability is predicated on the decisions and • Any K-factor can cause reliability
to go to zero.
actions of many people. If they change, the inherent relia-
bility could change. • Ifeach K-factor equals 1 (thegoal),Rproduct = R i.
11
have two design conceptsfor performing some function. isthe requiredfailurerate.Ifblowersareused forcooling,
If the failurerate of concept A is 10 times higher than the equipment must operate at temperatures as high as
that of concept ]3,one can expect concept B to failone- 75 °C; ifair-conditioning isused, the temperature need
tenth as often as concept A. If it is desirableto use not exceed 50 °C. Therefore,air-conditioningmust be
concept A forother reasons,such as cost,size,perform- used ifwe are to meet the reliability
requirement.
ance, or weight, the derating failurerate curves can be
used to improve concept A's failurerate (e.g.,select Other factorsmust be examined before we make a
components with a lower failurerate,derate the compo- final decision.Whatever type of cooling equipment is
nents more, or both). An even betterapproach isto find selected,the totalsystem reliabilitynow becomes
ways to reduce the complexity and thus the failurerate
of concept A. Figure 3 illustrates
the use of failurerate R T = ReR c
data in tradeoffs.This figuregives a failure-rate-versus-
temperature curve for the electronicsof a complex (over Therefore,the effecton the system of the cooling equip
35 000 parts) piece of ground support equipment. The ment's reliability
must be calculated.A more important
curve was developed as follows: considerationisthe effecton system reliability
should the
cooling equipment fail.Because temperature control
(1) A failureratepredictionwas performed by using appears to be critical,lossof itmay have serioussystem
component failureratesand theirapplicationfactorsK A consequences.Therefore,itistoo soon to ruleout blowers
foran operatingtemperature of25 oC. The resultingfail- entirely.A failuremode, effects, and criticality
analysis
ure rate was chosen as a referencepoint. (FMECA) must be made on both cooling methods to
examine allpossiblefailuremodes and theireffectson the
(2) Predictionswere then made by using the same system. Only then willwe have sufficient information to
method for temperatures of 50, 75, and 100 °C. The make a sound decision.
ratios of these predictionsto the referencepoint were
plotted versus component operating temperature, with
the resultingcurve for the equipment. This curve was 2.7 Importance of Learning From Each Failure
then used to provide tradeoff criteriafor using air-
conditioningversus blowers to cool the equipment. To When a product fails, a valuable piece of information
illustrate,
suppose the maximum operatingtemperatures about it has been generated because we have the
expected are 50 °C with air-conditioning
and 75 °C with opportunity to learn how to improve the product if we
blowers.Suppose furtherthatthe requiredfailureratefor take the right actions.
the equipment, ifthe equipment isto meet itsreliability
goal,isone failureper 50 hr.A failureratepredictionat as:
Failurescan be classified
25 °C might indicatea failurerateof I per 100 hr.From
the figure,note that the maximum allowableoperating (1) Catastrophic (a shorted transistoror an open
temperature is therefore60 °C, since the maximum wire-wound resistor)
allowablefailurerateratioisA = 2; that is,at 60 °C the (2)Degradation (change in transistorgain or the re-
equipment failureratewillbe (1/100) × 2 = 1/50,which sistorvalue)
(3)Wearout (brush wear in an electricmotor)
12
tive action ensures that the cause is dealt with. Concur- provide some measure of reliabilitybut littleinformation
rence informs management of actions being taken to about the population failure mechanisms of like devices.
avoid another failure.These data enable all personnel to (The exceptions to this are not dealt with at this time.)
compare the part ratings with the use stresses and verify
that the part isbeing used with a known margin. In subsequent sections, we discuss confidence levels,
attribute test, test-to-failure, and lifetest methods,
explain how wel these methods meet the two test objec-
2.8 Effects of Tolerance on a Product tives, show how the test results can be statistically
analyzed, and introduce the subject and use of confidence
Because tolerances must be expected in all manufac- limits.
(3) What will affect the term Pt in the product We know that statistical
estimatesare more likelyto be close
reliabilitymodel? to the truevalue as the sample sizeincreases.
Thus, thereisa close
correlationbetween the accuracy ofan estimateand the sizeofthe
mample from which itwas obtained.Only an infinitely largesample
Electrical circuits are often affected by part tolerances size could give us a 100 percent confidence or certainty that a
(circuit gains can shift up or down, and transfer function measured statistical parameter coincides with the true value. In
poles or zeros can shift into the righthand s-plane, this context, confidence is a mathematical probability relating the
mutual positions of the true value of a parameter and its estimate.
causing oscillations).Mechanical components may not fit
together or may be so loose that excessive vibration When the estimate of a parameter is obtained from a reason-
causes trouble (refs.1 to 3). ably sized sample, we may logically assume that the true value of
that parameter will be somewhere in the neighborhood of the
estimate, to the right or to the left, Therefore, it would be more
3.0 TESTING FOR RELIABILITY meaningful to express statistical estimates in terms of a range or
interval with an associated probability or confidence that the true
value lies within such interval than to express them as point
3.1 Test Objectives estimates. This is exactly what we are doing when we assign
confidence limits to point estimates obtained from statistical
measurements.
It can be inferred that i000 test samples are required
to demonstrate a reliabilityrequirement of 0.999. Because
In other words, rather than express statisticalestimates
of cost and time, this approach is impractical. Further-
more, the total production of a product often may not as point estimates, it would be more meaningful to
even approach 1000 items. Because we usually cannot test express them as a range (or interval), with an associated
the total production of a product (calledproduct popula- probability (or confidence) that the true value lleswithin
such an interval. Confidence is a statisticalterm that
tion), we must demonstrate reliabilityon a few samples.
Thus, the main objective of a reliabilitytest is to test an depends on supporting data and reflects the amount of
available device so that the data will allow a statistical risk to be taken when stating the reliability.
be tested and that often have not yet been manufactured. cation tests are categorized as attribute tests (refs.5
and 6). They are usually go/no-go and demonstrate that
To know how reliable a product is one must know a device is good or bad without showing how good or
how many ways itcan failand the types and magnitudes how bad. In a typical test, two samples are subjected to
of the stresses that produce such failures.This premise a selected level of environmental stress,usually tilemaxi-
leads to a secondary objective of a reliability test: to mum anticipated operational limit. Ifboth samples pass,
produce failures in the product so that the types and the device is considered qualified, preflight certified,or
magnitudes of the stresses causing such failures can be verified for use in the particular environment involved
identified. Reliability tests that result in no failures (refs.7 and 8). Occasionally, such tests are called tests to
13
success because the true objective is to have the device 3.5 Life Test Methods
pass the test.
Lifetestsaxe conducted to illustrate
how the failure
In summary, an attribute test is not a satisfactory rateof a typicalsystem or complex subsystem variesdur-
method of testing for reliability because it can only iden- ing itsoperatinglife.Such data provide valuable guide-
tify gross design and manufacturing problems; it is an lines for controllingproduct reliability.They help to
adequate method of testing for reliability only when suffi- establishburn-in requirements, to predict spare part
cient samples are tested to establish an acceptable level requirements,and to understand the need for or lack of
of statistical confidence. need for a system overhaul program. Such data are ob-
tained through laboratory lifetestsor from the normal
operation of a fieldedsystem.
3.4 Test-To-Failure Methods
2o
8 _ 9.68 Percentdefective
p(x)
In summary,
the safety factor may be.
-D
SM = 4.0
develop a strength distribution that provides a good esti-
mate of the Pt and Pw product reliability terms without lO Rb
the need for the large samples required for attribute tests;
the results of a test-to-failure exposure of a device can be
used to predict the reliability of similar devices that can-
8 _--'_'- 0.003 Peroentdefective
not or will not be tested; testing to failure provides a
means of evaluating the failure modes and mechanisms
devices so that improvements can be made; confidence
of
6
/
p(x)
levels can be applied to the safety margins and to the
resulting population reliability estimates; the accuracy of (b) StructureB.
a safety factor can be known only if the associated safety Figure4.--Two structureswith identicalsafetyfactors
margin is known. (SF= 13/10 = 1.3) butdifferentsafetymargins.
14
In summary, Iliatests are performed to evaluate Of course, in the case of missile flights or other events
product failurerate characteristics;
iffailuresincludeall that produce go/no-go results, an attribute analysis is the
causes of system failure,
the failurerate of the system is only way to determine product reliability.
the only true factoravailableforevaluating the system's
performance; lifetests at the part level require large
sample sizesifrealisticfailurerate characteristicsare to 4.0 SOFTWARE RELIABILITY
be identified;laboratory lifetests must simulate the
major factors that influencefailurerates in a device Software reliability management is highly dependent
during fieldoperations;the use ofrunning averagesin the on how the relationship between quality and reliability is
analysisof lifedata will identifyburn-in and wearout perceived. For the purposes of this tutorial, quality is
regions ifsuch exist;and failurerates are statisticsand closely related to the process, and reliability is closely
thereforeare subject to confidence levelswhen used in related to the product. Thus, both span the life cycle.
making predictions.
Before we can stratify software reliability, the progress
Figure 5 illustrates what might be called a failure of hardware reliability will be reviewed. Over the past
surfacefor a typicalproduct. Itshows system failurerate 25 years, the industry observed (1) the initial assignment
versus operating time and environmental stress,three of "wizard status _ to hardware reliability for theory,
parameters that describea surface such that, given an modeling, and analysis, (2) the growth of the field, and
environmental stressand an operating time, the failure (3) the final establishment of hardware reliability as a
rate is a point on the surface. science. One of the major problems was aligning
reliability predictions and field performance. Once that
Test-to-failuremethods generate lineson the surface was accomplished, the wizard status was removed from
parallelto the stressaxis;lifetestsgeneratelineson the hardware reliability. The emphasis in hardware reliabil-
surfaceparallelto the time axis. Therefore,these tests ity from now to the year 2000 will be on system failure
provide a good descriptionof the failuresurface and, modes and effects.
consequently,the reliabilityof a product.
Software reliability became classified as a science for
Attribute testsresultonly in a point on the surfaceif many reasons. The difficulty in assessing software
failuresoccur and a point somewhere within the volume reliability is analogous to the problem of assessing the
iffailuresdo not occur.For thisreason,attributetesting reliability of a new hardware device with unknown
isthe least desirablemethod forascertainingreliability. reliability characteristics. The existence of 30 to 50
different software reliability models indicates the
organization in this area. Hardware reliability began at
a few companies and later became the focus of the
Advisory Group on Reliability of Electronic Equipment.
The field then logically progressed through different
models in sequence over the years. Similarly, numerous
people and companies simultaneously entered the soft-
ware reliability field in their major areas: cost, complex-
ity, and reliability. The difference is that at least 100
times as many people are now studying software reliabil-
ity as those who initially studied hardware reliability.
The existence of so many models and their purports tends
to mask the fact that several of these models showed
excellent correlations between software performance pre-
dictions and actual software field performance: the Musa
model as applied to communications systems and the
Xerox model as applied to office copiers. There are also
reasons for not accepting software reliability as a science,
and they are discussed next.
15
(3) _quality is the same as reliability and is measured by "It is contrary to the definition of reliability to apply
the number of defects in a program and not by its reliability analysis to a system that never really works.
reliability. _ All of these philosophies tend to eliminate This means that the software which still has bugs in it
probabilistic measures because the managers consider a really has never worked in the true sense of reliability in
programmer to be a software factory whose quMity the hardware sense. _ Large complex software programs
output is controllable, adjustable, or both. In actuality, used in the communications industry are usually operat-
hardware design can be controlled for reliability ing with some software bugs. Thus, a reliability analysis
characteristics better than software design can. Design of such software is different from a reliability analysis of
philosophy experiments that failed to enhance hardware established hardware. Software reliability is not alone in
reliability axe again being formulated for software design the need for establishing qualitative and quantitative
(ref. 9). Quality and reliability are not the same. Quality models.
is characteristic and reliability is probabilistic. Our
approach draws the line between quality and reliability In the early 1980's,work was done on a combined
because quality is concerned with the development hardware/software reliability
model. A theory for com-
process and reliability is concerned with the operating bining well-known hardware and software models in a
product. Many models have been developed and a num- Markov processwas developed. A considerationwas the
ber of the measurement models show great promise. Pre- topic of software bugs and errorsbased on experiencein
dictive models have been far less successful partly because the telecommunicationsfield.To synthesizethe manifes-
a data base {such as MIL-HDBK-217E, ref. 10} is not tationsof softwarebugs, some of the followinghardware
yet available for software. Software reliability often has trends for these systems should be noted: (1)hardware
to use other methods; it must be concerned with the proc- transientfailuresincreaseas integratedcircuitsbecome
ess of software product development. denser; (2} hardware transientfailurestend to remain
constant or increaseslightlywith time afterthe burn-in;
and (3} hardware (integrated circuit) catastrophic failures
4.1 Hardware and Software Failures decrease with time after the burn-in phase. These trends
affect the operational software of communications sys-
Microprocessor-based products have more refined defi- tems. If the transient failures increase, the error analysis
nitions. Four types of failure may be considered (1} hard- and system security software are called into action more
ware catastrophic, (2)hardware transient, (3)software often. This increases the risk of misprocessing a given
catastrophic, and (4) software transient. In general, the transaction in the communications system. A decrease in
catastrophic failures require a physical or remote hard- the catastrophic failure rate of integrated circuits can be
ware replacement, a manual or remote unit restart, or a significant {ref. 12}. An order-of-magnitude decrease in
software program patch. The transient failure categories the failure rate of 4K memory devices between the first
can result in either restarts or reloads for the year and the twentieth year is predicted. We also tend to
microprocessor-based systems, subsystems, or individual oversimplify the actual situations. Even with five vendors
units and may or may not require further correction. A of these 4K devices, the manufacturing quality control
recent reliability analysis of such a system assigned ratios person may have to set up different screens to eliminate
for these categories. Hardware transient faults were the defective devices from different vendors. Thus, the
assumed to occur at 10 times the hardware catastrophic system software will see many different transient memory
rate, and software transient faults were assumed to occur problems and combinations of them in operation.
at 100 to 500 times the software catastrophic rate.
Centralcontroltechnology has prevailedin communi-
The time of day is of great concern in reliability cationssystems for25 years.The industryhas used many
modeling and analysis.Although hardware catastrophic of itsold modeling tools and applied them directlyto
failuresoccur at any time of theday, they oftenmanifest distributedcontrol structures.Most modeling research
themselves during busiersystem processingtimes.On the was performed on largeduplex processors.With an evolu-
other hand, hardware and softwaretransientfailures gen- tion through forms of multiple duplex processors and
erallyoccur during the busy hours.When a system'spre- load-sharingprocessorsand on to the present forms of
dicted reliabilityis close to the specifiedreliability,
a distributedprocessingarchitectures, the modeling tools
sensitivityanalysismust be performed. need to be verified.With fullydistributedcontrol sys-
tems, thesoftwarereliability
model must be conceptually
matched to the softwaredesign in order to achievevalid
4.2 Manifestations of Software Bugs predictionsof reliability.
Many theories, models, and methods are available for The followingtrends can be formulated for software
quantifying software reliability. Nathan (ref. 11) stated, transientfailures:
(I}softwaretransientfailuresdecrease
16
as the system architecture approaches a fully distributed modeling tool could additionally combine the effects of
control structure, and (2) software transient failures hardware, software, and operator faults, it would be a
increase as the processing window decreases (i.e., less time powerful tool for making design tradeoff decisions.
allowed per function, fast timing mode entry, removal of Table I, an example of the missing link, presents a five-
error checking, removal of system ready checks). level criticalityindex for defects.These examples indicate
the flexibility of such an approach to criticality
A fully distributed control structure can be configured classification.
17
quality appears to have a thirdmajor factorin addition metrics, and (4) standards. Areas {1} and (2) are applica-
to product and process:the environment. People are im- ble during both the design and development phase and
portant.They make the processor theproduct successful. the operation and maintenance phase. In general, area (2)
is used during the design and development phase before
The next step isto discusswhat the processof achiev- the acceptance phase for a given software product. The
ing qualityin softwareconsistsof and how qualityman- following discussion will concern area (2).
agement isinvolved.The purpose ofqualitymanagement
for programming products isto ensure that a preselected
softwarequalitylevelhas been achieved on scheduleand
4.5 Software Quality Metrics
in a cost-effective
manner. In developing a qualityman-
agement system, the programming product'scritical life-
The entirearea of softwaremeasurements and metrics
cycle-phasereviewsprovidethe referencebase fortracking
has been widely discussedand the subjectof many publi-
the achievement of qualityobjectives.The International
cations. Notable is the guide for software reliability
ElectrotechnicalCommission (IEC) system life-cycle
measurement developedby theInstitute forElectricaland
phases presented in theirguidelinesfor reliability
and
ElectronicsEngineers (IEEE} Computer Society's working
maintainabilitymanagement are (I)concept and defini-
group on metrics.A basisforsoftwarequalitystandardi-
tion, (2)design and development, (3) manufacturing,
zationwas alsoissuedby the IEEE. Software metricscan-
installation,
and acceptance, (4)operation and main-
not be developed before the cause and effectof a defect
tenance,and (5)disposal.
have been establishedfora givenproduct with relationto
itsproduct lifecycle.A typicalcause-and-effectchart for
In general,a phase-coststudy shows the increasing
a softwareproduct includesthe processindicator.At the
cost of correctingprogramming defectsin laterphases of
testingstage of product development, the evolution of
a programming product'slife. Also, the higher the level
softwarequalitylevelscan be assessedby characteristics
of softwarequality,the more life-cycle
costsare reduced.
such as freedom from error,successfultestcase comple-
tion,and estimate of the software bugs remaining. For
example, these processindicatorscan be used to predict
4.4 Software Quality
slippageof the product deliverydate and the inabilityto
meet originaldesign goals.
The next step is to look at specific software quality
items. Software quality is defined as %he achievement of
When the programming product entersthe qualifica-
a preselected software quality level within the costs,
tion, installation,and acceptance phase and continues
schedule, and productivity boundaries established by
into the maintenance and enhancements phase, the con-
management _ (ref. 10}. However, agreement on such a
cept of performance is important in the quality charac-
definition is often difficult to achieve because metrics
teristicactivity.This concept isshown in table IIwhere
vary more than those for hardware, software reliability
the 5 IEC system life-cyclephaseshave been expanded to
management has focused on the product, and software
10 softwarelife-cycle phases.
quality management has focused on the process. In prac-
tice, the quality emphasis can change with respect to the
specific product application environment. Different per-
4.6 Concluding Remarks
spectives of software product quality have been presented
over the years. However, in todays' literature there is
This sectionpresented a snapshot of software quality
general agreement that the proper quality level for a par-
assurancetoday. Continuing researchisconcerned with
ticular software product should be determined in the con-
theuse ofoverallsoftwarequalitymetricsand bettersoft-
cept and definition phase and that quality managers
ware predictiontoolsfordetermining the defectpopula-
should monitor the project during the remaining life-cycle
tion. In addition,simulators and code generators are
phases to ensure the proper quality level.
being furtherdeveloped so thathigh-qualitysoftwarecan
be produced.
The developer of a methodology for assessing the qual-
ity of a software product must respond to the specific
Process indicators are closely related to software
characteristics of the product. There can be no single
quality and some include them as a stage in software
quality metric. The process of assessing the quality of a
development. In general,such measures as (1) testcases
software product begins with the selection of specific
completed versus testcases planned and (2) the number
characteristics, quality metrics, and performance criteria.
of linesof code developed versus the number expected
give an indicationof the overallcompany or corporate
With respect to software quality, several areas of
progress toward a qualitysoftware product. Too often,
interest are (1) characteristics, (2) metrics, (3) overall
18
TABLE II.--MEASUREMENTS AND PROGRAMMING PRODUCT LIFE CYCLE
[The 5 International Electrotechnical Commission (IEC) life-cycle phases have been expanded to I0
software phases.]
personnel are moved from one project to another and (4) Selectthe reliability
analysisprocedure.
thus the lagging projects improve but the leading projects
decline in their process indicators. The llfe cycle for (5)Selectthe data sourcesforfailurerates and repair
programming products should not be disrupted. rates.
Performance measures,includingsuch criteriaas the (6) Determine the failureratesand the repairrates.
percentage of proper transactions,the number of system
restarts, the number of system reloads, and the (7) Perform the necessarycalculations.
percentage of uptime, should reflect
the user'sviewpoint.
(8) Validate and verify the reliability.
In general,the determination of applicable quality
measures for a given software product development is (9) Measure reliability until customer shipment.
viewed as a specific
taskof thesoftwarequalityassurance
function.The determinationof the processindicatorsand
performance measures is a task of the software quality 5.1 Goals and Objectives
standards function.
Goals must be placed into the proper perspective.
Because they are often examined by using models that
5.0 RELIABILITY MANAGEMENT the producer develops,one of the weakest links in the
reliability
processisthe modeling. Dr. John D. Spragins,
To design for successfulreliability
and continue to an editorfor the IEEE Transaction on Computers, cor-
provide customers with a reliable
product, the following roboratesthisfactwith the followingstatement (ref.13):
steps are necessary:
Some standard definitions of reliability or availability, such as
those based on the probability that all components of a system are
(1)Determine the reliability
goals to be met.
operational at a given time, can be dismissed as irrelevant when
studying large telecommunication networks, Many telecommunica-
(2) Construct a symbolic representation. tion networks are so large that the probability they are operational
according to this criterion may be very nearly zero; at least one
(3) Determine the logisticssupport and repair item of equipment may be down essentially all of the time. The
philosophy. typical user, however, does not see this unless he or she happens to
19
be the unlucky person whose equipment fails;the system may still 5.3 Human Reliability
operate perfectly from this user's point of view. A more meaningful
criterion is one based on the reliabilityseen by typical system
The major objectivesofreliability
management areto
users. The reliability apparent to system operators is another valid,
ensure that a selectedreliability
levelfor a product can
but distinct, criterion. (Since system operators commonly consider
systems down only after failures have been reported to them, and be achieved on schedulein a cost-effectivemanner and
may not hear of short self-clearing outages, their estimates of that the customer perceivesthe selectedreliabilitylevel.
reliability are often higher than the values seen by users.) The current emphasis in reliability management is on
meeting or exceedingcustomer expectations.We can view
Reliability objectives can be defined differently for thisas a challenge,but itshould be viewed as the bridge
various systems. An example from the telecommunica- between the user and the producer or provider. This
tions industry (ref. 14) is presented in table III. bridge is actually =human reliability.
_ In the past, the
producerwas concerned with the processand the product
and found reliabilitymeasurements that addressed both.
5.2 Specification Targets Often there was no correlationbetween fielddata, the
customer's perception of reliability,
and the producer's
A system can have a detailed performance or relia- reliability
metrics.Surveys then began to indicatethat
bility specification that is based on customer require- the customer distinguishedbetween reliabilityperform-
ments. The survivability of a telecommunications ance, response to order placement, technicalsupport,
network is defined as the ability of the network to servicequality,etc.
perform under stress caused by cable cuts or sudden and
lengthy traffic overloads and after failures including Human reliabilityis defined (ref. 17) as %..the
equipment breakdowns. Thus, performance and availabil- probabilityof accomplishinga job or task successfully
by
ity have been combined into a unified metric. One area humans at any requiredstagein system operationswithin
of telecommunications where these principles have been a specifiedminimum time limit (ifthe time requirement
applied is the design and implementation of fiber-based isspecified)."Although customers generallyaxe not yet
networks. Roohy-Laleh et al. (ref. 15) state a...the statis- requiringhuman reliability models in addition to the
tical observation that on the average 56 percent of the requestedhardware and software reliability models, the
pairs in a copper cable are cut when the cable is dug up, scienceof human reliability iswellestablished.
makes the copper network 'structurally survivable. On TM
a1 a2 100
Class 5 office System outage
Class 4 office Loss of service Availability, percent
Class 3 office Service degradation Figure 6.--Specification targets (ref. 16).
2O
can be quitedifferentfrom thereliabilitythata customer Reliabilitygrowth can be specifiedfrom "day 1_ in
willobservewith a unit or system produced 5 years later, product development and can be measured or controlled
or with the lastshipment. Because the customer'sexperi- with a 10-yearlifeuntil _day 5000._ We can apply the
ence can vary with the maturity of a system, reliability philosophy of reliability
knowledge generationprinciples,
growth isan important concept to customers and should which isto generatereliabilityknowledge at the earliest
be consideredin theirpurchasing decisions. possibletime in the planning processand to add to this
base for the duration of the product's useful life.To
One key to reliability growth is the ability to def'me accurately measure and control reliabilitygrowth, we
the goals for the product or service from the customer's must examine the entiremanufacturing lifecycle.One
perspective while reflecting the actual situation in which method is the constructionof a production life-cycle
the customer obtains the product or service. For large reliability
growth chart.
telecommunications switching systems, the rule of thumb
for determining reliability growth has been that often sys- In certainlargetelecommunications systems,the long
tems have been allowed to operate at a lower availability installationtime allows the electronicpart reliability
to
than the specifiedavailability goalforthe first 6 months grow so that the customer observesboth the design and
to 1 year of operation (ref.18).In addition,component the production growth. Large complex systems oftenoffer
part replacement rates have often been allowed to be an environment unique to each product installation,
50 percenthigher than specified forthe first6 months of which dictatesthat a significantreliability growth will
operation.These allowancesaccommodated craftspersons occur. Yet, with the differencethat sizeand complexity
learningpatterns,software patches,design errors,etc. impose on resultantproduct reliability growth, corpora-
tionswith largeproduct linesshould not present overall
Another key to reliabilitygrowth is to have its reliabilitygrowth curves on a corporatebasis but must
measurement encompass the entirelifecycleof the pro- presentindividualproduct-linereliabilitygrowth pictures
duct. The concept is not new; only here the emphasis is to achieve totalcustomer satisfaction.
placed on the customer's perspective.
21
APPENDIX--COURSE EVALUATION
22
NASA SAFETY TRAINING CENTER (NSTC) COURSE EVALUATION
3. How will the skills/knowledgeyou gained in thiscourse/workshop help you to perform betterin your job?
5 4 3 2 I
Excellent Fair Poor
5 4 3 2 1
Excellent Fair Poor
5 4 3 2 1
Excellent Fair Poor
(OVER)
23
7.Asa customer of the NASA Safety Training Center (NSTC), how would you rateour services?
5 4 3 2 I
Excellent Fair Poor
Comments:
Comments:
Per Diem
Total
THANK YOU_
24
REFERENCES ment. Workshop on Quantitative Software Models,
IEEE, New York, 1979.
1. ReliabilityPrediction of Electronic Equipment.
MIL-HDBK-217E, Jan. 1990. 12. Schick,G.J.;and Wolverton, R.W.: An Analysisof
Computing Software ReliabilityModels. IEEE
*
Electronic Reliability Design Handbook. Trans. Software Eng., vol.SE-4, no. 2, Mar. 1978_
MIL-HDBK-338, vols.1 and 2, Oct. 1988. pp. 104-120.
.
Laubach, C.H.: Environmental Acceptance Test- 15. Roohy-Laleh, E.,et al.:A Procedure forDesigning
ing. NASA SP-T-0023, 1975. a Low Connected Survivable Fiber Network. IEEE
J. Sel.Topics Commun., vol.SAC-4, no. 7, Oct.
7.
Laube, R.B.: Methods to Assessthe Successof Test 1986, pp. 1112-1117.
Programs. J. Environ. Sci.,vol. 26, no. 2, Mar.-
Apr. 1983, pp. 54-58. 16. Jones, D.R.; and Malec, H.A.: Communications
Systems Performability: New Horizons. 1989 IEEE
8.
Test Requirements for Space Vehicles. International Conference on Communications,
MIL-STD-1540B, Oct. 1982. vol. 1, IEEE, 1989, pp. 1.4.1-1.4.9.
25
Form Approved
REPORT DOCUMENTATION PAGE OMB No. 0704-0188
Public tepontng burden for this collection of information is estimated to .average 1.,hour pej' response, including the time lot revl.ewing In,stru.ctlons, searching existing data sources,
gathering and maintaining the data. needed, and completing ano reviewing the conec.on ov information. :_eno comments regarolng this ouroen estimate or any other aspect of this
co k_tion o! informatlon,/ncludlng suggestions for reducing this burden, to Washington Headquarters Services. Deectorate for Information Operstlons and Reports, 1215 Jefferson
Davis Highway, Suite 1204, Arlington. VA 22202-4302, and to the Office of Management and Budget. Paperwork Reduct on Project (0704-0188), Wash ngton. DC 20503.
1. AGENCY USE ONLY (Leave b/anlO 2. REPORT DATE 3. REPORT TYPE AND DATES COVERED
WU-323-44-19
O. AUTHOR(S)
Vincent R. Lalli
Unclassified - Unlimited
Subject Category 18
This tutorial summarizes reliability experience from both NASA and industry and reflects engineering practices that
support current and future civil space programs. These practices were collected from various NASA field centers and
were reviewed by a committee of senior technical representatives from the participating centers (members are listed at
the end). The material for this tutorial was taken from the publication issued by the NASA Reliability and Maintainabil-
ity Steering Committee (NASA Reliability Preferred Practices for Design and TesL NASA TM-4322, 1991). Reliability
must be an integral part of the systems engineering process. Although both disciplines must be weighted equally with
other technical and programmatic demands, the application of sound reliability principles will be the key to the effective-
ness and affordability of America's space program. Our space programs have shown that reliability efforts must focus on
the design characteristics that affect the frequency of failure. Herein, we emphasize that these identified design character-
istics must be controlled by applying conservative engineering principles.
27
Design; Test; Practices; Reliability; Training; Flight proven 16. PRICE CODE
A03
17. SECURITY CLASSIFICATION 18. SECURITY CLASSIFICATION 19. SECURITY CLASSIFICATION 20. LIMITATION OF ABSTRACT