Вы находитесь на странице: 1из 80

GEG8124 Understanding Reliability

Study Notes 2006


Matthew McLeod #19303432

Contents
Contents.................................................................................................................................................1
STUDY GUIDE 1 – Introduction to Reliability....................................................................................8
Objectives...........................................................................................................................................8
Study Guide 1 Notes...........................................................................................................................8
Assurance Technologies...................................................................................................................8
Reliability Fundamentals...............................................................................................................10
System Effectiveness.....................................................................................................................10
Quality And Reliability..................................................................................................................10
Determination Of Cost Drivers......................................................................................................11
Introduction To Cost Effective Analysis........................................................................................11
Profitability Of Reliability.............................................................................................................11
O’Conner Chapter 1 - Introduction to Reliability Engineering........................................................12
Why Do Engineering Items Fail?..................................................................................................12
Probabilistic Reliability.................................................................................................................13
Repairable and Non Repairable Items...........................................................................................13
Non-Repairable Items....................................................................................................................13
Repairable Items............................................................................................................................14
Development of Reliability Engineering.......................................................................................14
Reliability As An Effectiveness Parameter....................................................................................14
Reliability Programme Activities...................................................................................................14
Reliability Economics And Management......................................................................................15
Smith Chapter 1 - The history of reliability and safety technology.................................................15
Definitions.....................................................................................................................................16
Failure Data....................................................................................................................................16
Hazardous Failures........................................................................................................................17
Reliability and Risk Prediction......................................................................................................17
Achieving Reliability and Safety-Integrity....................................................................................17
RAMS Cycle..................................................................................................................................18
Contractual Pressures.....................................................................................................................18

Last saved 10/31/2006 10:51:00 PM 1 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 2 - Understanding terms and jargon.........................................................................19
Defining Failure And Failure Modes.............................................................................................19
Failure Rate And MTBF................................................................................................................19
Interrelationship Of Terms.............................................................................................................19
Bathtub Distribution......................................................................................................................19
Down Time And Repair Time........................................................................................................19
Availability, Unavailability And Probability Of Failure On Demand............................................20
Hazard And Risk Related Terms....................................................................................................20
Choosing The Appropriate Parameter............................................................................................20
Smith Chapter 3 - A cost-effective approach to quality, reliability and safety.................................21
Reliability And Cost.......................................................................................................................21
Costs And Safety............................................................................................................................21
Cost Of Quality..............................................................................................................................21
Study Guide 1 – Self Assessment Questions....................................................................................23
STUDY GUIDE 2 – Reliability in Management and Quality Control................................................24
Objectives.........................................................................................................................................24
AS2561-1982 Guide to the determination and use of quality costs.................................................24
O’Conner Chapter 15 – Reliability Management.............................................................................25
Corporate policy for reliability......................................................................................................25
Integrated reliability programmes..................................................................................................25
Reliability and Costs......................................................................................................................25
Safety and Product Liability..........................................................................................................25
Standards for reliability, quality and safety...................................................................................26
Specifying Reliability....................................................................................................................26
Smith Chapter 18 - Project Management.........................................................................................28
Setting Objectives and Specifications............................................................................................28
Planning, Feasibility and Allocation..............................................................................................28
Programme Activities....................................................................................................................28
Responsibilities..............................................................................................................................29
Standards and Guidance Documents.............................................................................................30
Smith Chapter 19 – Contract clauses and their pitfalls....................................................................31
Essential Areas...............................................................................................................................31
Other Areas....................................................................................................................................32
Pitfalls............................................................................................................................................33
Penalties.........................................................................................................................................34
Subcontracted Reliability Assessments.........................................................................................34

Last saved 10/31/2006 10:51:00 PM 2 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 20 – Product liability and safety legislation.............................................................35
The general situation......................................................................................................................35
Strict Liability................................................................................................................................35
Insurance and Product Recall........................................................................................................35
Smith Chapter 21 – Major Incident Legislation...............................................................................36
Problem Areas................................................................................................................................36
Smith Chapter 22 – Integrity of safety-related systems...................................................................37
Safety-related or safety critical?....................................................................................................37
Study Guide 2 Self Assessment Questions.......................................................................................37
STUDY GUIDE 3 - Reliability in Design...........................................................................................39
Objectives.........................................................................................................................................39
AS2529-1982 Collection of Reliability, Availability and Maintainability Data for Electronics and
Similar Engineering Use...................................................................................................................39
1 Scope...........................................................................................................................................39
2 Application and Purpose.............................................................................................................39
3 Data Required.............................................................................................................................39
4 Guidelines...................................................................................................................................39
5 Reports........................................................................................................................................39
6 Field Performance Reports.........................................................................................................40
AS2530-1982 Presentation of Reliability Data on Electronic and Similar Components.................40
1 Scope...........................................................................................................................................40
2 Identification of Components Tested..........................................................................................41
3 Test Conditions...........................................................................................................................41
5 Data on changes in Characteristics.............................................................................................41
AS3960-1990 Guide to Reliability and Maintainability Program Management..............................42
1 Scope and General......................................................................................................................42
2 Reliability and Maintainability Program....................................................................................42
3 Specification of Reliability and Maintainability.........................................................................42
4 Assessment and prediction of Reliability and Maintainability...................................................43
5 Production, Flow, Analysis and Interpretation of Reliability and Maintainability Data............44
O’Conner Chapter 6 – Reliability Prediction and Modelling..............................................................46
Introduction....................................................................................................................................46
Fundamental Limitations of Reliability Prediction.......................................................................46
Reliability Databases.....................................................................................................................48
The Practical Approach..................................................................................................................48
System Reliability Models.............................................................................................................48
Availability of Repairable Systems................................................................................................49
Last saved 10/31/2006 10:51:00 PM 3 Last printed 5/9/2006 01:59:00 PM
Modular Design.............................................................................................................................50
Block Diagram Analysis................................................................................................................50
Fault Tree Analysis........................................................................................................................51
Petri Nets........................................................................................................................................51
State Space Analysis (Markov Analysis).......................................................................................51
Monte Carlo Simulation.................................................................................................................52
Reliability Apportionment.............................................................................................................52
Standard Methods for Reliability Prediction and Modelling.........................................................52
Conclusions....................................................................................................................................52
O’Conner Chapter 7 – Reliability in Design (not required).............................................................53
Introduction....................................................................................................................................53
Computer-Aided Engineering........................................................................................................53
Environments.................................................................................................................................53
Design Analysis Methods..............................................................................................................53
Quality Function Deployment.......................................................................................................53
Load Strength Analysis..................................................................................................................53
Failure Modes, Effects and Criticality Analysis (FMECA)...........................................................53
Reliability Predictions for FMECA...............................................................................................53
Hazard and Operability Study (HAZOPS)....................................................................................53
Parts, Materials and Processes Review (PMP)..............................................................................53
Non-Material Failure Modes.........................................................................................................53
Human Reliability..........................................................................................................................53
Design analysis for processes........................................................................................................53
Critical Items List..........................................................................................................................53
Summary........................................................................................................................................53
Management of Design Review.....................................................................................................53
Configuration Control....................................................................................................................53
O’Conner Chapter 8 – Reliability of Mechanical Components and Systems..................................54
Introduction....................................................................................................................................54
Mechanical Stress, Strength and Fracture......................................................................................54
Fatigue...........................................................................................................................................54
O’Conner Chapter 9 – Electronic System Reliability......................................................................55
Introduction....................................................................................................................................55
Reliability of Electroninc Components..........................................................................................55
Component Types and Failure Mechanisms..................................................................................55
Summary of device failure modes.................................................................................................55

Last saved 10/31/2006 10:51:00 PM 4 Last printed 5/9/2006 01:59:00 PM


Circuit and System aspects............................................................................................................55
Electronic System Reliability Prediction.......................................................................................55
Reliability in electronic system design..........................................................................................55
Parameter variation and tolerances................................................................................................55
Design for production, test and maintenance.................................................................................55
O’Conner Chapter 10 – Software Reliability...................................................................................56
Introduction....................................................................................................................................56
Software in engineering systems...................................................................................................56
Software Errors..............................................................................................................................56
Preventing Errors...........................................................................................................................56
Software Structure and Modularity................................................................................................56
Programming Style........................................................................................................................56
Fault Tolerance...............................................................................................................................56
Redundancy/Diversity...................................................................................................................56
Languages......................................................................................................................................56
Data Reliability..............................................................................................................................56
Software Checking.........................................................................................................................56
Software Design Analysis Methods...............................................................................................56
Software Testing............................................................................................................................56
Error Reporting..............................................................................................................................56
Software Reliability Prediction and Measurement........................................................................56
Hardware/Software Interfaces.......................................................................................................56
Conclusions....................................................................................................................................56
Smith Chapter 17 – Systematic Failures, especially software..........................................................58
Programmable Devices..................................................................................................................58
Software-related Failures...............................................................................................................58
Software Failure Modelling...........................................................................................................58
Software Quality Assurance...........................................................................................................58
Modern/Formal Methods...............................................................................................................58
Software Checklists.......................................................................................................................58
Study Guide 3 Self Assessment Questions.......................................................................................58
STUDY GUIDE 4 - Reliability, Maintainability and Availability......................................................59
Objectives.........................................................................................................................................59
O’Conner Chapter 14 – Maintainability, Maintenance and Availability..........................................59
Introduction....................................................................................................................................59
Maintenance Time Distributions....................................................................................................59

Last saved 10/31/2006 10:51:00 PM 5 Last printed 5/9/2006 01:59:00 PM


Preventative Maintenance Strategy...............................................................................................60
FMECA and FTA in Maintenance Planning..................................................................................60
Maintenance Schedules..................................................................................................................60
Technology Aspects.......................................................................................................................61
Calibration.....................................................................................................................................62
Maintainability Predictions............................................................................................................62
Maintainability Demonstrations....................................................................................................62
Design for Maintainability.............................................................................................................62
Integrated Logistic Support...........................................................................................................62
O’Conner pages xxv – xxvi..............................................................................................................63
MIL-HDBK-472...............................................................................................................................63
SAE J817..........................................................................................................................................63
[Reader] Collcot Chapter 13 “Fault Analysis Planning and System Availability”...........................63
[Reader] Patton Chapter 8 “Reliability, Availability and Maintainability”......................................63
Study Guide 4 Self Assessment Questions.......................................................................................63
STUDY GUIDE 5 - Reliability Prediction and Modelling.................................................................64
Objectives.........................................................................................................................................64
O’Conner, Chapter 6 Conclusion (limitations for reliability modelling).........................................64
O’Conner Chapter 12 (pgs 341-346)................................................................................................64
Reliability Analysis of Repairable Systems...................................................................................64
Smith Chapter 8 – Methods of Modelling........................................................................................65
Block Diagrams and Repairable Systems......................................................................................65
Common Cause (Dependent) Failure (CCF).................................................................................66
Fault Tree Analysis........................................................................................................................66
Event Tree Diagrams.....................................................................................................................66
Smith Chapter 9................................................................................................................................67
Duane, J.T., Learning Curve Approach to Reliability Monitoring, IEEE Transactions on Aerospace,
Volume 2, Number 2, April 1964......................................................................................................67
Summary........................................................................................................................................67
The Learning Curve.......................................................................................................................67
Analysis.........................................................................................................................................67
Discussion......................................................................................................................................67
Crow, Larry, Evaluating the Reliability of Repairable Systems, Proceedings Annual Reliability and
Maintainability Symposium, 1990...................................................................................................68
Abstract..........................................................................................................................................68
Introduction....................................................................................................................................68
http://www.weibull.com/RelGrowthWeb/Crow-AMSAA_(N.H.P.P.).htm......................................69

Last saved 10/31/2006 10:51:00 PM 6 Last printed 5/9/2006 01:59:00 PM


Minitab Help File..............................................................................................................................69
Study Guide 5 Self Assessment Questions.......................................................................................69
STUDY GUIDE 6 - Reliability Testing...............................................................................................71
Objectives.........................................................................................................................................71
O-Conner Chapter 11 – Reliability Testing......................................................................................71
Introduction....................................................................................................................................71
Planning Reliability Testing...........................................................................................................72
Test Environments.........................................................................................................................72
Testing for Reliability and Durability: Accelerated Testing..........................................................72
Smith, Chapter 12 – .........................................................................................................................73
AS3960 Section 2.............................................................................................................................73
AS3960 Page 26...............................................................................................................................73
Self-Assessment Questions...............................................................................................................73
STUDY GUIDE 7 - Managing & Solving Reliability Problems........................................................75
O’Conner Chapter 12 – Analysing Reliability Data.........................................................................75
Introduction....................................................................................................................................75
Pareto Analysis..............................................................................................................................75
Accelerated Test Data Analysis.....................................................................................................75
Reliability Analysis of Repairable Systems...................................................................................75
CUSUM Charts..............................................................................................................................76
Exploratory Data Analysis and Proportional Hazards Modelling.................................................77
Reliability Demonstration..............................................................................................................77
Combining Results Using Bayesian Statistics...............................................................................77
Non-Parametric Methods...............................................................................................................77
Reliability Growth Modelling........................................................................................................78
O’Conner, Cautionary Note, page 22...............................................................................................79
Smith, Chapter 3...............................................................................................................................79
Study Guide 7: Self Assessment Questions......................................................................................79

Last saved 10/31/2006 10:51:00 PM 7 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 1 – Introduction to Reliability
Objectives
Start developing answers to the following key questions:
• What is the practical significance of different Assurance Technologies?
• How are reliability issues addressed at each stage in the life cycle of a product or project? Are
there any shifts in focus? Are there nonetheless underlying principles that should remain
unshakeable through the life cycle?
• How might the primary cost drivers at the different stages in the life cycle be identified?
• Can system effectiveness and the life cycle costs be optimised or at least rationalised?
• At what stage in the life cycle can we most profitably focus our reliability efforts?

Study Guide 1 Notes


Assurance Technologies
We are interested in assuring ourselves that a product will perform within specified parameters during
its Life Cycle.

(Life Cycle
Life Cycle of a product includes the following typical phases:
• Concept
• Research & development
• Full scale development
• Production
• Operation and Support, and
• Disposal
In the Cat world, this is called NPI (New Product Introduction).
In most cases, 50-80% of the total costs are incurred during the operation and support phases,
which makes is an important focus for control of costs and losses.
New Product Introduction
https://npi.cat.com/
The New Product Introduction process simply builds on the 6 Sigma product and process
creation methodology, DMEDI, with which most Caterpillar employees are familiar. DMEDI
methodology is embedded within the NPI process, so any NPI program that follows the NPI
process meets 6 Sigma criteria. The NPI process is structured into phases, like DMEDI, but
includes more phases.
First, there is the Strategy phase, in which the groundwork is laid for all future decision-
making and all

Last saved 10/31/2006 10:51:00 PM 8 Last printed 5/9/2006 01:59:00 PM


Relevant strategies are aligned. Customers must be identified and segmented, and the program
charter must be drafted. The NPI program is registered and officially launched, and the NPI
team is commissioned at the very end of this phase.
Second, there is the Concept phase, in which the program elements outlined by the strategy
team are refined and solidified. The Concept phase is divided into three sub-phases, each
which corresponds with a similarly named part of the DMEDI process: Define, in which the
program charter is refined by the newly commissioned NPI team and the alignment of relevant
strategies is reaffirmed, Measure, in which the market and customer research is completed and
prioritized, and Explore, in which the market and customer research is made tangible through
high-level conceptual designs of processes and product features.
Third, there is the Development phase, in which the high-level designs are further developed
and verified. The Development phase, which corresponds with the DMEDI Develop phase, is
divided into two sub-phases: Design, in which the detailed designs are constructed, and Verify,
in which the detailed designs are confirmed to meet program requirements. These sub-phases
are not sequential steps but continually cycling steps in which the designs are verified using
the appropriate combination of virtual and physical processes into an ever improving product.
Fourth, there is the Pilot phase, in which the verified designs are further validated through pilot
testing in actual customer situations, processes are validated and preparations for production
are made.
Finally, there is the Production phase, in which the product is produced, delivered and
supported worldwide)
The key action is to assure us about product performance. A family of processes generally known as
assurance technologies can establish the assurance we require. The primary assurance technologies
include:

Quality Assurance
This includes all those planned and systematic actions necessary to provide adequate confidence that a
product or service will satisfy given requirements for quality

Human Factors Engineering


Sometimes called Ergonomics. The goal is to optimise the man-machine interface. Humans are
required to operate and maintain machines, and their ability to detect and respond to failure conditions
must be taken into account in the design of the product.

System Analysis

Product Safety
The chances of safety-related incidents must be eliminated - for example, those that might be caused
by misuse or design oversights. We are trying to eliminate design-induced defects.

Logistic Engineering
Includes the support-related activities that deal with system design and development. It covers the
support of the primary equipment and the support infrastructure.

Maintainability Analysis
Maintainability is the ease that a machine can be repaired i.e. how long it takes to repair. Analysis
included the assessment of accessibility, interchange ability, modularity, standardisation, operator and
maintainer requirements, test and maintenance requirements, spares provisioning, and maintenance
policy. Refer AS3969-1990 Para 2.2.3.(m)

Last saved 10/31/2006 10:51:00 PM 9 Last printed 5/9/2006 01:59:00 PM


Reliability Analysis
Reliability is how often a machine needs maintenance or repair. Reliability depends upon product
design, component quality control, manufacturing processes and maintenance skills. Reliability
analysis aims to lower product failure rate over its life and reduce warranty costs. Early analysis of
hardware or software requirements and a clear understanding of requirements

Reliability Fundamentals
Two terms that are often used in describing Reliability are Failure Rate and Hazard Rate. Others are
MTTF and MTBF
Reliability is the probability that the product will perform a specified function for a specified
operating interval under a specified set of conditions. The important criteria are thus probability,
function, interval and conditions. It is important that these four criteria are defined and quantified
otherwise reliability cannot be described.
Failure rate is the number of failures per unit time and change over the life of the product.
Mean Time To Failure - MTTF is used to measure the average life of an item that is not usually
repaired (e.g. light bulb, circuit board). Note this is an average life and is often subject to wide
variation
Mean Time Between Failures - MTBF is used to measure the average life of an item that is usually
repaired. MTBF is the reciprocal of failure rate.
E.g. analysis of data shows 20 failures during 10000 operating hours.
Failure rate (lambda) = 20/10000 = 0.002 failures per hour
MTBF = 1/lambda = 1/0.002 = 500 hours between failure

System Effectiveness
When assessing a system, the fundamental principle is that the parts should be optimised as a
composite set, not as individual parts.
System Effectiveness is the probability that a system can effectively meet an operational demand
within a given time when operated under specific conditions. It is usually considered in terms of
technical performance, capability, availability and dependability.
• Capability is a measure of how well a product performs
• Availability is the probability the product is ready for use when needed
• Dependability is the probability of successful performance
• Durability is a point where system wears out starts to increase
Reliability is an inherent characteristic of design and cannot be altered without modification to the
design. Additional maintenance cannot make the system more reliable; it will simply make the system
more dependable

Quality And Reliability


Quality can be described as a range of attributes inherent in a product, all of which influence its ability
to satisfy stated or implied needs. Reliability is just one attribute that impacts on most aspects of the
product throughout its life.

Last saved 10/31/2006 10:51:00 PM 10 Last printed 5/9/2006 01:59:00 PM


Determination Of Cost Drivers
There are many factors that impact on system cost. Reliability determines how often a part will need
repair, and maintainability will determine how long the maintenance will take. It is these two factors
that will determine the cost of maintenance.

Introduction To Cost Effective Analysis


There are three key requirements for developing cost effective analysis.
1. The systems being evaluated must meet the same objectives
2. At least two feasible solutions must exist
3. Sufficient reliable data about the systems must be available to perform the analysis.
The following steps form a standardised approach to analysis:
• Clearly define all goals
• Determine evaluation criteria
• Select basis for selection (fixed cost or fixed effectiveness)
• Prepare analysis report
o Main Headings
 Objectives
 Assumptions
 Evaluation Criteria
 Analysis Techniques
 Conclusion
 Recommendations

Profitability Of Reliability
Profit = Revenue – Expense
Profitability margins are sensitive to costs incurred through attention to reliability: for example,
• Maintenance costs
• Inventory holding costs
• Warranty
• Product recall
• Product reject
• Down time
• Product liability

Last saved 10/31/2006 10:51:00 PM 11 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 1 - Introduction to Reliability Engineering
The simplest view of reliability is that in which a product is assessed against a specification or set of
attributes, and when passed is delivered to the customer. The customer, having accepted the product,
accepts that it might fail at some future time. Reliability is usually concerned with failures in the time
domain. We come to the need for a time-based concept of quality. This distinction marks the
difference between quality control and reliability engineering. Whether a failure occurs or not and its
time to occurrence can seldom be forecast accurately. Reliability is therefore an aspect of engineering
uncertainty. Whether an item will work or not is usually answered as a probability. Thus the usual
engineering definition of reliability is:
"The probability that an item will perform a required function without failure under stated conditions
for a stated period of time".
"Durability" is a particular aspect of reliability, related to the ability to withstand the effects of time-
dependant mechanisms such as fatigue, wear, corrosion. Durability is usually expressed a minimum
time before the occurrence of wear out failures.
Mathematical and statistical methods can be used for quantifying and analysing reliability data despite
significant uncertainty. In practice, the uncertainty is often in orders of magnitude, and appreciation of
the uncertainty is important in order to minimise the chances of performing inappropriate analysis and
generating misleading results.
Variability and chance play a vital role in determining the reliability of must products. Basic
parameters like mass, dimensions, friction coefficients, strengths and stresses are never absolute, but
are in practice subject to variability due to process and material variations, human factors and
applications.
Understanding the laws of chance and the causes and effects of variability is therefore necessary for
the creation of reliable products and for the solution of problems of unreliability.

Why Do Engineering Items Fail?


Knowing the potential causes of failure is fundamental to preventing them.
1. The design might be inherently incapable. It might be too weak, consume too much, suffer
resonance etc.
2. The item might be overstressed in some way. If the applied stress exceeds the strength then
failure will occur. Factors of safety and de-rating are two methods of providing some margin
of between the strength of the component and the applied stress.
3. Failures might be caused by variation. In the situations above, the values of strength and load
are fixed and known. If, for example, the known load never exceeds the known strength, then
failure will not occur. However, in most cases there is uncertainty about both. The actual
strength of the population of components will vary, and the loads may be variable. However if
there is an overlap between the distributions of load and strength, then there is potential for
failure to occur.
4. Failures can be caused by wear out. This includes and mechanism or process that causes in
item to degrade or become weaker with age.
5. Failures can be caused by other time-dependant mechanisms, such as battery run down, creep
cause by temperature and applied stress, and drift of electronic components.
6. “Sneaks” can cause failures. A "sneak" is a condition where the system doe not work properly,
even though every part does. Sneaks can occur in software designs.

Last saved 10/31/2006 10:51:00 PM 12 Last printed 5/9/2006 01:59:00 PM


7. Failures can be caused by errors, such as incorrect specification, designs or software coding,
faulty assembly or test, inadequate or incorrect maintenance or by incorrect uses.
8. There are many others, such as noisy parts, leaks, incorrect instructions, electromagnetic
interference etc.

Probabilistic Reliability
The concept of reliability as a probability means any attempt to quantify it must involve statistical
methods. Reliability statistics are concerned with reliability values that are very high or very low.
Quantifying such numbers brings increased uncertainty, since we need correspondingly more
information. The application of statistics in reliability is less straightforward than in other areas. In
reliability we are concerned with the behaviour in the extreme tails of distributions, where variation is
hard to quantify and data is expensive. Further difficulties arise in application of statistics owing to
the fact that variation is often a function of time (cycles, seasons, maintenance periods etc). Therefore,
the reliability data from any past situation cannot be used to make credible forecasts of the future
behaviours, without taking into account non-statistical factors such as design changes, maintainer
training, or even production or service problems. The statistician working in reliability engineering
needs to be aware of these realities.

Repairable and Non Repairable Items


It is important to distinguish between repairable and non-repairable items when predicting or
measuring reliability.
For a non-repairable item such as a light bulb, reliability is the survival probability over the items
expected life or for a period during its life when only one failure can occur. During an items life the
instantaneous probability of the first and only failure is called the hazard rate. When a part fails in a
non-repairable system, the systems fails, and the system reliability is a function of the time to the first
part failure.
For repairable items, reliability is the probability that failure will not occur in the period of interest,
when more than one failure can occur. It can be expressed as the failure rate or rate of occurrence of
failure (ROCOF). The failure rate expresses the instantaneous probability of failure per unit time
when several failures can occur in a time continuum.
Repairable system reliability can be described by the Mean Time Between Failure (MTBF), but only
under the particular condition of constant failure rate. We are also concerned with the availability of
the repairable item, since repair takes time. Availability is affected by the ROCOF and by maintenance
time. We therefore need to understand the relationship between reliability and maintenance, and how
reliability and maintainability affect availability.

Non-Repairable Items
There are three ways the pattern of failures can change with time.
1. The hazard rate may be decreasing
2. The hazard rate may be constant
3. The hazard rate may be increasing
The hazard function h(t) is a function such that the probability that an item which has survived to age
t fails in the small interval t to t+δt is h(t) δt. This is the function, known loosely as the “failure rate”,
which is represented in the BTC. So, h(t) = f(t)/R(t)
Constant hazard rate is characteristic of failures caused by loads in excess of the design strength, at a
constant average rate. For example, overstress failures or maintenance-induced failures typically occur
randomly and at a generally constant rate. Material fatigue brought about by strength deterioration dur
Last saved 10/31/2006 10:51:00 PM 13 Last printed 5/9/2006 01:59:00 PM
to cyclic loading is a failure mode which dies not occur for a finite time, and then exhibits an
increasing probability of occurrence. Decreasing hazard rates are observed in items less likely to fail
as their survival time increases. This is often observed in electronic parts. The combined effect
generates the so-called bathtub curve. This shows an initial decreasing hazard rate or infant mortality
period, an intermediate useful life period, and a final wear out period.

Repairable Items
ROCOF can also vary with time, and important implications can be derived from these trends.
A constant failure rate (CFR) is indicative of externally induced failures, as in the constant hazard rate
situation for non-repairable items. A CFR is also typical of complex systems subject to repair and
overhaul, where different parts exhibit different patterns of failure with time and parts have different
ages due to repair or replacement.
Repairable systems can also show a decreasing failure rate (DFR) when reliability is improved by
good parts replace progressive repair as defective parts, which fail relatively early. An increasing
failure rate (IFR) occurs in repairable systems when wear out failure modes of parts begin to
predominate.

Development of Reliability Engineering


Reliability engineering originated as a separate engineering discipline in the USA in the 1950's. The
increasing complexity of military electronic systems was generating failure rates, which resulted in a
greatly reduced availability and increased costs.
Against this background, the US DoD and the electronics industry jointly set up the Advisory Group
on Reliability of Electronic Equipment (AGREE) in 1952. The AGREE report concluded that
disciplines must be laid down as integral activities in the development cycle for electronic equipment.
The report also recommended that formal demonstrations of reliability, in terms of statistical
confidence of MTBF, be instituted as a condition for acceptance of equipment by the procuring
agency. AGREE testing soon became an accepted practice.
It became evident that designers, often working at the fringes of advanced technology, could not
produce highly reliable equipment without it being subjected to a test regime to show ups its
weaknesses. The US DoD released the AGREE report as MIL-STD 781 "Reliability Qualification and
Production Approval Tests". Engineering reliability effort in the USA developed quickly, and the
AGREE and reliability program concepts were adopted by NASA and other suppliers/purchasers of
high tech equipment.
In 1965, the DoD issued MIL0-STD 785 "Reliability Programs for Systems and Equipment". This
document mandated a program of reliability engineering as it was realised potential problems would
be detected and eliminated at the earliest and therefore cheapest stage in the development cycle. The
concept of LCC originated around this time. In the UK, Defence Standard 00-40 "The Management
of Reliability and Maintainability" and BS5760 "Guide on Reliability of Systems, Equipment and
Components" were issued. In the 1980s, the reliability of Japanese industrial and commercial goods
took Western competitors by surprise. The Japanese "quality revolution" had been driven by lessons
from American teachers Juran and Deming in the post-WWII recovery, and their teaching were based
on those of Peter Drucker.

Reliability As An Effectiveness Parameter

Reliability Programme Activities


What actions can managers and engineers take to influence reliability?

Last saved 10/31/2006 10:51:00 PM 14 Last printed 5/9/2006 01:59:00 PM


Quality Control is essential, and often sufficient to ensure reliability of simple products, such as
matches, or simple die-castings. Risks are low when safety margins can be made high, such as in
structural engineering.
Formal reliability programs are necessary when risks are high. Risk normally increases in proportion
to complexity or number of components in a system.
A reliability program must begin at the earliest (conceptual) phase of a project. It is at this stage that
fundamental decisions are made that significantly affect reliability.
As development proceeds from initial to detailed design, risks are controlled by a formal documented
approach to the review of design and imposition of design rules relating to components, materials, de-
rating, tolerance etc.
The program continues through initial prototype manufacturing and test stages, by planning and
executing tests to generate confidence in the design.
During production, QC ensures that the proven design is repeated.
Throughout the product life cycle, performance is fed back to generate corrective action and to
provide data and guidelines for the future.

Reliability Economics And Management


Since less than perfect reliability is the result of failures, and all failures have causes, we should ask,
"What is the cost of preventing or correcting the cause, compared with the cost of doing nothing?"
It is not easy to quantify the effects of a given reliability program. However Deming (Out of the
Crisis) showed that effort expended on a reliability program is an investment, demonstrated by the
success of the companies that adopted this teaching.
There are three kinds of engineering product:
1. Intrinsically reliable components (electronic components, mechanical non-moving
components, software).
2. Intrinsically unreliable components (light bulbs, turbine blades, parts in contact like gears and
bearings).
3. Systems (of many components and interfaces with many possibilities for failure to occur.)
The essential points are:
1. Failures are caused primarily by people (designers, suppliers, assemblers, users, maintainers).
Therefore, the achievement of reliability is essentially a management task.
2. Reliability (and quality) is not separate functions that can ensure the prevention of failures.
3. There is no fundamental limit to the extent to which failures can be prevented. We can design
and build for ever-increasing reliability.

Smith Chapter 1 - The history of reliability and safety technology


Since no human activity can enjoy zero risk, and no equipment a zero fate of failure, there has grown
a safety technology for optimizing risk.
This attempts to balance the risk against the benefits of the activities and the cost of further risk
reduction

Last saved 10/31/2006 10:51:00 PM 15 Last printed 5/9/2006 01:59:00 PM


Definitions
Reliability Probability Density Function
The distribution of reliability values (for a PDF, the distribution of whatever value is in question). If
we measure a large number of points and further reduce the measurement interval, the frequency
histogram tends to a curve that describes the population probability density function (pg 32). The pdf
is a plot such that the area under the curve between any two ages is equal to the probability that a new
item fails in the give age interval. (This differs from the BTC in which the probability of failure is
conditional on the item having survived to the current age)

Reliability Function
Corresponds to the probability that an item will survive to any given age.

Distribution Function (=cumulative distribution function cdf)


The probability of failure at or before age t.

ETA Value
Weibull scale parameter, also known as the Characteristic life, or when 62.3% of the population has
failed. 62.3% indicates the average of an exponential distribution that represents a model for random
events.

Suspended item
When a test is run and ceases before a given item fails, it is a suspended item.

Random Failure
Beta = 1 in a Weibull distribution, as the item’s age increases there is not an increasing risk of failure,
and the component should only be replaced on failure. This is typical of many electronic components
– where the risk of failure is constant over their lifetime.

Hazard Function
Hazard Function is the failure rate – more specifically the probability that an item that has survived to
age t fails in the small interval t to t +dt. This is the function that is represented by the BTC

Life Cycle Cost


Sum of the acquisition, ownership and disposal costs for a product over its entire life cycle.

Failure Data
Reliability growth / reliability improvement arising from natural consequences of the analysis of
failure has been a central feature of product development. "Test and correct" was practiced long before
development of formal processes for data collection and analysis because failure is usually self-
evident and inevitably leads to design modifications. Nineteenth- and early twentieth-century designs
were less constrained by cost and schedule pressure of today. In many cases, reliability was the result
of over-design. The need for quantified reliability assessment was not needed. Thus failure rates were
not required, and consequently there was little incentive for the formal collection of failure data. The
advent of the electronic age, and the experience with poor field reliability of military equipment in the
1940s and 1950s led to the need for more complex mass-produced component parts. This gave rise to
the collection of failure information from the field and from interpretation of test data. This activity
was stimulated by the development of reliability prediction techniques that require component failure
rates as inputs to the prediction equations.

Last saved 10/31/2006 10:51:00 PM 16 Last printed 5/9/2006 01:59:00 PM


Hazardous Failures
In the 70s, process plants with large inventories of hazardous materials realised that the process of
learning from mistakes was no longer acceptable. Methods were developed for identifying hazards
and for quantifying the consequence of failure. These were evolved largely to assist decision-making
when developing or modifying plant. Pressure to identify and quantify risk came later. The techniques
for
Quantifying the predicted frequency of failures had previously applied mostly in the domain of
availability, where the cost of failure was prime concern. These techniques are now used in the field of
hazard assessment.

Reliability and Risk Prediction


The subject of reliability prediction, based on the concept of validly repeatable component failure
rates has become controversial. First, the extremely wide variability of failure rates of allegedly
identical components under supposedly identical environmental and operating conditions is now
acknowledged. The apparent precision of the reliability prediction tool is thus not compatible with the
accuracy of the failure rate parameter. As a result, it can be concluded that simple assessments of rates
and the use of simple models suffice. In any case, more accurate predictions can be misleading and a
waste of time and money.
The main benefit in reliability prediction of complex systems lies not in the absolute figure but in the
ability to repeat the assessment for different repair times, redundancy arrangements in the design
configuration and different values of component failure rates. Thus judgements can be made on the
basis of relative predictions with more confidence than can be placed on the absolute values.
In practice, prediction addresses the component-based "design reliability" and it is necessary to take
account of the additional factors when assessing the integrity of the system.
"Design reliability" is likely to be the figure suggested by a prediction exercise; however there will be
many source of failure in addition to simple random hardware failures predicted in this way. Thus the
"achieved reliability" of a new product or system is likely to be an order or more less than "design
reliability". Reliability growth is the improvement that takes place as modifications are made as a
result of field failure information. A well-established design with tens of thousands of field hours
might start to approach the "design reliability".
As a result, whereby systematic failures cannot be necessarily quantified, two separate approaches
might be takes side-by-side:
1. Quantitative assessment - predict frequency of hardware failures.
2. Qualitative assessment - attempt to minimise the occurrence of systematic failures (eg
software) by applying a variety of defences and design disciplines appropriate to the severity
of the target.

Achieving Reliability and Safety-Integrity


If we try to identify the characteristics of design or construction which have secured longevity, then
three factors emerge:
1. Complexity: the fewer the components and the fewer the types of materials involved, then the
greater the likelihood of a reliable item.
2. Duplication/Replication: the use of additional, redundant, parts whereby a single failure does not
cause overall system failure us a frequent method of achieving reliability
3. Excess strength: deliberate design to withstand stresses higher than anticipated will reduce failure
rates. Modern commercial pressures lead to the optimisation of tolerance and stress margins,
which just meet the functional spec.
Last saved 10/31/2006 10:51:00 PM 17 Last printed 5/9/2006 01:59:00 PM
The last two methods are costly, and the cost for reliability improvements needs to be paid for be a
reduction in failure rates or reduction in operating cost. We see therefore that reliability and safety are
"built-in" features of a product.
Maintainability also contributes since it is the combination of failure rate and repair/down time.
Achieving reliability, safety and maintainability results from activities in three main areas:
1. Design: complexity, duplication, stress, testing, feedback
2. Manufacture: materials, methods, standards
3. Field use: operation, maintenance, feedback

RAMS Cycle
Loops shown in Figure 1.2 represent RAMS activities as follows:
1. Review of the system RAMS feasibility calculations against the initial RAMS targets
2. Review of the conceptual design RAMS predictions against the RAMS target
3. Review of the detailed design against the RAMS target
4. Review of the RAMS test, at the end of design and development, against the requirements
5. Review of the acceptance demonstration against requirements
6. Review of the field RAMS performance against the targets\

Contractual Pressures
It is now common for reliability parameters to be specified in tender and other contractual documents.
There are problems arising from:
• Ambiguity of definition
• Hidden statistical risks
• Inadequate coverage of the requirements
• Unrealistic requirements
• Unmeasureable requirements

Last saved 10/31/2006 10:51:00 PM 18 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 2 - Understanding terms and jargon
Defining Failure And Failure Modes
Failure: Non-conformance to some defined performance criterion
Quality: Conformance to specification
Reliability: The probability that an item will perform a required function under stated conditions for a
stated period of time. Reliability is the extension of quality into the time domain.
Maintainability: The probability that a failed item will be restored to operational effectiveness within a
given period of time when the repair action is performed in accordance with the prescribed
procedures.

Failure Rate And MTBF


The observed failure rate: For a stated period in the life of an item, the ratio of the total number of
failures to the total cumulative observed time. Failure rate is only meaningful for situations where it is
constant. Most failure rates are stated to two significant figures. It is seldom justified to exceed this
level of accuracy.
The observed mean time between failures: For a stated period in the life of an item, the mean value of
the length of time between consecutive failures, computed as a ratio of the cumulative observed time
to the total number of failures.
The equality MTBF = 1/failure rate must be treated with caution since it is inappropriate to compute
failure rate unless it is constant.
The observed mean time to fail: For a stated period in the life of an item the ratio of cumulative time
to the total number of failures.
Mean Life: The mean of the times to failure where each item is allowed to fail over the entire life
period.

Interrelationship Of Terms

Bathtub Distribution
1. Decreasing Failure Rate (infant mortality, burn-in, early failures): Related to manufacture (welds,
joints, connections, dirt, impurities, cracks)
2. Constant Failure Rate (random failures, useful life, stress-related failures, stochastic failures):
Assumed to be stress related, random fluctuations of stress exceeding component strength.
3. Increasing Failure Rate (wear out failures): Owing to corrosion, oxidation, breakdown of
insulation, atomic migration, friction wear, shrinkage, fatigue.

Down Time And Repair Time


Elements of downtime and repair time:
a) Realisation Time
b) Access Time
c) Diagnosis Time
d) Spare Part Procurement
e) Replacement Time

Last saved 10/31/2006 10:51:00 PM 19 Last printed 5/9/2006 01:59:00 PM


f) Checkout Time
g) Alignment Time
h) Logistic Time
i) Administrative Time

Availability, Unavailability And Probability Of Failure On Demand


A = MTBF / (MTBF + MTTR) is known as the steady state availability.
Usually it is more convenient to use Unavailability or Probability of Failure on Demand (PFD)
PFD = 1 – A

Hazard And Risk Related Terms


Hazard: potential for injury or fatality to occur
Risk: probability of an event occurring, in conjunction with the consequence (severity)
Hazard rate is the failure rate at any instant in time. It is the first differential of the Failure Rate. In
practical terms, for a mature system, the hazard rate and constant failure rate are assumed to be equal.
The difference is: failure rate is an average and hazard rate is instantaneous.

Choosing The Appropriate Parameter

Last saved 10/31/2006 10:51:00 PM 20 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 3 - A cost-effective approach to quality, reliability and
safety
Reliability And Cost
Total costs incurred over the period of ownership of equipment are often referred to as Life-Cycle
Costs. Therese can be separated into:
• Acquisition costs
• Ownership Costs
• Operating Costs
• Administration Costs
They will be influenced by:
• Reliability - determines the frequency of repair
• Maintainability – affects training, equipment, downtime, and manpower
• Safety Factors – affects operating efficiency and maintainability.
Costs of carrying out RAMS-cycle predictions will usually be small compared with the potential
safety or life cycle cist savings.
Cost of carrying out RAMS prediction activities is in the order of 3-5% of total project cost. It is
credible that the assessment procedure will lead to savings, which exceed this outlay.

Costs And Safety


Once a hazardous event has been assessed, the costs of measures to reduce that risk are inevitably
considered. If the risk is sufficiently low, then reduction in risk for a given expenditure can be
examined to see if it can be justified. At this point, the concept of As Low As Reasonably Practicable
becomes relevant.

Cost Of Quality
Attempts to set budget levels for various elements of quality costs are rare. Quality costs can be
grouped under three headings:
1. Prevention Costs
- Design review
- Quality and reliability training
- Vendor quality planning
- Audits
- Installation prevention activities
- Product qualification
- Quality engineering
2. Appraisal Costs
- Test and inspect
- Maintenance & calibration

Last saved 10/31/2006 10:51:00 PM 21 Last printed 5/9/2006 01:59:00 PM


- Test equip depreciation
- Line quality engineering
- Installation testing
3. Failure Costs
1. Design changes - vendor rejects\n- rework\n- scrap & material renovation
2. Warranty
3. Commissioning failures
4. Fault finding in test

Last saved 10/31/2006 10:51:00 PM 22 Last printed 5/9/2006 01:59:00 PM


Study Guide 1 – Self Assessment Questions
1. What are the key elements in the reliability definition?
Reliability can be defined as: “The probability that a product will perform a specified function for a
specified operating interval under a specified set of conditions.”

The key elements are thus probability, function, interval and conditions, and they need to be defined
and quantified otherwise reliability cannot be adequately described.
2. Failure rate is defined as?
Failure rate is the number of failures per unit time and its subsequent change over the life of a product.
3. Maintenance data shows that a component has 25 failures during the last 100,000 system
operating hours. The MTBF for the component is?
The Mean Time Between Failures can be calculated by simply dividing the total operating hours by
the number of failures.
In this case, 100000 hours/25 failures = 4000 hours, or stated in plain English, on average, we will
experience a failure of this component every 4000 system operating hours.
4. The failure rate of equipment will most likely vary in three distinct phases during its life.
What are these phases?
Decreasing Failure Rate (Weibull shape factor <1) (also known as infant mortality, burn-in, early
failures): related to manufacture (welds, joints, connections, dirt , impurities, cracks)
Constant Failure Rate (Weibull shape factor = 1) (also known as random failures, useful life, stress-
related failures, stochastic failures): assumed to be stress related, random fluctuations of stress
exceeding component strength.
Increasing Failure Rate (Weibull shape factor >1) (also known as wearout failures): corrosion,
oxidation, breakdown of insulation, , friction wear, shrinkage, fatigue.
5. The ratio of tolerance to process variation is called? What is another name for this process
variation?
This ratio is denoted Cp, and is called the process capability. If a product has a tolerance, and it is to
be produced by a process which generates variation in the product, it is obviously important that the
process variation be less than the tolerance.

Last saved 10/31/2006 10:51:00 PM 23 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 2 – Reliability in Management and Quality
Control
Objectives
Discuss:
1. Why management must view reliability as a key attribute of the inputs, processes and outputs of
production
2. The close relationship between the methodologies of reliability and quality control
3. The costs and benefits of reliability and quality programs

AS2561-1982 Guide to the determination and use of quality costs

Last saved 10/31/2006 10:51:00 PM 24 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 15 – Reliability Management
Corporate policy for reliability
A really effective reliability function can exist only in an organisation where the achievement of high
reliability is recognised as part of the corporate strategy and is given top management attention. If not,
reliability effort will be cut back whenever cost or time pressures arise.

Integrated reliability programmes


Reliability effort should be treated as an integral party of the product development, not a parallel
activity unresponsive to the rest of the development program. This is the justification for putting
reliability with the project manager.
Since production quality will be the final determinant of reliability, quality control is an integral part
of the reliability program. Quality control cannot make up for design shortfalls, but poor quality can
negate much of the reliability effort. QC can contribute effectively to reliability effort if:
1. QC procedures are related to factors that affect reliability
2. QC test and inspection data are integrated with other reliability data
3. QC personnel are trained to recognise the relevance of their work to reliability and trained and
motivated to contribute

Reliability and Costs


Achieving high reliability is expensive, especially when the product is complex or contains untried
technology. It requires trained engineers, management time, test equipment and products for testing,
but there are practical limits on what can be spent. The earlier in the development program the failure
mode is identified and corrected the cheaper it will be.
Obviously it is necessary to minimise the sum of quality and reliability costs over the longer term.
Thus the immediate costs of prevention must be related to the anticipated effects on failure costs.
Investment analysis related to Q & R is an uncertain business, because of the impossibility of
accurately predicting and quantifying the results. Therefore the analysis should be performed using a
range of assumptions to determine the sensitivity of the results to the assumed effects.

Cost of unreliability
The cost of unreliability in service should be evaluated early in the development phase, so that the
effort on reliability can be justified and requirements set, related to expected costs. There are other
costs, such as goodwill and market share. These can be hard to quantify. In extreme cases
unreliability can lead to litigation if damage or injury occurs

Safety and Product Liability


Product liability was an outgrowth of the Ralph Nader campaigns in the USA, and it makes the
manufacturer of a product liable as a result of failure of his product. A designer can now be held
liable even if the product is old and the user did not maintain or operate it correctly. Claims can only
be defended successfully when the producer demonstrates he has taken all the practical steps towards
identifying and eliminating the risk.

Last saved 10/31/2006 10:51:00 PM 25 Last printed 5/9/2006 01:59:00 PM


Standards for reliability, quality and safety
US MIL-STD-785 – Reliability Programs for Systems and Equipments, Development
and Production
Best known, covering all development programs in the US DoD.

UK Defence Standards 00-40 and 0-41


Covers reliability program management and methods for defence equipment

BS5760
Published for commercial use

ARMP-1
NATO standard on reliability and maintainability

ISO-IEC60300 Dependability
Covers reliability, maintainability and safety (“dependability”). Describes management and methods
related to product design and development. Covers reliability prediction, design analysis, reliability
demonstration tests, maths/statistical techniques. Manufacturing not included. Methods are
inconsistent with modern best practice, in particular sections on reliability testing define rigid
environmental and other conditions to be applied, and for pass/fail criteria based on statistical methods
described and rejected in O’Conner Ch 2 and 12.

ISO9000 Quality Systems


Framework for assessing the “quality management system” which an organisation operates in relation
to the goods or services produced. Developed from US MIL STD MIL-Q-9858 in the 1950s. Does
not specifically address the quality of the products or services, nor prescribe methods how one might
achieve quality. It describes the system, vaguely, that should be in place to assure quality.
Registration cannot be taken as assurance of quality.

Specifying Reliability
How NOT to do it:
1. Do not write vague requirements, such as “reliable as possible”. Such statements do not provide
assurance against reliability being compromised.
2. Do not write unrealistic requirements “Will not fail under the specified operating conditions”.
However an unrealistically high reliability requirement will not be accepted as a credible design
parameter, and is likely to therefore be ignored.
The reliability specification must contain:
1. A definition of the failure related to the products function. The definition should cover all failure
modes relevant to its function.
2. A full description of the environments the product will be stored, transported, operated and
maintained.
3. A statement of the reliability requirement, and /or a statement of the failure modes and effects
which are particularly critical and which must therefore have a very low probability of occurrence.

Definition of failure
Failure should always be related to a measurable parameter or a clear indication.

Last saved 10/31/2006 10:51:00 PM 26 Last printed 5/9/2006 01:59:00 PM


Environmental Specifications
The environment spec must cover aspects of the loads and other effects that can influence the products
strength of probability of failure.

Stating the reliability requirement


The reliability requirement should be specified in a way that can be verified, and makes relative sense
to the use of the product. The simplest requirement is that no failure will occur under the stated
conditions. The requirement should not include statements on the s-confidence levels of the measured
reliability. The requirement relates to the population; s-confidence levels apply to the results of tests or
other limited sample data. S-Confidence limits may be used for pass/fail decision-making and test
planning, but should not be included with the requirement.
Reliability specifications based on life parameters must be framed in relation to the appropriate life
distributions. Two common parameters are MTBF (when a constant failure rate is assumed), and B-
life, related to Weibull life distributions. MTBF should not be specified if a constant failure rate
assumption couldn’t be justified. This assumption can usually be made for complex, repairable
systems. Otherwise a B-life should be specified.
Specified life parameters must clearly state the life characteristic related to the duty cycle. The life
parameter may be stated as some time-dependant function eg mile travelled or it may be stated as
some time-dependant function, with a stipulated operating cycle.

Last saved 10/31/2006 10:51:00 PM 27 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 18 - Project Management
Setting Objectives and Specifications
Realistic reliability and maintainability objectives need to be set with due regard for customer
requirements and cost constraints. Some discussion may be required to establish economic reliability
values which meet requirements and are achievable with proposed technology at the costs allowed for.
When specifying MTBF it is a common mistake to state a confidence level, in fact the MTBF
requirement stands alone. Addition of a confidence level implies a statistical demonstration would
suffice.
Vague statements should be avoided at all costs as they are subjective and cannot be measured and
thus cannot be demonstrated or proved.
Engineering requirements should include:
1. Functional description: speeds, functions, human interfaces and operating periods.
2. Environment: temperature, humidity, etc.
3. Design life: related to wearout and replacement policy.
4. Physical Parameters: size and weight restrictions, power supply limits.
5. Standards: BS, US MIL, Def Con, etc., standards for materials, components and tests.
6. Finishes: appearance and materials.
7. Ergonomics: human limitations and safety considerations.
8. Reliability, availability and maintainability: module reliability and MTTR objectives. Equipment
R and M related to module levels.
9. Manufacturing quantity: Projected manufacturing levels – First off, Batch, Flow.
10. Maintenance philosophy: Type and frequency of preventive maintenance. Repair level, method of
diagnosis, method of second-line repair.

Planning, Feasibility and Allocation


The design and assurance activities in this book simply will not take place unless there is real
management understanding and commitment to a reliability and maintainability program with specific
resources allocated.
There are three levels of RAM measurement
Prediction: a modelling exercise which relies on the validity of historical failure rates to the design in
question. This provides the lowest level of confidence
Statistical Demonstration Test: provides sample failure information, normally from a test environment
rather than the field. Provides more confidence than paper Prediction but still subject to statistical risk
and limitation of the test environment.
Field Data: Except in the case of a very high reliability system, realistic numbers of failures are
obtained and can be used in a reliability growth program as well as for comparison to the original
target.

Programme Activities
Extent of activities will depend upon:
• The severity of the requirement.

Last saved 10/31/2006 10:51:00 PM 28 Last printed 5/9/2006 01:59:00 PM


• The complexity of the product.
• Time and cost constraints.
• Safety considerations.
• The number of items to be produced.
A safety and reliability plan must be produced for each project/product. Without this there is nothing
to audit progress against, and no formal measure of progress.
Activities might include
• Feasibility Study
• Setting objectives
• Contract requirements
• Design reviews
o Electrical factors
o Software reliability
o Mechanical features
o Quality & reliability, testing, RAM predictions & demonstrations, FMEA, test
equipment, procedures
o Maintenance philosophy, policy, MTTR prediction, resource forecasts, training and
manuals
o Purchased items
o Manufacturing and installation, tolerancing, burn it, packaging and transport, costs
o Other, patents, value engineering, safety, documentation standards and product
reliability
• RAM predictions
• Design trade offs
• Prototype tests
• Parts selection and approval
• Demonstrations
• Spares Provisioning
• Data Collection and Failure Analysis
• Reliability growth
• Training

Responsibilities
Reliability and maintainability are engineering parameters and the responsibility for their achievement
is therefore primarily with the design team. Quality assurance techniques play a vital role in achieving
the goals but cannot be used to ‘test in’ reliability to a design which has its own inherent level.

Last saved 10/31/2006 10:51:00 PM 29 Last printed 5/9/2006 01:59:00 PM


Standards and Guidance Documents
BS5760 Reliability of systems, equipment and components
Part 1 is Guide to Reliability Programme Management and outlines the reliability activities such as
those above. Other parts deal with prediction, data, practices and so on

UK Ministry of Defence 00-40 Reliability and Maintainability


Parts 1 and 2 are concerned with project requirements and the remainder with documents, training,
procurement and so on

US MIL-STD-785A Reliability Program for Systems and Equipment Development and


Production
Specifies programme plans, reviews, predictions and so on.

US MIL-STD-470 Maintainability Programme Requirements


Program plan and activities for design criteria, design review, trade offs, data collection predictions
and status reporting.

Last saved 10/31/2006 10:51:00 PM 30 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 19 – Contract clauses and their pitfalls
Essential Areas
Two types of pitfalls arise from contractual conditions:
1. Those due to the omission of essential conditions or definitions
2. Those due to inadequately worded conditions which present ambiguities, concealed risk,
eventualities unforeseen by one or both parties etc
The following headings are essentially if reliability or maintainability is to be specified.

Definitions
If MTTR is specified then the meaning of repair time must be defined in detail. MTTR is often used
then mean down time is intended.
Failure itself must be thoroughly defined at system and module levels. It may be necessary to define
more than one type of failure, or failures for different operating modes (eg in flight or on the ground)
in order to describe all the requirements. MTBFs might then be ascribed to different failure types.
MTBF and failure rates often require clarification of “failure” and “time”.
The bathtub curve depicts early, random and wearout failures. Reliability parameters usually refer to
random failure unless stated to the contrary, it being assumed that burn-in failures are removed by
screening and wearout is eliminated as far as possible by preventative replacement.
Parameters should not be used without due regard to their meaning and applicability. Failure rate, for
example, has little meaning except when describing random failures. Availability, MTBF or reliability
should be specified in preference.
Reliability and maintainability are often combined by specifying Availabilty. This can be defined in
more than one way, and should thus be clearly specified. The usual form is Steady State Availability
(MTBF/(MDT+MTBF)).

Environment
A common mistake is to fail to specify the environmental conditions under which the product is to
work. The spec is often confined to temp range and humidity, which may not be sufficient. Other
parameters include pressure, vibration and shock, chemical attack, power supply
variations/interference, radiation, human factors and many others. The combination or cycling of
parameters may have significant results.
Where equipment is used as standby or held as spares, the conditions will be different to those
experienced by operating units. It is often assumed that because a unit is not powered or is stored, it
will not fail. In fact this environment might be more conducive to failure. Transport environmental
conditions and liabilities for component failures should also be considered.
Maintainability can also be influenced by environment. Conditions can influence repair times since
the use of particular protective clothing, remote handling devices. Safety precaustions increased the
active elements of repair time.

Maintenance Support
The provision of spares, test equipment, personnel, transport and the maintenance of such is a
responsibility that must be described in the contract and the supplier must be conscious of the risks
involved in the customer not meeting their side of the agreement.
Levels of skill and training should be specified.

Last saved 10/31/2006 10:51:00 PM 31 Last printed 5/9/2006 01:59:00 PM


Maintenance philosophy must be defined as it plays a part in defining reliability when under the
client;s control.
MTTR and identification of faults automatically needs to be specified up front as the cost and delay in
the design of product to incorporate such features is likely to be considerable.

Demonstration and prediction


In the case of a maintainability demonstration, it is essential to define the tools and equipment, the
maintenance instructions, test environment, technician level task selection, spares, and level of repair.
In the case of a reliability demonstration, it is essential to define environmental conditions, allowable
failures (eg maintenance induced), operating mode, preventive maintenance, burn-in, testing costs.
Statistical risks apply and the supplier needs to calculate the probability of failing the test with good
equipment and the customer that of passing inadequate goods.
Consider, that if 100 items of equipment meet their stated MTBF under random failure conditions,
then after operating for a period equal to one MTBF, 63 of them, on average, will have failed.
From a suppliers point of view, a warranty period is a form of reliability demonstration since, having
calculated the expected number of failures during the warranty period, there is a probability that more
will occur.

Liability
The exact natures of a supplier’s liability must be spelt out, including the maximum penalty that can
be incurred.
If part of the liability for failure or repair is to fall to some other subcontractor, then care must be
taken in defining each party’s area.

Other Areas
Reliability and maintainability programme
Sometimes the R&M activities are specified in the contract. In a development contract this allows the
customer to monitor activities against agreed milestones. Sometimes standard programs are used:

BS5760 Reliability of systems, equipment and components


Part 1 is Guide to Reliability Programme Management and outlines the reliability activities such as
those above. Other parts deal with prediction, data, practices and so on

BS5760 Reliability of systems, equipment and components


Part 5 Reliability programs for equipment

US MIL-STD-785 Reliability Program for Systems and Equipment Development and Production
Specifies programme plans, reviews, predictions and so on.

US MIL-STD-470 Maintainability Programme Requirements


Program plan and activities for design criteria, design review, trade offs, data collection predictions
and status reporting.

Reliability and maintainability analysis


The supplier may be required to offer a detailed reliability or maintainability prediction together with
an explanation of the techniques and data used.

Last saved 10/31/2006 10:51:00 PM 32 Last printed 5/9/2006 01:59:00 PM


Storage
If equipment is stored for some time then the storage conditions and durations will have to be defined.
Similar applies to storage and transport of spares and test equipment.

Design standards
Specific standards are sometimes described or referenced. A problem exists that these standards are
very detailed and most manufacturers have their own version. The fine detail can be overlooked until
some formal acceptance inspection takes place, by which time retrospective action is difficult, time
consuming and costly.

Pitfalls
The following lists those aspects of Reliability and Maintainability likely to be mentioned in an
invitation to tender or in a contract.

Definitions
Most likely area of dispute is the definition of what constitutes a failure and whether or not a
particular incident ranks as one or not. There are levels of failure, types of failure, causes of failure
and effects of failure. Careful definition of failure types covered by the contract is therefore
important.

Repair Time
Repair times can be grouped into active and passive elements. Broadly speaking, the active elements
are dictated by system design and passive by maintenance and operating arrangements. For this
reason, the supplier should never guarantee any part of the repair time that is influenced by the user.

Statistical risks
In both maintainability and reliability tests, producer and consumer risks apply.

Conclusion Drawn
Accept Ho Reject Ho

True State Ho True Correct Type I


α-risk
Producer risk
Ho False Type II Correct
β-risk
Consumer risk

Quoted specifications
Sometimes a reliability or maintainability program or test plan is specified by calling up a published
standard. The danger is the possibility that not all the quoted terms are suitable and the standard will
not be studied in every detail.

Environment
If environmental factors are likely to be present in the field then they must be specifically allowed for
in the design and price. It may not be desireable to specify every parameter possible since this leads
to over-design.
Last saved 10/31/2006 10:51:00 PM 33 Last printed 5/9/2006 01:59:00 PM
Liability
When stating the supplier’s liability it is important to establish its limit in terms of both cost and time.
Suppliers must ensure they know when they are finally free of liability.

In summary
The biggest pitfall is to assume either party wins any advantage from ambiguity or looseness in the
conditions of a contract. Effort expended from a dispute far outweigh any advantage that might have
been secured. If every effort is made to cover all the aread as clearly and simply as possible then both
parties will gain.

Penalties
Any cash penalty must be a genuine and reasonable pre-estimate of the damages thought to result
from a system outage.

Apportionment of costs during guarantee


The customer should never be permitted to benefit from poor maintenance, therefore any arrangement
by which the supplier pays for maintenance the customer undertakes should be avoided.

Payment according to down time

In summary

Subcontracted Reliability Assessments

Last saved 10/31/2006 10:51:00 PM 34 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 20 – Product liability and safety legislation
Product liability is the liability of a supplier, designer or manufacturer to the customer for injury or
loss resulting from a defect in that product.

The general situation


Legislation generally requires that goods are of merchantable quality and are reasonably fit for the
purpose intended.
Common law related to the Tort of Negligence, for which a claim for damages can be made. The onus
is on the plaintiff to prove negligence, which requires proof:
1. That the product was defective
2. That the defect was the cause of the injury
3. That this was foreseeable and that the defendant failed in his or her duty of care.
The present situation involved a form of strict liability but:
• Privity of Contract excludes third parties in the contract claims
• The onus is to prove negligence unless the loss results from a breach of contract
• Exclusion clauses involving death and personal injury are void

Strict Liability
This concept hinges on the idea that liability exists for no other reason than the mere existence of a
defect. No breach of contract or act of negligence is required in order to incur responsibility.

Insurance and Product Recall


The effect of Product Liability trends tend to:
• Increase the number of claims
• Increase premiums
• Generate separate Product Liability Policies
• Involve insurance companiesin defining quality and reliability standards and procedures
• Require the designer to insure the customer against genuine and frivolous consumer claims.
A design defect causing a potential hazard to life, health or safety may become evident when a number
of products are already in use. It may then become necessary to recall a batch of items. The extent
will be determined by the nature of the defect. A full evaluation of the hazard must be made and a
report prepared.

Last saved 10/31/2006 10:51:00 PM 35 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 21 – Major Incident Legislation
Problem Areas
Reports must be site specific and the use of generic procedures and justifications is to be discouraged.
Adopting similar is valid providing care is taken to ensure the end result is site specific.
The hazards from a dangerous substance may be various and it is necessary to consider secondary as
well as primary hazards.
The events which could lead to the major accident scenario have to be identified fully. The fault tree
approach (Ch 8) needs to identify all the initiators of the tree. This is an open ended problem in that is
it s a subjective judgement as to when they have all been listed. An obvious checklist would include
(in addition to hardware failure):
• Earthquake
• Human error
• Software
• Vandalism, terrorism
• External collision
• Meteorology
• Out of spec substances

Last saved 10/31/2006 10:51:00 PM 36 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 22 – Integrity of safety-related systems
Safety-related or safety critical?
“Safety critical” has tended to be used where the hazard leads to a fatality, whereas “safety related”
has been used in a broader context. There are many definitions, all of which vary slightly:
• Some distinguish between multiple and single deaths
• Some include injury, illness and incapacity without death
• Some include effects on the environment
• Some include system damage
Saferty-related systems are those which, singly or together with other safety-related systems, achieve
or maintain a safe state for equipment under their control
Safety critical systems are those which on their own achieve or maintain a safe state fk the equipment
under their control

Study Guide 2 Self Assessment Questions


1. Quality operations cost categories are:
These quality costs are the costs of activities directed at reliability and quality control and the cost of
failure. Costs are usually considered in these three categories:
a. Prevention Costs
b. Appraisal Costs
c. Failure Costs
Prevention Costs are those related to activities that prevent failures from occurring, such as reliability
activities, quality control of purchased goods and materials, training and management
Appraisal costs are those related to test and measurement, process control and quality audit.
Failure costs include internal and external costs. Internal costs include scrap and rework incurred
during manufacture. External costs include warranty.
2. Product liability
Product liability is the liability of a supplier, designer or manufacturer to the customer for injury or
loss resulting from a defect in that product. Product liability was an outgrowth of the Ralph Nader
campaigns in the USA, and it makes the manufacturer of a product liable as a result of failure of his
product. A designer can now be held liable even if the product is old and the user did not maintain or
operate it correctly. Claims can only be defended successfully when the producer demonstrates he has
taken all the practical steps towards identifying and eliminating the risk.
3. Reliability is the responsibility of the:
Reliability and maintainability are engineering parameters and the responsibility for their achievement
is therefore primarily with the design team. Quality assurance techniques play a vital role in achieving
the goals but cannot be used to ‘test in’ reliability to a design which has its own inherent level.
4. A reliability specification must contain:
The reliability specification must contain:
a. A definition of the failure related to the products function. The definition should cover all
failure modes relevant to its function.

Last saved 10/31/2006 10:51:00 PM 37 Last printed 5/9/2006 01:59:00 PM


b. A full description of the environments the product will be stored, transported, operated and
maintained.
c. A statement of the reliability requirement, and /or a statement of the failure modes and
effects which are particularly critical and which must therefore have a very low probability
of occurrence.
Engineering requirements should include:
a. Functional description: speeds, functions, human interfaces and operating periods.
b. Environment: temperature, humidity, etc.
c. Design life: related to wearout and replacement policy.
d. Physical Parameters: size and weight restrictions, power supply limits.
e. Standards: BS, US MIL, Def Con, etc., standards for materials, components and tests.
f. Finishes: appearance and materials.
g. Ergonomics: human limitations and safety considerations.
h. Reliability, availability and maintainability: module reliability and MTTR objectives.
Equipment R and M related to module levels.
i. Manufacturing quantity: Projected manufacturing levels – First off, Batch, Flow.
j. Maintenance philosophy: Type and frequency of preventive maintenance. Repair level,
method of diagnosis, method of second-line repair.
5. When and why should the costs of unreliability be evaluated?
The cost of unreliability in service should be evaluated early in the development phase, so that the
effort on reliability can be justified and requirements set, related to expected costs. There are other
costs, such as goodwill and market share. These can be hard to quantify. In extreme cases
unreliability can lead to litigation if damage or injury occurs.

Last saved 10/31/2006 10:51:00 PM 38 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 3 - Reliability in Design
Objectives
You should be able to:
• Choose the appropriate Australian or international standard to use for reliability design
• Recognise the specific techniques for each of the different design processes
• Recognise the importance of design to cost goal setting

AS2529-1982 Collection of Reliability, Availability and Maintainability


Data for Electronics and Similar Engineering Use
1 Scope
This standard provides guidance for the collection of reliability data relating to the field performance
of electronic items but may also be applicable to other engineering items.

2 Application and Purpose


The specific objectives of the collection of reliability data are:
a. To provide for a survey of the actual reliability,
b. To provide data for improving reliability
c. To provide data for the organisation and management of any maintenance operation.

3 Data Required
Consideration of the foregoing objectives defines the need for a system which provides for the
collection of documented data covering:
a. The total population under observation
b. Operational conditions
c. Failures of the items
d. Maintenance operations

4 Guidelines
It is the intention of this standard to provide guidelines for setting up data collection.

5 Reports
General Comments: the relative content of use and failure reports will vary markedly with the items
considered and the type of operation.
Use Reporting: Data reporting should be supported by information on the use of the items
Failure Reporting: Failure reports should cover all the failures which have been observed. They
should also contain sufficient information to identify misuse failures. Failures considered to be
attributable to any maintenance action should be so noted.
Preventative Maintenance Reporting: Essentially, preventative maintenance is scheduled so as to
forestall failure or eliminate failure entirely. When no replacements or repairs are made, the action
can be classified as a “Use” report. When the preventative maintenance actions results in a

Last saved 10/31/2006 10:51:00 PM 39 Last printed 5/9/2006 01:59:00 PM


replacement or repair, the report may be treated as a “Failure” report even though the item has in fact
not failed in operation.

6 Field Performance Reports


General: Sufficiently detailed reports permit the estimation of reliability not only of the items b ut of
the devices of which they are component parts.
Content of Field Performance Reports
• Number and date of report
• Name and address of user, location of item
• Nature of report (Use, failure, PM)
• Item Identification
• Number of items considered: it is possible to cover more than one item of basically identical
design
• History of Item
o Date of manufacture
o Original or modified state
o Date first placed into use
o Cumulative operating time
o Storage or transportation conditions and cumulative time prior to last use
o Nature and date of last maintenance task, operating time since this date.
o Cumulative time non-operational but believed serviceable
o Cumulative time non-operational but believed unserviceable.
o Cumulative time on standby
• General operating conditions
• Item Failure description
• Item Failure analysis
• Action taken
• Assessment by field or maintenance engineer

AS2530-1982 Presentation of Reliability Data on Electronic and Similar


Components
1 Scope
This standard is intended to provide guidance on the presentation of data necessary to distinguish the
reliability characteristics of an electronic component, but may also be applicable to other engineering
terms.

Last saved 10/31/2006 10:51:00 PM 40 Last printed 5/9/2006 01:59:00 PM


2 Identification of Components Tested
Information identifying the components shall be in accordance with the relevant Australian Standards
or other component specifications.

3 Test Conditions
The test conditions to be used should be those given in the relevant component standards.

4 Data on Failures
The following information shall be supplied:
a. The number of failures observed, categorised by test conditions and type of failure
b. The times at which the failures occurred or were verified
c. Particular incidents which occurred during testing which might have affected the results
d. Statement of failure mechanism
e. Discarded test data and the reasons why they were not used in the presentation or results
Additional requirements:
a. Failure criteria
b. Failure rate which can be assumed to be constant
c. Failure rate which cannot be assumed to be constant
d. Influence of stress
Presentation of data:
a. The failure rates of components failing in the sample tested shall preferably be supplied in
terms of the test period, eg 4 x 10-6 in 2000 hours, rather than the failure rate alone.
b. The upper confidence level (and the lower, where appropriate), shall be stated. Preferred
confidence levels are 60% and 90%. It shall be stated whether the failure rate is observed,
assessed or extrapolated.

5 Data on changes in Characteristics


The following information shall be supplied as a part of the test data:
a. Primary characteristics data
b. A graphical representation of changes in characteristics
c. A tabular representation of changes in characteristics
d. Particular incidents which occurred during testing which might have affected the results
e. Discarded test data and the reasons why they were not used in the presentation of the results,
shall be separately stated.
Additional requirements:
a. If the changes can be satisfactorily approximated by a mathematical function, that function
should be stated together with time duration for which it applies.
b. If the drift of the characteristic depends on type and magnitude of stress, these should be
stated together with the data.
Presentation of the data:
The various forms of data presentation are

Last saved 10/31/2006 10:51:00 PM 41 Last printed 5/9/2006 01:59:00 PM


a. Primary methods
b. Graphical methods
c. Numerical methods

AS3960-1990 Guide to Reliability and Maintainability Program


Management
1 Scope and General
This standard provides guidance on reliability and maintainability program management of
manufactured and constructed products. In management terms, it is concerned with what has to be
done and why, and when and how it is to be done.

2 Reliability and Maintainability Program


General
Life Cycle concept
Aim of a Reliability and Maintainability Program
General considerations on maintainability
Cost Considerations
Relative effectiveness of program activities
Training

Program Activities
Definition
Design and Development phase
Production phase
Installation and Commissioning phase
Operation-usage and maintenance phase

3 Specification of Reliability and Maintainability


General
Types of specification
Purpose of reliability and maintainability clauses
Qualitative versus quantitative approach to reliability and maintainability
Quantitative reliability clauses
Problems in applying the quantitative approach
Qualitative approach
Quantitative maintainability clauses
Quantitative maintainability requirements

Writing Reliability and Maintainability Clauses in a Specification


Necessary clauses
Last saved 10/31/2006 10:51:00 PM 42 Last printed 5/9/2006 01:59:00 PM
Function of an item
Criteria for failure
Choice of a reliability characteristic
Required value of the reliability characteristic
Choice of a maintainability characteristic
Required value of the maintainability characteristic
Operating conditions and regime
Reliability and Maintainability assurance

Specification or Reliability and Maintainability in Practice

4 Assessment and prediction of Reliability and Maintainability


General
Aims of reliability assessment
Reliability and Maintainability characteristics

Reliability Assessment

Reliability Prediction by Modelling

Provision of Reliability Data

Reliability Growth Testing


General
Preparation
Results of reliability growth testing
Factors governing reliability growth testing effectiveness

Reliability Demonstration and Testing


General
Aims of a test program
Choice of a test program
Evaluation of test data using Bayesian methods
Proof test
Suitability of statistical methods for analysis of test results

Maintainability Prediction
Maintainability prediction
Prediction advantages
Techniques
Basic Assumptions and Interpretations
Elements of maintainability prediction techniques
Last saved 10/31/2006 10:51:00 PM 43 Last printed 5/9/2006 01:59:00 PM
Maintainability Demonstration and Testing
General requirements
Maintainability testing program
Maintainability demonstration
Test Conditions
Maintenance task selection

Compliance illustration by means other than testing.

5 Production, Flow, Analysis and Interpretation of Reliability and Maintainability


Data
General
Benefits
Organisation
Effectiveness of communication

Data Input
Reporting systems
Specification and description
Operating history
Failure history

Data Sources
Guidelines
Past Experience
Design and development
Production
Factory test
Guarantee or warranty reports – product liability test reporting
Supply of replacement parts
Material or component supply
Repair department
Field installation, demonstration or commissioning tests
User reporting system
Field surveys

Designing the Data Collection Form

Validity of Data
Product manufacturer
Materials or component supplier
Last saved 10/31/2006 10:51:00 PM 44 Last printed 5/9/2006 01:59:00 PM
Field data retrieval programs

Collection and Flow of Reliability Data

Analysis of Data
Quantitative data
Qualitative data
Requirement specifications

Failure Classification

Interpretation and Presentation of Data

Last saved 10/31/2006 10:51:00 PM 45 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 6 – Reliability Prediction and Modelling
Introduction
Accurate prediction of the reliability of a new product before it is manufactured is obviously highly
desirable. Advance knowledge of reliability can allow forecasts of support ocsts, spares requirements,
warranty costs, marketability.
It can be argued that prediction acknowledges causes of failure that could then be eliminated. IN fact
a reliability prediction can rarely be made with high accuracy or confidence.
Nevertheless it can often provide the basis for forecasting of dependant factors such as life cycle costs,
and be valuable as part of the study and design processes, for comparing options and highlighting
critical reliability features of designs.
Eventually, in principle, it is necessary to consider the reliability contributions of individual parts.
However, the lower the level of analysis, the greater is the uncertainty. The great majority of modern
engineering components are sufficiently reliable that, for practical peruposes they generate no inherent
quantifiable failure rate.
Since reliability is affected strongly by human-related factors such as training and motivation of
design and test engineers, quality of production and maintenance skills, these factors must also be
taken into account.

Fundamental Limitations of Reliability Prediction


In engineering and science we use mathematical models for prediction. These laws and models are
valid within the appropriate domain (eg Ohm’s law does not hold at temps near absolute zero).
However for everyday, practical purposes such deterministic laws serve our purposes well, and we use
them to make predictions, taking account of such practical aspects as measurement errors in initial
conditions.
While most laws in physics can be considered deterministic, the underlying mechanisms can be
stochastic. It is only at the level of individual or very few actions and interactions that physicists take
into account the uncertainty due to underlying stochastic processes. For practical purposes, we ignore
infinitesimal variations.
Of course some physical systems are comprised of very few components, actions and interactions with
no underlying stochastic mechanism, and can be treated as deterministic because of the small numbers
involved. If however, the numbers become larger, the computational problems become significant and
begin to degrade the credibility of the predictions. Also, small errors and variations progressively
accumulate, leading to increased uncertainty and divergent behaviour.
Physical laws are thus a useful predictor of system behaviour either when very small or very large
numbers are involved. For systems involving moderately large numbers the predictive power in
empirical and deterministic laws diminishes. Very fast computing is applied, and new empirical
relationships are derived, and the results often fall short of our requirements. Prediction power is
increased if we can simplify the problem to a few variables or complicate it to a very large number.
Prediction power is greater in time-invariant or cyclical systems. Systems such as weather,
aerodynamics, and explosives are time-variant, and this leads to divergent behaviour. Thus predictions
of the instantaneous state become progressively less credible as time proceeds.
It follows from the arguments presented that one or a few measurements of a physical quantity can
usually provide us with sufficient information on which to make a prediction of the future state of a
physical time-invariant or cyclical system. When the quantity is time-variant, more measurements

Last saved 10/31/2006 10:51:00 PM 46 Last printed 5/9/2006 01:59:00 PM


might need to be taken before we can predict confidently from the data. It is only when the system is
moderately complex that one or a few measurements do not enable us to predict the future state.
For a mathematical model to be accepted as a basis for scientific prediction, it must be based on a
theory which explains the relationship. Finally, we expect the predictions made to be repeatable.

Predictions in Reliability
The concept of deriving mathematical models, which can be used to predict reliability, is intuitively
appealing. Sometimes these models are as simple as a single fixed value for failure rate or reliability.
However, some of the models derived are quite complex, taking account of many factors likely to
affect reliability.
Like other predictive models in science and engineering, these have been based upon consideration of
what might affect the parameter of interest, in this case reliability, or in other words, failure. Thus
there have been attempts to create theories. However, this approach is of severely limited validity for
predicting reliability.
Whilst an engineering component might have properties such as conductance and mass, it is very
unlikely to have an intrinsic reliability that meets such criteria.
Failure or the absence of failure is heavily dependant upon human actions and perceptions. This is
never true of laws of nature. This represents a fundamental limitation of the concept of reliability
prediction using mathematical models.
Onset of failure is nearly always a discontinuous function, subject to predictive difficulties described
for models of the behaviour of a system which contain moderately large number of factors and
interactions, and whose progression to a failed state is time-variant.
We saw in Chapter 4 how reliability can vary by orders of magnitude with small changes in load nad
strength distributions, and the large amount of uncertainty inherent in estimating reliability from the
load-strength model. These real uncertainties must be borne in mind when synthesizing the reliability
of a system by considering the likely failure rates of its parts.
Another limitation arises from the fact reliability models are usually based upon statistical analysis of
past data. Much more data is required to derive a statistical relationship, and even then there will be
uncertainty because the sample can seldom be taken to represent the population. Sometimes we can
say that the likelihood increases but we can very rarely predict the time of failure. A statistically
derived relationship can never be proof of a causal connection – it must be supported by theory based
on an understanding of the underlying cause and effect relationship.
It is never sensible to make a prediction based on past data unless we can be sure the underlying
condition that affect future behaviour will not change. However since engineering is concerned with
deliberate changes, predictions of reliability based on past data ignore the fact that changes might be
made to improve reliability. The use of past data to predict the future can be very misleading and
unduly pessimistic.
A reliability prediction for a system containing many parts is likely to be more accurate than for a
small system. It is important to remember the variances in reliability at the part level can be orders of
magnitude greater than the variances at system level.
Therefore any reliability prediction based on mathematical models or growth models must be treated
with some scepticism.
A designer cannot design for an MTBF, unless he places as much faith in the reliability math model as
he does in, say, Ohm’s law. The MTBF cannot be measured as can, say power consumption, and there
is no reason or logic to believe they will all show the same MTBF or patters of failure of a period of
time.

Last saved 10/31/2006 10:51:00 PM 47 Last printed 5/9/2006 01:59:00 PM


Unfortunately the mathematical modelling approach to reliability prediction has been given undue and
insufficiently critical attention in the literature and in reliability standards. The naïve presentation of
reliability predictions has done much to undermine the credibility of reliability engineering,
particularly since so often systems achieve reliability levels far higher than the predictions.

Reliability Databases
There are several published databases that give reliability info in engineering components and sub
systems.
Best known would be MIL-HDBK-217 for electronic components.

The Practical Approach


It is possible to make credible reliability predictions for systems under certain conditions, these are:
1. The system is similar to systems developed, built and used previously
2. The new system does not involve significant technological risk
3. The system will be manufactured in large quantities, or is very complex (ie many parts, or the
parts are complex) or will be used over a long time or a combination ie there is an asymptotic
property
4. There is a strong commitment to the achievement of the reliability predicted, as an overriding
priority
Reliability prediction for new high tech products must be based upon identification of objectives and
assessment of risks, in that order. This assessment can be aided by the educated use of appropriate
models and data which help to quantify the risks.
Reliability predictions for systems should be made “top down” not synthesized from the parts level.
The purpose to which the prediction will be applied should also influence the methods used and the
estimates provided.
The predictions should always take account of objectives and related management aspects, such as
commitment and risk. If management does not drive the reliability effort, the prediction can become a
meaningless exercise. As overriding considerations, it must be remembered there is no theoretical
limit to the reliability that can be attained, and this does not necessarily entail higher costs.

System Reliability Models


The basic series reliability model
In general for a series of n, s-independent components:
n
R = ∏ Ri
i =1

Where Ri is the reliability of the ith component. This is known as the product rule or series rule.
n
λ = ∑ λi and R = e − λt
i =1

This is the simplest basic model on which parts count reliability prediction is based.

Active Redundancy
In this system, composed of two s-independent parts with reliability R 1 and R2, satisfactory operation
occurs if either one or both parts function.

Last saved 10/31/2006 10:51:00 PM 48 Last printed 5/9/2006 01:59:00 PM


(R1 + R2) = R1 + R2 - R1R2
The general expression for active parallel redundancy is:
n
R = ∏ (1 − Ri )
i =1

m-out-of-n redundancy
In some active parallel redundant configurations, m out of n units may be required to be working for
the sustem to function. The reliability of an m/n system , with n, s-independent components in which
all the unit reliabilities are equal, is the binomial reliability function.

Standby redundancy
Standby redundancy is achieved when one unit does not operate continuously but is switched on when
the primary unit fails. The standby unit and the sensing and switching system may be considered to
have a “one-shot” reliability of starting and maintaining system function until the primary component
is repaired.

Further redundancy considerations


For systems where very high safety or reliability is required, more complex redundancy is frequently
applied:
1. In aircraft, dual or triple active redundant hydraulic power systems are used.
2. Aircraft electronic flying controls typically utilize trip voting active redundancy. A sensing system
automatically switches off one system if it transmits signals, which do not match those transmitted
by the other two.
3. Fire detection and suppression systems consist of detectors, which may be in parallel active
redundant configuration

Availability of Repairable Systems


Availability is defined as the probability that an item will be available when required, or as the
proportion of total time that the item is available for use. Therefore the availability of a repair able
item is a function of its failure rate, λ, and its repair or replacement rate, μ. For a simple unit with a
constant failure rate λ, and a constant repair rate, μ, where μ = 1/MTTR, the steady state availability is
equal to:
µ MTBF
A= =
λ + µ MTBF + MTTR
The instantaneous availability is equal to:
µ λ
A= + e −( λ + µ ) t
λ+µ λ+µ
Steady state unavailability = 1 – A
If scheduled maintenance is necessary and involves taking the system out of action, this must be
included in the availability formula.
Availability is an important consideration in relatively complex systems. In such systems, high
reliability by itself is not sufficient to ensure that the system will be a available when needed. It is
also necessary to ensure that it can be repaired quickly and that essential scheduled maintenance can
be performed quickly. Therefore maintainability is an important aspect of design for maximum
availability.

Last saved 10/31/2006 10:51:00 PM 49 Last printed 5/9/2006 01:59:00 PM


Availability is also affected by redundancy. Large gains in reliability and steady-state availability can
be provided by redundancy. However, these are relatively simple situations, particularly as constant
failure rate is assumed.

Modular Design
Availability and the cost of maintaining the system can be influenced by the way the design is
partitioned. Modular design is used in many complex products to ensure a failure can be corrected by
a relatively easy replacement of the defective module, rather than by replacement of a complete unit.

Block Diagram Analysis


The failure logic of a system can be shown with a reliability block diagram (RBD), which shows the
logical connections between components of the system.
Block diagram analysis consists of reducing the overall RBD to a simple system that can then be
analysed using the formula for series and parallel arrangements. It is necessary to assume s-
independence of block reliabilities.

Cut and Tie Sets


Complex RBDs can be analysed using the cut set or tie set methods. A cut set is produced by drawing
a line through blocks in the system to sho the minimum number of failed blocks which would lead to a
system failure. Tie sets are produced by drawing lines through blocks which, if all were working,
would allow the system to work.
Their use is appropriate for the analysis of large systems in which various configurations are possible,
such as aircraft controls.

Common Mode Failures


Examples of common mode failures are:
1. Changeover systems to activate standby redundant units
2. Sensor systems to detect failure of a path
3. Indicator systems to alert personnel to failure of a path
4. Power or fuel supplies which are common to different paths.

Enabling Events
An enabling event is one which, whilst not necessarily a failure or a direct cause of failure, will cause
a higher level failure event when accompanied by a failure. Examples are:
1. Warning systems disabled for maintenance
2. Controls incorrectly set
3. Personnel following procedures incorrectly or not at all.

Practical Aspects
It is essential that practical engineering considerations are applied to system reliability analysis.
Examples of situations in which practical and logical error can occur are:
1. Two diodes connected in series. If either fails open circuit there will be no current flow. If
either fails short circuit, the other will provide the required system function, so they will be in
parallel from a reliability point of view.
2. Common mode failures are often difficult to predicts, but can dominate the real reliability or
safety of a system

Last saved 10/31/2006 10:51:00 PM 50 Last printed 5/9/2006 01:59:00 PM


3. Unexpected combinations of events can occur
4. System failures can be caused by events other than failure of components or subsystems.
These illustrate the need for reliability and safety analysis to be performed by engineers with practical
knowledge and experience of the system design, manufacture, operation and maintenance.

Fault Tree Analysis


Fault Tree Analysis (FTA) is a reliability/safety design technique which starts from consideration of
system failure effects, referred to as top events.
In addition to showing the logical connection between failure events in relation to defined top events
can be caused by different failure modes or different logical connection between failure events.

Petri Nets
A Petri net is a general-purpose graphical and mathematical tool for describing relations existing
between conditions and events.
Owing to the variety of logical relations that can be represented with Petri nets, it is a powerful tool
for modelling systems. Petri nets can be used not only for simulation, reliability analysis and failure
monitoring, but also form dynamic behaviour observation.

State Space Analysis (Markov Analysis)


A system or component can be in one of two states (failed or non-failed). The probability of being in
one or the other at a future time can be valuated using state-space (or state-time) analysis. In
reliability and availability analysis, failure rate and repair rate are the variables of interest.
The best known state-space analysis techniques is Markov analysis, which can be applied under the
following major constraints:
1. The probabilities of changing from one state to another must remain constant. Thus the
method can only be used when a constant hazard or failure rate assumption can be justified.
2. Future states of the system are indepenedant of all past states except the immediately
preceding one. This is an important constraint in the analysis of repairable systems, since it
implies the repair returns the system to an “as new” condition.
The tree diagram approach quickly becomes intractable if the system is much more complex than the
one-component system described, and analysed over just a few increments. For more complex
systems, matrix methods can be used, particularly as these can be readily solved using computer
programs.

Continuous Markov Processes


So far we have considered discrete Markov processes. We can also use Markov to evaluate the
availability of systems in which the failure rate and the repair rate are assumed to be constant in a time
continuum. Markov analysis an also be used ior availability, taking account of the holding an repair
rate for spares.

Limitations
Markov analysis method suffers one major disadvantage. It is necessary to assume constant rates for
both failures and repairs. It is also necessary to assume events are s-independent, which is hardly ever
the case in the real world. The effect to which these might effect the situation should be carefully
considered when evaluating a Markov analysis.

Last saved 10/31/2006 10:51:00 PM 51 Last printed 5/9/2006 01:59:00 PM


Monte Carlo Simulation
In a Monte Carlo simulation, a logical model of the system being analysed is repeatedly evaluated,
each run using different values of the distributed parameters. The selection of parameters is made
randomly, but with probabilities govered by the relevant distribution function
Monte Carlo simulations can be used for reliability and availability modelling using computer
programs. Since MC involves no complex mathematical analysis it is an attractive alternative
approach. There are no constraints regarding the nature of input assumotions on parameters such as
failure and repair rates, so non-constant values can be used.
One problem is the expensive use of computer time. Also, since the simulation of probabilistic events
generates variable results, in effect simulating real life, it is usually necessary to perform a number of
runs in order to obtain estimates for mean and variances, such as availability, number of repairs arising
and facility utilization.

Reliability Apportionment
Sometimes it is necessary to break an overall system reliability requirement down to individual sub
system reliabilities.
The starting point for apportionment is an RBD for the system drawn to show the appropriate system
structure. It is important to take account of the uncertainty inherent in any early prediction.

Standard Methods for Reliability Prediction and Modelling


MIL-STD-756
IEEE 1413
NASA-CR-1129

Conclusions
System reliability prediction and modelling can be a frustrating exercise, since even quite simple
syste,s can lead to complex logic when redundancy , repair times, testing and monitoring are taking
into account.
In real life, availability is often determeined more by holding spares, admin times, rather than
predictable factors such as mean repair time.
Prediction and modelling are concepts which have generated much attention literature and controversy
in the reliability field.
Most of these work, however this is only obtuse interest since reliability is not a parameter, which is
inherently predictable, on the basis of the laws of nature or of statistical extrapolation.

Last saved 10/31/2006 10:51:00 PM 52 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 7 – Reliability in Design (not required)
Introduction

Computer-Aided Engineering

Environments

Design Analysis Methods

Quality Function Deployment

Load Strength Analysis

Failure Modes, Effects and Criticality Analysis (FMECA)

Reliability Predictions for FMECA

Hazard and Operability Study (HAZOPS)

Parts, Materials and Processes Review (PMP)

Non-Material Failure Modes

Human Reliability

Design analysis for processes

Critical Items List

Summary

Management of Design Review

Configuration Control

Last saved 10/31/2006 10:51:00 PM 53 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 8 – Reliability of Mechanical Components and
Systems
Introduction
Mechanical components can fail due to two causes:
1. Overstress leading to fracture
2. Degradation of strength
Mechanical components and systems can also fail for many other reasons:
• Backlash in controls, linkages and gears
• Incorrect adjustments
• Seizing of moving parts
• Leaking of seals
• Loose fasteners
• Excessive vibration or noise
Designers must be aware of these and other potential causes of failure, and must design to prevent or
minimise their occurrence.

Mechanical Stress, Strength and Fracture


Mechanical stress can be either tensile, compressive or shear. The amount of deformation is called the
strain. The relationship between stress (σ) and strain (ε) is described by Hooke’s Law:
σ = Eε
where E is Young’s Modulus. A high value of E indicates the material is stiff. A low value means that
the material is soft or ductile.
Another important material property is toughness. Toughness is the opposite of brittleness. It is the
resistance to fracture, measured as the energy input per unit volume required to cause fracture.
Compressive strength is much more difficult to analyse and predict. It depends upon the mode of
failure and the shape of the component.
Stress can also be applied in shear.

Fatigue
Fatigue damage within engineering materials is caused when a repeated mechanical stress is applied,
the stress being above a limiting value called the fatigue limit. Fatigue damage is cumulative, so that
repeated stress above the fatigue limit will eventually result in failure.
Initiation and growth rate of the cracks varies depending upon the material properties and on surface
and internal conditions. The material property that imparts resistance to fatigue damage is the
toughness.

Last saved 10/31/2006 10:51:00 PM 54 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 9 – Electronic System Reliability
Introduction

Reliability of Electroninc Components

Component Types and Failure Mechanisms

Summary of device failure modes

Circuit and System aspects

Electronic System Reliability Prediction

Reliability in electronic system design

Parameter variation and tolerances

Design for production, test and maintenance

Last saved 10/31/2006 10:51:00 PM 55 Last printed 5/9/2006 01:59:00 PM


O’Conner Chapter 10 – Software Reliability
Introduction

Software in engineering systems

Software Errors

Preventing Errors

Software Structure and Modularity

Programming Style

Fault Tolerance

Redundancy/Diversity

Languages

Data Reliability

Software Checking

Software Design Analysis Methods

Software Testing

Error Reporting

Software Reliability Prediction and Measurement


Sneak Analysis
Sneak conditions are:
1. Sneak Output – the wrong output is generated.
2. Sneak inhibit – undesired inhibit of an input or an output.
3. Sneak timing – the wrong output is generated because of its timing or incorrect input timing.
4. Sneak message – a program message incorrectly reports the state of the system.

Hardware/Software Interfaces

Conclusions
The versatility and economy offered by software control can lead to an under-estimation of the
difficulty and cost of software generation. To ensure the program will operate satisfactorily under all
conditions that might exist requires an effort greater than that required for the basic design and first-
program preparation. The cost and effort of debugging a large, unstructured program containing many
errors can be so high that it is cheaper to scrap the whole program and start again.
The essential elements of a software development program to ensure a reliable project are:
Last saved 10/31/2006 10:51:00 PM 56 Last printed 5/9/2006 01:59:00 PM
1. Specify the requirements completely and in detail
2. Make sure that all the project staff understand the requirements
3. Check the specifications thoroughly
4. Design a structured program and specify each module fully
5. Check the design and the module specifications against the system specifications
6. Check the written program for errors, line by line
7. Plan module and system tests to cover important input combinations, particularly at extreme
values
Ensure full recording of all development notes, test, checks, errors and program changes.

Last saved 10/31/2006 10:51:00 PM 57 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 17 – Systematic Failures, especially software
Systematic failures are generally considered to be in addition to those we quantify by means of failure
rate. Since they do not relate to past failure data, it follows that it is very difficult to justify their being
predicted by conventional techniques. Qualitative measures have been developed in the hope they
will minimise systematic failures. The following sections summarise these defences with particular
reference to software-related failure.

Programmable Devices

Software-related Failures

Software Failure Modelling

Software Quality Assurance

Modern/Formal Methods

Software Checklists

Study Guide 3 Self Assessment Questions


1. Reliability analysis

2. A sensitivity analysis

3. A trade-off analysis

4. What are four software sneak circuit conditions?


Sneak conditions are:
1. Sneak Output – the wrong output is generated.
2. Sneak inhibit – undesired inhibit of an input or an output.
3. Sneak timing – the wrong output is generated because of its timing or incorrect input timing.
4. Sneak message – a program message incorrectly reports the state of the system.
5. What are two primary causes for mechanical failure?
Mechanical components can fail due to two causes:
1. Overstress leading to fracture
2. Degredation of strength

Last saved 10/31/2006 10:51:00 PM 58 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 4 - Reliability, Maintainability and Availability
Objectives
You should be able to
• Compare your definitions of these concepts with those presented in the various resources, and
point out any differences in emphasis
• Outline the significance of reliability, maintainability and availability to equipment design
• Discuss the trade-off between the cost of reliability, maintainability and performance
• Describe the principles of fault tree analysis and failure mode effect and criticality analysis

O’Conner Chapter 14 – Maintainability, Maintenance and Availability


Introduction
Most engineering systems are maintained. The ease with which repair and other maintenance work
can be carried out determines a system’s maintainability.
Maintained systems may be subject to corrective and preventative maintenance. Corrective
maintenance includes all action to return a system from a failed to an operating or available state. The
amount of corrective maintenance is therefore determined by reliability.
Corrective maintenance can be quantified as the mean time to repair (MTTR), and this time can be
divided into three groups:
1. Preparation time
2. Active maintenance time
3. Delay, or Logistics time.
Corrective maintenance is also specified as mean active corrective maintenance time (MACMT) since
it is only the active time that the equipment designer can influence.
Preventative maintenance seeks to retain the system in an operational or available state by preventing
failures from occurring. Preventative maintenance affects reliability directly. It is planned and should
be performed when we want it to be performed. Preventative maintenance is measured by the time
taken to perform the specified maintenance tasks and their specified frequency.
Maintainability affects availability directly. The time taken to repair failures and to carry out routine
preventative maintenance removes the system from the available state. There is thus a close
relationship between reliability and maintainability, one affecting the other, and both affecting
availability and costs.
The maintainability of a system is clearly governed by the design. The design determines features
such as accessibility, ease of test, diagnosis and repair and requirements for calibration, lubrication
and other preventative maintenance actions.

Maintenance Time Distributions


Maintenance time tend to be lognormally distributed. In addition to job-to-job time variability,
leading typically to this lognormal distribution, there is also variability due to learning. However,
both the mean time and the variance should reduce with experience and learning.

Last saved 10/31/2006 10:51:00 PM 59 Last printed 5/9/2006 01:59:00 PM


Preventative Maintenance Strategy
The effectiveness and economy of preventative maintenance can be maximised by taking account of
the time-to-failure distributions of the maintained parts and of the failure rate trend of the system.
In general, if a part has a decreasing failure rate, any replacement will increase the probability of
failure. If the hazard rate is constant, replacement will make no difference to the probability of
failure. If a part has an increasing hazard rate, then scheduled replacement at any time will, in theory,
improve reliability of the system.
These are theoretical considerations. They assume that the replacement action does not introduce any
other defects and that the time-to-failure distributions of parts are exactly defined. These assumptions
must not be made without question. However, it is obviously of prime importance to take account of
the time-to-failure distributions in planning a preventative maintenance strategy.
In addition to the effect of replacement on reliability as theoretically determined by considering the
failure distributions of the replaced parts, we must also take account of the effects of maintenance
action on reliability.
The effects of failures, both in terms of effects on the system and of costs of downtime and repair,
must also be considered.
In order to optimise preventative replacement, it is therefore necessary to know the following for each
part:
1. The time-to-failure distribution parameters for the main failure modes
2. The effect of all failure modes
3. The cost of failure
4. The cost of scheduled replacement
5. The likely effect of maintenance on reliability.
We have considered so far parts which do not give any warning of the onset of failure. If incipient
failure can be detected, we must also consider:
6. The rate at which defects propagate to cause failure
7. The cost of inspection and test.
Note that from point 2, a Failure Modes and Effects Criticality Analysis (FMECA) is therefore an
essential input to maintenance planning. This systematic approach to maintenance planning, taking
account of reliability aspects, is called reliability centred maintenance (RCM).

FMECA and FTA in Maintenance Planning


The FMECA is an important prerequisite to effective maintenance planning and maintainability
analysis. The FMECA is also a very useful input for preparation of the diagnostic procedures and
checklists, since the likely causes of the failure symptoms can be traced back using the FMECA
results. When a fault tree analysis (FTA) has been performed it can also be used for this purpose.

Maintenance Schedules
When any maintenance activity is determined to be necessary, we must determine the most suitable
intervals between its performance.
The most appropriate base is the one that best accounts for the equipment’s utilisation in terms of the
causes of degradation (wear, fatigue, parameter change, etc) and is measured.

Last saved 10/31/2006 10:51:00 PM 60 Last printed 5/9/2006 01:59:00 PM


Technology Aspects
Mechanical
Monitoring methods are used to provide periodic or continuous indications of the condition of
mechanical components and systems. These include:
• Non-destructive test (NDT) for detection of cracks
• Temperature and vibration monitoring of bearing, gears and other rotating machinery
• Oil analysis, to detect signs of wear or break-up in lubrication and hydraulic systems

Electronic and Electrical


Electronic components and assemblies generally do not degrade in service, so long as they are
protected from environments such as corrosion. Therefore, apart from calibration for items like
measuring instruments, scheduled tests are seldom appropriate.

“No Fault Found”


A large proportion of the reported failures of many electronic systems are not confirmed on later test.
There are several causes of these, including:
• Intermittent failures, such as components that fail under certain conditions.
• Tolerance effects, which can cause a unit to operate correctly in one system or environment but
not another.
• Connector troubles
• Built in test (BIT) systems which falsely indicate failures that have not occurred (see below)
• Failures that have not been correctly diagnosed and repaired, so that the symptoms reoccur.
• Inconsistent test criteria between the in-service test and the test applied during diagnosis
elsewhere such as the repair depot.
• Human error or inexperience
• In some systems the diagnosis of which item failed might be ambiguous.

Software
As discussed in chapter 10 software does not fail in the ways hardware can, so there is no
“maintenance”. If it is found to be necessary to change a program for any reason, this is really
redesign of the program, not repair.

Built-In Test (BIT)


Complex electronic systems now frequently include built-in test (BIT) facilities. BIT consists of
additional hardware (and often software) which is used for carrying out functional test on the system.
BIT can be very effective in increasing system availability and user confidence in the system.
However BIT inevitably adds complexity and cost and can therefore increase the probability of
failure.
BIT can also adversely affect apparent reliability by falsely indicating that the system is in a failed
condition.
It is important to optimise the design of BIT in relation to reliability, availability and cost.

Last saved 10/31/2006 10:51:00 PM 61 Last printed 5/9/2006 01:59:00 PM


Calibration
Calibration is the regular check or test of equipment for measuring physical parameters, by making
comparisons with standard sources.
Whether an item needs to be calibrated or not depends primarily on its application, and also upon
whether or not any inaccuracy would be apparent during normal use.

Maintainability Predictions
Maintainability prediction is the estimation of the maintenance workload which will be imposed by
scheduled and unscheduled maintenance. A standard method used for this work is MIL-HDBK-472,
which contains four methods for predicting Mean Time To Repair (MTTR) of a system. Method II is
most frequently used and is based simply on summing the products of the expected failure repair times
of the individual failure modes and dividing by the sum of the individual failure rates, eg:

MTTR =
∑ (λ t )r

∑λ
The same approach is used for predicting the mean preventative maintenance time, with λ replaced by
the frequency of occurrence of the preventative maintenance action.
MIL-HDBK-472 describes the methods to be used for predicting individual task times based upon
design considerations such as accessibility, skills levels, etc

Maintainability Demonstrations
A standard approach to maintainability demonstration is MIL-HDBK-470. The technique is the same
as maintainability prediction using method III of MIL-HDBK-472, except that the individual task
times are measured rather than estimated from design.

Design for Maintainability


It is obviously important that maintained systems are designed so that maintenance tasks are easily
performed, and that the skill levels required are not too high, considering the experience and training
of likely maintenance personnel and users. As far as is practicable, the need for scheduled
maintenance should be eliminated.
Design rules and checklists should include guidance to aid design for maintainability and to guide
design review teams.
Design for maintainability is closely related to design for ease of production. If a product is easy to
assembly and test, maintenance will usually be easier
Interchangeability is another important aspect of design for ease of maintenance of repairable systems.
Replaceable components and assemblies must be designed so that no adjustment or recalibration is
necessary after replacement.

Integrated Logistic Support


Integrated logistic support (ILS) is concepts developed by the military, in which all aspects of design
and of support and maintenance planning are brought together, to ensure that the design and the
support system are optimised. The approach is described in MIL-HDBK-1388.
ILS requires inputs of reliability and maintainability data and forecasts, as well as data on costs,
weights, special tools and test equipment, training requirements etc.
ILS outputs are thus very sensitive to the accuracy of the inputs. In particular, reliability forecasts can
be highly uncertain (refer Chapter 6). Therefore, such analyses and the decisions based upon them,
should take full account of these uncertainties.
Last saved 10/31/2006 10:51:00 PM 62 Last printed 5/9/2006 01:59:00 PM
O’Conner pages xxv – xxvi
MIL-HDBK-472
SAE J817
[Reader] Collcot Chapter 13 “Fault Analysis Planning and System
Availability”
Overview of paper, understand concepts in paras 13.1 to 13.1.5

[Reader] Patton Chapter 8 “Reliability, Availability and


Maintainability”
pages 76-101

Study Guide 4 Self Assessment Questions


1. Maintainability
Most engineering systems are maintained. The ease with which repair and other maintenance work
can be carried out determines a system’s maintainability.
Maintained systems may be subject to corrective and preventative maintenance. Corrective
maintenance includes all action to return a system from a failed to an operating or available state. The
amount of corrective maintenance is therefore determined by reliability.
2. Integrated logistic support
Integrated logistic support (ILS) is concepts developed by the military, in which all aspects of design
and of support and maintenance planning are brought together, to ensure that the design and the
support system are optimised. The approach is described in MIL-HDBK-1388.
ILS requires inputs of reliability and maintainability data and forecasts, as well as data on costs,
weights, special tools and test equipment, training requirements etc.
ILS outputs are thus very sensitive to the accuracy of the inputs. In particular, reliability forecasts can
be highly uncertain. Therefore, such analyses and the decisions based upon them, should take full
account of these uncertainties.
3. Availability
4. What are the three groups of activities used to quantify MTTR?
Corrective maintenance can be quantified as the mean time to repair (MTTR), and this time can be
divided into three groups:
1. Preparation time
2. Active maintenance time
3. Delay, or Logistics time.
5. What is the main objective of a reliability and maintainability program?

Last saved 10/31/2006 10:51:00 PM 63 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 5 - Reliability Prediction and Modelling
Objectives
You should be able to:
• Identify the various modelling approaches
• Identify the strengths and weaknesses of each of them
• Determine if any of them would be relevant and useful to your work situation

O’Conner, Chapter 6 Conclusion (limitations for reliability modelling)


System reliability prediction and modelling can be a frustrating exercise, since even quite simple
systems can lead to complex logic when redundancy, repair times, testing and monitoring are taking
into account.
In real life, availability is often determined more by holding spares, admin times, rather than
predictable factors such as mean repair time.
Prediction and modelling are concepts that have generated much attention literature and controversy
in the reliability field.
Most of these work, however this is only obtuse interest since reliability is not a parameter, which is
inherently predictable, on the basis of the laws of nature or of statistical extrapolation.

O’Conner Chapter 12 (pgs 341-346)


Reliability Analysis of Repairable Systems
Chapter 3 described methods for analysing data related to time to first failure. However, for
repairable systems, which really represent the great majority of everyday reliability experience, the
distribution of times to first failures are much less important than is the failure rate or rate of
occurrence of failure (ROCOF) of the system.
Any repairable system may be considered as an assembly of parts, the parts being replaced when they
fail. If we ignore replacement (repair) times, which are usually small in comparison with standby or
operating times, and if we assume that the time to failure of any part is independent of any repair
actions, then we can use the methods of event series analysis (shown in chapter 2) to analyse system
reliability.
If we do not perform a centroid test and assume the data were independently and identically
distributed (IID), we might order the data in rank order and plot on probability paper.
This example shows how important it is for failure data to be analysed correctly, depending on
whether we need to understand the reliability of a non-repairable part or of a repairable system. The
presence of a trend when the data are ordered chronologically shows that times to failure are not IID,
and ordering by magnitude, which implies IID, will therefore give misleading results.
We can derive system reliability over a period by plotting the cumulative times to failure in
chronological order rather than in rank order.
If there are no perturbations, the failure rate will tend to a constant value after most parts have been
replaced at least once, regardless of the failure trends of individual parts. This is one of the main
reasons the Constant Failure Rate (CFR) assumption has become so widely used for systems, and why
part hazard rates has been confused with failure rate.

Last saved 10/31/2006 10:51:00 PM 64 Last printed 5/9/2006 01:59:00 PM


If part times to failure are individually and identically exponentially distributed (IID exponential) the
system will have a CFR which will be the sum of the reciprocals of the part mean times to failure, eg:
n
1
λs = ∑
1 xi
The assumption of IID exponential for part times to failure for a repairable system can be misleading
for the following reasons:
1. The most important failure modes of a system are usually caused by parts which have failure
probabilities which increase with time (wearout failures)
2. Failure and repair of one part may cause damage to others. Therefore times between failures are
not necessarily independent.
3. Repairs often do not “renew” the system.
4. Repairs might be made by adjustment, lubrication, thus providing a new lease of life, but not
“renewal”
5. Replacement parts can make subsequent failure initially more likely to occur
6. Repair personnel learn by experience, so diagnostic ability improves with time. Conversely,
changes in personnel can lead to reduced diagnostic ability and therefore repeated failures.
7. Not all part failures will produce system failures
8. Factors (such as cycling) are often more important than operating times
9. Reported failures are nearly always subject to human bias and emotion
10. Failure probability is affected by scheduled maintenance or overhaul
11. Replacement parts are not necessarily drawn from the same population – they may be better or
worse.
12. System failures might be caused by parts whose combined tolerances cause the system to fail.
13. Many reported failures are not caused by part failures at all, but by events such as intermittent
connections, improper use, maintainers using opportunities to replace “suspect” parts etc.
14. Within a system, not all parts operate to the overall system cycle.
The factors above often predominate in systems to be modelled and in collected reliability data.
A CFR is often a practicable and measurable first-order assumption, particularly when data are not
sufficient to allow more detailed analysis.

Smith Chapter 8 – Methods of Modelling


Block Diagrams and Repairable Systems
Reliability Block Diagrams
Steps to creating a reliability block diagram:
1. Establish failure criteria – define what constitutes a failure since this will determine which failure
modes at the component level actually cause a system to fail.
2. Establish a reliability block diagram – it is necessary to describe the system as a number of
functional blocks which are interconnected according to the effects of each block failure on the
overall system reliability

Last saved 10/31/2006 10:51:00 PM 65 Last printed 5/9/2006 01:59:00 PM


3. Failure Mode Analysis – complete an FMEA by examining individual component failure modes
and failure rates
4. Calculation of system reliability – relating the block failure rates to the system reliability is a
question of mathematical modelling
5. Reliability allocation – the block failure rates are taken as a measure of the complexity and
improved, suitably weighted objectives are set.

Repairable Systems
It is now generally acknowledged that traditional Markov modelling does not correctly represent the
normal repair activities for redundant repairable systems when calculating the probability of failure
on demand (PFD). The Journal of The Safety and Reliability Society, Vol 22, No 2, 2002 published
papers by Gulland and Simpson, both of which agree with those findings.

Common Cause (Dependent) Failure (CCF)


CCF often dominate the unreliability of redundant systems by virtue of defeating the random co-
incident failure features of the redundant protection.
Whereas simple models of redundancy assume that failures are both random and independent,
common cause failures modelling takes account of the failures which are linked, due to some
dependency, and therefore occur simultaneously or, at least, within a sufficiently short interval as to be
perceived as simultaneous.
Defences against CCF involve design and operating features .

Fault Tree Analysis


A fault tree is a graphical method of describing the combinations of events leading to a defined system
failure. The system failure mode is known as the top event.
The fault tree involves three logical possibilities being:
1. the OR gate whereby any input causes the output to occur
2. the AND gate whereby all inputs need to occur for the output to occur
3. the Voted gate, similar to the AND gate, whereby two or more inputs are required to make the
output to occur.

Event Tree Diagrams


Whereby Fault Tree analysis is probably the most widely used technique for quantitative analysis, it is
limited to combination of AND/OR events which contribute to a single defined failure (the top event).
Event Trees resemble decision trees which show the likely train of events between an initiating event
and any number of outcomes. The main element in a ET is the decision box which contains a
question/condition with a YES/NO outcome.
The main difference between Fault Trees and Event Trees is event trees model the order in which the
elements fail.

Last saved 10/31/2006 10:51:00 PM 66 Last printed 5/9/2006 01:59:00 PM


Smith Chapter 9
Duane, J.T., Learning Curve Approach to Reliability Monitoring, IEEE
Transactions on Aerospace, Volume 2, Number 2, April 1964
Summary
Several different and complex electromechanical and mechanical systems are show to have
remarkably similar rates of reliability improvement. These similarities provide the basis for a learning
curve, which can be used to monitor development programs, predict growth patters, and plan
programs for reliability improvement.

The Learning Curve


In an effort to determine the manner in which reliability performance changes during development and
design improvement activity, data was analysed for a total of five different products. A remarkably
consistent pattern emerged when cumulative failure rate (defined as total malfunctions since program
start, divided by total operating hours since start) was plotted on log-log-paper as a function of
cumulative operating hours.
Considering the wide range of equipment types and complexity represented by the data, a remarkable
similarity in trends is evident. The fact that the curves are parallel indicates uniformity in rate of
reliability improvement. Relative positions of the curves in the vertical direction are evidently a
measure of inherent design reliability.

Analysis
It can be seen that all the curves form reasonably straight lines that are similar in slope. In general, it
appears that cumulative failure rate will vary in a manner directly proportional to some negative
power of cumulative operating hours. This can be expressed mathematically as:
λ ∑ = K ( ∑ H ) −α

where: λ∑ ≡
∑F
∑H
∑ F = CumulativeFailures
∑ H = CumulativeOperatingHours
K = Constant
α = Exponent determined by slope
This equation implies that reliability will continually increase as operating experience is gained. This
may not be true as operating time reaches extremely high levels, but the evidence presented does
indicate that the relationship is valid over a long period. This relationship probably applies as long as
active programs are in place to improve equipment reliability.

Discussion
The techniques proposed here assume that a “normal” rate of reliability improvement exists. Sucha a
normal growth rate can be useful as a standard against which to compare actual performance, but it
must not be viewed as an absolute limit on the performance of a given product.
Since the proposed procedure is intended primarily for use in conjunction with development
programs, it is important to note that test conditions have a major effect on data validity.

Last saved 10/31/2006 10:51:00 PM 67 Last printed 5/9/2006 01:59:00 PM


Extrapolation of failure rate data by straight-line extension of experimental curves provides an
obvious way of predicting reliability at given points in the development cycle.

Crow, Larry, Evaluating the Reliability of Repairable Systems,


Proceedings Annual Reliability and Maintainability Symposium, 1990.
Abstract
Moat complex systems, for example automobiles, aircraft, communication systems, etc are repaired
and not replaced when they fail. This paper discusses the Weibull process, or power law non-
homogeneous Poisson process model for analysing the reliability of repairable systems. Estimation
and other statistical procedures are given for this model which are appropriate when failure data are
generated by multiple systems. It is shown that in a special case this repairable systems model
reduces to a model for reliability growth.

Introduction
When a complex system with new technology is fielded or subjected to customer use environment,
there is often considerable interest in assessing its reliability and other related performance parameters
such as availability. Although operating tests are conducted for many systems during development, it
is generally recognised that in many cases these tests may not yield complete data representative of an
actual use environments. Other interests in measuring the reliability of a fielded system may centre
on, for example, logistics and maintenance policies, quality and manufacturing issues, burn-in, wear
out, mission reliability or warranties.
Most complex systems are repaired, not replaced, when they fail. A number of books and papers in the
literature have stressed that the usual non-repairable reliability methodologies, such as the Weibull
distribution, are not appropriate for the repair able sustme reliability analyses and have suggested the
use of the non-homogeneous Poisson process models.
The homogeneous Poisson process is equivalent to the widely used Poisson distribution and
exponential times between system failures model appropriate when the system’s failure intensity is not
affected by the system’s age. However to realistically consider burn-in, wearout, useful life,
maintenance policies, warranties, mission reliability etc will often require an approach that recognises
that the failure intensity of these systems many not be constant over the operating life of interest but
may change with system age. A useful and generally practical extension of the Poisson process which
allows for the system failure intensity to change with system age.
Typically, the reliability analyses of a repairable system under customer use will involve data
generated by multiple systems. The Weibull process or power law non-homogeneous Poisson process
for this type of analysis are appropriate. This paper will discuss the specific application of these
methods under several situations which are coming in practice and will illustrate the numerical
calculations by examples.
In this paper it is strongly recommended that the reliability characteristics for each system under the
study be analysed separately before the failure data are combined. The techniques described in this
paper are combined failure data assumes that each system is governed by the same Weibull process
failure intensity model.
The Model
In this paper we assume that the failure for each system under study are occurring according to a non-
homogeneous Poisson process with intensity function
u (t ) = λβ t β −1
t >0

Last saved 10/31/2006 10:51:00 PM 68 Last printed 5/9/2006 01:59:00 PM


where lambda and beta > 0 and t is the age of the system. This particular mathematical form for the
intensity u(t) is the

http://www.weibull.com/RelGrowthWeb/Crow-AMSAA_(N.H.P.P.).htm
In "Reliability Analysis for Complex, Repairable Systems" (1974), Dr. Larry H. Crow noted that the
Duane model could be stochastically represented as a Weibull process, allowing for statistical
procedures to be used in the application of this model in reliability growth. This statistical extension
became what is known as the Crow-AMSAA (N.H.P.P.) model. This method was first developed at the
U.S. Army Materiel Systems Analysis Activity (AMSAA). It is frequently used on systems when
usage is measured on a continuous scale. It can also be applied for high reliability, a large number of
trials and one-shot items. Test programs are generally conducted on a phase by phase basis. The
Crow-AMSAA model is designed for tracking the reliability within a test phase and not across test
phases.

A development testing program may consist of several separate test phases. If corrective actions are
introduced during a particular test phase then this type of testing and the associate data are appropriate
for analysis by the Crow-AMSAA model. The model analyzes the reliability growth progress within
each test phase and can aid in determining the following:

• Reliability of the configuration currently on test


• Reliability of the configuration on test at the end of the test phase
• Expected reliability if the test time for the test phase is extended
• Growth rate
• Available confidence intervals
• Applicable goodness-of-fit tests

The reliability growth pattern for the Crow-AMSAA model is exactly the same pattern as for the
Duane postulate. That is, the cumulative number of failures is linear when plotted on ln-ln scale.
Unlike the Duane postulate the Crow-AMSAA model is statistically based. Under the Duane postulate
the failure rate is linear on ln-ln scale. However for the Crow-AMSAA model statistical structure, the
failure intensity of the underlying non-homogeneous Poisson process (NHPP) is linear when plotted
on ln-ln scale.

Minitab Help File


Power-law process
A non-homogeneous Poisson process with an intensity function that represents the rate of failures or
repairs. The power-law process can model a system that is improving, deteriorating, or remaining
stable.
With the default (maximum likelihood) estimation method, the power-law model is also known as the
AMSAA model. With the least squares estimation method, the power-law process model is also
known as the Duane model.

Study Guide 5 Self Assessment Questions


A basic reliability models includes what parameters:

Four methods of reliability modelling are:


Reliability block diagrams

Last saved 10/31/2006 10:51:00 PM 69 Last printed 5/9/2006 01:59:00 PM


Common Cause Failure (CCF) modelling
Fault Tree Analysis
Event Tree Analysis
Four methods of reliability prediction are:

Define a common failure mode

When developing a reliability block diagram, a general approach should include the following
steps:
1. Establish failure criteria – define what constitutes a failure since this will determine which
failure modes at the component level actually cause a system to fail.
2. Establish a reliability block diagram – it is necessary to describe the system as a number of
functional blocks which are interconnected according to the effects of each block failure on the
overall system reliability
3. Failure Mode Analysis – complete an FMEA by examining individual component failure
modes and failure rates
4. Calculation of system reliability – relating the block failure rates to the system reliability is a
question of mathematical modelling
5. Reliability allocation – the block failure rates are taken as a measure of the complexity and
improved, suitably weighted objectives are set.

Last saved 10/31/2006 10:51:00 PM 70 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 6 - Reliability Testing
Objectives
You should be able to:
 Recognise the importance of reliability testing
 Choose the approach that is best suited to your situation.

O-Conner Chapter 11 – Reliability Testing


Introduction
Testing is an essential part of any engineering development programme. Reliability testing is
necessary because designs are seldom perfect and because designers cannot be aware of, or be able to
analyse, all the likely causes of failures of their designs in service.
Reliability testing should be considered as part of an integrated test programme, which should include:
1. Functional Testing (to confirm the design meets the basic performance requirements)
2. Environmental Testing (to ensure the design is capable of operating under the expected range
of environments)
3. Statistical Tests (as described in Chapter 5, to optimise the design of the product and
production processes)
4. Reliability testing (to ensure the product will operating without failure during its expected life)
5. Safety Testing (when appropriate)
To provide the basis for a properly integrated development test programme, the design specificiations
should cover all criteria to be tested (function, environment, reliability, safety).
The development test programme should include:
1. Model allocations (components, sub-assemblies, system)
2. Requirements for facilities such as test equipment
3. A common test and failure reporting system
4. Test schedule
One person should be put in charge of the entire programme, with the responsibility and authority for
ensuring that all specification criteria will be demonstrated.
There is one conflict inherent in reliability testing. To obtain information about reliability in a cost-
effective way (ie quickly) it is necessary to generate failures. Only then can safety margins be
ascertained. On the other hand, failures interfere with functional and environmental testing. The
development test programme must address this dilemma.
The development test dilemma should be addressed by dividing tests into two main categories:
1. Tests in which failures are undesirable
2. Tests which deliberately generate failures
Statistical, functional and most environmental testing are in category 1. Most reliability testing is in
category 2. There must be a common reporting system for test results and failures, and for action to
be taken to analyse and correct failure modes.

Last saved 10/31/2006 10:51:00 PM 71 Last printed 5/9/2006 01:59:00 PM


The category 2 test should be started as soon as the hardware is available. Tests should be planned to
show up failure modes as early as is practicable.

Planning Reliability Testing


Using design analysis data
The design analyses performed during the design phase should be used in preparing the reliability test
plan. These should have highlighted the risks and uncertainties in the design, and the reliability test
programme should specifically address these.

Considering Variability
We have seen (Chapters 4 and 5) how variability affects the probability of failure. A major source of
variability is the range of production processes that convert designs into hardware. Therefore the
reliability test programme must cover the effects of variability on the expected and unexpected failure
modes.

Durability
The reliability test programme must take account of the pattern of the main failure modes with respect
to time.
If the failure modes have increasing hazard rates, testing must be directed to assuring adequate
reliability during the expected life.
Generally speaking, mechanical components and assemblies are subject to increasing hazard rates
when wear, fatigue, corrosion or other deterioration processes can cause failure. Systems subject to
repair and overhaul can also become less reliable with age, due to the effects of maintenance, so the
appropriate maintenance actions must be included in the test plan.

Test Environments
The reliability test programme must cover the range of environmental conditions which the product is
likely to have to endure. The main reliability-affecting environmental factors are:
 Temperature
 Vibration
 Shock
 Humidity
 Power input and output
 Dirt
 People
In addition, electronic equipment might be subjected to:
 Electromagnetic effects (EMI)
 Voltage transients, including Electrostatic Discharge (ESD)

Testing for Reliability and Durability: Accelerated Testing


In Chapters 8 and 9 we reviewed how mechanical, electrical and other stresses can lead to failures,
and in Chapter 4 how variations of strength, stresses and other conditions can influence the likelihood
of failure or duration to failure. In this section we describe how tests should be designed and
conducted to provide assurance that designs and products are reliable and durable in service.

Last saved 10/31/2006 10:51:00 PM 72 Last printed 5/9/2006 01:59:00 PM


For most engineering designs, we do not know what is the “uncertainty gap” between the theoretical
and real capabilities of the design and of products made to it, for the whole population, over their
operating lives and environments.
The conventional approach to this problem has been to treat reliability as a functional performance
characteristic that can be measured, by testing items over a period of time whilst applying simulated
or actual in-service conditions, then calculating the reliability achieved on the test.. These methods
are fundamentally inadequate for providing assureance of reliability. The main reason is they are
based on measuring the reliability achieved during the application of simulated or actual stresses that
are within the specified service environments, in the expectation that the number of failures will be
below some criterion for the test.
The correct approach is straightforward: we must test to cause failures, not test to demonstrate
successful achievement. If the design is simple and there is an adequate margin between stress and
strength, we might decide that no further testing is necessary. If however, constraints such as weight
force us to design with smaller margins, and if the compnents function is critical, we might well
consider it prudent to test some quantity to failure.
When failures occur on test we should ask whether they could occur in use. The questions that must
be asked are:
1. Could this failure occur in use (on other items, after longer times, at other stresses)?
2. Could we prevent it from happening in use?
The stresses that were applied are relevant only in so far as they were tools to provide the evidence
that an opportunity exists to improve the design. We have obtained information on how to reduce the
uncertainty gap.
For even simple and common failure situations like these, there is not just one distribution that is
important, but a number of possible distributions and interactions. This reasoning leads to the main
principle of development testing for reliability: we should increase the stresses so that we cause
failures to occur, then use the information to improve reliability and durability.
The logic that justifies the use of very high “unrepresentative” stresses is based upon four aspects of
engineering reality:
1. The causes of failures that will occur in the future are often very uncertain
2. The probabilities of and durations to failures are also highly uncertain.
3. Time spent testing is expensive, so the more quickly we can reduce the uncertainty gap the
better
4. Finding causes of failures during development and preventing recurrence is far less expensive
than finding new failure causes during use/service.
It cannot be emphasised to strongly: testing at “representative” stresses, in the hope that failures will
not occur, is very expensive in time and money and is mostly a waste of resources.

Smith, Chapter 12 –
AS3960 Section 2
AS3960 Page 26
Self-Assessment Questions
1. List the five elements of reliability testing

Last saved 10/31/2006 10:51:00 PM 73 Last printed 5/9/2006 01:59:00 PM


 Functional Testing (to confirm the design meets the basic performance requirements)
 Environmental Testing (to ensure the design is capable of operating under the expected range
of environments)
 Statistical Tests (as described in Chapter 5, to optimise the design of the product and
production processes)
 Reliability testing (to ensure the product will operating without failure during its expected life)
 Safety Testing (when appropriate)
2. The widest range of reliability-affecting environmental categories are:

3. This is not one of the main types of test program?

4. Name four categories of testing


 Temperature
 Vibration/shock
 Electromagnetic
5. How many systems to be tested can be determined by considering what three issues?

Last saved 10/31/2006 10:51:00 PM 74 Last printed 5/9/2006 01:59:00 PM


STUDY GUIDE 7 - Managing & Solving Reliability Problems
O’Conner Chapter 12 – Analysing Reliability Data
Introduction
This chapter describes a number of techniques, further to the probability plotting methods described in
chapter 3, which can be used to analyse reliability data derived from development tests or field units,
with the objectives of monitoring trends, identifying causes of unreliability, and measuring or
demonstrating reliability.

Pareto Analysis
As a first step in reliability data analysis we can use the Pareto principle of the ‘significant few and the
insignificant many’. It is often found that a large proportion of failures in a product are due to a small
number of causes. Therefore if we analyse the failure data, we can determine how to solve the largest
proportion of the overall reliability problem with the most economical use of resources.

Accelerated Test Data Analysis


Failure and life data from accelerated stress tests can be analysed using the methods described in
Chapters 2-5. If the mechanism is well understood, for example material fatigue, then the model for
the process can be applied to interpret the results and to derive reliability of life values at different
stress levels.
Extrapolation of accelerated test results to expected in-service conditions can be misleading if the test
stresses are much higher, since the different failure mechanisms might be simulated. It is important
that the primary objective of the test is understood; whether it is to determine or confirm a life
characteristic, or to help create designs that are inherently failure free.
These methods are not appropriate for failures of assemblies or systems, when several different failure
modes might be present.
Probability plotting methods (Chapter 3) can also be used for analysing such data, when sufficient
data are available.

Reliability Analysis of Repairable Systems


Chapter 3 described methods for analysing data related to time to first failure. However, for
repairable systems, which really represent the great majority of everyday reliability experience, the
distributions of times to first failures are much less important than is the failure rate or rate of
occurrence of failure (ROCOF) of the system.
Any repairable system may be considered as an assembly of parts, the parts being replaced when they
fail. If we ignore replacement (repair) times, which are usually small in comparison with standby or
operating times, and if we assume that the time to failure of any part is independent of any repair
actions, then we can use the methods of event series analysis (shown in chapter 2) to analyse system
reliability.
If we do not perform a centroid test and assume the data were independently and identically
distributed (IID), we might order the data in rank order and plot on probability paper.
This example shows how important it is for failure data to be analysed correctly, depending on
whether we need to understand the reliability of a non-repairable part or of a repairable system. The
presence of a trend when the data are ordered chronologically shows that times to failure are not IID,
and ordering by magnitude, which implies IID, will therefore give misleading results.

Last saved 10/31/2006 10:51:00 PM 75 Last printed 5/9/2006 01:59:00 PM


We can derive system reliability over a period by plotting the cumulative times to failure in
chronological order rather than in rank order.
If there are no perturbations, the failure rate will tend to a constant value after most parts have been
replaced at least once, regardless of the failure trends of individual parts. This is one of the main
reasons the Constant Failure Rate (CFR) assumption has become so widely used for systems, and why
part hazard rates has been confused with failure rate.
If part times to failure are individually and identically exponentially distributed (IID exponential) the
system will have a CFR which will be the sum of the reciprocals of the part mean times to failure, eg:
n
1
λs = ∑
1 xi
The assumption of IID exponential for part times to failure for a repairable system can be misleading
for the following reasons:
1. The most important failure modes of a system are usually caused by parts which have failure
probabilities which increase with time (wear out failures)
2. Failure and repair of one part may cause damage to others. Therefore times between failures are
not necessarily independent.
3. Repairs often do not “renew” the system.
4. Repairs might be made by adjustment, lubrication, thus providing a new lease of life, but not
“renewal”
5. Replacement parts can make subsequent failure initially more likely to occur
6. Repair personnel learn by experience, so diagnostic ability improves with time. Conversely,
changes in personnel can lead to reduced diagnostic ability and therefore repeated failures.
7. Not all part failures will produce system failures
8. Factors (such as cycling) are often more important than operating times
9. Reported failures are nearly always subject to human bias and emotion
10. Failure probability is affected by scheduled maintenance or overhaul
11. Replacement parts are not necessarily drawn from the same population – they may be better or
worse.
12. System failures might be caused by parts whose combined tolerances cause the system to fail.
13. Many reported failures are not caused by part failures at all, but by events such as intermittent
connections, improper use, maintainers using opportunities to replace “suspect” parts etc.
14. Within a system, not all parts operate to the overall system cycle.
The factors above often predominate in systems to be modelled and in collected reliability data.
A CFR is often a practicable and measurable first-order assumption, particularly when data are not
sufficient to allow more detailed analysis.

CUSUM Charts
The ‘cumulative sum’, or CUSUM, chart, is an effective graphical technique for monitoring trends in
quality control and reliability. The principle is that, instead of monitoring the measured value of
interest, we plot the divergence, plus or minus, from the target value. The method enables us to report
progress simply and in a way that is very easily comprehended.

Last saved 10/31/2006 10:51:00 PM 76 Last printed 5/9/2006 01:59:00 PM


The CUSUM chart also provides a sensitive indication of trends and changes. Instead of indicating
measured values against the sample number, the plot shows the CUSUM, and the slope provides a
sensitive indicator of the trend, and of points at which the trend changes.

Exploratory Data Analysis and Proportional Hazards Modelling


Exploratory data analysis is a simple graphical technique for searching for connections between time
series data and explanatory factors. In the reliability context, the failure data and plotted as a time
series chart, along with other information.
The method of presenting data can be very useful for showing up causes of unreliability in systems
such as vehicle fleets, process plants etc.
Proportional hazards modelling (PHM) is a mathematical extension of EDA. In the proportional
hazards model, the covariates are assumed to have a multiplicative effect on the total hazard rate. In
standard regression analysis or analysis of variance the effects are assumed to be additive. The
proportional hazards approach can be applied to failure rate data from repairable and non-repairable
systems.

Reliability Demonstration
It is often necessary to measure the reliability of equipment and systems during development,
production and use. Two basic forms of reliability measurement are used. A sample of equipment may
be subjected to a formal reliability test, with the condition specified in detail. Reliability may also be
monitored during development and use, as test and utilization proceed, without tests being set up
specifically for reliability measurement. This section describes standard methods of test and analysis
which are used to demonstrate compliance with reliability requirements.

Probability ration sequential test (PRST) (US MIL-HDBK-781)


MIL HDBK-781 testing is based on probability ratio sequential testing (PRST). Testing continues
until the ‘staircase’ plot of failures versus time crosses a decision line. The reject line indicates a
boundary beyond which the equipment will have failed to meet the test criteria. Crossing the accept
line denotes that the test criteria have been met. Test time is stated as multiples of the specified
MTBF.

Combining Results Using Bayesian Statistics


It can be argued that the result of a reliability demonstration test is not the only information available
on a product, but that information is available prior to the start of the test, from component and
subassembly tests, previous tests on the product and even intuition based on experience. Why should
this information not be used to supplement the formal test result? Bayes theorem enables us to
combine such probabilities.
The Bayesian approach is very controversial in reliability engineering, particularly as it has been
argued that is provides a justification for less reliability testing. Choosing a prior distribution based
on subjective judgement or other test experience can also be very contentious. Combining
subassembly test results in this way also ignores the possibility of interface problems. The Bayesian
approach is not normally recommended and has not been approved in any formal national standards.

Non-Parametric Methods
Non-parametric statistical techniques (see page 65) can be applied to reliability measurement. They
are arithmetically very simple and so can be useful as quick tests in advance of more detailed analysis,
particularly when no assumption is made of the underlying failure distribution.

Last saved 10/31/2006 10:51:00 PM 77 Last printed 5/9/2006 01:59:00 PM


Reliability Growth Modelling
The Duane Method
It is common for new products to be less reliable during early development than later in a program,
when improvements have been incorporated as a result of failures observed and corrected. This was
first analysed by J.T. Duane, who derived an empirical relationship based upon observation of the
MTBF improvement of a range of items used on aircraft. Duane observed that the cumulative MTBF
plotted against total time on log-log paper gave a straight line.
The slope of the lines gives an indication of the rate of MTBF growth and hence the effectiveness of
the reliability programme in correcting failure modes. Duane observed that typical slopes ranged
between 0.2 and 0.4, and that the value was correlated with the intensity of the effort on reliability
improvement.
The Duane methods is applicable to a population with a number of failure modes which are
progressively corrected, an in which a number of items contribute different running times to the total
time. Therefore it is not appropriate for monitoring early development testing. The method is also not
consistent with the use of accelerated tests during development, since the objective of these is to force
failures, not to generate reliability statistics.
After the end of a development programme, the anticipated MTBF of production items is measured
assuming the development testing accurately simulated the expected in-use stresses of the production
items and that the standard of items being tested at the end of the development program fully
represents production items. The empirical Duane method provides a reasonable approach to
monitoring and planning MTBF growth for complex systems.
The Duane method, can also be used in principle to assess the amount of test time required to attain a
target MTBF. If the MTBF is known as some early stage, the test time required can be estimated is a
value is assumed for the growth slope.
The Duane method is criticised as being empirical and subject to wide variation. It is also argued that
reliability improvements in development is not usually progressive but occurs in steps as
modifications are made. However, the model is simple to use and it can provide a useful planning and
monitoring method for reliability growth. As with any other failure data, trend tests as described in
Chapter 2 should be performed to ascertain whether the assumption of a constant failure rate is valid.
Statistical tests for MTBF or success rate changes can also be used ot confirm reliability growth as
described in Chapter 2.

The M(t) Method


The M(t) method of plotting failure data is a simple and effective way of monitoring reliability
changes over time. It is most suitable for analysing the reliability performance of equipment in
service.
M(t) is the mean accumulated number of failures as a function of operating time. The slope of the line
indicates the proportion per time unit failing, or the failure intensity. Reliability improvement will
reduce the slope. A straight line indicates a constant (random) pattern. An increasing slope indicates
an increasing patter. Changes in slope indicate changing trends.
The M(t) method can be used to monitor reliability trends such as the effectiveness of improvement
actions.
The M(t) method can be useful for identifying and interpreting failure trends. It can also be used for
evaluating logistics and warranty policies.

Last saved 10/31/2006 10:51:00 PM 78 Last printed 5/9/2006 01:59:00 PM


O’Conner, Cautionary Note, page 22
Whilst statistical methods can be very powerful, economic and effective in reliability engineering
applications, they must be used in the knowledge that variation in engineering is in important ways
different from variation in most natural processes or in repetitive engineering processes such as
controlled machining or diffusion processes. Such processes are usually:
 Constant in time, in terms of the nature (average, spread) of the variation
 Distributed in a particular way, describable by a mathematical function known as the s-normal
distribution
In fact, these conditions often do not apply in engineering. For example:
 Past data cannot be used to forecast future reliability, using purely statistical methods. A
change in a process might affect reliability and the change might be deliberate or accidental,
know or unknown
 Components might be selected according to criteria such as dimension or other measured
parameters. This can invalidate the s-normal distribution assumption on which much of the
statistical methods is based. This might or might not be important in assessing results.
 A process or parameter might vary in time, continuously or cyclically, so that statistics derived
at one time might not be relevant at others.
 Variation is often deterministic by nature, for example spring deflection as a function of force,
and it would not always be appropriate to apply statistical techniques to this sort of situation.
 Variation in engineering can arise from factors that defy mathematical treatment.
 Variation can be catastrophic, not only continuous.
These points highlight the fact that variation in engineering is caused to a large extent by people, as
designers, makers, operators and maintainers. Therefore the human element must always be
considered and statistical analysis must not be relied on without appropriate allowances being made
for the effects of factors such as motivation, training, management and the many other factors that can
influence reliability.
In any application of statistical methods ultimately all cause and effect relationships have explanation.
Statistical techniques can be very useful in helping us to understand and control engineering
situations; however they do not provide explanations on their own. We must seek to understand the
causes of variation, since only then can we really be in control.

Smith, Chapter 3
Study Guide 7: Self Assessment Questions
1. The usual indices of quality costs include
Prevention Costs
Appraisal Costs
Failure Costs
2. Reliability growth modelling

3. Profit from quality is measured by the difference between

Last saved 10/31/2006 10:51:00 PM 79 Last printed 5/9/2006 01:59:00 PM


4. What are four components of life cycle cost?
I. Acquisition Costs
II. Ownership Cost
III. Operating Cost
IV. Administration Cost
5. How does reliability contribute to life cycle?
Determines the frequency of repair, fixes spares requirements, determines loss of revenue (with
maintainability)

Last saved 10/31/2006 10:51:00 PM 80 Last printed 5/9/2006 01:59:00 PM

Вам также может понравиться