Академический Документы
Профессиональный Документы
Культура Документы
Contents
Contents.................................................................................................................................................1
STUDY GUIDE 1 – Introduction to Reliability....................................................................................8
Objectives...........................................................................................................................................8
Study Guide 1 Notes...........................................................................................................................8
Assurance Technologies...................................................................................................................8
Reliability Fundamentals...............................................................................................................10
System Effectiveness.....................................................................................................................10
Quality And Reliability..................................................................................................................10
Determination Of Cost Drivers......................................................................................................11
Introduction To Cost Effective Analysis........................................................................................11
Profitability Of Reliability.............................................................................................................11
O’Conner Chapter 1 - Introduction to Reliability Engineering........................................................12
Why Do Engineering Items Fail?..................................................................................................12
Probabilistic Reliability.................................................................................................................13
Repairable and Non Repairable Items...........................................................................................13
Non-Repairable Items....................................................................................................................13
Repairable Items............................................................................................................................14
Development of Reliability Engineering.......................................................................................14
Reliability As An Effectiveness Parameter....................................................................................14
Reliability Programme Activities...................................................................................................14
Reliability Economics And Management......................................................................................15
Smith Chapter 1 - The history of reliability and safety technology.................................................15
Definitions.....................................................................................................................................16
Failure Data....................................................................................................................................16
Hazardous Failures........................................................................................................................17
Reliability and Risk Prediction......................................................................................................17
Achieving Reliability and Safety-Integrity....................................................................................17
RAMS Cycle..................................................................................................................................18
Contractual Pressures.....................................................................................................................18
(Life Cycle
Life Cycle of a product includes the following typical phases:
• Concept
• Research & development
• Full scale development
• Production
• Operation and Support, and
• Disposal
In the Cat world, this is called NPI (New Product Introduction).
In most cases, 50-80% of the total costs are incurred during the operation and support phases,
which makes is an important focus for control of costs and losses.
New Product Introduction
https://npi.cat.com/
The New Product Introduction process simply builds on the 6 Sigma product and process
creation methodology, DMEDI, with which most Caterpillar employees are familiar. DMEDI
methodology is embedded within the NPI process, so any NPI program that follows the NPI
process meets 6 Sigma criteria. The NPI process is structured into phases, like DMEDI, but
includes more phases.
First, there is the Strategy phase, in which the groundwork is laid for all future decision-
making and all
Quality Assurance
This includes all those planned and systematic actions necessary to provide adequate confidence that a
product or service will satisfy given requirements for quality
System Analysis
Product Safety
The chances of safety-related incidents must be eliminated - for example, those that might be caused
by misuse or design oversights. We are trying to eliminate design-induced defects.
Logistic Engineering
Includes the support-related activities that deal with system design and development. It covers the
support of the primary equipment and the support infrastructure.
Maintainability Analysis
Maintainability is the ease that a machine can be repaired i.e. how long it takes to repair. Analysis
included the assessment of accessibility, interchange ability, modularity, standardisation, operator and
maintainer requirements, test and maintenance requirements, spares provisioning, and maintenance
policy. Refer AS3969-1990 Para 2.2.3.(m)
Reliability Fundamentals
Two terms that are often used in describing Reliability are Failure Rate and Hazard Rate. Others are
MTTF and MTBF
Reliability is the probability that the product will perform a specified function for a specified
operating interval under a specified set of conditions. The important criteria are thus probability,
function, interval and conditions. It is important that these four criteria are defined and quantified
otherwise reliability cannot be described.
Failure rate is the number of failures per unit time and change over the life of the product.
Mean Time To Failure - MTTF is used to measure the average life of an item that is not usually
repaired (e.g. light bulb, circuit board). Note this is an average life and is often subject to wide
variation
Mean Time Between Failures - MTBF is used to measure the average life of an item that is usually
repaired. MTBF is the reciprocal of failure rate.
E.g. analysis of data shows 20 failures during 10000 operating hours.
Failure rate (lambda) = 20/10000 = 0.002 failures per hour
MTBF = 1/lambda = 1/0.002 = 500 hours between failure
System Effectiveness
When assessing a system, the fundamental principle is that the parts should be optimised as a
composite set, not as individual parts.
System Effectiveness is the probability that a system can effectively meet an operational demand
within a given time when operated under specific conditions. It is usually considered in terms of
technical performance, capability, availability and dependability.
• Capability is a measure of how well a product performs
• Availability is the probability the product is ready for use when needed
• Dependability is the probability of successful performance
• Durability is a point where system wears out starts to increase
Reliability is an inherent characteristic of design and cannot be altered without modification to the
design. Additional maintenance cannot make the system more reliable; it will simply make the system
more dependable
Profitability Of Reliability
Profit = Revenue – Expense
Profitability margins are sensitive to costs incurred through attention to reliability: for example,
• Maintenance costs
• Inventory holding costs
• Warranty
• Product recall
• Product reject
• Down time
• Product liability
Probabilistic Reliability
The concept of reliability as a probability means any attempt to quantify it must involve statistical
methods. Reliability statistics are concerned with reliability values that are very high or very low.
Quantifying such numbers brings increased uncertainty, since we need correspondingly more
information. The application of statistics in reliability is less straightforward than in other areas. In
reliability we are concerned with the behaviour in the extreme tails of distributions, where variation is
hard to quantify and data is expensive. Further difficulties arise in application of statistics owing to
the fact that variation is often a function of time (cycles, seasons, maintenance periods etc). Therefore,
the reliability data from any past situation cannot be used to make credible forecasts of the future
behaviours, without taking into account non-statistical factors such as design changes, maintainer
training, or even production or service problems. The statistician working in reliability engineering
needs to be aware of these realities.
Non-Repairable Items
There are three ways the pattern of failures can change with time.
1. The hazard rate may be decreasing
2. The hazard rate may be constant
3. The hazard rate may be increasing
The hazard function h(t) is a function such that the probability that an item which has survived to age
t fails in the small interval t to t+δt is h(t) δt. This is the function, known loosely as the “failure rate”,
which is represented in the BTC. So, h(t) = f(t)/R(t)
Constant hazard rate is characteristic of failures caused by loads in excess of the design strength, at a
constant average rate. For example, overstress failures or maintenance-induced failures typically occur
randomly and at a generally constant rate. Material fatigue brought about by strength deterioration dur
Last saved 10/31/2006 10:51:00 PM 13 Last printed 5/9/2006 01:59:00 PM
to cyclic loading is a failure mode which dies not occur for a finite time, and then exhibits an
increasing probability of occurrence. Decreasing hazard rates are observed in items less likely to fail
as their survival time increases. This is often observed in electronic parts. The combined effect
generates the so-called bathtub curve. This shows an initial decreasing hazard rate or infant mortality
period, an intermediate useful life period, and a final wear out period.
Repairable Items
ROCOF can also vary with time, and important implications can be derived from these trends.
A constant failure rate (CFR) is indicative of externally induced failures, as in the constant hazard rate
situation for non-repairable items. A CFR is also typical of complex systems subject to repair and
overhaul, where different parts exhibit different patterns of failure with time and parts have different
ages due to repair or replacement.
Repairable systems can also show a decreasing failure rate (DFR) when reliability is improved by
good parts replace progressive repair as defective parts, which fail relatively early. An increasing
failure rate (IFR) occurs in repairable systems when wear out failure modes of parts begin to
predominate.
Reliability Function
Corresponds to the probability that an item will survive to any given age.
ETA Value
Weibull scale parameter, also known as the Characteristic life, or when 62.3% of the population has
failed. 62.3% indicates the average of an exponential distribution that represents a model for random
events.
Suspended item
When a test is run and ceases before a given item fails, it is a suspended item.
Random Failure
Beta = 1 in a Weibull distribution, as the item’s age increases there is not an increasing risk of failure,
and the component should only be replaced on failure. This is typical of many electronic components
– where the risk of failure is constant over their lifetime.
Hazard Function
Hazard Function is the failure rate – more specifically the probability that an item that has survived to
age t fails in the small interval t to t +dt. This is the function that is represented by the BTC
Failure Data
Reliability growth / reliability improvement arising from natural consequences of the analysis of
failure has been a central feature of product development. "Test and correct" was practiced long before
development of formal processes for data collection and analysis because failure is usually self-
evident and inevitably leads to design modifications. Nineteenth- and early twentieth-century designs
were less constrained by cost and schedule pressure of today. In many cases, reliability was the result
of over-design. The need for quantified reliability assessment was not needed. Thus failure rates were
not required, and consequently there was little incentive for the formal collection of failure data. The
advent of the electronic age, and the experience with poor field reliability of military equipment in the
1940s and 1950s led to the need for more complex mass-produced component parts. This gave rise to
the collection of failure information from the field and from interpretation of test data. This activity
was stimulated by the development of reliability prediction techniques that require component failure
rates as inputs to the prediction equations.
RAMS Cycle
Loops shown in Figure 1.2 represent RAMS activities as follows:
1. Review of the system RAMS feasibility calculations against the initial RAMS targets
2. Review of the conceptual design RAMS predictions against the RAMS target
3. Review of the detailed design against the RAMS target
4. Review of the RAMS test, at the end of design and development, against the requirements
5. Review of the acceptance demonstration against requirements
6. Review of the field RAMS performance against the targets\
Contractual Pressures
It is now common for reliability parameters to be specified in tender and other contractual documents.
There are problems arising from:
• Ambiguity of definition
• Hidden statistical risks
• Inadequate coverage of the requirements
• Unrealistic requirements
• Unmeasureable requirements
Interrelationship Of Terms
Bathtub Distribution
1. Decreasing Failure Rate (infant mortality, burn-in, early failures): Related to manufacture (welds,
joints, connections, dirt, impurities, cracks)
2. Constant Failure Rate (random failures, useful life, stress-related failures, stochastic failures):
Assumed to be stress related, random fluctuations of stress exceeding component strength.
3. Increasing Failure Rate (wear out failures): Owing to corrosion, oxidation, breakdown of
insulation, atomic migration, friction wear, shrinkage, fatigue.
Cost Of Quality
Attempts to set budget levels for various elements of quality costs are rare. Quality costs can be
grouped under three headings:
1. Prevention Costs
- Design review
- Quality and reliability training
- Vendor quality planning
- Audits
- Installation prevention activities
- Product qualification
- Quality engineering
2. Appraisal Costs
- Test and inspect
- Maintenance & calibration
The key elements are thus probability, function, interval and conditions, and they need to be defined
and quantified otherwise reliability cannot be adequately described.
2. Failure rate is defined as?
Failure rate is the number of failures per unit time and its subsequent change over the life of a product.
3. Maintenance data shows that a component has 25 failures during the last 100,000 system
operating hours. The MTBF for the component is?
The Mean Time Between Failures can be calculated by simply dividing the total operating hours by
the number of failures.
In this case, 100000 hours/25 failures = 4000 hours, or stated in plain English, on average, we will
experience a failure of this component every 4000 system operating hours.
4. The failure rate of equipment will most likely vary in three distinct phases during its life.
What are these phases?
Decreasing Failure Rate (Weibull shape factor <1) (also known as infant mortality, burn-in, early
failures): related to manufacture (welds, joints, connections, dirt , impurities, cracks)
Constant Failure Rate (Weibull shape factor = 1) (also known as random failures, useful life, stress-
related failures, stochastic failures): assumed to be stress related, random fluctuations of stress
exceeding component strength.
Increasing Failure Rate (Weibull shape factor >1) (also known as wearout failures): corrosion,
oxidation, breakdown of insulation, , friction wear, shrinkage, fatigue.
5. The ratio of tolerance to process variation is called? What is another name for this process
variation?
This ratio is denoted Cp, and is called the process capability. If a product has a tolerance, and it is to
be produced by a process which generates variation in the product, it is obviously important that the
process variation be less than the tolerance.
Cost of unreliability
The cost of unreliability in service should be evaluated early in the development phase, so that the
effort on reliability can be justified and requirements set, related to expected costs. There are other
costs, such as goodwill and market share. These can be hard to quantify. In extreme cases
unreliability can lead to litigation if damage or injury occurs
BS5760
Published for commercial use
ARMP-1
NATO standard on reliability and maintainability
ISO-IEC60300 Dependability
Covers reliability, maintainability and safety (“dependability”). Describes management and methods
related to product design and development. Covers reliability prediction, design analysis, reliability
demonstration tests, maths/statistical techniques. Manufacturing not included. Methods are
inconsistent with modern best practice, in particular sections on reliability testing define rigid
environmental and other conditions to be applied, and for pass/fail criteria based on statistical methods
described and rejected in O’Conner Ch 2 and 12.
Specifying Reliability
How NOT to do it:
1. Do not write vague requirements, such as “reliable as possible”. Such statements do not provide
assurance against reliability being compromised.
2. Do not write unrealistic requirements “Will not fail under the specified operating conditions”.
However an unrealistically high reliability requirement will not be accepted as a credible design
parameter, and is likely to therefore be ignored.
The reliability specification must contain:
1. A definition of the failure related to the products function. The definition should cover all failure
modes relevant to its function.
2. A full description of the environments the product will be stored, transported, operated and
maintained.
3. A statement of the reliability requirement, and /or a statement of the failure modes and effects
which are particularly critical and which must therefore have a very low probability of occurrence.
Definition of failure
Failure should always be related to a measurable parameter or a clear indication.
Programme Activities
Extent of activities will depend upon:
• The severity of the requirement.
Responsibilities
Reliability and maintainability are engineering parameters and the responsibility for their achievement
is therefore primarily with the design team. Quality assurance techniques play a vital role in achieving
the goals but cannot be used to ‘test in’ reliability to a design which has its own inherent level.
Definitions
If MTTR is specified then the meaning of repair time must be defined in detail. MTTR is often used
then mean down time is intended.
Failure itself must be thoroughly defined at system and module levels. It may be necessary to define
more than one type of failure, or failures for different operating modes (eg in flight or on the ground)
in order to describe all the requirements. MTBFs might then be ascribed to different failure types.
MTBF and failure rates often require clarification of “failure” and “time”.
The bathtub curve depicts early, random and wearout failures. Reliability parameters usually refer to
random failure unless stated to the contrary, it being assumed that burn-in failures are removed by
screening and wearout is eliminated as far as possible by preventative replacement.
Parameters should not be used without due regard to their meaning and applicability. Failure rate, for
example, has little meaning except when describing random failures. Availability, MTBF or reliability
should be specified in preference.
Reliability and maintainability are often combined by specifying Availabilty. This can be defined in
more than one way, and should thus be clearly specified. The usual form is Steady State Availability
(MTBF/(MDT+MTBF)).
Environment
A common mistake is to fail to specify the environmental conditions under which the product is to
work. The spec is often confined to temp range and humidity, which may not be sufficient. Other
parameters include pressure, vibration and shock, chemical attack, power supply
variations/interference, radiation, human factors and many others. The combination or cycling of
parameters may have significant results.
Where equipment is used as standby or held as spares, the conditions will be different to those
experienced by operating units. It is often assumed that because a unit is not powered or is stored, it
will not fail. In fact this environment might be more conducive to failure. Transport environmental
conditions and liabilities for component failures should also be considered.
Maintainability can also be influenced by environment. Conditions can influence repair times since
the use of particular protective clothing, remote handling devices. Safety precaustions increased the
active elements of repair time.
Maintenance Support
The provision of spares, test equipment, personnel, transport and the maintenance of such is a
responsibility that must be described in the contract and the supplier must be conscious of the risks
involved in the customer not meeting their side of the agreement.
Levels of skill and training should be specified.
Liability
The exact natures of a supplier’s liability must be spelt out, including the maximum penalty that can
be incurred.
If part of the liability for failure or repair is to fall to some other subcontractor, then care must be
taken in defining each party’s area.
Other Areas
Reliability and maintainability programme
Sometimes the R&M activities are specified in the contract. In a development contract this allows the
customer to monitor activities against agreed milestones. Sometimes standard programs are used:
US MIL-STD-785 Reliability Program for Systems and Equipment Development and Production
Specifies programme plans, reviews, predictions and so on.
Design standards
Specific standards are sometimes described or referenced. A problem exists that these standards are
very detailed and most manufacturers have their own version. The fine detail can be overlooked until
some formal acceptance inspection takes place, by which time retrospective action is difficult, time
consuming and costly.
Pitfalls
The following lists those aspects of Reliability and Maintainability likely to be mentioned in an
invitation to tender or in a contract.
Definitions
Most likely area of dispute is the definition of what constitutes a failure and whether or not a
particular incident ranks as one or not. There are levels of failure, types of failure, causes of failure
and effects of failure. Careful definition of failure types covered by the contract is therefore
important.
Repair Time
Repair times can be grouped into active and passive elements. Broadly speaking, the active elements
are dictated by system design and passive by maintenance and operating arrangements. For this
reason, the supplier should never guarantee any part of the repair time that is influenced by the user.
Statistical risks
In both maintainability and reliability tests, producer and consumer risks apply.
Conclusion Drawn
Accept Ho Reject Ho
Quoted specifications
Sometimes a reliability or maintainability program or test plan is specified by calling up a published
standard. The danger is the possibility that not all the quoted terms are suitable and the standard will
not be studied in every detail.
Environment
If environmental factors are likely to be present in the field then they must be specifically allowed for
in the design and price. It may not be desireable to specify every parameter possible since this leads
to over-design.
Last saved 10/31/2006 10:51:00 PM 33 Last printed 5/9/2006 01:59:00 PM
Liability
When stating the supplier’s liability it is important to establish its limit in terms of both cost and time.
Suppliers must ensure they know when they are finally free of liability.
In summary
The biggest pitfall is to assume either party wins any advantage from ambiguity or looseness in the
conditions of a contract. Effort expended from a dispute far outweigh any advantage that might have
been secured. If every effort is made to cover all the aread as clearly and simply as possible then both
parties will gain.
Penalties
Any cash penalty must be a genuine and reasonable pre-estimate of the damages thought to result
from a system outage.
In summary
Strict Liability
This concept hinges on the idea that liability exists for no other reason than the mere existence of a
defect. No breach of contract or act of negligence is required in order to incur responsibility.
3 Data Required
Consideration of the foregoing objectives defines the need for a system which provides for the
collection of documented data covering:
a. The total population under observation
b. Operational conditions
c. Failures of the items
d. Maintenance operations
4 Guidelines
It is the intention of this standard to provide guidelines for setting up data collection.
5 Reports
General Comments: the relative content of use and failure reports will vary markedly with the items
considered and the type of operation.
Use Reporting: Data reporting should be supported by information on the use of the items
Failure Reporting: Failure reports should cover all the failures which have been observed. They
should also contain sufficient information to identify misuse failures. Failures considered to be
attributable to any maintenance action should be so noted.
Preventative Maintenance Reporting: Essentially, preventative maintenance is scheduled so as to
forestall failure or eliminate failure entirely. When no replacements or repairs are made, the action
can be classified as a “Use” report. When the preventative maintenance actions results in a
3 Test Conditions
The test conditions to be used should be those given in the relevant component standards.
4 Data on Failures
The following information shall be supplied:
a. The number of failures observed, categorised by test conditions and type of failure
b. The times at which the failures occurred or were verified
c. Particular incidents which occurred during testing which might have affected the results
d. Statement of failure mechanism
e. Discarded test data and the reasons why they were not used in the presentation or results
Additional requirements:
a. Failure criteria
b. Failure rate which can be assumed to be constant
c. Failure rate which cannot be assumed to be constant
d. Influence of stress
Presentation of data:
a. The failure rates of components failing in the sample tested shall preferably be supplied in
terms of the test period, eg 4 x 10-6 in 2000 hours, rather than the failure rate alone.
b. The upper confidence level (and the lower, where appropriate), shall be stated. Preferred
confidence levels are 60% and 90%. It shall be stated whether the failure rate is observed,
assessed or extrapolated.
Program Activities
Definition
Design and Development phase
Production phase
Installation and Commissioning phase
Operation-usage and maintenance phase
Reliability Assessment
Maintainability Prediction
Maintainability prediction
Prediction advantages
Techniques
Basic Assumptions and Interpretations
Elements of maintainability prediction techniques
Last saved 10/31/2006 10:51:00 PM 43 Last printed 5/9/2006 01:59:00 PM
Maintainability Demonstration and Testing
General requirements
Maintainability testing program
Maintainability demonstration
Test Conditions
Maintenance task selection
Data Input
Reporting systems
Specification and description
Operating history
Failure history
Data Sources
Guidelines
Past Experience
Design and development
Production
Factory test
Guarantee or warranty reports – product liability test reporting
Supply of replacement parts
Material or component supply
Repair department
Field installation, demonstration or commissioning tests
User reporting system
Field surveys
Validity of Data
Product manufacturer
Materials or component supplier
Last saved 10/31/2006 10:51:00 PM 44 Last printed 5/9/2006 01:59:00 PM
Field data retrieval programs
Analysis of Data
Quantitative data
Qualitative data
Requirement specifications
Failure Classification
Predictions in Reliability
The concept of deriving mathematical models, which can be used to predict reliability, is intuitively
appealing. Sometimes these models are as simple as a single fixed value for failure rate or reliability.
However, some of the models derived are quite complex, taking account of many factors likely to
affect reliability.
Like other predictive models in science and engineering, these have been based upon consideration of
what might affect the parameter of interest, in this case reliability, or in other words, failure. Thus
there have been attempts to create theories. However, this approach is of severely limited validity for
predicting reliability.
Whilst an engineering component might have properties such as conductance and mass, it is very
unlikely to have an intrinsic reliability that meets such criteria.
Failure or the absence of failure is heavily dependant upon human actions and perceptions. This is
never true of laws of nature. This represents a fundamental limitation of the concept of reliability
prediction using mathematical models.
Onset of failure is nearly always a discontinuous function, subject to predictive difficulties described
for models of the behaviour of a system which contain moderately large number of factors and
interactions, and whose progression to a failed state is time-variant.
We saw in Chapter 4 how reliability can vary by orders of magnitude with small changes in load nad
strength distributions, and the large amount of uncertainty inherent in estimating reliability from the
load-strength model. These real uncertainties must be borne in mind when synthesizing the reliability
of a system by considering the likely failure rates of its parts.
Another limitation arises from the fact reliability models are usually based upon statistical analysis of
past data. Much more data is required to derive a statistical relationship, and even then there will be
uncertainty because the sample can seldom be taken to represent the population. Sometimes we can
say that the likelihood increases but we can very rarely predict the time of failure. A statistically
derived relationship can never be proof of a causal connection – it must be supported by theory based
on an understanding of the underlying cause and effect relationship.
It is never sensible to make a prediction based on past data unless we can be sure the underlying
condition that affect future behaviour will not change. However since engineering is concerned with
deliberate changes, predictions of reliability based on past data ignore the fact that changes might be
made to improve reliability. The use of past data to predict the future can be very misleading and
unduly pessimistic.
A reliability prediction for a system containing many parts is likely to be more accurate than for a
small system. It is important to remember the variances in reliability at the part level can be orders of
magnitude greater than the variances at system level.
Therefore any reliability prediction based on mathematical models or growth models must be treated
with some scepticism.
A designer cannot design for an MTBF, unless he places as much faith in the reliability math model as
he does in, say, Ohm’s law. The MTBF cannot be measured as can, say power consumption, and there
is no reason or logic to believe they will all show the same MTBF or patters of failure of a period of
time.
Reliability Databases
There are several published databases that give reliability info in engineering components and sub
systems.
Best known would be MIL-HDBK-217 for electronic components.
Where Ri is the reliability of the ith component. This is known as the product rule or series rule.
n
λ = ∑ λi and R = e − λt
i =1
This is the simplest basic model on which parts count reliability prediction is based.
Active Redundancy
In this system, composed of two s-independent parts with reliability R 1 and R2, satisfactory operation
occurs if either one or both parts function.
m-out-of-n redundancy
In some active parallel redundant configurations, m out of n units may be required to be working for
the sustem to function. The reliability of an m/n system , with n, s-independent components in which
all the unit reliabilities are equal, is the binomial reliability function.
Standby redundancy
Standby redundancy is achieved when one unit does not operate continuously but is switched on when
the primary unit fails. The standby unit and the sensing and switching system may be considered to
have a “one-shot” reliability of starting and maintaining system function until the primary component
is repaired.
Modular Design
Availability and the cost of maintaining the system can be influenced by the way the design is
partitioned. Modular design is used in many complex products to ensure a failure can be corrected by
a relatively easy replacement of the defective module, rather than by replacement of a complete unit.
Enabling Events
An enabling event is one which, whilst not necessarily a failure or a direct cause of failure, will cause
a higher level failure event when accompanied by a failure. Examples are:
1. Warning systems disabled for maintenance
2. Controls incorrectly set
3. Personnel following procedures incorrectly or not at all.
Practical Aspects
It is essential that practical engineering considerations are applied to system reliability analysis.
Examples of situations in which practical and logical error can occur are:
1. Two diodes connected in series. If either fails open circuit there will be no current flow. If
either fails short circuit, the other will provide the required system function, so they will be in
parallel from a reliability point of view.
2. Common mode failures are often difficult to predicts, but can dominate the real reliability or
safety of a system
Petri Nets
A Petri net is a general-purpose graphical and mathematical tool for describing relations existing
between conditions and events.
Owing to the variety of logical relations that can be represented with Petri nets, it is a powerful tool
for modelling systems. Petri nets can be used not only for simulation, reliability analysis and failure
monitoring, but also form dynamic behaviour observation.
Limitations
Markov analysis method suffers one major disadvantage. It is necessary to assume constant rates for
both failures and repairs. It is also necessary to assume events are s-independent, which is hardly ever
the case in the real world. The effect to which these might effect the situation should be carefully
considered when evaluating a Markov analysis.
Reliability Apportionment
Sometimes it is necessary to break an overall system reliability requirement down to individual sub
system reliabilities.
The starting point for apportionment is an RBD for the system drawn to show the appropriate system
structure. It is important to take account of the uncertainty inherent in any early prediction.
Conclusions
System reliability prediction and modelling can be a frustrating exercise, since even quite simple
syste,s can lead to complex logic when redundancy , repair times, testing and monitoring are taking
into account.
In real life, availability is often determeined more by holding spares, admin times, rather than
predictable factors such as mean repair time.
Prediction and modelling are concepts which have generated much attention literature and controversy
in the reliability field.
Most of these work, however this is only obtuse interest since reliability is not a parameter, which is
inherently predictable, on the basis of the laws of nature or of statistical extrapolation.
Computer-Aided Engineering
Environments
Human Reliability
Summary
Configuration Control
Fatigue
Fatigue damage within engineering materials is caused when a repeated mechanical stress is applied,
the stress being above a limiting value called the fatigue limit. Fatigue damage is cumulative, so that
repeated stress above the fatigue limit will eventually result in failure.
Initiation and growth rate of the cracks varies depending upon the material properties and on surface
and internal conditions. The material property that imparts resistance to fatigue damage is the
toughness.
Software Errors
Preventing Errors
Programming Style
Fault Tolerance
Redundancy/Diversity
Languages
Data Reliability
Software Checking
Software Testing
Error Reporting
Hardware/Software Interfaces
Conclusions
The versatility and economy offered by software control can lead to an under-estimation of the
difficulty and cost of software generation. To ensure the program will operate satisfactorily under all
conditions that might exist requires an effort greater than that required for the basic design and first-
program preparation. The cost and effort of debugging a large, unstructured program containing many
errors can be so high that it is cheaper to scrap the whole program and start again.
The essential elements of a software development program to ensure a reliable project are:
Last saved 10/31/2006 10:51:00 PM 56 Last printed 5/9/2006 01:59:00 PM
1. Specify the requirements completely and in detail
2. Make sure that all the project staff understand the requirements
3. Check the specifications thoroughly
4. Design a structured program and specify each module fully
5. Check the design and the module specifications against the system specifications
6. Check the written program for errors, line by line
7. Plan module and system tests to cover important input combinations, particularly at extreme
values
Ensure full recording of all development notes, test, checks, errors and program changes.
Programmable Devices
Software-related Failures
Modern/Formal Methods
Software Checklists
2. A sensitivity analysis
3. A trade-off analysis
Maintenance Schedules
When any maintenance activity is determined to be necessary, we must determine the most suitable
intervals between its performance.
The most appropriate base is the one that best accounts for the equipment’s utilisation in terms of the
causes of degradation (wear, fatigue, parameter change, etc) and is measured.
Software
As discussed in chapter 10 software does not fail in the ways hardware can, so there is no
“maintenance”. If it is found to be necessary to change a program for any reason, this is really
redesign of the program, not repair.
Maintainability Predictions
Maintainability prediction is the estimation of the maintenance workload which will be imposed by
scheduled and unscheduled maintenance. A standard method used for this work is MIL-HDBK-472,
which contains four methods for predicting Mean Time To Repair (MTTR) of a system. Method II is
most frequently used and is based simply on summing the products of the expected failure repair times
of the individual failure modes and dividing by the sum of the individual failure rates, eg:
MTTR =
∑ (λ t )r
∑λ
The same approach is used for predicting the mean preventative maintenance time, with λ replaced by
the frequency of occurrence of the preventative maintenance action.
MIL-HDBK-472 describes the methods to be used for predicting individual task times based upon
design considerations such as accessibility, skills levels, etc
Maintainability Demonstrations
A standard approach to maintainability demonstration is MIL-HDBK-470. The technique is the same
as maintainability prediction using method III of MIL-HDBK-472, except that the individual task
times are measured rather than estimated from design.
Repairable Systems
It is now generally acknowledged that traditional Markov modelling does not correctly represent the
normal repair activities for redundant repairable systems when calculating the probability of failure
on demand (PFD). The Journal of The Safety and Reliability Society, Vol 22, No 2, 2002 published
papers by Gulland and Simpson, both of which agree with those findings.
Analysis
It can be seen that all the curves form reasonably straight lines that are similar in slope. In general, it
appears that cumulative failure rate will vary in a manner directly proportional to some negative
power of cumulative operating hours. This can be expressed mathematically as:
λ ∑ = K ( ∑ H ) −α
where: λ∑ ≡
∑F
∑H
∑ F = CumulativeFailures
∑ H = CumulativeOperatingHours
K = Constant
α = Exponent determined by slope
This equation implies that reliability will continually increase as operating experience is gained. This
may not be true as operating time reaches extremely high levels, but the evidence presented does
indicate that the relationship is valid over a long period. This relationship probably applies as long as
active programs are in place to improve equipment reliability.
Discussion
The techniques proposed here assume that a “normal” rate of reliability improvement exists. Sucha a
normal growth rate can be useful as a standard against which to compare actual performance, but it
must not be viewed as an absolute limit on the performance of a given product.
Since the proposed procedure is intended primarily for use in conjunction with development
programs, it is important to note that test conditions have a major effect on data validity.
Introduction
When a complex system with new technology is fielded or subjected to customer use environment,
there is often considerable interest in assessing its reliability and other related performance parameters
such as availability. Although operating tests are conducted for many systems during development, it
is generally recognised that in many cases these tests may not yield complete data representative of an
actual use environments. Other interests in measuring the reliability of a fielded system may centre
on, for example, logistics and maintenance policies, quality and manufacturing issues, burn-in, wear
out, mission reliability or warranties.
Most complex systems are repaired, not replaced, when they fail. A number of books and papers in the
literature have stressed that the usual non-repairable reliability methodologies, such as the Weibull
distribution, are not appropriate for the repair able sustme reliability analyses and have suggested the
use of the non-homogeneous Poisson process models.
The homogeneous Poisson process is equivalent to the widely used Poisson distribution and
exponential times between system failures model appropriate when the system’s failure intensity is not
affected by the system’s age. However to realistically consider burn-in, wearout, useful life,
maintenance policies, warranties, mission reliability etc will often require an approach that recognises
that the failure intensity of these systems many not be constant over the operating life of interest but
may change with system age. A useful and generally practical extension of the Poisson process which
allows for the system failure intensity to change with system age.
Typically, the reliability analyses of a repairable system under customer use will involve data
generated by multiple systems. The Weibull process or power law non-homogeneous Poisson process
for this type of analysis are appropriate. This paper will discuss the specific application of these
methods under several situations which are coming in practice and will illustrate the numerical
calculations by examples.
In this paper it is strongly recommended that the reliability characteristics for each system under the
study be analysed separately before the failure data are combined. The techniques described in this
paper are combined failure data assumes that each system is governed by the same Weibull process
failure intensity model.
The Model
In this paper we assume that the failure for each system under study are occurring according to a non-
homogeneous Poisson process with intensity function
u (t ) = λβ t β −1
t >0
http://www.weibull.com/RelGrowthWeb/Crow-AMSAA_(N.H.P.P.).htm
In "Reliability Analysis for Complex, Repairable Systems" (1974), Dr. Larry H. Crow noted that the
Duane model could be stochastically represented as a Weibull process, allowing for statistical
procedures to be used in the application of this model in reliability growth. This statistical extension
became what is known as the Crow-AMSAA (N.H.P.P.) model. This method was first developed at the
U.S. Army Materiel Systems Analysis Activity (AMSAA). It is frequently used on systems when
usage is measured on a continuous scale. It can also be applied for high reliability, a large number of
trials and one-shot items. Test programs are generally conducted on a phase by phase basis. The
Crow-AMSAA model is designed for tracking the reliability within a test phase and not across test
phases.
A development testing program may consist of several separate test phases. If corrective actions are
introduced during a particular test phase then this type of testing and the associate data are appropriate
for analysis by the Crow-AMSAA model. The model analyzes the reliability growth progress within
each test phase and can aid in determining the following:
The reliability growth pattern for the Crow-AMSAA model is exactly the same pattern as for the
Duane postulate. That is, the cumulative number of failures is linear when plotted on ln-ln scale.
Unlike the Duane postulate the Crow-AMSAA model is statistically based. Under the Duane postulate
the failure rate is linear on ln-ln scale. However for the Crow-AMSAA model statistical structure, the
failure intensity of the underlying non-homogeneous Poisson process (NHPP) is linear when plotted
on ln-ln scale.
When developing a reliability block diagram, a general approach should include the following
steps:
1. Establish failure criteria – define what constitutes a failure since this will determine which
failure modes at the component level actually cause a system to fail.
2. Establish a reliability block diagram – it is necessary to describe the system as a number of
functional blocks which are interconnected according to the effects of each block failure on the
overall system reliability
3. Failure Mode Analysis – complete an FMEA by examining individual component failure
modes and failure rates
4. Calculation of system reliability – relating the block failure rates to the system reliability is a
question of mathematical modelling
5. Reliability allocation – the block failure rates are taken as a measure of the complexity and
improved, suitably weighted objectives are set.
Considering Variability
We have seen (Chapters 4 and 5) how variability affects the probability of failure. A major source of
variability is the range of production processes that convert designs into hardware. Therefore the
reliability test programme must cover the effects of variability on the expected and unexpected failure
modes.
Durability
The reliability test programme must take account of the pattern of the main failure modes with respect
to time.
If the failure modes have increasing hazard rates, testing must be directed to assuring adequate
reliability during the expected life.
Generally speaking, mechanical components and assemblies are subject to increasing hazard rates
when wear, fatigue, corrosion or other deterioration processes can cause failure. Systems subject to
repair and overhaul can also become less reliable with age, due to the effects of maintenance, so the
appropriate maintenance actions must be included in the test plan.
Test Environments
The reliability test programme must cover the range of environmental conditions which the product is
likely to have to endure. The main reliability-affecting environmental factors are:
Temperature
Vibration
Shock
Humidity
Power input and output
Dirt
People
In addition, electronic equipment might be subjected to:
Electromagnetic effects (EMI)
Voltage transients, including Electrostatic Discharge (ESD)
Smith, Chapter 12 –
AS3960 Section 2
AS3960 Page 26
Self-Assessment Questions
1. List the five elements of reliability testing
Pareto Analysis
As a first step in reliability data analysis we can use the Pareto principle of the ‘significant few and the
insignificant many’. It is often found that a large proportion of failures in a product are due to a small
number of causes. Therefore if we analyse the failure data, we can determine how to solve the largest
proportion of the overall reliability problem with the most economical use of resources.
CUSUM Charts
The ‘cumulative sum’, or CUSUM, chart, is an effective graphical technique for monitoring trends in
quality control and reliability. The principle is that, instead of monitoring the measured value of
interest, we plot the divergence, plus or minus, from the target value. The method enables us to report
progress simply and in a way that is very easily comprehended.
Reliability Demonstration
It is often necessary to measure the reliability of equipment and systems during development,
production and use. Two basic forms of reliability measurement are used. A sample of equipment may
be subjected to a formal reliability test, with the condition specified in detail. Reliability may also be
monitored during development and use, as test and utilization proceed, without tests being set up
specifically for reliability measurement. This section describes standard methods of test and analysis
which are used to demonstrate compliance with reliability requirements.
Non-Parametric Methods
Non-parametric statistical techniques (see page 65) can be applied to reliability measurement. They
are arithmetically very simple and so can be useful as quick tests in advance of more detailed analysis,
particularly when no assumption is made of the underlying failure distribution.
Smith, Chapter 3
Study Guide 7: Self Assessment Questions
1. The usual indices of quality costs include
Prevention Costs
Appraisal Costs
Failure Costs
2. Reliability growth modelling