Вы находитесь на странице: 1из 31


Most people will have some concept of what reliability is from everyday life. For
example, people may discuss how reliable their washing machine has been over the length of
time they have owned it. Similarly, a car that doesnt need to go to the garage for repairs
often, during its lifetime, would be said to have been reliable. It can be said that reliability is
quality overtime. Quality is associated with workmanship and manufacturing and therefore if
a product doesnt work or breaks as soon as you buy it you would consider the product to
have poor quality. However if over time parts of the product wear-out before you expect them
to then this would be termed poor reliability. The difference therefore between quality and
reliability is concerned with time and more specifically product life time.
Reliability is the probability of performing without failure, a specific function, under given
conditions for a specified period of time.
The five elements are:




Probability: Reliability is a probability, a probability of performing without failure; thus,

reliability is a number between zero and one.
Failure: What constitutes a failure must be agreed upon in advance of the testing and use
of the component or system under study. For example if the function of a pump is to
deliver at least 200 gallons of fluid per minute and it is now delivering 150 gallons/per
minute, the pump has failed, by this definition.
The main reasons why failures occur include: The product is not fit for purpose or more
specifically the design is inherently incapable; the item may be overstressed in some way;
caused by wear-out; caused by variation; wrong specifications may cause failures; misuse
of the item may cause failure; items are designed for a specific operating environment and
if they are then used outside this environment then failure can occur etc.
Function: The device whose reliability is in question must performa specific function. For
example, if I use my gasoline-powered lawn mower to trim my hedges and a blade
breaks, this should not be charged as a failure.
Conditions: The device must perform its function under given conditions. For example, if
my company builds and sells small gasoline-powered electrical generators intended for
use in ambient temperatures of 0-120 degrees Fahrenheit and several are brought to
Nome, Alaska and fail to operate in the winter, we should not charge failures to these
Time: The device must perform for a period of time. One should never cite a reliability
figure without specifying the time in question. The exception to this rule is for one-shot
devices such as munitions, rockets, automobile air-bags, and the like. In this case we
think of the reliability as the probability that the device will operate properly (once) when
deployed or used. Or equivalently one-shot reliability may be thought of as the proportion
of all identical devices which will operate properly (once) when deployed or used. In
reliability, unless otherwise specified time begins at zero. We treat conditional probability
of failure and conditional reliability separately and call them as such.

The load and strength of an item may be generally known,

however there will always be an element of uncertainty. The actual
strength values of any population of components will vary; there will
be some that are relatively strong, others that are relatively weak, but
most will be of nearly average strength. Similarly there will be some
loads greater than others but mostly they will be average. Fig1 shows
the load strength relationship with no overlaps.

Figure 1: Load strength relationship , no overlaps

However if, as shown in fig2, there is an overlap of the two distributions then failures will
occur. There therefore needs to be a safety margin to ensure that there is no overlap of these

Figure 2: Load strength relationship - overlaps

It is clear that to ensure good reliability the causes of failure need to be identified and
eliminated. Indeed the objectives of reliability engineering are:

To apply engineering knowledge to prevent or reduce the likelihood or frequency

of failures;
To identify and correct the causes of failure that does occur;
To determine ways of coping with failures that does occur;
To apply methods of estimating the likely reliability of new designs, and for
analysing reliability data.


The so-called bath-tub curve represents the pattern of failure for many products
especially complex products such as cars and washing machines. The vertical axis in the
figure is the failure rate at each point in time. Higher values here indicate higher probabilities
of failure.
The life of a product or a population of units can be divided into three distinct periods.
Figure 3 shows the reliability bathtub curve which models the cradle to grave instantaneous
failure rates vs. time. If we follow the slope from the start to where it begins to flatten out this
can be considered the first period. The first period is characterized by a decreasing failure
rate. It is what occurs during the early life of a product or population of units. The weaker
units die off leaving a population that is more rigorous. This first period is also called infant
mortality period. The next period is the flat portion of the graph. It is called the normal life.
Failures occur more in a random sequence during this time. It is difficult to predict which
failure mode will manifest, but the rate of failures is predictable. Notice the constant slope.
The third period begins at the point where the slope begins to increase and extends to the end
of the graph. This is what happens when units become old and begin to fail at an increasing

Figure 3: Bathtub Curve

Infant Mortality: This stage is also called early failure or debugging stage. The failure
rate is high but decreases gradually with time. During this period, failures occur because
engineering did not test products or systems or devices sufficiently, or manufacturing
made some defective products. Therefore the failure rate at the beginning of infant
mortality stage is high and then it decreases with time after early failures are removed by
burn-in or other stress screening methods. Some of the typical early failures are: poor
welds, poor connections, contamination on surface in materials, incorrect positioning of
parts, etc.
Useful Life Period: As the product matures, the weaker units die off, the failure rate
becomes nearly constant, and modules have entered what is considered the normal life
period. This period is characterized by a relatively constant failure rate. The length of this
period is referred to as the system life of a product or component. It is during this period
of time that the lowest failure rate occurs. Notice how the amplitude on the bathtub curve
is at its lowest during this time. The useful life period is the most common time frame for
making reliability predictions.
Wear-out Period: This is the final stage where the failure rate increases as the products
begin to wear out because of age or lack of maintenance. When the failure rate becomes
high, repair, replacement of parts etc., should be done.

Reliability is the probability that a product or part will operate properly for a specified
period of time (design life) under the design operating conditions (such as temperature, volt,
etc.) without failure. In other words, reliability may be used as a measure of the systems
success in providing its function properly. Reliability is one of the quality characteristics that
consumers require from the manufacturer of products.
Many mathematical concepts apply to reliability engineering, particularly from the
areas of probability and statistics. Likewise, many mathematical distributions can be used for
various purposes, including the Gaussian (normal) distribution, the log-normal distribution,
the exponential distribution, the Weibull distribution and a host of others.
Failure rate: The purpose for quantitative reliability measurements is to define the rate of
failure relative to time and to model that failure rate in a mathematical distribution for the
purpose of understanding the quantitative aspects of failure. The most basic building block is
the failure rate, which is estimated using the following equation:

= Failure rate (sometimes referred to as the hazard rate)
T = Total running time/cycles/miles/etc. during an investigation period for both failed
and non-failed items.
r = the total number of failures occurring during the investigation period.
For example, if five electric motors operate for a collective total time of 50 years with
five functional failures during the period, then the failure rate, , is 0.1 failures per year.
Another very basic concept is the mean time between/to failure (MTBF/MTTF). The
only difference between MTBF and MTTF is that we employ MTBF when referring to items
that are repaired when they fail. For items that are simply thrown away and replaced, we use
the term MTTF. The computations are the same.

The basic calculation to estimate mean time between failure (MTBF) and mean time
to failure (MTTF), both measures of central tendency, is simply the reciprocal of the failure
rate function. It is calculated using the following equation:

= Mean time between/to failure
T = Total running time/cycles/miles/etc. during an investigation period for
both failed and non-failed items.
r = the total number of failures occurring during the investigation period.
The MTBF for the industrial electric motor mentioned in the previous example is 10
years, which is the reciprocal of the failure rate for the motors. Incidentally, we would
estimate MTBF for electric motors that are rebuilt upon failure. For smaller motors that are
considered disposable, we would state the measure of central tendency as MTTF. The failure
rate is a basic component of many more complex reliability calculations. Depending upon the
mechanical/electrical design, operating context, environment and/or maintenance
effectiveness, a machines failure rate as a function of time may decline, remain constant,
increase linearly or increase geometrically.

Figure 4: Different Failure Rates vs. Time Scenarios

Failure rate calculations are based on complex models which include factors using
specific component data such as temperature, environment, and stress. In the prediction
model, assembled components are structured serially. Thus, calculated failure rates for
assemblies are a sum of the individual failure rates for components within the assembly.
There are three common basic categories of failure rates:
a) Mean Time Between Failures (MTBF): MTBF is a basic measure of reliability for
repairable items. MTBF can be described as the time passed before a component,
assembly, or system fails, under the condition of a constant failure rate. Another way of
stating MTBF is the expected value of time between two consecutive failures, for
repairable systems. It is a commonly used variable in reliability and maintainability
MTBF can be calculated as the inverse of the failure rate, , for constant failure rate
systems. For example, for a component with a failure rate of 2 failures per million hours,
the MTBF would be the inverse of that failure rate, , i.e.:

[MTBF = Total device hours / number of failures]

( = 1/MTBF )
( = T/R)
= MTBF; T = total time; R = number of failures
MTBF = MTTF + MTTR (see figure 5 below)

b) Mean time to failure (MTTF): MTTF is a basic measure of reliability for non-repairable
systems. It is the mean time expected until the first failure of a piece of equipment. MTTF
is a statistical value and is intended to be the mean over a long period of time and with a
large number of units. For constant failure rate systems, MTTF is the inverse of the
failure rate, . If failure rate, , is in failures/million hours, MTTF = 1,000,000 /Failure
Rate, , for components with exponential distributions.
MTTF is the number of total hours of service of all devices divided by the number of
devices. It is only when all the parts fail with the same failure mode that MTBF
converges to MTTF.
MTTF = 1/
= T/N
( = MTTF; T = total time; N = Number of units under test.)
For example, the item above fails, on average, once every 4000 hours, so the
probability of failure for each hour is obviously 1/4000. This depends on the failure
rate being constant - which is the condition for the exponential distribution.
This equation can also be written the other way round:
MTBF (or MTTF) = 1/
For example, if the failure rate is 0.00025, then
MTBF (or MTTF) = 1/0.00025 = 4,000 hours.
c) Mean Time to Repair (MTTR): Mean time to repair (MTTR) is defined as the total
amount of time spent performing all corrective or preventative maintenance repairs
divided by the total number of those repairs. It is the expected span of time from a failure
(or shut down) to the repair or maintenance completion. This term is typically only used
with repairable systems.

Figure 5: MTBF, MTTF& MTTR


If you take a large number of measurements you can draw a histogram to show the
how the measurements vary. A more useful diagram, for continuous data, is the probability
density function. The y axis is the percentage measured in a range (shown on the x-axis)
rather than the frequency as in a histogram. If you reduce the ranges (or intervals) then the
histogram becomes a curve which describes the distribution of the measurements or values.
This distribution is the probability density function or PDF.

In reliability engineering, the normal distribution primarily applies to measurements of
product susceptibility and external stress. This two parameter distribution is used to describe
systems in which a failure results due to some wear out effect for many mechanical systems.
Normal distributions are applied to single variable continuous data (E.g. heights of plants,
weights of lambs, lengths of time etc.). The normal distribution is the most important
distribution in statistics, since it arises naturally in numerous applications. The key reason is
that large sums of (small) random variables often turn out to be normally distributed.

Normal distribution curve

The normal distribution takes the well-known bell shape. This distribution is symmetrical
about the mean and the spread is measured by variance. This distribution is symmetrical
about the mean and the spread is measured by variance. The larger the value, the flatter the
distribution. The pdf is given by

Where is the mean value and is the standard deviation. The cumulative
distribution function (cdf) is

(either s or x)
The reliability function is

There is no closed form solution for the above equation. However, tables for the
standard normal density function are readily available and can be used to find probabilities
for any normal distribution. If

is substituted into the normal pdf, we obtain

This is a so-called standard normal pdf, with a mean value of 0 and a standard
deviation of 1. The standardized cdf is given by

Where is a standard normal distribution function. Thus, for a normal random

variable T, with mean and standard deviation ,

Where yields the relationship necessary if standard normal tables are to be used.
The hazard function for a normal distribution is a monotonically increasing function of t. This
can be easily shown by proving that h(t) 0 for all t. Since

The normal distribution is flexible enough to make it a very useful empirical model. It
can be theoretically derived under assumptions matching many failure mechanisms. Some of
these are corrosion, migration, crack growth, and in general, failures resulting from chemical
reactions or processes. That does not mean that the normal is always the correct model for
these mechanisms, but it does perhaps explain why it has been empirically successful in so
many of these cases.
Example: A component has a normal distribution of failure times with = 2000 hours and
= 100 hours. Find the reliability of the component and the hazard function at 1900 hours.
Solution: The reliability function is related to the standard normal deviate z by,

From the standard normal table, we obtain

The value of the hazard function is found from the relationship

where is a pdf of standard normal density. Here

Example: A part has a normal distribution of failure times with = 40000 cycles and =
2000 cycles. Find the reliability of the part at 38000 cycles.
Solution: The reliability at 38000 cycles,

The resulting reliability plot is shown in the figure below,

Normal reliability plot vs. time


The exponential distribution, the most basic and widely used reliability prediction
formula, models machines with the constant failure rate, or the flat section of the bathtub
curve. Most industrial machines spend most of their lives in the constant failure rate, so it is
widely applicable. Below is the basic equation for estimating the reliability of a machine that
follows the exponential distribution, where the failure rate is constant as a function of time.

Rt = e-t & Ft = 1 e-t

Rt = Reliability estimate for a period of time, cycles, miles, etc. (t).
e = Base of the natural logarithms (2.718281828)
= Failure rate (1/MTBF, or 1/MTTF)
F(t) = Unreliability (The probability that the component or system
experiences the first failure or has failed one or more times during the
time interval zero to time t, given that it was operating or repaired to a
like new condition at time zero; Rt + Ft = 1)
i.e. The PDF, CDF and survival function is given as:

In the electric motor example, if you assume a constant failure rate the likelihood of
running a motor for six years without a failure, or the projected reliability, is 55 percent.
This is calculated as follows:

R6 = e-0.1x6 = 0.5488 55%

In other words, after six years, about 45% of the population of identical motors
operating in an identical application can probabilistically be expected to fail. It is worth
reiterating at this point that these calculations project the probability for a population. Any

given individual from the population could fail on the first day of operation while another
individual could last 30 years. That is the nature of probabilistic reliability projections.
A characteristic of the exponential distribution is the MTBF occurs at the point at
which the calculated reliability is 36.78%, or the point at which 63.22% of the machines have
already failed. In our motor example, after 10 years, 63.22% of the motors from a population
of identical motors serving in identical applications can be expected to fail. In other words,
the survival rate is 36.78% of the population.
The probability density function (pdf), or life distribution, is a mathematical equation
that approximates the failure frequency distribution. It is the pdf, or life frequency
distribution, that yields the familiar bell-shaped curve in the Gaussian, or normal,
distribution. Below is the pdf for the exponential distribution.

f(t) = e-t
f(t) = Life frequency distribution for a given time (t) (Failure Density)
e = Base of the natural logarithms (2.718281828)
= Failure rate
In our electric motor example, the actual likelihood of failure at three years is calculated as


f(t) = e
f(3) = 0.1e-0.1x3 = .07408 7.4%

In the example, if we assume a constant failure rate, which follows the exponential
distribution, the life distribution, or pdf for the industrial electric motors, is expressed in
Figure 6. The failure rate is constant, but the pdf mathematically assumes failure without
replacement, so the population from which failures can occur is continuously reducing
asymptotically approaching zero.

Figure 6: The Probability Density Function (pdf)

The cumulative distribution function (cdf) is simply the cumulative number of failures
one might expect over a period of time. For the exponential distribution, the failure rate is
constant, so the relative rate at which failed components are added to the cdf remains
constant. However, as the population declines as a result of failure, the actual number of
mathematically estimated failures decreases as a function of the declining population. Much
like the pdf asymptotically approaches zero, the cdf asymptotically approaches one.

Failure Rate and the Cumulative Distribution Function (cdf)

The declining failure rate portion of the bathtub curve, which is often called the infant
mortality region, and the wear out region will be discussed in the following section
addressing the versatile Weibull distribution.
Hazard Rate: Sometimes it is difficult to specify the distribution function of T directly from
the physical information that is available. A function found useful in clarifying the
relationship between physical modes of failure and the probability distribution of T is the
conditional density function h(t), called the hazard function or failure rate. The hazard
function for the exponential distribution is given as:

h(t) =


For a constant failure rate, hazard rate is also constant and is equal to the failure rate.

h(t) =


Notice that the hazard function is not a function of time and is in fact a constant equal to .


In probability theory and statistics, the Weibull distribution is a continuous probability

distribution. It is named after Waloddi Weibull, who described it in detail in 1951, although it
was first identified by Frchet (1927) and first applied by Rosin & Rammler (1933) to
describe a particle size distribution.
Weibull analysis is easily the most versatile distribution employed by reliability
engineers. While it is called a distribution, it is actually a tool that enables the reliability
engineer to first characterize the probability density function (failure frequency distribution)
of a set of failure data to characterize the failures as early life, constant (exponential) or wear
out (Gaussian or lognormal) by plotting time to failure data on a special plotting paper with
the log of the times/cycles/miles to failure plotted a log scaled X-axis versus the cumulative
percent of the population represented by each failure on a log-log scaled Y-axis.
Once plotted, the linear slope of the resultant curve is an important variable, called the
shape parameter, represented by , which is used to adjust the exponential distribution to fit a
wide number of failure distributions. In general, if the coefficient, or shape parameter, is
less than 1, the distribution exhibits early life, or infant mortality failures. If the shape
parameter exceeds about 3.5, the data are time dependent and indicate wear-out failures. This
data set typically assumes the Gaussian, or normal, distribution. As the coefficient increases
above 3.5, the bell-shaped distribution tightens, exhibiting increasing kurtosis (peakedness at
the top of the curve) and a smaller standard deviation.

Many data sets will exhibit two or even three distinct regions. It is common for
reliability engineers to plot, for example, one curve representing the shape parameter during
run in (infant mortality period), another curve to represent the constant or gradually
increasing failure rate and a third distinct linear slope emerges to identify a third shape, the
wear out region. In these instances, the pdf of the failure data do in fact assume the familiar
bathtub curve shape.
The 3-parameter Weibull pdf is given by:

f t =


f(t) 0;
t 0 or
> 0;
- < < +
: scale parameter, or characteristic life
: shape parameter (or slope)
: location parameter (or failure free life)

The 2-Parameter Weibull

The 2-parameter Weibull pdf is obtained by setting = 0, and is given by:

f t =

Frequently, the location parameter is not used, and the value for this parameter can be set to
There is also a form of the Weibull distribution known as the 1-parameter Weibull
distribution. This in fact takes the same form as the 2-parameter Weibull pdf, the only
difference being that the value of is assumed to be known beforehand. This assumption
means that only the scale parameter needs be estimated, allowing for analysis of small data
sets. It is recommended that the analyst have a very good and justifiable estimate for before
using the 1-parameter Weibull distribution for analysis.
Weibull reliability and CDF functions are:

R t =
F t =

The characteristic life, , is the life at which 63.2% of

the population will have failed.
Figure 8: bath tub curve and the Weibull distribution
When = 1, the hazard function is constant
and therefore the data can be modeled by an exponential distribution with =1/ .
When <1, we get a decreasing hazard function and
When >1, we get an increasing hazard function
Figure 8, shows the Weibull shape parameters superimposed on the bath-tub curve.

The Weibull hazard function is:

H t =

Example: The failure time of a component follows a Weibull distribution with shape
parameter = 1.5 and scale parameter = 10,000 h. When should the component be replaced if
the minimum recurring reliability for the component is 0.95?

0.95 =



t = 1380.38 h
Example 2.8: The failure time of a certain component has a Weibull distribution with = 4,
= 2000, and = 1000. Find the reliability of the component and the hazard rate for an
operating time of 1500 hours.
Solution: A direct substitution into equation yields

The desired hazard function is given by

Note that the Rayleigh and exponential distributions are special cases of the Weibull
distribution at = 2, = 0, and = 1, = 0, respectively. For example, when = 1 and = 0,
the reliability of the Weibull distribution function reduces to

And the hazard function reduces to 1/, a constant. Thus, the exponential is a special case of
the Weibull distribution. Similarly, when = 0 and = 2, the Weibull probability density
function becomes the Rayleigh density function. That is


Gamma distribution can be used as a failure probability function for components whose
distribution is skewed. The failure density function for a gamma distribution is

where is the shape parameter and is the scale parameter. Hence,

If is an integer, it can be shown by successive integration by parts that


The gamma density function has shapes that are very similar to the Weibull
distribution. At = 1, the gamma distribution becomes the exponential distribution with the
constant failure rate 1/. The gamma distribution can also be used to model the time to the
nth failure of a system if the underlying failure distribution is exponential. Thus, if Xi is
exponentially distributed with parameter = 1/, then T = X1 + X2 ++Xn, is gamma
distributed with parameters and n.
The gamma model is a flexible lifetime model that may offer a good fit to some sets
of failure data. It is not, however, widely used as a lifetime distribution model for common
failure mechanisms. A common use of the gamma lifetime model occurs in Bayesian
reliability applications.
Example: The time to failure of a component has a gamma distribution with = 3 and = 5.
Determine the reliability of the component and the hazard rate at 10 time-units.
Solution: Using

we compute,

The other form of the gamma probability density function can be written as follows:

This pdf is characterized by two parameters: shape parameter and scale parameter .
When 0<<1, the failure rate monotonically decreases; when >1, the failure rate
monotonically increase; when =1 the failure rate is constant. The mean, variance and
reliability of the density function in the above equation are, respectively,

Example: A mechanical system time to failure is gamma distribution with =3 and 1/=120.
Find the system reliability at 280 hours.
Solution: The system reliability at 280 hours is given by

and the resulting reliability plot is shown in the figure below.

Gamma reliability function vs. time



The log normal lifetime distribution is a very flexible model that can empirically fit many
types of failure data. This distribution, with its applications in maintainability engineering, is
able to model failure probabilities of repairable systems and to model the uncertainty in
failure rate information. The log normal density function is given by

where and are parameters such that - < < , and > 0. Note that and are
not the mean and standard deviations of the distribution as in normal distribution.
The relationship to the normal (just take natural logarithms of all the data and time
points and you have normal data) makes it easy to work with many good software analysis
programs available to treat normal data.
Mathematically, if a random variable X is defined as X = lnT, then X is normally
distributed with a mean of and a variance of 2. That is,
E(X) = E(lnT) =
V(X) = V(lnT) = 2
Since T = eX, the mean of the log normal distribution can be found by using the
normal distribution. Consider that

and by rearrangement of the exponent, this integral becomes

Thus, the mean of the log normal distribution is

Proceeding in a similar manner,

thus, the variance for the log normal is

The cumulative distribution function for the log normal is

and this can be related to the standard normal deviate Z by

Therefore, the reliability function is given by

and the hazard function would be

where is a cdf of standard normal density.

Log normal reliability plot vs. time

The log normal lifetime model, like the normal, is flexible enough to make it a very
useful empirical model. Figure above shows the reliability of the log normal vs. time. It can
be theoretically derived under assumptions matching many failures mechanisms. Some of
these are: corrosion and crack growth, and in general, failures resulting from chemical
reactions or processes.
Example: The failure time of a certain component is log normal distributed with = 5 and
= 1. Find the reliability of the component and the hazard rate for a life of 50 time units.
Solution: Substituting the numerical values of , , and t into equation, we compute

Similarly, the hazard function is given by

Thus, values for the log normal distribution are easily computed by using the standard
normal tables.
Example: The failure time of a part is log normal distributed with = 6 and = 2. Find the
part reliability for a life of 200 time units.
Solution: The reliability for the part of 200 time units is


The Availability, A(t), of a component or system is defined as the probability that the
component or system is operating at time t, given that it was operating at time zero.
The Unavailability, Q(t), of a component or system is defined as the probability that
the component or system is not operating at time t, given that is was operating at time zero.
A(t) + Q(t) = 1

Maintainability is defined as the probability that a device will be restored to its
operational effectiveness within the given period when maintenance action is performed in
accordance with the prescribed procedure. Maintenance action is the prescribed operation to
correct an equipment failure.
Repairable and Non-repairable Items
It is important to distinguish between repairable and non-repairable items when predicting or
measuring reliability.

Non-repairable items: Non-repairable items are components or systems such as a

light bulb, transistor, rocket motor, etc. Their reliability is the survival probability
over the items expected life or over a specific period of time during its life, when only
one failure can occur. During the component or systems life, the instantaneous
probability of the first and only failure is called the hazard rate or failure rate. Life
values such as MTTF are used to define non-repairable items.
Repairable Items: For repairable items, reliability is the probability that failure will
not occur in the time period of interest; or when more than one failure can occur,
reliability can be expressed as the failure rate, , or the Rate of Occurrence of Failures

(ROCOF). In the case of repairable items, reliability can be characterized by MTBF

described above, but only under the condition of constant failure rate.
Some systems are considered both repairable and non-repairable, such as a missile. It is
repairable while under test on the ground; but becomes a non-repairable system when
Failure Patterns (Non-repairable Items)
There are three patterns of failures for non-repairable items, which can change with time. The
failure rate (hazard rate) may be decreasing, increasing or constant.

Decreasing Failure Rate (Non-repairable Items): A decreasing failure rate (DFR) can be
caused by an item, which becomes less likely to fail as the survival time increases. This is
demonstrated by electronic equipment during their early life or the burn-in period. This is
demonstrated by the first half of the traditional bath tub curve for electronic components
or equipment where failure rate is decreasing during the early life period.
Constant Failure Rate (Non-repairable Items): A constant failure rate (CFR) can be
caused by the application of loads at a constant average rate in excess of the design
specifications or strength. These are typically externally induced failures.
Increasing Failure Rate (Non-repairable Items): An increasing failure rate (IFR) can be
caused by material fatigue or by strength deterioration due to cyclic loading. Its failure
mode does not accrue for a finite time, and then exhibits an increasing probability of



Failure Patterns (Repairable Items)

There are three patterns of failures for repairable items, which can change with time. The
failure rate (hazard rate) may be decreasing, increasing or constant.


Decreasing Failure Rate (Repairable Items): An item whose reliability is improved by

progressive repair and / or burn-in can cause a decreasing failure rate (DFR) pattern.
Constant Failure Rate (Repairable Items): A constant failure rate (CFR) is indicative
of externally induced failures as in the constant failure rate of non-repairable items.
This is typical of complex systems subject to repair and overhaul.
Increasing Failure Rate (Repairable Items): This increasing failure rate (IFR) pattern
is demonstrated by repairable equipment when wear out modes begin to predominate
or electronic equipment that has aged beyond its useful life (right hand side of the
bath tub curve) and the failure rate is increasing with time.


Operating Characteristic (OC) curves are powerful tools in the field of quality control,
as they display the discriminatory power of a sampling plan. In quality control, the OC curve
plots the probability of accepting the lot on the Y-axis versus the lot fraction or percent
defectives (p) on the X-axis. Based on the number of defectives in a sample, the quality
engineer can decide to accept the lot, to reject the lot or even, for multiple or sequential
sampling schemes, to take another sample and then repeat the decision process.
In reliability engineering, the OC curve shows the probability of acceptance (i.e. the
probability of passing the test) versus a chosen test parameter. This parameter can be the true
or designed in mean life (MTTF) or the reliability (R), as shown in the figure below. Program
Managers, Evaluators, Testers, and other key acquisition personnel need to know the
probability of acceptance for a test plan to design appropriate test plans which will ensure

demonstration of reliability requirement at the desired confidence level. The most commonly
used tool for this purpose is the Operating Characteristic (OC) Curve. Figure below provides
a sample OC Curve. This OC curve is generated for a fixed configuration test and displays
the relationship between the probability of acceptance and MTBF based on test duration and
acceptable number of failures. The OC curve is a tool to determine the probability of
acceptance of a test plan corresponding to a given reliability requirement. The OC curve is
used to quantify the consumer risk and producer risk associated with a given MTBF value for
the associated testplan.

Reliability Risks: There are two types of decision risks which are of significant importance
during the demonstration of reliability requirements. These risks are called Consumer Risk
and Producer Risk.


Consumer risk: The probability that a level of system reliability at or below the
requirement will be found to be acceptable due to statistical chance. This is depicted
on the operational characteristic curve. We should endeavor to quantify and manage
consumer risk because reliability below the requirement results in reduced mission
reliability and increased support costs.
Producer risk: The probability that a level of system reliability that meets or exceeds
the reliability goal will be deemed unacceptable due to statistical chance. This risk is
also depicted in the figure above. If the system is incorrectly deemed unsuitable,
major cost and schedule impacts to the acquisition program may result.

An appropriate balance between the consumer risk and the producer risk is important to
determine test duration/number of trials. If the consumer risk and producer risk are not
balanced appropriately, the test duration/number of trials may be too short/small or too
long/large. If the test duration/number of trials is too short/small, the reliability goal (target)
for the test will be higher (test reliability requirement is inversely proportional to the test
duration/number of trials). For short/small test duration/number of trials, one or both risks
may be too high. If the test duration/number of trials is too long/large, it may be very costly
to perform the test. The cost factor may lead to an unacceptable program burden.
The probability of acceptance, P(A), can be represented by the cumulative binomial

where: =
! !

This gives the probability that the number of failures observed during the test, f, is
less than or equal to the acceptance number, c, which is the number of allowable failures in n
trials. Each trial has a probability of succeeding of R, where R is the reliability of each unit
under test. The reliability OC curve is developed by evaluating the above equation for various
values of R.
Poisson distribution can be used for large values of n.


Here x = acceptance number c & You will have to find P(Xx)

i.e., if c= 2, then Pa = P(X2) = P(X=0) + P(X=1) + P(X=2) for
corresponding c
Where c = T (T: Number of hours of test)

Also, note that the notation in reliability is not equal to np as in

acceptance sampling.

The OC curve represents the probability of acceptance for a given mean life. An OC
curve may be constructed showing the probability of acceptance as a function of average life,
. In this case, the sampling plan may be defined with:
Number of hours of test and
an acceptance number
A major assumption is that the failed item will be replaced by a good item.
Consider a sampling plan with:

Number of hours of test, T

an acceptance number, c
For each average life, ,
Compute the failure rate per hour
Compute the expected number of failures during the test

c = T
Compute Pa=P(c or fewer failure)=1-P(c+1 or more failure when the mean number of failures
is c). This can be obtained from using Poisson equation or the table from statistical data
Example: In one of the plans, 10 items were to be tested for 5000 hours with replacement and
with an acceptance number of 1. Plot an OC curve showing probability of acceptance as a
function of average life.
Solution: Given:
Duration of the test, T = 5000
Step 1: Create a column for mean life, .
Mean Life ()

(You can also assume Rt values and create the first column. Example, 0.05,
0.10, 0.15 etc. upto 8 to 10 rows)
Step 2: Calculate = 1/
Mean Life ()
Failure Rate,
Step 3: Calculate c = T
Mean Life ()
Failure Rate,

Expected Average
no. of failure, c
Step 4: Calculate Pausing Poisson distribution for c = 1


For example,

when = 1000,
P(X1) = P(X=0) + P(X+1)

1 =
Mean Life ()

5 50

Failure Rate,

Step 5: Plot graph, Y axis: Pa& X axis:

5 51

= 0.041 = 4.1%

Average no. of
failure, c

Probability of
acceptance, Pa

It is evident from the curve that the probability is approaching 1 (i.e. Pa =

100%) as the mean life increases.


Usually multiple components make up a systems and we often want to know the
reliability of a system that uses more than one component. How the components are
connected together determines what type of system reliability model is used.
There are different types of system reliability models and theses are typically used to
analyse items such as an aircraft completing its flight successfully. Once the reliability of
components or machines has been established relative to the operating context and required
mission time, plant engineers must assess the reliability of a system or process.

Series systems: Simplest reliability model is a serial model where all the components
must be working for the system to be successful.

To calculate the system reliability for a serial process, you only need to multiply the
estimated reliability of Subsystem A at time (t) by the estimated reliability of Subsystem B at time (t). The basic equation for calculating the system reliability of a
simple series system is:
RS = RA * RB .RZ

The Failure rate of the system is calculated as by adding the failure rates together, i.e

Example: So, for a simple system with three subsystems, or sub-functions, each
having an estimated reliability of 0.90 (90%) at time (t), the system reliability is
calculated as 0.90 X 0.90 X 0.90 = 0.729, or about 73%.

Active redundancy or Parallel Systems: One of the most common forms of

redundancy is the parallel reliability model where two independent items are
operating but the system can successfully operate as long as one of them is working.

To calculate the reliability of an active parallel system, where both machines are
running, use the following simple equation:

Rs(t) = 1 [{1- R1(t)}x{1- R2(t)} x. . . x {1- Rn(t)}]

Rs(t) System reliability for given time (t)
Rn(t) Subsystem or sub-function reliability for given time (t)

M-out-of-N redundancy (m/n Systems): In some active parallel redundant

configurations, m out of the n items may be required to be working for the system to
function. The reliability of an m-out-of-n system, with n identical independent
items is given by:

Problem 1: A certain type of electronic component has a uniform failure rate of 0.00001 per
hour. What is the reliability for a specified period of service of 10000 hours?
= 0.00001 per hour
t = 10000 hours
Rt = e-t = e-0.00001x10000 = 0.90483 = 90.483%
Problem 2: Given a (MTTF) of 5000 hours and a uniform failure rate, what is the
reliability associated with a specified service period of 200 hours?
' = 5000 hours
t = 200 hours

= = 1/5000
Rt = e-t =96.079%
Problem 3: The following reliability requirements have been set on the sub-systems of a
communication system:
Reliability (for a 4 hour period)
Control system
Power supply
What is the expected reliability of the overall system?
Solution: Rt(system) = Rt(subsystem1) xRt2 xRt3x Rt4 = 0.970x0.989x0.995x0.996 = 0.950 (95%)
The chance that the overall system will perform its function without failure for
a 4 hour period is 95%.
Problem 4: A unit has a reliability of 0.99 for a specified mission time. If 2 identical units are
used in parallel redundancy, what overall efficiency will be obtained?
Rs(t) = 1 {1- R1(t)}n =1 {1 0.99}2 = 0.999 or 99.9%
Problem 5: An industrial machine compresses natural gas into an interstate gas pipeline. The
compressor is on line 24 hours a day. (If the machine is down, a gas field has to be shutdown
until the natural gas can be compressed, so down time is very expensive.) The vendor knows
that the compressor has a constant failure rate of 0.000001 failures/hr. What is the operational
reliability after 2500 hours of continuous service?
The compressor has a constant failure rate and therefore the reliability follows
the exponential distribution: Rt = e-t
Failure rate = 0.000001 f/hr
Operational time t = 2500 hours
Reliability = e-(0.000001 * 2500) = 0.9975 or 99.75%
Problem 6: Suppose that a component we wish to model has a constant failure rate with a
mean time between failures of 25 hours? Find:(a) The reliability function.
(b) The reliability of the item at 30 hours.


Since the failure rate is constant, we will use the exponential distribution.
Also, the MTBF = 25 hours. We know, for an exponential distribution, MTBF
= 1/.
Therefore = 1/25 = 0.04
(a) The reliability function is given by: R(t) = e-t = e- (0.04 * t)
(b) The reliability of the item at 30 hours = e-0.04 * 30 = 0.3012

Problem 7: A certain electronic component has an exponential failure time with a mean of 50
(a) What is the rate of this component?
(b) What is the reliability of this component at 100 hours?
(c) What is the minimum number of these components that should be placed in parallel if we
desire a reliability of 0.90 at 100 hours? (The idea of placing extra components in parallel is
to provide a backup if the first component fails.)
(a) = 1/50 = 0.02 per hour
(b) R(100) = e-0.02x100 = 0.1353 (which is not very good)
(c) The parallel system will only fail if all components fail. The probability of
each failing is 1-0.1353= 0.8647.
If there are n parallel components needed
1 - 0.8647n = 0.9
0.8647n = 0.1
By trial and error, n = 16, so we need 16 components in parallel.
Some of the tools that are useful during the design stage can be thought of as tools for
fault avoidance. The fall into two general methods, bottom-up and top-down.

Top-down method
Undesirable single event or system success at the highest level of interest (the top event)
should be defined.
Contributory causes of that event at all levels are then identified and analysed.
Start at highest level of interest to successively lower levels
Event-oriented method
Useful during the early conceptual phase of system design
Used for evaluating multiple failures including sequentially related failures and commoncause events

Some examples of top-down methods include: Fault tree analysis (FTA) & Reliability
block diagram (RBD)
a. Fault tree analysis
Fault tree analysis is a systematic way of identifying all possible faults that could lead
to system fail-danger failure. The FTA provides a concise description of the various
combinations of possible occurrences within the system that can result in predetermined
critical output events. The FTA helps identify and evaluate critical components, fault paths,
and possible errors. It is both a reliability and safety engineering task, and it is a critical data
item that is submitted to the customer for their approval and their use in their higher-level
FTA and safety analysis. The key elements of a FTA include:
Gates represent the outcome
Events represent input to the gates
Cut sets are groups of events that would cause a system to fail

FTA is used to:

investigate potential faults;

its modes and causes;
to quantify their contribution to system unreliability in the course of product design

FTA can be done qualitatively by drawing the tree and identifying all the basic events.
However to identify the probability of the top event then probabilities or reliability figures
must be input for the basic events. Using logic the probabilities are worked up to given a
probability that the top event will occur. Often the data from an FMEA are used in
conjunction with an FTA.
The following table shows the flowchart symbols that are used in fault tree analysis in
order to aid with the correct reading of the fault tree.
A rectangle signifies a fault or
undesired event caused by one
or more preceding causes
acting through logic gates.
Circle signifies a primary
failure or basic fault that
Diamond denotes a secondary
failure or undesired event but
not developed further
And gate denotes that a
failure will occur if all inputs
fail (parallel redundancy)
Or gate denotes a failure will
occur if any input fails (series

FTA example

Transfer event

b. Reliability block diagram

The RBD is discussed and shown in section Modelling system reliability above. It
is however among the first tasks to be completed. It model system success and gives
results for the total system. It deals with different system configuration, including,
parallel, redundant, standby and alternative functional paths. It doesnt provide any fault
analysis and uses probabilistic measures to calculate system reliability.


Bottom-up method
Identify fault modes at the component level.
For each fault mode the corresponding effect on performance is deduced for the
next higher system level.
The resulting fault effect becomes the fault mode at the next higher system level,
and so on.
Successive iterations result in the eventual identification of the fault effects at all
functional levels up to the system level.

Rigorous in identifying all single fault modes.

Initially may be qualitative.

Some examples of bottom-up methods include: Event tree analysis (ETA); FMEA and
Hazard and operability study (HAZOP).
a. Event tree analysis
Considers a number of possible consequences of an initiating event or a system
May be combined with a fault tree.
Used when it is essential to investigate all possible paths of consequent events their
Analysis can become very involved and complicated when analysing larger systems.

b. Failure Modes and Effects Analysis (FMEA)

Failure mode and effect analysis (FMEA) is a bottom-up, qualitative dependability
analysis method, which is particularly suited to the study of material, component and
equipment failures and their effects on the next higher functional system level. Iterations of
this step (identification of single Failure modes and the evaluation of their effects on the next
higher system level) result in the eventual identification of all the system single failure
modes. FMEA lends itself to the analysis of systems of different technologies (electrical,
mechanical, hydraulic, software, etc.) with simple functional structures. FMECA extends the
FMEA to include criticality analysis by quantifying failure effects in terms of probability of
occurrence and the severity of any effects. The severity of effects is assessed by reference to
a specified scale.
FMEAs or FMECAs are generally done where a level of risk is anticipated in a
program early in product or process development. Factors that may be considered are new
technology, new processes, new designs, or changes in the environment, loads, or regulations.
FMEAs or FMECAs can be done on components or systems that make up products,
processes, or manufacturing equipment. They can also be done on software systems.

The FMEA or FMECA, analysis generally follows the following steps:

Identification of how the component of system should perform;

Identification of potential failure modes, effects, and causes;
Identification of risk related to failure modes and effects;
Identification of recommended actions to eliminate or reduce the risk;
Follow-up actions to close out the recommended actions.

Benefits include:

Identifies systematically the cause and effect relationships.

Gives an initial indication of those failure modes that are likely to be
critical, especially single failures that may propagate.
Identifies outcomes arising from specific causes or initiating events that
are believed to be important.
Provides a framework for identification of measures to mitigate risk.
Useful in the preliminary analysis of new or untried systems or processes.

Limitations include:

The output data may be large even for relatively simple systems.
May become complicated and unmanageable unless there is a fairly direct
(or "single-chain") relationship between cause and effect may not easily
deal with time sequences, restoration processes, environmental conditions,
maintenance aspects, etc.
Prioritizing mode criticality is complicated by competing factors involved.

c. Hazard & Operability Analysis (HAZOP)

Hazard and Operability Analysis (HAZOP) is a structured and systematic technique
for system examination and risk management. In particular, HAZOP is often used as a
technique for identifying potential hazards in a system and identifying operability
problems likely to lead to nonconforming products.
HAZOP is based on a theory that assumes risk events are caused by deviations from
design or operating intentions. Identification of such deviations is facilitated by using sets
of guide words as a systematic list of deviation perspectives. This approach is a unique
feature of the HAZOP methodology that helps stimulate the imagination of teammembers when exploring potential deviations.
As a risk assessment tool, HAZOP is often described as:
A brainstorming technique
A qualitative risk assessment tool
An inductive risk assessment tool, meaning that it is a bottom-up risk
identification approach, where success relies on the ability of subject matter
experts (SMEs) to predict deviations based on past experiences and general
subject matter expertise
HAZOP is a powerful communication tool. Once the HAZOP analysis is complete,
the study outputs and conclusions should be documented commensurate with the nature of
risks assessed in the study and per individual company documentation policies. As part of
closure for the HAZOP analysis, it should be verified that a process exists to ensure that
assigned actions are closed in a satisfactory manner.

The HAZOP analysis process is executed in four phases as illustrated below:


Life testing is concerned with measuring the pertinent characteristics of the life of the unit
under study. Often this is accomplished by making statistical inferences about probability
distributions or their parameters.
In general, units are put on test, observed and the times of failure recorded as they occur.
For example, a group of similar components are placed on test and the failure times observed.
Obviously, the times at which individual units fail will vary. Sometimes, assignable causes
can be found that contribute to that variation. Suppose some components have been subjected
to testing at a high temperature environment and it is possible that such components will fail
sooner than those tested at an ambient temperature environment. However, the components at
the high temperature will still have different failure times; and, if there are no assignable
causes in operation, these components will still have different failure times, that is, it is
always assumed that the failure times of the components have some random elements and
will be assumed to be a random variable with a probability distribution.
To make statistical inferences about the probability distribution of the failure time random
variable, one uses the failure times that have been observed from a life test, ideally a test that
has been statistically designed for the purpose of the study. If the failure times of a particular
component under a given set of conditions, can be adequately described by a probability
distribution, there are considerable practical benefits. The failure times can then be used to
estimate the parameters of the distribution and to perhaps study the relationship of these
parameters to associated explanatory variables. The estimates can be used to make
predictions, determine component configurations in systems, determine replacement
procedures, specify guarantee periods and make other decisions about the use of the
1) Accelerated life testing
The concept of accelerated testing is to compress time and accelerate the failure
mechanisms in a reasonable test period so that product reliability can be assessed. The only
way to accelerate time is to stress potential failure modes. These include electrical and
mechanical failures. Failure occurs when the stress exceeds the products strength. In a
products population, the strength is generally distributed and usually degrades over time.
Applying stress simply simulates aging. Increasing stress increases the unreliability and

improves the chances for failure occurring in a shorter period of time. This also means that a
smaller sample population of devices can be tested with an increased probability of finding
failure. Stress testing amplifies unreliability so failure can be detected sooner. Accelerated
life tests are also used extensively to help make predictions. Predictions can be limited when
testing small sample sizes. Predictions can be erroneously based on the assumption that lifetest results are representative of the entire population. Therefore, it can be difficult to design
an efficient experiment that yields enough failures so that the measures of uncertainty in the
predictions are not too large. Stresses can also be unrealistic. Fortunately, it is generally rare
for an increased stress to cause anomalous failures, especially if common sense guidelines are
Anomalous testing failures can occur when testing pushes the limits of the material out of
the region of the intended design capability. The natural question to ask is: What should the
guidelines be for designing proper accelerated tests and evaluating failures? The answer is:
Judgment is required by management and engineering staff to make the correct decisions in
this regard. To aid such decisions, the following guidelines are provided:

Always refer to the literature to see what has been done in the area of accelerated
Avoid accelerated stresses that cause nonlinearities, unless such stresses are
plausible in product-use conditions. Anomalous failures occur when accelerated stress
causes nonlinearities in the product. For example, material changing phases from
solid to liquid, as in a chemical nonlinear phase transition (e.g., solder melting,
inter-metallic changes, etc.); an electric spark in a material is an electrical
nonlinearity; material breakage compared to material flexing is a mechanical
Tests can be designed in two ways: by avoiding high stresses or by allowing them,
which may or may not cause nonlinear stresses. In the latter test design, a concurrent
engineering design team reviews all failures and decides if a failure is anomalous or
not. Then a decision is made whether or not to fix the problem. Conservative
decisions may result in fixing some anomalous failures. This is not a concern when
time and money permit fixing all problems. The problem occurs when normal failures
are labeled incorrectly as anomalous and no corrective action is taken.

Accelerated life testing is normally done early in the design process as a method for
testing for fit for purpose. It can be done at the component level or the sub-assembly level
but is rarely done at a system level as there are usually too many parts and factors that can
cause failures and these can be difficult to control and monitor.
Step-Stress Testing is an alternative test; it usually involves a small sample of devices
exposed to a series of successively higher and higher steps of stress. At the end of each
stress level, measurements are made to assess the results to the device. The measurements
could be simply to assess if a catastrophic failure has occurred or to measure the resulting
parameter shift due to the steps stress. Constant time periods are commonly used for
each step-stress period. This provides for simpler data analysis. There are a number of
reasons for performing a step-stress test, including:

Aging information can be obtained in a relatively short period of time. Common stepstress tests take about 1 to 2 weeks, depending on the objective.
Step-stress tests establish a baseline for future tests. For example, if a process
changes, quick comparisons can be made between the old process and the new
process. Accuracy can be enhanced when parametric change can be used as a measure
for comparison. Otherwise, catastrophic information is used.

Failure mechanisms and design weaknesses can be identified along with material
limitations. Failure-mode information can provide opportunities for reliability growth.
Fixes can then be put back on test and compared to previous test results to assess fix
Data analysis can provide accurate information on the stress distribution in which the
median-failure stress and stress standard deviation can be obtained.

2) Reliability enhancement testing or HALT

The goal of Reliability enhancement testing (RET) is to identify any potential failure
modes that are inherent in a design early in the design process. Identifying the root cause of
the failure mode and then incorporating a fix to the design can achieve reliability growth.
This is accomplished by designing out the possibility of potential failure modes occurring
with the customer and reducing the inherent risk associated with new product development.
RET at the unit or subassembly level utilizes step-stress testing as its primary test method. It
should be noted that Highly Accelerated Life Testing (HALT) is not meant to be a simulation
of the real world but a rapid way to stimulate failure modes. These methods commonly
employ sequential testing, such as step-stressing the units with temperature and then
vibration. These two stresses can be combined so that temperature and vibration are applied
simultaneously. This speeds up testing, and if an interactive vibration/temperature failure
mode is present, this combined testing may be the only way to find it. Other stresses used
may be power step-stress, power cycling, package preconditioning with infrared (IR) reflow,
electrostatic-discharge (ESD) simulation, and so forth. The choice depends on the intended
type of unit under test and the units potential failure modes.
HALT is primarily for assemblies and subassemblies. The HALT test method utilizes a
HALT chamber. Today, these multi-stress environmental systems are produced by a large
number of suppliers. The chamber is unique and can perform both temperature and vibration
step-stress testing.
3) Demonstration testing
Demonstration of reliability may be required as part of a development and production
contract, or prior to release to production, to ensure that the requirements have been met.
Two basic forms of reliability measurement are used:
a. A sample of units may be subjected to a formal reliability test, with
conditions specified in detail.
b. Reliability may be monitored during development and use.
The first method has been shown to be problematic and subject to sever limitations and
practical problems. The limitations include:

PRST (Probability ratio sequential test) assumes a constant hazard

It implies that MTBF is an inherent parameter of a system;
Extremely costly
It is an acceptance test
Objective is to have no or very few failures

It has been shown that a well-managed reliability growth programme as discussed earlier
would avoid the need for demonstration testing as they concentrate on how to improve
products. It is has also been argued that the benefit to the product in terms of improved
reliability is sometimes questionable having used PRST methods.

4) Environmental Stress Screening

If all processes were under complete control, product screening or monitoring would be
unnecessary. If products were perfect, there would be no field returns or infant mortality
problems, and customers would be satisfied with product reliability and quality. However, in
the real world, unacceptable process and material variations exist. Product flaws need to be
anticipated before customers receive final products and use them. This is the primary reason
that a good screening and monitoring program is needed to provide high quality products.
Screening and monitoring programs are a major factor in achieving customer satisfaction.
Parts are screened in the early production stage until the process is under control and any
material problems have been resolved. Once this occurs, a monitoring program can ensure
that the process has not changed and that any deviations have been stabilized. Here, the term
screening implies 100% product testing while monitoring indicates a sample test. Screens
are based upon a products potential failure modes. Screening may be simple, such as on-off
cycling of the unit, or it may be more involved, requiring one or more powered
environmental stress screens. Usually, screens that power up the unit, compared with nonpowered screens, provide the best opportunity to precipitate failure-mode problems. Screens
are constantly reviewed and may be modified based on screening yield results. For example,
if field returns are low and the screen yields are high (near 100 percent), the screen should be
changed to find all the field issues. If yields are high with acceptable part per million (PPM)
field returns, then a monitoring program will replace the screen. In general, monitoring is
preferred for low-cost/high-volume jobs. A major caution for selecting the correct screening
program is to ensure that the process of screening out early life failures does not remove too
much of a products useful life. Manufacturers have noted that, in the attempt to drive out
early life failure, the useful life of some products can become reduced. If this occurs,
customers will find wear-out failure mechanisms during early field use.
5) Reliability Growth/Enhancement Planning
Traditionally, the need for Reliability Growth planning has been for large subsystems or
systems. This is simply because of the greater risk in new product development at that level
compared to the component level. Also, in programs where one wishes to push mature
products or complex systems to new reliability milestones, inadequate strategies will be
costly. A program manager must know if Reliability Growth can be achieved under required
time and cost constraints. A plan of attack is required for each major subsystem so that
system-level reliability goals can be met. However Reliability Growth planning is
recommended for all new platforms, whether they are complex subsystems or simple
components. In a commercial environment with numerous product types, the emphasis must
be on platforms rather than products. Often there may be little time to validate, let alone
assess, reliability. Yet, without some method of assessment, platforms could be jeopardized.
Accelerated testing is, without question, the featured Reliability Growth tool for industry. It is
important to devise reliability planning during development that incorporates the most time
and cost effective testing techniques available.
Reliability growth can occur at the design and development stage of a project but most of
the growth should occur in the first accelerated testing stage, early in design. Generally, there
are two basic kinds of Reliability Growth test methods used: constant stress testing and stepstress testing. Constant stress testing applies to an elevated stress maintained at a particular
level over time, such as isothermal aging, in which parts are subjected to the same
temperature for the entire test (similar to a burn-in). Step-stress testing can apply to such
stresses as temperature, shock, vibration, and Highly Accelerated Life Test (HALT). These
tests stimulate potential failure modes, and Reliability Growth occurs when failure modes are

fixed. No matter what the method, Reliability Growth planning is essential to avoid wasting
time and money when accelerated testing is attempted without an organized program plan.
Table below summarizes how different tests fit into the product life cycle.
Accelerated tests
or methods

Stage of product
life cycle

Reliability Growth
or Reliability

Design and

HALT (Highly
Accelerated Life

Design and

Step-Stress Test

Design and
Development of
units or

Failure-Free Test or
demonstration test

Post Design

Stress Screening)
HASS (Highly
Stress Screen)



Definitions and uses

Reliability Growth is the positive improvement in a
reliability parameter over a period of time due to
changes in product design or the manufacturing
process. A Reliability Growth program is commonly
established to help systematically plan for reliability
achievement over a programs duration so that
resources and reliability risks can be managed.
HALT is a type of step-stress test that often combines
two stresses, such as temperature and vibration. This
highly accelerated stress test is used for finding
failure modes as fast as possible and assessing
product risks. Frequently it exceeds the equipmentspecified limits.
Exposing small samples of product to a series of
successively higher steps of a stress (like
temperature), with a measurement of failures after
each step. This test is used to find failures in a short
period of time and to perform risk studies.
This is also termed zero failure testing. This is a
statistically significant reliability test used to
demonstrate that a particular reliability objective can
be met at a certain level of confidence. For example,
there liability objective may be 1000 FITs (1million
hours MTTF) at the 90 percent confidence level. The
most efficient statistical sample size is calculated
when no failures are expected during the test period.
Hence the name.
This is an environmental screening test or tests used
in production to weed out latent and infant mortality
This is a screening test or tests used in production to
weed out infant mortality failures. This is an
aggressive test since it implements stresses that are
higher than common ESS screens. When aggressive
levels are used, the screening should be established in
HALT testing.