RCM Workpaper PDF

Reliability-centered Knowledge
Using Maintenance Databases for Reliability Analysis and

Improvement

Part 1. A Reliability-centered Knowledge Base 6

Part 2. Using Maintenance Data 21

Part 3. Condition Based Maintenance 49

Part 4. Reliability Centered Maintenance 110

by:
Murray Wiseman
Technical V.P.
Optimal Maintenance Decisions (OMDEC) Inc.
Mar 2004

Page 2
Optimal Maintenance Decisions (OMDEC) Inc 2004
Preface

This book provides the course notes for a CBM (condition based monitoring)
training session that consists of 3 parts:

1. Database attributes that are required for reliability analysis,
2. Using the information from such databases, and particularly,
3. Interpreting in an optimal way the data generated by condition based
maintenance activities.

Material for a fourth part, Reliability-centered maintenance (RCM), is
appended because the principles of EXAKT are founded in RCM theory. Hence
the course draws liberally from concepts such as failure modes , decision
analysis, and age exploration that are examined in Part 4.

Parts 1 and 2 usually fill the first morning of the course providing the introduction
and background for EXAKT. Part 3 begins with a theoretical development of
CBM, a history of CBM, and a discussion of the reasons for selecting CBM as a
proactive task. The second section of Part 3 presents the anatomy of CBM,
specifically its three sub-processes data acquisition, signal processing, and
decision making. The latter leads naturally into the introduction of EXAKT CBM
decision optimization. The fundamentals of CBM are explored further and the
RCM concept of the P-F interval
1
is reconciled with the methodology of
EXAKT. The development of the relationship among data, risk, and cost ensues,
using a time-based maintenance example. This approach is shown then to be
extendable to CBM using the Weibull PHM
2
model. The need for automated
decision making, as a consequence of the growing volumes of data and the
diminishing resources that characterize todays maintenance departments is
expounded upon.

At this point participants (or readers) are invited to work through an introductory
exercise during which they encounter most of the basic features of EXAKT. This
includes the 5 principal database tables and their table structure. They proceed to
build a decision model using a reduced set of haul truck data. In the exercise that
follows, they deploy the model that they have previously created. That is, they set
up an (EXAKTd) intelligent agent and examine its automated analysis, reporting,
and database functionality.

Next, the issue of data validation is explored. The example is from a CBM project
at the Cardinal River Coals mine in which invalid data, missing data, faulty
failure definition, the impact of oil changes on oil analysis data, and cost
sensitivity analysis are all encountered, and their respective EXAKT functions
explored. This discussion is then reinforced by an exercise in which all of the
results of the Cardinal River Coals project are replicated by the class working in
pairs on their own laptops. The exercise includes an introduction to general (data)
transformations
3
in EXAKT.

Page 3

At this time, an advanced topic is introduced the analysis of complex items
4
. A
complex item is defined and the data structure for representing complex items in a
model is described. The necessity to map database fields for a variety of
components to EXAKTs key fields of B, EF, and ES (Beginning, Ending by
Failure, and Ending by Suspension) is elaborated at some length and then
reinforced immediately with an exercise using a two-failure-mode gearbox as an
example.

The final exercise provides an introduction and practice in the use of history
specific (data) transformations, for the purpose of smoothing erratic data.
Additional sophistication is demonstrated via the elimination of a drooping
artifact as a result of the basic smoothing algorithm. Additionally this final
exercise introduces the testing of the shape-factor-equal-to-one
5
hypothesis, and
the reasoning behind its use in this specific case. This ends the formal part of the
course. The attendees are then asked to search their respective records and
databases for potentially good CBM optimization projects. The criteria for good
is articulated in the form of a balanced compromise between availability of
inspection and event data on the one hand, and, the gravity of the consequences of
failure on the other.

Page 4

Contents:
Part 1. A Reliability-centered Knowledge Base 6
Chapter 1. 6
Introduction 6
The Work Order UML Class Diagram 7
Incorporating RCM attributes 8
The Seven RCM Questions 9
The failure code problem 10
Chapter 2. Requirements of I nformation 11
Data Structure 13
Implementing a Reliability Knowledge Base 14
Step 4 Extending the Use Case if no record is found 16
Conclusions 19
Part 2. Using Maintenance Data 21
Chapter 3. Analyzing data 21
Introduction 21
The problem with failure rates 22
How to use maintenance data? 23
Age Exploration Procedures 25
Random Failure 26
Failure Finding Intervals 26
Measuring Reliability Improvement 29
Refining the maintenance program 30
Extending inspection intervals where no experience is available opportunity
sampling 31
Assessing the effectiveness of a CBM Program 31
Improving the program through failure mode assessment 32
Software analytic tools 33
CBM (on-condition maintenance) benefits analysis 36
Engineering Change Assessment 40
Recording Events 41
Component age 41
Significant components 42
Chapter 4. Case based reasoning 43
Results of case-based reasoning 47
Part 3. Condition Based Maintenance 49
Chapter 5. Deciding on CBM? 49
Introduction 49
Why do CBM? 50
History of CBM 53
Chapter 6. Anatomy of CBM 57
Data Acquisition 57
Signal Processing 57
Decision Making 62
Chapter 7. CBM Fundamentals 64

Page 5
The fundamental premise of CBM 64
CBM Program Criteria 64
CBM Monitoring Frequency 64
Estimating the PF Interval 65
Chapter 8. The Elusive P-F I nterval 66
Are failures required multiple levles of intrusiveness? 68
Chapter 9. Optimizing CBM 69
Developing a Maintenance Risk Model 69
The traditional risk model 69
Combining Data and Risk 71
The Optimal Risk 72
A Time Based Maintenance Model 73
Blending in Cost 78
A Condition Based Maintenance Model 80
Automated CBM Decision Making 81
Example 1 Creating an intelligent agent 82
Example 2 Data validation 88
Example 3 Complex Items 103
Summary 109
References 109
Part 4. Reliability Centered Maintenance 110
Chapter 10. Pillars of RCM 110
Introduction 110
Chapter 11. Failure Modes and Effects Analysis 113
Question 1 Functional Analysis 113
The process 113
Example 1 114
Example 2 118
Example 3 120
Example 4 121
Question 2 Failure analysis 121
The process 121
Example 1 122
Example 2 122
Example 3 123
Question 3 Failure modes analysis 123
The process 123
Example 1 125
Example 2 127
Example 3 127
Question 4 Effects analysis 128
The process 128
Example 1 129
Example 2 138
Example 3 139
Chapter 12. Decision Algorithm 140
Questions 5, 6, and 7 140

Page 6
The process 140
Example 1 141
Example 2 144
Example 3 147
Example 4 150
Chapter 13. Can RCM and Streamlined RCM peacefully co-exist? 156
Introduction 156
Why streamline RCM? 156
RCM/RCM Turbo dictionary 157
Example 1 162
Conclusions 164
Chapter 14. Appendices 166
Appendix 1. 166
The role of the RCM Facilitator 166
Appendix 2. 171
Sizing the analysis 171
Selecting the significant items 173
Appendix 3. 173
Failure finding intervals for complex items (multiple failure modes and devices)
173
Appendix 4. 174
Truck description 174
Appendix 5. 184
Terminology used: 184
Various definitions of Life 185
Appendix 6. 186
Relationship between hazard, reliability, and density functions 186
Appendix 7. 187
Random failure survival curve 187
Appendix 8. 187
Inherent reliability characteristics 187
Appendix 9. 188
Failure mode depth of causality 188
Appendix 10. 189
Expected failure time 189
Appendix 11. 189
Exercise(Example 2 Data validation) 189
Exercise 4 data smoothing and fixing shape factor to 1 193
Appendix 12. 195
Data for RCM Turbo 195
Appendix 13. 196
Default decision diagram answers in the absence of operating experience 196
Appendix 14. 198
Additional Relcode examples 198

Page 7
Left intentionally blank.

Page 8
Part 1. A Reliability-centered Knowledge Base

Chapter 1.
Introduction
Website Slogan Principal sponsors/members
www.mimosa.org a non-profit trade
association which develops
and encourages adoption of
open information standards
for Operations and
Maintenance
Emerson Process
Management CSI, The DEI
Group, DLI Engineering
Corp., Indus International,
Inc., Rockwell Automation,
SAIC, Siemens AG
www.osacbm.org the Open System
Architecture for Condition
Based Maintenance
Boeing, Caterpillar, Oceana
Sensor Technologies,
Rockwell Automation, ARL,
Office of Naval Research
www.opcfoundation.org Dedicated to interoperability
in automation
ABB Automation, Acsis, Inc,
Advanced Engineering, Inc.,
Advanced, Measurement &
Analysis Group Inc.,
Advancis Software &
Services GmbH, AFCON
Software & Electronics Ltd,
ALSTOM Transportation
The Work Order UML Class Diagram
Incorporating RCM attributes
The Seven RCM Questions
The failure code problem
Chapter 2. Requirements of Information
Data Structure
Implementing a Reliability Knowledge Base
Step 4 Extending the Use Case if no record is found
Conclusions
Part 2. Using Maintenance Data
Chapter 3. Analyzing data
Introduction
The problem with failure rates
How to use maintenance data?
Age Exploration Procedures
The purpose of age exploration is to establish equipment failure
characteristics by analyzing the data collected from previous maintenance
work done. Data sources for this will come from the results of proactive

Page 9
scheduled maintenance tasks and the from maintenance tasks that have
been provoked by a failure.

Valid data will include the results of CBM inspections, the records of
functional and potential failures, and the records of preventive renewals.
The quality of data will determine the validity (confidence level) of the
conclusions drawn from age exploration.

Random Failure
Failure Finding Intervals
Required safety device
availability
99.999% 99.99% 99.97% 99% 98.5% 98% 96%
Inspection interval as a % of
MTBF (I/M x 100)
0.002% 0.02% 0.06% 2% 3% 4% 8%
Measuring Reliability Improvement
Refining the maintenance program
Extending inspection intervals where no experience is
available opportunity sampling
Assessing the effectiveness of a CBM Program
Improving the program through failure mode assessment
Truck 1 Truck 2
51220 45380
68060 103510
At present the odometer readings are:
Truck 1 Truck 2
105680 132720
CBM (on-condition maintenance) benefits analysis
Policy Sample size Failed Replaced Undecided
60
%Suspended
Current 13 6 3 4 30.8
Applied 13 1
61
6 6 46.2
Fitted A 13 1,1 5 7,4 53.8

Policy
Cost/unit
time
(risk
level)
Compared
to Current
Preventive
Replacements
Compared
to Current
MTBR Compared
to Current
Current 0.391 100% 53.85% 100% 8458.92 100%
Applied 0.195
(0.638)
49.78% 92.31% 171.43% 7113.54 84.10%
Theoretical 0.157
(0.638)
40.26% 97.74% 181.53% 7070.09 83.58%
Fitted 0.182
(1.259)
46.43% 92.31% 171.43% 7627.00 90.17%
No
Scheduled
0.638 163.14% 0.0% 0.0% 9405.25 111.19

Page 10
Maintenance
Engineering Change Assessment
Recording Events
Component age
Significant components
Chapter 4. Case based reasoning

Intelligent reasoning agents can use reliability-centered knowledge for automated
diagnostics and prognostics. The last decade and a half has seen the introduction
and growing use of case based reasoning (CBR) in maintenance. CBR extends
the concept of reliability-centered knowledge to automated diagnostics and
troubleshooting support systems. A case based reasoning system uses a structured
classification of knowledge and experience in a continuous improvement cycle of
maintenance response to problems. Assisted by CBR software, a maintainer
quickly retrieves information relevant to a particular situation, reuses that
information appropriately to deal with the current problem, and collaboratively
revises and adapts it for retention in the incrementally enriched knowledge (case)
repository.

Figure 4-1. The case based reasoning troubleshooting process.
73

Figure 4-1 represents the CBR process: identify the current problem situation,
find a past case similar to the new one, use that case to suggest a solution to the
current problem, evaluate the proposed solution, and update the system by
learning from this experience. The solution from a previous case may be directly
applied to the present problem, or modified according to differences between the
two cases.
CBR cycle:

Page 11
RETRIEVE the most similar case or cases
REUSE the information and knowledge in that case to
solve the problem
REVISE the proposed solution
RETAIN the parts of this experience likely to be useful for
future problem solving

Figure 4-2. Case-based reasoning
74

Intelligent troubleshooting poses the right questions in the
best order. A case based reasoning system guides the
maintenance technician by asking for information based on
the most likely and least costly path to a solution. It poses
questions and suggests solutions by considering these
factors:
Similarity of cases to the current symptoms
Frequency of occurrence of cases
Cost and time to get an answer
Cost and time of repairs

Page 12

Figure 4-3: A typical CBR session

Figure 4-3 shows a typical CBR session. The maintenance technician need
not answer every question in the order suggested. The CBR program
orders questions and suggested solutions based on all information known
up to that point. The information may be entered manually and/or
extracted directly from relevant databases. As he probes the symptoms
guided by CBR, the system reconsiders the data and poses new questions
until the problem is solved.

Figure 4-4: CBR performance results

Page 13
CBR measures its own performance by tracking the hit rate and monthly
average number of solved sessions. Figure 4-4 depicts a report for a
Honeywell jet propulsion engine over one year.

Figure 4-5: Managing the knowledge base
75

Figure 4-5 describes the most significant feature of a case based reasoning
system. CBR provides tools and methods for the continuous enhancement
of the knowledge base. Knowledge integrity is assured through expert
review and classification of all completed sessions.

Page 14
The seed case base

How do we begin benefiting from CBR? The slide
illustrates the development of the seed case base from
existing work order and troubleshooting records, failure
modes and effects analysis records, and OEM maintenance
and troubleshooting manuals.

Figure 4-6: The seed case base

Results of case-based reasoning
Better first-time fix
Cost reduction / Cost avoidance
o Reduce troubleshooting time
o Plan rapidly for unscheduled maintenance events
o Reduce no fault found parts replacements
o Reduce unscheduled service interruptions
Improved asset availability
Preserve and use intellectual assets
o Capture walking knowledge prior to retirement or
attrition
o Maximize utility of new staff
o Focus efforts of expert staff on toughest problems

Page 15
Left intentionally blank.

Page 16

Part 3. Condition Based Maintenance
Chapter 5. Deciding on CBM?
Introduction
Why do CBM?
Physical asset managers attempt to implement policies that maintain the
functionality of machinery and other production assets at a level required
by their users, owners, and by society at large. They select "proactive
maintenance" as their first line of defense against the causes of equipment
failure. By applying routine inspection (condition based maintenance aka
CBM, on-condition maintenance, and predictive maintenance) or periodic
renewal (preventive maintenance aka PM, scheduled overhaul), they seek
to avoid the consequences of failure. Of the two approaches they prefer
the former because it is usually less expensive and less intrusive. Although
data is plentiful and can be collected and processed in every situation,
CBM is appropriate only when it is both applicable (technically feasible)
and effective (economically justifiable). Applicability implies that a non-
ambiguous indicator of failure initiation and sufficient time to proact.

Preventive maintenance is the routine renewal of physical assets or their
components. Condition based maintenance is the routine inspection of a
physical asset to determine whether a failure process is underway. If
failure has begun, the goal is to take an action which will somehow avoid
or reduce the consequences of failure. If the remedial action (for example
a cleaning or adjustment) can be performed on the spot, at the time of the
inspection, most companies consider the inspection activity as belonging
to their preventive maintenance (PM) program.

Condition based maintenance (aka on-condition maintenance, predictive
maintenance, and others) first appeared in the late 1940's in the Rio
Grande Railway Company, to detect coolant and fuel leaks in a diesel
engine's lubricating oil. They achieved outstanding economic success in
reducing engine failure by performing maintenance whenever "any" glycol
or fuel was detected in the engine oil. The U.S. army, impressed by the
relative ease with which physical asset availability could be improved,
adopted those techniques and developed others. During the 50's, 60's, and
early 70s CBM grew in popularity and a vibrant CBM technology industry
emerged providing training, products, and services which came to be
known as "predictive maintenance".

Commercialization of CBM coincided with the dawn of the "information
age" and CBM took on a new "flavor". Technology entrepreneurs
conjectured that, if simple physical measurements, such as vibration
amplitude or oil viscosity, could provide such useful benefits, then
collecting the data in computers and trending it over time would, likely,

Page 17
provide a far deeper insight into the state of a machine's health. Hence the
1980s and 1990s witnessed a soaring rise in the use of computers,
software, and data collectors in maintenance shops throughout the
industrial world.

In reality, even in the midst of impressive information technology growth,
most day-to-day CBM success stories still derive from the basic
application of the original, uncomplicated form of CBM. For example; the
detection of unbalance in a rotating machine, of glycol or fuel in an engine
oil, or of mechanical looseness, soft foot, or shaft misalignment seldom
require the degree of sophistication (and related expense) of the variety of
technology bells and whistles happily proffered by the CBM industry.

At the same time (as the growth of CBM), the information technology
revolution impacted another part of physical asset management - the
computerized control of maintenance materials, labor, and historical
records. These products became known as computerized maintenance
management systems (CMMS). There was, however, a striking difference
between the CBM and CMMS approaches.

While CBM technology vendors required their clients to adhere to highly
structured procedures for data collection and storage, CMMS vendors, on
the other hand, hailed the concept of 'flexibility' and emphasized their
products' "ease of adaptation" to their clients' existing business processes.
As a consequence of their much vaunted "user friendliness" no common
practices of data classification gathered sufficient critical mass to achieve
standardization - not even within a given organization, let alone in an
industry or in the physical asset management community at large.

It is in this context that the second millennium, the age of connectivity,
finds the state of maintenance information. Maintenance technology
vendors are poised to inject the latest generation of "integration
technology" into their traditional market. But the lack of a common data
model impedes smooth penetration.

The Maintenance Information Management Open Systems Alliance
(MIMOSA) was formed in 1994 by key CBM and maintenance
technology vendors to address the problem. The result of their labors in
the past 7 years is the impressive common relational information system
(CRIS) and associated enabling tools. The CRIS accommodates many
physical asset management concepts within its data structure and has the
flexibility to adapt as required. It is continuously maintained and updated
by MIMOSA (www.mimosa.org).

Hence we may foretell the day when disparate production and physical
asset management systems will communicate seamlessly thanks to

Page 18
MIMOSA and other standardized information protocols such as OSA-
CBM (Open Systems Alliance - Condition Based Maintenance), STEP
(standard exchange for model product data), OPC (OLE for process
control), OAG (Open Applications Group), and others.

Connectivity to this degree of intimacy implies that process and
maintenance information from multiple platforms will materialize in a
universally accessible format (CRIS) and, in that homogenized form, may
be intelligently processed for optimum decision making. Optimization
seeks to achieve some objective: the lowest average cost of maintenance,
highest asset availability, or a specified effective reliability. It is onto this
stage that the "CBM Optimizing Intelligent Agent" enters.

EXAKT, a CBM optimizing software, developed by the CBM Laboratory
at the University of Toronto is an intelligent agent. More precisely, it is a
platform for developing intelligent agents that are designed to interpret
condition data (CBM measurements) in combination with corresponding
historical data from the CMMS. The agent reduces both data sets to a clear
decision - i.e. whether to intervene and perform maintenance at this time
or to allow the equipment to continue operating. It does so by considering
the economic consequences of failure, the cost of repair, and the risk of
failure in an upcoming period. It generates, a recommendation that
supports a currently stated management objective - either to minimize cost
or to maximize the asset's availability or to achieve a particular desired
key performance indicator (KPI) such as the ratio of planned-to-
breakdown maintenance.

What does the future have in store for CBM? The CBM process consists
of three sub-processes: data acquisition, signal processing, and decision
making. Data acquisition is already highly technologically advanced.
"Signal processing" in CBM filters out of the data operational and
environmental data so that what is left is a "condition indicator" that
reflects the degree of deterioration of some targeted failure mode. New
signal processing methodologies based on a variety of disciplines (wavelet
analysis, principal component analysis, inference engines, and neural net
classifiers to name a few) are being developed in research institutions and
universities around the world. (Chapter 6. describes a few such
techniques.) Their effect will be to make it technically feasible to track
and manage ever increasing numbers of failure modes.

Page 19

Chapter 6. Anatomy of CBM
Data Acquisition
Signal Processing
Decision Making

Page 20
Chapter 7. CBM Fundamentals
The fundamental premise of CBM
CBM Program Criteria
CBM Monitoring Frequency
Estimating the PF Interval

Page 21
Chapter 8. The Elusive P-F Interval

J. Moubray coined the phrase "P-F interval". He used it to highlight two essential
pre-requisites of CBM, namely:
1. A clear indicator of decreased failure resistance - the potential failure, and
2. A reasonably consistent warning period prior to functional failure - the P-
F interval

Both these requirements are captured in the well known empirical graph of failure
resistance versus working age (Figure 8-1).

Figure 8-1 The P-F interval

The P-F interval is a deceptively simple idea. Deceptive, because it takes for
granted that we have previously defined "P" (the potential failure). Of the two
concepts, P and P-F Interval, it is the former, however, that poses the greater
challenge. Therefore, before addressing the P-F interval, we need to determine
when and how to declare a potential failure.

Figure 8-1implies that if we could monitor a condition indicator that tracks the
resistance to failure, then declaring the potential failure level would be an easy
matter. Two stumbling blocks, unfortunately, arise and obstruct our plan. The
obstacles to the implementation of Figure 8-1 are:

1. A condition indicator that faithfully tracks the resistance-to-failure curve
is rare, and
2. The resistance-to-failure curve itself is rarely available.

Condition data, on the other hand, is abundant. How may we overcome obstacles
1 and 2? That is, how may we apply CBM to the numerous physical assets where
condition monitoring data abounds, yet, where few alert limits have yet to be
defined?

Page 22
This (setting of the declaration level of the potential failure) is the problem
encountered by many asset managers deluged with condition monitoring data.
The unavoidable question facing any implementer of a CBM program is where to
set the potential failure. Which indicator, among many monitored variables should
he use? When the physics of the situation are not well known (as is often the
case), a policy for declaring a potential failure is far from obvious.

Why does Figure 8-1 so stubbornly elude our grasp? The reason is that this graph
is really not 2-dimensional, but multi-dimensional. There is one dimension for
each significant monitored variable. The simple curve of Figure 8-1 looses its
geometrical visuality. This is where software comes to the rescue.

EXAKT summarizes the risk factors associated with working age and monitored
variables and creates a new kind of graph by transforming the significant risk
information onto a 2-dimensional optimal decision graph. Dragan Banjevic,
CBM Lab director, captured the multi-dimensionality of Figure 8-1 in two ways.
First, he combined the significant monitored variables (other than age) into a risk-
weighted sum. That became the y-axis. Then he transformed the age-related risk
factor into the shape of the limit boundary. The 2-dimensional graph of Figure
8-2, tracks the composite of all the key risk variables determined by the software.

Figure 8-2

EXAKT handles the probabilistic nature of P and the P-F interval properly.
EXAKT does not assume a deterministic P or P-F interval. Instead it draws (from
historical records) a probabilistic relationship among all significant factors
(including working age). It uses that relationship to estimate the remaining useful
life at any given moment. One of the benefits of this approach is the ability to deal
with noisy data, illustrated in Figure 8-3. On the left side of Figure 8-3 are 3
examples of ideal data. The monitored values increase monotonically, with the
red alarm set conveniently to the potential failure declaration level (presumably

Page 23
the potential failure). On the right side of Figure 8-3 is data from the nasty world.
It contains random fluctuations and contradicting trends. In other words the usual
situation! EXAKT alleviates such randomness (see Example 4 on page 193) and
conflicting trend data (see Example 2 on page 88).

.
Figure 8-3
Summarizing, EXAKT overcomes both obstacles to the application of Figure 8-1:

1. It uncovers the combination of monitored variables that most reflect
degraded failure resistance, and
2. It provides a virtual failure resistance curve that accounts for multiple
risk factors.

Are failures required multiple levles of intrusiveness?
By definition, a potential failure has no dire consequences. Often a
less intrusive form of CBM is used to decide when a more
intrusive inspection is required. For example, oil analysis results
often indicate that a problem is occurring in a complex system,
such as an engine, but do not specify which component is failing,
nor which failure mode is occurring. In those cases a physical
inspection requiring additional forms of testing is desirable
80
.
Should the physical inspection (a more intrusive form of CBM)
uncover a potential failure, then a model relating the less intrusive
measurements to the findings of the more intrusive inspections is
desirable. Still, a functional failure will not have not yet occurred.
With ever increasing amounts of data being captured from the
control platform, two (or more) levels of CBM are often desirable.
Hence we may build decision models that predict potential failures
and avoid functional failures altogether.

Page 24

Chapter 9. Optimizing CBM
Developing a Maintenance Risk Model
When the physics of a failure are not completely understood (as is often
the case), we cannot specify a potential failure as a single alarm level.
Nevertheless, we often possess multiple indicators, that we know relate to
an items remaining useful life. But we do not know the nature of that
relationship. Additionally, measurable external factors, such as duty cycle,
operating environment, and minor maintenance, likewise, influence the
propensity of an item to fail. The best we can do under these
circumstances is to deduce those key risk variables in order to specify the
probability of failure occurring in a given interval.

The foregoing appears to rule out the use of CBM, according to a criterion
we stipulated in Chapter 7. (CBM Program Criteria page 64 ) that "an
unambiguous potential failure must be detectible".

We recognize, nonetheless, that asset management, as does business in
general, requires us to manage risk. Seldom do we have complete
information, yet we still must make the best decisions possible. Therefore,
in this chapter, we broaden the meaning of the word unambiguous to
include the ability to specify a probability that failure will occur in an
interval given a set of observations. Rather than precluding CBM, the
precondition of Chapter 7. now requires us to:

1. develop applicable signal processing methods, and
2. establish a CBM data interpretation policy

The traditional risk model
Combining Data and Risk

Page 25
The Optimal Risk
A Time Based Maintenance Model

Blending in Cost

A Condition Based Maintenance Model
Convention used: Meaning:
X instruction to close the current sub-window (or pane)
This tutorial can be run using the EXAKT program and data on the CD
distributed with this course material
93
. The instructions in the right column
of the following table are minimal so as to keep them simple. The left
column provides more detailed explanation. Whenever an EXAKT menu
option or icon is mentioned, it should be clicked in the EXAKT program.
When database tables are mentioned, they should be double clicked.
You will learn the basic functions of the EXAKT model building
platform and the EXAKT decision agent software. Example 1 uses a
reduced set of oil analysis data from a fleet of haul truck transmissions to
build a proportional hazards model. You will create and deploy this
model as an intelligent agent that silently and automatically monitors
future condition monitoring data, returning an optimized decision
(whether or not to remove and repair the transmission) as each new set of
condition monitoring readings are received. A long term policy of making
optimized decisions will, on the average, minimize some undesirable
feature, such as cost , or maximize some wanted feature, such as
availability . The agent provides a remaining useful life estimate based
on the current condition of the equipment, its age, and all relevant
maintenance and operational events that have occurred.
Building the CBM Optimal Decision Model
Detailed Explanation Steps to follow
1
Install the EXAKT program from the Flash player user
interface on the CD (or from the downloaded zip file).
EXAKT, Install Exakt, follow
prompts
2
Install the data files from the CD's Flash player user
interface. (Alternatively download and place them in a
folder on your hard drive. Modify the path given in step 4
and step 2 (of Section 2)accordingly.)
EXAKT, Install data files
3
Launch EXAKT for Modeling. This is the program for
validating and analyzing condition monitoring and event
data and for building the optimized CBM (condition based
maintenance) model
Start, Exakt for Modeling
4
Load the working model
database(Transmission_WMOD.mdb).
File, Open, navigate to c:\Program
Files\Exakt\data,
Transmission_WMOD.mdb, Open
5
From the EXAKT Modeling program attach the sample
measurements and events (Transmission_MES.mdb)
database to the Exakt working model database. After
executing the steps to the right you may examine the
attachment script by again hitting Modeling, Data Set-up.
You will note that it creates an ODBC (open database
connectivity) link to an external database called
Transmission_MES.mdb. and has attached a number of
tables. It has applied its own internal names to two of the
tables using the A=B syntax but other tables are attached
directly since their names are already consistent with
EXAKTs internal names for those tables.
Modeling (on the Menu bar), Data
setup, type in the attachment script
(actually it is already keyed in for
you), Execute, Save

Page 26
6
Notice that the attached tables have now become visible
and accessible in the right tree structure of the right pane.
In the next steps you will examine each one of those tables
to become familiar with their content and structure,
starting with Inspections. Open the Inspections table.
Note the column names and content. Ident, Date, and
WorkingAge are key words used by EXAKT. Ident is the
unique name of each unit of a specific type of Item to be
analyzed. An item is a significant system, subsystem, or
component upon which it is convenient and desirable to
conduct a reliability analysis. An item may consist of
several components and may undergo several failure
modes. But in this introductory section of the tutorial we
will keep it simple and assume that the item is a simple
item. The Date may be in date or date/time format. If
condition monitoring inspections are more frequent than
once every 24 hours, the date/time format must be used.
The WorkingAge is a measure such as hours of operation,
fuel consumed, thousands of feet of steel rolled, or any
other measurement that reflects the accumulated usage or
stress on the item. Calendar time can only be used if the
units operate regularly in time a rare situation.
Databases of production records, hour meters, or counters
must be used to acquire useful WorkingAge data. The
remaining columns contain the condition monitoring data
which we refer to as condition data.
Inspections, X
7
Now examine the Events data table. Contrasted with the
Inspections table, its information represents the other side
of the coin. Both Event and Inspection data are required
for CBM optimization. The EXAKT modeling process is one
of correlation of Events (of all kinds) and Inspections (that
is, condition data). Condition data often comes from
specialized databases provided by CBM product or service
vendors. Common examples are oil analysis and
vibration analysis. These databases are invariably well
organized and consistently populated. The Events data, on
the other hand, often comes from the organizations CMMS
(computerized maintenance management system) and
from production databases. (The records in the CMMS,
typically, have been less rigorously kept than the others.
Hence EXAKT contains tools and techniques to validate and
get the CMMS data into shape.) The basic required Events
are: 1) Beginning (an item has been placed into service)
designated by B. 2) Ending by Failure, (EF)and 3) Ending
by Suspension (ES). By suspension we mean that the
item has been taken out of service for any reason other
than failure. For example, it may have been preventively
replaced. Once again the Ident, Date of the Event,
WorkingAge are required fields. The Event itself is recorded
in the fourth column. OC in this example represents an
oil change event. Any event which affects the condition
data (in this case it would initialize the wear metals and
contaminant elements to zero) must be included in the
model.
Events, X
8
Examine the CovariatesOnEvent table. We must provide
the initialization values for each event. Note that in this
case we are initializing wear metals and contaminants to
zero and additives to their new-condition levels. We may
also establish calendar periods for which these initialized
values to be used. (For example, the brand or grade of
lubricating oil may be changed periodically.)
CovariatesOnEvent, X

Page 27
9
Examine the EventsDescription table. The column P (for
precedence) tells EXAKT program in which order to
consider separate events that occur at the same date/time.
For example, if an oil sample is drawn from an oil drain, we
would wish that the sequence of the Inspection precede
that of the oil change. The inspection event is implicitly
given the precedence 0.
EventsDescription, X
10
Examine the Models table. It contains no records yet. That
is because you have not yet begun building a model. This
table is populated automatically by EXAKT as you proceed.
The only time you might access this table manually would
be to delete certain sub-model(s) that you do not wish to
retain. A sub-model is one of any number of models that
are tested in the modeling process. The sub-model that is
considered the best, is then exported to become the
intelligent agent that will provide decision optimization on a
particular items condition data.
Models, X
11
Now that we have examined the internal and external
database tables we are ready to proceed with the
development of a rudimentary CBM optimization model.
We turn our attention to the right hand window pane
containing buttons arranged in a flow chart of activities.
We enter the general project data.
Data Preparation, General Event
Data, Project Title: Haul Trucks,
CBM Model: Trans Oil Anal,
Description: 350 T Transmission Oil
Analysis, Time Unit: Hrs., OK
12
Next we instruct EXAKT to assemble the Events and
Inspections into a single table C_Inspections to be used
for subsequent calculations. Depending on which version of
EXAKT you are using there are a number of alternative
buttons we may hit. But for this exercise please choose the
option similar to Covariates Complete. After hitting this
button two more tables will appear in the left pane,
C_Events and C_Inspections.
With Covariates (Complete)
13
Examine the C_Inspections table. Note that the records of
both tables (Events and Inspections) have been combined
and arranged in chronological order in the single table
C_Inspections. Inspection (condition monitoring) record
events are designated by an *. The other event records
have monitored data (covariate) values set to their
initialized levels according to the CovariatesOnEvent table
discussed previously.
C_Inspections, X
14
Now lets begin the modeling phase of the analysis. Hit
the Modeling button in the Transmissions Oil
Analysis(*):2 window, not the Modeling menu item. After
executing steps A on the right, the Trans Oil Anal (ilcm)
report window appears. Examine the report. The
Summary of Events and Censored Values presents the
overall summary of the data being analyzed. A Sample
Size of 13 means that there are 13 histories or lifetimes
having a beginning and some kind of ending event. Of the
13 histories 6 ended in failure, 3 (Censored (Def)) ended
prior to a failure, and 4 (Censored (Temp)) units are
currently in operation at the time of building this model.
They are referred to in EXAKT as temporary suspensions
and are identified automatically by the software. The next
tabulation Summary of Estimated Parameters provides
the results of our first sub-model ilcm. The column
Sign. indicates whether the Parameter is significant
that is, whether it has been found to be statistically related
to failure. The Shape (i.e. WorkingAge), Iron, and Lead are
designated as significant (at this point in the analysis)
A. Modeling, Weibull PHM, Select
Covariates, sub-model Name: ilcm,
Iron, , , , , OK, X
B. Modeling, Weibull PHM, Select
Covariates, sub-model Name: ilc,
Magnesium, , OK, X
C. Modeling, Weibull PHM, Select
Covariates, sub-model Name: il,
Calcium, , OK, X

Page 28
while Calcium and Magnesium are not. Note that
Magnesium has the highest p-Value; the p-value
represents the relative probability that Magnesium has no
significant impact on risk of failure. The next step is to try
a different model by eliminating the lowest impact variable
- magnesium. Close the window and execute steps B and C
to create 2 more sub-models. Notice that we are
successively removing the covariate with the highest
reported p-Value. After hitting OK you will receive an
alert warning message from EXAKT.m telling you that the
procedure is over. This is normal for samples of small size
(low number of histories ending in failure). You may safely
ignore this message by hitting OK in the message box.
Each of the reports produced from the different models
may be printed (Ctl-P). The columns in the reports are
explained in the Exakt Manual accessible from the Windows
Start menu.
15
At this point we have a sub-model with covariates and
shape parameter that are all significant. We may
conclude that this, therefore, is potentially an acceptable
model for failure risk prediction. To be rigorous, we should
test one last possible combination a sub-model with iron
alone. (We choose Iron as it is the variable with the lowest
p-value and thus is likely to have the strongest relationship
to failure.)The report tells us that this is also a potentially
good predictive model (i.e. iron alone is still significant). In
the next step we decide which of the two sub-models
should be retained and later deployed.
Modeling, Weibull PHM, sub-model
Name: i, Lead, , OK, X
16
After executing the steps on the right the PHM Parameter
Estimation - Comparison report is displayed. The N in
the second column is telling you that the sub-model i is
not close to the base sub-model il. This means that this
simpler sub-model is not as good as il and that we would
be losing confidence by using it rather than the more
complete model il.
Comparative Report, Compare: il, i,
, OK, OK
17
In this step we examine the results of statistical testing
performed by EXAKT on the retained sub-model, il.
Reactivate this model with the steps on the right. Use the
menu item Modeling
Modeling (menu item), Select
Current Model, Sub model: il, OK.
18
Now hit the Modeling button (not the Modeling menu item).
The third table of the PHM Goodness of Fit Test tells us
that the proportional hazards model we constructed for risk
as a function of working age and the two significant
covariates fits the data well enough for it to be used with
a confidence of 95%.The test used for this is known as the
Kolmogorov Smirnov test and is well accepted as a
statistical tool. The test shows that the model is not
rejected at the 5% significance level - i.e. it is accepted at
a 95% confidence level.
Modeling, Weibull PHM, Summary
Report, X
19
After executing the steps of (A) on the right we see that
EXAKT has created a set of bands (listed under Interval
Start Points) or transition states for Lead with which to
build a transition probability model. The transition
probability model calculates the probability of jumping to
another state at the next inspection interval. (An example
of what we mean by jumping to another state will be given
below in step 20). Execute step (B) and notice the
transition bands provided for Iron are quite different. This
is because historical iron measurements are scattered
(A) Transition Probability Model,
Covariate Bands Covariate: Lead
(B) select Covariate Iron
(C) OK

Page 29
throughout an entirely different range of values. This can
be ascertained using EXAKT's cross-graph function (see
user guide) Execute step (C) to close the window.
20
Execute step A. Notice that the two buttons Display
Matrix and Display Survival become active. Lets
examine the Display Survival function report. Set
WorkingAge to, say, 8000 hours, and Observation Interval
to, say 200 hours. (assuming, for example, that our asset
is currently at age 8000 and we are interested in knowing
its risk of failure in the next 200 hours.) The Markov
Chain Model Survival Probability matrix report is
displayed. The probabilities of Iron values jumping to
another state and the probability of failure in the upcoming
interval are displayed in a tabular format. (This table
represents only a part of the entire set of transition
probabilities taken into account by the model, since we
have chosen to ignore the other significant covariate, Lead
in this report. To include more than one covariate in the
visual report would require the representation of multi-
dimensional matrices which. Instead this report allows us
to see how a single variable changes irrespective of the
others.) Looking at the table we see for example that the
cell "0- 4.004" and "4.004-9.009" has the entry 0.301615.
This means that there is a 30.1615% probability that iron
will be that state at the next monitoring interval. Hence
this report provides the probabilities of being in any state
at some future time. (Of course, this report is provided for
analysis purposes only while building the model. The
transition probabilities are fully integrated into the final
decision model that will be deployed in section 2.)
(A) Transition Rates Display
Survival, Working Age: 8000,
Observation Interval: 200, Report
Close the report and the Display
Survival Probabilities dialog.
21
Now for the final step in developing a decision optimization
model. We blend into the model the economics governing
the failure and repair of this item. That is we apply the
average cost of a preventive repair C and the average cost
(including consequential costs) of a failure C+K. (It is
rarely necessary to have great precision in these amounts
for relative costs. The cost sensitivity function of EXAKT
allows us to confirm this for the decision model in question.
Its usage is described in the EXAKT help file guide.) After
hitting the Report Icon (which you'll find to the left of the
Print Icon on the Tool Bar), the Condition Based
Replacement Policy Cost Analysis report appears.
Examine the Summary of Cost Analysis table below the
Cost Function graph. It is telling you that by adhering to
the interpretive decisions of the model, an optimal long run
ratio of preventive to failure replacements will be 98.8:1.2
which will result in a cost savings of 75.1% relative to a
replacement-only-at-failure policy. (The cost comparison
reporting function similarly compares the optimal EXAKT
policy with existing practice. Its usage is described in the
EXAKT help file guide.)
Decision Model, Decision Model
Parameters, Replacement (C):
1200, Failure (C+K): 6000, Cost
Unit: $, Inspection Interval: 250,
OK, Report Icon (to the left of the
Print Icon)
22
We have been, up to now, building a model based on the
historical data from the entire fleet. We may now test the
model on any individual unit either for the current situation
(i.e. the latest data available in the database, called "LH"
for last history) or we may look at any other history
retroactively. The steps on the right display the reports of
the latest monitored values of each unit. Four graphs are
shown - one for each of the four units 17-66, 17-67, 17-77
and 17-79. By examining the four graphs we see that none
Decisions, 17-66, shift+17-79, OK,
Report Icon , PgDn, PgDn,
PgDn

Page 30
are in alarm at the current moment when this snapshot of
the data has been made. If the weighted sum of the
significant covariates (i.e. the y-axis plotted variable) falls
in the Green region, no action is necessary; in the yellow,
the item should be renewed before the next monitoring
interval; in the read, the item should be repaired or
replaced immediately. It should be noted that these
boundaries vary with working age which reflects the
analysis findings that working age, as well as Iron and
Lead, are significant failure risk factors. At some point in
the past the values for 17-67 hit the red zone. This may
indicate a spurious laboratory result that was corrected in a
follow-up verification. (For modeling, known incorrect data
should be removed from consideration.) Note that the x-
axis scale differs from graph to graph depending on the
current age of the unit.
23
The analysis and model building phase is complete. We are
now ready to export the optimal decision model we
created into our maintenance system environment (where
it has access to continuously renewing data) so that it can
do its job. Activate the pane on the left by clicking it. By
hitting save as instructed on the right, you are sending the
model to a database located on the network. But before
you do so, we will, for expedience, copy the script onto the
clipboard as instructed. Then hit save. You will notice that
several new table links to an external database have been
added to the tree in the left pane. Now that the ODBC links
have been set up, we proceed to the actual export step
next.
ModelDbase, Connect to Database
Script, key in the script for
exporting the model (actually it has
been keyed in for you in this
sample), select the entire script,
ctrl-c, Save
24
After executing the steps on the right you may examine
the tables DecModels, UnitToModel,
DecCovariatesOnEvent, DecEventsDescription (by
double clicking on the file names in the tree view of the left
pane) to see just what information has been exported to
the external database. Please proceed to Section II of this
tutorial in order to deploy the decision model that you have
just created. You may close the EXAKT Modeling (EXAKTm)
program
ModelDbase, Store the Decision
model
Deploying the Decision Model as an Intelligent Agent
1
In this section we run the agent
manually. (It can also be set up to run
automatically). After you execute the steps
on the right the user interface of the
EXAKTd decision agent appears.
Start, Programs, Exakt, Exakt for Decisions
2
Execute the steps on the right, to create a
working database for decisions
(Transmissions_WDEC.mdb).
File, New, Navigate to c:\Program Files\Exakt\data,
Transmission_WDEC.mdb, Create
3
Now we will link (via ODBC) to the
database where we previously exported
our model (Step 24 of Section I.). After
executing the steps on the right you will
see the name of the Model you created,
Trans Oil Anal in the top left pane.
Setup, Connect to model database script, ctrl-v (or
copy and paste this script:
DATABASE="Transmission_DMDR.mdb";
ATTACH DecModels,
UnitToModel,
DecCovariatesOnEvent,
DecEventsDescription,
Decisions
hit Save
4
After executing this step, you will see each
of the units whose optimal decisions for oil
Expand Trans Oil Anal

Page 31
analysis will be governed by this model.
(new units may be added easily in the
EXAKTd program.)
5
By selecting any unit in the top left pane,
we see a list of properties but no values.
We will next run the agent manually on the
latest available set of condition monitoring
oil analysis data.
11-66
6
Now you will re-select the Model Trans Oil
Anal and execute the decision agent by
following the steps on the right.
Trans Oil Anal, Reports, Create reports, Calculate
time to replace
7
The results of the entire fleet have been
analyzed and decisions have been returned
for each unit. You may examine the reports
of each fleet member by following the
steps on the right.
Report icon , expand report window, PgDn,
PgDn, PgDn, X (of the sub-window or pane, not the
main window)
8
With Trans Oil Anal selected you can
conveniently examine the optimal
decisions for the entire fleet on one list in
the right window. You are actually
examining the contents of the Decisions
table of the Transmissions_DMDR.mdb
database. This database can be accessed
easily by any program, such as your
CMMS. This implies that the decision
models operation and its results may be
integrated within existing maintenance
system software. In other words, the
EXAKTd program need not be used at all.
However, it does have a very convenient
user interface and several useful functions,
some of which are described in the
following steps.
Reports, Create new report list, New Report List
Name: Indoor trucks, OK Reports, Create new
report list, New Report List Name: Outdoor trucks,
OK Select Trans Oil Anal, Select 17-66 + 17-67,
ctrl-c, Select Indoor Trucks, ctrl-v Select Trans Oil
Anal, Select 17-77 + 17-79, ctrl-c, Select Outdoor
Trucks, ctrl-v
9
Now we will use the new report lists to help
manage our trucks by department.
Select Indoor Trucks, Reports, Create Reports,
Calculate time to replace Select Outdoor Trucks,
Reports, Create Reports, Calculate time to replace
10
This completes this section of the Tutorial.
This has been a minimal exercise to
demonstrate a small portion of the EXAKT
functionality. Please refer to the On-line
guide (available on your Start | Programs |
EXAKT menu) for a much more detailed
treatment of the subject of CBM
optimization.

Example 2 Data validation
Most maintenance departments have yet to adopt standard data
management procedures such as those described in Part 1. (page
6). Hence data validation, after the fact, will necessarily form part
of the CBM automated data interpretation modeling process.
Cardinal River Coals Ltd. was a 50/50 joint venture between
Luscar Ltd. and Consol of Canada, Inc. The mine is located
approximately 50 km south of Hinton, Alberta on the eastern
slopes of the Rocky Mountains. The coal produced from the mine
was low sulphur, high quality coking coal used for steel making.

Page 32
Cardinal River Coals Ltd. opened in 1970 as a multiple open pit
mine using the truck and shovel mining method. Annual
production at the mine called for the removal of 21 million cubic
meters of rock and 2.8 million tons of coal. The mine won multiple
awards for the land reclamation and creating wildlife habitat. Oil
analysis results from a fleet of 55 haul truck wheel motors were
analyzed along with their respective failures and repairs over a
nine-year period.

Extensive planetary gear or sun gear (Figure 9-15) damage
necessitates replacement of one or more major internal components
in a general overhaul. There were 26 haul trucks at the mine site,
each having two wheel motors. With 3 spare wheel motors the fleet
numbered 55. Oil analysis was carried out monthly.

Figure 9-15: Wheelmotor

The mines computerized maintenance management system
(CMMS) recorded wheel motor removals either due to failure or as
the result of a decision to carry out preventive maintenance. The
decision to perform PM was made by assessing oil analysis data

Page 33
and by taking into account the wheelmotors age. Costs and details
associated with past removal were available. Additionally, an oil
analysis database contained a vast history of condition monitoring
test results some 50,000 records covering the same time period as
the removal history.

The Existing CBM Program
In most maintenance departments, oil analysis reports are
received from a commercial laboratory and are summarily
reviewed by a maintenance planner or supervisor. The
laboratory usually points out sudden increases in
concentration of wear metals or contaminant elements such
as silicon (Si) in the oil sample. At Cardinal, staff, noted
the amount of sediment (weight filtrate on a filter patch)
and parts per million (ppm) of five elements: iron, silicon,
chrome, nickel, and titanium. A decision to remove the unit
(for overhaul) was based on a visual perusal of the reported
values of these elements in conjunction with the unit's age.
The policy, was to rebuild the units after about 20,000
hours of operation regardless of the CBM data earlier if
the metal levels were abnormally high. Our objective is to
assess the applicability and effectiveness of the existing
CBM program and to improve it by using the EXAKT age
exploration methodology.

Validating Event Data
The first and most important step in any CBM optimization
project is to ascertain the validity of the data available both
in the CMMS and in the CBM databases. We begin, by
applying EXAKTs DataCheck function to the records of
both databases. The result of this operation is a report
similar to that shown in Figure 9-16.

Page 34
Figure 9-16: DataCheck report in synchronized view with
Inspections and Events tables.
The DataCheck report addresses a common problem in
historical CMMS databases. Often work order records omit
the description of what was found when examining the item
prior to its repair.
94
The report of Figure 9-16 issues the
warning,

Check whether this history is temporary suspended or EF/ES is
missing.

whenever it deduces that an ending event, either EF (ending
with failure) or ES (ending by suspension) may be missing.
The analyst must investigate the actual work orders or the
comments in the work order record in order to ascertain
whether a failure or a preventive renewal of the item
occurred, or whether the item is currently in operation.
Each valid history for a wheel motor must have a
Beginning event (B), an Ending event (EF for failure, or ES
for suspension (such as a preventive removal)) and
Inspection events in between.

The DataCheck report of Figure 9-16 may issue additional
comments and warnings. For example:

The first event of this history is not a B event, or
This record has the same WorkingAge as the previous record , or
This record has a larger WorkingAge than the next record , and
so on

Page 35

By methodically investigating each of the warning
messages, with the automatically synchronized records in
the Events and Inspection tables (Figure 9-16), the analyst
eventually corrects all of the logical errors and omissions
in the database. The output of the data - checking tool
points out errors of inconsistency of events and dates in the
CMMS and condition databases. The software deduces,
from the dates and working ages, the sets of data that
comprise individual histories. For each history that it finds
without an ending, it asks whether the ending event should
be designated as a suspension (ES), a temporary suspension
(TS, which is denoted by *ES in the software) or a failure
(EF). (Temporary suspension means that the item is
currently in operation).

The DataCheck function also points out anomalies that may
indicate data problems such as two inspections on the same
day, or working ages and calendar dates out of
synchronization. All of these logical errors would have
compromised the models accuracy. Most of these types of
errors can easily be corrected by inserting the missing
Beginning and Ending events for each history.

Validating Inspection data
Inspection records can be examined graphically using
various combinations of covariates, dates, ranges, and
scales. For example, the cross graph of Figure 9-17 reveals
statistically unusual values of silicon forming a
horizontal line at exactly 900 parts per million (PPM).

Page 36
Figure 9-17 Cross graph of Si and Working age for the entire fleet
over 9 years
Investigating the commercial laboratory that performed the
oil analyses, it was discovered that, for a period of time, the
photo-multiplier tube on the spectrometer was saturating at
exactly 900 PPM. In other words all values of silicon above
900 were truncated to 900 PPM. A similar situation
occurred for iron above 2500 PPM. If not detected, this
could play havoc with the building of the PHM
(proportional hazard model). To solve this problem we call
up the cross graph of Silicon versus Iron displayed in
Figure 9-18.

Figure 9-18: Cross graph of correlation between Si and Fe showing
data errors
Figure 9-18 reveals strong correlation between Silicon and
Iron, as well as an obvious dog leg in the graph where Si
plateaus at 900 ppm. We note too that a few appear after

Page 37
the spectrometer was repaired and that they fall on the
correlation line. It is reasonable, therefore, to correct the
values of 900 ppm by substituting the values of iron the
slope of the correlation line as was done in Figure 9-19.
Figure 9-19 Corrected values of silicon

In this instance, knowing the errors in the laboratory test
data, it was possible to compensate for them in the database
used to build the model. For example, to correct the
truncated values of Si they were replaced with 1.2 x Fe.
The factor of 1.2 was determined from the initial slope of
the cross graph (a correlation graph) of Fe-Si and values
obtained after the saturation defect was corrected. The
truncated Fe values were not corrected since there were too
few of them to influence the model.

Determining correlation between covariates is useful both
to provide insight into the data, and in understanding the
models generated by the software. For example, if Fe and
Ni are highly correlated the software would confirm that
there is no point in including nickel in the model since it
has been determined to provide no additional information
regarding the probability of failure. Thus, if the software
concludes that nickel is insignificant, then by inspecting
the correlation graphs one could therefore understand the
reasonableness of such an indication. These correlations are
the result of wear of a metallic alloy component present in
the unit.

The effects of minor maintenance or equipment calibrations

Page 38

Figure 9-20 The effect of an oil change
When building the PHM it is necessary that account is
taken of any minor maintenance work that is done, such as
changing the oil in the wheel motor. For example, Figure
9-20 illustrates that the actual transition path of oil
measurements was from A to B to C to D. If we did not
account for the oil change, then the software would assume
that the transition was A to B to D. This would be
misleading and would tend to overestimate the risk of
failure
95
. EXAKT properly handles minor maintenance
events, that impact monitored variables.

In the EXAKT data preparation phase we set up
initialization conditions associated with certain events. The
model is told what covariate values should be associated
with those minor corrective events, such as an oil change
(OC). By the same token, events such as balancing a rotor,
or aligning a shaft should be recorded whenever they occur.
During model setup approximate initialization vibration
levels will have been assigned to these event in the
CovariatesOnEvent table, so that the model can propely
recognize that decreases in covariate values are the result of
a minor maintenance event.

Figure 9-21 shows missing or irregular oil changes and
obvious gaps due to incomplete records
96
. Oil ages of 7000-
8000 hours are indicated which is quite unlikely with the
use of mineral oils in this application. The site changed to
synthetics about two years earlier to eliminate the need for
regular oil changes. However most histories, containing
missing oil changes, occurred prior to1997. It was thought
that this information needed to be recovered from the

Page 39
commercial laboratorys files. Unfortunately these files,
too, were incomplete and inconsistent with the dates and
working ages in the work order database.

Figure 9-21 Missing oil change events
Fortunately, however, these 'missing' oil changes did not
significantly affect the model since they were relatively few
in number with respect to all of the known oil changes.
That is, there were a sufficient number of known oil
changes in the database for the model to account for their
effect on the measured data.

Building the Proportional Hazards Model (PHM)
After all the obvious data errors are eliminated or corrected
by using the DataCheck function and the rich assortment of
graphical analytical tools in the EXAKT toolbox, the
proportional hazard model may be generated using the
techniques of Example 1 (page 82). Figure 9-13 on page 80
shows that the risk of failure is a function of both working
age and the significant condition data. By following the
iterative procedure learned in Example 1, which is based on
Cox [ref 1972], the insignificant covariates are
systematically eliminated, and potentially good models are
tested to see how well they represent the actual data. One of

Page 40
these methods used by the software is known as residual
graphical analysis, illustrated in Figure 9-22.

Figure 9-22 Graphical analysis of maximum likelihood estimation
(MLE) residuals

Each point on the residual graph of Figure 9-22 represents a
history, that is, a lifetime of a wheel motor from its
installation to its removal. The sample used to build the
model consists of many histories drawn from the entire
fleet. The graph shows an unusual point that is well above
the 95% upper limit. This leads one to investigate the
underlying data corresponding to this residual (i.e. this
particular life cycle). It was discovered (Figure 9-23) that
some unusual data were included in that history which
appears to violate the model that we are attempting to
build.

Page 41

Figure 9-23 Unusually high values of Fe and Si unexplained by a
failure event

The Fe values in the left-circled region of Figure 9-23 have
an inexplicable pattern. Fe jumps to high values, but
truncated at 2500 PPM due to instrument saturation, and
remains in the same range for a few more inspections.
Then, the readings fall back to low values. No events were
recorded to explain these sudden jumps.

Having no event data to support such high values of Fe and
Si, the model was regenerated and the fit tested again after
removing that history from the working data set. Statistical
and graphical goodness-of-fit testing procedures are applied
by the software as part of the modeling procedure. The
models fit to the data improved immediately. The model
building algorithms no longer had to accommodate
obviously contradictory and misleading information.

Validating Human Decisions
The forgoing describes data related problems that were
encountered and that were relatively easy to correct using
the statistical and graphical tools available in the softwares
function arsenal.

However a different (and more fundamental) problem
occurred regarding the definition of wheel motor failure.

Page 42
These units seldom failed functionally. That is no haul
truck needed to be taken out of service immediately while it
was hauling a load of rock or coal. Nevertheless, to
develop a CBM policy (model) we must have some
objective definition of failure. Initially, the mechanics
remarks (on the work order) were used for this purpose. For
example,
"High iron in oil sample and high hours, removed and replaced wheel
motor."

This event was then classified as a failure. However, on
reviewing the re-builder's report attached to each invoice it
became clear that some events initially classified as a
failure should be treated as a suspension and vice verse. For
example: If the gears had been replaced because they failed
an ultrasonic test or they were obviously in a failed state
then that event should be classified as a failure. But if the
gears were replaced simply because it was expedient to do
so, or if the unit was only generally rebuilt with no real
internal damage, then that event should be considered a
suspension.

With the definition of suspension and of failure thus
clarified, a proportional-hazards model was found which
was shown by the softwares report (Figure 9-24) to be a
good fit by the statement not rejected.

Figure 9-24: Results of the "Goodness of Fit" test
The model containing the covariates iron and sediments
was found to be good, both by graphical residual analysis
(such as that of Figure 9-22 on page 96) and by the
Kolmogorov-Smirnov statistical test Figure 9-24 applied
automatically by the software. The results of the analysis
and the proportional hazard model are displayed in Figure
9-25.

Page 43
Figure 9-25: The proportional hazard model
97
for a haul truck
wheel motor

Obtaining the CBM Optimal Decision Model
After finding the PHM we are next ready to establish the
optimal decision model [ref Jardine et al 1997] that
incorporates economic considerations along with the risk
estimate obtained from the PHM. Using the decision model
building methodology that we learned in Example 1 (page
82) a cost ratio of 3:1 ($20K for preventive replacement
cost, $60K for failure replacement cost, based on the
invoices of past repairs by outside contractors) was blended
into the model. The resulting decision model applied to a
wheel motor is depicted in Figure 9-26. Using EXAKTs
cost comparison function (described in Part 2. the section
Assessing the effectiveness of a CBM Program (page 31),
the software calculates the expected savings. These are
shown in Figure 9-27 for various economic conditions,
represented by three possible cost ratios.

Page 44
Figure 9-26: Optimal CBM decision model applied to a set of oil
analysis data for a wheel motor
Application of decision model
Once the decision model was built, data was examined
from previous histories to see what the decision model
would have recommended for situations in which the wheel
motor failed. One illustration of such a history is shown in
Figure 13. This graph provides a recommended decision
based on inspection data (covariates and working age).

The decision Replace immediately was suggested by the
model (as illustrated in Figure 9-26) for the first time at the
inspection point at working age = 11384 hrs, 286 hours
(about two weeks) prior to failure (reported at 11660 hrs).
The following inspection at working age = 11653 hours, 7
hours prior to failure, also suggests the replacement of the
wheel motor. The first warning may have been sufficient,
given sample turnaround time of 48 hours, to prevent the
consequences of failure. Even prior to 11384 hours it can
be seen from the decision graph that the results of the
measurements indicate that a replacement recommendation
was imminent. Note that the zero points on the graph
indicate default measurement values of zero (imputed by
the software) immediately following oil changes.

Page 45
The economic benefit associated with basing the
maintenance policy on the Decision Policy Graph of Figure
9-26 is exposed through an economic investigation using
EXAKTs sensitivity analysis function. Under current
economic conditions, Figure 9-27 indicates a potential
saving of between 20%-30% compared to current practice.

Figure 9-27: Expected savings for various coal market conditions

It is to be noted that for the cost ration of 3:1 (first section
of Figure 9-27) no operational savings were accounted for
since at the time of this study, unfavorable coal market
conditions caused the mine to operate below its capacity.
However, as market conditions improve higher cost ratios
would be used since the capital assets of the mine will be
used at maximum capacity. Current strip ratios (total
material removed versus sellable material) would also
affect the savings associated with increased availability and
reliability of the units. The sensitivity analysis function of
EXAKT, described in Figure 9-28 demonstrates the
sensitivity of the overall savings to changes or inaccuracies
in the cost ratio.

Page 46
Sensitivity analysis

Figure 9-28: Sensitivity of the CBM model to economic and
geological conditions affecting the cost consequences of haul truck
failure

In real situations, the actual ratio of failure and preventive
replacement costs may not be well known. Furthermore the
dynamics of industry are such that costs can change with
changing technology, production, and market conditions.
Therefore one would like to know, to what degree the true
total cost per unit time and the optimal policy would
change with changes in cost ratio. The software enables

Page 47
sensitivity analysis to be undertaken and generates a graph
and corresponding tabular data of Figure 9-28 .

The curves on the graph are interpreted as follows. Solid
Line: If the actual cost ratio (CR) of today differs from that
specified when the model was built, that means that the
current policy (as dictated by the Optimal Replacement
Graph of Figure 9-26 on page100) may no longer be
optimal. The line indicates the increase (in percentages)
that will be incurred above the optimal cost/unit time by
adhering to the current (no longer optimal) policy. For
example, if the actual cost ratio is 5 and our model was
built with CR=3, then the increase in the cost incurred by
following that (original optimal) policy is around 6%
(5.98). In other words the solid line represents the
sensitivity of costs to changes in CR. Dashed Line: Again,
assume the actual cost ratio has strayed from what was used
when the model was built. If the model is rebuilt using the
new ratio the dashed line tells how much the new optimal
cost would differ from that of the original model. In other
words the dashed line represents the sensitivity
98
of the
optimal policy to changes in CR. The graph indicates that
moderate overestimation of the cost ratio does not
significantly affect the average long run cost but provides a
more conservative policy from the point of view of risk of
failure. In a frequently (perhaps seasonally) changing cost
situation it could be worthwhile to dynamically rebuild the
CBM optimization model each time it is applied by using a
cost ratio fed from an ERP (enterprise resource planning)
system that takes account of current market conditions.
The cost analysis summary shown on Figure 9-27 (page
101) indicates a saving of 25%, when CR=3, over the
replace only on failure (ROOF) policy, whose costs
approximate those of the sites past policy. Decision model
results are also calculated for cost ratios of 5 and 6. As the
cost ratio increases we can observe an increase in both the
optimal policy cost as well as an increase in savings. The
optimal decision models in these cases indicate more
frequent preventive replacements (from 74% to 91%) will
result from applying the optimal decision policy in order to
avoid costly failures. (Note: There is a slight discrepancy
between the expected time between replacements for the
ROOF policy, when CR=3 and CR=5 and 6. This is due to
the numerical calculation procedure.)

Page 48
The steps contained in Appendix 11. page 189 contains a
hyperlink to a database file with which the reader may
reproduce the analyses and graphs of Example 2.

Example 3 Complex Items
A simple item has a dominant failure mode, while a complex item
has several failure modes. A CBM program acquires inspection
data (e.g. oil analysis, vibration, performance data) for an entire
system, such as an engine. Thus, a single system identifier labels
inspection data records from which more than one failure mode
can be deduced. Each failure mode will have its own age-reliability
relationship, and hence, its own CBM decision model. Figure 3-10
on page 33, illustrated the conditional probability of failure (age-
reliability relationship) curves for a complex item.

In 2003, the Condition Based Maintenance Laboratory at the
University of Toronto developed a data structure and methodology
for the predictive analysis of complex systems containing multiple
components and subjected to a variety of failure modes.

This example is of a single reduction gearbox that contains two
gears, a pinion gear and a driven gear referred to as GearOne and
GearTwo respectively. We concern ourselves, in this example with
the failure mode tooth fails due to root crack, which can occur
on either gear. We treat this unit, therefore, as a complex item
having two failure modes.
99
A CBM policy must distinguish data
patterns characterizing one failure mode from another.

EXAKT uses the term Marginal Analysis
100
to designate that a
complex item is being analyzed. The extension _MA on the
following table names, tells the software to perform marginal
analysis, which will yield individual CBM decision models
corresponding to specific components or specific failure modes:

Inspections_MA
Ident
Date
WorkingAge
Covariate1Name
Covariate2Name
Comment
Events_MA
Ident
Date
WorkingAge
Event
Comment
EventsDescription_MA
EventName
P
Comment
VarDescription_MA
VariableName
MeasureUnit
WarnLimit1
WarnLimit2
Comment
CovariatesOnEvent_
Event
StartingDate
EndingDate
Covariate1Name
Covariate2Name
Comment

The above tables are identical (except for the _MA table name
suffix) to their counterparts in the analysis of simple items. The
following additional tables, however, are required for complex
items. Each component or failure mode in a complex item will

Page 49
behave according to their individual models. The supplementary
tables shown below relate a model to an Ident (unit of a fleet), an
event, and a variable (co-variate):

IdentToModel
ModelName
IdentName
Date
EventToModel
ModelName
InputEventName
OutputEventName
InputP
OutputP
VarToModel
ModelName
InputVariableName
OutputVariableName
VariableDataType
MeasureUnit
WarnLimit1
WarnLimit2

Notice the phrases Input and Output in several field
names of the last two tables. We use these fields to map database
values from the component to the model. The model itself must use
the keywords B, EF, and ES. However, in the database we
may use any labels for these events. Using the EventToModel table
we tell a model which events in the database to use as beginning,
failure, and suspension events in the model. For example, we can
have two types of Beginning events, B1 and B2 in the database.
They correspond to our two components which can begin their
lives at different times. Similarly, we may have two types of
failure events, EF1 and EF2. And two types of suspensions, ES1
and ES2. We need to tell a particular model (of a particular failure
mode or component), which event records (for example, those with
the values ES1 or ES2 in the database) to use as the suspension
event for the failure mode currently being modeled or predicted.
That is, we need to tell the model for GearOne to use the events
B1, EF1, and ES1 as the beginning, failure, and suspension events:
B, EF, and ES. We perform this mapping in the EventToModel
table.

For convenience and flexibility, we use the same mapping
technique in the VarToModel table to change the names of the
original variables in the database to other names in the particular
model. For example, we might call the fault growth parameter for
GearOne, FGP1, in the database. But in the model for GearOne,
we might like to call it simply FGP or just Health_Indicator
(since we already know that this model is for GearOne and it is
indicating health degradation). When the intelligent agent is
deployed it will run through each model of the complex item and
provide a decision and remaining useful life corresponding to that
component or failure mode.

The table IdentToModel allows us to include selected individual
units (from a fleet) in the model. We may wish to do this when a

Page 50
particular component is present in some units of the fleet but not in
others. For example, we would not model a failure mode
associated with a turbocharger on those engines that are not
equipped with one.

Let us look then at the Events_MA table (Figure 9-29) for the
Gearbox under analysis.

Figure 9-29: Events_MA table for a gearbox with two failure modes.
Note that, in Figure 9-29, there is no B1 or B2 to designate the
beginnings of the individual components. That is because, in the
particular case being modeled, when one gear fails, both are
replaced. Therefore we have decided to use the event B to mark
the life beginnings of either component
101
. We have chosen to use
EF to designate the failure of GearOne and EF2 for that of
GearTwo. The suffix, _nn, of the Ident (e.g. GearOne_01)
designates one component life. GearOne had 11 lives, and so did
GearTwo. GearOne failed 8 times, and GearTwo failed twice.

Page 51
Figure 9-30 is a view of two extracts from the Inspections_MA
table.

Figure 9-30: Partial views of Inspections_MA table

Once the decision models have been built an deployed (following
the general methods presented in Example 1 (page82) a typical
optimized CBM decision at a point in time might resemble that of
Figure 9-31.

Page 52
Figure 9-31: EXAKT output for two failure modes, GearOne and GearTwo

The above results for the decision models of a complex system
may be achieved by the following steps using the EXAKT
program.
1 Start, Exakt for Modeling
2
File, Open, navigate to , ComplexItemsDemo_WMOD.mdb,
Open
3
Modeling (on the Menu bar), Data setup, type in the
attachment script (actually it is already keyed in for you),
Execute, Save
4
Data Preparation, Enter General Data, Project Title:
Complex Items Demo, CBM Model: Gear One, Description:
Vibration Analysis for Gear One, Time Unit: Hrs., OK
5 Marginal Analysis
6
Idents Selection, GearOne_01 to GearOne_11: check,
Events Selection, B, Select Event: B, Precedence: 5, Apply,
EF, Select Event: EF, Precedence: 2, Apply, ES, Select
Event: ES, Precedence: 3, Apply, Variable Selection,
Health_Indicator, Select Variable: Health_Indicator, Apply,
OK.
7
A. Modeling, Weibull PHM, Select Covariates, sub-model
Name: HI, Health_Indicator, , OK, X
B. Select Covariates, sub-model Name: HI_1, Fix shape
parameter=1: check, OK, OK, X
8
Transition Probability Model, Transition Rates, OK, Decision
Model, Decision Model Parameters, Replacement (C): 1000,
Failure (C+K): 6000, Cost Unit: $, Inspection Interval: 500,
OK, X
9
Decisions, GearOne_01, shift+GearOne_11, Report, Full
Report Icon , PgDn, PgDn, PgDn

For the second model of our complex system, please repeat
steps 4-9 with the following substitutions.
Step Change this: To:
4 Model: Gear One Model: Gear Two
4 Description: Vibration
Analysis for Gear One
Description: Vibration
Analysis for Gear Two
6 GearOne_01 to
GearOne_11: check
GearTwo_01 to
GearTwo_11: check
6 EF, Select Event: EF EF1, Select Event: EF
7* sub-model Name: HI sub-model Name: HI
8* Replacement (C): 1000,
Failure (C+K): 6000,
Cost Unit: $, Inspection
Interval: 500
Replacement (C):
1000, Failure (C+K):
6000, Cost Unit: $,
Inspection Interval:
500
9 GearOne_01,
shift+GearOne_11
GearTwo_01,
shift+GearTwo_11

Summary

Page 53
Chapter 9. extended the usefulness of CBM by providing a systematic
methodology for declaring a potential failure and for determining the
remaining useful life (or P-F interval) of an item. Example 1 Creating an
intelligent agent (page 82) provided a tutorial to familiarize you with the
functionality of the EXAKT software for the automated interpretation of
condition data. Example 2 Data validation (page 88) was a case study
from a mining application showing realizable savings from CBM
optimization and providing several techniques for data cleaning. The step-
by-step tutorial that reproduces in EXAKT many of the lessons of
Example 2 is provided in Appendix 11. on page 189. Example 3 Complex
Items (page 103) extended the methodology to items having multiple
failure modes. An Exercise 4 data smoothing is provided in Appendix
11.on page 193

References
Cox, D.R., (1972) Regression models and life tables (with discussion),
J.Roy. Stat. Soc. B, Vol. 34,pp. 187-220.

Jardine, A.K.S., Banjevic D. and Makis V, (1997) Optimal replacement
policy and the structure of software for condition-based maintenance,
Journal of Quality in Maintenance Engineering, Vol. 3, No.2, pp. 109-119.

Campbell, J.D. and Jardine A.K.S. (Editors), (2001) Maintenance
Excellence: Optimizing Equipment Life-Cycle Decisions, Marcel Dekker,
(Chapter 12: Optimizing Condition Based Maintenance, by M. Wiseman).

Page 54
Part 4. Reliability Centered Maintenance
Chapter 10. Pillars of RCM
Introduction

Page 55
Chapter 11. Failure Modes and Effects Analysis

Question 1 Functional Analysis
The process
Example 1

Function Functional
Failure
Failure mode Failure effects

2 To insulate passengers from
shocks caused by crossing rail
joints, bumps and to minimize
transient oscillations after
crossing such bumps.

Function Functional
Failure

3 To insulate passengers from
jerks during acceleration and
braking

Function Functional
Failure

4 To control the roll angle of the
car body relative to the truck

Function Functional
Failure

5 To ensure that the carriage floor
is level with the platforms when
train stops at a station

6 To assist in stopping the train at
up to 0.88 m/s2

7 To prevent direct contact
between axle box and truck frame
under severe bounce conditions

8 To permit the truck to be lifted
and/or the car to be towed easily

9 To ensure that wheel sets
remain attached to truck while
truck is being lifted

10 To insulate the car from
shocks to some extent if the air
bag fails

11To limit lateral movement of car
relative to truck

12 To prevent traction link
retaining nut from coming undone

Example 2

Page 56
Item description:
Pack delivers temperature-controlled air to
conditioned-air distribution ducts of airplane. Major
assemblies are heat exchanger, air-cycle machine,
anti-ice valve, water separator, and bulkhead check
valve.
Redundancies and protective features (include
instrumentation):
The three packs are completely independent. Each
pack has a check valve to prevent loss of cabin
pressure in case of duct failure in unpressurized
nose-wheel compartment. Flow to each pack is
modulated by a flow-control valve which provides
automatic over-temperature protection backed up
by an over- temperature trip off. Full cockpit
instrumentation for each pack includes indicators
for pack flow, turbine inlet temperature, pack-
temperature valve position, and pack discharge
temperature.
Reliability data: Built-in test equipment (described): none
Can aircraft be dispatched with item
inoperative? If so list any limitations which
must be observed:
Yes. No operating restrictions with one pack
inoperative.
Hidden functions: Yes
Functions Functional failures Failure modes Failure effects
1 To supply air to
conditioned air
distribution ducts at the
temperature called for by
pack temperature
controller

2 To be capable of
preventing loss of cabin
pressure by backflow if
the duct is fails in
unpressurized nose-wheel
compartment

Example 3
Item description: Distributed control system (DCS)

Redundancies and protective features (include
instrumentation):

Built-in test equipment (describe):
Operating context: Continuous process. Unionized. 500 employees. See business plan. Biggest product
Ethylene. Can also produce gasoline Two lines: 1. Material flow 2. Olefins. Raw material safely stored at
high pressure (6000 MPa) in storage underground caverns. It is pipelined to production facilities. Ethylene
converted to polyethylene. There is a "hot side" and "cold side". Raw material undergoes cracking
(breaking carbon chains) and becomes ethylene. The plant extends over several acres (a square kilometer).
The DCS (distributed control system) is integral to the entire production line. There are 3 different types of
DCS. Recently there has been a benzene spill. Environmental excursions occur occasionally. Installed in
1996. Capital expenditures have been curtailed recently. Individual heaters can be shut down for
maintenance.
Hidden
functions:
Yes, UPS
Functions Functional
failures
Failure modes Failure effects
To provide safe, secure, uninterrupted,
redundant, cost effective, continuous process
control and monitoring according to the target
product of the day, within the parameters

Page 57
specified by product specification and by
current environmental regulations, in the
presence of a UPS (uninterruptible power
supply)
To alarm on abnormal conditions in the
process real time

To allow manual intervention
To interface with other control systems
To graphically present the process to the
operators

To exchange data with other control systems
To capture historical data
To provide the means to alter control logic
To backup/restore configuration data
To execute batch recipes within the
continuous process, for example cleaning
cycles

To provide safe shutdown in the event of a
hardware failure.

To alert the operator, in real time, when some
part of the DCS hardware or a field device
fails.

To be immune from physical, electromagnetic,
electronic, environmental intrusion

To be ergonomic
To conform to NEMA standards
Example 4
1 To provide a
renewable surface that
protects the carcass of
the tire so that it can be
retreaded

Question 2 Failure analysis
The process
Example 1
Ctrl.
No.
Function Statements (Quantitative
Performance Requirements)
Failed States (Ways
Performance is Lost)
Failure Causes
1
To provide smooth rolling support for half the
weight of a passenger car (up to 26.5 tons)
on the rails at speeds up to 120 kph
Fails to provide support
5
Unable to support the car on
the rails at 120 kph

16
Fails to provide rolling
support

21
Fails to provide a smooth
ride

Example 2

Page 58
1 To supply air to
conditioned air
pack temperature
controller
A conditioned air is not
supplied at called-for
temperature

2 To be capable of
preventing loss of cabin
pressure by backflow if
the duct is fails in
unpressurized nose-wheel
compartment
A No protection against
backflow

Example 3
failures
To provide safe, secure,
uninterrupted, redundant, cost
effective, continuous process control
and monitoring according to the
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Fails to
provide
security
Unauthorized usage of
console either when
unattended or if
password stolen

Unable to
log in
Password forgotten
Unable to
protect
against loss
of control
UPS has failed
Control lost Complete loss of
communication with
ring

Complete loss of
communication with
controller node

All consoles fail
Complete loss of
communication on
module bus

Complete loss of
communication on slave
bus

Console LAN fails
Redundancy
lost
Console hardware or
software fails

Controller hardware or
software fails

Question 3 Failure modes analysis
The process
Why? Why? Why? Why? Why? Why? Why? Why?
Ventilation Fan fails Motor Motor Airways Inadequate

Page 59
system
fails
fails trips clogged
with dirt.
design
Defective
sensor

Bearing
seized
Lubricant
allowed to run
dry

Wrong
lubricant
Improperly
labeled
Stores
error
Label
misread
Inattention
Insufficient
training
Power
drive
fails
Belts
failed
Incorrectly
installed
Insufficient
training.
Employee
turnover
Poor
working
conditions
Missing
documentation
Inadequate
document
control

Inadequate
tools

Incorrectly
specified

Distribution
system fails
Duct
fails
Duct
clogged

Duct
pierced

Damper
failed

Example 1
1 To provide smooth
rolling support for half
the weight of a
passenger car (up to
26.5 tons) on the rails
at speeds up to 120
kph
Fails to provide
support
Weld in frame fails due to
fatigue

Wheel collapses due to
fatigue

Axle fails due to fatigue
Truck frame component
fails due to fatigue

failures
1 To provide smooth
the weight of a
26.5 tons) on the rails at
speeds up to 120 kph
Unable to
support the car
on the rails at
120 kph
Differential wear of steel
treads on the same axle

Spalling on wheel tread
Wheel flange shears off

Page 60
Chevron rubber shears
Tie bar rod axle rod
slackens off

Chevron rubber settles
Chevron rubber elastically
yields

Traction link bolt comes
adrift

Traction link falls off due to
fatigue

To provide smooth
the weight of a
at speeds up to 120
kph
Fails to provide
rolling support
Bearing collapses due to
fatigue failure of cage,
rollers, spacer or inner or
outer race

excessive clearing in
housing

bumpy rails

Bearing fails due to under
lubrication

Plug falls out of axle box
cover

Bearing fails due to over
lubrication

Moisture in lubricant causes
bearing to fail

To provide smooth
the weight of a
at speeds up to 120
kph
Fails to provide a
smooth ride
Flats worn on wheel tread
Example 2
1 To supply air to
conditioned air
pack temperature
controller
temperature
air-cycle machine seized
ram-air passages in heat
exchanger blocked

anti-ice valve fails
water separator fails
2 To prevent loss of
cabin pressure by
backflow if the duct is
backflow
bulkhead check valve
fails

Page 61
fails in unpressurized
nose-wheel compartment
failures
effective, continuous process control
and monitoring according to the
target product of the day, within the
parameters specified by product
specification and by current
environmental regulations, in the
presence of a UPS (uninterruptible
power supply)
Fails to
provide
security
Unauthorized usage of
console either when
unattended or if
password stolen

Unable to
log in

Unable to
protect
against loss
of control

Control lost
Redundancy
lost

Question 4 Effects analysis
The process

Example 1
Function Statement Failure Failure
mode
Effects
1 To provide smooth
the weight of a passenger
car (up to 26.5 tons) on
the rails at speeds up to
120 kph
Fails to provide
support
Weld in
frame fails
due to
fatigue
The truck as a whole collapses. This is most likely
to occur when the car is most heavily loaded - in
other words when it is full of passengers, and
probably while the train is going round a corner. As
a result, it would almost certainly be derailed. At
present, the truck is replaced when a crack longer
than 100 mm is found. (Such a crack would be
found during course of other inspections that occur
often enough to detect it). Downtime to replace
truck on its own 16 hours.

mode
Effects
1 To provide smooth
120 kph
Fails to provide
support
Wheel
collapses due
to fatigue
a result, it would almost certainly be derailed. Only
one cracked wheel has been found to date. It takes 8
hours to replace a wheel

Page 62
1 To provide smooth
120 kph
Fails to provide
support
Axle fails
due to
fatigue
a result, it would almost certainly be derailed. No
axles have failed so far.
1 To provide smooth
120 kph
Fails to provide
support
Truck frame
component
fails due to
fatigue
Initial cracking is likely to lead to frame distortion,
which could make the truck unstable enough to
derail the train. As before, this is most likely to
happen when heavily loaded - in other words, when
it is full of passengers, and probably while the train
is going round a corner. So far, the only frame
component which has shown signs of failing has
been the transom, which cracked and has since been
reinforced with a steel plate. Downtime to replace a
truck is 16 hours.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Differential
wear of steel
treads on the
same axle
If the difference between wheel diameters is greater
than 2 mm, the possibility of derailment at speeds
near 120 kph increases. Downtime to re-profile a
pair of wheels is 3 hours.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Spalling on
wheel tread
This could lead to differential wear. If the
difference between wheel diameters is greater than
2 mm, the possibility of derailment at speeds near
120 kph increases. Downtime to re-profile a pair of
wheels is 3 hours.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Wheel flange
shears off
This failure is only likely to a flange which has
been weakened by excessive wear. It is most likely
to happen on a heavily loaded train going round a
corner at high speed, which would almost certainly
lead to a derailment. Downtime to replace a set of
wheels 3 hours.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Chevron
rubber shears
Truck frame rests directly on the axle box bump
stop. Wheel loading is unevenly distributed and
wheels are prevented from moving off-axis during
curving - both of these conditions may cause
derailment under adverse conditions of load and
speed. Downtime to replace the chevron rubber
about 16 hours. (The clearance between the bump
stop and the truck frame should be 30 +1-0 mm)
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Tie bar rod
axle rod
slackens off
Wheel arch could distort and chevron rubber could
shear. Truck frame rests directly on the axle box
bump stop. Truck frame rests directly on the axle
box bump stop. Wheel loading is unevenly
distributed and wheels are prevented from moving
off-axis during curving - both of these conditions
may cause derailment under adverse conditions of
load and speed. Time to tighten axle rod nut in
Depot 15 minutes.
1 To provide smooth
Unable to
support the car
on the rails at
120 kph
Chevron
rubber settles
Settling could cause excessive contact between
vertical bump stop and wheel arch. This would
restrict wheel set movement during curving, and
could cause derailment under severely adverse
conditions of load and speed. Clearance should be

Page 63
120 kph 30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Chevron
rubber
elastically
yields
Settling could cause excessive contact between
vertical bump stop and wheel arch. This would
restrict wheel set movement during curving, and
could cause derailment under severely adverse
conditions of load and speed. Clearance should be
30 +1-0mm. Time to replace chevron rubber 4
hours. See also function 2.
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Traction link
bolt comes
adrift
The traction link falls off at one end, so the traction
center is connected to the truck by only one link.
Asymmetric load on the remaining link damages
the bushes, interfering with ride comfort and
possibly twisting the link mounting plates. This in
turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. one end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
a spectacularly nasty derailment. Time to replace a
traction link bolt two hours (note that the nuts on
the traction link bolts are held in place by split pins,
which means that this failure should not occur if the
split pin is in place - see also function 11)
1 To provide smooth
120 kph
Unable to
support the car
on the rails at
120 kph
Traction link
falls off due
to fatigue
The traction link falls off at one end, so the traction
center is connected to the truck by only one link.
Asymmetric load on the remaining link damages
the bushes, interfering with ride comfort and
possibly twisting the link mounting plates. This in
turn causes the second traction link to shear off,
which would mean that the truck is only connected
to the car by the air bags. A twisted mounting could
also restrict truck movement during curving, which
may lead to derailment under adverse conditions of
load and speed. One end of the traction link could
also hit the ground in such a way that the truck
frame or traction center has to fault over it, causing
a spectacularly nasty derailment. Time to replace a
traction link five hours.
1 To provide smooth
120 kph
Fails to provide
rolling support
Bearing
collapses due
to fatigue
failure of
cage, rollers,
spacer or
inner or
outer race
Collapsed bearing causes a "hot box", and train
must stop at the next station to evacuate passengers
which causes a traffic delay of 20-60 minutes. It is
also possible that a failed bearing could cause a
derailment. The hot box melts the chevron causing
it to emit smoke. The chevron also collapses,
damaging the tie-bar and axle. Time to replace a
wheel set complete with bearing and axle box 8
hours.

Page 64
1 To provide smooth
120 kph
Fails to provide
rolling support
Bearing
collapses due
to excessive
clearing in
housing
If the axle box liner bore exceeds the bearing outer
race external diameter by more than 0.6 mm,
relative movement between the liner and outer race
causes excessive vibration and collapse of the
bearing. This causes a hot box, and train must stop
at the next station to evacuate passengers which
causes a traffic delay of 20-60 minutes. It is also
possible that a failed bearing could cause a
hours.
1 To provide smooth
120 kph
Fails to provide
rolling support
Bearing
collapses due
to bumpy
rails
Excessive interaction between railhead and wheel
sets applies shock loads to bearings, leading to
either fracture of bearing components or accelerated
fatigue failure. This causes a hot box, and train
must stop at the next station to evacuate passengers
which causes a traffic delay of 20-60 minutes. It is
also possible that a failed bearing could cause a
hours. Rails to be analyzed separately.
1 To provide smooth
120 kph
Fails to provide
rolling support
Bearing fails
due to under
lubrication
Seized bearing causes a hot box, and train must stop
at the next station to evacuate passengers which
causes a traffic delay of 20-60 minutes. It is also
possible that a failed bearing could cause a
damaging the tie-bar and axle. Time to grease an
axle box 30 mins.
1 To provide smooth
120 kph
Fails to provide
rolling support
Plug falls out
of axle box
cover
Lubricant drains out, causing bearing to seize
resulting in a hot box. Train must stop at the next
station to evacuate passengers which causes a
traffic delay of 20-60 minutes. It is also possible
that a failed bearing could cause a derailment. The
hot box melts the chevron causing it to emit smoke.
The chevron also collapses, damaging the tie-bar
and axle. Wheel set would be replaced if plug was
found to be missing. Time required to do so 8
hours.
1 To provide smooth
120 kph
Fails to provide
rolling support
Bearing fails
due to over
lubrication
Over-lubrication leads to excessive churning and
eventual breakdown of lubricant, causing bearing to
seize resulting in a hot box. Train must stop at the
next station to evacuate passengers which causes a
and axle. It is felt that this failure is unlikely to
occur because the amount of lubricant is controlled.

Page 65
1 To provide smooth
120 kph
Fails to provide
rolling support
Moisture in
lubricant
causes
bearing to
fail
Moisture in lubricant reduces its lubricating
effectiveness and may also cause the bearing to
corrode, in both cases leading to bearing failure
resulting in a hot box. Train must stop at the next
station to evacuate passengers which causes a
and axle. Time to replace wheel set is 8 hours.
1 To provide smooth
120 kph
Fails to provide
a smooth ride
Flats worn
on wheel
tread
A wheel flat longer than 40 mm is likely to affect
ride comfort. It will also damage the railhead. The
noise and vibration caused by a flat wheel tread is
usually detected quickly by Operations. Time to re-
profile a wheel set on the under floor lathe is 3
hours.
2 To insulate passengers
from shocks caused by
crossing rail joints,
bumps and to minimize
transient oscillations
after crossing such
bumps.
Fails to insulate
passengers
adequately
Air bag leaks
via top plate
of car bolster
faster than it
can be
pumped in
Air bag deflates, so forces are transmitted between
truck and car through the layer and emergency
springs only. This causes a sharper ride, but train
does not have to be withdrawn from service
immediately. Time to replace air bag 8 hours. See
also function 5.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Steel wire
inside airbag
fails
Air bag fabric cannot contain the air pressure on its
own, so bag bursts causing forces to be transmitted
through layer and emergency springs only. This
causes a sharper ride, but train does not have to be
withdrawn from service immediately. Time to
replace air bag 8 hours. See also 44 and 45.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Chevron
spring rubber
settles
Reduced clearance causes more frequent contact
between vertical bump stop and wheel arch over
bumps. This reduces ride quality and increases
stresses on all truck components. See also 10 above.
Time to replace chevron 8 hours.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Chevron
elastically
yields
Reduced clearance causes more frequent contact
between vertical bump stop and wheel arch over
bumps. This reduces ride quality and increases
stresses on all truck components. See also 11 above.
Time to replace chevron 8 hours.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Damper non-
return valve
fails in
closed
position
Damper "seizes" and transmits shocks directly from
truck frame to underside of car (in the case of the
vertical damper) or to traction center (in the case of
the horizontal damper). This reduces ride quality
and increases stresses on all truck components.
Time to replace a defective damper in Depot 1
hour.
Fails to insulate
passengers
adequately
Damper oil
viscosity
increased by
dirt or
oxidation
Damper becomes steadily stiffer until it eventually
seizes altogether, transmitting shocks directly from

Page 66
after crossing such
bumps.
hour.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Excessive
metal-to-
metal contact
between
damper
piston and
cylinder
Damper becomes steadily stiffer until it eventually
seizes altogether, transmitting shocks directly from
hour.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Layer spring
stiffness
decreases
Serious loss of stiffness means that secondary
suspension is provided by the air bag only. This
reduces ride comfort and increases shock loads
especially on the air bag itself. Time to replace
layer spring at Depot 8 hours. See also 45.
after crossing such
bumps.
Fails to insulate
passengers
adequately
Air bag,
layer spring,
and
emergency
spring all fail
Car has no secondary suspension at all, so all
shocks which pass through the primary suspension
are transmitted directly to the car. Ride becomes
very rough and stresses on local truck components
are severely increased. Replacement of the three
suspension components takes 8 hours at the Depot.
after crossing such
bumps.
Fails to
minimize
oscillations
Oil leaks out
of damper
seals
(vertical or
horizontal
damper)
In the case of the vertical damper, full damping
capability would have to be provided by the damper
opposite, which might not be able to cope and
hence which might also fail rapidly itself. Even if
the opposite damper did not fail, damping
efficiency is impaired so oscillations are not
effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.
after crossing such
bumps.
Fails to
minimize
oscillations
Damper non
return valve
fails in open
position
In the case of the vertical damper, full damping
capability would have to be provided by the damper
opposite, which might not be able to cope and
hence which might also fail rapidly itself. Even if
the opposite damper did not fail, damping
efficiency is impaired so oscillations are not
effectively damped, which could cause discomfort
on longer journeys. There is only one horizontal
damper, so the effect of loss of this damper is
immediate. Under damping also increases cyclic
stresses on other suspension components, especially
the torsion bar, which could shorten the life of these
components. Time to replace a defective damper in
Depot 1 hour.

Page 67
after crossing such
bumps.
Fails to
minimize
oscillations
Damper
mounting
bolts become
detached
Dampers come adrift and oscillations are not
effectively damped, which causes discomfort and
may induce motion sickness on longer journeys.
Horizontal damper could be dragged along a rail. It
may also drop off in front of a wheel, possibly
leading to derailment. Time to replace a defective
damper in Depot 1 hour.
from jerks during
acceleration and braking
Fails to insulate
passengers
from jerky
stops and starts
Compound
spring
retaining nut
fails, leading
to
dislocation
of the
compound
spring
The car body is still supported by the secondary
suspension, but the center pivot crashes back and
forth against the traction center when starting and
stopping. This causes a jerky ride and considerably
increases shock loads on the truck and local car
components (especially the center pivot, traction
center and air bags). A dislocated spring could also
prevent the truck from curving correctly, which
may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot. (Note that the retaining nut is
held in place by the split pin, so this failure would
not occur if the split pin is in place)
from jerks during
Fails to insulate
passengers
from jerky
stops and starts
Compound
spring rubber
deteriorates
The car body is still supported by the secondary
suspension, but the center pivot crashes back and
forth against the traction center when starting and
stopping. This causes a jerky ride and considerably
increases shock loads on the truck and local car
components (especially the center pivot, traction
center and air bags). A dislocated spring could also
prevent the truck from curving correctly, which
may lead to a derailment under adverse conditions
of load and speed. Time to rectify this defect 2
hours at the Depot.
from jerks during
Fails to insulate
passengers
from jerky
stops and starts
Traction link
rubber bush
fails
Starting and stopping forces are damped only by the
compound spring, which leads to a jerky ride and a
general increase in shock loads. Time to replace
bush 2 hours.
4 To control the roll
angle of the car body
relative to the truck
Fails to control
the roll angle of
the car body at
all
Torsion bar
shears
If the torsion bar shears, one end of the car body
lurches from side to side during cornering. This
could disturb and possibly frighten passengers. The
car also becomes highly unstable and the resulting
loss of balance could lead to derailment, especially
if a heavily loaded car was going at high speed
round a corner. Time to replace the torsion bar in
Depot 4 hours.
Fails to control
the roll angle of
the car body at
all
Torsion bar
retaining key
fails
The torsion bar would rotate by itself and cause
noise and vibration. However, the torsion bar would
not be sheared, so derailment is unlikely to occur.
Time to replace the torsion bar in Depot 4 hours.
Fails to control
the roll angle of
the car body at
all
Torsion bar
turnbuckle
fastening
comes
undone
Torsion bar has nothing to act against, causing one
end of the car to lurch from side to side during
cornering, disturbing and possibly frightening
passengers. The car also becomes highly unstable
and the resulting loss of balance could lead to
derailment, especially if a heavily loaded car was
going at high speed round a corner. Time to
reconnect the turnbuckle in Depot 4 hours.

Page 68
Fails to control
the roll angle of
the car body at
all
Torsion bar
bearing worn
due to lack
of
lubrication
Excessive clearance means that the torsion bar rests
directly on the edge of the bearing housing. The
resulting point load on the torsion bar greatly
increases the chances of the bar shearing, causing
instability and a possible derailment. Time to
replace this bearing at Deport 4 hours.
5 To ensure that the
carriage floor is level
with the platforms when
Unable to
ensure that
carriage floor is
level with the
platform
Air bag leaks
via top plate
of car bolster
faster than it
can be
pumped in
If the step is not level with the platform, a
passenger could trip and fall. Time to replace air
bag at Deport 8 hours. See also 22 above
Unable to
ensure that
carriage floor is
level with the
platform
Air bag
bursts
If the step is not level with the platform, a
passenger could trip and fall. Time to replace air
bag at Deport 8 hours.
Unable to
ensure that
carriage floor is
level with the
platform
Leveling
valve
turnbuckle
loose
Air bag cannot be charged efficiently so carriage
floor cannot be aligned with platform before
passengers start moving on and off the train. This
means that a passenger could trip and fall. This
failure occurred quite often in the past, but the
locknut and spring washer were replaced by a nylon
washer, and it has not happened for a year.
Unable to
ensure that
carriage floor is
level with the
platform
Layer spring
stiffness
decreases
Car body sags, which can be compensated for
initially by adding adjustment shims. Serious loss
of stiffness means that shims can no longer
compensate. Time to replace layer spring at depot 8
hours.
6 To assist in stopping
the train at up to 0.88
m/s2
Completely
unable to assist
in stopping the
train
Brake pad
worn more
than 10 mm
One worn pad is unlikely to affect the stopping
performance of the whole train, but a number of
worn pads could do so. Pads are usually replaced
when wear exceeds 7 mm and it takes 20 minutes to
repair a pad in the Depot.
m/s2
Completely
unable to assist
in stopping the
train
Brake disk
wear exceeds
2.5 mm
One worn disc would not have a significant to
affect on the stopping performance of the whole
train, but several worn disks would do so. Disks are
re-profiled on the under floor wheel lathe when
wear exceeds 2 mm. This takes 2 hours.
m/s2
Completely
unable to assist
in stopping the
train
Brake pad
falls off
Brake pad holder scratches the disk, so the disk has
to be re-profiled (2 hours) and brake pad replaced
(20 minutes). One worn disc would not have a
significant effect on the braking performance but
several worn discs would do so.
7 To prevent direct
contact between axle box
and truck frame under
severe bounce conditions
Unable to
prevent contact
between axle
box and truck
under severe
bounce
conditions
Vertical
bump stop
missing
The axle box could hammer against the truck frame
when passing over bumps, leading to deformation
of the axle box and possible accelerated failure of
the axle bearings. Time to replace the bump stop in
Depot up to 8 hours.
8 To permit the truck to
be lifted and/or the car to
be towed easily
Truck cannot
be lifted or car
towed easily
Lifting point
fails due to
wear or
corrosion
This failure could occur while the truck is
suspended in mid-air, which means that it could fall
onto somebody. Time to repair eye by welding 3
hours.

Page 69
be towed easily
Truck cannot
be lifted or car
towed easily
Lifting point
damaged by
external
force
Eye could be weakened or the truck could be
improperly secured for lifting, causing a suspended
truck to fall, possibly onto somebody. Time to fit
new eye 3 hours.
be towed easily
Truck cannot
be lifted or car
towed easily
Lifting point
sheared off
by external
force
Truck could not be lifted at all using the eye, so
alternative arrangements would have to be made.
9 To ensure that wheel
sets remain attached to
truck while truck is
being lifted
Wheel set falls
off truck while
truck is being
lifted
Tie bar
fractures
Wheel set could drop onto somebody while the
truck is suspended in mid-air. Time to replace the
tie bar up to 8 hours in the Depot.
10 To insulate the car
from shocks to some
extent if the air bag fails
Incapable of
insulating the
car if the air
bag fails
Emergency
spring fails
This failure on its own has no effect. If the air bag
fails and the emergency spring both fail, secondary
suspension has to be provided by the layer spring
on its own. 30 above explains what happens if air
bag, layer spring and emergency spring all fail.
Time to replace the emergency spring at Depot 8
hours.
11 To limit lateral
movement of car relative
to truck
Unable to limit
lateral
movement of
car relative to
truck
Lateral bump
stop rubbers
worn away
Under extreme conditions of lateral load, car bolster
stool could hit truck frame, reducing ride comfort
and generally increasing shock loads. Time to
replace lateral bump stop rubber at Depot 8 hours.
11 To limit lateral
movement of car relative
to truck
Unable to limit
lateral
movement of
car relative to
truck
Lateral bump
stop falls off
Under extreme conditions of lateral load, car bolster
stool could hit truck frame, reducing ride comfort
and generally increasing shock loads. Time to
replace lateral bump stop rubber at Depot 8 hours.
12 To prevent traction
link retaining nut from
coming undone
Unable to
prevent traction
link retaining
nut from falling
off bolt
Split pin falls
out
This failure only matters if the retaining nut starts
coming loose. If the retaining bolt falls out, effects
are described in 12 above. Time to replace split pin
at Depot 1 hour.
13 To prevent compound
spring retaining nut from
coming undone
Unable to
prevent the
compound
spring retaining
nut from falling
off
Split pin falls
out
This failure only matters if the retaining nut starts
coming loose. If the retaining nut falls off, the
compound spring would fall off. Large clearance
between the center pivot and the center plate would
cause fierce vibrations in the car compartment and
further damage to the bolster stool. Time to replace
split pin in Depot 1 hour.

Example 2
1 To supply air to
conditioned air
pack temperature
controller
temperature
1 air-cycle machine
seized
Reduced pack flow,
anomalous readings on
pack-flow indicator and
other instruments
2 blocked ram-air
passages in heat
exchanger
High turbine-inlet
temperature and partial
closure of slow-control
valve by over-

Page 70
temperature protection,
with resulting reduction
in Pack airflow
3 failure of anti-ice
valve
If valve fails in open
position, increasing
impact discharge
temperature; if valve
fails in closed position,
reduced pack airflow
4 failure of water
separator
Condensation (water
drops, fog, or ice
crystals) in cabin
2 To be able to prevent
loss of cabin pressure by
backflow if the duct is
fails in unpressurized
nose-wheel compartment
backflow
1 failure of bulkhead
check valve
None (hidden function);
if duct and or connectors
fail in pack bay, loss of
cabin pressure by
backflow, and airplane
must descend to lower
altitude
failures
effective, continuous process
control and monitoring
according to the target
product of the day, within the
parameters specified by
product specification and by
current environmental
regulations, in the presence of
a UPS (uninterruptible power
supply)
Fails to
provide
security
Unauthorized
usage of
console either
when
unattended or
if password
stolen
An unauthorized and untrained person
gains access an operating console or an
engineering console. This may lead to a
condition where loss of life or
environmental disaster can occur. In this
eventuality legal or civil proceedings
will likely be brought against the
Company.
Unable to
log in
Password
forgotten
Operator unable control the plant.
Operator would look for another console
which has a log in. In a worst case
scenario all consoles would be locked
out and emergency shutdown would be
initiated if the operator suspects
abnormal operation at that particular
time.
Unable to
protect
against loss
of control
UPS has failed Under normal conditions this failure
would be noticed by the operator who
checks the alarms in the normal
execution of his daily tasks.
Control lost Complete loss
of
communication
with ring
Unreliable or no data shown on console.
Operator loses ability to control the
plant. Emergency shutdown initiated.
The most common cause of this failure
in the past has been contractors
inadvertently cutting cables. This is
likely to take at least 2 hours to one day
to fix entailing a loss of production. This
failure mode is considered to be rare

Page 71
event.
Complete loss
of
communication
with controller
node
One node goes off line. This could be
preceded by any of dirt fouling of fan,
moisture penetration, RF interference,
electronic component failure. Partial or
complete shutdown depending on
importance of node. Unreliable or no
data shown on console. Operator loses
ability to control the plant. Emergency
shutdown initiated. The most common
cause of this failure in the past has been
contractors inadvertently cutting cables.
This is likely to take at least 2 hours to
one day to fix entailing a loss of
production. This has happened
occasionally in the past.

Page 72
Chapter 12. Decision Algorithm
Questions 5, 6, and 7
The process
4. Nonoperational maintenance (M) consequences, which
involve only the direct cost of repair

Example 1 shows several of the records from the full analysis of
the rail passenger car Truck. In the column H S P M we decide,
from the effects description, whether the consequences are hidden,
safety or environmental, production (operational) or maintenance
(non operational). We test each of the four possible consequences
in this order, and we stop as soon as the we ascertain that the
circumstances (effects) of the failure mode provoke the
consequence being tested.

Page 73
Example 1

Function
Statements
Failed
States
Failure
Causes
Local and Global Effects from the
Failure Cause
H
S
O
M
C
C
C
C
T
T
T
T
D
2
N
N
M
M
M
M
Maintenance
Tasks
Interval By

1 2 3 4

To provide smooth
rolling support for
half the weight of a
passenger car (up
to 26.5 tons) on the
rails at speeds up to
120 kph
Fails to provide
support
Weld in
frame
fails due
to fatigue
The truck as a whole collapses. This is most
likely to occur when the car is most heavily
loaded - in other words when it is full of
passengers, and probably while the train is
going round a corner. As a result, it would
almost certainly be derailed. At present, the
truck is replaced when a crack longer than
100 mm is found. (Such a crack would be
found during course of other inspections
that occur often enough to detect it).
Downtime to replace truck on its own 16
hours.
S C Inspect frame for
cracks greater than
100 mm
To be included
with other
scheduled
tasks

Page 75

The RCM decision algorithm is represented by the matrix of
Figure 12-1 That is also in the heading of the decision half of the
RCM worksheet.

H C T D R
S C T 2 R
O C T N R
M C T N R
Figure 12-1 RCM Decision Diagram. Redesign, R, is mandatory in rows
H and S if no proactive task reduces the consequences of failure to a
tolerable level.
We execute the RCM decision logic by beginning at the top right
of Figure 12-1 and working to the left before descending to the
next lower row. The letter in each cell of the matrix represents a
question (step) in the RCM decision algorithm. The full text of the
questions (below) should be explicitly recited as the decision
diagram is being traversed. Avoid the tendency to abbreviate the
questions so much that their meaning is lost or distorted.
Full text of decision diagram questions
H. Is the function's failed state hidden? That is, will the
failure go unnoticed until another function fails or some
extraordinary event occurs?
S. Does the failure affect safety, health, or the
environment?
O. Can the failure provoke operational (production)
consequences. These include cost, quality, and customer
service.
M. Are the only consequences those that affect
maintenance or the maintenance budget?
C. Is a condition based maintenance (CBM) task
applicable? Can it reliably detect the 'failing' state early
enough to reduce the multiple failure's probability and/or
its consequences to a tolerable level? Is it effective? Does it
make economic sense to perform this task at the frequency
required?
T. Is a time based maintenance task applicable? Is there an
age (useful life) at which the probability of failure due to
this failure mode increases rapidly, and do most items
survive to this age? Effective: Can a routine (TBM) task
reduce the multiple failure's probability and/or its
consequences to a tolerable level? Two types of time based
tasks are considered under this heading: 1) Scheduled
Overhaul, and 2) Scheduled Discard, the letter being
mandatory for a safe-life item
104
.
D. Is a detection task applicable? Will it reduce the
multiple failure's probability to a tolerable level. Is it

Page 76
effective? Is it practical to do the task at the required
interval?
2. Can a combination of 2 or more TBM and CBM tasks be
applicable (avoid or reduce the safety consequences to a
tolerable level)? Are they effective (practical)?
N. No time nor condition based activities need be
scheduled.
R. A hardware, software, or procedural modification that
will reduce the failure's probability and/or its consequences
to a tolerable level is mandatory (H or S) or may be
desirable (P or M).

For the failure mode (cause) Weld in frame fails due to
fatigue we ask whether the failure is hidden. Since the
failures direct effects will be clearly visible (probably
catastrophic) to operating personnel, this failure is not
hidden. Therefore we proceed to the next cell to the right
and ask whether there is a CBM task that is applicable and
effective. We need search no further than the effects
description to learn that it is entirely feasible to detect a
crack at the potential failure stage of 100 mm length. It will
be effective (economically feasible to do so) because there
will be ample opportunity to perform this inspection often
enough during other routine work (to be described in
subsequent rows of the analysis.). Hence we stop at that
point and enter C under the second third of the matrix.

Example 2
H C T D M
S C T 2 M
O C T N M
M C T N M
Example 4

Page 77

Figure 12-2 The shock-strut assembly on the main landing gear of the
Douglas DC-10. The outer cylinder is a structurally significant item.

Structures Worksheet: type of Aircraft Douglas DC-10-10
Item Number: 101 No. per aircraft: 2
Item Name: Shock-strut outer cylinder Major area: main landing gear
Vendor part/model no: PN ARG 7002-505 Zones: 144, 145
Design criterion:
Damage tolerant element: __
Safe-life element: Yes
Inspection access:
Description/location details:
Shock-strut assembly is located on main landing gear; SSI
consists of outer cylinder (both faces)
Internal: Yes
External: Yes
Material (include manufacturer's trade name): Steel alloy
4330 MOD (Douglas TRICENT 300 M)
Redundancy and external
detectability:
No redundancies; only one cylinder
each landing gear, left and right
wings. No external detectability of
internal corrosion.
Fatigue-test data Is element inspected via a
related SSI? If so, list SSI no.: No
Expected fatigue life:

Classification of item
(significant/nonsignificant):
significant
Crack propagation:
Established safe-life: 46,800 landings 70,200 oper. hours
Design conversion ratio: 1.5 operating hours/flight cycle
R
e
s
i
d
u
a
l

s
t
r
e
n
g
t
h

F
a
t
i
g
u
e

l
i
f
e

C
r
a
c
k

g
r
o
w
t
h

C
o
r
r
o
s
i
o
n

A
c
c
i
d
e
n
t
a
l

d
a
m
a
g
e

C
l
a
s
s

n
o
.

C
o
n
t
r
o
l
l
i
n
g

f
a
c
t
o
r

I
n
s
p
e
c
t
i
o
n

(
i
n
t
.
/
e
x
t
)

Proposed task Initial interval

Page 78
- - - 1 4 1 CorrosionInternal Magnetic-particle
inspection for cracking and
detailded visual inspection
for corrosion
Sample at 6000 to
9000 hours and at
12000 to 15000
horus to establish
best interval
External General inspection of outer
surface

Detailed visual inspection
for corrosion and cracking
During preflight
walkarounds and at
A checks
Not to exceed 1,000
hours (C check)
Remove and discard at life
limit
34,800 hours
Figure 12-3 RCM Worksheet for structurally significant items
The worksheet of Figure 12-3 differs from that of the previous
examples. This form applies to the anlaysis of structurally
significant items. All structually significant items, fall into one of
two categories:

1. Damage-tolerant item: A monolithic or multiple load path
item in which a crack or complete failure of an element will not
reduce residual strength below the safety level prior to detection,
or
2. Safe-life item: A structurally significant item whose
potential failure is not reliably detectable.

Table 12-1 explains the rating system for the first 5 columns of
Figure 12-3. The analysis shows the treatment of a safe-life item in
an airline context. Because the shock-strut outer cylinder on the
main landing gear of the Douglas DC-10 has been classified a safe-
life item it must be discarded before a fatigue crack is expected to
occur. Hence it is not rated for residual strength, fatigue life, or
crack propagation characteristics (the first three columns of Figure
12-3). The Class Number of column 6 is set to the minimum of the
columns 1 to 5. The controlling factor is that which corresponds
to the minimum (of the 5 columns).

Safe-life limits are only effective, however, if nothing prevents the
item from reaching them. In the case of structural items, there are
two factors that introduce this possibility corrosion and
accidental damage. Experience has shown that landing-gear
cylinders of this type are subject to two corrosion problems. First,
the outer cylinder is susceptible to corrosion from moisture that
enters the joints at which other components are attached; second,
high-strength steels such as 4330 MOD are subject to stress
corrosion in some of the same areas. The item is given a corrosion
rating of 1, which results, therefore, in a (overall) class number of
1.

Page 79
In addition to the corrosion rating, the shock-strut cylinder is rated
for susceptibility to accidental damage. The cylinder is exposed to
relatively infrequent damage from rocks and other debris thrown
up by the wheels. The material is also hard enough to resist most
such damage. Its susceptibility is therefore very low, and the rating
is 4. However, because the damage is random and cannot be
predicted, a general check of the outer cylinder, along with the
other landing-gear parts, is included in the walkaround inspections
and the A check, with a detailed inspection of the outer cylinder
scheduled at the C-check interval.

Table 12-1
Reduction in
residual strength
Fatigue life of
element
Crack-
propagation rate
Susceptibility to
corrosion
Susceptibility to
accidental damage
No. of
elements that
can fail
without
reducing
strength below
damage
tolerant level
r
a
t
i
n
g

Ratio of
fatigue life to
design goal
Ratio of
interval to
fatigue-life
design goal
r
a
t
i
n
g

Ratio of
corrosion-free
age to fatigue-
life design
goal
r
a
t
i
n
g

Exposure as a
result of
location
r
a
t
i
n
g

One 1 1/8 1 1/8 1 High 1
Two or
more
106

2 2 2 Moderate 2
Two or
more
107

3 3/8 3 3/8 3 Low 3
Two or
more
108

4 4 4 Very low 4

Page 81
Chapter 13. Can RCM and Streamlined RCM
peacefully co-exist?
Introduction
Religious or political zealots confront one another, often, not on the basis
of the mores of their respective doctrines, but rather from superficial
differences in the details surrounding each others cultural reference
points. Mathematicians take pride in their ability to adopt a new set of
definitions and symbols as effortlessly as they would don a fresh suit of
clothes. Thus they proceed, unfettered by prior points of view, to build
new theorems upon old. The world of maintenance has, not dissimilarly,
spawned a multitude of cultures and languages for formulating solutions to
real problems.

In the preceding chapters we conducted RCM on several diverse item
types. We systematically answered each of the seven RCM questions
about the item, and, in the order stipulated by the SAE JA-1011 standard:
1) functions?, 2) failures?, 3) failure modes?, 4) failure effects?, 5)
consequences?, 6) scheduled tasks?, and 7) default tasks?. We entered
the answers to the questions in an electronic spreadsheet (for example, MS
Excel or a database form) formatted as the RCM Worksheet illustrated in
Figure 10-2. on page 111.

This chapter explores streamlined RCM software. We begin with an
examination of what is meant by streamlining. We illustrate the
streamlined approach by describing a popular representative RCM
software package called RCM Turbo
109
. We set up a cross-reference
dictionary of terms describing similar sounding but, sometimes,
differently applied concepts in the two languages. Finally we summarize
the relative advantages and potential drawbacks of the streamlined RCM
and the RCM processes. Through this process, we discover how the
juxtaposition of two approaches may benefit the proponents of both.
Why streamline RCM?

Page 82
Chapter 10. (page 110) cited the SAE Standard Evaluation Criteria for
Reliability-Centered Maintenance (RCM) Processes that defines RCM
as:

a specific process used to identify the policies which must be
implemented to manage the failure modes which could cause the
functional failure of any physical asset in a given operating context.

It goes on, to define the process by adding:

Any RCM process shall ensure that all the following seven questions
are answered
satisfactorily and are answered in the sequence shown as follows:
a. What are the functions and associated desired standards of
performance of the asset in its present operating context (functions)?
b. In what ways can it fail to fulfill its functions (functional failures)?
c. What causes each functional failure (failure modes)?
d. What happens when each failure occurs (failure effects)?
e. In what way does each failure matter (failure consequences)?
f. What should be done to predict or prevent each failure (proactive tasks
and task intervals)
g. What should be done if a suitable proactive task cannot be found
(default actions)?

Were we to consider the process (of answering the 7 RCM questions in the
sequence stipulated) unacceptably resource intensive, then,
understandably, we would seek to replace it with a process that consumes
less time and fewer resources, but by one that provides, no less a
responsible (sufficiently rigorous) analysis. We emphasize that the JA
1011 SAE standard stipulates a minimal set of criteria for a process to be
called RCM. Therefore, it is to be expected that most commercially
packaged RCM software systems and methodologies will add a
considerable number of features that will enhance and facilitate the
experience.

The original
110
as well as the various streamlined RCM methods all
demand that the assembled team of analysts (operational, process, and
maintenance specialists) possess, collectively, the knowledge necessary to
make informed decisions regarding the maintenance characteristics of the
item under scrutiny. The process chosen (either original or streamlined)
must, therefore, encourage the maximum contribution by each participant
so that RCM decisions will carry the force of all knowledge and
experience available on the team. The success of any RCM
methodology, therefore, depends heavily on its ability to gain true
consensus, throughout every stage of the analysis. The group, guided by a
well trained facilitator, exercises its best judgment when visualizing the

Page 83
typical worst case scenario (TWCS) surrounding each functional failure
analyzed.

With these objectives in mind, we compare the two processes, starting
with a dictionary of some of their respective terms of reference.
RCM/RCM Turbo dictionary
Table 13-1 Relationship between RCM and RCM Turbo terminology
RCM RCM Turbo
Item: a collection of parts, or systems that is
convenient to analyze as a group. It has been
selected at a high enough level of indenture that its
failure may easily be related to that of the
equipment as a whole, but at a low enough level so
that the analysis is of manageable size (i.e. having a
manageable number of failure modes).
Maintainable item (MI): same meaning
No equivalent terminology is specified by the RCM
minimum criteria standard. (Any convenient or
existing equipment hierarchy naming system may
be used.) Operating context is recorded in a flexible
structure at the top of the RCM worksheet.
Productive unit (PU): A system that includes
several maintainable items. A convenient place to
record the operating context of the MI. A productive
unit belongs to a Major Unit and a Plant is the
highest level in the Turbo RCM hierarchy.
Worksheet: A document (conveniently an
electronic spreadsheet or simple database
application) onto which the answers to the 7 RCM
questions are recorded during the RCM team
session.
The RCM Turbo software product is not meant to
be populated during the sessions, but afterwards by
the facilitator or other person trained in the use of
the software. A MS Excel form (Figure 13-2 page
163) is provided for use during the sessions.
The RCM minimum criteria standard does not
specify a criticality or priority scale with which to
schedule the order of items to be analyzed. Nowlan
and Heap developed a simple priority system for the
aviation industry that has only two criticality
ratings: 1)significant item
111
, and 2) non-significant
item. This classification system has proved useful in
a variety of other industries. For structurally
significant items (SSI) Nowlan and Heap apply a
further classification of one to four for each of the
five categories: 1)Residual strength after failure, 2)
Fatigue life, 3) Crack growth, 4) Corrosion, and 5)
Accidental damage. The minimum class (for all 5)
determines task frequency. There are two categories
of SSI: 1) Damage-tolerant and 2) Safe-life.
Classifications 1 to 5 apply to damage-tolerant
items, but only classifications 4 and 5 apply to safe-
life items. (See Example 4 of Chapter 12. on page
150).
Criticality/Priority: values used to set priorities for
PUs and MIs. It is derived by question and answer
sessions driven by the program. (Criticality
calculations in no way detract from RCM. They
merely add another dimension to the analysis.)
Failure: Describes the way in which a specified
function no longer performs as required. It
distinguishes (for example) full from partial
failure of a function. The RCM Worksheet enforces
a one-to-many integrity constraint between Function
and Failure.
Failure: same basic definition. However Turbo-
RCM does not constrain a one-to-many relationship
between Function and Failure.
Failure Mode: A reasonably likely cause of a
specified failure. Consists of a noun, a verb (active
or passive form) and a phrase such as due to .
For example bolt cracks due to stress corrosion
Failure Mode: A superset of the RCM definition.
Structured in 3 parts as follows:
1) a component reference, 2) a Failure Mode &
Effect field - a single field that includes both RCM

Page 84
fatigue. The number of failure modes to list and
their depth of causality depend on operating
context. RCM enforces a one-to-many integrity
constraint between failure and failure mode. RCM
Turbo does not.
concepts (Failure Mode and Failure Effects), and 3)
a Root cause reference. An example of a RCM
Turbo failure mode is: Bearings + wear between
rolling elements and racers leading to increased
vibration levels, localized heating and eventual
seizure and total stoppage of process due to +
normal wear and tear.
Failure Mode: In RCM, the terms Root Cause,
Failure Mode, Failure Mechanism, Failure
Reason, etc are all equivalent and represented by
the term Failure Mode. It is an event in the
causality chain that leads to the failed state. The
link in the causality chain selected as the Failure
Mode is the one that the organization can manage
effectively and practically by whichever means
(proactive, detective, or redesign).
Root cause: related to Failure Mode. Same
definition. That is, Root Cause in Turbo RCM is
equivalent to Failure Mode in RCM.
Failure Effects: Text answering the following:
what sequence of events (considering a TWCS
112

in the component, in the system, organization wide,
and in the external world) could be touched off by
the failure mode?
how does the failure make itself known? What
observable events lead up to the failure?
how is safety or the environment impacted?
(without mentioning the words "safety" or
"environment")
how is production impacted? (quality, cost,
customer service)
is there any additional damage caused by the
failure?
how long will it take and what actions must be
accomplished to correct the failure?
How does the likelihood of this failure depend on
deeper causes? Has it happened before? How often?
Under what circumstances?

Same definition but it is structurally embedded in
the Failure Mode & Effect field. In addition the
following Failure Mode fields (with sample data)
contribute to the Effects narrative:
Unit Output Reduction: Total stoppage,
PU Downtime Cost: $11,390 / hour,
MI Downtime Cost: $11,390 / hour
F/mode&Effects: Shaft failure-Chemical
corrosion,overtorque, indicated by cracks, increase
in vibration leading to shutdown of Brownstock
washer
Characteristic: Definitive life / wear out
charactersitcs
Measurabilty: Moderately easy to monitor
Category: Normal Operation
Typical Warn Time: 4 Weeks
Root cause: Normal wear & tear
MTBF: 5 years
Consequence: Total stoppage
Strategy: CBM
Hidden Function: A Function whose failure will
not be detected under normal circumstances.
Identified by RCM during functional analysis when
examining each component (from schematics,
p&ids, photographs, and physical walkaround) and
listing the functions they suggest. Code phrases
(such as able to, in the presence of, etc) are
used to point out that a function is hidden or
protected by a hidden function. Subsequent
questions address the hidden function. The hidden
consequence supplants all other possible failure
consequences in the RCM logic for determining a
mitigating task.
Hidden Failure Mode: Same meaning. Consists of
the fields: Component, Failure Mode & Effects,
Task Description, Frequency, Duration, Initiate
Date, Job Group ID, Service Period, No. of Units in
Service, No. of failures, and MTBF of the protective
device (calculated).
RCM records this information in the free text
answer to question 4, Failure Effects
MTBF: related to the Failure Mode.
RCM records this information in the answer to
question 6 Tasks when following one of the four
branches (H, S, O, N) in the RCM decision logic
tree.
Strategy: related to Failure Mode. Takes one of
three possible values: 1) fixed time maintenance, 2)
condition based maintenance, or 3) operate to
failure

Page 85
Same definition. RCM records this information in
the free text answer to question 4, Failure effects
P-F Interval: related to Failure Mode. Estimated
interval (measured in working age units) between
the appearance of a potential failure and a functional
failure.
Potential failure: An indicator that a failure mode
has initiated.
S/A (secondary action) Indicator: same meaning.
No equivalent concept in RCM. If a failure mode is
due to design, lubrication, overload, or maintenance
practices, they would each constitute a separate
failure mode, and this information would be
included in the failure mode description itself. The
word Safety or Environment is not mentioned
until the consequence phase of the RCM logic
diagram.
Category: related to Failure Mode. Takes one of six
possible values: 1) Design, 2) Lubrication, 3)
Normal Operation, 4) Overload Condition, 5)
Maintenance practices, or 6) Safety
answer to question 4, Failure effects
Characteristic: related to Failure Mode. Takes one
of three possible values: 1) Definitive life/wearout,
2) General degradation, and 3) Random
Consequences: Question 5. Takes one of four
possible values: 1) Hidden, 2) Safety
/Environmental, 3) Operational, and 4)Non-
operational.
RCM records RCM Turbos Consequence in the
free text answer to question 4 Failure effects and
in the third or fourth option of Question 5
Consequences.
Consequence: related to Failure Mode. Takes one
of four possible values: 1) Total stoppage, 2) Partial
stoppage/quality, 3) No immediate effect, or 4) No
effect
answer to Question 4 Failure effects and in the
answer to Question 6 Tasks. Q6 asks whether
there is an applicable CBM task. Once a (CBM or
other) task is found to be applicable (practical)
RCM then asks whether it will be effective. That is,
will it sufficiently reduce or entirely avoid the
consequences of failure at acceptable cost?
Measurability: related to Failure Mode. Takes one
of three possible values: 1) Easy, 2) Moderate, or 3)
Impossible
Redesign: RCM records this information in the
free text answer to question 7, Default Tasks.
Differs from RCM Turbo only in the sequence in
which this question appears (i.e. following a
determination that no proactive or failure finding
task adequately mitigate the consequences of the
failure.)
Design Notes: related to the Failure Mode. Records
decision/recommendation to design-out the failure
mode. (strictly speaking it is presented out of RCM
sequence.)
RCM provides no specific field for this
information, leaving its provision up to the
implementer or commercial packager.
Strategy Notes: related to Failure Mode. A free text
field used to store comments or notes on the chosen
maintenance strategy. Useful where a second or
alternative strategy has been considered and
rejected.
answer to question 4, Failure Effects. However, in
far less detail.
Breakdown Action: related to Failure Mode.
Describes what must be done to repair the
functional failure. Also has the specific fields:
Work Order No., SOP, Duration, Downtime, MI
Status, S/A Initiator, Resources (up to six steps),
Assumptions, Materials, Spares.
answer to question 6, Tasks. However, in far less
detail.
Primary Action: Related to the Failure mode.
Describes what should be done to prevent the failure
mode. Also has the specific fields: Work Order
No., SOP, Duration, Downtime, MI Status, S/A

Page 86
Initiator, Resources (up to six steps),
Assumptions, Materials, Spares.
answer to question 6, Tasks. However, in far less
detail.
Secondary Action: related to Failure Mode.
Describes what must be done following the
detection of a potential failure. Also has the specific
fields: Work Order No., SOP, Duration,
Downtime, MI Status, S/A Initiator, Resources
(up to six steps), Assumptions, Materials, Spares.
answer to question 4, Failure Effects. However, in
far less detail.
Overhaul Action: related to Failure Mode. Records
Overhaul Maintenance actions. For example, where
the Secondary Action was the change-out of a
rotable item which itself requires subsequent
overhaul. Also has the specific fields: Work Order
No., SOP, Duration, Downtime, MI Status, O/H
Venue, S/A Initiator, Resources (for up to six
steps), Assumptions, Materials, Spares.
Not called a library. However, the records are
equally accessible (structured as answers to the
seven questions) in the RCM worksheets
comprising the global RCM table. No corporate
harmonizing process need be applied because every
record is a one-off development. However, tools,
training, supervision and support are required to
validate and maintain the knowledge base.
Templating of an entire item, is, nonetheless,
possible by copying any or all records of an item
after carefully comparing their respective operating
context descriptions.
Failure Data Library: a table of 3 part failure
modes referenced by Machine Type. An
administration process is used to control the quality
of data from multiple sites and harmonize it for the
purpose of providing templates where applicable
in future analyses of other MIs or PUs. The ease of
templating justifies the appellation Streamlined
in the case of RCM Turbo.

We may conclude from Table 13-1, that, although RCM Turbo refers to
itself as a streamlined process, and, that some of its terminology differs
from that of RCM, it does not omit any vital knowledge element specified
by the SAE RCM minimum criteria standard. RCM Turbo does deviate
from the sequence stipulated in the standard. As pointed out in Chapter 10.
(page 110), in practice, however, RCM is not a sequential process. RCM
analysts anticipate the answers to subsequent questions while working
the current question. Furthermore, the RCM process is iterative. That is,
the analysts often return to a previous answer and adjust it in the light of
revelations further on in the process. The iterative and non-sequential
nature of the RCM process tends to render less important the differences
between the two approaches.

The terminology comparisons of Table 13-1 show that RCM Turbo
extends the information elements of RCM into greater structural detail.
Such data structuring facilitates the post-RCM processes (included in the
RCM Turbo software package) of workload smoothing, frequency
calculations, and CMMS integration as well as integration with a spares
optimization (optional) package.

Page 87
Figure 13-1 of Example 1 shows how the RCM Worksheet of Chapter 10.
(Figure 10-2 page 111) may be combined with the extended data fields of
RCM Turbo.

Page 88

Example 1

PU Code: Repulper, MI Code: Repulper screw
mode
Effects
To feed material 24
hours/day
Does not feed
at all
Shaft fails Unit Output Reduction: Total stoppage,
PU Downtime Cost: $11,390 / hour,
MI Downtime Cost: $11,390 / hour

F/mode&Effects: Shaft failure-Chemical
corrosion,overtorque, indicated by cracks, increase
in vibration leading to shutdown of Brownstock
washer

Characteristic: Definitive life / wear out
charactersitcs
Measurabilty: Moderately easy to monitor
Category: Normal Operation
Typical Warn Time: 4 Weeks
Root cause: Normal wear & tear
MTBF: 5 years
Consequence: Total stoppage
Strategy: CBM
Figure 13-1 RCM Worksheet applied to a RCM Turbo example
In the RCM worksheet of Figure 13-1 we note that most of the
RCM Turbo failure mode fields (in bold) fall quite readily into
the RCM Effects column, with the possible exception of the field
Strategy. The latter appears to preempt the RCM decision logic
of question 6. We view this, nonetheless, as an insignificant
departure (from RCM), given that RCM analysts consider the
mitigating task in the normal course describing the effects of
failure. It is essential that the RCM consequences (H, S, O, or M)
be determined and the complete decision logic of RCM (Figure
12-1 on page 143) be applied immediately following this RCM
Turbo step.

RCM Turbo facilitates data entry with a convenient Visual Basic
MS Excel form illustrated in Figure 13-1.

Page 89

Figure 13-2 MS Excel failure mode entry form in RCM Turbo

RCM Turbo then will perform a primary (i.e. a CBM) task
frequency calculation and display the results that 14 days (i.e. half
the warning interval) is the recommended task frequency.

Page 90

RCM Turbo has blended risk and cost (in much the same way as
described in Chapter 9. page 78) to estimate an optimal CBM
inspection frequency. It performs analogous calculations for time-
based and failure-finding tasks. The complete set of RCM Turbos
data fields is given in Appendix 12. on page 195.
Conclusions
1. Table 13-1 illustrates that streamlined RCM (as it is embodied in
RCM Turbo), is not streamlined (i.e. in the sense of being abridged or
reduced). Rather, it encompasses the principles of RCM, adding features
that address CMMS integration, quantitative reliability assessment and
task frequency calculations, spares, workload scheduling and balancing,
and other considerations.
2. RCM Turbo does address the 7 RCM questions, however, not in
the sequence stipulated by the RCM Standard.
3. The RCM Turbo software expands the 7 information elements of
RCM into multiple database fields. For example, MTBF, P-F Interval,
Repair time, etc are all explicit fields related to a Failure Mode.
4. Combining the RCM Worksheet with RCM Turbo can reduce the
workload in RCM Turbo. For example, if a failure mode has Safety
Consequences, there is no need to bother filling in the Turbo field
Consequence.
5. In this author's view (that some proponents of both camps may
challenge) a RCM Worksheet along the lines described in Figure 10-2 on
page 111, provides excellent team focus regardless of the methodology
adopted. If populated (perhaps adapted as in Figure 13-1 page 162) with
RCM Turbo's needs (see Turbo's data sheet page ) in mind, the worksheet
will benefit both streamlined and original RCM users.
6. Both RCM and RCM Turbo demand that the persons (primarily
maintainers and operators), directly impacted by the decisions flowing
from either process, participate fully in the process. Indeed they must
drive it. External consultants can only teach the principles and techniques
of RCM. The organization must select its analysts from among its most
experienced and competent operators and maintainers. It must chose a
facilitator who will learn the process fluently, elicit, and faithfully record
the technical knowledge of the analysts.

References:

1. RCM Turbo Maintenance Plan Development System Quick
Reference Guide
2. RCM Turbo V9.2 User Guide
3. RCM Turbo V9 desktop guide rev 2
4. RCMT92 Installation Instructions

Page 91

Page 92
Chapter 14. Appendices
Appendix 1.
The role of the RCM Facilitator
1. Administration 2. Animation 3. Clarity 4. Time Management
5. Focus

The quality and success of the RCM analysis will depend
on how well the facilitator has mastered and exercises his
skills outlined in Table 14-1: RCM facilitators checklist.
The facilitators skill and vigilance will prevent the analysis
from being dangerously superficial, or, conversely, from
becoming bogged down and stalled in unnecessary detail.
The novice facilitator should refer often to this scorecard
throughout the RCM project, and continually self-evaluate
h(is)(er) performance, (initially under the watchful eye of
an experienced RCM practitioner) with respect to each of
the items in Table 14-1.

Table 14-1: RCM facilitators checklist
1.0 Administration Score

Shortly after the RCM analysis has been completed, assemble the
worksheets and supporting documentation (drawings, photographs) into
a coherent, readable dossier for review and authorization by a
designated auditor.
1 2 3 4 5

In the planning phase, before an RCM analysis begins, ensure that
potentially useful documentation (drawings, schematics, etc) are readily
accessible to the team. Discuss the general RCM objectives, beforehand,
with resource people, outside the team, so they may respond quickly
when called upon to provide clarification or information when required
the course of the analysis.
1 2 3 4 5

Assist in the selection of the appropriately skilled RCM team members. 1 2 3 4 5

Assist in the initial decomposition of the asset/plant into manageable
significant items for individual RCM analyses. Position the items
boundaries so that it can be analyzed in 6 to 14 3 hour sessions. Ensure
that an item has not been defined at too low a level of indenture where
failure modes would be difficult to relate to the failure of the equipment
as a whole. During the analysis, decide how the failure modes of a
subsystem should be handled whether to 1) break out the subsystem
for more convenient, separate, analysis later, or 2) consider each of the
subsystem's failure modes as part of the main analysis, or 3) consider
the subsystem's failure modes as a single failure mode, or 4) consider
(as part of the main analysis) each of the subsystem's dominant failure
mode(s) singly and the other failure modes lumped under the title
others.
1 2 3 4 5

Page 93

Assist in the development of the items operating context 1 2 3 4 5

Assist in the scheduling of the RCM sessions 1 2 3 4 5

Report regularly on progress to the RCM sponsor. Call upon h(im)(er)
for help in resolving technical, organizational, or human issues as they
arise
1 2 3 4 5

Assist in the preparation of the presentation (by a team member) to
management at the end of the analysis
1 2 3 4 5

Provide team members access to the evolving RCM worksheet as the
analysis unfolds from session to session.
1 2 3 4 5
2.0
Animation
Score
Recognize and be sensitive to each personality type. Help each team
member contribute fully to the RCM process by using one or more of
these techniques: Gently discourage the extrovert from monopolizing
the floor by (following a tirade) asking a question to another team
member. ("George, what do you think about that") Encourage the
introvert by asking h(im)(er) questions and by assigning short research
tasks between sessions on unclear issues. (calling a vendor, checking a
log sheet, etc). Ask h(im)(er) to report on h(is)(er) findings at the
beginning of the next meeting. Be careful not to harass h(im)(er).
1 2 3 4 5
Recognize when true consensus is achieved. Never permit a vote. Keep
in mind that a lone dissenter may be right. Record h(is)(er) position and
ask h(im)(er) to agree to disagree until further elucidating information
comes along.
1 2 3 4 5
Sustain the morale of the group by summarizing progress at the
beginning of each session, and by always being positive about the
process. Express praise and gratitude when someone makes a
noteworthy contribution to the analysis.
1 2 3 4 5
At the beginning of the first session of the RCM analysis, help the team
set and agree upon the ground rules (smoking, punctuality, etc)
1 2 3 4 5
Recognize when the team simply does not know (about some aspect
of the asset) by being alert to statements beginning with "I think ..." or "I
believe ...". Assign short research tasks to team members to find out.
1 2 3 4 5
Remind participants of the objectives and importance of the analysis and
that they have been chosen to participate because of their knowledge
and experience.
1 2 3 4 5
With an inexperienced team be alert to misunderstandings of the process
and the meanings of questions. Use timeouts to clarify points of RCM
procedure when required. Common misunderstandings are a) mixing up
failed states and failure modes, b) mixing up average life (mtbf), useful
life, and B
n
life, etc., distinguishing potential failure from functional
1 2 3 4 5

Page 94
failure, d) recognizing the difference between a failure finding task and
an on-condition task
Be alert to answering the wrong question. This could occur at anytime
throughout the RCM process. An example is the raising of an
operational consequence when the process has moved onto the safety
and environmental branch of the decision diagram.
1 2 3 4 5
Safeguard the self-esteem of each team member. Loss of face may
occur by persons considered knowledgeable. Under all circumstances
emphasize (in timeouts and anecdotes) that RCM is, above all else, a
learning forum to bridge the discontinuities in the knowledge of
individuals by gaining advantage from the collective perspectives of the
team.
1 2 3 4 5
3.0
Clarity
Score

Input the answers to the RCM questions into the RCM worksheet. 1 2 3 4 5

While entering the answers, retain team members wording as much as
possible. Occasionally, when necessary suggest ways of expressing the
answers more succinctly in written form. Revise and correct the text
outside the meeting without altering what was said and meant during the
session. When in doubt obtain approval from the team for extensive
word-smithing. Avoid jargon. Ensure that the technical terms used on
the worksheet will be understood by everyone on the site.
1 2 3 4 5
4.0
Time Management
Score
Following an RCM decision to modify an asset or operating procedure,
resist the enticement to redesign the asset (or operating procedure)
during the RCM meeting. Allow the team to go only so far as to
elaborate the redesign requirement. Do NOT embark on a design
brainstorming process at this time.
1 2 3 4 5
Remind the team of the time allotted to the current analysis and the rate
of progress necessary to attain that goal.
1 2 3 4 5
Keep the pace of analysis (all 7 steps) at an average rate of 6 failure
modes per hour.
1 2 3 4 5
Indicate that about 1/3 of the time will be dedicated to defining the
functions, 1/3 on failures, modes, and effects (FMEA), and 1/3 on
consequences, decisions, and task definition and assignment.
1 2 3 4 5
5.0
Focus on the process
Score
Ask the RCM questions. Never answer them. (If the team may have
made a technical error or omission rephrase the questions to probe in a
particular direction or ask that a particular point be checked between
sessions.)
1 2 3 4 5
Call a timeout when necessary to explain pertinent the RCM process. 1 2 3 4 5

Page 95
Elaborate the asset's operating context at the beginning of the analysis.
Keep it in the teams mind throughout the analysis.
1 2 3 4 5
Ensure that the 7 RCM questions are asked completely, in the manner,
and the order prescribed by SAE JA1011
113
.
1 2 3 4 5
Resist the tendency to skip questions, or parts of questions by taking
their answers for granted. In particular ask, explicitly, each question
(page 143) along the appropriate logic branch of the decision diagram.
The RCM process must be performed rigorously. In spite of the
repetitious nature of the process do not abbreviate the questions so much
that their meaning is lost or distorted.
1 2 3 4 5
Pay strict attention to the following issues with respect to each of the
SAE JA1011 RCM questions (5.1 to 5.7)...
1 2 3 4 5
5.1 What are the functions and associated desired standards of
performance of the asset in its present operating context
(functions)?

Ask the team to uncover the primary functions, the secondary functions,
including all hidden functions. Afterwards invoke the PEACHES
mnemonic to double check that all functions have been listed.
1 2 3 4 5
Direct the team to include as many quantitative performance
requirements as practical in each function statement to fully describe the
users (owners, societal) objectives for the asset. The function statement
usually begins with To or Not to . Avoid the use of and
between two verbs.
1 2 3 4 5
Simplify the function list by deciding when certain functions may be
more conveniently included as a failure mode of another functional
failure. For example, the function "Not to trip when the liquid level is
below 100 hectoliters" preferably should be included as the failure mode
"pump trips due to grounded electrical contact" of the primary function
"To pump x liters ... "
1 2 3 4 5
Have the team use code phrases to imply a hidden function (e.g. to be
capable of, to be able to, to heat to 140C in the presence of a standby
heater.)
1 2 3 4 5
5.2 In what ways can it fail to fulfill its functions (functional failures)?
Ensure that each quantitative performance requirement within an
individual function statement is addressed. Separate partial and total loss
with respect to each requirement.
1 2 3 4 5
5.3 What causes each functional failure (failure modes)?

Page 96
Pay enormous attention to the number of failure modes to be included
and to their depth of causality. The list should be tempered by the
reasonable likelihood of occurrence and by the gravity of the
consequences (always keeping the operating context in mind.) More
serious consequences would tend to lengthen the list of failure modes to
be addressed. The depth (no of times to ask why) of causality at which
to specify a failure mode is likewise operating context sensitive. The
depth should be that at which the organization can do something about
the failure or its consequences.
1 2 3 4 5
5.4 What happens when each failure occurs (failure effects)?

Extract from the team the sequence of events (internally and
organization-wide) that could be touched off by the failure mode? Also
describe:
how does the failure make itself known?
how is safety or the environment impacted? (without mentioning the
words "safety" or "environment")
how is production impacted? (quality, cost, customer service)
is there any additional damage caused by the failure?
how long will it take and what actions must be accomplished to correct
the failure?
How does the likelihood of this failure depend on deeper causes? Has
it happened before? Under what circumstances?
1 2 3 4 5
5.5 In what way does each failure matter (failure consequences)?

Carefully examine the failure effects as elaborated in 5.4 above and
select one of the four possible consequences.
1 2 3 4 5
5.6 What should be done to predict or prevent each failure (proactive
tasks and task intervals)?

For CBM tasks, explore alternative technologies, and expose the true
costs of the proposed program. For all proactive tasks consider the long
run costs of the task and the those of the failure consequences it is
designed to reduce or prevent.
1 2 3 4 5
Set the proactive task intervals. For CBM estimate using consensus the
P-F interval, or apply a risk based non-deterministic approach such as
EXAKT when the failure mechanism is not clearly understood. For
TBM estimate the useful life regarding the failure mode in question.
1 2 3 4 5
5.7 What should be done if a suitable proactive task cannot be found
(default actions)?

The three possible default actions: run-to-failure, failure detection, and
redesign must be considered when so directed by the decision diagram.
For hidden failures, the detection interval must consider the acceptable
level of risk of a multiple failure.
1 2 3 4 5
Ensure that the group has considered all practical aspects of the task that
has been selected. The task descriptions must contain the necessary
detail to ensure that no misunderstanding is possible when it is
transcribed into the maintenance system.
1 2 3 4 5

Page 97

Appendix 2.
Sizing the analysis
The RCM facilitator, at the outset, makes a most important
decision to define the boundaries of the item being analyzed.
RCM can be applied at almost any level of the

Figure 14-1
asset hierarchy. However Figure 14-1 implies that there are
compromises that we must weigh when selecting level at which to
define our item. The advantage of a high level is that the items
functions and functional failures are more clearly related to the
performance requirements of the equipment as a whole a
desirable characteristic.

Time is the facilitators prime consideration. The more failure
modes that need to be considered, the longer the analysis will take.
Experience tells us that we should size the item so that it may be
analyzed in from 5 and 15 three-hour sessions. A well run analysis
averages 6 failure modes per hour. Hence a small analysis would
contain about 90 failure modes while a large one would analyze
about 270. These figures make it apparent that the facilitator must
carefully control the process, lest it flounder by not achieving the
target item analysis in the allotted amount of time. This could
jeopardize the entire RCM initiative.

Page 98

Selecting the significant items

Figure 14-2: Selecting the significant items for analysis
Figure 14-2 depicts the initial significant item selection process.
The criteria of significance and hiddenness dictate which items
need to be analyzed within the RCM project. Priorization of the
analyses lies outside the scope of RCM because it varies according
industry. Variants of RCM (such as Turbo RCM, see Chapter 13.
page 156) provide structured priority systems. Whatever priority
sequence has been chosen, the analysis are scheduled and team
members assigned, taking into account operational and personnel
constraints. The schedule provides a concrete set of objectives and
milestones for the RCM project.
Appendix 3.
Failure finding intervals for complex items (multiple failure
modes and devices)
Failure finding interval for devices with more than one failure
mode.
( )
3 2 1
1 1 1
2
sd sd sd mf
pf
ff
M M M M
M
I
+ +
=

where:
I
ff
= failure finding interval

Page 99
M
pf
= reliability (mean time between failure) of the
protected function
M
mf
= tolerable mean time between multiple failure
M
sd1
= mean time between failure due to failure mode 1 of
the safety device
M
sd2
the safety device
M
sd3
the safety device

Failure finding interval for redundant devices (based on the
linear approximation).

( )
n
mf
pf
sd ff
M
M n
M I
1
1
+
=

where:
n = number of redundant devices of the same kind.

Failure finding interval for voting systems.
( ) ( )
+
=
mf
pf
sd ff
M n
M r r n
M I
!
1 !

Voting systems are usually called k out of n systems,
where:
n = number of sensors in parallel
k = number of sensors needed to activate the safety action
r = number of sensors which must be failed for the safety
system to fail
so: r = n - k + 1

Optimal failure finding interval for parallel redundant devices
where only cost is a factor
( )
n
mf
ff pd
n
sd
off
C n
C M n M
I
1
) 1 (
+
=
where:
C
mf
= average cost of a multiple failure
n = number of redundant safety devices of the same kind.

Appendix 4.
Truck description
Appendix 5.

Page 100
Terminology used:
Appendix 6.
Relationship between hazard, reliability, and density
functions
Appendix 8.
Inherent reliability characteristics
120

Inherent reliability characteristic
Impact on PM applicability and effectiveness
Failure consequences Determine the significance of items for scheduled
maintenance; establish the definition of task
effectiveness; determine default strategy when no
applicable and effective PM task can be found
Visibility of functional failure to
operating crew under normal
circumstances
Determines the need for a failure-finding task to
ensure that failure is detected
Ability to measure/detect
reduced resistance to failure
Determines applicability of on-condition tasks
Rate at which failure resistance
decreases with operating age
once a potential failure
121
occurs
Determines interval for on-condition tasks
Age-reliability relationship Determines applicability of rework and discard tasks
Age-reliability-covariate
relationship
Determines the key risk factors for interpreting on-
condition data.
Cost of corrective maintenance

Helps establish PM task effectiveness, except for
safety and environment impacting failures
Cost of preventive maintenance Helps establish PM task effectiveness (except for
safety and environment impacting failures).
Need for safe-life limits to
prevent safety or environment
failures
Determines applicability and interval of safe-life
discard tasks
Need for servicing and
lubrication
Determines applicability and interval of servicing and
lubrication tasks

Appendix 9.
Failure mode depth of causality
Why
?
Why
?
Why? Why? Why? Why? Why? Why?
Ventil
ation
syste
m
fails
Fan
fails
Motor
fails
Motor trips Airways
clogged with
dirt.
Inadequate
design

Defective
sensor

Bearing
seized
Lubricant
allowed to run
dry

Page 101
Wrong
lubricant
Improperly
labeled
Stores error
Label
misread
Inattention
Insufficient
training
Power
drive
fails
Belts failed Incorrectly
installed

Incorrectly
specified

Distri
butio
n
syste
m
fails
Duct
fails
Duct
clogged

Duct
pierced

Damper
failed

Appendix 10.
Expected failure time
Appendix 11.
Exercise(Example 2 Data validation)
1
In this exercise we will examine some of the
data validation tools in EXAKT.
Download the wheelmotor oil analysis data
from
www.omdec.com/reliability/wheelmotor.zip.
2
Check for logical (chronolgogical sequencing)
errors. Examine the Data Check report. It will
give you an overall picture of the sample, and
indicate errors such as missing beginning or
ending events.
Start EXAKT for Modeling, Maximize
EXAKT Modeling window, File, Open,
Navigate to locate the file
Mar2004CRC_WMOD, Modeling, Select
Current Model, CBM Model: PHM(no OC),
OK, Activate Left pane (Database explorer
pane), Edit, Check Database, Data, Look at
the report, Reduce and Close the Report
3
Executing the instructions on the right should
give you a screen that looks like Figure 9-16
on page 91.

A) Left pane (Database explorer pane),
Open DataCheck table, View, Inspections,
Include Events View, OK

B) Arrange windows and panes so that the
Inspections and Events window covers the
top two-thirds of the screen and the
DataCheck window the bottom third.
Spread the windows so that they span the
entire width of the screen. Spread the
Description column of the DataCheck
window so that you can see as much of it
as possible. (It could take a few tries as the
edge of the column seems to stick and
spring back, so do it slowly.) The top
window should have four panes.

Page 102
4
The tables and views are all in automatic
synchronization. This makes it easy to find and
correct errors, as we shall see in subsequent
steps.

EXAKT has no way of distinguishing between
missing ending events and temporary
inspections. Therefore you will see many
requests to Check whether this history is
temporary suspended or "EF/ES" is missing.
The user makes sure that all such indicated
records correspond to units that are operating
currently. EXAKT will then assume that they are
indeed temporary suspensions. Otherwise the
message means that you are missing an ending
event, either an EF or an ES. You must
manually add the missing record. If the lifetime
corresponding to the message is in fact on
going at the moment, then you must ignore this
message.
DataCheck Window
5
The 5
th
record of the DataCheck table has the
description This record can't be properly
identified. It has the same Ident, Date, WAge,
and Event as the previous record:Id=5503R 2,
Date=...
DataCheck window, Record 5
6
Note that the synchronized corresponding
record (819) is flagged in the Inspections
window and the Events window likewise has its
pointer positioned at record 404.
Inspections window, widen the Date
column so the full date is visible, scroll up 1
row on the scroll bar so that record 818 is
visible
7
Note that record 818 corresponds to an oil
sample taken on the same equipment on the
same day. EXAKT is suspicious about this and is
asking you to verify the dates and working ages
for these two. Maintenance planning personnel
tell us that record 819 must be an error.
Therefore we may delete it.
122

Delete record 819 (by selecting it and
hitting del).
8
Here is a similar type of problem. But in this
case two samples have the same working age
but different calendar dates. EXAKT is not
pleased with this situation and is asking you to
do something about it.
DataCheck window, record 6, Inspections
window, scroll up one row so that records
822 and 823 are visible.
9
Thus does one go systematically through the
database records, as indicated by the
DataCheck table, correcting the anomolies that
are pointed out by EXAKT.
Do not bother making any more corrections
for purposes of this exercise. Close the
Inspections, and DataCheck windows.
10
After following the instructions on the right you
will have reproduced Figure 9-17 on page 92.
View, Cross Graph, maximize window,
Table: Inspections, Horizontal:
WorkingAge, Vertical: SI, Condition:
Si<1000, Show
11
Horizontal: Fe, Vertical Si, delete
Si<1000, Show, reduce, X
12
Examine the OutputVarScript. It uses a succinct
data query language to conveniently transform
combinations of existing covariates into new
covariates for building and testing risk models.
The *(condition), shown on several lines of
this program, is read where condition true.
The statement of interest is the next to last:
Database explorer pane (left pane),
OutputVarScript, X

Page 103
CorrSi=Si*(Si<>900)+1.2*Fe*(Si=900);
It is telling the program to return the actual
value of Si where Si <>900 and to use 1.2*Fe
where Si=900.
13
Modelling (on menu bar), Create Model
Input tables, Complete data, View, Cross
Graph, Table: C_Inspections, Horizontal:
Fe, Vertical CorrSI, Show, reduce, X
14
EXAKT handles events (such as oil changes,
adjustments, alignments, calibrations and other
minor maintenance) that impact condition data
in a correct manner. The instructions on the
right will display Figure 9-21 on page 95. It is
often useful to display the events and
inspections in a single table. Not the regularity
of the oil change events.
Modeling (on menu bar), Select Current
Model, CBM Model: PHM(with OC), OK,
Activate Left pane (Database explorer
pane), Modelling (on menu bar), Create
Model Input tables, Complete data,
Database pane, C_Inspections, Scroll to
record 345, reduce and close the
C_Inspections table
15
Executing the instructions on the right will
display a graph similar to that of Figure 9-22
on page 96
Modeling (on menu bar), Select Current
Model, CBM Model: PHM(noHistExcl),
Submodel: FeCorrSed, OK, Procedures
panel, Modeling, Weibull PHM, In Order of
Appearance, close the graph
16
Follow the instructions on the right and when
we scroll down to the last row, we see the
history number of the offending history in
Figure 9-22. The number is found to be 64.
Database pane, Residuals:
PHM(noHistExcl)(FeCorrSed) #1, click on
the Residual column header to order the
records by Residual, scroll down to last
row, note the History Number of 64, close
the table
17
We must identify which history of which unit is
the offending one. Following the instructions on
the right, we can find the history is the 2
nd

history of unit 5509R.
Procedures panel, Decisions, All Histories,
Select History 5501L[1] (That is the first
lifetime of the left wheelmotor of haul truck
5501), hit the DnArrow key 63 times, Close

We need to examine the cause of the offending
history. The instructions on the right reproduce
Figure 9-23 on page 97. From this Figure, we
observe that the cause of offending history is
the unusually high values of Fe and Si not
explained by a failure event. A reasonable
solution to obtain a better fit model is to
assume that a maintenance event was not
properly recorded and to exclude this history
from the model.
Database pane, Inspections, scroll down to
row 2768, X

Page 104

Exercise 4 data smoothing and fixing shape factor to 1
Random fluctuation of monitored condition data characterizes
many otherwise straight-forward CBM applications. In this
exercise we use the monitored pressure test data, which reflects the
deterioration of a sealing system in a nuclear fuel rod manipulating
mechanism. For additional background and details on this
application, you may refer to the document
www.omdec.com/articles/p_paperCandu.html.

1
Download the database files from
www.omdec.com/publications/reliabilty/candu.zip
or from the OMDEC CD.
2
Start EXAKT for Modeling, Maximize EXAKT
Modeling window, File, Open, Navigate to locate
the file candu_WMOD, Data
3
Note the randomness yet increasing
nature (generally rising slope) of the data.
Although it is obvious that the item ages
in a fairly linear fashion, how does one
make a decision at any given inspection if
the data is so erratic? How do we know if
a high reading is due to noise or to a
deteriorating failure mode? The following
steps in EXAKT provide a solution to this
problem.
Activate left (database explorer view) pane,
View, Inspections, OK, Ident drop down list, hit
various idents and observe their corresponding
sets of inspection data, reduce the inspections
window, close (X) the inspections window.
4
We wish to get rid of any randomness
that is irrelevant to risk of failure. EXAKT
provides a way to perform smoothing
transformations of the data. In the
OutputVarScript window you will see a
small program that transforms the
original variable LeakRate into the
transformed variables leakSmooth and
leakSmoothAve. EXAKTs programming
language provides several smoothing
functions. Smooth() and SmoothAve() are
smoothing functions that take parameters
to adjust the way in which they transform
the variables.
Database pane, OutputVarScript, X

(Note that we have defined 4 new variables from
the original LeakRate and WorkingAge variables:
leakSmooth0, leakSmooth, leakSmoothAve0, and
leakSmoothAve

By studying (in the Guide and Manual) the
definitions of the various EXAKT transformation
functions such as Smooth(), SmoothAve(), Last()
and NonDecr() in the manual, you will soon get
to understand how this transformation works.)
5
The instruction on the left generates the
decision graphs of the model built directly
on the original (untransformed) data.
Observe how much randomness there is
in the inspection data. Such randomness
may bias the model and may make it
difficult to clearly apply an optimal
decision.
A) Modeling (on menu bar), Select Current
Model, CBM Model: Seals, Submodel: LR_b1, OK,
Procedures panel, Decisions, Select Ident: 5EH1,
scroll down to last row, shift+8WH4, Report,
Close, maximize the decision graph window, click
full report icon, PageDown or PageUp, reduce
the decision graph window, X

B) Modeling (on Procedures panel), Weibull PHM,
Select Covariates, (note the variable used for
this model LR_b1 is LeakRate), Cancel, Seals
(LR_b1):3, X, Seals (LR_b1):2 X
6
The model LR_Smooth0 uses a variable
that has been smoothed by the Smooth()
function in EXAKT. On the decision
graphs, we observe that we have
Repeat Step 5A but select the submodel
LR_Smooth0 instead of LR_b1

Repeat Step 5B but note the variable used for

Page 105
eliminated the randomness of the
previous submodel. But we have another
problem. We observe a drooping
artifact
123
at the end of every history. This
causes a poor model and a poor decision
recommendation because the current
value of the condition indicator
leakSmooth0 is erroneously low! In step 7
we will correct this problem with a further
transformation.
this model LR_Smooth0 is leakSmooth0, Cancel,
Seals (LR_b1):3, X, Seals (LR_b1):2 X
7
The adjusted smoothed variable produces
a better model and a better decision
recommendation. Note that the
randomness of the data is further reduced
and the drooping artifact has been
corrected.
Repeat Step 5A but this time use the submodel
LR_Smooth

Repeat Step 5B but this time note that the
variable used in the submodel LR_Smooth is
leakSmooth
9
Now that we have seen some techniqes
for pre-processing data to eliminate
confusing noise, we may look more
closely at the model itself. You may be
wondering about the naming convention
we for the model LR_Smooth_b1. The
b1 part of the name indicates that we
have fixed Beta, the shape factor, to 1.
We will proceed to learn why we did this.
Activate left (explorer) pane, Modeling (on menu
bar), Select Current Model, LR_Smooth, OK
10
We note, in carrying out the steps on the
right, that this Submodel LR_Smooth
uses the transformed variable leakSmooth
and that the Fix shape factor to 1
checkbox is unchecked.
Modeling (on Procedures panel), Weibull PHM,
Select Covariates, Cancel
11
Upon executing the steps at the right, we
note that the model is rejected by the
Kolmogorov-Smirnov test. The test is
telling us that the hypothesis that the
model is good (fits the data) must be
rejected.
Residual Analysis, Summary Report, expand and
scroll down. (note that the goodness of fit
hypothesis is rejected), reduce window, X
13
EXAKT has told us in step 8 that working
age is not significant. In fact it is highly
significant, so much so that it correlates
closely with the LeakRate. Thus EXAKT is
really telling us that the LeakRate itself
contains all the information we need, to
establish a good predictive model, and it
is telling us that we should remove the
WorkingAge factor from the model by
setting Shape to 1.
Modeling (on menu bar), Select Current Model,
LR_Smooth_b1, Modeling (on Procedures panel),
Weibull PHM, (note that the shape parameter has
been fixed to 1 for this submodel), Cancel

Residual Analysis, Summary Report, expand and
scroll down. (note that the goodness of fit
hypothesis is not rejected), reduce window, X
14
Similar results can be found for models:
LR_SmoothAve0_b1, and
LR_SmoothAve_b1. You may go ahead
examine these models using the tecniques
you have learned in this exercise

Appendix 12.
Data for RCM Turbo
Table 14-4 RCM Turbo FAILURE MODE ANALYSIS DATA SHEET

Page 106
Productive Unit Code:
Name:

Maintainable Item Code:
Name:
Unit Output Reduction: Total Stoppage Partial Stoppage/Quality No Immediate Effect
No Effect

FM #
Component/Part:
Failure Mode & Effect is:

Root Cause is:
MTBF = Confident Warning Time (98%) = Early Warning
Time (70%)=
Life/Wear: Early Life Mid Life End Life PA Effectiveness (%) =
Category: Design Lubrication Normal Operation Overload
Condition
Review Maintenance Practice Safety/Environmental
Consequence: Total Stoppage Partial Stoppage/Quality No Immediate Effect
No Effect
Characteristic: Definitive Life/Wear Out General Degradation Random-
Constant Probability
Measurability: Easy-Can monitor in fail degrade time Moderately Easy
Impossible
Strategy: CBM FTM OTF

Primary Action Description:

Job Duration: Downtime: Secondary Action Initiator:
Maintainable Item Status: Running Stop Downday
Res: Hrs: Crew size: Res: Hrs: Crew size: Res: Hrs: Crew size:
Material Cost: Consequential Damage Cost:
Estimated Cost of Downtime (if any):

Secondary Action Description:

Job Duration: Downtime:

Breakdown Action Description:

Page 107
Job Duration: Downtime:

Spares (PA, SA, O/H & BD):

Strategy Notes:

Design Notes:

Materials Notes (PA, SA, O/H & BD):

Maintenance Actions/Assumptions (PA, SA, O/H & BD):

Appendix 13.
Default decision diagram answers in the absence of
operating experience
Table 14-5 The default answer to be used in developing an initial
scheduled-maintenance program in the absence of data from actual
operating experience.
Stage at which
question can be
answered
Decision
question
Default
answer to be
used in case
of
uncertainty
Initial
program
(with
default)
Ongoing
program
(operating
data)
Possible
adverse
consequences
of default
condition
Default
consequences
eliminated
with
subsequent
operating
information
IDENTIFICATION OF SIGNIFICANT ITEMS
Is the item
clearly
nonsignificant
No: classify item
as significant
X. X. Unnecessary
analysis
no
EVALUATION OF FAILURE CONSEQUENCES
Is the occurrence
of a failure
evident to the
operating crew
during
performance of
normal duties?
No (except for
critical
secondary
damage):
classify function
as hidden.
X. X. Unnecessary
inspections that
are not cost-
effective
yes
Does the failure
cause a loss of
function or
secondary
Yes: classify
consequences as
critical
X. X. Unnecessary
redesign or
scheduled
maintenance that
No for the
redesign; yes for
scheduled
maintenance

Page 108
damage that
could have a
direct adverse
effect on
operating safety
and the
environment?
is not cost-
effective
Does the failure
have a direct
adverse effect on
operational
capability?
Yes: classify
consequences as
operational
(production )
X. X. Scheduled
maintenance that
is not cost-
effective
yes
EVALUATION OF PROPOSED TASKS
Is an on-
condition task to
detect potential
failures
technically
feasible?
Yes: include on-
condition task in
the program.
X. X. Scheduled
maintenance that
is not cost-
effective
yes
If an on-
condition task is
technically
feasible
(effective), is it
worthwhile?
Yes: assigned
inspection
intervals short
enough to make
the task
effective.
X. X. Scheduled
maintenance that
is not cost-
effective
yes
Is a rework task
to reduce the
failure rate
applicable?
No (unless there
are real and
applicable data):
assign item to no
scheduled
maintenance.
-- X. Delay in
exploiting
opportunity to
reduce costs
yes
If a reworked
task is
applicable, is it
effective?
No (unless there
are real and
applicable data):
assign item
scheduled
maintenance
-- X. Unnecessary
redesign (safety)
or delay in
exploiting
opportunity
No for redesign;
yes for
scheduled
maintenance
Is a discard task
to avoid failures
or reduce the
failure rate
applicable?
No (except for
safe-life items):
assign item to
know scheduled
maintenance
X.
(safe life
only)
X.
(economic
life)
Delay in
exploiting
opportunity to
reduce costs
Yes
If a discarded
task is
applicable, is it
effective?
No (except for
safe-life items):
assign item to
know scheduled
maintenance
X.
(safe life
only)
X.
(economic
life)
Delay in
exploiting
opportunity to
reduce costs
yes
Appendix 14.
Additional Relcode examples
Exercise 3
The cloth filter on a sugar centrifuge is currently replaced
on a preventive basis if a suitable opportunity occurs and
the cloth has been in use for at least 20 hours. The cloth is

Page 109
also replaced on failure. The following data are available
for 10 hour time intervals of cloth life.

Age in
Hours
Failure
Replacements
Preventive
Replacements
0-9.99 14 0
10-19.99 5 0
20-29.99 2 4
30-39.99 1 8

Figure 14-12: Relcode data entry for cloth filters

Figure 14-13

Page 110
Exercise 4
A metropolitan transport company operates a fleet of
similar buses. Engine failures necessitating replacement
have occurred in the kilometer ranges shown in the
following table which also shows the number of engines
currently running in each age range.

Age Range
(Kilometers)
Failure
Replacements

Survivors
0-49,999 2 35
50,000-99,999 8 27
100,000-149,999 33 12
150,000-199,999 44 62

Figure 14-14: Relcode data entry for engines

Page 111
Figure 14-15
Exercise 5
A new type of car has recently been released and is subject
to warranty. An analysis of warranty claims shows several
alternator failures, although, as a proportion of the whole
population the numbers are quite small.

The available data are as follows:
Age Range
(Kilometers)
Failure
Replacements

Survivors
0-49,999 1 48
50,000-99,999 2 123
100,000-149,999 3 56
150,000-199,999 4 44

Figure 14-16: Relcode data entry for alternator failure warranties

Page 112

Figure 14-17

5.

RCM Workpaper PDF

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

RCM Workpaper PDF

Загружено:

Авторское право:

Доступные форматы

Reliability-centered Knowledge

Using Maintenance Databases for Reliability Analysis and

Вам также может понравиться