Introduction To RCM

Reliability Centered Maintenance
6800$5<
This report summarizes the main elements of Reliability centered maintenance (RCM). The presentation is to a great extent based on the outline of the RCM methodology by Rausand & Vatn (1998). In this presentation we have made effort to include ideas and examples from railway applications. RCM is a method for maintenance planning developed within the aircraft industry and later adapted to several other industries and military branches. This report presents a structured approach to RCM, and discusses the various steps in the approach. The availability of reliability data and operating experience is of vital importance for the RCM method. The RCM method provides a means to utilize operating experience in a more systematic way. Aspects related to utilization of operating experience are therefore addressed specifically. In this paper, RCM is put into a risk analysis framework, taking advantages of reliability modelling in a more structured way than in more traditional RCM approaches.
7$%/( 2) &217(17
SUMMARY.............................................................................................................................................. 1 TABLE OF CONTENT............................................................................................................................ 1 1 INTRODUCTION ............................................................................................................................. 2 2 A CONCEPTUAL MODEL FOR RCM ........................................................................................... 2 3 MAIN STEPS OF AN RCM ANALYSIS......................................................................................... 3 Step 1: Study preparation................................................................................................................. 4 Step 2: System selection and definition........................................................................................... 5 Step 3: Functional failure analysis (FFA)........................................................................................ 6 Step 4: Critical item selection........................................................................................................ 10 Step 5: Data collection and analysis .............................................................................................. 12 Step 6: Failure modes, effects and criticality analysis................................................................... 14 Step 7: Selection of Maintenance Actions..................................................................................... 16 Step 8: Determination of Maintenance Intervals ........................................................................... 18 Step 9: Preventive maintenance comparison analysis ................................................................... 21 Step 10: Treatment of non-MSIs ..................................................................................................... 22 Step 11: Implementation.................................................................................................................. 22 Step 12: In-service data collection and updating ............................................................................. 22 4 DISCUSSIONS AND CONCLUSIONS ......................................................................................... 23 General benefits:.............................................................................................................................. 23 Problem areas in the analysis: ......................................................................................................... 24 Conclusions: .................................................................................................................................... 25 REFERENCES ....................................................................................................................................... 26
M. Rausand and J. Vatn. Reliability Centered Maintenance. In C. G. Soares, editor, Risk and Reliability in Marine Technology. Balkema, Holland, 1998
,1752'8&7,21
The reliability centered maintenance (RCM) concept has been on the scene for more than 20 years, and has been applied with considerable success within the aircraft industry, the military forces, the nuclear power industry, and more recently within the offshore oil and gas industry. Experiences from the use of RCM within these industries (see e.g. Sandtorv & Rausand 1991) show significant reductions in preventive maintenance (PM) costs while maintaining, or even improving, the availability of the systems. According to the Electric Power Research Institute (EPRI) RCM is: a systematic consideration of system functions, the way functions can fail, and a prioritybased consideration of safety and economics that identifies applicable and effective PM tasks. The main focus of RCM is hence on the system functions, and not on the system hardware. Several textbooks and reports presenting the RCM concept have been published. The most important books are Nowlan and Heap (1978), Moubray (1991), Smith (1993), Anderson & Neri (1990), and Moss (1985). These textbooks provide a good introduction to RCM, but most of them are a bit inaccurate regarding stringency of definitions of the basic concepts. The main ideas presented in these textbooks are more or less the same, but the detailed procedures are rather different.
M k
..
In all these books there are generally more focus on maintenance than on reliability. The use of reliability data sources like OREDA (1997) is not at all emphasized. Smith (1993) states for example on page 103 that: . . . , it is the authors experience that any introduction of quantitative reliability data or models into the RCM process only clouds the PM issue and raises credibility questions that are of no constructive value. The main objective of this paper is to present a structured approach to RCM, and to put more focus on reliability models and methods in the RCM process.
$ &21&(378$/ 02'(/ )25 5&0

Most of the available PM models (Valdez-Flores and Feldman 1989) are based on the assumption that; (1) only single units are considered, and that (2) the cost of a single unit failure can easily be quantified in (discounted) monetary units. In RCM we have to consider the entire PM program, i.e. several units simultaneously. It is further required to consider failure consequences which cannot be measured directly in monetary units. In the present paper, we will split the possible failure consequences into the following four consequence classes: S: Safety of personnel E: Environmental impact A: Production availability C: Material losses (costs)
..
B1
M 1 M2
M3
: :
C1 C2 C3
: Fault tree analysis Event tree analysis : Total loss
B2 B3
: : Undesired event
Figure 1 A conceptual model for RCM based on risk analysis 2
Barrieres
Risk analysis
To measure all the consequences in monetary units, we have to define economic values of the life and health of persons, and of different environmental aspects. This is at best a difficult and controversial task. A new conceptual model for the RCM approach is illustrated in Figure 1. This model is based on the ideas presented by Vatn et. al (1996). The basis for the new conceptual model is a traditional risk analysis. The risk analysis approach is based on a number of socalled undesired events in the system. An undesired event is typically an outset of a possible accident, for example a gas leakage or an unintended stop of a compressor. In this context the term accident is defined in a very broad sense, including all events causing a loss related to one of the consequence classes defined above. An undesired event may be caused by a number of basic events B1, B2, . . . . The basic events may comprise failures of technical items, human errors, and environmental impacts. The basic events are often identified and modeled by fault tree analysis.If failure rates and other necessary data are available for the basic events, the fault tree analysis will provide estimates of the frequency of occurrence of the various undesired events. The consequences of an undesired event will normally depend on the barriers that are established to prevent escalation of the undesired event. On an oil production plattform the barriers may comprise emergency shutdown (ESD) systems, pressure relief systems, fire walls, fire fighting systems, etc. The use of two or more parallell tracks in the railway infrastructure can be considered as a barrier to prevent consequence of a turnout failure.The possible consequence chains starting from an undesired event are often identified and modeled by event tree analysis, supplemented by various physical models, like fire and explosion models. The output of the event tree analysis will be a set of possible consequences C1, C2, . . . . If necessary input data is available for the barriers and physical models, the event tree analysis will provide frequencies or probabilities of the various consequences. In order to build models for railway applications a key element will be to
know the time table, the configuration of the line etc. By analyzing all the undesired events in this way we will in principle end up with a complete consequence spectrum for the system, i.e. a listing of all possible consequences together with an estimated frequency of each consequence. A traditional risk analysis stops with the consequence spectrum. If possible we should, however, combine the effects of the consequences into a total loss measure (loss function). This can in some cases be done without too strong controversies. The system maintenance activities M1, M2, . . . will affect the frequencies of both the basic events and the barrier failures. An effective PM task may prevent a failure of a process unit or a barrier (e.g. a safety valve). On the other hand, failures may also be caused by human or procedural errors during maintenance. Experience has shown that many major accidents have occurred either during maintenance or because of wrongly executed maintenance. The overall objective of RCM is to establish PM tasks that are applicable and effective with respect to the consequence classes defined above. To be effective, a PM task must therefore provide a reduced expected loss related to one or more of the four consequence classes.
0$,1 67(36 $1$/<6,6
2)
$1
5&0
The RCM analysis may be carried out as a sequence of activities. Some of these activities, or steps, are overlapping in time, as illustrated in Figure 2. The RCM process comprises the following steps: 1. Study preparation 2. System selection and definition 3. Functional failure analysis (FFA) 4. Critical item selection 5. Data collection and analysis 3
Rausand&Vatn 6. Failure modes, effects and criticality analysis (FMECA) 7. Selection of maintenance actions 8. Determination of maintenance intervals 9. Preventive maintenance comparison analysis 10.Treatment of noncritical items
Reliability Centered Maintenance 11.Implementation 12.Inservice data collection and updating The various steps are discussed in the following sections with a focus on Steps 18. The time sequence of Steps 18 is illustrated in Figure 2. The sizes of the boxes do not reflect the required workload in the steps.
8 Maintenance intervals 7 Maintenance tasks 6 FMECA 5 Data collection and analysis 4 Critical item selection 3 FFA 2 System selection 1 Study prep. Time
Figure 2 The RCM process policies, and acceptance criteria with respect to safety and environmental protection should be made visible as boundary conditions for the RCM analysis. The part of the plant to be analyzed is selected in Step 2. The type of consequences to be considered should, however, be discussed and settled on a general basis in Step 1. Possible consequences to be evaluated may comprise: (i) risk to humans, (ii) environmental damages, If a maintenance program already exists, the result of an RCM analysis will often be to eliminate inefficient maintenance tasks. Before an actual RCM analysis is initiated, an RCM project group should be established, see e.g. Moubray (1991) pp. 1617. The RCM project group should include at least one person from the maintenance function and one from the operations function, in addition to an RCM specialist. In Step 1 Study preparation the RCM project group should define and clarify the objectives and the scope of the analysis. Requirements, (iii) delays and cancellation of travels, (iv) material losses or equipment damage, (v) loss of marked shares, etc. The possible consequence classes can not be measured in one common unit. It is therefore necessary to prioritize between means affecting the various consequence classes. Such a prioritization is not an easy task and will not be discussed in this presentation. The tradeoff problems can to some extent be solved within a
4
6WHS 6WXG\ SUHSDUDWLRQ The main objectives of an RCM analysis are: 1. to identify effective maintenance tasks, 2. to evaluate these tasks by some costbenefit analysis, and 3. to prepare a plan for carrying out the identified maintenance tasks at optimal intervals.
Rausand&Vatn decision theoretical framework (Vatn 95 and Vatn et al. 1996). RCM analyses have traditionally concentrated on PM strategies. It is, however, possible to extend the scope of the analysis to cover topics like corrective maintenance strategies, spare part inventories, logistic support problems, etc. The RCM project group must decide what should be part of the scope and what should be outside. The resources that are available for the analysis are usually limited. The RCM group should therefore be sober with respect to what to look into, realizing that analysis cost should not dominate potential benefits. In many RCM applications the plant already has effective maintenance programs. The RCM project will therefore be an upgrade project to identify and select the most effective PM tasks, to recommend new tasks or revisions, and to eliminate ineffective tasks. Then apply those changes within the existing programs in a way that will allow the most efficient allocation of resources. When applying RCM to an existing PM program, it is best to utilize, to the greatest extent possible, established plant administrative and control procedures in order to maintain the structure and format of the current program. This approach provides at least three additional benefits: (i) It preserves the effectiveness and successfulness of the current program. (ii) It facilitates acceptance and implementation of the projects recommendations when they are processed. (iii) It allows incorporation of improvements as soon as they are discovered, without the necessity of waiting for major changes to the PM program or analysis of every system.
Reliability Centered Maintenance To which systems are an RCM analysis beneficial compared with more traditional maintenance planning? At what level of assembly (plant, system, subsystem . . . ) should the analysis be conducted? Regarding the first question, all systems may in principle benefit from an RCM analysis. With limited resources, we must, however, usually make priorities, at least when introducing the RCM approach in a new plant. We should start with the systems that we assume will benefit most from the analysis. The following criteria may be used to prioritize systems for an RCM analysis: (i) The failure effects of potential system failures must be significant in terms of safety, environmental consequences, production loss, or maintenance costs. (ii) The system complexity must be above average. (iii) Reliability data or operating experience from the actual system, or similar systems, should be available. Most operating plants have developed an assembly hierarchy, i.e. an organization of the system hardware elements into a structure that looks like the root system of a tree. In the offshore oil and gas industry this hierarchy is usually referred to as the tag number system. Several other names are also used. Moubray (1991) for example refers to the assembly hierarchy as the plant register. The following terms will be used in this paper for the levels of the assembly hierarchy: Plant: A logical grouping of systems that function together to provide an output or product by processing and manipulating various input raw materials and feed stock. An offshore gas production platform may e.g. be considered as a plant. For railway application a plant might be a maintenance area, where the main function of that plant is to ensure satisfactiory infrastructure functionality in that area. Moubray (1991) refers to the plant as a cost center.
5
6WHS 6\VWHP VHOHFWLRQ DQG GHI LQLWLRQ Before a decision to perform an RCM analysis at a plant is taken, two questions should be considered:
Rausand&Vatn System: A logical grouping of subsystems that will perform a series of key functions, which often can be summarized as one main function, that are required of a plant (e.g. feed water, steam supply, and water injection). The compression system on an offshore gas production platform may e.g. be considered as a system. Note that the compression system may consist of several compressors with a high degree of redundancy. Redundant units performing the same main function should be included in the same system. It is usually easy to identify the systems in a plant, since they are used as logical building blocks in the design process. The system level is usually recommended as the starting point for the RCM process. This is further discussed and justified for example by Smith (1993) and in MILSTD 2173. This means that on an offshore oil/gas platform the starting point of the analysis should be for example the compression system, the water injection system or the fire water system, and not the whole platform. The systems may be further broken down in subsystems, and subsubsystems, etc. For the purpose of the RCMprocess the lowest level of the hierarchy should be what we will call an RCM analysis item: RCM analysis item: A grouping or collection of components which together form some identifiable package that will perform at least one significant function as a standalone item (e.g. pumps, valves, and electric motors). For brevity, an RCM analysis item will in the following be called an analysis item. By this definition a shutdown valve, for example, is classified as an analysis item, while the valve actuator is not. The actuator is a supporting equipment to the shutdown valve, and only has a function as a part of the valve. The importance of distinguishing the analysis items from their supporting equipment is clearly seen in the FMECA in Step 6. If an analysis item is found to have no significant failure modes, then none of the failure modes or causes of the supporting equipment are important, and therefore do not need to be addressed. Similarly if an analysis item has only one significant failure mode then the supporting equipment only needs to be
Reliability Centered Maintenance analyzed to determine if there are failure causes that can affect that particular failure mode (Paglia et al. 1991). Therefore only the failure modes and effects of the analysis items need to be analyzed in the FMECA in Step 6. An analysis item is usually repairable, meaning that it can be repaired without replacing the whole item. In the offshore reliability database OREDA (1992) the analysis item is called an equipment unit. The various analysis items of a system may be at different levels of assembly. On an offshore platform, for example, a huge pump may be defined as an analysis item in the same way as a small gas detector. If we have redundant items, e.g. two parallel pumps, each of them should be classified as analysis items. When we in Step 6 of the RCM process identify causes of analysis item failures, we will often find it suitable to attribute these failure causes to failures of items on an even lower level of indenture. The lowest level is normally referred to as components. Component: The lowest level at which equipment can be disassembled without damage or destruction to the items involved. Smith (1993) refers to this lowest level as Least Replaceable Assembly (LRA), while OREDA (1997) uses the term maintainable item. It is very important that the analysis items are selected and defined in a clear and unambiguous way in this initial phase of the RCMprocess, since the following analysis will be based on these analysis items. If the OREDA database is to be used in later phases of the RCM process, it is recommended as far as possible to define the analysis items in compliance with the equipment units in OREDA.
6WHS )XQFWLRQDO IDLOXUH DQDO\VLV ))$ The objectives of this step are: (i) to identify and describe the systemss required functions, (ii) to describe input interfaces required for the system to operate, and (iii) to identify the ways in which the system might fail to function.
6
Rausand&Vatn
Reliability Centered Maintenance 42. An example of a protective function is the protection provided by a rupture disk on a pressure vessel (e.g. a separator). 4. Information functions: These functions comprise condition monitoring, various gauges and alarms etc. 5. Interface functions: These functions apply to the interfaces between the item in question and other items. The interfaces may be active or passive. A passive interface is for example present when an item is a support or a base for another item. 6. Superfluous functions: According to Moubray (1991) Items or components are sometimes encountered which are completely superfluous. This usually happens when equipment has been modified frequently over a period of years, or when new equipment has been overspecified. Superfluous functions are sometimes present when the item has been designed for an operational context that is different from the actual operational context. In some cases failures of a superfluous function may cause failure of other functions. For analysis purposes the various functions of an item may also be classified as: (a) Online functions: These are functions operated either continuously or so often that the user has current knowledge about their state. The termination of an online function is called an evident failure. (b) Offline functions: These are functions that are used intermittently or so infrequently that their availability is not known by the user without some special check or test. The protective functions are very often offline functions. An example of an offline function is the essential function of an emergency shutdown (ESD) system on an oil plattform. Many of the protective functions are off-line functions. The termination of an offline function is called a hidden failure. Note that this classification of functions should only be used as a checklist to ensure that all relevant functions are revealed. Discussions about whether a function should be classified as
7
Step 3(i): Identification of system functions The objective of this step is to identify and describe all the required functions of the system. In many guidelines and textbooks (e.g. Cross 1994), it is recommended that the various functions are expressed in the same way, as a statement comprising a verb plus a noun for example, close flow, contain fluid, transmit signal. A complex system will usually have a high number of different functions. It is often difficult to identify all these functions without a checklist. The checklist or classification scheme of the various functions presented below may help the analyst in identifying the functions. The same scheme will be used in Step 6 to identify functions of analysis items. The term item is therefore used in the classification scheme to denote either a system or an analysis item. 1. Essential functions: These are the functions required to fulfill the intended purpose of the item. The essential functions are simply the reasons for installing the item. Often an essential function is reflected in the name of the item. An essential function of a pump is for example to pump a fluid. 2. Auxiliary functions: These are the functions that are required to support the essential functions. The auxiliary functions are usually less obvious than the essential functions, but may in many cases be as important as the essential functions. Failure of an auxiliary function may in many cases be more critical than a failure of an essential function. An auxiliary function of a pump is for example containment of the fluid. 3. Protective functions: The functions intended to protect people, equipment and the environment from damage and injury. The protective functions may be classified according to what they protect, as: safety functions environment functions hygiene functions Safety protective functions are further discussed e.g. by Moubray (1991) pp. 40
Rausand&Vatn essential or auxiliary etc. should be avoided. Also note that the classification of functions here is used at the system level. Later the same classification of functions is used in the failure modes, effects and criticality analysis (FMECA) in Step 6 at the analysis item level. The system may in general have several operational modes (e.g. running, and standby), and several functions for each operating state. The essential functions are often obvious and easy to establish, while the other functions may be rather difficult to reveal. Step 3(ii): Functional block diagrams The various system functions identified in Step 3(i) may be represented by functional diagrams of various types. The most common diagram is the socalled functional block diagram. A simple functional block diagram of a pump is shown in Figure 3.
Control system System boundary
Reliability Centered Maintenance of diagrams is given by e.g. Pahl and Beitz (1984). In some cases we may want to split system functions into subfunctions on an increasing level of detail, down to functions of analysis items. The functional block diagrams may be used to establish this functional hierarchy in a pictorial manner, illustrating seriesparallel relationships, possible feedbacks, and functional interfaces (Blanchard & Fabrycky 1981). Alternatives to the functional block diagram are reliability block diagrams and fault trees. Functional block diagrams are also recommended by IEC812 as a basis for failure modes, effects and criticality analysis (FMECA) and will therefore be a basis for Step 6 in the RCM procedure. Step 3(iii): System failure modes The next step of the FFA is to identify and describe how the various system functions may fail. Since we will need the following concepts also in the FMECA in Step 6, we will use the term item to denote both the system and the analysis items. According to accepted standards (IEC 50(191)) failure is defined as the termination of the ability of an item to perform a required function. British Standard BS 5760, Part 5 defines failure mode as the effect by which a failure is observed on a failed item. It is important to realize that a failure mode is a manifestation of the failure as seen from the outside, i.e. the termination of one or more functions. In most of the RCM references the system failure modes are denoted functional failures. Failure modes may be classified in three main groups related to the function of the item: (i) Total loss of function: In this case a function is not achieved at all, or the quality of the function is far beyond what is considered as acceptable. (ii) Partial loss of function: This group may be very wide, and may range from the nuisance category almost to the total loss of function.
8
Fluid in Pump fluid El. power
Fluid out
Environment
Figure 3 Functional block diagram for a pump The necessary inputs to a function are illustrated in the functional block diagram together with the necessary control signals and the various environmental stressors that may influence the function. It is generally not required to establish functional block diagrams for all the system functions. The diagrams are, however, often considered as efficient tools to illustrate the input interfaces to a function. The functional block diagram is recommended for RCM by Smith (1993). A detailed description of this type
Rausand&Vatn (iii) Erroneous function: This means that the item performs an action that was not intended, often the opposite of the intended function. A variety of classifications schemes for failure modes have been published. Some of these schemes, e.g. Blache & Shrivastava (1994), may be used in combination with the function classification scheme in Step 3(ii) to secure that all relevant system failure modes (functional failures) are identified. In the following we will need to classify failures as: Sudden failures: Failures that could not be forecast by prior testing or examination. Gradual failures: Failures that could be forecast by testing or examination. A gradual failure will represent a gradual drifting out of the specified range of performance values. The recognition of gradual failures requires comparison of actual device performance with a performance specification, and may in some cases be a difficult task. An example of a gradual failure situation is illustrated in Figure 4. The specified performance is illustrated by the target value, together with the acceptable deviation from this target value. As soon as the actual performance drifts outside the acceptable deviation, we have a failure. An important type of failures are the socalled ageing failures:
Performance
Target value Acceptable deviation
Failure
Time
Figure 4 Example of a gradual failure Ageing failures: Failures whose probability of occurrence increases with the passage of time, as a result of processes inherent in the item. Ageing failures are also sometimes called wearout failures. An ageing failure is normally caused by some physical, chemical or other processes that are deteriorating the item. These processes are usually referred to as failure mechanisms. The ageing failure is sometimes a gradual failure, meaning that the performance of the item is gradually drifting out of the specified range. In other cases the ageing failure will be sudden. The inherent resistance of the item may gradually be reduced until a failure occurs. The performance of the item may in such cases be perfect until the failure occurs. The system failure modes (functional failures) may be recorded on a specially designed FFAform, that is rather similar to a standard FMECA form. An example of an FFA-form is presented in Figure 5 .
System: Ref. drawing no.: Operational mode Function
Performed by: Date: Function requirements System failure mode
Page: of: Criticality S E
Figure 5 Example of an FFA-form
Rausand&Vatn In the first column of Figure 5 the various operational modes of the system are recorded. For each operational mode, all the relevant functions of the system are recorded in column 2. The performance requirements to the functions, like target values and acceptable deviations (ref. Figure 4) are listed in column 3. For each system function (in column 2) all the relevant system failure modes are listed in column 4. In column 5 a criticality ranking of each system failure mode (functional failure) in that particular operational mode is given. The reason for including the criticality ranking is to be able to limit the extent of the further analysis by disregarding insignificant system failure modes. For complex systems such a screening is often very important in order not to waste time and money. The criticality ranking depends on both the frequency/probability of the occurrence of the system failure mode, and the severity of the failure. The severity must be judged at the plant level. In the conceptual RCM model in Figure 1 the system failure modes will be undesired events. In addition the undesired events will also include accidental events (like external impacts) that are not normally identified as a loss of system function. Such events are usually identified by using various risk identification checklists. The severity ranking should be given in the four consequence classes; (S) safety of personnel, (E) environmental impact, (A) production availability, and (C) economic losses. For each of these consequence classes the severity should be ranked as for example (H) high, (M) medium, or (L) low. How we should define the borderlines between these classes, will depend on the specific application. If at least one of the four entries are (M) medium or (H) high, the severity of the system failure mode should be classified as significant, and the system failure mode should be subject to further analysis. The frequency of the system failure mode may also be classified in the same three classes. (H) high may for example be defined as more than once per 5 years, and (L) low less than once per
Reliability Centered Maintenance 50 years. As above the specific borderlines will depend on the application. The frequency classes may be used to prioritize between the significant system failure modes. If all the four severity entries of a system failure mode are (L) low, and the frequency is also (L) low, the criticality is classified as insignificant, and the system failure mode is disregarded in the further analysis. If, however, the frequency is (M) medium or (H) high the system failure mode should be included in the further analysis even if all the severity ranks are (L) low, but with a lower priority than the significant system failure modes. If we were able to define a total loss function in the conceptual model in Figure 1, the criticality of the various system failure modes (undesired events) could be assessed explicitly. This approach will, however, not be costefficient in most practical applications.
6WHS &ULWLFDO LWHP VHOHFWLRQ The objective of this step is to identify the analysis items that are potentially critical with respect to the system failure modes (functional failures) identified in Step 3(iii). These analysis items are denoted functional significant items (FSI). Note that some of the less critical system failure modes have been disregarded at this stage of the analysis. Further, the two failure modes total loss of function and partial loss of function will often be affected by the same items (FSIs). For simple systems the FSIs may be identified without any formal analysis. In many cases it is obvious which analysis items that have influence on the system functions. For complex systems with an ample degree of redundancy or with buffers, we may need a formal approach to identify the functional significant items. In the conceptual model in Figure 1 the analysis item failures are classified as basic events. This means that the causal analysis in the conceptual model should be pursued down to the analysis item level and not further. As explained in section 2, the basic events will also comprise events that are not classified as analysis item failures, like human
10
Rausand&Vatn errors and environmental impacts. In the conceptual model, fault tree analysis is suggested as a suitable technique for identification and modeling of basic events. Depending on the complexity of the system, other techniques like reliability block diagrams, or Monte Carlo simulation (see e.g. Hyland and Rausand 1994) may be more suitable. In an petroleum production plant there are often a variety of buffers and rerouting possibilities. Rerouting will also be possible in railway applications. For such systems, Monte Carlo next event simulation may often be the only feasible approach. If failure rates and other necessary input data are available for the various analysis items, it is usually a straightforward task to calculate the relative importance of the various analysis items based on a fault tree model or a reliability block diagram. A number of importance measures are discussed by Hyland and Rausand (1994). In a Monte Carlo model it is also rather straightforward to rank the various analysis items according to criticality. The main reason for performing this task is to screen out items that are more or less irrelevant for the main system functions, i.e. in order not to waste time and money analyzing irrelevant items. In addition to the FSIs, we should also identify items with high failure rate, high repair costs, low maintainability, long lead time for spare parts, or items requiring external maintenance
Reliability Centered Maintenance personnel. These analysis items are denoted maintenance cost significant items (MCSI). The sum of the functional significant items and the maintenance cost significant items are denoted maintenance significant items (MSI). Some authors, e.g. Smith (1993), claim that such a screening of critical items should not be done, others e.g. Paglia et al. (1991) claim that the selection of critical items is very important in order not to waste time and money. We tend to agree with both. In some cases it may be beneficial to focus on critical items, in other cases we should analyze all items. In the FMECA analysis of Step 6, each of the MSIs will be analyzed to identify their possible impact upon failure on the four consequence classes: (S) safety of personnel, (E) environmental impact, (A) production availability, and (C) economic losses. This analysis is partly inductive and will focus on both local and system level effects. From the present step we know that a failure of an MSI may have impact on one or more of the system functions. In addition, the failure of an MSI may have several local effects and also effects on system level not involving the identified system functions. There may also be analysis items, that are not classified as MSIs, that have negative effects on the system level not involving the identified system functions. This observation may be seen as an argument for not to screen out socalled noncritical items.
System functions obtained by functional analysis System function I System function II System function III System function ;
MSI A
MSI B
MSI C
MSI 1
Anal. item 0
MSIs considered
Figure 6 Relation between top level system functions and analysis items
11
In Step 6 a complete FMECA is carried out for all the MSIs. The FMECA is partly an inductive analysis that identifies all the local and system level consequences of the MSI failure modes. This means that other (top level) functions than those identified may be considered in the FMECA. This is illustrated in Figure 6, where the system function X is affected by analysis item N. On the other hand, there might be important items which are omitted from the FMECA because the corresponding top level functions were overlooked. This is the case for analysis item M in Figure 6 that has an impact on system function X. The only way to ensure that all functions are considered, is to include all items in the FMECA analysis. However, this will often lead to a too comprehensive analysis.
Operating profile (continuous or intermittent operation) Control philosophy (remote/local and automatic/manual) Environmental conditions Maintainability Calendar and accumulated operating time for overhauls Maintenance and downtime costs Recommended maintenance for each analysis item based on manufacturer specification, general guidelines or standards, or inhouse recommended practice. Failure information, when a failure occurs the following registrations are relevant: System number (tag number) of the analysis item Calendar time Accumulated operating time to the failure Failure event Failure mode Failure cause Failure consequences Repair time (active and passive) Downtime 3. Reliability data Reliability data may be derived from the operational data. The reliability data is used to decide the criticality, to mathematically describe the failure process and to optimize the time between PMtasks. The reliability data includes: Mean time to failure (MTTF). Mean time to repair (MTTR). Failure rate function z(t).
6WHS 'DWD FROOHFWLRQ DQG DQDO\VLV The data necessary for the RCM analysis may according to (Sandtorv & Rausand 1991) be categorized in the following three groups: 1. Design data System definition: a description of the system boundaries including all subsystems and equipment to fulfill the main functions of the system. System breakdown: the assembly hierarchy as described in Step 2. A technical description of each subsystem, such as the structure of the subsystem, capacity and functions (e.g. input and output). System performance requirements, e.g. desired system availability, environmental requirements. Requirements related to maintenance/testing e.g. according to rules and regulations. 2. Operational data Performance requirements
12
Rausand&Vatn A functional relation between the value of condition monitoring information and the failure rate z(t). The failure rate function is briefly described in the following. Let T denote the time from an item is put into operation at time t = 0 until a potential failure occurs. The item may be either new or used when it is put into operation. In many cases the item will be reput into operation after a refurbishment or a failure has been corrected. The uncertainties in the time to failure T may be described by the distribution function F(t) = Pr(T t), or the probability density function f(t) = F(t). The probability density function f(t) may be expressed as: f(t)t Pr(t < T t+t ) Hence, f(t)t is approximately equal to the probability that the item will fail in the time interval t,t+t]. The life distribution is often most effectively characterized by the socalled failure rate, or force of mortality (FOM). The failure rate function z(t) may be expressed as: z(t) t Pr(t < T t+t T > t) If we consider an item that has survived the time interval 0,t], i.e. T > t, then the probability that the item will fail in the time interval t, t+t] is approximately z(t) t. In many cases the failure rate will be an increasing function of time, indicating that the item is deteriorating. In other cases the failure rate may be decreasing, indicating that the item is improving. There are even cases where the failure rate is decreasing in one time interval and increasing in another. In some cases we may predict the form of the failure rate curve based on knowledge about the relevant failure mechanisms. An example of a failure rate function for a deteriorating item is given in Figure 7. A popular class of life distributions is the Weibull distribution where the failure rate function is given by: zW(t) = ()(t)
-1
Reliability Centered Maintenance deteriorating. It is also seen that the degree of deterioration increases with . When < 1, the failure rate function zW(t) is decreasing meaning that the item is improving. When = 1, the failure rate is constant, meaning that failures are truly random.
Failure rate
Wearout limit
Time
Figure 7 Failure rate with identifiable wearout limit In some cases the value of the shape parameter may be estimated based on knowledge about the relevant failure mechanisms, i.e. based on expert judgment. Several other life distributions are available. Among the most popular are; the lognormal distribution, the BirnbaumSaunders distribution, and the inverse Gaussian distribution. All of these distributions are rather flexible, and may be used for detailed modeling of specific failure mechanisms. For the purpose of this paper, however, the class of Weibull distributions is sufficiently flexible to be the preferred distribution. In the rest of this paper we will therefore assume that the time to failure follows a Weibull distribution. The various reliability parameters may be estimated from relevant operational data. Estimation techniques are thoroughly discussed in Hyland and Rausand (1994). The operational and reliability data are collected from available operating experience and from external files where reliability information from systems with similar design and operating conditions can be found (e.g. data banks, data handbooks, field data from own data storage, manufacturers recommendations). The external information available should be considered carefully before it is used, because such information is generally available at a much
13
When > 1, the failure rate function zW(t) is seen to be increasing, meaning that the item is
Rausand&Vatn coarser level than what is indicated in point (2) and (3) above. The following three points should be considered before reliability data is used: What are the system boundaries for the system (analysis item) from which the data arrives? What are the specific operating and maintenance features that may influence on the data validity? Is the time scale used calendar time, operating time, or some other time concept? A most valuable source of reliability data is the OREDA handbook (1997) and the OREDA database. OREDA contains data from a wide range of offshore equipment. The data has been collected mainly from platform maintenance records from the whole North Sea area and from the Mediterranean Sea. The handbook presents generic data, while more detailed, manufacturer specific data is available in the database. The OREDA database is, however, available only for the participants in the OREDA project. At the outset of the analysis, the relevant reliability may often be scarce, because of little or no operating experience. The initial information used may, however, later be
Tr) Srsqhvt) Qrsrqi) 9hr)
Reliability Centered Maintenance adjusted based on updated information and experience. In some situations there is a complete lack of reliability data. This is the fact when developing a maintenance program for new systems. The maintenance program development starts long before the equipment enters service. Helpful sources of information can then be experience data from similar equipment, directions from manufacturers and results from testing. The RCM method will even in this situation provide useful information. A successful application of RCM requires an extensive amount of information. Both qualitative and quantitative data are required. A systematic approach to the collection phase is essential. The results of the total RCM process depend highly on the quality of the input data.
6WHS )DLOXUH PRGHV HIIHFWV DQG FULWLFDOLW\ DQDO\VLV The objective of this step is to identify the dominant failure modes of the MSIs identified during Step 4.
Qhtr)s)
9rpvvsv
Failure @ssrpsshvyr MSI Operational Function mode Consequence Worst case mode class probability S E A C S E A C
MTTF Criticality
Failure Failure cause mechanism
%MTTF Failure Maintenance Failure characteristic action characteristic measure
Recommended interval
Figure 8 RCM FMECA-form A wide variety of different FMECA forms are used in the main RCM references. The FMECA form used in our approach is presented in Figure 8. The various columns in this FMECA form are discussed below: MSI: This will typically be the analysis item number in the assembly hierarchy (tag number), optionally with a descriptive text. Operational mode: The MSI may have various operational modes, for example running and standby. Function: For each operational mode, the MSI may have several functions. A function of a standby water supply pump is for example to start upon demand.
14
Rausand&Vatn Failure mode: A failure mode is the manner by which a failure is observed, and is defined as nonfulfillment of one of the equipment functions. Effect of failure/Severity class: The effect of a failure is described in terms of the worst case outcome with respect to safety (S), environmental impact (E), production availability (A), and direct economic cost (C). The effect can either be specified by means of consequence classes, or some numerical severity measure. A failure of an MSI will not necessarily give a worst case outcome due to e.g. redundancy, buffer capacities, etc. A conditional likelihood field is therefore introduced. Worst case probability: The worst case probability is defined as the probability that an equipment failure will give the worst case outcome. To obtain a numerical probability measure, a system model is required. This will often be inappropriate at this stage of the analysis, and a descriptive measure may be used. Proposed classes are serial, redundancy,cold standby, hot standby, and buffer. MTTF:Mean time to failure for each failure mode is recorded. Either a numerical measure or likelihood classes may be used. Criticality: The criticality field is used to tag off the dominant failure mode according some criticality measure. A criticality measure should take failure effect, worst case probability and MTTF into account. Yes is used to tag off the dominant failure modes. The information described so far should be entered for all failure modes. A screening may now be appropriate, giving only dominant failure modes, i.e. items with high criticality. For the dominant failure modes the following fields are required: Failure cause: For each failure mode there may be several failure causes. An MSI failure mode will typically be caused by one or more component failures. Note that supporting equipment to the MSIs entered in the FMECA form is for the first time considered at this step. In this context a failure cause may therefore be a
Reliability Centered Maintenance failure mode of a supporting equipment. A fail to close failure of a safety valve may for example be caused by a broken spring in the failsafe actuator. Failure mechanism: For each failure cause, there is one or several failure mechanisms. Examples of failure mechanisms are fatigue, corrosion, and wear. % MTTF: The MTTF was entered on an MSI failure mode level. It is also relevant to enter the MTTF for each failure mechanism. To simplify, a per cent is given, and MTTF can be calculated for each failure mechanism. The %MTTF will obviously be only an approximation since the failure mechanisms usually are strongly interdependent. Failure characteristic: Failure propagation may be divided into three classes. 1. The failure propagation can be measured by one or several (condition monitoring) indicators. The failure is referred to as a gradual failure. 2. The failure probability is agedependent, i.e. there is a predictable wearout limit. The failure is referred to as an ageing failure. 3. Complete randomness. The failure cannot be predicted by either condition monitoring indicators or by measuring the age of the item. The time to failure can only be described by an exponential distribution, and the failure is referred to as a sudden failure. Maintenance action: For each failure mechanism, an appropriate maintenance action may hopefully be found by the decision logic in Step 7. This field can thus not be completed until Step 7 is performed. Failure characteristic measure: For gradual failures, the condition monitoring indicators are listed by name. Ageing failures are described by an ageing parameter, i.e. the shape parameter () in the Weibull distribution is recorded. Recommended maintenance interval: The identified maintenance action is performed at intervals of fixed length. The length of the interval is found in Step 8.
15
Rausand&Vatn 6WHS 6HOHFWLRQ RI 0DLQWHQDQFH $FWLRQV This phase is the most novel compared to other maintenance planning techniques. A decision logic is used to guide the analyst through a questionandanswer process. The input to the RCM decision logic is the dominant failure modes from the FMECA in Step 6. The main idea is for each dominant failure mode to decide whether a preventive maintenance task is suitable, or it will be best to let the item deliberately run to failure and afterwards carry out a corrective maintenance task. There are generally three reasons for doing a preventive maintenance task: (a) to prevent a failure (b)to detect the onset of a failure (c) to discover a hidden failure Only the dominant failure modes are subjected to preventive maintenance. To obtain appropriate maintenance tasks, the failure causes or failure mechanisms should be considered. The idea of performing a maintenance task is to prevent a failure mechanism to cause a failure. Hence, the failure mechanisms behind each of the dominant failure modes should be entered into the RCM decision logic to decide which of the following basic maintenance tasks that is applicable: 1. Continious oncondition task (CCT) 2. Scheduled oncondition task (SCT) 3. Scheduled overhaul (SOH) 4. Scheduled replacement (SRP) 5. Scheduled function test (SFT) 6. Run to failure (RTF) Continuous oncondition task (CCT) is a continuous monitoring of an item to find any potential failures. An oncondition task is applicable only if it is possible to detect reduced failure resistance for a specific failure mode from the measurement of some quantity. Example:
A distance gauge might be used to measure the distance between the switch point and stock rail to detect that the 3mm limit will be reached. At a predefined level (i.e. 2.7 mm), the system alerts the maintenance crew, which carry out an appropriate maintenance action.
Scheduled oncondition task (SCT) is a scheduled inspection of an item at regular intervals to find any potential failures. There are three criteria that must be met for an on condition task to be applicable: 1. It must be possible to detect reduced failure resistance for a specific failure mode. 2. It must be possible to define a potential failure condition that can be detected by an explicit task. 3. There must be a reasonable consistent age interval between the time of potential failure and the time of failure. Example: A manual inspection every second month will reveal whether the 3 mm limit is soon being reached. Appropriate maintenance action can be issued.
There are two disadvantage of a scheduled versus a continuous on-condition task: The man-hour cost of inspection is often larger than the cost of installing the sensor Since the scheduled inspection is carried out at fixed points of time, one might miss situations where the degradation is faster than anticipated.
An advantage of a scheduled on-condition task is that the human operator is then able to sense information that a physical sensor will not be able to detect. This means that traditional Walk around checks should not be totally skipped even if sensors are installed. Condition monitoring is discussed in Nowlan & Heap (1978), and statistical models are presented in e.g. Aven (1992) and Valdez-Flores & Feldman (1989).
16
Rausand&Vatn Example: Scheduled overhaul (SOH) is a scheduled overhaul of an item at or before some specified age limit, and is often called hard time maintenance. An overhaul task can be considered applicable to an item only if the following criteria are met (Nowlan & Heap 1978): 1. There must be an identifiable age at which the item shows a rapid increase in the items failure rate function. 2. A large proportion of the units must survive to that age. 3. It must be possible to restore the original failure resistance of the item by reworking it. Examples: Rehabilitation of wooden sleepers borings every three year. Lubrication of the char/slideplate every three day. Cleaning every month.
Replacement of the motor every one year The motor is then either overhauled to a god as new condition, or replaced in the maintenance depot.
Scheduled function test (SFT) is a scheduled inspection of a hidden function to identify any failure. A scheduled function test task is applicable to an item under the following conditions (Nowlan & Heap 1978): 1. The item must be subject to a functional failure that is not evident to the operating crew during the performance of normal duties. 2. The item must be one for which no other type of task is applicable and effective. Example: Sighting or hammer blow every year to detect loose lockspikes fastening chars/baseplates on wooden sleepers.
Scheduled replacement (SRP) is scheduled discard of an item (or one of its parts) at or before some specified age limit. A scheduled replacement task is applicable only under the following circumstances (Nowlan & Heap 1978): 1. The item must be subject to a critical failure. 2. Test data must show that no failures are expected to occur below the specified life limit. 3. The item must be subject to a failure that has major economic (but not safety) consequences. 4. There must be an identifiable age at which the item shows a rapid increase in the failure rate function. 5. A large proportion of the units must survive to that age.
Run to failure (RTF) is a deliberate decision to run to failure because the other tasks are not possible or the economics are less favorable. In many situations one maintenance task may prevent several failure mechanisms. For example function testing of an ESD-valve (with an offline function) will reveal any failure mechanisms causing a hidden failure. Hence in some situations it is better to put failure modes rather than failure mechanisms into the RCM decision logic. Note also that if a failure cause for a dominant failure mode corresponds to a supporting equipment, the supporting equipment should be defined as the item to be entered into the RCM decision logic. The criteria given for using the various tasks should only be considered as guidelines for selecting an appropriate task. A task might be found appropriate even if some of the criteria are not fulfilled.
17
Rausand&Vatn The RCM decision logic is shown in Figure 9. Note that this logic is much simpler than those found in standard RCM references, e.g. Moubray (1991). It should be emphasized that
Reliability Centered Maintenance such a logic can never cover all situations. For example in the situation of a hidden function with ageing failures, a combination of scheduled replacements and function tests is required.
Yes Does a failure alerting measurable indicator exist? No Yes Is continious monitoring feasible?
Continious oncondition task (CCT) Scheduled oncondition task (SCT) Scheduled overhaul (SOH) Scheduled replacement (SRP)
No
Yes Is ageing parameter >1? No Yes Is overhaul feasible? No
Is the function hidden? No No PM activity found (RTF)
Yes
Scheduled function test (SFT)
Figure 9 Maintenance Task Assignment/Decision logic 6WHS 'HWHUPLQDWLRQ RI 0DLQWH QDQFH ,QWHUYDOV The RCM decision logic was qualitatively used to establish preventive maintenance tasks. These tasks are performed at times k, k=1,2, . . . , . Hence, for each task, the optimal interval should be decided. When balancing costs, we realize that the preventive maintenance cost increases with decreasing , and the cost of unplanned failures decreases with decreasing . In this presentation, only tree simple models are discussed. Model 1 and Model 2 are appropriate models for scheduled rework/replacement task, while Model 3 may be used for scheduled function testing. For more general models, see Valdez-Flores & Feldman (1989). To determine the optimal interval t some crucial information is required. First we need information about cost structures, i.e. the total cost of the preventive maintenance action and the total cost of a failure which the maintenance action was supposed to prevent. Note that the models are developed for single unit systems, thus for redundant systems we realize that a failure needs not necessarily give a system failure. If the cost of a system failure is cs, then the cost element to use in the model should be cs = p + cr where p is the probability that the (redundant) unit will cause a system failure, and cr is the repair/replacement cost of the unit. In addition to cost structures, information about the actual failure distribution is necessary. This information will typically be mean time to failure (MTTF), and the shape parameter for units where ageing, wear, corrosion etc. are
18
Rausand&Vatn present. Note that the failure information should be obtained at a failure cause level, i.e. corresponding to the failure cause the preventive maintenance task is designed for. Model 1 Minimal repair policy The minimal repair policy describes a single unit system subjected to preventive replacement at periods of fixed lengths. To be formal the unit is put into operation at time t = 0, and replaced at times k for k = 1,2, . . . . If the unit fails in an interval (k-1), k] a minimal repair, see e.g. Hyland and Rausand (1994), is performed. The situation where the unit is replaced upon a failure in (k-1), k], i.e. a block replacement policy is discussed in Model 2. The total cost of a minimal repair is denoted cm and the total cost of a replacement is cp. It will be convenient to introduce = cp/cm. Typically >> 1, or at least > 1. In special situations we can even have < 1. The expected cost per unit time is: &( ) = F S + FP: (W )
We will consider a socalled Weibull process with W(t) = (t). In this case the time from t = 0 until the first failure has a Weibull distribution with survivor function R(t) = exp(-(t)). It can be shown that the expected cost per unit time, C(t), is minimized when:
1 = 1
provided > 1. Hence, to optimize the replacement interval, estimates for the parameters; cm, cp, and are required. cm is the total cost of a minimal repair, including any harm to material, personnel and environment. Assessing a value of cm may therefore cause controversies. and are the parameters in the failure distribution of the item. Often it is more convenient to specify the failure distribution in terms of mean time to failure (MTTF) and the shape parameter , yielding:
077) = 1 ( + 1) 1
where W(t) = E(N(t)) is the expected number of failures in 0, t]. Table 1 Optimal replacement interval relative to MTTF. For a given value of and , the table entry should be multiplied with MTTF to give the optimum replacement interval length Cost ratio = cu /cp 2
1.2 1.5 1.7 2.0 2.5 3.0 4.0 13.40 8.19 6.60 4.84 .99 .82 .75
3
12.20 7.97 1.59 .86 .71 .67 .66
4
12.95 1.22 .83 .67 .60 .59 .61
5
12.62 .85 .66 .57 .54 .54 .57
7
2.050 .590 .503 .464 .461 .478 .523
10
.897 .432 .389 .377 .394 .421 .476
20
.393 .253 .247 .259 .294 .331 .398
50
.165 .133 .141 .161 .202 .242 .316
100
.090 .083 .093 .113 .152 .192 .265
200
.050 .052 .061 .080 .115 .152 .223
Model 2 Block replacement policy The block replacement policy describes a singleunit system put into operation at time t =
0. The unit is replaced at times kt for k=1,2,. . . and at failures. The cost of a planned replacement is denoted cp, and the total cost of
19
Rausand&Vatn an unplanned replacement, i.e. a failure is cu. Let W(t) denote the renewal function, see e.g. Hyland and Rausand (1994), for the lifetime distribution of the unit. The average cost per unit time is: &( ) = F S + FX: (W )
Reliability Centered Maintenance degradation in performance, or some indicator variable is alerting about the failure.
Performance/ Condition Point where we can find out that vvshvyvt("potential failure") Target value Acceptable deviation F Failure Time
P-F interval
If the times between failures are Weibull distributed, W(t) can be found by the algorithm given by Smith and Leadbetter (1963). Numerical methods are, however, required to find the optimal interval . In Table 1, numerical values for the optimal replacement interval is given relative to MTTF. In order to use Table 1 the value of must be specified. During the data analysis in Step 5 the value of should have been found by e.g. expert judgment. Model 3 - Functional testing This model is appropriate for scheduled function testing. Consider a protective device with a constant failure rate . A functional test of the device is performed at times k for k=1,2,. . . The cost of a Functional test is ct. If a failure is detected upon a test, the device is replaced at a cost of cr. Further assume that the device is demanded with a frequency f, i.e. the rate of critical situations. A hazardous situation occurs if the protective device fails upon a demand. The total cost of such a situation is ch. The expected cost per unit time is:
Figure 10 P-F interval In Figure 10 the performance is viewed as a function of time. The point P is the first point in time where we are able to reveal the outset of a failure. When the performance is below some limiting value a failure will occur. The length from a potential failure is detectable until a failure occurs is denoted the P-F interval. The length of the P-F interval is assumed to vary from time to time, and is therefore modeled as a random variable. In order to establish an optimal maintenance interval, , the following quantities must be defined: ' Delay time, i.e. the time from a potential failure is revealed until an appropriate corrective action is completed. For simplicity the delay time is considered as a deterministic quantity. FL: Cost of (manual) inspection. Cost of (unplanned) failure. FX: 73): PF interval (random variable). : E(73)) = Mean value of P-F interval. : SD(7PF) = Standard deviation of P-F interval. MTTF: Mean time to failure if no corrective maintenance is carried out The expected cost per unit time is given by:
FL + F X H W Pr( 73) W + ')GW
0
&
FW 2 + FU + I FK 2 2
yielding an optimal interval :
FW I FK FU
077) FW I FK FU 077)
Model 4 - Scheduled on-condition tasks, the concept of P-F-intervals The idea behind a scheduled on-condition task is that a coming failure is alerted by some
where =1/(MTTF-), and we have assumed that the time from the component is in a perfect state until a potential failure reveals is
20
& ( ) =
Rausand&Vatn exponentially distributed. In order to optimize Eq. (1) numerical values are required for ci, cu, MTTF, D, and . Numerical methods are usually required to optimize Eq. (1). The calculations will be simplified if we choose a distribution for TPF with a closed form of the cumulative distribution function. Model 5 - Continuos on-condition tasks The idea of continuos on-condition monitoring is to measure one or more indicator variable. The reading of the component in this manner can be used to detect a coming failure. The variable being monitored is denoted X(t) in Figure 11.
X(t)
Reliability Centered Maintenance Applicability: meaning that the task is applicable in relation to our reliability knowledge and in relation to the consequences of failure. If a task is found based on the preceding analysis, it should satisfy the Applicability criterion. A PM task will be applicable if it can eliminate a failure, or at least reduce the probability of occurrence to an acceptable level (Hoch 1990) or reduce the impact of failures! Cost-effectiveness: meaning that the task does not cost more than the failure(s) it is going to prevent. The PM task's effectiveness is a measure of how well it accomplishes that purpose and if it is worth doing. Clearly, when evaluating the effectiveness of a task, we are balancing the cost of performing the maintenance with the cost of not performing it. In this context, we may refer to the cost as follows (Hoch 1990): 1. The cost of a PM task may include: the risk of maintenance personnel error, e.g. maintenance introduced failures
Time Failure
"Failure Limit"
"Action Limit"
the risk of increasing the effect of a failure of another component while the one is out of service the use and cost of physical resources the unavailability of physical resources elsewhere while in use on this task production maintenance unavailability during functions
Figure 11 Continuos monitoring In Figure 11 the deteriorating process is shown. Here X(t) is can be interpreted as the cumulative damage at time t. When the damage exceeds some limit, a failure occurs. In Figure 11 we have also shown an action limit, upon where to take a maintenance action. The challenge here is to decide the optimal action limit. No general approach seems applicable here since the solution is highly dependent on how X(t) is modeled. Aven (1992) discusses one method where an underlying chock model is assumed. 6WHS 3UHYHQWLYH PDLQWHQDQFH FRPSDULVRQ DQDO\VLV Two overriding criteria for selecting maintenance tasks are used in RCM. Each task selected must meet two requirements: It must be applicable It must be effective
unavailability of protective during maintenance of these
The more maintenance you do the more risk you will expose your maintenance personnel to 2. On the other hand, the cost of a failure may include: the consequences of the failure should it occur (i.e. loss of production, possible violation of laws or regulations, reduction in plant or personnel safety, or damage to other equipment) the consequences of not performing the PM task even if a failure does not occur
21
Rausand&Vatn (i.e., loss of warranty) increased premiums for emergency repairs (such as overtime, expediting costs, or high replacement power cost). Balancing the various cost elements to achieve a global optimum will always be a challenge. The conceptual RCM model in Figure 1 may be a starting point. If such a model could be established, and the various cost elements incorporated, the trade-off analysis is reduced to an optimization problem with a precisely defined mathematical model. Often the resources available for the RCM analysis do not permit building such an overall model, hence we can not expect to achieve a global optimum. Sub-optimization can to some extent be achieved by simplifying the model in Figure 1. For example one could consider only one consequence at a time and/or only one maintenance task at a time.
Reliability Centered Maintenance When implementing a maintenance program it is therefore of vital importance to consider the risk associated with the execution of the maintenance work. Checklists could be used to identify potential risk involved with maintenance work: Can maintenance people be injured during the maintenance work? Is work permit required for execution of the maintenance work? Are means taken to avoid problems related to re-routing, by-passing etc.? Can failures be maintenance work? etc. introduced during
Task analysis, see e.g. Kirwan & Ainsworth (1992) may be used to reveal the risk involved with each maintenance job. See Hoch (1990) for a further discussion on implementing the RCM analysis results.
6WHS 7UHDWPHQW RI QRQ06,V In Step 4 critical items (MSIs) were selected for further analysis. A remaining question is what to do with the items which are not analyzed. For plants already having a maintenance program it is reasonable to continue this program for the non-MSIs. If a maintenance program is not in effect, maintenance should be carried out according to vendor specifications if they exist, else no maintenance should be performed. See Paglia et al (1991). for further discussion.
6WHS ,QVHUYLFH GDWD FROOHFWLRQ DQG XSGDWLQJ As mentioned earlier, the reliability data we have access to at the outset of the analysis may be scarce, or even second to none. In our opinion, one of the most significant advantages of RCM is that we systematically analyze and document the basis for our initial decisions, and, hence, can better utilize operating experience to adjust that decision as operating experience data is collected. The full benefit of RCM is therefore only achieved when operation and maintenance experience is fed back into the analysis process. The process of updating the analysis results is also important due to the fact that nothing remain constant, best seen considering the following arguments (Smith 1993): The system analysis process is not perfect and requires periodic adjustments. The plant itself is not a constant since design, equipment and operating procedures may change over time. Knowledge grows, both in terms of understanding how the plant equipment behaves and how technology can increase
22
6WHS ,PSOHPHQWDWLRQ A necessary basis for implementing the result of the RCM analysis is that the organizational and technical maintenance support functions are available. A major issue is therefore to ensure the availability of the maintenance support functions. The maintenance actions are typically grouped into maintenance packages, each package describing what to do, and when to do it. As indicated in the outset of this paper, many accidents are related to maintenance work.
Rausand&Vatn availability and reduce costs. Reliability trends are often measured in terms of a non-constant ROCOF (rate of occurrence of failures), see e.g. Hyland & Rausand (1994). The ROCOF measures the probability of failure as a function of calendar time, or global time since the plant was put into operation. The ROCOF may change over time, but within one cycle the ROCOF is assumed to be constant. This means that analysis updates should be so frequent that the ROCOF is fairly constant within one period. Opposite to the ROCOF, the failure rate or FOM, is measuring the probability of failure as a function of local time, i.e. the time elapsed since last repair/replacement. However, the FOM can not be considered constant, if so there is no rationale for performing scheduled replacement/repair. The updating process should be concentrated on three major time perspectives (Sandtorv & Rausand 1991): Short term interval adjustments Medium term task evaluation Long term revision of the initial strategy
',6&866,216 &21&/86,216
$1'
The following summarizes some main benefits, drawbacks and problems encountered during application of the RCM method in some offshore case studies.
*HQHUDO EHQHILWV Cross-discipline utilization of knowledge: To fully utilize the benefits of the RCM concept, one needs contributions from a wider scope of disciplines than what is common practice. This means that an RCM analysis requires contribution from the three following discipline categories working closely together: 1. System/reliability analyst 2. Maintenance/operation specialist 3. Designer/manufacturer All these categories do not need to take part in the analysis on a full time engagement. They should, however, be deeply involved in the process during pre- and post-analysis review meetings, and quality review of final results. The result of this is that knowledge is extracted and commingled across traditional discipline borders. It may, however, cost more at the outset to engage all these personnel categories. Traceability of decisions: Traditionally, PM programs tend to be cemented. After some time one hardly knows on what basis the initial decisions were made and therefore do not want to change those decisions. In the RCM concept all decisions are taken based on a set of analytical steps, all of which should be documented in the analysis. When operating experience accumulates, one may go back and see on what basis the initial decisions were taken, and adjust the tasks and intervals as required based on the operating experience. This is especially important for initial decisions based on scarce data. Recruitment of skilled personnel for maintenance planning and execution: The RCM way of planning and updating maintenance
23
The short term update can be considered as a revision of previous analysis results. The input to such an analysis is updated reliability figures either due to more data, or updated data because of reliability trends. This analysis should not require much resources, as the framework for the analysis is already established. Only Step 5 and Step 8 in the RCM process will be affected by short term updates. The medium term update will also review the basis for the selection of maintenance actions in Step 7. Analysis of maintenance experience may identify significant failure causes not considered in the initial analysis, requiring an updated FMECA analysis in Step 6. The medium term update therefore affects Step 5 to 8. The long term revision will consider all steps in the analysis. It is not sufficient to consider only the system being analyzed, it is required to consider the entire plant with it's relations to the outside world, e.g. contractual considerations, new laws regulating environmental protection etc.
Rausand&Vatn requires more professional skills, and is therefore a greater challenge for skilled engineers. It also provides the engineers with a broader and more attractive way of working with maintenance than what sometimes is common today. Cost aspects: As indicated, RCM will require more efforts both in skills and manhours when first being introduced in a company. It is, however, documented by many companies and organizations that the long term benefits will far outweigh the initial extra costs. One problem is that the return of investment has to be looked upon in a long term perspective, something that the management is not always willing to take a chance on. Benefits related to PM-program achievement: Based on the case studies we have carried out, and experience published by others, the general achievements of RCM in relation to a traditional PM-programs may be summarized as follows: By careful analysis of the failure consequences, the amount of PM tasks can often be reduced, or replaced by corrective tasks or more dedicated tasks. We have therefore chosen to include corrective maintenance as a possible outcome of the RCM analysis. Emphasis has been changed from periodic rework or overhaul tasks of the large assemblies/units to more dedicated object oriented tasks. Consequently, condition monitoring has been more frequently used to detect specific failure modes. Requirement for spare parts has been reduced as a result of a better justification for replacements. Design solutions have been discovered that were not optimal from a safety and plant economic point of view.
Reliability Centered Maintenance and that it in most analyses exclude important equipment from appropriate attention. He writes (page 82): . . . we should be very careful not to prematurely discard components as non-critical until we have truly identified their proper tie and priority status to the functions and functional failures. Other authors argue that the main objective of the RCM process is to create a basis for maintenance evaluation and task adjustment. The selection of MSIs will reduce this basis and result in an insufficient evaluation process. The rationale for working with the MSIs only was to reduce the analysis work. Thus there is always a risk of an insufficient analysis when the non-MSIs are not subjected to a formal analysis. In our presentation the criterion for classifying an analysis item as non-MSI is: The item should not affect any of the critical system failure modes (Step 4). By using such an approach the criterion for disregarding an item is traceable, and may be reevaluated later. Further we believe that this criterion makes very good sense. Lack of reliability data: As indicated the full benefit of the RCM concept can only be achieved when we have access to reliability data for the items being analyzed. Is now RCM worthless if we have no or very poor data at the outset? The answer to this question is no, even in this case the RCM approach will provide some useful information for assessing maintenance tasks. PM intervals will, however, not be available. As a result of the analysis, we should at least have identified the following: We know whether the failure involves a safety hazard to personnel, environment or equipment We know whether the failure affects production availability We know whether the failure is evident or hidden We have a better criterion for evaluating cost-effectiveness
24
3UREOHP DUHDV LQ WKH DQDO\VLV Identification of Maintenance Significant Items: In some cases there may be very little to achieve by limiting the analysis to only include the MSIs. Smith (1993) argues that concentrating on critical components (MSIs) is directly wrong
Rausand&Vatn Lack of reliability data will always be a problem. First of all there are problems with getting access to operational data with sufficient quality. Next, even if we have data, it is not straight-forward to obtain reliability data from the operational data. Before we discuss some problems with collecting and using operational data, it should be emphasized that there will never be a complete lack of reliability figures. Even if no operational data is available, expert judgment will be available. However, the uncertainty in the reliability figures can be very large. Based on our various engagements in the OREDA project and other data collection projects on offshore installations, we have experienced the following common difficulties related to acquisition of failure data: Data is generally very repair oriented and not directed towards describing failure causes, modes and effects. How the failure was detected is rarely stated (e.g. by inspection, monitoring, PM, tests, casual observation). This information is very useful in order to select applicable tasks. Failure modes can sometimes be deduced, but this is generally left to the data collector to interpret. The true failure cause is rarely found, but the failure symptom can to some extent be traced. Failure effect on the lower indenture level is reasonably well described, but may often be missing on higher indenture level (system level). Operating conditions when the failure occurred is frequently missing or vaguely stated.
Reliability Centered Maintenance unit(s) being considered. Further if several units are used to enlarge the data set, these units should be operated identically under the same environmental conditions. The requirements above are very seldom fulfilled, hence the estimation techniques may collapse. We therefore recommend use of expert judgment to establish appropriate ageing parameters. The ageing parameter is a measure of how deterministic a failure is, and it is reasonable to believe that this measure is relatively constant for each failure cause. On the other hand, it seems meaningless to establish a general set of recommended MTTF-figures for the various failure mechanisms. Trade-off analyses: There are four major criteria for the assessment of the consequences of a failure: safety, environment, production availability, and economic losses. During the analysis, we have to quantify these measures to some extent to be able to use them as decision criteria. Further, a trade-off analysis is required to balance each means against the different consequences. Referring to Figure 1 we need to consider the effect of the maintenance tasks M1,M2,.., on the consequences C1,C2,..,. This will require comprehensive reliability models. Further, the transformation of the consequences C1,C2,.., into a unidimensional loss function is at best a difficult and controversial task. A framework for dealing with these problems is given in Vatn et al. 1996. Assessing proper interval: The RCM concept is very valuable in assessing the proper type of PM task, but traditionally RCM does not basically include any tool for deciding optimal intervals. The "new framework for RCM given in Figure 1 together with standard PM-models listed under Step 8 are believed to form a very sound basis for deciding optimal intervals.
As mentioned above, there are often problems with estimating reliability data from the operational data. Reliability data comprises MTTR and MTTF figures together with the failure rate function. Reasonable estimates for MTTR and MTTF may be found by various averaging techniques. The failure rate function, i.e. the ageing parameter is much harder to obtain. Available estimation techniques require no reliability trend (in calendar time) for the
&RQFOXVLRQV RCM is not a simple and straightforward way of optimizing maintenance, but ensures that one does not jump to conclusions before all the right questions are asked and answers given. RCM can in many respects be compared with Quality Assurance. By rephrasing the definition of QA, RCM can be defined
25
Rausand&Vatn All systematic actions required to plan and verify that the efforts spent on preventive maintenance are applicable and cost-effective. Thus, RCM does not contain any basically new method. Rather, RCM is a more structured way of utilizing the best of several methods and disciplines. Quoting Malik (1990) the author postulates: . . . there is more isolation between practitioners of maintenance and the researchers than in any other professional activity. We see the RCM concept as a way to reduce this isolation by closing the gap between the traditionally more design related reliability methods, and the practical related operating and maintenance personnel.
IEC

50(191). International Electrotechnical Vocabulary (IEV) - Chapter 191 Dependability and quality of service. International Electrotechnical Commission, Geneva, 1990. IEC 812. Analysis Techniques for System Reliability - Procedures for Failure Modes and Effects Analysis (FMEA). International Electrotechnical Commission, Geneva, 1985. B. Kirwan and L. K. Ainsworth. A Guide to Task Analysis. Taylor & Francis, London, 1992. M.A. Malik. Reliable preventive maintenance scheduling. AIEE Trans., 11:221-228, 1990. M. A. Moss. Designing for Minimal Maintenance Expense. The Practical Application of Reliability and Maintainability. Marcel Dekker, Inc., New York, 1985. J. Moubray. Reliability-centred Maintenance. Butterworth-Heinemann, Oxford, 1991. F. S. Nowlan and H. F. Heap. Reliability-centered Maintenance. Technical Report AD/A066579, National Technical Information Service, US Department of Commerce, Springfield, Virginia, 1978. NPD. Regulations concerning implementation and use of risk analyses in the petroleum activities. Norwegian Petroleum Directorate, P.O.Box 600, N-4001 Stavanger, Norway, 1991. OREDA-97. Offshore Reliability Data. Distributed by Det Norske Veritas, P.O.Box 300, N1322 Hvik, Norway, 3 edition, 1997. Prepared by SINTEF Industrial Management. N-7034 Trondheim, Norway. A.M. Paglia, D.D. Barnard, and D.E. Sonnett. A Case Study of the RCM Project at V.C. Summer Nuclear Generating Station. 4th International Power Generation Exhibition and Conference, Tampa, Florida, US, 5:1003-1013, 1991. G. Pahl and W. Beitz. Engineering Design. The Design Council, London, 1984. M. Rausand and J. Vatn. Reliability Centered Maintenance. In C. G. Soares, editor, Risk and Reliability in Marine Technology. Balkema, Holland, 1997. H. Sandtorv and M. Rausand. RCM - closing the loop between design and operation reliability. Maintenance, 6, No.1:13-21, 1991. A. M. Smith. Reliability-Centered Maintenance. McGraw-Hill, Inc, New York, 1993. D. J. Smith. Reliability, Maintainability and Risk, Practical methods for engineers.
26
5()(5(1&(6
R. T. Anderson and L. Neri. Reliability-Centered Maintenance. Management and Engineering Methods. Elsevier Applied Science, London, 1990. T. Aven. Reliability and Risk Analysis. Elsevier Science Publishers, London, 1992. K. M. Blache and A. B. Shrivastava. Defining failure of manufacturing machinery & equipment. Proceedings Annual Reliability and Maintainability Symposium, pages 6975, 1994. B. S. Blanchard and W. J. Fabrycky. System Engineering and Analysis. Prentice-Hall, Inc., Englewood Cliffs, New Jersey 07632, 1981. BS 5760-5. Reliability of systems, equipments and components; Part 5: Guide to failure modes, effects and criticality analysis (FMEA and FMECA). British Standards Institution, London, 1991. N. Cross. Engineering Design Methods: Strategies for Product Design. John Wiley & Sons, Chichester, 1994. R. R. Hoch. A Practical Application of Reliability Centered Maintenance. The American Society of Mechanical Engineers, 90JPGC/Pwr-51, Joint ASME/IEEE Power Gen. Conf., Boston, MA, 21-25 Oct., 1990. A. Hyland and M. Rausand. Reliability Theory; Models and Statistical Methods. John Wiley & Sons, New York, 1994.
Rausand&Vatn
Butterworth Heinemann, Oxford, 4th edition, 1993. W. L. Smith and M. R. Leadbetter. On the renewal function for the Weibull distribution. Technometrics, 5:393-396, 1963. Weapon Systems and Support Equipment. Reliability-Centered Maintenance. Requirements for Naval Aircraft. US Department of Defense, Washington DC 20301, 1986. C. Valdez-Flores and R.M. Feldman. A survey of preventive maintenance models for stochastically deteriorating single-unit systems. Naval Research Logistics Quarterly, 36:419-446, 1989. J. Vatn. Maintenance Optimization from a Decision Theoretical Point of View. In Proceedings, ESREL95, pages 273-285, London, 1995. Chameleon Press Limited. J. Vatn, P. Hokstad, and L. Bodsberg. An overall model for maintenance optimization. Reliability Engineering and System Safety, 51:241-257, 1996.
27

Introduction To RCM

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Introduction To RCM

Загружено:

Авторское право:

Доступные форматы

Reliability Centered Maintenance

 $ &21&(378$/ 02'(/ )25 5&0

Figure 1 A conceptual model for RCM based on risk analysis 2

 0$,1 67(36 $1$/<6,6

Fluid in Pump fluid El. power

Reliability Centered Maintenance

Target value Acceptable deviation

System: Ref. drawing no.: Operational mode Function

Performed by: Date: Function requirements System failure mode

Page: of: Criticality S E

Figure 5 Example of an FFA-form

Failure Failure cause mechanism

%MTTF Failure Maintenance Failure characteristic action characteristic measure

Reliability Centered Maintenance

Reliability Centered Maintenance

Yes Is ageing parameter >1? No Yes Is overhaul feasible? No

Is the function hidden? No No PM activity found (RTF)

Scheduled function test (SFT)

Reliability Centered Maintenance

yielding an optimal interval :

unavailability of protective during maintenance of these

Reliability Centered Maintenance

Reliability Centered Maintenance

Reliability Centered Maintenance

Вам также может понравиться

$ &21&(378$/ 02'(/ )25 5&0

0$,1 67(36 $1$/<6,6