Вы находитесь на странице: 1из 15

Safety in Integrated Systems

Abstract: This paper describes the state of the art in system safety engineering and management along
with new models of accident causation, based on systems theory, that may allow us to greatly expand the
power of the techniques and tools we use. The new models consider hardware, software, humans,
management decision-making, and organizational design as an integrated whole. ew hazard analysis
techniques based on these expanded models of causation pro!ide a means for obtaining the information
necessary to design safety into the system and to determine which are the most critical parameters to
monitor during operations and how to respond to them. The paper first describes and contrasts the current
system safety and reliability engineering approaches to safety and the traditional methods used in both
these fields. "t then outlines the new systems-theoretic approach being de!eloped in #urope and the $.%.
and the application of the new approach to aerospace systems, including a recent risk analysis and health
assessment of the A%A manned space program management structure and safety culture that used the
new approach.
Reliability Engineering vs. System Safety
Although safety has been a concern of engineering for centuries and references to designing for safety
appear in the &'
th
century ()ooper, &*'&+, modern approaches to engineering safety appeared as a result
of changes that occurred in the mid-&',,s.
"n the years following -orld -ar "", the growth in military electronics ga!e rise to reliability
engineering. .eliability was also important to A%A and our space efforts, as e!idenced by the high
failure rate of space missions in the late &'/,s and early &'0,s and was adopted as a basic approach to
achie!ing mission success. .eliability engineering is concerned primarily with failures and failure rate
reduction. The reliability engineering approach to safety thus concentrates on failure as the cause of
accidents. A !ariety of techniques are used to minimize component failures and thereby the failures of
complex systems caused by component failure. These reliability engineering techniques include parallel
redundancy, standby sparing, built-in safety factors and margins, derating, screening, and timed
replacements.
%ystem %afety arose around the same time, primarily in the defense industry. Although many of the
basic concepts of system safety, such as anticipating hazards and accidents and building in safety, predate
the post--orld -ar "" period, much of the early de!elopment of system safety as a separate discipline
began with flight engineers immediately after the -ar and was de!eloped into a mature discipline in the
early ballistic missile programs of the &'/,s and &'0,s. The space program was the second ma1or
application area to apply system safety approaches in a disciplined manner. After the Apollo 2,3 fire in
&'04, A%A hired 5erome 6ederer, the head of the 7light %afety 7oundation, to head manned space-flight
safety and, later, all safety acti!ities. Through his leadership, an extensi!e program of system safety was
established for space pro1ects, much of it patterned after the Air 7orce and 8o8 programs.
-hile traditional reliability engineering techniques are often effecti!e in increasing reliability, they do
not necessarily increase safety. "n fact, their use under some conditions may actually reduce safety. 7or
example, increasing the burst-pressure to working-pressure ratio of a tank often introduces new dangers
of an explosion or chemical reaction in the e!ent of a rupture.
%ystem safety, in contrast to the reliability engineering focus on pre!enting failure, is primarily
concerned with the management of hazards: their identification, e!aluation, elimination, and control
through analysis, design and management procedures. "n the case of the tank rupture, system safety would
look at the interactions among the system components rather than 1ust at failures or engineering strengths.
.eliability engineers often assume that reliability and safety are synonymous, but this assumption is
true only in special cases. "n general, safety has a broader scope than failures, and failures may not
compromise safety. There is ob!iously an o!erlap between reliability and safety, but many accidents
occur without any component failure, e.g., the 9ars :olar 6ander loss, where the indi!idual components
&
were operating exactly as specified or intended, that is, without failure. The opposite is also true;
components may fail without a resulting accident. "n addition, accidents may be caused by equipment
operation outside the parameters and time limits upon which the reliability analyses are based. Therefore,
a system may ha!e high reliability and still ha!e accidents. 7or example, the Therac-2/ (a medical de!ice
that massi!ely o!erdosed se!eral patients+ worked safely tens of thousands of time before the peculiar
conditions arose that triggered the software flaw that killed a patient.
)hernobyl-style nuclear power plants had a calculated mean time between failure of ten thousand years.
%ystem safety deals with systems as a whole rather than with subsystems or components. "n system
safety, safety is treated as an emergent property that arises at the system le!el when components are
operating together. The e!ents leading to an accident may be a complex combination of equipment
failure, faulty maintenance, instrumentation and control problems, human actions, and design errors.
.eliability analysis considers only the possibility of accidents related to failures< it does not in!estigate
the potential damage that could result from successful operation of the indi!idual components.
Another unique feature of system safety engineering is its inclusion of non-technical aspects of
systems. 5erome 6ederer wrote in &'0*:
=%ystem safety co!ers the entire spectrum of risk management. "t goes beyond the hardware and
associated procedures to system safety engineering. "t in!ol!es: attitudes and moti!ation of
designers and production people, employee>management rapport, the relation of industrial
associations among themsel!es and with go!ernment, human factors in super!ision and quality
control, documentation on the interfaces of industrial and public safety with design and
operations, the interest and attitudes of top management, the effects of the legal system on
accident in!estigations and exchange of information, the certification of critical workers, political
considerations, resources, public sentiment and many other non-technical but !ital influences on
the attainment of an acceptable le!el of risk control. These non-technical aspects of system safety
cannot be ignored (quoted in 6ederer, &'*0+.
Traditional Accident Models and Approaches to Safety Engineering
9odels pro!ide a means for understanding phenomena and recording that understanding so we can
communicate with others. Accident models underlie all attempts to engineer for safety< they are used to
explain how accidents occur. All accident models assume there exist common patterns in accidents;that
they are not 1ust random e!ents. "mposing a pattern on accidents influences the factors that are considered
in in!estigating and analyzing accidents, pre!enting accidents through hazard analysis and design
techniques, assessing risk, and performance auditing. -hile you may not be aware that you are using an
accident model, it underlies all causal identification, countermeasures taken, and risk e!aluation.
The most common model used today explains accidents in terms of multiple failure e!ents, sequenced
as a forward chain o!er time. The relationships between the e!ents are simple and direct, e.g., if A hadn?t
occurred, @ would not ha!e happened. 7igure & shows an example of an accident chain model for the
rupture of a tank. The e!ents in the chain almost always in!ol!e component failure, human error, or
energy-related e!ents such as an explosion.
)hain-of-e!ents models form the basis for most system safety and reliability engineering analyses
used today, such as fault tree analysis, failure modes and effects analysis, e!ent trees, and probabilistic
risk assessment. ote that hazard analysis is simply the in!estigation of an accident before it occurs:
Techniques such as fault tree analysis or failure modes and effects analysis attempt to delineate all the
rele!ant chains of e!ents that can lead to the loss e!ent being in!estigated.
The e!ent chain models that result from this analysis form the basis for most reliability and safety
engineering design techniques, such as redundancy, o!erdesign, safety margins, interlocks, etc. The
designer attempts to interrupt the chain of e!ents by pre!enting the occurrence of e!ents in the chain.
7igure & is annotated with some potential design techniques that might be used to interrupt the flow of
e!ents to the loss state. ot shown, but !ery common, is to introduce =andA relationships in the chain, i.e.,
2
to design such that two or more failures must happen for the chain to continue toward the loss state, thus
reducing the probability of the loss occurring.

.eliability engineering relies on redundancy, increasing component integrity (e.g., incorporating safety
margins for physical components and attempting to achie!e error-free beha!ior of the logical and human
components+ and using =safetyA functions during operations to reco!er from failures (e.g., shutdown and
other types of protection systems+. %ystem safety uses many of the same techniques, but focuses them on
eliminating or pre!enting hazards. %ystem safety engineers also tend to use a wider !ariety of design
approaches, including !arious types of interlocks to pre!ent the system from entering a hazardous state or
lea!ing a safe state.
"n summary, reliability engineering focuses on failures while system safety focuses on hazards. These
are not equi!alent. ).B.9iller, of the founders of system safety in the &'/,s, cautioned that
=distinguishing hazards from failures is implicit in understanding the difference between safety and
reliabilityA (9iller, &'*/+.
-hile both of these approaches work well with respect to their different goals for the relati!ely simple
systems for which they were designed in the &'/, and &'0,s, they are not as effecti!e for the complex
systems and new technology common today. The basic hazard analysis techniques ha!e not changed
significantly since 7ault Tree Analysis was introduced in &'02. The introduction of digital systems and
software control, in particular, has had a profound effect in terms of technology that does not satisfy the
underlying assumptions of the traditional hazard and reliability analysis and safety engineering
techniques. "t also allows le!els of complexity in our system designs that o!erwhelms the traditional
approaches (6e!eson &''/, 6e!eson 2,,/+. A related new de!elopment is the introduction of distributed
human-automation control and the changing role of human operators from controller to high-le!el
super!isor and decision-maker. The simple slips and operational mistakes of the past are being eliminated
by substituting automation, resulting in a change in the role humans play in accidents and the substitution
of cogniti!ely complex decision-making errors for the simple slips of the past. "n the most technically
ad!anced aircraft, the types of errors pilots make ha!e changed but not been eliminated (@illings, &''0+.
Another important limitation of the chain-of-e!ents model is that it ignores the social and
organizational factors in accidents. @oth the )hallenger and )olumbia accident reports stressed the
importance of these factors in accident causation. :oor engineering and management decision-making and
processes can outweigh and undo good analysis and design of the physical system.
7inally, any accident model that includes the social system and human error must account for
adaptation. %ystems and organizations continually change as adaptations are made in response to local
pressures and short-term producti!ity and cost goals. :eople adapt to their en!ironment or they change
their en!ironment to better suit their purposes. A corollary of this propensity for systems and people to
adapt o!er time is that safety defenses are likely to degenerate systematically through time, particularly
when pressure toward cost-effecti!eness and increased producti!ity is the dominant element in decision-
making (.asmussen &''4+. 9a1or accidents are often caused not merely by a coincidence of independent
failures but instead reflect a systematic migration of the organizational beha!ior to the boundaries of safe
beha!ior under pressure toward cost-effecti!eness in an aggressi!e, competiti!e en!ironment.
7or an accident model to handle system adaptation o!er time, it must consider the processes in!ol!ed
in accidents and not simply e!ents and conditions: :rocesses control a sequence of e!ents and describe
system and human beha!ior o!er time rather than considering e!ents and human actions indi!idually.
.asmussen argues that accident causation must be !iewed as a complex process in!ol!ing the entire
socio-technical system including legislators, go!ernment agencies, industry associations and insurance
companies, company management, technical and engineering personnel, operations, etc.
All of these factors are gi!ing rise to a new type of accident, labeled by :errow as a =system accidentA
(:errow, &'*3+. "n these accidents, losses occur due to dysfunctional interactions among components
rather than component failures. :errow hypothesized that they were related to the increased interacti!e
C
complexity and tight coupling in modern high technology systems. :errow was pessimistic about whether
such accidents can be pre!ented, but his pessimism was based on the use of traditional reliability
engineering techniques, particularly redundancy. %ystem %afety does not rely on redundancy and system
safety engineers ha!e been working on de!eloping more effecti!e new techniques based on system theory
to pre!ent system accidents.
New System-Theoretic Approaches to Safety Engineering
"n parallel to the rise of system safety in the $nited %tates, some #uropeans began to re1ect the basic
chain of e!ents models and suggested that systems theory be used as the basis for a new, more powerful
model of accidents. The leader of this group was 5ens .asmussen, a researcher in the nuclear power field
at .iso 6abs in 8enmark.
.asmussen and %!edung ha!e described a hierarchical model of the socio-technical system in!ol!ed in
risk management (see 7igure 2+ (.asmussen and %!edung 2,,,+. At the social and organizational le!els
of their model, .asmussen and %!edung use a control-based model of accidents and at all le!els they
focus on information flow. At each le!el, howe!er, and between le!els, they model the e!ents and their
initiation and flow of effects using an e!ent-chain modeling language similar to cause-consequence
diagrams (which combine fault trees and e!ent trees+. "n addition, they focus on the downstream part of
the chain following the occurrence of the hazard. This downstream emphasis is common in the process
industry, where .asmussen has done most of his work. 7inally, their model focuses on operations;
engineering design acti!ities are treated as input to the model but not as a central part of the model itself.
6e!eson (2,,/+ has taken the systems approach one step further by de!eloping a pure systems-
theoretic model of accident causation called %TA9: (%ystems-Theoretic Accident 9odeling and
:rocesses+. "n many ways, %TA9: is a return to the original roots of early %ystem %afety, but it extends
the scope of what can be handled to include the new factors common to engineered systems today.

STAM

"n %TA9:, accidents are not !iewed as resulting from component failures but rather as a result of
flawed processes in!ol!ing interactions among physical system components, people, societal and
organizational structures, and engineering acti!ities. %afety is treated as a control problem: accidents
occur when component failures, external disturbances, and>or dysfunctional interactions among system
components are not adequately handled. "n the %pace %huttle )hallenger loss, for example, the B-rings
did not adequately control propellant gas release by sealing a tiny gap in the field 1oint. "n the 9ars :olar
6ander loss, the software did not adequately control the descent speed of the spacecraft;it misinterpreted
noise from a sensor as an indication the spacecraft had reached the surface of the planet.
Accidents such as these, in!ol!ing engineering design errors and misunderstanding of the functional
requirements (in the case of the 9ars :olar 6ander+, may in turn stem from inadequate control o!er the
de!elopment process, i.e., risk is not adequately managed in the design, implementation, and
manufacturing processes. )ontrol is also imposed by the management functions in an organization;the
)hallenger accident in!ol!ed inadequate controls in the launch-decision process, for example;and by
the social and political system within which the organization exists. The role of all of these factors must
be considered in hazard and risk analysis.
ote that the use of the term =controlA does not imply a strict military-style command and control
structure. @eha!ior is controlled or influenced not only by direct management inter!ention, but also
indirectly by policies, procedures, shared !alues, and other aspects of the organizational culture. All
beha!ior is influenced and at least partially =controlledA by the social and organizational context in which
3
the beha!ior occurs. #ngineering this context can be an effecti!e way of creating and changing a safety
culture.
%ystems are !iewed in %TA9: as interrelated components that are kept in a state of dynamic
equilibrium by feedback loops of information and control. A system is not treated as a static design, but as
a dynamic process that is continually adapting to achie!e its ends and to react to changes in itself and its
en!ironment. The original design must not only enforce appropriate constraints on beha!ior to ensure safe
operation, but it must continue to operate safely as changes and adaptations occur o!er time. Accidents,
then, are considered to result from dysfunctional interactions among the system components (including
both the physical system components and the organizational and human components+ that !iolate the
system safety constraints. The process leading up to an accident can be described in terms of an adapti!e
feedback function that fails to maintain safety as performance changes o!er time to meet a complex set of
goals and !alues. The accident or loss itself results not simply from component failure (which is treated as
a symptom of the problems+ but from inadequate control of safety-related constraints on the de!elopment,
design, construction, and operation of the socio-technical system.
-hile e!ents reflect the effects of dysfunctional interactions and inadequate enforcement of safety
constraints, the inadequate control itself is only indirectly reflected by the e!ents;the e!ents are the
result of the inadequate control. The system control structure itself, therefore, must be examined to
determine how unsafe e!ents might occur and if the controls are adequate to maintain the required
constraints on safe beha!ior.
%TA9: has three fundamental concepts: constraints, hierarchical le!els of control, and process
models. These concepts, in turn, gi!e rise to a classification of control flaws that can lead to accidents.
The most basic component of %TA9: is not an e!ent, but a constraint. "n systems theory and control
theory, systems are !iewed as hierarchical structures where each le!el imposes constraints on the acti!ity
of the le!el below it;that is, constraints or lack of constraints at a higher le!el allow or control lower-
le!el beha!ior.
%afety-related constraints specify those relationships among system !ariables that constitute the non-
hazardous or safe system states;for example, the power must ne!er be on when the access to the high-
!oltage power source is open, the descent engines on the lander must remain on until the spacecraft
reaches the planet surface, and two aircraft must ne!er !iolate minimum separation requirements.
"nstead of !iewing accidents as the result of an initiating (root cause+ e!ent in a chain of e!ents leading
to a loss, accidents are !iewed as resulting from interactions among components that !iolate the system
safety constraints. The control processes that enforce these constraints must limit system beha!ior to the
safe changes and adaptations implied by the constraints. :re!enting accidents requires designing a
control structure, encompassing the entire socio-technical system, that will enforce the necessary
constraints on de!elopment and operations. 7igure C shows a generic hierarchical safety control structure.
Accidents result from inadequate enforcement of constraints on beha!ior (e.g., the physical system,
engineering design, management, and regulatory beha!ior+ at each le!el of the socio-technical system.
"nadequate control may result from missing safety constraints, inadequately communicated constraints, or
from constraints that are not enforced correctly at a lower le!el. 7eedback during operations is critical
here. 7or example, the safety analysis process that generates constraints always in!ol!es some basic
assumptions about the operating en!ironment of the process. -hen the en!ironment changes such that
those assumptions are no longer true, the controls in place may become inadequate.



The model in 7igure C has two basic hierarchical control structures;one for system de!elopment
(on the left+ and one for system operation (on the right+;with interactions between them. A spacecraft
manufacturer, for example, might only ha!e system de!elopment under its immediate control, but safety
in!ol!es both de!elopment and operational use of the spacecraft, and neither can be accomplished
successfully in isolation: %afety must be designed into the physical system, and safety during operation
/
depends partly on the original system design and partly on effecti!e control o!er operations.
9anufacturers must communicate to their customers the assumptions about the operational en!ironment
upon which their safety analysis and design was based, as well as information about safe operating
procedures. The operational en!ironment, in turn, pro!ides feedback to the manufacturer about the
performance of the system during operations.
@etween the hierarchical le!els of each control structure, effecti!e communication channels are
needed, both a downward reference channel pro!iding the information necessary to impose constraints on
the le!el below and a measuring channel to pro!ide feedback about how effecti!ely the constraints were
enforced. 7or example, company management in the de!elopment process structure may pro!ide a safety
policy, standards, and resources to pro1ect management and in return recei!e status reports, risk
assessment, and incident reports as feedback about the status of the pro1ect with respect to the safety
constraints.
The safety control structure often changes o!er time, which accounts for the obser!ation that accidents
in complex systems frequently in!ol!e a migration of the system toward a state where a small de!iation
(in the physical system or in human beha!ior+ can lead to a catastrophe. The foundation for an accident is
often laid years before. Bne e!ent may trigger the loss, but if that e!ent had not happened, another one
would ha!e. As an example, 7igure 3 shows the changes o!er time that led to a water contamination
accident in )anada where 23,, people became ill and 4 died (most of them children+. The reasons why
this accident occurred would take too many pages to explain and only a small part of the o!erall %TA9:
model is shown. #ach component of the water quality control structure played a role in the accident. The
model at the top shows the control structure for water quality in Bntario )anada as designed. The figure
at the bottom shows the control structure as it existed at the time of the accident. Bne of the important
changes that contributed to the accident is the elimination of a go!ernment water-testing laboratory. The
pri!ate companies that were substituted were not required to report instances of bacterial contamination to
the appropriate go!ernment ministries. #ssentially, the elimination of the feedback loops made it
impossible for the go!ernment agencies and public utility managers to perform their o!ersight duties
effecti!ely. ote that the goal here is not to identify indi!iduals to blame for the accident but to
understand why they made the mistakes they made (none were e!il or wanted children to die+ and what
changes are needed in the culture and water quality control structure to reduce risk in the future.
"n this accident, and in most accidents, degradation in the safety margin occurred o!er time and
without any particular single decision to do so but simply as a series of decisions that indi!idually seemed
safe but together resulted in mo!ing the water quality control system structure slowly toward a situation
where any slight error would lead to a ma1or accident. :re!enting accidents requires ensuring that
controls do not degrade despite the ine!itable changes that occur o!er time or that such degradation is
detected and corrected before a loss occurs.
7igure 3 shows static models of the safety control structure. @ut models are also needed to understand
why the structure changed o!er time in order to build in protection against unsafe changes. 7or this goal,
we use system dynamics models. %ystem dynamics models are formal and can be executed, like our other
models. The field of system dynamics, created at 9"T in the &'/,s by 7orrester (7orrester, &'0&+, is
designed to help decision makers learn about the structure and dynamics of complex systems, to design
high le!erage policies for sustained impro!ement, and to catalyze successful implementation and change.
%ystem dynamics pro!ides a framework for dealing with dynamic complexity, where cause and effect are
not ob!iously related. 6ike the other models used in a %TA9: analysis, it is grounded in the theory of
non-linear dynamics and feedback control, but also draws on cogniti!e and social psychology,
organization theory, economics, and other social sciences (%enge, &'',< %terman, 2,,,+.


0

!
The control loop in the lower left corner of 7igure /, labeled .& or Pushing the Limit, shows how as
external pressures increased, performance pressure increased, which led to increased launch rates and thus
success in meeting the launch rate expectations, which in turn led to increased expectations and
increasing performance pressures. This, of course, is an unstable system and cannot be maintained
indefinitely;note the larger control loop, @&, in which this loop is embedded, is labeled Limits to
Success. The upper left loop represents part of the safety program loop. The external influences of budget
cuts and increasing performance pressures that reduced the priority of safety procedures led to a decrease
in system safety efforts. The combination of this decrease along with loop @2 in which fixing problems
increased complacency, which also contributed to reduction of system safety efforts, e!entually led to a
situation of (unrecognized+ high risk. There is one other important factor shown in the model: increasing
system safety efforts led to launch delays, another reason for reduction in priority of the safety efforts in
the face of increasing launch pressures.
Bne thing not shown in this simplified model is the delays in the system. -hile reduction in safety
efforts and lower prioritization of safety concerns may lead to accidents, accidents usually do not occur
for a while so false confidence is created that the reductions are ha!ing no impact on safety and therefore
pressures increase to reduce the efforts and priority e!en further as the external performance pressures
mount.
9odels of this sort can be used to redesign the system to eliminate or reduce risk or to e!aluate the
impact of policy decisions. 7or example, one way to a!oid the dynamics shown in 7igure / is to =anchorA
the safety efforts by external means, i.e., Agency-wide standards and re!iew processes that cannot be
watered down by program>pro1ect managers when performance pressures build.
Bften degradation of the control structure in!ol!es asynchronous evolution where one part of a system
changes without the related necessary changes in other parts. )hanges to subsystems may be carefully
designed, but consideration of their effects on other parts of the system, including the control aspects,
may be neglected or inadequate. Asynchronous e!olution may also occur when one part of a properly
designed system deteriorates. The Ariane / tra1ectory changed from that of the Ariane 3, but the inertial
reference system software did not. Bne factor in the loss of contact with the %BDB (%olar Deliospheric
Bbser!atory+ spacecraft in &''* was the failure to communicate to operators that a functional change had
been made in a procedure to perform gyro spin-down.
@esides constraints and hierarchical le!els of control, a third basic concept in %TA9: is that of
process models. Any controller;human or automated;must contain a model of the system being
controlled. 7or humans, this model is generally referred to as their mental model of the process being
controlled.
7or effecti!e control, the process models must contain the following: (&+ the current state of the system
being controlled, (2+ the required relationship between system !ariables, and (C+ the ways the process can
change state. Accidents, particularly system accidents, frequently result from inconsistencies between the
model of the process used by the controllers and the actual process state: for example, the lander software
thinks the lander has reached the surface and shuts down the descent engine< the 9inister of Dealth has
recei!ed no reports about water quality problems and belie!es the state of water quality in the town is
better than it actually is< or a mission manager belie!es that foam shedding is a maintenance or
turnaround issue only. :art of our modeling efforts in!ol!e creating the process models, examining the
4
ways that they can become inconsistent with the actual state (e.g., missing or incorrect feedback+, and
determining what feedback loops are necessary to maintain the safety constraints.
-hen there are multiple controllers and decision makers, system accidents may also in!ol!e
inadequate control actions and unexpected side effects of decisions or actions, again often the result of
inconsistent process models. 7or example, two controllers may both think the other is making the required
control action (both )anadian go!ernment ministries responsible for the water supply and public health
thought the other had followed up on the pre!ious poor water quality reports+, or they make control
actions that conflict with each other. )ommunication plays an important role here: Accidents are most
likely in the boundary or overlap areas where two or more controllers control the same process (6eplat,
&'*4+.
A %TA9: modeling and analysis effort in!ol!es creating a model of the system safety control
structure: the safety requirements and constraints that each component (both technical and organizational+
is responsible for maintaining< controls and feedback channels< process models representing the !iew of
the controlled process by those controlling it< and a model of the dynamics and pressures that can lead to
degradation of this structure o!er time. These models and the analysis procedures defined for them can be
used (&+ to in!estigate accidents and incidents to determine the role played by the different components of
the safety control structure and learn how to pre!ent related accidents in the future, (2+ to proacti!ely
perform hazard analysis and design to reduce risk throughout the life of the system, and (C+ to support a
continuous risk management program where risk is monitored and controlled.
STAM-"ased #a$ard Analysis %STA&
)urrent hazard analysis techniques, such as fault tree analysis, do not pro!ide the power necessary to
handle software-intensi!e systems, system accidents (whose roots lie in dysfunctional interactions among
system components rather than indi!idual component failures, such as occurred in the )olumbia and 9ars
:olar 6ander losses+, and systems in!ol!ing complex human-automation interaction and distributed
decision-making. To handle these types of systems, which will be common in future space exploration
and ad!anced a!iation, more powerful hazard analysis techniques will be needed that include more than
simply component failure as an accident cause. -e ha!e de!eloped a new approach to hazard analysis,
based on %TA9:, called %T:A (%Tam: Analysis+ that handles hardware, software, human decision-
making, and organizational factors in accidents.
%T:A starts in the early concept de!elopment stages of a pro1ect and continues throughout the life of
the system. "ts use during system design supports a safety-dri!en design process where the hazard
analysis influences and shapes the early system design decisions and then is iterated and refined as the
design e!ol!es and more information becomes a!ailable.
%T:A has the same general goals as any hazard analysis: (&+ identification of the system hazards and
the related safety constraints necessary to ensure acceptable risk and (2+ accumulation of information
about how those hazards could be !iolated so it can be used in eliminating, reducing, and controlling
hazards. Thus, the process starts with identifying the system requirements and design constraints
necessary to maintain safety. "n later steps, %T:A assists in the top-down refinement of the safety-related
system requirements and constraints into requirements and constraints on the indi!idual system
components. -e note that this is exactly the process required in the new A%A %oftware %afety %tandard
A%A-%T8-*4&'.&C@. %T:A pro!ides a method to implement that new standard. The same approach
can, of course, be applied to hardware and human acti!ities, not 1ust software. The o!erall %T:A process
pro!ides the information and documentation necessary to ensure the safety constraints are enforced in
system design, de!elopment, manufacturing, and operations.
"n general, safety-dri!en design in!ol!es first attempting to eliminate hazards from the design and, if
that is not possible or requires unacceptable tradeoffs, reducing the probability the hazard will occur,
reducing the negati!e consequences of the hazard, and implementing contingency plans for dealing with
the hazard. As design decisions are made, an %T:A-based hazard analysis is used to inform those
decisions.
*
%T:A is used to identify the safety constraints that must be enforced, the control required to enforce
those constraints, and the reasons the control may be enforced inadequately. That information is used to
select designs and make design decisions at each step of the design process, starting with the early
architectural trades and concept design. As the design is refined, the %T:A analysis is refined in parallel in
a step-by-step process. )ontrol requirements that cannot be resol!ed in the design itself (either because of
infeasibility or because it would require unacceptable tradeoffs with other goals+ are then used as the basis
for designing performance monitoring and safety management requirements during operations. An
example application of %T:A to the safety-dri!en design of a hypothetical industrial robot for inspection
and waterproofing of the thermal tiles on the %pace %huttle can be found in 8ulac and 6e!eson (2,,3+.
A unique aspect of %T:A is its dynamic nature. An effecti!e hazard control approach at the beginning
of a system lifecycle may become less effecti!e at enforcing the safety constraints as the result of
e!olution of and changes to the system and eroding control mechanisms. Traditional hazard analysis
techniques are static in nature, focusing on the ability of the system to a!oid unsafe states gi!en the
current system design and its en!ironment. "n contrast, %T:A assumes that systems are dynamic in nature
and will e!ol!e and adapt based on changes within the system and in the operating en!ironment. A
complete hazard analysis must therefore identify the possible changes in the safety controls o!er time that
could lead to a high-risk state. %ystem dynamics models can be used for this purpose. The information
deri!ed from the analysis can be used (&+ to pre!ent such changes through system design or, if that is not
possible, (2+ to generate operational metrics and monitoring procedures to detect such degradation and (C+
to design controls on maintenance, system changes, and upgrade acti!ities.
#ealth Monitoring and 'rgani$ational Ris( Management
-hen hazards cannot be totally eliminated from the design, some type of monitoring is often required
to back up other types of hazard control in the system design. An infeasibly large number of parameters,
howe!er, can potentially be monitored by safety management systems in complex, human-machine
systems. A ma1or problem is to determine which parameters will detect errors that are likely to lead to
human or mission loss. Bne way to accomplish this goal is to use the information obtained in the hazard
analysis. %T:A (and other hazard analysis techniques to a lesser extent, depending on what is being
monitored+ can pro!ide this information. The use of %T:A on physical systems was described abo!e. A
more unusual application is in the monitoring of the safety culture and organizational decision-making
that can lead to unacceptable system risk. The rest of this paper describes an application of the new
systems approach to organizational risk management.
This approach rests on the hypothesis that safety culture and organizational risk factors can be
modeled, formally analyzed, and engineered. 9odels of the organizational safety control structure and
dynamic decision-making and re!iew processes can potentially be used for: (&+ designing and !alidating
impro!ements to the risk management and safety culture< (2+ e!aluating and analyzing organizational risk
factors< (C+ detecting when risk is increasing to unacceptable le!els (a !irtual =canary in the coal mineA+<
(3+ e!aluating the potential impact of changes and policy decisions on risk< (/+ performing =root causeA
analysis that identifies systemic factors and not 1ust symptoms of the underlying organizational and
culture problems< and (0+ determining the information each decision-maker needs to manage risk
effecti!ely and the communication requirements for coordinated decision-making across large pro1ects.
Bne of the ad!antages of using formal models in organizational and pro1ect risk analysis is that
analytical tools can be used to identify the most important leading indicators of increasing system risk.
9ost ma1or accidents do not result simply from a unique set of proximal, physical e!ents but from the
drift of the organization to a state of heightened risk o!er time as safeguards and controls are relaxed due
to conflicting goals and tradeoffs. "n this state, some e!ents are bound to occur that will trigger an
accident. "n both the )hallenger and )olumbia losses, organizational risk had been increasing to
unacceptable le!els for quite some time as beha!ior and decision-making e!ol!ed in response to a !ariety
of internal and external pressures. @ecause risk increased slowly, nobody noticed it, i.e., the =boiled frogA
'
phenomenon. "n fact, confidence and complacency were increasing at the same time as risk due to the
lack of accidents.
The challenge in pre!enting accidents is to establish safeguards and metrics to pre!ent and detect
migration to a state of unacceptable risk before an accident occurs. The process of tracking leading
indicators of increasing risk (the !irtual =canary in the coal mineA+ can play an important role in
pre!enting accidents. ote that our goal is not quantitati!e risk assessment, but rather to identify the
factors in!ol!ed in risk, their relationship to each other, and their impact on o!erall system risk.
This approach to risk management was de!eloped during its application for the assessment of the new
"ndependent Technical Authority program in the A%A 9anned %pace :rogram (6e!eson et.al, 2,,/+. "t
uses a traditional system engineering and %ystem %afety approach but built on %TA9: and %T:A and
adapted to the task at hand, i.e., organizational risk analysis. 7igure 0 shows the o!erall process, which
in!ol!es six steps:
&. :erform a high-le!el system hazard analysis, i.e., identify the system-le!el hazards to be the focus
of the analysis and then the system requirements and design constraints necessary to a!oid those
hazards.
2. )reate the %TA9: hierarchical safety control structure using either the organizational design as it
exists or creating a new design that satisfies the system requirements and constraints. This control
structure will include the roles and responsibilities of each component with respect to safety.
C. "dentify any gaps in the control structure that might lead to a lack of fulfillment of the system
safety requirements and design constraints and places where changes or enhancements in the
control structure might be helpful.
3. $se %T:A to identify the inadequate controls for each of the control structure components that
could lead to the component?s responsibilities not being fulfilled. These are the system risks.
/. )ategorize the risks as to whether they need to be assessed immediately or whether they are
longer-term risks that require monitoring o!er time. "dentify some potential metrics or measures
of effecti!eness for each of the risks.
0. )reate a system dynamics model of the non-linear dynamics of the system and use the models to
identify the most important leading indicators of risk that need to be monitored during the
lifetime of the system or program. :erform other types of analysis on the models to identify
additional risk factors.


The first step in the risk analysis is a preliminary hazard analysis to identify the high-le!el hazard the
system safety program must be designed to control and the general requirements and constraints
necessary to eliminate that hazard. 7or A%A "TA that hazard is:
%ystem Dazard) Poor engineering and management decision-making leading to an accident (loss)
7rom this hazard, high-le!el system requirements and constraints can be deri!ed, for example:

&,
%ystem %afety .equirements and )onstraints
1. Safety considerations must be first and foremost in technical decision-making.
a. %tate-of-the art safety standards and requirements for A%A missions must be established,
implemented, enforced, and maintained that protect the astronauts, the workforce, and the
public.
b. %afety-related technical decision-making must be independent from programmatic
considerations, including cost and schedule.
c. %afety-related decision-making must be based on correct, complete, and up-to-date
information.
d. B!erall (final+ decision-making must include transparent and explicit consideration of both
safety and programmatic concerns.
e. The Agency must pro!ide for effecti!e assessment and impro!ement in safety-related
decision making.
2. Safety-related technical decision-making must be done by eminently qualified experts !ith
broad participation of the full !orkforce.
a. Technical decision-making must be credible (executed using credible personnel, technical
requirements, and decision-making tools+.
b. Technical decision-making must be clear and unambiguous with respect to authority,
responsibility, and accountability.
c. All safety-related technical decisions, before being implemented by the :rogram, must ha!e
the appro!al of the technical decision-maker assigned responsibility for that class of
decisions.
d. 9echanisms and processes must be created that allow and encourage all employees and
contractors to contribute to safety-related decision-making.
". Safety analyses must be a#ailable and used starting in the early acquisition requirements
de#elopment and design processes and continuing through the system lifecycle.
a. Digh-quality system hazard analyses must be created.
b. :ersonnel must ha!e the capability to produce high-quality safety analyses.
c. #ngineers and managers must be trained to use the results of hazard analyses in their
decision-making.
d. Adequate resources must be applied to the hazard analysis process.
e. Dazard analysis results must be communicated in a timely manner to those who need them. A
communication structure must be established that includes contractors and allows
communication downward, upward, and sideways (e.g., among those building subsystems+.
f. Dazard analyses must be elaborated (refined and extended+ and updated as the design e!ol!es
and test experience is acquired.
g. 8uring operations, hazard logs must be maintained and used as experience is acquired. All in-
flight anomalies must be e!aluated for their potential to contribute to hazards.
$. %he &gency must pro#ide a#enues for the full expression of technical conscience (for safety-
related technical concerns) and pro#ide a process for full and adequate resolution of technical
conflicts as !ell as conflicts bet!een programmatic and technical concerns.
a. )ommunication channels, resolution processes, ad1udication procedures must be created to
handle expressions of technical conscience.
b. Appeals channels must be established to surface complaints and concerns about aspects of the
safety-related decision making and technical conscience structures that are not functioning
appropriately.
&&
The next step is to create a model of a safety control structure, in this case the A%A manned space
program. This model includes the roles and responsibilities of each organizational component with
respect to safety. #ach of the abo!e system safety requirements and constraints is then traced to those
components responsible for their implementation and enforcement. This process may identify omissions
in the organizational design and places where o!erlapping control responsibilities could lead to conflicts
or require careful coordination and communication.
The fourth step is to perform a hazard analysis on the safety control structure, using %T:A. As noted
abo!e, %T:A works on both the technical (physical+ and the organizational (social+ aspects of systems.
There are four general types of risks of "TA:
&. $nsafe decisions are made by or appro!ed by the "TA warrant holders.
2. %afe decisions are disallowed (i.e., o!erly conser!ati!e decision-making that undermines the
goals of A%A and long-term support for "TA+<
C. 8ecision-making takes too long, minimizing impact and also reducing support for "TA.
3. Eood decisions are made by "TA, but they do not ha!e adequate impact on system design,
construction, and operation.
The hazard analysis applied each of these types of risks to the A%A organizational components and
functions in!ol!ed in safety-related decision-making and identified the risks (inadequate controls+
associated with each. 7or example, one responsibility of the A%A )hief #ngineer is to de!elop, monitor,
and maintain the A%A technical standards and policy. The risks associated with this responsibility are:
&. Eeneral technical and safety standards and policy are not created.
2. "nadequate standards are created.
C. The standards degrade as they are changed o!er time due to external pressures to weaken them
(as happened before the )olumbia accident+. The process for appro!ing changes is flawed.
3. The standards are not changed or updated o!er time as the en!ironment changes.
The resulting list of risks is quite long (2/,+, but most appear to be important and not easily dismissed. To
reduce the list to one that can be feasibly assessed, we categorize each risk as either an immediate and
substantial concern, a longer-term concern, or capable of being handled through standard processes and
not needing a special assessment.
The next step in the risk analysis process is to create a system dynamics model, in this case a model of
the A%A manned space program. 7igure / shows a greatly simplified example< the model we used in the
"TA risk analysis had se!eral hundred !ariables and consists of nine sub-models related to safety-culture
and system safety engineering: .isk< :ercei!ed %uccess by Administration< %ystem %afety .esource
Allocation< %ystem %afety #fforts and #fficacy< "ncident 6earning and )orrecti!e Action< Fehicle Aging
and 9aintenance< %ystem %afety Gnowledge, %kills, and %taffing< 6aunch .ate, and "ndependent
Technical Authority. @ecause of the size of the complete model, the indi!idual sub-models were first
!alidated separately and then put together.
"n the final step, the system dynamics model is used to identify which risks are the most important to
measure and assess, i.e., which pro!ide the best measure of the current le!el of organizational risk and are
the most likely to detect increasing risk early enough to pre!ent significant losses. This analysis led to a
list of the best leading indicators of increasing and unacceptable risk in the manned space program The
analysis also identified important structural changes to the current A%A safety control structure and
planned e!olution of the safety-related decision-making structure o!er time that could strengthen the
efforts to a!oid migration to unacceptable le!els of organizational risk and a!oid flawed management and
engineering decision-making leading to an accident. %ome examples of the results of the risk analysis
follow.
"mpact of the "TA on .isk
The "TA program was introduced in response to a )A"@ recommendation and is intended to be a ma1or
contributor to maintaining acceptable risk in the manned space program. -e found that the introduction
of "TA has the potential to significantly reduce risk and to sustain an acceptable risk le!el, countering
some of the natural tendency for risk to increase o!er time due to complacency generated by success,
&2
aging !ehicles and infrastructures, etc. Dowe!er, we also found significant risk of unsuccessful
implementation of "TA, leading to a migration toward states of unacceptable high risk.
"n order to in!estigate the effect of "TA parameters on the system-le!el dynamics, a 2,,-run 9onte-
)arlo sensiti!ity analysis was performed. .andom !ariations representing H>- C,I of the baseline "TA
exogenous parameter !alues were used in the analysis. 7igures 4 and * show the results.
The initial sensiti!ity analysis identified two qualitati!ely different beha!ior modes: 4/I of the
simulation runs showed a successful "TA program implementation where risk is adequately mitigated for
a relati!ely long period of time (labeled =&A in 7igures 4 and *+< the other runs identified a beha!ior mode
with an initial rapid rise in effecti!eness and then a collapse into an unsuccessful "TA program
implementation where risk increases rapidly and accidents occur (labeled =2A+.

The "TA support structure is self-sustaining in both beha!ior modes for a short period of time if the
conditions are in place for its early acceptance. This early beha!ior is representati!e of an initial
excitement phase when "TA is implemented and shows great promise to reduce the le!el of risk. This
short-term reinforcing loop pro!ides the foundation for a solid, sustainable "TA program and safety
control structure for the manned space program under the right conditions.
#!en in the successful scenarios, after a period of !ery high success, the effecti!eness and credibility
of the "TA slowly starts to decline, mainly due to the effects of complacency where the safety efforts start
to erode as the program is highly successful and safety is increasingly seen as a sol!ed problem. -hen
this decline occurs, resources are reallocated to more urgent performance-related matters. Dowe!er, in the
successful implementations, risk is still at acceptable le!els, and an extended period of nearly steady-state
equilibrium ensues where risk remains at low le!els.
"n the unsuccessful "TA implementation scenarios, effecti!eness and credibility of the "TA quickly
starts to decline after the initial increase and e!entually reaches unacceptable le!els. )onditions arise that
limit the ability of "TA to ha!e a sustained effect on the system. Dazardous e!ents start to occur and safety
is increasingly percei!ed as an urgent problem. 9ore resources are allocated to safety efforts, but at this
point the Technical Authority (TA+ and the Technical -arrant Dolders (T-Ds+ ha!e lost so much
credibility they are no longer able to significantly contribute to risk mitigation anymore. As a result, risk
increases dramatically, the "TA personnel and the %afety and 9ission Assurance staff become
o!erwhelmed with safety problems and an increasing number of wai!ers are appro!ed in order to
continue flying.
As the number of problems identified increases along with their in!estigation requirements, corners
may be cut to compensate, resulting in lower-quality in!estigation resolutions and correcti!e actions. "f
in!estigation requirements continue to increase, T-Ds and Trusted Agents become saturated and simply
cannot attend to each in!estigation in a timely manner. A bottleneck effect is created by requiring the
T-Ds to authorize all safety-related decisions, making things worse. #xamining the factors in these
unsuccessful scenarios can assist in making changes to the program to pre!ent them and, if that is not
possible or desirable, to identify leading indicator metrics to detect rising risk while effecti!e
inter!entions are still possible and not o!erly costly in terms of resources and downtime.
6eading "ndicators of "ncreasing .isk:
.esults from a metrics analysis using our system dynamics model show that many model !ariables
may pro!ide good indications of risk. Dowe!er, many of these indicators will only show an increase in
risk after it has happened, limiting their role in pre!enting accidents. 7or example, the number of wai!ers
issued o!er time is a good indicator of increasing risk, but its effecti!eness is limited by the fact that
wai!ers start to accumulate after risk has started to increase rapidly (see 7igure '+.
&C
Bther lagging indicators include the amount of resources a!ailable for safety acti!ities< the schedule
pressure, which will only be reduced when managers belie!e the system to be unsafe< and the perception
of the risk le!el by management, which is primarily affected by e!ents such as accidents and close-calls.
7igure &, shows an example of a leading indicator.
7inding leading indicators that can be used to monitor the program and detect increasing risk early is
extremely important because of the dynamics associated with the non-linear tipping point associated with
technical risk le!el. At this tipping point, risk increases slowly at first and then !ery rapidly (i.e., the
reinforcing loop has a gain J &+. The system can be pre!ented from reaching this point, but once it is
reached multiple serious problems occur rapidly and o!erwhelm the problem-sol!ing capacity of "TA and
the program management. -hen the system reaches that state, risk starts to increase rapidly, and a great
deal of effort and resources will be necessary to bring the risk le!el down to acceptable le!els.

The modeling and analysis identified fi!e leading indicators of increasing and unacceptable risk le!els
in the A%A 9anned %pace :rogram that should be tracked: (&+ Gnowledge, skills, and quality of the
T-Ds, Trusted Agents, and safety staff (e.g., experience, technical knowledge, communication skills,
reputation, social network, communication skills, difficulty in recruiting replacements, and amount of
training+< (2+ in!estigation acti!ity (fraction of problem reports under in!estigation, number of unresol!ed
or unhandled problems+< (C+ quality of the safety analyses (knowledge and skills of analysis staff,
resources applied to analyses, a!ailability of lessons learned+< (3+ quality of incident (hazardous e!ent and
anomaly+ in!estigation (number and quality of those in!ol!ed, resources and workload, independence and
work balance, systemic factor fixes !s. symptom remo!al+< and (/+ power and authority of the Trusted
Agents and Technical -arrant Dolders.
*oncl+sions
-hile the reliability and system safety engineering approaches de!eloped after -orld -ar "" ha!e
ser!ed us well for a long time, technology and complexity is starting to o!erwhelm their capabilities. This
paper has distinguished the approaches of .eliability #ngineering and %ystem %afety in building safe and
reliable systems and described their limitations. A new approach, based on systems theory, has the
potential to o!ercome these limitations and is being pro!en on a wide !ariety of industrial applications.
The new approach has the ad!antage of not only applying to new types of technology but also extending
the boundaries of traditional risk and hazard analysis to the social and organizational aspects of systems
considered in the large.
References
)harles @illings (&''0+. Aviation Automation: The Search for a Human-Centered Approach, 6awrence
#rlbaum Associates, ew Kork.
5.D. )ooper (&*'&+. =Accident-pre!ention de!ices applied to machines,A Transactions of the AS!,
&2:23'-203.
icolas 8ulac and ancy 6e!eson (2,,3+. =An Approach to 8esign for %afety in )omplex %ystems,A
"nternational Symposium on Systems !ngineering #"$C%S! &'(), Toulouse 7rance, 5une.
5ay -. 7orrester (&'0&+. "ndustrial *ynamics, :egasus )ommunications.
&3
5erome 6ederer (&'*0+. =Dow far ha!e we comeL A look back a the leading edge of system safety
eighteen years ago.A Ha+ard Prevention, page *, 9ay>5une.
5acques 6eplat (&'*4+. Bccupational Accident .esearch and %ystems Approach. "n 5ens .asmussen,
Geith 8uncan, and 5acques 6eplat (editors+, $ew Technology and Human !rror, pp. &*&-&'&, ew Kork:
5ohn -iley M %ons.
ancy 6e!eson (&''/+. Safeware, .eading, 9A: Addison--esley :ublishing )ompany.
ancy 6e!eson (2,,/+. A $ew Approach to System Safety !ngineering, incomplete draft (downloadable
from http:>>sunnyday.mit.edu>book2.html+
ancy 6e!eson, icolas 8ulac, @etty @arrett, 5ohn )arroll, 5oel )utcher-Eershenfeld, %tephen
7riedenthal (2,,/+. ,is- Analysis of $ASA "ndependent Technical Authority, 5uly,
http:>>sunnyday.mit.edu>"TA-.isk-Analysis.doc.
).B. 9iller (&'*/+. =A comparison of military and ci!il approaches to a!iation system safety, Ha+ard
Prevention, pages 2'-C3, 9ay>5une.
)harles :errow (&'*3+. $ormal Accidents, @asic @ooks (reprinted in &'NN by NNN $ni!ersity :ress+.
5ens .asmussen (&''4+. =.isk management in a dynamic society: A modeling problem,A Safety Science,
Fol. 24, o. 2>C, #lse!ier %cience 6td., pp. &*C-2&C.
5ens .asmussen and "nge %!edung (2,,,+.

Proactive ,is- anagement in a *ynamic Society. %wedish
.escue %er!ices Agency.
:eter %enge (&'',+. The /ifth *iscipline: The Art and Practice of Learning %rgani+ations0 8oubleday
)urrency, ew Kork.
5ohn %terman (2,,,+. @usiness 8ynamics: %ystems Thinking and 9odeling for a )omplex -orld,
9cEraw-Dill, ew Kork.
&/