Вы находитесь на странице: 1из 49

The Resilient Organisation

Erik Hollnagel
Professor & Industrial Safety Chair
MINES ParisTech Crisis and Risk Research Centre
Sophia Antipolis, France
E-mail: erik.hollnagel@crc.ensmp.fr
Erik Hollnagel, 2009

Controlling the unpredictable


Safety is the freedom
from unacceptable risk

C
T

FAA

Maintenance
oversight

Accident
model / risk
model

O
I

Certification

Aircraft design knowledge

R
T

Aircraft

Interval approvals

Interval approvals

Aircraft
design

High workload

Redundant
design

Expertise

Controlled
stabilizer
movement

C
Jackscrew
up-down
movement

I
Excessive
end-play

High workload

Procedures

Lubrication

Jackscrew
replacement

R
Limited
stabilizer
movement

C
Horizontal
stabilizer
movement

Lubrication

Limiting
stabilizer
movement

Allowable
end-play

T
Equipment

Procedures

End-play
checking

Mechanics

What may
happen?

Aircraft pitch
control

Grease

Expertise

Models and methods must be able to


capture the functional complexity of
the system being analysed.

How should
we respond?

Erik Hollnagel, 2009

Three ages of industrial safety


Hale & Hovden (1998)

Things can
go wrong

because
technology fails
Age of technology

1850

1769
Industrial
Revolution

1900

1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention

1950

IT
Revolution

2000

1961
Fault tree
analysis
Erik Hollnagel, 2009

Technical analysis methods

HAZOP
FMEA Fault tree FMECA
1900 1910

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010

Erik Hollnagel, 2009

How do we know technology is safe?

Design principles:
Architecture and components:
Models:
Analysis methods:
Mode of operation:
Structural stability:
Functional stability:

Clear and explicit


Known
Formal, explicit
Standardised, validated
Well-defined (simple)
High (permanent)
High
Erik Hollnagel, 2009

Human factors wake-up call


Three Mile Island Unit
2 (TMI-2), March 28,
1979: partial
meltdown of the
core.

Due to a combination of
equipment malfunctions, design
problems and worker errors
Human factors as a critical part of plant safety (operator
training and staffing requirements, instrumentation and
controls, instructions and procedures.
Erik Hollnagel, 2009

Three ages of industrial safety


Things can go wrong because
the human factor fails
Age of human factors
Age of technology
1850

1769
Industrial
Revolution

1900

1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention

1950

IT
Revolution

2000

1961
1979
Fault tree Three Mile
analysis
Island
Erik Hollnagel, 2009

Human factors analysis methods


RCA, ATHEANA
HEAT
Swiss Cheese
HPES

Root
cause

1900 1910

HAZOP

Domino

HCR
THERP

CSNI
FMEA Fault tree FMECA

HERA
AEB
TRACEr

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Technical

Human Factors
Erik Hollnagel, 2009

How do we know humans are safe?

Design principles:
Architecture and components:
Models:
Analysis methods:
Mode of operation:
Structural stability:
Functional stability:

Unknown, inferred
Partly known, partly unknown
Mainly analogies
Ad hoc, unproven
Vaguely defined, complex
Variable
Usually reliable
Erik Hollnagel, 2009

Safety culture / organisational failures


Several very serious
accidents made it clear, that
safety could not be ensured
by addressing technical and
human factors alone.
Safety culture

Chernobyl, 1986

Challenger, 1986

That assembly of characteristics and attitudes in


organizations and individuals which establishes that, as
an overriding priority, nuclear plant safety issues
receive the attention warranted by their significance.
IAEA, INSAG-1 (1986)
Erik Hollnagel, 2009

Normal accident theory (1984)

On the whole, we have complex systems


because we dont know how to produce
the output through linear systems.
Erik Hollnagel, 2009

Three ages of industrial safety


Things can go
wrong because
Organisations fail

Age of safety management


Age of human factors

Age of technology
1850

1769
Industrial
Revolution

1900

1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention

1950

2000

2009
1961
AF 447
Fault tree
analysis 1979
IT
Three Mile 2003
Revolution
Island Columbia
Erik Hollnagel, 2009

Organisational analysis methods


RCA, ATHEANA
TRIPO
HEAT
D
MTO
Swiss Cheese
HPES

Root
cause
1900 1910

Domino

STEP
HERA
HCR
AcciMap
AEB
THERP
HAZOP
MERMOS
CSNI
FMEA Fault tree FMECA
TRACEr
CREAM
MORT

1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Technical

Human Factors

Organisational
Erik Hollnagel, 2009

How do we know organisations are safe?

Design principles:
Architecture and components:
Models:
Analysis methods:
Mode of operation:
Structural stability:
Functional stability:

High-level, programmatic
Partly known, partly unknown
Semi-formal,
Ad hoc, unproven
Partly defined, complex
Stable (formal), volatile (informal)
Good, hysteretic (lagging).
Erik Hollnagel, 2009

Simple causal thinking


Starting from
the effect,
you can
reason
backwards to
find the cause

Starting
from the
cause, you
can reason
forwards to
find the
effect

Erik Hollnagel, 2009

Complex causal thinking


Suchman,
1961

In epidemiology, injury /damage is


due to the interaction among
host, agent and environment.

Defence
Host
Agent

Environment

Defences of a host may be weakened


by an unsupportive environment.

Erik Hollnagel, 2009

Nature of technical (formal) systems


They can be described
bottom-up in terms of
components and
subsystems.
Many
identical
systems
Decomposition works for
technical systems, because
they have been designed.

Risks and failures can


therefore be analysed
relative to individual
components and events.

Output (effects) are proportional to input (causes) and predictable from knowledge
of the components. Technical systems are linear and event outcomes are tractable.
Erik Hollnagel, 2009

Nature of socio-technical systems


All systems
unique

Must be described topdown in terms of


functions and objectives.

Decomposition does not


work for socio-technical
systems, because they are
emergent.

Risks and failures must


therefore be described
relative to functional wholes.

Complex relations between input (causes) and output (effects) give rise to
unexpected and disproportionate consequences. Socio-technical systems are
non-linear and event outcomes are intractable.
Erik Hollnagel, 2009

Theories and models of the negative

Accidents arecaused by people,


due to carelessness, inexperience,
and/or wrong attitudes.
Technology and materials are
imperfect so failures are inevitable

Organisations are complex


but brittle with limited
memory and unclear
distribution of authority
Erik Hollnagel, 2009

Changes in attribution of causes


100

Organisation

90

80

70

Human
factors
human
error

60

50

40

30

20

Technology

10

0
1960

1965

1970

1975

1980

1985

1990

1995

2000

2005

2010

Erik Hollnagel, 2009

Safety as risk reduction


Safety is normally measured by the
absence of negative outcomes.
This can be achieved in three different ways:
- eliminating hazards (design),
- preventing initiating events (constraints)
- protecting against consequences (barriers)

What happens when


there is no
measurable change?

Safety, as commonly practised, implies a


distinction between:
Normal operations that ensure the system
works as it should and produces the intended outcomes.
Abnormal operations that disrupt or disturb normal operations or otherwise
render them ineffective.
The purpose of safety management is to maintain normal operations by preventing
disruptions or disturbances. Safety efforts are usually driven by what has happened
in the past, and are therefore reactive.
Erik Hollnagel, 2009

Safety = (1 - Risk)

By 2020 a new safety paradigm will have been widely adopted in European industry.
Safety is seen as a key factor for successful business and an inherent element of
business performance. As a result, industrial safety performance will have
progressively and measurably improved in terms of reduction of
- reportable accidents at work,
The measurements
- occupational diseases,
are all negative or
- environmental incidents and
unwanted outcomes.
- accident-related production losses.
It is expected that an incident elimination culture will develop where safety is
embedded in design, maintenance, operation and management at all levels in
enterprises. This will be identifiable as an output from this Technology Platform
meeting its quantified objectives.
Erik Hollnagel, 2009

Focus on operation (sharp end, 1984)


Organisation
(management)

Design

Downstream

Maintenance

Upstream

Work has clear


objectives and takes
place in well-defined
situations. Systems
and technologies are
loosely coupled and
tractable.

Technology
Erik Hollnagel, 2009

Vertical and horizontal extensions (2008)


Organisation
(management)

Design

A second horizontal extension, to cover


upstream and downstream processes

Maintenance

Upstream

Technology

One horizontal extension, to cover the


whole lifecycle, from design to maintenance

A vertical
extension
to cover
the entire
system,
from technology
to organisation

Work is underspecified.
Systems and technologies
are tightly coupled and
intractable.
Erik Hollnagel, 2009

Tractable and intractable systems


Tractable system
(loosely coupled)
L

Description are simple


with few details

All principles of
functioning are known

System does not change


while being described

Fully specified

Intractable system
(tightly coupled)
Complicacy
Comprehensibility
Stability

Partly specified

Description are elaborate


with many details

Some principles of
functioning are unknown

System changes before a


description is completed

Underspecified
Erik Hollnagel, 2009

Performance variability is necessary


Systems are so complex that work situations always
are underspecified hence partly unpredictable
Few if any tasks can successfully be carried out
unless procedures and tools are adapted to the
situation. Performance variability is both normal and
necessary.

Many socio-technical systems are intractable. The


Success
conditions of work therefore never completely match
Performance
what has been specified or prescribed.
variability
Individuals, groups, and organisations normally
adjust their performance to meet existing conditions,
Failure
specifically actual resources and requirements.
Because resources (time, manpower, information,
etc.) always are finite, such adjustments will
always be approximate rather than exact.
Erik Hollnagel, 2009

Range of event outcomes


Positive

Outcome

Serendipity

Good luck

Normal outcomes
(things that go
right)

Neutral

Random events
Near misses

sa
Di
s
er
st

Negative

Incidents

Very low

Accidents

Mishaps
(outcomes that
should have been
eliminated)
Very high

Predictability
Erik Hollnagel, 2009

Range of event outcomes


Positive

Outcome

Serendipity

Good luck
Neutral

Random events
Near misses

sa
Di
s
er
st

Negative

Normal outcomes
(things that go
right)

Very low

Incidents
Mishaps
Focus of safety management
(outcomes that
Accidents
should have been
eliminated)
Very high

Predictability
Erik Hollnagel, 2009

Range of event outcomes


Positive

Outcome

Serendipity

Good luck
Neutral

Random events

Normal outcomes
(things that go
right)
Focus of resilience
engineering
Near misses

sa
Di
s
er
st

Negative

Incidents

Very low

Accidents

Mishaps
(outcomes that
should have been
eliminated)
Very high

Predictability
Erik Hollnagel, 2009

Range of event outcomes


Positive

Outcome

Serendipity

Good luck

Near misses

sa
Di
s
er
st

Negative

Neutral

Random events

Normal outcomes
(things that go
right)
Focus of resilience
engineering

Very low

Incidents
Focus of safety
Accidents
management

Mishaps
(outcomes that
should have been
eliminated)
Very high

Predictability
Erik Hollnagel, 2009

Frequency of outcomes
Positive

Outcome

Serendipity

Good luck
Neutral

Random events
Near misses
cy
n
ue
q
e
Accidents
Fr

sa
Di

st
s
er

Negative

Normal outcomes
(things that go
right)

Very low

Incidents

Mishaps
(outcomes that
should have been
eliminated)
Very high

Predictability
Erik Hollnagel, 2009

Why only look at what goes wrong?


10-4 := 1 failure in
10.000 events

Safety = Reduced number of adverse events.


Focus is on what goes wrong.
Look for the underlying failures and malfunctions.
Try to eliminate causes and improve barriers.
Safety and core business compete for resources.
Learning only uses a fraction of the data available
Safety = Ability to succeed under varying
conditions.

1 - 10-4 := 9.999 nonfailures in 10.000 events

Focus is on what goes right.


Understand why normal performance succeed.
Use that to perform better and safer..
Safety and core business help each other.
Learning uses most of the data available
Erik Hollnagel, 2009

Failures or successes?
When something goes wrong,
e.g., 1 event out of 10.000
(10E-4), humans are assumed
to be responsible in 80-90%
of the cases.

When something goes right,


e.g., 9.999 events out of
10.000, are humans also
responsible in 80-90% of
the cases?

Who or what are responsible


for the remaining 10-20%?

Who or what are


responsible for the
remaining 10-20%?

Investigation of failures is
accepted as important.

Investigation of
successes is rarely
undertaken.
Erik Hollnagel, 2009

From the negative to the positive


Negative outcomes are
caused by failures and
malfunctions.

All outcomes (positive and


negative) are due to
performance variability..

Safety = Reduced
number of adverse
events.

Safety = Ability to
respond when
something fails.

Safety = Ability to
succeed under varying
conditions.

Eliminate failures and


malfunctions as far
as possible.

Improve ability to
respond to adverse
events.

Improve resilience.
Erik Hollnagel, 2009

Are failures different from successes?


Things that go wrong

Disasters

Accidents

Failures,
malfunctions,
violations,
error
mechanisms,

Things that go right

Incidents Near misses

Normal actions

Slips, unsafe
acts

???

Erik Hollnagel, 2009

Premises for Resilience Engineering


Performance conditions are always underspecified.
Individuals and organisations must therefore always adjust their performance
to match current demands and resources; because resources and time are
finite, such adjustments will inevitably be approximate.
Many adverse events can be attributed to a breakdown or malfunctioning of
components and normal system functions, but many cannot.
These events can be understood as the result of unexpected combinations of
performance variability.
Safety management cannot be based on hindsight, nor rely on error tabulation
and the calculation of failure probabilities.
Safety management must be proactive as well as reactive.
Safety cannot be isolated from the core (business) process, nor vice versa.
Safety is the prerequisite for productivity, and productivity is the prerequisite
for safety. Safety is achieved by improvements rather than by constraints.
Erik Hollnagel, 2009

The resilient organisation


Resilience is the intrinsic ability of a system to adjust its functioning prior to,
during, or following changes and disturbances, so that it can sustain required
operations under both expected and unexpected conditions.
A practice of Resilience Engineering / Proactive Safety Management requires that
all levels of the organisation are able to:
Respond to regular and
irregular conditions in an
effective, flexible manner,
Factual
Learn from past events,
understand correctly
what happened and why

Anticipate long-term
threats and opportunities

Actual
Critical

Potential

Monitor short-term
developments and threats;
revise risk models
Erik Hollnagel, 2009

Designing for resilience


Responding: Knowing
what to do, being
capable of doing it.

Anticipating: Finding
out and knowing what
to expect

Actual
Factual
Learning:
Knowing what has
happened

Critical

Potential

Monitoring: Knowing
what to look for
(indicators)

An increased availability and reliability of functioning on all levels will both improve
safety and enhance control, hence the ability to predict, plan, and produce.
Erik Hollnagel, 2009

Time to think or time to do?

THINK!

DO!

Work is carefully
planned and monitored

Work is paced by
technology and external
events.

Demands match
capacity
Control is kept.

Efficient performance
requires a balance between
thinking and doing.

Demands exceed capacity


Easy to lose control.
Erik Hollnagel, 2009

Efficiency-Thoroughness Trade-Off
Thoroughness: Time to think
Recognising situation.
Choosing and planning.

Efficiency: Time to do
Implementing plans.
Executing actions.

If thoroughness dominates,
there may be too little time
to carry out the actions.

If efficiency dominates,
actions may be badly
prepared or wrong

Neglect pending actions


Miss new events

Miss pre-conditions
Look for expected results
Time & resources needed
Time & resources available
Erik Hollnagel, 2009

Some ETTO heuristics


Cognitive
(individual)

Idiosyncratic
(work related)

Collective
(organisation)

Judgement under
uncertainty
Cognitive primitives
(SM FG)
Reactions to
information input
overload and
underload

Looks fine
Not really important
Normally OK, no need to check
Ive done it millions of time before
Will be checked by someone else
Has been checked by someone else
This way is much quicker
No time (or resources) to do it now
Cant remember how to do it
We always do it this way
It looks like X (so it probably is X)
We must get this done
Must be ready in time
Must not use too much of X

Negative reporting

Cognitive style
Confirmation bias

Reduce
redundancy
Meet production
targets
Reduce
unnecessary cost
Double-bind
Reject conflicting
information

Erik Hollnagel, 2009

Efficiency-Thoroughness Trade-Off
For distributed work it is necessary to trust what
others do; it is impossible
to check everything.
Consider secondary
Confirm that
outcomes and sideinput is correct
Thoroughness
effects

Trust that input


is correct

Efficiency

Assume someone
else takes care of
outcomes

One way of managing


time and resource limitations is to
think only one step back and/or one step ahead.

Erik Hollnagel, 2009

Mutual optimism
I can allow
myself to be
effective because
the others will be
thorough

I can allow
myself to be
effective because
the others will be
thorough

I can allow
myself to be
effective because
the others will be
thorough

I can allow
myself to be
effective because
the others will be
thorough

I can allow
myself to be
effective because
the others will be
thorough

I can allow
myself to be
effective because
the others will be
thorough

Erik Hollnagel, 2009

The ETTO principle


The ETTO principle describes the fact that people
(and organisations) as part of their activities
practically always must make a trade-off between
the resources (time and effort) they spend on
preparing an activity and the resources (time, effort
and materials) they spend on doing it.
ETTOing favours efficiency over thoroughness if
throughput and output are the dominant concerns,
and thoroughness over efficiency if safety and
quality are the dominant concerns.
It follows from the ETTO principle that it is
impossible to maximise efficiency and thoroughness at the same time. Nor can an
activity expect to succeed, if there is not a minimum of either.
Erik Hollnagel, 2009

Resilience and what individuals do


Responding
Anticipating

Actual
Factual

Critical

Learning

Monitoring
Individual
ETTO rules

Potential

Looks fine
Not really important
Normally OK, no need to check now
Ive done it hundreds of times before
Will be checked by someone else
Has been checked by someone else

A single individual cannot address


all four abilities equally well, and will
normally prioritise the ability to
respond.
ETTOing will affect the trade-off
between the four abilities.
This way is much quicker
No time (or resources) to do it now
Cant remember how to do it
We always do it this way
It looks like X (so it probably is X)
We must get this done (ready in time)
We must not use too much of X
Erik Hollnagel, 2009

Resilience and what organisations do


Responding
Anticipating

Actual
Factual

Critical

Learning

Monitoring

Potential

An organisation can allocate


resources to all four abilities,
according to its priorities.
ETTOing will affect how well each
ability is addressed (trade-off
within abilities).

Reduce redundancy
Organisational or
Double-bind (DO and DONT)
collective ETTO rules
Reject conflicting information
Reduce unnecessary cost and effort
Meet production targets (safety first but )
Negative reporting (only report when something is wrong)
Erik Hollnagel, 2009

The ETTO TETO paradox


Efficiency in the present Efficiency in the future
requires thoroughness in requires thoroughness in
the past. the present.

Efficiency
Thoroughness

Thoroughness

Learning and preparing


indicators & responses

Learning and preparing


indicators & responses

Efficiency

In order to be resilient it is necessary that short-term ETTO is balanced by long-term


TETO (Thoroughness-Efficiency Trade-Off)
Erik Hollnagel, 2009

As high as reasonably practicable


Which events? How were
they found? Is the list
revised? How is readiness
ensured and maintained?

What is our model of the future?


How long to we look ahead? What
risks are we willing to take? Who
believes what and why?

Actual
Factual
What, when continuously or
event-driven (successes or
failures), how (qualitative,
quantitative),
by individual or
by organisation?

Critical

Potential

How are indicators defined? Lagging /


leading? How are they measured? Are
effects transient or permanent? Who looks
where and when? How, and when,
are they revised?

Erik Hollnagel, 2009

We attribute the success of HROs in managing the unexpected to their


determined efforts to act mindfully. By this we mean that they organize
themselves in such a way that they are better able to notice the
unexpected in the making and halt its development. If they have
difficulty halting the development of the unexpected, they focus on
containing it. And if some of the unexpected breaks through the
containment, they focus on resilience and swift restoration of system
functioning.

When we call this approach mindful, we mean that HROs strive to


maintain an underlying style of mental functioning that is
distinguished by continuous updating and deepening of
increasingly plausible interpretations of what the context is, what
problems define it, and what remedies it contains. The key
difference between HROs and other organizations in managing the
unexpected often occurs in the earliest stages, when the
unexpected may give off only weak signals of trouble. The
overwhelming tendency is to respond to weak signals with a weak
response. Mindfulness preserves the capability to see the
significant meaning of weak signals and to give strong responses
to weak signals. This counterintuitive act holds the key to
managing the unexpected.
Managing the Unexpected: Assuring High Performance in an Age of

Complexity, Vol. 1 by Karl E. Weick and Kathleen M. Sutcliffe Jossey-Bass Inc.


2001

Thank you for your attention

Erik Hollnagel, 2009

Вам также может понравиться