The Resilient Organisation

The Resilient Organisation
Erik Hollnagel
Professor & Industrial Safety Chair
MINES ParisTech Crisis and Risk Research Centre
Sophia Antipolis, France
E-mail: erik.hollnagel@crc.ensmp.fr
Erik Hollnagel, 2009
Controlling the unpredictable

Safety is the freedom
from unacceptable risk
C
T
FAA
Maintenance
oversight
Accident
model / risk
model
O
I
Certification
Aircraft design knowledge
R
T
Aircraft
Interval approvals
Interval approvals
Aircraft
design
High workload
Redundant
design
Expertise
Controlled
stabilizer
movement
C
Jackscrew
up-down
movement
I
Excessive
end-play
High workload
Procedures
Lubrication
Jackscrew
replacement
R
Limited
stabilizer
movement
C
Horizontal
stabilizer
movement
Lubrication
Limiting
stabilizer
movement
Allowable
end-play
T
Equipment
Procedures
End-play
checking
Mechanics
What may
happen?
Aircraft pitch
control
Grease
Expertise
Models and methods must be able to

capture the functional complexity of
the system being analysed.
How should
we respond?
Three ages of industrial safety

Hale & Hovden (1998)
Things can
go wrong
because
technology fails
Age of technology
1850
1769
Industrial
Revolution
1900
1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention
1950
IT
Revolution
2000
1961
Fault tree
analysis
Technical analysis methods
HAZOP
FMEA Fault tree FMECA
1900 1910
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
How do we know technology is safe?
Design principles:
Architecture and components:
Models:
Analysis methods:
Mode of operation:
Structural stability:
Functional stability:
Clear and explicit

Known
Formal, explicit
Standardised, validated
Well-defined (simple)
High (permanent)
High
Human factors wake-up call

Three Mile Island Unit
2 (TMI-2), March 28,
1979: partial
meltdown of the
core.
Due to a combination of
equipment malfunctions, design
problems and worker errors
Human factors as a critical part of plant safety (operator
training and staffing requirements, instrumentation and
controls, instructions and procedures.

Things can go wrong because
the human factor fails
Age of human factors
Age of technology
1850
1769
Industrial
Revolution
1900
1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention
1950
IT
Revolution
2000
1961
1979
Fault tree Three Mile
analysis
Island
Human factors analysis methods

RCA, ATHEANA
HEAT
Swiss Cheese
HPES
Root
cause
1900 1910
HAZOP
Domino
HCR
THERP
CSNI
HERA
AEB
TRACEr
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Technical
Human Factors
How do we know humans are safe?
Design principles:
Models:
Analysis methods:
Mode of operation:
Unknown, inferred
Partly known, partly unknown
Mainly analogies
Ad hoc, unproven
Vaguely defined, complex
Variable
Usually reliable
Safety culture / organisational failures

Several very serious
accidents made it clear, that
safety could not be ensured
by addressing technical and
human factors alone.
Safety culture
Chernobyl, 1986
Challenger, 1986
That assembly of characteristics and attitudes in

organizations and individuals which establishes that, as
an overriding priority, nuclear plant safety issues
receive the attention warranted by their significance.
IAEA, INSAG-1 (1986)
Normal accident theory (1984)
On the whole, we have complex systems

because we dont know how to produce
the output through linear systems.

Things can go
wrong because
Organisations fail
Age of safety management

Age of human factors
Age of technology
1850
1769
Industrial
Revolution
1900
1893
Railroad Safety
Appliance Act 1931
Industrial
accident
prevention
1950
2000
2009
1961
AF 447
Fault tree
analysis 1979
IT
Three Mile 2003
Revolution
Island Columbia
Organisational analysis methods

RCA, ATHEANA
TRIPO
HEAT
D
MTO
Swiss Cheese
HPES
Root
cause
1900 1910
Domino
STEP
HERA
HCR
AcciMap
AEB
THERP
HAZOP
MERMOS
CSNI
TRACEr
CREAM
MORT
1920 1930 1940 1950 1960 1970 1980 1990 2000 2010
Technical
Human Factors
Organisational
How do we know organisations are safe?
Design principles:
Models:
Analysis methods:
Mode of operation:
High-level, programmatic
Partly known, partly unknown
Semi-formal,
Ad hoc, unproven
Partly defined, complex
Stable (formal), volatile (informal)
Good, hysteretic (lagging).
Simple causal thinking

Starting from
the effect,
you can
reason
backwards to
find the cause
Starting
from the
cause, you
can reason
forwards to
find the
effect
Complex causal thinking

Suchman,
1961
In epidemiology, injury /damage is

due to the interaction among
host, agent and environment.
Defence
Host
Agent
Environment
Defences of a host may be weakened

by an unsupportive environment.
Nature of technical (formal) systems

They can be described
bottom-up in terms of
components and
subsystems.
Many
identical
systems
Decomposition works for
technical systems, because
they have been designed.
Risks and failures can

therefore be analysed
relative to individual
components and events.
Output (effects) are proportional to input (causes) and predictable from knowledge
of the components. Technical systems are linear and event outcomes are tractable.
Nature of socio-technical systems

All systems
unique
Must be described topdown in terms of

functions and objectives.
Decomposition does not

work for socio-technical
systems, because they are
emergent.
Risks and failures must

therefore be described
relative to functional wholes.
Complex relations between input (causes) and output (effects) give rise to
unexpected and disproportionate consequences. Socio-technical systems are
non-linear and event outcomes are intractable.
Theories and models of the negative
Accidents arecaused by people,

due to carelessness, inexperience,
and/or wrong attitudes.
Technology and materials are
imperfect so failures are inevitable
Organisations are complex

but brittle with limited
memory and unclear
distribution of authority
Changes in attribution of causes

100
Organisation
90
80
70
Human
factors
human
error
60
50
40
30
20
Technology
10
0
1960
1965
1970
1975
1980
1985
1990
1995
2000
2005
2010
Safety as risk reduction

Safety is normally measured by the
absence of negative outcomes.
This can be achieved in three different ways:
- eliminating hazards (design),
- preventing initiating events (constraints)
- protecting against consequences (barriers)
What happens when

there is no
measurable change?
Safety, as commonly practised, implies a

distinction between:
Normal operations that ensure the system
works as it should and produces the intended outcomes.
Abnormal operations that disrupt or disturb normal operations or otherwise
render them ineffective.
The purpose of safety management is to maintain normal operations by preventing
disruptions or disturbances. Safety efforts are usually driven by what has happened
in the past, and are therefore reactive.
Safety = (1 - Risk)
By 2020 a new safety paradigm will have been widely adopted in European industry.
Safety is seen as a key factor for successful business and an inherent element of
business performance. As a result, industrial safety performance will have
progressively and measurably improved in terms of reduction of
- reportable accidents at work,
The measurements
- occupational diseases,
are all negative or
- environmental incidents and
unwanted outcomes.
- accident-related production losses.
It is expected that an incident elimination culture will develop where safety is
embedded in design, maintenance, operation and management at all levels in
enterprises. This will be identifiable as an output from this Technology Platform
meeting its quantified objectives.
Focus on operation (sharp end, 1984)

Organisation
(management)
Design
Downstream
Maintenance
Upstream
Work has clear

objectives and takes
place in well-defined
situations. Systems
and technologies are
loosely coupled and
tractable.
Technology
Vertical and horizontal extensions (2008)

Organisation
(management)
Design
A second horizontal extension, to cover

upstream and downstream processes
Maintenance
Upstream
Technology
One horizontal extension, to cover the

whole lifecycle, from design to maintenance
A vertical
extension
to cover
the entire
system,
from technology
to organisation
Work is underspecified.
Systems and technologies
are tightly coupled and
intractable.
Tractable and intractable systems

Tractable system
(loosely coupled)
L
Description are simple

with few details
All principles of
functioning are known
System does not change

while being described
Fully specified
Intractable system
(tightly coupled)
Complicacy
Comprehensibility
Stability
Partly specified
Description are elaborate

with many details
Some principles of
functioning are unknown
System changes before a

description is completed
Underspecified
Performance variability is necessary

Systems are so complex that work situations always
are underspecified hence partly unpredictable
Few if any tasks can successfully be carried out
unless procedures and tools are adapted to the
situation. Performance variability is both normal and
necessary.
Many socio-technical systems are intractable. The

Success
conditions of work therefore never completely match
Performance
what has been specified or prescribed.
variability
Individuals, groups, and organisations normally
adjust their performance to meet existing conditions,
Failure
specifically actual resources and requirements.
Because resources (time, manpower, information,
etc.) always are finite, such adjustments will
always be approximate rather than exact.
Range of event outcomes

Positive
Outcome
Serendipity
Good luck
Normal outcomes
(things that go
right)
Neutral
Random events
Near misses
sa
Di
s
er
st
Negative
Incidents
Very low
Accidents
Mishaps
(outcomes that
should have been
eliminated)
Very high
Predictability

Positive
Outcome
Serendipity
Good luck
Neutral
Random events
Near misses
sa
Di
s
er
st
Negative
Normal outcomes
(things that go
right)
Very low
Incidents
Mishaps
Focus of safety management
(outcomes that
Accidents
should have been
eliminated)
Very high
Predictability

Positive
Outcome
Serendipity
Good luck
Neutral
Random events
Normal outcomes
(things that go
right)
Focus of resilience
engineering
Near misses
sa
Di
s
er
st
Negative
Incidents
Very low
Accidents
Mishaps
(outcomes that
should have been
eliminated)
Very high
Predictability

Positive
Outcome
Serendipity
Good luck
Near misses
sa
Di
s
er
st
Negative
Neutral
Random events
Normal outcomes
(things that go
right)
Focus of resilience
engineering
Very low
Incidents
Focus of safety
Accidents
management
Mishaps
(outcomes that
should have been
eliminated)
Very high
Predictability
Frequency of outcomes
Positive
Outcome
Serendipity
Good luck
Neutral
Random events
Near misses
cy
n
ue
q
e
Accidents
Fr
sa
Di
st
s
er
Negative
Normal outcomes
(things that go
right)
Very low
Incidents
Mishaps
(outcomes that
should have been
eliminated)
Very high
Predictability
Why only look at what goes wrong?

10-4 := 1 failure in
10.000 events
Safety = Reduced number of adverse events.

Focus is on what goes wrong.
Look for the underlying failures and malfunctions.
Try to eliminate causes and improve barriers.
Safety and core business compete for resources.
Learning only uses a fraction of the data available
Safety = Ability to succeed under varying
conditions.
1 - 10-4 := 9.999 nonfailures in 10.000 events
Focus is on what goes right.

Understand why normal performance succeed.
Use that to perform better and safer..
Safety and core business help each other.
Learning uses most of the data available
Failures or successes?
When something goes wrong,
e.g., 1 event out of 10.000
(10E-4), humans are assumed
to be responsible in 80-90%
of the cases.
When something goes right,

e.g., 9.999 events out of
10.000, are humans also
responsible in 80-90% of
the cases?
Who or what are responsible

for the remaining 10-20%?
Who or what are

responsible for the
remaining 10-20%?
Investigation of failures is
accepted as important.
Investigation of
successes is rarely
undertaken.
From the negative to the positive

Negative outcomes are
caused by failures and
malfunctions.
All outcomes (positive and

negative) are due to
performance variability..
Safety = Reduced
number of adverse
events.
Safety = Ability to
respond when
something fails.
Safety = Ability to
succeed under varying
conditions.
Eliminate failures and

malfunctions as far
as possible.
Improve ability to
respond to adverse
events.
Improve resilience.
Are failures different from successes?

Things that go wrong
Disasters
Accidents
Failures,
malfunctions,
violations,
error
mechanisms,
Things that go right
Incidents Near misses
Normal actions
Slips, unsafe
acts
???
Premises for Resilience Engineering

Performance conditions are always underspecified.
Individuals and organisations must therefore always adjust their performance
to match current demands and resources; because resources and time are
finite, such adjustments will inevitably be approximate.
Many adverse events can be attributed to a breakdown or malfunctioning of
components and normal system functions, but many cannot.
These events can be understood as the result of unexpected combinations of
performance variability.
Safety management cannot be based on hindsight, nor rely on error tabulation
and the calculation of failure probabilities.
Safety management must be proactive as well as reactive.
Safety cannot be isolated from the core (business) process, nor vice versa.
Safety is the prerequisite for productivity, and productivity is the prerequisite
for safety. Safety is achieved by improvements rather than by constraints.
The resilient organisation

Resilience is the intrinsic ability of a system to adjust its functioning prior to,
during, or following changes and disturbances, so that it can sustain required
operations under both expected and unexpected conditions.
A practice of Resilience Engineering / Proactive Safety Management requires that
all levels of the organisation are able to:
Respond to regular and
irregular conditions in an
effective, flexible manner,
Factual
Learn from past events,
understand correctly
what happened and why
Anticipate long-term
threats and opportunities
Actual
Critical
Potential
Monitor short-term
developments and threats;
revise risk models
Designing for resilience

Responding: Knowing
what to do, being
capable of doing it.
Anticipating: Finding
out and knowing what
to expect
Actual
Factual
Learning:
Knowing what has
happened
Critical
Potential
Monitoring: Knowing
what to look for
(indicators)
An increased availability and reliability of functioning on all levels will both improve
safety and enhance control, hence the ability to predict, plan, and produce.
Time to think or time to do?
THINK!
DO!
Work is carefully
planned and monitored
Work is paced by
technology and external
events.
Demands match
capacity
Control is kept.
Efficient performance
requires a balance between
thinking and doing.
Demands exceed capacity

Easy to lose control.
Efficiency-Thoroughness Trade-Off
Thoroughness: Time to think
Recognising situation.
Choosing and planning.
Efficiency: Time to do
Implementing plans.
Executing actions.
If thoroughness dominates,
there may be too little time
to carry out the actions.
If efficiency dominates,
actions may be badly
prepared or wrong
Neglect pending actions

Miss new events
Miss pre-conditions
Look for expected results
Time & resources needed
Time & resources available
Some ETTO heuristics

Cognitive
(individual)
Idiosyncratic
(work related)
Collective
(organisation)
Judgement under
uncertainty
Cognitive primitives
(SM FG)
Reactions to
information input
overload and
underload
Looks fine
Not really important
Normally OK, no need to check
Ive done it millions of time before
Will be checked by someone else
Has been checked by someone else
This way is much quicker
No time (or resources) to do it now
Cant remember how to do it
We always do it this way
It looks like X (so it probably is X)
We must get this done
Must be ready in time
Must not use too much of X
Negative reporting
Cognitive style
Confirmation bias
Reduce
redundancy
Meet production
targets
Reduce
unnecessary cost
Double-bind
Reject conflicting
information
Efficiency-Thoroughness Trade-Off
For distributed work it is necessary to trust what
others do; it is impossible
to check everything.
Consider secondary
Confirm that
outcomes and sideinput is correct
Thoroughness
effects
Trust that input

is correct
Efficiency
Assume someone
else takes care of
outcomes
One way of managing

time and resource limitations is to
think only one step back and/or one step ahead.
Mutual optimism
I can allow
myself to be
effective because
the others will be
thorough
I can allow
myself to be
effective because
the others will be
thorough
I can allow
myself to be
effective because
the others will be
thorough
I can allow
myself to be
effective because
the others will be
thorough
I can allow
myself to be
effective because
the others will be
thorough
I can allow
myself to be
effective because
the others will be
thorough
The ETTO principle

The ETTO principle describes the fact that people
(and organisations) as part of their activities
practically always must make a trade-off between
the resources (time and effort) they spend on
preparing an activity and the resources (time, effort
and materials) they spend on doing it.
ETTOing favours efficiency over thoroughness if
throughput and output are the dominant concerns,
and thoroughness over efficiency if safety and
quality are the dominant concerns.
It follows from the ETTO principle that it is
impossible to maximise efficiency and thoroughness at the same time. Nor can an
activity expect to succeed, if there is not a minimum of either.
Resilience and what individuals do

Responding
Anticipating
Actual
Factual
Critical
Learning
Monitoring
Individual
ETTO rules
Potential
Looks fine
Not really important
Normally OK, no need to check now
Ive done it hundreds of times before
Will be checked by someone else
Has been checked by someone else
A single individual cannot address

all four abilities equally well, and will
normally prioritise the ability to
respond.
ETTOing will affect the trade-off
between the four abilities.
This way is much quicker
No time (or resources) to do it now
Cant remember how to do it
We always do it this way
It looks like X (so it probably is X)
We must get this done (ready in time)
We must not use too much of X
Resilience and what organisations do

Responding
Anticipating
Actual
Factual
Critical
Learning
Monitoring
Potential
An organisation can allocate

resources to all four abilities,
according to its priorities.
ETTOing will affect how well each
ability is addressed (trade-off
within abilities).
Reduce redundancy
Organisational or
Double-bind (DO and DONT)
collective ETTO rules
Reject conflicting information
Reduce unnecessary cost and effort
Meet production targets (safety first but )
Negative reporting (only report when something is wrong)
The ETTO TETO paradox

Efficiency in the present Efficiency in the future
requires thoroughness in requires thoroughness in
the past. the present.
Efficiency
Thoroughness
Thoroughness
Learning and preparing

indicators & responses
Learning and preparing

indicators & responses
Efficiency
In order to be resilient it is necessary that short-term ETTO is balanced by long-term

TETO (Thoroughness-Efficiency Trade-Off)
As high as reasonably practicable

Which events? How were
they found? Is the list
revised? How is readiness
ensured and maintained?
What is our model of the future?

How long to we look ahead? What
risks are we willing to take? Who
believes what and why?
Actual
Factual
What, when continuously or
event-driven (successes or
failures), how (qualitative,
quantitative),
by individual or
by organisation?
Critical
Potential
How are indicators defined? Lagging /

leading? How are they measured? Are
effects transient or permanent? Who looks
where and when? How, and when,
are they revised?
We attribute the success of HROs in managing the unexpected to their

determined efforts to act mindfully. By this we mean that they organize
themselves in such a way that they are better able to notice the
unexpected in the making and halt its development. If they have
difficulty halting the development of the unexpected, they focus on
containing it. And if some of the unexpected breaks through the
containment, they focus on resilience and swift restoration of system
functioning.
When we call this approach mindful, we mean that HROs strive to

maintain an underlying style of mental functioning that is
distinguished by continuous updating and deepening of
increasingly plausible interpretations of what the context is, what
problems define it, and what remedies it contains. The key
difference between HROs and other organizations in managing the
unexpected often occurs in the earliest stages, when the
unexpected may give off only weak signals of trouble. The
overwhelming tendency is to respond to weak signals with a weak
response. Mindfulness preserves the capability to see the
significant meaning of weak signals and to give strong responses
to weak signals. This counterintuitive act holds the key to
managing the unexpected.
Managing the Unexpected: Assuring High Performance in an Age of
Complexity, Vol. 1 by Karl E. Weick and Kathleen M. Sutcliffe Jossey-Bass Inc.

2001
Thank you for your attention

The Resilient Organisation

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

The Resilient Organisation

Загружено:

Авторское право:

Доступные форматы

The Resilient Organisation

Controlling the unpredictable

Aircraft design knowledge

Models and methods must be able to

Erik Hollnagel, 2009

Three ages of industrial safety

Technical analysis methods

Erik Hollnagel, 2009

How do we know technology is safe?

Clear and explicit

Human factors wake-up call

Three ages of industrial safety

Human factors analysis methods

How do we know humans are safe?

Safety culture / organisational failures

That assembly of characteristics and attitudes in

Normal accident theory (1984)

On the whole, we have complex systems

Three ages of industrial safety

Age of safety management

Organisational analysis methods

How do we know organisations are safe?

Simple causal thinking

Erik Hollnagel, 2009

Complex causal thinking

In epidemiology, injury /damage is

Defences of a host may be weakened

Erik Hollnagel, 2009

Nature of technical (formal) systems

Risks and failures can

Nature of socio-technical systems

Must be described topdown in terms of

Decomposition does not

Risks and failures must

Theories and models of the negative

Accidents arecaused by people,

Organisations are complex

Changes in attribution of causes

Erik Hollnagel, 2009

Safety as risk reduction

What happens when

Safety, as commonly practised, implies a

Focus on operation (sharp end, 1984)

Work has clear

Vertical and horizontal extensions (2008)

A second horizontal extension, to cover

One horizontal extension, to cover the

Tractable and intractable systems

Description are simple

System does not change

Description are elaborate

System changes before a

Performance variability is necessary

Many socio-technical systems are intractable. The

Range of event outcomes

Range of event outcomes

Range of event outcomes

Range of event outcomes

Why only look at what goes wrong?

Safety = Reduced number of adverse events.

1 - 10-4 := 9.999 nonfailures in 10.000 events

Focus is on what goes right.

When something goes right,