Вы находитесь на странице: 1из 30

Class 304: Fantastic Failures

Embedded Systems Conference


Wednesday, 31 March 2004
By Kim R. Fowler

Historical Case Studies

1
The De Havilland Comet

31 March 2004 (courtesy of Marc Schaeffer, find this photo at the following 3
website: www.geocities.com/CapeCanaveral/Lab/8803/)

Comet recounted
• First jet airline – speed and comfort
• 3 crashes between May 1953 and April
1954
• Extensive testing
• Catastrophic cracking from metal fatigue
• Fixes – rounded corners, reinforcing plates
• New understanding of metal fatigue

31 March 2004 4

2
Ariane 5

(Photographic source is ESA/CNES. You can find these photos at the following website:
www.mssl.ucl.ac.uk/www_plasma/missions/cluster/about_cluster/clu ster1/cluster1_images.html)
31 March 2004 5

Ariane 5 recounted
• Dual-redundant processors
• 3 unprotected variables that overflowed
• Processors reset on overflow, no graceful
recovery
• Used in Ariane 4, no check of flight dynamics
• Ariane 5 had > horizontal drift velocities
• Reuse is tricky, end-to-end system test
necessary
• Find report at:
www.esa.int/export/esaLA/Pr_33_1996_p_EN.html
31 March 2004 6

3
Therac 25
• Medical linear accelerator for treating tumors
• Mid-1980s overdosed six patients
• Problems
– Quick editing by operator caused race condition
– Cryptic error messages ignored
– No explanation in Users Manual of error codes
– 50 times full dose but displayed “no dose given”
– No mechanical interlocks
– No software reviews or audits, little
documentation
31 March 2004 7

Therac 25 Lessons
• Need general plan for system development
• The operator interface must be clear, intuitive,
and explained
• Hardware safeguards must limit software faults
• Good design, not testing, makes a safe system
• See Appendix A – Medical Devices: The Therac-
25 from Nancy Leveson, Safeware: System
Safety and Computers, Addison-Wesley, 1995.

31 March 2004 8

4
Chernobyl

31 March 2004 (Courtesy of the U.S. Department of Energy, http://insp.pnl.gov, 9


photographs UK-CH-002, UK-CH-003, UK-CH-015, UK-CH-100)

Chernobyl recounted
• Chernobyl reactor 4 exploded, April 1986
• Released clouds of radioactive material for
10 days
• 100 x exposure over Hiroshima bomb
• Background
– Graphite block reactors unstable at low
reactivity
– Safety rules require power > 20% capacity at
all times
31 March 2004 10

5
Chernobyl events
– Experiment called for by engineers in Moscow
– Manual shutdown, automatic control turned off
– Power dropped to 1% capacity
– Removed more control rods
– Power crept up to 7%
– Turned on more water to produce more steam
– Water cooled reactor, dropping steam and reactivity
– Removed even more control rods
– Steam production rose until 1:22 a.m. when operators
shut off water flow
– Heat built up quickly, control rod sleeves bent
– Could not insert control rods
– Steam explosion
31 March 2004 11

Chernobyl Lessons
• Theoretical knowledge vs. hands-on
• Humans “over-steer” dynamic systems
• Humans don’t handle interacting,
nonlinear problems well
• “Groupthink”
• Understand human nature
– Clarity of function
– Reduce confounding problems
– Accommodate in system design
31 March 2004 12

6
Apple Lisa

(Part of the computer collection of Giorgio Ungarelli, photograph used with permission.)
31 March 2004 13

Apple Lisa Legacy

• Brilliant concept before its time


– Mouse
– Graphical file management
• People not ready for paradigm shift

31 March 2004 14

7
Apple Lisa Lessons
• Prohibitive price for unappreciated
capability
• Cost-effective solutions rely on users’
understanding
• Failure falls into business/political arena –
difficult to predict and avoid

31 March 2004 15

Navy Terrier/LEAP

8
Terrier LEAP outline

• Concept for ballistic missile intercept


• Use current (early-mid 1990s) technology
• Prepare and test quickly
• Target launched from Wallops Island
• Interceptor launched from cruiser in
Atlantic
• Basic human error foiled success

31 March 2004 17

LEAP Target

(Photograph courtesy
of Raytheon, Inc.)

31 March 2004 18

9
LEAP General Operation
• High-resolution radars at Wallops Island track
target (shipboard radars insufficient)
• Wallops Island processor collected data from the
radars, filtered the target track with a six-state
Kalman filter, and transmitted the track to the
ship.
• Sent target tracks to ship via redundant telephone
landlines and Inmarsat satellite links
• Ship processor received the data, predicted the
intercept time and point, and indicated when to
launch the interceptor missile.
31 March 2004 19

LEAP Missile & Intercept

(Photograph courtesy of Raytheon, Inc.)

31 March 2004 20

10
LEAP Testing Finds Problems
• End-to-end tests of the system
– simulated a target launch,
– transmitted the simulated data through the entire
system to the ship,
– calculated an intercept as if we were at sea.
• Redundant landlines – switch maintenance in
New Jersey cut off early test
• Separate landlines
– one through New Jersey
– other through Pennsylvania

31 March 2004 21

Richmond K. Turner, GC20

31 March 2004 (Photograph courtesy of the Johns Hopkins 22


University Applied Physics Laboratory.)

11
Testing Finds Problems (cont’d.)
• Two shipboard radars caused problems
– SPS-49 jammed the Inmarsat receivers
– SPS-20 jammed the GPS receivers
• Inmarsat situated on port and starboard
bridge to reduce superstructure blockage
• Too many dropouts with commercial
modems, switched to cell phone modems

31 March 2004 23

LEAP Targeting Processor


and laboratory test set

(Photographs courtesy of the Johns Hopkins


University Applied Physics Laboratory.)

31 March 2004 24

12
LEAP: Lessons Learned
• Technical failure
• Simple, human error can interrupt the best
designs
• Careful development and thorough testing
necessary
• All components must be tested within the
system to uncover interactions

31 March 2004 25

Aegis LEAP
• A success story
• Three successful intercepts in 2002, more
in 2003
• Carefully planned development

31 March 2004 26

13
Aegis LEAP Flight Profile

31 March 2004 27
(Figure courtesy of the Johns Hopkins University Applied Physics Laboratory.)

Aegis LEAP Missile

(Photograph courtesy of the Johns Hopkins University Applied Physics Laboratory.)


31 March 2004 28

14
Kinetic Kill Vehicle and Target
Image

(Figure and photograph courtesy of the Johns Hopkins


University Applied Physics Laboratory.)
31 March 2004 29

Aegis LEAP Launch

(Photographs courtesy of the Johns Hopkins University Applied Physics Laboratory.)


31 March 2004 30

15
Thorough Ground Test Program
• Separation tests – squibs, batteries, explosive bolts
• KW hover test for the closed loop pointing
• Air bearing tests of maneuvers: pitch-to-ditch, IR seeker
calibration, and pointing before separation
• Hardware-in-the-loop simulation and test of avionics
• KW tests for the IR seeker characterization, stabilization,
third stage interfaces
• Vacuum tests – PCB delamination, arcing, and outgassing
• Aerothermal testing in a hypersonic wind tunnel for
nosecone heating and outgassing, seeker shield function,
strake heating and insulation
31 March 2004 31

Types of Failure

16
Examples: Product Recalls
• [. . .] recalled 45,000 heaters for defective thermostats that
were improperly positioned, which could lead to the
overheating.
• [. . .] recalled 3.1 million dishwashers. The slide switch (the
lever that selects between heat drying and energy saving)
can melt and ignite over time, posing a fire hazard.
• [. . .] recalled 5,500 toy flashlights because the batteries
may overheat or leak and children can suffer burns from
the leaking battery.
• [. . .] recalled upright vacuum cleaners because the power
cord may break inside of the handle posing electrical
shock and burn injury hazards.
• http://www.matthewslawfirm.com

31 March 2004 33

Examples: Automotive Recalls


• March 12, 2002 [. . .] recalled the [. . .] trailer hitch –
circuitry in the converter is inadequate to properly
manage voltage spikes that can lead to an electrical
short or open circuit within the converter, causing a
failure and an inoperative trailer light.
• September 11, 2000 [. . .] recalled about 270,000 [cars]
– air bags that may deploy unexpectedly because of
corrosion in the inflator.
• During 2000 [. . .] recalled ignition modules that could
cause a car to stall. When the temperature of the ignition
module rises above a certain temperature the chances of
the module cutting out also increases.
• http://www.crash-worthiness.com

31 March 2004 34

17
Examples: More Automotive
Recalls
• [. . .] recalled 263,000 1995-97 [vehicles] . . . The airbag
electronic control module (AECM) could corrode from
water or road salt and then accidentally fire the driver side
airbag.
• [. . .] recalled 757,000 1992-97 [vehicles] because higher
than specified electrical load through accessory power
feed circuit may cause a short circuit and allow current to
flow through ground wiring. This could cause overheating
and an electrical fire.
• [. . .] recalled 1995-97 [vehicles] because improperly
routed wire harness for the air-conditioner may permit
wires to rub together and short circuit, resulting in a blown
fuse, dead battery, or fire.
• http://www.matthewslawfirm.com
31 March 2004 35

Examples: More Automotive


Recalls

• December 11, 1998 [. . .] recalled 226


[electric vehicles] to reprogram the logic in
the motor electronic control unit (ECU), which
can mistakenly detect a failure of an electrical
current sensor at speeds above 50 mph. It
can cause the sudden loss of power and
unexpected deceleration.
• http://autorepair.about.com/library/recalls/

31 March 2004 36

18
Elements of Unintended
Consequences in Previous Examples
• Passage of time – usually fielded units
• Nonobvious or obscure causes
• Environmental interactions, i.e. corrosion,
overheating
• Failure modes with significant effects, i.e.
fire or injury

31 March 2004 37

The Nature of Problems

• Confounding complexity
– unforeseen circumstances
– multiple causes
• Human error
– nonobviousness to user
– improper use
– design oversight – even if it appears to be
a manufacturing problem

31 March 2004 38

19
Example: Complexity or Oversight?
• September 2003, Hurricane Isabel
• Power outages – trees down on power lines.
• NIST experienced 180 VAC for 20 minutes that
destroyed 1000s of fluorescent lamp ballasts
• Protective mechanisms for AC power were
controlled over telephone lines.
• Guess what was also knocked down by wind-
blown trees?

31 March 2004 39

Causes and Factors


• Dishonest portrayal of capabilities – expertise,
schedule estimation, unreasonable professional
relationships i.e. management/engineering
• Inadequate schedule for review and testing
• Reinventing the wheel
– building your own custom design
• Creeping featurism
– the continual addition of new capabilities
• Perception is reality

31 March 2004 40

20
Remedies
• Truth in advertising – expertise, schedule estimation,
management style/employee responses
• Work hard to develop reasonable schedules
– review and testing
– plan for contingencies
• Continuous learning
– lessons learned, your own experience
– others’ experiences
• Reduce complexity
– understand and define interactions
– do not “reinvent the wheel”
– limit features
• Teamwork
31 March 2004 41

Integrity
• The “Big Picture”
• Truth in advertising (your capability and
skills)
• Estimation and scheduling
• Plan for the long term
– your success and reputation
– your product’s viability
– your company’s reputation

31 March 2004 42

21
Failure and How to Handle It
• Types of failure
– technical
– professional
Less control
– political/societal
Progression • Embrace failure
– admit and accept responsibility
– understand and learn
– put past behind you because others won’t
– forgive others’ failures; help them to
31 March 2004 rebound 43

Personal Examples

22
Technical Failure
• Ultraviolet satellite camera with image
intensifier
• Automatic gain control for image intensifier
• Nonlinear control problem
• First version – blooming/collapsing picture
• Second version – unreliable transmission
of gain value

31 March 2004 45

Technical Failure – 1st Version

Image
intensifier

Camera Video
Frame signal
sync

reset
Dn
Hi-threshold
DAC
comparator

Up

Up-down
Pixel clock
counter

(© 2002, Figure courtesy of the Johns Hopkins University


31 March 2004 Applied Physics Laboratory.) 46

23
Technical Failure – 1st Version
• Problem: blooming/collapsing picture
• Background:
– Discrete logic, up-down counters
– Unstable for bright objects
– Not fully simulated or analyzed
– Short development time (flew breadboards)
• Should’a: analyzed/simulated expected
scenes during design
31 March 2004 47

Technical Failure – 2nd Version

(© 1996, Oxford University


Press, used with permission.)

31 March 2004 48

24
Technical Failure – 2nd Version
• Problem: unreliable transmission of gain value
• Background:
– Microcontroller implementation of AGC
– AGC stable for all scenes
– Readout of gain by ground equipment unreliable
– Analog encoding of gain into video frame
• Should’a:
– Use digital encoding into video frame for noise margin
– Needed better understanding of noise environment

31 March 2004 49

Professional Failure
• Asked to finish programming effort while
original designer moved onto other
projects
• False starts and procrastination
• Finally removed myself from project

31 March 2004 50

25
Professional Failure
• Problem: did not complete assignment
• Background:
– Mounds of documentation to plow through
– Early realization of no-win situation
• Lost motivation
• No real recognition of work obvious to me

• Should’a:
– Either not taken the job in the first place
– Or if no choice, plow through assignment while finding
another job (setting precedence)
31 March 2004 51

Professional/Business Failure
• Business deal
• My personal performance
– Technical excellence
– Professional excellence
– Maintained integrity
• Accused of bad stuff, which I did not do
• Deal fell through

31 March 2004 52

26
Professional/Business Failure
• Problem: business politics outside my control
• Background:
– Interesting proposition and product
– Long-term relationships
– Unknowns quantities introduced early in deal
– Weirdnesses grew
• Should’a:
– Either not make deal in the first place
– Or left earlier before weirdness got out of hand
• Note: always deal with integrity or don’t deal
31 March 2004 53

Political Failure
• Satellite subsystem
• Team’s performance
– Technical excellence
– Professional excellence
• NASA sponsor pulled project in-house

31 March 2004 54

27
Political Failure
• Problem: politics outside my company’s control
• Background:
– 6-month long set of trade studies to define architecture
– Thorough studies and review
– Schedule well understood, team prepared to build
system
– Groups at NASA out of work
– NASA pulled project in-house to feed their own
• Should’a:
– None, politics happen
31 March 2004 55

A Success Story

28
The Sidewinder Missile – A Success
Story

(Courtesy of the U.S. Navy. All U.S. Navy photos


are public domain.
http://library.thinkquest.org/jo113065/citations.htm)

31 March 2004 57

Sidewinder recounted
• Goal: simple, sturdy, cheap missile
• Small development team, 1949 – 1953
• Simple, clever combination of ideas
– Rollerons: simple but important control
– Proportional navigation simplified circuitry
– Torque-balance servo for maneuvering
– Canard control fins reduced wiring and connectors
– Simple data acquisition equipment
• Extensive testing and prototyping

31 March 2004 58

29
Sidewinder Lessons
• Breakthroughs require vision
• Small teams facilitate commitment and
communications
• Simple and robust design
• Careful, thorough, and extensive testing
and integration

31 March 2004 59

30

Вам также может понравиться