Вы находитесь на странице: 1из 23

Fault-Tolerant Computing

Motivation,
Background,
and Tools

Terminology, Models, and Measures


Oct. 2006

Slide 1

About This Presentation


This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. Behrooz Parhami

Edition

Released

First

Oct. 2006

Revised

Revised

Terminology, Models, and Measures


Oct. 2006

Slide 2

Terminology, Models, and Measures


for Dependability

Terminology, Models, and Measures


Oct. 2006

Slide 3

Terminology, Models, and Measures


Oct. 2006

Slide 4

Impairments to Dependability

Fl
aw

Fa
ul
t

Error

l
i
a
F

d
r
a
z
a
H

e
r
u

Bu
g

n
o
i
t
a
d
a
r
g
e
D

Intr
u

sion

t
c
e
f
e
D

M
al
fu

Crash

Terminology, Models, and Measures


Oct. 2006

nc

t io
n

Slide 5

The Fault-Error-Failure Cycle


Includes both
components
and design

Aspect

Impairment

Structure

State

Behavior

Fault

Error

Failure

Fault

Correct
signal

0
Replaced
with
NAND?

Schematic diagram of the Newcastle hierarchical model


and the impairments within one level.

Terminology, Models, and Measures


Oct. 2006

Slide 6

The Four-Universe Model


Universe

Impairment

Physical

Logical

Informational

External

Failure

Fault

Error

Crash

Cause-effect diagram for Aviienis four-universe


model of impairments to dependability.

Terminology, Models, and Measures


Oct. 2006

Slide 7

Unrolling the Fault-Error-Failure Cycle

Aspect

Impairment

Structure

State

Behavior

Fault

Error

Failure

First
Cycle

Second
Cycle

Abstraction

Impairment

Component

Logic

Information

System

Service

Result

Defect

Fault

Error

Malfunction

Degradation

Failure

LowLevel

MidLevel

HighLevel

Cause-effect diagram for an extended six-level


view of impairments to dependability.

Terminology, Models, and Measures


Oct. 2006

Slide 8

Multilevel Model
Component
Logic

Defective

Legend:
Legned:
Initial
Entry
Entry

Information
System
Service
Result

Ideal

Low-Level
Impaired
Faulty
Erroneous

Deviation

Malfunctioning

Remedy

Degraded

Tolerance

Failed

Terminology, Models, and Measures


Oct. 2006

Mid-Level
Impaired

High-Level
Impaired

Slide 9

Analogy for the Multilevel Model


An analogy for our
multi-level model of
dependable computing.
Defects, faults, errors,
malfunctions,
degradations, and
failures are
represented by pouring
water from above.
Valves represent
avoidance and
tolerance techniques.
The goal is to avoid
overflow.

Wallheightsrepresent
interlevellatencies

Inletvalvesrepresent
avoidancetechniques

Concentricreservoirsare
analogsofthesixmodellevels,
withdefectbeinginnermost

Terminology, Models, and Measures


Oct. 2006

I I I I I I
Drainvalvesrepresent
tolerancetechniques

Slide 10

Why Our Concern with Dependability?


Reliability of n-transistor system, each having failure rate

R(t) = ent
There are only 3 ways of making systems more reliable
1.0

Reduce

Alternative:
Change the reliability
formula by introducing
redundancy in system

.9990

.9900
.9048

0.8

Reduce n
Reduce t

.9999

0.6
n t

0.4

.3679

0.2
0.0
104

106

nt

Terminology, Models, and Measures


Oct. 2006

108

1010

Slide 11

Highly Dependable Computer Systems


Long-life systems: Fail-slow, Rugged, High-reliability
Spacecraft with multiyear missions, systems in inaccessible locations
Methods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrity
Flight control computers, nuclear-plant shutdown, medical monitoring
Methods: Replication with voting, time redundancy, design diversity
High-availability: Fail-soft, Robust, High-availability
Telephone switching centers, transaction processing, e-commerce
Methods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from
supercomputers to desktops, so too dependability enhancement
methods find their way from exotic systems into personal computers

Terminology, Models, and Measures


Oct. 2006

Slide 12

Aspects of Dependability
ea

bi

il ty

Se

ce
n
ue
q
e
ns

y
t
e
f
a
co
,
S k

cu
s
c
ilience
s
e
Ri
i
R
rit
v
r
y
e
S
y
v.,
t
a
y
i
l
t
l
a
i lity,
i terv
l
b
i
i
a
b
n
y
b
t
t
I
i
l
i
a
a
,
b
l
s trol ility
l
lbiia
v.
e
= M TF F
i
e
F
R
a
T
T
M
,
T
a
y
v
ise TR
on rvab
Relia lit
w
C
A oi nt , M T
se
M
b
o
y
P BF
t
a
i
T
lF
in
M
i
ta
a, bMCB
i
R
n
m
o
I
y
n
t
a
r
b
t
li
u
e
i
b
o
s
g
b
f
t
r
a
i
n
i
r
t
l
y
e
m
it y
r
s
e
o
s
f
Pr
Pe

Terminology, Models, and Measures


Oct. 2006

Slide 13

Concepts from Probability Theory


Probability density function: pdf
f(t) = prob[t x t + dt] / dt = dF(t) / dt

Liftimes of 20
identical systems

Cumulative distribution function: CDF


F(t) = prob[x t] = 0t f(x) dx
Expected value of x
Ex = x f(x) dx = k xk f(xk)

10

20

Covariance of x and y
x,y = E [(x Ex)(y Ey)]
= E [x y] Ex Ey

30

40

50

30

40

50

30

40

50

1.0
0.8

CDF

0.6

Variance of x
2x = (x Ex)2 f(x) dx
= k (xk Ex)2 f(xk)

Time

0.4

F(t)

0.2
0.0
0

10

20

Time

0.05

pdf

0.04
0.03

f(t)

0.02
0.01
0.00
0

10

20

Time

Terminology, Models, and Measures


Oct. 2006

Slide 14

Some Simple Probability Distributions


F(x)
1

CDF

CDF

CDF

CDF

Normal

Binomial

f(x)
pdf

Uniform

pdf

Exponential

pdf

Terminology, Models, and Measures


Oct. 2006

Slide 15

Reliability and MTTF


Reliability: R(t)
Probability that system remains in the
Good state through the interval [0, t]
R(t + dt) = R(t) [1 z(t) dt]
Hazard function
R(t) = 1 F(t)

Two-state
nonrepairable
system
Start
State

Good

Failed

CDF of the system lifetime, or its unreliability

Constant hazard function z(t) = R(t) = et


(system failure rate is independent of its age)
Mean time to failure: MTTF

MTTF =
t f(t) dt = R(t) dt

Failure

Expected value of lifetime

Exponential
reliability law

Area under the reliability curve


(easily provable)

Terminology, Models, and Measures


Oct. 2006

Slide 16

Failure Distributions of Interest


Exponential: z(t) =
R(t) = et

MTTF = 1/

Discrete versions
Geometric
R(k) = q k

Rayleigh: z(t) = 2(t)


R(t) = e(t)2
MTTF = (1/)
Weibull: z(t) = (t) 1
R(t) = e(t) MTTF = (1/) (1 + 1/)

Discrete Weibull

Erlang:
MTTF = k/
Gamma:
Erlang and exponential are special cases
Normal:
Reliability and MTTF formulas are complicated

Terminology, Models, and Measures


Oct. 2006

Binomial

Slide 17

Comparing Reliabilities
Reliability difference: R2 R1
Reliability gain: R2 / R1
Reliability improvement factor
RIF2/1 = [1R1(tM)] / [1R2(tM)]
System Reliability (R)
Example:
[1 0.9] / [1 0.99] = 10

1.0
R2 (tM)
rG

Reliability improv. index


RII = log R1(tM) / log R2(tM)

R1(tM)

Reliability functions
for Systems 1/2

R2 (t)

Mission time extension


MTE2/1(rG) = T2(rG) T1(rG)
Mission time improv. factor:
MTIF2/1(rG) = T2(rG) / T1(rG)

R1 (t)

0.0

T1 (rG)

tM T2 (rG) MTTF2

MTTF1

Time (t)

Terminology, Models, and Measures


Oct. 2006

Slide 18

Availability, MTTR, and MTBF


(Interval) Availability: A(t)
Fraction of time that system is in the
Up state during the interval [0, t]

Two-state
repairable
system

Steady-state availability: A = limt A(t)


Pointwise availability: a(t)
Probability that system available at time t
A(t) = (1/t) t a(x) dx

Repair
Start
State

Down

Up
Failure

Availability = Reliability, when there is no repair


Availability is a function not only of how rarely a system fails (reliability)
but also of how quickly it can be repaired (time to repair)
Repair rate
MTTF
MTTF

A=
=
=
1/ = MTTR
MTTF + MTTR MTBF
+
In general, >> , leading to A 1

(Will justify this


equation later)

Terminology, Models, and Measures


Oct. 2006

Slide 19

System Up and Down Times


Short repair time implies
good Maintainability (serviceability)
Time to first failure

Repair
Start
State

Down

Up
Failure

Time between failures


Repair time

Up

Down
0

t1

t 2 t'2

t'1

Time

Terminology, Models, and Measures


Oct. 2006

Slide 20

Performability and MCBF


Performability: P
Composite measure, incorporating
both performance and reliability
Start
State

Three-state
degradable system
Repair
Up 2

Up 1

Partial repair
Down
Failure

Partial failure
Simple example
Worth of Up2 twice that of Up1
t
pUpi = probability
system is in state Upi
Question:

P = 2pUp2 + pUp1

What is system
availability here?

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90


(system performance equiv. To that of 1.9 processors on average)

Performability improvement factor of this system (akin to RIF) relative


to a fail-hard system that goes down when either processor fails:
PIF = (2 2 0.92) / (2 1.90) = 1.6

Terminology, Models, and Measures


Oct. 2006

Slide 21

System Up, Partially Up, and Down Times


Important to prevent
direct transitions to the
Down state (coverage)

Repair
Start
State

Up 2

Up 1
Partial failure

Partial repair
Down
Failure

Partial
Failure

Up

Partially Up

Total
Failure

Partial
Repair

t2

t'2
Time

MCBF

Down
0

t1

t'1

Terminology, Models, and Measures


Oct. 2006

t 3 t'3 t

Slide 22

Integrity and Safety


Risk: Prob. of being in Unsafe Failed state
There may be multiple unsafe states,
each with a different consequence (cost)
Simple analysis
Lump Safe Failed state with Good
state; proceed as in reliability analysis

Three-state
fail-safe system
Failure

Start
State

Safe
Failed

Good
Failure

More detailed analysis


Even though Safe Failed state is more
desirable than Unsafe Failed, it is still
not as desirable as the Good state;
so keeping it separate makes sense

Unsafe
Failed

For example, if a repair transition is introduced between Safe Failed


and Good states, we can tackle questions such as the expected
outage of the system in safe mode, and thus its availability

Terminology, Models, and Measures


Oct. 2006

Slide 23

Вам также может понравиться