f33 FT Computing Lec02 Measures

Fault-Tolerant Computing
Motivation,
Background,
and Tools
Terminology, Models, and Measures

Oct. 2006
Slide 1
About This Presentation

This presentation has been prepared for the graduate
course ECE 257A (Fault-Tolerant Computing) by
Behrooz Parhami, Professor of Electrical and Computer
Engineering at University of California, Santa Barbara.
The material contained herein can be used freely in
classroom teaching or any other educational setting.
Unauthorized uses are prohibited. Behrooz Parhami
Edition
Released
First
Oct. 2006
Revised
Revised

Oct. 2006
Slide 2

for Dependability

Oct. 2006
Slide 3

Oct. 2006
Slide 4
Impairments to Dependability
Fl
aw
Fa
ul
t
Error
l
i
a
F
d
r
a
z
a
H
e
r
u
Bu
g
n
o
i
t
a
d
a
r
g
e
D
Intr
u
sion
t
c
e
f
e
D
M
al
fu
Crash

Oct. 2006
nc
t io
n
Slide 5
The Fault-Error-Failure Cycle

Includes both
components
and design
Aspect
Impairment
Structure
State
Behavior
Fault
Error
Failure
Fault
Correct
signal
0
Replaced
with
NAND?
Schematic diagram of the Newcastle hierarchical model

and the impairments within one level.

Oct. 2006
Slide 6
The Four-Universe Model

Universe
Impairment
Physical
Logical
Informational
External
Failure
Fault
Error
Crash
Cause-effect diagram for Aviienis four-universe

model of impairments to dependability.

Oct. 2006
Slide 7
Unrolling the Fault-Error-Failure Cycle
Aspect
Impairment
Structure
State
Behavior
Fault
Error
Failure
First
Cycle
Second
Cycle
Abstraction
Impairment
Component
Logic
Information
System
Service
Result
Defect
Fault
Error
Malfunction
Degradation
Failure
LowLevel
MidLevel
HighLevel
Cause-effect diagram for an extended six-level

view of impairments to dependability.

Oct. 2006
Slide 8
Multilevel Model
Component
Logic
Defective
Legend:
Legned:
Initial
Entry
Entry
Information
System
Service
Result
Ideal
Low-Level
Impaired
Faulty
Erroneous
Deviation
Malfunctioning
Remedy
Degraded
Tolerance
Failed

Oct. 2006
Mid-Level
Impaired
High-Level
Impaired
Slide 9
Analogy for the Multilevel Model

An analogy for our
multi-level model of
dependable computing.
Defects, faults, errors,
malfunctions,
degradations, and
failures are
represented by pouring
water from above.
Valves represent
avoidance and
tolerance techniques.
The goal is to avoid
overflow.
Wallheightsrepresent
interlevellatencies
Inletvalvesrepresent
avoidancetechniques
Concentricreservoirsare
analogsofthesixmodellevels,
withdefectbeinginnermost

Oct. 2006
I I I I I I
Drainvalvesrepresent
tolerancetechniques
Slide 10
Why Our Concern with Dependability?

Reliability of n-transistor system, each having failure rate
R(t) = ent
There are only 3 ways of making systems more reliable
1.0
Reduce
Alternative:
Change the reliability
formula by introducing
redundancy in system
.9990
.9900
.9048
0.8
Reduce n
Reduce t
.9999
0.6
n t
0.4
.3679
0.2
0.0
104
106
nt

Oct. 2006
108
1010
Slide 11
Highly Dependable Computer Systems

Long-life systems: Fail-slow, Rugged, High-reliability
Spacecraft with multiyear missions, systems in inaccessible locations
Methods: Replication (spares), error coding, monitoring, shielding
Safety-critical systems: Fail-safe, Sound, High-integrity
Flight control computers, nuclear-plant shutdown, medical monitoring
Methods: Replication with voting, time redundancy, design diversity
High-availability: Fail-soft, Robust, High-availability
Telephone switching centers, transaction processing, e-commerce
Methods: HW/info redundancy, backup schemes, hot-swap, recovery
Just as performance enhancement techniques gradually migrate from
supercomputers to desktops, so too dependability enhancement
methods find their way from exotic systems into personal computers

Oct. 2006
Slide 12
Aspects of Dependability
ea
bi
il ty
Se
ce
n
ue
q
e
ns
y
t
e
f
a
co
,
S k
cu
s
c
ilience
s
e
Ri
i
R
rit
v
r
y
e
S
y
v.,
t
a
y
i
l
t
l
a
i lity,
i terv
l
b
i
i
a
b
n
y
b
t
t
I
i
l
i
a
a
,
b
l
s trol ility
l
lbiia
v.
e
= M TF F
i
e
F
R
a
T
T
M
,
T
a
y
v
ise TR
on rvab
Relia lit
w
C
A oi nt , M T
se
M
b
o
y
P BF
t
a
i
T
lF
in
M
i
ta
a, bMCB
i
R
n
m
o
I
y
n
t
a
r
b
t
li
u
e
i
b
o
s
g
b
f
t
r
a
i
n
i
r
t
l
y
e
m
it y
r
s
e
o
s
f
Pr
Pe

Oct. 2006
Slide 13
Concepts from Probability Theory

Probability density function: pdf
f(t) = prob[t x t + dt] / dt = dF(t) / dt
Liftimes of 20
identical systems
Cumulative distribution function: CDF

F(t) = prob[x t] = 0t f(x) dx
Expected value of x
Ex = x f(x) dx = k xk f(xk)
10
20
Covariance of x and y
x,y = E [(x Ex)(y Ey)]
= E [x y] Ex Ey
30
40
50
30
40
50
30
40
50
1.0
0.8
CDF
0.6
Variance of x
2x = (x Ex)2 f(x) dx
= k (xk Ex)2 f(xk)
Time
0.4
F(t)
0.2
0.0
0
10
20
Time
0.05
pdf
0.04
0.03
f(t)
0.02
0.01
0.00
0
10
20
Time

Oct. 2006
Slide 14
Some Simple Probability Distributions

F(x)
1
CDF
CDF
CDF
CDF
Normal
Binomial
f(x)
pdf
Uniform
pdf
Exponential
pdf

Oct. 2006
Slide 15
Reliability and MTTF

Reliability: R(t)
Probability that system remains in the
Good state through the interval [0, t]
R(t + dt) = R(t) [1 z(t) dt]
Hazard function
R(t) = 1 F(t)
Two-state
nonrepairable
system
Start
State
Good
Failed
CDF of the system lifetime, or its unreliability
Constant hazard function z(t) = R(t) = et

(system failure rate is independent of its age)
Mean time to failure: MTTF
MTTF =
t f(t) dt = R(t) dt
Failure
Expected value of lifetime
Exponential
reliability law
Area under the reliability curve

(easily provable)

Oct. 2006
Slide 16
Failure Distributions of Interest

Exponential: z(t) =
R(t) = et
MTTF = 1/
Discrete versions
Geometric
R(k) = q k
Rayleigh: z(t) = 2(t)

R(t) = e(t)2
MTTF = (1/)
Weibull: z(t) = (t) 1
R(t) = e(t) MTTF = (1/) (1 + 1/)
Discrete Weibull
Erlang:
MTTF = k/
Gamma:
Erlang and exponential are special cases
Normal:
Reliability and MTTF formulas are complicated

Oct. 2006
Binomial
Slide 17
Comparing Reliabilities
Reliability difference: R2 R1
Reliability gain: R2 / R1
Reliability improvement factor
RIF2/1 = [1R1(tM)] / [1R2(tM)]
System Reliability (R)
Example:
[1 0.9] / [1 0.99] = 10
1.0
R2 (tM)
rG
Reliability improv. index

RII = log R1(tM) / log R2(tM)
R1(tM)
Reliability functions
for Systems 1/2
R2 (t)
Mission time extension

MTE2/1(rG) = T2(rG) T1(rG)
Mission time improv. factor:
MTIF2/1(rG) = T2(rG) / T1(rG)
R1 (t)
0.0
T1 (rG)
tM T2 (rG) MTTF2
MTTF1
Time (t)

Oct. 2006
Slide 18
Availability, MTTR, and MTBF

(Interval) Availability: A(t)
Fraction of time that system is in the
Up state during the interval [0, t]
Two-state
repairable
system
Steady-state availability: A = limt A(t)

Pointwise availability: a(t)
Probability that system available at time t
A(t) = (1/t) t a(x) dx
Repair
Start
State
Down
Up
Failure
Availability = Reliability, when there is no repair

Availability is a function not only of how rarely a system fails (reliability)
but also of how quickly it can be repaired (time to repair)
Repair rate
MTTF
MTTF
A=
=
=
1/ = MTTR
MTTF + MTTR MTBF
+
In general, >> , leading to A 1
(Will justify this

equation later)

Oct. 2006
Slide 19
System Up and Down Times

Short repair time implies
good Maintainability (serviceability)
Time to first failure
Repair
Start
State
Down
Up
Failure
Time between failures

Repair time
Up
Down
0
t1
t 2 t'2
t'1
Time

Oct. 2006
Slide 20
Performability and MCBF

Performability: P
Composite measure, incorporating
both performance and reliability
Start
State
Three-state
degradable system
Repair
Up 2
Up 1
Partial repair
Down
Failure
Partial failure
Simple example
Worth of Up2 twice that of Up1
t
pUpi = probability
system is in state Upi
Question:
P = 2pUp2 + pUp1
What is system
availability here?
pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90

(system performance equiv. To that of 1.9 processors on average)
Performability improvement factor of this system (akin to RIF) relative

to a fail-hard system that goes down when either processor fails:
PIF = (2 2 0.92) / (2 1.90) = 1.6

Oct. 2006
Slide 21
System Up, Partially Up, and Down Times

Important to prevent
direct transitions to the
Down state (coverage)
Repair
Start
State
Up 2
Up 1
Partial failure
Partial repair
Down
Failure
Partial
Failure
Up
Partially Up
Total
Failure
Partial
Repair
t2
t'2
Time
MCBF
Down
0
t1
t'1

Oct. 2006
t 3 t'3 t
Slide 22
Integrity and Safety

Risk: Prob. of being in Unsafe Failed state
There may be multiple unsafe states,
each with a different consequence (cost)
Simple analysis
Lump Safe Failed state with Good
state; proceed as in reliability analysis
Three-state
fail-safe system
Failure
Start
State
Safe
Failed
Good
Failure
More detailed analysis

Even though Safe Failed state is more
desirable than Unsafe Failed, it is still
not as desirable as the Good state;
so keeping it separate makes sense
Unsafe
Failed
For example, if a repair transition is introduced between Safe Failed

and Good states, we can tackle questions such as the expected
outage of the system in safe mode, and thus its availability

Oct. 2006
Slide 23

f33 FT Computing Lec02 Measures

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

f33 FT Computing Lec02 Measures

Загружено:

Авторское право:

Доступные форматы

Fault-Tolerant Computing

Terminology, Models, and Measures

About This Presentation

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

Terminology, Models, and Measures

The Fault-Error-Failure Cycle

Schematic diagram of the Newcastle hierarchical model

Terminology, Models, and Measures

The Four-Universe Model

Cause-effect diagram for Aviienis four-universe

Terminology, Models, and Measures

Unrolling the Fault-Error-Failure Cycle

Cause-effect diagram for an extended six-level

Terminology, Models, and Measures

Terminology, Models, and Measures

Analogy for the Multilevel Model

Terminology, Models, and Measures

Why Our Concern with Dependability?

Terminology, Models, and Measures

Highly Dependable Computer Systems

Terminology, Models, and Measures

Terminology, Models, and Measures

Concepts from Probability Theory

Cumulative distribution function: CDF

Terminology, Models, and Measures

Some Simple Probability Distributions

Terminology, Models, and Measures

Reliability and MTTF

CDF of the system lifetime, or its unreliability

Constant hazard function z(t) = R(t) = et

Expected value of lifetime

Area under the reliability curve

Terminology, Models, and Measures

Failure Distributions of Interest

Rayleigh: z(t) = 2(t)

Terminology, Models, and Measures

Reliability improv. index

Mission time extension

Terminology, Models, and Measures

Availability, MTTR, and MTBF

Steady-state availability: A = limt A(t)

Availability = Reliability, when there is no repair

(Will justify this

Terminology, Models, and Measures

System Up and Down Times

Time between failures

Terminology, Models, and Measures

Performability and MCBF

pUp2 = 0.92, pUp1 = 0.06, pDown = 0.02, P = 1.90

Performability improvement factor of this system (akin to RIF) relative

Terminology, Models, and Measures

System Up, Partially Up, and Down Times

Terminology, Models, and Measures

Integrity and Safety

More detailed analysis

For example, if a repair transition is introduced between Safe Failed

Terminology, Models, and Measures

Вам также может понравиться