You are on page 1of 25

Availability and Reliability

Availability and reliability, 2013

Slide 1

Principal dependability
properties

Availability and reliability, 2013

Slide 2

Reliability
The probability of failure-free
system operation over a specified
time in a given environment for a
given purpose

Availability and reliability, 2013

Slide 3

Availability
The probability that a system, at a
point in time, will be operational and
able to deliver the requested services

Availability and reliability, 2013

Slide 4

Availability specification
Both reliability and availability
attributes can be expressed as
numbers:
Availability of 0.999 means that the
system is up and running for 99.9% of
the time;
Availability and reliability, 2013

Slide 5

Reliability specification
Probability of failure on demand
(POFOD) of 0.0001 means that on
average 1 in 10, 000 demands for
service from a system will fail in
some way

Availability and reliability, 2013

Slide 6

Availability and reliability


Availability and reliability are closely
related
Obviously if a system is unavailable it is
not delivering the specified system
services.

Availability and reliability, 2013

Slide 7

However, it is possible to have


systems with low reliability that
must be available.
So long as system failures can be
repaired quickly and does not damage
data, some system failures may not
be a problem.
Availability and reliability, 2013

Slide 8

Availability is therefore best


considered as a separate attribute
reflecting whether or not the
system can deliver its services.
Availability takes repair time into
account, if the system has to be
taken out of service to repair
faults.

Availability and reliability, 2013

Slide 9

Availability perception
Availability is usually expressed as
a percentage of the time that the
system is available to deliver
services e.g. 99.9%.

Availability and reliability, 2013

Slide 10

Availability and reliability, 2013

Slide 11

Subjective availability
The number of users affected by
the service outage.
Loss of service in the middle of the
night is less important for many
systems than loss of service during
peak usage periods.
Availability and reliability, 2013

Slide 12

The length of the outage.


The longer the outage, the more the
disruption. Several short outages are
less likely to be disruptive than 1 long
outage. Long repair times are a
particular problem.
Availability and reliability, 2013

Slide 13

Reliability metrics
Probability of failure on demand
(POFOD)
Probability that a system will not
deliver a service correctly when
requested
Used for systems where demands are
infrequent and intermittent
Availability and reliability, 2013

Slide 14

Rate of occurrence of failure


(ROCOF)
Number of system failures in a given
time period
Used for transaction processing
systems with frequent and regular
transactions
Availability and reliability, 2013

Slide 15

Fault

A characteristic of a software system that can lead to a


system error.

Error

An erroneous system state that can lead to system behavior


that is unexpected by system users.

Failure

An event that occurs at some point in time when the system


does not deliver a service as expected by its users.

Availability and reliability, 2013

Slide 16

Faults-errors-failures
Fault
Error
Failure
Availability and reliability, 2013

Slide 17

Faults and failures


Failures are a usually a result of
system errors.
The incorrect state causes
undesirable system behaviour
Incorrect state is a consequence of
executing faulty code
Availability and reliability, 2013

Slide 18

However, faults do not necessarily


result in system errors
The erroneous system state resulting
from the fault may be transient and
corrected before an error arises.
The faulty code may never be
executed.
Availability and reliability, 2013

Slide 19

Errors do not necessarily lead to


system failures
The error can be corrected by built-in
error detection and recovery
The failure can be protected against
by built-in protection facilities. These
may, for example, protect system
resources from system errors
Availability and reliability, 2013

Slide 20

Reliability achievement
Fault avoidance
Development technique are used
that either minimise the
possibility of mistakes or trap
mistakes before they result in the
introduction of system faults.
Availability and reliability, 2013

Slide 21

Fault detection and removal


Verification and validation
techniques that increase the
probability of detecting and
correcting errors before the
system goes into service are
used.
Availability and reliability, 2013

Slide 22

Fault tolerance
Run-time techniques are used to
ensure that system faults do not
result in system errors and/or
that system errors do not lead to
system failures.
Availability and reliability, 2013

Slide 23

Summary
Availability is the probability that a
system will be available when a
service request is made
Reliability is the probablity that a
system will deliver a service as
expected by users
Availability and reliability, 2013

Slide 24

Summary
Software faults lead to state errors
lead to operational failures
Fault avoidance, detection and
tolerance are strategies for
achieving reliability
Availability and reliability, 2013

Slide 25