Вы находитесь на странице: 1из 40

Lecture IX

Fault-Tolerance in Real-Time Systems


Sidra Rashid Bahria University, Islamabad Campus

Fault Tolerance
What is Fault Tolerance?
Ability of an operational system to tolerate the presence of faults

Why tolerate faults?


It is proven that it is impossible to completely test a practical-sized system. Therefore, it is important to implement techniques which allow a system to detect and tolerate faults during normal operation.

4 phases of fault tolerance:


Error detection detection of an erroneous state Damage assessment computes the severity of the fault Error processing substitute erroneous state for an error-free one Fault Treatment determine the cause of the error, then run fault passivation to ensure it doesnt happen again

5/3/13

Fault Classification
Faults
Nature
Phenomenon

Origin
Extent Phase

Persistence

Accidental

Intentional

Physical

Humanmade

Ext.

Int.

Design

Operation

Permanent

Temporary

Nature of faults distinguishes the intention of the fault Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system Persistence of faults determine the duration of the fault state

5/3/13

Software Fault Tolerance Techniques


Key to fault-tolerance is redundancy
Three domains:
Space
Several hardware channels each executing same task

Information
Recover the system via data structures storing system contents

Repetition
Restarts module in event of a faulty module

Two major schemes have evolved


Recovery Block (RB)
1H/Nds/NT-System
There is only one hardware channel (1H), and the faults are tolerated by executing several diverse software modules (NdS) sequentially (NT)

N-Version Programming (NVP)


NH/Nds/1T-System
The system has a number of (identical) hardware channels (NH) each executing one of the diverse software versions (NdS), hence no redundancy in time (1T)
5/3/13

Software Fault Tolerance Recovery Block

Primary

Checkpoint Switch

Alternate 1 Alternate 2
. . .

Acceptance Test

Passed

Failed

Alternate N-1

True
Restore from Checkpoint More alternates? Deadline not exceeded?

Fault

5/3/13

Software Fault Tolerance Recovery Block


Considerations
Software diversity
Idea: different teams, one specification, different products
Hope that failure domains do not overlap

Difficulties in designing acceptance test


Single test for all modules of recovery block Test is most crucial element in improving reliability

Design of Recovery Cache


sufficiently simple to ensure no faults

Increased System Overhead Domino Effect


Recovery blocks can push concurrent tasks that communicate into uncontrolled rollback
5/3/13

Software Fault Tolerance N-Version Programming


Version 1 Version 2

. . . . .

Majority Agreement Output


Voter

N-Version Programming ( NH/Nds/1T )



5/3/13

Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial

Switch

Synch

Version N

No agreement

Failure

Software Fault Tolerance N-Version Programming


Considerations
Software diversity! Difficult to create good specification Decision Mechanism
Some results will not always be identical (valid and invalid)
define a range of valid solutions but decreases distance from acceptance test approach

System Overhead
temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions

Extensions
Community Error Recovery ( forward recovery)
enough information from good versions to recover failed versions
5/3/13

Software Fault Tolerance Consensus Recovery Block(CRB)


Input Switch
Version 1

Version 2
. . . . .

Agreement
Voter

Output

Version N

No agreement

AT

Versions untried? Time limit not expired?

Failure

NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:


no similar errors will occur (erroneous results resembling each other) if two or more versions agree, the result is considered correct

5/3/13

Software Fault Tolerance Distributed Recovery Block


Version A Acceptance Test

Accepted

Input

Version B

True Primary Node

More alternates? Deadline not exceeded?

False

Version A
Version B

Acceptance Test

Accepted

True Secondary Node

More alternates? Deadline not exceeded?

Failed

NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes Considerations


5/3/13

Synchronization between nodes especially during rollback

Extended Distributed Recovery Block


Supervisor

Heartbeat scheme
Active Node Shadow Node Supervisor Node
Active Node Exec.

Recovery Manager
Heartbeat/Reset Request Consent Node Exec. Shadow

Heartbeats

Each node contains


Primary version Alternate version Acceptance test Device Drivers

Primary Version
Alternate Version Acceptance Test Device Drivers

Alternate Version
Primary Version Acceptance Test Device Drivers

To the system 5/3/13

To the system

5/3/13

Roll-Forward Checkpointing Scheme


Used for multiprocessor systems Pool of Active Processing Modules
Processor Volatile storage Stable storage

Checkpoint processor
The checkpoint processor detects module failures by comparing the state of each pair of processing modules that perform the same task. The two processors execute their tasks, checkpoint their states, and send the checkpoints to the checkpoint processor.

The checkpoint processor compares the states, and if the states match the new checkpoint is considered correct and it replaces the old checkpoint.

5/3/13

5/3/13

5/3/13

N Self-Checking Program
Made up of several Self Checking Components
Made up of different variants Variants are either associated with an acceptance test or paired together and associated with a comparison algorithm Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered result is acceptable the system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checking component is made up in one of two ways: a) each variant is associated with an acceptance test which tests the results of the variant (Figure a), or b)
5/3/13

5/3/13

Data Diversity
Retry Block
Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data has been restated

N-copy Programming
Upon entry to the block, data is restated to N-1 ways This creates N different data sets The copies execute in parallel Output is selected with a voting scheme

5/3/13

5/3/13

5/3/13

Summary
Fault tolerant design considerations
Anticipated faults In most cases, a simple acceptance test is all that is needed Unanticipated faults Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and many designers will not be able to use them This leaves designers with Recovery Blocks (Software Design Diversity) Retry Blocks (Data Diversity)

5/3/13

Fault-Tolerance in Real-Time Databases

5/3/13

Overview
The causes of the downtime Availability solutions CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN

5/3/13

The Causes of Downtime


Planned downtime
Hardware expansion Database software upgrades Operating system upgrades

Unplanned downtime

5/3/13

Hardware failure OS failure Database software bugs Power failure Disaster Human error

Traditional Availability Solutions


Replication:
The standby system needs to duplicate transactions as they occur on the primary system. Ideally, this replication is done in near-real time, so the standby system is very close to current in the event of a primary system failure.

Failover
Failover is the moment of truth. When a failure occurs on the primary system, all connections must be re established on the standby, and all active transactions must be rolled back and restarted. Because everything must be transferred, typical failover times are measured in minutes at best, during which time the database is unavailable.

Primary restart
Once the standby system takes over, there is no longer a standby. This is especially vulnerable period, and so the primary must be restarted as quickly as possible. In some schemes the primary becomes the new standby, and in other schemes processing must, at some point, be switched back to the primary.

5/3/13

CASE 1: Clustra
Developed for telephony applications such as mobility management and intelligent networks. Relational database with location and replication transparency. Real-Time data locked in main memory and API provides precompiled transactions. NOT a Real-Time Database !

5/3/13

Clustra hardware architecture

5/3/13

Data distribution and replication

5/3/13

How Clustra Handles Failures


Real-Time failover: Hot-standby data is up to date, so failover occurs in milliseconds. Automatic restart and takeback: Restart of the failed node and takeback of operations is automatic, and again transparent to users and operators. Self-repair: If a node fails completely, data is copied from the complementary node to standby. This is also automatic and transparent. Limited failure effects

5/3/13

How Clustra Handles Upgades


Hardware, operating system, and database software upgrades without ever going down.
Process called rolling upgrade
I.e. required changes are performed node by node. Each node upgraded to catch up to the status of complementary node. When this is completed, the operation is performed to next node.

5/3/13

CASE 2: TelORB
Characteristics
Very high availability (HA), robustness implemented in SW (soft) Real Time Scalability by using loosely coupled processors

Openness

5/3/13

Hardware: Intel/Pentium Language: C++, Java Interoperability: CORBA/IIOP, TCP/IP, Java RMI 3:rd party SW: Java

TelORB Availability
Real-time object-oriented DBMS supporting
Distributed Transactions
ACID Data

properties expected from a DBMS Replication (providing redundancy)

Network Redundancy

Software Configuration Control


Automatic Self

restart of processes that originally executed on a faulty processor on the ones that are working healing

In service upgrade of software with no disturbance to operation Hot replacement of faulty processors
5/3/13

Automatic Reconfiguration

reloading

5/3/13

Software upgrade
Smooth software upgrade when old and new version of same process can coexist Possibility for application to arrange for state transfer between old and new static process (unless important states arent already stored in the database)

5/3/13

Partioning: Types and Data

17

18

17

18

19

20

21

22

19

20

21

22

5/3/13

Advantages
Standard interfaces through Corba Standard languages: C++, Java Based on commercial hardware (Soft) Real-time OS Fault tolerance implemented in software Fully scalable architecture Includes powerful middleware: A database management system and functions for software management

Fully compatible simulated environment for development on Unix/Linux/NT workstations

5/3/13

CASE 3: RODAIN
Real-Time Object-Oriented Database Architechture for Intelligent Networks Real-Time Main-Memory Database System Runs on Real-Time OS: Linux

5/3/13

Rodain Cluster

5/3/13

Rodain Database Node


Database Primary Unit
User Request Interpreter Subsystem ObjectOriented Database Management Subsystem

Distributed Database Subsystem

Watchdog Subsystem

Fault-Tolerance and Recovery Subsystem

Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem

Database Mirror Unit


ObjectOriented Database Management Subsystem

shared disk

Watchdog Subsystem

User Request Interpreter Subsystem 5/3/13

RODAIN Database Node II


Database Primary Unit
User Request Interpreter Subsystem ObjectOriented Database Management Subsystem

Distributed Database Subsystem

Watchdog Subsystem

Fault-Tolerance and Recovery Subsystem

Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem

Database Mirror Unit


ObjectOriented Database Management Subsystem

shared disk

Watchdog Subsystem

User Request Interpreter Subsystem 5/3/13

Вам также может понравиться