Real Time Systems IX

Lecture IX
Fault-Tolerance in Real-Time Systems

Sidra Rashid Bahria University, Islamabad Campus
Fault Tolerance
What is Fault Tolerance?
Ability of an operational system to tolerate the presence of faults
Why tolerate faults?

It is proven that it is impossible to completely test a practical-sized system. Therefore, it is important to implement techniques which allow a system to detect and tolerate faults during normal operation.
4 phases of fault tolerance:

Error detection detection of an erroneous state Damage assessment computes the severity of the fault Error processing substitute erroneous state for an error-free one Fault Treatment determine the cause of the error, then run fault passivation to ensure it doesnt happen again
5/3/13
Fault Classification
Faults
Nature
Phenomenon
Origin
Extent Phase
Persistence
Accidental
Intentional
Physical
Humanmade
Ext.
Int.
Design
Operation
Permanent
Temporary
Nature of faults distinguishes the intention of the fault Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system Persistence of faults determine the duration of the fault state
5/3/13
Software Fault Tolerance Techniques

Key to fault-tolerance is redundancy
Three domains:
Space
Several hardware channels each executing same task
Information
Recover the system via data structures storing system contents
Repetition
Restarts module in event of a faulty module
Two major schemes have evolved

Recovery Block (RB)
1H/Nds/NT-System
There is only one hardware channel (1H), and the faults are tolerated by executing several diverse software modules (NdS) sequentially (NT)
N-Version Programming (NVP)

NH/Nds/1T-System
The system has a number of (identical) hardware channels (NH) each executing one of the diverse software versions (NdS), hence no redundancy in time (1T)
5/3/13
Software Fault Tolerance Recovery Block
Primary
Checkpoint Switch
Alternate 1 Alternate 2
. . .
Acceptance Test
Passed
Failed
Alternate N-1
True
Restore from Checkpoint More alternates? Deadline not exceeded?
Fault
5/3/13
Software Fault Tolerance Recovery Block

Considerations
Software diversity
Idea: different teams, one specification, different products
Hope that failure domains do not overlap
Difficulties in designing acceptance test

Single test for all modules of recovery block Test is most crucial element in improving reliability
Design of Recovery Cache

sufficiently simple to ensure no faults
Increased System Overhead Domino Effect

Recovery blocks can push concurrent tasks that communicate into uncontrolled rollback
5/3/13
Software Fault Tolerance N-Version Programming

Version 1 Version 2
. . . . .
Majority Agreement Output

Voter
N-Version Programming ( NH/Nds/1T )

5/3/13
Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial
Switch
Synch
Version N
No agreement
Failure
Software Fault Tolerance N-Version Programming

Considerations
Software diversity! Difficult to create good specification Decision Mechanism
Some results will not always be identical (valid and invalid)
define a range of valid solutions but decreases distance from acceptance test approach
System Overhead
temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions
Extensions
Community Error Recovery ( forward recovery)
enough information from good versions to recover failed versions
5/3/13
Software Fault Tolerance Consensus Recovery Block(CRB)

Input Switch
Version 1
Version 2
. . . . .
Agreement
Voter
Output
Version N
No agreement
AT
Versions untried? Time limit not expired?
Failure
NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:

no similar errors will occur (erroneous results resembling each other) if two or more versions agree, the result is considered correct
5/3/13
Software Fault Tolerance Distributed Recovery Block

Version A Acceptance Test
Accepted
Input
Version B
True Primary Node
More alternates? Deadline not exceeded?
False
Version A
Version B
Acceptance Test
Accepted
True Secondary Node
More alternates? Deadline not exceeded?
Failed
NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes Considerations

5/3/13
Synchronization between nodes especially during rollback
Extended Distributed Recovery Block

Supervisor
Heartbeat scheme
Active Node Shadow Node Supervisor Node
Active Node Exec.
Recovery Manager
Heartbeat/Reset Request Consent Node Exec. Shadow
Heartbeats
Each node contains

Primary version Alternate version Acceptance test Device Drivers
Primary Version
Alternate Version Acceptance Test Device Drivers
Alternate Version
Primary Version Acceptance Test Device Drivers
To the system 5/3/13
To the system
5/3/13
Roll-Forward Checkpointing Scheme

Used for multiprocessor systems Pool of Active Processing Modules
Processor Volatile storage Stable storage
Checkpoint processor
The checkpoint processor detects module failures by comparing the state of each pair of processing modules that perform the same task. The two processors execute their tasks, checkpoint their states, and send the checkpoints to the checkpoint processor.
The checkpoint processor compares the states, and if the states match the new checkpoint is considered correct and it replaces the old checkpoint.
5/3/13
5/3/13
5/3/13
N Self-Checking Program
Made up of several Self Checking Components
Made up of different variants Variants are either associated with an acceptance test or paired together and associated with a comparison algorithm Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered result is acceptable the system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checking component is made up in one of two ways: a) each variant is associated with an acceptance test which tests the results of the variant (Figure a), or b)
5/3/13
5/3/13
Data Diversity
Retry Block
Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data has been restated
N-copy Programming
Upon entry to the block, data is restated to N-1 ways This creates N different data sets The copies execute in parallel Output is selected with a voting scheme
5/3/13
5/3/13
5/3/13
Summary
Fault tolerant design considerations
Anticipated faults In most cases, a simple acceptance test is all that is needed Unanticipated faults Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and many designers will not be able to use them This leaves designers with Recovery Blocks (Software Design Diversity) Retry Blocks (Data Diversity)
5/3/13
Fault-Tolerance in Real-Time Databases
5/3/13
Overview
The causes of the downtime Availability solutions CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN
5/3/13
The Causes of Downtime

Planned downtime
Hardware expansion Database software upgrades Operating system upgrades
Unplanned downtime

5/3/13
Hardware failure OS failure Database software bugs Power failure Disaster Human error
Traditional Availability Solutions

Replication:
The standby system needs to duplicate transactions as they occur on the primary system. Ideally, this replication is done in near-real time, so the standby system is very close to current in the event of a primary system failure.
Failover
Failover is the moment of truth. When a failure occurs on the primary system, all connections must be re established on the standby, and all active transactions must be rolled back and restarted. Because everything must be transferred, typical failover times are measured in minutes at best, during which time the database is unavailable.
Primary restart
Once the standby system takes over, there is no longer a standby. This is especially vulnerable period, and so the primary must be restarted as quickly as possible. In some schemes the primary becomes the new standby, and in other schemes processing must, at some point, be switched back to the primary.
5/3/13
CASE 1: Clustra
Developed for telephony applications such as mobility management and intelligent networks. Relational database with location and replication transparency. Real-Time data locked in main memory and API provides precompiled transactions. NOT a Real-Time Database !
5/3/13
Clustra hardware architecture
5/3/13
Data distribution and replication
5/3/13
How Clustra Handles Failures

Real-Time failover: Hot-standby data is up to date, so failover occurs in milliseconds. Automatic restart and takeback: Restart of the failed node and takeback of operations is automatic, and again transparent to users and operators. Self-repair: If a node fails completely, data is copied from the complementary node to standby. This is also automatic and transparent. Limited failure effects
5/3/13
How Clustra Handles Upgades

Hardware, operating system, and database software upgrades without ever going down.
Process called rolling upgrade
I.e. required changes are performed node by node. Each node upgraded to catch up to the status of complementary node. When this is completed, the operation is performed to next node.
5/3/13
CASE 2: TelORB
Characteristics
Very high availability (HA), robustness implemented in SW (soft) Real Time Scalability by using loosely coupled processors
Openness

5/3/13
Hardware: Intel/Pentium Language: C++, Java Interoperability: CORBA/IIOP, TCP/IP, Java RMI 3:rd party SW: Java
TelORB Availability
Real-time object-oriented DBMS supporting
Distributed Transactions
ACID Data
properties expected from a DBMS Replication (providing redundancy)
Network Redundancy
Software Configuration Control

Automatic Self
restart of processes that originally executed on a faulty processor on the ones that are working healing
In service upgrade of software with no disturbance to operation Hot replacement of faulty processors
5/3/13
Automatic Reconfiguration
reloading
5/3/13
Software upgrade
Smooth software upgrade when old and new version of same process can coexist Possibility for application to arrange for state transfer between old and new static process (unless important states arent already stored in the database)
5/3/13
Partioning: Types and Data
17
18
17
18
19
20
21
22
19
20
21
22
5/3/13
Advantages
Standard interfaces through Corba Standard languages: C++, Java Based on commercial hardware (Soft) Real-time OS Fault tolerance implemented in software Fully scalable architecture Includes powerful middleware: A database management system and functions for software management
Fully compatible simulated environment for development on Unix/Linux/NT workstations
5/3/13
CASE 3: RODAIN
Real-Time Object-Oriented Database Architechture for Intelligent Networks Real-Time Main-Memory Database System Runs on Real-Time OS: Linux
5/3/13
Rodain Cluster
5/3/13
Rodain Database Node

Database Primary Unit
User Request Interpreter Subsystem ObjectOriented Database Management Subsystem
Distributed Database Subsystem
Watchdog Subsystem
Fault-Tolerance and Recovery Subsystem
Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem
Database Mirror Unit

ObjectOriented Database Management Subsystem
shared disk
Watchdog Subsystem
User Request Interpreter Subsystem 5/3/13
RODAIN Database Node II

Database Primary Unit
User Request Interpreter Subsystem ObjectOriented Database Management Subsystem
Distributed Database Subsystem
Watchdog Subsystem
Fault-Tolerance and Recovery Subsystem
Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem
Database Mirror Unit

ObjectOriented Database Management Subsystem
shared disk
Watchdog Subsystem
User Request Interpreter Subsystem 5/3/13

Real Time Systems IX

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Real Time Systems IX

Загружено:

Авторское право:

Доступные форматы

Lecture IX

Fault-Tolerance in Real-Time Systems

Why tolerate faults?

4 phases of fault tolerance:

Software Fault Tolerance Techniques

Two major schemes have evolved

N-Version Programming (NVP)

Software Fault Tolerance Recovery Block

Software Fault Tolerance Recovery Block

Difficulties in designing acceptance test

Design of Recovery Cache

Increased System Overhead Domino Effect

Software Fault Tolerance N-Version Programming

Majority Agreement Output

N-Version Programming ( NH/Nds/1T )

Software Fault Tolerance N-Version Programming

Software Fault Tolerance Consensus Recovery Block(CRB)

Versions untried? Time limit not expired?

NH/Nds/1T Synthesis of N-version Programming and recovery block Basic Assumption:

Software Fault Tolerance Distributed Recovery Block

True Primary Node

More alternates? Deadline not exceeded?

True Secondary Node

More alternates? Deadline not exceeded?

NH/NS/1T or Nhs/Nds/1T Reproducing RB Scheme on Multiple Network Nodes Considerations

Synchronization between nodes especially during rollback

Extended Distributed Recovery Block

Each node contains

To the system 5/3/13

Roll-Forward Checkpointing Scheme

Fault-Tolerance in Real-Time Databases

The Causes of Downtime

Traditional Availability Solutions

Clustra hardware architecture

Data distribution and replication

How Clustra Handles Failures

How Clustra Handles Upgades

properties expected from a DBMS Replication (providing redundancy)

Software Configuration Control

Partioning: Types and Data

Fully compatible simulated environment for development on Unix/Linux/NT workstations

Rodain Database Node

Distributed Database Subsystem

Fault-Tolerance and Recovery Subsystem

Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem

Database Mirror Unit

User Request Interpreter Subsystem 5/3/13

RODAIN Database Node II

Distributed Database Subsystem

Fault-Tolerance and Recovery Subsystem

Fault-Tolerance and Recovery Subsystem Distributed Database Subsystem

Database Mirror Unit

User Request Interpreter Subsystem 5/3/13

Вам также может понравиться