Академический Документы
Профессиональный Документы
Культура Документы
Fault Tolerance
What is Fault Tolerance?
Ability of an operational system to tolerate the presence of faults
5/3/13
Fault Classification
Faults
Nature
Phenomenon
Origin
Extent Phase
Persistence
Accidental
Intentional
Physical
Humanmade
Ext.
Int.
Design
Operation
Permanent
Temporary
Nature of faults distinguishes the intention of the fault Origin of faults are categorized into 3 types: Phenomenon is the fault from physical or human phenomenon Extent does the internal or external environment cause the fault Phase is the fault caused within the design or operation of the system Persistence of faults determine the duration of the fault state
5/3/13
Information
Recover the system via data structures storing system contents
Repetition
Restarts module in event of a faulty module
Primary
Checkpoint Switch
Alternate 1 Alternate 2
. . .
Acceptance Test
Passed
Failed
Alternate N-1
True
Restore from Checkpoint More alternates? Deadline not exceeded?
Fault
5/3/13
. . . . .
Several Hardware channels Software diverse versions of code Results are voted upon Initial Specification is crucial
Switch
Synch
Version N
No agreement
Failure
System Overhead
temporal: Synchronization and decision algorithm space: multiple hardware channels and space for multiple software versions
Extensions
Community Error Recovery ( forward recovery)
enough information from good versions to recover failed versions
5/3/13
Version 2
. . . . .
Agreement
Voter
Output
Version N
No agreement
AT
Failure
5/3/13
Accepted
Input
Version B
False
Version A
Version B
Acceptance Test
Accepted
Failed
Heartbeat scheme
Active Node Shadow Node Supervisor Node
Active Node Exec.
Recovery Manager
Heartbeat/Reset Request Consent Node Exec. Shadow
Heartbeats
Primary Version
Alternate Version Acceptance Test Device Drivers
Alternate Version
Primary Version Acceptance Test Device Drivers
To the system
5/3/13
Checkpoint processor
The checkpoint processor detects module failures by comparing the state of each pair of processing modules that perform the same task. The two processors execute their tasks, checkpoint their states, and send the checkpoints to the checkpoint processor.
The checkpoint processor compares the states, and if the states match the new checkpoint is considered correct and it replaces the old checkpoint.
5/3/13
5/3/13
5/3/13
N Self-Checking Program
Made up of several Self Checking Components
Made up of different variants Variants are either associated with an acceptance test or paired together and associated with a comparison algorithm Components execute in parallel Fault tolerance is provided by parallel execution of components Each component is responsible for determining whether a delivered result is acceptable the system is divided into several self-checking components comprised of different variants (equivalent to alternates in RB and versions in NVP) of the software. These components execute in parallel. A self-checking component is made up in one of two ways: a) each variant is associated with an acceptance test which tests the results of the variant (Figure a), or b)
5/3/13
5/3/13
Data Diversity
Retry Block
Executes test normally If the results are accepted by the test, execution is complete If the results are not accepted the test runs again once the input data has been restated
N-copy Programming
Upon entry to the block, data is restated to N-1 ways This creates N different data sets The copies execute in parallel Output is selected with a voting scheme
5/3/13
5/3/13
5/3/13
Summary
Fault tolerant design considerations
Anticipated faults In most cases, a simple acceptance test is all that is needed Unanticipated faults Designers must decide what is the most practical solution Most of the techniques in this report are hardware based, and many designers will not be able to use them This leaves designers with Recovery Blocks (Software Design Diversity) Retry Blocks (Data Diversity)
5/3/13
5/3/13
Overview
The causes of the downtime Availability solutions CASE 1: Clustra CASE 2: TelORB CASE 3: RODAIN
5/3/13
Unplanned downtime
5/3/13
Hardware failure OS failure Database software bugs Power failure Disaster Human error
Failover
Failover is the moment of truth. When a failure occurs on the primary system, all connections must be re established on the standby, and all active transactions must be rolled back and restarted. Because everything must be transferred, typical failover times are measured in minutes at best, during which time the database is unavailable.
Primary restart
Once the standby system takes over, there is no longer a standby. This is especially vulnerable period, and so the primary must be restarted as quickly as possible. In some schemes the primary becomes the new standby, and in other schemes processing must, at some point, be switched back to the primary.
5/3/13
CASE 1: Clustra
Developed for telephony applications such as mobility management and intelligent networks. Relational database with location and replication transparency. Real-Time data locked in main memory and API provides precompiled transactions. NOT a Real-Time Database !
5/3/13
5/3/13
5/3/13
5/3/13
5/3/13
CASE 2: TelORB
Characteristics
Very high availability (HA), robustness implemented in SW (soft) Real Time Scalability by using loosely coupled processors
Openness
5/3/13
Hardware: Intel/Pentium Language: C++, Java Interoperability: CORBA/IIOP, TCP/IP, Java RMI 3:rd party SW: Java
TelORB Availability
Real-time object-oriented DBMS supporting
Distributed Transactions
ACID Data
Network Redundancy
restart of processes that originally executed on a faulty processor on the ones that are working healing
In service upgrade of software with no disturbance to operation Hot replacement of faulty processors
5/3/13
Automatic Reconfiguration
reloading
5/3/13
Software upgrade
Smooth software upgrade when old and new version of same process can coexist Possibility for application to arrange for state transfer between old and new static process (unless important states arent already stored in the database)
5/3/13
17
18
17
18
19
20
21
22
19
20
21
22
5/3/13
Advantages
Standard interfaces through Corba Standard languages: C++, Java Based on commercial hardware (Soft) Real-time OS Fault tolerance implemented in software Fully scalable architecture Includes powerful middleware: A database management system and functions for software management
5/3/13
CASE 3: RODAIN
Real-Time Object-Oriented Database Architechture for Intelligent Networks Real-Time Main-Memory Database System Runs on Real-Time OS: Linux
5/3/13
Rodain Cluster
5/3/13
Watchdog Subsystem
shared disk
Watchdog Subsystem
Watchdog Subsystem
shared disk
Watchdog Subsystem