Вы находитесь на странице: 1из 4

CMP3201: Embedded Systems

Reliability & Dependability Engineering

1/4

Introduction
Zero-defect systems are impossible to design. Even nature, the author of the most complex engineering techniques, often produces faulty plants and animals. Each time you set off for a

long journey, youd be wise enough to carry a spare tyre or battery, because the one of your car tyres could fail. In other words, a seasoned long distance driver takes the possibility of component failure into account so as to lower the probability of system failure. Elevators are another good example. Any elevator component can fail, including the cable from which the elevator cab is suspended. But, the elevator system as a whole is designed in such a way that even when the cable breaks, the car will not come crashing down. For the car, redundancy (spare tyre, battery) can be used, but in the case of the elevator, redundancy does not necessarily solve the problem. Multiple cables may help address one specific form of component failure, but not another, such as power failure. Traditional high-reliability systems have used hardware redundancy (for example, two engines on an airplane instead of one). But, cost-sensitive everyday embedded systems often do not have a price structure that permits redundancy. It is worth contemplating how deeply engrained the discipline of reliable system design is, outside software engineering. If your kitchen-sink leaks, you can close a valve that stops the flow of water to that sink. The valve is there because experience has shown that sinks do occasionally leak, no matter how carefully they are constructed to prevent just that. If you short-circuit an electrical outlet in your home, a fuse will blow. The fuse is there to prevent greater disaster in case the unimaginable happens. The presence of the fuse and the valve do not signify an implicit acceptance of sloppy workmanship; they are an essential part of reliable system design. Building Reliable Safety Critical Systems from Unreliable Parts Your pen is unreliable. Your memory is unreliable. Cheating is very unreliable. So, how does a student ensure that they pass an exam seamlessly? The techniques that you, as an individual, have developed to ensure this are a good example of how unreliable parts can be used to build a reliable system. In the embedded systems market, one of the most critical competition factors is the reliability of an embedded application, given that the simplest system failure may stop entire factory assembly lines or other expensive processes. In the same line, military and medical systems in particular are intolerant of failures, as lives may depend on the reliable functioning of equipment. In industrial or scientific applications, embedded systems may be outside the reach of humans, making failures costly in terms of down time of the system and the human labor involved in getting to the board. Therefore, embedded computers must be designed to run continuously for years without errors, often in hostile environments that no desktop PC could endure. To meet these higher quality requirements of the industry and consumer market, what is required are sophisticated testing processes and new performance evaluation techniques which provide a

CMP3201: Embedded Systems

Reliability & Dependability Engineering

2/4

company the capability to figure how their design would hold up over the years, without having to wait that long. Before going any further, we should perhaps, at this juncture, define what reliability actually is and what exactly makes a system dependable Definitions Definitions Let S be the set of all possible states of a system. Then, we can divide S into two disjoint subsets SS and SF where SS denotes a subset of states where the system is operating successfully and SF where the system has failed. We can see that S =SS SSF We can define reliability as R = P S S .
b c

It is the probability of a system staying in the operating state without failure.


When unreliable parts are put together to constitute a system, the system design is what determines how reliable the system will be. When we say that a system is reliable, we mean that the probability of failure is small. A system is reliable if and only if it is faultfault tolerant. And if, when this system fails, it remains safe, we can say that the system is failsafe. We further define dependability of a computing system as the trustworthiness of the system which allows reliance (reliability) to be justifiably placed on the service it delivers. Dependability includes reliability performance, maintainability performance and safety performance. These definitions are by the IFIP(International Federation for Information Processing) and electropedia [Electronics and Computing Encyclopedia]. In an unrelated juncture, we may define a foolproof system as one that can be operated successfully by any individual! Such systems are usually technically called user-friendly and are the job of the project manager and not the designer himself!

The following can cause a system to fail. A required event that does not occur An incorrect sequence of desired events Two incompatible events occurring simultaneously Timing failures in event sequences Failing to ensure minimum time constraints between events All these events will be caused by an unreliable part or subsystem. How then can we ensure minimum probability of failure with such parts? There are a number of strategies for achieving system reliability. The first two listed below are the primary ones. i. Simplicity The first strategy is to use a design that emphasizes simplicity and robustness. A simple design is easier to understand, easier to test or verify, and easier to operate.

CMP3201: Embedded Systems

Reliability & Dependability Engineering

3/4

ii.

Redundancy The second strategy is to exploit redundancy. Usually, the probability of failure of individual components is statistically independent so the chance of having both a prime and a backup component fail at the same time can be made very small. If, for instance, all components have the same probability p of failure, then the probability that all N components fail in an N-redundant system would be p N . We have hardware and software redundancy. For software redundancy, we could execute two or more versions of the same thread to carry out a given task. We shall speak more about the issues concerning redundancy in embedded systems.

iii.

SelfSelf-repairing systems Just like you can cross out an error and append the correct answer in your answer sheet, computers can fix themselves. This of course will have usually happened after the error has occurred (invalidating our definition for reliability) but the timespan it takes to fix the error may be small enough not to cause any significant consequences. Self-repairing hardware is a new trend especially in Aviation. An EPSRC funded project at the University of Bristol is in the process of completing body designs for a self-healing aircraft. Many companies are currently investing in self-healing software systems (Microsoft, HP) using object oriented design and AI. Through machine-learning, high-end embedded systems will be more able to optimize themselves, configure themselves, heal themselves and protect themselves. ErrorError-predi predicting dicting systems Some errors in a system are gradual. The onset of an error may be characterized by small irregularities in the program functionality. One design example can be as follows. When a power IC is gradually burning out, its quiescent current will generally increase. We can always measure this current and make sure it is in the right range. If we note the rate of increase, we can predict when it will fail and alert the users soon enough. Designing for worst case scenario This is only an option when you have sufficient resouces. If your microcontroller keeps saving 10MB of data every 10 minutes, you could keep on erasing a 10MB disc in this time interval to free the memory for further write operations. If however it does this only 5 times in a day, you could just get a 50MB disc, just

iv.

v.

in case it fails to erase.


Hardware and Software Reliability An embedded system is incomplete with just hardware or software. The two must co-exist. The hardware gives the software a platform to run on and the software manipulates the hardware at the lowest levels. To this end, we need both NOT to fail. Hardware reliability is easier to understand. Software reliability requires all modules in a program to operate together seamlessly.

CMP3201: Embedded Systems

Reliability & Dependability Engineering

4/4

Why hardware fails Some reasons include: Physical trauma Short circuits Wrong connections Poor connections Buggy software Why Software fails Logic errors Poor inter-process communication Faulty hardware

Many documented cases of embedded systems failures have been ascribed to software malfunction, such as hybrid cars stalling suddenly at highway speeds (2005).

P roblems when Designing for Reliability Addressing the concerns of reliability will require techniques beyond traditional fault tolerance techniques of simplicity and redundancy. Further, designing for worst-case scenario would be overly pessimistic, expensive and probably cause performance loss. Consequently, developers need new techniques. Redundancy always increases cost, whether monetary of in terms of time or human labor. It is sometimes simply not possible. It could cause the price of a project to double. Embedded systems are cost sensitive and often work with limited resources such as smaller memory size or diskless designs. These constraints can make it difficult to apply design methodologies like redundancy for improving embedded systems reliability. For example, executing multiple redundant versions of the same thread to ensure reliable operation, while common in server environments, can be quite costly in embedded systems. Providing reliable embedded systems operation while satisfying other stringent constraints such as power consumption and real-time throughput is essential. Consequently, reliability-oriented designs that rely on aggressive redundancy, such as triple-module redundancy, might not be reliable from an energy viewpoint. Similarly, approaches such as transaction rollback, used in many server environments, might have no relevance in real-time embedded applications if the recovery happens after a deadline.

Bibliography [a fraction of these notes has been obtained from the following] Article: Brian Blum[2010] Self-Healing software on its way Reliability Concerns in Embedded System Designs. Computer Magazine 2006. Pimentel. J.R, Safety-Reliability of Distributed Embedded System Fault Tolerant Units, Kettering University, Michigan Holzmann, J., Joshi, R., Reliable Software Systems Design: Defect Prevention, Detection, and Containment

Вам также может понравиться